Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Josh Berkus writes: > Greg, > > > We *currently* use a block based sampling algorithm that addresses this > > issue by taking care to select rows within the selected blocks in an > > unbiased way. You were proposing reading *all* the records from the > > selected blocks, which throws away that feature. > > The block-based algorithms have specific math to cope with this. Stuff > which is better grounded in statistical analysis than our code. Please > read the papers before you judge the solution. I'm confused since Postgres's current setup *was* based on papers on just this topic. I haven't read any of the papers myself, I'm just repeating what was previously discussed when the current block sampling code went in. There was extensive discussion of the pros and cons of different algorithms taken from various papers and how they related to Postgres. I don't recall any of them suggesting that there was any way to take a sample which included every row from a sampled block and then somehow unbias the sample after the fact. > Right, which is why researchers are currently looking for something better. > The Brutlag & Richardson claims to be able to produce estimates which are > within +/- 3x 90% of the time using a 5% sample, which is far better than > our current accuracy. Nobody claims to be able to estimate based on < > 0.1% of the table, which is what Postgres tries to do for large tables. > > 5% based on block-based sampling is reasonable; that means a straight 5% of > the on-disk size of the table, so 5gb for a 100gb table. With random-row > sampling, that would require as much as 25% of the table, making it easier > to just scan the whole thing. Postgres's current sample sizes are clearly geared towards the histograms where they are entirely realistic. All of the distinct estimates are clearly just ad hoc attempts based on the existing sampling. Is a mechanism that is only 5x faster than reading the whole table (assuming random_page_cost of 4) and is off by more than a factor of three 10% of the time really worth it? -- greg ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Warm-up cache may have its virtue
On Thu, 5 Jan 2006, Qingqing Zhou wrote: > > Feasibility: Our bufmgr lock rewrite already makes this possible. But to > enable it, we may need more work: (w1) make bufferpool relation-wise, > which makes our estimation of data page residence more easy and reliable. > (w2) add aggresive pre-read on buffer pool level. Also, another benefit of > w1 will make our query planner can estimate query cost more precisely. > "w1" is doable by introducing a shared-memory bitmap indicating which pages of a relation are in buffer pool (We may want to add a hash to manage the relations). Theoretically, O(shared_buffer) bits are enough. So this will not use a lot of space. When we maintain the SharedBufHash, we maintain this bitmap. When we do query cost estimation or preread, we just need a rough number, so this can be done by scanning the bitmap without lock. Thus there is also almost no extra cost. Regards, Qingqing ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] catalog corruption bug
Jeremy Drake <[EMAIL PROTECTED]> writes: > Here is some additional information that I have managed to gather today > regarding this. It is not really what causes it, so much as what does > not. > ... > Similar for pg_type, there being 248 index row versions vs 244 row > versions in the table. The ReadBuffer bug I just fixed could result in disappearance of catalog rows, so this observation is consistent with the theory that that's what's biting you. It's not proof though... regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] catalog corruption bug
Here is some additional information that I have managed to gather today regarding this. It is not really what causes it, so much as what does not. I removed all plperl from the loading processes. I did a VACUUM FULL ANALYZE, and then I reindexed everything in the database (Including starting the backend in standalone mode and running REINDEX SYSTEM dbname). They still failed. So it is apparently not that plperl issue which was being discussed earlier. Also, what I said about the corruption not having persisted into other backends was not quite correct. It was leaving behind types in pg_type which were in some of the pg_temp* schemas which corresponded to some of the temp tables. But I took those out and still had issues (slightly different). Here is some interesting stuff too. I just stopped my processes to start up a batch again to copy the error message I got now, but before doing so I was doing a VACUUM FULL ANALYZE VERBOSE so I could hopefully start from a relatively clean state. I got a few warnings I don't remember seeing before. INFO: vacuuming "pg_catalog.pg_shdepend" INFO: "pg_shdepend": found 108 removable, 440 nonremovable row versions in 15 p ages DETAIL: 0 dead row versions cannot be removed yet. Nonremovable row versions range from 53 to 53 bytes long. There were 1492 unused item pointers. Total free space (including removable row versions) is 89780 bytes. 7 pages are or will become empty, including 0 at the end of the table. 12 pages containing 89744 free bytes are potential move destinations. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: index "pg_shdepend_depender_index" now contains 448 row versions in 33 pages DETAIL: 108 index row versions were removed. 23 index pages have been deleted, 23 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.10 sec. WARNING: index "pg_shdepend_depender_index" contains 448 row versions, but table contains 440 row versions HINT: Rebuild the index with REINDEX. INFO: index "pg_shdepend_reference_index" now contains 448 row versions in 12 pages DETAIL: 108 index row versions were removed. 3 index pages have been deleted, 3 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. WARNING: index "pg_shdepend_reference_index" contains 448 row versions, but table contains 440 row versions HINT: Rebuild the index with REINDEX. INFO: "pg_shdepend": moved 4 row versions, truncated 15 to 4 pages DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: index "pg_shdepend_depender_index" now contains 448 row versions in 33 pages DETAIL: 4 index row versions were removed. 23 index pages have been deleted, 23 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. WARNING: index "pg_shdepend_depender_index" contains 448 row versions, but table contains 440 row versions HINT: Rebuild the index with REINDEX. INFO: index "pg_shdepend_reference_index" now contains 448 row versions in 12 pages DETAIL: 4 index row versions were removed. 4 index pages have been deleted, 4 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. WARNING: index "pg_shdepend_reference_index" contains 448 row versions, but table contains 440 row versions HINT: Rebuild the index with REINDEX. INFO: analyzing "pg_catalog.pg_shdepend" INFO: "pg_shdepend": scanned 4 of 4 pages, containing 440 live rows and 0 dead rows; 440 rows in sample, 440 estimated total rows Similar for pg_type, there being 248 index row versions vs 244 row versions in the table. 1631 vs 1619 in pg_attribute 95 vs 94 in pg_index Looks like it may be time to start a standalone backend and REINDEX again... -- Don't smoke the next cigarette. Repeat. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Warm-up cache may have its virtue
On Thu, 5 Jan 2006, Tom Lane wrote: > > The difference between the cached and non-cached states is that the > kernel has seen fit to remove those pages from its cache. It is > reasonable to suppose that it did so because there was a more immediate > use for the memory. Trying to override that behavior will therefore > result in de-optimizing the global performance of the machine. > Yeah, so in another word, warm-up cache is just wasting of time if the pages are already in OS caches. I agree with this. But does this mean it may deserve to experiment another strategy: big-stomach Postgres, i.e., with big shared_buffer value. By this strategy, (1) almost all the buffers are in our control, and we will know when a pre-read is needed; (2) avoid double-buffering: though people are suggested not to use very big shared_buffer value, but in practice, I see people gain performance by increase it to 20 or more. Feasibility: Our bufmgr lock rewrite already makes this possible. But to enable it, we may need more work: (w1) make bufferpool relation-wise, which makes our estimation of data page residence more easy and reliable. (w2) add aggresive pre-read on buffer pool level. Also, another benefit of w1 will make our query planner can estimate query cost more precisely. Regards, Qingqing ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Warm-up cache may have its virtue
Qingqing Zhou <[EMAIL PROTECTED]> writes: > Hinted by this thread: > http://archives.postgresql.org/pgsql-performance/2006-01/msg00016.php > I wonder if we should really implement file-system-cache-warmup strategy > which we have discussed before. The difference between the cached and non-cached states is that the kernel has seen fit to remove those pages from its cache. It is reasonable to suppose that it did so because there was a more immediate use for the memory. Trying to override that behavior will therefore result in de-optimizing the global performance of the machine. If the machine is actually dedicated to Postgres, I'd expect disk pages to stay in cache without our taking any heroic measures to keep them there. If they don't, that's a matter for kernel configuration tuning, not "warmup" processes. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Questions on printtup()
Qingqing Zhou <[EMAIL PROTECTED]> writes: > So we spend a portion of time at preparing tuples in printtup() by > converting the tuple format to a network format. I am not quite familiar > with that part, so I wonder is it possible to try to send with original > tuple format with minimal preparing job (say handling toast) -- which > might increase the amount of data of network communication, but may reduce > the CPU on server side? It's called retrieving in binary format ... regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Warm-up cache may have its virtue
Hinted by this thread: http://archives.postgresql.org/pgsql-performance/2006-01/msg00016.php I wonder if we should really implement file-system-cache-warmup strategy which we have discussed before. There are two natural good places to do this: (1) sequentail scan (2) bitmap index scan We can consider (2) as a generalized version of (1). For (1), we have mentioned several heuristics like keep scan interval to avoid competition. These strategy is also applable to (2). Question: why file-system level, instead of buffer pool level? For two reasons: (1) Notice that in the above thread, the user just use "shared_buffers = 8192" which suggest that file-system level is already good enough; (2) easy to implement. Use t*h*r*e*a*d? Well, I am a little bit afraid of mention this word. But we can have some dedicated backends to do this - like bgwriter. Let's dirty our hands! Comments? Regards, Qingqing ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Questions on printtup()
I did some profiling related to printtup() by a simple libpq "SELECT * test" program (revised from the libpq programing sample in document without retriving the results). There are 260k or so records in table test(i int). /* original version - prepare tuple and send */ SELECT * TIMING: 0.63 sec /* Prepare but not send In printtup(): - pq_endmessage(&buf); + pfree(buf.data); + buf.data = NULL; */ SELECT * TIMING: 0.46 sec /* No prepare no send In ExecSelect(): - (*dest->receiveSlot) (slot, dest); */ SELECT * TIMING: 0.08 sec So we spend a portion of time at preparing tuples in printtup() by converting the tuple format to a network format. I am not quite familiar with that part, so I wonder is it possible to try to send with original tuple format with minimal preparing job (say handling toast) -- which might increase the amount of data of network communication, but may reduce the CPU on server side? If this is not a non-starter, I am happy to look into details, Regards, Qingqing ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] catalog corruption bug
Jeremy Drake <[EMAIL PROTECTED]> writes: >>> We have encountered a very nasty but apparently rare bug which appears to >>> result in catalog corruption. I've been fooling around with this report today. In several hours of trying, I've been able to get one Assert failure from running Jeremy's example on CVS tip. (I would've given up long ago, except the Assert happened very soon after I started trying...) The assert was from this line in hio.c: Assert(PageIsNew((PageHeader) pageHeader)); which we've seen before in connection with the vacuum-vs-relation-extension race condition found last May. It seems we still have an issue of that sort :-(. While fruitlessly waiting for the test to fail again, I've been combing through the code looking for possible failure paths, and I've found something that might explain it. I think this is definitely a bug even if it isn't what's happening in Jeremy's test: mdread() is defined to not fail, but silently return a page of zeroes, if asked to read a page that's at or beyond the end of the table file. (As noted therein, this seems like pretty sucky design, but there are various reasons that make it hard to change the behavior.) Therefore, if for some reason a process tries to read the page just at EOF, it will leave behind a buffer pool entry that is marked BM_VALID but contains zeroes. There are a number of scenarios that could cause this, but all seem rather low-probability. One way is if a process' rd_targblock field for a relation is pointing at the last page of the file and then VACUUM comes along and truncates off that page because it's empty. The next insertion attempt by the process will try to fetch that page, obtain all-zeroes, decide the page has no free space (PageGetFreeSpace is carefully coded to be sure that happens), and go looking for free space elsewhere. Now suppose someone tries to obtain a new page in the relation by calling ReadBuffer(rel, P_NEW). The location of the new page is determined by asking lseek() how long the file is. ReadBuffer then obtains a buffer for that file offset --- and it is going to get a hit on the all-zeroes buffer previously left behind. Since the buffer is marked BM_VALID, the test "if it was already in the buffer pool, we're done" succeeds and the buffer is returned as-is. This is fine as far as the caller knows: it's expecting to get back an all-zero page, so it goes merrily along. The problem is that if that code path is taken, we *have not extended the file on disk*. That means, until we get around to writing the dirty buffer to disk (eg via checkpoint), the kernel thinks the file doesn't contain that block yet. So if someone else comes along and again does ReadBuffer(rel, P_NEW), the lseek computation will return the same offset as before, and we'll wind up handing back the very same buffer as before. Now we get the above-mentioned Assert, if we are lucky enough to be running an assert-enabled build. Otherwise the code in hio.c will just wipe and reinitialize the page, leading to loss of whatever rows were previously placed in it. Based on this analysis, the logic in ReadBuffer is wrong: if it finds an existing buffer in the P_NEW case, it still has to zero the page and do smgrextend() to be sure that the kernel thinks the page has been added to the file. I'm also thinking that the test for empty page in hio.c ought to be an actual test and elog, not just an Assert. Now that we've seen two different bugs allowing the "can't happen" case to happen, I'm no longer satisfied with not having any check there in a non-assert build. The consequences of not detecting an overwrite are too severe. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [Bizgres-general] WAL bypass for INSERT, UPDATE and
On Thu, 2006-01-05 at 12:27 -0500, Bruce Momjian wrote: > Seems like a nice optimization. Negative thoughts: Toast tables have a toast index on them, yes? We have agreed that we cannot use the optimization if we have indexes on the main table. It follows that we cannot use the optimization if we have *any* toasted data, since that would require a pointer between two blocks, which would not be correctly recovered following a crash. If we log the toast table then there could be a mismatch between heap and toast table; if we don't log the toast table there could be a mismatch between toast table and toast index. We can test to see if the toast table is empty when we do ALTER TABLE, but loading operations may try to create toasted data rows. Presumably that means we must either: i) abort a COPY if we get a toastable value ii) if we get a toastable value, insert the row into a new block, which we do logging of, then also log the toast insert and the toast index insert - i.e. some blocks we log, others not This is still useful for many applications, IMHO, but the list of restrictions seems to be growing. Worse, we wouldn't know that the toast tables were empty until after we did the COPY TO for a pg_dump, so we wouldn't be able to retrospectively add an ALTER TABLE command ahead of the COPY. Thoughts? Hopefully there are some flaws in my thinking here, Best Regards, Simon Riggs ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] when can we get better partitioning?
hubert depesz lubaczewski wrote: hi i was really more than happy when i saw table partitioning in release info for 8.1. then i tried to use it, and hit some serious problem (described on pgsql-general). basically the question is - is anybody at the moment working on improving partitioning capabilities? like improving queries to partitioned tables with "ORDER BY" and "LIMIT" statements? if yes, when can we start thinking about release with this functionality? i, my own, do not have enough c knowledge to do it by myself, yet i would *really* like to have this feature improved, as it would save me and my coworkers a lot of work and hassle that we face right now with rewriting queries to ask directly partitions instead of master table, that got partitioned. You could consider sponsoring one of the developers to enhance the feature. Joshua D. Drake best regards, depesz -- The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564 PostgreSQL Replication, Consulting, Custom Development, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] when can we get better partitioning?
hii was really more than happy when i saw table partitioning in release info for 8.1.then i tried to use it, and hit some serious problem (described on pgsql-general).basically the question is - is anybody at the moment working on improving partitioning capabilities? like improving queries to partitioned tables with "ORDER BY" and "LIMIT" statements? if yes, when can we start thinking about release with this functionality?i, my own, do not have enough c knowledge to do it by myself, yet i would *really* like to have this feature improved, as it would save me and my coworkers a lot of work and hassle that we face right now with rewriting queries to ask directly partitions instead of master table, that got partitioned. best regards,depesz
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Thu, Jan 05, 2006 at 10:12:29AM -0500, Greg Stark wrote: > Worse, my recollection from the paper I mentioned earlier was that sampling > small percentages like 3-5% didn't get you an acceptable accuracy. Before you > got anything reliable you found you were sampling very large percentages of > the table. And note that if you have to sample anything over 10-20% you may as > well just read the whole table. Random access reads are that much slower. If I'm reading backend/commands/analyze.c right, the heap is accessed linearly, only reading blocks that get selected but reading them in heap order, which shouldn't be anywhere near as bad as random access. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Greg, > We *currently* use a block based sampling algorithm that addresses this > issue by taking care to select rows within the selected blocks in an > unbiased way. You were proposing reading *all* the records from the > selected blocks, which throws away that feature. The block-based algorithms have specific math to cope with this. Stuff which is better grounded in statistical analysis than our code. Please read the papers before you judge the solution. > Worse, my recollection from the paper I mentioned earlier was that > sampling small percentages like 3-5% didn't get you an acceptable > accuracy. Before you got anything reliable you found you were sampling > very large percentages of the table. And note that if you have to sample > anything over 10-20% you may as well just read the whole table. Random > access reads are that much slower. Right, which is why researchers are currently looking for something better. The Brutlag & Richardson claims to be able to produce estimates which are within +/- 3x 90% of the time using a 5% sample, which is far better than our current accuracy. Nobody claims to be able to estimate based on < 0.1% of the table, which is what Postgres tries to do for large tables. 5% based on block-based sampling is reasonable; that means a straight 5% of the on-disk size of the table, so 5gb for a 100gb table. With random-row sampling, that would require as much as 25% of the table, making it easier to just scan the whole thing. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Inconsistent syntax in GRANT
Josh Berkus wrote: > Euler, > > > It should but it's not implemented yet. There is no difficulty in doing > > it. But I want to propose the following idea: if some object depends on > > another object and its type is 'DEPENDENCY_INTERNAL' we could > > grant/revoke privileges automagically to it. Or maybe create another > > type of dependency to do so. > > Comments? > > I think this would be difficult to work out. Personally, the only > clear-cut case I can think of is SERIAL columns; other dependancies would > require a lot of conditional logic. Addded to TODO: * Allow SERIAL sequences to inherit permissions from the base table? -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [Bizgres-general] WAL bypass for INSERT, UPDATE and
Simon Riggs wrote: > On Thu, 2005-12-29 at 11:37 -0500, Bruce Momjian wrote: > > Having COPY behave differently because it is > > in a transaction is fine as long as it is user-invisible, but once you > > require users to do that to get the speedup, it isn't user-invisible > > anymore. > > Since we're agreed on adding ALTER TABLE rather than COPY LOCK, we have > our explicit mechanism for speedup. > > However, it costs a single line of code and very very little execution > time to add in the optimization to COPY to make it bypass WAL when > executed in the same transaction that created the table. Everything else > is already there. > > As part of the use_wal test: > + if (resultRelInfo->ri_NumIndices == 0 && > + !XLogArchivingActive()&& > >> (cstate->rel->rd_createSubid != InvalidSubTransactionId )) > + use_wal = false; > > the value is already retrieved from cache... > > Can anyone see a reason *not* to put that change in also? We just don't > advertise it as the "suggested" route to gaining performance, nor would > we rely on it for pg_dump/restore performance. Seems like a nice optimization. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [Bizgres-general] WAL bypass for INSERT, UPDATE and
Simon Riggs wrote: > > So, we need a name for EXCLUSIVE mode that suggests how it is different > > from TRUNCATE, and in this case, the difference is that EXCLUSIVE > > preserves the previous contents of the table on recovery, while TRUNCATE > > does not. Do you want to call the mode PRESERVE, or EXCLUSIVE WRITER? > > Anyway, the keywords are easy to modify, even after the patch is > > submitted. FYI, I usually go through keywords.c looking for a keyword > > we already use. > > I'm very happy for suggestions on what these new modes are called. > > > > > So, to summarize, I think we should add DROP/TRUNCATE, and use that by > > > > default (or optionally off?) in pg_dump, and, assuming we want EXCLUSIVE > > > > for more than just COPY, we need to add ALTER TABLE EXCLUSIVE. > > > > > > Would you mind stating again what you mean, just so I can understand > > > this? Your summary isn't enough. > > > > New ALTER TABLE mode, perhaps call it PERSISTENCE: > > > > ALTER TABLE tab PERSISTENCE DROP ON RECOVERY > > ALTER TABLE tab PERSISTENCE TRUNCATE ON RECOVERY > > > > These would drop or truncate all tables with this flag on a non-clean > > start of the postmaster, and write something in the server logs. > > However, I don't know that we have the code in place to DROP/TRUNCATE in > > recovery mode, and it would affect all databases, so it could be quite > > complex to implement. In this mode, no WAL logs would be written for > > table modifications, though DDL commands would have to be logged. > > Right now, this will be a TODO item... it looks like it will take some > thought to implement correctly. OK, I know my suggestions have made it more complicated. TODO added: * Allow control over which tables are WAL-logged Allow tables to bypass WAL writes and just fsync() dirty pages on commit. To do this, only a single writer can modify the table, and writes must happen only on new pages. Readers can continue accessing the table. This would affect COPY, and perhaps INSERT/UPDATE too. Another option is to avoid transaction logging entirely and truncate or drop the table on crash recovery. These should be implemented using ALTER TABLE, e.g. ALTER TABLE PERSISTENCE [ DROP | TRUNCATE | STABLE | DEFAULT ]. Tables using non-default logging should not use referential integrity with default-logging tables, and tables using stable logging probably can not have indexes. [walcontrol] > > ALTER TABLE tab PERSISTENCE PRESERVE (or STABLE?) > > > > Table contents are preserved across recoveries, but data modifications > > can happen only one at a time. I don't think we have a lock mode that > > does this, so I am worried a new lock mode will have to be created. A > > simplified solution at this stage would be to take an exclusive lock on > > the table, but really we just need a single-writer table lock, which I > > don't think we have. initially this can implemented to only affect COPY > > but later can be done for other commands. > > ExclusiveLock locks out everything apart from readers, no new lock mode > AFAICS. Implementing that is little additional work for COPY. Nice. > Tom had a concern about setting this for I, U, D commands via the > executor. Not sure what the details of that are, as yet. That is much more complicated than the COPY-only idea, for sure. I am thinking we could add the ALTER syntax and just do COPY at this stage, meaning that I/U/D still do full logging until we get to improving them. The big benefit is that the user API doesn't need to change when we improve the code. In fact I think we could do the TRUNCATE/DROP easily for I/U/D, but the STABLE option would require work and we don't need to implement it in the first patch. > We can use either of the unlogged modes for pg_dump, so I'd suggest its > this one. Everybody happy with this being the new default in pg_dump, or > should it be an option? > > > ALTER TABLE tab PERSISTENCE DEFAULT > > > > This would be our current default mode, which is full concurrency and > > persistence. > > I'm thinking whether the ALTER TABLE statement might be better with two > bool flags rather than a 3-state char. > > flag 1: ENABLE LOGGING | DISABLE LOGGING > > flag 2: FULL RECOVERY | TRUNCATE ON RECOVERY > > Giving 3 possible sets of options: > > -- the default > ALTER TABLE mytable ENABLE LOGGING FULL RECOVERY; (default) > > -- EXCLUSIVE mode > ALTER TABLE mytable DISABLE LOGGING FULL RECOVERY; > ...which would be used like this > ALTER TABLE mytable DISABLE LOGGING; > COPY or other bulk data manipulation SQL > ALTER TABLE mytable ENABLE LOGGING; > ...since FULL RECOVERY is the default. > > -- multiuser temp table mode > ALTER TABLE mytable DISABLE LOGGING TRUNCATE ON RECOVERY; > ...which would usually be left on all the time > > which only uses one new keyword LOGGING and yet all the modes are fairly > explicit as to what they do. > > An alternative might be the slightly more ve
Re: [HACKERS] postmaster/postgres options assimilation plan
FYI, with the options merged, we still have this TODO item: * %Remove behavior of postmaster -o --- Peter Eisentraut wrote: > Here's the plan for assimilating the command-line options of the postmaster > and postgres options. I reported earlier on a couple of conflict areas; here > is the full plan: > > * Remove: postmaster -a -b -m -M > > These options have done nothing forever. > > * postmaster options added to postgres: -h -i -k -l -n > > These options will not have any useful effects, but their behavior is > consistent if you do, say, SHOW listen_addresses. > > * postgres options added to postmaster: -e -E -f -O -P -t -W > > Using -E with postmaster does nothing, though. > > * Renamed options (because of conflicts): > > postgres -N is now postgres -j (mostly internal use) > > postgres -o is now postgres -r (mostly internal use) > > (postmaster -o is obsolete but still works for compatibility; postgres -o > will > get you an error.) > > postgres -p is now postgres -y (internal use only) > > postmaster -S now sets work_mem, like postgres -S does. The (deprecated) > silent mode can be obtained using the long option --silent-mode=on. > > postmaster -s is now postmaster -T (expert/hacker use only) > > > For the options -O, -P, and -W I have added three new GUC variables > allow_system_table_mods (PGC_SIGHUP), ignore_system_indexes (PGC_BACKEND), > connection_startup_delay (PGC_BACKEND); mostly to simplify the > postmaster->postgres communication. > > Except for a few odd exceptions, all command line arguments now map to > setting > a GUC variable. > > Comments? > > ---(end of broadcast)--- > TIP 2: Don't 'kill -9' the postmaster > -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] nicer error out in initdb?
Andrew Dunstan <[EMAIL PROTECTED]> writes: > If we find at the bottom of test_config_settings() that we have not been > able to run successfully at all (i.e. status != 0 at about line 1183 of > initdb.c) is there any point in continuing? Don't we know that we are > bound to fail at the template1 creation stage? Maybe we should just exit > nicely when we do know this. Nope: since we've been suppressing stderr, "exit nicely" here would translate as "refuse to provide any information about the problem". The existing behavior is just fine. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PATCHES] TRUNCATE, VACUUM, ANALYZE privileges
On Thu, 2006-01-05 at 09:41 -0500, Stephen Frost wrote: > * Michael Paesold ([EMAIL PROTECTED]) wrote: > > Stephen Frost wrote: > > >I'm not a particularly big fan of this though because, while I'd > > >like to be able to give TRUNCATE permissions I'm not a big fan of SET > > >RELIABILITY because it would affect PITR backups. > > > > As far as I have understood the discussion... with WAL archiving turned on, > > the whole RELIABILITY changes would be no-ops, no? > > Just as the CTAS optimization etc. only skip WAL if WAL archiving is turned > > off. > > Oh, I thought the reliability bit would bypass WAL even with archiving > turned on (which could be fine in some cases, just not all cases :). Of It might be better if this was an setting in postgresql.conf requiring a restart to change, off by default. I don't like the thought of a table owner or even a super-user being able to throw away data because they failed to investigate the full impact of the backup strategy. I.e. Someone missed the memo that backups were changing from pg_dumps to PITR for database environment H. -- ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] nicer error out in initdb?
If we find at the bottom of test_config_settings() that we have not been able to run successfully at all (i.e. status != 0 at about line 1183 of initdb.c) is there any point in continuing? Don't we know that we are bound to fail at the template1 creation stage? Maybe we should just exit nicely when we do know this. cheers andrew ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Heads up: upcoming back-branch re-releases
Thanks. Updated. --- Marko Kreen wrote: > +Fix bug in /contrib/pgcrypto Openwall > +gen_salt processing (Marko Kreen> > > I guess it should be bit more explicit: > > Fix bug in /contrib/pgcrypto gen_salt, > which caused it not to use all available salt space for md5 and > xdes algorithms (Marko Kreen, Solar Designer) > Salts for blowfish and standard des are unaffected > > This hopefully makes it clear who is affected and how important is to > upgrade. Also the 'Openwall' is confusing, better credit fix author. > > -- > marko > > ---(end of broadcast)--- > TIP 4: Have you searched our list archives? > >http://archives.postgresql.org > -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Josh Berkus writes: > > These statements are at odds with my admittedly basic understanding of > > statistics. Isn't the power of a sample more related to the absolute size > > of > > the sample than the sample as fraction of the population? Why not just pick > > a smallish sample size, say about 3000, and apply it to all the tables, even > > the ones with just a single row (modify appropriately from block sampling). > > Nope, it's definitely proportional. As a simple example, a sample of 500 > rows > in a table of 1000 rows should yeild stats estimates with 90%+ accuracy. But > a > sample of 500 rows in a 600,000,000 row table is so small as to be nearly > useless; it's quite possible to get all the same value in a random sample of < > 0.1% even on a column with a D/N of 0.001. If you look at the papers cited, > almost all researchers more recent than Chaudhuri use a proportional sample > size. To be clear Josh is talking specifically about the estimate of how many distinct values a query will see. Not the more usual estimates of how many records the query will see. For estimating how many records a query like SELECT * FROM tab WHERE x BETWEEN ? AND ? the fixed size sample is on fairly solid ground. A sample of 600 gives (iirc) +/- 2% 19 times out of 20. That's the same sample size most major opinion polls use... However this same logic doesn't work for estimating distinct values. Since a single occurrence of a distinct value is just as important as hundreds of occurrences, and your chances of finding the single occurrence is proportional to what percentage of the overall table you sample, to maintain a given accuracy you're going to have to maintain a sample of percentage of the overall table. Worse, my recollection from the paper I mentioned earlier was that sampling small percentages like 3-5% didn't get you an acceptable accuracy. Before you got anything reliable you found you were sampling very large percentages of the table. And note that if you have to sample anything over 10-20% you may as well just read the whole table. Random access reads are that much slower. -- greg ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
Josh Berkus writes: > > Only if your sample is random and independent. The existing mechanism tries > > fairly hard to ensure that every record has an equal chance of being > > selected. > > If you read the entire block and not appropriate samples then you'll > > introduce > > systematic sampling errors. For example, if you read an entire block you'll > > be > > biasing towards smaller records. > > Did you read any of the papers on block-based sampling? These sorts of > issues > are specifically addressed in the algorithms. We *currently* use a block based sampling algorithm that addresses this issue by taking care to select rows within the selected blocks in an unbiased way. You were proposing reading *all* the records from the selected blocks, which throws away that feature. -- greg ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PATCHES] TRUNCATE, VACUUM, ANALYZE privileges
* Michael Paesold ([EMAIL PROTECTED]) wrote: > Stephen Frost wrote: > >I'm not a particularly big fan of this though because, while I'd > >like to be able to give TRUNCATE permissions I'm not a big fan of SET > >RELIABILITY because it would affect PITR backups. > > As far as I have understood the discussion... with WAL archiving turned on, > the whole RELIABILITY changes would be no-ops, no? > Just as the CTAS optimization etc. only skip WAL if WAL archiving is turned > off. Oh, I thought the reliability bit would bypass WAL even with archiving turned on (which could be fine in some cases, just not all cases :). Of course, all of this is still up in the air somewhat. :) If it's a noop in that case then the 'bypass' bit might be alright to have control SET RELIABILITY. I'd rather have the flexibility to have them be seperately grantable though. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] [PATCHES] TRUNCATE, VACUUM, ANALYZE privileges
Stephen Frost wrote: I'm not a particularly big fan of this though because, while I'd like to be able to give TRUNCATE permissions I'm not a big fan of SET RELIABILITY because it would affect PITR backups. As far as I have understood the discussion... with WAL archiving turned on, the whole RELIABILITY changes would be no-ops, no? Just as the CTAS optimization etc. only skip WAL if WAL archiving is turned off. Best Regards, Michael Paesold ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PATCHES] TRUNCATE, VACUUM, ANALYZE privileges
Tom, et al, Sorry for the longish email; if you're most interested in a change to the ACL system to allow more privileges then skip to the bottom where I worked up a change to give us more options without much of a performance impact (I don't think anyway). Personally, I'd prefer that to overloading the bits we have (except perhaps for vacuum/analyze). * Tom Lane ([EMAIL PROTECTED]) wrote: > Stephen Frost <[EMAIL PROTECTED]> writes: > > GRANT ADD COLUMN, etc, aren't maintenance commands, they're DDL > > statements and as such should be the purview of the owner. TRUNCATE, > > VACUUM and ANALYZE are DML commands and are commands a user of > > the table would use through the normal course of inserting, updating or > > deleteing data in the table. > > I find this reasoning fairly dubious. In particular, it's hard to argue > that there is no DDL component to TRUNCATE when it effectively does an > implicit disable-triggers operation. Another thing setting TRUNCATE > apart from run-of-the-mill DDL operations is that it inherently violates > MVCC rules (by deleting rows that should still be visible to concurrent > transactions). Kind of makes one wish you could know what tables were going to be touched for a given transaction at the start of the transaction. That's not really here nor there tho. I could see limiting truncate privileges to tables which don't have on-delete triggers, that doesn't help with the MVCC problem though and ends up being why we can't just use 'delete' privileges for it. Could we base vacuum and analyze rights off of other privileges though? vacuum allowed if yoou have 'delete' privileges, analyze if you have 'insert', 'update', or 'delete'? And for 'truncate' permissions... > But my real problem with the approach is that I don't see where it > stops. If you're allowed to do ANALYZE, why not ALTER TABLE SET > STATISTICS? If you're allowed to do TRUNCATE, why not the > recently-discussed ALTER TABLE SET RELIABILITY? And how about CLUSTER? > All of these could be pretty useful for some applications not too far > removed from yours. And there will be someone wanting a bit for > DISABLE/ENABLE TRIGGER coming along right afterwards. Must we implement > a separate nonstandard privilege bit for every operation that someone > comes up and wants a bit for, if they have the necessary cut-and-paste > skill to submit a patch for it? I think analyze is distinct from set statistics. SET STATISTICS, if used improperly (perhaps by mistake or misunderstanding), could cause serious havoc on the system as potentially very poor plans are chosen because the statistics aren't at all correct. I don't see how running analyze a couple times would have a detrimental effect (except for the effort of running the analyze itself, but that's really not all that much and if they want to DoS the box in that way there are other things they can do). SET STATISTICS is also hopefully something you're not having to change or do every time you add/remove/update data. SET RELIABILITY is a more interesting question since it could be used in a situation similar to why truncate's popular (mass data loading/reloading). The same is true for disable/enable triggers. > I'd feel happier about an approach that adds *one* privilege bit > covering a range of operations that we agree to be useful. This will > avoid chewing a disproportionate amount of ACL storage space, and it > will force us to confront the decision about which operations are out > as well as which are in. If we modify VACUUM/ANALYZE to be based off what I suggested above, then we could add just one 'BYPASS' permission bit which would allow TRUNCATE right now and then SET RELIABILITY, and DISABLE/ENABLE TRIGGER later. I'm not a particularly big fan of this though because, while I'd like to be able to give TRUNCATE permissions I'm not a big fan of SET RELIABILITY because it would affect PITR backups. I suppose a user would still have to intentionally do that though and so they'd have only themselves to blame if the data they loaded wasn't part of the backup. DISABLE/ENABLE TRIGGER has a similar issue though because that could probably be used to bypass CHECK and REFERENCES constraints, which I wouldn't want to allow. A BYPASS bit would be better than having to give ownership rights though. We could also look at restructuring the ACL system to be able to handle more permissions better. As was suggested elsewhere, one option would be to have a seperate set of ACLs (at least internally) such that a given set of commands/permissions would be associated with one set or the other set (and hopefully rarely, both). These could be divided up by either level or frequency (which I think would actually result in the same set). Level would be row/table/database, frequency would be estimation of usage in the real world. This would seperate SELECT, INSERT, UPDATE, DELETE into one set of permissions, and then REFERENCES, RULE, TRIGGER, TRUNCATE, SE
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
> > Do you *really* want the median estimate in these case? Are you certain > > you > > do not want something with the opposite behavior of Chaudhuri's estimate so > > that for small sample sizes the bias is toward a high estimate of D? > > (Converges on D from the right instead of the left.) > > > > Chaudhuri's <-D--> needed > > Estimate estimate > > Hmmm. Yeah, I see what you mean. True, the ideal approach would to > deterime for each query operation whether a too-low D or a too-high D > was more risky, and then use the more conservative number. However, > that would complicate the query planner enough that I think Tom would > leave us. :-p You could have some specific functions vote themselves out if their cost is shakey. We know that the cost of a miscalculated nestloop is huge, so after calculating the common case it might apply a multiplier for the "risk" involved. There have been lots of requests for a way to achieve more consistent plans that have a determined worst case performance, even if they never perform as well in the best case as another algorithm might. Perhaps this could be a GUC. PlanCost + PlanCost * Risk * RiskGUC "Risk" is a number that indicates how badly things can go wrong. "RiskGUC" is an integer multiplier. Someone who is risk averse (wants a predictable execution time rather than the best possible time) would set this value high. Others who want the best possible plan in most cases even if it performs poorly once in a while will set the value very low, possibly 0. -- ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Heads up: upcoming back-branch re-releases
+Fix bug in /contrib/pgcrypto Openwall +gen_salt processing (Marko Kreen> I guess it should be bit more explicit: Fix bug in /contrib/pgcrypto gen_salt, which caused it not to use all available salt space for md5 and xdes algorithms (Marko Kreen, Solar Designer) Salts for blowfish and standard des are unaffected This hopefully makes it clear who is affected and how important is to upgrade. Also the 'Openwall' is confusing, better credit fix author. -- marko ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Improving N-Distinct estimation by ANALYZE
On Thu, 2006-01-05 at 00:33 -0500, Greg Stark wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > > The approach I suggested uses the existing technique for selecting > > random blocks, then either an exhaustive check on all of the rows in a > > block or the existing random row approach, depending upon available > > memory. We need to check all of the rows in a reasonable sample of > > blocks otherwise we might miss clusters of rows in large tables - which > > is the source of the problems identified. > > > > The other reason was to increase the sample size, which is a win in any > > form of statistics. > > Only if your sample is random and independent. The existing mechanism tries > fairly hard to ensure that every record has an equal chance of being selected. > If you read the entire block and not appropriate samples then you'll introduce > systematic sampling errors. For example, if you read an entire block you'll be > biasing towards smaller records. Yes, I discussed that, following Brutlag & Richardson [2000]. The bottom line is if there is no clustering, block sampling is random, which is good; if there is clustering, then you spot it, which is good. > I think it would be useful to have a knob to increase the sample size > separately from the knob for the amount of data retained in the statistics > tables. Though I think you'll be disappointed and find you have to read an > unreasonably large sample out of the table before you get more useful distinct > estimates. OK, I'll look at doing that. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings