Re: [HACKERS] Running PostGre on DVD
> > This has been proposed before, and always rejected. While you're > > always welcome to provide a patch, I'm very doubtful it would be > > accepted into the main product. > > The example given in this thread certainly isn't going to change anybody's mind. > "Hi, I propose reducing everybody's security because my local admins insist on an > utterly brain-dead security policy." I think there is still need for discussion in this area for typical Windows desktop use. 1. You can run Windows without creating users at all. 2. You may be using a Windows box where you are not allowed to create a user To apply unix practices to Windows is imho not really practicable. For example a Windows developer usually uses an account with administrative privs and thus cannot run "make check" from his account :-( Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [GENERAL] aix build question re: duplicate symbol warning
> > Hmm. pqStrerror is defined in libpgport (which is linked into the > > backend) as well as libpq. ISTM that libpq should not be > linked with > > -Wl,-bI:../../../src/backend/postgres.imp, since it's not > intended to > > be loaded into the backend. Without having looked at the code, I'm > > wondering if the AIX makefiles stick that option into LDFLAGS_SL > > rather than someplace more restricted. > > It seems that the right place to put this is BE_DLLLIBS, > which is a macro that probably didn't exist when the AIX > support was last looked at. But both Windows and Darwin yes > ports use it now, so we may as well bring AIX up to speed. > Would you try the attached patch and see if it gets rid of > the warnings? Also check that regression tests and contrib > build/regression tests still work. works perfectly on AIX 4.3.3 with xlc. All regression tests pass, except horology that shows 7 extra -infinity rows. In contrib I tested btree_gist and dblink which also pass. There are a few places left in contrib with duplicate warnings, some because of -l pgport and -I:postgres.imp but the main build is now free of those warnings. Thank you for the work, please apply. Andreas ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
> Now I've asked for the quickest path to detailed > understanding of the pg IO subsystem. The goal being to get > more up to speed on its coding details. Certainly not to > annoy you or anyone else. Basically pg does random 8k (compile time blocksize) reads/writes only. Bitmap and sequential scans read 8k blocks in order. Only WAL does n x 8k writes with one system call. pg relys on the OS readahead (== larger block IO) to do efficient IO. Basically the pg scan performance should match a dd if=file of=/dev/null bs=8k, unless CPU bound. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Vacuum and Transactions
> > Is it reasonable to cancel and restart the vacuum process periodically > > (say every 12 hours) until it manages to complete the work? It takes > > about 2 hours to do the table scan, and should get in about 10 hours > > of index work each round. If we started the vacuum with the indexes, remembered a lowest xid per index, we could then vacuum the heap up to the lowest of those xids, no ? We could then also vacuum each index separately. Andreas ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Inherited indexes.
> Another possibility is optimizing for the special case of > indexing on a partitioning key. In this case, index values > would be very localized to one table, so just storing the > table info on each index page (or something similar) would work well. If you have the partitioning key in the index and the partitions don't overlap, it is better to create separate [unique] indexes on the subtables. Building separate indexes per partition is usually preferred because of: 1. performance of dropping a partition 2. smaller index for CE Only if you need an "order by" without a sort step, that spawns more than one partition things usually get ugly. Imho the best solution would be a merge node, that merges results of several index accesses to avoid a sort and still use separate indexes. Such a merge node could probably also detect the case where accessing partitions in a certain order still produces ordered results. Usually you will only want the "one big unique index" when the partitioning is not reflectable in the index keys, and then (also in other db's) such an index is usually a pain ... Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
> In my original example, a sequential scan of the 1TB of 2KB > or 4KB records, => 250M or 500M records of data, being sorted > on a binary value key will take ~1000x more time than reading > in the ~1GB Btree I described that used a Key+RID (plus node > pointers) representation of the data. Imho you seem to ignore the final step your algorithm needs of collecting the data rows. After you sorted the keys the collect step will effectively access the tuples in random order (given a sufficiently large key range). This random access is bad. It effectively allows a competing algorithm to read the whole data at least 40 times sequentially, or write the set 20 times sequentially. (Those are the random/sequential ratios of modern discs) Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Call for 7.5 feature completion
> I think the main problem with switching to visual studio > project files is maintainabilty. (It's not easy to get all I think the target should be a way to auto create those files with gmake (maybe with mingw for configure). The format of VS6 project and workspace files is pretty simple. It should be possible to derive them from the makefiles and simple templates. Andreas ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Pre-allocated free space for row
> > The bgwriter could "update" all columns of dead heap tuples in heap > > pages to NULL and thus also gain free space without the need to touch > > the indexes. > > The slot would stay used but it would need less space. > > Not unless it's running a transaction (consider TOAST updates). Ok, you could leave all toast pointers and the toast table as is. Andreas ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Pre-allocated free space for row
> > My wild guess is that deleting all index pointers for a removed index > > is more-or-less the same cost as creating new ones for > > inserted/updated page. > > Only if you are willing to make the removal process > recalculate the index keys from looking at the deleted tuple. The bgwriter could "update" all columns of dead heap tuples in heap pages to NULL and thus also gain free space without the need to touch the indexes. The slot would stay used but it would need less space. Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] wal_buffer tests in
> I ran a wal_buffer test series. It appears that increasing the > wal_buffers is indeed very important for OLTP applications, > potentially resulting in as much as a 15% average increase in > transaction processing. > What's interesting is that this is not just true for 8.1, it's true for > 8.0.3 as well. You will want the log write to write up to at least 256kb (>= 256kb you reach the max Mb/s a disk can write) (if that is possible, no commit/fsync inbetween). You will also want enough buffers, that can be filled while the 256kb are written. Thus a value of at least 64 with 8k pages seems reasonable. Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Checkpoint cost, looks like it is WAL/CRC
>>> The point here is that fsync-off is only realistic for development or >>> playpen installations. You don't turn it off in a production >>> machine, and I can't see that you'd turn off the full-page-write >>> option either. So we have not solved anyone's performance problem. > >> Yes, this is basically another fsync-like option that isn't for >> production usage in most cases. Sad but true. > > Just to make my position perfectly clear: I don't want to see > this option shipped in 8.1. Why not ? If your filesystem buffer size matches your pg page size, and you have a persistent write cache, the option makes perfect sense. Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Checkpoint cost, looks like it is WAL/CRC
>> Only workable solution would imho be to write the LSN to each 512 byte >> block (not that I am propagating that idea). "Only workable" was a stupid formulation, I meant a solution that works with a LSN. > We're not doing anything like that, as it would create an > impossible space-management problem (or are you happy with > limiting tuples to 500 bytes?). To do it, a layer between physical storage and row workmemory would need to be inserted, of course that would add a lot of overhead. I guess more overhead than computing a page crc. > We still don't know enough about the situation to know what a solution might look like. > Is the slowdown Josh is seeing due to the extra CPU cost of the CRCs, or the extra I/O cost, > or excessive locking of the WAL-related data structures while we do this stuff, or ???. > Need more data. Yes, especially the 10 sec instead of 1 sec response times look very suspicious. Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Checkpoint cost, looks like it is WAL/CRC
> Here's an idea: > > We read the page that we would have backed up, calc the CRC and > write a short WAL record with just the CRC, not the block. When > we recover we re-read the database page, calc its CRC and > compare it with the CRC from the transaction log. If they > differ, we know that the page was torn and we know the database > needs recovery. (So we calc the CRC when we log AND when we recover). Won't work, since the page on disk may have x different contents between 2 checkpoints (bgwriter from lru). Only workable solution would imho be to write the LSN to each 512 byte block (not that I am propagating that idea). Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Checkpoint cost, looks like it is WAL/CRC
>> Are you sure about that? That would probably be the normal case, but >> are you promised that the hardware will write all of the sectors of a >> block in order? > > I don't think you can possibly assume that. If the block > crosses a cylinder boundary then it's certainly an unsafe > assumption, and even within a cylinder (no seek required) I'm > pretty sure that disk drives have understood "write the next > sector that passes under the heads" > for decades. A lot of hardware exists, that guards against partial writes of single IO requests (a persistent write cache for a HP raid controller for intel servers costs ~500$ extra). But, the OS usually has 4k (some 8k) filesystem buffer size, and since we do not use direct io for datafiles, the OS might decide to schedule two 4k writes differently for one 8k page. If you do not build pg to match your fs buffer size you cannot guard against partial writes with hardware :-( We could alleviate that problem with direct io for datafiles. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] NOLOGGING option, or ?
> Escape processing would proceed as before, but the semantics would change to > allow the use of different characters as the escape character, in addition > to the special characters for delimiter and newline. If you mean syntax to specify escape and delimiter (and newline ?), that is a great addition, that imho all would like. > Also, escape > processing would be "false" as the default, so that the only special > characters by default would be the newline and delimiter characters. I don't see how that would be any advantage ? What is so wrong about having syntax to choose no escape processing, like "escape ''" ? Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Tablespace-level Block Size Definitions
> The problem I see with this proposal is that the buffer manager knows > how to handle only a equally-sized pages. And the shared memory stuff > gets sized according to size * num_pages. So what happens if a certain > tablespace A with pagesize=X gets to have a lot of its pages cached, > evicting pages from tablespace B with pagesize=Y, where Y < X? You could create a separate bufferpool per page size. Of course that has other disadvantages. Is it really so difficult to create and attach another shmem segment ? Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] compiling postgres with Visual Age compiler on OpenPower5 / Linux
> I have not researched any deeper,but currently it fails with > > [EMAIL PROTECTED] postgresql-8.0.3]$ > CC=/opt/ibmcmp/vac/7.0/bin/xlc ./configure > ..A.. > checking for int timezone... yes > checking types of arguments for accept()... configure: error: > could not determine argument types The odds are very high, that it rather depends on the system header file version, or the order of inclusion (e.g. stdio.h before config.h). Would need to see the snippet of configure output though. IIRC there was a fix for this or a similar case, has this been reverted ? Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bitmap scan cost model (was Re: bitmap scans, btree scans, and tid order)
> But to get the estimated cost ratio to match up with the actual cost > ratio, we'd have to raise random_page_cost to nearly 70, which is a bit > hard to credit. What was the platform being tested here? Why ? Numbers for modern single disks are 1-2Mb/s 8k random and 50-120 Mb/s sequential. Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Bitmap scan cost model (was Re: bitmap scans, btree scans, and tid order)
> >Incrementing random_page_cost from 4 (the default) to 5 causes the > >planner to make a better decision. > > We have such a low default random_page_cost primarily to mask other > problems in the optimizer, two of which are > > . multi-column index correlation > > . interpolation between min_IO_Cost and max_IO_cost which approximates > max_IO_cost too fast. One other important figure here is concurrency. If you have a lot of backends concurrently doing other IO, your sequential IO numbers will suffer more than random IO numbers. Might be, that some super smart OS readahead implementation aleviates that problem, but I have not yet experienced one. Also less of random IO tends to get higher cache rates. Thus I think if you are alone, 4 tends to be too low, while with concurrent load 4 tends to be too high. (All assuming farly large tables, that don't fit into RAM) Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] DO INSTEAD and conditional rules
> > If we did not define > > it that way, I think your example would have to error out --- how > > would you choose which INSTEAD rule wins? > > The documentation says that they evaluate in alphabetical order by > name. So I would expect that the first one to have its WHERE statement > evaluate to true would short-circuit the execution of the remaining > rules. Why not simply write non overlapping WHERE statements ? Imho not doing them all would be very counter intuitive. Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] locks in CREATE TRIGGER, ADD FK
> > It would keep the old table around while building the new, then grab > > an exclusive lock to swap the two. > > Lock upgrading is right out. It would need a whole of new family of "intent" locks, with different rules. Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] We are not following the spec for HAVING without GROUP BY
> What that means is that neither the HAVING clause nor the targetlist > can use any ungrouped columns except within aggregate calls; that is, > > select col from tab having 2>1 Informix: select tabname from systables having 2 > 1; 294: The column (tabname) must be in the GROUP BY list. select tabname from systables group by 1 having 2 > 1; all rows returned select tabname from systables group by 1 having 1 > 2; no rows found Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [pgsql-hackers-win32] [HACKERS] win32 performance - fsync question
> >Are you verifying that all the data that was committed was actually stored? > >Or > >just verifying that the database works properly after rebooting? > > I verified the data. Does pg startup increase the xid by some amount (say 1000 xids) after crash ? Else I think you would also need to rollback a range of xids after the crash, to see if you don't loose data by reusing and rolling back xids. The risk is datapages reaching the disk before WAL, because the disk rearranges. I think you would not notice such corruption (with pg_dump) unless you do the range rollback. Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] win32 performance - fsync question
> >> One point that I no longer recall the reasoning behind is that xlog.c > >> doesn't think O_SYNC is a preferable default over fsync. > > > >For larger (>8k) transactions O_SYNC|O_DIRECT is only good with the recent > >pending patch to group WAL writes together. The fsync method gives the OS a > >chance to do the grouping. (Of course it does not matter if you have small > >tx < 8k WAL) > > This would be true for fdatasync() but not for fsync(), I think. No, it is only worse with fsync, since that adds a mandatory seek. > On win32 (which started this discussion, fsync will sync the directory > entry as well, which will lead to *at least* two seeks on the disk. > Writing two blocks after each other to an O_SYNC opened file should give > exactly two seeks. I think you are making the following not maintainable assumptions. 1. there is no other outstanding IO on that drive that the OS happily inserts between your two 8k writes 2. the rotational delay is neglectible 3. the per call overhead is neglectible You will at least wait until the heads reach the write position again, since you will not be able to supply the next 8k in time for the drive to continue writing (with the single backend large tx I was referring to). If you doubt what I am saying do dd blocksize tests on a raw device. The results are, that up to ~256kb blocksize you can increase the drive performance on a drive that does not have a powerfailsafe cache, and does not lie about write success. Andreas ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] win32 performance - fsync question
> One point that I no longer recall the reasoning behind is that xlog.c > doesn't think O_SYNC is a preferable default over fsync. For larger (>8k) transactions O_SYNC|O_DIRECT is only good with the recent pending patch to group WAL writes together. The fsync method gives the OS a chance to do the grouping. (Of course it does not matter if you have small tx < 8k WAL) Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Design notes for BufMgrLock rewrite
> Would there be any value in incrementing by 2 for index accesses and 1 > for seq-scans/vacuums? Actually, it should probably be a ratio based on > random_page_cost shouldn't it? What happens with very small hot tables that are only a few pages and thus have no index defined. I think it would not be good to treat such data pages as less important than index pages. Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] enforcing a plan (in brief)
> And the user maintenance of updating those hints for every release of > PostgreSQL as we improve the database engine. I don't think so. Basically an optimizer hint simply raises or lowers the cost of an index, mandates a certain join order, allows or disallows a seq scan ... Imho it is not so far from the things people currently do with the set seq_scan= type of commands. (I don't think actually giving a certain plan is a good idea) A good optimizer hint system would imho not circumvent the optimizer, but only give it hints. The hints should be very specifically aimed, like "an index on column x" is going to be more expensive than you (the optimizer) think, if used with this query. like: select /*+ avoid_index(atab atab_x0) */ * from atab ... > > The people who are actually doing the work think their time is more > > usefully spent on improving the planner's intelligence than on devising > > ways to override it. The subject of this mail and "override it" imho goes too far, I would like to be able to give advice in the form of hints to the optimizer. > One consistent problem is the planner not being able to handle this or > that scenario. At this stage, the *best* way to improve the planner is to > add the ability to place hints in the plan. sql statement I agree. Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] New form of index "persistent reference"
> I asked the question how do you get a record without going through an > index, the answer was CTID, which unfortunately changes when the row is > updated. The ctid is a physical location of the row. On update a new tuple is written in a new location, that is why the ctid changes. The old tuple has a system field t_ctid which is then a forward pointer to the new tuple. Thus you can follow that chain until the visible tuple is found. The current tid = tid does not do that (I think because the ODBC driver which was the first to use it (for result set modification) needed to notice when the tuple was updated underneath). But you can use: select * from atab where ctid = currtid2('atab', '(0,1)'); -- '(0,1)' is the old ctid Of course the old ctid is only valid until (auto)vacuum marks it free. Without vacuum you are currently safe to use the currtid functions. Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] ARC patent
> > There's a very recent paper at > > http://carmen.cs.uiuc.edu/~zchen9/paper/TPDS-final.ps on an alternative > > to ARC which claims superior performance ... > > From a quick glance, this doesn't look applicable. The authors are > discussing buffer replacement strategies for a multi-level cache > hierarchy (e.g. they would call the DBMS buffer cache "L1", and the Yes, it might not matter however. Another algorithm that was written by university folk (thus probably not patent prone) that looks promising is: http://www.cs.wm.edu/hpcs/WWW/HTML/publications/papers/TR-02-6.pdf http://parapet.ee.princeton.edu/~sigm2002/papers/p31-jiang.pdf (same, but better typeset) It even seems to slightly beat ARC according to the MQ paper. Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] ARC patent
> >> FYI, IBM has applied for a patent on ARC (AFAICS the patent application > >> is still pending, although the USPTO site is a little hard to grok): > > > >> > http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220040098541%22.PGNR.&OS=DN/20040098541&RS=DN/20040098541 > > > > Ugh. We could hope that the patent wouldn't be granted, but I think > > it unlikely, unless Jan is aware of prior art (like a publication > > predating the filing date). I fear we'll have to change or remove > > that code. > > > > regards, tom lane > > Unfortunately no. The document that inspired me to adapt ARC for > PostgreSQL is from the USENIX File & Storage Technologies Conference > (FAST), March 31, 2003, San Francisco, CA. > > I am seriously concerned about this and think we should not knowingly > release code that is possibly infringing a patent. I thought IBM granted the right to use these methods in OSS software. PostgreSQL is OSS software, thus only such entities relicensing pg need to worry about the patent. Also the algo is probably sufficiently altered already to not be subject to the patent, no ? Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] RC2 and open issues
> If we don't start where we left off, I am thinking if you do a lot of > writes then do nothing, the next checkpoint would be huge because a lot > of the LRU will be dirty because the bgwriter never got to it. I think the problem is, that we don't see wether a "read hot" page is also "write hot". We would want to write dirty "read hot" pages, but not "write hot" pages. It does not make sense to write a "write hot" page since it will be dirty again when the checkpoint comes. Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Shared row locking
> In general, I agree with Tom: I haven't seen many programs that use > extended SELECT FOR UPDATE logic. However, the ones I have seen have > been batch style programs written using a whole-table cursor - these > latter ones have been designed for the cursor stability approach. I think if we add shared locks we should by default behave like cursor stability isolation level, that only holds one shared lock for the current cursor row. The semantics are well defined in SQL. If you want repeatable read you need to change isolation level. I know FK checks will need to keep the locks, but I would special case that. Andreas ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [Testperf-general] BufferSync and bgwriter
> Hmmm, I've not seen this. For example, with people who are having trouble > with checkpoint spikes on Linux, I've taken to recommending that they call > sync() (via cron) every 5-10 seconds (thanks, Bruce, for suggestion!). > Believe it or not, this does help smooth out the spikes and give better > overall performance in a many-small-writes situation. The reason is imho that the checkpoint otherwise also syncs all other writes. These can be writes other backends had to do to replace a buffer. Linux obviously lacks a mechanism to distribute the IO for cached writes over time ala bgwriter (or does not do it when already faced with an IO bottleneck). Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] bgwriter changes
> > Only if you redefine the meaning of bgwriter_percent. At present it's > > defined by reference to the total number of dirty pages, and that can't > > be known without collecting them all. > > > > If it were, say, a percentage of the total length of the T1/T2 lists, > > then we'd have some chance of stopping the scan early. > The other way around would make sense. In order to avoid writing the > busiest buffers at all (except for checkpoinging), the parameter should > mean "don't scan the last x% of the queue at all". Your meaning is 1 - above meaning (at least that is what Tom and I meant), but is probably easier to understand (== Informix LRU_MIN_DIRTY). > Still, we need to avoid scanning over all the clean blocks of a large > buffer pool, so there is need for a separate dirty-LRU. Maybe a "may be dirty" bitmap would be easier to do without beeing deadlock prone ? Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] bgwriter changes
> The two alternative algorithms are similar, but have these > differences: > The former (option (2)) finds a constant number of dirty pages, though > has varying search time. This has the disadvantage of converging against 0 dirty pages. A system that has less than maxpages dirty will write every page with every bgwriter run. > The latter (option (3)) has constant search > time, yet finds a varying number of dirty pages. This might have the disadvantage of either leaving too much for the checkpoint or writing too many dirty pages in one run. Is writing a lot in one run actually a problem though ? Or does the bgwriter pause periodically while writing the pages of one run ? If this is expressed in pages it would naturally need to be more than the current maxpages (to accomodate for clean pages). The suggested 2% sounded way too low for me (that leaves 98% to the checkpoint). Also I think we are doing too frequent checkpoints with bgwriter in place. Every 15-30 minutes should be sufficient, even for benchmarks. We need a tuned bgwriter for this though. Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] bgwriter changes
> > > and stops early when eighter maxpages is reached or bgwriter_percent > > > pages are scanned ? > > > > Only if you redefine the meaning of bgwriter_percent. At present it's > > defined by reference to the total number of dirty pages, and that can't > > be known without collecting them all. > > > > If it were, say, a percentage of the total length of the T1/T2 lists, > > then we'd have some chance of stopping the scan early. > > ...which was exactly what was proposed for option (3). But the benchmark run was with bgwriter_percent 100. I wanted to point out, that I think 100% is too much (writes hot pages multiple times between checkpoints). In the benchmark, bgwriter obviously falls behind, the delay is too long. But if you reduce the delay you will start to see what I mean. Actually I think what is really needed is a max number of pages we want dirty during checkpoint. Since that would again require scanning all pages, the next best definition would imho be stop at a percentage (or a number of pages short) of total T1/T2. Then you can still calculate a worst case IO for checkpoint (assume that all hot pages are dirty) Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] bgwriter changes
> (2) Remove bgwriter_percent. I have yet to hear anyone argue that > there's an actual need for bgwriter_percent in tuning > bgwriter behavior, One argument for it is to avoid writing very hot pages. > (3) Change the meaning of bgwriter_percent, per Simon's proposal. Make > it mean "the percentage of the buffer pool to scan, at most, to look for > dirty buffers". I don't think this is workable, at least not at this a la long I think we want to avoid that checkpoint needs to do a lot of writing, without writing hot pages too often. This can only reasonably be defined with a max number of pages we want to allow dirty at checkpoint time. bgwriter_percent comes close to this meaning, although in this sense the value would need to be high, like 80%. I think we do want 2 settings. Think of one as a short time value (so bgwriter does not write everything in one run) and one a long term target over multiple runs. Is it possible to do a patch that produces a dirty buffer list in LRU order and stops early when eighter maxpages is reached or bgwriter_percent pages are scanned ? Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] V8 Beta 5 on AIX
>> However, one thing we can do is to try this in Makefile.aix: >> # AIX needs threads for everything that uses libpq >> LIBS += $(PTHREAD_LIBS) >> That is going to enable thread libs for all linking including the >> backend, but it might work. > That is certainly wrong. The correct thing is to add PTHREAD_LIBS to > all and only those links that include libpq. I suspect that the cc_r You mean like attached patch ? That is safe and can be applied. My machine has problems with two configure tests with --enable-thread-safety. 1. the snprintf long long int check that uses int64 as variable name which is a typedef in sys/inttypes.h 2. whether strerror_r returns int check fails with a redefine of named function from string.h No idea how to fix those, but workaround is easy :-( -lpthread can be removed from PTHREAD_LIBS, the lib is a compat symlink to -lpthreads, I guess it does not hurt eighter. Andreas pg_aix_pthreads.patch Description: pg_aix_pthreads.patch ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] V8 Beta 5 on AIX
> OK, so does someone want to suggest why a library used to link libpq > would also be needed to link binaries that use libpq? And with no cc_r > it seems I have no idea how to get this working. AIX does not pull in any libraries that a shared lib depends upon, not even libc. You only specify those on the .so link line so the linker can check what is unresolved. If he does not have cc_r, then he eighter uses gcc or a newer compiler, that does not need cc_r any more ??? I would generally use cc_r if it exists or gcc for the threaded build. The performance difference between cc and cc_r should not be noticeable. Andreas ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] [HACKERS] Open Items
> I am going to discard these emails. We haven't solve the Win32 terminal > server problem and I think it needs to be moved to the TODO list instead. Yes, please do that. I do not think there is a problem on TS other than some missing permissions. The patch was only intended to avoid starting 2 postmasters for the same datadir. The patch is not 100% since I was told that my check would fail on Win95 and falsely assume Win95 allows a \ in the shmem name. I think that would currently not matter, but it is unclean for the future. Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Stopgap solution for table-size-estimate updatingproblem
>> I think I recall that lseek may have a negative effect on some OS's >> readahead calculations (probably only systems that cannot handle an >> lseek to the next page eighter) ? Do you think we should cache the >> last value to avoid the syscall ? > > We really can't, since the point of doing it is to find out whether any > other backends have extended the file since we last looked. Also, IIRC Well yes. The value would need to be kept ajour by the backends that extend, with an interlocked increment. We would thus need to keep the count in shmem. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Stopgap solution for table-size-estimate updatingproblem
>> This is not true in my case, since I only "update statistics"/analyze >> when the tables have representative content (i.e. not empty). > > I'm unsure why you feel you need a knob to defeat this. The only time > when the plan would change from what you think of as the hand-tuned > case is when the physical table size is greatly different from what it > was when you analyzed. Ok, understood. I just need to make sure I don't "vacuum full" in that case, which is good anyway if I expect the table to soon grow to this size again. I think that is good. I think I recall that lseek may have a negative effect on some OS's readahead calculations (probably only systems that cannot handle an lseek to the next page eighter) ? Do you think we should cache the last value to avoid the syscall ? Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Stopgap solution for table-size-estimate updatingproblem
>> One possibility: vacuum already knows how many tuples it removed. We >> could set reltuples equal to, say, the mean of the number-of-tuples- >> after-vacuuming and the number-of-tuples-before. In a steady state >> situation this would represent a fairly reasonable choice. In cases >> where the table size has actually decreased permanently, it'd take a few >> cycles of vacuuming before reltuples converges to the new value, but that >> doesn't seem too bad. > > That sounds good to me. Covers all cases I can see from here. Yes, sounds good for me also. I think that would be a good thing even if viewed isolated from the rest of the proposal. I am sorry if I made the impression that I don't like a change in this direction in general, I think there is need for both. I am only worried about core OLTP applications where every query is highly tuned (and a different plan is more often than not counter productive, especially if it comes and goes without intervention). >> A standalone ANALYZE should still do what it does now, though, I think; >> namely set reltuples to its best estimate of the current value. good, imho :-) > A GUC-free solution...but yet manual control is possible. Sounds good to > me - and for you Andreas, also? It is the GUC to keep the optimizer from using the dynamic page count, that I would still like to have. I especially liked Simon's name for it: enable_dynamic_statistics=true Tom wrote: >> But I am used to applications >> that prepare a query and hold the plan for days or weeks. If you happen to >> create the plan when the table is by chance empty you lost. > > You lose in either case, since this proposal doesn't change when > planning occurs or doesn't occur. This is not true in my case, since I only "update statistics"/analyze when the tables have representative content (i.e. not empty). Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Stopgap solution for table-size-estimate updating problem
> rel->pages = RelationGetNumberOfBlocks(relation); Is RelationGetNumberOfBlocks cheap enough that you can easily use it for the optimizer ? I myself have always preferred more stable estimates that only change when told to. I never liked that vacuum (without analyze) and create index change those values, imho only analyze should. But I am used to applications that prepare a query and hold the plan for days or weeks. If you happen to create the plan when the table is by chance empty you lost. But it seems I am quite alone with that opinion. > A variant of this is to set reltuples = density, relpages = 1 instead > of 0, which makes the relpages value a lie but would be even less likely > to confuse client-side code. I would let analyze still fill in the correct values. The 0 or 1 page previous denstity could be used for only when the table is empty. I don't think eighter would confuse client-side code more or less. More confusing for client-side code would be if the two would not show the analyze results any more. I would like to see this configurable. Like "use_analyzed_stats" or "enable_online_statistics" or some such that can be used more future online statistics. One setting would use: pages = RelationGetNumberOfBlocks(relation) count = pages * reltuples / relpages the other: pages = relpages count = relcount Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Can postgresql accept mutliple connections in the same instance?
>> I am running of postgresql database servers with generally 30-50 users >> at a time per server. I have noticed one thing for web based databases >> that they fail to initialse a pg_connection connection every now and >> again and return no error message at all. I am thinking of the PG_SOMAXCONN maximum connects in the queue of the listen socket. That would only be visible to the client app, and would mainly happen when load is high so that postmaster cannot fork new backends fast enough. And if the client does not properly report ... IIRC some systems have a max of 5. Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] pg_resetxlog options
>> Some other time maybe. Meanwhile, this patch ought to make it compile >> more cleanly on Windows - not sure why I get errors there but not >> Linux. > > Because getopt() is normally declared in unistd.h, not getopt.h (Windows > being an exception?). getopt is not in any standard Windows headers. The getopt.h header is from mingw to assist porting (don't know why they didn't put it in unistd.h ?). Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] psql \e broken again
>> I like Kevin Brown's suggestion of writing out a temporary .txt file and >> 'executing' it. It will follow the principle of least suprise for Windows >> users. > > Note that the current default behavior (assuming you've not set EDITOR) > is "vi foo.txt" which is quite unlikely to be helpful to a Windows user. works for me :-) but agreed not a good default. > I'm not sure we need to do the "execute a textfile" bit, but at the very I don't think that eighter. And also imho the obvious extension would be .sql, and that might not be registered, or already registered for another not really wanted sql tool. I think we need or own editor setting, the windows way would be a private registry setting, but using EDITOR is imho just as good. > least DEFAULT_EDITOR should probably be "notepad.exe" on Windows. Yes that would be a sane default on Windows. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] psql \e broken again
>> So I thought I'd try a few scenarios, since I have it installed: >> [ none of which work ] >> So to answer your question, at least in part, there is no current good >> behavior to emulate. At least on this version of CVS: > I think this is fairly graphic proof that (1) a straight port without > doing anything special is essentially nonfunctional on Windows, and that > (2) there isn't any easy way for the user to fix it by inserting quotes > for himself. I am not sure the test is valid, since %EDITOR% was used on the command line, which does it's own magic on quotes. Is there a command that would use the envvar EDITOR without putting it on the commandline ? We are talking about directly using the envvar from inside the program, no? Andreas ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] psql \e broken again
> We could maybe compromise with (again, for Windows only) a policy like > "double-quote unless the value already contains at least one > double-quote". This should work conveniently for everybody except the > user who wants EDITOR='gvim.exe -y' on Windows; he'll have to add > some useless-looking double quotes a la EDITOR='"gvim.exe" -y'. Single quotes don't work on Windows (tried with %EDITOR% x.txt). I think we should not worry too much since the user could use the "good old" EDITOR=C:/PROGRA~1/vim/vim63/gvim.exe -y syntax on win if all else fails. Andreas ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] psql \e broken again
>> The EDITOR variable seems to have a fairly standard meaning on Unix >> systems. I've been using that EDITOR value for years without problems, >> only when I use psql's \e once in a while it breaks. I don't think we >> should deviate from what seems to be a standard practice. > > Agreed, no quotes on Unix. I'm just wondering what to do on Windows. I would require the user to add quotes around his executable if it has spaces. set EDITOR="C:/Program Files/vim/vim63/gvim.exe" -y Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] MAX/MIN optimization via rewrite (plus query rewrites generally)
> How are you planning to represent the association between MIN/MAX and > particular index orderings in the system catalogs? Don't we already have that info to decide whether an index handles an "ORDER BY" without a sort node ? Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PATCHES] [HACKERS] Open Items
> Just one question about the actual implementation of the patch - why are > you setting the OS version *before* you call GetVersionEx()? The Microsoft Example did a memset on the structure before calling void GetVersionEx(). Setting it to a version that needs the Global\ is only a safeguard against a failing call. I guess the call cannot fail, and thus setting it before is not really needed. > It'll just get overwritten... Yes, if the void call does not fail to work. Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PATCHES] [HACKERS] Open Items
> > > o fix shared memory on Win2k terminal server > > > > > > We might be able to just mark this as not supported. The shmem code works in a terminal server session with or without the patch. Magnus had a different problem, probably permissions. Since I do not have a non admin user (on a TS server) I disabled the win32_is_admin check. It is my opinion that we should allow pg to run as Admin on Windows, at least with an override option. Services that run under a specified user are a headache on Win32, because you need to store a password, and a lot of systems only have one user. > > I have attached a patch that I think fixes this. The problem I saw > > and fixed is, that the shmem created in a terminal services client is not > > visible to the console (or services.msc). The decision is simple. If we want the shmem to be global on the machine we need the patch. I think we want that, but can not give an authoritative answer. > > I think in addition the system global name "sharemem.1" should be made more > > pg specific, like "PostgreSQL.1". I have not done this since a new compile > > would not detect a running old beta. But now would be the time (or never). I think all agreed that this is good. Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PATCHES] [HACKERS] Open Items
> It makes no difference on any of my systems, so at least it doesn't > completely solve the problem. I haven't heard any > confirmation on wether > it partially solves it. It certainly does not solve any part of your problem. I think your problem is a permissions problem. It does however make sure the shmem is global, which I think we need for detecting certain parallel startup (still running backend) situations. Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] rmtree() failure on Windows
> >>> 300 secs (!) fs timeout is really broken. > >>> Looks more like a locking or network timeout issue. > >>> What error codes does unlink(3) return? > >> > > > >> success. > > > > > > Oops! 5min timeout for success is certainly problematic. > > > > > > You misunderstood. The 300 secs is not in waiting for unlink() to > return, it is in waiting for its effects to be seen by > rmdir() (i.e. for because the bgwriter obviously keeps them open, no ? Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PATCHES] [HACKERS] Open Items
> >>o fix shared memory on Win2k terminal server > >> > >>We might be able to just mark this as not supported. > > > >I have attached a patch that I think fixes this. The problem I saw > >and fixed is, that the shmem created in a terminal services client is not > >visible to the console (or services.msc). > > Does this actually fix the problem for you? > Because, as I have previously posted I think, it does *not* solve the > problem on any of my test machines. I still get the shmget() error > message when running from a TS session. I think you are having another problem. I can create it here (with or without the patch). I am running W2000 5.00.2195 SP4. Maybe you are having a permissions problem? I am using a user with near Administrator privs. > Also, I don't really see how the visibility of the shmem segment > matters. If it really does not matter, please don't apply my patch Bruce. (still do the rename though please) > We can't *create* the first instance of it, which should not > affect this at all. And if we passed that, all backends are still > execute in the same session, so there is no effect on it. Yes it only matters if postmaster is started/trying to start from different TS Sessions. I think we need to determine global existance of the shm segment to get rid of old processes/segments. > services.msc only interacts with the SCM, it has nothing at all to do > with shmem. I meant if "run as a service", which is the same TS session as the console. Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Open Items
> o fix shared memory on Win2k terminal server > > We might be able to just mark this as not supported. I have attached a patch that I think fixes this. The problem I saw and fixed is, that the shmem created in a terminal services client is not visible to the console (or services.msc). It was necessary to differenciate OS versions, this might be better put elsewhere. I think in addition the system global name "sharemem.1" should be made more pg specific, like "PostgreSQL.1". I have not done this since a new compile would not detect a running old beta. But now would be the time (or never). Andreas shmem.win32.patch Description: shmem.win32.patch ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] plans for bitmap indexes?
> > create index people_male_gay_ix on people (city) where gender = 'male' and > > orientation = 'gay'; > > You've forgotten part of my premise (based on a real case I discussed on IRC) > that there are EIGHTEEN criteria columns. That is why I said maybe :-) Whether it helps depends on the number of actually (often) used access patterns. > That would require, by the method > you have above, roughly 18(3rd factorial) indexes, times the number of values > allowed by each column, which if it averaged, say 5 values, would be 24,480 Well, an index only needs to reduce the cost enough so that you can afford your workload and have reasonable response times, so you might only need to create a few of them. I was actually only trying to help optimize this example without the "bitmap index" feature, not trying to say that for this example "partial indexes" are better. Especially since the first example, that mentioned partial indexes was not "the way to do it" for a value that represents a large part of your table (here approx 50%). (don't do: create index people_male on people (gender) where gender='male';) Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] plans for bitmap indexes?
> > The most nearly comparable thing is be the notion of "partial > > indexes," where, supposing you had 60 region codes (e.g. - 50 US > > states, 10 Canadian provinces), you might set up indices thus: > For example, imagine you have a table on a dating website with 18 columns > representing 18 different characteristics for matching. Imagine that you > index each of those columns seperately. If you do: > > SELECT * FROM people WHERE orientation = 'gay' AND gender = 'male' AND city = > 'San Francisco'; I think bitmap indexes do have valid use cases, but partitioned indexes are really a wonderful feature with a lot of use cases, maybe including this one. Workable examples for useful partitioned indexes, that help here are: create index people_male_ix on people (city) where gender = 'male'; create index people_gay_ix on people (city) where orientation = 'gay'; create index people_male_gay_ix on people (city) where gender = 'male' and orientation = 'gay'; Note, that the indexed column differs from the partitioning clause. Note also, that the last index will perform way better than a combo of bitmap indexes. Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Security implications of config-file-location patch
> > If they are using tablespaces is it OK that anyone can see their > > location? > > Good point. Should we obscure pg_tablespace similarly to > what we do for pg_shadow? Hmm, I can not see how a person with file access could not easily find the file for a specific table without pg_tablespace anyway (since oid names will be quite unique). Without file access, what malicious act is he going to do with that info ? I think hiding that info would not really be safer, thus not worth it. Andreas ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Two-phase commit
> Well, the question is how long must the individual databases retain > state with which to answer "recover" requests. I don't like "forever", > so I'm proposing that there should be an explicit command to say "you > can forget about this gid". I think it would be ok to forget the gid after: 1. the same TX manager session sends the next statement (maybe only BEGIN?) 2. the TX manager session is explicitly disconnected (do we know the difference to an aborted connection ?) > > Note that this is only really necessary because of Heikki's choice to > make the API work in terms of a user-assigned GID. This was not an arbitrary choice, but is required by most/all TX managers :-( I agree that it would be cleaner for the tx manager to keep such a map. Maybe the idea was for the gid's to be human readable since human intervention is rarely but still required with 2PC, especially in the case where the tx manager is unavailable/unrecoverable. Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings