Re: [HACKERS] Any conclusion on the Xeon context-switching issue?
On Mon, 2006-02-27 at 16:54 +, Richard Huxton wrote: Tom Lane wrote: Richard Huxton dev@archonet.com writes: Subject says it all really. I've got a new client who seems to be suffering from it, and I'm not sure if any conclusion was reached. What's he using? 8.1 seems to have alleviated the problem somewhat, and I've done more work in CVS tip. It'll never go away entirely, because these chips are just not very good at sharing memory, but we've certainly reduced it quite a bit from where it was in 7.x. 7.4.12 (.12 as of last week). I've seen context-switching peak at 8 on a quad-Xeon that really shouldn't be straining. Is this causing an actual problem, or do you just like to see lower numbers under the cs column? Certainly many people have asked about this issue but few have pointed to application-level performance problems that may have resulted. -jwb ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [GENERAL] New project launched : PostgreSQL GUI
On Tue, 2006-01-31 at 03:50 -0400, Marc G. Fournier wrote: On Tue, 31 Jan 2006, Tino Wildenhain wrote: Devrim GUNDUZ schrieb: Hi, ... Are you going to work with the underlying system's package manager, or put everything in /usr/local? We'll work with the package manager -- I'm an RPM guy ;) RPM isnt the only packaging system out there ;) I thought that Linux had this 'Linux Standard File System' or some such that described where files were supposed to be installed? Or is this another one of those Standards that nobody follows? :( Package management goes beyond sticking the files in the right place. If that were the only requirement, then tar(1) would be a package manager. As for the Linux Standards Base, that is little more than Red Hat and their like-minded friends getting together in a committee and declaring the way they already do everything to be a standard. Generally the LSB holds the short-term needs of commercial Linux distributors above other considerations. -jwb ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Configuration WAS: New project launched : PostgreSQL
On Tue, 2006-01-31 at 12:11 -0800, Josh Berkus wrote: Jeffery, PostgreSQL *desperately* needs a better means of dealing with configuration (though I guess I shouldn't be pushing too hard for this since the current state of affairs brings me business). Any improvement in this area would be very welcome. http://pgfoundry.org/projects/configurator/ is something worth looking at. An ideal facility would be a program that analyzes the workload at runtime and adjusts accordingly. That doesn't sound too hard, within some unambitious boundary. If anyone would like to work on this, I'd be happy to contribute. It seems pretty hard to *me*, compared with static configuration. If you have ideas for runtime analysis of configuration criteria, I'd be thrilled to hear them. From my perspective, most of them depend on backend monitoring that we don't have yet (like querying how full the FSM is). I agree that some instrumentation of the backend might be needed. But several of the performance-critical parameters seem tractable: Effective cache size - should be easy to monitor the system for this Shared buffers - easy to start from a rule-of-thumb and monitor usage Work mem - trace the size and frequency of temp files Wal buffers - trace the average or 80th percentile number of pages generated by transactions Commit delay - track the concurrency level and avg distance btw commits Checkpoint segments - should be very easy to auto-adjust Random page cost - should possible to determine this at runtime Vacuum* - may be possible to determine vacuum impact on concurrent queries ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Configuration WAS: New project launched : PostgreSQL
On Tue, 2006-01-31 at 17:06 -0600, Jim C. Nasby wrote: On Tue, Jan 31, 2006 at 12:36:31PM -0800, Jeffrey W. Baker wrote: Random page cost - should possible to determine this at runtime Before worrying about random_page_cost at all it makes a lot more sense to look at the query cost estimates, some of which are pretty far out of wack. I agree, but this problem is orthogonal, no? ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] New project launched : PostgreSQL GUI Installer for
On Tue, 2006-01-31 at 13:02 -0600, Jim C. Nasby wrote: On Tue, Jan 31, 2006 at 11:46:03AM +, Andreas Pflug wrote: Tino Wildenhain wrote: Figuring out the correct values for some of the buffers and costs is still a bit tough. Otoh, I guess there is no easy way to predict all these. pgAdmin has a mechanism to suggest values (currently for autovacuum and listen_address only), which waits for expansion :-) I could think of a wizard that asks decent questions, resulting in proposals. Whether implemented as GUI or not, a questionaire and suggested algorithms to calculate settings (eyeballed from Core) would be a good starting point. PostgreSQL *desperately* needs a better means of dealing with configuration (though I guess I shouldn't be pushing too hard for this since the current state of affairs brings me business). Any improvement in this area would be very welcome. http://pgfoundry.org/projects/configurator/ is something worth looking at. An ideal facility would be a program that analyzes the workload at runtime and adjusts accordingly. That doesn't sound too hard, within some unambitious boundary. If anyone would like to work on this, I'd be happy to contribute. -jwb ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] New project launched : PostgreSQL GUI Installer for
On Mon, 2006-01-30 at 19:52 -0800, Joshua D. Drake wrote: On my Debian systems, I can install PostgreSQL quite readily via the command apt-get install postgresql-8.1, which can get GUIed at least somewhat if I run aptitude, synaptic, or such... Yes Christopher, you can... I can, and Devrim can As more and more people come on board people are going to want to download a .exe (a metaphor), double click and have it open an installer, they will then want to click next, next, continue, finish. There is such a thing as best practices. If you install postgresql in this glorious graphical manner, what will prevent you from accidentally upgrading a shared library which postgresql depends upon? Nothing, really, unless this installer is going to be able to customize, build, and install a native package on all the target operating systems. How will you do an orderly upgrade from one revision to the next, including all the dependencies? How will you distribute security updates? I predict this form of installation will cause a great many support headaches as users report problems which are caused by oddball compilers, strange CFLAGS, unreleased or strangely patched versions of shared libraries and headers, and so forth. You don't get that with apt-get install. Right, with apt-get install you get a package built with a known-good compiler, known-sane configure flags, and a method of pinning the dependencies, which passes at the very least a smoketest on Alpha, AMD64, ARM, HPPA, x86, IA64, 640x0, MIPS, PowerPC, S/390, and SPARC. There is a reason that even Oracle has a graphical installer on Linux, because most people installing the software: A. Don't know how to use it B. Probably don't know how to use Linux C. Don't want to. Oracle's graphical installer is a material impediment to Oracle adoption. The installer only works on systems where particular versions of Java and Motif libraries are available. On 64-bit Opteron systems it only works with the peculiar 32-bit thunking tree favored by Red Hat and hardly anybody else. If I could install Oracle on Debian/AMD64 with a shell script, I'd drop Postgresql in a heartbeat. Obviously anybody is welcome and able to just write whatever software they feel is needed, but go ahead and count me among the skeptics. -jwb ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] New project launched : PostgreSQL GUI Installer for
On Mon, 2006-01-30 at 20:53 -0800, Joshua D. Drake wrote: Oracle's graphical installer is a material impediment to Oracle adoption. The installer only works on systems where particular versions of Java and Motif libraries are available. On 64-bit Opteron systems it only works with the peculiar 32-bit thunking tree favored by Red Hat and hardly anybody else. If I could install Oracle on Debian/AMD64 with a shell script, I'd drop Postgresql in a heartbeat. Obviously anybody is welcome and able to just write whatever software they feel is needed, but go ahead and count me among the skeptics. The installer is for the 98% not the 2%. You are in the 2%. Right, and it would make FAR more sense if Oracle just shipped the whole thing, operating system and the works, on a single installer image. So why don't you just do that with Postgres? You could call it Bootable PostgreSQL. It would be a big hit. When a new version comes out, you can just mail out a new DVD. That would be a lot better than pretending to know how to fit in, best practices and the works, with all the various Unix systems out there. -jwb ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] ice-broker scan thread
On Tue, 2005-11-29 at 09:45 -0500, Pollard, Mike wrote: Anyway, what I did was the following. When doing a sequential scan, we were starting at the beginning of the table and scanning forward. If I threw up some threads to read ahead, then my user thread and my read ahead threads would thrash on trying to lock the buffer slots. So, I had the read ahead threads start at some distance into the table, and work toward the beginning. I believe this is commonly called a synchronous scan. -jwb ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
On Wed, 2005-10-05 at 12:14 -0400, Ron Peacetree wrote: I've now gotten verification from multiple working DBA's that DB2, Oracle, and SQL Server can achieve ~250MBps ASTR (with as much as ~500MBps ASTR in setups akin to Oracle RAC) when attached to a decent (not outrageous, but decent) HD subsystem... I've not yet had any RW DBA verify Jeff Baker's supposition that ~1GBps ASTR is attainable. Cache based bursts that high, yes. ASTR, no. I find your tone annoying. That you do not have access to this level of hardware proves nothing, other than pointing out that your repeated emails on this list are based on supposition. If you want 1GB/sec STR you need: 1) 1 or more Itanium CPUs 2) 24 or more disks 3) 2 or more SATA controllers 4) Linux Have fun. -jwb ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
On Mon, 2005-10-03 at 13:34 -0700, Josh Berkus wrote: Michael, Realistically, you can't do better than about 25MB/s on a single-threaded I/O on current Linux machines, What on earth gives you that idea? Did you drop a zero? Nope, LOTS of testing, at OSDL, GreenPlum and Sun. For comparison, A Big-Name Proprietary Database doesn't get much more than that either. I find this claim very suspicious. I get single-threaded reads in excess of 1GB/sec with XFS and 250MB/sec with ext3. -jwb ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PERFORM] A Better External Sort?
On Mon, 2005-10-03 at 14:16 -0700, Josh Berkus wrote: Jeff, Nope, LOTS of testing, at OSDL, GreenPlum and Sun. For comparison, A Big-Name Proprietary Database doesn't get much more than that either. I find this claim very suspicious. I get single-threaded reads in excess of 1GB/sec with XFS and 250MB/sec with ext3. Database reads? Or raw FS reads? It's not the same thing. Just reading files off the filesystem. These are input rates I get with a specialized sort implementation. 1GB/sec is not even especially wonderful, I can get that on two controllers with 24-disk stripe set. I guess database reads are different, but I remain unconvinced that they are *fundamentally* different. After all, a tab-delimited file (my sort workload) is a kind of database. Also, we're talking *write speed* here, not read speed. Ok, I did not realize. Still you should see 250-300MB/sec single-threaded sequential output on ext3, assuming the storage can provide that rate. I also find *your* claim suspicious, since there's no way XFS is 300% faster than ext3 for the *general* case. On a single disk you wouldn't notice, but XFS scales much better when you throw disks at it. I get a 50MB/sec boost from the 24th disk, whereas ext3 stops scaling after 16 disks. For writes both XFS and ext3 top out around 8 disks, but in this case XFS tops out at 500MB/sec while ext3 can't break 350MB/sec. I'm hopeful that in the future the work being done at ClusterFS will make ext3 on-par with XFS. -jwb ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
On Thu, 2005-09-29 at 10:06 -0700, Luke Lonergan wrote: Josh, On 9/29/05 9:54 AM, Josh Berkus josh@agliodbs.com wrote: Following an index creation, we see that 95% of the time required is the external sort, which averages 2mb/s. This is with seperate drives for the WAL, the pg_tmp, the table and the index. I've confirmed that increasing work_mem beyond a small minimum (around 128mb) had no benefit on the overall index creation speed. Yp! That about sums it up - regardless of taking 1 or 2 passes through the heap being sorted, 1.5 - 2 MB/s is the wrong number. Yeah this is really bad ... approximately the speed of GNU sort. Josh, do you happen to know how many passes are needed in the multiphase merge on your 60GB table? Looking through tuplesort.c, I have a couple of initial ideas. Are we allowed to fork here? That would open up the possibility of using the CPU and the I/O in parallel. I see that tuplesort.c also suffers from the kind of postgresql-wide disease of calling all the way up and down a big stack of software for each tuple individually. Perhaps it could be changed to work on vectors. I think the largest speedup will be to dump the multiphase merge and merge all tapes in one pass, no matter how large M. Currently M is capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over the tape. It could be done in a single pass heap merge with N*log(M) comparisons, and, more importantly, far less input and output. I would also recommend using an external processes to asynchronously feed the tuples into the heap during the merge. What's the timeframe for 8.2? -jwb ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
On Thu, 2005-09-29 at 11:03 -0700, Josh Berkus wrote: Jeff, Josh, do you happen to know how many passes are needed in the multiphase merge on your 60GB table? No, any idea how to test that? I would just run it under the profiler and see how many times beginmerge() is called. -jwb ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
On Wed, 2005-09-28 at 12:03 -0400, Ron Peacetree wrote: From: Jeffrey W. Baker [EMAIL PROTECTED] Sent: Sep 27, 2005 1:26 PM To: Ron Peacetree [EMAIL PROTECTED] Subject: Re: [HACKERS] [PERFORM] A Better External Sort? On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote: That Btree can be used to generate a physical reordering of the data in one pass, but that's the weakest use for it. The more powerful uses involve allowing the Btree to persist and using it for more efficient re-searches or combining it with other such Btrees (either as a step in task distribution across multiple CPUs or as a more efficient way to do things like joins by manipulating these Btrees rather than the actual records.) Maybe you could describe some concrete use cases. I can see what you are getting at, and I can imagine some advantageous uses, but I'd like to know what you are thinking. Specifically I'd like to see some cases where this would beat sequential scan. I'm thinking that in your example of a terabyte table with a column having only two values, all the queries I can think of would be better served with a sequential scan. In my original example, a sequential scan of the 1TB of 2KB or 4KB records, = 250M or 500M records of data, being sorted on a binary value key will take ~1000x more time than reading in the ~1GB Btree I described that used a Key+RID (plus node pointers) representation of the data. You are engaging in a length and verbose exercise in mental masturbation, because you have not yet given a concrete example of a query where this stuff would come in handy. A common, general-purpose case would be the best. We can all see that the method you describe might be a good way to sort a very large dataset with some known properties, which would be fine if you are trying to break the terasort benchmark. But that's not what we're doing here. We are designing and operating relational databases. So please explain the application. Your main example seems to focus on a large table where a key column has constrained values. This case is interesting in proportion to the number of possible values. If I have billions of rows, each having one of only two values, I can think of a trivial and very fast method of returning the table sorted by that key: make two sequential passes, returning the first value on the first pass and the second value on the second pass. This will be faster than the method you propose. I think an important aspect you have failed to address is how much of the heap you must visit after the sort is complete. If you are returning every tuple in the heap then the optimal plan will be very different from the case when you needn't. -jwb PS: Whatever mailer you use doesn't understand or respect threading nor attribution. Out of respect for the list's readers, please try a mailer that supports these 30-year-old fundamentals of electronic mail. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Database file compatability
On Mon, 2005-09-26 at 17:27 -0500, Jim C. Nasby wrote: If a database is created with a 64 bit version of initdb, would a 32bit backend be able to talk to it? Likewise, would a backend compiled by a different compiler be able to? If there was some kind of incompatability, would the backend just refuse to start, or would it start and start silently trashing data? I plugged a storage array that was initialized with ia32 and attached it to an amd64 machine. The postmaster complained at startup that such-and-such magic value was incorrect and refused to start. However it was implied on the mailing list that for certain unlucky magic values the postmaster may have started anyway and eaten the database. -jwb ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote: That Btree can be used to generate a physical reordering of the data in one pass, but that's the weakest use for it. The more powerful uses involve allowing the Btree to persist and using it for more efficient re-searches or combining it with other such Btrees (either as a step in task distribution across multiple CPUs or as a more efficient way to do things like joins by manipulating these Btrees rather than the actual records.) Maybe you could describe some concrete use cases. I can see what you are getting at, and I can imagine some advantageous uses, but I'd like to know what you are thinking. Specifically I'd like to see some cases where this would beat sequential scan. I'm thinking that in your example of a terabyte table with a column having only two values, all the queries I can think of would be better served with a sequential scan. Perhaps I believe this because you can now buy as much sequential I/O as you want. Random I/O is the only real savings. -jwb ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] data on devel code perf dip
On Sun, 2005-08-21 at 20:37 -0400, Tom Lane wrote: The whole thing's pretty bizarre. I hate to sound obvious, but does the missing performance return if you back the patch out? It seemed to have been decided on Tue, 16 Aug 2005 15:45:30 -0700 that the performance was the same before and after. However, there also seems to be version confusion, as [EMAIL PROTECTED] also claimed to be testing a tree from the future. Basically I think there's more confusion here than evidence. On an admittedly smaller x86 configuration, source trees with this patch are faster than without, not slower. And finally, not to start a flamewar, but the initial report was on a machine running gentoo, the distribution that should be renamed Irreproduceable Benchmarks Linux. -jwb ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Simplifying wal_sync_method
On Mon, 2005-08-08 at 17:03 -0400, Tom Lane wrote: That's a decision that hasn't got a shred of evidence to justify imposing it on every platform. This option has its uses on Linux, however. In my testing it's good for a large speedup (20%) on a 10-client pgbench, and a minor improvement with 100 clients. See my mail of July 14th O_DIRECT for WAL writes. -jwb ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] ORDER BY field not in return list
On Mon, 2005-07-25 at 19:08 -0300, Marc G. Fournier wrote: On Mon, 25 Jul 2005, Jeffrey W. Baker wrote: On Mon, 2005-07-25 at 18:11 -0300, Marc G. Fournier wrote: Just curious as to whether or not a warning or something should be issued in a case like: SELECT c.* FROM company c, company_summary cs WHERE c.id = cs.id AND cs.detail = 'test' ORDER BY cs.fullname; Seems like it should work. Is it not returning in fullname order in your tests? Full name isn't a field in the results, so how would it be ORDERing based on it? fullname is a field in the table being joined in order to restrict the results to just those with cs.detail = 'test' ... but company itself doesn't have a field fullname ... I'm still not seeing the problem. cs.fullname is in the product of the join, and you can order the result thereby, and not return the column. -jwb ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes
On Fri, 2005-06-24 at 09:37 -0400, Tom Lane wrote: ITAGAKI Takahiro [EMAIL PROTECTED] writes: ... So I'll post the new results: checkpoint_ | writeback | segments| cache | open_sync | fsync=false | O_DIRECT only | fsync_direct | open_direct +---+---+---+---+---+-- [3] 3 | off | 38.2 tps | 138.8(+263.5%)| 38.6(+ 1.2%) | 38.5(+ 0.9%) | 38.5(+ 0.9%) Yeah, this is about what I was afraid of: if you're actually fsyncing then you get at best one commit per disk revolution, and the negotiation with the OS is down in the noise. At this point I'm inclined to reject the patch on the grounds that it adds complexity and portability issues, without actually buying any useful performance improvement. The write-cache-on numbers are not going to be interesting to any serious user :-( You mean not interesting to people without a UPS. Personally, I'd like to realize a 50% boost in tps, which is what O_DIRECT buys according to ITAGAKI Takahiro's posted results. The batteries on a caching RAID controller can run for days at a stretch. It's not as dangerous as people make it sound. And anyone running PG on software RAID is crazy. -jwb ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes
On Fri, 2005-06-24 at 10:19 -0500, Jim C. Nasby wrote: On Fri, Jun 24, 2005 at 09:37:23AM -0400, Tom Lane wrote: ITAGAKI Takahiro [EMAIL PROTECTED] writes: ... So I'll post the new results: checkpoint_ | writeback | segments| cache | open_sync | fsync=false | O_DIRECT only | fsync_direct | open_direct +---+---+---+---+---+-- [3] 3 | off | 38.2 tps | 138.8(+263.5%)| 38.6(+ 1.2%) | 38.5(+ 0.9%) | 38.5(+ 0.9%) Yeah, this is about what I was afraid of: if you're actually fsyncing then you get at best one commit per disk revolution, and the negotiation with the OS is down in the noise. At this point I'm inclined to reject the patch on the grounds that it adds complexity and portability issues, without actually buying any useful performance improvement. The write-cache-on numbers are not going to be interesting to any serious user :-( Is there anyone with a battery-backed RAID controller that could run these tests? I suspect that in that case the differences might be closer to 1 or 2 rather than 3, which would make the patch much more valuable. I applied the O_DIRECT patch to 8.0.3 and I tested this on a battery-backed RAID controller with 128MB of cache and 5 7200RPM SATA disks. All caches are write-back. The xlog and data are on the same JFS volume. pgbench was run with a scale factor of 1000 and 10 total transactions. Clients varied from 10 to 100. Clients | fsync | open_direct 10|81|98 (+21%) 100| 100| 105 ( +5%) No problems were experienced. The patch seems to give a useful boost! -jwb ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Consumer-grade vs enterprise-grade disk drives
On Mon, 2005-05-30 at 21:38 -0700, Luke Lonergan wrote: Tom, This is a story that is evolving. Anyone else use StorageReview? Great comprehensive drive benchmarks: http://www.storagereview.com/ Check the comparisons between 15K RPM SCSI drives and the 2004 Western Digital 10K RPM SATA (Raptor) drives. The Raptors are an interesting hybrid of SCSI-related tech and desktop tech, and were some of the first drives with SCSI-like command queuing TCQ/NCQ. If we're looking at the same benchmark (File Server DriveMark), the fastest SCSI disk is 65% faster than the fastest SATA disk. The fastest SCSI 10K disk is 25% faster than the SATA. I think the last remaining issue in moving to SATA for all enterprise use is the lack of decent SATA controllers, though 3Ware (http://www.3ware.com) is getting there: http://www.3ware.com/link/pdf/Serial-ATA.pdf http://www.3ware.com/products/benchmarks_sata.asp The 3Ware controllers are probably the worst ones you can get for database use. Their caches are slow (I didn't even know that was possible until I bought one), as are the XOR engines. After reading this very comprehensive benchmark: http://print.tweakers.net/?reviews/557 I purchased one of the Areca controllers with a large battery-backed cache. It is unholy fast. I recommend it. -jwb ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Bitmap scan cost model (was Re: bitmap scans, btree
On Wed, 2005-05-18 at 11:27 -0400, Tom Lane wrote: Jeffrey W. Baker [EMAIL PROTECTED] writes: Obviously in this case sequential scan was (would have been) a huge win. Incrementing random_page_cost from 4 (the default) to 5 causes the planner to make a better decision. But to get the estimated cost ratio to match up with the actual cost ratio, we'd have to raise random_page_cost to nearly 70, which is a bit hard to credit. What was the platform being tested here? i686 Linux 2.6.8 with a single 7200RPM SATA disk. -jwb ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[HACKERS] Bitmap scan cost model (was Re: bitmap scans, btree scans, and tid order)
On Mon, 2005-05-16 at 14:35 -0400, Tom Lane wrote: Jeffrey W. Baker [EMAIL PROTECTED] writes: On Mon, 2005-05-16 at 09:53 -0400, Tom Lane wrote: This is a fallacy, and I think your concern is largely mistaken. Have you experimented with the cases you are worried about? Perhaps I have not stated the problem clearly. Believe me, I have experimented extensively with this problem. Sorry, perhaps I wasn't clear: have you experimented *with CVS tip*? The current code is certainly capable of choosing either seqscan, bitmap scan, or traditional index scan depending on the given query. For example, [...] I think the work that's most needed at the moment is to test the bitmap-scan cost model to see if it has much to do with reality... The cost model seems to work pretty darn well, as a matter of fact. This table is in-order by id, in-order by date, and randomly ordered by random. scratch=# \d test Table public.test Column | Type | Modifiers +-+--- id | integer | date | date| random | integer | Indexes: test_pk UNIQUE, btree (id) test_date_idx btree (date) test_rand_idx btree (random) scratch=# select count(1) from test; count -- 1000 The size of the tables and indexes is about 1G, or double physical memory. I tested by selecting on the random column. For 100 random values, I selected the row that matches exactly, then in ranges of 1000, 1, 10, and 100. These touch roughly 5, 50, 500, and 5000 tuples, respectively. The test is repeated for index scan, bitmap scan, and sequential scan. Example query: select count(1) from test where random = 1429076987 and random 1429076987 + 1; QUERY PLAN -- Aggregate (cost=171.31..171.31 rows=1 width=0) (actual time=0.996..1.000 rows=1 loops=1) - Bitmap Heap Scan on test (cost=2.26..171.20 rows=43 width=0) (actual time=0.145..0.829 rows=42 loops=1) Recheck Cond: ((random = 1429076987) AND (random 1429086987)) - Bitmap Index Scan on test_rand_idx (cost=0.00..2.26 rows=43 width=0) (actual time=0.063..0.063 rows=42 loops=1) Index Cond: ((random = 1429076987) AND (random 1429086987)) Total runtime: 1.114 ms 100 queries returning | 1 | 5 | 50 | 500 | 5000 | --+-+-+--+---++ bitmap scan | 1s | 2s | 13s | 1m41s | 11m16s | index scan| 1s | 1s | 12s | 2m6s | 24m19s | --+-+-+--+---++ sequential scan | 28m20s| The planner uses index scan for the first two columns, and chooses bitmap scan for the rest. This would seem to be a reasonable decision, given the results. The only real problem with the cost model in further testing is that it doesn't switch to seq scan quickly enough. If bitmap scan is disabled, the planner will pick index scan even in cases when sequential scan is 10x faster: scratch=# set enable_bitmapscan to off; SET scratch=# explain analyze select count(1) from test where random = 1429076987 and random 1429076987 + 1000; QUERY PLAN Aggregate (cost=170142.03..170142.03 rows=1 width=0) (actual time=177419.182..177419.185 rows=1 loops=1) - Index Scan using test_rand_idx on test (cost=0.00..170034.11 rows=43167 width=0) (actual time=0.035..177255.696 rows=46764 loops=1) Index Cond: ((random = 1429076987) AND (random 1439076987)) Total runtime: 177419.302 ms (4 rows) scratch=# set enable_indexscan to off; SET scratch=# explain analyze select count(1) from test where random = 1429076987 and random 1429076987 + 1000; QUERY PLAN -- Aggregate (cost=204165.55..204165.55 rows=1 width=0) (actual time=12334.042..12334.045 rows=1 loops=1) - Seq Scan on test (cost=0.00..204057.62 rows=43167 width=0) (actual time=17.436..12174.150 rows=46764 loops=1) Filter: ((random = 1429076987) AND (random 1439076987)) Total runtime: 12334.156 ms (4 rows) Obviously in this case sequential scan was (would have been) a huge win. Incrementing random_page_cost from 4 (the default) to 5 causes the planner to make a better decision. -jwb
Re: [HACKERS] bitmap scans, btree scans, and tid order
On Mon, 2005-05-16 at 09:53 -0400, Tom Lane wrote: Jeffrey Baker [EMAIL PROTECTED] writes: Change the planner/executor to use the bitmap scan in all cases where index order is unimportant. From my reading of the current code, the bitmap scan is only used in case of a join. This is a fallacy, and I think your concern is largely mistaken. Have you experimented with the cases you are worried about? Perhaps I have not stated the problem clearly. Believe me, I have experimented extensively with this problem. So here's the statement of the problem: during an index scan, the heap is visited in index order, which can be and frequently is random order on the disk. An index scan that currently takes 15 minutes on my database takes only 5 when the tuples are fetched in strict disk order, and takes 8 minutes if the tuples are fetched in disk order in groups of 1000. The problem exists when the scan is expected to return a lot of tuples, but the planner chooses an index scan anyway. In many cases, sequential scan would be faster than index scan + random heap visits. It's entirely possible that the current cost model needs adjustment to make the planner pick the bitmap scan in more cases. However, it is easy to demonstrate (either by thought-experiment or actual trial) that a bitmap scan isn't necessarily a win. I agree. There's certainly a threshold below which the bitmap would not make sense. It's also possible that changing the btree scan to work in groups of tuples instead of single tuples would make more sense, which is why I ventured two different solution to the problem. The overhead of maintaining the bitmap is far from zero, and you don't get anything unless the resulting pattern of heap page fetches is significantly cleaner than before. Bitmaps scans naturally come back in TID order. I realize this isn't 1:1 correspondence with disk order, but it's a good start. I also like the way the index scan and heap scan are decoupled in the bitmap code. It leaves room for more serious optimization of disk access, which is relatively difficult in the synchronous, one-at-a-time btree code. Therefore, a patch that eliminates cost-estimation in favor of trying to force one choice or the other is definitely not likely to be received favorably ... That's not the idea. -jwb ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] bitmap scans, btree scans, and tid order
On Mon, 2005-05-16 at 14:35 -0400, Tom Lane wrote: Jeffrey W. Baker [EMAIL PROTECTED] writes: It's also possible that changing the btree scan to work in groups of tuples instead of single tuples would make more sense, which is why I ventured two different solution to the problem. My feeling is that that would add a lot of complexity for dubious win. The reason it's dubious is that it would penalize some cases, for instance LIMIT-type queries where you aren't going to fetch many tuples anyway. I think that at least for the time being (8.1 time frame) we should leave traditional indexscans alone and concentrate on being sure we are getting the most we can out of the new bitmap scan code. Only after that dust has settled will we have hard facts with which to argue about whether compromise behaviors might be wins. I agree. I'll look at how my workload behaves with CVS code. I wasn't proposing this for 8.1 inclusion, and the TODO isn't marked for 8.1. I think the work that's most needed at the moment is to test the bitmap-scan cost model to see if it has much to do with reality... Alright. -jwb ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[HACKERS] bitmap scans, btree scans, and tid order
About this time last year I was holding forth on the value of visiting the heap in TID order, even when index scan tuples are randomly ordered. Today I decided to start working on the problem stated in this TODO item: Fetch heap pages matching index entries in sequential order Rather than randomly accessing heap pages based on index entries, mark heap pages needing access in a bitmap and do the lookups in sequential order. Another method would be to sort heap ctids matching the index before accessing the heap rows. I see that Tom has already done the infrastructure work by adding getmulti, but getmulti isn't used by nodeIndexscan.c, only nodeBitmapIndexscan.c. Will btree index scans be executed by creating in-memory bitmaps in 8.1, or will some scans still be executed the usual way? If the former, I'd be wasting time, but in the latter case it would be worth it. -jwb ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] proposal: be smarter about i/o patterns in index scan
On Wed, 2004-05-19 at 12:55, Tom Lane wrote: Glen Parker [EMAIL PROTECTED] writes: What am I missing? Why is a performance bottle neck of this magnitude not on the same list of priorities as PITR, replication, and Win32? It's higher on my personal to-do list than most of those ;-). But those things are getting done because there are other developers with other priorities. I suspect also that the set of people competent to make this change is much smaller than the set of people able to work on the other points. In my mind, most of the issue is in the planner (figuring out what to do with an unsorted-indexscan option) and relatively few people have wanted to touch the planner. Here's one answer: If you had to sort every result set, even when an index could have been used, overall performance would still improve by a very large margin. I'd bet money on it. For a counterexample I refer you to our standard solution for MAX-using-an-index: SELECT ... FROM table ORDER BY foo DESC LIMIT 1; Yes, that would stink with unordered index result. which would become truly spectacularly bad without an ordered index scan. A more general point is that for any indexscan that returns only a small number of index entries (eg, any unique-key search) worrying about physical-order access will be wasted effort. The best you could hope for is not to be significantly worse than the existing code in such cases, and I'm unconvinced you could achieve even that. I don't think a TID-sorted index scan would be worse than the current unsorted scan. You will produce a list of 1, sort that, and fetch the tuple. In the current implementation, you only omit the sort, which is basically free. I can assure you that any patch that completely removes the existing behavior will be rejected, because there are plenty of cases where it's the right thing. Okay. The main thing that unordered indexscan could do for us is extend the usefulness of indexscan plans into relatively-poor-selectivity cases where we presently tend to drop back to seqscans. There would still be a selectivity threshold above which you might as well use seqscan, but it ought to be higher than the fraction-of-a-percent that we currently see for indexscans. What is unknown, and will be unknown until someone tries it, is just what range of selectivity this technique might win for. I think such a range exists; I am not real certain that it is wide enough to justify a lot of effort in making the idea a reality. I don't think selectivity is the main issue. I think the absolute size of the result is the issue. The current implementation fairly well garauntees one seek for each tuple of the result. Which means that the cost to access the result is something on the order of 10ms * rows. So for any significant result that needs to be fetched from the heap, the seek cost is going to overwhelm all other costs quickly. Even though my index is selective, retrieving only 200k rows of 240m row table, the cost of the sort is justified. -jwb ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] proposal: be smarter about i/o patterns in index scan
On Wed, 2004-05-19 at 07:58, Tom Lane wrote: Jeffrey W. Baker [EMAIL PROTECTED] writes: We have noticed a way to make a major improvement in Pg's performance on our workload, and I would like to get your thoughts before I go off to work up a patch. For starters, read the previous discussions of this in the archives. If you are referring to archives.postgresql.org, all I have to say is: An error occured! Can not connect to search daemon And as far as I have seen, it's been like that for years. Two questions you should have answers to before starting to implement, rather than after, are how to deal with locking considerations and what will be the implications of giving up the property that indexscans deliver sorted output. Are you saying that index scan results are sorted by something other than the index key? Because in my scheme they would still be sorted by index key. -jwb ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] proposal: be smarter about i/o patterns in index scan
On Wed, 2004-05-19 at 11:56, Tom Lane wrote: Jeffrey W. Baker [EMAIL PROTECTED] writes: Are you saying that index scan results are sorted by something other than the index key? Because in my scheme they would still be sorted by index key. Not unless you add yet another sort step after the fetch step, that is the idea becomes 1. read index to get candidate TIDs 2. sort TIDs into physical order 3. read heap in physical order, check row visibility 4. sort selected rows into index order That would start to look like an unpleasant amount of overhead, since the sort would have to be over the entire output; you couldn't divide the scan into reasonable-size batches, whereas steps 1-3 can easily be run in batched style. Moreover, this'd not be any win compared to just dropping the assumption of sorted output; the planner could stick in the same sort for itself if it decided the cost was worth it. What you probably want to do is support both our traditional indexscan style and an unsorted indexscan plan type that delivers unsorted output (implementing steps 1-3 above). With appropriate cost estimation the planner could make an intelligent choice between the two. I'm too lazy to look in the archives right now, but my private notes about this say: Thanks for the notes. Introducing a new query execution step sounds like a better/easier idea than was I was going to do, which was to rework some of the access methods to operate on vectors instead of scalars. That idea looks increasingly difficult to implement. We could improve indexscan performance if we were to read a bunch of TIDs from the index, sort in page number order, and access the heap tuples in page number order. But probably need a batch size in the thousands of TIDs to get any really useful effect. Depends on the query, but I got an improvement by half from batches of 400. This would not work very well given the present assumption that we hold a pin on the index page until we have fetched the tuple (cf lazy VACUUM). Pinning lots of index pages is no good for concurrency or buffer usage. QUESTION: is that assumption really necessary? Can't we just make the code ignore TIDs that lead to deleted tuples? Worst case is that we fetch a newly-inserted tuple that does not actually have the expected index value, but won't we ignore it because of MVCC tqual rules? (This does NOT work for dirty read or SnapshotNow cases, but we could probably set things up to only try this optimization for standard snapshot reads.) An even nicer property is that we could loop inside the index AM to scoop up tuples, saving locking overhead. This might be worth doing even when we want to keep returning the tuples in index order. I think this is *definitely* worth doing. The current implementation in the vicinity of index_getnext wastes valuable opportunities to optimize the io and/or to let the kernel do a better job. Descending into index_getnext - heap_getnext for every tuple looks expensive in my application. Thanks again for your notes I'll revisit the source code and see if I can come up with a plan. -jwb PS Regarding the archive, I've never seen it work. I use MARC instead. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] proposal: be smarter about i/o patterns in index scan
On Wed, 2004-05-19 at 13:10, Tom Lane wrote: Glen Parker [EMAIL PROTECTED] writes: Not unless you add yet another sort step after the fetch step, that is the idea becomes 1. read index to get candidate TIDs 2. sort TIDs into physical order 3. read heap in physical order, check row visibility 4. sort selected rows into index order Or: 2. Sort AND COPY TID's into physical order. 3. Read heap in phy. order, match results to un-sorted TID list. That sounds quite cheap. No, it sounds like handwaving. What's your match results step, and does it take less than O(N^2) time? How do you get to *return* the tuples in index order, without having read them all before you return the first one? (Which is really what the problem is with the sort alternative, at least for fast-start cases such as the LIMIT 1 example.) What happens when you run out of memory for the list of TIDs ... er, make that two lists of TIDs? No doubt, you can't just naively create giant vectors of TIDs and expect to sort them. Is there any concept in Pg of an unrealized result? If you scanned an index building up a result set that was totally unrealized, except for the TID and the index columns, you could cheaply join two such results without ever touching the heap. You could also use the existing Sort execution step to sort such a result. Then don't touch the heap something accesses a non-index column, or because you are returning the result somewhere and need to satisfy MVCC visibility limits. This would lay down the foundation needed by your TIDExpand, and could also be a useful concept in other areas. I acknowledge my own significant handwaving on this subject :) From your point of view everyone is going to be handwaving, because we don't have your depth of experience with this code. -jwb ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] proposal: be smarter about i/o patterns in index scan
Greetings all, We have noticed a way to make a major improvement in Pg's performance on our workload, and I would like to get your thoughts before I go off to work up a patch. The basic problem is that Pg seeks far too much when performing an index scan. I have saved an strace of a backend which is selecting 170,000 rows from a 240,000,000 row table via index scan. The strace shows 134,000 seeks and reads, or almost one seek for every tuple in the result set. This would be fine normally except the seeks are in a *very* suboptimal pattern, swinging back and forth across the device. The query requires 16 minutes, 30 seconds on our test hardware. The proposal is to sort the block requests before they are issued. Because Pg only issues single seek/read pairs synchronously, the kernel has no chance to apply its elevator and hence every seek/read Pg issues results in actual movement of the disk head. Pg's random pattern of seeking also makes the kernel's readahead fairly useless. To prove the proposal I did a simulation, using the recorded seek positions in the strace. We noted that Pg issued 403 seek/read pairs for every 8192 bytes read from the index. So we performed four simulations: a straight playback of Pg's seek pattern, a playback where each list of 403 seeks is sorted by byte offset, a playback of all the seeks sorted by offset, and a sequential scan of the 13GB data file. PostgreSQL 4.2.1: 16m30s Sorted in chunks: 10m6s Sorted altogether: 4m19s Sequential scan: 6m7s As you can see, there's a lot to be gained here. If Pg was able to issue its seeks in the optimal order, the query would return in 1/4th the time. Even the chunk-sorted scheme is a big win. So the proposal is this: the offsets stored in the index ought to be sorted. Either A) they can be stored sorted in the first place (sorted in VACUUM ANALYZE, or always sorted), or B) the backend can sort each list of tuples it reads from the index, or C) the backend can read the entire index result and sort that (this would be the best). From just paging through the code, it doesn't seem terribly hard. B seems the easiest to implement but is also the least improvement. Even seqscan is better than B. One improvement to B would be to read much more than 8K from the index. Reading e.g. 256KB would improve things dramatically. A might be easy but would either degrade over time or cause slower inserts. C is the best and hardest to implement (I am guessing, because the size of sorted index subset is unbounded). Your thoughts are appreciated. -jwb ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Schemas: status report, call for developers
On Tue, Apr 30, 2002 at 09:41:47PM +0300, Oleg Bartunov wrote: I think DBD::Pg driver very much depends on system tables. Hope, Jeffrey (current maintainer) is online. These changes may break DBD::Pg. What is the expected time of this release? I will review my code for impact. Thanks for the warning, Jeffrey ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Schemas: status report, call for developers
On Thu, May 02, 2002 at 05:28:36PM +0300, Oleg Bartunov wrote: On Wed, 1 May 2002, Jeffrey W. Baker wrote: On Tue, Apr 30, 2002 at 09:41:47PM +0300, Oleg Bartunov wrote: I think DBD::Pg driver very much depends on system tables. Hope, Jeffrey (current maintainer) is online. These changes may break DBD::Pg. What is the expected time of this release? I will review my code for impact. Jeffrey, btw, DBD-Pg 1.13 doesn't passed all tests (Linux 2.4.17, pgsql 7.2.1, DBI-1.21) t/02prepare.ok t/03bindok t/04execute.FAILED tests 5-7 Failed 3/10 tests, 70.00% okay t/05fetch...ok t/06disconnect..ok These tests were failing when I inherited the code. I'll fix them when I rewrite the parser. -jwb ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster