Re: [HACKERS] [PERFORM] A Better External Sort?
On Fri, Oct 07, 2005 at 09:20:59PM -0700, Luke Lonergan wrote: On 10/7/05 5:17 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: On Fri, Oct 07, 2005 at 04:55:28PM -0700, Luke Lonergan wrote: On 10/5/05 5:12 PM, Steinar H. Gunderson [EMAIL PROTECTED] wrote: What? strlen is definitely not in the kernel, and thus won't count as system time. System time on Linux includes time spent in glibc routines. Do you have a reference for this? I believe this statement to be 100% false. How about 99%? OK, you're right, I had this confused with the profiling problem where glibc routines aren't included in dynamic linked profiles. Sorry to emphasize the 100%. It wasn't meant to judge you. It was meant to indicate that I believe 100% of system time is accounted for, while the system call is actually active, which is not possible while glibc is active. I believe the way it works, is that a periodic timer interrupt increments a specific integer every time it wakes up. If it finds itself within the kernel, it increments the system time for the active process, if it finds itself outside the kernel, it incremenets the user time for the active process. Back to the statements earlier - the output of time had much of time for a dd spent in system, which means kernel, so where in the kernel would that be exactly? Not really an expert here. I only play around. At a minimum, their is a cost to switching from user context to system context and back, and then filling in the zero bits. There may be other inefficiencies, however. Perhaps /dev/zero always fill in a whole block (8192 usually), before allowing the standard file system code to read only one byte. I dunno. But, I see this oddity too: $ time dd if=/dev/zero of=/dev/zero bs=1 count=1000 1000+0 records in 1000+0 records out dd if=/dev/zero of=/dev/zero bs=1 count=1000 4.05s user 11.13s system 94% cpu 16.061 total $ time dd if=/dev/zero of=/dev/zero bs=10 count=100 100+0 records in 100+0 records out dd if=/dev/zero of=/dev/zero bs=10 count=100 0.37s user 1.37s system 100% cpu 1.738 total From my numbers, it looks like 1 byte reads are hard in both the user context and the system context. It looks almost linearly, even: $ time dd if=/dev/zero of=/dev/zero bs=100 count=10 10+0 records in 10+0 records out dd if=/dev/zero of=/dev/zero bs=100 count=10 0.04s user 0.15s system 95% cpu 0.199 total $ time dd if=/dev/zero of=/dev/zero bs=1000 count=1 1+0 records in 1+0 records out dd if=/dev/zero of=/dev/zero bs=1000 count=1 0.01s user 0.02s system 140% cpu 0.021 total At least some of this gets into the very in-depth discussions as to whether kernel threads, or user threads, are more efficient. Depending on the application, user threads can switch many times faster than kernel threads. Other parts of this may just mean that /dev/zero isn't implemented optimally. Cheers, mark -- [EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] __ . . _ ._ . . .__. . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/|_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
Steinar, On 10/5/05 5:12 PM, Steinar H. Gunderson [EMAIL PROTECTED] wrote: What? strlen is definitely not in the kernel, and thus won't count as system time. System time on Linux includes time spent in glibc routines. - Luke ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] A Better External Sort?
On Fri, Oct 07, 2005 at 04:55:28PM -0700, Luke Lonergan wrote: On 10/5/05 5:12 PM, Steinar H. Gunderson [EMAIL PROTECTED] wrote: What? strlen is definitely not in the kernel, and thus won't count as system time. System time on Linux includes time spent in glibc routines. Do you have a reference for this? I believe this statement to be 100% false. Cheers, mark -- [EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] __ . . _ ._ . . .__. . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/|_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
Mark, On 10/7/05 5:17 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: On Fri, Oct 07, 2005 at 04:55:28PM -0700, Luke Lonergan wrote: On 10/5/05 5:12 PM, Steinar H. Gunderson [EMAIL PROTECTED] wrote: What? strlen is definitely not in the kernel, and thus won't count as system time. System time on Linux includes time spent in glibc routines. Do you have a reference for this? I believe this statement to be 100% false. How about 99%? OK, you're right, I had this confused with the profiling problem where glibc routines aren't included in dynamic linked profiles. Back to the statements earlier - the output of time had much of time for a dd spent in system, which means kernel, so where in the kernel would that be exactly? - Luke ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
Martijn van Oosterhout kleptog@svana.org writes: Indeed, one of the things on my list is to remove all the lseeks in favour of pread. Halving the number of kernel calls has got to be worth something right? Portability is an issue ofcourse... Being sure that it's not a pessimization is another issue. I note that glibc will emulate these functions if the kernel doesn't have them; which means you could be replacing one kernel call with three. And I don't think autoconf has any way to determine whether a libc function represents a native kernel call or not ... regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
Martijn van Oosterhout kleptog@svana.org writes: Are we awfully worried about people still using 2.0 kernels? And it would replace two calls with three in the worst case, we currently lseek before every read. That's utterly false. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
On Thu, Oct 06, 2005 at 03:57:38PM -0400, Tom Lane wrote: Martijn van Oosterhout kleptog@svana.org writes: Indeed, one of the things on my list is to remove all the lseeks in favour of pread. Halving the number of kernel calls has got to be worth something right? Portability is an issue ofcourse... Being sure that it's not a pessimization is another issue. I note that glibc will emulate these functions if the kernel doesn't have them; which means you could be replacing one kernel call with three. And I don't think autoconf has any way to determine whether a libc function represents a native kernel call or not ... The problem kernels would be Linux 2.0, which I very much doubt is going to be present in to-be-deployed database servers. Unless someone runs glibc on top of some other kernel, I guess. Is this a common scenario? I've never seen it. -- Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34 Oh, oh, las chicas galacianas, lo harán por las perlas, ¡Y las de Arrakis por el agua! Pero si buscas damas Que se consuman como llamas, ¡Prueba una hija de Caladan! (Gurney Halleck) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
On E, 2005-10-03 at 14:16 -0700, Josh Berkus wrote: Jeff, Nope, LOTS of testing, at OSDL, GreenPlum and Sun. For comparison, A Big-Name Proprietary Database doesn't get much more than that either. I find this claim very suspicious. I get single-threaded reads in excess of 1GB/sec with XFS and 250MB/sec with ext3. Database reads? Or raw FS reads? It's not the same thing. Just FYI, I run a count(*) on a 15.6GB table on a lightly loaded db and it run in 163 sec. (Dual opteron 2.6GHz, 6GB RAM, 6 x 74GB 15k disks in RAID10, reiserfs). A little less than 100MB sec. After this I ran count(*) over a 2.4GB file from another tablespace on another device (4x142GB 10k disks in RAID10) and it run 22.5 sec on first run and 12.5 on second. db=# show shared_buffers ; shared_buffers 196608 (1 row) db=# select version(); version PostgreSQL 8.0.3 on x86_64-pc-linux-gnu, compiled by GCC cc (GCC) 3.3.6 (Debian 1:3.3.6-7) (1 row) -- Hannu Krosing [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
On 10/3/05, Ron Peacetree [EMAIL PROTECTED] wrote: [snip] Just how bad is this CPU bound condition? How powerful a CPU is needed to attain a DB IO rate of 25MBps? If we replace said CPU with one 2x, 10x, etc faster than that, do we see any performance increase? If a modest CPU can drive a DB IO rate of 25MBps, but that rate does not go up regardless of how much extra CPU we throw at it... Single threaded was mentioned. Plus even if it's purely cpu bound, it's seldom as trivial as throwing CPU at it, consider the locking in both the application, in the filesystem, and elsewhere in the kernel. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] A Better External Sort?
-Original Message- From: [EMAIL PROTECTED] [mailto:pgsql-hackers- [EMAIL PROTECTED] On Behalf Of PFC Sent: Thursday, September 29, 2005 9:10 AM To: [EMAIL PROTECTED] Cc: Pg Hackers; pgsql-performance@postgresql.org Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Just to add a little anarchy in your nice debate... Who really needs all the results of a sort on your terabyte table ? Reports with ORDER BY/GROUP BY, and many other possibilities. 40% of mainframe CPU cycles are spent sorting. That is because huge volumes of data require lots of energy to be meaningfully categorized. Let's suppose that instead of a terabyte of data (or a petabyte or whatever) we have 10% of it. That's still a lot of data. I guess not many people do a SELECT from such a table and want all the results. What happens when they do? The cases where it is already fast are not very important. The cases where things go into the crapper are the ones that need attention. So, this leaves : - Really wanting all the results, to fetch using a cursor, - CLUSTER type things, where you really want everything in order, - Aggregates (Sort-GroupAggregate), which might really need to sort the whole table. - Complex queries where the whole dataset needs to be examined, in order to return a few values - Joins (again, the whole table is probably not going to be selected) - And the ones I forgot. However, Most likely you only want to SELECT N rows, in some ordering : - the first N (ORDER BY x LIMIT N) - last N (ORDER BY x DESC LIMIT N) For these, the QuickSelect algorithm is what is wanted. For example: #include stdlib.h typedef double Etype; extern EtypeRandomSelect(Etype * A, size_t p, size_t r, size_t i); extern size_t RandRange(size_t a, size_t b); extern size_t RandomPartition(Etype * A, size_t p, size_t r); extern size_t Partition(Etype * A, size_t p, size_t r); /* ** ** In the following code, every reference to CLR means: ** **Introduction to Algorithms **By Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest **ISBN 0-07-013143-0 */ /* ** CLR, page 187 */ Etype RandomSelect(Etype A[], size_t p, size_t r, size_t i) { size_t q, k; if (p == r) return A[p]; q = RandomPartition(A, p, r); k = q - p + 1; if (i = k) return RandomSelect(A, p, q, i); else return RandomSelect(A, q + 1, r, i - k); } size_t RandRange(size_t a, size_t b) { size_t c = (size_t) ((double) rand() / ((double) RAND_MAX + 1) * (b - a)); return c + a; } /* ** CLR, page 162 */ size_t RandomPartition(Etype A[], size_t p, size_t r) { size_t i = RandRange(p, r); Etype Temp; Temp = A[p]; A[p] = A[i]; A[i] = Temp; return Partition(A, p, r); } /* ** CLR, page 154 */ size_t Partition(Etype A[], size_t p, size_t r) { Etype x, temp; size_t i, j; x = A[p]; i = p - 1; j = r + 1; for (;;) { do { j--; } while (!(A[j] = x)); do { i++; } while (!(A[i] = x)); if (i j) { temp = A[i]; A[i] = A[j]; A[j] = temp; } else return j; } } - WHERE xvalue ORDER BY x LIMIT N - WHERE xvalue ORDER BY x DESC LIMIT N - and other variants Or, you are doing a Merge JOIN against some other table ; in that case, yes, you might need the whole sorted terabyte table, but most likely there are WHERE clauses in the query that restrict the set, and thus, maybe we can get some conditions or limit values on the column to sort. Where clause filters are to be applied AFTER the join operations, according to the SQL standard. Also the new, optimized hash join, which is more memory efficient, might cover this case. For == joins. Not every order by is applied to joins. And not every join is an equal join. Point is, sometimes, you only need part of the results of your sort. And the bigger the sort, the most likely it becomes that you only want part of the results. That is an assumption that will sometimes be true, and sometimes not. It is not possible to predict usage patterns for a general purpose database system. So, while we're in the fun hand-waving, new algorithm trying mode, why not consider this right from the start ? (I know I'm totally in hand-waving mode right now, so slap me if needed). I'd say your new, fancy sort algorithm needs a few more input values : - Range of values that must appear in the final result of the sort : none, minimum, maximum, both, or even a set of values from the other side of the join, hashed, or sorted. That will already happen (or it certainly
Re: [HACKERS] [PERFORM] A Better External Sort?
-Original Message- From: [EMAIL PROTECTED] [mailto:pgsql-hackers- [EMAIL PROTECTED] On Behalf Of Tom Lane Sent: Friday, September 30, 2005 11:02 PM To: Jeffrey W. Baker Cc: Luke Lonergan; Josh Berkus; Ron Peacetree; pgsql- [EMAIL PROTECTED]; pgsql-performance@postgresql.org Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Jeffrey W. Baker [EMAIL PROTECTED] writes: I think the largest speedup will be to dump the multiphase merge and merge all tapes in one pass, no matter how large M. Currently M is capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over the tape. It could be done in a single pass heap merge with N*log(M) comparisons, and, more importantly, far less input and output. I had more or less despaired of this thread yielding any usable ideas :-( but I think you have one here. I believe I made the exact same suggestion several days ago. The reason the current code uses a six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first edition) shows that there's not much incremental gain from using more tapes ... if you are in the regime where number of runs is much greater than number of tape drives. But if you can stay in the regime where only one merge pass is needed, that is obviously a win. I don't believe we can simply legislate that there be only one merge pass. That would mean that, if we end up with N runs after the initial run-forming phase, we need to fit N tuples in memory --- no matter how large N is, or how small work_mem is. But it seems like a good idea to try to use an N-way merge where N is as large as work_mem will allow. We'd not have to decide on the value of N until after we've completed the run-forming phase, at which time we've already seen every tuple once, and so we can compute a safe value for N as work_mem divided by largest_tuple_size. (Tape I/O buffers would have to be counted too of course.) You only need to hold the sort column(s) in memory, except for the queue you are exhausting at the time. [And of those columns, only the values for the smallest one in a sub-list.] Of course, the more data from each list that you can hold at once, the fewer the disk reads and seeks. Another idea (not sure if it is pertinent): Instead of having a fixed size for the sort buffers, size it to the query. Given a total pool of size M, give a percentage according to the difficulty of the work to perform. So a query with 3 small columns and a cardinality of 1000 gets a small percentage and a query with 10 GB of data gets a big percentage of available sort mem. It's been a good while since I looked at the sort code, and so I don't recall if there are any fundamental reasons for having a compile-time- constant value of the merge order rather than choosing it at runtime. My guess is that any inefficiencies added by making it variable would be well repaid by the potential savings in I/O. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
; do { { (parent) = (i); (temp) = (array)[(parent)]; (child) = (parent) * 2; while ((nmemb) (child)) { if ((*((array) + (child) + 1) *((array) + (child { ++(child); } if ((*((array) + (child)) *((temp { (array)[(parent)] = (array)[(child)]; (parent) = (child); (child) *= 2; } else { --(child); break; } } if ((nmemb) == (child) (*((array) + (child)) *((temp { (array)[(parent)] = (array)[(child)]; (parent) = (child); } (array)[(parent)] = (temp); } } while (i--); ((void) ((temp) = *(array), *(array) = *(array + nmemb), *(array + nmemb) = (temp))); for (--nmemb; nmemb; --nmemb) { { (parent) = (0); (temp) = (array)[(parent)]; (child) = (parent) * 2; while ((nmemb) (child)) { if ((*((array) + (child) + 1) *((array) + (child { ++(child); } if ((*((array) + (child)) *((temp { (array)[(parent)] = (array)[(child)]; (parent) = (child); (child) *= 2; } else { --(child); break; } } if ((nmemb) == (child) (*((array) + (child)) *((temp { (array)[(parent)] = (array)[(child)]; (parent) = (child); } (array)[(parent)] = (temp); } ((void) ((temp) = *(array), *(array) = *(array + nmemb), *(array + nmemb) = (temp))); } } } // // We use this to check to see if a partition is already sorted. // template class e_type int sorted(e_type * array, size_t nmemb) { for (--nmemb; nmemb; --nmemb) { if ((*(array) *(array + 1))) { return 0; } ++array; } return 1; } // // We use this to check to see if a partition is already reverse-sorted. // template class e_type int rev_sorted(e_type * array, size_t nmemb) { for (--nmemb; nmemb; --nmemb) { if ((*(array + 1) *(array))) { return 0; } ++array; } return 1; } // // We use this to reverse a reverse-sorted partition. // template class e_type void rev_array(e_type * array, size_t nmemb) { e_type temp, *end; for (end = array + nmemb - 1; end array; ++array) { ((void) ((temp) = *(array), *(array) = *(end), *(end) = (temp))); --end; } } // // Introspective quick sort algorithm user entry point. // You do not need to directly call any other sorting template. // This sort will perform very well under all circumstances. // template class e_type void iqsort(e_type * array, size_t nmemb) { size_t d, n; if (nmemb 1 !sorted(array, nmemb)) { if (!rev_sorted(array, nmemb)) { n = nmemb / 4; d = 2; while (n) { ++d; n /= 2; } qloop(array, nmemb, 2 * d); } else { rev_array(array, nmemb); } } } -Original Message- From: [EMAIL PROTECTED] [mailto:pgsql-hackers- [EMAIL PROTECTED] On Behalf Of Jignesh K. Shah Sent: Friday, September 30, 2005 1:38 PM To: Ron Peacetree Cc: Josh Berkus; pgsql-hackers@postgresql.org; pgsql- [EMAIL PROTECTED] Subject: Re: [HACKERS] [PERFORM] A Better External Sort? I have seen similar performance as Josh and my reasoning is as follows: * WAL is the biggest bottleneck with its default size of 16MB. Many people hate to recompile the code to change its default, and increasing checkpoint segments help but still there is lot of overhead in the rotation of WAL files (Even putting WAL on tmpfs shows that it is still slow). Having an option for bigger size is helpful to a small extent percentagewise (and frees up CPU a bit in doing file rotation) * Growing files: Even though this is OS dependent but it does spend lot of time doing small 8K block increases to grow files. If we can signal bigger chunks to grow or pre-grow to expected size of data files that will help a lot in such cases. * COPY command had restriction but that has been fixed to a large extent.(Great job) But ofcourse I have lost touch with programming and can't begin to understand PostgreSQL code to change it myself. Regards, Jignesh Ron Peacetree wrote: That 11MBps was your =bulk load= speed
Re: [HACKERS] [PERFORM] A Better External Sort?
Judy definitely rates a WOW!! -Original Message- From: [EMAIL PROTECTED] [mailto:pgsql-hackers- [EMAIL PROTECTED] On Behalf Of Gregory Maxwell Sent: Friday, September 30, 2005 7:07 PM To: Ron Peacetree Cc: Jeffrey W. Baker; pgsql-hackers@postgresql.org; pgsql- [EMAIL PROTECTED] Subject: Re: [HACKERS] [PERFORM] A Better External Sort? On 9/28/05, Ron Peacetree [EMAIL PROTECTED] wrote: 2= We use my method to sort two different tables. We now have these very efficient representations of a specific ordering on these tables. A join operation can now be done using these Btrees rather than the original data tables that involves less overhead than many current methods. If we want to make joins very fast we should implement them using RD trees. For the example cases where a join against a very large table will produce a much smaller output, a RD tree will provide pretty much the optimal behavior at a very low memory cost. On the subject of high speed tree code for in-core applications, you should check out http://judy.sourceforge.net/ . The performance (insert, remove, lookup, AND storage) is really quite impressive. Producing cache friendly code is harder than one might expect, and it appears the judy library has already done a lot of the hard work. Though it is *L*GPLed, so perhaps that might scare some here away from it. :) and good luck directly doing joins with a LC-TRIE. ;) ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
I see the following routines that seem to be related to sorting. If I were to examine these routines to consider ways to improve it, what routines should I key in on? I am guessing that tuplesort.c is the hub of activity for database sorting. Directory of U:\postgresql-snapshot\src\backend\access\nbtree 08/11/2005 06:22 AM24,968 nbtsort.c 1 File(s) 24,968 bytes Directory of U:\postgresql-snapshot\src\backend\executor 03/16/2005 01:38 PM 7,418 nodeSort.c 1 File(s) 7,418 bytes Directory of U:\postgresql-snapshot\src\backend\utils\sort 09/23/2005 08:36 AM67,585 tuplesort.c 1 File(s) 67,585 bytes Directory of U:\postgresql-snapshot\src\bin\pg_dump 06/29/2005 08:03 PM31,620 pg_dump_sort.c 1 File(s) 31,620 bytes Directory of U:\postgresql-snapshot\src\port 07/27/2005 09:03 PM 5,077 qsort.c 1 File(s) 5,077 bytes ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
On Sat, Oct 01, 2005 at 10:22:40AM -0400, Ron Peacetree wrote: Assuming we get the abyssmal physical IO performance fixed... (because until we do, _nothing_ is going to help us as much) I'm still not convinced this is the major problem. For example, in my totally unscientific tests on an oldish machine I have here: Direct filesystem copy to /dev/null 21MB/s10% user 50% system (dual cpu, so the system is using a whole CPU) COPY TO /dev/null WITH binary 13MB/s55% user 45% system (ergo, CPU bound) COPY TO /dev/null 4.4MB/s 60% user 40% system \copy to /dev/null in psql 6.5MB/s 60% user 40% system This machine is a bit strange setup, not sure why fs copy is so slow. As to why \copy is faster than COPY, I have no idea, but it is repeatable. And actually turning the tuples into a printable format is the most expensive. But it does point out that the whole process is probably CPU bound more than anything else. So, I don't think physical I/O is the problem. It's something further up the call tree. I wouldn't be surprised at all it it had to do with the creation and destruction of tuples. The cost of comparing tuples should not be underestimated. -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgpDEXF6GSZ4G.pgp Description: PGP signature
Re: [HACKERS] [PERFORM] A Better External Sort?
On 9/28/05, Ron Peacetree [EMAIL PROTECTED] wrote: 2= We use my method to sort two different tables. We now have these very efficient representations of a specific ordering on these tables. A join operation can now be done using these Btrees rather than the original data tables that involves less overhead than many current methods. If we want to make joins very fast we should implement them using RD trees. For the example cases where a join against a very large table will produce a much smaller output, a RD tree will provide pretty much the optimal behavior at a very low memory cost. On the subject of high speed tree code for in-core applications, you should check out http://judy.sourceforge.net/ . The performance (insert, remove, lookup, AND storage) is really quite impressive. Producing cache friendly code is harder than one might expect, and it appears the judy library has already done a lot of the hard work. Though it is *L*GPLed, so perhaps that might scare some here away from it. :) and good luck directly doing joins with a LC-TRIE. ;) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
Ron Peacetree wrote: The good news is all this means it's easy to demonstrate that we can improve the performance of our sorting functionality. Assuming we get the abyssmal physical IO performance fixed... (because until we do, _nothing_ is going to help us as much) I for one would be paying more attention if such a demonstration were forthcoming, in the form of a viable patch and some benchmark results. cheers andrew ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
On 9/30/05, Ron Peacetree [EMAIL PROTECTED] wrote: 4= I'm sure we are paying all sorts of nasty overhead for essentially emulating the pg filesystem inside another filesystem. That means ~2x as much overhead to access a particular piece of data. The simplest solution is for us to implement a new VFS compatible filesystem tuned to exactly our needs: pgfs. We may be able to avoid that by some amount of hacking or modifying of the current FSs we use, but I suspect it would be more work for less ROI. On this point, Reiser4 fs already implements a number of things which would be desirable for PostgreSQL. For example: write()s to reiser4 filesystems are atomic, so there is no risk of torn pages (this is enabled because reiser4 uses WAFL like logging where data is not overwritten but rather relocated). The filesystem is modular and extensible so it should be easy to add whatever additional semantics are needed. I would imagine that all that would be needed is some more atomicity operations (single writes are already atomic, but I'm sure it would be useful to batch many writes into a transaction),some layout and packing controls, and some flush controls. A step further would perhaps integrate multiversioning directly into the FS (the wandering logging system provides the write side of multiversioning, a little read side work would be required.). More importantly: the file system was intended to be extensible for this sort of application. It might make a good 'summer of code' project for someone next year, ... presumably by then reiser4 will have made it into the mainline kernel by then. :) ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] A Better External Sort?
On Thu, Sep 29, 2005 at 10:06:52AM -0700, Luke Lonergan wrote: Josh, On 9/29/05 9:54 AM, Josh Berkus josh@agliodbs.com wrote: Following an index creation, we see that 95% of the time required is the external sort, which averages 2mb/s. This is with seperate drives for the WAL, the pg_tmp, the table and the index. I've confirmed that increasing work_mem beyond a small minimum (around 128mb) had no benefit on the overall index creation speed. Yp! That about sums it up - regardless of taking 1 or 2 passes through the heap being sorted, 1.5 - 2 MB/s is the wrong number. This is not necessarily an algorithmic problem, but is a optimization problem with Postgres that must be fixed before it can be competitive. We read/write to/from disk at 240MB/s and so 2 passes would run at a net rate of 120MB/s through the sort set if it were that efficient. Anyone interested in tackling the real performance issue? (flame bait, but for a worthy cause :-) I'm not sure that it's flamebait, but what do I know? Apart from the nasty number (1.5-2 MB/s), what other observations do you have to hand? Any ideas about what things are not performing here? Parts of the code that could bear extra scrutiny? Ideas on how to fix same in a cross-platform way? Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 510 893 6100 mobile: +1 415 235 3778 Remember to vote! ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
In my original example, a sequential scan of the 1TB of 2KB or 4KB records, = 250M or 500M records of data, being sorted on a binary value key will take ~1000x more time than reading in the ~1GB Btree I described that used a Key+RID (plus node pointers) representation of the data. Imho you seem to ignore the final step your algorithm needs of collecting the data rows. After you sorted the keys the collect step will effectively access the tuples in random order (given a sufficiently large key range). This random access is bad. It effectively allows a competing algorithm to read the whole data at least 40 times sequentially, or write the set 20 times sequentially. (Those are the random/sequential ratios of modern discs) Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] A Better External Sort?
On Sat, Oct 01, 2005 at 06:19:41PM +0200, Martijn van Oosterhout wrote: COPY TO /dev/null WITH binary 13MB/s55% user 45% system (ergo, CPU bound) [snip] the most expensive. But it does point out that the whole process is probably CPU bound more than anything else. Note that 45% of that cpu usage is system--which is where IO overhead would end up being counted. Until you profile where you system time is going it's premature to say it isn't an IO problem. Mike Stone ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] A Better External Sort?
On Tue, Oct 04, 2005 at 12:43:10AM +0300, Hannu Krosing wrote: Just FYI, I run a count(*) on a 15.6GB table on a lightly loaded db and it run in 163 sec. (Dual opteron 2.6GHz, 6GB RAM, 6 x 74GB 15k disks in RAID10, reiserfs). A little less than 100MB sec. And none of that 15G table is in the 6G RAM? Mike Stone ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] A Better External Sort?
Nope - it would be disk wait. COPY is CPU bound on I/O subsystems faster that 50 MB/s on COPY (in) and about 15 MB/s (out). - Luke -Original Message- From: Michael Stone [mailto:[EMAIL PROTECTED] Sent: Wed Oct 05 09:58:41 2005 To: Martijn van Oosterhout Cc: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org Subject:Re: [HACKERS] [PERFORM] A Better External Sort? On Sat, Oct 01, 2005 at 06:19:41PM +0200, Martijn van Oosterhout wrote: COPY TO /dev/null WITH binary 13MB/s55% user 45% system (ergo, CPU bound) [snip] the most expensive. But it does point out that the whole process is probably CPU bound more than anything else. Note that 45% of that cpu usage is system--which is where IO overhead would end up being counted. Until you profile where you system time is going it's premature to say it isn't an IO problem. Mike Stone ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
On Wed, Oct 05, 2005 at 11:24:07AM -0400, Luke Lonergan wrote: Nope - it would be disk wait. I said I/O overhead; i.e., it could be the overhead of calling the kernel for I/O's. E.g., the following process is having I/O problems: time dd if=/dev/sdc of=/dev/null bs=1 count=1000 1000+0 records in 1000+0 records out 1000 bytes transferred in 8.887845 seconds (1125132 bytes/sec) real0m8.889s user0m0.877s sys 0m8.010s it's not in disk wait state (in fact the whole read was cached) but it's only getting 1MB/s. Mike Stone ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
I've now gotten verification from multiple working DBA's that DB2, Oracle, and SQL Server can achieve ~250MBps ASTR (with as much as ~500MBps ASTR in setups akin to Oracle RAC) when attached to a decent (not outrageous, but decent) HD subsystem... I've not yet had any RW DBA verify Jeff Baker's supposition that ~1GBps ASTR is attainable. Cache based bursts that high, yes. ASTR, no. The DBA's in question run RW installations that include Solaris, M$, and Linux OS's for companies that just about everyone on these lists are likely to recognize. Also, the implication of these pg IO limits is that money spent on even moderately priced 300MBps SATA II based RAID HW is wasted $'s. In total, this situation is a recipe for driving potential pg users to other DBMS. 25MBps in and 15MBps out is =BAD=. Have we instrumented the code in enough detail that we can tell _exactly_ where the performance drainage is? We have to fix this. Ron -Original Message- From: Luke Lonergan [EMAIL PROTECTED] Sent: Oct 5, 2005 11:24 AM To: Michael Stone [EMAIL PROTECTED], Martijn van Oosterhout kleptog@svana.org Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Nope - it would be disk wait. COPY is CPU bound on I/O subsystems faster that 50 MB/s on COPY (in) and about 15 MB/s (out). - Luke -Original Message- From: Michael Stone [mailto:[EMAIL PROTECTED] Sent: Wed Oct 05 09:58:41 2005 To: Martijn van Oosterhout Cc: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org Subject:Re: [HACKERS] [PERFORM] A Better External Sort? On Sat, Oct 01, 2005 at 06:19:41PM +0200, Martijn van Oosterhout wrote: COPY TO /dev/null WITH binary 13MB/s55% user 45% system (ergo, CPU bound) [snip] the most expensive. But it does point out that the whole process is probably CPU bound more than anything else. Note that 45% of that cpu usage is system--which is where IO overhead would end up being counted. Until you profile where you system time is going it's premature to say it isn't an IO problem. Mike Stone ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
We have to fix this. Ron The source is freely available for your perusal. Please feel free to point us in specific directions in the code where you may see some benefit. I am positive all of us that can, would put resources into fixing the issue had we a specific direction to attack. Sincerely, Joshua D. Drake -- Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240 PostgreSQL Replication, Consulting, Custom Programming, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] A Better External Sort?
First I wanted to verify that pg's IO rates were inferior to The Competition. Now there's at least an indication that someone else has solved similar problems. Existence proofs make some things easier ;-) Is there any detailed programmer level architectual doc set for pg? I know the best doc is the code, but the code in isolation is often the Slow Path to understanding with systems as complex as a DBMS IO layer. Ron -Original Message- From: Joshua D. Drake [EMAIL PROTECTED] Sent: Oct 5, 2005 1:18 PM Subject: Re: [HACKERS] [PERFORM] A Better External Sort? The source is freely available for your perusal. Please feel free to point us in specific directions in the code where you may see some benefit. I am positive all of us that can, would put resources into fixing the issue had we a specific direction to attack. Sincerely, Joshua D. Drake ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
On Wed, 2005-10-05 at 12:14 -0400, Ron Peacetree wrote: I've now gotten verification from multiple working DBA's that DB2, Oracle, and SQL Server can achieve ~250MBps ASTR (with as much as ~500MBps ASTR in setups akin to Oracle RAC) when attached to a decent (not outrageous, but decent) HD subsystem... I've not yet had any RW DBA verify Jeff Baker's supposition that ~1GBps ASTR is attainable. Cache based bursts that high, yes. ASTR, no. I find your tone annoying. That you do not have access to this level of hardware proves nothing, other than pointing out that your repeated emails on this list are based on supposition. If you want 1GB/sec STR you need: 1) 1 or more Itanium CPUs 2) 24 or more disks 3) 2 or more SATA controllers 4) Linux Have fun. -jwb ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
I'm putting in as much time as I can afford thinking about pg related performance issues. I'm doing it because of a sincere desire to help understand and solve them, not to annoy people. If I didn't believe in pg, I would't be posting thoughts about how to make it better. It's probably worth some review (suggestions marked with a +: +I came to the table with a possibly better way to deal with external sorts (that now has branched into 2 efforts: short term improvements to the existing code, and the original from-the-ground-up idea). That suggestion was based on a great deal of prior thought and research, despite what some others might think. Then we were told that our IO limit was lower than I thought. +I suggested that as a Quick Fix we try making sure we do IO transfers in large enough chunks based in the average access time of the physical device in question so as to achieve the device's ASTR (ie at least 600KB per access for a 50MBps ASTR device with a 12ms average access time.) whenever circumstances allowed us. As far as I know, this experiment hasn't been tried yet. I asked some questions about physical layout and format translation overhead being possibly suboptimal that seemed to be agreed to, but specifics as to where we are taking the hit don't seem to have been made explicit yet. +I made the from left field suggestion that perhaps a pg native fs format would be worth consideration. This is a major project, so the suggestion was to at least some extent tongue-in-cheek. +I then made some suggestions about better code instrumentation so that we can more accurately characterize were the bottlenecks are. We were also told that evidently we are CPU bound far before one would naively expect to be based on the performance specifications of the components involved. Double checking among the pg developer community led to some differing opinions as to what the actual figures were and under what circumstances they were achieved. Further discussion seems to have converged on both accurate values and a better understanding as to the HW and SW needed; _and_ we've gotten some RW confirmation as to what current reasonable expectations are within this problem domain from outside the pg community. +Others have made some good suggestions in this thread as well. Since I seem to need to defend my tone here, I'm not detailing them here. That should not be construed as a lack of appreciation of them. Now I've asked for the quickest path to detailed understanding of the pg IO subsystem. The goal being to get more up to speed on its coding details. Certainly not to annoy you or anyone else. At least from my perspective, this for the most part seems to have been an useful and reasonable engineering discussion that has exposed a number of important things. Regards, Ron ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
On Wed, Oct 05, 2005 at 04:55:51PM -0700, Luke Lonergan wrote: In COPY, we found lots of libc functions like strlen() being called ridiculous numbers of times, in one case it was called on every timestamp/date attribute to get the length of TZ, which is constant. That one function call was in the system category, and was responsible for several percent of the time. What? strlen is definitely not in the kernel, and thus won't count as system time. /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
Jeffrey W. Baker [EMAIL PROTECTED] writes: I think the largest speedup will be to dump the multiphase merge and merge all tapes in one pass, no matter how large M. Currently M is capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over the tape. It could be done in a single pass heap merge with N*log(M) comparisons, and, more importantly, far less input and output. I had more or less despaired of this thread yielding any usable ideas :-( but I think you have one here. The reason the current code uses a six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first edition) shows that there's not much incremental gain from using more tapes ... if you are in the regime where number of runs is much greater than number of tape drives. But if you can stay in the regime where only one merge pass is needed, that is obviously a win. I don't believe we can simply legislate that there be only one merge pass. That would mean that, if we end up with N runs after the initial run-forming phase, we need to fit N tuples in memory --- no matter how large N is, or how small work_mem is. But it seems like a good idea to try to use an N-way merge where N is as large as work_mem will allow. We'd not have to decide on the value of N until after we've completed the run-forming phase, at which time we've already seen every tuple once, and so we can compute a safe value for N as work_mem divided by largest_tuple_size. (Tape I/O buffers would have to be counted too of course.) It's been a good while since I looked at the sort code, and so I don't recall if there are any fundamental reasons for having a compile-time- constant value of the merge order rather than choosing it at runtime. My guess is that any inefficiencies added by making it variable would be well repaid by the potential savings in I/O. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] A Better External Sort?
On Fri, 2005-09-30 at 13:41 -0700, Josh Berkus wrote: Yeah, that's what I thought too. But try sorting an 10GB table, and you'll see: disk I/O is practically idle, while CPU averages 90%+. We're CPU-bound, because sort is being really inefficient about something. I just don't know what yet. If we move that CPU-binding to a higher level of performance, then we can start looking at things like async I/O, O_Direct, pre-allocation etc. that will give us incremental improvements. But what we need now is a 5-10x improvement and that's somewhere in the algorithms or the code. I'm trying to keep an open mind about what the causes are, and I think we need to get a much better characterisation of what happens during a sort before we start trying to write code. It is always too easy to jump in and tune the wrong thing, which is not a good use of time. The actual sort algorithms looks damn fine to me and the code as it stands is well optimised. That indicates to me that we've come to the end of the current line of thinking and we need a new approach, possibly in a number of areas. For myself, I don't wish to be drawn further on solutions at this stage but I am collecting performance data, so any test results are most welcome. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] A Better External Sort?
On Sat, 2005-10-01 at 02:01 -0400, Tom Lane wrote: Jeffrey W. Baker [EMAIL PROTECTED] writes: I think the largest speedup will be to dump the multiphase merge and merge all tapes in one pass, no matter how large M. Currently M is capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over the tape. It could be done in a single pass heap merge with N*log(M) comparisons, and, more importantly, far less input and output. I had more or less despaired of this thread yielding any usable ideas :-( but I think you have one here. The reason the current code uses a six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first edition) shows that there's not much incremental gain from using more tapes ... if you are in the regime where number of runs is much greater than number of tape drives. But if you can stay in the regime where only one merge pass is needed, that is obviously a win. I don't believe we can simply legislate that there be only one merge pass. That would mean that, if we end up with N runs after the initial run-forming phase, we need to fit N tuples in memory --- no matter how large N is, or how small work_mem is. But it seems like a good idea to try to use an N-way merge where N is as large as work_mem will allow. We'd not have to decide on the value of N until after we've completed the run-forming phase, at which time we've already seen every tuple once, and so we can compute a safe value for N as work_mem divided by largest_tuple_size. (Tape I/O buffers would have to be counted too of course.) It's been a good while since I looked at the sort code, and so I don't recall if there are any fundamental reasons for having a compile-time- constant value of the merge order rather than choosing it at runtime. My guess is that any inefficiencies added by making it variable would be well repaid by the potential savings in I/O. Well, perhaps Knuth is not untouchable! So we merge R runs with N variable rather than N=6. Pick N so that N = 6 and N = R, with N limited by memory, sufficient to allow long sequential reads from the temp file. Looking at the code, in selectnewtape() we decide on the connection between run number and tape number. This gets executed during the writing of initial runs, which was OK when the run-tape mapping was known ahead of time because of fixed N. To do this it sounds like we'd be better to write each run out to its own personal runtape, taking the assumption that N is very large. Then when all runs are built, re-assign the run numbers to tapes for the merge. That is likely to be a trivial mapping unless N isn't large enough to fit in memory. That idea should be easily possible because the tape numbers were just abstract anyway. Right now, I can't see any inefficiencies from doing this. It uses memory better and Knuth shows that using more tapes is better anyhow. Keeping track of more tapes isn't too bad, even for hundreds or even thousands of runs/tapes. Tom, its your idea, so you have first dibs. I'm happy to code this up if you choose not to, once I've done my other immediate chores. That just leaves these issues for a later time: - CPU and I/O interleaving - CPU cost of abstract data type comparison operator invocation Best Regards, Simon Riggs ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
*blink* Tapes?! I thought that was a typo... If our sort is code based on sorting tapes, we've made a mistake. HDs are not tapes, and Polyphase Merge Sort and it's brethren are not the best choices for HD based sorts. Useful references to this point: Knuth, Vol 3 section 5.4.9, (starts p356 of 2ed) Tharp, ISBN 0-471-60521-2, starting p352 Folk, Zoellick, and Riccardi, ISBN 0-201-87401-6, chapter 8 (starts p289) The winners of the Daytona version of Jim Gray's sorting contest, for general purpose external sorting algorithms that are of high enough quality to be offered commercially, also demonstrate a number of better ways to attack external sorting using HDs. The big take aways from all this are: 1= As in Polyphase Merge Sort, optimum External HD Merge Sort performance is obtained by using Replacement Selection and creating buffers of different lengths for later merging. The values are different. 2= Using multiple HDs split into different functions, IOW _not_ simply as RAIDs, is a big win. A big enough win that we should probably consider having a config option to pg that allows the use of HD(s) or RAID set(s) dedicated as temporary work area(s). 3= If the Key is small compared record size, Radix or Distribution Counting based algorithms are worth considering. The good news is all this means it's easy to demonstrate that we can improve the performance of our sorting functionality. Assuming we get the abyssmal physical IO performance fixed... (because until we do, _nothing_ is going to help us as much) Ron -Original Message- From: Tom Lane [EMAIL PROTECTED] Sent: Oct 1, 2005 2:01 AM Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Jeffrey W. Baker [EMAIL PROTECTED] writes: I think the largest speedup will be to dump the multiphase merge and merge all tapes in one pass, no matter how large M. Currently M is capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over the tape. It could be done in a single pass heap merge with N*log(M) comparisons, and, more importantly, far less input and output. I had more or less despaired of this thread yielding any usable ideas :-( but I think you have one here. The reason the current code uses a six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first edition) shows that there's not much incremental gain from using more tapes ... if you are in the regime where number of runs is much greater than number of tape drives. But if you can stay in the regime where only one merge pass is needed, that is obviously a win. I don't believe we can simply legislate that there be only one merge pass. That would mean that, if we end up with N runs after the initial run-forming phase, we need to fit N tuples in memory --- no matter how large N is, or how small work_mem is. But it seems like a good idea to try to use an N-way merge where N is as large as work_mem will allow. We'd not have to decide on the value of N until after we've completed the run-forming phase, at which time we've already seen every tuple once, and so we can compute a safe value for N as work_mem divided by largest_tuple_size. (Tape I/O buffers would have to be counted too of course.) It's been a good while since I looked at the sort code, and so I don't recall if there are any fundamental reasons for having a compile-time- constant value of the merge order rather than choosing it at runtime. My guess is that any inefficiencies added by making it variable would be well repaid by the potential savings in I/O. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] A Better External Sort?
Josh Berkus josh@agliodbs.com writes: The biggest single area where I see PostgreSQL external sort sucking is on index creation on large tables. For example, for free version of TPCH, it takes only 1.5 hours to load a 60GB Lineitem table on OSDL's hardware, but over 3 hours to create each index on that table. This means that over all our load into TPCH takes 4 times as long to create the indexes as it did to bulk load the data. ... Following an index creation, we see that 95% of the time required is the external sort, which averages 2mb/s. This is with seperate drives for the WAL, the pg_tmp, the table and the index. I've confirmed that increasing work_mem beyond a small minimum (around 128mb) had no benefit on the overall index creation speed. These numbers don't seem to add up. You have not provided any details about the index key datatypes or sizes, but I'll take a guess that the raw data for each index is somewhere around 10GB. The theory says that the runs created during the first pass should on average be about twice work_mem, so at 128mb work_mem there should be around 40 runs to be merged, which would take probably three passes with six-way merging. Raising work_mem to a gig should result in about five runs, needing only one pass, which is really going to be as good as it gets. If you could not see any difference then I see little hope for the idea that reducing the number of merge passes will help. Umm ... you were raising maintenance_work_mem, I trust, not work_mem? We really need to get some hard data about what's going on here. The sort code doesn't report any internal statistics at the moment, but it would not be hard to whack together a patch that reports useful info in the form of NOTICE messages or some such. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
Tom Lane [EMAIL PROTECTED] writes: Jeffrey W. Baker [EMAIL PROTECTED] writes: I think the largest speedup will be to dump the multiphase merge and merge all tapes in one pass, no matter how large M. Currently M is capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over the tape. It could be done in a single pass heap merge with N*log(M) comparisons, and, more importantly, far less input and output. I had more or less despaired of this thread yielding any usable ideas :-( but I think you have one here. The reason the current code uses a six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first edition) shows that there's not much incremental gain from using more tapes ... if you are in the regime where number of runs is much greater than number of tape drives. But if you can stay in the regime where only one merge pass is needed, that is obviously a win. Is that still true when the multiple tapes are being multiplexed onto a single actual file on disk? That brings up one of my pet features though. The ability to declare multiple temporary areas on different spindles and then have them be used on a rotating basis. So a sort could store each tape on a separate spindle and merge them together at full sequential i/o speed. This would make the tradeoff between multiway merges and many passes even harder to find though. The broader the multiway merges the more sort areas would be used which would increase the likelihood of another sort using the same sort area and hurting i/o performance. -- greg ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] A Better External Sort?
You have not said anything about what HW, OS version, and pg version used here, but even at that can't you see that something Smells Wrong? The most common CPUs currently shipping have clock rates of ~2-3GHz and have 8B-16B internal pathways. SPARCs and other like CPUs are clocked slower but have 16B-32B internal pathways. In short, these CPU's have an internal bandwidth of 16+ GBps. The most common currently shipping mainboards have 6.4GBps RAM subsystems. ITRW, their peak is ~80% of that, or ~5.1GBps. In contrast, the absolute peak bandwidth of a 133MHx 8B PCI-X bus is 1GBps, and ITRW it peaks at ~800-850MBps. Should anyone ever build a RAID system that can saturate a PCI-Ex16 bus, that system will be maxing ITRW at ~3.2GBps. CPUs should NEVER be 100% utilized during copy IO. They should be idling impatiently waiting for the next piece of data to finish being processed even when the RAM IO subsystem is pegged; and they definitely should be IO starved rather than CPU bound when doing HD IO. Those IO rates are also alarming in all but possibly the first case. A single ~50MBps HD doing 21MBps isn't bad, but for even a single ~80MBps HD it starts to be of concern. If any these IO rates came from any reasonable 300+MBps RAID array, then they are BAD. What your simple experiment really does is prove We Have A Problem (tm) with our IO code at either or both of the OS or the pg level(s). Ron -Original Message- From: Martijn van Oosterhout kleptog@svana.org Sent: Oct 1, 2005 12:19 PM Subject: Re: [HACKERS] [PERFORM] A Better External Sort? On Sat, Oct 01, 2005 at 10:22:40AM -0400, Ron Peacetree wrote: Assuming we get the abyssmal physical IO performance fixed... (because until we do, _nothing_ is going to help us as much) I'm still not convinced this is the major problem. For example, in my totally unscientific tests on an oldish machine I have here: Direct filesystem copy to /dev/null 21MB/s10% user 50% system (dual cpu, so the system is using a whole CPU) COPY TO /dev/null WITH binary 13MB/s55% user 45% system (ergo, CPU bound) COPY TO /dev/null 4.4MB/s 60% user 40% system \copy to /dev/null in psql 6.5MB/s 60% user 40% system This machine is a bit strange setup, not sure why fs copy is so slow. As to why \copy is faster than COPY, I have no idea, but it is repeatable. And actually turning the tuples into a printable format is the most expensive. But it does point out that the whole process is probably CPU bound more than anything else. So, I don't think physical I/O is the problem. It's something further up the call tree. I wouldn't be surprised at all it it had to do with the creation and destruction of tuples. The cost of comparing tuples should not be underestimated. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
Ron, Hmmm. 60GB/5400secs= 11MBps. That's ssllooww. So the first problem is evidently our physical layout and/or HD IO layer sucks. Actually, it's much worse than that, because the sort is only dealing with one column. As I said, monitoring the iostat our top speed was 2.2mb/s. --Josh ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] A Better External Sort?
That 11MBps was your =bulk load= speed. If just loading a table is this slow, then there are issues with basic physical IO, not just IO during sort operations. As I said, the obvious candidates are inefficient physical layout and/or flawed IO code. Until the basic IO issues are addressed, we could replace the present sorting code with infinitely fast sorting code and we'd still be scrod performance wise. So why does basic IO suck so badly? Ron -Original Message- From: Josh Berkus josh@agliodbs.com Sent: Sep 30, 2005 1:23 PM To: Ron Peacetree [EMAIL PROTECTED] Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Ron, Hmmm. 60GB/5400secs= 11MBps. That's ssllooww. So the first problem is evidently our physical layout and/or HD IO layer sucks. Actually, it's much worse than that, because the sort is only dealing with one column. As I said, monitoring the iostat our top speed was 2.2mb/s. --Josh ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] A Better External Sort?
Ron, That 11MBps was your =bulk load= speed. If just loading a table is this slow, then there are issues with basic physical IO, not just IO during sort operations. Oh, yeah. Well, that's separate from sort. See multiple posts on this list from the GreenPlum team, the COPY patch for 8.1, etc. We've been concerned about I/O for a while. Realistically, you can't do better than about 25MB/s on a single-threaded I/O on current Linux machines, because your bottleneck isn't the actual disk I/O. It's CPU. Databases which go faster than this are all, to my knowledge, using multi-threaded disk I/O. (and I'd be thrilled to get a consistent 25mb/s on PostgreSQL, but that's another thread ... ) As I said, the obvious candidates are inefficient physical layout and/or flawed IO code. Yeah, that's what I thought too. But try sorting an 10GB table, and you'll see: disk I/O is practically idle, while CPU averages 90%+. We're CPU-bound, because sort is being really inefficient about something. I just don't know what yet. If we move that CPU-binding to a higher level of performance, then we can start looking at things like async I/O, O_Direct, pre-allocation etc. that will give us incremental improvements. But what we need now is a 5-10x improvement and that's somewhere in the algorithms or the code. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PERFORM] A Better External Sort?
I have seen similar performance as Josh and my reasoning is as follows: * WAL is the biggest bottleneck with its default size of 16MB. Many people hate to recompile the code to change its default, and increasing checkpoint segments help but still there is lot of overhead in the rotation of WAL files (Even putting WAL on tmpfs shows that it is still slow). Having an option for bigger size is helpful to a small extent percentagewise (and frees up CPU a bit in doing file rotation) * Growing files: Even though this is OS dependent but it does spend lot of time doing small 8K block increases to grow files. If we can signal bigger chunks to grow or pre-grow to expected size of data files that will help a lot in such cases. * COPY command had restriction but that has been fixed to a large extent.(Great job) But ofcourse I have lost touch with programming and can't begin to understand PostgreSQL code to change it myself. Regards, Jignesh Ron Peacetree wrote: That 11MBps was your =bulk load= speed. If just loading a table is this slow, then there are issues with basic physical IO, not just IO during sort operations. As I said, the obvious candidates are inefficient physical layout and/or flawed IO code. Until the basic IO issues are addressed, we could replace the present sorting code with infinitely fast sorting code and we'd still be scrod performance wise. So why does basic IO suck so badly? Ron -Original Message- From: Josh Berkus josh@agliodbs.com Sent: Sep 30, 2005 1:23 PM To: Ron Peacetree [EMAIL PROTECTED] Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Ron, Hmmm. 60GB/5400secs= 11MBps. That's ssllooww. So the first problem is evidently our physical layout and/or HD IO layer sucks. Actually, it's much worse than that, because the sort is only dealing with one column. As I said, monitoring the iostat our top speed was 2.2mb/s. --Josh ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PERFORM] A Better External Sort?
Ron, On 9/30/05 1:20 PM, Ron Peacetree [EMAIL PROTECTED] wrote: That 11MBps was your =bulk load= speed. If just loading a table is this slow, then there are issues with basic physical IO, not just IO during sort operations. Bulk loading speed is irrelevant here - that is dominated by parsing, which we have covered copiously (har har) previously and have sped up by 500%, which still makes Postgres 1/2 the loading speed of MySQL. As I said, the obvious candidates are inefficient physical layout and/or flawed IO code. Yes. Until the basic IO issues are addressed, we could replace the present sorting code with infinitely fast sorting code and we'd still be scrod performance wise. Postgres' I/O path has many problems that must be micro-optimized away. Too small of an operand size compared to disk caches, memory, etc etc are the common problem. Another is lack of micro-parallelism (loops) with long enough runs to let modern processors pipeline and superscale. The net problem here is that a simple select blah from blee order by(blah.a); runs at 1/100 of the sequential scan rate. - Luke ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] A Better External Sort?
Bulk loading speed is irrelevant here - that is dominated by parsing, which we have covered copiously (har har) previously and have sped up by 500%, which still makes Postgres 1/2 the loading speed of MySQL. Let's ask MySQL 4.0 LOAD DATA INFILE blah 0 errors, 666 warnings SHOW WARNINGS; not implemented. upgrade to 4.1 duh ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PERFORM] A Better External Sort?
25MBps should not be a CPU bound limit for IO, nor should it be an OS limit. It should be something ~100x (Single channel RAM) to ~200x (dual channel RAM) that. For an IO rate of 25MBps to be pegging the CPU at 100%, the CPU is suffering some combination of A= lot's of cache misses (cache thrash), B= lot's of random rather than sequential IO (like pointer chasing) C= lot's of wasteful copying D= lot's of wasteful calculations In fact, this is crappy enough performance that the whole IO layer should be rethought and perhaps reimplemented from scratch. Optimization of the present code is unlikely to yield a 100-200x improvement. On the HD side, the first thing that comes to mind is that DBs are -NOT- like ordinary filesystems in a few ways: 1= the minimum HD IO is a record that is likely to be larger than a HD sector. Therefore, the FS we use should be laid out with physical segments of max(HD sector size, record size) 2= DB files (tables) are usually considerably larger than any other kind of files stored. Therefore the FS we should use should be laid out using LARGE physical pages. 64KB-256KB at a _minimum_. 3= The whole 2GB striping of files idea needs to be rethought. Our tables are significantly different in internal structure from the usual FS entity. 4= I'm sure we are paying all sorts of nasty overhead for essentially emulating the pg filesystem inside another filesystem. That means ~2x as much overhead to access a particular piece of data. The simplest solution is for us to implement a new VFS compatible filesystem tuned to exactly our needs: pgfs. We may be able to avoid that by some amount of hacking or modifying of the current FSs we use, but I suspect it would be more work for less ROI. Ron -Original Message- From: Josh Berkus josh@agliodbs.com Sent: Sep 30, 2005 4:41 PM To: Ron Peacetree [EMAIL PROTECTED] Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Ron, That 11MBps was your =bulk load= speed. If just loading a table is this slow, then there are issues with basic physical IO, not just IO during sort operations. Oh, yeah. Well, that's separate from sort. See multiple posts on this list from the GreenPlum team, the COPY patch for 8.1, etc. We've been concerned about I/O for a while. Realistically, you can't do better than about 25MB/s on a single-threaded I/O on current Linux machines, because your bottleneck isn't the actual disk I/O. It's CPU. Databases which go faster than this are all, to my knowledge, using multi-threaded disk I/O. (and I'd be thrilled to get a consistent 25mb/s on PostgreSQL, but that's another thread ... ) As I said, the obvious candidates are inefficient physical layout and/or flawed IO code. Yeah, that's what I thought too. But try sorting an 10GB table, and you'll see: disk I/O is practically idle, while CPU averages 90%+. We're CPU-bound, because sort is being really inefficient about something. I just don't know what yet. If we move that CPU-binding to a higher level of performance, then we can start looking at things like async I/O, O_Direct, pre-allocation etc. that will give us incremental improvements. But what we need now is a 5-10x improvement and that's somewhere in the algorithms or the code. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
Your main example seems to focus on a large table where a key column has constrained values. This case is interesting in proportion to the number of possible values. If I have billions of rows, each having one of only two values, I can think of a trivial and very fast method of returning the table sorted by that key: make two sequential passes, returning the first value on the first pass and the second value on the second pass. This will be faster than the method you propose. 1= No that was not my main example. It was the simplest example used to frame the later more complicated examples. Please don't get hung up on it. 2= You are incorrect. Since IO is the most expensive operation we can do, any method that makes two passes through the data at top scanning speed will take at least 2x as long as any method that only takes one such pass. You do not get the point. As the time you get the sorted references to the tuples, you need to fetch the tuples themself, check their visbility, etc. and returns them to the client. So, if there is only 2 values in the column of big table that is larger than available RAM, two seq scans of the table without any sorting is the fastest solution. Cordialement, Jean-Gérard Pailloncy ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
Just to add a little anarchy in your nice debate... Who really needs all the results of a sort on your terabyte table ? I guess not many people do a SELECT from such a table and want all the results. So, this leaves : - Really wanting all the results, to fetch using a cursor, - CLUSTER type things, where you really want everything in order, - Aggregates (Sort-GroupAggregate), which might really need to sort the whole table. - Complex queries where the whole dataset needs to be examined, in order to return a few values - Joins (again, the whole table is probably not going to be selected) - And the ones I forgot. However, Most likely you only want to SELECT N rows, in some ordering : - the first N (ORDER BY x LIMIT N) - last N (ORDER BY x DESC LIMIT N) - WHERE xvalue ORDER BY x LIMIT N - WHERE xvalue ORDER BY x DESC LIMIT N - and other variants Or, you are doing a Merge JOIN against some other table ; in that case, yes, you might need the whole sorted terabyte table, but most likely there are WHERE clauses in the query that restrict the set, and thus, maybe we can get some conditions or limit values on the column to sort. Also the new, optimized hash join, which is more memory efficient, might cover this case. Point is, sometimes, you only need part of the results of your sort. And the bigger the sort, the most likely it becomes that you only want part of the results. So, while we're in the fun hand-waving, new algorithm trying mode, why not consider this right from the start ? (I know I'm totally in hand-waving mode right now, so slap me if needed). I'd say your new, fancy sort algorithm needs a few more input values : - Range of values that must appear in the final result of the sort : none, minimum, maximum, both, or even a set of values from the other side of the join, hashed, or sorted. - LIMIT information (first N, last N, none) - Enhanced Limit information (first/last N values of the second column to sort, for each value of the first column) (the infamous top10 by category query) - etc. With this, the amount of data that needs to be kept in memory is dramatically reduced, from the whole table (even using your compressed keys, that's big) to something more manageable which will be closer to the size of the final result set which will be returned to the client, and avoid a lot of effort. So, this would not be useful in all cases, but when it applies, it would be really useful. Regards ! ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] A Better External Sort?
Jeff, Ron, First off, Jeff, please take it easy. We're discussing 8.2 features at this point and there's no reason to get stressed out at Ron. You can get plenty stressed out when 8.2 is near feature freeze. ;-) Regarding use cases for better sorts: The biggest single area where I see PostgreSQL external sort sucking is on index creation on large tables. For example, for free version of TPCH, it takes only 1.5 hours to load a 60GB Lineitem table on OSDL's hardware, but over 3 hours to create each index on that table. This means that over all our load into TPCH takes 4 times as long to create the indexes as it did to bulk load the data. Anyone restoring a large database from pg_dump is in the same situation. Even worse, if you have to create a new index on a large table on a production database in use, because the I/O from the index creation swamps everything. Following an index creation, we see that 95% of the time required is the external sort, which averages 2mb/s. This is with seperate drives for the WAL, the pg_tmp, the table and the index. I've confirmed that increasing work_mem beyond a small minimum (around 128mb) had no benefit on the overall index creation speed. --Josh Berkus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
Josh, On 9/29/05 9:54 AM, Josh Berkus josh@agliodbs.com wrote: Following an index creation, we see that 95% of the time required is the external sort, which averages 2mb/s. This is with seperate drives for the WAL, the pg_tmp, the table and the index. I've confirmed that increasing work_mem beyond a small minimum (around 128mb) had no benefit on the overall index creation speed. Yp! That about sums it up - regardless of taking 1 or 2 passes through the heap being sorted, 1.5 - 2 MB/s is the wrong number. This is not necessarily an algorithmic problem, but is a optimization problem with Postgres that must be fixed before it can be competitive. We read/write to/from disk at 240MB/s and so 2 passes would run at a net rate of 120MB/s through the sort set if it were that efficient. Anyone interested in tackling the real performance issue? (flame bait, but for a worthy cause :-) - Luke ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] A Better External Sort?
On Thu, 2005-09-29 at 10:06 -0700, Luke Lonergan wrote: Josh, On 9/29/05 9:54 AM, Josh Berkus josh@agliodbs.com wrote: Following an index creation, we see that 95% of the time required is the external sort, which averages 2mb/s. This is with seperate drives for the WAL, the pg_tmp, the table and the index. I've confirmed that increasing work_mem beyond a small minimum (around 128mb) had no benefit on the overall index creation speed. Yp! That about sums it up - regardless of taking 1 or 2 passes through the heap being sorted, 1.5 - 2 MB/s is the wrong number. Yeah this is really bad ... approximately the speed of GNU sort. Josh, do you happen to know how many passes are needed in the multiphase merge on your 60GB table? Looking through tuplesort.c, I have a couple of initial ideas. Are we allowed to fork here? That would open up the possibility of using the CPU and the I/O in parallel. I see that tuplesort.c also suffers from the kind of postgresql-wide disease of calling all the way up and down a big stack of software for each tuple individually. Perhaps it could be changed to work on vectors. I think the largest speedup will be to dump the multiphase merge and merge all tapes in one pass, no matter how large M. Currently M is capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over the tape. It could be done in a single pass heap merge with N*log(M) comparisons, and, more importantly, far less input and output. I would also recommend using an external processes to asynchronously feed the tuples into the heap during the merge. What's the timeframe for 8.2? -jwb ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] A Better External Sort?
Jeff, Josh, do you happen to know how many passes are needed in the multiphase merge on your 60GB table? No, any idea how to test that? I think the largest speedup will be to dump the multiphase merge and merge all tapes in one pass, no matter how large M. Currently M is capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over the tape. It could be done in a single pass heap merge with N*log(M) comparisons, and, more importantly, far less input and output. Yes, but the evidence suggests that we're actually not using the whole 1GB of RAM ... maybe using only 32MB of it which would mean over 200 passes (I'm not sure of the exact match). Just fixing our algorithm so that it used all of the work_mem permitted might improve things tremendously. I would also recommend using an external processes to asynchronously feed the tuples into the heap during the merge. What's the timeframe for 8.2? Too far out to tell yet. Probably 9mo to 1 year, that's been our history. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
From: Pailloncy Jean-Gerard [EMAIL PROTECTED] Sent: Sep 29, 2005 7:11 AM Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Jeff Baker: Your main example seems to focus on a large table where a key column has constrained values. This case is interesting in proportion to the number of possible values. If I have billions of rows, each having one of only two values, I can think of a trivial and very fast method of returning the table sorted by that key: make two sequential passes, returning the first value on the first pass and the second value on the second pass. This will be faster than the method you propose. Ron Peacetree: 1= No that was not my main example. It was the simplest example used to frame the later more complicated examples. Please don't get hung up on it. 2= You are incorrect. Since IO is the most expensive operation we can do, any method that makes two passes through the data at top scanning speed will take at least 2x as long as any method that only takes one such pass. You do not get the point. As the time you get the sorted references to the tuples, you need to fetch the tuples themself, check their visbility, etc. and returns them to the client. As PFC correctly points out elsewhere in this thread, =maybe= you have to do all that. The vast majority of the time people are not going to want to look at a detailed record by record output of that much data. The most common usage is to calculate or summarize some quality or quantity of the data and display that instead or to use the tuples or some quality of the tuples found as an intermediate step in a longer query process such as a join. Sometimes there's a need to see _some_ of the detailed records; a random sample or a region in a random part of the table or etc. It's rare that there is a RW need to actually list every record in a table of significant size. On the rare occasions where one does have to return or display all records in such large table, network IO and/or display IO speeds are the primary performance bottleneck. Not HD IO. Nonetheless, if there _is_ such a need, there's nothing stopping us from rearranging the records in RAM into sorted order in one pass through RAM (using at most space for one extra record) after constructing the cache conscious Btree index. Then the sorted records can be written to HD in RAM buffer sized chunks very efficiently. Repeating this process until we have stepped through the entire data set will take no more HD IO than one HD scan of the data and leave us with a permanent result that can be reused for multiple purposes. If the sorted records are written in large enough chunks, rereading them at any later time can be done at maximum HD throughput In a total of two HD scans (one to read the original data, one to write out the sorted data) we can make a permanent rearrangement of the data. We've essentially created a cluster index version of the data. So, if there is only 2 values in the column of big table that is larger than available RAM, two seq scans of the table without any sorting is the fastest solution. If you only need to do this once, yes this wins. OTOH, if you have to do this sort even twice, my method is better. regards, Ron ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] A Better External Sort?
From: Zeugswetter Andreas DAZ SD [EMAIL PROTECTED] Sent: Sep 29, 2005 9:28 AM Subject: RE: [HACKERS] [PERFORM] A Better External Sort? In my original example, a sequential scan of the 1TB of 2KB or 4KB records, = 250M or 500M records of data, being sorted on a binary value key will take ~1000x more time than reading in the ~1GB Btree I described that used a Key+RID (plus node pointers) representation of the data. Imho you seem to ignore the final step your algorithm needs of collecting the data rows. After you sorted the keys the collect step will effectively access the tuples in random order (given a sufficiently large key range). Collecting the data rows can be done for each RAM buffer full of of data in one pass through RAM after we've built the Btree. Then if desired those data rows can be read out to HD in sorted order in essentially one streaming burst. This combination of index build + RAM buffer rearrangement + write results to HD can be repeat as often as needed until we end up with an overall Btree index and a set of sorted sublists on HD. Overall HD IO for the process is only two effectively sequential passes through the data. Subsequent retrieval of the sorted information from HD can be done at full HD streaming speed and whatever we've decided to save to HD can be reused later if we desire. Hope this helps, Ron ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] A Better External Sort?
Jeff, On 9/29/05 10:44 AM, Jeffrey W. Baker [EMAIL PROTECTED] wrote: On Thu, 2005-09-29 at 10:06 -0700, Luke Lonergan wrote: Looking through tuplesort.c, I have a couple of initial ideas. Are we allowed to fork here? That would open up the possibility of using the CPU and the I/O in parallel. I see that tuplesort.c also suffers from the kind of postgresql-wide disease of calling all the way up and down a big stack of software for each tuple individually. Perhaps it could be changed to work on vectors. Yes! I think the largest speedup will be to dump the multiphase merge and merge all tapes in one pass, no matter how large M. Currently M is capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over the tape. It could be done in a single pass heap merge with N*log(M) comparisons, and, more importantly, far less input and output. Yes again, see above. I would also recommend using an external processes to asynchronously feed the tuples into the heap during the merge. Simon Riggs is working this idea a bit - it's slightly less interesting to us because we already have a multiprocessing executor. Our problem is that 4 x slow is still far too slow. What's the timeframe for 8.2? Let's test it out in Bizgres! - Luke ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] A Better External Sort?
From: Josh Berkus josh@agliodbs.com Sent: Sep 29, 2005 12:54 PM Subject: Re: [HACKERS] [PERFORM] A Better External Sort? The biggest single area where I see PostgreSQL external sort sucking is on index creation on large tables. For example, for free version of TPCH, it takes only 1.5 hours to load a 60GB Lineitem table on OSDL's hardware, but over 3 hours to create each index on that table. This means that over all our load into TPCH takes 4 times as long to create the indexes as it did to bulk load the data. Hmmm. 60GB/5400secs= 11MBps. That's ssllooww. So the first problem is evidently our physical layout and/or HD IO layer sucks. Creating the table and then creating the indexes on the table is going to require more physical IO than if we created the table and the indexes concurrently in chunks and then combined the indexes on the chunks into the overall indexes for the whole table, so there's a potential speed-up. The method I've been talking about is basically a recipe for creating indexes as fast as possible with as few IO operations, HD or RAM, as possible and nearly no random ones, so it could help as well. OTOH, HD IO rate is the fundamental performance metric. As long as our HD IO rate is pessimal, so will the performance of everything else be. Why can't we load a table at closer to the peak IO rate of the HDs? Anyone restoring a large database from pg_dump is in the same situation. Even worse, if you have to create a new index on a large table on a production database in use, because the I/O from the index creation swamps everything. Fix for this in the works ;-) Following an index creation, we see that 95% of the time required is the external sort, which averages 2mb/s. Assuming decent HD HW, this is HORRIBLE. What's kind of instrumenting and profiling has been done of the code involved? This is with seperate drives for the WAL, the pg_tmp, the table and the index. I've confirmed that increasing work_mem beyond a small minimum (around 128mb) had no benefit on the overall index creation speed. No surprise. The process is severely limited by the abyssmally slow HD IO. Ron ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
From: Jeffrey W. Baker [EMAIL PROTECTED] Sent: Sep 27, 2005 1:26 PM To: Ron Peacetree [EMAIL PROTECTED] Subject: Re: [HACKERS] [PERFORM] A Better External Sort? On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote: That Btree can be used to generate a physical reordering of the data in one pass, but that's the weakest use for it. The more powerful uses involve allowing the Btree to persist and using it for more efficient re-searches or combining it with other such Btrees (either as a step in task distribution across multiple CPUs or as a more efficient way to do things like joins by manipulating these Btrees rather than the actual records.) Maybe you could describe some concrete use cases. I can see what you are getting at, and I can imagine some advantageous uses, but I'd like to know what you are thinking. 1= In a 4P box, we split the data in RAM into 4 regions and create a CPU cache friendly Btree using the method I described for each CPU. The 4 Btrees can be merged in a more time and space efficient manner than the original records to form a Btree that represents the sorted order of the entire data set. Any of these Btrees can be allowed to persist to lower the cost of doing similar operations in the future (Updating the Btrees during inserts and deletes is cheaper than updating the original data files and then redoing the same sort from scratch in the future.) Both the original sort and future such sorts are made more efficient than current methods. 2= We use my method to sort two different tables. We now have these very efficient representations of a specific ordering on these tables. A join operation can now be done using these Btrees rather than the original data tables that involves less overhead than many current methods. 3= We have multiple such Btrees for the same data set representing sorts done using different fields (and therefore different Keys). Calculating a sorted order for the data based on a composition of those Keys is now cheaper than doing the sort based on the composite Key from scratch. When some of the Btrees exist and some of them do not, there is a tradeoff calculation to be made. Sometimes it will be cheaper to do the sort from scratch using the composite Key. Specifically I'd like to see some cases where this would beat sequential scan. I'm thinking that in your example of a terabyte table with a column having only two values, all the queries I can think of would be better served with a sequential scan. In my original example, a sequential scan of the 1TB of 2KB or 4KB records, = 250M or 500M records of data, being sorted on a binary value key will take ~1000x more time than reading in the ~1GB Btree I described that used a Key+RID (plus node pointers) representation of the data. Just to clarify the point further, 1TB of 1B records = 2^40 records of at most 256 distinct values. 1TB of 2B records = 2^39 records of at most 2^16 distinct values. 1TB of 4B records = 2^38 records of at most 2^32 distinct values. 1TB of 5B records = 200B records of at most 200B distinct values. From here on, the number of possible distinct values is limited by the number of records. 100B records are used in the Indy version of Jim Gray's sorting contests, so 1TB = 10B records. 2KB-4KB is the most common record size I've seen in enterprise class DBMS (so I used this value to make my initial example more realistic). Therefore the vast majority of the time representing a data set by Key will use less space that the original record. Less space used means less IO to scan the data set, which means faster scan times. This is why index files work in the first place, right? Perhaps I believe this because you can now buy as much sequential I/O as you want. Random I/O is the only real savings. 1= No, you can not buy as much sequential IO as you want. Even if with an infinite budget, there are physical and engineering limits. Long before you reach those limits, you will pay exponentially increasing costs for linearly increasing performance gains. So even if you _can_ buy a certain level of sequential IO, it may not be the most efficient way to spend money. 2= Most RW IT professionals have far from an infinite budget. Just traffic on these lists shows how severe the typical cost constraints usually are. OTOH, if you have an inifinite IT budget, care to help a few less fortunate than yourself? After all, a even a large constant substracted from infinity is still infinity... ;-) 3= No matter how fast you can do IO, IO remains the most expensive part of the performance equation. The fastest and cheapest IO you can do is _no_ IO. As long as we trade cheaper RAM and even cheaoer CPU operations for IO correctly, more space efficient data representations will always be a Win because of this. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining
Re: [HACKERS] [PERFORM] A Better External Sort?
In the interest of efficiency and not reinventing the wheel, does anyone know where I can find C or C++ source code for a Btree variant with the following properties: A= Data elements (RIDs) are only stored in the leaves, Keys (actually KeyPrefixes; see D below) and Node pointers are only stored in the internal nodes of the Btree. B= Element redistribution is done as an alternative to node splitting in overflow conditions during Inserts whenever possible. C= Variable length Keys are supported. D= Node buffering with a reasonable replacement policy is supported. E= Since we will know beforehand exactly how many RID's will be stored, we will know apriori how much space will be needed for leaves, and will know the worst case for how much space will be required for the Btree internal nodes as well. This implies that we may be able to use an array, rather than linked list, implementation of the Btree. Less pointer chasing at the expense of more CPU calculations, but that's a trade-off in the correct direction. Such source would be a big help in getting a prototype together. Thanks in advance for any pointers or source, Ron ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
Ron, I've somehow missed part of this thread, which is a shame since this is an area of primary concern for me. Your suggested algorithm seems to be designed to relieve I/O load by making more use of the CPU. (if I followed it correctly). However, that's not PostgreSQL's problem; currently for us external sort is a *CPU-bound* operation, half of which is value comparisons. (oprofiles available if anyone cares) So we need to look, instead, at algorithms which make better use of work_mem to lower CPU activity, possibly even at the expense of I/O. --Josh Berkus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] A Better External Sort?
From: Josh Berkus josh@agliodbs.com ent: Sep 27, 2005 12:15 PM To: Ron Peacetree [EMAIL PROTECTED] Subject: Re: [HACKERS] [PERFORM] A Better External Sort? I've somehow missed part of this thread, which is a shame since this is an area of primary concern for me. Your suggested algorithm seems to be designed to relieve I/O load by making more use of the CPU. (if I followed it correctly). The goal is to minimize all IO load. Not just HD IO load, but also RAM IO load. Particularly random access IO load of any type (for instance: the pointer chasing problem). In addition, the design replaces explicit data or explicit key manipulation with the creation of a smaller, far more CPU and IO efficient data structure (essentially a CPU cache friendly Btree index) of the sorted order of the data. That Btree can be used to generate a physical reordering of the data in one pass, but that's the weakest use for it. The more powerful uses involve allowing the Btree to persist and using it for more efficient re-searches or combining it with other such Btrees (either as a step in task distribution across multiple CPUs or as a more efficient way to do things like joins by manipulating these Btrees rather than the actual records.) However, that's not PostgreSQL's problem; currently for us external sort is a *CPU-bound* operation, half of which is value comparisons. (oprofiles available if anyone cares) So we need to look, instead, at algorithms which make better use of work_mem to lower CPU activity, possibly even at the expense of I/O. I suspect that even the highly efficient sorting code we have is suffering more pessimal CPU IO behavior than what I'm presenting. Jim Gray's external sorting contest web site points out that memory IO has become a serious problem for most of the contest entries. Also, I'll bet the current code manipulates more data. Finally, there's the possibilty of reusing the product of this work to a degree and in ways that we can't with our current sorting code. Now all we need is resources and time to create a prototype. Since I'm not likely to have either any time soon, I'm hoping that I'll be able to explain this well enough that others can test it. *sigh* I _never_ have enough time or resources any more... Ron ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] A Better External Sort?
On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote: That Btree can be used to generate a physical reordering of the data in one pass, but that's the weakest use for it. The more powerful uses involve allowing the Btree to persist and using it for more efficient re-searches or combining it with other such Btrees (either as a step in task distribution across multiple CPUs or as a more efficient way to do things like joins by manipulating these Btrees rather than the actual records.) Maybe you could describe some concrete use cases. I can see what you are getting at, and I can imagine some advantageous uses, but I'd like to know what you are thinking. Specifically I'd like to see some cases where this would beat sequential scan. I'm thinking that in your example of a terabyte table with a column having only two values, all the queries I can think of would be better served with a sequential scan. Perhaps I believe this because you can now buy as much sequential I/O as you want. Random I/O is the only real savings. -jwb ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] A Better External Sort?
Ron Peacetree [EMAIL PROTECTED] writes: Let's start by assuming that an element is = in size to a cache line and a node fits into L1 DCache. [ much else snipped ] So far, you've blithely assumed that you know the size of a cache line, the sizes of L1 and L2 cache, and that you are working with sort keys that you can efficiently pack into cache lines. And that you know the relative access speeds of the caches and memory so that you can schedule transfers, and that the hardware lets you get at that transfer timing. And that the number of distinct key values isn't very large. I don't see much prospect that anything we can actually use in a portable fashion is going to emerge from this line of thought. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
From: Dann Corbit [EMAIL PROTECTED] Sent: Sep 26, 2005 5:13 PM To: Ron Peacetree [EMAIL PROTECTED], pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org Subject: RE: [HACKERS] [PERFORM] A Better External Sort? I think that the btrees are going to be O(n*log(n)) in construction of the indexes in disk access unless you memory map them [which means you would need stupendous memory volume] and so I cannot say that I really understand your idea yet. Traditional algorithms for the construction of Btree variants (B, B+, B*, ...) don't require O(nlgn) HD accesses. These shouldn't either. Let's start by assuming that an element is = in size to a cache line and a node fits into L1 DCache. To make the discussion more concrete, I'll use a 64KB L1 cache + a 1MB L2 cache only as an example. Simplest case: the Key has few enough distinct values that all Keys or KeyPrefixes fit into L1 DCache (for a 64KB cache with 64B lines, that's = 1000 different values. More if we can fit more than 1 element into each cache line.). As we scan the data set coming in from HD, we compare the Key or KeyPrefix to the sorted list of Key values in the node. This can be done in O(lgn) using Binary Search or O(lglgn) using a variation of Interpolation Search. If the Key value exists, we append this RID to the list of RIDs having the same Key: If the RAM buffer of this list of RIDs is full we append it and the current RID to the HD list of these RIDs. Else we insert this new key value into its proper place in the sorted list of Key values in the node and start a new list for this value of RID. We allocate room for a CPU write buffer so we can schedule RAM writes to the RAM lists of RIDs so as to minimize the randomness of them. When we are finished scanning the data set from HD, the sorted node with RID lists for each Key value contains the sort order for the whole data set. Notice that almost all of the random data access is occuring within the CPU rather than in RAM or HD, and that we are accessing RAM or HD only when absolutely needed. Next simplest case: Multiple nodes, but they all fit in the CPU cache(s). In the given example CPU, we will be able to fit at least 1000 elements per node and 2^20/2^16= up to 16 such nodes in this CPU. We use a node's worth of space as a RAM write buffer, so we end up with room for 15 such nodes in this CPU. This is enough for a 2 level index to at least 15,000 distinct Key value lists. All of the traditional tricks for splitting a Btree node and redistributing elements within them during insertion or splitting for maximum node utilization can be used here. The most general case: There are too many nodes to fit within the CPU cache(s). The root node now points to a maximum of at least 1000 nodes since each element in the root node points to another node. A full 2 level index is now enough to point to at least 10^6 distinct Key value lists, and 3 levels will index more distinct Key values than is possible in our 1TB, 500M record example. We can use some sort of node use prediction algorithm like LFU to decide which node should be moved out of CPU when we have to replace one of the nodes in the CPU. The nodes in RAM or on HD can be arranged to maximize streaming IO behavior and minimize random access IO behavior. As you can see, both the RAM and HD IO are as minimized as possible, and what such IO there is has been optimized for streaming behavior. Can you draw a picture of it for me? (I am dyslexic and understand things far better when I can visualize it). Not much for pictures. Hopefully the explanation helps? Ron ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
SECOND ATTEMPT AT POST. Web mailer appears to have eaten first one. I apologize in advance if anyone gets two versions of this post. =r From: Tom Lane [EMAIL PROTECTED] Sent: Sep 26, 2005 9:42 PM Subject: Re: [HACKERS] [PERFORM] A Better External Sort? So far, you've blithely assumed that you know the size of a cache line, the sizes of L1 and L2 cache, NO. I used exact values only as examples. Realistic examples drawn from an extensive survey of past, present, and what I could find out about future systems; but only examples nonetheless. For instance, Hennessy and Patterson 3ed points out that 64B cache lines are optimally performing for caches between 16KB and 256KB. The same source as well as sources specifically on CPU memory hierarchy design points out that we are not likely to see L1 caches larger than 256KB in the forseeable future. The important point was the idea of an efficient Key, rather than Record, sort using a CPU cache friendly data structure with provably good space and IO characteristics based on a reasonable model of current and likely future single box computer architecture (although it would be fairly easy to extend it to include the effects of networking.) No apriori exact or known values are required for the method to work. and that you are working with sort keys that you can efficiently pack into cache lines. Not pack. map. n items can not take on more than n values. n values can be represented in lgn bits. Less efficient mappings can also work. Either way I demonstrated that we have plenty of space in a likely and common cache line size. Creating a mapping function to represent m values in lgm bits is a well known hack, and if we keep track of minimum and maximum values for fields during insert and delete operations, we can even create mapping functions fairly easily. (IIRC, Oracle does keep track of minimum and maximum field values.) And that you know the relative access speeds of the caches and memory so that you can schedule transfers, Again, no. I created a reasonable model of a computer system that holds remarkably well over a _very_ wide range of examples. I don't need the numbers to be exactly right to justify my approach to this problem or understand why other approaches may have downsides. I just have to get the relative performance of the system components and the relative performance gap between them reasonably correct. The stated model does that very well. Please don't take my word for it. Go grab some random box: laptop, desktop, unix server, etc and try it for yourself. Part of the reason I published the model was so that others could examine it. and that the hardware lets you get at that transfer timing. Never said anything about this, and in fact I do not need any such. And that the number of distinct key values isn't very large. Quite the opposite in fact. I went out of my way to show that the method still works well even if every Key is distinct. It is _more efficient_ when the number of distinct keys is small compared to the number of data items, but it works as well as any other Btree would when all n of the Keys are distinct. This is just a CPU cache and more IO friendly Btree, not some magical and unheard of technique. It's just as general purpose as Btrees usually are. I'm simply looking at the current and likely future state of computer systems architecture and coming up with a slight twist on how to use already well known and characterized techniques. not trying to start a revolution. I'm trying very hard NOT to waste anyone's time around here. Including my own Ron ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster