> -----Original Message----- > From: Tom Lane [mailto:[EMAIL PROTECTED] > Sent: Wednesday, March 08, 2006 3:17 PM > To: Dann Corbit > Cc: Jim C. Nasby; Luke Lonergan; Simon Riggs; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes" > > "Dann Corbit" <[EMAIL PROTECTED]> writes: > > Here are some suggestions of things that I know work really, really > > well: > > #1. Two pass merge (none of that silly poly-tape merge goo) > > This amounts to an assumption that you have infinite work_mem, in which > case you hardly need an external sort at all. If your work_mem is in > fact finite, then at some point you need more than two passes. I'm not > really interested in ripping out support for sort operations that are > much larger than work_mem.
No it does not. I have explained this before. You can have one million files and merge them all into a final output with a single pass. It does not matter how big they are or how much memory you have. The idea is very simple. Each subfile has its top record inserted into a priority queue of file handles (or whatever else you want to use -- temp tables, you name it). When you extract the smallest record from the queue, the priority changes and that file handle gets moved to a new place in the queue. You keep pulling records from the queue until the entire queue is empty. The outline is like this: 1. Sort chunks 2. Write chunks 3. Insert top record of chunks into priority queue 4. Extract records from queue, writing them to final output 5. Repeat step 4 until queue is empty. > > #2. Load ONLY the keys that are to be sorted into memory. Use a > > pointer exchange sort, and do not move the physical rows of data at all. > > This suggestion isn't a whole lot better; in general the rows to be > sorted don't exist until we compute them, and so proposing that we > "don't load them until later" is pretty much irrelevant. Also, in > a lot of common cases the keys to be sorted are the bulk of the data > anyway. This suggestion is in addition to suggestion 1. They are not even related except that both suggestions make the sort run a lot faster. I think I did not explain it clearly enough. Suppose that you have a set of rows you need to sort. Instead of loading the whole row into memory, just load the columns (or parts of columns) that are being sorted. I hope that it is more clear now. > > regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org