On Sun, 2012-03-18 at 11:25 -0400, Tom Lane wrote: > Yeah, that was me, and it came out of actual user complaints ten or more > years back. (It's actually not 2X growth but more like 4X growth > according to the comments in logtape.c, though I no longer remember the > exact reasons why.) We knew when we put in the logtape logic that we > were trading off speed for space, and we accepted that.
I skimmed through TAOCP, and I didn't find the 4X number you are referring to, and I can't think what would cause that, either. The exact wording in the comment in logtape.c is "4X the actual data volume", so maybe that's just referring to per-tuple overhead? However, I also noticed that section 5.4.4 (Vol 3 p299) starts discussing the idea of running the tapes backwards and forwards. That doesn't directly apply, because a disk seek is cheaper than rewinding a tape, but perhaps it could be adapted to get the block-freeing behavior we want. The comments in logtape.c say: "Few OSes allow arbitrary parts of a file to be released back to the OS, so we have to implement this space-recycling ourselves within a single logical file." But if we alternate between reading in forward and reverse order, we can make all of the deallocations at the end of the file, and then just truncate to free space. I would think that the OS could be more intelligent about block allocations and deallocations to avoid too much fragmentation, and it would probably be a net reduction in code complexity. Again, the comments in logtape.c have something to say about it: "...but the seeking involved should be comparable to what would happen if we kept each logical tape in a separate file" But I don't understand why that is the case. On another topic, quite a while ago Len Shapiro (PSU professor) suggested to me that we implement Algorithm F (Vol 3 p321). Right now, as tuplesort.c says: "In the current code we determine the number of tapes M on the basis of workMem: we want workMem/M to be large enough that we read a fair amount of data each time we preread from a tape" But with forcasting, we can be a little smarter about which tapes we preread from if the data in the runs is not random. That means we could potentially merge more runs at once with the same work_mem without sacrificing adequate buffers for prefetching. I'm not sure whether this is a practical problem today, and I'm also not sure what to do if we start merging a lot more runs and then determine that forcasting doesn't work as well as we'd hoped (e.g. that the data in the runs really is random). But I thought it was worth mentioning. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers