The enclosed patch substantially improves large sort performance, in the
general case

cvstip: elapsed 5693 sec, CPU 196 sec
patch:  elapsed 4132 sec, CPU 90 sec

The patch implements dynamically increasing number of logical tapes when
sufficient memory is available to make that efficient. cvstip runs with
a static number of tapes (7) whereas the patch was able to allocate 104
tapes to the task. This has the effect of almost completely removing
intermediate merge runs and hence the increased performance. 

>From Jeffrey W. Baker's original idea
http://archives.postgresql.org/pgsql-performance/2005-09/msg00430.php
and followup comments
http://archives.postgresql.org/pgsql-hackers/2005-10/msg00015.php

It is expected this will substantially improve performance for large
ORDER BY, GROUP BY and CREATE INDEX statements.

The guesstimated default setting of the OPTIMAL_MERGE_BUFFER_SIZE of
262144 means that the default setting of work_mem will still use only 7
tapes, though setting work_mem > 2MB will yield improvements.

Further testing and/or patch comments are requested.

All changes are isolated to src/backend/utils/sort/tuplesort.c
Patch applies cleanly and passes make check on cvstip (though this code
path is not tested in the regression tests anyway).

Test details:
Run the following sort on my laptop (512MB RAM)

postgres=# set work_mem=65536;
SET
Time: 0.801 ms
postgres=# select * from d order by 1,2 limit 1;
 col1 |          col2
------+-------------------------
    1 | eeeeeeseeeeeeeeeeeeeeee
(1 row)

Time: 4133122.769 ms

postgres=# \d d
       Table "public.d"
 Column |  Type   | Modifiers
--------+---------+-----------
 col1   | integer |
 col2   | text    |

postgres=# select count(*) from d;
   count
-----------
 100000000
(1 row)

Time: 248283.128 ms
postgres=# select pg_relation_size('d');
 pg_relation_size
------------------
       6450397184
(1 row)

Time: 98.629 ms
postgres=#

trace_sort was enabled for both runs and these are attached as files to
this mail. Test data was anti-sorted, but the ordering of data is not
relevant to the algorithm anyway, except when the data is already almost
perfectly sorted, in which case there is typically only one run anyway.

Best Regards, Simon Riggs

LOG:  begin tuple sort: nkeys = 2, workMem = 65536, randomAccess = t
LOG:  switching to external sort: CPU 0.25s/0.39u sec elapsed 1.18 sec
LOG:  finished writing run 1: CPU 0.79s/15.29u sec elapsed 19.75 sec
LOG:  finished writing run 2: CPU 1.29s/25.13u sec elapsed 33.81 sec
LOG:  finished writing run 3: CPU 1.75s/33.26u sec elapsed 45.11 sec
LOG:  finished writing run 4: CPU 2.42s/41.00u sec elapsed 58.49 sec
LOG:  finished writing run 5: CPU 3.04s/48.57u sec elapsed 70.06 sec
LOG:  finished writing run 6: CPU 3.67s/55.95u sec elapsed 81.97 sec
LOG:  finished writing run 7: CPU 4.31s/63.22u sec elapsed 94.80 sec
LOG:  finished writing run 8: CPU 4.89s/70.54u sec elapsed 106.45 sec
LOG:  finished writing run 9: CPU 5.55s/77.76u sec elapsed 119.85 sec
LOG:  finished writing run 10: CPU 6.19s/85.27u sec elapsed 132.30 sec
LOG:  finished writing run 11: CPU 6.81s/92.68u sec elapsed 145.78 sec
LOG:  finished writing run 12: CPU 7.47s/99.63u sec elapsed 159.01 sec
LOG:  finished writing run 13: CPU 8.11s/106.92u sec elapsed 170.88 sec
LOG:  finished writing run 14: CPU 8.76s/113.67u sec elapsed 184.52 sec
LOG:  finished writing run 15: CPU 9.39s/120.32u sec elapsed 196.05 sec
LOG:  finished writing run 16: CPU 10.04s/127.61u sec elapsed 211.77 sec
LOG:  finished writing run 17: CPU 10.67s/134.80u sec elapsed 224.49 sec
LOG:  finished writing run 18: CPU 11.27s/141.30u sec elapsed 234.93 sec
LOG:  finished writing run 19: CPU 11.82s/147.79u sec elapsed 245.44 sec
LOG:  finished writing run 20: CPU 12.40s/154.45u sec elapsed 256.55 sec
LOG:  finished writing run 21: CPU 13.05s/160.99u sec elapsed 268.55 sec
LOG:  finished writing run 22: CPU 13.69s/168.26u sec elapsed 280.16 sec
LOG:  finished writing run 23: CPU 14.29s/175.05u sec elapsed 291.57 sec
LOG:  finished writing run 24: CPU 14.95s/182.21u sec elapsed 304.59 sec
LOG:  finished writing run 25: CPU 15.52s/188.82u sec elapsed 315.75 sec
LOG:  finished writing run 26: CPU 16.23s/196.91u sec elapsed 329.30 sec
LOG:  finished writing run 27: CPU 16.82s/203.89u sec elapsed 340.90 sec
LOG:  finished writing run 28: CPU 17.48s/210.93u sec elapsed 354.44 sec
LOG:  finished writing run 29: CPU 18.07s/217.16u sec elapsed 365.60 sec
LOG:  finished writing run 30: CPU 18.69s/223.92u sec elapsed 377.15 sec
LOG:  finished writing run 31: CPU 19.39s/231.10u sec elapsed 391.09 sec
LOG:  finished writing run 32: CPU 20.03s/237.83u sec elapsed 404.30 sec
LOG:  finished writing run 33: CPU 20.67s/245.04u sec elapsed 417.78 sec
LOG:  finished writing run 34: CPU 21.35s/253.22u sec elapsed 431.70 sec
LOG:  finished writing run 35: CPU 22.02s/259.88u sec elapsed 444.64 sec
LOG:  finished writing run 36: CPU 22.63s/266.52u sec elapsed 455.38 sec
LOG:  finished writing run 37: CPU 23.21s/273.07u sec elapsed 466.02 sec
LOG:  finished writing run 38: CPU 23.85s/280.20u sec elapsed 477.81 sec
LOG:  finished writing run 39: CPU 24.47s/286.92u sec elapsed 490.00 sec
LOG:  finished writing run 40: CPU 25.13s/294.33u sec elapsed 502.36 sec
LOG:  finished writing run 41: CPU 25.79s/301.49u sec elapsed 515.60 sec
LOG:  finished writing run 42: CPU 26.42s/308.47u sec elapsed 527.15 sec
LOG:  finished writing run 43: CPU 27.07s/315.10u sec elapsed 540.04 sec
LOG:  finished writing run 44: CPU 27.65s/321.53u sec elapsed 551.21 sec
LOG:  finished writing run 45: CPU 28.24s/328.55u sec elapsed 562.53 sec
LOG:  finished writing run 46: CPU 28.97s/336.26u sec elapsed 576.52 sec
LOG:  finished writing run 47: CPU 29.55s/342.87u sec elapsed 587.49 sec
LOG:  finished writing run 48: CPU 30.19s/349.29u sec elapsed 600.15 sec
LOG:  finished writing run 49: CPU 30.77s/356.16u sec elapsed 611.30 sec
LOG:  finished writing run 50: CPU 31.37s/362.60u sec elapsed 623.15 sec
LOG:  finished writing run 51: CPU 31.99s/369.28u sec elapsed 635.14 sec
LOG:  finished writing run 52: CPU 32.59s/376.67u sec elapsed 647.61 sec
LOG:  finished writing run 53: CPU 33.21s/383.23u sec elapsed 659.79 sec
LOG:  finished writing run 54: CPU 33.81s/389.43u sec elapsed 669.98 sec
LOG:  finished writing run 55: CPU 34.40s/396.33u sec elapsed 681.33 sec
LOG:  finished writing run 56: CPU 34.97s/403.27u sec elapsed 692.09 sec
LOG:  finished writing run 57: CPU 35.55s/409.43u sec elapsed 702.82 sec
LOG:  finished writing run 58: CPU 36.12s/415.79u sec elapsed 714.68 sec
LOG:  finished writing run 59: CPU 36.72s/422.38u sec elapsed 725.80 sec
LOG:  finished writing run 60: CPU 37.30s/429.51u sec elapsed 737.33 sec
LOG:  finished writing run 61: CPU 37.93s/435.85u sec elapsed 749.49 sec
LOG:  finished writing run 62: CPU 38.71s/444.21u sec elapsed 765.26 sec
LOG:  finished writing run 63: CPU 39.26s/450.65u sec elapsed 776.38 sec
LOG:  finished writing run 64: CPU 39.84s/457.17u sec elapsed 787.53 sec
LOG:  finished writing run 65: CPU 40.47s/463.53u sec elapsed 800.61 sec
LOG:  finished writing run 66: CPU 41.01s/470.02u sec elapsed 812.05 sec
LOG:  finished writing run 67: CPU 41.72s/476.66u sec elapsed 825.01 sec
LOG:  finished writing run 68: CPU 42.38s/484.39u sec elapsed 837.59 sec
LOG:  finished writing run 69: CPU 43.04s/490.90u sec elapsed 850.66 sec
LOG:  finished writing run 70: CPU 43.65s/497.70u sec elapsed 862.45 sec
LOG:  finished writing run 71: CPU 44.31s/504.37u sec elapsed 874.99 sec
LOG:  finished writing run 72: CPU 44.89s/510.81u sec elapsed 886.81 sec
LOG:  finished writing run 73: CPU 45.44s/517.28u sec elapsed 898.07 sec
LOG:  finished writing run 74: CPU 46.10s/524.10u sec elapsed 911.08 sec
LOG:  finished writing run 75: CPU 46.65s/530.52u sec elapsed 922.09 sec
LOG:  finished writing run 76: CPU 47.26s/537.12u sec elapsed 935.34 sec
LOG:  finished writing run 77: CPU 47.83s/543.77u sec elapsed 946.49 sec
LOG:  finished writing run 78: CPU 48.48s/550.73u sec elapsed 958.10 sec
LOG:  finished writing run 79: CPU 49.14s/557.54u sec elapsed 970.90 sec
LOG:  finished writing run 80: CPU 49.77s/564.60u sec elapsed 982.79 sec
LOG:  finished writing run 81: CPU 50.42s/571.48u sec elapsed 995.26 sec
LOG:  finished writing run 82: CPU 51.04s/578.56u sec elapsed 1006.57 sec
LOG:  finished writing run 83: CPU 51.64s/585.34u sec elapsed 1017.57 sec
LOG:  finished writing run 84: CPU 52.23s/591.96u sec elapsed 1029.00 sec
LOG:  finished writing run 85: CPU 52.84s/598.50u sec elapsed 1041.13 sec
LOG:  finished writing run 86: CPU 53.45s/605.38u sec elapsed 1052.22 sec
LOG:  finished writing run 87: CPU 54.11s/612.66u sec elapsed 1064.52 sec
LOG:  finished writing run 88: CPU 54.73s/620.00u sec elapsed 1076.61 sec
LOG:  finished writing run 89: CPU 55.33s/626.51u sec elapsed 1087.48 sec
LOG:  finished writing run 90: CPU 55.92s/632.85u sec elapsed 1098.38 sec
LOG:  finished writing run 91: CPU 56.48s/638.97u sec elapsed 1109.10 sec
LOG:  finished writing run 92: CPU 57.10s/645.54u sec elapsed 1121.25 sec
LOG:  finished writing run 93: CPU 57.69s/651.96u sec elapsed 1132.15 sec
LOG:  finished writing run 94: CPU 58.30s/658.61u sec elapsed 1143.81 sec
LOG:  finished writing run 95: CPU 58.91s/665.23u sec elapsed 1156.37 sec
LOG:  finished writing run 96: CPU 59.47s/671.91u sec elapsed 1167.75 sec
LOG:  finished writing run 97: CPU 60.18s/679.10u sec elapsed 1181.43 sec
LOG:  finished writing run 98: CPU 60.72s/685.90u sec elapsed 1193.10 sec
LOG:  finished writing run 99: CPU 61.32s/692.32u sec elapsed 1205.26 sec
LOG:  finished writing run 100: CPU 61.99s/700.06u sec elapsed 1218.48 sec
LOG:  finished writing run 101: CPU 62.55s/706.45u sec elapsed 1231.28 sec
LOG:  finished writing run 102: CPU 63.20s/714.00u sec elapsed 1243.58 sec
LOG:  finished writing run 103: CPU 63.83s/720.38u sec elapsed 1256.51 sec
LOG:  finished writing run 104: CPU 64.39s/727.04u sec elapsed 1267.64 sec
LOG:  performsort starting: CPU 64.83s/732.23u sec elapsed 1277.40 sec
LOG:  finished writing run 105: CPU 64.90s/733.34u sec elapsed 1278.85 sec
LOG:  finished writing final run 106: CPU 64.98s/734.78u sec elapsed 1280.42 sec
LOG:  finished merge step: CPU 65.20s/736.00u sec elapsed 1287.24 sec
LOG:  finished merge step: CPU 66.72s/763.05u sec elapsed 1338.02 sec
LOG:  finished merge step: CPU 68.56s/798.53u sec elapsed 1407.28 sec
LOG:  finished merge step: CPU 70.68s/830.08u sec elapsed 1471.06 sec
LOG:  finished merge step: CPU 72.78s/858.63u sec elapsed 1530.78 sec
LOG:  finished merge step: CPU 74.91s/886.03u sec elapsed 1595.45 sec
LOG:  finished merge step: CPU 77.16s/909.13u sec elapsed 1647.66 sec
LOG:  finished merge step: CPU 79.32s/933.85u sec elapsed 1704.74 sec
LOG:  finished merge step: CPU 81.15s/952.54u sec elapsed 1755.45 sec
LOG:  finished merge step: CPU 83.17s/972.56u sec elapsed 1803.80 sec
LOG:  finished merge step: CPU 85.23s/992.00u sec elapsed 1856.65 sec
LOG:  finished merge step: CPU 87.23s/1010.60u sec elapsed 1899.01 sec
LOG:  finished merge step: CPU 89.29s/1028.82u sec elapsed 1942.89 sec
LOG:  finished merge step: CPU 91.30s/1047.05u sec elapsed 1988.01 sec
LOG:  finished merge step: CPU 93.30s/1066.22u sec elapsed 2032.50 sec
LOG:  finished merge step: CPU 95.28s/1082.87u sec elapsed 2073.21 sec
LOG:  finished merge step: CPU 99.51s/1130.74u sec elapsed 2180.97 sec
LOG:  finished merge step: CPU 105.29s/1195.15u sec elapsed 2315.62 sec
LOG:  finished merge step: CPU 111.77s/1265.04u sec elapsed 2469.50 sec
LOG:  finished merge step: CPU 118.17s/1338.26u sec elapsed 2623.61 sec
LOG:  finished merge step: CPU 127.93s/1494.36u sec elapsed 3020.38 sec
LOG:  finished merge step: CPU 139.25s/1661.30u sec elapsed 3365.08 sec
LOG:  finished merge step: CPU 159.41s/2078.13u sec elapsed 4091.09 sec
LOG:  finished merge step: CPU 195.87s/3022.95u sec elapsed 5675.71 sec
LOG:  performsort done: CPU 195.87s/3022.95u sec elapsed 5675.96 sec
LOG:  external sort ended, 818272 disk blocks used: CPU 196.86s/3022.95u sec elapsed 5693.14 sec
LOG:  begin tuple sort: nkeys = 2, workMem = 65536, randomAccess = t
LOG:  switching to external sort: (ntapes=256) CPU 0.27s/0.36u sec elapsed 2.51 sec
LOG:  finished writing run 1 to tape 0: CPU 0.87s/15.48u sec elapsed 22.84 sec
LOG:  finished writing run 2 to tape 1: CPU 1.40s/25.37u sec elapsed 37.97 sec
LOG:  finished writing run 3 to tape 2: CPU 1.85s/33.86u sec elapsed 50.69 sec
LOG:  finished writing run 4 to tape 3: CPU 2.34s/41.56u sec elapsed 63.54 sec
LOG:  finished writing run 5 to tape 4: CPU 2.83s/49.14u sec elapsed 75.35 sec
LOG:  finished writing run 6 to tape 5: CPU 3.35s/56.28u sec elapsed 87.95 sec
LOG:  finished writing run 7 to tape 6: CPU 3.81s/62.99u sec elapsed 98.96 sec
LOG:  finished writing run 8 to tape 7: CPU 4.27s/70.02u sec elapsed 110.18 sec
LOG:  finished writing run 9 to tape 8: CPU 4.79s/77.75u sec elapsed 123.09 sec
LOG:  finished writing run 10 to tape 9: CPU 5.25s/84.90u sec elapsed 134.11 sec
LOG:  finished writing run 11 to tape 10: CPU 5.69s/92.15u sec elapsed 145.43 sec
LOG:  finished writing run 12 to tape 11: CPU 6.16s/98.78u sec elapsed 156.89 sec
LOG:  finished writing run 13 to tape 12: CPU 6.66s/105.80u sec elapsed 168.22 sec
LOG:  finished writing run 14 to tape 13: CPU 7.09s/112.09u sec elapsed 178.70 sec
LOG:  finished writing run 15 to tape 14: CPU 7.52s/118.75u sec elapsed 189.58 sec
LOG:  finished writing run 16 to tape 15: CPU 7.99s/125.66u sec elapsed 201.04 sec
LOG:  finished writing run 17 to tape 16: CPU 8.42s/132.20u sec elapsed 211.36 sec
LOG:  finished writing run 18 to tape 17: CPU 8.93s/138.88u sec elapsed 222.83 sec
LOG:  finished writing run 19 to tape 18: CPU 9.42s/146.30u sec elapsed 234.36 sec
LOG:  finished writing run 20 to tape 19: CPU 9.96s/154.64u sec elapsed 246.21 sec
LOG:  finished writing run 21 to tape 20: CPU 10.48s/162.00u sec elapsed 258.40 sec
LOG:  finished writing run 22 to tape 21: CPU 10.95s/168.79u sec elapsed 269.41 sec
LOG:  finished writing run 23 to tape 22: CPU 11.43s/175.66u sec elapsed 280.36 sec
LOG:  finished writing run 24 to tape 23: CPU 11.96s/183.36u sec elapsed 293.09 sec
LOG:  finished writing run 25 to tape 24: CPU 12.49s/191.35u sec elapsed 304.85 sec
LOG:  finished writing run 26 to tape 25: CPU 12.95s/197.84u sec elapsed 315.76 sec
LOG:  finished writing run 27 to tape 26: CPU 13.43s/204.41u sec elapsed 326.74 sec
LOG:  finished writing run 28 to tape 27: CPU 13.96s/210.98u sec elapsed 338.47 sec
LOG:  finished writing run 29 to tape 28: CPU 14.42s/218.45u sec elapsed 350.03 sec
LOG:  finished writing run 30 to tape 29: CPU 14.90s/225.54u sec elapsed 360.91 sec
LOG:  finished writing run 31 to tape 30: CPU 15.41s/232.16u sec elapsed 372.97 sec
LOG:  finished writing run 32 to tape 31: CPU 15.88s/239.26u sec elapsed 384.22 sec
LOG:  finished writing run 33 to tape 32: CPU 16.42s/247.06u sec elapsed 396.32 sec
LOG:  finished writing run 34 to tape 33: CPU 16.85s/253.60u sec elapsed 406.45 sec
LOG:  finished writing run 35 to tape 34: CPU 17.34s/259.90u sec elapsed 417.89 sec
LOG:  finished writing run 36 to tape 35: CPU 17.78s/266.25u sec elapsed 428.83 sec
LOG:  finished writing run 37 to tape 36: CPU 18.28s/273.26u sec elapsed 440.42 sec
LOG:  finished writing run 38 to tape 37: CPU 18.79s/280.80u sec elapsed 453.10 sec
LOG:  finished writing run 39 to tape 38: CPU 19.23s/287.43u sec elapsed 464.18 sec
LOG:  finished writing run 40 to tape 39: CPU 19.75s/294.92u sec elapsed 475.78 sec
LOG:  finished writing run 41 to tape 40: CPU 20.30s/302.87u sec elapsed 488.83 sec
LOG:  finished writing run 42 to tape 41: CPU 20.73s/309.53u sec elapsed 499.68 sec
LOG:  finished writing run 43 to tape 42: CPU 21.16s/316.33u sec elapsed 510.49 sec
LOG:  finished writing run 44 to tape 43: CPU 21.60s/322.77u sec elapsed 521.30 sec
LOG:  finished writing run 45 to tape 44: CPU 22.07s/329.97u sec elapsed 533.82 sec
LOG:  finished writing run 46 to tape 45: CPU 22.55s/336.69u sec elapsed 544.18 sec
LOG:  finished writing run 47 to tape 46: CPU 23.01s/343.78u sec elapsed 558.01 sec
LOG:  finished writing run 48 to tape 47: CPU 23.42s/350.44u sec elapsed 567.56 sec
LOG:  finished writing run 49 to tape 48: CPU 23.92s/357.25u sec elapsed 579.08 sec
LOG:  finished writing run 50 to tape 49: CPU 24.41s/364.36u sec elapsed 590.24 sec
LOG:  finished writing run 51 to tape 50: CPU 24.85s/370.99u sec elapsed 600.21 sec
LOG:  finished writing run 52 to tape 51: CPU 25.33s/377.37u sec elapsed 611.31 sec
LOG:  finished writing run 53 to tape 52: CPU 25.77s/383.96u sec elapsed 621.48 sec
LOG:  finished writing run 54 to tape 53: CPU 26.23s/390.35u sec elapsed 631.59 sec
LOG:  finished writing run 55 to tape 54: CPU 26.76s/397.31u sec elapsed 642.27 sec
LOG:  finished writing run 56 to tape 55: CPU 27.29s/404.83u sec elapsed 654.39 sec
LOG:  finished writing run 57 to tape 56: CPU 27.79s/411.35u sec elapsed 664.62 sec
LOG:  finished writing run 58 to tape 57: CPU 28.27s/418.23u sec elapsed 674.97 sec
LOG:  finished writing run 59 to tape 58: CPU 28.67s/424.23u sec elapsed 684.80 sec
LOG:  finished writing run 60 to tape 59: CPU 29.12s/430.80u sec elapsed 695.20 sec
LOG:  finished writing run 61 to tape 60: CPU 29.56s/437.47u sec elapsed 705.60 sec
LOG:  finished writing run 62 to tape 61: CPU 29.96s/443.64u sec elapsed 715.52 sec
LOG:  finished writing run 63 to tape 62: CPU 30.38s/449.62u sec elapsed 725.42 sec
LOG:  finished writing run 64 to tape 63: CPU 30.83s/455.93u sec elapsed 735.48 sec
LOG:  finished writing run 65 to tape 64: CPU 31.26s/462.09u sec elapsed 745.27 sec
LOG:  finished writing run 66 to tape 65: CPU 31.68s/468.43u sec elapsed 755.42 sec
LOG:  finished writing run 67 to tape 66: CPU 32.20s/475.39u sec elapsed 766.57 sec
LOG:  finished writing run 68 to tape 67: CPU 32.71s/482.86u sec elapsed 777.49 sec
LOG:  finished writing run 69 to tape 68: CPU 33.15s/489.36u sec elapsed 788.99 sec
LOG:  finished writing run 70 to tape 69: CPU 33.61s/495.79u sec elapsed 799.57 sec
LOG:  finished writing run 71 to tape 70: CPU 34.09s/502.30u sec elapsed 810.14 sec
LOG:  finished writing run 72 to tape 71: CPU 34.58s/509.59u sec elapsed 821.10 sec
LOG:  finished writing run 73 to tape 72: CPU 35.02s/516.10u sec elapsed 831.36 sec
LOG:  finished writing run 74 to tape 73: CPU 35.49s/522.80u sec elapsed 842.09 sec
LOG:  finished writing run 75 to tape 74: CPU 35.93s/529.39u sec elapsed 852.23 sec
LOG:  finished writing run 76 to tape 75: CPU 36.41s/535.46u sec elapsed 862.33 sec
LOG:  finished writing run 77 to tape 76: CPU 36.85s/541.99u sec elapsed 872.75 sec
LOG:  finished writing run 78 to tape 77: CPU 37.27s/548.01u sec elapsed 882.64 sec
LOG:  finished writing run 79 to tape 78: CPU 37.74s/554.47u sec elapsed 893.38 sec
LOG:  finished writing run 80 to tape 79: CPU 38.23s/561.55u sec elapsed 904.87 sec
LOG:  finished writing run 81 to tape 80: CPU 38.68s/568.80u sec elapsed 915.56 sec
LOG:  finished writing run 82 to tape 81: CPU 39.15s/575.33u sec elapsed 925.97 sec
LOG:  finished writing run 83 to tape 82: CPU 39.61s/582.02u sec elapsed 936.73 sec
LOG:  finished writing run 84 to tape 83: CPU 40.16s/589.43u sec elapsed 949.33 sec
LOG:  finished writing run 85 to tape 84: CPU 40.57s/595.42u sec elapsed 958.63 sec
LOG:  finished writing run 86 to tape 85: CPU 41.01s/601.66u sec elapsed 969.48 sec
LOG:  finished writing run 87 to tape 86: CPU 41.51s/608.30u sec elapsed 980.47 sec
LOG:  finished writing run 88 to tape 87: CPU 41.96s/614.45u sec elapsed 990.19 sec
LOG:  finished writing run 89 to tape 88: CPU 42.38s/620.42u sec elapsed 1000.37 sec
LOG:  finished writing run 90 to tape 89: CPU 42.79s/626.83u sec elapsed 1011.85 sec
LOG:  finished writing run 91 to tape 90: CPU 43.23s/633.36u sec elapsed 1022.45 sec
LOG:  finished writing run 92 to tape 91: CPU 43.65s/639.65u sec elapsed 1032.89 sec
LOG:  finished writing run 93 to tape 92: CPU 44.05s/645.64u sec elapsed 1042.95 sec
LOG:  finished writing run 94 to tape 93: CPU 44.50s/651.86u sec elapsed 1054.17 sec
LOG:  finished writing run 95 to tape 94: CPU 44.93s/658.10u sec elapsed 1064.57 sec
LOG:  finished writing run 96 to tape 95: CPU 45.35s/664.39u sec elapsed 1075.14 sec
LOG:  finished writing run 97 to tape 96: CPU 45.82s/671.33u sec elapsed 1086.25 sec
LOG:  finished writing run 98 to tape 97: CPU 46.31s/678.27u sec elapsed 1096.75 sec
LOG:  finished writing run 99 to tape 98: CPU 46.77s/684.64u sec elapsed 1107.05 sec
LOG:  finished writing run 100 to tape 99: CPU 47.20s/691.16u sec elapsed 1117.65 sec
LOG:  finished writing run 101 to tape 100: CPU 47.64s/697.60u sec elapsed 1127.96 sec
LOG:  finished writing run 102 to tape 101: CPU 48.13s/704.17u sec elapsed 1138.55 sec
LOG:  finished writing run 103 to tape 102: CPU 48.58s/710.67u sec elapsed 1150.49 sec
LOG:  finished writing run 104 to tape 103: CPU 49.07s/717.84u sec elapsed 1162.72 sec
LOG:  performsort starting: CPU 49.44s/722.60u sec elapsed 1171.61 sec
LOG:  finished writing run 105 to tape 104: CPU 49.48s/723.73u sec elapsed 1172.82 sec
LOG:  finished writing final run 106 to tape 105: CPU 49.55s/725.12u sec elapsed 1174.37 sec
LOG:  finished merge step: CPU 89.50s/2159.13u sec elapsed 4116.80 sec
LOG:  performsort done: CPU 89.50s/2159.13u sec elapsed 4116.88 sec
LOG:  external sort ended, 818272 disk blocks used: CPU 90.47s/2159.13u sec elapsed 4132.91 sec
Index: src/backend/utils/sort/tuplesort.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/sort/tuplesort.c,v
retrieving revision 1.57
diff -c -r1.57 tuplesort.c
*** src/backend/utils/sort/tuplesort.c	5 Jan 2006 01:56:29 -0000	1.57
--- src/backend/utils/sort/tuplesort.c	25 Jan 2006 23:53:31 -0000
***************
*** 120,130 ****
  } TupSortStatus;
  
  /*
!  * We use a seven-tape polyphase merge, which is the "sweet spot" on the
!  * tapes-to-passes curve according to Knuth's figure 70 (section 5.4.2).
   */
! #define MAXTAPES		7		/* Knuth's T */
! #define TAPERANGE		(MAXTAPES-1)	/* Knuth's P */
  
  /*
   * Private state of a Tuplesort operation.
--- 120,186 ----
  } TupSortStatus;
  
  /*
!  * Knuth's results suggest that a seven-tape polyphase merge is the 
!  * "sweet spot" on the tapes-to-passes curve according to Knuth's 
!  * figure 70 (section 5.4.2). Prior to 8.2 we always used 7 logical tapes.
!  * From 8.2, if we have sufficient memory, we use more logical tapes
!  * because this reduces the cost of additional merge passes. But we must
!  * keep a reasonably sized preread buffer for each tape, or we lose the
!  * performance gains from sequential disk I/O (readahead). So there is an
!  * optimal merge buffer size reflecting the trade-off between these two 
!  * issues. 
!  *
!  * If we have fewer runs than we have tapes we can avoid merging
!  * runs altogether and move directly to the final merge. In this case, we
!  * can take advantage of a larger preread buffer. We also note that the
!  * wasted memory from unused Tapestates is unlikely to have negative 
!  * performance effects, so we do not need fully dynamic tape allocation.
!  * So we calculate NumTapes = MaxTapes at start, with MaxTapes >= MINTAPES
!  *
!  * The size of the Tapestate array is assumed to be small in comparison
!  * with the tuplearrays, so this is not tracked, just as in pre-8.2
!  * releases.
!  */
! int MaxTapes;		/* Knuth's T */
! int TapeRange;		/* Knuth's P */
! #define MINTAPES    7
! #define OPTIMAL_MERGE_BUFFER_SIZE   (BLCKSZ * 32)
! /*
!  * Private state of a one logical tape in an external Tuplesort operation
   */
! typedef struct Tapestate
! {
! 	/*
! 	 * These variables are only used during merge passes.  mergeactive[i] is
! 	 * true if we are reading an input run from (actual) tape number i and
! 	 * have not yet exhausted that run.  mergenext[i] is the memtuples index
! 	 * of the next pre-read tuple (next to be loaded into the heap) for tape
! 	 * i, or 0 if we are out of pre-read tuples.  mergelast[i] similarly
! 	 * points to the last pre-read tuple from each tape. mergeavailmem[i] is
! 	 * the amount of unused space allocated for tape i. mergefreelist and
! 	 * mergefirstfree keep track of unused locations in the memtuples[] array.
! 	 * memtupindex[] links together pre-read tuples for each tape as well as
! 	 * recycled locations in mergefreelist. It is OK to use 0 as a null link
! 	 * in these lists, because memtuples[0] is part of the merge heap and is
! 	 * never a pre-read tuple.
! 	 */
! 	bool		mergeactive;	      /* Active input run source? */
! 	int			mergenext;	          /* first preread tuple for each source */
! 	int			mergelast;	          /* last preread tuple for each source */
! 	long		mergeavailmem;	      /* availMem for prereading tapes */
! 
! 	/*
! 	 * Variables for Algorithm D.  The index into these arrays is a logical
!      * tape number. Be careful to keep "logical" and "actual" tape numbers 
!      * straight!
! 	 */
! 	int			tp_fib;      	      /* Target Fibonacci run counts (A[]) */
! 	int			tp_runs;         	  /* # of real runs on each tape */
! 	int			tp_dummy;        	  /* # of dummy runs for each tape (D[]) */
! 	int			tp_tapenum;           /* Actual tape numbers (TAPE[]) */
! } TapeStateData;
! 
! typedef TapeStateData *Tapestate;
  
  /*
   * Private state of a Tuplesort operation.
***************
*** 179,185 ****
  	 * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
  	 * and FINALMERGE, the tuples are organized in "heap" order per Algorithm
  	 * H.  (Note that memtupcount only counts the tuples that are part of the
! 	 * heap --- during merge passes, memtuples[] entries beyond TAPERANGE are
  	 * never in the heap and are used to hold pre-read tuples.)  In state
  	 * SORTEDONTAPE, the array is not used.
  	 */
--- 235,241 ----
  	 * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
  	 * and FINALMERGE, the tuples are organized in "heap" order per Algorithm
  	 * H.  (Note that memtupcount only counts the tuples that are part of the
! 	 * heap --- during merge passes, memtuples[] entries beyond TapeRange are
  	 * never in the heap and are used to hold pre-read tuples.)  In state
  	 * SORTEDONTAPE, the array is not used.
  	 */
***************
*** 205,243 ****
  	int			currentRun;
  
  	/*
! 	 * These variables are only used during merge passes.  mergeactive[i] is
! 	 * true if we are reading an input run from (actual) tape number i and
! 	 * have not yet exhausted that run.  mergenext[i] is the memtuples index
! 	 * of the next pre-read tuple (next to be loaded into the heap) for tape
! 	 * i, or 0 if we are out of pre-read tuples.  mergelast[i] similarly
! 	 * points to the last pre-read tuple from each tape. mergeavailmem[i] is
! 	 * the amount of unused space allocated for tape i. mergefreelist and
! 	 * mergefirstfree keep track of unused locations in the memtuples[] array.
! 	 * memtupindex[] links together pre-read tuples for each tape as well as
! 	 * recycled locations in mergefreelist. It is OK to use 0 as a null link
! 	 * in these lists, because memtuples[0] is part of the merge heap and is
! 	 * never a pre-read tuple.
! 	 */
! 	bool		mergeactive[MAXTAPES];	/* Active input run source? */
! 	int			mergenext[MAXTAPES];	/* first preread tuple for each source */
! 	int			mergelast[MAXTAPES];	/* last preread tuple for each source */
! 	long		mergeavailmem[MAXTAPES];		/* availMem for prereading
! 												 * tapes */
  	long		spacePerTape;	/* actual per-tape target usage */
  	int			mergefreelist;	/* head of freelist of recycled slots */
  	int			mergefirstfree; /* first slot never used in this merge */
  
  	/*
  	 * Variables for Algorithm D.  Note that destTape is a "logical" tape
! 	 * number, ie, an index into the tp_xxx[] arrays.  Be careful to keep
  	 * "logical" and "actual" tape numbers straight!
  	 */
  	int			Level;			/* Knuth's l */
  	int			destTape;		/* current output tape (Knuth's j, less 1) */
- 	int			tp_fib[MAXTAPES];		/* Target Fibonacci run counts (A[]) */
- 	int			tp_runs[MAXTAPES];		/* # of real runs on each tape */
- 	int			tp_dummy[MAXTAPES];		/* # of dummy runs for each tape (D[]) */
- 	int			tp_tapenum[MAXTAPES];	/* Actual tape numbers (TAPE[]) */
  
  	/*
  	 * These variables are used after completion of sorting to keep track of
--- 261,281 ----
  	int			currentRun;
  
  	/*
! 	 * These variables are only used during merge passes
!      */
!     Tapestate  *logtapes;       /* array of pointers to palloc'd tapes */
  	long		spacePerTape;	/* actual per-tape target usage */
  	int			mergefreelist;	/* head of freelist of recycled slots */
  	int			mergefirstfree; /* first slot never used in this merge */
  
+ 
  	/*
  	 * Variables for Algorithm D.  Note that destTape is a "logical" tape
! 	 * number, ie, an index into the Tapesortstate arrays.  Be careful to keep
  	 * "logical" and "actual" tape numbers straight!
  	 */
  	int			Level;			/* Knuth's l */
  	int			destTape;		/* current output tape (Knuth's j, less 1) */
  
  	/*
  	 * These variables are used after completion of sorting to keep track of
***************
*** 972,980 ****
  				/* returned tuple is no longer counted in our memory space */
  				tuplen = GetMemoryChunkSpace(tup);
  				state->availMem += tuplen;
! 				state->mergeavailmem[srcTape] += tuplen;
  				tuplesort_heap_siftup(state, false);
! 				if ((tupIndex = state->mergenext[srcTape]) == 0)
  				{
  					/*
  					 * out of preloaded data on this tape, try to read more
--- 1010,1018 ----
  				/* returned tuple is no longer counted in our memory space */
  				tuplen = GetMemoryChunkSpace(tup);
  				state->availMem += tuplen;
! 				state->logtapes[srcTape]->mergeavailmem += tuplen;
  				tuplesort_heap_siftup(state, false);
! 				if ((tupIndex = state->logtapes[srcTape]->mergenext) == 0)
  				{
  					/*
  					 * out of preloaded data on this tape, try to read more
***************
*** 984,997 ****
  					/*
  					 * if still no data, we've reached end of run on this tape
  					 */
! 					if ((tupIndex = state->mergenext[srcTape]) == 0)
  						return tup;
  				}
  				/* pull next preread tuple from list, insert in heap */
  				newtup = state->memtuples[tupIndex];
! 				state->mergenext[srcTape] = state->memtupindex[tupIndex];
! 				if (state->mergenext[srcTape] == 0)
! 					state->mergelast[srcTape] = 0;
  				state->memtupindex[tupIndex] = state->mergefreelist;
  				state->mergefreelist = tupIndex;
  				tuplesort_heap_insert(state, newtup, srcTape, false);
--- 1022,1035 ----
  					/*
  					 * if still no data, we've reached end of run on this tape
  					 */
! 					if ((tupIndex = state->logtapes[srcTape]->mergenext) == 0)
  						return tup;
  				}
  				/* pull next preread tuple from list, insert in heap */
  				newtup = state->memtuples[tupIndex];
! 				state->logtapes[srcTape]->mergenext = state->memtupindex[tupIndex];
! 				if (state->logtapes[srcTape]->mergenext == 0)
! 					state->logtapes[srcTape]->mergelast = 0;
  				state->memtupindex[tupIndex] = state->mergefreelist;
  				state->mergefreelist = tupIndex;
  				tuplesort_heap_insert(state, newtup, srcTape, false);
***************
*** 1053,1065 ****
  	int			ntuples,
  				j;
  
  #ifdef TRACE_SORT
  	if (trace_sort)
! 		elog(LOG, "switching to external sort: %s",
  			 pg_rusage_show(&state->ru_start));
  #endif
  
! 	state->tapeset = LogicalTapeSetCreate(MAXTAPES);
  
  	/*
  	 * Allocate the memtupindex array, same size as memtuples.
--- 1091,1129 ----
  	int			ntuples,
  				j;
  
+     /* 
+      * Calculate number of tapes based on memory limits. We need to have
+      * MaxTapes to be at least MINTAPES. It can be set higher than this to
+      * reduce the number of merge passes, but we must allow a large enough
+      * memory buffer for each tape to maximise the effects of sequential
+      * I/O during the merge phase
+      */
+     MaxTapes = (int) state->allowedMem / OPTIMAL_MERGE_BUFFER_SIZE;
+     if (MaxTapes < MINTAPES)
+     {
+         MaxTapes = MINTAPES;
+     }
+     TapeRange = MaxTapes - 1;
+ 
  #ifdef TRACE_SORT
  	if (trace_sort)
! 		elog(LOG, "switching to external sort: (ntapes=%u) %s", MaxTapes,
  			 pg_rusage_show(&state->ru_start));
  #endif
  
!     /*
!      * Allocate the Tapestate array
!      */
! 	state->logtapes = (Tapestate *) palloc0(MaxTapes * sizeof(Tapestate));
!     /*
!      * Allocate a fixed number of tapes, rather than dynamic allocation
!      */
! 	for (j = 0; j < MaxTapes; j++)
! 	{
!         state->logtapes[j] = (Tapestate) palloc0(sizeof(TapeStateData));
!     }
! 
! 	state->tapeset = LogicalTapeSetCreate(MaxTapes);
  
  	/*
  	 * Allocate the memtupindex array, same size as memtuples.
***************
*** 1087,1101 ****
  	/*
  	 * Initialize variables of Algorithm D (step D1).
  	 */
! 	for (j = 0; j < MAXTAPES; j++)
  	{
! 		state->tp_fib[j] = 1;
! 		state->tp_runs[j] = 0;
! 		state->tp_dummy[j] = 1;
! 		state->tp_tapenum[j] = j;
  	}
! 	state->tp_fib[TAPERANGE] = 0;
! 	state->tp_dummy[TAPERANGE] = 0;
  
  	state->Level = 1;
  	state->destTape = 0;
--- 1151,1165 ----
  	/*
  	 * Initialize variables of Algorithm D (step D1).
  	 */
! 	for (j = 0; j < MaxTapes; j++)
  	{
! 		state->logtapes[j]->tp_fib = 1;
! 		state->logtapes[j]->tp_runs = 0;
! 		state->logtapes[j]->tp_dummy = 1;
! 		state->logtapes[j]->tp_tapenum = j;
  	}
! 	state->logtapes[TapeRange]->tp_fib = 0;
! 	state->logtapes[TapeRange]->tp_dummy = 0;
  
  	state->Level = 1;
  	state->destTape = 0;
***************
*** 1115,1127 ****
  	int			j;
  	int			a;
  
  	/* Step D3: advance j (destTape) */
! 	if (state->tp_dummy[state->destTape] < state->tp_dummy[state->destTape + 1])
  	{
  		state->destTape++;
  		return;
  	}
! 	if (state->tp_dummy[state->destTape] != 0)
  	{
  		state->destTape = 0;
  		return;
--- 1179,1195 ----
  	int			j;
  	int			a;
  
+     /* 
+      * If we use dynamically allocated Tapestates then allocate them here
+      */
+ 
  	/* Step D3: advance j (destTape) */
!     if (state->logtapes[state->destTape]->tp_dummy < state->logtapes[state->destTape + 1]->tp_dummy)
  	{
  		state->destTape++;
  		return;
  	}
! 	if (state->logtapes[state->destTape]->tp_dummy != 0)
  	{
  		state->destTape = 0;
  		return;
***************
*** 1129,1139 ****
  
  	/* Step D4: increase level */
  	state->Level++;
! 	a = state->tp_fib[0];
! 	for (j = 0; j < TAPERANGE; j++)
  	{
! 		state->tp_dummy[j] = a + state->tp_fib[j + 1] - state->tp_fib[j];
! 		state->tp_fib[j] = a + state->tp_fib[j + 1];
  	}
  	state->destTape = 0;
  }
--- 1197,1207 ----
  
  	/* Step D4: increase level */
  	state->Level++;
! 	a = state->logtapes[0]->tp_fib;
! 	for (j = 0; j < TapeRange; j++)
  	{
! 		state->logtapes[j]->tp_dummy = a + state->logtapes[j + 1]->tp_fib - state->logtapes[j]->tp_fib;
! 		state->logtapes[j]->tp_fib = a + state->logtapes[j + 1]->tp_fib;
  	}
  	state->destTape = 0;
  }
***************
*** 1162,1168 ****
  	 */
  	if (state->currentRun == 1)
  	{
! 		state->result_tape = state->tp_tapenum[state->destTape];
  		/* must freeze and rewind the finished output tape */
  		LogicalTapeFreeze(state->tapeset, state->result_tape);
  		state->status = TSS_SORTEDONTAPE;
--- 1230,1236 ----
  	 */
  	if (state->currentRun == 1)
  	{
! 		state->result_tape = state->logtapes[state->destTape]->tp_tapenum;
  		/* must freeze and rewind the finished output tape */
  		LogicalTapeFreeze(state->tapeset, state->result_tape);
  		state->status = TSS_SORTEDONTAPE;
***************
*** 1170,1191 ****
  	}
  
  	/* End of step D2: rewind all output tapes to prepare for merging */
! 	for (tapenum = 0; tapenum < TAPERANGE; tapenum++)
  		LogicalTapeRewind(state->tapeset, tapenum, false);
  
  	for (;;)
  	{
  		/* Step D5: merge runs onto tape[T] until tape[P] is empty */
! 		while (state->tp_runs[TAPERANGE - 1] || state->tp_dummy[TAPERANGE - 1])
  		{
  			bool		allDummy = true;
  			bool		allOneRun = true;
  
! 			for (tapenum = 0; tapenum < TAPERANGE; tapenum++)
  			{
! 				if (state->tp_dummy[tapenum] == 0)
  					allDummy = false;
! 				if (state->tp_runs[tapenum] + state->tp_dummy[tapenum] != 1)
  					allOneRun = false;
  			}
  
--- 1238,1259 ----
  	}
  
  	/* End of step D2: rewind all output tapes to prepare for merging */
! 	for (tapenum = 0; tapenum < TapeRange; tapenum++)
  		LogicalTapeRewind(state->tapeset, tapenum, false);
  
  	for (;;)
  	{
  		/* Step D5: merge runs onto tape[T] until tape[P] is empty */
! 		while (state->logtapes[TapeRange - 1]->tp_runs || state->logtapes[TapeRange - 1]->tp_dummy)
  		{
  			bool		allDummy = true;
  			bool		allOneRun = true;
  
! 			for (tapenum = 0; tapenum < TapeRange; tapenum++)
  			{
! 				if (state->logtapes[tapenum]->tp_dummy == 0)
  					allDummy = false;
! 				if (state->logtapes[tapenum]->tp_runs + state->logtapes[tapenum]->tp_dummy != 1)
  					allOneRun = false;
  			}
  
***************
*** 1198,1211 ****
  				Assert(!allDummy);
  				/* Initialize for the final merge pass */
  				beginmerge(state);
  				state->status = TSS_FINALMERGE;
  				return;
  			}
  			if (allDummy)
  			{
! 				state->tp_dummy[TAPERANGE]++;
! 				for (tapenum = 0; tapenum < TAPERANGE; tapenum++)
! 					state->tp_dummy[tapenum]--;
  			}
  			else
  				mergeonerun(state);
--- 1266,1284 ----
  				Assert(!allDummy);
  				/* Initialize for the final merge pass */
  				beginmerge(state);
+ #ifdef TRACE_SORT
+ 	if (trace_sort)
+ 		elog(LOG, "TSS_FINALMERGE %s",
+ 			 pg_rusage_show(&state->ru_start));
+ #endif
  				state->status = TSS_FINALMERGE;
  				return;
  			}
  			if (allDummy)
  			{
! 				state->logtapes[TapeRange]->tp_dummy++;
! 				for (tapenum = 0; tapenum < TapeRange; tapenum++)
! 					state->logtapes[tapenum]->tp_dummy--;
  			}
  			else
  				mergeonerun(state);
***************
*** 1214,1241 ****
  		if (--state->Level == 0)
  			break;
  		/* rewind output tape T to use as new input */
! 		LogicalTapeRewind(state->tapeset, state->tp_tapenum[TAPERANGE],
  						  false);
  		/* rewind used-up input tape P, and prepare it for write pass */
! 		LogicalTapeRewind(state->tapeset, state->tp_tapenum[TAPERANGE - 1],
  						  true);
! 		state->tp_runs[TAPERANGE - 1] = 0;
  
  		/*
  		 * reassign tape units per step D6; note we no longer care about A[]
  		 */
! 		svTape = state->tp_tapenum[TAPERANGE];
! 		svDummy = state->tp_dummy[TAPERANGE];
! 		svRuns = state->tp_runs[TAPERANGE];
! 		for (tapenum = TAPERANGE; tapenum > 0; tapenum--)
  		{
! 			state->tp_tapenum[tapenum] = state->tp_tapenum[tapenum - 1];
! 			state->tp_dummy[tapenum] = state->tp_dummy[tapenum - 1];
! 			state->tp_runs[tapenum] = state->tp_runs[tapenum - 1];
  		}
! 		state->tp_tapenum[0] = svTape;
! 		state->tp_dummy[0] = svDummy;
! 		state->tp_runs[0] = svRuns;
  	}
  
  	/*
--- 1287,1314 ----
  		if (--state->Level == 0)
  			break;
  		/* rewind output tape T to use as new input */
! 		LogicalTapeRewind(state->tapeset, state->logtapes[TapeRange]->tp_tapenum,
  						  false);
  		/* rewind used-up input tape P, and prepare it for write pass */
! 		LogicalTapeRewind(state->tapeset, state->logtapes[TapeRange - 1]->tp_tapenum,
  						  true);
! 		state->logtapes[TapeRange - 1]->tp_runs = 0;
  
  		/*
  		 * reassign tape units per step D6; note we no longer care about A[]
  		 */
! 		svTape = state->logtapes[TapeRange]->tp_tapenum;
! 		svDummy = state->logtapes[TapeRange]->tp_dummy;
! 		svRuns = state->logtapes[TapeRange]->tp_runs;
! 		for (tapenum = TapeRange; tapenum > 0; tapenum--)
  		{
! 			state->logtapes[tapenum]->tp_tapenum = state->logtapes[tapenum - 1]->tp_tapenum;
! 			state->logtapes[tapenum]->tp_dummy = state->logtapes[tapenum - 1]->tp_dummy;
! 			state->logtapes[tapenum]->tp_runs = state->logtapes[tapenum - 1]->tp_runs;
  		}
! 		state->logtapes[0]->tp_tapenum = svTape;
! 		state->logtapes[0]->tp_dummy = svDummy;
! 		state->logtapes[0]->tp_runs = svRuns;
  	}
  
  	/*
***************
*** 1246,1252 ****
  	 * output tape while rewinding it.	The last iteration of step D6 would be
  	 * a waste of cycles anyway...
  	 */
! 	state->result_tape = state->tp_tapenum[TAPERANGE];
  	LogicalTapeFreeze(state->tapeset, state->result_tape);
  	state->status = TSS_SORTEDONTAPE;
  }
--- 1319,1325 ----
  	 * output tape while rewinding it.	The last iteration of step D6 would be
  	 * a waste of cycles anyway...
  	 */
! 	state->result_tape = state->logtapes[TapeRange]->tp_tapenum;
  	LogicalTapeFreeze(state->tapeset, state->result_tape);
  	state->status = TSS_SORTEDONTAPE;
  }
***************
*** 1260,1266 ****
  static void
  mergeonerun(Tuplesortstate *state)
  {
! 	int			destTape = state->tp_tapenum[TAPERANGE];
  	int			srcTape;
  	int			tupIndex;
  	void	   *tup;
--- 1333,1339 ----
  static void
  mergeonerun(Tuplesortstate *state)
  {
! 	int			destTape = state->logtapes[TapeRange]->tp_tapenum;
  	int			srcTape;
  	int			tupIndex;
  	void	   *tup;
***************
*** 1287,1308 ****
  		WRITETUP(state, destTape, state->memtuples[0]);
  		/* writetup adjusted total free space, now fix per-tape space */
  		spaceFreed = state->availMem - priorAvail;
! 		state->mergeavailmem[srcTape] += spaceFreed;
  		/* compact the heap */
  		tuplesort_heap_siftup(state, false);
! 		if ((tupIndex = state->mergenext[srcTape]) == 0)
  		{
  			/* out of preloaded data on this tape, try to read more */
  			mergepreread(state);
  			/* if still no data, we've reached end of run on this tape */
! 			if ((tupIndex = state->mergenext[srcTape]) == 0)
  				continue;
  		}
  		/* pull next preread tuple from list, insert in heap */
  		tup = state->memtuples[tupIndex];
! 		state->mergenext[srcTape] = state->memtupindex[tupIndex];
! 		if (state->mergenext[srcTape] == 0)
! 			state->mergelast[srcTape] = 0;
  		state->memtupindex[tupIndex] = state->mergefreelist;
  		state->mergefreelist = tupIndex;
  		tuplesort_heap_insert(state, tup, srcTape, false);
--- 1360,1381 ----
  		WRITETUP(state, destTape, state->memtuples[0]);
  		/* writetup adjusted total free space, now fix per-tape space */
  		spaceFreed = state->availMem - priorAvail;
! 		state->logtapes[srcTape]->mergeavailmem += spaceFreed;
  		/* compact the heap */
  		tuplesort_heap_siftup(state, false);
! 		if ((tupIndex = state->logtapes[srcTape]->mergenext) == 0)
  		{
  			/* out of preloaded data on this tape, try to read more */
  			mergepreread(state);
  			/* if still no data, we've reached end of run on this tape */
! 			if ((tupIndex = state->logtapes[srcTape]->mergenext) == 0)
  				continue;
  		}
  		/* pull next preread tuple from list, insert in heap */
  		tup = state->memtuples[tupIndex];
! 		state->logtapes[srcTape]->mergenext = state->memtupindex[tupIndex];
! 		if (state->logtapes[srcTape]->mergenext == 0)
! 			state->logtapes[srcTape]->mergelast = 0;
  		state->memtupindex[tupIndex] = state->mergefreelist;
  		state->mergefreelist = tupIndex;
  		tuplesort_heap_insert(state, tup, srcTape, false);
***************
*** 1313,1319 ****
  	 * output tape, and increment its count of real runs.
  	 */
  	markrunend(state, destTape);
! 	state->tp_runs[TAPERANGE]++;
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1386,1392 ----
  	 * output tape, and increment its count of real runs.
  	 */
  	markrunend(state, destTape);
! 	state->logtapes[TapeRange]->tp_runs++;
  
  #ifdef TRACE_SORT
  	if (trace_sort)
***************
*** 1341,1365 ****
  	Assert(state->memtupcount == 0);
  
  	/* Clear merge-pass state variables */
! 	memset(state->mergeactive, 0, sizeof(state->mergeactive));
! 	memset(state->mergenext, 0, sizeof(state->mergenext));
! 	memset(state->mergelast, 0, sizeof(state->mergelast));
! 	memset(state->mergeavailmem, 0, sizeof(state->mergeavailmem));
  	state->mergefreelist = 0;	/* nothing in the freelist */
! 	state->mergefirstfree = MAXTAPES;	/* first slot available for preread */
  
  	/* Adjust run counts and mark the active tapes */
  	activeTapes = 0;
! 	for (tapenum = 0; tapenum < TAPERANGE; tapenum++)
  	{
! 		if (state->tp_dummy[tapenum] > 0)
! 			state->tp_dummy[tapenum]--;
  		else
  		{
! 			Assert(state->tp_runs[tapenum] > 0);
! 			state->tp_runs[tapenum]--;
! 			srcTape = state->tp_tapenum[tapenum];
! 			state->mergeactive[srcTape] = true;
  			activeTapes++;
  		}
  	}
--- 1414,1441 ----
  	Assert(state->memtupcount == 0);
  
  	/* Clear merge-pass state variables */
! 	for (tapenum = 0; tapenum < MaxTapes; tapenum++)
!     {
!         state->logtapes[tapenum]->mergeactive = false;
!         state->logtapes[tapenum]->mergenext = 0;
!         state->logtapes[tapenum]->mergelast = 0;
!         state->logtapes[tapenum]->mergeavailmem = 0;
!     }
  	state->mergefreelist = 0;	/* nothing in the freelist */
! 	state->mergefirstfree = MaxTapes;	/* first slot available for preread */
  
  	/* Adjust run counts and mark the active tapes */
  	activeTapes = 0;
! 	for (tapenum = 0; tapenum < TapeRange; tapenum++)
  	{
! 		if (state->logtapes[tapenum]->tp_dummy > 0)
! 			state->logtapes[tapenum]->tp_dummy--;
  		else
  		{
! 			Assert(state->logtapes[tapenum]->tp_runs > 0);
! 			state->logtapes[tapenum]->tp_runs--;
! 			srcTape = state->logtapes[tapenum]->tp_tapenum;
! 			state->logtapes[srcTape]->mergeactive = true;
  			activeTapes++;
  		}
  	}
***************
*** 1370,1379 ****
  	 */
  	Assert(activeTapes > 0);
  	state->spacePerTape = state->availMem / activeTapes;
! 	for (srcTape = 0; srcTape < MAXTAPES; srcTape++)
  	{
! 		if (state->mergeactive[srcTape])
! 			state->mergeavailmem[srcTape] = state->spacePerTape;
  	}
  
  	/*
--- 1446,1455 ----
  	 */
  	Assert(activeTapes > 0);
  	state->spacePerTape = state->availMem / activeTapes;
! 	for (srcTape = 0; srcTape < MaxTapes; srcTape++)
  	{
! 		if (state->logtapes[srcTape]->mergeactive)
! 			state->logtapes[srcTape]->mergeavailmem = state->spacePerTape;
  	}
  
  	/*
***************
*** 1383,1399 ****
  	mergepreread(state);
  
  	/* Load the merge heap with the first tuple from each input tape */
! 	for (srcTape = 0; srcTape < MAXTAPES; srcTape++)
  	{
! 		int			tupIndex = state->mergenext[srcTape];
  		void	   *tup;
  
  		if (tupIndex)
  		{
  			tup = state->memtuples[tupIndex];
! 			state->mergenext[srcTape] = state->memtupindex[tupIndex];
! 			if (state->mergenext[srcTape] == 0)
! 				state->mergelast[srcTape] = 0;
  			state->memtupindex[tupIndex] = state->mergefreelist;
  			state->mergefreelist = tupIndex;
  			tuplesort_heap_insert(state, tup, srcTape, false);
--- 1459,1475 ----
  	mergepreread(state);
  
  	/* Load the merge heap with the first tuple from each input tape */
! 	for (srcTape = 0; srcTape < MaxTapes; srcTape++)
  	{
! 		int			tupIndex = state->logtapes[srcTape]->mergenext;
  		void	   *tup;
  
  		if (tupIndex)
  		{
  			tup = state->memtuples[tupIndex];
! 			state->logtapes[srcTape]->mergenext = state->memtupindex[tupIndex];
! 			if (state->logtapes[srcTape]->mergenext == 0)
! 				state->logtapes[srcTape]->mergelast = 0;
  			state->memtupindex[tupIndex] = state->mergefreelist;
  			state->mergefreelist = tupIndex;
  			tuplesort_heap_insert(state, tup, srcTape, false);
***************
*** 1420,1428 ****
  	long		priorAvail,
  				spaceUsed;
  
! 	for (srcTape = 0; srcTape < MAXTAPES; srcTape++)
  	{
! 		if (!state->mergeactive[srcTape])
  			continue;
  
  		/*
--- 1496,1504 ----
  	long		priorAvail,
  				spaceUsed;
  
! 	for (srcTape = 0; srcTape < MaxTapes; srcTape++)
  	{
! 		if (!state->logtapes[srcTape]->mergeactive)
  			continue;
  
  		/*
***************
*** 1431,1438 ****
  		 * adjustment?).  This avoids reading just a few tuples when the
  		 * incoming runs are not being consumed evenly.
  		 */
! 		if (state->mergenext[srcTape] != 0 &&
! 			state->mergeavailmem[srcTape] <= state->spacePerTape / 2)
  			continue;
  
  		/*
--- 1507,1514 ----
  		 * adjustment?).  This avoids reading just a few tuples when the
  		 * incoming runs are not being consumed evenly.
  		 */
! 		if (state->logtapes[srcTape]->mergenext != 0 &&
! 			state->logtapes[srcTape]->mergeavailmem <= state->spacePerTape / 2)
  			continue;
  
  		/*
***************
*** 1440,1452 ****
  		 * but ensure that we have at least one.
  		 */
  		priorAvail = state->availMem;
! 		state->availMem = state->mergeavailmem[srcTape];
! 		while (!LACKMEM(state) || state->mergenext[srcTape] == 0)
  		{
  			/* read next tuple, if any */
  			if ((tuplen = getlen(state, srcTape, true)) == 0)
  			{
! 				state->mergeactive[srcTape] = false;
  				break;
  			}
  			tup = READTUP(state, srcTape, tuplen);
--- 1516,1528 ----
  		 * but ensure that we have at least one.
  		 */
  		priorAvail = state->availMem;
! 		state->availMem = state->logtapes[srcTape]->mergeavailmem;
! 		while (!LACKMEM(state) || state->logtapes[srcTape]->mergenext == 0)
  		{
  			/* read next tuple, if any */
  			if ((tuplen = getlen(state, srcTape, true)) == 0)
  			{
! 				state->logtapes[srcTape]->mergeactive = false;
  				break;
  			}
  			tup = READTUP(state, srcTape, tuplen);
***************
*** 1476,1490 ****
  			/* store tuple, append to list for its tape */
  			state->memtuples[tupIndex] = tup;
  			state->memtupindex[tupIndex] = 0;
! 			if (state->mergelast[srcTape])
! 				state->memtupindex[state->mergelast[srcTape]] = tupIndex;
  			else
! 				state->mergenext[srcTape] = tupIndex;
! 			state->mergelast[srcTape] = tupIndex;
  		}
  		/* update per-tape and global availmem counts */
! 		spaceUsed = state->mergeavailmem[srcTape] - state->availMem;
! 		state->mergeavailmem[srcTape] = state->availMem;
  		state->availMem = priorAvail - spaceUsed;
  	}
  }
--- 1552,1566 ----
  			/* store tuple, append to list for its tape */
  			state->memtuples[tupIndex] = tup;
  			state->memtupindex[tupIndex] = 0;
! 			if (state->logtapes[srcTape]->mergelast)
! 				state->memtupindex[state->logtapes[srcTape]->mergelast] = tupIndex;
  			else
! 				state->logtapes[srcTape]->mergenext = tupIndex;
! 			state->logtapes[srcTape]->mergelast = tupIndex;
  		}
  		/* update per-tape and global availmem counts */
! 		spaceUsed = state->logtapes[srcTape]->mergeavailmem - state->availMem;
! 		state->logtapes[srcTape]->mergeavailmem = state->availMem;
  		state->availMem = priorAvail - spaceUsed;
  	}
  }
***************
*** 1516,1522 ****
  		 * heap.
  		 */
  		Assert(state->memtupcount > 0);
! 		WRITETUP(state, state->tp_tapenum[state->destTape],
  				 state->memtuples[0]);
  		tuplesort_heap_siftup(state, true);
  
--- 1592,1598 ----
  		 * heap.
  		 */
  		Assert(state->memtupcount > 0);
! 		WRITETUP(state, state->logtapes[state->destTape]->tp_tapenum,
  				 state->memtuples[0]);
  		tuplesort_heap_siftup(state, true);
  
***************
*** 1527,1542 ****
  		if (state->memtupcount == 0 ||
  			state->currentRun != state->memtupindex[0])
  		{
! 			markrunend(state, state->tp_tapenum[state->destTape]);
  			state->currentRun++;
! 			state->tp_runs[state->destTape]++;
! 			state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
  
  #ifdef TRACE_SORT
  			if (trace_sort)
! 				elog(LOG, "finished writing%s run %d: %s",
  					 (state->memtupcount == 0) ? " final" : "",
! 					 state->currentRun,
  					 pg_rusage_show(&state->ru_start));
  #endif
  
--- 1603,1618 ----
  		if (state->memtupcount == 0 ||
  			state->currentRun != state->memtupindex[0])
  		{
! 			markrunend(state, state->logtapes[state->destTape]->tp_tapenum);
  			state->currentRun++;
! 			state->logtapes[state->destTape]->tp_runs++;
! 			state->logtapes[state->destTape]->tp_dummy--; /* per Alg D step D2 */
  
  #ifdef TRACE_SORT
  			if (trace_sort)
! 				elog(LOG, "finished writing%s run %d to tape %d: %s",
  					 (state->memtupcount == 0) ? " final" : "",
! 					 state->currentRun, state->destTape,
  					 pg_rusage_show(&state->ru_start));
  #endif
  
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to