Claudio Grondi <[EMAIL PROTECTED]> writes: > Does it mean, that in case of very large files: > the size of available memory for the sorting operation (making it > possible to work on larger chunks of data in memory) has less impact > on the actual sorting speed than > the speed of the data transfer from/to storage device(s)
Transfer speed and parallelism matters most, if the cpu can keep up. Parallelism means you want multiple drives that you can do i/o to simultaneously without having to seek. Random access helps simulate this, but only somewhat. Large database installations always want lots of parallel disks for similar reasons to this. The basic method of sorting large files has traditionally been: 1) Read the file in pieces p(1),p(2),...,p(n) where each piece is as big as will fit in memory. Sort each piece with your favorite in-memory sort, and write out sorted pieces that we'll designate r(1,1),...r(1,n). The r(a,b)'s are called "runs". Use multiple output devices (tape or disk drives) to write out the runs, i.e. each output tape will contain its own bunch of runs. 2) Read back a bunch of the runs in parallel, i.e. say you have written r(1,1),r(1,2),...,r(1,5) to five separate tape drives. Read them back simultaneously and merge them (requires very little external memory) into a new "second level" run called r(2,1). Similarly merge r(1,6),...,r(1,10) to the second level run r(2,2), and so forth. 3) Merge the second level runs into third level runs, and so forth recursively until there's only one giant run. That is the sorted output. The amount of memory determines the size of the initial "first level" runs, which determines how many recursive merges are needed, so more memory helps. The number of levels of recursion also depends on the "merge order", i.e. the number of runs merged at each merge step. You really want each merge to read from M physical input devices that are separate from the output device(s). There are obviously a lot of optimizations that the above description is missing, and there's a lot of strategy about how to organize the merges, e.g. if you have 8 drives, do you use 4 for input and 4 for output, or 5 in and 3 out, or what? Is this some kind of production deployment problem you're working on, or do you just have a big pile of data you need to sort once? If you need to deploy across multiple machines (i.e. you can't just buy a giant machine if you need to do this sorting frequently), then I'd suggest reading up on the subject, starting with Knuth vol 3. -- http://mail.python.org/mailman/listinfo/python-list