I'm the one who wrote that code, so I'm the best one to explain it.
What exactly were you wanting to know about it?
Basically the idea is that files are sorted (and in the case of
distinct, distinct applied) as each file is spilled. Then at read
time, the files are read back and merged via a priority queue. In the
case of distinct the distinct operator also has to be applied.
This code is complicated by the fact that while reading in spilled
files, there may still be entries in memory. It is also possible to
have what was in memory (and already partially read) spilled in between
reads. So the iterator code has to handle merging in results from
memory, and if we were reading from memory and got spilled, making sure
we start reading again from the correct point in the newly spilled
file. This is made a little easier by the fact that data bags are
written entirely before they are read, so there will be at most one
spill during a read.
Hopefully that helps as an introduction. If you have specific
questions I'm glad to answer them.
Alan.
On Mar 26, 2008, at 7:12 AM, pi song wrote:
Dear Ben or anyone who knows,
Can you please explain me how multiple files spilling works in sorted
bag/distinct bag?
Cheers,
Pi