I'm the one who wrote that code, so I'm the best one to explain it. What exactly were you wanting to know about it?

Basically the idea is that files are sorted (and in the case of distinct, distinct applied) as each file is spilled. Then at read time, the files are read back and merged via a priority queue. In the case of distinct the distinct operator also has to be applied.

This code is complicated by the fact that while reading in spilled files, there may still be entries in memory. It is also possible to have what was in memory (and already partially read) spilled in between reads. So the iterator code has to handle merging in results from memory, and if we were reading from memory and got spilled, making sure we start reading again from the correct point in the newly spilled file. This is made a little easier by the fact that data bags are written entirely before they are read, so there will be at most one spill during a read.

Hopefully that helps as an introduction. If you have specific questions I'm glad to answer them.

Alan.

On Mar 26, 2008, at 7:12 AM, pi song wrote:

Dear Ben or anyone who knows,

Can you please explain me how multiple files spilling works in sorted
bag/distinct bag?

Cheers,
Pi

Reply via email to