Re: Multiple file bag spilling

Alan Gates Fri, 28 Mar 2008 09:22:01 -0700

I'm the one who wrote that code, so I'm the best one to explain it.What exactly were you wanting to know about it?

Basically the idea is that files are sorted (and in the case ofdistinct, distinct applied) as each file is spilled. Then at readtime, the files are read back and merged via a priority queue. In thecase of distinct the distinct operator also has to be applied.

This code is complicated by the fact that while reading in spilledfiles, there may still be entries in memory. It is also possible tohave what was in memory (and already partially read) spilled in betweenreads. So the iterator code has to handle merging in results frommemory, and if we were reading from memory and got spilled, making surewe start reading again from the correct point in the newly spilledfile. This is made a little easier by the fact that data bags arewritten entirely before they are read, so there will be at most onespill during a read.

Hopefully that helps as an introduction. If you have specificquestions I'm glad to answer them.


Alan.

On Mar 26, 2008, at 7:12 AM, pi song wrote:

Dear Ben or anyone who knows,

Can you please explain me how multiple files spilling works in sorted
bag/distinct bag?

Cheers,
Pi

Re: Multiple file bag spilling

Reply via email to