Dear Alan, At first I thought the implementation doesn't look right but after I found this comment, everything has become sensible. " * The DataBag interface assumes that all data is written before any is * read. That is, a DataBag cannot be used as a queue. If data is written * after data is read, the results are undefined. "
Anyway thanks for your help, Pi On Sat, Mar 29, 2008 at 3:21 AM, Alan Gates <[EMAIL PROTECTED]> wrote: > I'm the one who wrote that code, so I'm the best one to explain it. > What exactly were you wanting to know about it? > > Basically the idea is that files are sorted (and in the case of > distinct, distinct applied) as each file is spilled. Then at read > time, the files are read back and merged via a priority queue. In the > case of distinct the distinct operator also has to be applied. > > This code is complicated by the fact that while reading in spilled > files, there may still be entries in memory. It is also possible to > have what was in memory (and already partially read) spilled in between > reads. So the iterator code has to handle merging in results from > memory, and if we were reading from memory and got spilled, making sure > we start reading again from the correct point in the newly spilled > file. This is made a little easier by the fact that data bags are > written entirely before they are read, so there will be at most one > spill during a read. > > Hopefully that helps as an introduction. If you have specific > questions I'm glad to answer them. > > Alan. > > On Mar 26, 2008, at 7:12 AM, pi song wrote: > > > Dear Ben or anyone who knows, > > > > Can you please explain me how multiple files spilling works in sorted > > bag/distinct bag? > > > > Cheers, > > Pi > >
