bump On Fri, Dec 5, 2014 at 5:33 PM, Jason Altekruse <altekruseja...@gmail.com> wrote:
> Hello Drillers, > > I am currently working on trying to write documentation to describe our > current interface and implementation patterns used in RecordBatch and its > subclasses. These classes currently contain the implementations of all of > our physical operators, subclasses include FilterRecordBatch, HashAggBatch, > etc. > > This naming convention has been a point of confusion for many developers > as they get up to speed on Drill and begin to piece together the control > flow of a query. The name "RecordBatch" implies that the class is logically > a data structure, that holds a batch of records. > > During execution, each downsteam operator (which implements the > RecordBatch interface) will be able to access all of the data in the > current batches (the actual data structure) from the operator(s) > immediately preceding it. In this sense, calling this class a RecordBatch > is not entirely inaccurate, as it is providing a reference into the current > data. > > The place where it gets confusing, is that it does not just hold data. > Each RecordBatch has a next() method, which is used to retrieve the next > batch of records (the data structure). The way this is possible is that the > data is shared with consumers of the interface in the form of a vector > container object, which wraps value vectors. A call to next will swap out > the data in the vector containers with new data. > > I was talking with a few members of the dev team about this problem and we > were all in agreement that the interface and its implementations should be > renamed. We tried to talk further about the overall model and decided that > some refactoring/ encapsulation may come along with this re-naming as we > clarify these concepts. > > I would like to propose the beginning of this discussion with our > candidates for new names of the interface. The three that stood out for us > were BatchIterator, BatchStream, and BatchCursor. These all represent a > logical wrapper around data that will be accessed by a consumer over time, > and will be accessed in discrete chunks at some level. Each has existing > conventions that define them, and some might be more appropriate than > others for the current implementation used by Drill. > > Please share your thoughts on the best possible new name for RecordBatch. > > Thanks, > Jason > >