bump

On Fri, Dec 5, 2014 at 5:33 PM, Jason Altekruse <altekruseja...@gmail.com>
wrote:

> Hello Drillers,
>
> I am currently working on trying to write documentation to describe our
> current interface and implementation patterns used in RecordBatch and its
> subclasses. These classes currently contain the implementations of all of
> our physical operators, subclasses include FilterRecordBatch, HashAggBatch,
> etc.
>
> This naming convention has been a point of confusion for many developers
> as they get up to speed on Drill and begin to piece together the control
> flow of a query. The name "RecordBatch" implies that the class is logically
> a data structure, that holds a batch of records.
>
> During execution, each downsteam operator (which implements the
> RecordBatch interface) will be able to access all of the data in the
> current batches (the actual data structure) from the operator(s)
> immediately preceding it. In this sense, calling this class a RecordBatch
> is not entirely inaccurate, as it is providing a reference into the current
> data.
>
> The place where it gets confusing, is that it does not just hold data.
> Each RecordBatch has a next() method, which is used to retrieve the next
> batch of records (the data structure). The way this is possible is that the
> data is shared with consumers of the interface in the form of a vector
> container object, which wraps value vectors. A call to next will swap out
> the data in the vector containers with new data.
>
> I was talking with a few members of the dev team about this problem and we
> were all in agreement that the interface and its implementations should be
> renamed. We tried to talk further about the overall model and decided that
> some refactoring/ encapsulation may come along with this re-naming as we
> clarify these concepts.
>
> I would like to propose the beginning of this discussion with our
> candidates for new names of the interface. The three that stood out for us
> were BatchIterator, BatchStream, and BatchCursor. These all represent a
> logical wrapper around data that will be accessed by a consumer over time,
> and will be accessed in discrete chunks at some level. Each has existing
> conventions that define them, and some might be more appropriate than
> others for the current implementation used by Drill.
>
> Please share your thoughts on the best possible new name for RecordBatch.
>
> Thanks,
> Jason
>
>

Reply via email to