Agreed that allocating vectors up front is another good improvement.
The average batch size approach gets us 80% of the way to the goal: it limits
batch size and allows vector preallocation.
What it cannot do is limit individual vector sizes. Nor can it ensure that the
resulting batch is optimally loaded with data. Getting the remaining 20%
requires the level of detail provided by the result set loader.
We are driving to use the result set loader first in readers, since readers
can't use the average batch size (they don't have an input batch to use to
obtain sizes.)
To use the result set loader in non-leaf operators, we'd need to modify code
generation. AFAIK, that is not something anyone is working on, so another
advantage of the average batch size method is that it works with the code
generation we already have.
Thanks,
- Paul
On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy
<[email protected]> wrote:
With average row size method, since I know number of rows and the average size
for each column,
I am planning to use that information to allocate required memory for each
vector upfront.
This should help avoid copying every time we double and also improve memory
utilization.
Thanks
Padma
> On Feb 11, 2018, at 3:44 PM, Paul Rogers <[email protected]> wrote:
>
> One more thought:
>>> 3) Assuming that you go with the average batch size calculation approach,
>
> The average batch size approach is a quick and dirty approach for non-leaf
> operators that can observe an incoming batch to estimate row width. Because
> Drill batches are large, the law of large numbers means that the average of a
> large input batch is likely to be a good estimator for the average size of a
> large output batch.
> Note that this works only because non-leaf operators have an input batch to
> sample. Leaf operators (readers) do not have this luxury. Hence the result
> set loader uses the actual accumulated size for the current batch.
> Also note that the average row method, while handy, is not optimal. It will,
> in general, result in greater internal fragmentation than the result set
> loader. Why? The result set loader packs vectors right up to the point where
> the largest would overflow. The average row method works at the aggregate
> level and will likely result in wasted space (internal fragmentation) in the
> largest vector. Said another way, with the average row size method, we can
> usually pack in a few more rows before the batch actually fills, and so we
> end up with batches with lower "density" than the optimal. This is important
> when the consuming operator is a buffering one such as sort.
> The key reason Padma is using the quick & dirty average row size method is
> not that it is ideal (it is not), but rather that it is, in fact, quick.
> We do want to move to the result set loader over time so we get improved
> memory utilization. And, it is the only way to control row size in readers
> such as CSV or JSON in which we have no size information until we read the
> data.
> - PaulĀ