Re: Batch Sizing for Parquet Flat Reader

Aman Sinha Sun, 04 Mar 2018 18:55:00 -0800

Hi Paul,
thanks for your comments.  I have added my thoughts in the DRILL-6147 JIRA
as well.  Regarding the hangout, let me find out about availability of
other folks too and will circle back with you.


thanks,
Aman

On Sun, Mar 4, 2018 at 1:23 PM, Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi Aman,
> To follow up, we should look at all sides of the issue. One factor
> overlooked in my previous note is that code now is better than code later.
> DRILL-6147 is available today and will immediately give users a
> performance boost. The result set loader is large and will take some months
> to commit, and so can't offer a benefit until then.
> It is hard to argue that we wait. Let's get DRILL-6147 in now, then
> revisit the issue later (doing the proposed test) once the result set
> loader is available.
> And, as discussed, DRILL-6147 works only for the flat Parquet reader.
> We'll need the result set loader for the Parquet reader that reads nested
> types.
>
> Thanks,
> - Paul
>
>
>
>     On Sunday, March 4, 2018, 1:07:38 PM PST, Paul Rogers
> <par0...@yahoo.com.INVALID> wrote:
>
>  Hi Aman,
> Please see my comment in DRILL-6147.
> For the hangout to be productive, perhaps we should create test cases that
> will show the benefit of DRILL-6147 relative to the result set loader.
> The test case of interest has three parts:
> * Multiple variable-width fields (say five) with a large variance in field
> widths in each field
> * Large data set that will be split across multiple batches (say 10 or 50
> batches)
> * Constraints on total batch size and size of the largest vector
> Clearly, we can't try this out with Parquet: that's the topic we are
> discussing.
> But, we can generate a data set in code, then do a unit test of the two
> methods (just the vector loading bits) and time the result. Similar code
> already exists in the result set loader branch that can be repurposed for
> this use. We'd want to create a similar test for the DRILL-6147 mechanisms.
> We can work out the details in a separate discussion.
> IHMO, if the results are the same, we should go with one solution. If
> DRILL-6147 is significantly faster, the decision is clear: we have two
> solutions.
> We also should consider things such as selection, null columns, implicit
> columns, and the other higher-level functionality provided by the result
> set loader. Since Parquet already has ad-hoc solutions for these, with
> DRILL-6147 we'd simply keep those solutions for Parquet, while the other
> readers use the new, unified mechanisms.
> In terms of time, this week is busy:
> * Wed. 3PM or later* Fri. 3PM or later
> The week of the 12th is much more open.
> Thanks,
> - Paul
>
>
>
>     On Sunday, March 4, 2018, 11:48:33 AM PST, Aman Sinha <
> amansi...@apache.org> wrote:
>
>  Hi all,  with reference to DRILL-6147
> <https://issues.apache.org/jira/browse/DRILL-6147> given the overlapping
> approaches,  I feel like we should have a separate hangout session with
> interested parties and discuss the details.
> Let me know and I can setup one.
>
> Aman
>
> On Mon, Feb 12, 2018 at 8:50 AM, Padma Penumarthy <ppenumar...@mapr.com>
> wrote:
>
> > If our goal is to not to allocate more than 16MB for individual vectors
> to
> > avoid external fragmentation, I guess
> > we can take that also into consideration in our calculations to figure
> out
> > the outgoing number of rows.
> > The math might become more complex. But, the main point like you said is
> > operators know what they are
> > getting and can figure out how to deal with that to honor the constraints
> > imposed.
> >
> > Thanks
> > Padma
> >
> >
> > On Feb 12, 2018, at 8:25 AM, Paul Rogers <par0...@yahoo.com.INVALID<
> > mailto:par0...@yahoo.com.INVALID>> wrote:
> >
> > Agreed that allocating vectors up front is another good improvement.
> > The average batch size approach gets us 80% of the way to the goal: it
> > limits batch size and allows vector preallocation.
> > What it cannot do is limit individual vector sizes. Nor can it ensure
> that
> > the resulting batch is optimally loaded with data. Getting the remaining
> > 20% requires the level of detail provided by the result set loader.
> > We are driving to use the result set loader first in readers, since
> > readers can't use the average batch size (they don't have an input batch
> to
> > use to obtain sizes.)
> > To use the result set loader in non-leaf operators, we'd need to modify
> > code generation. AFAIK, that is not something anyone is working on, so
> > another advantage of the average batch size method is that it works with
> > the code generation we already have.
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy <
> > ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>> wrote:
> >
> > With average row size method, since I know number of rows and the average
> > size for each column,
> > I am planning to use that information to allocate required memory for
> each
> > vector upfront.
> > This should help avoid copying every time we double and also improve
> > memory utilization.
> >
> > Thanks
> > Padma
> >
> >
> > On Feb 11, 2018, at 3:44 PM, Paul Rogers <par0...@yahoo.com.INVALID<
> > mailto:par0...@yahoo.com.INVALID>> wrote:
> >
> > One more thought:
> > 3) Assuming that you go with the average batch size calculation approach,
> >
> > The average batch size approach is a quick and dirty approach for
> non-leaf
> > operators that can observe an incoming batch to estimate row width.
> Because
> > Drill batches are large, the law of large numbers means that the average
> of
> > a large input batch is likely to be a good estimator for the average size
> > of a large output batch.
> > Note that this works only because non-leaf operators have an input batch
> > to sample. Leaf operators (readers) do not have this luxury. Hence the
> > result set loader uses the actual accumulated size for the current batch.
> > Also note that the average row method, while handy, is not optimal. It
> > will, in general, result in greater internal fragmentation than the
> result
> > set loader. Why? The result set loader packs vectors right up to the
> point
> > where the largest would overflow. The average row method works at the
> > aggregate level and will likely result in wasted space (internal
> > fragmentation) in the largest vector. Said another way, with the average
> > row size method, we can usually pack in a few more rows before the batch
> > actually fills, and so we end up with batches with lower "density" than
> the
> > optimal. This is important when the consuming operator is a buffering one
> > such as sort.
> > The key reason Padma is using the quick & dirty average row size method
> is
> > not that it is ideal (it is not), but rather that it is, in fact, quick.
> > We do want to move to the result set loader over time so we get improved
> > memory utilization. And, it is the only way to control row size in
> readers
> > such as CSV or JSON in which we have no size information until we read
> the
> > data.
> > - Paul
> >
> >
>
>

Re: Batch Sizing for Parquet Flat Reader

Reply via email to