Re: Batch Sizing for Parquet Flat Reader

salim achouche Sun, 11 Feb 2018 14:42:07 -0800

Thanks Parth for your feedback! I am planning to enhance the document based
on the received feedback and the prototype I am currently working on!


Regards,
Salim

On Sun, Feb 11, 2018 at 2:36 PM, salim achouche <sachouc...@gmail.com>
wrote:

> Thanks Paul for your feedback! let me try to answer some of your questions
> / comments:
>
> *Duplicate Implementation*
> - I am *not* contemplating two different implementations; one for Parquet
> and another for the rest of the code
> - Instead, I am reacting to the fact that we have two different processing
> patterns Row Oriented and Columnar
> - The goal is to offer both strategies depending on the operator
>
> Complex Vs Flat Parquet Readers
> - The Complex and Flat Parquet readers are quite different
> - I presume, for the sake of performance, we can enhance our SQL
> capabilities so that the Flat Parquet reader can be invoked more frequently
>
> *Predicate Pushdown*
> - The reason I invoked Predicate Pushdown within the document is to help
> the analysis:
>    o Notice how Record Batch materialization could involve many more pages
>    o A solution that relies mainly on the current set of pages (one per
> column) might pay a heavy IO price without much to show for
>       + By waiting for all columns to have at least one page loaded so
> that upfront stats are gathered
>       + Batch memory is then divided optimally across columns and the
> current batch size is computed
>       + Unfortunately, such logic will fail if more pages are involved
> than the ones taken in consideration
>    o Example -
>       + Two variable length columns c1 and c2
>       + Reader waits for two pages P1-1 and P2-1 so that we a) allocate
> memory optimally across c1 and c2 and b) compute a batch size that will
> minimize overflow logic
>       + Assume, because of data length skew or predicate pushdown, that
> more pages are involved in loading the batch
>       + for c1: {P1-1, P1-2, P1-3, P1-4}, c2: {P2-1, P2-2}
>       + It is now highly possible that overflow logic might not be optimal
> since only  two pages statistics were considered instead of six
>
>  - I have added new logic to the ScanBatch so to log (on-demand) extra
> batch statistics which will help us assess the efficiency of the batch
> sizing strategy; will add this information to the document when this
> sub-task is done
>
>
> *Implementation Strategy*
> - DRILL-6147 mission is to implement batch sizing for Flat Parquet with 
> *minimal
> overhead*
> - This will also help us test this functionality for end-to-end cases
> (whole query)
> - My next task (after DRILL-6147) is to incorporate your framework with
> Parquet
> - I’ll will a) enhance the framework to support columnar processing and b)
> refactor the Parquet code to user the framework
> *- *I agree there might be some duplicate effort but I really believe
> this will be minimal
> - DRILL-6147 is not more than one week of research & analysis and one week
> of implementation
>
> Regards,
> Salim
>
>
>
> On Feb 11, 2018, at 1:35 PM, Paul Rogers <par0...@yahoo.com.INVALID>
> wrote:
>
> Hi All,
> Perhaps this topic needs just a bit more thought and discussion to avoid
> working at cross purposes. I've outlined the issues, and a possible path
> forward, in a comment to DRILL-6147.
> Quick summary: creating a second batch size implementation just for
> Parquet will be very difficult once we handle all the required use cases as
> spelled out in the comment. We'd want to be very sure that we do, indeed,
> want to duplicate this effort before we head down that route. Duplicating
> the effort means repeating all the work done over the last six months to
> make the original result set loader work, and the future work needed to
> maintain two parallel systems. This is not a decision to make by default.
> Thanks,
> - Paul
>
>    On Sunday, February 11, 2018, 12:10:58 AM PST, Parth Chandra <
> par...@apache.org> wrote:
>
> Thanks Salim.
> Can you add this to the JIRA/design doc. Also, I would venture to suggest
> that the section on predicate pushdown can be made clearer.
> Also, Since you're proposing the average batch size approach with overflow
> handling, some detail on the proposed changes to the framework would be
> useful in the design doc. (Perhaps pseudo code and affected classes.)
> Essentially some guarantees provided by the framework will change and this
> may affect (or not) the existing usage. These should be enumerated in the
> design doc.
>
>
>
>

Re: Batch Sizing for Parquet Flat Reader

Reply via email to