Thanks Paul for your feedback! let me try to answer some of your questions /
comments:
Duplicate Implementation
- I am not contemplating two different implementations; one for Parquet and
another for the rest of the code
- Instead, I am reacting to the fact that we have two different processing
patterns Row Oriented and Columnar
- The goal is to offer both strategies depending on the operator
Complex Vs Flat Parquet Readers
- The Complex and Flat Parquet readers are quite different
- I presume, for the sake of performance, we can enhance our SQL capabilities
so that the Flat Parquet reader can be invoked more frequently
Predicate Pushdown
- The reason I invoked Predicate Pushdown within the document is to help the
analysis:
o Notice how Record Batch materialization could involve many more pages
o A solution that relies mainly on the current set of pages (one per column)
might pay a heavy IO price without much to show for
+ By waiting for all columns to have at least one page loaded so that
upfront stats are gathered
+ Batch memory is then divided optimally across columns and the current
batch size is computed
+ Unfortunately, such logic will fail if more pages are involved than the
ones taken in consideration
o Example -
+ Two variable length columns c1 and c2
+ Reader waits for two pages P1-1 and P2-1 so that we a) allocate memory
optimally across c1 and c2 and b) compute a batch size that will minimize
overflow logic
+ Assume, because of data length skew or predicate pushdown, that more
pages are involved in loading the batch
+ for c1: {P1-1, P1-2, P1-3, P1-4}, c2: {P2-1, P2-2}
+ It is now highly possible that overflow logic might not be optimal
since only two pages statistics were considered instead of six
- I have added new logic to the ScanBatch so to log (on-demand) extra batch
statistics which will help us assess the efficiency of the batch sizing
strategy; will add this information to the document when this sub-task is done
Implementation Strategy
- DRILL-6147 mission is to implement batch sizing for Flat Parquet with minimal
overhead
- This will also help us test this functionality for end-to-end cases (whole
query)
- My next task (after DRILL-6147) is to incorporate your framework with Parquet
- I’ll will a) enhance the framework to support columnar processing and b)
refactor the Parquet code to user the framework
- I agree there might be some duplicate effort but I really believe this will
be minimal
- DRILL-6147 is not more than one week of research & analysis and one week of
implementation
Regards,
Salim
> On Feb 11, 2018, at 1:35 PM, Paul Rogers <[email protected]> wrote:
>
> Hi All,
> Perhaps this topic needs just a bit more thought and discussion to avoid
> working at cross purposes. I've outlined the issues, and a possible path
> forward, in a comment to DRILL-6147.
> Quick summary: creating a second batch size implementation just for Parquet
> will be very difficult once we handle all the required use cases as spelled
> out in the comment. We'd want to be very sure that we do, indeed, want to
> duplicate this effort before we head down that route. Duplicating the effort
> means repeating all the work done over the last six months to make the
> original result set loader work, and the future work needed to maintain two
> parallel systems. This is not a decision to make by default.
> Thanks,
> - Paul
>
> On Sunday, February 11, 2018, 12:10:58 AM PST, Parth Chandra
> <[email protected]> wrote:
>
> Thanks Salim.
> Can you add this to the JIRA/design doc. Also, I would venture to suggest
> that the section on predicate pushdown can be made clearer.
> Also, Since you're proposing the average batch size approach with overflow
> handling, some detail on the proposed changes to the framework would be
> useful in the design doc. (Perhaps pseudo code and affected classes.)
> Essentially some guarantees provided by the framework will change and this
> may affect (or not) the existing usage. These should be enumerated in the
> design doc.
>
>