Thanks Paul for your feedback! let me try to answer some of your questions / comments:
Duplicate Implementation - I am not contemplating two different implementations; one for Parquet and another for the rest of the code - Instead, I am reacting to the fact that we have two different processing patterns Row Oriented and Columnar - The goal is to offer both strategies depending on the operator Complex Vs Flat Parquet Readers - The Complex and Flat Parquet readers are quite different - I presume, for the sake of performance, we can enhance our SQL capabilities so that the Flat Parquet reader can be invoked more frequently Predicate Pushdown - The reason I invoked Predicate Pushdown within the document is to help the analysis: o Notice how Record Batch materialization could involve many more pages o A solution that relies mainly on the current set of pages (one per column) might pay a heavy IO price without much to show for + By waiting for all columns to have at least one page loaded so that upfront stats are gathered + Batch memory is then divided optimally across columns and the current batch size is computed + Unfortunately, such logic will fail if more pages are involved than the ones taken in consideration o Example - + Two variable length columns c1 and c2 + Reader waits for two pages P1-1 and P2-1 so that we a) allocate memory optimally across c1 and c2 and b) compute a batch size that will minimize overflow logic + Assume, because of data length skew or predicate pushdown, that more pages are involved in loading the batch + for c1: {P1-1, P1-2, P1-3, P1-4}, c2: {P2-1, P2-2} + It is now highly possible that overflow logic might not be optimal since only two pages statistics were considered instead of six - I have added new logic to the ScanBatch so to log (on-demand) extra batch statistics which will help us assess the efficiency of the batch sizing strategy; will add this information to the document when this sub-task is done Implementation Strategy - DRILL-6147 mission is to implement batch sizing for Flat Parquet with minimal overhead - This will also help us test this functionality for end-to-end cases (whole query) - My next task (after DRILL-6147) is to incorporate your framework with Parquet - I’ll will a) enhance the framework to support columnar processing and b) refactor the Parquet code to user the framework - I agree there might be some duplicate effort but I really believe this will be minimal - DRILL-6147 is not more than one week of research & analysis and one week of implementation Regards, Salim > On Feb 11, 2018, at 1:35 PM, Paul Rogers <par0...@yahoo.com.INVALID> wrote: > > Hi All, > Perhaps this topic needs just a bit more thought and discussion to avoid > working at cross purposes. I've outlined the issues, and a possible path > forward, in a comment to DRILL-6147. > Quick summary: creating a second batch size implementation just for Parquet > will be very difficult once we handle all the required use cases as spelled > out in the comment. We'd want to be very sure that we do, indeed, want to > duplicate this effort before we head down that route. Duplicating the effort > means repeating all the work done over the last six months to make the > original result set loader work, and the future work needed to maintain two > parallel systems. This is not a decision to make by default. > Thanks, > - Paul > > On Sunday, February 11, 2018, 12:10:58 AM PST, Parth Chandra > <par...@apache.org> wrote: > > Thanks Salim. > Can you add this to the JIRA/design doc. Also, I would venture to suggest > that the section on predicate pushdown can be made clearer. > Also, Since you're proposing the average batch size approach with overflow > handling, some detail on the proposed changes to the framework would be > useful in the design doc. (Perhaps pseudo code and affected classes.) > Essentially some guarantees provided by the framework will change and this > may affect (or not) the existing usage. These should be enumerated in the > design doc. > >