Thanks Parth for your feedback! I am planning to enhance the document based on the received feedback and the prototype I am currently working on!
Regards, Salim On Sun, Feb 11, 2018 at 2:36 PM, salim achouche <sachouc...@gmail.com> wrote: > Thanks Paul for your feedback! let me try to answer some of your questions > / comments: > > *Duplicate Implementation* > - I am *not* contemplating two different implementations; one for Parquet > and another for the rest of the code > - Instead, I am reacting to the fact that we have two different processing > patterns Row Oriented and Columnar > - The goal is to offer both strategies depending on the operator > > Complex Vs Flat Parquet Readers > - The Complex and Flat Parquet readers are quite different > - I presume, for the sake of performance, we can enhance our SQL > capabilities so that the Flat Parquet reader can be invoked more frequently > > *Predicate Pushdown* > - The reason I invoked Predicate Pushdown within the document is to help > the analysis: > o Notice how Record Batch materialization could involve many more pages > o A solution that relies mainly on the current set of pages (one per > column) might pay a heavy IO price without much to show for > + By waiting for all columns to have at least one page loaded so > that upfront stats are gathered > + Batch memory is then divided optimally across columns and the > current batch size is computed > + Unfortunately, such logic will fail if more pages are involved > than the ones taken in consideration > o Example - > + Two variable length columns c1 and c2 > + Reader waits for two pages P1-1 and P2-1 so that we a) allocate > memory optimally across c1 and c2 and b) compute a batch size that will > minimize overflow logic > + Assume, because of data length skew or predicate pushdown, that > more pages are involved in loading the batch > + for c1: {P1-1, P1-2, P1-3, P1-4}, c2: {P2-1, P2-2} > + It is now highly possible that overflow logic might not be optimal > since only two pages statistics were considered instead of six > > - I have added new logic to the ScanBatch so to log (on-demand) extra > batch statistics which will help us assess the efficiency of the batch > sizing strategy; will add this information to the document when this > sub-task is done > > > *Implementation Strategy* > - DRILL-6147 mission is to implement batch sizing for Flat Parquet with > *minimal > overhead* > - This will also help us test this functionality for end-to-end cases > (whole query) > - My next task (after DRILL-6147) is to incorporate your framework with > Parquet > - I’ll will a) enhance the framework to support columnar processing and b) > refactor the Parquet code to user the framework > *- *I agree there might be some duplicate effort but I really believe > this will be minimal > - DRILL-6147 is not more than one week of research & analysis and one week > of implementation > > Regards, > Salim > > > > On Feb 11, 2018, at 1:35 PM, Paul Rogers <par0...@yahoo.com.INVALID> > wrote: > > Hi All, > Perhaps this topic needs just a bit more thought and discussion to avoid > working at cross purposes. I've outlined the issues, and a possible path > forward, in a comment to DRILL-6147. > Quick summary: creating a second batch size implementation just for > Parquet will be very difficult once we handle all the required use cases as > spelled out in the comment. We'd want to be very sure that we do, indeed, > want to duplicate this effort before we head down that route. Duplicating > the effort means repeating all the work done over the last six months to > make the original result set loader work, and the future work needed to > maintain two parallel systems. This is not a decision to make by default. > Thanks, > - Paul > > On Sunday, February 11, 2018, 12:10:58 AM PST, Parth Chandra < > par...@apache.org> wrote: > > Thanks Salim. > Can you add this to the JIRA/design doc. Also, I would venture to suggest > that the section on predicate pushdown can be made clearer. > Also, Since you're proposing the average batch size approach with overflow > handling, some detail on the proposed changes to the framework would be > useful in the design doc. (Perhaps pseudo code and affected classes.) > Essentially some guarantees provided by the framework will change and this > may affect (or not) the existing usage. These should be enumerated in the > design doc. > > > >