Re: Batch Sizing for Parquet Flat Reader

salim achouche Sun, 11 Feb 2018 14:36:26 -0800

Thanks Paul for your feedback! let me try to answer some of your questions / 
comments:


Duplicate Implementation
- I am not contemplating two different implementations; one for Parquet and 
another for the rest of the code
- Instead, I am reacting to the fact that we have two different processing 
patterns Row Oriented and Columnar
- The goal is to offer both strategies depending on the operator

Complex Vs Flat Parquet Readers
- The Complex and Flat Parquet readers are quite different
- I presume, for the sake of performance, we can enhance our SQL capabilities 
so that the Flat Parquet reader can be invoked more frequently

Predicate Pushdown
- The reason I invoked Predicate Pushdown within the document is to help the 
analysis:
   o Notice how Record Batch materialization could involve many more pages
   o A solution that relies mainly on the current set of pages (one per column) 
might pay a heavy IO price without much to show for
      + By waiting for all columns to have at least one page loaded so that 
upfront stats are gathered 
      + Batch memory is then divided optimally across columns and the current 
batch size is computed
      + Unfortunately, such logic will fail if more pages are involved than the 
ones taken in consideration
   o Example -
      + Two variable length columns c1 and c2
      + Reader waits for two pages P1-1 and P2-1 so that we a) allocate memory 
optimally across c1 and c2 and b) compute a batch size that will minimize 
overflow logic
      + Assume, because of data length skew or predicate pushdown, that more 
pages are involved in loading the batch
      + for c1: {P1-1, P1-2, P1-3, P1-4}, c2: {P2-1, P2-2} 
      + It is now highly possible that overflow logic might not be optimal 
since only  two pages statistics were considered instead of six

 - I have added new logic to the ScanBatch so to log (on-demand) extra batch 
statistics which will help us assess the efficiency of the batch sizing 
strategy; will add this information to the document when this sub-task is done


Implementation Strategy
- DRILL-6147 mission is to implement batch sizing for Flat Parquet with minimal 
overhead
- This will also help us test this functionality for end-to-end cases (whole 
query)
- My next task (after DRILL-6147) is to incorporate your framework with Parquet 
- I’ll will a) enhance the framework to support columnar processing and b) 
refactor the Parquet code to user the framework
- I agree there might be some duplicate effort but I really believe this will 
be minimal
- DRILL-6147 is not more than one week of research & analysis and one week of 
implementation

Regards,
Salim



> On Feb 11, 2018, at 1:35 PM, Paul Rogers <par0...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> Perhaps this topic needs just a bit more thought and discussion to avoid 
> working at cross purposes. I've outlined the issues, and a possible path 
> forward, in a comment to DRILL-6147.
> Quick summary: creating a second batch size implementation just for Parquet 
> will be very difficult once we handle all the required use cases as spelled 
> out in the comment. We'd want to be very sure that we do, indeed, want to 
> duplicate this effort before we head down that route. Duplicating the effort 
> means repeating all the work done over the last six months to make the 
> original result set loader work, and the future work needed to maintain two 
> parallel systems. This is not a decision to make by default.
> Thanks,
> - Paul
> 
>    On Sunday, February 11, 2018, 12:10:58 AM PST, Parth Chandra 
> <par...@apache.org> wrote:  
> 
> Thanks Salim.
> Can you add this to the JIRA/design doc. Also, I would venture to suggest
> that the section on predicate pushdown can be made clearer.
> Also, Since you're proposing the average batch size approach with overflow
> handling, some detail on the proposed changes to the framework would be
> useful in the design doc. (Perhaps pseudo code and affected classes.)
> Essentially some guarantees provided by the framework will change and this
> may affect (or not) the existing usage. These should be enumerated in the
> design doc.
> 
>

Re: Batch Sizing for Parquet Flat Reader

Reply via email to