Is there a JIRA for this? Would be useful to capture the comments in the JIRA. Note that the document itself is not comment-able as it is shared with view-only permissions.
Some thoughts in no particular order- 1) The Page based statistical approach is likely to run into trouble with the encoding used for Parquet fields especially RLE which drastically changes the size of the field. So pageSize/numValues is going to be wildly inaccurate with RLE. 2) Not sure where you were going with the predicate pushdown section and how it pertains to your proposed batch sizing. 3) Assuming that you go with the average batch size calculation approach, are you proposing to have a Parquet scan specific overflow implementation? Or are you planning to leverage the ResultSet loader mechanism? If you plan to use the latter, it will need to be enhanced to handle a bulk chunk as opposed to a single value at a time. If not using the ResultSet loader mechanism, why not (you would be reinventing the wheel) ? 4) Parquet page level stats are probably not reliable. You can assume page size (compressed/uncompressed) and value count are accurate, but nothing else. Also note that memory allocations by Netty greater than the 16MB chunk size are returned to the OS when the memory is free'd. Both this document and the original document on memory fragmentation state incorrectly that such memory is not released back to the OS. A quick thought experiment - where does this memory go if it is not released back to the OS? On Fri, Feb 9, 2018 at 7:12 AM, salim achouche <[email protected]> wrote: > The following document > <https://docs.google.com/document/d/1A6zFkjxnC_- > 9RwG4h0sI81KI5ZEvJ7HzgClCUFpB5WE/edit?ts=5a793606#> > describes > a proposal for enforcing batch sizing constraints (count and memory) within > the Parquet Reader (Flat Schema). Please feel free to take a look and > provide feedback. > > Thanks! > > Regards, > Salim >
