[ https://issues.apache.org/jira/browse/DRILL-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360195#comment-16360195 ]
Paul Rogers commented on DRILL-6147: ------------------------------------ Salim says: Implementation Strategy - DRILL-6147 mission is to implement batch sizing for Flat Parquet with minimal overhead - This will also help us test this functionality for end-to-end cases (whole query) - My next task (after DRILL-6147) is to incorporate your framework with Parquet - I’ll will a) enhance the framework to support columnar processing and b) refactor the Parquet code to user the framework - I agree there might be some duplicate effort but I really believe this will be minimal - DRILL-6147 is not more than one week of research & analysis and one week of implementation Paul's comment: If you plan to do step 3 (use the result set loader), then efforts to build an alternative solution (steps 1, 2) are needed because...? Could we not put that effort into reviewing the result set loader code so it can be applied to Parquet sooner? Then, if we want to do predicate push-down, apply the extra effort into doing that in Parquet as explained above. Also, invest in unifying the readers so that we have a single reader which is both fast and handles nested structures as needed by Aman's project. I understand the argument that the work will be minimal. But, that is in part because we are only updating the special-case flat reader, and partly (I suspect) because the work seems much easier than it will actually turned out to be. (I'm speaking from experience here given what I learned when building the result set loader.) Help me understand why investing in a throw-away solution helps us short term? Especially since that solution (average row size) is not quick & easy for readers? That is, what value are we creating for users by building the same feature twice rather than building multiple features once? Why not, in fact, focus efforts on item 4: enhancing the result set loader for columnar processing: adding new value on top of the work already done? Now, there may well be a good answer. As Parth suggested, let's just spell it out in the design doc. so that it is clear that the proposed approach is the lowest cost way to get to the final solution. Finally, if we're doing this work primarily to solve a commercial problem, then that is a fine reason to make these changes in a private branch. My focus here is on moving Apache Drill forward for the benefit of the overall user base. > Limit batch size for Flat Parquet Reader > ---------------------------------------- > > Key: DRILL-6147 > URL: https://issues.apache.org/jira/browse/DRILL-6147 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet > Reporter: salim achouche > Assignee: salim achouche > Priority: Major > Fix For: 1.13.0 > > > The Parquet reader currently uses a hard-coded batch size limit (32k rows) > when creating scan batches; there is no parameter nor any logic for > controlling the amount of memory used. This enhancement will allow Drill to > take an extra input parameter to control direct memory usage. -- This message was sent by Atlassian JIRA (v7.6.3#76005)