[ 
https://issues.apache.org/jira/browse/DRILL-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360195#comment-16360195
 ] 

Paul Rogers commented on DRILL-6147:
------------------------------------

Salim says:

Implementation Strategy
- DRILL-6147 mission is to implement batch sizing for Flat Parquet with minimal 
overhead
- This will also help us test this functionality for end-to-end cases (whole 
query)
- My next task (after DRILL-6147) is to incorporate your framework with Parquet
- I’ll will a) enhance the framework to support columnar processing and b) 
refactor the Parquet code to user the framework
- I agree there might be some duplicate effort but I really believe this will 
be minimal
- DRILL-6147 is not more than one week of research & analysis and one week of 
implementation

Paul's comment:

If you plan to do step 3 (use the result set loader), then efforts to build an 
alternative solution (steps 1, 2) are needed because...? Could we not put that 
effort into reviewing the result set loader code so it can be applied to 
Parquet sooner? Then, if we want to do predicate push-down, apply the extra 
effort into doing that in Parquet as explained above. Also, invest in unifying 
the readers so that we have a single reader which is both fast and handles 
nested structures as needed by Aman's project.

I understand the argument that the work will be minimal. But, that is in part 
because we are only updating the special-case flat reader, and partly (I 
suspect) because the work seems much easier than it will actually turned out to 
be. (I'm speaking from experience here given what I learned when building the 
result set loader.)

Help me understand why investing in a throw-away solution helps us short term? 
Especially since that solution (average row size) is not quick & easy for 
readers? That is, what value are we creating for users by building the same 
feature twice rather than building multiple features once? Why not, in fact, 
focus efforts on item 4: enhancing the result set loader for columnar 
processing: adding new value on top of the work already done?

Now, there may well be a good answer. As Parth suggested, let's just spell it 
out in the design doc. so that it is clear that the proposed approach is the 
lowest cost way to get to the final solution.

Finally, if we're doing this work primarily to solve a commercial problem, then 
that is a fine reason to make these changes in a private branch. My focus here 
is on moving Apache Drill forward for the benefit of the overall user base.

> Limit batch size for Flat Parquet Reader
> ----------------------------------------
>
>                 Key: DRILL-6147
>                 URL: https://issues.apache.org/jira/browse/DRILL-6147
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.13.0
>
>
> The Parquet reader currently uses a hard-coded batch size limit (32k rows) 
> when creating scan batches; there is no parameter nor any logic for 
> controlling the amount of memory used. This enhancement will allow Drill to 
> take an extra input parameter to control direct memory usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to