[ 
https://issues.apache.org/jira/browse/DRILL-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360193#comment-16360193
 ] 

Paul Rogers commented on DRILL-6147:
------------------------------------

Salim says:

 Complex Vs Flat Parquet Readers
- The Complex and Flat Parquet readers are quite different
- I presume, for the sake of performance, we can enhance our SQL capabilities 
so that the Flat Parquet reader can be invoked more frequently

Paul's comment:

Sounds like we are making an assumption here. Drill has benchmarks which use 
"classic" flat rows. We want to invest effort to create a parallel structure 
just to make benchmarks on artificial data go fast. Meanwhile, Aman is building 
a solution that says that actual users need complex structures. (Big data is 
often stored denormalized.) So, we are optimizing a reader used primarily for 
benchmarks.

The question is more general: rather than having two readers, why not have one 
which handles both simple and nested types? Why not make that go fast? Why not, 
in doing so, reuse the effort already invested in the result set loader to do 
vector writing, batch size control, projection and the rest? Then, why not 
invest our optimization efforts into improving the result solution so that all 
operators benefit from the improvements?

Doing one-off solutions for each reader and operator will be prohibitively 
expensive and a nightmare to maintain, IMHO.

> Limit batch size for Flat Parquet Reader
> ----------------------------------------
>
>                 Key: DRILL-6147
>                 URL: https://issues.apache.org/jira/browse/DRILL-6147
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.13.0
>
>
> The Parquet reader currently uses a hard-coded batch size limit (32k rows) 
> when creating scan batches; there is no parameter nor any logic for 
> controlling the amount of memory used. This enhancement will allow Drill to 
> take an extra input parameter to control direct memory usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to