[ 
https://issues.apache.org/jira/browse/DRILL-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360192#comment-16360192
 ] 

Paul Rogers commented on DRILL-6147:
------------------------------------

Salim says:

Duplicate Implementation
- I am not contemplating two different implementations; one for Parquet and 
another for the rest of the code
- Instead, I am reacting to the fact that we have two different processing 
patterns Row Oriented and Columnar
- The goal is to offer both strategies depending on the operator

Paul's response:

Drill is columnar. But, batches must be collections of rows (all vectors must 
have the same row count.) How we fill the batch may sometimes be row-wise 
(CSV), sometimes columnar (Parquet). Even operators such as SVR could be 
columnar. That is, in the SVR, we could compress out unwanted rows 
column-by-column which is likely much more CPU-cache friendly than what we are 
doing now. The point is: Drill is like photons: it has a row/column duality and 
morphs between the two depending on the context.

If we create a separate solution for the columnar read pattern, we must handle 
the entire stack: writing to vectors, controlling vector sizes, handling 
overflow and the rest. Doing so is, by definition, as separate implementation. 
It may seem like the new version is simple, but that is only because you've not 
yet had the pleasure of working with the complicated use cases such as deeply 
nested structures. Trust me: the simple flat case is simple. Beyond that, 
things get very complex indeed.

> Limit batch size for Flat Parquet Reader
> ----------------------------------------
>
>                 Key: DRILL-6147
>                 URL: https://issues.apache.org/jira/browse/DRILL-6147
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.13.0
>
>
> The Parquet reader currently uses a hard-coded batch size limit (32k rows) 
> when creating scan batches; there is no parameter nor any logic for 
> controlling the amount of memory used. This enhancement will allow Drill to 
> take an extra input parameter to control direct memory usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to