[ https://issues.apache.org/jira/browse/DRILL-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360192#comment-16360192 ]
Paul Rogers commented on DRILL-6147: ------------------------------------ Salim says: Duplicate Implementation - I am not contemplating two different implementations; one for Parquet and another for the rest of the code - Instead, I am reacting to the fact that we have two different processing patterns Row Oriented and Columnar - The goal is to offer both strategies depending on the operator Paul's response: Drill is columnar. But, batches must be collections of rows (all vectors must have the same row count.) How we fill the batch may sometimes be row-wise (CSV), sometimes columnar (Parquet). Even operators such as SVR could be columnar. That is, in the SVR, we could compress out unwanted rows column-by-column which is likely much more CPU-cache friendly than what we are doing now. The point is: Drill is like photons: it has a row/column duality and morphs between the two depending on the context. If we create a separate solution for the columnar read pattern, we must handle the entire stack: writing to vectors, controlling vector sizes, handling overflow and the rest. Doing so is, by definition, as separate implementation. It may seem like the new version is simple, but that is only because you've not yet had the pleasure of working with the complicated use cases such as deeply nested structures. Trust me: the simple flat case is simple. Beyond that, things get very complex indeed. > Limit batch size for Flat Parquet Reader > ---------------------------------------- > > Key: DRILL-6147 > URL: https://issues.apache.org/jira/browse/DRILL-6147 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet > Reporter: salim achouche > Assignee: salim achouche > Priority: Major > Fix For: 1.13.0 > > > The Parquet reader currently uses a hard-coded batch size limit (32k rows) > when creating scan batches; there is no parameter nor any logic for > controlling the amount of memory used. This enhancement will allow Drill to > take an extra input parameter to control direct memory usage. -- This message was sent by Atlassian JIRA (v7.6.3#76005)