John Omernik created DRILL-4758:
-----------------------------------

             Summary: Option for Lazy/Late Materialization of columns during 
query with Parquet
                 Key: DRILL-4758
                 URL: https://issues.apache.org/jira/browse/DRILL-4758
             Project: Apache Drill
          Issue Type: Improvement
          Components: Storage - Parquet
    Affects Versions: 1.6.0
            Reporter: John Omernik


On tables stored as Parquet with lots of columns, it appears that all columns 
requested in the select statement are materialized for every row, regardless of 
the where clause filter. 

For example, a table with 100 columns, 

select field1 from table where id = 123 and client BETWEEN 10 and 100 

Will return in 30 seconds a large amount of data (2 TB) and return no rows. 

However, 

select * from table where id = 123 and client BETWEEN 10 and 100 

will take 15 minutes to run on the same amount of data, while still returning 
no rows.  

If an option (perhaps it should be the default) to only materialize rows that 
match the filter were present, it would provide a huge boon to performance. 

Now, if this were an issue because tables with a small number of columns would 
now have an extra step, one option would be to use table options (select with 
options) to make it so queries to certain tables would have this option, and 
queries to other tables would not.  This is up for discussion, but I think the 
first step is to discuss how something this could be achieved.  This is an item 
also being looked at by the Impala project on Parquet files. (IMPALA-2017) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to