[ 
https://issues.apache.org/jira/browse/DRILL-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026333#comment-16026333
 ] 

Jinfeng Ni commented on DRILL-5542:
-----------------------------------

Sounds like planner should differentiate "*" and implicit columns, if implicit 
columns are not part of expanded list from "*".  In other words, if a query 
uses "*" and implicit columns, planner should put both "*" and implicit columns 
into Scan's column list, since they are different.  

Query semantics says "*" should only include regular columns. But when we put 
"*" into Scan's column list, we essentially change the meaning of "*".  To make 
it work in more desirable way, 1) format plugin / storage plugin has to inform 
planner the list of implicit columns, 2) planner rule should keep "*" 
separately from implicit columns (Calcite has concept of system fields/columns, 
probably we could use that). 3) Scan only return regular columns for "*", and 
return implicit columns only when explicitly requested. 


> Scan unnecessary adds implicit columns to ScanRecordBatch for select * query
> ----------------------------------------------------------------------------
>
>                 Key: DRILL-5542
>                 URL: https://issues.apache.org/jira/browse/DRILL-5542
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>            Reporter: Jinfeng Ni
>
> It seems that Drill would add several implicit columns (`fqn`, `filepath`, 
> `filename`, `suffix`) to ScanBatch, where it's actually not required at 
> downstream operator. Although those implicit columns would be dropped off 
> later on, it increases both memory and CPU overhead.    
> 1. JSON
> ```
> {a: 100}
> ```
> {code}
> select * from dfs.tmp.`1.json`;
> +------+
> |  a   |
> +------+
> | 100  |
> +------+
> {code}
> The schema from ScanRecordBatch is :
> {code}
> [ schema:
>     BatchSchema [fields=[fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), 
> filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL), a(BIGINT:OPTIONAL)], 
> selectionVector=NONE], 
>  {code}
> 2. Parquet
> {code}
> elect * from cp.`tpch/nation.parquet`;
> +--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
> | n_nationkey  |     n_name      | n_regionkey  |                             
>                          n_comment                                            
>           |
> +--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
> | 0            | ALGERIA         | 0            |  haggle. carefully final 
> deposits detect slyly agai                                                    
>              |
> ...
> {code}
> The schema of ScanRecordBatch:
> {code}
>   schema:
>     BatchSchema [fields=[n_nationkey(INT:REQUIRED), n_name(VARCHAR:REQUIRED), 
> n_regionkey(INT:REQUIRED), n_comment(VARCHAR:REQUIRED), 
> fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), 
> filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], 
> {code}
> 3. Text
> {code}
> cat 1.csv
> a, b, c
> select * from dfs.tmp.`1.csv`;
> +----------------+
> |    columns     |
> +----------------+
> | ["a","b","c"]  |
> +----------------+
> {code}
> Schema of ScanRecordBatch 
> {code}
>   schema:
>     BatchSchema [fields=[columns(VARCHAR:REPEATED)[$data$(VARCHAR:REQUIRED)], 
> fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), 
> filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], 
> {code}
> If implicit columns are not part of query result of `select * query`, then 
> Scan operator should not populate those implicit columns.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to