[ https://issues.apache.org/jira/browse/DRILL-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026333#comment-16026333 ]
Jinfeng Ni commented on DRILL-5542: ----------------------------------- Sounds like planner should differentiate "*" and implicit columns, if implicit columns are not part of expanded list from "*". In other words, if a query uses "*" and implicit columns, planner should put both "*" and implicit columns into Scan's column list, since they are different. Query semantics says "*" should only include regular columns. But when we put "*" into Scan's column list, we essentially change the meaning of "*". To make it work in more desirable way, 1) format plugin / storage plugin has to inform planner the list of implicit columns, 2) planner rule should keep "*" separately from implicit columns (Calcite has concept of system fields/columns, probably we could use that). 3) Scan only return regular columns for "*", and return implicit columns only when explicitly requested. > Scan unnecessary adds implicit columns to ScanRecordBatch for select * query > ---------------------------------------------------------------------------- > > Key: DRILL-5542 > URL: https://issues.apache.org/jira/browse/DRILL-5542 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators > Reporter: Jinfeng Ni > > It seems that Drill would add several implicit columns (`fqn`, `filepath`, > `filename`, `suffix`) to ScanBatch, where it's actually not required at > downstream operator. Although those implicit columns would be dropped off > later on, it increases both memory and CPU overhead. > 1. JSON > ``` > {a: 100} > ``` > {code} > select * from dfs.tmp.`1.json`; > +------+ > | a | > +------+ > | 100 | > +------+ > {code} > The schema from ScanRecordBatch is : > {code} > [ schema: > BatchSchema [fields=[fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), > filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL), a(BIGINT:OPTIONAL)], > selectionVector=NONE], > {code} > 2. Parquet > {code} > elect * from cp.`tpch/nation.parquet`; > +--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+ > | n_nationkey | n_name | n_regionkey | > n_comment > | > +--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+ > | 0 | ALGERIA | 0 | haggle. carefully final > deposits detect slyly agai > | > ... > {code} > The schema of ScanRecordBatch: > {code} > schema: > BatchSchema [fields=[n_nationkey(INT:REQUIRED), n_name(VARCHAR:REQUIRED), > n_regionkey(INT:REQUIRED), n_comment(VARCHAR:REQUIRED), > fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), > filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], > {code} > 3. Text > {code} > cat 1.csv > a, b, c > select * from dfs.tmp.`1.csv`; > +----------------+ > | columns | > +----------------+ > | ["a","b","c"] | > +----------------+ > {code} > Schema of ScanRecordBatch > {code} > schema: > BatchSchema [fields=[columns(VARCHAR:REPEATED)[$data$(VARCHAR:REQUIRED)], > fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), > filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], > {code} > If implicit columns are not part of query result of `select * query`, then > Scan operator should not populate those implicit columns. -- This message was sent by Atlassian JIRA (v6.3.15#6346)