Jinfeng Ni created DRILL-5542: --------------------------------- Summary: Scan unnecessary adds implicit columns to ScanRecordBatch for select * query Key: DRILL-5542 URL: https://issues.apache.org/jira/browse/DRILL-5542 Project: Apache Drill Issue Type: Bug Components: Execution - Relational Operators Reporter: Jinfeng Ni
It seems that Drill would add several implicit columns (`fqn`, `filepath`, `filename`, `suffix`) to ScanBatch, where it's actually not required at downstream operator. Although those implicit columns would be dropped off later on, it increases both memory and CPU overhead. 1. JSON ``` {a: 100} ``` {code} select * from dfs.tmp.`1.json`; +------+ | a | +------+ | 100 | +------+ {code} The schema from ScanRecordBatch is : {code} [ schema: BatchSchema [fields=[fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL), a(BIGINT:OPTIONAL)], selectionVector=NONE], {code} 2. Parquet {code} elect * from cp.`tpch/nation.parquet`; +--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+ | n_nationkey | n_name | n_regionkey | n_comment | +--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+ | 0 | ALGERIA | 0 | haggle. carefully final deposits detect slyly agai | ... {code} The schema of ScanRecordBatch: {code} schema: BatchSchema [fields=[n_nationkey(INT:REQUIRED), n_name(VARCHAR:REQUIRED), n_regionkey(INT:REQUIRED), n_comment(VARCHAR:REQUIRED), fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], {code} 3. Text {code} cat 1.csv a, b, c select * from dfs.tmp.`1.csv`; +----------------+ | columns | +----------------+ | ["a","b","c"] | +----------------+ {code} Schema of ScanRecordBatch {code} schema: BatchSchema [fields=[columns(VARCHAR:REPEATED)[$data$(VARCHAR:REQUIRED)], fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], {code} If implicit columns are not part of query result of `select * query`, then Scan operator should not populate those implicit columns. -- This message was sent by Atlassian JIRA (v6.3.15#6346)