[
https://issues.apache.org/jira/browse/DRILL-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561404#comment-17561404
]
ASF GitHub Bot commented on DRILL-8182:
---------------------------------------
jnturton opened a new pull request, #2583:
URL: https://github.com/apache/drill/pull/2583
# [DRILL-8182](https://issues.apache.org/jira/browse/DRILL-8182): File scan
nodes not differentiated by format config
## Description
Two file scans that differ only by format config overriden with table
functions may be genuinely different in terms of the data they return. The
format config options may affect the behaviour of the format parser (date
strings, delimiters, etc.) possibly directing format plugin to entirely
different data within the file. Such scans should not be considered the same by
the query planner. This illustrated by the following example based on the Excel
format plugin.
When a query includes multiple SELECTs against a workbook by using TABLE
functions to access different sheets, and those sheets contain a column with
the same name, then values for that column come a single sheet for both
SELECTs. To reproduce, run the following query against the attachment and note
that the `Name` values returned from the Products sheet are `Name` values from
the Customers sheet.
```
with
prod as (
select Id, Name from TABLE(dfs.tmp.`/Products_Customers_Orders.xlsx`
(type => 'excel', sheetName => 'Products'))
)
, cust as (
select Id, Name from TABLE(dfs.tmp.`/Products_Customers_Orders.xlsx`
(type => 'excel', sheetName => 'Customers'))
)
select * from cust join prod on cust.Id = prod.Id;
```
## Documentation
N/A
## Testing
New unit test: TestExcelFormat#testTableFuncsThatDifferOnlyByFormatConfig
> File scan nodes not differentiated by format config
> ---------------------------------------------------
>
> Key: DRILL-8182
> URL: https://issues.apache.org/jira/browse/DRILL-8182
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Other
> Affects Versions: 1.20.0
> Reporter: James Turton
> Assignee: Charles Givre
> Priority: Major
> Fix For: 1.20.2
>
> Attachments: Products_Customers_Orders.xlsx
>
>
> Two file scans that differ only by format config overriden with table
> functions may be genuinely different in terms of the data they return. The
> format config options may affect the behaviour of the format parser (date
> strings, delimiters, etc.) possibly directing format plugin to entirely
> different data within the file. Such scans should not be considered the same
> by the query planner. This illustrated by the following example based on the
> Excel format plugin.
> When a query includes multiple SELECTs against a workbook by using TABLE
> functions to access different sheets, and those sheets contain a column with
> the same name, then values for that column come a single sheet for both
> SELECTs. To reproduce, run the following query against the attachment and
> note that the `Name` values returned from the Products sheet are `Name`
> values from the Customers sheet.
>
> {code:java}
> with
> prod as (
> select Id, Name from TABLE(dfs.tmp.`/Products_Customers_Orders.xlsx`
> (type => 'excel', sheetName => 'Products'))
> )
> , cust as (
> select Id, Name from TABLE(dfs.tmp.`/Products_Customers_Orders.xlsx`
> (type => 'excel', sheetName => 'Customers'))
> )
> select * from cust join prod on cust.Id = prod.Id; {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)