[
https://issues.apache.org/jira/browse/DRILL-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731080#comment-17731080
]
Philippe Audet commented on DRILL-8425:
---------------------------------------
Hi, I investiguate a bit on the issue and I found that function
{*}FileSelection.{*}{*}minusDirectories(){*} ** is the bottle neck. I'm not
sure if it because it instatiate too many threads at the same time but it's
almost one per subdirectory. For what I understand, it's does not look that
trivial to narrow the search by updating the root dir.
> Directory pruning issue with queries including joins.
> ------------------------------------------------------
>
> Key: DRILL-8425
> URL: https://issues.apache.org/jira/browse/DRILL-8425
> Project: Apache Drill
> Issue Type: Bug
> Components: Functions - Drill
> Affects Versions: 1.21.0, 1.19.0
> Reporter: Loy2
> Priority: Major
>
> Performance degradation base on the number of files present in the directory
> structure when using the same query on one day of data
> I'm using partitioned directories
> ./product/year/month/day
> ./command/year/month/day
> each contain a particular parquet file. (tested with csv as well)
> If I query a table for one day, say select * from dfs.root.product where dir0
> = 2023 and dir1 = 04 and dir2 = 12; then only the file located in
> ./product/year/month/day/product.parquet is accessed (as expected)
> Now if I do a join query between product and command for a particular day
> {quote}
> SELECT p.field1 , p.field2, c.field2 FROM dfs.root.command as c
> LEFT JOIN dfs.root.product as p
> on p.field1 = c.field1
> where p.dir0 = 2023
> and p.dir1 = 04
> and p.dir2 = 12
> and c.dir0 = 2023
> and c.dir1 = 04
> and c.dir2 = 12;
> {quote}
> I can see in the log (debug mode) that all the directory structures is
> scanned and not just the 2 concerned files
> so the more file (year month) you have in the DFS the more heap memory you
> use and the more time it takes to get the results
> (posted in slack channel
> (https://apache-drill.slack.com/archives/CG380K519/p1681335761429099)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)