Hi guys, I'm trying to understand something about directory partitions and how they're implemented.
For sake of basic argument, I have ~3 mil rows in 3 separate Parquet files. Each one has a "groupId" of 1, 2 and 3 respectively. I then place them in separate directories named 1, 2 and 3. The following query takes ~250 ms: SELECT SUM(price) FROM dfs.tmp.test WHERE groupId IN (1, 2, 3) The following query takes ~500 ms: SELECT SUM(price) FROM dfs.tmp.test WHERE dir0 IN (1, 2, 3) Now I imagine the former is grabbing two fields (price and groupId), then running the groupId through a filter and then adding price, whereas I would imagine the latter would just read price from the 3 Parquet files in those matching directories. I can't understand why this would be twice as slow. The only thing I can imagine is that the dir0..x logic is completely separate and it's doing 3 very disparate Parquet reads. To make it more obvious, I moved all Parquet files into the same directory and the query takes ~100-150ms. Any ideas?