Hi guys,

I'm trying to understand something about directory partitions and how
they're implemented.

For sake of basic argument, I have ~3 mil rows in 3 separate Parquet
files.  Each one has a "groupId" of 1, 2 and 3 respectively.

I then place them in separate directories named 1, 2 and 3.

The following query takes ~250 ms:

SELECT SUM(price) FROM dfs.tmp.test WHERE groupId IN (1, 2, 3)

The following query takes ~500 ms:

SELECT SUM(price) FROM dfs.tmp.test WHERE dir0 IN (1, 2, 3)

Now I imagine the former is grabbing two fields (price and groupId), then
running the groupId through a filter and then adding price, whereas I would
imagine the latter would just read price from the 3 Parquet files in those
matching directories.

I can't understand why this would be twice as slow.  The only thing I can
imagine is that the dir0..x logic is completely separate and it's doing 3
very disparate Parquet reads.

To make it more obvious, I moved all Parquet files into the same directory
and the query takes ~100-150ms.

Any ideas?

Reply via email to