Hello,

It is, indeed, a bit surprising that you're still accessing files that are
not 'city=Karlsruhe'. What happens if you run the query with just: WHERE
city='Karlsruhe' ? Do you still see openat calls for parquet files other
than the 'city=Karlsruhe' ones? And if you don't see the extra openat
calls, I would be curious if the order of the conditions changes the
behavior: WHERE city='Karlsruhe' AND col_55 > col_44.

I have not yet set up a dataset to simulate this myself, but if you could
share or grant access to this dataset that might be helpful as well.

-Jon


On Mon, Jan 17, 2022 at 7:01 AM Philipp Bender <[email protected]> wrote:

> Dear List,
>
> I try to find out why querying Parquet files on S3 takes longer than I
> think it should take.
>
> I have a dummy data set with a couple of random columns, but also
> 'year', 'month' and 'city'.
>
> This data set is stored on my disk, partitioned along the three columns
> mentioned before. This looks for example like this:
>
>
> synth/year=2020/month=4/city=Karlsruhe/6a260e3e455f4b05b0cf7b4c361de6be.parquet
>
> I query this data set like this:
>
> ###
>
> import pyarrow.dataset as ds
> import duckdb
>
> data_set = ds.dataset(path, partitioning="hive")
> quack = duckdb.arrow(data_set)
> res = quack.query("tbl", "SELECT avg(col_55) as avg, count(*) as
> num_records, city FROM tbl WHERE col_55 > col_44 AND city='Karlsruhe'
> GROUP BY city")
>
> ###
>
> This query works and I think the result is correct.
>
> However, I was wondering why especially with remote files (using s3fs)
> it takes really long although the query is only supposed to touch 10% or
> so of the files since I have city in the partitioning as well as in the
> WHERE clause of the query.
>
> So I ran my query with 'strace':
>
> $ strace -f -t -e trace=file my-command
>
> The result was interesting:
>
> * I had 'stat' calls on all of the files (which I expect)
> * But apart from that, I also had 'openat' calls on more files than just
> the 'city=Karlsruhe' ones, but not on all of them. From the 10 cities I
> have, only 'Berlin' and 'Braunschweig' have been read (and there, not
> all parquet files, only half of them), plus 'Karlsruhe', of course.
>
> I would be interested what the expectation is and if there is a way to
> really load (remote) data on demand, and the demand is determined by
> DuckDB based on the query.
>
> Thanks,
> Philipp
>

Reply via email to