Hello, It is, indeed, a bit surprising that you're still accessing files that are not 'city=Karlsruhe'. What happens if you run the query with just: WHERE city='Karlsruhe' ? Do you still see openat calls for parquet files other than the 'city=Karlsruhe' ones? And if you don't see the extra openat calls, I would be curious if the order of the conditions changes the behavior: WHERE city='Karlsruhe' AND col_55 > col_44.
I have not yet set up a dataset to simulate this myself, but if you could share or grant access to this dataset that might be helpful as well. -Jon On Mon, Jan 17, 2022 at 7:01 AM Philipp Bender <[email protected]> wrote: > Dear List, > > I try to find out why querying Parquet files on S3 takes longer than I > think it should take. > > I have a dummy data set with a couple of random columns, but also > 'year', 'month' and 'city'. > > This data set is stored on my disk, partitioned along the three columns > mentioned before. This looks for example like this: > > > synth/year=2020/month=4/city=Karlsruhe/6a260e3e455f4b05b0cf7b4c361de6be.parquet > > I query this data set like this: > > ### > > import pyarrow.dataset as ds > import duckdb > > data_set = ds.dataset(path, partitioning="hive") > quack = duckdb.arrow(data_set) > res = quack.query("tbl", "SELECT avg(col_55) as avg, count(*) as > num_records, city FROM tbl WHERE col_55 > col_44 AND city='Karlsruhe' > GROUP BY city") > > ### > > This query works and I think the result is correct. > > However, I was wondering why especially with remote files (using s3fs) > it takes really long although the query is only supposed to touch 10% or > so of the files since I have city in the partitioning as well as in the > WHERE clause of the query. > > So I ran my query with 'strace': > > $ strace -f -t -e trace=file my-command > > The result was interesting: > > * I had 'stat' calls on all of the files (which I expect) > * But apart from that, I also had 'openat' calls on more files than just > the 'city=Karlsruhe' ones, but not on all of them. From the 10 cities I > have, only 'Berlin' and 'Braunschweig' have been read (and there, not > all parquet files, only half of them), plus 'Karlsruhe', of course. > > I would be interested what the expectation is and if there is a way to > really load (remote) data on demand, and the demand is determined by > DuckDB based on the query. > > Thanks, > Philipp >
