Hi,
we load ParquetDatasets from S3 and experience many more requests to the S3-API
than expected. Our dataset is partitioned by one column of type date32 and
looks as follows:
S3://<BUCKET_NAME>/<UUID>/<DATASET>/column_of_type_date32=2023-01-01/parquet_file.parquet
S3://<BUCKET_NAME>/<UUID>/<DATASET>/column_of_type_date32=2023-01-02/
parquet_file.parquet
S3://<BUCKET_NAME>/<UUID>/<DATASET>/column_of_type_date32=2023-01-03/
parquet_file.parquet
S3://<BUCKET_NAME>/<UUID>/<DATASET>/column_of_type_date32=2023-01-04/
parquet_file.parquet
...
S3://<BUCKET_NAME>/<UUID>/<DATASET>/column_of_type_date32=2023-12-31/
parquet_file.parquet
When we load one day of this dataset (e.g., column_of_type_date32=2023-03-17)
using a filter (see simplified code below), about 364 list-requests are made to
the S3 bucket. I would expect that it suffices to once list the contents of the
S3 bucket for the prefix S3://<BUCKET_NAME>/<UUID>/<DATASET>/ and to perform a
get-request on the corresponding parquet file (e.g.,
S3://<BUCKET_NAME>/<UUID>/<DATASET>/column_of_type_date32=2023-03-17/parquet_file.parquet).
What am I missing?
```
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
partition = ds.partitioning(pa.schema([("column_of_type_date32",
pa.date32())]), flavor="hive")
dataset_path = f"s3://<BUCKET_NAME>/<UUID>/<DATASET>"
filesystem, path = pa.fs.FileSystem.from_uri(dataset_path)
pq_ds = pq.ParquetDataset(
path,
filters=[(" column_of_type_date32", "=", pd.Timestamp("2023-03-17"))],
schema=<SCHEMA>,
filesystem=filesystem,
partitioning=partition,
pre_buffer=False,
)
table = pq_ds.read(use_threads=False)
df = table.combine_chunks().to_pandas()
```
Best regards,
Michael
________________________________
Pflichtangaben anzeigen<https://www.deutschebahn.com/pflichtangaben/20230303>
N?here Informationen zur Datenverarbeitung im DB-Konzern finden Sie hier:
https://www.deutschebahn.com/de/konzern/datenschutz