[Python] Many S3-list-requests when loading ParquetDataset

Michael Geilke Fri, 17 Mar 2023 01:54:37 -0700

Hi,

we load ParquetDatasets from S3 and experience many more requests to the S3-API 
than expected. Our dataset is partitioned by one column of type date32 and 
looks as follows:


S3://<BUCKET_NAME>/<UUID>/<DATASET>/column_of_type_date32=2023-01-01/parquet_file.parquet
S3://<BUCKET_NAME>/<UUID>/<DATASET>/column_of_type_date32=2023-01-02/ 
parquet_file.parquet
S3://<BUCKET_NAME>/<UUID>/<DATASET>/column_of_type_date32=2023-01-03/ 
parquet_file.parquet
S3://<BUCKET_NAME>/<UUID>/<DATASET>/column_of_type_date32=2023-01-04/ 
parquet_file.parquet
...
S3://<BUCKET_NAME>/<UUID>/<DATASET>/column_of_type_date32=2023-12-31/ 
parquet_file.parquet

When we load one day of this dataset (e.g., column_of_type_date32=2023-03-17) 
using a filter (see simplified code below), about 364 list-requests are made to 
the S3 bucket. I would expect that it suffices to once list the contents of the 
S3 bucket for the prefix S3://<BUCKET_NAME>/<UUID>/<DATASET>/ and to perform a 
get-request on the corresponding parquet file (e.g., 
S3://<BUCKET_NAME>/<UUID>/<DATASET>/column_of_type_date32=2023-03-17/parquet_file.parquet).
 What am I missing?

```
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

partition = ds.partitioning(pa.schema([("column_of_type_date32", 
pa.date32())]), flavor="hive")

dataset_path = f"s3://<BUCKET_NAME>/<UUID>/<DATASET>"
filesystem, path = pa.fs.FileSystem.from_uri(dataset_path)

pq_ds = pq.ParquetDataset(
        path,
        filters=[(" column_of_type_date32", "=", pd.Timestamp("2023-03-17"))],
        schema=<SCHEMA>,
        filesystem=filesystem,
        partitioning=partition,
        pre_buffer=False,
)
table = pq_ds.read(use_threads=False)
df = table.combine_chunks().to_pandas()
```

Best regards,
Michael

________________________________

Pflichtangaben anzeigen<https://www.deutschebahn.com/pflichtangaben/20230303>

N?here Informationen zur Datenverarbeitung im DB-Konzern finden Sie hier: 
https://www.deutschebahn.com/de/konzern/datenschutz

[Python] Many S3-list-requests when loading ParquetDataset

Reply via email to