An example using the pyarrow.dataset api…
data = ds.dataset("global-radiosondes/hires_sonde", filesystem=fs,
format="parquet",
partitioning=["year", "month", "day", "hour", "site"])
subset = (ds.field("year") == "2022") & (ds.field("month") == "01") \
& (ds.field("day") == "09") & (ds.field("hour") == "12")
batches = list(data.to_batches(filter=subset))
print(batches)
Output:
[]
> On Jan 9, 2022, at 3:46 PM, Kelton Halbert <[email protected]> wrote:
>
> Hello - I’m not sure if this is a bug, or if I’m not using the API correctly,
> but I have a partitioned parquet dataset stored on a Google Cloud Bucket that
> I am attempting to load for analysis. However, when applying filters to the
> dataset (using both the pyarrow.dataset and pyarrow.parquet.ParquetDataset
> APIs), I receive empty data frames and tables.
>
> Here is my sample code:
>
> import matplotlib.pyplot as plt
> import pyarrow.dataset as ds
> import numpy as np
> import gcsfs
> import pyarrow.parquet as pq
>
> fs = gcsfs.GCSFileSystem()
> data = pq.ParquetDataset("global-radiosondes/hires_sonde", filesystem=fs,
> partitioning=["year", "month", "day", "hour",
> "site"],
> use_legacy_dataset=False,
> filters=[
> ('year', '=', '2022'),
> ('month', '=', '01'),
> ('day', '=', '09'),
> ('hour', '=', '12')])
> table = data.read(columns=["pres", "hght"])
> df = table.to_pandas()
> print(df)
>
> With the following output:
> Empty DataFrame
> Columns: [pres, hght]
> Index: []
>
>
> Am I applying this incorrectly somehow? Any help would be appreciated. Again,
> the same issue happens when using the pyarrow.dataset API to load as well.
> The data bucket is public, so feel free to experiment. If I load the whole
> dataset into a pandas data frame, it works fine. Issue seems to be the
> filtering.
>
> Thanks,
> Kelton.