Re: PyArrow + GCSFS not loading data when using filters...

Kelton Halbert Sun, 09 Jan 2022 16:02:19 -0800

An example using the pyarrow.dataset api…


data = ds.dataset("global-radiosondes/hires_sonde", filesystem=fs, 
format="parquet",
                         partitioning=["year", "month", "day", "hour", "site"])
subset = (ds.field("year") == "2022") & (ds.field("month") == "01") \
       & (ds.field("day") == "09") & (ds.field("hour") == "12")
batches = list(data.to_batches(filter=subset))
print(batches)

Output:
[]


> On Jan 9, 2022, at 3:46 PM, Kelton Halbert <[email protected]> wrote:
> 
> Hello - I’m not sure if this is a bug, or if I’m not using the API correctly, 
> but I have a partitioned parquet dataset stored on a Google Cloud Bucket that 
> I am attempting to load for analysis. However, when applying filters to the 
> dataset (using both the pyarrow.dataset and pyarrow.parquet.ParquetDataset 
> APIs), I receive empty data frames and tables.
> 
> Here is my sample code:
> 
> import matplotlib.pyplot as plt
> import pyarrow.dataset as ds
> import numpy as np
> import gcsfs
> import pyarrow.parquet as pq
> 
> fs = gcsfs.GCSFileSystem()
> data = pq.ParquetDataset("global-radiosondes/hires_sonde", filesystem=fs,
>                          partitioning=["year", "month", "day", "hour", 
> "site"], 
>                          use_legacy_dataset=False, 
>                          filters=[
>                             ('year', '=', '2022'),
>                             ('month', '=', '01'),
>                             ('day', '=', '09'),
>                             ('hour', '=', '12')])
> table = data.read(columns=["pres", "hght"])
> df = table.to_pandas()
> print(df)
> 
> With the following output: 
> Empty DataFrame
> Columns: [pres, hght]
> Index: []
> 
> 
> Am I applying this incorrectly somehow? Any help would be appreciated. Again, 
> the same issue happens when using the pyarrow.dataset API to load as well. 
> The data bucket is public, so feel free to experiment. If I load the whole 
> dataset into a pandas data frame, it works fine. Issue seems to be the 
> filtering.
> 
> Thanks,
> Kelton.

Re: PyArrow + GCSFS not loading data when using filters...

Reply via email to