Hi Alenka,

Here is the code snippet that loads a single Parquet file. I can also confirm 
that it appears to be with the function call “fs.isfile” on the root directory… 
calling this function myself returns False, as I would expect it should: 
fs.isfile("global-radiosondes/hires-sonde”)

fs = gcsfs.GCSFileSystem(token="anon")

partitioning = ds.HivePartitioning(
        pyarrow.schema([
            pyarrow.field('year', pyarrow.int32()),
            pyarrow.field('month', pyarrow.int32()),
            pyarrow.field('day', pyarrow.int32()),
            pyarrow.field('hour', pyarrow.int32()),
            pyarrow.field('WMO', pyarrow.string())
        ])
)

schema = pyarrow.schema([
    pyarrow.field('lon', pyarrow.float32()),
    pyarrow.field('lat', pyarrow.float32()),
    pyarrow.field('pres', pyarrow.float32()),
    pyarrow.field('hght', pyarrow.float32()),
    pyarrow.field('gpht', pyarrow.float32()),
    pyarrow.field('tmpc', pyarrow.float32()),
    pyarrow.field('dwpc', pyarrow.float32()),
    pyarrow.field('relh', pyarrow.float32()),
    pyarrow.field('uwin', pyarrow.float32()),
    pyarrow.field('vwin', pyarrow.float32()),
    pyarrow.field('wspd', pyarrow.float32()),
    pyarrow.field('wdir', pyarrow.float32()),
    pyarrow.field('year', pyarrow.int32()),
    pyarrow.field('month', pyarrow.int32()),
    pyarrow.field('day', pyarrow.int32()),
    pyarrow.field('hour', pyarrow.int32()),
    pyarrow.field('WMO', pyarrow.string())
])

data = 
ds.dataset("global-radiosondes/hires-sonde/year=2016/month=5/day=24/hour=19/WMO=72451",
 filesystem=fs, format="parquet", \
                        schema=schema, partitioning=partitioning)

batches = data.to_batches(columns=["pres", "gpht", "hght", "tmpc", "wspd", 
"wdir"], \
                use_threads=True)

batches = list(batches)
print(batches[0].to_pandas().head())

Kelton.


> On Feb 21, 2022, at 3:07 AM, Alenka Frim <ale...@voltrondata.com> wrote:
> 
> Hi Kelton,
> 
> I can reproduce the same error if I try to load all the data with data = 
> ds.dataset("global-radiosondes/hires-sonde", filesystem=fs) or data = 
> pq.ParquetDataset("global-radiosondes/hires-sonde", filesystem=fs, 
> use_legacy_dataset=False).
> 
> Could you share your code, where you read a specific parquet file?
> 
> Best,
> Alenka 
> 
> On Mon, Feb 21, 2022 at 12:04 AM Kelton Halbert <kthalb...@wxbyte.com 
> <mailto:kthalb...@wxbyte.com>> wrote:
> Hello,
> 
> I’ve been learning and working with PyArrow recently for a project to store 
> some atmospheric science data as part of a partitioned dataset, and recently 
> the dataset class with the  fsspec/gcsfs filesystem has started producing a 
> new error. Unfortunately I cannot seem to track down what changed or if it’s 
> an error on my end or not. I’m using PyArrow 7.0.0 and python 3.8.
> 
> If I specify a specific parquet file, everything is fine - but if I give it 
> any of the directory partitions, the same issue occurs. Any guidance here 
> would be appreciated!
> 
> The code: 
> fs = gcsfs.GCSFileSystem(token="anon")
> 
> partitioning = ds.HivePartitioning(
>         pyarrow.schema([
>             pyarrow.field('year', pyarrow.int32()),
>             pyarrow.field('month', pyarrow.int32()),
>             pyarrow.field('day', pyarrow.int32()),
>             pyarrow.field('hour', pyarrow.int32()),
>             pyarrow.field('WMO', pyarrow.string())
>         ])
> )
> 
> schema = pyarrow.schema([
>     pyarrow.field('lon', pyarrow.float32()),
>     pyarrow.field('lat', pyarrow.float32()),
>     pyarrow.field('pres', pyarrow.float32()),
>     pyarrow.field('hght', pyarrow.float32()),
>     pyarrow.field('gpht', pyarrow.float32()),
>     pyarrow.field('tmpc', pyarrow.float32()),
>     pyarrow.field('dwpc', pyarrow.float32()),
>     pyarrow.field('relh', pyarrow.float32()),
>     pyarrow.field('uwin', pyarrow.float32()),
>     pyarrow.field('vwin', pyarrow.float32()),
>     pyarrow.field('wspd', pyarrow.float32()),
>     pyarrow.field('wdir', pyarrow.float32()),
>     pyarrow.field('year', pyarrow.int32()),
>     pyarrow.field('month', pyarrow.int32()),
>     pyarrow.field('day', pyarrow.int32()),
>     pyarrow.field('hour', pyarrow.int32()),
>     pyarrow.field('WMO', pyarrow.string())
> ])
> 
> data = ds.dataset("global-radiosondes/hires-sonde", filesystem=fs, 
> format="parquet", \
>                         partitioning=partitioning, schema=schema)
> 
> subset = (ds.field("year") == 2016) & (ds.field("WMO") == "72451")
> 
> batches = data.to_batches(columns=["pres", "gpht", "tmpc", "wspd", "wdir", 
> "year", "month", "day", "hour"], \
>                 use_threads=True)
> 
> batches = list(batches)
> 
> The error:
>     391 from pyarrow import PythonFile
>     393 if not self.fs.isfile(path):
> --> 394     raise FileNotFoundError(path)
>     396 return PythonFile(self.fs.open(path, mode="rb"), mode="r")
> 
> FileNotFoundError: global-radiosondes/hires-sonde/
> 

Reply via email to