Hi Alenka, Here is the code snippet that loads a single Parquet file. I can also confirm that it appears to be with the function call “fs.isfile” on the root directory… calling this function myself returns False, as I would expect it should: fs.isfile("global-radiosondes/hires-sonde”)
fs = gcsfs.GCSFileSystem(token="anon") partitioning = ds.HivePartitioning( pyarrow.schema([ pyarrow.field('year', pyarrow.int32()), pyarrow.field('month', pyarrow.int32()), pyarrow.field('day', pyarrow.int32()), pyarrow.field('hour', pyarrow.int32()), pyarrow.field('WMO', pyarrow.string()) ]) ) schema = pyarrow.schema([ pyarrow.field('lon', pyarrow.float32()), pyarrow.field('lat', pyarrow.float32()), pyarrow.field('pres', pyarrow.float32()), pyarrow.field('hght', pyarrow.float32()), pyarrow.field('gpht', pyarrow.float32()), pyarrow.field('tmpc', pyarrow.float32()), pyarrow.field('dwpc', pyarrow.float32()), pyarrow.field('relh', pyarrow.float32()), pyarrow.field('uwin', pyarrow.float32()), pyarrow.field('vwin', pyarrow.float32()), pyarrow.field('wspd', pyarrow.float32()), pyarrow.field('wdir', pyarrow.float32()), pyarrow.field('year', pyarrow.int32()), pyarrow.field('month', pyarrow.int32()), pyarrow.field('day', pyarrow.int32()), pyarrow.field('hour', pyarrow.int32()), pyarrow.field('WMO', pyarrow.string()) ]) data = ds.dataset("global-radiosondes/hires-sonde/year=2016/month=5/day=24/hour=19/WMO=72451", filesystem=fs, format="parquet", \ schema=schema, partitioning=partitioning) batches = data.to_batches(columns=["pres", "gpht", "hght", "tmpc", "wspd", "wdir"], \ use_threads=True) batches = list(batches) print(batches[0].to_pandas().head()) Kelton. > On Feb 21, 2022, at 3:07 AM, Alenka Frim <ale...@voltrondata.com> wrote: > > Hi Kelton, > > I can reproduce the same error if I try to load all the data with data = > ds.dataset("global-radiosondes/hires-sonde", filesystem=fs) or data = > pq.ParquetDataset("global-radiosondes/hires-sonde", filesystem=fs, > use_legacy_dataset=False). > > Could you share your code, where you read a specific parquet file? > > Best, > Alenka > > On Mon, Feb 21, 2022 at 12:04 AM Kelton Halbert <kthalb...@wxbyte.com > <mailto:kthalb...@wxbyte.com>> wrote: > Hello, > > I’ve been learning and working with PyArrow recently for a project to store > some atmospheric science data as part of a partitioned dataset, and recently > the dataset class with the fsspec/gcsfs filesystem has started producing a > new error. Unfortunately I cannot seem to track down what changed or if it’s > an error on my end or not. I’m using PyArrow 7.0.0 and python 3.8. > > If I specify a specific parquet file, everything is fine - but if I give it > any of the directory partitions, the same issue occurs. Any guidance here > would be appreciated! > > The code: > fs = gcsfs.GCSFileSystem(token="anon") > > partitioning = ds.HivePartitioning( > pyarrow.schema([ > pyarrow.field('year', pyarrow.int32()), > pyarrow.field('month', pyarrow.int32()), > pyarrow.field('day', pyarrow.int32()), > pyarrow.field('hour', pyarrow.int32()), > pyarrow.field('WMO', pyarrow.string()) > ]) > ) > > schema = pyarrow.schema([ > pyarrow.field('lon', pyarrow.float32()), > pyarrow.field('lat', pyarrow.float32()), > pyarrow.field('pres', pyarrow.float32()), > pyarrow.field('hght', pyarrow.float32()), > pyarrow.field('gpht', pyarrow.float32()), > pyarrow.field('tmpc', pyarrow.float32()), > pyarrow.field('dwpc', pyarrow.float32()), > pyarrow.field('relh', pyarrow.float32()), > pyarrow.field('uwin', pyarrow.float32()), > pyarrow.field('vwin', pyarrow.float32()), > pyarrow.field('wspd', pyarrow.float32()), > pyarrow.field('wdir', pyarrow.float32()), > pyarrow.field('year', pyarrow.int32()), > pyarrow.field('month', pyarrow.int32()), > pyarrow.field('day', pyarrow.int32()), > pyarrow.field('hour', pyarrow.int32()), > pyarrow.field('WMO', pyarrow.string()) > ]) > > data = ds.dataset("global-radiosondes/hires-sonde", filesystem=fs, > format="parquet", \ > partitioning=partitioning, schema=schema) > > subset = (ds.field("year") == 2016) & (ds.field("WMO") == "72451") > > batches = data.to_batches(columns=["pres", "gpht", "tmpc", "wspd", "wdir", > "year", "month", "day", "hour"], \ > use_threads=True) > > batches = list(batches) > > The error: > 391 from pyarrow import PythonFile > 393 if not self.fs.isfile(path): > --> 394 raise FileNotFoundError(path) > 396 return PythonFile(self.fs.open(path, mode="rb"), mode="r") > > FileNotFoundError: global-radiosondes/hires-sonde/ >