Thanks, Arthur, this helps. The complete code example is: filename = 'gs://' + files[0] gs = gcsfs.GCSFileSystem() f = gs.open(filename) pqf = pq.ParquetFile(f) pqf.metadata
On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <[email protected]> wrote: > Hi Cindy, > > In your case you'd have to pass a GCS file instance to the ParquetFile > constructor. Something like this: > > source = fs.open_input_file(filename) > parquet_file = pq.ParquetFile(source) > > You can see how read_table does this in the source code: > https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977 > > I hope this helps. > > > > On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <[email protected]> > wrote: > >> Hi - >> >> I need to drop down to the ParquetFile API so I can have better control >> over batch size for reading huge Parquet files. The filename is: >> >> >> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy* >> >> This invocation fails: >> *pqf = pq.ParquetFile(filename)* >> "FileNotFoundError: [Errno 2] Failed to open local file >> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'. >> Detail: [errno 2] No such file or directory" >> >> While this API, using the same, succeeds because I can specify 'gs' >> filesystem. >> *table = pq.read_table(filename, filesystem=gs, >> use_legacy_dataset=False) * >> >> I don't see a way to specify 'filesystem' on the ParquetFile API >> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>. >> Is there any way to read a GCS file using ParquetFile? >> >> If not, can you show me the code for reading batches using pq.read_table >> or one of the other Arrow Parquet APIs >> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>? >> >> Thanks - >> >> -- Cindy >> >
