Re: ParquetFile API and GCS file

Arthur Andres Thu, 23 Dec 2021 00:48:32 -0800

Hi Cindy,

In your case you'd have to pass a GCS file instance to the ParquetFile
constructor. Something like this:


source = fs.open_input_file(filename)
parquet_file = pq.ParquetFile(source)

You can see how read_table does this in the source code:
https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977

I hope this helps.



On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <[email protected]> wrote:

> Hi -
>
> I need to drop down to the ParquetFile API so I can have better control
> over batch size for reading huge Parquet files.  The filename is:
>
>
> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>
> This invocation fails:
> *pqf = pq.ParquetFile(filename)*
> "FileNotFoundError: [Errno 2] Failed to open local file
> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
> Detail: [errno 2] No such file or directory"
>
> While this API, using the same, succeeds because I can specify 'gs'
> filesystem.
> *table = pq.read_table(filename, filesystem=gs, use_legacy_dataset=False) *
>
> I don't see a way to specify 'filesystem' on the ParquetFile API
> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
> Is there any way to read a GCS file using ParquetFile?
>
> If not, can you show me the code for reading batches using pq.read_table
> or one of the other Arrow Parquet APIs
> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>
> Thanks -
>
> -- Cindy
>

Re: ParquetFile API and GCS file

Reply via email to