Re: ParquetFile API and GCS file

Partha Dutta Fri, 24 Dec 2021 14:45:46 -0800

I have a similar issue with trying to read huge 1GB parquet files from
Azure DataLake Storage. I'm trying to read the file in small chunks using
the ParquetFile.iter_batches method, but it seems like the entire file is
read into memory before the first batch is returned. I am using the Azure
SDK for python and another python package (pyarrowfs-adlgen2). Has anyone
faced a problem similar to what I am seeing, or is there a workaround?


On Fri, Dec 24, 2021 at 2:11 PM Cindy McMullen <[email protected]>
wrote:

> Thanks, Arthur, this helps.  The complete code example is:
>
> filename = 'gs://' + files[0]
> gs = gcsfs.GCSFileSystem()
> f = gs.open(filename)
> pqf = pq.ParquetFile(f)
> pqf.metadata
>
>
> On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <[email protected]>
> wrote:
>
>> Hi Cindy,
>>
>> In your case you'd have to pass a GCS file instance to the ParquetFile
>> constructor. Something like this:
>>
>> source = fs.open_input_file(filename)
>> parquet_file = pq.ParquetFile(source)
>>
>> You can see how read_table does this in the source code:
>> https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977
>>
>> I hope this helps.
>>
>>
>>
>> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <[email protected]>
>> wrote:
>>
>>> Hi -
>>>
>>> I need to drop down to the ParquetFile API so I can have better control
>>> over batch size for reading huge Parquet files.  The filename is:
>>>
>>>
>>> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>>>
>>> This invocation fails:
>>> *pqf = pq.ParquetFile(filename)*
>>> "FileNotFoundError: [Errno 2] Failed to open local file
>>> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
>>> Detail: [errno 2] No such file or directory"
>>>
>>> While this API, using the same, succeeds because I can specify 'gs'
>>> filesystem.
>>> *table = pq.read_table(filename, filesystem=gs,
>>> use_legacy_dataset=False) *
>>>
>>> I don't see a way to specify 'filesystem' on the ParquetFile API
>>> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
>>> Is there any way to read a GCS file using ParquetFile?
>>>
>>> If not, can you show me the code for reading batches using pq.read_table
>>> or one of the other Arrow Parquet APIs
>>> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>>>
>>> Thanks -
>>>
>>> -- Cindy
>>>
>>

-- 
Partha Dutta
[email protected]

Re: ParquetFile API and GCS file

Reply via email to