Re: ParquetFile API and GCS file

David Li Fri, 24 Dec 2021 15:51:40 -0800

Just a shot in the dark, but how many row groups are there in that 1 GB file? 
IIRC, the reader loads an entire row group's worth of rows at once.


-David

On Fri, Dec 24, 2021, at 17:45, Partha Dutta wrote:
> I have a similar issue with trying to read huge 1GB parquet files from Azure 
> DataLake Storage. I'm trying to read the file in small chunks using the 
> ParquetFile.iter_batches method, but it seems like the entire file is read 
> into memory before the first batch is returned. I am using the Azure SDK for 
> python and another python package (pyarrowfs-adlgen2). Has anyone faced a 
> problem similar to what I am seeing, or is there a workaround?
> 
> On Fri, Dec 24, 2021 at 2:11 PM Cindy McMullen <[email protected]> wrote:
>> Thanks, Arthur, this helps.  The complete code example is:
>> 
>> `filename = 'gs://' + files[0]
>> gs = gcsfs.GCSFileSystem()
>> f = gs.open(filename)
>> pqf = pq.ParquetFile(f)
>> pqf.metadata`
>> 
>> On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <[email protected]> 
>> wrote:
>>> Hi Cindy,
>>> 
>>> In your case you'd have to pass a GCS file instance to the ParquetFile 
>>> constructor. Something like this:
>>> 
>>> source = fs.open_input_file(filename)
>>> parquet_file = pq.ParquetFile(source)
>>> 
>>> You can see how read_table does this in the source code: 
>>> https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977
>>> 
>>> I hope this helps.
>>> 
>>> 
>>> 
>>> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <[email protected]> wrote:
>>>> Hi -
>>>> 
>>>> I need to drop down to the ParquetFile API so I can have better control 
>>>> over batch size for reading huge Parquet files.  The filename is:
>>>> 
>>>> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>>>> **
>>>> This invocation fails:
>>>> *pqf = pq.ParquetFile(filename)*
>>>> "FileNotFoundError: [Errno 2] Failed to open local file 
>>>> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
>>>>  Detail: [errno 2] No such file or directory"
>>>> 
>>>> While this API, using the same, succeeds because I can specify 'gs' 
>>>> filesystem.
>>>> *table = pq.read_table(filename, filesystem=gs, use_legacy_dataset=False) *
>>>> **
>>>> I don't see a way to specify 'filesystem' on the ParquetFile API 
>>>> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
>>>>   Is there any way to read a GCS file using ParquetFile?
>>>> 
>>>> If not, can you show me the code for reading batches using pq.read_table 
>>>> or one of the other Arrow Parquet APIs 
>>>> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>>>> 
>>>> Thanks -
>>>> 
>>>> -- Cindy
> 
> 
> -- 
> Partha Dutta
> [email protected]

Re: ParquetFile API and GCS file

Reply via email to