Re: ParquetFile API and GCS file

Cindy McMullen Tue, 28 Dec 2021 07:34:49 -0800

Found a good example
<https://stackoverflow.com/questions/68048816/how-can-i-process-a-large-parquet-file-from-spark-in-numpy-pandas>
on StackOverflow:


batches = pq_file.iter_batches(batch_size, use_pandas_metadata=True) #
batches will be a generator    for batch in batches:
    df = batch.to_pandas()
    process(df)


On Mon, Dec 27, 2021 at 3:39 PM Cindy McMullen <[email protected]>
wrote:

> Can you give an example of using the ParquteFile.iter_batches() API?  I
> can see it returns a 'generator' class, but not sure how to iterate over
> the results to get at the underlying row data.
>
> On Mon, Dec 27, 2021 at 11:33 AM David Li <[email protected]> wrote:
>
>> Ah, I'm sorry. I misremembered, I was recalling the implementation of
>> ReadOneRowGroup/ReadRowGroups, but iter_batches() boils down to
>> GetRecordBatchReader which does read at a finer granularity.
>>
>> -David
>>
>> On Mon, Dec 27, 2021, at 11:50, Micah Kornfield wrote:
>>
>> Just a shot in the dark, but how many row groups are there in that 1 GB
>> file? IIRC, the reader loads an entire row group's worth of rows at once.
>>
>>
>> Can you clarify what you mean by "loads" I thought it only loaded the
>> compressed data at once, and then read per page (I could be misremembering
>> or thinking this was an aspirational goal).
>>
>> On Fri, Dec 24, 2021 at 4:01 PM Partha Dutta <[email protected]>
>> wrote:
>>
>> I see 5 row groups. This parquet file contains 1.8 million records
>>
>> On Fri, Dec 24, 2021 at 4:51 PM David Li <[email protected]> wrote:
>>
>>
>> Just a shot in the dark, but how many row groups are there in that 1 GB
>> file? IIRC, the reader loads an entire row group's worth of rows at once.
>>
>>
>> -David
>>
>> On Fri, Dec 24, 2021, at 17:45, Partha Dutta wrote:
>>
>> I have a similar issue with trying to read huge 1GB parquet files from
>> Azure DataLake Storage. I'm trying to read the file in small chunks using
>> the ParquetFile.iter_batches method, but it seems like the entire file is
>> read into memory before the first batch is returned. I am using the Azure
>> SDK for python and another python package (pyarrowfs-adlgen2). Has anyone
>> faced a problem similar to what I am seeing, or is there a workaround?
>>
>> On Fri, Dec 24, 2021 at 2:11 PM Cindy McMullen <[email protected]>
>> wrote:
>>
>> Thanks, Arthur, this helps.  The complete code example is:
>>
>> filename = 'gs://' + files[0]
>> gs = gcsfs.GCSFileSystem()
>> f = gs.open(filename)
>> pqf = pq.ParquetFile(f)
>> pqf.metadata
>>
>>
>> On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <[email protected]>
>> wrote:
>>
>> Hi Cindy,
>>
>> In your case you'd have to pass a GCS file instance to the ParquetFile
>> constructor. Something like this:
>>
>> source = fs.open_input_file(filename)
>> parquet_file = pq.ParquetFile(source)
>>
>> You can see how read_table does this in the source code:
>> https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977
>>
>> I hope this helps.
>>
>>
>>
>> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <[email protected]>
>> wrote:
>>
>> Hi -
>>
>> I need to drop down to the ParquetFile API so I can have better control
>> over batch size for reading huge Parquet files.  The filename is:
>>
>>
>> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>>
>> This invocation fails:
>> *pqf = pq.ParquetFile(filename)*
>> "FileNotFoundError: [Errno 2] Failed to open local file
>> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
>> Detail: [errno 2] No such file or directory"
>>
>> While this API, using the same, succeeds because I can specify 'gs'
>> filesystem.
>> *table = pq.read_table(filename, filesystem=gs,
>> use_legacy_dataset=False) *
>>
>> I don't see a way to specify 'filesystem' on the ParquetFile API
>> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
>> Is there any way to read a GCS file using ParquetFile?
>>
>> If not, can you show me the code for reading batches using pq.read_table
>> or one of the other Arrow Parquet APIs
>> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>>
>> Thanks -
>>
>> -- Cindy
>>
>>
>>
>> --
>> Partha Dutta
>> [email protected]
>>
>>
>> --
>> Partha Dutta
>> [email protected]
>>
>>
>>

Re: ParquetFile API and GCS file

Reply via email to