Re: ParquetFile API and GCS file

Micah Kornfield Mon, 27 Dec 2021 08:50:30 -0800

>
> Just a shot in the dark, but how many row groups are there in that 1 GB
> file? IIRC, the reader loads an entire row group's worth of rows at once.



Can you clarify what you mean by "loads" I thought it only loaded the
compressed data at once, and then read per page (I could be misremembering
or thinking this was an aspirational goal).

On Fri, Dec 24, 2021 at 4:01 PM Partha Dutta <[email protected]> wrote:

> I see 5 row groups. This parquet file contains 1.8 million records
>
> On Fri, Dec 24, 2021 at 4:51 PM David Li <[email protected]> wrote:
>
>> Just a shot in the dark, but how many row groups are there in that 1 GB
>> file? IIRC, the reader loads an entire row group's worth of rows at once.
>>
>>
>> -David
>>
>> On Fri, Dec 24, 2021, at 17:45, Partha Dutta wrote:
>>
>> I have a similar issue with trying to read huge 1GB parquet files from
>> Azure DataLake Storage. I'm trying to read the file in small chunks using
>> the ParquetFile.iter_batches method, but it seems like the entire file is
>> read into memory before the first batch is returned. I am using the Azure
>> SDK for python and another python package (pyarrowfs-adlgen2). Has anyone
>> faced a problem similar to what I am seeing, or is there a workaround?
>>
>> On Fri, Dec 24, 2021 at 2:11 PM Cindy McMullen <[email protected]>
>> wrote:
>>
>> Thanks, Arthur, this helps.  The complete code example is:
>>
>> filename = 'gs://' + files[0]
>> gs = gcsfs.GCSFileSystem()
>> f = gs.open(filename)
>> pqf = pq.ParquetFile(f)
>> pqf.metadata
>>
>>
>> On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <[email protected]>
>> wrote:
>>
>> Hi Cindy,
>>
>> In your case you'd have to pass a GCS file instance to the ParquetFile
>> constructor. Something like this:
>>
>> source = fs.open_input_file(filename)
>> parquet_file = pq.ParquetFile(source)
>>
>> You can see how read_table does this in the source code:
>> https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977
>>
>> I hope this helps.
>>
>>
>>
>> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <[email protected]>
>> wrote:
>>
>> Hi -
>>
>> I need to drop down to the ParquetFile API so I can have better control
>> over batch size for reading huge Parquet files.  The filename is:
>>
>>
>> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>>
>> This invocation fails:
>> *pqf = pq.ParquetFile(filename)*
>> "FileNotFoundError: [Errno 2] Failed to open local file
>> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
>> Detail: [errno 2] No such file or directory"
>>
>> While this API, using the same, succeeds because I can specify 'gs'
>> filesystem.
>> *table = pq.read_table(filename, filesystem=gs,
>> use_legacy_dataset=False) *
>>
>> I don't see a way to specify 'filesystem' on the ParquetFile API
>> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
>> Is there any way to read a GCS file using ParquetFile?
>>
>> If not, can you show me the code for reading batches using pq.read_table
>> or one of the other Arrow Parquet APIs
>> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>>
>> Thanks -
>>
>> -- Cindy
>>
>>
>>
>> --
>> Partha Dutta
>> [email protected]
>>
>>
>> --
> Partha Dutta
> [email protected]
>

Re: ParquetFile API and GCS file

Reply via email to