Re: ParquetFile API and GCS file

David Li Mon, 27 Dec 2021 09:33:46 -0800

Ah, I'm sorry. I misremembered, I was recalling the implementation of 
ReadOneRowGroup/ReadRowGroups, but iter_batches() boils down to 
GetRecordBatchReader which does read at a finer granularity.


-David

On Mon, Dec 27, 2021, at 11:50, Micah Kornfield wrote:
>> Just a shot in the dark, but how many row groups are there in that 1 GB 
>> file? IIRC, the reader loads an entire row group's worth of rows at once.
> 
> Can you clarify what you mean by "loads" I thought it only loaded the 
> compressed data at once, and then read per page (I could be misremembering or 
> thinking this was an aspirational goal).
> 
> On Fri, Dec 24, 2021 at 4:01 PM Partha Dutta <[email protected]> wrote:
>> I see 5 row groups. This parquet file contains 1.8 million records
>> 
>> On Fri, Dec 24, 2021 at 4:51 PM David Li <[email protected]> wrote:
>>> __
>>> Just a shot in the dark, but how many row groups are there in that 1 GB 
>>> file? IIRC, the reader loads an entire row group's worth of rows at once.
>>> 
>>> 
>>> -David
>>> 
>>> On Fri, Dec 24, 2021, at 17:45, Partha Dutta wrote:
>>>> I have a similar issue with trying to read huge 1GB parquet files from 
>>>> Azure DataLake Storage. I'm trying to read the file in small chunks using 
>>>> the ParquetFile.iter_batches method, but it seems like the entire file is 
>>>> read into memory before the first batch is returned. I am using the Azure 
>>>> SDK for python and another python package (pyarrowfs-adlgen2). Has anyone 
>>>> faced a problem similar to what I am seeing, or is there a workaround?
>>>> 
>>>> On Fri, Dec 24, 2021 at 2:11 PM Cindy McMullen <[email protected]> 
>>>> wrote:
>>>>> Thanks, Arthur, this helps.  The complete code example is:
>>>>> 
>>>>> `filename = 'gs://' + files[0]
>>>>> gs = gcsfs.GCSFileSystem()
>>>>> f = gs.open(filename)
>>>>> pqf = pq.ParquetFile(f)
>>>>> pqf.metadata`
>>>>> 
>>>>> On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <[email protected]> 
>>>>> wrote:
>>>>>> Hi Cindy,
>>>>>> 
>>>>>> In your case you'd have to pass a GCS file instance to the ParquetFile 
>>>>>> constructor. Something like this:
>>>>>> 
>>>>>> source = fs.open_input_file(filename)
>>>>>> parquet_file = pq.ParquetFile(source)
>>>>>> 
>>>>>> You can see how read_table does this in the source code: 
>>>>>> https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977
>>>>>> 
>>>>>> I hope this helps.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <[email protected]> 
>>>>>> wrote:
>>>>>>> Hi -
>>>>>>> 
>>>>>>> I need to drop down to the ParquetFile API so I can have better control 
>>>>>>> over batch size for reading huge Parquet files.  The filename is:
>>>>>>> 
>>>>>>> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>>>>>>> **
>>>>>>> This invocation fails:
>>>>>>> *pqf = pq.ParquetFile(filename)*
>>>>>>> "FileNotFoundError: [Errno 2] Failed to open local file 
>>>>>>> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
>>>>>>>  Detail: [errno 2] No such file or directory"
>>>>>>> 
>>>>>>> While this API, using the same, succeeds because I can specify 'gs' 
>>>>>>> filesystem.
>>>>>>> *table = pq.read_table(filename, filesystem=gs, 
>>>>>>> use_legacy_dataset=False) *
>>>>>>> **
>>>>>>> I don't see a way to specify 'filesystem' on the ParquetFile API 
>>>>>>> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
>>>>>>>   Is there any way to read a GCS file using ParquetFile?
>>>>>>> 
>>>>>>> If not, can you show me the code for reading batches using 
>>>>>>> pq.read_table or one of the other Arrow Parquet APIs 
>>>>>>> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>>>>>>> 
>>>>>>> Thanks -
>>>>>>> 
>>>>>>> -- Cindy
>>>> 
>>>> 
>>>> -- 
>>>> Partha Dutta
>>>> [email protected]
>>> 
>> -- 
>> Partha Dutta
>> [email protected]

Re: ParquetFile API and GCS file

Reply via email to