Ah, I'm sorry. I misremembered, I was recalling the implementation of ReadOneRowGroup/ReadRowGroups, but iter_batches() boils down to GetRecordBatchReader which does read at a finer granularity.
-David On Mon, Dec 27, 2021, at 11:50, Micah Kornfield wrote: >> Just a shot in the dark, but how many row groups are there in that 1 GB >> file? IIRC, the reader loads an entire row group's worth of rows at once. > > Can you clarify what you mean by "loads" I thought it only loaded the > compressed data at once, and then read per page (I could be misremembering or > thinking this was an aspirational goal). > > On Fri, Dec 24, 2021 at 4:01 PM Partha Dutta <[email protected]> wrote: >> I see 5 row groups. This parquet file contains 1.8 million records >> >> On Fri, Dec 24, 2021 at 4:51 PM David Li <[email protected]> wrote: >>> __ >>> Just a shot in the dark, but how many row groups are there in that 1 GB >>> file? IIRC, the reader loads an entire row group's worth of rows at once. >>> >>> >>> -David >>> >>> On Fri, Dec 24, 2021, at 17:45, Partha Dutta wrote: >>>> I have a similar issue with trying to read huge 1GB parquet files from >>>> Azure DataLake Storage. I'm trying to read the file in small chunks using >>>> the ParquetFile.iter_batches method, but it seems like the entire file is >>>> read into memory before the first batch is returned. I am using the Azure >>>> SDK for python and another python package (pyarrowfs-adlgen2). Has anyone >>>> faced a problem similar to what I am seeing, or is there a workaround? >>>> >>>> On Fri, Dec 24, 2021 at 2:11 PM Cindy McMullen <[email protected]> >>>> wrote: >>>>> Thanks, Arthur, this helps. The complete code example is: >>>>> >>>>> `filename = 'gs://' + files[0] >>>>> gs = gcsfs.GCSFileSystem() >>>>> f = gs.open(filename) >>>>> pqf = pq.ParquetFile(f) >>>>> pqf.metadata` >>>>> >>>>> On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <[email protected]> >>>>> wrote: >>>>>> Hi Cindy, >>>>>> >>>>>> In your case you'd have to pass a GCS file instance to the ParquetFile >>>>>> constructor. Something like this: >>>>>> >>>>>> source = fs.open_input_file(filename) >>>>>> parquet_file = pq.ParquetFile(source) >>>>>> >>>>>> You can see how read_table does this in the source code: >>>>>> https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977 >>>>>> >>>>>> I hope this helps. >>>>>> >>>>>> >>>>>> >>>>>> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <[email protected]> >>>>>> wrote: >>>>>>> Hi - >>>>>>> >>>>>>> I need to drop down to the ParquetFile API so I can have better control >>>>>>> over batch size for reading huge Parquet files. The filename is: >>>>>>> >>>>>>> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy* >>>>>>> ** >>>>>>> This invocation fails: >>>>>>> *pqf = pq.ParquetFile(filename)* >>>>>>> "FileNotFoundError: [Errno 2] Failed to open local file >>>>>>> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'. >>>>>>> Detail: [errno 2] No such file or directory" >>>>>>> >>>>>>> While this API, using the same, succeeds because I can specify 'gs' >>>>>>> filesystem. >>>>>>> *table = pq.read_table(filename, filesystem=gs, >>>>>>> use_legacy_dataset=False) * >>>>>>> ** >>>>>>> I don't see a way to specify 'filesystem' on the ParquetFile API >>>>>>> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>. >>>>>>> Is there any way to read a GCS file using ParquetFile? >>>>>>> >>>>>>> If not, can you show me the code for reading batches using >>>>>>> pq.read_table or one of the other Arrow Parquet APIs >>>>>>> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>? >>>>>>> >>>>>>> Thanks - >>>>>>> >>>>>>> -- Cindy >>>> >>>> >>>> -- >>>> Partha Dutta >>>> [email protected] >>> >> -- >> Partha Dutta >> [email protected]
