Just a shot in the dark, but how many row groups are there in that 1 GB file? IIRC, the reader loads an entire row group's worth of rows at once.
-David On Fri, Dec 24, 2021, at 17:45, Partha Dutta wrote: > I have a similar issue with trying to read huge 1GB parquet files from Azure > DataLake Storage. I'm trying to read the file in small chunks using the > ParquetFile.iter_batches method, but it seems like the entire file is read > into memory before the first batch is returned. I am using the Azure SDK for > python and another python package (pyarrowfs-adlgen2). Has anyone faced a > problem similar to what I am seeing, or is there a workaround? > > On Fri, Dec 24, 2021 at 2:11 PM Cindy McMullen <[email protected]> wrote: >> Thanks, Arthur, this helps. The complete code example is: >> >> `filename = 'gs://' + files[0] >> gs = gcsfs.GCSFileSystem() >> f = gs.open(filename) >> pqf = pq.ParquetFile(f) >> pqf.metadata` >> >> On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <[email protected]> >> wrote: >>> Hi Cindy, >>> >>> In your case you'd have to pass a GCS file instance to the ParquetFile >>> constructor. Something like this: >>> >>> source = fs.open_input_file(filename) >>> parquet_file = pq.ParquetFile(source) >>> >>> You can see how read_table does this in the source code: >>> https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977 >>> >>> I hope this helps. >>> >>> >>> >>> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <[email protected]> wrote: >>>> Hi - >>>> >>>> I need to drop down to the ParquetFile API so I can have better control >>>> over batch size for reading huge Parquet files. The filename is: >>>> >>>> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy* >>>> ** >>>> This invocation fails: >>>> *pqf = pq.ParquetFile(filename)* >>>> "FileNotFoundError: [Errno 2] Failed to open local file >>>> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'. >>>> Detail: [errno 2] No such file or directory" >>>> >>>> While this API, using the same, succeeds because I can specify 'gs' >>>> filesystem. >>>> *table = pq.read_table(filename, filesystem=gs, use_legacy_dataset=False) * >>>> ** >>>> I don't see a way to specify 'filesystem' on the ParquetFile API >>>> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>. >>>> Is there any way to read a GCS file using ParquetFile? >>>> >>>> If not, can you show me the code for reading batches using pq.read_table >>>> or one of the other Arrow Parquet APIs >>>> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>? >>>> >>>> Thanks - >>>> >>>> -- Cindy > > > -- > Partha Dutta > [email protected]
