It's not in the footer metadata but each record batch should have its own metadata the batch's metadata should contain the # of rows. So you should be able to do it without reading any data. In pyarrow, this *should* be what count_rows is doing but it has been a while since I've really dove into that code and I may be remembering incorrectly.
Can you use a MessageReader[1]? I have not used it myself. I don't actually know if it will read the buffer data as well or just the metadata. [1] https://arrow.apache.org/docs/python/generated/pyarrow.ipc.MessageReader.html#pyarrow.ipc.MessageReader On Thu, Oct 20, 2022 at 9:14 AM Quentin Lhoest <[email protected]> wrote: > > Hi everyone ! I was wondering: > What is the most efficient way to know the number of rows in dataset of Arrow > IPC files ? > > I expected each file to have the number of rows as metadata in the footer, > but it doesn’t seem to be the case. Therefore I need to call count_rows() > which is less efficient than reading metadata. > > Maybe the number of row can be written as custom_metadata in the footer, but > the writing/reading custom_metadata functions don’t seem to be exposed in > python - if I’m not mistaken. > > Thanks in advance :) > > -- > Quentin
