I think there is an open Jira from several years ago about adding an
optional number-of-rows field to the file footer so that it can be
precomputed and stored rather than requiring the application to look
at all the batch metadata to compute it when needed. This seems like a
harmless addition that should not cause any backward/forward
compatibility issues.

On Thu, Oct 20, 2022 at 11:30 PM Weston Pace <[email protected]> wrote:
>
> It's not in the footer metadata but each record batch should have its
> own metadata the batch's metadata should contain the # of rows.  So
> you should be able to do it without reading any data.  In pyarrow,
> this *should* be what count_rows is doing but it has been a while
> since I've really dove into that code and I may be remembering
> incorrectly.
>
> Can you use a MessageReader[1]?  I have not used it myself.  I don't
> actually know if it will read the buffer data as well or just the
> metadata.
>
> [1] 
> https://arrow.apache.org/docs/python/generated/pyarrow.ipc.MessageReader.html#pyarrow.ipc.MessageReader
>
> On Thu, Oct 20, 2022 at 9:14 AM Quentin Lhoest <[email protected]> wrote:
> >
> > Hi everyone ! I was wondering:
> > What is the most efficient way to know the number of rows in dataset of 
> > Arrow IPC files ?
> >
> > I expected each file to have the number of rows as metadata in the footer, 
> > but it doesn’t seem to be the case. Therefore I need to call count_rows() 
> > which is less efficient than reading metadata.
> >
> > Maybe the number of row can be written as custom_metadata in the footer, 
> > but the writing/reading custom_metadata functions don’t seem to be exposed 
> > in python - if I’m not mistaken.
> >
> > Thanks in advance :)
> >
> > --
> > Quentin

Reply via email to