It's not in the footer metadata but each record batch should have its
own metadata the batch's metadata should contain the # of rows.  So
you should be able to do it without reading any data.  In pyarrow,
this *should* be what count_rows is doing but it has been a while
since I've really dove into that code and I may be remembering
incorrectly.

Can you use a MessageReader[1]?  I have not used it myself.  I don't
actually know if it will read the buffer data as well or just the
metadata.

[1] 
https://arrow.apache.org/docs/python/generated/pyarrow.ipc.MessageReader.html#pyarrow.ipc.MessageReader

On Thu, Oct 20, 2022 at 9:14 AM Quentin Lhoest <[email protected]> wrote:
>
> Hi everyone ! I was wondering:
> What is the most efficient way to know the number of rows in dataset of Arrow 
> IPC files ?
>
> I expected each file to have the number of rows as metadata in the footer, 
> but it doesn’t seem to be the case. Therefore I need to call count_rows() 
> which is less efficient than reading metadata.
>
> Maybe the number of row can be written as custom_metadata in the footer, but 
> the writing/reading custom_metadata functions don’t seem to be exposed in 
> python - if I’m not mistaken.
>
> Thanks in advance :)
>
> --
> Quentin

Reply via email to