I am wondering if the validation step (where Arrow checks for corrupted data) is causing the full disk IO here. I am almost positive it's turned on by default (to prevent a crash when consuming untrusted input), but I *think* there is an option to turn it off if you are processing trusted input. I don't know if that's the root cause of your problem here, but it's worth a try!
Cheers, -dewey On Sun, Jan 26, 2025 at 12:49 PM Sharvil Nanavati <[email protected]> wrote: > In a different context, fetching batches one-by-one would be a good way to > control when the disk read takes place. > > In my context, I'm looking for a way to construct a Table without > performing the bulk of the IO operations until the memory is accessed. I > need random access to the table and my accesses are often sparse relative > to the size of the entire table. Obviously there has to be *some* IO to > read the schema and offsets, but that's tiny relative to the data itself. > > Is there any way to get a Table instance without triggering large data > reads of the Arrow file? > > -s > *Builder @ LMNT* > Web <https://www.lmnt.com> | LinkedIn > <https://www.linkedin.com/in/sharvil-nanavati/> > > > > On Wed, Jan 22, 2025 at 5:56 AM Felipe Oliveira Carvalho < > [email protected]> wrote: > >> I don't have very specific advice, but mmap() and programmer control >> don't come together. The point of mmap is deferring all the logic to the OS >> and trusting that it knows better. >> >> If you're calling read_all(), it will do what the name says: read all the >> batches. Have you tried looping and getting batches one by one as you >> process them? >> >> -- >> Felipe >> >> >> On Tue, Jan 21, 2025 at 1:45 PM Sharvil Nanavati <[email protected]> >> wrote: >> >>> I'm loading a large number of large Arrow IPC streams/files from disk >>> with mmap. I'd like to demand-load the contents instead of prefetching them >>> – or at least have better control over disk IO. >>> >>> Calling `read_all` on a stream triggers a complete read of the file >>> (`MADV_WILLNEED` over the entire byte range of the file) whereas `read_all` >>> on a file seems to trigger a complete read through page faults. I'm not >>> fully confident in the latter behavior. >>> >>> Is there a way I can disable prefetching in the stream case or configure >>> Arrow to demand-load Tables? I'd like to get a reference to a Table without >>> triggering disk reads except for the schema + magic bytes + metadata. >>> >>> -s >>> *Builder @ LMNT* >>> Web <https://www.lmnt.com> | LinkedIn >>> <https://www.linkedin.com/in/sharvil-nanavati/> >>> >>>
