I am wondering if the validation step (where Arrow checks for corrupted
data) is causing the full disk IO here. I am almost positive it's turned on
by default (to prevent a crash when consuming untrusted input), but I
*think* there is an option to turn it off if you are processing trusted
input. I don't know if that's the root cause of your problem here, but it's
worth a try!

Cheers,

-dewey

On Sun, Jan 26, 2025 at 12:49 PM Sharvil Nanavati <[email protected]> wrote:

> In a different context, fetching batches one-by-one would be a good way to
> control when the disk read takes place.
>
> In my context, I'm looking for a way to construct a Table without
> performing the bulk of the IO operations until the memory is accessed. I
> need random access to the table and my accesses are often sparse relative
> to the size of the entire table. Obviously there has to be *some* IO to
> read the schema and offsets, but that's tiny relative to the data itself.
>
> Is there any way to get a Table instance without triggering large data
> reads of the Arrow file?
>
> -s
> *Builder @ LMNT*
> Web <https://www.lmnt.com> | LinkedIn
> <https://www.linkedin.com/in/sharvil-nanavati/>
>
>
>
> On Wed, Jan 22, 2025 at 5:56 AM Felipe Oliveira Carvalho <
> [email protected]> wrote:
>
>> I don't have very specific advice, but mmap() and programmer control
>> don't come together. The point of mmap is deferring all the logic to the OS
>> and trusting that it knows better.
>>
>> If you're calling read_all(), it will do what the name says: read all the
>> batches. Have you tried looping and getting batches one by one as you
>> process them?
>>
>> --
>> Felipe
>>
>>
>> On Tue, Jan 21, 2025 at 1:45 PM Sharvil Nanavati <[email protected]>
>> wrote:
>>
>>> I'm loading a large number of large Arrow IPC streams/files from disk
>>> with mmap. I'd like to demand-load the contents instead of prefetching them
>>> – or at least have better control over disk IO.
>>>
>>> Calling `read_all` on a stream triggers a complete read of the file
>>> (`MADV_WILLNEED` over the entire byte range of the file) whereas `read_all`
>>> on a file seems to trigger a complete read through page faults. I'm not
>>> fully confident in the latter behavior.
>>>
>>> Is there a way I can disable prefetching in the stream case or configure
>>> Arrow to demand-load Tables? I'd like to get a reference to a Table without
>>> triggering disk reads except for the schema + magic bytes + metadata.
>>>
>>> -s
>>> *Builder @ LMNT*
>>> Web <https://www.lmnt.com> | LinkedIn
>>> <https://www.linkedin.com/in/sharvil-nanavati/>
>>>
>>>

Reply via email to