Re: Demand-loading Arrow files

Sharvil Nanavati Tue, 28 Jan 2025 18:39:22 -0800

Thanks for the discussion, folks. I think the keg takeaway for me is that
my access pattern / use case isn't directly supported by Arrow today, but
there's no technical reason it can't be.


Would there be any opposition to me expanding the API surface to support a
zero-data-read-by-default implementation?

On Tue, Jan 28, 2025, 18:04 Aldrin <[email protected]> wrote:

> I see, I was incorrectly conflating the pointer math and when a page fault
> is actually generated. Thanks for clarifying!
>
> Without knowing Sharvil's actual interactions with the Table, I'm still
> not convinced a table method wouldn't trigger the scan anyways, but I
> suppose that's more of a pessimistic perspective and not necessarily the
> case.
>
>
I need the number of rows, schema, and sparse random access to rows.
Concatenating two or more datasets with the same schema, all backed by mmap
regions is also useful. None of these operations should require a table
scan.


> On Tue, Jan 28, 2025 at 17:42, Weston Pace <[email protected]
> <On+Tue,+Jan+28,+2025+at+17:42,+Weston+Pace+%3C%3Ca+href=>> wrote:
>
> > Sharvil wants random access to only a few RecordBatches via Table
> methods, but I don't think that's possible with the Arrow library
>
> The idea (and I believe things worked this way at one point) was that you
> could memory map a file, read in a bunch of record batches (even an entire
> table if you want), and you would just have a collection of pointers into
> the memory mapped file without ever actually loading any of the data into
> memory.
>
>
That's exactly what I'm looking for and assumed was the default behavior in
an mmap world. I was surprised to find it wasn't, and the assumption of
dense sequential access is baked in.


> Then, when the data is needed (e.g. when a user calls
> `table.column(0).chunk(0).value(0)` then the pointers would be dereferenced
> and, through the magic of memory mapping, the data would be loaded on
> demand.  This loading on demand tends to be inefficient and _not_ what most
> IPC users are looking for (they just want to read an IPC file and expect
> they will be accessing the entire file) so I understand why the
> MADV_WILLNEED is there.  However, for users that do want this, I'm not sure
> if there is any way to achieve the load on demand semantics.
>
> > The code you linked specifies a memory region and the proceeding nbytes:
>
> Yes, the actual implementation does call ReadAt with a memory region and
> nbytes.  These are then used to create a slice into the underlying memory
> mapped area.  If MADV_WILLNEED was _not_ called then this would be a
> zero-copy / zero-load operation that doesn't actually load anything from
> the disk (it's just doing pointer math).
>
>
> On Tue, Jan 28, 2025 at 2:56 PM Aldrin < [email protected]> wrote:
>
>> > ...and that function triggers the MADV_WILLNEED
>>
>> The code you linked specifies a memory region and the proceeding nbytes:
>> ```
>>
>> RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
>>       {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
>>   return memory_map_->Slice(position, nbytes)
>>
>> ```
>>
>> The original question said "Calling `read_all` on a stream triggers a
>> complete read of the file". So, my impression is that either read_all
>> (I'm assuming via python) is purposely specifying the whole file, or
>> eventually (through multiple calls) specifying the whole file. I am curious
>> how large the file itself is, though I assume it's larger than whatever
>> size nbytes is defaulted to.
>>
>> But, I also can't find which implementation of MemoryMap::Slice [1] is
>> resolved by memory_map_->Slice(position, nbytes), which I don't think
>> is likely to be problematic but I can't totally rule out either.
>>
>> Either way, if I understand correctly, Sharvil wants random access to
>> only a few RecordBatches via Table methods, but I don't think that's
>> possible with the Arrow library; the only ways are to manage accesses at
>> the RecordBatch level, or maybe using the Dataset or Acero APIs. Or am I
>> forgetting something... or maybe I'm misunderstanding why Sharvil wants to
>> specifically construct a Table rather than RecordBatches?
>>
>>
>> [1]:
>> https://github.com/apache/arrow/blob/apache-arrow-19.0.0/cpp/src/arrow/io/file.h#L216-L217
>>
>>
>>
>> # ------------------------------
>> # Aldrin
>>
>> https://github.com/drin/
>> https://gitlab.com/octalene
>> https://keybase.io/octalene
>>
>> On Tuesday, January 28th, 2025 at 13:45, Weston Pace <
>> [email protected]> wrote:
>>
>> I believe the concern is that reading a record batch from a
>> RecordBatchStreamReader triggers the MADV_WILLNEED advice to be sent to the
>> OS before any data is accessed (and regardless of whether or not that data
>> is ever accessed).
>>
>> I'm pretty sure the `RecordBatchStreamReader` uses
>> `MemoryMappedFile::ReadAt` and that function triggers the MADV_WILLNEED[1].
>> This is contrary to the user expectation that only the data actually
>> accessed would be loaded into memory.
>>
>> [1]
>> https://github.com/apache/arrow/blob/ca2f4d68e834e600852d5af36dc2190741e33118/cpp/src/arrow/io/file.cc#L677
>>
>> On Tue, Jan 28, 2025 at 7:15 AM Aldrin < [email protected]> wrote:
>>
>>> > Then you should just use a memory-mapped file.
>>>
>>> Unless I'm misunderstanding their original message, I believe they are
>>> using a memory-mapped file. I'm not sure if other suggestions helped
>>> address the issue, but my understanding was that they were somehow
>>> triggering reads against the whole file anyways.
>>>
>>>
>>> I'm not sure why a Table is necessary (presumably some useful method in
>>> the API?) if accesses are sparse relative to the entire table; that sounds
>>> more aligned to RecordBatch access. I would think that any use of a Table
>>> method is going to trigger reads to every batch. I would also think that
>>> this scenario has 2 opportunities to do processing without triggering a
>>> scan of the whole file:
>>> 1. when a RecordBatch is read into memory
>>> 2. on the RecordBatches accumulated so far (a Table instance can be
>>> constructed from them without copies, I am pretty sure)
>>>
>>> I have little experience with mmap, so I don't have any particular
>>> thoughts there. Some extra information about how random access into the
>>> table occurs would be helpful, though.
>>>
>>>
>>>
>>> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>>>
>>>
>>> On Tue, Jan 28, 2025 at 01:14, Antoine Pitrou < [email protected]
>>> <On+Tue,+Jan+28,+2025+at+01:14,+Antoine+Pitrou+%3C%3Ca+href=>> wrote:
>>>
>>> On Sun, 26 Jan 2025 10:48:48 -0800
>>> Sharvil Nanavati < [email protected]> wrote:
>>> > In a different context, fetching batches one-by-one would be a good
>>> way to
>>> > control when the disk read takes place.
>>> >
>>> > In my context, I'm looking for a way to construct a Table without
>>> > performing the bulk of the IO operations until the memory is accessed.
>>> I
>>> > need random access to the table and my accesses are often sparse
>>> relative
>>> > to the size of the entire table. Obviously there has to be *some* IO
>>> to
>>> > read the schema and offsets, but that's tiny relative to the data
>>> itself.
>>>
>>> Then you should just use a memory-mapped file.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>>
>>

Re: Demand-loading Arrow files

Reply via email to