Re: Demand-loading Arrow files

Aldrin Tue, 28 Jan 2025 18:02:44 -0800

I see, I was incorrectly conflating the pointer math and when a page fault is 
actually generated. Thanks for clarifying!


Without knowing Sharvil's actual interactions with the Table, I'm still not 
convinced a table method wouldn't trigger the scan anyways, but I suppose 
that's more of a pessimistic perspective and not necessarily the case.


On Tue, Jan 28, 2025 at 17:42, Weston Pace <[email protected]> wrote:

> > Sharvil wants random access to only a few RecordBatches via Table methods, 
> > but I don't think that's possible with the Arrow library
> 

> The idea (and I believe things worked this way at one point) was that you 
> could memory map a file, read in a bunch of record batches (even an entire 
> table if you want), and you would just have a collection of pointers into the 
> memory mapped file without ever actually loading any of the data into memory.
> 

> Then, when the data is needed (e.g. when a user calls 
> `table.column(0).chunk(0).value(0)` then the pointers would be dereferenced 
> and, through the magic of memory mapping, the data would be loaded on demand. 
>  This loading on demand tends to be inefficient and _not_ what most IPC users 
> are looking for (they just want to read an IPC file and expect they will be 
> accessing the entire file) so I understand why the MADV_WILLNEED is there.  
> However, for users that do want this, I'm not sure if there is any way to 
> achieve the load on demand semantics.
> 

> > The code you linked specifies a memory region and the proceeding `nbytes`:
> 

> 

> Yes, the actual implementation does call ReadAt with a memory region and 
> nbytes.  These are then used to create a slice into the underlying memory 
> mapped area.  If MADV_WILLNEED was _not_ called then this would be a 
> zero-copy / zero-load operation that doesn't actually load anything from the 
> disk (it's just doing pointer math).
> 

> 

> On Tue, Jan 28, 2025 at 2:56 PM Aldrin < [email protected]> wrote:
> 

> > > ...and that function triggers the MADV_WILLNEED
> > 

> > 

> > The code you linked specifies a memory region and the proceeding `nbytes`:
> > ```
> > 

> > RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
> >       {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
> >   return memory_map_->Slice(position, nbytes)
> > 

> > ```
> > 

> > 

> > The original question said "Calling `read_all` on a stream triggers a 
> > complete read of the file". So, my impression is that either `read_all` 
> > (I'm assuming via python) is purposely specifying the whole file, or 
> > eventually (through multiple calls) specifying the whole file. I am curious 
> > how large the file itself is, though I assume it's larger than whatever 
> > size `nbytes` is defaulted to.
> > 

> > But, I also can't find which implementation of `MemoryMap::Slice` [1] is 
> > resolved by `memory_map_->Slice(position, nbytes)`, which I don't think is 
> > likely to be problematic but I can't totally rule out either.
> > 

> > Either way, if I understand correctly, Sharvil wants random access to only 
> > a few RecordBatches via Table methods, but I don't think that's possible 
> > with the Arrow library; the only ways are to manage accesses at the 
> > RecordBatch level, or maybe using the Dataset or Acero APIs. Or am I 
> > forgetting something... or maybe I'm misunderstanding why Sharvil wants to 
> > specifically construct a Table rather than RecordBatches?
> > 

> > 

> > [1]: 
> > https://github.com/apache/arrow/blob/apache-arrow-19.0.0/cpp/src/arrow/io/file.h#L216-L217
> > 

> > 

> > 

> > 

> > 

> > # ------------------------------
> > 

> > # Aldrin
> > 

> > 

> > https://github.com/drin/
> > 

> > https://gitlab.com/octalene
> > 

> > https://keybase.io/octalene
> > 

> > 

> > On Tuesday, January 28th, 2025 at 13:45, Weston Pace < 
> > [email protected]> wrote:
> > 

> > > I believe the concern is that reading a record batch from a 
> > > RecordBatchStreamReader triggers the MADV_WILLNEED advice to be sent to 
> > > the OS before any data is accessed (and regardless of whether or not that 
> > > data is ever accessed).
> > > 

> > > I'm pretty sure the `RecordBatchStreamReader` uses 
> > > `MemoryMappedFile::ReadAt` and that function triggers the 
> > > MADV_WILLNEED[1]. This is contrary to the user expectation that only the 
> > > data actually accessed would be loaded into memory.
> > > 

> > > [1] 
> > > https://github.com/apache/arrow/blob/ca2f4d68e834e600852d5af36dc2190741e33118/cpp/src/arrow/io/file.cc#L677
> > > 

> > > On Tue, Jan 28, 2025 at 7:15 AM Aldrin < [email protected]> wrote:
> > > 

> > > > > Then you should just use a memory-mapped file.
> > > > 

> > > > Unless I'm misunderstanding their original message, I believe they are 
> > > > using a memory-mapped file. I'm not sure if other suggestions helped 
> > > > address the issue, but my understanding was that they were somehow 
> > > > triggering reads against the whole file anyways.
> > > > 

> > > > 

> > > > I'm not sure why a Table is necessary (presumably some useful method in 
> > > > the API?) if accesses are sparse relative to the entire table; that 
> > > > sounds more aligned to RecordBatch access. I would think that any use 
> > > > of a Table method is going to trigger reads to every batch. I would 
> > > > also think that this scenario has 2 opportunities to do processing 
> > > > without triggering a scan of the whole file:
> > > > 1. when a RecordBatch is read into memory
> > > > 2. on the RecordBatches accumulated so far (a Table instance can be 
> > > > constructed from them without copies, I am pretty sure)
> > > > 

> > > > I have little experience with mmap, so I don't have any particular 
> > > > thoughts there. Some extra information about how random access into the 
> > > > table occurs would be helpful, though.
> > > > 

> > > > 

> > > > 

> > > > Sent from Proton Mail for iOS
> > > > 

> > > > 

> > > > On Tue, Jan 28, 2025 at 01:14, Antoine Pitrou < [email protected]> 
> > > > wrote:
> > > > 

> > > > > On Sun, 26 Jan 2025 10:48:48 -0800
> > > > > Sharvil Nanavati < [email protected]> wrote:
> > > > > > In a different context, fetching batches one-by-one would be a good 
> > > > > > way to
> > > > > > control when the disk read takes place.
> > > > > >
> > > > > > In my context, I'm looking for a way to construct a Table without
> > > > > > performing the bulk of the IO operations until the memory is 
> > > > > > accessed. I
> > > > > > need random access to the table and my accesses are often sparse 
> > > > > > relative
> > > > > > to the size of the entire table. Obviously there has to be *some* 
> > > > > > IO to
> > > > > > read the schema and offsets, but that's tiny relative to the data 
> > > > > > itself.
> > > > > 

> > > > > Then you should just use a memory-mapped file.
> > > > > 

> > > > > Regards
> > > > > 

> > > > > Antoine.
> > > > >

signature.asc
Description: OpenPGP digital signature

Re: Demand-loading Arrow files

Reply via email to