Re: Demand-loading Arrow files

Aldrin Tue, 28 Jan 2025 19:35:53 -0800

> That's exactly what I'm looking for and assumed was the default behavior in 
>an mmap world. I was surprised to find it wasn't, and the assumption of dense 
>sequential access is baked in.


Just to clarify, Weston pointed out where my understanding (as far as mmap and 
page faults) was incorrect. But the problem I see isn't with dense sequential 
access being baked in, it's that Table methods are something of a convenience 
over many RecordBatches [1]. More on my opinions below.

> I need the number of rows, schema, and sparse random access to rows. 
>Concatenating two or more datasets with the same schema, all backed by mmap 
>regions is also useful. None of these operations should require a table scan.

I think for these things, what Weston pointed out as the problem seems highly 
likely. In which case, I think you would just need to expand the API to use an 
option to control whether posix_madvise is used or not (may be more complicated 
than that in practice).


As for the dense, sequential access, this could be my misunderstanding (again), 
but Table methods are more focused on columns and do not necessarily try to 
minimize access to the contained RecordBatches (or column chunks). Using a 
table as a container and passing that around is totally fine and not something 
I'm discouraging.

Generally, I think you either want to use a higher-level API (datasets[2], or 
compute[3]) that is designed to be smart about which RecordBatches (or column 
chunks) to access or directly use RecordBatches yourself so you have the 
control you're looking for. In my opinion, having RecordBatches instead of 
Tables for any of these cases is not much more difficult and provides more 
clarity.

[1]: https://arrow.apache.org/docs/cpp/tables.html#tables
[2]: https://arrow.apache.org/docs/python/dataset.html#filtering-data
[3]: https://arrow.apache.org/docs/python/compute.html#filtering-by-expressions



# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


On Tuesday, January 28th, 2025 at 18:18, Sharvil Nanavati <[email protected]> 
wrote:

> Thanks for the discussion, folks. I think the keg takeaway for me is that my 
> access pattern / use case isn't directly supported by Arrow today, but 
> there's no technical reason it can't be.
> 

> Would there be any opposition to me expanding the API surface to support a 
> zero-data-read-by-default implementation?
> 

> On Tue, Jan 28, 2025, 18:04 Aldrin <[email protected]> wrote:
> 

> > I see, I was incorrectly conflating the pointer math and when a page fault 
> > is actually generated. Thanks for clarifying!
> > 

> > Without knowing Sharvil's actual interactions with the Table, I'm still not 
> > convinced a table method wouldn't trigger the scan anyways, but I suppose 
> > that's more of a pessimistic perspective and not necessarily the case.
> > 

> 

> 

> I need the number of rows, schema, and sparse random access to rows. 
> Concatenating two or more datasets with the same schema, all backed by mmap 
> regions is also useful. None of these operations should require a table scan.
> 

> 

> > 

> > On Tue, Jan 28, 2025 at 17:42, Weston Pace <[email protected]> wrote:
> > 

> > > > Sharvil wants random access to only a few RecordBatches via Table 
> > > > methods, but I don't think that's possible with the Arrow library
> > > 

> > > The idea (and I believe things worked this way at one point) was that you 
> > > could memory map a file, read in a bunch of record batches (even an 
> > > entire table if you want), and you would just have a collection of 
> > > pointers into the memory mapped file without ever actually loading any of 
> > > the data into memory.
> 

> 

> That's exactly what I'm looking for and assumed was the default behavior in 
> an mmap world. I was surprised to find it wasn't, and the assumption of dense 
> sequential access is baked in.
> 

> 

> > > 

> > > Then, when the data is needed (e.g. when a user calls 
> > > `table.column(0).chunk(0).value(0)` then the pointers would be 
> > > dereferenced and, through the magic of memory mapping, the data would be 
> > > loaded on demand. This loading on demand tends to be inefficient and 
> > > _not_ what most IPC users are looking for (they just want to read an IPC 
> > > file and expect they will be accessing the entire file) so I understand 
> > > why the MADV_WILLNEED is there. However, for users that do want this, I'm 
> > > not sure if there is any way to achieve the load on demand semantics.
> > > 

> > > > The code you linked specifies a memory region and the proceeding 
> > > > `nbytes`:
> > > 

> > > 

> > > Yes, the actual implementation does call ReadAt with a memory region and 
> > > nbytes. These are then used to create a slice into the underlying memory 
> > > mapped area. If MADV_WILLNEED was _not_ called then this would be a 
> > > zero-copy / zero-load operation that doesn't actually load anything from 
> > > the disk (it's just doing pointer math).
> > > 

> > > 

> > > On Tue, Jan 28, 2025 at 2:56 PM Aldrin < [email protected]> wrote:
> > > 

> > > > > ...and that function triggers the MADV_WILLNEED
> > > > 

> > > > 

> > > > The code you linked specifies a memory region and the proceeding 
> > > > `nbytes`:
> > > > ```
> > > > 

> > > > RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
> > > >       {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
> > > >   return memory_map_->Slice(position, nbytes)
> > > > 

> > > > ```
> > > > 

> > > > 

> > > > The original question said "Calling `read_all` on a stream triggers a 
> > > > complete read of the file". So, my impression is that either `read_all` 
> > > > (I'm assuming via python) is purposely specifying the whole file, or 
> > > > eventually (through multiple calls) specifying the whole file. I am 
> > > > curious how large the file itself is, though I assume it's larger than 
> > > > whatever size `nbytes` is defaulted to.
> > > > 

> > > > But, I also can't find which implementation of `MemoryMap::Slice` [1] 
> > > > is resolved by `memory_map_->Slice(position, nbytes)`, which I don't 
> > > > think is likely to be problematic but I can't totally rule out either.
> > > > 

> > > > Either way, if I understand correctly, Sharvil wants random access to 
> > > > only a few RecordBatches via Table methods, but I don't think that's 
> > > > possible with the Arrow library; the only ways are to manage accesses 
> > > > at the RecordBatch level, or maybe using the Dataset or Acero APIs. Or 
> > > > am I forgetting something... or maybe I'm misunderstanding why Sharvil 
> > > > wants to specifically construct a Table rather than RecordBatches?
> > > > 

> > > > 

> > > > [1]: 
> > > > https://github.com/apache/arrow/blob/apache-arrow-19.0.0/cpp/src/arrow/io/file.h#L216-L217
> > > > 

> > > > 

> > > > 

> > > > 

> > > > 

> > > > # ------------------------------
> > > > 

> > > > # Aldrin
> > > > 

> > > > 

> > > > https://github.com/drin/
> > > > 

> > > > https://gitlab.com/octalene
> > > > 

> > > > https://keybase.io/octalene
> > > > 

> > > > 

> > > > On Tuesday, January 28th, 2025 at 13:45, Weston Pace < 
> > > > [email protected]> wrote:
> > > > 

> > > > > I believe the concern is that reading a record batch from a 
> > > > > RecordBatchStreamReader triggers the MADV_WILLNEED advice to be sent 
> > > > > to the OS before any data is accessed (and regardless of whether or 
> > > > > not that data is ever accessed).
> > > > > 

> > > > > I'm pretty sure the `RecordBatchStreamReader` uses 
> > > > > `MemoryMappedFile::ReadAt` and that function triggers the 
> > > > > MADV_WILLNEED[1]. This is contrary to the user expectation that only 
> > > > > the data actually accessed would be loaded into memory.
> > > > > 

> > > > > [1] 
> > > > > https://github.com/apache/arrow/blob/ca2f4d68e834e600852d5af36dc2190741e33118/cpp/src/arrow/io/file.cc#L677
> > > > > 

> > > > > On Tue, Jan 28, 2025 at 7:15 AM Aldrin < [email protected]> wrote:
> > > > > 

> > > > > > > Then you should just use a memory-mapped file.
> > > > > > 

> > > > > > Unless I'm misunderstanding their original message, I believe they 
> > > > > > are using a memory-mapped file. I'm not sure if other suggestions 
> > > > > > helped address the issue, but my understanding was that they were 
> > > > > > somehow triggering reads against the whole file anyways.
> > > > > > 

> > > > > > 

> > > > > > I'm not sure why a Table is necessary (presumably some useful 
> > > > > > method in the API?) if accesses are sparse relative to the entire 
> > > > > > table; that sounds more aligned to RecordBatch access. I would 
> > > > > > think that any use of a Table method is going to trigger reads to 
> > > > > > every batch. I would also think that this scenario has 2 
> > > > > > opportunities to do processing without triggering a scan of the 
> > > > > > whole file:
> > > > > > 1. when a RecordBatch is read into memory
> > > > > > 2. on the RecordBatches accumulated so far (a Table instance can be 
> > > > > > constructed from them without copies, I am pretty sure)
> > > > > > 

> > > > > > I have little experience with mmap, so I don't have any particular 
> > > > > > thoughts there. Some extra information about how random access into 
> > > > > > the table occurs would be helpful, though.
> > > > > > 

> > > > > > 

> > > > > > 

> > > > > > Sent from Proton Mail for iOS
> > > > > > 

> > > > > > 

> > > > > > On Tue, Jan 28, 2025 at 01:14, Antoine Pitrou < [email protected]> 
> > > > > > wrote:
> > > > > > 

> > > > > > > On Sun, 26 Jan 2025 10:48:48 -0800
> > > > > > > Sharvil Nanavati < [email protected]> wrote:
> > > > > > > > In a different context, fetching batches one-by-one would be a 
> > > > > > > > good way to
> > > > > > > > control when the disk read takes place.
> > > > > > > >
> > > > > > > > In my context, I'm looking for a way to construct a Table 
> > > > > > > > without
> > > > > > > > performing the bulk of the IO operations until the memory is 
> > > > > > > > accessed. I
> > > > > > > > need random access to the table and my accesses are often 
> > > > > > > > sparse relative
> > > > > > > > to the size of the entire table. Obviously there has to be 
> > > > > > > > *some* IO to
> > > > > > > > read the schema and offsets, but that's tiny relative to the 
> > > > > > > > data itself.
> > > > > > > 

> > > > > > > Then you should just use a memory-mapped file.
> > > > > > > 

> > > > > > > Regards
> > > > > > > 

> > > > > > > Antoine.
> > > > > > >

publickey - [email protected] - 0x21969656.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

Re: Demand-loading Arrow files

Reply via email to