> Then you should just use a memory-mapped file. Unless I'm misunderstanding their original message, I believe they are using a memory-mapped file. I'm not sure if other suggestions helped address the issue, but my understanding was that they were somehow triggering reads against the whole file anyways.
I'm not sure why a Table is necessary (presumably some useful method in the
API?) if accesses are sparse relative to the entire table; that sounds more
aligned to RecordBatch access. I would think that any use of a Table method is
going to trigger reads to every batch. I would also think that this scenario
has 2 opportunities to do processing without triggering a scan of the whole
file:
1. when a RecordBatch is read into memory
2. on the RecordBatches accumulated so far (a Table instance can be
constructed from them without copies, I am pretty sure)
I have little experience with mmap, so I don't have any particular thoughts
there. Some extra information about how random access into the table occurs
would be helpful, though.
Sent from Proton Mail for iOS On Tue, Jan 28, 2025 at
01:14, Antoine Pitrou < [email protected]> wrote:
On Sun, 26 Jan 2025 10:48:48 -0800 Sharvil Nanavati <[email protected]>
wrote: > In a different context, fetching batches one-by-one would be a
good way to > control when the disk read takes place. > > In my
context, I'm looking for a way to construct a Table without > performing
the bulk of the IO operations until the memory is accessed. I > need random
access to the table and my accesses are often sparse relative > to the size
of the entire table. Obviously there has to be *some* IO to > read the
schema and offsets, but that's tiny relative to the data itself. Then you
should just use a memory-mapped file. Regards Antoine.
signature.asc
Description: OpenPGP digital signature
