> Then you should just use a memory-mapped file.      
  Unless I'm misunderstanding their original message, I believe they are using 
a memory-mapped file. I'm not sure if other suggestions helped address the 
issue, but my understanding was that they were somehow triggering reads against 
the whole file anyways. 

 
     
  I'm not sure why a Table is necessary (presumably some useful method in the 
API?) if accesses are sparse relative to the entire table; that sounds more 
aligned to RecordBatch access. I would think that any use of a Table method is 
going to trigger reads to every batch. I would also think that this scenario 
has 2 opportunities to do processing without triggering a scan of the whole 
file:


 1. when a RecordBatch is read into memory


 2. on the RecordBatches accumulated so far (a Table instance can be 
constructed from them without copies, I am pretty sure)


 


 I have little experience with mmap, so I don't have any particular thoughts 
there. Some extra information about how random access into the table occurs 
would be helpful, though.


 
              Sent from   Proton Mail for iOS           On Tue, Jan 28, 2025 at 
01:14, Antoine Pitrou < [email protected]> wrote: 
  On Sun, 26 Jan 2025 10:48:48 -0800  Sharvil Nanavati <[email protected]> 
wrote:  > In a different context, fetching batches one-by-one would be a 
good way to  > control when the disk read takes place.  >  > In my 
context, I'm looking for a way to construct a Table without  > performing 
the bulk of the IO operations until the memory is accessed. I  > need random 
access to the table and my accesses are often sparse relative  > to the size 
of the entire table. Obviously there has to be *some* IO to  > read the 
schema and offsets, but that's tiny relative to the data itself.    Then you 
should just use a memory-mapped file.    Regards    Antoine.

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to