u3Izx9ql7vW4 commented on issue #43929:
URL: https://github.com/apache/arrow/issues/43929#issuecomment-2325536409

   That's correct, I'm writing to IPC with memory mapped files. I did go over 
the page you linked a few times but couldn't figure out the difference between 
`new_file` and `new_stream`, as they both accept `NativeFile` . I needed to 
have multiple IPCs streams running, so I opted for memory mapped files which 
allowed me to designate a file path for each producer. Could this be done with 
Arrows' memory buffer somehow?
   
   > If you just want to write to disk and keep the file memory mapped it's 
likely easier (and faster) to just write an arrow file to disk and mmap it 
after.
   
   I don't quite follow this part. Isn't this what I'm already doing? Perhaps 
you're suggesting that I check if the mem map file has already been created 
before creating a new one, like below?
   
   ```python
   def save_data():
       size = table.get_total_buffer_size()
   
       file_path = os.path.join(prefix_stream, sink)
       
       if not os.path.exists(file_path):
           pa.create_memory_map(file_path, size)
   
       with pa.memory_map(file_path, 'wb') as sink:
           with pa.ipc.new_file(sink, table.schema) as writer:
               writer.write_table(table, max_chunksize=1000)
   ```
   
   > edit: actually the offset is likely metadata, maybe?
   
   That's what I thought as well, but I would have thought the 
`total_buffer_size` would have included the metadata. Though I may have 
dictionary encoded columns, so maybe that's inflating the metadata. Do you know 
if there's a way to find out?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to