jonded94 commented on issue #44599:
URL: https://github.com/apache/arrow/issues/44599#issuecomment-2479532080

   As all of this is unfortunately an internal project, I'm unable to share any 
specifics, especially files. I'm very sorry!
   Maybe I can find some time to create synthetic files which show a similar 
behaviour.
   
   > it would be nice if you could compare the memory consumption with PyArrow 
and with parquet-rs. 
   
   I can do that with a similar workload that does not read *metadata*, but 
actual rows. This unfortunately is not strictly connected to the original issue 
mentioned here, I'm sorry! But it nevertheless could be interesting if I share 
our results of that. Sorry that this issue is so meandering.
   
   This workload just consumed `RecordBatch` instances of size 2250 with 
`pyarrow.Dataset.Fragment.to_batches` 
(https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Fragment.html#pyarrow.dataset.Fragment.to_batches).
 Nothing was done with them. They were just immediately discarded, very similar 
to the original script I shared.
   
   
![image](https://github.com/user-attachments/assets/b33accbf-d122-4428-aaeb-72a08dbc0368)
   
   As one can see, memory is just steadily increasing. Running this script in 
parallel will very easily lead to OOM errors (which we also could reproduce). 
   
   Then we tried creating an entirely new memory pool for every data iteration 
as a workaround:
   
![image](https://github.com/user-attachments/assets/254aa104-8b00-40d8-a439-1014b2ffccc6)
   This helped, but the memory load still is quite high.
   
   When using the Rust reader for the same workload, memory load seemed to 
behave much more predictibly and was quite a bit lower:
   
![image](https://github.com/user-attachments/assets/bbce1260-ade7-45dd-af3c-016b54b5e028)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to