[GitHub] [arrow] westonpace commented on issue #36303: PyArrow Write Dataset Memory Consumption - How does it work?

via GitHub Tue, 11 Jul 2023 16:27:13 -0700


westonpace commented on issue #36303:
URL: https://github.com/apache/arrow/issues/36303#issuecomment-1631631311


   > Is this in reference specifically to each RecordBatch that arrives to the 
write_dataset function?
   
   Hmm...this is probably close enough to say yes :)  Technically, the 
`write_dataset` function gets a generator.  This generator is passed to the C++ 
layer (Acero) which then pulls from the generator one item at a time and (I 
think but would have to verify) slices the batches that come from the generator 
into smaller chunks (32Ki rows), partitions those smaller chunks, and then 
sends those into "the dataset writer".
   
   > My assumption was that pyarrow is keeping the files open until it's 
arrived at 500k rows per file, but this never happens, because we never get 
through enough rows to write a full file before we are OOM'd.
   
   Pyarrow will keep the files "open" until they reach 500k rows.  This 
shouldn't prevent data from being flushed to disk and evicted from RAM.
   
   > Right, so that makes sense, is it possible that the flush is not happening 
fast enough? I.e. the part that is reading rows into memory is happening "too 
fast" for us to flush to disk and alleviate some of said memory pressure?
   
   Yes, sort of.  There are a few threads here.  Writes happen asynchronously.  
So one thread (may be more than one depending on a few factors but let's assume 
one and call it the scanning thread) will ask python for a batch of data, 
partition the batch, and then create a new thread task to write the partitioned 
batch.  These thread tasks are quite short (they should mostly just be a memcpy 
from user space into kernel space) and they are submitted to the I/O thread 
pool (which has 8 threads we can call I/O threads or write threads).  So it's 
possible that these tasks are piling up.
   
   > Max row counts above ~75k will end up with a container of 2GB memory being 
OOM killed and my write process doesn't complete :(
   
   This part I don't really understand.  It's possible that docker is 
complicating things here.  I don't know much about the docker filesystem (and 
it appears there are several).  What is doing the OOM killing?  Is this the 
host OS killing the docker process because it is using too much host memory?  
Or is this some internal kubernetes monitor that is killing things?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #36303: PyArrow Write Dataset Memory Consumption - How does it work?

Reply via email to