[I] Arrow CSV reader peak memory is very large [incubator-gluten]

via GitHub Wed, 15 May 2024 23:05:21 -0700


liujiayi771 opened a new issue, #5766:
URL: https://github.com/apache/incubator-gluten/issues/5766


   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   When reading large CSV files, for example, when a single CSV file in a table 
is 300M, the peak memory usage of arrow memory pool during single-threaded 
reading can reach 500M. If the CSV is 2G, the peak memory usage can also 
increase to 1.7G. It looks like there is no memory leak, but the peak memory 
usage is very high.
   
   From the code of Arrow Dataset, it seems that we are using the Streaming 
reader, theoretically the memory consumption may not increase proportionally 
with the size of the CSV file.
   
   I have added some codes in the release method of ArrowNativeMemoryPool to 
check the peak memory.
   ```java
   @Override
   public void release() throws Exception {
     System.out.println("peak=" + listener.peak() +", current=" + 
listener.current());
     if (arrowPool.getBytesAllocated() != 0) {
       LOGGER.warn(
           String.format(
               "Arrow pool still reserved non-zero bytes, "
                   + "which may cause memory leak, size: %s. ",
               Utils.bytesToString(arrowPool.getBytesAllocated())));
     }
     arrowPool.close();
   }
   ```
   I also added some logs in arrow codes to check the peak memory.
   ```c++
   Result<RecordBatchGenerator> CsvFileFormat::ScanBatchesAsync(
       const std::shared_ptr<ScanOptions>& scan_options,
       const std::shared_ptr<FileFragment>& file) const {
     auto this_ = checked_pointer_cast<const CsvFileFormat>(shared_from_this());
     auto source = file->source();
     auto reader_fut =
         OpenReaderAsync(source, *this, scan_options, 
::arrow::internal::GetCpuThreadPool());
     auto generator = GeneratorFromReader(std::move(reader_fut), 
scan_options->batch_size);
     WRAP_ASYNC_GENERATOR_WITH_CHILD_SPAN(
         generator, "arrow::dataset::CsvFileFormat::ScanBatchesAsync::Next");
     std::cout << "memory=" << default_memory_pool()->bytes_allocated() << ", 
max=" << default_memory_pool()->max_memory() << std::endl;
     return generator;
   }
   ```
   
   <img width="542" alt="image" 
src="https://github.com/apache/incubator-gluten/assets/13622031/02cd9643-12b0-4d1c-a426-1cfdeac77d76";>
   <img width="963" alt="image" 
src="https://github.com/apache/incubator-gluten/assets/13622031/89fd4020-28d7-49fd-9063-442f1f21d359";>
   
   
   ### Spark version
   
   None
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org
For additional commands, e-mail: commits-h...@gluten.apache.org

[I] Arrow CSV reader peak memory is very large [incubator-gluten]

Reply via email to