liujiayi771 opened a new issue, #5766: URL: https://github.com/apache/incubator-gluten/issues/5766
### Backend VL (Velox) ### Bug description When reading large CSV files, for example, when a single CSV file in a table is 300M, the peak memory usage of arrow memory pool during single-threaded reading can reach 500M. If the CSV is 2G, the peak memory usage can also increase to 1.7G. It looks like there is no memory leak, but the peak memory usage is very high. From the code of Arrow Dataset, it seems that we are using the Streaming reader, theoretically the memory consumption may not increase proportionally with the size of the CSV file. I have added some codes in the release method of ArrowNativeMemoryPool to check the peak memory. ```java @Override public void release() throws Exception { System.out.println("peak=" + listener.peak() +", current=" + listener.current()); if (arrowPool.getBytesAllocated() != 0) { LOGGER.warn( String.format( "Arrow pool still reserved non-zero bytes, " + "which may cause memory leak, size: %s. ", Utils.bytesToString(arrowPool.getBytesAllocated()))); } arrowPool.close(); } ``` I also added some logs in arrow codes to check the peak memory. ```c++ Result<RecordBatchGenerator> CsvFileFormat::ScanBatchesAsync( const std::shared_ptr<ScanOptions>& scan_options, const std::shared_ptr<FileFragment>& file) const { auto this_ = checked_pointer_cast<const CsvFileFormat>(shared_from_this()); auto source = file->source(); auto reader_fut = OpenReaderAsync(source, *this, scan_options, ::arrow::internal::GetCpuThreadPool()); auto generator = GeneratorFromReader(std::move(reader_fut), scan_options->batch_size); WRAP_ASYNC_GENERATOR_WITH_CHILD_SPAN( generator, "arrow::dataset::CsvFileFormat::ScanBatchesAsync::Next"); std::cout << "memory=" << default_memory_pool()->bytes_allocated() << ", max=" << default_memory_pool()->max_memory() << std::endl; return generator; } ``` <img width="542" alt="image" src="https://github.com/apache/incubator-gluten/assets/13622031/02cd9643-12b0-4d1c-a426-1cfdeac77d76"> <img width="963" alt="image" src="https://github.com/apache/incubator-gluten/assets/13622031/89fd4020-28d7-49fd-9063-442f1f21d359"> ### Spark version None ### Spark configurations _No response_ ### System information _No response_ ### Relevant logs _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org