Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-06-01 Thread via GitHub
jinchengchenghh commented on issue #5766: URL: https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2143424599 I mark this format as spiltable false, so it should not split.

Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-05-31 Thread via GitHub
FelixYBW commented on issue #5766: URL: https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2141314435 > Arrow is easy to support file offset and length, we just need to use `RandomAccessFile` to generate `InputStream`. FileSource class constructor is > > ``` >

Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-05-31 Thread via GitHub
jinchengchenghh commented on issue #5766: URL: https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2141312309 Arrow is easy to support file offset and length, we just need to use `RandomAccessFile` to generate `InputStream`. FileSource class constructor is ```

Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-05-29 Thread via GitHub
jinchengchenghh commented on issue #5766: URL: https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2138456354 Yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-05-29 Thread via GitHub
FelixYBW commented on issue #5766: URL: https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2138161430 Do you mean arrow csv doesn't support split? each partition must have one or more csv files, instead of part of a large csv file. -- This is an automated message from

Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-05-28 Thread via GitHub
jinchengchenghh commented on issue #5766: URL: https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2136264870 I think it is because arrow does not support to add file start and length to split a file, so it's peak memory is high for a very big CSV file. -- This is an

Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-05-28 Thread via GitHub
liujiayi771 commented on issue #5766: URL: https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2134574129 @jinchengchenghh I tested the latest code, and the peak memory usage is still relatively high. I did not add logs in `ArrowReservationListener.reserve`. Printing logs

Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-05-27 Thread via GitHub
jinchengchenghh commented on issue #5766: URL: https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2134145056 I assume you use a middle commit of csv reader, there is redundant `colVector.retain() in function ArrowUtil.loadBatch()` in a middle version not the merged

Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-05-24 Thread via GitHub
liujiayi771 commented on issue #5766: URL: https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2128793052 @jinchengchenghh Have you checked the size of a single CSV file? -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-05-23 Thread via GitHub
FelixYBW commented on issue #5766: URL: https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2127614458 @jinchengchenghh can you print in the record batch construction and destruction function to confirm? there should be only 1 record batch alive, no more than 3. -- This

Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-05-23 Thread via GitHub
liujiayi771 commented on issue #5766: URL: https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2126952907 @jinchengchenghh I will test the latest code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-05-23 Thread via GitHub
jinchengchenghh commented on issue #5766: URL: https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2126691925 I could not reproduce this issue, I test TPCH Q6 with data 600G, and print the peak every time arrow reserve memory. ``` public void reserve(long size) {

Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-05-16 Thread via GitHub
zhztheplayer commented on issue #5766: URL: https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2116709386 > @zhztheplayer do you remember? I can't recall that. But it doesn't make sense to buffer all data for a reader. I suppose @jinchengchenghh is looking

Re: [I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-05-16 Thread via GitHub
liujiayi771 commented on issue #5766: URL: https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2114102095 cc @jinchengchenghh @zhztheplayer, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[I] Arrow CSV reader peak memory is very large [incubator-gluten]

2024-05-16 Thread via GitHub
liujiayi771 opened a new issue, #5766: URL: https://github.com/apache/incubator-gluten/issues/5766 ### Backend VL (Velox) ### Bug description When reading large CSV files, for example, when a single CSV file in a table is 300M, the peak memory usage of arrow memory pool