jinchengchenghh commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2143424599
I mark this format as spiltable false, so it should not split.
FelixYBW commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2141314435
> Arrow is easy to support file offset and length, we just need to use
`RandomAccessFile` to generate `InputStream`. FileSource class constructor is
>
> ```
>
jinchengchenghh commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2141312309
Arrow is easy to support file offset and length, we just need to use
`RandomAccessFile` to generate `InputStream`.
FileSource class constructor is
```
jinchengchenghh commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2138456354
Yes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific
FelixYBW commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2138161430
Do you mean arrow csv doesn't support split? each partition must have one or
more csv files, instead of part of a large csv file.
--
This is an automated message from
jinchengchenghh commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2136264870
I think it is because arrow does not support to add file start and length to
split a file, so it's peak memory is high for a very big CSV file.
--
This is an
liujiayi771 commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2134574129
@jinchengchenghh I tested the latest code, and the peak memory usage is
still relatively high. I did not add logs in
`ArrowReservationListener.reserve`. Printing logs
jinchengchenghh commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2134145056
I assume you use a middle commit of csv reader, there is redundant
`colVector.retain() in function ArrowUtil.loadBatch()` in a middle version not
the merged
liujiayi771 commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2128793052
@jinchengchenghh Have you checked the size of a single CSV file?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on
FelixYBW commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2127614458
@jinchengchenghh can you print in the record batch construction and
destruction function to confirm? there should be only 1 record batch alive, no
more than 3.
--
This
liujiayi771 commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2126952907
@jinchengchenghh I will test the latest code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use
jinchengchenghh commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2126691925
I could not reproduce this issue, I test TPCH Q6 with data 600G, and print
the peak every time arrow reserve memory.
```
public void reserve(long size) {
zhztheplayer commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2116709386
> @zhztheplayer do you remember?
I can't recall that. But it doesn't make sense to buffer all data for a
reader.
I suppose @jinchengchenghh is looking
liujiayi771 commented on issue #5766:
URL:
https://github.com/apache/incubator-gluten/issues/5766#issuecomment-2114102095
cc @jinchengchenghh @zhztheplayer, thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
liujiayi771 opened a new issue, #5766:
URL: https://github.com/apache/incubator-gluten/issues/5766
### Backend
VL (Velox)
### Bug description
When reading large CSV files, for example, when a single CSV file in a table
is 300M, the peak memory usage of arrow memory pool
15 matches
Mail list logo