Dandandan commented on issue #1363:
URL:
https://github.com/apache/arrow-datafusion/issues/1363#issuecomment-1000115758
> I'm not sure whether it is relevant or not. In the current
implementation, the sync_chunk_reader() method was
> invoked for every parquet page which will cause lots of unnecessary file
open and seek calls.
>
> FilePageIterator.next() ->
FileReader.get_row_group().get_column_page_reader() ->
SerializedRowGroupReader.get_column_page_reader() ->
ChunkObjectReader.get_read() -> LocalFileReader.sync_chunk_reader()
>
> ```
> fn sync_chunk_reader(
> &self,
> start: u64,
> length: usize,
> ) -> Result<Box<dyn Read + Send + Sync>> {
> // A new file descriptor is opened for each chunk reader.
> // This okay because chunks are usually fairly large.
> let mut file = File::open(&self.file.path)?;
> file.seek(SeekFrom::Start(start))?;
>
> let file = BufReader::new(file.take(length as u64));
>
> Ok(Box::new(file))
> }
> ```
>
> TPCH Q1:
>
> Read parquet file lineitem.parquet time spent: 590639777 ns, row group
count 60, skipped row group 0
> total open/seek count 421, bytes read from FS: 97028517
> memory alloc size: 1649375985 memory alloc count: 499533 during parquet
read.
>
> Query 1 iteration 0 took 679.9 ms
I think that might be relevant. How about opening a new issue to track
improving on that (reusing the file descriptor)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]