[GitHub] [arrow-datafusion] mingmwang commented on issue #1363: Major performance regression in reading partitioned Parquet data on master

GitBox Wed, 22 Dec 2021 23:55:06 -0800


mingmwang commented on issue #1363:
URL: 
https://github.com/apache/arrow-datafusion/issues/1363#issuecomment-1000112313



   I'm not sure whether it is relevant or not.  In the current implementation, 
the sync_chunk_reader() method was
   invoked for every parquet page which will cause lots of unnecessary file 
open and seek calls.
   
   FilePageIterator.next() -> 
FileReader.get_row_group().get_column_page_reader()  -> 
SerializedRowGroupReader.get_column_page_reader() -> 
ChunkObjectReader.get_read() -> LocalFileReader.sync_chunk_reader()
   
   ```
        fn sync_chunk_reader(
           &self,
           start: u64,
           length: usize,
       ) -> Result<Box<dyn Read + Send + Sync>> {
           // A new file descriptor is opened for each chunk reader.
           // This okay because chunks are usually fairly large.
           let mut file = File::open(&self.file.path)?;
           file.seek(SeekFrom::Start(start))?;
   
           let file = BufReader::new(file.take(length as u64));
   
           Ok(Box::new(file))
       }
   ```
   
   TPCH Q1:
   
   Read parquet file lineitem.parquet time spent: 590639777 ns, row group count 
60, skipped row group 0
   total open/seek count 421, bytes read from FS: 97028517
   memory alloc size: 1649375985 memory alloc count: 499533 during parquet read.
   
   Query 1 iteration 0 took 679.9 ms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on issue #1363: Major performance regression in reading partitioned Parquet data on master

Reply via email to