jizezhang commented on issue #18195:
URL: https://github.com/apache/datafusion/issues/18195#issuecomment-3543227177

   Thank you. I did take a look at how file scan operator works. 
`DataSourceExec` opens a `FileStream` such that when polled, internally calls a 
file opener to open a file, e.g. `ParquetOpener::open`. It seems to me that 
majority of the logic on reading file is inside the future returned by `open`. 
Metadata seems to be loaded during physical planning for `TableScan`, which 
involves collecting statistics from metadata, and then cached. 
`ParquetOpener::open` returns a `ParquetRecordBatchStream`, and decoding of 
payload happens when the stream is polled (also inside 
`FileStream:poll_inner`)? 
   
   In terms of tracking elapsed compute time, doe we want to create a 
`BaselineMetrics` instance and track inside `ParquetOpener::open`?  But for 
decoding, how/where would we track that? It looks like currently we copy 
metrics from `ArrowReaderMetrics` 
   
https://github.com/apache/datafusion/blob/af2233675dbe8821cf388a5366e25268295ce034/datafusion/datasource-parquet/src/opener.rs#L485
   which does not seem to track elapsed compute time. But I could have 
misunderstood and please let me know your thoughts.
   
   Another question I have is that when I tried file scan with csv, I also got 
an extremely small elapsed compute time in `ns`. Is it expected to be that 
small for file formats other than parquet, or that the metric is probably not 
tracked for file scan in general?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to