2010YOUY01 commented on issue #18195:
URL: https://github.com/apache/datafusion/issues/18195#issuecomment-3539495693

   > Hi [@2010YOUY01](https://github.com/2010YOUY01) , I put out a short draft 
PR [#18752](https://github.com/apache/datafusion/pull/18752) for "fixing 
`elapsed_compute` baseline metrics not counting issue". The change I made was 
to track `elapsed_compute` in `FileStream`, which is opened by 
`DataSourceExec`, though looking over the code base I saw `elapsed_compute` 
timers being added in different ways across different types of streams, thus 
not sure what is the best place to track for file scan in this case.
   > 
   > Also in terms of testing, I see we have tests for ensuring a metric 
appears in `explain analyze`, but in this case the metric does exist just the 
value does not seem correct. How do we usually test for this type of changes?
   > 
   > Thank you.
   
   Testing if have the time measured correct is tricky, I don't think there is 
a good way to do it. But for a large parquet file scan, several nanoseconds is 
definitely not reasonable.
   
   So the approach is first understand how an operator work, and manually 
insert the timer to all the places doing computation, for parquet source, the 
major part I believe is metadata handling, and payload decoding.
   
   This task can get a little tricky due to the complexity of parquet 
implementation, I don't know the exact places to count now, but I will get back 
once I got a chance to dive deeper into the parquet implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to