2010YOUY01 commented on issue #18195: URL: https://github.com/apache/datafusion/issues/18195#issuecomment-3539495693
> Hi [@2010YOUY01](https://github.com/2010YOUY01) , I put out a short draft PR [#18752](https://github.com/apache/datafusion/pull/18752) for "fixing `elapsed_compute` baseline metrics not counting issue". The change I made was to track `elapsed_compute` in `FileStream`, which is opened by `DataSourceExec`, though looking over the code base I saw `elapsed_compute` timers being added in different ways across different types of streams, thus not sure what is the best place to track for file scan in this case. > > Also in terms of testing, I see we have tests for ensuring a metric appears in `explain analyze`, but in this case the metric does exist just the value does not seem correct. How do we usually test for this type of changes? > > Thank you. Testing if have the time measured correct is tricky, I don't think there is a good way to do it. But for a large parquet file scan, several nanoseconds is definitely not reasonable. So the approach is first understand how an operator work, and manually insert the timer to all the places doing computation, for parquet source, the major part I believe is metadata handling, and payload decoding. This task can get a little tricky due to the complexity of parquet implementation, I don't know the exact places to count now, but I will get back once I got a chance to dive deeper into the parquet implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
