Dandandan commented on pull request #9588: URL: https://github.com/apache/arrow/pull/9588#issuecomment-791332249
@yordan-pavlov thanks a lot for detailed descriptions. I think that's a great overview of the current situation. I can help with reviewing your next PR (maybe the `10-15% ` improvement might be useful to have as PR already?). I will also have a look at using a sampling profiler (`perf`?), so far I have been using callgrind / cachegrind for collecting profiles which doesn't always give perfect results (although run time and instructions are quite correlated) and is quite slow. Using MS Visual Studio sound like a great idea too. I think maybe it's worth to document steps to profile Arrow/Parquet/DataFusion on different profiles, these are my current steps with callgrind: https://docs.google.com/document/d/1OqM1SSFmopcbz4JtOXJ8pXE7c1b4A2zDm4w207KBjq0/edit?usp=sharing I think from different queries the source of the "hot path" might be very different. For example, I didn't so far see that much in the `ComplexObjectArrayReader` in my test so it would be also good to profile / optimize a variety of parquet files / queries. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
