alamb opened a new issue, #16987: URL: https://github.com/apache/datafusion/issues/16987
### Is your feature request related to a problem or challenge? This is a ticket based on an issue reported by @tobixdev in [Discord](https://discord.com/channels/885562378132000778/1398849058898186260/1399038190395986111) I do see some regressions related to the planning time in my project (a few ms for some queries) after updating. https://codspeed.io/tobixdev/rdf-fusion/branches/feature%2Fdf_upgrade Here are some first thoughts: I think there are some hints in the "query - Single Pattern / No Quads" benchmark. Its basically the SPARQL equivalent to "SELECT * FROM empty_table" so planning and executing it should be fast and stable (as additional logical optimizers won't change the plan). The update to DF 49 has a 17% regression there (2ms -> 2.4ms on the "Codspeed isolation"). It seems to be that they stem from the pyhsical optimizer. According to the flamegraph, we can attribute that to: 3.3 % to a ensure cooperative pass, which is fine as this is an entirely new pass The other physical optimizer passes got slower. All just a bit but apparentely they "add up" (again, just a fraction of an ms). At the end there is a screenshot of the flamegraph for "pushdown_limits". It seems like we spend a lot of time hashing values (TypeSignature to be specific).I think this might be the reason on why its caught in my project. We try to encode RDF terms in Arrow arrays and depending on this encoding the types can become relatively large (basically a large DataType::Union). So I think most people working with "normal" data won't observe significant slow downs, which is very good 🎉 . The question now is whether we can improve the hashing performance: This is the current hash_value implementation in ScalarUDFImpl. ```rust fn hash_value(&self) -> u64 { let hasher = &mut DefaultHasher::new(); self.as_any().type_id().hash(hasher); self.name().hash(hasher); self.aliases().hash(hasher); self.signature().hash(hasher); hasher.finish() } ``` The question is if we really need to hash all of these values. AFAIK we must only ensure that A = B -> Hash(A) = Hash(B). We could try how much the performance changes if we're just hashing the name (which usually implies equality I assume). Any thoughts on that? <img width="1615" height="381" alt="Image" src="https://github.com/user-attachments/assets/151f70d8-5fa2-44d3-9981-5f37ba7d6885" /> And I think this gives us also an avenue to eliminate the regression by overriding hash_value in our ScalarUDFImpl (I'll report back soon). Maybe I am also overlooking something essential and this isn't as easy as I've described it above. ### Describe the solution you'd like So changing hash_value to ```rust fn hash_value(&self) -> u64 { let hasher = &mut DefaultHasher::new(); self.as_any().type_id().hash(hasher); self.name().hash(hasher); hasher.finish() } ``` basically removed the regressions and some queries are now even faster to plan. I have a feeling that this could also help DF but we would need to benchmark that. EDIT: I should also mention that in our project basically every expression is a UDF as most SPARQL functions slightly deviate from their SQL "counterparts". ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org