alamb opened a new issue, #16987:
URL: https://github.com/apache/datafusion/issues/16987

   ### Is your feature request related to a problem or challenge?
   
   This is a ticket based on an issue reported by @tobixdev  in 
[Discord](https://discord.com/channels/885562378132000778/1398849058898186260/1399038190395986111)
   
   I do see some regressions related  to the planning time in my project (a few 
ms for some queries) after updating. 
   
   https://codspeed.io/tobixdev/rdf-fusion/branches/feature%2Fdf_upgrade
   
   Here are some first thoughts:
   
   I think there are some hints in the "query - Single Pattern / No Quads" 
benchmark. Its basically the SPARQL equivalent to "SELECT * FROM empty_table" 
so planning and executing it should be fast and stable (as additional logical 
optimizers won't change the plan). The update to DF 49 has a 17% regression 
there (2ms -> 2.4ms on the "Codspeed isolation"). It seems to be that they stem 
from the pyhsical optimizer.
   
   According to the flamegraph, we can attribute that to:
   3.3 % to a ensure cooperative pass, which is fine as this is an entirely new 
pass
   The other physical optimizer passes got slower. All just a bit but 
apparentely they "add up" (again, just a fraction of an ms).
   
   At the end there is a screenshot of the flamegraph for "pushdown_limits".
   
   It seems like we spend a lot of time hashing values (TypeSignature to be 
specific).I think this might be the reason on why its caught in my project. We 
try to encode RDF terms in Arrow arrays and depending on this encoding the 
types can become relatively large (basically a large DataType::Union).
   
   So I think most people working with "normal" data won't observe significant 
slow downs, which is very good 🎉 .
   
   The question now is whether we can improve the hashing performance: This is 
the current hash_value implementation in ScalarUDFImpl.
   
   ```rust
   fn hash_value(&self) -> u64 {
       let hasher = &mut DefaultHasher::new();
       self.as_any().type_id().hash(hasher);
       self.name().hash(hasher);
       self.aliases().hash(hasher);
       self.signature().hash(hasher);
       hasher.finish()
   }
   ```
   
   
   
   The question is if we really need to hash all of these values. AFAIK we must 
only ensure that A = B -> Hash(A) = Hash(B). We could try how much the 
performance changes if we're just hashing the name (which usually implies 
equality I assume). Any thoughts on that? 
   
   <img width="1615" height="381" alt="Image" 
src="https://github.com/user-attachments/assets/151f70d8-5fa2-44d3-9981-5f37ba7d6885";
 />
   
   And I think this gives us also an avenue to eliminate the regression by 
overriding hash_value in our ScalarUDFImpl (I'll report back soon). 
   
   Maybe I am also overlooking something essential and this isn't as easy as 
I've described it above.
   
   ### Describe the solution you'd like
   
   So changing hash_value to 
   
   ```rust
    fn hash_value(&self) -> u64 {
       let hasher = &mut DefaultHasher::new();
       self.as_any().type_id().hash(hasher);
       self.name().hash(hasher);
       hasher.finish()
   }
   ```
   
   
   basically removed the regressions and some queries are now even faster to 
plan.
   
   I have a feeling that this could also help DF but we would need to benchmark 
that. 
   
   EDIT: I should also mention that in our project basically every expression 
is a UDF as most SPARQL functions slightly deviate from their SQL 
"counterparts". 
   
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to