mccullocht commented on issue #15697:
URL: https://github.com/apache/lucene/issues/15697#issuecomment-3931727091

   Draft PR: https://github.com/apache/lucene/pull/15736
   
   I'd observed the short->int conversion to be quite expensive so I removed 
half of the conversions by summing the short accumulators before widening. I 
think these are expensive because they would produce a vector larger than the 
preferred
   bit size on the host. I did not look at the code the JVM generated to figure 
out if it was doing a good job here, IMO this
   would be easier to do cheaply on aarch64 than x86. Summing the accumulators 
may cause some lanes to overflow, but we
   can paper this over by zero extending the widening operation since this dot 
product is implicitly unsigned.
   
   @kaivalnp if you can confirm graviton3 performance in the microbenchmark 
that would be appreciated since I don't have
   a graviton3 host for testing at hand.
   
   I'll also run luceneutil benchmarks for this change to figure out if there's 
a real win here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to