Pulkitg64 commented on PR #15549:
URL: https://github.com/apache/lucene/pull/15549#issuecomment-3758270550
Hi @rmuir ,
> I looked at your commented-out code here and it doesn't seem to use
Float16Vector class but is instead doing a bunch of conversions and scalar
operations
Actually for the defaultUtilSupport (not using panama), I tried three
different approaches (that's why other 2 are commented out), but the approach 1
gave the best performance in my benchmarks
* Approach 1 (Best performance): In this I am converting the short values to
float32 values, then passing the values to fma function.
```
JMH:
Benchmark (size) Mode Cnt Score Error
Units
VectorUtilBenchmark.shortDotProductScalar 1024 thrpt 15 0.753 ± 0.001
ops/us
Code:
@Override
public short dotProduct(short[] a, short[] b) {
assert a.length == b.length : "Vector lengths must match";
float sum = 0f;
for (int i = 0; i < a.length; i++) {
sum = Math.fma(
Float.float16ToFloat(a[i]),
Float.float16ToFloat(b[i]),
sum
);
}
return Float.floatToFloat16(sum);
}
```
* Approach 2: In this I am using Float16 class objects to assign values and
use Float16.fma function for computation. In the implementation, the Float16
object is converted to float32 array internally. Hence I think this is not
performant enough.
```
JMH:
Benchmark (size) Mode Cnt Score
Error Units
VectorUtilBenchmark.shortDotProductScalar 1024 thrpt 15 0.077 ±
0.001 ops/us
Code:
@Override
public short dotProduct(short[] a, short[] b) {
assert a.length == b.length : "Vector lengths must match";
Float16 sum = Float16.valueOf(0);
for (int i = 0; i < a.length; i++) {
sum = Float16.fma(Float16.shortBitsToFloat16(a[i]),
Float16.shortBitsToFloat16(b[i]), sum);
}
return sum.shortValue();
}
```
* Approach 3: This is extension to Approach 1 where I am trying to use loop
unrolling, but I am not seeing any difference in performance.
```
JMH:
Benchmark (size) Mode Cnt Score Error
Units
VectorUtilBenchmark.shortDotProductScalar 1024 thrpt 15 0.748 ± 0.002
ops/us
Code
@Override
public short dotProduct(short[] a, short[] b) {
float res = 0f;
int i = 0;
// if the array is big, unroll it
if (a.length > 32) {
float acc1 = 0f;
float acc2 = 0f;
float acc3 = 0f;
float acc4 = 0f;
int upperBound = a.length & ~(4 - 1);
for (; i < upperBound; i += 4) {
acc1 = fma(Float.float16ToFloat(a[i]),
Float.float16ToFloat(b[i]), acc1);
acc2 = fma(Float.float16ToFloat(a[i + 1]), Float.float16ToFloat(b[i
+ 1]), acc2);
acc3 = fma(Float.float16ToFloat(a[i + 2]), Float.float16ToFloat(b[i
+ 2]), acc3);
acc4 = fma(Float.float16ToFloat(a[i + 3]), Float.float16ToFloat(b[i
+ 3]), acc4);
}
res += acc1 + acc2 + acc3 + acc4;
}
for (; i < a.length; i++) {
res = fma(Float.float16ToFloat(a[i]), Float.float16ToFloat(b[i]), res);
}
return Float.floatToFloat16(res);
}
```
* Note:
The Float16Vector is used in PanamaVectorUtilSupport class for which we are
seeing very bad performance as explained in my above comment. (Sorry for the
confusion, the PR size makes it difficult to navigate). But please let me know
if you meant something else in your comment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]