Pulkitg64 commented on PR #15549: URL: https://github.com/apache/lucene/pull/15549#issuecomment-3770820114
I think there is some misunderstanding. Since you mentioned about commented out code, I thought you were referring to different DefaultVectorUtil implementations which doesn't use Float16Vector class. This is the panama implementation which uses Float16Vectors in `PanamaVectorUtilSupport` class which uses the JDK PR [change](https://bugs.openjdk.org/browse/JDK-8370691). ``` @Override public short dotProduct(short[] a, short[] b) { int i = 0; short res = 0; // if the array size is large (> 2x platform vector size), it's worth the overhead to vectorize if (a.length > 2 * FLOAT16_SPECIES.length()) { i += FLOAT16_SPECIES.loopBound(a.length); res += dotProductBody(a, b, i); } // scalar tail for (; i < a.length; i++) { res = fma(a[i], b[i], res); } return res; } /** vectorized float dot product body */ private short dotProductBody(short[] a, short[] b, int limit) { int i = 0; // vector loop is unrolled 4x (4 accumulators in parallel) // we don't know how many the cpu can do at once, some can do 2, some 4 Float16Vector acc1 = Float16Vector.zero(FLOAT16_SPECIES); Float16Vector acc2 = Float16Vector.zero(FLOAT16_SPECIES); Float16Vector acc3 = Float16Vector.zero(FLOAT16_SPECIES); Float16Vector acc4 = Float16Vector.zero(FLOAT16_SPECIES); int unrolledLimit = limit - 3 * FLOAT16_SPECIES.length(); for (; i < unrolledLimit; i += 4 * FLOAT16_SPECIES.length()) { // one Float16Vector va = Float16Vector.fromArray(FLOAT16_SPECIES, a, i); Float16Vector vb = Float16Vector.fromArray(FLOAT16_SPECIES, b, i); acc1 = fma(va, vb, acc1); // two Float16Vector vc = Float16Vector.fromArray(FLOAT16_SPECIES, a, i + FLOAT16_SPECIES.length()); Float16Vector vd = Float16Vector.fromArray(FLOAT16_SPECIES, b, i + FLOAT16_SPECIES.length()); acc2 = fma(vc, vd, acc2); // three Float16Vector ve = Float16Vector.fromArray(FLOAT16_SPECIES, a, i + 2 * FLOAT16_SPECIES.length()); Float16Vector vf = Float16Vector.fromArray(FLOAT16_SPECIES, b, i + 2 * FLOAT16_SPECIES.length()); acc3 = fma(ve, vf, acc3); // four Float16Vector vg = Float16Vector.fromArray(FLOAT16_SPECIES, a, i + 3 * FLOAT16_SPECIES.length()); Float16Vector vh = Float16Vector.fromArray(FLOAT16_SPECIES, b, i + 3 * FLOAT16_SPECIES.length()); acc4 = fma(vg, vh, acc4); } // vector tail: less scalar computations for unaligned sizes, esp with big vector sizes for (; i < limit; i += FLOAT16_SPECIES.length()) { Float16Vector va = Float16Vector.fromArray(FLOAT16_SPECIES, a, i); Float16Vector vb = Float16Vector.fromArray(FLOAT16_SPECIES, b, i); acc1 = fma(va, vb, acc1); } // reduce Float16Vector res1 = acc1.add(acc2); Float16Vector res2 = acc3.add(acc4); return res1.add(res2).reduceLanes(ADD); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
