[
https://issues.apache.org/jira/browse/LUCENE-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13706750#comment-13706750
]
Dawid Weiss commented on LUCENE-5098:
-------------------------------------
bq. I played a bit with it and rank9 was always between 15% and 20% slower than
bitCount no matter what the input was (which is still impressing since bitCount
is supposed to be an intrinsic)
I also toyed with this a bit. Interesting because rank9 is essentially Hacker's
delight implementation, but with a multiplication at the end rather than
shifts/ additions (which I think originated from the fact that multiplication
used to be much slower than additions/shifts back in the day).
I ran a few caliper bechmarks on a million longs, different distributions
(Intel I7, JDK17) just to see what the output will be.
{code}
benchmark distribution us linear runtime
HdPopCnd ZEROS 2333 =
HdPopCnd FULL 2334 =
HdPopCnd RANDOM 2329 =
HdPopCnd ONEBIT 2334 =
Rank9 ZEROS 1651 =
Rank9 FULL 1652 =
Rank9 RANDOM 1651 =
Rank9 ONEBIT 1651 =
LongBitCount ZEROS 411 =
LongBitCount FULL 394 =
LongBitCount RANDOM 391 =
LongBitCount ONEBIT 404 =
NaivePopCnt ZEROS 585 =
NaivePopCnt FULL 39019 ======
NaivePopCnt RANDOM 171365 ==============================
NaivePopCnt ONEBIT 28155 ====
{code}
The naive loop was:
{code}
int cnt = 0;
while (x != 0) {
if (((x >>>= 1) & 1) != 0L) {
cnt++;
}
}
return cnt;
{code}
and you can see that even for all zeros (when in fact there is no counting at
all) it's still slower than the intrinsified popcnt. Note full-ones is not the
worst case (I believe due to constant branch misprediction in a random input).
A zoomed-in benchmark without the naive impl.:
{code}
benchmark distribution us linear runtime
HdPopCnd ZEROS 2331 =============================
HdPopCnd FULL 2329 =============================
HdPopCnd RANDOM 2333 =============================
HdPopCnd ONEBIT 2333 =============================
Rank9 ZEROS 1650 =====================
Rank9 FULL 1650 =====================
Rank9 RANDOM 1652 =====================
Rank9 ONEBIT 1650 =====================
LongBitCount ZEROS 400 =====
LongBitCount FULL 402 =====
LongBitCount RANDOM 401 =====
LongBitCount ONEBIT 391 =====
{code}
You can see when popcnt isn't intrinsified by running with IBM's J9, for
example:
{code}
benchmark distribution ms linear runtime
LongBitCount ZEROS 2.25 =============================
LongBitCount FULL 2.22 =============================
LongBitCount RANDOM 2.25 =============================
LongBitCount ONEBIT 2.25 =============================
HdPopCnd ZEROS 2.25 =============================
HdPopCnd FULL 2.25 =============================
HdPopCnd RANDOM 2.22 =============================
HdPopCnd ONEBIT 2.22 =============================
Rank9 ZEROS 1.62 =====================
Rank9 FULL 1.62 =====================
Rank9 RANDOM 1.62 =====================
Rank9 ONEBIT 1.62 =====================
{code}
But I think they'll eventually catch up with modern cpus too so I'd stick with
Long.bitCount.
> Broadword bit selection
> -----------------------
>
> Key: LUCENE-5098
> URL: https://issues.apache.org/jira/browse/LUCENE-5098
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/other
> Reporter: Paul Elschot
> Assignee: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-5098.patch
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]