George Karpodinis created SOLR-18200:
----------------------------------------
Summary: TextProfileSignature produces different signatures for
identical input when the internal HashMap rehashes
Key: SOLR-18200
URL: https://issues.apache.org/jira/browse/SOLR-18200
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: clients - java
Affects Versions: 9.7
Environment: Solr 9.7.0 (probably affects ealier and later versions
too)
running on OpenJDK 21.0.10 LTS (Amazon Corretto 21.0.10.7.1, 64-bit Server VM)
Reporter: George Karpodinis
{{TextProfileSignature}} is meant to be a deterministic near-duplicate
detector: byte-identical input should yield the same MD5 signature. It does
not, because the signature depends on the internal layout of a {{HashMap}} that
grows and rehashes while tokens are being counted.
The algorithm:
# Tokenize and populate a {{HashMap}} with per-token counts.
# Quantize counts; drop tokens below threshold.
# Sort surviving tokens by decreasing frequency.
# MD5-hash the resulting {{{{token\tcount\n...}}}} string.
The comparator used in step 3 sorts on frequency alone.{{ }}
Any two tokens with the same quantized frequency — which is a very common case
— are left in whatever order the {{HashMap}} hands them back via
{{{}values().iterator(){}}}.
*The rehashing problem*
Token counting inserts entries one at a time. Each time the map's load factor
crosses the resize threshold, Java's {{HashMap}} allocates a larger bucket
array and redistributes every existing entry into new buckets (a rehash). After
a rehash, {{values()}} iterates in a different order than it did before the
rehash, because the underlying bucket array is different.
Concretely, this means:
* Two documents whose token sets are identical but whose insertion paths cross
the resize threshold at different points can produce different iteration orders
for tokens with equal frequency, hence different signatures.
* Increasing the document length past a resize threshold can change the
signature of the _unchanged_ high-frequency portion of the profile, because the
tail tokens forced a rehash that reordered the head.
The result is that signatures are not a function of the input alone; they are a
function of the input _and_ the map's resize history.
*Reproduction*
The following demonstration starts with a base document of six high-frequency
tokens ({{{}alpha{}}}...{{{}foxtrot{}}}, 200 occurrences each) and
incrementally appends unique filler words (frequency 1, filtered out by
quantization). The set of tokens that survive quantization is the _same_ on
every iteration — only {{HashMap}} capacity changes as fillers are inserted.
The signature nevertheless flips every time the map resizes:
=== TextProfileSignature: HashMap resize demonstration ===
Base text: [alpha, bravo, charlie, delta, echo, foxtrot] x 200 each
Adding unique filler words (freq 1, filtered by quantization).
HashMap resize thresholds: 16x0.75=12, 32x0.75=24 entries.
Added word Unique Signature Changed?
------------------------------------------------------------------------------------------
(none) 6 45a42603e173ae2a0055316fd290943e
xpad0000 7 45a42603e173ae2a0055316fd290943e
xpad0001 8 45a42603e173ae2a0055316fd290943e
xpad0002 9 45a42603e173ae2a0055316fd290943e
xpad0003 10 45a42603e173ae2a0055316fd290943e
xpad0004 11 45a42603e173ae2a0055316fd290943e
xpad0005 12 45a42603e173ae2a0055316fd290943e
xpad0006 13 9a9fea292e31eb5c2a0a404ee94fe7f9 <-- CHANGED
(resize 16->32)
Surviving tokens are identical — only HashMap iteration order changed:
Before (capacity 16) After (capacity 32)
------------------------------- -------------------------------
bravo 200 foxtrot 200
foxtrot 200 delta 200
alpha 200 echo 200
delta 200 bravo 200
echo 200 alpha 200
charlie 200 charlie 200
xpad0007 14 9a9fea292e31eb5c2a0a404ee94fe7f9
...
xpad0041 48 9a9fea292e31eb5c2a0a404ee94fe7f9
xpad0042 49 10bc3a47c80346cfc4031123527d4194 <-- CHANGED
(resize 64->128)
Surviving tokens are identical — only HashMap iteration order changed:
Before (capacity 64) After (capacity 128)
------------------------------- -------------------------------
foxtrot 200 delta 200
delta 200 echo 200
echo 200 bravo 200
bravo 200 foxtrot 200
alpha 200 alpha 200
charlie 200 charlie 200
Notable: the 32→64 resize did happen (at 25 unique tokens), but those six
survivors happened to land in the same relative bucket order at capacity 64, so
no signature change there. It took until 64→128 for the buckets to reshuffle
again. +The signature is therefore not a pure function of the surviving
profile+ — it is sensitive to the internal bucket layout at the moment
{{values().iterator()}} is called.
*Impact*
* Deduplication silently fails: similar documents that should produce
identical signatures may result in different ones
*Suggested fix*
Remove the dependence on {{HashMap}} iteration order by giving the comparator a
total order. Once ties are broken lexicographically, the final profile no
longer depends on when (or whether) the {{HashMap}} rehashed during counting.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]