[
https://issues.apache.org/jira/browse/SOLR-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
George Karpodinis updated SOLR-18200:
-------------------------------------
Attachment: REPORT.md
> TextProfileSignature produces different signatures for identical input when
> the internal HashMap rehashes
> ---------------------------------------------------------------------------------------------------------
>
> Key: SOLR-18200
> URL: https://issues.apache.org/jira/browse/SOLR-18200
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: clients - java
> Affects Versions: 9.7
> Environment: Solr 9.7.0 (probably affects ealier and later versions
> too)
> running on OpenJDK 21.0.10 LTS (Amazon Corretto 21.0.10.7.1, 64-bit Server VM)
> Reporter: George Karpodinis
> Priority: Minor
> Attachments: REPORT.md
>
>
> {{TextProfileSignature}} is meant to be a deterministic near-duplicate
> detector: byte-identical input should yield the same MD5 signature. It does
> not, because the signature depends on the internal layout of a {{HashMap}}
> that grows and rehashes while tokens are being counted.
> The algorithm:
> # Tokenize and populate a {{HashMap}} with per-token counts.
> # Quantize counts; drop tokens below threshold.
> # Sort surviving tokens by decreasing frequency.
> # MD5-hash the resulting {{{{token\tcount\n...}}}} string.
> The comparator used in step 3 sorts on frequency alone.{{ }}
> Any two tokens with the same quantized frequency — which is a very common
> case — are left in whatever order the {{HashMap}} hands them back via
> {{{}values().iterator(){}}}.
> *The rehashing problem*
> Token counting inserts entries one at a time. Each time the map's load factor
> crosses the resize threshold, Java's {{HashMap}} allocates a larger bucket
> array and redistributes every existing entry into new buckets (a rehash).
> After a rehash, {{values()}} iterates in a different order than it did before
> the rehash, because the underlying bucket array is different.
> Concretely, this means:
> * Two documents whose token sets are identical but whose insertion paths
> cross the resize threshold at different points can produce different
> iteration orders for tokens with equal frequency, hence different signatures.
> * Increasing the document length past a resize threshold can change the
> signature of the _unchanged_ high-frequency portion of the profile, because
> the tail tokens forced a rehash that reordered the head.
> The result is that signatures are not a function of the input alone; they are
> a function of the input _and_ the map's resize history.
> *Reproduction*
> The following demonstration starts with a base document of six high-frequency
> tokens ({{{}alpha{}}}...{{{}foxtrot{}}}, 200 occurrences each) and
> incrementally appends unique filler words (frequency 1, filtered out by
> quantization). The set of tokens that survive quantization is the _same_ on
> every iteration — only {{HashMap}} capacity changes as fillers are inserted.
> The signature nevertheless flips every time the map resizes:
> === TextProfileSignature: HashMap resize demonstration ===
> Base text: [alpha, bravo, charlie, delta, echo, foxtrot] x 200 each
> Adding unique filler words (freq 1, filtered by quantization).
> HashMap resize thresholds: 16x0.75=12, 32x0.75=24 entries.
> Added word Unique Signature Changed?
> ------------------------------------------------------------------------------------------
> (none) 6 45a42603e173ae2a0055316fd290943e
> xpad0000 7 45a42603e173ae2a0055316fd290943e
> xpad0001 8 45a42603e173ae2a0055316fd290943e
> xpad0002 9 45a42603e173ae2a0055316fd290943e
> xpad0003 10 45a42603e173ae2a0055316fd290943e
> xpad0004 11 45a42603e173ae2a0055316fd290943e
> xpad0005 12 45a42603e173ae2a0055316fd290943e
> xpad0006 13 9a9fea292e31eb5c2a0a404ee94fe7f9 <-- CHANGED
> (resize 16->32)
> Surviving tokens are identical — only HashMap iteration order changed:
> Before (capacity 16) After (capacity 32)
> ------------------------------- -------------------------------
> bravo 200 foxtrot 200
> foxtrot 200 delta 200
> alpha 200 echo 200
> delta 200 bravo 200
> echo 200 alpha 200
> charlie 200 charlie 200
> xpad0007 14 9a9fea292e31eb5c2a0a404ee94fe7f9
> ...
> xpad0041 48 9a9fea292e31eb5c2a0a404ee94fe7f9
> xpad0042 49 10bc3a47c80346cfc4031123527d4194 <-- CHANGED
> (resize 64->128)
> Surviving tokens are identical — only HashMap iteration order changed:
> Before (capacity 64) After (capacity 128)
> ------------------------------- -------------------------------
> foxtrot 200 delta 200
> delta 200 echo 200
> echo 200 bravo 200
> bravo 200 foxtrot 200
> alpha 200 alpha 200
> charlie 200 charlie 200
> Notable: the 32→64 resize did happen (at 25 unique tokens), but those six
> survivors happened to land in the same relative bucket order at capacity 64,
> so no signature change there. It took until 64→128 for the buckets to
> reshuffle again. +The signature is therefore not a pure function of the
> surviving profile+ — it is sensitive to the internal bucket layout at the
> moment {{values().iterator()}} is called.
> *Impact*
> * Deduplication silently fails: similar documents that should produce
> identical signatures may result in different ones
> *Suggested fix*
> Remove the dependence on {{HashMap}} iteration order by giving the comparator
> a total order. Once ties are broken lexicographically, the final profile no
> longer depends on when (or whether) the {{HashMap}} rehashed during counting.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]