[ 
https://issues.apache.org/jira/browse/SOLR-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Karpodinis updated SOLR-18200:
-------------------------------------
    Attachment:     (was: REPORT.md)

> TextProfileSignature produces different signatures for identical input when 
> the internal HashMap rehashes
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-18200
>                 URL: https://issues.apache.org/jira/browse/SOLR-18200
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: clients - java
>    Affects Versions: 9.7
>         Environment: Solr 9.7.0 (probably affects ealier and later versions 
> too)
> running on OpenJDK 21.0.10 LTS (Amazon Corretto 21.0.10.7.1, 64-bit Server VM)
>            Reporter: George Karpodinis
>            Priority: Minor
>
> {{TextProfileSignature}} is meant to be a deterministic near-duplicate 
> detector: byte-identical input should yield the same MD5 signature. It does 
> not, because the signature depends on the internal layout of a {{HashMap}} 
> that grows and rehashes while tokens are being counted.
> The algorithm:
>  # Tokenize and populate a {{HashMap}} with per-token counts.
>  # Quantize counts; drop tokens below threshold.
>  # Sort surviving tokens by decreasing frequency.
>  # MD5-hash the resulting {{{{token\tcount\n...}}}} string.
> The comparator used in step 3 sorts on frequency alone.{{ }}
> Any two tokens with the same quantized frequency — which is a very common 
> case — are left in whatever order the {{HashMap}} hands them back via 
> {{{}values().iterator(){}}}.
> *The rehashing problem*
> Token counting inserts entries one at a time. Each time the map's load factor 
> crosses the resize threshold, Java's {{HashMap}} allocates a larger bucket 
> array and redistributes every existing entry into new buckets (a rehash). 
> After a rehash, {{values()}} iterates in a different order than it did before 
> the rehash, because the underlying bucket array is different.
> Concretely, this means:
>  * Two documents whose token sets are identical but whose insertion paths 
> cross the resize threshold at different points can produce different 
> iteration orders for tokens with equal frequency, hence different signatures.
>  * Increasing the document length past a resize threshold can change the 
> signature of the _unchanged_ high-frequency portion of the profile, because 
> the tail tokens forced a rehash that reordered the head.
> The result is that signatures are not a function of the input alone; they are 
> a function of the input _and_ the map's resize history.
> *Reproduction*
> The following demonstration starts with a base document of six high-frequency 
> tokens ({{{}alpha{}}}...{{{}foxtrot{}}}, 200 occurrences each) and 
> incrementally appends unique filler words (frequency 1, filtered out by 
> quantization). The set of tokens that survive quantization is the _same_ on 
> every iteration — only {{HashMap}} capacity changes as fillers are inserted. 
> The signature nevertheless flips every time the map resizes:
> === TextProfileSignature: HashMap resize demonstration ===
> Base text: [alpha, bravo, charlie, delta, echo, foxtrot] x 200 each
> Adding unique filler words (freq 1, filtered by quantization).
> HashMap resize thresholds: 16x0.75=12, 32x0.75=24 entries.
> Added word      Unique    Signature                           Changed?
> ------------------------------------------------------------------------------------------
> (none)          6         45a42603e173ae2a0055316fd290943e
> xpad0000        7         45a42603e173ae2a0055316fd290943e
> xpad0001        8         45a42603e173ae2a0055316fd290943e
> xpad0002        9         45a42603e173ae2a0055316fd290943e
> xpad0003        10        45a42603e173ae2a0055316fd290943e
> xpad0004        11        45a42603e173ae2a0055316fd290943e
> xpad0005        12        45a42603e173ae2a0055316fd290943e
> xpad0006        13        9a9fea292e31eb5c2a0a404ee94fe7f9    <-- CHANGED 
> (resize 16->32)
>   Surviving tokens are identical — only HashMap iteration order changed:
>   Before (capacity 16)                 After (capacity 32)
>   -------------------------------      -------------------------------
>   bravo 200                            foxtrot 200
>   foxtrot 200                          delta 200
>   alpha 200                            echo 200
>   delta 200                            bravo 200
>   echo 200                             alpha 200
>   charlie 200                          charlie 200
> xpad0007        14        9a9fea292e31eb5c2a0a404ee94fe7f9
> ...
> xpad0041        48        9a9fea292e31eb5c2a0a404ee94fe7f9
> xpad0042        49        10bc3a47c80346cfc4031123527d4194    <-- CHANGED 
> (resize 64->128)
>   Surviving tokens are identical — only HashMap iteration order changed:
>   Before (capacity 64)                 After (capacity 128)
>   -------------------------------      -------------------------------
>   foxtrot 200                          delta 200
>   delta 200                            echo 200
>   echo 200                             bravo 200
>   bravo 200                            foxtrot 200
>   alpha 200                            alpha 200
>   charlie 200                          charlie 200 
> Notable: the 32→64 resize did happen (at 25 unique tokens), but those six 
> survivors happened to land in the same relative bucket order at capacity 64, 
> so no signature change there. It took until 64→128 for the buckets to 
> reshuffle again. +The signature is therefore not a pure function of the 
> surviving profile+ — it is sensitive to the internal bucket layout at the 
> moment {{values().iterator()}} is called.
> *Impact*
>  * Deduplication silently fails: similar documents that should produce 
> identical signatures may result in different ones
> *Suggested fix*
> Remove the dependence on {{HashMap}} iteration order by giving the comparator 
> a total order. Once ties are broken lexicographically, the final profile no 
> longer depends on when (or whether) the {{HashMap}} rehashed during counting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to