George Karpodinis created SOLR-18200:
----------------------------------------

             Summary: TextProfileSignature produces different signatures for 
identical input when the internal HashMap rehashes
                 Key: SOLR-18200
                 URL: https://issues.apache.org/jira/browse/SOLR-18200
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: clients - java
    Affects Versions: 9.7
         Environment: Solr 9.7.0 (probably affects ealier and later versions 
too)

running on OpenJDK 21.0.10 LTS (Amazon Corretto 21.0.10.7.1, 64-bit Server VM)
            Reporter: George Karpodinis


{{TextProfileSignature}} is meant to be a deterministic near-duplicate 
detector: byte-identical input should yield the same MD5 signature. It does 
not, because the signature depends on the internal layout of a {{HashMap}} that 
grows and rehashes while tokens are being counted.

The algorithm:
 # Tokenize and populate a {{HashMap}} with per-token counts.
 # Quantize counts; drop tokens below threshold.
 # Sort surviving tokens by decreasing frequency.
 # MD5-hash the resulting {{{{token\tcount\n...}}}} string.

The comparator used in step 3 sorts on frequency alone.{{ }}

Any two tokens with the same quantized frequency — which is a very common case 
— are left in whatever order the {{HashMap}} hands them back via 
{{{}values().iterator(){}}}.

*The rehashing problem*

Token counting inserts entries one at a time. Each time the map's load factor 
crosses the resize threshold, Java's {{HashMap}} allocates a larger bucket 
array and redistributes every existing entry into new buckets (a rehash). After 
a rehash, {{values()}} iterates in a different order than it did before the 
rehash, because the underlying bucket array is different.

Concretely, this means:
 * Two documents whose token sets are identical but whose insertion paths cross 
the resize threshold at different points can produce different iteration orders 
for tokens with equal frequency, hence different signatures.
 * Increasing the document length past a resize threshold can change the 
signature of the _unchanged_ high-frequency portion of the profile, because the 
tail tokens forced a rehash that reordered the head.

The result is that signatures are not a function of the input alone; they are a 
function of the input _and_ the map's resize history.

*Reproduction*

The following demonstration starts with a base document of six high-frequency 
tokens ({{{}alpha{}}}...{{{}foxtrot{}}}, 200 occurrences each) and 
incrementally appends unique filler words (frequency 1, filtered out by 
quantization). The set of tokens that survive quantization is the _same_ on 
every iteration — only {{HashMap}} capacity changes as fillers are inserted. 
The signature nevertheless flips every time the map resizes:
=== TextProfileSignature: HashMap resize demonstration ===

Base text: [alpha, bravo, charlie, delta, echo, foxtrot] x 200 each
Adding unique filler words (freq 1, filtered by quantization).
HashMap resize thresholds: 16x0.75=12, 32x0.75=24 entries.

Added word      Unique    Signature                           Changed?
------------------------------------------------------------------------------------------
(none)          6         45a42603e173ae2a0055316fd290943e
xpad0000        7         45a42603e173ae2a0055316fd290943e
xpad0001        8         45a42603e173ae2a0055316fd290943e
xpad0002        9         45a42603e173ae2a0055316fd290943e
xpad0003        10        45a42603e173ae2a0055316fd290943e
xpad0004        11        45a42603e173ae2a0055316fd290943e
xpad0005        12        45a42603e173ae2a0055316fd290943e
xpad0006        13        9a9fea292e31eb5c2a0a404ee94fe7f9    <-- CHANGED 
(resize 16->32)

  Surviving tokens are identical — only HashMap iteration order changed:

  Before (capacity 16)                 After (capacity 32)
  -------------------------------      -------------------------------
  bravo 200                            foxtrot 200
  foxtrot 200                          delta 200
  alpha 200                            echo 200
  delta 200                            bravo 200
  echo 200                             alpha 200
  charlie 200                          charlie 200

xpad0007        14        9a9fea292e31eb5c2a0a404ee94fe7f9
...
xpad0041        48        9a9fea292e31eb5c2a0a404ee94fe7f9
xpad0042        49        10bc3a47c80346cfc4031123527d4194    <-- CHANGED 
(resize 64->128)

  Surviving tokens are identical — only HashMap iteration order changed:

  Before (capacity 64)                 After (capacity 128)
  -------------------------------      -------------------------------
  foxtrot 200                          delta 200
  delta 200                            echo 200
  echo 200                             bravo 200
  bravo 200                            foxtrot 200
  alpha 200                            alpha 200
  charlie 200                          charlie 200 
Notable: the 32→64 resize did happen (at 25 unique tokens), but those six 
survivors happened to land in the same relative bucket order at capacity 64, so 
no signature change there. It took until 64→128 for the buckets to reshuffle 
again. +The signature is therefore not a pure function of the surviving 
profile+ — it is sensitive to the internal bucket layout at the moment 
{{values().iterator()}} is called.

*Impact*
 * Deduplication silently fails: similar documents that should produce 
identical signatures may result in different ones

*Suggested fix*

Remove the dependence on {{HashMap}} iteration order by giving the comparator a 
total order. Once ties are broken lexicographically, the final profile no 
longer depends on when (or whether) the {{HashMap}} rehashed during counting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to