[ 
https://issues.apache.org/jira/browse/SOLR-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Karpodinis updated SOLR-18200:
-------------------------------------
    Description: 
{{TextProfileSignature}} is meant to be a deterministic near-duplicate 
detector: byte-identical input should yield the same MD5 signature. It does 
not, because the signature depends on the internal layout of a {{HashMap}} that 
grows and rehashes while tokens are being counted.

The algorithm:
 # Tokenize and populate a {{HashMap}} with per-token counts.
 # Quantize counts; drop tokens below threshold.
 # Sort surviving tokens by decreasing frequency.
 # MD5-hash the resulting {{{{{}token\tcount\n...{}}}}} string.

The comparator used in step 3 sorts on frequency alone.\{{ }}

Any two tokens with the same quantized frequency — which is a very common case 
— are left in whatever order the {{HashMap}} hands them back via 
{{{}values().iterator(){}}}.

*The rehashing problem*

Token counting inserts entries one at a time. Each time the map's load factor 
crosses the resize threshold, Java's {{HashMap}} allocates a larger bucket 
array and redistributes every existing entry into new buckets (a rehash). After 
a rehash, {{values()}} iterates in a different order than it did before the 
rehash, because the underlying bucket array is different.

Concretely, this means:
 * Two documents whose token sets are identical but whose insertion paths cross 
the resize threshold at different points can produce different iteration orders 
for tokens with equal frequency, hence different signatures.
 * Increasing the document length past a resize threshold can change the 
signature of the _unchanged_ high-frequency portion of the profile, because the 
tail tokens forced a rehash that reordered the head.

The result is that signatures are not a function of the input alone; they are a 
function of the input _and_ the map's resize history.

*Reproduction*

The following demonstration starts with a base document of six high-frequency 
tokens ({{{}alpha{}}}...{{{}foxtrot{}}}, 200 occurrences each) and 
incrementally appends unique filler words (frequency 1, filtered out by 
quantization). The set of tokens that survive quantization is the _same_ on 
every iteration — only {{HashMap}} capacity changes as fillers are inserted. 
The signature nevertheless flips every time the map resizes:
{code:java}
=== TextProfileSignature: HashMap resize demonstration ===
Base text: [alpha, bravo, charlie, delta, echo, foxtrot] x 200 each
Adding unique filler words (freq 1, filtered by quantization).
HashMap resize thresholds: 16x0.75=12, 32x0.75=24 entries.
Added word Unique Signature Changed?
------------------------------------------------------------------------------------------
(none)    6 45a42603e173ae2a0055316fd290943e
xpad0000  7 45a42603e173ae2a0055316fd290943e
xpad0001  8 45a42603e173ae2a0055316fd290943e
xpad0002  9 45a42603e173ae2a0055316fd290943e
xpad0003 10 45a42603e173ae2a0055316fd290943e
xpad0004 11 45a42603e173ae2a0055316fd290943e
xpad0005 12 45a42603e173ae2a0055316fd290943e
xpad0006 13 9a9fea292e31eb5c2a0a404ee94fe7f9    <-- CHANGED (resize 16->32)
Surviving tokens are identical — only HashMap iteration order changed:
Before (capacity 16) After (capacity 32)
------------------------------- -------------------------------
bravo 200 foxtrot 200
foxtrot 200 delta 200
alpha 200 echo 200
delta 200 bravo 200
echo 200 alpha 200
charlie 200 charlie 200
xpad0007 14 9a9fea292e31eb5c2a0a404ee94fe7f9
...
xpad0041 48 9a9fea292e31eb5c2a0a404ee94fe7f9
xpad0042 49 10bc3a47c80346cfc4031123527d4194    <-- CHANGED (resize 64->128)

Surviving tokens are identical — only HashMap iteration order changed:
Before (capacity 64) After (capacity 128)
------------------------------- -------------------------------
foxtrot 200 delta 200
delta 200 echo 200
echo 200 bravo 200
bravo 200 foxtrot 200
alpha 200 alpha 200
charlie 200 charlie 200 {code}
Notable: the 32→64 resize did happen (at 25 unique tokens), but those six 
survivors happened to land in the same relative bucket order at capacity 64, so 
no signature change there. It took until 64→128 for the buckets to reshuffle 
again. +The signature is therefore not a pure function of the surviving 
profile+ — it is sensitive to the internal bucket layout at the moment 
{{values().iterator()}} is called.

*Impact*
 * Deduplication silently fails: similar documents that should produce 
identical signatures may result in different ones

*Suggested fix*

Remove the dependence on {{HashMap}} iteration order by giving the comparator a 
total order. Once ties are broken lexicographically, the final profile no 
longer depends on when (or whether) the {{HashMap}} rehashed during counting.

  was:
{{TextProfileSignature}} is meant to be a deterministic near-duplicate 
detector: byte-identical input should yield the same MD5 signature. It does 
not, because the signature depends on the internal layout of a {{HashMap}} that 
grows and rehashes while tokens are being counted.

The algorithm:
 # Tokenize and populate a {{HashMap}} with per-token counts.
 # Quantize counts; drop tokens below threshold.
 # Sort surviving tokens by decreasing frequency.
 # MD5-hash the resulting {{{{{}token\tcount\n...{}}}}} string.

The comparator used in step 3 sorts on frequency alone.\{{ }}

Any two tokens with the same quantized frequency — which is a very common case 
— are left in whatever order the {{HashMap}} hands them back via 
{{{}values().iterator(){}}}.

*The rehashing problem*

Token counting inserts entries one at a time. Each time the map's load factor 
crosses the resize threshold, Java's {{HashMap}} allocates a larger bucket 
array and redistributes every existing entry into new buckets (a rehash). After 
a rehash, {{values()}} iterates in a different order than it did before the 
rehash, because the underlying bucket array is different.

Concretely, this means:
 * Two documents whose token sets are identical but whose insertion paths cross 
the resize threshold at different points can produce different iteration orders 
for tokens with equal frequency, hence different signatures.
 * Increasing the document length past a resize threshold can change the 
signature of the _unchanged_ high-frequency portion of the profile, because the 
tail tokens forced a rehash that reordered the head.

The result is that signatures are not a function of the input alone; they are a 
function of the input _and_ the map's resize history.

*Reproduction*

The following demonstration starts with a base document of six high-frequency 
tokens ({{{}alpha{}}}...{{{}foxtrot{}}}, 200 occurrences each) and 
incrementally appends unique filler words (frequency 1, filtered out by 
quantization). The set of tokens that survive quantization is the _same_ on 
every iteration — only {{HashMap}} capacity changes as fillers are inserted. 
The signature nevertheless flips every time the map resizes:
{code:java}
=== TextProfileSignature: HashMap resize demonstration ===
Base text: [alpha, bravo, charlie, delta, echo, foxtrot] x 200 each
Adding unique filler words (freq 1, filtered by quantization).
HashMap resize thresholds: 16x0.75=12, 32x0.75=24 entries.
Added word Unique Signature Changed?
------------------------------------------------------------------------------------------
(none) 6 45a42603e173ae2a0055316fd290943e
xpad0000 7 45a42603e173ae2a0055316fd290943e
xpad0001 8 45a42603e173ae2a0055316fd290943e
xpad0002 9 45a42603e173ae2a0055316fd290943e
xpad0003 10 45a42603e173ae2a0055316fd290943e
xpad0004 11 45a42603e173ae2a0055316fd290943e
xpad0005 12 45a42603e173ae2a0055316fd290943e
xpad0006 13 9a9fea292e31eb5c2a0a404ee94fe7f9 <-- CHANGED (resize 16->32)
Surviving tokens are identical — only HashMap iteration order changed:
Before (capacity 16) After (capacity 32)
------------------------------- -------------------------------
bravo 200 foxtrot 200
foxtrot 200 delta 200
alpha 200 echo 200
delta 200 bravo 200
echo 200 alpha 200
charlie 200 charlie 200
xpad0007 14 9a9fea292e31eb5c2a0a404ee94fe7f9
...
xpad0041 48 9a9fea292e31eb5c2a0a404ee94fe7f9
xpad0042 49 10bc3a47c80346cfc4031123527d4194 <-- CHANGED (resize 64->128)
Surviving tokens are identical — only HashMap iteration order changed:
Before (capacity 64) After (capacity 128)
------------------------------- -------------------------------
foxtrot 200 delta 200
delta 200 echo 200
echo 200 bravo 200
bravo 200 foxtrot 200
alpha 200 alpha 200
charlie 200 charlie 200 {code}

Notable: the 32→64 resize did happen (at 25 unique tokens), but those six 
survivors happened to land in the same relative bucket order at capacity 64, so 
no signature change there. It took until 64→128 for the buckets to reshuffle 
again. +The signature is therefore not a pure function of the surviving 
profile+ — it is sensitive to the internal bucket layout at the moment 
{{values().iterator()}} is called.

*Impact*
 * Deduplication silently fails: similar documents that should produce 
identical signatures may result in different ones

*Suggested fix*

Remove the dependence on {{HashMap}} iteration order by giving the comparator a 
total order. Once ties are broken lexicographically, the final profile no 
longer depends on when (or whether) the {{HashMap}} rehashed during counting.


> TextProfileSignature produces different signatures for identical input when 
> the internal HashMap rehashes
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-18200
>                 URL: https://issues.apache.org/jira/browse/SOLR-18200
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: clients - java
>    Affects Versions: 9.7
>         Environment: Solr 9.7.0 (probably affects ealier and later versions 
> too)
> running on OpenJDK 21.0.10 LTS (Amazon Corretto 21.0.10.7.1, 64-bit Server VM)
>            Reporter: George Karpodinis
>            Priority: Minor
>         Attachments: REPORT.md
>
>
> {{TextProfileSignature}} is meant to be a deterministic near-duplicate 
> detector: byte-identical input should yield the same MD5 signature. It does 
> not, because the signature depends on the internal layout of a {{HashMap}} 
> that grows and rehashes while tokens are being counted.
> The algorithm:
>  # Tokenize and populate a {{HashMap}} with per-token counts.
>  # Quantize counts; drop tokens below threshold.
>  # Sort surviving tokens by decreasing frequency.
>  # MD5-hash the resulting {{{{{}token\tcount\n...{}}}}} string.
> The comparator used in step 3 sorts on frequency alone.\{{ }}
> Any two tokens with the same quantized frequency — which is a very common 
> case — are left in whatever order the {{HashMap}} hands them back via 
> {{{}values().iterator(){}}}.
> *The rehashing problem*
> Token counting inserts entries one at a time. Each time the map's load factor 
> crosses the resize threshold, Java's {{HashMap}} allocates a larger bucket 
> array and redistributes every existing entry into new buckets (a rehash). 
> After a rehash, {{values()}} iterates in a different order than it did before 
> the rehash, because the underlying bucket array is different.
> Concretely, this means:
>  * Two documents whose token sets are identical but whose insertion paths 
> cross the resize threshold at different points can produce different 
> iteration orders for tokens with equal frequency, hence different signatures.
>  * Increasing the document length past a resize threshold can change the 
> signature of the _unchanged_ high-frequency portion of the profile, because 
> the tail tokens forced a rehash that reordered the head.
> The result is that signatures are not a function of the input alone; they are 
> a function of the input _and_ the map's resize history.
> *Reproduction*
> The following demonstration starts with a base document of six high-frequency 
> tokens ({{{}alpha{}}}...{{{}foxtrot{}}}, 200 occurrences each) and 
> incrementally appends unique filler words (frequency 1, filtered out by 
> quantization). The set of tokens that survive quantization is the _same_ on 
> every iteration — only {{HashMap}} capacity changes as fillers are inserted. 
> The signature nevertheless flips every time the map resizes:
> {code:java}
> === TextProfileSignature: HashMap resize demonstration ===
> Base text: [alpha, bravo, charlie, delta, echo, foxtrot] x 200 each
> Adding unique filler words (freq 1, filtered by quantization).
> HashMap resize thresholds: 16x0.75=12, 32x0.75=24 entries.
> Added word Unique Signature Changed?
> ------------------------------------------------------------------------------------------
> (none)    6 45a42603e173ae2a0055316fd290943e
> xpad0000  7 45a42603e173ae2a0055316fd290943e
> xpad0001  8 45a42603e173ae2a0055316fd290943e
> xpad0002  9 45a42603e173ae2a0055316fd290943e
> xpad0003 10 45a42603e173ae2a0055316fd290943e
> xpad0004 11 45a42603e173ae2a0055316fd290943e
> xpad0005 12 45a42603e173ae2a0055316fd290943e
> xpad0006 13 9a9fea292e31eb5c2a0a404ee94fe7f9    <-- CHANGED (resize 16->32)
> Surviving tokens are identical — only HashMap iteration order changed:
> Before (capacity 16) After (capacity 32)
> ------------------------------- -------------------------------
> bravo 200 foxtrot 200
> foxtrot 200 delta 200
> alpha 200 echo 200
> delta 200 bravo 200
> echo 200 alpha 200
> charlie 200 charlie 200
> xpad0007 14 9a9fea292e31eb5c2a0a404ee94fe7f9
> ...
> xpad0041 48 9a9fea292e31eb5c2a0a404ee94fe7f9
> xpad0042 49 10bc3a47c80346cfc4031123527d4194    <-- CHANGED (resize 64->128)
> Surviving tokens are identical — only HashMap iteration order changed:
> Before (capacity 64) After (capacity 128)
> ------------------------------- -------------------------------
> foxtrot 200 delta 200
> delta 200 echo 200
> echo 200 bravo 200
> bravo 200 foxtrot 200
> alpha 200 alpha 200
> charlie 200 charlie 200 {code}
> Notable: the 32→64 resize did happen (at 25 unique tokens), but those six 
> survivors happened to land in the same relative bucket order at capacity 64, 
> so no signature change there. It took until 64→128 for the buckets to 
> reshuffle again. +The signature is therefore not a pure function of the 
> surviving profile+ — it is sensitive to the internal bucket layout at the 
> moment {{values().iterator()}} is called.
> *Impact*
>  * Deduplication silently fails: similar documents that should produce 
> identical signatures may result in different ones
> *Suggested fix*
> Remove the dependence on {{HashMap}} iteration order by giving the comparator 
> a total order. Once ties are broken lexicographically, the final profile no 
> longer depends on when (or whether) the {{HashMap}} rehashed during counting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to