On Fri, 2026-06-05 at 15:57 +0000, Tristan Partin wrote:
> > In any case, the input comes from a trusted
> > source (dictionary configuration), so it's not very serious.
> 
> The fix looks and sounds good. Do we have any way to test this, so it
> doesn't regress in the future?

  -- Ⱥ is 2 bytes, 'ⱥ' is 3 bytes
  $ echo "foo barȺ" > /path/to/postgres/share/tsearch_data/mbtest.syn

  CREATE TEXT SEARCH DICTIONARY mb_syn (
    TEMPLATE = synonym,
    SYNONYMS = mbtest);

  SELECT ts_lexize('mb_syn', 'foo');

  =# SELECT ts_lexize('mb_syn', 'foo'); -- before patch
   ts_lexize 
  -----------
   {bar}
  (1 row)

  =# SELECT ts_lexize('mb_syn', 'foo'); -- after patch
   ts_lexize 
  -----------
   {barⱥ}
  (1 row)

It requires a specially-crafted synonym file, and I'm not sure it's
worth much effort to add a test for this specific path. If we see
similar bugs, it's more likely to be somewhere else that makes the same
faulty assumption.

If you do think we should add tests, we should probably add a set of
dictionary-related files (.syn, .dict, .ths, etc.) that contain a
variety of adversarial Unicode cases.

I'd be inclined to just commit this fix for now. It needs backpatching,
and I don't think we want to backpatch a large set of tests with it.

Regards,
        Jeff Davis



Reply via email to