bug#79702: request: flag for visually identical but different unicode characters

Collin Funk Sun, 26 Oct 2025 11:42:35 -0700

Hi Dave,

Dave via Bug reports for GNU grep <[email protected]> writes:


> Today, I realized that there are characters which are visually
> identical, yet have different unicodes, thus they can't be matched in
> grep.

A bit different from your example, but in some cases you can encode the
same character in multiple ways.

The character á (LATIN SMALL LETTER A WITH ACUTE) can be written as:

    * Normalized:   U+00E1
    * Unnormalized: U+0061 U+0301


> Example #1:
> احمدی
>
> Example #2:
> احمدى
>
> The ى in both examples are exactly the same, yet the first one is
> U+06CC, and second one U+0649.
>
> From the user's perspective, it's impossible to realize which unicode
> the word is using. In fact, these two words, even though they are from
> different languages/keyboards, match perfectly on the other letters,
> and only it's ی/ى that espaces the match.
>
> While not as important, this letter has other variants like ي (notice
> two dots below it, think an umlaut) corresponding to U+064A. If you
> press Ctrl + F on your browser, you'd notice that you can match U+064A
> with U+0649 one. but this is not the default behavior in grep either.

What browser does that? Firefox and Chrome on my machine don't match the
other character.

Collin

bug#79702: request: flag for visually identical but different unicode characters

Reply via email to