Hi Dave,

Dave via Bug reports for GNU grep <[email protected]> writes:

> Today, I realized that there are characters which are visually
> identical, yet have different unicodes, thus they can't be matched in
> grep.

A bit different from your example, but in some cases you can encode the
same character in multiple ways.

The character á (LATIN SMALL LETTER A WITH ACUTE) can be written as:

    * Normalized:   U+00E1
    * Unnormalized: U+0061 U+0301


> Example #1:
> احمدی
>
> Example #2:
> احمدى
>
> The ى in both examples are exactly the same, yet the first one is
> U+06CC, and second one U+0649.
>
> From the user's perspective, it's impossible to realize which unicode
> the word is using. In fact, these two words, even though they are from
> different languages/keyboards, match perfectly on the other letters,
> and only it's ی/ى that espaces the match.
>
> While not as important, this letter has other variants like ي (notice
> two dots below it, think an umlaut) corresponding to U+064A. If you
> press Ctrl + F on your browser, you'd notice that you can match U+064A
> with U+0649 one. but this is not the default behavior in grep either.

What browser does that? Firefox and Chrome on my machine don't match the
other character.

Collin



Reply via email to