Hi Dave,
Dave via Bug reports for GNU grep <[email protected]> writes:
> Today, I realized that there are characters which are visually
> identical, yet have different unicodes, thus they can't be matched in
> grep.
A bit different from your example, but in some cases you can encode the
same character in multiple ways.
The character á (LATIN SMALL LETTER A WITH ACUTE) can be written as:
* Normalized: U+00E1
* Unnormalized: U+0061 U+0301
> Example #1:
> احمدی
>
> Example #2:
> احمدى
>
> The ى in both examples are exactly the same, yet the first one is
> U+06CC, and second one U+0649.
>
> From the user's perspective, it's impossible to realize which unicode
> the word is using. In fact, these two words, even though they are from
> different languages/keyboards, match perfectly on the other letters,
> and only it's ی/ى that espaces the match.
>
> While not as important, this letter has other variants like ي (notice
> two dots below it, think an umlaut) corresponding to U+064A. If you
> press Ctrl + F on your browser, you'd notice that you can match U+064A
> with U+0649 one. but this is not the default behavior in grep either.
What browser does that? Firefox and Chrome on my machine don't match the
other character.
Collin