Isn't this what equivalence classes (like [[=e=]]) are supposed to solve? Can grep even use them?
Arnold Dave via Bug reports for GNU grep <[email protected]> wrote: > Today, I realized that there are characters which are visually > identical, yet have different unicodes, thus they can't be matched in > grep. > > Example #1: > احمدی > > Example #2: > احمدى > > The ى in both examples are exactly the same, yet the first one is > U+06CC, and second one U+0649. > > From the user's perspective, it's impossible to realize which unicode > the word is using. In fact, these two words, even though they are from > different languages/keyboards, match perfectly on the other letters, > and only it's ی/ى that espaces the match. > > While not as important, this letter has other variants like ي (notice > two dots below it, think an umlaut) corresponding to U+064A. If you > press Ctrl + F on your browser, you'd notice that you can match U+064A > with U+0649 one. but this is not the default behavior in grep either. > > I understand there's no straightforward solution for this, so I'm > thinking of having an extra flag which converts all visually similar > characters to the same unicode and then looks for matches. Thoughts? > > >
