Hypothetical: What about mixed language texts such as a Greek/French lexicon?
DM > On Feb 21, 2017, at 4:56 PM, Troy A. Griffitts <scr...@crosswire.org> wrote: > > > Simply don't use the UTF-8 Greek Accent filter on non-Greek texts. As you > have discovered there are accents used in Greek which are also used in other > languages and adverse effects will be seen for these languages. The bottom > line is simple. Only use the UTF-8 Greek Accents filter on UTF-8 Greek texts. > > Hope this helps. > > On February 21, 2017 2:45:24 PM MST, David Haslam <dfh...@googlemail.com> > wrote: > These are the principal diacritics found in Biblical Greek that have to be > removed with a UTF8GreekAccents filter. > > The first five are general accents, not particular to Greek. > It's on account of these that the filter should not be applied to non-Greek > text. > > U+0300 ̀ COMBINING GRAVE ACCENT > U+0301 ́ COMBINING ACUTE ACCENT > U+0308 ̈ COMBINING DIAERESIS > U+0313 ̓ COMBINING COMMA ABOVE > U+0314 ̔ COMBINING REVERSED COMMA ABOVE > U+0342 ͂ COMBINING GREEK PERISPOMENI > U+0343 ̓ COMBINING GREEK KORONIS > U+0344 ̈́ COMBINING GREEK DIALYTIKA TONOS > U+0345 ͅ COMBINING GREEK YPOGEGRAMMENI > > No other diacritics or characters should be removed. > Though there are a few more combining accents in this block, they aren't > really used in Biblical Greek. > I am open to correction on this point. > > e.g. The right single quotation mark (U+2019) is NOT a diacritic. It should > not be removed. > > Before any of these accents can be removed, they must first be separated > from the Greek letters they are combined with. > > Although normalization to the decomposed form can produce this effect, as we > have seen already, this can have undesirable side effects on any non-Greek > text in the module that may happen to include combined or unusual > characters. > > It would therefore be more sensible to simply use a comprehensive mapping > table that replaces each possible accented character by the corresponding > letter in the Greek alphabet. In this way the filter can completely avoid > the need to apply any Unicode normalization. > > The complete mapping table would have at least 130 rows. It will need to > take into account that there are at least 75 possible combinations of a > letter with two accents. There are none with three. > > Any residual combining characters should also be removed, to cover the > possibility that a module may have been intentionally made without > normalizing the Greek source text by default to NFC. > > That's my proposal. I can easily create such a mapping table that > programmers can use. > I can also readily test it with a bespoke TextPipe filter. > > > Best regards, > > David > > > > > > -- > View this message in context: > http://sword-dev.350566.n4.nabble.com/GlobalOptionFilter-UTF8GreekAccents-and-non-Greek-modules-tp4656719p4656765.html > > <http://sword-dev.350566.n4.nabble.com/GlobalOptionFilter-UTF8GreekAccents-and-non-Greek-modules-tp4656719p4656765.html> > Sent from the SWORD Dev mailing list archive at Nabble.com > <http://nabble.com/>. > > > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > <http://www.crosswire.org/mailman/listinfo/sword-devel> > Instructions to unsubscribe/change your settings at above page > > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity. > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page
_______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page