Doug Ewell wrote,
> Philipp Reichmuth <uzsv2k at uni-bonn dot de> wrote: > > > Is there a standard way to handle ZWJ/ZWNJ in sorting & searching? > > I think in quite a lot of situations and/or scripts it would be > > feasible just to ignore ZWJ (or give the user the choice to ignore > > it). Especially in a Latin context. > > I would ignore ZWJ, ZWNJ, and any other formatting marks in searching > and sorting. > Quoting from TUS 3.0 page 317: "ZERO WIDTH NON-JOINER or ZERO WIDTH JOINER are format control characters. As with other such characters, they should be ignored by processes that analyze text content. For example, a spelling-checker or find/replace operation should filter them out. (See Section 2.7, Special Character and Noncharacter Values, for a general discussion of format control characters.)" Philipp Reichmuth mentions offering the user a choice. It might not be a bad idea for some apps to offer advanced features which would allow the user to seek/display/process the format characters. Note that the quote above is plain text, which effectively conveys the information in the book. With mark-up, the book's text could be reproduced (more-or-less) as: "<font face="Minion"><small caps>ZERO WIDTH NON-JOINER</small caps> or <small caps>ZERO WIDTH JOINER</small caps> are format control characters. As with <br> other such characters, they should be ignored by processes that analyze text content. For <br> example, a spelling-checker or find/replace operation should filter them out. (See <br> <i>Section 2.7, Special Character and Noncharacter Values,</i> for a general discussion of format <br> control characters.)</font>" This paragraph uses a ligature twice. In the mark up version above, ZWJ was inserted using SCUnipad. This doesn't make the ligature display here. The small caps tag was made up for this example, don't know if HTML has such a tag. In HTML, the font face tag used above is deprecated. Of the following... fidelity (with ZWJ) fidelity (without ZWJ) fidelity (using the presentation form as UTF-8) ...Outlook Express' Edit/Find feature finds only the second example. My expectation would be that a default condition search find all three instances in a Unicode-savvy application. Best regards, James Kass.