At http://www.unicode.org/review/pri186/ is a suggestion that U+2011 NON-BREAKING HYPHEN should be given the word-break property MidLetter, one reason being that some languages use a hyphen character between syllables within a word where word breaking, such as by word-selection or move-to-next-word commands, should ignore these hyphens.
# The advantage of making this change is that U+2011 NON-BREAKING HYPHEN # could be used in orthographies that contain interior hyphens. This # would avoid a requirement to encode yet another confusable # hyphen/dash/minus character to the over-a-dozen already in Unicode. The implication is that the alternative to the suggestion is to add a new character. I don’t see such a requirement! Yes, it’s sometimes hard to know where word boundaries are, and Unicode certainly helps, but that doesn’t mean the characters on their own have to completely solve that problem. Knowledge about the language being used can also be useful, for example. Compare this with LEFT SINGLE QUOTATION MARK, used as quotation mark and apostrophe such that extra knowledge can be needed to know where the word divisions are: ‘Tis just a highfalutin‘ idea, reminding me of that ‘sublime masterwork’ L’Étranger that I don‘t approve of. For instance a mark-word operation on "highfalutin’" should ideally include the apostrophe but not on "masterwork". It would help if the quotation mark and the apostrophe were seen as different characters here, even though they look the same, but for good reasons they are seen as the same character in Unicode. And certainly no one is suggesting different "characters" for joining and splitting apostrophes (using terminology from http://unicode.org/mail-arch/unicode-ml/y2002-m08/att-0428/01-cimaUTR29.html ). I don’t know about the Iu Mien language mentioned in the PRI, but would it even be correct to disallow *line* breaks with NON-BREAKING HYPHEN in many of these cases? Wouldn’t it be acceptable to hyphenate some of these words? So I would say, don’t ‘fix’ this: * hyphens are hyphens, even when they are used for slightly different reasons in different orthographies. * word breaking is hard, and not only partially solvable by Unicode