: > http://unicode.org/reports/tr29/#Word_Boundaries
: >
: > ...I think it would be a good idea to add some new customization options
: > to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the
: > behavior based on the various "tailored improvement" notes...
: Use a CharFilter.
can you elaborate on how you would suggest implenting these "tailored
improvements" using a CharFilter?
I imagine #5 ('"' used when U+05F4 should be) could be solved with a
CharFilter since it sounds like hte fundemental issue is that '"' is being
used as a substitute character in these situations that oculd be "fixed"
but i don't understand how any of the other examples could be dealt with
in this way.
none of them are about adding/removing/replacing any chacters in the
stream, they are all about giving the ability to tailor the logic used
to decide when/where word boundaries should be found w/o changing the
content...
: 1) An option to include the various "hypen" characters in the "MidLetter"
: class per this note...
:
: "Some or all of the following characters may be tailored to be in
: MidLetter, depending on the environment: ..."
: [\u002D\uFF0D\uFE63\u058A\u1806\u2010\u2011\u30A0\u30FB\u201B\u055A\u0F0B]
:
: It might make sense to expand this option to also include the other
: characters listed in the note below, and name the option something along
: the lines of "splitItentifiers"...
:
: "Characters such as hyphens, apostrophes, quotation marks, and
: colon should be taken into account when using identifiers that
: are intended to represent words of one or more natural languages.
: See Section 2.4, Specific Character Adjustments, of [UAX31].
: Treatment of hyphens, in particular, may be different in the case
: of processing identifiers than when using word break analysis for
: a Whole Word Search or query, because when handling identifiers the
: goal will be to parse maximal units corresponding to natural language
: “words,” rather than to find smaller word units within longer lexical
: units connected by hyphens."
:
: (this point about "parse maximal units" seems paticularly on point for
: the usecase where a user's search input consists of a single hyphenated
: word)
:
: 2) an option to control if/when the following characters in the "MidNum"
: class per the corisponding note...
:
: "Some or all of the following characters may be tailored to be in
: MidNum, depending on the environment, to allow for languages that
: use spaces as thousands separators, such as €1 234,56. ..."
: [\u0020\u00A0\u2007\u2008\u2009\u202F]
:
: 3) an option to control wether word breaking should happen between
: scripts, per this note...
:
: "Normally word breaking does not require breaking between different
: scripts. However, adding that capability may be useful in combination
: with other extensions of word segmentation. ..."
:
: 4) an option to control wether U+002E should be included in ExtendedNumLet
: per this note ...
:
: "To allow acronyms like “U.S.A.”, a tailoring may include U+002E FULL
: STOP in ExtendNumLet"
[...]
: 6) an option to apostrophe's following vowels per this note...
:
: "The use of the apostrophe is ambiguous. ... In some languages,
: such as French and Italian, tailoring to break words when the
: character after the apostrophe is a vowel may yield better results
: in more cases. This can be done by adding a rule WB5a ..."
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]