: >         http://unicode.org/reports/tr29/#Word_Boundaries
: >
: > ...I think it would be a good idea to add some new customization options
: > to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the
: > behavior based on the various "tailored improvement" notes...


: Use a CharFilter.

can you elaborate on how you would suggest implenting these "tailored 
improvements" using a CharFilter?

I imagine #5 ('"' used when U+05F4 should be) could be solved with a 
CharFilter since it sounds like hte fundemental issue is that '"' is being 
used as a substitute character in these situations that oculd be "fixed" 
but i don't understand how any of the other examples could be dealt with 
in this way.

none of them are about adding/removing/replacing any chacters in the 
stream, they are all about giving the ability to tailor the logic used 
to decide when/where word boundaries should be found w/o changing the 
content...


: 1) An option to include the various "hypen" characters in the "MidLetter"
: class per this note...
: 
:   "Some or all of the following characters may be tailored to be in
:    MidLetter, depending on the environment: ..."
:    [\u002D\uFF0D\uFE63\u058A\u1806\u2010\u2011\u30A0\u30FB\u201B\u055A\u0F0B]
:    
: It might make sense to expand this option to also include the other
: characters listed in the note below, and name the option something along
: the lines of "splitItentifiers"...
: 
:   "Characters such as hyphens, apostrophes, quotation marks, and
:    colon should be taken into account when using identifiers that
:    are intended to represent words of one or more natural languages.
:    See Section 2.4, Specific Character Adjustments, of [UAX31].
:    Treatment of hyphens, in particular, may be different in the case
:    of processing identifiers than when using word break analysis for
:    a Whole Word Search or query, because when handling identifiers the
:    goal will be to parse maximal units corresponding to natural language
:    “words,” rather than to find smaller word units within longer lexical
:    units connected by hyphens."
:    
: (this point about "parse maximal units" seems paticularly on point for
: the usecase where a user's search input consists of a single hyphenated
: word)
: 
: 2) an option to control if/when the following characters in the "MidNum"
: class per the corisponding note...
: 
:   "Some or all of the following characters may be tailored to be in
:    MidNum, depending on the environment, to allow for languages that
:    use spaces as thousands separators, such as €1 234,56.  ..."
:    [\u0020\u00A0\u2007\u2008\u2009\u202F]
:    
: 3) an option to control wether word breaking should happen between
: scripts, per this note...
: 
:   "Normally word breaking does not require breaking between different
:    scripts. However, adding that capability may be useful in combination
:    with other extensions of word segmentation.  ..."
:    
: 4) an option to control wether U+002E should be included in ExtendedNumLet
: per this note ...
: 
:   "To allow acronyms like “U.S.A.”, a tailoring may include U+002E FULL
:    STOP in ExtendNumLet"

     [...]

: 6) an option to apostrophe's following vowels per this note...
: 
:   "The use of the apostrophe is ambiguous. ... In some languages,
:    such as French and Italian, tailoring to break words when the 
:    character after the apostrophe is a vowel may yield better results
:    in more cases. This can be done by adding a rule WB5a ..."




-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to