: What many of us not familiar with the tokenizing rules of the standard
: tokenizer just realized is that it's not a good default for english
: and probably most other european languages.

Jira is down for reindexing at the moment, so i can't file this suggestion 
as a new Feature proposal (or comment on it's relevance in SOLR-3723) and 
i probably won't be online for another few days, so i wanted to get this 
idea out there now for discussion instead of waiting.

        ---

Based on the link steven mentioned clarifying why exactly 
StandardTokenizer works the way it does...

        http://unicode.org/reports/tr29/#Word_Boundaries

...I think it would be a good idea to add some new customization options 
to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the 
behavior based on the various "tailored improvement" notes...

  "It is not possible to provide a uniform set of rules that resolves 
   all issues across languages or that handles all ambiguous situations 
   within a given language. The goal for the specification presented in 
   this annex is to provide a workable default; tailored implementations 
   can be more sophisticated.

1) An option to include the various "hypen" characters in the "MidLetter" 
class per this note...

  "Some or all of the following characters may be tailored to be in 
   MidLetter, depending on the environment: ..."
   [\u002D\uFF0D\uFE63\u058A\u1806\u2010\u2011\u30A0\u30FB\u201B\u055A\u0F0B]

It might make sense to expand this option to also include the other 
characters listed in the note below, and name the option something along 
the lines of "splitItentifiers"...

  "Characters such as hyphens, apostrophes, quotation marks, and 
   colon should be taken into account when using identifiers that 
   are intended to represent words of one or more natural languages. 
   See Section 2.4, Specific Character Adjustments, of [UAX31]. 
   Treatment of hyphens, in particular, may be different in the case 
   of processing identifiers than when using word break analysis for 
   a Whole Word Search or query, because when handling identifiers the 
   goal will be to parse maximal units corresponding to natural language 
   “words,” rather than to find smaller word units within longer lexical 
   units connected by hyphens."

(this point about "parse maximal units" seems paticularly on point for 
the usecase where a user's search input consists of a single hyphenated 
word)

2) an option to control if/when the following characters in the "MidNum" 
class per the corisponding note...

  "Some or all of the following characters may be tailored to be in
   MidNum, depending on the environment, to allow for languages that 
   use spaces as thousands separators, such as €1 234,56.  ..."
   [\u0020\u00A0\u2007\u2008\u2009\u202F]

3) an option to control wether word breaking should happen between 
scripts, per this note...

  "Normally word breaking does not require breaking between different 
   scripts. However, adding that capability may be useful in combination 
   with other extensions of word segmentation.  ..."

4) an option to control wether U+002E should be included in ExtendedNumLet 
per this note ...

  "To allow acronyms like “U.S.A.”, a tailoring may include U+002E FULL
   STOP in ExtendNumLet"

5) an option to control wether '"' and U+05F3 are treated as MidLetter 
basd on this note...

  "For Hebrew, a tailoring may include a double quotation mark between 
   letters, because legacy data may contain that in place of U+05F4 ..."

6) an option to apostrophe's following vowels per this note...

  "The use of the apostrophe is ambiguous. ... In some languages, 
   such as French and Italian, tailoring to break words when the 
   character after the apostrophe is a vowel may yield better results 
   in more cases. This can be done by adding a rule WB5a ..."




-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to