: What many of us not familiar with the tokenizing rules of the standard
: tokenizer just realized is that it's not a good default for english
: and probably most other european languages.
Jira is down for reindexing at the moment, so i can't file this suggestion
as a new Feature proposal (or comment on it's relevance in SOLR-3723) and
i probably won't be online for another few days, so i wanted to get this
idea out there now for discussion instead of waiting.
---
Based on the link steven mentioned clarifying why exactly
StandardTokenizer works the way it does...
http://unicode.org/reports/tr29/#Word_Boundaries
...I think it would be a good idea to add some new customization options
to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the
behavior based on the various "tailored improvement" notes...
"It is not possible to provide a uniform set of rules that resolves
all issues across languages or that handles all ambiguous situations
within a given language. The goal for the specification presented in
this annex is to provide a workable default; tailored implementations
can be more sophisticated.
1) An option to include the various "hypen" characters in the "MidLetter"
class per this note...
"Some or all of the following characters may be tailored to be in
MidLetter, depending on the environment: ..."
[\u002D\uFF0D\uFE63\u058A\u1806\u2010\u2011\u30A0\u30FB\u201B\u055A\u0F0B]
It might make sense to expand this option to also include the other
characters listed in the note below, and name the option something along
the lines of "splitItentifiers"...
"Characters such as hyphens, apostrophes, quotation marks, and
colon should be taken into account when using identifiers that
are intended to represent words of one or more natural languages.
See Section 2.4, Specific Character Adjustments, of [UAX31].
Treatment of hyphens, in particular, may be different in the case
of processing identifiers than when using word break analysis for
a Whole Word Search or query, because when handling identifiers the
goal will be to parse maximal units corresponding to natural language
“words,” rather than to find smaller word units within longer lexical
units connected by hyphens."
(this point about "parse maximal units" seems paticularly on point for
the usecase where a user's search input consists of a single hyphenated
word)
2) an option to control if/when the following characters in the "MidNum"
class per the corisponding note...
"Some or all of the following characters may be tailored to be in
MidNum, depending on the environment, to allow for languages that
use spaces as thousands separators, such as €1 234,56. ..."
[\u0020\u00A0\u2007\u2008\u2009\u202F]
3) an option to control wether word breaking should happen between
scripts, per this note...
"Normally word breaking does not require breaking between different
scripts. However, adding that capability may be useful in combination
with other extensions of word segmentation. ..."
4) an option to control wether U+002E should be included in ExtendedNumLet
per this note ...
"To allow acronyms like “U.S.A.”, a tailoring may include U+002E FULL
STOP in ExtendNumLet"
5) an option to control wether '"' and U+05F3 are treated as MidLetter
basd on this note...
"For Hebrew, a tailoring may include a double quotation mark between
letters, because legacy data may contain that in place of U+05F4 ..."
6) an option to apostrophe's following vowels per this note...
"The use of the apostrophe is ambiguous. ... In some languages,
such as French and Italian, tailoring to break words when the
character after the apostrophe is a vowel may yield better results
in more cases. This can be done by adding a rule WB5a ..."
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]