I'd like to contribute for potential inclusion, or to help out others in the
community, a small set of enhancements I've made to the porter tokenizer. This
implementation shares most of its code with the current porter tokenizer, as
the changes are really just in the tokenizer prior to the stemming operation.
This small patch implements an additional tokenizer, which I am calling
"porterPlus", for lack of further inspiration.
The code is based on several observations made while attempting to use the
current porter tokenizer on a common english/utf-8 dataset:
- There are a limited number of accented characters common in english text.
- If the accents simply weren't there, the words would be stemmed
appropriately, but the porter stemmer gives up on a word when it sees any utf-8
characters, leading to perceived failures in the search queries.
- The porter stemmer, by its very nature, is not intended to work for
non-english text, so we can write off the major part of the the utf-8 character
set, while concentrating on major improvements to those characters involved in
common european languages, particularly those that have been adopted into
english usage.
- Additionally, there are a number of punctuation characters commonly rendered
in utf-8 that are missed by the regular porter tokenizer (hyphen and
typographic quotes are good examples).
This small patch does the following:
- Defines a new tokenizer "porterPlus" which shares most of its code
with the regular porter tokenizer
- Identifies a small subset of utf-8 characters for special handling.
In the case of common accented varieties of regular ascii characters, the
accents are dropped, leaving the unaccented character only. For instance, sauté
is converted to saute. The resultant word is passed as usual into the porter
stemmer.
- Also identifies a small subset of utf-8 characters to treat as
delimiters, as they would otherwise be treated as part of another token,
leading to search failures. (hyphen, typographic quotes, etc).
In our use so far, these small changes have meant that we now normalize away
all of the important utf-8 characters in our input text, which gives us 100%
searchability of significant input tokens.
The patch (to the 3.6.22 amalgamation) is attached.
James
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users