Default Tokenization behavior (was Re: Excerpting algos)

Marvin Humphrey Mon, 08 Jun 2009 18:21:43 -0700

On Sat, Jun 06, 2009 at 06:05:57PM -0700, [email protected] wrote:

> (As an aside, your default tokeniser  doesn’t work with Greek, which can
> have mid-commas, but the only two  words with mid commas [ὅ,τι and
> ὅ,τιδηποτε] are stop-words, so I don’t worry about it.)


Two points about this.

Tokenizer's default pattern contains a character class matching a straight
apostrophe and a curly apostrophe.  You can add a comma to that character
class and those words will be indexed as atomic terms:

  "\\w+(?:[\\x{2019}']\\w+)*"   # default
  "\\w+(?:[\\x{2019}',]\\w+)*"  # modified

That way, TermQueries for those terms will return expected results.

However, in practice it may not matter that much because of the default
behavior of QueryParser.  As a first stage, QueryParser splits on whitespace.
As a second stage, QueryParser tokenizes using the an Analyzer.  If a single
token is returned, a TermQuery is created.  If multiple tokens are returned, a
PhraseQuery is created.  So, "ὅ,τι" gets turned into a phrase which will only
match things like "ὅ,τι", "ὅ τι", "ὅ-τι", etc -- but won't match "ὅ" or "τι"
in isolation.

Dunno if "ὅ" ever appears on its own, but this QueryParser behavior is pretty
handy: if a user searches for 'foo.com', that's the same as '"foo com"' and
will match documents that contain "www.foo.com", "[email protected]",
"http://www.foo.com";, in addition to plain old "foo.com".  The goal was to
squeeze maximum recall from the simplest possible implementation; even though
the mid-comma words are stopwords, they may come in handy in phrase searches
-- just like the stopwords in the search query "I am the Walrus".

The weakness of this approach is that it does not conflate e.g. 'U.S.A.' and
'USA', but methinks it's still a pretty decent default.

Marvin Humphrey

Default Tokenization behavior (was Re: Excerpting algos)

Reply via email to