On Mon, Aug 13, 2012 at 1:58 PM, Chris Hostetter
<[email protected]> wrote:
>
> : >         http://unicode.org/reports/tr29/#Word_Boundaries
> : >
> : > ...I think it would be a good idea to add some new customization options
> : > to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the
> : > behavior based on the various "tailored improvement" notes...
>
>
> : Use a CharFilter.
>
> can you elaborate on how you would suggest implenting these "tailored
> improvements" using a CharFilter?

Generally the easiest way is to replace your ambiguous character (such
as your hyphen-minus) with what your domain-specific knowledge tells
you it should be.
If you are indexing a dictionary where this ambiguous hyphen-minus is
being used to separate syllables, then replace it with \u2027
(hyphenation point), and it won't trigger word boundaries.

But it really depends on how you want your whole analysis process to
work. e.g. in the above example if you want to treat "foo-bar" as
really equivalent to foobar, or you want to treat U.S.A as equivalent
to USA, because thats how you want your search to work, then I would
just replace with U+2060 word joiner. follow through with NFKC_CF
unicode normalization filter in the icu package which will remove
this, since its Format.

So I think you can handle all of your cases there with a simple regex
charfilter, substituting the correct 'semantics' depending on
ultimately how you want it to work, and then just apply nfkc_cf at the
end.

As far as the last example, no need for the tokenizer to be involved.
We already have elisionfilter for this, and italian and french
analyzers use it to remove a default (but configurable) set of
contractions. The solr example for these languages is setup with
these, too.

If you really don't like these dead-simple approaches, then just use
the tokenizer in the ICU package, which is more flexible than the
jflex implementation: lets you supply custom grammars at runtime, and
can split by script, etc, etc.


-- 
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to