On Mon, Aug 13, 2012 at 1:58 PM, Chris Hostetter <[email protected]> wrote: > > : > http://unicode.org/reports/tr29/#Word_Boundaries > : > > : > ...I think it would be a good idea to add some new customization options > : > to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the > : > behavior based on the various "tailored improvement" notes... > > > : Use a CharFilter. > > can you elaborate on how you would suggest implenting these "tailored > improvements" using a CharFilter?
Generally the easiest way is to replace your ambiguous character (such as your hyphen-minus) with what your domain-specific knowledge tells you it should be. If you are indexing a dictionary where this ambiguous hyphen-minus is being used to separate syllables, then replace it with \u2027 (hyphenation point), and it won't trigger word boundaries. But it really depends on how you want your whole analysis process to work. e.g. in the above example if you want to treat "foo-bar" as really equivalent to foobar, or you want to treat U.S.A as equivalent to USA, because thats how you want your search to work, then I would just replace with U+2060 word joiner. follow through with NFKC_CF unicode normalization filter in the icu package which will remove this, since its Format. So I think you can handle all of your cases there with a simple regex charfilter, substituting the correct 'semantics' depending on ultimately how you want it to work, and then just apply nfkc_cf at the end. As far as the last example, no need for the tokenizer to be involved. We already have elisionfilter for this, and italian and french analyzers use it to remove a default (but configurable) set of contractions. The solr example for these languages is setup with these, too. If you really don't like these dead-simple approaches, then just use the tokenizer in the ICU package, which is more flexible than the jflex implementation: lets you supply custom grammars at runtime, and can split by script, etc, etc. -- lucidworks.com --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
