On Thu, Aug 9, 2012 at 11:43 PM, Chris Hostetter <[email protected]> wrote: > > : What many of us not familiar with the tokenizing rules of the standard > : tokenizer just realized is that it's not a good default for english > : and probably most other european languages. > > Jira is down for reindexing at the moment, so i can't file this suggestion > as a new Feature proposal (or comment on it's relevance in SOLR-3723) and > i probably won't be online for another few days, so i wanted to get this > idea out there now for discussion instead of waiting. > > --- > > Based on the link steven mentioned clarifying why exactly > StandardTokenizer works the way it does... > > http://unicode.org/reports/tr29/#Word_Boundaries > > ...I think it would be a good idea to add some new customization options > to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the > behavior based on the various "tailored improvement" notes... >
Use a CharFilter. -- lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
