[ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-2023: -------------------------------- Attachment: LUCENE-2023.patch refactor a lot of this analyzer: * move hhmm specific stuff (like WordType, CharType, Utility) into hhmm package * move/remove tokenfilter specific stuff (like lowercasing, full-width conversion) out of hhmm package (uses LowerCaseFilter, adds FullWidthFilter) * remove the stopwords list, it was full of various punctuation, all of which got converted by "SegTokenFilter" into a comma anyway. instead just don't emit punctuation. to me, this refactoring makes the analyzer easier to debug. it also happens to improve performance (up to 2500k/s now) > Improve performance of SmartChineseAnalyzer > ------------------------------------------- > > Key: LUCENE-2023 > URL: https://issues.apache.org/jira/browse/LUCENE-2023 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Reporter: Robert Muir > Assignee: Robert Muir > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, > LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch > > > I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer > on chinese text. > This patch improves the internal hhmm implementation. > Time to index my chinese corpus is 75% of the previous time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org