[ https://issues.apache.org/jira/browse/LUCENE-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-2522: -------------------------------- Attachment: LUCENE-2522.patch here is a really quickly done patch, just to get started (not really for committing) * converted their tests to basetokenstream tests, * changed it to use CharTermAttribute instead of TermAttribute, * added clearAttributes() * made class final. * added solr factory. The code is nice, it is setup to work on unicode codepoints etc, but i think we can improve it by using CharArrayMaps for speed and by using lucene's codepoint i/o stuff in CharUtils. > add simple japanese tokenizer, based on tinysegmenter > ----------------------------------------------------- > > Key: LUCENE-2522 > URL: https://issues.apache.org/jira/browse/LUCENE-2522 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers > Reporter: Robert Muir > Priority: Minor > Attachments: LUCENE-2522.patch > > > TinySegmenter (http://www.chasen.org/~taku/software/TinySegmenter/) is a tiny > japanese segmenter. > It was ported to java/lucene by Kohei TAKETA <k-...@void.in>, > and is under friendly license terms (BSD, some files explicitly disclaim > copyright to the source code, giving a blessing instead) > Koji knows the author, and already contacted about incorporating into lucene: > {noformat} > I've contacted Takeda-san who is the creater of Java version of > TinySegmenter. He said he is happy if his program is part of Lucene. > He is a co-author of my book about Solr published in Japan, BTW. ;-) > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org