Hi; Here is the page of it that has a online Kuromoji tokenizer and information: http://www.atilika.org/ It may help you.
Thanks; Furkan KAMACI 2014-03-10 19:57 GMT+02:00 Rahul Ratnakar <rahul.ratna...@gmail.com>: > I am trying to analyze some japanese web pages for presence of slang/adult > phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the > tokenizer breaks up the word into proper words, I am more interested in > catching the slangs which seems to result from combining various "safe" > words. > > Few example of words that, as per our in-house japanese language expert,(I > have no knowledge of japanese whatsoever) are slangs and should be caught > "unbroken" are- > > 無臭正 - is a bad word and we want to catch it as is, but the tokenizer breaks > it up into 無臭 and 正 which are both apparently safe. > > ハメ撮り - it was broken into ハメ and 撮り, again both safe on their own but bad > when combined. > > 中出し broken into 中 and 出し, but should have been left as is as it represents > a bad phrase. > > Any help on how I can use kuromozi tokenizer or any alternatives would be > greatly appreciated. > > Thanks. >