Hi;

Here is the page of it that has a online Kuromoji tokenizer and
information: http://www.atilika.org/ It may help you.

Thanks;
Furkan KAMACI


2014-03-10 19:57 GMT+02:00 Rahul Ratnakar <rahul.ratna...@gmail.com>:

> I am trying to analyze some japanese web pages for presence of slang/adult
> phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the
> tokenizer breaks up the word into proper words, I am more interested in
> catching the slangs which seems to result from combining various "safe"
> words.
>
> Few example of words that, as per our in-house japanese language expert,(I
> have no knowledge of japanese whatsoever)  are slangs and should be caught
> "unbroken" are-
>
> 無臭正 - is a bad word and we want to catch it as is, but the tokenizer breaks
> it up into 無臭 and 正 which are both apparently safe.
>
> ハメ撮り - it was broken into ハメ and 撮り, again both safe on their own but bad
> when combined.
>
> 中出し  broken into 中 and 出し, but should have been left as is as it represents
> a bad phrase.
>
> Any help on how I can use kuromozi tokenizer or any alternatives would be
> greatly appreciated.
>
> Thanks.
>

Reply via email to