You can pass UserDictionary with your own entries to do this. On Mon, Mar 10, 2014 at 3:08 PM, Rahul Ratnakar <rahul.ratna...@gmail.com> wrote: > Thanks Furkan, This is the exact tool that I am using, albeit in my code, I > have tried all search modes e.g. > > new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode.NORMAL, > JapaneseAnalyzer.getDefaultStopSet(), JapaneseAnalyzer.getDefaultStopTags()) > new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode. > EXTENDED, JapaneseAnalyzer.getDefaultStopSet(), > JapaneseAnalyzer.getDefaultStopTags()) > > new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode.SEARCH, > JapaneseAnalyzer.getDefaultStopSet(), > JapaneseAnalyzer.getDefaultStopTags()) > > > > and none of them seem to tokenize the words as I want, so was wondering if > there is some way for me to actually "update" the dictionary/corpus so that > these slangs are caught by the tokenizer as single word. > > > My example text has been scrapped from an "adult" website, so it might be > offensive and i apologize for that. A small excerpt from that website:- > > > "裏びでお無料・無臭正 動画無料・無料aだると動画 裏びでお無料・無臭正 > 動画無料・無料aだると動画・seっくす動画無料・裏ビデオ・ヘンリー塚本・ウラビデライフ無料動画・セッく動画無料・" > > > > On tokenizing I get the list of tokens below. My problem is that as per my > in-house japanese language expert, this list breaks up the word "無臭正 " > into 無臭 and 正 whereas it should be caught as a single word. :- > > 裏 > > びでお > > 無料 > > 無臭 > > 正 > > 動画 > > 無料 > > 無料 > > a > > 動画 > > 裏 > > びでお > > 無料 > > 無臭 > > 正 > > 動画 > > 無料 > > 無料 > > a > > 動画 > > se > > く > > くすい > > 動画 > > 無料 > > 裏 > > ビデオ > > ヘンリ > > 塚本 > > ウラビデライフ > > 無料 > > 動画 > > セッ > > く > > 動画 > > 無料 > > > Thanks, > > Rahul > > > > > > On Mon, Mar 10, 2014 at 2:09 PM, Furkan KAMACI <furkankam...@gmail.com>wrote: > >> Hi; >> >> Here is the page of it that has a online Kuromoji tokenizer and >> information: http://www.atilika.org/ It may help you. >> >> Thanks; >> Furkan KAMACI >> >> >> 2014-03-10 19:57 GMT+02:00 Rahul Ratnakar <rahul.ratna...@gmail.com>: >> >> > I am trying to analyze some japanese web pages for presence of >> slang/adult >> > phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the >> > tokenizer breaks up the word into proper words, I am more interested in >> > catching the slangs which seems to result from combining various "safe" >> > words. >> > >> > Few example of words that, as per our in-house japanese language >> expert,(I >> > have no knowledge of japanese whatsoever) are slangs and should be >> caught >> > "unbroken" are- >> > >> > 無臭正 - is a bad word and we want to catch it as is, but the tokenizer >> breaks >> > it up into 無臭 and 正 which are both apparently safe. >> > >> > ハメ撮り - it was broken into ハメ and 撮り, again both safe on their own but bad >> > when combined. >> > >> > 中出し broken into 中 and 出し, but should have been left as is as it >> represents >> > a bad phrase. >> > >> > Any help on how I can use kuromozi tokenizer or any alternatives would be >> > greatly appreciated. >> > >> > Thanks. >> > >>
--------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org