Hi everybody UerDictionary is right. I am using yahoo Japanese tokenizer API (日本語形態素解析) to teach my own user dictionary. http://developer.yahoo.co.jp/webapi/jlp/
On 2014/03/11, at 8:10, Rahul Ratnakar wrote: > Worked perfectly for Japanese. > > I have the same issue with Chinese Analyzer, I am using SmartChinese > (lucene-analyzers-smartcn-4.6.0.jar) but I don't see a similar interface as > the Japanese analyzer. Is there an easy way to implement the same for > Chinese? > > > On Mon, Mar 10, 2014 at 3:26 PM, Rahul Ratnakar > <rahul.ratna...@gmail.com>wrote: > >> Thanks Robert. This was exactly what I was looking for, will try this. >> >> >> On Mon, Mar 10, 2014 at 3:13 PM, Robert Muir <rcm...@gmail.com> wrote: >> >>> You can pass UserDictionary with your own entries to do this. >>> >>> On Mon, Mar 10, 2014 at 3:08 PM, Rahul Ratnakar >>> <rahul.ratna...@gmail.com> wrote: >>>> Thanks Furkan, This is the exact tool that I am using, albeit in my >>> code, I >>>> have tried all search modes e.g. >>>> >>>> new JapaneseAnalyzer(Version.LUCENE_46, null, >>> JapaneseTokenizer.Mode.NORMAL, >>>> JapaneseAnalyzer.getDefaultStopSet(), >>> JapaneseAnalyzer.getDefaultStopTags()) >>>> new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode. >>>> EXTENDED, JapaneseAnalyzer.getDefaultStopSet(), >>>> JapaneseAnalyzer.getDefaultStopTags()) >>>> >>>> new JapaneseAnalyzer(Version.LUCENE_46, null, >>> JapaneseTokenizer.Mode.SEARCH, >>>> JapaneseAnalyzer.getDefaultStopSet(), >>>> JapaneseAnalyzer.getDefaultStopTags()) >>>> >>>> >>>> >>>> and none of them seem to tokenize the words as I want, so was wondering >>> if >>>> there is some way for me to actually "update" the dictionary/corpus so >>> that >>>> these slangs are caught by the tokenizer as single word. >>>> >>>> >>>> My example text has been scrapped from an "adult" website, so it might >>> be >>>> offensive and i apologize for that. A small excerpt from that website:- >>>> >>>> >>>> "裏びでお無料・無臭正 動画無料・無料aだると動画 裏びでお無料・無臭正 >>>> 動画無料・無料aだると動画・seっくす動画無料・裏ビデオ・ヘンリー塚本・ウラビデライフ無料動画・セッく動画無料・" >>>> >>>> >>>> >>>> On tokenizing I get the list of tokens below. My problem is that as per >>> my >>>> in-house japanese language expert, this list breaks up the word "無臭正 " >>>> into 無臭 and 正 whereas it should be caught as a single word. :- >>>> >>>> 裏 >>>> >>>> びでお >>>> >>>> 無料 >>>> >>>> 無臭 >>>> >>>> 正 >>>> >>>> 動画 >>>> >>>> 無料 >>>> >>>> 無料 >>>> >>>> a >>>> >>>> 動画 >>>> >>>> 裏 >>>> >>>> びでお >>>> >>>> 無料 >>>> >>>> 無臭 >>>> >>>> 正 >>>> >>>> 動画 >>>> >>>> 無料 >>>> >>>> 無料 >>>> >>>> a >>>> >>>> 動画 >>>> >>>> se >>>> >>>> く >>>> >>>> くすい >>>> >>>> 動画 >>>> >>>> 無料 >>>> >>>> 裏 >>>> >>>> ビデオ >>>> >>>> ヘンリ >>>> >>>> 塚本 >>>> >>>> ウラビデライフ >>>> >>>> 無料 >>>> >>>> 動画 >>>> >>>> セッ >>>> >>>> く >>>> >>>> 動画 >>>> >>>> 無料 >>>> >>>> >>>> Thanks, >>>> >>>> Rahul >>>> >>>> >>>> >>>> >>>> >>>> On Mon, Mar 10, 2014 at 2:09 PM, Furkan KAMACI <furkankam...@gmail.com >>>> wrote: >>>> >>>>> Hi; >>>>> >>>>> Here is the page of it that has a online Kuromoji tokenizer and >>>>> information: http://www.atilika.org/ It may help you. >>>>> >>>>> Thanks; >>>>> Furkan KAMACI >>>>> >>>>> >>>>> 2014-03-10 19:57 GMT+02:00 Rahul Ratnakar <rahul.ratna...@gmail.com>: >>>>> >>>>>> I am trying to analyze some japanese web pages for presence of >>>>> slang/adult >>>>>> phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the >>>>>> tokenizer breaks up the word into proper words, I am more interested >>> in >>>>>> catching the slangs which seems to result from combining various >>> "safe" >>>>>> words. >>>>>> >>>>>> Few example of words that, as per our in-house japanese language >>>>> expert,(I >>>>>> have no knowledge of japanese whatsoever) are slangs and should be >>>>> caught >>>>>> "unbroken" are- >>>>>> >>>>>> 無臭正 - is a bad word and we want to catch it as is, but the tokenizer >>>>> breaks >>>>>> it up into 無臭 and 正 which are both apparently safe. >>>>>> >>>>>> ハメ撮り - it was broken into ハメ and 撮り, again both safe on their own >>> but bad >>>>>> when combined. >>>>>> >>>>>> 中出し broken into 中 and 出し, but should have been left as is as it >>>>> represents >>>>>> a bad phrase. >>>>>> >>>>>> Any help on how I can use kuromozi tokenizer or any alternatives >>> would be >>>>>> greatly appreciated. >>>>>> >>>>>> Thanks. >>>>>> >>>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>