Hi, On 11/11/05, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > Hi, > > Was wondering if someone could help me out with a few things in Korean > as related to Lucene: > 1. Which Analyzer do you recommend? From the list, I see that some > have had success with the StandardAnalyzer. Are there any caveats I > should be aware of if I choose to use it?
StandardAnalyzer currently in svn separates all Korean words into each characters. As you know, Korean has almost no meaning with 'one' character, so I've made a patch on JIRA to address this issue. You can find it http://issues.apache.org/jira/browse/LUCENE-461. But for the stemming, StandardTokenizer(and StandardAnalyzer) has no ability to do it, so you need something else like CJKAnalyzer that does a bi-gram tokenization. There currently is no lucene analyzer freely available that does the Korean stemming like Porter, Lovins, etc. > 2. Could anyone point me to a fairly decent size (doesn't need to be > huge), freely available collection? Please check out the Sejong project(http://www.sejong.or.kr/, Sejong is name of the king who created the Hangul in ancient times), it's kind of a national linguistics project and has lots of Korean corpus that is freely available for research purpose only. But those text are provided in xxx.HWP file format, so it's hard to download-and-use in one shot. It's very very time consuming :-( You need "Hangul 2005" word processor to read the xxx.HWP file. (I know Sejong project shouldn't have used a company proprietary format like HWP instead of XML or even just TXT.) > > Thanks, > Grant > > -- > ------------------------------------------------------------------- > Grant Ingersoll > Sr. Software Engineer > Center for Natural Language Processing > Syracuse University > School of Information Studies > 337 Hinds Hall > Syracuse, NY 13244 > > http://www.cnlp.org > Voice: 315-443-5484 > Fax: 315-443-6886 > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Cheolgoo --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]