I need a Thai Analyzer for Nutch. I want the crawler to be intelligent enough to split thai words correctly since thai don't have spaces between words. :-(
ogjunk-nutch wrote: > > Regarding Thai, there is a Thai Analyzer in Lucene already: > > $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ > total 24 > drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ > -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java > -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.java > > Otis > > ----- Original Message ---- > From: Teruhiko Kurosaka <[EMAIL PROTECTED]> > To: sanjeev <[EMAIL PROTECTED]>; [email protected] > Sent: Wednesday, November 8, 2006 2:16:38 PM > Subject: RE: implement thai lanaguage analyzer in nutch > > Sanjay, > I don't think you should follow the Chinese example and extend the CJK > range. > This was needed because Chinese and Japanese don't use space to separate > words. I believe Thai uses spaces, right? If so, you should extend > LETTER > range to include Thai character rather than CJK. > > Another place you would need to change is the LanguageIdentifier. > You would either train it, or implement some hack, in order for it to > be able to > detect Thai language documents that are not of HTML with lang="th" > attribute. > > -kuro > > > > > -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7251826 Sent from the Nutch - Dev mailing list archive at Nabble.com. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
