Re: implement thai lanaguage analyzer in nutch

ogjunk-nutch Wed, 08 Nov 2006 14:43:34 -0800

Regarding Thai, there is a Thai Analyzer in Lucene already:

$ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
total 24
drwxrwxr-x  7 otis otis 4096 Oct 27 02:08 .svn/
-rw-rw-r--  1 otis otis 1528 Jun  5 14:27 ThaiAnalyzer.java
-rw-rw-r--  1 otis otis 2437 Jun  5 14:27 ThaiWordFilter.java

Otis

----- Original Message ----
From: Teruhiko Kurosaka <[EMAIL PROTECTED]>
To: sanjeev <[EMAIL PROTECTED]>; [email protected]
Sent: Wednesday, November 8, 2006 2:16:38 PM
Subject: RE: implement thai lanaguage analyzer in nutch

Sanjay,
I don't think you should follow the Chinese example and extend the CJK
range. 
This was needed because Chinese and Japanese don't use space to separate
words.  I believe Thai uses spaces, right? If so, you should extend
LETTER
range to include Thai character rather than CJK.

Another place you would need to change is the LanguageIdentifier. 
You would either train it, or implement some hack,  in order for it to
be able to 
detect Thai language documents that are not of HTML with lang="th"
attribute.

-kuro

Re: implement thai lanaguage analyzer in nutch

Reply via email to