ok Kuro - you are wrong about thai language having spaces between words.
Thai don't have space between words and segmenting thai is a bit tricky
methinks.

Will appreciate any/all help you can give me

cheers,
sanjeev








sanjeev wrote:
> 
> ok. I downloaded the LuceneInAction code examples from the book and found
> there were some 
> analyzers and tests/demos which included chinese.
> 
> But these analyzers were standalone java programs with a main method.
> 
> My question is how to integrate into nutch so the index created by crawl
> process can be searchable in thai ?
> 
> Someone please help as I'm hopelessly confused by the whole thing. :-(
> 
> cheers,
> sanjeev.
> 
> 
> 
> 
> 
> ogjunk-nutch wrote:
>> 
>> Regarding Thai, there is a Thai Analyzer in Lucene already:
>> 
>> $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
>> total 24
>> drwxrwxr-x  7 otis otis 4096 Oct 27 02:08 .svn/
>> -rw-rw-r--  1 otis otis 1528 Jun  5 14:27 ThaiAnalyzer.java
>> -rw-rw-r--  1 otis otis 2437 Jun  5 14:27 ThaiWordFilter.java
>> 
>> Otis
>> 
>> ----- Original Message ----
>> From: Teruhiko Kurosaka <[EMAIL PROTECTED]>
>> To: sanjeev <[EMAIL PROTECTED]>; [email protected]
>> Sent: Wednesday, November 8, 2006 2:16:38 PM
>> Subject: RE: implement thai lanaguage analyzer in nutch
>> 
>> Sanjay,
>> I don't think you should follow the Chinese example and extend the CJK
>> range. 
>> This was needed because Chinese and Japanese don't use space to separate
>> words.  I believe Thai uses spaces, right? If so, you should extend
>> LETTER
>> range to include Thai character rather than CJK.
>> 
>> Another place you would need to change is the LanguageIdentifier. 
>> You would either train it, or implement some hack,  in order for it to
>> be able to 
>> detect Thai language documents that are not of HTML with lang="th"
>> attribute.
>> 
>> -kuro
>> 
>> 
>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7252863
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to