[ 
http://issues.apache.org/jira/browse/NUTCH-224?page=comments#action_12416108 ] 

Sean Dean commented on NUTCH-224:
---------------------------------

Im still using 0.7.1 and also see this problem.

In the Nutch 0.7.2 release they upgraded to Lucene 1.9.1, which included the 
above fixes for Korean language support.

Have you tried 0.7.2 or .8-dev with any luck?

> Nutch doesn't handle Korean text at all
> ---------------------------------------
>
>          Key: NUTCH-224
>          URL: http://issues.apache.org/jira/browse/NUTCH-224
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.7.1
>     Reporter: KuroSaka TeruHiko

>
> I was browing NutchAnalysis.jj and found that
> Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means
> a Unicode character of the hex value xxxx) are not
> part of LETTER or CJK class.  This seems to me that
> Nutch cannot handle Korean documents at all.
> I posted the above message at nutch-user ML and Cheolgoo Kang [EMAIL 
> PROTECTED]
> replied as:
> ------------------------------------------------------------------------------------
> There was similar issue with Lucene's StandardTokenizer.jj.
> http://issues.apache.org/jira/browse/LUCENE-444
> and
> http://issues.apache.org/jira/browse/LUCENE-461
> I'm have almost no experience with Nutch, but you can handle it like
> those issues above.
> ------------------------------------------------------------------------------------
> Both fixes should probably be ported back to NuatchAnalysis.jj.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to