[
https://issues.apache.org/jira/browse/NUTCH-224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477190
]
Steve Severance commented on NUTCH-224:
---------------------------------------
The PDF Parser for 0.8.1 also fails on Korean text.
Steve
> Nutch doesn't handle Korean text at all
> ---------------------------------------
>
> Key: NUTCH-224
> URL: https://issues.apache.org/jira/browse/NUTCH-224
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 0.7.1
> Reporter: KuroSaka TeruHiko
>
> I was browing NutchAnalysis.jj and found that
> Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means
> a Unicode character of the hex value xxxx) are not
> part of LETTER or CJK class. This seems to me that
> Nutch cannot handle Korean documents at all.
> I posted the above message at nutch-user ML and Cheolgoo Kang [EMAIL
> PROTECTED]
> replied as:
> ------------------------------------------------------------------------------------
> There was similar issue with Lucene's StandardTokenizer.jj.
> http://issues.apache.org/jira/browse/LUCENE-444
> and
> http://issues.apache.org/jira/browse/LUCENE-461
> I'm have almost no experience with Nutch, but you can handle it like
> those issues above.
> ------------------------------------------------------------------------------------
> Both fixes should probably be ported back to NuatchAnalysis.jj.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers