On Aug 1, 2005, at 12:25 PM, Andy Liu wrote:
The current Nutch language identifier plugin currently doesn't handle CJKV pages. Does anybody here have any experience with automatically detecting the language of such pages? I know there are specific encodings which give away what language the page is, but for Asian language pages that use unicode or its variants, I'm out of luck.
For Unicode it's pretty easy... just look for characters that give away the language... for example, Hiragana for Japanese, Hangul for Korean, etc.
It's hard to detect all the various encodings... EUC-JP, SHIFT-JIS, ISO-2022-KR/JP, BIG5, etc. and many servers do not correctly identify the encodings.
------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
