Re: Arabic language in Nutch

Ken Krugler Mon, 01 Jun 2009 10:02:15 -0700

Hello,

I want to crawl Arabic URL
(http://www.kuna.net.kw/NewsAgenciesPublicSite/HomePage.aspx?Language=ar) It
contains charset  windows-1256.

Are you sure it's really 1256? The charsetreturned by the server (in the headers) for thispage is UTF-*:


< Content-Type: text/html; charset=utf-8

And when I look at some of the content with animplicit UTF-8 charset, it seems to makesense...or at least the encoded byte sequencesare valid. If there are any native Arabicspeakers, then the question is whether this makessense as the page title:

<snip - had to cut out image, as otherwiseApache's mail server rejects this as spam...sorry>


If so, then the page encoding really is UTF-8, not 1256.

I do see the HTML meta tag that specifies the windows-1256 charset:

<META http-equiv="Content-Type" content="text/html; charset=windows-1256">

But I think that might be wrong.

-- Ken


I have another URL (http://www.afp.com/afpcom/ar/home) and it contains
charset UTF-8. This links work fine(crawling, indexing and searching working
properly).

When I search in above url it return unreadable characters (ÿÿÿÝ×§
ÿÿ×¥×©ÿ×§ÿÝ ×©ÿÿ ×§ÿÿÿÿ×© ×¥×). I want to search properly.

Is there any issue with charset? plz help me.

Thanks in advance.

Regards,
Chetan Patel


--
Ken Krugler
+1 530-210-6378

Re: Arabic language in Nutch

Reply via email to