Hello,

I want to crawl Arabic URL
(http://www.kuna.net.kw/NewsAgenciesPublicSite/HomePage.aspx?Language=ar) It
contains charset  windows-1256.

Are you sure it's really 1256? The charset returned by the server (in the headers) for this page is UTF-*:

< Content-Type: text/html; charset=utf-8

And when I look at some of the content with an implicit UTF-8 charset, it seems to make sense...or at least the encoded byte sequences are valid. If there are any native Arabic speakers, then the question is whether this makes sense as the page title:

<snip - had to cut out image, as otherwise Apache's mail server rejects this as spam...sorry>

If so, then the page encoding really is UTF-8, not 1256.

I do see the HTML meta tag that specifies the windows-1256 charset:

<META http-equiv="Content-Type" content="text/html; charset=windows-1256">

But I think that might be wrong.

-- Ken


I have another URL (http://www.afp.com/afpcom/ar/home) and it contains
charset UTF-8. This links work fine(crawling, indexing and searching working
properly).

When I search in above url it return unreadable characters (ÿŸÿ–ÿÝ×§
ÿ“ÿ“ץש–ÿ“×§ÿÝ ×©–ÿ“ÿ” ×§ÿ“ÿŠÿ–ÿ’ש– ×¥×). I want to search properly.

Is there any issue with charset? plz help me.

Thanks in advance.

Regards,
Chetan Patel

--
Ken Krugler
+1 530-210-6378

Reply via email to