Hello,
I want to crawl Arabic URL
(http://www.kuna.net.kw/NewsAgenciesPublicSite/HomePage.aspx?Language=ar) It
contains charset windows-1256.
Are you sure it's really 1256? The charset
returned by the server (in the headers) for this
page is UTF-*:
< Content-Type: text/html; charset=utf-8
And when I look at some of the content with an
implicit UTF-8 charset, it seems to make
sense...or at least the encoded byte sequences
are valid. If there are any native Arabic
speakers, then the question is whether this makes
sense as the page title:
<snip - had to cut out image, as otherwise
Apache's mail server rejects this as spam...sorry>
If so, then the page encoding really is UTF-8, not 1256.
I do see the HTML meta tag that specifies the windows-1256 charset:
<META http-equiv="Content-Type" content="text/html; charset=windows-1256">
But I think that might be wrong.
-- Ken
I have another URL (http://www.afp.com/afpcom/ar/home) and it contains
charset UTF-8. This links work fine(crawling, indexing and searching working
properly).
When I search in above url it return unreadable characters (ÿÿÿÝ×§
ÿÿץשÿ×§ÿÝ ×©ÿÿ ×§ÿÿÿÿש ×¥×). I want to search properly.
Is there any issue with charset? plz help me.
Thanks in advance.
Regards,
Chetan Patel
--
Ken Krugler
+1 530-210-6378