Hi Ken,

Thanks for your reply.

Yes, i have seen response header and it is charaset UTF-8. where as in site
display charset windows-1256.

Please can you let me know how to crawl following two URL so we can got
proper result.

http://www.kuna.net.kw/NewsAgenciesPublicSite/HomePage.aspx?Language=ar
http://www.alqabas.com.kw/

Again thanks for your help.

Regards,
Chetan


Ken Krugler wrote:
> 
>>Hello,
>>
>>I want to crawl Arabic URL
>>(http://www.kuna.net.kw/NewsAgenciesPublicSite/HomePage.aspx?Language=ar)
It
>>contains charset  windows-1256.
> 
> Are you sure it's really 1256? The charset 
> returned by the server (in the headers) for this 
> page is UTF-*:
> 
> < Content-Type: text/html; charset=utf-8
> 
> And when I look at some of the content with an 
> implicit UTF-8 charset, it seems to make 
> sense...or at least the encoded byte sequences 
> are valid. If there are any native Arabic 
> speakers, then the question is whether this makes 
> sense as the page title:
> 
> <snip - had to cut out image, as otherwise 
> Apache's mail server rejects this as spam...sorry>
> 
> If so, then the page encoding really is UTF-8, not 1256.
> 
> I do see the HTML meta tag that specifies the windows-1256 charset:
> 
> <META http-equiv="Content-Type" content="text/html; charset=windows-1256">
> 
> But I think that might be wrong.
> 
> -- Ken
> 
>>
>>I have another URL (http://www.afp.com/afpcom/ar/home) and it contains
>>charset UTF-8. This links work fine(crawling, indexing and searching
working
>>properly).
>>
>>When I search in above url it return unreadable characters (ÿŸÿ–ÿÝ×§
>>ÿ“ÿ“ץש–ÿ“×§ÿÝ ×©–ÿ“ÿ” ×§ÿ“ÿŠÿ–ÿ’ש– ×¥×). I want to search properly.
>>
>>Is there any issue with charset? plz help me.
>>
>>Thanks in advance.
>>
>>Regards,
>>Chetan Patel
> 
> --
> Ken Krugler
> +1 530-210-6378
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Arabic-language-in-Nutch-tp9269533p23830012.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to