Hi Ken, Thanks for your reply.
Yes, i have seen response header and it is charaset UTF-8. where as in site display charset windows-1256. Please can you let me know how to crawl following two URL so we can got proper result. http://www.kuna.net.kw/NewsAgenciesPublicSite/HomePage.aspx?Language=ar http://www.alqabas.com.kw/ Again thanks for your help. Regards, Chetan Ken Krugler wrote: > >>Hello, >> >>I want to crawl Arabic URL >>(http://www.kuna.net.kw/NewsAgenciesPublicSite/HomePage.aspx?Language=ar) It >>contains charset windows-1256. > > Are you sure it's really 1256? The charset > returned by the server (in the headers) for this > page is UTF-*: > > < Content-Type: text/html; charset=utf-8 > > And when I look at some of the content with an > implicit UTF-8 charset, it seems to make > sense...or at least the encoded byte sequences > are valid. If there are any native Arabic > speakers, then the question is whether this makes > sense as the page title: > > <snip - had to cut out image, as otherwise > Apache's mail server rejects this as spam...sorry> > > If so, then the page encoding really is UTF-8, not 1256. > > I do see the HTML meta tag that specifies the windows-1256 charset: > > <META http-equiv="Content-Type" content="text/html; charset=windows-1256"> > > But I think that might be wrong. > > -- Ken > >> >>I have another URL (http://www.afp.com/afpcom/ar/home) and it contains >>charset UTF-8. This links work fine(crawling, indexing and searching working >>properly). >> >>When I search in above url it return unreadable characters (ÿÿÿÝ×§ >>ÿÿץשÿ×§ÿÝ ×©ÿÿ ×§ÿÿÿÿש ×¥×). I want to search properly. >> >>Is there any issue with charset? plz help me. >> >>Thanks in advance. >> >>Regards, >>Chetan Patel > > -- > Ken Krugler > +1 530-210-6378 > > -- View this message in context: http://www.nabble.com/Arabic-language-in-Nutch-tp9269533p23830012.html Sent from the Nutch - User mailing list archive at Nabble.com.
