Hi I'm trying to index some unicodes pages in utf-8. For all the pages which are encoded in unicode utf-8 its fine. but for some pages when I'm crawling the pages what I've is unicode NCR(dec) which are getting indexing as such .
What I mean is say I'm viewing some page abc.com/hello which has non-eng content. Now I opened the source code of that page and what I find is that the source itself contains those characters i.e హెల్లొ but when this gets displayed through the browser it is shown in proper format[in this case its Telugu language]. So what I download as raw text is just the aboce NCR(dec) codes and thats what getting posted to lucene. For all the languages I'm getting the content in unicode utf-8 format which is not able to handle this particular language. Are these called as HTML Entity? Now it seems before passing these content to lucene I've to get the utf-8 encoding for them. Is this the way to fix this? or there are other and better ways for doing the same. I need proper guidance from someone who has faced similar problems earlier. All are welcome to give their views/ideas on the same. Thanks, KK