Re: HTML parsing and charset for Polish

MilleBii Wed, 23 Sep 2009 06:09:37 -0700

At last someone answers.
Correct CP1250.
My pages look fine in the browsers of course, but it does not mean Nutch
handles them properly.

What I'm wondering is if the the nutch HTML parser reads them properly,
because when I do a search on such characters it fails on pages iso8859-2 or
cp1250, but not if the page is UTF-8 encoded from what I could see.
Nutch uses java String (ie Unicode) internally, but I wonder if there would
a problem in the conversion from the page encoding into the unicode
encoding.

I did not have time to dig into the details of the matter, I wonder if any
one has come across the issue and/or solved it.

2009/9/23 Dawid Weiss <dawid.we...@gmail.com>

> Polish Web sites use Cp1250 (windows-1250) or iso8859-2 (or UTF-8 of
> course). Check if diacritics like these:
>
> ęółąśćżń
>
> look all right in the above encodings and use appropriately.
>
> Dawid
>
> On Wed, Sep 16, 2009 at 4:47 PM, MilleBii <mille...@gmail.com> wrote:
> > same thing when there is
> > charset=ISO-8859-2
> >
> > 2009/9/16 MilleBii <mille...@gmail.com>
> >
> >> Not sure where to look for explanations:
> >>
> >> I have a problem with some Polish pages which I can not index properly
> on
> >> the specific polish characters such as :
> >> &#321;
> >>
> >> They are havin the following  charset=windows-1252
> >>
> >> Does the HTML parser convert them into their Unicode equivalent ....
> >>
> >> --
> >> -MilleBii-
> >>
> >
> >
> >
> > --
> > -MilleBii-
> >
>

-- 
-MilleBii-

Re: HTML parsing and charset for Polish

Reply via email to