Hi Alex,

I cannot locate the java file you mention at
org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3...

Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both versions
above it is identical) it appears that you are right the "double quotes" for
<meta http-equiv....> are accepted whereas 'single quotes' are not. I would
be interested to see what kind of output you get when nutch-1.2 experiences
the type of single quote meta syntax you highlight? Can you elaborate
please...

If your regex suggestion is working then I would stick with this, however
this is maybe something you wish to raise in JIRA... any comments?
Lewis

On Tue, Jun 7, 2011 at 4:05 PM, Alex F <
alexander.fahlke.mailingli...@googlemail.com> wrote:

> Hi,
>
> the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
> suitable for sites using single quotes for <meta http-equiv....>
>
>  Example: <meta http-equiv='Content-Type' content='text/html;
> charset=iso-8859-1'>
>  We experienced a couple of pages with that kind of quotes and Nutch-1.2
> was not able to handle it.
>
> Is there any fallback or would it be good to use the following
> regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>"
> (single
> or regular quotes are accepted)?
>
> BR
>
> Alexander Fahlke
> Software Development
> www.informera.de
>



-- 
*Lewis*

Reply via email to