Hi Alex, I cannot locate the java file you mention at org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3...
Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both versions above it is identical) it appears that you are right the "double quotes" for <meta http-equiv....> are accepted whereas 'single quotes' are not. I would be interested to see what kind of output you get when nutch-1.2 experiences the type of single quote meta syntax you highlight? Can you elaborate please... If your regex suggestion is working then I would stick with this, however this is maybe something you wish to raise in JIRA... any comments? Lewis On Tue, Jun 7, 2011 at 4:05 PM, Alex F < alexander.fahlke.mailingli...@googlemail.com> wrote: > Hi, > > the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not > suitable for sites using single quotes for <meta http-equiv....> > > Example: <meta http-equiv='Content-Type' content='text/html; > charset=iso-8859-1'> > We experienced a couple of pages with that kind of quotes and Nutch-1.2 > was not able to handle it. > > Is there any fallback or would it be good to use the following > regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>" > (single > or regular quotes are accepted)? > > BR > > Alexander Fahlke > Software Development > www.informera.de > -- *Lewis*