Hi Dave, could you share an example document? Which Nutch version is used?
I tried to reproduce the problem without success using Nutch v1.16: - example document: <html> <head> <title>Test metatags</title> <meta name='language' content='en'> <meta name='subject' content='test'> <meta name='Category' content='meta data'> </head> <body> test for metatag extraction </body> </html> - using parse-html (works) > bin/nutch indexchecker -Dmetatags.names='*' \ -Dindex.parse.md='metatag.language,metatag.subject,metatag.category' \ -Dplugin.includes='protocol-http|parse-(html|metatags)|index-(basic|metadata)' \ http://localhost/nutch/test_metatags.html fetching: http://localhost/nutch/test_metatags.html robots.txt whitelist not configured. parsing: http://localhost/nutch/test_metatags.html contentType: text/html tstamp : Mon Oct 14 13:24:14 CEST 2019 metatag.language : en metatag.language : en metatag.category : meta data metatag.category : meta data digest : 50d08494ba791bb52fcdeebfc08ba640 host : localhost metatag.subject : test metatag.subject : test id : http://localhost/nutch/test_metatags.html title : Test metatags url : http://localhost/nutch/test_metatags.html content : Test metatags test for metatag extraction - using parse-tika (works) > bin/nutch indexchecker -Dmetatags.names='*' \ -Dindex.parse.md='metatag.language,metatag.subject,metatag.category' \ -Dplugin.includes='protocol-http|parse-(tika|metatags)|index-(basic|metadata)' \ http://localhost/nutch/test_metatags.html fetching: http://localhost/nutch/test_metatags.html robots.txt whitelist not configured. parsing: http://localhost/nutch/test_metatags.html contentType: text/html tstamp : Mon Oct 14 13:25:34 CEST 2019 metatag.language : en metatag.language : en metatag.category : meta data metatag.category : meta data digest : 50d08494ba791bb52fcdeebfc08ba640 host : localhost metatag.subject : test metatag.subject : test id : http://localhost/nutch/test_metatags.html title : Test metatags url : http://localhost/nutch/test_metatags.html content : Test metatags test for metatag extraction There are currently two issue open around metatags: https://issues.apache.org/jira/browse/NUTCH-1559 https://issues.apache.org/jira/browse/NUTCH-2525 Maybe it's related to one of those? Best, Sebastian On 11.10.19 22:38, Dave Beckstrom wrote: > Hi Everyone, > > It seems like I take 1 step forward and 2 steps backwards. > > I was using parse-tika and I needed to change to parse-html in order to use > a plug-in for excluding content such as headers and footers. > > I have the excludes working with the plug-in. But now I see that all of > the metatags are missing from solr. The metatag fields are defined in SOLR > but not populated. > > Metatags were working prior to the change to parse-html. What would > explain the metatags not being indexed when the configuration > parameters didn't change? Is there some other setting for parse-html that > I need to look into? > > Thanks! > > > <property> > <name>plugin.includes</name> > > <value>exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)|index-blacklist-whitelist</value> > <description> </description> > </property> > <!-- index all metatags --> > <property> > <name>metatags.names</name> > <value>*</value> > <description> </description> > </property> > <property> > <name>index.parse.md</name> > <value>metatag.language,metatag.subject,metatag.category</value> > <description> </description> > </property> >