Re: metatags missing with parse-html

Sebastian Nagel Mon, 14 Oct 2019 05:03:28 -0700

Hi Dave,

could you share an example document? Which Nutch version is used?


I tried to reproduce the problem without success using Nutch v1.16:

- example document:

<html>
<head>
<title>Test metatags</title>
<meta name='language' content='en'>
<meta name='subject'  content='test'>
<meta name='Category' content='meta data'>
</head>
<body>
test for metatag extraction
</body>
</html>

- using parse-html (works)

> bin/nutch indexchecker -Dmetatags.names='*' \
   -Dindex.parse.md='metatag.language,metatag.subject,metatag.category' \
   
-Dplugin.includes='protocol-http|parse-(html|metatags)|index-(basic|metadata)' \
   http://localhost/nutch/test_metatags.html
fetching: http://localhost/nutch/test_metatags.html
robots.txt whitelist not configured.
parsing: http://localhost/nutch/test_metatags.html
contentType: text/html
tstamp :        Mon Oct 14 13:24:14 CEST 2019
metatag.language :      en
metatag.language :      en
metatag.category :      meta data
metatag.category :      meta data
digest :        50d08494ba791bb52fcdeebfc08ba640
host :  localhost
metatag.subject :       test
metatag.subject :       test
id :    http://localhost/nutch/test_metatags.html
title : Test metatags
url :   http://localhost/nutch/test_metatags.html
content :       Test metatags
test for metatag extraction

- using parse-tika (works)

> bin/nutch indexchecker -Dmetatags.names='*' \
   -Dindex.parse.md='metatag.language,metatag.subject,metatag.category' \
   
-Dplugin.includes='protocol-http|parse-(tika|metatags)|index-(basic|metadata)' \
   http://localhost/nutch/test_metatags.html
fetching: http://localhost/nutch/test_metatags.html
robots.txt whitelist not configured.
parsing: http://localhost/nutch/test_metatags.html
contentType: text/html
tstamp :        Mon Oct 14 13:25:34 CEST 2019
metatag.language :      en
metatag.language :      en
metatag.category :      meta data
metatag.category :      meta data
digest :        50d08494ba791bb52fcdeebfc08ba640
host :  localhost
metatag.subject :       test
metatag.subject :       test
id :    http://localhost/nutch/test_metatags.html
title : Test metatags
url :   http://localhost/nutch/test_metatags.html
content :       Test metatags
test for metatag extraction


There are currently two issue open around metatags:
 https://issues.apache.org/jira/browse/NUTCH-1559
 https://issues.apache.org/jira/browse/NUTCH-2525

Maybe it's related to one of those?


Best,
Sebastian


On 11.10.19 22:38, Dave Beckstrom wrote:
> Hi Everyone,
> 
> It seems like I take 1 step forward and 2 steps backwards.
> 
> I was using parse-tika and I needed to change to parse-html in order to use
> a plug-in for excluding content such as headers and footers.
> 
> I have the excludes working with the plug-in.  But now I see that all of
> the metatags are missing from solr.  The metatag fields are defined in SOLR
> but not populated.
> 
> Metatags were working prior to the change to parse-html.  What would
> explain the metatags not being indexed when the configuration
> parameters didn't change?  Is there some other setting for parse-html that
> I need to look into?
> 
> Thanks!
> 
> 
>  <property>
>   <name>plugin.includes</name>
> 
> <value>exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)|index-blacklist-whitelist</value>
>   <description> </description>
>  </property>
>  <!--  index all metatags -->
>  <property>
>   <name>metatags.names</name>
>   <value>*</value>
>   <description> </description>
>  </property>
>  <property>
>   <name>index.parse.md</name>
>    <value>metatag.language,metatag.subject,metatag.category</value>
>   <description> </description>
> </property>
>

Re: metatags missing with parse-html

Reply via email to