[
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated NUTCH-809:
--------------------------------
Attachment: (was: NUTCH-809.patch)
> Parse-metatags plugin
> ---------------------
>
> Key: NUTCH-809
> URL: https://issues.apache.org/jira/browse/NUTCH-809
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Julien Nioche
> Assignee: Julien Nioche
>
> h2. Parse-metatags plugin
> *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see
> [TIKA-379]).*
> To use the legacy HTML parser specify in parse-plugins.xml
> {code:xml}
> <mimeType name="text/html">
> <plugin id="parse-html" />
> </mimeType>
> {code}
> The parse-metatags plugin consists of a HTMLParserFilter which takes as
> parameter a list of metatag names with '*' as default value. The values are
> separated by ';'.
> In order to extract the values of the metatags description and keywords, you
> must specify in nutch-site.xml
> {code:xml}
> <property>
> <name>metatags.names</name>
> <value>description;keywords</value>
> </property>
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields
> 'keywords' and 'description'. Note that keywords is multivalued.
> The MetaTagsQueryFilter allows to include the fields above in the Nutch
> queries.
> This code has been developed by DigitalPebble Ltd and offered to the
> community by ANT.com
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.