Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "IndexMetatags" page has been changed by SebastianNagel: https://wiki.apache.org/nutch/IndexMetatags?action=diff&rev1=6&rev2=7 Comment: NUTCH-1827: update description on configuration with 2.x; update status information (availability for Nutch versions) = Nutch - Parse Metatags = - '''Summary:''' When crawling HTML pages, it might be necessary to retrieve information which is stored in HTML Meta tags. This tutorial shows how to install the plugin and configure Nutch to parse meta tags into separate fields in the Solr index. Note that Nutch pushes the information to Solr, so this tutorial also includes the changes required to Solr. This article relates to the parse`-metatags` plugin, provided in jira: https://issues.apache.org/jira/browse/NUTCH-809 + '''Summary:''' When crawling HTML pages, it might be necessary to retrieve information which is stored in HTML Meta tags. This tutorial shows how to install the plugin and configure Nutch to parse meta tags into separate fields in the Solr index. Note that Nutch pushes the information to Solr, so this tutorial also includes the changes required to Solr. This article relates to the parse`-metatags` plugin, provided in jira: [[https://issues.apache.org/jira/browse/NUTCH-809|NUTCH-809]] + {{{#!wiki solid - - {{{ - The current version of plugin in 1.x series cannot parse multiValued metatags. Please check https://issues.apache.org/jira/browse/NUTCH-1467 for patch. See also NUTCH-1467 and NUTCH-1561 for improvements. - - This plugin is not included in 2.x series (it will be included in 2.3). Please check https://issues.apache.org/jira/browse/NUTCH-1478 for patch, and also NUTCH-1827. + This plugin is not included in 2.x series (it will be included in 2.3). Please check [[https://issues.apache.org/jira/browse/NUTCH-1478|NUTCH-1478]] for patch, and also [[https://issues.apache.org/jira/browse/NUTCH-1827|NUTCH-1827]]. }}} == Plugin Information == - This plugin has been committed to the trunk in revision 1303371 and will be available in Nutch 1.5. It parses specified meta tags and relies on the index`-metadata `plugin. + This plugin parses specified meta tags and relies on the `index-metadata` plugin. It has been included in Nutch 1.5. + With Nutch 1.7 all values of multi-valued metatags are added (see [[https://issues.apache.org/jira/browse/NUTCH-1467|NUTCH-1467]]), + with Nutch 1.9 the configuration is simplified ([[https://issues.apache.org/jira/browse/NUTCH-1561|NUTCH-1561]]). == Plugin Configuration == 1. In the file `conf/nutch-site.xml`, edit the property `plugin.includes` to contain the following plugins: `parse-metatags` and index`-metadata` so it looks like for example: @@ -34, +33 @@ </description> </property> }}} - 1. In the same file you need to configure the index`-metadata `plugin. The values are stored in the parse metadata so we need to specify : + 1. In the same file you need to configure the `index-metadata` plugin. The values are stored in the parse metadata so we need to specify the property `index.parse.md`: {{{ <property> <name>index.parse.md</name> @@ -46, +45 @@ </description> </property> }}} - '''CAUTION : '''the names of the fields must be prefixed with 'metatag.' (1.x only). + '''CAUTION''' (1.x only): the names of the fields must be prefixed with 'metatag.'! - For 2.x: enter comma-separated metatags (without any prefix) which should be indexed to the property `index.metadata`. + For '''2.x''' enter comma-separated metatags (without any prefix) which should be indexed to the property `index.metadata`: + {{{ + <property> + <name>index.metadata</name> + <value>description,keywords</value> + <description> + Comma-separated list of keys to be taken from the metadata to generate fields. + Can be used e.g. for 'description' or 'keywords' provided that these values are generated + by a parser (see parse-metatags plugin), and property 'metatags.names'. + </description> + </property> + }}} 1. You can test that the fields are generated correctly by using the [[bin/nutch indexchecker]] command 1. In order to have the specified metatags indexed by Solr, edit your Solr `schema.xml` (located in `$SOLR_HOME$/conf`) and include new fields for each metatag you want to indexed. For example for the field 'role', add the following lines: {{{