[ https://issues.apache.org/jira/browse/NUTCH-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche resolved NUTCH-747. --------------------------------- Resolution: Implemented This has been made possible since thanks to : - Metadata injection (https://issues.apache.org/jira/browse/NUTCH-655) - urlmeta plugin - index-metadata plugin > inject&Index metadatas and inherit these metadatas to all matching suburls > -------------------------------------------------------------------------- > > Key: NUTCH-747 > URL: https://issues.apache.org/jira/browse/NUTCH-747 > Project: Nutch > Issue Type: Improvement > Components: indexer, injector > Reporter: Marko Bauhardt > Attachments: index-metadata.patch, metadata.patch > > > Hi. > the following two patches supports > + inject metadatas to url's into a metadatadb > url.com <TAB> <METAKEY> : <TAB> <METAVALUE> <TAB> <METAVALUE> <METAKEY> : > <METAVALUE> ... > ... > + updates the parse_data metadata from a shard and write the metadatas to all > fetched urls that starts with an url from the metadatadb > + this patch support's metadata to all matching suburls inheritance > the second patch implements a index-metadata plugin. > + this plugin extract all metadats from the parse_data of a shard and index > it. which metadats you can configure in the plugin.properties. > + to index for example the lang you have to configure the plugin.properties: > lang=STORE,UNTOKENIZED > + that means that the index plugin exract metadata values with key "lang". if > exists, all values are indexed stored and untokenized > Example > create start url's in "/tmp/urls/start/urls.txt" > http://lucene.apache.org/nutch/apidocs-1.0/index.html > http://lucene.apache.org/nutch/apidocs-0.9/index.html > create metadata url's in "/tmp/urls/metadata/urls.txt" > http://lucene.apache.org/nutch/apidocs-1.0/ version: 1.0 > http://lucene.apache.org/nutch/apidocs-0.9/ version: 0.9 > Inject Urls > bin/nutch inject crawldb /tmp/urls/start/ > bin/nutch org.apache.nutch.crawl.metadata.MetadataInjector metadatadb > /tmp/urls/metadata/ > Fetch & Parse & Update > bin/nutch generate crawldb segments > bin/nutch fetch segments/20090806105717/ > bin/nutch org.apache.nutch.crawl.metadata.ParseDataUpdater metadatadb > segments/20090806105717 > bin/nutch updatedb crawldb/ segments/20090806105717/ > Fetch & Parse & Update Again > ... > Index > bin/nutch invertlinks linkdb -dir segments/ > bin/nutch index index crawldb/ linkdb/ segments/20090806105717 > segments/20090806110127 > Check your Index > All urls starting with "http://lucene.apache.org/nutch/apidocs-1.0/ " are > indexed with "version:1.0". > All urls starting with "http://lucene.apache.org/nutch/apidocs-0.9/ " are > indexed with "version:0.9". > This issue is some related to NUTCH-655 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira