Jonathan et all, URLMeta Plugin Test to Force Updated MetaData from Seed.txt
BackGround: URLMeta plugin allows you to define a metadata value to a url in the seed.txt file. This metadata value is inherited by all pages crawled within this domain name. The question comes, what happens when the metadata value is changed in the seed.txt file. Additionally, we tested adding a new metatag name to see if this would propagate. Hypothesis: If the seed.txt file is updated to a new metadata value and the db.injector.overwrite is set to true, all the urls in the domain will be updated to the new metadata value when refetched/parsed/indexed. Test Scenario: The db.fetch.interval.default=60 (seconds) and the db.fetch.interval.max=180 (seconds) were changed to very small intervals. This would allow for forcing the urls to be fetched quickly. (do not do this in production, unless you have unlimited bandwidth and hardware.) Fetch times were validated by dumping the crawldb to validate the urls were being refetched. Additionally, the segments were reviewed to validate the urls were being sent to solr post metadata changes. Expected Results: - The urls listed in the seed.txt file were updated with the new metadata values and the new metatag name/value. Unexpected Results: - The other urls within the domain were NOT updated with the new metadata values. - All new urls identified, fetched, parsed did NOT use the new metadata values. - All new urls identified, fetched, parsed did NOT pick up the new metatag name/value Additional questions: 1. It appears the metatag values are managed in the plugin/urlmeta/src/java/org/apache/nutch/scoring/urlmeta/URLMetaScoringFilter.java. According to the comments, it appears the metatags/metadata are defined at outlink creation. 2. Has anyone ever fixed/tackled this issue? 3. Will NUTCH-1872 (enables control over how injected metadata is propagated) fix this issue? thank you jeff On Fri, Apr 3, 2015 at 1:07 PM, Jonathan Cooper-Ellis < [email protected]> wrote: > I think if you have "db.injector.overwrite" property configured to true, > the new values will be injected and then when the outlinks are recrawled > they'll be updated accordingly, but honestly I'm not totally sure. > > -- > Jonathan Cooper-Ellis > Field Enablement Engineer > <http://www.cloudera.com> >

