Jonathan et all,

URLMeta Plugin Test to Force Updated MetaData from Seed.txt

BackGround: URLMeta plugin allows you to define a metadata value to a url
in the seed.txt file. This metadata value is inherited by all pages crawled
within this domain name. The question comes, what happens when the metadata
value is changed in the seed.txt file. Additionally, we tested adding a new
metatag name to see if this would propagate.

Hypothesis:  If the seed.txt file is updated to a new metadata value and
the db.injector.overwrite is set to true, all the urls in the domain will
be updated to the new metadata value when refetched/parsed/indexed.

Test Scenario:

The db.fetch.interval.default=60 (seconds) and the
db.fetch.interval.max=180 (seconds) were changed to very small intervals.
This would allow for forcing the urls to be fetched quickly. (do not do
this in production, unless you have unlimited bandwidth and hardware.)

Fetch times were validated by dumping the crawldb to validate the urls were
being refetched. Additionally, the segments were reviewed to validate the
urls were being sent to solr post metadata changes.

Expected Results:

- The urls listed in the seed.txt file were updated with the new metadata
values and the new metatag name/value.

Unexpected Results:
 - The other urls within the domain were NOT updated with the new metadata
values.
 - All new urls identified, fetched, parsed did NOT use the new metadata
values.
 - All new urls identified, fetched, parsed did NOT pick up the new metatag
name/value


Additional questions:
1. It appears the metatag values are managed in the
plugin/urlmeta/src/java/org/apache/nutch/scoring/urlmeta/URLMetaScoringFilter.java.
According to the comments, it appears the metatags/metadata are defined at
outlink creation.
2. Has anyone ever fixed/tackled this issue?
3. Will NUTCH-1872 (enables control over how injected metadata is
propagated) fix this issue?

thank you

jeff


On Fri, Apr 3, 2015 at 1:07 PM, Jonathan Cooper-Ellis <
[email protected]> wrote:

> I think if you have "db.injector.overwrite" property configured to true,
> the new values will be injected and then when the outlinks are recrawled
> they'll be updated accordingly, but honestly I'm not totally sure.
>
> --
> Jonathan Cooper-Ellis
> Field Enablement Engineer
> <http://www.cloudera.com>
>

Reply via email to