Sebastian Nagel created NUTCH-3010:
--------------------------------------

             Summary: Injector: count unique number of injected URLs
                 Key: NUTCH-3010
                 URL: https://issues.apache.org/jira/browse/NUTCH-3010
             Project: Nutch
          Issue Type: Improvement
          Components: injector
    Affects Versions: 1.19
            Reporter: Sebastian Nagel
            Assignee: Sebastian Nagel
             Fix For: 1.20


Injector uses two counters: one for the total number of injected URLs, the 
other for the number of URLs "merged", that is already in CrawlDb. There is now 
counter for the number of unique URLs injected which may lead to wrong counts 
if the seed files contain duplicates:

Suppose the following seed file which contains a duplicated URL:

{noformat}
$> cat seeds_with_duplicates.txt 
https://www.example.org/page1.html
https://www.example.org/page2.html
https://www.example.org/page2.html

$> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt
...
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
rejected by filters: 0
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected after normalization and filtering: 3
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected but already in CrawlDb: 0
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls 
injected: 3
...
{noformat}

However, because of the duplicated URL, only 2 URLs were injected into the 
CrawlDb:

{noformat}
$> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats
...
2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls:   2
...
{noformat}

If the Injector job is run again with the same input, we get the erroneous 
output, that still one "new URL" was injected:

{noformat}
2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
rejected by filters: 0
2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected after normalization and filtering: 3
2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected but already in CrawlDb: 2
2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls 
injected: 1
{noformat}

This is because the urls_merged counter counts unique items, while url_injected 
does not.

Adding a counter to count the number of unique injected URLs will allow to get 
the correct count of newly injected URLs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to