[ https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771020#comment-17771020 ]
ASF GitHub Bot commented on NUTCH-3010: --------------------------------------- sebastian-nagel merged PR #783: URL: https://github.com/apache/nutch/pull/783 > Injector: count unique number of injected URLs > ---------------------------------------------- > > Key: NUTCH-3010 > URL: https://issues.apache.org/jira/browse/NUTCH-3010 > Project: Nutch > Issue Type: Improvement > Components: injector > Affects Versions: 1.19 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.20 > > > Injector uses two counters: one for the total number of injected URLs, the > other for the number of URLs "merged", that is already in CrawlDb. There is > now counter for the number of unique URLs injected which may lead to wrong > counts if the seed files contain duplicates: > Suppose the following seed file which contains a duplicated URL: > {noformat} > $> cat seeds_with_duplicates.txt > https://www.example.org/page1.html > https://www.example.org/page2.html > https://www.example.org/page2.html > $> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt > ... > 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls > rejected by filters: 0 > 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls > injected after normalization and filtering: 3 > 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls > injected but already in CrawlDb: 0 > 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls > injected: 3 > ... > {noformat} > However, because of the duplicated URL, only 2 URLs were injected into the > CrawlDb: > {noformat} > $> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats > ... > 2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls: 2 > ... > {noformat} > If the Injector job is run again with the same input, we get the erroneous > output, that still one "new URL" was injected: > {noformat} > 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls > rejected by filters: 0 > 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls > injected after normalization and filtering: 3 > 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls > injected but already in CrawlDb: 2 > 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls > injected: 1 > {noformat} > This is because the urls_merged counter counts unique items, while > url_injected does not, and the shown number is the difference between both > counters. > Adding a counter to count the number of unique injected URLs will allow to > get the correct count of newly injected URLs. -- This message was sent by Atlassian Jira (v8.20.10#820010)