[ https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771038#comment-17771038 ]
Hudson commented on NUTCH-3010: ------------------------------- SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #129 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/129/]) NUTCH-3010 Injector: count unique number of injected URLs (snagel: [https://github.com/apache/nutch/commit/810b1d6ad50fa9021469b4ca5e1db9050a3263c5]) * (edit) src/java/org/apache/nutch/crawl/Injector.java > Injector: count unique number of injected URLs > ---------------------------------------------- > > Key: NUTCH-3010 > URL: https://issues.apache.org/jira/browse/NUTCH-3010 > Project: Nutch > Issue Type: Improvement > Components: injector > Affects Versions: 1.19 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.20 > > > Injector uses two counters: one for the total number of injected URLs, the > other for the number of URLs "merged", that is already in CrawlDb. There is > now counter for the number of unique URLs injected which may lead to wrong > counts if the seed files contain duplicates: > Suppose the following seed file which contains a duplicated URL: > {noformat} > $> cat seeds_with_duplicates.txt > https://www.example.org/page1.html > https://www.example.org/page2.html > https://www.example.org/page2.html > $> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt > ... > 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls > rejected by filters: 0 > 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls > injected after normalization and filtering: 3 > 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls > injected but already in CrawlDb: 0 > 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls > injected: 3 > ... > {noformat} > However, because of the duplicated URL, only 2 URLs were injected into the > CrawlDb: > {noformat} > $> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats > ... > 2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls: 2 > ... > {noformat} > If the Injector job is run again with the same input, we get the erroneous > output, that still one "new URL" was injected: > {noformat} > 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls > rejected by filters: 0 > 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls > injected after normalization and filtering: 3 > 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls > injected but already in CrawlDb: 2 > 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls > injected: 1 > {noformat} > This is because the urls_merged counter counts unique items, while > url_injected does not, and the shown number is the difference between both > counters. > Adding a counter to count the number of unique injected URLs will allow to > get the correct count of newly injected URLs. -- This message was sent by Atlassian Jira (v8.20.10#820010)