[jira] [Commented] (NUTCH-3010) Injector: count unique number of injected URLs

Hudson (Jira) Mon, 02 Oct 2023 03:43:12 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771038#comment-17771038
 ]


Hudson commented on NUTCH-3010:
-------------------------------

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #129 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/129/])
NUTCH-3010 Injector: count unique number of injected URLs (snagel: 
[https://github.com/apache/nutch/commit/810b1d6ad50fa9021469b4ca5e1db9050a3263c5])
* (edit) src/java/org/apache/nutch/crawl/Injector.java


> Injector: count unique number of injected URLs
> ----------------------------------------------
>
>                 Key: NUTCH-3010
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3010
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>    Affects Versions: 1.19
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.20
>
>
> Injector uses two counters: one for the total number of injected URLs, the 
> other for the number of URLs "merged", that is already in CrawlDb. There is 
> now counter for the number of unique URLs injected which may lead to wrong 
> counts if the seed files contain duplicates:
> Suppose the following seed file which contains a duplicated URL:
> {noformat}
> $> cat seeds_with_duplicates.txt 
> https://www.example.org/page1.html
> https://www.example.org/page2.html
> https://www.example.org/page2.html
> $> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt
> ...
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
> rejected by filters: 0
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected after normalization and filtering: 3
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected but already in CrawlDb: 0
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls 
> injected: 3
> ...
> {noformat}
> However, because of the duplicated URL, only 2 URLs were injected into the 
> CrawlDb:
> {noformat}
> $> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats
> ...
> 2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls:   2
> ...
> {noformat}
> If the Injector job is run again with the same input, we get the erroneous 
> output, that still one "new URL" was injected:
> {noformat}
> 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
> rejected by filters: 0
> 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected after normalization and filtering: 3
> 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected but already in CrawlDb: 2
> 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls 
> injected: 1
> {noformat}
> This is because the urls_merged counter counts unique items, while 
> url_injected does not, and the shown number is the difference between both 
> counters.
> Adding a counter to count the number of unique injected URLs will allow to 
> get the correct count of newly injected URLs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3010) Injector: count unique number of injected URLs

Reply via email to