[ 
https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3010:
-----------------------------------
    Description: 
Injector uses two counters: one for the total number of injected URLs, the 
other for the number of URLs "merged", that is already in CrawlDb. There is now 
counter for the number of unique URLs injected which may lead to wrong counts 
if the seed files contain duplicates:

Suppose the following seed file which contains a duplicated URL:
{noformat}
$> cat seeds_with_duplicates.txt 
https://www.example.org/page1.html
https://www.example.org/page2.html
https://www.example.org/page2.html

$> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt
...
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
rejected by filters: 0
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected after normalization and filtering: 3
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected but already in CrawlDb: 0
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls 
injected: 3
...
{noformat}
However, because of the duplicated URL, only 2 URLs were injected into the 
CrawlDb:
{noformat}
$> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats
...
2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls:   2
...
{noformat}
If the Injector job is run again with the same input, we get the erroneous 
output, that still one "new URL" was injected:
{noformat}
2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
rejected by filters: 0
2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected after normalization and filtering: 3
2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected but already in CrawlDb: 2
2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls 
injected: 1
{noformat}
This is because the urls_merged counter counts unique items, while url_injected 
does not, and the shown number is the difference between both counters.

Adding a counter to count the number of unique injected URLs will allow to get 
the correct count of newly injected URLs.

  was:
Injector uses two counters: one for the total number of injected URLs, the 
other for the number of URLs "merged", that is already in CrawlDb. There is now 
counter for the number of unique URLs injected which may lead to wrong counts 
if the seed files contain duplicates:

Suppose the following seed file which contains a duplicated URL:

{noformat}
$> cat seeds_with_duplicates.txt 
https://www.example.org/page1.html
https://www.example.org/page2.html
https://www.example.org/page2.html

$> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt
...
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
rejected by filters: 0
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected after normalization and filtering: 3
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected but already in CrawlDb: 0
2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls 
injected: 3
...
{noformat}

However, because of the duplicated URL, only 2 URLs were injected into the 
CrawlDb:

{noformat}
$> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats
...
2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls:   2
...
{noformat}

If the Injector job is run again with the same input, we get the erroneous 
output, that still one "new URL" was injected:

{noformat}
2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
rejected by filters: 0
2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected after normalization and filtering: 3
2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls 
injected but already in CrawlDb: 2
2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls 
injected: 1
{noformat}

This is because the urls_merged counter counts unique items, while url_injected 
does not.

Adding a counter to count the number of unique injected URLs will allow to get 
the correct count of newly injected URLs.


> Injector: count unique number of injected URLs
> ----------------------------------------------
>
>                 Key: NUTCH-3010
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3010
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>    Affects Versions: 1.19
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.20
>
>
> Injector uses two counters: one for the total number of injected URLs, the 
> other for the number of URLs "merged", that is already in CrawlDb. There is 
> now counter for the number of unique URLs injected which may lead to wrong 
> counts if the seed files contain duplicates:
> Suppose the following seed file which contains a duplicated URL:
> {noformat}
> $> cat seeds_with_duplicates.txt 
> https://www.example.org/page1.html
> https://www.example.org/page2.html
> https://www.example.org/page2.html
> $> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt
> ...
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
> rejected by filters: 0
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected after normalization and filtering: 3
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected but already in CrawlDb: 0
> 2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls 
> injected: 3
> ...
> {noformat}
> However, because of the duplicated URL, only 2 URLs were injected into the 
> CrawlDb:
> {noformat}
> $> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats
> ...
> 2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls:   2
> ...
> {noformat}
> If the Injector job is run again with the same input, we get the erroneous 
> output, that still one "new URL" was injected:
> {noformat}
> 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
> rejected by filters: 0
> 2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected after normalization and filtering: 3
> 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls 
> injected but already in CrawlDb: 2
> 2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls 
> injected: 1
> {noformat}
> This is because the urls_merged counter counts unique items, while 
> url_injected does not, and the shown number is the difference between both 
> counters.
> Adding a counter to count the number of unique injected URLs will allow to 
> get the correct count of newly injected URLs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to