Hi,

Im looking for the best way of restriction by amount of pages crawled per host. 
I have a list of hosts to crawl, lets say M hosts and I would like to limit 
crawling on each host as MaxPages.
The external links are turned off for the crawling processes.

My own proposal can be found at 3)
 
1)Using https://www.mail-archive.com/user@nutch.apache.org/msg10245.html
We know the size of the cluster(number of Nodes) and now the size of the 
list(M). 
If we divide M/(number of Nodes in the cluster * number of fetches per Node) we 
can get the total amount of rounds for first level crawling(K).
Then we multiply this parameter on necessary number of level for the website(N 
= 2,3,4...) depending on how deep we want to get to the specific website.
Lets say to crawl all the list we need to have K = 500 rounds, we want to crawl 
each website up to 4th level N= 4, therefore the total amount of rounds KN = 
2000
Combining with  generate.max.count = MaxPages we get maximum pages MaxPages * 
N. 
Problem: the process should be smooth enough to guarantee the full list crawl 
for K rounds. Potential problems with crawling process and/or Hadoop cluster.
 
2) The second approach is to use hostdb 
https://www.mail-archive.com/user@nutch.apache.org/msg14330.html[https://www.mail-archive.com/user@nutch.apache.org/msg14330.html]
Problem : that asks for additional computations for hostdb + workaround with 
the black list
 
3) My own solution, it is a bit tricky.
Using scoring-depth plugin extension and generate.min.score config.
 
That plugin set up the weights of linked pages as ParrentWeight/Number of 
linked pages. The initial weight equals to 1 by default.
 
My idea that we can estimate the maximum amount of page for the host.
To illustrate, there are several ways to get 1/4 weights for a host(5 pages, 5 
pages and 7 pages). 
 
        1
   /   / \     \
  /   /   \     \ 
 /   /     \     \
1/4   1/4     1/4  1/4
        1
       / \
      /   \
     /     \
    1/2     1/2
            / \
          1/4 1/4
    
        1
       / \
      /   \
     /     \
    1/2     1/2
   / \     / \
  1/4 1/4 1/4 1/4

The last tree gives maximum amount of pages with weight of 1/4( 3 levels each 
one sums up to 1). Total sum  = 7.
The idea behind it is the maximum amount of links are given with the deepest 
tree.The deepest tree can be factorized on prime factors of the final weight.
 
For example, for 1/4 we calculate the prime factors for 4 = 1 * 2 * 2, the 
total sum of pages equals to 1 + 1 * 2 + 1* 2* 2 = 7.
For weight of 1/9, 1 + 1 * 3 + 1*3*3 = 13
For weight of 1/48, 1 + 1 *2 + 1*2*2 + 1*2*2*2 + 1*2*2*2*2 + 1*2*2*2*2*2*3 

The calculator: 
http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22[http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22]
 
Problem : the score can be affected by other plugins.
 
Thanks.

Semyon.

Reply via email to