[Nutch Wiki] Update of "FAQ" by TejasPatil

Apache Wiki Sat, 22 Jun 2013 08:47:09 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "FAQ" page has been changed by TejasPatil:
https://wiki.apache.org/nutch/FAQ?action=diff&rev1=137&rev2=138

Comment:
Added FAQ "What do the numbers in the fetcher log indicate"

   </description>
  </property>
  }}}
+ 
+ ==== What do the numbers in the fetcher log indicate ? ====
+ While fetching is in progress, the fetcher job will log such statement to 
indicate the progress of the job:
+ {{{
+ 0/20 spinwaiting/active, 53852 pages, 7612 errors, 4.1 12 pages/s, 2632 7346 
kb/s, 989 URLs in 5 queue
+ }}}
+ 
+ Here is the explanation of each of all the fields:
+  * Fetcher threads try to get a fetch item (url) from a queue of all the 
fetch items (this queue is actually a queue of queues. For details see [0]). If 
a thread doesn't get a fetch-item, it spinwaits for 500ms before polling the 
queue again. The 'spinWaiting' count tells us how many threads are in their 
"spinwaiting" state at a given instance.
+  * The 'active' count tells us how many threads are currently performing the 
activities related to the fetch of a fetch-item. This involves sending requests 
to the server, getting the bytes from the server, parsing, storing etc.
+  * 'pages' is a count for total pages fetched till a given point.
+  * 'errors' is a count for total errors seen.
+  * Next comes pages/s. First number comes from this: {{{ 
((((float)pages)*10)/elapsed)/10.0 }}} second one comes from this: {{{ 
(actualPages*10)/10.0 }}}. "actualPages" holds the count of pages processed in 
the last 5 secs (when the calculation is done). First number can be seen as the 
overall speed for that execution. The second number can be regarded as the 
instanteous speed as it just uses the #pages in last 5 secs when this 
calculation is done.
+  * Next comes the kb/s values which are computed from: 
{{{(((float)bytes)*8)/1024)/elapsed }}} and {{{ 
((float)actualBytes)*8)/1024}}}. This is similar to that of pages/sec.
+  * 'URLs' indicates how many urls are pending and 'queues' indicate the 
number of queues present. Queues are formed on the basis on hostname or ip 
depending on the configuration set.
+ 
  === Updating ===
  ==== Isn't there redudant/wasteful duplication between nutch crawldb and solr 
index? ====
  Nutch maintains a crawldb (and linkdb, for that matter) of the urls it 
crawled, the fetch status, and the date. This data is maintained beyond fetch 
so that pages may be re-crawled, after the a re-crawling period. At the same 
time Solr maintains an inverted index of all the fetched pages. It'd seem more 
efficient if Nutch relied on the index instead of maintaining its own crawldb, 
to !store the same url twice? The problem we face here is what Nutch would do 
if we wished to change the Solr core which to index to?

[Nutch Wiki] Update of "FAQ" by TejasPatil

Reply via email to