Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "FAQ" page has been changed by TejasPatil: https://wiki.apache.org/nutch/FAQ?action=diff&rev1=137&rev2=138 Comment: Added FAQ "What do the numbers in the fetcher log indicate" </description> </property> }}} + + ==== What do the numbers in the fetcher log indicate ? ==== + While fetching is in progress, the fetcher job will log such statement to indicate the progress of the job: + {{{ + 0/20 spinwaiting/active, 53852 pages, 7612 errors, 4.1 12 pages/s, 2632 7346 kb/s, 989 URLs in 5 queue + }}} + + Here is the explanation of each of all the fields: + * Fetcher threads try to get a fetch item (url) from a queue of all the fetch items (this queue is actually a queue of queues. For details see [0]). If a thread doesn't get a fetch-item, it spinwaits for 500ms before polling the queue again. The 'spinWaiting' count tells us how many threads are in their "spinwaiting" state at a given instance. + * The 'active' count tells us how many threads are currently performing the activities related to the fetch of a fetch-item. This involves sending requests to the server, getting the bytes from the server, parsing, storing etc. + * 'pages' is a count for total pages fetched till a given point. + * 'errors' is a count for total errors seen. + * Next comes pages/s. First number comes from this: {{{ ((((float)pages)*10)/elapsed)/10.0 }}} second one comes from this: {{{ (actualPages*10)/10.0 }}}. "actualPages" holds the count of pages processed in the last 5 secs (when the calculation is done). First number can be seen as the overall speed for that execution. The second number can be regarded as the instanteous speed as it just uses the #pages in last 5 secs when this calculation is done. + * Next comes the kb/s values which are computed from: {{{(((float)bytes)*8)/1024)/elapsed }}} and {{{ ((float)actualBytes)*8)/1024}}}. This is similar to that of pages/sec. + * 'URLs' indicates how many urls are pending and 'queues' indicate the number of queues present. Queues are formed on the basis on hostname or ip depending on the configuration set. + === Updating === ==== Isn't there redudant/wasteful duplication between nutch crawldb and solr index? ==== Nutch maintains a crawldb (and linkdb, for that matter) of the urls it crawled, the fetch status, and the date. This data is maintained beyond fetch so that pages may be re-crawled, after the a re-crawling period. At the same time Solr maintains an inverted index of all the fetched pages. It'd seem more efficient if Nutch relied on the index instead of maintaining its own crawldb, to !store the same url twice? The problem we face here is what Nutch would do if we wished to change the Solr core which to index to?