Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

Andrzej Bialecki Tue, 22 Apr 2008 01:49:14 -0700

[EMAIL PROTECTED] wrote:

+              // time the request
+              long fetchStart = System.currentTimeMillis();
               ProtocolOutput output = protocol.getProtocolOutput(fit.url, 
fit.datum);
+              long fetchTime = (System.currentTimeMillis() - fetchStart)/1000;
               ProtocolStatus status = output.getStatus();
               Content content = output.getContent();
               ParseStatus pstatus = null;
+
+              // compute page download speed
+              int kbps = 
Math.round(((((float)content.getContent().length)*8)/1024)/fetchTime);
+              LOG.info("Fetch time: " + fetchTime + " KBPS: " + kbps + " URL: 
" + fit.url);
+//              fit.datum.getMetaData().put(new Text("kbps"), new 
IntWritable(kbps));



Yes, that's more or less correct.


I *think* the updatedb step will keep any new keys/values in that MetaData
 MapWritable in the CrawlDatum while merging, right?


Right.

Then, HostDb would run through CrawlDb and aggregate (easy).
But:
What other data should be stored in CrawlDatum?


I can think of a few other useful metrics:

* the fetch status (failure / success) - this will be aggregated into afailure rate per host.

* number of outlinks - this comes useful when determining areas within asite with high density of links


* content type

* size

Other aggregated metrics, which are derived from urls alone, couldinclude the following:

* number of urls with query strings (may indicate spider traps or adatabase)

* total number of urls from a host (obvious :) ) - useful to limit themax. number of urls per host.

* max depth per host - again, to enforce limits on the max. depth of thecrawl, where depth is defined as the maximum number of elements in URL path.

* spam-related metrics (# of pages with known spam hrefs, keywordstuffing in meta-tags, # of pages with spam keywords, etc, etc).

Plus a range of arbitrary operator-specific tags / metrics, usuallymanually added:

* special fetching parameters (maybe authentication, or the overridesfor crawl-delay or the number of threads per host)

* other parameters affecting the crawling priority or page ranking forall pages from that host

As you can see, possibilities are nearly endless, and revolve around theissues of crawl quality and performance.

How exactly should that data be aggregated? (mostly added up?)

Good question. Some of these are aggregates, some others are runningaverages, some others still perhaps form a historical log of a number ofmost recent values. We could specify a couple of standard operations,such as:


* SUM - initialize to zero, and add all values

* AVG - calculate arithmetic average from all values

* MAX / MIN - retain only the largest . smallest value

* LOG - keep a log of the last N values - somewhat orthogonal concept tothe above, i.e. it could be a valid option for any of the above operations.

This complicates the simplistic model of HostDB that we had :) andindicates that we may need a sort of a schema descriptor for it.

How exactly will this data then be accessed? (I need to be able to do 
host-based lookup)

Ah, this is an interesting issue :) All tools in Nutch work usingURL-based keys, which means they operate on a per-page level. Now weneed to join this with HostDb, which uses host names as keys. If youwant to use HostDb as one of the inputs to a map-reduce job, then Idescribed this problem here, and Owen O'Malley provided a solution:


https://issues.apache.org/jira/browse/HADOOP-2853

This would require significant changes in severa Nutch tools, i.e.several m-r jobs would have to be restructured.

There is however a different approach too, which may be efficient enough- put the HostDb in a DistributedCache, and read it directly as aMapFile (or BloomMapFile - see HADOOP-3063) - I tried this and formedium-size datasets the performance was acceptable.

My immediate interest is computing per-host download speed, sothat Generator can take that into consideration when creating fetchlists.

I'm not even 100% sure if that will even have a positive effect on the
overall fetch speed, but I imagine I will have to load this HostDb data
in a Map, so I can change Generator and add something like this
in one of its map or reduce methods:

int kbps = hostdb.get(host);
if (kbps < N) { don't emit this host }

Right, that's how the API could look like using the second approach thatI outlined above. You could even wrap it in a URLFilter plugin, so thatyou don't have to modify the Generator.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

Reply via email to