[EMAIL PROTECTED] wrote:

+              // time the request
+              long fetchStart = System.currentTimeMillis();
               ProtocolOutput output = protocol.getProtocolOutput(fit.url, 
fit.datum);
+              long fetchTime = (System.currentTimeMillis() - fetchStart)/1000;
               ProtocolStatus status = output.getStatus();
               Content content = output.getContent();
               ParseStatus pstatus = null;
+
+              // compute page download speed
+              int kbps = 
Math.round(((((float)content.getContent().length)*8)/1024)/fetchTime);
+              LOG.info("Fetch time: " + fetchTime + " KBPS: " + kbps + " URL: 
" + fit.url);
+//              fit.datum.getMetaData().put(new Text("kbps"), new 
IntWritable(kbps));


Yes, that's more or less correct.


I *think* the updatedb step will keep any new keys/values in that MetaData
 MapWritable in the CrawlDatum while merging, right?

Right.


Then, HostDb would run through CrawlDb and aggregate (easy).
But:
What other data should be stored in CrawlDatum?

I can think of a few other useful metrics:

* the fetch status (failure / success) - this will be aggregated into a failure rate per host.

* number of outlinks - this comes useful when determining areas within a site with high density of links

* content type

* size

Other aggregated metrics, which are derived from urls alone, could include the following:

* number of urls with query strings (may indicate spider traps or a database)

* total number of urls from a host (obvious :) ) - useful to limit the max. number of urls per host.

* max depth per host - again, to enforce limits on the max. depth of the crawl, where depth is defined as the maximum number of elements in URL path.

* spam-related metrics (# of pages with known spam hrefs, keyword stuffing in meta-tags, # of pages with spam keywords, etc, etc).

Plus a range of arbitrary operator-specific tags / metrics, usually manually added:

* special fetching parameters (maybe authentication, or the overrides for crawl-delay or the number of threads per host)

* other parameters affecting the crawling priority or page ranking for all pages from that host

As you can see, possibilities are nearly endless, and revolve around the issues of crawl quality and performance.


How exactly should that data be aggregated? (mostly added up?)

Good question. Some of these are aggregates, some others are running averages, some others still perhaps form a historical log of a number of most recent values. We could specify a couple of standard operations, such as:

* SUM - initialize to zero, and add all values

* AVG - calculate arithmetic average from all values

* MAX / MIN - retain only the largest . smallest value

* LOG - keep a log of the last N values - somewhat orthogonal concept to the above, i.e. it could be a valid option for any of the above operations.

This complicates the simplistic model of HostDB that we had :) and indicates that we may need a sort of a schema descriptor for it.

How exactly will this data then be accessed? (I need to be able to do 
host-based lookup)

Ah, this is an interesting issue :) All tools in Nutch work using URL-based keys, which means they operate on a per-page level. Now we need to join this with HostDb, which uses host names as keys. If you want to use HostDb as one of the inputs to a map-reduce job, then I described this problem here, and Owen O'Malley provided a solution:

https://issues.apache.org/jira/browse/HADOOP-2853

This would require significant changes in severa Nutch tools, i.e. several m-r jobs would have to be restructured.

There is however a different approach too, which may be efficient enough - put the HostDb in a DistributedCache, and read it directly as a MapFile (or BloomMapFile - see HADOOP-3063) - I tried this and for medium-size datasets the performance was acceptable.


My immediate interest is computing per-host download speed, so that Generator can take that into consideration when creating fetchlists.
I'm not even 100% sure if that will even have a positive effect on the
overall fetch speed, but I imagine I will have to load this HostDb data
in a Map, so I can change Generator and add something like this
in one of its map or reduce methods:

int kbps = hostdb.get(host);
if (kbps < N) { don't emit this host }

Right, that's how the API could look like using the second approach that I outlined above. You could even wrap it in a URLFilter plugin, so that you don't have to modify the Generator.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to