[EMAIL PROTECTED] wrote:
+ // time the request
+ long fetchStart = System.currentTimeMillis();
ProtocolOutput output = protocol.getProtocolOutput(fit.url,
fit.datum);
+ long fetchTime = (System.currentTimeMillis() - fetchStart)/1000;
ProtocolStatus status = output.getStatus();
Content content = output.getContent();
ParseStatus pstatus = null;
+
+ // compute page download speed
+ int kbps =
Math.round(((((float)content.getContent().length)*8)/1024)/fetchTime);
+ LOG.info("Fetch time: " + fetchTime + " KBPS: " + kbps + " URL:
" + fit.url);
+// fit.datum.getMetaData().put(new Text("kbps"), new
IntWritable(kbps));
Yes, that's more or less correct.
I *think* the updatedb step will keep any new keys/values in that MetaData
MapWritable in the CrawlDatum while merging, right?
Right.
Then, HostDb would run through CrawlDb and aggregate (easy).
But:
What other data should be stored in CrawlDatum?
I can think of a few other useful metrics:
* the fetch status (failure / success) - this will be aggregated into a
failure rate per host.
* number of outlinks - this comes useful when determining areas within a
site with high density of links
* content type
* size
Other aggregated metrics, which are derived from urls alone, could
include the following:
* number of urls with query strings (may indicate spider traps or a
database)
* total number of urls from a host (obvious :) ) - useful to limit the
max. number of urls per host.
* max depth per host - again, to enforce limits on the max. depth of the
crawl, where depth is defined as the maximum number of elements in URL path.
* spam-related metrics (# of pages with known spam hrefs, keyword
stuffing in meta-tags, # of pages with spam keywords, etc, etc).
Plus a range of arbitrary operator-specific tags / metrics, usually
manually added:
* special fetching parameters (maybe authentication, or the overrides
for crawl-delay or the number of threads per host)
* other parameters affecting the crawling priority or page ranking for
all pages from that host
As you can see, possibilities are nearly endless, and revolve around the
issues of crawl quality and performance.
How exactly should that data be aggregated? (mostly added up?)
Good question. Some of these are aggregates, some others are running
averages, some others still perhaps form a historical log of a number of
most recent values. We could specify a couple of standard operations,
such as:
* SUM - initialize to zero, and add all values
* AVG - calculate arithmetic average from all values
* MAX / MIN - retain only the largest . smallest value
* LOG - keep a log of the last N values - somewhat orthogonal concept to
the above, i.e. it could be a valid option for any of the above operations.
This complicates the simplistic model of HostDB that we had :) and
indicates that we may need a sort of a schema descriptor for it.
How exactly will this data then be accessed? (I need to be able to do
host-based lookup)
Ah, this is an interesting issue :) All tools in Nutch work using
URL-based keys, which means they operate on a per-page level. Now we
need to join this with HostDb, which uses host names as keys. If you
want to use HostDb as one of the inputs to a map-reduce job, then I
described this problem here, and Owen O'Malley provided a solution:
https://issues.apache.org/jira/browse/HADOOP-2853
This would require significant changes in severa Nutch tools, i.e.
several m-r jobs would have to be restructured.
There is however a different approach too, which may be efficient enough
- put the HostDb in a DistributedCache, and read it directly as a
MapFile (or BloomMapFile - see HADOOP-3063) - I tried this and for
medium-size datasets the performance was acceptable.
My immediate interest is computing per-host download speed, so
that Generator can take that into consideration when creating fetchlists.
I'm not even 100% sure if that will even have a positive effect on the
overall fetch speed, but I imagine I will have to load this HostDb data
in a Map, so I can change Generator and add something like this
in one of its map or reduce methods:
int kbps = hostdb.get(host);
if (kbps < N) { don't emit this host }
Right, that's how the API could look like using the second approach that
I outlined above. You could even wrap it in a URLFilter plugin, so that
you don't have to modify the Generator.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com