Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2008-04-22 Thread Andrzej Bialecki

[EMAIL PROTECTED] wrote:


+  // time the request
+  long fetchStart = System.currentTimeMillis();
   ProtocolOutput output = protocol.getProtocolOutput(fit.url, 
fit.datum);
+  long fetchTime = (System.currentTimeMillis() - fetchStart)/1000;
   ProtocolStatus status = output.getStatus();
   Content content = output.getContent();
   ParseStatus pstatus = null;
+
+  // compute page download speed
+  int kbps = 
Math.round(float)content.getContent().length)*8)/1024)/fetchTime);
+  LOG.info(Fetch time:  + fetchTime +  KBPS:  + kbps +  URL: 
 + fit.url);
+//  fit.datum.getMetaData().put(new Text(kbps), new 
IntWritable(kbps));



Yes, that's more or less correct.



I *think* the updatedb step will keep any new keys/values in that MetaData
 MapWritable in the CrawlDatum while merging, right?


Right.



Then, HostDb would run through CrawlDb and aggregate (easy).
But:
What other data should be stored in CrawlDatum?


I can think of a few other useful metrics:

* the fetch status (failure / success) - this will be aggregated into a 
failure rate per host.


* number of outlinks - this comes useful when determining areas within a 
site with high density of links


* content type

* size

Other aggregated metrics, which are derived from urls alone, could 
include the following:


* number of urls with query strings (may indicate spider traps or a 
database)


* total number of urls from a host (obvious :) ) - useful to limit the 
max. number of urls per host.


* max depth per host - again, to enforce limits on the max. depth of the 
crawl, where depth is defined as the maximum number of elements in URL path.


* spam-related metrics (# of pages with known spam hrefs, keyword 
stuffing in meta-tags, # of pages with spam keywords, etc, etc).


Plus a range of arbitrary operator-specific tags / metrics, usually 
manually added:


* special fetching parameters (maybe authentication, or the overrides 
for crawl-delay or the number of threads per host)


* other parameters affecting the crawling priority or page ranking for 
all pages from that host


As you can see, possibilities are nearly endless, and revolve around the 
issues of crawl quality and performance.




How exactly should that data be aggregated? (mostly added up?)


Good question. Some of these are aggregates, some others are running 
averages, some others still perhaps form a historical log of a number of 
most recent values. We could specify a couple of standard operations, 
such as:


* SUM - initialize to zero, and add all values

* AVG - calculate arithmetic average from all values

* MAX / MIN - retain only the largest . smallest value

* LOG - keep a log of the last N values - somewhat orthogonal concept to 
the above, i.e. it could be a valid option for any of the above operations.


This complicates the simplistic model of HostDB that we had :) and 
indicates that we may need a sort of a schema descriptor for it.



How exactly will this data then be accessed? (I need to be able to do 
host-based lookup)


Ah, this is an interesting issue :) All tools in Nutch work using 
URL-based keys, which means they operate on a per-page level. Now we 
need to join this with HostDb, which uses host names as keys. If you 
want to use HostDb as one of the inputs to a map-reduce job, then I 
described this problem here, and Owen O'Malley provided a solution:


https://issues.apache.org/jira/browse/HADOOP-2853

This would require significant changes in severa Nutch tools, i.e. 
several m-r jobs would have to be restructured.


There is however a different approach too, which may be efficient enough 
- put the HostDb in a DistributedCache, and read it directly as a 
MapFile (or BloomMapFile - see HADOOP-3063) - I tried this and for 
medium-size datasets the performance was acceptable.



My immediate interest is computing per-host download speed, so 
that Generator can take that into consideration when creating fetchlists.

I'm not even 100% sure if that will even have a positive effect on the
overall fetch speed, but I imagine I will have to load this HostDb data
in a Map, so I can change Generator and add something like this
in one of its map or reduce methods:

int kbps = hostdb.get(host);
if (kbps  N) { don't emit this host }


Right, that's how the API could look like using the second approach that 
I outlined above. You could even wrap it in a URLFilter plugin, so that 
you don't have to modify the Generator.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[Nutch Wiki] Update of GettingNutchRunningWithDebian by StevenHayles

2008-04-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by StevenHayles:
http://wiki.apache.org/nutch/GettingNutchRunningWithDebian

The comment on the change is:
Added installation  of tomcat5.5-webapps without it home page it blank

--
   ''export JAVA_HOME''[[BR]]
  
  ==  Install Tomcat5.5 and Verify that it is functioning ==
-  ''# apt-get install tomcat5.5 libtomcat5.5-java tomcat5.5-admin ''[[BR]]
+  ''# apt-get install tomcat5.5 libtomcat5.5-java tomcat5.5-admin 
tomcat5.5-webapps''[[BR]]
  Verify Tomcat is running:[[BR]]
   ''# /etc/init.d/tomcat5.5 status''[[BR]]
   ''#Tomcat servlet engine is running with Java pid 
/var/lib/tomcat5.5/temp/tomcat5.5.pid''[[BR]]


[Nutch Wiki] Update of FetchCycleOverlap by OtisGospodnetic

2008-04-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by OtisGospodnetic:
http://wiki.apache.org/nutch/FetchCycleOverlap

New page:
Without overlapping jobs people running Nutch are likely not utilizing their 
clusters fully.  Thus, here is a recipe for overlapping jobs:

0. imagine a cluster with M max maps and R max reduces (say M=R=8)

1. run generate job with -numFetchers equal to M-2

2. run a fetcher job (uses M-2 maps and later all R reduces)

3. at this point there are 2 open map slots for something else to run, say the 
updatedb job for the previously fetched/parsed segment

4. when updatedb job is done the cluster can take on more jobs.  Any completed 
tasks (C) from the running fetcher job represent open work slots

5. start another fetch job.  This will be able to use only C tasks, but C will 
grow as the first job opens up more slots, eventually hitting M-2 open slots.

6. at some point, the fetch job from 2) above will complete, opening up 2 map 
slots, so updatedb can be run, even in the background, allowing the execution 
to go back to 1)

Because a URL is locked out for 7 days after the generate step included it 
into a fetchlist, the above cycle needs to complete within 7 days.  In more 
detail:

Generate updates the CrawlDb so that urls selected
for the latest fetchlist become locked out for the next 7 days. This
means that you can happily generate multiple fetchlists, and fetch them
out of order, and then do the DB updates out of order, as you see fit,
so long as you make it within the 7 days of the lock out period.

This means that it's practical to limit the numFetchers to a number
below your cluster capacity, because then you can run other maintenance
jobs in parallel with the currently running fetch job (such as updatedb
and generate of next fetchlists).