[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-28 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668135#action_12668135
 ] 

Otis Gospodnetic commented on NUTCH-628:


Thanks for the update.  Sorry, I don't recall the details around 
crawldb/current... is referring to current bad?


 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Fix For: 1.1

 Attachments: domain_statistics_v2.patch, 
 NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on nutch-u...@lucene:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668141#action_12668141
 ] 

Doğacan Güney commented on NUTCH-628:
-

When someone thinks of crawldb, he would probably think of crawldb directory 
and not crawldb/current since
current is pretty much an implementation detail (so that jobs that change 
crawldb can write their results to a temp directory under crawldb first then 
this dir can move to crawldb/current). 

So, it is not exactly bad to refer to current, it is just that it may be 
counter-intuitive for people, who may try to pass crawldb directory to 
DomainStatistics. Maybe we can add some documentation to command line?

What do you think?

 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Fix For: 1.1

 Attachments: domain_statistics_v2.patch, 
 NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on nutch-u...@lucene:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-28 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668164#action_12668164
 ] 

Andrzej Bialecki  commented on NUTCH-628:
-

I agree that the crawldb/current/ subdir is an implementation detail that 
should be hidden from users. All other tools take the name of the parent 
directory (crawldb/), so I see no reason why this tool should do it differently.

 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Fix For: 1.1

 Attachments: domain_statistics_v2.patch, 
 NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on nutch-u...@lucene:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668170#action_12668170
 ] 

Doğacan Güney commented on NUTCH-628:
-

This tool can also read crawl_fetch and other directories as well. And that is 
the problem. If you are reading crawl_fetch
MapFile parts are right under there but for crawldb, MapFile parts are under 
crawldb/current. I guess we can add a special case for any path that ends in 
crawldb but this is not a complete fix either as someone else may rename his 
crawl database something else.

 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Fix For: 1.1

 Attachments: domain_statistics_v2.patch, 
 NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on nutch-u...@lucene:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12667740#action_12667740
 ] 

Doğacan Güney commented on NUTCH-628:
-

DomainStatistics is committed as of rev. 738175 .

I am leaving this issue open. We can deal with it after 1.0.

 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Fix For: 1.1

 Attachments: domain_statistics_v2.patch, 
 NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on nutch-u...@lucene:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12667929#action_12667929
 ] 

Hudson commented on NUTCH-628:
--

Integrated in Nutch-trunk #707 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/707/])
 - DomainStatistics tool


 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Fix For: 1.1

 Attachments: domain_statistics_v2.patch, 
 NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on nutch-u...@lucene:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666477#action_12666477
 ] 

Doğacan Güney commented on NUTCH-628:
-

I don't know much about the patch here. Otis, do you have time to update and 
commit Domain Stats? If not, I will take a look.

 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Fix For: 1.1

 Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on nutch-u...@lucene:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-23 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666764#action_12666764
 ] 

Otis Gospodnetic commented on NUTCH-628:


Could you take it if you have time, please?

 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Fix For: 1.1

 Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on nutch-u...@lucene:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-22 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666290#action_12666290
 ] 

Otis Gospodnetic commented on NUTCH-628:


I'm +1 on getting Domain Stats into 1.0.  The patch will need a small update, I 
think.

 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Fix For: 1.1

 Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on nutch-u...@lucene:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2008-04-22 Thread Andrzej Bialecki

[EMAIL PROTECTED] wrote:


+  // time the request
+  long fetchStart = System.currentTimeMillis();
   ProtocolOutput output = protocol.getProtocolOutput(fit.url, 
fit.datum);
+  long fetchTime = (System.currentTimeMillis() - fetchStart)/1000;
   ProtocolStatus status = output.getStatus();
   Content content = output.getContent();
   ParseStatus pstatus = null;
+
+  // compute page download speed
+  int kbps = 
Math.round(float)content.getContent().length)*8)/1024)/fetchTime);
+  LOG.info(Fetch time:  + fetchTime +  KBPS:  + kbps +  URL: 
 + fit.url);
+//  fit.datum.getMetaData().put(new Text(kbps), new 
IntWritable(kbps));



Yes, that's more or less correct.



I *think* the updatedb step will keep any new keys/values in that MetaData
 MapWritable in the CrawlDatum while merging, right?


Right.



Then, HostDb would run through CrawlDb and aggregate (easy).
But:
What other data should be stored in CrawlDatum?


I can think of a few other useful metrics:

* the fetch status (failure / success) - this will be aggregated into a 
failure rate per host.


* number of outlinks - this comes useful when determining areas within a 
site with high density of links


* content type

* size

Other aggregated metrics, which are derived from urls alone, could 
include the following:


* number of urls with query strings (may indicate spider traps or a 
database)


* total number of urls from a host (obvious :) ) - useful to limit the 
max. number of urls per host.


* max depth per host - again, to enforce limits on the max. depth of the 
crawl, where depth is defined as the maximum number of elements in URL path.


* spam-related metrics (# of pages with known spam hrefs, keyword 
stuffing in meta-tags, # of pages with spam keywords, etc, etc).


Plus a range of arbitrary operator-specific tags / metrics, usually 
manually added:


* special fetching parameters (maybe authentication, or the overrides 
for crawl-delay or the number of threads per host)


* other parameters affecting the crawling priority or page ranking for 
all pages from that host


As you can see, possibilities are nearly endless, and revolve around the 
issues of crawl quality and performance.




How exactly should that data be aggregated? (mostly added up?)


Good question. Some of these are aggregates, some others are running 
averages, some others still perhaps form a historical log of a number of 
most recent values. We could specify a couple of standard operations, 
such as:


* SUM - initialize to zero, and add all values

* AVG - calculate arithmetic average from all values

* MAX / MIN - retain only the largest . smallest value

* LOG - keep a log of the last N values - somewhat orthogonal concept to 
the above, i.e. it could be a valid option for any of the above operations.


This complicates the simplistic model of HostDB that we had :) and 
indicates that we may need a sort of a schema descriptor for it.



How exactly will this data then be accessed? (I need to be able to do 
host-based lookup)


Ah, this is an interesting issue :) All tools in Nutch work using 
URL-based keys, which means they operate on a per-page level. Now we 
need to join this with HostDb, which uses host names as keys. If you 
want to use HostDb as one of the inputs to a map-reduce job, then I 
described this problem here, and Owen O'Malley provided a solution:


https://issues.apache.org/jira/browse/HADOOP-2853

This would require significant changes in severa Nutch tools, i.e. 
several m-r jobs would have to be restructured.


There is however a different approach too, which may be efficient enough 
- put the HostDb in a DistributedCache, and read it directly as a 
MapFile (or BloomMapFile - see HADOOP-3063) - I tried this and for 
medium-size datasets the performance was acceptable.



My immediate interest is computing per-host download speed, so 
that Generator can take that into consideration when creating fetchlists.

I'm not even 100% sure if that will even have a positive effect on the
overall fetch speed, but I imagine I will have to load this HostDb data
in a Map, so I can change Generator and add something like this
in one of its map or reduce methods:

int kbps = hostdb.get(host);
if (kbps  N) { don't emit this host }


Right, that's how the API could look like using the second approach that 
I outlined above. You could even wrap it in a URLFilter plugin, so that 
you don't have to modify the Generator.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2008-04-20 Thread Andrzej Bialecki

[EMAIL PROTECTED] wrote:



Host extraction from URL makes sense, but there would be no host-level
data in CrawlDatum.  For example, one of the things I'd like to track is
download speed.  I don't want to track that on the per-URL level, but on
a per-host level.  I'd keep track of the d/l speed for each host in Fetcher2
and its FetcherInputQueue (that part is in JIRA already). 


So I'm not sure how I'd put the d/l speed for a host in the CrawlDatum


You really don't have to - see below. The queue monitoring stuff in 
Fetcher gives you only the current fetchlist metrics anyway, so they are 
incomplete - you need to calculate the actual averages from all urls 
from that host, and not just the current fetchlist. That's why it's 
better to do this using the information from CrawlDb and not just from 
the current segment.


So, let's assume for a moment that you don't track the d/l speed per 
host in Fetchers, or you discard this information, and assume that you 
only add the actually measured per-url download speed to crawl_fetch, as 
part of CrawlDatum.metaData. This metadata will be merged to the CrawlDb 
during the updatedb operation (replacing any older values if they exist).






Reduce: input: host, (hostStats1, hostStats2, ...)
output:  host, hostStats // aggregated



Let's try with a concrete example.
Imagine I just ran a fetch job and that fetched some number of URLs
from www.foo.com and www.bar.com. foo.com aggregate d/l speed for
that fetch run was 50 kbps.  bar.com speed was 20 kbps.

At the end of the run, I'd somehow store, say:
www.foo.comdl_speed:50requests:100timeouts:0
www.bar.comdl_speed:20requests:90timeouts:20


No, what you want to store is this:

www.example.com/page1.html dl_speed:50 status:ok
www.example.com/page2.html dl_speed:45 status:ok
www.example.com/page3.html dl_speed:0 status:gone
...



Then, I was thinking, something else (some HostDb MapReduce job)
would go through this data stored under segment/2008./something/
and merge it into crawl/hostdb file.

It sounds like you are saying, this should stick the data in CrawlDatum
and let that be merged into crawl/crawldb but I don't see how I'd put
the numbers from the above example into CrawlDatum without
repeating them, so that each URL from www.foo.com has those 3
numbers above for www.foo.com stored in their crawldb entries.


See above - we store only per-url metrics in CrawlDb. Then the HostDb 
job aggregates the info from CrawlDb using host name (or domain name, or 
TLD) as the key.



PS. Could you please wrap your lines to 80 chars? I always have to
re-wrap your emails when responding, otherwise they consists of very
very long lines ..



Sorry about that.  I wrapped them manually here.


Thanks. Mail apps are no longer what they used to be ...

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2008-04-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12590724#action_12590724
 ] 

Andrzej Bialecki  commented on NUTCH-628:
-

Not everything looks like a String ;) MapWritable is useful in situations where 
you need to (de)serialize non-String types. And most of the information in 
HostDb is numeric, so if we decided to use simple Metadata it would cause 
constant pointless conversion from/to Strings.

Having said that, I'm for a specialized class (which can contain MapWritable as 
a placeholder for anything else than the specific built-in types of info).

 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on [EMAIL PROTECTED]:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2008-04-19 Thread Andrzej Bialecki

[EMAIL PROTECTED] wrote:


I do understand that CrawlDb is the source to get all known URLs
from, and from those URLs we can extract host names, domains, etc.
(what DomainStatistics tool does), but I don't understand how you'd
use CrawlDb as the source of per-host data, since CrawlDb does not
have aggregate per-host data.  Shouldn't that live in a separate
file, a file that can updated after every fetch run?


Well, as it happens, map-reduce is exceptionally good at collecting 
aggregate data :) This is a simple map-reduce job, where we do the 
following:


Map:input: url, CrawlDatum from CrawlDb
output: host, hostStats

Host is extracted from the current url, and hostStats is extracted from 
the data in this CrawlDatum


Reduce: input: host, (hostStats1, hostStats2, ...)
output: host, hostStats // aggregated



PS. Could you please wrap your lines to 80 chars? I always have to
re-wrap your emails when responding, otherwise they consists of very
very long lines ..

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2008-04-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12590559#action_12590559
 ] 

Doğacan Güney commented on NUTCH-628:
-

+1 for extracting hostdb from crawldb...

(also, do we really want to make hostdb just a map file of Text,MapWritable? 
IMHO, it would be better to design a proper HostDatum class with some 
statistics built-in, and then maybe a Metadata element [I guess it's just me 
but I hate MapWritable, I prefer Metadata:D])

 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on [EMAIL PROTECTED]:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2008-04-18 Thread ogjunk-nutch
You are both in agreement, but I don't fully follow as I'm not intimately 
familiar with all the files and structures yet.

- Fetcher-s putting info about hosts into crawl_fetch for each fetched segment 
makes sense.  I see Fetcher(2) uses FetcherOutputFormat, which has its own 
RecordWriter, which then writes CrawlDatum to HDFS.  I do not see where exactly 
to plug per-host info writing in the current Fetcher2 flow.  I think the thing 
to do would be to simply collect the data in memory and at the end of the fetch 
run, at the end of the fetch() method write it out.  I just don't know how 
to write it out to HDFS without relying on Reduce to do the writing for me.

Should it be something as simple as the following?

// write a plain-text file with space-delimited values
FileSystem hdfs = FileSystem.get(getConf());
FSDataOutputStream dos = hdfs.create(path);
dos.writeUTF(host +   + requests +   + timeouts. );
dos.close();

- I don't understand how per-host info can go in the CrawlDb.  Isn't CrawlDb a 
database of all known *URLs*?  Doesn't CrawlDb contain only CrawlDatum records, 
and doesn't each CrawlDatum hold data about a single URL?  So if I wanted to 
record, say, the number of timeouts for a given host, how would I add that to a 
CrawlDatum, when a CrawlDatum is for a specific URL, and not host?

I do understand that CrawlDb is the source to get all known URLs from,
and from those URLs we can extract host names, domains, etc. (what
DomainStatistics tool does), but I don't understand how you'd use CrawlDb as 
the source of per-host data, since CrawlDb does not have aggregate per-host 
data.  Shouldn't that live in a separate file, a file that can updated after 
every fetch run?


Thanks,
Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
From: Doğacan Güney (JIRA) [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Friday, April 18, 2008 2:40:21 PM
Subject: [jira] Commented: (NUTCH-628) Host database to keep track of 
host-level information


[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12590559#action_12590559
 ] 

Doğacan Güney commented on NUTCH-628:
-

+1 for extracting hostdb from crawldb...

(also, do we really want to make hostdb just a map file of Text,MapWritable? 
IMHO, it would be better to design a proper HostDatum class with some 
statistics built-in, and then maybe a Metadata element [I guess it's just me 
but I hate MapWritable, I prefer Metadata:D])

 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on [EMAIL PROTECTED]:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.