[jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668135#action_12668135 ] Otis Gospodnetic commented on NUTCH-628: Thanks for the update. Sorry, I don't recall the details around crawldb/current... is referring to current bad? Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Fix For: 1.1 Attachments: domain_statistics_v2.patch, NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on nutch-u...@lucene: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668141#action_12668141 ] Doğacan Güney commented on NUTCH-628: - When someone thinks of crawldb, he would probably think of crawldb directory and not crawldb/current since current is pretty much an implementation detail (so that jobs that change crawldb can write their results to a temp directory under crawldb first then this dir can move to crawldb/current). So, it is not exactly bad to refer to current, it is just that it may be counter-intuitive for people, who may try to pass crawldb directory to DomainStatistics. Maybe we can add some documentation to command line? What do you think? Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Fix For: 1.1 Attachments: domain_statistics_v2.patch, NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on nutch-u...@lucene: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668164#action_12668164 ] Andrzej Bialecki commented on NUTCH-628: - I agree that the crawldb/current/ subdir is an implementation detail that should be hidden from users. All other tools take the name of the parent directory (crawldb/), so I see no reason why this tool should do it differently. Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Fix For: 1.1 Attachments: domain_statistics_v2.patch, NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on nutch-u...@lucene: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668170#action_12668170 ] Doğacan Güney commented on NUTCH-628: - This tool can also read crawl_fetch and other directories as well. And that is the problem. If you are reading crawl_fetch MapFile parts are right under there but for crawldb, MapFile parts are under crawldb/current. I guess we can add a special case for any path that ends in crawldb but this is not a complete fix either as someone else may rename his crawl database something else. Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Fix For: 1.1 Attachments: domain_statistics_v2.patch, NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on nutch-u...@lucene: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12667740#action_12667740 ] Doğacan Güney commented on NUTCH-628: - DomainStatistics is committed as of rev. 738175 . I am leaving this issue open. We can deal with it after 1.0. Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Fix For: 1.1 Attachments: domain_statistics_v2.patch, NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on nutch-u...@lucene: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12667929#action_12667929 ] Hudson commented on NUTCH-628: -- Integrated in Nutch-trunk #707 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/707/]) - DomainStatistics tool Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Fix For: 1.1 Attachments: domain_statistics_v2.patch, NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on nutch-u...@lucene: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666477#action_12666477 ] Doğacan Güney commented on NUTCH-628: - I don't know much about the patch here. Otis, do you have time to update and commit Domain Stats? If not, I will take a look. Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Fix For: 1.1 Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on nutch-u...@lucene: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666764#action_12666764 ] Otis Gospodnetic commented on NUTCH-628: Could you take it if you have time, please? Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Fix For: 1.1 Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on nutch-u...@lucene: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666290#action_12666290 ] Otis Gospodnetic commented on NUTCH-628: I'm +1 on getting Domain Stats into 1.0. The patch will need a small update, I think. Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Fix For: 1.1 Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on nutch-u...@lucene: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[EMAIL PROTECTED] wrote: + // time the request + long fetchStart = System.currentTimeMillis(); ProtocolOutput output = protocol.getProtocolOutput(fit.url, fit.datum); + long fetchTime = (System.currentTimeMillis() - fetchStart)/1000; ProtocolStatus status = output.getStatus(); Content content = output.getContent(); ParseStatus pstatus = null; + + // compute page download speed + int kbps = Math.round(float)content.getContent().length)*8)/1024)/fetchTime); + LOG.info(Fetch time: + fetchTime + KBPS: + kbps + URL: + fit.url); +// fit.datum.getMetaData().put(new Text(kbps), new IntWritable(kbps)); Yes, that's more or less correct. I *think* the updatedb step will keep any new keys/values in that MetaData MapWritable in the CrawlDatum while merging, right? Right. Then, HostDb would run through CrawlDb and aggregate (easy). But: What other data should be stored in CrawlDatum? I can think of a few other useful metrics: * the fetch status (failure / success) - this will be aggregated into a failure rate per host. * number of outlinks - this comes useful when determining areas within a site with high density of links * content type * size Other aggregated metrics, which are derived from urls alone, could include the following: * number of urls with query strings (may indicate spider traps or a database) * total number of urls from a host (obvious :) ) - useful to limit the max. number of urls per host. * max depth per host - again, to enforce limits on the max. depth of the crawl, where depth is defined as the maximum number of elements in URL path. * spam-related metrics (# of pages with known spam hrefs, keyword stuffing in meta-tags, # of pages with spam keywords, etc, etc). Plus a range of arbitrary operator-specific tags / metrics, usually manually added: * special fetching parameters (maybe authentication, or the overrides for crawl-delay or the number of threads per host) * other parameters affecting the crawling priority or page ranking for all pages from that host As you can see, possibilities are nearly endless, and revolve around the issues of crawl quality and performance. How exactly should that data be aggregated? (mostly added up?) Good question. Some of these are aggregates, some others are running averages, some others still perhaps form a historical log of a number of most recent values. We could specify a couple of standard operations, such as: * SUM - initialize to zero, and add all values * AVG - calculate arithmetic average from all values * MAX / MIN - retain only the largest . smallest value * LOG - keep a log of the last N values - somewhat orthogonal concept to the above, i.e. it could be a valid option for any of the above operations. This complicates the simplistic model of HostDB that we had :) and indicates that we may need a sort of a schema descriptor for it. How exactly will this data then be accessed? (I need to be able to do host-based lookup) Ah, this is an interesting issue :) All tools in Nutch work using URL-based keys, which means they operate on a per-page level. Now we need to join this with HostDb, which uses host names as keys. If you want to use HostDb as one of the inputs to a map-reduce job, then I described this problem here, and Owen O'Malley provided a solution: https://issues.apache.org/jira/browse/HADOOP-2853 This would require significant changes in severa Nutch tools, i.e. several m-r jobs would have to be restructured. There is however a different approach too, which may be efficient enough - put the HostDb in a DistributedCache, and read it directly as a MapFile (or BloomMapFile - see HADOOP-3063) - I tried this and for medium-size datasets the performance was acceptable. My immediate interest is computing per-host download speed, so that Generator can take that into consideration when creating fetchlists. I'm not even 100% sure if that will even have a positive effect on the overall fetch speed, but I imagine I will have to load this HostDb data in a Map, so I can change Generator and add something like this in one of its map or reduce methods: int kbps = hostdb.get(host); if (kbps N) { don't emit this host } Right, that's how the API could look like using the second approach that I outlined above. You could even wrap it in a URLFilter plugin, so that you don't have to modify the Generator. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[EMAIL PROTECTED] wrote: Host extraction from URL makes sense, but there would be no host-level data in CrawlDatum. For example, one of the things I'd like to track is download speed. I don't want to track that on the per-URL level, but on a per-host level. I'd keep track of the d/l speed for each host in Fetcher2 and its FetcherInputQueue (that part is in JIRA already). So I'm not sure how I'd put the d/l speed for a host in the CrawlDatum You really don't have to - see below. The queue monitoring stuff in Fetcher gives you only the current fetchlist metrics anyway, so they are incomplete - you need to calculate the actual averages from all urls from that host, and not just the current fetchlist. That's why it's better to do this using the information from CrawlDb and not just from the current segment. So, let's assume for a moment that you don't track the d/l speed per host in Fetchers, or you discard this information, and assume that you only add the actually measured per-url download speed to crawl_fetch, as part of CrawlDatum.metaData. This metadata will be merged to the CrawlDb during the updatedb operation (replacing any older values if they exist). Reduce: input: host, (hostStats1, hostStats2, ...) output: host, hostStats // aggregated Let's try with a concrete example. Imagine I just ran a fetch job and that fetched some number of URLs from www.foo.com and www.bar.com. foo.com aggregate d/l speed for that fetch run was 50 kbps. bar.com speed was 20 kbps. At the end of the run, I'd somehow store, say: www.foo.comdl_speed:50requests:100timeouts:0 www.bar.comdl_speed:20requests:90timeouts:20 No, what you want to store is this: www.example.com/page1.html dl_speed:50 status:ok www.example.com/page2.html dl_speed:45 status:ok www.example.com/page3.html dl_speed:0 status:gone ... Then, I was thinking, something else (some HostDb MapReduce job) would go through this data stored under segment/2008./something/ and merge it into crawl/hostdb file. It sounds like you are saying, this should stick the data in CrawlDatum and let that be merged into crawl/crawldb but I don't see how I'd put the numbers from the above example into CrawlDatum without repeating them, so that each URL from www.foo.com has those 3 numbers above for www.foo.com stored in their crawldb entries. See above - we store only per-url metrics in CrawlDb. Then the HostDb job aggregates the info from CrawlDb using host name (or domain name, or TLD) as the key. PS. Could you please wrap your lines to 80 chars? I always have to re-wrap your emails when responding, otherwise they consists of very very long lines .. Sorry about that. I wrapped them manually here. Thanks. Mail apps are no longer what they used to be ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12590724#action_12590724 ] Andrzej Bialecki commented on NUTCH-628: - Not everything looks like a String ;) MapWritable is useful in situations where you need to (de)serialize non-String types. And most of the information in HostDb is numeric, so if we decided to use simple Metadata it would cause constant pointless conversion from/to Strings. Having said that, I'm for a specialized class (which can contain MapWritable as a placeholder for anything else than the specific built-in types of info). Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on [EMAIL PROTECTED]: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[EMAIL PROTECTED] wrote: I do understand that CrawlDb is the source to get all known URLs from, and from those URLs we can extract host names, domains, etc. (what DomainStatistics tool does), but I don't understand how you'd use CrawlDb as the source of per-host data, since CrawlDb does not have aggregate per-host data. Shouldn't that live in a separate file, a file that can updated after every fetch run? Well, as it happens, map-reduce is exceptionally good at collecting aggregate data :) This is a simple map-reduce job, where we do the following: Map:input: url, CrawlDatum from CrawlDb output: host, hostStats Host is extracted from the current url, and hostStats is extracted from the data in this CrawlDatum Reduce: input: host, (hostStats1, hostStats2, ...) output: host, hostStats // aggregated PS. Could you please wrap your lines to 80 chars? I always have to re-wrap your emails when responding, otherwise they consists of very very long lines .. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12590559#action_12590559 ] Doğacan Güney commented on NUTCH-628: - +1 for extracting hostdb from crawldb... (also, do we really want to make hostdb just a map file of Text,MapWritable? IMHO, it would be better to design a proper HostDatum class with some statistics built-in, and then maybe a Metadata element [I guess it's just me but I hate MapWritable, I prefer Metadata:D]) Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on [EMAIL PROTECTED]: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information
You are both in agreement, but I don't fully follow as I'm not intimately familiar with all the files and structures yet. - Fetcher-s putting info about hosts into crawl_fetch for each fetched segment makes sense. I see Fetcher(2) uses FetcherOutputFormat, which has its own RecordWriter, which then writes CrawlDatum to HDFS. I do not see where exactly to plug per-host info writing in the current Fetcher2 flow. I think the thing to do would be to simply collect the data in memory and at the end of the fetch run, at the end of the fetch() method write it out. I just don't know how to write it out to HDFS without relying on Reduce to do the writing for me. Should it be something as simple as the following? // write a plain-text file with space-delimited values FileSystem hdfs = FileSystem.get(getConf()); FSDataOutputStream dos = hdfs.create(path); dos.writeUTF(host + + requests + + timeouts. ); dos.close(); - I don't understand how per-host info can go in the CrawlDb. Isn't CrawlDb a database of all known *URLs*? Doesn't CrawlDb contain only CrawlDatum records, and doesn't each CrawlDatum hold data about a single URL? So if I wanted to record, say, the number of timeouts for a given host, how would I add that to a CrawlDatum, when a CrawlDatum is for a specific URL, and not host? I do understand that CrawlDb is the source to get all known URLs from, and from those URLs we can extract host names, domains, etc. (what DomainStatistics tool does), but I don't understand how you'd use CrawlDb as the source of per-host data, since CrawlDb does not have aggregate per-host data. Shouldn't that live in a separate file, a file that can updated after every fetch run? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney (JIRA) [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Friday, April 18, 2008 2:40:21 PM Subject: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information [ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12590559#action_12590559 ] Doğacan Güney commented on NUTCH-628: - +1 for extracting hostdb from crawldb... (also, do we really want to make hostdb just a map file of Text,MapWritable? IMHO, it would be better to design a proper HostDatum class with some statistics built-in, and then maybe a Metadata element [I guess it's just me but I hate MapWritable, I prefer Metadata:D]) Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on [EMAIL PROTECTED]: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.