Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db
Hi Gal, I'm curious about the memory consumption of the cache and the speed of retrieval of an item from the cache, when the cache has 100k domains in it. Thanks, Otis --- Gal Nitzan [EMAIL PROTECTED] wrote: Hi Michael, At the moment I have about 3000 domains in my db. I didn't time the performance however having even 100k domains shouldn't have an impact since it is fetched only once from the database to the cache. A little performance hit should be over 100k (depends on number elements defined in xml file). After a few birth problems, the plugin works nicely and I do not feel any impact. Regards, Gal Michael Ji wrote: hi, How is performance concern if the size of domain list reaches 10,000? Micheal Ji, --- Gal Nitzan (JIRA) [EMAIL PROTECTED] wrote: [ http://issues.apache.org/jira/browse/NUTCH-100?page=all ] Gal Nitzan updated NUTCH-100: - type: Improvement (was: New Feature) Description: Hi, I have written a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml was: Hi, I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml Environment: All Nutch versions (was: MapRed) Fixed some issues clean up Added a patch for Subversion New plugin urlfilter-db --- Key: NUTCH-100 URL: http://issues.apache.org/jira/browse/NUTCH-100 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Environment: All Nutch versions Reporter: Gal Nitzan Priority: Trivial Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz Hi, I have written a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira __ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/ .
Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db
[EMAIL PROTECTED] wrote: Hi Gal, I'm curious about the memory consumption of the cache and the speed of retrieval of an item from the cache, when the cache has 100k domains in it. Slightly off-topic, but I hope this is relevant to the original reason for creating this plugin... There is a BSD-licensed library that implements a large subset of regexps, which is based on finite automata. It is reported to be scalable and very fast (benchmarks are surely impressive): http://www.brics.dk/~amoeller/automaton/ I suggest to do some tests with 100k regexps and see if it survives. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
reprocessing hanging tasks
Hi, I tried to understand the jobtracker code. Hmm more than 1000 lines of code in just one class. :-( This makes understanding code very difficult. Anyway I'm missing a mechanism to reprocess hanging tasks. May I just didn't find the code, but I invest some time to find it. As the google paper describe the original map reduce reprocess tasks that may still run but are much slower than the other tasks because of some hardware failures. Since I notice that task-tracker isn't that stabile yet, I would really love to have such a reprocessing mechanism. Actually I seen tasks are reprocessed in case the task-tracker crash and does not return any reports anymore or the task-tracker report a task failure. But for example in case the network speed of a fetching mapping task is very very slow the job itself needs for ever. I would suggest add start time and finishing time to the task object and set these values until status changes. We can calculate a average time a task need for processing based on this values. Than we have a configurable value of minimal finished tasks before we start to reprocessing tasks. For example 80% tasks need to be ready. Further more we have a configurable values threshold, in case the processing time of a task is treshold * average processing time, we just reprocessing the task on a other tasktracker. What do people think? Do I miss the section in the jobtracker where this is done, or are people interested that I submit a patch doing this mechanism? Stefan
Re: reprocessing hanging tasks
Doug, I definitely run several times in problems, where task-trackers was sending hard-beat messages but hadn't process the job anymore. For example no new pages was fetched but the page / sec. statistic becomes slow and slower. I personal would think it makes more sense in case the jobtracker decide if a task is over the average processing time and need to be reexcuted or not. The last section of the google paper covers this issue and they notice performance improvements by reexecutng task that are over a specific time. May we misunderstand each other, I do not mean tasks that crash, I mean tasks that are 20 times slower on one machine as the other tasks on the other machines. Stefan Am 10.10.2005 um 20:16 schrieb Doug Cutting: Stefan Groschupf wrote: Do I miss the section in the jobtracker where this is done, or are people interested that I submit a patch doing this mechanism? This is mostly already implemented. The tasktracker fails tasks that do not update their status within a configurable timeout. Task status is updated each time a task reads an input, writes an output or calls the Reporter.setStatus() method. The jobtracker will retry failed tasks up to four times. The mapred-based fetcher also should not hang. It will exit even when it has hung threads. So the task timeout should be set to the maximum amount of time that any single page should require to fetch parse. By default it is set to 10 minutes. Doug --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: reprocessing hanging tasks
Stefan Groschupf wrote: May we misunderstand each other, I do not mean tasks that crash, I mean tasks that are 20 times slower on one machine as the other tasks on the other machines. Ah, I call that speculative re-exectution. Nutch does not yet implement that. I don't think speculative re-execution of tasks would help much with fetching, since a fetch task that is slow on one machine will probably be slow on another. What would probably make the fetcher faster is to use Thread.kill() on fetcher threads which have exceeded a timeout, and then replace them with a new Fetcher thread. Speculative re-execution is among the list of features we'd like to add, but it is not a high priority for me. Doug
fetch speed issue
Another observation: when the same size fetch list and same number of threads were used, the fetcher started at different speed in different runs, ranging from 200kb/s to 1200kb/s. I'm using DSL at home, so this variation in downlaod speed could be due to the variation in DSL connection. If using stable connections like T1 or fiber, I expect the fetcher should start at the same spped. Could someone using T1 line or fiber connection verify that the fetcher starts always at similar speed? Given large enough number of threads, did your fetcher always reliably achieve the maximum speed, i.e. using the full bandwidth of the connection? Thanks, AJ
Re: Re[2]: what contibute to fetch slowing down
On 03:36:45 03/Oct , Michael wrote: 3mbit, 100 threads = 15 pages/sec cpu is low during fetch, so its bandwidth limit. yes, cpu is low, and even memory is quite free. But, with a 10MB in/out I cannot obtain good results (and I do not parse results, simply fetch them). If I use 100 threads, I can download pages at 500KB/s for about 5 seconds, but after that, the download rate falls to 0. If I set 20 threads, I can download at 200KB for 4/5 minutes, and the rate initially seems very stable. But, after theese few minutes, the rate starts to get lower and lower, and tends to reach zero pages/s. I cannot understand what could be the problem. Every thread number I choose, the rate _always_ decrease, till it has reached 1/2 pages/s. I;ve tried 2 different machines, but the problem is always the same. Can you please give me some advices? Thank you Daniele -- Free Software Enthusiast Debian Powered Linux User #332564 http://menoz.homelinux.org
Re: what contibute to fetch slowing down
On 09:59:45 03/Oct , Doug Cutting wrote: I suspect threads are hanging, probably in the parser, I tried to not parse, but without good results. If I use 100 threads, I can download pages at 500KB/s for about 5 seconds, but after that, the download rate falls to 0. If I set 20 threads, I can download at 200KB for 4/5 minutes, and the rate initially seems very stable. But, after theese few minutes, the rate starts to get lower and lower, and tends to reach zero pages/s. but sometimes TCP connections get stuck too. If this is the problem, how can I fix it? And How can I recognize it? Thanks! -z -- Free Software Enthusiast Debian Powered Linux User #332564 http://menoz.homelinux.org
[jira] Created: (NUTCH-109) Nutch - Fetcher - HTTP - Performance Testing Tuning
Nutch - Fetcher - HTTP - Performance Testing Tuning - Key: NUTCH-109 URL: http://issues.apache.org/jira/browse/NUTCH-109 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7, 0.6, 0.7.1, 0.8-dev Environment: Nutch: Windows XP, J2SE 1.4.2_09 Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53 Reporter: Fuad Efendi 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends Connection: close before closing in JVM Socket.close() ... I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days... I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages) Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/ Please note: Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance: private static int CLIENTS_PER_HOST = NutchConf.get().getInt(http.clients.per.host, 3); I'll add more comments after finishing tests... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-109) Nutch - Fetcher - HTTP - Performance Testing Tuning
[ http://issues.apache.org/jira/browse/NUTCH-109?page=all ] Fuad Efendi updated NUTCH-109: -- Attachment: protocol-httpclient-innovation-0.1.0.zip New Plugin, you may play with commenting this code in HttpFactory static { CookieModule.setCookiePolicyHandler(null); } Nutch - Fetcher - HTTP - Performance Testing Tuning - Key: NUTCH-109 URL: http://issues.apache.org/jira/browse/NUTCH-109 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7, 0.6, 0.7.1, 0.8-dev Environment: Nutch: Windows XP, J2SE 1.4.2_09 Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53 Reporter: Fuad Efendi Attachments: protocol-httpclient-innovation-0.1.0.zip 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends Connection: close before closing in JVM Socket.close() ... I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days... I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages) Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/ Please note: Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance: private static int CLIENTS_PER_HOST = NutchConf.get().getInt(http.clients.per.host, 3); I'll add more comments after finishing tests... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect
On 10/6/05, Dawid Weiss [EMAIL PROTECTED] wrote: That would be great, I looked already to the code base in the plug-in directory and it seems you use this call to get the clustering results: controller.query(lingo-nmf-km-3, pseudo-query, requestParams); am I right ? anyway, I want to have the type of algorithm used for clustering, picked up from the xml file, it should be easy to do so. Yes, it is quite easy -- the controller above can be instantiated from an XML file or from a Beanshell script using a local controller component (not in the Nutch codebase yet). There are unit tests of that controller in Carrot2 CVS, but it has been added recently so I didn't have the time to integrate it in a solid working example. Hi Dawid, I was able to hack the Clusterer class and made it work for STH, here is my hack ;-) // Clustering component here. LocalComponentFactory stcFactory = new LocalComponentFactoryBase() { public LocalComponent getInstance() { HashMap defaults = new HashMap(); // These are adjustments settings for the clustering algorithm... // You can play with them, but the values below are our 'best guess' // settings that we acquired experimentally. defaults.put(lsi.threshold.clusterAssignment, 0.150); defaults.put(lsi.threshold.candidateCluster, 0.775); // TODO: this should be eventually replaced with documents from Nutch // tagged with a language tag. There is no need to again determine // the language of a document. return new STCLocalFilterComponent(); } }; controller.addLocalComponentFactory(filter.lingo-old, stcFactory); } But I have two questions: 1. AHC doesn't have any local filter that implements LocalFilterComponent, RawClusterProducer and so on, how can I achieve that, form a very superficial point of view it seem that nobody uses AHC class?, 2. How do the stopwords and stemmers work for STC ? There is one potential problem that I see -- Nutch plugins require explicit JAR references. If you want to switch between algorithms you'll need to either put all Carrot2 JARs in the descriptor, put them in CLASSPATH before Nutch starts or do some other trickery with class loading. I just put the stc.jar in the lib directory, I will optimize it later ;-). Cheers, R. I won't be able to help you until next week, but after then I'll try to find some time to prepare you an example of how the scriptable controller is used (or look at the unit tests, the component is called carrot2-local-controller. Dawid