org.apache.commons.io.FileUtils
I need a recursive file delete for cleaning up after a JUnit test. There is one in Commons IO (org.apache.commons.io): FileUtils.deleteDirectory(File directory) I wonder whether I should use org.apache.commons.io as a new jar added to lib or arrange a libtest for jars only used by JUnit tests, or write yet another recursive delete and put it in src/test/org/apache/nutch/util. If org.apache.commons.io were included, then ndfs/DF.java could almost be replaced by FileSystemUtils.freeSpace(path), but DF also reports disk space capacity. Paul
Re: org.apache.commons.io.FileUtils
Paul Baclace wrote: I need a recursive file delete for cleaning up after a JUnit test. I just now spotted: org.apache.nutch.fs.LocalFileSystem.delete(File f) which does what I want (recursive, local delete). So no need for common.io. Paul
Re: nutch downloads
Joshua, We have received your message. I'm only remotely involved with Nutch, so I'm prodding other committers to Nutch to please update the links to take advantage of the mirroring system in place. Please - someone reply back volunteering to correct this ASAP. Erik On Oct 11, 2005, at 6:51 PM, Joshua Slive wrote: Downloads from nutch are now by far the biggest consumers of bandwidth on apache.org. This is because nutch is not following apache download policies and is not using our mirroring system at all. Please correct this right away. See http://www.apache.org/dev/mirrors.html for more details. Please confirm that you have received this message and are taking action. Thanks. Joshua (Apache mirror guy)
[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
[ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331877 ] Fuad Efendi commented on NUTCH-109: --- Ok, I'll do it tonight; I believe fetcher.server.delay means Wait for a Response from Server, then throw a Timeout Exception I can also execute 1000 threads, we will have fair comparison even with fetcher.server.delay=50 seconds (fair - because of too many threads - we will have probably 20 requests per second, 20 * 50 = 1000) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation --- Key: NUTCH-109 URL: http://issues.apache.org/jira/browse/NUTCH-109 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7, 0.8-dev, 0.6, 0.7.1 Environment: Nutch: Windows XP, J2SE 1.4.2_09 Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53 Reporter: Fuad Efendi Attachments: protocol-httpclient-innovation-0.1.0.zip 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends Connection: close before closing in JVM Socket.close() ... I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days... I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages) Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/ Please note: Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance: private static int CLIENTS_PER_HOST = NutchConf.get().getInt(http.clients.per.host, 3); I'll add more comments after finishing tests... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db
Gal Nitzan wrote: Hi Andrzej, Yes, it seems like a good option. However, it is GPL, and I noticed in one of the posts that this license is no good for apach.org :). If you refer to the bricks automata library, it's BSD-licensed. I mentioned in one of the posts that the Innovation httpclient is L-GPL, and hence not acceptable for apache.org. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
[ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331897 ] Fuad Efendi commented on NUTCH-109: --- Opps... need to learn more! [protocol-httpclient] Http.java is Singleton, it uses MultiThreadedHttpConnectionManager It uses single instance of HttpClient for all hosts and all threads. Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation --- Key: NUTCH-109 URL: http://issues.apache.org/jira/browse/NUTCH-109 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7, 0.8-dev, 0.6, 0.7.1 Environment: Nutch: Windows XP, J2SE 1.4.2_09 Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53 Reporter: Fuad Efendi Attachments: protocol-httpclient-innovation-0.1.0.zip 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends Connection: close before closing in JVM Socket.close() ... I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days... I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages) Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/ Please note: Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance: private static int CLIENTS_PER_HOST = NutchConf.get().getInt(http.clients.per.host, 3); I'll add more comments after finishing tests... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db
Andrzej Bialecki wrote: 100k regexps is still alot, so I'm not totally sure it would be much faster, but perhaps worth checking. I have worked with this type of technology before (minimized, determinized FSAs, constructed from large sets of strings expressions) and it should be very fast to perform lookups, even in large, complex FSAs. Construction of the FSA can be time consuming and should probably be done offline, not at fetcher startup time, so that it is only performed once for a number of fetcher runs. Doug
RE: Q about exact match counts?
Anyone answer this question? I see in the Hits class that there's a boolean totalIsExact attribute, but this becomes false only when deduplication (per site) occurs during the search. And I see that underneath Nutch, Lucene will obtain the documents for only the top hits. But does Nutch/Lucene return exact match counts for every query? How does this scale to very large indexes? Thanks! DaveG -Original Message- From: Gaulin, Mark [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 05, 2005 2:04 PM To: nutch-dev@lucene.apache.org Subject: Q about exact match counts? Hi I was wondering if nutch returns exact match counts or estimated match counts, or if this behavior was configurable. The reason I ask is that on a very large index I'd rather get the top 10 results right and get them quickly and not spend time getting the total number of documents that match exactly right. (I assume that's why Google reports totals as 1 to 10 of about 1,000,000) Is this something that nutch/lucene can do? Thanks Mark
Re: nutch downloads
Erik Hatcher wrote: Please - someone reply back volunteering to correct this ASAP. My bad. I'm fixing this right now. In 24 hours all Nutch downloads should be through the mirrors. Sorry! Doug
Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db
Doug Cutting wrote: Andrzej Bialecki wrote: 100k regexps is still alot, so I'm not totally sure it would be much faster, but perhaps worth checking. I have worked with this type of technology before (minimized, determinized FSAs, constructed from large sets of strings expressions) and it should be very fast to perform lookups, even in large, complex FSAs. Construction of the FSA can be time consuming and should probably be done offline, not at fetcher startup time, so that it is only performed once for a number of fetcher runs. Guess what... this library supports (de)serialization of automata, so they can be compiled once, and then just stored/loaded. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
[ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331913 ] Fuad Efendi commented on NUTCH-109: --- I was totally wrong and unfair: Have you seen Kelvin Tan's patch? You should take a look, it's in JIRA, and addresses some of the HTTP/1.1 issues that you are concerned about. http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg01037.html I need to perform tests with Kelvin Tan's patch too. Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation --- Key: NUTCH-109 URL: http://issues.apache.org/jira/browse/NUTCH-109 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7, 0.8-dev, 0.6, 0.7.1 Environment: Nutch: Windows XP, J2SE 1.4.2_09 Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53 Reporter: Fuad Efendi Attachments: protocol-httpclient-innovation-0.1.0.zip 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends Connection: close before closing in JVM Socket.close() ... I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days... I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages) Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/ Please note: Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance: private static int CLIENTS_PER_HOST = NutchConf.get().getInt(http.clients.per.host, 3); I'll add more comments after finishing tests... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
suspicious outlink count
202443 Pages consumed: 13 (at index 13). Links fetched: 233386. 202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/]. 202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315. If there is maxoutlinks already specified in the xml config, why does nutch bother counting anything over that again?
keep count of selected url
Hi all, I was interesting in keeping count of the number of time every URL is selected by an user. The problem is not on how do I know what page is clicked, but how can I store every page/number-of-clicks touple? What is the best way to store theese informations, and let nutch use them? Can you please give me some advice on how implement this? What part of Nutch, and what classes, do I have to modify? What is the best way to do this? Thank you very much!! Menoz -- Free Software Enthusiast Debian Powered Linux User #332564 http://menoz.homelinux.org
clustering strategies
I think it would be nice to have a few cluster strategies on the wiki. It seems there are at least three separate needs: CPU, storage and bandwidth, and I think the more those could be cleanly spread to different boxes, the better. Guess I am imagining a breakdown that lists, by priority, how things should be broken out. So someone could look at the list and say, ok, I have three good boxes, I should make the best box do x, the second best do y, etc. There could also be case studies for how different folks did their own implementations and what their crawl/query times were like. I have a small cluster (up to 15 boxes) and would like to start to play around and see how things go under different strategies. I also have about a million pages of local content, so I can hammer things pretty hard without even leaving my network. I know that may not match normal conditions, but it could hopefully remove a variable or two (network latency, slow sites), to keep things simple at least to start. I think it also a decent goal to be able to crawl/index my pages in a night (say eight hours), which would be around 35 pages/second. If that isn't a reasonable goal, I would like to hear why not. For each strategy, we could have a set of confs describing how to set things up. I can picture a gui which could list box roles (crawler, mapper, whatever) and boxes available. The users could drag and drop their boxes to roles, and confs could then be generated. Think it could make for rather easy design/implementation of clusters that could get rather complicated. I can do drag/drop and interpolate into templates in javascript, so I could envision a rather simple page. Maybe we could even store the cluster setup in xml, and have a script that takes the xml and draws the cluster. Then when people report slowness or the like, they could also post their cluster setup. I think when users come to nutch, they come with a set of boxes. I think it would be nice for them to see what has worked for such a set of boxes in the past and be able to easily implement such a strategy. Kind of the one hour from download to spidering vision. Just a few thoughts. Earl __ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com
[jira] Created: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.7 Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: fixIllegalXmlChars.patch Attached patch runs all xml text through a check for bad xml characters. This patch is brutal dropping silently illegal characters. Patch was made after hunting xalan, jdk, and nutch itself for a method that would do the above filtering but was unable to find any such method -- perhaps an oversight on my part? OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.7 Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] Attachments: fixIllegalXmlChars.patch OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
Hi, I'm not an XML expert by any means, but wouldn't it be simpler to just wrap any text where illegal chars are possible with a !CDATA[ ]! tag? That way, the offending characters won't be dropped and the process won't be lossy, no? If the CDATA method won't work, and there's no other way to solve the problem without losing text, then your patch has my +1. Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: [EMAIL PROTECTED] (JIRA) [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 12, 2005 5:19 PM To: nutch-dev@incubator.apache.org Subject: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: fixIllegalXmlChars.patch Attached patch runs all xml text through a check for bad xml characters. This patch is brutal dropping silently illegal characters. Patch was made after hunting xalan, jdk, and nutch itself for a method that would do the above filtering but was unable to find any such method -- perhaps an oversight on my part? OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.7 Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] Attachments: fixIllegalXmlChars.patch OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
[ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331950 ] Fuad Efendi commented on NUTCH-109: --- Please see attachment for more details. In order to be fair (protocol-http uses single shared Socket per Host) I tried to modify this line in new plugin, HttpFactory.java: private static int CLIENTS_PER_HOST = NutchConf.get().getInt(http.clients.per.host, 1); It was 3 before. However, with http.clients.per.host=1 new plugin stops in a dead-lock. I tried few times, it always stops after 3-4 minutes. So, results are with http.clients.per.host=3 for new plugin (as it was before), but new plugin didn't pass the test, just a baseline. New Test Results: === 1. PROTOCOL-HTTP = 910,549,682 bytes (size on disk, WebDB+Segments) 1,201,908 milliseconds 2. PROTOCOL-HTTPCLIENT 935,856,675 bytes 1,261,064 milliseconds 999. PROTOCOL-HTTPCLIENT-INNOVATION == 936,152,532 bytes 1,305,377 milliseconds nutch-site.xml == property namefetcher.server.delay/name value0/value /property property namefetcher.threads.per.host/name value20/value /property property namehttp.timeout/name value3/value /property property namehttp.content.limit/name value-1/value /property Client: === IBM ThinkPad T42p, 2Ghz, 2Gb, Windows XP, J2SE 1.4.2_09 Server: === Suse Linux 9.3, Apache HTTPD 2.0.53-9.5, Worker Command: bin\nutch7 crawl url3.txt -dir crawl005 -threads 20 -depth 6 (Modified crawl without indexing) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation --- Key: NUTCH-109 URL: http://issues.apache.org/jira/browse/NUTCH-109 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7, 0.8-dev, 0.6, 0.7.1 Environment: Nutch: Windows XP, J2SE 1.4.2_09 Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53 Reporter: Fuad Efendi Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends Connection: close before closing in JVM Socket.close() ... I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days... I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages) Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/ Please note: Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance: private static int CLIENTS_PER_HOST = NutchConf.get().getInt(http.clients.per.host, 3); I'll add more comments after finishing tests... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
[ http://issues.apache.org/jira/browse/NUTCH-109?page=all ] Fuad Efendi updated NUTCH-109: -- Attachment: test_results.txt Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation --- Key: NUTCH-109 URL: http://issues.apache.org/jira/browse/NUTCH-109 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7, 0.8-dev, 0.6, 0.7.1 Environment: Nutch: Windows XP, J2SE 1.4.2_09 Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53 Reporter: Fuad Efendi Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends Connection: close before closing in JVM Socket.close() ... I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days... I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages) Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/ Please note: Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance: private static int CLIENTS_PER_HOST = NutchConf.get().getInt(http.clients.per.host, 3); I'll add more comments after finishing tests... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira