java.net.URL synchronization
Hello, Has anyone seen this: http://www.supermind.org/blog/580/java-net-url-synchronization-bottleneck ? Is this something that needs to be addressed in Nutch (and thus in Bixo, and thus in the common crawler project)? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
RE: java.net.URL synchronization
I checked java.net.URL; yes, Nutch and BIXO implicitly use synchronized Hashtable: public URL(String protocol, String host, int port, String file, URLStreamHandler handler) throws MalformedURLException { ... if (handler == null (handler = getURLStreamHandler(protocol)) == null) { throw new MalformedURLException(unknown protocol: + protocol); } ... However, I don't think it hurts because both architecture (at least, BIXO) run single thread in a single JVM to process, for instance, Outlinks. Only Fetch part is multithreaded, but it doesn't use URL class. Not sure about Nutch, how the fetch list is generated... if multithreaded then shared between threads RegexUrlNormalizer is even bigger problem... Fuad Efendi +1 416-993-2060 http://www.tokenizer.ca/ Data Mining, Vertical Search -Original Message- From: Otis Gospodnetic [mailto:ogjunk-nu...@yahoo.com] Sent: December-09-09 5:12 PM To: nutch-dev@lucene.apache.org Subject: java.net.URL synchronization Hello, Has anyone seen this: http://www.supermind.org/blog/580/java-net-url-synchronization-bottleneck ? Is this something that needs to be addressed in Nutch (and thus in Bixo, and thus in the common crawler project)? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
RE: java.net.URL synchronization
Tomcat uses own slightly different version of URL class: http://tomcat.apache.org/tomcat-5.5-doc/catalina/docs/api/index.html URL is designed to provide public APIs for parsing and synthesizing Uniform Resource Locators as similar as possible to the APIs of java.net.URL, but without the ability to open a stream or connection. One of the consequences of this is that you can construct URLs for protocols for which a URLStreamHandler is not available (such as an https URL when JSSE is not installed). Synchonized staff in java.net.URL is URLStreamHandler -related. -Original Message- From: Fuad Efendi [mailto:f...@efendi.ca] Sent: December-09-09 5:40 PM To: nutch-dev@lucene.apache.org Subject: RE: java.net.URL synchronization I checked java.net.URL; yes, Nutch and BIXO implicitly use synchronized Hashtable: public URL(String protocol, String host, int port, String file, URLStreamHandler handler) throws MalformedURLException { ... if (handler == null (handler = getURLStreamHandler(protocol)) == null) { throw new MalformedURLException(unknown protocol: + protocol); } ... However, I don't think it hurts because both architecture (at least, BIXO) run single thread in a single JVM to process, for instance, Outlinks. Only Fetch part is multithreaded, but it doesn't use URL class. Not sure about Nutch, how the fetch list is generated... if multithreaded then shared between threads RegexUrlNormalizer is even bigger problem... Fuad Efendi +1 416-993-2060 http://www.tokenizer.ca/ Data Mining, Vertical Search -Original Message- From: Otis Gospodnetic [mailto:ogjunk-nu...@yahoo.com] Sent: December-09-09 5:12 PM To: nutch-dev@lucene.apache.org Subject: java.net.URL synchronization Hello, Has anyone seen this: http://www.supermind.org/blog/580/java-net-url-synchronization- bottleneck ? Is this something that needs to be addressed in Nutch (and thus in Bixo, and thus in the common crawler project)? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay
Build failed in Hudson: Nutch-trunk #1007
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1007/changes Changes: [kubes] Remove old jetty jars that should have been removed with NUTCH-768, upgrade to Hadoop 0.20.1 -- [...truncated 4727 lines...] jar: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: lib-regex-filter compile-test: compile: [echo] Compiling plugin: urlfilter-regex [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/urlfilter-regex.jar deps-test: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: lib-regex-filter jar: deps-test: deploy: copy-generated-lib: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-suffix [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes [javac] Note: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-validator [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator init: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlnormalizer-basic [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes jar: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar deps-test: deploy: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic copy-generated-lib: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic init: [mkdir] Created dir: