Hello, I tried to modify Nutch in order to pass through a web proxy as
advice below but it still doesn'tr work.
I've got the following error:
2007-02-15 17:04:58,285 INFO fetcher.Fetcher - fetching
http://lucene.apache.org/nutch/
2007-02-15 17:04:58,300 INFO http.Http - http.proxy.host = ncproxy1
2007-02-15 17:04:58,300 INFO http.Http - http.proxy.port = 8080
2007-02-15 17:04:58,300 INFO http.Http - http.timeout = 10000
2007-02-15 17:04:58,300 INFO http.Http - http.content.limit = 65536
2007-02-15 17:04:58,300 INFO http.Http - http.agent = NutchCVS/Nutch-0.9-dev
(C:\pbapps\nutch-nightly\conf\nutch-default.xml)
2007-02-15 17:04:58,300 INFO http.Http - protocol.plugin.check.blocking = true
2007-02-15 17:04:58,300 INFO http.Http - protocol.plugin.check.robots = true
2007-02-15 17:04:58,300 INFO http.Http - fetcher.server.delay = 1000
2007-02-15 17:04:58,300 INFO http.Http - http.max.delays = 1000
2007-02-15 17:04:58,316 ERROR http.Http -
org.apache.nutch.protocol.http.api.HttpException: java.net.UnknownHostException:
lucene.apache.org: lucene.apache.org
2007-02-15 17:04:58,316 ERROR http.Http - at
org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:340)
2007-02-15 17:04:58,316 ERROR http.Http - at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:212)
2007-02-15 17:04:58,316 ERROR http.Http - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
2007-02-15 17:04:58,316 ERROR http.Http - Caused by:
java.net.UnknownHostException:
lucene.apache.org: lucene.apache.org
2007-02-15 17:04:58,316 ERROR http.Http - at
java.net.InetAddress.getAllByName0(InetAddress.java:1128)
2007-02-15 17:04:58,316 ERROR http.Http - at
java.net.InetAddress.getAllByName0(InetAddress.java:1098)
2007-02-15 17:04:58,316 ERROR http.Http - at
java.net.InetAddress.getAllByName(InetAddress.java:1061)
2007-02-15 17:04:58,316 ERROR http.Http - at
java.net.InetAddress.getByName(InetAddress.java:958)
2007-02-15 17:04:58,316 ERROR http.Http - at
org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:336)
2007-02-15 17:04:58,316 ERROR http.Http - ... 2 more
2007-02-15 17:04:58,316 INFO fetcher.Fetcher - fetch of
http://lucene.apache.org/nutch/ failed with:
org.apache.nutch.protocol.http.api.HttpException: java.net.UnknownHostException:
lucene.apache.org
: lucene.apache.org
2007-02-15 17:04:59,597 INFO plugin.PluginRepository - Plugins: looking in:
C:\pbapps\nutch-nightly\plugins
2007-02-15 17:04:59,722 INFO plugin.PluginRepository - Plugin
Auto-activation mode:
[true]
2007-02-15 17:04:59,722 INFO plugin.PluginRepository - Registered Plugins:
2007-02-15 17:04:59,722 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-02-15 17:04:59,722 INFO plugin.PluginRepository - Site
Query Filter
Could you please help me to go through this proxy with authentication ?
Thanks,
[-] Hi
[-]
[-] I was having the same problem running nutch behind a web proxy.
[-] But with little changes in the plugin protocol-httpclient this works for
[-] me.
[-]
[-] See source below for my changes.
[-]
[-]
[-] public class Http extends HttpBase {
[-]
[-] public static final Log LOG = LogFactory.getLog(Http.class);
[-]
[-] private static MultiThreadedHttpConnectionManager connectionManager =
[-] new MultiThreadedHttpConnectionManager();
[-]
[-] // Since the Configuration has not yet been setted,
[-] // then an unconfigured client is returned.
[-] private static HttpClient client = new HttpClient(connectionManager);
[-]
[-] static synchronized HttpClient getClient() {
[-] return client;
[-] }
[-]
[-] boolean verbose = false;
[-] int maxThreadsTotal = 10;
[-] String ntlmUsername = "";
[-] String ntlmPassword = "";
[-] String ntlmDomain = "";
[-] String ntlmHost = "";
[-]
[-] String proxyuser = "";
[-] String proxypass = "";
[-]
[-] public Http() {
[-] super(LOG);
[-] }
[-]
[-] public void setConf(Configuration conf) {
[-] super.setConf(conf);
[-] this.maxThreadsTotal = conf.getInt("fetcher.threads.fetch", 10);
[-] this.ntlmUsername = conf.get("http.auth.ntlm.username", "");
[-] this.ntlmPassword = conf.get("http.auth.ntlm.password", "");
[-] this.ntlmDomain = conf.get("http.auth.ntlm.domain", "");
[-] this.ntlmHost = conf.get("http.auth.ntlm.host", "");
[-]
[-]
[-] // add config for auth proxy
[-] this.proxyuser = conf.get("http.auth.proxy.username", "");
[-] this.proxypass = conf.get("http.auth.proxy.password", "");
[-]
[-]
[-] //Level logLevel = Level.WARNING;
[-] //if (conf.getBoolean("http.verbose", false)) {
[-] // logLevel = Level.FINE;
[-] //}
[-] //LOG.setLevel(logLevel);
[-] //Logger.getLogger("org.apache.commons.httpclient.HttpMethodDirector
")
[-] // .setLevel(logLevel);
[-] configureClient();
[-] }
[-]
[-] public static void main(String[] args) throws Exception {
[-] Http http = new Http();
[-] http.setConf(NutchConfiguration.create());
[-] main(http, args);
[-] }
[-]
[-] protected Response getResponse(URL url, CrawlDatum datum, boolean
[-] redirect)
[-] throws ProtocolException, IOException {
[-] return new HttpResponse(this, url, datum, redirect);
[-] }
[-]
[-] private void configureClient() {
[-]
[-] // Set up an HTTPS socket factory that accepts self-signed certs.
[-] //Protocol dummyhttps = new Protocol("https", new
[-] DummySSLProtocolSocketFactory(), 443);
[-] //Protocol.registerProtocol("https", dummyhttps);
[-]
[-] HttpConnectionManagerParams params = connectionManager.getParams();
[-] params.setConnectionTimeout(timeout);
[-] params.setSoTimeout(timeout);
[-] params.setSendBufferSize(BUFFER_SIZE);
[-] params.setReceiveBufferSize(BUFFER_SIZE);
[-] params.setMaxTotalConnections(maxThreadsTotal);
[-] if (maxThreadsTotal > maxThreadsPerHost) {
[-] params.setDefaultMaxConnectionsPerHost(maxThreadsPerHost);
[-] } else {
[-] params.setDefaultMaxConnectionsPerHost(maxThreadsTotal);
[-] }
[-]
[-] HostConfiguration hostConf = client.getHostConfiguration();
[-] ArrayList headers = new ArrayList();
[-] // prefer English
[-] headers.add(new Header("Accept-Language",
[-] "en-us,en-gb,en;q=0.7,*;q=0.3"));
[-] // prefer UTF-8
[-] headers.add(new Header("Accept-Charset",
[-] "utf-8,ISO-8859-1;q=0.7,*;q=0.7"));
[-] // prefer understandable formats
[-] headers.add(new Header("Accept",
[-]
[-] "text/html,application/xml;q=0.9,application/xhtml+xml,text/xml;q=0.9
,text/p
[-] lain;q=0.8,image/png,*/*;q=0.5"));
[-] // accept gzipped content
[-] headers.add(new Header("Accept-Encoding", "x-gzip, gzip"));
[-] hostConf.getParams().setParameter("http.default-headers", headers);
[-] if (useProxy) {
[-] hostConf.setProxy(proxyHost, proxyPort);
[-] // add support for proxy authentication
[-] if (proxyuser.length() > 0 ) {
[-] Credentials proxyCreds = new
[-] UsernamePasswordCredentials(proxyuser,proxypass);
[-] client.getState().setProxyCredentials(new
[-] AuthScope(proxyHost,AuthScope.ANY_PORT), proxyCreds);
[-] }
[-] }
[-] if (ntlmUsername.length() > 0) {
[-] Credentials ntCreds = new NTCredentials(ntlmUsername,
ntlmPassword,
[-] ntlmHost, ntlmDomain);
[-] client.getState().setCredentials(new AuthScope(ntlmHost,
[-] AuthScope.ANY_PORT), ntCreds);
[-]
[-] if (LOG.isInfoEnabled()) {
[-] LOG.info("Added NTLM credentials for " + ntlmUsername);
[-] }
[-] }
[-] if (LOG.isInfoEnabled()) { LOG.info("Configured Client"); }
[-] }
[-] }
[-]
[-]
[-] -----Ursprüngliche Nachricht-----
[-] Von: ekoje ekoje [mailto:[EMAIL PROTECTED]
[-] Gesendet: Donnerstag, 8. Februar 2007 15:36
[-] An: [email protected]
[-] Betreff: Web Proxy
[-]
[-] Hi Guys,
[-]
[-] I would like to run nutch but I'm behind a web proxy with
authentication.
[-]
[-] I use nutch-0.8.1 under windows XP. Ive configured nutch-site.xml to
[-] specify
[-] my proxy host and port but how do i specify the username and password ?
[-]
[-] Could you please help me ?
[-]
[-] Thanks
[-]
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general