2006/10/12, Guruprasad Iyer <[EMAIL PROTECTED]>: > Hi, > > I need to know how to crawl (intranet) sites which require authentication. > One suggestion was that I replace protocol-http with protocol-httpclient in > the value field of plugin.includes tag in the nutch-default.xml file. > However, this did not solve the problem. > Can you help me out on this? Thanks.
I don't know what kind of authentication scheme you're up against, but recently I had to work with NTLM authentication in an intranet and worked arround it using a ntlmaps proxy. You tell nutch to use the proxy and you provide the proxy with adequate access priviledges. As simple as that and works like a charm. I imagine the nutch proxy support could be extended so that e.g. it selects a proxy based on regexp matching of urls. That way it would be possible to provide all the login/password pairs needed to crawl all of the sites you're interested in. t.n.a. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
