Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "HttpAuthenticationSchemes" page has been changed by susam.
The comment on this change is: Reverting to last good revision, i.e. 22.
http://wiki.apache.org/nutch/HttpAuthenticationSchemes?action=diff&rev1=23&rev2=24

--------------------------------------------------

  
  == Introduction to Authentication Scope ==
  Different credentials for different authentication scopes can be configured 
in 'conf/httpclient-auth.xml'. If a set of credentials is configured for a 
particular authentication scope (i.e. particular host, port number, realm 
and/or scheme), then that set of credentials would be sent only to pages 
falling under the specified authentication scope.
- 
+   
  When authentication is required to fetch a resource from a web-server, the 
authentication-scope is determined from the host and port obtained from the URL 
of the page. If it matches any 'authscope' in this configuration file, then the 
'credentials' for that 'authscope' is used for authentication.
  
  == Configuration ==
@@ -21, +21 @@

  
  === Prerequisites ===
  In order to use HTTP Authentication, the Nutch crawler must be configured to 
use 'protocol-httpclient' instead of the default 'protocol-http'. To do this 
copy 'plugin.includes' property from 'conf/nutch-default.xml' into 
'conf/nutch-site.xml'. Replace 'protocol-http' with 'protocol-httpclient' in 
the value of the property. If you have made no other changes it should look as 
follows:
- 
  {{{
  <property>
    <name>plugin.includes</name>
@@ -30, +29 @@

    include.  Any plugin not matching this expression is excluded.
    In any case you need at least include the nutch-extensionpoints plugin. By
    default Nutch includes crawling just HTML and plain text via HTTP,
-   and basic indexing and search plugins. In order to use HTTPS please enable
+   and basic indexing and search plugins. In order to use HTTPS please enable 
-   protocol-httpclient, but be aware of possible intermittent problems with the
+   protocol-httpclient, but be aware of possible intermittent problems with 
the 
    underlying commons-httpclient library.
    </description>
  </property>
  }}}
+ 
  === Optional ===
  By default Nutch uses credentials from 'conf/httpclient-auth.xml'. If you 
wish to use a different file, the file should be placed in the 'conf' directory 
and 'http.auth.file' property should be copied from 'conf/nutch-default.xml' 
into 'conf/nutch-site.xml' and then the file name in the '<value>' element 
should be edited accordingly. The default property appears as follows:
+ {{{
+ <property>
+   <name>http.auth.file</name>
+   <value>httpclient-auth.xml</value>
+   <description>Authentication configuration file for 'protocol-httpclient' 
plugin.</description>
+ </property>
+ }}}
+ 
+ === Crawling an Intranet with Default Authentication Scope ===
+ Let's say all pages of an intranet are protected by basic, digest or ntlm 
authentication and there is only one set of credentials to be used for all web 
pages in the intranet, then a configuration as described below is enough. This 
is also the simplest possible configuration possible for authentication schemes.
  
  {{{
- <credentialsweb-servers which have authentication scopes defined for a few 
selected realms/schemes. This is discussed in next section.
+ <credentials username="susam" password="masus">
+  <default/>
+ </credentials>
  }}}
+ 
+ The credentials specified above would be sent to any page requesting 
authentication. Though it is extremely simple, default authentication scope 
should be used with caution. This set of credentials would be sent to any 
web-page requesting for authentication and therefore, a malicious user can 
steal the credentials used in the configuration by setting up a web-page 
requiring Basic authentication. Therefore, we usually use credentials set apart 
for crawling only, so that even if a user steals the credentials, he wouldn't 
be able to do anything harmful. If you are sure, that all pages in the intranet 
use a particular authentication scheme, say, NTLM, then this situation can be 
improved a little in this manner.
+ 
+ {{{
+ <credentials username="susam" password="masus">
+  <default scheme="ntlm"/>
+ </credentials>
+ }}}
+ 
+ Thus, this set of credentials would be sent to pages requesting NTLM 
authentication only. Now, one can not set up a page requiring Basic 
authentication and steal the credentials. NTLM is safer, because password is 
not sent in clear-text or in a form from which the original password can be 
recovered directly.
+ 
+ === Credentials for Specific Authentication Scopes ===
+ The following is an example that shows how two sets of credentials have been 
defined for different authentication scopes. 
+ For all pages of example:8080 requiring authentication in the 'blogs' or 
'wiki' realm, the first set of credentials would be used. 
+ 
+ {{{
+ <credentials username="susam" password="masus">
+   <authscope host="example" port="8080" realm="blogs"/>
+   <authscope host="example" port="8080" realm="wiki"/>
+ </credentials>
+ <credentials username="admin" password="nimda">
+   <default/>
+ </credentials>
+ }}}
+ 
+ However, an important thing to note here is that if some page of example:8080 
requires authentication in another realm, say, 'mail', authentication would not 
be done even though the second set of credentials is defined as default. Of 
course this doesn't affect authentication for other web servers and the default 
authscope would be used for other web-servers. This problem occurs only for 
those web-servers which have authentication scopes defined for a few selected 
realms/schemes. This is discussed in next section.
+ 
  === Catch-all Authentication Scope for a Web Server ===
  When one or more authentication scopes are defined for a particular web 
server (host:port), then the default credentials is ignored for that host:port 
combination. Therefore, a catch-all authentication scope to handle all other 
realms and scopes must be specified explicitly as shown below.
  
@@ -55, +94 @@

    <authscope host="example" port="8080"/>
  </credentials>
  }}}
+ 
  The last authscope tag for example:8080 acts as the catch all authentication 
scope. In this section, realms were used to demonstrate the example. The same 
holds true for schemes also. For example, in the following example, the last 
authscope tag is necessary if the second set of credentials must be used for 
all pages of example:8080 not belonging to the authentication scope defined in 
the first tag.
  
  {{{
@@ -66, +106 @@

    <authscope host="example" port="8080"/>
  </credentials>
  }}}
+ 
  === Important Points ===
   1. For <authscope> tag, 'host' and 'port' attribute should always be 
specified. 'realm' and 'scheme' attributes may or may not be specified 
depending on your needs. If you are tempted to omit the 'host' and 'port' 
attribute, because you want the credentials to be used for any host and any 
port for that realm/scheme, please use the 'default' tag instead. That's what 
'default' tag is meant for.
   1. One authentication scope should not be defined twice as different 
<authscope> tags for different <credentials> tag. However, if this is done by 
mistake, the credentials for the last defined <authscope> tag would be used. 
This is because, the XML parsing code, reads the file from top to bottom and 
sets the credentials for authentication-scopes. If the same authentication 
scope is encountered once again, it will be overwritten with the new 
credentials. However, one should not rely on this behavior as this might change 
with further developments.
@@ -80, +121 @@

  
  == Need Help? ==
  If you need help, please feel free to post your question to the 
[[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing 
list]]. The author of this work, [[http://susam.in/|Susam Pal]], usually 
responds to mails related to authentication problems. The DEBUG logs may be 
required to troubleshoot the problem. You must enable the debug logging for 
'protocol-httpclient' and Jakarta Commons !HttpClient before running the 
crawler. To enable debug logging for 'protocol-httpclient' and !HttpClient, 
open 'conf/log4j.properties' and add the following lines:
- 
  {{{
  log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
  log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout
  }}}
+ 
  It would be good to check the following things before asking for help.
  
   1. Have you overridden the 'plugin.includes' property of 
'conf/nutch-default.xml' with 'conf/nutch-site.xml' and replaced 
'protocol-http' with 'protocol-httpclient'?

Reply via email to