Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
Added TableOfContents and minor edits in Prerequisites and Optional sect

------------------------------------------------------------------------------
+ [[TableOfContents]]
+ 
  == Introduction ==
  This is a feature in Nutch that allows the crawler to authenticate itself to 
websites requiring NTLM, Basic or Digest authentication. This feature can not 
do POST based authentication that depends on cookies. More information on this 
can be found at: HttpPostAuthentication
  
@@ -18, +20 @@

  Since the example and explanation provided as comments in 
'conf/httpclient-auth.xml' is very brief, therefore this section would explain 
it in a little more detail. In all the examples below, the root element 
<auth-configuration> has been omitted for the sake of clarity.
  
  === Prerequisites ===
- In order use HTTP Authentication your Nutch install must be configured to use 
'protocol-httpclient' instead of the default 'protocol-http'. To make this 
change copy the 'plugin.includes' property from 'conf/nutch-default.xml' and 
paste it into 'conf/nutch-site.xml'. Within that property replace 
'protocol-http' with 'protocol-httpclient'. If you have made no other changes 
it will look as follows:
+ In order to use HTTP Authentication, the Nutch crawler must be configured to 
use 'protocol-httpclient' instead of the default 'protocol-http'. To do this 
copy 'plugin.includes' property from 'conf/nutch-default.xml' into 
'conf/nutch-site.xml'. Replace 'protocol-http' with 'protocol-httpclient' in 
the value of the property. If you have made no other changes it should look as 
follows:
  {{{
  <property>
    <name>plugin.includes</name>
@@ -35, +37 @@

  }}}
  
  === Optional ===
- By default Nutch use credential from 'httpclient-auth.xml'. If you wish to 
use a different file you will need to copy the 'http.auth.file' property from 
'conf/nutch-default.xml' and paste it into 'conf/nutch-site.xml' and then 
modify the '<value>' element. The default property appears as follows:
+ By default Nutch uses credentials from 'conf/httpclient-auth.xml'. If you 
wish to use a different file, the file should be placed in the 'conf' directory 
and 'http.auth.file' property should be copied from 'conf/nutch-default.xml' 
into 'conf/nutch-site.xml' and then the file name in the '<value>' element 
should be edited accordingly. The default property appears as follows:
  {{{
  <property>
    <name>http.auth.file</name>
@@ -43, +45 @@

    <description>Authentication configuration file for 'protocol-httpclient' 
plugin.</description>
  </property>
  }}}
- 
  
  === Crawling an Intranet with Default Authentication Scope ===
  Let's say all pages of an intranet are protected by basic, digest or ntlm 
authentication and there is only one set of credentials to be used for all web 
pages in the intranet, then a configuration as described below is enough. This 
is also the simplest possible configuration possible for authentication schemes.

Reply via email to