Hello Tizy, As I know, currently the development version of Nutch can do Basic, Digest and NTLM based authentication. [1] Nutch can not do POST based authentication that depends on cookies. BTW there is a document which supposed to provide this feature but as far as i see no code developed yet. [2]
[1] https://wiki.apache.org/nutch/HttpAuthenticationSchemes [2] https://wiki.apache.org/nutch/HttpPostAuthentication Halil 2014-12-16 7:16 GMT+02:00 Tizy Ninan <[email protected]>: > > Hi, > > I am trying to develop a custom crawler to crawl websites that require form > based authentication using Nutch v1.9 in Java. The HttpPostAuthentication > feature of Nutch is followed to implement it. > > The login parameters required for authentication such as html form-id, > login post data(username, password) are specified as key-value pairs in a > configuration file. What is required to identify the html login form(id or > name of the html form)? How to identify the html form parameters if id or > name of the form is not specified? > > I have also posted the question to the developer mailing list, but did not > receive any reply.I am stuck with this for a while. Could somebody provide > with a solution on how to specify the html form parameters of websites to > be crawled to perform form based authentication? > > Thanks and Regards, > Tizy >

