I have been doing a lot of POST authentication while crawling corporate stuff. Since POST methods may vary drastically between sites (e.g. typical JIRA to POST+JS redirection, NTLMv2...) it's hard not to extend the crawler with some additional Java.
So what I've ended up doing is to build a "handler" class for each site specific site and that handler knows how to send requests and fetch the contain. Some common response type is expected so it looks like an extension/plugin design for the protocol-httpclient plugin. On Tue, Dec 16, 2014 at 5:46 PM, Tizy Ninan <[email protected]> wrote: > > Hi Talat, > > Thanks a lot for the reply. I will go through it and try it out. > > Thanks, > Tizy > > On Tue, Dec 16, 2014 at 2:25 PM, Talat Uyarer <[email protected]> wrote: > > > > Hi Tizy, > > > > There is some discuss. You can reach at NUTCH-827 [1] IMHO we need > > some help. If we create this feature it will be useful. > > > > Talat > > > > [1] https://issues.apache.org/jira/browse/NUTCH-827 > > > > 2014-12-16 10:44 GMT+02:00 Tizy Ninan <[email protected]>: > > > Hi, > > > > > > Thanks for the reply. > > > Is there any alternative way to do this authentication? Does the > fetcher > > > job of Nutch accept cookies for fetching the web sites from the same > > > domain? Could you suggest any work around to do form based > authentication > > > using Nutch? > > > > > > Thanks, > > > Tizy > > > > > > On Tue, Dec 16, 2014 at 1:08 PM, Halil Ibrahim Simsek < > > [email protected]> > > > wrote: > > >> > > >> Hello Tizy, > > >> > > >> As I know, currently the development version of Nutch can do Basic, > > Digest > > >> and NTLM based authentication. [1] Nutch can not do POST based > > >> authentication that depends on cookies. BTW there is a document which > > >> supposed to provide this feature but as far as i see no code developed > > yet. > > >> [2] > > >> > > >> [1] https://wiki.apache.org/nutch/HttpAuthenticationSchemes > > >> [2] https://wiki.apache.org/nutch/HttpPostAuthentication > > >> > > >> Halil > > >> > > >> 2014-12-16 7:16 GMT+02:00 Tizy Ninan <[email protected]>: > > >> > > > >> > Hi, > > >> > > > >> > I am trying to develop a custom crawler to crawl websites that > require > > >> form > > >> > based authentication using Nutch v1.9 in Java. The > > >> HttpPostAuthentication > > >> > feature of Nutch is followed to implement it. > > >> > > > >> > The login parameters required for authentication such as html > form-id, > > >> > login post data(username, password) are specified as key-value pairs > > in a > > >> > configuration file. What is required to identify the html login > > form(id > > >> or > > >> > name of the html form)? How to identify the html form parameters if > > id or > > >> > name of the form is not specified? > > >> > > > >> > I have also posted the question to the developer mailing list, but > did > > >> not > > >> > receive any reply.I am stuck with this for a while. Could somebody > > >> provide > > >> > with a solution on how to specify the html form parameters of > > websites to > > >> > be crawled to perform form based authentication? > > >> > > > >> > Thanks and Regards, > > >> > Tizy > > >> > > > >> > > > > > > > > > -- > > > Thanks and Regards, > > > Tizy > > > > > > > > -- > > Talat UYARER > > Websitesi: http://talat.uyarer.com > > Twitter: http://twitter.com/talatuyarer > > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304 > > > > > -- > Thanks and Regards, > Tizy >

