Hi, Can somebody give any help on the above issue with HttpPostAuthentication in Nutch v1.9? I am stuck with this problem for a while. It would be really helpful if someone could give any insights on the above problem.
Thanks, Tizy On Thu, Dec 18, 2014 at 11:38 AM, Tizy Ninan <[email protected]> wrote: > Hi, > > Thanks for the reply. > > I tried applying the patch(http-client-form-authtication.patch) > in NUTCH-827 [1]. Compiled the code using ant. > > When I ran the crawler it is giving the following warning log message, " > httpclient.Http: Bad auth conf file: Element <removedFormFields> not > recognized in httpclient-auth.xml - expected <authscope> " . > > How do I make sure that the changes in the code is reflected? It seems > like the changes are not effected while crawling. What is the correct > procedure to compile the code in the plugins? > > Thanks, > Tizy > > On Tue, Dec 16, 2014 at 6:34 PM, remi tassing <[email protected]> > wrote: >> >> I have been doing a lot of POST authentication while crawling corporate >> stuff. Since POST methods may vary drastically between sites (e.g. typical >> JIRA to POST+JS redirection, NTLMv2...) it's hard not to extend the >> crawler >> with some additional Java. >> >> So what I've ended up doing is to build a "handler" class for each site >> specific site and that handler knows how to send requests and fetch the >> contain. Some common response type is expected so it looks like an >> extension/plugin design for the protocol-httpclient plugin. >> >> On Tue, Dec 16, 2014 at 5:46 PM, Tizy Ninan <[email protected]> wrote: >> > >> > Hi Talat, >> > >> > Thanks a lot for the reply. I will go through it and try it out. >> > >> > Thanks, >> > Tizy >> > >> > On Tue, Dec 16, 2014 at 2:25 PM, Talat Uyarer <[email protected]> wrote: >> > > >> > > Hi Tizy, >> > > >> > > There is some discuss. You can reach at NUTCH-827 [1] IMHO we need >> > > some help. If we create this feature it will be useful. >> > > >> > > Talat >> > > >> > > [1] https://issues.apache.org/jira/browse/NUTCH-827 >> > > >> > > 2014-12-16 10:44 GMT+02:00 Tizy Ninan <[email protected]>: >> > > > Hi, >> > > > >> > > > Thanks for the reply. >> > > > Is there any alternative way to do this authentication? Does the >> > fetcher >> > > > job of Nutch accept cookies for fetching the web sites from the same >> > > > domain? Could you suggest any work around to do form based >> > authentication >> > > > using Nutch? >> > > > >> > > > Thanks, >> > > > Tizy >> > > > >> > > > On Tue, Dec 16, 2014 at 1:08 PM, Halil Ibrahim Simsek < >> > > [email protected]> >> > > > wrote: >> > > >> >> > > >> Hello Tizy, >> > > >> >> > > >> As I know, currently the development version of Nutch can do Basic, >> > > Digest >> > > >> and NTLM based authentication. [1] Nutch can not do POST based >> > > >> authentication that depends on cookies. BTW there is a document >> which >> > > >> supposed to provide this feature but as far as i see no code >> developed >> > > yet. >> > > >> [2] >> > > >> >> > > >> [1] https://wiki.apache.org/nutch/HttpAuthenticationSchemes >> > > >> [2] https://wiki.apache.org/nutch/HttpPostAuthentication >> > > >> >> > > >> Halil >> > > >> >> > > >> 2014-12-16 7:16 GMT+02:00 Tizy Ninan <[email protected]>: >> > > >> > >> > > >> > Hi, >> > > >> > >> > > >> > I am trying to develop a custom crawler to crawl websites that >> > require >> > > >> form >> > > >> > based authentication using Nutch v1.9 in Java. The >> > > >> HttpPostAuthentication >> > > >> > feature of Nutch is followed to implement it. >> > > >> > >> > > >> > The login parameters required for authentication such as html >> > form-id, >> > > >> > login post data(username, password) are specified as key-value >> pairs >> > > in a >> > > >> > configuration file. What is required to identify the html login >> > > form(id >> > > >> or >> > > >> > name of the html form)? How to identify the html form parameters >> if >> > > id or >> > > >> > name of the form is not specified? >> > > >> > >> > > >> > I have also posted the question to the developer mailing list, >> but >> > did >> > > >> not >> > > >> > receive any reply.I am stuck with this for a while. Could >> somebody >> > > >> provide >> > > >> > with a solution on how to specify the html form parameters of >> > > websites to >> > > >> > be crawled to perform form based authentication? >> > > >> > >> > > >> > Thanks and Regards, >> > > >> > Tizy >> > > >> > >> > > >> >> > > > >> > > > >> > > > -- >> > > > Thanks and Regards, >> > > > Tizy >> > > >> > > >> > > >> > > -- >> > > Talat UYARER >> > > Websitesi: http://talat.uyarer.com >> > > Twitter: http://twitter.com/talatuyarer >> > > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304 >> > > >> > >> > >> > -- >> > Thanks and Regards, >> > Tizy >> > >> > > > -- > Thanks and Regards, > Tizy > > >

