I have been doing a lot of POST authentication while crawling corporate
stuff. Since POST methods may vary drastically between sites (e.g. typical
JIRA to POST+JS redirection, NTLMv2...) it's hard not to extend the crawler
with some additional Java.

So what I've ended up doing is to build a "handler" class for each site
specific site and that handler knows how to send requests and fetch the
contain. Some common response type is expected so it looks like an
extension/plugin design for the protocol-httpclient plugin.

On Tue, Dec 16, 2014 at 5:46 PM, Tizy Ninan <[email protected]> wrote:
>
> Hi Talat,
>
> Thanks a lot for the reply. I will go through it and try it out.
>
> Thanks,
> Tizy
>
> On Tue, Dec 16, 2014 at 2:25 PM, Talat Uyarer <[email protected]> wrote:
> >
> > Hi Tizy,
> >
> > There is some discuss. You can reach at NUTCH-827 [1] IMHO we need
> > some help. If we create this feature it will be useful.
> >
> > Talat
> >
> > [1] https://issues.apache.org/jira/browse/NUTCH-827
> >
> > 2014-12-16 10:44 GMT+02:00 Tizy Ninan <[email protected]>:
> > > Hi,
> > >
> > > Thanks for the reply.
> > > Is there any alternative way to do this authentication? Does the
> fetcher
> > > job of Nutch accept cookies for fetching the web sites from the same
> > > domain? Could you suggest any work around to do form based
> authentication
> > > using Nutch?
> > >
> > > Thanks,
> > > Tizy
> > >
> > > On Tue, Dec 16, 2014 at 1:08 PM, Halil Ibrahim Simsek <
> > [email protected]>
> > > wrote:
> > >>
> > >> Hello Tizy,
> > >>
> > >> As I know, currently the development version of Nutch can do Basic,
> > Digest
> > >> and NTLM based authentication. [1] Nutch can not do POST based
> > >> authentication that depends on cookies. BTW there is a document which
> > >> supposed to provide this feature but as far as i see no code developed
> > yet.
> > >> [2]
> > >>
> > >> [1] https://wiki.apache.org/nutch/HttpAuthenticationSchemes
> > >> [2] https://wiki.apache.org/nutch/HttpPostAuthentication
> > >>
> > >> Halil
> > >>
> > >> 2014-12-16 7:16 GMT+02:00 Tizy Ninan <[email protected]>:
> > >> >
> > >> > Hi,
> > >> >
> > >> > I am trying to develop a custom crawler to crawl websites that
> require
> > >> form
> > >> > based authentication using Nutch v1.9 in Java.  The
> > >> HttpPostAuthentication
> > >> > feature of Nutch is followed to implement it.
> > >> >
> > >> > The login parameters required for authentication such as html
> form-id,
> > >> > login post data(username, password) are specified as key-value pairs
> > in a
> > >> > configuration file. What is required to identify the html login
> > form(id
> > >> or
> > >> > name of the html form)? How to identify the html form parameters if
> > id or
> > >> > name of the form is not specified?
> > >> >
> > >> > I have also posted the question to the developer mailing list, but
> did
> > >> not
> > >> > receive any reply.I am stuck with this for a while. Could somebody
> > >> provide
> > >> > with a solution on how to specify the html form parameters of
> > websites to
> > >> > be crawled to perform form based authentication?
> > >> >
> > >> > Thanks and Regards,
> > >> > Tizy
> > >> >
> > >>
> > >
> > >
> > > --
> > > Thanks and Regards,
> > > Tizy
> >
> >
> >
> > --
> > Talat UYARER
> > Websitesi: http://talat.uyarer.com
> > Twitter: http://twitter.com/talatuyarer
> > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
> >
>
>
> --
> Thanks and Regards,
> Tizy
>

Reply via email to