For whom it might concern, I have achieved my solution to override the protocol-http plugin to send an additional header like: "Cookie: mycookie=1"
Regards. 2011/8/2 Dinçer Kavraal <[email protected]> > Hi, (USING: nutch 1.3 on Ubuntu 11.04 - 2.6.38-10-generic-pae) > > I would like to crawl a news site and having troubles with cookie support > (which haven't undestood well enough yet). > > Here is the status of the project... When I crawl the URLs I get this > message in the log of nutch (also in console): > > Skipping http://ntvmsnbc.com/id/25237248 as content is not fetched > successfully > > Skipping http://ntvmsnbc.com/id/25237249 as content is not fetched > successfully > > Skipping http://ntvmsnbc.com/id/25237253 as content is not fetched > successfully > > > I have looked up the segments dump and saw that the site would like to > redirect the client to an address which sets a cookie and falls back to the > same address and shows the content. To clearify: > 1. clients tries the address: http://ntvmsnbc.com/id/25237248 > 2. address sends a Location header for redirection to: > http://www.ntvmsnbc.com/redirect.aspx?to=http%3a%2f%2fwww.ntvmsnbc.com%2fid%2f25237248%2f&from=http%3a%2f%2fntvmsnbc.com%2fid%2f25237248%2f&mskey=323dcf07efd1a45a3851a0199679e65e > 3. this address sets a cookie and redirects the user back to original > address requested: http://ntvmsnbc.com/id/25237248 > 4. the content is shown this time > > with the standart configuration, I cannot get any content at all. I thought > I should try to accept cookies. However, *I don't know how to accept > cookies yet*. My settings work for other sites which not requires > cookies. > > I tried to change in nutch-default.xml as: > http.redirect.max = 5 > http.useHttp11 = true > plugin.includes > = > protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic) > > Do you think something goes wrong with my config? > > Appreciate any ideas, > Dincer >

