[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-827: ---------------------------------- Attachment: NUTCH-827-trunk-v3.patch Hi [~lewismc], attached patch fixes two points * the CSS statement to select of "form" elements by "name" attribute didn't work properly * (should be documented) the configuration allows to set <additionalPostHeaders>, e.g. {noformat} <additionalPostHeaders> <field name="User-Agent" value="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0" /> </additionalPostHeaders> {noformat} One point is open (but we could delay it, it may take some work): * the form authentication is global and ignores {{<authScope>}}. So you have to restrict your crawl to the form authentication pages only. Ideally, also form authentication should be bound to a scope (one host, one URL prefix, etc.) same as HTTP authentication. > HTTP POST Authentication > ------------------------ > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: protocol > Affects Versions: 1.1, nutchgora > Reporter: Jasper van Veghel > Assignee: Lewis John McGibbney > Priority: Minor > Labels: authentication > Fix For: 1.10 > > Attachments: NUTCH-827-trunk-v3.patch, NUTCH-827-trunk.patch, > NUTCH-827-trunkv2.patch, http-client-form-authtication.patch, > nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)