[jira] [Updated] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-827: --- Labels: authentication memex (was: authentication) > HTTP POST Authentication > > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.1, nutchgora >Reporter: Jasper van Veghel >Assignee: Lewis John McGibbney >Priority: Minor > Labels: authentication, memex > Fix For: 1.10 > > Attachments: NUTCH-827-trunk-v3.patch, NUTCH-827-trunk.patch, > NUTCH-827-trunkv2.patch, http-client-form-authtication.patch, > nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-827: -- Attachment: NUTCH-827-trunk-v3.patch Hi [~lewismc], attached patch fixes two points * the CSS statement to select of "form" elements by "name" attribute didn't work properly * (should be documented) the configuration allows to set , e.g. {noformat} {noformat} One point is open (but we could delay it, it may take some work): * the form authentication is global and ignores {{}}. So you have to restrict your crawl to the form authentication pages only. Ideally, also form authentication should be bound to a scope (one host, one URL prefix, etc.) same as HTTP authentication. > HTTP POST Authentication > > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.1, nutchgora >Reporter: Jasper van Veghel >Assignee: Lewis John McGibbney >Priority: Minor > Labels: authentication > Fix For: 1.10 > > Attachments: NUTCH-827-trunk-v3.patch, NUTCH-827-trunk.patch, > NUTCH-827-trunkv2.patch, http-client-form-authtication.patch, > nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-827: --- Fix Version/s: (was: 2.4) > HTTP POST Authentication > > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.1, nutchgora >Reporter: Jasper van Veghel >Assignee: Lewis John McGibbney >Priority: Minor > Labels: authentication > Fix For: 1.10 > > Attachments: NUTCH-827-trunk.patch, NUTCH-827-trunkv2.patch, > http-client-form-authtication.patch, nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-827: --- Attachment: NUTCH-827-trunkv2.patch Updated patch for trunk which takes on [~wastl-nagel]'s comments. * I've moved the additions to httpclient-auth.xml to httpclient-auth.xml.template * I've also added some primative checking for form 'name' if we cannot locate an 'id' {code} Element loginform = doc.getElementById(authConfigurer.getLoginFormId()); if (loginform == null) { LOGGER.debug("'id' attribute for form element is null, trying 'name'."); loginform = doc.select("form.answer[name="+ authConfigurer.getLoginFormId() + "]").first(); if (loginform == null) { LOGGER.debug("'name' attribute for form element is also null."); throw new IllegalArgumentException("No form exists: " + authConfigurer.getLoginFormId()); } } {code} The rest seem to be OK to me and I am able to use this patch to fetch content from secure databases. > HTTP POST Authentication > > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.1, nutchgora >Reporter: Jasper van Veghel >Assignee: Lewis John McGibbney >Priority: Minor > Labels: authentication > Fix For: 2.4, 1.10 > > Attachments: NUTCH-827-trunk.patch, NUTCH-827-trunkv2.patch, > http-client-form-authtication.patch, nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-827: --- Fix Version/s: (was: 1.11) 1.10 > HTTP POST Authentication > > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.1, nutchgora >Reporter: Jasper van Veghel >Assignee: Lewis John McGibbney >Priority: Minor > Labels: authentication > Fix For: 2.4, 1.10 > > Attachments: NUTCH-827-trunk.patch, > http-client-form-authtication.patch, nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-827: --- Attachment: NUTCH-827-trunk.patch Patch for trunk. This has been tested and verified to enable access to various large Databases requiring HTTP Post authentication. I also would like to mention that setting the redirect boolean flag to true is usually always required. Would really appreciate if folks could try this out and comment. > HTTP POST Authentication > > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.1, nutchgora >Reporter: Jasper van Veghel >Assignee: Lewis John McGibbney >Priority: Minor > Labels: authentication > Fix For: 2.4, 1.11 > > Attachments: NUTCH-827-trunk.patch, > http-client-form-authtication.patch, nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-827: --- Fix Version/s: 2.4 > HTTP POST Authentication > > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.1, nutchgora >Reporter: Jasper van Veghel >Assignee: Lewis John McGibbney >Priority: Minor > Labels: authentication > Fix For: 2.4, 1.11 > > Attachments: http-client-form-authtication.patch, > nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-827: Component/s: (was: fetcher) protocol > HTTP POST Authentication > > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.1, nutchgora >Reporter: Jasper van Veghel >Priority: Minor > Labels: authentication > Fix For: 1.9 > > Attachments: http-client-form-authtication.patch, > nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuanyun.cn updated NUTCH-827: - Attachment: http-client-form-authtication.patch > HTTP POST Authentication > > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 1.1, nutchgora >Reporter: Jasper van Veghel >Priority: Minor > Labels: authentication > Fix For: 1.9 > > Attachments: http-client-form-authtication.patch, > nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-827: -- Fix Version/s: 1.8 > HTTP POST Authentication > > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 1.1, nutchgora >Reporter: Jasper van Veghel >Priority: Minor > Labels: authentication > Fix For: 2.3, 1.8 > > Attachments: nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-827: --- Fix Version/s: 2.2 > HTTP POST Authentication > > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 1.1, nutchgora >Reporter: Jasper van Veghel >Priority: Minor > Labels: authentication > Fix For: 1.7, 2.2 > > Attachments: nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-827: Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 > HTTP POST Authentication > > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 1.1, nutchgora >Reporter: Jasper van Veghel >Priority: Minor > Labels: authentication > Fix For: 1.6 > > Attachments: nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-827: Fix Version/s: 1.5 > HTTP POST Authentication > > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 1.1, nutchgora >Reporter: Jasper van Veghel >Priority: Minor > Labels: authentication > Fix For: 1.5 > > Attachments: nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jasper van Veghel updated NUTCH-827: Attachment: nutch-http-cookies.patch > HTTP POST Authentication > > > Key: NUTCH-827 > URL: https://issues.apache.org/jira/browse/NUTCH-827 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 1.1, 2.0 >Reporter: Jasper van Veghel >Priority: Minor > Attachments: nutch-http-cookies.patch > > > I've created a patch against the trunk which adds support for very > rudimentary POST-based authentication support. It takes a link from > nutch-site.xml with a site to POST to and its respective parameters > (username, password, etc.). It then checks upon every request whether any > cookies have been initialized, and if none have, it fetches them from the > given link. > This isn't perfect but Works For Me (TM) as I generally only need to retrieve > results from a single domain and so have no cookie overlap (i.e. if the > domain cookies expire, all cookies disappear from the HttpClient and I can > simply re-fetch them). A natural improvement would be to be able to specify > one particular cookie to check the expiration-date against. If anyone is > interested in this beside me I'd be glad to put some more effort into making > this more universally applicable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.