[jira] [Commented] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471577#comment-13471577 ] Max Dzyuba commented on NUTCH-827: -- Hi Jasper, Thanks, removing that line fixed the exception problem. At the moment, the log file doesn't have any errors related to HTTPclient plugin or authentication process. However, my tests show that the cookie can't be read by the test auth page I've set up. Is there an easy way to verify if the cookie was created by Nutch and stored as intended? Thanks, Max HTTP POST Authentication Key: NUTCH-827 URL: https://issues.apache.org/jira/browse/NUTCH-827 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.1, nutchgora Reporter: Jasper van Veghel Priority: Minor Labels: authentication Fix For: 1.6 Attachments: nutch-http-cookies.patch I've created a patch against the trunk which adds support for very rudimentary POST-based authentication support. It takes a link from nutch-site.xml with a site to POST to and its respective parameters (username, password, etc.). It then checks upon every request whether any cookies have been initialized, and if none have, it fetches them from the given link. This isn't perfect but Works For Me (TM) as I generally only need to retrieve results from a single domain and so have no cookie overlap (i.e. if the domain cookies expire, all cookies disappear from the HttpClient and I can simply re-fetch them). A natural improvement would be to be able to specify one particular cookie to check the expiration-date against. If anyone is interested in this beside me I'd be glad to put some more effort into making this more universally applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466686#comment-13466686 ] Max Dzyuba commented on NUTCH-827: -- Hi Jasper, Thanks a lot for the explanation! I applied the patch and compiled Nutch just fine, but can't confirm that it is working. Can you point to a website that this patch worked to pass the form auth at? I need to verify that it is working for me, but can't at the moment. Thanks in advance, Max HTTP POST Authentication Key: NUTCH-827 URL: https://issues.apache.org/jira/browse/NUTCH-827 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.1, nutchgora Reporter: Jasper van Veghel Priority: Minor Labels: authentication Fix For: 1.6 Attachments: nutch-http-cookies.patch I've created a patch against the trunk which adds support for very rudimentary POST-based authentication support. It takes a link from nutch-site.xml with a site to POST to and its respective parameters (username, password, etc.). It then checks upon every request whether any cookies have been initialized, and if none have, it fetches them from the given link. This isn't perfect but Works For Me (TM) as I generally only need to retrieve results from a single domain and so have no cookie overlap (i.e. if the domain cookies expire, all cookies disappear from the HttpClient and I can simply re-fetch them). A natural improvement would be to be able to specify one particular cookie to check the expiration-date against. If anyone is interested in this beside me I'd be glad to put some more effort into making this more universally applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466706#comment-13466706 ] Max Dzyuba commented on NUTCH-827: -- Hi Jasper, Thank you for the tip! I'll have to go that way then. Best regards, Max HTTP POST Authentication Key: NUTCH-827 URL: https://issues.apache.org/jira/browse/NUTCH-827 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.1, nutchgora Reporter: Jasper van Veghel Priority: Minor Labels: authentication Fix For: 1.6 Attachments: nutch-http-cookies.patch I've created a patch against the trunk which adds support for very rudimentary POST-based authentication support. It takes a link from nutch-site.xml with a site to POST to and its respective parameters (username, password, etc.). It then checks upon every request whether any cookies have been initialized, and if none have, it fetches them from the given link. This isn't perfect but Works For Me (TM) as I generally only need to retrieve results from a single domain and so have no cookie overlap (i.e. if the domain cookies expire, all cookies disappear from the HttpClient and I can simply re-fetch them). A natural improvement would be to be able to specify one particular cookie to check the expiration-date against. If anyone is interested in this beside me I'd be glad to put some more effort into making this more universally applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466738#comment-13466738 ] Max Dzyuba commented on NUTCH-827: -- Hi Jasper, I've set up a script that does just that (receives POSTed data, sets a cookie, returns some data if the cookie is set), but now I have this error in my log: 2012-10-01 13:11:24,557 ERROR httpclient.Http - Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one. 2012-10-01 13:11:24,682 ERROR httpclient.Http - Unable to retrieve login page; code = 200 The second line with response code 200 is what I don't understand. I'd appreciate any tips you could give in this regard. Thanks, Max HTTP POST Authentication Key: NUTCH-827 URL: https://issues.apache.org/jira/browse/NUTCH-827 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.1, nutchgora Reporter: Jasper van Veghel Priority: Minor Labels: authentication Fix For: 1.6 Attachments: nutch-http-cookies.patch I've created a patch against the trunk which adds support for very rudimentary POST-based authentication support. It takes a link from nutch-site.xml with a site to POST to and its respective parameters (username, password, etc.). It then checks upon every request whether any cookies have been initialized, and if none have, it fetches them from the given link. This isn't perfect but Works For Me (TM) as I generally only need to retrieve results from a single domain and so have no cookie overlap (i.e. if the domain cookies expire, all cookies disappear from the HttpClient and I can simply re-fetch them). A natural improvement would be to be able to specify one particular cookie to check the expiration-date against. If anyone is interested in this beside me I'd be glad to put some more effort into making this more universally applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461759#comment-13461759 ] Max Dzyuba commented on NUTCH-827: -- Jasper, if you're still around, can you please answer Ian's question? I have a similar problem figuring out how exactly I can provide the username and password. Thank you! HTTP POST Authentication Key: NUTCH-827 URL: https://issues.apache.org/jira/browse/NUTCH-827 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.1, nutchgora Reporter: Jasper van Veghel Priority: Minor Labels: authentication Fix For: 1.6 Attachments: nutch-http-cookies.patch I've created a patch against the trunk which adds support for very rudimentary POST-based authentication support. It takes a link from nutch-site.xml with a site to POST to and its respective parameters (username, password, etc.). It then checks upon every request whether any cookies have been initialized, and if none have, it fetches them from the given link. This isn't perfect but Works For Me (TM) as I generally only need to retrieve results from a single domain and so have no cookie overlap (i.e. if the domain cookies expire, all cookies disappear from the HttpClient and I can simply re-fetch them). A natural improvement would be to be able to specify one particular cookie to check the expiration-date against. If anyone is interested in this beside me I'd be glad to put some more effort into making this more universally applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1458) Support for raw HTML field added to Solr
Max Dzyuba created NUTCH-1458: - Summary: Support for raw HTML field added to Solr Key: NUTCH-1458 URL: https://issues.apache.org/jira/browse/NUTCH-1458 Project: Nutch Issue Type: New Feature Components: indexer, parser Affects Versions: 1.5.1 Reporter: Max Dzyuba At the moment, the “content” field holds only the parsed text from the page. It would be nice to have a separate field in Solr document that would hold raw HTML from the crawled page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira