[jira] [Commented] (NUTCH-827) HTTP POST Authentication

2012-10-08 Thread Max Dzyuba (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471577#comment-13471577
 ] 

Max Dzyuba commented on NUTCH-827:
--

Hi Jasper,

Thanks, removing that line fixed the exception problem.
At the moment, the log file doesn't have any errors related to HTTPclient 
plugin or authentication process. However, my tests show that the cookie can't 
be read by the test auth page I've set up.

Is there an easy way to verify if the cookie was created by Nutch and stored as 
intended?


Thanks,
Max

 HTTP POST Authentication
 

 Key: NUTCH-827
 URL: https://issues.apache.org/jira/browse/NUTCH-827
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.1, nutchgora
Reporter: Jasper van Veghel
Priority: Minor
  Labels: authentication
 Fix For: 1.6

 Attachments: nutch-http-cookies.patch


 I've created a patch against the trunk which adds support for very 
 rudimentary POST-based authentication support. It takes a link from 
 nutch-site.xml with a site to POST to and its respective parameters 
 (username, password, etc.). It then checks upon every request whether any 
 cookies have been initialized, and if none have, it fetches them from the 
 given link.
 This isn't perfect but Works For Me (TM) as I generally only need to retrieve 
 results from a single domain and so have no cookie overlap (i.e. if the 
 domain cookies expire, all cookies disappear from the HttpClient and I can 
 simply re-fetch them). A natural improvement would be to be able to specify 
 one particular cookie to check the expiration-date against. If anyone is 
 interested in this beside me I'd be glad to put some more effort into making 
 this more universally applicable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-827) HTTP POST Authentication

2012-10-01 Thread Max Dzyuba (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466686#comment-13466686
 ] 

Max Dzyuba commented on NUTCH-827:
--

Hi Jasper,

Thanks a lot for the explanation! I applied the patch and compiled Nutch just 
fine, but can't confirm that it is working. Can you point to a website that 
this patch worked to pass the form auth at? I need to verify that it is working 
for me, but can't at the moment.


Thanks in advance,
Max

 HTTP POST Authentication
 

 Key: NUTCH-827
 URL: https://issues.apache.org/jira/browse/NUTCH-827
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.1, nutchgora
Reporter: Jasper van Veghel
Priority: Minor
  Labels: authentication
 Fix For: 1.6

 Attachments: nutch-http-cookies.patch


 I've created a patch against the trunk which adds support for very 
 rudimentary POST-based authentication support. It takes a link from 
 nutch-site.xml with a site to POST to and its respective parameters 
 (username, password, etc.). It then checks upon every request whether any 
 cookies have been initialized, and if none have, it fetches them from the 
 given link.
 This isn't perfect but Works For Me (TM) as I generally only need to retrieve 
 results from a single domain and so have no cookie overlap (i.e. if the 
 domain cookies expire, all cookies disappear from the HttpClient and I can 
 simply re-fetch them). A natural improvement would be to be able to specify 
 one particular cookie to check the expiration-date against. If anyone is 
 interested in this beside me I'd be glad to put some more effort into making 
 this more universally applicable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-827) HTTP POST Authentication

2012-10-01 Thread Max Dzyuba (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466706#comment-13466706
 ] 

Max Dzyuba commented on NUTCH-827:
--

Hi Jasper,

Thank you for the tip! I'll have to go that way then.


Best regards,
Max

 HTTP POST Authentication
 

 Key: NUTCH-827
 URL: https://issues.apache.org/jira/browse/NUTCH-827
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.1, nutchgora
Reporter: Jasper van Veghel
Priority: Minor
  Labels: authentication
 Fix For: 1.6

 Attachments: nutch-http-cookies.patch


 I've created a patch against the trunk which adds support for very 
 rudimentary POST-based authentication support. It takes a link from 
 nutch-site.xml with a site to POST to and its respective parameters 
 (username, password, etc.). It then checks upon every request whether any 
 cookies have been initialized, and if none have, it fetches them from the 
 given link.
 This isn't perfect but Works For Me (TM) as I generally only need to retrieve 
 results from a single domain and so have no cookie overlap (i.e. if the 
 domain cookies expire, all cookies disappear from the HttpClient and I can 
 simply re-fetch them). A natural improvement would be to be able to specify 
 one particular cookie to check the expiration-date against. If anyone is 
 interested in this beside me I'd be glad to put some more effort into making 
 this more universally applicable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-827) HTTP POST Authentication

2012-10-01 Thread Max Dzyuba (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466738#comment-13466738
 ] 

Max Dzyuba commented on NUTCH-827:
--

Hi Jasper,

I've set up a script that does just that (receives POSTed data, sets a cookie, 
returns some data if the cookie is set), but now I have this error in my log:

2012-10-01 13:11:24,557 ERROR httpclient.Http - Cookie-based authentication 
failed; cookies will not be present for this request but an attempt to retrieve 
them will be made for the next one.
2012-10-01 13:11:24,682 ERROR httpclient.Http - Unable to retrieve login page; 
code = 200

The second line with response code 200 is what I don't understand. I'd 
appreciate any tips you could give in this regard.


Thanks,
Max

 HTTP POST Authentication
 

 Key: NUTCH-827
 URL: https://issues.apache.org/jira/browse/NUTCH-827
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.1, nutchgora
Reporter: Jasper van Veghel
Priority: Minor
  Labels: authentication
 Fix For: 1.6

 Attachments: nutch-http-cookies.patch


 I've created a patch against the trunk which adds support for very 
 rudimentary POST-based authentication support. It takes a link from 
 nutch-site.xml with a site to POST to and its respective parameters 
 (username, password, etc.). It then checks upon every request whether any 
 cookies have been initialized, and if none have, it fetches them from the 
 given link.
 This isn't perfect but Works For Me (TM) as I generally only need to retrieve 
 results from a single domain and so have no cookie overlap (i.e. if the 
 domain cookies expire, all cookies disappear from the HttpClient and I can 
 simply re-fetch them). A natural improvement would be to be able to specify 
 one particular cookie to check the expiration-date against. If anyone is 
 interested in this beside me I'd be glad to put some more effort into making 
 this more universally applicable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-827) HTTP POST Authentication

2012-09-24 Thread Max Dzyuba (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461759#comment-13461759
 ] 

Max Dzyuba commented on NUTCH-827:
--

Jasper, if you're still around, can you please answer Ian's question? I have a 
similar problem figuring out how exactly I can provide the username and 
password.

Thank you!

 HTTP POST Authentication
 

 Key: NUTCH-827
 URL: https://issues.apache.org/jira/browse/NUTCH-827
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.1, nutchgora
Reporter: Jasper van Veghel
Priority: Minor
  Labels: authentication
 Fix For: 1.6

 Attachments: nutch-http-cookies.patch


 I've created a patch against the trunk which adds support for very 
 rudimentary POST-based authentication support. It takes a link from 
 nutch-site.xml with a site to POST to and its respective parameters 
 (username, password, etc.). It then checks upon every request whether any 
 cookies have been initialized, and if none have, it fetches them from the 
 given link.
 This isn't perfect but Works For Me (TM) as I generally only need to retrieve 
 results from a single domain and so have no cookie overlap (i.e. if the 
 domain cookies expire, all cookies disappear from the HttpClient and I can 
 simply re-fetch them). A natural improvement would be to be able to specify 
 one particular cookie to check the expiration-date against. If anyone is 
 interested in this beside me I'd be glad to put some more effort into making 
 this more universally applicable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1458) Support for raw HTML field added to Solr

2012-08-16 Thread Max Dzyuba (JIRA)
Max Dzyuba created NUTCH-1458:
-

 Summary: Support for raw HTML field added to Solr
 Key: NUTCH-1458
 URL: https://issues.apache.org/jira/browse/NUTCH-1458
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, parser
Affects Versions: 1.5.1
Reporter: Max Dzyuba


At the moment, the “content” field holds only the parsed text from the page. It 
would be nice to have a separate field in Solr document that would hold raw 
HTML from the crawled page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira