[jira] [Commented] (CONNECTORS-243) Web crawler must get the Last-Modified HTTP header and pass it as metadata to output

2011-08-22 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088754#comment-13088754
 ] 

Karl Wright commented on CONNECTORS-243:


I'll try to have a look at this this evening.


 Web crawler must get the Last-Modified HTTP header and pass it as metadata 
 to output
 --

 Key: CONNECTORS-243
 URL: https://issues.apache.org/jira/browse/CONNECTORS-243
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Web connector
Affects Versions: ManifoldCF 0.2
Reporter: Jan Høydahl
  Labels: last-modified

 Last-Modified is important in web search, at it may be used for (de)boosting 
 based on date.
 In fact, ManifoldCF should have the ability to parse any (or all) HTTP 
 headers from source document and pass it on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-243) Web crawler must get the Last-Modified HTTP header and pass it as metadata to output

2011-08-22 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089128#comment-13089128
 ] 

Karl Wright commented on CONNECTORS-243:


Looking at this further, there are a number of headers that would be bad to 
include in metadata.  For example, you would not want to include anything 
authentication related or session related.  Any transient information should 
also be excluded, since that will cause ManifoldCF to be unable to avoid 
refetching the document on each job run.  Here's the list of exclusions I've 
come up with so far:

Age
WWW-Authenticate
Proxy-Authenticate
Date
Set-cookie
Via

Any I've missed?



 Web crawler must get the Last-Modified HTTP header and pass it as metadata 
 to output
 --

 Key: CONNECTORS-243
 URL: https://issues.apache.org/jira/browse/CONNECTORS-243
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Web connector
Affects Versions: ManifoldCF 0.2
Reporter: Jan Høydahl
Assignee: Karl Wright
  Labels: last-modified

 Last-Modified is important in web search, at it may be used for (de)boosting 
 based on date.
 In fact, ManifoldCF should have the ability to parse any (or all) HTTP 
 headers from source document and pass it on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira