[
https://issues.apache.org/jira/browse/CONNECTORS-243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089128#comment-13089128
]
Karl Wright commented on CONNECTORS-243:
Looking at this further, there are a number of headers that would be bad to
include in metadata. For example, you would not want to include anything
authentication related or session related. Any transient information should
also be excluded, since that will cause ManifoldCF to be unable to avoid
refetching the document on each job run. Here's the list of exclusions I've
come up with so far:
Age
WWW-Authenticate
Proxy-Authenticate
Date
Set-cookie
Via
Any I've missed?
Web crawler must get the Last-Modified HTTP header and pass it as metadata
to output
--
Key: CONNECTORS-243
URL: https://issues.apache.org/jira/browse/CONNECTORS-243
Project: ManifoldCF
Issue Type: New Feature
Components: Web connector
Affects Versions: ManifoldCF 0.2
Reporter: Jan Høydahl
Assignee: Karl Wright
Labels: last-modified
Last-Modified is important in web search, at it may be used for (de)boosting
based on date.
In fact, ManifoldCF should have the ability to parse any (or all) HTTP
headers from source document and pass it on.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira