[
https://issues.apache.org/jira/browse/CONNECTORS-805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822395#comment-13822395
]
Karl Wright commented on CONNECTORS-805:
----------------------------------------
It turns out that hierarchical application of these fields is problematic for
the RSS connector. The reason is that there's nothing in the spec that says
(for instance) that a "dc:creator" tag at the channel level must appear before
any "entry" tags. That means that we cannot be guaranteed to get the right
value unless will build in-memory structures from the whole feed.
Unfortunately, that's not possible, because many of our clients construct RSS
feeds that are incredibly large and expect the RSS connector to handle them
without blowing out of memory.
So I'm going to presume (for the moment) that author tags at the higher levels
are not significant. Making them be significant only when the precede entry
tags is one resolution should that be needed, but I will open a new ticket
should that come to be required.
> Crawling author metadata from feeds
> -----------------------------------
>
> Key: CONNECTORS-805
> URL: https://issues.apache.org/jira/browse/CONNECTORS-805
> Project: ManifoldCF
> Issue Type: Improvement
> Components: RSS connector
> Affects Versions: ManifoldCF 1.4
> Reporter: Benjamin Brandmeier
> Assignee: Karl Wright
> Fix For: ManifoldCF 1.5
>
>
> Functionality for retrieving the author of a RSS entry.
> The RSS specifications treat this differently:
> RSS 2.0 (Source ->
> http://www.rssboard.org/rss-specification#ltauthorgtSubelementOfLtitemgt):
> <author> sub-element of <item>
> It's the email address of the author of the item. For newspapers and
> magazines syndicating via RSS, the author is the person who wrote the article
> that the <item> describes.
> Atom (Source -> http://www.ietf.org/rfc/rfc4287.txt):
> The "atom:author" element is a Person construct that indicates the author of
> the entry or feed.
> atomAuthor = element atom:author { atomPersonConstruct }
> If an atom:entry element does not contain atom:author elements, then
> the atom:author elements of the contained atom:source element are
> considered to apply. In an Atom Feed Document, the atom:author
> elements of the containing atom:feed element are considered to apply
> to the entry if there are no atom:author elements in the locations
> described above.
> The atomPersonConstruct looks like this:
> atomPersonConstruct =
> atomCommonAttributes,
> (element atom:name { text }
> & element atom:uri { atomUri }?
> & element atom:email { atomEmailAddress }?
> & extensionElement*)
> where atomCommonAttributes is defined like this:
> atomCommonAttributes =
> attribute xml:base { atomUri }?,
> attribute xml:lang { atomLanguageTag }?,
> undefinedAttribute*
> Further more there exists a atom:contributor tag:
> The "atom:contributor" element is a Person construct that indicates a person
> or other entity who contributed to the entry or feed.
> atomContributor = element atom:contributor { atomPersonConstruct }
> For further information please check the specifciation.
> Dublin Core (Source ->
> http://dublincore.org/documents/dcmi-type-vocabulary/index.shtml#elements-creator)
> <dc:creator>
> The primary individual responsible for the content of the resource.
> The element can be at the <item>, <image> or <channel> level.
--
This message was sent by Atlassian JIRA
(v6.1#6144)