Antone Roundy wrote: > A number of people in the working group have expressed a desire to > be able to archive the revision history of an entry--not just the > latest version--which is a fairly natural thing to want to do with > an archive format. That's not something any RSS version has > attempted as far as I'm aware. This is something that we actually do at PubSub for internal purposes. Basically, we produce very large "log files" which are, effectively, Atom files. These files contain a concatenation of all the updated entries that we've discovered -- in the sequence we discovered them. Each of the entries has a <ps:source-feed> element (our temporary equivalent of HeadInEntry or Feeder) so that we can map an entry back to the feed that it came from and have a permanent record of the feed metadata at the time we found the entry. The atom:id's of entries in these files are the ones with which they were published or atom:ids that we created to cover for entries that were published without atom:ids. A single file can contain multiple instances of entries that share the same atom:id. What we have is very large atom feed documents that contain all discovered states of entries and, in the case where an entry appears in multiple source feeds, all copies of all entries from all feeds. This is an effective and simple archival format. Undoubtedly, we could do all sorts of things to make a more robust format -- but the degree of benefit received for the additional cost is not immediately and apparently compelling. As long as multiple instances/versions of an entry are permitted to exist in a single atom document while sharing the same atom:id, the current Atom document format provides a useable "archive format."
The only real issue we have with this system is the fact that although the Atom specification requires that atom:ids be globally unique there is no defined mechanism or convention to ensure such uniqueness. Thus, we are forced to assume that atom:ids are at best unique within the context of a single source-feed. The *real* unique identifier for an entry thus becomes <source-feed>+<atom:id>. This means that we can't do cross-feed duplicate detection based soley on atom:id. To do cross-feed duplicate detection we have to rely on textual analysis, a variety of heuristics, and other bits of guesswork. However, this is not a problem specific to "archiving." It is a problem with the definition of atom:id and will impact any atom processor that attempts to meld content from multiple feeds into a single view. This problem is very unfortunate when you realize that Entry Documents aren't always associated with Feeds... Thus, one must fabricate a "feed" context for the atom:ids found in Entry documents if you want to mix feed-independent Entry Documents with Entries that come from feeds. (Note: Most existing "news aggregators" don't concern themselves with these issues since they typically present feed-oriented rather than entry-oriented views of data and existing news aggregators have yet to deal with the issue of Entry Documents that exist outside feeds. Because of their feed orientation, existing news aggregators violate the principle of "It's about the Entries, Stupid!" Hopefully, this will change.) bob wyman