RE: PaceRepeatIdInDocument posted

Bob Wyman Sat, 05 Feb 2005 12:32:54 -0800

Antone Roundy wrote:
> A number of people in the working group have expressed a desire to
> be able to archive the revision history of an entry--not just the
> latest version--which is a fairly natural thing to want to do with
> an archive format.  That's not something any RSS version has
> attempted as far as I'm aware.
        This is something that we actually do at PubSub for internal
purposes. Basically, we produce very large "log files" which are,
effectively, Atom files. These files contain a concatenation of all the
updated entries that we've discovered -- in the sequence we discovered them.
Each of the entries has a <ps:source-feed> element (our temporary equivalent
of HeadInEntry or Feeder) so that we can map an entry back to the feed that
it came from and have a permanent record of the feed metadata at the time we
found the entry. The atom:id's of entries in these files are the ones with
which they were published or atom:ids that we created to cover for entries
that were published without atom:ids. A single file can contain multiple
instances of entries that share the same atom:id. 
        What we have is very large atom feed documents that contain all
discovered states of entries and, in the case where an entry appears in
multiple source feeds, all copies of all entries from all feeds. This is an
effective and simple archival format. Undoubtedly, we could do all sorts of
things to make a more robust format -- but the degree of benefit received
for the additional cost is not immediately and apparently compelling.
        As long as multiple instances/versions of an entry are permitted to
exist in a single atom document while sharing the same atom:id, the current
Atom document format provides a useable "archive format."


        The only real issue we have with this system is the fact that
although the Atom specification requires that atom:ids be globally unique
there is no defined mechanism or convention to ensure such uniqueness. Thus,
we are forced to assume that atom:ids are at best unique within the context
of a single source-feed. The *real* unique identifier for an entry thus
becomes <source-feed>+<atom:id>. This means that we can't do cross-feed
duplicate detection based soley on atom:id. To do cross-feed duplicate
detection we have to rely on textual analysis, a variety of heuristics, and
other bits of guesswork. However, this is not a problem specific to
"archiving." It is a problem with the definition of atom:id and will impact
any atom processor that attempts to meld content from multiple feeds into a
single view. This problem is very unfortunate when you realize that Entry
Documents aren't always associated with Feeds... Thus, one must fabricate a
"feed" context for the atom:ids found in Entry documents if you want to mix
feed-independent Entry Documents with Entries that come from feeds. (Note:
Most existing "news aggregators" don't concern themselves with these issues
since they typically present feed-oriented rather than entry-oriented views
of data and existing news aggregators have yet to deal with the issue of
Entry Documents that exist outside feeds. Because of their feed orientation,
existing news aggregators violate the principle of "It's about the Entries,
Stupid!" Hopefully, this will change.)

                bob wyman

RE: PaceRepeatIdInDocument posted

Reply via email to