Re: Don't Aggregrate Me

2005-08-29 Thread Mark Pilgrim

On 8/26/05, Graham [EMAIL PROTECTED] wrote:
  (And before you say but my aggregator is nothing but a podcast
  client, and the feeds are nothing but links to enclosures, so it's
  obvious that the publisher wanted me to download them -- WRONG!  The
  publisher might want that, or they might not ...
 
 So you're saying browsers should check robots.txt before downloading
 images?

It's sad that such an inane dodge would even garner any attention at
all, much less require a response.

http://www.robotstxt.org/wc/faq.html


What is a WWW robot?
A robot is a program that automatically traverses the Web's hypertext
structure by retrieving a document, and recursively retrieving all
documents that are referenced.

Note that recursive here doesn't limit the definition to any
specific traversal algorithm; even if a robot applies some heuristic
to the selection and order of documents to visit and spaces out
requests over a long space of time, it is still a robot.

Normal Web browsers are not robots, because they are operated by a
human, and don't automatically retrieve referenced documents (other
than inline images).

Web robots are sometimes referred to as Web Wanderers, Web Crawlers,
or Spiders. These names are a bit misleading as they give the
impression the software itself moves between sites like a virus; this
not the case, a robot simply visits sites by requesting documents from
them.


On a more personal note, I would like to thank you for reminding me
why there will never be an Atom Implementor's Guide. 
http://diveintomark.org/archives/2004/08/16/specs

-- 
Cheers,
-Mark



Re: Don't Aggregrate Me

2005-08-26 Thread Mark Pilgrim

On 8/25/05, Roger B. [EMAIL PROTECTED] wrote:
  Mhh. I have not looked into this. But is not every desktop aggregator
  a robot?
 
 Henry: Depends on who you ask. (See the Newsmonster debates from a
 couple years ago.)

As I am the one who kicked off the Newsmonster debates a couple years
ago, I would like to throw in my opinion here.  My opinion has not
changed, and it is this:

1. If a user gives a feed URL to a program (aggregator, aggregator
service, ping service, whatever), the program may request it and
re-request it as often as it likes.  This is not robotic behavior in
the robots.txt sense.  The program has been given instructions to
request a URL, and it does so, perhaps repeatedly.  This covers the
most common case of a desktop or web-based feed reader or aggregator
that reads feeds and nothing else.

2. If a user gives a feed URL to a program *and then the program finds
all the URLs in that feed and requests them too*, the program needs to
support robots.txt exclusions for all the URLs other than the original
URL it was given.  This is robotic behavior; it's exactly the same as
requesting an HTML page, scraping it for links, and then requesting
each of those scraped URLs.  The fact that the original URL pointed to
an HTML document or an XML document is immaterial; they are clearly
the same use case.

Programs such as wget may fall into either category, depending on
command line options.  The user can request a single resource
(category 1), or can instruct wget to recursive through links and
effectively mirror a remote site (category 2).  Section 9.1 of the
wget manual describes its behavior in the case of category 2:

http://www.delorie.com/gnu/docs/wget/wget_41.html

For instance, when you issue:

wget -r http://www.server.com/

First the index of `www.server.com' will be downloaded. If Wget finds
that it wants to download more documents from that server, it will
request `http://www.server.com/robots.txt' and, if found, use it for
further downloads. `robots.txt' is loaded only once per each server.


So wget downloads the URL it was explicitly given, but then if it's
going to download any other autodiscovered URLs, it checks robots.txt
to make sure that's OK.

Bringing this back to feeds, aggregators can fall into either category
(1 or 2, above).  At the moment, the vast majority of aggregators fall
into category 1.  *However*, what Newsmonster did 2 years ago pushed
it into category 2 in some cases.  It had a per-feed option to
prefetch and cache the actual HTML pages linked by excerpt-only feeds.
 When it fetched the feed, Newsmonster would go out and also fetch the
page pointed to by the item's link element.  This is actually a very
useful feature; my only problem with it was that it did not respect
robots.txt *when it went outside the original feed URL and fetched
other resources*.

Nor is this limited to prefetching HTML pages.  The same problem
arises with aggregators that automatically download *any* linked
content, such as enclosures.  The end user gave their aggregator the
URL of a feed, so the aggregator may poll that feed from now until the
end of time (or 410 Gone, whichever comes first :).  But if the
aggregator reads that feed and subsequently decides to request
resources other than the original feed URL (like .mp3 files), the
aggregator should support robots.txt for those other URLs.

(And before you say but my aggregator is nothing but a podcast
client, and the feeds are nothing but links to enclosures, so it's
obvious that the publisher wanted me to download them -- WRONG!  The
publisher might want that, or they might not.  They might publish a
few selected files on a high-bandwidth server where anything goes, and
other files on a low-bandwidth server where they would prefer that
users explicitly click the link to download the file if they really
want it.  Or they might want some types of clients (like personal
desktop aggregators) to download those files and other types of
clients (like centralized aggregation services) not to download them. 
Or someone might set up a malicious feed that intentionally pointed to
large files on someone else's server... a kind of platypus DoS attack.
 Or any number of other scenarios.  So how do you, as a client, know
what to do?  robots.txt.)

-- 
Cheers,
-Mark



Re: Review of Atom 0.8 Spec against W3C QA Specification Guidelines

2005-05-25 Thread Mark Pilgrim

On 5/24/05, Karl Dubost [EMAIL PROTECTED] wrote:
 Validation is something very precise. It can be validated against a
 DTD, or against a Schema or another grammar language, etc. At least
 the Feed validator could become a Feed checker which develops a
 heuristic to check if the requirements of the specification are
 verified. :))) up to the validator authors :)

This from the organization that added a fussy parsing option
*enabled by default* in their (X)HTML Validator for almost 8 months.

-- 
Cheers,
-Mark



Re: PaceTextShouldBeProvided and accessibility - was Re: Consensus call on last raft of issues

2005-05-19 Thread Mark Pilgrim

On 5/19/05, Isofarro [EMAIL PROTECTED] wrote:
 I'd urge that the wording here should also include accessibility
 concerns, especially to encourage accessible alternatives to to be
 adopted when the content is known to be inaccessible - e.g. images,
 sound files, movies, flash.
 
 HTML for instance has a number of accessible alternatives to
 inaccessible constructs - images have a src attribute, flash and
 embedded movies allow the child of the object element to contain
 accessible alternatives to the content.

Presumably you mean images have an alt attribute, but otherwise +1.

Note that HTML 4 and beyond *require* an alt attribute for images, but
does not similarly require non-script alternatives to script elements.
 Nor does it explicitly require accessible alternatives to embedded
media such as Flash or video; the mechanism is present and encouraged,
but ultimately optional.

Since the last time this came up on the list, I have relaxed my
position, and I am now fine with defining a feed format that
encourages, but ultimately does not require, accessible content.  Note
that inaccessible content is A Bad Thing(tm) and in some contexts
content producers will be legally or contractually obligated to
provide it, but I will not go so far as to say that the format itself
should force you to provide it.

The format spec is the proper place to STRONGLY RECOMMEND that you
provide accessible alternatives to inaccessible context (and we
already have sufficient mechanisms to provide such alternatives), but
I will no longer go so far as to call it a MUST.

-- 
Cheers,
-Mark



Re: Autodiscovery paces

2005-05-10 Thread Mark Pilgrim

On 5/9/05, Nikolas Coukouma [EMAIL PROTECTED] wrote:
 http://www.intertwingly.net/wiki/pie/PaceAnchorSupport

Autodicovery elements MAY appear in either the head or the body
of the document.

I believe this is incorrect.  IIRC, link elements may only appear in
the head, and a elements may only appear in the body.

Other than that, +1 on PaceAnchorSupport.

 http://www.intertwingly.net/wiki/pie/PaceDifferentRelValue

+0.  Part of my newfound personal definition of a life well-lived is
to never again argue about semantics, markup, or the correct way to
use them.  This Pace will break every aggregator on the planet, but
then again, so will Atom 1.0 feeds, so... +0.

-- 
Cheers,
-Mark



Re: the atom:copyright element

2005-05-08 Thread Mark Pilgrim

On 5/8/05, Bob Wyman [EMAIL PROTECTED] wrote:
 First, let me say that I am a *very* strong supporter of
 intellectual property rights... I have always made my income by selling my
 intellectual property and I consider the anti-IPR proponents and Free
 Software evangelists to be no better than thieves or communists... 

I knew from the moment I met you that you were destined to become a
corollary to Godwin's Law...

 Also, I
 am listed as sole inventor on four patents dealing with DRM and I have a
 number of pending applications in the works today...

I don't suppose any of those would have anything to do with
syndication, would they?

Sing it with me now:

We all live in a Wyman submarine,
Wyman submarine,
Wyman submarine (patent)...

-- 
Cheers,
-Mark



Re: Autodiscovery discussion editorship

2005-05-06 Thread Mark Pilgrim

On 5/5/05, Tim Bray [EMAIL PROTECTED] wrote:
 The discussion in recent days has been lively but unstructured.  If I
 were forced to make a consensus call right now, I'm pretty sure I
 wouldn't be able to pick out any one spec change that I could say
 clearly has consensus.

The one suggestion I did see, which should be acted on immediately, is
to update the references section to point to the newest versions of
the XML and URI specs (and associated link changes throughout the
text).

-- 
Cheers,
-Mark



Re: Atom feed refresh rates

2005-05-05 Thread Mark Pilgrim

On 5/5/05, Andy Henderson [EMAIL PROTECTED] wrote:
 convincing the WG, I would simply point out that a mechanism widely
 available to, and understood by, feed providers and aggregators cannot do
 harm and has the potential to do a great deal of good.

Not to be flippant, but we have one that's widely available.  It's
called the Expires header.  I spoke with Roy Fielding at Apachecon
2003 and asked him this exact question: If I set an Expires header on
a feed of now + 3 hours, does that mean that I don't want the client
to fetch the feed again for at least 3 hours?  And he said yes,
that's exactly what it means.

I sympathize with your dilemma that you have no control over your HTTP
headers, but... wait, no I don't sympathize.  At all.

-- 
Cheers,
-Mark



Re: Atom feed refresh rates

2005-05-05 Thread Mark Pilgrim

On 5/5/05, Walter Underwood [EMAIL PROTECTED] wrote:
 You need the information outside of HTTP. To quote from the RSS spec
 for ttl:
 
   This makes it possible for RSS sources to be managed by a file-sharing
   network such as Gnutella.

Ignoring, for the moment, that this is a horrible idea and no one
supports it, Gnutella has its own caching and time-to-live mechanisms
that the RSS spec is ignoring.

-- 
Cheers,
-Mark



Re: PaceCaching

2005-05-05 Thread Mark Pilgrim

On 5/5/05, Graham [EMAIL PROTECTED] wrote:
 seriously expect it to be interpreted as a promise that the feed
 won't change for the next x minutes?

No, but I do seriously expect it to be interpreted that the feed
publisher does not wish clients to check it for the next x minutes.

-- 
Cheers,
-Mark



Re: Atom feed refresh rates

2005-05-05 Thread Mark Pilgrim

On 5/5/05, John Panzer [EMAIL PROTECTED] wrote:
 I assume an HTTP Expires header for Atom content will work and play well
 with caches such as the Google Accelerator
 (http://webaccelerator.google.com/).  I'd also guess that a syntax-level
 tag won't.  Is this important?

Yes, and yes.  This is exactly the sort of software that we're talking
about when we say that HTTP's native caching mechanism is widely
supported.  All the proxies in the world (which is what Google's Web
Accelerator is, except it runs on your own machine and listens on port
9100) are able to reduce network traffic and therefore make the end
user's experience faster because they understand and respect the HTTP
caching mechanism.  (Google Web Accelerator does other things too,
like proxying requests through Google's servers.  And what are those
servers running?  Another caching HTTP proxy.)  Many ISPs do this at
the ISP level, both to reduce their own upstream bandwidth costs and
to make their end users happier.  Many corporations do this as well (I
would bet good money that IBM does it).  At one time, I even had Squid
installed on my home network to do this. http://www.squid-cache.org/

HTTP caching works.

 The HTML solution for people who could not implement Expires: seems to
 be META tags with in theory equivalent information.  Though in practice
 the whole thing is a mess, this seems like a conceptually simple
 workaround.  Is there something obviously wrong with it?

Other than being a God-awful mess?  No, there's nothing wrong with it. ;)

-- 
Cheers,
-Mark