This HAS NOT yet been submitted. I'm offering it up for discussion first.
http://www.snellspace.com/public/draft-snell-atompub-feed-nofollow-00.txt
defines x:follow=yes|no x:index=yes|no and x:archive=yes|no
attributes
- James
On 8/26/05, Graham [EMAIL PROTECTED] wrote:
(And before you say but my aggregator is nothing but a podcast
client, and the feeds are nothing but links to enclosures, so it's
obvious that the publisher wanted me to download them -- WRONG! The
publisher might want that, or they might not
On Monday, August 29, 2005, at 10:12 AM, Mark Pilgrim wrote:
On 8/26/05, Graham [EMAIL PROTECTED] wrote:
(And before you say but my aggregator is nothing but a podcast
client, and the feeds are nothing but links to enclosures, so it's
obvious that the publisher wanted me to download them --
On Monday, August 29, 2005, at 10:39 AM, Antone Roundy wrote:
ext:auto-download target=enclosures default=false /
More robust would be:
ext:auto-download target=[EMAIL PROTECTED]'enclosure'] default=false
/
...enabling extension elements to be named in @target without
* Antone Roundy [EMAIL PROTECTED] [2005-08-29 19:00]:
More robust would be:
ext:auto-download target=[EMAIL PROTECTED]'enclosure']
default=false /
...enabling extension elements to be named in @target without
requiring a list of @target values to be maintained anywhere.
Is it wise to
--On Monday, August 29, 2005 10:39:33 AM -0600 Antone Roundy [EMAIL
PROTECTED] wrote:
As has been suggested, to inline images, we need to add frame documents,
stylesheets, Java applets, external JavaScript code, objects such as Flash
files, etc., etc., etc. The question is, with respect to
* Mark Pilgrim [EMAIL PROTECTED] [2005-08-29 18:20]:
On 8/26/05, Graham [EMAIL PROTECTED] wrote:
So you're saying browsers should check robots.txt before
downloading images?
It's sad that such an inane dodge would even garner any
attention at all, much less require a response.
I’m with
Le 05-08-26 à 18:59, Bob Wyman a écrit :
Karl, Please, accept my apologies for this. I could have sworn we
had the policy prominently displayed on the site. I know we used to
have it
there. This must have been lost when we did a site redesign last
November!
I'm really surprised that it
link rel=enclosure href=http://www.example.com/enclosure.mp3;
x:follow=no /
link rel=enclosure href=http://www.example.com/enclosure.mp3;
x:follow=yes /
content src=http://www.example.com/enclosure.mp3; x:follow=no /
content src=http://www.example.com/enclosure.mp3; x:follow=yes /
???
-
On 30/8/05 11:19 AM, James M Snell [EMAIL PROTECTED] wrote:
link rel=enclosure href=http://www.example.com/enclosure.mp3;
x:follow=no /
link rel=enclosure href=http://www.example.com/enclosure.mp3;
x:follow=yes /
content src=http://www.example.com/enclosure.mp3; x:follow=no /
content
Eric Scheid wrote:
On 30/8/05 11:19 AM, James M Snell [EMAIL PROTECTED] wrote:
link rel=enclosure href=http://www.example.com/enclosure.mp3;
x:follow=no /
link rel=enclosure href=http://www.example.com/enclosure.mp3;
x:follow=yes /
content src=http://www.example.com/enclosure.mp3;
On 30/8/05 12:05 PM, James M Snell [EMAIL PROTECTED] wrote:
That's kinda where I was going with x:follow=no|yes. An
x:archive=no|yes would also make some sense but could also be handled
with HTTP caching (e.g. set the referenced content to expire
immediately). x:index=no|yes doesn't seem
--On August 30, 2005 11:39:04 AM +1000 Eric Scheid [EMAIL PROTECTED] wrote:
Someone wrote up A Robots Processing Instruction for XML Documents
http://atrus.org/writings/technical/robots_pi/spec-199912__/
That's a PI though, and I have no idea how well supported they are. I'd
prefer a
--On August 29, 2005 7:05:09 PM -0700 James M Snell [EMAIL PROTECTED] wrote:
x:index=no|yes doesn't seem to make a lot of sense in this case.
It makes just as much sense as it does for HTML files. Maybe it is a
whole group of Atom test cases. Maybe it is a feed of reboot times
for the server.
On 8/29/05, Walter Underwood [EMAIL PROTECTED] wrote:
That was me. I think it makes perfect sense as a PI. But I think reuse
via namespaces is oversold. For example, we didn't even try to use
Dublin Core tags in Atom.
Speak for yourself :)
http://bitworking.org/news/Not_Invented_Here
Walter Underwood wrote:
--On August 30, 2005 11:39:04 AM +1000 Eric Scheid [EMAIL PROTECTED] wrote:
Someone wrote up A Robots Processing Instruction for XML Documents
http://atrus.org/writings/technical/robots_pi/spec-199912__/
That's a PI though, and I have no idea how well supported
Roger Benningfield wrote:
However, if I put something like:
User-agent: PubSub
Disallow: /
...in my robots.txt and you ignore it, then you very much
belong on the Bad List.
I don't think so. The reason is that I believe that robots.txt has
nothing to do with any service I provide or
On 26/8/05 3:55 PM, Bob Wyman [EMAIL PROTECTED] wrote:
Remember, PubSub never does
anything that a desktop client doesn't do.
Periodic re-fetching is a robotic behaviour, common to both desktop
aggregators and server based aggregators. Robots.txt was established to
minimise harm caused by
On Friday, August 26, 2005, at 04:39 AM, Eric Scheid wrote:
On 26/8/05 3:55 PM, Bob Wyman [EMAIL PROTECTED] wrote:
Remember, PubSub never does
anything that a desktop client doesn't do.
Periodic re-fetching is a robotic behaviour, common to both desktop
aggregators and server based
* Bob Wyman [EMAIL PROTECTED] [2005-08-26 01:00]:
My impression has always been that robots.txt was intended to
stop robots that crawl a site (i.e. they read one page, extract
the URLs from it and then read those pages). I don't believe
robots.txt is intended to stop processes that simply
There are no wildcards in /robots.txt, only path prefixes and user-agent
names. There is one special user-agent, *, which means all.
I can't think of any good reason to always ignore the disallows for *.
I guess it is OK to implement the parts of a spec that you want.
Just don't answer yes when
Antone Roundy wrote:
I'm with Bob on this. If a person publishes a feed without limiting
access to it, they either don't know what they're doing, or they're
EXPECTING it to be polled on a regular basis. As long as PubSub
doesn't poll too fast, the publisher is getting exactly what they
Ok, so this discussion has definitely been interesting... let's see if
we can turn it into something actionable.
1. Desktop aggregators and services like pubsub really do not fall into
the same category as robots/crawlers and therefore should not
necessarily be paying attention to
On 8/25/05, Roger B. [EMAIL PROTECTED] wrote:
Mhh. I have not looked into this. But is not every desktop aggregator
a robot?
Henry: Depends on who you ask. (See the Newsmonster debates from a
couple years ago.)
As I am the one who kicked off the Newsmonster debates a couple years
ago, I
--On August 26, 2005 9:51:10 AM -0700 James M Snell [EMAIL PROTECTED] wrote:
Add a new link rel=readers whose href points to a robots.txt-like file that
either allows or disallows the aggregator for specific URI's and establishes
polling rate preferences
User-agent: {aggregator-ua}
Graham wrote:
(And before you say but my aggregator is nothing but a podcast
client, and the feeds are nothing but links to enclosures, so it's
obvious that the publisher wanted me to download them -- WRONG! The
publisher might want that, or they might not ...
So you're saying browsers
Le 05-08-25 à 18:51, Bob Wyman a écrit :
At PubSub we *never* crawl to discover feed URLs. The only feeds
we know about are:
1. Feeds that have announced their presence with a ping
2. Feeds that have been announced to us via a FeedMesh message.
3. Feeds that have been manually
On 26 Aug 2005, at 7:46 pm, Mark Pilgrim wrote:
2. If a user gives a feed URL to a program *and then the program finds
all the URLs in that feed and requests them too*, the program needs to
support robots.txt exclusions for all the URLs other than the original
URL it was given.
...
(And
Mark Pilgrim wrote (among other things):
(And before you say but my aggregator is nothing but a podcast
client, and the feeds are nothing but links to enclosures, so
it's obvious that the publisher wanted me to download them -- WRONG!
I agree with just about everything that Mark wrote
Remember, PubSub never does
anything that a desktop client doesn't do.
Bob: What about FeedMesh? If I ping blo.gs, they pass that ping along
to you, and PubSub fetches my feed, then PubSub is doing something a
desktop client doesn't do. It's following a link found in one place
and
* Bob Wyman [EMAIL PROTECTED] [2005-08-26 22:50]:
It strikes me that not all URIs are created equally and not
everything that looks like crawling is really crawling.
@xlink:type?
Regards,
--
Aristotle Pagaltzis // http://plasmasturm.org/
Roger Benningfield wrote:
We've got a mechanism that allows any user with his own domain
and a text editor to tell us whether or not he wants us messing with
his stuff. I think it's foolish to ignore that.
The problem is that we have *many* such mechanisms. Robots.txt is
only one.
Karl Dubost wrote:
- How one who has previously submitted a feed URL remove it from
the index? (Change of opinions)
If you are the publisher of a feed and you don't want us to monitor
your content, complain to us and we'll filter you out. Folk do this every
once in a while. Send us an
Le 05-08-26 à 17:53, Bob Wyman a écrit :
Karl Dubost wrote:
- How one who has previously submitted a feed URL remove it from
the index? (Change of opinions)
If you are the publisher of a feed and you don't want us to
monitor
your content, complain to us and we'll filter you out. Folk
Karl Dubost points out that it is hard to figure out what email address to
send messages to if you want to de-list from PubSub...:
Karl, Please, accept my apologies for this. I could have sworn we
had the policy prominently displayed on the site. I know we used to have it
there. This must
On 27/8/05 6:40 AM, Bob Wyman [EMAIL PROTECTED] wrote:
I think crawling URI's found in link/ tags,
img/ tags and enclosures isn't crawling... Or... Is there something I'm
missing here?
crawling img tags isn't a huge problem because it doesn't lead to a
recursive situation. Same withh
I'm adding robots@mccmedia.com to this dicussion. That is the classic
list for robots.txt discussion.
Robots list: this is a discussion about the interactions of /robots.txt
and clients or robots that fetch RSS feeds. Atom is a new format in
the RSS family.
--On August 26, 2005 8:39:59 PM +1000
On Wed, Aug 24, 2005 at 11:25:12PM -0700, James M Snell wrote:
For example, suppose I build an application that depends on an Atom feed
containing binary content (e.g. a software update feed). I don't really
want aggregators pulling and indexing that feed and attempting to
display it
* James M Snell [EMAIL PROTECTED] [2005-08-25 08:35]:
I don't really want aggregators pulling and indexing that feed
and attempting to display it within a traditional feed reader.
Why, though?
There’s no reason aggregators couldn’t at some point become more
capable of doing something useful
On 8/25/05, James M Snell [EMAIL PROTECTED] wrote:
Up to this point, the vast majority of use cases for Atom feeds is the
traditional syndicated content case. A bunch of content updates that
are designed to be distributed and aggregated within Feed readers or
online aggregators, etc. But
A. Pagaltzis wrote:
* James M Snell [EMAIL PROTECTED] [2005-08-25 08:35]:
I don't really want aggregators pulling and indexing that feed
and attempting to display it within a traditional feed reader.
Why, though?
There’s no reason aggregators couldn’t at some point become more
James M Snell wrote:
Does the following work?
feed
...
x:aggregateno/x:aggregate
/feed
I think it is important to recognize that there are at least two
kinds of aggregator. The most common is the desktop end-point aggregator
that consumes feeds from various sources and then
On 25 Aug 2005, at 15:45, Joe Gregorio wrote:
On 8/25/05, James M Snell [EMAIL PROTECTED] wrote:
Up to this point, the vast majority of use cases for Atom feeds is
the
traditional syndicated content case. A bunch of content updates that
are designed to be distributed and aggregated
* Henry Story [EMAIL PROTECTED] [2005-08-25 16:55]:
Do we put base64 encoded stuff in html? No: that is why there
are things like
img src=...
img
src=data:image/gif;base64,R0lGODlhAQABAIAAAP///yH5BAEKAAEALAABAAEAAAICTAEAOw==
/
:-)
Regards,
--
Aristotle Pagaltzis //
A. Pagaltzis wrote:
* James M Snell [EMAIL PROTECTED] [2005-08-25 16:20]:
I dunno, I'm just kinda scratching my head on this wondering if
there is any actual need here. My instincts are telling me no,
but...
Seems to me that your instincts are right. :-)
I’m not sure why, in the
On Thursday, August 25, 2005, at 12:25 AM, James M Snell wrote:
Up to this point, the vast majority of use cases for Atom feeds is the
traditional syndicated content case. A bunch of content updates that
are designed to be distributed and aggregated within Feed readers or
online
On Thursday, August 25, 2005, at 08:16 AM, James M Snell wrote:
Good points but it's more than just the handling of human-readable
content. That's one use case but there are others. Consider, for
example, if I was producing a feed that contained javascript and CSS
styles that would
On 25 Aug 2005, at 17:06, A. Pagaltzis wrote:
* Henry Story [EMAIL PROTECTED] [2005-08-25 16:55]:
Do we put base64 encoded stuff in html? No: that is why there
are things like
img src=...
img src=data:image/gif;base64,R0lGODlhAQABAIAAAP///
yH5BAEKAAEALAABAAEAAAICTAEAOw== /
At 10:22 AM -0400 8/25/05, Bob Wyman wrote:
James M Snell wrote:
Does the following work?
feed
...
x:aggregateno/x:aggregate
/feed
I think it is important to recognize that there are at least two
kinds of aggregator. The most common is the desktop end-point aggregator
that
* Henry Story [EMAIL PROTECTED] [2005-08-25 18:40]:
And it does not give me anything very intersting when I look at
it in either Safari or Firefox.
Of course not – it’s the infamous transparent single-pixel GIF.
:-)
Regards,
--
Aristotle Pagaltzis // http://plasmasturm.org/
It works in both Safari and Firefox; it's just that that particular
data: URI is a 1x1 blank gif ;)
On 25/08/2005, at 9:37 AM, Henry Story wrote:
On 25 Aug 2005, at 17:06, A. Pagaltzis wrote:
* Henry Story [EMAIL PROTECTED] [2005-08-25 16:55]:
Do we put base64 encoded stuff in
I can see reasonable uses for this, like marking a feed of local disk
errors
as not of general interest.
This is not published data - http://www.spacekdet.com/pipe/
Security by obscurity^H^H^H^H^H^H^H^H^H saying please -
http://www-cs-faculty.stanford.edu/~knuth/ (see the second link from
Le 05-08-25 à 06:44, James Aylett a écrit :
I like the use case, but I don't see why you would want to disallow
aggregators to pull the feed.
You might want it for many reasons. One of my reasons which worries
me more and more, is that some aggregators, bots do not respect the
Creative
Le 05-08-25 à 12:51, Walter Underwood a écrit :
/robots.txt is one approach. Wouldn't hurt to have a recommendation
for whether Atom clients honor that.
Not many honor it.
A while ago I had this list from http://varchars.com/blog/node/view/59
The Good
BlogPulse
NITLE Blog Spider
Karl Dubost wrote:
One of my reasons which worries me more and more, is that some
aggregators, bots do not respect the Creative Common license (or
at least the way I understand it).
Your understanding of Creative Commons is apparently a bit
non-optimal -- even though many people seem
Bob,
Thanks for the explanation. Much appreciated.
Le 05-08-25 à 15:59, Bob Wyman a écrit :
Karl Dubost wrote:
One of my reasons which worries me more and more, is that some
aggregators, bots do not respect the Creative Common license (or
at least the way I understand it).
It is
Bob Wyman wrote:
Karl Dubost wrote:
One of my reasons which worries me more and more, is that some
aggregators, bots do not respect the Creative Common license (or
at least the way I understand it).
Your understanding of Creative Commons is apparently a bit
non-optimal --
Mhh. I have not looked into this. But is not every desktop aggregator
a robot?
Henry
On 25 Aug 2005, at 22:18, James M Snell wrote:
At the very least, aggregators should respect robots.txt. Doing so
would allow publishers to restrict who is allowed to pull their feed.
- James
--On August 25, 2005 3:43:03 PM -0400 Karl Dubost [EMAIL PROTECTED] wrote:
Le 05-08-25 à 12:51, Walter Underwood a écrit :
/robots.txt is one approach. Wouldn't hurt to have a recommendation
for whether Atom clients honor that.
Not many honor it.
I'm not surprised. There seems to be a new
I would call desktop clients clients not robots. The distinction is
how they add feeds to the polling list. Clients add them because of
human decisions. Robots discover them mechanically and add them.
So, clients should act like browsers, and ignore robots.txt.
Robots.txt is not very widely
Mhh. I have not looked into this. But is not every desktop aggregator
a robot?
Henry: Depends on who you ask. (See the Newsmonster debates from a
couple years ago.)
Right now, I obey all wildcard and/or my-user-agent-specific
directives I find in robots.txt. If I were writing a desktop app, I
Walter Underwood wrote:
--On August 25, 2005 3:43:03 PM -0400 Karl Dubost [EMAIL PROTECTED] wrote:
Le 05-08-25 à 12:51, Walter Underwood a écrit :
/robots.txt is one approach. Wouldn't hurt to have a recommendation
for whether Atom clients honor that.
Not many honor it.
Yes, I see how one is meant to look at it. But I can imagine desktop
aggregators
becoming more independent when searching for information... Perhaps
at that point
they should start reading robots.txt...
Henry
On 25 Aug 2005, at 23:12, Walter Underwood wrote:
I would call desktop
On Thursday, August 25, 2005, at 03:12 PM, Walter Underwood wrote:
I would call desktop clients clients not robots. The distinction is
how they add feeds to the polling list. Clients add them because of
human decisions. Robots discover them mechanically and add them.
So, clients should act
Antone Roundy wrote:
How could this all be related to aggregators that accept feed URL
submissions?
My impression has always been that robots.txt was intended to stop
robots that crawl a site (i.e. they read one page, extract the URLs from it
and then read those pages). I don't believe
Bob: It's one thing to ignore a wildcard rule in robots.txt. I don't
think its a good idea, but I can at least see a valid argument for it.
However, if I put something like:
User-agent: PubSub
Disallow: /
...in my robots.txt and you ignore it, then you very much belong on
the Bad List.
--
66 matches
Mail list logo