On 07/14/2014 09:56 PM, Chris Morgan wrote: > On Tue, Jul 15, 2014 at 4:16 AM, Brian Anderson <bander...@mozilla.com> wrote: >> Can somebody file an issue described exactly what we should do and cc me? > > Nothing. Absolutely nothing. > > robots.txt rules do not apply to historical data; if archive.org has > archived something, the introduction of a new Disallow rule will not > remove the contents of a previous scan.
Although that is the robots.txt standard, archive.org does retroactively apply robots.txt Disallow rules to already-archived content. https://archive.org/about/exclude.php > It therefore has three months in which to make a scan of a release > before that release is marked obsolete with the introduction of a > Disallow directive. > > This is right and proper. Special casing a specific user agent is not > the right thing to do. The contents won’t be changing after the > release, anyway, so allowing archive.org to continue scanning it is a > complete waste of effort. It's my understanding that archive.org doesn't have the funding to reliably crawl everything on the Web promptly. I agree with the principle that "Special casing a specific user agent is not the right thing to do." but I also support the Internet Archive's mission. Another option is a `X-Robots-Tag: noindex` HTTP header, which is more robust at banning indexing[1], and it allows archiving (vs. `X-Robots-Tag: noindex, noarchive` would disallow it). It's likely less robust from the perspective of keeping our website serving that header consistently long-term though. For HTML files, X-Robots-Tag can also go in a <meta> tag in the head. -Isaac [1] (Google can still list a robots.txt-disallowed page as a search result if many sites it trusts link to that page) _______________________________________________ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev