Caching: Access a local file, but ensure it is up-to-date from a remote URL

2014-10-13 Thread Ben Finney
Howdy all,

I'm hoping that the problem I currently have is one already solved,
either in the Python standard library, or with some well-tested obvious
code.

A program I'm working on needs to access a set of files locally; they're
just normal files.

But those files are local cached copies of documents available at remote
URLs — each file has a canonical URL for that file's content.

I'd like an API for ‘get_file_from_cache’ that looks something like::

file_urls = {
foo.txt: http://example.org/spam/;,
bar.data: https://example.net/beans/flonk.xml;,
}

for (filename, url) in file_urls.items():
infile = get_file_from_cache(filename, canonical=url)
do_stuff_with(infile.read())

* If the local file's modification timestamp is not significantly
  earlier than the Last-Modified timestamp for the document at the
  corresponding URL, ‘get_file_from_cache’ just returns the file object
  without changing the file.

* The local file might be out of date (its modification timestamp may be
  significantly older than the Last-Modified timestamp from the
  corresponding URL). In that case, ‘get_file_from_cache’ should first
  read the document's contents into the file, then return the file
  object.

* The local file may not yet exist. In that case, ‘get_file_from_cache’
  should first read the document content from the corresponding URL,
  create the local file, and then return the file object.

* The remote URL may not be available for some reason. In that case,
  ‘get_file_from_cache’ should simply return the file object, or if that
  can't be done, raise an error.

So this is something similar to an HTTP object cache. Except where those
are usually URL-focussed with the local files a hidden implementation
detail, I want an API that focusses on the local files, with the remote
requests a hidden implementation detail.

Does anything like this exist in the Python library, or as simple code
using it? With or without the specifics of HTTP and URLs, is there some
generic caching recipe already implemented with the standard library?

This local file cache (ignoring the spcifics of URLs and network access)
seems like exactly the kind of thing that is easy to get wrong in
countless ways, and so should have a single obvious implementation
available.

Am I in luck? What do you advise?

-- 
 \   “If consumers even know there's a DRM, what it is, and how it |
  `\ works, we've already failed.” —Peter Lee, Disney corporation, |
_o__) 2005 |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Caching: Access a local file, but ensure it is up-to-date from a remote URL

2014-10-13 Thread Chris Angelico
On Mon, Oct 13, 2014 at 5:36 PM, Ben Finney ben+pyt...@benfinney.id.au wrote:
 So this is something similar to an HTTP object cache. Except where those
 are usually URL-focussed with the local files a hidden implementation
 detail, I want an API that focusses on the local files, with the remote
 requests a hidden implementation detail.

Potential issue: You may need some metadata storage as well as the
actual files. Or can you just ignore the Varies header etc etc etc,
and pretend that this URL represents a single blob of data no matter
what? I'm also dubious about relying on FS timestamps for critical
data, as it's very easy to bump the timestamp to current, which would
make your program think that the contents are fresh; but if that's
truly the only metadata needed, that might be safe to accept.

One way you could possibly do this is to pick up a URL-based cache
(even something stand-alone like Squid), and then create symlinks from
your canonically-named local files to the implementation-detail
storage space for the cache. Then you probe the URL and return its
contents. That guarantees that you're playing nicely with the rules of
HTTP (particularly if you have chained proxies, proxy authentication,
etc, etc - if you're deploying this to arbitrary locations, that might
be an issue), but at the expense of complexity.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list