Miles wrote: > On Jul 16, 1:00 am, John Nagle <[EMAIL PROTECTED]> wrote: > >> I'm reading the PhishTank XML file of active phishing sites, >>at "http://data.phishtank.com/data/online-valid/" This changes >>frequently, and it's big (about 10MB right now) and on a busy server. >>So once in a while I get a bogus copy of the file because the file >>was rewritten while being sent by the server. >> >> Any good way to deal with this, short of reading it twice >>and comparing? >> >> John Nagle > > > Sounds like that's the host's problem--they should be using atomic > writes, which is usally done be renaming the new file on top of the > old one. How "bogus" are the bad files? If it's just incomplete, > then since it's XML, it'll be missing the "</output>" and you should > get a parse error if you're using a suitable strict parser. If it's > mixed old data and new data, but still manages to be well-formed XML, > then yes, you'll probably have to read it twice. > > -Miles
Yes, they're updating it non-atomically. I'm now reading it twice and comparing, which works. Actually, it's read up to 5 times, until the same contents appear twice in a row. Two tries usually work, but if the server is updating, it may require more. Ugly, and doubles the load on the server, but necessary to get a consistent copy of the data. John Nagle -- http://mail.python.org/mailman/listinfo/python-list