Re: [Python-Dev] Ext4 data loss

2009-03-16 Thread Nick Coghlan
Greg Ewing wrote:
 What might make more sense is a context manager,
 e.g.
 
   with renaming_file(blarg.txt, w) as f:
 ...

As you were describing the problems with rename on close, I actually
immediately thought of the oft-repeated db transaction commit/rollback
example from PEP 343 :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-16 Thread Valentino Volonghi


On Mar 15, 2009, at 3:25 PM, Greg Ewing wrote:


 with renaming_file(blarg.txt, w) as f:
   ...



By chance during the weekend I actually wrote something like that:

from __future__ import with_statement

import os
import codecs
import shutil
import tempfile

from contextlib import contextmanager

TDIR = tempfile.mktemp(dir='/tmp/')

@contextmanager
def topen(filepath, mode='wb', bufsize=-1, encoding=None,
  inplace=False, tmpd=TDIR, sync=False):

C{topen} is a transactional version of the Python builtin C{open}
function for file IO. It manages transactionality by using a
temporary file and then moving it to the final position once its
content has been written to disk.
If the mode used to open the file doesn't modify the file, this
function is equivalent to the built-in C{open} with automatic
file closing behavior.

@param filepath: The path of the file that you want to open
@type filepath: C{str}

@param mode: POSIX mode in which you want to open the file.
@type mode: C{str} see documentation for the format.

@param bufsize: Buffer size for file IO
@type bufsize: C{int} see documentation for the meaning

@param encoding: Encoding that should be used to read the file
@type encoding: C{str}

@param inplace: Indicates if the temporary file should reside
in the same directory of the final file.
@type inplace: C{bool}

@param tmpd: The temporary directory in which file IO is
 performed. Then files are moved from here to
 their original destination.
@type tmpd: C{str}

@param sync: Force topen to fsync the file before closing it
@type sync: C{bool}

if 'r' in mode or 'a' in mode:
fp = filepath
else:
if inplace:
source_dir, _ = os.path.split(filepath)
tmpd = source_dir

if not os.path.exists(tmpd):
os.makedirs(tmpd)
_f, fp = tempfile.mkstemp(dir=tmpd)

if encoding is not None:
f = codecs.open(fp, mode, encoding=encoding)
else:
f = open(fp, mode, bufsize)

try:
yield f
finally:
if 'r' in mode:
if + in mode:
f.flush()
if sync:
os.fsync(f.fileno())
f.close()
return

f.flush()
if sync:
os.fsync(f.fileno())
f.close()
if 'w' in mode:
shutil.move(fp, filepath)

if __name__ == __main__:
with topen(a_test) as f:
f.write(hello)
assert file(a_test, rb).read() == 'hello'
assert os.path.exists(TDIR)
os.rmdir(TDIR)

with topen(a_test, mode=rb) as f:
assert f.read() == hello
assert not os.path.exists(TDIR)
os.remove(a_test)


--
Valentino Volonghi aka Dialtone
Now running MacOS X 10.5
Home Page: http://www.twisted.it
http://www.adroll.com



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-15 Thread Greg Ewing

Nick Coghlan wrote:


It actually wouldn't be a bad place to put a create a temporary file
and rename it to name when closing it helper class.


I'm not sure it would be a good idea to make that
behaviour automatic on closing. If anything goes
wrong while writing the file, you *don't* want the
rename to happen, otherwise it defeats the purpose.

It would be okay to have an explicit close_and_rename()
method, although there wouldn't be much gained over
just calling os.rename() afterwards.

What might make more sense is a context manager,
e.g.

  with renaming_file(blarg.txt, w) as f:
...

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-15 Thread Mikko Ohtamaa


  Ok. In that use case, however, it is completely irrelevant whether the
  tempfile module calls fsync. After it has generated the non-conflicting
  filename, it's done.

 I agree, but my comment was that it would be nice if better fsync
 support (if added) could be done in such a way that it helped not only
 file objects, but also *file-like* objects (such as the wrappers in the
 tempfile module) without making the file-like API any fatter.


fsync() might not be the answer.

I found this blog post very entertaining to read:
http://www.advogato.org/person/mjg59/diary.html?start=195

So, on the one hand, we're trying to use things like relatime to batch data
to reduce the amount of time a disk has to be spun up. And on the other
hand, we're moving to filesystems that require us to generate *more* io in
order to guarantee that our data hits disk, which is a guarantee we often
don't want anyway! Users will be fine with losing their most recent changes
to preferences if a machine crashes. They will not be fine with losing the
entirity of their preferences. Arguing that applications need to use *fsync*()
and are otherwise broken is ignoring the important difference between these
use cases. It's no longer going to be possible to spin down a disk when any
software is running at all, since otherwise it's probably going to write
something and then have to *fsync* it out of sheer paranoia that something
bad will happen. And then probably *fsync* the directory as well, because
what if someone writes an even more pathological filesystem. And the disks
sit there spinning gently and chitter away as they write tiny files[4] and
never spin down and the polar bears all drown in the bitter tears of
application developers who are forced to drink so much to forget that they
all die of acute liver failure by the age of 35 and where are we then oh yes
we're fucked.

-M
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-13 Thread Oleg Broytmann
On Thu, Mar 12, 2009 at 10:14:41PM -0600, Adam Olsen wrote:
 Yet the ext4
 developers didn't see it that way, so it was sacrificed to new
 performance improvements (delayed allocation).

   Ext4 is not the only FS with delayed allocation. New XFS has it, btrfs
will have it. Don't know about other OS/FS (ZFS? NTFS?)

Oleg.
-- 
 Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-13 Thread Nick Coghlan
Martin v. Löwis wrote:
 auto-delete is one of the nice features of tempfile.  Another feature
 which is entirely appropriate to this usage, though, though, is creation
 of a non-conflicting filename.
 
 Ok. In that use case, however, it is completely irrelevant whether the
 tempfile module calls fsync. After it has generated the non-conflicting
 filename, it's done.

I agree, but my comment was that it would be nice if better fsync
support (if added) could be done in such a way that it helped not only
file objects, but also *file-like* objects (such as the wrappers in the
tempfile module) without making the file-like API any fatter.

If that's not possible or practical so be it, but it is still something
to keep in mind when considering options.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-13 Thread Zvezdan Petkovic


On Mar 12, 2009, at 3:15 PM, Martin v. Löwis wrote:

You still wouldn't use the tempfile module in that case. Instead, you
would create a regular file, with the name base on the name of the
important file.


If the file is *really* important, you actually want to use a  
temporary, randomly chosen, *unpredictable* name.


Think about the security implications of a file name that is in  
advance known to an attacker as well as the fact that the said file  
will replace an *important* system file.


See the details in any man page on mkstemp() and why it was introduced  
to replace a predictable mktemp().  Also notice that even mktemp() is  
better then what you proposed above.


Of course, the above are C functions.  I don't think that Python  
programming is immune from such security considerations either.


Zvezdan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-13 Thread Oleg Broytmann
On Fri, Mar 13, 2009 at 12:28:07PM +0300, Oleg Broytmann wrote:
 On Thu, Mar 12, 2009 at 10:14:41PM -0600, Adam Olsen wrote:
  Yet the ext4
  developers didn't see it that way, so it was sacrificed to new
  performance improvements (delayed allocation).
 
Ext4 is not the only FS with delayed allocation. New XFS has it, btrfs
 will have it. Don't know about other OS/FS (ZFS? NTFS?)

http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/

   Ted Tso said HFS+ and ZFS have the property as well. So no, it is not
a deficiency in the Linux kernel or in Ext4 FS - it is a mainstream path in
modern filesystem design.

Oleg.
-- 
 Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-13 Thread Martin v. Löwis
 Think about the security implications of a file name that is in advance
 known to an attacker as well as the fact that the said file will replace
 an *important* system file.

You should always use O_EXCL in that case. Relying on random name will
be a severe security threat to the application.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-13 Thread Zvezdan Petkovic

On Mar 13, 2009, at 2:31 PM, Martin v. Löwis wrote:

Think about the security implications of a file name that is in  
advance known to an attacker as well as the fact that the said file  
will replace an *important* system file.


You should always use O_EXCL in that case. Relying on random name will
be a severe security threat to the application.


If you read an implementation of mkstemp() function, you'll see that  
it does exactly that:


if ((*doopen = open(path, O_CREAT|O_EXCL|O_RDWR, 0600)) = 0)
return(1);
if (errno != EEXIST)
return(0);

That's why I mentioned mkstemp() in the OP.

Zvezdan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-13 Thread Andrew McNabb
On Fri, Mar 13, 2009 at 07:31:21PM +0100, Martin v. Löwis wrote:
  Think about the security implications of a file name that is in advance
  known to an attacker as well as the fact that the said file will replace
  an *important* system file.
 
 You should always use O_EXCL in that case. Relying on random name will
 be a severe security threat to the application.

But mkstemp does open files with O_EXCL, so the two approaches really
aren't that different.  Using tempfile can be a little simpler because
it will eventually succeed.

-- 
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-13 Thread Nick Coghlan
Zvezdan Petkovic wrote:
 Of course, the above are C functions.  I don't think that Python
 programming is immune from such security considerations either.

The tempfile module exposes the same functionality (and uses mkstemp()
to create its filenames). It has also had features added over the years
to prevent automatic deletion of the temporary files, precisely so you
*can* grab them and rename them afterwards.

It actually wouldn't be a bad place to put a create a temporary file
and rename it to name when closing it helper class. Such a utility
could also include a way to request fsync() before rename behaviour
(off by default of course).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-13 Thread Jan Claeys
Op vrijdag 13-03-2009 om 12:28 uur [tijdzone +0300], schreef Oleg
Broytmann:
Ext4 is not the only FS with delayed allocation.

Of course not, even ext3 has delayed allocation (even if 5 sec vs. 2 min
makes the disaster window a bit smaller).

The real problem seems to be that ext4 re-orders the rename (which it
does almost instantly) before the write (which waits for 2-15 minutes or
so).

There are other modern filesystems that take care such reordering
doesn't happen...


-- 
Jan Claeys

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Gisle Aas

On Mar 11, 2009, at 22:43 , Cameron Simpson wrote:


On 11Mar2009 10:09, Joachim K?nig h...@online.de wrote:

Guido van Rossum wrote:
On Tue, Mar 10, 2009 at 1:11 PM, Christian Heimes  
li...@cheimes.de wrote:

[...]
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54 
.

[...]
If I understand the post properly, it's up to the app to call  
fsync(),
and it's only necessary when you're doing one of the rename  
dances, or
updating a file in place. Basically, as he explains, fsync() is a  
very

heavyweight operation; I'm against calling it by default anywhere.


To me, the flaw seem to be in the close() call (of the operating
system). I'd expect the data to be
in a persistent state once the close() returns. So there would be no
need to fsync if the file gets closed anyway.


Not really. On the whole, flush() means the object has handed all  
data

to the OS.  close() means the object has handed all data to the OS
and released the control data structures (OS file descriptor release;
like the OS, the python interpreter may release python stuff later  
too).


By contrast, fsync() means the OS has handed filesystem changes to  
the
disc itself. Really really slow, by comparison with memory. It is  
Very

Expensive, and a very different operation to close().


...and at least on OS X there is one level more where you actually  
tell the

disc to flush its buffers to permanent storage with:

   fcntl(fd, F_FULLSYNC)

The fsync manpage says:

 Note that while fsync() will flush all data from the host to the  
drive
 (i.e. the permanent storage device), the drive itself may not  
physi-
 cally write the data to the platters for quite some time and it  
may be

 written in an out-of-order sequence.

 Specifically, if the drive loses power or the OS crashes, the  
application
 may find that only some or none of their data was written.  The  
disk
 drive may also re-order the data so that later writes may be  
present,

 while earlier writes are not.

 This is not a theoretical edge case.  This scenario is easily  
reproduced

 with real world workloads and drive power failures.

 For applications that require tighter guarantees about the  
integrity of
 their data, Mac OS X provides the F_FULLFSYNC fcntl.  The  
F_FULLFSYNC
 fcntl asks the drive to flush all buffered data to permanent  
storage.
 Applications, such as databases, that require a strict ordering  
of writes
 should use F_FULLFSYNC to ensure that their data is written in  
the order

 they expect.  Please see fcntl(2) for more detail.

It's not obvious what level of syncing is appropriate to automatically  
happen

from Python so I think it's better to let the application deal with it.

--Gisle

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Antoine Pitrou
Nick Coghlan ncoghlan at gmail.com writes:
 
 On the performance side... the overhead from fsync() itself is going to
 dwarf the CPU overhead of going through a wrapper class.

The significant overhead is not in calling sync() or flush() or close(), but in
calling methods which are supposed to be fast (read() from internal buffer or
write() to internal buffer, for example).


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Steven D'Aprano
On Thu, 12 Mar 2009 01:03:13 pm Antoine Pitrou wrote:
 Nick Coghlan ncoghlan at gmail.com writes:
  The tempfile module would be another example.

 Do you really need your temporary files to survive system crashes? ;)

It depends on what you mean by temporary.

Applications like OpenOffice can sometimes recover from an application 
crash or even a systems crash and give you the opportunity to restore 
the temporary files that were left lying around. Firefox does the same 
thing -- after a crash, it offers you the opportunity to open the 
websites you had open before. Konquorer does much the same, except it 
can only recover from application crashes, not system crashes. I can't 
tell you how many times such features have saved my hide!




-- 
Steven D'Aprano
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Antoine Pitrou
Steven D'Aprano steve at pearwood.info writes:
 
 It depends on what you mean by temporary.
 
 Applications like OpenOffice can sometimes recover from an application 
 crash or even a systems crash and give you the opportunity to restore 
 the temporary files that were left lying around.

For such files, you want deterministic naming in order to find them again, so
you won't use the tempfile module...



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Toshio Kuratomi
Antoine Pitrou wrote:
 Steven D'Aprano steve at pearwood.info writes:
 It depends on what you mean by temporary.

 Applications like OpenOffice can sometimes recover from an application 
 crash or even a systems crash and give you the opportunity to restore 
 the temporary files that were left lying around.
 
 For such files, you want deterministic naming in order to find them again, so
 you won't use the tempfile module...
 
Something that doesn't require deterministicly named tempfiles was Ted
T'so's explanation linked to earlier.

read data from important file
modify data
create tempfile
write data to tempfile
*sync tempfile to disk*
mv tempfile to filename of important file

The sync is necessary to ensure that the data is written to the disk
before the old file overwrites the new filename.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Toshio Kuratomi
Martin v. Löwis wrote:
 Something that doesn't require deterministicly named tempfiles was Ted
 T'so's explanation linked to earlier.

 read data from important file
 modify data
 create tempfile
 write data to tempfile
 *sync tempfile to disk*
 mv tempfile to filename of important file

 The sync is necessary to ensure that the data is written to the disk
 before the old file overwrites the new filename.
 
 You still wouldn't use the tempfile module in that case. Instead, you
 would create a regular file, with the name base on the name of the
 important file.
 
Uhm... why?  The requirements are:

1) lifetime of the temporary file is in control of the app
2) filename is available to the app so it can move it after data is written
3) temporary file can be created on the same filesystem as the important
file.

All of those are doable using the tempfile module.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Martin v. Löwis
 The sync is necessary to ensure that the data is written to the disk
 before the old file overwrites the new filename.
 You still wouldn't use the tempfile module in that case. Instead, you
 would create a regular file, with the name base on the name of the
 important file.

 Uhm... why?

Because it's much easier not to use the tempfile module, than to use it,
and because the main purpose of the tempfile module is irrelevant to
the specific application; the main purpose being the ability to
auto-delete the file when it gets closed.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Toshio Kuratomi
Martin v. Löwis wrote:
 The sync is necessary to ensure that the data is written to the disk
 before the old file overwrites the new filename.
 You still wouldn't use the tempfile module in that case. Instead, you
 would create a regular file, with the name base on the name of the
 important file.

 Uhm... why?
 
 Because it's much easier not to use the tempfile module, than to use it,
 and because the main purpose of the tempfile module is irrelevant to
 the specific application; the main purpose being the ability to
 auto-delete the file when it gets closed.
 
auto-delete is one of the nice features of tempfile.  Another feature
which is entirely appropriate to this usage, though, though, is creation
of a non-conflicting filename.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Martin v. Löwis
 auto-delete is one of the nice features of tempfile.  Another feature
 which is entirely appropriate to this usage, though, though, is creation
 of a non-conflicting filename.

Ok. In that use case, however, it is completely irrelevant whether the
tempfile module calls fsync. After it has generated the non-conflicting
filename, it's done.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Toshio Kuratomi
Martin v. Löwis wrote:
 auto-delete is one of the nice features of tempfile.  Another feature
 which is entirely appropriate to this usage, though, though, is creation
 of a non-conflicting filename.
 
 Ok. In that use case, however, it is completely irrelevant whether the
 tempfile module calls fsync. After it has generated the non-conflicting
 filename, it's done.

If you're saying that it shouldn't call fsync automatically I'll agree
to that.  The message thread I was replying to seemed to say that
tempfiles didn't need to support fsync because they will be useless
after a system crash.  I'm just refuting that by showing that it is
useful to call fsync on tempfiles as one of the steps in preserving the
data in another file.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Adam Olsen
On Tue, Mar 10, 2009 at 2:11 PM, Christian Heimes li...@cheimes.de wrote:
 Multiple blogs and news sites are swamped with a discussion about ext4
 and KDE 4.0. Theodore Ts'o - the developer of ext4 - explains the issue
 at
 https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54.


 Python's file type doesn't use fsync() and be the victim of the very
 same issue, too. Should we do anything about it?

It's a kernel defect and we shouldn't touch it.

Traditionally you were hooped regardless of what you did, just with
smaller windows.  Did you want to lose your file 50% of the time or
only 10% of the time?  Heck, 1% of the time you lose the *entire*
filesystem.

Along came journaling file systems.  They guarantee the filesystem
itself stays intact, but not your file.  Still, if you hedge your bets
it's a fairly small window.  In fact if you kill performance you can
eliminate the window: write to a new file, flush all the buffers, then
use the journaling filesystem to rename; few people do that though,
due to the insane performance loss.

What we really want is a simple memory barrier.  We don't need the
file to be saved *now*, just so long as it gets saved before the
rename does.  Unfortunately the filesystem APIs don't touch on this,
as they were designed when losing the entire filesystem was
acceptable.  What we need is a heuristic to make them work in this
scenario.  Lo and behold ext3's data=ordered did just that!

Personally, I consider journaling to be a joke without that.  It has
different justifications, but not this critical one.  Yet the ext4
developers didn't see it that way, so it was sacrificed to new
performance improvements (delayed allocation).

2.6.30 has patches lined up that will fix this use case, making sure
the file is written before the rename.  We don't have to touch it.

Of course if you're planning to use the file without renaming then you
probably do need an explicit fsync and an API for that might help
after all.  That's a different problem though, and has always existed.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Scott David Daniels

A.M. Kuchling wrote:

  With zipfile, you could at least access the .fp attribute
to sync it (though is the .fp documented as part of the interface?).


For this one, I'd like to add the sync as a method (so that Zip-inside-
Zip is eventually possible).  In fact, a sync on an exposed writable
for a single file should probably push back out to a full sync.  This
would be trickier to accomplish if the using code had to suss out how
to get to the fp.  Clearly I have plans for a ZipFile expansion, but
this could only conceivably hit 2.7, and 2.8 / 3.2 is a lot more likely.

--Scott David Daniels
scott.dani...@acm.org

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Martin v. Löwis
 We already have os.fsync() and os.fdatasync(). Should the sync() (and
 datasync()?) method be added as an object-oriented convenience?
 
 It's more than an object oriented convenience. fsync() takes a file
 descriptor as argument. Therefore I assume fsync() only syncs the data
 to disk that was written to the file descriptor. [*] 
[...]
 [*] Is my assumption correct, anybody?

Not necessarily. In Linux, for many releases, fsync() was really
equivalent to sync() (i.e. flushing all data for all files on all
file systems to disk). It may be that some systems still implement
it that way today.

However, even it it was true, I don't see why a .sync method would
be more than a convenience. An application wishing to sync a file
before close can do

f.flush()
os.fsync(f.fileno)
f.close()

With a sync method, it would become

f.flush()
f.sync()
f.close()

which is *really* nothing more than convenience.

O'd also like to point to the O_SYNC/O_DSYNC/O_RSYNC open(2)
flags. Applications that require durable writes can also chose
to set those on open, and be done.

Regrds,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Joachim König

Guido van Rossum wrote:

On Tue, Mar 10, 2009 at 1:11 PM, Christian Heimes li...@cheimes.de wrote:
  

[...]
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54.
[...]


If I understand the post properly, it's up to the app to call fsync(),
and it's only necessary when you're doing one of the rename dances, or
updating a file in place. Basically, as he explains, fsync() is a very
heavyweight operation; I'm against calling it by default anywhere.

  
To me, the flaw seem to be in the close() call (of the operating 
system). I'd expect the data to be
in a persistent state once the close() returns. So there would be no 
need to fsync if the file gets closed anyway.


Of course the close() call could take a while (up to 30 seconds in 
laptop mode), but if one does
not want to wait that long, than one can continue without calling 
close() and take the risk.


Of course, if the data should be on a persistant storage without closing 
the file (e.g. for database
applications), than one has to carefully call the different sync 
methods, but that's an other story.


Why has this ext4 problem not come up for other filesystems?



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Neil Hodgson
Antoine Pitrou:

 How about shutil.copystat()?

   shutil.copystat does not copy over the owner, group or ACLs.

   Modeling a copymetadata method on copystat would provide an easy to
understand API and should be implementable on Windows and POSIX.
Reading the OS X documentation shows a set of low-level POSIX
functions for ACLs. Since there are multiple pieces of metadata and an
application may not want to copy all pieces there could be multiple
methods (copygroup ...) or one method with options
shutil.copymetadata(src, dst, group=True, resource_fork=False)

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Hrvoje Niksic

Joachim König wrote:
To me, the flaw seem to be in the close() call (of the operating 
system). I'd expect the data to be

in a persistent state once the close() returns.


I wouldn't, because that would mean that every cp -r would effectively 
do an fsync() for each individual file it copies, which would bog down 
in the case of copying many small files.  Operating systems aggressively 
buffer file systems for good reason: performance of the common case.



Why has this ext4 problem not come up for other filesystems?


It has come up for XFS many many times, for example 
https://launchpad.net/ubuntu/+bug/37435


ext3 was resillient to the problem because of its default allocation 
policy; now that ext4 has implemented the same optimization XFS had 
before, it shares the problems.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Antoine Pitrou
Neil Hodgson nyamatongwe at gmail.com writes:
 
shutil.copystat does not copy over the owner, group or ACLs.

It depends on what you call ACLs. It does copy the chmod permission bits.
As for owner and group, I think there is a very good reason that it doesn't copy
them: under Linux, only root can change these properties.



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Oleg Broytmann
On Wed, Mar 11, 2009 at 11:43:33AM +, Antoine Pitrou wrote:
 As for owner and group, I think there is a very good reason that it doesn't 
 copy
 them: under Linux, only root can change these properties.

   Only root can change file ownership - and yes, there are scripts that
run with root privileges, so why not copy? As for group ownership - any
user can change group if [s]he belongs to the group.

Oleg.
-- 
 Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Antoine Pitrou
Christian Heimes lists at cheimes.de writes:
 
 It's more than an object oriented convenience. fsync() takes a file
 descriptor as argument. Therefore I assume fsync() only syncs the data
 to disk that was written to the file descriptor.

Ok, I agree that a .sync() method makes sense.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Antoine Pitrou
Oleg Broytmann phd at phd.pp.ru writes:
 
Only root can change file ownership - and yes, there are scripts that
 run with root privileges, so why not copy?

Because the new function would then be useless for non-root scripts, and
encouraging people to run their scripts as root would be rather bad.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Antoine Pitrou
Oleg Broytmann phd at phd.pp.ru writes:
 
That's easy to fix - only copy ownership if the effective user id == 0.

But errors should not pass silently. If the user intended the function to copy
ownership information and the function fails to do so, it should raise an 
exception.
Having implicit special cases in an API is usually bad, especially when it has
an impact on security.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Christian Heimes
Guido van Rossum wrote:
 Let's not think too Unix-specific. If we add such an API it should do
 something on Windows too -- the app shouldn't have to test for the
 presence of the API. (And thus the API probably shouldn't be called
 fsync.)

In my initial proposal one and a half hour earlier I suggested 'sync()'
as the name of the method and 'synced' as the name of the flag that
forces a fsync() call during the close operation.

Christian
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Hrvoje Niksic

Christian Heimes wrote:

Guido van Rossum wrote:

Let's not think too Unix-specific. If we add such an API it should do
something on Windows too -- the app shouldn't have to test for the
presence of the API. (And thus the API probably shouldn't be called
fsync.)


In my initial proposal one and a half hour earlier I suggested 'sync()'
as the name of the method and 'synced' as the name of the flag that
forces a fsync() call during the close operation.


Maybe it would make more sense for synced to force fsync() on each 
flush, not only on close.  I'm not sure how useful it is, but that's 
what synced would imply to me.  Maybe it would be best to avoid having 
such a variable, and expose a close_sync() method instead?

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Antoine Pitrou
Christian Heimes lists at cheimes.de writes:
 
 In my initial proposal one and a half hour earlier I suggested 'sync()'
 as the name of the method and 'synced' as the name of the flag that
 forces a fsync() call during the close operation.

I think your synced flag is too vague. Some applications may need the file to
be synced on close(), but some others may need it to be synced at regular
intervals, or after each write(), etc.

Calling the flag sync_on_close would be much more explicit. Also, given the
current API I think it should be an argument to open() rather than a writable
attribute.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Antoine Pitrou
After Hrvoje's message, let me rephrase my suggestion. Let's instead allow:
   open(..., sync_on=close)
   open(..., sync_on=flush)

with a default of None meaning no implicit syncs.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Antoine Pitrou
Christian Heimes lists at cheimes.de writes:
 
 And sync_on=flush implies sync_on=close?

close() implies flush(), so by construction yes.

 Your suggestion sounds like
 the right way to me!

I'm glad I brought something constructive to the discussion :-))


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Scott Dial
Aahz wrote:
 On Wed, Mar 11, 2009, Antoine Pitrou wrote:
 After Hrvoje's message, let me rephrase my suggestion. Let's instead allow:
open(..., sync_on=close)
open(..., sync_on=flush)

 with a default of None meaning no implicit syncs.
 
 That looks good, though I'd prefer using named constants rather than
 strings.

I would agree, but where do you put them? Since open is a built-in,
where would you suggest placing such constants (assuming we don't want
to pollute the built-in namespace)?

-- 
Scott Dial
sc...@scottdial.com
scod...@cs.indiana.edu
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Aahz
On Wed, Mar 11, 2009, Scott Dial wrote:
 Aahz wrote:
 On Wed, Mar 11, 2009, Antoine Pitrou wrote:
 After Hrvoje's message, let me rephrase my suggestion. Let's instead allow:
open(..., sync_on=close)
open(..., sync_on=flush)

 with a default of None meaning no implicit syncs.
 
 That looks good, though I'd prefer using named constants rather than
 strings.
 
 I would agree, but where do you put them? Since open is a built-in,
 where would you suggest placing such constants (assuming we don't want
 to pollute the built-in namespace)?

The os module, of course, like the existing O_* constants.
-- 
Aahz (a...@pythoncraft.com)   * http://www.pythoncraft.com/

All problems in computer science can be solved by another level of 
indirection.  --Butler Lampson
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Eric Smith

Antoine Pitrou wrote:

I think your synced flag is too vague. Some applications may need the file to
be synced on close(), but some others may need it to be synced at regular
intervals, or after each write(), etc.


Why wouldn't sync just be an optional argument to close(), at least for 
the sync_on_close case?


Eric.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Eric Smith

Antoine Pitrou wrote:

Eric Smith eric at trueblade.com writes:
Why wouldn't sync just be an optional argument to close(), at least for 
the sync_on_close case?


It wouldn't work with the with statement.



Well, that is a good reason, then!
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Martin v. Löwis
 Maybe it would make more sense for synced to force fsync() on each
 flush, not only on close.  I'm not sure how useful it is, but that's
 what synced would imply to me.

That should be implement by passing O_SYNC on open, rather than
explicitly calling fsync.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Neil Hodgson
Antoine Pitrou:

 It depends on what you call ACLs. It does copy the chmod permission bits.

Access Control Lists are fine grained permissions. Perhaps you
want to allow Sam to read a file and for Ted to both read and write
it. These permissions should not need to be reset every time you
modify the file.

 As for owner and group, I think there is a very good reason that it doesn't 
 copy
 them: under Linux, only root can change these properties.

   Since I am a member of both staff and everyone, I can set group
on one of my files from staff to everyone or back again:

$ chown :everyone x.pl
$ ls -la x.pl
-rwxrwxrwx  1 nyamatongwe  everyone  269 Mar 11  2008 x.pl
$ chown :staff x.pl
$ ls -la x.pl
-rwxrwxrwx  1 nyamatongwe  staff  269 Mar 11  2008 x.pl

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Greg Ewing

Barry Warsaw wrote:

Of course, a careful *nix application can ensure that the file owners  
and mod bits are set the way it needs them to be set.  A convenience  
function might be useful though.


A specialised function would also provide a place for
dealing with platform-specific extensions, such as
MacOSX Finder attributes.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Cameron Simpson
On 11Mar2009 10:09, Joachim K?nig h...@online.de wrote:
 Guido van Rossum wrote:
 On Tue, Mar 10, 2009 at 1:11 PM, Christian Heimes li...@cheimes.de wrote:
 [...]
 https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54.
 [...]
 If I understand the post properly, it's up to the app to call fsync(),
 and it's only necessary when you're doing one of the rename dances, or
 updating a file in place. Basically, as he explains, fsync() is a very
 heavyweight operation; I'm against calling it by default anywhere.
   
 To me, the flaw seem to be in the close() call (of the operating  
 system). I'd expect the data to be
 in a persistent state once the close() returns. So there would be no  
 need to fsync if the file gets closed anyway.

Not really. On the whole, flush() means the object has handed all data
to the OS.  close() means the object has handed all data to the OS
and released the control data structures (OS file descriptor release;
like the OS, the python interpreter may release python stuff later too).

By contrast, fsync() means the OS has handed filesystem changes to the
disc itself. Really really slow, by comparison with memory. It is Very
Expensive, and a very different operation to close().

[...snip...]
 Why has this ext4 problem not come up for other filesystems?

The same problems exist for all disc based filesystems to a greater of
lesser degree; the OS always does some buffering and therefore there
is a gap between what the OS has accepted from you (and thus made
visible to other apps using the OS) and the physical data structures
on disc. Ext2/3/4 tend to do whole disc sync when just asked to fsync,
probably because it really is only feasible to say get to a particular
checkpoint in the journal. Many other filesystems will have similar
degrees of granularity, perhaps not all.

Anyway, fsync is a much bigger ask than close, and should be used very
sparingly.

Cheers,
-- 
Cameron Simpson c...@zip.com.au DoD#743
http://www.cskk.ezoshosting.com/cs/

If I repent anything, it is very likely to be my good behavior.
What demon possessed me that I behaved so well? - Henry David Thoreau
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Steven D'Aprano
On Thu, 12 Mar 2009 01:21:25 am Antoine Pitrou wrote:
 Christian Heimes lists at cheimes.de writes:
  In my initial proposal one and a half hour earlier I suggested
  'sync()' as the name of the method and 'synced' as the name of the
  flag that forces a fsync() call during the close operation.

 I think your synced flag is too vague. Some applications may need
 the file to be synced on close(), but some others may need it to be
 synced at regular intervals, or after each write(), etc.

 Calling the flag sync_on_close would be much more explicit. Also,
 given the current API I think it should be an argument to open()
 rather than a writable attribute.

Perhaps we should have a module containing rich file tools, e.g. classes 
FileSyncOnWrite, FileSyncOnClose, functions for common file-related 
operations, etc. This will make it easy for conscientious programmers 
to do the right thing for their app without needing to re-invent the 
wheel all the time, but without handcuffing them into a single one 
size fits all solution.

File operations are *hard*, because many error conditions are uncommon, 
and consequently many (possibly even the majority) of programmers never 
learn that something like this:

f = open('myfile', 'w')
f.write(data)
f.close()

(or the equivalent in whatever language they use) may cause data loss. 
Worse, we train users to accept that data loss as normal instead of 
reporting it as a bug -- possibly because it is unclear whether it is a 
bug in the application, the OS, the file system, or all three. (It's 
impossible to avoid *all* risk of data loss, of course -- what if the 
computer loses power in the middle of a write? But we can minimize that 
risk significantly.)

Even when programmers try to do the right thing, it is hard to know what 
the right thing is: there are trade-offs to be made, and having made a 
trade-off, the programmer then has to re-invent what usually turns out 
to be a quite complicated wheel. To do the right thing in Python often 
means delving into the world of os.O_* constants and file descriptors, 
which is intimidating and unpythonic. They're great for those who 
want/need them, but perhaps we should expose a Python interface to the 
more common operations? To my mind, that means classes instead of magic 
constants.

Would there be interest in a filetools module? Replies and discussion to 
python-ideas please.


-- 
Steven D'Aprano
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Greg Ewing

Lie Ryan wrote:


I actually prefer strings. Just like 'w' or 'r' in open().

Or why not add f c as modes?

open('file.txt', 'wf')


I like this, because it doesn't expand the signature that
file-like objects need to support. If you're wrapping
another file object you just need to pass on the mode
string and it will all work.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Greg Ewing

Martin v. Löwis wrote:


That should be implement by passing O_SYNC on open, rather than
explicitly calling fsync.


On platforms which have it (MacOSX doesn't seem to,
according to the man page).

This is another good reason to put these things in the
mode string.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Paul Moore
2009/3/11 Greg Ewing greg.ew...@canterbury.ac.nz:
 Lie Ryan wrote:

 I actually prefer strings. Just like 'w' or 'r' in open().

 Or why not add f c as modes?

 open('file.txt', 'wf')

 I like this, because it doesn't expand the signature that
 file-like objects need to support. If you're wrapping
 another file object you just need to pass on the mode
 string and it will all work.

Of course, a file opened for write, in text mode, with auto-sync on
flush, has mode wtf. I'm in favour just for the chance to use that
mode :-)

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Antoine Pitrou
Greg Ewing greg.ewing at canterbury.ac.nz writes:
 
 I like this, because it doesn't expand the signature that
 file-like objects need to support. If you're wrapping
 another file object you just need to pass on the mode
 string and it will all work.

What do you mean? open() doesn't allow you to wrap other file objects.

As for adding options to the mode string, I think it will only make things
unreadable. Better make the option explicit, like others already are (buffering,
newline, encoding).

Besides, file objects still have to support a sync() method, since sync-on-close
doesn't cater for all uses.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Greg Ewing

Antoine Pitrou wrote:


What do you mean? open() doesn't allow you to wrap other file objects.


I'm talking about things like GzipFile that take a
filename and mode, open the file and then wrap the
file object.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Nick Coghlan
Greg Ewing wrote:
 Antoine Pitrou wrote:
 
 What do you mean? open() doesn't allow you to wrap other file objects.
 
 I'm talking about things like GzipFile that take a
 filename and mode, open the file and then wrap the
 file object.

The tempfile module would be another example.

For that reason, I think Steven's idea of a filetools module which
provided context managers and the like that wrapped *existing* file-like
objects might be preferable.

Otherwise it may be a while before sync-aware code is able to deal with
anything other than basic files.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Antoine Pitrou
Nick Coghlan ncoghlan at gmail.com writes:
 
 The tempfile module would be another example.

Do you really need your temporary files to survive system crashes? ;)

 For that reason, I think Steven's idea of a filetools module which
 provided context managers and the like that wrapped *existing* file-like
 objects might be preferable.

Well, well, let's clarify things a bit.
If we want to help users with this problem, we can provide two things:
1. a new sync() method on the standard objects provided by the IO lib
2. a facility to automatically call sync() on flush() and/or close() calls

Step 1 may be done with a generic implementation in the IO ABCs calling
self.flush() and then os.fsync(self.fileno()). IMO it is important that it is a
method of IO objects because implementations may want to override it. An
external facility would be too inflexible.

Step 2 may be done with a generic wrapper. However, we could also provide an
open() flag which transparently invokes the wrapper. After all, open() is
already a convenience function creating a raw file object and wrapping it in two
optional layers.

(as a side note, wrappers have a non-zero performance impact, especially on
small ops - e.g. reading or writing a few bytes)


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-11 Thread Nick Coghlan
Antoine Pitrou wrote:
 Nick Coghlan ncoghlan at gmail.com writes:
 The tempfile module would be another example.
 
 Do you really need your temporary files to survive system crashes? ;)

No, but they need to provide the full file API. If we add a sync()
method to file objects, that becomes part of the file-like API.

On the performance side... the overhead from fsync() itself is going to
dwarf the CPU overhead of going through a wrapper class.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Ext4 data loss

2009-03-10 Thread Christian Heimes
Multiple blogs and news sites are swamped with a discussion about ext4
and KDE 4.0. Theodore Ts'o - the developer of ext4 - explains the issue
at
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54.


Python's file type doesn't use fsync() and be the victim of the very
same issue, too. Should we do anything about it?

Christian

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread Guido van Rossum
On Tue, Mar 10, 2009 at 1:11 PM, Christian Heimes li...@cheimes.de wrote:
 Multiple blogs and news sites are swamped with a discussion about ext4
 and KDE 4.0. Theodore Ts'o - the developer of ext4 - explains the issue
 at
 https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54.


 Python's file type doesn't use fsync() and be the victim of the very
 same issue, too. Should we do anything about it?

If I understand the post properly, it's up to the app to call fsync(),
and it's only necessary when you're doing one of the rename dances, or
updating a file in place. Basically, as he explains, fsync() is a very
heavyweight operation; I'm against calling it by default anywhere.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread Neil Hodgson
   The technique advocated by Theodore Ts'o (save to temporary then
rename) discards metadata. What would be useful is a simple, generic
way in Python to copy all the appropriate metadata (ownership, ACLs,
...) to another file so the temporary-and-rename technique could be
used.

   On Windows, there is a hack in the file system that tries to track
the use of temporary-and-rename and reapply ACLs and on OS X there is
a function FSPathReplaceObject but I don't know how to do this
correctly on Linux.

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread Barry Warsaw

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Mar 10, 2009, at 4:23 PM, Guido van Rossum wrote:

On Tue, Mar 10, 2009 at 1:11 PM, Christian Heimes li...@cheimes.de  
wrote:
Multiple blogs and news sites are swamped with a discussion about  
ext4
and KDE 4.0. Theodore Ts'o - the developer of ext4 - explains the  
issue

at
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54 
.



Python's file type doesn't use fsync() and be the victim of the very
same issue, too. Should we do anything about it?


If I understand the post properly, it's up to the app to call fsync(),
and it's only necessary when you're doing one of the rename dances, or
updating a file in place. Basically, as he explains, fsync() is a very
heavyweight operation; I'm against calling it by default anywhere.


Right.  Python /applications/ should call fsync() and do the rename  
dance if appropriate, and fortunately it's easy enough to implement in  
Python.  Mailman's queue runner has done exactly this for ages.


Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Darwin)

iQCUAwUBSbbSXHEjvBPtnXfVAQLrsQP2NxJL+js6fMDgluoSpB6kW+VCJfSS0G58
KaDiRniinl3E9qH9w+hvNE7Es9JzPSiOP79KkuqRkzvCCmkrQRvsY6dKukOs/1zq
KNpTB4I3bGzUHgM+OwAh2KuxJ3pXzOPhrPwLLXLq7k1OuGRODmPxWXZ+i8FirR7C
8fpV6wNFAQ==
=JIdS
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread Guido van Rossum
On Tue, Mar 10, 2009 at 1:46 PM, Neil Hodgson nyamaton...@gmail.com wrote:
   The technique advocated by Theodore Ts'o (save to temporary then
 rename) discards metadata. What would be useful is a simple, generic
 way in Python to copy all the appropriate metadata (ownership, ACLs,
 ...) to another file so the temporary-and-rename technique could be
 used.

   On Windows, there is a hack in the file system that tries to track
 the use of temporary-and-rename and reapply ACLs and on OS X there is
 a function FSPathReplaceObject but I don't know how to do this
 correctly on Linux.

I don't know how to implement this for metadata beyond the traditional
stat metadata, but the API could be an extension of shutil.copystat().

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread Barry Warsaw

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Mar 10, 2009, at 4:46 PM, Neil Hodgson wrote:


  The technique advocated by Theodore Ts'o (save to temporary then
rename) discards metadata. What would be useful is a simple, generic
way in Python to copy all the appropriate metadata (ownership, ACLs,
...) to another file so the temporary-and-rename technique could be
used.

  On Windows, there is a hack in the file system that tries to track
the use of temporary-and-rename and reapply ACLs and on OS X there is
a function FSPathReplaceObject but I don't know how to do this
correctly on Linux.


Of course, a careful *nix application can ensure that the file owners  
and mod bits are set the way it needs them to be set.  A convenience  
function might be useful though.


Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSbbb8XEjvBPtnXfVAQLxvgP/SDnwzcKHI9E9K/ShAVWk3aShsDvJSztH
wHRQlOkbxxG/xcGJ7hGYaxJh5/TszU4wvtSc7JV5p6tRWrk/FRvAPW9lFBrlVQ8I
ZTV/bsNRJLSDxEXe7H4S2/c0L8LuGu58RGWtQzFH0UlnIFYIDwxxVGjfpVckXAc4
l54OAhDPFSk=
=njh4
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread Martin v. Löwis
 If I understand the post properly, it's up to the app to call fsync(),

Correct.

 and it's only necessary when you're doing one of the rename dances, or
 updating a file in place. 

No. It's in general necessary when you want to be sure that the data is
on disk, even if the power is lost. So even if you write a file (say, a
.pyc) only once - if the lights go out, and on again, your .pyc might be
corrupted, as the file system may have chosen to flush the metadata onto
disk, but not the actual data (or only parts of it). This may
happen even if the close(2) operation was successful.

In the specific case of config files, that's unfortunate because you
then can't revert to the old state, either - because that may be gone.
Ideally, you want transactional updates - you get either the old config
or the new config after a crash. You can get that with explicit
fdatasync, or with a transactional database (which may chose to sync
only infrequently, but then will be able to rollback the old state if
the new one wasn't written completely).

But yes, I agree, it's the applications' responsibility to properly
sync. If I had to place sync calls into the standard library, they would
go into dumbdbm.

I somewhat disagree that it is the application's fault entirely, and not
the operating system's/file system's fault. Ideally, there would be an
option of specifying transaction brackets for file operations, so that
the system knows it cannot flush the unlink operation of the old file
before it has flushed the data of the new file. This would still allow
the system to schedule IO fairly freely, but also guarantee that not all
gets lost in a crash. I thought that the data=ordered ext3 mount option
was going in that direction - not sure what happened to it in ext4.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread A.M. Kuchling
On Tue, Mar 10, 2009 at 09:11:38PM +0100, Christian Heimes wrote:
 Python's file type doesn't use fsync() and be the victim of the very
 same issue, too. Should we do anything about it?

The mailbox module tries to be careful and always fsync() before
closing files, because mail messages are pretty important.  The
various *dbm modules mostly have .sync() method.  

dumbdbm.py doesn't call fsync(), AFAICT; _commit() writes stuff and
closes the file, but doesn't call fsync().

sqlite3 doesn't have a sync() or flush() call.  Does SQLite handle
this itself?

The tarfile, zipfile, and gzip/bzip2 classes don't seem to use fsync()
at all, either implicitly or by having methods for calling them.
Should they?  What about cookielib.CookieJar?

--amk
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread Antoine Pitrou
Neil Hodgson nyamatongwe at gmail.com writes:
 
 What would be useful is a simple, generic
 way in Python to copy all the appropriate metadata (ownership, ACLs,
 ...) to another file so the temporary-and-rename technique could be
 used.

How about shutil.copystat()?


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread Cameron Simpson
On 10Mar2009 18:09, A.M. Kuchling a...@amk.ca wrote:
| On Tue, Mar 10, 2009 at 09:11:38PM +0100, Christian Heimes wrote:
|  Python's file type doesn't use fsync() and be the victim of the very
|  same issue, too. Should we do anything about it?

IMHO, beyond _offering_ an fsync method, no.

| The mailbox module tries to be careful and always fsync() before
| closing files, because mail messages are pretty important.

Can it be turned off? I hadn't realised this.

| The
| various *dbm modules mostly have .sync() method.  
| 
| dumbdbm.py doesn't call fsync(), AFAICT; _commit() writes stuff and
| closes the file, but doesn't call fsync().
| 
| sqlite3 doesn't have a sync() or flush() call.  Does SQLite handle
| this itself?

Yeah, most obnoxiously. There's a longstanding firefox bug about the
horrendous performance side effects of sqlite's zeal in this regard:

  https://bugzilla.mozilla.org/show_bug.cgi?id=421482

At least there's now an (almost undocumented) preference to disable it,
which I do on a personal basis.

| The tarfile, zipfile, and gzip/bzip2 classes don't seem to use fsync()
| at all, either implicitly or by having methods for calling them.
| Should they?  What about cookielib.CookieJar?

I think they should not do this implicitly. By all means let a user
issue policy.

In case you hadn't guessed, I fall into the never fsync group,
something of a simplification of my real position. In my opinion,
deciding to fsync is almost always a user policy decision, not an app
decision. An app talks to the OS; if the OS' filesystem has accepted
responsibility for the data (as it has after a successful fflush, for
example) then normally the app should have no further responsibility;
that is _exactly_ what the OS is responsible for.

Recovery is what backups are for, generally speaking.
All this IMHO, of course.

Of course there are some circumstances where one might fsync, as part
of one's risk mitigation policies (eg database checkpointing etc). But
whenever you do this you're basicly saying you don't trust the OS
abstraction of the hardware and also imposing an inherent performance
bottleneck.

With things like ext3 (and ext4 may well be the same - I have not
checked) an fsync doesn't just sync that file data and metadata, it does
a whole-filesystem sync. Really expensive. If underlying libraries do that
quietly and without user oversight/control then this failure to trust the
OS puts an unresolvable bottlneck on various things, and as an app scales
up in I/O or operational throughput or as a library or facility becomes
higher level (i.e. _involving_ more and more underlying complexity or
number of basic operations) the more intrusive and unfixable such a low
level auto-fsync would become.

Also, how far do you want to go to assure integrity for particular
filesystems' integrity issues/behaviours? Most filesystems sync to disc
regularly (or frequently, at any rate) anyway. What's too big a window
of potential loss?

For myself, I'm against libraries that implicitly do fsyncs, especially
if the user can't issue policy about it.

Cheers,
-- 
Cameron Simpson c...@zip.com.au DoD#743
http://www.cskk.ezoshosting.com/cs/

If it can't be turned off, it's not a feature. - Karl Heuer
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread A.M. Kuchling
On Wed, Mar 11, 2009 at 11:31:52AM +1100, Cameron Simpson wrote:
 On 10Mar2009 18:09, A.M. Kuchling a...@amk.ca wrote:
 | The mailbox module tries to be careful and always fsync() before
 | closing files, because mail messages are pretty important.
 
 Can it be turned off? I hadn't realised this.

No, there's no way to turn it off (well, you could delete 'fsync' from
the os module).

 | The tarfile, zipfile, and gzip/bzip2 classes don't seem to use fsync()
 | at all, either implicitly or by having methods for calling them.
 | Should they?  What about cookielib.CookieJar?
 
 I think they should not do this implicitly. By all means let a user
 issue policy.

The problem is that in some cases the user can't issue policy.  For
example, look at dumbdbm._commit().  It renames a file to a backup,
opens a new file object, writes to it, and closes it.  A caller can't
fsync() because the file object is created, used, and closed
internally.  With zipfile, you could at least access the .fp attribute
to sync it (though is the .fp documented as part of the interface?).

In other words, do we need to ensure that all the relevant library
modules expose an interface to allow requesting a sync, or getting the
file descriptor in order to sync it?

--amk
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread Antoine Pitrou
Christian Heimes lists at cheimes.de writes:
 
 I agree with you, fsync() shouldn't be called by default. I didn't plan
 on adding fsync() calls all over our code. However I like to suggest a
 file.sync() method and a synced flag for files to make the job of
 application developers easier.

We already have os.fsync() and os.fdatasync(). Should the sync() (and
datasync()?) method be added as an object-oriented convenience?



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread Christian Heimes
Antoine Pitrou wrote:
 Christian Heimes lists at cheimes.de writes:
 I agree with you, fsync() shouldn't be called by default. I didn't plan
 on adding fsync() calls all over our code. However I like to suggest a
 file.sync() method and a synced flag for files to make the job of
 application developers easier.
 
 We already have os.fsync() and os.fdatasync(). Should the sync() (and
 datasync()?) method be added as an object-oriented convenience?

It's more than an object oriented convenience. fsync() takes a file
descriptor as argument. Therefore I assume fsync() only syncs the data
to disk that was written to the file descriptor. [*] In Python 2.x we
are using a FILE* based stream. In Python 3.x we have our own buffered
writer class.

In order to write all data to disk the FILE* stream must be flushed
first before fsync() is called:

PyFileObject *f;
if (fflush(f-f_fp) != 0) {
/* report error */
}
if (fsync(fileno(f-f_fp)) != 0) {
/* report error */
}


Christian

[*] Is my assumption correct, anybody?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread Guido van Rossum
On Tue, Mar 10, 2009 at 7:45 PM, Christian Heimes li...@cheimes.de wrote:
 Antoine Pitrou wrote:
 Christian Heimes lists at cheimes.de writes:
 I agree with you, fsync() shouldn't be called by default. I didn't plan
 on adding fsync() calls all over our code. However I like to suggest a
 file.sync() method and a synced flag for files to make the job of
 application developers easier.

 We already have os.fsync() and os.fdatasync(). Should the sync() (and
 datasync()?) method be added as an object-oriented convenience?

 It's more than an object oriented convenience. fsync() takes a file
 descriptor as argument. Therefore I assume fsync() only syncs the data
 to disk that was written to the file descriptor. [*] In Python 2.x we
 are using a FILE* based stream. In Python 3.x we have our own buffered
 writer class.

 In order to write all data to disk the FILE* stream must be flushed
 first before fsync() is called:

    PyFileObject *f;
    if (fflush(f-f_fp) != 0) {
        /* report error */
    }
    if (fsync(fileno(f-f_fp)) != 0) {
        /* report error */
    }

Let's not think too Unix-specific. If we add such an API it should do
something on Windows too -- the app shouldn't have to test for the
presence of the API. (And thus the API probably shouldn't be called
fsync.)

 Christian

 [*] Is my assumption correct, anybody?

It seems to be, at least it's ambiguous.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread Cameron Simpson
On 10Mar2009 22:14, A.M. Kuchling a...@amk.ca wrote:
| On Wed, Mar 11, 2009 at 11:31:52AM +1100, Cameron Simpson wrote:
|  On 10Mar2009 18:09, A.M. Kuchling a...@amk.ca wrote:
|  | The mailbox module tries to be careful and always fsync() before
|  | closing files, because mail messages are pretty important.
|  
|  Can it be turned off? I hadn't realised this.
| 
| No, there's no way to turn it off (well, you could delete 'fsync' from
| the os module).

Ah. For myself, were I writing a high load mailbox tool (eg a mail filer
or more to the point, a mail refiler - which I do actually intend to) I
would want to be able to do a huge mass of mailbox stuff and then
possibly issue a sync at the end. For unix mbox that might be ok but
for maildirs I'd imagine it leads to an fsync per message.

|  | The tarfile, zipfile, and gzip/bzip2 classes don't seem to use fsync()
|  | at all, either implicitly or by having methods for calling them.
|  | Should they?  What about cookielib.CookieJar?
|  
|  I think they should not do this implicitly. By all means let a user
|  issue policy.
| 
| The problem is that in some cases the user can't issue policy.  For
| example, look at dumbdbm._commit().  It renames a file to a backup,
| opens a new file object, writes to it, and closes it.  A caller can't
| fsync() because the file object is created, used, and closed
| internally.  With zipfile, you could at least access the .fp attribute
| to sync it (though is the .fp documented as part of the interface?).

I didn't so much mean giving the user an fsync hook so much as publishing a
flag such as .do_critical_fsyncs inside the dbm or zipfile object. If true,
issue fsyncs at appropriate times.

| In other words, do we need to ensure that all the relevant library
| modules expose an interface to allow requesting a sync, or getting the
| file descriptor in order to sync it?

With a policy flag you could solve the control issue even for things
which don't expose the fd such as your dumbdbm._commit() example.
If you supply both a flag and an fsync() method it becomes easy for
a user of a module to go:

  obj = get_dbm_handle()
  obj.do_critical_fsyncs = False
  ... do lots and lots of stuff ...
  obj.fsync()
  obj.close()

Cheers,
-- 
Cameron Simpson c...@zip.com.au DoD#743
http://www.cskk.ezoshosting.com/cs/

In the end, winning is the only safety. - Kerr Avon
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-10 Thread Cameron Simpson
On 11Mar2009 02:20, Antoine Pitrou solip...@pitrou.net wrote:
| Christian Heimes lists at cheimes.de writes:
|  I agree with you, fsync() shouldn't be called by default. I didn't plan
|  on adding fsync() calls all over our code. However I like to suggest a
|  file.sync() method and a synced flag for files to make the job of
|  application developers easier.
| 
| We already have os.fsync() and os.fdatasync(). Should the sync() (and
| datasync()?) method be added as an object-oriented convenience?

I can imagine plenty of occasions when there may not be an available
file descriptor to hand to os.fsync() et al. Having sync() and
datasync() methods in the object would obviate the need for the caller
to know the object internals.
-- 
Cameron Simpson c...@zip.com.au DoD#743
http://www.cskk.ezoshosting.com/cs/

I must construct my own System, or be enslaved to another Man's.
- William Blake
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com