Re: [RFC] Parallelize IO for e2fsck
On Mon, 28 Jan 2008, Theodore Tso wrote: On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote: As user pages are always in highmem, this should be easy to decide: only send SIGDANGER when highmem is full. (Yes, there are inodes/dentries/file descriptors in lowmem, but I doubt apps will respond to SIGDANGER by closing files). Good point; for a system with at least (say) 2GB of memory, that definitely makes sense. For a system with less than 768 megs of memory (how quaint, but it wasn't that long ago this was a lot of memory :-), there wouldn't *be* any memory in highmem at all not to mention machines with 1G of ram (900M lowmem, 128M highmem) David Lang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 05/26] mount options: fix afs
Miklos Szeredi <[EMAIL PROTECTED]> wrote: > Add a .show_options super operation to afs. > > Use generic_show_options() and save the complete option string in > afs_get_sb(). Sounds reasonable, but I can't test it till I get back from LCA. David - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel Event Notifications (was: [RFC] Parallelize IO for e2fsck)
On Sat, 2008-01-26 at 16:55 +0300, Al Boldi wrote: > KOSAKI Motohiro wrote: > > > > And from a performance point of view letting applications voluntarily > > > > free some memory is better even than starting to swap. > > > > > > Absolutely. > > > > the mem_notify patch can realize "just before starting swapping" > > notification :) I looked at this a year or two back, then ran out of time. But the thing I wanted to do was have libc's memory allocation routines extended to handle these through reservations - the kernel should send a userspace notification and then there should be some kind of concept of returning memory that's been used for "opportunistic" userspace caching, e.g. in firefox to cache the last 10 web pages. Let us know how you get on :) Jon. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3: per-process soft-syncing data=ordered mode
Jan Kara wrote: > On Sat 26-01-08 08:27:59, Al Boldi wrote: > > Do you mean there is a locking problem? > > No, but if you write to an mmaped file, then we can find out only later > we have dirty data in pages and we call writepage() on behalf of e.g. > pdflush(). Ok, that's a special case, which we could code for, but doesn't seem worthwile. In any case, child-forks should inherit its parent mode. > > > And in case of DB, they use direct-io > > > anyway most of the time so they don't care about journaling mode > > > anyway. > > > > Testing with sqlite3 and mysql4 shows that performance drastically > > improves with writeback writeout. > > And do you have the databases configured to use direct IO or not? I don't think so, but these tests are only meant to expose the underlying problem which needs to be fixed, while this RFC proposes a useful workaround. In another post Jan Kara wrote: > Hmm, if you're willing to test patches, then you could try a debug > patch: http://bugzilla.kernel.org/attachment.cgi?id=14574 > and send me the output. What kind of load do you observe problems with > and which problems exactly? 8M-record insert into indexed db-table: ordered writeback sqlite3: 75m22s8m45s mysql4 : 23m35s5m29s Also, see the 'konqueror deadlocks in 2.6.22' thread. Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Parallelize IO for e2fsck
On Mon 2008-01-28 14:56:33, Theodore Tso wrote: > On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote: > > > > As user pages are always in highmem, this should be easy to decide: > > only send SIGDANGER when highmem is full. (Yes, there are > > inodes/dentries/file descriptors in lowmem, but I doubt apps will > > respond to SIGDANGER by closing files). > > Good point; for a system with at least (say) 2GB of memory, that > definitely makes sense. For a system with less than 768 megs of > memory (how quaint, but it wasn't that long ago this was a lot of > memory :-), there wouldn't *be* any memory in highmem at all Ok, so it is 'send SIGDANGER when all zones are low', because user allocations can go from all zones (unless you have something really exotic, I'm not sure if that is true on huge NUMA machines & similar). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Parallelize IO for e2fsck
On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote: > > As user pages are always in highmem, this should be easy to decide: > only send SIGDANGER when highmem is full. (Yes, there are > inodes/dentries/file descriptors in lowmem, but I doubt apps will > respond to SIGDANGER by closing files). Good point; for a system with at least (say) 2GB of memory, that definitely makes sense. For a system with less than 768 megs of memory (how quaint, but it wasn't that long ago this was a lot of memory :-), there wouldn't *be* any memory in highmem at all - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
On Mon, 2008-01-28 at 12:33 -0600, Steve French wrote: > On Jan 28, 2008 2:17 AM, Andi Kleen <[EMAIL PROTECTED]> wrote: > > > I completely agree. If one thread writes A and another writes B then the > > > kernel should record either A or B, not ((A & 0x) | (B & > > > 0x)) > > > > The problem is pretty nasty unfortunately. To solve it properly I think > > the file_operations->read/write prototypes would need to be changed > > because otherwise it is not possible to do atomic relative updates > > of f_pos. Right now the actual update is burrowed deeply in the low level > > read/write implementation. But that would be a huge impact all over > > the tree :/ > > If there were a wrapper around reads and writes of f_pos as there is > for i_size e.g. it would hit a lot of code, but not as many as I had > originally thought. the most important ones are in the vfs itself, where > there are only 59 uses of the field (not all need to be changed). ext3 > has fewer (25), and cifs only 12 uses. Most of the uses in ext3 and cifs deal with a directory's f_pos in readdir, which is protected by i_mutex, so I don't think we need to worry about them at all. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Parallelize IO for e2fsck
Hi! > It's been discussed before, but I suspect the main reason why it was > never done is no one submitted a patch. Also, the problem is actually > a pretty complex one. There are a couple of different stages where > you might want to send an alert to processes: > > * Data is starting to get ejected from page/buffer cache > * System is starting to swap > * System is starting to really struggle to find memory > * System is starting an out-of-memory killer > > AIX's SIGDANGER really did the last two, where the OOM killer would > tend to avoid processes that had a SIGDANGER handler in favor of > processes that were SIGDANGER unaware. > > Then there is the additional complexity in Linux that you have > multiple zones of memory, which at least on the historically more > popular x86 was highly, highly important. You could say that whenever > there is sufficient memory pressure in any zone that you start > ejecting data from caches or start to swap that you start sending the > signals --- but on x86 systems with lowmem, that could happen quite > frequently, and since a user process has no idea whether its resources > are in lowmem or highmem, there's not much you can do about this. As user pages are always in highmem, this should be easy to decide: only send SIGDANGER when highmem is full. (Yes, there are inodes/dentries/file descriptors in lowmem, but I doubt apps will respond to SIGDANGER by closing files). -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
On Jan 28, 2008 2:17 AM, Andi Kleen <[EMAIL PROTECTED]> wrote: > > I completely agree. If one thread writes A and another writes B then the > > kernel should record either A or B, not ((A & 0x) | (B & > > 0x)) > > The problem is pretty nasty unfortunately. To solve it properly I think > the file_operations->read/write prototypes would need to be changed > because otherwise it is not possible to do atomic relative updates > of f_pos. Right now the actual update is burrowed deeply in the low level > read/write implementation. But that would be a huge impact all over > the tree :/ If there were a wrapper around reads and writes of f_pos as there is for i_size e.g. it would hit a lot of code, but not as many as I had originally thought. the most important ones are in the vfs itself, where there are only 59 uses of the field (not all need to be changed). ext3 has fewer (25), and cifs only 12 uses. -- Thanks, Steve - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3: per-process soft-syncing data=ordered mode
On Sat 26-01-08 08:27:43, Al Boldi wrote: > Diego Calleja wrote: > > El Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi <[EMAIL PROTECTED]> escribió: > > > Greetings! > > > > > > data=ordered mode has proven reliable over the years, and it does this > > > by ordering filedata flushes before metadata flushes. But this > > > sometimes causes contention in the order of a 10x slowdown for certain > > > apps, either due to the misuse of fsync or due to inherent behaviour > > > like db's, as well as inherent starvation issues exposed by the > > > data=ordered mode. > > > > There's a related bug in bugzilla: > > http://bugzilla.kernel.org/show_bug.cgi?id=9546 > > > > The diagnostic from Jan Kara is different though, but I think it may be > > the same problem... > > > > "One process does data-intensive load. Thus in the ordered mode the > > transaction is tiny but has tons of data buffers attached. If commit > > happens, it takes a long time to sync all the data before the commit > > can proceed... In the writeback mode, we don't wait for data buffers, in > > the journal mode amount of data to be written is really limited by the > > maximum size of a transaction and so we write by much smaller chunks > > and better latency is thus ensured." > > > > > > I'm hitting this bug too...it's surprising that there's not many people > > reporting more bugs about this, because it's really annoying. > > > > > > There's a patch by Jan Kara (that I'm including here because bugzilla > > didn't include it and took me a while to find it) which I don't know if > > it's supposed to fix the problem , but it'd be interesting to try: > > Thanks a lot, but it doesn't fix it. Hmm, if you're willing to test patches, then you could try a debug patch: http://bugzilla.kernel.org/attachment.cgi?id=14574 and send me the output. What kind of load do you observe problems with and which problems exactly? Honza -- Jan Kara <[EMAIL PROTECTED]> SUSE Labs, CR - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3: per-process soft-syncing data=ordered mode
On Sat 26-01-08 08:27:59, Al Boldi wrote: > Jan Kara wrote: > > > Greetings! > > > > > > data=ordered mode has proven reliable over the years, and it does this > > > by ordering filedata flushes before metadata flushes. But this > > > sometimes causes contention in the order of a 10x slowdown for certain > > > apps, either due to the misuse of fsync or due to inherent behaviour > > > like db's, as well as inherent starvation issues exposed by the > > > data=ordered mode. > > > > > > data=writeback mode alleviates data=order mode slowdowns, but only works > > > per-mount and is too dangerous to run as a default mode. > > > > > > This RFC proposes to introduce a tunable which allows to disable fsync > > > and changes ordered into writeback writeout on a per-process basis like > > > this: > > > > > > echo 1 > /proc/`pidof process`/softsync > > > > I guess disabling fsync() was already commented on enough. Regarding > > switching to writeback mode on per-process basis - not easily possible > > because sometimes data is not written out by the process which stored > > them (think of mmaped file). > > Do you mean there is a locking problem? No, but if you write to an mmaped file, then we can find out only later we have dirty data in pages and we call writepage() on behalf of e.g. pdflush(). > > And in case of DB, they use direct-io > > anyway most of the time so they don't care about journaling mode anyway. > > Testing with sqlite3 and mysql4 shows that performance drastically improves > with writeback writeout. And do you have the databases configured to use direct IO or not? Honza -- Jan Kara <[EMAIL PROTECTED]> SUSE Labs, CR - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6.24 REGRESSION] BUG: Soft lockup - with VFS
and so then dmesg .. -- Thanks, Oliver Initializing cgroup subsys cpuset Linux version 2.6.24-szami2 ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #2 SMP Sun Jan 27 01:47:58 CET 2008 BIOS-provided physical RAM map: BIOS-e820: - 0009fc00 (usable) BIOS-e820: 0009fc00 - 000a (reserved) BIOS-e820: 000e8000 - 0010 (reserved) BIOS-e820: 0010 - 1ff3 (usable) BIOS-e820: 1ff3 - 1ff4 (ACPI data) BIOS-e820: 1ff4 - 1fff (ACPI NVS) BIOS-e820: 1fff - 2000 (reserved) BIOS-e820: ffb8 - 0001 (reserved) 0MB HIGHMEM available. 511MB LOWMEM available. found SMP MP-table at 000ff780 Entering add_active_range(0, 0, 130864) 0 entries of 256 used Zone PFN ranges: DMA 0 -> 4096 Normal 4096 -> 130864 HighMem130864 -> 130864 Movable zone start PFN for each node early_node_map[1] active PFN ranges 0:0 -> 130864 On node 0 totalpages: 130864 DMA zone: 56 pages used for memmap DMA zone: 0 pages reserved DMA zone: 4040 pages, LIFO batch:0 Normal zone: 1733 pages used for memmap Normal zone: 125035 pages, LIFO batch:31 HighMem zone: 0 pages used for memmap Movable zone: 0 pages used for memmap DMI 2.3 present. ACPI: RSDP 000F9E30, 0021 (r2 ACPIAM) ACPI: XSDT 1FF30100, 003C (r1 A M I OEMXSDT 1414 MSFT 97) ACPI: FACP 1FF30290, 00F4 (r3 A M I OEMFACP 1414 MSFT 97) ACPI: DSDT 1FF303F0, 3779 (r1 P4C8B P4C8B106 106 INTL 2002026) ACPI: FACS 1FF4, 0040 ACPI: APIC 1FF30390, 005C (r1 A M I OEMAPIC 1414 MSFT 97) ACPI: OEMB 1FF40040, 003F (r1 A M I OEMBIOS 1414 MSFT 97) ACPI: PM-Timer IO Port: 0x808 ACPI: Local APIC address 0xfee0 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) Processor #0 15:2 APIC version 20 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) Processor #1 15:2 APIC version 20 ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0]) IOAPIC[0]: apic_id 2, version 32, address 0xfec0, GSI 0-23 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: IRQ0 used by override. ACPI: IRQ2 used by override. ACPI: IRQ9 used by override. Enabling APIC mode: Flat. Using 1 I/O APICs Using ACPI (MADT) for SMP configuration information Allocating PCI resources starting at 3000 (gap: 2000:dfb8) Built 1 zonelists in Zone order, mobility grouping on. Total pages: 129075 Kernel command line: BOOT_IMAGE=deb_s2.6.24 ro root=803 1 mapped APIC to b000 (fee0) mapped IOAPIC to a000 (fec0) Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Initializing CPU#0 PID hash table entries: 2048 (order: 11, 8192 bytes) Detected 3150.239 MHz processor. Console: colour VGA+ 132x44 console [tty0] enabled Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar ... MAX_LOCKDEP_SUBCLASSES:8 ... MAX_LOCK_DEPTH: 30 ... MAX_LOCKDEP_KEYS:2048 ... CLASSHASH_SIZE: 1024 ... MAX_LOCKDEP_ENTRIES: 8192 ... MAX_LOCKDEP_CHAINS: 16384 ... CHAINHASH_SIZE: 8192 memory used by lock dependency info: 1024 kB per task-struct memory footprint: 1680 bytes | Locking API testsuite: | spin |wlock |rlock |mutex | wsem | rsem | -- A-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok | A-B-B-C-C-D-D-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-D-B-C-D-A deadlock: ok | ok | ok | ok | ok | ok | double unlock: ok | ok | ok | ok | ok | ok | initialize held: ok | ok | ok | ok | ok | ok | bad unlock order: ok | ok | ok | ok | ok | ok | -- recursive read-lock: | ok | | ok | recursive read-lock #2: | ok | | ok | mixed read-write-lock: | ok | | ok | mixed write-read-lock: | ok | | ok | -- hard-irqs-on + irq-safe-A/12: ok | ok | ok | soft-irqs-on + irq-safe-A/12: o
Re: [patch 21/26] mount options: partially fix nfs
On Jan 28, 2008, at 6:34 AM, Miklos Szeredi wrote: All mount options should be shown, which are needed to reconstruct a previous mount. Ah, OK. I'm happy to implement logic to display the all missing options. I should have updated nfs_show_mount_options() when I wrote the NFS mount option parser. Let me know your preference. You are more familiar with NFS, so I think it would be better if you updated nfs_show_mount_options(). Could you also queue my patch (updated) or incorporate it into a combined fix? Yes. I'll have time in a day or two to get this finished. Thanks, Miklos Subject: mount options: partially fix nfs From: Miklos Szeredi <[EMAIL PROTECTED]> Add posix, bsize=, namelen= options to /proc/mounts for nfs filesystems. Document several other options that are still missing. Changes: - display namelen= unconditionally - addr= isn't missing after all Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]> CC: Trond Myklebust <[EMAIL PROTECTED]> --- Index: linux/fs/nfs/super.c === --- linux.orig/fs/nfs/super.c 2008-01-25 15:44:56.0 +0100 +++ linux/fs/nfs/super.c2008-01-25 15:57:32.0 +0100 @@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc } nfs_info[] = { { NFS_MOUNT_SOFT, ",soft", ",hard" }, { NFS_MOUNT_INTR, ",intr", ",nointr" }, + { NFS_MOUNT_POSIX, ",posix", "" }, { NFS_MOUNT_NOCTO, ",nocto", "" }, { NFS_MOUNT_NOAC, ",noac", "" }, { NFS_MOUNT_NONLM, ",nolock", "" }, @@ -463,6 +464,9 @@ static void nfs_show_mount_options(struc seq_printf(m, ",vers=%d", clp->rpc_ops->version); seq_printf(m, ",rsize=%d", nfss->rsize); seq_printf(m, ",wsize=%d", nfss->wsize); + seq_printf(m, ",namelen=%d", nfss->namelen); + if (nfss->bsize != 0) + seq_printf(m, ",bsize=%d", nfss->bsize); if (nfss->acregmin != 3*HZ || showdefaults) seq_printf(m, ",acregmin=%d", nfss->acregmin/HZ); if (nfss->acregmax != 60*HZ || showdefaults) @@ -482,6 +486,17 @@ static void nfs_show_mount_options(struc seq_printf(m, ",timeo=%lu", 10U * nfss->client->cl_timeout- >to_initval / HZ); seq_printf(m, ",retrans=%u", nfss->client->cl_timeout->to_retries); seq_printf(m, ",sec=%s", nfs_pseudoflavour_to_name(nfss->client- >cl_auth->au_flavor)); + + /* +* Missing options: +* port= +* mountport= +* mountvers= +* mountproto= +* clientaddr= +* mounthost= +* mountaddr= +*/ } /* -- Chuck Lever chuck[dot]lever[at]oracle[dot]com - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
El Mon, 28 Jan 2008 15:10:34 +0100, Andi Kleen <[EMAIL PROTECTED]> escribió: > So you get overlapping reads. Probably not good. This was discussed in the past i think -> http://lkml.org/lkml/2006/4/13/124 http://lkml.org/lkml/2006/4/13/130 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
On Mon, 28 Jan 2008 15:10:34 +0100 Andi Kleen <[EMAIL PROTECTED]> wrote: > On Monday 28 January 2008 14:38:57 Alan Cox wrote: > > > Also worse really fixing it would be a major change to the VFS > > > because of the way ->read/write are defined :/ > > > > I don't see a problem there. ->read and ->write update the passed pointer > > which is not the real f_pos anyway. Just the copies need fixing. > > They are effectually doing a decoupled read/modify/write cycle. e.g.: > > A B > > read fpos > > read fpos > > fpos += A fpos += B > write fpos > > > write fpos > > So you get overlapping reads. Probably not good. No unix system I'm aware of cares about the read/write positioning during parallel simultaneous reads or writes, with the exception of O_APPEND which is strictly defined. The problem case is getting fpos != either valid value. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
On Monday 28 January 2008 14:38:57 Alan Cox wrote: > > Also worse really fixing it would be a major change to the VFS > > because of the way ->read/write are defined :/ > > I don't see a problem there. ->read and ->write update the passed pointer > which is not the real f_pos anyway. Just the copies need fixing. They are effectually doing a decoupled read/modify/write cycle. e.g.: A B read fpos read fpos fpos += A fpos += B write fpos write fpos So you get overlapping reads. Probably not good. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
> Also worse really fixing it would be a major change to the VFS > because of the way ->read/write are defined :/ I don't see a problem there. ->read and ->write update the passed pointer which is not the real f_pos anyway. Just the copies need fixing. Alan - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
On Monday 28 January 2008 13:56:05 Alan Cox wrote: > > > No specific spec, just general quality of implementation. > > > > I completely agree. If one thread writes A and another writes B then the > > kernel should record either A or B, not ((A & 0x) | (B & > > 0x)) > > Agree entirely: the spec doesn't allow for random scribbling in the wrong > place. It doesn't cover which goes first or who "wins" the race but > provides pwrite/pread for that situation. Writing somewhere unrelated is > definitely not to spec Actually it would probably -- i guess it's undefined and in undefined country such things can happen. Also to be fair I think it's only a problem for the 4GB wrapping case which is presumably rare (otherwise we would have heard about it) Also worse really fixing it would be a major change to the VFS because of the way ->read/write are defined :/ -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3 freeze feature
Hi, What you *could* do is to start putting processes to sleep if they attempt to write to the frozen filesystem, and then detect the deadlock case where the process holding the file descriptor used to freeze the filesystem gets frozen because it attempted to write to the filesystem --- at which point it gets some kind of signal (which defaults to killing the process), and the filesystem is unfrozen and as part of the unfreeze you wake up all of the processes that were put to sleep for touching the frozen filesystem. I don't think close() usually writes to journal and the deadlock occurs. Is there the special case which close() writes to journal in case of getting signal? Cheers, Takashi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3 freeze feature
Hi, Thank you for your comments. That's inherently unsafe - you can have multiple unfreezes running in parallel which seriously screws with the bdev semaphore count that is used to lock the device due to doing multiple up()s for every down. Your timeout thingy guarantee that at some point you will get multiple up()s occuring due to the timer firing racing with a thaw ioctl. If this interface is to be more widely exported, then it needs a complete revamp of the bdev is locked while it is frozen so that there is no chance of a double up() ever occuring on the bd_mount_sem due to racing thaws. My patch has the race condition as you said. I will fix it. Cheers, Takashi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
> > No specific spec, just general quality of implementation. > > I completely agree. If one thread writes A and another writes B then the > kernel should record either A or B, not ((A & 0x) | (B & > 0x)) Agree entirely: the spec doesn't allow for random scribbling in the wrong place. It doesn't cover which goes first or who "wins" the race but provides pwrite/pread for that situation. Writing somewhere unrelated is definitely not to spec and not good. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 24/26] mount options: fix tmpfs
> > Thanks Miklos, that's a welcome enhancement, nicely done. I've only > noticed one thing wrong (MPOL_PREFERRED shown as "default"); but thought > shmem_config didn't add much value - I'd rather avoid those syntactic > changes to unchanged code; and several tmpfs defaults being relative > (e.g. to totalram_pages, or to mounter's fsuid), I ended up preferring > to do real tests in shmem_show_options. I completely agree, this is much better than my version. > Thus, for example, if memory is hotplugged in or out later, what started > out as an unspecified size option will then get shown as explicit size. > (I did think for a while that I wanted to show explicit size in all > cases; but it looked pretty silly on udev.) I think that's the correct > behaviour, that otherwise would be misleading; but I may be looking at > this the wrong way round, what's your view? I agree, this is the correct way. I'll add functions for calculating the default max values, so the calculations won't accidentally become different for the initialization and the option showing. > If you agree with the version below, please take it into your collection > and insert your Signed-off-by. I should admit, I've not yet tested how > the NUMA policies look: you'll hear from me again tomorrow morning if > those turn out to wrong. OK, I'll send this to Andrew. Maybe I'll wait until tomorrow to hear if it's working on NUMA. Thanks, Miklos - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 00/26] mount options: fix filesystem's ->show_options
> > On Thu, 24 Jan 2008 20:33:41 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > > Andrew, > > > > Would you please consider these patches for -mm? > > Sure, but I'm too lazy to pick through them and work out which ones need > updating, which ones got acked and which ones someone else merged, all on a > very bumpy plane flight ;) > > Please resend when the dust has settled? Yes, I should have thought, it won't quite work in a single iteration :) I'll resend them in a moment. Thanks, Miklos - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 21/26] mount options: partially fix nfs
> > All mount options should be shown, which are needed to reconstruct a > > previous mount. > > Ah, OK. > > I'm happy to implement logic to display the all missing options. I > should have updated nfs_show_mount_options() when I wrote the NFS > mount option parser. > > Let me know your preference. You are more familiar with NFS, so I think it would be better if you updated nfs_show_mount_options(). Could you also queue my patch (updated) or incorporate it into a combined fix? Thanks, Miklos Subject: mount options: partially fix nfs From: Miklos Szeredi <[EMAIL PROTECTED]> Add posix, bsize=, namelen= options to /proc/mounts for nfs filesystems. Document several other options that are still missing. Changes: - display namelen= unconditionally - addr= isn't missing after all Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]> CC: Trond Myklebust <[EMAIL PROTECTED]> --- Index: linux/fs/nfs/super.c === --- linux.orig/fs/nfs/super.c 2008-01-25 15:44:56.0 +0100 +++ linux/fs/nfs/super.c2008-01-25 15:57:32.0 +0100 @@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc } nfs_info[] = { { NFS_MOUNT_SOFT, ",soft", ",hard" }, { NFS_MOUNT_INTR, ",intr", ",nointr" }, + { NFS_MOUNT_POSIX, ",posix", "" }, { NFS_MOUNT_NOCTO, ",nocto", "" }, { NFS_MOUNT_NOAC, ",noac", "" }, { NFS_MOUNT_NONLM, ",nolock", "" }, @@ -463,6 +464,9 @@ static void nfs_show_mount_options(struc seq_printf(m, ",vers=%d", clp->rpc_ops->version); seq_printf(m, ",rsize=%d", nfss->rsize); seq_printf(m, ",wsize=%d", nfss->wsize); + seq_printf(m, ",namelen=%d", nfss->namelen); + if (nfss->bsize != 0) + seq_printf(m, ",bsize=%d", nfss->bsize); if (nfss->acregmin != 3*HZ || showdefaults) seq_printf(m, ",acregmin=%d", nfss->acregmin/HZ); if (nfss->acregmax != 60*HZ || showdefaults) @@ -482,6 +486,17 @@ static void nfs_show_mount_options(struc seq_printf(m, ",timeo=%lu", 10U * nfss->client->cl_timeout->to_initval / HZ); seq_printf(m, ",retrans=%u", nfss->client->cl_timeout->to_retries); seq_printf(m, ",sec=%s", nfs_pseudoflavour_to_name(nfss->client->cl_auth->au_flavor)); + + /* +* Missing options: +* port= +* mountport= +* mountvers= +* mountproto= +* clientaddr= +* mounthost= +* mountaddr= +*/ } /* - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
Trond Myklebust <[EMAIL PROTECTED]> wrote: > On Mon, 2008-01-28 at 05:38 +0100, Andi Kleen wrote: >> On Monday 28 January 2008 05:13:09 Trond Myklebust wrote: >> > On Mon, 2008-01-28 at 03:58 +0100, Andi Kleen wrote: >> > > The problem is that it's not a race in who gets to do its thing first, >> > > but a parallel reader can actually see a corrupted value from the two >> > > independent words on 32bit (e.g. during a 4GB). And this could actually >> > > completely corrupt f_pos when it happens with two racing relative seeks >> > > or read/write()s >> > > >> > > I would consider that a bug. >> > >> > I disagree. The corruption occurs because this isn't a situation that is >> > allowed by either POSIX or SUSv2/v3. Exactly what spec are you referring >> > to here? >> >> No specific spec, just general quality of implementation. We normally don't >> have non thread safe system calls even if it was in theory allowed by some >> specification. > > We've had the existing implementation for quite some time. The arguments > against changing it have been the same all along: if your application > wants to share files between threads, the portability argument implies > that you should either use pread/pwrite or use a mutex or some other > form of synchronisation primitive in order to ensure that > lseek()/read()/write() do not overlap. Does anything in the kernel depend on f_pos being valid? E.g. is it possible to read beyond the EOF using this race, or to have files larger than the ulimit? If not, update the manpage and be done. ¢¢ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[2.6.24 REGRESSION] BUG: Soft lockup - with VFS
hi all! in the 2.6.24 become i some soft lockups with usb-phone, when i pluged in the mobile, then the vfs-layer crashed. am afternoon can i the .config send, and i bisected the kernel, when i have time. pictures from crash: http://students.zipernowsky.hu/~oliverp/kernel/regression_2624/ -- Thanks, Oliver - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
> I completely agree. If one thread writes A and another writes B then the > kernel should record either A or B, not ((A & 0x) | (B & > 0x)) The problem is pretty nasty unfortunately. To solve it properly I think the file_operations->read/write prototypes would need to be changed because otherwise it is not possible to do atomic relative updates of f_pos. Right now the actual update is burrowed deeply in the low level read/write implementation. But that would be a huge impact all over the tree :/ Or maybe define a new read/write64 and keep the default as 32bit only-- i suppose most users don't really need 64bit. Still would be a nasty API change. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html