date:20080128

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread david


On Mon, 28 Jan 2008, Theodore Tso wrote:


On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:


As user pages are always in highmem, this should be easy to decide:
only send SIGDANGER when highmem is full. (Yes, there are
inodes/dentries/file descriptors in lowmem, but I doubt apps will
respond to SIGDANGER by closing files).


Good point; for a system with at least (say) 2GB of memory, that
definitely makes sense.  For a system with less than 768 megs of
memory (how quaint, but it wasn't that long ago this was a lot of
memory :-), there wouldn't *be* any memory in highmem at all


not to mention machines with 1G of ram (900M lowmem, 128M highmem)

David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 05/26] mount options: fix afs

2008-01-28 Thread David Howells

Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> Add a .show_options super operation to afs.
> 
> Use generic_show_options() and save the complete option string in
> afs_get_sb().

Sounds reasonable, but I can't test it till I get back from LCA.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel Event Notifications (was: [RFC] Parallelize IO for e2fsck)

2008-01-28 Thread Jon Masters

On Sat, 2008-01-26 at 16:55 +0300, Al Boldi wrote:
> KOSAKI Motohiro wrote:
> > > > And from a performance point of view letting applications voluntarily
> > > > free some memory is better even than starting to swap.
> > >
> > > Absolutely.
> >
> > the mem_notify patch can realize "just before starting swapping"
> > notification :)

I looked at this a year or two back, then ran out of time. But the thing
I wanted to do was have libc's memory allocation routines extended to
handle these through reservations - the kernel should send a userspace
notification and then there should be some kind of concept of returning
memory that's been used for "opportunistic" userspace caching, e.g. in
firefox to cache the last 10 web pages. Let us know how you get on :)

Jon.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-28 Thread Al Boldi

Jan Kara wrote:
> On Sat 26-01-08 08:27:59, Al Boldi wrote:
> > Do you mean there is a locking problem?
>
>   No, but if you write to an mmaped file, then we can find out only later
> we have dirty data in pages and we call writepage() on behalf of e.g.
> pdflush().

Ok, that's a special case, which we could code for, but doesn't seem 
worthwile.  In any case, child-forks should inherit its parent mode.

> > > And in case of DB, they use direct-io
> > > anyway most of the time so they don't care about journaling mode
> > > anyway.
> >
> > Testing with sqlite3 and mysql4 shows that performance drastically
> > improves with writeback writeout.
>
>   And do you have the databases configured to use direct IO or not?

I don't think so, but these tests are only meant to expose the underlying 
problem which needs to be fixed, while this RFC proposes a useful 
workaround.

In another post Jan Kara wrote:
>   Hmm, if you're willing to test patches, then you could try a debug
> patch: http://bugzilla.kernel.org/attachment.cgi?id=14574
>   and send me the output. What kind of load do you observe problems with
> and which problems exactly?

8M-record insert into indexed db-table:
 ordered  writeback
sqlite3:  75m22s8m45s
mysql4 :  23m35s5m29s

Also, see the 'konqueror deadlocks in 2.6.22' thread.


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Pavel Machek

On Mon 2008-01-28 14:56:33, Theodore Tso wrote:
> On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:
> > 
> > As user pages are always in highmem, this should be easy to decide:
> > only send SIGDANGER when highmem is full. (Yes, there are
> > inodes/dentries/file descriptors in lowmem, but I doubt apps will
> > respond to SIGDANGER by closing files).
> 
> Good point; for a system with at least (say) 2GB of memory, that
> definitely makes sense.  For a system with less than 768 megs of
> memory (how quaint, but it wasn't that long ago this was a lot of
> memory :-), there wouldn't *be* any memory in highmem at all

Ok, so it is 'send SIGDANGER when all zones are low', because user
allocations can go from all zones (unless you have something really
exotic, I'm not sure if that is true on huge NUMA  machines & similar).

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Theodore Tso

On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:
> 
> As user pages are always in highmem, this should be easy to decide:
> only send SIGDANGER when highmem is full. (Yes, there are
> inodes/dentries/file descriptors in lowmem, but I doubt apps will
> respond to SIGDANGER by closing files).

Good point; for a system with at least (say) 2GB of memory, that
definitely makes sense.  For a system with less than 768 megs of
memory (how quaint, but it wasn't that long ago this was a lot of
memory :-), there wouldn't *be* any memory in highmem at all

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Dave Kleikamp


On Mon, 2008-01-28 at 12:33 -0600, Steve French wrote:
> On Jan 28, 2008 2:17 AM, Andi Kleen <[EMAIL PROTECTED]> wrote:
> > > I completely agree.  If one thread writes A and another writes B then the
> > > kernel should record either A or B, not ((A & 0x) | (B &
> > > 0x))
> >
> > The problem is pretty nasty unfortunately. To solve it properly I think
> > the file_operations->read/write prototypes would need to be changed
> > because otherwise it is not possible to do atomic relative updates
> > of f_pos. Right now the actual update is burrowed deeply in the low level
> > read/write implementation. But that would be a huge impact all over
> > the tree :/
> 
> If there were a wrapper around reads and writes of f_pos as there is
> for i_size e.g. it would hit a lot of code, but not as many as I had
> originally thought.  the most important ones are in the vfs itself, where
> there are only 59 uses of the field (not all need to be changed).   ext3
> has fewer (25), and cifs only 12 uses.

Most of the uses in ext3 and cifs deal with a directory's f_pos in
readdir, which is protected by i_mutex, so I don't think we need to
worry about them at all.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Pavel Machek

Hi!

> It's been discussed before, but I suspect the main reason why it was
> never done is no one submitted a patch.  Also, the problem is actually
> a pretty complex one.  There are a couple of different stages where
> you might want to send an alert to processes:
> 
> * Data is starting to get ejected from page/buffer cache
> * System is starting to swap
> * System is starting to really struggle to find memory
> * System is starting an out-of-memory killer
> 
> AIX's SIGDANGER really did the last two, where the OOM killer would
> tend to avoid processes that had a SIGDANGER handler in favor of
> processes that were SIGDANGER unaware.
> 
> Then there is the additional complexity in Linux that you have
> multiple zones of memory, which at least on the historically more
> popular x86 was highly, highly important.  You could say that whenever
> there is sufficient memory pressure in any zone that you start
> ejecting data from caches or start to swap that you start sending the
> signals --- but on x86 systems with lowmem, that could happen quite
> frequently, and since a user process has no idea whether its resources
> are in lowmem or highmem, there's not much you can do about this.

As user pages are always in highmem, this should be easy to decide:
only send SIGDANGER when highmem is full. (Yes, there are
inodes/dentries/file descriptors in lowmem, but I doubt apps will
respond to SIGDANGER by closing files).
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Steve French

On Jan 28, 2008 2:17 AM, Andi Kleen <[EMAIL PROTECTED]> wrote:
> > I completely agree.  If one thread writes A and another writes B then the
> > kernel should record either A or B, not ((A & 0x) | (B &
> > 0x))
>
> The problem is pretty nasty unfortunately. To solve it properly I think
> the file_operations->read/write prototypes would need to be changed
> because otherwise it is not possible to do atomic relative updates
> of f_pos. Right now the actual update is burrowed deeply in the low level
> read/write implementation. But that would be a huge impact all over
> the tree :/

If there were a wrapper around reads and writes of f_pos as there is
for i_size e.g. it would hit a lot of code, but not as many as I had
originally thought.  the most important ones are in the vfs itself, where
there are only 59 uses of the field (not all need to be changed).   ext3
has fewer (25), and cifs only 12 uses.


-- 
Thanks,

Steve
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-28 Thread Jan Kara

On Sat 26-01-08 08:27:43, Al Boldi wrote:
> Diego Calleja wrote:
> > El Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi <[EMAIL PROTECTED]> escribió:
> > > Greetings!
> > >
> > > data=ordered mode has proven reliable over the years, and it does this
> > > by ordering filedata flushes before metadata flushes.  But this
> > > sometimes causes contention in the order of a 10x slowdown for certain
> > > apps, either due to the misuse of fsync or due to inherent behaviour
> > > like db's, as well as inherent starvation issues exposed by the
> > > data=ordered mode.
> >
> > There's a related bug in bugzilla:
> > http://bugzilla.kernel.org/show_bug.cgi?id=9546
> >
> > The diagnostic from Jan Kara is different though, but I think it may be
> > the same problem...
> >
> > "One process does data-intensive load. Thus in the ordered mode the
> > transaction is tiny but has tons of data buffers attached. If commit
> > happens, it takes a long time to sync all the data before the commit
> > can proceed... In the writeback mode, we don't wait for data buffers, in
> > the journal mode amount of data to be written is really limited by the
> > maximum size of a transaction and so we write by much smaller chunks
> > and better latency is thus ensured."
> >
> >
> > I'm hitting this bug too...it's surprising that there's not many people
> > reporting more bugs about this, because it's really annoying.
> >
> >
> > There's a patch by Jan Kara (that I'm including here because bugzilla
> > didn't include it and took me a while to find it) which I don't know if
> > it's supposed to fix the problem , but it'd be interesting to try:
> 
> Thanks a lot, but it doesn't fix it.
  Hmm, if you're willing to test patches, then you could try a debug patch:
http://bugzilla.kernel.org/attachment.cgi?id=14574
  and send me the output. What kind of load do you observe problems with
and which problems exactly?

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SUSE Labs, CR
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-28 Thread Jan Kara

On Sat 26-01-08 08:27:59, Al Boldi wrote:
> Jan Kara wrote:
> > > Greetings!
> > >
> > > data=ordered mode has proven reliable over the years, and it does this
> > > by ordering filedata flushes before metadata flushes.  But this
> > > sometimes causes contention in the order of a 10x slowdown for certain
> > > apps, either due to the misuse of fsync or due to inherent behaviour
> > > like db's, as well as inherent starvation issues exposed by the
> > > data=ordered mode.
> > >
> > > data=writeback mode alleviates data=order mode slowdowns, but only works
> > > per-mount and is too dangerous to run as a default mode.
> > >
> > > This RFC proposes to introduce a tunable which allows to disable fsync
> > > and changes ordered into writeback writeout on a per-process basis like
> > > this:
> > >
> > >   echo 1 > /proc/`pidof process`/softsync
> >
> >   I guess disabling fsync() was already commented on enough. Regarding
> > switching to writeback mode on per-process basis - not easily possible
> > because sometimes data is not written out by the process which stored
> > them (think of mmaped file).
> 
> Do you mean there is a locking problem?
  No, but if you write to an mmaped file, then we can find out only later
we have dirty data in pages and we call writepage() on behalf of e.g.
pdflush().

> > And in case of DB, they use direct-io
> > anyway most of the time so they don't care about journaling mode anyway.
> 
> Testing with sqlite3 and mysql4 shows that performance drastically improves 
> with writeback writeout.
  And do you have the databases configured to use direct IO or not?

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SUSE Labs, CR
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6.24 REGRESSION] BUG: Soft lockup - with VFS

2008-01-28 Thread Oliver Pinter (Pintér Olivér)

and so then dmesg ..

-- 
Thanks,
Oliver
Initializing cgroup subsys cpuset
Linux version 2.6.24-szami2 ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 
(prerelease) (Debian 4.1.1-21)) #2 SMP Sun Jan 27 01:47:58 CET 2008
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009fc00 (usable)
 BIOS-e820: 0009fc00 - 000a (reserved)
 BIOS-e820: 000e8000 - 0010 (reserved)
 BIOS-e820: 0010 - 1ff3 (usable)
 BIOS-e820: 1ff3 - 1ff4 (ACPI data)
 BIOS-e820: 1ff4 - 1fff (ACPI NVS)
 BIOS-e820: 1fff - 2000 (reserved)
 BIOS-e820: ffb8 - 0001 (reserved)
0MB HIGHMEM available.
511MB LOWMEM available.
found SMP MP-table at 000ff780
Entering add_active_range(0, 0, 130864) 0 entries of 256 used
Zone PFN ranges:
  DMA 0 -> 4096
  Normal   4096 ->   130864
  HighMem130864 ->   130864
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
0:0 ->   130864
On node 0 totalpages: 130864
  DMA zone: 56 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 4040 pages, LIFO batch:0
  Normal zone: 1733 pages used for memmap
  Normal zone: 125035 pages, LIFO batch:31
  HighMem zone: 0 pages used for memmap
  Movable zone: 0 pages used for memmap
DMI 2.3 present.
ACPI: RSDP 000F9E30, 0021 (r2 ACPIAM)
ACPI: XSDT 1FF30100, 003C (r1 A M I  OEMXSDT  1414 MSFT   97)
ACPI: FACP 1FF30290, 00F4 (r3 A M I  OEMFACP  1414 MSFT   97)
ACPI: DSDT 1FF303F0, 3779 (r1  P4C8B P4C8B106  106 INTL  2002026)
ACPI: FACS 1FF4, 0040
ACPI: APIC 1FF30390, 005C (r1 A M I  OEMAPIC  1414 MSFT   97)
ACPI: OEMB 1FF40040, 003F (r1 A M I  OEMBIOS  1414 MSFT   97)
ACPI: PM-Timer IO Port: 0x808
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:2 APIC version 20
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1 15:2 APIC version 20
ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 2, version 32, address 0xfec0, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Enabling APIC mode:  Flat.  Using 1 I/O APICs
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 3000 (gap: 2000:dfb8)
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 129075
Kernel command line: BOOT_IMAGE=deb_s2.6.24 ro root=803 1
mapped APIC to b000 (fee0)
mapped IOAPIC to a000 (fec0)
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
PID hash table entries: 2048 (order: 11, 8192 bytes)
Detected 3150.239 MHz processor.
Console: colour VGA+ 132x44
console [tty0] enabled
Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
... MAX_LOCKDEP_SUBCLASSES:8
... MAX_LOCK_DEPTH:  30
... MAX_LOCKDEP_KEYS:2048
... CLASSHASH_SIZE:   1024
... MAX_LOCKDEP_ENTRIES: 8192
... MAX_LOCKDEP_CHAINS:  16384
... CHAINHASH_SIZE:  8192
 memory used by lock dependency info: 1024 kB
 per task-struct memory footprint: 1680 bytes

| Locking API testsuite:

 | spin |wlock |rlock |mutex | wsem | rsem |
  --
 A-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
 A-B-B-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
 A-B-B-C-C-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
 A-B-C-A-B-C deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
 A-B-B-C-C-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
 A-B-C-D-B-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
 A-B-C-D-B-C-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
double unlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
  initialize held:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
 bad unlock order:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
  --
  recursive read-lock: |  ok  | |  ok  |
   recursive read-lock #2: |  ok  | |  ok  |
mixed read-write-lock: |  ok  | |  ok  |
mixed write-read-lock: |  ok  | |  ok  |
  --
 hard-irqs-on + irq-safe-A/12:  ok  |  ok  |  ok  |
 soft-irqs-on + irq-safe-A/12:  o

Re: [patch 21/26] mount options: partially fix nfs

2008-01-28 Thread Chuck Lever


On Jan 28, 2008, at 6:34 AM, Miklos Szeredi wrote:

All mount options should be shown, which are needed to reconstruct a
previous mount.


Ah, OK.

I'm happy to implement logic to display the all missing options.  I
should have updated nfs_show_mount_options() when I wrote the NFS
mount option parser.

Let me know your preference.


You are more familiar with NFS, so I think it would be better if you
updated nfs_show_mount_options().

Could you also queue my patch (updated) or incorporate it into a
combined fix?


Yes.  I'll have time in a day or two to get this finished.


Thanks,
Miklos


Subject: mount options: partially fix nfs

From: Miklos Szeredi <[EMAIL PROTECTED]>

Add posix, bsize=, namelen= options to /proc/mounts for nfs
filesystems.

Document several other options that are still missing.

Changes:

 - display namelen= unconditionally
 - addr= isn't missing after all

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
CC: Trond Myklebust <[EMAIL PROTECTED]>
---

Index: linux/fs/nfs/super.c
===
--- linux.orig/fs/nfs/super.c   2008-01-25 15:44:56.0 +0100
+++ linux/fs/nfs/super.c2008-01-25 15:57:32.0 +0100
@@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc
} nfs_info[] = {
{ NFS_MOUNT_SOFT, ",soft", ",hard" },
{ NFS_MOUNT_INTR, ",intr", ",nointr" },
+   { NFS_MOUNT_POSIX, ",posix", "" },
{ NFS_MOUNT_NOCTO, ",nocto", "" },
{ NFS_MOUNT_NOAC, ",noac", "" },
{ NFS_MOUNT_NONLM, ",nolock", "" },
@@ -463,6 +464,9 @@ static void nfs_show_mount_options(struc
seq_printf(m, ",vers=%d", clp->rpc_ops->version);
seq_printf(m, ",rsize=%d", nfss->rsize);
seq_printf(m, ",wsize=%d", nfss->wsize);
+   seq_printf(m, ",namelen=%d", nfss->namelen);
+   if (nfss->bsize != 0)
+   seq_printf(m, ",bsize=%d", nfss->bsize);
if (nfss->acregmin != 3*HZ || showdefaults)
seq_printf(m, ",acregmin=%d", nfss->acregmin/HZ);
if (nfss->acregmax != 60*HZ || showdefaults)
@@ -482,6 +486,17 @@ static void nfs_show_mount_options(struc
 	seq_printf(m, ",timeo=%lu", 10U * nfss->client->cl_timeout- 
>to_initval / HZ);

seq_printf(m, ",retrans=%u", nfss->client->cl_timeout->to_retries);
 	seq_printf(m, ",sec=%s", nfs_pseudoflavour_to_name(nfss->client- 
>cl_auth->au_flavor));

+
+   /*
+* Missing options:
+* port=
+* mountport=
+* mountvers=
+* mountproto=
+* clientaddr=
+* mounthost=
+* mountaddr=
+*/
 }

 /*


--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Diego Calleja

El Mon, 28 Jan 2008 15:10:34 +0100, Andi Kleen <[EMAIL PROTECTED]> escribió:

> So you get overlapping reads. Probably not good.

This was discussed in the past i think ->

http://lkml.org/lkml/2006/4/13/124
http://lkml.org/lkml/2006/4/13/130
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Alan Cox

On Mon, 28 Jan 2008 15:10:34 +0100
Andi Kleen <[EMAIL PROTECTED]> wrote:

> On Monday 28 January 2008 14:38:57 Alan Cox wrote:
> > > Also worse really fixing it would be a major change to the VFS 
> > > because of the way ->read/write are defined :/
> > 
> > I don't see a problem there. ->read and ->write update the passed pointer
> > which is not the real f_pos anyway. Just the copies need fixing. 
> 
> They are effectually doing a decoupled read/modify/write cycle. e.g.:
> 
> A   B
> 
> read fpos   
> 
> read fpos
> 
> fpos += A   fpos += B
> write fpos
> 
> 
> write fpos
> 
> So you get overlapping reads. Probably not good.

No unix system I'm aware of cares about the read/write positioning during
parallel simultaneous reads or writes, with the exception of O_APPEND
which is strictly defined. The problem case is getting fpos != either
valid value.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Andi Kleen

On Monday 28 January 2008 14:38:57 Alan Cox wrote:
> > Also worse really fixing it would be a major change to the VFS 
> > because of the way ->read/write are defined :/
> 
> I don't see a problem there. ->read and ->write update the passed pointer
> which is not the real f_pos anyway. Just the copies need fixing. 

They are effectually doing a decoupled read/modify/write cycle. e.g.:

A   B

read fpos   

read fpos

fpos += A   fpos += B
write fpos


write fpos

So you get overlapping reads. Probably not good.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Alan Cox

> Also worse really fixing it would be a major change to the VFS 
> because of the way ->read/write are defined :/

I don't see a problem there. ->read and ->write update the passed pointer
which is not the real f_pos anyway. Just the copies need fixing.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Andi Kleen

On Monday 28 January 2008 13:56:05 Alan Cox wrote:
> > > No specific spec, just general quality of implementation.
> > 
> > I completely agree.  If one thread writes A and another writes B then the
> > kernel should record either A or B, not ((A & 0x) | (B &
> > 0x))
> 
> Agree entirely: the spec doesn't allow for random scribbling in the wrong
> place. It doesn't cover which goes first or who "wins" the race but
> provides pwrite/pread for that situation. Writing somewhere unrelated is
> definitely not to spec 

Actually it would probably -- i guess it's undefined and in undefined
country such things can happen.

Also to be fair I think it's only a problem for the 4GB wrapping case
which is presumably rare (otherwise we would have heard about it)

Also worse really fixing it would be a major change to the VFS 
because of the way ->read/write are defined :/

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] ext3 freeze feature

2008-01-28 Thread Takashi Sato


Hi,


What you *could* do is to start putting processes to sleep if they
attempt to write to the frozen filesystem, and then detect the
deadlock case where the process holding the file descriptor used to
freeze the filesystem gets frozen because it attempted to write to the
filesystem --- at which point it gets some kind of signal (which
defaults to killing the process), and the filesystem is unfrozen and
as part of the unfreeze you wake up all of the processes that were put
to sleep for touching the frozen filesystem.


I don't think close() usually writes to journal and the deadlock occurs.
Is there the special case which close() writes to journal in case of
getting signal?

Cheers, Takashi 


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] ext3 freeze feature

2008-01-28 Thread Takashi Sato


Hi,

Thank you for your comments.


That's inherently unsafe - you can have multiple unfreezes
running in parallel which seriously screws with the bdev semaphore
count that is used to lock the device due to doing multiple up()s
for every down.

Your timeout thingy guarantee that at some point you will get
multiple up()s occuring due to the timer firing racing with
a thaw ioctl. 


If this interface is to be more widely exported, then it needs
a complete revamp of the bdev is locked while it is frozen so
that there is no chance of a double up() ever occuring on the
bd_mount_sem due to racing thaws.


My patch has the race condition as you said.
I will fix it.

Cheers, Takashi 


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Alan Cox

> > No specific spec, just general quality of implementation.
> 
> I completely agree.  If one thread writes A and another writes B then the
> kernel should record either A or B, not ((A & 0x) | (B &
> 0x))

Agree entirely: the spec doesn't allow for random scribbling in the wrong
place. It doesn't cover which goes first or who "wins" the race but
provides pwrite/pread for that situation. Writing somewhere unrelated is
definitely not to spec and not good.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 24/26] mount options: fix tmpfs

2008-01-28 Thread Miklos Szeredi

> 
> Thanks Miklos, that's a welcome enhancement, nicely done.  I've only
> noticed one thing wrong (MPOL_PREFERRED shown as "default"); but thought
> shmem_config didn't add much value - I'd rather avoid those syntactic
> changes to unchanged code; and several tmpfs defaults being relative
> (e.g. to totalram_pages, or to mounter's fsuid), I ended up preferring
> to do real tests in shmem_show_options.

I completely agree, this is much better than my version.

> Thus, for example, if memory is hotplugged in or out later, what started
> out as an unspecified size option will then get shown as explicit size.
> (I did think for a while that I wanted to show explicit size in all
> cases; but it looked pretty silly on udev.)  I think that's the correct
> behaviour, that otherwise would be misleading; but I may be looking at
> this the wrong way round, what's your view?

I agree, this is the correct way.

I'll add functions for calculating the default max values, so the
calculations won't accidentally become different for the
initialization and the option showing.

> If you agree with the version below, please take it into your collection
> and insert your Signed-off-by.  I should admit, I've not yet tested how
> the NUMA policies look: you'll hear from me again tomorrow morning if
> those turn out to wrong.

OK, I'll send this to Andrew.  Maybe I'll wait until tomorrow to hear
if it's working on NUMA.

Thanks,
Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 00/26] mount options: fix filesystem's ->show_options

2008-01-28 Thread Miklos Szeredi

> > On Thu, 24 Jan 2008 20:33:41 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote:
> > Andrew,
> > 
> > Would you please consider these patches for -mm?
> 
> Sure, but I'm too lazy to pick through them and work out which ones need
> updating, which ones got acked and which ones someone else merged, all on a
> very bumpy plane flight ;)
> 
> Please resend when the dust has settled?

Yes, I should have thought, it won't quite work in a single iteration :)

I'll resend them in a moment.

Thanks,
Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 21/26] mount options: partially fix nfs

2008-01-28 Thread Miklos Szeredi

> > All mount options should be shown, which are needed to reconstruct a
> > previous mount.
> 
> Ah, OK.
> 
> I'm happy to implement logic to display the all missing options.  I  
> should have updated nfs_show_mount_options() when I wrote the NFS  
> mount option parser.
> 
> Let me know your preference.

You are more familiar with NFS, so I think it would be better if you
updated nfs_show_mount_options().

Could you also queue my patch (updated) or incorporate it into a
combined fix?

Thanks,
Miklos


Subject: mount options: partially fix nfs

From: Miklos Szeredi <[EMAIL PROTECTED]>

Add posix, bsize=, namelen= options to /proc/mounts for nfs
filesystems.

Document several other options that are still missing.

Changes:

 - display namelen= unconditionally
 - addr= isn't missing after all

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
CC: Trond Myklebust <[EMAIL PROTECTED]>
---

Index: linux/fs/nfs/super.c
===
--- linux.orig/fs/nfs/super.c   2008-01-25 15:44:56.0 +0100
+++ linux/fs/nfs/super.c2008-01-25 15:57:32.0 +0100
@@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc
} nfs_info[] = {
{ NFS_MOUNT_SOFT, ",soft", ",hard" },
{ NFS_MOUNT_INTR, ",intr", ",nointr" },
+   { NFS_MOUNT_POSIX, ",posix", "" },
{ NFS_MOUNT_NOCTO, ",nocto", "" },
{ NFS_MOUNT_NOAC, ",noac", "" },
{ NFS_MOUNT_NONLM, ",nolock", "" },
@@ -463,6 +464,9 @@ static void nfs_show_mount_options(struc
seq_printf(m, ",vers=%d", clp->rpc_ops->version);
seq_printf(m, ",rsize=%d", nfss->rsize);
seq_printf(m, ",wsize=%d", nfss->wsize);
+   seq_printf(m, ",namelen=%d", nfss->namelen);
+   if (nfss->bsize != 0)
+   seq_printf(m, ",bsize=%d", nfss->bsize);
if (nfss->acregmin != 3*HZ || showdefaults)
seq_printf(m, ",acregmin=%d", nfss->acregmin/HZ);
if (nfss->acregmax != 60*HZ || showdefaults)
@@ -482,6 +486,17 @@ static void nfs_show_mount_options(struc
seq_printf(m, ",timeo=%lu", 10U * nfss->client->cl_timeout->to_initval 
/ HZ);
seq_printf(m, ",retrans=%u", nfss->client->cl_timeout->to_retries);
seq_printf(m, ",sec=%s", 
nfs_pseudoflavour_to_name(nfss->client->cl_auth->au_flavor));
+
+   /*
+* Missing options:
+* port=
+* mountport=
+* mountvers=
+* mountproto=
+* clientaddr=
+* mounthost=
+* mountaddr=
+*/
 }
 
 /*
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Bodo Eggert

Trond Myklebust <[EMAIL PROTECTED]> wrote:
> On Mon, 2008-01-28 at 05:38 +0100, Andi Kleen wrote:
>> On Monday 28 January 2008 05:13:09 Trond Myklebust wrote:
>> > On Mon, 2008-01-28 at 03:58 +0100, Andi Kleen wrote:

>> > > The problem is that it's not a race in who gets to do its thing first,
>> > > but a parallel reader can actually see a corrupted value from the two
>> > > independent words on 32bit (e.g. during a 4GB). And this could actually
>> > > completely corrupt f_pos when it happens with two racing relative seeks
>> > > or read/write()s
>> > > 
>> > > I would consider that a bug.
>> > 
>> > I disagree. The corruption occurs because this isn't a situation that is
>> > allowed by either POSIX or SUSv2/v3. Exactly what spec are you referring
>> > to here?
>> 
>> No specific spec, just general quality of implementation. We normally don't
>> have non thread safe system calls even if it was in theory allowed by some
>> specification.
> 
> We've had the existing implementation for quite some time. The arguments
> against changing it have been the same all along: if your application
> wants to share files between threads, the portability argument implies
> that you should either use pread/pwrite or use a mutex or some other
> form of synchronisation primitive in order to ensure that
> lseek()/read()/write() do not overlap.

Does anything in the kernel depend on f_pos being valid?
E.g. is it possible to read beyond the EOF using this race, or to have files
larger than the ulimit?

If not, update the manpage and be done. ¢¢

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[2.6.24 REGRESSION] BUG: Soft lockup - with VFS

2008-01-28 Thread Oliver Pinter (Pintér Olivér)

hi all!

in the 2.6.24 become i some soft lockups with usb-phone, when i pluged
in the mobile, then the vfs-layer crashed. am afternoon can i the
.config send, and i bisected the kernel, when i have time.

pictures from crash:
http://students.zipernowsky.hu/~oliverp/kernel/regression_2624/
-- 
Thanks,
Oliver
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Andi Kleen


> I completely agree.  If one thread writes A and another writes B then the
> kernel should record either A or B, not ((A & 0x) | (B &
> 0x))

The problem is pretty nasty unfortunately. To solve it properly I think
the file_operations->read/write prototypes would need to be changed
because otherwise it is not possible to do atomic relative updates
of f_pos. Right now the actual update is burrowed deeply in the low level 
read/write implementation. But that would be a huge impact all over
the tree :/

Or maybe define a new read/write64 and keep the default as 32bit only-- i 
suppose most users don't really need 64bit. Still would be a nasty API 
change.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Parallelize IO for e2fsck

Re: [patch 05/26] mount options: fix afs

Re: Kernel Event Notifications (was: [RFC] Parallelize IO for e2fsck)

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

Re: [RFC] Parallelize IO for e2fsck

Re: [RFC] Parallelize IO for e2fsck

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

Re: [RFC] Parallelize IO for e2fsck

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

Re: [2.6.24 REGRESSION] BUG: Soft lockup - with VFS

Re: [patch 21/26] mount options: partially fix nfs

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

Re: [RFC] ext3 freeze feature

Re: [RFC] ext3 freeze feature

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

Re: [patch 24/26] mount options: fix tmpfs

Re: [patch 00/26] mount options: fix filesystem's ->show_options

Re: [patch 21/26] mount options: partially fix nfs

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

[2.6.24 REGRESSION] BUG: Soft lockup - with VFS

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

27 matches

Site Navigation

Mail list logo

Footer information