Re: [BUG] Linux 2.6.25-rc2 - Kernel Ooops while running dbench

2008-02-18 Thread Andrew Morton
On Sat, 16 Feb 2008 11:14:46 +0530 Kamalesh Babulal [EMAIL PROTECTED] wrote:

 The 2.6.25-rc2 kernel oopses while running dbench on ext3 filesystem
 mounted with mount -o data=writeback,nobh option on the x86_64 box
 
 BUG: unable to handle kernel NULL pointer dereference at 
 IP: [80274972] kmem_cache_alloc+0x3a/0x6c
 PGD 1f6860067 PUD 1f5d64067 PMD 0 
 Oops:  [1] SMP 
 CPU 3 
 Modules linked in:
 Pid: 4271, comm: dbench Not tainted 2.6.25-rc2-autotest #1
 RIP: 0010:[80274972]  [80274972] 
 kmem_cache_alloc+0x3a/0x6c
 RSP: :8101fb041dc8  EFLAGS: 00010246
 RAX:  RBX: 810180033c00 RCX: 8027b269
 RDX:  RSI: 80d0 RDI: 80632d70
 RBP: 80d0 R08: 0001 R09: 
 R10: 8101feb36e50 R11: 0190 R12: 0001
 R13:  R14: 8101f8f38000 R15: ff9c
 FS:  () GS:8101fff0f000(0063) knlGS:f7e41460
 CS:  0010 DS: 002b ES: 002b CR0: 80050033
 CR2:  CR3: 0001f562 CR4: 06e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process dbench (pid: 4271, threadinfo 8101fb04, task 8101fb18)
 Stack:  0001 8101fb041ea8 0001 8027b269
  8101fb041ea8 80281fe8 0001 
  8101fb041ea8 ff9c 000b 0001
 Call Trace:
  [8027b269] get_empty_filp+0x55/0xf9
  [80281fe8] __path_lookup_intent_open+0x22/0x8f
  [80282853] open_namei+0x86/0x5a7
  [8027d019] vfs_stat_fd+0x3c/0x4a
  [80279ab1] do_filp_open+0x1c/0x3d
  [80279c2c] get_unused_fd_flags+0x79/0x111
  [80279dce] do_sys_open+0x46/0xca
  [80221c82] ia32_sysret+0x0/0xa
 

Looks to me like we broke slab.  Christoph is offline until the 27th..
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bugme-new] [Bug 9911] New: fsync blocks concurrent writes to file

2008-02-07 Thread Andrew Morton
On Thu,  7 Feb 2008 17:40:57 -0800 (PST) [EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=9911
 
Summary: fsync blocks concurrent writes to file
Product: File System
Version: 2.5
  KernelVersion: 2.6.23.8
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: ext3
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Latest working kernel version: none
 Earliest failing kernel version: 2.6.18
 Distribution: RHEL 4  Fedora 8
 Hardware Environment: Dual Xeon  dual-core Athlon
 Software Environment: Java 1.5.0_14
 Problem Description: Multiple threads append transactions to a single file.
 When fsyncs are issued, no writes complete until the fsync completes. This has
 the effect of yielding the same write rate regardless of how many threads are
 writing to the file. Solaris 10 does not exhibit this problem and scales well
 as additional threads are added.
 
 I have confirmed this by running strace and witnessing that writes block until
 immediately after the fsync completes.
 
 When using truss on Solaris I have witnessed multiple writes completing while
 an fsync was in progress.
 
 Steps to reproduce:
 
 Open a file for writing.
 Launch multiple threads with each one writing data and followed by an fsync,
 but only if data has been written since the last fsync.
 
 
 I have googled this issue substantially and have found no answers.
 

I supposing teaching java about sync_file_range() would be all too hard.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: merge plans, was Re: - disable-ext4.patch removed from -mm tree

2008-02-05 Thread Andrew Morton
On Tue, 5 Feb 2008 10:44:54 +0100 Christoph Hellwig [EMAIL PROTECTED] wrote:

 On Mon, Feb 04, 2008 at 02:35:29PM -0800, Andrew Morton wrote:
  On Mon, 4 Feb 2008 15:24:18 -0500
  Christoph Hellwig [EMAIL PROTECTED] wrote:
  
   On Sun, Feb 03, 2008 at 07:15:40PM -0800, Andrew Morton wrote:
On Sun, 3 Feb 2008 20:36:26 -0500 Theodore Tso [EMAIL PROTECTED] 
wrote:

 On Sun, Feb 03, 2008 at 12:25:51PM -0800, Andrew Morton wrote:
  When I merge David's iget coversion patches this will instead wreck 
  the
  ext4 patchset.
 
 That's ok, it shouldn't be hard for me to fix this up.  How quickly
 will you be able to merge David's iget converstion patches?

They're about 1,000 patches back
   
   Care to post a merge plan so we have a slight chance to make sure not
   too much crap is hiding in these 1000 patches?
  
  Pretty much everything up to
  
  #
  # end
  #
  reiser4-sb_sync_inodes.patch
 
 That includes the git trees?

No, I don't merge git trees.

  Defintive NACK to unionfs.

Agree, but it'd be nice to get some movement and resolution here.  The
thing's actively and enthusiastically maintained and is apparently useful. 
There's not much point in allowing developers to expend cycles on something
which doesn't have a future.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: - disable-ext4.patch removed from -mm tree

2008-02-05 Thread Andrew Morton
On Tue, 05 Feb 2008 17:02:20 -0800 Mingming Cao [EMAIL PROTECTED] wrote:

 On Tue, 2008-02-05 at 13:26 -0800, Mingming Cao wrote:
  On Mon, 2008-02-04 at 10:00 -0500, Theodore Tso wrote:
   On Sun, Feb 03, 2008 at 07:15:40PM -0800, Andrew Morton wrote:
On Sun, 3 Feb 2008 20:36:26 -0500 Theodore Tso [EMAIL PROTECTED] 
wrote:

 On Sun, Feb 03, 2008 at 12:25:51PM -0800, Andrew Morton wrote:
  When I merge David's iget coversion patches this will instead wreck 
  the
  ext4 patchset.
 
 That's ok, it shouldn't be hard for me to fix this up.  How quickly
 will you be able to merge David's iget converstion patches?

They're about 1,000 patches back.
   
   OK, if you're not planning on pushing David's changes to Linus right
   away, what if I pull in David's
   
 iget-stop-ext4-from-using-iget-and-read_inode-try.patch 
   
   and push it plus some other ext4 bug fixes directly to Linus, and let
   you know when that has happened so you can drop David's patch from
   your queue?
   
   David's changes to ext4 can be applied standalone without the rest of
   his series, so it would be safe to push that to Linus independently
   and in advance of the rest of his series.  
  
  I get compile error when builing ext4 patch queue with
  iget-stop-ext4-from-using-iget-and-read_inode-try.patch applied, against
  2.6.24-git14.
  
  It seems iget-stop-ext4-from-using-iget-and-read_inode-try.patch depends
  on patches:  
  [PATCH 01/31] Add an ERR_CAST() macro to complement ERR_PTR and co.
  [PATCH 03/32] IGET: Introduce a function to register iget failure
 
 It seems to me the easiest way to bring ext4 patches back to mm tree, is
 to carry above two patches in ext4 patch queue, like we did for other
 ext4 patches that depend on generic code in the past.

It doesn't matter a lot because I won't be doing another -mm until all this
is merged up anyway.

 So I added above two patches to ext4 patch queue, now that ext4 patches
 could apply cleanly to linus git tree, and Andrew should able to easily
 pull ext4 patches after removing the duplicated patches. 
 
 Ted, I have the ext4 patch queue updated for this. 

This means that I merge part of the iget patch series, then twiddle thumbs
until the ext4 tree merges, then merge the remainder of the iget series.

So unless Ted intends to merge RSN I think it'd be preferable if I were to
just merge the lot, sorry.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: - disable-ext4.patch removed from -mm tree

2008-02-04 Thread Andrew Morton
On Mon, 4 Feb 2008 10:00:44 -0500 Theodore Tso [EMAIL PROTECTED] wrote:

 On Sun, Feb 03, 2008 at 07:15:40PM -0800, Andrew Morton wrote:
  On Sun, 3 Feb 2008 20:36:26 -0500 Theodore Tso [EMAIL PROTECTED] wrote:
  
   On Sun, Feb 03, 2008 at 12:25:51PM -0800, Andrew Morton wrote:
When I merge David's iget coversion patches this will instead wreck the
ext4 patchset.
   
   That's ok, it shouldn't be hard for me to fix this up.  How quickly
   will you be able to merge David's iget converstion patches?
  
  They're about 1,000 patches back.
 
 OK, if you're not planning on pushing David's changes to Linus right
 away, what if I pull in David's
 
   iget-stop-ext4-from-using-iget-and-read_inode-try.patch 
 
 and push it plus some other ext4 bug fixes directly to Linus, and let
 you know when that has happened so you can drop David's patch from
 your queue?

Sure, go for it.

 David's changes to ext4 can be applied standalone without the rest of
 his series, so it would be safe to push that to Linus independently
 and in advance of the rest of his series.  That should also help
 reduce the number of inter-patch queue dependencies.

That patch series is kind of logjammed anyway because it breaks isofs. 
Last time I discussed this with David he seemed to find this amusing rather
than an urgent problem.  I'd drop the whole lot if there weren't lots of
other patches dependent upon them.  Mayeb I can do a selective droppage,
but I hate going out-of-order, and merging untested patch combinations.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: merge plans, was Re: - disable-ext4.patch removed from -mm tree

2008-02-04 Thread Andrew Morton
On Mon, 4 Feb 2008 15:24:18 -0500
Christoph Hellwig [EMAIL PROTECTED] wrote:

 On Sun, Feb 03, 2008 at 07:15:40PM -0800, Andrew Morton wrote:
  On Sun, 3 Feb 2008 20:36:26 -0500 Theodore Tso [EMAIL PROTECTED] wrote:
  
   On Sun, Feb 03, 2008 at 12:25:51PM -0800, Andrew Morton wrote:
When I merge David's iget coversion patches this will instead wreck the
ext4 patchset.
   
   That's ok, it shouldn't be hard for me to fix this up.  How quickly
   will you be able to merge David's iget converstion patches?
  
  They're about 1,000 patches back
 
 Care to post a merge plan so we have a slight chance to make sure not
 too much crap is hiding in these 1000 patches?

Pretty much everything up to

#
# end
#
reiser4-sb_sync_inodes.patch


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: - disable-ext4.patch removed from -mm tree

2008-02-03 Thread Andrew Morton
On Sun, 03 Feb 2008 12:18:35 -0800 [EMAIL PROTECTED] wrote:

 
 The patch titled
  disable-ext4
 has been removed from the -mm tree.  Its filename was
  disable-ext4.patch

I dropped the entire ext4 patch series, because the newly-added
convert-to-iget_locked patch has wrecked the iget-coversion patch which I
have queued.  Please do not merge that patch under my feet.

When I merge David's iget coversion patches this will instead wreck the
ext4 patchset.

Dunno what to do about this sort of thing, apart from encouraging
developers to take a look at what's happening in other trees before merging
stuff.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: - disable-ext4.patch removed from -mm tree

2008-02-03 Thread Andrew Morton
On Sun, 3 Feb 2008 20:36:26 -0500 Theodore Tso [EMAIL PROTECTED] wrote:

 On Sun, Feb 03, 2008 at 12:25:51PM -0800, Andrew Morton wrote:
  When I merge David's iget coversion patches this will instead wreck the
  ext4 patchset.
 
 That's ok, it shouldn't be hard for me to fix this up.  How quickly
 will you be able to merge David's iget converstion patches?

They're about 1,000 patches back.

  Could
 this get pushed to Linus, say, tomorrow?

No.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fw: [Bugme-new] [Bug 9849] New: NULL pointer deref in journal_wait_on_commit_record

2008-01-30 Thread Andrew Morton


Begin forwarded message:

Date: Wed, 30 Jan 2008 03:24:08 -0800 (PST)
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: [Bugme-new] [Bug 9849] New: NULL pointer deref in 
journal_wait_on_commit_record


http://bugzilla.kernel.org/show_bug.cgi?id=9849

   Summary: NULL pointer deref in journal_wait_on_commit_record
   Product: File System
   Version: 2.5
 KernelVersion: 2.6.24-03997-g85004cc
  Platform: All
OS/Version: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: ext4
AssignedTo: [EMAIL PROTECTED]
ReportedBy: [EMAIL PROTECTED]


Latest working kernel version: -
Earliest failing kernel version: 2.6.24-03863-g0ba6c33
Distribution: Ubuntu
Problem Description:

using a corrupted image causes an oops in unmount, seems as if
journal_wait_on_commit_record() gets passed a NULL pointer

Steps to reproduce:

using fsfuzz with ext4, I'll attach the image which causes this for me

one oops can be found here
http://kerneloops.org/raw.php?rawid=3160msgid=

here is another one with full jbd2 debugging enabled (there are a lot of
log_do_checkpoint messages above this)

[  242.863778] (fs/jbd2/checkpoint.c, 308): jbd2_log_do_checkpoint: Start
checkpoint
[  242.863790] (fs/jbd2/checkpoint.c, 316): jbd2_log_do_checkpoint:
cleanup_journal_tail returned 1
[  242.863810] (fs/jbd2/checkpoint.c, 308): jbd2_log_do_checkpoint: Start
checkpoint
[  242.863822] (fs/jbd2/checkpoint.c, 316): jbd2_log_do_checkpoint:
cleanup_journal_tail returned 1
[  242.863842] (fs/jbd2/checkpoint.c, 308): jbd2_log_do_checkpoint: Start
checkpoint
[  242.863854] (fs/jbd2/checkpoint.c, 316): jbd2_log_do_checkpoint:
cleanup_journal_tail returned 1
[  242.863874] (fs/jbd2/checkpoint.c, 308): jbd2_log_do_checkpoint: Start
checkpoint
[  242.863886] (fs/jbd2/checkpoint.c, 316): jbd2_log_do_checkpoint:
cleanup_journal_tail returned 1
[  242.864017] (fs/jbd2/journal.c, 193): kjournald2: kjournald2 wakes
[  242.864027] (fs/jbd2/journal.c, 201): kjournald2: woke because of timeout
[  242.864035] (fs/jbd2/journal.c, 145): kjournald2: commit_sequence=1,
commit_request=2
[  242.864044] (fs/jbd2/journal.c, 148): kjournald2: OK, requests differ
[  242.864055] (fs/jbd2/commit.c, 415): jbd2_journal_commit_transaction: super
block updated
[  242.864066] (fs/jbd2/journal.c, 1264): jbd2_journal_update_superblock: JBD:
updating superblock (start 15335425, seq 2, errno 0)
[  242.864385] (fs/jbd2/commit.c, 428): jbd2_journal_commit_transaction: JBD:
starting commit of transaction 2
[  242.864409] (fs/jbd2/commit.c, 501): jbd2_journal_commit_transaction: JBD:
commit phase 1
[  242.864428] (fs/jbd2/commit.c, 519): jbd2_journal_commit_transaction: JBD:
commit phase 2
[  242.864459] (fs/jbd2/revoke.c, 537): jbd2_journal_write_revoke_records:
Wrote 0 revoke records
[  242.864469] (fs/jbd2/commit.c, 561): jbd2_journal_commit_transaction: JBD:
commit phase 2
[  242.864478] (fs/jbd2/commit.c, 571): jbd2_journal_commit_transaction: JBD:
commit phase 3
[  242.864487] (fs/jbd2/commit.c, 780): jbd2_journal_commit_transaction: JBD:
commit phase 4
[  242.864496] (fs/jbd2/commit.c, 839): jbd2_journal_commit_transaction: JBD:
commit phase 5
[  242.864505] (fs/jbd2/commit.c, 866): jbd2_journal_commit_transaction: JBD:
commit phase 6
[  242.864599] attempt to access beyond end of device
[  242.864609] loop0: rw=0, want=200708, limit=16384
[  242.864633] jbd2_journal_bmap: journal block not found at offset 15335425 on
loop0
[  242.864680] Aborting journal on device loop0.
[  242.864733] (fs/jbd2/journal.c, 1264): jbd2_journal_update_superblock: JBD:
updating superblock (start 15335425, seq 2, errno -5)
[  242.864868] BUG: unable to handle kernel NULL pointer dereference at virtual
address 
[  242.864962] printing eip: c023c2a7 *pde =  
[  242.865048] Oops: 0002 [#1] PREEMPT 
[  242.865108] Modules linked in:
[  242.865218] 
[  242.865243] Pid: 3698, comm: kjournald2 Not tainted (2.6.24-03997-g85004cc
#16)
[  242.865268] EIP: 0060:[c023c2a7] EFLAGS: 00010202 CPU: 0
[  242.865382] EIP is at journal_wait_on_commit_record+0x7/0x50
[  242.865407] EAX:  EBX:  ECX: 0001 EDX: 0001
[  242.865431] ESI:  EDI: c07835d2 EBP: cb229ee4 ESP: cb229edc
[  242.865455]  DS: 007b ES: 007b FS:  GS:  SS: 0068
[  242.865539] Process kjournald2 (pid: 3698, ti=cb229000 task=cb208000
task.ti=cb229000)
[  242.865564] Stack:   cb229f88 c023cb07  c07835d2
0362 c069c620 
[  242.865864]cb2316e0 cb231504 cb2314f0 cb134960  cb231920
  
[  242.865918]cb208000   0008  
  
[  242.865918] Call Trace:
[  242.865918]  [c0104c0a] show_trace_log_lvl+0x1a/0x30
[  242.865918]  [c0104cc9] show_stack_log_lvl+0xa9/0xd0
[  242.865918]  [c0104dba] show_registers+0xca/0x250
[  242.865918]  [c01051e1] die+0x101/0x220
[  

Re: [Bugme-new] [Bug 9855] New: ext3 ACL corruption

2008-01-30 Thread Andrew Morton
On Wed, 30 Jan 2008 14:29:27 -0800 (PST)
[EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=9855
 
Summary: ext3 ACL corruption
Product: File System
Version: 2.5
  KernelVersion: 2.6.23
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: ext3
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Latest working kernel version: Unknown
 Earliest failing kernel version: Definitely 2.6.23 and 2.6.23.8 but earlier is
 possible
 Distribution: Debian Etch
 Hardware Environment: Multiple x86 machines
 
 Software Environment:
 Filesystem is Ext3 on LVM on RAID-1 (on SATA).
 # e2fsck -V
 e2fsck 1.40-WIP (14-Nov-2006)
 Using EXT2FS Library version 1.40-WIP, 14-Nov-2006
 
 Problem Description:
 On several occasions now I have had e2fsck prune away ACLs on my file systems
 during a file system check after rebooting a number of (reasonably) long
 running Samba servers. This morning I decided to manually run fsck before
 rebooting one of these:
 
 # e2fsck -pfv /dev/mapper/vg_main-lv_samba
 (entry-e_value_offs + entry-e_value_size: 116, offs: 120)
 /dev/mapper/vg_main-lv_samba: Extended attribute in inode 163841 has a value
 offset (56) which is invalid
 CLEARED.
 (entry-e_value_offs + entry-e_value_size: 116, offs: 120)
 /dev/mapper/vg_main-lv_samba: Extended attribute in inode 262146 has a value
 offset (56) which is invalid
 CLEARED.
 [ snip lots of (near) identical errors]
 
 8301 inodes used (0.08%)
 1621 non-contiguous inodes (19.5%)
  # of inodes with ind/dind/tind blocks: 3837/24/0
  1108478 blocks used (5.29%)
0 bad blocks
1 large file
 
 7590 regular files
  662 directories
0 character device files
0 block device files
0 fifos
0 links
   40 symbolic links (38 fast symbolic links)
0 sockets
 
 8292 files
 
 (Note: after remounting)
 # tune2fs -l /dev/mapper/vg_main-lv_samba 
 tune2fs 1.40-WIP (14-Nov-2006)
 Filesystem volume name:   none
 Last mounted on:  not available
 Filesystem UUID:  88677414-c1f8-41ba-b737-d9f6170d771b
 Filesystem magic number:  0xEF53
 Filesystem revision #:1 (dynamic)
 Filesystem features:  has_journal ext_attr resize_inode dir_index filetype
 needs_recovery sparse_super large_file
 Filesystem flags: signed directory hash 
 Default mount options:(none)
 Filesystem state: clean
 Errors behavior:  Continue
 Filesystem OS type:   Linux
 Inode count:  10485760
 Block count:  20971520
 Reserved block count: 1048576
 Free blocks:  19863038
 Free inodes:  10477459
 First block:  0
 Block size:   4096
 Fragment size:4096
 Reserved GDT blocks:  1019
 Blocks per group: 32768
 Fragments per group:  32768
 Inodes per group: 16384
 Inode blocks per group:   1024
 Filesystem created:   Wed Feb 21 21:38:33 2007
 Last mount time:  Thu Jan 31 03:18:54 2008
 Last write time:  Thu Jan 31 03:18:54 2008
 Mount count:  1
 Maximum mount count:  30
 Last checked: Thu Jan 31 03:16:51 2008
 Check interval:   15552000 (6 months)
 Next check after: Tue Jul 29 02:16:51 2008
 Reserved blocks uid:  0 (user root)
 Reserved blocks gid:  0 (group root)
 First inode:  11
 Inode size:   256
 Journal inode:8
 Default directory hash:   tea
 Directory Hash Seed:  be8c201b-3563-4fa5-a2a6-e2864e4b73e2
 Journal backup:   inode blocks
 
 
 Steps to reproduce:
 Unfortunately, precise steps are not known. Restoring all the filesystem's 
 ACLs
 from a recent dump made using getfacl -RP fixes the ACLs without causing the
 corruption to return.
 
 These are production Samba servers making fairly extensive use of file and
 directory ACLs. Thus far, I've only noticed the corruptions when it came time
 to upgrade to a new kernel and reboot (and the boot scripts then run fsck).
 Note that I've never noticed any issues at runtime because of this - only when
 I later realised that ACLs had been removed from random files and/or
 directories.
 
 I think I will implement some scripts to unmount and run fsck nightly from
 cron, so I can at least detect the corruption a little earlier. If there is
 some more helpful debugging output I can provide, please let me know.
 

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bugme-new] [Bug 9849] New: NULL pointer deref in journal_wait_on_commit_record

2008-01-30 Thread Andrew Morton
On Wed, 30 Jan 2008 15:17:57 -0800
Mingming Cao [EMAIL PROTECTED] wrote:

 The buufer head pointer passed to journal_wait_on_commit_record()
 could be NULL if the previous journal_submit_commit_record() failed
 or journal has already aborted.
 
 We need to check the error returns from journal_submit_commit_record()
 and avoid calling journal_wait_on_commit_record() in the failure case.
 
 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 ---
  fs/jbd2/commit.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 Index: linux-2.6.24-rc8/fs/jbd2/commit.c
 ===
 --- linux-2.6.24-rc8.orig/fs/jbd2/commit.c2008-01-30 14:12:10.0 
 -0800
 +++ linux-2.6.24-rc8/fs/jbd2/commit.c 2008-01-30 15:09:50.0 -0800
 @@ -872,7 +872,8 @@ wait_for_iobuf:
   if (err)
   __jbd2_journal_abort_hard(journal);
   }
 - err = journal_wait_on_commit_record(cbh);
 + if (!err  !is_journal_aborted(journal))
 + err = journal_wait_on_commit_record(cbh);
  
   if (err)
   jbd2_journal_abort(journal, err);

Thanks.  Please note that I Cc'ed [EMAIL PROTECTED] on this, for a 2.6.24.x
backport.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 33/49] ext4: Add the journal checksum feature

2008-01-23 Thread Andrew Morton
 On Mon, 21 Jan 2008 22:02:12 -0500 Theodore Ts'o [EMAIL PROTECTED] wrote:
 From: Girish Shilamkar [EMAIL PROTECTED]
 
 The journal checksum feature adds two new flags i.e
 JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.
 
 JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
 checksum for the blocks described by the descriptor blocks.
 Due to checksums, writing of the commit record no longer needs to be
 synchronous. Now commit record can be sent to disk without waiting for
 descriptor blocks to be written to disk. This behavior is controlled
 using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
 able to recover the journal with _ASYNC_COMMIT hence it is made
 incompat.
 The commit header has been extended to hold the checksum along with the
 type of the checksum.
 
 For recovery in pass scan checksums are verified to ensure the sanity
 and completeness(in case of _ASYNC_COMMIT) of every transaction.
 
  ...

 +static inline __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head 
 *bh)

unneeded inlining.

 +{
 + struct page *page = bh-b_page;
 + char *addr;
 + __u32 checksum;
 +
 + addr = kmap_atomic(page, KM_USER0);
 + checksum = crc32_be(crc32_sum,
 + (void *)(addr + offset_in_page(bh-b_data)), bh-b_size);
 + kunmap_atomic(addr, KM_USER0);
 +
 + return checksum;
 +}

Can this buffer actually be in highmem?

  static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
  unsigned long long block)

More unnecessary inlining.

 +/*
 + * jbd2_journal_clear_features () - Clear a given journal feature in the
 + *   superblock
 + * @journal: Journal to act on.
 + * @compat: bitmask of compatible features
 + * @ro: bitmask of features that force read-only mount
 + * @incompat: bitmask of incompatible features
 + *
 + * Clear a given journal feature as present on the
 + * superblock.  Returns true if the requested features could be reset.
 + */
 +int jbd2_journal_clear_features(journal_t *journal, unsigned long compat,
 + unsigned long ro, unsigned long incompat)
 +{
 + journal_superblock_t *sb;
 +
 + jbd_debug(1, Clear features 0x%lx/0x%lx/0x%lx\n,
 +   compat, ro, incompat);
 +
 + sb = journal-j_superblock;
 +
 + sb-s_feature_compat= ~cpu_to_be32(compat);
 + sb-s_feature_ro_compat = ~cpu_to_be32(ro);
 + sb-s_feature_incompat  = ~cpu_to_be32(incompat);
 +
 + return 1;
 +}
 +EXPORT_SYMBOL(jbd2_journal_clear_features);

Kernel usually returns 0 on success.  So we can return a useful errno on
failure.

 +/*
 + * calc_chksums calculates the checksums for the blocks described in the
 + * descriptor block.
 + */
 +static int calc_chksums(journal_t *journal, struct buffer_head *bh,
 + unsigned long *next_log_block, __u32 *crc32_sum)
 +{
 + int i, num_blks, err;
 + unsigned io_block;
 + struct buffer_head *obh;
 +
 + num_blks = count_tags(journal, bh);
 + /* Calculate checksum of the descriptor block. */
 + *crc32_sum = crc32_be(*crc32_sum, (void *)bh-b_data, bh-b_size);
 +
 + for (i = 0; i  num_blks; i++) {
 + io_block = (*next_log_block)++;

 unsigned - unsigned long.

Are all the types appropriate in here?

 + wrap(journal, *next_log_block);
 + err = jread(obh, journal, io_block);
 + if (err) {
 + printk(KERN_ERR JBD: IO error %d recovering block 
 + %u in log\n, err, io_block);
 + return 1;
 + } else {
 + *crc32_sum = crc32_be(*crc32_sum, (void *)obh-b_data,
 +  obh-b_size);
 + }
 + }
 + return 0;
 +}
 +
  static int do_one_pass(journal_t *journal,
   struct recovery_info *info, enum passtype pass)
  {
 @@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journal,
   unsigned intsequence;
   int blocktype;
   int tag_bytes = journal_tag_bytes(journal);
 + __u32   crc32_sum = ~0; /* Transactional Checksums */
  
   /* Precompute the maximum metadata descriptors in a descriptor block */
   int MAX_BLOCKS_PER_DESC;
 @@ -419,9 +452,23 @@ static int do_one_pass(journal_t *journal,
   switch(blocktype) {
   case JBD2_DESCRIPTOR_BLOCK:
   /* If it is a valid descriptor block, replay it
 -  * in pass REPLAY; otherwise, just skip over the
 -  * blocks it describes. */
 +  * in pass REPLAY; if journal_checksums enabled, then
 +  * calculate checksums in PASS_SCAN, otherwise,
 +  * just skip over the blocks it describes. */
   if (pass != 

Re: [PATCH 23/49] Add buffer head related helper functions

2008-01-23 Thread Andrew Morton
 On Mon, 21 Jan 2008 22:02:02 -0500 Theodore Ts'o [EMAIL PROTECTED] wrote:
 +}
 +EXPORT_SYMBOL(bh_uptodate_or_lock);
 +/**

Missing newline.

 + * bh_submit_read: Submit a locked buffer for reading
 + * @bh: struct buffer_head
 + *
 + * Returns a negative error
 + */
 +int bh_submit_read(struct buffer_head *bh)
 +{
 + if (!buffer_locked(bh))
 + lock_buffer(bh);
 +
 + if (buffer_uptodate(bh))
 + return 0;

Here it can lock the buffer then return zero

 + get_bh(bh);
 + bh-b_end_io = end_buffer_read_sync;
 + submit_bh(READ, bh);
 + wait_on_buffer(bh);
 + if (buffer_uptodate(bh))
 + return 0;

Here it will unlock the buffer and return zero.

This function is unusable when passed an unlocked buffer.

The return value should (always) be documented.

 + return -EIO;
 +}
 +EXPORT_SYMBOL(bh_submit_read);
  void __init buffer_init(void)

Missing newline.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 24/49] ext4: add block bitmap validation

2008-01-23 Thread Andrew Morton
 On Mon, 21 Jan 2008 22:02:03 -0500 Theodore Ts'o [EMAIL PROTECTED] wrote:
 + if (bh_submit_read(bh)  0) {
 + brelse(bh);
 + ext4_error(sb, __FUNCTION__,
   Cannot read block bitmap - 
 - block_group = %lu, block_bitmap = %llu,
 - block_group, bitmap_blk);
 + block_group = %d, block_bitmap = %llu,
 + (int)block_group, (unsigned long long)bitmap_blk);
 + return NULL;
 + }
 + if (!ext4_valid_block_bitmap(sb, desc, block_group, bh)) {
 + brelse(bh);
 + return NULL;
 + }

brelse() should only be used when the bh might be NULL - put_bh()
can be used here.

Please review all ext4/jbd2 code for this trivial speedup.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 41/49] ext4: Add multi block allocator for ext4

2008-01-23 Thread Andrew Morton
 On Mon, 21 Jan 2008 22:02:20 -0500 Theodore Ts'o [EMAIL PROTECTED] wrote:
 From: Alex Tomas [EMAIL PROTECTED]
 
 Signed-off-by: Alex Tomas [EMAIL PROTECTED]
 Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
 Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
 Signed-off-by: Eric Sandeen [EMAIL PROTECTED]
 Signed-off-by: Theodore Ts'o [EMAIL PROTECTED]

 ...

 +#if BITS_PER_LONG == 64
 +#define mb_correct_addr_and_bit(bit, addr)   \
 +{\
 + bit += ((unsigned long) addr  7UL)  3;   \
 + addr = (void *) ((unsigned long) addr  ~7UL);  \
 +}
 +#elif BITS_PER_LONG == 32
 +#define mb_correct_addr_and_bit(bit, addr)   \
 +{\
 + bit += ((unsigned long) addr  3UL)  3;   \
 + addr = (void *) ((unsigned long) addr  ~3UL);  \
 +}
 +#else
 +#error how many bits you are?!
 +#endif

Why do these exist?

 +static inline int mb_test_bit(int bit, void *addr)
 +{
 + mb_correct_addr_and_bit(bit, addr);
 + return ext4_test_bit(bit, addr);
 +}

ext2_test_bit() already handles bitnum  wordsize.

If mb_correct_addr_and_bit() is actually needed then some suitable comment
would help.

 +static inline void mb_set_bit(int bit, void *addr)
 +{
 + mb_correct_addr_and_bit(bit, addr);
 + ext4_set_bit(bit, addr);
 +}
 +
 +static inline void mb_set_bit_atomic(spinlock_t *lock, int bit, void *addr)
 +{
 + mb_correct_addr_and_bit(bit, addr);
 + ext4_set_bit_atomic(lock, bit, addr);
 +}
 +
 +static inline void mb_clear_bit(int bit, void *addr)
 +{
 + mb_correct_addr_and_bit(bit, addr);
 + ext4_clear_bit(bit, addr);
 +}
 +
 +static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr)
 +{
 + mb_correct_addr_and_bit(bit, addr);
 + ext4_clear_bit_atomic(lock, bit, addr);
 +}
 +
 +static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int 
 *max)

uninlining this will save about eighty squigabytes of text.

Please review all of ext4/jbd2 with a view to removig unnecessary and wrong
inlings.

 +{
 + char *bb;
 +
 + /* FIXME!! is this needed */
 + BUG_ON(EXT4_MB_BITMAP(e4b) == EXT4_MB_BUDDY(e4b));
 + BUG_ON(max == NULL);
 +
 + if (order  e4b-bd_blkbits + 1) {
 + *max = 0;
 + return NULL;
 + }
 +
 + /* at order 0 we see each particular block */
 + *max = 1  (e4b-bd_blkbits + 3);
 + if (order == 0)
 + return EXT4_MB_BITMAP(e4b);
 +
 + bb = EXT4_MB_BUDDY(e4b) + EXT4_SB(e4b-bd_sb)-s_mb_offsets[order];
 + *max = EXT4_SB(e4b-bd_sb)-s_mb_maxs[order];
 +
 + return bb;
 +}
 +

 ...

 +#else
 +#define mb_free_blocks_double(a, b, c, d)
 +#define mb_mark_used_double(a, b, c)
 +#define mb_cmp_bitmaps(a, b)
 +#endif

Please use the do{}while(0) thing.  Or, better, proper C functions which
have typechecking (unless this will cause undefined-var compile errors,
which happens sometimes)

 +/* find most significant bit */
 +static int fmsb(unsigned short word)
 +{
 + int order;
 +
 + if (word  255) {
 + order = 7;
 + word = 8;
 + } else {
 + order = -1;
 + }
 +
 + do {
 + order++;
 + word = 1;
 + } while (word != 0);
 +
 + return order;
 +}

Did we just reinvent fls()?

 +/* FIXME!! need more doc */
 +static void ext4_mb_mark_free_simple(struct super_block *sb,
 + void *buddy, unsigned first, int len,
 + struct ext4_group_info *grp)
 +{
 + struct ext4_sb_info *sbi = EXT4_SB(sb);
 + unsigned short min;
 + unsigned short max;
 + unsigned short chunk;
 + unsigned short border;
 +
 + BUG_ON(len = EXT4_BLOCKS_PER_GROUP(sb));
 +
 + border = 2  sb-s_blocksize_bits;

Won't this explode with = 32k blocksize?

 + while (len  0) {
 + /* find how many blocks can be covered since this position */
 + max = ffs(first | border) - 1;
 +
 + /* find how many blocks of power 2 we need to mark */
 + min = fmsb(len);
 +
 + if (max  min)
 + min = max;
 + chunk = 1  min;
 +
 + /* mark multiblock chunks only */
 + grp-bb_counters[min]++;
 + if (min  0)
 + mb_clear_bit(first  min,
 +  buddy + sbi-s_mb_offsets[min]);
 +
 + len -= chunk;
 + first += chunk;
 + }
 +}
 +
 
 ...

 +static int ext4_mb_init_cache(struct page *page, char *incore)
 +{
 + int blocksize;
 + int blocks_per_page;
 + int groups_per_page;
 + int err = 0;
 + int i;
 + ext4_group_t first_group;
 + int first_block;
 + struct super_block *sb;
 + struct buffer_head *bhs;
 + struct buffer_head **bh;
 + struct inode *inode;
 + char *data;
 + char *bitmap;
 +
 + mb_debug(init page %lu\n, page-index);

Re: [PATCH 30/49] ext4: Convert truncate_mutex to read write semaphore.

2008-01-23 Thread Andrew Morton
 On Mon, 21 Jan 2008 22:02:09 -0500 Theodore Ts'o [EMAIL PROTECTED] wrote:
 +int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t 
 block,
 + unsigned long max_blocks, struct buffer_head *bh,
 + int create, int extend_disksize)
 +{
 + int retval;
 + if (create) {
 + down_write((EXT4_I(inode)-i_data_sem));
 + } else {
 + down_read((EXT4_I(inode)-i_data_sem));
 + }
 + if (EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL) {
 + retval =  ext4_ext_get_blocks(handle, inode, block, max_blocks,
 + bh, create, extend_disksize);
 + } else {
 + retval = ext4_get_blocks_handle(handle, inode, block,
 + max_blocks, bh, create, extend_disksize);
 + }
 + if (create) {
 + up_write((EXT4_I(inode)-i_data_sem));
 + } else {
 + up_read((EXT4_I(inode)-i_data_sem));
 + }

This function has many unneeded braces.  checkpatch used to detect this
but it seems to have broken.

 + return retval;
 +}
  static int ext4_get_block(struct inode *inode, sector_t iblock,
   struct buffer_head *bh_result, int create)

Mising newline.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl

2008-01-23 Thread Andrew Morton
 On Mon, 21 Jan 2008 22:02:15 -0500 Theodore Ts'o [EMAIL PROTECTED] wrote:
 The below patch add ioctl for migrating ext3 indirect block mapped inode
 to ext4 extent mapped inode.

This patch adds lots of weird and inexplicable single- and double-newlines
in inappropriate places.  However it frequently forgets to add newlines
between end-of-locals and start-of-code, which is usual practice.


+struct list_blocks_struct {
+   ext4_lblk_t first_block, last_block;
+   ext4_fsblk_t first_pblock, last_pblock;
+};

This structure would benefit from some code comments.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]

2008-01-23 Thread Andrew Morton
 On Wed, 23 Jan 2008 04:12:16 -0500 Abhishek Rai [EMAIL PROTECTED] wrote:
  I'm wondering about the interaction between this code and the
  buffer_boundary() logic.  I guess we should disable the buffer_boundary()
  handling when this code is in effect.  Have you reviewed and tested that
  aspect?
 
 Thanks for pointing this out, I had totally missed this issue in my change. 
 I've now made the call to set_buffer_boundary() in ext3_get_blocks_handle() 
 subject to metacluster option being set.
 

Did it make any performance difference?  iirc the buffer_boundary stuff was
worth around 10% on a single linear read of a large, well-laid-out file.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]

2008-01-15 Thread Andrew Morton

I'm wondering about the real value of this change, really.

In any decent environment, people will fsck their ext3 filesystems during
planned downtime, and the benefit of reducing that downtime from 6
hours/machine to 2 hours/machine is probably fairly small, given that there
is no service interruption.  (The same applies to desktops and laptops).

Sure, the benefit is not *zero*, but it's small.  Much less than it would
be with ext2.  I mean, the avoid unplanned fscks feature is the whole
reason why ext3 has journalling (and boy is that feature expensive during
normal operation).

So...  it's unobvious that the benefit of this feature is worth its risks
and costs?

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 9692] New: journal_data mount option causes filesystem corruption with blocksize != 4096

2008-01-05 Thread Andrew Morton
On Sat,  5 Jan 2008 09:52:15 -0800 (PST) [EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=9692

Summary: journal_data mount option causes filesystem corruption
 with blocksize != 4096
Product: File System
Version: 2.5
  KernelVersion: 2.6.23.9
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: high
   Priority: P1
  Component: ext3
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Most recent kernel where this bug did not occur: -
 Older kernels have this problem too (I think I noticed this booting = 2.6.21,
 definitely 2.6.22).

I'm getting the feeling that we should just disable data=journal.  Make it
use data=ordered mode instead.  It isn't getting a lot of attention..

 Distribution: Gentoo Linux x86
 This bug seems to be hardware-independent (tested on three different machines
 which all use quite different drivers). If you need hardware information or 
 any
 other log or configuration files, let me know please.
 
 Problem Description:
 When creating an ext3 filesystem with journal_data option and block sizes
 different than 4096 (tested: 1024, 2048) filesystem corruption will occur if
 certain operations are performed (see below).
 Corruption will not occur if 4096 block size is used, or if any other block
 size is used together with journal_data_ordered or journal_data_writeback.
 No errors in dmesg.
 
 Steps to reproduce:
 I found this bug using an audio file tagger, so you need exfalso which is part
 of quodlibet (http://www.sacredchao.net/quodlibet/). No other file tagger I
 used produced this kind of problem. Still, this has to be a kernel problem,
 right??
 
 1. Create ext3 file system:
 mkfs.ext3 -O has_journal,dir_index -b 1024 /dev/sdd1
 tune2fs -c 0 -i 0 -m 0 -o journal_data /dev/sdd1
 
 tune2fs 1.40.3 (05-Dec-2007)  (filtered)
 Filesystem volume name:   none
 Last mounted on:  not available
 Filesystem magic number:  0xEF53
 Filesystem revision #:1 (dynamic)
 Filesystem features:  has_journal resize_inode dir_index filetype
 sparse_super
 Filesystem flags: signed directory hash
 Default mount options:journal_data
 Filesystem state: clean
 Errors behavior:  Continue
 Filesystem OS type:   Linux
 Inode count:  126976
 Block count:  1012060
 Reserved block count: 0
 Free blocks:  976865
 Free inodes:  126965
 First block:  1
 Block size:   1024
 Fragment size:1024
 Reserved GDT blocks:  256
 Blocks per group: 8192
 Fragments per group:  8192
 Inodes per group: 1024
 Inode blocks per group:   128
 Last mount time:  n/a
 Mount count:  0
 Maximum mount count:  -1
 Check interval:   0 (none)
 Reserved blocks uid:  0 (user root)
 Reserved blocks gid:  0 (group root)
 First inode:  11
 Inode size:   128
 Journal inode:8
 Default directory hash:   tea
 Journal backup:   inode blocks
 
 2. Mount it and copy mp3,ogg,... files to it. This does not cause any file
 system corruption (which you can confirm by running fsck).
 
 pmount /dev/sdd1:
 /dev/sdd1 on /media/sdd1 type ext3 (rw,noexec,nosuid,nodev,errors=continue)
 
 3. Use quodlibet/exfalso to change the id3 tags. Add tags to it if not 
 present,
 or delete them if already present. This will lead to file system corruption.
 
 brw-r- 1 root disk 8, 49 /dev/sdd1
 
 4. Unmount the volume.
 pumount /dev/sdd1
 
 5. Run fsck -fvD /dev/sdd1. It will complain about wrong i_size.
 
 e2fsck 1.40.3 (05-Dec-2007)
 Pass 1: Checking inodes, blocks, and sizes
 Inode 47106, i_size is 5015509, should be 5017600.  Fixy? yes
 Inode 47107, i_size is 4657736, should be 4661248.  Fixy? yes
 Inode 47109, i_size is 11928555, should be 11931648.  Fixy? yes
 Inode 47111, i_size is 5698454, should be 5701632.  Fixy? yes
 Inode 47112, i_size is 9384018, should be 9388032.  Fixy? yes
 Inode 47114, i_size is 5679228, should be 5681152.  Fixy? yes
 Inode 47115, i_size is 6107218, should be 6111232.  Fixy? yes
 Inode 47117, i_size is 4354297, should be 4358144.  Fixy? yes
 Inode 47118, i_size is 4512286, should be 4513792.  Fixy? yes
 Inode 47120, i_size is 7010846, should be 7012352.  Fixy? yes
 
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 3A: Optimizing directories
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 
 /dev/sdd1: * FILE SYSTEM WAS MODIFIED *
 
   28 inodes used (0.02%)
   14 non-contiguous inodes (50.0%)
  # of inodes with ind/dind/tind blocks: 15/15/0
   123417 blocks used (12.19%)
0 bad blocks
0 large files
 
   16 regular files
3 directories
0 character device files
0 block device 

ext4 still broken on multiple architectures

2007-12-12 Thread Andrew Morton
fs/ext4/mballoc.c: In function 'ext4_mb_generate_buddy':
fs/ext4/mballoc.c:836: error: implicit declaration of function 
'ext2_find_next_bit'

Can someone please get this fixed?
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 9546] New: Huge latency in concurrent I/O when using data=ordered

2007-12-11 Thread Andrew Morton

(switching to email - please respond via emailed reply-to-all, not via the
bugzilla web interface)

On Tue, 11 Dec 2007 11:36:39 -0800 (PST)
[EMAIL PROTECTED] wrote:

 http://bugzilla.kernel.org/show_bug.cgi?id=9546
 
Summary: Huge latency in concurrent I/O when using data=ordered
Product: File System
Version: 2.5
  KernelVersion: 2.6.24-rc4
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: ext3
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 Most recent kernel where this bug did not occur:
 Unknown, certainly not a regression, but something specific to ext3 algorithm
 
 Distribution:
 Bluewhite64 12 (Slackware 12 64 bits port) and Slackware 12
 
 Hardware Environment:
 Athlon 64 3000+laptop IDE 5400 80GB+1.2GB RAM
 Athlon 64X2 4200+SATA 7200 200GB drive+1GB
 Athlon 2800+IDE 7200 40GB drive+512MB
 
 Software Environment:
 dd, cp, konqueror/KDE, mount/tune2fs
 
 Problem Description:
 When the system does heavy input/output operations on big files, small files
 access from other applications are always not served for very long time. This
 can cause huge latencies. The system is really not usable at all, even with 
 all
 the recent improvements done to increase interactivity on desktop.
 
 This behaviour is very visible with the simple following test case:
 1. Build a DVD structure from big MPEG+PS files with dvdauthor (it copies the
 files in the DVD stucture, then pass on them to fix VOBUs, but this part is 
 not
 very long so this is not the main problem).
 2. While the computer is doing this, try to open a web browser such as
 konqueror. Then open a page from bookmark. Then open a new tab, then open
 another page from bookmark. Switch bak to first page.
 
 What I get is:
 35 seconds to open Konqueror.
 8 seconds to open the bookmark menu. Incredible.
 30 seconds to open the web page (DSL/10GBits).
 5 seconds to open the second tab.
 6 seconds to reopen the menu.
 36 seconds to open the second page.
 14 seconds to come back to first tab.
 This is unbelievable! The system is completely trashed, with more than 1GB 
 RAM,
 whatever the hardware configuration is used.
 
 Of course, I investigated the problem... First, DMA is OK. Second, I thought
 cache would make memory swapped. So I used echo 0  swapiness. Then (of 
 course,
 the system was not swapping at all), I thought TEXT sections from software
 discarded (that would be simply stupid, but who knows?). I then tried to make
 the writing process throttled with dirty_background_ratio (say 10%) while
 reserving a greater RAM portion for the rest of the system with dirty_ratio
 (say 70%). No way. Then I launched top, and looked at the WCHAN to see what 
 was
 the problem for the frozen process (ie: konqueror). The I saw the faulty guy:
 log_wait_commit!
 
 So I concluded there is unfair access to the filesystem journal. So I tried
 other journaling options than the default ordered data mode. The results 
 were
 really different: 5s, 2s, 4s, etc., both with journal and write back mode!
 
 I therefore think there is a great lock and even maybe a priority inversion in
 log_wait_commit of the ext3 filesystem. I think that, even if it is throttled,
 the writing process always get access to the journal in ordered mode, simply
 because it writes many pages at a time and because the ordered mode indeed
 implies... ordering of requests (as I understand it).
 
 It's sad this is the default option that gives the worst interactivity
 problems. Indeed, this messes all previous work done to enhance desktop
 experience I think, too bad!
 
 Btw, I've also seen on Internet that some people reported that journal data
 mode gives better performance. I think the problem was indeed related to
 latency rather than performance (timing the writing process effectively shows 
 a
 output rate halved with journal data mode, and twice the time to process).
 
 Steps to reproduce:
 I did a simple script:
 #!/bin/bash
 
 SRC1=src1.bin
 SRC2=src2.bin
 DEST_DIR=tmpdir
 DST1=dst.bin
 
 # First, create the source files:
 if [ ! -e $SRC1 ] ; then
 dd if=/dev/zero of=$SRC1 bs=10k count=15
 fi
 if [ ! -e $SRC2 ] ; then
 dd if=/dev/zero of=$SRC2 bs=10k count=15
 fi
 mkdir $DEST_DIR  /dev/null 21
 sync
 
 # Do the test:
 echo Trashing the system...
 rm $DEST_DIR/$DST1  /dev/null 21
 cp $SRC1 $DEST_DIR/$DST1
 cat $SRC2  $DEST_DIR/$DST1
 echo Done!
 
 #rm -rf $DEST_DIR $SRC1 $SRC2
 
 While running it, try to use normally the interactive programs, such as
 konqueror (the program should have to access files, such as cookies, cache and
 so for konqueror). Then remount/tune the filesystem to use another data mode
 for ext3.
 
 I didn't try with other journaling filesystems. I guess ext2 also doesn't
 exhibit the problem.
 

Interesting that data=writeback helped this.  You don't 

Re: 2.6.24-rc4-mm1: some issues on sparc64

2007-12-09 Thread Andrew Morton
On Sun, 09 Dec 2007 00:45:17 -0800 (PST) David Miller [EMAIL PROTECTED] wrote:

 From: Andrew Morton [EMAIL PROTECTED]
 Date: Sat, 8 Dec 2007 10:22:39 -0800
 
  That's
  
  J_ASSERT_BH(bh, !buffer_jbddirty(bh));
  
  at the end of journal_unmap_buffer().
  
  I don't recall seeing that before and I can't think of anything we've
  done recently which could cause it, sorry.
 
 If the per-cpu data patches are in the -mm tree that is the first
 place I would start looking at for possible cause.

They aren't.  The dust hadn't settled enough on those when Christoph shot
through on vacation.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc4-mm1: some issues on sparc64

2007-12-08 Thread Andrew Morton
On Sat, 8 Dec 2007 19:20:28 +0100 Mariusz Kozlowski [EMAIL PROTECTED] wrote:

   The box is sun ultra 60 (dual sparc64). This was caught when
 system (gentoo) was emerging some package. 
 
 [27006.402237] kernel BUG at fs/jbd/transaction.c:1894!

That's

J_ASSERT_BH(bh, !buffer_jbddirty(bh));

at the end of journal_unmap_buffer().

I don't recall seeing that before and I can't think of anything we've
done recently which could cause it, sorry.

 [27006.402268]   \|/  \|/
 [27006.402274]   @'/ .. \`@
 [27006.402279]   /_| \__/ |_\
 [27006.402285]  \__U_/

x86 needs that.

 [27006.402298] rm(4713): Kernel bad sw trap 5 [#1]
 [27006.402538] TSTATE: 009911009605 TPC: 0053b1cc TNPC: 
 0053b1d0 Y: Not tainted
 [27006.402579] TPC: journal_invalidatepage+0x3d4/0x460
 [27006.402593] g0: 0002 g1:  g2: 0001 
 g3: f800a7d9
 [27006.402610] g4: f800b54ea460 g5: f8007f832000 g6: f800a7d9 
 g7: 0076d868
 [27006.402627] o0: 0072b660 o1: 0766 o2: 0002 
 o3: 0001
 [27006.402644] o4: 008a2940 o5:  sp: f800a7d92c91 
 ret_pc: 0053b1c4
 [27006.402665] RPC: journal_invalidatepage+0x3cc/0x460
 [27006.402679] l0: f800afbf4070 l1: 0069511c l2: 2000 
 l3: 
 [27006.402696] l4: 0001 l5: f800ba4cb730 l6: f800bf1cd338 
 l7: 0001
 [27006.402713] i0: f800bf1cd000 i1: 000201db2708 i2:  
 i3: 00727000
 [27006.402730] i4: 0020 i5: f800bf1cd028 i6: f800a7d92d51 
 i7: 00529254
 [27006.402763] I7: ext3_invalidatepage+0x3c/0x60
 [27006.402776] Caller[00529254]: ext3_invalidatepage+0x3c/0x60
 [27006.402800] Caller[004b22fc]: do_invalidatepage+0x24/0x60
 [27006.402826] Caller[004b29c4]: truncate_complete_page+0x6c/0x80
 [27006.402849] Caller[004b2a6c]: truncate_inode_pages_range+0x94/0x440
 [27006.402872] Caller[004b2e2c]: truncate_inode_pages+0x14/0x20
 [27006.402894] Caller[00529888]: ext3_delete_inode+0x10/0x160
 [27006.402918] Caller[004e7ca0]: generic_delete_inode+0x88/0x120
 [27006.402949] Caller[004e7e60]: generic_drop_inode+0x128/0x1c0
 [27006.402971] Caller[004e75d4]: iput+0x7c/0xa0
 [27006.402992] Caller[004dd680]: do_unlinkat+0x108/0x1a0
 [27006.403024] Caller[004dd884]: sys_unlinkat+0x2c/0x60
 [27006.403047] Caller[004062d4]: linux_sparc_syscall32+0x3c/0x40
 [27006.403081] Caller[f7e7d0ec]: 0xf7e7d0f4
 [27006.403102] Instruction DUMP: 92102766  7ffbbeaf  90122260 91d02005 
 92102780  7ffbbeab  90122260  91d02005  7ffbbea8
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 9483] circular locking dependency detected

2007-12-04 Thread Andrew Morton
On Tue, 4 Dec 2007 22:25:18 +0100
Ingo Molnar [EMAIL PROTECTED] wrote:

 
 * Aneesh Kumar K.V [EMAIL PROTECTED] wrote:
 
  ===
  [ INFO: possible circular locking dependency detected ]
  2.6.24-rc3 #6
  ---
  bash/2294 is trying to acquire lock:
  (journal-j_list_lock){--..}, at: [c01eee2f] 
  journal_try_to_free_buffers+0x76/0x10c
  
  but task is already holding lock:
  (inode_lock){--..}, at: [c01864b6] drop_pagecache+0x48/0xd8
  
  which lock already depends on the new lock.
 
 Andrew, drop_pagecache() is root-only and it has some known deadlock, 
 right?
 

yup.   It takes inode_lock at too high a level so it can walk the per-sb inode
lists.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] Add buffer head related helper functions

2007-11-16 Thread Andrew Morton
On Thu, 15 Nov 2007 17:40:43 +0530
Aneesh Kumar K.V [EMAIL PROTECTED] wrote:

 Add buffer head related helper function
 bh_uptodate_or_lock and bh_submit_read
 which can be used by file system
 

The patches look sane.

 ---
  include/linux/buffer_head.h |   29 +
  1 files changed, 29 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
 index da0d83f..82cc9ef 100644
 --- a/include/linux/buffer_head.h
 +++ b/include/linux/buffer_head.h
 @@ -327,6 +327,35 @@ static inline void lock_buffer(struct buffer_head *bh)
  
  extern int __set_page_dirty_buffers(struct page *page);
  
 +/* Return true if the buffer is up-to-date.
 + * Return false, with the buffer locked, if not.
 + */
 +static inline int bh_uptodate_or_lock(struct buffer_head *bh)
 +{
 + if (!buffer_uptodate(bh)) {
 + lock_buffer(bh);
 + if (!buffer_uptodate(bh))
 + return 0;
 + unlock_buffer(bh);
 + }
 + return 1;
 +}
 +/*
 + * Submit a locked buffer for reading,
 + * return a negative error and release
 + * the buffer if failed.
 + */
 +static inline int bh_submit_read(struct buffer_head *bh)
 +{
 + get_bh(bh);
 + bh-b_end_io = end_buffer_read_sync;
 + submit_bh(READ, bh);
 + wait_on_buffer(bh);
 + if (buffer_uptodate(bh))
 + return 0;
 + brelse(bh);
 + return -EIO;
 +}
  #else /* CONFIG_BLOCK */

But these are waay too big to be inlined.  Could I ask that they be
turned into regular EXPORT_SYMBOL()ed functions in buffer.c?

Might as well turn the comments into regular kerneldoc format, too.

The function names are a bit awkward, but I can't think of anything better.

I'm surprised that we don't already have a function which does what
bh_submit_read() does.

bh_submit_read() might become a bit simpler if it called ll_rw_block().

I'd have thought that to be more general, bh_submit_read() should return
immediately if the buffer is uptodate.  ll_rw_block() sort of helps there.


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext3,4:fdatasync should skip metadata writeout

2007-11-15 Thread Andrew Morton
On Fri, 16 Nov 2007 11:47:27 +0900 Hisashi Hifumi [EMAIL PROTECTED] wrote:

 Currently fdatasync is identical to fsync in ext3,4.
 I think fdatasync should skip journal flush in data=ordered and 
 data=writeback mode
 because this syscall is not required to synchronize the metadata.

I suppose so.  Although one wonders what earthly point there is in syncing
a file's data if we haven't yet written out the metadata which is required
for locating that data.

IOW, fdatasync() is only useful if the application knows that it is overwriting
already-instantiated blocks.

In which case it might as well have used fsync().  For ext2-style filesystems,
anyway.

hm.  It needs some thought.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext3,4:fdatasync should skip metadata writeout

2007-11-15 Thread Andrew Morton
On Thu, 15 Nov 2007 22:47:40 -0500 Wendy Cheng [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
 
 On Fri, 16 Nov 2007 11:47:27 +0900 Hisashi Hifumi [EMAIL PROTECTED] wrote:
 
   
 
 Currently fdatasync is identical to fsync in ext3,4.
 I think fdatasync should skip journal flush in data=ordered and 
 data=writeback mode
 because this syscall is not required to synchronize the metadata.
 
 
 
 I suppose so.  Although one wonders what earthly point there is in syncing
 a file's data if we haven't yet written out the metadata which is required
 for locating that data.
 
 IOW, fdatasync() is only useful if the application knows that it is 
 overwriting
 already-instantiated blocks.
 
 In which case it might as well have used fsync().  For ext2-style 
 filesystems,
 anyway.
 
 hm.  It needs some thought.
 
   
 
 
 There are non-trivial amount of performance critical programs, 
 particularly in financial application segment ported from legacy UNIX 
 platforms, know the difference between fsync() and fdatasync(). Those 
 can certainly take advantages of this separation. Don't underestimate 
 the talents of these application programmers.
 

If they're that good, they'll be using sync_file_range() ;)
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Introduce ext4_find_next_bit

2007-11-13 Thread Andrew Morton
On Wed, 14 Nov 2007 00:41:03 +0530 Aneesh Kumar K.V [EMAIL PROTECTED] wrote:

 
 
 Andrew Morton wrote:
  On Fri, 21 Sep 2007 10:55:05 +0530 Aneesh Kumar K.V [EMAIL PROTECTED] 
  wrote:
  
  Also add generic_find_next_le_bit
 
  This gets used by the ext4 multi block allocator patches.
 
  
  arm allmodconfig:
  
  fs/ext4/mballoc.c: In function `ext4_mb_generate_buddy':
  fs/ext4/mballoc.c:836: error: implicit declaration of function 
  `ext2_find_next_bit'
  
  This patch makes my head spin.
  
  Why did we declare generic_find_next_le_bit() in
  include/asm-powerpc/bitops.h (wrong) as well as in
  include/asm-generic/bitops/le.h (presumably correct)?
  
 
 I was following the coding style used for rest of the APIs
 like ext4_set_bit.

Well.  There's quite a bit of cruft in there.  If you do come across
something which isn't right, please do try to find the time to fix it up
first.

That might be non-trivial - powerpc does seem to have gone off on a strange
tangent there.

 
  Why is it touching a powerpc file and no any other architectures? 
  Something screwed up in powerpc land?
  
  And why did arm break?
 
 arm and below list of arch doesn't include the 
 asm-generic/bitops/ext2-non-atomic.h 
 
 I did a grep and that list the below architectures as also affected.
 arm, m68k, m68knommu, s390
 
  
  Shudder.  Anyway, please fix, and if that fix requires that various
  braindamaged be repaired, please repair the braindamage rather than going
  along with it.
  
  
 
 That should be a separate patch altogether. I wanted to do the cleanup
 along with the usages such as but never got time to do the same.
 
 #define ocfs2_set_bit ext2_set_bit
 #define udf_set_bit(nr,addr) ext2_set_bit(nr,addr)
 direct usage in mb
 md/bitmap.c +799
 md/dm-log.c +177
 
 I will send a patch tomorrow that fix  arm and other architectures. I guess 
 the cleanup
 can be a separate patch ? 
 

Yes, that's a separate work, thanks.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Introduce ext4_find_next_bit

2007-11-12 Thread Andrew Morton
On Fri, 21 Sep 2007 10:55:05 +0530 Aneesh Kumar K.V [EMAIL PROTECTED] wrote:

 Also add generic_find_next_le_bit
 
 This gets used by the ext4 multi block allocator patches.
 

arm allmodconfig:

fs/ext4/mballoc.c: In function `ext4_mb_generate_buddy':
fs/ext4/mballoc.c:836: error: implicit declaration of function 
`ext2_find_next_bit'

This patch makes my head spin.

Why did we declare generic_find_next_le_bit() in
include/asm-powerpc/bitops.h (wrong) as well as in
include/asm-generic/bitops/le.h (presumably correct)?

Why is it touching a powerpc file and no any other architectures? 
Something screwed up in powerpc land?

And why did arm break?

Shudder.  Anyway, please fix, and if that fix requires that various
braindamaged be repaired, please repair the braindamage rather than going
along with it.

Thanks.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bugme-new] [Bug 9329] New: ext4: delalloc space accounting problem drops data

2007-11-08 Thread Andrew Morton
 On Thu,  8 Nov 2007 09:42:10 -0800 (PST) [EMAIL PROTECTED] wrote:
 http://bugzilla.kernel.org/show_bug.cgi?id=9329
 
Summary: ext4: delalloc space accounting problem drops data
Product: File System
Version: 2.5
  KernelVersion: 2.6.24-rc1
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: ext4
 AssignedTo: [EMAIL PROTECTED]
 ReportedBy: [EMAIL PROTECTED]
 
 
 2.6.24-rc1 + ext4 git patch queue from last week or so.
 
 It appears that delalloc does not track used space properly, and fails to
 return ENOSPC as appropriate:
 
 [EMAIL PROTECTED] ~]# mkfs.ext3 -I 256 /dev/sdb7 32768
 [EMAIL PROTECTED] ~]# mount -t ext4dev -o 
 data=writeback,delalloc,extents,mballoc
 /dev/sdb7 /mnt/test
 [EMAIL PROTECTED] ~]# df -h /mnt/test
 FilesystemSize  Used Avail Use% Mounted on
 /dev/sdb7  30M  4.5M   24M  16% /mnt/test
 [EMAIL PROTECTED] ~]# du -h /tmp/1Mfile 
 1.1M/tmp/1Mfile
 [EMAIL PROTECTED] ~]# for I in `seq 1 50`; do cp /tmp/1Mfile 
 /mnt/test/1Mfile-$I;
 done
 [EMAIL PROTECTED] ~]# df -h /mnt/test
 FilesystemSize  Used Avail Use% Mounted on
 /dev/sdb7  30M   30M 0 100% /mnt/test
 
 all resulting files are 1M in length:
 [EMAIL PROTECTED] ~]# ls -l /mnt/test/1M* | grep -v 1048576
 [EMAIL PROTECTED] ~]# ls -l /mnt/test/1M* | grep 1048576 | wc -l
 50
 but many of them have silently dropped data on the floor:
 [EMAIL PROTECTED] ~]# du -hc /mnt/test/1Mfile-* | grep -v 1.0M
 596K/mnt/test/1Mfile-26
 0   /mnt/test/1Mfile-27
 0   /mnt/test/1Mfile-28
 0   /mnt/test/1Mfile-29
 0   /mnt/test/1Mfile-30
 snip
 
 When mounted with nodelalloc, I get proper behavior:
 
 [EMAIL PROTECTED] ~]# for I in `seq 1 50`; do cp /tmp/1Mfile 
 /mnt/test/1Mfile-$I;
 done
 cp: writing `/mnt/test/1Mfile-26': No space left on device
 cp: writing `/mnt/test/1Mfile-27': No space left on device
 cp: writing `/mnt/test/1Mfile-28': No space left on device
 cp: writing `/mnt/test/1Mfile-29': No space left on device
 snip
 
 
 -- 
 Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
 --- You are receiving this mail because: ---
 You are on the CC list for the bug, or are watching someone who is.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] ext4 update

2007-10-25 Thread Andrew Morton
On Wed, 17 Oct 2007 08:59:53 -0700 (PDT)
Linus Torvalds [EMAIL PROTECTED] wrote:

 On Wed, 17 Oct 2007, Theodore Ts'o wrote:
  
  Please pull from:
  
  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
  for_linus
 
 This conflicts in nontrivial ways with
 
   commit 7c9e69faa28027913ee059c285a5ea8382e24b5d
   Author: Aneesh Kumar K.V [EMAIL PROTECTED]
   Date:   Tue Oct 16 23:27:02 2007 -0700
 
   ext2/ext3/ext4: add block bitmap validation
 
 which I just merged from -mm.

catching up

There shouldn't have been conflicts here - if there were I wouldn't have
sent those patches.  Unless there were things in the ext4 pull which
weren't present in the ext4 quilt tree which I included in 2.6.23-mm1?

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] ext4 update

2007-10-25 Thread Andrew Morton
On Thu, 25 Oct 2007 16:44:21 -0700 (PDT)
Linus Torvalds [EMAIL PROTECTED] wrote:

 
 
 On Thu, 25 Oct 2007, Andrew Morton wrote:
  
  There shouldn't have been conflicts here - if there were I wouldn't have
  sent those patches.  Unless there were things in the ext4 pull which
  weren't present in the ext4 quilt tree which I included in 2.6.23-mm1?
 
 Well, you merge your patch-series by patching.
 
 You should have noticed by now that GNU patch in particular will happily 
 apply a patch whether it conflicts or not. So it's entirely possible that 
 it didn't conflict for you, but applied cleanly and sanely.
 

hrm, could be.  It would be strange for that to happen quietly with fuzz=1
and to still produce a compileable result.

But there weren't any patches in this git-merge which weren't in 2.6.23-mm1
so maybe something like that happened.  Or maybe that fact that this pull
only contained _some_ of the ext4 patches which were in -mm somehow affected
things.

Oh well, I should have sent the ext4 changes via Ted anyway.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -mm] Split fs/Kconfig: ext[234]

2007-10-22 Thread Andrew Morton
On Sat, 6 Oct 2007 12:15:08 +0400
Alexey Dobriyan [EMAIL PROTECTED] wrote:

 Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
 ---
 
  fs/Kconfig  |  191 --
  fs/ext2/Kconfig |   55 +
  fs/ext3/Kconfig |   67 
  fs/ext4/Kconfig |   65 +++
  4 files changed, 190 insertions(+), 188 deletions(-)

A reasonable thing to do, but poorly timed.  I'd prefer not to carry a
patch like this through two months of development, please.

Late in the -rc timeframe would be a better time.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

2007-10-18 Thread Andrew Morton
On Thu, 18 Oct 2007 02:03:39 -0700 (PDT) Christoph Lameter [EMAIL PROTECTED] 
wrote:

 On Wed, 17 Oct 2007, Andrew Morton wrote:
 
  b) what happens when an old ext2 driver tries to read and/or write this
 directory entry?  Do we need a compat flag for it?
 
 Old ext2 only supports up to 4k
 
 include/linux/ext2_fs.h:
 
 #define EXT2_MIN_BLOCK_SIZE 1024
 #define EXT2_MAX_BLOCK_SIZE 4096
 #define EXT2_MIN_BLOCK_LOG_SIZE   10
 
 Should fail to mount the volume since the block size is too large.

should, but does it?

box:/usr/src/25 grep MAX_BLOCK_SIZE fs/ext2/*.[ch] include/linux/ext2*
include/linux/ext2_fs.h:#define EXT2_MAX_BLOCK_SIZE 4096
box:/usr/src/25 

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

2007-10-17 Thread Andrew Morton
On Thu, 11 Oct 2007 13:18:49 +0200 Jan Kara [EMAIL PROTECTED] wrote:

 +static inline __le16 ext2_rec_len_to_disk(unsigned len)
 +{
 + if (len == (1  16))
 + return cpu_to_le16(EXT2_MAX_REC_LEN);
 + else if (len  (1  16))
 + BUG();
 + return cpu_to_le16(len);
 +}

Of course, ext2 shouldn't be trying to write a bad record length into a
directory entry.  But are we sure that there is no way in which this
situation could occur is the on-disk data was _already_ bad?

Because it is very bad for a fileysstem to go BUG in response to unexpected
data on the disk.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

2007-10-17 Thread Andrew Morton
On Thu, 11 Oct 2007 13:18:49 +0200 Jan Kara [EMAIL PROTECTED] wrote:

 With 64KB blocksize, a directory entry can have size 64KB which does not fit
 into 16 bits we have for entry lenght. So we store 0x instead and convert
 value when read from / written to disk.

btw, this changes ext2's on-disk format.

a) is the ext2 format documented anywhere?  If so, that document will
   need updating.

b) what happens when an old ext2 driver tries to read and/or write this
   directory entry?  Do we need a compat flag for it?

c) what happens when old and new ext3 or ext4 try to read/write this
   directory entry?


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext2 statfs improvement for block and inode free count

2007-10-16 Thread Andrew Morton
On Fri, 13 Jul 2007 18:36:54 -0700
Badari Pulavarty [EMAIL PROTECTED] wrote:

 
 More statfs() improvements for ext2. ext2 already maintains
 percpu counters for free blocks and inodes. Derive free
 block count and inode count by summing up percpu counters,
 instead of counting up all the groups in the filesystem
 each time.
 
 
 Signed-off-by: Badari Pulavarty [EMAIL PROTECTED]
 Acked-by: Andreas Dilger [EMAIL PROTECTED]
 
  fs/ext2/super.c |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)
 
 Index: linux-2.6.22/fs/ext2/super.c
 ===
 --- linux-2.6.22.orig/fs/ext2/super.c 2007-07-13 20:06:38.0 -0700
 +++ linux-2.6.22/fs/ext2/super.c  2007-07-13 20:06:51.0 -0700
 @@ -1136,12 +1136,12 @@ static int ext2_statfs (struct dentry * 
   buf-f_type = EXT2_SUPER_MAGIC;
   buf-f_bsize = sb-s_blocksize;
   buf-f_blocks = le32_to_cpu(es-s_blocks_count) - overhead;
 - buf-f_bfree = ext2_count_free_blocks(sb);
 + buf-f_bfree = percpu_counter_sum(sbi-s_freeblocks_counter);
   buf-f_bavail = buf-f_bfree - le32_to_cpu(es-s_r_blocks_count);
   if (buf-f_bfree  le32_to_cpu(es-s_r_blocks_count))
   buf-f_bavail = 0;
   buf-f_files = le32_to_cpu(es-s_inodes_count);
 - buf-f_ffree = ext2_count_free_inodes(sb);
 + buf-f_ffree = percpu_counter_sum(sbi-s_freeinodes_counter);
   buf-f_namelen = EXT2_NAME_LEN;
   fsid = le64_to_cpup((void *)es-s_uuid) ^
  le64_to_cpup((void *)es-s_uuid + sizeof(u64));
 

Guys, I have this patch in a stalled state pending some convincing
demonstration/argument that it's actually a worthwhile change.

How much hurt will it cause on large-numa, and how much benefit do we get
for that hurt? 
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

2007-10-04 Thread Andrew Morton
On Mon, 01 Oct 2007 17:35:46 -0700
Mingming Cao [EMAIL PROTECTED] wrote:

 ext2: Avoid rec_len overflow with 64KB block size
 
 From: Jan Kara [EMAIL PROTECTED]
 
 With 64KB blocksize, a directory entry can have size 64KB which does not fit
 into 16 bits we have for entry lenght. So we store 0x instead and convert
 value when read from / written to disk.

This patch clashes in non-trivial ways with
ext2-convert-to-new-aops-fix.patch and perhaps other things which are
already queued for 2.6.24 inclusion, so I'll need to ask for an updated
patch, please.

Also, I'm planing on merging the ext2 reservations code into 2.6.24, so if
we're aiming for complete support of 64k blocksize in 2.6.24's ext2,
additional testing and checking will be needed.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

2007-10-04 Thread Andrew Morton
On Thu, 4 Oct 2007 16:40:44 -0600
Andreas Dilger [EMAIL PROTECTED] wrote:

 On Oct 04, 2007  13:12 -0700, Andrew Morton wrote:
  On Mon, 01 Oct 2007 17:35:46 -0700
   ext2: Avoid rec_len overflow with 64KB block size
   
   into 16 bits we have for entry lenght. So we store 0x instead and
   convert value when read from / written to disk.
  
  This patch clashes in non-trivial ways with
  ext2-convert-to-new-aops-fix.patch and perhaps other things which are
  already queued for 2.6.24 inclusion, so I'll need to ask for an updated
  patch, please.
 
 If the rel_len overflow patch isn't going to make it, then we also need
 to revert the EXT*_MAX_BLOCK_SIZE change to 65536.  It would be possible
 to allow this to be up to 32768 w/o the rec_len overflow fix however.
 

Ok, thanks, I dropped ext3-support-large-blocksize-up-to-pagesize.patch and
ext2-support-large-blocksize-up-to-pagesize.patch.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD/ext34 cleanups: convert to kzalloc

2007-09-26 Thread Andrew Morton
On Fri, 21 Sep 2007 16:13:56 -0700
Mingming Cao [EMAIL PROTECTED] wrote:

 Convert kmalloc to kzalloc() and get rid of the memset().

I split this into separate ext3/jbd and ext4/jbd2 patches.  It's generally
better to raise separate patches, please - the ext3 patches I'll merge
directly but the ext4 patches should go through (and be against) the ext4
devel tree.

I fixed lots of rejects against the already-pending changes to these
filesystems.

You forgot to remove the memsets in both start_this_handle()s.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: jbd : config_jbd_debug cannot create /proc entry

2007-09-26 Thread Andrew Morton
On Tue, 25 Sep 2007 16:36:08 +0200
Jan Kara [EMAIL PROTECTED] wrote:

  On Tue, 25 Sep 2007 07:49:38 -0500
  Jose R. Santos [EMAIL PROTECTED] wrote:
  
   On Tue, 25 Sep 2007 13:50:46 +0200
   Jan Kara [EMAIL PROTECTED] wrote:
 Jan Kara wrote:
 
 -#define create_jbd_proc_entry() do {} while (0)
 -#define remove_jbd_proc_entry() do {} while (0)
 +static ctl_table fs_table[] = {
 +{
 +.ctl_name   = -1,   /* Don't want it */
 
 
 
 shouldn't this be CTL_UNNUMBERED ?
  Oh, it should be. I didn't notice we have this :) Thanks for notifying
me. Attached is a fixed version.
   
   This was fixed in JBD2 by moving the jbd-debug file to debugfs:
   http://lkml.org/lkml/2007/7/11/334
   
   Since this code is already in the kernel, we should keep it consistent. 
   
  
  OK.  Here's a quick patch to fix this.  Adapted from the JBD2 patch.
  Let me know what you think.
   Looks fine - exactly what I've just done here :).

hm.  I found rather a lot of issues.  If this patch is derived from the
JBD2 patch then perhaps the JBD2 patch needs some looking at.

  Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
   You can add Signed-off-by: Jan Kara [EMAIL PROTECTED]

I suspect you might be getting your signed-off-bys and acked-bys mixed up. 
(If not this patch, then the previous one).  Please see
Documentation/SubmittingPatches section 13 for the difference.

Jose, please review and if possible runtime test these proposed changes?



From: Andrew Morton [EMAIL PROTECTED]

- use `#ifdef foo' instead of `#if defined(foo)'

- CONFIG_JBD_DEBUG depends on CONFIG_DEBUG_FS so we don't need to duplicate
  that logic in the .c file ifdefs

- Make journal_enable_debug __read_mostly just for the heck of it

- Make jbd_debugfs_dir and jbd_debug static

- debugfs_remove(NULL) is legal: remove unneeded tests

- jbd_create_debugfs_entry is a better name than create_jbd_debugfs_entry

- ditto remove_jbd_debugfs_entry

- C functions are preferred over macros

Cc: Jose R. Santos [EMAIL PROTECTED]
Cc: linux-ext4@vger.kernel.org
Cc: Jan Kara [EMAIL PROTECTED]
Cc: Jose R. Santos [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---


diff -puN fs/jbd/journal.c~jbd-config_jbd_debug-cannot-create-proc-entry-fix 
fs/jbd/journal.c
--- a/fs/jbd/journal.c~jbd-config_jbd_debug-cannot-create-proc-entry-fix
+++ a/fs/jbd/journal.c
@@ -1853,16 +1853,15 @@ void journal_put_journal_head(struct jou
 /*
  * debugfs tunables
  */
-#if defined(CONFIG_JBD_DEBUG)
-u8 journal_enable_debug;
-EXPORT_SYMBOL(journal_enable_debug);
-#endif
+#ifdef CONFIG_JBD_DEBUG
 
-#if defined(CONFIG_JBD_DEBUG)  defined(CONFIG_DEBUG_FS)
+u8 journal_enable_debug __read_mostly;
+EXPORT_SYMBOL(journal_enable_debug);
 
-struct dentry  *jbd_debugfs_dir, *jbd_debug;
+static struct dentry *jbd_debugfs_dir;
+static struct dentry *jbd_debug;
 
-static void __init create_jbd_debugfs_entry(void)
+static void __init jbd_create_debugfs_entry(void)
 {
jbd_debugfs_dir = debugfs_create_dir(jbd, NULL);
if (jbd_debugfs_dir)
@@ -1871,18 +1870,21 @@ static void __init create_jbd_debugfs_en
   journal_enable_debug);
 }
 
-static void __exit remove_jbd_debugfs_entry(void)
+static void __exit jbd_remove_debugfs_entry(void)
 {
-   if (jbd_debug)
-   debugfs_remove(jbd_debug);
-   if (jbd_debugfs_dir)
-   debugfs_remove(jbd_debugfs_dir);
+   debugfs_remove(jbd_debug);
+   debugfs_remove(jbd_debugfs_dir);
 }
 
 #else
 
-#define create_jbd_debugfs_entry() do {} while (0)
-#define remove_jbd_debugfs_entry() do {} while (0)
+static inline void jbd_create_debugfs_entry(void)
+{
+}
+
+static inline void jbd_remove_debugfs_entry(void)
+{
+}
 
 #endif
 
@@ -1940,7 +1942,7 @@ static int __init journal_init(void)
ret = journal_init_caches();
if (ret != 0)
journal_destroy_caches();
-   create_jbd_debugfs_entry();
+   jbd_create_debugfs_entry();
return ret;
 }
 
@@ -1951,7 +1953,7 @@ static void __exit journal_exit(void)
if (n)
printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n);
 #endif
-   remove_jbd_debugfs_entry();
+   jbd_remove_debugfs_entry();
journal_destroy_caches();
 }
 
_

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Ext4: Uninitialized Block Groups

2007-09-20 Thread Andrew Morton
On Tue, 18 Sep 2007 17:25:31 -0700
Avantika Mathur [EMAIL PROTECTED] wrote:

 In pass1 of e2fsck, every inode table in the fileystem is scanned and 
 checked, 
 regardless of whether it is in use.  This is this the most time consuming 
 part 
 of the filesystem check.  The unintialized block group feature can greatly 
 reduce e2fsck time by eliminating checking of uninitialized inodes.  
 
 With this feature, there is a a high water mark of used inodes for each block 
 group.  Block and inode bitmaps can be uninitialized on disk via a flag in the
 group descriptor to avoid reading or scanning them at e2fsck time.  A checksum
 of each group descriptor is used to ensure that corruption in the group
 descriptor's bit flags does not cause incorrect operation.

This needed a few fixups due to conflicts with
ext2-ext3-ext4-add-block-bitmap-validation.patch but they were pretty
straightforward.  Please check that the result is OK.


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD: use GFP_NOFS in kmalloc

2007-09-19 Thread Andrew Morton
On Wed, 19 Sep 2007 12:22:09 -0700
Mingming Cao [EMAIL PROTECTED] wrote:

 Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
 with the rest of kmalloc flag used in the JBD/JBD2 layer.
 
 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 
 ---
  fs/jbd/journal.c  |6 +++---
  fs/jbd/revoke.c   |8 
  fs/jbd2/journal.c |6 +++---
  fs/jbd2/revoke.c  |8 
  4 files changed, 14 insertions(+), 14 deletions(-)
 
 Index: linux-2.6.23-rc6/fs/jbd/journal.c
 ===
 --- linux-2.6.23-rc6.orig/fs/jbd/journal.c2007-09-19 11:51:10.0 
 -0700
 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.0 -0700
 @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
   journal_t *journal;
   int err;
  
 - journal = kmalloc(sizeof(*journal), GFP_KERNEL);
 + journal = kmalloc(sizeof(*journal), GFP_NOFS);
   if (!journal)
   goto fail;
   memset(journal, 0, sizeof(*journal));
 @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
   journal-j_blocksize = blocksize;
   n = journal-j_blocksize / sizeof(journal_block_tag_t);
   journal-j_wbufsize = n;
 - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
 + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
   if (!journal-j_wbuf) {
   printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
   __FUNCTION__);
 @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
   /* journal descriptor can store up to n blocks -bzzz */
   n = journal-j_blocksize / sizeof(journal_block_tag_t);
   journal-j_wbufsize = n;
 - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
 + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
   if (!journal-j_wbuf) {
   printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
   __FUNCTION__);
 Index: linux-2.6.23-rc6/fs/jbd/revoke.c
 ===
 --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.0 
 -0700
 +++ linux-2.6.23-rc6/fs/jbd/revoke.c  2007-09-19 11:52:34.0 -0700
 @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
   while((tmp = 1UL) != 0UL)
   shift++;
  
 - journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, 
 GFP_KERNEL);
 + journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, 
 GFP_NOFS);
   if (!journal-j_revoke_table[0])
   return -ENOMEM;
   journal-j_revoke = journal-j_revoke_table[0];
 @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
   journal-j_revoke-hash_shift = shift;
  
   journal-j_revoke-hash_table =
 - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
 + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
   if (!journal-j_revoke-hash_table) {
   kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]);
   journal-j_revoke = NULL;
 @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
   for (tmp = 0; tmp  hash_size; tmp++)
   INIT_LIST_HEAD(journal-j_revoke-hash_table[tmp]);
  
 - journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, 
 GFP_KERNEL);
 + journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, 
 GFP_NOFS);
   if (!journal-j_revoke_table[1]) {
   kfree(journal-j_revoke_table[0]-hash_table);
   kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]);
 @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
   journal-j_revoke-hash_shift = shift;
  
   journal-j_revoke-hash_table =
 - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
 + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
   if (!journal-j_revoke-hash_table) {
   kfree(journal-j_revoke_table[0]-hash_table);
   kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]);

These were all OK using GFP_KERNEL.

GFP_NOFS should only be used when the caller is holding some fs locks which
might cause a deadlock if that caller reentered the fs in -writepage (and
maybe put_inode and such).  That isn't the case in any of the above code,
which is all mount time stuff (I think).

ext3/4 should be using GFP_NOFS when the caller has a transaction open, has
a page locked, is holding i_mutex, etc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-18 Thread Andrew Morton
On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao [EMAIL PROTECTED] wrote:

 JBD: Replace slab allocations with page cache allocations
 
 JBD allocate memory for committed_data and frozen_data from slab. However
 JBD should not pass slab pages down to the block layer. Use page allocator 
 pages instead. This will also prepare JBD for the large blocksize patchset.
 
 
 Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly

__GFP_NOFAIL should only be used when we have no way of recovering
from failure.  The allocation in journal_init_common() (at least)
_can_ recover and hence really shouldn't be using __GFP_NOFAIL.

(Actually, nothing in the kernel should be using __GFP_NOFAIL.  It is 
there as a marker which says we really shouldn't be doing this but
we don't know how to fix it).

So sometime it'd be good if you could review all the __GFP_NOFAILs in
there and see if we can remove some, thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Ext4: Uninitialized Block Groups

2007-09-18 Thread Andrew Morton
On Tue, 18 Sep 2007 17:25:31 -0700 Avantika Mathur [EMAIL PROTECTED] wrote:

 +
 +__u16 crc16(__u16 crc, __u8 const *buffer, size_t len)

And is we really really have to do this, then the ext4-private crc16() 
should have static scope.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-08-17 Thread Andrew Morton
On Fri, 17 Aug 2007 12:36:32 +0400 Alex Tomas [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  Sort-of.  But the per-superpblock, per-inode writeback code is pretty
  careful to avoid livelocks.  The per-inode writeback is a strict single
  linear sweep across the file.  It'll basically write out anything which was
  dirty when it was called.  The per-superblock inode walk isn't as accurate
  as that, becuase of the difficulties of juggling list_heads.  But we're
  slowly working on that, and I suspect it'll be ggod enough for ext3
  purposes already.
 
 I'd say that these are two different mechanism solving different problems:
 1) VFS/MM does periodic updates and uses regular writeback
 2) data=ordered is to avoid metadata pointing to not-written-yet data

VFS/MM can do _much_ more than that!  Look at struct writeback_control.

That code path has many different modes of operation: it is used for
regular pdflush writeback, sync, fsync, throttling, etc.  Probably one of
its modes will be sufficient.  If we want to change ext3's existing
semantics and add an only writeback uninitialised blocks mode then
that'll be pretty straightforward: add more control information to
writeback_control and go for it.

 we can't use regular writeback in commit thread as long as it can fall into
 allocation. so, we'd have to add one more WB mode (btw, i have a patch which
 skips non-allocated blocks in writeback if special WB mode is requested).

yup

 OTOH, the faster we go through data sync part of commit, the better. given
 that lots of inodes can be dirty with no data to sync, it's going to take
 long in some cases. it's especially bad because commit doesn't scale to many
 CPUs.

eh?

 also, why would we need to flush *everything* every 5s? just because ext3 does
 this? sounds strange. if somebody really need this we could add this 
 possibility
 to regular writeback path (making it tunable). but I'd rather prefer to have
 a separate (fast, lightweight, scalable) mechanism to support data=ordered.
 

Yeah, that would make sense, perhaps.

Or just speed the existing stuff up.  iirc the main problem in there is 
unrelated
to data writeback.  There are situations where the running transaction has to 
block
behind metadata writeout which the committing transaction is performing.  I
reluctantly put that in years ago to get us out of a tight spot and it
never got optimised.


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/25] move handling of setuid/gid bits from VFS into individual setattr functions (RESEND)

2007-08-08 Thread Andrew Morton
On Wed, 8 Aug 2007 08:54:35 -0400 Jeff Layton [EMAIL PROTECTED] wrote:

 On Tue, 7 Aug 2007 17:15:01 -0700
 Andrew Morton [EMAIL PROTECTED] wrote:
 
  On Mon, 6 Aug 2007 09:54:03 -0400
  Jeff Layton [EMAIL PROTECTED] wrote:
  
  Is there any way in which we can prevent these problems?  Say
  
  - rename something so that unconverted filesystems will reliably fail to
compile?
  
 
 I suppose we could rename the .setattr inode operation to something
 else, but then we'll be stuck with it for at least a while. That seems
 sort of kludgey too...

Sure.  We're changing the required behaviour of .setattr.  Changing its
name is a fine and reasonably reliable way to communicate that fact.

  - leave existing filesystems alone, but add a new
inode_operations.setattr_jeff, which the networked filesytems can
implement, and teach core vfs to call setattr_jeff in preference to
setattr?
  
  Something else?
 
 There's also the approach suggested by Miklos: Add a new inode flag that
 tells notify_change not to convert ATTR_KILL_S* flags into a mode
 change. Basically, allow filesystems to opt out of that behavior. 
 
 I'd definitly pick that over a new inode op. That would also allow the
 default case be for the VFS to continue handling these flags.
 Everything would continue to work but filesystems that need to handle
 these flags differently would be able to do so.
 

We should opt for whatever produces the best end state in the kernel tree. 
ie: if it takes more work and a larger patch to create a better result,
let's go for the better result.  We merge large patches all the time.  We
prefer to smash through, get it right whatever the transient cost.  But
quietly making out-of-tree filesystems less secure is a pretty high cost.

I'm suspecting that adding more flags and some code to test them purely to
minimise the size of the patch and to retain compatibility with the old
.setattr is not a good tradeoff, given that we'd carry the flags and tests
for evermore.

So I'd suggest s/setattr/something_else/g.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/25] move handling of setuid/gid bits from VFS into individual setattr functions (RESEND)

2007-08-07 Thread Andrew Morton
On Mon, 6 Aug 2007 09:54:03 -0400
Jeff Layton [EMAIL PROTECTED] wrote:

 Apologies for the resend, but the original sending had the date in the
 email header and it caused some of these to bounce...
 
 ( Please consider trimming the Cc list if discussing some aspect of this
 that doesn't concern everyone.)
 
 When an unprivileged process attempts to modify a file that has the
 setuid or setgid bits set, the VFS will attempt to clear these bits. The
 VFS will set the ATTR_KILL_SUID or ATTR_KILL_SGID bits in the ia_valid
 mask, and then call notify_change to clear these bits and set the mode
 accordingly.
 
 With a networked filesystem (NFS in particular but most likely others),
 the client machine may not have credentials that allow for the clearing
 of these bits. In some situations, this can lead to file corruption, or
 to an operation failing outright because the setattr fails.
 
 In this situation, we'd like to just leave the handling of this to
 the server and ignore these bits. The problem is that by the time
 nfs_setattr is called, the VFS has already reinterpreted the ATTR_KILL_*
 bits into a mode change. We can't fix this in the filesystems where
 this is a problem, as doing so would leave us having to second-guess
 what the VFS wants us to do. So we need to change it so that filesystems
 have more flexibility in how to interpret the ATTR_KILL_* bits.
 
 The first patch in the following patchset moves this logic into a helper
 function, and then only calls this helper function for inodes that do
 not have a setattr operation defined. The subsequent patches fix up
 individual filesystem setattr functions to call this helper function.
 
 The upshot of this is that with this change, filesystems that define
 a setattr inode operation are now responsible for handling the ATTR_KILL
 bits as well. They can trivially do so by calling the helper, but they
 must do so.
 
 Some of the follow-on patches may not be strictly necessary, but I
 decided that it was better to take the conservative approach and call
 the helper when I wasn't sure. I've tried to CC the maintainers
 for the individual filesystems as well where I could find them,
 please let me know if there are others who should be informed.
 
 Comments and suggestions appreciated...
 

From a purely practical standpoint: it's a concern that all filesytems need
patching to continue to correctly function after this change.  There might
be filesystems which you missed, and there are out-of-tree filesystems
which won't be updated.

And I think the impact upon the out-of-tree filesystems would be fairly
severe: they quietly and subtly get their secutiry guarantees broken (I
think?)

Is there any way in which we can prevent these problems?  Say

- rename something so that unconverted filesystems will reliably fail to
  compile?

- leave existing filesystems alone, but add a new
  inode_operations.setattr_jeff, which the networked filesytems can
  implement, and teach core vfs to call setattr_jeff in preference to
  setattr?

Something else?
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/25] move handling of setuid/gid bits from VFS into individual setattr functions (RESEND)

2007-08-07 Thread Andrew Morton
On Tue, 07 Aug 2007 20:45:34 -0400
Trond Myklebust [EMAIL PROTECTED] wrote:

  - rename something so that unconverted filesystems will reliably fail to
compile?
  
  - leave existing filesystems alone, but add a new
inode_operations.setattr_jeff, which the networked filesytems can
implement, and teach core vfs to call setattr_jeff in preference to
setattr?
 
 If you really need to know that the filesystem is handling the flags,
 then how about instead having -setattr() return something which
 indicates which flags it actually handled? That is likely to be a far
 more intrusive change, but it is one which is future-proof.

If we change -setattr so that it will return a positive, non-zero value
which the caller can then check and reliably do printk(that filesystem
needs updating) then that addresses my concern, sure.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] ext2: show all mount options

2007-07-23 Thread Andrew Morton
On Mon, 23 Jul 2007 22:12:54 +0200
Miklos Szeredi [EMAIL PROTECTED] wrote:

 From: Miklos Szeredi [EMAIL PROTECTED]
 
 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]

Could we have changelogs for these patches, please?

ie: what's wrong with the existing code, what change does this patch make?
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] readahead drop behind and size adjustment

2007-07-23 Thread Andrew Morton
On Tue, 24 Jul 2007 08:47:28 +0800
Fengguang Wu [EMAIL PROTECTED] wrote:

 Subject: convert ill defined log2() to ilog2()
 
 It's *wrong* to have
   #define log2(n) ffz(~(n))
 It should be *reversed*:
   #define log2(n) flz(~(n))
 or
   #define log2(n) fls(n)
 or just use
   ilog2(n) defined in linux/log2.h.
 
 This patch follows the last solution, recommended by Andrew Morton.
 
 //Or are they simply the wrong naming, and is in fact correct in behavior?
 
 Cc: linux-ext4@vger.kernel.org
 Cc: Mingming Cao [EMAIL PROTECTED]
 Cc: Bjorn Helgaas [EMAIL PROTECTED]
 Cc: Chris Ahna [EMAIL PROTECTED]
 Cc: David Mosberger-Tang [EMAIL PROTECTED]
 Cc: Kyle McMartin [EMAIL PROTECTED]
 Signed-off-by: Fengguang Wu [EMAIL PROTECTED]
 ---
  drivers/char/agp/hp-agp.c |9 +++--
  drivers/char/agp/i460-agp.c   |5 ++---
  drivers/char/agp/parisc-agp.c |7 ++-
  fs/ext4/super.c   |6 ++
  4 files changed, 9 insertions(+), 18 deletions(-)

hm, yes, there is a risk that the code was accidentally correct.  Or the
code has only ever dealt with power-of-2 inputs, in which case it happens
to work either way.

David(s) and ext4-people: could we please have a close review of these
changes?

Thanks.

 --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/hp-agp.c
 +++ linux-2.6.22-rc6-mm1/drivers/char/agp/hp-agp.c
 @@ -14,15 +14,12 @@
  #include linux/pci.h
  #include linux/init.h
  #include linux/agp_backend.h
 +#include linux/log2.h
  
  #include asm/acpi-ext.h
  
  #include agp.h
  
 -#ifndef log2
 -#define log2(x)  ffz(~(x))
 -#endif
 -
  #define HP_ZX1_IOC_OFFSET0x1000  /* ACPI reports SBA, we want IOC */
  
  /* HP ZX1 IOC registers */
 @@ -256,7 +253,7 @@ hp_zx1_configure (void)
   readl(hp-ioc_regs+HP_ZX1_IMASK);
   writel(hp-iova_base|1, hp-ioc_regs+HP_ZX1_IBASE);
   readl(hp-ioc_regs+HP_ZX1_IBASE);
 - writel(hp-iova_base|log2(HP_ZX1_IOVA_SIZE), 
 hp-ioc_regs+HP_ZX1_PCOM);
 + writel(hp-iova_base|ilog2(HP_ZX1_IOVA_SIZE), 
 hp-ioc_regs+HP_ZX1_PCOM);
   readl(hp-ioc_regs+HP_ZX1_PCOM);
   }
  
 @@ -284,7 +281,7 @@ hp_zx1_tlbflush (struct agp_memory *mem)
  {
   struct _hp_private *hp = hp_private;
  
 - writeq(hp-gart_base | log2(hp-gart_size), hp-ioc_regs+HP_ZX1_PCOM);
 + writeq(hp-gart_base | ilog2(hp-gart_size), hp-ioc_regs+HP_ZX1_PCOM);
   readq(hp-ioc_regs+HP_ZX1_PCOM);
  }
  
 --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/i460-agp.c
 +++ linux-2.6.22-rc6-mm1/drivers/char/agp/i460-agp.c
 @@ -13,6 +13,7 @@
  #include linux/string.h
  #include linux/slab.h
  #include linux/agp_backend.h
 +#include linux/log2.h
  
  #include agp.h
  
 @@ -59,8 +60,6 @@
   */
  #define WR_FLUSH_GATT(index) RD_GATT(index)
  
 -#define log2(x)  ffz(~(x))
 -
  static struct {
   void *gatt; /* ioremap'd GATT area */
  
 @@ -148,7 +147,7 @@ static int i460_fetch_size (void)
* values[i].size.
*/
   values[i].num_entries = (values[i].size  8)  
 (I460_IO_PAGE_SHIFT - 12);
 - values[i].page_order = log2((sizeof(u32)*values[i].num_entries) 
  PAGE_SHIFT);
 + values[i].page_order = 
 ilog2((sizeof(u32)*values[i].num_entries)  PAGE_SHIFT);
   }
  
   for (i = 0; i  agp_bridge-driver-num_aperture_sizes; i++) {
 --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/parisc-agp.c
 +++ linux-2.6.22-rc6-mm1/drivers/char/agp/parisc-agp.c
 @@ -18,6 +18,7 @@
  #include linux/init.h
  #include linux/klist.h
  #include linux/agp_backend.h
 +#include linux/log2.h
  
  #include asm-parisc/parisc-device.h
  #include asm-parisc/ropes.h
 @@ -27,10 +28,6 @@
  #define DRVNAME  quicksilver
  #define DRVPFX   DRVNAME : 
  
 -#ifndef log2
 -#define log2(x)  ffz(~(x))
 -#endif
 -
  #define AGP8X_MODE_BIT   3
  #define AGP8X_MODE   (1  AGP8X_MODE_BIT)
  
 @@ -92,7 +89,7 @@ parisc_agp_tlbflush(struct agp_memory *m
  {
   struct _parisc_agp_info *info = parisc_agp_info;
  
 - writeq(info-gart_base | log2(info-gart_size), 
 info-ioc_regs+IOC_PCOM);
 + writeq(info-gart_base | ilog2(info-gart_size), 
 info-ioc_regs+IOC_PCOM);
   readq(info-ioc_regs+IOC_PCOM); /* flush */
  }
  
 --- linux-2.6.22-rc6-mm1.orig/fs/ext4/super.c
 +++ linux-2.6.22-rc6-mm1/fs/ext4/super.c
 @@ -1433,8 +1433,6 @@ static void ext4_orphan_cleanup (struct 
   sb-s_flags = s_flags; /* Restore MS_RDONLY status */
  }
  
 -#define log2(n) ffz(~(n))
 -
  /*
   * Maximal file size.  There is a direct, and {,double-,triple-}indirect
   * block limit, and also a limit of (2^32 - 1) 512-byte sectors in i_blocks.
 @@ -1706,8 +1704,8 @@ static int ext4_fill_super (struct super
   sbi-s_desc_per_block = blocksize / EXT4_DESC_SIZE(sb);
   sbi-s_sbh = bh;
   sbi-s_mount_state = le16_to_cpu(es-s_state);
 - sbi

Re: [PATCH] Faster ext2_clear_inode()

2007-07-19 Thread Andrew Morton
On Mon, 9 Jul 2007 22:00:03 +0200
Jörn Engel [EMAIL PROTECTED] wrote:

 On Mon, 9 July 2007 22:01:48 +0400, Alexey Dobriyan wrote:
  
  Yes. Note that ext2_clear_inode() is referenced from ext2_sops, so even
  empty, it leaves traces in resulting kernel.
 
 Is that your opinion or have you actually measured a difference?
 I strongly suspect that compilers are smart enough to optimize away a
 call to an empty static function.
 

It saves a big 16 bytes of text here.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-15 Thread Andrew Morton
On Sun, 15 Jul 2007 15:02:23 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:

 On Fri, 2007-07-13 at 14:47 -0700, Zach Brown wrote:
 
  Peter, do you have any interest in seeing how far we can get
  at tracking lock_page()?  I'm not holding my breath, but any little bit
  would probably help.
 
 Would this be a valid report? 
 
 ( /me goes hunt a x86_64 unwinder patch that will apply to this tree.
   These stacktraces are pain )

They are.  lockdep reports are a pain too.  It's still a struggle to
understand wtf they're trying to tell you.  Mabe it's just me.

 ===
 [ INFO: possible circular locking dependency detected ]
 [ 2.6.22-rt3-dirty #34
 ---
 mount/1296 is trying to acquire lock:
  (ei-truncate_mutex){--..}, at: [802f75e5] 
 ext3_get_blocks_handle+0x1a4/0x8f7
 
 but task is already holding lock:
  (lock_page_0){--..}, at: [80267107] 
 generic_file_buffered_write+0x1ee/0x646
 
 which lock already depends on the new lock.
 
 
 the existing dependency chain (in reverse order) is:
 
 - #1 (lock_page_0){--..}:
[80251b26] __lock_acquire+0xa72/0xc35
[802520c9] lock_acquire+0x48/0x61
[80265e22] add_to_page_cache_lru+0xe/0x23
[80265d31] add_to_page_cache+0x1de/0x2c1
[80265e22] add_to_page_cache_lru+0xe/0x23
[80266985] find_or_create_page+0x4c/0x73
[802ae716] __getblk+0x118/0x23c
[802afa91] __bread+0x6/0x9c
[802f382d] read_block_bitmap+0x34/0x65
[802f3e1b] ext3_free_blocks_sb+0xec/0x3d4
[802f4131] ext3_free_blocks+0x2e/0x61
[802f82bc] ext3_free_data+0xaa/0xda
[802f8976] ext3_truncate+0x4d2/0x84e
[8026df5a] pagevec_lookup+0x17/0x1e
[8026e7b1] truncate_inode_pages_range+0x1f4/0x323
[802614b4] add_preempt_count+0x14/0xe4
[80304d13] journal_stop+0x1fe/0x21d
[8027661a] vmtruncate+0xa2/0xc0
[802a292b] inode_setattr+0x22/0x10a
[802f9b51] ext3_setattr+0x136/0x18f
[802a2b1d] notify_change+0x10a/0x241
[802a2b3b] notify_change+0x128/0x241
[8028e35e] do_truncate+0x56/0x7f
[8028e369] do_truncate+0x61/0x7f
[80296278] get_write_access+0x3f/0x45
[802973c7] may_open+0x193/0x1af
[80299869] open_namei+0x2cb/0x63e
[8025718b] rt_up_read+0x53/0x5c
[8056da59] do_page_fault+0x479/0x7cc
[8028dce1] do_filp_open+0x1c/0x38
[8056a4f9] rt_spin_unlock+0x17/0x47
[8028da05] get_unused_fd+0xf9/0x107
[8028dd45] do_sys_open+0x48/0xd5
[8020950e] system_call+0x7e/0x83
[] 0x

I guess we're doing lock_page() against a blockdev pagecache page here
while holding truncate_mutex against some S_ISREG file.

 - #0 (ei-truncate_mutex){--..}:
[802503b9] print_circular_bug_header+0xcc/0xd3
[80251a22] __lock_acquire+0x96e/0xc35
[802520c9] lock_acquire+0x48/0x61
[802f75e5] ext3_get_blocks_handle+0x1a4/0x8f7
[8056a6d4] _mutex_lock+0x26/0x52
[802f75e5] ext3_get_blocks_handle+0x1a4/0x8f7
[802504b2] find_usage_backwards+0xb0/0xd9
[802504b2] find_usage_backwards+0xb0/0xd9
[80250d7c] debug_check_no_locks_freed+0x11d/0x129
[80250c33] trace_hardirqs_on_caller+0x115/0x138
[8024efdc] lockdep_init_map+0xac/0x41f
[802614b4] add_preempt_count+0x14/0xe4
[802f8035] ext3_get_block+0xc2/0xe4
[802aeed3] __block_prepare_write+0x195/0x442
[802f7f73] ext3_get_block+0x0/0xe4
[802af19a] block_prepare_write+0x1a/0x25
[802f93e9] ext3_prepare_write+0xb2/0x17b
[802671b1] generic_file_buffered_write+0x298/0x646
[8023944e] current_fs_time+0x3b/0x40
[802614b4] add_preempt_count+0x14/0xe4
[802678ae] __generic_file_aio_write_nolock+0x34f/0x3b9
[8024ed3d] put_lock_stats+0xe/0x2a
[80267964] generic_file_aio_write+0x4c/0xc4
[80267979] generic_file_aio_write+0x61/0xc4
[802fcf18] ext3_orphan_del+0x53/0x19f
[802f5768] ext3_file_write+0x1c/0x9d
[8028ef31] do_sync_write+0xcc/0x10f
[80246f9c] autoremove_wake_function+0x0/0x2e
[8024ecfe] get_lock_stats+0xe/0x3f
[8024ed9a] lock_release_holdtime+0x41/0x4f
[8024ed3d] put_lock_stats+0xe/0x2a
[8028dfb1] sys_fchmod+0xa3/0xbd
[8056a717] 

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-15 Thread Andrew Morton
On Sun, 15 Jul 2007 21:21:03 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:

 Shows the current stacktrace where we violate the previously established
 locking order.

yup, but the lock_page() which we did inside truncate_mutex was a 
lock_page() against a different address_space: the blockdev mapping.

So this is OK - we'll never take truncate_mutex against the blockdev
mapping (it doesn't have one, for a start ;))

This is similar to the quite common case where we take inode A's
i_mutex inside inode B's i_mutex, which needs special lockdep annotations.

I think.  I haven't looked into this in detail.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.

2007-07-13 Thread Andrew Morton
On Fri, 13 Jul 2007 16:00:48 +0530 Kalpak Shah [EMAIL PROTECTED] wrote:


   - if (inode-i_nlink = EXT4_LINK_MAX)
   + if (EXT4_DIR_LINK_MAX(inode))
 return -EMLINK;
  
  argh.  WHY_IS_EXT4_FULL_OF_UPPER_CASE_MACROS_WHICH_COULD_BE_IMPLEMENTED
  as_lower_case_inlines?  Sigh.  It's all the old-timers, I guess.
  
  EXT4_DIR_LINK_MAX() is buggy: it evaluates its arg twice.
 
 #define EXT4_DIR_LINK_MAX(dir) (!is_dx(dir)  (dir)-i_nlink = 
 EXT4_LINK_MAX)
 
 This just checks if directory has hash indexing in which case we need not 
 worry about EXT4_LINK_MAX subdir limit. If directory is not hash indexed then 
 we will need to enforce a max subdir limit. 
 
 Sorry, I didn't understand what is the problem with this macro?

Macros should never evaluate their argument more than once, because if they
do they will misbehave when someone passes them an
expression-with-side-effects:

struct inode *p = q;

EXT4_DIR_LINK_MAX(p++);

one expects `p' to have the value q+1 here.  But it might be q+2.

and

EXT4_DIR_LINK_MAX(some_function());

might cause some_function() to be called twice.


This is one of the many problems which gets fixed when we write code in C
rather than in cpp.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 2][PATCH 2/5] cleanups: Add extent sanity checks

2007-07-12 Thread Andrew Morton
On Thu, 12 Jul 2007 08:57:51 -0500 Dave Kleikamp [EMAIL PROTECTED] wrote:

 On Thu, 2007-07-12 at 12:38 +0100, Andy Whitcroft wrote:
  Andrew Morton wrote:
   +if (ext4_ext_check_header(inode, 
   ext_block_hdr(bh),
   +depth - i - 1)) 
   {
   +err = -EIO;
   +break;
   +}
   +path[i+1].p_bh = bh;
   
   Really that should have been i + 1.  checkpatch misses this.  It seems 
   to
   be missing more stuff that it used to lately.
  
  This one is difficult.  The rules up to now have been consistent spacing
  is required on both sides of mathematics operators.  I personally like
  spaces always, but we do tend to use them without spaces too where the
  binding is effectivly part of the value -- the classic case is something
  like:
  
  pfn  MAX_ORDER-1
  
  In allowing that sort of thing, we implictly allow the one you note
  above.  We have tried to be overly annoying on these things, and so the
  check is consistancy, spaces both or neither.  We could be stricter.
 
 I personally think stricter is better.  An occasionally false-positive
 isn't going to hurt anyone.  (Well, maybe the checkpatch.pl maintainers
 will get nagged.)  It at least will cause the developer to look at the
 line of code in question and make a conscious decision to leave it as it
 is.  I'm assuming that upstream maintainers use checkpatch.pl with some
 constraint, and don't throw every patch that produces a warning back at
 the submitter.
 

I'm in two minds.  Missing-the-spaces is pretty damn common and is sometimes
a reasonable way of saving quite a lot of horizontal space.  I spose we could
take it out again if it's causing problems.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes

2007-07-11 Thread Andrew Morton
On Sun, 01 Jul 2007 03:38:51 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 Subject: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes
 Date: Sun, 01 Jul 2007 03:38:51 -0400
 Sender: [EMAIL PROTECTED]
 Organization: IBM Linux Technology Center
 X-Mailer: Evolution 2.8.0 (2.8.0-33.el5) 
 
 From: Dmitry Monakhov [EMAIL PROTECTED]
 Subject: ext4: extent compilation fixes

I hope this patch series will hit git with nice titles like ext4: extent
compilation fixes and not funny ones like
Morecleanups:ext4_extent_compilation_fixes.

 Fix compilation with EXT_DEBUG, also fix leXX_to_cpu convertions.

conversions.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups

2007-07-11 Thread Andrew Morton
On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 From: Dmitry Monakhov [EMAIL PROTECTED]
 Subject: ext4: extent macros cleanup
 
 - Replace math equation to it's macro equivalent

s/it's/its/;)

 - make ext4_ext_grow_indepth() indexes/leaf correct

hm, what was wrong with it?

 Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED]
 Acked-by: Alex Tomas [EMAIL PROTECTED]
 Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
 ---
  fs/ext4/extents.c |   11 +++
  1 files changed, 7 insertions(+), 4 deletions(-)
 
 diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
 index 12fe3d7..1fd00ac 100644
 --- a/fs/ext4/extents.c
 +++ b/fs/ext4/extents.c
 @@ -375,7 +375,7 @@ ext4_ext_binsearch_idx(struct inode *inode, struct 
 ext4_ext_path *path, int bloc
   ext_debug(binsearch for %d(idx):  , block);
 
   l = EXT_FIRST_INDEX(eh) + 1;
 - r = EXT_FIRST_INDEX(eh) + le16_to_cpu(eh-eh_entries) - 1;
 + r = EXT_LAST_INDEX(eh);
   while (l = r) {
   m = l + (r - l) / 2;
   if (block  le32_to_cpu(m-ei_block))
 @@ -440,7 +440,7 @@ ext4_ext_binsearch(struct inode *inode, struct 
 ext4_ext_path *path, int block)
   ext_debug(binsearch for %d:  , block);
 
   l = EXT_FIRST_EXTENT(eh) + 1;
 - r = EXT_FIRST_EXTENT(eh) + le16_to_cpu(eh-eh_entries) - 1;
 + r = EXT_LAST_EXTENT(eh);
 
   while (l = r) {
   m = l + (r - l) / 2;
 @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, 
 struct inode *inode,
   curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode));
   curp-p_hdr-eh_entries = cpu_to_le16(1);
   curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr);
 - /* FIXME: it works, but actually path[0] can be index */
 - curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;
 + 
 + if (path[0].p_hdr-eh_depth)
 +   curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block;
 + else
 +   curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;

whitespace bustage there.


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Andrew Morton
On Tue, 10 Jul 2007 23:18:50 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote:
  On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
  
   David Chinneer pointed that we need to journal the version number
   updates together with the operations that causes the change of the inode
   version number, in order to survive server crashes so clients won't see
   the counter go backwards.
   
   So increment i_version in fs code is probably the place to ensure the
   inode version changes are stored to disk. It's seems update the ext4
   inode version in every ext4_mark_inode_dirty() is the easiest way.
  
  That still makes us dependent upon _something_ changing the inode.  For
  overwrites the only something is mtime.
  
  If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and
  I don't think we do) then I guess we'll need new code in or around
  file_update_time() to do this.
 
 do you mean mark inode dirty all the times in file_update_time()? Not
 sure about the overhead for ext3/4.
 

hm, I hadn't thought about it in any detail.

Maybe something like

--- a/fs/inode.c~a
+++ a/fs/inode.c
@@ -1238,6 +1238,11 @@ void file_update_time(struct file *file)
sync_it = 1;
}
 
+   if (IS_I_VERSION_64(inode)) {
+   inode_inc_iversion(inode);  /* Takes i_lock on 32-bit */
+   sync_it = 1;
+   }
+
if (sync_it)
mark_inode_dirty_sync(inode);
 }
_

?
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 8][PATCH 1/1]Add journal checksums

2007-07-11 Thread Andrew Morton
On Wed, 11 Jul 2007 07:01:08 -0600 Andreas Dilger [EMAIL PROTECTED] wrote:

-   /* AKPM: buglet - add `i' to tmp! */
   
   Damn.  After, what, seven years, someone actually fixed it?
   
for (i = 0; i  bh-b_size; i += 512) {
-   journal_header_t *tmp = (journal_header_t*)bh-b_data;
+   struct commit_header *tmp =
+   (struct commit_header *)(bh-b_data + i);
tmp-h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
tmp-h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
tmp-h_sequence = 
cpu_to_be32(commit_transaction-t_tid);
+
+   if (JBD2_HAS_COMPAT_FEATURE(journal,
+   
JBD2_FEATURE_COMPAT_CHECKSUM)) {
+   tmp-h_chksum_type  = JBD2_CRC32_CHKSUM;
+   tmp-h_chksum_size  = 
JBD2_CRC32_CHKSUM_SIZE;
+   tmp-h_chksum[0]= 
cpu_to_be32(crc32_sum);
+   }
}
   
   And in doing so, changed the on-disk format of the journal commit blocks.
   
   Surely this was worth a mention in the changelog, if not a standalone 
   patch?
   
   I don't think this is worth doing, really.  Why not just leave the format
   as it was, take the loop out and run this code once rather than eight
   times?
 
 Well, we aren't using the rest of the commit block in any case.  I think
 the original intention was that we'd get 8 copies of the commit block so
 we would be sure to get a good one.
 
 I don't know whether we'd rather have 8 copies of the commit block, or
 more potential to expand the commit block?  I don't personally have any
 preference, since the checksum should be a more robust way of checking
 validity than having multiple copies, so we may as well remove the loop
 and stick with a single copy for now.

We've never altered any commit block sectors apart from the zeroeth one
(eight times) due to the above bug.  So I'd suggest that we should formalise
the old bug and leave the format as-is.  That'll leave lots of space spare in
the commit block.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-11 Thread Andrew Morton
On Wed, 11 Jul 2007 06:10:56 -0600 Andreas Dilger [EMAIL PROTECTED] wrote:

 On Jul 10, 2007  16:32 -0700, Andrew Morton wrote:
 err = ext4_reserve_inode_write(handle, inode, iloc);
   + if (EXT4_I(inode)-i_extra_isize 
   + EXT4_SB(inode-i_sb)-s_want_extra_isize 
   + !(EXT4_I(inode)-i_state  EXT4_STATE_NO_EXPAND)) {
   + /* We need extra buffer credits since we may write into EA block
   +  * with this same handle */
   + if ((jbd2_journal_extend(handle,
   +  EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) {
   + ret = ext4_expand_extra_isize(inode,
   + EXT4_SB(inode-i_sb)-s_want_extra_isize,
   + iloc, handle);
   + if (ret) {
   + EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND;
   + if (!expand_message) {
   + ext4_warning(inode-i_sb, __FUNCTION__,
   + Unable to expand inode %lu. Delete
   +  some EAs or run e2fsck.,
   + inode-i_ino);
   + expand_message = 1;
   + }
   + }
   + }
   + }
  
  Maybe that message should come out once per mount rather than once per boot
  (or once per modprobe)?
 
 Probably true.
 
  What are the consequences of a jbd2_journal_extend() failure in here?
 
 Not fatal, just like every user of i_extra_isize.  If the inode isn't a
 large inode, or it can't be expanded then there will be a minor loss of
 functionality on that inode.  If the i_extra_isize is critical, then
 the sysadmin will have run e2fsck to force s_min_extra_isize large enough.
 
 Note that this is only applicable for filesystems which are upgraded.  For
 new inodes (i.e. all inodes that exist in the filesystem if it was always
 run on a kernel with the currently understood extra fields) then this will
 never be invoked (until such a time new extra fields are added).

I'd suggest that we get a comment in the code explaining this: this
unchecked error does rather stand out.

   + if (EXT4_I(inode)-i_file_acl) {
   + bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl);
   + error = -EIO;
   + if (!bh)
   + goto cleanup;
   + if (ext4_xattr_check_block(bh)) {
   + ext4_error(inode-i_sb, __FUNCTION__,
   + inode %lu: bad block %llu, inode-i_ino,
   + EXT4_I(inode)-i_file_acl);
   + error = -EIO;
   + goto cleanup;
   + }
   + base = BHDR(bh);
   + first = BFIRST(bh);
   + end = bh-b_data + bh-b_size;
   + min_offs = end - base;
   + free = ext4_xattr_free_space(first, min_offs, base,
   +  total_blk);
   + if (free  new_extra_isize) {
   + if (!tried_min_extra_isize  s_min_extra_isize) {
   + tried_min_extra_isize++;
   + new_extra_isize = s_min_extra_isize;
  
  Aren't we missing a brelse(bh) here?
 
 Seems likely, yes.

OK - could we get a positive ack from someone indicating that this will get
looked at?  Because I am about to forget about it.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 1][PATCH 1/2] Add noextents mount option

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:35:48 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 Add a mount option to turn off extents.

Please update the changelog to describe the reason for making this change.


 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 ---
 Index: linux-2.6.22-rc4/fs/ext4/super.c
 ===
 --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 17:02:18.0 
 -0700
 +++ linux-2.6.22-rc4/fs/ext4/super.c  2007-06-11 17:02:22.0 -0700
 @@ -725,7 +725,7 @@
   Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
   Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
   Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
 - Opt_grpquota, Opt_extents,
 + Opt_grpquota, Opt_extents, Opt_noextents,
  };
  
  static match_table_t tokens = {
 @@ -776,6 +776,7 @@
   {Opt_usrquota, usrquota},
   {Opt_barrier, barrier=%u},
   {Opt_extents, extents},
 + {Opt_noextents, noextents},
   {Opt_err, NULL},
   {Opt_resize, resize},
  };
 @@ -,6 +1112,9 @@
   case Opt_extents:
   set_opt (sbi-s_mount_opt, EXTENTS);
   break;
 + case Opt_noextents:
 + clear_opt (sbi-s_mount_opt, EXTENTS);
 + break;
   default:
   printk (KERN_ERR
   EXT4-fs: Unrecognized mount option \%s\ 
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:36:01 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 Turn on extents feature by default in ext4 filesystem. User could use
 -o noextents to turn it off.
 

Oh, there you go.

 
 Index: linux-2.6.22-rc4/fs/ext4/super.c
 ===
 --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 17:02:22.0 
 -0700
 +++ linux-2.6.22-rc4/fs/ext4/super.c  2007-06-11 17:03:09.0 -0700
 @@ -1546,6 +1546,12 @@
  
   set_opt(sbi-s_mount_opt, RESERVATION);
  
 + /*
 +  * turn on extents feature by default in ext4 filesystem
 +  * User -o noextents to turn it off
 +  */
 + set_opt (sbi-s_mount_opt, EXTENTS);
 +

Broken coding style.

Please feed all the ext4 patches through scripts/checkpatch.pl (preferably
version 0.07 - see Andy's patch on lkml) and then consider addressing the
(quite large) number of mistakes which are detected.


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 2][PATCH 1/5] cleanups: Propagate some i_flags to disk

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:36:12 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into
 ext4-specific i_flags. Hence, when someone sets these flags via a different
 interface than ioctl, they are stored correctly.
 

This changelog is inadequate.  I had to hunt down the equivalent ext3
patch's changelog to understand the reasons for this change.  Please update
this patch's changelog using the below: 

ext3: copy i_flags to inode flags on write

A patch that stores inode flags such as S_IMMUTABLE, S_APPEND, etc.  from
i_flags to EXT3_I(inode)-i_flags when inode is written to disk.  The same
thing is done on GETFLAGS ioctl.

Quota code changes these flags on quota files (to make it harder for
sysadmin to screw himself) and these changes were not correctly propagated
into the filesystem (especially, lsattr did not show them and users were
wondering...).

Propagate flags such as S_APPEND, S_IMMUTABLE, etc.  from i_flags into
ext3-specific i_flags.  Hence, when someone sets these flags via a
different interface than ioctl, they are stored correctly.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:36:48 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

  On Jun 07, 2007  23:45 -0500, Jose R. Santos wrote:
   The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but
   create_proc_entry() does not do lookups on file names with more that one
   directory deep.  This causes the entry creation to fail and hence, no proc
   file is created.  This patch moves the file to /proc/jbd2-degug.
   
   The file could be move to /proc/fs/jbd2/jbd2-debug, but it would require
   some minor alterations to the jbd-stats patch.
  
  I don't think we really want to be adding top-level files in /proc.
  What about using the debugfs filesystem (not to be confused with
  the e2fsprogs 'debugfs' command)?
 
 How about this then?  Moved the file to use debugfs as well as having
 the nice effect of removing more lines than what it adds.
 
 Signed-off-by: Jose R. Santos [EMAIL PROTECTED]

Please clean up the changelog.

The changelog should include information about the location and the content
of these debugfs files.  it should provide any instructions which users
will need to be able to create and use those files.

Alternatively (and preferably) do this via an update to
Documentation/filesystems/ext4.txt.

  fs/jbd2/journal.c|   6220 +42 -0 !
  include/linux/jbd2.h |21 + 1 - 0 !
  2 files changed, 21 insertions(+), 43 deletions(-)

Again, this patch isn't in Ted's kernel.org directory and hasn't been in -mm.

Apart from the lack of testing and review which this causes, it means I
can't just do `pushpatch name-of-this-patch' and look at it in tkdiff.  So
I squint at the diff, but that's harder when the diff wasn't prepared with
`diff -p'.  Oh well.


 Index: linux-2.6.22-rc4/fs/jbd2/journal.c
 ===
 --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c   2007-06-11 16:16:18.0 
 -0700
 +++ linux-2.6.22-rc4/fs/jbd2/journal.c2007-06-11 16:36:10.0 
 -0700
 @@ -35,6 +35,7 @@
  #include linux/kthread.h
  #include linux/poison.h
  #include linux/proc_fs.h
 +#include linux/debugfs.h
  
  #include asm/uaccess.h
  #include asm/page.h
 @@ -1954,60 +1955,37 @@
   * /proc tunables
   */
  #if defined(CONFIG_JBD2_DEBUG)
 -int jbd2_journal_enable_debug;
 +u16 jbd2_journal_enable_debug;
  EXPORT_SYMBOL(jbd2_journal_enable_debug);
  #endif
  
 -#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_PROC_FS)
 +#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_DEBUG_FS)

Has this been compile-tested with CONFIG_DEBUGFS=n?

  
 -#define create_jbd_proc_entry() do {} while (0)
 -#define jbd2_remove_jbd_proc_entry() do {} while (0)
 +#define jbd2_create_debugfs_entry() do {} while (0)
 +#define jbd2_remove_debugfs_entry() do {} while (0)

I suggest that these be converted to (preferable) inline functions while
you're there.

  #endif
  
 @@ -2067,7 +2045,7 @@
   ret = journal_init_caches();
   if (ret != 0)
   jbd2_journal_destroy_caches();
 - create_jbd_proc_entry();
 + jbd2_create_debugfs_entry();
   return ret;
  }
  
 @@ -2078,7 +2056,7 @@
   if (n)
   printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n);
  #endif
 - jbd2_remove_jbd_proc_entry();
 + jbd2_remove_debugfs_entry();
   jbd2_journal_destroy_caches();
  }
  
 Index: linux-2.6.22-rc4/include/linux/jbd2.h
 ===
 --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 
 16:16:18.0 -0700
 +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:35:25.0 
 -0700
 @@ -57,7 +57,7 @@
   * CONFIG_JBD2_DEBUG is on.
   */
  #define JBD_EXPENSIVE_CHECKING

JBD2?

 -extern int jbd2_journal_enable_debug;
 +extern u16 jbd2_journal_enable_debug;

Why was this made 16-bit?  To save 2 bytes?  Could have saved 3 if we're
going to do that.


Shoudln't all this debug info be a per-superblock thing rather than
kernel-wide?
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:36:56 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 This patch is a spinoff of the old nanosecond patches.

I don't know what the old nanosecond patches are.  A link to a suitable
changlog for those patches would do in a pinch.  Preferable would be to
write a proper changelog for this patch.

 It includes some cleanups and addition of a creation timestamp. The
 EXT3_FEATURE_RO_COMPAT_EXTRA_ISIZE flag has also been added along with
 s_{min, want}_extra_isize fields in struct ext3_super_block.
 
 Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
 Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
 Signed-off-by: Eric Sandeen [EMAIL PROTECTED]
 Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 
 Index: linux-2.6.22-rc4/fs/ext4/ialloc.c

Please include diffstat output when preparing patches.

 +
 +#define EXT4_FITS_IN_INODE(ext4_inode, einode, field)\
 + ((offsetof(typeof(*ext4_inode), field) +\
 +   sizeof((ext4_inode)-field))  \
 + = (EXT4_GOOD_OLD_INODE_SIZE +  \
 + (einode)-i_extra_isize))   \

Please add explanatory commentary to EXT4_FITS_IN_INODE(): tell readers
under what circumstances something will not fit in an inode and what the
consequences of this are.

 +static inline __le32 ext4_encode_extra_time(struct timespec *time)
 +{
 +   return cpu_to_le32((sizeof(time-tv_sec)  4 ?
 +time-tv_sec  32 : 0) |
 +((time-tv_nsec  2)  EXT4_NSEC_MASK));
 +}
 +
 +static inline void ext4_decode_extra_time(struct timespec *time, __le32 
 extra)
 +{
 +   if (sizeof(time-tv_sec)  4)
 +time-tv_sec |= (__u64)(le32_to_cpu(extra)  EXT4_EPOCH_MASK)
 + 32;
 +   time-tv_nsec = (le32_to_cpu(extra)  EXT4_NSEC_MASK)  2;
 +}

Consider uninlining these functions.

 +#define EXT4_INODE_SET_XTIME(xtime, inode, raw_inode)
\
 +do {\
 + (raw_inode)-xtime = cpu_to_le32((inode)-xtime.tv_sec);   \
 + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \
 + (raw_inode)-xtime ## _extra = \
 + ext4_encode_extra_time((inode)-xtime);   \
 +} while (0)
 +
 +#define EXT4_EINODE_SET_XTIME(xtime, einode, raw_inode)  
\
 +do {\
 + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime))  \
 + (raw_inode)-xtime = cpu_to_le32((einode)-xtime.tv_sec);  \
 + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\
 + (raw_inode)-xtime ## _extra = \
 + ext4_encode_extra_time((einode)-xtime);  \
 +} while (0)
 +
 +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode)
\
 +do {\
 + (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime);   \
 + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \
 + ext4_decode_extra_time((inode)-xtime,\
 +raw_inode-xtime ## _extra);\
 +} while (0)
 +
 +#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode)  
\
 +do {\
 + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime))  \
 + (einode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime);  \
 + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\
 + ext4_decode_extra_time((einode)-xtime,   \
 +raw_inode-xtime ## _extra);\
 +} while (0)

Ugly.  I expect these could be implemented as plain old C functions. 
Caller could pass in the address of the ext4_inode field which the function
is to operate upon.

  #if defined(__KERNEL__) || defined(__linux__)

(What's the __linux__ for?)

  #define i_reserved1  osd1.linux1.l_i_reserved1
  #define i_frag   osd2.linux2.l_i_frag
 @@ -539,6 +603,13 @@
   return container_of(inode, struct ext4_inode_info, vfs_inode);
  }
  
 +static inline struct timespec ext4_current_time(struct inode *inode)
 +{
 + return (inode-i_sb-s_time_gran  NSEC_PER_SEC) ?
 + current_fs_time(inode-i_sb) : CURRENT_TIME_SEC;
 +}

Now, I've forgotten how this works.  Remind me, please.  Can ext4
filesystems ever have one-second timestamp granularity?  If so, how does
one cause that to come about?

 --- linux-2.6.22-rc4.orig/include/linux/ext4_fs_i.h   2007-06-11 
 17:22:15.0 -0700
 +++ 

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:37:04 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 This patch converts the 32-bit i_version in the generic inode to a 64-bit
 i_version field.
 

That's obvious from the patch.  But what was the reason for making this
(unrelated to ext4) change?

Please update the changelog for this.

 Index: linux-2.6.21/include/linux/fs.h
 ===
 --- linux-2.6.21.orig/include/linux/fs.h
 +++ linux-2.6.21/include/linux/fs.h
 @@ -549,7 +549,7 @@ struct inode {
   uid_t   i_uid;
   gid_t   i_gid;
   dev_t   i_rdev;
 - unsigned long   i_version;
 + u64 i_version;
   loff_t  i_size;
  #ifdef __NEED_I_SIZE_ORDERED
   seqcount_t  i_size_seqcount;

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 2/5] i_version: Add hi 32 bit inode version on ext4 on-disk inode

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:37:16 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 This patch adds a 32-bit i_version_hi field to ext4_inode, which can be used 
 for 64-bit inode versions. This field will store the higher 32 bits of the 
 version, while Jean Noel's patch has added support to store the lower 32-bits 
 in osd1.linux1.l_i_version.

Please wordwrap this changelog entry to less than 80 columns.  Well, less
than 258, anyway ;)

 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
 Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
 ---
 Index: linux-2.6.21/include/linux/ext4_fs.h
 ===
 --- linux-2.6.21.orig/include/linux/ext4_fs.h
 +++ linux-2.6.21/include/linux/ext4_fs.h
 @@ -342,6 +342,7 @@ struct ext4_inode {
   __le32  i_atime_extra;  /* extra Access time  (nsec  2 | epoch) */
   __le32  i_crtime;   /* File Creation time */
   __le32  i_crtime_extra; /* extra FileCreationtime (nsec  2 | epoch) */
 + __le32  i_version_hi;   /* high 32 bits for 64-bit version */
  };

Aren't there forward- backward-compatibility issues here?  How does the
filesystem driver work out whether this field is present and valid?

The changelog should describe this design issue.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 2][PATCH 2/5] cleanups: Add extent sanity checks

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:36:22 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 with the patch all headers are checked. the code should become
 more resistant to on-disk corruptions. needless BUG_ON() have
 been removed. please, review for inclusion.
 
 ...

 @@ -269,6 +239,70 @@
   return size;
  }
  
 +static inline int
 +ext4_ext_max_entries(struct inode *inode, int depth)

Please remove the `inline'.

`inline' is usually wrong.  It is wrong here.  If this function has a
single callsite (it has) then the compiler will inline it.  If the function
later gains more callsites then the compiler will know (correctly) not to
inline it.

We hope.  If we find the compiler still inlines a function as large as this
one then we'd need to use noinline and complain at the gcc developers.

 +{
 + int max;
 +
 + if (depth == ext_depth(inode)) {
 + if (depth == 0)
 + max = ext4_ext_space_root(inode);
 + else
 + max = ext4_ext_space_root_idx(inode);
 + } else {
 + if (depth == 0)
 + max = ext4_ext_space_block(inode);
 + else
 + max = ext4_ext_space_block_idx(inode);
 + }
 +
 + return max;
 +}
 +
 +static int __ext4_ext_check_header(const char *function, struct inode *inode,
 + struct ext4_extent_header *eh,
 + int depth)
 +{
 + const char *error_msg = NULL;

Unneeded initialisation.

 + int max = 0;
 +
 + if (unlikely(eh-eh_magic != EXT4_EXT_MAGIC)) {
 + error_msg = invalid magic;
 + goto corrupted;
 + }
 + if (unlikely(le16_to_cpu(eh-eh_depth) != depth)) {
 + error_msg = unexpected eh_depth;
 + goto corrupted;
 + }
 + if (unlikely(eh-eh_max == 0)) {
 + error_msg = invalid eh_max;
 + goto corrupted;
 + }
 + max = ext4_ext_max_entries(inode, depth);
 + if (unlikely(le16_to_cpu(eh-eh_max)  max)) {
 + error_msg = too large eh_max;
 + goto corrupted;
 + }
 + if (unlikely(le16_to_cpu(eh-eh_entries)  le16_to_cpu(eh-eh_max))) {
 + error_msg = invalid eh_entries;
 + goto corrupted;
 + }
 + return 0;
 +
 +corrupted:
 + ext4_error(inode-i_sb, function,
 + bad header in inode #%lu: %s - magic %x, 
 + entries %u, max %u(%u), depth %u(%u),
 + inode-i_ino, error_msg, le16_to_cpu(eh-eh_magic),
 + le16_to_cpu(eh-eh_entries), le16_to_cpu(eh-eh_max),
 + max, le16_to_cpu(eh-eh_depth), depth);
 +
 + return -EIO;
 +}
 +

 ...

 + i = depth = ext_depth(inode);


Mulitple assignments like this are somewhat unpopular from the coding-style
POV.  

We have a local variable called `i' which is not used as a simple counter
and which has no explanatory comment.  This is plain bad.  Please find a
better name for this variable.  Review the code for other such lack of
clarity - I'm sure there's more.


 - BUG_ON(le16_to_cpu(eh-eh_entries)  le16_to_cpu(eh-eh_max));
 - BUG_ON(eh-eh_magic != EXT4_EXT_MAGIC);

Yeah, this patch improves things a lot.

How well-tested is it?  Have any of these errors actually been triggered?

   path[i].p_hdr = ext_block_hdr(path[i].p_bh);
 - if (ext4_ext_check_header(__FUNCTION__, inode,
 - path[i].p_hdr)) {
 - err = -EIO;
 - goto out;
 - }
   }
  
 - BUG_ON(le16_to_cpu(path[i].p_hdr-eh_entries)
 - le16_to_cpu(path[i].p_hdr-eh_max));
 - BUG_ON(path[i].p_hdr-eh_magic != EXT4_EXT_MAGIC);
 -
   if (!path[i].p_idx) {
   /* this level hasn't been touched yet */
   path[i].p_idx = EXT_LAST_INDEX(path[i].p_hdr);
 @@ -1873,17 +1890,24 @@
   i, EXT_FIRST_INDEX(path[i].p_hdr),
   path[i].p_idx);
   if (ext4_ext_more_to_rm(path + i)) {
 + struct buffer_head *bh;
   /* go to the next level */
   ext_debug(move to level %d (block %llu)\n,
 i + 1, idx_pblock(path[i].p_idx));
   memset(path + i + 1, 0, sizeof(*path));
 - path[i+1].p_bh =
 - sb_bread(sb, idx_pblock(path[i].p_idx));
 - if (!path[i+1].p_bh) {
 + bh = sb_bread(sb, idx_pblock(path[i].p_idx));
 + if (!bh) {
   /* should we reset i_size? */
   err = -EIO;
   break;
   }
 + BUG_ON(i + 1  depth);

Re: [EXT4 set 4][PATCH 3/5] i_version:ext4 inode version read/store

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:37:36 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 This patch adds 64-bit inode version support to ext4. The lower 32 bits
 are stored in the osd1.linux1.l_i_version field while the high 32 bits
 are stored in the i_version_hi field newly created in the ext4_inode.

So reading the code here does serve to answer the question I raised against
the earlier patch.  A bit.

I'd have thought that this patch and the one which adds i_version_hi should
be folded into a single diff?

 
 Index: linux-2.6.21/fs/ext4/inode.c
 ===
 --- linux-2.6.21.orig/fs/ext4/inode.c
 +++ linux-2.6.21/fs/ext4/inode.c
 @@ -2709,6 +2709,13 @@ void ext4_read_inode(struct inode * inod
   EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode);
   EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
 
 + inode-i_version = le32_to_cpu(raw_inode-i_disk_version);
 + if (EXT4_INODE_SIZE(inode-i_sb)  EXT4_GOOD_OLD_INODE_SIZE) {
 + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
 + inode-i_version |=
 + (__u64)(le32_to_cpu(raw_inode-i_version_hi))  32;

checks the precedence of (type) versus 

OK

 + }

I don't quite see how the above two tests are sufficient to unambiguously
determine that the i_version_hi field is present on-disk.

I guess we're implicitly assuming that if the on-disk inode is big enough
then it _must_ have i_version_hi in there?  If so, why is the comparison
with EXT4_GOOD_OLD_INODE_SIZE needed?

Some description of the overall approach to inode version identification
would be helpful here.

   if (S_ISREG(inode-i_mode)) {
   inode-i_op = ext4_file_inode_operations;
   inode-i_fop = ext4_file_operations;
 @@ -2852,8 +2859,14 @@ static int ext4_do_update_inode(handle_t
   } else for (block = 0; block  EXT4_N_BLOCKS; block++)
   raw_inode-i_block[block] = ei-i_data[block];
 
 - if (ei-i_extra_isize)
 + raw_inode-i_disk_version = cpu_to_le32(inode-i_version);
 + if (ei-i_extra_isize) {
 + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) {

There's no comparison with EXT4_GOOD_OLD_INODE_SIZE here...

 + raw_inode-i_version_hi =
 + cpu_to_le32(inode-i_version  32);
 + }
   raw_inode-i_extra_isize = cpu_to_le16(ei-i_extra_isize);
 + }
 

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:37:45 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 This patch is on top of i_version_update_vfs.
 The i_version field of the inode is set on inode creation and incremented when
 the inode is being modified.
 

Again, I don't think I've ever seen this patch before.  It is at least a
month old.

 
 Index: linux-2.6.22-rc4/fs/ext4/ialloc.c
 ===
 --- linux-2.6.22-rc4.orig/fs/ext4/ialloc.c2007-06-13 17:16:28.0 
 -0700
 +++ linux-2.6.22-rc4/fs/ext4/ialloc.c 2007-06-13 17:24:45.0 -0700
 @@ -565,6 +565,7 @@ got:
   inode-i_blocks = 0;
   inode-i_mtime = inode-i_atime = inode-i_ctime = ei-i_crtime =
  ext4_current_time(inode);
 + inode-i_version = 1;
  
   memset(ei-i_data, 0, sizeof(ei-i_data));
   ei-i_dir_start_lookup = 0;
 Index: linux-2.6.22-rc4/fs/ext4/inode.c
 ===
 --- linux-2.6.22-rc4.orig/fs/ext4/inode.c 2007-06-13 17:21:29.0 
 -0700
 +++ linux-2.6.22-rc4/fs/ext4/inode.c  2007-06-13 17:24:45.0 -0700
 @@ -3082,6 +3082,7 @@ int ext4_mark_iloc_dirty(handle_t *handl
  {
   int err = 0;
  
 + inode-i_version++;
   /* the do_update_inode consumes one bh-b_count */
   get_bh(iloc-bh);
  
 Index: linux-2.6.22-rc4/fs/ext4/super.c
 ===
 --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-13 17:19:11.0 
 -0700
 +++ linux-2.6.22-rc4/fs/ext4/super.c  2007-06-13 17:24:45.0 -0700
 @@ -2846,8 +2846,8 @@ out:
   i_size_write(inode, off+len-towrite);
   EXT4_I(inode)-i_disksize = inode-i_size;
   }
 - inode-i_version++;
   inode-i_mtime = inode-i_ctime = CURRENT_TIME;
 + inode-i_version = 1;
   ext4_mark_inode_dirty(handle, inode);
   mutex_unlock(inode-i_mutex);
   return len - towrite;

ext4 already has code to update i_version on directories.  Here we appear
to be udpating it on regular files?

But for what reason?  The changelog doesn't say?

AFAICT the code forgets to update i_version during file overwrites (unless
the overwrite was over a hole).  But without a decent description of this
change I cannot tell whether this was a bug.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:38:01 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 This patch is on top of the nanosecond timestamp and i_version_hi
 patches. 

This sort of information isn't needed (or desired) when this patch hits the
git tree.  Please ensure that things like this are cleaned up before the
patches go to Linus.

 We need to make sure that existing filesystems can also avail the new
 fields that have been added to the inode. We use s_want_extra_isize and
 s_min_extra_isize to decide by how much we should expand the inode. If
 EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature is set then we expand by
 max(s_want_extra_isize, s_min_extra_isize , sizeof(ext4_inode) -
 EXT4_GOOD_OLD_INODE_SIZE) bytes. Actually it is still an open question
 about whether users should be able to set s_*_extra_isize smaller than
 the known fields or not. 
 
 This patch also adds the functionality to expand inodes to include the
 newly added fields. We start by trying to expand by s_want_extra_isize
 bytes and if its fails we try to expand by s_min_extra_isize bytes. This
 is done by changing the i_extra_isize if enough space is available in
 the inode and no EAs are present. If EAs are present and there is enough
 space in the inode then the EAs in the inode are shifted to make space.
 If enough space is not available in the inode due to the EAs then 1 or
 more EAs are shifted to the external EA block. In the worst case when
 even the external EA block does not have enough space we inform the user
 that some EA would need to be deleted or s_min_extra_isize would have to
 be reduced.
 
 This would be online expansion of inodes. I am also working on adding an
 expand_inodes option to e2fsck which will expand all the used inodes.
 
 Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
 Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 
 Index: linux-2.6.22-rc4/fs/ext4/inode.c
 ===
 --- linux-2.6.22-rc4.orig/fs/ext4/inode.c 2007-06-14 17:32:27.0 
 -0700
 +++ linux-2.6.22-rc4/fs/ext4/inode.c  2007-06-14 17:32:41.0 -0700
 @@ -3120,6 +3120,40 @@ ext4_reserve_inode_write(handle_t *handl
  }
  
  /*
 + * Expand an inode by new_extra_isize bytes.
 + * Returns 0 on success or negative error number on failure.
 + */
 +int ext4_expand_extra_isize(struct inode *inode, unsigned int 
 new_extra_isize,
 + struct ext4_iloc iloc, handle_t *handle)
 +{
 + struct ext4_inode *raw_inode;
 + struct ext4_xattr_ibody_header *header;
 + struct ext4_xattr_entry *entry;
 +
 + if (EXT4_I(inode)-i_extra_isize = new_extra_isize) {
 + return 0;
 + }

This patch has lots of unneeded braces in it.  checkpatch appears to catch
them.


 + raw_inode = ext4_raw_inode(iloc);
 +
 + header = IHDR(inode, raw_inode);
 + entry = IFIRST(header);
 +
 + /* No extended attributes present */
 + if (!(EXT4_I(inode)-i_state  EXT4_STATE_XATTR) ||
 + header-h_magic != cpu_to_le32(EXT4_XATTR_MAGIC)) {
 + memset((void *)raw_inode + EXT4_GOOD_OLD_INODE_SIZE, 0,
 + new_extra_isize);
 + EXT4_I(inode)-i_extra_isize = new_extra_isize;
 + return 0;
 + }
 +
 + /* try to expand with EA present */
 + return ext4_expand_extra_isize_ea(inode, new_extra_isize,
 + raw_inode, handle);
 +}
 +
 +/*
   * What we do here is to mark the in-core inode as clean with respect to 
 inode
   * dirtiness (it may still be data-dirty).
   * This means that the in-core inode may be reaped by prune_icache
 @@ -3143,10 +3177,33 @@ ext4_reserve_inode_write(handle_t *handl
  int ext4_mark_inode_dirty(handle_t *handle, struct inode *inode)
  {
   struct ext4_iloc iloc;
 - int err;
 + int err, ret;
 + static int expand_message;
  
   might_sleep();
   err = ext4_reserve_inode_write(handle, inode, iloc);
 + if (EXT4_I(inode)-i_extra_isize 
 + EXT4_SB(inode-i_sb)-s_want_extra_isize 
 + !(EXT4_I(inode)-i_state  EXT4_STATE_NO_EXPAND)) {
 + /* We need extra buffer credits since we may write into EA block
 +  * with this same handle */
 + if ((jbd2_journal_extend(handle,
 +  EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) {
 + ret = ext4_expand_extra_isize(inode,
 + EXT4_SB(inode-i_sb)-s_want_extra_isize,
 + iloc, handle);
 + if (ret) {
 + EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND;
 + if (!expand_message) {
 + ext4_warning(inode-i_sb, __FUNCTION__,
 + Unable to expand inode %lu. Delete
 +  some EAs or run e2fsck.,
 + 

Re: [PATHC] Fix for ext2 reservation (Re: -mm merge plans for 2.6.23)

2007-07-10 Thread Andrew Morton
On Tue, 10 Jul 2007 15:15:57 -0700
Badari Pulavarty [EMAIL PROTECTED] wrote:

 Well, I looked at the problem now and here is the fix :)

whee, thanks.

 Greg, Please consider this for stable release also.

No, it is only relevant to -mm's ext2-reservations.patch.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:38:10 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 [PATCH] jbd2 stats through procfs
 
 The patch below updates the jbd stats patch to 2.6.20/jbd2.
 The initial patch was posted by Alex Tomas in December 2005
 (http://marc.info/?l=linux-ext4m=113538565128617w=2).
 It provides statistics via procfs such as transaction lifetime and size.
 
 [ This probably should be rewritten to use debugfs?   -- Ted]
 

I suppose that given that we're creating a spot in debugfs for the jbd2
debug code, yes, this also should be moved over.

But the jbd2 debug debugfs entries were kernel-wide whereas this is
per-superblock.  I think.

That email from Alex contains pretty important information.  I suggest that
it be added to the changelog after accuracy-checking.  Addition to
Documentation/filesystems/ext4.txt would be good.

 --
 
 Index: linux-2.6.22-rc4/include/linux/jbd2.h
 ===
 --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 
 17:28:17.0 -0700
 +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-13 10:45:21.0 
 -0700
 @@ -408,6 +408,16 @@
  };
  
  
 +/*
 + * Some stats for checkpoint phase
 + */
 +struct transaction_chp_stats_s {
 + unsigned long   cs_chp_time;
 + unsigned long   cs_forced_to_close;
 + unsigned long   cs_written;
 + unsigned long   cs_dropped;
 +};

It would be nice to document what units all these fields are in.  Jiffies,
I assume.

  /* The transaction_t type is the guts of the journaling mechanism.  It
   * tracks a compound transaction through its various states:
   *
 @@ -543,6 +553,21 @@
   spinlock_t  t_handle_lock;
  
   /*
 +  * Longest time some handle had to wait for running transaction
 +  */
 + unsigned long   t_max_wait;
 +
 + /*
 +  * When transaction started
 +  */
 + unsigned long   t_start;
 +
 + /*
 +  * Checkpointing stats [j_checkpoint_sem]
 +  */
 + struct transaction_chp_stats_s t_chp_stats;
 +
 + /*
* Number of outstanding updates running on this transaction
* [t_handle_lock]
*/
 @@ -573,6 +598,57 @@
  
  };
  
 +struct transaction_run_stats_s {
 + unsigned long   rs_wait;
 + unsigned long   rs_running;
 + unsigned long   rs_locked;
 + unsigned long   rs_flushing;
 + unsigned long   rs_logging;
 +
 + unsigned long   rs_handle_count;
 + unsigned long   rs_blocks;
 + unsigned long   rs_blocks_logged;
 +};
 +
 +struct transaction_stats_s
 +{
 + int ts_type;
 + unsigned long   ts_tid;
 + union {
 + struct transaction_run_stats_s run;
 + struct transaction_chp_stats_s chp;
 + } u;
 +};
 +
 +#define JBD2_STATS_RUN   1
 +#define JBD2_STATS_CHECKPOINT2
 +
 +#define ts_wait  u.run.rs_wait
 +#define ts_running   u.run.rs_running
 +#define ts_lockedu.run.rs_locked
 +#define ts_flushing  u.run.rs_flushing
 +#define ts_logging   u.run.rs_logging
 +#define ts_handle_count  u.run.rs_handle_count
 +#define ts_blocksu.run.rs_blocks
 +#define ts_blocks_logged u.run.rs_blocks_logged
 +
 +#define ts_chp_time  u.chp.cs_chp_time
 +#define ts_forced_to_close   u.chp.cs_forced_to_close
 +#define ts_written   u.chp.cs_written
 +#define ts_dropped   u.chp.cs_dropped

That's a bit sleazy.  We can drop the u from 'struct transaction_stats_s'
and make it an anonymous union, then open-code foo.run.rs_wait everywhere.

But that's a bit sleazy too, because the reader of the code would not know
that a write to foo.run.rs_wait will stomp on the value of
foo.chp.cs_chp_time.

So to minimize reader confusion it would be best I think to just open-code
the full u.run.rs_wait at all code-sites.

The macros above are the worst possible choice: they hide information from
the code-reader just to save the code-writer a bit of typing.  But we very
much want to optimise for code-readers, not for code-writers.

 +#define CURRENT_MSECS(jiffies_to_msecs(jiffies))

hm, that isn't something which should be in an ext4 header file.  And it
shouldn't be in UPPER CASE and it shouldn't pretend to be a constant (or a
global scalar).

IOW: yuk.

How's about raising a separate, standalone patch which creates a new
kernel-wide coded-in-C function such as

unsigned long jiffies_in_msecs(void);

?  (That's assuming we don't already have one.  Most likely we have seven
of them hiding in various dark corners).

 +static inline unsigned int
 +jbd2_time_diff(unsigned int start, unsigned int end)
 +{
 + if (unlikely(start  end))
 + end = end + (~0UL - start);
 + else
 + end -= start;
 + return end;
 +}

I don't know what this does 

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Andrew Morton
On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 David Chinneer pointed that we need to journal the version number
 updates together with the operations that causes the change of the inode
 version number, in order to survive server crashes so clients won't see
 the counter go backwards.
 
 So increment i_version in fs code is probably the place to ensure the
 inode version changes are stored to disk. It's seems update the ext4
 inode version in every ext4_mark_inode_dirty() is the easiest way.

That still makes us dependent upon _something_ changing the inode.  For
overwrites the only something is mtime.

If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and
I don't think we do) then I guess we'll need new code in or around
file_update_time() to do this.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Andrew Morton
On Wed, 11 Jul 2007 15:05:27 +1000 Neil Brown [EMAIL PROTECTED] wrote:

 
 It just occurred to me:
 
  If i_version is 64bit, then knfsd would need to be careful when
  reading it on a 32bit host.  What are the locking rules?
 
  Presumably it is only updated under i_mutex protection, but having to
  get i_mutex to read it would seem a little heavy handed.
 
  Should it use a seqlock like i_size?
  Could we use the same seqlock that i_size uses, or would we need a
  separate one?
 

seqlocks are a bit of a pain to use (we've had plenty of deadlocks on the
i_size one).  We could reuse inode.i_lock for this modification.  Its
mandate is general purpose innermost lock to protect stuff in this inode.


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:38:18 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 From [EMAIL PROTECTED] Thu May 17 17:21:08 2007
 Hi,
 
 I have rebased this patch to 2.6.22-rc1 so that it can be added to the
 ext4 patch queue. It has been tested by creating more than 65000 subdirs
 and then deleting them and checking the nlinks. The e2fsprogs part of
 this patch was sent earlier by me to linux-ext4 and doesn't need any
 changes, so not submitting it again.
 
 --
 This patch adds support to ext4 for allowing more than 65000
 subdirectories. Currently the maximum number of subdirectories is capped
 at 32000.
 
 If we exceed 65000 subdirectories in an htree directory it sets the
 inode link count to 1 and no longer counts subdirectories.  The
 directory link count is not actually used when determining if a
 directory is empty, as that only counts subdirectories and not regular
 files that might be in there. 
 
 A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if
 the subdir count for any directory crosses 65000.
 

Would I be correct in assuming that a later fsck will clear
EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any 65000 subdir
directories?

If so, that is worth a mention in the changelog, perhaps?

Please remind us what is the behaviour of an RO_COMPAT flag?  It means that
old ext4, ext3 and ext2 can only mount this fs read-only, yes?

 
 Index: linux-2.6.22-rc4/fs/ext4/namei.c
 ===
 --- linux-2.6.22-rc4.orig/fs/ext4/namei.c 2007-06-14 17:30:47.0 
 -0700
 +++ linux-2.6.22-rc4/fs/ext4/namei.c  2007-06-14 17:32:55.0 -0700
 @@ -1619,6 +1619,27 @@ static int ext4_delete_entry (handle_t *
   return -ENOENT;
  }
  
 +static inline void ext4_inc_count(handle_t *handle, struct inode *inode)
 +{
 + inc_nlink(inode);
 + if (is_dx(inode)  inode-i_nlink  1) {
 + /* limit is 16-bit i_links_count */
 + if (inode-i_nlink = EXT4_LINK_MAX || inode-i_nlink == 2) {
 + inode-i_nlink = 1;
 + EXT4_SET_RO_COMPAT_FEATURE(inode-i_sb,
 +   EXT4_FEATURE_RO_COMPAT_DIR_NLINK);
 + }
 + }
 +}

Looks too big to be inlined.

Why do we set EXT4_FEATURE_RO_COMPAT_DIR_NLINK if i_nlink==2?

 +static inline void ext4_dec_count(handle_t *handle, struct inode *inode)
 +{
 + drop_nlink(inode);
 + if (S_ISDIR(inode-i_mode)  inode-i_nlink == 0)
 + inc_nlink(inode);
 +}

Probably too big to inline.

 +
  static int ext4_add_nondir(handle_t *handle,
   struct dentry *dentry, struct inode *inode)
  {
 @@ -1715,7 +1736,7 @@ static int ext4_mkdir(struct inode * dir
   struct ext4_dir_entry_2 * de;
   int err, retries = 0;
  
 - if (dir-i_nlink = EXT4_LINK_MAX)
 + if (EXT4_DIR_LINK_MAX(dir))
   return -EMLINK;
  
  retry:
 @@ -1738,7 +1759,7 @@ retry:
   inode-i_size = EXT4_I(inode)-i_disksize = inode-i_sb-s_blocksize;
   dir_block = ext4_bread (handle, inode, 0, 1, err);
   if (!dir_block) {
 - drop_nlink(inode); /* is this nlink == 0? */
 + ext4_dec_count(handle, inode); /* is this nlink == 0? */
   ext4_mark_inode_dirty(handle, inode);
   iput (inode);
   goto out_stop;
 @@ -1770,7 +1791,7 @@ retry:
   iput (inode);
   goto out_stop;
   }
 - inc_nlink(dir);
 + ext4_inc_count(handle, dir);
   ext4_update_dx_flag(dir);
   ext4_mark_inode_dirty(handle, dir);
   d_instantiate(dentry, inode);
 @@ -2035,10 +2056,10 @@ static int ext4_rmdir (struct inode * di
   retval = ext4_delete_entry(handle, dir, de, bh);
   if (retval)
   goto end_rmdir;
 - if (inode-i_nlink != 2)
 - ext4_warning (inode-i_sb, ext4_rmdir,
 -   empty directory has nlink!=2 (%d),
 -   inode-i_nlink);
 + if (!EXT4_DIR_LINK_EMPTY(inode))
 + ext4_warning(inode-i_sb, ext4_rmdir,
 +  empty directory has too many links (%d),
 +  inode-i_nlink);
   inode-i_version++;
   clear_nlink(inode);
   /* There's no need to set i_disksize: the fact that i_nlink is
 @@ -2048,7 +2069,7 @@ static int ext4_rmdir (struct inode * di
   ext4_orphan_add(handle, inode);
   inode-i_ctime = dir-i_ctime = dir-i_mtime = ext4_current_time(inode);
   ext4_mark_inode_dirty(handle, inode);
 - drop_nlink(dir);
 + ext4_dec_count(handle, dir);
   ext4_update_dx_flag(dir);
   ext4_mark_inode_dirty(handle, dir);
  
 @@ -2099,7 +2120,7 @@ static int ext4_unlink(struct inode * di
   dir-i_ctime = dir-i_mtime = ext4_current_time(dir);
   ext4_update_dx_flag(dir);
   ext4_mark_inode_dirty(handle, dir);
 - drop_nlink(inode);
 + 

Re: [PATCH] fix error handling in ext3_create_journal

2007-07-03 Thread Andrew Morton
On Mon, 2 Jul 2007 00:11:11 +0200
Borislav Petkov [EMAIL PROTECTED] wrote:

 
 ---
 From: Borislav Petkov [EMAIL PROTECTED]
 
 Fix error handling in ext3_create_journal according to kernel conventions.
 
 Signed-off-by: Borislav Petkov [EMAIL PROTECTED]
 --
 
 Index: linux-2.6.22-rc6/fs/ext3/super.c
 ===
 --- linux-2.6.22-rc6/fs/ext3/super.c.orig 2007-07-01 21:12:51.0 
 +0200
 +++ linux-2.6.22-rc6/fs/ext3/super.c  2007-07-01 21:14:32.0 +0200
 @@ -2075,6 +2075,7 @@
  unsigned int journal_inum)
  {
   journal_t *journal;
 + int err;
  
   if (sb-s_flags  MS_RDONLY) {
   printk(KERN_ERR EXT3-fs: readonly filesystem when trying to 
 @@ -2082,13 +2083,15 @@
   return -EROFS;
   }
  
 - if (!(journal = ext3_get_journal(sb, journal_inum)))
 + journal = ext3_get_journal(sb, journal_inum);
 + if (!journal)
   return -EINVAL;
  
   printk(KERN_INFO EXT3-fs: creating new journal on inode %u\n,
  journal_inum);
  
 - if (journal_create(journal)) {
 + err = journal_create(journal);
 + if (err) {
   printk(KERN_ERR EXT3-fs: error creating journal.\n);
   journal_destroy(journal);
   return -EIO;

Please prepare the equivalent patch for ext4.  Without that, it'd probably
be better to avoid applying the ext3 patch: there are advantages to keeping
the two in sync where possible.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


fallocate support for bitmap-based files

2007-06-29 Thread Andrew Morton

Guys, Mike and Sreenivasa at google are looking into implementing
fallocate() on ext2.  Of course, any such implementation could and should
also be portable to ext3 and ext4 bitmapped files.

I believe that Sreenivasa will mainly be doing the implementation work.


The basic plan is as follows:

- Create (with tune2fs and mke2fs) a hidden file using one of the
  reserved inode numbers.  That file will be sized to have one bit for each
  block in the partition.  Let's call this the unwritten block file.

  The unwritten block file will be initialised with all-zeroes

- at fallocate()-time, allocate the blocks to the user's file (in some
  yet-to-be-determined fashion) and, for each one which is uninitialised,
  set its bit in the unwritten block file.  The set bit means this block
  is uninitialised and needs to be zeroed out on read.

- truncate() would need to clear out set-bits in the unwritten blocks file.

- When the fs comes to read a block from disk, it will need to consult
  the unwritten blocks file to see if that block should be zeroed by the
  CPU.

- When the unwritten-block is written to, its bit in the unwritten blocks
  file gets zeroed.

- An obvious efficiency concern: if a user file has no unwritten blocks
  in it, we don't need to consult the unwritten blocks file.

  Need to work out how to do this.  An obvious solution would be to have
  a number-of-unwritten-blocks counter in the inode.  But do we have space
  for that?

  (I expect google and others would prefer that the on-disk format be
  compatible with legacy ext2!)

- One concern is the following scenario:

  - Mount fs with new kernel, fallocate() some blocks to a file.

  - Now, mount the fs under old kernel (which doesn't understand the
unwritten blocks file).

- This kernel will be able to read uninitialised data from that
  fallocated-to file, which is a security concern.

  - Now, the old kernel writes some data to a fallocated block.  But
this kernel doesn't know that it needs to clear that block's flag in
the unwritten blocks file!

  - Now mount that fs under the new kernel and try to read that file.
 The flag for the block is set, so this kernel will still zero out the
data on a read, thus corrupting the user's data

  So how to fix this?  Perhaps with a per-inode flag indicating this
  inode has unwritten blocks.  But to fix this problem, we'd require that
  the old kernel clear out that flag.

  Can anyone propose a solution to this?

  Ah, I can!  Use the compatibility flags in such a way as to prevent the
  old kernel from mounting this filesystem at all.  To mount this fs
  under an old kernel the user will need to run some tool which will

  - read the unwritten blocks file

  - for each set-bit in the unwritten blocks file, zero out the
corresponding block

  - zero out the unwritten blocks file

  - rewrite the superblock to indicate that this fs may now be mounted
by an old kernel.

  Sound sane?

- I'm assuming that there are more reserved inodes available, and that
  the changes to tune2fs and mke2fs will be basically a copy-n-paste job
  from the `tune2fs -j' code.  Correct?

- I haven't thought about what fsck changes would be needed.

  Presumably quite a few.  For example, fsck should check that set-bits
  in the unwriten blobks file do not correspond to freed blocks.  If they
  do, that should be fixed up.

  And fsck can check each inodes number-of-unwritten-blocks counters
  against the unwritten blocks file (if we implement the per-inode
  number-of-unwritten-blocks counter)

  What else should fsck do?

- I haven't thought about the implications of porting this into ext3/4. 
  Probably the commit to the unwritten blocks file will need to be atomic
  with the commit to the user's file's metadata, so the unwritten-blocks
  file will effectively need to be in journalled-data mode.

  Or, more likely, we access the unwritten blocks file via the blockdev
  pagecache (ie: use bmap, like the journal file) and then we're just
  talking direct to the disk's blocks and it becomes just more fs metadata.

- I guess resize2fs will need to be taught about the unwritten blocks
  file: to shrink and grow it appropriately.


That's all I can think of for now - I probably missed something. 

Suggestions and thought are sought, please.


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-06-29 Thread Andrew Morton
On Fri, 29 Jun 2007 16:55:25 -0400
Theodore Tso [EMAIL PROTECTED] wrote:

 On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote:
  
  Guys, Mike and Sreenivasa at google are looking into implementing
  fallocate() on ext2.  Of course, any such implementation could and should
  also be portable to ext3 and ext4 bitmapped files.
 
 What's the eventual goal of this work?  Would it be for mainline use,
 or just something that would be used internally at Google?

Mainline, preferably.

  I'm not
 particularly ennthused about supporting two ways of doing fallocate();
 one for ext4 and one for bitmap-based files in ext2/3/4.  Is the
 benefit reallyworth it?

umm, it's worth it if you don't want to wear the overhead of journalling,
and/or if you don't want to wait on the, err, rather slow progress of ext4.

 What I would suggest, which would make much easier, is to make this be
 an incompatible extensions (which you as you point out is needed for
 security reasons anyway) and then steal the high bit from the block
 number field to indicate whether or not the block has been initialized
 or not.  That way you don't end up having to seek to a potentially
 distant part of the disk to check out the bitmap.  Also, you don't
 have to worry about how to recover if the block initialized bitmap
 inode gets smashed.  
 
 The downside is that it reduces the maximum size of the filesystem
 supported by ext2 by a factor of two.  But, there are at least two
 patch series floating about that promise to allow filesystem block
 sizes  than PAGE_SIZE which would allow you to recover the maximum
 size supported by the filesytem.
 
 Furthermore, I suspect (especially after listening to a very fasting
 Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks
 ago) that for many of Google's workloads, using a filesystem blocksize
 of 16K or 32K might not be a bad thing in any case.
 
 It would be a lot simpler
 

Hadn't thought of that.

Also, it's unclear to me why google is going this way rather than using
(perhaps suitably-tweaked) ext2 reservations code.

Because the stock ext2 block allcoator sucks big-time.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-28 Thread Andrew Morton
On Mon, 25 Jun 2007 18:58:10 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:

 N O T E: 
 ---
 1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part
of ext4 patch queue git tree hosted by Ted.

Why the heck are replacements for these things being sent out again when
they're already in -mm and they're already in Ted's queue (from which I
need to diligently drop them each time I remerge)?

Are we all supposed to re-review the entire patchset (or at least #4 and
#7) again?

The core kernel changes are not appropriate to the ext4 tree.

For a start, the syscall numbers in Ted's queue are wrong (other new
syscalls are pending).

Patches which add syscalls are an utter PITA to carry due to all the patch
conflicts and to the relatively frequent syscall renumbering (they don't
get numbered in time-of-arrival order due to differing rates at which patches
mature).

Please drop the non-ext4 patches from the ext4 tree and send incremental
patches against the (non-ext4) fallocate patches in -mm.

And try to get the code finished?  Time is pressing.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-28 Thread Andrew Morton
On Thu, 28 Jun 2007 23:27:57 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:

  Please drop the non-ext4 patches from the ext4 tree and send incremental
  patches against the (non-ext4) fallocate patches in -mm.
 
 Please let us know what you think of Mingming's suggestion of posting
 all the fallocate patches including the ext4 ones as incremental ones
 against the -mm.

I think Mingming was asking that Ted move the current quilt tree into git,
presumably because she's working off git.

I'm not sure what to do, really.  The core kernel patches need to be in
Ted's tree for testing but that'll create a mess for me.

ug.

Options might be

a) I drop the fallocate patches from -mm and from the ext4 tree, hack up
   any needed build fixes, then just wait for it all to mature and then
   think about it again

b) We do what we normally don't do and reserve the syscall slots in mainline.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] zero_user_page conversion

2007-06-23 Thread Andrew Morton
 On Wed, 20 Jun 2007 17:08:24 -0500 Eric Sandeen [EMAIL PROTECTED] wrote:
 Use zero_user_page() in cifs, ocfs2, ext4, and gfs2 where possible.

One patch, splattered across four maintainers, each of whom maintain separate
trees.

Sigh.  Please, don't.  _someone_ has to split this up, and it might as well
not be me.

splits it up

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fw: [Bugme-new] [Bug 8643] New: Bug with negative timestamps on 64bit machines

2007-06-17 Thread Andrew Morton

someone ort to fix this


Begin forwarded message:

Date: Sat, 16 Jun 2007 17:50:35 -0700 (PDT)
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: [Bugme-new] [Bug 8643] New: Bug with negative timestamps on 64bit 
machines


http://bugzilla.kernel.org/show_bug.cgi?id=8643

   Summary: Bug with negative timestamps on 64bit machines
   Product: File System
   Version: 2.5
 KernelVersion: 2.6.21-1.3228.fc7
  Platform: All
OS/Version: Linux
  Tree: Fedora
Status: NEW
  Severity: normal
  Priority: P1
 Component: ext3
AssignedTo: [EMAIL PROTECTED]
ReportedBy: [EMAIL PROTECTED]


Most recent kernel where this bug did not occur:
Distribution: Fedora Core 7
Hardware Environment: AMD Athlon 64
Software Environment:
Problem Description:

Negative timestamps (i.e. before 1970) are incorrectly handled due to some
obvious 64bit issues.  They show up normally as long as the cache has not been
purged.

Steps to reproduce:

Touch a file in an ext3-file system with a negative timestamp, e.g.:

  # touch -t 19690101 /opt/foo

Unless you have been writing a lot of data to this partition after the last
command, the next one should display the following:

  #  ls -l /opt/foo
  -rw-r--r-- 1 root root 0 1969-01-01 00:00 /opt/foo

This is still correct.  But now we purge the file system cache by unmounting
and mounting the partition again:

  # umount /opt/foo
  # mount /opt/foo

Now the timestamp will be corrupted:

  # ls -l /opt/foo 
  -rw-r--r-- 1 root root 0 2105-02-07 06:28 /opt/foo

It seems obvious that the upper 32 bits of the 64 bits representing the time
stamp get lost, which explains the above observation.

On 32bit architectures there is no such problem.  I haven't managed to
reproduce this problem with other file systems (e.g. XFS).


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug, or are watching someone who is.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] fs/buffer.c:1821 in 2.6.22-rc4-mm2

2007-06-11 Thread Andrew Morton
On Sun, 10 Jun 2007 17:57:14 +0200 Eric Sesterhenn / Snakebyte [EMAIL 
PROTECTED] wrote:

 hi,
 
 i got the following BUG while running the syscalls.sh
 from ltp-full-20070531 on an ext3 partition, it is easily reproducible
 for me
 
 [  476.338068] [ cut here ]
 [  476.338223] kernel BUG at fs/buffer.c:1821!
 [  476.338324] invalid opcode:  [#1]
 [  476.338423] PREEMPT 
 [  476.338665] Modules linked in:
 [  476.338833] CPU:0
 [  476.338836] EIP:0060:[c01a1914]Not tainted VLI
 [  476.338840] EFLAGS: 00010202   (2.6.22-rc4-mm2 #1)
 [  476.339206] EIP is at __block_prepare_write+0x64/0x410
 [  476.339311] eax: 0001   ebx: c136fbb8   ecx: c07faf28   edx:
 0001
 [  476.339417] esi: c1dc9040   edi: c32d2dfc   ebp: c3733db8   esp:
 c3733d50
 [  476.339584] ds: 007b   es: 007b   fs:   gs: 0033  ss: 0068
 [  476.339690] Process vmsplice01 (pid: 7680, ti=c3733000 task=c351ed60
 task.ti=c3733000)
 [  476.339796] Stack: c3733d70 c0143e76 c1a0eab0 0046 
 c2509d64 0cd8 c136fbb8 
 [  476.340675]c32d2dfc 0296 c02313b6 c1086088 0050
 c02313b6 c1dc9040 c2509d50 
 [  476.341491]c1dc9054 c3733dc4 c02313e9 c3733dbc c015728d
 c32d2f0c  c136fbb8 
 [  476.342371] Call Trace:
 [  476.342565]  [c01a1d83] block_write_begin+0x83/0xf0
 [  476.342804]  [c0207778] ext3_write_begin+0xc8/0x1c0
 [  476.342987]  [c01595bf] pagecache_write_begin+0x4f/0x150
 [  476.343243]  [c019db3b] pipe_to_file+0x9b/0x170
 [  476.343418]  [c019d4b0] __splice_from_pipe+0x70/0x260
 [  476.343654]  [c019d6e8] splice_from_pipe+0x48/0x70
 [  476.343828]  [c019d9f8] generic_file_splice_write+0x88/0x130
 [  476.344066]  [c019d267] do_splice_from+0xb7/0xc0
 [  476.344240]  [c019ea51] sys_splice+0x1a1/0x230
 [  476.344474]  [c01043be] sysenter_past_esp+0x5f/0x99
 [  476.344656]  [e410] 0xe410
 [  476.344882]  ===
 [  476.344984] INFO: lockdep is turned off.
 [  476.345084] Code: 00 0f 97 c2 e8 ee 2f 22 00 85 c0 74 04 0f 0b eb fe
 31 d2 b8 28 af 7f c0 81 7d 08 00 10 00 00 0f 97 c2 e8 d0 2f 22 00 85 c0
 74 04 0f 0b eb fe 8b 55 08 39 55 b0 0f 97 c0 0f b6 d0 b8 0c af 7f c0 
 [  476.350365] EIP: [c01a1914] __block_prepare_write+0x64/0x410 SS:ESP
 0068:c3733d50

Yep, vmsplice01 is not supported on -mm kernels ;)

Nick has a protofix but I don't think it's been tested yet.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH ext3/ext4] orphan list check on destroy_inode

2007-06-04 Thread Andrew Morton
On Mon, 04 Jun 2007 09:18:55 +0400 Vasily Averin [EMAIL PROTECTED] wrote:

 Customers claims to ext3-related errors, investigation showed that ext3 
 orphan list has been corrupted and have the reference to non-ext3 inode. The 
 following debug helps to understand the reasons of this issue.
 
 Signed-off-by:Vasily Averin [EMAIL PROTECTED]
 
 diff --git a/fs/ext3/super.c b/fs/ext3/super.c
 index 6e30629..46e2fa6 100644
 --- a/fs/ext3/super.c
 +++ b/fs/ext3/super.c
 @@ -459,6 +459,21 @@ static struct inode *ext3_alloc_inode(struct super_block 
 *sb)
  
  static void ext3_destroy_inode(struct inode *inode)
  {
 + if (!list_empty((EXT3_I(inode)-i_orphan))) {
 + int i, imax;
 + unsigned int *p;
 +
 + p = (unsigned int *)EXT3_I(inode);
 + imax = sizeof(struct ext3_inode_info) / sizeof(unsigned int);
 + printk(Inode %p: orphan list check failed!\n, EXT3_I(inode));
 + for (i = 0; i  imax; i++) {
 + if (i  ((i % 8) == 0))
 + printk(\n);
 + printk(%08x , *p++);
 + }
 + printk(\n);
 + dump_stack();

umm, OK, but please use the new lib/hexdump.c for this.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fw: ext3 dir_index causes an error

2007-05-31 Thread Andrew Morton

Ted is dir_index maintainer ;)

That's a nice-looking bug report, btw.  Thanks.


Begin forwarded message:

Date: Fri, 01 Jun 2007 13:01:07 +0900
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: ext3 dir_index causes an error



Hello,

First of all, I really appricate your great works.
Now I've found a problem around dir_index feature.
Here is a report following linux/REPORTING-BUGS.


[1.] One line summary of the problem:
ext3 dir_index causes an error

[2.] Full description of the problem/report:
This is my local test program to reproduce this problem. The
readdir1.c calls creat(2), opendir(3) and readdir(3). And the shell
script execute it repeatedly with a brand-new ext3fs image on a
loopback device.
When the script adds '-O dir_index' to mkfs, some errors appear.

On a system with linux-2.6.21.3, ext3fs produces these error message,
and the filesystem seems to be corrupted.
--
kjournald starting.  Commit interval 5 seconds
EXT3 FS on loop0, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
:::
EXT3-fs: mounted filesystem with ordered data mode.
EXT3-fs error (device loop0): htree_dirblock_to_tree: bad entry in directory 
#2: rec_len is too small for name_len - offset=6924, inode=26, rec_len=244, 
name_len=249
EXT3-fs error (device loop0): htree_dirblock_to_tree: bad entry in directory 
#2: rec_len is too small for name_len - offset=6924, inode=26, rec_len=244, 
name_len=249
EXT3-fs error (device loop0): htree_dirblock_to_tree: bad entry in directory 
#2: rec_len is too small for name_len - offset=6924, inode=26, rec_len=244, 
name_len=249
kjournald starting.  Commit interval 5 seconds
:::
--

On the other system with linux-2.6.18 (debian etch kernel), the same
error appears.
When the script adds '-O ^dir_index' to mkfs, the problem never appears.

It is not everytime that these errors appear. So the shell script
executes the readdir1 test program repeatedly.
Recently I upgraded my debian system from version 3.1 'sarge' to 4.0
'etch'. The debian etch sets the dir_index feature by default. So I
found this problem.

[3.] Keywords (i.e., modules, networking, kernel):
ext3 dir_index

[4.] Kernel information
[4.1.] Kernel version (from /proc/version):
[4.2.] Kernel .config file:
[5.] Most recent kernel version which did not have the bug:
[6.] Output of Oops.. message (if applicable) with symbolic information
 resolved (see Documentation/oops-tracing.txt)
[7.] A small shell script or example program which triggers the
 problem (if possible)

(readdir1.c)

#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include dirent.h
#include stdio.h
#include stdlib.h
#include unistd.h
#include assert.h
#include string.h
#include errno.h

void fin(char *s)
{
perror(s);
exit(1);
}

void msg(int found, char *fname)
{
printf(%s%s found\n, fname, found?: not);
}

int
main(int argc, char *argv[])
{
DIR *dp;
struct dirent *de;
int err, found, i;
char a[250];

err = chdir(argv[1]);
if (err)
fin(chdir);

memset(a, 'a', sizeof(a)-1);
a[sizeof(a)-1] = 0;
for (i = 0; i  16+1; i++) {
a[0]++;
err = creat(a, 0644);
if (err  0)
fin(creat);

err = creat(argv[2], 0644);
if (err  0)
fin(creat);
}

#if 0
err = unlink(argv[2]);
if (err  errno != ENOENT)
fin(unlink);
#endif

dp = opendir(.);
if (!dp)
fin(opendir);

de = readdir(dp);
if (!de)
fin(1st readdir);
assert(strcmp(argv[2], de-d_name));

#if 0
argv[2][0]++;
err = creat(argv[2], 0644);
if (err  0)
fin(creat);

argv[2][0]--;
#endif
err = creat(argv[2], 0644);
if (err  0)
fin(creat);

#if 0
err = unlink(argv[2]);
if (err  errno != ENOENT)
fin(unlink);
#endif

found = 0;
while ((de = readdir(dp))  !found)
found = !strcmp(argv[2], de-d_name);
msg(found, argv[2]);

found = 0;
rewinddir(dp);
while ((de = readdir(dp))  !found)
found = !strcmp(argv[2], de-d_name);
msg(found, argv[2]);

closedir(dp);
dp = opendir(.);
if (!dp)
fin(opendir);

found = 0;
while ((de = readdir(dp))  !found)
found = !strcmp(argv[2], de-d_name);
msg(found, argv[2]);

return 0;
}
--
#!/bin/sh

img=rw.img
dir=rw
set -e
make /tmp/readdir1

cd /dev/shm
dd if=/dev/zero of=$img bs=1k count=4k 2 

Re: [PATCH 0/6][TAKE4] fallocate system call

2007-05-19 Thread Andrew Morton
On Thu, 17 May 2007 19:41:15 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:

 fallocate() is a new system call being proposed here which will allow
 applications to preallocate space to any file(s) in a file system.

I merged the first three patches into -mm, thanks.

All the system call numbers got changed due to recent additions.  They
may change in the future, too - nothing is stable until the code lands
in mainline.

I didn't merge any of the ext4 changes as they appear to be in Ted's
devel tree.  Although I didn't check that they are 100% the same in 
that tree.

What's the plan to get some ext4 updates into mainline, btw?  Things
seem to be rather gradual.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] readahead: introduce PG_readahead

2007-05-19 Thread Andrew Morton
On Sat, 19 May 2007 20:30:31 +0800 Fengguang Wu [EMAIL PROTECTED] wrote:

 On Fri, May 18, 2007 at 11:28:24PM -0700, Andrew Morton wrote:
  On Thu, 17 May 2007 06:47:53 +0800 Fengguang Wu [EMAIL PROTECTED] wrote:
  
   Introduce a new page flag: PG_readahead.
  
  Is there any way in which we can avoid adding a new page flag?
  
  We have the advantage that if the kernel very occasionally gets the wrong
  result for PageReadahead(page), nothing particularly bad will happen, so we
  can do racy things.
  
  From a quick peek, it appears that PG_readahead is only ever set against
  non-uptodate pages.  If true we could perhaps exploit that: say,
  PageReadahead(page) == PG_referenced  !PG_uptodate?
 
 PG_uptodate will flip to 1 before the reader touches the page :(

OK.

 However, it may be possible to share the same bit with PG_reclaim or 
 PG_booked.
 Which one would be preferred?

I'd like PG_booked to go away too - I don't think we've put that under the
microscope yet.  If it remains then its scope will be defined by the
filesystem, so readahead shouldn't use it.  PG_reclaim belongs to core VFS
so that's much better.

Let's not do anything ugly, slow or silly in there, but please do take a
look, see if there is an opportunity here.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 7 May 2007 05:37:54 -0600
Andreas Dilger [EMAIL PROTECTED] wrote:

   + block = offset  blkbits;
   + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits)  blkbits)
   +  - block;
   + mutex_lock(EXT4_I(inode)-truncate_mutex);
   + credits = ext4_ext_calc_credits_for_insert(inode, NULL);
   + mutex_unlock(EXT4_I(inode)-truncate_mutex);
  
  Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
  space, and that this disk space will require an arbitrary amount of
  metadata, how can we work out how much journal space we'll be needing
  without at least looking at `len'?
 
 Good question.
 
 The uninitialized extent can cover up to 128MB with a single entry.
 If @path isn't specified, then ext4_ext_calc_credits_for_insert()
 function returns the maximum number of extents needed to insert a leaf,
 including splitting all of the index blocks.  That would allow up to 43GB
 (340 extents/block * 128MB) to be preallocated, but it still needs to take
 the size of the preallocation into account (adding 3 blocks per 43GB - a
 leaf block, a bitmap block and a group descriptor).

I think the use of ext4_journal_extend() (as Amit has proposed) will help
here, but it is not sufficient.

Because under some circumstances, a journal_extend() failure could mean
that we fail to allocate all the required disk space.  If it is infrequent
enough, that is acceptable when the caller is using fallocate() for
performance reasons.

But it is very much not acceptable if the caller is using fallocate() for
space-reservation reasons.  If you used fallocate to reserve 1GB of disk
and fallocate() succeeded and you later get ENOSPC then you'd have a
right to get a bit upset.

So I think the ext3/4 fallocate() implementation will need to be
implemented as a loop: 

while (len) {
journal_start();
len -= do_fallocate(len, ...);
journal_stop();
}


Now the interesting question is: what do we do if we get halfway through
this loop and then run out of space?  We could leave the disk all filled up
and then return failure to the caller, but that's pretty poor behaviour,
IMO.



Does the proposed implementation handle quotas correctly, btw?  Has that
been tested?


Final point: it's fairly disappointing that the present implementation is
ext4-only, and extent-only.  I do think we should be aiming at an ext4
bitmap-based implementation and an ext3 implementation.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 7 May 2007 15:21:04 -0700
Andreas Dilger [EMAIL PROTECTED] wrote:

 On May 07, 2007  13:58 -0700, Andrew Morton wrote:
  Final point: it's fairly disappointing that the present implementation is
  ext4-only, and extent-only.  I do think we should be aiming at an ext4
  bitmap-based implementation and an ext3 implementation.
 
 Actually, this is a non-issue.  The reason that it is handled for extent-only
 is that this is the only way to allocate space in the filesystem without
 doing the explicit zeroing.  For other filesystems (including ext3 and
 ext4 with block-mapped files) the filesystem should return an error (e.g.
 -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.

hrm, spose so.

It can be a bit suboptimal from the layout POV.  The reservations code will
largely save us here, but kernel support might make it a bit better.

Totally blowing pagecache could be a problem.  Fixable in userspace by
using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 07 May 2007 17:00:24 -0700
Mingming Cao [EMAIL PROTECTED] wrote:

  +   while (ret = 0  ret  max_blocks) {
  +   block = block + ret;
  +   max_blocks = max_blocks - ret;
  +   ret = ext4_ext_get_blocks(handle, inode, block,
  + max_blocks, map_bh,
  + EXT4_CREATE_UNINITIALIZED_EXT, 0);
  +   BUG_ON(!ret);
  +   if (ret  0  test_bit(BH_New, map_bh.b_state)
  +((block + ret)  (i_size_read(inode)  
  blkbits)))
  +   nblocks = nblocks + ret;
  +   }
  +
  +   if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, 
  retries))
  +   goto retry;
  +
  Now the interesting question is: what do we do if we get halfway through
  this loop and then run out of space?  We could leave the disk all filled up
  and then return failure to the caller, but that's pretty poor behaviour,
  IMO.
  
 The current code handles earlier ENOSPC by three times retries. After
 that if we still run out of space, then it's propably right to notify
 the caller there isn't much space left.
 
 We could extend the block reservation window size before the while loop
 so we could get a lower chance to get more fragmented.

yes, but my point is that the proposed behaviour is really quite bad.

We will attempt to allocate the disk space and then we will return failure,
having consumed all the disk space and having partially and uselessly
populated an unknown amount of the file.

Userspace could presumably repair the mess in most situations by truncating
the file back again.  The kernel cannot do that because there might be live
data in amongst there.

So we'd need to either keep track of which blocks were newly-allocated and
then free them all again on the error path (doesn't work right across
commit+crash+recovery) or we could later use the space-reservation scheme which
delayed allocation will need to introduce.

Or we could decide to live with the above IMO-crappy behaviour.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JBD: ext2online wants too many credits (744 256)

2007-05-06 Thread Andrew Morton
On Mon, 7 May 2007 00:26:26 +0200 Frank van Maarseveen [EMAIL PROTECTED] 
wrote:

 2.6.20.6, FC4:
 
 I created a 91248k ext3 fs with 4k blocksize:
 
 | mke2fs -j -b 4096 /dev/vol1/project 
 | mke2fs 1.38 (30-Jun-2005)
 | Filesystem label=
 | OS type: Linux
 | Block size=4096 (log=2)
 | Fragment size=4096 (log=2)
 | 23552 inodes, 23552 blocks
 | 1177 blocks (5.00%) reserved for the super user
 | First data block=0
 | Maximum filesystem blocks=25165824
 | 1 block group
 | 32768 blocks per group, 32768 fragments per group
 | 23552 inodes per group
 
 Writing inode tables: done
 Creating journal (1024 blocks): done
 Writing superblocks and filesystem accounting information: done
 
 Next, I tried to resize it to about 3G using ext2online while mounted:
 
 | # ext2online /dev/vol1/project 
 | ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b
 | ext2online: ext2_ioctl: No space left on device
 |
 | ext2online: unable to resize /dev/mapper/vol1-project
 
 At that time the kernel said:
 
 |JBD: ext2online wants too many credits (744  256)
 
 What is the limitation I should be aware of? Has it something to do with
 the journal log size?
 
 The size actually did increase a bit, to 128112k.
 
 
 Steps to reproduce:
 Create a 3G partition, say /dev/vol1/project
 mke2fs -j -b 4096 /dev/vol1/project 22812
 mount it
 ext2online /dev/vol1/project said:
 
 | ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b
 | ext2online: ext2_ioctl: No space left on device
 | 
 | ext2online: unable to resize /dev/mapper/vol1-project
 
 kernel said:
 
 | JBD: ext2online wants too many credits (721  256)
 

(added linux-ext4 to cc)
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] ext3: remove inode constructor

2007-05-05 Thread Andrew Morton
On Sat, 5 May 2007 11:58:45 +0300 (EEST) Pekka J Enberg [EMAIL PROTECTED] 
wrote:

 On Fri, 4 May 2007, Andrew Morton wrote:
  I got 100% rejects against this because Christoph has already had
  his paws all over the slab constructor code everywhere.
  
  Was going to fix it up but then decided that we ought to make changes
  like this to ext4 as well.  Ideally beforehand, but simultaneously is
  OK as long as it's simple enough.
 
 I'll send you proper patches for them (and will convert other filesystems 
 too).

May as well.

 On Fri, 4 May 2007, Andrew Morton wrote:
  btw, for a benchmark I'd suggest just a silly create-1-files
  tight loop rather than something more complex like postmark.
 
 Do you want me to redo the benchmarks or are you happy enough with the 
 postmark numbers?

I doubt if this is measurable, really.  It'll be something like the
difference between an L1 hit and an L2 hit in amongst all the other stuff
we do on a per-inode basis.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  Yes, there can be issues with needing to allocate journal space within the
  context of a commit.  But
 
 no-no, this isn't required. we only need to mark pages/blocks within
 transaction, otherwise race is possible when we allocate blocks in 
 transaction,
 then transacton starts to commit, then we mark pages/blocks to be flushed
 before commit.

I don't understand.  Can you please describe the race in more detail?
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 10:57:12 +0400 Alex Tomas [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas [EMAIL PROTECTED] wrote:
  
  Andrew Morton wrote:
  Yes, there can be issues with needing to allocate journal space within the
  context of a commit.  But
  no-no, this isn't required. we only need to mark pages/blocks within
  transaction, otherwise race is possible when we allocate blocks in 
  transaction,
  then transacton starts to commit, then we mark pages/blocks to be flushed
  before commit.
  
  I don't understand.  Can you please describe the race in more detail?
 
 if I understood your idea right, then in data=ordered mode, commit thread 
 writes
 all dirty mapped blocks before real commit.
 
 say, we have two threads: t1 is a thread doing flushing and t2 is a commit 
 thread
 
 t1t2
 find dirty inode I
 find some dirty unallocated blocks
 journal_start()
 allocate blocks
 attach them to I
 journal_stop()

I'm still not understanding.  The terms you're using are a bit ambiguous.

What does find some dirty unallocated blocks mean?  Find a page which is
dirty and which does not have a disk mapping?

Normally the above operation would be implemented via
ext4_writeback_writepage(), and it runs under lock_page().


   going to commit
   find inode I dirty
   do NOT find these blocks because they're
 allocated only, but pages/bhs aren't 
 mapped
 to them
   start commit

I think you're assuming here that commit would be using -t_sync_datalist
to locate dirty buffer_heads.

But under this proposal, t_sync_datalist just gets removed: the new
ordered-data mode _only_ need to do the sb-inode-page walk.  So if I'm
understanding you, the way in which we'd handle any such race is to make
kjournald's writeback of the dirty pages block in lock_page().  Once it
gets the page lock it can look to see if some other thread has mapped the
page to disk.



It may turn out that kjournald needs a private way of getting at the
I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
had the radix-tree-of-dirty-inodes thing then that's easy enough to do
anyway, with a tagged search.  But I expect that a single pass through the
superblock's dirty inodes would suffice for ordered-data.  Files which
have chattr +j would screw things up, as usual.

I assume (hope) that your delayed allocation code implements
-writepages()?  Doing the allocation one-page-at-a-time sounds painful...

 
 map pages/bhs to just allocate blocks
 
 
 so, either we mark pages/bhs someway within journal_start()--journal_stop() or
 commit thread should do lookup for all dirty pages. the latter doesn't sound 
 nice, IMHO.
 

I don't think I'm understanding you fully yet.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 11:39:22 +0400 Alex Tomas [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  I'm still not understanding.  The terms you're using are a bit ambiguous.
  
  What does find some dirty unallocated blocks mean?  Find a page which is
  dirty and which does not have a disk mapping?
  
  Normally the above operation would be implemented via
  ext4_writeback_writepage(), and it runs under lock_page().
 
 I'm mostly worried about delayed allocation case. My impression was that
 holding number of pages locked isn't a good idea, even if they're locked
 in index order. so, I was going to turn number of pages writeback, then
 allocate blocks for all of them at once, then put proper blocknr's into
 bh's (or PG_mappedtodisk?).

ooh, that sounds hacky and quite worrisome.  If someone comes in and does
an fsync() we've lost our synchronisation point.  Yes, all callers happen
to do

lock_page();
wait_on_page_writeback();

(I think) but we've never considered a bare PageWriteback() as something
which protects page internals.  We're OK wrt page reclaim and we're OK wrt
truncate and invalidate.  As long as the page is uptodate we _should_ be OK
wrt readpage().  But still, it'd be better to use the standard locking
rather than inventing new rules, if poss.


I'd be 100% OK with locking multiple pages in ascending pgoff_t order. 
Locking the page is the standard way of doing this synchronisation and the
only problem I can think of is that having a tremendous number of pages
locked could cause the wake_up_page() waitqueue hashes to get overloaded
and go slow.  But it's also possible to lock many, many pages with
readahead and nobody has reported problems in there.


  
  
 going to commit
 find inode I dirty
 do NOT find these blocks because they're
   allocated only, but pages/bhs aren't 
  mapped
   to them
 start commit
  
  I think you're assuming here that commit would be using -t_sync_datalist
  to locate dirty buffer_heads.
 
 nope, I mean sb-inode-page walk.
 
  But under this proposal, t_sync_datalist just gets removed: the new
  ordered-data mode _only_ need to do the sb-inode-page walk.  So if I'm
  understanding you, the way in which we'd handle any such race is to make
  kjournald's writeback of the dirty pages block in lock_page().  Once it
  gets the page lock it can look to see if some other thread has mapped the
  page to disk.
 
 if I'm right holding number of pages locked, then they won't be locked, but
 writeback. of course kjournald can block on writeback as well, but how does
 it find pages with *newly allocated* blocks only?

I don't think we'd want kjournald to do that.  Even if a page was dirtied
by an overwrite, we'd want to write it back during commit, just from a
quality-of-implementation point of view.  If we were to leave these pages
unwritten during commit then a post-recovery file could have a mix of
up-to-five-second-old data and up-to-30-seconds-old data.

  It may turn out that kjournald needs a private way of getting at the
  I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
  had the radix-tree-of-dirty-inodes thing then that's easy enough to do
  anyway, with a tagged search.  But I expect that a single pass through the
  superblock's dirty inodes would suffice for ordered-data.  Files which
  have chattr +j would screw things up, as usual.
 
 not dirty inodes only, but rather some fast way to find pages with newly
 allocated pages.

Newly allocated blocks, you mean?

Just write out the overwritten blocks as well as the new ones, I reckon. 
It's what we do now.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.21-git4 Scheduler, NOHZ, VFS bugs

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 18:20:51 +0200 Michal Piotrowski [EMAIL PROTECTED] wrote:

 I ran this script tree times,
 
 #! /bin/sh
 
 for i in `find /sys/ -type f`
 do
 echo wyƛwietlam $i
 sudo cat $i  /dev/null
 done
 
 First run - scheduler bug
 Second run - NOHZ bug
 Third - VFS bug
 
 H...
 
 [93298.252601] BUG: at /mnt/md0/devel/linux-git/kernel/sched.c:3241 
 add_preempt_count()
 [93298.260334]  [c0105039] show_trace_log_lvl+0x1a/0x2f
 [93298.265507]  [c0105720] show_trace+0x12/0x14
 [93298.269974]  [c01057d2] dump_stack+0x16/0x18
 [93298.274434]  [c011d18a] add_preempt_count+0x89/0x8b
 [93298.279501]  [c0126492] irq_enter+0xd/0x2e
 [93298.283788]  [c0114c1e] smp_apic_timer_interrupt+0x2a/0x84
 [93298.289458]  [c0104b2b] apic_timer_interrupt+0x33/0x38
 [93298.294783]  [c0257b79] show_uevent+0x58/0xcd
 [93298.299329]  ===
 [93390.468056] NOHZ: local_softirq_pending 22
 [93447.105850] NOHZ: local_softirq_pending 22
 [93450.332884] BUG: unable to handle kernel paging request at virtual address 
 3e343c0c
 [93450.340626]  printing eip:
 [93450.34] c018e2bc
 [93450.345520] *pde = 
 [93450.348314] Oops:  [#1]

Nice.  What was the last file which it read before crashing?

Are you able to consistently crash it by reading just that file?
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] ext3: remove inode constructor

2007-05-04 Thread Andrew Morton
On Fri, 4 May 2007 13:14:35 +0300 (EEST)
Pekka J Enberg [EMAIL PROTECTED] wrote:

 As explained by Christoph Lameter, ext3_alloc_inode() touches the same
 cache line as init_once() so we gain nothing from using slab
 constructors.  The SLUB allocator will be more effective without it
 (free pointer can be placed inside the free'd object), so move inode
 initialization to ext3_alloc_inode completely.

I got 100% rejects against this because Christoph has already had
his paws all over the slab constructor code everywhere.

Was going to fix it up but then decided that we ought to make changes
like this to ext4 as well.  Ideally beforehand, but simultaneously is
OK as long as it's simple enough.

btw, for a benchmark I'd suggest just a silly create-1-files
tight loop rather than something more complex like postmark.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >