Re: [BUG] Linux 2.6.25-rc2 - Kernel Ooops while running dbench
On Sat, 16 Feb 2008 11:14:46 +0530 Kamalesh Babulal [EMAIL PROTECTED] wrote: The 2.6.25-rc2 kernel oopses while running dbench on ext3 filesystem mounted with mount -o data=writeback,nobh option on the x86_64 box BUG: unable to handle kernel NULL pointer dereference at IP: [80274972] kmem_cache_alloc+0x3a/0x6c PGD 1f6860067 PUD 1f5d64067 PMD 0 Oops: [1] SMP CPU 3 Modules linked in: Pid: 4271, comm: dbench Not tainted 2.6.25-rc2-autotest #1 RIP: 0010:[80274972] [80274972] kmem_cache_alloc+0x3a/0x6c RSP: :8101fb041dc8 EFLAGS: 00010246 RAX: RBX: 810180033c00 RCX: 8027b269 RDX: RSI: 80d0 RDI: 80632d70 RBP: 80d0 R08: 0001 R09: R10: 8101feb36e50 R11: 0190 R12: 0001 R13: R14: 8101f8f38000 R15: ff9c FS: () GS:8101fff0f000(0063) knlGS:f7e41460 CS: 0010 DS: 002b ES: 002b CR0: 80050033 CR2: CR3: 0001f562 CR4: 06e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process dbench (pid: 4271, threadinfo 8101fb04, task 8101fb18) Stack: 0001 8101fb041ea8 0001 8027b269 8101fb041ea8 80281fe8 0001 8101fb041ea8 ff9c 000b 0001 Call Trace: [8027b269] get_empty_filp+0x55/0xf9 [80281fe8] __path_lookup_intent_open+0x22/0x8f [80282853] open_namei+0x86/0x5a7 [8027d019] vfs_stat_fd+0x3c/0x4a [80279ab1] do_filp_open+0x1c/0x3d [80279c2c] get_unused_fd_flags+0x79/0x111 [80279dce] do_sys_open+0x46/0xca [80221c82] ia32_sysret+0x0/0xa Looks to me like we broke slab. Christoph is offline until the 27th.. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 9911] New: fsync blocks concurrent writes to file
On Thu, 7 Feb 2008 17:40:57 -0800 (PST) [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9911 Summary: fsync blocks concurrent writes to file Product: File System Version: 2.5 KernelVersion: 2.6.23.8 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: ext3 AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] Latest working kernel version: none Earliest failing kernel version: 2.6.18 Distribution: RHEL 4 Fedora 8 Hardware Environment: Dual Xeon dual-core Athlon Software Environment: Java 1.5.0_14 Problem Description: Multiple threads append transactions to a single file. When fsyncs are issued, no writes complete until the fsync completes. This has the effect of yielding the same write rate regardless of how many threads are writing to the file. Solaris 10 does not exhibit this problem and scales well as additional threads are added. I have confirmed this by running strace and witnessing that writes block until immediately after the fsync completes. When using truss on Solaris I have witnessed multiple writes completing while an fsync was in progress. Steps to reproduce: Open a file for writing. Launch multiple threads with each one writing data and followed by an fsync, but only if data has been written since the last fsync. I have googled this issue substantially and have found no answers. I supposing teaching java about sync_file_range() would be all too hard. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: merge plans, was Re: - disable-ext4.patch removed from -mm tree
On Tue, 5 Feb 2008 10:44:54 +0100 Christoph Hellwig [EMAIL PROTECTED] wrote: On Mon, Feb 04, 2008 at 02:35:29PM -0800, Andrew Morton wrote: On Mon, 4 Feb 2008 15:24:18 -0500 Christoph Hellwig [EMAIL PROTECTED] wrote: On Sun, Feb 03, 2008 at 07:15:40PM -0800, Andrew Morton wrote: On Sun, 3 Feb 2008 20:36:26 -0500 Theodore Tso [EMAIL PROTECTED] wrote: On Sun, Feb 03, 2008 at 12:25:51PM -0800, Andrew Morton wrote: When I merge David's iget coversion patches this will instead wreck the ext4 patchset. That's ok, it shouldn't be hard for me to fix this up. How quickly will you be able to merge David's iget converstion patches? They're about 1,000 patches back Care to post a merge plan so we have a slight chance to make sure not too much crap is hiding in these 1000 patches? Pretty much everything up to # # end # reiser4-sb_sync_inodes.patch That includes the git trees? No, I don't merge git trees. Defintive NACK to unionfs. Agree, but it'd be nice to get some movement and resolution here. The thing's actively and enthusiastically maintained and is apparently useful. There's not much point in allowing developers to expend cycles on something which doesn't have a future. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: - disable-ext4.patch removed from -mm tree
On Tue, 05 Feb 2008 17:02:20 -0800 Mingming Cao [EMAIL PROTECTED] wrote: On Tue, 2008-02-05 at 13:26 -0800, Mingming Cao wrote: On Mon, 2008-02-04 at 10:00 -0500, Theodore Tso wrote: On Sun, Feb 03, 2008 at 07:15:40PM -0800, Andrew Morton wrote: On Sun, 3 Feb 2008 20:36:26 -0500 Theodore Tso [EMAIL PROTECTED] wrote: On Sun, Feb 03, 2008 at 12:25:51PM -0800, Andrew Morton wrote: When I merge David's iget coversion patches this will instead wreck the ext4 patchset. That's ok, it shouldn't be hard for me to fix this up. How quickly will you be able to merge David's iget converstion patches? They're about 1,000 patches back. OK, if you're not planning on pushing David's changes to Linus right away, what if I pull in David's iget-stop-ext4-from-using-iget-and-read_inode-try.patch and push it plus some other ext4 bug fixes directly to Linus, and let you know when that has happened so you can drop David's patch from your queue? David's changes to ext4 can be applied standalone without the rest of his series, so it would be safe to push that to Linus independently and in advance of the rest of his series. I get compile error when builing ext4 patch queue with iget-stop-ext4-from-using-iget-and-read_inode-try.patch applied, against 2.6.24-git14. It seems iget-stop-ext4-from-using-iget-and-read_inode-try.patch depends on patches: [PATCH 01/31] Add an ERR_CAST() macro to complement ERR_PTR and co. [PATCH 03/32] IGET: Introduce a function to register iget failure It seems to me the easiest way to bring ext4 patches back to mm tree, is to carry above two patches in ext4 patch queue, like we did for other ext4 patches that depend on generic code in the past. It doesn't matter a lot because I won't be doing another -mm until all this is merged up anyway. So I added above two patches to ext4 patch queue, now that ext4 patches could apply cleanly to linus git tree, and Andrew should able to easily pull ext4 patches after removing the duplicated patches. Ted, I have the ext4 patch queue updated for this. This means that I merge part of the iget patch series, then twiddle thumbs until the ext4 tree merges, then merge the remainder of the iget series. So unless Ted intends to merge RSN I think it'd be preferable if I were to just merge the lot, sorry. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: - disable-ext4.patch removed from -mm tree
On Mon, 4 Feb 2008 10:00:44 -0500 Theodore Tso [EMAIL PROTECTED] wrote: On Sun, Feb 03, 2008 at 07:15:40PM -0800, Andrew Morton wrote: On Sun, 3 Feb 2008 20:36:26 -0500 Theodore Tso [EMAIL PROTECTED] wrote: On Sun, Feb 03, 2008 at 12:25:51PM -0800, Andrew Morton wrote: When I merge David's iget coversion patches this will instead wreck the ext4 patchset. That's ok, it shouldn't be hard for me to fix this up. How quickly will you be able to merge David's iget converstion patches? They're about 1,000 patches back. OK, if you're not planning on pushing David's changes to Linus right away, what if I pull in David's iget-stop-ext4-from-using-iget-and-read_inode-try.patch and push it plus some other ext4 bug fixes directly to Linus, and let you know when that has happened so you can drop David's patch from your queue? Sure, go for it. David's changes to ext4 can be applied standalone without the rest of his series, so it would be safe to push that to Linus independently and in advance of the rest of his series. That should also help reduce the number of inter-patch queue dependencies. That patch series is kind of logjammed anyway because it breaks isofs. Last time I discussed this with David he seemed to find this amusing rather than an urgent problem. I'd drop the whole lot if there weren't lots of other patches dependent upon them. Mayeb I can do a selective droppage, but I hate going out-of-order, and merging untested patch combinations. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: merge plans, was Re: - disable-ext4.patch removed from -mm tree
On Mon, 4 Feb 2008 15:24:18 -0500 Christoph Hellwig [EMAIL PROTECTED] wrote: On Sun, Feb 03, 2008 at 07:15:40PM -0800, Andrew Morton wrote: On Sun, 3 Feb 2008 20:36:26 -0500 Theodore Tso [EMAIL PROTECTED] wrote: On Sun, Feb 03, 2008 at 12:25:51PM -0800, Andrew Morton wrote: When I merge David's iget coversion patches this will instead wreck the ext4 patchset. That's ok, it shouldn't be hard for me to fix this up. How quickly will you be able to merge David's iget converstion patches? They're about 1,000 patches back Care to post a merge plan so we have a slight chance to make sure not too much crap is hiding in these 1000 patches? Pretty much everything up to # # end # reiser4-sb_sync_inodes.patch - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: - disable-ext4.patch removed from -mm tree
On Sun, 03 Feb 2008 12:18:35 -0800 [EMAIL PROTECTED] wrote: The patch titled disable-ext4 has been removed from the -mm tree. Its filename was disable-ext4.patch I dropped the entire ext4 patch series, because the newly-added convert-to-iget_locked patch has wrecked the iget-coversion patch which I have queued. Please do not merge that patch under my feet. When I merge David's iget coversion patches this will instead wreck the ext4 patchset. Dunno what to do about this sort of thing, apart from encouraging developers to take a look at what's happening in other trees before merging stuff. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: - disable-ext4.patch removed from -mm tree
On Sun, 3 Feb 2008 20:36:26 -0500 Theodore Tso [EMAIL PROTECTED] wrote: On Sun, Feb 03, 2008 at 12:25:51PM -0800, Andrew Morton wrote: When I merge David's iget coversion patches this will instead wreck the ext4 patchset. That's ok, it shouldn't be hard for me to fix this up. How quickly will you be able to merge David's iget converstion patches? They're about 1,000 patches back. Could this get pushed to Linus, say, tomorrow? No. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: [Bugme-new] [Bug 9849] New: NULL pointer deref in journal_wait_on_commit_record
Begin forwarded message: Date: Wed, 30 Jan 2008 03:24:08 -0800 (PST) From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: [Bugme-new] [Bug 9849] New: NULL pointer deref in journal_wait_on_commit_record http://bugzilla.kernel.org/show_bug.cgi?id=9849 Summary: NULL pointer deref in journal_wait_on_commit_record Product: File System Version: 2.5 KernelVersion: 2.6.24-03997-g85004cc Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: ext4 AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] Latest working kernel version: - Earliest failing kernel version: 2.6.24-03863-g0ba6c33 Distribution: Ubuntu Problem Description: using a corrupted image causes an oops in unmount, seems as if journal_wait_on_commit_record() gets passed a NULL pointer Steps to reproduce: using fsfuzz with ext4, I'll attach the image which causes this for me one oops can be found here http://kerneloops.org/raw.php?rawid=3160msgid= here is another one with full jbd2 debugging enabled (there are a lot of log_do_checkpoint messages above this) [ 242.863778] (fs/jbd2/checkpoint.c, 308): jbd2_log_do_checkpoint: Start checkpoint [ 242.863790] (fs/jbd2/checkpoint.c, 316): jbd2_log_do_checkpoint: cleanup_journal_tail returned 1 [ 242.863810] (fs/jbd2/checkpoint.c, 308): jbd2_log_do_checkpoint: Start checkpoint [ 242.863822] (fs/jbd2/checkpoint.c, 316): jbd2_log_do_checkpoint: cleanup_journal_tail returned 1 [ 242.863842] (fs/jbd2/checkpoint.c, 308): jbd2_log_do_checkpoint: Start checkpoint [ 242.863854] (fs/jbd2/checkpoint.c, 316): jbd2_log_do_checkpoint: cleanup_journal_tail returned 1 [ 242.863874] (fs/jbd2/checkpoint.c, 308): jbd2_log_do_checkpoint: Start checkpoint [ 242.863886] (fs/jbd2/checkpoint.c, 316): jbd2_log_do_checkpoint: cleanup_journal_tail returned 1 [ 242.864017] (fs/jbd2/journal.c, 193): kjournald2: kjournald2 wakes [ 242.864027] (fs/jbd2/journal.c, 201): kjournald2: woke because of timeout [ 242.864035] (fs/jbd2/journal.c, 145): kjournald2: commit_sequence=1, commit_request=2 [ 242.864044] (fs/jbd2/journal.c, 148): kjournald2: OK, requests differ [ 242.864055] (fs/jbd2/commit.c, 415): jbd2_journal_commit_transaction: super block updated [ 242.864066] (fs/jbd2/journal.c, 1264): jbd2_journal_update_superblock: JBD: updating superblock (start 15335425, seq 2, errno 0) [ 242.864385] (fs/jbd2/commit.c, 428): jbd2_journal_commit_transaction: JBD: starting commit of transaction 2 [ 242.864409] (fs/jbd2/commit.c, 501): jbd2_journal_commit_transaction: JBD: commit phase 1 [ 242.864428] (fs/jbd2/commit.c, 519): jbd2_journal_commit_transaction: JBD: commit phase 2 [ 242.864459] (fs/jbd2/revoke.c, 537): jbd2_journal_write_revoke_records: Wrote 0 revoke records [ 242.864469] (fs/jbd2/commit.c, 561): jbd2_journal_commit_transaction: JBD: commit phase 2 [ 242.864478] (fs/jbd2/commit.c, 571): jbd2_journal_commit_transaction: JBD: commit phase 3 [ 242.864487] (fs/jbd2/commit.c, 780): jbd2_journal_commit_transaction: JBD: commit phase 4 [ 242.864496] (fs/jbd2/commit.c, 839): jbd2_journal_commit_transaction: JBD: commit phase 5 [ 242.864505] (fs/jbd2/commit.c, 866): jbd2_journal_commit_transaction: JBD: commit phase 6 [ 242.864599] attempt to access beyond end of device [ 242.864609] loop0: rw=0, want=200708, limit=16384 [ 242.864633] jbd2_journal_bmap: journal block not found at offset 15335425 on loop0 [ 242.864680] Aborting journal on device loop0. [ 242.864733] (fs/jbd2/journal.c, 1264): jbd2_journal_update_superblock: JBD: updating superblock (start 15335425, seq 2, errno -5) [ 242.864868] BUG: unable to handle kernel NULL pointer dereference at virtual address [ 242.864962] printing eip: c023c2a7 *pde = [ 242.865048] Oops: 0002 [#1] PREEMPT [ 242.865108] Modules linked in: [ 242.865218] [ 242.865243] Pid: 3698, comm: kjournald2 Not tainted (2.6.24-03997-g85004cc #16) [ 242.865268] EIP: 0060:[c023c2a7] EFLAGS: 00010202 CPU: 0 [ 242.865382] EIP is at journal_wait_on_commit_record+0x7/0x50 [ 242.865407] EAX: EBX: ECX: 0001 EDX: 0001 [ 242.865431] ESI: EDI: c07835d2 EBP: cb229ee4 ESP: cb229edc [ 242.865455] DS: 007b ES: 007b FS: GS: SS: 0068 [ 242.865539] Process kjournald2 (pid: 3698, ti=cb229000 task=cb208000 task.ti=cb229000) [ 242.865564] Stack: cb229f88 c023cb07 c07835d2 0362 c069c620 [ 242.865864]cb2316e0 cb231504 cb2314f0 cb134960 cb231920 [ 242.865918]cb208000 0008 [ 242.865918] Call Trace: [ 242.865918] [c0104c0a] show_trace_log_lvl+0x1a/0x30 [ 242.865918] [c0104cc9] show_stack_log_lvl+0xa9/0xd0 [ 242.865918] [c0104dba] show_registers+0xca/0x250 [ 242.865918] [c01051e1] die+0x101/0x220 [
Re: [Bugme-new] [Bug 9855] New: ext3 ACL corruption
On Wed, 30 Jan 2008 14:29:27 -0800 (PST) [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9855 Summary: ext3 ACL corruption Product: File System Version: 2.5 KernelVersion: 2.6.23 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: ext3 AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] Latest working kernel version: Unknown Earliest failing kernel version: Definitely 2.6.23 and 2.6.23.8 but earlier is possible Distribution: Debian Etch Hardware Environment: Multiple x86 machines Software Environment: Filesystem is Ext3 on LVM on RAID-1 (on SATA). # e2fsck -V e2fsck 1.40-WIP (14-Nov-2006) Using EXT2FS Library version 1.40-WIP, 14-Nov-2006 Problem Description: On several occasions now I have had e2fsck prune away ACLs on my file systems during a file system check after rebooting a number of (reasonably) long running Samba servers. This morning I decided to manually run fsck before rebooting one of these: # e2fsck -pfv /dev/mapper/vg_main-lv_samba (entry-e_value_offs + entry-e_value_size: 116, offs: 120) /dev/mapper/vg_main-lv_samba: Extended attribute in inode 163841 has a value offset (56) which is invalid CLEARED. (entry-e_value_offs + entry-e_value_size: 116, offs: 120) /dev/mapper/vg_main-lv_samba: Extended attribute in inode 262146 has a value offset (56) which is invalid CLEARED. [ snip lots of (near) identical errors] 8301 inodes used (0.08%) 1621 non-contiguous inodes (19.5%) # of inodes with ind/dind/tind blocks: 3837/24/0 1108478 blocks used (5.29%) 0 bad blocks 1 large file 7590 regular files 662 directories 0 character device files 0 block device files 0 fifos 0 links 40 symbolic links (38 fast symbolic links) 0 sockets 8292 files (Note: after remounting) # tune2fs -l /dev/mapper/vg_main-lv_samba tune2fs 1.40-WIP (14-Nov-2006) Filesystem volume name: none Last mounted on: not available Filesystem UUID: 88677414-c1f8-41ba-b737-d9f6170d771b Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file Filesystem flags: signed directory hash Default mount options:(none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 10485760 Block count: 20971520 Reserved block count: 1048576 Free blocks: 19863038 Free inodes: 10477459 First block: 0 Block size: 4096 Fragment size:4096 Reserved GDT blocks: 1019 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16384 Inode blocks per group: 1024 Filesystem created: Wed Feb 21 21:38:33 2007 Last mount time: Thu Jan 31 03:18:54 2008 Last write time: Thu Jan 31 03:18:54 2008 Mount count: 1 Maximum mount count: 30 Last checked: Thu Jan 31 03:16:51 2008 Check interval: 15552000 (6 months) Next check after: Tue Jul 29 02:16:51 2008 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Journal inode:8 Default directory hash: tea Directory Hash Seed: be8c201b-3563-4fa5-a2a6-e2864e4b73e2 Journal backup: inode blocks Steps to reproduce: Unfortunately, precise steps are not known. Restoring all the filesystem's ACLs from a recent dump made using getfacl -RP fixes the ACLs without causing the corruption to return. These are production Samba servers making fairly extensive use of file and directory ACLs. Thus far, I've only noticed the corruptions when it came time to upgrade to a new kernel and reboot (and the boot scripts then run fsck). Note that I've never noticed any issues at runtime because of this - only when I later realised that ACLs had been removed from random files and/or directories. I think I will implement some scripts to unmount and run fsck nightly from cron, so I can at least detect the corruption a little earlier. If there is some more helpful debugging output I can provide, please let me know. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bugme-new] [Bug 9849] New: NULL pointer deref in journal_wait_on_commit_record
On Wed, 30 Jan 2008 15:17:57 -0800 Mingming Cao [EMAIL PROTECTED] wrote: The buufer head pointer passed to journal_wait_on_commit_record() could be NULL if the previous journal_submit_commit_record() failed or journal has already aborted. We need to check the error returns from journal_submit_commit_record() and avoid calling journal_wait_on_commit_record() in the failure case. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd2/commit.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6.24-rc8/fs/jbd2/commit.c === --- linux-2.6.24-rc8.orig/fs/jbd2/commit.c2008-01-30 14:12:10.0 -0800 +++ linux-2.6.24-rc8/fs/jbd2/commit.c 2008-01-30 15:09:50.0 -0800 @@ -872,7 +872,8 @@ wait_for_iobuf: if (err) __jbd2_journal_abort_hard(journal); } - err = journal_wait_on_commit_record(cbh); + if (!err !is_journal_aborted(journal)) + err = journal_wait_on_commit_record(cbh); if (err) jbd2_journal_abort(journal, err); Thanks. Please note that I Cc'ed [EMAIL PROTECTED] on this, for a 2.6.24.x backport. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 33/49] ext4: Add the journal checksum feature
On Mon, 21 Jan 2008 22:02:12 -0500 Theodore Ts'o [EMAIL PROTECTED] wrote: From: Girish Shilamkar [EMAIL PROTECTED] The journal checksum feature adds two new flags i.e JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM. JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the checksum for the blocks described by the descriptor blocks. Due to checksums, writing of the commit record no longer needs to be synchronous. Now commit record can be sent to disk without waiting for descriptor blocks to be written to disk. This behavior is controlled using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be able to recover the journal with _ASYNC_COMMIT hence it is made incompat. The commit header has been extended to hold the checksum along with the type of the checksum. For recovery in pass scan checksums are verified to ensure the sanity and completeness(in case of _ASYNC_COMMIT) of every transaction. ... +static inline __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh) unneeded inlining. +{ + struct page *page = bh-b_page; + char *addr; + __u32 checksum; + + addr = kmap_atomic(page, KM_USER0); + checksum = crc32_be(crc32_sum, + (void *)(addr + offset_in_page(bh-b_data)), bh-b_size); + kunmap_atomic(addr, KM_USER0); + + return checksum; +} Can this buffer actually be in highmem? static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag, unsigned long long block) More unnecessary inlining. +/* + * jbd2_journal_clear_features () - Clear a given journal feature in the + * superblock + * @journal: Journal to act on. + * @compat: bitmask of compatible features + * @ro: bitmask of features that force read-only mount + * @incompat: bitmask of incompatible features + * + * Clear a given journal feature as present on the + * superblock. Returns true if the requested features could be reset. + */ +int jbd2_journal_clear_features(journal_t *journal, unsigned long compat, + unsigned long ro, unsigned long incompat) +{ + journal_superblock_t *sb; + + jbd_debug(1, Clear features 0x%lx/0x%lx/0x%lx\n, + compat, ro, incompat); + + sb = journal-j_superblock; + + sb-s_feature_compat= ~cpu_to_be32(compat); + sb-s_feature_ro_compat = ~cpu_to_be32(ro); + sb-s_feature_incompat = ~cpu_to_be32(incompat); + + return 1; +} +EXPORT_SYMBOL(jbd2_journal_clear_features); Kernel usually returns 0 on success. So we can return a useful errno on failure. +/* + * calc_chksums calculates the checksums for the blocks described in the + * descriptor block. + */ +static int calc_chksums(journal_t *journal, struct buffer_head *bh, + unsigned long *next_log_block, __u32 *crc32_sum) +{ + int i, num_blks, err; + unsigned io_block; + struct buffer_head *obh; + + num_blks = count_tags(journal, bh); + /* Calculate checksum of the descriptor block. */ + *crc32_sum = crc32_be(*crc32_sum, (void *)bh-b_data, bh-b_size); + + for (i = 0; i num_blks; i++) { + io_block = (*next_log_block)++; unsigned - unsigned long. Are all the types appropriate in here? + wrap(journal, *next_log_block); + err = jread(obh, journal, io_block); + if (err) { + printk(KERN_ERR JBD: IO error %d recovering block + %u in log\n, err, io_block); + return 1; + } else { + *crc32_sum = crc32_be(*crc32_sum, (void *)obh-b_data, + obh-b_size); + } + } + return 0; +} + static int do_one_pass(journal_t *journal, struct recovery_info *info, enum passtype pass) { @@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journal, unsigned intsequence; int blocktype; int tag_bytes = journal_tag_bytes(journal); + __u32 crc32_sum = ~0; /* Transactional Checksums */ /* Precompute the maximum metadata descriptors in a descriptor block */ int MAX_BLOCKS_PER_DESC; @@ -419,9 +452,23 @@ static int do_one_pass(journal_t *journal, switch(blocktype) { case JBD2_DESCRIPTOR_BLOCK: /* If it is a valid descriptor block, replay it - * in pass REPLAY; otherwise, just skip over the - * blocks it describes. */ + * in pass REPLAY; if journal_checksums enabled, then + * calculate checksums in PASS_SCAN, otherwise, + * just skip over the blocks it describes. */ if (pass !=
Re: [PATCH 23/49] Add buffer head related helper functions
On Mon, 21 Jan 2008 22:02:02 -0500 Theodore Ts'o [EMAIL PROTECTED] wrote: +} +EXPORT_SYMBOL(bh_uptodate_or_lock); +/** Missing newline. + * bh_submit_read: Submit a locked buffer for reading + * @bh: struct buffer_head + * + * Returns a negative error + */ +int bh_submit_read(struct buffer_head *bh) +{ + if (!buffer_locked(bh)) + lock_buffer(bh); + + if (buffer_uptodate(bh)) + return 0; Here it can lock the buffer then return zero + get_bh(bh); + bh-b_end_io = end_buffer_read_sync; + submit_bh(READ, bh); + wait_on_buffer(bh); + if (buffer_uptodate(bh)) + return 0; Here it will unlock the buffer and return zero. This function is unusable when passed an unlocked buffer. The return value should (always) be documented. + return -EIO; +} +EXPORT_SYMBOL(bh_submit_read); void __init buffer_init(void) Missing newline. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 24/49] ext4: add block bitmap validation
On Mon, 21 Jan 2008 22:02:03 -0500 Theodore Ts'o [EMAIL PROTECTED] wrote: + if (bh_submit_read(bh) 0) { + brelse(bh); + ext4_error(sb, __FUNCTION__, Cannot read block bitmap - - block_group = %lu, block_bitmap = %llu, - block_group, bitmap_blk); + block_group = %d, block_bitmap = %llu, + (int)block_group, (unsigned long long)bitmap_blk); + return NULL; + } + if (!ext4_valid_block_bitmap(sb, desc, block_group, bh)) { + brelse(bh); + return NULL; + } brelse() should only be used when the bh might be NULL - put_bh() can be used here. Please review all ext4/jbd2 code for this trivial speedup. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 41/49] ext4: Add multi block allocator for ext4
On Mon, 21 Jan 2008 22:02:20 -0500 Theodore Ts'o [EMAIL PROTECTED] wrote: From: Alex Tomas [EMAIL PROTECTED] Signed-off-by: Alex Tomas [EMAIL PROTECTED] Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Signed-off-by: Theodore Ts'o [EMAIL PROTECTED] ... +#if BITS_PER_LONG == 64 +#define mb_correct_addr_and_bit(bit, addr) \ +{\ + bit += ((unsigned long) addr 7UL) 3; \ + addr = (void *) ((unsigned long) addr ~7UL); \ +} +#elif BITS_PER_LONG == 32 +#define mb_correct_addr_and_bit(bit, addr) \ +{\ + bit += ((unsigned long) addr 3UL) 3; \ + addr = (void *) ((unsigned long) addr ~3UL); \ +} +#else +#error how many bits you are?! +#endif Why do these exist? +static inline int mb_test_bit(int bit, void *addr) +{ + mb_correct_addr_and_bit(bit, addr); + return ext4_test_bit(bit, addr); +} ext2_test_bit() already handles bitnum wordsize. If mb_correct_addr_and_bit() is actually needed then some suitable comment would help. +static inline void mb_set_bit(int bit, void *addr) +{ + mb_correct_addr_and_bit(bit, addr); + ext4_set_bit(bit, addr); +} + +static inline void mb_set_bit_atomic(spinlock_t *lock, int bit, void *addr) +{ + mb_correct_addr_and_bit(bit, addr); + ext4_set_bit_atomic(lock, bit, addr); +} + +static inline void mb_clear_bit(int bit, void *addr) +{ + mb_correct_addr_and_bit(bit, addr); + ext4_clear_bit(bit, addr); +} + +static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr) +{ + mb_correct_addr_and_bit(bit, addr); + ext4_clear_bit_atomic(lock, bit, addr); +} + +static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max) uninlining this will save about eighty squigabytes of text. Please review all of ext4/jbd2 with a view to removig unnecessary and wrong inlings. +{ + char *bb; + + /* FIXME!! is this needed */ + BUG_ON(EXT4_MB_BITMAP(e4b) == EXT4_MB_BUDDY(e4b)); + BUG_ON(max == NULL); + + if (order e4b-bd_blkbits + 1) { + *max = 0; + return NULL; + } + + /* at order 0 we see each particular block */ + *max = 1 (e4b-bd_blkbits + 3); + if (order == 0) + return EXT4_MB_BITMAP(e4b); + + bb = EXT4_MB_BUDDY(e4b) + EXT4_SB(e4b-bd_sb)-s_mb_offsets[order]; + *max = EXT4_SB(e4b-bd_sb)-s_mb_maxs[order]; + + return bb; +} + ... +#else +#define mb_free_blocks_double(a, b, c, d) +#define mb_mark_used_double(a, b, c) +#define mb_cmp_bitmaps(a, b) +#endif Please use the do{}while(0) thing. Or, better, proper C functions which have typechecking (unless this will cause undefined-var compile errors, which happens sometimes) +/* find most significant bit */ +static int fmsb(unsigned short word) +{ + int order; + + if (word 255) { + order = 7; + word = 8; + } else { + order = -1; + } + + do { + order++; + word = 1; + } while (word != 0); + + return order; +} Did we just reinvent fls()? +/* FIXME!! need more doc */ +static void ext4_mb_mark_free_simple(struct super_block *sb, + void *buddy, unsigned first, int len, + struct ext4_group_info *grp) +{ + struct ext4_sb_info *sbi = EXT4_SB(sb); + unsigned short min; + unsigned short max; + unsigned short chunk; + unsigned short border; + + BUG_ON(len = EXT4_BLOCKS_PER_GROUP(sb)); + + border = 2 sb-s_blocksize_bits; Won't this explode with = 32k blocksize? + while (len 0) { + /* find how many blocks can be covered since this position */ + max = ffs(first | border) - 1; + + /* find how many blocks of power 2 we need to mark */ + min = fmsb(len); + + if (max min) + min = max; + chunk = 1 min; + + /* mark multiblock chunks only */ + grp-bb_counters[min]++; + if (min 0) + mb_clear_bit(first min, + buddy + sbi-s_mb_offsets[min]); + + len -= chunk; + first += chunk; + } +} + ... +static int ext4_mb_init_cache(struct page *page, char *incore) +{ + int blocksize; + int blocks_per_page; + int groups_per_page; + int err = 0; + int i; + ext4_group_t first_group; + int first_block; + struct super_block *sb; + struct buffer_head *bhs; + struct buffer_head **bh; + struct inode *inode; + char *data; + char *bitmap; + + mb_debug(init page %lu\n, page-index);
Re: [PATCH 30/49] ext4: Convert truncate_mutex to read write semaphore.
On Mon, 21 Jan 2008 22:02:09 -0500 Theodore Ts'o [EMAIL PROTECTED] wrote: +int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, + unsigned long max_blocks, struct buffer_head *bh, + int create, int extend_disksize) +{ + int retval; + if (create) { + down_write((EXT4_I(inode)-i_data_sem)); + } else { + down_read((EXT4_I(inode)-i_data_sem)); + } + if (EXT4_I(inode)-i_flags EXT4_EXTENTS_FL) { + retval = ext4_ext_get_blocks(handle, inode, block, max_blocks, + bh, create, extend_disksize); + } else { + retval = ext4_get_blocks_handle(handle, inode, block, + max_blocks, bh, create, extend_disksize); + } + if (create) { + up_write((EXT4_I(inode)-i_data_sem)); + } else { + up_read((EXT4_I(inode)-i_data_sem)); + } This function has many unneeded braces. checkpatch used to detect this but it seems to have broken. + return retval; +} static int ext4_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create) Mising newline. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl
On Mon, 21 Jan 2008 22:02:15 -0500 Theodore Ts'o [EMAIL PROTECTED] wrote: The below patch add ioctl for migrating ext3 indirect block mapped inode to ext4 extent mapped inode. This patch adds lots of weird and inexplicable single- and double-newlines in inappropriate places. However it frequently forgets to add newlines between end-of-locals and start-of-code, which is usual practice. +struct list_blocks_struct { + ext4_lblk_t first_block, last_block; + ext4_fsblk_t first_pblock, last_pblock; +}; This structure would benefit from some code comments. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]
On Wed, 23 Jan 2008 04:12:16 -0500 Abhishek Rai [EMAIL PROTECTED] wrote: I'm wondering about the interaction between this code and the buffer_boundary() logic. I guess we should disable the buffer_boundary() handling when this code is in effect. Have you reviewed and tested that aspect? Thanks for pointing this out, I had totally missed this issue in my change. I've now made the call to set_buffer_boundary() in ext3_get_blocks_handle() subject to metacluster option being set. Did it make any performance difference? iirc the buffer_boundary stuff was worth around 10% on a single linear read of a large, well-laid-out file. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]
I'm wondering about the real value of this change, really. In any decent environment, people will fsck their ext3 filesystems during planned downtime, and the benefit of reducing that downtime from 6 hours/machine to 2 hours/machine is probably fairly small, given that there is no service interruption. (The same applies to desktops and laptops). Sure, the benefit is not *zero*, but it's small. Much less than it would be with ext2. I mean, the avoid unplanned fscks feature is the whole reason why ext3 has journalling (and boy is that feature expensive during normal operation). So... it's unobvious that the benefit of this feature is worth its risks and costs? - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 9692] New: journal_data mount option causes filesystem corruption with blocksize != 4096
On Sat, 5 Jan 2008 09:52:15 -0800 (PST) [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9692 Summary: journal_data mount option causes filesystem corruption with blocksize != 4096 Product: File System Version: 2.5 KernelVersion: 2.6.23.9 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: high Priority: P1 Component: ext3 AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] Most recent kernel where this bug did not occur: - Older kernels have this problem too (I think I noticed this booting = 2.6.21, definitely 2.6.22). I'm getting the feeling that we should just disable data=journal. Make it use data=ordered mode instead. It isn't getting a lot of attention.. Distribution: Gentoo Linux x86 This bug seems to be hardware-independent (tested on three different machines which all use quite different drivers). If you need hardware information or any other log or configuration files, let me know please. Problem Description: When creating an ext3 filesystem with journal_data option and block sizes different than 4096 (tested: 1024, 2048) filesystem corruption will occur if certain operations are performed (see below). Corruption will not occur if 4096 block size is used, or if any other block size is used together with journal_data_ordered or journal_data_writeback. No errors in dmesg. Steps to reproduce: I found this bug using an audio file tagger, so you need exfalso which is part of quodlibet (http://www.sacredchao.net/quodlibet/). No other file tagger I used produced this kind of problem. Still, this has to be a kernel problem, right?? 1. Create ext3 file system: mkfs.ext3 -O has_journal,dir_index -b 1024 /dev/sdd1 tune2fs -c 0 -i 0 -m 0 -o journal_data /dev/sdd1 tune2fs 1.40.3 (05-Dec-2007) (filtered) Filesystem volume name: none Last mounted on: not available Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal resize_inode dir_index filetype sparse_super Filesystem flags: signed directory hash Default mount options:journal_data Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 126976 Block count: 1012060 Reserved block count: 0 Free blocks: 976865 Free inodes: 126965 First block: 1 Block size: 1024 Fragment size:1024 Reserved GDT blocks: 256 Blocks per group: 8192 Fragments per group: 8192 Inodes per group: 1024 Inode blocks per group: 128 Last mount time: n/a Mount count: 0 Maximum mount count: -1 Check interval: 0 (none) Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode:8 Default directory hash: tea Journal backup: inode blocks 2. Mount it and copy mp3,ogg,... files to it. This does not cause any file system corruption (which you can confirm by running fsck). pmount /dev/sdd1: /dev/sdd1 on /media/sdd1 type ext3 (rw,noexec,nosuid,nodev,errors=continue) 3. Use quodlibet/exfalso to change the id3 tags. Add tags to it if not present, or delete them if already present. This will lead to file system corruption. brw-r- 1 root disk 8, 49 /dev/sdd1 4. Unmount the volume. pumount /dev/sdd1 5. Run fsck -fvD /dev/sdd1. It will complain about wrong i_size. e2fsck 1.40.3 (05-Dec-2007) Pass 1: Checking inodes, blocks, and sizes Inode 47106, i_size is 5015509, should be 5017600. Fixy? yes Inode 47107, i_size is 4657736, should be 4661248. Fixy? yes Inode 47109, i_size is 11928555, should be 11931648. Fixy? yes Inode 47111, i_size is 5698454, should be 5701632. Fixy? yes Inode 47112, i_size is 9384018, should be 9388032. Fixy? yes Inode 47114, i_size is 5679228, should be 5681152. Fixy? yes Inode 47115, i_size is 6107218, should be 6111232. Fixy? yes Inode 47117, i_size is 4354297, should be 4358144. Fixy? yes Inode 47118, i_size is 4512286, should be 4513792. Fixy? yes Inode 47120, i_size is 7010846, should be 7012352. Fixy? yes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 3A: Optimizing directories Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/sdd1: * FILE SYSTEM WAS MODIFIED * 28 inodes used (0.02%) 14 non-contiguous inodes (50.0%) # of inodes with ind/dind/tind blocks: 15/15/0 123417 blocks used (12.19%) 0 bad blocks 0 large files 16 regular files 3 directories 0 character device files 0 block device
ext4 still broken on multiple architectures
fs/ext4/mballoc.c: In function 'ext4_mb_generate_buddy': fs/ext4/mballoc.c:836: error: implicit declaration of function 'ext2_find_next_bit' Can someone please get this fixed? - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 9546] New: Huge latency in concurrent I/O when using data=ordered
(switching to email - please respond via emailed reply-to-all, not via the bugzilla web interface) On Tue, 11 Dec 2007 11:36:39 -0800 (PST) [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9546 Summary: Huge latency in concurrent I/O when using data=ordered Product: File System Version: 2.5 KernelVersion: 2.6.24-rc4 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: ext3 AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] Most recent kernel where this bug did not occur: Unknown, certainly not a regression, but something specific to ext3 algorithm Distribution: Bluewhite64 12 (Slackware 12 64 bits port) and Slackware 12 Hardware Environment: Athlon 64 3000+laptop IDE 5400 80GB+1.2GB RAM Athlon 64X2 4200+SATA 7200 200GB drive+1GB Athlon 2800+IDE 7200 40GB drive+512MB Software Environment: dd, cp, konqueror/KDE, mount/tune2fs Problem Description: When the system does heavy input/output operations on big files, small files access from other applications are always not served for very long time. This can cause huge latencies. The system is really not usable at all, even with all the recent improvements done to increase interactivity on desktop. This behaviour is very visible with the simple following test case: 1. Build a DVD structure from big MPEG+PS files with dvdauthor (it copies the files in the DVD stucture, then pass on them to fix VOBUs, but this part is not very long so this is not the main problem). 2. While the computer is doing this, try to open a web browser such as konqueror. Then open a page from bookmark. Then open a new tab, then open another page from bookmark. Switch bak to first page. What I get is: 35 seconds to open Konqueror. 8 seconds to open the bookmark menu. Incredible. 30 seconds to open the web page (DSL/10GBits). 5 seconds to open the second tab. 6 seconds to reopen the menu. 36 seconds to open the second page. 14 seconds to come back to first tab. This is unbelievable! The system is completely trashed, with more than 1GB RAM, whatever the hardware configuration is used. Of course, I investigated the problem... First, DMA is OK. Second, I thought cache would make memory swapped. So I used echo 0 swapiness. Then (of course, the system was not swapping at all), I thought TEXT sections from software discarded (that would be simply stupid, but who knows?). I then tried to make the writing process throttled with dirty_background_ratio (say 10%) while reserving a greater RAM portion for the rest of the system with dirty_ratio (say 70%). No way. Then I launched top, and looked at the WCHAN to see what was the problem for the frozen process (ie: konqueror). The I saw the faulty guy: log_wait_commit! So I concluded there is unfair access to the filesystem journal. So I tried other journaling options than the default ordered data mode. The results were really different: 5s, 2s, 4s, etc., both with journal and write back mode! I therefore think there is a great lock and even maybe a priority inversion in log_wait_commit of the ext3 filesystem. I think that, even if it is throttled, the writing process always get access to the journal in ordered mode, simply because it writes many pages at a time and because the ordered mode indeed implies... ordering of requests (as I understand it). It's sad this is the default option that gives the worst interactivity problems. Indeed, this messes all previous work done to enhance desktop experience I think, too bad! Btw, I've also seen on Internet that some people reported that journal data mode gives better performance. I think the problem was indeed related to latency rather than performance (timing the writing process effectively shows a output rate halved with journal data mode, and twice the time to process). Steps to reproduce: I did a simple script: #!/bin/bash SRC1=src1.bin SRC2=src2.bin DEST_DIR=tmpdir DST1=dst.bin # First, create the source files: if [ ! -e $SRC1 ] ; then dd if=/dev/zero of=$SRC1 bs=10k count=15 fi if [ ! -e $SRC2 ] ; then dd if=/dev/zero of=$SRC2 bs=10k count=15 fi mkdir $DEST_DIR /dev/null 21 sync # Do the test: echo Trashing the system... rm $DEST_DIR/$DST1 /dev/null 21 cp $SRC1 $DEST_DIR/$DST1 cat $SRC2 $DEST_DIR/$DST1 echo Done! #rm -rf $DEST_DIR $SRC1 $SRC2 While running it, try to use normally the interactive programs, such as konqueror (the program should have to access files, such as cookies, cache and so for konqueror). Then remount/tune the filesystem to use another data mode for ext3. I didn't try with other journaling filesystems. I guess ext2 also doesn't exhibit the problem. Interesting that data=writeback helped this. You don't
Re: 2.6.24-rc4-mm1: some issues on sparc64
On Sun, 09 Dec 2007 00:45:17 -0800 (PST) David Miller [EMAIL PROTECTED] wrote: From: Andrew Morton [EMAIL PROTECTED] Date: Sat, 8 Dec 2007 10:22:39 -0800 That's J_ASSERT_BH(bh, !buffer_jbddirty(bh)); at the end of journal_unmap_buffer(). I don't recall seeing that before and I can't think of anything we've done recently which could cause it, sorry. If the per-cpu data patches are in the -mm tree that is the first place I would start looking at for possible cause. They aren't. The dust hadn't settled enough on those when Christoph shot through on vacation. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc4-mm1: some issues on sparc64
On Sat, 8 Dec 2007 19:20:28 +0100 Mariusz Kozlowski [EMAIL PROTECTED] wrote: The box is sun ultra 60 (dual sparc64). This was caught when system (gentoo) was emerging some package. [27006.402237] kernel BUG at fs/jbd/transaction.c:1894! That's J_ASSERT_BH(bh, !buffer_jbddirty(bh)); at the end of journal_unmap_buffer(). I don't recall seeing that before and I can't think of anything we've done recently which could cause it, sorry. [27006.402268] \|/ \|/ [27006.402274] @'/ .. \`@ [27006.402279] /_| \__/ |_\ [27006.402285] \__U_/ x86 needs that. [27006.402298] rm(4713): Kernel bad sw trap 5 [#1] [27006.402538] TSTATE: 009911009605 TPC: 0053b1cc TNPC: 0053b1d0 Y: Not tainted [27006.402579] TPC: journal_invalidatepage+0x3d4/0x460 [27006.402593] g0: 0002 g1: g2: 0001 g3: f800a7d9 [27006.402610] g4: f800b54ea460 g5: f8007f832000 g6: f800a7d9 g7: 0076d868 [27006.402627] o0: 0072b660 o1: 0766 o2: 0002 o3: 0001 [27006.402644] o4: 008a2940 o5: sp: f800a7d92c91 ret_pc: 0053b1c4 [27006.402665] RPC: journal_invalidatepage+0x3cc/0x460 [27006.402679] l0: f800afbf4070 l1: 0069511c l2: 2000 l3: [27006.402696] l4: 0001 l5: f800ba4cb730 l6: f800bf1cd338 l7: 0001 [27006.402713] i0: f800bf1cd000 i1: 000201db2708 i2: i3: 00727000 [27006.402730] i4: 0020 i5: f800bf1cd028 i6: f800a7d92d51 i7: 00529254 [27006.402763] I7: ext3_invalidatepage+0x3c/0x60 [27006.402776] Caller[00529254]: ext3_invalidatepage+0x3c/0x60 [27006.402800] Caller[004b22fc]: do_invalidatepage+0x24/0x60 [27006.402826] Caller[004b29c4]: truncate_complete_page+0x6c/0x80 [27006.402849] Caller[004b2a6c]: truncate_inode_pages_range+0x94/0x440 [27006.402872] Caller[004b2e2c]: truncate_inode_pages+0x14/0x20 [27006.402894] Caller[00529888]: ext3_delete_inode+0x10/0x160 [27006.402918] Caller[004e7ca0]: generic_delete_inode+0x88/0x120 [27006.402949] Caller[004e7e60]: generic_drop_inode+0x128/0x1c0 [27006.402971] Caller[004e75d4]: iput+0x7c/0xa0 [27006.402992] Caller[004dd680]: do_unlinkat+0x108/0x1a0 [27006.403024] Caller[004dd884]: sys_unlinkat+0x2c/0x60 [27006.403047] Caller[004062d4]: linux_sparc_syscall32+0x3c/0x40 [27006.403081] Caller[f7e7d0ec]: 0xf7e7d0f4 [27006.403102] Instruction DUMP: 92102766 7ffbbeaf 90122260 91d02005 92102780 7ffbbeab 90122260 91d02005 7ffbbea8 - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 9483] circular locking dependency detected
On Tue, 4 Dec 2007 22:25:18 +0100 Ingo Molnar [EMAIL PROTECTED] wrote: * Aneesh Kumar K.V [EMAIL PROTECTED] wrote: === [ INFO: possible circular locking dependency detected ] 2.6.24-rc3 #6 --- bash/2294 is trying to acquire lock: (journal-j_list_lock){--..}, at: [c01eee2f] journal_try_to_free_buffers+0x76/0x10c but task is already holding lock: (inode_lock){--..}, at: [c01864b6] drop_pagecache+0x48/0xd8 which lock already depends on the new lock. Andrew, drop_pagecache() is root-only and it has some known deadlock, right? yup. It takes inode_lock at too high a level so it can walk the per-sb inode lists. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] Add buffer head related helper functions
On Thu, 15 Nov 2007 17:40:43 +0530 Aneesh Kumar K.V [EMAIL PROTECTED] wrote: Add buffer head related helper function bh_uptodate_or_lock and bh_submit_read which can be used by file system The patches look sane. --- include/linux/buffer_head.h | 29 + 1 files changed, 29 insertions(+), 0 deletions(-) diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index da0d83f..82cc9ef 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -327,6 +327,35 @@ static inline void lock_buffer(struct buffer_head *bh) extern int __set_page_dirty_buffers(struct page *page); +/* Return true if the buffer is up-to-date. + * Return false, with the buffer locked, if not. + */ +static inline int bh_uptodate_or_lock(struct buffer_head *bh) +{ + if (!buffer_uptodate(bh)) { + lock_buffer(bh); + if (!buffer_uptodate(bh)) + return 0; + unlock_buffer(bh); + } + return 1; +} +/* + * Submit a locked buffer for reading, + * return a negative error and release + * the buffer if failed. + */ +static inline int bh_submit_read(struct buffer_head *bh) +{ + get_bh(bh); + bh-b_end_io = end_buffer_read_sync; + submit_bh(READ, bh); + wait_on_buffer(bh); + if (buffer_uptodate(bh)) + return 0; + brelse(bh); + return -EIO; +} #else /* CONFIG_BLOCK */ But these are waay too big to be inlined. Could I ask that they be turned into regular EXPORT_SYMBOL()ed functions in buffer.c? Might as well turn the comments into regular kerneldoc format, too. The function names are a bit awkward, but I can't think of anything better. I'm surprised that we don't already have a function which does what bh_submit_read() does. bh_submit_read() might become a bit simpler if it called ll_rw_block(). I'd have thought that to be more general, bh_submit_read() should return immediately if the buffer is uptodate. ll_rw_block() sort of helps there. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext3,4:fdatasync should skip metadata writeout
On Fri, 16 Nov 2007 11:47:27 +0900 Hisashi Hifumi [EMAIL PROTECTED] wrote: Currently fdatasync is identical to fsync in ext3,4. I think fdatasync should skip journal flush in data=ordered and data=writeback mode because this syscall is not required to synchronize the metadata. I suppose so. Although one wonders what earthly point there is in syncing a file's data if we haven't yet written out the metadata which is required for locating that data. IOW, fdatasync() is only useful if the application knows that it is overwriting already-instantiated blocks. In which case it might as well have used fsync(). For ext2-style filesystems, anyway. hm. It needs some thought. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext3,4:fdatasync should skip metadata writeout
On Thu, 15 Nov 2007 22:47:40 -0500 Wendy Cheng [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Fri, 16 Nov 2007 11:47:27 +0900 Hisashi Hifumi [EMAIL PROTECTED] wrote: Currently fdatasync is identical to fsync in ext3,4. I think fdatasync should skip journal flush in data=ordered and data=writeback mode because this syscall is not required to synchronize the metadata. I suppose so. Although one wonders what earthly point there is in syncing a file's data if we haven't yet written out the metadata which is required for locating that data. IOW, fdatasync() is only useful if the application knows that it is overwriting already-instantiated blocks. In which case it might as well have used fsync(). For ext2-style filesystems, anyway. hm. It needs some thought. There are non-trivial amount of performance critical programs, particularly in financial application segment ported from legacy UNIX platforms, know the difference between fsync() and fdatasync(). Those can certainly take advantages of this separation. Don't underestimate the talents of these application programmers. If they're that good, they'll be using sync_file_range() ;) - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Introduce ext4_find_next_bit
On Wed, 14 Nov 2007 00:41:03 +0530 Aneesh Kumar K.V [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Fri, 21 Sep 2007 10:55:05 +0530 Aneesh Kumar K.V [EMAIL PROTECTED] wrote: Also add generic_find_next_le_bit This gets used by the ext4 multi block allocator patches. arm allmodconfig: fs/ext4/mballoc.c: In function `ext4_mb_generate_buddy': fs/ext4/mballoc.c:836: error: implicit declaration of function `ext2_find_next_bit' This patch makes my head spin. Why did we declare generic_find_next_le_bit() in include/asm-powerpc/bitops.h (wrong) as well as in include/asm-generic/bitops/le.h (presumably correct)? I was following the coding style used for rest of the APIs like ext4_set_bit. Well. There's quite a bit of cruft in there. If you do come across something which isn't right, please do try to find the time to fix it up first. That might be non-trivial - powerpc does seem to have gone off on a strange tangent there. Why is it touching a powerpc file and no any other architectures? Something screwed up in powerpc land? And why did arm break? arm and below list of arch doesn't include the asm-generic/bitops/ext2-non-atomic.h I did a grep and that list the below architectures as also affected. arm, m68k, m68knommu, s390 Shudder. Anyway, please fix, and if that fix requires that various braindamaged be repaired, please repair the braindamage rather than going along with it. That should be a separate patch altogether. I wanted to do the cleanup along with the usages such as but never got time to do the same. #define ocfs2_set_bit ext2_set_bit #define udf_set_bit(nr,addr) ext2_set_bit(nr,addr) direct usage in mb md/bitmap.c +799 md/dm-log.c +177 I will send a patch tomorrow that fix arm and other architectures. I guess the cleanup can be a separate patch ? Yes, that's a separate work, thanks. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Introduce ext4_find_next_bit
On Fri, 21 Sep 2007 10:55:05 +0530 Aneesh Kumar K.V [EMAIL PROTECTED] wrote: Also add generic_find_next_le_bit This gets used by the ext4 multi block allocator patches. arm allmodconfig: fs/ext4/mballoc.c: In function `ext4_mb_generate_buddy': fs/ext4/mballoc.c:836: error: implicit declaration of function `ext2_find_next_bit' This patch makes my head spin. Why did we declare generic_find_next_le_bit() in include/asm-powerpc/bitops.h (wrong) as well as in include/asm-generic/bitops/le.h (presumably correct)? Why is it touching a powerpc file and no any other architectures? Something screwed up in powerpc land? And why did arm break? Shudder. Anyway, please fix, and if that fix requires that various braindamaged be repaired, please repair the braindamage rather than going along with it. Thanks. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 9329] New: ext4: delalloc space accounting problem drops data
On Thu, 8 Nov 2007 09:42:10 -0800 (PST) [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9329 Summary: ext4: delalloc space accounting problem drops data Product: File System Version: 2.5 KernelVersion: 2.6.24-rc1 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: ext4 AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] 2.6.24-rc1 + ext4 git patch queue from last week or so. It appears that delalloc does not track used space properly, and fails to return ENOSPC as appropriate: [EMAIL PROTECTED] ~]# mkfs.ext3 -I 256 /dev/sdb7 32768 [EMAIL PROTECTED] ~]# mount -t ext4dev -o data=writeback,delalloc,extents,mballoc /dev/sdb7 /mnt/test [EMAIL PROTECTED] ~]# df -h /mnt/test FilesystemSize Used Avail Use% Mounted on /dev/sdb7 30M 4.5M 24M 16% /mnt/test [EMAIL PROTECTED] ~]# du -h /tmp/1Mfile 1.1M/tmp/1Mfile [EMAIL PROTECTED] ~]# for I in `seq 1 50`; do cp /tmp/1Mfile /mnt/test/1Mfile-$I; done [EMAIL PROTECTED] ~]# df -h /mnt/test FilesystemSize Used Avail Use% Mounted on /dev/sdb7 30M 30M 0 100% /mnt/test all resulting files are 1M in length: [EMAIL PROTECTED] ~]# ls -l /mnt/test/1M* | grep -v 1048576 [EMAIL PROTECTED] ~]# ls -l /mnt/test/1M* | grep 1048576 | wc -l 50 but many of them have silently dropped data on the floor: [EMAIL PROTECTED] ~]# du -hc /mnt/test/1Mfile-* | grep -v 1.0M 596K/mnt/test/1Mfile-26 0 /mnt/test/1Mfile-27 0 /mnt/test/1Mfile-28 0 /mnt/test/1Mfile-29 0 /mnt/test/1Mfile-30 snip When mounted with nodelalloc, I get proper behavior: [EMAIL PROTECTED] ~]# for I in `seq 1 50`; do cp /tmp/1Mfile /mnt/test/1Mfile-$I; done cp: writing `/mnt/test/1Mfile-26': No space left on device cp: writing `/mnt/test/1Mfile-27': No space left on device cp: writing `/mnt/test/1Mfile-28': No space left on device cp: writing `/mnt/test/1Mfile-29': No space left on device snip -- Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug, or are watching someone who is. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] ext4 update
On Wed, 17 Oct 2007 08:59:53 -0700 (PDT) Linus Torvalds [EMAIL PROTECTED] wrote: On Wed, 17 Oct 2007, Theodore Ts'o wrote: Please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git for_linus This conflicts in nontrivial ways with commit 7c9e69faa28027913ee059c285a5ea8382e24b5d Author: Aneesh Kumar K.V [EMAIL PROTECTED] Date: Tue Oct 16 23:27:02 2007 -0700 ext2/ext3/ext4: add block bitmap validation which I just merged from -mm. catching up There shouldn't have been conflicts here - if there were I wouldn't have sent those patches. Unless there were things in the ext4 pull which weren't present in the ext4 quilt tree which I included in 2.6.23-mm1? - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] ext4 update
On Thu, 25 Oct 2007 16:44:21 -0700 (PDT) Linus Torvalds [EMAIL PROTECTED] wrote: On Thu, 25 Oct 2007, Andrew Morton wrote: There shouldn't have been conflicts here - if there were I wouldn't have sent those patches. Unless there were things in the ext4 pull which weren't present in the ext4 quilt tree which I included in 2.6.23-mm1? Well, you merge your patch-series by patching. You should have noticed by now that GNU patch in particular will happily apply a patch whether it conflicts or not. So it's entirely possible that it didn't conflict for you, but applied cleanly and sanely. hrm, could be. It would be strange for that to happen quietly with fuzz=1 and to still produce a compileable result. But there weren't any patches in this git-merge which weren't in 2.6.23-mm1 so maybe something like that happened. Or maybe that fact that this pull only contained _some_ of the ext4 patches which were in -mm somehow affected things. Oh well, I should have sent the ext4 changes via Ted anyway. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -mm] Split fs/Kconfig: ext[234]
On Sat, 6 Oct 2007 12:15:08 +0400 Alexey Dobriyan [EMAIL PROTECTED] wrote: Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED] --- fs/Kconfig | 191 -- fs/ext2/Kconfig | 55 + fs/ext3/Kconfig | 67 fs/ext4/Kconfig | 65 +++ 4 files changed, 190 insertions(+), 188 deletions(-) A reasonable thing to do, but poorly timed. I'd prefer not to carry a patch like this through two months of development, please. Late in the -rc timeframe would be a better time. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
On Thu, 18 Oct 2007 02:03:39 -0700 (PDT) Christoph Lameter [EMAIL PROTECTED] wrote: On Wed, 17 Oct 2007, Andrew Morton wrote: b) what happens when an old ext2 driver tries to read and/or write this directory entry? Do we need a compat flag for it? Old ext2 only supports up to 4k include/linux/ext2_fs.h: #define EXT2_MIN_BLOCK_SIZE 1024 #define EXT2_MAX_BLOCK_SIZE 4096 #define EXT2_MIN_BLOCK_LOG_SIZE 10 Should fail to mount the volume since the block size is too large. should, but does it? box:/usr/src/25 grep MAX_BLOCK_SIZE fs/ext2/*.[ch] include/linux/ext2* include/linux/ext2_fs.h:#define EXT2_MAX_BLOCK_SIZE 4096 box:/usr/src/25 - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
On Thu, 11 Oct 2007 13:18:49 +0200 Jan Kara [EMAIL PROTECTED] wrote: +static inline __le16 ext2_rec_len_to_disk(unsigned len) +{ + if (len == (1 16)) + return cpu_to_le16(EXT2_MAX_REC_LEN); + else if (len (1 16)) + BUG(); + return cpu_to_le16(len); +} Of course, ext2 shouldn't be trying to write a bad record length into a directory entry. But are we sure that there is no way in which this situation could occur is the on-disk data was _already_ bad? Because it is very bad for a fileysstem to go BUG in response to unexpected data on the disk. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
On Thu, 11 Oct 2007 13:18:49 +0200 Jan Kara [EMAIL PROTECTED] wrote: With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. btw, this changes ext2's on-disk format. a) is the ext2 format documented anywhere? If so, that document will need updating. b) what happens when an old ext2 driver tries to read and/or write this directory entry? Do we need a compat flag for it? c) what happens when old and new ext3 or ext4 try to read/write this directory entry? - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext2 statfs improvement for block and inode free count
On Fri, 13 Jul 2007 18:36:54 -0700 Badari Pulavarty [EMAIL PROTECTED] wrote: More statfs() improvements for ext2. ext2 already maintains percpu counters for free blocks and inodes. Derive free block count and inode count by summing up percpu counters, instead of counting up all the groups in the filesystem each time. Signed-off-by: Badari Pulavarty [EMAIL PROTECTED] Acked-by: Andreas Dilger [EMAIL PROTECTED] fs/ext2/super.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.22/fs/ext2/super.c === --- linux-2.6.22.orig/fs/ext2/super.c 2007-07-13 20:06:38.0 -0700 +++ linux-2.6.22/fs/ext2/super.c 2007-07-13 20:06:51.0 -0700 @@ -1136,12 +1136,12 @@ static int ext2_statfs (struct dentry * buf-f_type = EXT2_SUPER_MAGIC; buf-f_bsize = sb-s_blocksize; buf-f_blocks = le32_to_cpu(es-s_blocks_count) - overhead; - buf-f_bfree = ext2_count_free_blocks(sb); + buf-f_bfree = percpu_counter_sum(sbi-s_freeblocks_counter); buf-f_bavail = buf-f_bfree - le32_to_cpu(es-s_r_blocks_count); if (buf-f_bfree le32_to_cpu(es-s_r_blocks_count)) buf-f_bavail = 0; buf-f_files = le32_to_cpu(es-s_inodes_count); - buf-f_ffree = ext2_count_free_inodes(sb); + buf-f_ffree = percpu_counter_sum(sbi-s_freeinodes_counter); buf-f_namelen = EXT2_NAME_LEN; fsid = le64_to_cpup((void *)es-s_uuid) ^ le64_to_cpup((void *)es-s_uuid + sizeof(u64)); Guys, I have this patch in a stalled state pending some convincing demonstration/argument that it's actually a worthwhile change. How much hurt will it cause on large-numa, and how much benefit do we get for that hurt? - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
On Mon, 01 Oct 2007 17:35:46 -0700 Mingming Cao [EMAIL PROTECTED] wrote: ext2: Avoid rec_len overflow with 64KB block size From: Jan Kara [EMAIL PROTECTED] With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. This patch clashes in non-trivial ways with ext2-convert-to-new-aops-fix.patch and perhaps other things which are already queued for 2.6.24 inclusion, so I'll need to ask for an updated patch, please. Also, I'm planing on merging the ext2 reservations code into 2.6.24, so if we're aiming for complete support of 64k blocksize in 2.6.24's ext2, additional testing and checking will be needed. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
On Thu, 4 Oct 2007 16:40:44 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: On Oct 04, 2007 13:12 -0700, Andrew Morton wrote: On Mon, 01 Oct 2007 17:35:46 -0700 ext2: Avoid rec_len overflow with 64KB block size into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. This patch clashes in non-trivial ways with ext2-convert-to-new-aops-fix.patch and perhaps other things which are already queued for 2.6.24 inclusion, so I'll need to ask for an updated patch, please. If the rel_len overflow patch isn't going to make it, then we also need to revert the EXT*_MAX_BLOCK_SIZE change to 65536. It would be possible to allow this to be up to 32768 w/o the rec_len overflow fix however. Ok, thanks, I dropped ext3-support-large-blocksize-up-to-pagesize.patch and ext2-support-large-blocksize-up-to-pagesize.patch. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD/ext34 cleanups: convert to kzalloc
On Fri, 21 Sep 2007 16:13:56 -0700 Mingming Cao [EMAIL PROTECTED] wrote: Convert kmalloc to kzalloc() and get rid of the memset(). I split this into separate ext3/jbd and ext4/jbd2 patches. It's generally better to raise separate patches, please - the ext3 patches I'll merge directly but the ext4 patches should go through (and be against) the ext4 devel tree. I fixed lots of rejects against the already-pending changes to these filesystems. You forgot to remove the memsets in both start_this_handle()s. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: jbd : config_jbd_debug cannot create /proc entry
On Tue, 25 Sep 2007 16:36:08 +0200 Jan Kara [EMAIL PROTECTED] wrote: On Tue, 25 Sep 2007 07:49:38 -0500 Jose R. Santos [EMAIL PROTECTED] wrote: On Tue, 25 Sep 2007 13:50:46 +0200 Jan Kara [EMAIL PROTECTED] wrote: Jan Kara wrote: -#define create_jbd_proc_entry() do {} while (0) -#define remove_jbd_proc_entry() do {} while (0) +static ctl_table fs_table[] = { +{ +.ctl_name = -1, /* Don't want it */ shouldn't this be CTL_UNNUMBERED ? Oh, it should be. I didn't notice we have this :) Thanks for notifying me. Attached is a fixed version. This was fixed in JBD2 by moving the jbd-debug file to debugfs: http://lkml.org/lkml/2007/7/11/334 Since this code is already in the kernel, we should keep it consistent. OK. Here's a quick patch to fix this. Adapted from the JBD2 patch. Let me know what you think. Looks fine - exactly what I've just done here :). hm. I found rather a lot of issues. If this patch is derived from the JBD2 patch then perhaps the JBD2 patch needs some looking at. Signed-off-by: Jose R. Santos [EMAIL PROTECTED] You can add Signed-off-by: Jan Kara [EMAIL PROTECTED] I suspect you might be getting your signed-off-bys and acked-bys mixed up. (If not this patch, then the previous one). Please see Documentation/SubmittingPatches section 13 for the difference. Jose, please review and if possible runtime test these proposed changes? From: Andrew Morton [EMAIL PROTECTED] - use `#ifdef foo' instead of `#if defined(foo)' - CONFIG_JBD_DEBUG depends on CONFIG_DEBUG_FS so we don't need to duplicate that logic in the .c file ifdefs - Make journal_enable_debug __read_mostly just for the heck of it - Make jbd_debugfs_dir and jbd_debug static - debugfs_remove(NULL) is legal: remove unneeded tests - jbd_create_debugfs_entry is a better name than create_jbd_debugfs_entry - ditto remove_jbd_debugfs_entry - C functions are preferred over macros Cc: Jose R. Santos [EMAIL PROTECTED] Cc: linux-ext4@vger.kernel.org Cc: Jan Kara [EMAIL PROTECTED] Cc: Jose R. Santos [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] --- diff -puN fs/jbd/journal.c~jbd-config_jbd_debug-cannot-create-proc-entry-fix fs/jbd/journal.c --- a/fs/jbd/journal.c~jbd-config_jbd_debug-cannot-create-proc-entry-fix +++ a/fs/jbd/journal.c @@ -1853,16 +1853,15 @@ void journal_put_journal_head(struct jou /* * debugfs tunables */ -#if defined(CONFIG_JBD_DEBUG) -u8 journal_enable_debug; -EXPORT_SYMBOL(journal_enable_debug); -#endif +#ifdef CONFIG_JBD_DEBUG -#if defined(CONFIG_JBD_DEBUG) defined(CONFIG_DEBUG_FS) +u8 journal_enable_debug __read_mostly; +EXPORT_SYMBOL(journal_enable_debug); -struct dentry *jbd_debugfs_dir, *jbd_debug; +static struct dentry *jbd_debugfs_dir; +static struct dentry *jbd_debug; -static void __init create_jbd_debugfs_entry(void) +static void __init jbd_create_debugfs_entry(void) { jbd_debugfs_dir = debugfs_create_dir(jbd, NULL); if (jbd_debugfs_dir) @@ -1871,18 +1870,21 @@ static void __init create_jbd_debugfs_en journal_enable_debug); } -static void __exit remove_jbd_debugfs_entry(void) +static void __exit jbd_remove_debugfs_entry(void) { - if (jbd_debug) - debugfs_remove(jbd_debug); - if (jbd_debugfs_dir) - debugfs_remove(jbd_debugfs_dir); + debugfs_remove(jbd_debug); + debugfs_remove(jbd_debugfs_dir); } #else -#define create_jbd_debugfs_entry() do {} while (0) -#define remove_jbd_debugfs_entry() do {} while (0) +static inline void jbd_create_debugfs_entry(void) +{ +} + +static inline void jbd_remove_debugfs_entry(void) +{ +} #endif @@ -1940,7 +1942,7 @@ static int __init journal_init(void) ret = journal_init_caches(); if (ret != 0) journal_destroy_caches(); - create_jbd_debugfs_entry(); + jbd_create_debugfs_entry(); return ret; } @@ -1951,7 +1953,7 @@ static void __exit journal_exit(void) if (n) printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n); #endif - remove_jbd_debugfs_entry(); + jbd_remove_debugfs_entry(); journal_destroy_caches(); } _ - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Ext4: Uninitialized Block Groups
On Tue, 18 Sep 2007 17:25:31 -0700 Avantika Mathur [EMAIL PROTECTED] wrote: In pass1 of e2fsck, every inode table in the fileystem is scanned and checked, regardless of whether it is in use. This is this the most time consuming part of the filesystem check. The unintialized block group feature can greatly reduce e2fsck time by eliminating checking of uninitialized inodes. With this feature, there is a a high water mark of used inodes for each block group. Block and inode bitmaps can be uninitialized on disk via a flag in the group descriptor to avoid reading or scanning them at e2fsck time. A checksum of each group descriptor is used to ensure that corruption in the group descriptor's bit flags does not cause incorrect operation. This needed a few fixups due to conflicts with ext2-ext3-ext4-add-block-bitmap-validation.patch but they were pretty straightforward. Please check that the result is OK. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD: use GFP_NOFS in kmalloc
On Wed, 19 Sep 2007 12:22:09 -0700 Mingming Cao [EMAIL PROTECTED] wrote: Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent with the rest of kmalloc flag used in the JBD/JBD2 layer. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/journal.c |6 +++--- fs/jbd/revoke.c |8 fs/jbd2/journal.c |6 +++--- fs/jbd2/revoke.c |8 4 files changed, 14 insertions(+), 14 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c2007-09-19 11:51:10.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_NOFS); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc journal-j_blocksize = blocksize; n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i /* journal descriptor can store up to n blocks -bzzz */ n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); Index: linux-2.6.23-rc6/fs/jbd/revoke.c === --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/revoke.c 2007-09-19 11:52:34.0 -0700 @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ while((tmp = 1UL) != 0UL) shift++; - journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal-j_revoke_table[0]) return -ENOMEM; journal-j_revoke = journal-j_revoke_table[0]; @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal-j_revoke-hash_table) { kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); journal-j_revoke = NULL; @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ for (tmp = 0; tmp hash_size; tmp++) INIT_LIST_HEAD(journal-j_revoke-hash_table[tmp]); - journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal-j_revoke_table[1]) { kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal-j_revoke-hash_table) { kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); These were all OK using GFP_KERNEL. GFP_NOFS should only be used when the caller is holding some fs locks which might cause a deadlock if that caller reentered the fs in -writepage (and maybe put_inode and such). That isn't the case in any of the above code, which is all mount time stuff (I think). ext3/4 should be using GFP_NOFS when the caller has a transaction open, has a page locked, is holding i_mutex, etc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD slab cleanups
On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao [EMAIL PROTECTED] wrote: JBD: Replace slab allocations with page cache allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly __GFP_NOFAIL should only be used when we have no way of recovering from failure. The allocation in journal_init_common() (at least) _can_ recover and hence really shouldn't be using __GFP_NOFAIL. (Actually, nothing in the kernel should be using __GFP_NOFAIL. It is there as a marker which says we really shouldn't be doing this but we don't know how to fix it). So sometime it'd be good if you could review all the __GFP_NOFAILs in there and see if we can remove some, thanks. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Ext4: Uninitialized Block Groups
On Tue, 18 Sep 2007 17:25:31 -0700 Avantika Mathur [EMAIL PROTECTED] wrote: + +__u16 crc16(__u16 crc, __u8 const *buffer, size_t len) And is we really really have to do this, then the ext4-private crc16() should have static scope. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 17 Aug 2007 12:36:32 +0400 Alex Tomas [EMAIL PROTECTED] wrote: Andrew Morton wrote: Sort-of. But the per-superpblock, per-inode writeback code is pretty careful to avoid livelocks. The per-inode writeback is a strict single linear sweep across the file. It'll basically write out anything which was dirty when it was called. The per-superblock inode walk isn't as accurate as that, becuase of the difficulties of juggling list_heads. But we're slowly working on that, and I suspect it'll be ggod enough for ext3 purposes already. I'd say that these are two different mechanism solving different problems: 1) VFS/MM does periodic updates and uses regular writeback 2) data=ordered is to avoid metadata pointing to not-written-yet data VFS/MM can do _much_ more than that! Look at struct writeback_control. That code path has many different modes of operation: it is used for regular pdflush writeback, sync, fsync, throttling, etc. Probably one of its modes will be sufficient. If we want to change ext3's existing semantics and add an only writeback uninitialised blocks mode then that'll be pretty straightforward: add more control information to writeback_control and go for it. we can't use regular writeback in commit thread as long as it can fall into allocation. so, we'd have to add one more WB mode (btw, i have a patch which skips non-allocated blocks in writeback if special WB mode is requested). yup OTOH, the faster we go through data sync part of commit, the better. given that lots of inodes can be dirty with no data to sync, it's going to take long in some cases. it's especially bad because commit doesn't scale to many CPUs. eh? also, why would we need to flush *everything* every 5s? just because ext3 does this? sounds strange. if somebody really need this we could add this possibility to regular writeback path (making it tunable). but I'd rather prefer to have a separate (fast, lightweight, scalable) mechanism to support data=ordered. Yeah, that would make sense, perhaps. Or just speed the existing stuff up. iirc the main problem in there is unrelated to data writeback. There are situations where the running transaction has to block behind metadata writeout which the committing transaction is performing. I reluctantly put that in years ago to get us out of a tight spot and it never got optimised. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/25] move handling of setuid/gid bits from VFS into individual setattr functions (RESEND)
On Wed, 8 Aug 2007 08:54:35 -0400 Jeff Layton [EMAIL PROTECTED] wrote: On Tue, 7 Aug 2007 17:15:01 -0700 Andrew Morton [EMAIL PROTECTED] wrote: On Mon, 6 Aug 2007 09:54:03 -0400 Jeff Layton [EMAIL PROTECTED] wrote: Is there any way in which we can prevent these problems? Say - rename something so that unconverted filesystems will reliably fail to compile? I suppose we could rename the .setattr inode operation to something else, but then we'll be stuck with it for at least a while. That seems sort of kludgey too... Sure. We're changing the required behaviour of .setattr. Changing its name is a fine and reasonably reliable way to communicate that fact. - leave existing filesystems alone, but add a new inode_operations.setattr_jeff, which the networked filesytems can implement, and teach core vfs to call setattr_jeff in preference to setattr? Something else? There's also the approach suggested by Miklos: Add a new inode flag that tells notify_change not to convert ATTR_KILL_S* flags into a mode change. Basically, allow filesystems to opt out of that behavior. I'd definitly pick that over a new inode op. That would also allow the default case be for the VFS to continue handling these flags. Everything would continue to work but filesystems that need to handle these flags differently would be able to do so. We should opt for whatever produces the best end state in the kernel tree. ie: if it takes more work and a larger patch to create a better result, let's go for the better result. We merge large patches all the time. We prefer to smash through, get it right whatever the transient cost. But quietly making out-of-tree filesystems less secure is a pretty high cost. I'm suspecting that adding more flags and some code to test them purely to minimise the size of the patch and to retain compatibility with the old .setattr is not a good tradeoff, given that we'd carry the flags and tests for evermore. So I'd suggest s/setattr/something_else/g. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/25] move handling of setuid/gid bits from VFS into individual setattr functions (RESEND)
On Mon, 6 Aug 2007 09:54:03 -0400 Jeff Layton [EMAIL PROTECTED] wrote: Apologies for the resend, but the original sending had the date in the email header and it caused some of these to bounce... ( Please consider trimming the Cc list if discussing some aspect of this that doesn't concern everyone.) When an unprivileged process attempts to modify a file that has the setuid or setgid bits set, the VFS will attempt to clear these bits. The VFS will set the ATTR_KILL_SUID or ATTR_KILL_SGID bits in the ia_valid mask, and then call notify_change to clear these bits and set the mode accordingly. With a networked filesystem (NFS in particular but most likely others), the client machine may not have credentials that allow for the clearing of these bits. In some situations, this can lead to file corruption, or to an operation failing outright because the setattr fails. In this situation, we'd like to just leave the handling of this to the server and ignore these bits. The problem is that by the time nfs_setattr is called, the VFS has already reinterpreted the ATTR_KILL_* bits into a mode change. We can't fix this in the filesystems where this is a problem, as doing so would leave us having to second-guess what the VFS wants us to do. So we need to change it so that filesystems have more flexibility in how to interpret the ATTR_KILL_* bits. The first patch in the following patchset moves this logic into a helper function, and then only calls this helper function for inodes that do not have a setattr operation defined. The subsequent patches fix up individual filesystem setattr functions to call this helper function. The upshot of this is that with this change, filesystems that define a setattr inode operation are now responsible for handling the ATTR_KILL bits as well. They can trivially do so by calling the helper, but they must do so. Some of the follow-on patches may not be strictly necessary, but I decided that it was better to take the conservative approach and call the helper when I wasn't sure. I've tried to CC the maintainers for the individual filesystems as well where I could find them, please let me know if there are others who should be informed. Comments and suggestions appreciated... From a purely practical standpoint: it's a concern that all filesytems need patching to continue to correctly function after this change. There might be filesystems which you missed, and there are out-of-tree filesystems which won't be updated. And I think the impact upon the out-of-tree filesystems would be fairly severe: they quietly and subtly get their secutiry guarantees broken (I think?) Is there any way in which we can prevent these problems? Say - rename something so that unconverted filesystems will reliably fail to compile? - leave existing filesystems alone, but add a new inode_operations.setattr_jeff, which the networked filesytems can implement, and teach core vfs to call setattr_jeff in preference to setattr? Something else? - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/25] move handling of setuid/gid bits from VFS into individual setattr functions (RESEND)
On Tue, 07 Aug 2007 20:45:34 -0400 Trond Myklebust [EMAIL PROTECTED] wrote: - rename something so that unconverted filesystems will reliably fail to compile? - leave existing filesystems alone, but add a new inode_operations.setattr_jeff, which the networked filesytems can implement, and teach core vfs to call setattr_jeff in preference to setattr? If you really need to know that the filesystem is handling the flags, then how about instead having -setattr() return something which indicates which flags it actually handled? That is likely to be a far more intrusive change, but it is one which is future-proof. If we change -setattr so that it will return a positive, non-zero value which the caller can then check and reliably do printk(that filesystem needs updating) then that addresses my concern, sure. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] ext2: show all mount options
On Mon, 23 Jul 2007 22:12:54 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote: From: Miklos Szeredi [EMAIL PROTECTED] Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] Could we have changelogs for these patches, please? ie: what's wrong with the existing code, what change does this patch make? - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] readahead drop behind and size adjustment
On Tue, 24 Jul 2007 08:47:28 +0800 Fengguang Wu [EMAIL PROTECTED] wrote: Subject: convert ill defined log2() to ilog2() It's *wrong* to have #define log2(n) ffz(~(n)) It should be *reversed*: #define log2(n) flz(~(n)) or #define log2(n) fls(n) or just use ilog2(n) defined in linux/log2.h. This patch follows the last solution, recommended by Andrew Morton. //Or are they simply the wrong naming, and is in fact correct in behavior? Cc: linux-ext4@vger.kernel.org Cc: Mingming Cao [EMAIL PROTECTED] Cc: Bjorn Helgaas [EMAIL PROTECTED] Cc: Chris Ahna [EMAIL PROTECTED] Cc: David Mosberger-Tang [EMAIL PROTECTED] Cc: Kyle McMartin [EMAIL PROTECTED] Signed-off-by: Fengguang Wu [EMAIL PROTECTED] --- drivers/char/agp/hp-agp.c |9 +++-- drivers/char/agp/i460-agp.c |5 ++--- drivers/char/agp/parisc-agp.c |7 ++- fs/ext4/super.c |6 ++ 4 files changed, 9 insertions(+), 18 deletions(-) hm, yes, there is a risk that the code was accidentally correct. Or the code has only ever dealt with power-of-2 inputs, in which case it happens to work either way. David(s) and ext4-people: could we please have a close review of these changes? Thanks. --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/hp-agp.c +++ linux-2.6.22-rc6-mm1/drivers/char/agp/hp-agp.c @@ -14,15 +14,12 @@ #include linux/pci.h #include linux/init.h #include linux/agp_backend.h +#include linux/log2.h #include asm/acpi-ext.h #include agp.h -#ifndef log2 -#define log2(x) ffz(~(x)) -#endif - #define HP_ZX1_IOC_OFFSET0x1000 /* ACPI reports SBA, we want IOC */ /* HP ZX1 IOC registers */ @@ -256,7 +253,7 @@ hp_zx1_configure (void) readl(hp-ioc_regs+HP_ZX1_IMASK); writel(hp-iova_base|1, hp-ioc_regs+HP_ZX1_IBASE); readl(hp-ioc_regs+HP_ZX1_IBASE); - writel(hp-iova_base|log2(HP_ZX1_IOVA_SIZE), hp-ioc_regs+HP_ZX1_PCOM); + writel(hp-iova_base|ilog2(HP_ZX1_IOVA_SIZE), hp-ioc_regs+HP_ZX1_PCOM); readl(hp-ioc_regs+HP_ZX1_PCOM); } @@ -284,7 +281,7 @@ hp_zx1_tlbflush (struct agp_memory *mem) { struct _hp_private *hp = hp_private; - writeq(hp-gart_base | log2(hp-gart_size), hp-ioc_regs+HP_ZX1_PCOM); + writeq(hp-gart_base | ilog2(hp-gart_size), hp-ioc_regs+HP_ZX1_PCOM); readq(hp-ioc_regs+HP_ZX1_PCOM); } --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/i460-agp.c +++ linux-2.6.22-rc6-mm1/drivers/char/agp/i460-agp.c @@ -13,6 +13,7 @@ #include linux/string.h #include linux/slab.h #include linux/agp_backend.h +#include linux/log2.h #include agp.h @@ -59,8 +60,6 @@ */ #define WR_FLUSH_GATT(index) RD_GATT(index) -#define log2(x) ffz(~(x)) - static struct { void *gatt; /* ioremap'd GATT area */ @@ -148,7 +147,7 @@ static int i460_fetch_size (void) * values[i].size. */ values[i].num_entries = (values[i].size 8) (I460_IO_PAGE_SHIFT - 12); - values[i].page_order = log2((sizeof(u32)*values[i].num_entries) PAGE_SHIFT); + values[i].page_order = ilog2((sizeof(u32)*values[i].num_entries) PAGE_SHIFT); } for (i = 0; i agp_bridge-driver-num_aperture_sizes; i++) { --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/parisc-agp.c +++ linux-2.6.22-rc6-mm1/drivers/char/agp/parisc-agp.c @@ -18,6 +18,7 @@ #include linux/init.h #include linux/klist.h #include linux/agp_backend.h +#include linux/log2.h #include asm-parisc/parisc-device.h #include asm-parisc/ropes.h @@ -27,10 +28,6 @@ #define DRVNAME quicksilver #define DRVPFX DRVNAME : -#ifndef log2 -#define log2(x) ffz(~(x)) -#endif - #define AGP8X_MODE_BIT 3 #define AGP8X_MODE (1 AGP8X_MODE_BIT) @@ -92,7 +89,7 @@ parisc_agp_tlbflush(struct agp_memory *m { struct _parisc_agp_info *info = parisc_agp_info; - writeq(info-gart_base | log2(info-gart_size), info-ioc_regs+IOC_PCOM); + writeq(info-gart_base | ilog2(info-gart_size), info-ioc_regs+IOC_PCOM); readq(info-ioc_regs+IOC_PCOM); /* flush */ } --- linux-2.6.22-rc6-mm1.orig/fs/ext4/super.c +++ linux-2.6.22-rc6-mm1/fs/ext4/super.c @@ -1433,8 +1433,6 @@ static void ext4_orphan_cleanup (struct sb-s_flags = s_flags; /* Restore MS_RDONLY status */ } -#define log2(n) ffz(~(n)) - /* * Maximal file size. There is a direct, and {,double-,triple-}indirect * block limit, and also a limit of (2^32 - 1) 512-byte sectors in i_blocks. @@ -1706,8 +1704,8 @@ static int ext4_fill_super (struct super sbi-s_desc_per_block = blocksize / EXT4_DESC_SIZE(sb); sbi-s_sbh = bh; sbi-s_mount_state = le16_to_cpu(es-s_state); - sbi
Re: [PATCH] Faster ext2_clear_inode()
On Mon, 9 Jul 2007 22:00:03 +0200 Jörn Engel [EMAIL PROTECTED] wrote: On Mon, 9 July 2007 22:01:48 +0400, Alexey Dobriyan wrote: Yes. Note that ext2_clear_inode() is referenced from ext2_sops, so even empty, it leaves traces in resulting kernel. Is that your opinion or have you actually measured a difference? I strongly suspect that compilers are smart enough to optimize away a call to an empty static function. It saves a big 16 bytes of text here. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Sun, 15 Jul 2007 15:02:23 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: On Fri, 2007-07-13 at 14:47 -0700, Zach Brown wrote: Peter, do you have any interest in seeing how far we can get at tracking lock_page()? I'm not holding my breath, but any little bit would probably help. Would this be a valid report? ( /me goes hunt a x86_64 unwinder patch that will apply to this tree. These stacktraces are pain ) They are. lockdep reports are a pain too. It's still a struggle to understand wtf they're trying to tell you. Mabe it's just me. === [ INFO: possible circular locking dependency detected ] [ 2.6.22-rt3-dirty #34 --- mount/1296 is trying to acquire lock: (ei-truncate_mutex){--..}, at: [802f75e5] ext3_get_blocks_handle+0x1a4/0x8f7 but task is already holding lock: (lock_page_0){--..}, at: [80267107] generic_file_buffered_write+0x1ee/0x646 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (lock_page_0){--..}: [80251b26] __lock_acquire+0xa72/0xc35 [802520c9] lock_acquire+0x48/0x61 [80265e22] add_to_page_cache_lru+0xe/0x23 [80265d31] add_to_page_cache+0x1de/0x2c1 [80265e22] add_to_page_cache_lru+0xe/0x23 [80266985] find_or_create_page+0x4c/0x73 [802ae716] __getblk+0x118/0x23c [802afa91] __bread+0x6/0x9c [802f382d] read_block_bitmap+0x34/0x65 [802f3e1b] ext3_free_blocks_sb+0xec/0x3d4 [802f4131] ext3_free_blocks+0x2e/0x61 [802f82bc] ext3_free_data+0xaa/0xda [802f8976] ext3_truncate+0x4d2/0x84e [8026df5a] pagevec_lookup+0x17/0x1e [8026e7b1] truncate_inode_pages_range+0x1f4/0x323 [802614b4] add_preempt_count+0x14/0xe4 [80304d13] journal_stop+0x1fe/0x21d [8027661a] vmtruncate+0xa2/0xc0 [802a292b] inode_setattr+0x22/0x10a [802f9b51] ext3_setattr+0x136/0x18f [802a2b1d] notify_change+0x10a/0x241 [802a2b3b] notify_change+0x128/0x241 [8028e35e] do_truncate+0x56/0x7f [8028e369] do_truncate+0x61/0x7f [80296278] get_write_access+0x3f/0x45 [802973c7] may_open+0x193/0x1af [80299869] open_namei+0x2cb/0x63e [8025718b] rt_up_read+0x53/0x5c [8056da59] do_page_fault+0x479/0x7cc [8028dce1] do_filp_open+0x1c/0x38 [8056a4f9] rt_spin_unlock+0x17/0x47 [8028da05] get_unused_fd+0xf9/0x107 [8028dd45] do_sys_open+0x48/0xd5 [8020950e] system_call+0x7e/0x83 [] 0x I guess we're doing lock_page() against a blockdev pagecache page here while holding truncate_mutex against some S_ISREG file. - #0 (ei-truncate_mutex){--..}: [802503b9] print_circular_bug_header+0xcc/0xd3 [80251a22] __lock_acquire+0x96e/0xc35 [802520c9] lock_acquire+0x48/0x61 [802f75e5] ext3_get_blocks_handle+0x1a4/0x8f7 [8056a6d4] _mutex_lock+0x26/0x52 [802f75e5] ext3_get_blocks_handle+0x1a4/0x8f7 [802504b2] find_usage_backwards+0xb0/0xd9 [802504b2] find_usage_backwards+0xb0/0xd9 [80250d7c] debug_check_no_locks_freed+0x11d/0x129 [80250c33] trace_hardirqs_on_caller+0x115/0x138 [8024efdc] lockdep_init_map+0xac/0x41f [802614b4] add_preempt_count+0x14/0xe4 [802f8035] ext3_get_block+0xc2/0xe4 [802aeed3] __block_prepare_write+0x195/0x442 [802f7f73] ext3_get_block+0x0/0xe4 [802af19a] block_prepare_write+0x1a/0x25 [802f93e9] ext3_prepare_write+0xb2/0x17b [802671b1] generic_file_buffered_write+0x298/0x646 [8023944e] current_fs_time+0x3b/0x40 [802614b4] add_preempt_count+0x14/0xe4 [802678ae] __generic_file_aio_write_nolock+0x34f/0x3b9 [8024ed3d] put_lock_stats+0xe/0x2a [80267964] generic_file_aio_write+0x4c/0xc4 [80267979] generic_file_aio_write+0x61/0xc4 [802fcf18] ext3_orphan_del+0x53/0x19f [802f5768] ext3_file_write+0x1c/0x9d [8028ef31] do_sync_write+0xcc/0x10f [80246f9c] autoremove_wake_function+0x0/0x2e [8024ecfe] get_lock_stats+0xe/0x3f [8024ed9a] lock_release_holdtime+0x41/0x4f [8024ed3d] put_lock_stats+0xe/0x2a [8028dfb1] sys_fchmod+0xa3/0xbd [8056a717]
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Sun, 15 Jul 2007 21:21:03 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: Shows the current stacktrace where we violate the previously established locking order. yup, but the lock_page() which we did inside truncate_mutex was a lock_page() against a different address_space: the blockdev mapping. So this is OK - we'll never take truncate_mutex against the blockdev mapping (it doesn't have one, for a start ;)) This is similar to the quite common case where we take inode A's i_mutex inside inode B's i_mutex, which needs special lockdep annotations. I think. I haven't looked into this in detail. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.
On Fri, 13 Jul 2007 16:00:48 +0530 Kalpak Shah [EMAIL PROTECTED] wrote: - if (inode-i_nlink = EXT4_LINK_MAX) + if (EXT4_DIR_LINK_MAX(inode)) return -EMLINK; argh. WHY_IS_EXT4_FULL_OF_UPPER_CASE_MACROS_WHICH_COULD_BE_IMPLEMENTED as_lower_case_inlines? Sigh. It's all the old-timers, I guess. EXT4_DIR_LINK_MAX() is buggy: it evaluates its arg twice. #define EXT4_DIR_LINK_MAX(dir) (!is_dx(dir) (dir)-i_nlink = EXT4_LINK_MAX) This just checks if directory has hash indexing in which case we need not worry about EXT4_LINK_MAX subdir limit. If directory is not hash indexed then we will need to enforce a max subdir limit. Sorry, I didn't understand what is the problem with this macro? Macros should never evaluate their argument more than once, because if they do they will misbehave when someone passes them an expression-with-side-effects: struct inode *p = q; EXT4_DIR_LINK_MAX(p++); one expects `p' to have the value q+1 here. But it might be q+2. and EXT4_DIR_LINK_MAX(some_function()); might cause some_function() to be called twice. This is one of the many problems which gets fixed when we write code in C rather than in cpp. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 2][PATCH 2/5] cleanups: Add extent sanity checks
On Thu, 12 Jul 2007 08:57:51 -0500 Dave Kleikamp [EMAIL PROTECTED] wrote: On Thu, 2007-07-12 at 12:38 +0100, Andy Whitcroft wrote: Andrew Morton wrote: +if (ext4_ext_check_header(inode, ext_block_hdr(bh), +depth - i - 1)) { +err = -EIO; +break; +} +path[i+1].p_bh = bh; Really that should have been i + 1. checkpatch misses this. It seems to be missing more stuff that it used to lately. This one is difficult. The rules up to now have been consistent spacing is required on both sides of mathematics operators. I personally like spaces always, but we do tend to use them without spaces too where the binding is effectivly part of the value -- the classic case is something like: pfn MAX_ORDER-1 In allowing that sort of thing, we implictly allow the one you note above. We have tried to be overly annoying on these things, and so the check is consistancy, spaces both or neither. We could be stricter. I personally think stricter is better. An occasionally false-positive isn't going to hurt anyone. (Well, maybe the checkpatch.pl maintainers will get nagged.) It at least will cause the developer to look at the line of code in question and make a conscious decision to leave it as it is. I'm assuming that upstream maintainers use checkpatch.pl with some constraint, and don't throw every patch that produces a warning back at the submitter. I'm in two minds. Missing-the-spaces is pretty damn common and is sometimes a reasonable way of saving quite a lot of horizontal space. I spose we could take it out again if it's causing problems. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes
On Sun, 01 Jul 2007 03:38:51 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Subject: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes Date: Sun, 01 Jul 2007 03:38:51 -0400 Sender: [EMAIL PROTECTED] Organization: IBM Linux Technology Center X-Mailer: Evolution 2.8.0 (2.8.0-33.el5) From: Dmitry Monakhov [EMAIL PROTECTED] Subject: ext4: extent compilation fixes I hope this patch series will hit git with nice titles like ext4: extent compilation fixes and not funny ones like Morecleanups:ext4_extent_compilation_fixes. Fix compilation with EXT_DEBUG, also fix leXX_to_cpu convertions. conversions. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups
On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao [EMAIL PROTECTED] wrote: From: Dmitry Monakhov [EMAIL PROTECTED] Subject: ext4: extent macros cleanup - Replace math equation to it's macro equivalent s/it's/its/;) - make ext4_ext_grow_indepth() indexes/leaf correct hm, what was wrong with it? Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED] Acked-by: Alex Tomas [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- fs/ext4/extents.c | 11 +++ 1 files changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 12fe3d7..1fd00ac 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -375,7 +375,7 @@ ext4_ext_binsearch_idx(struct inode *inode, struct ext4_ext_path *path, int bloc ext_debug(binsearch for %d(idx): , block); l = EXT_FIRST_INDEX(eh) + 1; - r = EXT_FIRST_INDEX(eh) + le16_to_cpu(eh-eh_entries) - 1; + r = EXT_LAST_INDEX(eh); while (l = r) { m = l + (r - l) / 2; if (block le32_to_cpu(m-ei_block)) @@ -440,7 +440,7 @@ ext4_ext_binsearch(struct inode *inode, struct ext4_ext_path *path, int block) ext_debug(binsearch for %d: , block); l = EXT_FIRST_EXTENT(eh) + 1; - r = EXT_FIRST_EXTENT(eh) + le16_to_cpu(eh-eh_entries) - 1; + r = EXT_LAST_EXTENT(eh); while (l = r) { m = l + (r - l) / 2; @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode, curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode)); curp-p_hdr-eh_entries = cpu_to_le16(1); curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr); - /* FIXME: it works, but actually path[0] can be index */ - curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block; + + if (path[0].p_hdr-eh_depth) + curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block; + else + curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block; whitespace bustage there. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 10 Jul 2007 23:18:50 -0400 Mingming Cao [EMAIL PROTECTED] wrote: On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote: On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote: David Chinneer pointed that we need to journal the version number updates together with the operations that causes the change of the inode version number, in order to survive server crashes so clients won't see the counter go backwards. So increment i_version in fs code is probably the place to ensure the inode version changes are stored to disk. It's seems update the ext4 inode version in every ext4_mark_inode_dirty() is the easiest way. That still makes us dependent upon _something_ changing the inode. For overwrites the only something is mtime. If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and I don't think we do) then I guess we'll need new code in or around file_update_time() to do this. do you mean mark inode dirty all the times in file_update_time()? Not sure about the overhead for ext3/4. hm, I hadn't thought about it in any detail. Maybe something like --- a/fs/inode.c~a +++ a/fs/inode.c @@ -1238,6 +1238,11 @@ void file_update_time(struct file *file) sync_it = 1; } + if (IS_I_VERSION_64(inode)) { + inode_inc_iversion(inode); /* Takes i_lock on 32-bit */ + sync_it = 1; + } + if (sync_it) mark_inode_dirty_sync(inode); } _ ? - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 8][PATCH 1/1]Add journal checksums
On Wed, 11 Jul 2007 07:01:08 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: - /* AKPM: buglet - add `i' to tmp! */ Damn. After, what, seven years, someone actually fixed it? for (i = 0; i bh-b_size; i += 512) { - journal_header_t *tmp = (journal_header_t*)bh-b_data; + struct commit_header *tmp = + (struct commit_header *)(bh-b_data + i); tmp-h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER); tmp-h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK); tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid); + + if (JBD2_HAS_COMPAT_FEATURE(journal, + JBD2_FEATURE_COMPAT_CHECKSUM)) { + tmp-h_chksum_type = JBD2_CRC32_CHKSUM; + tmp-h_chksum_size = JBD2_CRC32_CHKSUM_SIZE; + tmp-h_chksum[0]= cpu_to_be32(crc32_sum); + } } And in doing so, changed the on-disk format of the journal commit blocks. Surely this was worth a mention in the changelog, if not a standalone patch? I don't think this is worth doing, really. Why not just leave the format as it was, take the loop out and run this code once rather than eight times? Well, we aren't using the rest of the commit block in any case. I think the original intention was that we'd get 8 copies of the commit block so we would be sure to get a good one. I don't know whether we'd rather have 8 copies of the commit block, or more potential to expand the commit block? I don't personally have any preference, since the checksum should be a more robust way of checking validity than having multiple copies, so we may as well remove the loop and stick with a single copy for now. We've never altered any commit block sectors apart from the zeroeth one (eight times) due to the above bug. So I'd suggest that we should formalise the old bug and leave the format as-is. That'll leave lots of space spare in the commit block. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Wed, 11 Jul 2007 06:10:56 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: On Jul 10, 2007 16:32 -0700, Andrew Morton wrote: err = ext4_reserve_inode_write(handle, inode, iloc); + if (EXT4_I(inode)-i_extra_isize + EXT4_SB(inode-i_sb)-s_want_extra_isize + !(EXT4_I(inode)-i_state EXT4_STATE_NO_EXPAND)) { + /* We need extra buffer credits since we may write into EA block + * with this same handle */ + if ((jbd2_journal_extend(handle, + EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) { + ret = ext4_expand_extra_isize(inode, + EXT4_SB(inode-i_sb)-s_want_extra_isize, + iloc, handle); + if (ret) { + EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND; + if (!expand_message) { + ext4_warning(inode-i_sb, __FUNCTION__, + Unable to expand inode %lu. Delete + some EAs or run e2fsck., + inode-i_ino); + expand_message = 1; + } + } + } + } Maybe that message should come out once per mount rather than once per boot (or once per modprobe)? Probably true. What are the consequences of a jbd2_journal_extend() failure in here? Not fatal, just like every user of i_extra_isize. If the inode isn't a large inode, or it can't be expanded then there will be a minor loss of functionality on that inode. If the i_extra_isize is critical, then the sysadmin will have run e2fsck to force s_min_extra_isize large enough. Note that this is only applicable for filesystems which are upgraded. For new inodes (i.e. all inodes that exist in the filesystem if it was always run on a kernel with the currently understood extra fields) then this will never be invoked (until such a time new extra fields are added). I'd suggest that we get a comment in the code explaining this: this unchecked error does rather stand out. + if (EXT4_I(inode)-i_file_acl) { + bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl); + error = -EIO; + if (!bh) + goto cleanup; + if (ext4_xattr_check_block(bh)) { + ext4_error(inode-i_sb, __FUNCTION__, + inode %lu: bad block %llu, inode-i_ino, + EXT4_I(inode)-i_file_acl); + error = -EIO; + goto cleanup; + } + base = BHDR(bh); + first = BFIRST(bh); + end = bh-b_data + bh-b_size; + min_offs = end - base; + free = ext4_xattr_free_space(first, min_offs, base, + total_blk); + if (free new_extra_isize) { + if (!tried_min_extra_isize s_min_extra_isize) { + tried_min_extra_isize++; + new_extra_isize = s_min_extra_isize; Aren't we missing a brelse(bh) here? Seems likely, yes. OK - could we get a positive ack from someone indicating that this will get looked at? Because I am about to forget about it. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 1][PATCH 1/2] Add noextents mount option
On Sun, 01 Jul 2007 03:35:48 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Add a mount option to turn off extents. Please update the changelog to describe the reason for making this change. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- Index: linux-2.6.22-rc4/fs/ext4/super.c === --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 17:02:18.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/super.c 2007-06-11 17:02:22.0 -0700 @@ -725,7 +725,7 @@ Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, - Opt_grpquota, Opt_extents, + Opt_grpquota, Opt_extents, Opt_noextents, }; static match_table_t tokens = { @@ -776,6 +776,7 @@ {Opt_usrquota, usrquota}, {Opt_barrier, barrier=%u}, {Opt_extents, extents}, + {Opt_noextents, noextents}, {Opt_err, NULL}, {Opt_resize, resize}, }; @@ -,6 +1112,9 @@ case Opt_extents: set_opt (sbi-s_mount_opt, EXTENTS); break; + case Opt_noextents: + clear_opt (sbi-s_mount_opt, EXTENTS); + break; default: printk (KERN_ERR EXT4-fs: Unrecognized mount option \%s\ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev
On Sun, 01 Jul 2007 03:36:01 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Turn on extents feature by default in ext4 filesystem. User could use -o noextents to turn it off. Oh, there you go. Index: linux-2.6.22-rc4/fs/ext4/super.c === --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 17:02:22.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/super.c 2007-06-11 17:03:09.0 -0700 @@ -1546,6 +1546,12 @@ set_opt(sbi-s_mount_opt, RESERVATION); + /* + * turn on extents feature by default in ext4 filesystem + * User -o noextents to turn it off + */ + set_opt (sbi-s_mount_opt, EXTENTS); + Broken coding style. Please feed all the ext4 patches through scripts/checkpatch.pl (preferably version 0.07 - see Andy's patch on lkml) and then consider addressing the (quite large) number of mistakes which are detected. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 2][PATCH 1/5] cleanups: Propagate some i_flags to disk
On Sun, 01 Jul 2007 03:36:12 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into ext4-specific i_flags. Hence, when someone sets these flags via a different interface than ioctl, they are stored correctly. This changelog is inadequate. I had to hunt down the equivalent ext3 patch's changelog to understand the reasons for this change. Please update this patch's changelog using the below: ext3: copy i_flags to inode flags on write A patch that stores inode flags such as S_IMMUTABLE, S_APPEND, etc. from i_flags to EXT3_I(inode)-i_flags when inode is written to disk. The same thing is done on GETFLAGS ioctl. Quota code changes these flags on quota files (to make it harder for sysadmin to screw himself) and these changes were not correctly propagated into the filesystem (especially, lsattr did not show them and users were wondering...). Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into ext3-specific i_flags. Hence, when someone sets these flags via a different interface than ioctl, they are stored correctly. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs
On Sun, 01 Jul 2007 03:36:48 -0400 Mingming Cao [EMAIL PROTECTED] wrote: On Jun 07, 2007 23:45 -0500, Jose R. Santos wrote: The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but create_proc_entry() does not do lookups on file names with more that one directory deep. This causes the entry creation to fail and hence, no proc file is created. This patch moves the file to /proc/jbd2-degug. The file could be move to /proc/fs/jbd2/jbd2-debug, but it would require some minor alterations to the jbd-stats patch. I don't think we really want to be adding top-level files in /proc. What about using the debugfs filesystem (not to be confused with the e2fsprogs 'debugfs' command)? How about this then? Moved the file to use debugfs as well as having the nice effect of removing more lines than what it adds. Signed-off-by: Jose R. Santos [EMAIL PROTECTED] Please clean up the changelog. The changelog should include information about the location and the content of these debugfs files. it should provide any instructions which users will need to be able to create and use those files. Alternatively (and preferably) do this via an update to Documentation/filesystems/ext4.txt. fs/jbd2/journal.c| 6220 +42 -0 ! include/linux/jbd2.h |21 + 1 - 0 ! 2 files changed, 21 insertions(+), 43 deletions(-) Again, this patch isn't in Ted's kernel.org directory and hasn't been in -mm. Apart from the lack of testing and review which this causes, it means I can't just do `pushpatch name-of-this-patch' and look at it in tkdiff. So I squint at the diff, but that's harder when the diff wasn't prepared with `diff -p'. Oh well. Index: linux-2.6.22-rc4/fs/jbd2/journal.c === --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c 2007-06-11 16:16:18.0 -0700 +++ linux-2.6.22-rc4/fs/jbd2/journal.c2007-06-11 16:36:10.0 -0700 @@ -35,6 +35,7 @@ #include linux/kthread.h #include linux/poison.h #include linux/proc_fs.h +#include linux/debugfs.h #include asm/uaccess.h #include asm/page.h @@ -1954,60 +1955,37 @@ * /proc tunables */ #if defined(CONFIG_JBD2_DEBUG) -int jbd2_journal_enable_debug; +u16 jbd2_journal_enable_debug; EXPORT_SYMBOL(jbd2_journal_enable_debug); #endif -#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_PROC_FS) +#if defined(CONFIG_JBD2_DEBUG) defined(CONFIG_DEBUG_FS) Has this been compile-tested with CONFIG_DEBUGFS=n? -#define create_jbd_proc_entry() do {} while (0) -#define jbd2_remove_jbd_proc_entry() do {} while (0) +#define jbd2_create_debugfs_entry() do {} while (0) +#define jbd2_remove_debugfs_entry() do {} while (0) I suggest that these be converted to (preferable) inline functions while you're there. #endif @@ -2067,7 +2045,7 @@ ret = journal_init_caches(); if (ret != 0) jbd2_journal_destroy_caches(); - create_jbd_proc_entry(); + jbd2_create_debugfs_entry(); return ret; } @@ -2078,7 +2056,7 @@ if (n) printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n); #endif - jbd2_remove_jbd_proc_entry(); + jbd2_remove_debugfs_entry(); jbd2_journal_destroy_caches(); } Index: linux-2.6.22-rc4/include/linux/jbd2.h === --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 16:16:18.0 -0700 +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:35:25.0 -0700 @@ -57,7 +57,7 @@ * CONFIG_JBD2_DEBUG is on. */ #define JBD_EXPENSIVE_CHECKING JBD2? -extern int jbd2_journal_enable_debug; +extern u16 jbd2_journal_enable_debug; Why was this made 16-bit? To save 2 bytes? Could have saved 3 if we're going to do that. Shoudln't all this debug info be a per-superblock thing rather than kernel-wide? - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Sun, 01 Jul 2007 03:36:56 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch is a spinoff of the old nanosecond patches. I don't know what the old nanosecond patches are. A link to a suitable changlog for those patches would do in a pinch. Preferable would be to write a proper changelog for this patch. It includes some cleanups and addition of a creation timestamp. The EXT3_FEATURE_RO_COMPAT_EXTRA_ISIZE flag has also been added along with s_{min, want}_extra_isize fields in struct ext3_super_block. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Kalpak Shah [EMAIL PROTECTED] Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] Index: linux-2.6.22-rc4/fs/ext4/ialloc.c Please include diffstat output when preparing patches. + +#define EXT4_FITS_IN_INODE(ext4_inode, einode, field)\ + ((offsetof(typeof(*ext4_inode), field) +\ + sizeof((ext4_inode)-field)) \ + = (EXT4_GOOD_OLD_INODE_SIZE + \ + (einode)-i_extra_isize)) \ Please add explanatory commentary to EXT4_FITS_IN_INODE(): tell readers under what circumstances something will not fit in an inode and what the consequences of this are. +static inline __le32 ext4_encode_extra_time(struct timespec *time) +{ + return cpu_to_le32((sizeof(time-tv_sec) 4 ? +time-tv_sec 32 : 0) | +((time-tv_nsec 2) EXT4_NSEC_MASK)); +} + +static inline void ext4_decode_extra_time(struct timespec *time, __le32 extra) +{ + if (sizeof(time-tv_sec) 4) +time-tv_sec |= (__u64)(le32_to_cpu(extra) EXT4_EPOCH_MASK) + 32; + time-tv_nsec = (le32_to_cpu(extra) EXT4_NSEC_MASK) 2; +} Consider uninlining these functions. +#define EXT4_INODE_SET_XTIME(xtime, inode, raw_inode) \ +do {\ + (raw_inode)-xtime = cpu_to_le32((inode)-xtime.tv_sec); \ + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ + (raw_inode)-xtime ## _extra = \ + ext4_encode_extra_time((inode)-xtime); \ +} while (0) + +#define EXT4_EINODE_SET_XTIME(xtime, einode, raw_inode) \ +do {\ + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime)) \ + (raw_inode)-xtime = cpu_to_le32((einode)-xtime.tv_sec); \ + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\ + (raw_inode)-xtime ## _extra = \ + ext4_encode_extra_time((einode)-xtime); \ +} while (0) + +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) \ +do {\ + (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime); \ + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ + ext4_decode_extra_time((inode)-xtime,\ +raw_inode-xtime ## _extra);\ +} while (0) + +#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode) \ +do {\ + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime)) \ + (einode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime); \ + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\ + ext4_decode_extra_time((einode)-xtime, \ +raw_inode-xtime ## _extra);\ +} while (0) Ugly. I expect these could be implemented as plain old C functions. Caller could pass in the address of the ext4_inode field which the function is to operate upon. #if defined(__KERNEL__) || defined(__linux__) (What's the __linux__ for?) #define i_reserved1 osd1.linux1.l_i_reserved1 #define i_frag osd2.linux2.l_i_frag @@ -539,6 +603,13 @@ return container_of(inode, struct ext4_inode_info, vfs_inode); } +static inline struct timespec ext4_current_time(struct inode *inode) +{ + return (inode-i_sb-s_time_gran NSEC_PER_SEC) ? + current_fs_time(inode-i_sb) : CURRENT_TIME_SEC; +} Now, I've forgotten how this works. Remind me, please. Can ext4 filesystems ever have one-second timestamp granularity? If so, how does one cause that to come about? --- linux-2.6.22-rc4.orig/include/linux/ext4_fs_i.h 2007-06-11 17:22:15.0 -0700 +++
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Sun, 01 Jul 2007 03:37:04 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch converts the 32-bit i_version in the generic inode to a 64-bit i_version field. That's obvious from the patch. But what was the reason for making this (unrelated to ext4) change? Please update the changelog for this. Index: linux-2.6.21/include/linux/fs.h === --- linux-2.6.21.orig/include/linux/fs.h +++ linux-2.6.21/include/linux/fs.h @@ -549,7 +549,7 @@ struct inode { uid_t i_uid; gid_t i_gid; dev_t i_rdev; - unsigned long i_version; + u64 i_version; loff_t i_size; #ifdef __NEED_I_SIZE_ORDERED seqcount_t i_size_seqcount; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 2/5] i_version: Add hi 32 bit inode version on ext4 on-disk inode
On Sun, 01 Jul 2007 03:37:16 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch adds a 32-bit i_version_hi field to ext4_inode, which can be used for 64-bit inode versions. This field will store the higher 32 bits of the version, while Jean Noel's patch has added support to store the lower 32-bits in osd1.linux1.l_i_version. Please wordwrap this changelog entry to less than 80 columns. Well, less than 258, anyway ;) Signed-off-by: Mingming Cao [EMAIL PROTECTED] Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Kalpak Shah [EMAIL PROTECTED] --- Index: linux-2.6.21/include/linux/ext4_fs.h === --- linux-2.6.21.orig/include/linux/ext4_fs.h +++ linux-2.6.21/include/linux/ext4_fs.h @@ -342,6 +342,7 @@ struct ext4_inode { __le32 i_atime_extra; /* extra Access time (nsec 2 | epoch) */ __le32 i_crtime; /* File Creation time */ __le32 i_crtime_extra; /* extra FileCreationtime (nsec 2 | epoch) */ + __le32 i_version_hi; /* high 32 bits for 64-bit version */ }; Aren't there forward- backward-compatibility issues here? How does the filesystem driver work out whether this field is present and valid? The changelog should describe this design issue. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 2][PATCH 2/5] cleanups: Add extent sanity checks
On Sun, 01 Jul 2007 03:36:22 -0400 Mingming Cao [EMAIL PROTECTED] wrote: with the patch all headers are checked. the code should become more resistant to on-disk corruptions. needless BUG_ON() have been removed. please, review for inclusion. ... @@ -269,6 +239,70 @@ return size; } +static inline int +ext4_ext_max_entries(struct inode *inode, int depth) Please remove the `inline'. `inline' is usually wrong. It is wrong here. If this function has a single callsite (it has) then the compiler will inline it. If the function later gains more callsites then the compiler will know (correctly) not to inline it. We hope. If we find the compiler still inlines a function as large as this one then we'd need to use noinline and complain at the gcc developers. +{ + int max; + + if (depth == ext_depth(inode)) { + if (depth == 0) + max = ext4_ext_space_root(inode); + else + max = ext4_ext_space_root_idx(inode); + } else { + if (depth == 0) + max = ext4_ext_space_block(inode); + else + max = ext4_ext_space_block_idx(inode); + } + + return max; +} + +static int __ext4_ext_check_header(const char *function, struct inode *inode, + struct ext4_extent_header *eh, + int depth) +{ + const char *error_msg = NULL; Unneeded initialisation. + int max = 0; + + if (unlikely(eh-eh_magic != EXT4_EXT_MAGIC)) { + error_msg = invalid magic; + goto corrupted; + } + if (unlikely(le16_to_cpu(eh-eh_depth) != depth)) { + error_msg = unexpected eh_depth; + goto corrupted; + } + if (unlikely(eh-eh_max == 0)) { + error_msg = invalid eh_max; + goto corrupted; + } + max = ext4_ext_max_entries(inode, depth); + if (unlikely(le16_to_cpu(eh-eh_max) max)) { + error_msg = too large eh_max; + goto corrupted; + } + if (unlikely(le16_to_cpu(eh-eh_entries) le16_to_cpu(eh-eh_max))) { + error_msg = invalid eh_entries; + goto corrupted; + } + return 0; + +corrupted: + ext4_error(inode-i_sb, function, + bad header in inode #%lu: %s - magic %x, + entries %u, max %u(%u), depth %u(%u), + inode-i_ino, error_msg, le16_to_cpu(eh-eh_magic), + le16_to_cpu(eh-eh_entries), le16_to_cpu(eh-eh_max), + max, le16_to_cpu(eh-eh_depth), depth); + + return -EIO; +} + ... + i = depth = ext_depth(inode); Mulitple assignments like this are somewhat unpopular from the coding-style POV. We have a local variable called `i' which is not used as a simple counter and which has no explanatory comment. This is plain bad. Please find a better name for this variable. Review the code for other such lack of clarity - I'm sure there's more. - BUG_ON(le16_to_cpu(eh-eh_entries) le16_to_cpu(eh-eh_max)); - BUG_ON(eh-eh_magic != EXT4_EXT_MAGIC); Yeah, this patch improves things a lot. How well-tested is it? Have any of these errors actually been triggered? path[i].p_hdr = ext_block_hdr(path[i].p_bh); - if (ext4_ext_check_header(__FUNCTION__, inode, - path[i].p_hdr)) { - err = -EIO; - goto out; - } } - BUG_ON(le16_to_cpu(path[i].p_hdr-eh_entries) - le16_to_cpu(path[i].p_hdr-eh_max)); - BUG_ON(path[i].p_hdr-eh_magic != EXT4_EXT_MAGIC); - if (!path[i].p_idx) { /* this level hasn't been touched yet */ path[i].p_idx = EXT_LAST_INDEX(path[i].p_hdr); @@ -1873,17 +1890,24 @@ i, EXT_FIRST_INDEX(path[i].p_hdr), path[i].p_idx); if (ext4_ext_more_to_rm(path + i)) { + struct buffer_head *bh; /* go to the next level */ ext_debug(move to level %d (block %llu)\n, i + 1, idx_pblock(path[i].p_idx)); memset(path + i + 1, 0, sizeof(*path)); - path[i+1].p_bh = - sb_bread(sb, idx_pblock(path[i].p_idx)); - if (!path[i+1].p_bh) { + bh = sb_bread(sb, idx_pblock(path[i].p_idx)); + if (!bh) { /* should we reset i_size? */ err = -EIO; break; } + BUG_ON(i + 1 depth);
Re: [EXT4 set 4][PATCH 3/5] i_version:ext4 inode version read/store
On Sun, 01 Jul 2007 03:37:36 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch adds 64-bit inode version support to ext4. The lower 32 bits are stored in the osd1.linux1.l_i_version field while the high 32 bits are stored in the i_version_hi field newly created in the ext4_inode. So reading the code here does serve to answer the question I raised against the earlier patch. A bit. I'd have thought that this patch and the one which adds i_version_hi should be folded into a single diff? Index: linux-2.6.21/fs/ext4/inode.c === --- linux-2.6.21.orig/fs/ext4/inode.c +++ linux-2.6.21/fs/ext4/inode.c @@ -2709,6 +2709,13 @@ void ext4_read_inode(struct inode * inod EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode); EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode); + inode-i_version = le32_to_cpu(raw_inode-i_disk_version); + if (EXT4_INODE_SIZE(inode-i_sb) EXT4_GOOD_OLD_INODE_SIZE) { + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) + inode-i_version |= + (__u64)(le32_to_cpu(raw_inode-i_version_hi)) 32; checks the precedence of (type) versus OK + } I don't quite see how the above two tests are sufficient to unambiguously determine that the i_version_hi field is present on-disk. I guess we're implicitly assuming that if the on-disk inode is big enough then it _must_ have i_version_hi in there? If so, why is the comparison with EXT4_GOOD_OLD_INODE_SIZE needed? Some description of the overall approach to inode version identification would be helpful here. if (S_ISREG(inode-i_mode)) { inode-i_op = ext4_file_inode_operations; inode-i_fop = ext4_file_operations; @@ -2852,8 +2859,14 @@ static int ext4_do_update_inode(handle_t } else for (block = 0; block EXT4_N_BLOCKS; block++) raw_inode-i_block[block] = ei-i_data[block]; - if (ei-i_extra_isize) + raw_inode-i_disk_version = cpu_to_le32(inode-i_version); + if (ei-i_extra_isize) { + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) { There's no comparison with EXT4_GOOD_OLD_INODE_SIZE here... + raw_inode-i_version_hi = + cpu_to_le32(inode-i_version 32); + } raw_inode-i_extra_isize = cpu_to_le16(ei-i_extra_isize); + } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update
On Sun, 01 Jul 2007 03:37:45 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch is on top of i_version_update_vfs. The i_version field of the inode is set on inode creation and incremented when the inode is being modified. Again, I don't think I've ever seen this patch before. It is at least a month old. Index: linux-2.6.22-rc4/fs/ext4/ialloc.c === --- linux-2.6.22-rc4.orig/fs/ext4/ialloc.c2007-06-13 17:16:28.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/ialloc.c 2007-06-13 17:24:45.0 -0700 @@ -565,6 +565,7 @@ got: inode-i_blocks = 0; inode-i_mtime = inode-i_atime = inode-i_ctime = ei-i_crtime = ext4_current_time(inode); + inode-i_version = 1; memset(ei-i_data, 0, sizeof(ei-i_data)); ei-i_dir_start_lookup = 0; Index: linux-2.6.22-rc4/fs/ext4/inode.c === --- linux-2.6.22-rc4.orig/fs/ext4/inode.c 2007-06-13 17:21:29.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/inode.c 2007-06-13 17:24:45.0 -0700 @@ -3082,6 +3082,7 @@ int ext4_mark_iloc_dirty(handle_t *handl { int err = 0; + inode-i_version++; /* the do_update_inode consumes one bh-b_count */ get_bh(iloc-bh); Index: linux-2.6.22-rc4/fs/ext4/super.c === --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-13 17:19:11.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/super.c 2007-06-13 17:24:45.0 -0700 @@ -2846,8 +2846,8 @@ out: i_size_write(inode, off+len-towrite); EXT4_I(inode)-i_disksize = inode-i_size; } - inode-i_version++; inode-i_mtime = inode-i_ctime = CURRENT_TIME; + inode-i_version = 1; ext4_mark_inode_dirty(handle, inode); mutex_unlock(inode-i_mutex); return len - towrite; ext4 already has code to update i_version on directories. Here we appear to be udpating it on regular files? But for what reason? The changelog doesn't say? AFAICT the code forgets to update i_version during file overwrites (unless the overwrite was over a hole). But without a decent description of this change I cannot tell whether this was a bug. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Sun, 01 Jul 2007 03:38:01 -0400 Mingming Cao [EMAIL PROTECTED] wrote: This patch is on top of the nanosecond timestamp and i_version_hi patches. This sort of information isn't needed (or desired) when this patch hits the git tree. Please ensure that things like this are cleaned up before the patches go to Linus. We need to make sure that existing filesystems can also avail the new fields that have been added to the inode. We use s_want_extra_isize and s_min_extra_isize to decide by how much we should expand the inode. If EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature is set then we expand by max(s_want_extra_isize, s_min_extra_isize , sizeof(ext4_inode) - EXT4_GOOD_OLD_INODE_SIZE) bytes. Actually it is still an open question about whether users should be able to set s_*_extra_isize smaller than the known fields or not. This patch also adds the functionality to expand inodes to include the newly added fields. We start by trying to expand by s_want_extra_isize bytes and if its fails we try to expand by s_min_extra_isize bytes. This is done by changing the i_extra_isize if enough space is available in the inode and no EAs are present. If EAs are present and there is enough space in the inode then the EAs in the inode are shifted to make space. If enough space is not available in the inode due to the EAs then 1 or more EAs are shifted to the external EA block. In the worst case when even the external EA block does not have enough space we inform the user that some EA would need to be deleted or s_min_extra_isize would have to be reduced. This would be online expansion of inodes. I am also working on adding an expand_inodes option to e2fsck which will expand all the used inodes. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Kalpak Shah [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] Index: linux-2.6.22-rc4/fs/ext4/inode.c === --- linux-2.6.22-rc4.orig/fs/ext4/inode.c 2007-06-14 17:32:27.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/inode.c 2007-06-14 17:32:41.0 -0700 @@ -3120,6 +3120,40 @@ ext4_reserve_inode_write(handle_t *handl } /* + * Expand an inode by new_extra_isize bytes. + * Returns 0 on success or negative error number on failure. + */ +int ext4_expand_extra_isize(struct inode *inode, unsigned int new_extra_isize, + struct ext4_iloc iloc, handle_t *handle) +{ + struct ext4_inode *raw_inode; + struct ext4_xattr_ibody_header *header; + struct ext4_xattr_entry *entry; + + if (EXT4_I(inode)-i_extra_isize = new_extra_isize) { + return 0; + } This patch has lots of unneeded braces in it. checkpatch appears to catch them. + raw_inode = ext4_raw_inode(iloc); + + header = IHDR(inode, raw_inode); + entry = IFIRST(header); + + /* No extended attributes present */ + if (!(EXT4_I(inode)-i_state EXT4_STATE_XATTR) || + header-h_magic != cpu_to_le32(EXT4_XATTR_MAGIC)) { + memset((void *)raw_inode + EXT4_GOOD_OLD_INODE_SIZE, 0, + new_extra_isize); + EXT4_I(inode)-i_extra_isize = new_extra_isize; + return 0; + } + + /* try to expand with EA present */ + return ext4_expand_extra_isize_ea(inode, new_extra_isize, + raw_inode, handle); +} + +/* * What we do here is to mark the in-core inode as clean with respect to inode * dirtiness (it may still be data-dirty). * This means that the in-core inode may be reaped by prune_icache @@ -3143,10 +3177,33 @@ ext4_reserve_inode_write(handle_t *handl int ext4_mark_inode_dirty(handle_t *handle, struct inode *inode) { struct ext4_iloc iloc; - int err; + int err, ret; + static int expand_message; might_sleep(); err = ext4_reserve_inode_write(handle, inode, iloc); + if (EXT4_I(inode)-i_extra_isize + EXT4_SB(inode-i_sb)-s_want_extra_isize + !(EXT4_I(inode)-i_state EXT4_STATE_NO_EXPAND)) { + /* We need extra buffer credits since we may write into EA block + * with this same handle */ + if ((jbd2_journal_extend(handle, + EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) { + ret = ext4_expand_extra_isize(inode, + EXT4_SB(inode-i_sb)-s_want_extra_isize, + iloc, handle); + if (ret) { + EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND; + if (!expand_message) { + ext4_warning(inode-i_sb, __FUNCTION__, + Unable to expand inode %lu. Delete + some EAs or run e2fsck., +
Re: [PATHC] Fix for ext2 reservation (Re: -mm merge plans for 2.6.23)
On Tue, 10 Jul 2007 15:15:57 -0700 Badari Pulavarty [EMAIL PROTECTED] wrote: Well, I looked at the problem now and here is the fix :) whee, thanks. Greg, Please consider this for stable release also. No, it is only relevant to -mm's ext2-reservations.patch. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs
On Sun, 01 Jul 2007 03:38:10 -0400 Mingming Cao [EMAIL PROTECTED] wrote: [PATCH] jbd2 stats through procfs The patch below updates the jbd stats patch to 2.6.20/jbd2. The initial patch was posted by Alex Tomas in December 2005 (http://marc.info/?l=linux-ext4m=113538565128617w=2). It provides statistics via procfs such as transaction lifetime and size. [ This probably should be rewritten to use debugfs? -- Ted] I suppose that given that we're creating a spot in debugfs for the jbd2 debug code, yes, this also should be moved over. But the jbd2 debug debugfs entries were kernel-wide whereas this is per-superblock. I think. That email from Alex contains pretty important information. I suggest that it be added to the changelog after accuracy-checking. Addition to Documentation/filesystems/ext4.txt would be good. -- Index: linux-2.6.22-rc4/include/linux/jbd2.h === --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 17:28:17.0 -0700 +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-13 10:45:21.0 -0700 @@ -408,6 +408,16 @@ }; +/* + * Some stats for checkpoint phase + */ +struct transaction_chp_stats_s { + unsigned long cs_chp_time; + unsigned long cs_forced_to_close; + unsigned long cs_written; + unsigned long cs_dropped; +}; It would be nice to document what units all these fields are in. Jiffies, I assume. /* The transaction_t type is the guts of the journaling mechanism. It * tracks a compound transaction through its various states: * @@ -543,6 +553,21 @@ spinlock_t t_handle_lock; /* + * Longest time some handle had to wait for running transaction + */ + unsigned long t_max_wait; + + /* + * When transaction started + */ + unsigned long t_start; + + /* + * Checkpointing stats [j_checkpoint_sem] + */ + struct transaction_chp_stats_s t_chp_stats; + + /* * Number of outstanding updates running on this transaction * [t_handle_lock] */ @@ -573,6 +598,57 @@ }; +struct transaction_run_stats_s { + unsigned long rs_wait; + unsigned long rs_running; + unsigned long rs_locked; + unsigned long rs_flushing; + unsigned long rs_logging; + + unsigned long rs_handle_count; + unsigned long rs_blocks; + unsigned long rs_blocks_logged; +}; + +struct transaction_stats_s +{ + int ts_type; + unsigned long ts_tid; + union { + struct transaction_run_stats_s run; + struct transaction_chp_stats_s chp; + } u; +}; + +#define JBD2_STATS_RUN 1 +#define JBD2_STATS_CHECKPOINT2 + +#define ts_wait u.run.rs_wait +#define ts_running u.run.rs_running +#define ts_lockedu.run.rs_locked +#define ts_flushing u.run.rs_flushing +#define ts_logging u.run.rs_logging +#define ts_handle_count u.run.rs_handle_count +#define ts_blocksu.run.rs_blocks +#define ts_blocks_logged u.run.rs_blocks_logged + +#define ts_chp_time u.chp.cs_chp_time +#define ts_forced_to_close u.chp.cs_forced_to_close +#define ts_written u.chp.cs_written +#define ts_dropped u.chp.cs_dropped That's a bit sleazy. We can drop the u from 'struct transaction_stats_s' and make it an anonymous union, then open-code foo.run.rs_wait everywhere. But that's a bit sleazy too, because the reader of the code would not know that a write to foo.run.rs_wait will stomp on the value of foo.chp.cs_chp_time. So to minimize reader confusion it would be best I think to just open-code the full u.run.rs_wait at all code-sites. The macros above are the worst possible choice: they hide information from the code-reader just to save the code-writer a bit of typing. But we very much want to optimise for code-readers, not for code-writers. +#define CURRENT_MSECS(jiffies_to_msecs(jiffies)) hm, that isn't something which should be in an ext4 header file. And it shouldn't be in UPPER CASE and it shouldn't pretend to be a constant (or a global scalar). IOW: yuk. How's about raising a separate, standalone patch which creates a new kernel-wide coded-in-C function such as unsigned long jiffies_in_msecs(void); ? (That's assuming we don't already have one. Most likely we have seven of them hiding in various dark corners). +static inline unsigned int +jbd2_time_diff(unsigned int start, unsigned int end) +{ + if (unlikely(start end)) + end = end + (~0UL - start); + else + end -= start; + return end; +} I don't know what this does
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote: David Chinneer pointed that we need to journal the version number updates together with the operations that causes the change of the inode version number, in order to survive server crashes so clients won't see the counter go backwards. So increment i_version in fs code is probably the place to ensure the inode version changes are stored to disk. It's seems update the ext4 inode version in every ext4_mark_inode_dirty() is the easiest way. That still makes us dependent upon _something_ changing the inode. For overwrites the only something is mtime. If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and I don't think we do) then I guess we'll need new code in or around file_update_time() to do this. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, 11 Jul 2007 15:05:27 +1000 Neil Brown [EMAIL PROTECTED] wrote: It just occurred to me: If i_version is 64bit, then knfsd would need to be careful when reading it on a 32bit host. What are the locking rules? Presumably it is only updated under i_mutex protection, but having to get i_mutex to read it would seem a little heavy handed. Should it use a seqlock like i_size? Could we use the same seqlock that i_size uses, or would we need a separate one? seqlocks are a bit of a pain to use (we've had plenty of deadlocks on the i_size one). We could reuse inode.i_lock for this modification. Its mandate is general purpose innermost lock to protect stuff in this inode. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.
On Sun, 01 Jul 2007 03:38:18 -0400 Mingming Cao [EMAIL PROTECTED] wrote: From [EMAIL PROTECTED] Thu May 17 17:21:08 2007 Hi, I have rebased this patch to 2.6.22-rc1 so that it can be added to the ext4 patch queue. It has been tested by creating more than 65000 subdirs and then deleting them and checking the nlinks. The e2fsprogs part of this patch was sent earlier by me to linux-ext4 and doesn't need any changes, so not submitting it again. -- This patch adds support to ext4 for allowing more than 65000 subdirectories. Currently the maximum number of subdirectories is capped at 32000. If we exceed 65000 subdirectories in an htree directory it sets the inode link count to 1 and no longer counts subdirectories. The directory link count is not actually used when determining if a directory is empty, as that only counts subdirectories and not regular files that might be in there. A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if the subdir count for any directory crosses 65000. Would I be correct in assuming that a later fsck will clear EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any 65000 subdir directories? If so, that is worth a mention in the changelog, perhaps? Please remind us what is the behaviour of an RO_COMPAT flag? It means that old ext4, ext3 and ext2 can only mount this fs read-only, yes? Index: linux-2.6.22-rc4/fs/ext4/namei.c === --- linux-2.6.22-rc4.orig/fs/ext4/namei.c 2007-06-14 17:30:47.0 -0700 +++ linux-2.6.22-rc4/fs/ext4/namei.c 2007-06-14 17:32:55.0 -0700 @@ -1619,6 +1619,27 @@ static int ext4_delete_entry (handle_t * return -ENOENT; } +static inline void ext4_inc_count(handle_t *handle, struct inode *inode) +{ + inc_nlink(inode); + if (is_dx(inode) inode-i_nlink 1) { + /* limit is 16-bit i_links_count */ + if (inode-i_nlink = EXT4_LINK_MAX || inode-i_nlink == 2) { + inode-i_nlink = 1; + EXT4_SET_RO_COMPAT_FEATURE(inode-i_sb, + EXT4_FEATURE_RO_COMPAT_DIR_NLINK); + } + } +} Looks too big to be inlined. Why do we set EXT4_FEATURE_RO_COMPAT_DIR_NLINK if i_nlink==2? +static inline void ext4_dec_count(handle_t *handle, struct inode *inode) +{ + drop_nlink(inode); + if (S_ISDIR(inode-i_mode) inode-i_nlink == 0) + inc_nlink(inode); +} Probably too big to inline. + static int ext4_add_nondir(handle_t *handle, struct dentry *dentry, struct inode *inode) { @@ -1715,7 +1736,7 @@ static int ext4_mkdir(struct inode * dir struct ext4_dir_entry_2 * de; int err, retries = 0; - if (dir-i_nlink = EXT4_LINK_MAX) + if (EXT4_DIR_LINK_MAX(dir)) return -EMLINK; retry: @@ -1738,7 +1759,7 @@ retry: inode-i_size = EXT4_I(inode)-i_disksize = inode-i_sb-s_blocksize; dir_block = ext4_bread (handle, inode, 0, 1, err); if (!dir_block) { - drop_nlink(inode); /* is this nlink == 0? */ + ext4_dec_count(handle, inode); /* is this nlink == 0? */ ext4_mark_inode_dirty(handle, inode); iput (inode); goto out_stop; @@ -1770,7 +1791,7 @@ retry: iput (inode); goto out_stop; } - inc_nlink(dir); + ext4_inc_count(handle, dir); ext4_update_dx_flag(dir); ext4_mark_inode_dirty(handle, dir); d_instantiate(dentry, inode); @@ -2035,10 +2056,10 @@ static int ext4_rmdir (struct inode * di retval = ext4_delete_entry(handle, dir, de, bh); if (retval) goto end_rmdir; - if (inode-i_nlink != 2) - ext4_warning (inode-i_sb, ext4_rmdir, - empty directory has nlink!=2 (%d), - inode-i_nlink); + if (!EXT4_DIR_LINK_EMPTY(inode)) + ext4_warning(inode-i_sb, ext4_rmdir, + empty directory has too many links (%d), + inode-i_nlink); inode-i_version++; clear_nlink(inode); /* There's no need to set i_disksize: the fact that i_nlink is @@ -2048,7 +2069,7 @@ static int ext4_rmdir (struct inode * di ext4_orphan_add(handle, inode); inode-i_ctime = dir-i_ctime = dir-i_mtime = ext4_current_time(inode); ext4_mark_inode_dirty(handle, inode); - drop_nlink(dir); + ext4_dec_count(handle, dir); ext4_update_dx_flag(dir); ext4_mark_inode_dirty(handle, dir); @@ -2099,7 +2120,7 @@ static int ext4_unlink(struct inode * di dir-i_ctime = dir-i_mtime = ext4_current_time(dir); ext4_update_dx_flag(dir); ext4_mark_inode_dirty(handle, dir); - drop_nlink(inode); +
Re: [PATCH] fix error handling in ext3_create_journal
On Mon, 2 Jul 2007 00:11:11 +0200 Borislav Petkov [EMAIL PROTECTED] wrote: --- From: Borislav Petkov [EMAIL PROTECTED] Fix error handling in ext3_create_journal according to kernel conventions. Signed-off-by: Borislav Petkov [EMAIL PROTECTED] -- Index: linux-2.6.22-rc6/fs/ext3/super.c === --- linux-2.6.22-rc6/fs/ext3/super.c.orig 2007-07-01 21:12:51.0 +0200 +++ linux-2.6.22-rc6/fs/ext3/super.c 2007-07-01 21:14:32.0 +0200 @@ -2075,6 +2075,7 @@ unsigned int journal_inum) { journal_t *journal; + int err; if (sb-s_flags MS_RDONLY) { printk(KERN_ERR EXT3-fs: readonly filesystem when trying to @@ -2082,13 +2083,15 @@ return -EROFS; } - if (!(journal = ext3_get_journal(sb, journal_inum))) + journal = ext3_get_journal(sb, journal_inum); + if (!journal) return -EINVAL; printk(KERN_INFO EXT3-fs: creating new journal on inode %u\n, journal_inum); - if (journal_create(journal)) { + err = journal_create(journal); + if (err) { printk(KERN_ERR EXT3-fs: error creating journal.\n); journal_destroy(journal); return -EIO; Please prepare the equivalent patch for ext4. Without that, it'd probably be better to avoid applying the ext3 patch: there are advantages to keeping the two in sync where possible. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
fallocate support for bitmap-based files
Guys, Mike and Sreenivasa at google are looking into implementing fallocate() on ext2. Of course, any such implementation could and should also be portable to ext3 and ext4 bitmapped files. I believe that Sreenivasa will mainly be doing the implementation work. The basic plan is as follows: - Create (with tune2fs and mke2fs) a hidden file using one of the reserved inode numbers. That file will be sized to have one bit for each block in the partition. Let's call this the unwritten block file. The unwritten block file will be initialised with all-zeroes - at fallocate()-time, allocate the blocks to the user's file (in some yet-to-be-determined fashion) and, for each one which is uninitialised, set its bit in the unwritten block file. The set bit means this block is uninitialised and needs to be zeroed out on read. - truncate() would need to clear out set-bits in the unwritten blocks file. - When the fs comes to read a block from disk, it will need to consult the unwritten blocks file to see if that block should be zeroed by the CPU. - When the unwritten-block is written to, its bit in the unwritten blocks file gets zeroed. - An obvious efficiency concern: if a user file has no unwritten blocks in it, we don't need to consult the unwritten blocks file. Need to work out how to do this. An obvious solution would be to have a number-of-unwritten-blocks counter in the inode. But do we have space for that? (I expect google and others would prefer that the on-disk format be compatible with legacy ext2!) - One concern is the following scenario: - Mount fs with new kernel, fallocate() some blocks to a file. - Now, mount the fs under old kernel (which doesn't understand the unwritten blocks file). - This kernel will be able to read uninitialised data from that fallocated-to file, which is a security concern. - Now, the old kernel writes some data to a fallocated block. But this kernel doesn't know that it needs to clear that block's flag in the unwritten blocks file! - Now mount that fs under the new kernel and try to read that file. The flag for the block is set, so this kernel will still zero out the data on a read, thus corrupting the user's data So how to fix this? Perhaps with a per-inode flag indicating this inode has unwritten blocks. But to fix this problem, we'd require that the old kernel clear out that flag. Can anyone propose a solution to this? Ah, I can! Use the compatibility flags in such a way as to prevent the old kernel from mounting this filesystem at all. To mount this fs under an old kernel the user will need to run some tool which will - read the unwritten blocks file - for each set-bit in the unwritten blocks file, zero out the corresponding block - zero out the unwritten blocks file - rewrite the superblock to indicate that this fs may now be mounted by an old kernel. Sound sane? - I'm assuming that there are more reserved inodes available, and that the changes to tune2fs and mke2fs will be basically a copy-n-paste job from the `tune2fs -j' code. Correct? - I haven't thought about what fsck changes would be needed. Presumably quite a few. For example, fsck should check that set-bits in the unwriten blobks file do not correspond to freed blocks. If they do, that should be fixed up. And fsck can check each inodes number-of-unwritten-blocks counters against the unwritten blocks file (if we implement the per-inode number-of-unwritten-blocks counter) What else should fsck do? - I haven't thought about the implications of porting this into ext3/4. Probably the commit to the unwritten blocks file will need to be atomic with the commit to the user's file's metadata, so the unwritten-blocks file will effectively need to be in journalled-data mode. Or, more likely, we access the unwritten blocks file via the blockdev pagecache (ie: use bmap, like the journal file) and then we're just talking direct to the disk's blocks and it becomes just more fs metadata. - I guess resize2fs will need to be taught about the unwritten blocks file: to shrink and grow it appropriately. That's all I can think of for now - I probably missed something. Suggestions and thought are sought, please. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate support for bitmap-based files
On Fri, 29 Jun 2007 16:55:25 -0400 Theodore Tso [EMAIL PROTECTED] wrote: On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote: Guys, Mike and Sreenivasa at google are looking into implementing fallocate() on ext2. Of course, any such implementation could and should also be portable to ext3 and ext4 bitmapped files. What's the eventual goal of this work? Would it be for mainline use, or just something that would be used internally at Google? Mainline, preferably. I'm not particularly ennthused about supporting two ways of doing fallocate(); one for ext4 and one for bitmap-based files in ext2/3/4. Is the benefit reallyworth it? umm, it's worth it if you don't want to wear the overhead of journalling, and/or if you don't want to wait on the, err, rather slow progress of ext4. What I would suggest, which would make much easier, is to make this be an incompatible extensions (which you as you point out is needed for security reasons anyway) and then steal the high bit from the block number field to indicate whether or not the block has been initialized or not. That way you don't end up having to seek to a potentially distant part of the disk to check out the bitmap. Also, you don't have to worry about how to recover if the block initialized bitmap inode gets smashed. The downside is that it reduces the maximum size of the filesystem supported by ext2 by a factor of two. But, there are at least two patch series floating about that promise to allow filesystem block sizes than PAGE_SIZE which would allow you to recover the maximum size supported by the filesytem. Furthermore, I suspect (especially after listening to a very fasting Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks ago) that for many of Google's workloads, using a filesystem blocksize of 16K or 32K might not be a bad thing in any case. It would be a lot simpler Hadn't thought of that. Also, it's unclear to me why google is going this way rather than using (perhaps suitably-tweaked) ext2 reservations code. Because the stock ext2 block allcoator sucks big-time. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][TAKE5] fallocate system call
On Mon, 25 Jun 2007 18:58:10 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: N O T E: --- 1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part of ext4 patch queue git tree hosted by Ted. Why the heck are replacements for these things being sent out again when they're already in -mm and they're already in Ted's queue (from which I need to diligently drop them each time I remerge)? Are we all supposed to re-review the entire patchset (or at least #4 and #7) again? The core kernel changes are not appropriate to the ext4 tree. For a start, the syscall numbers in Ted's queue are wrong (other new syscalls are pending). Patches which add syscalls are an utter PITA to carry due to all the patch conflicts and to the relatively frequent syscall renumbering (they don't get numbered in time-of-arrival order due to differing rates at which patches mature). Please drop the non-ext4 patches from the ext4 tree and send incremental patches against the (non-ext4) fallocate patches in -mm. And try to get the code finished? Time is pressing. Thanks. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][TAKE5] fallocate system call
On Thu, 28 Jun 2007 23:27:57 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: Please drop the non-ext4 patches from the ext4 tree and send incremental patches against the (non-ext4) fallocate patches in -mm. Please let us know what you think of Mingming's suggestion of posting all the fallocate patches including the ext4 ones as incremental ones against the -mm. I think Mingming was asking that Ted move the current quilt tree into git, presumably because she's working off git. I'm not sure what to do, really. The core kernel patches need to be in Ted's tree for testing but that'll create a mess for me. ug. Options might be a) I drop the fallocate patches from -mm and from the ext4 tree, hack up any needed build fixes, then just wait for it all to mature and then think about it again b) We do what we normally don't do and reserve the syscall slots in mainline. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] zero_user_page conversion
On Wed, 20 Jun 2007 17:08:24 -0500 Eric Sandeen [EMAIL PROTECTED] wrote: Use zero_user_page() in cifs, ocfs2, ext4, and gfs2 where possible. One patch, splattered across four maintainers, each of whom maintain separate trees. Sigh. Please, don't. _someone_ has to split this up, and it might as well not be me. splits it up - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: [Bugme-new] [Bug 8643] New: Bug with negative timestamps on 64bit machines
someone ort to fix this Begin forwarded message: Date: Sat, 16 Jun 2007 17:50:35 -0700 (PDT) From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: [Bugme-new] [Bug 8643] New: Bug with negative timestamps on 64bit machines http://bugzilla.kernel.org/show_bug.cgi?id=8643 Summary: Bug with negative timestamps on 64bit machines Product: File System Version: 2.5 KernelVersion: 2.6.21-1.3228.fc7 Platform: All OS/Version: Linux Tree: Fedora Status: NEW Severity: normal Priority: P1 Component: ext3 AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] Most recent kernel where this bug did not occur: Distribution: Fedora Core 7 Hardware Environment: AMD Athlon 64 Software Environment: Problem Description: Negative timestamps (i.e. before 1970) are incorrectly handled due to some obvious 64bit issues. They show up normally as long as the cache has not been purged. Steps to reproduce: Touch a file in an ext3-file system with a negative timestamp, e.g.: # touch -t 19690101 /opt/foo Unless you have been writing a lot of data to this partition after the last command, the next one should display the following: # ls -l /opt/foo -rw-r--r-- 1 root root 0 1969-01-01 00:00 /opt/foo This is still correct. But now we purge the file system cache by unmounting and mounting the partition again: # umount /opt/foo # mount /opt/foo Now the timestamp will be corrupted: # ls -l /opt/foo -rw-r--r-- 1 root root 0 2105-02-07 06:28 /opt/foo It seems obvious that the upper 32 bits of the 64 bits representing the time stamp get lost, which explains the above observation. On 32bit architectures there is no such problem. I haven't managed to reproduce this problem with other file systems (e.g. XFS). -- Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug, or are watching someone who is. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] fs/buffer.c:1821 in 2.6.22-rc4-mm2
On Sun, 10 Jun 2007 17:57:14 +0200 Eric Sesterhenn / Snakebyte [EMAIL PROTECTED] wrote: hi, i got the following BUG while running the syscalls.sh from ltp-full-20070531 on an ext3 partition, it is easily reproducible for me [ 476.338068] [ cut here ] [ 476.338223] kernel BUG at fs/buffer.c:1821! [ 476.338324] invalid opcode: [#1] [ 476.338423] PREEMPT [ 476.338665] Modules linked in: [ 476.338833] CPU:0 [ 476.338836] EIP:0060:[c01a1914]Not tainted VLI [ 476.338840] EFLAGS: 00010202 (2.6.22-rc4-mm2 #1) [ 476.339206] EIP is at __block_prepare_write+0x64/0x410 [ 476.339311] eax: 0001 ebx: c136fbb8 ecx: c07faf28 edx: 0001 [ 476.339417] esi: c1dc9040 edi: c32d2dfc ebp: c3733db8 esp: c3733d50 [ 476.339584] ds: 007b es: 007b fs: gs: 0033 ss: 0068 [ 476.339690] Process vmsplice01 (pid: 7680, ti=c3733000 task=c351ed60 task.ti=c3733000) [ 476.339796] Stack: c3733d70 c0143e76 c1a0eab0 0046 c2509d64 0cd8 c136fbb8 [ 476.340675]c32d2dfc 0296 c02313b6 c1086088 0050 c02313b6 c1dc9040 c2509d50 [ 476.341491]c1dc9054 c3733dc4 c02313e9 c3733dbc c015728d c32d2f0c c136fbb8 [ 476.342371] Call Trace: [ 476.342565] [c01a1d83] block_write_begin+0x83/0xf0 [ 476.342804] [c0207778] ext3_write_begin+0xc8/0x1c0 [ 476.342987] [c01595bf] pagecache_write_begin+0x4f/0x150 [ 476.343243] [c019db3b] pipe_to_file+0x9b/0x170 [ 476.343418] [c019d4b0] __splice_from_pipe+0x70/0x260 [ 476.343654] [c019d6e8] splice_from_pipe+0x48/0x70 [ 476.343828] [c019d9f8] generic_file_splice_write+0x88/0x130 [ 476.344066] [c019d267] do_splice_from+0xb7/0xc0 [ 476.344240] [c019ea51] sys_splice+0x1a1/0x230 [ 476.344474] [c01043be] sysenter_past_esp+0x5f/0x99 [ 476.344656] [e410] 0xe410 [ 476.344882] === [ 476.344984] INFO: lockdep is turned off. [ 476.345084] Code: 00 0f 97 c2 e8 ee 2f 22 00 85 c0 74 04 0f 0b eb fe 31 d2 b8 28 af 7f c0 81 7d 08 00 10 00 00 0f 97 c2 e8 d0 2f 22 00 85 c0 74 04 0f 0b eb fe 8b 55 08 39 55 b0 0f 97 c0 0f b6 d0 b8 0c af 7f c0 [ 476.350365] EIP: [c01a1914] __block_prepare_write+0x64/0x410 SS:ESP 0068:c3733d50 Yep, vmsplice01 is not supported on -mm kernels ;) Nick has a protofix but I don't think it's been tested yet. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH ext3/ext4] orphan list check on destroy_inode
On Mon, 04 Jun 2007 09:18:55 +0400 Vasily Averin [EMAIL PROTECTED] wrote: Customers claims to ext3-related errors, investigation showed that ext3 orphan list has been corrupted and have the reference to non-ext3 inode. The following debug helps to understand the reasons of this issue. Signed-off-by:Vasily Averin [EMAIL PROTECTED] diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 6e30629..46e2fa6 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -459,6 +459,21 @@ static struct inode *ext3_alloc_inode(struct super_block *sb) static void ext3_destroy_inode(struct inode *inode) { + if (!list_empty((EXT3_I(inode)-i_orphan))) { + int i, imax; + unsigned int *p; + + p = (unsigned int *)EXT3_I(inode); + imax = sizeof(struct ext3_inode_info) / sizeof(unsigned int); + printk(Inode %p: orphan list check failed!\n, EXT3_I(inode)); + for (i = 0; i imax; i++) { + if (i ((i % 8) == 0)) + printk(\n); + printk(%08x , *p++); + } + printk(\n); + dump_stack(); umm, OK, but please use the new lib/hexdump.c for this. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: ext3 dir_index causes an error
Ted is dir_index maintainer ;) That's a nice-looking bug report, btw. Thanks. Begin forwarded message: Date: Fri, 01 Jun 2007 13:01:07 +0900 From: [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: ext3 dir_index causes an error Hello, First of all, I really appricate your great works. Now I've found a problem around dir_index feature. Here is a report following linux/REPORTING-BUGS. [1.] One line summary of the problem: ext3 dir_index causes an error [2.] Full description of the problem/report: This is my local test program to reproduce this problem. The readdir1.c calls creat(2), opendir(3) and readdir(3). And the shell script execute it repeatedly with a brand-new ext3fs image on a loopback device. When the script adds '-O dir_index' to mkfs, some errors appear. On a system with linux-2.6.21.3, ext3fs produces these error message, and the filesystem seems to be corrupted. -- kjournald starting. Commit interval 5 seconds EXT3 FS on loop0, internal journal EXT3-fs: mounted filesystem with ordered data mode. ::: EXT3-fs: mounted filesystem with ordered data mode. EXT3-fs error (device loop0): htree_dirblock_to_tree: bad entry in directory #2: rec_len is too small for name_len - offset=6924, inode=26, rec_len=244, name_len=249 EXT3-fs error (device loop0): htree_dirblock_to_tree: bad entry in directory #2: rec_len is too small for name_len - offset=6924, inode=26, rec_len=244, name_len=249 EXT3-fs error (device loop0): htree_dirblock_to_tree: bad entry in directory #2: rec_len is too small for name_len - offset=6924, inode=26, rec_len=244, name_len=249 kjournald starting. Commit interval 5 seconds ::: -- On the other system with linux-2.6.18 (debian etch kernel), the same error appears. When the script adds '-O ^dir_index' to mkfs, the problem never appears. It is not everytime that these errors appear. So the shell script executes the readdir1 test program repeatedly. Recently I upgraded my debian system from version 3.1 'sarge' to 4.0 'etch'. The debian etch sets the dir_index feature by default. So I found this problem. [3.] Keywords (i.e., modules, networking, kernel): ext3 dir_index [4.] Kernel information [4.1.] Kernel version (from /proc/version): [4.2.] Kernel .config file: [5.] Most recent kernel version which did not have the bug: [6.] Output of Oops.. message (if applicable) with symbolic information resolved (see Documentation/oops-tracing.txt) [7.] A small shell script or example program which triggers the problem (if possible) (readdir1.c) #include sys/types.h #include sys/stat.h #include fcntl.h #include dirent.h #include stdio.h #include stdlib.h #include unistd.h #include assert.h #include string.h #include errno.h void fin(char *s) { perror(s); exit(1); } void msg(int found, char *fname) { printf(%s%s found\n, fname, found?: not); } int main(int argc, char *argv[]) { DIR *dp; struct dirent *de; int err, found, i; char a[250]; err = chdir(argv[1]); if (err) fin(chdir); memset(a, 'a', sizeof(a)-1); a[sizeof(a)-1] = 0; for (i = 0; i 16+1; i++) { a[0]++; err = creat(a, 0644); if (err 0) fin(creat); err = creat(argv[2], 0644); if (err 0) fin(creat); } #if 0 err = unlink(argv[2]); if (err errno != ENOENT) fin(unlink); #endif dp = opendir(.); if (!dp) fin(opendir); de = readdir(dp); if (!de) fin(1st readdir); assert(strcmp(argv[2], de-d_name)); #if 0 argv[2][0]++; err = creat(argv[2], 0644); if (err 0) fin(creat); argv[2][0]--; #endif err = creat(argv[2], 0644); if (err 0) fin(creat); #if 0 err = unlink(argv[2]); if (err errno != ENOENT) fin(unlink); #endif found = 0; while ((de = readdir(dp)) !found) found = !strcmp(argv[2], de-d_name); msg(found, argv[2]); found = 0; rewinddir(dp); while ((de = readdir(dp)) !found) found = !strcmp(argv[2], de-d_name); msg(found, argv[2]); closedir(dp); dp = opendir(.); if (!dp) fin(opendir); found = 0; while ((de = readdir(dp)) !found) found = !strcmp(argv[2], de-d_name); msg(found, argv[2]); return 0; } -- #!/bin/sh img=rw.img dir=rw set -e make /tmp/readdir1 cd /dev/shm dd if=/dev/zero of=$img bs=1k count=4k 2
Re: [PATCH 0/6][TAKE4] fallocate system call
On Thu, 17 May 2007 19:41:15 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. I merged the first three patches into -mm, thanks. All the system call numbers got changed due to recent additions. They may change in the future, too - nothing is stable until the code lands in mainline. I didn't merge any of the ext4 changes as they appear to be in Ted's devel tree. Although I didn't check that they are 100% the same in that tree. What's the plan to get some ext4 updates into mainline, btw? Things seem to be rather gradual. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/9] readahead: introduce PG_readahead
On Sat, 19 May 2007 20:30:31 +0800 Fengguang Wu [EMAIL PROTECTED] wrote: On Fri, May 18, 2007 at 11:28:24PM -0700, Andrew Morton wrote: On Thu, 17 May 2007 06:47:53 +0800 Fengguang Wu [EMAIL PROTECTED] wrote: Introduce a new page flag: PG_readahead. Is there any way in which we can avoid adding a new page flag? We have the advantage that if the kernel very occasionally gets the wrong result for PageReadahead(page), nothing particularly bad will happen, so we can do racy things. From a quick peek, it appears that PG_readahead is only ever set against non-uptodate pages. If true we could perhaps exploit that: say, PageReadahead(page) == PG_referenced !PG_uptodate? PG_uptodate will flip to 1 before the reader touches the page :( OK. However, it may be possible to share the same bit with PG_reclaim or PG_booked. Which one would be preferred? I'd like PG_booked to go away too - I don't think we've put that under the microscope yet. If it remains then its scope will be defined by the filesystem, so readahead shouldn't use it. PG_reclaim belongs to core VFS so that's much better. Let's not do anything ugly, slow or silly in there, but please do take a look, see if there is an opportunity here. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 7 May 2007 05:37:54 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: + block = offset blkbits; + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) blkbits) + - block; + mutex_lock(EXT4_I(inode)-truncate_mutex); + credits = ext4_ext_calc_credits_for_insert(inode, NULL); + mutex_unlock(EXT4_I(inode)-truncate_mutex); Now I'm mystified. Given that we're allocating an arbitrary amount of disk space, and that this disk space will require an arbitrary amount of metadata, how can we work out how much journal space we'll be needing without at least looking at `len'? Good question. The uninitialized extent can cover up to 128MB with a single entry. If @path isn't specified, then ext4_ext_calc_credits_for_insert() function returns the maximum number of extents needed to insert a leaf, including splitting all of the index blocks. That would allow up to 43GB (340 extents/block * 128MB) to be preallocated, but it still needs to take the size of the preallocation into account (adding 3 blocks per 43GB - a leaf block, a bitmap block and a group descriptor). I think the use of ext4_journal_extend() (as Amit has proposed) will help here, but it is not sufficient. Because under some circumstances, a journal_extend() failure could mean that we fail to allocate all the required disk space. If it is infrequent enough, that is acceptable when the caller is using fallocate() for performance reasons. But it is very much not acceptable if the caller is using fallocate() for space-reservation reasons. If you used fallocate to reserve 1GB of disk and fallocate() succeeded and you later get ENOSPC then you'd have a right to get a bit upset. So I think the ext3/4 fallocate() implementation will need to be implemented as a loop: while (len) { journal_start(); len -= do_fallocate(len, ...); journal_stop(); } Now the interesting question is: what do we do if we get halfway through this loop and then run out of space? We could leave the disk all filled up and then return failure to the caller, but that's pretty poor behaviour, IMO. Does the proposed implementation handle quotas correctly, btw? Has that been tested? Final point: it's fairly disappointing that the present implementation is ext4-only, and extent-only. I do think we should be aiming at an ext4 bitmap-based implementation and an ext3 implementation. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 7 May 2007 15:21:04 -0700 Andreas Dilger [EMAIL PROTECTED] wrote: On May 07, 2007 13:58 -0700, Andrew Morton wrote: Final point: it's fairly disappointing that the present implementation is ext4-only, and extent-only. I do think we should be aiming at an ext4 bitmap-based implementation and an ext3 implementation. Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and ext4 with block-mapped files) the filesystem should return an error (e.g. -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. hrm, spose so. It can be a bit suboptimal from the layout POV. The reservations code will largely save us here, but kernel support might make it a bit better. Totally blowing pagecache could be a problem. Fixable in userspace by using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 07 May 2007 17:00:24 -0700 Mingming Cao [EMAIL PROTECTED] wrote: + while (ret = 0 ret max_blocks) { + block = block + ret; + max_blocks = max_blocks - ret; + ret = ext4_ext_get_blocks(handle, inode, block, + max_blocks, map_bh, + EXT4_CREATE_UNINITIALIZED_EXT, 0); + BUG_ON(!ret); + if (ret 0 test_bit(BH_New, map_bh.b_state) +((block + ret) (i_size_read(inode) blkbits))) + nblocks = nblocks + ret; + } + + if (ret == -ENOSPC ext4_should_retry_alloc(inode-i_sb, retries)) + goto retry; + Now the interesting question is: what do we do if we get halfway through this loop and then run out of space? We could leave the disk all filled up and then return failure to the caller, but that's pretty poor behaviour, IMO. The current code handles earlier ENOSPC by three times retries. After that if we still run out of space, then it's propably right to notify the caller there isn't much space left. We could extend the block reservation window size before the while loop so we could get a lower chance to get more fragmented. yes, but my point is that the proposed behaviour is really quite bad. We will attempt to allocate the disk space and then we will return failure, having consumed all the disk space and having partially and uselessly populated an unknown amount of the file. Userspace could presumably repair the mess in most situations by truncating the file back again. The kernel cannot do that because there might be live data in amongst there. So we'd need to either keep track of which blocks were newly-allocated and then free them all again on the error path (doesn't work right across commit+crash+recovery) or we could later use the space-reservation scheme which delayed allocation will need to introduce. Or we could decide to live with the above IMO-crappy behaviour. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 256)
On Mon, 7 May 2007 00:26:26 +0200 Frank van Maarseveen [EMAIL PROTECTED] wrote: 2.6.20.6, FC4: I created a 91248k ext3 fs with 4k blocksize: | mke2fs -j -b 4096 /dev/vol1/project | mke2fs 1.38 (30-Jun-2005) | Filesystem label= | OS type: Linux | Block size=4096 (log=2) | Fragment size=4096 (log=2) | 23552 inodes, 23552 blocks | 1177 blocks (5.00%) reserved for the super user | First data block=0 | Maximum filesystem blocks=25165824 | 1 block group | 32768 blocks per group, 32768 fragments per group | 23552 inodes per group Writing inode tables: done Creating journal (1024 blocks): done Writing superblocks and filesystem accounting information: done Next, I tried to resize it to about 3G using ext2online while mounted: | # ext2online /dev/vol1/project | ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b | ext2online: ext2_ioctl: No space left on device | | ext2online: unable to resize /dev/mapper/vol1-project At that time the kernel said: |JBD: ext2online wants too many credits (744 256) What is the limitation I should be aware of? Has it something to do with the journal log size? The size actually did increase a bit, to 128112k. Steps to reproduce: Create a 3G partition, say /dev/vol1/project mke2fs -j -b 4096 /dev/vol1/project 22812 mount it ext2online /dev/vol1/project said: | ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b | ext2online: ext2_ioctl: No space left on device | | ext2online: unable to resize /dev/mapper/vol1-project kernel said: | JBD: ext2online wants too many credits (721 256) (added linux-ext4 to cc) - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH] ext3: remove inode constructor
On Sat, 5 May 2007 11:58:45 +0300 (EEST) Pekka J Enberg [EMAIL PROTECTED] wrote: On Fri, 4 May 2007, Andrew Morton wrote: I got 100% rejects against this because Christoph has already had his paws all over the slab constructor code everywhere. Was going to fix it up but then decided that we ought to make changes like this to ext4 as well. Ideally beforehand, but simultaneously is OK as long as it's simple enough. I'll send you proper patches for them (and will convert other filesystems too). May as well. On Fri, 4 May 2007, Andrew Morton wrote: btw, for a benchmark I'd suggest just a silly create-1-files tight loop rather than something more complex like postmark. Do you want me to redo the benchmarks or are you happy enough with the postmark numbers? I doubt if this is measurable, really. It'll be something like the difference between an L1 hit and an L2 hit in amongst all the other stuff we do on a per-inode basis. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas [EMAIL PROTECTED] wrote: Andrew Morton wrote: Yes, there can be issues with needing to allocate journal space within the context of a commit. But no-no, this isn't required. we only need to mark pages/blocks within transaction, otherwise race is possible when we allocate blocks in transaction, then transacton starts to commit, then we mark pages/blocks to be flushed before commit. I don't understand. Can you please describe the race in more detail? - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 04 May 2007 10:57:12 +0400 Alex Tomas [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas [EMAIL PROTECTED] wrote: Andrew Morton wrote: Yes, there can be issues with needing to allocate journal space within the context of a commit. But no-no, this isn't required. we only need to mark pages/blocks within transaction, otherwise race is possible when we allocate blocks in transaction, then transacton starts to commit, then we mark pages/blocks to be flushed before commit. I don't understand. Can you please describe the race in more detail? if I understood your idea right, then in data=ordered mode, commit thread writes all dirty mapped blocks before real commit. say, we have two threads: t1 is a thread doing flushing and t2 is a commit thread t1t2 find dirty inode I find some dirty unallocated blocks journal_start() allocate blocks attach them to I journal_stop() I'm still not understanding. The terms you're using are a bit ambiguous. What does find some dirty unallocated blocks mean? Find a page which is dirty and which does not have a disk mapping? Normally the above operation would be implemented via ext4_writeback_writepage(), and it runs under lock_page(). going to commit find inode I dirty do NOT find these blocks because they're allocated only, but pages/bhs aren't mapped to them start commit I think you're assuming here that commit would be using -t_sync_datalist to locate dirty buffer_heads. But under this proposal, t_sync_datalist just gets removed: the new ordered-data mode _only_ need to do the sb-inode-page walk. So if I'm understanding you, the way in which we'd handle any such race is to make kjournald's writeback of the dirty pages block in lock_page(). Once it gets the page lock it can look to see if some other thread has mapped the page to disk. It may turn out that kjournald needs a private way of getting at the I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we had the radix-tree-of-dirty-inodes thing then that's easy enough to do anyway, with a tagged search. But I expect that a single pass through the superblock's dirty inodes would suffice for ordered-data. Files which have chattr +j would screw things up, as usual. I assume (hope) that your delayed allocation code implements -writepages()? Doing the allocation one-page-at-a-time sounds painful... map pages/bhs to just allocate blocks so, either we mark pages/bhs someway within journal_start()--journal_stop() or commit thread should do lookup for all dirty pages. the latter doesn't sound nice, IMHO. I don't think I'm understanding you fully yet. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 04 May 2007 11:39:22 +0400 Alex Tomas [EMAIL PROTECTED] wrote: Andrew Morton wrote: I'm still not understanding. The terms you're using are a bit ambiguous. What does find some dirty unallocated blocks mean? Find a page which is dirty and which does not have a disk mapping? Normally the above operation would be implemented via ext4_writeback_writepage(), and it runs under lock_page(). I'm mostly worried about delayed allocation case. My impression was that holding number of pages locked isn't a good idea, even if they're locked in index order. so, I was going to turn number of pages writeback, then allocate blocks for all of them at once, then put proper blocknr's into bh's (or PG_mappedtodisk?). ooh, that sounds hacky and quite worrisome. If someone comes in and does an fsync() we've lost our synchronisation point. Yes, all callers happen to do lock_page(); wait_on_page_writeback(); (I think) but we've never considered a bare PageWriteback() as something which protects page internals. We're OK wrt page reclaim and we're OK wrt truncate and invalidate. As long as the page is uptodate we _should_ be OK wrt readpage(). But still, it'd be better to use the standard locking rather than inventing new rules, if poss. I'd be 100% OK with locking multiple pages in ascending pgoff_t order. Locking the page is the standard way of doing this synchronisation and the only problem I can think of is that having a tremendous number of pages locked could cause the wake_up_page() waitqueue hashes to get overloaded and go slow. But it's also possible to lock many, many pages with readahead and nobody has reported problems in there. going to commit find inode I dirty do NOT find these blocks because they're allocated only, but pages/bhs aren't mapped to them start commit I think you're assuming here that commit would be using -t_sync_datalist to locate dirty buffer_heads. nope, I mean sb-inode-page walk. But under this proposal, t_sync_datalist just gets removed: the new ordered-data mode _only_ need to do the sb-inode-page walk. So if I'm understanding you, the way in which we'd handle any such race is to make kjournald's writeback of the dirty pages block in lock_page(). Once it gets the page lock it can look to see if some other thread has mapped the page to disk. if I'm right holding number of pages locked, then they won't be locked, but writeback. of course kjournald can block on writeback as well, but how does it find pages with *newly allocated* blocks only? I don't think we'd want kjournald to do that. Even if a page was dirtied by an overwrite, we'd want to write it back during commit, just from a quality-of-implementation point of view. If we were to leave these pages unwritten during commit then a post-recovery file could have a mix of up-to-five-second-old data and up-to-30-seconds-old data. It may turn out that kjournald needs a private way of getting at the I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we had the radix-tree-of-dirty-inodes thing then that's easy enough to do anyway, with a tagged search. But I expect that a single pass through the superblock's dirty inodes would suffice for ordered-data. Files which have chattr +j would screw things up, as usual. not dirty inodes only, but rather some fast way to find pages with newly allocated pages. Newly allocated blocks, you mean? Just write out the overwritten blocks as well as the new ones, I reckon. It's what we do now. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.21-git4 Scheduler, NOHZ, VFS bugs
On Fri, 04 May 2007 18:20:51 +0200 Michal Piotrowski [EMAIL PROTECTED] wrote: I ran this script tree times, #! /bin/sh for i in `find /sys/ -type f` do echo wyĆwietlam $i sudo cat $i /dev/null done First run - scheduler bug Second run - NOHZ bug Third - VFS bug H... [93298.252601] BUG: at /mnt/md0/devel/linux-git/kernel/sched.c:3241 add_preempt_count() [93298.260334] [c0105039] show_trace_log_lvl+0x1a/0x2f [93298.265507] [c0105720] show_trace+0x12/0x14 [93298.269974] [c01057d2] dump_stack+0x16/0x18 [93298.274434] [c011d18a] add_preempt_count+0x89/0x8b [93298.279501] [c0126492] irq_enter+0xd/0x2e [93298.283788] [c0114c1e] smp_apic_timer_interrupt+0x2a/0x84 [93298.289458] [c0104b2b] apic_timer_interrupt+0x33/0x38 [93298.294783] [c0257b79] show_uevent+0x58/0xcd [93298.299329] === [93390.468056] NOHZ: local_softirq_pending 22 [93447.105850] NOHZ: local_softirq_pending 22 [93450.332884] BUG: unable to handle kernel paging request at virtual address 3e343c0c [93450.340626] printing eip: [93450.34] c018e2bc [93450.345520] *pde = [93450.348314] Oops: [#1] Nice. What was the last file which it read before crashing? Are you able to consistently crash it by reading just that file? - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH] ext3: remove inode constructor
On Fri, 4 May 2007 13:14:35 +0300 (EEST) Pekka J Enberg [EMAIL PROTECTED] wrote: As explained by Christoph Lameter, ext3_alloc_inode() touches the same cache line as init_once() so we gain nothing from using slab constructors. The SLUB allocator will be more effective without it (free pointer can be placed inside the free'd object), so move inode initialization to ext3_alloc_inode completely. I got 100% rejects against this because Christoph has already had his paws all over the slab constructor code everywhere. Was going to fix it up but then decided that we ought to make changes like this to ext4 as well. Ideally beforehand, but simultaneously is OK as long as it's simple enough. btw, for a benchmark I'd suggest just a silly create-1-files tight loop rather than something more complex like postmark. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html