[PATCH] ext4: dir inode reservation V3
Basic idea of my dir inode reservation patch can be found here, http://lists.openwall.net/linux-ext4/2007/11/05/3 1, What does dir inode reservation do Dir inode reservation tries to reserve several inodes in inodes table for a directory when this directory is created. When create new file under this directory, try to allocate inode from the reserved inodes area. This is called as dir_ireserve inode allocator. 2, What does dir inode reservation help If we create huge number of directories, and create files in each directory alternatively (like some web proxy or mail server do), current linear inode allocator in ext[234] will introduce unnecessary hard disk accessing and seeking during creation/unlink, which brings out poor performance. Here is an example. The uppercase letters represent directories inode, and corresponded lowercase letters represent file inodes under the parent directory. When number of directories are much more than block groups number, some inodes of directories will be allocated in inodes table like this, -- blk 0 -- A B C D E F H I When the files under each directory increase alternatively, some time latter (if each directory has 2 files), the layout of this inodes table will be, -- blk 0 -- -- blk 1 -- -- blk 2 -- A B C D E F H I a b c d e f h i a b c d e f h i Now if these directories are unlinked recursively in hashed order, because every time when unlink a file from a directory, the inode of directory should also be updated to hard disk. If each I/O session can only write 1 block into hard disk (this is the most simple condition which will not happen in practice), the access sequence should be, 1 0 2 0 1 0 2 0 1 0 2 0 1 0 2 0 1 0 2 0 1 0 2 0 1 0 2 0 1 0 2 0 From the above sequence, we can find in order to remove 8 directories and their files, 32 blocks should be written to hard disk. Dir inode reservation patches tries to improve unlink performance in above condition. In dir_ireserve inode allocator, directory inodes will be allocated like, -- blk 0 -- -- blk 1 -- ...-- blk 6 -- -- blk 7 -- A B H I If we create new files under each directory, the layout in inodes table will be, -- blk 0 -- -- blk 1 -- ...-- blk 6 -- -- blk 7 -- A a a B b b H h h I i i Because files inodes are very near to parent directory inode, when these files are removed recursively, directory inode and its file inodes can be updated to hard disk in one I/O session. If each I/O session can only write 1 block into hard disk (like we assumed as before), the access sequence should be, 0 1 2 3 4 5 6 7 By dir_ireserve inode allocator, 25 extra block I/O can be avoided. At the same time, because files inodes are near to parent directory inode, we also avoid the unnecessary hard disk seeking between files inode and directory inode. From benchmark, I also find a helpful side effect from dir inode reservation -- it is 10%~30% faster to create these directories and files, faster result can be observed on more directories are created. The reason is same, when create new files the parent directory inode is updated to hard disk, too. Place files inodes and directory inode nearly can merge many redundant block I/O. 3, Full compatible to current ext[234] This is 3rd release (and 5th house keeping version)of my dir inode reservation patch. V1 is only for feasibility research, V2 is in compatible to ext[23], both require to patch e2fsprogs. V3 is only patch to kernel, no e2fsprogs modification. V3 has no any on-disk modification or file system data structure change, therefore it is full compatible to current ext[234]. 4, Dir inode reservation is optional Dir inode reservation is optional, you can use -o followed by one of these options to enable dir inode reservation during mount ext4 file system: dir_ireserve=low dir_ireserve=normal dir_ireserve=high Currently, 'low' reserves 15 file inodes for each directory, 'normal' reserves 31 inodes and 'high' reserves 127 inodes. Reserving more than 127 inodes does not help to performance obviously. 5, Performance number On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 partition, and tried to create 5 directories, and create 15 (1KB) files in each directory alternatively. After a remount, I tried to remove all the directories and files recursively by a 'rm -rf'. Bellow is the benchmark result, normal ext4 ext4 with dir inode reservation mount options: -o data=writeback -o data=writeback,dir_ireserve=low Create dirs:real0m49.101s real2m59.703s Create files: real24m17.962s real21m8.161s Unlink all: real24m43.788s real17m29.862s Creating dirs with dir inode
Re: [PATCH] ext4: dir inode reservation V3
hmm. so you trade 265% degradation of creation for 40% improvement of unlink? thanks, Alex Coly Li wrote: normal ext4 ext4 with dir inode reservation mount options: -o data=writeback -o data=writeback,dir_ireserve=low Create dirs:real0m49.101s real2m59.703s Create files: real24m17.962s real21m8.161s Unlink all: real24m43.788s real17m29.862s Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating directory inodes in non-linear order will cause extra hard disk seeking and block I/O. Creating files with dir inode reservation is 13% faster than normal ext4. Unlink all the directories and files is 29.2% faster as expected. When number of directories is increased, the performance improvement will be more considerable. More benchmark result will be posted here if necessary, because I need more time to run more test cases. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: dir inode reservation V3
Thanks for the feedback :-) Alex Tomas wrote: hmm. so you trade 265% degradation of creation for 40% improvement of unlink? 265% degradation is only for creating 5 empty directories. This is not a common case. There are 13% improvement on create 15 files in each directories. Total time on creating these directories and files are 25m6s VS. 24m86s, indeed, dir inode reservation is a little faster. Maybe most of the people will not create dozens of empty directories in their applications, therefore IMHO the 265% degradation is acceptable. If user really need to create so many empty directories, they also can mount the file system without dir inode reservation to get better performance. thanks, Alex Coly Li wrote: normal ext4ext4 with dir inode reservation mount options:-o data=writeback-o data=writeback,dir_ireserve=low Create dirs:real0m49.101sreal2m59.703s Create files:real24m17.962sreal21m8.161s Unlink all:real24m43.788sreal17m29.862s Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating directory inodes in non-linear order will cause extra hard disk seeking and block I/O. Creating files with dir inode reservation is 13% faster than normal ext4. Unlink all the directories and files is 29.2% faster as expected. When number of directories is increased, the performance improvement will be more considerable. More benchmark result will be posted here if necessary, because I need more time to run more test cases. -- Coly Li SuSE PRC Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: dir inode reservation V3
Coly Li wrote: Thanks for the feedback :-) Alex Tomas wrote: hmm. so you trade 265% degradation of creation for 40% improvement of unlink? 265% degradation is only for creating 5 empty directories. This is not a common case. There are 13% improvement on create 15 files in each directories. Total time on creating these directories and files are 25m6s VS. 24m86s, indeed, dir inode reservation is a little faster. Sorry a typo here, it's 25m6s VS. 24m7.86s. Maybe most of the people will not create dozens of empty directories in their applications, therefore IMHO the 265% degradation is acceptable. If user really need to create so many empty directories, they also can mount the file system without dir inode reservation to get better performance. thanks, Alex Coly Li wrote: normal ext4ext4 with dir inode reservation mount options:-o data=writeback-o data=writeback,dir_ireserve=low Create dirs:real0m49.101sreal2m59.703s Create files:real24m17.962sreal21m8.161s Unlink all:real24m43.788sreal17m29.862s Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating directory inodes in non-linear order will cause extra hard disk seeking and block I/O. Creating files with dir inode reservation is 13% faster than normal ext4. Unlink all the directories and files is 29.2% faster as expected. When number of directories is increased, the performance improvement will be more considerable. More benchmark result will be posted here if necessary, because I need more time to run more test cases. -- Coly Li SuSE PRC Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oops 2.6.23.1 in ext3+jbd at journal_put_journal_head
Hello, A one-time event thus far, happened under very heavy I/O, Dell i9400 Core2Duo notebook w/3GB ram, single SATA drive with ext3. Had to cycle power to get it back and see this Oops in the syslog: : BUG: unable to handle kernel paging request at virtual address 430a7261 : printing eip: : c01a6605 : *pde = : Oops: 0002 [#1] : PREEMPT SMP : Modules linked in: nls_iso8859_1 vfat fat usb_storage ide_core libusual hci_usb ext2 loop nls_cp437 isofs zlib_inflate udf vmnet(P) vmblock(P) vmmon(P) binfmt_misc rfcomm l2cap bluetooth nfs nfsd exportfs lockd nfs_acl auth_rpcgss sunrpc acpi_cpufreq cpufreq_stats cpufreq_userspace cpufreq_ondemand cpufreq_conservative freq_table cpufreq_powersave container fan firmware_class af_packet pciehp usbhid hid pci_hotplug visor usbserial fuse mousedev snd_hda_intel snd_pcm_oss snd_pcm snd_mixer_oss snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event serio_raw snd_seq snd_timer snd_seq_device sg thermal firewire_ohci snd pcspkr sr_mod cdrom psmouse firewire_core sdhci mmc_core b44 mii crc_itu_t ac uhci_hcd ehci_hcd intel_agp agpgart processor button soundcore snd_page_alloc usbcore battery unix : CPU:1 : EIP:0060:[journal_put_journal_head+64/209]Tainted: PVLI : EFLAGS: 00010202 (2.6.23.1-slab #15) : EIP is at journal_put_journal_head+0x40/0xd1 : eax: c2bf7000 ebx: 430a7261 ecx: edx: c24f4780 : esi: f000fea5 edi: c24f4780 ebp: f000fea5 esp: c2bf7e38 : ds: 007b es: 007b fs: 00d8 gs: ss: 0068 Hmm, your pointer to buffer_head in journal_head has been overwritten by some garbage - it actually looks like ASCII (C\n ra). I think your journal_head pointer is stored in EAX (at least if I compile SMP kernel for i386 it is) and that is 0xc2bd7000 - start of the page. So probably some driver went wild and overwritten a piece of memory which did not belong to it... I suggest turning on a few debugging options (like DEBUG_SLAB) to catch the offender. : Process kswapd0 (pid: 243, ti=c2bf7000 task=c29ec030 task.ti=c2bf7000) : Stack: d6216868 002a d4f52670 0034 f000fea5 c01a34b6 :0246 c003fe08 ef082d98 ef082d4c ef082d4c c29f00c0 c003fdb8 c0198a8f : 0002ad02 ef082d4c c0145b21 c24f4780 000b c014a5e0 000e : Call Trace: : [journal_try_to_free_buffers+299/383] journal_try_to_free_buffers+0x12b/0x17f : [ext3_releasepage+0/114] ext3_releasepage+0x0/0x72 : [try_to_release_page+48/66] try_to_release_page+0x30/0x42 : [__invalidate_mapping_pages+116/231] __invalidate_mapping_pages+0x74/0xe7 : [invalidate_mapping_pages+15/17] invalidate_mapping_pages+0xf/0x11 : [shrink_icache_memory+219/445] shrink_icache_memory+0xdb/0x1bd : [shrink_slab+217/338] shrink_slab+0xd9/0x152 : [kswapd+729/1069] kswapd+0x2d9/0x42d : [autoremove_wake_function+0/53] autoremove_wake_function+0x0/0x35 : [kswapd+0/1069] kswapd+0x0/0x42d : [kthread+56/95] kthread+0x38/0x5f : [kthread+0/95] kthread+0x0/0x5f : [kernel_thread_helper+7/16] kernel_thread_helper+0x7/0x10 : === : Code: 89 e0 25 00 f0 ff ff ff 48 14 8b 40 08 a8 04 74 05 e8 a6 d6 0e 00 f3 90 89 e0 25 00 f0 ff ff ff 40 14 8b 03 a9 00 00 40 00 75 d5 f0 0f ba 2b 16 19 c0 85 c0 75 ec 8b 46 04 85 c0 7f 30 c7 44 24 : EIP: [journal_put_journal_head+64/209] journal_put_journal_head+0x40/0xd1 SS:ESP 0068:c2bf7e38 : note: kswapd0[243] exited with preempt_count 2 Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oops 2.6.23.1 in ext3+jbd at journal_put_journal_head
Jan Kara wrote: Hello, A one-time event thus far, happened under very heavy I/O, Dell i9400 Core2Duo notebook w/3GB ram, single SATA drive with ext3. Had to cycle power to get it back and see this Oops in the syslog: .. Hmm, your pointer to buffer_head in journal_head has been overwritten by some garbage - it actually looks like ASCII (C\n ra). I think your journal_head pointer is stored in EAX (at least if I compile SMP kernel for i386 it is) and that is 0xc2bd7000 - start of the page. So probably .. As for me, I'm guessing a use/free race somewhere, but with only the information from the Oops that's hard to know. some driver went wild and overwritten a piece of memory which did not belong to it... I suggest turning on a few debugging options (like DEBUG_SLAB) to catch the offender. .. You mean, like, this: # # Automatically generated make config: don't edit # Linux kernel version: 2.6.23.1-slab # Wed Nov 7 08:00:18 2007 # CONFIG_X86_32=y ... CONFIG_DEBUG_SLAB=y ... Thought since it was.. A one-time event thus far ... .. .. I'm unlikely to see it again. Cheers - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Introduce ext4_find_next_bit
On Wed, 14 Nov 2007 00:41:03 +0530 Aneesh Kumar K.V [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Fri, 21 Sep 2007 10:55:05 +0530 Aneesh Kumar K.V [EMAIL PROTECTED] wrote: Also add generic_find_next_le_bit This gets used by the ext4 multi block allocator patches. arm allmodconfig: fs/ext4/mballoc.c: In function `ext4_mb_generate_buddy': fs/ext4/mballoc.c:836: error: implicit declaration of function `ext2_find_next_bit' This patch makes my head spin. Why did we declare generic_find_next_le_bit() in include/asm-powerpc/bitops.h (wrong) as well as in include/asm-generic/bitops/le.h (presumably correct)? I was following the coding style used for rest of the APIs like ext4_set_bit. Well. There's quite a bit of cruft in there. If you do come across something which isn't right, please do try to find the time to fix it up first. That might be non-trivial - powerpc does seem to have gone off on a strange tangent there. Why is it touching a powerpc file and no any other architectures? Something screwed up in powerpc land? And why did arm break? arm and below list of arch doesn't include the asm-generic/bitops/ext2-non-atomic.h I did a grep and that list the below architectures as also affected. arm, m68k, m68knommu, s390 Shudder. Anyway, please fix, and if that fix requires that various braindamaged be repaired, please repair the braindamage rather than going along with it. That should be a separate patch altogether. I wanted to do the cleanup along with the usages such as but never got time to do the same. #define ocfs2_set_bit ext2_set_bit #define udf_set_bit(nr,addr) ext2_set_bit(nr,addr) direct usage in mb md/bitmap.c +799 md/dm-log.c +177 I will send a patch tomorrow that fix arm and other architectures. I guess the cleanup can be a separate patch ? Yes, that's a separate work, thanks. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/13][e2fsprogs] Add initial checksum support.
On Sun, Oct 14, 2007 at 10:46:05PM -0400, Theodore Tso wrote: In crc16.h, this patch assumes that linux/types.h defines uint16_t. There are a couple of problems with this. #1) linux/types.h is non-portable, not only does it not exist on non-Linux systems, apparently on Ubuntu it's not always defining uint16_t. On my Ubuntu gutsy system, it doesn't always get defined. I hope and believe uint16_t is everywhere -- ISO C99: 7.18 Integer types stdint.h. Karel -- Karel Zak [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] - [4/15] - remove defconfig ptr comparisons to 0 - fs/jbd
Remove defconfig ptr comparison to 0 Remove sparse warning: Using plain integer as NULL pointer Signed-off-by: Joe Perches [EMAIL PROTECTED] --- diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c index 5d14243..0459657 100644 --- a/fs/jbd/journal.c +++ b/fs/jbd/journal.c @@ -1619,14 +1619,14 @@ static int journal_init_journal_head_cache(void) { int retval; - J_ASSERT(journal_head_cache == 0); + J_ASSERT(!journal_head_cache); journal_head_cache = kmem_cache_create(journal_head, sizeof(struct journal_head), 0, /* offset */ SLAB_TEMPORARY, /* flags */ NULL); /* ctor */ retval = 0; - if (journal_head_cache == 0) { + if (!journal_head_cache) { retval = -ENOMEM; printk(KERN_EMERG JBD: no memory for journal_head cache\n); } diff --git a/fs/jbd/revoke.c b/fs/jbd/revoke.c index ad2eacf..d5f8eee 100644 --- a/fs/jbd/revoke.c +++ b/fs/jbd/revoke.c @@ -173,13 +173,13 @@ int __init journal_init_revoke_caches(void) 0, SLAB_HWCACHE_ALIGN|SLAB_TEMPORARY, NULL); - if (revoke_record_cache == 0) + if (!revoke_record_cache) return -ENOMEM; revoke_table_cache = kmem_cache_create(revoke_table, sizeof(struct jbd_revoke_table_s), 0, SLAB_TEMPORARY, NULL); - if (revoke_table_cache == 0) { + if (!revoke_table_cache) { kmem_cache_destroy(revoke_record_cache); revoke_record_cache = NULL; return -ENOMEM; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] - [5/15] - remove defconfig ptr comparisons to 0 - fs/jbd2
Remove defconfig ptr comparison to 0 Remove sparse warning: Using plain integer as NULL pointer Signed-off-by: Joe Perches [EMAIL PROTECTED] --- diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 6ddc553..ca74850 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -218,7 +218,7 @@ static int jbd2_journal_start_thread(journal_t *journal) if (IS_ERR(t)) return PTR_ERR(t); - wait_event(journal-j_wait_done_commit, journal-j_task != 0); + wait_event(journal-j_wait_done_commit, journal-j_task); return 0; } @@ -230,7 +230,7 @@ static void journal_kill_thread(journal_t *journal) while (journal-j_task) { wake_up(journal-j_wait_commit); spin_unlock(journal-j_state_lock); - wait_event(journal-j_wait_done_commit, journal-j_task == 0); + wait_event(journal-j_wait_done_commit, !journal-j_task); spin_lock(journal-j_state_lock); } spin_unlock(journal-j_state_lock); @@ -1629,14 +1629,14 @@ static int journal_init_jbd2_journal_head_cache(void) { int retval; - J_ASSERT(jbd2_journal_head_cache == 0); + J_ASSERT(!jbd2_journal_head_cache); jbd2_journal_head_cache = kmem_cache_create(jbd2_journal_head, sizeof(struct journal_head), 0, /* offset */ 0, /* flags */ NULL); /* ctor */ retval = 0; - if (jbd2_journal_head_cache == 0) { + if (!jbd2_journal_head_cache) { retval = -ENOMEM; printk(KERN_EMERG JBD: no memory for journal_head cache\n); } @@ -1662,14 +1662,14 @@ static struct journal_head *journal_alloc_journal_head(void) atomic_inc(nr_journal_heads); #endif ret = kmem_cache_alloc(jbd2_journal_head_cache, GFP_NOFS); - if (ret == 0) { + if (!ret) { jbd_debug(1, out of memory for journal_head\n); if (time_after(jiffies, last_warning + 5*HZ)) { printk(KERN_NOTICE ENOMEM in %s, retrying.\n, __FUNCTION__); last_warning = jiffies; } - while (ret == 0) { + while (!ret) { yield(); ret = kmem_cache_alloc(jbd2_journal_head_cache, GFP_NOFS); } diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c index 3595fd4..ec81511 100644 --- a/fs/jbd2/revoke.c +++ b/fs/jbd2/revoke.c @@ -172,13 +172,13 @@ int __init jbd2_journal_init_revoke_caches(void) jbd2_revoke_record_cache = kmem_cache_create(jbd2_revoke_record, sizeof(struct jbd2_revoke_record_s), 0, SLAB_HWCACHE_ALIGN, NULL); - if (jbd2_revoke_record_cache == 0) + if (!jbd2_revoke_record_cache) return -ENOMEM; jbd2_revoke_table_cache = kmem_cache_create(jbd2_revoke_table, sizeof(struct jbd2_revoke_table_s), 0, 0, NULL); - if (jbd2_revoke_table_cache == 0) { + if (!jbd2_revoke_table_cache) { kmem_cache_destroy(jbd2_revoke_record_cache); jbd2_revoke_record_cache = NULL; return -ENOMEM; diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index b1fcf2b..036e7ef 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -1164,7 +1164,7 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh) } /* That test should have eliminated the following case: */ - J_ASSERT_JH(jh, jh-b_frozen_data == 0); + J_ASSERT_JH(jh, !jh-b_frozen_data); JBUFFER_TRACE(jh, file as BJ_Metadata); spin_lock(journal-j_list_lock); @@ -1512,7 +1512,7 @@ void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh) J_ASSERT_JH(jh, jh-b_jlist BJ_Types); if (jh-b_jlist != BJ_None) - J_ASSERT_JH(jh, transaction != 0); + J_ASSERT_JH(jh, transaction); switch (jh-b_jlist) { case BJ_None: @@ -1581,11 +1581,11 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh) if (buffer_locked(bh) || buffer_dirty(bh)) goto out; - if (jh-b_next_transaction != 0) + if (jh-b_next_transaction) goto out; spin_lock(journal-j_list_lock); - if (jh-b_transaction != 0 jh-b_cp_transaction == 0) { + if (jh-b_transaction !jh-b_cp_transaction) { if (jh-b_jlist == BJ_SyncData || jh-b_jlist == BJ_Locked) { /* A written-back ordered data buffer */ JBUFFER_TRACE(jh, release data); @@ -1593,7 +1593,7 @@