Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On 03/30/2017 09:35 PM, Jeff Layton wrote: <> > Yeah, I imagine we'd need a on-disk change for this unless there's > something already present that we could use in place of a crash counter. > Perhaps we can use s_mtime and/or s_wtime in some way, I'm not sure what is a parallel for that in xfs. s_mtime - time-of-last mount s_wtime - time-of-last mount, umount, freez, unfreez, remount, ... Of course you'll need a per FS vector to access that. But this will need some math foo to get the bits compacted correctly just a thought. Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mkfs.btrfs cannot find rotational file for SSD detection for a pmem device
On 09/09/2015 02:28 PM, Austin S Hemmelgarn wrote: > On 2015-09-08 16:00, Elliott, Robert (Persistent Memory) wrote: <> > this may actually make things slower (the particular effect of SSD mode > is that it tries to spread allocations out as much as possible, as this > helps with wear-leveling on many SSD's). > For DRAM based NvDIMM it matters not at all. For Flash based or the new 3d Xpoint it is a plus, so no harm in leaving it in Just my 1.7 cents Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mkfs.btrfs cannot find rotational file for SSD detection for a pmem device
On 09/09/2015 03:40 PM, Austin S Hemmelgarn wrote: > On 2015-09-09 08:12, Boaz Harrosh wrote: >> On 09/09/2015 02:28 PM, Austin S Hemmelgarn wrote: >>> On 2015-09-08 16:00, Elliott, Robert (Persistent Memory) wrote: >> <> >> >>> this may actually make things slower (the particular effect of SSD mode >>> is that it tries to spread allocations out as much as possible, as this >>> helps with wear-leveling on many SSD's). >>> >> >> For DRAM based NvDIMM it matters not at all. For Flash based or the new >> 3d Xpoint it is a plus, so no harm in leaving it in >> > Looking at it from another perspective however, a lot of modern RAM > modules will stripe the bits across multiple chips to improve > performance. In such a situation, BTRFS making the effort to spread out > the allocation as much as possible may have an impact because that > allocation path is slower than the regular one (not by much, but even a > few microseconds can make a difference when it is getting called a lot). > > It is pointless to argue about this, but the allocations are 4k aligned any which way, which means you are right at the beginning of the striping, the RAM striping is a cacheline (64 bytes) granularity. I think you will never find a single micro benchmark that will ever produce a difference. Cheers Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option
On 12/02/2014 02:58 PM, Jan Kara wrote: On Fri 28-11-14 13:14:21, Ted Tso wrote: On Fri, Nov 28, 2014 at 06:23:23PM +0100, Jan Kara wrote: Hum, when someone calls fsync() for an inode, you likely want to sync timestamps to disk even if everything else is clean. I think that doing what you did in last version: dirty = inode-i_state I_DIRTY_INODE; inode-i_state = ~I_DIRTY_INODE; spin_unlock(inode-i_lock); if (dirty I_DIRTY_TIME) mark_inode_dirty_sync(inode); looks better to me. IMO when someone calls __writeback_single_inode() we should write whatever we have... Yes, but we also have to distinguish between what happens on an fsync() versus what happens on a periodic writeback if I_DIRTY_PAGES (but not I_DIRTY_SYNC or I_DIRTY_DATASYNC) is set. So there is a check in the fsync() code path to handle the concern you raised above. Ah, this is the thing you have been likely talking about but which I was constantly missing in my thoughts. You don't want to write times when inode has only dirty pages and timestamps - This I do not understand. I thought that I_DIRTY_TIME, and the all lazytime mount option, is only for atime. So if there are dirty pages then there are also m/ctime that changed and surly we want to write these times to disk ASAP. if we are lazytime also with m/ctime then I think I would like an option for only atime lazy. because m/ctime is cardinal to some operations even though I might want atime lazy. Sorry for the slowness, I'm probably missing something Thanks Boaz -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] vfs: add support for a lazytime mount option
On 11/25/2014 06:33 AM, Theodore Ts'o wrote: I was concerned about putting them on the dirty inode list because it would be extra inodes for the writeback threads would have to skip over and ignore (since they would not be dirty in the inde or data pages sense). Another solution would be to use a separate linked list for dirtytime inodes, but that means adding some extra fields to the inode structure, which some might view as bloat. You could use the same list-head for both lists. If the inode is on the dirty-inode-list then no need to add it to the list-for-dirtytime, it will be written soon anyway. else you add it to the list-for-dirtytime. If you (real)dirty an inode then you first remove it from the list-for-dirtytime first, and then add it to the dirty-inode-list. So at each given time it is only on one list Cheers Boaz -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atime and filesystems with snapshots (especially Btrfs)
On 05/25/2012 06:35 PM, Alexander Block wrote: Hello, (this is a resend with proper CC for linux-fsdevel and linux-kernel) I would like to start a discussion on atime in Btrfs (and other filesystems with snapshot support). As atime is updated on every access of a file or directory, we get many changes to the trees in btrfs that as always trigger cow operations. This is no problem as long as the changed tree blocks are not shared by other subvolumes. Performance is also not a problem, no matter if shared or not (thanks to relatime which is the default). The problems start when someone starts to use snapshots. If you for example snapshot your root and continue working on your root, after some time big parts of the tree will be cowed and unshared. In the worst case, the whole tree gets unshared and thus takes up the double space. Normally, a user would expect to only use extra space for a tree if he changes something. A worst case scenario would be if someone took regular snapshots for backup purposes and later greps the contents of all snapshots to find a specific file. This would touch all inodes in all trees and thus make big parts of the trees unshared. relatime (which is the default) reduces this problem a little bit, as it by default only updates atime once a day. This means, if anyone wants to test this problem, mount with relatime disabled or change the system date before you try to update atime (that's the way i tested it). As a solution, I would suggest to make noatime the default for btrfs. I'm however not sure if it is allowed in linux to have different default mount options for different filesystem types. I know this discussion pops up every few years (last time it resulted in making relatime the default). But this is a special case for btrfs. atime is already bad on other filesystems, but it's much much worse in btrfs. Sounds like a real problem. I would suggest a few remedies. 1. Make a filesystem persistent parameter that says noatime/relatime/atime So the default if not specified on mount is taken as a property of the FS (mkfs can set it) 2. The snapshot program should check and complain if it is on, and recommend an off. Since the problem only starts with a snapshot. 3. If space availability drops under some threshold, disable atime. As you said this is catastrophic in this case. So user can always search and delete files. In fact if the IO was only because of atime, it should be a soft error, warned, and ignored. But perhaps the true solution is to put atime on a side table, so only the atime info gets COW and not the all MetaData Just my $0.017 Boaz Alex. -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fs: fix deadlocks in writeback_if_idle
On 11/23/2010 12:02 PM, Nick Piggin wrote: Taking s_umount lock inside i_mutex can result in an ABBA deadlock: === [ INFO: possible circular locking dependency detected ] 2.6.37-rc3+ #26 --- append_writer/12828 is trying to acquire lock: (type-s_umount_key#24){+.}, at: [8113d6d2] writeback_inodes_sb_if_idle+0x32/0x60 but task is already holding lock: (sb-s_type-i_mutex_key#14){+.+.+.}, at: [810cc863] generic_file_aio_write+0x53/0xd0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #3 (sb-s_type-i_mutex_key#14){+.+.+.}: [810852c5] lock_acquire+0x95/0x1b0 [81601329] __mutex_lock_common+0x59/0x480 [8160182e] mutex_lock_nested+0x3e/0x50 [a003a147] ext4_end_io_work+0x37/0xb0 [ext4] [81068378] process_one_work+0x1b8/0x5a0 [81069675] worker_thread+0x175/0x3a0 [8106e246] kthread+0x96/0xa0 [81003ed4] kernel_thread_helper+0x4/0x10 - #2 ((io-work)){+.+...}: [810852c5] lock_acquire+0x95/0x1b0 [81068364] process_one_work+0x1a4/0x5a0 [81069675] worker_thread+0x175/0x3a0 [8106e246] kthread+0x96/0xa0 [81003ed4] kernel_thread_helper+0x4/0x10 - #1 (ext4-dio-unwritten){+.+...}: [810852c5] lock_acquire+0x95/0x1b0 [81067bc8] flush_workqueue+0x148/0x540 [a004761b] ext4_sync_fs+0x3b/0x100 [ext4] [8114304e] __sync_filesystem+0x5e/0x90 [81143132] sync_filesystem+0x32/0x60 [8111a97f] generic_shutdown_super+0x2f/0x100 [8111aa7c] kill_block_super+0x2c/0x50 [8111b1e5] deactivate_locked_super+0x45/0x60 [8111b415] deactivate_super+0x45/0x60 [81136430] mntput_no_expire+0xf0/0x190 [811376a9] sys_umount+0x79/0x3a0 [8100312b] system_call_fastpath+0x16/0x1b - #0 (type-s_umount_key#24){+.}: [81085122] __lock_acquire+0x1382/0x1490 [810852c5] lock_acquire+0x95/0x1b0 [81601ba2] down_read+0x42/0x60 [8113d6d2] writeback_inodes_sb_if_idle+0x32/0x60 [a0037efd] ext4_da_write_begin+0x20d/0x310 [ext4] [810cbbf4] generic_file_buffered_write+0x114/0x2a0 [810cc5e0] __generic_file_aio_write+0x240/0x470 [810cc876] generic_file_aio_write+0x66/0xd0 [a002cfad] ext4_file_write+0x3d/0xd0 [ext4] [81117702] do_sync_write+0xd2/0x110 [811179c8] vfs_write+0xc8/0x190 [811183ec] sys_write+0x4c/0x80 [8100312b] system_call_fastpath+0x16/0x1b other info that might help us debug this: 1 lock held by append_writer/12828: #0: (sb-s_type-i_mutex_key#14){+.+.+.}, at: [810cc863] generic_file_aio_write+0x53/0xd0 stack backtrace: Pid: 12828, comm: append_writer Not tainted 2.6.37-rc3+ #26 Call Trace: [81082c39] print_circular_bug+0xe9/0xf0 [81085122] __lock_acquire+0x1382/0x1490 [810852c5] lock_acquire+0x95/0x1b0 [8113d6d2] ? writeback_inodes_sb_if_idle+0x32/0x60 [81606c3d] ? sub_preempt_count+0x9d/0xd0 [81601ba2] down_read+0x42/0x60 [8113d6d2] ? writeback_inodes_sb_if_idle+0x32/0x60 [8113d6d2] writeback_inodes_sb_if_idle+0x32/0x60 [a0037efd] ext4_da_write_begin+0x20d/0x310 [ext4] [81073dde] ? up_read+0x1e/0x40 [810cbbf4] generic_file_buffered_write+0x114/0x2a0 [8104fa22] ? current_fs_time+0x22/0x30 [810cc5e0] __generic_file_aio_write+0x240/0x470 [810cc863] ? generic_file_aio_write+0x53/0xd0 [810cc876] generic_file_aio_write+0x66/0xd0 [a002cfad] ext4_file_write+0x3d/0xd0 [ext4] [81150db8] ? fsnotify+0x88/0x5e0 [81117702] do_sync_write+0xd2/0x110 [81606c3d] ? sub_preempt_count+0x9d/0xd0 [816027b9] ? trace_hardirqs_on_thunk+0x3a/0x3f [81022e9a] ? smp_apic_timer_interrupt+0x6a/0xa0 [811179c8] vfs_write+0xc8/0x190 [811183ec] sys_write+0x4c/0x80 [8100312b] system_call_fastpath+0x16/0x1b Also, there can be an AA deadlock if the filesystem takes i_mutex in a workqueue, if it also waits for work completion while holding i_mutex. SysRq : Show Blocked State taskPC stack pid father kworker/9:1 D 6296 118 2 Call Trace: [81039431] ? get_parent_ip+0x11/0x50 [8160145c] __mutex_lock_common+0x18c/0x480 [a0042147] ?
Re: [patch] fs: fix deadlocks in writeback_if_idle
On 11/23/2010 12:54 PM, Nick Piggin wrote: On Tue, Nov 23, 2010 at 12:26:31PM +0200, Boaz Harrosh wrote: * * Invoke writeback_inodes_sb if no writeback is currently underway. * Returns 1 if writeback was started, 0 if not. + * + * Even if 1 is returned, writeback may not be started if memory allocation + * fails. This function makes no guarantees about anything. */ int writeback_inodes_sb_if_idle(struct super_block *sb) { if (!writeback_in_progress(sb-s_bdi)) { - down_read(sb-s_umount); - writeback_inodes_sb(sb); - up_read(sb-s_umount); + bdi_start_writeback(sb-s_bdi, get_nr_dirty_pages()); return 1; - } else - return 0; + } + return 0; } EXPORT_SYMBOL(writeback_inodes_sb_if_idle); @@ -1172,17 +1173,18 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl * * Invoke writeback_inodes_sb if no writeback is currently underway. * Returns 1 if writeback was started, 0 if not. + * + * Even if 1 is returned, writeback may not be started if memory allocation + * fails. This function makes no guarantees about anything. */ int writeback_inodes_sb_nr_if_idle(struct super_block *sb, unsigned long nr) { if (!writeback_in_progress(sb-s_bdi)) { - down_read(sb-s_umount); - writeback_inodes_sb_nr(sb, nr); - up_read(sb-s_umount); + bdi_start_writeback(sb-s_bdi, nr); return 1; - } else - return 0; + } + return 0; } EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle); static inline int writeback_inodes_sb_if_idle(struct super_block *sb) { return writeback_inodes_sb_nr_if_idle(sb, get_nr_dirty_pages()); } In writeback.h, No? I didn't care enough to move it :P I don't know if it matters. Than please just open code it in ext4 and completely remove it. One stupid function is enough don't you think? But it has a single user so please just kill it. Also writeback_inodes_sb_nr_if_idle() has a single user. Combined with above, two users. Why not open code it in the two sites. It should be much clearer to understand what the magic is all about? The filesystem shouldn't be aware of the details (the magic) of how to kick writeback, so I think the abstraction is right as is. bdi_start_writeback() looks like a very clear abstraction to me but if you have plans for it than sure, I'm convinced. Thanks, Nick Thanks Boaz -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 0/8] Cleancache: overview
On 07/24/2010 12:17 AM, Dan Magenheimer wrote: On Fri, Jul 23, 2010 at 06:58:03AM -0700, Dan Magenheimer wrote: CHRISTOPH AND ANDREW, if you disagree and your concerns have not been resolved, please speak up. Hi Christoph -- Thanks very much for the quick (instantaneous?) reply! Anything that need modification of a normal non-shared fs is utterly broken and you'll get a clear NAK, so the propsal before is a good one. No, the per-fs opt-in is very sensible; and its design is very minimal. Not to belabor the point, but maybe the right way to think about this is: Cleancache is a new optional feature provided by the VFS layer that potentially dramatically increases page cache effectiveness for many workloads in many environments at a negligible cost. Filesystems that are well-behaved and conform to certain restrictions can utilize cleancache simply by making a call to cleancache_init_fs at mount time. Unusual, misbehaving, or poorly layered filesystems must either add additional hooks and/or undergo extensive additional testing... or should just not enable the optional cleancache. OK, So I maintain a filesystem in Kernel. How do I know if my FS is not Unusual, misbehaving, or poorly layered Thanks Boaz -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v8][RFC] mutex: implement adaptive spinning
Peter Zijlstra wrote: On Mon, 2009-01-12 at 08:20 -0800, Linus Torvalds wrote: On Mon, 12 Jan 2009, Linus Torvalds wrote: You made it back into the locked version. Btw, even if you probably had some reason for this, one thing to note is that I think Chris' performance testing showed that the version using a lock was inferior to his local btrfs hack, while the unlocked version actually beat his hack. Maybe I misunderstood his numbers, though. But if I followed that sub-part of the test right, it really means that the locked version is pointless - it will never be able to replace peoples local hacks for this same thing, because it just doesn't give the performance people are looking for. Since the whole (and _only_) point of this thing is to perform well, that's a big deal. Like said in reply to Chris' email, I just wanted to see if fairness was worth the effort, because the pure unlocked spin showed significant unfairness (and I know some people really care about some level of fairness). Which brings me back to my initial reaction to this work. Do we need two flavors of Mutex? some program sections need Fairness, some need performance. Some need low-latency, some need absolute raw CPU power. Because at the end of the day spinning in a saturated CPU work-load that does not care about latency, eats away cycles that could be spent on computation. Think multi-threaded video processing for example. Thing I would like to measure is 1 - how many times we spin and at the end get a lock 2 - how many times we spin and at the end sleep. 3 - how many times we sleep like before. vs. In old case CPU spent on scheduling. Just to see if we are actually loosing cycles at the end. Initial testing with the simple test-mutex thing didn't show too bad numbers. Boaz -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html