Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread Boaz Harrosh
On 03/30/2017 09:35 PM, Jeff Layton wrote:
<>
> Yeah, I imagine we'd need a on-disk change for this unless there's
> something already present that we could use in place of a crash counter.
> 

Perhaps we can use s_mtime and/or s_wtime in some way, I'm not sure
what is a parallel for that in xfs.
s_mtime - time-of-last mount 
s_wtime - time-of-last mount, umount, freez, unfreez, remount, ...
Of course you'll need a per FS vector to access that.

But this will need some math foo to get the bits compacted correctly
just a thought.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mkfs.btrfs cannot find rotational file for SSD detection for a pmem device

2015-09-09 Thread Boaz Harrosh
On 09/09/2015 02:28 PM, Austin S Hemmelgarn wrote:
> On 2015-09-08 16:00, Elliott, Robert (Persistent Memory) wrote:
<>

> this may actually make things slower (the particular effect of SSD mode 
> is that it tries to spread allocations out as much as possible, as this 
> helps with wear-leveling on many SSD's).
> 

For DRAM based NvDIMM it matters not at all. For Flash based or the new
3d Xpoint it is a plus, so no harm in leaving it in

Just my 1.7 cents
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mkfs.btrfs cannot find rotational file for SSD detection for a pmem device

2015-09-09 Thread Boaz Harrosh
On 09/09/2015 03:40 PM, Austin S Hemmelgarn wrote:
> On 2015-09-09 08:12, Boaz Harrosh wrote:
>> On 09/09/2015 02:28 PM, Austin S Hemmelgarn wrote:
>>> On 2015-09-08 16:00, Elliott, Robert (Persistent Memory) wrote:
>> <>
>>
>>> this may actually make things slower (the particular effect of SSD mode
>>> is that it tries to spread allocations out as much as possible, as this
>>> helps with wear-leveling on many SSD's).
>>>
>>
>> For DRAM based NvDIMM it matters not at all. For Flash based or the new
>> 3d Xpoint it is a plus, so no harm in leaving it in
>>
> Looking at it from another perspective however, a lot of modern RAM 
> modules will stripe the bits across multiple chips to improve 
> performance.  In such a situation, BTRFS making the effort to spread out 
> the allocation as much as possible may have an impact because that 
> allocation path is slower than the regular one (not by much, but even a 
> few microseconds can make a difference when it is getting called a lot).
> 
> 

It is pointless to argue about this, but the allocations are 4k aligned
any which way, which means you are right at the beginning of the striping,
the RAM striping is a cacheline (64 bytes) granularity.

I think you will never find a single micro benchmark that will ever produce
a difference.

Cheers
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option

2014-12-02 Thread Boaz Harrosh
On 12/02/2014 02:58 PM, Jan Kara wrote:
 On Fri 28-11-14 13:14:21, Ted Tso wrote:
 On Fri, Nov 28, 2014 at 06:23:23PM +0100, Jan Kara wrote:
 Hum, when someone calls fsync() for an inode, you likely want to sync
 timestamps to disk even if everything else is clean. I think that doing
 what you did in last version:
 dirty = inode-i_state  I_DIRTY_INODE;
 inode-i_state = ~I_DIRTY_INODE;
 spin_unlock(inode-i_lock);
 if (dirty  I_DIRTY_TIME)
 mark_inode_dirty_sync(inode);
 looks better to me. IMO when someone calls __writeback_single_inode() we
 should write whatever we have...

 Yes, but we also have to distinguish between what happens on an
 fsync() versus what happens on a periodic writeback if I_DIRTY_PAGES
 (but not I_DIRTY_SYNC or I_DIRTY_DATASYNC) is set.  So there is a
 check in the fsync() code path to handle the concern you raised above.
   Ah, this is the thing you have been likely talking about but which I was
 constantly missing in my thoughts. You don't want to write times when inode
 has only dirty pages and timestamps - 

This I do not understand. I thought that I_DIRTY_TIME, and the all
lazytime mount option, is only for atime. So if there are dirty
pages then there are also m/ctime that changed and surly we want to
write these times to disk ASAP.

if we are lazytime also with m/ctime then I think I would like an
option for only atime lazy. because m/ctime is cardinal to some
operations even though I might want atime lazy.

Sorry for the slowness, I'm probably missing something
Thanks
Boaz

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-25 Thread Boaz Harrosh
On 11/25/2014 06:33 AM, Theodore Ts'o wrote:

 
 I was concerned about putting them on the dirty inode list because it
 would be extra inodes for the writeback threads would have to skip
 over and ignore (since they would not be dirty in the inde or data
 pages sense).
 
 Another solution would be to use a separate linked list for dirtytime
 inodes, but that means adding some extra fields to the inode
 structure, which some might view as bloat.  

You could use the same list-head for both lists. 

If the inode is on the dirty-inode-list then no need to add it
to the list-for-dirtytime, it will be written soon anyway.
else you add it to the list-for-dirtytime.

If you (real)dirty an inode then you first remove it from the
list-for-dirtytime first, and then add it to the dirty-inode-list.

So at each given time it is only on one list



Cheers
Boaz
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atime and filesystems with snapshots (especially Btrfs)

2012-05-29 Thread Boaz Harrosh
On 05/25/2012 06:35 PM, Alexander Block wrote:

 Hello,
 
 (this is a resend with proper CC for linux-fsdevel and linux-kernel)
 
 I would like to start a discussion on atime in Btrfs (and other
 filesystems with snapshot support).
 
 As atime is updated on every access of a file or directory, we get
 many changes to the trees in btrfs that as always trigger cow
 operations. This is no problem as long as the changed tree blocks are
 not shared by other subvolumes. Performance is also not a problem, no
 matter if shared or not (thanks to relatime which is the default).
 The problems start when someone starts to use snapshots. If you for
 example snapshot your root and continue working on your root, after
 some time big parts of the tree will be cowed and unshared. In the
 worst case, the whole tree gets unshared and thus takes up the double
 space. Normally, a user would expect to only use extra space for a
 tree if he changes something.
 A worst case scenario would be if someone took regular snapshots for
 backup purposes and later greps the contents of all snapshots to find
 a specific file. This would touch all inodes in all trees and thus
 make big parts of the trees unshared.
 
 relatime (which is the default) reduces this problem a little bit, as
 it by default only updates atime once a day. This means, if anyone
 wants to test this problem, mount with relatime disabled or change the
 system date before you try to update atime (that's the way i tested
 it).
 
 As a solution, I would suggest to make noatime the default for btrfs.
 I'm however not sure if it is allowed in linux to have different
 default mount options for different filesystem types. I know this
 discussion pops up every few years (last time it resulted in making
 relatime the default). But this is a special case for btrfs. atime is
 already bad on other filesystems, but it's much much worse in btrfs.
 


Sounds like a real problem. I would suggest a few remedies.
1. Make a filesystem persistent parameter that says noatime/relatime/atime
   So the default if not specified on mount is taken as a property of
   the FS (mkfs can set it)
2. The snapshot program should check and complain if it is on, and recommend
   an off. Since the problem only starts with a snapshot.
3. If space availability drops under some threshold, disable atime. As you said
   this is catastrophic in this case. So user can always search and delete 
files.
   In fact if the IO was only because of atime, it should be a soft error, 
warned,
   and ignored.

But perhaps the true solution is to put atime on a side table, so only the atime
info gets COW and not the all MetaData

Just my $0.017
Boaz

 Alex.
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fs: fix deadlocks in writeback_if_idle

2010-11-23 Thread Boaz Harrosh
On 11/23/2010 12:02 PM, Nick Piggin wrote:
 
 Taking s_umount lock inside i_mutex can result in an ABBA deadlock:
 
  ===
  [ INFO: possible circular locking dependency detected ]
  2.6.37-rc3+ #26
  ---
  append_writer/12828 is trying to acquire lock:
   (type-s_umount_key#24){+.}, at:
   [8113d6d2] writeback_inodes_sb_if_idle+0x32/0x60
  
  but task is already holding lock:
   (sb-s_type-i_mutex_key#14){+.+.+.}, at:
   [810cc863] generic_file_aio_write+0x53/0xd0
  
  which lock already depends on the new lock.
  
  
  the existing dependency chain (in reverse order) is:
  
  - #3 (sb-s_type-i_mutex_key#14){+.+.+.}:
 [810852c5] lock_acquire+0x95/0x1b0
 [81601329] __mutex_lock_common+0x59/0x480
 [8160182e] mutex_lock_nested+0x3e/0x50
 [a003a147] ext4_end_io_work+0x37/0xb0 [ext4]
 [81068378] process_one_work+0x1b8/0x5a0
 [81069675] worker_thread+0x175/0x3a0
 [8106e246] kthread+0x96/0xa0
 [81003ed4] kernel_thread_helper+0x4/0x10
  
  - #2 ((io-work)){+.+...}:
 [810852c5] lock_acquire+0x95/0x1b0
 [81068364] process_one_work+0x1a4/0x5a0
 [81069675] worker_thread+0x175/0x3a0
 [8106e246] kthread+0x96/0xa0
 [81003ed4] kernel_thread_helper+0x4/0x10
  
  - #1 (ext4-dio-unwritten){+.+...}:
 [810852c5] lock_acquire+0x95/0x1b0
 [81067bc8] flush_workqueue+0x148/0x540
 [a004761b] ext4_sync_fs+0x3b/0x100 [ext4]
 [8114304e] __sync_filesystem+0x5e/0x90
 [81143132] sync_filesystem+0x32/0x60
 [8111a97f] generic_shutdown_super+0x2f/0x100
 [8111aa7c] kill_block_super+0x2c/0x50
 [8111b1e5] deactivate_locked_super+0x45/0x60
 [8111b415] deactivate_super+0x45/0x60
 [81136430] mntput_no_expire+0xf0/0x190
 [811376a9] sys_umount+0x79/0x3a0
 [8100312b] system_call_fastpath+0x16/0x1b
  
  - #0 (type-s_umount_key#24){+.}:
 [81085122] __lock_acquire+0x1382/0x1490
 [810852c5] lock_acquire+0x95/0x1b0
 [81601ba2] down_read+0x42/0x60
 [8113d6d2] writeback_inodes_sb_if_idle+0x32/0x60
 [a0037efd] ext4_da_write_begin+0x20d/0x310 [ext4]
 [810cbbf4] generic_file_buffered_write+0x114/0x2a0
 [810cc5e0] __generic_file_aio_write+0x240/0x470
 [810cc876] generic_file_aio_write+0x66/0xd0
 [a002cfad] ext4_file_write+0x3d/0xd0 [ext4]
 [81117702] do_sync_write+0xd2/0x110
 [811179c8] vfs_write+0xc8/0x190
 [811183ec] sys_write+0x4c/0x80
 [8100312b] system_call_fastpath+0x16/0x1b
  
  other info that might help us debug this:
  
  1 lock held by append_writer/12828:
   #0:  (sb-s_type-i_mutex_key#14){+.+.+.}, at:
   [810cc863] generic_file_aio_write+0x53/0xd0
  
  stack backtrace:
  Pid: 12828, comm: append_writer Not tainted 2.6.37-rc3+ #26
  Call Trace:
   [81082c39] print_circular_bug+0xe9/0xf0
   [81085122] __lock_acquire+0x1382/0x1490
   [810852c5] lock_acquire+0x95/0x1b0
   [8113d6d2] ? writeback_inodes_sb_if_idle+0x32/0x60
   [81606c3d] ? sub_preempt_count+0x9d/0xd0
   [81601ba2] down_read+0x42/0x60
   [8113d6d2] ? writeback_inodes_sb_if_idle+0x32/0x60
   [8113d6d2] writeback_inodes_sb_if_idle+0x32/0x60
   [a0037efd] ext4_da_write_begin+0x20d/0x310 [ext4]
   [81073dde] ? up_read+0x1e/0x40
   [810cbbf4] generic_file_buffered_write+0x114/0x2a0
   [8104fa22] ? current_fs_time+0x22/0x30
   [810cc5e0] __generic_file_aio_write+0x240/0x470
   [810cc863] ? generic_file_aio_write+0x53/0xd0
   [810cc876] generic_file_aio_write+0x66/0xd0
   [a002cfad] ext4_file_write+0x3d/0xd0 [ext4]
   [81150db8] ? fsnotify+0x88/0x5e0
   [81117702] do_sync_write+0xd2/0x110
   [81606c3d] ? sub_preempt_count+0x9d/0xd0
   [816027b9] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [81022e9a] ? smp_apic_timer_interrupt+0x6a/0xa0
   [811179c8] vfs_write+0xc8/0x190
   [811183ec] sys_write+0x4c/0x80
   [8100312b] system_call_fastpath+0x16/0x1b
 
 Also, there can be an AA deadlock if the filesystem takes i_mutex in a
 workqueue, if it also waits for work completion while holding i_mutex.
 
 SysRq : Show Blocked State
 taskPC stack   pid father
 kworker/9:1   D   6296   118  2
 Call Trace:
 [81039431] ? get_parent_ip+0x11/0x50
 [8160145c] __mutex_lock_common+0x18c/0x480
 [a0042147] ? 

Re: [patch] fs: fix deadlocks in writeback_if_idle

2010-11-23 Thread Boaz Harrosh
On 11/23/2010 12:54 PM, Nick Piggin wrote:
 On Tue, Nov 23, 2010 at 12:26:31PM +0200, Boaz Harrosh wrote:
   *
   * Invoke writeback_inodes_sb if no writeback is currently underway.
   * Returns 1 if writeback was started, 0 if not.
 + *
 + * Even if 1 is returned, writeback may not be started if memory allocation
 + * fails. This function makes no guarantees about anything.
   */
  int writeback_inodes_sb_if_idle(struct super_block *sb)
  {
 if (!writeback_in_progress(sb-s_bdi)) {
 -   down_read(sb-s_umount);
 -   writeback_inodes_sb(sb);
 -   up_read(sb-s_umount);
 +   bdi_start_writeback(sb-s_bdi, get_nr_dirty_pages());
 return 1;
 -   } else
 -   return 0;
 +   }
 +   return 0;
  }
  EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
  
 @@ -1172,17 +1173,18 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl
   *
   * Invoke writeback_inodes_sb if no writeback is currently underway.
   * Returns 1 if writeback was started, 0 if not.
 + *
 + * Even if 1 is returned, writeback may not be started if memory allocation
 + * fails. This function makes no guarantees about anything.
   */
  int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
unsigned long nr)
  {
 if (!writeback_in_progress(sb-s_bdi)) {
 -   down_read(sb-s_umount);
 -   writeback_inodes_sb_nr(sb, nr);
 -   up_read(sb-s_umount);
 +   bdi_start_writeback(sb-s_bdi, nr);
 return 1;
 -   } else
 -   return 0;
 +   }
 +   return 0;
  }
  EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);
  

 static inline int writeback_inodes_sb_if_idle(struct super_block *sb)
 {
  return writeback_inodes_sb_nr_if_idle(sb, get_nr_dirty_pages());
 }

 In writeback.h, No?
 
 I didn't care enough to move it :P I don't know if it matters.
 

Than please just open code it in ext4 and completely remove it.
One stupid function is enough don't you think?

 
 But it has a single user so please just kill it.

 Also writeback_inodes_sb_nr_if_idle() has a single user. Combined with above,
 two users. Why not open code it in the two sites. It should be much
 clearer to understand what the magic is all about?
 
 The filesystem shouldn't be aware of the details (the magic) of how to
 kick writeback, so I think the abstraction is right as is.
 

bdi_start_writeback() looks like a very clear abstraction to me but
if you have plans for it than sure, I'm convinced.

 Thanks,
 Nick
 

Thanks
Boaz
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 0/8] Cleancache: overview

2010-08-03 Thread Boaz Harrosh
On 07/24/2010 12:17 AM, Dan Magenheimer wrote:
 On Fri, Jul 23, 2010 at 06:58:03AM -0700, Dan Magenheimer wrote:
 CHRISTOPH AND ANDREW, if you disagree and your concerns have
 not been resolved, please speak up.

 Hi Christoph --

 Thanks very much for the quick (instantaneous?) reply!

 Anything that need modification of a normal non-shared fs is utterly
 broken and you'll get a clear NAK, so the propsal before is a good
 one.

 No, the per-fs opt-in is very sensible; and its design is
 very minimal.
 
 Not to belabor the point, but maybe the right way to think about
 this is:
 
 Cleancache is a new optional feature provided by the VFS layer
 that potentially dramatically increases page cache effectiveness
 for many workloads in many environments at a negligible cost.
 
 Filesystems that are well-behaved and conform to certain restrictions
 can utilize cleancache simply by making a call to cleancache_init_fs
 at mount time.  Unusual, misbehaving, or poorly layered filesystems
 must either add additional hooks and/or undergo extensive additional
 testing... or should just not enable the optional cleancache.

OK, So I maintain a filesystem in Kernel. How do I know if my FS
is not Unusual, misbehaving, or poorly layered

Thanks
Boaz
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-12 Thread Boaz Harrosh
Peter Zijlstra wrote:
 On Mon, 2009-01-12 at 08:20 -0800, Linus Torvalds wrote:
 On Mon, 12 Jan 2009, Linus Torvalds wrote:
 You made it back into the locked version.
 Btw, even if you probably had some reason for this, one thing to note is 
 that I think Chris' performance testing showed that the version using a 
 lock was inferior to his local btrfs hack, while the unlocked version 
 actually beat his hack.

 Maybe I misunderstood his numbers, though. But if I followed that sub-part 
 of the test right, it really means that the locked version is pointless - 
 it will never be able to replace peoples local hacks for this same thing, 
 because it just doesn't give the performance people are looking for.

 Since the whole (and _only_) point of this thing is to perform well, 
 that's a big deal.
 
 Like said in reply to Chris' email, I just wanted to see if fairness was
 worth the effort, because the pure unlocked spin showed significant
 unfairness (and I know some people really care about some level of
 fairness).
 

Which brings me back to my initial reaction to this work. Do we need
two flavors of Mutex? some program sections need Fairness, some need
performance. Some need low-latency, some need absolute raw CPU power.

Because at the end of the day spinning in a saturated CPU work-load
that does not care about latency, eats away cycles that could be spent
on computation. Think multi-threaded video processing for example. 
Thing I would like to measure is 
1 - how many times we spin and at the end get a lock
2 - how many times we spin and at the end sleep.
3 - how many times we sleep like before.
vs. In old case CPU spent on scheduling. Just to see if we are actually loosing
cycles at the end.

 Initial testing with the simple test-mutex thing didn't show too bad
 numbers.
 

Boaz
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html