Re: [RFC v2] Improve subvolume usability for a normal user

2017-12-10 Thread ein
And also, how to prevent creation of the snapshots by the user.

On 12/11/2017 08:30 AM, ein wrote:
> On 12/11/2017 07:38 AM, Misono, Tomohiro wrote:
>> - Change the default behavior to allow a user to delete subvolume which is 
>> empty
> From sysadmin point of view I think it's worth considering the following
> scenario(s):
> what if admin wants one persistent snapshot undeletable by the user?
> - snapshots created by the root in user work tree should not be deleted
> by the user (snapshot owner should be root?), but we may want also
> permissions, filesystem ACLs and extend ACLs consistency
> - snapshots with chattr +i should be not deleted by the user, even if he
> created it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2] Improve subvolume usability for a normal user

2017-12-10 Thread ein
On 12/11/2017 07:38 AM, Misono, Tomohiro wrote:
> - Change the default behavior to allow a user to delete subvolume which is 
> empty

>From sysadmin point of view I think it's worth considering the following
scenario(s):
what if admin wants one persistent snapshot undeletable by the user?
- snapshots created by the root in user work tree should not be deleted
by the user (snapshot owner should be root?), but we may want also
permissions, filesystem ACLs and extend ACLs consistency
- snapshots with chattr +i should be not deleted by the user, even if he
created it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v2] Improve subvolume usability for a normal user

2017-12-10 Thread Misono, Tomohiro
Hello all,

I reflected the comments of the first version of the RFC[1]. 
Thanks for all those who commented.

The summary of updated proposal is:
 - Change the default behavior to allow a user to delete subvolume which is 
empty
 - Add 2 new non-root ioctls to get subvolume/quota info under the specified 
path

Please see the 'Proposal' section below for the detail.

More comments are welcome.
Regards,
Tomohiro Misono

[1] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg70666.html

==
- Goal and current problem
The goal of this RFC is to give a normal user more control to their own 
subvolumes.
Currently the control to subvolumes for a normal user is restricted as below: 

+-+--+--+
|   command   | root | user |
+-+--+--+
| sub create  | Y| Y|
| sub snap| Y| Y|
| sub del | Y| N|
| sub list| Y| N|
| sub show| Y| N|
| qgroup show | Y| N|
+-+--+--+

In short, I want to change this as below in order to improve user's usability:

+-+--++
|   command   | root | user   |
+-+--++
| sub create  | Y| Y  |
| sub snap| Y| Y  |
| sub del | Y| N -> Y |
| sub list| Y| N -> Y |
| sub show| Y| N -> Y |
| qgroup show | Y| N -> Y |
+-+--++

In words,
(1) allow deletion of subvolume (if it is empty) and
(2) allow getting subvolume/quota info (under the specified path)

I think other commands not listed above (qgroup limit, send/receive etc.) 
should be done
by root and not be allowed for a normal user.


- Proposal
 (1) deletion of subvolume

  Change the default behavior for a user to allow to delete a subvolume (by 
"subvol del") if 
1. the user has write+exec right to it, and
2. it is empty
  So, it is the same as user_subvol_rm_allowed option with emptiness check.

  Emptiness check is needed because Snapshot creation wont' check the 
permission and 
  can copy a dir which cannot be deleted by the user, and therefore just 
allowing deletion 
  may cause data loss.

  Summary of behavior by different condition is as follows:

  
+===++==+
  |   |Current |   Proposal 
  |
  
+===++==+
  | root  | Can delete all | Same as the 
current  |
  
+---++--+
  | user (user_subvol_rm_allowed) | Can delete if he   | Same as the 
current  |
  |   | has write+exec right   |
  |
  
+---++--+
  | user (default)| Cannot delete anything | Can delete if he   
  |
  |   || has write+exec 
right |
  |   || and is empty   
  |
  
+---++--+
 
 (2) getting subvolume/quota info

  Introduce 2 new ioctls to get subol/quota info under the specified path 
(which needs
  to be able to be opened by the user) and modify INO_LOOKUP to check permission
  during path construction for a normal user. 

  Current approach cannot be used directly for a normal user as explained below:
TREE_SEARCH ioctl is used to retrieve the subvolume/quota info by 
btrfs-progs
(sub show/list, qgroup show etc.). This requires the root permission. Also, 
in order to construct the path, INO_LOOKUP will be called afterwards, 
which also
requires root permission and omits the permission check during path 
construction.
  
The easiest way to allow a user to get subvolume/quota info is just relaxing
the permission of TREE_SEARCH. However, since all the tree information (inc.
file name) will be exposed, this poses a sequrity risk and is not 
acceptable.
 
  The detail of new ioctls and approach is here:
   [subvolume info]
Searching ROOT tree for ROOT_ITEM/ROOT_BACKREF under the specified path, 
and 
checking its read right by searching FS/FILE tree and comparing the mode 
with caller's uid.

After this ioctl is called, btrfs-progs calls modified INO_LOOKUP to 
construct the path
with permission check. In case path construction fails due to permission, 
btrfs-progs
skips to output the infomation of the subvolume.
  
   [quota info]
Same as above, but mainly searching QUOTA tree.


- Summary of Proposal
 - Change the default behavior to allow a user to delete subvolume which is 
empty
 - Add 2 new non-root ioctls to get subvolume/quota info under the specified 
path
==

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 

Re: defragmenting best practice?

2017-12-10 Thread Timofey Titovets
2017-12-11 8:18 GMT+03:00 Dave :
> On Tue, Oct 31, 2017 someone wrote:
>>
>>
>> > 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted
>> > nocow -- it will NOT be snapshotted
>
> I did exactly this. It servers the purpose of avoiding snapshots.
> However, today I saw the following at
> https://wiki.archlinux.org/index.php/Btrfs
>
> Note: From Btrfs Wiki Mount options: within a single file system, it
> is not possible to mount some subvolumes with nodatacow and others
> with datacow. The mount option of the first mounted subvolume applies
> to any other subvolumes.
>
> That makes me think my nodatacow mount option on $HOME/.cache is not
> effective. True?
>
> (My subjective performance results have not been as good as hoped for
> with the tweaks I have tried so far.)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

True, for magic dirs, that you may want mark as no cow, you need to
use chattr, like:
rm -rf ~/.cache
mkdir ~/.cache
chattr +C ~/.cache

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-12-10 Thread Dave
On Tue, Oct 31, 2017 someone wrote:
>
>
> > 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted
> > nocow -- it will NOT be snapshotted

I did exactly this. It servers the purpose of avoiding snapshots.
However, today I saw the following at
https://wiki.archlinux.org/index.php/Btrfs

Note: From Btrfs Wiki Mount options: within a single file system, it
is not possible to mount some subvolumes with nodatacow and others
with datacow. The mount option of the first mounted subvolume applies
to any other subvolumes.

That makes me think my nodatacow mount option on $HOME/.cache is not
effective. True?

(My subjective performance results have not been as good as hoped for
with the tweaks I have tried so far.)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-10 Thread Matthew Wilcox
On Mon, Dec 11, 2017 at 10:57:45AM +1100, Dave Chinner wrote:
> i.e.  the fact the cmpxchg failed may not have anything to do with a
> race condtion - it failed because the slot wasn't empty like we
> expected it to be. There can be any number of reasons the slot isn't
> empty - the API should not "document" that the reason the insert
> failed was a race condition. It should document the case that we
> "couldn't insert because there was an existing entry in the slot".
> Let the surrounding code document the reason why that might have
> happened - it's not for the API to assume reasons for failure.
> 
> i.e. this API and potential internal implementation makes much
> more sense:
> 
> int
> xa_store_iff_empty(...)
> {
>   curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS);
>   if (!curr)
>   return 0;   /* success! */
>   if (!IS_ERR(curr))
>   return -EEXIST; /* failed - slot not empty */
>   return PTR_ERR(curr);   /* failed - XA internal issue */
> }
> 
> as it replaces the existing preload and insert code in the XFS code
> paths whilst letting us handle and document the "insert failed
> because slot not empty" case however we want. It implies nothing
> about *why* the slot wasn't empty, just that we couldn't do the
> insert because it wasn't empty.

Yeah, OK.  So, over the top of the recent changes I'm looking at this:

I'm not in love with xa_store_empty() as a name.  I almost want
xa_store_weak(), but after my MAP_FIXED_WEAK proposed name got shot
down, I'm leery of it.  "empty" is at least a concept we already have
in the API with the comment for xa_init() talking about an empty array
and xa_empty().  I also considered xa_store_null and xa_overwrite_null
and xa_replace_null().  Naming remains hard.

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 941f38bb94a4..586b43836905 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -451,7 +451,7 @@ xfs_iget_cache_miss(
int flags,
int lock_flags)
 {
-   struct xfs_inode*ip, *curr;
+   struct xfs_inode*ip;
int error;
xfs_agino_t agino = XFS_INO_TO_AGINO(mp, ino);
int iflags;
@@ -498,8 +498,7 @@ xfs_iget_cache_miss(
xfs_iflags_set(ip, iflags);
 
/* insert the new inode */
-   curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS);
-   error = __xa_race(curr, -EAGAIN);
+   error = xa_store_empty(>pag_ici_xa, agino, ip, GFP_NOFS, -EAGAIN);
if (error)
goto out_unlock;
 
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 5792b6dbb040..cc7cc5253a67 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -271,43 +271,30 @@ static inline int xa_err(void *entry)
 }
 
 /**
- * __xa_race() - Turn a cmpxchg result into an errno.
- * @entry: Result from calling an XArray function.
- * @errno: Error number to return if we lost the race.
+ * xa_store_empty() - Store this entry in the XArray unless another entry is
+ * already present.
+ * @xa: XArray.
+ * @index: Index into array.
+ * @entry: New entry.
+ * @gfp: Memory allocation flags.
+ * @rc: Number to return if another entry was present.
  *
- * Like xa_race(), but returns the error number of your choice.  Calling
- * __xa_race(entry, 0) has the same result (but is less efficient) as
- * calling xa_err().
+ * Like xa_store(), but will fail and return the supplied error number if
+ * the existing entry at @index is not %NULL.
  *
  * Return: A negative errno or 0.
  */
-static inline int __xa_race(void *entry, int errno)
+static inline int xa_store_empty(struct xarray *xa, unsigned long index,
+   void *entry, gfp_t gfp, int errno)
 {
-   if (!entry)
+   void *curr = xa_cmpxchg(xa, index, NULL, entry, gfp);
+   if (!curr)
return 0;
-   if (xa_is_err(entry))
-   return (long)entry >> 2;
+   if (xa_is_err(curr))
+   return xa_err(curr);
return errno;
 }
 
-/**
- * xa_race() - Turn a cmpxchg result into an errno.
- * @entry: Result from calling an XArray function.
- *
- * It is common to use xa_cmpxchg() to ensure that only one thread assigns
- * a value to an index.  Pass the result from xa_cmpxchg() to xa_race() to
- * get an errno back.  This function also handles any other error which
- * may have been returned by xa_cmpxchg() such as ENOMEM.
- *
- * If you don't care that you lost the race, you can use xa_err() instead.
- *
- * Return: A negative errno or 0.
- */
-static inline int xa_race(void *entry)
-{
-   return __xa_race(entry, -EEXIST);
-}
-
 /* Everything below here is the Advanced API.  Proceed with caution. */
 
 #define xa_trylock(xa) spin_trylock(&(xa)->xa_lock)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 85d1bc963ab6..87ed55af823e 100644
--- a/mm/backing-dev.c
+++ 

Ref doesn't match the record start and is compressed

2017-12-10 Thread Kai Krakow
Hello!

Since my system was temporarily overloaded (load 500+ for probably
several hours but it recovered, no reboot needed), I'm seeing
btrfs-related backtraces in dmesg with the fs going RO.

I tried to fix it but it fails:

> Fixed 0 roots.
> checking extents
> ref mismatch on [3043919785984 57344] extent item 1, found 3
> Data backref 3043919785984 root 265 owner 83535 offset 131072 num_refs 0 not 
> found in extent tree
> Incorrect local backref count on 3043919785984 root 265 owner 83535 offset 
> 131072 found 1 wanted 0 back 0x61854760
> Backref disk bytenr does not match extent record, bytenr=3043919785984, ref 
> bytenr=3043919790080
> Backref bytes do not match extent backref, bytenr=3043919785984, ref 
> bytes=57344, backref bytes=40960
> Data backref 3043919785984 root 259 owner 11522804 offset 0 num_refs 0 not 
> found in extent tree
> Incorrect local backref count on 3043919785984 root 259 owner 11522804 offset 
> 0 found 1 wanted 0 back 0x57892430
> Backref disk bytenr does not match extent record, bytenr=3043919785984, ref 
> bytenr=3043919831040
> Backref bytes do not match extent backref, bytenr=3043919785984, ref 
> bytes=57344, backref bytes=8192
> backpointer mismatch on [3043919785984 57344]
> attempting to repair backref discrepency for bytenr 3043919785984
> Ref doesn't match the record start and is compressed, please take a 
> btrfs-image of this file system and send it to a btrfs developer so they can 
> complete this functionality for bytenr 3043919790080
> failed to repair damaged filesystem, aborting
> enabling repair mode
> Checking filesystem on /dev/bcacache48 8 UUID:
> bc201ce5-8f2b-4263-995a-6641e89d4c88  

When I identify the files with "btrfs inspect-internal" and try to
delete them, btrfs fails and goes into RO mode:

[  735.877040] [ cut here ]
[  735.877048] WARNING: CPU: 2 PID: 809 at fs/btrfs/extent-tree.c:7053 
__btrfs_free_extent.isra.71+0x740/0xbc0 

   
[  735.877049] Modules linked in: uas usb_storage r8168(O) nvidia_drm(PO) 
vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nvidia_uvm(PO) 
nvidia_modeset(PO) nvidia(PO) nct6775 hwmon_vid coretemp hwmon efivarfs 
   
[  735.877065] CPU: 2 PID: 809 Comm: umount Tainted: P   O
4.14.0-pf4 #1   


[  735.877066] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z68 
Pro3, BIOS L2.16A 02/22/2013

  
[  735.877067] task: 88040b6cf080 task.stack: c900035b  


  
[  735.877070] RIP: 0010:__btrfs_free_extent.isra.71+0x740/0xbc0


  
[  735.877071] RSP: 0018:c900035b3c00 EFLAGS: 00010246  


  
[  735.877073] RAX: fffe RBX: 02c4b7c2 RCX: 
c900035b3bb8

  
[  735.877074] RDX: 02c4b7c2d000 RSI: 88021c22f607 RDI: 
c900035b3bb8

  
[  735.877075] RBP: fffe R08:  R09: 
0001464f

  
[  735.877076] R10: 0002 R11: 076d R12: 
880414618000

  
[  735.877077] R13: 880406b3d850 R14:  R15: 
0109   

Re: ERROR: failed to repair root items: Input/output error

2017-12-10 Thread Duncan
Tomasz Pala posted on Sun, 10 Dec 2017 16:38:05 +0100 as excerpted:

> On Sun, Dec 10, 2017 at 15:18:32 +, constantine wrote:
> 
>> I have a laptop root hard drive (Samsung SSD 850 EVO 1TB), which is
>> within warranty.
>> I can't mount it read-write ("no rw mounting  after error").
> 
> There is a data-corruption issue with this controller!
> The same as 840 EVO - just google this.

That's a bit alarmist...

It's not a problem with recent kernels (I too have a Samsung 850
EVO 1 TB), as you mention below.  Given the focus of this list, btrfs,
and the fact that btrfs is still stabilizing, not yet fully stable and
mature, so reasonably new kernels are strongly recommended (with the
second newest mainline-LTS kernel series, now 4.9 since 4.14 is LTS,
being the oldest recommended)...

Anyone already following btrfs-list kernel recommendations is already
/well/ beyond the kernel versions you mention below as adding the
blacklisting, so it shouldn't be a problem.

And he mentions Ubuntu with kernel 4.13, so he's at least well beyond
the problem kernels now, tho of course it's possible he was running
an older one previously, and that's what did the damage.

> In short: either use recent kernel (AFAIR 4.0.5+ for 840 EVO and some
> newer for entire 8* Samsung SSD family blacklisting) or disable NCQ.
> 
> Using queued TRIM on this drive leads to data loss! Firmware zeroes
> fist 512 bytes of a block, sorry.
> 
> If you only had smaller drive, as 850s up to 512 GB have different
> controller...
> 
>> checksum verify failed on 103009173504 found 25334496 wanted 3500
>> bytenr mismatch, want=103009173504, have=889192478
>> ERROR: failed to repair root items: Input/output error
>> 
>> What do these errors mean?
>> What should I do to fix the filesystem and be able to mount it
>> read-write?
> 
> You probably can't fix this - there is data missing on bare metal, so
> you should recover using backups. If you don't have one, you need to
> perform manual data recovery procedures (like photorec) with little
> chances to restore complete files due to the nature of data loss
> (beginning of blocks).

I'll agree here.  First sysadmin rule of backups.  The value of your
data is defined not by empty claims, but by the number of backups you
consider it worth having of that data.  No backups simply defines the
data as of only trivial value, worth less than the time, trouble and
resources necessary to make that backup.

So in the event of loss of the working copy, you can always rest easy.
Either there was a backup and recovery can proceed from it, or there
wasn't, in which case you can still be happy, since after all you still
saved what was self-evidently more important than that data, the time,
trouble and resources you'd have put into backing it up if it /were/
worth it.


Meanwhile, apparently the filesystem can still be mounted read-only,
just not read-write, so the first thing I'd do would be to mount it
read-only and see how much of the data I can successfully copy off
to some other filesystem.  With a bit of luck, only a few files are
damaged and will need restored from backup (assuming they were
valuable enough to have a backup, of course), while the rest are
fine.

If that's /not/ the case, and the filesystem won't mount read-only
either, or there's too much damaged, then it's time to try
btrfs restore, to try to restore some of the files to some other
location.  Tho note that btrfs restore is an effort at recovery,
and as such, it doesn't verify checksums as normal btrfs file
operations will, so some of the otherwise successfully restored
files may still be bitrotted.  (Tho most filesystems don't checksum
in any case, so the danger of bitrot is no worse than using a normal
filesystem that doesn't do checksumming anyway.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-10 Thread Qu Wenruo


On 2017年12月11日 07:44, Qu Wenruo wrote:
> 
> 
> On 2017年12月10日 19:27, Tomasz Pala wrote:
>> On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote:
>>
 1. is there any switch resulting in 'defrag only exclusive data'?
>>>
>>> IIRC, no.
>>
>> I have found a directory - pam_abl databases, which occupy 10 MB (yes,
>> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after
>> defrag. After defragging files were not snapshotted again and I've lost
>> 3.6 GB again, so I got this fully reproducible.
>> There are 7 files, one of which is 99% of the space (10 MB). None of
>> them has nocow set, so they're riding all-btrfs.
>>
>> I could debug something before I'll clean this up, is there anything you
>> want to me to check/know about the files?
> 
> fiemap result along with btrfs dump-tree -t2 result.
> 
> Both output has nothing related to file name/dir name, but only some
> "meaningless" bytenr, so it should be completely OK to share them.
> 
>>
>> The fragmentation impact is HUGE here, 1000-ratio is almost a DoS
>> condition which could be triggered by malicious user during a few hours
>> or faster
> 
> You won't want to hear this:
> The biggest ratio in theory is, 128M / 4K = 32768.
> 
>> - I've lost 3.6 GB during the night with reasonably small
>> amount of writes, I guess it might be possible to trash entire
>> filesystem within 10 minutes if doing this on purpose.
> 
> That's a little complex.
> To get into such situation, snapshot must be used and one must know
> which file extent is shared and how it's shared.
> 
> But yes, it's possible.
> 
> While on the other hand, XFS, which also supports reflink, handles it
> quite well, so I'm wondering if it's possible for btrfs to follow its
> behavior.
> 
>>
 3. I guess there aren't, so how could I accomplish my target, i.e.
reclaiming space that was lost due to fragmentation, without breaking
spanshoted CoW where it would be not only pointless, but actually 
 harmful?
>>>
>>> What about using old kernel, like v4.13?
>>
>> Unfortunately (I guess you had 3.13 on mind), I need the new ones and
>> will be pushing towards 4.14.
> 
> No, I really mean v4.13.

My fault, it is v3.13.

What a stupid error...

> 
> From btrfs(5):
> ---
>Warning
>Defragmenting with Linux kernel versions < 3.9 or ≥
> 3.14-rc2 as
>well as with Linux stable kernel versions ≥ 3.10.31, ≥
> 3.12.12
>or ≥ 3.13.4 will break up the ref-links of CoW data (for
>example files copied with cp --reflink, snapshots or
>de-duplicated data). This may cause considerable increase of
>space usage depending on the broken up ref-links.
> ---
> 
>>
 4. How can I prevent this from happening again? All the files, that are
written constantly (stats collector here, PostgreSQL database and
logs on other machines), are marked with nocow (+C); maybe some new
attribute to mark file as autodefrag? +t?
>>>
>>> Unfortunately, nocow only works if there is no other subvolume/inode
>>> referring to it.
>>
>> This shouldn't be my case anymore after defrag (==breaking links).
>> I guess no easy way to check refcounts of the blocks?
> 
> No easy way unfortunately.
> It's either time consuming (used by qgroup) or complex (manually tree
> search and do the backref walk by yourself)
> 
>>
>>> But in my understanding, btrfs is not suitable for such conflicting
>>> situation, where you want to have snapshots of frequent partial updates.
>>>
>>> IIRC, btrfs is better for use case where either update is less frequent,
>>> or update is replacing the whole file, not just part of it.
>>>
>>> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
>>> is pointing to /usr/bin and /usr/lib) , but not for /var or /run.
>>
>> That is something coherent with my conclusions after 2 years on btrfs,
>> however I didn't expect a single file to eat 1000 times more space than it
>> should...
>>
>>
>> I wonder how many other filesystems were trashed like this - I'm short
>> of ~10 GB on other system, many other users might be affected by that
>> (telling the Internet stories about btrfs running out of space).
> 
> Firstly, no other filesystem supports snapshot.
> So it's pretty hard to get a baseline.
> 
> But as I mentioned, XFS supports reflink, which means file extent can be
> shared between several inodes.
> 
> From the message I got from XFS guys, they free any unused space of a
> file extent, so it should handle it quite well.
> 
> But it's quite a hard work to achieve in btrfs, needs years development
> at least.
> 
>>
>> It is not a problem that I need to defrag a file, the problem is I don't 
>> know:
>> 1. whether I need to defrag,
>> 2. *what* should I defrag
>> nor have a tool that would defrag smart - only the exclusive data or, in
>> general, the block that are worth defragging if space released from
>> extents is greater than space lost on 

Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-10 Thread Dave Chinner
On Fri, Dec 08, 2017 at 03:01:31PM -0800, Matthew Wilcox wrote:
> On Thu, Dec 07, 2017 at 11:38:43AM +1100, Dave Chinner wrote:
> > > > cmpxchg is for replacing a known object in a store - it's not really
> > > > intended for doing initial inserts after a lookup tells us there is
> > > > nothing in the store.  The radix tree "insert only if empty" makes
> > > > sense here, because it naturally takes care of lookup/insert races
> > > > via the -EEXIST mechanism.
> > > > 
> > > > I think that providing xa_store_excl() (which would return -EEXIST
> > > > if the entry is not empty) would be a better interface here, because
> > > > it matches the semantics of lookup cache population used all over
> > > > the kernel
> > > 
> > > I'm not thrilled with xa_store_excl(), but I need to think about that
> > > a bit more.
> > 
> > Not fussed about the name - I just think we need a function that
> > matches the insert semantics of the code
> 
> I think I have something that works better for you than returning -EEXIST
> (because you don't actually want -EEXIST, you want -EAGAIN):
> 
> /* insert the new inode */
> -   spin_lock(>pag_ici_lock);
> -   error = radix_tree_insert(>pag_ici_root, agino, ip);
> -   if (unlikely(error)) {
> -   WARN_ON(error != -EEXIST);
> -   XFS_STATS_INC(mp, xs_ig_dup);
> -   error = -EAGAIN;
> -   goto out_preload_end;
> -   }
> -   spin_unlock(>pag_ici_lock);
> -   radix_tree_preload_end();
> +   curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS);
> +   error = __xa_race(curr, -EAGAIN);
> +   if (error)
> +   goto out_unlock;
> 
> ...
> 
> -out_preload_end:
> -   spin_unlock(>pag_ici_lock);
> -   radix_tree_preload_end();
> +out_unlock:
> +   if (error == -EAGAIN)
> +   XFS_STATS_INC(mp, xs_ig_dup);
> 
> I've changed the behaviour slightly in that returning an -ENOMEM used to
> hit a WARN_ON, and I don't think that's the right response -- GFP_NOFS
> returning -ENOMEM probably gets you a nice warning already from the
> mm code.

It's been a couple of days since I first looked at this, and my
initial reaction of "that's horrible!" hasn't changed.

What you are proposing here might be a perfectly reasonable
*internal implemention* of xa_store_excl(), but it makes for a
terrible external API because the sematics and behaviour are so
vague. e.g. what does "race" mean here with respect to an insert
failure?

i.e.  the fact the cmpxchg failed may not have anything to do with a
race condtion - it failed because the slot wasn't empty like we
expected it to be. There can be any number of reasons the slot isn't
empty - the API should not "document" that the reason the insert
failed was a race condition. It should document the case that we
"couldn't insert because there was an existing entry in the slot".
Let the surrounding code document the reason why that might have
happened - it's not for the API to assume reasons for failure.

i.e. this API and potential internal implementation makes much
more sense:

int
xa_store_iff_empty(...)
{
curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS);
if (!curr)
return 0;   /* success! */
if (!IS_ERR(curr))
return -EEXIST; /* failed - slot not empty */
return PTR_ERR(curr);   /* failed - XA internal issue */
}

as it replaces the existing preload and insert code in the XFS code
paths whilst letting us handle and document the "insert failed
because slot not empty" case however we want. It implies nothing
about *why* the slot wasn't empty, just that we couldn't do the
insert because it wasn't empty.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Chunk-Recovery fails with alignment error

2017-12-10 Thread Qu Wenruo


On 2017年12月11日 05:16, Benjamin Beichler wrote:
> The patch let the chunk-recover be successful. But I'm no lucky man,
> the recovered chunk tree does not work or other metadata is also
> broken.
> 
> Mounting the system is not successful (dmesg):
> BTRFS critical (device dm-0): corrupt node, invalid item slot:
> block=16384, root=1, slot=0
> BTRFS error (device dm-0): failed to read chunk root
> BTRFS error (device dm-0): open_ctree failed

For this error, you could apply this diff to by-pass it:

--
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index ce4ed6ec8f39..355220e21c2e 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -413,12 +413,6 @@ int btrfs_check_node(struct btrfs_root *root,
struct extent_buffer *node)
btrfs_node_key_to_cpu(node, , slot);
btrfs_node_key_to_cpu(node, _key, slot + 1);

-   if (!bytenr) {
-   generic_err(root, node, slot,
-   "invalid NULL node pointer");
-   ret = -EUCLEAN;
-   goto out;
-   }
if (!IS_ALIGNED(bytenr, root->fs_info->sectorsize)) {
generic_err(root, node, slot,
"unaligned pointer, have %llu should be aligned
to %u",
--

Thanks,
Qu

> 
> Therefore I tried a btrfs check --repair, this time without error:
> https://gist.github.com/anonymous/5cf7ad9e187032d2c94db4f91bb62c24
> 
> Then I tried btrfs check --init-extent-tree and this produces much
> output. I put the beginning into here:
> https://gist.github.com/anonymous/70e2482646a8235ee2327105d920dadd
> From a fast view, the messages keep to be similar to the last of the
> gist, but the messages in the beginning are not repeating. If it helps
> I have complete compressed log.
> 
> Then I tried a btrfs recover to get files, but for many files (also
> improtant data, but I filtered the output) I get outputs like in:
> https://gist.github.com/anonymous/1cc7f7ab5af33e76d0bf80960bb300eb
> 
> Any new suggestions?
> 



signature.asc
Description: OpenPGP digital signature


btrfs-progs: replace: error message can be improved when other operation is running

2017-12-10 Thread Lukas Pirl
Dear all,

when trying to replace a device of a file system for which a balance is
running, btrfs-progs fails with the error message:

ERROR: ioctl(DEV_REPLACE_START) on '/mnt/xyz' returns error: 

This might also be true for alike operations, such as "add", "delete"
and "resize", since those cases do not seem to be considered in
cmds-replace.c [0].

Apparently, this is not very helpful to the user (if not scary).
In contrast, other commands give very helpful output in similar
situations (e.g., "add/delete/… operation in progress" [1]).

Other users' confusions might also be related to this potential issue [2].

This is probably very easy to fix for someone into ioctl return values
and all this.

Thanks and cheers,

Lukas

GNU/Linux
4.13.0-0.bpo.1-amd64 #1 SMP Debian 4.13.13-1~bpo9+1 (2017-11-22) x86_64
btrfs-progs v4.13.3

[0]
https://github.com/kdave/btrfs-progs/blob/11c83cefb8b4a03b1835efaf603ddc95430a0c9e/cmds-replace.c#L48
[1]
https://github.com/kdave/btrfs-progs/blob/9fe889ac02b9c49b885c8999f5dd4e192697fa83/ioctl.h#L709
[2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=866734
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-10 Thread Qu Wenruo


On 2017年12月10日 19:27, Tomasz Pala wrote:
> On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote:
> 
>>> 1. is there any switch resulting in 'defrag only exclusive data'?
>>
>> IIRC, no.
> 
> I have found a directory - pam_abl databases, which occupy 10 MB (yes,
> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after
> defrag. After defragging files were not snapshotted again and I've lost
> 3.6 GB again, so I got this fully reproducible.
> There are 7 files, one of which is 99% of the space (10 MB). None of
> them has nocow set, so they're riding all-btrfs.
> 
> I could debug something before I'll clean this up, is there anything you
> want to me to check/know about the files?

fiemap result along with btrfs dump-tree -t2 result.

Both output has nothing related to file name/dir name, but only some
"meaningless" bytenr, so it should be completely OK to share them.

> 
> The fragmentation impact is HUGE here, 1000-ratio is almost a DoS
> condition which could be triggered by malicious user during a few hours
> or faster

You won't want to hear this:
The biggest ratio in theory is, 128M / 4K = 32768.

> - I've lost 3.6 GB during the night with reasonably small
> amount of writes, I guess it might be possible to trash entire
> filesystem within 10 minutes if doing this on purpose.

That's a little complex.
To get into such situation, snapshot must be used and one must know
which file extent is shared and how it's shared.

But yes, it's possible.

While on the other hand, XFS, which also supports reflink, handles it
quite well, so I'm wondering if it's possible for btrfs to follow its
behavior.

> 
>>> 3. I guess there aren't, so how could I accomplish my target, i.e.
>>>reclaiming space that was lost due to fragmentation, without breaking
>>>spanshoted CoW where it would be not only pointless, but actually 
>>> harmful?
>>
>> What about using old kernel, like v4.13?
> 
> Unfortunately (I guess you had 3.13 on mind), I need the new ones and
> will be pushing towards 4.14.

No, I really mean v4.13.

From btrfs(5):
---
   Warning
   Defragmenting with Linux kernel versions < 3.9 or ≥
3.14-rc2 as
   well as with Linux stable kernel versions ≥ 3.10.31, ≥
3.12.12
   or ≥ 3.13.4 will break up the ref-links of CoW data (for
   example files copied with cp --reflink, snapshots or
   de-duplicated data). This may cause considerable increase of
   space usage depending on the broken up ref-links.
---

> 
>>> 4. How can I prevent this from happening again? All the files, that are
>>>written constantly (stats collector here, PostgreSQL database and
>>>logs on other machines), are marked with nocow (+C); maybe some new
>>>attribute to mark file as autodefrag? +t?
>>
>> Unfortunately, nocow only works if there is no other subvolume/inode
>> referring to it.
> 
> This shouldn't be my case anymore after defrag (==breaking links).
> I guess no easy way to check refcounts of the blocks?

No easy way unfortunately.
It's either time consuming (used by qgroup) or complex (manually tree
search and do the backref walk by yourself)

> 
>> But in my understanding, btrfs is not suitable for such conflicting
>> situation, where you want to have snapshots of frequent partial updates.
>>
>> IIRC, btrfs is better for use case where either update is less frequent,
>> or update is replacing the whole file, not just part of it.
>>
>> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
>> is pointing to /usr/bin and /usr/lib) , but not for /var or /run.
> 
> That is something coherent with my conclusions after 2 years on btrfs,
> however I didn't expect a single file to eat 1000 times more space than it
> should...
> 
> 
> I wonder how many other filesystems were trashed like this - I'm short
> of ~10 GB on other system, many other users might be affected by that
> (telling the Internet stories about btrfs running out of space).

Firstly, no other filesystem supports snapshot.
So it's pretty hard to get a baseline.

But as I mentioned, XFS supports reflink, which means file extent can be
shared between several inodes.

From the message I got from XFS guys, they free any unused space of a
file extent, so it should handle it quite well.

But it's quite a hard work to achieve in btrfs, needs years development
at least.

> 
> It is not a problem that I need to defrag a file, the problem is I don't know:
> 1. whether I need to defrag,
> 2. *what* should I defrag
> nor have a tool that would defrag smart - only the exclusive data or, in
> general, the block that are worth defragging if space released from
> extents is greater than space lost on inter-snapshot duplication.
> 
> I can't just defrag entire filesystem since it breaks links with snapshots.
> This change was a real deal-breaker here...

IIRC it's better to add a option to make defrag snapshot-aware.
(Don't break snapshot sharing but only to 

Re: Chunk-Recovery fails with alignment error

2017-12-10 Thread Qu Wenruo


On 2017年12月11日 05:16, Benjamin Beichler wrote:
> The patch let the chunk-recover be successful. But I'm no lucky man,
> the recovered chunk tree does not work or other metadata is also
> broken.
> 
> Mounting the system is not successful (dmesg):
> BTRFS critical (device dm-0): corrupt node, invalid item slot:
> block=16384, root=1, slot=0

What does btrfs dump-tree -b 16384 say?

IIRC the same 0 bytenr problem in kernel.

> BTRFS error (device dm-0): failed to read chunk root
> BTRFS error (device dm-0): open_ctree failed
> 
> Therefore I tried a btrfs check --repair, this time without error:
> https://gist.github.com/anonymous/5cf7ad9e187032d2c94db4f91bb62c24
> 
> Then I tried btrfs check --init-extent-tree and this produces much
> output. I put the beginning into here:
> https://gist.github.com/anonymous/70e2482646a8235ee2327105d920dadd

That's common for --init-extent-tree, but I don't think extent tree is
related in this case.

Thanks,
Qu
> From a fast view, the messages keep to be similar to the last of the
> gist, but the messages in the beginning are not repeating. If it helps
> I have complete compressed log.
> 
> Then I tried a btrfs recover to get files, but for many files (also
> improtant data, but I filtered the output) I get outputs like in:
> https://gist.github.com/anonymous/1cc7f7ab5af33e76d0bf80960bb300eb>
> Any new suggestions?
> 



signature.asc
Description: OpenPGP digital signature


Just curious, whats happened with btrfs & rcu skiplist in 2013?

2017-12-10 Thread Timofey Titovets
Subj,
I found that https://lwn.net/Articles/554885/,
https://lwn.net/Articles/553047/
and some others messages like judy rcu & etc.

But nothing after june 2013, just curious, may be some one know why
that has been stalled?
or just what happened in the end?  (i.e. ex, RT guys, just said that
too bad for us..)

Thanks!
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Chunk-Recovery fails with alignment error

2017-12-10 Thread Benjamin Beichler
The patch let the chunk-recover be successful. But I'm no lucky man,
the recovered chunk tree does not work or other metadata is also
broken.

Mounting the system is not successful (dmesg):
BTRFS critical (device dm-0): corrupt node, invalid item slot:
block=16384, root=1, slot=0
BTRFS error (device dm-0): failed to read chunk root
BTRFS error (device dm-0): open_ctree failed

Therefore I tried a btrfs check --repair, this time without error:
https://gist.github.com/anonymous/5cf7ad9e187032d2c94db4f91bb62c24

Then I tried btrfs check --init-extent-tree and this produces much
output. I put the beginning into here:
https://gist.github.com/anonymous/70e2482646a8235ee2327105d920dadd
>From a fast view, the messages keep to be similar to the last of the
gist, but the messages in the beginning are not repeating. If it helps
I have complete compressed log.

Then I tried a btrfs recover to get files, but for many files (also
improtant data, but I filtered the output) I get outputs like in:
https://gist.github.com/anonymous/1cc7f7ab5af33e76d0bf80960bb300eb

Any new suggestions?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-10 Thread Tomasz Pala
On Sun, Dec 10, 2017 at 12:27:38 +0100, Tomasz Pala wrote:

> I have found a directory - pam_abl databases, which occupy 10 MB (yes,
> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after

#  df
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda264G   61G  2.8G  96% /

#  btrfs fi du .
 Total   Exclusive  Set shared  Filename
 0.00B   0.00B   -  ./1/__db.register
  10.00MiB10.00MiB   -  ./1/log.01
  16.00KiB   0.00B   -  ./1/hosts.db
  16.00KiB   0.00B   -  ./1/users.db
 168.00KiB   0.00B   -  ./1/__db.001
  40.00KiB   0.00B   -  ./1/__db.002
  44.00KiB   0.00B   -  ./1/__db.003
  10.28MiB10.00MiB   -  ./1
 0.00B   0.00B   -  ./__db.register
  16.00KiB16.00KiB   -  ./hosts.db
  16.00KiB16.00KiB   -  ./users.db
  10.00MiB10.00MiB   -  ./log.13
 0.00B   0.00B   -  ./__db.001
 0.00B   0.00B   -  ./__db.002
 0.00B   0.00B   -  ./__db.003
  20.31MiB20.03MiB   284.00KiB  .

#  btrfs fi defragment log.13 
#  df
/dev/sda264G   54G  9.4G  86% /


6.6 GB / 10 MB = 660:1 overhead within 1 day of uptime.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ERROR: failed to repair root items: Input/output error

2017-12-10 Thread Tomasz Pala
On Sun, Dec 10, 2017 at 15:18:32 +, constantine wrote:

> I have a laptop root hard drive (Samsung SSD 850 EVO 1TB), which is
> within warranty.
> I can't mount it read-write ("no rw mounting  after error").

There is a data-corruption issue with this controller!
The same as 840 EVO - just google this.

In short: either use recent kernel (AFAIR 4.0.5+ for 840 EVO and some
newer for entire 8* Samsung SSD family blacklisting) or disable NCQ.

Using queued TRIM on this drive leads to data loss! Firmware zeroes fist
512 bytes of a block, sorry.

If you only had smaller drive, as 850s up to 512 GB have different
controller...

> checksum verify failed on 103009173504 found 25334496 wanted 3500
> bytenr mismatch, want=103009173504, have=889192478
> ERROR: failed to repair root items: Input/output error
> 
> What do these errors mean?
> What should I do to fix the filesystem and be able to mount it read-write?

You probably can't fix this - there is data missing on bare metal, so you should
recover using backups. If you don't have one, you need to perform manual
data recovery procedures (like photorec) with little chances to restore
complete files due to the nature of data loss (beginning of blocks).

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ERROR: failed to repair root items: Input/output error

2017-12-10 Thread constantine
I have a laptop root hard drive (Samsung SSD 850 EVO 1TB), which is
within warranty.
I can't mount it read-write ("no rw mounting  after error").
The data are not really critical (I will overcome the shock of losing
them within a couple of days).


Btrfs check --repair throws an error:

sudo btrfs check --repair /dev/sdb1
enabling repair mode
Checking filesystem on /dev/sdb1
UUID: a955bc5f-e5f0-42ce-bd5a-de5eb8d5d3aa
checking extents
ERROR: add_tree_backref failed (non-leaf block): File exists
parent transid verify failed on 103009185792 wanted 5026 found 345954
parent transid verify failed on 103009185792 wanted 5026 found 345954
Ignoring transid failure
leaf parent key incorrect 103009185792
bad block 103009185792
ERROR: errors found in extent allocation tree or chunk allocation
checksum verify failed on 103009173504 found 25334496 wanted 3500
checksum verify failed on 103009173504 found 25334496 wanted 3500
bytenr mismatch, want=103009173504, have=889192478
ERROR: failed to repair root items: Input/output error

What do these errors mean?
What should I do to fix the filesystem and be able to mount it read-write?


Thank you,

Konstantinos Tsardounis


PS:
I login with an Ubuntu LiveCD now which returns:

uname -a
Linux ubuntu 4.13.0-16-generic #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux
btrfs --version
btrfs-progs v4.12
btrfs fi show
Label: 'arch'  uuid: a955bc5f-e5f0-42ce-bd5a-de5eb8d5d3aa
Total devices 1 FS bytes used 876.26GiB
devid1 size 931.01GiB used 931.01GiB path /dev/sdb1
btrfs fi df sdb1
Data, single: total=926.25GiB, used=872.65GiB
System, single: total=4.00MiB, used=128.00KiB
Metadata, single: total=4.76GiB, used=3.60GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

and the dmesg.log is on:
https://gist.github.com/anonymous/16344244259bb2989701f3ec43e26f39
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-10 Thread Tomasz Pala
On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote:

>> 1. is there any switch resulting in 'defrag only exclusive data'?
> 
> IIRC, no.

I have found a directory - pam_abl databases, which occupy 10 MB (yes,
TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after
defrag. After defragging files were not snapshotted again and I've lost
3.6 GB again, so I got this fully reproducible.
There are 7 files, one of which is 99% of the space (10 MB). None of
them has nocow set, so they're riding all-btrfs.

I could debug something before I'll clean this up, is there anything you
want to me to check/know about the files?

The fragmentation impact is HUGE here, 1000-ratio is almost a DoS
condition which could be triggered by malicious user during a few hours
or faster - I've lost 3.6 GB during the night with reasonably small
amount of writes, I guess it might be possible to trash entire
filesystem within 10 minutes if doing this on purpose.

>> 3. I guess there aren't, so how could I accomplish my target, i.e.
>>reclaiming space that was lost due to fragmentation, without breaking
>>spanshoted CoW where it would be not only pointless, but actually harmful?
> 
> What about using old kernel, like v4.13?

Unfortunately (I guess you had 3.13 on mind), I need the new ones and
will be pushing towards 4.14.

>> 4. How can I prevent this from happening again? All the files, that are
>>written constantly (stats collector here, PostgreSQL database and
>>logs on other machines), are marked with nocow (+C); maybe some new
>>attribute to mark file as autodefrag? +t?
> 
> Unfortunately, nocow only works if there is no other subvolume/inode
> referring to it.

This shouldn't be my case anymore after defrag (==breaking links).
I guess no easy way to check refcounts of the blocks?

> But in my understanding, btrfs is not suitable for such conflicting
> situation, where you want to have snapshots of frequent partial updates.
> 
> IIRC, btrfs is better for use case where either update is less frequent,
> or update is replacing the whole file, not just part of it.
> 
> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
> is pointing to /usr/bin and /usr/lib) , but not for /var or /run.

That is something coherent with my conclusions after 2 years on btrfs,
however I didn't expect a single file to eat 1000 times more space than it
should...


I wonder how many other filesystems were trashed like this - I'm short
of ~10 GB on other system, many other users might be affected by that
(telling the Internet stories about btrfs running out of space).

It is not a problem that I need to defrag a file, the problem is I don't know:
1. whether I need to defrag,
2. *what* should I defrag
nor have a tool that would defrag smart - only the exclusive data or, in
general, the block that are worth defragging if space released from
extents is greater than space lost on inter-snapshot duplication.

I can't just defrag entire filesystem since it breaks links with snapshots.
This change was a real deal-breaker here...

Any way to fed the deduplication code with snapshots maybe? There are
directories and files in the same layout, this could be fast-tracked to
check and deduplicate.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-10 Thread Tomasz Pala
On Sun, Dec 03, 2017 at 01:45:45 +, Duncan wrote:

> OTOH, it's also quite possible that people chose btrfs at least partly
> for other reasons, say the "storage pool" qualities, and would rather

Well, to name some:

1. filesystem-level backups via snapshot/send/receive - much cleaner and
faster than rsyncs or other old-fashioned methods. This obviously requires the 
CoW-once feature;

- caveat: for btrfs-killing usage patterns all the snapshots but the
  last one need to be removed;


2. block-level checksums with RAID1-awareness - in contrary to mdadm
RAIDx, which chooses random data copy from underlying devices, this is
much less susceptible to bit rot;

- caveats: requires CoW enabled, RAID1 reading is dumb (even/odd PID
  instead of real balancing), no N-way mirroring nor write-mostly flag.


3. compression - there is no real alternative, however:

- caveat: requires CoW enabled, which makes it not suitable for
  ...systemd journals, which compress with great ratio (c.a. 1:10),
  nor for various databases, as they will be nocowed sooner or later;


4. storage pools you've mentioned - they are actually not much superior to
LVM-based approach; until one could create subvolume with different
profile (e.g. 'disable RAID1 for /var/log/journal') it is still better
to create separate filesystems, meaning one have to use LVM or (the hard
way) paritioning.


Some of the drawbacks above are immanent to CoW and so shouldn't be
expected to be fixed internally, as the needs are conflicting, but their
impact might be nullified by some housekeeping.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/5] btrfs: cleanup device states define BTRFS_DEV_STATE_REPLACE_TGT

2017-12-10 Thread Anand Jain
Currently device state is being managed by each individual int
variable such as struct btrfs_device::is_tgtdev_for_dev_replace.
Instead of that declare btrfs_device::dev_state
BTRFS_DEV_STATE_MISSING and use the bit operations.

Signed-off-by: Anand Jain 
---
v3: Define BTRFS_DEV_STATE_REPLACE_TGT as bit nr

 fs/btrfs/dev-replace.c |  5 +++--
 fs/btrfs/extent-tree.c |  3 ++-
 fs/btrfs/ioctl.c   |  2 +-
 fs/btrfs/scrub.c   |  2 +-
 fs/btrfs/super.c   |  5 +++--
 fs/btrfs/volumes.c | 39 ++-
 fs/btrfs/volumes.h |  2 +-
 7 files changed, 33 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 559db7667f38..12fd8a203735 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -172,7 +172,8 @@ int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info)
dev_replace->tgtdev->commit_bytes_used =
dev_replace->srcdev->commit_bytes_used;
}
-   dev_replace->tgtdev->is_tgtdev_for_dev_replace = 1;
+   set_bit(BTRFS_DEV_STATE_REPLACE_TGT,
+   _replace->tgtdev->dev_state);
btrfs_init_dev_replace_tgtdev_for_resume(fs_info,
dev_replace->tgtdev);
}
@@ -564,7 +565,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
  dev_missing_or_rcu_str(src_device),
  src_device->devid,
  rcu_str_deref(tgt_device->name));
-   tgt_device->is_tgtdev_for_dev_replace = 0;
+   clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, _device->dev_state);
tgt_device->devid = src_device->devid;
src_device->devid = BTRFS_DEV_REPLACE_DEVID;
memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp));
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2cd323d184a0..1e65d5d54a8a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9692,7 +9692,8 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 
bytenr)
 * space to fit our block group in.
 */
if (device->total_bytes > device->bytes_used + min_free &&
-   !device->is_tgtdev_for_dev_replace) {
+   !test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
+   >dev_state)) {
ret = find_free_dev_extent(trans, device, min_free,
   _offset, NULL);
if (!ret)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e59004a17166..953563138020 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1528,7 +1528,7 @@ static noinline int btrfs_ioctl_resize(struct file *file,
}
}
 
-   if (device->is_tgtdev_for_dev_replace) {
+   if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) {
ret = -EPERM;
goto out_free;
}
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b6de017066b3..b5a33db38874 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -4131,7 +4131,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
 
mutex_lock(_info->scrub_lock);
if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state) ||
-   dev->is_tgtdev_for_dev_replace) {
+   test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) {
mutex_unlock(_info->scrub_lock);
mutex_unlock(_info->fs_devices->device_list_mutex);
return -EIO;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 6bae2e046257..b16e3fbd5895 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1973,8 +1973,9 @@ static int btrfs_calc_avail_data_space(struct 
btrfs_fs_info *fs_info,
rcu_read_lock();
list_for_each_entry_rcu(device, _devices->devices, dev_list) {
if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
-   >dev_state) ||
-   !device->bdev || device->is_tgtdev_for_dev_replace)
+   >dev_state) || !device->bdev ||
+   test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
+   >dev_state))
continue;
 
if (i >= nr_devices)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c6f7f4935dc4..37b1aed14353 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -845,8 +845,8 @@ void btrfs_close_extra_devices(struct btrfs_fs_devices 
*fs_devices, int step)
list_for_each_entry_safe(device, next, _devices->devices, dev_list) {
if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>dev_state)) {

Re: [bug report] btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA

2017-12-10 Thread Anand Jain



On 12/08/2017 07:08 PM, Dan Carpenter wrote:

Hello Anand Jain,

The patch 2dcfcdc43524: "btrfs: cleanup device states define
BTRFS_DEV_STATE_IN_FS_METADATA" from Dec 4, 2017, leads to the
following static checker warning:

fs/btrfs/super.c:2007 btrfs_calc_avail_data_space()
warn: test_bit() takes a bit number

fs/btrfs/super.c
   2004
   2005  rcu_read_lock();
   2006  list_for_each_entry_rcu(device, _devices->devices, 
dev_list) {
   2007  if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
   ^^
This BTRFS_DEV_STATE_IN_FS_METADATA define is BIT(1) but for test_bit()
we should just take 1.  In other words we are doing double BIT(BIT(1)).

   2008  >dev_state) ||
   2009  !device->bdev ||
   2010  test_bit(BTRFS_DEV_STATE_REPLACE_TGT, 
>dev_state))
   2011  continue;
   2012
   2013  if (i >= nr_devices)
   2014  break;
   2015


 Thanks Dan. I fixed all these in v3.

- Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/5] btrfs: cleanup device states define BTRFS_DEV_STATE_MISSING

2017-12-10 Thread Anand Jain
Currently device state is being managed by each individual int
variable such as struct btrfs_device::missing. Instead of that
declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use
the bit operations.

Signed-off-by: Anand Jain 
Reviewed-by : Nikolay Borisov 
---
v3: Define BTRFS_DEV_STATE_MISSING as bit nr

 fs/btrfs/dev-replace.c |  3 ++-
 fs/btrfs/disk-io.c |  4 ++--
 fs/btrfs/scrub.c   |  7 ---
 fs/btrfs/super.c   |  2 +-
 fs/btrfs/volumes.c | 34 --
 fs/btrfs/volumes.h |  2 +-
 6 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 4b6ceb38cb5f..559db7667f38 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -306,7 +306,8 @@ void btrfs_after_dev_replace_commit(struct btrfs_fs_info 
*fs_info)
 
 static inline char *dev_missing_or_rcu_str(struct btrfs_device *device)
 {
-   return device->missing ? "" : rcu_str_deref(device->name);
+   return test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) ?
+   "" : rcu_str_deref(device->name);
 }
 
 int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 634e8eb51cc8..890e3a6a2f3e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3399,7 +3399,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
/* send down all the barriers */
head = >fs_devices->devices;
list_for_each_entry_rcu(dev, head, dev_list) {
-   if (dev->missing)
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
continue;
if (!dev->bdev)
continue;
@@ -3415,7 +3415,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 
/* wait for all the barriers */
list_for_each_entry_rcu(dev, head, dev_list) {
-   if (dev->missing)
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
continue;
if (!dev->bdev) {
errors_wait++;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index c4705de2ec26..b6de017066b3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2535,7 +2535,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
}
 
WARN_ON(sblock->page_count == 0);
-   if (dev->missing) {
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
/*
 * This case should only be hit for RAID 5/6 device replace. See
 * the comment in scrub_missing_raid56_pages() for details.
@@ -2870,7 +2870,7 @@ static int scrub_extent_for_parity(struct scrub_parity 
*sparity,
u8 csum[BTRFS_CSUM_SIZE];
u32 blocksize;
 
-   if (dev->missing) {
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
scrub_parity_mark_sectors_error(sparity, logical, len);
return 0;
}
@@ -4112,7 +4112,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
 
mutex_lock(_info->fs_devices->device_list_mutex);
dev = btrfs_find_device(fs_info, devid, NULL, NULL);
-   if (!dev || (dev->missing && !is_dev_replace)) {
+   if (!dev || (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) &&
+   !is_dev_replace)) {
mutex_unlock(_info->fs_devices->device_list_mutex);
return -ENODEV;
}
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f0906fbfa731..6bae2e046257 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2270,7 +2270,7 @@ static int btrfs_show_devname(struct seq_file *m, struct 
dentry *root)
while (cur_devices) {
head = _devices->devices;
list_for_each_entry(dev, head, dev_list) {
-   if (dev->missing)
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
continue;
if (!dev->name)
continue;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7100c877748d..c6f7f4935dc4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -758,9 +758,9 @@ static noinline int device_list_add(const char *path,
return -ENOMEM;
rcu_string_free(device->name);
rcu_assign_pointer(device->name, name);
-   if (device->missing) {
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
fs_devices->missing_devices--;
-   device->missing = 0;
+   clear_bit(BTRFS_DEV_STATE_MISSING, >dev_state);
}
}
 
@@ -944,7 +944,7 @@ static void btrfs_prepare_close_one_device(struct 
btrfs_device *device)
fs_devices->rw_devices--;
 

[PATCH v3 0/5] define BTRFS_DEV_STATE

2017-12-10 Thread Anand Jain
As of now device properties and states are being represented as int
variable, patches here makes them bit flags instead. Further, wip
patches such as device failed state needs this cleanup.

v2:
 Adds BTRFS_DEV_STATE_REPLACE_TGT
 Adds BTRFS_DEV_STATE_FLUSH_SENT
 Drops BTRFS_DEV_STATE_CAN_DISCARD
 Starts bit flag from the bit 0
 Drops unrelated change - declare btrfs_device

v3:
 Fix static checker warning, define respective dev state as bit number

Anand Jain (5):
  btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE
  btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA
  btrfs: cleanup device states define BTRFS_DEV_STATE_MISSING
  btrfs: cleanup device states define BTRFS_DEV_STATE_REPLACE_TGT
  btrfs: cleanup device states define BTRFS_DEV_STATE_FLUSH_SENT

 fs/btrfs/dev-replace.c |   8 ++-
 fs/btrfs/disk-io.c |  29 ++---
 fs/btrfs/extent-tree.c |   5 +-
 fs/btrfs/extent_io.c   |   3 +-
 fs/btrfs/ioctl.c   |   4 +-
 fs/btrfs/scrub.c   |  13 +++--
 fs/btrfs/super.c   |   8 ++-
 fs/btrfs/volumes.c | 156 -
 fs/btrfs/volumes.h |  11 ++--
 9 files changed, 143 insertions(+), 94 deletions(-)

-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 5/5] btrfs: cleanup device states define BTRFS_DEV_STATE_FLUSH_SENT

2017-12-10 Thread Anand Jain
Currently device state is being managed by each individual int
variable such as struct btrfs_device::is_tgtdev_for_dev_replace.
Instead of that declare btrfs_device::dev_state
BTRFS_DEV_STATE_FLUSH_SENT and use the bit operations.

Signed-off-by: Anand Jain 
---
v3: Define BTRFS_DEV_STATE_FLUSH_SENT as bit nr

 fs/btrfs/disk-io.c | 6 +++---
 fs/btrfs/volumes.h | 1 +
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 890e3a6a2f3e..9b20c1f3563b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3359,7 +3359,7 @@ static void write_dev_flush(struct btrfs_device *device)
bio->bi_private = >flush_wait;
 
btrfsic_submit_bio(bio);
-   device->flush_bio_sent = 1;
+   set_bit(BTRFS_DEV_STATE_FLUSH_SENT, >dev_state);
 }
 
 /*
@@ -3369,10 +3369,10 @@ static blk_status_t wait_dev_flush(struct btrfs_device 
*device)
 {
struct bio *bio = device->flush_bio;
 
-   if (!device->flush_bio_sent)
+   if (!test_bit(BTRFS_DEV_STATE_FLUSH_SENT, >dev_state))
return BLK_STS_OK;
 
-   device->flush_bio_sent = 0;
+   clear_bit(BTRFS_DEV_STATE_FLUSH_SENT, >dev_state);
wait_for_completion_io(>flush_wait);
 
return bio->bi_status;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index a15f8b103072..826fce3d6825 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -51,6 +51,7 @@ struct btrfs_pending_bios {
 #define BTRFS_DEV_STATE_IN_FS_METADATA 1
 #define BTRFS_DEV_STATE_MISSING2
 #define BTRFS_DEV_STATE_REPLACE_TGT3
+#define BTRFS_DEV_STATE_FLUSH_SENT 4
 
 struct btrfs_device {
struct list_head dev_list;
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/5] btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA

2017-12-10 Thread Anand Jain
Currently device state is being managed by each individual int
variable such as struct btrfs_device::in_fs_metadata. Instead of
that declare device state BTRFS_DEV_STATE_IN_FS_METADATA and use
the bit operations.

Signed-off-by: Anand Jain 
Reviewed-by: Nikolay Borisov 
---
v3: Define BTRFS_DEV_STATE_IN_FS_METADATA as bit nr

 fs/btrfs/disk-io.c | 21 ++---
 fs/btrfs/scrub.c   |  3 ++-
 fs/btrfs/super.c   |  5 +++--
 fs/btrfs/volumes.c | 29 +
 fs/btrfs/volumes.h |  2 +-
 5 files changed, 37 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 56198cb02b35..634e8eb51cc8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3403,8 +3403,10 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
continue;
if (!dev->bdev)
continue;
-   if (!dev->in_fs_metadata ||
-   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE,
+   >dev_state))
continue;
 
write_dev_flush(dev);
@@ -3419,8 +3421,10 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
errors_wait++;
continue;
}
-   if (!dev->in_fs_metadata ||
-   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE,
+   >dev_state))
continue;
 
ret = wait_dev_flush(dev);
@@ -3517,7 +3521,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int 
max_mirrors)
total_errors++;
continue;
}
-   if (!dev->in_fs_metadata ||
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
@@ -3557,8 +3562,10 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int 
max_mirrors)
list_for_each_entry_rcu(dev, head, dev_list) {
if (!dev->bdev)
continue;
-   if (!dev->in_fs_metadata ||
-   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE,
+   >dev_state))
continue;
 
ret = wait_dev_supers(dev, max_mirrors);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index fa70ff9b7762..c4705de2ec26 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -4129,7 +4129,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
}
 
mutex_lock(_info->scrub_lock);
-   if (!dev->in_fs_metadata || dev->is_tgtdev_for_dev_replace) {
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state) ||
+   dev->is_tgtdev_for_dev_replace) {
mutex_unlock(_info->scrub_lock);
mutex_unlock(_info->fs_devices->device_list_mutex);
return -EIO;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 3a4dce153645..f0906fbfa731 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1972,8 +1972,9 @@ static int btrfs_calc_avail_data_space(struct 
btrfs_fs_info *fs_info,
 
rcu_read_lock();
list_for_each_entry_rcu(device, _devices->devices, dev_list) {
-   if (!device->in_fs_metadata || !device->bdev ||
-   device->is_tgtdev_for_dev_replace)
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
+   !device->bdev || device->is_tgtdev_for_dev_replace)
continue;
 
if (i >= nr_devices)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9d14d83ab8dc..7100c877748d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -636,7 +636,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
fs_devices->rotating = 1;
 
device->bdev = bdev;
-   device->in_fs_metadata = 0;
+   clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state);
device->mode = flags;
 
fs_devices->open_devices++;
@@ -843,7 +843,8 @@ void 

[PATCH v3 1/5] btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE

2017-12-10 Thread Anand Jain
Currently device state is being managed by each individual int
variable such as struct btrfs_device::writeable. Instead of that
declare device state BTRFS_DEV_STATE_WRITEABLE and use the
bit operations.

Signed-off-by: Anand Jain 
---
v2:
  Drop unrelated change to declare btrfs_device
  Start bit flag defines from bit 0
v3:
  Define BTRFS_DEV_STATE_WRITEABLE as bit nr

 fs/btrfs/disk-io.c | 12 ++
 fs/btrfs/extent-tree.c |  2 +-
 fs/btrfs/extent_io.c   |  3 ++-
 fs/btrfs/ioctl.c   |  2 +-
 fs/btrfs/scrub.c   |  3 ++-
 fs/btrfs/volumes.c | 60 +-
 fs/btrfs/volumes.h |  4 +++-
 7 files changed, 52 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 10a2a579cc7f..56198cb02b35 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3403,7 +3403,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
continue;
if (!dev->bdev)
continue;
-   if (!dev->in_fs_metadata || !dev->writeable)
+   if (!dev->in_fs_metadata ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
write_dev_flush(dev);
@@ -3418,7 +3419,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
errors_wait++;
continue;
}
-   if (!dev->in_fs_metadata || !dev->writeable)
+   if (!dev->in_fs_metadata ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
ret = wait_dev_flush(dev);
@@ -3515,7 +3517,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int 
max_mirrors)
total_errors++;
continue;
}
-   if (!dev->in_fs_metadata || !dev->writeable)
+   if (!dev->in_fs_metadata ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
btrfs_set_stack_device_generation(dev_item, 0);
@@ -3554,7 +3557,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int 
max_mirrors)
list_for_each_entry_rcu(dev, head, dev_list) {
if (!dev->bdev)
continue;
-   if (!dev->in_fs_metadata || !dev->writeable)
+   if (!dev->in_fs_metadata ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
ret = wait_dev_supers(dev, max_mirrors);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 15c01014e5e1..2cd323d184a0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10877,7 +10877,7 @@ static int btrfs_trim_free_extents(struct btrfs_device 
*device,
*trimmed = 0;
 
/* Not writeable = nothing to do. */
-   if (!device->writeable)
+   if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
return 0;
 
/* No free space = nothing to do. */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 012d63870b99..25682c5a0dd5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2027,7 +2027,8 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 
ino, u64 start,
bio->bi_iter.bi_sector = sector;
dev = bbio->stripes[bbio->mirror_num - 1].dev;
btrfs_put_bbio(bbio);
-   if (!dev || !dev->bdev || !dev->writeable) {
+   if (!dev || !dev->bdev ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) {
btrfs_bio_counter_dec(fs_info);
bio_put(bio);
return -EIO;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d748ad1c3620..e59004a17166 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1503,7 +1503,7 @@ static noinline int btrfs_ioctl_resize(struct file *file,
goto out_free;
}
 
-   if (!device->writeable) {
+   if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) {
btrfs_info(fs_info,
   "resizer unable to apply on readonly device %llu",
   devid);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b2f871d80982..fa70ff9b7762 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -4117,7 +4117,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
return -ENODEV;
}
 
-   if (!is_dev_replace && !readonly && !dev->writeable) {
+   if (!is_dev_replace && !readonly &&
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) {
mutex_unlock(_info->fs_devices->device_list_mutex);
rcu_read_lock();
name = rcu_dereference(dev->name);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c19a49167966..9d14d83ab8dc