Re: RAID-1 refuses to balance large drive

2018-05-26 Thread Duncan
Brad Templeton posted on Sat, 26 May 2018 19:21:57 -0700 as excerpted:

> Certainly.  My apologies for not including them before.

Aieee!  Reply before quote, making the reply out of context, and my
attempt to reply in context... difficult and troublesome.

Please use standard list context-quote, reply in context, next time,
making it easier for further replies also in context.

> As
> described, the disks are reasonably balanced -- not as full as the
> last time.  As such, it might be enough that balance would (slowly)
> free up enough chunks to get things going.  And if I have to, I will
> partially convert to single again.   Certainly btrfs replace seems
> like the most planned and simple path but it will result in a strange
> distribution of the chunks.

[btrfs filesystem usage output below]

> Label: 'butter'  uuid: a91755d4-87d8-4acd-ae08-c11e7f1f5438
>Total devices 3 FS bytes used 6.11TiB
>devid1 size 3.62TiB used 3.47TiB path /dev/sdj2Overall:
>Device size:  12.70TiB
>Device allocated: 12.25TiB
>Device unallocated:  459.95GiB
>Device missing:  0.00B
>Used: 12.21TiB
>Free (estimated):246.35GiB  (min: 246.35GiB)
>Data ratio:   2.00
>Metadata ratio:   2.00
>Global reserve:  512.00MiB  (used: 1.32MiB)
> 
> Data,RAID1: Size:6.11TiB, Used:6.09TiB
>   /dev/sda3.48TiB
>   /dev/sdi2   5.28TiB
>   /dev/sdj2   3.46TiB
> 
> Metadata,RAID1: Size:14.00GiB, Used:12.38GiB
>   /dev/sda8.00GiB
>   /dev/sdi2   7.00GiB
>   /dev/sdj2  13.00GiB
> 
> System,RAID1: Size:32.00MiB, Used:888.00KiB
>   /dev/sdi2  32.00MiB
>   /dev/sdj2  32.00MiB
> 
> Unallocated:
>   /dev/sda  153.02GiB
>   /dev/sdi2 154.56GiB
>   /dev/sdj2 152.36GiB

[Presumably this is a bit of btrfs filesystem show output, but the
rest of it is missing...]

>   devid2 size 3.64TiB used 3.49TiB path /dev/sda
>   devid3 size 5.43TiB used 5.28TiB path /dev/sdi2


Based on the 100+ GiB still free on each of the three devices above,
you should have no issues balancing after replacing one of them.

Presumably the first time you tried it, there was far less, likely under
a GiB free on the two not replaced.  Since data chunks are nominally
1 GiB each and raid1 requires two copies, each on a different device,
that didn't leave enough space on either of the older devices to do
a balance, even tho there was plenty of space left on the just-replaced
new one.

(Tho multiple-GiB chunks are possible on TB+ devices, but 10 GiB free
on each device should be plenty, so 100+ GiB free on each... should be
no issues unless you run into some strange bug.)


Meanwhile, even in the case of not enough space free on all three
existing devices, given that they're currently two 4 TB devices and
a 6 TB device and that you're replacing one of the 4 TB devices with
an 8 TB device...

Doing a two-step replace, first replacing the 6 TB device with the
new 8 TB device, then resizing to the new 8 TB size, giving you ~2 TB of
free space on it, then replacing one of the 4 TB devices with the now
free 6 TB device, and again resizing to the new 6 TB size, giving you
~2 TB free on it too, thus giving you ~2 TB free on each of two devices
instead of all 4 TB of new space on a single device, should do the trick
very well, and should still be faster, probably MUCH faster, than doing
a temporary convert to single, then back to raid1, the kludge you used
last time. =:^)


Meanwhile, while kernel version of course remains up to you, given that
you mentioned 4.4 with a potential upgrade to 4.15, I will at least
cover the following, so you'll have it to use as you decide on kernel
versions.

4.15?  Why?  4.14 is the current mainline LTS kernel series, with 4.15
only being a normal short-term stable series that has already been
EOLed.  So 4.15 now makes little sense at all.  Either go current-stable
series and do 4.16 and continue to upgrade as the new kernels come (4.17
should be out shortly as it's past rc6, with rc7 likely out by the time
you read this and release likely in a week), or stick with 4.14 LTS for
the longer-term support.

Of course you can go with your distro kernel if you like, and I presume
that's why you mentioned 4.15, but as I said it's already EOLed upstream,
and of course this list being a kernel development list, our focus tends
to be on upstream/mainstream, not distro level kernels.  If you choose
a distro level kernel series that's EOLed at kernel.org, then you really
should be getting support from them for it, as they know what they've
backported and/or patched and are thus best positioned to support it.

As for what this list does try to support, it's the last two kernel
release series in each of the current and LTS tracks.  So as the first
release back from current 4.16, 4.15, tho EOLed upstream, is still

Re: csum failed root raveled during balance

2018-05-26 Thread Andrei Borzenkov
23.05.2018 09:32, Nikolay Borisov пишет:
> 
> 
> On 22.05.2018 23:05, ein wrote:
>> Hello devs,
>>
>> I tested BTRFS in production for about a month:
>>
>> 21:08:17 up 34 days,  2:21,  3 users,  load average: 0.06, 0.02, 0.00
>>
>> Without power blackout, hardware failure, SSD's SMART is flawless etc.
>> The tests ended with:
>>
>> root@node0:~# dmesg | grep BTRFS | grep warn
>> 185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
>> -9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
>> -9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>> 191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
>> -9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
>> -9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>> 206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
>> mirror 1
>> 209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
>> mirror 1
>> 210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>> mirror 1
>> 211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>> mirror 1
>>
>> Above has been revealed during below command and quite high IO usage by
>> few VMs (Linux on top Ext4 with firebird database, lots of random
>> read/writes, two others with Windows 2016 and Windows Update in the
>> background):
> 
> I believe you are hitting the issue described here:
> 
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg25656.html
> 
> Essentially the way qemu operates on vm images atop btrfs is prone to
> producing such errors. As a matter of fact, other filesystems also
> suffer from this(i.e pages modified while being written, however due to
> lack of CRC on the data they don't detect it). Can you confirm that
> those inodes (312/314/319/219962/219915) belong to vm images files?
> 
> IMHO the best course of action would be to disable checksumming for you
> vm files.
> 
> 
> For some background I suggest you read the following LWN articles:
> 
> https://lwn.net/Articles/486311/
> https://lwn.net/Articles/442355/
> 

Hmm ... according to these articles, "pages under writeback are marked
as not being writable; any process attempting to write to such a page
will block until the writeback completes". And it says this feature is
available since 3.0 and btrfs has it. So how comes it still happens?
Were stable patches removed since then?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID-1 refuses to balance large drive

2018-05-26 Thread Brad Templeton
Certainly.  My apologies for not including them before.   As
described, the disks are reasonably balanced -- not as full as the
last time.  As such, it might be enough that balance would (slowly)
free up enough chunks to get things going.  And if I have to, I will
partially convert to single again.   Certainly btrfs replace seems
like the most planned and simple path but it will result in a strange
distribution of the chunks.

Label: 'butter'  uuid: a91755d4-87d8-4acd-ae08-c11e7f1f5438
   Total devices 3 FS bytes used 6.11TiB
   devid1 size 3.62TiB used 3.47TiB path /dev/sdj2Overall:
   Device size:  12.70TiB
   Device allocated: 12.25TiB
   Device unallocated:  459.95GiB
   Device missing:  0.00B
   Used: 12.21TiB
   Free (estimated):246.35GiB  (min: 246.35GiB)
   Data ratio:   2.00
   Metadata ratio:   2.00
   Global reserve:  512.00MiB  (used: 1.32MiB)

Data,RAID1: Size:6.11TiB, Used:6.09TiB
  /dev/sda3.48TiB
  /dev/sdi2   5.28TiB
  /dev/sdj2   3.46TiB

Metadata,RAID1: Size:14.00GiB, Used:12.38GiB
  /dev/sda8.00GiB
  /dev/sdi2   7.00GiB
  /dev/sdj2  13.00GiB

System,RAID1: Size:32.00MiB, Used:888.00KiB
  /dev/sdi2  32.00MiB
  /dev/sdj2  32.00MiB

Unallocated:
  /dev/sda  153.02GiB
  /dev/sdi2 154.56GiB
  /dev/sdj2 152.36GiB

  devid2 size 3.64TiB used 3.49TiB path /dev/sda
   devid3 size 5.43TiB used 5.28TiB path /dev/sdi2


On Sat, May 26, 2018 at 7:16 PM, Qu Wenruo  wrote:
>
>
> On 2018年05月27日 10:06, Brad Templeton wrote:
>> Thanks.  These are all things which take substantial fractions of a
>> day to try, unfortunately.
>
> Normally I would suggest just using VM and several small disks (~10G),
> along with fallocate (the fastest way to use space) to get a basic view
> of the procedure.
>
>> Last time I ended up fixing it in a
>> fairly kluged way, which was to convert from raid-1 to single long
>> enough to get enough single blocks that when I converted back to
>> raid-1 they got distributed to the right drives.
>
> Yep, that's the ultimate one-fit-all solution.
> Also, this reminds me about the fact we could do the
> RAID1->Single/DUP->Single downgrade in a much much faster way.
> I think it's worthy considering for later enhancement.
>
>>  But this is, aside
>> from being a kludge, a procedure with some minor risk.  Of course I am
>> taking a backup first, but still...
>>
>> This strikes me as something that should be a fairly common event --
>> your raid is filling up, and so you expand it by replacing the oldest
>> and smallest drive with a new much bigger one.   In the old days of
>> RAID, you could not do that, you had to grow all drives at the same
>> time, and this is one of the ways that BTRFS is quite superior.
>> When I had MD raid, I went through a strange process of always having
>> a raid 5 that consisted of different sized drives.  The raid-5 was
>> based on the smallest of the 3 drives, and then the larger ones had
>> extra space which could either be in raid-1, or more imply was in solo
>> disk mode and used for less critical data (such as backups and old
>> archives.)   Slowly, and in a messy way, each time I replaced the
>> smallest drive, I could then grow the raid 5.  Yuck. BTRFS is so
>> much better, except for this issue.
>>
>> So if somebody has a thought of a procedure that is fairly sure to
>> work and doesn't involve too many copying passes -- copying 4tb is not
>> a quick operation -- it is much appreciated and might be a good thing
>> to add to a wiki page, which I would be happy to do.
>
> Anyway, "btrfs fi show" and "btrfs fi usage" would help before any
> further advice from community.
>
> Thanks,
> Qu
>
>>
>> On Sat, May 26, 2018 at 6:56 PM, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年05月27日 09:49, Brad Templeton wrote:
 That is what did not work last time.

 I say I think there can be a "fix" because I hope the goal of BTRFS
 raid is to be superior to traditional RAID.   That if one replaces a
 drive, and asks to balance, it figures out what needs to be done to
 make that work.  I understand that the current balance algorithm may
 have trouble with that.   In this situation, the ideal result would be
 the system would take the 3 drives (4TB and 6TB full, 8TB with 4TB
 free) and move extents strictly from the 4TB and 6TB to the 8TB -- ie
 extents which are currently on both the 4TB and 6TB -- by moving only
 one copy.
>>>
>>> Btrfs can only do balance in a chunk unit.
>>> Thus btrfs can only do:
>>> 1) Create new chunk
>>> 2) Copy data
>>> 3) Remove old chunk.
>>>
>>> So it can't do the way you mentioned.
>>> But your purpose sounds pretty valid and maybe we could enhanace btrfs
>>> to do such thing.
>>> (Currently only replace can behave like that)
>>>
 

Re: RAID-1 refuses to balance large drive

2018-05-26 Thread Qu Wenruo


On 2018年05月27日 10:06, Brad Templeton wrote:
> Thanks.  These are all things which take substantial fractions of a
> day to try, unfortunately.

Normally I would suggest just using VM and several small disks (~10G),
along with fallocate (the fastest way to use space) to get a basic view
of the procedure.

> Last time I ended up fixing it in a
> fairly kluged way, which was to convert from raid-1 to single long
> enough to get enough single blocks that when I converted back to
> raid-1 they got distributed to the right drives.

Yep, that's the ultimate one-fit-all solution.
Also, this reminds me about the fact we could do the
RAID1->Single/DUP->Single downgrade in a much much faster way.
I think it's worthy considering for later enhancement.

>  But this is, aside
> from being a kludge, a procedure with some minor risk.  Of course I am
> taking a backup first, but still...
> 
> This strikes me as something that should be a fairly common event --
> your raid is filling up, and so you expand it by replacing the oldest
> and smallest drive with a new much bigger one.   In the old days of
> RAID, you could not do that, you had to grow all drives at the same
> time, and this is one of the ways that BTRFS is quite superior.
> When I had MD raid, I went through a strange process of always having
> a raid 5 that consisted of different sized drives.  The raid-5 was
> based on the smallest of the 3 drives, and then the larger ones had
> extra space which could either be in raid-1, or more imply was in solo
> disk mode and used for less critical data (such as backups and old
> archives.)   Slowly, and in a messy way, each time I replaced the
> smallest drive, I could then grow the raid 5.  Yuck. BTRFS is so
> much better, except for this issue.
> 
> So if somebody has a thought of a procedure that is fairly sure to
> work and doesn't involve too many copying passes -- copying 4tb is not
> a quick operation -- it is much appreciated and might be a good thing
> to add to a wiki page, which I would be happy to do.

Anyway, "btrfs fi show" and "btrfs fi usage" would help before any
further advice from community.

Thanks,
Qu

> 
> On Sat, May 26, 2018 at 6:56 PM, Qu Wenruo  wrote:
>>
>>
>> On 2018年05月27日 09:49, Brad Templeton wrote:
>>> That is what did not work last time.
>>>
>>> I say I think there can be a "fix" because I hope the goal of BTRFS
>>> raid is to be superior to traditional RAID.   That if one replaces a
>>> drive, and asks to balance, it figures out what needs to be done to
>>> make that work.  I understand that the current balance algorithm may
>>> have trouble with that.   In this situation, the ideal result would be
>>> the system would take the 3 drives (4TB and 6TB full, 8TB with 4TB
>>> free) and move extents strictly from the 4TB and 6TB to the 8TB -- ie
>>> extents which are currently on both the 4TB and 6TB -- by moving only
>>> one copy.
>>
>> Btrfs can only do balance in a chunk unit.
>> Thus btrfs can only do:
>> 1) Create new chunk
>> 2) Copy data
>> 3) Remove old chunk.
>>
>> So it can't do the way you mentioned.
>> But your purpose sounds pretty valid and maybe we could enhanace btrfs
>> to do such thing.
>> (Currently only replace can behave like that)
>>
>>> It is not strictly a "bug" in that the code is operating
>>> as designed, but it is an undesired function.
>>>
>>> The problem is the approach you describe did not work in the prior upgrade.
>>
>> Would you please try 4/4/6 + 4 or 4/4/6 + 2 and then balance?
>> And before/after balance, "btrfs fi usage" and "btrfs fi show" output
>> could also help.
>>
>> Thanks,
>> Qu
>>
>>>
>>> On Sat, May 26, 2018 at 6:41 PM, Qu Wenruo  wrote:


 On 2018年05月27日 09:27, Brad Templeton wrote:
> A few years ago, I encountered an issue (halfway between a bug and a
> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
> fairly full.   The problem was that after replacing (by add/delete) a
> small drive with a larger one, there were now 2 full drives and one
> new half-full one, and balance was not able to correct this situation
> to produce the desired result, which is 3 drives, each with a roughly
> even amount of free space.  It can't do it because the 2 smaller
> drives are full, and it doesn't realize it could just move one of the
> copies of a block off the smaller drive onto the larger drive to free
> space on the smaller drive, it wants to move them both, and there is
> nowhere to put them both.

 It's not that easy.
 For balance, btrfs must first find a large enough space to locate both
 copy, then copy data.
 Or if powerloss happens, it will cause data corruption.

 So in your case, btrfs can only find enough space for one copy, thus
 unable to relocate any chunk.

>
> I'm about to do it again, taking my nearly full array which is 4TB,
> 4TB, 6TB and replacing one of the 4TB with 

Re: RAID-1 refuses to balance large drive

2018-05-26 Thread Brad Templeton
Thanks.  These are all things which take substantial fractions of a
day to try, unfortunately.Last time I ended up fixing it in a
fairly kluged way, which was to convert from raid-1 to single long
enough to get enough single blocks that when I converted back to
raid-1 they got distributed to the right drives.  But this is, aside
from being a kludge, a procedure with some minor risk.  Of course I am
taking a backup first, but still...

This strikes me as something that should be a fairly common event --
your raid is filling up, and so you expand it by replacing the oldest
and smallest drive with a new much bigger one.   In the old days of
RAID, you could not do that, you had to grow all drives at the same
time, and this is one of the ways that BTRFS is quite superior.
When I had MD raid, I went through a strange process of always having
a raid 5 that consisted of different sized drives.  The raid-5 was
based on the smallest of the 3 drives, and then the larger ones had
extra space which could either be in raid-1, or more imply was in solo
disk mode and used for less critical data (such as backups and old
archives.)   Slowly, and in a messy way, each time I replaced the
smallest drive, I could then grow the raid 5.  Yuck. BTRFS is so
much better, except for this issue.

So if somebody has a thought of a procedure that is fairly sure to
work and doesn't involve too many copying passes -- copying 4tb is not
a quick operation -- it is much appreciated and might be a good thing
to add to a wiki page, which I would be happy to do.

On Sat, May 26, 2018 at 6:56 PM, Qu Wenruo  wrote:
>
>
> On 2018年05月27日 09:49, Brad Templeton wrote:
>> That is what did not work last time.
>>
>> I say I think there can be a "fix" because I hope the goal of BTRFS
>> raid is to be superior to traditional RAID.   That if one replaces a
>> drive, and asks to balance, it figures out what needs to be done to
>> make that work.  I understand that the current balance algorithm may
>> have trouble with that.   In this situation, the ideal result would be
>> the system would take the 3 drives (4TB and 6TB full, 8TB with 4TB
>> free) and move extents strictly from the 4TB and 6TB to the 8TB -- ie
>> extents which are currently on both the 4TB and 6TB -- by moving only
>> one copy.
>
> Btrfs can only do balance in a chunk unit.
> Thus btrfs can only do:
> 1) Create new chunk
> 2) Copy data
> 3) Remove old chunk.
>
> So it can't do the way you mentioned.
> But your purpose sounds pretty valid and maybe we could enhanace btrfs
> to do such thing.
> (Currently only replace can behave like that)
>
>> It is not strictly a "bug" in that the code is operating
>> as designed, but it is an undesired function.
>>
>> The problem is the approach you describe did not work in the prior upgrade.
>
> Would you please try 4/4/6 + 4 or 4/4/6 + 2 and then balance?
> And before/after balance, "btrfs fi usage" and "btrfs fi show" output
> could also help.
>
> Thanks,
> Qu
>
>>
>> On Sat, May 26, 2018 at 6:41 PM, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年05月27日 09:27, Brad Templeton wrote:
 A few years ago, I encountered an issue (halfway between a bug and a
 problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
 fairly full.   The problem was that after replacing (by add/delete) a
 small drive with a larger one, there were now 2 full drives and one
 new half-full one, and balance was not able to correct this situation
 to produce the desired result, which is 3 drives, each with a roughly
 even amount of free space.  It can't do it because the 2 smaller
 drives are full, and it doesn't realize it could just move one of the
 copies of a block off the smaller drive onto the larger drive to free
 space on the smaller drive, it wants to move them both, and there is
 nowhere to put them both.
>>>
>>> It's not that easy.
>>> For balance, btrfs must first find a large enough space to locate both
>>> copy, then copy data.
>>> Or if powerloss happens, it will cause data corruption.
>>>
>>> So in your case, btrfs can only find enough space for one copy, thus
>>> unable to relocate any chunk.
>>>

 I'm about to do it again, taking my nearly full array which is 4TB,
 4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
 repeat the very time consuming situation, so I wanted to find out if
 things were fixed now.   I am running Xenial (kernel 4.4.0) and could
 consider the upgrade to  bionic (4.15) though that adds a lot more to
 my plate before a long trip and I would prefer to avoid if I can.
>>>
>>> Since there is nothing to fix, the behavior will not change at all.
>>>

 So what is the best strategy:

 a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" 
 strategy)
 b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
 from 4TB but possibly not enough)
 c) 

Re: RAID-1 refuses to balance large drive

2018-05-26 Thread Qu Wenruo


On 2018年05月27日 09:49, Brad Templeton wrote:
> That is what did not work last time.
> 
> I say I think there can be a "fix" because I hope the goal of BTRFS
> raid is to be superior to traditional RAID.   That if one replaces a
> drive, and asks to balance, it figures out what needs to be done to
> make that work.  I understand that the current balance algorithm may
> have trouble with that.   In this situation, the ideal result would be
> the system would take the 3 drives (4TB and 6TB full, 8TB with 4TB
> free) and move extents strictly from the 4TB and 6TB to the 8TB -- ie
> extents which are currently on both the 4TB and 6TB -- by moving only
> one copy.

Btrfs can only do balance in a chunk unit.
Thus btrfs can only do:
1) Create new chunk
2) Copy data
3) Remove old chunk.

So it can't do the way you mentioned.
But your purpose sounds pretty valid and maybe we could enhanace btrfs
to do such thing.
(Currently only replace can behave like that)

> It is not strictly a "bug" in that the code is operating
> as designed, but it is an undesired function.
> 
> The problem is the approach you describe did not work in the prior upgrade.

Would you please try 4/4/6 + 4 or 4/4/6 + 2 and then balance?
And before/after balance, "btrfs fi usage" and "btrfs fi show" output
could also help.

Thanks,
Qu

> 
> On Sat, May 26, 2018 at 6:41 PM, Qu Wenruo  wrote:
>>
>>
>> On 2018年05月27日 09:27, Brad Templeton wrote:
>>> A few years ago, I encountered an issue (halfway between a bug and a
>>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
>>> fairly full.   The problem was that after replacing (by add/delete) a
>>> small drive with a larger one, there were now 2 full drives and one
>>> new half-full one, and balance was not able to correct this situation
>>> to produce the desired result, which is 3 drives, each with a roughly
>>> even amount of free space.  It can't do it because the 2 smaller
>>> drives are full, and it doesn't realize it could just move one of the
>>> copies of a block off the smaller drive onto the larger drive to free
>>> space on the smaller drive, it wants to move them both, and there is
>>> nowhere to put them both.
>>
>> It's not that easy.
>> For balance, btrfs must first find a large enough space to locate both
>> copy, then copy data.
>> Or if powerloss happens, it will cause data corruption.
>>
>> So in your case, btrfs can only find enough space for one copy, thus
>> unable to relocate any chunk.
>>
>>>
>>> I'm about to do it again, taking my nearly full array which is 4TB,
>>> 4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
>>> repeat the very time consuming situation, so I wanted to find out if
>>> things were fixed now.   I am running Xenial (kernel 4.4.0) and could
>>> consider the upgrade to  bionic (4.15) though that adds a lot more to
>>> my plate before a long trip and I would prefer to avoid if I can.
>>
>> Since there is nothing to fix, the behavior will not change at all.
>>
>>>
>>> So what is the best strategy:
>>>
>>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" 
>>> strategy)
>>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
>>> from 4TB but possibly not enough)
>>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
>>> recently vacated 6TB -- much longer procedure but possibly better
>>>
>>> Or has this all been fixed and method A will work fine and get to the
>>> ideal goal -- 3 drives, with available space suitably distributed to
>>> allow full utilization over time?
>>
>> Btrfs chunk allocator is already trying to utilize all drivers for a
>> long long time.
>> When allocate chunks, btrfs will choose the device with the most free
>> space. However the nature of RAID1 needs btrfs to allocate extents from
>> 2 different devices, which makes your replaced 4/4/6 a little complex.
>> (If your 4/4/6 array is set up and then filled to current stage, btrfs
>> should be able to utilize all the space)
>>
>>
>> Personally speaking, if you're confident enough, just add a new device,
>> and then do balance.
>> If enough chunks get balanced, there should be enough space freed on
>> existing disks.
>> Then remove the newly added device, then btrfs should handle the
>> remaining space well.
>>
>> Thanks,
>> Qu
>>
>>>
>>> On Sat, May 26, 2018 at 6:24 PM, Brad Templeton  wrote:
 A few years ago, I encountered an issue (halfway between a bug and a
 problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
 full.   The problem was that after replacing (by add/delete) a small drive
 with a larger one, there were now 2 full drives and one new half-full one,
 and balance was not able to correct this situation to produce the desired
 result, which is 3 drives, each with a roughly even amount of free space.
 It can't do it because the 2 smaller drives are full, and it doesn't 
 realize
 it could just move 

Re: RAID-1 refuses to balance large drive

2018-05-26 Thread Brad Templeton
That is what did not work last time.

I say I think there can be a "fix" because I hope the goal of BTRFS
raid is to be superior to traditional RAID.   That if one replaces a
drive, and asks to balance, it figures out what needs to be done to
make that work.  I understand that the current balance algorithm may
have trouble with that.   In this situation, the ideal result would be
the system would take the 3 drives (4TB and 6TB full, 8TB with 4TB
free) and move extents strictly from the 4TB and 6TB to the 8TB -- ie
extents which are currently on both the 4TB and 6TB -- by moving only
one copy.   It is not strictly a "bug" in that the code is operating
as designed, but it is an undesired function.

The problem is the approach you describe did not work in the prior upgrade.

On Sat, May 26, 2018 at 6:41 PM, Qu Wenruo  wrote:
>
>
> On 2018年05月27日 09:27, Brad Templeton wrote:
>> A few years ago, I encountered an issue (halfway between a bug and a
>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
>> fairly full.   The problem was that after replacing (by add/delete) a
>> small drive with a larger one, there were now 2 full drives and one
>> new half-full one, and balance was not able to correct this situation
>> to produce the desired result, which is 3 drives, each with a roughly
>> even amount of free space.  It can't do it because the 2 smaller
>> drives are full, and it doesn't realize it could just move one of the
>> copies of a block off the smaller drive onto the larger drive to free
>> space on the smaller drive, it wants to move them both, and there is
>> nowhere to put them both.
>
> It's not that easy.
> For balance, btrfs must first find a large enough space to locate both
> copy, then copy data.
> Or if powerloss happens, it will cause data corruption.
>
> So in your case, btrfs can only find enough space for one copy, thus
> unable to relocate any chunk.
>
>>
>> I'm about to do it again, taking my nearly full array which is 4TB,
>> 4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
>> repeat the very time consuming situation, so I wanted to find out if
>> things were fixed now.   I am running Xenial (kernel 4.4.0) and could
>> consider the upgrade to  bionic (4.15) though that adds a lot more to
>> my plate before a long trip and I would prefer to avoid if I can.
>
> Since there is nothing to fix, the behavior will not change at all.
>
>>
>> So what is the best strategy:
>>
>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" 
>> strategy)
>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
>> from 4TB but possibly not enough)
>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
>> recently vacated 6TB -- much longer procedure but possibly better
>>
>> Or has this all been fixed and method A will work fine and get to the
>> ideal goal -- 3 drives, with available space suitably distributed to
>> allow full utilization over time?
>
> Btrfs chunk allocator is already trying to utilize all drivers for a
> long long time.
> When allocate chunks, btrfs will choose the device with the most free
> space. However the nature of RAID1 needs btrfs to allocate extents from
> 2 different devices, which makes your replaced 4/4/6 a little complex.
> (If your 4/4/6 array is set up and then filled to current stage, btrfs
> should be able to utilize all the space)
>
>
> Personally speaking, if you're confident enough, just add a new device,
> and then do balance.
> If enough chunks get balanced, there should be enough space freed on
> existing disks.
> Then remove the newly added device, then btrfs should handle the
> remaining space well.
>
> Thanks,
> Qu
>
>>
>> On Sat, May 26, 2018 at 6:24 PM, Brad Templeton  wrote:
>>> A few years ago, I encountered an issue (halfway between a bug and a
>>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
>>> full.   The problem was that after replacing (by add/delete) a small drive
>>> with a larger one, there were now 2 full drives and one new half-full one,
>>> and balance was not able to correct this situation to produce the desired
>>> result, which is 3 drives, each with a roughly even amount of free space.
>>> It can't do it because the 2 smaller drives are full, and it doesn't realize
>>> it could just move one of the copies of a block off the smaller drive onto
>>> the larger drive to free space on the smaller drive, it wants to move them
>>> both, and there is nowhere to put them both.
>>>
>>> I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB
>>> and replacing one of the 4TB with an 8TB.  I don't want to repeat the very
>>> time consuming situation, so I wanted to find out if things were fixed now.
>>> I am running Xenial (kernel 4.4.0) and could consider the upgrade to  bionic
>>> (4.15) though that adds a lot more to my plate before a long trip and I
>>> would prefer to avoid if I can.

Re: RAID-1 refuses to balance large drive

2018-05-26 Thread Qu Wenruo


On 2018年05月27日 09:27, Brad Templeton wrote:
> A few years ago, I encountered an issue (halfway between a bug and a
> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
> fairly full.   The problem was that after replacing (by add/delete) a
> small drive with a larger one, there were now 2 full drives and one
> new half-full one, and balance was not able to correct this situation
> to produce the desired result, which is 3 drives, each with a roughly
> even amount of free space.  It can't do it because the 2 smaller
> drives are full, and it doesn't realize it could just move one of the
> copies of a block off the smaller drive onto the larger drive to free
> space on the smaller drive, it wants to move them both, and there is
> nowhere to put them both.

It's not that easy.
For balance, btrfs must first find a large enough space to locate both
copy, then copy data.
Or if powerloss happens, it will cause data corruption.

So in your case, btrfs can only find enough space for one copy, thus
unable to relocate any chunk.

> 
> I'm about to do it again, taking my nearly full array which is 4TB,
> 4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
> repeat the very time consuming situation, so I wanted to find out if
> things were fixed now.   I am running Xenial (kernel 4.4.0) and could
> consider the upgrade to  bionic (4.15) though that adds a lot more to
> my plate before a long trip and I would prefer to avoid if I can.

Since there is nothing to fix, the behavior will not change at all.

> 
> So what is the best strategy:
> 
> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" 
> strategy)
> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
> from 4TB but possibly not enough)
> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
> recently vacated 6TB -- much longer procedure but possibly better
> 
> Or has this all been fixed and method A will work fine and get to the
> ideal goal -- 3 drives, with available space suitably distributed to
> allow full utilization over time?

Btrfs chunk allocator is already trying to utilize all drivers for a
long long time.
When allocate chunks, btrfs will choose the device with the most free
space. However the nature of RAID1 needs btrfs to allocate extents from
2 different devices, which makes your replaced 4/4/6 a little complex.
(If your 4/4/6 array is set up and then filled to current stage, btrfs
should be able to utilize all the space)


Personally speaking, if you're confident enough, just add a new device,
and then do balance.
If enough chunks get balanced, there should be enough space freed on
existing disks.
Then remove the newly added device, then btrfs should handle the
remaining space well.

Thanks,
Qu

> 
> On Sat, May 26, 2018 at 6:24 PM, Brad Templeton  wrote:
>> A few years ago, I encountered an issue (halfway between a bug and a
>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
>> full.   The problem was that after replacing (by add/delete) a small drive
>> with a larger one, there were now 2 full drives and one new half-full one,
>> and balance was not able to correct this situation to produce the desired
>> result, which is 3 drives, each with a roughly even amount of free space.
>> It can't do it because the 2 smaller drives are full, and it doesn't realize
>> it could just move one of the copies of a block off the smaller drive onto
>> the larger drive to free space on the smaller drive, it wants to move them
>> both, and there is nowhere to put them both.
>>
>> I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB
>> and replacing one of the 4TB with an 8TB.  I don't want to repeat the very
>> time consuming situation, so I wanted to find out if things were fixed now.
>> I am running Xenial (kernel 4.4.0) and could consider the upgrade to  bionic
>> (4.15) though that adds a lot more to my plate before a long trip and I
>> would prefer to avoid if I can.
>>
>> So what is the best strategy:
>>
>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic"
>> strategy)
>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from
>> 4TB but possibly not enough)
>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently
>> vacated 6TB -- much longer procedure but possibly better
>>
>> Or has this all been fixed and method A will work fine and get to the ideal
>> goal -- 3 drives, with available space suitably distributed to allow full
>> utilization over time?
>>
>> On Fri, Mar 25, 2016 at 7:35 AM, Henk Slager  wrote:
>>>
>>> On Fri, Mar 25, 2016 at 2:16 PM, Patrik Lundquist
>>>  wrote:
 On 23 March 2016 at 20:33, Chris Murphy  wrote:
>
> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton 
> wrote:
>>
>> I am surprised to hear it said that having the 

Re: RAID-1 refuses to balance large drive

2018-05-26 Thread Brad Templeton
A few years ago, I encountered an issue (halfway between a bug and a
problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
fairly full.   The problem was that after replacing (by add/delete) a
small drive with a larger one, there were now 2 full drives and one
new half-full one, and balance was not able to correct this situation
to produce the desired result, which is 3 drives, each with a roughly
even amount of free space.  It can't do it because the 2 smaller
drives are full, and it doesn't realize it could just move one of the
copies of a block off the smaller drive onto the larger drive to free
space on the smaller drive, it wants to move them both, and there is
nowhere to put them both.

I'm about to do it again, taking my nearly full array which is 4TB,
4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
repeat the very time consuming situation, so I wanted to find out if
things were fixed now.   I am running Xenial (kernel 4.4.0) and could
consider the upgrade to  bionic (4.15) though that adds a lot more to
my plate before a long trip and I would prefer to avoid if I can.

So what is the best strategy:

a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" strategy)
b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
from 4TB but possibly not enough)
c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
recently vacated 6TB -- much longer procedure but possibly better

Or has this all been fixed and method A will work fine and get to the
ideal goal -- 3 drives, with available space suitably distributed to
allow full utilization over time?

On Sat, May 26, 2018 at 6:24 PM, Brad Templeton  wrote:
> A few years ago, I encountered an issue (halfway between a bug and a
> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
> full.   The problem was that after replacing (by add/delete) a small drive
> with a larger one, there were now 2 full drives and one new half-full one,
> and balance was not able to correct this situation to produce the desired
> result, which is 3 drives, each with a roughly even amount of free space.
> It can't do it because the 2 smaller drives are full, and it doesn't realize
> it could just move one of the copies of a block off the smaller drive onto
> the larger drive to free space on the smaller drive, it wants to move them
> both, and there is nowhere to put them both.
>
> I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB
> and replacing one of the 4TB with an 8TB.  I don't want to repeat the very
> time consuming situation, so I wanted to find out if things were fixed now.
> I am running Xenial (kernel 4.4.0) and could consider the upgrade to  bionic
> (4.15) though that adds a lot more to my plate before a long trip and I
> would prefer to avoid if I can.
>
> So what is the best strategy:
>
> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic"
> strategy)
> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from
> 4TB but possibly not enough)
> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently
> vacated 6TB -- much longer procedure but possibly better
>
> Or has this all been fixed and method A will work fine and get to the ideal
> goal -- 3 drives, with available space suitably distributed to allow full
> utilization over time?
>
> On Fri, Mar 25, 2016 at 7:35 AM, Henk Slager  wrote:
>>
>> On Fri, Mar 25, 2016 at 2:16 PM, Patrik Lundquist
>>  wrote:
>> > On 23 March 2016 at 20:33, Chris Murphy  wrote:
>> >>
>> >> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton 
>> >> wrote:
>> >> >
>> >> > I am surprised to hear it said that having the mixed sizes is an odd
>> >> > case.
>> >>
>> >> Not odd as in wrong, just uncommon compared to other arrangements being
>> >> tested.
>> >
>> > I think mixed drive sizes in raid1 is a killer feature for a home NAS,
>> > where you replace an old smaller drive with the latest and largest
>> > when you need more storage.
>> >
>> > My raid1 currently consists of 6TB+3TB+3*2TB.
>>
>> For the original OP situation, with chunks all filled op with extents
>> and devices all filled up with chunks, 'integrating' a new 6TB drive
>> in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
>> way in order to avoid immediate balancing needs:
>> - 'plug-in' the 6TB
>> - btrfs-replace  4TB by 6TB
>> - btrfs fi resize max 6TB_devID
>> - btrfs-replace  2TB by 4TB
>> - btrfs fi resize max 4TB_devID
>> - 'unplug' the 2TB
>>
>> So then there would be 2 devices with roughly 2TB space available, so
>> good for continued btrfs raid1 writes.
>>
>> An offline variant with dd instead of btrfs-replace could also be done
>> (I used to do that sometimes when btrfs-replace was not implemented).
>> My experience is that btrfs-replace speed is roughly at max speed (so
>> 

Re: [PATCH v2] btrfs: Use btrfs_mark_bg_unused() to replace open code

2018-05-26 Thread Qu Wenruo


On 2018年05月22日 20:14, David Sterba wrote:
> On Tue, May 22, 2018 at 04:43:47PM +0800, Qu Wenruo wrote:
>> Introduce a small helper, btrfs_mark_bg_unused(), to accquire needed
>> locks and add a block group to unused_bgs list.
> 
> The helper is nice but hides that there's a reference taken on the 'bg'.
> This would be good to add at least to the function comment or somehow
> squeeze it into the function name itself, like
> btrfs_get_and_mark_bg_unused.

That btrfs_get_block_group() call is only for later removal.
The reference will be removed in btrfs_delete_unused_bgs().

So I don't think the function name needs the "_get" part.

Although I could add more comment about the btrfs_get_block_group() call.

How do you think about this?

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


Re: off-by-one uncompressed invalid ram_bytes corruptions

2018-05-26 Thread Qu Wenruo


On 2018年05月26日 22:06, Steve Leung wrote:
> On 05/20/2018 07:07 PM, Qu Wenruo wrote:
>>
>>
>> On 2018年05月21日 04:43, Steve Leung wrote:
>>> On 05/19/2018 07:02 PM, Qu Wenruo wrote:


 On 2018年05月20日 07:40, Steve Leung wrote:
> On 05/17/2018 11:49 PM, Qu Wenruo wrote:
>> On 2018年05月18日 13:23, Steve Leung wrote:
>>> Hi list,
>>>
>>> I've got 3-device raid1 btrfs filesystem that's throwing up some
>>> "corrupt leaf" errors in dmesg.  This is a uniquified list I've
>>> observed lately:
>>>
>>>  BTRFS critical (device sda1): corrupt leaf: root=1
>>> block=4970196795392
>>> slot=307 ino=206231 file_offset=0, invalid ram_bytes for
>>> uncompressed
>>> inline extent, have 3468 expect 3469
>>
>> Would you please use "btrfs-debug-tree -b 4970196795392 /dev/sda1" to
>> dump the leaf?
>
> Attached btrfs-debug-tree dumps for all of the blocks that I saw
> messages for.
>
>> It's caught by tree-checker code which is ensuring all tree blocks
>> are
>> correct before btrfs can take use of them.
>>
>> That inline extent size check is tested, so I'm wondering if this
>> indicates any real corruption.
>> That btrfs-debug-tree output will definitely help.
>>
>> BTW, if I didn't miss anything, there should not be any inlined
>> extent
>> in root tree.
>>
>>>  BTRFS critical (device sda1): corrupt leaf: root=1
>>> block=4970552426496
>>> slot=91 ino=209736 file_offset=0, invalid ram_bytes for uncompressed
>>> inline extent, have 3496 expect 3497
>>
>> Same dump will definitely help.
>>
>>>  BTRFS critical (device sda1): corrupt leaf: root=1
>>> block=4970712399872
>>> slot=221 ino=205230 file_offset=0, invalid ram_bytes for
>>> uncompressed
>>> inline extent, have 1790 expect 1791
>>>  BTRFS critical (device sda1): corrupt leaf: root=1
>>> block=4970803920896
>>> slot=368 ino=205732 file_offset=0, invalid ram_bytes for
>>> uncompressed
>>> inline extent, have 2475 expect 2476
>>>  BTRFS critical (device sda1): corrupt leaf: root=1
>>> block=4970987945984
>>> slot=236 ino=208896 file_offset=0, invalid ram_bytes for
>>> uncompressed
>>> inline extent, have 490 expect 491
>>>
>>> All of them seem to be 1 short of the expected value.
>>>
>>> Some files do seem to be inaccessible on the filesystem, and btrfs
>>> inspect-internal on any of those inode numbers fails with:
>>>
>>> ERROR: ino paths ioctl: Input/output error
>>>
>>> and another message for that inode appears.
>>>
>>> 'btrfs check' (output attached) seems to notice these corruptions
>>> (among
>>> a few others, some of which seem to be related to a problematic
>>> attempt
>>> to build Android I posted about some months ago).
>>>
>>> Other information:
>>>
>>> Arch Linux x86-64, kernel 4.16.6, btrfs-progs 4.16.  The filesystem
>>> has
>>> about 25 snapshots at the moment, only a handful of compressed
>>> files,
>>> and nothing fancy like qgroups enabled.
>>>
>>> btrfs fi show:
>>>
>>> Label: none  uuid: 9d4db9e3-b9c3-4f6d-8cb4-60ff55e96d82
>>>     Total devices 4 FS bytes used 2.48TiB
>>>     devid    1 size 1.36TiB used 1.13TiB path /dev/sdd1
>>>     devid    2 size 464.73GiB used 230.00GiB path /dev/sdc1
>>>     devid    3 size 1.36TiB used 1.13TiB path /dev/sdb1
>>>     devid    4 size 3.49TiB used 2.49TiB path /dev/sda1
>>>
>>> btrfs fi df:
>>>
>>> Data, RAID1: total=2.49TiB, used=2.48TiB
>>> System, RAID1: total=32.00MiB, used=416.00KiB
>>> Metadata, RAID1: total=7.00GiB, used=5.29GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>
>>> dmesg output attached as well.
>>>
>>> Thanks in advance for any assistance!  I have backups of all the
>>> important stuff here but it would be nice to fix the corruptions in
>>> place.
>>
>> And btrfs check doesn't report the same problem as the default
>> original
>> mode doesn't have such check.
>>
>> Please also post the result of "btrfs check --mode=lowmem /dev/sda1"
>
> Also, attached.  It seems to notice the same off-by-one problems,
> though
> there also seem to be a couple of examples of being off by more than
> one.

 Unfortunately, it doesn't detect, as there is no off-by-one error at
 all.

 The problem is, kernel is reporting error on completely fine leaf.

 Further more, even in the same leaf, there are more inlined extents,
 and
 they are all valid.

 So the kernel reports the error out of nowhere.

 More problems happens for extent_size where a lot of them is offset by
 one.

 Moreover, the root owner is not printed 

[josef-btrfs:blk-iolatency 7/13] mm/memcontrol.c:5479:3: error: implicit declaration of function 'blkcg_schedule_throttle'; did you mean 'poll_schedule_timeout'?

2018-05-26 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
blk-iolatency
head:   b62bb0da2afe1437b9d2d687ea1f509466fd3843
commit: 2caae39bf0a83094c506f98be6e339355544d103 [7/13] memcontrol: schedule 
throttling if we are congested
config: x86_64-randconfig-s4-05270049 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
git checkout 2caae39bf0a83094c506f98be6e339355544d103
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   mm/memcontrol.c: In function 'mem_cgroup_try_charge_delay':
>> mm/memcontrol.c:5479:3: error: implicit declaration of function 
>> 'blkcg_schedule_throttle'; did you mean 'poll_schedule_timeout'? 
>> [-Werror=implicit-function-declaration]
  blkcg_schedule_throttle(bdev_get_queue(bdev), true);
  ^~~
  poll_schedule_timeout
   cc1: some warnings being treated as errors

vim +5479 mm/memcontrol.c

  5460  
  5461  int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
  5462gfp_t gfp_mask, struct mem_cgroup **memcgp,
  5463bool compound)
  5464  {
  5465  struct mem_cgroup *memcg;
  5466  struct block_device *bdev;
  5467  int ret;
  5468  
  5469  ret = mem_cgroup_try_charge(page, mm, gfp_mask, memcgp, 
compound);
  5470  memcg = *memcgp;
  5471  
  5472  if (!(gfp_mask & __GFP_IO) || !memcg)
  5473  return ret;
  5474  #ifdef CONFIG_BLOCK
  5475  if (atomic_read(>css.cgroup->congestion_count) &&
  5476  has_usable_swap()) {
  5477  map_swap_page(page, );
  5478  
> 5479  blkcg_schedule_throttle(bdev_get_queue(bdev), true);
  5480  }
  5481  #endif
  5482  return ret;
  5483  }
  5484  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[josef-btrfs:blk-iolatency 7/13] mm/memcontrol.c:5476:6: error: implicit declaration of function 'has_usable_swap'

2018-05-26 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
blk-iolatency
head:   b62bb0da2afe1437b9d2d687ea1f509466fd3843
commit: 2caae39bf0a83094c506f98be6e339355544d103 [7/13] memcontrol: schedule 
throttling if we are congested
config: x86_64-randconfig-s3-05270017 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
git checkout 2caae39bf0a83094c506f98be6e339355544d103
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   mm/memcontrol.c: In function 'mem_cgroup_try_charge_delay':
>> mm/memcontrol.c:5476:6: error: implicit declaration of function 
>> 'has_usable_swap' [-Werror=implicit-function-declaration]
 has_usable_swap()) {
 ^~~
>> mm/memcontrol.c:5477:3: error: implicit declaration of function 
>> 'map_swap_page'; did you mean 'do_swap_page'? 
>> [-Werror=implicit-function-declaration]
  map_swap_page(page, );
  ^
  do_swap_page
   cc1: some warnings being treated as errors

vim +/has_usable_swap +5476 mm/memcontrol.c

  5460  
  5461  int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
  5462gfp_t gfp_mask, struct mem_cgroup **memcgp,
  5463bool compound)
  5464  {
  5465  struct mem_cgroup *memcg;
  5466  struct block_device *bdev;
  5467  int ret;
  5468  
  5469  ret = mem_cgroup_try_charge(page, mm, gfp_mask, memcgp, 
compound);
  5470  memcg = *memcgp;
  5471  
  5472  if (!(gfp_mask & __GFP_IO) || !memcg)
  5473  return ret;
  5474  #ifdef CONFIG_BLOCK
  5475  if (atomic_read(>css.cgroup->congestion_count) &&
> 5476  has_usable_swap()) {
> 5477  map_swap_page(page, );
  5478  
  5479  blkcg_schedule_throttle(bdev_get_queue(bdev), true);
  5480  }
  5481  #endif
  5482  return ret;
  5483  }
  5484  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: off-by-one uncompressed invalid ram_bytes corruptions

2018-05-26 Thread Steve Leung

On 05/20/2018 07:07 PM, Qu Wenruo wrote:



On 2018年05月21日 04:43, Steve Leung wrote:

On 05/19/2018 07:02 PM, Qu Wenruo wrote:



On 2018年05月20日 07:40, Steve Leung wrote:

On 05/17/2018 11:49 PM, Qu Wenruo wrote:

On 2018年05月18日 13:23, Steve Leung wrote:

Hi list,

I've got 3-device raid1 btrfs filesystem that's throwing up some
"corrupt leaf" errors in dmesg.  This is a uniquified list I've
observed lately:



     BTRFS critical (device sda1): corrupt leaf: root=1
block=4970196795392
slot=307 ino=206231 file_offset=0, invalid ram_bytes for uncompressed
inline extent, have 3468 expect 3469


Would you please use "btrfs-debug-tree -b 4970196795392 /dev/sda1" to
dump the leaf?


Attached btrfs-debug-tree dumps for all of the blocks that I saw
messages for.


It's caught by tree-checker code which is ensuring all tree blocks are
correct before btrfs can take use of them.

That inline extent size check is tested, so I'm wondering if this
indicates any real corruption.
That btrfs-debug-tree output will definitely help.

BTW, if I didn't miss anything, there should not be any inlined extent
in root tree.


     BTRFS critical (device sda1): corrupt leaf: root=1
block=4970552426496
slot=91 ino=209736 file_offset=0, invalid ram_bytes for uncompressed
inline extent, have 3496 expect 3497


Same dump will definitely help.


     BTRFS critical (device sda1): corrupt leaf: root=1
block=4970712399872
slot=221 ino=205230 file_offset=0, invalid ram_bytes for uncompressed
inline extent, have 1790 expect 1791
     BTRFS critical (device sda1): corrupt leaf: root=1
block=4970803920896
slot=368 ino=205732 file_offset=0, invalid ram_bytes for uncompressed
inline extent, have 2475 expect 2476
     BTRFS critical (device sda1): corrupt leaf: root=1
block=4970987945984
slot=236 ino=208896 file_offset=0, invalid ram_bytes for uncompressed
inline extent, have 490 expect 491

All of them seem to be 1 short of the expected value.

Some files do seem to be inaccessible on the filesystem, and btrfs
inspect-internal on any of those inode numbers fails with:

    ERROR: ino paths ioctl: Input/output error

and another message for that inode appears.

'btrfs check' (output attached) seems to notice these corruptions
(among
a few others, some of which seem to be related to a problematic
attempt
to build Android I posted about some months ago).

Other information:

Arch Linux x86-64, kernel 4.16.6, btrfs-progs 4.16.  The filesystem
has
about 25 snapshots at the moment, only a handful of compressed files,
and nothing fancy like qgroups enabled.

btrfs fi show:

    Label: none  uuid: 9d4db9e3-b9c3-4f6d-8cb4-60ff55e96d82
    Total devices 4 FS bytes used 2.48TiB
    devid    1 size 1.36TiB used 1.13TiB path /dev/sdd1
    devid    2 size 464.73GiB used 230.00GiB path /dev/sdc1
    devid    3 size 1.36TiB used 1.13TiB path /dev/sdb1
    devid    4 size 3.49TiB used 2.49TiB path /dev/sda1

btrfs fi df:

    Data, RAID1: total=2.49TiB, used=2.48TiB
    System, RAID1: total=32.00MiB, used=416.00KiB
    Metadata, RAID1: total=7.00GiB, used=5.29GiB
    GlobalReserve, single: total=512.00MiB, used=0.00B

dmesg output attached as well.

Thanks in advance for any assistance!  I have backups of all the
important stuff here but it would be nice to fix the corruptions in
place.


And btrfs check doesn't report the same problem as the default original
mode doesn't have such check.

Please also post the result of "btrfs check --mode=lowmem /dev/sda1"


Also, attached.  It seems to notice the same off-by-one problems, though
there also seem to be a couple of examples of being off by more than
one.


Unfortunately, it doesn't detect, as there is no off-by-one error at all.

The problem is, kernel is reporting error on completely fine leaf.

Further more, even in the same leaf, there are more inlined extents, and
they are all valid.

So the kernel reports the error out of nowhere.

More problems happens for extent_size where a lot of them is offset by
one.

Moreover, the root owner is not printed correctly, thus I'm wondering if
the memory is corrupted.

Please try memtest+ to verify all your memory is correct, and if so,
please try the attached patch and to see if it provides extra info.


Memtest ran for about 12 hours last night, and didn't find any errors.

New messages from patched kernel:

  BTRFS critical (device sdd1): corrupt leaf: root=1 block=4970196795392
slot=307 ino=206231 file_offset=0, invalid ram_bytes for uncompressed
inline extent, have 3468 expect 3469 (21 + 3448)


This output doesn't match with debug-tree dump.

item 307 key (206231 EXTENT_DATA 0) itemoff 15118 itemsize 3468
generation 692987 type 0 (inline)
inline extent data size 3447 ram_bytes 3447 compression 0 (none)

Where its ram_bytes is 3447, not 3448.

Further more, there are 2 more inlined extent, if something really went
wrong reading ram_bytes, it should also trigger the same warning.

item 26 key (206227 

Re: [PATCH 2/4] btrfs: remove the wait ordered logic in the log_one_extent path

2018-05-26 Thread Nikolay Borisov


On 23.05.2018 18:58, Josef Bacik wrote:
> From: Josef Bacik 
> 
> Since we are waiting on all ordered extents at the start of the fsync()
> path we don't need to wait on any logged ordered extents, and we don't
> need to look up the checksums on the ordered extents as they will
> already be on disk prior to getting here.  Rework this so we're only
> looking up and copying the on-disk checksums for the extent range we
> care about.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/tree-log.c | 118 
> +---
>  1 file changed, 10 insertions(+), 108 deletions(-)
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 43758e30aa7a..791b5a731456 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4082,127 +4082,30 @@ static int extent_cmp(void *priv, struct list_head 
> *a, struct list_head *b)
>   return 0;
>  }
>  
> -static int wait_ordered_extents(struct btrfs_trans_handle *trans,
> - struct inode *inode,
> - struct btrfs_root *root,
> - const struct extent_map *em,
> - const struct list_head *logged_list,
> - bool *ordered_io_error)
> +static int log_extent_csums(struct btrfs_trans_handle *trans,
> + struct btrfs_inode *inode,
> + struct btrfs_root *root,
> + const struct extent_map *em)
>  {
>   struct btrfs_fs_info *fs_info = root->fs_info;
> - struct btrfs_ordered_extent *ordered;
>   struct btrfs_root *log = root->log_root;
> - u64 mod_start = em->mod_start;
> - u64 mod_len = em->mod_len;
> - const bool skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
>   u64 csum_offset;
>   u64 csum_len;
>   LIST_HEAD(ordered_sums);
>   int ret = 0;
>  
> - *ordered_io_error = false;
> -
> - if (test_bit(EXTENT_FLAG_PREALLOC, >flags) ||
> + if (inode->flags & BTRFS_INODE_NODATASUM ||
> + test_bit(EXTENT_FLAG_PREALLOC, >flags) ||
>   em->block_start == EXTENT_MAP_HOLE)
>   return 0;
>  
> - /*
> -  * Wait far any ordered extent that covers our extent map. If it
> -  * finishes without an error, first check and see if our csums are on
> -  * our outstanding ordered extents.
> -  */
> - list_for_each_entry(ordered, logged_list, log_list) {
> - struct btrfs_ordered_sum *sum;
> -
> - if (!mod_len)
> - break;
> -
> - if (ordered->file_offset + ordered->len <= mod_start ||
> - mod_start + mod_len <= ordered->file_offset)
> - continue;
> -
> - if (!test_bit(BTRFS_ORDERED_IO_DONE, >flags) &&
> - !test_bit(BTRFS_ORDERED_IOERR, >flags) &&
> - !test_bit(BTRFS_ORDERED_DIRECT, >flags)) {
> - const u64 start = ordered->file_offset;
> - const u64 end = ordered->file_offset + ordered->len - 1;
> -
> - WARN_ON(ordered->inode != inode);
> - filemap_fdatawrite_range(inode->i_mapping, start, end);
> - }
> -
> - wait_event(ordered->wait,
> -(test_bit(BTRFS_ORDERED_IO_DONE, >flags) ||
> - test_bit(BTRFS_ORDERED_IOERR, >flags)));
> -
> - if (test_bit(BTRFS_ORDERED_IOERR, >flags)) {
> - /*
> -  * Clear the AS_EIO/AS_ENOSPC flags from the inode's
> -  * i_mapping flags, so that the next fsync won't get
> -  * an outdated io error too.
> -  */
> - filemap_check_errors(inode->i_mapping);
> - *ordered_io_error = true;
> - break;
> - }
> - /*
> -  * We are going to copy all the csums on this ordered extent, so
> -  * go ahead and adjust mod_start and mod_len in case this
> -  * ordered extent has already been logged.
> -  */
> - if (ordered->file_offset > mod_start) {
> - if (ordered->file_offset + ordered->len >=
> - mod_start + mod_len)
> - mod_len = ordered->file_offset - mod_start;
> - /*
> -  * If we have this case
> -  *
> -  * |- logged extent -|
> -  *   |- ordered extent |
> -  *
> -  * Just don't mess with mod_start and mod_len, we'll
> -  * just end up logging more csums than we need and it
> -  * will be ok.
> -  */
> - } else {
> - if