date:20170519

Re: RAID 6 corrupted

2017-05-19 Thread Duncan

Pasi Kärkkäinen posted on Fri, 19 May 2017 11:55:27 +0300 as excerpted:

> On Thu, May 18, 2017 at 06:12:22AM +, Duncan wrote:
>> Roman Mamedov posted on Thu, 18 May 2017 10:17:19 +0500 as excerpted:
>> 
>> > On Thu, 18 May 2017 04:09:38 +0200 ??ukasz Wróblewski 
>> > wrote:
>> > 
>> >> I will try when stable 4.12 comes out.
>> >> Unfortunately I do not have a backup.
>> >> Fortunately, these data are not so critical.
>> >> Some private photos and videos of youth.
>> >> However, I would be very happy if I could get it back.
>> > 
>> > Try saving your data with "btrfs restore" first
>> 
>> First post, he tried that.  No luck.  Tho that was with 4.4 userspace.
>> It might be worth trying with the 4.11-rc or soon to be released 4.11
>> userspace, tho...
>> 
>> 
> Try with 4.12-rc, I assume :)

No.  btrfs restore is 100% userspace, so we're talking btrfs-progs (thus 
the "userspace" specifier) version here, not kernel version.

And at the time 4.11-rc was the latest progs version, with 4.11 due 
within a day or two, so "4.11-rc or soon to be released 4.11 userspace" 
was correct.

Of course since then 4.11 userspace /has/ been released, so it'd be 4.11 
without the -rc, now.

(When I got the CCed email I couldn't make heads or tails since I was 
missing the thread context, tho the fact that I was dead tired probably 
didn't help.  I just checked the list, which I interface with as a 
newsgroup on gmane.org's list2news service, after getting home from work 
today, however, and there the entire thread is visible, as one would 
expect for a newsgroup, so I could make sense of the context, and realize 
it was a case of kernel version vs. userspace version confusion. Easy 
enough to clarify... =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!

2017-05-19 Thread Hugo Mills

On Fri, May 19, 2017 at 06:25:22PM -0700, Marc MERLIN wrote:
> On Sat, May 20, 2017 at 12:57:09AM +, Hugo Mills wrote:
> >I think from the POV of removing these BUG_ONs, it doesn't matter
> > which FS causes them. "All" you need to know is where the error
> > happened. From there, you can (in theory) work out what was wrong and
> > handle it more elagantly than simply stopping.
>  
> Sorry, "you" being the code author, or the user?

   Author.

> If code author, I'd rather this be worked out without the extra steps of
> having to guess or spend more time to see which FS.

   As I understand it, it doesn't really matter which FS it comes
from. The question is: The kernel has hit this BUG_ON. What do you
actually want to do when this happens? You can't bring the kernel to a
grinding halt (BUG_ON), so how do you handle this more elegantly?

   It actually doesn't matter about the state of any specific FS that
caused this particular problem. The fact is, someone decided to check
on the FS's state, and punted the problem of handling the check's
failure to someone later (the BUG_ON). You(*)'ve got to pick up that punt
and deal with it more cleanly.

(*) You == some kernel developer.

> My FS takes up to a day to scrub and btrfs check, clearly making me do this
> over 3 of them is not a good use of time and a loss of up to 3 days of wall
> clock time.
> Not counting that during that time, I have loss of service on all my
> filesystems because I don't want to mount them read-write.
> 
> >Obviously it would be nice, from the POV of the sysadmin, to know
> > which FS was complaining, but as an FS developer it's secondary to
> > identifying a BUG_ON which happens in real life, which offers an
> > opportunity to make the error path more elegant.
> 
> If the FS is remounted R/O, further damage is averted and it's obvious to
> the admin which FS has a problem.
> 
> Is there a reason why all errors that are serious enough, do not cause the
> FS to remount R/O instead of having any BUG/BUG_ON at all?

   Simply that it's easier to write a BUG_ON than to write the code to
bubble up a failure to the point that the FS can be made RO. This is a
clean-up kind of process: BUG_ONs should mostly be changed into a
proper error-handling path leading to remount-RO (in the worst
cases). As I understand it, it's not massively difficult, but it's
probably non-trivial effort to get right in each case.

> WARN_ON is also fine obviously if the error is not serious enough, or doing
> a WARN_ON + a remount R/O

   Sure, but everything shouild really be turned into either a proper
error-handling path (most likely remount RO), or explicitly defined as
BUG_ON (i.e. "this must never happen -- if it does, then the hardware
is fucked up, and we're not responsible for the consequences") It's
that latter definition that's part of the hard decision-making process
for the kernel dev.

   Hugo.

-- 
Hugo Mills | Great oxymorons of the world, no. 7:
hugo@... carfax.org.uk | The Simple Truth
http://carfax.org.uk/  |
PGP: E2AB1DE4  |

signature.asc
Description: Digital signature

Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!

2017-05-19 Thread Marc MERLIN

On Sat, May 20, 2017 at 12:57:09AM +, Hugo Mills wrote:
>I think from the POV of removing these BUG_ONs, it doesn't matter
> which FS causes them. "All" you need to know is where the error
> happened. From there, you can (in theory) work out what was wrong and
> handle it more elagantly than simply stopping.

Sorry, "you" being the code author, or the user?
If code author, I'd rather this be worked out without the extra steps of
having to guess or spend more time to see which FS.
My FS takes up to a day to scrub and btrfs check, clearly making me do this
over 3 of them is not a good use of time and a loss of up to 3 days of wall
clock time.
Not counting that during that time, I have loss of service on all my
filesystems because I don't want to mount them read-write.

>Obviously it would be nice, from the POV of the sysadmin, to know
> which FS was complaining, but as an FS developer it's secondary to
> identifying a BUG_ON which happens in real life, which offers an
> opportunity to make the error path more elegant.

If the FS is remounted R/O, further damage is averted and it's obvious to
the admin which FS has a problem.
Is there a reason why all errors that are serious enough, do not cause the
FS to remount R/O instead of having any BUG/BUG_ON at all?
WARN_ON is also fine obviously if the error is not serious enough, or doing
a WARN_ON + a remount R/O

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

signature.asc
Description: Digital signature

Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!

2017-05-19 Thread Hugo Mills

On Fri, May 19, 2017 at 05:47:48PM -0700, Marc MERLIN wrote:
> On Sat, May 20, 2017 at 12:37:47AM +, Hugo Mills wrote:
> > > Can I make another plea for just removing all those BUG/BUG_ON?
> > > They really have no place in production code, there is no excuse for a
> > > filesystem to bring down the entire and in the process not even tell you
> > > which of your filesystems had the issue to start with.
> > > 
> > > Could this be made part of a cleanup for this build to remove them all?
> > 
> >The removal of these has been an ongoing process for at least the
> > last 5 years.
>  
> That's great news, thanks. I guess I'm a bit edgy because I've hit too many
> of them already :) but glad to hear that there are a lot fewer now.
> 
> >I don't understand the specifics of the kernel code in question(*),
> > but compared to 5 years ago, btrfs has got rid of most of the
> > BUG_ONs(**) a few years ago. The remaining ones are probably
> > complicated to deal with in any way more elegant than just stopping.
> 
> The biggest problem is that those BUG* do not even tell you where the
> problem.
> The assumption that you'd only ever have a single btrfs filesystem mounted,
> is flawed to say the least :)
> (I have 5 different ones on my server)

   I think from the POV of removing these BUG_ONs, it doesn't matter
which FS causes them. "All" you need to know is where the error
happened. From there, you can (in theory) work out what was wrong and
handle it more elagantly than simply stopping.

   Obviously it would be nice, from the POV of the sysadmin, to know
which FS was complaining, but as an FS developer it's secondary to
identifying a BUG_ON which happens in real life, which offers an
opportunity to make the error path more elegant.

> >I recall seeing someone's stats on BUG_ON locations a couple of
> > years ago, and btrfs had managed to get the number of locations down
> > below XFS (but no other FS). It's a kind of success, at least...
> 
> Good to know, thanks, and thanks to anyone who has worked on removing those.

   I don't know what the current state is. Maybe someone on IRC will
be able to do the greps/stats to give proper numbers to it.

   Hugo.

-- 
Hugo Mills | IMPROVE YOUR ORGANISMS!!
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Subject line of spam email


signature.asc
Description: Digital signature

Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!

2017-05-19 Thread Marc MERLIN

On Sat, May 20, 2017 at 12:37:47AM +, Hugo Mills wrote:
> > Can I make another plea for just removing all those BUG/BUG_ON?
> > They really have no place in production code, there is no excuse for a
> > filesystem to bring down the entire and in the process not even tell you
> > which of your filesystems had the issue to start with.
> > 
> > Could this be made part of a cleanup for this build to remove them all?
> 
>The removal of these has been an ongoing process for at least the
> last 5 years.
 
That's great news, thanks. I guess I'm a bit edgy because I've hit too many
of them already :) but glad to hear that there are a lot fewer now.

>I don't understand the specifics of the kernel code in question(*),
> but compared to 5 years ago, btrfs has got rid of most of the
> BUG_ONs(**) a few years ago. The remaining ones are probably
> complicated to deal with in any way more elegant than just stopping.

The biggest problem is that those BUG* do not even tell you where the
problem.
The assumption that you'd only ever have a single btrfs filesystem mounted,
is flawed to say the least :)
(I have 5 different ones on my server)

>I recall seeing someone's stats on BUG_ON locations a couple of
> years ago, and btrfs had managed to get the number of locations down
> below XFS (but no other FS). It's a kind of success, at least...

Good to know, thanks, and thanks to anyone who has worked on removing those.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!

2017-05-19 Thread Hugo Mills

On Fri, May 19, 2017 at 05:11:34PM -0700, Marc MERLIN wrote:
> On Fri, May 19, 2017 at 12:03:58PM -0700, Liu Bo wrote:
> > Hi Marc,
> > 
> > On Thu, May 18, 2017 at 09:16:38PM -0700, Marc MERLIN wrote:
> > > Looks like all the unhelpful BUG() aren't gone yet :-/
> > > This one is really not helpful, I don't even know which one of my 
> > > filesystems caused the crash :(
> > > 
> > > Why is this not remounting the filesystem read only?
> > > Really, from a user and admin perspective, this is really not helpful.
> > > 
> > > Could someone who know more than me do a pass and eradicate those? 
> > > Btrfs cannot be a production filesystem as long as those are still around 
> > > IMO.
> > 
> > Looks like there's a security hole hidden in code, I don't think it's
> > a bug in code, it's more like caused by a corrupted metadata reading
> > from disk rather than a memory corruption.
> > 
> > A quick glance at the stack shows in 
> > extent-tree.c:lookup_inline_extent_backref()
> > 
> > type = btrfs_extent_inline_ref_type(leaf, iref);
> > then...
> > ptr += btrfs_extent_inline_ref_size(type);
> > 
> > I agree that a corrupted image should not corrupt the kernel, so we
> > can fix it by forcing it to readonly.
> 
> Thanks.
> Can I make another plea for just removing all those BUG/BUG_ON?
> They really have no place in production code, there is no excuse for a
> filesystem to bring down the entire and in the process not even tell you
> which of your filesystems had the issue to start with.
> 
> Could this be made part of a cleanup for this build to remove them all?

   The removal of these has been an ongoing process for at least the
last 5 years.

   I don't understand the specifics of the kernel code in question(*),
but compared to 5 years ago, btrfs has got rid of most of the
BUG_ONs(**) a few years ago. The remaining ones are probably
complicated to deal with in any way more elegant than just stopping.

   I recall seeing someone's stats on BUG_ON locations a couple of
years ago, and btrfs had managed to get the number of locations down
below XFS (but no other FS). It's a kind of success, at least...

   Hugo.

(*) I don't have the spare time to fully comprehend the process. Sorry.

(**) A good fraction went away in the first year after the decision to
get rid of them.

> Pretty please with cherry on top? :)
> 
> Marc

-- 
Hugo Mills | IMPROVE YOUR ORGANISMS!!
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Subject line of spam email


signature.asc
Description: Digital signature

Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!

2017-05-19 Thread Marc MERLIN

On Fri, May 19, 2017 at 12:03:58PM -0700, Liu Bo wrote:
> Hi Marc,
> 
> On Thu, May 18, 2017 at 09:16:38PM -0700, Marc MERLIN wrote:
> > Looks like all the unhelpful BUG() aren't gone yet :-/
> > This one is really not helpful, I don't even know which one of my 
> > filesystems caused the crash :(
> > 
> > Why is this not remounting the filesystem read only?
> > Really, from a user and admin perspective, this is really not helpful.
> > 
> > Could someone who know more than me do a pass and eradicate those? 
> > Btrfs cannot be a production filesystem as long as those are still around 
> > IMO.
> 
> Looks like there's a security hole hidden in code, I don't think it's
> a bug in code, it's more like caused by a corrupted metadata reading
> from disk rather than a memory corruption.
> 
> A quick glance at the stack shows in 
> extent-tree.c:lookup_inline_extent_backref()
> 
> type = btrfs_extent_inline_ref_type(leaf, iref);
> then...
> ptr += btrfs_extent_inline_ref_size(type);
> 
> I agree that a corrupted image should not corrupt the kernel, so we
> can fix it by forcing it to readonly.

Thanks.
Can I make another plea for just removing all those BUG/BUG_ON?
They really have no place in production code, there is no excuse for a
filesystem to bring down the entire and in the process not even tell you
which of your filesystems had the issue to start with.

Could this be made part of a cleanup for this build to remove them all?

Pretty please with cherry on top? :)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] Btrfs: lzo compression must free at least PAGE_SIZE

2017-05-19 Thread Timofey Titovets

2017-05-19 23:19 GMT+03:00 Lionel Bouton :
> I was too focused on other problems and having a fresh look at what I
> wrote I'm embarrassed by what I read.
> Used pages for a given amount of data should be (amount / PAGE_SIZE) +
> ((amount % PAGE_SIZE) == 0 ? 0 : 1) this seems enough of a common thing
> to compute that the kernel might have a macro defined for this.

If i understand the code correctly, the logic of comparing the size of
input/output by bytes is enough (IMHO) for skipping the compression of
non-compressible data, and as btrfs uses PAGE_SIZE as a data cluster
size (and if i understand correctly it's minimum IO size), the logic
of that can be improved in case when compressed data use 126978 <
compressed_size < 131072.
The easiest way to improve that case, i think, is making the
compressed size bigger by PAGE_SIZE.

JFYI:
Once I've read on the list that btrfs don't compress data, if data are
less then PAGE_SIZE because it's useless (i didn't find that in the
code, so i think that i don't fully understand code of compress
routine)

After some time i got a idea that if btrfs determines store data
compressed or not by comparing input/output size of data (i.e. if
compressed size is bigger in compare to input data, don't store
compressed version), this logic can be improved by also checking if
compression profit is more then PAGE_SIZE, because if it's not,
compressed data will pass check and comsume the same amount of space
and as a result will not show any gain.

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] Btrfs: lzo compression must free at least PAGE_SIZE

2017-05-19 Thread Lionel Bouton

Le 19/05/2017 à 16:17, Lionel Bouton a écrit :
> Hi,
>
> Le 19/05/2017 à 15:38, Timofey Titovets a écrit :
>> If data compression didn't free at least one PAGE_SIZE, it useless to store 
>> that compressed extent
>>
>> Signed-off-by: Timofey Titovets 
>> ---
>>  fs/btrfs/lzo.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
>> index bd0b0938..637ef1b0 100644
>> --- a/fs/btrfs/lzo.c
>> +++ b/fs/btrfs/lzo.c
>> @@ -207,7 +207,7 @@ static int lzo_compress_pages(struct list_head *ws,
>>  }
>>  
>>  /* we're making it bigger, give up */
>> -if (tot_in > 8192 && tot_in < tot_out) {
>> +if (tot_in > 8192 && tot_in < tot_out + PAGE_SIZE) {
>>  ret = -E2BIG;
>>  goto out;
>>  }
> I'm not familiar with this code but I was surprised by the test : you
> would expect compression having a benefit when you are freeing an actual
> page not reducing data by a page size. So unless I don't understand the
> context shouldn't it be something like :
>
> if (tot_in > 8192 && ((tot_in % PAGE_SIZE) <= (tot_out % PAGE_SIZE))
>
> but looking at the code I see that this is in a while loop and there's
> another test just after the loop in the existing code :
>
> if (tot_out > tot_in)
> goto out;
>
> There's a couple of things I don't understand but isn't this designed to
> stream data in small chunks through compression before writing it in the
> end ? So isn't this later test the proper location to detect if
> compression was beneficial ?
>
> You might not save a page early on in the while loop working on a subset
> of the data to compress but after enough data being processed you could
> save a page. It seems odd that your modification could abort compression
> early on although the same condition would become true after enough loops.
>
> Isn't what you want something like :
>
> if (tot_out % PAGE_SIZE >= tot_in % PAGE_SIZE)
> goto out;
>
> after the loop ?
> The >= instead of > would avoid decompression in the case where the
> compressed data is smaller but uses the same space on disk.

I was too focused on other problems and having a fresh look at what I
wrote I'm embarrassed by what I read.
Used pages for a given amount of data should be (amount / PAGE_SIZE) +
((amount % PAGE_SIZE) == 0 ? 0 : 1) this seems enough of a common thing
to compute that the kernel might have a macro defined for this.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io

2017-05-19 Thread Liu Bo

On Fri, May 19, 2017 at 09:06:42PM +0200, David Sterba wrote:
> On Tue, May 09, 2017 at 05:02:15PM -0600, Liu Bo wrote:
> > Before this, we use 'filled' mode here, ie. if all range has been filled
> > with EXTENT_DEFRAG bits, get to clear it, but if the defrag range joins
> > the adjacent delalloc range, then we'll leave EXTENT_DEFRAG bits until
> > evicting inode.
> >
> > This clears the bit if any was found within the ordered extent.
> 
> What effects, good or bad, can this have?
> 
> Is it worth backporting to stable trees?

The good effect of this patch is to free extent_state quickly if we
don't need it, without this, it can't be freed since the extent_state
has at least EXTENT_DEFRAG bit in ->state.

Just notice that I made a mistake in the changelog, the bit will be
cleared until releasing pages, which may be called by
invalidate_mapping_ranges(), not evicting inode.

No, I don't think it's a candidate for stable tree.

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs hang with 4.11.1 kernel

2017-05-19 Thread Liu Bo

On Wed, May 17, 2017 at 05:44:02PM +0900, Tomasz Chmielewski wrote:
> After upgrading to 4.11.1 and running for ~12 hours, btrfs filesystem hanged
> - most processes on a server accessing the filesystem went into "D" state
> and blocked.
> 
> The server would not reboot and had to be power cycled.
> 
> 
> Previously, it was running 4.10.7 for around 1.5 month without any issues.
> 
> The server has ~10 subvolumes and 5-10 snapshots, >220 GB free (about 50%
> free), runs in RAID-1 mode on SSD disks.
> 
> 
> Is it anything known with 4.11.x?

Stacks trimmed like this makes them harder to understand :(

These hung stacks shows several points,

- one file writer is waiting on lock_extent_bits() while holding inode's lock,
  so other writers on this file is also blocked.

- commit transaction is waiting on fs/file tree's root lock (someone else is
  holding the write lock), and it blocks other transactions to either commit or
  start.

- fsync on one of files is trying to start a transaction but is blocked while
  holding inode's lock, so other writers on this file is blocked.

So we need to find out who takes a fs/file tree's root write lock, sysrq+w may
have more info.

Thanks,
-liubo

> 
> 
> May 17 07:47:53 lxd02 kernel: [43865.593536] INFO: task btrfs-transacti:610
> blocked for more than 120 seconds.
> May 17 07:47:53 lxd02 kernel: [43865.593557]   Not tainted
> 4.11.1-041101-generic #201705140931
> May 17 07:47:53 lxd02 kernel: [43865.593572] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 17 07:47:53 lxd02 kernel: [43865.593593] btrfs-transacti D0   610
> 2 0x
> May 17 07:47:53 lxd02 kernel: [43865.593596] Call Trace:
> May 17 07:47:53 lxd02 kernel: [43865.593605]  __schedule+0x3c6/0x8c0
> May 17 07:47:53 lxd02 kernel: [43865.593609]  schedule+0x36/0x80
> May 17 07:47:53 lxd02 kernel: [43865.593645]  wait_current_trans+0xc2/0x100
> [btrfs]
> May 17 07:47:53 lxd02 kernel: [43865.593649]  ?
> wake_atomic_t_function+0x60/0x60
> May 17 07:47:53 lxd02 kernel: [43865.593675]  start_transaction+0x24e/0x460
> [btrfs]
> May 17 07:47:53 lxd02 kernel: [43865.593698]
> btrfs_attach_transaction+0x1d/0x20 [btrfs]
> May 17 07:47:53 lxd02 kernel: [43865.593720]  transaction_kthread+0x8c/0x1c0
> [btrfs]
> May 17 07:47:53 lxd02 kernel: [43865.593724]  kthread+0x109/0x140
> May 17 07:47:53 lxd02 kernel: [43865.593744]  ?
> btrfs_cleanup_transaction+0x530/0x530 [btrfs]
> May 17 07:47:53 lxd02 kernel: [43865.593748]  ?
> kthread_create_on_node+0x70/0x70
> May 17 07:47:53 lxd02 kernel: [43865.593751]  ret_from_fork+0x2c/0x40
> May 17 07:47:53 lxd02 kernel: [43865.593778] INFO: task mongod:6703 blocked
> for more than 120 seconds.
> May 17 07:47:53 lxd02 kernel: [43865.593793]   Not tainted
> 4.11.1-041101-generic #201705140931
> May 17 07:47:53 lxd02 kernel: [43865.593809] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 17 07:47:53 lxd02 kernel: [43865.593829] mongod  D0  6703
> 3531 0x0100
> May 17 07:47:53 lxd02 kernel: [43865.593832] Call Trace:
> May 17 07:47:53 lxd02 kernel: [43865.593836]  __schedule+0x3c6/0x8c0
> May 17 07:47:53 lxd02 kernel: [43865.593839]  schedule+0x36/0x80
> May 17 07:47:53 lxd02 kernel: [43865.593860]  wait_current_trans+0xc2/0x100
> [btrfs]
> May 17 07:47:53 lxd02 kernel: [43865.593863]  ?
> wake_atomic_t_function+0x60/0x60
> May 17 07:47:53 lxd02 kernel: [43865.593883]  start_transaction+0x2d4/0x460
> [btrfs]
> May 17 07:47:53 lxd02 kernel: [43865.593903]
> btrfs_start_transaction+0x1e/0x20 [btrfs]
> May 17 07:47:53 lxd02 kernel: [43865.593926]  btrfs_create+0x5d/0x210
> [btrfs]
> May 17 07:47:53 lxd02 kernel: [43865.593930]  path_openat+0x1433/0x1500
> May 17 07:47:53 lxd02 kernel: [43865.593935]  do_filp_open+0x99/0x110
> May 17 07:47:53 lxd02 kernel: [43865.593938]  ?
> __check_object_size+0x100/0x19d
> May 17 07:47:53 lxd02 kernel: [43865.593942]  do_sys_open+0x130/0x220
> May 17 07:47:53 lxd02 kernel: [43865.593944]  ? do_sys_open+0x130/0x220
> May 17 07:47:53 lxd02 kernel: [43865.593946]  SyS_open+0x1e/0x20
> May 17 07:47:53 lxd02 kernel: [43865.593950]  do_syscall_64+0x5b/0xc0
> May 17 07:47:53 lxd02 kernel: [43865.593952]
> entry_SYSCALL64_slow_path+0x25/0x25
> May 17 07:47:53 lxd02 kernel: [43865.593954] RIP: 0033:0x7f5e3b44947d
> May 17 07:47:53 lxd02 kernel: [43865.593956] RSP: 002b:7f5e2ee4f310
> EFLAGS: 0293 ORIG_RAX: 0002
> May 17 07:47:53 lxd02 kernel: [43865.593959] RAX: ffda RBX:
> 31815840 RCX: 7f5e3b44947d
> May 17 07:47:53 lxd02 kernel: [43865.593960] RDX: 01b6 RSI:
> 0241 RDI: 06d83d80
> May 17 07:47:53 lxd02 kernel: [43865.593962] RBP: 06d83d80 R08:
> 0004 R09: 0001
> May 17 07:47:53 lxd02 kernel: [43865.593963] R10: 0240 R11:
> 0293 R12: 01da7d7d
> May 17 07:47:53 lxd02 kernel: [43865.593964] R13: 0001 R14:
>

Re: [PATCH] Btrfs: work around maybe-uninitialized warning

2017-05-19 Thread Arnd Bergmann

On Fri, May 19, 2017 at 8:10 PM, Liu Bo  wrote:
> On Thu, May 18, 2017 at 03:33:29PM +0200, Arnd Bergmann wrote:
>> A rewrite of btrfs_submit_direct_hook appears to have introduced a warning:
>>
>> fs/btrfs/inode.c: In function 'btrfs_submit_direct_hook':
>> fs/btrfs/inode.c:8467:14: error: 'bio' may be used uninitialized in this 
>> function [-Werror=maybe-uninitialized]
>>
>> Where the 'bio' variable was previously initialized unconditionally, it
>> is now set in the "while (submit_len > 0)" loop that would never execute
>> if submit_len is zero.
>>
>> Assuming this cannot happen in practice, we can avoid the warning
>> by simply replacing the while{} loop with a do{}while() loop so
>> the compiler knows that it will always be entered at least once.
>>
>
> Thanks for the fix.  I think it's a false positve one and I've updated it in 
> v2
> with a 'struct bio *bio = NULL' to make compiler happy, could you please help
> reveiw it?

Right, it is a false positive and adding the =NULL initialization shuts up the
warning. The reason my patch used a different approach is to make the
code more robust, see https://rusty.ozlabs.org/?p=232

Generally speaking initializing a local variable to an illegal value, and later
using the variable without a check for that original value is error-prone.
Even though the code is correct at the moment, someone else might
modify it later. My first (broken) solution avoided this by checking for
the condition that led to the warning, my newer solution is nicer as it
makes it much clearer to the reader what is going on, compared to
the NULL initialization that does not help readability but makes
it slightly harder to understand why you wrote the code specifically that
way.

   Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io

2017-05-19 Thread David Sterba

On Tue, May 09, 2017 at 05:02:15PM -0600, Liu Bo wrote:
> Before this, we use 'filled' mode here, ie. if all range has been filled
> with EXTENT_DEFRAG bits, get to clear it, but if the defrag range joins
> the adjacent delalloc range, then we'll leave EXTENT_DEFRAG bits until
> evicting inode.
>
> This clears the bit if any was found within the ordered extent.

What effects, good or bad, can this have?

Is it worth backporting to stable trees?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!

2017-05-19 Thread Liu Bo

Hi Marc,

On Thu, May 18, 2017 at 09:16:38PM -0700, Marc MERLIN wrote:
> Looks like all the unhelpful BUG() aren't gone yet :-/
> This one is really not helpful, I don't even know which one of my filesystems 
> caused the crash :(
> 
> Why is this not remounting the filesystem read only?
> Really, from a user and admin perspective, this is really not helpful.
> 
> Could someone who know more than me do a pass and eradicate those? 
> Btrfs cannot be a production filesystem as long as those are still around IMO.

Looks like there's a security hole hidden in code, I don't think it's
a bug in code, it's more like caused by a corrupted metadata reading
from disk rather than a memory corruption.

A quick glance at the stack shows in 
extent-tree.c:lookup_inline_extent_backref()

type = btrfs_extent_inline_ref_type(leaf, iref);
then...
ptr += btrfs_extent_inline_ref_size(type);

I agree that a corrupted image should not corrupt the kernel, so we
can fix it by forcing it to readonly.

-liubo

> 
> Thanks,
> Marc
> 
> [ cut here ]
> kernel BUG at fs/btrfs/ctree.h:1779!
> invalid opcode:  [#1] PREEMPT SMP
> Modules linked in: veth ip6table_filter ip6_tables ebtable_nat ebtables ppdev 
> lp xt_addrtype br_netfilter bridge stp llc tun autofs4 softdog binfmt_misc 
> ftdi_sio nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ipt_REJECT 
> nf_reject_ipv4 xt_conntrack xt_mark xt_nat xt_tcpudp nf_log_ipv4 
> nf_log_common xt_LOG iptable_mangle iptable_filter lm85 hwmon_vid pl2303 
> dm_snapshot dm_bufio iptable_nat ip_tables nf_conntrack_ipv4 nf_defrag_ipv4 
> nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_nat 
> nf_conntrack x_tables sg st snd_pcm_oss snd_mixer_oss bcache kvm_intel kvm 
> irqbypass snd_hda_codec_realtek snd_hda_codec_generic snd_cmipci 
> snd_hda_intel snd_mpu401_uart snd_hda_codec snd_opl3_lib snd_hda_core 
> snd_rawmidi eeepc_wmi snd_hwdep snd_seq_device asus_wmi snd_pcm sparse_keymap
>  rfkill snd_timer hwmon snd i915 lpc_ich tpm_infineon rc_ati_x10 asix mei_me 
> usbnet ati_remote pcspkr libphy tpm_tis rc_core usbserial tpm_tis_core wmi 
> tpm parport_pc parport input_leds battery i2c_i801 soundcore evdev e1000e ptp 
> pps_core fuse raid456 multipath mmc_block mmc_core lrw ablk_helper dm_crypt 
> dm_mod async_raid6_recov async_pq async_xor async_memcpy async_tx 
> crc32c_intel blowfish_x86_64 blowfish_common pcbc aesni_intel aes_x86_64 
> crypto_simd glue_helper cryptd xhci_pci ehci_pci xhci_hcd ehci_hcd mvsas 
> r8169 sata_sil24 mii libsas usbcore scsi_transport_sas thermal fan [last 
> unloaded: ftdi_sio]
> CPU: 2 PID: 22204 Comm: kworker/u16:20 Tainted: G U  
> 4.11.0-amd64-preempt-sysrq-20170406 #2
> Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 
> 04/27/2013
> Workqueue: btrfs-extent-refs btrfs_extent_refs_helper
> task: 9417d6de2240 task.stack: a1314e7e
> RIP: 0010:btrfs_extent_inline_ref_size+0x29/0x39
> RSP: 0018:a1314e7e3b10 EFLAGS: 00010297
> RAX: 001d RBX: 941849fd3700 RCX: 941aaa669000
> RDX: 2000 RSI: 245a RDI: 
> RBP: a1314e7e3b10 R08: 4000 R09: a1314e7e3ad8
> R10:  R11: 2000 R12: 245a
> R13:  R14: 94183c20b5b8 R15: 
> FS:  () GS:941d9e28() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: f7557d76 CR3: 0003f8c09000 CR4: 001406e0
> Call Trace:
>  lookup_inline_extent_backref+0x302/0x436
>  ? ___cache_free+0x200/0x25c
>  __btrfs_free_extent+0xf1/0xb18
>  __btrfs_run_delayed_refs+0xb2f/0xd15
>  ? __wake_up_common+0x4d/0x81
>  btrfs_run_delayed_refs+0x7a/0x1cc
>  delayed_ref_async_start+0x5e/0x9b
>  btrfs_scrubparity_helper+0x111/0x271
>  ? pwq_activate_delayed_work+0x4d/0x5b
>  btrfs_extent_refs_helper+0xe/0x10
>  process_one_work+0x193/0x2b0
>  ? rescuer_thread+0x2b1/0x2b1
>  worker_thread+0x1e9/0x2c1
>  ? rescuer_thread+0x2b1/0x2b1
>  kthread+0xfb/0x100
>  ? init_completion+0x24/0x24
>  ? do_fast_syscall_32+0xb7/0xfe
>  ret_from_fork+0x2c/0x40
> Code: 5d c3 55 81 ff b0 00 00 00 48 89 e5 74 1f 81 ff b6 00 00 00 74 17 81 ff 
> b8 00 00 00 74 16 81 ff b2 00 00 00 b8 1d 00 00 00 74 0e <0f> 0b b8 09 00 00 
> 00 eb 05 b8 0d 00 00 00 5d c3 55 48 89 f0 48
> RIP: btrfs_extent_inline_ref_size+0x29/0x39 RSP: a1314e7e3b10
> ---[ end trace 8bd2bf161055b042 ]---
> 
> static inline u32 btrfs_extent_inline_ref_size(int type)
> {
>   if (type == BTRFS_TREE_BLOCK_REF_KEY ||
>   type == BTRFS_SHARED_BLOCK_REF_KEY)
>   return sizeof(struct btrfs_extent_inline_ref);
>   if (type == BTRFS_SHARED_DATA_REF_KEY)
>   return sizeof(struct btrfs_shared_data_ref) +
>  sizeof(struct btrfs_extent_inline_ref);
>   if (type == BTRFS_EXTENT_DATA_REF_KEY)
>   return sizeof(struct

[PATCH] Btrfs: skip commit transaction if we don't have enough pinned bytes

2017-05-19 Thread Liu Bo

We commit transaction in order to reclaim space from pinned bytes because
it could process delayed refs, and in may_commit_transaction(), we check
first if pinned bytes are enough for the required space, we then check if
that plus bytes reserved for delayed insert are enough for the required
space.

This changes the code to the above logic.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e390451c72e6..bded1ddd1bb6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4837,7 +4837,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 
spin_lock(_rsv->lock);
if (percpu_counter_compare(_info->total_bytes_pinned,
-  bytes - delayed_rsv->size) >= 0) {
+  bytes - delayed_rsv->size) < 0) {
spin_unlock(_rsv->lock);
return -ENOSPC;
}
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs metadata reclaim behavior/performance characteristics

2017-05-19 Thread Liu Bo

On Fri, May 19, 2017 at 12:54:59PM +0300, Nikolay Borisov wrote:
> > From: Liu Bo 
> > 
> > Subject: [PATCH] Btrfs: skip commit transaction if we don't have enough 
> > pinned bytes
> > 
> > We commit transaction in order to reclaim space from pinned bytes because 
> > it could process delayed refs, and in may_commit_transaction(), we check 
> > first if pinned bytes are enough for the required space, we then check if 
> > that plus bytes reserved for delayed insert are enough for the required 
> > space.
> > 
> > This changes the code to the above logic.
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/extent-tree.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index e390451c72e6..bded1ddd1bb6 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -4837,7 +4837,7 @@ static int may_commit_transaction(struct 
> > btrfs_fs_info *fs_info,
> >  
> > spin_lock(_rsv->lock);
> > if (percpu_counter_compare(_info->total_bytes_pinned,
> > -  bytes - delayed_rsv->size) >= 0) {
> > +bytes - 
> > delayed_rsv->size) < 0) {
> >
> > spin_unlock(_rsv->lock);
> > return -ENOSPC;
> > }
> > 
> 
> Your patch does make a very big difference. Here are a couple of runs of
> slow-rm:
> 
> 
> 
> root@ubuntu-virtual:~# ./slow-rm.sh
> Created 837 files before returning error, time taken 3
> Created 920 files before returning error, time taken 3
> Created 949 files before returning error, time taken 3
> Created 930 files before returning error, time taken 3
> Created 1101 files before returning error, time taken 4
> Created 1082 files before returning error, time taken 4
> Created 1608 files before returning error, time taken 5
> Created 1735 files before returning error, time taken 5
> rming took 1 seconds
> 
> root@ubuntu-virtual:~# ./slow-rm.sh
> Created 801 files before returning error, time taken 3
> Created 829 files before returning error, time taken 3
> Created 983 files before returning error, time taken 3
> Created 978 files before returning error, time taken 3
> Created 1023 files before returning error, time taken 3
> Created 1126 files before returning error, time taken 3
> Created 1538 files before returning error, time taken 4
> Created 1737 files before returning error, time taken 5
> rming took 2 seconds
> 
> root@ubuntu-virtual:~# ./slow-rm.sh
> Created 875 files before returning error, time taken 3
> Created 891 files before returning error, time taken 3
> Created 969 files before returning error, time taken 4
> Created 1002 files before returning error, time taken 4
> Created 1039 files before returning error, time taken 4
> Created 1051 files before returning error, time taken 4
> Created 1191 files before returning error, time taken 4
> Created 2137 files before returning error, time taken 8
> rming took 2 seconds
> 
> So rming is a lot faster, but we create less files all in all and get
> ENOSPC earlier. This means that most of the time bytes_pinned is not
> enough to satisfy the allocation hence we are hitting the second
> percpu_counter comparison.
>

Right, it's sort of my expected bahavior because all 1K buffered IO ends up
being inline extent, it's likely to run out of metadata space very soon.

> Also, the reason why the previous links showed 0 for bytes_pinned was
> due to me having completely forgotten that bytes_pinned is a percpu
> counter, hence my stap script wasn't actually reading it correctly.

I see, bytes_pinned in space_info is different from the percpu one, they're
updated at different time, but overall the percpu one is the the preciser
counter.

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] btrfs: fix race with relocation recovery and fs_root setup

2017-05-19 Thread David Sterba

On Thu, May 18, 2017 at 04:58:19PM -0700, Liu Bo wrote:
> On Wed, May 17, 2017 at 11:38:34AM -0400, je...@suse.com wrote:
> > From: Jeff Mahoney 
> > 
> > If we have to recover relocation during mount, we'll ultimately have to
> > evict the orphan inode.  That goes through the reservation dance, where
> > priority_reclaim_metadata_space and flush_space expect fs_info->fs_root
> > to be valid.  That's the next thing to be set up during mount, so we
> > crash, almost always in flush_space trying to join the transaction
> > but priority_reclaim_metadata_space is possible as well.  This call
> > path has been problematic in the past WRT whether ->fs_root is valid
> > yet.  Commit 957780eb278 (Btrfs: introduce ticketed enospc
> > infrastructure) added new users that are called in the direct path
> > instead of the async path that had already been worked around.
> > 
> > The thing is that we don't actually need the fs_root, specifically, for
> > anything.  We either use it to determine whether the root is the
> > chunk_root for use in choosing an allocation profile or as a root to pass
> > btrfs_join_transaction before immediately committing it.  Anything that
> > isn't the chunk root works in the former case and any root works in
> > the latter.
> > 
> > A simple fix is to use a root we know will always be there: the
> > extent_root.
> > 
> Reviewed-by: Liu Bo 

2-3 added to 4.13 queue, 1 will go to 4.12. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: add statx support

2017-05-19 Thread David Sterba

On Fri, May 12, 2017 at 03:07:43PM -0700, Omar Sandoval wrote:
> From: Yonghong Song 
> 
> Return enhanced file attributes from the btrfs, including:
>   (1). inode creation time as stx_btime, and
>   (2). Certain BTRFS_INODE_xxx flags are mapped to stx_attributes flags.
> 
> Example output:
>   [root@localhost ~]# cat t.sh
>   touch t
>   chattr +aic t
>   ~/linux/samples/statx/test-statx t
>   chattr -aic t
>   touch t
>   echo ""
>   ~/linux/samples/statx/test-statx t
>   /bin/rm t
>   [root@localhost ~]# ./t.sh
>   statx(t) = 0
>   results=fff
> Size: 0   Blocks: 0  IO Block: 4096regular 
> file
>   Device: 00:1c   Inode: 63962   Links: 1
>   Access: (0644/-rw-r--r--)  Uid: 0   Gid: 0
>   Access: 2017-05-11 16:03:13.999856591-0700
>   Modify: 2017-05-11 16:03:13.999856591-0700
>   Change: 2017-05-11 16:03:14.000856663-0700
>Birth: 2017-05-11 16:03:13.999856591-0700
>   Attributes: 0034 (    
>    .-ai.c..)
>   
>   statx(t) = 0
>   results=fff
> Size: 0   Blocks: 0  IO Block: 4096regular 
> file
>   Device: 00:1c   Inode: 63962   Links: 1
>   Access: (0644/-rw-r--r--)  Uid: 0   Gid: 0
>   Access: 2017-05-11 16:03:14.006857097-0700
>   Modify: 2017-05-11 16:03:14.006857097-0700
>   Change: 2017-05-11 16:03:14.006857097-0700
>   Birth: 2017-05-11 16:03:13.999856591-0700
>   Attributes:  (    
>    .---.-..)
>   [root@localhost ~]#
> 
> Reviewed-by: Omar Sandoval 
> Signed-off-by: Yonghong Song 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: work around maybe-uninitialized warning

2017-05-19 Thread Liu Bo

On Thu, May 18, 2017 at 03:33:29PM +0200, Arnd Bergmann wrote:
> A rewrite of btrfs_submit_direct_hook appears to have introduced a warning:
> 
> fs/btrfs/inode.c: In function 'btrfs_submit_direct_hook':
> fs/btrfs/inode.c:8467:14: error: 'bio' may be used uninitialized in this 
> function [-Werror=maybe-uninitialized]
> 
> Where the 'bio' variable was previously initialized unconditionally, it
> is now set in the "while (submit_len > 0)" loop that would never execute
> if submit_len is zero.
> 
> Assuming this cannot happen in practice, we can avoid the warning
> by simply replacing the while{} loop with a do{}while() loop so
> the compiler knows that it will always be entered at least once.
>

Thanks for the fix.  I think it's a false positve one and I've updated it in v2
with a 'struct bio *bio = NULL' to make compiler happy, could you please help
reveiw it?

Thanks,
-liubo

> Fixes: 0fd27e06c61b ("Btrfs: use bio_clone_bioset_partial to simplify DIO 
> submit")
> Signed-off-by: Arnd Bergmann 
> ---
> ---
>  fs/btrfs/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8c37b4fa4cbb..c62cf9593cb3 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8497,7 +8497,7 @@ static int btrfs_submit_direct_hook(struct 
> btrfs_dio_private *dip,
>   /* bio split */
>   ASSERT(map_length <= INT_MAX);
>   atomic_inc(>pending_bios);
> - while (submit_len > 0) {
> + do {
>   clone_len = min_t(int, submit_len, map_length);
>  
>   /*
> @@ -8540,7 +8540,7 @@ static int btrfs_submit_direct_hook(struct 
> btrfs_dio_private *dip,
> start_sector << 9, _length, NULL, 0);
>   if (ret)
>   goto out_err;
> - }
> + } while (submit_len > 0);
>  
>  submit:
>   ret = __btrfs_submit_dio_bio(bio, inode, file_offset, skip_sum,
> -- 
> 2.9.0
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] btrfs: Clean up btrfs_leaf_data

2017-05-19 Thread David Sterba

On Fri, May 19, 2017 at 10:49:10AM +0300, Nikolay Borisov wrote:
> Commit 5f39d397dfbe ("Btrfs: Create extent_buffer interface for large 
> blocksizes")
> refactored btrfs_leaf_data function to take extent_buffer rather than
> struct btrfs_leaf. However, as it turns out the parameter being passed is 
> never
> used. Help the brave souls which are going to read this code in the future
> by eliminating the parameter.

The git changelog lines should be at most 74 chars.

> 
> Signed-off-by: Nikolay Borisov 
> ---
> 
> Hello, 
> 
> Here is a minor cleanup, which hopefully reduces cognitive load while reading
> the extent_buffer code. After this commit btrfs_leaf_data basically becomes
> a synonym to offsetof and I can't help it but begin to wonder wouldn't it be 
> better if I nuke btrfs_leaf_data altogether or rename it to
> btrfs_leaf_data_offset?

Yes please, make it BTRFS_LEAF_DATA_OFFSET so it's consistent with other
BTRFS_LEAF_* helpers.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/2] btrfs: Separate space_info create/update

2017-05-19 Thread David Sterba

Only minor nits

On Fri, May 19, 2017 at 10:21:06AM +0300, Nikolay Borisov wrote:
> Currently the struct space_info creation code is intermixed in the
> udpate_space_info function. There are well-defined points at which the we
  ^

> actually want to create brand-new space_info structs (e.g. during mount of
> the filesystem as well as sometimes when adding/initialising new chunks). In
> such cases udpate_space_info is called with 0 as the bytes parameter. All of
 ^

> this makes for spaghetti code.
> 
> Fix it by factoring out the creation code in a separate create_space_info
> structure. This also allows to simplify the internals. Also remove BUG_ON from
> do_alloc_chunk since the callers handle errors. Furthermore it will
> make the update_space_info function not fail, allowing us to remove error
> handling in callers. This will come in a follow up patch.
> 
> Reviewed-by: Jeff Mahoney 
> Reviewed-by: Liu Bo 

Missing signed-off-by

> ---
>  fs/btrfs/extent-tree.c | 129 
> -
>  1 file changed, 64 insertions(+), 65 deletions(-)
> 
> 
>  Worked on feedback from Liu Bo and discovered I didn't have vim's linuxtsy 
>  plugin installed, hence the b0rked formatting in some of my patches. 
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index be5477676cc8..55b6836f5a20 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3914,15 +3914,58 @@ static const char *alloc_name(u64 flags)
>   };
>  }
>  
> +static int create_space_info(struct btrfs_fs_info *info, u64 flags,
> +  struct btrfs_space_info **new) {

function openning { on a new line

> +
> + struct btrfs_space_info *space_info;
> + int i;
> + int ret;
> +
> + space_info = kzalloc(sizeof(*space_info), GFP_NOFS);
> + if (!space_info)
> + return -ENOMEM;
> +
> + ret = percpu_counter_init(_info->total_bytes_pinned, 0,
> +  GFP_KERNEL);
> + if (ret) {
> + kfree(space_info);
> + return ret;
> + }
> +
> + for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
> + INIT_LIST_HEAD(_info->block_groups[i]);
> + init_rwsem(_info->groups_sem);
> + spin_lock_init(_info->lock);
> + space_info->flags = flags & BTRFS_BLOCK_GROUP_TYPE_MASK;
> + space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
> + init_waitqueue_head(_info->wait);
> + INIT_LIST_HEAD(_info->ro_bgs);
> + INIT_LIST_HEAD(_info->tickets);
> + INIT_LIST_HEAD(_info->priority_tickets);
> +
> + ret = kobject_init_and_add(_info->kobj, _info_ktype,
> + info->space_info_kobj, "%s",
> + alloc_name(space_info->flags));
> + if (ret) {
> + kfree(space_info);
> + return ret;
> + }
> +
> + *new = space_info;
> + list_add_rcu(_info->list, >space_info);
> + if (flags & BTRFS_BLOCK_GROUP_DATA)
> + info->data_sinfo = space_info;
> +
> + return ret;
> +}
> +
>  static int update_space_info(struct btrfs_fs_info *info, u64 flags,
>u64 total_bytes, u64 bytes_used,
>u64 bytes_readonly,
>struct btrfs_space_info **space_info)
>  {
>   struct btrfs_space_info *found;
> - int i;
>   int factor;
> - int ret;
>  
>   if (flags & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
>BTRFS_BLOCK_GROUP_RAID10))
> @@ -3946,53 +3989,6 @@ static int update_space_info(struct btrfs_fs_info 
> *info, u64 flags,
>   *space_info = found;
>   return 0;
>   }
> - found = kzalloc(sizeof(*found), GFP_NOFS);
> - if (!found)
> - return -ENOMEM;
> -
> - ret = percpu_counter_init(>total_bytes_pinned, 0, GFP_KERNEL);
> - if (ret) {
> - kfree(found);
> - return ret;
> - }
> -
> - for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
> - INIT_LIST_HEAD(>block_groups[i]);
> - init_rwsem(>groups_sem);
> - spin_lock_init(>lock);
> - found->flags = flags & BTRFS_BLOCK_GROUP_TYPE_MASK;
> - found->total_bytes = total_bytes;
> - found->disk_total = total_bytes * factor;
> - found->bytes_used = bytes_used;
> - found->disk_used = bytes_used * factor;
> - found->bytes_pinned = 0;
> - found->bytes_reserved = 0;
> - found->bytes_readonly = bytes_readonly;
> - found->bytes_may_use = 0;
> - found->full = 0;
> - found->max_extent_size = 0;
> - found->force_alloc = CHUNK_ALLOC_NO_FORCE;
> - found->chunk_alloc = 0;
> - found->flush = 0;
> - init_waitqueue_head(>wait);
> - INIT_LIST_HEAD(>ro_bgs);
> - INIT_LIST_HEAD(>tickets);
> - INIT_LIST_HEAD(>priority_tickets);
> -
> - ret = kobject_init_and_add(>kobj, _info_ktype,
> -

Re: [PATCH 0/8 v1] utilize bio_clone_fast to clean up

2017-05-19 Thread David Sterba

On Wed, May 17, 2017 at 04:22:44PM -0600, Liu Bo wrote:
> v1: - Drop the RFC tag.
> - Update to use bio_segments accordingly as __bio_segments is removed.
> - Remove if (!bio) since bio_clone_fast with bioset and GFP_NOFS will
>   never fail.
> 
> This attempts to use bio_clone_fast() in the places where we clone bio,
> such as when bio got cloned for multiple disks and when bio got split
> during dio submit.
> 
> One benefit is to simplify dio submit to avoid calling bio_add_page one by
> one.
> 
> Another benefit is that comparing to bio_clone_bioset, bio_clone_fast is
> faster because of copying the vector pointer directly, and bio_clone_fast
> doesn't modify bi_vcnt, so the extra work is to fix up bi_vcnt usage we
> currently have to use bi_iter to iterate bvec.
> 
> Here are some numbers collected with the script [1], note that most of
> performance tests usually issue bs=4k dio write/read so our directIO split 
> code
> is not tested as it requires bs > stripe_len(64K), thus I made this simple
> script which writes 2G with bs=128K.
> 
> - vanilla:
> real0m10.265s
> user0m0.005s
> sys 0m9.164s
> 
> - patched:
> real0m8.973s
> user0m0.006s
> sys 0m7.804s
> 
> [1]:
> #!/bin/bash
> 
> M=/mnt/btrfs
> D1=/dev/pmem0p1
> D2=/dev/pmem0p2
> 
> umount $M
> mkfs.btrfs -f $D1 $D2 >/dev/null || exit
> 
> mount $D1 $M -onodatasum || exit
> 
> xfs_io -f -c "falloc 0 2G" $M/foo
> 
> time xfs_io -d -c "pwrite -b 128K 0 2G" $M/foo
> 
> Liu Bo (8):
>   Btrfs: use bio_clone_fast to clone our bio
>   Btrfs: new helper btrfs_bio_clone_partial
>   Btrfs: use bio_clone_bioset_partial to simplify DIO submit
>   Btrfs: change how we iterate bios in endio
>   Btrfs: record error if one block has failed to retry
>   Btrfs: make check-integrity use bvec_iter
>   Btrfs: unify naming of btrfs_io_bio
>   Btrfs: hardcode GFP_NOFS for btrfs_bio_clone_partial

Added to 4.13 queue, thankgs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Recurring free space warnings for same blocks

2017-05-19 Thread Ochi


Hello,

looking at the journals of three different machines running btrfs on 
crypt-devices on SSDs for root (dm-0 in the logs below) and other 
volumes, I'm seeing space cache warnings recurring for the same blocks 
over and over again. The machines are usually (re)booted once a day. See 
below for excerpts from the journals of the three machines. It is 
striking that these warnings seem to start somewhere around April/May. I 
thought maybe it is worth pointing out before I try something else like 
resetting the space cache manually. I've been using the 4.10.* series 
kernel on all machines for quite a while now (it's a routinely updated 
Arch Linux). Currently installed btrfs-progs are 4.10.2.


Mai 19 08:57:45 machine0 kernel: BTRFS warning (device dm-0): block 
group 6471811072 has wrong amount of free space
Mai 19 08:57:45 machine0 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 6471811072, rebuilding it now
Mai 18 10:00:26 machine0 kernel: BTRFS warning (device dm-0): block 
group 6471811072 has wrong amount of free space
Mai 18 10:00:26 machine0 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 6471811072, rebuilding it now
Mai 18 10:00:26 machine0 kernel: BTRFS warning (device dm-0): block 
group 105256058880 has wrong amount of free space
Mai 18 10:00:26 machine0 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 105256058880, rebuilding it now
Mai 17 14:57:00 machine0 kernel: BTRFS warning (device dm-0): block 
group 6471811072 has wrong amount of free space
Mai 17 14:57:00 machine0 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 6471811072, rebuilding it now
Mai 17 09:08:09 machine0 kernel: BTRFS warning (device dm-0): block 
group 6471811072 has wrong amount of free space
Mai 17 09:08:09 machine0 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 6471811072, rebuilding it now
Mai 16 11:41:42 machine0 kernel: BTRFS warning (device dm-0): block 
group 6471811072 has wrong amount of free space
Mai 16 11:41:42 machine0 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 6471811072, rebuilding it now
Mai 16 11:41:43 machine0 kernel: BTRFS warning (device dm-0): block 
group 105256058880 has wrong amount of free space
Mai 16 11:41:43 machine0 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 105256058880, rebuilding it now
Mai 03 16:46:48 machine0 kernel: BTRFS warning (device dm-0): block 
group 53179580416 has wrong amount of free space
Mai 03 16:46:48 machine0 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 53179580416, rebuilding it now


Apr 26 20:06:27 machine1 kernel: BTRFS warning (device dm-0): block 
group 31159484416 has wrong amount of free space
Apr 26 20:06:27 machine1 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 31159484416, rebuilding it now
Apr 27 18:46:19 machine1 kernel: BTRFS warning (device dm-0): block 
group 31159484416 has wrong amount of free space
Apr 27 18:46:19 machine1 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 31159484416, rebuilding it now
Apr 29 00:00:10 machine1 kernel: BTRFS warning (device dm-0): block 
group 31159484416 has wrong amount of free space
Apr 29 00:00:10 machine1 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 31159484416, rebuilding it now
Apr 30 00:01:07 machine1 kernel: BTRFS warning (device dm-0): block 
group 31159484416 has wrong amount of free space
Apr 30 00:01:07 machine1 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 31159484416, rebuilding it now
Mai 01 18:05:25 machine1 kernel: BTRFS warning (device dm-0): block 
group 31159484416 has wrong amount of free space
Mai 01 18:05:25 machine1 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 31159484416, rebuilding it now
Mai 02 23:17:05 machine1 kernel: BTRFS warning (device dm-0): block 
group 31159484416 has wrong amount of free space
Mai 02 23:17:05 machine1 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 31159484416, rebuilding it now
Mai 03 22:47:08 machine1 kernel: BTRFS warning (device dm-0): block 
group 31159484416 has wrong amount of free space
Mai 03 22:47:08 machine1 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 31159484416, rebuilding it now
Mai 04 22:05:59 machine1 kernel: BTRFS warning (device dm-0): block 
group 31159484416 has wrong amount of free space
Mai 04 22:05:59 machine1 kernel: BTRFS warning (device dm-0): failed to 
load free space cache for block group 31159484416, rebuilding it now
Mai 17 17:53:39 machine1 kernel: BTRFS warning (device dm-0): block 
group 8610906112 has wrong amount of free space
Mai 17 17:53:39

Re: [PATCH RFC] vfs: add mount umount logs

2017-05-19 Thread Colin Walters

On Thu, May 18, 2017, at 06:08 AM, Anand Jain wrote:
> By looking at the logs we should be able to know when was the FS
> mounted and unmounted and the options used, so to help forensic
> investigations.

Worth noting here that systemd already tracks mounts (via monitoring 
/proc/self/mountinfo) and logs
them to the journal, and for mounts it initiates, logs both start and 
completion.
It doesn't log the options right now, but that wouldn't be hard to add 
(particularly
since systemd has structured logging).

On the flip side of course that's only for mount namespaces where systemd is
used, and given user namespaces, a lot of use cases don't involve a 
systemd-per-container.

But that said, I find the log spam today from e.g. docker + devicemapper + xfs
annoying, and switching to overlay2 fixed that as a side effect which is nice.
Having overlay2 log would reintroduce that problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] vfs: add mount umount logs

2017-05-19 Thread Theodore Ts'o

On Fri, May 19, 2017 at 08:17:55AM +0800, Anand Jain wrote:
> > XFS already logs its own unmounts.
> 
>  Nice. as far as I know its only in XFS.

Ext4 logs mounts, but not unmounts.

> >  I prefer to let each filesystem log
> > its own unmount, because then the mount/unmount messages also have the
> > same prefix as all other messages coming from that filesystem driver.
> 
>  Ok. One nitpick though there are other mount types (remount, bind..),
>  and subsequent mounts of the same FS and FS driver can't log them
>  effectively.

The other issue is there is a difference between when an unmount is
started, or when a file system is unmounted in one namespace, and when
the file system as a whole is unmounted by the low-level because there
are no longer any references to the file system in any namespace.

More than once I've gotten a bug report for ext4 that was really
caused by the fact that some program had forked a daemon process in
its own namespace, so when the USB thumbdrive was unmounted, it wasn't
*really* unmounted.  So being able to clearly express that in the logs
is also a good thing.

I think the way to do that is to have one set of logs for when the
file system is unmounted from one mount namespace (or from a bind
mount), with perhaps an indication of the number of remaining
refcounts, and to make this be distinct from the unmount of the
underlying low file system code, which should be logged by the file
system code itself.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] Btrfs: lzo compression must free at least PAGE_SIZE

2017-05-19 Thread Lionel Bouton

Hi,

Le 19/05/2017 à 15:38, Timofey Titovets a écrit :
> If data compression didn't free at least one PAGE_SIZE, it useless to store 
> that compressed extent
>
> Signed-off-by: Timofey Titovets 
> ---
>  fs/btrfs/lzo.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
> index bd0b0938..637ef1b0 100644
> --- a/fs/btrfs/lzo.c
> +++ b/fs/btrfs/lzo.c
> @@ -207,7 +207,7 @@ static int lzo_compress_pages(struct list_head *ws,
>   }
>  
>   /* we're making it bigger, give up */
> - if (tot_in > 8192 && tot_in < tot_out) {
> + if (tot_in > 8192 && tot_in < tot_out + PAGE_SIZE) {
>   ret = -E2BIG;
>   goto out;
>   }
I'm not familiar with this code but I was surprised by the test : you
would expect compression having a benefit when you are freeing an actual
page not reducing data by a page size. So unless I don't understand the
context shouldn't it be something like :

if (tot_in > 8192 && ((tot_in % PAGE_SIZE) <= (tot_out % PAGE_SIZE))

but looking at the code I see that this is in a while loop and there's
another test just after the loop in the existing code :

if (tot_out > tot_in)
goto out;

There's a couple of things I don't understand but isn't this designed to
stream data in small chunks through compression before writing it in the
end ? So isn't this later test the proper location to detect if
compression was beneficial ?

You might not save a page early on in the while loop working on a subset
of the data to compress but after enough data being processed you could
save a page. It seems odd that your modification could abort compression
early on although the same condition would become true after enough loops.

Isn't what you want something like :

if (tot_out % PAGE_SIZE >= tot_in % PAGE_SIZE)
goto out;

after the loop ?
The >= instead of > would avoid decompression in the case where the
compressed data is smaller but uses the same space on disk.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] Btrfs: lzo.c pr_debug() deflate->lzo

2017-05-19 Thread Timofey Titovets

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/lzo.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index f48c8c14..bd0b0938 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -141,7 +141,7 @@ static int lzo_compress_pages(struct list_head *ws,
ret = lzo1x_1_compress(data_in, in_len, workspace->cbuf,
   _len, workspace->mem);
if (ret != LZO_E_OK) {
-   pr_debug("BTRFS: deflate in loop returned %d\n",
+   pr_debug("BTRFS: lzo in loop returned %d\n",
   ret);
ret = -EIO;
goto out;
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/3] Btrfs: compression fixes

2017-05-19 Thread Timofey Titovets

First patch fix copy paste typo in debug message in lzo.c, lzo is not deflate

Second and third patches force btrfs not compress data if compression will not 
free at least one PAGE_SIZE
Because it's useless in term of storage and read data from disk, as a result 
productivity suffers

Timofey Titovets (3):
  Btrfs: lzo.c pr_debug() deflate->lzo
  Btrfs: lzo compression must free at least PAGE_SIZE
  Btrfs: zlib compression must free at least PAGE_SIZE

 fs/btrfs/lzo.c  | 4 ++--
 fs/btrfs/zlib.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

--
2.13.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] Btrfs: lzo compression must free at least PAGE_SIZE

2017-05-19 Thread Timofey Titovets

If data compression didn't free at least one PAGE_SIZE, it useless to store 
that compressed extent

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/lzo.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index bd0b0938..637ef1b0 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -207,7 +207,7 @@ static int lzo_compress_pages(struct list_head *ws,
}
 
/* we're making it bigger, give up */
-   if (tot_in > 8192 && tot_in < tot_out) {
+   if (tot_in > 8192 && tot_in < tot_out + PAGE_SIZE) {
ret = -E2BIG;
goto out;
}
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] Btrfs: zlib compression must free at least PAGE_SIZE

2017-05-19 Thread Timofey Titovets

If data compression didn't free at least one PAGE_SIZE, it useless to store 
that compressed extent

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/zlib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index 135b1082..7c3c27e6 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -134,7 +134,7 @@ static int zlib_compress_pages(struct list_head *ws,
/* we're making it bigger, give up */
if (workspace->strm.total_in > 8192 &&
workspace->strm.total_in <
-   workspace->strm.total_out) {
+   workspace->strm.total_out + PAGE_SIZE) {
ret = -E2BIG;
goto out;
}
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/2] btrfs: allow mechanism to override quota

2017-05-19 Thread David Sterba

On Fri, May 19, 2017 at 01:39:49AM -0700, Sargun Dhillon wrote:
> On Fri, May 12, 2017 at 7:48 AM, David Sterba  wrote:
> > On Thu, May 11, 2017 at 09:17:01PM +, Sargun Dhillon wrote:
> >> This patchset makes it so that on a per-filesystem basis one can disable
> >> quota enforcement for users with cap_sys_resource. This patchset can
> >> likely later be extended to per-qgroup, or a per-volume basis. I'm
> >> thinking of extending the sysfs interface to list the qgroups and
> >> this same interface for the qgroups themselves.
> >>
> >> Changes since v1:
> >>   -Rather than a separate member of btrfs_fs_info, use the existing
> >>flags field
> >
> > Looks good to me, thanks.
> I'm curious as to whether this approach is fine to get an Acked-by,

This was meant as an acked-by, the patch is now queued for 4.13, but as
it add some user-visible interface, this may need more time to see if we
haven't missed something.

> or
> if I need to figure out how to make it more leak-tolerant. I don't
> think modifying the overridden extents inflight is a problem. I'm not
> sure of a way a user would be able to create *new* chunks of data.
> Alternatively, I'd be quite happy making this applicable to metadata
> only, for file xattrs, creation, deletion, etc..

>From the interface POV, this can be set as a bitmask.  I haven't looked
if we'd be able to propagate all the changes everywhere in the code,
but sounds doable, bug I'm not sure if this level of fine-grained control
is desired.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 06/10] btrfs: scrub: use bool for flush_all_writes

2017-05-19 Thread David Sterba

On Thu, May 18, 2017 at 05:42:24PM -0700, Liu Bo wrote:
> On Tue, May 16, 2017 at 07:10:38PM +0200, David Sterba wrote:
> > flush_all_writes is an atomic but does not use the semantics at all,
> > it's just on/off indicator, we can use bool.
> > 
> 
> It might use atomic to avoid reordering, but I'm not sure if
> reordering could really happen here.

Ok, I'll have a look. It does not seem to rely on atomics/ordering though.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

enospc_debug seems to need more info, or btrfs file usage bug?

2017-05-19 Thread E V

grep btrfs /proc/mounts
/dev/sdb /mirror btrfs
rw,noatime,compress=zlib,space_cache=v2,enospc_debug,subvolid=5,subvol=/
0 0

Filesystem on 4.9.18 went read-only w/ ENOSPC. Now it is close to
full, but can't really tell why it actually filled up at that time.
Here's the dmesg:

[1491582.099306] use_block_rsv: 8 callbacks suppressed
[1491582.101405] [ cut here ]
[1491582.103404] WARNING: CPU: 3 PID: 6933 at
/home/zumbi/linux-4.9.18/fs/btrfs/extent-tree.c:8331
btrfs_alloc_tree_block+0x3be/0x4e0 [btrfs]
[1491582.105432] BTRFS: block rsv returned -28
[1491582.105433] Modules linked in: ipmi_si nls_utf8 fuse ufs qnx4
hfsplus hfs minix ntfs vfat msdos fat jfs xfs libcrc32c crc32c_generic
mpt3sas raid_class scsi_transport_sas mptctl mptbase dell_rbu
binfmt_misc rpcsec_gss_krb5 nfsv3 nfsv4 dns_resolver nfsd auth_rpcgss
nfs_acl
sunrpc intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul
crc32_pclmul mgag200 joydev ttm ghash_clmulni_intel intel_cstate
drm_kms_helper drm dcdbas ipmi_devintf iTCO_wdt evdev
iTCO_vendor_support i2c_algo_bit intel_uncore serio_raw button
ipmi_msghandler pcspkr acpi_power_meter i7co
lpc_ich tpm_tis_core mfd_core tpm loop autofs4 ext4 crc16 jbd2
fscrypto mbcache btrfs xor raid6_pq dm_mod hid_generic usbhid hid sg
sd_mod crc32c_intel aesni_intel ehci_pci aes_x86_64
[1491582.116330]  uhci_hcd glue_helper psmouse lrw ehci_hcd gf128mul
ixgbe megaraid_sas ablk_helper cryptd usbcore scsi_mod dca ptp
pps_core usb_common bnx2 mdio [last unloaded: ipmi_si]
[1491582.120924] CPU: 3 PID: 6933 Comm: btrfs-transacti Tainted: G
   W I 4.9.0-0.bpo.2-amd64 #1 Debian 4.9.18-1~bpo8+1
[1491582.132792] Call Trace:
[1491582.135149]  [] ? dump_stack+0x5c/0x77
[1491582.137528]  [] ? __warn+0xc4/0xe0
[1491582.139896]  [] ? warn_slowpath_fmt+0x5f/0x80
[1491582.142298]  [] ?
btrfs_alloc_tree_block+0x3be/0x4e0 [btrfs]
[1491582.144713]  [] ? __btrfs_cow_block+0x149/0x5a0 [btrfs]
[1491582.147141]  [] ? btrfs_commit_super+0x11/0x90 [btrfs]
[1491582.149568]  [] ? btrfs_cow_block+0x104/0x1e0 [btrfs]
[1491582.151990]  [] ? btrfs_search_slot+0x1f9/0x9e0 [btrfs]
[1491582.154406]  [] ?
lookup_inline_extent_backref+0xee/0x640 [btrfs]
[1491582.156841]  [] ? set_extent_bit+0x29/0x30 [btrfs]
[1491582.159229]  [] ?
update_block_group.isra.68+0x144/0x410 [btrfs]
[1491582.161582]  [] ?
__btrfs_free_extent.isra.69+0x117/0xd60 [btrfs]
[1491582.163896]  [] ?
btrfs_merge_delayed_refs+0x69/0x600 [btrfs]
[1491582.166159]  [] ?
__btrfs_run_delayed_refs+0xa15/0x1360 [btrfs]
[1491582.168371]  [] ?
btrfs_run_delayed_refs+0x8f/0x2b0 [btrfs]
[1491582.170538]  [] ?
btrfs_commit_transaction+0x43a/0xa40 [btrfs]
[1491582.172656]  [] ? start_transaction+0x9d/0x4a0 [btrfs]
[1491582.174723]  [] ? transaction_kthread+0x1b2/0x1f0 [btrfs]
[1491582.176766]  [] ?
btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[1491582.178761]  [] ? kthread+0xe0/0x100
[1491582.180729]  [] ? __switch_to+0x2bb/0x700
[1491582.182687]  [] ? kthread_park+0x60/0x60
[1491582.184631]  [] ? ret_from_fork+0x25/0x30
[1491582.186585] ---[ end trace 98e911bbd6a0b34f ]---
[1491582.188576] [ cut here ]
[1491582.190523] WARNING: CPU: 3 PID: 6933 at
/home/zumbi/linux-4.9.18/fs/btrfs/extent-tree.c:6961
__btrfs_free_extent.isra.69+0x152/0xd60 [btrfs]
[1491582.192547] BTRFS: Transaction aborted (error -28)
[1491582.208188] CPU: 3 PID: 6933 Comm: btrfs-transacti Tainted: G
   W I 4.9.0-0.bpo.2-amd64 #1 Debian 4.9.18-1~bpo8+1
[1491582.220164] Call Trace:
[1491582.222551]  [] ? dump_stack+0x5c/0x77
[1491582.224943]  [] ? __warn+0xc4/0xe0
[1491582.227338]  [] ? warn_slowpath_fmt+0x5f/0x80
[1491582.229754]  [] ?
__btrfs_free_extent.isra.69+0x152/0xd60 [btrfs]
[1491582.232192]  [] ?
btrfs_merge_delayed_refs+0x69/0x600 [btrfs]
[1491582.234633]  [] ?
__btrfs_run_delayed_refs+0xa15/0x1360 [btrfs]
[1491582.237081]  [] ?
btrfs_run_delayed_refs+0x8f/0x2b0 [btrfs]
[1491582.239540]  [] ?
btrfs_commit_transaction+0x43a/0xa40 [btrfs]
[1491582.241998]  [] ? start_transaction+0x9d/0x4a0 [btrfs]
[1491582.244453]  [] ? transaction_kthread+0x1b2/0x1f0 [btrfs]
[1491582.246910]  [] ?
btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[1491582.249319]  [] ? kthread+0xe0/0x100
[1491582.251681]  [] ? __switch_to+0x2bb/0x700
[1491582.253984]  [] ? kthread_park+0x60/0x60
[1491582.256237]  [] ? ret_from_fork+0x25/0x30
[1491582.258439] ---[ end trace 98e911bbd6a0b350 ]---
[1491582.260635] BTRFS: error (device sdb) in __btrfs_free
_extent:6961: errno=-28 No space left
[1491582.262758] BTRFS info (device sdb): forced readonly
[1491582.264843] BTRFS: error (device sdb) in
btrfs_run_delayed_refs:2967: errno=-28 No space left
[1491582.266941] BTRFS warning (device sdb): Skipping commit of
aborted transaction.
[1491582.269009] BTRFS: error (device sdb) in
cleanup_transaction:1850: errno=-28 No space left

and yes, it's quite full, but still seems to have both meta-data &
data & unallocated available. Hmm, I say that and see size ==
allocated below, but Unallocated shows 39GB

Re: btrfs list corruption and soft lockups while testing writeback error handling

2017-05-19 Thread Jeff Layton

On Thu, 2017-05-18 at 21:57 -0700, Liu Bo wrote:
> On Fri, May 12, 2017 at 10:22:47AM -0400, Jeff Layton wrote:
> > On Fri, 2017-05-12 at 08:12 -0400, Jeff Layton wrote:
> > > On Thu, 2017-05-11 at 15:56 -0400, Chris Mason wrote:
> > > > On 05/11/2017 03:52 PM, Jeff Layton wrote:
> > > > > On Thu, 2017-05-11 at 07:13 -0400, Jeff Layton wrote:
> > > > > > I finally got my writeback error handling test to work on btrfs 
> > > > > > (thanks,
> > > > > > Chris!), by making the filesystem stripe the data and mirror the
> > > > > > metadata across two devices. The test passes now, but on one run, I 
> > > > > > got
> > > > > > the following list corruption warning and then a soft lockup (which 
> > > > > > is
> > > > > > probably fallout from the list corruption).
> > > > > > 
> > > > > > I ran the test several times before and since then without this 
> > > > > > failure,
> > > > > > so I don't have a clear reproducer. The kernel in this instance is
> > > > > > basically a v4.11 kernel with my pile of writeback error handling
> > > > > > patches on top:
> > > > > > 
> > > > > > 
> > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__git.samba.org_-3Fp-3Djlayton_linux.git-3Ba-3Dshortlog-3Bh-3Drefs_heads_wberr=DwICaQ=5VD0RTtNlTh3ycd41b3MUw=9QPtTAxcitoznaWRKKHoEQ=BXXwaUFQNFNaGGFYHEVlvNBwkrXiIoH7K5iOdR_PvxM=xE6pIXeQ1rlaxAV8aTYBSiI06pb3WZoiRJW8Vo1L3NQ=
> > > > > > 
> > > > > > It may be that they are a contributing factor, but this smells more 
> > > > > > like
> > > > > > a bug down in btrfs. Let me know if you need other info:
> > > > 
> > > > [ btrfs inode logging ]
> > > > 
> > > > > (cc'ing Liu Bo since we were discussing this earlier this week)
> > > > > 
> > > > > I can't reproduce this on stock v4.11, so I think this is a bug in my
> > > > > series.
> > > > > 
> > > > > I think this is due to the differences in how errors are being 
> > > > > reported
> > > > > from filemap_fdatawait_range now causing some transactions to end up
> > > > > being freed while they're still on the log_ctxs list. I'm working on
> > > > > hunting down the problem now.
> > > > > 
> > > > > Sorry for the noise!
> > > > > 
> > > > 
> > > > There's a list in the inode logging code that we consistently seem to 
> > > > find list debugging assertions with.  We've fixed up all the known 
> > > > issues, but I wouldn't be surprised if we've got a goto fail in there.
> > > > 
> > > > I'll take a look ;)
> > > > 
> > > 
> > > Thanks. I'm running test 999 here in a loop to reproduce it on a kernel
> > > with my patch series applied:
> > > 
> > > https://git.samba.org/?p=jlayton/xfstests.git;a=shortlog;h=refs/heads/wberr
> > > 
> > > The patch below seems to prevent it from crashing, but I'm not at all
> > > sure that this is a correct fix. Still, I think that the way errors are
> > > tracked within btrfs might need some rework around errseq_t's. In
> > > principle, it could make things even simpler now that we don't need to
> > > worry about resetting errors that have been cleared, etc...
> > > 
> > 
> > This patch instead rolls up all of the btrfs changes in the earlier
> > patches so it may make a bit more sense. I also tried to clean up the
> > changelog. I think this is probably along the lines of what we want, but
> > I'd want someone with more btrfs chops to scrutinize it closely:
> > 
> > ---8<--
> > 
> > [PATCH] SQUASH: btrfs: convert over to errseq_t based writeback error 
> > tracking
> > 
> > Writeback in btrfs is somewhat complicated and it tries to use
> > filemap_* functions to drive it, while tracking errors with flags,
> > alternate error fields, etc.
> > 
> > With the change to errseq_t based error reporting in the kernel, we
> > can simplify this somewhat and ensure that errors are caught properly
> > even when there are parallel operations on the same inode.
> > 
> > The btrfs_log_ctx has an io_err field in it that gets an error stored in
> > it when there is an I/O error. Instead, sample the mapping's errseq_t
> > when the context is initialized and use that to tell whether there has
> > been a writeback error since that point.
> > 
> > Note that btrfs_sync_log passes in NULL for the inode when initializing
> > the context, but that codepath doesn't seem to use the io_err field
> > anyway.
> > 
> 
> Have you figured out why we hit that list_add_tail assertion?
> 

No, I didn't track it down in detail.

Once I determined that v4.11 didn't have the same error, it was pretty
clear that the error reporting differences were the real culprit. Once I
established a sane "since" value for checking errors, the problem went
away so I think that's probably it.

At a high level, my guess is that btrfs has some places where it's
relying on an error return from filemap_fdatawait or the like to take
things off of this list, and the since value that it was using wasn't
early enough to catch the error. Eventually the objects get freed while
they're still on the list and you end up with

Re: btrfs metadata reclaim behavior/performance characteristics

2017-05-19 Thread Nikolay Borisov

> From: Liu Bo 
> 
> Subject: [PATCH] Btrfs: skip commit transaction if we don't have enough 
> pinned bytes
> 
> We commit transaction in order to reclaim space from pinned bytes because it 
> could process delayed refs, and in may_commit_transaction(), we check first 
> if pinned bytes are enough for the required space, we then check if that plus 
> bytes reserved for delayed insert are enough for the required space.
> 
> This changes the code to the above logic.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e390451c72e6..bded1ddd1bb6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4837,7 +4837,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>  
>   spin_lock(_rsv->lock);
>   if (percpu_counter_compare(_info->total_bytes_pinned,
> -bytes - delayed_rsv->size) >= 0) {
> +  bytes - 
> delayed_rsv->size) < 0) {
>  
> spin_unlock(_rsv->lock);
>   return -ENOSPC;
>   }
> 

Your patch does make a very big difference. Here are a couple of runs of
slow-rm:



root@ubuntu-virtual:~# ./slow-rm.sh
Created 837 files before returning error, time taken 3
Created 920 files before returning error, time taken 3
Created 949 files before returning error, time taken 3
Created 930 files before returning error, time taken 3
Created 1101 files before returning error, time taken 4
Created 1082 files before returning error, time taken 4
Created 1608 files before returning error, time taken 5
Created 1735 files before returning error, time taken 5
rming took 1 seconds

root@ubuntu-virtual:~# ./slow-rm.sh
Created 801 files before returning error, time taken 3
Created 829 files before returning error, time taken 3
Created 983 files before returning error, time taken 3
Created 978 files before returning error, time taken 3
Created 1023 files before returning error, time taken 3
Created 1126 files before returning error, time taken 3
Created 1538 files before returning error, time taken 4
Created 1737 files before returning error, time taken 5
rming took 2 seconds

root@ubuntu-virtual:~# ./slow-rm.sh
Created 875 files before returning error, time taken 3
Created 891 files before returning error, time taken 3
Created 969 files before returning error, time taken 4
Created 1002 files before returning error, time taken 4
Created 1039 files before returning error, time taken 4
Created 1051 files before returning error, time taken 4
Created 1191 files before returning error, time taken 4
Created 2137 files before returning error, time taken 8
rming took 2 seconds

So rming is a lot faster, but we create less files all in all and get
ENOSPC earlier. This means that most of the time bytes_pinned is not
enough to satisfy the allocation hence we are hitting the second
percpu_counter comparison.

Also, the reason why the previous links showed 0 for bytes_pinned was
due to me having completely forgotten that bytes_pinned is a percpu
counter, hence my stap script wasn't actually reading it correctly.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID 6 corrupted

2017-05-19 Thread Roman Mamedov

On Fri, 19 May 2017 11:55:27 +0300
Pasi Kärkkäinen  wrote:

> > > Try saving your data with "btrfs restore" first 
> > 
> > First post, he tried that.  No luck.  Tho that was with 4.4 userspace.  
> > It might be worth trying with the 4.11-rc or soon to be released 4.11 
> > userspace, tho...
> > 
> 
> Try with 4.12-rc, I assume :)

No, actually I missed that this was already tried, and a newer kernel will not
help at "btrfs restore", AFAIU it works entirely in userspace, not kernel.

Newer btrfs-progs could be something to try though, as the version used seems
pretty old -- btrfs-progs v4.4.1, while the current one is v4.11.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

QGroups Semantics

2017-05-19 Thread Sargun Dhillon

I have some questions about the *intended* qgroups semantics, and why
we allow certain operations:

1) Why can you create a level 0 qgroup for a non-existent subvolume?
It looks like it just gets reset when you create the subvol anyway?
Should we prevent this?

2) Why do we allow deleting a level 0 qgroup for a currently existing
subvolume? It seems to me that just setting the limit to 0, and/or
removing relations would work equally well? Perhaps a new ioctl to
clear the qgroup would make sense to do this.

3) Why don't we remove the associated level 0 qgroup when you remove
the subvol with that id?

4) I noticed in qgroup.c, one of the outstanding items is to determine
how to autocleanup. Are there any stated semantics around that, or
opinions?

PS:
If the answer to any of these is "it just needs to be worked on" --
I'm currently poking around this code path for some enhancements we're
doing for our usecase. I'm just trying to figure out how much of what
I'm doing is generalizable.

-Thanks,
Sargun
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID 6 corrupted

2017-05-19 Thread Pasi Kärkkäinen

On Thu, May 18, 2017 at 06:12:22AM +, Duncan wrote:
> Roman Mamedov posted on Thu, 18 May 2017 10:17:19 +0500 as excerpted:
> 
> > On Thu, 18 May 2017 04:09:38 +0200 ??ukasz Wróblewski  wrote:
> > 
> >> I will try when stable 4.12 comes out.
> >> Unfortunately I do not have a backup.
> >> Fortunately, these data are not so critical.
> >> Some private photos and videos of youth.
> >> However, I would be very happy if I could get it back.
> > 
> > Try saving your data with "btrfs restore" first 
> 
> First post, he tried that.  No luck.  Tho that was with 4.4 userspace.  
> It might be worth trying with the 4.11-rc or soon to be released 4.11 
> userspace, tho...
> 

Try with 4.12-rc, I assume :)


-- Pasi


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/2] btrfs: allow mechanism to override quota

2017-05-19 Thread Sargun Dhillon

On Fri, May 12, 2017 at 7:48 AM, David Sterba  wrote:
> On Thu, May 11, 2017 at 09:17:01PM +, Sargun Dhillon wrote:
>> This patchset makes it so that on a per-filesystem basis one can disable
>> quota enforcement for users with cap_sys_resource. This patchset can
>> likely later be extended to per-qgroup, or a per-volume basis. I'm
>> thinking of extending the sysfs interface to list the qgroups and
>> this same interface for the qgroups themselves.
>>
>> Changes since v1:
>>   -Rather than a separate member of btrfs_fs_info, use the existing
>>flags field
>
> Looks good to me, thanks.
I'm curious as to whether this approach is fine to get an Acked-by, or
if I need to figure out how to make it more leak-tolerant. I don't
think modifying the overridden extents inflight is a problem. I'm not
sure of a way a user would be able to create *new* chunks of data.
Alternatively, I'd be quite happy making this applicable to metadata
only, for file xattrs, creation, deletion, etc..

Opinions?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/15] block: merge blk_types.h into bio.h

2017-05-19 Thread h...@lst.de

On Thu, May 18, 2017 at 02:25:52PM +, Bart Van Assche wrote:
> On Thu, 2017-05-18 at 15:18 +0200, Christoph Hellwig wrote:
> > We've cleaned up our headers sufficiently that we don't need this split
> > anymore.
> 
> Hello Christoph,
> 
> Request-based drivers need the structure definitions from  and
> the type definitions from  but do not need the definition 
> of
> struct bio. Have you considered to remove #include  from file
> include/linux/blkdev.h? Do you think that would help to reduce the kernel 
> build
> time?

Most block drivers end up needing bio.h anyway.  But I can skip this
part of the series for now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dedicated error codes for the block layer

2017-05-19 Thread Christoph Hellwig

On Thu, May 18, 2017 at 04:55:08PM +0200, David Sterba wrote:
> JFYI, the patches 13 and 15 are missing, not in linux-btrfs mailbox and
> patchwork web. Does not look like a delay, maybe vger refused them
> completely.

They still haven't made it to linux-block and linux-btrfs, but they did
make it to dm-devel, which is not host on vger.

I also have a git tree here:

   git://git.infradead.org/users/hch/block.git block-errors

Gitweb:

   
http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-errors   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH] btrfs: Clean up btrfs_leaf_data

2017-05-19 Thread Nikolay Borisov

Commit 5f39d397dfbe ("Btrfs: Create extent_buffer interface for large 
blocksizes")
refactored btrfs_leaf_data function to take extent_buffer rather than
struct btrfs_leaf. However, as it turns out the parameter being passed is never
used. Help the brave souls which are going to read this code in the future
by eliminating the parameter.

Signed-off-by: Nikolay Borisov 
---

Hello, 

Here is a minor cleanup, which hopefully reduces cognitive load while reading
the extent_buffer code. After this commit btrfs_leaf_data basically becomes
a synonym to offsetof and I can't help it but begin to wonder wouldn't it be 
better if I nuke btrfs_leaf_data altogether or rename it to
btrfs_leaf_data_offset? This one is based on David's for-next branch 


 fs/btrfs/ctree.c | 40 
 fs/btrfs/ctree.h |  6 +++---
 fs/btrfs/extent_io.c |  2 +-
 3 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index a3a75f1de002..7bb0cb165e1e 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -3667,14 +3667,14 @@ static noinline int __push_leaf_right(struct 
btrfs_fs_info *fs_info,
/* make room in the right data area */
data_end = leaf_data_end(fs_info, right);
memmove_extent_buffer(right,
- btrfs_leaf_data(right) + data_end - push_space,
- btrfs_leaf_data(right) + data_end,
+ btrfs_leaf_data() + data_end - push_space,
+ btrfs_leaf_data() + data_end,
  BTRFS_LEAF_DATA_SIZE(fs_info) - data_end);
 
/* copy from the left data area */
-   copy_extent_buffer(right, left, btrfs_leaf_data(right) +
+   copy_extent_buffer(right, left, btrfs_leaf_data() +
 BTRFS_LEAF_DATA_SIZE(fs_info) - push_space,
-btrfs_leaf_data(left) + leaf_data_end(fs_info, left),
+btrfs_leaf_data() + leaf_data_end(fs_info, left),
 push_space);
 
memmove_extent_buffer(right, btrfs_item_nr_offset(push_items),
@@ -3888,9 +3888,9 @@ static noinline int __push_leaf_left(struct btrfs_fs_info 
*fs_info,
push_space = BTRFS_LEAF_DATA_SIZE(fs_info) -
 btrfs_item_offset_nr(right, push_items - 1);
 
-   copy_extent_buffer(left, right, btrfs_leaf_data(left) +
+   copy_extent_buffer(left, right, btrfs_leaf_data() +
 leaf_data_end(fs_info, left) - push_space,
-btrfs_leaf_data(right) +
+btrfs_leaf_data() +
 btrfs_item_offset_nr(right, push_items - 1),
 push_space);
old_left_nritems = btrfs_header_nritems(left);
@@ -3917,9 +3917,9 @@ static noinline int __push_leaf_left(struct btrfs_fs_info 
*fs_info,
if (push_items < right_nritems) {
push_space = btrfs_item_offset_nr(right, push_items - 1) -
  leaf_data_end(fs_info, right);
-   memmove_extent_buffer(right, btrfs_leaf_data(right) +
+   memmove_extent_buffer(right, btrfs_leaf_data() +
  BTRFS_LEAF_DATA_SIZE(fs_info) - 
push_space,
- btrfs_leaf_data(right) +
+ btrfs_leaf_data() +
  leaf_data_end(fs_info, right), 
push_space);
 
memmove_extent_buffer(right, btrfs_item_nr_offset(0),
@@ -4069,8 +4069,8 @@ static noinline void copy_for_split(struct 
btrfs_trans_handle *trans,
   nritems * sizeof(struct btrfs_item));
 
copy_extent_buffer(right, l,
-btrfs_leaf_data(right) + BTRFS_LEAF_DATA_SIZE(fs_info) -
-data_copy_size, btrfs_leaf_data(l) +
+btrfs_leaf_data() + BTRFS_LEAF_DATA_SIZE(fs_info) -
+data_copy_size, btrfs_leaf_data() +
 leaf_data_end(fs_info, l), data_copy_size);
 
rt_data_off = BTRFS_LEAF_DATA_SIZE(fs_info) - btrfs_item_end_nr(l, mid);
@@ -4607,8 +4607,8 @@ void btrfs_truncate_item(struct btrfs_fs_info *fs_info,
 
/* shift the data */
if (from_end) {
-   memmove_extent_buffer(leaf, btrfs_leaf_data(leaf) +
- data_end + size_diff, btrfs_leaf_data(leaf) +
+   memmove_extent_buffer(leaf, btrfs_leaf_data() +
+ data_end + size_diff, btrfs_leaf_data() +
  data_end, old_data_start + new_size - data_end);
} else {
struct btrfs_disk_key disk_key;
@@ -4634,8 +4634,8 @@ void btrfs_truncate_item(struct btrfs_fs_info *fs_info,
}
}
 
-   memmove_extent_buffer(leaf, btrfs_leaf_data(leaf) +
- data_end + size_diff,

[PATCH v3 1/2] btrfs: Separate space_info create/update

2017-05-19 Thread Nikolay Borisov

Currently the struct space_info creation code is intermixed in the
udpate_space_info function. There are well-defined points at which the we
actually want to create brand-new space_info structs (e.g. during mount of
the filesystem as well as sometimes when adding/initialising new chunks). In
such cases udpate_space_info is called with 0 as the bytes parameter. All of
this makes for spaghetti code.

Fix it by factoring out the creation code in a separate create_space_info
structure. This also allows to simplify the internals. Also remove BUG_ON from
do_alloc_chunk since the callers handle errors. Furthermore it will
make the update_space_info function not fail, allowing us to remove error
handling in callers. This will come in a follow up patch.

Reviewed-by: Jeff Mahoney 
Reviewed-by: Liu Bo 
---
 fs/btrfs/extent-tree.c | 129 -
 1 file changed, 64 insertions(+), 65 deletions(-)


 Worked on feedback from Liu Bo and discovered I didn't have vim's linuxtsy 
 plugin installed, hence the b0rked formatting in some of my patches. 

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index be5477676cc8..55b6836f5a20 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3914,15 +3914,58 @@ static const char *alloc_name(u64 flags)
};
 }
 
+static int create_space_info(struct btrfs_fs_info *info, u64 flags,
+struct btrfs_space_info **new) {
+
+   struct btrfs_space_info *space_info;
+   int i;
+   int ret;
+
+   space_info = kzalloc(sizeof(*space_info), GFP_NOFS);
+   if (!space_info)
+   return -ENOMEM;
+
+   ret = percpu_counter_init(_info->total_bytes_pinned, 0,
+GFP_KERNEL);
+   if (ret) {
+   kfree(space_info);
+   return ret;
+   }
+
+   for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
+   INIT_LIST_HEAD(_info->block_groups[i]);
+   init_rwsem(_info->groups_sem);
+   spin_lock_init(_info->lock);
+   space_info->flags = flags & BTRFS_BLOCK_GROUP_TYPE_MASK;
+   space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
+   init_waitqueue_head(_info->wait);
+   INIT_LIST_HEAD(_info->ro_bgs);
+   INIT_LIST_HEAD(_info->tickets);
+   INIT_LIST_HEAD(_info->priority_tickets);
+
+   ret = kobject_init_and_add(_info->kobj, _info_ktype,
+   info->space_info_kobj, "%s",
+   alloc_name(space_info->flags));
+   if (ret) {
+   kfree(space_info);
+   return ret;
+   }
+
+   *new = space_info;
+   list_add_rcu(_info->list, >space_info);
+   if (flags & BTRFS_BLOCK_GROUP_DATA)
+   info->data_sinfo = space_info;
+
+   return ret;
+}
+
 static int update_space_info(struct btrfs_fs_info *info, u64 flags,
 u64 total_bytes, u64 bytes_used,
 u64 bytes_readonly,
 struct btrfs_space_info **space_info)
 {
struct btrfs_space_info *found;
-   int i;
int factor;
-   int ret;
 
if (flags & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
 BTRFS_BLOCK_GROUP_RAID10))
@@ -3946,53 +3989,6 @@ static int update_space_info(struct btrfs_fs_info *info, 
u64 flags,
*space_info = found;
return 0;
}
-   found = kzalloc(sizeof(*found), GFP_NOFS);
-   if (!found)
-   return -ENOMEM;
-
-   ret = percpu_counter_init(>total_bytes_pinned, 0, GFP_KERNEL);
-   if (ret) {
-   kfree(found);
-   return ret;
-   }
-
-   for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
-   INIT_LIST_HEAD(>block_groups[i]);
-   init_rwsem(>groups_sem);
-   spin_lock_init(>lock);
-   found->flags = flags & BTRFS_BLOCK_GROUP_TYPE_MASK;
-   found->total_bytes = total_bytes;
-   found->disk_total = total_bytes * factor;
-   found->bytes_used = bytes_used;
-   found->disk_used = bytes_used * factor;
-   found->bytes_pinned = 0;
-   found->bytes_reserved = 0;
-   found->bytes_readonly = bytes_readonly;
-   found->bytes_may_use = 0;
-   found->full = 0;
-   found->max_extent_size = 0;
-   found->force_alloc = CHUNK_ALLOC_NO_FORCE;
-   found->chunk_alloc = 0;
-   found->flush = 0;
-   init_waitqueue_head(>wait);
-   INIT_LIST_HEAD(>ro_bgs);
-   INIT_LIST_HEAD(>tickets);
-   INIT_LIST_HEAD(>priority_tickets);
-
-   ret = kobject_init_and_add(>kobj, _info_ktype,
-   info->space_info_kobj, "%s",
-   alloc_name(found->flags));
-   if (ret) {
-   kfree(found);
-   return ret;
-   }
-
-   *space_info = found;
-   list_add_rcu(>list, >space_info);
-   if (flags &

[PATCH v3 2/2] btrfs: Refactor update_space_info

2017-05-19 Thread Nikolay Borisov

Following the factoring out of the creation code udpate_space_info can only
be called for already-existing space_info structs. As such it cannot fail.
Remove superfulous error handling and make the function return void.


Reviewed-by: Jeff Mahoney 
Reviewed-by: Liu Bo 
---
 fs/btrfs/extent-tree.c | 58 --
 1 file changed, 18 insertions(+), 40 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 55b6836f5a20..e25be9f0089f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3959,7 +3959,7 @@ static int create_space_info(struct btrfs_fs_info *info, 
u64 flags,
return ret;
 }
 
-static int update_space_info(struct btrfs_fs_info *info, u64 flags,
+static void update_space_info(struct btrfs_fs_info *info, u64 flags,
 u64 total_bytes, u64 bytes_used,
 u64 bytes_readonly,
 struct btrfs_space_info **space_info)
@@ -3974,21 +3974,19 @@ static int update_space_info(struct btrfs_fs_info 
*info, u64 flags,
factor = 1;
 
found = __find_space_info(info, flags);
-   if (found) {
-   spin_lock(>lock);
-   found->total_bytes += total_bytes;
-   found->disk_total += total_bytes * factor;
-   found->bytes_used += bytes_used;
-   found->disk_used += bytes_used * factor;
-   found->bytes_readonly += bytes_readonly;
-   if (total_bytes > 0)
-   found->full = 0;
-   space_info_add_new_bytes(info, found, total_bytes -
-bytes_used - bytes_readonly);
-   spin_unlock(>lock);
-   *space_info = found;
-   return 0;
-   }
+   ASSERT(found);
+   spin_lock(>lock);
+   found->total_bytes += total_bytes;
+   found->disk_total += total_bytes * factor;
+   found->bytes_used += bytes_used;
+   found->disk_used += bytes_used * factor;
+   found->bytes_readonly += bytes_readonly;
+   if (total_bytes > 0)
+   found->full = 0;
+   space_info_add_new_bytes(info, found, total_bytes -
+bytes_used - bytes_readonly);
+   spin_unlock(>lock);
+   *space_info = found;
 }
 
 static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
@@ -10043,19 +10041,9 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
}
 
trace_btrfs_add_block_group(info, cache, 0);
-   ret = update_space_info(info, cache->flags, found_key.offset,
-   btrfs_block_group_used(>item),
-   cache->bytes_super, _info);
-   if (ret) {
-   btrfs_remove_free_space_cache(cache);
-   spin_lock(>block_group_cache_lock);
-   rb_erase(>cache_node,
->block_group_cache_tree);
-   RB_CLEAR_NODE(>cache_node);
-   spin_unlock(>block_group_cache_lock);
-   btrfs_put_block_group(cache);
-   goto error;
-   }
+   update_space_info(info, cache->flags, found_key.offset,
+ btrfs_block_group_used(>item),
+ cache->bytes_super, _info);
 
cache->space_info = space_info;
 
@@ -10214,18 +10202,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans,
 * the rbtree, update the space info's counters.
 */
trace_btrfs_add_block_group(fs_info, cache, 1);
-   ret = update_space_info(fs_info, cache->flags, size, bytes_used,
+   update_space_info(fs_info, cache->flags, size, bytes_used,
cache->bytes_super, >space_info);
-   if (ret) {
-   btrfs_remove_free_space_cache(cache);
-   spin_lock(_info->block_group_cache_lock);
-   rb_erase(>cache_node,
-_info->block_group_cache_tree);
-   RB_CLEAR_NODE(>cache_node);
-   spin_unlock(_info->block_group_cache_lock);
-   btrfs_put_block_group(cache);
-   return ret;
-   }
update_global_block_rsv(fs_info);
 
__link_block_group(cache->space_info, cache);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] btrfs-progs: test for restoring multiple devices fs into a single device

2017-05-19 Thread Lakshmipathi.G

Looks fine, just couple of minor feedback.

> +run_check $SUDO_HELPER $TOP/mkfs.btrfs -f $loop1 $loop2

Please add quotation around variable in the script, as suggested in
tests/README.md 'Coding style, best practices' section. This section
is added recently, we are updating older scripts.

> +
> +# Cleanup loop devices.
> +run_check $SUDO_HELPER losetup -d $loop1
> +run_check $SUDO_HELPER losetup -d $loop2
> +rm -f dev1 dev2
> +
> +# Compare the file digests.
> +[ $orig_md5 == $new_md5 ] || _fail "File digests do not match"

"Cleanup loop devices" can be done after comparing file integrity?
This way, if integrity fails, the setup will be there to debug
further. thanks.

-- 

Cheers,
Lakshmipathi.G
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LKP] [btrfs] "fio: pid=2214, got signal=7" error showed in fio test for btrfs

2017-05-19 Thread Ye Xiaolong

On 03/29, Ye Xiaolong wrote:
>Attach kmsg and reproduce script.
>
>Note: kernel cmdline contained "memmap=104G!4G memmap=104G!132G"

Hi,

Any feedback about this issue?

Thanks,
Xiaolong

>
>Thanks,
>Xiaolong
>
>On 03/29, kernel test robot wrote:
>>Hi,
>>
>>We detected below error messages in fio pmem test for btrfs.
>>
>>machine: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory
>>kernel: v4.10 
>>test parameters:
>>
>>[global]
>>bs=2M
>>ioengine=mmap
>>iodepth=32
>>size=7669584457
>>direct=0
>>runtime=200
>>invalidate=1
>>fallocate=posix
>>group_reporting
>>
>>time_based
>>
>>[task_0]
>>rw=write
>>directory=/fs/pmem0
>>numjobs=14
>>
>>[task_1]
>>rw=write
>>directory=/fs/pmem1
>>numjobs=14
>>
>>
>>==> /tmp/stderr <==
>>fio: pid=2214, got signal=7
>>fio: pid=2215, got signal=7
>>fio: pid=2217, got signal=7
>>fio: pid=2220, got signal=7
>>fio: pid=, got signal=7
>>fio: pid=2224, got signal=7
>>fio: pid=2225, got signal=7
>>fio: pid=2227, got signal=7
>>fio: pid=2216, got signal=7
>>fio: pid=2218, got signal=7
>>fio: pid=2219, got signal=7
>>fio: pid=2221, got signal=7
>>fio: pid=2223, got signal=7
>>fio: pid=2226, got signal=7
>>fio: pid=2200, got signal=7
>>fio: pid=2201, got signal=7
>>fio: pid=2203, got signal=7
>>fio: pid=2204, got signal=7
>>fio: pid=2205, got signal=7
>>fio: pid=2208, got signal=7
>>fio: pid=2209, got signal=7
>>fio: pid=2210, got signal=7
>>fio: pid=2211, got signal=7
>>fio: pid=2212, got signal=7
>>fio: pid=2213, got signal=7
>>fio: pid=2202, got signal=7
>>fio: pid=2206, got signal=7
>>fio: pid=2207, got signal=7
>>fio: file hash not empty on exit
>>
>>while fio test is ok for ext4 with the same test parameters.
>>Attached reproduce.sh may help reproduce the problem.
>>
>>Thanks,
>>Xiaolong
>
>


>modprobe nd_e820
>dmsetup remove_all
>wipefs -a --force /dev/pmem0
>wipefs -a --force /dev/pmem1
>mkfs -t btrfs /dev/pmem1
>mkfs -t btrfs /dev/pmem0
>mkdir -p /fs/pmem0
>mount -t btrfs /dev/pmem0 /fs/pmem0
>mkdir -p /fs/pmem1
>mount -t btrfs /dev/pmem1 /fs/pmem1
>
>for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
>do
>   echo performance > $file
>done
>
>echo '
>[global]
>bs=2M
>ioengine=mmap
>iodepth=32
>size=7669584457
>direct=0
>runtime=200
>invalidate=1
>fallocate=posix
>group_reporting
>
>time_based
>
>[task_0]
>rw=write
>directory=/fs/pmem0
>numjobs=14
>
>[task_1]
>rw=write
>directory=/fs/pmem1
>numjobs=14' | fio --output-format=json -

>___
>LKP mailing list
>l...@lists.01.org
>https://lists.01.org/mailman/listinfo/lkp

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

46 matches

Mail list logo