date:20140804

Thanks for responses.

All of this is *very* surprising. I'm not new to BTRFS, I've been
using it on my own machines for multiple years. I didn't realise there
was an un-holstered footgun on my lap at this point. How can it be
made clear how to avoid the ENOSPC problem to myself and other
sysadmins? Or preferably not exist as a problem?

One thing which continues to puzzle me is "How do I make an alarm to
warn of an impending ENOSPC condition on BTRFS?". ENOSPC is a bad
place to be.

All of the standard monitoring tools warn on the output of `df`.

My first thought was to make a graph and put a threshold in `metadata
total - used`. However, I was fortunate enough in this case to know
about `btrfs fi df`. When I looked at "metadata free" I concluded that
there is plenty free, not knowing that it was allocated in blocks
larger than the amount presented as free (total - used = 0.5GiB). So
these numbers were quite misleading in this case. If I had seen
total=used, or available=0, the problem would have been much clearer.

Why present space as available when it can't be used?

In the end, it seems that metadata should be able to steal space from
"data" on demand. That would make the output of "df" more informative,
since you wouldn't see "60 GB free" and get ENOSPC, which is an
utterly confusing situation and harmful to production.

Is there something fundamental preventing that from happening or is it
just that no-one has gotten around to yet?

Thanks,

- Peter


On 4 August 2014 02:38, Qu Wenruo  wrote:
> Hi, Peter
>
> Some explain below inline.
>
>  Original Message 
> Subject: ENOSPC with mkdir and rename
> From: Peter Waller 
> To: 
> Date: 2014年08月03日 07:35
>>
>> Hi All,
>>
>> My TL;DR questions are at the bottom, before the stack trace.
>>
>> I'm running Ubuntu 14.04. I wonder if this problem is related to the
>> thread titled "Machine lockup due to btrfs-transaction on AWS EC2
>> Ubuntu 14.04" which I started on the 29th of July:
>>
>>> http://thread.gmane.org/gmane.comp.file-systems.btrfs/37224
>>
>> Kernel: 3.15.7-031507-generic
>>
>> I'm on a single block device system, i.e, no RAID.
>>
>> I was observing ENOSPC from `mkdir` and `rename` on this system, with
>> a good amount of free disk space (df -h reports 62 GB remain). I added
>> enospc_debug (full umount/mount, not just mount -o remount), but this
>> had no apparent effect when receiving ENOSPC from userland.
>>
>> $ sudo btrfs fi df /path/to/volume
>> Data, single: total=489.97GiB, used=427.75GiB
>> System, DUP: total=8.00MiB, used=60.00KiB
>> System, single: total=4.00MiB, used=0.00
>> Metadata, DUP: total=5.00GiB, used=4.50GiB
>
> In fact, all your metadata is used.
> It seems strange since there should be 500MB(to be precious 512MiB) free,
> but I'll explain it below.
>
>> Metadata, single: total=8.00MiB, used=0.00
>> unknown, single: total=512.00MiB, used=820.00KiB
>
> Here the "unknown" is in fact "global data reserve", reserved for COW tree
> write (except FS-tree and subvolume tree if I'm right)
> If you use latest btrfs-progs, it will not show "unknown" but
> "GlobalReserve" and it should not be used under most cases, but it is used,
> which really shows the shortage of space.
>
> So saddly, there is really no space for metadata for mkdir and rename(*).
>
> *: since rename will modify the metadata and since btrfs will do COW for
> metadata tree, and rename/mkdir
> will not use space from global reserve, so ENOSPC is normal.
>
> The good thing is that rm will steel space from global reserve, so you
> should be OK to remove files and hope to free
> enough metadata space.
> Or you can try to add more device to this btrfs.
>
> Thanks,
> Qu
>>
>>
>> After a thorough search of the internet for ENOSPC BTRFS I found
>> various resources and came to understand a little bit more. One thing
>> which broke my intuition severely is that I expected if there is a
>> large number of free GiB, I should expect things to continue to work.
>>
>> In this case, for example, metadata has 0.5GiB free ("sounds like
>> plenty for metadata for one mkdir to me"). Data has 62GiB free. Why
>> would I get ENOSPC for a file rename?
>>
>> I expected that if metadata needed more space, it would just eat it
>> from the 'data'. Now I believe this not to be the case and that it
>> wanted to allocate > 0.5GiB, and this is why I was getting ENOSPC.
>>
>> I tried a rebalance with btrfs balance start -dusage=10 and tried
>> increasing the value until I saw reallocations in dmesg.
>>
>> This spat out a large number of messages in dmesg, of this form:
>>
>>> [376096.546353] BTRFS info (device dm-0): relocating block group
>>> 530457821184 flags 1
>>> [376010.736879] BTRFS info (device dm-0): 40 enospc errors during balance
>>
>> (and a full stack trace at the end of this message).
>>
>> The rebalance printed:
>>
>>> ERROR: error during balancing '/path/to/volume' - No space left on device
>>> There may be more info in syslog - try dmesg | tail
>>
>> Eventually, n

Re: ENOSPC with mkdir and rename

2014-08-04 Thread Clemens Eisserer

Hi Peter,

> All of this is *very* surprising. I'm not new to BTRFS, I've been
> using it on my own machines for multiple years. I didn't realise there
> was an un-holstered footgun on my lap at this point. How can it be
> made clear how to avoid the ENOSPC problem to myself and other
> sysadmins? Or preferably not exist as a problem?

I've also found the fixed metadata/data split to be an uncomfortable
implementation detail, and some more flexible approach would be very
welcome from my side.

So far I've used BTRFS' mixed mode mentioned in the mkfs.btrfs man page:

> -M|--mixedMix data and metadata chunks together for more efficient space 
> utilization.
> This feature incurs a performance penalty in larger filesystems.
> It is recommended for use with filesystems of 1 GiB or smaller.

However I didn't find any information on how large the mentioned
overhead is, or where it originates from.

Best regards, Clemens
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Scan not being performed properly on boot

On Mon, 4 Aug 2014 01:31:42 PM Russell Coker wrote:

> Is BTRFS supported in that version of Ubuntu?

Out of the box a fresh 14.04 install onto btrfs worked fine for me on two 
different sets of hardware.  13.10 the same on a third piece of hardware.

cheers,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

On Mon, 4 Aug 2014 09:14:19 AM Peter Waller wrote:

> All of this is *very* surprising.

Hmm, it shouldn't be, the ENOSPC issues are well known and have been discussed 
here for years.

cheers,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

2014-08-04 Thread Clemens Eisserer

Hi Chris,

> Hmm, it shouldn't be, the ENOSPC issues are well known and have been discussed
> here for years.

Which doesn't protect the *average* user from running into issues like this.
Just because it has been discussed, doesn't mean nothing can/should be
done about it ;)

However, as I am only a user too and can't contribute in terms of
code, I keep patient and observe how btrfs is evolving.
One day or another, the ENOSPC issues will get fixed or worked arround,

Regards, Clemens
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

On 4 August 2014 10:39, Chris Samuel wrote:
> On Mon, 4 Aug 2014 09:14:19 AM Peter Waller wrote:
>> All of this is *very* surprising.
>
> Hmm, it shouldn't be, the ENOSPC issues are well known and have been discussed
> here for years.

I accept that. It's all very well if you read the BTRFS list and/or
are a BTRFS developer. But if you're trying to work it out in the heat
of battle, as we have sysadmins who would have to, there is a
combination of things here that makes it unreasonable and harmful for
production.

I was in a situation where I was getting sporadic ENOSPC and none of
the instructions I could find helped. I did a thorough search of the
wiki and mailing list - I found a plethora of similar sounding
problems and none of the advice given helped.

Our usage is a simple case: no RAID, no subvolumes, no snapshots. We
had >60GiB free and apparently some metadata free.

I still can't find a clear answer to the question "How do I make an
alarm to warn of an impending ENOSPC condition on BTRFS?"

Is that because there is no clear answer?

The nature of "running out of disk space" as a problem means you won't
hit it until you've been using it for a long while, which makes this
problem of the form "a ticking time bomb". Is there no way to make
this operationally easier? or should only BTRFS developers use BTRFS?

I'm breaking the rest out below if you are interested to try and
understand more the problems I was having.

Thanks,

- Peter

More thoughts to illustrate the problems with the existing documentation:

Getting started contains no warning of what's different about free
space compared with other filesystems one might be familiar with:

https://btrfs.wiki.kernel.org/index.php/Getting_started

The sysadmin guide doesn't appear to mention free space at all:

https://btrfs.wiki.kernel.org/index.php/SysadminGuide

The FAQ has a question:

https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_Btrfs_claims_I.27m_out_of_space.2C_but_it_looks_like_I_should_have_lots_left.21

Which starts out "Free space is a tricky concept in Btrfs" but then
doesn't explain it very well. None of the advice given there helped in
my case. There is talk about a mixed mode, but not how to move an
existing filesystem to it. I'm yet to find an explanation of
rebalancing which isn't focussed on what it means for RAID, and it
still isn't crystal clear to me what rebalancing means for
metadata/data on one disk. Rebalancing didn't work in my case. Must I
construct an image of the underlying BTRFS datastructures in my head?
I'm fine if I have to do that, but nowhere makes it clear what mental
tools I need to tackle this.

This link is mentioned by the above but not directly linked to by it
(and has "are" and "is" changed compared with the above text):

https://btrfs.wiki.kernel.org/index.php/FAQ#Why_are_there_so_many_ways_to_check_the_amount_of_free_space.3F

This link would have helped a bit but wasn't cross referenced by any
of the other materials which I did find, so I couldn't find it in the
heat of battle:

https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#I_get_.22No_space_left_on_device.22_errors.2C_but_df_says_I.27ve_got_lots_of_space

One problem is that it isn't clear what "chunks" are. Does an operator
of a BTRFS filesystem need to understand this in the simple case of no
snapshots, no RAID?

How did the whole disk come to be allocated to data given that we
hadn't used all of it? Is it because the data is using chunks
inefficiently? How does this come to be in the simple case?

The documentation could use some illustrations to make this clear.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

On Mon, Aug 04, 2014 at 11:09:23AM +0100, Peter Waller wrote:
> On 4 August 2014 10:39, Chris Samuel  wrote:
> > On Mon, 4 Aug 2014 09:14:19 AM Peter Waller wrote:
> >> All of this is *very* surprising.
> >
> > Hmm, it shouldn't be, the ENOSPC issues are well known and have been 
> > discussed
> > here for years.
> 
> I accept that. It's all very well if you read the BTRFS list and/or
> are a BTRFS developer. But if you're trying to work it out in the heat
> of battle, as we have sysadmins who would have to, there is a
> combination of things here that makes it unreasonable and harmful for
> production.
> 
> I was in a situation where I was getting sporadic ENOSPC and none of
> the instructions I could find helped. I did a thorough search of the
> wiki and mailing list - I found a plethora of similar sounding
> problems and none of the advice given helped.
> 
> Our usage is a simple case: no RAID, no subvolumes, no snapshots. We
> had >60GiB free and apparently some metadata free.
> 
> I still can't find a clear answer to the question "How do I make an
> alarm to warn of an impending ENOSPC condition on BTRFS?"

   On the 3.15+ kernels, the block reserve is split out of metadata
and reported separately. This helps with the following process:

 * btrfs fi show
- look at the total and used values. If used < total, you're OK.
  If used == total, then you could potentially hit ENOSPC.

 * btrfs fi df
- look at metadata used vs total. If these are close to zero (on
  3.15+) or close to 512 MiB (on <3.15), then you are in danger of
  ENOSPC.

- look at data used vs total. If the used is much smaller than
  total, you can reclaim some of the allocation with a filtered
  balance (btrfs balance start -dusage=5), which will then give
  you unallocated space again (see the btrfs fi show test).

> Is that because there is no clear answer?
> 
> The nature of "running out of disk space" as a problem means you won't
> hit it until you've been using it for a long while, which makes this
> problem of the form "a ticking time bomb". Is there no way to make
> this operationally easier? or should only BTRFS developers use BTRFS?
>
> I'm breaking the rest out below if you are interested to try and
> understand more the problems I was having.
> 
> Thanks,
> 
> - Peter
> 
> More thoughts to illustrate the problems with the existing documentation:
> 
> Getting started contains no warning of what's different about free
> space compared with other filesystems one might be familiar with:
> 
>   https://btrfs.wiki.kernel.org/index.php/Getting_started
> 
> The sysadmin guide doesn't appear to mention free space at all:
> 
>   https://btrfs.wiki.kernel.org/index.php/SysadminGuide
> 
> The FAQ has a question:
> 
>   
> https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_Btrfs_claims_I.27m_out_of_space.2C_but_it_looks_like_I_should_have_lots_left.21
> 
> Which starts out "Free space is a tricky concept in Btrfs" but then
> doesn't explain it very well. None of the advice given there helped in
> my case. There is talk about a mixed mode, but not how to move an
> existing filesystem to it. I'm yet to find an explanation of
> rebalancing which isn't focussed on what it means for RAID, and it
> still isn't crystal clear to me what rebalancing means for
> metadata/data on one disk. Rebalancing didn't work in my case. Must I
> construct an image of the underlying BTRFS datastructures in my head?
> I'm fine if I have to do that, but nowhere makes it clear what mental
> tools I need to tackle this.

   This FAQ entry is pretty horrible, I'm afraid. I actually started
rewriting it here to try to make it clearer what's going on. I'll try
to work on it a bit more this week and put out a better version for
the wiki.

> This link is mentioned by the above but not directly linked to by it
> (and has "are" and "is" changed compared with the above text):
>
> https://btrfs.wiki.kernel.org/index.php/FAQ#Why_are_there_so_many_ways_to_check_the_amount_of_free_space.3F
> 
> This link would have helped a bit but wasn't cross referenced by any
> of the other materials which I did find, so I couldn't find it in the
> heat of battle:
> 
> https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#I_get_.22No_space_left_on_device.22_errors.2C_but_df_says_I.27ve_got_lots_of_space
> 
> One problem is that it isn't clear what "chunks" are. Does an operator
> of a BTRFS filesystem need to understand this in the simple case of no
> snapshots, no RAID?
> 
> How did the whole disk come to be allocated to data given that we
> hadn't used all of it? Is it because the data is using chunks
> inefficiently? How does this come to be in the simple case?

   Two ways: Write lots of data, delete it again. (This could also
happen with snapshots). Alternatively, kernels earlier than about 3.10
had a bug that massively overallocated data chunks when it didn't need
to.

   Please do feel free to add more crosslinks or text to the wiki to
make it clearer

Re: ENOSPC with mkdir and rename

On Mon, 4 Aug 2014 11:56:46 AM Clemens Eisserer wrote:

> Which doesn't protect the *average* user from running into issues like this.

No, but they need to be aware of it.

> Just because it has been discussed, doesn't mean nothing can/should be done
> about it

Indeed, and a lot of work has been done over the years on it and it's a lot 
better than it used to be. :-)

cheers,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

On Mon, Aug 04, 2014 at 11:31:57AM +0100, Peter Waller wrote:
> Thanks Hugo, this is the most informative e-mail yet! (more inline)
> 
> On 4 August 2014 11:22, Hugo Mills  wrote:
> >
> >  * btrfs fi show
> > - look at the total and used values. If used < total, you're OK.
> >   If used == total, then you could potentially hit ENOSPC.
> 
> Another thing which is unclear and undocumented anywhere I can find is
> what the meaning of `btrfs fi show` is.
> 
> I'm sure it is totally obvious if you are a developer or if you have
> used it for long enough. But it isn't covered in the manpage, nor in
> the oracle documentation, nor anywhere on the wiki that I could find.
> 
> When I looked at it in my problematic situation, it said "500 GiB /
> 500 GiB". That sounded fine to me because I interpreted the output as
> what fraction of which RAID devices BTRFS was using. In other words, I
> thought "Oh, BTRFS will just make use of the whole device that's
> available to it.". I thought that `btrfs fi df` was the source of
> information for how much space was free inside of that.

   That's actually pretty much accurate. The problem is that btrfs
distinguishes between "space available for data" and "space available
for metadata", and doesn't trade off one for the other once they've
been allocated. The balance operation frees up some of the allocation,
allowing the newly-freed space to be allocated again for something
else.

   All of the information about the data/metadata split, and what's
used out of that, is revealed by btrfs fi df.

> >  * btrfs fi df
> > - look at metadata used vs total. If these are close to zero (on
> >   3.15+) or close to 512 MiB (on <3.15), then you are in danger of
> >   ENOSPC.
> 
> Hmm. It's unfortunate that this could indicate an amount of space
> which is free when it actually isn't.

   That's why the 512 MiB block reserve was split out of metadata --
so that you don't look at metadata and say "oh, I've got half a gig
free, that's OK".

> > - look at data used vs total. If the used is much smaller than
> >   total, you can reclaim some of the allocation with a filtered
> >   balance (btrfs balance start -dusage=5), which will then give
> >   you unallocated space again (see the btrfs fi show test).
> 
> So the filtered balance didn't help in my situation. I understand it's
> something to do with the "5" parameter. But I do not understand what
> the impact of changing this parameter is. It is something to do with a
> fraction of something, but those things are still not present in my
> mental model despite a large amount of reading. Is there an
> illustration which could clear this up?

   The 5 is 5%. So, it'll only look at chunks which are less than 5%
full. David Sterba published a patch that would balance the
(approximately N) least-used chunks, which is a considerably more
usable approach, but I don't know what happened to that one.

> Among other things I also got the kernel stack trace I pasted at the
> bottom of the first e-mail to this thread when I did the rebalance.

   OK, I'll go back and read that. You probably shouldn't have had it,
though. :)

> >This FAQ entry is pretty horrible, I'm afraid. I actually started
> > rewriting it here to try to make it clearer what's going on. I'll try
> > to work on it a bit more this week and put out a better version for
> > the wiki.
> 
> This is great to hear! :)
> 
> Thanks for your response Hugo, that really cleared up a lot of mental
> model problems. I hope the documentation can be improved so that
> others can learn from my mistakes.

   I do try to work on it every so often. Note to self: win lottery,
or get cloned.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- You stay in the theatre because you're afraid of having no ---
 money? There's irony... 


signature.asc
Description: Digital signature

Re: ENOSPC with mkdir and rename

On 4 August 2014 11:39, Hugo Mills  wrote:
>> >  * btrfs fi df
>> > - look at metadata used vs total. If these are close to zero (on
>> >   3.15+) or close to 512 MiB (on <3.15), then you are in danger of
>> >   ENOSPC.
>>
>> Hmm. It's unfortunate that this could indicate an amount of space
>> which is free when it actually isn't.
>
>That's why the 512 MiB block reserve was split out of metadata --
> so that you don't look at metadata and say "oh, I've got half a gig
> free, that's OK".

I don't quite follow this. Is it a recent development I missed? When
was it "split out"? More recently than the software I'm using?
Otherwise I'm having difficulty parsing this.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

Thanks Hugo, this is the most informative e-mail yet! (more inline)

On 4 August 2014 11:22, Hugo Mills  wrote:
>
>  * btrfs fi show
> - look at the total and used values. If used < total, you're OK.
>   If used == total, then you could potentially hit ENOSPC.

Another thing which is unclear and undocumented anywhere I can find is
what the meaning of `btrfs fi show` is.

I'm sure it is totally obvious if you are a developer or if you have
used it for long enough. But it isn't covered in the manpage, nor in
the oracle documentation, nor anywhere on the wiki that I could find.

When I looked at it in my problematic situation, it said "500 GiB /
500 GiB". That sounded fine to me because I interpreted the output as
what fraction of which RAID devices BTRFS was using. In other words, I
thought "Oh, BTRFS will just make use of the whole device that's
available to it.". I thought that `btrfs fi df` was the source of
information for how much space was free inside of that.

>  * btrfs fi df
> - look at metadata used vs total. If these are close to zero (on
>   3.15+) or close to 512 MiB (on <3.15), then you are in danger of
>   ENOSPC.

Hmm. It's unfortunate that this could indicate an amount of space
which is free when it actually isn't.

> - look at data used vs total. If the used is much smaller than
>   total, you can reclaim some of the allocation with a filtered
>   balance (btrfs balance start -dusage=5), which will then give
>   you unallocated space again (see the btrfs fi show test).

So the filtered balance didn't help in my situation. I understand it's
something to do with the "5" parameter. But I do not understand what
the impact of changing this parameter is. It is something to do with a
fraction of something, but those things are still not present in my
mental model despite a large amount of reading. Is there an
illustration which could clear this up?

Among other things I also got the kernel stack trace I pasted at the
bottom of the first e-mail to this thread when I did the rebalance.

>This FAQ entry is pretty horrible, I'm afraid. I actually started
> rewriting it here to try to make it clearer what's going on. I'll try
> to work on it a bit more this week and put out a better version for
> the wiki.

This is great to hear! :)

Thanks for your response Hugo, that really cleared up a lot of mental
model problems. I hope the documentation can be improved so that
others can learn from my mistakes.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

On Mon, 4 Aug 2014 11:09:23 AM Peter Waller wrote:

> I accept that. It's all very well if you read the BTRFS list and/or
> are a BTRFS developer. But if you're trying to work it out in the heat
> of battle, as we have sysadmins who would have to, there is a
> combination of things here that makes it unreasonable and harmful for
> production.

To be honest I'm not sure I'd suggest btrfs for production use at all at 
present, it's only recently been unmarked as experimental and to be honest I 
feel that was premature. :-(

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

On 4 August 2014 11:50, Chris Samuel  wrote:
> To be honest I'm not sure I'd suggest btrfs for production use at all at
> present, it's only recently been unmarked as experimental and to be honest I
> feel that was premature. :-(

Thanks for the honest answer.

There are very positive signals out there which I had perhaps taken
too literally. I'd love to see it become ready, there are a lot of things
about BTRFS which appeal greatly. So I hope I'm helping by trying
to make it clear the problems that I encountered.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

2014-08-04 Thread Clemens Eisserer

Hi Hugo,

>On the 3.15+ kernels, the block reserve is split out of metadata
> and reported separately. This helps with the following process:

Thanks a lot for pointing this out, I hadn't noticed this change until now.

One thing I didn't find any information about is the overhead
introduced by mixied-mode.
It would be great if you could explain it in a few sentences.

Thank you in advance, Clemens
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

On Mon, Aug 04, 2014 at 11:48:17AM +0100, Peter Waller wrote:
> On 4 August 2014 11:39, Hugo Mills  wrote:
> >> >  * btrfs fi df
> >> > - look at metadata used vs total. If these are close to zero (on
> >> >   3.15+) or close to 512 MiB (on <3.15), then you are in danger of
> >> >   ENOSPC.
> >>
> >> Hmm. It's unfortunate that this could indicate an amount of space
> >> which is free when it actually isn't.
> >
> >That's why the 512 MiB block reserve was split out of metadata --
> > so that you don't look at metadata and say "oh, I've got half a gig
> > free, that's OK".
> 
> I don't quite follow this. Is it a recent development I missed? When
> was it "split out"? More recently than the software I'm using?
> Otherwise I'm having difficulty parsing this.

   It's purely a change in the way that the kernel reports this info.
Before 3.15, the block reserve was included in the "Metadata" report
in btrfs fi df. After 3.15, the kernel reports the block reserve as
its own separate item in btrfs fi df (either as "BlockRsv", or
"unknown", depending on how old your userspace is). The theory is, the
change is made to make it clearer how much is used/reserved/free and
thus to make this kind of calculation simpler in the long run.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- Reading Mein Kampf won't make you a Nazi. Reading Das Kapital ---  
 won't make you a communist. But most trolls started out 
with a copy of Lord of the Rings.


signature.asc
Description: Digital signature

Re: ENOSPC with mkdir and rename

On Mon, Aug 04, 2014 at 01:04:25PM +0200, Clemens Eisserer wrote:
> Hi Hugo,
> 
> >On the 3.15+ kernels, the block reserve is split out of metadata
> > and reported separately. This helps with the following process:
> 
> Thanks a lot for pointing this out, I hadn't noticed this change until now.
> 
> One thing I didn't find any information about is the overhead
> introduced by mixied-mode.
> It would be great if you could explain it in a few sentences.

   I don't know, I'm afraid. I don't think we've got any benchmarks on
the scale of the slowdown.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- Reading Mein Kampf won't make you a Nazi. Reading Das Kapital ---  
 won't make you a communist. But most trolls started out 
with a copy of Lord of the Rings.


signature.asc
Description: Digital signature

Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-04 Thread Martin Steigerwald

Am Freitag, 25. Juli 2014, 11:54:37 schrieb Martin Steigerwald:
> Am Donnerstag, 24. Juli 2014, 22:48:05 schrieben Sie:
> > When failing to allocate space for the whole compressed extent, we'll
> > fallback to uncompressed IO, but we've forgotten to redirty the pages
> > which belong to this compressed extent, and these 'clean' pages will
> > simply skip 'submit' part and go to endio directly, at last we got data
> > corruption as we write nothing.
> > 
> > Signed-off-by: Liu Bo 
> > ---
> > 
> >  fs/btrfs/inode.c | 12 
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 3668048..8ea7610 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > 
> > @@ -709,6 +709,18 @@ retry:
> > unlock_extent(io_tree, async_extent->start,
> > 
> >   async_extent->start +
> >   async_extent->ram_size - 1);
> > 
> > +
> > +   /*
> > +* we need to redirty the pages if we decide to
> > +* fallback to uncompressed IO, otherwise we
> > +* will not submit these pages down to lower
> > +* layers.
> > +*/
> > +   extent_range_redirty_for_io(inode,
> > +   async_extent->start,
> > +   async_extent->start +
> > +   async_extent->ram_size - 1);
> > +
> > 
> > goto retry;
> > 
> > }
> > goto out_free;
> 
> I am testing this currently. So far no lockup. Lets see. Still has not
> filled the the block device with trees completely after I balanced them:
> 
> Label: 'home'  uuid: […]
> Total devices 2 FS bytes used 125.57GiB
> devid1 size 160.00GiB used 153.00GiB path /dev/dm-0
> devid2 size 160.00GiB used 153.00GiB path /dev/mapper/sata-home
> 
> I believe the lockups happen more easily if the trees occupy all of disk
> space. Well I will do some compiling of some KDE components which may let
> BTRFS fill all space again.
> 
> This patch will mean it when it can´t make enough free space in the
> (fragmented) tree it will write uncompressed?
> 
> This would mean that one would have a defragment trees regularily to allow
> for writes to happen compressed at all times.
> 
> Well… of course still better than lockup or corruption.

So lookups so far anymore.

Tested with 3.16-rc5, 3.16-rc7, now running 3-16 final.

/home BTRFS only got today filled completely tree-wise:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
Total devices 2 FS bytes used 127.35GiB
devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
devid2 size 160.00GiB used 160.00GiB path /dev/dm-3

But I had KDE and kernel compile running full throttle at that time and still 
good.

Tested-By: Martin Steigerwald 


I think this should go to stable. Thanks, Liu.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix compressed write corruption on enospc

2014-08-04 Thread Martin Steigerwald

Am Montag, 4. August 2014, 14:50:29 schrieb Martin Steigerwald:
> Am Freitag, 25. Juli 2014, 11:54:37 schrieb Martin Steigerwald:
> > Am Donnerstag, 24. Juli 2014, 22:48:05 schrieben Sie:
> > > When failing to allocate space for the whole compressed extent, we'll
> > > fallback to uncompressed IO, but we've forgotten to redirty the pages
> > > which belong to this compressed extent, and these 'clean' pages will
> > > simply skip 'submit' part and go to endio directly, at last we got data
> > > corruption as we write nothing.
> > >
> > > 
> > >
> > > Signed-off-by: Liu Bo 
> > > ---
> > >
> > > 
> > >  fs/btrfs/inode.c | 12 
> > >  1 file changed, 12 insertions(+)
> > > 
> > >
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 3668048..8ea7610 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > >
> > > 
> > >
> > > @@ -709,6 +709,18 @@ retry:
> > > unlock_extent(io_tree, async_extent->start,
> > > 
> > >   async_extent->start +
> > >   async_extent->ram_size - 1);
> > > 
> > >
> > > +
> > > +   /*
> > > +* we need to redirty the pages if we decide
> > > to
> > > +* fallback to uncompressed IO, otherwise we
> > > +* will not submit these pages down to lower
> > > +* layers.
> > > +*/
> > > +   extent_range_redirty_for_io(inode,
> > > +   async_extent->start,
> > > +   async_extent->start +
> > > +   async_extent->ram_size - 1);
> > > +
> > >
> > > 
> > > goto retry;
> > > 
> > > }
> > > goto out_free;
> >
> > 
> >
> > I am testing this currently. So far no lockup. Lets see. Still has not
> >
> > filled the the block device with trees completely after I balanced them:
> > 
> >
> > Label: 'home'  uuid: […]
> >
> > Total devices 2 FS bytes used 125.57GiB
> > devid1 size 160.00GiB used 153.00GiB path /dev/dm-0
> > devid2 size 160.00GiB used 153.00GiB path
> >/dev/mapper/sata-home 
> >
> > I believe the lockups happen more easily if the trees occupy all of disk
> > space. Well I will do some compiling of some KDE components which may let
> > BTRFS fill all space again.
> >
> > 
> >
> > This patch will mean it when it can´t make enough free space in the
> > (fragmented) tree it will write uncompressed?
> >
> > 
> >
> > This would mean that one would have a defragment trees regularily to allow
> > for writes to happen compressed at all times.
> >
> > 
> >
> > Well… of course still better than lockup or corruption.
> 
> So lookups so far anymore.

No lookups of course.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs on bcache

2014-08-04 Thread Fábio Pfeifer

After completely loosing my filesystem twice because of this bug, I gave
up using btrfs on top of bcache (also writeback). In my case, I used to
have some subvolumes and some snapshot of these subvolumes, but not many
of them. The btrfs mantra "backup, bakcup and backup" saved me.

Best regards,

Fábio Pfeifer

2014-07-30 20:01 GMT-03:00 Larkin Lowrey :
> I've been running two backup servers, with 25T and 20T of data, using
> btrfs on bcache (writeback) for about 7 months. I periodically run btrfs
> scrubs and backup verifies (SHA1 hashes) and have never had a corruption
> issue.
>
> My use of btrfs is simple, though, with no subvolumes and no btrfs level
> raid. My bcache backing devices are LVM volumes that span multiple md
> raid6 arrays. So, either the bug has been fixed or my configuration is
> not susceptible.
>
> I'm running kernel 3.15.5-200.fc20.x86_64.
>
> --Larkin
>
> On 7/30/2014 5:04 PM, dptr...@arcor.de wrote:
>> Concerning http://thread.gmane.org/gmane.comp.file-systems.btrfs/31018, does 
>> this "bug" still exists?
>>
>> Kernel 3.14
>> B: 2x HDD 1 TB
>> C: 1x SSD 256 GB
>>
>> # make-bcache -B /dev/sda /dev/sdb -C /dev/sdc --cache_replacement_policy=lru
>> # mkfs.btrfs -d raid1 -m raid1 -L "BTRFS_RAID" /dev/bcache0 /dev/bcache1
>>
>> I still have no "incomplete page write" messages in "dmesg | grep btrfs" and 
>> the checksums of some manually reviewed files are okay.
>>
>> Who has more experiences about this?
>>
>> Thanks,
>>
>> - dp
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

For anyone else having this problem, this article is fairly useful for
understanding disk full problems and rebalance:

http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html

It actually covers the problem that I had, which is that a rebalance
can't take place because it is full.

I still am unsure what is really wrong with this whole situation. Is
it that I wasn't careful to do a rebalance when I should have been
doing? Is it that BTRFS doesn't do a rebalance automatically when it
could in principle?

It's pretty bad to end up in a situation (with spare space) where the
only way out is to add more storage, which may be impractical,
difficult or expensive.

The other thing that I still don't understand I've seen repeated in a
few places, from the above article:

"because the filesystem is only 55% full, I can ask balance to rewrite
all chunks that are more than 55% full"

Then he uses `btrfs balance start -dusage=55 /mnt/btrfs_pool1`. I
don't understand the relationship between "the FS is 55% full" and
"chunks more than 55% full". What's going on here?

I conclude that now since I have added more storage, the rebalance
won't fail and if I keep rebalancing from a cron job I won't hit this
problem again (unless the filesystem fills up very fast! what then?).
I don't know however what value to assign to `-dusage` in general for
the cron rebalance. Any hints?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Work Queue for btrfs compression writes

2014-08-04 Thread Chris Mason

On 07/29/2014 11:54 PM, Nick Krause wrote:
> Hey Guys ,
> I am new to   reading  and writing  kernel code.I got interested in
> writing code for btrfs as it seems to
> need more work then other file systems and this seems other then
> drivers, a good use of time on my part.
> I interested in helping improving the compression of btrfs by using  a
> set of threads using work queues like XFS
> or reads and keeping the page cache after reading compressed blocks as
> these seem to be a great way to improve
> on  compression performance mostly with large partitions of compressed
> data. I am not asking you to write the code
> for me but as I am new a little guidance and help would be greatly
> appreciated as this seems like too much work for just a newbie.

[ Back from vacation ]

Reading through the thread, I don't see anyone mentioning that btrfs
already funnels most compression through helper threads in the kernel
workqueues.

There is also an ordering component that submits the compressed bios to
disk (for writes) in the same order they were created.  This lets us
scatter compression across N cpus, but not introduce seeks if they make
progress at different rates.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

On Mon, Aug 04, 2014 at 02:17:02PM +0100, Peter Waller wrote:
> For anyone else having this problem, this article is fairly useful for
> understanding disk full problems and rebalance:
> 
> http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
> 
> It actually covers the problem that I had, which is that a rebalance
> can't take place because it is full.
> 
> I still am unsure what is really wrong with this whole situation. Is
> it that I wasn't careful to do a rebalance when I should have been
> doing? Is it that BTRFS doesn't do a rebalance automatically when it
> could in principle?

   This latter one.

   Well, actually two things: the FS should be capable of autonomously
rebalancing at low bandwidth to prevent this problem, but nobody's got
round to implementing it yet. Secondly, it should not be possible to
get into a state where you can't run the balance -- Josef spent about
three kernel revisions fixing the block reserve code to that end.
However, since about 3.14, there's been more cases like yours show up,
so I think there's been a regression. It's not very common, though. I
think we've had maybe a dozen reported instances in the last 6 months.
Someone on IRC had it just now, though, and captured a metadata image,
so at least we've got some (meta)data to work with now.

> It's pretty bad to end up in a situation (with spare space) where the
> only way out is to add more storage, which may be impractical,
> difficult or expensive.
> 
> The other thing that I still don't understand I've seen repeated in a
> few places, from the above article:
> 
> "because the filesystem is only 55% full, I can ask balance to rewrite
> all chunks that are more than 55% full"
> 
> Then he uses `btrfs balance start -dusage=55 /mnt/btrfs_pool1`. I
> don't understand the relationship between "the FS is 55% full" and
> "chunks more than 55% full". What's going on here?

   Pigeonhole principle -- if the FS is 55% full, there must be at
least one chunk <= 55% full.

> I conclude that now since I have added more storage, the rebalance
> won't fail and if I keep rebalancing from a cron job I won't hit this
> problem again (unless the filesystem fills up very fast! what then?).
> I don't know however what value to assign to `-dusage` in general for
> the cron rebalance. Any hints?

   Try with increasing values until you've moved as many chunks as you
want to. This is what David's "balance at least N chunks" patch did.
I'd suggest start with 5, and go up in increments of 5, if you're
making it an automatic process. Stop when you reach some threshold
(like, say, 80), or when it reports that it's actually moved some
chunks.

   Doing it manually, I usually recommend 5, 10, 20, 50, 80.

   Hugo.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Well, you don't get to be a kernel hacker simply by looking ---   
good in Speedos. -- Rusty Russell

signature.asc
Description: Digital signature

Re: ENOSPC with mkdir and rename

2014-08-04 Thread Austin S Hemmelgarn

On 2014-08-04 09:17, Peter Waller wrote:
> For anyone else having this problem, this article is fairly useful for
> understanding disk full problems and rebalance:
> 
> http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
> 
> It actually covers the problem that I had, which is that a rebalance
> can't take place because it is full.
> 
> I still am unsure what is really wrong with this whole situation. Is
> it that I wasn't careful to do a rebalance when I should have been
> doing? Is it that BTRFS doesn't do a rebalance automatically when it
> could in principle?
> 
> It's pretty bad to end up in a situation (with spare space) where the
> only way out is to add more storage, which may be impractical,
> difficult or expensive.
I really disagree with the statement that adding more storage is
difficult or expensive, all you need to do is plug in a 2G USB flash
drive, or allocate a ramdisk, and add the device to the filesystem only
long enough to do a full balance.
> 
> The other thing that I still don't understand I've seen repeated in a
> few places, from the above article:
> 
> "because the filesystem is only 55% full, I can ask balance to rewrite
> all chunks that are more than 55% full"
> 
> Then he uses `btrfs balance start -dusage=55 /mnt/btrfs_pool1`. I
> don't understand the relationship between "the FS is 55% full" and
> "chunks more than 55% full". What's going on here?
To understand this, you have to understand that BTRFS uses a two level
allocation scheme, at the top level, you have chunks, which are
contiguous regions of the disk that get used for storing a specific
block type.  For data chunks, these default to 1G in size, for metadata,
they default to 256M in size.  When a filesystem is created, you get the
minimum number of chunks of each type based on the replication profiles
chosen for each chunk type; with no extra options, this means 1 data
chunk and 2 metadata chunks for a single disk filesystem.  Within each
chunk, BTRFS then allocates and frees individual blocks on demand, these
blocks are the analogue of blocks in most other filesystems.  When there
are no free blocks in any chunks of a given type, BTRFS then allocates
new chunks of that type based on the replication profile.  Unlike blocks
however, chunks aren't freed automatically (there are good reasons for
this behavior, but they are kind of long to explain here), this is where
balance comes in, it takes all of the blocks in the filesystem, and
sends them back through the block allocator.  This usually causes all of
the free blocks to end up in a single chunk, and frees the unneeded chunks.

When someone talks about a chunk being x% full, they mean that x% of the
space in that chunk is used by allocated blocks.  Talking about how full
the filesystem is can get tricky because of the replication profiles,
but the usual consensus is to treat that as the percentage of the
filesystem that contains blocks that are being used.

It should say LESS than 55% full in the various articles, as the
-dusage=x option tells balance to only consider chunks that are less
than 55% full for balancing.  In general, if your filesystem is totally
full, you should use numbers starting with 0, and working your way up
from there.  You may even get lucky, and using -dusage=0 -musage=0 may
free up enough chunks that you don't need to add more storage.
> 
> I conclude that now since I have added more storage, the rebalance
> won't fail and if I keep rebalancing from a cron job I won't hit this
> problem again (unless the filesystem fills up very fast! what then?).
> I don't know however what value to assign to `-dusage` in general for
> the cron rebalance. Any hints?
I've found that something between 25 and 50 tends to do well, much
outside of that range and you start to get diminishing returns.  The
exact value tends to be more personal preference, I use 25 on most of my
systems, because I don't like saturating the disks with I/O for very
long.  Do make sure however to add -musage=x as well, metadata also
should be balanced (especially if you have very large numbers of small
files).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 




smime.p7s
Description: S/MIME Cryptographic Signature

Re: ENOSPC with mkdir and rename

On 4 August 2014 15:02, Austin S Hemmelgarn  wrote:
> I really disagree with the statement that adding more storage is
> difficult or expensive, all you need to do is plug in a 2G USB flash
> drive, or allocate a ramdisk, and add the device to the filesystem only
> long enough to do a full balance.

What if the machine is a server in a datacenter you don't have
physical access to and the problem is an emergency preventing your
users from being able to get work done?

What happens if you use a RAM disk and there is a power failure?

Thanks for the other explanations and advice also,

- Peter
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

2014-08-04 Thread Austin S Hemmelgarn

On 2014-08-04 10:11, Peter Waller wrote:
> On 4 August 2014 15:02, Austin S Hemmelgarn  wrote:
>> I really disagree with the statement that adding more storage is
>> difficult or expensive, all you need to do is plug in a 2G USB flash
>> drive, or allocate a ramdisk, and add the device to the filesystem only
>> long enough to do a full balance.
> 
> What if the machine is a server in a datacenter you don't have
> physical access to and the problem is an emergency preventing your
> users from being able to get work done?
> 
> What happens if you use a RAM disk and there is a power failure?
> 
I'm not saying that either option is a perfect solution.  In fact, the
only reason that I even mentioned the ramdisk is because I have had good
success with that on my laptop, but then laptops essentially have a
built-in UPS.  I personally wouldn't use a ramdisk except as a last
resort if you don't have some sort of UPS or redundancy in the PSU.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: ENOSPC with mkdir and rename

2014-08-04 Thread Russell Coker

On Mon, 4 Aug 2014 14:17:02 Peter Waller wrote:
> For anyone else having this problem, this article is fairly useful for
> understanding disk full problems and rebalance:
> 
> http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-> 
> Full-Problems.html
> 
> It actually covers the problem that I had, which is that a rebalance
> can't take place because it is full.
> 
> I still am unsure what is really wrong with this whole situation. Is
> it that I wasn't careful to do a rebalance when I should have been
> doing? Is it that BTRFS doesn't do a rebalance automatically when it
> could in principle?

Yes and yes.  The fact that BTRFS can't avoid getting into such situations and 
can't recover when it does are both bugs in BTRFS.  The fact that you didn't 
run a balance to prevent this is due to not being careful enough with a 
filesystem that's still in a development stage.

> It's pretty bad to end up in a situation (with spare space) where the
> only way out is to add more storage, which may be impractical,
> difficult or expensive.

Absolutely.

> I conclude that now since I have added more storage, the rebalance
> won't fail and if I keep rebalancing from a cron job I won't hit this
> problem again

Yes.

> (unless the filesystem fills up very fast! what then?).
> I don't know however what value to assign to `-dusage` in general for
> the cron rebalance. Any hints?

If you regularly run a scrub with options such as "-dusage=50 -musage=10" then 
the amount of free space in metadata chunks will tend to be a lot greater than 
that in data chunks.

Another option I've considered is to write a program that creates millions of 
files with 1000 byte random file names.  After creating a filesystem I could 
run that program to cause a sufficient number of metadata chunks to be 
allocated and then remove the subvol containing all those files (which 
incidentally is a lot faster than "rm -rf").

Another thing I've considered is making a filesystem for a file server with a 
RAID-1 array of SSDs and running the above program to allocate all chunks for 
metadata.  Then when the SSDs are totally assigned to metadata I would add a 
pair of SATA disks for data.  A filesystem with all metadata on SSD and all 
data on SATA disks should give great performance as well as having lots of 
space.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Scan not being performed properly on boot

2014-08-04 Thread Peter Roberts


On 04/08/2014 04:31, Russell Coker wrote:
What is GRUB (or your boot loader) giving as parameters to the kernel? 
What error messages appear on screen? Sometimes it's helpful to 
photograph the screen and put the picture on a web server to help 
people diagnose the problem. 

Here a screenshot http://imgur.com/8pQgG8g

Making changes to /etc/fstab didn't seem to work and nor did manually 
adding device=/dev/sda1,device... in the GRUB config itself. What has 
fixed the problem though seems pretty horrible it to specify 
root=/dev/sda1 instead of UUID=


That sounds like a problem with the Ubuntu initrd, probably filing an 
Ubuntu bug report would be the best thing to do. Is BTRFS supported in 
that version of Ubuntu? But just changing your boot configuration to 
use /dev/sdx is probably the best option. 
I'll file a bug report but I'm not sure whether this is a GRUB or an 
initrd / initramfs problem. Getting a persistent fix required a bodge in 
/etc/grub.d/10-linux


As for support, I don't think it is considered stable in 14.04 but I 
haven't had to do anything special to use it and used to official 
installer to get where I am (converting the single disk FS to a RAID1 
after).


In any case, thanks for the help :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

2014-08-04 Thread Mitch Harder

On Mon, Aug 4, 2014 at 9:47 AM, Russell Coker  wrote:
> If you regularly run a scrub with options such as "-dusage=50 -musage=10" then
> the amount of free space in metadata chunks will tend to be a lot greater than
> that in data chunks.
>

Just to clarify for posterity, I'm pretty sure you meant 'balance'
with "-dusage=50 -musage=10" instead of 'scrub'.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unrecoverable errors when the btrfs file system was modified outside the running OS

2014-08-04 Thread Chris Murphy

On Aug 4, 2014, at 12:29 AM, rocwhite168  wrote:

> Hello,
> 
> I just had a very frustrating experience with btrfs, which I was only
> able to resolve by rolling back to ext4 using the subvol btrfs-convert
> created. The same type of situation occurred before when I was using
> the ext file system and the result was far less disastrous.

The recoverability of a filesystem after it's been used simultaneously by two 
computers can't be a useful metric. This is so highly non-deterministic I just 
don't buy the one off comparison of two filesystems' survival rates.

This is sort of in the realm of if you're going to slam doors, make really 
certain first there aren't any fingers near the door frame.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ENOSPC with mkdir and rename

2014-08-04 Thread Austin S Hemmelgarn

On 2014-08-04 06:31, Peter Waller wrote:
> Thanks Hugo, this is the most informative e-mail yet! (more inline)
> 
> On 4 August 2014 11:22, Hugo Mills  wrote:
>>
>>  * btrfs fi show
>> - look at the total and used values. If used < total, you're OK.
>>   If used == total, then you could potentially hit ENOSPC.
> 
> Another thing which is unclear and undocumented anywhere I can find is
> what the meaning of `btrfs fi show` is.
> 
> I'm sure it is totally obvious if you are a developer or if you have
> used it for long enough. But it isn't covered in the manpage, nor in
> the oracle documentation, nor anywhere on the wiki that I could find.
> 
You didn't look very hard then, because there is information in the
manpage (oh wait, you mentioned Oracle, your probably using RHEL or
CentOS, which are the last thing you should be using if you want to use
stuff like BTRFS that is under heavy development), and it is documented
on the wiki.
> When I looked at it in my problematic situation, it said "500 GiB /
> 500 GiB". That sounded fine to me because I interpreted the output as
> what fraction of which RAID devices BTRFS was using. In other words, I
> thought "Oh, BTRFS will just make use of the whole device that's
> available to it.". I thought that `btrfs fi df` was the source of
> information for how much space was free inside of that.
> 
>>  * btrfs fi df
>> - look at metadata used vs total. If these are close to zero (on
>>   3.15+) or close to 512 MiB (on <3.15), then you are in danger of
>>   ENOSPC.
> 
> Hmm. It's unfortunate that this could indicate an amount of space
> which is free when it actually isn't.
That depends on what you mean by 'free'.
> 
>> - look at data used vs total. If the used is much smaller than
>>   total, you can reclaim some of the allocation with a filtered
>>   balance (btrfs balance start -dusage=5), which will then give
>>   you unallocated space again (see the btrfs fi show test).
> 
> So the filtered balance didn't help in my situation. I understand it's
> something to do with the "5" parameter. But I do not understand what
> the impact of changing this parameter is. It is something to do with a
> fraction of something, but those things are still not present in my
> mental model despite a large amount of reading. Is there an
> illustration which could clear this up?
> 
Think of each chunk like a box, and each block as a block, and that you
have two different types of block (data and metadata) and two different
types of box (also data and metadata). The data boxes are four times the
size of the metadata boxes, and they all have to fit in one really big
container (the device itself).  You can only put data blocks in the data
boxs, and you can only put metadata blocks in metadata boxes.  Say that
in total, you can fit 128 data boxes in the large container, or you can
replace one data box with up to four metadata boxes.  Even though you
may only have a few blocks in a given box, the box still takes up the
same amount of space in the larger container.  Thus, it's possible to
have only a few blocks stored, but not be able to add any more boxes to
the larger container.  A balance operation is essentially the equivalent
of taking all of the blocks of a given type, and fitting them into the
smallest number of boxes possible.
> Among other things I also got the kernel stack trace I pasted at the
> bottom of the first e-mail to this thread when I did the rebalance.
> 
>>This FAQ entry is pretty horrible, I'm afraid. I actually started
>> rewriting it here to try to make it clearer what's going on. I'll try
>> to work on it a bit more this week and put out a better version for
>> the wiki.
> 
> This is great to hear! :)
> 
> Thanks for your response Hugo, that really cleared up a lot of mental
> model problems. I hope the documentation can be improved so that
> others can learn from my mistakes.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 




smime.p7s
Description: S/MIME Cryptographic Signature

[PATCH] Btrfs: make btrfs_search_forward return with nodes unlocked

2014-08-04 Thread Filipe Manana

None of the uses of btrfs_search_forward() need to have the path
nodes (level >= 1) read locked, only the leaf needs to be locked
while the caller processes it. Therefore make it return a path
with all nodes unlocked, except for the leaf.

This change is motivated by the observation that during a file
fsync we repeatdly call btrfs_search_forward() and process the
returned leaf while upper nodes of the returned path (level >= 1)
are read locked, which unnecessarily blocks other tasks that want
to write to the same fs/subvol btree.
Therefore instead of modifying the fsync code to unlock all nodes
with level >= 1 immediately after calling btrfs_search_forward(),
change btrfs_search_forward() to do it, so that it benefits all
callers.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/ctree.c | 11 +++
 fs/btrfs/ioctl.c |  5 -
 fs/btrfs/tree-log.c  |  3 ---
 fs/btrfs/uuid-tree.c |  1 -
 fs/btrfs/volumes.c   |  2 --
 5 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 8ca6761..993d81b 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -5144,8 +5144,9 @@ int btrfs_search_forward(struct btrfs_root *root, struct 
btrfs_key *min_key,
u32 nritems;
int level;
int ret = 1;
+   int keep_locks = path->keep_locks;
 
-   WARN_ON(!path->keep_locks);
+   path->keep_locks = 1;
 again:
cur = btrfs_read_lock_root_node(root);
level = btrfs_header_level(cur);
@@ -5209,7 +5210,6 @@ find_next_key:
path->slots[level] = slot;
if (level == path->lowest_level) {
ret = 0;
-   unlock_up(path, level, 1, 0, NULL);
goto out;
}
btrfs_set_path_blocking(path);
@@ -5224,9 +5224,12 @@ find_next_key:
btrfs_clear_path_blocking(path, NULL, 0);
}
 out:
-   if (ret == 0)
+   path->keep_locks = keep_locks;
+   if (ret == 0) {
+   btrfs_unlock_up_safe(path, path->lowest_level + 1);
+   btrfs_set_path_blocking(path);
memcpy(min_key, &found_key, sizeof(found_key));
-   btrfs_set_path_blocking(path);
+   }
return ret;
 }
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ef2e073..d490abd 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -936,12 +936,9 @@ static int find_new_extents(struct btrfs_root *root,
min_key.offset = *off;
 
while (1) {
-   path->keep_locks = 1;
ret = btrfs_search_forward(root, &min_key, path, newer_than);
if (ret != 0)
goto none;
-   path->keep_locks = 0;
-   btrfs_unlock_up_safe(path, 1);
 process_slot:
if (min_key.objectid != ino)
goto none;
@@ -2083,8 +2080,6 @@ static noinline int search_ioctl(struct inode *inode,
key.type = sk->min_type;
key.offset = sk->min_offset;
 
-   path->keep_locks = 1;
-
while (1) {
ret = btrfs_search_forward(root, &key, path, sk->min_transid);
if (ret != 0) {
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 6e0fa17..df332dd 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2981,8 +2981,6 @@ static noinline int log_dir_items(struct 
btrfs_trans_handle *trans,
min_key.type = key_type;
min_key.offset = min_offset;
 
-   path->keep_locks = 1;
-
ret = btrfs_search_forward(root, &min_key, path, trans->transid);
 
/*
@@ -3950,7 +3948,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
err = ret;
goto out_unlock;
}
-   path->keep_locks = 1;
 
while (1) {
ins_nr = 0;
diff --git a/fs/btrfs/uuid-tree.c b/fs/btrfs/uuid-tree.c
index f6a4c03..7782829 100644
--- a/fs/btrfs/uuid-tree.c
+++ b/fs/btrfs/uuid-tree.c
@@ -279,7 +279,6 @@ int btrfs_uuid_tree_iterate(struct btrfs_fs_info *fs_info,
key.offset = 0;
 
 again_search_slot:
-   path->keep_locks = 1;
ret = btrfs_search_forward(root, &key, path, 0);
if (ret) {
if (ret > 0)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0daf748..73e4d30 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3629,8 +3629,6 @@ static int btrfs_uuid_scan_kthread(void *data)
max_key.type = BTRFS_ROOT_ITEM_KEY;
max_key.offset = (u64)-1;
 
-   path->keep_locks = 1;
-
while (1) {
ret = btrfs_search_forward(root, &key, path, 0);
if (ret) {
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] Btrfs: fix sparse warning

2014-08-04 Thread Zach Brown

On Sat, Aug 02, 2014 at 02:24:49PM +0200, Fabian Frederick wrote:
> On Thu, 17 Jul 2014 12:01:52 -0700
> Zach Brown  wrote:
> 
> > > > > @@ -515,7 +515,8 @@ static int write_buf(struct file *filp, const 
> > > > > void *buf,
> > > > > u32 len, loff_t *off)
> > > >
> > > > Though this probably wants to be rewritten in terms of kernel_write().
> > > > That'd give an opportunity to get rid of the sctx->send_off and have it
> > > > use f_pos in the filp.
> > > 
> > > Do you mean directly call kernel_write from send_cmd/send_header ?
> > > I guess that loop around vfs_write in write_buf is there for something ...
> > 
> > write_buf() could still exist to iterate over the buffer in the case of
> > partial writes but it doesn't need to muck around with set_fs() and
> > forcing casts.
> > 
> > - z
> 
> Hello Zach,
> 
>   Here's an untested patch which

Try testing it.  It's easy with virtualization and xfstests.

You'll find that sending to a file fails because each individual file
write call that makes up a send starts at offset 0 -- at the start of
the file.

Getting this right means getting the semantics around updating the send
descriptors f_pos right.  It requires having a bit of a think about send
semantics and f_pos update locking.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/12] btrfs: handle errors from reading the quota tree root

2014-08-04 Thread Zach Brown

On Fri, Aug 01, 2014 at 06:12:37PM -0500, Eric Sandeen wrote:
> Reading the quota tree root may fail with ENOENT
> if there is no quota, which is fine, but the code was
> ignoring every other error as well, which is not fine.

Kinda makes you want to write a test that would have caught this.

Kinda.

Also, if you're still keen to iterate on this series, it looks like this
pattern is copied and pasted a few times in open_ctree().  With
temporary root pointers for each block, for some reason.  A little
helper function could take a bite out of open_ctree().

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/12] btrfs: handle errors from reading the quota tree root

2014-08-04 Thread Eric Sandeen

On 8/4/14, 1:35 PM, Zach Brown wrote:
> On Fri, Aug 01, 2014 at 06:12:37PM -0500, Eric Sandeen wrote:
>> Reading the quota tree root may fail with ENOENT
>> if there is no quota, which is fine, but the code was
>> ignoring every other error as well, which is not fine.
> 
> Kinda makes you want to write a test that would have caught this.
> 
> Kinda.

/me looks at ground, shuffles feet ...
 
> Also, if you're still keen to iterate on this series, it looks like this
> pattern is copied and pasted a few times in open_ctree().  With
> temporary root pointers for each block, for some reason.  A little
> helper function could take a bite out of open_ctree().

Hm, the uuid tree is roughly similar, but not exactly.  I think those
are the only 2 "optional" roots (uuid because it'll get regenerated).

I'm guessing the temporary root pointer is so we don't ever assign a
PTR_ERR to the root in fs_info?  

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/12] btrfs: handle errors from reading the quota tree root

2014-08-04 Thread Zach Brown

On Mon, Aug 04, 2014 at 01:42:23PM -0500, Eric Sandeen wrote:
> On 8/4/14, 1:35 PM, Zach Brown wrote:
> > On Fri, Aug 01, 2014 at 06:12:37PM -0500, Eric Sandeen wrote:
> >> Reading the quota tree root may fail with ENOENT
> >> if there is no quota, which is fine, but the code was
> >> ignoring every other error as well, which is not fine.
> > 
> > Kinda makes you want to write a test that would have caught this.
> > 
> > Kinda.
> 
> /me looks at ground, shuffles feet ...
>  
> > Also, if you're still keen to iterate on this series, it looks like this
> > pattern is copied and pasted a few times in open_ctree().  With
> > temporary root pointers for each block, for some reason.  A little
> > helper function could take a bite out of open_ctree().
> 
> Hm, the uuid tree is roughly similar, but not exactly.  I think those
> are the only 2 "optional" roots (uuid because it'll get regenerated).
> 
> I'm guessing the temporary root pointer is so we don't ever assign a
> PTR_ERR to the root in fs_info?  

It took me a while to see what you meant.

Yeah, using a temporary root makes sense.  Using a different one for
each block makes less sense.

a = f(A);
if (a)
goto out;
info->a = a;

b = f(B);
if (b)
goto out;
info->b = b;

vs.

r = f(A);
if (r)
goto out;
info->a = r;

r = f(B);
if (r)
goto out;
info->b = r;

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/12] btrfs: handle errors from reading the quota tree root

2014-08-04 Thread Eric Sandeen

On 8/4/14, 1:51 PM, Zach Brown wrote:
> On Mon, Aug 04, 2014 at 01:42:23PM -0500, Eric Sandeen wrote:
>> On 8/4/14, 1:35 PM, Zach Brown wrote:
>>> On Fri, Aug 01, 2014 at 06:12:37PM -0500, Eric Sandeen wrote:
 Reading the quota tree root may fail with ENOENT
 if there is no quota, which is fine, but the code was
 ignoring every other error as well, which is not fine.
>>>
>>> Kinda makes you want to write a test that would have caught this.
>>>
>>> Kinda.
>>
>> /me looks at ground, shuffles feet ...
>>  
>>> Also, if you're still keen to iterate on this series, it looks like this
>>> pattern is copied and pasted a few times in open_ctree().  With
>>> temporary root pointers for each block, for some reason.  A little
>>> helper function could take a bite out of open_ctree().
>>
>> Hm, the uuid tree is roughly similar, but not exactly.  I think those
>> are the only 2 "optional" roots (uuid because it'll get regenerated).
>>
>> I'm guessing the temporary root pointer is so we don't ever assign a
>> PTR_ERR to the root in fs_info?  
> 
> It took me a while to see what you meant.
> 
> Yeah, using a temporary root makes sense.  Using a different one for
> each block makes less sense.
> 



> - z
> 

Yeah, fair enough, I thought about that after I hit send ;)
I could send a V2 of patch 11/12 to do that w/o needing to redo
the series too much.  :)  I'll see if there are any other comments.

Thanks!
-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

btrfs sub list output

   The output options of btrfs sub list seem a bit... arbitrary?
awkward? unhelpful?

   Here's my problem: Given a path at some arbitrary point into a
mounted btrfs (sub)volume, find all subvolumes visible under that
point, and identify their absolute path names.

   My test btrfs filesystem looks like this:


   root
   home
   test
  subdir (a subdir, not a subvol)
 foo
  bar

# mount -osubvol=test /dev/sda2 /mnt

so I want to be able to go from that configuration (knowing nothing
about the mountpoint), and map (both ways) between UUID and the
(e.g.) /mnt/foo path. But:

# btrfs sub list -oau /mnt  # and
# btrfs sub list -au /mnt
ID 259 gen 549115 top level 5 uuid 6a50af8d-83dd-9943-b5b7-4f8b0a7f3fa7 path 
/root
ID 260 gen 548768 top level 5 uuid c73d4296-7c30-074e-b647-e6e83025a125 path 
/home
ID 11826 gen 549045 top level 272 uuid f78aed0d-db5a-a342-b422-87abfa18efe0 
path test/subdir/foo
ID 11827 gen 549046 top level 272 uuid a5cea7ae-3fdd-c247-8905-40cbb7f39017 
path test/bar

   Here, I can easily filter out the subvols I want (they're the ones
without ), but I have to know the mountpoint (which I can
find) and the subvol= parameter (which I think I can't).

# btrfs sub list -ou /mnt/subdir/
ID 11826 gen 549045 top level 272 uuid f78aed0d-db5a-a342-b422-87abfa18efe0 
path test/subdir/foo
ID 11827 gen 549046 top level 272 uuid a5cea7ae-3fdd-c247-8905-40cbb7f39017 
path test/bar

   This filters the subvols correctly, but otherwise has the same
drawbacks as above.

# btrfs sub list -u /mnt
ID 259 gen 549114 top level 5 uuid 6a50af8d-83dd-9943-b5b7-4f8b0a7f3fa7 path 
root
ID 260 gen 548768 top level 5 uuid c73d4296-7c30-074e-b647-e6e83025a125 path 
home
ID 11826 gen 549045 top level 272 uuid f78aed0d-db5a-a342-b422-87abfa18efe0 
path subdir/foo
ID 11827 gen 549046 top level 272 uuid a5cea7ae-3fdd-c247-8905-40cbb7f39017 
path bar

   Here, I get the paths relative to the mountpoint, which is what I
want, but mixed up with paths outside the mountpoint as well, which I
don't, and have no way of distinguishing the two classes without
making a separate call to btrfs sub list -a and filtering out the
UUIDs with  in the name.

   Incidentally, if the parameter to btrfs sub list is inside another
subvolume within the mount, then the "relative" effects are all
relative to that subvol, not to the mountpoint.

   I'm finding it hard to work out how the variants with -o or -a (or
both) are actually helpful at all, now that I come to use them in more
than a vague human-readable form. Have I missed something, or is this
actually an awkward furball of confusing and mostly unhelpful options?
Are these options actually doing what the original author intended? If
so, what was that intent?

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Summoning his Cosmic Powers, and glowing slightly ---
from his toes... 


signature.asc
Description: Digital signature

Re: ENOSPC with mkdir and rename