Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)

2015-01-09 Thread Duncan
Martin Steigerwald posted on Thu, 08 Jan 2015 11:18:40 +0100 as excerpted:

 Duncan, I *did* file a bug.

I think you misunderstood me... I understood that and actually said as 
much:

 But the recommendation is to file the bugzilla report precisely so it
 does /not/ get lost, and you've done that, so... you've done your part
 there and now comes the enforced patience bit of waiting [...]

My point was simply that based on the wiki recommendation and the earlier 
thread as mentioned on the wiki, the reason /why/ a bugzi report is 
preferred over simply reporting it here is that the devs tend to pick 
bugs and spend some time digging into them, during which they don't look 
too much at other reports here, and they can get lost, while the bugzi 
report won't.

Which implies that a failure to respond either to a thread here or a bug 
report there is because they're busy working on other bugs, and that 
failure to immediately respond isn't to be seen as ignoring the problem, 
and is in fact to be expected.

IOW, I was saying now that the bug is filed, you can sit back and wait in 
reasonable assurance that it'll be processed in due time, as you've done 
your bit and now it's up to them to prioritize and process in due time.  
That's a good thing, and I was commending you for taking the time to file 
the bug as well. =:^)

... While at the same time commiserating a bit, since I know from 
experience how hard that wait for a dev reply can be, and that the wait 
is sort of an enforced patience as at least for a non-coder as I am, 
there's not much else one can do. =:^(

That said, now that I reread, I can see how what I wrote could appear to 
be contingent on an assumed /future/ filing of a bug, and that it wasn't 
as clear as I intended that I was commending you for filing it already, 
and basically saying, Be patient, I know how hard it can be to wait.

Words!  They be tricky! =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regular rebalancing should be unnecessary? (Was: Re: BTRFS free space handling still needs more work: Hangs again)

2015-01-09 Thread Martin Steigerwald
Am Freitag, 9. Januar 2015, 11:04:32 schrieb Peter Waller:
 Apologies to those receiving this twice.
 
 On 27 December 2014 at 09:30, Hugo Mills h...@carfax.org.uk wrote:
  Now, since you're seeing lockups when the space on your disks is
  
  all allocated I'd say that's a bug. However, you're the *only* person
  
  who's reported this as a regular occurrence. Does this happen with all
  filesystems you have, or just this one?
 
 I have experienced machine lockups on four separate cloud machines,
 and reported it in a few venues. I think I even reported it on this
 list in the past but I can't find that right now. Here's a bug report
 to Ubuntu-Kernel:
 
 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349711
 
 Regularly rebalancing the machines and ensuring they have 10% free
 disk (filesystem) and I don't experience this. Yet I read in this
 thread I read that regular rebalancing shouldn't be necessary?
 
 FWIW, trying to sell BTRFS to my colleagues and they view it as a
 stupid filesystem like the bad old windows days when you had to
 regularly defragment. They then go on to say they have never
 experienced machine lockups on EXT* (over a fairly significant length
 of time).
 
 So what can I tell them? Are we just hitting a bug which is likely to
 get fixed, or must we regularly rebalance?
 
 .. or is regularly rebalancing incorrect and actually regular machine
 lockups are the expected behaviour? :-)

I think it should *not* be required.

But my practical experience differs from what I think, as I described in great 
detail here and in this bugreport:

[Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for 
minutes on random write into big file

https://bugzilla.kernel.org/show_bug.cgi?id=90401


So I had these hangs so far *only* when BTRFS was not able to reserve 
previously unused and unreserved space on the devices for a new chunk, as long 
as BTRFS can still allocate a new chunk, it stays fast. That said, not in all 
situation where BTRFS can´t do this, it goes slow. So for me it seems that not 
having any unreserved device space to allocate chunks from seems to be a 
*necessary* but no *sufficient* criterion for the kworker uses up 100% of one 
core issue I reported.

I suggest that you add your findings to the bug report and also share details 
there, as it may help to have more data available on when it happens.

That said, still no BTRFS developer looked into the kern.log with Sysrq-T 
triggers I uploaded there.

Robert made a test case which easily triggers the behavior for him, I didn´t 
yet take time to try out this testcase. Maybe you have a chance to? Its 
somewhere in this thread as a little shell script.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=90401#c0

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regular rebalancing should be unnecessary? (Was: Re: BTRFS free space handling still needs more work: Hangs again)

2015-01-09 Thread Peter Waller
Apologies to those receiving this twice.

On 27 December 2014 at 09:30, Hugo Mills h...@carfax.org.uk wrote:

 Now, since you're seeing lockups when the space on your disks is

 all allocated I'd say that's a bug. However, you're the *only* person

 who's reported this as a regular occurrence. Does this happen with all
 filesystems you have, or just this one?


I have experienced machine lockups on four separate cloud machines,
and reported it in a few venues. I think I even reported it on this
list in the past but I can't find that right now. Here's a bug report
to Ubuntu-Kernel:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349711

Regularly rebalancing the machines and ensuring they have 10% free
disk (filesystem) and I don't experience this. Yet I read in this
thread I read that regular rebalancing shouldn't be necessary?

FWIW, trying to sell BTRFS to my colleagues and they view it as a
stupid filesystem like the bad old windows days when you had to
regularly defragment. They then go on to say they have never
experienced machine lockups on EXT* (over a fairly significant length
of time).

So what can I tell them? Are we just hitting a bug which is likely to
get fixed, or must we regularly rebalance?

.. or is regularly rebalancing incorrect and actually regular machine
lockups are the expected behaviour? :-)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)

2015-01-08 Thread Martin Steigerwald
Am Donnerstag, 8. Januar 2015, 05:45:56 schrieben Sie:
 Martin Steigerwald posted on Wed, 07 Jan 2015 20:08:50 +0100 as excerpted:
  No BTRFS developers commented yet on this, neither in this thread nor in
  the bug report at kernel.org I made.
 
 Just a quick general note on this point...
 
 There has in the past (and I believe referenced on the wiki) been dev 
 comment to the effect that on the list they tend to find particular 
 reports/threads and work on them until they find and either fix the issue 
 or (when not urgent) decide it must wait for something else, first.  
 During the time they're busy pursuing such a report, they don't read 
 others on the list very closely, and such list-only bug reports may thus 
 get dropped on the floor and never worked on.
 
 The recommendation, then, is to report it to the list, and if not picked 
 up right away and you plan on being around in a few weeks/months when 
 they potentially get to it, file a bug on it, so it doesn't get dropped 
 on the floor.

Duncan, I *did* file a bug.

[Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for 
minutes on random write into big file

https://bugzilla.kernel.org/show_bug.cgi?id=90401

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)

2015-01-07 Thread Duncan
Martin Steigerwald posted on Wed, 07 Jan 2015 20:08:50 +0100 as excerpted:

 No BTRFS developers commented yet on this, neither in this thread nor in
 the bug report at kernel.org I made.

Just a quick general note on this point...

There has in the past (and I believe referenced on the wiki) been dev 
comment to the effect that on the list they tend to find particular 
reports/threads and work on them until they find and either fix the issue 
or (when not urgent) decide it must wait for something else, first.  
During the time they're busy pursuing such a report, they don't read 
others on the list very closely, and such list-only bug reports may thus 
get dropped on the floor and never worked on.

The recommendation, then, is to report it to the list, and if not picked 
up right away and you plan on being around in a few weeks/months when 
they potentially get to it, file a bug on it, so it doesn't get dropped 
on the floor.

With the bugzilla.kernel.org report you've followed the recommendation, 
but the implication is that you won't necessarily get any comment right 
away, only later, when they're not immediately busy looking at some other 
bug.  So lack of b.k.o comment in the immediate term doesn't mean they're 
ignoring the bug or don't value it; it just means they're hot on the 
trail of something else ATM and it might take some time to get that 
first comment engagement.

But the recommendation is to file the bugzilla report precisely so it 
does /not/ get lost, and you've done that, so... you've done your part 
there and now comes the enforced patience bit of waiting for that 
engagement.

But if it takes a bit, I would keep the bug updated every kernel release 
or so, with a comment updating status.

(Meanwhile, I've seen no indication of such issues here.  Most of my 
btrfs are 8-24 GiB each, all SSD, mostly dual-device btrfs raid1 both 
data/metadata.  Maybe I don't run those full enough, however.  I do have 
three mixed-bg mode sub-GiB btrfs, however, with one of them, a 256 MiB 
single-device dup-mode btrfs, used as /boot, that tends to run reasonably 
full, but I've not seen a problem like that there, either.  But my use-
case probably simply doesn't hit the problem.)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)

2015-01-07 Thread Zygo Blaxell
On Wed, Jan 07, 2015 at 08:08:50PM +0100, Martin Steigerwald wrote:
 Am Dienstag, 6. Januar 2015, 15:03:23 schrieb Zygo Blaxell:
  ext3 has a related problem when it's nearly full:  it will try to search
  gigabytes of block allocation bitmaps searching for a free block, which
  can result in a single 'mkdir' call spending 45 minutes reading a large
  slow 99.5% full filesystem.
 
 Ok, thats for bitmap access. Ext4 uses extens. 

...and the problem doesn't happen to the same degree on ext4 as it did
on ext3.

  So far I've found that problems start when space drops below 1GB free
  (although it can go as low as 400MB) and problems stop when space gets
  above 1GB free, even without resizing or balancing the filesystem.
  I've adjusted free space monitoring thresholds accordingly for now,
  and it seems to be keeping things working so far.
 
 Just to see whether we are on the same terms: You talk about space that BTRFS 
 has not yet reserved for chunks, i.e. the difference between size and used in 
 btrfs fi sh, right?

The number I look at for this issue is statvfs() f_bavail (i.e. the
Available column of /bin/df).

Before the empty-chunk-deallocation code, most of my filesystems would
quickly reach a steady state where all space is allocated to chunks,
and they stay that way unless I have to downsize them.

Now there is free (non-chunk) space on most of my filesystems.  I'll try
monitoring btrfs fi df and btrfs fi show under the failing conditions
and see if there are interesting correlations.



signature.asc
Description: Digital signature


Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)

2015-01-07 Thread Martin Steigerwald
Am Dienstag, 6. Januar 2015, 15:03:23 schrieb Zygo Blaxell:
 On Mon, Dec 29, 2014 at 10:32:00AM +0100, Martin Steigerwald wrote:
  Am Sonntag, 28. Dezember 2014, 21:07:05 schrieb Zygo Blaxell:
   On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote:
[…]
Zygo, was is the characteristics of your filesystem. Do you use
compress=lzo and skinny metadata as well? How are the chunks
allocated?
What kind of data you have on it?
   
   compress-force (default zlib), no skinny-metadata.  Chunks are d=single,
   m=dup.  Data is a mix of various desktop applications, most active
   file sizes from a few hundred K to a few MB, maybe 300k-400k files.
   No database or VM workloads.  Filesystem is 100GB and is usually between
   98 and 99% full (about 1-2GB free).
   
   I have another filesystem which has similar problems when it's 99.99%
   full (it's 13TB, so 0.01% is 1.3GB).  That filesystem is RAID1 with
   skinny-metadata and no-holes.
   
   On various filesystems I have the above CPU-burning problem, a bunch of
   irreproducible random crashes, and a hang with a kernel stack that goes
   through SyS_unlinkat and btrfs_evict_inode.
  
  Zygo, thanks. That desktop filesystem sounds a bit similar to my usecase,
  with the interesting difference that you have no databases or VMs on it.
  
  That said, I use the Windows XP rarely, but using it was what made the
  issue so visible for me. Is your desktop filesystem on SSD?
 
 No, but I recently stumbled across the same symptoms on an 8GB SD card
 on kernel 3.12.24 (raspberry pi).  When the filesystem hit over ~97%
 full, all accesses were blocked for several minutes.  I was able to
 work around it by adjusting the threshold on a garbage collector daemon
 (i.e. deleting a lot of expendable files) to keep usage below 90%.
 I didn't try to balance the filesystem, and didn't seem to need to.

Interesting.

 ext3 has a related problem when it's nearly full:  it will try to search
 gigabytes of block allocation bitmaps searching for a free block, which
 can result in a single 'mkdir' call spending 45 minutes reading a large
 slow 99.5% full filesystem.

Ok, thats for bitmap access. Ext4 uses extens. BTRFS can use bitmaps as well, 
but also supports extents and I think uses it for most use cases.

 I'd expect a btrfs filesystem that was nearly full to have a small tree
 of cached free space extents and be able to search it quickly even if
 the result is negative (i.e. there's no free space).  It seems to be
 doing something else... :-P

Yeah :)


  Do you have the chance to extend one of the affected filesystems to check
  my theory that this does not happen as long as BTRFS can still allocate
  new
  data chunks? If its right, your FS should be fluent again as long as you
  see more than 1 GiB free
  
  Label: none  uuid: 53bdf47c-4298-45bc-a30f-8a310c274069
  
  Total devices 2 FS bytes used 512.00KiB
  devid1 size 10.00GiB used 6.53GiB path
  /dev/mapper/sata-btrfsraid1
  devid2 size 10.00GiB used 6.53GiB path
  /dev/mapper/msata-btrfsraid1
  
  between size and used in btrfs fi sh. I suggest going with at least
  2-3
  GiB, as BTRFS may allocate just one chunk so quickly that you do not have
  the chance to recognize the difference.
 
 So far I've found that problems start when space drops below 1GB free
 (although it can go as low as 400MB) and problems stop when space gets
 above 1GB free, even without resizing or balancing the filesystem.
 I've adjusted free space monitoring thresholds accordingly for now,
 and it seems to be keeping things working so far.

Just to see whether we are on the same terms: You talk about space that BTRFS 
has not yet reserved for chunks, i.e. the difference between size and used in 
btrfs fi sh, right?

No BTRFS developers commented yet on this, neither in this thread nor in the 
bug report at kernel.org I made.

  Well, and if thats works for you, we are back to my recommendation:
  
  More so than with other filesystems give BTRFS plenty of free space to
  operate with. At best as much, that you always have a mininum of 2-3 GiB
  unused device space for chunk reservation left. One could even do some
  Nagios/Icinga monitoring plugin for that :)

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

signature.asc
Description: This is a digitally signed message part.


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-29 Thread Martin Steigerwald
Am Sonntag, 28. Dezember 2014, 16:27:41 schrieb Robert White:
 On 12/28/2014 07:42 AM, Martin Steigerwald wrote:
  Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White:
  On 12/28/2014 04:07 AM, Martin Steigerwald wrote:
  Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
  Now:
 
  The complaining party has verified the minimum, repeatable case of
  simple file allocation on a very fragmented system and the responding
  party and several others have understood and supported the bug.
 
  I didn´t yet provide such a test case.
 
  My bad.
 
 
  At the moment I can only reproduce this kworker thread using a CPU for
  minutes case with my /home filesystem.
 
  A mininmal test case for me would be to be able to reproduce it with a
  fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
  get 4800 instead of 270 IOPS.
 
 
  A version of the test case to demonstrate absolutely system-clogging
  loads is pretty easy to construct.
 
  Make a raid1 filesystem.
  Balance it once to make sure the seed filesystem is fully integrated.
 
  Create a bunch of small files that are at least 4K in size, but are
  randomly sized. Fill the entire filesystem with them.
 
  BASH Script:
  typeset -i counter=0
  while
 dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM))
  count=1 2/dev/null
  do
  echo $counter /dev/null #basically a noop
  done
 
  The while will exit when the dd encounters a full filesystem.
 
  Then delete ~10% of the files with
  rm *0
 
  Run the while loop again, then delete a different 10% with rm *1.
 
  Then again with rm *2, etc...
 
  Do this a few times and with each iteration the CPU usage gets worse and
  worse. You'll easily get system-wide stalls on all IO tasks lasting ten
  or more seconds.
 
  Thanks Robert. Thats wonderful.
 
  I wondered about such a test case already and thought about reproducing
  it just with fallocate calls instead to reduce the amount of actual
  writes done. I.e. just do some silly fallocate, truncating, write just
  some parts with dd seek and remove things again kind of workload.
 
  Feel free to add your testcase to the bug report:
 
  [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core 
  for minutes on random write into big file
  https://bugzilla.kernel.org/show_bug.cgi?id=90401
 
  Cause anything that helps a BTRFS developer to reproduce will make it easier
  to find and fix the root cause of it.
 
  I think I will try with this little critter:
 
  merkaba:/mnt/btrfsraid1 cat freespracefragment.sh
  #!/bin/bash
 
  TESTDIR=./test
  mkdir -p $TESTDIR
 
  typeset -i counter=0
  while true; do
   fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((++counter))
   echo $counter /dev/null #basically a noop
  done
 
 If you don't do the remove/delete passes you won't get as much 
 fragmentation...
 
 I also noticed that fallocate would not actually create the files in my 
 toolset, so I had to touch them first. So the theoretical script became
 
 e.g.
 
 typeset -i counter=0
 for AA in {0..9}
 do
while
  touch ${TESTDIR}/$((++counter)) 
  fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((counter))
do
  if ((counter%100 == 0))
  then
echo $counter
  fi
done
echo removing ${AA}
rm ${TESTDIR}/*${AA}
 done

Hmmm, strange. It did here. I had a ton of files in the test directory.

 Meanwhile, on my test rig using fallocate did _not_ result in final 
 exhaustion of resources. That is btrfs fi df /mnt/Work didn't show 
 significant changes on a near full expanse.

Hmmm, I had it running up to it allocating about 5 GiB in the data chunks.

But I stopped it yesterday. It took a long time to get there. It seems to be
quite slow on filling a 10 GiB RAID-1 BTRFS. I bet that may be due to lots
of forks for the fallocate command.

But it seems my fallocate works differently than yours. I have fallocate
from:

merkaba:~ fallocate --version
fallocate von util-linux 2.25.2

 I also never got a failed response back from fallocate, that is the 
 inner loop never terminated. This could be a problem with the system 
 call itself or it could be a problem with the application wrapper.

Hmmm, it should return a failure like this:

merkaba:/mnt/btrfsraid1 LANG=C fallocate -l 20G 20g
fallocate: fallocate failed: No space left on device
merkaba:/mnt/btrfsraid1#1 echo $?
1
 
 Nor did I reach the CPU saturation I expected.

No, I didn´t reach it as well. Just 5% or so for the script itself and I
didn´t see any notable kworker activity. But I stopped it before the
filesystem was full, so.

 e.g.
 Gust vm # btrfs fi df /mnt/Work/
 Data, RAID1: total=1.72GiB, used=1.66GiB
 System, RAID1: total=32.00MiB, used=16.00KiB
 Metadata, RAID1: total=256.00MiB, used=57.84MiB
 GlobalReserve, single: total=32.00MiB, used=0.00B
 
 time passes while script running...
 
 Gust vm # btrfs fi df /mnt/Work/
 Data, RAID1: total=1.72GiB, used=1.66GiB
 System, RAID1: total=32.00MiB, 

Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)

2014-12-29 Thread Martin Steigerwald
Am Sonntag, 28. Dezember 2014, 14:56:21 schrieb Martin Steigerwald:
 Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald:
  Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
   Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
Summarized at

Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for 
minutes on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401

see below. This is reproducable with fio, no need for Windows XP in
Virtualbox for reproducing the issue. Next I will try to reproduce with
a freshly creating filesystem.


Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
 On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
  Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
   On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
Hello!

First: Have a merry christmas and enjoy a quiet time in these 
days.

Second: At a time you feel like it, here is a little rant, but 
also a
bug
report:

I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD 
RAID with
space_cache, skinny meta data extents – are these a problem? – 
and
   
compress=lzo:
   (there is no known problem with skinny metadata, it's actually 
   more
   efficient than the older format. There has been some anecdotes 
   about
   mixing the skinny and fat metadata but nothing has ever been
   demonstrated problematic.)
   
merkaba:~ btrfs fi sh /home
Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a

 Total devices 2 FS bytes used 144.41GiB
 devid1 size 160.00GiB used 160.00GiB path
 /dev/mapper/msata-home
 devid2 size 160.00GiB used 160.00GiB path
 /dev/mapper/sata-home

Btrfs v3.17
merkaba:~ btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
   
   This filesystem, at the allocation level, is very full (see 
   below).
   
And I had hangs with BTRFS again. This time as I wanted to 
install tax
return software in Virtualbox´d Windows XP VM (which I use once 
a year
cause I know no tax return software for Linux which would be 
suitable
for
Germany and I frankly don´t care about the end of security 
cause all
surfing and other network access I will do from the Linux box 
and I
only
run the VM behind a firewall).
   
And thus I try the balance dance again:
   ITEM: Balance... it doesn't do what you think it does... 
   
   Balancing is something you should almost never need to do. It 
   is only
   for cases of changing geometry (adding disks, switching RAID 
   levels,
   etc.) of for cases when you've radically changed allocation 
   behaviors
   (like you decided to remove all your VM's or you've decided to 
   remove a
   mail spool directory full of thousands of tiny files).
   
   People run balance all the time because they think they should. 
   They are
   _usually_ incorrect in that belief.
  
  I only see the lockups of BTRFS is the trees *occupy* all space on 
  the
  device.
No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata
 space. What's more, balance does *not* balance the metadata trees. The
 remaining space -- 154.97 GiB -- is unstructured storage for file
 data, and you have some 13 GiB of that available for use.
 
Now, since you're seeing lockups when the space on your disks is
 all allocated I'd say that's a bug. However, you're the *only* person
 who's reported this as a regular occurrence. Does this happen with all
 filesystems you have, or just this one?
 
  I *never* so far saw it lockup if there is still space BTRFS can 
  allocate
  from to *extend* a tree.
 
It's not a tree. It's simply space allocation. It's not even space
 *usage* you're talking about here -- it's just allocation (i.e. the FS
 saying I'm going to use this piece of disk for this purpose).
 
  This may be a bug, but this is what I see.
  
  And no amount of you should not balance a BTRFS will make that
  perception go away.
  
  See, I see the sun coming out on a morning and you tell me no, it
  doesn´t. Simply that is not going to match my perception.
 
Duncan's assertion is correct in its detail. Looking at your space

Robert's 


Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)

2014-12-29 Thread Martin Steigerwald
Am Sonntag, 28. Dezember 2014, 21:07:05 schrieb Zygo Blaxell:
 On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote:
  My simple test case didn´t trigger it, and I so not have another twice 160
  GiB available on this SSDs available to try with a copy of my home
  filesystem. Then I could safely test without bringing the desktop session to
  an halt. Maybe someone has an idea on how to enhance my test case in
  order to reliably trigger the issue.
  
  It may be challenging tough. My /home is quite a filesystem. It has a 
  maildir
  with at least one million of files (yeah, I am performance testing KMail and
  Akonadi as well to the limit!), and it has git repos and this one VM image,
  and the desktop search and the Akonadi database. In other words: It has
  been hit nicely with various mostly random I think workloads over the last
  about six months. I bet its not that easy to simulate that. Maybe some runs
  of compilebench to age the filesystem before the fio test?
  
  That said, BTRFS performs a lot better. The complete lockups without any
  CPU usage of 3.15 and 3.16 have gone for sure. Thats wonderful. But there
  is this kworker issue now. I noticed it that gravely just while trying to
  complete this tax returns stuff with the Windows XP VM. Otherwise it may
  have happened, I have seen some backtraces in kern.log, but it didn´t last
  for minutes. So this indeed is of less severity than the full lockups with
  3.15 and 3.16.
  
  Zygo, was is the characteristics of your filesystem. Do you use
  compress=lzo and skinny metadata as well? How are the chunks allocated?
  What kind of data you have on it?
 
 compress-force (default zlib), no skinny-metadata.  Chunks are d=single,
 m=dup.  Data is a mix of various desktop applications, most active
 file sizes from a few hundred K to a few MB, maybe 300k-400k files.
 No database or VM workloads.  Filesystem is 100GB and is usually between
 98 and 99% full (about 1-2GB free).
 
 I have another filesystem which has similar problems when it's 99.99%
 full (it's 13TB, so 0.01% is 1.3GB).  That filesystem is RAID1 with
 skinny-metadata and no-holes.
 
 On various filesystems I have the above CPU-burning problem, a bunch of
 irreproducible random crashes, and a hang with a kernel stack that goes
 through SyS_unlinkat and btrfs_evict_inode.

Zygo, thanks. That desktop filesystem sounds a bit similar to my usecase,
with the interesting difference that you have no databases or VMs on it.

That said, I use the Windows XP rarely, but using it was what made the issue
so visible for me. Is your desktop filesystem on SSD?

Do you have the chance to extend one of the affected filesystems to check
my theory that this does not happen as long as BTRFS can still allocate new
data chunks? If its right, your FS should be fluent again as long as you see
more than 1 GiB free

Label: none  uuid: 53bdf47c-4298-45bc-a30f-8a310c274069
Total devices 2 FS bytes used 512.00KiB
devid1 size 10.00GiB used 6.53GiB path /dev/mapper/sata-btrfsraid1
devid2 size 10.00GiB used 6.53GiB path /dev/mapper/msata-btrfsraid1

between size and used in btrfs fi sh. I suggest going with at least 2-3
GiB, as BTRFS may allocate just one chunk so quickly that you do not have
the chance to recognize the difference.

Well, and if thats works for you, we are back to my recommendation:

More so than with other filesystems give BTRFS plenty of free space to
operate with. At best as much, that you always have a mininum of 2-3 GiB
unused device space for chunk reservation left. One could even do some
Nagios/Icinga monitoring plugin for that :)

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

signature.asc
Description: This is a digitally signed message part.


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-29 Thread Martin Steigerwald
Am Sonntag, 28. Dezember 2014, 18:04:31 schrieb Patrik Lundquist:
 On 28 December 2014 at 13:03, Martin Steigerwald mar...@lichtvoll.de wrote:
 
  BTW, I found that the Oracle blog didn´t work at all for me. I completed
  a cycle of defrag, sdelete -c and VBoxManage compact, [...] and it
  apparently did *nothing* to reduce the size of the file.
 
 They've changed the argument to -z; sdelete -z.

Now how cute is that. Thank you. This did the trick:

martin@merkaba:~/.VirtualBox/HardDisks VBoxManage modifyhd Winlala.vdi 
--compact
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
martin@merkaba:~/.VirtualBox/HardDisks ls -lh
insgesamt 12G
-rw--- 1 martin martin 12G Dez 29 11:00 Winlala.vdi
martin@merkaba:~/.VirtualBox/HardDisks

It has been 20 GiB before.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-28 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 16:06:13 schrieb Robert White:
 
  I also don't know what kind of tool you are using, but it might be
  repeatedly trying and failing to fallocate the file as a single
  extent or something equally dumb.
 
  Userspace doesn't as far as I know, get to make that decision. I've
  just read the fallocate(2) man page, and it says nothing at all about
  the contiguity of the extent(s) storage allocated by the call.
 
 Yep, my bad. But as soon as I saw that fio was starting two threads, 
 one doing random read/write and another doing sequential read/write, 
 both on the same file, it set off my not just creating a file mindset. 
 Given the delayed write into/through the cache normally done by casual 
 file io, It seemed likely that fio would be doing something more 
 aggressive (like using O_DIRECT or repeated fdatasync() which could get 
 very tit-for-tat).

Robert, please get to know about fio or *ask* before jumping to conclusions.

I used this:

[global]
bs=4k
#ioengine=libaio
#iodepth=4
size=4g
#direct=1
runtime=120
filename=ssd.test.file

#[seq-write]
#rw=write
#stonewall

[rand-write]
rw=randwrite
stonewall


At the first test I still tested seq-write, but do you note the stonewall
param? It *separates* both jobs from one another. I.e. fio may be starting
two threads as it I think prepares all threads in advance, yet it did
execute only *one* at a time.

From the manpage of fio:

   stonewall , wait_for_previous
  Wait  for  preceding  jobs  in the job file to exit before
  starting this one.  stonewall implies new_group.

(that said the first stonewall isn´t even needed, but I removed the read
jobs from the ssd-test.fio example fio I used for this job and I didn´t
remember to remove the statement)


Thank you a lot for your input. I learned some from it. For example that
the trees for the data handling are in the metadata section. And now
I am very clear the btrfs fi df does not display any trees but the chunk
reservation and usage. I think I knew this before, but I thought somehow
that was combined with the tree, but it isn´t, at least not in place, but
the trees are stored in the metadata chunks. I´d still not call these
extents tough, cause thats a file-based thing to all I know.

I skip theoretizing about algorithms here. I prefer to let measurements
speak and try to understand these. Best approach to understand the ones
I made, I think, is what Hugo suggested: A developer looks at the sysrq-t
outputs. So I personally won´t speculate any further about given or not
given algorithmic limitations of BTRFS.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-28 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
 On 12/27/2014 05:01 PM, Bardur Arantsson wrote:
  On 2014-12-28 01:25, Robert White wrote:
  On 12/27/2014 08:01 AM, Martin Steigerwald wrote:
   From how you write I get the impression that you think everyone else
  beside you is just silly and dumb. Please stop this assumption. I may not
  always get terms right, and I may make a mistake as with the wrong df
  figure. But I also highly dislike to feel treated like someone who
  doesn´t
  know a thing.
 
  Nope. I'm a systems theorist and I demand/require variable isolation.
 
  Not a question of silly or dumb but a question of speaking with
  sufficient precision and clarity.
 
  For instance you speak of having an impression and then decide I've
  made an assumption.
 
  I define my position. Explain my terms. Give my examples.
 
  I also risk being utterly wrong because sometimes being completely wrong
  gets others to cut away misconceptions and assumptions.
 
  It annoys some people, but it gets results.
 
  Can you please stop this bullshit posturing nonsense? It accomlishes
  nothing -- if you're right your other posts will stand for themselves
  and show that you are indeed the shit when it comes to these matters,
  but this post (so far, didn't read further) accomplishes nothing other
  than (possibly) convincing everyone that you're a pompous/self-important
  ass.
 
 Really? accomplishes nothing?
 
 24 hours ago:
 
 the complaining party was talking about
 
 - Windows XP
 - Tax software
 - Virtual box
 - vdi files
 - defragging
 - balancing
 - data trees
 - system hanging
 
 And the responding party was saying
 
 you are the only person reporting this as a regular occurrence with 
 the implication that the report was a duplicate or at least might not 
 get much immediate attention.
 
 Now:
 
 The complaining party has verified the minimum, repeatable case of 
 simple file allocation on a very fragmented system and the responding 
 party and several others have understood and supported the bug.

It was repeatable before. That I go from application case to simulate a
workload case is only natural. Or do you run fio or other load testing apps
as a part of your daily work on your computer (unless you are actually
diagnosing performance issues). I still *use* the computer with
applications. And if thats where I see the performance issue, I report as
such. Then I think about the kind of workload it creates and go from there
to simplicy it to a reproducable case.

At least I read mails, browse the web, run a VM, and so such kinds of
things as daily computer usage. And thus its likely that performance issues
show like this. Heck even my server does mail and Owncloud and things.

I only use workload generation tools during my teachings or when analysing
things, not as part of my daily computer usage.

And that doesn´t make using a VM any less valid. And if it basically crawls
BTRFS to an halt, I report this. Its actually that easy.

 That's not accomplishing nothing, thats called engaging in diagnostics 
 instead of dismissing a complaint, and sticking out the diagnostic 
 process until everyone is on the same page.
 
 I never dismissed Martin. I never disbelieved him. I went through his 
 elements one at a time with examples of what I was taking away from him 
 and why they didn't match expectations and experimental evidence. We 
 adjusted our positions and communications.

Robert, I received this differently. I received your input partly as wronging
me. Granted that motivated me even more to prove things. But I highly
dislike this kind of motivation. As I think I am motivated myself. I like
finding causes of performance bottle necks. And I prefer positive motivation
instead of negative one.

 So you can call it bullshit posturing nonsense but I see taking less 
 than a day to get to the bottom of a bug report that might not have 
 gotten significant attention.

And you attribute all of this to your argumentation?

Thats bold.

See, Robert, your arguments helped with clearing my understanding in some
parts. Especially on the terms I have not been very familiar.

I am grateful for that.

It even helped motivate me to do the further tests, as I got the
impression that you have just been discussing that what I am seeing is
just the way BTRFS necesessarily is *algorithmically* and I was just using
it wrongly. But that said: I have an interest myself in resolving this.
I was prepared for giving additional input at a given time. But right on
this day I was just fed up with things.

It motivated to prove the abysmal performance behaviour in a certain
workload.

Robert, your arguments contributed, thats true. But still I did the work of
the actual measurements. I spent the hours on doing the measurements,
with a slight risk of having to restore from backup, incase BTRFS would
mess up things. I was the one bringing BTRFS to the limits where it
actually shows an issue, instead of theoreticing about 

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-28 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
 Now:
 
 The complaining party has verified the minimum, repeatable case of 
 simple file allocation on a very fragmented system and the responding 
 party and several others have understood and supported the bug.

I didn´t yet provide such a test case.

At the moment I can only reproduce this kworker thread using a CPU for
minutes case with my /home filesystem.

A mininmal test case for me would be to be able to reproduce it with a 
fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
get 4800 instead of 270 IOPS.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again (further tests)

2014-12-28 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
 Summarized at
 
 Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for 
 minutes on random write into big file
 https://bugzilla.kernel.org/show_bug.cgi?id=90401
 
 see below. This is reproducable with fio, no need for Windows XP in
 Virtualbox for reproducing the issue. Next I will try to reproduce with
 a freshly creating filesystem.
 
 
 Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
  On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
   Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
 Hello!
 
 First: Have a merry christmas and enjoy a quiet time in these days.
 
 Second: At a time you feel like it, here is a little rant, but also a
 bug
 report:
 
 I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
 space_cache, skinny meta data extents – are these a problem? – and

 compress=lzo:
(there is no known problem with skinny metadata, it's actually more
efficient than the older format. There has been some anecdotes about
mixing the skinny and fat metadata but nothing has ever been
demonstrated problematic.)

 merkaba:~ btrfs fi sh /home
 Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
 
  Total devices 2 FS bytes used 144.41GiB
  devid1 size 160.00GiB used 160.00GiB path
  /dev/mapper/msata-home
  devid2 size 160.00GiB used 160.00GiB path
  /dev/mapper/sata-home
 
 Btrfs v3.17
 merkaba:~ btrfs fi df /home
 Data, RAID1: total=154.97GiB, used=141.12GiB
 System, RAID1: total=32.00MiB, used=48.00KiB
 Metadata, RAID1: total=5.00GiB, used=3.29GiB
 GlobalReserve, single: total=512.00MiB, used=0.00B

This filesystem, at the allocation level, is very full (see below).

 And I had hangs with BTRFS again. This time as I wanted to install tax
 return software in Virtualbox´d Windows XP VM (which I use once a year
 cause I know no tax return software for Linux which would be suitable
 for
 Germany and I frankly don´t care about the end of security cause all
 surfing and other network access I will do from the Linux box and I
 only
 run the VM behind a firewall).

 And thus I try the balance dance again:
ITEM: Balance... it doesn't do what you think it does... 8-)

Balancing is something you should almost never need to do. It is only
for cases of changing geometry (adding disks, switching RAID levels,
etc.) of for cases when you've radically changed allocation behaviors
(like you decided to remove all your VM's or you've decided to remove a
mail spool directory full of thousands of tiny files).

People run balance all the time because they think they should. They are
_usually_ incorrect in that belief.
   
   I only see the lockups of BTRFS is the trees *occupy* all space on the
   device.
 No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata
  space. What's more, balance does *not* balance the metadata trees. The
  remaining space -- 154.97 GiB -- is unstructured storage for file
  data, and you have some 13 GiB of that available for use.
  
 Now, since you're seeing lockups when the space on your disks is
  all allocated I'd say that's a bug. However, you're the *only* person
  who's reported this as a regular occurrence. Does this happen with all
  filesystems you have, or just this one?
  
   I *never* so far saw it lockup if there is still space BTRFS can allocate
   from to *extend* a tree.
  
 It's not a tree. It's simply space allocation. It's not even space
  *usage* you're talking about here -- it's just allocation (i.e. the FS
  saying I'm going to use this piece of disk for this purpose).
  
   This may be a bug, but this is what I see.
   
   And no amount of you should not balance a BTRFS will make that
   perception go away.
   
   See, I see the sun coming out on a morning and you tell me no, it
   doesn´t. Simply that is not going to match my perception.
  
 Duncan's assertion is correct in its detail. Looking at your space
 
 Robert's :)
 
  usage, I would not suggest that running a balance is something you
  need to do. Now, since you have these lockups that seem quite
  repeatable, there's probably a lurking bug in there, but hacking
  around with balance every time you hit it isn't going to get the
  problem solved properly.
  
 I think I would suggest the following:
  
   - make sure you have some way of logging your dmesg permanently (use
 a different filesystem for /var/log, or a serial console, or a
 netconsole)
  
   - when the lockup happens, hit Alt-SysRq-t a few times
  
   - send the dmesg output here, or post to bugzilla.kernel.org
  
 That's probably going to give enough 

Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare)

2014-12-28 Thread Martin Steigerwald
Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
 Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
  Summarized at
  
  Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for 
  minutes on random write into big file
  https://bugzilla.kernel.org/show_bug.cgi?id=90401
  
  see below. This is reproducable with fio, no need for Windows XP in
  Virtualbox for reproducing the issue. Next I will try to reproduce with
  a freshly creating filesystem.
  
  
  Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
   On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
 On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
  Hello!
  
  First: Have a merry christmas and enjoy a quiet time in these days.
  
  Second: At a time you feel like it, here is a little rant, but also 
  a
  bug
  report:
  
  I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID 
  with
  space_cache, skinny meta data extents – are these a problem? – and
 
  compress=lzo:
 (there is no known problem with skinny metadata, it's actually more
 efficient than the older format. There has been some anecdotes about
 mixing the skinny and fat metadata but nothing has ever been
 demonstrated problematic.)
 
  merkaba:~ btrfs fi sh /home
  Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
  
   Total devices 2 FS bytes used 144.41GiB
   devid1 size 160.00GiB used 160.00GiB path
   /dev/mapper/msata-home
   devid2 size 160.00GiB used 160.00GiB path
   /dev/mapper/sata-home
  
  Btrfs v3.17
  merkaba:~ btrfs fi df /home
  Data, RAID1: total=154.97GiB, used=141.12GiB
  System, RAID1: total=32.00MiB, used=48.00KiB
  Metadata, RAID1: total=5.00GiB, used=3.29GiB
  GlobalReserve, single: total=512.00MiB, used=0.00B
 
 This filesystem, at the allocation level, is very full (see below).
 
  And I had hangs with BTRFS again. This time as I wanted to install 
  tax
  return software in Virtualbox´d Windows XP VM (which I use once a 
  year
  cause I know no tax return software for Linux which would be 
  suitable
  for
  Germany and I frankly don´t care about the end of security cause all
  surfing and other network access I will do from the Linux box and I
  only
  run the VM behind a firewall).
 
  And thus I try the balance dance again:
 ITEM: Balance... it doesn't do what you think it does... 8-)
 
 Balancing is something you should almost never need to do. It is 
 only
 for cases of changing geometry (adding disks, switching RAID levels,
 etc.) of for cases when you've radically changed allocation behaviors
 (like you decided to remove all your VM's or you've decided to remove 
 a
 mail spool directory full of thousands of tiny files).
 
 People run balance all the time because they think they should. They 
 are
 _usually_ incorrect in that belief.

I only see the lockups of BTRFS is the trees *occupy* all space on the
device.
  No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata
   space. What's more, balance does *not* balance the metadata trees. The
   remaining space -- 154.97 GiB -- is unstructured storage for file
   data, and you have some 13 GiB of that available for use.
   
  Now, since you're seeing lockups when the space on your disks is
   all allocated I'd say that's a bug. However, you're the *only* person
   who's reported this as a regular occurrence. Does this happen with all
   filesystems you have, or just this one?
   
I *never* so far saw it lockup if there is still space BTRFS can 
allocate
from to *extend* a tree.
   
  It's not a tree. It's simply space allocation. It's not even space
   *usage* you're talking about here -- it's just allocation (i.e. the FS
   saying I'm going to use this piece of disk for this purpose).
   
This may be a bug, but this is what I see.

And no amount of you should not balance a BTRFS will make that
perception go away.

See, I see the sun coming out on a morning and you tell me no, it
doesn´t. Simply that is not going to match my perception.
   
  Duncan's assertion is correct in its detail. Looking at your space
  
  Robert's :)
  
   usage, I would not suggest that running a balance is something you
   need to do. Now, since you have these lockups that seem quite
   repeatable, there's probably a lurking bug in there, but hacking
   around with balance every time you hit it isn't going to get the
   problem solved properly.
   
  I think I would suggest the following:
   
- make sure you have some way of logging your dmesg permanently (use
  a different 

Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)

2014-12-28 Thread Martin Steigerwald
Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald:
 Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
  Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
   Summarized at
   
   Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for 
   minutes on random write into big file
   https://bugzilla.kernel.org/show_bug.cgi?id=90401
   
   see below. This is reproducable with fio, no need for Windows XP in
   Virtualbox for reproducing the issue. Next I will try to reproduce with
   a freshly creating filesystem.
   
   
   Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
 Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
  On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
   Hello!
   
   First: Have a merry christmas and enjoy a quiet time in these 
   days.
   
   Second: At a time you feel like it, here is a little rant, but 
   also a
   bug
   report:
   
   I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID 
   with
   space_cache, skinny meta data extents – are these a problem? – and
  
   compress=lzo:
  (there is no known problem with skinny metadata, it's actually more
  efficient than the older format. There has been some anecdotes about
  mixing the skinny and fat metadata but nothing has ever been
  demonstrated problematic.)
  
   merkaba:~ btrfs fi sh /home
   Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
   
Total devices 2 FS bytes used 144.41GiB
devid1 size 160.00GiB used 160.00GiB path
/dev/mapper/msata-home
devid2 size 160.00GiB used 160.00GiB path
/dev/mapper/sata-home
   
   Btrfs v3.17
   merkaba:~ btrfs fi df /home
   Data, RAID1: total=154.97GiB, used=141.12GiB
   System, RAID1: total=32.00MiB, used=48.00KiB
   Metadata, RAID1: total=5.00GiB, used=3.29GiB
   GlobalReserve, single: total=512.00MiB, used=0.00B
  
  This filesystem, at the allocation level, is very full (see 
  below).
  
   And I had hangs with BTRFS again. This time as I wanted to 
   install tax
   return software in Virtualbox´d Windows XP VM (which I use once a 
   year
   cause I know no tax return software for Linux which would be 
   suitable
   for
   Germany and I frankly don´t care about the end of security cause 
   all
   surfing and other network access I will do from the Linux box and 
   I
   only
   run the VM behind a firewall).
  
   And thus I try the balance dance again:
  ITEM: Balance... it doesn't do what you think it does... 
  
  Balancing is something you should almost never need to do. It is 
  only
  for cases of changing geometry (adding disks, switching RAID levels,
  etc.) of for cases when you've radically changed allocation 
  behaviors
  (like you decided to remove all your VM's or you've decided to 
  remove a
  mail spool directory full of thousands of tiny files).
  
  People run balance all the time because they think they should. 
  They are
  _usually_ incorrect in that belief.
 
 I only see the lockups of BTRFS is the trees *occupy* all space on the
 device.
   No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata
space. What's more, balance does *not* balance the metadata trees. The
remaining space -- 154.97 GiB -- is unstructured storage for file
data, and you have some 13 GiB of that available for use.

   Now, since you're seeing lockups when the space on your disks is
all allocated I'd say that's a bug. However, you're the *only* person
who's reported this as a regular occurrence. Does this happen with all
filesystems you have, or just this one?

 I *never* so far saw it lockup if there is still space BTRFS can 
 allocate
 from to *extend* a tree.

   It's not a tree. It's simply space allocation. It's not even space
*usage* you're talking about here -- it's just allocation (i.e. the FS
saying I'm going to use this piece of disk for this purpose).

 This may be a bug, but this is what I see.
 
 And no amount of you should not balance a BTRFS will make that
 perception go away.
 
 See, I see the sun coming out on a morning and you tell me no, it
 doesn´t. Simply that is not going to match my perception.

   Duncan's assertion is correct in its detail. Looking at your space
   
   Robert's 
   
usage, I would not suggest that running a balance is something you
need to do. Now, since you have these lockups that seem quite
repeatable, there's probably a lurking bug in there, but hacking
around with balance 

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-28 Thread Robert White

On 12/28/2014 04:07 AM, Martin Steigerwald wrote:

Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:

Now:

The complaining party has verified the minimum, repeatable case of
simple file allocation on a very fragmented system and the responding
party and several others have understood and supported the bug.


I didn´t yet provide such a test case.


My bad.



At the moment I can only reproduce this kworker thread using a CPU for
minutes case with my /home filesystem.

A mininmal test case for me would be to be able to reproduce it with a
fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
get 4800 instead of 270 IOPS.



A version of the test case to demonstrate absolutely system-clogging 
loads is pretty easy to construct.


Make a raid1 filesystem.
Balance it once to make sure the seed filesystem is fully integrated.

Create a bunch of small files that are at least 4K in size, but are 
randomly sized. Fill the entire filesystem with them.


BASH Script:
typeset -i counter=0
while
 dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM)) 
count=1 2/dev/null

do
echo $counter /dev/null #basically a noop
done

The while will exit when the dd encounters a full filesystem.

Then delete ~10% of the files with
rm *0

Run the while loop again, then delete a different 10% with rm *1.

Then again with rm *2, etc...

Do this a few times and with each iteration the CPU usage gets worse and 
worse. You'll easily get system-wide stalls on all IO tasks lasting ten 
or more seconds.


I don't have enough spare storage to do this directly, so I used 
loopback devices. First I did it with the loopback files in COW mode. 
Then I did it again with the files in NOCOW mode. (the COW files got 
thick with overwrite real fast. 8-)


So anyway...

After I got through all ten digits on the rm (that is removing *0, then 
refilling, then *1 etc...) I figured the FS image was nicely fragmented.


At that point it was very easy to spike the kworker to 100% CPU with

dd if=/dev/urandom of=/mnt/Work/scratch bs=40k

The DD wold read 40k (a cpu spike for /dev/urandom processing) then it 
would write the 40k and the kworker would peg 100% on one CPU and stay 
there for a while. Then it would be back to the /dev/urandom spike.


So this laptop has been carefully detuned to prevent certain kinds of 
stalls (particularly the moveablecore= reservation, as previously 
mentioned, to prevent non-responsiveness of the UI) and I had to go 
through /dev/loop so that had a smoothing effect... but yep, there were 
clear kworker spikes that _did_ stop the IO path (the system monitor ap, 
for instance,  could not get I/O statistics for ten and fifteen second 
intervals and would stop logging/scrolling).


Progressively larger block sizes on the write path made things 
progressively worse...


dd if=/dev/urandom of=/mnt/Work/scratch bs=160k


And overwriting the file by just invoking DD again, was worse still 
(presumably from the juggling act) before resulting in a net 
out-of-space condition.


Switching from /dev/urandom to /dev/zero for writing the large file made 
things worse still -- probably since there were no respites for the 
kworker to catch up etc.


ASIDE: Playing with /proc/sys/vm/dirty_{background_,}ratio had lots of 
interesting and difficult to quantify effects on user-space 
applications. Cutting in half (5 and 10 instead of 10 and 20 
respectively) seemed to give some relief, but going further got harmful 
quickly. Diverging numbers was odd too. But it seemed a little brittle 
to play with these numbers.


SUPER FREAKY THING...

Every time I removed and recreated scratch I would get _radically_ 
different results for how much I could write into that remaining space 
and how long it took to do so. In theory I am reusing the exact same 
storage again and again. I'm not doing compression (the underlying 
filessytem behind the loop devices have compression but that would be 
disabled by the +C attribute). It's not enough space coming-and-going to 
cause data extents to be reclaimed or displaced by metadata. And the 
filessytem is otherwise completely unused.


But check it out...

Gust Work # rm scratch
Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
1700+0 records in
1700+0 records out
278528000 bytes (279 MB) copied, 1.4952 s, 186 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
1700+0 records in
1700+0 records out
278528000 bytes (279 MB) copied, 292.135 s, 953 kB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700
dd: error writing ‘/mnt/Work/scratch’: No space left on device
93+0 records in
92+0 records out
15073280 bytes (15 MB) copied, 0.0453977 s, 332 MB/s
Gust Work # rm scratch
Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700
dd: error writing ‘/mnt/Work/scratch’: No space left on device
1090+0 records in
1089+0 records out
178421760 bytes (178 MB) 

Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)

2014-12-28 Thread Martin Steigerwald
Am Sonntag, 28. Dezember 2014, 14:56:21 schrieb Martin Steigerwald:
 Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald:
  Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
   Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
Summarized at

Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for 
minutes on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401

see below. This is reproducable with fio, no need for Windows XP in
Virtualbox for reproducing the issue. Next I will try to reproduce with
a freshly creating filesystem.


Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
 On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
  Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
   On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
Hello!

First: Have a merry christmas and enjoy a quiet time in these 
days.

Second: At a time you feel like it, here is a little rant, but 
also a
bug
report:

I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD 
RAID with
space_cache, skinny meta data extents – are these a problem? – 
and
   
compress=lzo:
   (there is no known problem with skinny metadata, it's actually 
   more
   efficient than the older format. There has been some anecdotes 
   about
   mixing the skinny and fat metadata but nothing has ever been
   demonstrated problematic.)
   
merkaba:~ btrfs fi sh /home
Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a

 Total devices 2 FS bytes used 144.41GiB
 devid1 size 160.00GiB used 160.00GiB path
 /dev/mapper/msata-home
 devid2 size 160.00GiB used 160.00GiB path
 /dev/mapper/sata-home

Btrfs v3.17
merkaba:~ btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
   
   This filesystem, at the allocation level, is very full (see 
   below).
   
And I had hangs with BTRFS again. This time as I wanted to 
install tax
return software in Virtualbox´d Windows XP VM (which I use once 
a year
cause I know no tax return software for Linux which would be 
suitable
for
Germany and I frankly don´t care about the end of security 
cause all
surfing and other network access I will do from the Linux box 
and I
only
run the VM behind a firewall).
   
And thus I try the balance dance again:
   ITEM: Balance... it doesn't do what you think it does... 
   
   Balancing is something you should almost never need to do. It 
   is only
   for cases of changing geometry (adding disks, switching RAID 
   levels,
   etc.) of for cases when you've radically changed allocation 
   behaviors
   (like you decided to remove all your VM's or you've decided to 
   remove a
   mail spool directory full of thousands of tiny files).
   
   People run balance all the time because they think they should. 
   They are
   _usually_ incorrect in that belief.
  
  I only see the lockups of BTRFS is the trees *occupy* all space on 
  the
  device.
No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata
 space. What's more, balance does *not* balance the metadata trees. The
 remaining space -- 154.97 GiB -- is unstructured storage for file
 data, and you have some 13 GiB of that available for use.
 
Now, since you're seeing lockups when the space on your disks is
 all allocated I'd say that's a bug. However, you're the *only* person
 who's reported this as a regular occurrence. Does this happen with all
 filesystems you have, or just this one?
 
  I *never* so far saw it lockup if there is still space BTRFS can 
  allocate
  from to *extend* a tree.
 
It's not a tree. It's simply space allocation. It's not even space
 *usage* you're talking about here -- it's just allocation (i.e. the FS
 saying I'm going to use this piece of disk for this purpose).
 
  This may be a bug, but this is what I see.
  
  And no amount of you should not balance a BTRFS will make that
  perception go away.
  
  See, I see the sun coming out on a morning and you tell me no, it
  doesn´t. Simply that is not going to match my perception.
 
Duncan's assertion is correct in its detail. Looking at your space

Robert's 


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-28 Thread Martin Steigerwald
Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White:
 On 12/28/2014 04:07 AM, Martin Steigerwald wrote:
  Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
  Now:
 
  The complaining party has verified the minimum, repeatable case of
  simple file allocation on a very fragmented system and the responding
  party and several others have understood and supported the bug.
 
  I didn´t yet provide such a test case.
 
 My bad.
 
 
  At the moment I can only reproduce this kworker thread using a CPU for
  minutes case with my /home filesystem.
 
  A mininmal test case for me would be to be able to reproduce it with a
  fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
  get 4800 instead of 270 IOPS.
 
 
 A version of the test case to demonstrate absolutely system-clogging 
 loads is pretty easy to construct.
 
 Make a raid1 filesystem.
 Balance it once to make sure the seed filesystem is fully integrated.
 
 Create a bunch of small files that are at least 4K in size, but are 
 randomly sized. Fill the entire filesystem with them.
 
 BASH Script:
 typeset -i counter=0
 while
   dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM)) 
 count=1 2/dev/null
 do
 echo $counter /dev/null #basically a noop
 done

 The while will exit when the dd encounters a full filesystem.
 
 Then delete ~10% of the files with
 rm *0
 
 Run the while loop again, then delete a different 10% with rm *1.
 
 Then again with rm *2, etc...
 
 Do this a few times and with each iteration the CPU usage gets worse and 
 worse. You'll easily get system-wide stalls on all IO tasks lasting ten 
 or more seconds.

Thanks Robert. Thats wonderful.

I wondered about such a test case already and thought about reproducing
it just with fallocate calls instead to reduce the amount of actual
writes done. I.e. just do some silly fallocate, truncating, write just
some parts with dd seek and remove things again kind of workload.

Feel free to add your testcase to the bug report:

[Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for 
minutes on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401

Cause anything that helps a BTRFS developer to reproduce will make it easier
to find and fix the root cause of it.

I think I will try with this little critter:

merkaba:/mnt/btrfsraid1 cat freespracefragment.sh 
#!/bin/bash

TESTDIR=./test
mkdir -p $TESTDIR

typeset -i counter=0
while true; do
fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((++counter))
echo $counter /dev/null #basically a noop
done

It takes a time, the script itself is using only a few percent of one core
there, while busying out the SSDs more heavily than I thought it would do.
But well I see up to 12000 writes per 10 seconds – thats not that much, still
it busies one SSD for 80%:

ATOP - merkaba 2014/12/28  16:40:57 
---   10s elapsed
PRC | sys1.50s | user   3.47s | #proc367  | #trun  1 | #tslpi   649 
| #tslpu 0 | #zombie0 | clones   839  |  | no  procacct |
CPU | sys  30% | user 38% | irq   1%  | idle293% | wait 37% 
|  | steal 0% | guest 0%  | curf 1.63GHz | curscal  50% |
cpu | sys   7% | user 11% | irq   1%  | idle 75% | cpu000 w  6% 
|  | steal 0% | guest 0%  | curf 1.25GHz | curscal  39% |
cpu | sys   8% | user 11% | irq   0%  | idle 76% | cpu002 w  4% 
|  | steal 0% | guest 0%  | curf 1.55GHz | curscal  48% |
cpu | sys   7% | user  9% | irq   0%  | idle 71% | cpu001 w 13% 
|  | steal 0% | guest 0%  | curf 1.75GHz | curscal  54% |
cpu | sys   8% | user  7% | irq   0%  | idle 71% | cpu003 w 14% 
|  | steal 0% | guest 0%  | curf 1.96GHz | curscal  61% |
CPL | avg11.69 | avg51.30 | avg15   0.94  |  |  
| csw68387 | intr   36928 |   |  | numcpu 4 |
MEM | tot15.5G | free3.1G | cache   8.8G  | buff4.2M | slab1.0G 
| shmem 210.3M | shrss  79.1M | vmbal   0.0M  | hptot   0.0M | hpuse   0.0M |
SWP | tot12.0G | free   11.5G |   |  |  
|  |  |   | vmcom   4.9G | vmlim  19.7G |
LVM | a-btrfsraid1 | busy 80% | read   0  | write  11873 | KiB/r  0 
| KiB/w  3 | MBr/s   0.00 | MBw/s   4.31  | avq 1.11 | avio 0.67 ms |
LVM | a-btrfsraid1 | busy  5% | read   0  | write  11873 | KiB/r  0 
| KiB/w  3 | MBr/s   0.00 | MBw/s   4.31  | avq 2.45 | avio 0.04 ms |
LVM |   msata-home | busy  3% | read   0  | write175 | KiB/r  0 
| KiB/w  3 | MBr/s   0.00 | MBw/s   0.06  | avq 1.71 | avio 1.43 ms |
LVM | msata-debian | busy  0% | read   0  | write 10 | KiB/r  

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-28 Thread Martin Steigerwald
Am Sonntag, 28. Dezember 2014, 16:42:20 schrieb Martin Steigerwald:
 Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White:
  On 12/28/2014 04:07 AM, Martin Steigerwald wrote:
   Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
   Now:
  
   The complaining party has verified the minimum, repeatable case of
   simple file allocation on a very fragmented system and the responding
   party and several others have understood and supported the bug.
  
   I didn´t yet provide such a test case.
  
  My bad.
  
  
   At the moment I can only reproduce this kworker thread using a CPU for
   minutes case with my /home filesystem.
  
   A mininmal test case for me would be to be able to reproduce it with a
   fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
   get 4800 instead of 270 IOPS.
  
  
  A version of the test case to demonstrate absolutely system-clogging 
  loads is pretty easy to construct.
  
  Make a raid1 filesystem.
  Balance it once to make sure the seed filesystem is fully integrated.
  
  Create a bunch of small files that are at least 4K in size, but are 
  randomly sized. Fill the entire filesystem with them.
  
  BASH Script:
  typeset -i counter=0
  while
dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM)) 
  count=1 2/dev/null
  do
  echo $counter /dev/null #basically a noop
  done
 
  The while will exit when the dd encounters a full filesystem.
  
  Then delete ~10% of the files with
  rm *0
  
  Run the while loop again, then delete a different 10% with rm *1.
  
  Then again with rm *2, etc...
  
  Do this a few times and with each iteration the CPU usage gets worse and 
  worse. You'll easily get system-wide stalls on all IO tasks lasting ten 
  or more seconds.
 
 Thanks Robert. Thats wonderful.
 
 I wondered about such a test case already and thought about reproducing
 it just with fallocate calls instead to reduce the amount of actual
 writes done. I.e. just do some silly fallocate, truncating, write just
 some parts with dd seek and remove things again kind of workload.
 
 Feel free to add your testcase to the bug report:
 
 [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for 
 minutes on random write into big file
 https://bugzilla.kernel.org/show_bug.cgi?id=90401
 
 Cause anything that helps a BTRFS developer to reproduce will make it easier
 to find and fix the root cause of it.
 
 I think I will try with this little critter:
 
 merkaba:/mnt/btrfsraid1 cat freespracefragment.sh 
 #!/bin/bash
 
 TESTDIR=./test
 mkdir -p $TESTDIR
 
 typeset -i counter=0
 while true; do
 fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((++counter))
 echo $counter /dev/null #basically a noop
 done
 
 It takes a time, the script itself is using only a few percent of one core
 there, while busying out the SSDs more heavily than I thought it would do.
 But well I see up to 12000 writes per 10 seconds – thats not that much, still
 it busies one SSD for 80%:
 
 ATOP - merkaba 2014/12/28  16:40:57   
   ---   10s 
 elapsed
 PRC | sys1.50s | user   3.47s | #proc367  | #trun  1 | #tslpi   
 649 | #tslpu 0 | #zombie0 | clones   839  |  | no  
 procacct |
 CPU | sys  30% | user 38% | irq   1%  | idle293% | wait 
 37% |  | steal 0% | guest 0%  | curf 1.63GHz | curscal  
 50% |
 cpu | sys   7% | user 11% | irq   1%  | idle 75% | cpu000 w  
 6% |  | steal 0% | guest 0%  | curf 1.25GHz | curscal  
 39% |
 cpu | sys   8% | user 11% | irq   0%  | idle 76% | cpu002 w  
 4% |  | steal 0% | guest 0%  | curf 1.55GHz | curscal  
 48% |
 cpu | sys   7% | user  9% | irq   0%  | idle 71% | cpu001 w 
 13% |  | steal 0% | guest 0%  | curf 1.75GHz | curscal  
 54% |
 cpu | sys   8% | user  7% | irq   0%  | idle 71% | cpu003 w 
 14% |  | steal 0% | guest 0%  | curf 1.96GHz | curscal  
 61% |
 CPL | avg11.69 | avg51.30 | avg15   0.94  |  |
   | csw68387 | intr   36928 |   |  | numcpu 4 
 |
 MEM | tot15.5G | free3.1G | cache   8.8G  | buff4.2M | slab
 1.0G | shmem 210.3M | shrss  79.1M | vmbal   0.0M  | hptot   0.0M | hpuse   
 0.0M |
 SWP | tot12.0G | free   11.5G |   |  |
   |  |  |   | vmcom   4.9G | vmlim  19.7G 
 |
 LVM | a-btrfsraid1 | busy 80% | read   0  | write  11873 | KiB/r  
 0 | KiB/w  3 | MBr/s   0.00 | MBw/s   4.31  | avq 1.11 | avio 0.67 ms 
 |
 LVM | a-btrfsraid1 | busy  5% | read   0  | write  11873 | KiB/r  
 0 | KiB/w  3 | MBr/s   0.00 | MBw/s   4.31  | avq 2.45 | avio 0.04 ms 
 |
 LVM |   msata-home | busy   

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-28 Thread Patrik Lundquist
On 28 December 2014 at 13:03, Martin Steigerwald mar...@lichtvoll.de wrote:

 BTW, I found that the Oracle blog didn´t work at all for me. I completed
 a cycle of defrag, sdelete -c and VBoxManage compact, [...] and it
 apparently did *nothing* to reduce the size of the file.

They've changed the argument to -z; sdelete -z.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-28 Thread Robert White

On 12/28/2014 07:42 AM, Martin Steigerwald wrote:

Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White:

On 12/28/2014 04:07 AM, Martin Steigerwald wrote:

Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:

Now:

The complaining party has verified the minimum, repeatable case of
simple file allocation on a very fragmented system and the responding
party and several others have understood and supported the bug.


I didn´t yet provide such a test case.


My bad.



At the moment I can only reproduce this kworker thread using a CPU for
minutes case with my /home filesystem.

A mininmal test case for me would be to be able to reproduce it with a
fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
get 4800 instead of 270 IOPS.



A version of the test case to demonstrate absolutely system-clogging
loads is pretty easy to construct.

Make a raid1 filesystem.
Balance it once to make sure the seed filesystem is fully integrated.

Create a bunch of small files that are at least 4K in size, but are
randomly sized. Fill the entire filesystem with them.

BASH Script:
typeset -i counter=0
while
   dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM))
count=1 2/dev/null
do
echo $counter /dev/null #basically a noop
done

The while will exit when the dd encounters a full filesystem.

Then delete ~10% of the files with
rm *0

Run the while loop again, then delete a different 10% with rm *1.

Then again with rm *2, etc...

Do this a few times and with each iteration the CPU usage gets worse and
worse. You'll easily get system-wide stalls on all IO tasks lasting ten
or more seconds.


Thanks Robert. Thats wonderful.

I wondered about such a test case already and thought about reproducing
it just with fallocate calls instead to reduce the amount of actual
writes done. I.e. just do some silly fallocate, truncating, write just
some parts with dd seek and remove things again kind of workload.

Feel free to add your testcase to the bug report:

[Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for 
minutes on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401

Cause anything that helps a BTRFS developer to reproduce will make it easier
to find and fix the root cause of it.

I think I will try with this little critter:

merkaba:/mnt/btrfsraid1 cat freespracefragment.sh
#!/bin/bash

TESTDIR=./test
mkdir -p $TESTDIR

typeset -i counter=0
while true; do
 fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((++counter))
 echo $counter /dev/null #basically a noop
done


If you don't do the remove/delete passes you won't get as much 
fragmentation...


I also noticed that fallocate would not actually create the files in my 
toolset, so I had to touch them first. So the theoretical script became


e.g.

typeset -i counter=0
for AA in {0..9}
do
  while
touch ${TESTDIR}/$((++counter)) 
fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((counter))
  do
if ((counter%100 == 0))
then
  echo $counter
fi
  done
  echo removing ${AA}
  rm ${TESTDIR}/*${AA}
done

Meanwhile, on my test rig using fallocate did _not_ result in final 
exhaustion of resources. That is btrfs fi df /mnt/Work didn't show 
significant changes on a near full expanse.


I also never got a failed response back from fallocate, that is the 
inner loop never terminated. This could be a problem with the system 
call itself or it could be a problem with the application wrapper.


Nor did I reach the CPU saturation I expected.

e.g.
Gust vm # btrfs fi df /mnt/Work/
Data, RAID1: total=1.72GiB, used=1.66GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=256.00MiB, used=57.84MiB
GlobalReserve, single: total=32.00MiB, used=0.00B

time passes while script running...

Gust vm # btrfs fi df /mnt/Work/
Data, RAID1: total=1.72GiB, used=1.66GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=256.00MiB, used=57.84MiB
GlobalReserve, single: total=32.00MiB, used=0.00B

So there may be some limiting factor or something.

Without the actual writes to the actual file expanse I don't get the stalls.

(I added a _touch_ of instrumentation, it makes the various catostrophy 
events a little more obvious in context. 8-)


mount /dev/whattever /mnt/Work
typeset -i counter=0
for AA in {0..9}
do
  while
dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + 
$RANDOM)) count=1 2/dev/null

  do
if ((counter%100 == 0))
then
  echo $counter
  if ((counter%1000 == 0))
  then
btrfs fi df /mnt/Work
  fi
fi
  done
  btrfs fi df /mnt/Work
  echo removing ${AA}
  rm /mnt/Work/*${AA}
  btrfs fi df /mnt/Work
done

So you definitely need the writes to really see the stalls.


I may try with with my test BTRFS. I could even make it 2x20 GiB RAID 1
as well.


I guess I never mentioned it... I am using 4x1GiB NOCOW files through 
losetup as the basis of a RAID1. No compression (by virtue of the NOCOW 
files 

Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)

2014-12-28 Thread Zygo Blaxell
On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote:
 My simple test case didn´t trigger it, and I so not have another twice 160
 GiB available on this SSDs available to try with a copy of my home
 filesystem. Then I could safely test without bringing the desktop session to
 an halt. Maybe someone has an idea on how to enhance my test case in
 order to reliably trigger the issue.
 
 It may be challenging tough. My /home is quite a filesystem. It has a maildir
 with at least one million of files (yeah, I am performance testing KMail and
 Akonadi as well to the limit!), and it has git repos and this one VM image,
 and the desktop search and the Akonadi database. In other words: It has
 been hit nicely with various mostly random I think workloads over the last
 about six months. I bet its not that easy to simulate that. Maybe some runs
 of compilebench to age the filesystem before the fio test?
 
 That said, BTRFS performs a lot better. The complete lockups without any
 CPU usage of 3.15 and 3.16 have gone for sure. Thats wonderful. But there
 is this kworker issue now. I noticed it that gravely just while trying to
 complete this tax returns stuff with the Windows XP VM. Otherwise it may
 have happened, I have seen some backtraces in kern.log, but it didn´t last
 for minutes. So this indeed is of less severity than the full lockups with
 3.15 and 3.16.
 
 Zygo, was is the characteristics of your filesystem. Do you use
 compress=lzo and skinny metadata as well? How are the chunks allocated?
 What kind of data you have on it?

compress-force (default zlib), no skinny-metadata.  Chunks are d=single,
m=dup.  Data is a mix of various desktop applications, most active
file sizes from a few hundred K to a few MB, maybe 300k-400k files.
No database or VM workloads.  Filesystem is 100GB and is usually between
98 and 99% full (about 1-2GB free).

I have another filesystem which has similar problems when it's 99.99%
full (it's 13TB, so 0.01% is 1.3GB).  That filesystem is RAID1 with
skinny-metadata and no-holes.

On various filesystems I have the above CPU-burning problem, a bunch of
irreproducible random crashes, and a hang with a kernel stack that goes
through SyS_unlinkat and btrfs_evict_inode.



signature.asc
Description: Digital signature


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Martin Steigerwald
Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
 On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
  Hello!
  
  First: Have a merry christmas and enjoy a quiet time in these days.
  
  Second: At a time you feel like it, here is a little rant, but also a bug
  report:
  
  I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
  space_cache, skinny meta data extents – are these a problem? – and
 
  compress=lzo:
 (there is no known problem with skinny metadata, it's actually more
 efficient than the older format. There has been some anecdotes about
 mixing the skinny and fat metadata but nothing has ever been
 demonstrated problematic.)
 
  merkaba:~ btrfs fi sh /home
  Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
  
   Total devices 2 FS bytes used 144.41GiB
   devid1 size 160.00GiB used 160.00GiB path
   /dev/mapper/msata-home
   devid2 size 160.00GiB used 160.00GiB path
   /dev/mapper/sata-home
  
  Btrfs v3.17
  merkaba:~ btrfs fi df /home
  Data, RAID1: total=154.97GiB, used=141.12GiB
  System, RAID1: total=32.00MiB, used=48.00KiB
  Metadata, RAID1: total=5.00GiB, used=3.29GiB
  GlobalReserve, single: total=512.00MiB, used=0.00B
 
 This filesystem, at the allocation level, is very full (see below).
 
  And I had hangs with BTRFS again. This time as I wanted to install tax
  return software in Virtualbox´d Windows XP VM (which I use once a year
  cause I know no tax return software for Linux which would be suitable for
  Germany and I frankly don´t care about the end of security cause all
  surfing and other network access I will do from the Linux box and I only
  run the VM behind a firewall).
 
  And thus I try the balance dance again:
 ITEM: Balance... it doesn't do what you think it does... 8-)
 
 Balancing is something you should almost never need to do. It is only
 for cases of changing geometry (adding disks, switching RAID levels,
 etc.) of for cases when you've radically changed allocation behaviors
 (like you decided to remove all your VM's or you've decided to remove a
 mail spool directory full of thousands of tiny files).
 
 People run balance all the time because they think they should. They are
 _usually_ incorrect in that belief.

I only see the lockups of BTRFS is the trees *occupy* all space on the device.

I *never* so far saw it lockup if there is still space BTRFS can allocate from 
to *extend* a tree.

This may be a bug, but this is what I see.

And no amount of you should not balance a BTRFS will make that perception go 
away.

See, I see the sun coming out on a morning and you tell me no, it doesn´t. 
Simply that is not going to match my perception.

  merkaba:~ btrfs balance start -dusage=5 -musage=5 /home
  ERROR: error during balancing '/home' - No space left on device
 
 ITEM: Running out of space during a balance is not running out of space
 for files. BTRFS has two layers of allocation. That is, there are two
 levels of abstraction where no space can occur.

I understand that *very* well. I know about the allocation of *device* space 
for tree and I know about the allocation *inside* a tree.

 The first level of allocation is the making more BTRFS structures out

Skipped rest of explaination that I already now. 

I also don´t buy in the SSD makes kworker thread to use 100% for minutes 
explaination - *while* this SSDs are basically idling. A sandybridge core is 
not exactly slow and these are still consumer SSDs, we are not talking about a 
million of IOPS here.

And again:

This does not ever happen on when the trees do *not* fully allocate all device 
space. Even the defragmentation of the Windows XP run fine until after the 
trees allocated all space on the device again.

Try to reread the last two sentences in case it doesn´t sink to you.


Thats why I consider it a bug. I totally agree with you that a balance should 
not be necessary, but in my observation it is. That is the actual bug.




And no, no one needs me to tell to nocow the file. Even the extents are no 
issue: Not with SSDs which provide good enough random access.

My interpretation from what I see is this: BTRFS free space *in tree* handling 
is still not up to producation quality.


Now you either try out what I describe and see whether you perceive the same, 
or if you don´t, please don´t argue with my perception. You can argue with my 
conclusion, but I know what I see here. Thanks.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Hugo Mills
On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
 Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
  On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
   Hello!
   
   First: Have a merry christmas and enjoy a quiet time in these days.
   
   Second: At a time you feel like it, here is a little rant, but also a bug
   report:
   
   I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
   space_cache, skinny meta data extents – are these a problem? – and
  
   compress=lzo:
  (there is no known problem with skinny metadata, it's actually more
  efficient than the older format. There has been some anecdotes about
  mixing the skinny and fat metadata but nothing has ever been
  demonstrated problematic.)
  
   merkaba:~ btrfs fi sh /home
   Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
   
Total devices 2 FS bytes used 144.41GiB
devid1 size 160.00GiB used 160.00GiB path
/dev/mapper/msata-home
devid2 size 160.00GiB used 160.00GiB path
/dev/mapper/sata-home
   
   Btrfs v3.17
   merkaba:~ btrfs fi df /home
   Data, RAID1: total=154.97GiB, used=141.12GiB
   System, RAID1: total=32.00MiB, used=48.00KiB
   Metadata, RAID1: total=5.00GiB, used=3.29GiB
   GlobalReserve, single: total=512.00MiB, used=0.00B
  
  This filesystem, at the allocation level, is very full (see below).
  
   And I had hangs with BTRFS again. This time as I wanted to install tax
   return software in Virtualbox´d Windows XP VM (which I use once a year
   cause I know no tax return software for Linux which would be suitable for
   Germany and I frankly don´t care about the end of security cause all
   surfing and other network access I will do from the Linux box and I only
   run the VM behind a firewall).
  
   And thus I try the balance dance again:
  ITEM: Balance... it doesn't do what you think it does... 8-)
  
  Balancing is something you should almost never need to do. It is only
  for cases of changing geometry (adding disks, switching RAID levels,
  etc.) of for cases when you've radically changed allocation behaviors
  (like you decided to remove all your VM's or you've decided to remove a
  mail spool directory full of thousands of tiny files).
  
  People run balance all the time because they think they should. They are
  _usually_ incorrect in that belief.
 
 I only see the lockups of BTRFS is the trees *occupy* all space on the device.

   No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata
space. What's more, balance does *not* balance the metadata trees. The
remaining space -- 154.97 GiB -- is unstructured storage for file
data, and you have some 13 GiB of that available for use.

   Now, since you're seeing lockups when the space on your disks is
all allocated I'd say that's a bug. However, you're the *only* person
who's reported this as a regular occurrence. Does this happen with all
filesystems you have, or just this one?

 I *never* so far saw it lockup if there is still space BTRFS can allocate 
 from 
 to *extend* a tree.

   It's not a tree. It's simply space allocation. It's not even space
*usage* you're talking about here -- it's just allocation (i.e. the FS
saying I'm going to use this piece of disk for this purpose).

 This may be a bug, but this is what I see.
 
 And no amount of you should not balance a BTRFS will make that perception 
 go 
 away.
 
 See, I see the sun coming out on a morning and you tell me no, it doesn´t. 
 Simply that is not going to match my perception.

   Duncan's assertion is correct in its detail. Looking at your space
usage, I would not suggest that running a balance is something you
need to do. Now, since you have these lockups that seem quite
repeatable, there's probably a lurking bug in there, but hacking
around with balance every time you hit it isn't going to get the
problem solved properly.

   I think I would suggest the following:

 - make sure you have some way of logging your dmesg permanently (use
   a different filesystem for /var/log, or a serial console, or a
   netconsole)

 - when the lockup happens, hit Alt-SysRq-t a few times

 - send the dmesg output here, or post to bugzilla.kernel.org

   That's probably going to give enough information to the developers
to work out where the lockup is happening, and is clearly the way
forward here.

   Hugo.

-- 
Hugo Mills | w.w.w. -- England's batting scorecard
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: 65E74AC0  |


signature.asc
Description: Digital signature


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
 On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
  Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
   On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
Hello!

First: Have a merry christmas and enjoy a quiet time in these days.

Second: At a time you feel like it, here is a little rant, but also a
bug
report:

I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
space_cache, skinny meta data extents – are these a problem? – and
   
compress=lzo:
   (there is no known problem with skinny metadata, it's actually more
   efficient than the older format. There has been some anecdotes about
   mixing the skinny and fat metadata but nothing has ever been
   demonstrated problematic.)
   
merkaba:~ btrfs fi sh /home
Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a

 Total devices 2 FS bytes used 144.41GiB
 devid1 size 160.00GiB used 160.00GiB path
 /dev/mapper/msata-home
 devid2 size 160.00GiB used 160.00GiB path
 /dev/mapper/sata-home

Btrfs v3.17
merkaba:~ btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
   
   This filesystem, at the allocation level, is very full (see below).
   
And I had hangs with BTRFS again. This time as I wanted to install tax
return software in Virtualbox´d Windows XP VM (which I use once a year
cause I know no tax return software for Linux which would be suitable
for
Germany and I frankly don´t care about the end of security cause all
surfing and other network access I will do from the Linux box and I
only
run the VM behind a firewall).
   
And thus I try the balance dance again:
   ITEM: Balance... it doesn't do what you think it does... 8-)
   
   Balancing is something you should almost never need to do. It is only
   for cases of changing geometry (adding disks, switching RAID levels,
   etc.) of for cases when you've radically changed allocation behaviors
   (like you decided to remove all your VM's or you've decided to remove a
   mail spool directory full of thousands of tiny files).
   
   People run balance all the time because they think they should. They are
   _usually_ incorrect in that belief.
  
  I only see the lockups of BTRFS is the trees *occupy* all space on the
  device.
No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata
 space. What's more, balance does *not* balance the metadata trees. The
 remaining space -- 154.97 GiB -- is unstructured storage for file
 data, and you have some 13 GiB of that available for use.

Ok, let me rephrase that: Then the space *reserved* for the trees occupies all 
space on the device. Or okay, when that I see in btrfs fi df as total in 
summary occupies what I see as size in btrfs fi sh, i.e. when used equals 
space in btrfs fi sh

What happened here is this:

I tried

 https://blogs.oracle.com/virtualbox/entry/how_to_compact_your_virtual

in order to regain some space from the Windows XP VDI file. I just wanted to 
get around upsizing the BTRFS again.

And on the defragementation step in Windows it first ran fast. For about 46-47% 
there, during that fast phase btrfs fi df showed that BTRFS was quickly 
reserving the remaining free device space for data trees (not metadata).

Only after a while after it did so, it got slow again, basically the Windows 
defragmentation process stopped at 46-47% altogether and then after a while 
even the desktop locked due to processes being blocked in I/O.

I decided to forget about this downsizing of the Virtualbox VDI file, it will 
extend again on next Windows work and it is already 18 GB of its maximum 20GB, 
so… I dislike the approach anyway, and don´t even understand why the 
defragmentation step would be necessary as I think Virtualbox can poke holes 
into the file for any space not allocated inside the VM, whether it is 
defragmented or not.

Now, since you're seeing lockups when the space on your disks is
 all allocated I'd say that's a bug. However, you're the *only* person
 who's reported this as a regular occurrence. Does this happen with all
 filesystems you have, or just this one?

The *only* person? The compression lockups with 3.15 and 3.16, quite some 
people saw them, I thought. For me also these lockups only happened with all 
space on device allocated.

And these seem to be gone. In regular use it doesn´t lockup totally hard. But 
in the a processes writes a lot into one big no-cowed file case, it seems it 
can still get into a lockup, but this time one where a kworker thread consumes 
100% of CPU for minutes.

  I *never* so far saw it lockup if there is still space BTRFS can allocate
  from to 

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
  
 
  I only see the lockups of BTRFS is the trees *occupy* all space on the
  device.
No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata
 space. What's more, balance does *not* balance the metadata trees. The
 remaining space -- 154.97 GiB -- is unstructured storage for file
 data, and you have some 13 GiB of that available for use.
 
Now, since you're seeing lockups when the space on your disks is
 all allocated I'd say that's a bug. However, you're the *only* person
 who's reported this as a regular occurrence. Does this happen with all
 filesystems you have, or just this one?

Okay, just about terms.

What I call trees is this:

merkaba:~ btrfs fi df /
Data, RAID1: total=27.99GiB, used=17.21GiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=2.00GiB, used=596.12MiB
GlobalReserve, single: total=208.00MiB, used=0.00B

For me each one of Data, System, Metadata and GlobalReserve is what I 
call a tree.

How would you call it?

I always thought that BTRFS uses a tree structure not only for metadata, but 
also for data. But I bet strictly spoken thats only to *manage* the chunks it 
allocates and what I see above is the actual chunk usage.

I.e. to get terms straight, how would you call it? I think my understanding of 
how BTRFS handles space allocation is quite correct, but I may use a term 
incorrectly.

I read

 Data, RAID1: total=27.99GiB, used=17.21GiB

as:

I reserved 27,99 GiB for data chunks and used 17,21 GiB in these data chunks 
so far. So I have about 10,5 GiB free in these data chunks at the moment and 
all is good.

What it doesn´t tell me at all is how the allocated space is distributed onto 
these chunks. I may be that some chunks are completely empty or not. It may be 
that each chunk has some space allocated to it but in total there is that 
amount of free space yet. I.e. it doesn´t tell me anything about the free 
space fragmentation inside the chunks.

Yet I still hold my theory that in the case of heavily writing to a COW´d file 
BTRFS seems to prefer to reserve new empty chunks on this /home filesystem of 
my laptop instead of trying to find free space in existing only partially empty 
chunks. And the lockup only happens when it tries to do the latter. And no, I 
think it shouldn´t lockup then. I also think its a bug. I never said 
differently.

And yes, I only ever had this on my /home so far. Not on / which is also RAID 
1 and has all device space reserved for quite some time, not on /daten which 
only holds large files and is single instead of RAID. Also not on the server, 
but the server FS has lots of unallocated device space still, or on the 2 TiB 
eSATA backup HD, also I do get the impression that BTRFS started to get slower 
there as well at least the rsync based backup script takes quite long 
meanwhile and I see rsync reading from backup BTRFS and in this case almost 
fully ultilizing the disk for longer times. But unlike my /home the backup 
disk has some timely widely distributed snaphots (about 2 week to 1 months 
intervalls, and about last half year).

Neither /home nor / on the SSD have snapshots at the moment. So this is 
happening without snapshots.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

signature.asc
Description: This is a digitally signed message part.


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Robert White

On 12/27/2014 02:54 AM, Martin Steigerwald wrote:

Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:

On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:

Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:

On 12/26/2014 05:37 AM, Martin Steigerwald wrote:

Hello!

First: Have a merry christmas and enjoy a quiet time in these days.

Second: At a time you feel like it, here is a little rant, but also a
bug
report:

I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
space_cache, skinny meta data extents – are these a problem? – and



compress=lzo:

(there is no known problem with skinny metadata, it's actually more
efficient than the older format. There has been some anecdotes about
mixing the skinny and fat metadata but nothing has ever been
demonstrated problematic.)


merkaba:~ btrfs fi sh /home
Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a

  Total devices 2 FS bytes used 144.41GiB
  devid1 size 160.00GiB used 160.00GiB path
  /dev/mapper/msata-home
  devid2 size 160.00GiB used 160.00GiB path
  /dev/mapper/sata-home

Btrfs v3.17
merkaba:~ btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


This filesystem, at the allocation level, is very full (see below).


And I had hangs with BTRFS again. This time as I wanted to install tax
return software in Virtualbox´d Windows XP VM (which I use once a year
cause I know no tax return software for Linux which would be suitable
for
Germany and I frankly don´t care about the end of security cause all
surfing and other network access I will do from the Linux box and I
only
run the VM behind a firewall).



And thus I try the balance dance again:

ITEM: Balance... it doesn't do what you think it does... 8-)

Balancing is something you should almost never need to do. It is only
for cases of changing geometry (adding disks, switching RAID levels,
etc.) of for cases when you've radically changed allocation behaviors
(like you decided to remove all your VM's or you've decided to remove a
mail spool directory full of thousands of tiny files).

People run balance all the time because they think they should. They are
_usually_ incorrect in that belief.


I only see the lockups of BTRFS is the trees *occupy* all space on the
device.

No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata
space. What's more, balance does *not* balance the metadata trees. The
remaining space -- 154.97 GiB -- is unstructured storage for file
data, and you have some 13 GiB of that available for use.


Ok, let me rephrase that: Then the space *reserved* for the trees occupies all
space on the device. Or okay, when that I see in btrfs fi df as total in
summary occupies what I see as size in btrfs fi sh, i.e. when used equals
space in btrfs fi sh

What happened here is this:

I tried

  https://blogs.oracle.com/virtualbox/entry/how_to_compact_your_virtual

in order to regain some space from the Windows XP VDI file. I just wanted to
get around upsizing the BTRFS again.

And on the defragementation step in Windows it first ran fast. For about 46-47%
there, during that fast phase btrfs fi df showed that BTRFS was quickly
reserving the remaining free device space for data trees (not metadata).


The above statement is word-salad. The storage for data is not a data 
tree, the tree that maps data into a file is metadata. The data is 
data. There is no data tree.



Only after a while after it did so, it got slow again, basically the Windows
defragmentation process stopped at 46-47% altogether and then after a while
even the desktop locked due to processes being blocked in I/O.


If you've over-organized your very-large data files you can get waste 
some terrific amounts of space.


[---]
  [---] [uuu]  [] [-]
  [--] [-][]   [---]
   []

As you write new segments you don't actually free the lower extents 
unless they are _completely_ obscured end-to-end by a later extent. So 
if you've _ever_ defragged the BTRFS extent to be fully contiguous and 
you've not overwritten each and every byte later, the original expanse 
is still going to be there.


In the above exampel only the uuu block is ever freed, and only when 
the fourth generation finally covers the little gap.


In the worst case you can end up with (N*(N+1))/2 total blocks used up 
on disk when only N blocks are visible. (See the Gauss equation for the 
sum of consecutive integers for why this is the correct approximation 
for the worst case.)


[]
[---]
[--]
...
[-]

Each generation, being one block shorter than the previous one, exposes 
N blocks, one from each generation. So 1+2+3+4+5...+N blocks allocated 
if each ovewrite is one block shorter 

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Robert White

On 12/27/2014 03:11 AM, Martin Steigerwald wrote:

Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:



I only see the lockups of BTRFS is the trees *occupy* all space on the
device.

No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata
space. What's more, balance does *not* balance the metadata trees. The
remaining space -- 154.97 GiB -- is unstructured storage for file
data, and you have some 13 GiB of that available for use.

Now, since you're seeing lockups when the space on your disks is
all allocated I'd say that's a bug. However, you're the *only* person
who's reported this as a regular occurrence. Does this happen with all
filesystems you have, or just this one?


Okay, just about terms.


Terms are _really_ important if you want to file and discuss bugs.


What I call trees is this:

merkaba:~ btrfs fi df /
Data, RAID1: total=27.99GiB, used=17.21GiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=2.00GiB, used=596.12MiB
GlobalReserve, single: total=208.00MiB, used=0.00B

For me each one of Data, System, Metadata and GlobalReserve is what I
call a tree.

How would you call it?


Those are extents I think. All of the Trees are in the metadata. One 
of the trees is the extent tree. That extent tree is what contains the 
list of which regions of the disk are data, or metadata, or 
system-metadata (like the superblocks), or the global reserve.


Those extents are then filled with the type of information described.

But all the trees are in the metadata extents.



I always thought that BTRFS uses a tree structure not only for metadata, but
also for data. But I bet strictly spoken thats only to *manage* the chunks it
allocates and what I see above is the actual chunk usage.

I.e. to get terms straight, how would you call it? I think my understanding of
how BTRFS handles space allocation is quite correct, but I may use a term
incorrectly.

I read


Data, RAID1: total=27.99GiB, used=17.21GiB


as:

I reserved 27,99 GiB for data chunks and used 17,21 GiB in these data chunks
so far. So I have about 10,5 GiB free in these data chunks at the moment and
all is good.

What it doesn´t tell me at all is how the allocated space is distributed onto
these chunks. I may be that some chunks are completely empty or not. It may be
that each chunk has some space allocated to it but in total there is that
amount of free space yet. I.e. it doesn´t tell me anything about the free
space fragmentation inside the chunks.

Yet I still hold my theory that in the case of heavily writing to a COW´d file
BTRFS seems to prefer to reserve new empty chunks on this /home filesystem of
my laptop instead of trying to find free space in existing only partially empty
chunks. And the lockup only happens when it tries to do the latter. And no, I
think it shouldn´t lockup then. I also think its a bug. I never said
differently.


Partly correct. The system (as I understand it) will try to fill old 
chunks before allocating to new ones. It also perfers the most empty 
chunk first. But if you fallocate large extents they can have trouble 
finding a home. So lets say you have a systemic process that keeps 
making .51GiB files then it will tend to allocate a new 1GiB data extent 
each time (presuming you used default values) because each successive 
.51GiB region cannot fit in any existing data extent.


Excessive snapshotting can also contribute to this effect, but only 
because it freezes the history.


There are some other odd-out cases.


And yes, I only ever had this on my /home so far. Not on / which is also RAID
1 and has all device space reserved for quite some time, not on /daten which
only holds large files and is single instead of RAID. Also not on the server,
but the server FS has lots of unallocated device space still, or on the 2 TiB
eSATA backup HD, also I do get the impression that BTRFS started to get slower
there as well at least the rsync based backup script takes quite long
meanwhile and I see rsync reading from backup BTRFS and in this case almost
fully ultilizing the disk for longer times. But unlike my /home the backup
disk has some timely widely distributed snaphots (about 2 week to 1 months
intervalls, and about last half year).

Neither /home nor / on the SSD have snapshots at the moment. So this is
happening without snapshots.

Ciao,



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 03:52:56 schrieb Robert White:
  My theory from watching the Windows XP defragmentation case is this:
  
  - For writing into the file BTRFS needs to actually allocate and use free
  space in the current tree allocation, or, as we seem to misunderstood
  from the words we use, it needs to fit data in
  
  Data, RAID1: total=144.98GiB, used=140.94GiB
  
  between 144,98 GiB and 140,94 GiB given that total space of this tree, or
  if its not a tree, but the chunks in that the tree manages, in these
  chunks can *not* be extended anymore.
 
 If your file was actually COW (and you have _not_ been taking snapshots) 
 then there is no extenting to be had. But if you are using snapper 
 (which I believe you mentioned previously) then the snapshots cause a 
 write boundary and a layer of copying. Frequently taking snapshots of a 
 COW file is self defeating. If you are going to take snapshots then you 
 might as well turn copy on write back on and, for the love of pete, stop 
 defragging things.

I don´t use any snapshots on the filesystems. None, zero, zilch, nada.

And as I understand it copy on write means: It has to write the new write 
requests to somewhere else. For this it needs to allocate space. Either 
withing existing chunks or in a newly allocated one.

So for COW when writing to a file it will always need to allocate new space 
(although it can forget about the old space afterwards unless there isn´t a 
snapshot holding it)

Anyway, I got it reproduced. And am about to write a lengthy mail about.

It can easily be reproduced without even using Virtualbox, just by a nice 
simple fio job.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Robert White

On 12/27/2014 05:16 AM, Martin Steigerwald wrote:

Am Samstag, 27. Dezember 2014, 03:52:56 schrieb Robert White:

My theory from watching the Windows XP defragmentation case is this:

- For writing into the file BTRFS needs to actually allocate and use free
space in the current tree allocation, or, as we seem to misunderstood
from the words we use, it needs to fit data in

Data, RAID1: total=144.98GiB, used=140.94GiB

between 144,98 GiB and 140,94 GiB given that total space of this tree, or
if its not a tree, but the chunks in that the tree manages, in these
chunks can *not* be extended anymore.


If your file was actually COW (and you have _not_ been taking snapshots)
then there is no extenting to be had. But if you are using snapper
(which I believe you mentioned previously) then the snapshots cause a
write boundary and a layer of copying. Frequently taking snapshots of a
COW file is self defeating. If you are going to take snapshots then you
might as well turn copy on write back on and, for the love of pete, stop
defragging things.


I don´t use any snapshots on the filesystems. None, zero, zilch, nada.

And as I understand it copy on write means: It has to write the new write
requests to somewhere else. For this it needs to allocate space. Either
withing existing chunks or in a newly allocated one.

So for COW when writing to a file it will always need to allocate new space
(although it can forget about the old space afterwards unless there isn´t a
snapshot holding it)


It can _only_ forget about the space if absolutely _all_ of the old 
extent is overwritten. So if you write 1MiB, then you go back and 
overwrite 1MiB-4Kib, then you go back and write 1MiB-8KiB, you've now 
got 3MiB-12KiB to represent 1MiB of data. No snapshots involved. The 
worst case is quite well understood.


[...--] 1MiB
[...-]  1MiB-4KiB
[...]   1MiB-8KiB


BTRFS will _NOT_ reclaim the part of any extent. So if this kept going 
it would take 250 diminishing overwrites, each 4k less than the prior:


1MiB == 250 4k blocks.
(250*(250+1))/2 = 31375 4K blocks or 125.5MiB of storage allocated and 
dedicated to representing 1MiB of accessible data.


This is a worst case, of course, but it exists and it's _horrible_.

And such a file can be burped by doing a copy-and-rename, resulting in 
returning it to a single 1MiB extent. (I don't know if a btrfs defrag 
would have identical results, but I think it would.)


The problem is that there isn't (yet) a COW safe way to discard partial 
extents. That is, there is no universally safe way (yet implemented) to 
turn that first 1MiB into two extents of 1MiB-4K and one 4K extent in 
place so there is no way (yet) to prevent this worst case.


Doing things like excessive defragging at the BTRFS level, and 
defragging inside of a VM, and using certain file types can lead to 
pretty awful data wastage. YMMV.


e.g. too much tidying up and you make a mess.

I offered a pseudocode example a few days back on how this problem might 
be dealt with in future, but I've not seen any feedback on it.




Anyway, I got it reproduced. And am about to write a lengthy mail about.


Have fun with that lengthy email, but the devs already know about the 
data waste profile of the system. They just don't have a good solution yet.


Practical use cases involving _not_ defragging and _not_ packing files, 
or disabling COW and using raw image formats for VM disk storage are, 
meanwhile, also well understood.




It can easily be reproduced without even using Virtualbox, just by a nice
simple fio job.



Yep. As I've explained twice now.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Martin Steigerwald
Summarized at

Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes 
on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401

see below. This is reproducable with fio, no need for Windows XP in
Virtualbox for reproducing the issue. Next I will try to reproduce with
a freshly creating filesystem.


Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
 On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
  Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
   On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
Hello!

First: Have a merry christmas and enjoy a quiet time in these days.

Second: At a time you feel like it, here is a little rant, but also a
bug
report:

I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
space_cache, skinny meta data extents – are these a problem? – and
   
compress=lzo:
   (there is no known problem with skinny metadata, it's actually more
   efficient than the older format. There has been some anecdotes about
   mixing the skinny and fat metadata but nothing has ever been
   demonstrated problematic.)
   
merkaba:~ btrfs fi sh /home
Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a

 Total devices 2 FS bytes used 144.41GiB
 devid1 size 160.00GiB used 160.00GiB path
 /dev/mapper/msata-home
 devid2 size 160.00GiB used 160.00GiB path
 /dev/mapper/sata-home

Btrfs v3.17
merkaba:~ btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
   
   This filesystem, at the allocation level, is very full (see below).
   
And I had hangs with BTRFS again. This time as I wanted to install tax
return software in Virtualbox´d Windows XP VM (which I use once a year
cause I know no tax return software for Linux which would be suitable
for
Germany and I frankly don´t care about the end of security cause all
surfing and other network access I will do from the Linux box and I
only
run the VM behind a firewall).
   
And thus I try the balance dance again:
   ITEM: Balance... it doesn't do what you think it does... 8-)
   
   Balancing is something you should almost never need to do. It is only
   for cases of changing geometry (adding disks, switching RAID levels,
   etc.) of for cases when you've radically changed allocation behaviors
   (like you decided to remove all your VM's or you've decided to remove a
   mail spool directory full of thousands of tiny files).
   
   People run balance all the time because they think they should. They are
   _usually_ incorrect in that belief.
  
  I only see the lockups of BTRFS is the trees *occupy* all space on the
  device.
No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata
 space. What's more, balance does *not* balance the metadata trees. The
 remaining space -- 154.97 GiB -- is unstructured storage for file
 data, and you have some 13 GiB of that available for use.
 
Now, since you're seeing lockups when the space on your disks is
 all allocated I'd say that's a bug. However, you're the *only* person
 who's reported this as a regular occurrence. Does this happen with all
 filesystems you have, or just this one?
 
  I *never* so far saw it lockup if there is still space BTRFS can allocate
  from to *extend* a tree.
 
It's not a tree. It's simply space allocation. It's not even space
 *usage* you're talking about here -- it's just allocation (i.e. the FS
 saying I'm going to use this piece of disk for this purpose).
 
  This may be a bug, but this is what I see.
  
  And no amount of you should not balance a BTRFS will make that
  perception go away.
  
  See, I see the sun coming out on a morning and you tell me no, it
  doesn´t. Simply that is not going to match my perception.
 
Duncan's assertion is correct in its detail. Looking at your space

Robert's :)

 usage, I would not suggest that running a balance is something you
 need to do. Now, since you have these lockups that seem quite
 repeatable, there's probably a lurking bug in there, but hacking
 around with balance every time you hit it isn't going to get the
 problem solved properly.
 
I think I would suggest the following:
 
  - make sure you have some way of logging your dmesg permanently (use
a different filesystem for /var/log, or a serial console, or a
netconsole)
 
  - when the lockup happens, hit Alt-SysRq-t a few times
 
  - send the dmesg output here, or post to bugzilla.kernel.org
 
That's probably going to give enough information to the developers
 to work out where the lockup is happening, and is clearly the way
 forward here.

And I got it reproduced. *Perfectly* reproduced, I´d say.

But let me run 

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Robert White

On 12/27/2014 05:16 AM, Martin Steigerwald wrote:

It can easily be reproduced without even using Virtualbox, just by a nice
simple fio job.



TL;DR: If you want a worst-case example of consuming a BTRFS filesystem 
with one single file...


#!/bin/bash
# not tested, so correct any syntax errors
typeset -i counter
for ((counter=250;counter0;counter--)); do
 dd if=/dev/urandom of=/some/file bs=4k count=$counter
done
exit


Each pass over /some/file is 4k shorter than the previous one, but none 
of the extents can be deallocated. File will be 1MiB in size and usage 
will be something like 125.5MiB (if I've done the math correctly). 
larger values of counter will result in exponentially larger amounts of 
waste.


Doing the bad things is very bad...
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 05:49:48 schrieb Robert White:
  Anyway, I got it reproduced. And am about to write a lengthy mail about.
 
 Have fun with that lengthy email, but the devs already know about the 
 data waste profile of the system. They just don't have a good solution yet.
 
 Practical use cases involving _not_ defragging and _not_ packing files, 
 or disabling COW and using raw image formats for VM disk storage are, 
 meanwhile, also well understood.

Okay, then how about a database?

BTRFS is not usable for these kind of workloads then.

And thats about it.

Not even on SSD.

Yet, what I have shown in my lengthy mail is pathological.

Its even abysmal.

And yet it only happens when BTRFS is forced to pack things into *existing* 
chunks. It does not happen when BTRFS can still reserve new chunks and write 
to them.

And this makes all the talk that you should not need to rebalance obsolete 
when in practice you need to to get decent performance. To get out of your 
SSDs what your SSDs can provide instead of waiting for BTRFS to finish being 
busy with itself.

Still, I have only yet reproduced it on this /home filesystem. If that is also 
reproducable on a freshly created filesystem after some runs of the fio job I 
provided I´d say that there is a performance bug in BTRFS. And thats it.

No talking about technicalities my turn this performance bug observation away. 
Heck 254 IOPS from a Dual SSD RAID 1? Are you even kidding me?

I refuse to believe that this is built into the design, no matter how much you 
outline its limitations.

And if it is?

Well… then maybe BTRFS won´t save us. Unless you give it a ton of extra free 
space. Unless you do as I recommend and if you use 25 GB you make it 100 GB 
big so it will always find enough space to waste.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White:
 On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
  It can easily be reproduced without even using Virtualbox, just by a nice
  simple fio job.
 
 TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
 with one single file...
 
 #!/bin/bash
 # not tested, so correct any syntax errors
 typeset -i counter
 for ((counter=250;counter0;counter--)); do
   dd if=/dev/urandom of=/some/file bs=4k count=$counter
 done
 exit
 
 
 Each pass over /some/file is 4k shorter than the previous one, but none
 of the extents can be deallocated. File will be 1MiB in size and usage
 will be something like 125.5MiB (if I've done the math correctly).
 larger values of counter will result in exponentially larger amounts of
 waste.

Robert, I experienced this hang issues even before the defragmenting case. It 
happened while just installed a 400 MiB tax returns application to it (that is 
no joke, it is that big).

It happens while just using the VM.

Yes, I recommend not to use BTRFS for any VM image or any larger database on 
rotating storage for exactly that COW semantics.

But on SSD?

Its busy looping a CPU core and while the flash is basically idling.

I refuse to believe that this is by design.

I do think there is a *bug*.

Either acknowledge it and try to fix it, or say its by design *without even 
looking at it closely enough to be sure that it is not a bug* and limit your 
own possibilities by it.

I´d rather see it treated as a bug for now.

Come on, 254 IOPS on a filesystem with still 17 GiB of free space while 
randomly writing to a 4 GiB file.

People do these kind of things. Ditch that defrag Windows XP VM case, I had 
performance issue even before by just installing things to it. Databases, VMs, 
emulators. And heck even while just *creating* the file with fio as I shown.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Robert White

On 12/27/2014 06:00 AM, Robert White wrote:

On 12/27/2014 05:16 AM, Martin Steigerwald wrote:

It can easily be reproduced without even using Virtualbox, just by a nice
simple fio job.



TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
with one single file...

#!/bin/bash
# not tested, so correct any syntax errors
typeset -i counter
for ((counter=250;counter0;counter--)); do
  dd if=/dev/urandom of=/some/file bs=4k count=$counter
done
exit 0


Slight correction: you need to prevent the truncate dd performs by 
default, and flush the data and metadata to disk between after each 
invocatoin. So you need the conv= flags.


for ((counter=250;counter0;counter--)); do
dd if=/dev/urandom of=some_file conv=notrunc,fsync bs=4k count=$counter
done






Each pass over /some/file is 4k shorter than the previous one, but none
of the extents can be deallocated. File will be 1MiB in size and usage
will be something like 125.5MiB (if I've done the math correctly).
larger values of counter will result in exponentially larger amounts of
waste.

Doing the bad things is very bad...
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 15:14:05 schrieb Martin Steigerwald:
 Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White:
  On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
   It can easily be reproduced without even using Virtualbox, just by a
   nice
   simple fio job.
  
  TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
  with one single file...
  
  #!/bin/bash
  # not tested, so correct any syntax errors
  typeset -i counter
  for ((counter=250;counter0;counter--)); do
  
dd if=/dev/urandom of=/some/file bs=4k count=$counter
  
  done
  exit
  
  
  Each pass over /some/file is 4k shorter than the previous one, but none
  of the extents can be deallocated. File will be 1MiB in size and usage
  will be something like 125.5MiB (if I've done the math correctly).
  larger values of counter will result in exponentially larger amounts of
  waste.
 
 Robert, I experienced this hang issues even before the defragmenting case.
 It happened while just installed a 400 MiB tax returns application to it
 (that is no joke, it is that big).
 
 It happens while just using the VM.
 
 Yes, I recommend not to use BTRFS for any VM image or any larger database on
 rotating storage for exactly that COW semantics.
 
 But on SSD?
 
 Its busy looping a CPU core and while the flash is basically idling.
 
 I refuse to believe that this is by design.
 
 I do think there is a *bug*.
 
 Either acknowledge it and try to fix it, or say its by design *without even
 looking at it closely enough to be sure that it is not a bug* and limit your
 own possibilities by it.
 
 I´d rather see it treated as a bug for now.
 
 Come on, 254 IOPS on a filesystem with still 17 GiB of free space while
 randomly writing to a 4 GiB file.
 
 People do these kind of things. Ditch that defrag Windows XP VM case, I had
 performance issue even before by just installing things to it. Databases,
 VMs, emulators. And heck even while just *creating* the file with fio as I
 shown.

Add to these use cases things like this:

martin@merkaba:~/.local/share/akonadi/db_data/akonadi ls -lSh | head -5
insgesamt 2,2G
-rw-rw 1 martin martin 1,7G Dez 27 15:17 parttable.ibd
-rw-rw 1 martin martin 488M Dez 27 15:17 pimitemtable.ibd
-rw-rw 1 martin martin  23M Dez 27 15:17 pimitemflagrelation.ibd
-rw-rw 1 martin martin 240K Dez 27 15:17 collectiontable.ibd


Or this:

martin@merkaba:~/.local/share/baloo du -sch * | sort -rh
9,2Ginsgesamt
8,0Gemail
1,2Gfile
51M emailContacts
408Kcontacts
76K notes
16K calendars

martin@merkaba:~/.local/share/baloo ls -lSh email | head -5
insgesamt 8,0G
-rw-r--r-- 1 martin martin 4,0G Dez 27 15:16 postlist.DB
-rw-r--r-- 1 martin martin 3,9G Dez 27 15:16 termlist.DB
-rw-r--r-- 1 martin martin 143M Dez 27 15:16 record.DB
-rw-r--r-- 1 martin martin  63K Dez 27 15:16 postlist.baseA



These will not be as bad as the fio test case, but still these files are
written into. They are updated in place.

And thats running on every Plasma desktop by default. And on GNOME desktops
there is similar stuff.

I haven´t seen this spike out a kworker yet tough, so maybe the workload is 
light enough not to trigger it that easily.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Robert White

On 12/27/2014 05:55 AM, Martin Steigerwald wrote:

Summarized at

Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes 
on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401

see below. This is reproducable with fio, no need for Windows XP in
Virtualbox for reproducing the issue. Next I will try to reproduce with
a freshly creating filesystem.


Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:

On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:

Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:

On 12/26/2014 05:37 AM, Martin Steigerwald wrote:

Hello!

First: Have a merry christmas and enjoy a quiet time in these days.

Second: At a time you feel like it, here is a little rant, but also a
bug
report:

I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
space_cache, skinny meta data extents – are these a problem? – and



compress=lzo:

(there is no known problem with skinny metadata, it's actually more
efficient than the older format. There has been some anecdotes about
mixing the skinny and fat metadata but nothing has ever been
demonstrated problematic.)


merkaba:~ btrfs fi sh /home
Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a

  Total devices 2 FS bytes used 144.41GiB
  devid1 size 160.00GiB used 160.00GiB path
  /dev/mapper/msata-home
  devid2 size 160.00GiB used 160.00GiB path
  /dev/mapper/sata-home

Btrfs v3.17
merkaba:~ btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


This filesystem, at the allocation level, is very full (see below).


And I had hangs with BTRFS again. This time as I wanted to install tax
return software in Virtualbox´d Windows XP VM (which I use once a year
cause I know no tax return software for Linux which would be suitable
for
Germany and I frankly don´t care about the end of security cause all
surfing and other network access I will do from the Linux box and I
only
run the VM behind a firewall).



And thus I try the balance dance again:

ITEM: Balance... it doesn't do what you think it does... 8-)

Balancing is something you should almost never need to do. It is only
for cases of changing geometry (adding disks, switching RAID levels,
etc.) of for cases when you've radically changed allocation behaviors
(like you decided to remove all your VM's or you've decided to remove a
mail spool directory full of thousands of tiny files).

People run balance all the time because they think they should. They are
_usually_ incorrect in that belief.


I only see the lockups of BTRFS is the trees *occupy* all space on the
device.

No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata
space. What's more, balance does *not* balance the metadata trees. The
remaining space -- 154.97 GiB -- is unstructured storage for file
data, and you have some 13 GiB of that available for use.

Now, since you're seeing lockups when the space on your disks is
all allocated I'd say that's a bug. However, you're the *only* person
who's reported this as a regular occurrence. Does this happen with all
filesystems you have, or just this one?


I *never* so far saw it lockup if there is still space BTRFS can allocate
from to *extend* a tree.


It's not a tree. It's simply space allocation. It's not even space
*usage* you're talking about here -- it's just allocation (i.e. the FS
saying I'm going to use this piece of disk for this purpose).


This may be a bug, but this is what I see.

And no amount of you should not balance a BTRFS will make that
perception go away.

See, I see the sun coming out on a morning and you tell me no, it
doesn´t. Simply that is not going to match my perception.


Duncan's assertion is correct in its detail. Looking at your space


Robert's :)


usage, I would not suggest that running a balance is something you
need to do. Now, since you have these lockups that seem quite
repeatable, there's probably a lurking bug in there, but hacking
around with balance every time you hit it isn't going to get the
problem solved properly.

I think I would suggest the following:

  - make sure you have some way of logging your dmesg permanently (use
a different filesystem for /var/log, or a serial console, or a
netconsole)

  - when the lockup happens, hit Alt-SysRq-t a few times

  - send the dmesg output here, or post to bugzilla.kernel.org

That's probably going to give enough information to the developers
to work out where the lockup is happening, and is clearly the way
forward here.


And I got it reproduced. *Perfectly* reproduced, I´d say.

But let me run the whole story:

1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.

Which gave me:

merkaba:~ btrfs fi sh /home
Label: 'home'  uuid: 

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Robert White

On 12/27/2014 06:21 AM, Martin Steigerwald wrote:

Am Samstag, 27. Dezember 2014, 15:14:05 schrieb Martin Steigerwald:

Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White:

On 12/27/2014 05:16 AM, Martin Steigerwald wrote:

It can easily be reproduced without even using Virtualbox, just by a
nice
simple fio job.


TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
with one single file...

#!/bin/bash
# not tested, so correct any syntax errors
typeset -i counter
for ((counter=250;counter0;counter--)); do

   dd if=/dev/urandom of=/some/file bs=4k count=$counter

done
exit


Each pass over /some/file is 4k shorter than the previous one, but none
of the extents can be deallocated. File will be 1MiB in size and usage
will be something like 125.5MiB (if I've done the math correctly).
larger values of counter will result in exponentially larger amounts of
waste.


Robert, I experienced this hang issues even before the defragmenting case.
It happened while just installed a 400 MiB tax returns application to it
(that is no joke, it is that big).

It happens while just using the VM.

Yes, I recommend not to use BTRFS for any VM image or any larger database on
rotating storage for exactly that COW semantics.

But on SSD?

Its busy looping a CPU core and while the flash is basically idling.

I refuse to believe that this is by design.

I do think there is a *bug*.

Either acknowledge it and try to fix it, or say its by design *without even
looking at it closely enough to be sure that it is not a bug* and limit your
own possibilities by it.

I´d rather see it treated as a bug for now.

Come on, 254 IOPS on a filesystem with still 17 GiB of free space while
randomly writing to a 4 GiB file.

People do these kind of things. Ditch that defrag Windows XP VM case, I had
performance issue even before by just installing things to it. Databases,
VMs, emulators. And heck even while just *creating* the file with fio as I
shown.


Add to these use cases things like this:

martin@merkaba:~/.local/share/akonadi/db_data/akonadi ls -lSh | head -5
insgesamt 2,2G
-rw-rw 1 martin martin 1,7G Dez 27 15:17 parttable.ibd
-rw-rw 1 martin martin 488M Dez 27 15:17 pimitemtable.ibd
-rw-rw 1 martin martin  23M Dez 27 15:17 pimitemflagrelation.ibd
-rw-rw 1 martin martin 240K Dez 27 15:17 collectiontable.ibd


Or this:

martin@merkaba:~/.local/share/baloo du -sch * | sort -rh
9,2Ginsgesamt
8,0Gemail
1,2Gfile
51M emailContacts
408Kcontacts
76K notes
16K calendars

martin@merkaba:~/.local/share/baloo ls -lSh email | head -5
insgesamt 8,0G
-rw-r--r-- 1 martin martin 4,0G Dez 27 15:16 postlist.DB
-rw-r--r-- 1 martin martin 3,9G Dez 27 15:16 termlist.DB
-rw-r--r-- 1 martin martin 143M Dez 27 15:16 record.DB
-rw-r--r-- 1 martin martin  63K Dez 27 15:16 postlist.baseA


/usr/bin/du and /usr/bin/df and /bin/ls are all _useless_ for showing 
the amount of filespace used by a file in BTRFS.


Look at a nice paste of the previously described worst case allocation.

Gust rwhite # btrfs fi df /
Data, single: total=344.00GiB, used=340.41GiB
System, DUP: total=32.00MiB, used=80.00KiB
Metadata, DUP: total=8.00GiB, used=4.84GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Gust rwhite # for ((counter=250;counter0;counter--)); do dd 
if=/dev/urandom of=some_file conv=notrunc,fsync bs=4k count=$counter 
/dev/null 21; done


Gust rwhite # btrfs fi df /
Data, single: total=344.00GiB, used=340.48GiB
System, DUP: total=32.00MiB, used=80.00KiB
Metadata, DUP: total=8.00GiB, used=4.84GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Gust rwhite # du some_file
1000some_file

Gust rwhite # ls -lh some_file
-rw-rw-r--+ 1 root root 1000K Dec 27 07:00 some_file

Gust rwhite # rm some_file
Gust rwhite # btrfs fi df /
Data, single: total=344.00GiB, used=340.41GiB
System, DUP: total=32.00MiB, used=80.00KiB
Metadata, DUP: total=8.00GiB, used=4.84GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Notice that some_file shows 1000 blocks in du, and 1000k bytes in ls.

But notice that data used jumps from 340.41GiB to 340.48GiB when the 
file is created, then drops back down to 340.41GiB when it's deleted.


Now I have compression turned on so the amount of growth/shrinkage 
changes between each run, but it's _Way_ more than 1Meg, that's like 
70MiB (give or take significant rounding in the third place after the 
decimal). So I wrote this file in a way that leads to it taking up 
_seventy_ _times_ it's base size in actual allocated storage. Real files 
do not perform this terribly, but they can get pretty ugly in some cases.


You _really_ need to learn how the system works and what its best and 
worst cases look like before you start shouting bug!


You are using the wrong numbers (e.g. df) for available space and you 
don't know how to estimate what your tools _should_ do for the 
conditions observed.


But yes, if you open a file and scribble all over it when your disk is 
full to 

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 07:14:32 schrieb Robert White:
 But yes, if you open a file and scribble all over it when your disk is 
 full to within the same order of magnitude as the size of the file you 
 are scribbling on, you will get into a condition where the _application_ 
 will aggressively retry the IO. Particularly if that application is a 
 test program or a virtual machine doing asynchronous IO.
 
 That's what those sorts of systems do when they crash against a limit in 
 the underlying system.
 
 So yea... out of space plus agressive writer equals spinning CPU
 
 Before you can assign blame you need to strace your application to see 
 what call its making over and over again to see if its just being stupid.

Robert, I am pretty sure that fio does not retry the I/O. If the I/O returns
an error it exists immediately.

I don´t think BTRFS fails an I/O – there is nothing of that in kern.log or
dmesg. But it just needs a very long time for it.

And yet, with BTRFS *is* *full* testcase I still can´t reproduce the 300
IOPS case. I consistently get about 4800 IOPS which is just about okay IMHO.

fio just does random I/O. Aggressively, yes. But it would stop on the *first*
*failed* I/O request. I am pretty sure of that.

fio is flexible I/O tester. It has been written mostly by Jens Axboe. Jens
is the block maintainer of the Linux kernel. So I kindly ask that
before you assume I use crap tools, you have a look at it.

From how you write I get the impression that you think everyone else
beside you is just silly and dumb. Please stop this assumption. I may not
always get terms right, and I may make a mistake as with the wrong df
figure. But I also highly dislike to feel treated like someone who doesn´t
know a thing.

I made my case.

I tried to reproduce it in a test case.

Now I suggest we wait till someone had an actual log at the sysrq-t triggers
of the 25 MiB kern.log I provided in the bug report.

I will now wait for BTRFS developers to comment on this.

I think Chris and Josef and other BTRFS developers actually know what fio
is, so… either they are interested in that 300 IOPS case I cannot yet
reproduce with a fresh filesystem or not.


Even when it is as almost full as it can get and the fio *barely* completes
without a no space left on device error, I still get those 4800 IOPS.
I tested it and took the first run where it actually completed again after
deleting partially copies /usr/bin directory from the test filesystem.

As I have shown it in my test case (see my other mail with altered subject
line)

So for at least a *small* full filesystem, the filesystem full or BTRFS has
to search for free space aggressively case *does not* explain what I see
with my /home. So either I need a fuller filesystem for the test case,
maybe one which carries a million of files or more, or one that at least
has more chunks to allocate from, or there is more to it and there is
something with my /home that makes it even worse.

So it isn´t just the filesystem full case, and the all free space allocated
for chunks condition also does not suffice as my test case shows (where
BTRFS just won´t allocate another data chunk it seems).

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 07:14:32 schrieb Robert White:
 On 12/27/2014 06:21 AM, Martin Steigerwald wrote:
  Am Samstag, 27. Dezember 2014, 15:14:05 schrieb Martin Steigerwald:
  Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White:
  On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
  It can easily be reproduced without even using Virtualbox, just by a
  nice
  simple fio job.
 
  TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
  with one single file...
 
  #!/bin/bash
  # not tested, so correct any syntax errors
  typeset -i counter
  for ((counter=250;counter0;counter--)); do
 
 dd if=/dev/urandom of=/some/file bs=4k count=$counter
 
  done
  exit
 
 
  Each pass over /some/file is 4k shorter than the previous one, but none
  of the extents can be deallocated. File will be 1MiB in size and usage
  will be something like 125.5MiB (if I've done the math correctly).
  larger values of counter will result in exponentially larger amounts of
  waste.
 
  Robert, I experienced this hang issues even before the defragmenting case.
  It happened while just installed a 400 MiB tax returns application to it
  (that is no joke, it is that big).
 
  It happens while just using the VM.
 
  Yes, I recommend not to use BTRFS for any VM image or any larger database 
  on
  rotating storage for exactly that COW semantics.
 
  But on SSD?
 
  Its busy looping a CPU core and while the flash is basically idling.
 
  I refuse to believe that this is by design.
 
  I do think there is a *bug*.
 
  Either acknowledge it and try to fix it, or say its by design *without even
  looking at it closely enough to be sure that it is not a bug* and limit 
  your
  own possibilities by it.
 
  I´d rather see it treated as a bug for now.
 
  Come on, 254 IOPS on a filesystem with still 17 GiB of free space while
  randomly writing to a 4 GiB file.
 
  People do these kind of things. Ditch that defrag Windows XP VM case, I had
  performance issue even before by just installing things to it. Databases,
  VMs, emulators. And heck even while just *creating* the file with fio as I
  shown.
 
  Add to these use cases things like this:
 
  martin@merkaba:~/.local/share/akonadi/db_data/akonadi ls -lSh | head -5
  insgesamt 2,2G
  -rw-rw 1 martin martin 1,7G Dez 27 15:17 parttable.ibd
  -rw-rw 1 martin martin 488M Dez 27 15:17 pimitemtable.ibd
  -rw-rw 1 martin martin  23M Dez 27 15:17 pimitemflagrelation.ibd
  -rw-rw 1 martin martin 240K Dez 27 15:17 collectiontable.ibd
 
 
  Or this:
 
  martin@merkaba:~/.local/share/baloo du -sch * | sort -rh
  9,2Ginsgesamt
  8,0Gemail
  1,2Gfile
  51M emailContacts
  408Kcontacts
  76K notes
  16K calendars
 
  martin@merkaba:~/.local/share/baloo ls -lSh email | head -5
  insgesamt 8,0G
  -rw-r--r-- 1 martin martin 4,0G Dez 27 15:16 postlist.DB
  -rw-r--r-- 1 martin martin 3,9G Dez 27 15:16 termlist.DB
  -rw-r--r-- 1 martin martin 143M Dez 27 15:16 record.DB
  -rw-r--r-- 1 martin martin  63K Dez 27 15:16 postlist.baseA
 
 /usr/bin/du and /usr/bin/df and /bin/ls are all _useless_ for showing 
 the amount of filespace used by a file in BTRFS.

Yes.

But they are *useful* to demonstrate that there are regular desktop
application which randomly write into huge files. And that was *exactly*
the point I was trying to make.

Yes, I didn´t prove the random aspect. But heck, one is a MySQL and
one is a Xapian. I am fairly sure that for a desktop search and for maildir
folder indexing there is some random aspect in the workload. Do you
agree to that?

So what you call as bad – that was my exact point I was going to make
– point is going to happen on systems. Maybe not as fierce as a fio job,
granted. And for these said /home BTRFS worked fine, but for just
installed a 400 MiB application onto the Windows XP I had the hang
already. With more than 8 GiB of free space within the chunks at that
time.

If BTRFS fails like 300 IOPS on Dual SSD on disk full conditions on
workloads like this it will fail in real world scenarios. And again my
recommendation to leave way more free space than with other filesystems
still holds.

Yes, I saw XFS developer Dave Chinner recommending about 50% of free
space of XFS for a crazy workload in case you want the filesystem in a young
state even after 10 years. So I am fully aware that filesystems will age.

But to *this* extent? After about the six months I actually run the BTRFS
RAID 1, and started with a fresh single BTRFS that I balanced as RAID 1 to
the second SSD then?

I still think it is a bug. Especially as it just does not happen with a
simple disk full condition as I spent several hours in trying to reproduce
this worst case.

If it only happens with my /home, I am willing to accept that something may
be borked with it. And I haven´t been able to produce with a clean filesystem
yet. So maybe it doesn´t happen for others. Then all fine, I recreate the FS
and forget about it.

But before I do any 

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Hugo Mills
On Sat, Dec 27, 2014 at 06:54:33AM -0800, Robert White wrote:
 On 12/27/2014 05:55 AM, Martin Steigerwald wrote:
[snip]
 while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
 for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
 
 martin@merkaba:~ LANG=C df -hT /home
 Filesystem Type   Size  Used Avail Use% Mounted on
 /dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
 
 where a 4 GiB file should easily fit, no? (And this output is with the 4
 GiB file. So it was even 4 GiB more free before.)
 
 No. /usr/bin/df is an _approximation_ in BTRFS because of the limits
 of the fsstat() function call. The fstat function call was defined
 in 1990 and can't understand the dynamic allocation model used in
 BTRFS as it assumes fixed geometry for filesystems. You do _not_
 have 17G actually available. You need to rely on btrfs fi df and
 btrfs fi show to figure out how much space you _really_ have.
 
 According to this block you have a RAID1 of ~ 160GB expanse (two 160G disks)
 
  merkaba:~ date; btrfs fi sh /home ; btrfs fi df /home
  Sa 27. Dez 13:26:39 CET 2014
  Label: 'home'  uuid: [some UUID]
   Total devices 2 FS bytes used 152.83GiB
   devid1 size 160.00GiB used 160.00GiB path
 /dev/mapper/msata-home
   devid2 size 160.00GiB used 160.00GiB path
 /dev/mapper/sata-home
 
 And according to this block you have about 4.49GiB of data space:
 
  Btrfs v3.17
  Data, RAID1: total=154.97GiB, used=149.58GiB
  System, RAID1: total=32.00MiB, used=48.00KiB
  Metadata, RAID1: total=5.00GiB, used=3.26GiB
  GlobalReserve, single: total=512.00MiB, used=0.00B
 
 154.97
   5.00
   0.032
 + 0.512
 
 Pretty much as close to 160GiB as you are going to get (those
 numbers being rounded up in places for human readability) BTRFS
 has allocate 100% of the raw storage into typed extents.
 
 A large datafile can only fit in the 154.97-149.58 = 5.39

   I appreciate that this is something of a minor point in the grand
scheme of things, but I'm afraid I've lost the enthusiasm to engage
with the broader (somewhat rambling, possibly-at-cross-purposes)
conversation in this thread. However...

 Trying to allocate that 4GiB file into that 5.39GiB of space becomes
 an NP-complete (e.g. very hard) problem if it is very fragmented.

   This is... badly mistaken, at best. The problem of where to write a
file into a set of free extents is definitely *not* an NP-hard
problem. It's a P problem, with an O(n log n) solution, where n is the
number of free extents in the free space cache. The simple approach:
fill the first hole with as many bytes as you can, then move on to the
next hole. More complex: order the free extents by size first. Both of
these are O(n log n) algorithms, given an efficient general-purpose
index of free space.

   The problem of placing file data isn't a bin-packing problem; it's
not like allocating RAM (where each allocation must be contiguous).
The items being placed may be split as much as you like, although
minimising the amount of splitting is a goal.

   I suspect that the performance problems that Martin is seeing may
indeed be related to free space fragmentation, in that finding and
creating all of those tiny extents for a huge file is causing
problems. I believe that btrfs isn't alone in this, but it may well be
showing the problem to a far greater degree than other FSes. I don't
have figures to compare, I'm afraid.

 I also don't know what kind of tool you are using, but it might be
 repeatedly trying and failing to fallocate the file as a single
 extent or something equally dumb.

   Userspace doesn't as far as I know, get to make that decision. I've
just read the fallocate(2) man page, and it says nothing at all about
the contiguity of the extent(s) storage allocated by the call.

   Hugo.

[snip]

-- 
Hugo Mills | O tempura! O moresushi!
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: 65E74AC0  |


signature.asc
Description: Digital signature


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 16:26:42 schrieb Hugo Mills:
 On Sat, Dec 27, 2014 at 06:54:33AM -0800, Robert White wrote:
  On 12/27/2014 05:55 AM, Martin Steigerwald wrote:
 [snip]
  while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
  for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
  
  martin@merkaba:~ LANG=C df -hT /home
  Filesystem Type   Size  Used Avail Use% Mounted on
  /dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
  
  where a 4 GiB file should easily fit, no? (And this output is with the 4
  GiB file. So it was even 4 GiB more free before.)
  
  No. /usr/bin/df is an _approximation_ in BTRFS because of the limits
  of the fsstat() function call. The fstat function call was defined
  in 1990 and can't understand the dynamic allocation model used in
  BTRFS as it assumes fixed geometry for filesystems. You do _not_
  have 17G actually available. You need to rely on btrfs fi df and
  btrfs fi show to figure out how much space you _really_ have.
  
  According to this block you have a RAID1 of ~ 160GB expanse (two 160G disks)
  
   merkaba:~ date; btrfs fi sh /home ; btrfs fi df /home
   Sa 27. Dez 13:26:39 CET 2014
   Label: 'home'  uuid: [some UUID]
Total devices 2 FS bytes used 152.83GiB
devid1 size 160.00GiB used 160.00GiB path
  /dev/mapper/msata-home
devid2 size 160.00GiB used 160.00GiB path
  /dev/mapper/sata-home
  
  And according to this block you have about 4.49GiB of data space:
  
   Btrfs v3.17
   Data, RAID1: total=154.97GiB, used=149.58GiB
   System, RAID1: total=32.00MiB, used=48.00KiB
   Metadata, RAID1: total=5.00GiB, used=3.26GiB
   GlobalReserve, single: total=512.00MiB, used=0.00B
  
  154.97
5.00
0.032
  + 0.512
  
  Pretty much as close to 160GiB as you are going to get (those
  numbers being rounded up in places for human readability) BTRFS
  has allocate 100% of the raw storage into typed extents.
  
  A large datafile can only fit in the 154.97-149.58 = 5.39
 
I appreciate that this is something of a minor point in the grand
 scheme of things, but I'm afraid I've lost the enthusiasm to engage
 with the broader (somewhat rambling, possibly-at-cross-purposes)
 conversation in this thread. However...
 
  Trying to allocate that 4GiB file into that 5.39GiB of space becomes
  an NP-complete (e.g. very hard) problem if it is very fragmented.
 
This is... badly mistaken, at best. The problem of where to write a
 file into a set of free extents is definitely *not* an NP-hard
 problem. It's a P problem, with an O(n log n) solution, where n is the
 number of free extents in the free space cache. The simple approach:
 fill the first hole with as many bytes as you can, then move on to the
 next hole. More complex: order the free extents by size first. Both of
 these are O(n log n) algorithms, given an efficient general-purpose
 index of free space.
 
The problem of placing file data isn't a bin-packing problem; it's
 not like allocating RAM (where each allocation must be contiguous).
 The items being placed may be split as much as you like, although
 minimising the amount of splitting is a goal.
 
I suspect that the performance problems that Martin is seeing may
 indeed be related to free space fragmentation, in that finding and
 creating all of those tiny extents for a huge file is causing
 problems. I believe that btrfs isn't alone in this, but it may well be
 showing the problem to a far greater degree than other FSes. I don't
 have figures to compare, I'm afraid.

Thats what I wanted to hint at.

I suspect an issue with free space fragmentation and do what I think I see:

btrfs balance minimizes free space in chunk fragmentation.

And that is my whole case on why I think it does help with my /home
filesystem.

So while btrfs filesystem defragment may help with defragmenting individual
files, possibly at the cost of fragmenting free space at least on filesystem
almost full conditions, I think to help with free space fragmentation there
are only three options at the moment:

1) reformat and restore via rsync or btrfs send from backup (i.e. file based)

2) make the BTRFS in itself bigger

3) btrfs balance at least chunks, at least those that are not more than 70%
or 80% full.

Do you know of any other ways to deal with it?

So yes, in case it really is freespace fragmentation, I do think a balance
may be helpful. Even if usually one should not use a balance.
 
  I also don't know what kind of tool you are using, but it might be
  repeatedly trying and failing to fallocate the file as a single
  extent or something equally dumb.
 
Userspace doesn't as far as I know, get to make that decision. I've
 just read the fallocate(2) man page, and it says nothing at all about
 the contiguity of the extent(s) storage allocated by the call.

fio fallocates just once. And then writes, even if the fallocate call fails.

Was nice to see at some point as BTRFS 

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 18:11:21 schrieb Martin Steigerwald:
 Am Samstag, 27. Dezember 2014, 16:26:42 schrieb Hugo Mills:
  On Sat, Dec 27, 2014 at 06:54:33AM -0800, Robert White wrote:
   On 12/27/2014 05:55 AM, Martin Steigerwald wrote:
  [snip]
   while fio was just *laying* out the 4 GiB file. Yes, thats 100% system 
   CPU
   for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
   
   martin@merkaba:~ LANG=C df -hT /home
   Filesystem Type   Size  Used Avail Use% Mounted on
   /dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
   
   where a 4 GiB file should easily fit, no? (And this output is with the 4
   GiB file. So it was even 4 GiB more free before.)
   
   No. /usr/bin/df is an _approximation_ in BTRFS because of the limits
   of the fsstat() function call. The fstat function call was defined
   in 1990 and can't understand the dynamic allocation model used in
   BTRFS as it assumes fixed geometry for filesystems. You do _not_
   have 17G actually available. You need to rely on btrfs fi df and
   btrfs fi show to figure out how much space you _really_ have.
   
   According to this block you have a RAID1 of ~ 160GB expanse (two 160G 
   disks)
   
merkaba:~ date; btrfs fi sh /home ; btrfs fi df /home
Sa 27. Dez 13:26:39 CET 2014
Label: 'home'  uuid: [some UUID]
 Total devices 2 FS bytes used 152.83GiB
 devid1 size 160.00GiB used 160.00GiB path
   /dev/mapper/msata-home
 devid2 size 160.00GiB used 160.00GiB path
   /dev/mapper/sata-home
   
   And according to this block you have about 4.49GiB of data space:
   
Btrfs v3.17
Data, RAID1: total=154.97GiB, used=149.58GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
   
   154.97
 5.00
 0.032
   + 0.512
   
   Pretty much as close to 160GiB as you are going to get (those
   numbers being rounded up in places for human readability) BTRFS
   has allocate 100% of the raw storage into typed extents.
   
   A large datafile can only fit in the 154.97-149.58 = 5.39
  
 I appreciate that this is something of a minor point in the grand
  scheme of things, but I'm afraid I've lost the enthusiasm to engage
  with the broader (somewhat rambling, possibly-at-cross-purposes)
  conversation in this thread. However...
  
   Trying to allocate that 4GiB file into that 5.39GiB of space becomes
   an NP-complete (e.g. very hard) problem if it is very fragmented.
  
 This is... badly mistaken, at best. The problem of where to write a
  file into a set of free extents is definitely *not* an NP-hard
  problem. It's a P problem, with an O(n log n) solution, where n is the
  number of free extents in the free space cache. The simple approach:
  fill the first hole with as many bytes as you can, then move on to the
  next hole. More complex: order the free extents by size first. Both of
  these are O(n log n) algorithms, given an efficient general-purpose
  index of free space.
  
 The problem of placing file data isn't a bin-packing problem; it's
  not like allocating RAM (where each allocation must be contiguous).
  The items being placed may be split as much as you like, although
  minimising the amount of splitting is a goal.
  
 I suspect that the performance problems that Martin is seeing may
  indeed be related to free space fragmentation, in that finding and
  creating all of those tiny extents for a huge file is causing
  problems. I believe that btrfs isn't alone in this, but it may well be
  showing the problem to a far greater degree than other FSes. I don't
  have figures to compare, I'm afraid.
 
 Thats what I wanted to hint at.
 
 I suspect an issue with free space fragmentation and do what I think I see:
 
 btrfs balance minimizes free space in chunk fragmentation.
 
 And that is my whole case on why I think it does help with my /home
 filesystem.
 
 So while btrfs filesystem defragment may help with defragmenting individual
 files, possibly at the cost of fragmenting free space at least on filesystem
 almost full conditions, I think to help with free space fragmentation there
 are only three options at the moment:
 
 1) reformat and restore via rsync or btrfs send from backup (i.e. file based)
 
 2) make the BTRFS in itself bigger
 
 3) btrfs balance at least chunks, at least those that are not more than 70%
 or 80% full.
 
 Do you know of any other ways to deal with it?

Yes.

4) Delete some stuff from it or move it over to a different filesystem.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

signature.asc
Description: This is a digitally signed message part.


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Zygo Blaxell
On Sat, Dec 27, 2014 at 09:30:43AM +, Hugo Mills wrote:
 On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
  Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
   On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
Now, since you're seeing lockups when the space on your disks is
 all allocated I'd say that's a bug. However, you're the *only* person
 who's reported this as a regular occurrence. Does this happen with all
 filesystems you have, or just this one?

I do see something similar, but there are so many problems going on I
have no idea which ones to report, and which ones are my own doing.  :-P

I see lots of CPU being burned when all the disk space is allocated
to chunks, but there is still lots of space free (multiple GB) inside
the chunks.

iotop shows a crapton of disk writes (1-5MB/sec) from one kworker.
There are maybe a few kB/sec of writes through the filesystem at the time.

The filesystem where I see this most is on a laptop, so the disk writes
also hit the CPU again for encryption.  There's so much CPU usage it's
worth mentioning twice.  :-(

'watch cat /proc/12345/stack' on the active processes shows the kernel
fairly often in that new chunk deallocator function whose name escapes
me at the moment.

Deleting a bunch of data then running balance helps return to sane CPU
usage...for a while (maybe a week?).

It's not technically locked up per se, but when a 5KB download takes
a minute or more, most users won't wait around to see the difference.

Kernel versions I'm using are 3.17.7 and 3.18.1.


signature.asc
Description: Digital signature


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Hugo Mills
On Sat, Dec 27, 2014 at 01:28:46PM -0500, Zygo Blaxell wrote:
 On Sat, Dec 27, 2014 at 09:30:43AM +, Hugo Mills wrote:
  On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
   Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
 Now, since you're seeing lockups when the space on your disks is
  all allocated I'd say that's a bug. However, you're the *only* person
  who's reported this as a regular occurrence. Does this happen with all
  filesystems you have, or just this one?
 
 I do see something similar, but there are so many problems going on I
 have no idea which ones to report, and which ones are my own doing.  :-P
 
 I see lots of CPU being burned when all the disk space is allocated
 to chunks, but there is still lots of space free (multiple GB) inside
 the chunks.
 
 iotop shows a crapton of disk writes (1-5MB/sec) from one kworker.
 There are maybe a few kB/sec of writes through the filesystem at the time.
 
 The filesystem where I see this most is on a laptop, so the disk writes
 also hit the CPU again for encryption.  There's so much CPU usage it's
 worth mentioning twice.  :-(
 
 'watch cat /proc/12345/stack' on the active processes shows the kernel
 fairly often in that new chunk deallocator function whose name escapes
 me at the moment.
 
 Deleting a bunch of data then running balance helps return to sane CPU
 usage...for a while (maybe a week?).
 
 It's not technically locked up per se, but when a 5KB download takes
 a minute or more, most users won't wait around to see the difference.
 
 Kernel versions I'm using are 3.17.7 and 3.18.1.

   OK, so I'd like to change my statement above.

   When I first read Martin's problem, I thought that he was referring
to a complete, hit-the-power-button kind of lock-up. Given that
(erroneous) assumption, I stand by my (now pointless) statement. :)

   I realised during a brief conversation on IRC that Martin was
actually referring to long but temporary periods where the machine is
unusable by any process requiring disk activity. There's clearly a
number of people seeing that.

   It doesn't stop it being a major problem, but it does change the
interpretation considerably.

   Hugo.

-- 
Hugo Mills | Mixing mathematics and alcohol is dangerous. Don't
hugo@... carfax.org.uk | drink and derive.
http://carfax.org.uk/  |
PGP: 65E74AC0  |


signature.asc
Description: Digital signature


Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)

2014-12-27 Thread Martin Steigerwald
Am Samstag, 27. Dezember 2014, 18:40:17 schrieb Hugo Mills:
 On Sat, Dec 27, 2014 at 01:28:46PM -0500, Zygo Blaxell wrote:
  On Sat, Dec 27, 2014 at 09:30:43AM +, Hugo Mills wrote:
   On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
 On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
  Now, since you're seeing lockups when the space on your disks is
   all allocated I'd say that's a bug. However, you're the *only* person
   who's reported this as a regular occurrence. Does this happen with all
   filesystems you have, or just this one?
  
  I do see something similar, but there are so many problems going on I
  have no idea which ones to report, and which ones are my own doing.  :-P
  
  I see lots of CPU being burned when all the disk space is allocated
  to chunks, but there is still lots of space free (multiple GB) inside
  the chunks.
  
  iotop shows a crapton of disk writes (1-5MB/sec) from one kworker.
  There are maybe a few kB/sec of writes through the filesystem at the time.
  
  The filesystem where I see this most is on a laptop, so the disk writes
  also hit the CPU again for encryption.  There's so much CPU usage it's
  worth mentioning twice.  :-(
  
  'watch cat /proc/12345/stack' on the active processes shows the kernel
  fairly often in that new chunk deallocator function whose name escapes
  me at the moment.
  
  Deleting a bunch of data then running balance helps return to sane CPU
  usage...for a while (maybe a week?).
  
  It's not technically locked up per se, but when a 5KB download takes
  a minute or more, most users won't wait around to see the difference.
  
  Kernel versions I'm using are 3.17.7 and 3.18.1.
 
OK, so I'd like to change my statement above.
 
When I first read Martin's problem, I thought that he was referring
 to a complete, hit-the-power-button kind of lock-up. Given that
 (erroneous) assumption, I stand by my (now pointless) statement. :)
 
I realised during a brief conversation on IRC that Martin was
 actually referring to long but temporary periods where the machine is
 unusable by any process requiring disk activity. There's clearly a
 number of people seeing that.
 
It doesn't stop it being a major problem, but it does change the
 interpretation considerably.

Ah, then my bet was right with whom I talked there. :)

Yeah, it does not seem to be a complete hang, I though so initially, cause
honestly after waiting several minutes for my Plasma desktop to come back
I just gave up. Maybe it would have returned at some time. I just didn´t
have the patience to wait.

It now did at my last testing where I continued on tty1 (had all the testing
in a screen) as the desktop session locked up. After some time after the
test completed I was able to use that desktop again and I am still using it.

So the issue I see is: One kworker uses 100% of one core for minutes and
while doing so processes that do I/O to the BTRFS that I test (/home) in my
case seem to be stuck in uninteruptible sleep (D process state). While I
see this there is no huge load on the SSDs so… it seems to be something
CPU bound. I didn´t yet use a strace on the kworker process – or at the
allocation time on the fio process –, Robert, thats a good suggestion. From
a gut feeling I wouldn´t be surprised if I see *nothing* in strace as my bet
is that the kworker thread deals with finding free space inside the chunks
and deals with some data structures while doing so. But that is really just
a gut feeling and so an strace would be nice.

I made a backup yesterday, so I think I can try the strace. But I also spend
a considerable amount of time of reproducing it and digging deeper into it
so likely not this weekend anymore although this even makes some fun. But
I see myself neglecting other stuff thats important to me as well, so…

My simple test case didn´t trigger it, and I so not have another twice 160
GiB available on this SSDs available to try with a copy of my home
filesystem. Then I could safely test without bringing the desktop session to
an halt. Maybe someone has an idea on how to enhance my test case in
order to reliably trigger the issue.

It may be challenging tough. My /home is quite a filesystem. It has a maildir
with at least one million of files (yeah, I am performance testing KMail and
Akonadi as well to the limit!), and it has git repos and this one VM image,
and the desktop search and the Akonadi database. In other words: It has
been hit nicely with various mostly random I think workloads over the last
about six months. I bet its not that easy to simulate that. Maybe some runs
of compilebench to age the filesystem before the fio test?

That said, BTRFS performs a lot better. The complete lockups without any
CPU usage of 3.15 and 3.16 have gone for sure. Thats wonderful. But there
is this kworker issue now. I noticed it that gravely just while trying to
complete this tax returns 

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Robert White

Semi off-topic questions...

On 12/27/2014 08:26 AM, Hugo Mills wrote:

This is... badly mistaken, at best. The problem of where to write a
file into a set of free extents is definitely *not* an NP-hard
problem. It's a P problem, with an O(n log n) solution, where n is the
number of free extents in the free space cache. The simple approach:
fill the first hole with as many bytes as you can, then move on to the
next hole. More complex: order the free extents by size first. Both of
these are O(n log n) algorithms, given an efficient general-purpose
index of free space.


Which algorithm is actually in use?

Is any attempt made to keep subsequent allocations in the same data extent?

All of best fit, first fit, and first encountered allocation have 
terrible distribution graphs over time.


Without a knod to locality, discontiguous allocation will have 
staggeringly bad after-effects in terms of read-ahead.




The problem of placing file data isn't a bin-packing problem; it's
not like allocating RAM (where each allocation must be contiguous).
The items being placed may be split as much as you like, although
minimising the amount of splitting is a goal.


How is compression and re-compression handled? If a linear extent is 
compressed to find its on-disk size in bytes, and then there isn't an 
extent large enough to fit it, it has to be cut, then recompressed, then 
searched again right?


How does the system look for the right cut? How iterative can this get? 
Does it always try cutting in half? Does it shave single bytes off the 
end? Does it add one byte at a time till it reaches the size of the 
extent its looking at?


Can you get down to a point where you are placing data in five or ten 
byte chunks somehow? (e.g. what's the smallest chunk you can place? 
clearly if I open a multi-megabyte file and update a single word or byte 
it's not going to land in metadata from my reading of the code.) One 
could easily end up with a couple million free extents of just a few 
bytes each, particularly if largest-first allocation is used.


The degenerate cases here do come straight from the various packing 
problems. You may not be executing any of those packing algorithms but 
once you ignore enough of those issues in the easy cases your free space 
will be a fine pink mist suspended in space. (both an explosion analogy 
and a reference to pink noise 8-) ).



I suspect that the performance problems that Martin is seeing may
indeed be related to free space fragmentation, in that finding and
creating all of those tiny extents for a huge file is causing
problems. I believe that btrfs isn't alone in this, but it may well be
showing the problem to a far greater degree than other FSes. I don't
have figures to compare, I'm afraid.





I also don't know what kind of tool you are using, but it might be
repeatedly trying and failing to fallocate the file as a single
extent or something equally dumb.


Userspace doesn't as far as I know, get to make that decision. I've
just read the fallocate(2) man page, and it says nothing at all about
the contiguity of the extent(s) storage allocated by the call.


Yep, my bad. But as soon as I saw that fio was starting two threads, 
one doing random read/write and another doing sequential read/write, 
both on the same file, it set off my not just creating a file mindset. 
Given the delayed write into/through the cache normally done by casual 
file io, It seemed likely that fio would be doing something more 
aggressive (like using O_DIRECT or repeated fdatasync() which could get 
very tit-for-tat).


Compare that to a VM in which the guest operating system knows it has, 
and has used, its disk space internally, and the subsequent async 
activity of the monitor to push that activity out to real storage which 
is usually quite pathological... well you can get into some super 
pernicious behavior over write ordering and infinite retries.


So I was wrong about fallocate per-se, applications can be incredibly 
dumb. For instance a VM might think its _inconceivable_ to get an ENOSPC 
while rewriting data it's just read from a file it knows has no holes etc.


Given how lots of code doesn't even check the results of many function 
calls... how many times have you seen code that doesn't look at the 
return value of fwrite() or printf()? Or one that, if it does something 
like if (bytes_written  size) retry_remainder(); So sure I was 
imagining an fallocate() in a loop or something equally dumb. 8-)




Hugo.

[snip]



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Robert White

On 12/27/2014 08:01 AM, Martin Steigerwald wrote:

From how you write I get the impression that you think everyone else

beside you is just silly and dumb. Please stop this assumption. I may not
always get terms right, and I may make a mistake as with the wrong df
figure. But I also highly dislike to feel treated like someone who doesn´t
know a thing.


Nope. I'm a systems theorist and I demand/require variable isolation.

Not a question of silly or dumb but a question of speaking with 
sufficient precision and clarity.


For instance you speak of having an impression and then decide I've 
made an assumption.


I define my position. Explain my terms. Give my examples.

I also risk being utterly wrong because sometimes being completely wrong 
gets others to cut away misconceptions and assumptions.


It annoys some people, but it gets results. You've been going around on 
this topic for how long? and just today Hugo got that your problem is 
becoming CPU bound (long process) instead of a hard lockup. We've 
stopped talking about trees and started talking about free space 
management. We've stopped talking about 17G of free space and gotten 
down to the 5 or so, plus you've gotten angry at me, tried to prove me 
an idiot, and so produced test cases and data that is absolutely clear 
including steps to reproduce.


In real life I work on mission critical systems that can get people 
killed when they fail. So I have developed the reflex of tenacity in 
getting everyone using the same words, talking about the same concepts, 
giving concrete examples, and generally bringing the discussion to a 
very precise head.


Example: I had two parties in conflict about a system. One party said 
that every time they did an orderly shutdown the device would hang in 
a way that took days to recover from. The other party would examine the 
device and say could not reproduce. Turns out that the two parties 
were doing entirely different (but both correct) sequences for orderly 
shutdown. They'd been having that conflict for more than a year. But 
since they both _knew_ what an orderly shutdown was, they _never_ 
analyzed what they were saying. (turns out one procedure left a chip in 
a state that it wouldn't restart until a capacitor discharged, and the 
other procedure did not.)


So yea, when people make statements that everybody understands and 
those statements don't agree. I start slicing concepts off one at a time...


It's not about dumb or silly it's about exact and accurate 
descriptions that have been stripped of assumptions and tribal knowledge.


And I don't care if I come off looking like the bad guy because I 
don't believe in the bad guy at all when it comes to making things 
more clear and getting out of a communications deadlock. My only goal is 
less broken.


So occasionally annoying... but look... progress!
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Bardur Arantsson
On 2014-12-28 01:25, Robert White wrote:
 On 12/27/2014 08:01 AM, Martin Steigerwald wrote:
 From how you write I get the impression that you think everyone else
 beside you is just silly and dumb. Please stop this assumption. I may not
 always get terms right, and I may make a mistake as with the wrong df
 figure. But I also highly dislike to feel treated like someone who
 doesn´t
 know a thing.
 
 Nope. I'm a systems theorist and I demand/require variable isolation.
 
 Not a question of silly or dumb but a question of speaking with
 sufficient precision and clarity.
 
 For instance you speak of having an impression and then decide I've
 made an assumption.
 
 I define my position. Explain my terms. Give my examples.
 
 I also risk being utterly wrong because sometimes being completely wrong
 gets others to cut away misconceptions and assumptions.
 
 It annoys some people, but it gets results.

Can you please stop this bullshit posturing nonsense? It accomlishes
nothing -- if you're right your other posts will stand for themselves
and show that you are indeed the shit when it comes to these matters,
but this post (so far, didn't read further) accomplishes nothing other
than (possibly) convincing everyone that you're a pompous/self-important
ass.

Regards,

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-27 Thread Robert White

On 12/27/2014 05:01 PM, Bardur Arantsson wrote:

On 2014-12-28 01:25, Robert White wrote:

On 12/27/2014 08:01 AM, Martin Steigerwald wrote:

 From how you write I get the impression that you think everyone else

beside you is just silly and dumb. Please stop this assumption. I may not
always get terms right, and I may make a mistake as with the wrong df
figure. But I also highly dislike to feel treated like someone who
doesn´t
know a thing.


Nope. I'm a systems theorist and I demand/require variable isolation.

Not a question of silly or dumb but a question of speaking with
sufficient precision and clarity.

For instance you speak of having an impression and then decide I've
made an assumption.

I define my position. Explain my terms. Give my examples.

I also risk being utterly wrong because sometimes being completely wrong
gets others to cut away misconceptions and assumptions.

It annoys some people, but it gets results.


Can you please stop this bullshit posturing nonsense? It accomlishes
nothing -- if you're right your other posts will stand for themselves
and show that you are indeed the shit when it comes to these matters,
but this post (so far, didn't read further) accomplishes nothing other
than (possibly) convincing everyone that you're a pompous/self-important
ass.


Really? accomplishes nothing?

24 hours ago:

the complaining party was talking about

- Windows XP
- Tax software
- Virtual box
- vdi files
- defragging
- balancing
- data trees
- system hanging

And the responding party was saying

you are the only person reporting this as a regular occurrence with 
the implication that the report was a duplicate or at least might not 
get much immediate attention.


Now:

The complaining party has verified the minimum, repeatable case of 
simple file allocation on a very fragmented system and the responding 
party and several others have understood and supported the bug.


That's not accomplishing nothing, thats called engaging in diagnostics 
instead of dismissing a complaint, and sticking out the diagnostic 
process until everyone is on the same page.


I never dismissed Martin. I never disbelieved him. I went through his 
elements one at a time with examples of what I was taking away from him 
and why they didn't match expectations and experimental evidence. We 
adjusted our positions and communications.


So you can call it bullshit posturing nonsense but I see taking less 
than a day to get to the bottom of a bug report that might not have 
gotten significant attention.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BTRFS free space handling still needs more work: Hangs again

2014-12-26 Thread Martin Steigerwald
Hello!

First: Have a merry christmas and enjoy a quiet time in these days.

Second: At a time you feel like it, here is a little rant, but also a bug
report:




























I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
space_cache, skinny meta data extents – are these a problem? – and
compress=lzo:

merkaba:~ btrfs fi sh /home
Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
Total devices 2 FS bytes used 144.41GiB
devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~ btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


And I had hangs with BTRFS again. This time as I wanted to install tax
return software in Virtualbox´d Windows XP VM (which I use once a year
cause I know no tax return software for Linux which would be suitable for
Germany and I frankly don´t care about the end of security cause all
surfing and other network access I will do from the Linux box and I only
run the VM behind a firewall).


And thus I try the balance dance again:

merkaba:~ btrfs balance start -dusage=5 -musage=5 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1 btrfs balance start -dusage=5 -musage=5 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1 btrfs balance start -dusage=5 /home  
Done, had to relocate 0 out of 164 chunks
merkaba:~ btrfs balance start -dusage=10 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~ btrfs balance start -dusage=20 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~ btrfs balance start -dusage=30 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~ btrfs balance start -dusage=40 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~ btrfs balance start -dusage=50 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~ btrfs balance start -dusage=60 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~ btrfs balance start -dusage=70 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1 btrfs balance start -dusage=70 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1 btrfs balance start -dusage=70 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1 btrfs balance start -dusage=65 /home
Done, had to relocate 0 out of 164 chunks
merkaba:~ btrfs balance start -dusage=67 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1 btrfs balance start -musage=10 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1 btrfs balance start -musage=05 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail


Okay, not really, ey?



But

merkaba:~ btrfs balance start /home

works.

So I am rebalancing everything basically, without need I bet, so causing
more churn to SSDs than is needed.


Otherwise alternative would be to make BTRFS larger I bet.


Well this is still not what I would consider stable. So I will still
recommend: If you want to use BTRFS on a server and estimate 25 GiB of
usage, make drive at least 50GiB big or even 100GiB to be on the safe
side. Like I recommended for SLES 11 SP 2/3 BTRFS deployments – but
hey, there say meanwhile don´t as in just don´t use it at all and use SLES
12 instead, cause BTRFS with 3.0 kernel with a ton of snapper snapshots
is really not asking for anything even near to production or enterprise
reliability (if you need proof, I think I still have a snapshot of a SLES
11 SP3 VM that broke over night due to me having installed an LDAP server
for preparing some training slides). Even 3.12 kernel seems daring regarding
BTRFS, unless SUSE actively backports fixes.


In kernel log the failed attempts look like this:

[  209.783437] BTRFS info (device dm-3): relocating block group 501238202368 
flags 17
[  210.116416] BTRFS info (device dm-3): relocating block group 501238202368 
flags 17
[  210.455479] BTRFS info (device dm-3): 1 enospc errors during balance
[  212.915690] BTRFS info (device dm-3): relocating block group 501238202368 
flags 17
[  213.291634] BTRFS info (device dm-3): relocating block group 501238202368 
flags 17
[  213.654145] BTRFS info (device dm-3): 1 enospc errors during balance
[  219.219584] BTRFS info (device dm-3): relocating block group 501238202368 
flags 17
[  

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-26 Thread Martin Steigerwald
Am Freitag, 26. Dezember 2014, 14:37:36 schrieben Sie:
 It currently is here:
 
 merkaba:~ btrfs balance status /home
 Balance on '/home' is running
 32 out of about 164 chunks balanced (53 considered),  80% left
 
 merkaba:~ btrfs fi df /home
 Data, RAID1: total=154.97GiB, used=142.10GiB
 System, RAID1: total=32.00MiB, used=48.00KiB
 Metadata, RAID1: total=5.00GiB, used=3.33GiB
 GlobalReserve, single: total=512.00MiB, used=254.31MiB

Now I got this:

merkaba:~ btrfs balance start /home   

ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1 dmesg | tail
[ 4260.276416] BTRFS info (device dm-3): relocating block group 151418568704 
flags 17
[ 4274.683349] BTRFS info (device dm-3): found 25089 extents
[ 4295.836590] BTRFS info (device dm-3): found 25089 extents
[ 4296.026778] BTRFS info (device dm-3): relocating block group 150344826880 
flags 17
[ 4312.732021] BTRFS info (device dm-3): found 59388 extents
[ 4326.398261] BTRFS info (device dm-3): found 59388 extents
[ 4326.813205] BTRFS info (device dm-3): relocating block group 149271085056 
flags 17
[ 4347.346540] BTRFS info (device dm-3): found 104739 extents
[ 4357.160098] BTRFS info (device dm-3): found 104739 extents
[ 4359.304646] BTRFS info (device dm-3): 20 enospc errors during balance

And I wonder about:

 Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
 GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599
 
84C7N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2
 �ޙ�)ߡ�a�����G���h��j:+v���w��٥


These random chars are not supposed to be there: I better run scrub straight 
after this balance.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-26 Thread Martin Steigerwald
Am Freitag, 26. Dezember 2014, 15:20:42 schrieben Sie:
 And I wonder about:
  Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
  GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599
 
  
 
 
84C7N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2
 
  �ޙ�)ߡ�a�����G���h��j:+v���w��٥
 
 These random chars are not supposed to be there: I better run scrub
 straight  after this balance.

Okay, thats not me I think. scrub didn´t report any errors and when I look
in kmail send folder I don´t see these random chars as well, so it seems
some server on the wire added the garbage.

Lets defragment the file:

merkaba:/home/martin/.VirtualBox/HardDisks filefrag Winlala.vdi 
Winlala.vdi: 41462 extents found
merkaba:/home/martin/.VirtualBox/HardDisks btrfs filesystem defragment 
Winlala.vdi 
merkaba:/home/martin/.VirtualBox/HardDisks filefrag Winlala.vdi
   
Winlala.vdi: 11735 extents found
merkaba:/home/martin/.VirtualBox/HardDisks sync
merkaba:/home/martin/.VirtualBox/HardDisks filefrag Winlala.vdi
Winlala.vdi: 11735 extents found


Okay, that together with:

merkaba:~ btrfs fi df /home   
Data, RAID1: total=151.95GiB, used=144.68GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.25GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
merkaba:~ btrfs fi sh /home
Label: 'home'  uuid: […]
Total devices 2 FS bytes used 147.94GiB
devid1 size 160.00GiB used 156.98GiB path /dev/mapper/msata-home
devid2 size 160.00GiB used 156.98GiB path /dev/mapper/sata-home

Btrfs v3.17

May do for a while.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-26 Thread Martin Steigerwald
Am Freitag, 26. Dezember 2014, 14:37:36 schrieben Sie:
 I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
 space_cache, skinny meta data extents – are these a problem? – and
 compress=lzo:
 
 merkaba:~ btrfs fi sh /home
 Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
 Total devices 2 FS bytes used 144.41GiB
 devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
 devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
 
 Btrfs v3.17
 merkaba:~ btrfs fi df /home
 Data, RAID1: total=154.97GiB, used=141.12GiB
 System, RAID1: total=32.00MiB, used=48.00KiB
 Metadata, RAID1: total=5.00GiB, used=3.29GiB
 GlobalReserve, single: total=512.00MiB, used=0.00B
 
 
 And I had hangs with BTRFS again. This time as I wanted to install tax
 return software in Virtualbox´d Windows XP VM (which I use once a year
 cause I know no tax return software for Linux which would be suitable for
 Germany and I frankly don´t care about the end of security cause all
 surfing and other network access I will do from the Linux box and I only
 run the VM behind a firewall).

These are 100% reproducable for me:

1) Have the compress=lzo, space_cache BTRFS RAID Dual SSD RAID 1 fill both
with trees.

2) Have a Windows XP VM in Virtualbox on that BTRFS RAID 1

3) Press Defragment (in the hope to be able to use sdelete -c and then
VBoxManage modifyhd Winlala.vdi --compact to reduce image size)


Gives:

One kworker thread using up 100% of a core for minutes with bursts of
btrfs-transaction-process in between and:

Dec 26 16:17:57 merkaba kernel: [ 8102.029438] mce: [Hardware Error]: Machine 
check events logged
Dec 26 16:18:15 merkaba kernel: [ 8119.879230] CPU2: Core temperature above 
threshold, cpu clock throttled (total events = 54053)
Dec 26 16:18:15 merkaba kernel: [ 8119.879232] CPU0: Package temperature above 
threshold, cpu clock throttled (total events = 89435)
Dec 26 16:18:15 merkaba kernel: [ 8119.879234] CPU3: Core temperature above 
threshold, cpu clock throttled (total events = 54053)
Dec 26 16:18:15 merkaba kernel: [ 8119.879235] CPU1: Package temperature above 
threshold, cpu clock throttled (total events = 89435)
Dec 26 16:18:15 merkaba kernel: [ 8119.879237] CPU3: Package temperature above 
threshold, cpu clock throttled (total events = 89435)
Dec 26 16:18:15 merkaba kernel: [ 8119.879245] CPU2: Package temperature above 
threshold, cpu clock throttled (total events = 89435)
Dec 26 16:18:15 merkaba kernel: [ 8119.880218] CPU2: Core temperature/speed 
normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880219] CPU1: Package temperature/speed 
normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880220] CPU3: Core temperature/speed 
normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880221] CPU0: Package temperature/speed 
normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880223] CPU3: Package temperature/speed 
normal
Dec 26 16:18:15 merkaba kernel: [ 8119.880228] CPU2: Package temperature/speed 
normal
Dec 26 16:20:27 merkaba kernel: [ 8252.054015] mce: [Hardware Error]: Machine 
check events logged
Dec 26 16:20:57 merkaba kernel: [ 8281.461874] INFO: task kded4:1959 blocked 
for more than 120 seconds.
Dec 26 16:20:57 merkaba kernel: [ 8281.464106]   Tainted: G   O   
3.18.0-tp520 #14
Dec 26 16:20:57 merkaba kernel: [ 8281.466361] echo 0  
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Dec 26 16:20:57 merkaba kernel: [ 8281.468760] kded4   D 
88040764ce98 0  1959  1 0x
Dec 26 16:20:57 merkaba kernel: [ 8281.471112]  8803efa57bb8 
0002 8803efa57c00 880407f261c0
Dec 26 16:20:57 merkaba kernel: [ 8281.473462]  8803efa57fd8 
88040764c950 00012300 88040764c950
Dec 26 16:20:57 merkaba kernel: [ 8281.475780]  8803efa57ba8 
8803eea9a900 8803eea9a904 88040764c950
Dec 26 16:20:57 merkaba kernel: [ 8281.478142] Call Trace:
Dec 26 16:20:57 merkaba kernel: [ 8281.480414]  [814a6f9a] 
schedule+0x64/0x66
Dec 26 16:20:57 merkaba kernel: [ 8281.482694]  [814a72d3] 
schedule_preempt_disabled+0x13/0x1f
Dec 26 16:20:57 merkaba kernel: [ 8281.484979]  [814a8440] 
__mutex_lock_slowpath+0xab/0x126
Dec 26 16:20:57 merkaba kernel: [ 8281.487271]  [81143735] ? 
lookup_fast+0x173/0x238
Dec 26 16:20:57 merkaba kernel: [ 8281.489534]  [814a84ce] 
mutex_lock+0x13/0x24
Dec 26 16:20:57 merkaba kernel: [ 8281.491811]  [81143c45] 
walk_component+0x69/0x17e
Dec 26 16:20:57 merkaba kernel: [ 8281.494092]  [81143d88] 
lookup_last+0x2e/0x30
Dec 26 16:20:57 merkaba kernel: [ 8281.496416]  [81145a32] 
path_lookupat+0x83/0x2d9
Dec 26 16:20:57 merkaba kernel: [ 8281.498733]  [8121f38c] ? 
debug_smp_processor_id+0x17/0x19
Dec 26 16:20:57 merkaba kernel: [ 8281.501074]  [8114683c] ? 
getname_flags+0x31/0x134
Dec 26 16:20:57 merkaba kernel: [ 8281.503338]  [81145cad] 
filename_lookup+0x25/0x7a
Dec 26 16:20:57 merkaba 

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-26 Thread Robert White

On 12/26/2014 05:37 AM, Martin Steigerwald wrote:

Hello!

First: Have a merry christmas and enjoy a quiet time in these days.

Second: At a time you feel like it, here is a little rant, but also a bug
report:

I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
space_cache, skinny meta data extents – are these a problem? – and
compress=lzo:


(there is no known problem with skinny metadata, it's actually more 
efficient than the older format. There has been some anecdotes about 
mixing the skinny and fat metadata but nothing has ever been 
demonstrated problematic.)




merkaba:~ btrfs fi sh /home
Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
 Total devices 2 FS bytes used 144.41GiB
 devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
 devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~ btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


This filesystem, at the allocation level, is very full (see below).


And I had hangs with BTRFS again. This time as I wanted to install tax
return software in Virtualbox´d Windows XP VM (which I use once a year
cause I know no tax return software for Linux which would be suitable for
Germany and I frankly don´t care about the end of security cause all
surfing and other network access I will do from the Linux box and I only
run the VM behind a firewall).


And thus I try the balance dance again:


ITEM: Balance... it doesn't do what you think it does... 8-)

Balancing is something you should almost never need to do. It is only 
for cases of changing geometry (adding disks, switching RAID levels, 
etc.) of for cases when you've radically changed allocation behaviors 
(like you decided to remove all your VM's or you've decided to remove a 
mail spool directory full of thousands of tiny files).


People run balance all the time because they think they should. They are 
_usually_ incorrect in that belief.




merkaba:~ btrfs balance start -dusage=5 -musage=5 /home
ERROR: error during balancing '/home' - No space left on device


ITEM: Running out of space during a balance is not running out of space 
for files. BTRFS has two layers of allocation. That is, there are two 
levels of abstraction where no space can occur.


The first level of allocation is the making more BTRFS structures out 
of raw device space.


The second level is allocating space for files inside of existing BTRFS 
structures.


Balance is the operation of relocating the BTRFS structures and 
attempting to increase their order (conincidentally) while doing that. 
So, for instance, reocating block group some_number_here requires 
finding an unallocated expanse of disk, creating a new/empty block group 
there of the current relevant block group size (typically data=1G or 
metadata=256M if you didn't override these settings while making the 
filesystem). You can _easily_ end up lacking a 1G contiguous expanse of 
raw allocation space on a nearly-full filesystem.


NOTE :: This does _not_ happen with other filesystems like EXT4 because 
building those filesystems creates a static filesystem-level allocation. 
That is 100% of the disk that can be controlled by EXT4 (etc) is 
allocated and initialized at initial creation time (or first mount in 
the case of EXT4).


BTRFS is intentionally different because it wants to be able to adapt as 
your usage changes. If you first make millions of tiny files then you 
will have a lot of metadata extents and virtually no data extents. If 
you erase a lot of those and then start making large files the metadata 
will tend to go away and then data extents will be created.


Being a chaotic system, you can get into some corner cases that suck, 
but in terms of natural evolution it has more benefits than drawbacks.




There may be more info in syslog - try dmesg | tail
merkaba:~#1 btrfs balance start -dusage=5 -musage=5 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1 btrfs balance start -dusage=5 /home

 losts deleted for brevity 

So I am rebalancing everything basically, without need I bet, so causing
more churn to SSDs than is needed.


Correct, though churn isn't really the issue.


Otherwise alternative would be to make BTRFS larger I bet.


Correct.




Well this is still not what I would consider stable. So I will still


Not a question of stability.

See, dong a balance is like doing a sliding block puzzle. If there isn't 
enough room to slide the blocks around then the blocks will not slide 
around. You are just out of space and that results in out of space 
returns. This is not even an error, just a fact.


http://en.wikipedia.org/wiki/15_puzzle

Meditate on the above link. Then ask yourself what happens if you put 

Re: BTRFS free space handling still needs more work: Hangs again

2014-12-26 Thread Duncan
Martin Steigerwald posted on Fri, 26 Dec 2014 15:41:23 +0100 as excerpted:

 Am Freitag, 26. Dezember 2014, 15:20:42 schrieben Sie:
 And I wonder about:
  Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C
  0040 0710 4AFA  B82F 991B EAAC A599
 
 
  
 
 84C7N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/
���z�ޖ��2
 
  �ޙ�)ߡ�a�����G���h��j:+v���w��٥
 
 These random chars are not supposed to be there: I better run scrub
 straight  after this balance.
 
 Okay, thats not me I think. scrub didn´t report any errors and when I
 look in kmail send folder I don´t see these random chars as well, so it
 seems some server on the wire added the garbage.

FWIW...

They didn't show up here on gmane's list2nntp service (message viewed 
with pan), either.  There were a few strange characters -- your dashes(?) 
on either side of the are these a problem? showed up as the squares 
containing four digits (0080, 0093) that appear when a font doesn't 
contain the appropriate character it's being asked to display, and there 
were a few others, but that's a common charset/font l10n issue, not the 
apparent line noise binary corruption shown above.

So I'd guess it was either the transmission to your mail service, at the 
mail service, or the transmission between them and your mail client, that 
corrupted.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-26 Thread Duncan
Martin Steigerwald posted on Fri, 26 Dec 2014 16:59:09 +0100 as excerpted:

 Dec 26 16:17:57 merkaba kernel: [ 8102.029438] mce:
 [Hardware Error]: Machine check events logged
 Dec 26 16:20:27 merkaba kernel: [ 8252.054015] mce:
 [Hardware Error]: Machine check events logged

Have you checked these MCEs?  What are they?

MCEs are hardware errors.  These are *NOT* kernel errors, tho of course 
they may /trigger/ kernel errors.  The reported event codes can be looked 
up and translated into English. 

Since shortly after the first one until a bit before the second one here, 
you had hardware thermal throttling, the CPUs, on-chip cache, and 
possibly the memory, was working pretty hard.

FWIW, I had an AMD machine that would MCE with memory related errors some 
time (about a decade) ago.  I had ECC RAM, but it was cheap and 
apparently not quite up to the speeds it was actually rated for.  MemTest 
check out the memory fine, but under high stress especially, it would 
sometimes have bus/transit related corruption, which would sometimes (not 
always) trigger those MCEs.

Eventually a BIOS update gave me the ability to turn down the memory 
timings, and turning them down just one notch made everything rock-stable 
-- I was even able to decrease some of the wait-states to get a bit of 
the memory speed back.  It just so happened that it was borderline stable 
at the rated clock, and turning the memory clock down just one notch was 
all it took.  Later, I upgraded the RAM (the bad RAM was two half-gig 
sticks, back when they were $100+ a piece, I upgraded to four 2-gig 
sticks), and the new RAM didn't have the problem at all -- the bad RAM 
sticks simply weren't /quite/ stable at the rated speed, that was it.

I run gentoo so of course do a lot of building from sources, and 
interestingly enough, the thing that turned out to detect the corruption 
the most often was bzip2 compression checksums -- I'd get errors on 
sources decompress previous to the build, rather more often than actual 
build failures altho those would happen occasionally as well, while 
redoing it would work fine -- checksums passed, and I never had a build 
that actually finished fail to run due to a bad build.

Now here's the thing.  Of course a decade ago was well before I was 
running btrfs (FWIW I was running reiserfs at the time, and it seemed 
pretty resilient given the bad RAM I had), so it was the bzip2 checksums 
it failed on.

But guess what btrfs uses for file integrity, checksums.  If your MCEs 
are either like my memory-related MCEs were, or are similar CPU-cache or 
CPU related but still something that would affect checksumming, btrfs may 
well be fighting bad checksums due to the same issues, and that would of 
course throw all sorts of wrenches into things.  Another thing I've seen 
reported as triggering MCEs is bad power (in that case it was an either 
underpowered or going bad UPS, once it was out of the picture, the MCEs 
and problems stopped).

Now I think you're having other btrfs issues as well, some of which are 
likely legit bugs.  However, your MCEs certainly aren't helping things, 
and I'd definitely recommend checking up on them to see what's actually 
happening to your hardware.  It may well be that without whatever 
hardware issues are triggering those MCEs, you may end up with less btrfs 
problems as well.

Or maybe not, but it's something to look into, because right now, 
regardless of whether they're making things worse physically, they're at 
minimum obscuring a troubleshooting picture that would be clearer without 
them.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS free space handling still needs more work: Hangs again

2014-12-26 Thread Duncan
Robert White posted on Fri, 26 Dec 2014 14:48:38 -0800 as excerpted:

 ITEM: An SSD plus a good fast controller and default system virtual
 memory and disk scheduler activities can completely bog a system down.
 You can get into a mode where the system begins doing synchronous writes
 of vast expanses of dirty cache. The SSD is so fast that there is
 effectively zero wait for IO time and the IO subsystem is effectively
 locked or just plain busy.
 
 Look at /proc/sys/vm/dirty_background_ratio which is probably set to 10%
 of system ram.
 
 You may need/want to change this number to something closer to 4. That's
 not a hard suggestion. Some reading and analysis will be needed to find
 the best possible tuning for an advanced system.

FWIW, I can second at least this part, myself.  Half of the base problem 
is that memory speeds have increased far faster than storage speeds.  
SSDs do help with that, but the problem remains.  The other half of the 
problem is the comparatively huge memory capacity systems have today, 
with the result being that the default percentages of system RAM that 
were allowed to be dirty before kicking in background and then foreground 
flushing, reasonable back when they were introduced, simply aren't 
reasonable any longer, PARTICULARLY on spinning rust, but even on SSD.

vm.dirty_ratio is the percentage of RAM allowed to dirty before the 
system kicks into high-priority write-flush mode.  
vm.dirty_background_ratio is likewise, but where the system starts even 
worrying about it at all, doing work in the background.

Now take my 16 GiB RAM system as an example.

The default background setting is 5%, foreground/high-priority, 10%.  
With 16 gigs RAM, that 10% is 1.6 GiB of dirty pages to flush.  A 
spinning rust drive might do 100 MiB/sec throughput contiguous, but a 
real-world number is more like 30-50 MiB/sec.

At 100 MiB/sec, that 1.6 GiB will take 16+ seconds, during which nothing 
else can be doing I/O.  So let's just divide the speed by 3 and call it 
33.3 MiB/sec.  Now we're looking at being blocked for nearly 50 seconds 
to flush all those dirty blocks.  And the system doesn't even START 
worrying about it, at even LOW priority, until it has about 25 seconds 
worth of full-usage flushing built-up!

Not only that, but that's *ALSO* 1.6 GiB worth of dirty data that isn't 
yet written to storage, that would lost in the event of a crash!

Of course there's a timer expiry as well.  vm.dirty_writeback_centiseconds 
(that's background) defaults to 499 (5 seconds), 
vm.dirty_expire_centiseconds defaults to 2999 (30 seconds).

So the first thing to notice is that it's going to take more time to 
write the dirty data we're allowing to stack up, than the expiry time!  
At least to me, that makes absolutely NO sense!  At minimum, we need to 
reduce cached writes allowed to stack up to something that can actually 
be done before they expire, time-wise.  Either that, or trying to depend 
on that 30-second expiry to make sure our dirty data is flushed in 
something at least /close/ to that isn't going to work so well!

So assuming we think the 30-seconds is logical, the /minimum/ we need to 
do is reduce the size cap by half, to 5% high-priority/foreground (which 
was as we saw about 25 seconds worth), say 2% lower-priority/background.

But that's STILL about 800 MiB before it kicks to high priority mode at 
risk in case of a crash, and I still considered that a bit more than I 
wanted.

So what I ended up with here (set for spinning rust before I had SSD), 
was:

vm.dirty_background_ratio = 1

(low priority flush, that's still ~160 MiB or about 5 seconds worth of 
activity at lower 30s MiB/sec)

vm.dirty_ratio = 3

(high priority flush, roughly half a GiB, about 15 seconds of activity)

vm.dirty_writeback_centiseconds=1000

(10 seconds, background flush timeout, note that the corresponding size 
cap is ~5 seconds worth so about 50% duty cycle, a bit high for 
background priority, but...)

(I left vm.dirty_expire_centiseconds at the default, 2999 or 30 seconds, 
since I found that an acceptable amount of work to lose in the case of a 
crash.  Again, the corresponding size cap is ~15 seconds worth, so ~50 
duty cycle.  This is very reasonable for high priority, as if data is 
coming in faster than that, it'll trigger high priority flushing billed 
to the processes actually dirtying the memory in the first place, thus 
forcing them to slow down and wait for their IO, in turn allowing other 
(CPU-bound) processes to run.)

And while 15-second interactivity latency during disk thrashing isn't 
cake, it's at least tolerable, while 50-second latency is HORRIBLE.

Meanwhile, with vm.dirty_background_ratio already set to 1 and without 
knowing whether it can take a decimal such as 0.5 (I could look I suppose 
but I don't really have to), that's the lowest I can go there unless I 
set it to zero.  HOWEVER, if I wanted to go lower, I could set the actual 
size version, vm.dirty_background_bytes,