date:20140208

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Duncan

Roman Mamedov posted on Sun, 09 Feb 2014 04:10:50 +0600 as excerpted:

> If you need to perform a btrfs-specific operation, you can easily use
> the btrfs-specific tools to prepare for it, specifically use "btrfs fi
> df" which could give provide every imaginable interpretation of free
> space estimate and then some.
> 
> UNIX 'df' and the 'statfs' call on the other hand should keep the
> behavior people are accustomized to rely on since 1970s.

Which it does... on filesystems that only have 1970s filesystem features. 
=:^)

RAID or multi-device filesystems aren't 1970s features and break 1970s 
behavior and the assumptions associated with it.  If you're not prepared 
to deal with those broken assumptions, don't.  Use mdraid or dmraid or lvm 
or whatever to combine your multiple devices into one logical devices as 
presented, and put your filesystem (either traditional filesystem, or 
even btrfs using traditional single-device functionality) on top of the 
single device the layer beneath the filesystem presents.  Problem solved! 
=:^)

Note that df only lists a single device as well, not the multiple 
component devices of the filesystem.  That's broken functionality by your 
definition, too, and again, using some other layer like lvm or mdraid to 
present multiple devices as a single virtual device, with a traditional 
single-device filesystem layout on top of that single device... solves 
the problem!


Meanwhile, what I've done here is use one of df's commandline options to 
set its block size to 2 MiB, and further used bash's alias functionality 
to setup an alias accordingly:

alias df='df -B2M'

$ df /h
Filesystem 2M-blocks  Used Available Use% Mounted on
/dev/sda6  20480 12186  7909  61% /h

$ sudo btrfs fi show /h
Label: hm0238gcnx+35l0  uuid: ce23242a-b0a9-423f-a9c3-7db2729f48d6
Total devices 2 FS bytes used 11.90GiB
devid1 size 20.00GiB used 14.78GiB path /dev/sda6
devid2 size 20.00GiB used 14.78GiB path /dev/sdb6

$ sudo btrfs fi df /h
Data, RAID1: total=14.00GiB, used=11.49GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=768.00MiB, used=414.94MiB


On btrfs such as the above I can read the 2M blocks as 1M and be happy.  
On btrfs such as my /boot, which aren't raid1 (I have two separate 
/boots, one on each device, with grub2 configured separately for each to 
provide a backup), or if I df my media partitions still on reiserfs on 
the old spinning rust, I can either double the figures DF gives me, or 
add a second -B option at the CLI, overriding the aliased option.

If I wanted something fully automated, it'd be easy enough to setup a 
script that checked what filesystem I was df-ing, matched that against a 
table of filesystems to preferred df block sizes, and supplied the 
appropriate -BxX option accordingly.  As I guess most admins after a few 
years, I've developed quite a library of scripts/aliases for various 
things I do routinely enough to warrant it, and this would be just one 
more joining the list. =:^)


But of course it's your system in question, and you can patch btrfs to 
output anything you like, in any format you like.  No need to bother with 
df's -B option if you'd prefer to patch the kernel instead.  Me, I'll 
stick to the -B option.  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: lost with degraded RAID1

2014-02-08 Thread Duncan

Johan Kröckel posted on Sat, 08 Feb 2014 12:09:46 +0100 as excerpted:

> Ok, I did nuke it now and created the fs again using 3.12 kernel. So far
> so good. Runs fine.
> Finally, I know its kind of offtopic, but can some help me interpreting
> this (I think this is the error in the smart-log which started the whole
> mess)?
> 
> Error 1 occurred at disk power-on lifetime: 2576 hours (107 days + 8
> hours)
>   When the command that caused the error occurred, the device was
> active or idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   04 71 00 ff ff ff 0f
>  Device Fault; Error: ABRT at LBA = 0x0fff = 268435455

I'm no SMART expert, but that LBA number is incredibly suspicious.  With 
standard 512-byte sectors that's the 128 GiB boundary, the old 28-bit LBA 
limit (LBA28, introduced with ATA-1 in 1994, modern drives are LBA48, 
introduced in 2003 with ATA-6 and offering an addressing capacity of 128 
PiB, according to wikipedia's article on LBA).

It looks like something flipped back to LBA28, and when a continuing 
operation happened to write past that value... it triggered the abort you 
see in the SMART log.

Double-check your BIOS to be sure it didn't somehow revert to the old 
LBA28 compatibility mode or some such, and the drives, to make sure they 
aren't "clipped" to LBA28 compatibility mode as well.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bedup bug report

2014-02-08 Thread Marc MERLIN

kernel 3.12.7, python 2.7.6-5, debian testing/unstable, bedup installed as per
pip install --user bedup

I tried installing the git version, but the error is the same:

Anyway, with the other bedup, I get:
gargamel:/mnt/dshelf2/backup# bedup show
Traceback (most recent call last):
  File "/usr/local/bin/bedup", line 9, in 
load_entry_point('bedup==0.9.0', 'console_scripts', 'bedup')()
  File "/root/.local/lib/python2.7/site-packages/bedup/__main__.py", line 483, 
in script_main
sys.exit(main(sys.argv))
  File "/root/.local/lib/python2.7/site-packages/bedup/__main__.py", line 472, 
in main
return args.action(args)
  File "/root/.local/lib/python2.7/site-packages/bedup/__main__.py", line 70, 
in cmd_show_vols
sess = get_session(args)
  File "/root/.local/lib/python2.7/site-packages/bedup/__main__.py", line 105, 
in get_session
upgrade_schema(engine, database_exists)
  File "/root/.local/lib/python2.7/site-packages/bedup/migrations.py", line 38, 
in upgrade_schema
context = MigrationContext.configure(engine.connect())
  File "/root/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", 
line 1678, in connect
return self._connection_cls(self, **kwargs)
  File "/root/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", 
line 59, in __init__
self.__connection = connection or engine.raw_connection()
  File "/root/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", 
line 1747, in raw_connection
return self.pool.unique_connection()
  File "/root/.local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 272, 
in unique_connection
return _ConnectionFairy._checkout(self)
  File "/root/.local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 608, 
in _checkout
fairy = _ConnectionRecord.checkout(pool)
  File "/root/.local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 425, 
in checkout
rec = pool._do_get()
  File "/root/.local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 838, 
in _do_get
c = self._create_connection()
  File "/root/.local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 277, 
in _create_connection
return _ConnectionRecord(self)
  File "/root/.local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 402, 
in __init__
pool.dispatch.connect(self.connection, self)
  File "/root/.local/lib/python2.7/site-packages/sqlalchemy/event/attr.py", 
line 247, in __call__
fn(*args, **kw)
  File "/root/.local/lib/python2.7/site-packages/bedup/__main__.py", line 89, 
in sql_setup
assert val == ('wal',), val
AssertionError: (u'delete',)

gargamel:/mnt/dshelf2/backup# bedup dedup --db-path dedup --defrag .
also fails the same way

Last thing that happens is
open("/usr/lib/python2.7/lib-tk/_sqlite3module.so", O_RDONLY|O_LARGEFILE) = -1 
ENOENT (No such file or directory)
open("/usr/lib/python2.7/lib-tk/_sqlite3.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT 
(No such file or directory)
open("/usr/lib/python2.7/lib-tk/_sqlite3.pyc", O_RDONLY|O_LARGEFILE) = -1 
ENOENT (No such file or directory)
stat64("/usr/lib/python2.7/lib-dynload/_sqlite3", 0xff921890) = -1 ENOENT (No 
such file or directory)
open("/usr/lib/python2.7/lib-dynload/_sqlite3.i386-linux-gnu.so", 
O_RDONLY|O_LARGEFILE) = 5
open("/usr/lib/python2.7/lib-dynload/_sqlite3.i386-linux-gnu.so", 
O_RDONLY|O_CLOEXEC) = 6
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 6
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/usr/lib/libsqlite3.so.0", O_RDONLY|O_CLOEXEC) = 6
getcwd("/mnt/dshelf2/backup", 1024) = 20
stat64("/mnt/dshelf2/backup/dedup", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
open("/mnt/dshelf2/backup/dedup", O_RDWR|O_CREAT|O_LARGEFILE, 0644) = 3
Traceback (most recent call last):
  File "/usr/local/bin/bedup", line 9, in 
open("/usr/local/bin/bedup", O_RDONLY|O_LARGEFILE) = 4
load_entry_point('bedup==0.9.0', 'console_scripts', 'bedup')()
  File 
"/root/.local/lib/python2.7/site-packages/bedup-0.9.0-py2.7-linux-x86_64.egg/bedup/__main__.py",
 line 487, in script_main
(...)

Any suggestions?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] Revert "Btrfs: remove transaction from btrfs send"

2014-02-08 Thread Shilong Wang

2014-02-08 23:46 GMT+08:00 Wang Shilong :
> From: Wang Shilong 
>
> This reverts commit 41ce9970a8a6a362ae8df145f7a03d789e9ef9d2.
> Previously i was thinking we can use readonly root's commit root
> safely while it is not true, readonly root may be cowed with the
> following cases.
>
> 1.snapshot send root will cow source root.
> 2.balance,device operations will also cow readonly send root
> to relocate.
>
> So i have two ideas to make us safe to use commit root.
>
> -->approach 1:
> make it protected by transaction and end transaction properly and we research
> next item from root node(see btrfs_search_slot_for_read()).
>
> -->approach 2:
> add another counter to local root structure to sync snapshot with send.
> and add a global counter to sync send with exclusive device operations.
>
> So with approach 2, send can use commit root safely, because we make sure
> send root can not be cowed during send. Unfortunately, it make codes *ugly*
> and more complex to maintain.
>
> To make snapshot and send exclusively, device operations and send operation
> exclusively with each other is a little confusing for common users.
>
> So why not drop into previous way.
>
> Cc: Josef Bacik 
> Signed-off-by: Wang Shilong 
> ---
> Josef, if we reach agreement to adopt this approach, please revert
> Filipe's patch(Btrfs: make some tree searches in send.c more efficient)
> from btrfs-next.

Oops, this patch guarantee searching commit roots are all protected by
transaction, Filipe's
patch is ok, we need update Josef's previous patch.

Wang
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Chris Murphy


On Feb 8, 2014, at 7:21 PM, Chris Murphy  wrote:

> we don't have a top level switch for variable raid on a volume yet

This isn't good wording. We don't have a controllable way to set variable raid 
levels. The interrupted convert model I'd consider not controllable.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Chris Murphy

On Feb 8, 2014, at 6:55 PM, Roman Mamedov  wrote:
> 
> Not sure what exactly becomes problematic if a 2-device RAID1 tells the user
> they can store 1 TB of their data on it, and is no longer lying about the 
> possibility to store 2 TB on it as currently.
> 
> Two 1TB disks in RAID1.

OK but while we don't have a top level switch for variable raid on a volume 
yet, the on-disk format doesn't consider the device to be raid1 at all. Not the 
device, nor the volume, nor the subvolume have this attribute. It's a function 
of the data, metadata or system chunk via their profiles.

I can do a partial conversion on a volume, and even could do this multiple 
times and end up with some chunks in every available option, some chunks 
single, some raid1, some raid0, some raid5. All I have to do is cancel the 
conversion before each conversion is complete, successively shortening the time.

And it's not fair to say this has no application because such conversions take 
a long time. I might not want to fully do a conversion all at once. There's no 
requirement that I do so.

In any case I object to the language being used that implicitly indicates the 
'raidness' is a device or disk attribute.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Roman Mamedov

On Sun, 09 Feb 2014 00:17:29 +0100
Kai Krakow  wrote:

> "Dear employees,
> 
> Please keep in mind that when you run out of space on the fileserver 
> '\\DepartmentC', when you free up space in the directory '\PublicStorage7' 
> the free space you gain on '\StorageArchive' is only one third of the amount 
> you deleted, and in '\VideoFiles', you gain only one half.

But that's simply incorrect. Looking at my 2nd patch which also changes the
total reported size and 'used' size, the 'total' space, 'used' space and space
freed up as 'available' after file deletion will all match up perfectly.

> The exercise of why is left to the reader...
> 
> The proposed fix simply does not fix the problem. It simply shifts it 
> introducing the need for another fix somewhere else, which in turn probably 
> also introduces another need for a fix, and so forth... This will become an 
> endless effort of fixing and tuning.

Not sure what exactly becomes problematic if a 2-device RAID1 tells the user
they can store 1 TB of their data on it, and is no longer lying about the 
possibility to store 2 TB on it as currently.

Two 1TB disks in RAID1. Total space 1TB. Can store of my data: 1TB.
Wrote 100 GB of files? 100 GB used, 900 GB available, 1TB total.
Deleted 50 GB of those? 50 GB used, 950 GB available, 1TB total.

Can't see anything horribly broken about this behavior.

For when you need to "get to the bottom of things", as mentioned earlier
there's always 'btrfs fi df'.

> Feel free to fix it but be prepared for the reincarnation of this problem when
> per-subvolume raid levels become introduced.

AFAIK no one has even begun to write any code code to implement those yet.

-- 
With respect,
Roman

signature.asc
Description: PGP signature

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Roman Mamedov

On Sun, 09 Feb 2014 00:32:47 +0100
Kai Krakow  wrote:

> When I started to use unix, df returned blocks, not bytes. Without your 
> proposed patch, it does that right. With your patch, it does it wrong.

It returns total/used/available space that is usable/used/available by/for
user data. Whether that be in sectors, blocks, kilobytes, megabytes or in some
other unit, is a secondary detail which is also unrelated to the change being
currently discussed and not affected by it.

-- 
With respect,
Roman

signature.asc
Description: PGP signature

Re: btrfsck does not fix

2014-02-08 Thread Chris Murphy

On Feb 8, 2014, at 3:01 PM, Hendrik Friedel  wrote:

> Hello,
> 
>> Ok.
>> I think, I do/did have some symptoms, but I cannot exclude other reasons..
>> -High Load without high cpu-usage (io was the bottleneck)
>> -Just now: transfer from one directory to the other on the same
>> subvolume (from /mnt/subvol/A/B to /mnt/subvol/A) I get 1.2MB/s instead
>> of > 60.
>> -For some of the files I even got a "no space left on device" error.
>> 
>> This is without any messages in dmesg or syslog related to btrfs.
> 
> as I don't see that I can fix this, I intend to re-create the file-system. 
> For that, I need to remove one of the two discs from the raid/filesystem, 
> then create a new fs on this and move the data to it (I have no spare)
> Could you please advise me, wheather this will be successful?
> 
> 
> first some Information on the filesystem:
> 
> ./btrfs filesystem show /dev/sdb1
> Label: none  uuid: 989306aa-d291-4752-8477-0baf94f8c42f
>Total devices 2 FS bytes used 3.47TiB
>devid1 size 2.73TiB used 1.74TiB path /dev/sdb1
>devid2 size 2.73TiB used 1.74TiB path /dev/sdc1

I don't understand the no spare part. You have 3.47T of data, and yet the 
single device size is 2.73T. There is no way to migrate 1.74T from sdc1 to sdb1 
because there isn't enough space.

> 
> /btrfs subvolume list /mnt/BTRFS/Video
> ID 256 gen 226429 top level 5 path Video
> ID 1495 gen 226141 top level 5 path rsnapshot
> ID  gen 226429 top level 256 path Snapshot
> ID 5845 gen 226375 top level 5 path backups
> 
> btrfs fi df /mnt/BTRFS/Video/
> Data, RAID0: total=3.48TB, used=3.47TB
> System, RAID1: total=32.00MB, used=260.00KB
> Metadata, RAID1: total=4.49GB, used=3.85GB
> 
> 
> What I did already yesterday was:
> 
> btrfs device delete /dev/sdc1 /mnt/BTRFS/rsnapshot/
> btrfs device delete /dev/sdc1 /mnt/BTRFS/backups/
> btrfs device delete /dev/sdc1 /mnt/BTRFS/Video/
> btrfs filesystem balance start /mnt/BTRFS/Video/

I don't understand this sequence because I don't know what you've mounted 
where, but in any case maybe it's a bug that you're not getting errors for each 
of these commands because  you can't delete sdc1 from a raid0 volume. You'd 
first have to convert the data, metadata, and system profiles to single 
(metadata can be set to dup). And then you'd be able to delete a device so long 
as there's room on remaining devices, which you don't have.

> next, I'm doing the balance for the subvolume /mnt/BTRFS/backups

You told us above  you deleted that subvolume. So how are you balancing it? And 
also, balance applies to a mountpoint, and even if you mount a subvolume to 
that mountpoint, the whole file system is balanced. Not just the mounted 
subvolume.

> In parallel, I try to delete /mnt/BTRFS/rsnapshot, but it fails:
>  btrfs subvolume delete  /mnt/BTRFS/rsnapshot/
>  Delete subvolume '/mnt/BTRFS/rsnapshot'
>  ERROR: cannot delete '/mnt/BTRFS/rsnapshot' - Inappropriate ioctl
>  for  device
> 
> Why's that?
> But even more: How do I free sdc1 now?!

Well I'm pretty confused because again, I can't tell if your paths refer to 
subvolumes or if they refer to mount points. The balance and device delete 
commands all refer to a mount point, which is the path returned by the df 
command. The subvolume delete command needs a path to subvolume that starts 
with the mount point.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Kai Krakow

Roman Mamedov  schrieb:

> UNIX 'df' and the 'statfs' call on the other hand should keep the behavior
> people are accustomized to rely on since 1970s.

When I started to use unix, df returned blocks, not bytes. Without your 
proposed patch, it does that right. With your patch, it does it wrong.
 
-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Kai Krakow

cwillu  schrieb:

> Everyone who has actually looked at what the statfs syscall returns
> and how df (and everyone else) uses it, keep talking.  Everyone else,
> go read that source code first.
> 
> There is _no_ combination of values you can return in statfs which
> will not be grossly misleading in some common scenario that someone
> cares about.

Thanks man! statfs returns free blocks. So let's stick with that. The df 
command, as people try to understand it, is broken by design on btrfs. One 
has to live with that. The df command as it works since 1970 returns free 
blocks - and it does that perfectly fine on btrfs without that proposed 
"fix".

User space should not try to be smart about how many blocks are written to 
the filesystem if it writes xyz bytes to the filesystem. It has been that 
way since 1970 (or whatever), and it will be that way in the future. And a 
good file copying GUI should give you the choice of "I know better, copy 
anyways" (like every other unix utility).

Your pointer is everything to say about it.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Kai Krakow

Roman Mamedov  schrieb:

>> It should show the raw space available. Btrfs also supports compression
>> and doesn't try to be smart about how much compressed data would fit in
>> the free space of the drive. If one is using RAID1, it's supposed to fill
>> up with a rate of 2:1. If one is using compression, it's supposed to fill
>> up with a rate of maybe 1:5 for mostly text files.
> 
> Imagine a small business with some 30-40 employees. There is a piece of
> paper near the door at the office so that everyone sees it when entering
> or leaving, which says:
> 
> "Dear employees,
> 
> Please keep in mind that on the fileserver '\\DepartmentC', in the
> directory '\PublicStorage7' the free space you see as being available
> needs to be divided by two; On the server '\\DepartmentD', in
> '\StorageArchive' and '\VideoFiles', multiplied by two-thirds. For more
> details please contact the IT operations team. Further assistance will be
> provided at the monthly training seminar.

"Dear employees,

Please keep in mind that when you run out of space on the fileserver 
'\\DepartmentC', when you free up space in the directory '\PublicStorage7' 
the free space you gain on '\StorageArchive' is only one third of the amount 
you deleted, and in '\VideoFiles', you gain only one half. For more details 
please contact the IT operations team. Further assistance will be provided 
at the monthly training seminar.

Regards,
John S, CTO."

The exercise of why is left to the reader...

The proposed fix simply does not fix the problem. It simply shifts it 
introducing the need for another fix somewhere else, which in turn probably 
also introduces another need for a fix, and so forth... This will become an 
endless effort of fixing and tuning.

It simply does not work because btrfs' design does not allow that. Feel free 
to fix it but be prepared for the reincarnation of this problem when per-
subvolume raid levels become introduced. The problem has to be fixed in user 
space or with a new API call.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread cwillu

Everyone who has actually looked at what the statfs syscall returns
and how df (and everyone else) uses it, keep talking.  Everyone else,
go read that source code first.

There is _no_ combination of values you can return in statfs which
will not be grossly misleading in some common scenario that someone
cares about.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: always choose work from prio_head first

2014-02-08 Thread Stanislaw Gruszka

In case we do not refill, we can overwrite cur pointer from prio_head
by one from not prioritized head, what looks as something that was
not intended.

This change make we always take works from prio_head first until it's
not empty.

Signed-off-by: Stanislaw Gruszka 
---
I found this by reading code, not sure if change is correct.
Patch is only compile tested.

 fs/btrfs/async-thread.c |9 +
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index c1e0b0c..0b78bf2 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -262,18 +262,19 @@ static struct btrfs_work *get_next_work(struct 
btrfs_worker_thread *worker,
struct btrfs_work *work = NULL;
struct list_head *cur = NULL;
 
-   if (!list_empty(prio_head))
+   if (!list_empty(prio_head)) {
cur = prio_head->next;
+   goto out;
+   }
 
smp_mb();
if (!list_empty(&worker->prio_pending))
goto refill;
 
-   if (!list_empty(head))
+   if (!list_empty(head)) {
cur = head->next;
-
-   if (cur)
goto out;
+   }
 
 refill:
spin_lock_irq(&worker->lock);
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Roman Mamedov

On Sat, 08 Feb 2014 22:35:40 +0100
Kai Krakow  wrote:

> Imagine the future: Btrfs supports different RAID levels per subvolume. We 
> need to figure out where to place a new subvolume. I need raw numbers for 
> it. Df won't tell me that now. Things become very difficult now.

If you need to perform a btrfs-specific operation, you can easily use the
btrfs-specific tools to prepare for it, specifically use "btrfs fi df" which
could give provide every imaginable interpretation of free space estimate
and then some.

UNIX 'df' and the 'statfs' call on the other hand should keep the behavior
people are accustomized to rely on since 1970s.

-- 
With respect,
Roman

signature.asc
Description: PGP signature

Re: btrfsck does not fix

2014-02-08 Thread Hendrik Friedel


Hello,


Ok.
I think, I do/did have some symptoms, but I cannot exclude other reasons..
-High Load without high cpu-usage (io was the bottleneck)
-Just now: transfer from one directory to the other on the same
subvolume (from /mnt/subvol/A/B to /mnt/subvol/A) I get 1.2MB/s instead
of > 60.
-For some of the files I even got a "no space left on device" error.

This is without any messages in dmesg or syslog related to btrfs.


as I don't see that I can fix this, I intend to re-create the 
file-system. For that, I need to remove one of the two discs from the 
raid/filesystem, then create a new fs on this and move the data to it (I 
have no spare)

Could you please advise me, wheather this will be successful?


first some Information on the filesystem:

./btrfs filesystem show /dev/sdb1
Label: none  uuid: 989306aa-d291-4752-8477-0baf94f8c42f
Total devices 2 FS bytes used 3.47TiB
devid1 size 2.73TiB used 1.74TiB path /dev/sdb1
devid2 size 2.73TiB used 1.74TiB path /dev/sdc1

/btrfs subvolume list /mnt/BTRFS/Video
ID 256 gen 226429 top level 5 path Video
ID 1495 gen 226141 top level 5 path rsnapshot
ID  gen 226429 top level 256 path Snapshot
ID 5845 gen 226375 top level 5 path backups

btrfs fi df /mnt/BTRFS/Video/
Data, RAID0: total=3.48TB, used=3.47TB
System, RAID1: total=32.00MB, used=260.00KB
Metadata, RAID1: total=4.49GB, used=3.85GB


What I did already yesterday was:

 btrfs device delete /dev/sdc1 /mnt/BTRFS/rsnapshot/
 btrfs device delete /dev/sdc1 /mnt/BTRFS/backups/
 btrfs device delete /dev/sdc1 /mnt/BTRFS/Video/
 btrfs filesystem balance start /mnt/BTRFS/Video/

next, I'm doing the balance for the subvolume /mnt/BTRFS/backups

In parallel, I try to delete /mnt/BTRFS/rsnapshot, but it fails:
  btrfs subvolume delete  /mnt/BTRFS/rsnapshot/
  Delete subvolume '/mnt/BTRFS/rsnapshot'
  ERROR: cannot delete '/mnt/BTRFS/rsnapshot' - Inappropriate ioctl
  for  device

Why's that?
But even more: How do I free sdc1 now?!

Greetings,
Hendrik
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Kai Krakow

Martin Steigerwald  schrieb:

> While I understand that there is *never* a guarentee that a given free
> space can really be allocated by a process cause other processes can
> allocate space as well in the mean time, and while I understand that its
> difficult to provide an accurate to provide exact figures as soon as RAID
> settings can be set per subvolume, it still think its important to improve
> on the figures.

The question here: Does the free space indicator "fail" predictably or 
inpredictably? It will do the latter with this change.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Kai Krakow

Chris Murphy  schrieb:

> 
> On Feb 6, 2014, at 11:08 PM, Roman Mamedov  wrote:
> 
>>  And what
>> if I am accessing that partition on a server via a network CIFS/NFS share
>> and don't even *have a way to find out* any of that.
> 
> That's the strongest argument. And if the user is using
> Explorer/Finder/Nautilus to copy files to the share, I'm pretty sure all
> three determine if there's enough free space in advance of starting the
> copy. So if it thinks there's free space, it will start to copy and then
> later fail midstream when there's no more space. And then the user's copy
> task is in a questionable state as to what's been copied, depending on how
> the file copies are being threaded.

This problem has already been solved for remote file systems maybe 20-30 
years ago: You cannot know how much space is left at the end of the copy by 
looking at the numbers before the copy - it may have been used up by another 
user copying a file at the same time. The problem has been solved by 
applying hard and soft quotas: The sysadmin does an optimistic (or possibly 
even pessimistic) planning and applies quotas. Soft quotas can be passed for 
(maybe) 7 days after which you need to free up space again before adding new 
data. Hard quotas are the hard cutoff - you cannot pass that barrier. Df 
will show you what's free within your softquota. Problem solved. If you need 
better numbers, there are quota commands instead of df. Why break with this 
design choice?

If you manage a central shared storage for end users, you should really 
start thinking about quotas. Without, you cannot even exactly plan your 
backups.

If df shows transformed/guessed numbers to the sysadmins, things start to 
become very complicated and unpredictable.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Kai Krakow

Hugo Mills  schrieb:

> On Sat, Feb 08, 2014 at 05:33:10PM +0600, Roman Mamedov wrote:
>> On Fri, 07 Feb 2014 21:32:42 +0100
>> Kai Krakow  wrote:
>> 
>> > It should show the raw space available. Btrfs also supports compression
>> > and doesn't try to be smart about how much compressed data would fit in
>> > the free space of the drive. If one is using RAID1, it's supposed to
>> > fill up with a rate of 2:1. If one is using compression, it's supposed
>> > to fill up with a rate of maybe 1:5 for mostly text files.
>> 
>> Imagine a small business with some 30-40 employees. There is a piece of
>> paper near the door at the office so that everyone sees it when entering
>> or leaving, which says:
>> 
>> "Dear employees,
>> 
>> Please keep in mind that on the fileserver '\\DepartmentC', in the
>> directory '\PublicStorage7' the free space you see as being available
>> needs to be divided by two; On the server '\\DepartmentD', in
>> '\StorageArchive' and '\VideoFiles', multiplied by two-thirds. For more
>> details please contact the IT operations team. Further assistance will be
>> provided at the monthly training seminar.
>> 
>> Regards,
>> John S, CTO.'
> 
>In my experience, nobody who uses a shared filesystem *ever* looks
> at the amount of free space on it, until it fills up, at which point
> they may look at the free space and see "0". Or most likely, they'll
> be alerted to the issue by an email from the systems people saying,
> "please will everyone delete unnecessary files from the shared drive,
> because it's full up."

Exactly that is the point from my practical experience. Only sysadmins watch 
these numbers, and they'd know how to handle them.

Imagine the future: Btrfs supports different RAID levels per subvolume. We 
need to figure out where to place a new subvolume. I need raw numbers for 
it. Df won't tell me that now. Things become very difficult now.

Free space is a number unimportant to end users. They won't look at it. They 
start to cry and call helpdesk if an application says: Disk is full. You 
cannot even save your unsaved document, because: Disk full.

The only way to solve this, is to apply quotas to users and let the 
sysadmins do the space usage planning. That will work.

I still think, there should be an extra utility which guesses the predicted 
usable free space - or an option added to df to show that.

Roman's argument is only one view of the problem. My argument (sysadmin 
space planning) is exactly the opposite view. In the future, free space 
prediction will only become more complicated, involves more code, introduces 
bugs... It should be done in user space. Df should receive raw numbers.

Storage space is cheap these days. You should just throw another disk at the 
array if free space falls below a certain threshold. End users do not care 
for free space. They just cry when it's full - no matter how accurate the 
numbers had been before. They will certainly not cry if they copied 2 MB to 
the disk but 4 MB had been taken. In a shared storage space this is probably 
always the case anyway, because just the very same moment someone else also 
copied 2 MB to the volume. So what?

>Having a more accurate estimate of the free space is a laudable
> aim, and in principle I agree with attempts to do it, but I think the
> argument above isn't exactly a strong one in practice.

I do not disagree, too. But I think it should go to a separate utility or 
there should be a new API call in the kernel to get predicted usable free 
space based on current usage pattern. Df is meant as a utility to get 
accurate numbers. It should not tell you guessed numbers.

Whatever you design a df calculater in btrfs, it could always be too 
pessimistic or too optimistic (and could even switch unpredictably between 
both situations). So whatever you do: It is always inaccurate. It will never 
be able to exactly tell you the numbers you need. If disk space is low: Add 
disks. Clean up. Whatever. Just simply do not try to fill up your FS to just 
1kb left. Btrfs doesn't like that anyway. So: Use quotas.

Picking up the piece of paper example: You still have to tell your employees 
that the free space numbers aren't exact anyways, so their best chance is to 
simply not look at them and are better off with just trying to copy 
something.

Besides: If you want to fix this, what about the early-ENOSPC problem which 
is there by design (allocation in chunks)? You'd need to fix that, too.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038

2014-02-08 Thread Tejun Heo

Hello, David, Fengguang, Chris.

On Fri, Feb 07, 2014 at 01:13:06PM -0800, David Rientjes wrote:
> On Fri, 7 Feb 2014, Fengguang Wu wrote:
> 
> > On Fri, Feb 07, 2014 at 02:13:59AM -0800, David Rientjes wrote:
> > > On Fri, 7 Feb 2014, Fengguang Wu wrote:
> > > 
> > > > [1.625020] BTRFS: selftest: Running btrfs_split_item tests
> > > > [1.627004] BTRFS: selftest: Running find delalloc tests
> > > > [2.289182] tsc: Refined TSC clocksource calibration: 2299.967 MHz
> > > > [  292.084537] kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=1, 
> > > > oom_score_adj=0
> > > > [  292.086439] kthreadd cpuset=
> > > > [  292.087072] BUG: unable to handle kernel NULL pointer dereference at 
> > > > 0038
> > > > [  292.087372] IP: [] pr_cont_kernfs_name+0x1b/0x6c
> > > 
> > > This looks like a problem with the cpuset cgroup name, are you sure this 
> > > isn't related to the removal of cgroup->name?
> > 
> > It looks not related to patch "cgroup: remove cgroup->name", because
> > that patch lies in the cgroup tree and not contained in output of "git log 
> > BAD_COMMIT".
> > 
> 
> It's dying on pr_cont_kernfs_name which is some tree that has "kernfs: 
> implement kernfs_get_parent(), kernfs_name/path() and friends", which is 
> not in linux-next, and is obviously printing the cpuset cgroup name.
> 
> It doesn't look like it has anything at all to do with btrfs or why they 
> would care about this failure.

Yeah, this is from a patch in cgroup/review-post-kernfs-conversion
branch which updates cgroup to use pr_cont_kernfs_name().  I forget
that cgrp->kn is NULL for the dummy_root's top cgroup and thus it ends
up calling the kernfs functions with NULL kn and thus the oops.  I
posted an updated patch and the git branch has been updated.

 http://lkml.kernel.org/g/20140208200640.gb10...@htj.dyndns.org

So, nothing to do with btrfs and it looks like somehow the test
appratus is mixing up branches?

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

system stuck with flush-btrfs-4 at 100% after filesystem resize

2014-02-08 Thread John Navitsky


Hello,

I have a large file system that has been growing.  We've resized it a 
couple of times with the following approach:


  lvextend -L +800G /dev/raid/virtual_machines
  btrfs filesystem resize +800G /vms

I think the FS started out at 200G, we increased it by 200GB a time or 
two, then by 800GB and everything worked fine.


The filesystem hosts a number of virtual machines so the file system is 
in use, although the VMs individually tend not to be overly active.


VMs tend to be in subvolumes, and some of those subvolumes have snapshots.

This time, I increased it by another 800GB, and it it has hung for many 
hours (over night) with flush-btrfs-4 near 100% cpu all that time.


I'm not clear at this point that it will finish or where to go from here.

Any pointers would be much appreciated.

Thanks,

-john (newbie to BTRFS)


 procedure log --

romulus:/home/users/johnn # lvextend -L +800G /dev/raid/virtual_machines
romulus:/home/users/johnn #  btrfs filesystem resize +800G /vms
Resize '/vms' of '+800G'
[hangs]


top - 12:21:53 up 136 days,  2:45, 13 users,  load average: 30.39, 
30.37, 30.37

Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.4 us,  2.3 sy,  0.0 ni, 95.1 id,  0.1 wa,  0.0 hi,  0.0 si, 
 0.0 st

MiB Mem:129147 total,   127427 used, 1720 free,  264 buffers
MiB Swap:   262143 total,  661 used,   261482 free,93666 cached

   PID USER  PR  NI  VIRT  RES  SHR S  %CPU %MEM TIME+ COMMAND 

 48809 root  20   0 000 R  99.3  0.0   1449:14 
flush-btrfs-4


--- misc info ---

romulus:/home/users/johnn # cat /etc/SuSE-release
openSUSE 12.3 (x86_64)
VERSION = 12.3
CODENAME = Dartmouth
romulus:/home/users/johnn # uname -a
Linux romulus.us.redacted.com 3.7.10-1.16-desktop #1 SMP PREEMPT Fri May 
31 20:21:23 UTC 2013 (97c14ba) x86_64 x86_64 x86_64 GNU/Linux

romulus:/home/users/johnn #


romulus:/home/users/johnn # vgdisplay
  --- Volume group ---
  VG Name   raid
  System ID
  Formatlvm2
  Metadata Areas1
  Metadata Sequence No  19
  VG Access read/write
  VG Status resizable
  MAX LV0
  Cur LV7
  Open LV   7
  Max PV0
  Cur PV1
  Act PV1
  VG Size   10.91 TiB
  PE Size   4.00 MiB
  Total PE  2859333
  Alloc PE / Size   1371136 / 5.23 TiB
  Free  PE / Size   1488197 / 5.68 TiB
  VG UUID   npyvGj-7vxF-IoI8-Z4tF-ygpP-Q2Ja-vV8sLA
[...]


romulus:/home/users/johnn # lvdisplay
[...]
  --- Logical volume ---
  LV Path/dev/raid/virtual_machines
  LV Namevirtual_machines
  VG Nameraid
  LV UUIDqtzNBG-vuLV-EsgO-FDIf-sO7A-GKmd-EVjGjp
  LV Write Accessread/write
  LV Creation host, time romulus.redacted.com, 2013-09-25 11:05:54 -0500
  LV Status  available
  # open 1
  LV Size2.54 TiB
  Current LE 665600
  Segments   2
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device   253:4
[...]


johnn@romulus:~> df -h /vms
Filesystem  Size  Used Avail Use% Mounted on
/dev/dm-4   1.8T  1.8T  6.0G 100% /vms
johnn@romulus:~>


romulus:/home/users/johnn # btrfs filesystem show
[...]
Label: none  uuid: f08c5602-f53a-43c9-b498-fa788b01e679
Total devices 1 FS bytes used 1.74TB
devid1 size 1.76TB used 1.76TB path /dev/dm-4
[...]
Btrfs v0.19+
romulus:/home/users/johnn #


romulus:/home/users/johnn # btrfs subvolume list /vms
ID 324 top level 5 path johnn-centos64
ID 325 top level 5 path johnn-ubuntu1304
ID 326 top level 5 path johnn-opensuse1203
ID 327 top level 5 path johnn-sles11sp3
ID 328 top level 5 path johnn-sles11sp2
ID 329 top level 5 path johnn-fedora19
ID 330 top level 5 path johnn-sles11sp1
ID 394 top level 5 path redacted-glance
ID 396 top level 5 path redacted_test
ID 397 top level 5 path glance
ID 403 top level 5 path test_redacted
ID 414 top level 5 path johnn-disktest
ID 460 top level 5 path redacted-opensuse-01
ID 472 top level 5 path redacted
ID 473 top level 5 path redacted2
ID 496 top level 5 path redacted_test
ID 524 top level 5 path redacted-moab
ID 525 top level 5 path redacted_redacted-1
ID 531 top level 5 path 
.snapshots/johnn-sles11sp2/2013.10.11-14:25.18/johnn-sles11sp2
ID 533 top level 5 path 
.snapshots/johnn-centos64/2013.10.11-15:32.16/johnn-centos64
ID 534 top level 5 path 
.snapshots/johnn-ubuntu1304/2013.10.11-15:33.20/johnn-ubuntu1304
ID 535 top level 5 path 
.snapshots/johnn-opensuse1203/2013.10.11-15:36.19/johnn-opensuse1203
ID 536 top level 5 path 
.snapshots/johnn-sles11sp3/2013.10.11-15:39.51/johnn-sles11sp3
ID 537 top level 5 path 
.snapshots/johnn-fedora19/2013.10.11-15:41.08/johnn-fedora19
ID 538 top level 5 path 
.snapshots/johnn-sles11sp2/2013.10.11-16

Recovering from persistent kernel oops on 'btrfs balance'

2014-02-08 Thread Nathan Kidd

Hi,

I added a 2nd device and 'btrfs balance' crashed (kernel oops) half way
through, now I can only read the fs from a rawhide livedvd, but even
that can't fix the fs (finish balance, or remove 2nd device to try
again). I'd be grateful for any advice on getting back to a working
btrfs filesystem.

Details
===

Hardware： Asus P5G41T-M with Pentium dual core E2140，4GB ram, OS on
ext4 drive, two 4TB Segate "NAS" SATA drives.

On Ubuntu 13.04 x86_64 (3.8 kernel, btrfs-tools 0.19+20130117)

1. Install new 4TB drive (/dev/sdb), use gparted to create full-disk
btrfs partition, mount on /ark copy ~500GB data, everything working well
for a couple weeks

2. Install additional identical 4TB drive,

Following
https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Adding_new_devices

3. btrfs device add /dev/sdc /ark

4. btrfs balance start -dconvert=raid1 -mconvert=raid1 /ark

5. After ~1 hour, at about 50% (according to 'btrfs balance status', the
system locks up with this displayed (sorry, JPEG):
http://i.imgur.com/Ds9pnZV.jpg

6. System repeat same oops on startup

7. After removing /dev/sdc system boots but can't see anything on /ark

I guess using a 3.8 kernel wasn't the smartest idea. Let's update.

8. Update to Ubuntu 13.11 x86_64 (3.11 kernel, btrfs-tools
0.19+20130705-1)

9. Now system boots with /dev/sdc plugged in but still can't see data on
/ark, IIRC the balance command gave similar kernel oops.

10。 Fine I'll try Rawhide. From Jan 30, 2014, kernel
3.14.0-0.rc0.git17.1.fc21.x86_64

11. I can see data on /ark！

12. If I try to 'btrfs balance resume' or 'btrfs balance cancel' I get
roughly the same kernel oops: http://pastebin.ca/2634583

13. 'btrfs device delete /dev/sdc /ark' says it cannot be done while
balance is underway

14. Help! Any suggestion on how to recover the btrfs fs?

My last resort idea is pull /dev/sdb (which seems to have actual data
that rawhide can see), format /dev/sdc ext4, plug both drives in again
and copy from btrfs /dev/sdb to ext4 /dev/sdc, then wipe the btrfs fs on
/dev/sdb and try again with the 3.11 kernel (or just with rawhide?).

But that is a whole lot of copying it would be nice to avoid.

Thanks,

-Nathan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH][V2] Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Goffredo Baroncelli

On 02/07/2014 05:40 AM, Roman Mamedov wrote:
> On Thu, 06 Feb 2014 20:54:19 +0100
> Goffredo Baroncelli  wrote:
> 
[...]

Even I am not entirely convinced, I update the Roman's PoC in order
to take in account all the RAID levels.

The filesystem test is composed by 7 51GB disks. Here my "df" results:

Profile: single
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdc351G  512K  348G   1% /mnt/btrfs1

Profile: raid1
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdc351G  1.3M  150G   1% /mnt/btrfs1

Profile: raid10
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdc351G  2.3M  153G   1% /mnt/btrfs1

Profile: raid5
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdc351G  2.0M  298G   1% /mnt/btrfs1

Profile: raid6
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdc351G  1.8M  248G   1% /mnt/btrfs1


Note that RAID1 and RAID10 can only use an even number of disks.
The mixing mode (data and metadata in the same chunk) 
return strange results.

Below my patch.

BR
G.Baroncelli

Changes history:
V1  First issue
V2  Correct a (old) bug when in RAID10 the disks aren't 
a multiple of 4


diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d71a11d..aea9afa 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1481,10 +1481,16 @@ static int btrfs_calc_avail_data_space(struct 
btrfs_root *root, u64 *free_bytes)
num_stripes = nr_devices;
} else if (type & BTRFS_BLOCK_GROUP_RAID1) {
min_stripes = 2;
-   num_stripes = 2;
+   num_stripes = nr_devices & ~1llu;
} else if (type & BTRFS_BLOCK_GROUP_RAID10) {
min_stripes = 4;
-   num_stripes = 4;
+   num_stripes = nr_devices & ~1llu;
+   } else if (type & BTRFS_BLOCK_GROUP_RAID5) {
+   min_stripes = 3;
+   num_stripes = nr_devices;
+   } else if (type & BTRFS_BLOCK_GROUP_RAID6) {
+   min_stripes = 4;
+   num_stripes = nr_devices;
}
 
if (type & BTRFS_BLOCK_GROUP_DUP)
@@ -1561,8 +1567,30 @@ static int btrfs_calc_avail_data_space(struct btrfs_root 
*root, u64 *free_bytes)
if (devices_info[i].max_avail >= min_stripe_size) {
int j;
u64 alloc_size;
+   int k;
 
-   avail_space += devices_info[i].max_avail * num_stripes;
+   /*
+* Depending by the RAID profile, we use some
+* disk space as redundancy:
+* RAID1, RAID10, DUP -> half of space used as 
redundancy
+* RAID5  -> 1 stripe used as redundancy
+* RAID6  -> 2 stripes used as redundancy
+* RAID0,LINEAR   -> no redundancy
+*/
+   if (type & BTRFS_BLOCK_GROUP_RAID1) {
+   k = num_stripes >> 1;
+   } else if (type & BTRFS_BLOCK_GROUP_DUP) {
+   k = num_stripes >> 1;
+   } else if (type & BTRFS_BLOCK_GROUP_RAID10) {
+   k = num_stripes >> 1;
+   } else if (type & BTRFS_BLOCK_GROUP_RAID5) {
+   k = num_stripes-1;
+   } else if (type & BTRFS_BLOCK_GROUP_RAID6) {
+   k = num_stripes-2;
+   } else { /* RAID0/LINEAR */
+   k = num_stripes;
+   }
+   avail_space += devices_info[i].max_avail * k;
alloc_size = devices_info[i].max_avail;
for (j = i + 1 - num_stripes; j <= i; j++)
devices_info[j].max_avail -= alloc_size;



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] xfstests: add test for btrfs data corruption when using compression

2014-02-08 Thread Filipe David Borba Manana

Test for a btrfs data corruption when using compressed files/extents.
Under certain cases, it was possible for reads to return random data
(content from a previously used page) instead of zeroes. This also
caused partial updates to those regions that were supposed to be filled
with zeroes to save random (and invalid) data into the file extents.

This is fixed by the commit for the linux kernel titled:

   Btrfs: fix data corruption when reading/updating compressed extents
   (https://patchwork.kernel.org/patch/3610391/)

Signed-off-by: Filipe David Borba Manana 
---
 tests/btrfs/036 |  111 +++
 tests/btrfs/036.out |1 +
 tests/btrfs/group   |1 +
 3 files changed, 113 insertions(+)
 create mode 100755 tests/btrfs/036
 create mode 100644 tests/btrfs/036.out

diff --git a/tests/btrfs/036 b/tests/btrfs/036
new file mode 100755
index 000..533b6ee
--- /dev/null
+++ b/tests/btrfs/036
@@ -0,0 +1,111 @@
+#! /bin/bash
+# FS QA Test No. btrfs/036
+#
+# Test for a btrfs data corruption when using compressed files/extents.
+# Under certain cases, it was possible for reads to return random data
+# (content from a previously used page) instead of zeroes. This also
+# caused partial updates to those regions that were supposed to be filled
+# with zeroes to save random (and invalid) data into the file extents.
+#
+# This is fixed by the commit for the linux kernel titled:
+#
+#   Btrfs: fix data corruption when reading/updating compressed extents
+#
+#---
+# Copyright (c) 2014 Filipe Manana.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=`mktemp -d`
+
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+rm -fr $tmp
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_need_to_be_root
+
+rm -f $seqres.full
+
+_scratch_mkfs >/dev/null 2>&1
+_scratch_mount "-o compress-force=lzo"
+
+run_check $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" \
+$SCRATCH_MNT/foobar
+run_check $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar
+run_check $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar
+run_check $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
+
+# Expected file items in the fs tree are (from btrfs-debug-tree):
+#
+#   item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
+#   inode generation 6 transid 6 size 542872 block group 0 mode 100600
+#   item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16
+#   inode ref index 2 namelen 6 name: foobar
+#   item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
+#   extent data disk byte 0 nr 0 gen 6
+#   extent data offset 0 nr 24576 ram 266240
+#   extent compression 0
+#   item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53
+#   prealloc data disk byte 12849152 nr 241664 gen 6
+#   prealloc data offset 0 nr 241664
+#   item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53
+#   extent data disk byte 12845056 nr 4096 gen 6
+#   extent data offset 0 nr 20480 ram 20480
+#   extent compression 2
+#   item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53
+#   prealloc data disk byte 13090816 nr 405504 gen 6
+#   prealloc data offset 0 nr 258048
+#
+# The on disk extent at 266240, contains 5 compressed chunks of file data.
+# Each of the first 4 chunks compress 4096 bytes of file data, while the last
+# one compresses only 3024 bytes of file data. Because this extent item is not
+# the last one in the file, as it followed by a prealloc extent, reads into
+# the region [285648 ; 286720[ (length = 4096 - 3024) should return zeroes.
+
+_scratch_unmount
+_check_btrfs_filesystem $SCRATCH_DEV
+
+EXPECTED_MD5="b8b0dbb8e02f94123c741c23659a1c0a"
+
+for i in `seq 1 27`
+do
+_scratch_mount "-o ro"
+MD5=`md5sum $SCRATCH_MNT/foobar | cut -f 1 -d ' '`
+_scratch_unmount
+if [ "${MD5}x" != "${EXPECTED_MD5}x" ]
+then
+   echo "Unexpected file digest (wanted $EXPECTED_MD5, got $MD5)"
+fi
+done
+
+status=0
+exit

[PATCH 1/2] Btrfs: skip readonly root for snapshot-aware defragment

2014-02-08 Thread Wang Shilong

From: Wang Shilong 

Btrfs send is assuming readonly root won't change, let's skip readonly root.

Signed-off-by: Wang Shilong 
---
 fs/btrfs/inode.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1af34d0..e8dfd83 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2239,6 +2239,11 @@ static noinline int relink_extent_backref(struct 
btrfs_path *path,
return PTR_ERR(root);
}
 
+   if (btrfs_root_readonly(root)) {
+   srcu_read_unlock(&fs_info->subvol_srcu, index);
+   return 0;
+   }
+
/* step 2: get inode */
key.objectid = backref->inum;
key.type = BTRFS_INODE_ITEM_KEY;
-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 2/2] Revert "Btrfs: remove transaction from btrfs send"

2014-02-08 Thread Wang Shilong

From: Wang Shilong 

This reverts commit 41ce9970a8a6a362ae8df145f7a03d789e9ef9d2.
Previously i was thinking we can use readonly root's commit root
safely while it is not true, readonly root may be cowed with the
following cases.

1.snapshot send root will cow source root.
2.balance,device operations will also cow readonly send root
to relocate.

So i have two ideas to make us safe to use commit root.

-->approach 1:
make it protected by transaction and end transaction properly and we research
next item from root node(see btrfs_search_slot_for_read()).

-->approach 2:
add another counter to local root structure to sync snapshot with send.
and add a global counter to sync send with exclusive device operations.

So with approach 2, send can use commit root safely, because we make sure
send root can not be cowed during send. Unfortunately, it make codes *ugly*
and more complex to maintain.

To make snapshot and send exclusively, device operations and send operation
exclusively with each other is a little confusing for common users.

So why not drop into previous way.

Cc: Josef Bacik 
Signed-off-by: Wang Shilong 
---
Josef, if we reach agreement to adopt this approach, please revert
Filipe's patch(Btrfs: make some tree searches in send.c more efficient)
from btrfs-next.
---
 fs/btrfs/send.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 168b9ec..e9d1265 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -5098,6 +5098,7 @@ out:
 static int full_send_tree(struct send_ctx *sctx)
 {
int ret;
+   struct btrfs_trans_handle *trans = NULL;
struct btrfs_root *send_root = sctx->send_root;
struct btrfs_key key;
struct btrfs_key found_key;
@@ -5119,6 +5120,19 @@ static int full_send_tree(struct send_ctx *sctx)
key.type = BTRFS_INODE_ITEM_KEY;
key.offset = 0;
 
+join_trans:
+   /*
+* We need to make sure the transaction does not get committed
+* while we do anything on commit roots. Join a transaction to prevent
+* this.
+*/
+   trans = btrfs_join_transaction(send_root);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   trans = NULL;
+   goto out;
+   }
+
/*
 * Make sure the tree has not changed after re-joining. We detect this
 * by comparing start_ctransid and ctransid. They should always match.
@@ -5142,6 +5156,19 @@ static int full_send_tree(struct send_ctx *sctx)
goto out_finish;
 
while (1) {
+   /*
+* When someone want to commit while we iterate, end the
+* joined transaction and rejoin.
+*/
+   if (btrfs_should_end_transaction(trans, send_root)) {
+   ret = btrfs_end_transaction(trans, send_root);
+   trans = NULL;
+   if (ret < 0)
+   goto out;
+   btrfs_release_path(path);
+   goto join_trans;
+   }
+
eb = path->nodes[0];
slot = path->slots[0];
btrfs_item_key_to_cpu(eb, &found_key, slot);
@@ -5169,6 +5196,12 @@ out_finish:
 
 out:
btrfs_free_path(path);
+   if (trans) {
+   if (!ret)
+   ret = btrfs_end_transaction(trans, send_root);
+   else
+   btrfs_end_transaction(trans, send_root);
+   }
return ret;
 }
 
-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: fix data corruption when reading/updating compressed extents

2014-02-08 Thread Filipe David Borba Manana

When using a mix of compressed file extents and prealloc extents, it
is possible to fill a page of a file with random, garbage data from
some unrelated previous use of the page, instead of a sequence of zeroes.

A simple sequence of steps to get into such case, taken from the test
case I made for xfstests, is:

   _scratch_mkfs
   _scratch_mount "-o compress-force=lzo"
   $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar
   $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar
   $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar
   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar

This results in the following file items in the fs tree:

   item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
   inode generation 6 transid 6 size 542872 block group 0 mode 100600
   item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16
   inode ref index 2 namelen 6 name: foobar
   item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
   extent data disk byte 0 nr 0 gen 6
   extent data offset 0 nr 24576 ram 266240
   extent compression 0
   item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53
   prealloc data disk byte 12849152 nr 241664 gen 6
   prealloc data offset 0 nr 241664
   item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53
   extent data disk byte 12845056 nr 4096 gen 6
   extent data offset 0 nr 20480 ram 20480
   extent compression 2
   item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53
   prealloc data disk byte 13090816 nr 405504 gen 6
   prealloc data offset 0 nr 258048

The on disk extent at offset 266240 (which corresponds to 1 single disk block),
contains 5 compressed chunks of file data. Each of the first 4 compress 4096
bytes of file data, while the last one only compresses 3024 bytes of file data.
Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 =
1072 bytes) should always return zeroes (our next extent is a prealloc one).

The solution here is the compression code path to zero the remaining (untouched)
bytes of the last page it uncompressed data into, as the information about how
much space the file data consumes in the last page is not known in the upper 
layer
fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing
the remainder of the page but only if it corresponds to the last page of the 
inode
and if the inode's size is not a multiple of the page size.

This would cause not only returning random data on reads, but also permanently
storing random data when updating parts of the region that should be zeroed.
For the example above, it means updating a single byte in the region [285648 ; 
286720[
would store that byte correctly but also store random data on disk.

A test case for xfstests follows soon.

Signed-off-by: Filipe David Borba Manana 
---
 fs/btrfs/compression.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index af815eb..ed1ff1cb 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1011,6 +1011,8 @@ int btrfs_decompress_buf2page(char *buf, unsigned long 
buf_start,
bytes = min(bytes, working_bytes);
kaddr = kmap_atomic(page_out);
memcpy(kaddr + *pg_offset, buf + buf_offset, bytes);
+   if (*pg_index == (vcnt - 1) && *pg_offset == 0)
+   memset(kaddr + bytes, 0, PAGE_CACHE_SIZE - bytes);
kunmap_atomic(kaddr);
flush_dcache_page(page_out);
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Goffredo Baroncelli

On 02/07/2014 05:40 AM, Roman Mamedov wrote:
> On Thu, 06 Feb 2014 20:54:19 +0100
> Goffredo Baroncelli  wrote:
> 
[...]

Even I am not entirely convinced, I update the Roman's PoC in order
to take in account all the RAID levels.

I performed some tests with 7 48.8GB disks. Here my "df" results

Profile: single
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdc342G  512K  340G   1% /mnt/btrfs1

Profile: raid1
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdc342G  1.3M  147G   1% /mnt/btrfs1

Profile: raid10
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdc342G  2.3M  102G   1% /mnt/btrfs1

Profile: raid5
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdc342G  2.0M  291G   1% /mnt/btrfs1

Profile: raid6
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdc342G  1.8M  243G   1% /mnt/btrfs1

Note that RAID1 can only uses 6 disks; raid 10 only four, but I think that it 
is due to a previous bug. 
Still the mixing mode (data and metadata raid in the same chunk) is unsupported

below my patch.

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d71a11d..e5c58b3 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1485,6 +1485,12 @@ static int btrfs_calc_avail_data_space(struct btrfs_root 
*root, u64 *free_bytes)
} else if (type & BTRFS_BLOCK_GROUP_RAID10) {
min_stripes = 4;
num_stripes = 4;
+   } else if (type & BTRFS_BLOCK_GROUP_RAID5) {
+   min_stripes = 3;
+   num_stripes = nr_devices;
+   } else if (type & BTRFS_BLOCK_GROUP_RAID6) {
+   min_stripes = 4;
+   num_stripes = nr_devices;
}
 
if (type & BTRFS_BLOCK_GROUP_DUP)
@@ -1561,8 +1567,30 @@ static int btrfs_calc_avail_data_space(struct btrfs_root 
*root, u64 *free_bytes)
if (devices_info[i].max_avail >= min_stripe_size) {
int j;
u64 alloc_size;
+   int k;
 
-   avail_space += devices_info[i].max_avail * num_stripes;
+   /*
+* Depending by the RAID profile, we use some
+* disk space as redundancy:
+* RAID1, RAID10, DUP -> half of space used as 
redundancy
+* RAID5  -> 1 stripe used as redundancy
+* RAID6  -> 2 stripes used as redundancy
+* RAID0,LINEAR   -> no redundancy
+*/
+   if (type & BTRFS_BLOCK_GROUP_RAID1) {
+   k = num_stripes >> 1;
+   } else if (type & BTRFS_BLOCK_GROUP_DUP) {
+   k = num_stripes >> 1;
+   } else if (type & BTRFS_BLOCK_GROUP_RAID10) {
+   k = num_stripes >> 1;
+   } else if (type & BTRFS_BLOCK_GROUP_RAID5) {
+   k = num_stripes-1;
+   } else if (type & BTRFS_BLOCK_GROUP_RAID6) {
+   k = num_stripes-2;
+   } else { /* RAID0/LINEAR */
+   k = num_stripes;
+   }
+   avail_space += devices_info[i].max_avail * k;
alloc_size = devices_info[i].max_avail;
for (j = i + 1 - num_stripes; j <= i; j++)
devices_info[j].max_avail -= alloc_size;




-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038

2014-02-08 Thread Fengguang Wu

> If you disable CONFIG_BTRFS_FS_RUN_SANITY_TESTS, does it still crash?

Good idea! I've queued test jobs for that config. However sorry that
I'll be offline for the next 2 days. So please expect some delays.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Hugo Mills

On Sat, Feb 08, 2014 at 05:33:10PM +0600, Roman Mamedov wrote:
> On Fri, 07 Feb 2014 21:32:42 +0100
> Kai Krakow  wrote:
> 
> > It should show the raw space available. Btrfs also supports compression and 
> > doesn't try to be smart about how much compressed data would fit in the 
> > free 
> > space of the drive. If one is using RAID1, it's supposed to fill up with a 
> > rate of 2:1. If one is using compression, it's supposed to fill up with a 
> > rate of maybe 1:5 for mostly text files.
> 
> Imagine a small business with some 30-40 employees. There is a piece of paper
> near the door at the office so that everyone sees it when entering or leaving,
> which says:
> 
> "Dear employees,
> 
> Please keep in mind that on the fileserver '\\DepartmentC', in the directory
> '\PublicStorage7' the free space you see as being available needs to be 
> divided
> by two; On the server '\\DepartmentD', in '\StorageArchive' and '\VideoFiles',
> multiplied by two-thirds. For more details please contact the IT operations
> team. Further assistance will be provided at the monthly training seminar.
> 
> Regards,
> John S, CTO.'

   In my experience, nobody who uses a shared filesystem *ever* looks
at the amount of free space on it, until it fills up, at which point
they may look at the free space and see "0". Or most likely, they'll
be alerted to the issue by an email from the systems people saying,
"please will everyone delete unnecessary files from the shared drive,
because it's full up."

   Having a more accurate estimate of the free space is a laudable
aim, and in principle I agree with attempts to do it, but I think the
argument above isn't exactly a strong one in practice.

   Even in the current code with only one RAID setting available for
data, if you have parity RAID, you've got to look at the number of
drives with available free space to make an estimate of available
space. I think your best bet, ultimately, is to write code to give
either a pessimistic (lower bound) or optimistic (upper bound)
estimate of available space based on the profiles in use and the
current distribution of free/unallocated space, and stick with that. I
think I'd prefer to see a pessimistic bound, although that could break
anything like an installer that attempts to see how much free space
there is before proceeding.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- This year,  I'm giving up Lent. --- 

signature.asc
Description: Digital signature

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Roman Mamedov

On Fri, 07 Feb 2014 21:32:42 +0100
Kai Krakow  wrote:

> It should show the raw space available. Btrfs also supports compression and 
> doesn't try to be smart about how much compressed data would fit in the free 
> space of the drive. If one is using RAID1, it's supposed to fill up with a 
> rate of 2:1. If one is using compression, it's supposed to fill up with a 
> rate of maybe 1:5 for mostly text files.

Imagine a small business with some 30-40 employees. There is a piece of paper
near the door at the office so that everyone sees it when entering or leaving,
which says:

"Dear employees,

Please keep in mind that on the fileserver '\\DepartmentC', in the directory
'\PublicStorage7' the free space you see as being available needs to be divided
by two; On the server '\\DepartmentD', in '\StorageArchive' and '\VideoFiles',
multiplied by two-thirds. For more details please contact the IT operations
team. Further assistance will be provided at the monthly training seminar.

Regards,
John S, CTO.'

-- 
With respect,
Roman

signature.asc
Description: PGP signature

Re: Provide a better free space estimate on RAID1

2014-02-08 Thread Roman Mamedov

On Fri, 7 Feb 2014 12:08:12 +0600
Roman Mamedov  wrote:

> > Earlier conventions would have stated Size ~900GB, and Avail ~900GB. But 
> > that's not exactly true either, is it?
> 
> Much better, and matching the user expectations of how RAID1 should behave,
> without a major "gotcha" blowing up into their face the first minute they are
> trying it out. In fact next step that I planned would be finding how to adjust
> also Size and Used on all my machines to show what you just mentioned.

OK done; again, this is just what I will personally use from now on (and for
anyone who finds this helpful).



--- fs/btrfs/super.c.orig   2014-02-06 01:28:36.636164982 +0600
+++ fs/btrfs/super.c2014-02-08 17:16:50.361931959 +0600
@@ -1481,6 +1481,11 @@
}
 
kfree(devices_info);
+
+   if (type & BTRFS_BLOCK_GROUP_RAID1) {
+   do_div(avail_space, min_stripes);
+   }
+  
*free_bytes = avail_space;
return 0;
 }
@@ -1491,8 +1496,10 @@
struct btrfs_super_block *disk_super = fs_info->super_copy;
struct list_head *head = &fs_info->space_info;
struct btrfs_space_info *found;
+   u64 total_size;
u64 total_used = 0;
u64 total_free_data = 0;
+   u64 type;
int bits = dentry->d_sb->s_blocksize_bits;
__be32 *fsid = (__be32 *)fs_info->fsid;
int ret;
@@ -1512,7 +1519,13 @@
rcu_read_unlock();
 
buf->f_namelen = BTRFS_NAME_LEN;
-   buf->f_blocks = btrfs_super_total_bytes(disk_super) >> bits;
+   total_size = btrfs_super_total_bytes(disk_super);
+   type = btrfs_get_alloc_profile(fs_info->tree_root, 1);
+   if (type & BTRFS_BLOCK_GROUP_RAID1) {
+   do_div(total_size, 2);
+   do_div(total_used, 2);
+   }
+   buf->f_blocks = total_size >> bits;
buf->f_bfree = buf->f_blocks - (total_used >> bits);
buf->f_bsize = dentry->d_sb->s_blocksize;
buf->f_type = BTRFS_SUPER_MAGIC;




2x1TB RAID1 with a 1GB file:

Filesystem  Size  Used Avail Use% Mounted on
/dev/sda2   912G  1.1G  911G   1% /mnt/p2


-- 
With respect,
Roman


signature.asc
Description: PGP signature

Re: lost with degraded RAID1

2014-02-08 Thread Johan Kröckel

Ok, I did nuke it now and created the fs again using 3.12 kernel. So
far so good. Runs fine.
Finally, I know its kind of offtopic, but can some help me
interpreting this (I think this is the error in the smart-log which
started the whole mess)?

Error 1 occurred at disk power-on lifetime: 2576 hours (107 days + 8 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 71 00 ff ff ff 0f  Device Fault; Error: ABRT at LBA = 0x0fff
= 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --    
  61 00 08 ff ff ff 4f 00   5d+04:53:11.169  WRITE FPDMA QUEUED
  61 00 08 80 18 00 40 00   5d+04:52:45.129  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00   5d+04:52:44.701  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00   5d+04:52:44.700  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00   5d+04:52:44.679  WRITE FPDMA QUEUED

2014-02-07 Chris Murphy :
>
> On Feb 7, 2014, at 4:34 AM, Johan Kröckel  wrote:
>
>> Is there anything else I should do with this setup or may I nuke the
>> two partitions and reuse them?
>
> Well I'm pretty sure once you run 'btrfs check --repair' that you've hit the 
> end of the road. Possibly btrfs restore can still extract some files, it 
> might be worth testing whether that works.
>
> Otherwise blow it away. I'd say test with 3.14-rc2 with a new file system and 
> see if you can reproduce the sequence that caused this problem in the first 
> place. If it's reproducible, I think there's a bug here somewhere.
>
>
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Are nocow files snapshot-aware

2014-02-08 Thread Kai Krakow

Duncan <1i5t5.dun...@cox.net> schrieb:

[...]

Difficult to twist your mind around that but well explained. ;-)

> A snapshot thus looks much like a crash in terms of NOCOW file integrity
> since the blocks of a NOCOW file are simply snapshotted in-place, and
> there's already no checksumming or file integrity verification on such
> files -- they're simply directly written in-place (with the exception of
> a single COW write when a writable snapshottted NOCOW file diverges from
> the shared snapshot version).
> 
> But as I said, the applications themselves are normally designed to
> handle and recover from crashes, and in fact, having btrfs try to manage
> it too only complicates things and can actually make it impossible for
> the app to recover what it would have otherwise recovered just fine.
> 
> So it should be with these NOCOW in-place snapshotted files, too.  If a
> NOCOW file is put back into operation from a snapshot, and the file was
> being written to at snapshot time, it'll very likely trigger exactly the
> same response from the application as a crash while writing would have
> triggered, but, the point is, such applications are normally designed to
> deal with just that, and thus, they should recover just as they would
> from a crash.  If they could recover from a crash, it shouldn't be an
> issue.  If they couldn't, well...

So we have common sense that taking a snapshot looks like a crash from the 
applications perspective. That means if their are facilities to instruct the 
application to suspend its operations first, you should use them - like in 
the InnoDB case:

http://dev.mysql.com/doc/refman/5.1/en/lock-tables.html:

| FLUSH TABLES WITH READ LOCK;
| SHOW MASTER STATUS;
| SYSTEM xfs_freeze -f /var/lib/mysql;
| SYSTEM YOUR_SCRIPT_TO_CREATE_SNAPSHOT.sh;
| SYSTEM xfs_freeze -u /var/lib/mysql;
| UNLOCK TABLES;
| EXIT;

Only that way you get consistent snapshots and won't trigger crash-recovery 
(which might otherwise throw away unrecoverable transactions or otherwise 
harm your data for the sake of consistency). InnoDB is more or less like a 
vm filesystem image on btrfs in this case. So the same approach should be 
taken for vm images if possible. I think VMware has facilities to prepare 
the guest for a snapshot being taken (it is triggered when you take 
snapshots with VMware itself, and btw it usually takes much longer than 
btrfs snapshots do).

Take xfs for example: Although it is crash-safe, it prefers to zero-out your 
files for security reasons during log-replay - because it is crash-safe only 
for meta-data: if meta-data has already allocated blocks but file-data has 
not yet been written, a recovered file may end up with wrong content 
otherwise, so its cleared out. This _IS_NOT_ the situation you want with vm 
images with xfs inside hosted on btrfs when taking a snapshot. You should 
trigger xfs_freeze in the guest before taking the btrfs snapshot in the 
host.

I think the same holds true for most other meta-data-only-journalling file 
systems which probably even do not zero-out files during recovery and just 
silently corrupt your files during crash-recovery.

So in case of crash or snapshot (which looks the same from the application 
perspective), btrfs' capabilities won't help you here (at least in the nocow 
case, probably in the cow case too, because the vm guest may write blocks 
out-of-order without having the possibility to pass write-barriers down to 
btrfs cow mechanism). Taking snapshots of database files or vm images 
without proper prepartion only guarantees you crash-like rollback 
situations. Taking snapshots even at short intervals only makes this worse, 
with all the extra downsides of effects this has within the btrfs.

I think this is important to understand for people planning to do automated 
snapshots of such file data. Making a file nocow only helps the situation 
during normal operation - but after a snapshot, a nocow file is essentially 
cow while carried over to the new subvolume generation during writes of 
blocks from the old generation.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] xfstests: Btrfs: add test for large metadata blocks

2014-02-08 Thread Koen De Wit


Thanks for the review, Dave!
Comments inline.

On 02/07/2014 11:49 PM, Dave Chinner wrote:

On Fri, Feb 07, 2014 at 06:14:45PM +0100, Koen De Wit wrote:

Tests Btrfs filesystems with all possible metadata block sizes, by
setting large extended attributes on files.

Signed-off-by: Koen De Wit 

There's a few things here that need fixing.


+pagesize=`$here/src/feature -s`
+pagesize_kb=`expr $pagesize / 1024`
+
+# Test all valid leafsizes
+for leafsize in `seq $pagesize_kb $pagesize_kb 64`; do
+_scratch_unmount >/dev/null 2>&1

Indentation are tabs, and tabs are 8 spaces in size, please.


OK, I fixed this in v2.


+_scratch_mkfs -l ${leafsize}K >/dev/null
+_scratch_mount

No need to use _scratch_unmount here - you should be doing a
_check_scratch_fs at the end of the loop.


Fixed in v2 too.


+# Calculate the xattr size, but leave 512 bytes for other metadata.
+xattr_size=`expr $leafsize \* 1024 - 512`
+
+touch $SCRATCH_MNT/emptyfile
+# smallfile will be inlined, bigfile not.
+$XFS_IO_PROG -f -c "pwrite 0 100" $SCRATCH_MNT/smallfile >/dev/null
+$XFS_IO_PROG -f -c "pwrite 0 9000" $SCRATCH_MNT/bigfile >/dev/null
+ln -s $SCRATCH_MNT/bigfile $SCRATCH_MNT/bigfile_softlink
+
+files=(emptyfile smallfile bigfile bigfile_softlink)
+chars=(a b c d)
+for i in `seq 0 1 3`; do
+char=${chars[$i]}
+file=$SCRATCH_MNT/${files[$i]}
+lnkfile=${file}_hardlink
+ln $file $lnkfile
+xattr_value=`head -c $xattr_size < /dev/zero | tr '\0' $char`
+
+set_md5=`echo -n "$xattr_value" | md5sum`

Just dump the md5sum to the output file.


+${ATTR_PROG} -Lq -s attr_$char -V $xattr_value $file
+get_md5=`${ATTR_PROG} -Lq -g attr_$char $file | md5sum`
+get_ln_md5=`${ATTR_PROG} -Lq -g attr_$char $lnkfile | md5sum`

And dump these to the output file, too. Then the golden image
matching when the test is finish will tell you if it passed or not.
i.e:

echo -n "$xattr_value" | md5sum
${ATTR_PROG} -Lq -s attr_$char -V $xattr_value $file
${ATTR_PROG} -Lq -g attr_$char $file | md5sum
${ATTR_PROG} -Lq -g attr_$char $lnkfile | md5sum

is all that neds to be done here.


The problem with this is that the length of the output will depend on the page 
size. The code above runs for every valid leafsize, which can be any multiple 
of the page size up to 64KB, as defined in the loop initialization:
for leafsize in `seq $pagesize_kb $pagesize_kb 64`; do


+# Test attributes with a size larger than the leafsize.
+# Should result in an error.
+if [ "$leafsize" -lt "64" ]; then
+# Bash command lines cannot be larger than 64K characters, so we
+# do not test attribute values with a size >64KB.
+xattr_size=`expr $leafsize \* 1024 + 512`
+xattr_value=`head -c $xattr_size < /dev/zero | tr '\0' x`
+${ATTR_PROG} -q -s attr_toobig -V $xattr_value \
+$SCRATCH_MNT/emptyfile >> $seqres.full 2>&1
+if [ "$?" -eq "0" ]; then
+echo "Expected error, xattr_size is bigger than ${leafsize}K"
+fi

What you are doing is redirecting the error to $seqres.full
so that it doesn't end up in the output file, then detecting the
absence of an error and dumping a message to the output file to make
the test fail.

IOWs, the ATTR_PROG failure message should be in the golden output
file and you don't have to do anything else to detect a pass/fail
condition.


Same here: the bigger the page size, the less this code will be executed. If 
the page size is 64KB, this code isn't executed at all.
To make sure the golden output does not depend on the page size, I chose to 
suppress all output as long as the test is successful. Is there a better way to 
accomplish this?


+_scratch_unmount
+
+# Some illegal leafsizes
+
+_scratch_mkfs -l 0 2>> $seqres.full
+echo $?

Same again - you are dumping the error output into a different file,
then detecting the error manually. pass the output of _scratch_mkfs
through a filter, and let errors cause golden output mismatches.


I did this to make the golden output not depend on the output of mkfs.btrfs, 
inspired by 
http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfstests.git;a=commit;h=fd7a8e885732475c17488e28b569ac1530c8eb59
 and 
http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfstests.git;a=commit;h=78d86b996c9c431542fdbac11fa08764b16ceb7d
However, in my opinion the test should simply be updated if the output of 
mkfs.btrfs changes, so I agree with you and I fixed this in v2.

Thanks,
Koen.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] xfstests: Btrfs: add test for large metadata blocks

2014-02-08 Thread Koen De Wit

Tests Btrfs filesystems with all possible metadata block sizes, by
setting large extended attributes on files.

Signed-off-by: Koen De Wit 
---

v1->v2: 
- Fix indentation: 8 spaces instead of 4
- Move _scratch_unmount to end of loop, add _check_scratch_fs
- Sending failure messages of mkfs.btrfs to output instead of
  $seqres.full

diff --git a/tests/btrfs/036 b/tests/btrfs/036
new file mode 100644
index 000..b14697d
--- /dev/null
+++ b/tests/btrfs/036
@@ -0,0 +1,137 @@
+#! /bin/bash
+# FS QA Test No. 036
+#
+# Tests large metadata blocks in btrfs, which allows large extended
+# attributes.
+#
+#---
+# Copyright (c) 2014, Oracle and/or its affiliates.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+status=1   # failure is the default!
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_need_to_be_root
+
+rm -f $seqres.full
+
+pagesize=`$here/src/feature -s`
+pagesize_kb=`expr $pagesize / 1024`
+
+# Test all valid leafsizes
+for leafsize in `seq $pagesize_kb $pagesize_kb 64`; do
+_scratch_mkfs -l ${leafsize}K >/dev/null
+_scratch_mount
+# Calculate the size of the extended attribute value, leaving
+# 512 bytes for other metadata.
+xattr_size=`expr $leafsize \* 1024 - 512`
+
+touch $SCRATCH_MNT/emptyfile
+# smallfile will be inlined, bigfile not.
+$XFS_IO_PROG -f -c "pwrite 0 100" $SCRATCH_MNT/smallfile \
+>/dev/null
+$XFS_IO_PROG -f -c "pwrite 0 9000" $SCRATCH_MNT/bigfile \
+>/dev/null
+ln -s $SCRATCH_MNT/bigfile $SCRATCH_MNT/bigfile_softlink
+
+files=(emptyfile smallfile bigfile bigfile_softlink)
+chars=(a b c d)
+for i in `seq 0 1 3`; do
+char=${chars[$i]}
+file=$SCRATCH_MNT/${files[$i]}
+lnkfile=${file}_hardlink
+ln $file $lnkfile
+xattr_value=`head -c $xattr_size < /dev/zero \
+| tr '\0' $char`
+
+set_md5=`echo -n "$xattr_value" | md5sum`
+${ATTR_PROG} -Lq -s attr_$char -V $xattr_value $file
+get_md5=`${ATTR_PROG} -Lq -g attr_$char $file | md5sum`
+get_ln_md5=`${ATTR_PROG} -Lq -g attr_$char $lnkfile \
+| md5sum`
+
+# Using md5sums for comparison instead of the values
+# themselves because bash command lines cannot be larger
+# than 64K chars.
+if [ "$set_md5" != "$get_md5" ]; then
+echo -n "Got unexpected xattr value for "
+echo -n "attr_$char from file ${file}. "
+echo "(leafsize is ${leafsize}K)"
+fi
+if [ "$set_md5" != "$get_ln_md5" ]; then
+echo -n "Value for attr_$char differs for "
+echo -n "$file and ${lnkfile}. "
+echo "(leafsize is ${leafsize}K)"
+fi
+done
+
+# Test attributes with a size larger than the leafsize.
+# Should result in an error.
+if [ "$leafsize" -lt "64" ]; then
+# Bash command lines cannot be larger than 64K
+# characters, so we do not test attribute values
+# with a size >64KB.
+xattr_size=`expr $leafsize \* 1024 + 512`
+xattr_value=`head -c $xattr_size < /dev/zero | tr '\0' x`
+${ATTR_PROG} -q -s attr_toobig -V $xattr_value \
+$SCRATCH_MNT/emptyfile >> $seqres.full 2>&1
+if [ "$?" -eq "0" ]; then
+echo -n "Expected error, xattr_size is bigger "
+echo "than ${leafsize}K"
+fi
+fi
+
+_scratch_unmount >/dev/null 2>&1
+_check_scratch_fs
+done
+
+_scratch_mount
+
+# Illegal attribute name (more than 256 characters)
+attr_name=`head -c 260 < /dev/zero | tr '\0' n`
+${ATTR_PRO

Re: Provide a better free space estimate on RAID1

Re: lost with degraded RAID1

Bedup bug report

Re: [RFC PATCH 2/2] Revert "Btrfs: remove transaction from btrfs send"

Re: Provide a better free space estimate on RAID1

Re: Provide a better free space estimate on RAID1

Re: Provide a better free space estimate on RAID1

Re: Provide a better free space estimate on RAID1

Re: btrfsck does not fix

Re: Provide a better free space estimate on RAID1

Re: Provide a better free space estimate on RAID1

Re: Provide a better free space estimate on RAID1

Re: Provide a better free space estimate on RAID1

[PATCH] btrfs: always choose work from prio_head first

Re: Provide a better free space estimate on RAID1

Re: btrfsck does not fix

Re: Provide a better free space estimate on RAID1

Re: Provide a better free space estimate on RAID1

Re: Provide a better free space estimate on RAID1

Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038

system stuck with flush-btrfs-4 at 100% after filesystem resize

Recovering from persistent kernel oops on 'btrfs balance'

[PATCH][V2] Re: Provide a better free space estimate on RAID1

[PATCH] xfstests: add test for btrfs data corruption when using compression

[PATCH 1/2] Btrfs: skip readonly root for snapshot-aware defragment

[RFC PATCH 2/2] Revert "Btrfs: remove transaction from btrfs send"

[PATCH] Btrfs: fix data corruption when reading/updating compressed extents

Re: Provide a better free space estimate on RAID1

Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038

Re: Provide a better free space estimate on RAID1

Re: Provide a better free space estimate on RAID1

Re: Provide a better free space estimate on RAID1

Re: lost with degraded RAID1

Re: Are nocow files snapshot-aware

Re: [PATCH] xfstests: Btrfs: add test for large metadata blocks

[PATCH v2] xfstests: Btrfs: add test for large metadata blocks

36 matches

Site Navigation

Mail list logo

Footer information