Re: Provide a better free space estimate on RAID1
Roman Mamedov posted on Sun, 09 Feb 2014 04:10:50 +0600 as excerpted: > If you need to perform a btrfs-specific operation, you can easily use > the btrfs-specific tools to prepare for it, specifically use "btrfs fi > df" which could give provide every imaginable interpretation of free > space estimate and then some. > > UNIX 'df' and the 'statfs' call on the other hand should keep the > behavior people are accustomized to rely on since 1970s. Which it does... on filesystems that only have 1970s filesystem features. =:^) RAID or multi-device filesystems aren't 1970s features and break 1970s behavior and the assumptions associated with it. If you're not prepared to deal with those broken assumptions, don't. Use mdraid or dmraid or lvm or whatever to combine your multiple devices into one logical devices as presented, and put your filesystem (either traditional filesystem, or even btrfs using traditional single-device functionality) on top of the single device the layer beneath the filesystem presents. Problem solved! =:^) Note that df only lists a single device as well, not the multiple component devices of the filesystem. That's broken functionality by your definition, too, and again, using some other layer like lvm or mdraid to present multiple devices as a single virtual device, with a traditional single-device filesystem layout on top of that single device... solves the problem! Meanwhile, what I've done here is use one of df's commandline options to set its block size to 2 MiB, and further used bash's alias functionality to setup an alias accordingly: alias df='df -B2M' $ df /h Filesystem 2M-blocks Used Available Use% Mounted on /dev/sda6 20480 12186 7909 61% /h $ sudo btrfs fi show /h Label: hm0238gcnx+35l0 uuid: ce23242a-b0a9-423f-a9c3-7db2729f48d6 Total devices 2 FS bytes used 11.90GiB devid1 size 20.00GiB used 14.78GiB path /dev/sda6 devid2 size 20.00GiB used 14.78GiB path /dev/sdb6 $ sudo btrfs fi df /h Data, RAID1: total=14.00GiB, used=11.49GiB System, RAID1: total=32.00MiB, used=16.00KiB Metadata, RAID1: total=768.00MiB, used=414.94MiB On btrfs such as the above I can read the 2M blocks as 1M and be happy. On btrfs such as my /boot, which aren't raid1 (I have two separate /boots, one on each device, with grub2 configured separately for each to provide a backup), or if I df my media partitions still on reiserfs on the old spinning rust, I can either double the figures DF gives me, or add a second -B option at the CLI, overriding the aliased option. If I wanted something fully automated, it'd be easy enough to setup a script that checked what filesystem I was df-ing, matched that against a table of filesystems to preferred df block sizes, and supplied the appropriate -BxX option accordingly. As I guess most admins after a few years, I've developed quite a library of scripts/aliases for various things I do routinely enough to warrant it, and this would be just one more joining the list. =:^) But of course it's your system in question, and you can patch btrfs to output anything you like, in any format you like. No need to bother with df's -B option if you'd prefer to patch the kernel instead. Me, I'll stick to the -B option. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: lost with degraded RAID1
Johan Kröckel posted on Sat, 08 Feb 2014 12:09:46 +0100 as excerpted: > Ok, I did nuke it now and created the fs again using 3.12 kernel. So far > so good. Runs fine. > Finally, I know its kind of offtopic, but can some help me interpreting > this (I think this is the error in the smart-log which started the whole > mess)? > > Error 1 occurred at disk power-on lifetime: 2576 hours (107 days + 8 > hours) > When the command that caused the error occurred, the device was > active or idle. > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 04 71 00 ff ff ff 0f > Device Fault; Error: ABRT at LBA = 0x0fff = 268435455 I'm no SMART expert, but that LBA number is incredibly suspicious. With standard 512-byte sectors that's the 128 GiB boundary, the old 28-bit LBA limit (LBA28, introduced with ATA-1 in 1994, modern drives are LBA48, introduced in 2003 with ATA-6 and offering an addressing capacity of 128 PiB, according to wikipedia's article on LBA). It looks like something flipped back to LBA28, and when a continuing operation happened to write past that value... it triggered the abort you see in the SMART log. Double-check your BIOS to be sure it didn't somehow revert to the old LBA28 compatibility mode or some such, and the drives, to make sure they aren't "clipped" to LBA28 compatibility mode as well. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bedup bug report
kernel 3.12.7, python 2.7.6-5, debian testing/unstable, bedup installed as per pip install --user bedup I tried installing the git version, but the error is the same: Anyway, with the other bedup, I get: gargamel:/mnt/dshelf2/backup# bedup show Traceback (most recent call last): File "/usr/local/bin/bedup", line 9, in load_entry_point('bedup==0.9.0', 'console_scripts', 'bedup')() File "/root/.local/lib/python2.7/site-packages/bedup/__main__.py", line 483, in script_main sys.exit(main(sys.argv)) File "/root/.local/lib/python2.7/site-packages/bedup/__main__.py", line 472, in main return args.action(args) File "/root/.local/lib/python2.7/site-packages/bedup/__main__.py", line 70, in cmd_show_vols sess = get_session(args) File "/root/.local/lib/python2.7/site-packages/bedup/__main__.py", line 105, in get_session upgrade_schema(engine, database_exists) File "/root/.local/lib/python2.7/site-packages/bedup/migrations.py", line 38, in upgrade_schema context = MigrationContext.configure(engine.connect()) File "/root/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1678, in connect return self._connection_cls(self, **kwargs) File "/root/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 59, in __init__ self.__connection = connection or engine.raw_connection() File "/root/.local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1747, in raw_connection return self.pool.unique_connection() File "/root/.local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 272, in unique_connection return _ConnectionFairy._checkout(self) File "/root/.local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 608, in _checkout fairy = _ConnectionRecord.checkout(pool) File "/root/.local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 425, in checkout rec = pool._do_get() File "/root/.local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 838, in _do_get c = self._create_connection() File "/root/.local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 277, in _create_connection return _ConnectionRecord(self) File "/root/.local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 402, in __init__ pool.dispatch.connect(self.connection, self) File "/root/.local/lib/python2.7/site-packages/sqlalchemy/event/attr.py", line 247, in __call__ fn(*args, **kw) File "/root/.local/lib/python2.7/site-packages/bedup/__main__.py", line 89, in sql_setup assert val == ('wal',), val AssertionError: (u'delete',) gargamel:/mnt/dshelf2/backup# bedup dedup --db-path dedup --defrag . also fails the same way Last thing that happens is open("/usr/lib/python2.7/lib-tk/_sqlite3module.so", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory) open("/usr/lib/python2.7/lib-tk/_sqlite3.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory) open("/usr/lib/python2.7/lib-tk/_sqlite3.pyc", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory) stat64("/usr/lib/python2.7/lib-dynload/_sqlite3", 0xff921890) = -1 ENOENT (No such file or directory) open("/usr/lib/python2.7/lib-dynload/_sqlite3.i386-linux-gnu.so", O_RDONLY|O_LARGEFILE) = 5 open("/usr/lib/python2.7/lib-dynload/_sqlite3.i386-linux-gnu.so", O_RDONLY|O_CLOEXEC) = 6 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 6 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/usr/lib/libsqlite3.so.0", O_RDONLY|O_CLOEXEC) = 6 getcwd("/mnt/dshelf2/backup", 1024) = 20 stat64("/mnt/dshelf2/backup/dedup", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 open("/mnt/dshelf2/backup/dedup", O_RDWR|O_CREAT|O_LARGEFILE, 0644) = 3 Traceback (most recent call last): File "/usr/local/bin/bedup", line 9, in open("/usr/local/bin/bedup", O_RDONLY|O_LARGEFILE) = 4 load_entry_point('bedup==0.9.0', 'console_scripts', 'bedup')() File "/root/.local/lib/python2.7/site-packages/bedup-0.9.0-py2.7-linux-x86_64.egg/bedup/__main__.py", line 487, in script_main (...) Any suggestions? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] Revert "Btrfs: remove transaction from btrfs send"
2014-02-08 23:46 GMT+08:00 Wang Shilong : > From: Wang Shilong > > This reverts commit 41ce9970a8a6a362ae8df145f7a03d789e9ef9d2. > Previously i was thinking we can use readonly root's commit root > safely while it is not true, readonly root may be cowed with the > following cases. > > 1.snapshot send root will cow source root. > 2.balance,device operations will also cow readonly send root > to relocate. > > So i have two ideas to make us safe to use commit root. > > -->approach 1: > make it protected by transaction and end transaction properly and we research > next item from root node(see btrfs_search_slot_for_read()). > > -->approach 2: > add another counter to local root structure to sync snapshot with send. > and add a global counter to sync send with exclusive device operations. > > So with approach 2, send can use commit root safely, because we make sure > send root can not be cowed during send. Unfortunately, it make codes *ugly* > and more complex to maintain. > > To make snapshot and send exclusively, device operations and send operation > exclusively with each other is a little confusing for common users. > > So why not drop into previous way. > > Cc: Josef Bacik > Signed-off-by: Wang Shilong > --- > Josef, if we reach agreement to adopt this approach, please revert > Filipe's patch(Btrfs: make some tree searches in send.c more efficient) > from btrfs-next. Oops, this patch guarantee searching commit roots are all protected by transaction, Filipe's patch is ok, we need update Josef's previous patch. Wang -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
On Feb 8, 2014, at 7:21 PM, Chris Murphy wrote: > we don't have a top level switch for variable raid on a volume yet This isn't good wording. We don't have a controllable way to set variable raid levels. The interrupted convert model I'd consider not controllable. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
On Feb 8, 2014, at 6:55 PM, Roman Mamedov wrote: > > Not sure what exactly becomes problematic if a 2-device RAID1 tells the user > they can store 1 TB of their data on it, and is no longer lying about the > possibility to store 2 TB on it as currently. > > Two 1TB disks in RAID1. OK but while we don't have a top level switch for variable raid on a volume yet, the on-disk format doesn't consider the device to be raid1 at all. Not the device, nor the volume, nor the subvolume have this attribute. It's a function of the data, metadata or system chunk via their profiles. I can do a partial conversion on a volume, and even could do this multiple times and end up with some chunks in every available option, some chunks single, some raid1, some raid0, some raid5. All I have to do is cancel the conversion before each conversion is complete, successively shortening the time. And it's not fair to say this has no application because such conversions take a long time. I might not want to fully do a conversion all at once. There's no requirement that I do so. In any case I object to the language being used that implicitly indicates the 'raidness' is a device or disk attribute. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
On Sun, 09 Feb 2014 00:17:29 +0100 Kai Krakow wrote: > "Dear employees, > > Please keep in mind that when you run out of space on the fileserver > '\\DepartmentC', when you free up space in the directory '\PublicStorage7' > the free space you gain on '\StorageArchive' is only one third of the amount > you deleted, and in '\VideoFiles', you gain only one half. But that's simply incorrect. Looking at my 2nd patch which also changes the total reported size and 'used' size, the 'total' space, 'used' space and space freed up as 'available' after file deletion will all match up perfectly. > The exercise of why is left to the reader... > > The proposed fix simply does not fix the problem. It simply shifts it > introducing the need for another fix somewhere else, which in turn probably > also introduces another need for a fix, and so forth... This will become an > endless effort of fixing and tuning. Not sure what exactly becomes problematic if a 2-device RAID1 tells the user they can store 1 TB of their data on it, and is no longer lying about the possibility to store 2 TB on it as currently. Two 1TB disks in RAID1. Total space 1TB. Can store of my data: 1TB. Wrote 100 GB of files? 100 GB used, 900 GB available, 1TB total. Deleted 50 GB of those? 50 GB used, 950 GB available, 1TB total. Can't see anything horribly broken about this behavior. For when you need to "get to the bottom of things", as mentioned earlier there's always 'btrfs fi df'. > Feel free to fix it but be prepared for the reincarnation of this problem when > per-subvolume raid levels become introduced. AFAIK no one has even begun to write any code code to implement those yet. -- With respect, Roman signature.asc Description: PGP signature
Re: Provide a better free space estimate on RAID1
On Sun, 09 Feb 2014 00:32:47 +0100 Kai Krakow wrote: > When I started to use unix, df returned blocks, not bytes. Without your > proposed patch, it does that right. With your patch, it does it wrong. It returns total/used/available space that is usable/used/available by/for user data. Whether that be in sectors, blocks, kilobytes, megabytes or in some other unit, is a secondary detail which is also unrelated to the change being currently discussed and not affected by it. -- With respect, Roman signature.asc Description: PGP signature
Re: btrfsck does not fix
On Feb 8, 2014, at 3:01 PM, Hendrik Friedel wrote: > Hello, > >> Ok. >> I think, I do/did have some symptoms, but I cannot exclude other reasons.. >> -High Load without high cpu-usage (io was the bottleneck) >> -Just now: transfer from one directory to the other on the same >> subvolume (from /mnt/subvol/A/B to /mnt/subvol/A) I get 1.2MB/s instead >> of > 60. >> -For some of the files I even got a "no space left on device" error. >> >> This is without any messages in dmesg or syslog related to btrfs. > > as I don't see that I can fix this, I intend to re-create the file-system. > For that, I need to remove one of the two discs from the raid/filesystem, > then create a new fs on this and move the data to it (I have no spare) > Could you please advise me, wheather this will be successful? > > > first some Information on the filesystem: > > ./btrfs filesystem show /dev/sdb1 > Label: none uuid: 989306aa-d291-4752-8477-0baf94f8c42f >Total devices 2 FS bytes used 3.47TiB >devid1 size 2.73TiB used 1.74TiB path /dev/sdb1 >devid2 size 2.73TiB used 1.74TiB path /dev/sdc1 I don't understand the no spare part. You have 3.47T of data, and yet the single device size is 2.73T. There is no way to migrate 1.74T from sdc1 to sdb1 because there isn't enough space. > > /btrfs subvolume list /mnt/BTRFS/Video > ID 256 gen 226429 top level 5 path Video > ID 1495 gen 226141 top level 5 path rsnapshot > ID gen 226429 top level 256 path Snapshot > ID 5845 gen 226375 top level 5 path backups > > btrfs fi df /mnt/BTRFS/Video/ > Data, RAID0: total=3.48TB, used=3.47TB > System, RAID1: total=32.00MB, used=260.00KB > Metadata, RAID1: total=4.49GB, used=3.85GB > > > What I did already yesterday was: > > btrfs device delete /dev/sdc1 /mnt/BTRFS/rsnapshot/ > btrfs device delete /dev/sdc1 /mnt/BTRFS/backups/ > btrfs device delete /dev/sdc1 /mnt/BTRFS/Video/ > btrfs filesystem balance start /mnt/BTRFS/Video/ I don't understand this sequence because I don't know what you've mounted where, but in any case maybe it's a bug that you're not getting errors for each of these commands because you can't delete sdc1 from a raid0 volume. You'd first have to convert the data, metadata, and system profiles to single (metadata can be set to dup). And then you'd be able to delete a device so long as there's room on remaining devices, which you don't have. > next, I'm doing the balance for the subvolume /mnt/BTRFS/backups You told us above you deleted that subvolume. So how are you balancing it? And also, balance applies to a mountpoint, and even if you mount a subvolume to that mountpoint, the whole file system is balanced. Not just the mounted subvolume. > In parallel, I try to delete /mnt/BTRFS/rsnapshot, but it fails: > btrfs subvolume delete /mnt/BTRFS/rsnapshot/ > Delete subvolume '/mnt/BTRFS/rsnapshot' > ERROR: cannot delete '/mnt/BTRFS/rsnapshot' - Inappropriate ioctl > for device > > Why's that? > But even more: How do I free sdc1 now?! Well I'm pretty confused because again, I can't tell if your paths refer to subvolumes or if they refer to mount points. The balance and device delete commands all refer to a mount point, which is the path returned by the df command. The subvolume delete command needs a path to subvolume that starts with the mount point. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
Roman Mamedov schrieb: > UNIX 'df' and the 'statfs' call on the other hand should keep the behavior > people are accustomized to rely on since 1970s. When I started to use unix, df returned blocks, not bytes. Without your proposed patch, it does that right. With your patch, it does it wrong. -- Replies to list only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
cwillu schrieb: > Everyone who has actually looked at what the statfs syscall returns > and how df (and everyone else) uses it, keep talking. Everyone else, > go read that source code first. > > There is _no_ combination of values you can return in statfs which > will not be grossly misleading in some common scenario that someone > cares about. Thanks man! statfs returns free blocks. So let's stick with that. The df command, as people try to understand it, is broken by design on btrfs. One has to live with that. The df command as it works since 1970 returns free blocks - and it does that perfectly fine on btrfs without that proposed "fix". User space should not try to be smart about how many blocks are written to the filesystem if it writes xyz bytes to the filesystem. It has been that way since 1970 (or whatever), and it will be that way in the future. And a good file copying GUI should give you the choice of "I know better, copy anyways" (like every other unix utility). Your pointer is everything to say about it. -- Replies to list only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
Roman Mamedov schrieb: >> It should show the raw space available. Btrfs also supports compression >> and doesn't try to be smart about how much compressed data would fit in >> the free space of the drive. If one is using RAID1, it's supposed to fill >> up with a rate of 2:1. If one is using compression, it's supposed to fill >> up with a rate of maybe 1:5 for mostly text files. > > Imagine a small business with some 30-40 employees. There is a piece of > paper near the door at the office so that everyone sees it when entering > or leaving, which says: > > "Dear employees, > > Please keep in mind that on the fileserver '\\DepartmentC', in the > directory '\PublicStorage7' the free space you see as being available > needs to be divided by two; On the server '\\DepartmentD', in > '\StorageArchive' and '\VideoFiles', multiplied by two-thirds. For more > details please contact the IT operations team. Further assistance will be > provided at the monthly training seminar. "Dear employees, Please keep in mind that when you run out of space on the fileserver '\\DepartmentC', when you free up space in the directory '\PublicStorage7' the free space you gain on '\StorageArchive' is only one third of the amount you deleted, and in '\VideoFiles', you gain only one half. For more details please contact the IT operations team. Further assistance will be provided at the monthly training seminar. Regards, John S, CTO." The exercise of why is left to the reader... The proposed fix simply does not fix the problem. It simply shifts it introducing the need for another fix somewhere else, which in turn probably also introduces another need for a fix, and so forth... This will become an endless effort of fixing and tuning. It simply does not work because btrfs' design does not allow that. Feel free to fix it but be prepared for the reincarnation of this problem when per- subvolume raid levels become introduced. The problem has to be fixed in user space or with a new API call. -- Replies to list only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
Everyone who has actually looked at what the statfs syscall returns and how df (and everyone else) uses it, keep talking. Everyone else, go read that source code first. There is _no_ combination of values you can return in statfs which will not be grossly misleading in some common scenario that someone cares about. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: always choose work from prio_head first
In case we do not refill, we can overwrite cur pointer from prio_head by one from not prioritized head, what looks as something that was not intended. This change make we always take works from prio_head first until it's not empty. Signed-off-by: Stanislaw Gruszka --- I found this by reading code, not sure if change is correct. Patch is only compile tested. fs/btrfs/async-thread.c |9 + 1 files changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c index c1e0b0c..0b78bf2 100644 --- a/fs/btrfs/async-thread.c +++ b/fs/btrfs/async-thread.c @@ -262,18 +262,19 @@ static struct btrfs_work *get_next_work(struct btrfs_worker_thread *worker, struct btrfs_work *work = NULL; struct list_head *cur = NULL; - if (!list_empty(prio_head)) + if (!list_empty(prio_head)) { cur = prio_head->next; + goto out; + } smp_mb(); if (!list_empty(&worker->prio_pending)) goto refill; - if (!list_empty(head)) + if (!list_empty(head)) { cur = head->next; - - if (cur) goto out; + } refill: spin_lock_irq(&worker->lock); -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
On Sat, 08 Feb 2014 22:35:40 +0100 Kai Krakow wrote: > Imagine the future: Btrfs supports different RAID levels per subvolume. We > need to figure out where to place a new subvolume. I need raw numbers for > it. Df won't tell me that now. Things become very difficult now. If you need to perform a btrfs-specific operation, you can easily use the btrfs-specific tools to prepare for it, specifically use "btrfs fi df" which could give provide every imaginable interpretation of free space estimate and then some. UNIX 'df' and the 'statfs' call on the other hand should keep the behavior people are accustomized to rely on since 1970s. -- With respect, Roman signature.asc Description: PGP signature
Re: btrfsck does not fix
Hello, Ok. I think, I do/did have some symptoms, but I cannot exclude other reasons.. -High Load without high cpu-usage (io was the bottleneck) -Just now: transfer from one directory to the other on the same subvolume (from /mnt/subvol/A/B to /mnt/subvol/A) I get 1.2MB/s instead of > 60. -For some of the files I even got a "no space left on device" error. This is without any messages in dmesg or syslog related to btrfs. as I don't see that I can fix this, I intend to re-create the file-system. For that, I need to remove one of the two discs from the raid/filesystem, then create a new fs on this and move the data to it (I have no spare) Could you please advise me, wheather this will be successful? first some Information on the filesystem: ./btrfs filesystem show /dev/sdb1 Label: none uuid: 989306aa-d291-4752-8477-0baf94f8c42f Total devices 2 FS bytes used 3.47TiB devid1 size 2.73TiB used 1.74TiB path /dev/sdb1 devid2 size 2.73TiB used 1.74TiB path /dev/sdc1 /btrfs subvolume list /mnt/BTRFS/Video ID 256 gen 226429 top level 5 path Video ID 1495 gen 226141 top level 5 path rsnapshot ID gen 226429 top level 256 path Snapshot ID 5845 gen 226375 top level 5 path backups btrfs fi df /mnt/BTRFS/Video/ Data, RAID0: total=3.48TB, used=3.47TB System, RAID1: total=32.00MB, used=260.00KB Metadata, RAID1: total=4.49GB, used=3.85GB What I did already yesterday was: btrfs device delete /dev/sdc1 /mnt/BTRFS/rsnapshot/ btrfs device delete /dev/sdc1 /mnt/BTRFS/backups/ btrfs device delete /dev/sdc1 /mnt/BTRFS/Video/ btrfs filesystem balance start /mnt/BTRFS/Video/ next, I'm doing the balance for the subvolume /mnt/BTRFS/backups In parallel, I try to delete /mnt/BTRFS/rsnapshot, but it fails: btrfs subvolume delete /mnt/BTRFS/rsnapshot/ Delete subvolume '/mnt/BTRFS/rsnapshot' ERROR: cannot delete '/mnt/BTRFS/rsnapshot' - Inappropriate ioctl for device Why's that? But even more: How do I free sdc1 now?! Greetings, Hendrik -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
Martin Steigerwald schrieb: > While I understand that there is *never* a guarentee that a given free > space can really be allocated by a process cause other processes can > allocate space as well in the mean time, and while I understand that its > difficult to provide an accurate to provide exact figures as soon as RAID > settings can be set per subvolume, it still think its important to improve > on the figures. The question here: Does the free space indicator "fail" predictably or inpredictably? It will do the latter with this change. -- Replies to list only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
Chris Murphy schrieb: > > On Feb 6, 2014, at 11:08 PM, Roman Mamedov wrote: > >> And what >> if I am accessing that partition on a server via a network CIFS/NFS share >> and don't even *have a way to find out* any of that. > > That's the strongest argument. And if the user is using > Explorer/Finder/Nautilus to copy files to the share, I'm pretty sure all > three determine if there's enough free space in advance of starting the > copy. So if it thinks there's free space, it will start to copy and then > later fail midstream when there's no more space. And then the user's copy > task is in a questionable state as to what's been copied, depending on how > the file copies are being threaded. This problem has already been solved for remote file systems maybe 20-30 years ago: You cannot know how much space is left at the end of the copy by looking at the numbers before the copy - it may have been used up by another user copying a file at the same time. The problem has been solved by applying hard and soft quotas: The sysadmin does an optimistic (or possibly even pessimistic) planning and applies quotas. Soft quotas can be passed for (maybe) 7 days after which you need to free up space again before adding new data. Hard quotas are the hard cutoff - you cannot pass that barrier. Df will show you what's free within your softquota. Problem solved. If you need better numbers, there are quota commands instead of df. Why break with this design choice? If you manage a central shared storage for end users, you should really start thinking about quotas. Without, you cannot even exactly plan your backups. If df shows transformed/guessed numbers to the sysadmins, things start to become very complicated and unpredictable. -- Replies to list only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
Hugo Mills schrieb: > On Sat, Feb 08, 2014 at 05:33:10PM +0600, Roman Mamedov wrote: >> On Fri, 07 Feb 2014 21:32:42 +0100 >> Kai Krakow wrote: >> >> > It should show the raw space available. Btrfs also supports compression >> > and doesn't try to be smart about how much compressed data would fit in >> > the free space of the drive. If one is using RAID1, it's supposed to >> > fill up with a rate of 2:1. If one is using compression, it's supposed >> > to fill up with a rate of maybe 1:5 for mostly text files. >> >> Imagine a small business with some 30-40 employees. There is a piece of >> paper near the door at the office so that everyone sees it when entering >> or leaving, which says: >> >> "Dear employees, >> >> Please keep in mind that on the fileserver '\\DepartmentC', in the >> directory '\PublicStorage7' the free space you see as being available >> needs to be divided by two; On the server '\\DepartmentD', in >> '\StorageArchive' and '\VideoFiles', multiplied by two-thirds. For more >> details please contact the IT operations team. Further assistance will be >> provided at the monthly training seminar. >> >> Regards, >> John S, CTO.' > >In my experience, nobody who uses a shared filesystem *ever* looks > at the amount of free space on it, until it fills up, at which point > they may look at the free space and see "0". Or most likely, they'll > be alerted to the issue by an email from the systems people saying, > "please will everyone delete unnecessary files from the shared drive, > because it's full up." Exactly that is the point from my practical experience. Only sysadmins watch these numbers, and they'd know how to handle them. Imagine the future: Btrfs supports different RAID levels per subvolume. We need to figure out where to place a new subvolume. I need raw numbers for it. Df won't tell me that now. Things become very difficult now. Free space is a number unimportant to end users. They won't look at it. They start to cry and call helpdesk if an application says: Disk is full. You cannot even save your unsaved document, because: Disk full. The only way to solve this, is to apply quotas to users and let the sysadmins do the space usage planning. That will work. I still think, there should be an extra utility which guesses the predicted usable free space - or an option added to df to show that. Roman's argument is only one view of the problem. My argument (sysadmin space planning) is exactly the opposite view. In the future, free space prediction will only become more complicated, involves more code, introduces bugs... It should be done in user space. Df should receive raw numbers. Storage space is cheap these days. You should just throw another disk at the array if free space falls below a certain threshold. End users do not care for free space. They just cry when it's full - no matter how accurate the numbers had been before. They will certainly not cry if they copied 2 MB to the disk but 4 MB had been taken. In a shared storage space this is probably always the case anyway, because just the very same moment someone else also copied 2 MB to the volume. So what? >Having a more accurate estimate of the free space is a laudable > aim, and in principle I agree with attempts to do it, but I think the > argument above isn't exactly a strong one in practice. I do not disagree, too. But I think it should go to a separate utility or there should be a new API call in the kernel to get predicted usable free space based on current usage pattern. Df is meant as a utility to get accurate numbers. It should not tell you guessed numbers. Whatever you design a df calculater in btrfs, it could always be too pessimistic or too optimistic (and could even switch unpredictably between both situations). So whatever you do: It is always inaccurate. It will never be able to exactly tell you the numbers you need. If disk space is low: Add disks. Clean up. Whatever. Just simply do not try to fill up your FS to just 1kb left. Btrfs doesn't like that anyway. So: Use quotas. Picking up the piece of paper example: You still have to tell your employees that the free space numbers aren't exact anyways, so their best chance is to simply not look at them and are better off with just trying to copy something. Besides: If you want to fix this, what about the early-ENOSPC problem which is there by design (allocation in chunks)? You'd need to fix that, too. -- Replies to list only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
Hello, David, Fengguang, Chris. On Fri, Feb 07, 2014 at 01:13:06PM -0800, David Rientjes wrote: > On Fri, 7 Feb 2014, Fengguang Wu wrote: > > > On Fri, Feb 07, 2014 at 02:13:59AM -0800, David Rientjes wrote: > > > On Fri, 7 Feb 2014, Fengguang Wu wrote: > > > > > > > [1.625020] BTRFS: selftest: Running btrfs_split_item tests > > > > [1.627004] BTRFS: selftest: Running find delalloc tests > > > > [2.289182] tsc: Refined TSC clocksource calibration: 2299.967 MHz > > > > [ 292.084537] kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=1, > > > > oom_score_adj=0 > > > > [ 292.086439] kthreadd cpuset= > > > > [ 292.087072] BUG: unable to handle kernel NULL pointer dereference at > > > > 0038 > > > > [ 292.087372] IP: [] pr_cont_kernfs_name+0x1b/0x6c > > > > > > This looks like a problem with the cpuset cgroup name, are you sure this > > > isn't related to the removal of cgroup->name? > > > > It looks not related to patch "cgroup: remove cgroup->name", because > > that patch lies in the cgroup tree and not contained in output of "git log > > BAD_COMMIT". > > > > It's dying on pr_cont_kernfs_name which is some tree that has "kernfs: > implement kernfs_get_parent(), kernfs_name/path() and friends", which is > not in linux-next, and is obviously printing the cpuset cgroup name. > > It doesn't look like it has anything at all to do with btrfs or why they > would care about this failure. Yeah, this is from a patch in cgroup/review-post-kernfs-conversion branch which updates cgroup to use pr_cont_kernfs_name(). I forget that cgrp->kn is NULL for the dummy_root's top cgroup and thus it ends up calling the kernfs functions with NULL kn and thus the oops. I posted an updated patch and the git branch has been updated. http://lkml.kernel.org/g/20140208200640.gb10...@htj.dyndns.org So, nothing to do with btrfs and it looks like somehow the test appratus is mixing up branches? Thanks! -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
system stuck with flush-btrfs-4 at 100% after filesystem resize
Hello, I have a large file system that has been growing. We've resized it a couple of times with the following approach: lvextend -L +800G /dev/raid/virtual_machines btrfs filesystem resize +800G /vms I think the FS started out at 200G, we increased it by 200GB a time or two, then by 800GB and everything worked fine. The filesystem hosts a number of virtual machines so the file system is in use, although the VMs individually tend not to be overly active. VMs tend to be in subvolumes, and some of those subvolumes have snapshots. This time, I increased it by another 800GB, and it it has hung for many hours (over night) with flush-btrfs-4 near 100% cpu all that time. I'm not clear at this point that it will finish or where to go from here. Any pointers would be much appreciated. Thanks, -john (newbie to BTRFS) procedure log -- romulus:/home/users/johnn # lvextend -L +800G /dev/raid/virtual_machines romulus:/home/users/johnn # btrfs filesystem resize +800G /vms Resize '/vms' of '+800G' [hangs] top - 12:21:53 up 136 days, 2:45, 13 users, load average: 30.39, 30.37, 30.37 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie %Cpu(s): 2.4 us, 2.3 sy, 0.0 ni, 95.1 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem:129147 total, 127427 used, 1720 free, 264 buffers MiB Swap: 262143 total, 661 used, 261482 free,93666 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 48809 root 20 0 000 R 99.3 0.0 1449:14 flush-btrfs-4 --- misc info --- romulus:/home/users/johnn # cat /etc/SuSE-release openSUSE 12.3 (x86_64) VERSION = 12.3 CODENAME = Dartmouth romulus:/home/users/johnn # uname -a Linux romulus.us.redacted.com 3.7.10-1.16-desktop #1 SMP PREEMPT Fri May 31 20:21:23 UTC 2013 (97c14ba) x86_64 x86_64 x86_64 GNU/Linux romulus:/home/users/johnn # romulus:/home/users/johnn # vgdisplay --- Volume group --- VG Name raid System ID Formatlvm2 Metadata Areas1 Metadata Sequence No 19 VG Access read/write VG Status resizable MAX LV0 Cur LV7 Open LV 7 Max PV0 Cur PV1 Act PV1 VG Size 10.91 TiB PE Size 4.00 MiB Total PE 2859333 Alloc PE / Size 1371136 / 5.23 TiB Free PE / Size 1488197 / 5.68 TiB VG UUID npyvGj-7vxF-IoI8-Z4tF-ygpP-Q2Ja-vV8sLA [...] romulus:/home/users/johnn # lvdisplay [...] --- Logical volume --- LV Path/dev/raid/virtual_machines LV Namevirtual_machines VG Nameraid LV UUIDqtzNBG-vuLV-EsgO-FDIf-sO7A-GKmd-EVjGjp LV Write Accessread/write LV Creation host, time romulus.redacted.com, 2013-09-25 11:05:54 -0500 LV Status available # open 1 LV Size2.54 TiB Current LE 665600 Segments 2 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:4 [...] johnn@romulus:~> df -h /vms Filesystem Size Used Avail Use% Mounted on /dev/dm-4 1.8T 1.8T 6.0G 100% /vms johnn@romulus:~> romulus:/home/users/johnn # btrfs filesystem show [...] Label: none uuid: f08c5602-f53a-43c9-b498-fa788b01e679 Total devices 1 FS bytes used 1.74TB devid1 size 1.76TB used 1.76TB path /dev/dm-4 [...] Btrfs v0.19+ romulus:/home/users/johnn # romulus:/home/users/johnn # btrfs subvolume list /vms ID 324 top level 5 path johnn-centos64 ID 325 top level 5 path johnn-ubuntu1304 ID 326 top level 5 path johnn-opensuse1203 ID 327 top level 5 path johnn-sles11sp3 ID 328 top level 5 path johnn-sles11sp2 ID 329 top level 5 path johnn-fedora19 ID 330 top level 5 path johnn-sles11sp1 ID 394 top level 5 path redacted-glance ID 396 top level 5 path redacted_test ID 397 top level 5 path glance ID 403 top level 5 path test_redacted ID 414 top level 5 path johnn-disktest ID 460 top level 5 path redacted-opensuse-01 ID 472 top level 5 path redacted ID 473 top level 5 path redacted2 ID 496 top level 5 path redacted_test ID 524 top level 5 path redacted-moab ID 525 top level 5 path redacted_redacted-1 ID 531 top level 5 path .snapshots/johnn-sles11sp2/2013.10.11-14:25.18/johnn-sles11sp2 ID 533 top level 5 path .snapshots/johnn-centos64/2013.10.11-15:32.16/johnn-centos64 ID 534 top level 5 path .snapshots/johnn-ubuntu1304/2013.10.11-15:33.20/johnn-ubuntu1304 ID 535 top level 5 path .snapshots/johnn-opensuse1203/2013.10.11-15:36.19/johnn-opensuse1203 ID 536 top level 5 path .snapshots/johnn-sles11sp3/2013.10.11-15:39.51/johnn-sles11sp3 ID 537 top level 5 path .snapshots/johnn-fedora19/2013.10.11-15:41.08/johnn-fedora19 ID 538 top level 5 path .snapshots/johnn-sles11sp2/2013.10.11-16
Recovering from persistent kernel oops on 'btrfs balance'
Hi, I added a 2nd device and 'btrfs balance' crashed (kernel oops) half way through, now I can only read the fs from a rawhide livedvd, but even that can't fix the fs (finish balance, or remove 2nd device to try again). I'd be grateful for any advice on getting back to a working btrfs filesystem. Details === Hardware: Asus P5G41T-M with Pentium dual core E2140,4GB ram, OS on ext4 drive, two 4TB Segate "NAS" SATA drives. On Ubuntu 13.04 x86_64 (3.8 kernel, btrfs-tools 0.19+20130117) 1. Install new 4TB drive (/dev/sdb), use gparted to create full-disk btrfs partition, mount on /ark copy ~500GB data, everything working well for a couple weeks 2. Install additional identical 4TB drive, Following https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Adding_new_devices 3. btrfs device add /dev/sdc /ark 4. btrfs balance start -dconvert=raid1 -mconvert=raid1 /ark 5. After ~1 hour, at about 50% (according to 'btrfs balance status', the system locks up with this displayed (sorry, JPEG): http://i.imgur.com/Ds9pnZV.jpg 6. System repeat same oops on startup 7. After removing /dev/sdc system boots but can't see anything on /ark I guess using a 3.8 kernel wasn't the smartest idea. Let's update. 8. Update to Ubuntu 13.11 x86_64 (3.11 kernel, btrfs-tools 0.19+20130705-1) 9. Now system boots with /dev/sdc plugged in but still can't see data on /ark, IIRC the balance command gave similar kernel oops. 10。 Fine I'll try Rawhide. From Jan 30, 2014, kernel 3.14.0-0.rc0.git17.1.fc21.x86_64 11. I can see data on /ark! 12. If I try to 'btrfs balance resume' or 'btrfs balance cancel' I get roughly the same kernel oops: http://pastebin.ca/2634583 13. 'btrfs device delete /dev/sdc /ark' says it cannot be done while balance is underway 14. Help! Any suggestion on how to recover the btrfs fs? My last resort idea is pull /dev/sdb (which seems to have actual data that rawhide can see), format /dev/sdc ext4, plug both drives in again and copy from btrfs /dev/sdb to ext4 /dev/sdc, then wipe the btrfs fs on /dev/sdb and try again with the 3.11 kernel (or just with rawhide?). But that is a whole lot of copying it would be nice to avoid. Thanks, -Nathan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][V2] Re: Provide a better free space estimate on RAID1
On 02/07/2014 05:40 AM, Roman Mamedov wrote: > On Thu, 06 Feb 2014 20:54:19 +0100 > Goffredo Baroncelli wrote: > [...] Even I am not entirely convinced, I update the Roman's PoC in order to take in account all the RAID levels. The filesystem test is composed by 7 51GB disks. Here my "df" results: Profile: single Filesystem Size Used Avail Use% Mounted on /dev/vdc351G 512K 348G 1% /mnt/btrfs1 Profile: raid1 Filesystem Size Used Avail Use% Mounted on /dev/vdc351G 1.3M 150G 1% /mnt/btrfs1 Profile: raid10 Filesystem Size Used Avail Use% Mounted on /dev/vdc351G 2.3M 153G 1% /mnt/btrfs1 Profile: raid5 Filesystem Size Used Avail Use% Mounted on /dev/vdc351G 2.0M 298G 1% /mnt/btrfs1 Profile: raid6 Filesystem Size Used Avail Use% Mounted on /dev/vdc351G 1.8M 248G 1% /mnt/btrfs1 Note that RAID1 and RAID10 can only use an even number of disks. The mixing mode (data and metadata in the same chunk) return strange results. Below my patch. BR G.Baroncelli Changes history: V1 First issue V2 Correct a (old) bug when in RAID10 the disks aren't a multiple of 4 diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index d71a11d..aea9afa 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1481,10 +1481,16 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) num_stripes = nr_devices; } else if (type & BTRFS_BLOCK_GROUP_RAID1) { min_stripes = 2; - num_stripes = 2; + num_stripes = nr_devices & ~1llu; } else if (type & BTRFS_BLOCK_GROUP_RAID10) { min_stripes = 4; - num_stripes = 4; + num_stripes = nr_devices & ~1llu; + } else if (type & BTRFS_BLOCK_GROUP_RAID5) { + min_stripes = 3; + num_stripes = nr_devices; + } else if (type & BTRFS_BLOCK_GROUP_RAID6) { + min_stripes = 4; + num_stripes = nr_devices; } if (type & BTRFS_BLOCK_GROUP_DUP) @@ -1561,8 +1567,30 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) if (devices_info[i].max_avail >= min_stripe_size) { int j; u64 alloc_size; + int k; - avail_space += devices_info[i].max_avail * num_stripes; + /* +* Depending by the RAID profile, we use some +* disk space as redundancy: +* RAID1, RAID10, DUP -> half of space used as redundancy +* RAID5 -> 1 stripe used as redundancy +* RAID6 -> 2 stripes used as redundancy +* RAID0,LINEAR -> no redundancy +*/ + if (type & BTRFS_BLOCK_GROUP_RAID1) { + k = num_stripes >> 1; + } else if (type & BTRFS_BLOCK_GROUP_DUP) { + k = num_stripes >> 1; + } else if (type & BTRFS_BLOCK_GROUP_RAID10) { + k = num_stripes >> 1; + } else if (type & BTRFS_BLOCK_GROUP_RAID5) { + k = num_stripes-1; + } else if (type & BTRFS_BLOCK_GROUP_RAID6) { + k = num_stripes-2; + } else { /* RAID0/LINEAR */ + k = num_stripes; + } + avail_space += devices_info[i].max_avail * k; alloc_size = devices_info[i].max_avail; for (j = i + 1 - num_stripes; j <= i; j++) devices_info[j].max_avail -= alloc_size; -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] xfstests: add test for btrfs data corruption when using compression
Test for a btrfs data corruption when using compressed files/extents. Under certain cases, it was possible for reads to return random data (content from a previously used page) instead of zeroes. This also caused partial updates to those regions that were supposed to be filled with zeroes to save random (and invalid) data into the file extents. This is fixed by the commit for the linux kernel titled: Btrfs: fix data corruption when reading/updating compressed extents (https://patchwork.kernel.org/patch/3610391/) Signed-off-by: Filipe David Borba Manana --- tests/btrfs/036 | 111 +++ tests/btrfs/036.out |1 + tests/btrfs/group |1 + 3 files changed, 113 insertions(+) create mode 100755 tests/btrfs/036 create mode 100644 tests/btrfs/036.out diff --git a/tests/btrfs/036 b/tests/btrfs/036 new file mode 100755 index 000..533b6ee --- /dev/null +++ b/tests/btrfs/036 @@ -0,0 +1,111 @@ +#! /bin/bash +# FS QA Test No. btrfs/036 +# +# Test for a btrfs data corruption when using compressed files/extents. +# Under certain cases, it was possible for reads to return random data +# (content from a previously used page) instead of zeroes. This also +# caused partial updates to those regions that were supposed to be filled +# with zeroes to save random (and invalid) data into the file extents. +# +# This is fixed by the commit for the linux kernel titled: +# +# Btrfs: fix data corruption when reading/updating compressed extents +# +#--- +# Copyright (c) 2014 Filipe Manana. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +tmp=`mktemp -d` + +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ +rm -fr $tmp +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_scratch +_need_to_be_root + +rm -f $seqres.full + +_scratch_mkfs >/dev/null 2>&1 +_scratch_mount "-o compress-force=lzo" + +run_check $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" \ +$SCRATCH_MNT/foobar +run_check $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar +run_check $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar +run_check $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar + +# Expected file items in the fs tree are (from btrfs-debug-tree): +# +# item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160 +# inode generation 6 transid 6 size 542872 block group 0 mode 100600 +# item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16 +# inode ref index 2 namelen 6 name: foobar +# item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 +# extent data disk byte 0 nr 0 gen 6 +# extent data offset 0 nr 24576 ram 266240 +# extent compression 0 +# item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53 +# prealloc data disk byte 12849152 nr 241664 gen 6 +# prealloc data offset 0 nr 241664 +# item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53 +# extent data disk byte 12845056 nr 4096 gen 6 +# extent data offset 0 nr 20480 ram 20480 +# extent compression 2 +# item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53 +# prealloc data disk byte 13090816 nr 405504 gen 6 +# prealloc data offset 0 nr 258048 +# +# The on disk extent at 266240, contains 5 compressed chunks of file data. +# Each of the first 4 chunks compress 4096 bytes of file data, while the last +# one compresses only 3024 bytes of file data. Because this extent item is not +# the last one in the file, as it followed by a prealloc extent, reads into +# the region [285648 ; 286720[ (length = 4096 - 3024) should return zeroes. + +_scratch_unmount +_check_btrfs_filesystem $SCRATCH_DEV + +EXPECTED_MD5="b8b0dbb8e02f94123c741c23659a1c0a" + +for i in `seq 1 27` +do +_scratch_mount "-o ro" +MD5=`md5sum $SCRATCH_MNT/foobar | cut -f 1 -d ' '` +_scratch_unmount +if [ "${MD5}x" != "${EXPECTED_MD5}x" ] +then + echo "Unexpected file digest (wanted $EXPECTED_MD5, got $MD5)" +fi +done + +status=0 +exit
[PATCH 1/2] Btrfs: skip readonly root for snapshot-aware defragment
From: Wang Shilong Btrfs send is assuming readonly root won't change, let's skip readonly root. Signed-off-by: Wang Shilong --- fs/btrfs/inode.c | 5 + 1 file changed, 5 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 1af34d0..e8dfd83 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2239,6 +2239,11 @@ static noinline int relink_extent_backref(struct btrfs_path *path, return PTR_ERR(root); } + if (btrfs_root_readonly(root)) { + srcu_read_unlock(&fs_info->subvol_srcu, index); + return 0; + } + /* step 2: get inode */ key.objectid = backref->inum; key.type = BTRFS_INODE_ITEM_KEY; -- 1.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/2] Revert "Btrfs: remove transaction from btrfs send"
From: Wang Shilong This reverts commit 41ce9970a8a6a362ae8df145f7a03d789e9ef9d2. Previously i was thinking we can use readonly root's commit root safely while it is not true, readonly root may be cowed with the following cases. 1.snapshot send root will cow source root. 2.balance,device operations will also cow readonly send root to relocate. So i have two ideas to make us safe to use commit root. -->approach 1: make it protected by transaction and end transaction properly and we research next item from root node(see btrfs_search_slot_for_read()). -->approach 2: add another counter to local root structure to sync snapshot with send. and add a global counter to sync send with exclusive device operations. So with approach 2, send can use commit root safely, because we make sure send root can not be cowed during send. Unfortunately, it make codes *ugly* and more complex to maintain. To make snapshot and send exclusively, device operations and send operation exclusively with each other is a little confusing for common users. So why not drop into previous way. Cc: Josef Bacik Signed-off-by: Wang Shilong --- Josef, if we reach agreement to adopt this approach, please revert Filipe's patch(Btrfs: make some tree searches in send.c more efficient) from btrfs-next. --- fs/btrfs/send.c | 33 + 1 file changed, 33 insertions(+) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 168b9ec..e9d1265 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -5098,6 +5098,7 @@ out: static int full_send_tree(struct send_ctx *sctx) { int ret; + struct btrfs_trans_handle *trans = NULL; struct btrfs_root *send_root = sctx->send_root; struct btrfs_key key; struct btrfs_key found_key; @@ -5119,6 +5120,19 @@ static int full_send_tree(struct send_ctx *sctx) key.type = BTRFS_INODE_ITEM_KEY; key.offset = 0; +join_trans: + /* +* We need to make sure the transaction does not get committed +* while we do anything on commit roots. Join a transaction to prevent +* this. +*/ + trans = btrfs_join_transaction(send_root); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + trans = NULL; + goto out; + } + /* * Make sure the tree has not changed after re-joining. We detect this * by comparing start_ctransid and ctransid. They should always match. @@ -5142,6 +5156,19 @@ static int full_send_tree(struct send_ctx *sctx) goto out_finish; while (1) { + /* +* When someone want to commit while we iterate, end the +* joined transaction and rejoin. +*/ + if (btrfs_should_end_transaction(trans, send_root)) { + ret = btrfs_end_transaction(trans, send_root); + trans = NULL; + if (ret < 0) + goto out; + btrfs_release_path(path); + goto join_trans; + } + eb = path->nodes[0]; slot = path->slots[0]; btrfs_item_key_to_cpu(eb, &found_key, slot); @@ -5169,6 +5196,12 @@ out_finish: out: btrfs_free_path(path); + if (trans) { + if (!ret) + ret = btrfs_end_transaction(trans, send_root); + else + btrfs_end_transaction(trans, send_root); + } return ret; } -- 1.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix data corruption when reading/updating compressed extents
When using a mix of compressed file extents and prealloc extents, it is possible to fill a page of a file with random, garbage data from some unrelated previous use of the page, instead of a sequence of zeroes. A simple sequence of steps to get into such case, taken from the test case I made for xfstests, is: _scratch_mkfs _scratch_mount "-o compress-force=lzo" $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar This results in the following file items in the fs tree: item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160 inode generation 6 transid 6 size 542872 block group 0 mode 100600 item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16 inode ref index 2 namelen 6 name: foobar item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 24576 ram 266240 extent compression 0 item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53 prealloc data disk byte 12849152 nr 241664 gen 6 prealloc data offset 0 nr 241664 item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53 extent data disk byte 12845056 nr 4096 gen 6 extent data offset 0 nr 20480 ram 20480 extent compression 2 item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53 prealloc data disk byte 13090816 nr 405504 gen 6 prealloc data offset 0 nr 258048 The on disk extent at offset 266240 (which corresponds to 1 single disk block), contains 5 compressed chunks of file data. Each of the first 4 compress 4096 bytes of file data, while the last one only compresses 3024 bytes of file data. Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 = 1072 bytes) should always return zeroes (our next extent is a prealloc one). The solution here is the compression code path to zero the remaining (untouched) bytes of the last page it uncompressed data into, as the information about how much space the file data consumes in the last page is not known in the upper layer fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing the remainder of the page but only if it corresponds to the last page of the inode and if the inode's size is not a multiple of the page size. This would cause not only returning random data on reads, but also permanently storing random data when updating parts of the region that should be zeroed. For the example above, it means updating a single byte in the region [285648 ; 286720[ would store that byte correctly but also store random data on disk. A test case for xfstests follows soon. Signed-off-by: Filipe David Borba Manana --- fs/btrfs/compression.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index af815eb..ed1ff1cb 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -1011,6 +1011,8 @@ int btrfs_decompress_buf2page(char *buf, unsigned long buf_start, bytes = min(bytes, working_bytes); kaddr = kmap_atomic(page_out); memcpy(kaddr + *pg_offset, buf + buf_offset, bytes); + if (*pg_index == (vcnt - 1) && *pg_offset == 0) + memset(kaddr + bytes, 0, PAGE_CACHE_SIZE - bytes); kunmap_atomic(kaddr); flush_dcache_page(page_out); -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
On 02/07/2014 05:40 AM, Roman Mamedov wrote: > On Thu, 06 Feb 2014 20:54:19 +0100 > Goffredo Baroncelli wrote: > [...] Even I am not entirely convinced, I update the Roman's PoC in order to take in account all the RAID levels. I performed some tests with 7 48.8GB disks. Here my "df" results Profile: single Filesystem Size Used Avail Use% Mounted on /dev/vdc342G 512K 340G 1% /mnt/btrfs1 Profile: raid1 Filesystem Size Used Avail Use% Mounted on /dev/vdc342G 1.3M 147G 1% /mnt/btrfs1 Profile: raid10 Filesystem Size Used Avail Use% Mounted on /dev/vdc342G 2.3M 102G 1% /mnt/btrfs1 Profile: raid5 Filesystem Size Used Avail Use% Mounted on /dev/vdc342G 2.0M 291G 1% /mnt/btrfs1 Profile: raid6 Filesystem Size Used Avail Use% Mounted on /dev/vdc342G 1.8M 243G 1% /mnt/btrfs1 Note that RAID1 can only uses 6 disks; raid 10 only four, but I think that it is due to a previous bug. Still the mixing mode (data and metadata raid in the same chunk) is unsupported below my patch. diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index d71a11d..e5c58b3 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1485,6 +1485,12 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) } else if (type & BTRFS_BLOCK_GROUP_RAID10) { min_stripes = 4; num_stripes = 4; + } else if (type & BTRFS_BLOCK_GROUP_RAID5) { + min_stripes = 3; + num_stripes = nr_devices; + } else if (type & BTRFS_BLOCK_GROUP_RAID6) { + min_stripes = 4; + num_stripes = nr_devices; } if (type & BTRFS_BLOCK_GROUP_DUP) @@ -1561,8 +1567,30 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) if (devices_info[i].max_avail >= min_stripe_size) { int j; u64 alloc_size; + int k; - avail_space += devices_info[i].max_avail * num_stripes; + /* +* Depending by the RAID profile, we use some +* disk space as redundancy: +* RAID1, RAID10, DUP -> half of space used as redundancy +* RAID5 -> 1 stripe used as redundancy +* RAID6 -> 2 stripes used as redundancy +* RAID0,LINEAR -> no redundancy +*/ + if (type & BTRFS_BLOCK_GROUP_RAID1) { + k = num_stripes >> 1; + } else if (type & BTRFS_BLOCK_GROUP_DUP) { + k = num_stripes >> 1; + } else if (type & BTRFS_BLOCK_GROUP_RAID10) { + k = num_stripes >> 1; + } else if (type & BTRFS_BLOCK_GROUP_RAID5) { + k = num_stripes-1; + } else if (type & BTRFS_BLOCK_GROUP_RAID6) { + k = num_stripes-2; + } else { /* RAID0/LINEAR */ + k = num_stripes; + } + avail_space += devices_info[i].max_avail * k; alloc_size = devices_info[i].max_avail; for (j = i + 1 - num_stripes; j <= i; j++) devices_info[j].max_avail -= alloc_size; -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> If you disable CONFIG_BTRFS_FS_RUN_SANITY_TESTS, does it still crash? Good idea! I've queued test jobs for that config. However sorry that I'll be offline for the next 2 days. So please expect some delays. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
On Sat, Feb 08, 2014 at 05:33:10PM +0600, Roman Mamedov wrote: > On Fri, 07 Feb 2014 21:32:42 +0100 > Kai Krakow wrote: > > > It should show the raw space available. Btrfs also supports compression and > > doesn't try to be smart about how much compressed data would fit in the > > free > > space of the drive. If one is using RAID1, it's supposed to fill up with a > > rate of 2:1. If one is using compression, it's supposed to fill up with a > > rate of maybe 1:5 for mostly text files. > > Imagine a small business with some 30-40 employees. There is a piece of paper > near the door at the office so that everyone sees it when entering or leaving, > which says: > > "Dear employees, > > Please keep in mind that on the fileserver '\\DepartmentC', in the directory > '\PublicStorage7' the free space you see as being available needs to be > divided > by two; On the server '\\DepartmentD', in '\StorageArchive' and '\VideoFiles', > multiplied by two-thirds. For more details please contact the IT operations > team. Further assistance will be provided at the monthly training seminar. > > Regards, > John S, CTO.' In my experience, nobody who uses a shared filesystem *ever* looks at the amount of free space on it, until it fills up, at which point they may look at the free space and see "0". Or most likely, they'll be alerted to the issue by an email from the systems people saying, "please will everyone delete unnecessary files from the shared drive, because it's full up." Having a more accurate estimate of the free space is a laudable aim, and in principle I agree with attempts to do it, but I think the argument above isn't exactly a strong one in practice. Even in the current code with only one RAID setting available for data, if you have parity RAID, you've got to look at the number of drives with available free space to make an estimate of available space. I think your best bet, ultimately, is to write code to give either a pessimistic (lower bound) or optimistic (upper bound) estimate of available space based on the profiles in use and the current distribution of free/unallocated space, and stick with that. I think I'd prefer to see a pessimistic bound, although that could break anything like an installer that attempts to see how much free space there is before proceeding. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- This year, I'm giving up Lent. --- signature.asc Description: Digital signature
Re: Provide a better free space estimate on RAID1
On Fri, 07 Feb 2014 21:32:42 +0100 Kai Krakow wrote: > It should show the raw space available. Btrfs also supports compression and > doesn't try to be smart about how much compressed data would fit in the free > space of the drive. If one is using RAID1, it's supposed to fill up with a > rate of 2:1. If one is using compression, it's supposed to fill up with a > rate of maybe 1:5 for mostly text files. Imagine a small business with some 30-40 employees. There is a piece of paper near the door at the office so that everyone sees it when entering or leaving, which says: "Dear employees, Please keep in mind that on the fileserver '\\DepartmentC', in the directory '\PublicStorage7' the free space you see as being available needs to be divided by two; On the server '\\DepartmentD', in '\StorageArchive' and '\VideoFiles', multiplied by two-thirds. For more details please contact the IT operations team. Further assistance will be provided at the monthly training seminar. Regards, John S, CTO.' -- With respect, Roman signature.asc Description: PGP signature
Re: Provide a better free space estimate on RAID1
On Fri, 7 Feb 2014 12:08:12 +0600 Roman Mamedov wrote: > > Earlier conventions would have stated Size ~900GB, and Avail ~900GB. But > > that's not exactly true either, is it? > > Much better, and matching the user expectations of how RAID1 should behave, > without a major "gotcha" blowing up into their face the first minute they are > trying it out. In fact next step that I planned would be finding how to adjust > also Size and Used on all my machines to show what you just mentioned. OK done; again, this is just what I will personally use from now on (and for anyone who finds this helpful). --- fs/btrfs/super.c.orig 2014-02-06 01:28:36.636164982 +0600 +++ fs/btrfs/super.c2014-02-08 17:16:50.361931959 +0600 @@ -1481,6 +1481,11 @@ } kfree(devices_info); + + if (type & BTRFS_BLOCK_GROUP_RAID1) { + do_div(avail_space, min_stripes); + } + *free_bytes = avail_space; return 0; } @@ -1491,8 +1496,10 @@ struct btrfs_super_block *disk_super = fs_info->super_copy; struct list_head *head = &fs_info->space_info; struct btrfs_space_info *found; + u64 total_size; u64 total_used = 0; u64 total_free_data = 0; + u64 type; int bits = dentry->d_sb->s_blocksize_bits; __be32 *fsid = (__be32 *)fs_info->fsid; int ret; @@ -1512,7 +1519,13 @@ rcu_read_unlock(); buf->f_namelen = BTRFS_NAME_LEN; - buf->f_blocks = btrfs_super_total_bytes(disk_super) >> bits; + total_size = btrfs_super_total_bytes(disk_super); + type = btrfs_get_alloc_profile(fs_info->tree_root, 1); + if (type & BTRFS_BLOCK_GROUP_RAID1) { + do_div(total_size, 2); + do_div(total_used, 2); + } + buf->f_blocks = total_size >> bits; buf->f_bfree = buf->f_blocks - (total_used >> bits); buf->f_bsize = dentry->d_sb->s_blocksize; buf->f_type = BTRFS_SUPER_MAGIC; 2x1TB RAID1 with a 1GB file: Filesystem Size Used Avail Use% Mounted on /dev/sda2 912G 1.1G 911G 1% /mnt/p2 -- With respect, Roman signature.asc Description: PGP signature
Re: lost with degraded RAID1
Ok, I did nuke it now and created the fs again using 3.12 kernel. So far so good. Runs fine. Finally, I know its kind of offtopic, but can some help me interpreting this (I think this is the error in the smart-log which started the whole mess)? Error 1 occurred at disk power-on lifetime: 2576 hours (107 days + 8 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 71 00 ff ff ff 0f Device Fault; Error: ABRT at LBA = 0x0fff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- 61 00 08 ff ff ff 4f 00 5d+04:53:11.169 WRITE FPDMA QUEUED 61 00 08 80 18 00 40 00 5d+04:52:45.129 WRITE FPDMA QUEUED 61 00 08 ff ff ff 4f 00 5d+04:52:44.701 WRITE FPDMA QUEUED 61 00 08 ff ff ff 4f 00 5d+04:52:44.700 WRITE FPDMA QUEUED 61 00 08 ff ff ff 4f 00 5d+04:52:44.679 WRITE FPDMA QUEUED 2014-02-07 Chris Murphy : > > On Feb 7, 2014, at 4:34 AM, Johan Kröckel wrote: > >> Is there anything else I should do with this setup or may I nuke the >> two partitions and reuse them? > > Well I'm pretty sure once you run 'btrfs check --repair' that you've hit the > end of the road. Possibly btrfs restore can still extract some files, it > might be worth testing whether that works. > > Otherwise blow it away. I'd say test with 3.14-rc2 with a new file system and > see if you can reproduce the sequence that caused this problem in the first > place. If it's reproducible, I think there's a bug here somewhere. > > > Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
Duncan <1i5t5.dun...@cox.net> schrieb: [...] Difficult to twist your mind around that but well explained. ;-) > A snapshot thus looks much like a crash in terms of NOCOW file integrity > since the blocks of a NOCOW file are simply snapshotted in-place, and > there's already no checksumming or file integrity verification on such > files -- they're simply directly written in-place (with the exception of > a single COW write when a writable snapshottted NOCOW file diverges from > the shared snapshot version). > > But as I said, the applications themselves are normally designed to > handle and recover from crashes, and in fact, having btrfs try to manage > it too only complicates things and can actually make it impossible for > the app to recover what it would have otherwise recovered just fine. > > So it should be with these NOCOW in-place snapshotted files, too. If a > NOCOW file is put back into operation from a snapshot, and the file was > being written to at snapshot time, it'll very likely trigger exactly the > same response from the application as a crash while writing would have > triggered, but, the point is, such applications are normally designed to > deal with just that, and thus, they should recover just as they would > from a crash. If they could recover from a crash, it shouldn't be an > issue. If they couldn't, well... So we have common sense that taking a snapshot looks like a crash from the applications perspective. That means if their are facilities to instruct the application to suspend its operations first, you should use them - like in the InnoDB case: http://dev.mysql.com/doc/refman/5.1/en/lock-tables.html: | FLUSH TABLES WITH READ LOCK; | SHOW MASTER STATUS; | SYSTEM xfs_freeze -f /var/lib/mysql; | SYSTEM YOUR_SCRIPT_TO_CREATE_SNAPSHOT.sh; | SYSTEM xfs_freeze -u /var/lib/mysql; | UNLOCK TABLES; | EXIT; Only that way you get consistent snapshots and won't trigger crash-recovery (which might otherwise throw away unrecoverable transactions or otherwise harm your data for the sake of consistency). InnoDB is more or less like a vm filesystem image on btrfs in this case. So the same approach should be taken for vm images if possible. I think VMware has facilities to prepare the guest for a snapshot being taken (it is triggered when you take snapshots with VMware itself, and btw it usually takes much longer than btrfs snapshots do). Take xfs for example: Although it is crash-safe, it prefers to zero-out your files for security reasons during log-replay - because it is crash-safe only for meta-data: if meta-data has already allocated blocks but file-data has not yet been written, a recovered file may end up with wrong content otherwise, so its cleared out. This _IS_NOT_ the situation you want with vm images with xfs inside hosted on btrfs when taking a snapshot. You should trigger xfs_freeze in the guest before taking the btrfs snapshot in the host. I think the same holds true for most other meta-data-only-journalling file systems which probably even do not zero-out files during recovery and just silently corrupt your files during crash-recovery. So in case of crash or snapshot (which looks the same from the application perspective), btrfs' capabilities won't help you here (at least in the nocow case, probably in the cow case too, because the vm guest may write blocks out-of-order without having the possibility to pass write-barriers down to btrfs cow mechanism). Taking snapshots of database files or vm images without proper prepartion only guarantees you crash-like rollback situations. Taking snapshots even at short intervals only makes this worse, with all the extra downsides of effects this has within the btrfs. I think this is important to understand for people planning to do automated snapshots of such file data. Making a file nocow only helps the situation during normal operation - but after a snapshot, a nocow file is essentially cow while carried over to the new subvolume generation during writes of blocks from the old generation. -- Replies to list only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xfstests: Btrfs: add test for large metadata blocks
Thanks for the review, Dave! Comments inline. On 02/07/2014 11:49 PM, Dave Chinner wrote: On Fri, Feb 07, 2014 at 06:14:45PM +0100, Koen De Wit wrote: Tests Btrfs filesystems with all possible metadata block sizes, by setting large extended attributes on files. Signed-off-by: Koen De Wit There's a few things here that need fixing. +pagesize=`$here/src/feature -s` +pagesize_kb=`expr $pagesize / 1024` + +# Test all valid leafsizes +for leafsize in `seq $pagesize_kb $pagesize_kb 64`; do +_scratch_unmount >/dev/null 2>&1 Indentation are tabs, and tabs are 8 spaces in size, please. OK, I fixed this in v2. +_scratch_mkfs -l ${leafsize}K >/dev/null +_scratch_mount No need to use _scratch_unmount here - you should be doing a _check_scratch_fs at the end of the loop. Fixed in v2 too. +# Calculate the xattr size, but leave 512 bytes for other metadata. +xattr_size=`expr $leafsize \* 1024 - 512` + +touch $SCRATCH_MNT/emptyfile +# smallfile will be inlined, bigfile not. +$XFS_IO_PROG -f -c "pwrite 0 100" $SCRATCH_MNT/smallfile >/dev/null +$XFS_IO_PROG -f -c "pwrite 0 9000" $SCRATCH_MNT/bigfile >/dev/null +ln -s $SCRATCH_MNT/bigfile $SCRATCH_MNT/bigfile_softlink + +files=(emptyfile smallfile bigfile bigfile_softlink) +chars=(a b c d) +for i in `seq 0 1 3`; do +char=${chars[$i]} +file=$SCRATCH_MNT/${files[$i]} +lnkfile=${file}_hardlink +ln $file $lnkfile +xattr_value=`head -c $xattr_size < /dev/zero | tr '\0' $char` + +set_md5=`echo -n "$xattr_value" | md5sum` Just dump the md5sum to the output file. +${ATTR_PROG} -Lq -s attr_$char -V $xattr_value $file +get_md5=`${ATTR_PROG} -Lq -g attr_$char $file | md5sum` +get_ln_md5=`${ATTR_PROG} -Lq -g attr_$char $lnkfile | md5sum` And dump these to the output file, too. Then the golden image matching when the test is finish will tell you if it passed or not. i.e: echo -n "$xattr_value" | md5sum ${ATTR_PROG} -Lq -s attr_$char -V $xattr_value $file ${ATTR_PROG} -Lq -g attr_$char $file | md5sum ${ATTR_PROG} -Lq -g attr_$char $lnkfile | md5sum is all that neds to be done here. The problem with this is that the length of the output will depend on the page size. The code above runs for every valid leafsize, which can be any multiple of the page size up to 64KB, as defined in the loop initialization: for leafsize in `seq $pagesize_kb $pagesize_kb 64`; do +# Test attributes with a size larger than the leafsize. +# Should result in an error. +if [ "$leafsize" -lt "64" ]; then +# Bash command lines cannot be larger than 64K characters, so we +# do not test attribute values with a size >64KB. +xattr_size=`expr $leafsize \* 1024 + 512` +xattr_value=`head -c $xattr_size < /dev/zero | tr '\0' x` +${ATTR_PROG} -q -s attr_toobig -V $xattr_value \ +$SCRATCH_MNT/emptyfile >> $seqres.full 2>&1 +if [ "$?" -eq "0" ]; then +echo "Expected error, xattr_size is bigger than ${leafsize}K" +fi What you are doing is redirecting the error to $seqres.full so that it doesn't end up in the output file, then detecting the absence of an error and dumping a message to the output file to make the test fail. IOWs, the ATTR_PROG failure message should be in the golden output file and you don't have to do anything else to detect a pass/fail condition. Same here: the bigger the page size, the less this code will be executed. If the page size is 64KB, this code isn't executed at all. To make sure the golden output does not depend on the page size, I chose to suppress all output as long as the test is successful. Is there a better way to accomplish this? +_scratch_unmount + +# Some illegal leafsizes + +_scratch_mkfs -l 0 2>> $seqres.full +echo $? Same again - you are dumping the error output into a different file, then detecting the error manually. pass the output of _scratch_mkfs through a filter, and let errors cause golden output mismatches. I did this to make the golden output not depend on the output of mkfs.btrfs, inspired by http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfstests.git;a=commit;h=fd7a8e885732475c17488e28b569ac1530c8eb59 and http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfstests.git;a=commit;h=78d86b996c9c431542fdbac11fa08764b16ceb7d However, in my opinion the test should simply be updated if the output of mkfs.btrfs changes, so I agree with you and I fixed this in v2. Thanks, Koen. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] xfstests: Btrfs: add test for large metadata blocks
Tests Btrfs filesystems with all possible metadata block sizes, by setting large extended attributes on files. Signed-off-by: Koen De Wit --- v1->v2: - Fix indentation: 8 spaces instead of 4 - Move _scratch_unmount to end of loop, add _check_scratch_fs - Sending failure messages of mkfs.btrfs to output instead of $seqres.full diff --git a/tests/btrfs/036 b/tests/btrfs/036 new file mode 100644 index 000..b14697d --- /dev/null +++ b/tests/btrfs/036 @@ -0,0 +1,137 @@ +#! /bin/bash +# FS QA Test No. 036 +# +# Tests large metadata blocks in btrfs, which allows large extended +# attributes. +# +#--- +# Copyright (c) 2014, Oracle and/or its affiliates. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +status=1 # failure is the default! + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here + +_supported_fs btrfs +_supported_os Linux +_require_scratch +_need_to_be_root + +rm -f $seqres.full + +pagesize=`$here/src/feature -s` +pagesize_kb=`expr $pagesize / 1024` + +# Test all valid leafsizes +for leafsize in `seq $pagesize_kb $pagesize_kb 64`; do +_scratch_mkfs -l ${leafsize}K >/dev/null +_scratch_mount +# Calculate the size of the extended attribute value, leaving +# 512 bytes for other metadata. +xattr_size=`expr $leafsize \* 1024 - 512` + +touch $SCRATCH_MNT/emptyfile +# smallfile will be inlined, bigfile not. +$XFS_IO_PROG -f -c "pwrite 0 100" $SCRATCH_MNT/smallfile \ +>/dev/null +$XFS_IO_PROG -f -c "pwrite 0 9000" $SCRATCH_MNT/bigfile \ +>/dev/null +ln -s $SCRATCH_MNT/bigfile $SCRATCH_MNT/bigfile_softlink + +files=(emptyfile smallfile bigfile bigfile_softlink) +chars=(a b c d) +for i in `seq 0 1 3`; do +char=${chars[$i]} +file=$SCRATCH_MNT/${files[$i]} +lnkfile=${file}_hardlink +ln $file $lnkfile +xattr_value=`head -c $xattr_size < /dev/zero \ +| tr '\0' $char` + +set_md5=`echo -n "$xattr_value" | md5sum` +${ATTR_PROG} -Lq -s attr_$char -V $xattr_value $file +get_md5=`${ATTR_PROG} -Lq -g attr_$char $file | md5sum` +get_ln_md5=`${ATTR_PROG} -Lq -g attr_$char $lnkfile \ +| md5sum` + +# Using md5sums for comparison instead of the values +# themselves because bash command lines cannot be larger +# than 64K chars. +if [ "$set_md5" != "$get_md5" ]; then +echo -n "Got unexpected xattr value for " +echo -n "attr_$char from file ${file}. " +echo "(leafsize is ${leafsize}K)" +fi +if [ "$set_md5" != "$get_ln_md5" ]; then +echo -n "Value for attr_$char differs for " +echo -n "$file and ${lnkfile}. " +echo "(leafsize is ${leafsize}K)" +fi +done + +# Test attributes with a size larger than the leafsize. +# Should result in an error. +if [ "$leafsize" -lt "64" ]; then +# Bash command lines cannot be larger than 64K +# characters, so we do not test attribute values +# with a size >64KB. +xattr_size=`expr $leafsize \* 1024 + 512` +xattr_value=`head -c $xattr_size < /dev/zero | tr '\0' x` +${ATTR_PROG} -q -s attr_toobig -V $xattr_value \ +$SCRATCH_MNT/emptyfile >> $seqres.full 2>&1 +if [ "$?" -eq "0" ]; then +echo -n "Expected error, xattr_size is bigger " +echo "than ${leafsize}K" +fi +fi + +_scratch_unmount >/dev/null 2>&1 +_check_scratch_fs +done + +_scratch_mount + +# Illegal attribute name (more than 256 characters) +attr_name=`head -c 260 < /dev/zero | tr '\0' n` +${ATTR_PRO