[PATCH v2] Btrfs: fix a spinlock warning when cleaning up aborted transaction
Given now we have 2 spinlock for management of delayed refs, CONFIG_DEBUG_SPINLOCK=y helped me find this, [ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258 [ 4723.414882] lock: 0x880048377670, .magic: dead4ead, .owner: btrfs-transacti/2258, .owner_cpu: 2 [ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: GW O 3.12.0+ #4 [ 4723.421321] Call Trace: [ 4723.421872] [] dump_stack+0x54/0x74 [ 4723.422753] [] spin_dump+0x8c/0x91 [ 4723.424979] [] spin_bug+0x21/0x26 [ 4723.425846] [] do_raw_spin_unlock+0x66/0x90 [ 4723.434424] [] _raw_spin_unlock+0x27/0x40 [ 4723.438747] [] btrfs_cleanup_one_transaction+0x35e/0x710 [btrfs] [ 4723.443321] [] btrfs_cleanup_transaction+0x104/0x570 [btrfs] [ 4723.444692] [] ? trace_hardirqs_on_caller+0xfd/0x1c0 [ 4723.450336] [] ? trace_hardirqs_on+0xd/0x10 [ 4723.451332] [] transaction_kthread+0x22e/0x270 [btrfs] [ 4723.452543] [] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs] [ 4723.457833] [] kthread+0xea/0xf0 [ 4723.458990] [] ? kthread_create_on_node+0x140/0x140 [ 4723.460133] [] ret_from_fork+0x7c/0xb0 [ 4723.460865] [] ? kthread_create_on_node+0x140/0x140 [ 4723.496521] [ cut here ] -- The reason is that we get to call cond_resched_lock(&head_ref->lock) while still holding @delayed_refs->lock. So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire dance before and after actually processing delayed refs. Here we don't drop the lock, others are not able to add new delayed refs to head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here. Signed-off-by: Liu Bo --- v2: Fix typo in the title, s/lockdep/spinlock/g fs/btrfs/disk-io.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index de6a48f..735738b 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3853,7 +3853,6 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, rb_erase(&ref->rb_node, &head->ref_root); atomic_dec(&delayed_refs->num_entries); btrfs_put_delayed_ref(ref); - cond_resched_lock(&head->lock); } if (head->must_insert_reserved) pin_bytes = true; -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix a lockdep warning when cleaning up aborted transaction
Given now we have 2 spinlock for management of delayed refs, CONFIG_DEBUG_SPINLOCK=y helped me find this, [ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258 [ 4723.414882] lock: 0x880048377670, .magic: dead4ead, .owner: btrfs-transacti/2258, .owner_cpu: 2 [ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: GW O 3.12.0+ #4 [ 4723.421321] Call Trace: [ 4723.421872] [] dump_stack+0x54/0x74 [ 4723.422753] [] spin_dump+0x8c/0x91 [ 4723.424979] [] spin_bug+0x21/0x26 [ 4723.425846] [] do_raw_spin_unlock+0x66/0x90 [ 4723.434424] [] _raw_spin_unlock+0x27/0x40 [ 4723.438747] [] btrfs_cleanup_one_transaction+0x35e/0x710 [btrfs] [ 4723.443321] [] btrfs_cleanup_transaction+0x104/0x570 [btrfs] [ 4723.444692] [] ? trace_hardirqs_on_caller+0xfd/0x1c0 [ 4723.450336] [] ? trace_hardirqs_on+0xd/0x10 [ 4723.451332] [] transaction_kthread+0x22e/0x270 [btrfs] [ 4723.452543] [] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs] [ 4723.457833] [] kthread+0xea/0xf0 [ 4723.458990] [] ? kthread_create_on_node+0x140/0x140 [ 4723.460133] [] ret_from_fork+0x7c/0xb0 [ 4723.460865] [] ? kthread_create_on_node+0x140/0x140 [ 4723.496521] [ cut here ] -- The reason is that we get to call cond_resched_lock(&head_ref->lock) while still holding @delayed_refs->lock. So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire dance before and after actually processing delayed refs. Here we don't drop the lock, others are not able to add new delayed refs to head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here. Signed-off-by: Liu Bo --- fs/btrfs/disk-io.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index de6a48f..735738b 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3853,7 +3853,6 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, rb_erase(&ref->rb_node, &head->ref_root); atomic_dec(&delayed_refs->num_entries); btrfs_put_delayed_ref(ref); - cond_resched_lock(&head->lock); } if (head->must_insert_reserved) pin_bytes = true; -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
Kai Krakow posted on Fri, 07 Feb 2014 23:26:34 +0100 as excerpted: > So the question is: Do btrfs snapshots give the same guarantees on the > filesystem level that write-barriers give on the storage level which > exactly those processes rely upon? The cleanest solution would be if > processes could give btrfs hints about what belongs to their > transactions so in the moment of a snapshot the data file would be in > clean state. I guess snapshots are atomic in that way, that pending > writes will never reach the snapshots just taken, which is good. Keep in mind that btrfs' metadata is COW-based also. Like reiser4 in this way, in theory at least, commits are atomic -- they've ether made it to disk or they haven't, there's no half there. Commits at the leaf level propagate up the tree, and are not finalized until the top-level root node is written. AFAIK if there's dirty data to write, btrfs triggers a root node commit every 30 seconds. Until that root is rewritten, it points to the last consistent-state written root node. Once it's rewritten, it points to the new one and a new set of writes are started, only to be finalized at the next root node write. And I believe that final write simply updates a pointer to point at the latest root node. There's also a history of root nodes, which is what the btrfs-find-root tool uses in combination with btrfs restore, if necessary, to find a valid root from the root node pointer log if the system crashed in the middle of that final update so the pointer ends up pointing at garbage. Meanwhile, I'm a bit blurry on this but if I understand things correctly, between root node writes/full-filesystem-commits there's a log of transaction completions at the atomic individual transaction level, such that even transactions completed between root node writes can normally be replayed. Of course this is only ~30 seconds worth of activity max, since the root node writes should occur every 30 seconds, but this is what btrfs-zero-log zeroes out, if/when needed. You'll lose that few seconds of log replay since the last root node write, but if it was garbage data due to it being written when the system actually went down, dropping those few extra seconds of log can allow the filesystem to mount properly from the last full root node commit, where it couldn't, otherwise. It's actually those metadata trees and the atomic root-node commit feature that btrfs snapshots depend on, and why they're normally so fast to create. When a snapshot is taken, btrfs simply keeps a record of the current root node instead of letting it recede into history and fall off the end of the root node log, labeling that record with the name of the snapshot for humans as well as the object-ID that btrfs uses. That root node is by definition a record of the filesystem in a consistent state, so any snapshot that's a reference to it is similarly by definition in a consistent state. So normally, files in the process of being written out (created) simply wouldn't appear in the snapshot. Of course preexisting files will appear (and fallocated files are simply the blanked-out-special-case of preexisting), but again, with normal COW-based files at least, will exist in a state either before the latest transaction started, or after it finished, which of course is where fsync comes in, since that's how userspace apps communicate file transactions to the filesystem. And of course in addition to COW, btrfs normally does checksumming as well, and again, the filesystem including that checksumming will be self- consistent when a root-node is written, or it won't be written until the filesystem /is/ self-consistent. If for whatever reason there's garbage when btrfs attempts to read the data back, which is exactly what btrfs defines it as if it doesn't pass checksum, btrfs will refuse to use that data. If there's a second copy somewhere (as with raid1 mode), it'll try to restore from that second copy. If it can't, btrfs will return an error and simply won't let you access that file. So one way or another, a snapshot is deterministic and atomic. No partial transactions, at least on ordinary COW and checksummed files. Which brings us to NOCOW files, where for btrfs NOCOW also turns off checksumming. Btrfs will write these files in-place, and as a result there's not the transaction integrity guarantee on these files that there is on ordinary files. *HOWEVER*, the situation isn't as bad as it might seem, because most files where NOCOW is recommended, database files, VM images, pre- allocated torrent files, etc, are created and managed by applications that already have their own data integrity management/verification/repair methods, since they're designed to work on filesystems without the data integrity guarantees btrfs normally provides. In fact, it's possible, even likely in case of a crash, that the application's own data integrity mechanisms can fight with
Re: [PATCH] Btrfs: convert to add transaction protection for btrfs send
Hi Josef, [..SNIP..] >> Yeah, very reasonable, if you don't mind, i would give a patch for this >> issue. > Go for it, you'll be faster than I will be, all I do is run xfstests and > try to reproduce things that will never reproduce for me. So Finally, i've figured everything out. Previous send is almostly safe is because we deal every item by transaction protection, and we will search next item from root again next time (see btrfs_search_slot_for_read()) And we will check root's @ctransid to make sure root did not change during send. (For readonly root, usually it should not except snapshot-aware defrag, ok we should fix it!!) Since we kicked off transaction from send, life will become a little complex,we should take care of everything can change readonly snapshot, from i can say now: 1. snapshot @send_root will cow everything 2. snapshot-aware defrag also work for readonly root. 3. balance, device resize, add,remove, scrub(repair mode) So there are two ideas in my mind: #Approach 1 convert to previous ways that means add transaction protection for send. and Filipe's optimizations for send search should be reverted, this way is simple and make codes less complex, but we will involve transaction protection. #Approach 2. Add a local flag to root structure to sync snapshot with send. Add a global flag to sync balance's etc options with send don't defray readonly snapshot for snapshot-aware defrag. Approach 2 will make it safe to avoid transaction from send, but it add complexity to codes.Personally, i do not like approach 2 very much because it make btrfs codes more *ugly* and make it *harder* to maintain. Anyway, I am ok to implement both ways, Tell me if you have any better ideas or which approach is your appreciated way to fix the issue^_^ Thanks, Wang > > Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can you keep reflink relationship during a copy/backup to another filesystem?
On Wed, Jan 29, 2014 at 08:05:14AM +, Hugo Mills wrote: > On Tue, Jan 28, 2014 at 11:50:25PM -0800, Marc MERLIN wrote: > > So I used to use hardlinks to do historical backups of the same filesystem > > but I know it's preferable to use refllink with btrfs to avoid having too > > many hardlinks. > >If you use btrfstune to set the "extended inode refs" option on the > device, then there's no hardlink limit (and the limit was only on the I'm not too clear about that part: https://btrfs.wiki.kernel.org/index.php/Btrfstune doesn't seem to match what you're saying. > number of hardlinks to the same thing *in the same directory*). I didn't catch the 'same directory' part. I used hardlinks.py to compare all my files from many backups across filesystems and hardlink files together if they were the same. You're saying that I could have 500 hardlinks to the same file as long as they are in 500 different directories, and that this is not a performance problem or limitation for btrfs (maybe anymore? I could swear I had issues with this in the past). If the limit is now hardlinking many times to the same file in the same directory, I don't actually have a need for that use case. > > But if I need to backup this filesystem to another one some other way than > > btrfs send/receive (let's say cp -a, tar, or rsync), is it correct to say > > that reflink relationships will be lost and my data will take more space? > > > > That is unless the target filesytem can do deduplication like btrs now can? > > (I haven't tried it yet, but it's about time that I do) > >Correct. Thanks. I just noticed that on the fly deduplication wasn't in 3.12 yet, so I'm better off keeping my hardlinks for backups since it's easy to keep that relationship when copying my backup tree to a backup device (backups of backups, of course, y'all don't do that? :) ). Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ signature.asc Description: Digital signature
Re: [PATCH] xfstests: Btrfs: add test for large metadata blocks
On Fri, Feb 07, 2014 at 06:14:45PM +0100, Koen De Wit wrote: > Tests Btrfs filesystems with all possible metadata block sizes, by > setting large extended attributes on files. > > Signed-off-by: Koen De Wit There's a few things here that need fixing. > +pagesize=`$here/src/feature -s` > +pagesize_kb=`expr $pagesize / 1024` > + > +# Test all valid leafsizes > +for leafsize in `seq $pagesize_kb $pagesize_kb 64`; do > +_scratch_unmount >/dev/null 2>&1 Indentation are tabs, and tabs are 8 spaces in size, please. > +_scratch_mkfs -l ${leafsize}K >/dev/null > +_scratch_mount No need to use _scratch_unmount here - you should be doing a _check_scratch_fs at the end of the loop. > +# Calculate the xattr size, but leave 512 bytes for other metadata. > +xattr_size=`expr $leafsize \* 1024 - 512` > + > +touch $SCRATCH_MNT/emptyfile > +# smallfile will be inlined, bigfile not. > +$XFS_IO_PROG -f -c "pwrite 0 100" $SCRATCH_MNT/smallfile >/dev/null > +$XFS_IO_PROG -f -c "pwrite 0 9000" $SCRATCH_MNT/bigfile >/dev/null > +ln -s $SCRATCH_MNT/bigfile $SCRATCH_MNT/bigfile_softlink > + > +files=(emptyfile smallfile bigfile bigfile_softlink) > +chars=(a b c d) > +for i in `seq 0 1 3`; do > +char=${chars[$i]} > +file=$SCRATCH_MNT/${files[$i]} > +lnkfile=${file}_hardlink > +ln $file $lnkfile > +xattr_value=`head -c $xattr_size < /dev/zero | tr '\0' $char` > + > +set_md5=`echo -n "$xattr_value" | md5sum` Just dump the md5sum to the output file. > +${ATTR_PROG} -Lq -s attr_$char -V $xattr_value $file > +get_md5=`${ATTR_PROG} -Lq -g attr_$char $file | md5sum` > +get_ln_md5=`${ATTR_PROG} -Lq -g attr_$char $lnkfile | md5sum` And dump these to the output file, too. Then the golden image matching when the test is finish will tell you if it passed or not. i.e: echo -n "$xattr_value" | md5sum ${ATTR_PROG} -Lq -s attr_$char -V $xattr_value $file ${ATTR_PROG} -Lq -g attr_$char $file | md5sum ${ATTR_PROG} -Lq -g attr_$char $lnkfile | md5sum is all that neds to be done here. > +# Test attributes with a size larger than the leafsize. > +# Should result in an error. > +if [ "$leafsize" -lt "64" ]; then > +# Bash command lines cannot be larger than 64K characters, so we > +# do not test attribute values with a size >64KB. > +xattr_size=`expr $leafsize \* 1024 + 512` > +xattr_value=`head -c $xattr_size < /dev/zero | tr '\0' x` > +${ATTR_PROG} -q -s attr_toobig -V $xattr_value \ > +$SCRATCH_MNT/emptyfile >> $seqres.full 2>&1 > +if [ "$?" -eq "0" ]; then > +echo "Expected error, xattr_size is bigger than ${leafsize}K" > +fi What you are doing is redirecting the error to $seqres.full so that it doesn't end up in the output file, then detecting the absence of an error and dumping a message to the output file to make the test fail. IOWs, the ATTR_PROG failure message should be in the golden output file and you don't have to do anything else to detect a pass/fail condition. > +_scratch_unmount > + > +# Some illegal leafsizes > + > +_scratch_mkfs -l 0 2>> $seqres.full > +echo $? Same again - you are dumping the error output into a different file, then detecting the error manually. pass the output of _scratch_mkfs through a filter, and let errors cause golden output mismatches. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
Chris Murphy schrieb: > > On Feb 7, 2014, at 2:07 PM, Kai Krakow wrote: > >> Chris Murphy schrieb: >> If the database/virtual machine/whatever is crash safe, then the atomic state that a snapshot grabs will be useful. >>> >>> How fast is this state fixed on disk from the time of the snapshot >>> command? Loosely speaking. I'm curious if this is < 1 second; a few >>> seconds; or possibly up to the 30 second default commit interval? And >>> also if it's even related to the commit interval time at all? >> >> Such constructs can only be crash-safe if write-barriers are passed down >> through the cow logic of btrfs to the storage layer. That won't probably >> ever happen. Atomic and transactional updates cannot happen without >> write- barriers or synchronous writes. To make it work, you need to >> design the storage-layers from the ground up to work without >> write-barriers, like having battery-backed write-caches, synchronous >> logical file-system layers etc. Otherwise, database/vm/whatever >> transactional/atomic writes are just having undefined status down at the >> lowest storage layer. > > This explanation makes sense. But I failed to qualify the "state fixed on > disk". I'm not concerned about when bits actually arrive on disk. I'm > wondering what state they describe. So assume no crash or power failure, > and assume writes eventually make it onto the media without a problem. > What I'm wondering is, what state of the subvolume I'm snapshotting do I > end up with? Is there a delay and how long is it, or is it pretty much > instant? The command completes really quickly even when the file system is > actively being used, so the feedback is that the snapshot state is > established very fast but I'm not sure what bearing that has in reality. I think from that perspective it is more or less the same taking a snapshot or cycling the power. For the state of the file consistency it means the same, I suppose. I got your argument about "state fixed on disk", but I implied from perspective of the writing process it is just the same situation: in the moment of the snapshot the data file is in a crashed state. That is like cycling the power without having a mechanism to support transactional guarantees. So the question is: Do btrfs snapshots give the same guarantees on the filesystem level that write-barriers give on the storage level which exactly those processes rely upon? The cleanest solution would be if processes could give btrfs hints about what belongs to their transactions so in the moment of a snapshot the data file would be in clean state. I guess snapshots are atomic in that way, that pending writes will never reach the snapshots just taken, which is good. But what about the ordering of writes? Maybe some younger write requests already made it to the disk, while older ones didn't. The file system usually only has to care about its own transactional integrity, not those of its writing processes, and that is completely unrelated to what the writing process expects. Or in other words: A following crash only guarantees that the active subvolume being written to is clean from the transactional perspective of the process, but the snapshot may be broken. As far as I know, user processes cannot tell the filesystem when to issue write- barriers, it could only issue fsyncs (which hurts performance). Otherwise this discussion would be a whole different story. Did you test how btrfs snapshots perform while running fsync with a lot of data to be committed? Could give a clue... -- Replies to list only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs snapshot is killing IO and hanging my device with delayed writes for 10mn+
On my workstation, which unfortunately I can't easily upgrade the kernel on, so it's running 3.8.0 for now, I've had pretty perpelexing hangs from btrfs snapshot when I make a new snapshot every hour. I see the btrfs snapshot command in iotop for over a minute, and it completes eventually. Other times, it just runs with virtually no IO. When it takes a while, I see this next in iotop for several minutes: Total DISK READ: 0.00 B/s | Total DISK WRITE: 2.12 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO>COMMAND 1620 be/4 root0.00 B/s0.00 B/s 0.00 % 99.99 % [btrfs-transacti] 658 be/3 root0.00 B/s 28.46 K/s 0.00 % 62.35 % [jbd2/dm-0-8] The device is indeed on top of an LVM, which likely isn't helping performance (I see only 2.12 M/s of writes in the iotop above): sda 8:00 931.5G 0 disk ├─sda1 8:10 243M 0 part /boot ├─sda2 8:20 1K 0 part └─sda5 8:50 931.3G 0 part ├─moremagic-root (dm-0) 252:00 37.3G 0 lvm / ├─moremagic-usr+local (dm-1) 252:10 878G 0 lvm Today, it took about 10mn for IO to go back to normal (stop blocking) to that partition during my noon snapshot. The later ones didn't hang. Why would a snapshot cause hangs and so much IO? hdparm shows that basic IO seems ok: root@polgara:/tmp# hdparm -t /dev/mapper/moremagic-usr+local /dev/mapper/moremagic-usr+local: Timing buffered disk reads: 566 MB in 3.00 seconds = 188.49 MB/sec root@polgara:/tmp# btrfs fi show Label: 'btrfs_pool1' uuid: 64eafeba-603e-483f-bff9-395b9907b415 Total devices 1 FS bytes used 443.76GB devid1 size 877.98GB used 541.54GB path /dev/dm-1 root@polgara:/tmp# btrfs device stats /mnt/btrfs_pool1 [/dev/mapper/moremagic-usr+local].write_io_errs 0 [/dev/mapper/moremagic-usr+local].read_io_errs0 [/dev/mapper/moremagic-usr+local].flush_io_errs 0 [/dev/mapper/moremagic-usr+local].corruption_errs 0 [/dev/mapper/moremagic-usr+local].generation_errs 0 Is there other data I can provide? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
Duncan <1i5t5.dun...@cox.net> schrieb: >> The question here is: Does it really make sense to create such snapshots >> of disk images currently online and running a system. They will probably >> be broken anyway after rollback - or at least I'd not fully trust the >> contents. >> >> VM images should not be part of a subvolume of which snapshots are taken >> at a regular and short interval. The problem will go away if you follow >> this rule. >> >> The same applies to probably any kind of file which you make nocow - >> e.g. database files. The only use case is taking _controlled_ snapshots >> - and doing it all 30 seconds is by all means NOT controlled, it's >> completely undeterministic. > > I'd absolutely agree -- and that wasn't my report, I'm just recalling it, > as at the time I didn't understand the interaction between NOCOW and > snapshots and couldn't quite understand how a NOCOW file was still > triggering the snapshot-aware-defrag pathology, which in fact we were > just beginning to realize based on such reports. Sorry, didn't mean to push it to you. ;-) I just wanted to give some pointers to rethink such practices for people stumpling upon this. > But some of the snapshotting scripts out there, and the admins running > them, seem to have the idea that just because it's possible it must be > done, and they have snapshots taken every minute or more frequently, with > no automated snapshot thinning at all. IMO that's pathology run amok > even if btrfs /was/ stable and mature and /could/ handle it properly. Yeah, people should stop such "bullshit practice" (sorry), no matter if there's a technical problem with it. It does not give the protection they intended to give. It's just wrong sense for security/safety... There _may_ be actual use cases for doing it - but generally I'd suggest it's plain wrong. > That's regardless of the content so it's from a different angle than you > were attacking the problem from... But if admins aren't able to > recognize the problem with per-minute snapshots without any thinning at > all for days, weeks, months on end, I doubt they'll be any better at > recognizing that VMs, databases, etc, should have a dedicated subvolume. True. > But be that as it may, since such extreme snapshotting /is/ possible, and > with automation and downloadable snapper scripts somebody WILL be doing > it, btrfs should scale to it if it is to be considered mature and > stable. People don't want a filesystem that's going to fall over on them > and lose data or simply become unworkably live-locked just because they > didn't know what they were doing when they setup the snapper script and > set it to 1 minute snaps without any corresponding thinning after an hour > or a day or whatever. Such, uhm, sorry, "bullshit practice" should not be a high priority on the fix-list for btrfs. There are other areas. It's a technical problem, yes, but I think there are more important ones than brute-forcing problems out of btrfs that are never being hit by normal usage patterns. It is good that such "tests" are done, but I would not understand how people can expect they need such a "feature" - now and at once. Such tests are not ready to leave the development sandbox yet. >From a normal use perspective, doing such heavy snapshotting is probably almost always nonsense. I'd be more interested in how btrfs behaves in highly io loaded server patterns. One interesting use case for me would be to use btrfs as the building block of a system with container virtualization (docker, lxc), making a high vm density on the machine (with the io load and unpredictable io bahavior that internet-facing servers apply to their storage layer), using btrfs snapshots to instantly create new vms from vm templates living in subvolumes (thin provisioning), spreading btrfs across a higher number of disks as the average desktop user / standard server has. I think this is one of many very interesting use cases for btrfs and its capabilities. And this is how we get back to my initial question: In such a scenario I'd like to take ro snapshots of all machines (which probably host nocow files for databases), send these to a backup server at low io-priority, then remove the snapshots. Apparently, btrfs send/receive is still far from being stable and bullet-proof from what I read here, so the destination would probably be another btrfs or zfs, using inplace-rsync backups and snapshotting for backlog. -- Replies to list only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
On Feb 7, 2014, at 2:07 PM, Kai Krakow wrote: > Chris Murphy schrieb: > >>> If the database/virtual machine/whatever is crash safe, then the >>> atomic state that a snapshot grabs will be useful. >> >> How fast is this state fixed on disk from the time of the snapshot >> command? Loosely speaking. I'm curious if this is < 1 second; a few >> seconds; or possibly up to the 30 second default commit interval? And also >> if it's even related to the commit interval time at all? > > Such constructs can only be crash-safe if write-barriers are passed down > through the cow logic of btrfs to the storage layer. That won't probably > ever happen. Atomic and transactional updates cannot happen without write- > barriers or synchronous writes. To make it work, you need to design the > storage-layers from the ground up to work without write-barriers, like > having battery-backed write-caches, synchronous logical file-system layers > etc. Otherwise, database/vm/whatever transactional/atomic writes are just > having undefined status down at the lowest storage layer. This explanation makes sense. But I failed to qualify the "state fixed on disk". I'm not concerned about when bits actually arrive on disk. I'm wondering what state they describe. So assume no crash or power failure, and assume writes eventually make it onto the media without a problem. What I'm wondering is, what state of the subvolume I'm snapshotting do I end up with? Is there a delay and how long is it, or is it pretty much instant? The command completes really quickly even when the file system is actively being used, so the feedback is that the snapshot state is established very fast but I'm not sure what bearing that has in reality. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
On Fri, 7 Feb 2014, Fengguang Wu wrote: > On Fri, Feb 07, 2014 at 02:13:59AM -0800, David Rientjes wrote: > > On Fri, 7 Feb 2014, Fengguang Wu wrote: > > > > > [1.625020] BTRFS: selftest: Running btrfs_split_item tests > > > [1.627004] BTRFS: selftest: Running find delalloc tests > > > [2.289182] tsc: Refined TSC clocksource calibration: 2299.967 MHz > > > [ 292.084537] kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=1, > > > oom_score_adj=0 > > > [ 292.086439] kthreadd cpuset= > > > [ 292.087072] BUG: unable to handle kernel NULL pointer dereference at > > > 0038 > > > [ 292.087372] IP: [] pr_cont_kernfs_name+0x1b/0x6c > > > > This looks like a problem with the cpuset cgroup name, are you sure this > > isn't related to the removal of cgroup->name? > > It looks not related to patch "cgroup: remove cgroup->name", because > that patch lies in the cgroup tree and not contained in output of "git log > BAD_COMMIT". > It's dying on pr_cont_kernfs_name which is some tree that has "kernfs: implement kernfs_get_parent(), kernfs_name/path() and friends", which is not in linux-next, and is obviously printing the cpuset cgroup name. It doesn't look like it has anything at all to do with btrfs or why they would care about this failure. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
Chris Murphy schrieb: >> If the database/virtual machine/whatever is crash safe, then the >> atomic state that a snapshot grabs will be useful. > > How fast is this state fixed on disk from the time of the snapshot > command? Loosely speaking. I'm curious if this is < 1 second; a few > seconds; or possibly up to the 30 second default commit interval? And also > if it's even related to the commit interval time at all? Such constructs can only be crash-safe if write-barriers are passed down through the cow logic of btrfs to the storage layer. That won't probably ever happen. Atomic and transactional updates cannot happen without write- barriers or synchronous writes. To make it work, you need to design the storage-layers from the ground up to work without write-barriers, like having battery-backed write-caches, synchronous logical file-system layers etc. Otherwise, database/vm/whatever transactional/atomic writes are just having undefined status down at the lowest storage layer. > I'm also curious what happens to files that are presently writing. e.g. > I'm writing a 1GB file to subvol A and before it completes I snapshot > subvol A into A.1. If I go find the file I was writing to, in A.1, what's > its state? Truncated? Or or are in-progress writes permitted to complete > if it's a rw snapshot? Any difference in behavior if it's an ro snapshot? I wondered that many times, too. What happens to files being written to? I suppose, at the time of snapshotting it's taking the current state of the blocks as they are, ignoring pending writes. This means, the file being written to is probably in limbo state. For example, xfs has an option to freeze the file system to take atomic snapshots. You can use that feature to take consistent snapshots of MySQL InnoDB files to create a hot-copy backup of it. But: You need to instruct MySQL first to complete its transactions and pausing before running xfs_freeze, then after that's done, you can resume MySQL operations. That clearly tells me that it is probably not safe to take snapshots of online databases, even if they are crash-safe (and by what I know, InnoDB is designed to be crash-safe). A solution, probably far-future, could be that a btrfs snapshot would inform all current file-writers to complete transactions and atomic operations and wait until each one signals a ready state, then take the snapshot, then signal the processes to resume operations. For this, the btrfs driver could offer some sort of subscription, similar to what inotify offers. Processes subscribe to some sort of notification broadcasts, btrfs can wait for every process to report an integral file state. If I remember right, reiser4 offered some similar feature (approaching the problem from the opposite side): processes were offered an interface to start and commit transactions within reiser4. If btrfs had such information from file-writers, it could take consistent snapshots of online databases/vms/whatever (given, that in the vm case the guest could pass this information to the host). Whatever approach is taken, however, it will make the time needed to create snapshots undeterministic, processes may not finish their transactions within a reasonable time... -- Replies to list only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
Josef Bacik schrieb: > > On 02/05/2014 03:15 PM, Roman Mamedov wrote: >> Hello, >> >> On a freshly-created RAID1 filesystem of two 1TB disks: >> >> # df -h /mnt/p2/ >> Filesystem Size Used Avail Use% Mounted on >> /dev/sda2 1.8T 1.1M 1.8T 1% /mnt/p2 >> >> I cannot write 2TB of user data to that RAID1, so this estimate is >> clearly misleading. I got tired of looking at the bogus disk free space >> on all my RAID1 btrfs systems, so today I decided to do something about >> this: >> >> --- fs/btrfs/super.c.orig2014-02-06 01:28:36.636164982 +0600 >> +++ fs/btrfs/super.c 2014-02-06 01:28:58.304164370 +0600 >> @@ -1481,6 +1481,11 @@ >> } >> >> kfree(devices_info); >> + >> +if (type & BTRFS_BLOCK_GROUP_RAID1) { >> +do_div(avail_space, min_stripes); >> +} >> + >> *free_bytes = avail_space; >> return 0; >> } > > This needs to be more flexible, and also this causes the problem where > now you show the actual usable amount of space _but_ you are also > showing twice the amount of used space. I'm ok with going in this > direction, but we need to convert everybody over so it works for raid10 > as well and the used values need to be adjusted. Thanks, It should show the raw space available. Btrfs also supports compression and doesn't try to be smart about how much compressed data would fit in the free space of the drive. If one is using RAID1, it's supposed to fill up with a rate of 2:1. If one is using compression, it's supposed to fill up with a rate of maybe 1:5 for mostly text files. However, btrfs should probably provide its own df utility (like "btrfs fi df") which is smart about disk usage and tries to predict the usable space. But df should stay with actually showing the _free_ space, not _usable_ space (the latter is unpredictable anyway based on the usage patterns applied to the drive, so it can be a rough guess as its best, like looking into the crystal ball). The point here is about "free" vs. "usable" space, the first being a known number, the latter only being a prediction based on current usage. I'd like to stay with "free space" actually showing raw free space. -- Replies to list only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: don't loop forever if we can't run because of the tree mod log
A user reported a 100% cpu hang with my new delayed ref code. Turns out I forgot to increase the count check when we can't run a delayed ref because of the tree mod log. If we can't run any delayed refs during this there is no point in continuing to look, and we need to break out. Thanks, Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9c9ecc9..32312e0 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2385,6 +2385,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, spin_unlock(&delayed_refs->lock); locked_ref = NULL; cond_resched(); + count++; continue; } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: Preserve process_one_leaf return value.
The return value in process_one_leaf could be over-written while looping over the items in the leaf. This patch will preserve a non-zero return value to the calling function if a non-zero return value is encountered in the loop. The return value of one (1) is consistent with non-zero values that could be returned while processing the leaf. The only caller of this function (walk_down_tree) would ignore the return value anyway. But this patch will correct the behaviour in case future changes intend to utilize the return value. Signed-off-by: Mitch Harder --- cmds-check.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/cmds-check.c b/cmds-check.c index 2911af0..eef7c6c 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -1219,6 +1219,7 @@ static int process_one_leaf(struct btrfs_root *root, struct extent_buffer *eb, u32 nritems; int i; int ret = 0; + int error = 0; struct cache_tree *inode_cache; struct shared_node *active_node; @@ -1268,8 +1269,10 @@ static int process_one_leaf(struct btrfs_root *root, struct extent_buffer *eb, default: break; }; + if (ret != 0) + error = 1; } - return ret; + return error; } static void reada_walk_down(struct btrfs_root *root, -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
On Feb 6, 2014, at 11:08 PM, Roman Mamedov wrote: > And what > if I am accessing that partition on a server via a network CIFS/NFS share and > don't even *have a way to find out* any of that. That's the strongest argument. And if the user is using Explorer/Finder/Nautilus to copy files to the share, I'm pretty sure all three determine if there's enough free space in advance of starting the copy. So if it thinks there's free space, it will start to copy and then later fail midstream when there's no more space. And then the user's copy task is in a questionable state as to what's been copied, depending on how the file copies are being threaded. And due to Btrfs metadata requirements even when deleting, we actually need an Avail estimate that accounts for "phantom future metadata" as if it's currently in use, otherwise we don't really have the right indication of whether or not files can be copied. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: integration-20140206 build failure
On 7 February 2014 17:54, David Sterba wrote: > On Fri, Feb 07, 2014 at 12:38:22AM +, WorMzy Tykashi wrote: >> Building the latest integration snapshot, I get the following failure: >> >> >> ... >> [CC] cmds-filesystem.o >> cmds-filesystem.c: In function 'cmd_show': >> cmds-filesystem.c:740:12: error: invalid storage class for function >> 'cmd_sync' >> static int cmd_sync(int argc, char **argv) >> ^ >> cmds-filesystem.c:770:12: error: invalid storage class for function > > Thanks. A mismerged patch, fixed in next integration. Cheers. I did briefly get the following failure from integration-20140207: [CC] cmds-filesystem.o cmds-filesystem.c: In function 'handle_print': cmds-filesystem.c:434:5: error: too few arguments to function 'print_one_fs' space_info_arg, label, mnt); ^ cmds-filesystem.c:361:12: note: declared here static int print_one_fs(struct btrfs_ioctl_fs_info_args *fs_info, ^ Makefile:113: recipe for target 'cmds-filesystem.o' failed make: *** [cmds-filesystem.o] Error 1 Though I can't reproduce it now. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: integration-20140206 build failure
On Fri, Feb 07, 2014 at 12:38:22AM +, WorMzy Tykashi wrote: > Building the latest integration snapshot, I get the following failure: > > > ... > [CC] cmds-filesystem.o > cmds-filesystem.c: In function 'cmd_show': > cmds-filesystem.c:740:12: error: invalid storage class for function 'cmd_sync' > static int cmd_sync(int argc, char **argv) > ^ > cmds-filesystem.c:770:12: error: invalid storage class for function Thanks. A mismerged patch, fixed in next integration. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: lost with degraded RAID1
On Feb 7, 2014, at 4:34 AM, Johan Kröckel wrote: > Is there anything else I should do with this setup or may I nuke the > two partitions and reuse them? Well I'm pretty sure once you run 'btrfs check --repair' that you've hit the end of the road. Possibly btrfs restore can still extract some files, it might be worth testing whether that works. Otherwise blow it away. I'd say test with 3.14-rc2 with a new file system and see if you can reproduce the sequence that caused this problem in the first place. If it's reproducible, I think there's a bug here somewhere. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: unlock extent and pages on error in cow_file_range
When I converted the BUG_ON() for the free_space_cache_inode in cow_file_range I made it so we just return an error instead of unlocking all of our various stuff. This is a mistake and causes us to hang when we run into this. This patch fixes this problem. Thanks, Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d44ff9f..dbd996a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -864,7 +864,8 @@ static noinline int cow_file_range(struct inode *inode, if (btrfs_is_free_space_inode(inode)) { WARN_ON_ONCE(1); - return -EINVAL; + ret = -EINVAL; + goto out_unlock; } num_bytes = ALIGN(end - start + 1, blocksize); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] xfstests: Btrfs: add test for large metadata blocks
Tests Btrfs filesystems with all possible metadata block sizes, by setting large extended attributes on files. Signed-off-by: Koen De Wit --- tests/btrfs/036 | 128 +++ tests/btrfs/036.out |7 +++ tests/btrfs/group |1 + 3 files changed, 136 insertions(+), 0 deletions(-) create mode 100644 tests/btrfs/036 create mode 100644 tests/btrfs/036.out diff --git a/tests/btrfs/036 b/tests/btrfs/036 new file mode 100644 index 000..0f8287a --- /dev/null +++ b/tests/btrfs/036 @@ -0,0 +1,128 @@ +#! /bin/bash +# FS QA Test No. 036 +# +# Tests large metadata blocks in btrfs, which allows large extended +# attributes. +# +#--- +# Copyright (c) 2014, Oracle and/or its affiliates. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +status=1 # failure is the default! + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here + +_supported_fs btrfs +_supported_os Linux +_require_scratch +_need_to_be_root + +rm -f $seqres.full + +pagesize=`$here/src/feature -s` +pagesize_kb=`expr $pagesize / 1024` + +# Test all valid leafsizes +for leafsize in `seq $pagesize_kb $pagesize_kb 64`; do +_scratch_unmount >/dev/null 2>&1 +_scratch_mkfs -l ${leafsize}K >/dev/null +_scratch_mount +# Calculate the xattr size, but leave 512 bytes for other metadata. +xattr_size=`expr $leafsize \* 1024 - 512` + +touch $SCRATCH_MNT/emptyfile +# smallfile will be inlined, bigfile not. +$XFS_IO_PROG -f -c "pwrite 0 100" $SCRATCH_MNT/smallfile >/dev/null +$XFS_IO_PROG -f -c "pwrite 0 9000" $SCRATCH_MNT/bigfile >/dev/null +ln -s $SCRATCH_MNT/bigfile $SCRATCH_MNT/bigfile_softlink + +files=(emptyfile smallfile bigfile bigfile_softlink) +chars=(a b c d) +for i in `seq 0 1 3`; do +char=${chars[$i]} +file=$SCRATCH_MNT/${files[$i]} +lnkfile=${file}_hardlink +ln $file $lnkfile +xattr_value=`head -c $xattr_size < /dev/zero | tr '\0' $char` + +set_md5=`echo -n "$xattr_value" | md5sum` +${ATTR_PROG} -Lq -s attr_$char -V $xattr_value $file +get_md5=`${ATTR_PROG} -Lq -g attr_$char $file | md5sum` +get_ln_md5=`${ATTR_PROG} -Lq -g attr_$char $lnkfile | md5sum` + +# Using md5sums for comparison instead of the values themselves +# because bash command lines cannot be larger than 64K chars. +if [ "$set_md5" != "$get_md5" ]; then +echo "Got unexpected xattr value for attr_$char" +echo "from file $file . (leafsize is ${leafsize}K)" +fi +if [ "$set_md5" != "$get_ln_md5" ]; then +echo "Value for attr_$char differs for $file and" +echo "$lnkfile . (leafsize is ${leafsize}K)" +fi +done + +# Test attributes with a size larger than the leafsize. +# Should result in an error. +if [ "$leafsize" -lt "64" ]; then +# Bash command lines cannot be larger than 64K characters, so we +# do not test attribute values with a size >64KB. +xattr_size=`expr $leafsize \* 1024 + 512` +xattr_value=`head -c $xattr_size < /dev/zero | tr '\0' x` +${ATTR_PROG} -q -s attr_toobig -V $xattr_value \ +$SCRATCH_MNT/emptyfile >> $seqres.full 2>&1 +if [ "$?" -eq "0" ]; then +echo "Expected error, xattr_size is bigger than ${leafsize}K" +fi +fi + +done + +# Illegal attribute name (more than 256 characters) +attr_name=`head -c 260 < /dev/zero | tr '\0' n` +${ATTR_PROG} -s $attr_name -V attribute_name_too_big \ +$SCRATCH_MNT/emptyfile 2>&1 | head -n 1 + +_scratch_unmount + +# Some illegal leafsizes + +_scratch_mkfs -l 0 2>> $seqres.full +echo $? + +_scratch_mkfs -l 5678 2>> $seqres.full +echo $? + +_scratch_mkfs -l `expr $pagesize / 2 + $pagesize` 2>> $seqres.full +echo $? + +_scratch_mkfs -l 128K 2>> $seqres.full +echo $? + +_scratch_mkfs -l K + +# success, all done +status=0 +exit diff --git a/tests/btrfs/036.out b/tests/btrfs/036.out new file mode 100644 index 000..1d9bdfb --- /dev/null +++ b/tests/btrfs/036.out @@ -0,0 +1,7 @@ +
[PATCH] btrfs-progs: add dry-run option to restore command
Sometimes it is useful to see what btrfs restore is going to do before provisioning enough external storage to restore onto. Add a dry-run option so we can see what files and paths are found by restore, without actually restoring any data. --- cmds-restore.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/cmds-restore.c b/cmds-restore.c index 1748262..5116af0 100644 --- a/cmds-restore.c +++ b/cmds-restore.c @@ -53,6 +53,7 @@ static int verbose = 0; static int ignore_errors = 0; static int overwrite = 0; static int get_xattrs = 0; +static int dry_run = 0; #define LZO_LEN 4 #define PAGE_CACHE_SIZE 4096 @@ -801,6 +802,8 @@ static int search_dir(struct btrfs_root *root, struct btrfs_key *key, } if (verbose) printf("Restoring %s\n", path_name); + if (dry_run) + goto next; fd = open(path_name, O_CREAT|O_WRONLY, 0644); if (fd < 0) { fprintf(stderr, "Error creating %s: %d\n", @@ -873,7 +876,10 @@ static int search_dir(struct btrfs_root *root, struct btrfs_key *key, printf("Restoring %s\n", path_name); errno = 0; - ret = mkdir(path_name, 0755); + if (dry_run) + ret = 0; + else + ret = mkdir(path_name, 0755); if (ret && errno != EEXIST) { free(dir); fprintf(stderr, "Error mkdiring %s: %d\n", @@ -1108,6 +1114,7 @@ const char * const cmd_restore_usage[] = { "-r root objectid", "-d find dir", "-l list tree roots", + "-D dry run (only list files that would be recovered)", "--path-regex ", "restore only filenames matching regex,", "you have to use following syntax (possibly quoted):", @@ -1135,7 +1142,7 @@ int cmd_restore(int argc, char **argv) regex_t match_reg, *mreg = NULL; char reg_err[256]; - while ((opt = getopt_long(argc, argv, "sxviot:u:df:r:lc", long_options, + while ((opt = getopt_long(argc, argv, "sxviot:u:df:r:lDc", long_options, &option_index)) != -1) { switch (opt) { @@ -1191,6 +1198,9 @@ int cmd_restore(int argc, char **argv) case 'l': list_roots = 1; break; + case 'D': + dry_run = 1; + break; case 'c': match_cflags |= REG_ICASE; break; -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
On Fri, Feb 7, 2014 at 3:10 PM, Chris Mason wrote: > On Fri 07 Feb 2014 07:10:38 AM EST, Fengguang Wu wrote: >> >> On Fri, Feb 07, 2014 at 02:13:59AM -0800, David Rientjes wrote: >>> >>> On Fri, 7 Feb 2014, Fengguang Wu wrote: >>> [1.625020] BTRFS: selftest: Running btrfs_split_item tests [1.627004] BTRFS: selftest: Running find delalloc tests [2.289182] tsc: Refined TSC clocksource calibration: 2299.967 MHz [ 292.084537] kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0 [ 292.086439] kthreadd cpuset= [ 292.087072] BUG: unable to handle kernel NULL pointer dereference at 0038 [ 292.087372] IP: [] pr_cont_kernfs_name+0x1b/0x6c >>> >>> >>> This looks like a problem with the cpuset cgroup name, are you sure this >>> isn't related to the removal of cgroup->name? >> >> >> It looks not related to patch "cgroup: remove cgroup->name", because >> that patch lies in the cgroup tree and not contained in output of "git log >> BAD_COMMIT". > > > Still not sure exactly what is going on, but I can't trigger it here. My > first guess is that it is related to having btrfs static, some part of our > init is happening at the wrong time, and the self tests are swooping in and > causing trouble. I couldn't reproduce it either so far, neither on a physical machine nor in a vm (qemu+kvm) (with CONFIG_BTRFS_FS=y, CONFIG_CRYPTO_CRC32C=y and CONFIG_CRYPTO_CRC32C_INTEL=y). If you disable CONFIG_BTRFS_FS_RUN_SANITY_TESTS, does it still crash? thanks > > -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: Change BUG() to use assert.
Change the definition of BUG() to use assert instead of abort to provide information about the location of the issue. Signed-off-by: Mitch Harder --- kerncompat.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kerncompat.h b/kerncompat.h index 1fc2b34..f370cd8 100644 --- a/kerncompat.h +++ b/kerncompat.h @@ -50,7 +50,7 @@ #define ULONG_MAX (~0UL) #endif -#define BUG() abort() +#define BUG() assert(0) #ifdef __CHECKER__ #define __force__attribute__((force)) #define __bitwise__ __attribute__((bitwise)) -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
On Fri 07 Feb 2014 07:10:38 AM EST, Fengguang Wu wrote: On Fri, Feb 07, 2014 at 02:13:59AM -0800, David Rientjes wrote: On Fri, 7 Feb 2014, Fengguang Wu wrote: [1.625020] BTRFS: selftest: Running btrfs_split_item tests [1.627004] BTRFS: selftest: Running find delalloc tests [2.289182] tsc: Refined TSC clocksource calibration: 2299.967 MHz [ 292.084537] kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0 [ 292.086439] kthreadd cpuset= [ 292.087072] BUG: unable to handle kernel NULL pointer dereference at 0038 [ 292.087372] IP: [] pr_cont_kernfs_name+0x1b/0x6c This looks like a problem with the cpuset cgroup name, are you sure this isn't related to the removal of cgroup->name? It looks not related to patch "cgroup: remove cgroup->name", because that patch lies in the cgroup tree and not contained in output of "git log BAD_COMMIT". Still not sure exactly what is going on, but I can't trigger it here. My first guess is that it is related to having btrfs static, some part of our init is happening at the wrong time, and the self tests are swooping in and causing trouble. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
On 06/02/14 19:54, Goffredo Baroncelli wrote: Hi Roman On 02/06/2014 01:45 PM, Roman Mamedov wrote: There's not a lot of code to include (as my 3-line patch demonstrates), it could just as easily be removed when it's obsolete. But I did not have any high hopes of defeating the "broken by design" philosophy, that's why I didn't submit it as a real patch for inclusion but rather just as a helpful hint for people to add to their own kernels if they want this change to happen. I agree with you about the needing of a solution. However your patch to me seems even worse than the actual code. For example you cannot take in account the mix of data/linear and metadata/dup (with the pathological case of small files stored in the metadata chunks ), nor different profile level like raid5/6 (or the future raidNxM) And do not forget the compression... Just because the solution that Roman provided is not perfect does not mean that it is no good at all. For common use cases this will give a much better estimate of free space than the current code does, at a trivial cost in code size. It has the benefit of giving a simple estimate without doing any further work or disk activity (no need to walk all chunks). Adding a couple of more lines of code will make it work equally well with other RAID levels, maybe that would be more acceptable? Frank -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4] xfstests/btrfs: add a regression test for running snapshot and send concurrently
From: Wang Shilong Btrfs would fail to send if snapshot run concurrently, this test is to make sure we have fixed the bug. Signed-off-by: Wang Shilong --- v3->v4: avoid to use silent send command(thanks to Dave Chinner) rename patch's subject from Btrfs to xfstest/btrfs(thanks to Anand) make sure background snapshots exists before killing it. v2->v3: make sure we kill backgrund snapshots on test failure. v1-v2: Avoid race codes and a code style update(Thanks to Dave Chinner's comments) --- tests/btrfs/034 | 85 + tests/btrfs/034.out | 2 ++ tests/btrfs/group | 1 + 3 files changed, 88 insertions(+) create mode 100644 tests/btrfs/034 create mode 100644 tests/btrfs/034.out diff --git a/tests/btrfs/034 b/tests/btrfs/034 new file mode 100644 index 000..06c796e --- /dev/null +++ b/tests/btrfs/034 @@ -0,0 +1,85 @@ +#!/bin/bash +# FS QA Test No. btrfs/034 +# +# Regression test for running snapshots and send concurrently. +# +#--- +# Copyright (c) 2014 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +snapshots_pid=0 + +_cleanup() +{ + # kill backgroud snapshots + if [ $snapshots_pid -ne 0 ] && ps -p $snapshots_pid | grep -q $snapshots_pid; then + kill -TERM $snapshots_pid 2> /dev/null + fi + rm -f $tmp.* +} + +do_snapshots() +{ + i=2 + while [ 1 ] + do + $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT/snap_1 \ + $SCRATCH_MNT/snap_$i >> $seqres.full 2>&1 + let i=$i+1 + sleep 1 + done +} + +trap "_cleanup ; exit \$status" 0 1 2 3 15 + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_scratch + +_scratch_mkfs > /dev/null 2>&1 +_scratch_mount + +touch $SCRATCH_MNT/foo + +# get file with fragments by using backwards writes. +for i in `seq 10240 -1 1`; do + $XFS_IO_PROG -f -d -c "pwrite $(($i * 4096)) 4096" \ + $SCRATCH_MNT/foo > /dev/null | _filter_xfs_io +done + +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \ + $SCRATCH_MNT/snap_1 >> $seqres.full 2>&1 + +do_snapshots & +snapshots_pid=$! + +$BTRFS_UTIL_PROG send $SCRATCH_MNT/snap_1 -f /dev/null 2>&1 | _filter_scratch + +status=0 ; exit diff --git a/tests/btrfs/034.out b/tests/btrfs/034.out new file mode 100644 index 000..9ed31e8 --- /dev/null +++ b/tests/btrfs/034.out @@ -0,0 +1,2 @@ +QA output created by 034 +At subvol SCRATCH_MNT/snap_1 diff --git a/tests/btrfs/group b/tests/btrfs/group index b29236c..f9f062f 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -36,3 +36,4 @@ 031 auto quick 032 auto quick 033 auto quick +034 auto quick -- 1.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] btrfs: reserve no transaction units in btrfs_ioctl_set_features
Added in patch "btrfs: add ioctls to query/change feature bits online" modifications to superblock don't need to reserve metadata blocks when starting a transaction. Signed-off-by: David Sterba --- fs/btrfs/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 8e48b81e1515..383ab455bfa7 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4668,7 +4668,7 @@ static int btrfs_ioctl_set_features(struct file *file, void __user *arg) if (ret) return ret; - trans = btrfs_start_transaction(root, 1); + trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) return PTR_ERR(trans); -- 1.8.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4][RFC] btrfs: export global block reserve size as space_info
Introduce a block group type bit for a global reserve and fill the space info for SPACE_INFO ioctl. This should replace the newly added ioctl (01e219e8069516cdb98594d417b8bb8d906ed30d) to get just the 'size' part of the global reserve, while the actual usage can be now visible in the 'btrfs fi df' output during ENOSPC stress. The unpatched userspace tools will show the blockgroup as 'unknown'. CC: Jeff Mahoney CC: Josef Bacik Signed-off-by: David Sterba --- fs/btrfs/ctree.h | 9 - fs/btrfs/ioctl.c | 20 2 files changed, 28 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2c1a42ca519f..8bf1890ea21f 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -985,7 +985,8 @@ struct btrfs_dev_replace_item { #define BTRFS_BLOCK_GROUP_RAID10 (1ULL << 6) #define BTRFS_BLOCK_GROUP_RAID5 (1ULL << 7) #define BTRFS_BLOCK_GROUP_RAID6 (1ULL << 8) -#define BTRFS_BLOCK_GROUP_RESERVED BTRFS_AVAIL_ALLOC_BIT_SINGLE +#define BTRFS_BLOCK_GROUP_RESERVED (BTRFS_AVAIL_ALLOC_BIT_SINGLE | \ +BTRFS_SPACE_INFO_GLOBAL_RSV) enum btrfs_raid_types { BTRFS_RAID_RAID10, @@ -1017,6 +1018,12 @@ enum btrfs_raid_types { */ #define BTRFS_AVAIL_ALLOC_BIT_SINGLE (1ULL << 48) +/* + * A fake block group type that is used to communicate global block reserve + * size to userspace via the SPACE_INFO ioctl. + */ +#define BTRFS_SPACE_INFO_GLOBAL_RSV(1ULL << 49) + #define BTRFS_EXTENDED_PROFILE_MASK(BTRFS_BLOCK_GROUP_PROFILE_MASK | \ BTRFS_AVAIL_ALLOC_BIT_SINGLE) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 383ab455bfa7..0d938f8f173e 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3465,6 +3465,11 @@ static long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg) up_read(&info->groups_sem); } + /* +* Global block reserve, exported as a space_info +*/ + slot_count++; + /* space_slots == 0 means they are asking for a count */ if (space_args.space_slots == 0) { space_args.total_spaces = slot_count; @@ -3523,6 +3528,21 @@ static long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg) up_read(&info->groups_sem); } + /* +* Add global block reserve +*/ + if (slot_count) { + struct btrfs_block_rsv *block_rsv = &root->fs_info->global_block_rsv; + + spin_lock(&block_rsv->lock); + space.total_bytes = block_rsv->size; + space.used_bytes = block_rsv->size - block_rsv->reserved; + spin_unlock(&block_rsv->lock); + space.flags = BTRFS_SPACE_INFO_GLOBAL_RSV; + memcpy(dest, &space, sizeof(space)); + space_args.total_spaces++; + } + user_dest = (struct btrfs_ioctl_space_info __user *) (arg + sizeof(struct btrfs_ioctl_space_args)); -- 1.8.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] btrfs: commit transaction after setting label and features
From: Jeff Mahoney The set_fslabel ioctl uses btrfs_end_transaction, which means it's possible that the change will be lost if the system crashes, same for the newly set features. Let's use btrfs_commit_transaction instead. Signed-off-by: Jeff Mahoney Signed-off-by: David Sterba --- fs/btrfs/ioctl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index b0134892dc70..8e48b81e1515 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4525,7 +4525,7 @@ static int btrfs_ioctl_set_fslabel(struct file *file, void __user *arg) spin_lock(&root->fs_info->super_lock); strcpy(super_block->label, label); spin_unlock(&root->fs_info->super_lock); - ret = btrfs_end_transaction(trans, root); + ret = btrfs_commit_transaction(trans, root); out_unlock: mnt_drop_write_file(file); @@ -4689,7 +4689,7 @@ static int btrfs_ioctl_set_features(struct file *file, void __user *arg) btrfs_set_super_incompat_flags(super_block, newflags); spin_unlock(&root->fs_info->super_lock); - return btrfs_end_transaction(trans, root); + return btrfs_commit_transaction(trans, root); } long btrfs_ioctl(struct file *file, unsigned int -- 1.8.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] Revert "btrfs: add ioctl to export size of global metadata reservation"
This reverts commit 01e219e8069516cdb98594d417b8bb8d906ed30d. The information from the removed ioctl is newly provided by the existing SPACE_INFO ioctl where it logically belongs and needs only minor changes to the userspace code. The GLOBAL_RSV ioctl hasn't been used anywhere yet, introduced only in 3.14-rc1 so there is no userspace ABI breakage. CC: Jeff Mahoney Signed-off-by: David Sterba --- fs/btrfs/ioctl.c | 16 include/uapi/linux/btrfs.h | 1 - 2 files changed, 17 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 0d938f8f173e..dc64cc28af29 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3557,20 +3557,6 @@ out: return ret; } -static long btrfs_ioctl_global_rsv(struct btrfs_root *root, void __user *arg) -{ - struct btrfs_block_rsv *block_rsv = &root->fs_info->global_block_rsv; - u64 reserved; - - spin_lock(&block_rsv->lock); - reserved = block_rsv->reserved; - spin_unlock(&block_rsv->lock); - - if (arg && copy_to_user(arg, &reserved, sizeof(reserved))) - return -EFAULT; - return 0; -} - /* * there are many ways the trans_start and trans_end ioctls can lead * to deadlocks. They should only be used by applications that @@ -4777,8 +4763,6 @@ long btrfs_ioctl(struct file *file, unsigned int return btrfs_ioctl_logical_to_ino(root, argp); case BTRFS_IOC_SPACE_INFO: return btrfs_ioctl_space_info(root, argp); - case BTRFS_IOC_GLOBAL_RSV: - return btrfs_ioctl_global_rsv(root, argp); case BTRFS_IOC_SYNC: { int ret; diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 1b8a0f4c9590..b4d69092fbdb 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -558,7 +558,6 @@ static inline char *btrfs_err_str(enum btrfs_err_code err_code) #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, __u64) #define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \ struct btrfs_ioctl_space_args) -#define BTRFS_IOC_GLOBAL_RSV _IOR(BTRFS_IOCTL_MAGIC, 20, __u64) #define BTRFS_IOC_START_SYNC _IOR(BTRFS_IOCTL_MAGIC, 24, __u64) #define BTRFS_IOC_WAIT_SYNC _IOW(BTRFS_IOCTL_MAGIC, 22, __u64) #define BTRFS_IOC_SNAP_CREATE_V2 _IOW(BTRFS_IOCTL_MAGIC, 23, \ -- 1.8.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] Btrfs updates for 3.14-rc - sysfs, ioctl
Two minor fixes to the new sysfs code and a RFC change to replace the GLOBAL_RSV ioctl with the same information exported through SPACE_INFO ioctl. This requires to reserve a blockgroup type that will become part of ABI. David Sterba (3): btrfs: reserve no transaction units in btrfs_ioctl_set_features btrfs: export global block reserve size as space_info Revert "btrfs: add ioctl to export size of global metadata reservation" Jeff Mahoney (1): btrfs: commit transaction after setting label and features fs/btrfs/ctree.h | 9 - fs/btrfs/ioctl.c | 42 +++--- include/uapi/linux/btrfs.h | 1 - 3 files changed, 31 insertions(+), 21 deletions(-) -- 1.8.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Can i trust you?
I am Mrs. Aisha Al Mutasim from England, I have a business proposal worth of $10.6M US Dollars get back for more info. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
On Fri, Feb 07, 2014 at 02:13:59AM -0800, David Rientjes wrote: > On Fri, 7 Feb 2014, Fengguang Wu wrote: > > > [1.625020] BTRFS: selftest: Running btrfs_split_item tests > > [1.627004] BTRFS: selftest: Running find delalloc tests > > [2.289182] tsc: Refined TSC clocksource calibration: 2299.967 MHz > > [ 292.084537] kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=1, > > oom_score_adj=0 > > [ 292.086439] kthreadd cpuset= > > [ 292.087072] BUG: unable to handle kernel NULL pointer dereference at > > 0038 > > [ 292.087372] IP: [] pr_cont_kernfs_name+0x1b/0x6c > > This looks like a problem with the cpuset cgroup name, are you sure this > isn't related to the removal of cgroup->name? It looks not related to patch "cgroup: remove cgroup->name", because that patch lies in the cgroup tree and not contained in output of "git log BAD_COMMIT". Thanks, Fengguang > > [ 292.087372] PGD 0 > > [ 292.087372] Oops: [#1] > > [ 292.087372] Modules linked in: > > [ 292.087372] CPU: 0 PID: 2 Comm: kthreadd Not tainted > > 3.14.0-rc1-wl-ath-00978-g4830363 #2 > > [ 292.087372] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > > [ 292.087372] task: 88148050 ti: 8814a000 task.ti: > > 8814a000 > > [ 292.087372] RIP: 0010:[] [] > > pr_cont_kernfs_name+0x1b/0x6c > > [ 292.087372] RSP: :8814bb20 EFLAGS: 00010046 > > [ 292.087372] RAX: 0282 RBX: RCX: > > 0002 > > [ 292.087372] RDX: 812119de RSI: 8247d4a8 RDI: > > 0046 > > [ 292.087372] RBP: 8814bb30 R08: 82f31218 R09: > > > > [ 292.087372] R10: R11: R12: > > 833c4ab8 > > [ 292.087372] R13: 003000d0 R14: 0001 R15: > > > > [ 292.087372] FS: () GS:82279000() > > knlGS: > > [ 292.087372] CS: 0010 DS: ES: CR0: 8005003b > > [ 292.087372] CR2: 0038 CR3: 02269000 CR4: > > 06b0 > > [ 292.087372] Stack: > > [ 292.087372] 88148050 833c4ab8 8814bb50 > > 8110836d > > [ 292.087372] 88148050 88148530 8814bbd0 > > 81ae2614 > > [ 292.087372] 88148640 8814bb78 810c7dd7 > > 8814bb98 > > [ 292.087372] Call Trace: > > [ 292.087372] [] > > cpuset_print_task_mems_allowed+0x65/0x8b > > [ 292.087372] [] dump_header.isra.11+0x68/0x29a > > [ 292.087372] [] ? local_clock+0x2b/0x34 > > [ 292.087372] [] ? lock_release_holdtime+0xcf/0xdb > > [ 292.087372] [] ? out_of_memory+0x318/0x40f > > [ 292.087372] [] out_of_memory+0x373/0x40f > > [ 292.087372] [] ? out_of_memory+0x193/0x40f > > [ 292.087372] [] __alloc_pages_nodemask+0xdd1/0x12c4 > > [ 292.087372] [] ? native_sched_clock+0xe4/0xfc > > [ 292.087372] [] copy_process+0x21d/0x2115 > > [ 292.087372] [] ? kthread_create_on_node+0x23e/0x23e > > [ 292.087372] [] ? sched_clock+0x9/0xd > > [ 292.087372] [] ? > > sched_clock_local.constprop.2+0x35/0xc8 > > [ 292.087372] [] ? pvclock_clocksource_read+0x9b/0x140 > > [ 292.087372] [] ? kthread_create_on_node+0x23e/0x23e > > [ 292.087372] [] do_fork+0x105/0x4db > > [ 292.087372] [] ? sched_clock_cpu+0xc9/0xdb > > [ 292.087372] [] ? local_clock+0x2b/0x34 > > [ 292.087372] [] ? lock_release_holdtime+0xcf/0xdb > > [ 292.087372] [] ? kthreadd+0x1b3/0x24a > > [ 292.087372] [] kernel_thread+0x26/0x28 > > [ 292.087372] [] kthreadd+0x1c7/0x24a > > [ 292.087372] [] ? ret_from_fork+0x7a/0xb0 > > [ 292.087372] [] ? kthread_create_on_cpu+0x7a/0x7a > > [ 292.087372] [] ret_from_fork+0x7a/0xb0 > > [ 292.087372] [] ? kthread_create_on_cpu+0x7a/0x7a > > [ 292.087372] Code: 1c 92 8e 00 48 8b 45 e0 59 5b 41 5c 41 5d 5d c3 0f 1f > > 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 c7 c7 90 d4 47 82 e8 b1 8f 8e 00 > > <48> 83 7b 38 00 49 89 c4 74 06 48 8b 73 40 eb 07 48 c7 c6 48 fe > > [ 292.087372] RIP [] pr_cont_kernfs_name+0x1b/0x6c > > [ 292.087372] RSP > > [ 292.087372] CR2: 0038 > > [ 292.087372] ---[ end trace df2598a82119 ]--- > > [ 292.087372] ---[ end trace df2598a82119 ]--- > > > > git bisect start 483036322b45d800fd68cb028874b6cd4fee3dba > > 38dbfb59d1175ef458d006556061adeaa8751b72 -- > > git bisect bad 50a557bc0d03997d9203d749b0304c54d2c6e5f5 # 02:20 0- > >5 Merge 'cgroup/review-simplify' into devel-hourly-2014020601 > > git bisect bad c04e036e53a39cf2986ddaaf5607ff011f9ea55b # 03:14 0- > >1 Merge 'pinctrl/for-next' into devel-hourly-2014020601 > > git bisect good 08ce20bd7ef9a08b74cdfe6fe454374be72b5825 # 03:58 25+ > >0 Merge 'pci/pci/msi' into devel-hourly-2014020601 > > git bisect good 4f86564c657669f69c9cc03151ef6f23ae9e2015 # 04:32 25+ > >0 Merge 'm68knommu/cf' into devel-hourly-2014020601 > > git
Re: lost with degraded RAID1
Is there anything else I should do with this setup or may I nuke the two partitions and reuse them? 2014-02-03 Johan Kröckel : > State is: I wont use this filesystem again. I have a backup. So I am > interested to give the necessary information for debuging it and > afterwards format it and create a new one. I already did fscks and > btrfschk --repair and pushed the output to txt-files but they are more > than 4 mb in size. > > So I will post excerpts: > > file: btrfsck.out=== > Checking filesystem on /dev/mapper/bunkerA > UUID: 7f954a85-7566-4251-832c-44f2d3de2211 > 42 > parent transid verify failed on 1887688011776 wanted 121037 found 88533 > parent transid verify failed on 1888518615040 wanted 121481 found 90267 > parent transid verify failed on 1681394102272 wanted 110919 found 91024 > parent transid verify failed on 1888522838016 wanted 121486 found 90270 > parent transid verify failed on 1888398331904 wanted 121062 found 89987 > leaf parent key incorrect 1887867330560 > bad block 1887867330560 > leaf parent key incorrect 188812032 > bad block 188812032 > leaf parent key incorrect 1888124637184 > bad block 1888124637184 > leaf parent key incorrect 1888131444736 > bad block 1888131444736 > > [...and so on for 4MB] > > bad block 1888513552384 > leaf parent key incorrect 1888513642496 > bad block 1888513642496 > leaf parent key incorrect 1888513654784 > bad block 1888513654784 > leaf parent key incorrect 1888514023424 > bad block 1888514023424 > btrfsck: cmds-check.c:2212: check_owner_ref: Assertion `!(rec->is_root)' > failed. > > file: smartctl-before-btrfschk-repair== > smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.12-0.bpo.1-amd64] (local build) > Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net > > === START OF READ SMART DATA SECTION === > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 118 099 006Pre-fail > Always - 172055696 > 3 Spin_Up_Time0x0003 093 093 000Pre-fail > Always - 0 > 4 Start_Stop_Count0x0032 100 100 020Old_age > Always - 7 > 5 Reallocated_Sector_Ct 0x0033 100 100 010Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x000f 069 060 030Pre-fail > Always - 9085642 > 9 Power_On_Hours 0x0032 097 097 000Old_age > Always - 2769 > 10 Spin_Retry_Count0x0013 100 100 097Pre-fail > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 020Old_age > Always - 7 > 184 End-to-End_Error0x0032 100 100 099Old_age > Always - 0 > 187 Reported_Uncorrect 0x0032 100 100 000Old_age > Always - 0 > 188 Command_Timeout 0x0032 100 100 000Old_age > Always - 0 > 189 High_Fly_Writes 0x003a 083 083 000Old_age > Always - 17 > 190 Airflow_Temperature_Cel 0x0022 077 071 045Old_age > Always - 23 (Min/Max 22/23) > 191 G-Sense_Error_Rate 0x0032 100 100 000Old_age > Always - 0 > 192 Power-Off_Retract_Count 0x0032 100 100 000Old_age > Always - 5 > 193 Load_Cycle_Count0x0032 100 100 000Old_age > Always - 7 > 194 Temperature_Celsius 0x0022 023 040 000Old_age > Always - 23 (0 20 0 0) > 197 Current_Pending_Sector 0x0012 100 100 000Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0010 100 100 000Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count0x003e 200 200 000Old_age > Always - 0 > > =file:btrfsck-repair.out > enabling repair mode > Checking filesystem on /dev/mapper/bunkerA > UUID: 7f954a85-7566-4251-832c-44f2d3de2211 > ify failed on 1887688011776 wanted 121037 found 88533 > parent transid verify failed on 1888518615040 wanted 121481 found 90267 > parent transid verify failed on 1681394102272 wanted 110919 found 91024 > parent transid verify failed on 1888522838016 wanted 121486 found 90270 > parent transid verify failed on 1888398331904 wanted 121062 found 89987 > leaf parent key incorrect 1887867330560 > bad block 1887867330560 > > [...and so on for 4MB] > > bad block 1888513642496 > leaf parent key incorrect 1888513654784 > bad block 1888513654784 > leaf parent key incorrect 1888514023424 > bad block 1888514023424 > btrfsck: cmds-check.c:2212: check_owner_ref: Assertion `!(rec->is_root)' > failed. > > ==file:smartctl-after-btrfschk-repair== > smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.12-0.bpo.1-amd64] (local build) > Copyright (C) 2
Re: [PATCH v3] Btrfs: add regression test for running snapshot and send concurrently
Hi Dave, > On Fri, Feb 07, 2014 at 12:18:31PM +0800, Wang Shilong wrote: >> >> Hi Dave, >> >>> On Fri, Feb 07, 2014 at 12:10:08AM +0800, Wang Shilong wrote: +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \ + $SCRATCH_MNT/snap_1 >> $seqres.full 2>&1 + +do_snapshots & +snapshots_pid=$! + +$BTRFS_UTIL_PROG send $SCRATCH_MNT/snap_1 > /dev/null 2>&1 || echo "btrfs send failed" >>> >>> Let's stop this anti-pattern before it takes hold. >>> >>> If there's output from the send command it should be >>> filtered and captured in the golden image. Hence any deviation >>> caused by errors is automatically flagged as an error. >>> >>> That's the whole point of using golden images for capturing errors - >>> you don't need to capture return values from binaries and it >>> guarantees that users are informed about failures through error >>> messages. IOWs: >>> >>> $BTRFS_UTIL_PROG send $SCRATCH_MNT/snap_1 | _btrfs_send_filter >>> >>> is what you should be doing here. >> >> I knew what you mean here, in fact, i did this on purpose. > > Ok, then you need to explain why you did it on purpose with a comment. > It's just as important to explain the reason for doing something in > test code as it is in the kernel code. i.e. so when we are looking > at the test in 5 years time we know the reason for it being that > way. > >> for this test failure, btrfs-prog did not output failure >> information from the beginning. > > I have nothing good to say about that state of affairs, but... > >> So to make older progs can also >> detect the test failure, i dropped into this way. > > .. it's going to have to stay like it. Please insert an > appropriately sarcastic comment about the usefulness of a silent > send command here, because if I write it I'm going to offend lots of > people. :/ Sorry, my miss, when i was going to give a patch for btrfs-progs, i noticed the issue has been fixed in latest btrfs-progs(3.12 which has released for a long time). So let's drop the way you have said before. Thanks, Wang > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/3] btrfs/035: add new clone overwrite regression test
This test uses the newly added cloner binary to dispatch full file and range specific clone (reflink) requests. Signed-off-by: David Disseldorp --- tests/btrfs/035 | 77 + tests/btrfs/035.out | 3 +++ tests/btrfs/group | 1 + 3 files changed, 81 insertions(+) create mode 100755 tests/btrfs/035 create mode 100644 tests/btrfs/035.out diff --git a/tests/btrfs/035 b/tests/btrfs/035 new file mode 100755 index 000..6808179 --- /dev/null +++ b/tests/btrfs/035 @@ -0,0 +1,77 @@ +#!/bin/bash +# FS QA Test No. btrfs/035 +# +# Regression test for overwriting clones +# +#--- +# Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! + +_cleanup() +{ +rm -f $tmp.* +} + +trap "_cleanup ; exit \$status" 0 1 2 3 15 + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_scratch + +_scratch_mkfs > /dev/null 2>&1 +_scratch_mount + +CLONER_PROG=$here/src/cloner +[ -x $CLONER_PROG ] || _notrun "cloner binary not present at $CLONER_PROG" + +src_str="aa" + +echo -n "$src_str" > $SCRATCH_MNT/src + +$CLONER_PROG $SCRATCH_MNT/src $SCRATCH_MNT/src.clone1 + +src_str="bbcc" + +echo -n "$src_str" > $SCRATCH_MNT/src + +$CLONER_PROG $SCRATCH_MNT/src $SCRATCH_MNT/src.clone2 + +snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone1 | awk '{print $5}'` +echo "attempting ioctl (src.clone1 src)" +$CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \ + $SCRATCH_MNT/src.clone1 $SCRATCH_MNT/src + +snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone2 | awk '{print $5}'` +echo "attempting ioctl (src.clone2 src)" +$CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \ + $SCRATCH_MNT/src.clone2 $SCRATCH_MNT/src + +status=0 ; exit diff --git a/tests/btrfs/035.out b/tests/btrfs/035.out new file mode 100644 index 000..f86cadf --- /dev/null +++ b/tests/btrfs/035.out @@ -0,0 +1,3 @@ +QA output created by 035 +attempting ioctl (src.clone1 src) +attempting ioctl (src.clone2 src) diff --git a/tests/btrfs/group b/tests/btrfs/group index f9f062f..bee57cb 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -37,3 +37,4 @@ 032 auto quick 033 auto quick 034 auto quick +035 auto quick -- 1.8.4.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/3] __btrfs_drop_extents() BUG_ON reproducer
This patch-set provides a reproducer for hitting the 3.14.0-rc1 BUG_ON() at: 692 int __btrfs_drop_extents(struct btrfs_trans_handle *trans, ... 839 /* 840 * | range to drop - | 841 * | extent | 842 */ 843 if (start <= key.offset && end < extent_end) { 844 BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE); 845 846 memcpy(&new_key, &key, sizeof(new_key)); The first patch adds a small cloner binary which is used by btrfs/035 to dispatch BTRFS_IOC_CLONE_RANGE requests. This workload resembles that of Samba's vfs_btrfs module, when a Windows client restores a file from a shadow-copy (snapshot) using server-side copy requests. Changes since V2: - Remove explicit write error checks Changes since V1: - Use strtoull instead of atoi - Print error conditions in cloner - Check for cloner binary before running test - Continue test on failure - Add cloner to .gitignore Feedback appreciated. Cheers, David .gitignore | 1 + configure.ac| 1 + src/Makefile| 2 +- src/cloner.c| 192 +++ tests/btrfs/035 | 77 + tests/btrfs/035.out | 3 +++ tests/btrfs/group | 1 + 7 files changed, 276 insertions(+), 1 deletion(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/3] btrfs: add small program for clone testing
The cloner program is capable of cloning files using the BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls. Signed-off-by: David Disseldorp --- .gitignore | 1 + src/Makefile | 2 +- src/cloner.c | 188 +++ 3 files changed, 190 insertions(+), 1 deletion(-) create mode 100644 src/cloner.c diff --git a/.gitignore b/.gitignore index ee4cb41..b6f2463 100644 --- a/.gitignore +++ b/.gitignore @@ -104,6 +104,7 @@ /src/aio-dio-regress/aio-free-ring-with-bogus-nr-pages /src/aio-dio-regress/aio-io-setup-with-nonwritable-context-pointer /src/aio-dio-regress/aiodio_sparse2 +/src/cloner # dmapi/ binaries /dmapi/src/common/cmd/read_invis diff --git a/src/Makefile b/src/Makefile index 84c8297..6509f2d 100644 --- a/src/Makefile +++ b/src/Makefile @@ -18,7 +18,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \ locktest unwritten_mmap bulkstat_unlink_test t_stripealign \ bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \ stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \ - seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec + seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner SUBDIRS = diff --git a/src/cloner.c b/src/cloner.c new file mode 100644 index 000..dfce837 --- /dev/null +++ b/src/cloner.c @@ -0,0 +1,188 @@ +/* + * Tiny program to perform file (range) clones using raw Btrfs ioctls. + * It should only be needed until btrfs-progs has an xfs_io equivalent. + * + * Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct btrfs_ioctl_clone_range_args { + int64_t src_fd; + uint64_t src_offset; + uint64_t src_length; + uint64_t dest_offset; +}; + +#define BTRFS_IOCTL_MAGIC 0x94 +#define BTRFS_IOC_CLONE _IOW(BTRFS_IOCTL_MAGIC, 9, int) +#define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \ + struct btrfs_ioctl_clone_range_args) + +static void +usage(char *name, const char *msg) +{ + printf("Fatal: %s\n" + "Usage:\n" + "%s [options] \n" + "\tA full file clone (reflink) is performed by default, " + "unless any of the following are specified:\n" + "\t-s : source file offset (default = 0)\n" + "\t-d : destination file offset (default = 0)\n" + "\t-l : length of clone (default = 0)\n", + msg, name); + _exit(1); +} + +static int +clone_file(int src_fd, int dst_fd) +{ + int ret = ioctl(dst_fd, BTRFS_IOC_CLONE, src_fd); + if (ret != 0) + ret = errno; + return ret; +} + +static int +clone_file_range(int src_fd, int dst_fd, uint64_t src_off, uint64_t dst_off, +uint64_t len) +{ + struct btrfs_ioctl_clone_range_args cr_args; + int ret; + + memset(&cr_args, 0, sizeof(cr_args)); + cr_args.src_fd = src_fd; + cr_args.src_offset = src_off; + cr_args.src_length = len; + cr_args.dest_offset = dst_off; + ret = ioctl(dst_fd, BTRFS_IOC_CLONE_RANGE, &cr_args); + if (ret != 0) + ret = errno; + return ret; +} + +int +main(int argc, char **argv) +{ + bool full_file = true; + uint64_t src_off = 0; + uint64_t dst_off = 0; + uint64_t len = 0; + char *src_file; + int src_fd; + char *dst_file; + int dst_fd; + int ret; + int opt; + + while ((opt = getopt(argc, argv, "s:d:l:")) != -1) { + char *sval_end; + switch (opt) { + case 's': + errno = 0; + src_off = strtoull(optarg, &sval_end, 10); + if ((errno) || (*sval_end != '\0')) + usage(argv[0], "invalid source offset"); + full_file = false; + break; + case 'd': + errno = 0; + dst_off = strtoull(optarg, &sval_end, 10); +
[PATCH v3 2/3] src/cloner: use btrfs/ioctl.h header if present
Check for the btrfsprogs-devel ioctl.h header at configure time. Use it in src/cloner if present, otherwise fall back to using the copied clone ioctl definitions. Signed-off-by: David Disseldorp --- configure.ac | 1 + src/cloner.c | 4 2 files changed, 5 insertions(+) diff --git a/configure.ac b/configure.ac index bd48fd9..6fba3ad 100644 --- a/configure.ac +++ b/configure.ac @@ -30,6 +30,7 @@ AC_HEADER_STDC AC_CHECK_HEADERS([ sys/fs/xfs_fsops.h \ sys/fs/xfs_itable.h \ xfs/platform_defs.h \ + btrfs/ioctl.h \ ]) AC_PACKAGE_NEED_UUIDCOMPARE diff --git a/src/cloner.c b/src/cloner.c index dfce837..ccc2354 100644 --- a/src/cloner.c +++ b/src/cloner.c @@ -30,6 +30,9 @@ #include #include #include +#ifdef HAVE_BTRFS_IOCTL_H +#include +#else struct btrfs_ioctl_clone_range_args { int64_t src_fd; @@ -42,6 +45,7 @@ struct btrfs_ioctl_clone_range_args { #define BTRFS_IOC_CLONE _IOW(BTRFS_IOCTL_MAGIC, 9, int) #define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \ struct btrfs_ioctl_clone_range_args) +#endif static void usage(char *name, const char *msg) -- 1.8.4.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/3] btrfs/035: add new clone overwrite regression test
On Fri, 7 Feb 2014 09:45:09 +1100, Dave Chinner wrote: > Not exactly what I intended. > > If echo fails, it will output some kind of error message, and that > will cause the golden image mismatch. Fair enough. I'll resend with the two ' || echo "failed to create src"' checks removed. Thanks again for the review Dave. Cheers, David -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: introduce BTRFS_IOC_GET_DEVS
On Fri, Feb 07, 2014 at 06:08:56PM +0800, Anand Jain wrote: > > > Thanks for the comments. > > mainly here sysfs way defeats the purpose - debug as > mentioned. Sysfs would/should show only mounted disks, So let's find a way of showing the "known-about" data structure separately from the "mounted" data structure. Just thinking aloud here, how about something like this: For unmounted, but known filesystems: /sys/fs/btrfs/devmap// (for each device known) /sys/fs/btrfs/devmap//label /sys/fs/btrfs/devmap//complete (0 if missing devices, 1 otherwise) /sys/fs/btrfs/devmap// And then for mounted filesystems, we can populate the with more details as appropriate (and remove them when unmounted again). We can also add a symlink to the directory for the filesystem to e.g.: /sys/fs/btrfs/mounted/ If truly the *only* reason to do this is debug, then simply dump the info in debugfs and have done with it. However, there are good reasons to want to have this information in an ordinary running system, too. So stick with sysfs, IMO. > the ioctl way doesn't have such a limitation (of course > as I commented memory dump way would have been best choice > but I got stuck with that approach if anybody wants to give > a try with that approach you are most welcome). > > IMO no harm to have both sysfs way and ioctl way let user > or developer use which were is suitable in their context. Extra maintenance overhead; divergence of the APIs. "You can find out the label of the filesystem using sysfs, but not the ioctl. You can find out the size of the filesystem using the ioctl, but not using sysfs." That way lies quite a bit of pain for the user of the interfaces. Hugo. > Thanks, Anand > > > On 02/07/2014 06:05 AM, David Sterba wrote: > >On Thu, Feb 06, 2014 at 09:09:57PM +0100, Goffredo Baroncelli wrote: > >>>With 3.14 the sysfs interface is available, but the devices under > >>> /sys/fs/btrfs//devices/... > >>>are symlinks to the sysfs devices, so this btrfs-specific device > >>>information has to be located in a separate directory. > >> > >>yes please; it would scale better than the ioctl()s; anyway I suggest 1 > >>file per property, and not "...one file that holds all the stats > >> about the device" > > > >Well, we can do both, I don't have a preference and both make sense. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Great oxymorons of the world, no. 4: Future Perfect --- signature.asc Description: Digital signature
Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
On Fri, 7 Feb 2014, Fengguang Wu wrote: > [1.625020] BTRFS: selftest: Running btrfs_split_item tests > [1.627004] BTRFS: selftest: Running find delalloc tests > [2.289182] tsc: Refined TSC clocksource calibration: 2299.967 MHz > [ 292.084537] kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=1, > oom_score_adj=0 > [ 292.086439] kthreadd cpuset= > [ 292.087072] BUG: unable to handle kernel NULL pointer dereference at > 0038 > [ 292.087372] IP: [] pr_cont_kernfs_name+0x1b/0x6c This looks like a problem with the cpuset cgroup name, are you sure this isn't related to the removal of cgroup->name? > [ 292.087372] PGD 0 > [ 292.087372] Oops: [#1] > [ 292.087372] Modules linked in: > [ 292.087372] CPU: 0 PID: 2 Comm: kthreadd Not tainted > 3.14.0-rc1-wl-ath-00978-g4830363 #2 > [ 292.087372] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 292.087372] task: 88148050 ti: 8814a000 task.ti: > 8814a000 > [ 292.087372] RIP: 0010:[] [] > pr_cont_kernfs_name+0x1b/0x6c > [ 292.087372] RSP: :8814bb20 EFLAGS: 00010046 > [ 292.087372] RAX: 0282 RBX: RCX: > 0002 > [ 292.087372] RDX: 812119de RSI: 8247d4a8 RDI: > 0046 > [ 292.087372] RBP: 8814bb30 R08: 82f31218 R09: > > [ 292.087372] R10: R11: R12: > 833c4ab8 > [ 292.087372] R13: 003000d0 R14: 0001 R15: > > [ 292.087372] FS: () GS:82279000() > knlGS: > [ 292.087372] CS: 0010 DS: ES: CR0: 8005003b > [ 292.087372] CR2: 0038 CR3: 02269000 CR4: > 06b0 > [ 292.087372] Stack: > [ 292.087372] 88148050 833c4ab8 8814bb50 > 8110836d > [ 292.087372] 88148050 88148530 8814bbd0 > 81ae2614 > [ 292.087372] 88148640 8814bb78 810c7dd7 > 8814bb98 > [ 292.087372] Call Trace: > [ 292.087372] [] cpuset_print_task_mems_allowed+0x65/0x8b > [ 292.087372] [] dump_header.isra.11+0x68/0x29a > [ 292.087372] [] ? local_clock+0x2b/0x34 > [ 292.087372] [] ? lock_release_holdtime+0xcf/0xdb > [ 292.087372] [] ? out_of_memory+0x318/0x40f > [ 292.087372] [] out_of_memory+0x373/0x40f > [ 292.087372] [] ? out_of_memory+0x193/0x40f > [ 292.087372] [] __alloc_pages_nodemask+0xdd1/0x12c4 > [ 292.087372] [] ? native_sched_clock+0xe4/0xfc > [ 292.087372] [] copy_process+0x21d/0x2115 > [ 292.087372] [] ? kthread_create_on_node+0x23e/0x23e > [ 292.087372] [] ? sched_clock+0x9/0xd > [ 292.087372] [] ? sched_clock_local.constprop.2+0x35/0xc8 > [ 292.087372] [] ? pvclock_clocksource_read+0x9b/0x140 > [ 292.087372] [] ? kthread_create_on_node+0x23e/0x23e > [ 292.087372] [] do_fork+0x105/0x4db > [ 292.087372] [] ? sched_clock_cpu+0xc9/0xdb > [ 292.087372] [] ? local_clock+0x2b/0x34 > [ 292.087372] [] ? lock_release_holdtime+0xcf/0xdb > [ 292.087372] [] ? kthreadd+0x1b3/0x24a > [ 292.087372] [] kernel_thread+0x26/0x28 > [ 292.087372] [] kthreadd+0x1c7/0x24a > [ 292.087372] [] ? ret_from_fork+0x7a/0xb0 > [ 292.087372] [] ? kthread_create_on_cpu+0x7a/0x7a > [ 292.087372] [] ret_from_fork+0x7a/0xb0 > [ 292.087372] [] ? kthread_create_on_cpu+0x7a/0x7a > [ 292.087372] Code: 1c 92 8e 00 48 8b 45 e0 59 5b 41 5c 41 5d 5d c3 0f 1f 44 > 00 00 55 48 89 e5 41 54 53 48 89 fb 48 c7 c7 90 d4 47 82 e8 b1 8f 8e 00 <48> > 83 7b 38 00 49 89 c4 74 06 48 8b 73 40 eb 07 48 c7 c6 48 fe > [ 292.087372] RIP [] pr_cont_kernfs_name+0x1b/0x6c > [ 292.087372] RSP > [ 292.087372] CR2: 0038 > [ 292.087372] ---[ end trace df2598a82119 ]--- > [ 292.087372] ---[ end trace df2598a82119 ]--- > > git bisect start 483036322b45d800fd68cb028874b6cd4fee3dba > 38dbfb59d1175ef458d006556061adeaa8751b72 -- > git bisect bad 50a557bc0d03997d9203d749b0304c54d2c6e5f5 # 02:20 0- > 5 Merge 'cgroup/review-simplify' into devel-hourly-2014020601 > git bisect bad c04e036e53a39cf2986ddaaf5607ff011f9ea55b # 03:14 0- > 1 Merge 'pinctrl/for-next' into devel-hourly-2014020601 > git bisect good 08ce20bd7ef9a08b74cdfe6fe454374be72b5825 # 03:58 25+ > 0 Merge 'pci/pci/msi' into devel-hourly-2014020601 > git bisect good 4f86564c657669f69c9cc03151ef6f23ae9e2015 # 04:32 25+ > 0 Merge 'm68knommu/cf' into devel-hourly-2014020601 > git bisect bad 29ae23c20b538664beaea72bb0721ce2538b4ca9 # 04:50 0- > 11 Merge 'm68knommu/cfmmu' into devel-hourly-2014020601 > git bisect bad bbce71dbd03db3cf6df7faba0132d87a1a055827 # 04:58 0- > 3 Merge 'nfs/devel' into devel-hourly-2014020601 > git bisect good 88a78a912ee059467ae6db7429a6efe4654620a5 # 06:16 25+ > 1 Merge branch 'acl_fixes' into linux-next > git bisect good 12b13835a0a8bfabea68741e1a
Re: Provide a better free space estimate on RAID1
Am Donnerstag, 6. Februar 2014, 22:30:46 schrieb Chris Murphy: > On Feb 6, 2014, at 9:40 PM, Roman Mamedov wrote: > > On Thu, 06 Feb 2014 20:54:19 +0100 > > > > Goffredo Baroncelli wrote: > > > > > >> I agree with you about the needing of a solution. However your patch to > >> me seems even worse than the actual code.>> > >> > >> > >> For example you cannot take in account the mix of data/linear and > >> metadata/dup (with the pathological case of small files stored in the > >> metadata chunks ), nor different profile level like raid5/6 (or the > >> future raidNxM) And do not forget the compression... > > > > > > > > Every estimate first and foremost should be measured by how precise it is, > > or in this case "wrong by how many gigabytes". The actual code returns a > > result that is pretty much always wrong by 2x, after the patch it will be > > close within gigabytes to the correct value in the most common use case > > (data raid1, metadata raid1 and that's it). Of course that PoC is nowhere > > near the final solution, what I can't agree with is "if another option is > > somewhat better, but not ideally perfect, then it's worse than the > > current one", even considering the current one is absolutely broken. > > Is the glass half empty or is it half full? > > From the original post, context is a 2x 1TB raid volume: > > Filesystem Size Used Avail Use% Mounted on > /dev/sda2 1.8T 1.1M 1.8T 1% /mnt/p2 > > Earlier conventions would have stated Size ~900GB, and Avail ~900GB. But > that's not exactly true either, is it? It's merely a convention to cut the > storage available in half, while keeping data file sizes the same as if > they were on a single device without raid. > > On Btrfs the file system Size is reported as the total storage stack size, > and that's not incorrect. And the amount Avail is likewise not wrong > because that space is "not otherwise occupied" which is the definition of > available. It's linguistically consistent, it's just not a familiar > convention. I see one issue with it: There are installers and applications that check available disk space prior to installing. This won´t work with current df figures that BTRFS delivers. While I understand that there is *never* a guarentee that a given free space can really be allocated by a process cause other processes can allocate space as well in the mean time, and while I understand that its difficult to provide an accurate to provide exact figures as soon as RAID settings can be set per subvolume, it still think its important to improve on the figures. In the longer term I´d like like a function / syscall to ask the filesystem the following question: I am about to write 200 MB in this directory, am I likely to succeed with that? This way an application can ask specific to a directory which allows BTRFS to provide a more accurate estimation. I understand that there is something like that for single files (fallocate), but there is nothing like this for writing a certain amount of data in several files / directories. Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: introduce BTRFS_IOC_GET_DEVS
Thanks for the comments. mainly here sysfs way defeats the purpose - debug as mentioned. Sysfs would/should show only mounted disks, the ioctl way doesn't have such a limitation (of course as I commented memory dump way would have been best choice but I got stuck with that approach if anybody wants to give a try with that approach you are most welcome). IMO no harm to have both sysfs way and ioctl way let user or developer use which were is suitable in their context. Thanks, Anand On 02/07/2014 06:05 AM, David Sterba wrote: On Thu, Feb 06, 2014 at 09:09:57PM +0100, Goffredo Baroncelli wrote: With 3.14 the sysfs interface is available, but the devices under /sys/fs/btrfs//devices/... are symlinks to the sysfs devices, so this btrfs-specific device information has to be located in a separate directory. yes please; it would scale better than the ioctl()s; anyway I suggest 1 file per property, and not "...one file that holds all the stats about the device" Well, we can do both, I don't have a preference and both make sense. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Btrfs: fix use-after-free in the finishing procedure of the device replace
On Fri, 07 Feb 2014 17:17:45 +0800, Miao Xie wrote: > On Fri, 07 Feb 2014 09:14:42 +0100, Stefan Behrens wrote: >> On Thu, 30 Jan 2014 16:46:55 +0800, Miao Xie wrote: >>> During device replace test, we hit a null pointer deference (It was very >>> easy >>> to reproduce it by running xfstests' btrfs/011 on the devices with the >>> virtio >>> scsi driver). There were two bugs that caused this problem: >>> - We might allocate new chunks on the replaced device after we updated >>> the mapping tree. And we forgot to replace the source device in those >>> mapping of the new chunks. >>> - We might get the mapping information which including the source device >>> before the mapping information update. And then submit the bio which was >>> based on that mapping information after we freed the source device. >>> >>> For the first bug, we can fix it by doing mapping tree update and source >>> device remove in the same context of the chunk mutex. The chunk mutex is >>> used to protect the allocable device list, the above method can avoid >>> the new chunk allocation, and after we remove the source device, all >>> the new chunks will be allocated on the new device. So it can fix >>> the first bug. >>> >>> For the second bug, we need make sure all flighting bios are finished and >>> no new bios are produced during we are removing the source device. To fix >>> this problem, we introduced a global @bio_counter, we not only inc/dec >>> @bio_counter outsize of map_blocks, but also inc it before submitting bio >>> and dec @bio_counter when ending bios. >>> >>> Since Raid56 is a little different and device replace dosen't support raid56 >>> yet, it is not addressed in the patch and I add comments to make sure we >>> will >>> fix it in the future. >>> >>> Reported-by: Qu Wenruo >>> Signed-off-by: Wang Shilong >>> Signed-off-by: Miao Xie >>> --- >>> fs/btrfs/ctree.h | 9 ++ >>> fs/btrfs/dev-replace.c | 74 >>> +- >>> fs/btrfs/disk-io.c | 12 +++- >>> fs/btrfs/volumes.c | 30 +++- >>> fs/btrfs/volumes.h | 1 + >>> 5 files changed, 111 insertions(+), 15 deletions(-) [...] >>> static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, >>>int scrub_ret) >>> { >>> @@ -458,12 +487,6 @@ static int btrfs_dev_replace_finishing(struct >>> btrfs_fs_info *fs_info, >>> src_device = dev_replace->srcdev; >>> btrfs_dev_replace_unlock(dev_replace); >>> >>> - /* replace old device with new one in mapping tree */ >>> - if (!scrub_ret) >>> - btrfs_dev_replace_update_device_in_mapping_tree(fs_info, >>> - src_device, >>> - tgt_device); >>> - >> >> Isn't this change the reason that you need to add code to count bios? > > No, the change is not the reason to introduce a bio counter. > > The above two bugs are independent, the second bug can happen even the first > one > doesn't exist, for example: > Task1 Task2 > btrfs_readpages() > btrfs_map_bio() > __btrfs_map_block() > You are right! Thanks. > device replace start > device replace finish > release source device > bio_size_ok() > bdev_get_queue() > oops > >> I mean, code is there to flush everything to disk. It's needed for sync, >> too. When btrfs_dev_replace_finishing() is called, the copy operation is >> finished and the replaced and the replacement drive are logically >> identical except that data is not yet forced to be flushed to disk. The >> only difference between both disks is the bdev block device pointer and >> the btrfs device pointer. >> >> I think this is the correct and lightweight procedure: >> >> 1. After the copy operation finished, from now on, always use the new >> btrfs device and the new block device. >> 2. Flush all outstanding bios and wait for them. >> 3. Close the block dev and free the btrfs device. >> >> You said that the problem was that we might allocate new chunks on the >> replaced device after we updated the mapping tree. Well, can't we just >> make sure the new chunks are allocated using the new device pointers >> after step 1 and all issues are resolved? I think that the first change >> is not a good idea and causes the other issues that you fix with a lot >> of code. > > Step1 can fix the first problem, but need more code to recover the mapping > tree when the flush fails. > > And as I said above, the second bug is not introduced by the fix of the first > bug, we have to add a bio counter. Any good idea? > > BTW, the reason why I mixed two fixes into one patch is that we could not fix >
Re: [PATCH 2/2] Btrfs: fix use-after-free in the finishing procedure of the device replace
On Fri, 07 Feb 2014 09:14:42 +0100, Stefan Behrens wrote: > On Thu, 30 Jan 2014 16:46:55 +0800, Miao Xie wrote: >> During device replace test, we hit a null pointer deference (It was very easy >> to reproduce it by running xfstests' btrfs/011 on the devices with the virtio >> scsi driver). There were two bugs that caused this problem: >> - We might allocate new chunks on the replaced device after we updated >> the mapping tree. And we forgot to replace the source device in those >> mapping of the new chunks. >> - We might get the mapping information which including the source device >> before the mapping information update. And then submit the bio which was >> based on that mapping information after we freed the source device. >> >> For the first bug, we can fix it by doing mapping tree update and source >> device remove in the same context of the chunk mutex. The chunk mutex is >> used to protect the allocable device list, the above method can avoid >> the new chunk allocation, and after we remove the source device, all >> the new chunks will be allocated on the new device. So it can fix >> the first bug. >> >> For the second bug, we need make sure all flighting bios are finished and >> no new bios are produced during we are removing the source device. To fix >> this problem, we introduced a global @bio_counter, we not only inc/dec >> @bio_counter outsize of map_blocks, but also inc it before submitting bio >> and dec @bio_counter when ending bios. >> >> Since Raid56 is a little different and device replace dosen't support raid56 >> yet, it is not addressed in the patch and I add comments to make sure we will >> fix it in the future. >> >> Reported-by: Qu Wenruo >> Signed-off-by: Wang Shilong >> Signed-off-by: Miao Xie >> --- >> fs/btrfs/ctree.h | 9 ++ >> fs/btrfs/dev-replace.c | 74 >> +- >> fs/btrfs/disk-io.c | 12 +++- >> fs/btrfs/volumes.c | 30 +++- >> fs/btrfs/volumes.h | 1 + >> 5 files changed, 111 insertions(+), 15 deletions(-) >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index 84d4c05..c39ad06 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -351,6 +351,7 @@ static inline unsigned long btrfs_chunk_item_size(int >> num_stripes) >> #define BTRFS_FS_STATE_ERROR0 >> #define BTRFS_FS_STATE_REMOUNTING 1 >> #define BTRFS_FS_STATE_TRANS_ABORTED2 >> +#define BTRFS_FS_STATE_DEV_REPLACING3 >> >> /* Super block flags */ >> /* Errors detected */ >> @@ -1674,6 +1675,9 @@ struct btrfs_fs_info { >> >> atomic_t mutually_exclusive_operation_running; >> >> +struct percpu_counter bio_counter; >> +wait_queue_head_t replace_wait; >> + >> struct semaphore uuid_tree_rescan_sem; >> unsigned int update_uuid_tree_gen:1; >> }; >> @@ -3991,6 +3995,11 @@ int btrfs_scrub_cancel_dev(struct btrfs_fs_info *info, >> int btrfs_scrub_progress(struct btrfs_root *root, u64 devid, >> struct btrfs_scrub_progress *progress); >> >> +/* dev-replace.c */ >> +void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info); >> +void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info); >> +void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info); >> + >> /* reada.c */ >> struct reada_control { >> struct btrfs_root *root; /* tree to prefetch */ >> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c >> index b20d59e..ec1c3f3 100644 >> --- a/fs/btrfs/dev-replace.c >> +++ b/fs/btrfs/dev-replace.c >> @@ -431,6 +431,35 @@ leave_no_lock: >> return ret; >> } >> >> +/* >> + * blocked until all flighting bios are finished. >> + */ >> +static void btrfs_rm_dev_replace_blocked(struct btrfs_fs_info *fs_info) >> +{ >> +s64 writers; >> +DEFINE_WAIT(wait); >> + >> +set_bit(BTRFS_FS_STATE_DEV_REPLACING, &fs_info->fs_state); >> +do { >> +prepare_to_wait(&fs_info->replace_wait, &wait, >> +TASK_UNINTERRUPTIBLE); >> +writers = percpu_counter_sum(&fs_info->bio_counter); >> +if (writers) >> +schedule(); >> +finish_wait(&fs_info->replace_wait, &wait); >> +} while (writers); >> +} >> + >> +/* >> + * we have removed target device, it is safe to allow new bios request. >> + */ >> +static void btrfs_rm_dev_replace_unblocked(struct btrfs_fs_info *fs_info) >> +{ >> +clear_bit(BTRFS_FS_STATE_DEV_REPLACING, &fs_info->fs_state); >> +if (waitqueue_active(&fs_info->replace_wait)) >> +wake_up(&fs_info->replace_wait); >> +} >> + >> static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, >> int scrub_ret) >> { >> @@ -458,12 +487,6 @@ static int btrfs_dev_replace_finishing(struct >> btrfs_fs_info *fs_info, >> src_device = dev_replace->srcdev; >> btrfs_dev_replace_unlock(dev_replace); >> >> -/* replace
Re: [PATCH 1/2] btrfs-progs: Add missing devices check for mounted btrfs.
IMO btrfs-progs shouldn't add its intelligence to know if disk is missing. If btrfs-kernel doesn't know when disk is missing that's a bug to fix in btrfs-kernel. yes that indeed true as of now in btrfs-kernel. btrfs kernel has no idea when disk goes missing, just -EIO doesn't tell btrfs that. I am trying to fix this first. But the problem is there isn't good way with in btrfs/FS to know when disk goes missing. did I miss anything ? Thanks, Anand On 02/07/2014 02:45 PM, Qu Wenruo wrote: In btrfs/003 of xfstest, it will check whether btrfs fi show can find missing devices. But before the patch, btrfs-progs will not check whether device missing if given a mounted btrfs mountpoint/block device. This patch fixes the bug and will pass btrfs/003. Signed-off-by: Qu Wenruo Cc: Anand Jain --- cmds-filesystem.c | 12 1 file changed, 12 insertions(+) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 384d1b9..4c9933d 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -363,6 +363,8 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args *fs_info, char *label, char *path) { int i; + int fd; + int missing; char uuidbuf[BTRFS_UUID_UNPARSED_SIZE]; struct btrfs_ioctl_dev_info_args *tmp_dev_info; int ret; @@ -385,6 +387,14 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args *fs_info, for (i = 0; i < fs_info->num_devices; i++) { tmp_dev_info = (struct btrfs_ioctl_dev_info_args *)&dev_info[i]; + + /* Add check for missing devices even mounted */ + fd = open((char *)tmp_dev_info->path, O_RDONLY); + if (fd < 0) { + missing = 1; + continue; + } + close(fd); printf("\tdevid %4llu size %s used %s path %s\n", tmp_dev_info->devid, pretty_size(tmp_dev_info->total_bytes), @@ -392,6 +402,8 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args *fs_info, tmp_dev_info->path); } + if (missing) + printf("\t*** Some devices missing\n"); printf("\n"); return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs-progs: Add -p/--print-missing options for btrfs fi show
Whats needed is more comprehensive btrfs fi show which shows the flags (including missing) per disk. And also show the FS/Raid status. Which I am working on. sorry -p feature would be covered by default in the coming revamp of btrfs fi show. Thanks, Anand On 02/07/2014 02:46 PM, Qu Wenruo wrote: Since a mounted btrfs filesystem contains all the devices info even a device is removed after mount(like btrfs/003 in xfstests), we can use the info to print the known missing device if possible. So -p/--print-missing options are added to print possible missing devices. Signed-off-by: Qu Wenruo --- cmds-filesystem.c | 26 -- man/btrfs.8.in| 4 +++- 2 files changed, 23 insertions(+), 7 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 4c9933d..77b142c 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -360,7 +360,7 @@ static u64 calc_used_bytes(struct btrfs_ioctl_space_args *si) static int print_one_fs(struct btrfs_ioctl_fs_info_args *fs_info, struct btrfs_ioctl_dev_info_args *dev_info, struct btrfs_ioctl_space_args *space_info, - char *label, char *path) + char *label, char *path, int print_missing) { int i; int fd; @@ -392,7 +392,14 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args *fs_info, fd = open((char *)tmp_dev_info->path, O_RDONLY); if (fd < 0) { missing = 1; - continue; + if (print_missing) + printf("\tdevid %4llu size %s used %s path %s (missing)\n", + tmp_dev_info->devid, + pretty_size(tmp_dev_info->total_bytes), + pretty_size(tmp_dev_info->bytes_used), + tmp_dev_info->path); + else + continue; } close(fd); printf("\tdevid %4llu size %s used %s path %s\n", @@ -440,7 +447,7 @@ static int check_arg_type(char *input) return BTRFS_ARG_UNKNOWN; } -static int btrfs_scan_kernel(void *search) +static int btrfs_scan_kernel(void *search, int print_missing) { int ret = 0, fd; FILE *f; @@ -477,7 +484,8 @@ static int btrfs_scan_kernel(void *search) fd = open(mnt->mnt_dir, O_RDONLY); if ((fd != -1) && !get_df(fd, &space_info_arg)) { print_one_fs(&fs_info_arg, dev_info_arg, - space_info_arg, label, mnt->mnt_dir); +space_info_arg, label, mnt->mnt_dir, +print_missing); kfree(space_info_arg); memset(label, 0, sizeof(label)); } @@ -500,6 +508,7 @@ static const char * const cmd_show_usage[] = { "Show the structure of a filesystem", "-d|--all-devices show only disks under /dev containing btrfs filesystem", "-m|--mounted show only mounted btrfs", + "-p|--print-missing show known missing device if possible", "If no argument is given, structure of all present filesystems is shown.", NULL }; @@ -513,6 +522,7 @@ static int cmd_show(int argc, char **argv) int ret; int where = BTRFS_SCAN_LBLKID; int type = 0; + int print_missing = 0; char mp[BTRFS_PATH_NAME_MAX + 1]; char path[PATH_MAX]; @@ -521,9 +531,10 @@ static int cmd_show(int argc, char **argv) static struct option long_options[] = { { "all-devices", no_argument, NULL, 'd'}, { "mounted", no_argument, NULL, 'm'}, + { "print-missing", no_argument, NULL, 'p'}, { NULL, no_argument, NULL, 0 }, }; - int c = getopt_long(argc, argv, "dm", long_options, + int c = getopt_long(argc, argv, "dmp", long_options, &long_index); if (c < 0) break; @@ -534,6 +545,9 @@ static int cmd_show(int argc, char **argv) case 'm': where = BTRFS_SCAN_MOUNTED; break; + case 'p': + print_missing = 1; + break; default: usage(cmd_show_usage); } @@ -571,7 +585,7 @@ static int cmd_show(int argc, char **argv) goto devs_only; /* show mounted btrfs */ - ret = btrfs_scan_kernel(search); + ret = btrfs_scan_kernel(search, print_missing); if (search && !ret) return 0; diff --git a/man/btrfs.8.in b/man/btrfs.8.in index 8fea115..db2e355 100644 --- a/man/btrfs.8.in
Re: [PATCH 2/2] Btrfs: fix use-after-free in the finishing procedure of the device replace
On Thu, 30 Jan 2014 16:46:55 +0800, Miao Xie wrote: > During device replace test, we hit a null pointer deference (It was very easy > to reproduce it by running xfstests' btrfs/011 on the devices with the virtio > scsi driver). There were two bugs that caused this problem: > - We might allocate new chunks on the replaced device after we updated > the mapping tree. And we forgot to replace the source device in those > mapping of the new chunks. > - We might get the mapping information which including the source device > before the mapping information update. And then submit the bio which was > based on that mapping information after we freed the source device. > > For the first bug, we can fix it by doing mapping tree update and source > device remove in the same context of the chunk mutex. The chunk mutex is > used to protect the allocable device list, the above method can avoid > the new chunk allocation, and after we remove the source device, all > the new chunks will be allocated on the new device. So it can fix > the first bug. > > For the second bug, we need make sure all flighting bios are finished and > no new bios are produced during we are removing the source device. To fix > this problem, we introduced a global @bio_counter, we not only inc/dec > @bio_counter outsize of map_blocks, but also inc it before submitting bio > and dec @bio_counter when ending bios. > > Since Raid56 is a little different and device replace dosen't support raid56 > yet, it is not addressed in the patch and I add comments to make sure we will > fix it in the future. > > Reported-by: Qu Wenruo > Signed-off-by: Wang Shilong > Signed-off-by: Miao Xie > --- > fs/btrfs/ctree.h | 9 ++ > fs/btrfs/dev-replace.c | 74 > +- > fs/btrfs/disk-io.c | 12 +++- > fs/btrfs/volumes.c | 30 +++- > fs/btrfs/volumes.h | 1 + > 5 files changed, 111 insertions(+), 15 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 84d4c05..c39ad06 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -351,6 +351,7 @@ static inline unsigned long btrfs_chunk_item_size(int > num_stripes) > #define BTRFS_FS_STATE_ERROR 0 > #define BTRFS_FS_STATE_REMOUNTING1 > #define BTRFS_FS_STATE_TRANS_ABORTED 2 > +#define BTRFS_FS_STATE_DEV_REPLACING 3 > > /* Super block flags */ > /* Errors detected */ > @@ -1674,6 +1675,9 @@ struct btrfs_fs_info { > > atomic_t mutually_exclusive_operation_running; > > + struct percpu_counter bio_counter; > + wait_queue_head_t replace_wait; > + > struct semaphore uuid_tree_rescan_sem; > unsigned int update_uuid_tree_gen:1; > }; > @@ -3991,6 +3995,11 @@ int btrfs_scrub_cancel_dev(struct btrfs_fs_info *info, > int btrfs_scrub_progress(struct btrfs_root *root, u64 devid, >struct btrfs_scrub_progress *progress); > > +/* dev-replace.c */ > +void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info); > +void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info); > +void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info); > + > /* reada.c */ > struct reada_control { > struct btrfs_root *root; /* tree to prefetch */ > diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c > index b20d59e..ec1c3f3 100644 > --- a/fs/btrfs/dev-replace.c > +++ b/fs/btrfs/dev-replace.c > @@ -431,6 +431,35 @@ leave_no_lock: > return ret; > } > > +/* > + * blocked until all flighting bios are finished. > + */ > +static void btrfs_rm_dev_replace_blocked(struct btrfs_fs_info *fs_info) > +{ > + s64 writers; > + DEFINE_WAIT(wait); > + > + set_bit(BTRFS_FS_STATE_DEV_REPLACING, &fs_info->fs_state); > + do { > + prepare_to_wait(&fs_info->replace_wait, &wait, > + TASK_UNINTERRUPTIBLE); > + writers = percpu_counter_sum(&fs_info->bio_counter); > + if (writers) > + schedule(); > + finish_wait(&fs_info->replace_wait, &wait); > + } while (writers); > +} > + > +/* > + * we have removed target device, it is safe to allow new bios request. > + */ > +static void btrfs_rm_dev_replace_unblocked(struct btrfs_fs_info *fs_info) > +{ > + clear_bit(BTRFS_FS_STATE_DEV_REPLACING, &fs_info->fs_state); > + if (waitqueue_active(&fs_info->replace_wait)) > + wake_up(&fs_info->replace_wait); > +} > + > static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, > int scrub_ret) > { > @@ -458,12 +487,6 @@ static int btrfs_dev_replace_finishing(struct > btrfs_fs_info *fs_info, > src_device = dev_replace->srcdev; > btrfs_dev_replace_unlock(dev_replace); > > - /* replace old device with new one in mapping tree */ > - if (!scrub_ret) > - btrfs_dev_replace_update_device_in_mapping_tree(fs_info, > -