date:20140207

[PATCH v2] Btrfs: fix a spinlock warning when cleaning up aborted transaction

2014-02-07 Thread Liu Bo

Given now we have 2 spinlock for management of delayed refs,
CONFIG_DEBUG_SPINLOCK=y helped me find this,

[ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258
[ 4723.414882]  lock: 0x880048377670, .magic: dead4ead, .owner: 
btrfs-transacti/2258, .owner_cpu: 2
[ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: GW  O 
3.12.0+ #4
[ 4723.421321] Call Trace:
[ 4723.421872]  [] dump_stack+0x54/0x74
[ 4723.422753]  [] spin_dump+0x8c/0x91
[ 4723.424979]  [] spin_bug+0x21/0x26
[ 4723.425846]  [] do_raw_spin_unlock+0x66/0x90
[ 4723.434424]  [] _raw_spin_unlock+0x27/0x40
[ 4723.438747]  [] btrfs_cleanup_one_transaction+0x35e/0x710 
[btrfs]
[ 4723.443321]  [] btrfs_cleanup_transaction+0x104/0x570 
[btrfs]
[ 4723.444692]  [] ? trace_hardirqs_on_caller+0xfd/0x1c0
[ 4723.450336]  [] ? trace_hardirqs_on+0xd/0x10
[ 4723.451332]  [] transaction_kthread+0x22e/0x270 [btrfs]
[ 4723.452543]  [] ? btrfs_cleanup_transaction+0x570/0x570 
[btrfs]
[ 4723.457833]  [] kthread+0xea/0xf0
[ 4723.458990]  [] ? kthread_create_on_node+0x140/0x140
[ 4723.460133]  [] ret_from_fork+0x7c/0xb0
[ 4723.460865]  [] ? kthread_create_on_node+0x140/0x140
[ 4723.496521] [ cut here ]

--

The reason is that we get to call cond_resched_lock(&head_ref->lock) while
still holding @delayed_refs->lock.

So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire
dance before and after actually processing delayed refs.

Here we don't drop the lock, others are not able to add new delayed refs to
head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here.

Signed-off-by: Liu Bo 
---
v2: Fix typo in the title, s/lockdep/spinlock/g

 fs/btrfs/disk-io.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index de6a48f..735738b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3853,7 +3853,6 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
rb_erase(&ref->rb_node, &head->ref_root);
atomic_dec(&delayed_refs->num_entries);
btrfs_put_delayed_ref(ref);
-   cond_resched_lock(&head->lock);
}
if (head->must_insert_reserved)
pin_bytes = true;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: fix a lockdep warning when cleaning up aborted transaction

2014-02-07 Thread Liu Bo

Given now we have 2 spinlock for management of delayed refs,
CONFIG_DEBUG_SPINLOCK=y helped me find this,

[ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258
[ 4723.414882]  lock: 0x880048377670, .magic: dead4ead, .owner: 
btrfs-transacti/2258, .owner_cpu: 2
[ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: GW  O 
3.12.0+ #4
[ 4723.421321] Call Trace:
[ 4723.421872]  [] dump_stack+0x54/0x74
[ 4723.422753]  [] spin_dump+0x8c/0x91
[ 4723.424979]  [] spin_bug+0x21/0x26
[ 4723.425846]  [] do_raw_spin_unlock+0x66/0x90
[ 4723.434424]  [] _raw_spin_unlock+0x27/0x40
[ 4723.438747]  [] btrfs_cleanup_one_transaction+0x35e/0x710 
[btrfs]
[ 4723.443321]  [] btrfs_cleanup_transaction+0x104/0x570 
[btrfs]
[ 4723.444692]  [] ? trace_hardirqs_on_caller+0xfd/0x1c0
[ 4723.450336]  [] ? trace_hardirqs_on+0xd/0x10
[ 4723.451332]  [] transaction_kthread+0x22e/0x270 [btrfs]
[ 4723.452543]  [] ? btrfs_cleanup_transaction+0x570/0x570 
[btrfs]
[ 4723.457833]  [] kthread+0xea/0xf0
[ 4723.458990]  [] ? kthread_create_on_node+0x140/0x140
[ 4723.460133]  [] ret_from_fork+0x7c/0xb0
[ 4723.460865]  [] ? kthread_create_on_node+0x140/0x140
[ 4723.496521] [ cut here ]

--

The reason is that we get to call cond_resched_lock(&head_ref->lock) while
still holding @delayed_refs->lock.

So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire
dance before and after actually processing delayed refs.

Here we don't drop the lock, others are not able to add new delayed refs to
head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here.

Signed-off-by: Liu Bo 
---
 fs/btrfs/disk-io.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index de6a48f..735738b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3853,7 +3853,6 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
rb_erase(&ref->rb_node, &head->ref_root);
atomic_dec(&delayed_refs->num_entries);
btrfs_put_delayed_ref(ref);
-   cond_resched_lock(&head->lock);
}
if (head->must_insert_reserved)
pin_bytes = true;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Are nocow files snapshot-aware

2014-02-07 Thread Duncan

Kai Krakow posted on Fri, 07 Feb 2014 23:26:34 +0100 as excerpted:

> So the question is: Do btrfs snapshots give the same guarantees on the
> filesystem level that write-barriers give on the storage level which
> exactly those processes rely upon? The cleanest solution would be if
> processes could give btrfs hints about what belongs to their
> transactions so in the moment of a snapshot the data file would be in
> clean state. I guess snapshots are atomic in that way, that pending
> writes will never reach the snapshots just taken, which is good.

Keep in mind that btrfs' metadata is COW-based also.  Like reiser4 in 
this way, in theory at least, commits are atomic -- they've ether made it 
to disk or they haven't, there's no half there.  Commits at the leaf 
level propagate up the tree, and are not finalized until the top-level 
root node is written.  AFAIK if there's dirty data to write, btrfs 
triggers a root node commit every 30 seconds.  Until that root is 
rewritten, it points to the last consistent-state written root node.  
Once it's rewritten, it points to the new one and a new set of writes are 
started, only to be finalized at the next root node write.

And I believe that final write simply updates a pointer to point at the 
latest root node.  There's also a history of root nodes, which is what 
the btrfs-find-root tool uses in combination with btrfs restore, if 
necessary, to find a valid root from the root node pointer log if the 
system crashed in the middle of that final update so the pointer ends up 
pointing at garbage.

Meanwhile, I'm a bit blurry on this but if I understand things correctly, 
between root node writes/full-filesystem-commits there's a log of 
transaction completions at the atomic individual transaction level, such 
that even transactions completed between root node writes can normally be 
replayed.  Of course this is only ~30 seconds worth of activity max, 
since the root node writes should occur every 30 seconds, but this is 
what btrfs-zero-log zeroes out, if/when needed.  You'll lose that few 
seconds of log replay since the last root node write, but if it was 
garbage data due to it being written when the system actually went down, 
dropping those few extra seconds of log can allow the filesystem to mount 
properly from the last full root node commit, where it couldn't, 
otherwise.

It's actually those metadata trees and the atomic root-node commit 
feature that btrfs snapshots depend on, and why they're normally so fast 
to create.  When a snapshot is taken, btrfs simply keeps a record of the 
current root node instead of letting it recede into history and fall off 
the end of the root node log, labeling that record with the name of the 
snapshot for humans as well as the object-ID that btrfs uses.  That root 
node is by definition a record of the filesystem in a consistent state, 
so any snapshot that's a reference to it is similarly by definition in a 
consistent state.

So normally, files in the process of being written out (created) simply 
wouldn't appear in the snapshot.  Of course preexisting files will appear 
(and fallocated files are simply the blanked-out-special-case of 
preexisting), but again, with normal COW-based files at least, will exist 
in a state either before the latest transaction started, or after it 
finished, which of course is where fsync comes in, since that's how 
userspace apps communicate file transactions to the filesystem.

And of course in addition to COW, btrfs normally does checksumming as 
well, and again, the filesystem including that checksumming will be self-
consistent when a root-node is written, or it won't be written until the 
filesystem /is/ self-consistent.  If for whatever reason there's garbage 
when btrfs attempts to read the data back, which is exactly what btrfs 
defines it as if it doesn't pass checksum, btrfs will refuse to use that 
data.  If there's a second copy somewhere (as with raid1 mode), it'll try 
to restore from that second copy.  If it can't, btrfs will return an 
error and simply won't let you access that file.

So one way or another, a snapshot is deterministic and atomic.  No 
partial transactions, at least on ordinary COW and checksummed files.

Which brings us to NOCOW files, where for btrfs NOCOW also turns off 
checksumming.  Btrfs will write these files in-place, and as a result 
there's not the transaction integrity guarantee on these files that there 
is on ordinary files.

*HOWEVER*, the situation isn't as bad as it might seem, because most 
files where NOCOW is recommended, database files, VM images, pre-
allocated torrent files, etc, are created and managed by applications 
that already have their own data integrity management/verification/repair 
methods, since they're designed to work on filesystems without the data 
integrity guarantees btrfs normally provides.

In fact, it's possible, even likely in case of a crash, that the 
application's own data integrity mechanisms can fight with

Re: [PATCH] Btrfs: convert to add transaction protection for btrfs send

2014-02-07 Thread Wang Shilong

Hi Josef,

[..SNIP..]

>> Yeah, very reasonable, if you don't mind, i would give a patch for this 
>> issue.
> Go for it, you'll be faster than I will be, all I do is run xfstests and
> try to reproduce things that will never reproduce for me.

So Finally, i've figured everything out.

Previous send is almostly safe is because we deal every item by transaction 
protection, 
and we will search next item from root again next time (see 
btrfs_search_slot_for_read())
And we will check root's @ctransid to make sure root did not change during 
send. (For readonly root,
usually it should not except snapshot-aware defrag, ok we should fix it!!)

Since we kicked off transaction from send, life will become a little complex,we 
should take care of
everything can change readonly snapshot, from i can say now:

1. snapshot @send_root will cow everything
2. snapshot-aware defrag also work for readonly root.
3. balance, device resize, add,remove, scrub(repair mode)

So there are two ideas in my mind:

#Approach 1
convert to previous ways that means add transaction protection for send. and 
Filipe's optimizations
for send search should be reverted, this way is simple and make codes less 
complex, but we will involve
transaction protection.

#Approach 2.
Add a local flag to root structure to sync snapshot with send.
Add a global flag to sync balance's etc options with send
don't defray readonly snapshot for snapshot-aware defrag.

Approach 2 will make it safe to avoid transaction from send, but it add 
complexity to codes.Personally,
i do not like approach 2 very much because it make btrfs codes more *ugly*  and 
make it *harder* to maintain.

Anyway, I am ok to implement  both ways, Tell me if you have any better ideas 
or which approach is your
appreciated way to fix the issue^_^

Thanks,
Wang
> 
> Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Can you keep reflink relationship during a copy/backup to another filesystem?

2014-02-07 Thread Marc MERLIN

On Wed, Jan 29, 2014 at 08:05:14AM +, Hugo Mills wrote:
> On Tue, Jan 28, 2014 at 11:50:25PM -0800, Marc MERLIN wrote:
> > So I used to use hardlinks to do historical backups of the same filesystem
> > but I know it's preferable to use refllink with btrfs to avoid having too
> > many hardlinks.
> 
>If you use btrfstune to set the "extended inode refs" option on the
> device, then there's no hardlink limit (and the limit was only on the

I'm not too clear about that part:
https://btrfs.wiki.kernel.org/index.php/Btrfstune
doesn't seem to match what you're saying.

> number of hardlinks to the same thing *in the same directory*).

I didn't catch the 'same directory' part. I used hardlinks.py to compare
all my files from many backups across filesystems and hardlink files
together if they were the same.
You're saying that I could have 500 hardlinks to the same file as long as
they are in 500 different directories, and that this is not a performance
problem or limitation for btrfs (maybe anymore? I could swear I had issues
with this in the past).

If the limit is now hardlinking many times to the same file in the same
directory, I don't actually have a need for that use case.

> > But if I need to backup this filesystem to another one some other way than
> > btrfs send/receive (let's say cp -a, tar, or rsync), is it correct to say
> > that reflink relationships will be lost and my data will take more space?
> > 
> > That is unless the target filesytem can do deduplication like btrs now can?
> > (I haven't tried it yet, but it's about time that I do)
> 
>Correct.

Thanks.
I just noticed that on the fly deduplication wasn't in 3.12 yet, so I'm
better off keeping my hardlinks for backups since it's easy to keep that
relationship when copying my backup tree to a backup device (backups of
backups, of course, y'all don't do that? :) ).

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

signature.asc
Description: Digital signature

Re: [PATCH] xfstests: Btrfs: add test for large metadata blocks

2014-02-07 Thread Dave Chinner

On Fri, Feb 07, 2014 at 06:14:45PM +0100, Koen De Wit wrote:
> Tests Btrfs filesystems with all possible metadata block sizes, by
> setting large extended attributes on files.
> 
> Signed-off-by: Koen De Wit 

There's a few things here that need fixing.

> +pagesize=`$here/src/feature -s`
> +pagesize_kb=`expr $pagesize / 1024`
> +
> +# Test all valid leafsizes
> +for leafsize in `seq $pagesize_kb $pagesize_kb 64`; do
> +_scratch_unmount >/dev/null 2>&1

Indentation are tabs, and tabs are 8 spaces in size, please.

> +_scratch_mkfs -l ${leafsize}K >/dev/null
> +_scratch_mount

No need to use _scratch_unmount here - you should be doing a
_check_scratch_fs at the end of the loop.

> +# Calculate the xattr size, but leave 512 bytes for other metadata.
> +xattr_size=`expr $leafsize \* 1024 - 512`
> +
> +touch $SCRATCH_MNT/emptyfile
> +# smallfile will be inlined, bigfile not.
> +$XFS_IO_PROG -f -c "pwrite 0 100" $SCRATCH_MNT/smallfile >/dev/null
> +$XFS_IO_PROG -f -c "pwrite 0 9000" $SCRATCH_MNT/bigfile >/dev/null
> +ln -s $SCRATCH_MNT/bigfile $SCRATCH_MNT/bigfile_softlink
> +
> +files=(emptyfile smallfile bigfile bigfile_softlink)
> +chars=(a b c d)
> +for i in `seq 0 1 3`; do
> +char=${chars[$i]}
> +file=$SCRATCH_MNT/${files[$i]}
> +lnkfile=${file}_hardlink
> +ln $file $lnkfile
> +xattr_value=`head -c $xattr_size < /dev/zero | tr '\0' $char`
> +
> +set_md5=`echo -n "$xattr_value" | md5sum`

Just dump the md5sum to the output file.

> +${ATTR_PROG} -Lq -s attr_$char -V $xattr_value $file

> +get_md5=`${ATTR_PROG} -Lq -g attr_$char $file | md5sum`
> +get_ln_md5=`${ATTR_PROG} -Lq -g attr_$char $lnkfile | md5sum`

And dump these to the output file, too. Then the golden image
matching when the test is finish will tell you if it passed or not.
i.e:

echo -n "$xattr_value" | md5sum
${ATTR_PROG} -Lq -s attr_$char -V $xattr_value $file
${ATTR_PROG} -Lq -g attr_$char $file | md5sum
${ATTR_PROG} -Lq -g attr_$char $lnkfile | md5sum

is all that neds to be done here.

> +# Test attributes with a size larger than the leafsize.
> +# Should result in an error.
> +if [ "$leafsize" -lt "64" ]; then
> +# Bash command lines cannot be larger than 64K characters, so we
> +# do not test attribute values with a size >64KB.
> +xattr_size=`expr $leafsize \* 1024 + 512`
> +xattr_value=`head -c $xattr_size < /dev/zero | tr '\0' x`
> +${ATTR_PROG} -q -s attr_toobig -V $xattr_value \
> +$SCRATCH_MNT/emptyfile >> $seqres.full 2>&1
> +if [ "$?" -eq "0" ]; then
> +echo "Expected error, xattr_size is bigger than ${leafsize}K"
> +fi

What you are doing is redirecting the error to $seqres.full
so that it doesn't end up in the output file, then detecting the
absence of an error and dumping a message to the output file to make
the test fail.

IOWs, the ATTR_PROG failure message should be in the golden output
file and you don't have to do anything else to detect a pass/fail
condition.

> +_scratch_unmount
> +
> +# Some illegal leafsizes
> +
> +_scratch_mkfs -l 0 2>> $seqres.full
> +echo $?

Same again - you are dumping the error output into a different file,
then detecting the error manually. pass the output of _scratch_mkfs
through a filter, and let errors cause golden output mismatches.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Are nocow files snapshot-aware

2014-02-07 Thread Kai Krakow

Chris Murphy  schrieb:

> 
> On Feb 7, 2014, at 2:07 PM, Kai Krakow  wrote:
> 
>> Chris Murphy  schrieb:
>> 
 If the database/virtual machine/whatever is crash safe, then the
 atomic state that a snapshot grabs will be useful.
>>> 
>>> How fast is this state fixed on disk from the time of the snapshot
>>> command? Loosely speaking. I'm curious if this is < 1 second; a few
>>> seconds; or possibly up to the 30 second default commit interval? And
>>> also if it's even related to the commit interval time at all?
>> 
>> Such constructs can only be crash-safe if write-barriers are passed down
>> through the cow logic of btrfs to the storage layer. That won't probably
>> ever happen. Atomic and transactional updates cannot happen without
>> write- barriers or synchronous writes. To make it work, you need to
>> design the storage-layers from the ground up to work without
>> write-barriers, like having battery-backed write-caches, synchronous
>> logical file-system layers etc. Otherwise, database/vm/whatever
>> transactional/atomic writes are just having undefined status down at the
>> lowest storage layer.
> 
> This explanation makes sense. But I failed to qualify the "state fixed on
> disk". I'm not concerned about when bits actually arrive on disk. I'm
> wondering what state they describe. So assume no crash or power failure,
> and assume writes eventually make it onto the media without a problem.
> What I'm wondering is, what state of the subvolume I'm snapshotting do I
> end up with? Is there a delay and how long is it, or is it pretty much
> instant? The command completes really quickly even when the file system is
> actively being used, so the feedback is that the snapshot state is
> established very fast but I'm not sure what bearing that has in reality.

I think from that perspective it is more or less the same taking a snapshot 
or cycling the power. For the state of the file consistency it means the 
same, I suppose. I got your argument about "state fixed on disk", but I 
implied from perspective of the writing process it is just the same 
situation: in the moment of the snapshot the data file is in a crashed 
state. That is like cycling the power without having a mechanism to support 
transactional guarantees.

So the question is: Do btrfs snapshots give the same guarantees on the 
filesystem level that write-barriers give on the storage level which exactly 
those processes rely upon? The cleanest solution would be if processes could 
give btrfs hints about what belongs to their transactions so in the moment 
of a snapshot the data file would be in clean state. I guess snapshots are 
atomic in that way, that pending writes will never reach the snapshots just 
taken, which is good.

But what about the ordering of writes? Maybe some younger write requests 
already made it to the disk, while older ones didn't. The file system 
usually only has to care about its own transactional integrity, not those of 
its writing processes, and that is completely unrelated to what the writing 
process expects. Or in other words: A following crash only guarantees that 
the active subvolume being written to is clean from the transactional 
perspective of the process, but the snapshot may be broken. As far as I 
know, user processes cannot tell the filesystem when to issue write-
barriers, it could only issue fsyncs (which hurts performance). Otherwise 
this discussion would be a whole different story.

Did you test how btrfs snapshots perform while running fsync with a lot of 
data to be committed? Could give a clue...

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

btrfs snapshot is killing IO and hanging my device with delayed writes for 10mn+

2014-02-07 Thread Marc MERLIN

On my workstation, which unfortunately I can't easily upgrade the kernel on,
so it's running 3.8.0 for now, I've had pretty perpelexing hangs from btrfs 
snapshot
when I make a new snapshot every hour.

I see the btrfs snapshot command in iotop for over a minute, and it completes 
eventually.
Other times, it just runs with virtually no IO.

When it takes a while, I see this next in iotop for several minutes:
Total DISK READ:   0.00 B/s | Total DISK WRITE:   2.12 M/s
  TID  PRIO  USER DISK READ  DISK WRITE  SWAPIN IO>COMMAND  
 
 1620 be/4 root0.00 B/s0.00 B/s  0.00 % 99.99 % [btrfs-transacti]
  658 be/3 root0.00 B/s   28.46 K/s  0.00 % 62.35 % [jbd2/dm-0-8]

The device is indeed on top of an LVM, which likely isn't helping performance
(I see only 2.12 M/s of writes in the iotop above):
sda 8:00 931.5G  0 disk  
├─sda1  8:10   243M  0 part  /boot
├─sda2  8:20 1K  0 part  
└─sda5  8:50 931.3G  0 part  
  ├─moremagic-root (dm-0) 252:00  37.3G  0 lvm   /
  ├─moremagic-usr+local (dm-1) 252:10   878G  0 lvm   

Today, it took about 10mn for IO to go back to normal (stop blocking) to that
partition during my noon snapshot. The later ones didn't hang.
Why would a snapshot cause hangs and so much IO?


hdparm shows that basic IO seems ok:
root@polgara:/tmp# hdparm -t /dev/mapper/moremagic-usr+local 
/dev/mapper/moremagic-usr+local:
 Timing buffered disk reads: 566 MB in  3.00 seconds = 188.49 MB/sec

root@polgara:/tmp# btrfs fi show
Label: 'btrfs_pool1'  uuid: 64eafeba-603e-483f-bff9-395b9907b415
Total devices 1 FS bytes used 443.76GB
devid1 size 877.98GB used 541.54GB path /dev/dm-1

root@polgara:/tmp# btrfs device stats /mnt/btrfs_pool1
[/dev/mapper/moremagic-usr+local].write_io_errs   0
[/dev/mapper/moremagic-usr+local].read_io_errs0
[/dev/mapper/moremagic-usr+local].flush_io_errs   0
[/dev/mapper/moremagic-usr+local].corruption_errs 0
[/dev/mapper/moremagic-usr+local].generation_errs 0

Is there other data I can provide?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Are nocow files snapshot-aware

2014-02-07 Thread Kai Krakow

Duncan <1i5t5.dun...@cox.net> schrieb:

>> The question here is: Does it really make sense to create such snapshots
>> of disk images currently online and running a system. They will probably
>> be broken anyway after rollback - or at least I'd not fully trust the
>> contents.
>> 
>> VM images should not be part of a subvolume of which snapshots are taken
>> at a regular and short interval. The problem will go away if you follow
>> this rule.
>> 
>> The same applies to probably any kind of file which you make nocow -
>> e.g. database files. The only use case is taking _controlled_ snapshots
>> - and doing it all 30 seconds is by all means NOT controlled, it's
>> completely undeterministic.
> 
> I'd absolutely agree -- and that wasn't my report, I'm just recalling it,
> as at the time I didn't understand the interaction between NOCOW and
> snapshots and couldn't quite understand how a NOCOW file was still
> triggering the snapshot-aware-defrag pathology, which in fact we were
> just beginning to realize based on such reports.

Sorry, didn't mean to push it to you. ;-) I just wanted to give some 
pointers to rethink such practices for people stumpling upon this.

> But some of the snapshotting scripts out there, and the admins running
> them, seem to have the idea that just because it's possible it must be
> done, and they have snapshots taken every minute or more frequently, with
> no automated snapshot thinning at all.  IMO that's pathology run amok
> even if btrfs /was/ stable and mature and /could/ handle it properly.

Yeah, people should stop such "bullshit practice" (sorry), no matter if 
there's a technical problem with it. It does not give the protection they 
intended to give. It's just wrong sense for security/safety... There _may_ 
be actual use cases for doing it - but generally I'd suggest it's plain 
wrong.

> That's regardless of the content so it's from a different angle than you
> were attacking the problem from...  But if admins aren't able to
> recognize the problem with per-minute snapshots without any thinning at
> all for days, weeks, months on end, I doubt they'll be any better at
> recognizing that VMs, databases, etc, should have a dedicated subvolume.

True.

> But be that as it may, since such extreme snapshotting /is/ possible, and
> with automation and downloadable snapper scripts somebody WILL be doing
> it, btrfs should scale to it if it is to be considered mature and
> stable.  People don't want a filesystem that's going to fall over on them
> and lose data or simply become unworkably live-locked just because they
> didn't know what they were doing when they setup the snapper script and
> set it to 1 minute snaps without any corresponding thinning after an hour
> or a day or whatever.

Such, uhm, sorry, "bullshit practice" should not be a high priority on the 
fix-list for btrfs. There are other areas. It's a technical problem, yes, 
but I think there are more important ones than brute-forcing problems out of 
btrfs that are never being hit by normal usage patterns.

It is good that such "tests" are done, but I would not understand how people 
can expect they need such a "feature" - now and at once. Such tests are not 
ready to leave the development sandbox yet.

>From a normal use perspective, doing such heavy snapshotting is probably 
almost always nonsense.

I'd be more interested in how btrfs behaves in highly io loaded server 
patterns. One interesting use case for me would be to use btrfs as the 
building block of a system with container virtualization (docker, lxc), 
making a high vm density on the machine (with the io load and unpredictable 
io bahavior that internet-facing servers apply to their storage layer), 
using btrfs snapshots to instantly create new vms from vm templates living 
in subvolumes (thin provisioning), spreading btrfs across a higher number of 
disks as the average desktop user / standard server has. I think this is one 
of many very interesting use cases for btrfs and its capabilities. And this 
is how we get back to my initial question: In such a scenario I'd like to 
take ro snapshots of all machines (which probably host nocow files for 
databases), send these to a backup server at low io-priority, then remove 
the snapshots. Apparently, btrfs send/receive is still far from being stable 
and bullet-proof from what I read here, so the destination would probably be 
another btrfs or zfs, using inplace-rsync backups and snapshotting for 
backlog.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Are nocow files snapshot-aware

2014-02-07 Thread Chris Murphy

On Feb 7, 2014, at 2:07 PM, Kai Krakow  wrote:

> Chris Murphy  schrieb:
> 
>>> If the database/virtual machine/whatever is crash safe, then the
>>> atomic state that a snapshot grabs will be useful.
>> 
>> How fast is this state fixed on disk from the time of the snapshot
>> command? Loosely speaking. I'm curious if this is < 1 second; a few
>> seconds; or possibly up to the 30 second default commit interval? And also
>> if it's even related to the commit interval time at all?
> 
> Such constructs can only be crash-safe if write-barriers are passed down 
> through the cow logic of btrfs to the storage layer. That won't probably 
> ever happen. Atomic and transactional updates cannot happen without write-
> barriers or synchronous writes. To make it work, you need to design the 
> storage-layers from the ground up to work without write-barriers, like 
> having battery-backed write-caches, synchronous logical file-system layers 
> etc. Otherwise, database/vm/whatever transactional/atomic writes are just 
> having undefined status down at the lowest storage layer.

This explanation makes sense. But I failed to qualify the "state fixed on 
disk". I'm not concerned about when bits actually arrive on disk. I'm wondering 
what state they describe. So assume no crash or power failure, and assume 
writes eventually make it onto the media without a problem. What I'm wondering 
is, what state of the subvolume I'm snapshotting do I end up with? Is there a 
delay and how long is it, or is it pretty much instant? The command completes 
really quickly even when the file system is actively being used, so the 
feedback is that the snapshot state is established very fast but I'm not sure 
what bearing that has in reality.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038

2014-02-07 Thread David Rientjes

On Fri, 7 Feb 2014, Fengguang Wu wrote:

> On Fri, Feb 07, 2014 at 02:13:59AM -0800, David Rientjes wrote:
> > On Fri, 7 Feb 2014, Fengguang Wu wrote:
> > 
> > > [1.625020] BTRFS: selftest: Running btrfs_split_item tests
> > > [1.627004] BTRFS: selftest: Running find delalloc tests
> > > [2.289182] tsc: Refined TSC clocksource calibration: 2299.967 MHz
> > > [  292.084537] kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=1, 
> > > oom_score_adj=0
> > > [  292.086439] kthreadd cpuset=
> > > [  292.087072] BUG: unable to handle kernel NULL pointer dereference at 
> > > 0038
> > > [  292.087372] IP: [] pr_cont_kernfs_name+0x1b/0x6c
> > 
> > This looks like a problem with the cpuset cgroup name, are you sure this 
> > isn't related to the removal of cgroup->name?
> 
> It looks not related to patch "cgroup: remove cgroup->name", because
> that patch lies in the cgroup tree and not contained in output of "git log 
> BAD_COMMIT".
> 

It's dying on pr_cont_kernfs_name which is some tree that has "kernfs: 
implement kernfs_get_parent(), kernfs_name/path() and friends", which is 
not in linux-next, and is obviously printing the cpuset cgroup name.

It doesn't look like it has anything at all to do with btrfs or why they 
would care about this failure.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Are nocow files snapshot-aware

2014-02-07 Thread Kai Krakow

Chris Murphy  schrieb:

>> If the database/virtual machine/whatever is crash safe, then the
>> atomic state that a snapshot grabs will be useful.
> 
> How fast is this state fixed on disk from the time of the snapshot
> command? Loosely speaking. I'm curious if this is < 1 second; a few
> seconds; or possibly up to the 30 second default commit interval? And also
> if it's even related to the commit interval time at all?

Such constructs can only be crash-safe if write-barriers are passed down 
through the cow logic of btrfs to the storage layer. That won't probably 
ever happen. Atomic and transactional updates cannot happen without write-
barriers or synchronous writes. To make it work, you need to design the 
storage-layers from the ground up to work without write-barriers, like 
having battery-backed write-caches, synchronous logical file-system layers 
etc. Otherwise, database/vm/whatever transactional/atomic writes are just 
having undefined status down at the lowest storage layer.

> I'm also curious what happens to files that are presently writing. e.g.
> I'm writing a 1GB file to subvol A and before it completes I snapshot
> subvol A into A.1. If I go find the file I was writing to, in A.1, what's
> its state? Truncated? Or or are in-progress writes permitted to complete
> if it's a rw snapshot? Any difference in behavior if it's an ro snapshot?

I wondered that many times, too. What happens to files being written to? I 
suppose, at the time of snapshotting it's taking the current state of the 
blocks as they are, ignoring pending writes. This means, the file being 
written to is probably in limbo state.

For example, xfs has an option to freeze the file system to take atomic 
snapshots. You can use that feature to take consistent snapshots of MySQL 
InnoDB files to create a hot-copy backup of it. But: You need to instruct 
MySQL first to complete its transactions and pausing before running 
xfs_freeze, then after that's done, you can resume MySQL operations. That 
clearly tells me that it is probably not safe to take snapshots of online 
databases, even if they are crash-safe (and by what I know, InnoDB is 
designed to be crash-safe).

A solution, probably far-future, could be that a btrfs snapshot would inform 
all current file-writers to complete transactions and atomic operations and 
wait until each one signals a ready state, then take the snapshot, then 
signal the processes to resume operations. For this, the btrfs driver could 
offer some sort of subscription, similar to what inotify offers. Processes 
subscribe to some sort of notification broadcasts, btrfs can wait for every 
process to report an integral file state. If I remember right, reiser4 
offered some similar feature (approaching the problem from the opposite 
side): processes were offered an interface to start and commit transactions 
within reiser4. If btrfs had such information from file-writers, it could 
take consistent snapshots of online databases/vms/whatever (given, that in 
the vm case the guest could pass this information to the host). Whatever 
approach is taken, however, it will make the time needed to create snapshots 
undeterministic, processes may not finish their transactions within a 
reasonable time...

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-07 Thread Kai Krakow

Josef Bacik  schrieb:

> 
> On 02/05/2014 03:15 PM, Roman Mamedov wrote:
>> Hello,
>>
>> On a freshly-created RAID1 filesystem of two 1TB disks:
>>
>> # df -h /mnt/p2/
>> Filesystem  Size  Used Avail Use% Mounted on
>> /dev/sda2   1.8T  1.1M  1.8T   1% /mnt/p2
>>
>> I cannot write 2TB of user data to that RAID1, so this estimate is
>> clearly misleading. I got tired of looking at the bogus disk free space
>> on all my RAID1 btrfs systems, so today I decided to do something about
>> this:
>>
>> --- fs/btrfs/super.c.orig2014-02-06 01:28:36.636164982 +0600
>> +++ fs/btrfs/super.c 2014-02-06 01:28:58.304164370 +0600
>> @@ -1481,6 +1481,11 @@
>>   }
>>   
>>   kfree(devices_info);
>> +
>> +if (type & BTRFS_BLOCK_GROUP_RAID1) {
>> +do_div(avail_space, min_stripes);
>> +}
>> +
>>   *free_bytes = avail_space;
>>   return 0;
>>   }
> 
> This needs to be more flexible, and also this causes the problem where
> now you show the actual usable amount of space _but_ you are also
> showing twice the amount of used space.  I'm ok with going in this
> direction, but we need to convert everybody over so it works for raid10
> as well and the used values need to be adjusted.  Thanks,

It should show the raw space available. Btrfs also supports compression and 
doesn't try to be smart about how much compressed data would fit in the free 
space of the drive. If one is using RAID1, it's supposed to fill up with a 
rate of 2:1. If one is using compression, it's supposed to fill up with a 
rate of maybe 1:5 for mostly text files.

However, btrfs should probably provide its own df utility (like "btrfs fi 
df") which is smart about disk usage and tries to predict the usable space. 
But df should stay with actually showing the _free_ space, not _usable_ 
space (the latter is unpredictable anyway based on the usage patterns 
applied to the drive, so it can be a rough guess as its best, like looking 
into the crystal ball).

The point here is about "free" vs. "usable" space, the first being a known 
number, the latter only being a prediction based on current usage. I'd like 
to stay with "free space" actually showing raw free space.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: don't loop forever if we can't run because of the tree mod log

2014-02-07 Thread Josef Bacik

A user reported a 100% cpu hang with my new delayed ref code.  Turns out I
forgot to increase the count check when we can't run a delayed ref because of
the tree mod log.  If we can't run any delayed refs during this there is no
point in continuing to look, and we need to break out.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9c9ecc9..32312e0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2385,6 +2385,7 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
spin_unlock(&delayed_refs->lock);
locked_ref = NULL;
cond_resched();
+   count++;
continue;
}
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs-progs: Preserve process_one_leaf return value.

2014-02-07 Thread Mitch Harder

The return value in process_one_leaf could be over-written while
looping over the items in the leaf.

This patch will preserve a non-zero return value to the calling
function if a non-zero return value is encountered in the loop.

The return value of one (1) is consistent with non-zero values
that could be returned while processing the leaf.

The only caller of this function (walk_down_tree) would ignore
the return value anyway.  But this patch will correct the
behaviour in case future changes intend to utilize the return
value.

Signed-off-by: Mitch Harder 
---
 cmds-check.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/cmds-check.c b/cmds-check.c
index 2911af0..eef7c6c 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -1219,6 +1219,7 @@ static int process_one_leaf(struct btrfs_root *root, 
struct extent_buffer *eb,
u32 nritems;
int i;
int ret = 0;
+   int error = 0;
struct cache_tree *inode_cache;
struct shared_node *active_node;
 
@@ -1268,8 +1269,10 @@ static int process_one_leaf(struct btrfs_root *root, 
struct extent_buffer *eb,
default:
break;
};
+   if (ret != 0)
+   error = 1;
}
-   return ret;
+   return error;
 }
 
 static void reada_walk_down(struct btrfs_root *root,
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-07 Thread Chris Murphy

On Feb 6, 2014, at 11:08 PM, Roman Mamedov  wrote:

>  And what
> if I am accessing that partition on a server via a network CIFS/NFS share and
> don't even *have a way to find out* any of that.

That's the strongest argument. And if the user is using 
Explorer/Finder/Nautilus to copy files to the share, I'm pretty sure all three 
determine if there's enough free space in advance of starting the copy. So if 
it thinks there's free space, it will start to copy and then later fail 
midstream when there's no more space. And then the user's copy task is in a 
questionable state as to what's been copied, depending on how the file copies 
are being threaded.

And due to Btrfs metadata requirements even when deleting, we actually need an 
Avail estimate that accounts for "phantom future metadata" as if it's currently 
in use, otherwise we don't really have the right indication of whether or not 
files can be copied. 

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: integration-20140206 build failure

2014-02-07 Thread WorMzy Tykashi

On 7 February 2014 17:54, David Sterba  wrote:
> On Fri, Feb 07, 2014 at 12:38:22AM +, WorMzy Tykashi wrote:
>> Building the latest integration snapshot, I get the following failure:
>>
>> 
>> ...
>> [CC] cmds-filesystem.o
>> cmds-filesystem.c: In function 'cmd_show':
>> cmds-filesystem.c:740:12: error: invalid storage class for function 
>> 'cmd_sync'
>>  static int cmd_sync(int argc, char **argv)
>> ^
>> cmds-filesystem.c:770:12: error: invalid storage class for function
>
> Thanks. A mismerged patch, fixed in next integration.

Cheers.

I did briefly get the following failure from integration-20140207:

[CC] cmds-filesystem.o
cmds-filesystem.c: In function 'handle_print':
cmds-filesystem.c:434:5: error: too few arguments to function 'print_one_fs'
 space_info_arg, label, mnt);
 ^
cmds-filesystem.c:361:12: note: declared here
 static int print_one_fs(struct btrfs_ioctl_fs_info_args *fs_info,
^
Makefile:113: recipe for target 'cmds-filesystem.o' failed
make: *** [cmds-filesystem.o] Error 1

Though I can't reproduce it now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: integration-20140206 build failure

2014-02-07 Thread David Sterba

On Fri, Feb 07, 2014 at 12:38:22AM +, WorMzy Tykashi wrote:
> Building the latest integration snapshot, I get the following failure:
> 
> 
> ...
> [CC] cmds-filesystem.o
> cmds-filesystem.c: In function 'cmd_show':
> cmds-filesystem.c:740:12: error: invalid storage class for function 'cmd_sync'
>  static int cmd_sync(int argc, char **argv)
> ^
> cmds-filesystem.c:770:12: error: invalid storage class for function

Thanks. A mismerged patch, fixed in next integration.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: lost with degraded RAID1

2014-02-07 Thread Chris Murphy

On Feb 7, 2014, at 4:34 AM, Johan Kröckel  wrote:

> Is there anything else I should do with this setup or may I nuke the
> two partitions and reuse them?

Well I'm pretty sure once you run 'btrfs check --repair' that you've hit the 
end of the road. Possibly btrfs restore can still extract some files, it might 
be worth testing whether that works.

Otherwise blow it away. I'd say test with 3.14-rc2 with a new file system and 
see if you can reproduce the sequence that caused this problem in the first 
place. If it's reproducible, I think there's a bug here somewhere.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: unlock extent and pages on error in cow_file_range

2014-02-07 Thread Josef Bacik

When I converted the BUG_ON() for the free_space_cache_inode in cow_file_range I
made it so we just return an error instead of unlocking all of our various
stuff.  This is a mistake and causes us to hang when we run into this.  This
patch fixes this problem.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d44ff9f..dbd996a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -864,7 +864,8 @@ static noinline int cow_file_range(struct inode *inode,
 
if (btrfs_is_free_space_inode(inode)) {
WARN_ON_ONCE(1);
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out_unlock;
}
 
num_bytes = ALIGN(end - start + 1, blocksize);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] xfstests: Btrfs: add test for large metadata blocks

2014-02-07 Thread Koen De Wit

Tests Btrfs filesystems with all possible metadata block sizes, by
setting large extended attributes on files.

Signed-off-by: Koen De Wit 
---
 tests/btrfs/036 |  128 +++
 tests/btrfs/036.out |7 +++
 tests/btrfs/group   |1 +
 3 files changed, 136 insertions(+), 0 deletions(-)
 create mode 100644 tests/btrfs/036
 create mode 100644 tests/btrfs/036.out

diff --git a/tests/btrfs/036 b/tests/btrfs/036
new file mode 100644
index 000..0f8287a
--- /dev/null
+++ b/tests/btrfs/036
@@ -0,0 +1,128 @@
+#! /bin/bash
+# FS QA Test No. 036
+#
+# Tests large metadata blocks in btrfs, which allows large extended
+# attributes.
+#
+#---
+# Copyright (c) 2014, Oracle and/or its affiliates.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+status=1   # failure is the default!
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_need_to_be_root
+
+rm -f $seqres.full
+
+pagesize=`$here/src/feature -s`
+pagesize_kb=`expr $pagesize / 1024`
+
+# Test all valid leafsizes
+for leafsize in `seq $pagesize_kb $pagesize_kb 64`; do
+_scratch_unmount >/dev/null 2>&1
+_scratch_mkfs -l ${leafsize}K >/dev/null
+_scratch_mount
+# Calculate the xattr size, but leave 512 bytes for other metadata.
+xattr_size=`expr $leafsize \* 1024 - 512`
+
+touch $SCRATCH_MNT/emptyfile
+# smallfile will be inlined, bigfile not.
+$XFS_IO_PROG -f -c "pwrite 0 100" $SCRATCH_MNT/smallfile >/dev/null
+$XFS_IO_PROG -f -c "pwrite 0 9000" $SCRATCH_MNT/bigfile >/dev/null
+ln -s $SCRATCH_MNT/bigfile $SCRATCH_MNT/bigfile_softlink
+
+files=(emptyfile smallfile bigfile bigfile_softlink)
+chars=(a b c d)
+for i in `seq 0 1 3`; do
+char=${chars[$i]}
+file=$SCRATCH_MNT/${files[$i]}
+lnkfile=${file}_hardlink
+ln $file $lnkfile
+xattr_value=`head -c $xattr_size < /dev/zero | tr '\0' $char`
+
+set_md5=`echo -n "$xattr_value" | md5sum`
+${ATTR_PROG} -Lq -s attr_$char -V $xattr_value $file
+get_md5=`${ATTR_PROG} -Lq -g attr_$char $file | md5sum`
+get_ln_md5=`${ATTR_PROG} -Lq -g attr_$char $lnkfile | md5sum`
+
+# Using md5sums for comparison instead of the values themselves
+# because bash command lines cannot be larger than 64K chars.
+if [ "$set_md5" != "$get_md5" ]; then
+echo "Got unexpected xattr value for attr_$char"
+echo "from file $file . (leafsize is ${leafsize}K)"
+fi
+if [ "$set_md5" != "$get_ln_md5" ]; then
+echo "Value for attr_$char differs for $file and"
+echo "$lnkfile . (leafsize is ${leafsize}K)"
+fi
+done
+
+# Test attributes with a size larger than the leafsize.
+# Should result in an error.
+if [ "$leafsize" -lt "64" ]; then
+# Bash command lines cannot be larger than 64K characters, so we
+# do not test attribute values with a size >64KB.
+xattr_size=`expr $leafsize \* 1024 + 512`
+xattr_value=`head -c $xattr_size < /dev/zero | tr '\0' x`
+${ATTR_PROG} -q -s attr_toobig -V $xattr_value \
+$SCRATCH_MNT/emptyfile >> $seqres.full 2>&1
+if [ "$?" -eq "0" ]; then
+echo "Expected error, xattr_size is bigger than ${leafsize}K"
+fi
+fi
+
+done
+
+# Illegal attribute name (more than 256 characters)
+attr_name=`head -c 260 < /dev/zero | tr '\0' n`
+${ATTR_PROG} -s $attr_name -V attribute_name_too_big \
+$SCRATCH_MNT/emptyfile 2>&1 | head -n 1
+
+_scratch_unmount
+
+# Some illegal leafsizes
+
+_scratch_mkfs -l 0 2>> $seqres.full
+echo $?
+
+_scratch_mkfs -l 5678 2>> $seqres.full
+echo $?
+
+_scratch_mkfs -l `expr $pagesize / 2 + $pagesize` 2>> $seqres.full
+echo $?
+
+_scratch_mkfs -l 128K 2>> $seqres.full
+echo $?
+
+_scratch_mkfs -l K
+
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/036.out b/tests/btrfs/036.out
new file mode 100644
index 000..1d9bdfb
--- /dev/null
+++ b/tests/btrfs/036.out
@@ -0,0 +1,7 @@
+

[PATCH] btrfs-progs: add dry-run option to restore command

2014-02-07 Thread Justin Maggard

Sometimes it is useful to see what btrfs restore is going to do
before provisioning enough external storage to restore onto.
Add a dry-run option so we can see what files and paths are found
by restore, without actually restoring any data.
---
 cmds-restore.c |   14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/cmds-restore.c b/cmds-restore.c
index 1748262..5116af0 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -53,6 +53,7 @@ static int verbose = 0;
 static int ignore_errors = 0;
 static int overwrite = 0;
 static int get_xattrs = 0;
+static int dry_run = 0;
 
 #define LZO_LEN 4
 #define PAGE_CACHE_SIZE 4096
@@ -801,6 +802,8 @@ static int search_dir(struct btrfs_root *root, struct 
btrfs_key *key,
}
if (verbose)
printf("Restoring %s\n", path_name);
+   if (dry_run)
+   goto next;
fd = open(path_name, O_CREAT|O_WRONLY, 0644);
if (fd < 0) {
fprintf(stderr, "Error creating %s: %d\n",
@@ -873,7 +876,10 @@ static int search_dir(struct btrfs_root *root, struct 
btrfs_key *key,
printf("Restoring %s\n", path_name);
 
errno = 0;
-   ret = mkdir(path_name, 0755);
+   if (dry_run)
+   ret = 0;
+   else
+   ret = mkdir(path_name, 0755);
if (ret && errno != EEXIST) {
free(dir);
fprintf(stderr, "Error mkdiring %s: %d\n",
@@ -1108,6 +1114,7 @@ const char * const cmd_restore_usage[] = {
"-r  root objectid",
"-d  find dir",
"-l  list tree roots",
+   "-D  dry run (only list files that would be recovered)",
"--path-regex ",
"restore only filenames matching regex,",
"you have to use following syntax (possibly quoted):",
@@ -1135,7 +1142,7 @@ int cmd_restore(int argc, char **argv)
regex_t match_reg, *mreg = NULL;
char reg_err[256];
 
-   while ((opt = getopt_long(argc, argv, "sxviot:u:df:r:lc", long_options,
+   while ((opt = getopt_long(argc, argv, "sxviot:u:df:r:lDc", long_options,
&option_index)) != -1) {
 
switch (opt) {
@@ -1191,6 +1198,9 @@ int cmd_restore(int argc, char **argv)
case 'l':
list_roots = 1;
break;
+   case 'D':
+   dry_run = 1;
+   break;
case 'c':
match_cflags |= REG_ICASE;
break;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038

2014-02-07 Thread Filipe David Manana

On Fri, Feb 7, 2014 at 3:10 PM, Chris Mason  wrote:
> On Fri 07 Feb 2014 07:10:38 AM EST, Fengguang Wu wrote:
>>
>> On Fri, Feb 07, 2014 at 02:13:59AM -0800, David Rientjes wrote:
>>>
>>> On Fri, 7 Feb 2014, Fengguang Wu wrote:
>>>
 [1.625020] BTRFS: selftest: Running btrfs_split_item tests
 [1.627004] BTRFS: selftest: Running find delalloc tests
 [2.289182] tsc: Refined TSC clocksource calibration: 2299.967 MHz
 [  292.084537] kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=1,
 oom_score_adj=0
 [  292.086439] kthreadd cpuset=
 [  292.087072] BUG: unable to handle kernel NULL pointer dereference at
 0038
 [  292.087372] IP: [] pr_cont_kernfs_name+0x1b/0x6c
>>>
>>>
>>> This looks like a problem with the cpuset cgroup name, are you sure this
>>> isn't related to the removal of cgroup->name?
>>
>>
>> It looks not related to patch "cgroup: remove cgroup->name", because
>> that patch lies in the cgroup tree and not contained in output of "git log
>> BAD_COMMIT".
>
>
> Still not sure exactly what is going on, but I can't trigger it here.  My
> first guess is that it is related to having btrfs static, some part of our
> init is happening at the wrong time, and the self tests are swooping in and
> causing trouble.

I couldn't reproduce it either so far, neither on a physical machine
nor in a vm (qemu+kvm) (with CONFIG_BTRFS_FS=y, CONFIG_CRYPTO_CRC32C=y
and CONFIG_CRYPTO_CRC32C_INTEL=y).
If you disable CONFIG_BTRFS_FS_RUN_SANITY_TESTS, does it still crash?

thanks

>
>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs-progs: Change BUG() to use assert.

2014-02-07 Thread Mitch Harder

Change the definition of BUG() to use assert instead of abort to
provide information about the location of the issue.

Signed-off-by: Mitch Harder 
---
 kerncompat.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kerncompat.h b/kerncompat.h
index 1fc2b34..f370cd8 100644
--- a/kerncompat.h
+++ b/kerncompat.h
@@ -50,7 +50,7 @@
 #define ULONG_MAX   (~0UL)
 #endif
 
-#define BUG() abort()
+#define BUG() assert(0)
 #ifdef __CHECKER__
 #define __force__attribute__((force))
 #define __bitwise__ __attribute__((bitwise))
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038

2014-02-07 Thread Chris Mason


On Fri 07 Feb 2014 07:10:38 AM EST, Fengguang Wu wrote:

On Fri, Feb 07, 2014 at 02:13:59AM -0800, David Rientjes wrote:

On Fri, 7 Feb 2014, Fengguang Wu wrote:


[1.625020] BTRFS: selftest: Running btrfs_split_item tests
[1.627004] BTRFS: selftest: Running find delalloc tests
[2.289182] tsc: Refined TSC clocksource calibration: 2299.967 MHz
[  292.084537] kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=1, 
oom_score_adj=0
[  292.086439] kthreadd cpuset=
[  292.087072] BUG: unable to handle kernel NULL pointer dereference at 
0038
[  292.087372] IP: [] pr_cont_kernfs_name+0x1b/0x6c


This looks like a problem with the cpuset cgroup name, are you sure this
isn't related to the removal of cgroup->name?


It looks not related to patch "cgroup: remove cgroup->name", because
that patch lies in the cgroup tree and not contained in output of "git log 
BAD_COMMIT".


Still not sure exactly what is going on, but I can't trigger it here.  
My first guess is that it is related to having btrfs static, some part 
of our init is happening at the wrong time, and the self tests are 
swooping in and causing trouble.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Provide a better free space estimate on RAID1

2014-02-07 Thread Frank Kingswood


On 06/02/14 19:54, Goffredo Baroncelli wrote:

Hi Roman

On 02/06/2014 01:45 PM, Roman Mamedov wrote:

There's not a lot of code to include (as my 3-line patch demonstrates), it
could just as easily be removed when it's obsolete. But I did not have any
high hopes of defeating the "broken by design" philosophy, that's why I didn't
submit it as a real patch for inclusion but rather just as a helpful hint for
people to add to their own kernels if they want this change to happen.


I agree with you about the needing of a solution. However your patch to me 
seems even worse than the actual code.

For example you cannot take in account the mix of data/linear and metadata/dup 
(with the pathological case of small files stored in the metadata chunks ), nor 
different profile level like raid5/6 (or the future raidNxM)
And do not forget the compression...


Just because the solution that Roman provided is not perfect does not 
mean that it is no good at all. For common use cases this will give a 
much better estimate of free space than the current code does, at a 
trivial cost in code size.


It has the benefit of giving a simple estimate without doing any further 
work or disk activity (no need to walk all chunks).


Adding a couple of more lines of code will make it work equally well 
with other RAID levels, maybe that would be more acceptable?


Frank

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4] xfstests/btrfs: add a regression test for running snapshot and send concurrently

2014-02-07 Thread Wang Shilong

From: Wang Shilong 

Btrfs would fail to send if snapshot run concurrently, this test is to make
sure we have fixed the bug.

Signed-off-by: Wang Shilong 
---
v3->v4:
  avoid to use silent send command(thanks to Dave Chinner)
  rename patch's subject from Btrfs to xfstest/btrfs(thanks to Anand)
  make sure background snapshots exists before killing it. 

v2->v3:
  make sure we kill backgrund snapshots on test failure.

v1-v2:
  Avoid race codes and a code style update(Thanks to Dave Chinner's comments)
---
 tests/btrfs/034 | 85 +
 tests/btrfs/034.out |  2 ++
 tests/btrfs/group   |  1 +
 3 files changed, 88 insertions(+)
 create mode 100644 tests/btrfs/034
 create mode 100644 tests/btrfs/034.out

diff --git a/tests/btrfs/034 b/tests/btrfs/034
new file mode 100644
index 000..06c796e
--- /dev/null
+++ b/tests/btrfs/034
@@ -0,0 +1,85 @@
+#!/bin/bash
+# FS QA Test No. btrfs/034
+#
+# Regression test for running snapshots and send concurrently.
+#
+#---
+# Copyright (c) 2014 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+snapshots_pid=0
+
+_cleanup()
+{
+   # kill backgroud snapshots
+   if [ $snapshots_pid -ne 0 ] && ps -p $snapshots_pid | grep -q 
$snapshots_pid; then
+   kill -TERM $snapshots_pid 2> /dev/null
+   fi
+   rm -f $tmp.*
+}
+
+do_snapshots()
+{
+   i=2
+   while [ 1 ]
+   do
+   $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT/snap_1 \
+   $SCRATCH_MNT/snap_$i >> $seqres.full 2>&1
+   let i=$i+1
+   sleep 1
+   done
+}
+
+trap "_cleanup ; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+_scratch_mkfs > /dev/null 2>&1
+_scratch_mount
+
+touch $SCRATCH_MNT/foo
+
+# get file with fragments by using backwards writes.
+for i in `seq 10240 -1 1`; do
+   $XFS_IO_PROG -f -d -c "pwrite $(($i * 4096)) 4096" \
+   $SCRATCH_MNT/foo > /dev/null | _filter_xfs_io
+done
+
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+   $SCRATCH_MNT/snap_1 >> $seqres.full 2>&1
+
+do_snapshots &
+snapshots_pid=$!
+
+$BTRFS_UTIL_PROG send $SCRATCH_MNT/snap_1 -f /dev/null 2>&1 | _filter_scratch
+
+status=0 ; exit
diff --git a/tests/btrfs/034.out b/tests/btrfs/034.out
new file mode 100644
index 000..9ed31e8
--- /dev/null
+++ b/tests/btrfs/034.out
@@ -0,0 +1,2 @@
+QA output created by 034
+At subvol SCRATCH_MNT/snap_1
diff --git a/tests/btrfs/group b/tests/btrfs/group
index b29236c..f9f062f 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -36,3 +36,4 @@
 031 auto quick
 032 auto quick
 033 auto quick
+034 auto quick
-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/4] btrfs: reserve no transaction units in btrfs_ioctl_set_features

2014-02-07 Thread David Sterba

Added in patch "btrfs: add ioctls to query/change feature bits online"
modifications to superblock don't need to reserve metadata blocks when
starting a transaction.

Signed-off-by: David Sterba 
---
 fs/btrfs/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 8e48b81e1515..383ab455bfa7 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4668,7 +4668,7 @@ static int btrfs_ioctl_set_features(struct file *file, 
void __user *arg)
if (ret)
return ret;
 
-   trans = btrfs_start_transaction(root, 1);
+   trans = btrfs_start_transaction(root, 0);
if (IS_ERR(trans))
return PTR_ERR(trans);
 
-- 
1.8.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4][RFC] btrfs: export global block reserve size as space_info

2014-02-07 Thread David Sterba

Introduce a block group type bit for a global reserve and fill the space
info for SPACE_INFO ioctl. This should replace the newly added ioctl
(01e219e8069516cdb98594d417b8bb8d906ed30d) to get just the 'size' part
of the global reserve, while the actual usage can be now visible in the
'btrfs fi df' output during ENOSPC stress.

The unpatched userspace tools will show the blockgroup as 'unknown'.

CC: Jeff Mahoney 
CC: Josef Bacik 
Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h |  9 -
 fs/btrfs/ioctl.c | 20 
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2c1a42ca519f..8bf1890ea21f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -985,7 +985,8 @@ struct btrfs_dev_replace_item {
 #define BTRFS_BLOCK_GROUP_RAID10   (1ULL << 6)
 #define BTRFS_BLOCK_GROUP_RAID5 (1ULL << 7)
 #define BTRFS_BLOCK_GROUP_RAID6 (1ULL << 8)
-#define BTRFS_BLOCK_GROUP_RESERVED BTRFS_AVAIL_ALLOC_BIT_SINGLE
+#define BTRFS_BLOCK_GROUP_RESERVED (BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
+BTRFS_SPACE_INFO_GLOBAL_RSV)
 
 enum btrfs_raid_types {
BTRFS_RAID_RAID10,
@@ -1017,6 +1018,12 @@ enum btrfs_raid_types {
  */
 #define BTRFS_AVAIL_ALLOC_BIT_SINGLE   (1ULL << 48)
 
+/*
+ * A fake block group type that is used to communicate global block reserve
+ * size to userspace via the SPACE_INFO ioctl.
+ */
+#define BTRFS_SPACE_INFO_GLOBAL_RSV(1ULL << 49)
+
 #define BTRFS_EXTENDED_PROFILE_MASK(BTRFS_BLOCK_GROUP_PROFILE_MASK | \
 BTRFS_AVAIL_ALLOC_BIT_SINGLE)
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 383ab455bfa7..0d938f8f173e 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3465,6 +3465,11 @@ static long btrfs_ioctl_space_info(struct btrfs_root 
*root, void __user *arg)
up_read(&info->groups_sem);
}
 
+   /*
+* Global block reserve, exported as a space_info
+*/
+   slot_count++;
+
/* space_slots == 0 means they are asking for a count */
if (space_args.space_slots == 0) {
space_args.total_spaces = slot_count;
@@ -3523,6 +3528,21 @@ static long btrfs_ioctl_space_info(struct btrfs_root 
*root, void __user *arg)
up_read(&info->groups_sem);
}
 
+   /*
+* Add global block reserve
+*/
+   if (slot_count) {
+   struct btrfs_block_rsv *block_rsv = 
&root->fs_info->global_block_rsv;
+
+   spin_lock(&block_rsv->lock);
+   space.total_bytes = block_rsv->size;
+   space.used_bytes = block_rsv->size - block_rsv->reserved;
+   spin_unlock(&block_rsv->lock);
+   space.flags = BTRFS_SPACE_INFO_GLOBAL_RSV;
+   memcpy(dest, &space, sizeof(space));
+   space_args.total_spaces++;
+   }
+
user_dest = (struct btrfs_ioctl_space_info __user *)
(arg + sizeof(struct btrfs_ioctl_space_args));
 
-- 
1.8.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] btrfs: commit transaction after setting label and features

2014-02-07 Thread David Sterba

From: Jeff Mahoney 

The set_fslabel ioctl uses btrfs_end_transaction, which means it's
possible that the change will be lost if the system crashes, same for
the newly set features. Let's use btrfs_commit_transaction instead.

Signed-off-by: Jeff Mahoney 
Signed-off-by: David Sterba 
---
 fs/btrfs/ioctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b0134892dc70..8e48b81e1515 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4525,7 +4525,7 @@ static int btrfs_ioctl_set_fslabel(struct file *file, 
void __user *arg)
spin_lock(&root->fs_info->super_lock);
strcpy(super_block->label, label);
spin_unlock(&root->fs_info->super_lock);
-   ret = btrfs_end_transaction(trans, root);
+   ret = btrfs_commit_transaction(trans, root);
 
 out_unlock:
mnt_drop_write_file(file);
@@ -4689,7 +4689,7 @@ static int btrfs_ioctl_set_features(struct file *file, 
void __user *arg)
btrfs_set_super_incompat_flags(super_block, newflags);
spin_unlock(&root->fs_info->super_lock);
 
-   return btrfs_end_transaction(trans, root);
+   return btrfs_commit_transaction(trans, root);
 }
 
 long btrfs_ioctl(struct file *file, unsigned int
-- 
1.8.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/4] Revert "btrfs: add ioctl to export size of global metadata reservation"

2014-02-07 Thread David Sterba

This reverts commit 01e219e8069516cdb98594d417b8bb8d906ed30d.

The information from the removed ioctl is newly provided by the existing
SPACE_INFO ioctl where it logically belongs and needs only minor changes
to the userspace code.

The GLOBAL_RSV ioctl hasn't been used anywhere yet, introduced only in
3.14-rc1 so there is no userspace ABI breakage.

CC: Jeff Mahoney 
Signed-off-by: David Sterba 
---
 fs/btrfs/ioctl.c   | 16 
 include/uapi/linux/btrfs.h |  1 -
 2 files changed, 17 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0d938f8f173e..dc64cc28af29 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3557,20 +3557,6 @@ out:
return ret;
 }
 
-static long btrfs_ioctl_global_rsv(struct btrfs_root *root, void __user *arg)
-{
-   struct btrfs_block_rsv *block_rsv = &root->fs_info->global_block_rsv;
-   u64 reserved;
-
-   spin_lock(&block_rsv->lock);
-   reserved = block_rsv->reserved;
-   spin_unlock(&block_rsv->lock);
-
-   if (arg && copy_to_user(arg, &reserved, sizeof(reserved)))
-   return -EFAULT;
-   return 0;
-}
-
 /*
  * there are many ways the trans_start and trans_end ioctls can lead
  * to deadlocks.  They should only be used by applications that
@@ -4777,8 +4763,6 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_logical_to_ino(root, argp);
case BTRFS_IOC_SPACE_INFO:
return btrfs_ioctl_space_info(root, argp);
-   case BTRFS_IOC_GLOBAL_RSV:
-   return btrfs_ioctl_global_rsv(root, argp);
case BTRFS_IOC_SYNC: {
int ret;
 
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 1b8a0f4c9590..b4d69092fbdb 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -558,7 +558,6 @@ static inline char *btrfs_err_str(enum btrfs_err_code 
err_code)
 #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, __u64)
 #define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \
struct btrfs_ioctl_space_args)
-#define BTRFS_IOC_GLOBAL_RSV _IOR(BTRFS_IOCTL_MAGIC, 20, __u64)
 #define BTRFS_IOC_START_SYNC _IOR(BTRFS_IOCTL_MAGIC, 24, __u64)
 #define BTRFS_IOC_WAIT_SYNC  _IOW(BTRFS_IOCTL_MAGIC, 22, __u64)
 #define BTRFS_IOC_SNAP_CREATE_V2 _IOW(BTRFS_IOCTL_MAGIC, 23, \
-- 
1.8.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/4] Btrfs updates for 3.14-rc - sysfs, ioctl

2014-02-07 Thread David Sterba

Two minor fixes to the new sysfs code and a RFC change to replace the
GLOBAL_RSV ioctl with the same information exported through SPACE_INFO ioctl.

This requires to reserve a blockgroup type that will become part of ABI.

David Sterba (3):
  btrfs: reserve no transaction units in btrfs_ioctl_set_features
  btrfs: export global block reserve size as space_info
  Revert "btrfs: add ioctl to export size of global metadata
reservation"

Jeff Mahoney (1):
  btrfs: commit transaction after setting label and features

 fs/btrfs/ctree.h   |  9 -
 fs/btrfs/ioctl.c   | 42 +++---
 include/uapi/linux/btrfs.h |  1 -
 3 files changed, 31 insertions(+), 21 deletions(-)

-- 
1.8.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Can i trust you?

2014-02-07 Thread ivan . zoric

 I am Mrs. Aisha Al Mutasim from England, I have a business proposal worth of 
$10.6M US Dollars  get
back for more info.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038

2014-02-07 Thread Fengguang Wu

On Fri, Feb 07, 2014 at 02:13:59AM -0800, David Rientjes wrote:
> On Fri, 7 Feb 2014, Fengguang Wu wrote:
> 
> > [1.625020] BTRFS: selftest: Running btrfs_split_item tests
> > [1.627004] BTRFS: selftest: Running find delalloc tests
> > [2.289182] tsc: Refined TSC clocksource calibration: 2299.967 MHz
> > [  292.084537] kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=1, 
> > oom_score_adj=0
> > [  292.086439] kthreadd cpuset=
> > [  292.087072] BUG: unable to handle kernel NULL pointer dereference at 
> > 0038
> > [  292.087372] IP: [] pr_cont_kernfs_name+0x1b/0x6c
> 
> This looks like a problem with the cpuset cgroup name, are you sure this 
> isn't related to the removal of cgroup->name?

It looks not related to patch "cgroup: remove cgroup->name", because
that patch lies in the cgroup tree and not contained in output of "git log 
BAD_COMMIT".

Thanks,
Fengguang

> > [  292.087372] PGD 0 
> > [  292.087372] Oops:  [#1] 
> > [  292.087372] Modules linked in:
> > [  292.087372] CPU: 0 PID: 2 Comm: kthreadd Not tainted 
> > 3.14.0-rc1-wl-ath-00978-g4830363 #2
> > [  292.087372] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [  292.087372] task: 88148050 ti: 8814a000 task.ti: 
> > 8814a000
> > [  292.087372] RIP: 0010:[]  [] 
> > pr_cont_kernfs_name+0x1b/0x6c
> > [  292.087372] RSP: :8814bb20  EFLAGS: 00010046
> > [  292.087372] RAX: 0282 RBX:  RCX: 
> > 0002
> > [  292.087372] RDX: 812119de RSI: 8247d4a8 RDI: 
> > 0046
> > [  292.087372] RBP: 8814bb30 R08: 82f31218 R09: 
> > 
> > [  292.087372] R10:  R11:  R12: 
> > 833c4ab8
> > [  292.087372] R13: 003000d0 R14: 0001 R15: 
> > 
> > [  292.087372] FS:  () GS:82279000() 
> > knlGS:
> > [  292.087372] CS:  0010 DS:  ES:  CR0: 8005003b
> > [  292.087372] CR2: 0038 CR3: 02269000 CR4: 
> > 06b0
> > [  292.087372] Stack:
> > [  292.087372]  88148050 833c4ab8 8814bb50 
> > 8110836d
> > [  292.087372]  88148050 88148530 8814bbd0 
> > 81ae2614
> > [  292.087372]  88148640 8814bb78 810c7dd7 
> > 8814bb98
> > [  292.087372] Call Trace:
> > [  292.087372]  [] 
> > cpuset_print_task_mems_allowed+0x65/0x8b
> > [  292.087372]  [] dump_header.isra.11+0x68/0x29a
> > [  292.087372]  [] ? local_clock+0x2b/0x34
> > [  292.087372]  [] ? lock_release_holdtime+0xcf/0xdb
> > [  292.087372]  [] ? out_of_memory+0x318/0x40f
> > [  292.087372]  [] out_of_memory+0x373/0x40f
> > [  292.087372]  [] ? out_of_memory+0x193/0x40f
> > [  292.087372]  [] __alloc_pages_nodemask+0xdd1/0x12c4
> > [  292.087372]  [] ? native_sched_clock+0xe4/0xfc
> > [  292.087372]  [] copy_process+0x21d/0x2115
> > [  292.087372]  [] ? kthread_create_on_node+0x23e/0x23e
> > [  292.087372]  [] ? sched_clock+0x9/0xd
> > [  292.087372]  [] ? 
> > sched_clock_local.constprop.2+0x35/0xc8
> > [  292.087372]  [] ? pvclock_clocksource_read+0x9b/0x140
> > [  292.087372]  [] ? kthread_create_on_node+0x23e/0x23e
> > [  292.087372]  [] do_fork+0x105/0x4db
> > [  292.087372]  [] ? sched_clock_cpu+0xc9/0xdb
> > [  292.087372]  [] ? local_clock+0x2b/0x34
> > [  292.087372]  [] ? lock_release_holdtime+0xcf/0xdb
> > [  292.087372]  [] ? kthreadd+0x1b3/0x24a
> > [  292.087372]  [] kernel_thread+0x26/0x28
> > [  292.087372]  [] kthreadd+0x1c7/0x24a
> > [  292.087372]  [] ? ret_from_fork+0x7a/0xb0
> > [  292.087372]  [] ? kthread_create_on_cpu+0x7a/0x7a
> > [  292.087372]  [] ret_from_fork+0x7a/0xb0
> > [  292.087372]  [] ? kthread_create_on_cpu+0x7a/0x7a
> > [  292.087372] Code: 1c 92 8e 00 48 8b 45 e0 59 5b 41 5c 41 5d 5d c3 0f 1f 
> > 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 c7 c7 90 d4 47 82 e8 b1 8f 8e 00 
> > <48> 83 7b 38 00 49 89 c4 74 06 48 8b 73 40 eb 07 48 c7 c6 48 fe 
> > [  292.087372] RIP  [] pr_cont_kernfs_name+0x1b/0x6c
> > [  292.087372]  RSP 
> > [  292.087372] CR2: 0038
> > [  292.087372] ---[ end trace df2598a82119 ]---
> > [  292.087372] ---[ end trace df2598a82119 ]---
> > 
> > git bisect start 483036322b45d800fd68cb028874b6cd4fee3dba 
> > 38dbfb59d1175ef458d006556061adeaa8751b72 --
> > git bisect  bad 50a557bc0d03997d9203d749b0304c54d2c6e5f5  # 02:20  0-   
> >5  Merge 'cgroup/review-simplify' into devel-hourly-2014020601
> > git bisect  bad c04e036e53a39cf2986ddaaf5607ff011f9ea55b  # 03:14  0-   
> >1  Merge 'pinctrl/for-next' into devel-hourly-2014020601
> > git bisect good 08ce20bd7ef9a08b74cdfe6fe454374be72b5825  # 03:58 25+   
> >0  Merge 'pci/pci/msi' into devel-hourly-2014020601
> > git bisect good 4f86564c657669f69c9cc03151ef6f23ae9e2015  # 04:32 25+   
> >0  Merge 'm68knommu/cf' into devel-hourly-2014020601
> > git

Re: lost with degraded RAID1

2014-02-07 Thread Johan Kröckel

Is there anything else I should do with this setup or may I nuke the
two partitions and reuse them?

2014-02-03 Johan Kröckel :
> State is: I wont use this filesystem again. I have a backup. So I am
> interested to give the necessary information for debuging it and
> afterwards format it and create a new one. I already did fscks and
> btrfschk --repair and pushed the output to txt-files but they are more
> than 4 mb in size.
>
> So I will post excerpts:
>
> file: btrfsck.out===
> Checking filesystem on /dev/mapper/bunkerA
> UUID: 7f954a85-7566-4251-832c-44f2d3de2211
> 42
> parent transid verify failed on 1887688011776 wanted 121037 found 88533
> parent transid verify failed on 1888518615040 wanted 121481 found 90267
> parent transid verify failed on 1681394102272 wanted 110919 found 91024
> parent transid verify failed on 1888522838016 wanted 121486 found 90270
> parent transid verify failed on 1888398331904 wanted 121062 found 89987
> leaf parent key incorrect 1887867330560
> bad block 1887867330560
> leaf parent key incorrect 188812032
> bad block 188812032
> leaf parent key incorrect 1888124637184
> bad block 1888124637184
> leaf parent key incorrect 1888131444736
> bad block 1888131444736
>
> [...and so on for 4MB]
>
> bad block 1888513552384
> leaf parent key incorrect 1888513642496
> bad block 1888513642496
> leaf parent key incorrect 1888513654784
> bad block 1888513654784
> leaf parent key incorrect 1888514023424
> bad block 1888514023424
> btrfsck: cmds-check.c:2212: check_owner_ref: Assertion `!(rec->is_root)' 
> failed.
>
> file: smartctl-before-btrfschk-repair==
> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.12-0.bpo.1-amd64] (local build)
> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF READ SMART DATA SECTION ===
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate 0x000f   118   099   006Pre-fail
> Always   -   172055696
>   3 Spin_Up_Time0x0003   093   093   000Pre-fail
> Always   -   0
>   4 Start_Stop_Count0x0032   100   100   020Old_age
> Always   -   7
>   5 Reallocated_Sector_Ct   0x0033   100   100   010Pre-fail
> Always   -   0
>   7 Seek_Error_Rate 0x000f   069   060   030Pre-fail
> Always   -   9085642
>   9 Power_On_Hours  0x0032   097   097   000Old_age
> Always   -   2769
>  10 Spin_Retry_Count0x0013   100   100   097Pre-fail
> Always   -   0
>  12 Power_Cycle_Count   0x0032   100   100   020Old_age
> Always   -   7
> 184 End-to-End_Error0x0032   100   100   099Old_age
> Always   -   0
> 187 Reported_Uncorrect  0x0032   100   100   000Old_age
> Always   -   0
> 188 Command_Timeout 0x0032   100   100   000Old_age
> Always   -   0
> 189 High_Fly_Writes 0x003a   083   083   000Old_age
> Always   -   17
> 190 Airflow_Temperature_Cel 0x0022   077   071   045Old_age
> Always   -   23 (Min/Max 22/23)
> 191 G-Sense_Error_Rate  0x0032   100   100   000Old_age
> Always   -   0
> 192 Power-Off_Retract_Count 0x0032   100   100   000Old_age
> Always   -   5
> 193 Load_Cycle_Count0x0032   100   100   000Old_age
> Always   -   7
> 194 Temperature_Celsius 0x0022   023   040   000Old_age
> Always   -   23 (0 20 0 0)
> 197 Current_Pending_Sector  0x0012   100   100   000Old_age
> Always   -   0
> 198 Offline_Uncorrectable   0x0010   100   100   000Old_age
> Offline  -   0
> 199 UDMA_CRC_Error_Count0x003e   200   200   000Old_age
> Always   -   0
>
> =file:btrfsck-repair.out
> enabling repair mode
> Checking filesystem on /dev/mapper/bunkerA
> UUID: 7f954a85-7566-4251-832c-44f2d3de2211
> ify failed on 1887688011776 wanted 121037 found 88533
> parent transid verify failed on 1888518615040 wanted 121481 found 90267
> parent transid verify failed on 1681394102272 wanted 110919 found 91024
> parent transid verify failed on 1888522838016 wanted 121486 found 90270
> parent transid verify failed on 1888398331904 wanted 121062 found 89987
> leaf parent key incorrect 1887867330560
> bad block 1887867330560
>
> [...and so on for 4MB]
>
> bad block 1888513642496
> leaf parent key incorrect 1888513654784
> bad block 1888513654784
> leaf parent key incorrect 1888514023424
> bad block 1888514023424
> btrfsck: cmds-check.c:2212: check_owner_ref: Assertion `!(rec->is_root)' 
> failed.
>
> ==file:smartctl-after-btrfschk-repair==
> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.12-0.bpo.1-amd64] (local build)
> Copyright (C) 2

Re: [PATCH v3] Btrfs: add regression test for running snapshot and send concurrently

2014-02-07 Thread Wang Shilong


Hi Dave,

> On Fri, Feb 07, 2014 at 12:18:31PM +0800, Wang Shilong wrote:
>> 
>> Hi Dave,
>> 
>>> On Fri, Feb 07, 2014 at 12:10:08AM +0800, Wang Shilong wrote:
 +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
 +  $SCRATCH_MNT/snap_1 >> $seqres.full 2>&1
 +
 +do_snapshots &
 +snapshots_pid=$!
 +
 +$BTRFS_UTIL_PROG send $SCRATCH_MNT/snap_1 > /dev/null 2>&1 || echo "btrfs 
 send failed"
>>> 
>>> Let's stop this anti-pattern before it takes hold.
>>> 
>>> If there's output from the send command it should be
>>> filtered and captured in the golden image. Hence any deviation
>>> caused by errors is automatically flagged as an error.
>>> 
>>> That's the whole point of using golden images for capturing errors -
>>> you don't need to capture return values from binaries and it
>>> guarantees that users are informed about failures through error
>>> messages. IOWs:
>>> 
>>> $BTRFS_UTIL_PROG send $SCRATCH_MNT/snap_1 | _btrfs_send_filter
>>> 
>>> is what you should be doing here.
>> 
>> I knew what you mean here, in fact, i did this on purpose.
> 
> Ok, then you need to explain why you did it on purpose with a comment.
> It's just as important to explain the reason for doing something in
> test code as it is in the kernel code. i.e. so when we are looking
> at the test in 5 years time we know the reason for it being that
> way.
> 
>> for this test failure, btrfs-prog did not output failure
>> information from the beginning. 
> 
> I have nothing good to say about that state of affairs, but...
> 
>> So to make older progs can also
>> detect the test failure, i dropped into this way.
> 
> .. it's going to have to stay like it. Please insert an
> appropriately sarcastic comment about the usefulness of a silent
> send command here, because if I write it I'm going to offend lots of
> people. :/

Sorry, my miss, when i was going to give a patch for btrfs-progs, i noticed
the issue has been fixed in latest btrfs-progs(3.12 which has released for a 
long time).

So let's drop the way you have said before.

Thanks,
Wang
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/3] btrfs/035: add new clone overwrite regression test

2014-02-07 Thread David Disseldorp

This test uses the newly added cloner binary to dispatch full file and
range specific clone (reflink) requests.

Signed-off-by: David Disseldorp 
---
 tests/btrfs/035 | 77 +
 tests/btrfs/035.out |  3 +++
 tests/btrfs/group   |  1 +
 3 files changed, 81 insertions(+)
 create mode 100755 tests/btrfs/035
 create mode 100644 tests/btrfs/035.out

diff --git a/tests/btrfs/035 b/tests/btrfs/035
new file mode 100755
index 000..6808179
--- /dev/null
+++ b/tests/btrfs/035
@@ -0,0 +1,77 @@
+#!/bin/bash
+# FS QA Test No. btrfs/035
+#
+# Regression test for overwriting clones
+#
+#---
+# Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+
+_cleanup()
+{
+rm -f $tmp.*
+}
+
+trap "_cleanup ; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+_scratch_mkfs > /dev/null 2>&1
+_scratch_mount
+
+CLONER_PROG=$here/src/cloner
+[ -x $CLONER_PROG ] || _notrun "cloner binary not present at $CLONER_PROG"
+
+src_str="aa"
+
+echo -n "$src_str" > $SCRATCH_MNT/src
+
+$CLONER_PROG $SCRATCH_MNT/src  $SCRATCH_MNT/src.clone1
+
+src_str="bbcc"
+
+echo -n "$src_str" > $SCRATCH_MNT/src
+
+$CLONER_PROG $SCRATCH_MNT/src $SCRATCH_MNT/src.clone2
+
+snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone1 | awk '{print $5}'`
+echo "attempting ioctl (src.clone1 src)"
+$CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \
+   $SCRATCH_MNT/src.clone1 $SCRATCH_MNT/src
+
+snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone2 | awk '{print $5}'`
+echo "attempting ioctl (src.clone2 src)"
+$CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \
+   $SCRATCH_MNT/src.clone2 $SCRATCH_MNT/src
+
+status=0 ; exit
diff --git a/tests/btrfs/035.out b/tests/btrfs/035.out
new file mode 100644
index 000..f86cadf
--- /dev/null
+++ b/tests/btrfs/035.out
@@ -0,0 +1,3 @@
+QA output created by 035
+attempting ioctl (src.clone1 src)
+attempting ioctl (src.clone2 src)
diff --git a/tests/btrfs/group b/tests/btrfs/group
index f9f062f..bee57cb 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -37,3 +37,4 @@
 032 auto quick
 033 auto quick
 034 auto quick
+035 auto quick
-- 
1.8.4.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/3] __btrfs_drop_extents() BUG_ON reproducer

2014-02-07 Thread David Disseldorp

This patch-set provides a reproducer for hitting the 3.14.0-rc1 BUG_ON()
at:
 692 int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
...
 839 /*
 840  *  |  range to drop - |
 841  *  |  extent  |
 842  */
 843 if (start <= key.offset && end < extent_end) {
 844 BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE);
 845 
 846 memcpy(&new_key, &key, sizeof(new_key));

The first patch adds a small cloner binary which is used by btrfs/035 to
dispatch BTRFS_IOC_CLONE_RANGE requests.

This workload resembles that of Samba's vfs_btrfs module, when a Windows
client restores a file from a shadow-copy (snapshot) using server-side
copy requests.

Changes since V2:
- Remove explicit write error checks

Changes since V1:
- Use strtoull instead of atoi
- Print error conditions in cloner
- Check for cloner binary before running test
- Continue test on failure
- Add cloner to .gitignore

Feedback appreciated.

Cheers, David


 .gitignore  |   1 +
 configure.ac|   1 +
 src/Makefile|   2 +-
 src/cloner.c| 192 
+++
 tests/btrfs/035 |  77 
+
 tests/btrfs/035.out |   3 +++
 tests/btrfs/group   |   1 +
 7 files changed, 276 insertions(+), 1 deletion(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/3] btrfs: add small program for clone testing

2014-02-07 Thread David Disseldorp

The cloner program is capable of cloning files using the BTRFS_IOC_CLONE
and BTRFS_IOC_CLONE_RANGE ioctls.

Signed-off-by: David Disseldorp 
---
 .gitignore   |   1 +
 src/Makefile |   2 +-
 src/cloner.c | 188 +++
 3 files changed, 190 insertions(+), 1 deletion(-)
 create mode 100644 src/cloner.c

diff --git a/.gitignore b/.gitignore
index ee4cb41..b6f2463 100644
--- a/.gitignore
+++ b/.gitignore
@@ -104,6 +104,7 @@
 /src/aio-dio-regress/aio-free-ring-with-bogus-nr-pages
 /src/aio-dio-regress/aio-io-setup-with-nonwritable-context-pointer
 /src/aio-dio-regress/aiodio_sparse2
+/src/cloner
 
 # dmapi/ binaries
 /dmapi/src/common/cmd/read_invis
diff --git a/src/Makefile b/src/Makefile
index 84c8297..6509f2d 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -18,7 +18,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize 
preallo_rw_pattern_reader \
locktest unwritten_mmap bulkstat_unlink_test t_stripealign \
bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \
stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \
-   seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec
+   seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner
 
 SUBDIRS =
 
diff --git a/src/cloner.c b/src/cloner.c
new file mode 100644
index 000..dfce837
--- /dev/null
+++ b/src/cloner.c
@@ -0,0 +1,188 @@
+/*
+ *  Tiny program to perform file (range) clones using raw Btrfs ioctls.
+ *  It should only be needed until btrfs-progs has an xfs_io equivalent.
+ *
+ *  Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct btrfs_ioctl_clone_range_args {
+   int64_t src_fd;
+   uint64_t src_offset;
+   uint64_t src_length;
+   uint64_t dest_offset;
+};
+
+#define BTRFS_IOCTL_MAGIC 0x94
+#define BTRFS_IOC_CLONE   _IOW(BTRFS_IOCTL_MAGIC, 9, int)
+#define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \
+  struct btrfs_ioctl_clone_range_args)
+
+static void
+usage(char *name, const char *msg)
+{
+   printf("Fatal: %s\n"
+  "Usage:\n"
+  "%s [options]  \n"
+  "\tA full file clone (reflink) is performed by default, "
+  "unless any of the following are specified:\n"
+  "\t-s :  source file offset (default = 0)\n"
+  "\t-d :  destination file offset (default = 0)\n"
+  "\t-l :  length of clone (default = 0)\n",
+  msg, name);
+   _exit(1);
+}
+
+static int
+clone_file(int src_fd, int dst_fd)
+{
+   int ret = ioctl(dst_fd, BTRFS_IOC_CLONE, src_fd);
+   if (ret != 0)
+   ret = errno;
+   return ret;
+}
+
+static int
+clone_file_range(int src_fd, int dst_fd, uint64_t src_off, uint64_t dst_off,
+uint64_t len)
+{
+   struct btrfs_ioctl_clone_range_args cr_args;
+   int ret;
+
+   memset(&cr_args, 0, sizeof(cr_args));
+   cr_args.src_fd = src_fd;
+   cr_args.src_offset = src_off;
+   cr_args.src_length = len;
+   cr_args.dest_offset = dst_off;
+   ret = ioctl(dst_fd, BTRFS_IOC_CLONE_RANGE, &cr_args);
+   if (ret != 0)
+   ret = errno;
+   return ret;
+}
+
+int
+main(int argc, char **argv)
+{
+   bool full_file = true;
+   uint64_t src_off = 0;
+   uint64_t dst_off = 0;
+   uint64_t len = 0;
+   char *src_file;
+   int src_fd;
+   char *dst_file;
+   int dst_fd;
+   int ret;
+   int opt;
+
+   while ((opt = getopt(argc, argv, "s:d:l:")) != -1) {
+   char *sval_end;
+   switch (opt) {
+   case 's':
+   errno = 0;
+   src_off = strtoull(optarg, &sval_end, 10);
+   if ((errno) || (*sval_end != '\0'))
+   usage(argv[0], "invalid source offset");
+   full_file = false;
+   break;
+   case 'd':
+   errno = 0;
+   dst_off = strtoull(optarg, &sval_end, 10);
+

[PATCH v3 2/3] src/cloner: use btrfs/ioctl.h header if present

2014-02-07 Thread David Disseldorp

Check for the btrfsprogs-devel ioctl.h header at configure time. Use it
in src/cloner if present, otherwise fall back to using the copied clone
ioctl definitions.

Signed-off-by: David Disseldorp 
---
 configure.ac | 1 +
 src/cloner.c | 4 
 2 files changed, 5 insertions(+)

diff --git a/configure.ac b/configure.ac
index bd48fd9..6fba3ad 100644
--- a/configure.ac
+++ b/configure.ac
@@ -30,6 +30,7 @@ AC_HEADER_STDC
 AC_CHECK_HEADERS([ sys/fs/xfs_fsops.h  \
sys/fs/xfs_itable.h \
xfs/platform_defs.h \
+   btrfs/ioctl.h   \
 ])
 
 AC_PACKAGE_NEED_UUIDCOMPARE
diff --git a/src/cloner.c b/src/cloner.c
index dfce837..ccc2354 100644
--- a/src/cloner.c
+++ b/src/cloner.c
@@ -30,6 +30,9 @@
 #include 
 #include 
 #include 
+#ifdef HAVE_BTRFS_IOCTL_H
+#include 
+#else
 
 struct btrfs_ioctl_clone_range_args {
int64_t src_fd;
@@ -42,6 +45,7 @@ struct btrfs_ioctl_clone_range_args {
 #define BTRFS_IOC_CLONE   _IOW(BTRFS_IOCTL_MAGIC, 9, int)
 #define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \
   struct btrfs_ioctl_clone_range_args)
+#endif
 
 static void
 usage(char *name, const char *msg)
-- 
1.8.4.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 3/3] btrfs/035: add new clone overwrite regression test

2014-02-07 Thread David Disseldorp

On Fri, 7 Feb 2014 09:45:09 +1100, Dave Chinner wrote:

> Not exactly what I intended.
> 
> If echo fails, it will output some kind of error message, and that
> will cause the golden image mismatch.

Fair enough. I'll resend with the two '  || echo "failed to create src"'
checks removed. Thanks again for the review Dave.

Cheers, David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: introduce BTRFS_IOC_GET_DEVS

2014-02-07 Thread Hugo Mills

On Fri, Feb 07, 2014 at 06:08:56PM +0800, Anand Jain wrote:
> 
> 
>  Thanks for the comments.
> 
>  mainly here sysfs way defeats the purpose - debug as
>  mentioned. Sysfs would/should show only mounted disks,

   So let's find a way of showing the "known-about" data structure
separately from the "mounted" data structure. Just thinking aloud
here, how about something like this:

For unmounted, but known filesystems:
/sys/fs/btrfs/devmap//   (for each device known)
/sys/fs/btrfs/devmap//label
/sys/fs/btrfs/devmap//complete  (0 if missing devices, 1 otherwise)
/sys/fs/btrfs/devmap//

And then for mounted filesystems, we can populate the  with more details as appropriate (and remove them when
unmounted again). We can also add a symlink to the  directory
for the filesystem to e.g.:

/sys/fs/btrfs/mounted/

   If truly the *only* reason to do this is debug, then simply dump
the info in debugfs and have done with it. However, there are good
reasons to want to have this information in an ordinary running
system, too. So stick with sysfs, IMO.

>  the ioctl way doesn't have such a limitation (of course
>  as I commented memory dump way would have been best choice
>  but I got stuck with that approach if anybody wants to give
>  a try with that approach you are most welcome).
> 
>  IMO no harm to have both sysfs way and ioctl way let user
>  or developer use which were is suitable in their context.

   Extra maintenance overhead; divergence of the APIs. "You can find
out the label of the filesystem using sysfs, but not the ioctl. You
can find out the size of the filesystem using the ioctl, but not using
sysfs." That way lies quite a bit of pain for the user of the
interfaces.

   Hugo.

> Thanks, Anand
> 
> 
> On 02/07/2014 06:05 AM, David Sterba wrote:
> >On Thu, Feb 06, 2014 at 09:09:57PM +0100, Goffredo Baroncelli wrote:
> >>>With 3.14 the sysfs interface is available, but the devices under
> >>>  /sys/fs/btrfs//devices/...
> >>>are symlinks to the sysfs devices, so this btrfs-specific device
> >>>information has to be located in a separate directory.
> >>
> >>yes please; it would scale better than the ioctl()s; anyway I suggest 1
> >>file per property, and not "...one file that holds all the stats
> >>  about the device"
> >
> >Well, we can do both, I don't have a preference and both make sense.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Great oxymorons of the world, no. 4: Future Perfect ---   

signature.asc
Description: Digital signature

Re: [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038

2014-02-07 Thread David Rientjes

On Fri, 7 Feb 2014, Fengguang Wu wrote:

> [1.625020] BTRFS: selftest: Running btrfs_split_item tests
> [1.627004] BTRFS: selftest: Running find delalloc tests
> [2.289182] tsc: Refined TSC clocksource calibration: 2299.967 MHz
> [  292.084537] kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=1, 
> oom_score_adj=0
> [  292.086439] kthreadd cpuset=
> [  292.087072] BUG: unable to handle kernel NULL pointer dereference at 
> 0038
> [  292.087372] IP: [] pr_cont_kernfs_name+0x1b/0x6c

This looks like a problem with the cpuset cgroup name, are you sure this 
isn't related to the removal of cgroup->name?

> [  292.087372] PGD 0 
> [  292.087372] Oops:  [#1] 
> [  292.087372] Modules linked in:
> [  292.087372] CPU: 0 PID: 2 Comm: kthreadd Not tainted 
> 3.14.0-rc1-wl-ath-00978-g4830363 #2
> [  292.087372] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [  292.087372] task: 88148050 ti: 8814a000 task.ti: 
> 8814a000
> [  292.087372] RIP: 0010:[]  [] 
> pr_cont_kernfs_name+0x1b/0x6c
> [  292.087372] RSP: :8814bb20  EFLAGS: 00010046
> [  292.087372] RAX: 0282 RBX:  RCX: 
> 0002
> [  292.087372] RDX: 812119de RSI: 8247d4a8 RDI: 
> 0046
> [  292.087372] RBP: 8814bb30 R08: 82f31218 R09: 
> 
> [  292.087372] R10:  R11:  R12: 
> 833c4ab8
> [  292.087372] R13: 003000d0 R14: 0001 R15: 
> 
> [  292.087372] FS:  () GS:82279000() 
> knlGS:
> [  292.087372] CS:  0010 DS:  ES:  CR0: 8005003b
> [  292.087372] CR2: 0038 CR3: 02269000 CR4: 
> 06b0
> [  292.087372] Stack:
> [  292.087372]  88148050 833c4ab8 8814bb50 
> 8110836d
> [  292.087372]  88148050 88148530 8814bbd0 
> 81ae2614
> [  292.087372]  88148640 8814bb78 810c7dd7 
> 8814bb98
> [  292.087372] Call Trace:
> [  292.087372]  [] cpuset_print_task_mems_allowed+0x65/0x8b
> [  292.087372]  [] dump_header.isra.11+0x68/0x29a
> [  292.087372]  [] ? local_clock+0x2b/0x34
> [  292.087372]  [] ? lock_release_holdtime+0xcf/0xdb
> [  292.087372]  [] ? out_of_memory+0x318/0x40f
> [  292.087372]  [] out_of_memory+0x373/0x40f
> [  292.087372]  [] ? out_of_memory+0x193/0x40f
> [  292.087372]  [] __alloc_pages_nodemask+0xdd1/0x12c4
> [  292.087372]  [] ? native_sched_clock+0xe4/0xfc
> [  292.087372]  [] copy_process+0x21d/0x2115
> [  292.087372]  [] ? kthread_create_on_node+0x23e/0x23e
> [  292.087372]  [] ? sched_clock+0x9/0xd
> [  292.087372]  [] ? sched_clock_local.constprop.2+0x35/0xc8
> [  292.087372]  [] ? pvclock_clocksource_read+0x9b/0x140
> [  292.087372]  [] ? kthread_create_on_node+0x23e/0x23e
> [  292.087372]  [] do_fork+0x105/0x4db
> [  292.087372]  [] ? sched_clock_cpu+0xc9/0xdb
> [  292.087372]  [] ? local_clock+0x2b/0x34
> [  292.087372]  [] ? lock_release_holdtime+0xcf/0xdb
> [  292.087372]  [] ? kthreadd+0x1b3/0x24a
> [  292.087372]  [] kernel_thread+0x26/0x28
> [  292.087372]  [] kthreadd+0x1c7/0x24a
> [  292.087372]  [] ? ret_from_fork+0x7a/0xb0
> [  292.087372]  [] ? kthread_create_on_cpu+0x7a/0x7a
> [  292.087372]  [] ret_from_fork+0x7a/0xb0
> [  292.087372]  [] ? kthread_create_on_cpu+0x7a/0x7a
> [  292.087372] Code: 1c 92 8e 00 48 8b 45 e0 59 5b 41 5c 41 5d 5d c3 0f 1f 44 
> 00 00 55 48 89 e5 41 54 53 48 89 fb 48 c7 c7 90 d4 47 82 e8 b1 8f 8e 00 <48> 
> 83 7b 38 00 49 89 c4 74 06 48 8b 73 40 eb 07 48 c7 c6 48 fe 
> [  292.087372] RIP  [] pr_cont_kernfs_name+0x1b/0x6c
> [  292.087372]  RSP 
> [  292.087372] CR2: 0038
> [  292.087372] ---[ end trace df2598a82119 ]---
> [  292.087372] ---[ end trace df2598a82119 ]---
> 
> git bisect start 483036322b45d800fd68cb028874b6cd4fee3dba 
> 38dbfb59d1175ef458d006556061adeaa8751b72 --
> git bisect  bad 50a557bc0d03997d9203d749b0304c54d2c6e5f5  # 02:20  0- 
>  5  Merge 'cgroup/review-simplify' into devel-hourly-2014020601
> git bisect  bad c04e036e53a39cf2986ddaaf5607ff011f9ea55b  # 03:14  0- 
>  1  Merge 'pinctrl/for-next' into devel-hourly-2014020601
> git bisect good 08ce20bd7ef9a08b74cdfe6fe454374be72b5825  # 03:58 25+ 
>  0  Merge 'pci/pci/msi' into devel-hourly-2014020601
> git bisect good 4f86564c657669f69c9cc03151ef6f23ae9e2015  # 04:32 25+ 
>  0  Merge 'm68knommu/cf' into devel-hourly-2014020601
> git bisect  bad 29ae23c20b538664beaea72bb0721ce2538b4ca9  # 04:50  0- 
> 11  Merge 'm68knommu/cfmmu' into devel-hourly-2014020601
> git bisect  bad bbce71dbd03db3cf6df7faba0132d87a1a055827  # 04:58  0- 
>  3  Merge 'nfs/devel' into devel-hourly-2014020601
> git bisect good 88a78a912ee059467ae6db7429a6efe4654620a5  # 06:16 25+ 
>  1  Merge branch 'acl_fixes' into linux-next
> git bisect good 12b13835a0a8bfabea68741e1a

Re: Provide a better free space estimate on RAID1

2014-02-07 Thread Martin Steigerwald

Am Donnerstag, 6. Februar 2014, 22:30:46 schrieb Chris Murphy:
> On Feb 6, 2014, at 9:40 PM, Roman Mamedov  wrote:
> > On Thu, 06 Feb 2014 20:54:19 +0100
> >
> > Goffredo Baroncelli  wrote:
> > 
> >
> >> I agree with you about the needing of a solution. However your patch to
> >> me seems even worse than the actual code.>>
> >> 
> >>
> >> For example you cannot take in account the mix of data/linear and
> >> metadata/dup (with the pathological case of small files stored in the
> >> metadata chunks ), nor different profile level like raid5/6 (or the
> >> future raidNxM) And do not forget the compression...
> >
> > 
> >
> > Every estimate first and foremost should be measured by how precise it is,
> > or in this case "wrong by how many gigabytes". The actual code returns a
> > result that is pretty much always wrong by 2x, after the patch it will be
> > close within gigabytes to the correct value in the most common use case
> > (data raid1, metadata raid1 and that's it). Of course that PoC is nowhere
> > near the final solution, what I can't agree with is "if another option is
> > somewhat better, but not ideally perfect, then it's worse than the
> > current one", even considering the current one is absolutely broken.
> 
> Is the glass half empty or is it half full?
> 
> From the original post, context is a 2x 1TB raid volume:
> 
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/sda2   1.8T  1.1M  1.8T   1% /mnt/p2
> 
> Earlier conventions would have stated Size ~900GB, and Avail ~900GB. But
> that's not exactly true either, is it? It's merely a convention to cut the
> storage available in half, while keeping data file sizes the same as if
> they were on a single device without raid.
> 
> On Btrfs the file system Size is reported as the total storage stack size,
> and that's not incorrect. And the amount Avail is likewise not wrong
> because that space is "not otherwise occupied" which is the definition of
> available. It's linguistically consistent, it's just not a familiar
> convention.

I see one issue with it:

There are installers and applications that check available disk space prior to 
installing. This won´t work with current df figures that BTRFS delivers.

While I understand that there is *never* a guarentee that a given free space 
can really be allocated by a process cause other processes can allocate space 
as well in the mean time, and while I understand that its difficult to provide 
an accurate to provide exact figures as soon as RAID settings can be set per 
subvolume, it still think its important to improve on the figures.

In the longer term I´d like like a function / syscall to ask the filesystem the 
following question:

I am about to write 200 MB in this directory, am I likely to succeed with 
that?

This way an application can ask specific to a directory which allows BTRFS to 
provide a more accurate estimation.

I understand that there is something like that for single files (fallocate), 
but there is nothing like this for writing a certain amount of data in several 
files / directories.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: introduce BTRFS_IOC_GET_DEVS

2014-02-07 Thread Anand Jain




 Thanks for the comments.

 mainly here sysfs way defeats the purpose - debug as
 mentioned. Sysfs would/should show only mounted disks,
 the ioctl way doesn't have such a limitation (of course
 as I commented memory dump way would have been best choice
 but I got stuck with that approach if anybody wants to give
 a try with that approach you are most welcome).

 IMO no harm to have both sysfs way and ioctl way let user
 or developer use which were is suitable in their context.

Thanks, Anand


On 02/07/2014 06:05 AM, David Sterba wrote:

On Thu, Feb 06, 2014 at 09:09:57PM +0100, Goffredo Baroncelli wrote:

With 3.14 the sysfs interface is available, but the devices under
  /sys/fs/btrfs//devices/...
are symlinks to the sysfs devices, so this btrfs-specific device
information has to be located in a separate directory.


yes please; it would scale better than the ioctl()s; anyway I suggest 1
file per property, and not "...one file that holds all the stats
  about the device"


Well, we can do both, I don't have a preference and both make sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] Btrfs: fix use-after-free in the finishing procedure of the device replace

2014-02-07 Thread Stefan Behrens

On Fri, 07 Feb 2014 17:17:45 +0800, Miao Xie wrote:
> On Fri, 07 Feb 2014 09:14:42 +0100, Stefan Behrens wrote:
>> On Thu, 30 Jan 2014 16:46:55 +0800, Miao Xie wrote:
>>> During device replace test, we hit a null pointer deference (It was very 
>>> easy
>>> to reproduce it by running xfstests' btrfs/011 on the devices with the 
>>> virtio
>>> scsi driver). There were two bugs that caused this problem:
>>> - We might allocate new chunks on the replaced device after we updated
>>>   the mapping tree. And we forgot to replace the source device in those
>>>   mapping of the new chunks.
>>> - We might get the mapping information which including the source device
>>>   before the mapping information update. And then submit the bio which was
>>>   based on that mapping information after we freed the source device.
>>>
>>> For the first bug, we can fix it by doing mapping tree update and source
>>> device remove in the same context of the chunk mutex. The chunk mutex is
>>> used to protect the allocable device list, the above method can avoid
>>> the new chunk allocation, and after we remove the source device, all
>>> the new chunks will be allocated on the new device. So it can fix
>>> the first bug.
>>>
>>> For the second bug, we need make sure all flighting bios are finished and
>>> no new bios are produced during we are removing the source device. To fix
>>> this problem, we introduced a global @bio_counter, we not only inc/dec
>>> @bio_counter outsize of map_blocks, but also inc it before submitting bio
>>> and dec @bio_counter when ending bios.
>>>
>>> Since Raid56 is a little different and device replace dosen't support raid56
>>> yet, it is not addressed in the patch and I add comments to make sure we 
>>> will
>>> fix it in the future.
>>>
>>> Reported-by: Qu Wenruo 
>>> Signed-off-by: Wang Shilong 
>>> Signed-off-by: Miao Xie 
>>> ---
>>>  fs/btrfs/ctree.h   |  9 ++
>>>  fs/btrfs/dev-replace.c | 74 
>>> +-
>>>  fs/btrfs/disk-io.c | 12 +++-
>>>  fs/btrfs/volumes.c | 30 +++-
>>>  fs/btrfs/volumes.h |  1 +
>>>  5 files changed, 111 insertions(+), 15 deletions(-)
[...]
>>>  static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>>>int scrub_ret)
>>>  {
>>> @@ -458,12 +487,6 @@ static int btrfs_dev_replace_finishing(struct 
>>> btrfs_fs_info *fs_info,
>>> src_device = dev_replace->srcdev;
>>> btrfs_dev_replace_unlock(dev_replace);
>>>  
>>> -   /* replace old device with new one in mapping tree */
>>> -   if (!scrub_ret)
>>> -   btrfs_dev_replace_update_device_in_mapping_tree(fs_info,
>>> -   src_device,
>>> -   tgt_device);
>>> -
>>
>> Isn't this change the reason that you need to add code to count bios?
> 
> No, the change is not the reason to introduce a bio counter.
> 
> The above two bugs are independent, the second bug can happen even the first 
> one
> doesn't exist, for example:
>   Task1   Task2
>   btrfs_readpages()
> btrfs_map_bio()
>   __btrfs_map_block()
>   

You are right!
Thanks.


>   device replace start
>   device replace finish
> release source device
>   bio_size_ok()
> bdev_get_queue()
>   oops
> 
>> I mean, code is there to flush everything to disk. It's needed for sync,
>> too. When btrfs_dev_replace_finishing() is called, the copy operation is
>> finished and the replaced and the replacement drive are logically
>> identical except that data is not yet forced to be flushed to disk. The
>> only difference between both disks is the bdev block device pointer and
>> the btrfs device pointer.
>>
>> I think this is the correct and lightweight procedure:
>>
>> 1. After the copy operation finished, from now on, always use the new
>> btrfs device and the new block device.
>> 2. Flush all outstanding bios and wait for them.
>> 3. Close the block dev and free the btrfs device.
>>
>> You said that the problem was that we might allocate new chunks on the
>> replaced device after we updated the mapping tree. Well, can't we just
>> make sure the new chunks are allocated using the new device pointers
>> after step 1 and all issues are resolved? I think that the first change
>> is not a good idea and causes the other issues that you fix with a lot
>> of code.
> 
> Step1 can fix the first problem, but need more code to recover the mapping
> tree when the flush fails.
> 
> And as I said above, the second bug is not introduced by the fix of the first
> bug, we have to add a bio counter. Any good idea?
> 
> BTW, the reason why I mixed two fixes into one patch is that we could not fix
>

Re: [PATCH 2/2] Btrfs: fix use-after-free in the finishing procedure of the device replace

2014-02-07 Thread Miao Xie

On Fri, 07 Feb 2014 09:14:42 +0100, Stefan Behrens wrote:
> On Thu, 30 Jan 2014 16:46:55 +0800, Miao Xie wrote:
>> During device replace test, we hit a null pointer deference (It was very easy
>> to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
>> scsi driver). There were two bugs that caused this problem:
>> - We might allocate new chunks on the replaced device after we updated
>>   the mapping tree. And we forgot to replace the source device in those
>>   mapping of the new chunks.
>> - We might get the mapping information which including the source device
>>   before the mapping information update. And then submit the bio which was
>>   based on that mapping information after we freed the source device.
>>
>> For the first bug, we can fix it by doing mapping tree update and source
>> device remove in the same context of the chunk mutex. The chunk mutex is
>> used to protect the allocable device list, the above method can avoid
>> the new chunk allocation, and after we remove the source device, all
>> the new chunks will be allocated on the new device. So it can fix
>> the first bug.
>>
>> For the second bug, we need make sure all flighting bios are finished and
>> no new bios are produced during we are removing the source device. To fix
>> this problem, we introduced a global @bio_counter, we not only inc/dec
>> @bio_counter outsize of map_blocks, but also inc it before submitting bio
>> and dec @bio_counter when ending bios.
>>
>> Since Raid56 is a little different and device replace dosen't support raid56
>> yet, it is not addressed in the patch and I add comments to make sure we will
>> fix it in the future.
>>
>> Reported-by: Qu Wenruo 
>> Signed-off-by: Wang Shilong 
>> Signed-off-by: Miao Xie 
>> ---
>>  fs/btrfs/ctree.h   |  9 ++
>>  fs/btrfs/dev-replace.c | 74 
>> +-
>>  fs/btrfs/disk-io.c | 12 +++-
>>  fs/btrfs/volumes.c | 30 +++-
>>  fs/btrfs/volumes.h |  1 +
>>  5 files changed, 111 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 84d4c05..c39ad06 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -351,6 +351,7 @@ static inline unsigned long btrfs_chunk_item_size(int 
>> num_stripes)
>>  #define BTRFS_FS_STATE_ERROR0
>>  #define BTRFS_FS_STATE_REMOUNTING   1
>>  #define BTRFS_FS_STATE_TRANS_ABORTED2
>> +#define BTRFS_FS_STATE_DEV_REPLACING3
>>  
>>  /* Super block flags */
>>  /* Errors detected */
>> @@ -1674,6 +1675,9 @@ struct btrfs_fs_info {
>>  
>>  atomic_t mutually_exclusive_operation_running;
>>  
>> +struct percpu_counter bio_counter;
>> +wait_queue_head_t replace_wait;
>> +
>>  struct semaphore uuid_tree_rescan_sem;
>>  unsigned int update_uuid_tree_gen:1;
>>  };
>> @@ -3991,6 +3995,11 @@ int btrfs_scrub_cancel_dev(struct btrfs_fs_info *info,
>>  int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
>>   struct btrfs_scrub_progress *progress);
>>  
>> +/* dev-replace.c */
>> +void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info);
>> +void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info);
>> +void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info);
>> +
>>  /* reada.c */
>>  struct reada_control {
>>  struct btrfs_root   *root;  /* tree to prefetch */
>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>> index b20d59e..ec1c3f3 100644
>> --- a/fs/btrfs/dev-replace.c
>> +++ b/fs/btrfs/dev-replace.c
>> @@ -431,6 +431,35 @@ leave_no_lock:
>>  return ret;
>>  }
>>  
>> +/*
>> + * blocked until all flighting bios are finished.
>> + */
>> +static void btrfs_rm_dev_replace_blocked(struct btrfs_fs_info *fs_info)
>> +{
>> +s64 writers;
>> +DEFINE_WAIT(wait);
>> +
>> +set_bit(BTRFS_FS_STATE_DEV_REPLACING, &fs_info->fs_state);
>> +do {
>> +prepare_to_wait(&fs_info->replace_wait, &wait,
>> +TASK_UNINTERRUPTIBLE);
>> +writers = percpu_counter_sum(&fs_info->bio_counter);
>> +if (writers)
>> +schedule();
>> +finish_wait(&fs_info->replace_wait, &wait);
>> +} while (writers);
>> +}
>> +
>> +/*
>> + * we have removed target device, it is safe to allow new bios request.
>> + */
>> +static void btrfs_rm_dev_replace_unblocked(struct btrfs_fs_info *fs_info)
>> +{
>> +clear_bit(BTRFS_FS_STATE_DEV_REPLACING, &fs_info->fs_state);
>> +if (waitqueue_active(&fs_info->replace_wait))
>> +wake_up(&fs_info->replace_wait);
>> +}
>> +
>>  static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>> int scrub_ret)
>>  {
>> @@ -458,12 +487,6 @@ static int btrfs_dev_replace_finishing(struct 
>> btrfs_fs_info *fs_info,
>>  src_device = dev_replace->srcdev;
>>  btrfs_dev_replace_unlock(dev_replace);
>>  
>> -/* replace

Re: [PATCH 1/2] btrfs-progs: Add missing devices check for mounted btrfs.

2014-02-07 Thread Anand Jain




 IMO btrfs-progs shouldn't add its intelligence to know if disk
 is missing. If btrfs-kernel doesn't know when disk is missing
 that's a bug to fix in btrfs-kernel. yes that indeed true as
 of now in btrfs-kernel. btrfs kernel has no idea when disk
 goes missing, just -EIO doesn't tell btrfs that. I am trying
 to fix this first.

 But the problem is there isn't good way with in btrfs/FS
 to know when disk goes missing. did I miss anything ?


Thanks, Anand


On 02/07/2014 02:45 PM, Qu Wenruo wrote:

In btrfs/003 of xfstest, it will check whether btrfs fi show can find
missing devices.

But before the patch, btrfs-progs will not check whether device missing
if given a mounted btrfs mountpoint/block device.
This patch fixes the bug and will pass btrfs/003.

Signed-off-by: Qu Wenruo 
Cc: Anand Jain 
---
  cmds-filesystem.c | 12 
  1 file changed, 12 insertions(+)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 384d1b9..4c9933d 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -363,6 +363,8 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args 
*fs_info,
char *label, char *path)
  {
int i;
+   int fd;
+   int missing;
char uuidbuf[BTRFS_UUID_UNPARSED_SIZE];
struct btrfs_ioctl_dev_info_args *tmp_dev_info;
int ret;
@@ -385,6 +387,14 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args 
*fs_info,

for (i = 0; i < fs_info->num_devices; i++) {
tmp_dev_info = (struct btrfs_ioctl_dev_info_args *)&dev_info[i];
+
+   /* Add check for missing devices even mounted */
+   fd = open((char *)tmp_dev_info->path, O_RDONLY);
+   if (fd < 0) {
+   missing = 1;
+   continue;
+   }
+   close(fd);
printf("\tdevid %4llu size %s used %s path %s\n",
tmp_dev_info->devid,
pretty_size(tmp_dev_info->total_bytes),
@@ -392,6 +402,8 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args 
*fs_info,
tmp_dev_info->path);
}

+   if (missing)
+   printf("\t*** Some devices missing\n");
printf("\n");
return 0;
  }


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] btrfs-progs: Add -p/--print-missing options for btrfs fi show

2014-02-07 Thread Anand Jain



 Whats needed is more comprehensive btrfs fi show
 which shows the flags (including missing) per disk.
 And also show the FS/Raid status. Which I am working on.

 sorry -p feature would be covered by default in the
 coming revamp of btrfs fi show.

Thanks, Anand


On 02/07/2014 02:46 PM, Qu Wenruo wrote:

Since a mounted btrfs filesystem contains all the devices info even a
device is removed after mount(like btrfs/003 in xfstests),
we can use the info to print the known missing device if possible.

So -p/--print-missing options are added to print possible missing
devices.

Signed-off-by: Qu Wenruo 
---
  cmds-filesystem.c | 26 --
  man/btrfs.8.in|  4 +++-
  2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 4c9933d..77b142c 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -360,7 +360,7 @@ static u64 calc_used_bytes(struct btrfs_ioctl_space_args 
*si)
  static int print_one_fs(struct btrfs_ioctl_fs_info_args *fs_info,
struct btrfs_ioctl_dev_info_args *dev_info,
struct btrfs_ioctl_space_args *space_info,
-   char *label, char *path)
+   char *label, char *path, int print_missing)
  {
int i;
int fd;
@@ -392,7 +392,14 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args 
*fs_info,
fd = open((char *)tmp_dev_info->path, O_RDONLY);
if (fd < 0) {
missing = 1;
-   continue;
+   if (print_missing)
+   printf("\tdevid %4llu size %s used %s path %s 
(missing)\n",
+  tmp_dev_info->devid,
+  pretty_size(tmp_dev_info->total_bytes),
+  pretty_size(tmp_dev_info->bytes_used),
+  tmp_dev_info->path);
+   else
+   continue;
}
close(fd);
printf("\tdevid %4llu size %s used %s path %s\n",
@@ -440,7 +447,7 @@ static int check_arg_type(char *input)
return BTRFS_ARG_UNKNOWN;
  }

-static int btrfs_scan_kernel(void *search)
+static int btrfs_scan_kernel(void *search, int print_missing)
  {
int ret = 0, fd;
FILE *f;
@@ -477,7 +484,8 @@ static int btrfs_scan_kernel(void *search)
fd = open(mnt->mnt_dir, O_RDONLY);
if ((fd != -1) && !get_df(fd, &space_info_arg)) {
print_one_fs(&fs_info_arg, dev_info_arg,
-   space_info_arg, label, mnt->mnt_dir);
+space_info_arg, label, mnt->mnt_dir,
+print_missing);
kfree(space_info_arg);
memset(label, 0, sizeof(label));
}
@@ -500,6 +508,7 @@ static const char * const cmd_show_usage[] = {
"Show the structure of a filesystem",
"-d|--all-devices   show only disks under /dev containing btrfs 
filesystem",
"-m|--mounted   show only mounted btrfs",
+   "-p|--print-missing show known missing device if possible",
"If no argument is given, structure of all present filesystems is 
shown.",
NULL
  };
@@ -513,6 +522,7 @@ static int cmd_show(int argc, char **argv)
int ret;
int where = BTRFS_SCAN_LBLKID;
int type = 0;
+   int print_missing = 0;
char mp[BTRFS_PATH_NAME_MAX + 1];
char path[PATH_MAX];

@@ -521,9 +531,10 @@ static int cmd_show(int argc, char **argv)
static struct option long_options[] = {
{ "all-devices", no_argument, NULL, 'd'},
{ "mounted", no_argument, NULL, 'm'},
+   { "print-missing", no_argument, NULL, 'p'},
{ NULL, no_argument, NULL, 0 },
};
-   int c = getopt_long(argc, argv, "dm", long_options,
+   int c = getopt_long(argc, argv, "dmp", long_options,
&long_index);
if (c < 0)
break;
@@ -534,6 +545,9 @@ static int cmd_show(int argc, char **argv)
case 'm':
where = BTRFS_SCAN_MOUNTED;
break;
+   case 'p':
+   print_missing = 1;
+   break;
default:
usage(cmd_show_usage);
}
@@ -571,7 +585,7 @@ static int cmd_show(int argc, char **argv)
goto devs_only;

/* show mounted btrfs */
-   ret = btrfs_scan_kernel(search);
+   ret = btrfs_scan_kernel(search, print_missing);
if (search && !ret)
return 0;

diff --git a/man/btrfs.8.in b/man/btrfs.8.in
index 8fea115..db2e355 100644
--- a/man/btrfs.8.in

Re: [PATCH 2/2] Btrfs: fix use-after-free in the finishing procedure of the device replace

2014-02-07 Thread Stefan Behrens

On Thu, 30 Jan 2014 16:46:55 +0800, Miao Xie wrote:
> During device replace test, we hit a null pointer deference (It was very easy
> to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
> scsi driver). There were two bugs that caused this problem:
> - We might allocate new chunks on the replaced device after we updated
>   the mapping tree. And we forgot to replace the source device in those
>   mapping of the new chunks.
> - We might get the mapping information which including the source device
>   before the mapping information update. And then submit the bio which was
>   based on that mapping information after we freed the source device.
> 
> For the first bug, we can fix it by doing mapping tree update and source
> device remove in the same context of the chunk mutex. The chunk mutex is
> used to protect the allocable device list, the above method can avoid
> the new chunk allocation, and after we remove the source device, all
> the new chunks will be allocated on the new device. So it can fix
> the first bug.
> 
> For the second bug, we need make sure all flighting bios are finished and
> no new bios are produced during we are removing the source device. To fix
> this problem, we introduced a global @bio_counter, we not only inc/dec
> @bio_counter outsize of map_blocks, but also inc it before submitting bio
> and dec @bio_counter when ending bios.
> 
> Since Raid56 is a little different and device replace dosen't support raid56
> yet, it is not addressed in the patch and I add comments to make sure we will
> fix it in the future.
> 
> Reported-by: Qu Wenruo 
> Signed-off-by: Wang Shilong 
> Signed-off-by: Miao Xie 
> ---
>  fs/btrfs/ctree.h   |  9 ++
>  fs/btrfs/dev-replace.c | 74 
> +-
>  fs/btrfs/disk-io.c | 12 +++-
>  fs/btrfs/volumes.c | 30 +++-
>  fs/btrfs/volumes.h |  1 +
>  5 files changed, 111 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 84d4c05..c39ad06 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -351,6 +351,7 @@ static inline unsigned long btrfs_chunk_item_size(int 
> num_stripes)
>  #define BTRFS_FS_STATE_ERROR 0
>  #define BTRFS_FS_STATE_REMOUNTING1
>  #define BTRFS_FS_STATE_TRANS_ABORTED 2
> +#define BTRFS_FS_STATE_DEV_REPLACING 3
>  
>  /* Super block flags */
>  /* Errors detected */
> @@ -1674,6 +1675,9 @@ struct btrfs_fs_info {
>  
>   atomic_t mutually_exclusive_operation_running;
>  
> + struct percpu_counter bio_counter;
> + wait_queue_head_t replace_wait;
> +
>   struct semaphore uuid_tree_rescan_sem;
>   unsigned int update_uuid_tree_gen:1;
>  };
> @@ -3991,6 +3995,11 @@ int btrfs_scrub_cancel_dev(struct btrfs_fs_info *info,
>  int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
>struct btrfs_scrub_progress *progress);
>  
> +/* dev-replace.c */
> +void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info);
> +void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info);
> +void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info);
> +
>  /* reada.c */
>  struct reada_control {
>   struct btrfs_root   *root;  /* tree to prefetch */
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index b20d59e..ec1c3f3 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -431,6 +431,35 @@ leave_no_lock:
>   return ret;
>  }
>  
> +/*
> + * blocked until all flighting bios are finished.
> + */
> +static void btrfs_rm_dev_replace_blocked(struct btrfs_fs_info *fs_info)
> +{
> + s64 writers;
> + DEFINE_WAIT(wait);
> +
> + set_bit(BTRFS_FS_STATE_DEV_REPLACING, &fs_info->fs_state);
> + do {
> + prepare_to_wait(&fs_info->replace_wait, &wait,
> + TASK_UNINTERRUPTIBLE);
> + writers = percpu_counter_sum(&fs_info->bio_counter);
> + if (writers)
> + schedule();
> + finish_wait(&fs_info->replace_wait, &wait);
> + } while (writers);
> +}
> +
> +/*
> + * we have removed target device, it is safe to allow new bios request.
> + */
> +static void btrfs_rm_dev_replace_unblocked(struct btrfs_fs_info *fs_info)
> +{
> + clear_bit(BTRFS_FS_STATE_DEV_REPLACING, &fs_info->fs_state);
> + if (waitqueue_active(&fs_info->replace_wait))
> + wake_up(&fs_info->replace_wait);
> +}
> +
>  static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>  int scrub_ret)
>  {
> @@ -458,12 +487,6 @@ static int btrfs_dev_replace_finishing(struct 
> btrfs_fs_info *fs_info,
>   src_device = dev_replace->srcdev;
>   btrfs_dev_replace_unlock(dev_replace);
>  
> - /* replace old device with new one in mapping tree */
> - if (!scrub_ret)
> - btrfs_dev_replace_update_device_in_mapping_tree(fs_info,
> -

50 matches

Mail list logo