Re: revert to static snapshot on reboot

2012-01-09 Thread Hugo Mills
On Sun, Jan 08, 2012 at 10:43:04PM -0800, bt...@spiritvideo.com wrote:
 Hi all --
 
 I just installed my first btrfs-based linux tonight, and I must say it
 gives me a very warm feeling!  Congratulations on all your hard work
 and your fine product.
 
 I administer laptops for a small school, and we want to implement what
 Deep Freeze (http://www.faronics.com/enterprise/deep-freeze) does for
 Windows -- no matter what a student does after they log in, when they
 reboot it is all forgotten and the computer has returned to a standard
 state.
 
 I would think this would be a FAQ, but I have searched the web and
 mailing list for the past couple of hours.
 
 Of course it's easy to mount a snapshot, but then if students make
 changes the snapshot changes.
 
 The plan that occurs to me is to make a snapshot of the system in the
 state that I want to always boot.  Then, I would rewrite the init
 script in the initrd to (a) delete any old tmp copy of the snapshot;
 (b) copy the static snapshot to a tmp copy; (c) mount the tmp copy.
 
 That's a little harder than I was hoping to work -- is there an easier
 way to get this functionality?

   I think you've got the right approach there. I can't immediately
see anything simpler.

   The other way of doing it I can think of, without using btrfs
snapshots, might be to mount / read-only, and then mount a disposable
writable layer on top of it with some union filesystem.

 I have a small ext4 boot partition containing grub, vmlinuz and
 initramfs.  Everything else is in a big btrfs root partition.  I am
 running Fedora 14, with Fedora-patched linux 2.6.35.  I could upgrade
 if necessary.

   Yes, do upgrade. Really, really do upgrade. 2.6.35 is nearly 18
months old, and there are many serious bugs that have been fixed in
btrfs since then. For btrfs, you should be running the latest release
kernel (3.2 currently) at the *minimum*. Preferably, you should be
running the (later-series) -rc kernels; I'd avoid -rc1 or -rc2, but by
-rc3 or so it's usually stabilised to the point that it's usable.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- He's playing Schubert.  I think Schubert is losing. ---   


signature.asc
Description: Digital signature


Re: [PATCH 0/2] btrfs: allow cross-subvolume BTRFS_IOC_CLONE

2012-01-09 Thread Jérôme Poulin
On Mon, Jan 9, 2012 at 1:58 AM, Marios Titas redneb8...@gmail.com wrote:
 The simple case of 'cp --reflink' works fine [...]

 It doesn't work here:

 cp: failed to clone `/tmp/test': Invalid cross-device link

 That's with 3.1 + for-linus.

This is the problem, it doesn't work because you have to apply the
patch at http://permalink.gmane.org/gmane.comp.file-systems.btrfs/9865
which is not mainlined yet. This patch dates back from March 31st, I
have been using it since it was released.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/21] Btrfs: restriper

2012-01-09 Thread Ilya Dryomov
On Mon, Jan 09, 2012 at 01:50:34AM -0500, Marios Titas wrote:
 I tried this for many different scenarios and it seems to work pretty
 well. I only ran into one problematic case: If you remove a device
 from a multidevice filesystem it crashes. Here's how to reproduce it:
 
 truncate -s1g /tmp/test1
 truncate -s1g /tmp/test2
 losetup /dev/loop1 /tmp/test1
 losetup /dev/loop2 /tmp/test2
 mkdir /tmp/test
 ./mkfs.btrfs -L test -d single -m single /dev/loop1 /dev/loop2
 mount -o noatime /dev/loop1 /tmp/test
 ./btrfs dev del /dev/loop1 /tmp/test
 ./btrfs fi bal start /tmp/test
 
 There is no actual restriping involved but the above example does work
 corretly under 3.1+for-linus whereas it fails with your patches.

Thanks for your testing.  The good news is that I put that BUG() there
simply for debugging so it's nothing major:

2520if (ret)
2521BUG(); /* FIXME break ? */

It used to be just a break out of the loop there, so that's the reason
it doesn't panic with 3.1+for-linus.  I'll investigate further and fix
this.

Thanks,

Ilya

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How long does it take to balance a 2x1TB RAID1 ?

2012-01-09 Thread Phillip Susi

On 1/6/2012 6:23 AM, Dirk Lutzebäck wrote:

Hi,

I have setup up a btrfs RAID1 using two 1TB drives. How long should a
'btrfs filesystem balance' take? It is running now for more than 3 days
on about 30% CPU and 40% wait state.

I am using stock btrfs from ubuntu 11.10 kernel 3.0.0


Not nearly that long.  Assuming it actually has to rewrite 1 TB of data 
and is only getting 50 MB/s, that should only take about 5.5 hours.  You 
might want to try a newer kernel ( like the one from 12.04 ).

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: change resize ioctl to take device path instead of id

2012-01-09 Thread Phillip Susi

Bump.

On 12/11/2011 10:12 PM, Phillip Susi wrote:

The resize ioctl took an optional argument that was a string
representation of the devid which you wish to resize.  For
the sake of consistency with the other ioctls that take a
device argument, I converted this to take a device path instead
of a devid number, and look up the number from the path.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs-progs: document --rootdir mkfs switch

2012-01-09 Thread Phillip Susi

Signed-off-by: Phillip Susi ps...@cfl.rr.com
---
 man/mkfs.btrfs.8.in |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/man/mkfs.btrfs.8.in b/man/mkfs.btrfs.8.in
index 542e6cf..25e817b 100644
--- a/man/mkfs.btrfs.8.in
+++ b/man/mkfs.btrfs.8.in
@@ -12,6 +12,7 @@ mkfs.btrfs \- create an btrfs filesystem
 [ \fB\-M\fP\fI mixed data+metadata\fP ]
 [ \fB\-n\fP\fI nodesize\fP ]
 [ \fB\-s\fP\fI sectorsize\fP ]
+[ \fB\-r\fP\fI rootdir\fP ]
 [ \fB\-h\fP ]
 [ \fB\-V\fP ]
 \fI device\fP [ \fIdevice ...\fP ]
@@ -59,6 +60,9 @@ Specify the nodesize. By default the value is set to the 
pagesize.
 \fB\-s\fR, \fB\-\-sectorsize \fIsize\fR
 Specify the sectorsize, the minimum block allocation.
 .TP
+\fB\-r\fR, \fB\-\-rootdir \fIrootdir\fR
+Specify a directory to copy into the newly created fs.
+.TP
 \fB\-V\fR, \fB\-\-version\fR
 Print the \fBmkfs.btrfs\fP version and exit.
 .SH AVAILABILITY
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs-progs: removed extraneous whitespace from mkfs man page

2012-01-09 Thread Phillip Susi
There were extra spaces around some of the arguments in the man
page for mkfs.

Signed-off-by: Phillip Susi ps...@cfl.rr.com
---
 man/mkfs.btrfs.8.in |   19 ++-
 1 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/man/mkfs.btrfs.8.in b/man/mkfs.btrfs.8.in
index 432db1b..542e6cf 100644
--- a/man/mkfs.btrfs.8.in
+++ b/man/mkfs.btrfs.8.in
@@ -5,15 +5,16 @@ mkfs.btrfs \- create an btrfs filesystem
 .B mkfs.btrfs
 [ \fB\-A\fP\fI alloc-start\fP ]
 [ \fB\-b\fP\fI byte-count\fP ]
-[ \fB \-d\fP\fI data-profile\fP ]
-[ \fB \-l\fP\fI leafsize\fP ]
-[ \fB \-L\fP\fI label\fP ]
-[ \fB \-m\fP\fI metadata profile\fP ]
-[ \fB \-M\fP\fI mixed data+metadata\fP ]
-[ \fB \-n\fP\fI nodesize\fP ]
-[ \fB \-s\fP\fI sectorsize\fP ]
-[ \fB \-h\fP ]
-[ \fB \-V\fP ] \fI device\fP [ \fI device ...\fP ]
+[ \fB\-d\fP\fI data-profile\fP ]
+[ \fB\-l\fP\fI leafsize\fP ]
+[ \fB\-L\fP\fI label\fP ]
+[ \fB\-m\fP\fI metadata profile\fP ]
+[ \fB\-M\fP\fI mixed data+metadata\fP ]
+[ \fB\-n\fP\fI nodesize\fP ]
+[ \fB\-s\fP\fI sectorsize\fP ]
+[ \fB\-h\fP ]
+[ \fB\-V\fP ]
+\fI device\fP [ \fIdevice ...\fP ]
 .SH DESCRIPTION
 .B mkfs.btrfs
 is used to create an btrfs filesystem (usually in a disk partition, or an array
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Odd behavior of subvolume find-new

2012-01-09 Thread David Brown

I've been creating some time-based snapshots, e.g.

 # btrfs subvolume snapshot @root 2012-01-09-@root

After some changes, I wanted to see what had changed, so I tried:

 # btrfs subvolume find-new @root 2012-01-09-@root
 transid marker was 37

which doesn't print anything out.  Curiously, if I make a snapshot of
the snapshot, then I get output from the delta:

 # btrfs subvolume snapshot 2012-01-09-@root tmp
 # btrfs subvolume find-new @root tmp
 . lots of output .

I haven't seen this behavior on other filesystems or subvolumes.

My intent was to filter through the small script below to compute the
size of the delta.

Thanks,
David

#! /usr/bin/perl

# Process the output of btrfs subvolume find-new and print out the
# size used by the new data.  Doesn't show delta in metadata, only the
# data itself.
use strict;

my $bytes = 0;
while () {
if (/ len (\d+) /) {
$bytes += $1;
}
}
printf %d bytes\n, $bytes;
printf %.1f MByte\n, $bytes / 1024.0 / 1024.0;
printf %.1f GByte\n, $bytes / 1024.0 / 1024.0 / 1024.0;

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Multiple btrfsck inode error 400 on Unclean Shutdowns

2012-01-09 Thread Mitch Harder
Lately, I've been running into a sharp increase in btrfsck inode 400
corruptions after an unclean shutdown.

The shutdowns have resulted from multiple sources (power outage, Xorg
keyboard misconfiguration, etc...).  I have not made any systematic
study of btrfs' robustness to corruption after an unclean shutdown,
but I've had at least 4 btrfs partitions report btrfsck inode 400
corruptions after unclean shutdowns.  That seems frequent enough for
me to develop concerns that a regression has slipped in somewhere.
Then again, maybe I'm just unlucky (I know you can't guarantee a
shutdown won't lead to a potential corruption).

So far, the impact of these corruptions has been minor.  I've been
able to pull the data from the partition without error for a reformat.
 However, I see numerous errors reported if I try to run a balance
once I get reports of btrfsck inode 400 corruptions.

Even though the impact of this corruption is minor, this frequency of
issues after an unclean shutdown seems much higher than I encounter
with competing file systems.

Has anybody else been encountering an increase in btrfsck inode 400
errors recently?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs-related kernel oops due to media error

2012-01-09 Thread Vincent Vanackere

Hi,

One of my disks, partitioned into a single btrfs partition, is showing 
media errors. The problem is that these errors lead to kernel panic from 
btrfs - that make the filesystem unusable until reboot - and therefore 
it is very hard for me to do a full backup of the data prior to changing 
the disk.
My current kernel is 3.2.0-8-generic from Ubuntu/precise (based on linux 
3.2-final) but I quickly tested and get the same error with an older 3.1 
kernel (and I can probably reproduce it with a vanilla kernel if necessary).
I assume that the filesystem should not panic even in case of a media 
error... Is there any procedure I can follow / patch I could apply to 
salvage my data while ignoring media errors ?


logs/OOPS at the end of this mail, please let me know if more 
information is needed,


Best regards,

Vincent

---

   [  129.241636] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
   [  129.241640] ata6.00: BMDMA stat 0x24
   [  129.241643] ata6.00: failed command: READ DMA EXT
   [  129.241649] ata6.00: cmd 25/00:08:5f:dc:2f/00:00:70:00:00/e0 tag
   0 dma 4096 in
   [  129.241651]  res 51/40:00:61:dc:2f/40:00:70:00:00/e0
   Emask 0x9 (media error)
   [  129.241654] ata6.00: status: { DRDY ERR }
   [  129.241656] ata6.00: error: { UNC }
   [  129.256243] ata6.00: configured for UDMA/133
   [  129.256261] ata6: EH complete
   [  131.640911] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
   [  131.640915] ata6.00: BMDMA stat 0x24
   [  131.640918] ata6.00: failed command: READ DMA EXT
   [  131.640922] ata6.00: cmd 25/00:08:5f:dc:2f/00:00:70:00:00/e0 tag
   0 dma 4096 in
   [  131.640923]  res 51/40:00:61:dc:2f/40:00:70:00:00/e0
   Emask 0x9 (media error)
   [  131.640926] ata6.00: status: { DRDY ERR }
   [  131.640927] ata6.00: error: { UNC }
   [  131.656244] ata6.00: configured for UDMA/133
   [  131.656260] ata6: EH complete
   [  134.317351] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
   [  134.317355] ata6.00: BMDMA stat 0x24
   [  134.317359] ata6.00: failed command: READ DMA EXT
   [  134.317365] ata6.00: cmd 25/00:08:5f:dc:2f/00:00:70:00:00/e0 tag
   0 dma 4096 in
   [  134.317366]  res 51/40:00:61:dc:2f/40:00:70:00:00/e0
   Emask 0x9 (media error)
   [  134.317369] ata6.00: status: { DRDY ERR }
   [  134.317371] ata6.00: error: { UNC }
   [  134.332234] ata6.00: configured for UDMA/133
   [  134.332248] ata6: EH complete
   [  136.894260] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
   [  136.894264] ata6.00: BMDMA stat 0x24
   [  136.894268] ata6.00: failed command: READ DMA EXT
   [  136.894274] ata6.00: cmd 25/00:08:5f:dc:2f/00:00:70:00:00/e0 tag
   0 dma 4096 in
   [  136.894275]  res 51/40:00:61:dc:2f/40:00:70:00:00/e0
   Emask 0x9 (media error)
   [  136.894278] ata6.00: status: { DRDY ERR }
   [  136.894280] ata6.00: error: { UNC }
   [  136.924255] ata6.00: configured for UDMA/133
   [  136.924269] ata6: EH complete
   [  139.437990] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
   [  139.437994] ata6.00: BMDMA stat 0x24
   [  139.437998] ata6.00: failed command: READ DMA EXT
   [  139.438004] ata6.00: cmd 25/00:08:5f:dc:2f/00:00:70:00:00/e0 tag
   0 dma 4096 in
   [  139.438005]  res 51/40:00:61:dc:2f/40:00:70:00:00/e0
   Emask 0x9 (media error)
   [  139.438008] ata6.00: status: { DRDY ERR }
   [  139.438010] ata6.00: error: { UNC }
   [  139.468239] ata6.00: configured for UDMA/133
   [  139.468253] ata6: EH complete
   [  141.937488] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
   [  141.937493] ata6.00: BMDMA stat 0x24
   [  141.937497] ata6.00: failed command: READ DMA EXT
   [  141.937503] ata6.00: cmd 25/00:08:5f:dc:2f/00:00:70:00:00/e0 tag
   0 dma 4096 in
   [  141.937504]  res 51/40:00:61:dc:2f/40:00:70:00:00/e0
   Emask 0x9 (media error)
   [  141.937507] ata6.00: status: { DRDY ERR }
   [  141.937509] ata6.00: error: { UNC }
   [  141.952236] ata6.00: configured for UDMA/133
   [  141.952253] sd 5:0:0:0: [sdd] Unhandled sense code
   [  141.952256] sd 5:0:0:0: [sdd]  Result: hostbyte=DID_OK
   driverbyte=DRIVER_SENSE
   [  141.952260] sd 5:0:0:0: [sdd]  Sense Key : Medium Error [current]
   [descriptor]
   [  141.952264] Descriptor sense data with sense descriptors (in hex):
   [  141.952266] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
   [  141.952275] 70 2f dc 61
   [  141.952279] sd 5:0:0:0: [sdd]  Add. Sense: Unrecovered read error
   - auto reallocate failed
   [  141.952284] sd 5:0:0:0: [sdd] CDB: Read(10): 28 00 70 2f dc 5f 00
   00 08 00
   [  141.952293] end_request: I/O error, dev sdd, sector 1882184801
   [  141.952313] ata6: EH complete
   [  141.952335] BUG: unable to handle kernel NULL pointer dereference
   at   (null)
   [  141.952383] IP: [a018e439]
   extent_range_uptodate+0x59/0xe0 [btrfs]
   [  141.952440] PGD 21caae067 PUD 

[PATCH V2 1/3] Btrfs: fix btrfsck error 400 when truncating a compressed file extent

2012-01-09 Thread Miao Xie
Reproduce steps:
 # mkfs.btrfs /dev/sdb5
 # mount /dev/sdb5 -o compress=lzo /mnt
 # dd if=/dev/zero of=/mnt/tmpfile bs=128K count=1
 # sync
 # truncate -s 64K /mnt/tmpfile
 # btrfsck /dev/sdb5
 root 5 inode 257 errors 400

This is because of the wrong if condition, which is used to check if we should
subtract the bytes of the dropped range from i_blocks/i_bytes of i-node or not.
When we truncate a compressed extent, btrfs substracts the bytes of the whole
extent, it's wrong. We should substract the real size that we truncate, no
matter it is a compressed extent or not. Fix it.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changes v1 - v2:
- None.
---
 fs/btrfs/inode.c |8 +---
 1 files changed, 1 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 13b0542..85e2312 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3009,7 +3009,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
int pending_del_nr = 0;
int pending_del_slot = 0;
int extent_type = -1;
-   int encoding;
int ret;
int err = 0;
u64 ino = btrfs_ino(inode);
@@ -3059,7 +3058,6 @@ search_again:
leaf = path-nodes[0];
btrfs_item_key_to_cpu(leaf, found_key, path-slots[0]);
found_type = btrfs_key_type(found_key);
-   encoding = 0;
 
if (found_key.objectid != ino)
break;
@@ -3072,10 +3070,6 @@ search_again:
fi = btrfs_item_ptr(leaf, path-slots[0],
struct btrfs_file_extent_item);
extent_type = btrfs_file_extent_type(leaf, fi);
-   encoding = btrfs_file_extent_compression(leaf, fi);
-   encoding |= btrfs_file_extent_encryption(leaf, fi);
-   encoding |= btrfs_file_extent_other_encoding(leaf, fi);
-
if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
item_end +=
btrfs_file_extent_num_bytes(leaf, fi);
@@ -3103,7 +3097,7 @@ search_again:
if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
u64 num_dec;
extent_start = btrfs_file_extent_disk_bytenr(leaf, fi);
-   if (!del_item  !encoding) {
+   if (!del_item) {
u64 orig_num_bytes =
btrfs_file_extent_num_bytes(leaf, fi);
extent_num_bytes = new_size -
-- 
1.7.6.5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 2/3] Btrfs: make btrfs_truncate_inode_items() more readable

2012-01-09 Thread Miao Xie
As the title said, this patch just make the functions of the truncation
more readable.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changes v1 - v2:
- move return sentence out of if...else..., make the logic of the code more
  clear.
---
 fs/btrfs/inode.c |  292 ++
 1 files changed, 162 insertions(+), 130 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 85e2312..4d1d4c4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2977,10 +2977,145 @@ out:
return err;
 }
 
+static int btrfs_release_and_test_inline_data_extent(
+   struct btrfs_root *root,
+   struct inode *inode,
+   struct extent_buffer *leaf,
+   struct btrfs_file_extent_item *fi,
+   u64 offset,
+   u64 new_size)
+{
+   u64 item_end;
+
+   item_end = offset + btrfs_file_extent_inline_len(leaf, fi) - 1;
+
+   if (item_end  new_size)
+   return 0;
+
+   /*
+* Truncate inline items is special, we have done it by
+*   btrfs_truncate_page();
+*/
+   if (offset  new_size)
+   return 0;
+
+   if (root-ref_cows)
+   inode_sub_bytes(inode, item_end + 1 - offset);
+
+   return 1;
+}
+
 /*
- * this can truncate away extent items, csum items and directory items.
- * It starts at a high offset and removes keys until it can't find
- * any higher than new_size
+ * If this function return 1, it means this item can be dropped directly.
+ * If 0 is returned, the item can not be dropped.
+ */
+static int btrfs_release_and_test_data_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct btrfs_path *path,
+ struct inode *inode,
+ u64 offset,
+ u64 new_size)
+{
+   struct extent_buffer *leaf;
+   struct btrfs_file_extent_item *fi;
+   u64 extent_start;
+   u64 extent_offset;
+   u64 item_end;
+   u64 ino = btrfs_ino(inode);
+   u64 orig_nbytes;
+   u64 new_nbytes;
+   int extent_type;
+   int ret;
+
+   leaf = path-nodes[0];
+   fi = btrfs_item_ptr(leaf, path-slots[0],
+   struct btrfs_file_extent_item);
+
+   extent_type = btrfs_file_extent_type(leaf, fi);
+   if (extent_type == BTRFS_FILE_EXTENT_INLINE)
+   return btrfs_release_and_test_inline_data_extent(root, inode,
+leaf, fi,
+offset,
+new_size);
+
+   item_end = offset + btrfs_file_extent_num_bytes(leaf, fi) - 1;
+
+   /*
+* If the new size is beyond the end of the extent:
+*   +--+
+*   |  |
+*   +--+
+*^ new size
+* so the extent should not be dropped or truncated.
+*/
+   if (item_end  new_size)
+   return 0;
+
+   extent_start = btrfs_file_extent_disk_bytenr(leaf, fi);
+   if (offset  new_size) {
+   /*
+* If the new size is in the extent:
+*   +--+
+*   |  |
+*   +--+
+*  ^ new size
+* so this extent should be truncated, not be dropped directly.
+*/
+   orig_nbytes = btrfs_file_extent_num_bytes(leaf, fi);
+   new_nbytes = round_up(new_size - offset, root-sectorsize);
+
+   btrfs_set_file_extent_num_bytes(leaf, fi, new_nbytes);
+
+   if (extent_start != 0  root-ref_cows)
+   inode_sub_bytes(inode, orig_nbytes - new_nbytes);
+
+   btrfs_mark_buffer_dirty(leaf);
+
+   ret = 0;
+   } else {
+   /*
+* If the new size is in the font of the extent:
+*   +--+
+*   |  |
+*   +--+
+*  ^ new size
+* so this extent should be dropped.
+*/
+
+   /*
+* It is a dummy extent, or it is in log tree, we needn't do
+* anything, just drop it.
+*/
+   if (extent_start == 0 ||
+   !(root-ref_cows || root == root-fs_info-tree_root))
+   

[PATCH V2 3/3] Btrfs: improve truncation of btrfs

2012-01-09 Thread Miao Xie
The original truncation of btrfs has a bug, that is the orphan item will not be
dropped when the truncation fails. This bug will trigger BUG() when unlink that
truncated file. And besides that, if the user does pre-allocation for the file
which is truncated unsuccessfully, after re-mount(umount-mount, not -o remount),
the pre-allocated extent will be dropped.

This patch modified the relative functions of the truncation, and makes the
truncation update i_size and disk_i_size of i-nodes every time we drop the file
extent successfully, and set them to the real value. By this way, we needn't
add orphan items to guarantee the consistency of the meta-data.

By this patch, it is possible that the file may not be truncated to the size
that the user expects(may be = the orignal size and = the expected one), so I
think it is better that we shouldn't lose the data that lies within the range
the expected size, the real size, because the user may take it for granted
that the data in that extent is not lost. In order to implement it, we just
write out all the dirty pages which are beyond the expected size of the file.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changes v1 - v2:
- None.
---
 fs/btrfs/inode.c |  159 +-
 1 files changed, 49 insertions(+), 110 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4d1d4c4..77a295d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -88,7 +88,7 @@ static unsigned char btrfs_type_by_mode[S_IFMT  S_SHIFT] = {
 };
 
 static int btrfs_setsize(struct inode *inode, loff_t newsize);
-static int btrfs_truncate(struct inode *inode);
+static int btrfs_truncate(struct inode *inode, loff_t newsize);
 static int btrfs_finish_ordered_io(struct inode *inode, u64 start, u64 end);
 static noinline int cow_file_range(struct inode *inode,
   struct page *locked_page,
@@ -2230,7 +2230,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 * btrfs_delalloc_reserve_space to catch offenders.
 */
mutex_lock(inode-i_mutex);
-   ret = btrfs_truncate(inode);
+   ret = btrfs_truncate(inode, inode-i_size);
mutex_unlock(inode-i_mutex);
} else {
nr_unlink++;
@@ -2993,7 +2993,7 @@ static int btrfs_release_and_test_inline_data_extent(
return 0;
 
/*
-* Truncate inline items is special, we have done it by
+* Truncate inline items is special, we will do it by
 *   btrfs_truncate_page();
 */
if (offset  new_size)
@@ -3124,9 +3124,9 @@ static int btrfs_release_and_test_data_extent(struct 
btrfs_trans_handle *trans,
  * will kill all the items on this inode, including the INODE_ITEM_KEY.
  */
 int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root,
-   struct inode *inode,
-   u64 new_size, u32 min_type)
+  struct btrfs_root *root,
+  struct inode *inode,
+  u64 new_size, u32 min_type)
 {
struct btrfs_path *path;
struct extent_buffer *leaf;
@@ -3134,6 +3134,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
struct btrfs_key found_key;
u64 mask = root-sectorsize - 1;
u64 ino = btrfs_ino(inode);
+   u64 old_size = i_size_read(inode);
u32 found_type;
int pending_del_nr = 0;
int pending_del_slot = 0;
@@ -3141,6 +3142,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
int err = 0;
 
BUG_ON(new_size  0  min_type != BTRFS_EXTENT_DATA_KEY);
+   BUG_ON(new_size  mask);
 
path = btrfs_alloc_path();
if (!path)
@@ -3193,6 +3195,13 @@ search_again:
ret = btrfs_release_and_test_data_extent(trans, root,
path, inode, found_key.offset,
new_size);
+   if (root-ref_cows ||
+   root == root-fs_info-tree_root) {
+   if (ret  found_key.offset  old_size)
+   i_size_write(inode, found_key.offset);
+   else if (!ret)
+   i_size_write(inode, new_size);
+   }
if (!ret)
break;
}
@@ -3250,12 +3259,10 @@ out:
 static int btrfs_truncate_page(struct address_space *mapping, loff_t from)
 {
struct inode *inode = mapping-host;
-   struct btrfs_root *root = BTRFS_I(inode)-root;
struct extent_io_tree *io_tree = BTRFS_I(inode)-io_tree;
struct 

[RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs

2012-01-09 Thread Liu Bo
Since we are inclined to apply a lockless scheme on some objects of btrfs for
higher performance, we want to build a RCU version the Probabilistic Skiplist.

Here our skiplist algorithm is based on the skiplist experiments of
Con Kolivas ker...@kolivas.org for BFS cpu scheduler.
And more details about skiplist design are in patch 1.

Right now we have a plan to apply skiplist on extent_map and extent_state.

Here we choose extent_map firstly, since it is a read mostly thing,
and the change is quite direct, all we need to do is
a) to replace rbtree with skiplist,
b) to add rcu support.
And more details are in patch 2 and patch 3.

I've done some simple tests for performance on my 2-core box, there is no
obvious difference, but I want to focus on the design side and make sure
there is no more bug in it firstly.

For long term goals, we want to ship skiplist to lib, like lib/rbtree.c.

MORE TESTS ARE WELCOME!

---
changes v2:
- fix a bug reported by David Sterba d...@jikos.cz, thanks a lot!
- use mutex lock to protect extent_map updater side, so that we can make
  the reclaim code much easier.
  And I've ran through xfstests, no panic occurred but they failed at
  273 and 274, and I've tested them without my patches and
  found that they still fails on the upstream.
---

Liu Bo (3):
  Btrfs: add the Probabilistic Skiplist
  Btrfs: rebuild extent_map based on skiplist
  Btrfs: convert rwlock to RCU for extent_map

 fs/btrfs/Makefile  |2 +-
 fs/btrfs/compression.c |8 +-
 fs/btrfs/disk-io.c |9 +-
 fs/btrfs/extent_io.c   |   13 +--
 fs/btrfs/extent_map.c  |  278 +++-
 fs/btrfs/extent_map.h  |   19 +++-
 fs/btrfs/file.c|   11 +-
 fs/btrfs/inode.c   |   28 +++---
 fs/btrfs/ioctl.c   |8 +-
 fs/btrfs/relocation.c  |4 +-
 fs/btrfs/scrub.c   |4 +-
 fs/btrfs/skiplist.c|  101 +
 fs/btrfs/skiplist.h|  217 +
 fs/btrfs/volumes.c |   58 +-
 14 files changed, 585 insertions(+), 175 deletions(-)
 create mode 100644 fs/btrfs/skiplist.c
 create mode 100644 fs/btrfs/skiplist.h

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 3/3] Btrfs: convert rwlock to RCU for extent_map

2012-01-09 Thread Liu Bo
In this patch, we make two things:

a) skiplist - rcu-skiplist
   This is quite direct, since in skiplist each level is a list,
   any modification to the skiplist refers to pointers change,
   which fits RCU's sematic.

b) use rcu lock for reader side and mutex lock for updater side
   to protect extent_map instead of rwlock.

Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
---
changes v2:
- fix a bug reported by David Sterba d...@jikos.cz, thanks a lot!
- use mutex lock to protect extent_map updater side, so that we can make
  the reclaim code much easier.
---
 fs/btrfs/compression.c |8 
 fs/btrfs/disk-io.c |9 -
 fs/btrfs/extent_io.c   |   13 ++---
 fs/btrfs/extent_map.c  |   28 ++--
 fs/btrfs/extent_map.h  |5 ++---
 fs/btrfs/file.c|   11 ++-
 fs/btrfs/inode.c   |   28 ++--
 fs/btrfs/ioctl.c   |8 
 fs/btrfs/relocation.c  |4 ++--
 fs/btrfs/scrub.c   |4 ++--
 fs/btrfs/skiplist.c|   11 +++
 fs/btrfs/skiplist.h|   25 -
 fs/btrfs/volumes.c |   36 ++--
 13 files changed, 103 insertions(+), 87 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 14f1c5a..bb4ac31 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -498,10 +498,10 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 */
set_page_extent_mapped(page);
lock_extent(tree, last_offset, end, GFP_NOFS);
-   read_lock(em_tree-lock);
+   rcu_read_lock();
em = lookup_extent_mapping(em_tree, last_offset,
   PAGE_CACHE_SIZE);
-   read_unlock(em_tree-lock);
+   rcu_read_unlock();
 
if (!em || last_offset  em-start ||
(last_offset + PAGE_CACHE_SIZE  extent_map_end(em)) ||
@@ -583,11 +583,11 @@ int btrfs_submit_compressed_read(struct inode *inode, 
struct bio *bio,
em_tree = BTRFS_I(inode)-extent_tree;
 
/* we need the actual starting offset of this extent in the file */
-   read_lock(em_tree-lock);
+   rcu_read_lock();
em = lookup_extent_mapping(em_tree,
   page_offset(bio-bi_io_vec-bv_page),
   PAGE_CACHE_SIZE);
-   read_unlock(em_tree-lock);
+   rcu_read_unlock();
 
compressed_len = em-block_len;
cb = kmalloc(compressed_bio_size(root, compressed_len), GFP_NOFS);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3f9d555..8e09517 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -191,15 +191,14 @@ static struct extent_map *btree_get_extent(struct inode 
*inode,
struct extent_map *em;
int ret;
 
-   read_lock(em_tree-lock);
+   rcu_read_lock();
em = lookup_extent_mapping(em_tree, start, len);
+   rcu_read_unlock();
if (em) {
em-bdev =
BTRFS_I(inode)-root-fs_info-fs_devices-latest_bdev;
-   read_unlock(em_tree-lock);
goto out;
}
-   read_unlock(em_tree-lock);
 
em = alloc_extent_map();
if (!em) {
@@ -212,7 +211,7 @@ static struct extent_map *btree_get_extent(struct inode 
*inode,
em-block_start = 0;
em-bdev = BTRFS_I(inode)-root-fs_info-fs_devices-latest_bdev;
 
-   write_lock(em_tree-lock);
+   mutex_lock(em_tree-lock);
ret = add_extent_mapping(em_tree, em);
if (ret == -EEXIST) {
u64 failed_start = em-start;
@@ -231,7 +230,7 @@ static struct extent_map *btree_get_extent(struct inode 
*inode,
free_extent_map(em);
em = NULL;
}
-   write_unlock(em_tree-lock);
+   mutex_unlock(em_tree-lock);
 
if (ret)
em = ERR_PTR(ret);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 49f3c9d..7efa8dd 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2013,10 +2013,10 @@ static int bio_readpage_error(struct bio *failed_bio, 
struct page *page,
failrec-bio_flags = 0;
failrec-in_validation = 0;
 
-   read_lock(em_tree-lock);
+   rcu_read_lock();
em = lookup_extent_mapping(em_tree, start, failrec-len);
+   rcu_read_unlock();
if (!em) {
-   read_unlock(em_tree-lock);
kfree(failrec);
return -EIO;
}
@@ -2025,7 +2025,6 @@ static int bio_readpage_error(struct bio *failed_bio, 
struct page *page,
free_extent_map(em);
em = NULL;
}
-   read_unlock(em_tree-lock);
 
if (!em || IS_ERR(em)) {
kfree(failrec);
@@ -3286,15 +3285,15 @@ 

[RFC PATCH v2 2/3] Btrfs: rebuild extent_map based on skiplist

2012-01-09 Thread Liu Bo
extent_map applies a read more senario, since we want to build
a RCU-skiplist later, we build a new version extent_map based on
skiplist firstly.

Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
---
 fs/btrfs/extent_map.c |  258 -
 fs/btrfs/extent_map.h |   14 +++-
 fs/btrfs/volumes.c|   22 ++--
 3 files changed, 190 insertions(+), 104 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 7c97b33..e0a7881 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -9,6 +9,13 @@
 
 static struct kmem_cache *extent_map_cache;
 
+static LIST_HEAD(maps);
+
+#define MAP_LEAK_DEBUG 1
+#if MAP_LEAK_DEBUG
+static DEFINE_SPINLOCK(map_leak_lock);
+#endif
+
 int __init extent_map_init(void)
 {
extent_map_cache = kmem_cache_create(extent_map,
@@ -21,6 +28,30 @@ int __init extent_map_init(void)
 
 void extent_map_exit(void)
 {
+   struct extent_map *em;
+
+#if MAP_LEAK_DEBUG
+   struct list_head *tmp;
+   int count = 0;
+
+   list_for_each(tmp, maps)
+   count++;
+
+   printk(KERN_INFO %d em is left to free\n, count);
+
+   while (!list_empty(maps)) {
+   cond_resched();
+   em = list_entry(maps.next, struct extent_map, leak_list);
+   printk(KERN_ERR btrfs extent map: start %llu, len %llu 
+   refs %d block_start %llu, block_len %llu, in_tree 
%u\n,
+em-start, em-len, atomic_read(em-refs),
+em-block_start, em-block_len, em-in_tree);
+   WARN_ON(1);
+   list_del(em-leak_list);
+   kmem_cache_free(extent_map_cache, em);
+   }
+#endif
+
if (extent_map_cache)
kmem_cache_destroy(extent_map_cache);
 }
@@ -34,7 +65,8 @@ void extent_map_exit(void)
  */
 void extent_map_tree_init(struct extent_map_tree *tree)
 {
-   tree-map = RB_ROOT;
+   tree-head.start = (-1ULL);
+   sl_init_list(tree-map, tree-head.sl_node);
rwlock_init(tree-lock);
 }
 
@@ -48,16 +80,41 @@ void extent_map_tree_init(struct extent_map_tree *tree)
 struct extent_map *alloc_extent_map(void)
 {
struct extent_map *em;
+#if MAP_LEAK_DEBUG
+   unsigned long flags;
+#endif
+
em = kmem_cache_alloc(extent_map_cache, GFP_NOFS);
if (!em)
return NULL;
em-in_tree = 0;
em-flags = 0;
em-compress_type = BTRFS_COMPRESS_NONE;
+   sl_init_node(em-sl_node);
atomic_set(em-refs, 1);
+#if MAP_LEAK_DEBUG
+   spin_lock_irqsave(map_leak_lock, flags);
+   list_add(em-leak_list, maps);
+   spin_unlock_irqrestore(map_leak_lock, flags);
+#endif
return em;
 }
 
+static inline void __free_extent_map(struct extent_map *em)
+{
+#if MAP_LEAK_DEBUG
+   unsigned long flags;
+
+   spin_lock_irqsave(map_leak_lock, flags);
+   list_del(em-leak_list);
+   spin_unlock_irqrestore(map_leak_lock, flags);
+#endif
+
+   WARN_ON(em-in_tree);
+   sl_free_node(em-sl_node);
+   kmem_cache_free(extent_map_cache, em);
+}
+
 /**
  * free_extent_map - drop reference count of an extent_map
  * @em:extent map beeing releasead
@@ -69,91 +126,113 @@ void free_extent_map(struct extent_map *em)
 {
if (!em)
return;
+
WARN_ON(atomic_read(em-refs) == 0);
-   if (atomic_dec_and_test(em-refs)) {
-   WARN_ON(em-in_tree);
-   kmem_cache_free(extent_map_cache, em);
-   }
+   if (atomic_dec_and_test(em-refs))
+   __free_extent_map(em);
 }
 
-static struct rb_node *tree_insert(struct rb_root *root, u64 offset,
-  struct rb_node *node)
+static inline int in_entry(struct sl_node *node, u64 offset)
 {
-   struct rb_node **p = root-rb_node;
-   struct rb_node *parent = NULL;
struct extent_map *entry;
 
-   while (*p) {
-   parent = *p;
-   entry = rb_entry(parent, struct extent_map, rb_node);
+   entry = sl_entry(node, struct extent_map, sl_node);
+   if (!node-head 
+   entry-start = offset  extent_map_end(entry) - 1 = offset)
+   return 1;
+   return 0;
+}
 
-   WARN_ON(!entry-in_tree);
+static inline struct extent_map *next_entry(struct sl_node *p, int l,
+   struct sl_node **q)
+{
+   struct extent_map *ret;
+   struct sl_node *next;
 
-   if (offset  entry-start)
-   p = (*p)-rb_left;
-   else if (offset = extent_map_end(entry))
-   p = (*p)-rb_right;
-   else
-   return parent;
-   }
+   next = __sl_next_with_level(p, l);
+   ret = sl_entry(next, struct extent_map, sl_node);
+   BUG_ON(!ret);
+   *q = next;
 
-   entry = rb_entry(node, struct extent_map, rb_node);
-   entry-in_tree = 1;
-