Re: [PATCH 2/2] Btrfs: resize all devices when we dont assign a specific device id

2012-05-22 Thread Goffredo Baroncelli
Hi

On 05/17/2012 02:08 PM, Liu Bo wrote:
> This patch fixes two bugs:
> 
> When we do not assigne a device id for the resizer,
> - it will only take one device to resize, which is supposed to apply on
>   all available devices.
> - it will take 'id 1' device as default, and this will cause a bug as we
>   may have removed the 'id 1' device from the filesystem.
> 
> After this patch, we can find all available devices by searching the
> chunk tree and resize them:


I am not sure that this is a sane default for all resizing.

If the user want to resize to MAX, I agree that it is a sane default,
but when the user want to shrink or enlarge of a fixed quantity, the
user should specific the dev id. Because the shrinking and or the
enlarging should be evaluated case by case.

My suggestion is to change the code at kernel level so in case of
multi-volume file-system the user *has* to specify the device to shrink
and/or enlarge.
Should be the user space btrfs tool to handle the check and the growing
(i.e: if the new size is max, automatically grow all the device up to
max; otherwise the user should specific the device to shrink and/or
enlarge).

BR
Goffredo


> 
> $ mkfs.btrfs /dev/sdb7
> $ mount /dev/sdb7 /mnt/btrfs/
> $ btrfs dev add /dev/sdb8 /mnt/btrfs/
> 
> $ btrfs fi resize -100m /mnt/btrfs/
> then we can get from dmesg:
> btrfs: new size for /dev/sdb7 is 980844544
> btrfs: new size for /dev/sdb8 is 980844544
> 
> $ btrfs fi resize max /mnt/btrfs
> then we can get from dmesg:
> btrfs: new size for /dev/sdb7 is 1085702144
> btrfs: new size for /dev/sdb8 is 1085702144
> 
> $ btrfs fi resize 1:-100m /mnt/btrfs
> then we can get from dmesg:
> btrfs: resizing devid 1
> btrfs: new size for /dev/sdb7 is 980844544
> 
> $ btrfs fi resize 1:-100m /mnt/btrfs
> then we can get from dmesg:
> btrfs: resizing devid 2
> btrfs: new size for /dev/sdb8 is 980844544
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/ioctl.c |  101 
> --
>  1 files changed, 83 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index ec2245d..d9a4fa8 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1250,12 +1250,51 @@ out_ra:
>   return ret;
>  }
>  
> +static struct btrfs_device *get_avail_device(struct btrfs_root *root, u64 
> devid)
> +{
> + struct btrfs_key key;
> + struct btrfs_path *path;
> + struct btrfs_dev_item *dev_item;
> + struct btrfs_device *device = NULL;
> + int ret;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return ERR_PTR(-ENOMEM);
> +
> + key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
> + key.offset = devid;
> + key.type = BTRFS_DEV_ITEM_KEY;
> +
> + ret = btrfs_search_slot(NULL, root->fs_info->chunk_root, &key,
> + path, 0, 0);
> + if (ret < 0) {
> + device = ERR_PTR(ret);
> + goto out;
> + }
> + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> + if (key.objectid != BTRFS_DEV_ITEMS_OBJECTID ||
> + key.type != BTRFS_DEV_ITEM_KEY) {
> + device = NULL;
> + goto out;
> + }
> + dev_item = btrfs_item_ptr(path->nodes[0], path->slots[0],
> +   struct btrfs_dev_item);
> + devid = btrfs_device_id(path->nodes[0], dev_item);
> +
> + device = btrfs_find_device(root, devid, NULL, NULL);
> +out:
> + btrfs_free_path(path);
> + return device;
> +}
> +
>  static noinline int btrfs_ioctl_resize(struct btrfs_root *root,
>   void __user *arg)
>  {
> - u64 new_size;
> + u64 new_size = 0;
>   u64 old_size;
> - u64 devid = 1;
> + u64 orig_new_size = 0;
> + u64 devid = (-1ULL);
>   struct btrfs_ioctl_vol_args *vol_args;
>   struct btrfs_trans_handle *trans;
>   struct btrfs_device *device = NULL;
> @@ -1263,6 +1302,8 @@ static noinline int btrfs_ioctl_resize(struct 
> btrfs_root *root,
>   char *devstr = NULL;
>   int ret = 0;
>   int mod = 0;
> + int scan_all = 1;
> + int use_max = 0;
>  
>   if (root->fs_info->sb->s_flags & MS_RDONLY)
>   return -EROFS;
> @@ -1295,8 +1336,31 @@ static noinline int btrfs_ioctl_resize(struct 
> btrfs_root *root,
>   devid = simple_strtoull(devstr, &end, 10);
>   printk(KERN_INFO "btrfs: resizing devid %llu\n",
>  (unsigned long long)devid);
> + scan_all = 0;
>   }
> - device = btrfs_find_device(root, devid, NULL, NULL);
> +
> + if (!strcmp(sizestr, "max")) {
> + use_max = 1;
> + } else {
> + if (sizestr[0] == '-') {
> + mod = -1;
> + sizestr++;
> + } else if (sizestr[0] == '+') {
> + mod = 1;
> + sizestr++;
> + }
> + orig_new_size = memparse(sizestr, NULL);
> + if (orig_

Re: SSD erase state and reducing SSD wear

2012-05-22 Thread Calvin Walton
On Tue, 2012-05-22 at 22:47 +0100, Martin wrote:
> I've got two recent examples of SSDs. Their pristine state from the
> manufacturer shows:

> Device Model: OCZ-VERTEX3
>   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

> Device Model: OCZ VERTEX PLUS
>  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

> What's a good way to test what state they get erased to from a TRIM
> operation?

This pristine state probably matches up with the result of a trim
command on the drive. In particular, a freshly erased flash block is in
a state where the bits are all 1, so the Vertex Plus drive is showing
you the flash contents directly. The Vertex 3 has substantially more
processing, and the 0s are effectively generated on the fly for unmapped
flash blocks (similar to how the missing portions of a sparse file
contains 0s).

> Can btrfs detect the erase state and pad unused space in filesystem
> writes with the same value so as to reduce SSD wear?

On the Vertex 3, this wouldn't actually do what you'd hope. The firmware
in that drive actually compresses, deduplicates, and encrypts all the
data prior to writing it to flash - and as a result the data that hits
the flash looks nothing like what the filesystem wrote.
(For best performance, it might make sense to disable btrfs's built-in
compression on the Vertex 3 drive to allow the drive's compression to
kick in. Let us know if you benchmark it either way.)

The benefit to doing this on the Vertex Plus is probably fairly small,
since to rewrite a block - even if the block is partially unwritten - is
still likely to require a read-modify-write cycle with an erase step.
The granularity of the erase blocks is just too big for the savings to
be very meaningful.

-- 
Calvin Walton 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/3] Btrfs: read device stats on mount, write modified ones during commit

2012-05-22 Thread Liu Bo
On 05/22/2012 06:53 PM, Stefan Behrens wrote:

> The device statistics are written into the device tree with each
> transaction commit. Only modified statistics are written.
> When a filesystem is mounted, the device statistics for each involved
> device are read from the device tree and used to initialize the
> counters.
> 
> Signed-off-by: Stefan Behrens 
> ---
>  fs/btrfs/ctree.h   |   51 
>  fs/btrfs/disk-io.c |7 ++
>  fs/btrfs/print-tree.c  |3 +
>  fs/btrfs/transaction.c |4 +
>  fs/btrfs/volumes.c |  205 
> 
>  fs/btrfs/volumes.h |9 +++
>  6 files changed, 279 insertions(+)
> 
[...]
> +static int update_device_stat_item(struct btrfs_trans_handle *trans,
> +struct btrfs_root *dev_root,
> +struct btrfs_device *device)
> +{
> + struct btrfs_path *path;
> + struct btrfs_key key;
> + struct extent_buffer *eb;
> + struct btrfs_device_stats_item *ptr;
> + int ret;
> +
> + key.objectid = 0;
> + key.type = BTRFS_DEVICE_STATS_KEY;
> + key.offset = device->devid;
> +
> + path = btrfs_alloc_path();
> + BUG_ON(!path);
> + ret = btrfs_search_slot(trans, dev_root, &key, path, 0, 1);


Since we may delete this item, I prefer cow: -1,

btrfs_search_slot(trans, dev_root, &key, path, 0, -1);

thanks,
liubo

> + if (ret < 0) {
> + printk(KERN_WARNING "btrfs: error %d while searching for 
> device_stats item for device %s!\n",
> +ret, device->name);
> + goto out;
> + }
> +
> + if (ret == 0 &&
> + btrfs_item_size_nr(path->nodes[0], path->slots[0]) < sizeof(*ptr)) {
> + /* need to delete old one and insert a new one */
> + ret = btrfs_del_item(trans, dev_root, path);
> + if (ret != 0) {
> + printk(KERN_WARNING "btrfs: delete too small 
> device_stats item for device %s failed %d!\n",
> +device->name, ret);
> + goto out;
> + }
> + ret = 1;
> + }
> +
> + if (ret == 1) {
> + /* need to insert a new item */
> + btrfs_release_path(path);
> + ret = btrfs_insert_empty_item(trans, dev_root, path,
> +   &key, sizeof(*ptr));
> + if (ret < 0) {
> + printk(KERN_WARNING "btrfs: insert device_stats item 
> for device %s failed %d!\n",
> +device->name, ret);
> + goto out;
> + }
> + }
> +
> + eb = path->nodes[0];
> + ptr = btrfs_item_ptr(eb, path->slots[0],
> +  struct btrfs_device_stats_item);
> + btrfs_set_device_stats_cnt_write_io_errs(eb, ptr,
> + btrfs_device_stat_read(&device->cnt_write_io_errs));
> + btrfs_set_device_stats_cnt_read_io_errs(eb, ptr,
> + btrfs_device_stat_read(&device->cnt_read_io_errs));
> + btrfs_set_device_stats_cnt_flush_io_errs(eb, ptr,
> + btrfs_device_stat_read(&device->cnt_flush_io_errs));
> + btrfs_set_device_stats_cnt_corruption_errs(eb, ptr,
> + btrfs_device_stat_read(&device->cnt_corruption_errs));
> + btrfs_set_device_stats_cnt_generation_errs(eb, ptr,
> + btrfs_device_stat_read(&device->cnt_generation_errs));
> + btrfs_mark_buffer_dirty(eb);
> +
> +out:
> + btrfs_free_path(path);
> + return ret;
> +}
> +
> +/*
> + * called from commit_transaction. Writes all changed device stats to disk.
> + */
> +int btrfs_run_device_stats(struct btrfs_trans_handle *trans,
> +struct btrfs_fs_info *fs_info)
> +{
> + struct btrfs_root *dev_root = fs_info->dev_root;
> + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> + struct btrfs_device *device;
> + int ret = 0;
> +
> + mutex_lock(&fs_devices->device_list_mutex);
> + list_for_each_entry(device, &fs_devices->devices, dev_list) {
> + if (!device->device_stats_valid || !device->device_stats_dirty)
> + continue;
> +
> + ret = update_device_stat_item(trans, dev_root, device);
> + if (!ret)
> + device->device_stats_dirty = 0;
> + }
> + mutex_unlock(&fs_devices->device_list_mutex);
> +
> + return ret;
> +}
> +
>  void btrfs_device_stat_print_on_error(struct btrfs_device *device)
>  {
> + if (!device->device_stats_valid)
> + return;
>   printk_ratelimited(KERN_ERR
>  "btrfs: bdev %s errs: wr %u, rd %u, flush %u, 
> corrupt %u, gen %u\n",
>  device->name,
> @@ -4639,6 +4828,18 @@ void btrfs_device_stat_print_on_error(struct 
> btrfs_device *device)
>   &device->cnt_generation_errs));
>  }
>  
> +static void btrfs_device_stat_print_on_load(struct btrfs_devic

Re: warnings met in introduce extent buffer cache for each i-node patch

2012-05-22 Thread Miao Xie
On Tue, 22 May 2012 09:54:54 -0700, Tim Chen wrote:
> Miao,
> 
> I was trying out your patch on scalability testing for BTRFS on v3.3
> kernel.
> http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg14930.html
> 
> However, I ran into a lot of warnings (see the dmesg below).  Wonder if
> you have a more up to date version of this patch?
> 
> In addition, I have to do this modification to fix a warning in your
> original patch.

Thanks for your test, This patch still has some problem, I'm improve it
now. I will send the new one soon.

Thanks again
Miao

> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 892b347..e0210c9 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4608,7 +4613,7 @@ fail_dir_item:
> int err;
> 
> err = btrfs_del_inode_ref(trans, root, name, name_len,
> - ino, parent_ino, &local_index);
> + inode, parent_ino, &local_index);
> }
> return ret;
>  }
> 
> Thanks.
> Tim
> 
> 
> May 22 09:23:57 bigbox kernel: [56455.532138] [ cut here 
> ]
> May 22 09:23:57 bigbox kernel: [56455.532146] NG: at 
> fs/btrfs/extent_io.c:3795 free_extent_buffer+0x31/6455.532189] Hardware name: 
> PRIMEQUEST 1800E2
> May 22 09:23:57 bigbox kernel: [56455.53nked in: scsi_ram lockd 
> nf_conntrack_ipv4 nf_defrag_ipv4 xtfat ioatdma i2c_i801 i7core_edac e1000e 
> microcode edac_core i2c_core igb iTCO_wdt iTCO_vendor_support dca uinput 
> sunrpc usb_stortsas mptscsih mptbase scsi_transport_sas [last unloaded: 
> scsi_wait_sca55.532431] Pid: 4399, comm: btrfs-endio-wri Tainted: GW  
>   3.3.0c-scsiram-btrfs2+ #30
> May 22 09:23:57 bigbox kernel: [56455.532486] Call Trace:
> May 22 09:23:57 bigbox kernel: [56455. [] 
> warn_slowpath_common+0x7f/0xc0
> May 22 09:23:57 bigbox kernel: [5645]  [] 
> warn_slowpath_null+0x1a/0x20
> May 22 09:23:57 bigbox kernel: [56455.53220]  [] 
> free_extent_buffer+0x31/0x40
> May 22 09:23:57 bigbox kernel: ]  [] 
> read_block_for_search+0x117/0x3d0
> May 22 09:23:57 bigbox kernel: 32559]  [] ? 
> generic_bin_search.constprop.4+0[] ? unlock_up+0x15d/0x190
> May 22 09:23:57 bigbox kernel: [56455.5812c31c1>] 
> btrfs_search_slot+0x241/0x720
> May 22 09:23:57 bigbox kernel: [56455.5326fff812c3adc>] 
> btrfs_search_slot_for_inode+0x43c/0x910
> May 22 09:23:57 bigbox kernel: [56455.532fff812d5f04>] 
> btrfs_lookup_file_extent+0x54/0x70
> May 22 09:23:57 bigbox kernel: [56455.532646812f097c>] 
> btrfs_drop_extents+0xec/0x940
> May 22 09:23:57 bigbox kernel: [56455.532662] fff81084eec>] ? 
> try_to_wake_up+0x1bc/0x2b0
> May 22 09:23:57 bigbox kernel: [56455.53268 [] ? 
> set_state_bits+0x3f/0x80
> May 22 09:23:57 bigbox kernel: [56455fff8116228c>] ? 
> kmem_cache_alloc+0x10c/0x140
> May 22 09:23:57 bigbox kernel: [56455.532713]  [] ? 
> btrfs_alloc_path+0x1a/0x20
> May 22 09:23:57 bigbox kernel: [5645532]  [] 
> insert_reserved_file_extent.constpr13+0x73/0x270
> May 22 09:23:57 bigbox kernel: [56455.532746]  [] ? 
> join_transactio0x2b/0x2b0
> May 22 09:23:57 bigbox kernel: [56455.532759]  [] ? 
> start_transaction+0x94/0x320
> May 22 09:23:57 bigbox kernel: [56455.532774]  [] 
> btrfinish_ordered_io+0x2ca/0x320
> May 22 09:23:57 bigbox kernel: [56455.532793]  
> [age_end_io_hook+0x4d/0xc0
> May 22 09:23:57 bigbox kernel: [56455.532813]  [] 
> ] ? bio_free+0x5f/0x70
> May 22 09:23:57 bigbox kernel: [56455.532837] fff811a817d>] 
> bio_endio+0x1d/0x40
> May 22 09:23:57 bigbox kernel: [56455.532869]  [] 
> end_workqueue_fn+0x56/0x140
> May 22 09:23:57 bigbox kernel: [56455.532886]  [] 
> worker_loop+0x148/0x580
> May 22 09:23:57 bigbox kernel: [56455.532898]  [] ? 
> btrfs_queue_worker+0x2e0/0x2e0
> May 22 09:23:57 bigbox kernel: [56455.532915]  [] 
> kthread+0x93/0xa0
> May 22 09:23:57 bigbox kernel: [56455.532929]  [] 
> kernel_thread_helper+0x4/0x10
> May 22 09:23:57 bigbox kernel: [56455.532944]  [ May 22 09:23:57 bigbox kernel: [56455.532972] ---[ end trace a7919e7f17c42adb 
> ]---
> May 22 09:23:57 bigbox kernel: [56455.532985] [ cut here 
> ]
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Newbie questions on some of btrfs code...

2012-05-22 Thread Jan Schmidt
On 22.05.2012 10:07, Alex Lyakas wrote:
>>> # If my understanding in the previous bullet is correct: Is that the
>>> reason that in btrfs_prev_leaf() it is assumed that if there is a
>>> lesser key, btrfs_search_slot() will never bring us to the slot==0 of
>>> the current leaf?
>>
>> It's quite straight: We look for a key smaller than the first (slot 0)
>> of the current leaf. If we find the current leaf again (because
>> btrfs_search_slot returns the place where such a key would have be
>> inserted), then there's no previous leaf. No preconditions or
>> assumptions on nodes in levels needed.
> 
> Let's say that slot[0] of the current leaf (A) has key=10. And let's
> say that its parent node (N) has key=5 (and not 10). Let's say we have
> a previous leaf (B), whose last slot has key=2.
> If such tree is valid, then: btrfs_prev_leaf() will search for key==9.
> Then btrfs_search_slot() would bring us node N and leaf A again,
> wouldn't it? Because key(N)<=9. So we will receive leaf A back, and
> will think that there is no previous leaf, while there is.
> What am I missing here?

It wouldn't. btrfs_search_slot always sets up the path such that it
points to the position where such an key would be inserted. And we never
insert at the beginning of a leaf. So in your example, this would be at
the end of leaf B: your path object will have nodes[1] = N, nodes[0] = B
and slots[0] = number_of_slots_used_in_B + 1.

Your example sounds like a good explanation why the key in the parent
node should really be an exact match. It sounds reasonable that it's not
allowed to be <= than the first key of its child. If it was, extra
lookups would be required to setup the path correctly for your example
above (which I haven't seen so far).

-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Could btrfs-restore be extended to also restore file dates?

2012-05-22 Thread Henry Bakker
Any possibility of getting btrfs-restore to also restore the files
timestamp?

I'm doing a restore right now as I had one btrfs partition blow up and
I'm noting that the timestamps are marking all the restored files as
new. It would be nice to be able to do a quick compare of file dates to
determine any changed files that may be newer on the restore vs the
backup. (I can save full file compares for when the server is not being
actively used.)

I do realize it is possible that there could be other issues, but for
quickly determining potential issues this could be useful.

I do realize that there may be technical reasons for the current
behavior, so at the very least this is suggestion for future
functionality even if it doesn't help me.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SSD erase state and reducing SSD wear

2012-05-22 Thread Martin
I've got two recent examples of SSDs. Their pristine state from the
manufacturer shows:


Device Model: OCZ-VERTEX3

# hexdump -C /dev/sdd
  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
||
*
1bf2976000


Device Model: OCZ VERTEX PLUS
(OCZ VERTEX 2E)

# hexdump -C /dev/sdd
 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
||
*
df99e6000



What's a good way to test what state they get erased to from a TRIM
operation?

Can btrfs detect the erase state and pad unused space in filesystem
writes with the same value so as to reduce SSD wear?

Regards,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD format/mount parameters questions

2012-05-22 Thread Martin
On 19/05/12 18:36, Martin Steigerwald wrote:
> Am Freitag, 18. Mai 2012 schrieb Sander:
>> Martin wrote (ao):
>>> Are there any format/mount parameters that should be set for using
>>> btrfs on SSDs (other than the "ssd" mount option)?
>>
>> If possible, format the whole device, do not partition the ssd. This
>> will guarantee proper allignment.
> 
> Current partitioning tools align at 1 MiB unless otherwise specified.
> 
> And then thats only the alignment of the start of the filesystem.
> 
> Not the granularity that the filesystem itself uses to align its writes.
> 
> And then its not clear to me what effect proper alignment will actually 
> have given the intelligent nature of SSD firmwares.

That's what I'm trying to untangle rather than just trusting to "magic".
I'm also not so convinced about the "SSD firmwares" being quite so
"intelligent"...


So far, the only clear indications are that a number of SSDs have a
performance 'sweet spot' when you use 16kByte blocks for data transfer.

Practicalities for the SSD internal structure strongly suggest that they
work in chunks of data greater than 4kBytes.

4kByte operation is a strong driver for SSD manufacturers, but what
compromises do they make to accommodate that?


And for btrfs:

Extents are aligned to "sector size" boundaries (4kBytes default).

And there is a comment that setting larger sector sizes increases the
CPU overhead in btrfs due to the larger memory moves needed for making
inserts into the trees.

If the SSD is going to do a read-modify-write on anything smaller than
16kBytes in any case, might btrfs just as well use that chunk size to
good advantage in the first place?

So, what is most significant?


Also:

btrfs has a big advantage of using checksumming and COW. However, ext4
is more mature, similarly uses extents, and also allows specifying a
large "delayed allocation" time to merge multiple writes if you're happy
your system is safely on a UPS...


I'm not too worried about this for MLC SSDs, but it is something that is
of concern for the yet shorter modify-erase count lifespan of TLC SSDs.


Regards,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Which is the maximum files size in BTRFS ? [was Re: btrfs: Probably the larger filesystem I will see for a long time]

2012-05-22 Thread Goffredo Baroncelli
On 05/22/2012 07:17 PM, Goffredo Baroncelli wrote:
> Hi all,
> 
>>From the specification [1] the btrfs maximum file size limit should be
> 1<<64 bytes. However I was never able to create a file >= 1<<63 bytes.
> 
> 
> ghigo@venice:/mnt/old-btrfs/home/ghigo/gianfile$ ls -l giantfile2
> -rw-r--r-- 1 ghigo ghigo 9223372036854775807 May 22 18:55 giantfile2
> ghigo@venice:/mnt/old-btrfs/home/ghigo/gianfile$ ls -lh giantfile2
> -rw-r--r-- 1 ghigo ghigo 8.0E May 22 18:55 giantfile2
> ghigo@venice:/mnt/old-btrfs/home/ghigo/gianfile$ echo -n x >>giantfile2
> bash: echo: write error: File too large
> ghigo@venice:/mnt/old-btrfs/home/ghigo/gianfile$ python -c "print 1<<63"
> 9223372036854775808
> 
> Could be a kernel limit ?

Yes, it seems to be a kernel limit: the generic_file_llseek() function
check the lseek "offset" argument against superblock->s_maxbytes, which
is set to MAX_LFS_FILESIZE in btrfs. (see file fs/read_write.c and
fs/btrfs/super.c).
MAX_LFS_FILESIZE is defined in include/linux/fs.h as

  /* Page cache limit. The filesystems should put that into their
 s_maxbytes limits, otherwise bad things can happen in VM. */

  #if BITS_PER_LONG==32

  #define MAX_LFS_FILESIZE \
(((u64)PAGE_CACHE_SIZE << (BITS_PER_LONG-1))-1)

  #elif BITS_PER_LONG==64

  #define MAX_LFS_FILESIZE 0x7fffUL

  #endif

Which means that in btrfs under linux there is a file size limit of 8EB
( 0x7fff +1 ).

Goffredo


> 
> Goffredo
> 
> [1] https://btrfs.wiki.kernel.org/index.php/Main_Page
> 
> P.S.
> I am asking about this un-useful question because I want to create a
> loop based btrfs filesystem on a file greater than 8E. But I was unable
> to create a such big file. I got success up to 8E-1
> 
> 
> 
> On 05/19/2012 05:03 AM, Christian Robert wrote:
>> Probably the larger filesystem I will ever see. Tryed 8 Exabytes but it
>> failed.
>>
>> [root@CentOS6-A:/root] # df
>> Filesystem1K-blocks  Used Available 
>> Use%  Mounted
>> /dev/mapper/vg01-root  17915884  11533392   5513572  
>> 68%  /
>> /dev/sda1508745140314342831  
>> 30%  /boot
>> /dev/mapper/data_0 66993872   1644372  61994060   
>> 3%  /mnt/data_0
>> /dev/mapper/data_1 7881299347898368508360  7881248224091896   
>> 1%  /mnt/data_1
>>
>> [root@CentOS6-A:/root] # df -h
>> Filesystem Size  Used  Avail  Use%  Mounted
>> /dev/mapper/vg01-root   18G   11G   5.3G   68%  /
>> /dev/sda1  497M  138M   335M   30%  /boot
>> /dev/mapper/data_0  64G  1.6G60G3%  /mnt/data_0
>> /dev/mapper/data_1 7.0E  497M   7.0E1%  /mnt/data_1
>>
>> [root@CentOS6-A:/root] # df -Th
>> Filesystem  Type  Size  Used  Avail  Use%
>> /dev/mapper/vg01-root   ext4   18G   11G   5.3G  68%
>> /dev/sda1   ext4  497M  138M   335M  30%
>> /dev/mapper/data_0  ext4   64G  1.6G60G  3%
>> /dev/mapper/data_1 btrfs  7.0E  499M   7.0E  1%
>> [root@CentOS6-A:/root] #
>>
>>
>> [root@CentOS6-A:/root] # uname -rv
>> 3.4.0-rc7+ #23 SMP Wed May 16 20:20:47 EDT 2012
>>
>>
>> made with a dm-thin device sitting on a device pair composed of
>> (metadata 256Megs and data 23 Gigs)
>>
>> running on my laptop at home.
>>
>> yes, this is 7 Exabytes or 7,168 Petabytes or ( 7,340,032 Terabytes ) or
>> 7,516,192,768 Gigabytes.
>>
>>
>> please do not answer, it is just a statement of a fact at 3.4-rc7 (was
>> not working at 3.4-rc3 if I remember).
>>
>>
>> Xtian.
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> .
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph on btrfs 3.4rc

2012-05-22 Thread Josef Bacik
On Tue, May 22, 2012 at 12:29:59PM +0200, Christian Brunner wrote:
> 2012/5/21 Miao Xie :
> > Hi Josef,
> >
> > On fri, 18 May 2012 15:01:05 -0400, Josef Bacik wrote:
> >> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> >> index 9b9b15f..492c74f 100644
> >> --- a/fs/btrfs/btrfs_inode.h
> >> +++ b/fs/btrfs/btrfs_inode.h
> >> @@ -57,9 +57,6 @@ struct btrfs_inode {
> >>       /* used to order data wrt metadata */
> >>       struct btrfs_ordered_inode_tree ordered_tree;
> >>
> >> -     /* for keeping track of orphaned inodes */
> >> -     struct list_head i_orphan;
> >> -
> >>       /* list of all the delalloc inodes in the FS.  There are times we 
> >> need
> >>        * to write all the delalloc pages to disk, and this list is used
> >>        * to walk them all.
> >> @@ -156,6 +153,8 @@ struct btrfs_inode {
> >>       unsigned dummy_inode:1;
> >>       unsigned in_defrag:1;
> >>       unsigned delalloc_meta_reserved:1;
> >> +     unsigned has_orphan_item:1;
> >> +     unsigned doing_truncate:1;
> >
> > I think the problem is we should not use the different lock to protect the 
> > bit fields which
> > are stored in the same machine word. Or some bit fields may be covered by 
> > the others when
> > someone change those fields. Could you try to declare 
> > ->delalloc_meta_reserved and ->has_orphan_item
> > as a integer?
> 
> I have tried changing it to:
> 
> struct btrfs_inode {
> unsigned orphan_meta_reserved:1;
> unsigned dummy_inode:1;
> unsigned in_defrag:1;
> -   unsigned delalloc_meta_reserved:1;
> +   int delalloc_meta_reserved;
> +   int has_orphan_item;
> +   int doing_truncate;
> 
> The strange thing is, that I'm no longer hitting the BUG_ON, but the
> old WARNING (no additional messages):
> 

Yeah you would also need to change orphan_meta_reserved.  I fixed this by just
taking the BTRFS_I(inode)->lock when messing with these since we don't want to
take up all that space in the inode just for a marker.  I ran this patch for 3
hours with no issues, let me know if it works for you.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3771b85..559e716 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
/* used to order data wrt metadata */
struct btrfs_ordered_inode_tree ordered_tree;
 
-   /* for keeping track of orphaned inodes */
-   struct list_head i_orphan;
-
/* list of all the delalloc inodes in the FS.  There are times we need
 * to write all the delalloc pages to disk, and this list is used
 * to walk them all.
@@ -153,6 +150,7 @@ struct btrfs_inode {
unsigned dummy_inode:1;
unsigned in_defrag:1;
unsigned delalloc_meta_reserved:1;
+   unsigned has_orphan_item:1;
 
/*
 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ba8743b..72cdf98 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
struct list_head root_list;
 
spinlock_t orphan_lock;
-   struct list_head orphan_list;
+   atomic_t orphan_inodes;
struct btrfs_block_rsv *orphan_block_rsv;
int orphan_item_inserted;
int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 19f5b45..25dba7a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
root->orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(&root->dirty_list);
-   INIT_LIST_HEAD(&root->orphan_list);
INIT_LIST_HEAD(&root->root_list);
spin_lock_init(&root->orphan_lock);
spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
atomic_set(&root->log_commit[0], 0);
atomic_set(&root->log_commit[1], 0);
atomic_set(&root->log_writers, 0);
+   atomic_set(&root->orphan_inodes, 0);
root->log_batch = 0;
root->log_transid = 0;
root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 54ae3df..54f1b30 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2104,12 +2104,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
struct btrfs_block_rsv *block_rsv;
int ret;
 
-   if (!list_empty(&root->orphan_list) ||
+   if (atomic_read(&root->orphan_inodes) ||
root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
return;
 
spin_lock(&root->orphan_lock);
-   if (!list_empty(&root->orphan_list)) {
+   if (atomic_read(&root->orphan_inodes)) {
spin_unlock(&root->orphan_lock);
return;
}
@@ -2166,8 +2166,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
block_rsv = NULL;
}
 
-   if

Which is the maximum files size in BTRFS ? [was Re: btrfs: Probably the larger filesystem I will see for a long time]

2012-05-22 Thread Goffredo Baroncelli
Hi all,

>From the specification [1] the btrfs maximum file size limit should be
1<<64 bytes. However I was never able to create a file >= 1<<63 bytes.


ghigo@venice:/mnt/old-btrfs/home/ghigo/gianfile$ ls -l giantfile2
-rw-r--r-- 1 ghigo ghigo 9223372036854775807 May 22 18:55 giantfile2
ghigo@venice:/mnt/old-btrfs/home/ghigo/gianfile$ ls -lh giantfile2
-rw-r--r-- 1 ghigo ghigo 8.0E May 22 18:55 giantfile2
ghigo@venice:/mnt/old-btrfs/home/ghigo/gianfile$ echo -n x >>giantfile2
bash: echo: write error: File too large
ghigo@venice:/mnt/old-btrfs/home/ghigo/gianfile$ python -c "print 1<<63"
9223372036854775808

Could be a kernel limit ?

Goffredo

[1] https://btrfs.wiki.kernel.org/index.php/Main_Page

P.S.
I am asking about this un-useful question because I want to create a
loop based btrfs filesystem on a file greater than 8E. But I was unable
to create a such big file. I got success up to 8E-1



On 05/19/2012 05:03 AM, Christian Robert wrote:
> Probably the larger filesystem I will ever see. Tryed 8 Exabytes but it
> failed.
> 
> [root@CentOS6-A:/root] # df
> Filesystem1K-blocks  Used Available 
> Use%  Mounted
> /dev/mapper/vg01-root  17915884  11533392   5513572  
> 68%  /
> /dev/sda1508745140314342831  
> 30%  /boot
> /dev/mapper/data_0 66993872   1644372  61994060   
> 3%  /mnt/data_0
> /dev/mapper/data_1 7881299347898368508360  7881248224091896   
> 1%  /mnt/data_1
> 
> [root@CentOS6-A:/root] # df -h
> Filesystem Size  Used  Avail  Use%  Mounted
> /dev/mapper/vg01-root   18G   11G   5.3G   68%  /
> /dev/sda1  497M  138M   335M   30%  /boot
> /dev/mapper/data_0  64G  1.6G60G3%  /mnt/data_0
> /dev/mapper/data_1 7.0E  497M   7.0E1%  /mnt/data_1
> 
> [root@CentOS6-A:/root] # df -Th
> Filesystem  Type  Size  Used  Avail  Use%
> /dev/mapper/vg01-root   ext4   18G   11G   5.3G  68%
> /dev/sda1   ext4  497M  138M   335M  30%
> /dev/mapper/data_0  ext4   64G  1.6G60G  3%
> /dev/mapper/data_1 btrfs  7.0E  499M   7.0E  1%
> [root@CentOS6-A:/root] #
> 
> 
> [root@CentOS6-A:/root] # uname -rv
> 3.4.0-rc7+ #23 SMP Wed May 16 20:20:47 EDT 2012
> 
> 
> made with a dm-thin device sitting on a device pair composed of
> (metadata 256Megs and data 23 Gigs)
> 
> running on my laptop at home.
> 
> yes, this is 7 Exabytes or 7,168 Petabytes or ( 7,340,032 Terabytes ) or
> 7,516,192,768 Gigabytes.
> 
> 
> please do not answer, it is just a statement of a fact at 3.4-rc7 (was
> not working at 3.4-rc3 if I remember).
> 
> 
> Xtian.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


warnings met in introduce extent buffer cache for each i-node patch

2012-05-22 Thread Tim Chen
Miao,

I was trying out your patch on scalability testing for BTRFS on v3.3
kernel.
http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg14930.html

However, I ran into a lot of warnings (see the dmesg below).  Wonder if
you have a more up to date version of this patch?

In addition, I have to do this modification to fix a warning in your
original patch.

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 892b347..e0210c9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4608,7 +4613,7 @@ fail_dir_item:
int err;

err = btrfs_del_inode_ref(trans, root, name, name_len,
- ino, parent_ino, &local_index);
+ inode, parent_ino, &local_index);
}
return ret;
 }

Thanks.
Tim


May 22 09:23:57 bigbox kernel: [56455.532138] [ cut here 
]
May 22 09:23:57 bigbox kernel: [56455.532146] NG: at fs/btrfs/extent_io.c:3795 
free_extent_buffer+0x31/6455.532189] Hardware name: PRIMEQUEST 1800E2
May 22 09:23:57 bigbox kernel: [56455.53nked in: scsi_ram lockd 
nf_conntrack_ipv4 nf_defrag_ipv4 xtfat ioatdma i2c_i801 i7core_edac e1000e 
microcode edac_core i2c_core igb iTCO_wdt iTCO_vendor_support dca uinput sunrpc 
usb_stortsas mptscsih mptbase scsi_transport_sas [last unloaded: 
scsi_wait_sca55.532431] Pid: 4399, comm: btrfs-endio-wri Tainted: GW
3.3.0c-scsiram-btrfs2+ #30
May 22 09:23:57 bigbox kernel: [56455.532486] Call Trace:
May 22 09:23:57 bigbox kernel: [56455. [] 
warn_slowpath_common+0x7f/0xc0
May 22 09:23:57 bigbox kernel: [5645]  [] 
warn_slowpath_null+0x1a/0x20
May 22 09:23:57 bigbox kernel: [56455.53220]  [] 
free_extent_buffer+0x31/0x40
May 22 09:23:57 bigbox kernel: ]  [] 
read_block_for_search+0x117/0x3d0
May 22 09:23:57 bigbox kernel: 32559]  [] ? 
generic_bin_search.constprop.4+0[] ? unlock_up+0x15d/0x190
May 22 09:23:57 bigbox kernel: [56455.5812c31c1>] 
btrfs_search_slot+0x241/0x720
May 22 09:23:57 bigbox kernel: [56455.5326fff812c3adc>] 
btrfs_search_slot_for_inode+0x43c/0x910
May 22 09:23:57 bigbox kernel: [56455.532fff812d5f04>] 
btrfs_lookup_file_extent+0x54/0x70
May 22 09:23:57 bigbox kernel: [56455.532646812f097c>] 
btrfs_drop_extents+0xec/0x940
May 22 09:23:57 bigbox kernel: [56455.532662] fff81084eec>] ? 
try_to_wake_up+0x1bc/0x2b0
May 22 09:23:57 bigbox kernel: [56455.53268 [] ? 
set_state_bits+0x3f/0x80
May 22 09:23:57 bigbox kernel: [56455fff8116228c>] ? 
kmem_cache_alloc+0x10c/0x140
May 22 09:23:57 bigbox kernel: [56455.532713]  [] ? 
btrfs_alloc_path+0x1a/0x20
May 22 09:23:57 bigbox kernel: [5645532]  [] 
insert_reserved_file_extent.constpr13+0x73/0x270
May 22 09:23:57 bigbox kernel: [56455.532746]  [] ? 
join_transactio0x2b/0x2b0
May 22 09:23:57 bigbox kernel: [56455.532759]  [] ? 
start_transaction+0x94/0x320
May 22 09:23:57 bigbox kernel: [56455.532774]  [] 
btrfinish_ordered_io+0x2ca/0x320
May 22 09:23:57 bigbox kernel: [56455.532793]  
[age_end_io_hook+0x4d/0xc0
May 22 09:23:57 bigbox kernel: [56455.532813]  [] 
] ? bio_free+0x5f/0x70
May 22 09:23:57 bigbox kernel: [56455.532837] fff811a817d>] bio_endio+0x1d/0x40
May 22 09:23:57 bigbox kernel: [56455.532869]  [] 
end_workqueue_fn+0x56/0x140
May 22 09:23:57 bigbox kernel: [56455.532886]  [] 
worker_loop+0x148/0x580
May 22 09:23:57 bigbox kernel: [56455.532898]  [] ? 
btrfs_queue_worker+0x2e0/0x2e0
May 22 09:23:57 bigbox kernel: [56455.532915]  [] 
kthread+0x93/0xa0
May 22 09:23:57 bigbox kernel: [56455.532929]  [] 
kernel_thread_helper+0x4/0x10
May 22 09:23:57 bigbox kernel: [56455.532944]  [http://vger.kernel.org/majordomo-info.html


Re: 3.4.0-rc6: WARNING: at fs/btrfs/super.c:219 __btrfs_abort_transaction+0xae/0xc0 [btrfs]()

2012-05-22 Thread Arnd Hannemann
Hi,

I just got the same warning on a fresh 3.4.0 final while booting.
This time on /usr/share (different filesystem from last time):

arnd@kallisto:~$ ls -l /dev/mapper/vg0-usr_share
lrwxrwxrwx 1 root root 7 Mai 22 17:59 /dev/mapper/vg0-usr_share -> ../dm-4
arnd@kallisto:~$ grep usr_share /proc/mounts
/dev/mapper/vg0-usr_share /usr/share btrfs 
rw,relatime,compress=zlib,ssd,nospace_cache 0 0

[   12.326239] [ cut here ]
[   12.326264] WARNING: at 
/home/arnd/Projekte/kernel/linux-2.6/fs/btrfs/super.c:219 
__btrfs_abort_transaction+0xae/0xc0 [btrfs]()
[   12.326266] Hardware name: 4384GEG
[   12.326267] btrfs: Transaction aborted
[   12.326268] Modules linked in: joydev bridge stp llc kvm_intel kvm dm_crypt 
bnep rfcomm bluetooth binfmt_misc arc4 coretemp snd_hda_codec_hdmi 
snd_hda_codec_conexant thinkpad_acpi microcode snd_seq_midi psmouse snd_rawmidi 
serio_raw iwlwifi intel_ips qcserial usb_wwan usbserial mac80211 snd_hda_intel 
snd_seq_midi_event snd_hda_codec snd_seq snd_hwdep cfg80211 snd_seq_device 
snd_pcm snd_timer snd_page_alloc snd soundcore tpm_tis nvram mei(C) btrfs 
zlib_deflate libcrc32c mxm_wmi ghash_clmulni_intel aesni_intel cryptd 
aes_x86_64 i915 ahci libahci drm_kms_helper drm e1000e sdhci_pci sdhci 
firewire_ohci firewire_core i2c_algo_bit crc_itu_t video wmi
[   12.326297] Pid: 1471, comm: hybrid-detect Tainted: G C   3.4.0aha+ 
#11
[   12.326298] Call Trace:
[   12.326305]  [] warn_slowpath_common+0x7f/0xc0
[   12.326307]  [] warn_slowpath_fmt+0x46/0x50
[   12.326315]  [] ? do_chunk_alloc.isra.71+0x31c/0x3f0 
[btrfs]
[   12.326322]  [] __btrfs_abort_transaction+0xae/0xc0 [btrfs]
[   12.326329]  [] find_free_extent+0xbe5/0xc70 [btrfs]
[   12.326334]  [] ? __switch_to+0x17a/0x410
[   12.326341]  [] btrfs_reserve_extent+0xed/0x250 [btrfs]
[   12.326350]  [] btrfs_alloc_free_block+0x177/0x370 [btrfs]
[   12.326357]  [] __btrfs_cow_block+0x135/0x4d0 [btrfs]
[   12.326363]  [] btrfs_cow_block+0xfc/0x220 [btrfs]
[   12.326370]  [] btrfs_search_slot+0x454/0x910 [btrfs]
[   12.326377]  [] ? 
reserve_metadata_bytes.isra.72+0x207/0x740 [btrfs]
[   12.326384]  [] btrfs_insert_empty_items+0x7c/0xe0 [btrfs]
[   12.326390]  [] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[   12.326401]  [] btrfs_insert_orphan_item+0x5f/0x90 [btrfs]
[   12.326429]  [] btrfs_orphan_add+0xc5/0x1c0 [btrfs]
[   12.326443]  [] btrfs_truncate+0x146/0x650 [btrfs]
[   12.326449]  [] ? security_inode_alloc+0x1e/0x20
[   12.326461]  [] btrfs_setattr+0xc1/0x1b0 [btrfs]
[   12.326464]  [] notify_change+0x1aa/0x340
[   12.326467]  [] do_truncate+0x5e/0xa0
[   12.326470]  [] do_last+0x581/0x8f0
[   12.326472]  [] path_openat+0xd2/0x400
[   12.326474]  [] do_filp_open+0x42/0xa0
[   12.326476]  [] ? alloc_fd+0xd1/0x120
[   12.326478]  [] do_sys_open+0xf8/0x1d0
[   12.326480]  [] ? filp_close+0x66/0x90
[   12.326482]  [] sys_open+0x21/0x30
[   12.326485]  [] system_call_fastpath+0x16/0x1b
[   12.326487] ---[ end trace 4479826ac6de5588 ]---
[   12.326489] BTRFS warning (device dm-4): Aborting unused transaction.

Best regards
Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Btrfs: fix deadlock on sb->s_umount when doing umount

2012-05-22 Thread David Sterba
On Wed, May 09, 2012 at 11:24:28AM +0800, Miao Xie wrote:
> Did you apply the trylock patchs I sent before?

20120429   [PATCH 1/2] vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and 
rename them
20120429   [PATCH 2/2] Btrfs: flush all the dirty pages if 
try_to_writeback_inodes_sb_nr() fails

on top of 3.4, no deadlocks occured with looped 269, 264, 254, 276. Mounted with
space_cache,autodefrag .

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph on btrfs 3.4rc

2012-05-22 Thread Josef Bacik
On Mon, May 21, 2012 at 11:59:54AM +0800, Miao Xie wrote:
> Hi Josef,
> 
> On fri, 18 May 2012 15:01:05 -0400, Josef Bacik wrote:
> > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> > index 9b9b15f..492c74f 100644
> > --- a/fs/btrfs/btrfs_inode.h
> > +++ b/fs/btrfs/btrfs_inode.h
> > @@ -57,9 +57,6 @@ struct btrfs_inode {
> > /* used to order data wrt metadata */
> > struct btrfs_ordered_inode_tree ordered_tree;
> >  
> > -   /* for keeping track of orphaned inodes */
> > -   struct list_head i_orphan;
> > -
> > /* list of all the delalloc inodes in the FS.  There are times we need
> >  * to write all the delalloc pages to disk, and this list is used
> >  * to walk them all.
> > @@ -156,6 +153,8 @@ struct btrfs_inode {
> > unsigned dummy_inode:1;
> > unsigned in_defrag:1;
> > unsigned delalloc_meta_reserved:1;
> > +   unsigned has_orphan_item:1;
> > +   unsigned doing_truncate:1;
> 
> I think the problem is we should not use the different lock to protect the 
> bit fields which
> are stored in the same machine word. Or some bit fields may be covered by the 
> others when
> someone change those fields. Could you try to declare 
> ->delalloc_meta_reserved and ->has_orphan_item
> as a integer?
> 

Oh freaking duh, thank you Miao, I'm an idiot.

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/3] Btrfs-progs: move open_file_or_dir() to utils.c

2012-05-22 Thread Stefan Behrens
This is a preparation step to add support for device stats. The definition
of the function open_file_or_dir() is moved from common.c to utils.c in
order to be able to share some common code between scrub and the device
stats in the following step. That common code uses open_file_or_dir().
Since open_file_or_dir() makes use of the function dirfd(3), the required
XOPEN version was raised from 6 to 7.

Signed-off-by: Stefan Behrens 
---
 Makefile |4 ++--
 btrfsctl.c   |   28 
 cmds-balance.c   |1 +
 cmds-inspect.c   |1 +
 cmds-subvolume.c |1 +
 commands.h   |3 ---
 common.c |   46 --
 utils.c  |   31 +--
 utils.h  |3 +++
 9 files changed, 37 insertions(+), 81 deletions(-)

diff --git a/Makefile b/Makefile
index 79818e6..fe2b432 100644
--- a/Makefile
+++ b/Makefile
@@ -39,8 +39,8 @@ all: version $(progs) manpages
 version:
bash version.sh
 
-btrfs: $(objects) btrfs.o help.o common.o $(cmds_objects)
-   $(CC) $(CFLAGS) -o btrfs btrfs.o help.o common.o $(cmds_objects) \
+btrfs: $(objects) btrfs.o help.o $(cmds_objects)
+   $(CC) $(CFLAGS) -o btrfs btrfs.o help.o $(cmds_objects) \
$(objects) $(LDFLAGS) $(LIBS) -lpthread
 
 calc-size: $(objects) calc-size.o
diff --git a/btrfsctl.c b/btrfsctl.c
index d45e2a7..f0584f3 100644
--- a/btrfsctl.c
+++ b/btrfsctl.c
@@ -63,34 +63,6 @@ static void print_usage(void)
exit(1);
 }
 
-static int open_file_or_dir(const char *fname)
-{
-   int ret;
-   struct stat st;
-   DIR *dirstream;
-   int fd;
-
-   ret = stat(fname, &st);
-   if (ret < 0) {
-   perror("stat:");
-   exit(1);
-   }
-   if (S_ISDIR(st.st_mode)) {
-   dirstream = opendir(fname);
-   if (!dirstream) {
-   perror("opendir");
-   exit(1);
-   }
-   fd = dirfd(dirstream);
-   } else {
-   fd = open(fname, O_RDWR);
-   }
-   if (fd < 0) {
-   perror("open");
-   exit(1);
-   }
-   return fd;
-}
 int main(int ac, char **av)
 {
char *fname = NULL;
diff --git a/cmds-balance.c b/cmds-balance.c
index 38a7426..5793b5c 100644
--- a/cmds-balance.c
+++ b/cmds-balance.c
@@ -26,6 +26,7 @@
 #include "ctree.h"
 #include "ioctl.h"
 #include "volumes.h"
+#include "utils.h"
 
 #include "commands.h"
 
diff --git a/cmds-inspect.c b/cmds-inspect.c
index 2f0228f..7a8785b 100644
--- a/cmds-inspect.c
+++ b/cmds-inspect.c
@@ -22,6 +22,7 @@
 
 #include "kerncompat.h"
 #include "ioctl.h"
+#include "utils.h"
 
 #include "commands.h"
 
diff --git a/cmds-subvolume.c b/cmds-subvolume.c
index 950fa8f..8ecd3f4 100644
--- a/cmds-subvolume.c
+++ b/cmds-subvolume.c
@@ -26,6 +26,7 @@
 
 #include "kerncompat.h"
 #include "ioctl.h"
+#include "utils.h"
 
 #include "commands.h"
 
diff --git a/commands.h b/commands.h
index a303a50..aea4cb1 100644
--- a/commands.h
+++ b/commands.h
@@ -79,9 +79,6 @@ void help_ambiguous_token(const char *arg, const struct 
cmd_group *grp);
 
 void help_command_group(const struct cmd_group *grp, int argc, char **argv);
 
-/* common.c */
-int open_file_or_dir(const char *fname);
-
 extern const struct cmd_group subvolume_cmd_group;
 extern const struct cmd_group filesystem_cmd_group;
 extern const struct cmd_group balance_cmd_group;
diff --git a/common.c b/common.c
deleted file mode 100644
index 03f6570..000
--- a/common.c
+++ /dev/null
@@ -1,46 +0,0 @@
-/*
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public
- * License v2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public
- * License along with this program; if not, write to the
- * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
- * Boston, MA 021110-1307, USA.
- */
-
-#include 
-#include 
-#include 
-#include 
-
-int open_file_or_dir(const char *fname)
-{
-   int ret;
-   struct stat st;
-   DIR *dirstream;
-   int fd;
-
-   ret = stat(fname, &st);
-   if (ret < 0) {
-   return -1;
-   }
-   if (S_ISDIR(st.st_mode)) {
-   dirstream = opendir(fname);
-   if (!dirstream) {
-   return -2;
-   }
-   fd = dirfd(dirstream);
-   } else {
-   fd = open(fname, O_RDWR);
-   }
-   if (fd < 0) {
-   return -3;
-   }
-   return fd;
-}
diff --git a/utils.c b/utils.c
index ee7fa1b..6157115 100644
--- a/utils.c
+++ b/utils.c
@@ -16,8 +16,9

[PATCH v4 3/3] Btrfs-progs: add command to get/reset device stats via ioctl

2012-05-22 Thread Stefan Behrens
"btrfs device stats" is used to retrieve and print the device stats.
"btrfs device stats -z" is used to atomically retrieve, reset and
print the stats.

Signed-off-by: Stefan Behrens 
---
 cmds-device.c  |  113 
 ctree.h|6 +++
 ioctl.h|   28 ++
 man/btrfs.8.in |   14 +++
 print-tree.c   |6 +++
 5 files changed, 167 insertions(+)

diff --git a/cmds-device.c b/cmds-device.c
index db625a6..3417f03 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -246,11 +246,124 @@ static int cmd_scan_dev(int argc, char **argv)
return 0;
 }
 
+static const char * const cmd_dev_stats_usage[] = {
+   "btrfs device stats [-z] |",
+   "Show current device IO stats. -z to reset stats afterwards.",
+   NULL
+};
+
+static int cmd_dev_stats(int argc, char **argv)
+{
+   char *path;
+   struct btrfs_ioctl_fs_info_args fi_args;
+   struct btrfs_ioctl_dev_info_args *di_args = NULL;
+   int ret;
+   int fdmnt;
+   int i;
+   char c;
+   int fdres = -1;
+   int err = 0;
+   int cmd = BTRFS_IOC_GET_DEVICE_STATS;
+
+   optind = 1;
+   while ((c = getopt(argc, argv, "z")) != -1) {
+   switch (c) {
+   case 'z':
+   cmd = BTRFS_IOC_GET_AND_RESET_DEVICE_STATS;
+   break;
+   case '?':
+   default:
+   fprintf(stderr, "ERROR: device stat args invalid.\n"
+   " device stat [-z] |\n"
+   " -z  to reset stats after reading.\n");
+   return 1;
+   }
+   }
+
+   if (optind + 1 != argc) {
+   fprintf(stderr, "ERROR: device stat needs path|device as single"
+   " argument\n");
+   return 1;
+   }
+
+   path = argv[optind];
+
+   fdmnt = open_file_or_dir(path);
+   if (fdmnt < 0) {
+   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   return 12;
+   }
+
+   ret = get_fs_info(fdmnt, path, &fi_args, &di_args);
+   if (ret) {
+   fprintf(stderr, "ERROR: getting dev info for devstats failed: "
+   "%s\n", strerror(-ret));
+   err = 1;
+   goto out;
+   }
+   if (!fi_args.num_devices) {
+   fprintf(stderr, "ERROR: no devices found\n");
+   err = 1;
+   goto out;
+   }
+
+   for (i = 0; i < fi_args.num_devices; i++) {
+   struct btrfs_ioctl_get_device_stats args = {0};
+   __u8 path[BTRFS_DEVICE_PATH_NAME_MAX + 1];
+
+   strncpy((char *)path, (char *)di_args[i].path,
+   BTRFS_DEVICE_PATH_NAME_MAX);
+   path[BTRFS_DEVICE_PATH_NAME_MAX] = '\0';
+
+   args.devid = di_args[i].devid;
+   args.nr_items = BTRFS_IOCTL_GET_DEVICE_STATS_MAX_NR_ITEMS;
+
+   if (ioctl(fdmnt, cmd, &args) < 0) {
+   fprintf(stderr, "ERROR: ioctl(%s) on %s failed: %s\n",
+   BTRFS_IOC_GET_AND_RESET_DEVICE_STATS == cmd ?
+"BTRFS_IOC_GET_AND_RESET_DEVICE_STATS" :
+"BTRFS_IOC_GET_DEVICE_STATS",
+   path, strerror(errno));
+   err = 1;
+   } else {
+   if (args.nr_items >= 1)
+   printf("[%s].cnt_write_io_errs   %llu\n",
+  path, (unsigned long long)
+args.cnt_write_io_errs);
+   if (args.nr_items >= 2)
+   printf("[%s].cnt_read_io_errs%llu\n",
+  path, (unsigned long long)
+args.cnt_read_io_errs);
+   if (args.nr_items >= 3)
+   printf("[%s].cnt_flush_io_errs   %llu\n",
+  path, (unsigned long long)
+args.cnt_flush_io_errs);
+   if (args.nr_items >= 4)
+   printf("[%s].cnt_corruption_errs %llu\n",
+  path, (unsigned long long)
+args.cnt_corruption_errs);
+   if (args.nr_items >= 5)
+   printf("[%s].cnt_generation_errs %llu\n",
+  path, (unsigned long long)
+args.cnt_generation_errs);
+   }
+   }
+
+out:
+   free(di_args);
+   close(fdmnt);
+   if (fdres > -1)
+   close(fdres);
+
+   return err;
+}
+
 const struct cmd_group device_cmd_group = {
de

[PATCH v4 0/3] Btrfs-progs: support get/reset device stats via ioctl

2012-05-22 Thread Stefan Behrens
"btrfs device stats" is used to retrieve and print the device stats.
"btrfs device stats -z" is used to atomically retrieve, reset and
print the stats.

In order to share two utility functions between scrub and the dev stats
code, these two functions are moved to utils.c and renamed.
Since these functions are using open_file_or_dir(), and since the linking
against utils.o and common.o was different, open_file_or_dir() was moved
from common.c to utils.c. And since that function makes use of the
function dirfd(3), the required XOPEN version was raised from 6 to 7.

Changes v1->v2:
- Remove a verbose printf()
- Cast u64 to unsigned long long for printf()
- Update the man page

Changes v2->v3:
- Rebase on Chris' current master branch
- Split the patch into three seperate patches because after rebasing,
  open_file_or_dir() was moved and additional changes had been necessary

Changes v2->v3:
- Add padding at end of ioctl structure

Stefan Behrens (3):
  Btrfs-progs: move open_file_or_dir() to utils.c
  Btrfs-progs: make two utility functions globally available
  Btrfs-progs: add command to get/reset device stats via ioctl

 Makefile |4 +-
 btrfsctl.c   |   28 --
 cmds-balance.c   |1 +
 cmds-device.c|  113 ++
 cmds-inspect.c   |1 +
 cmds-scrub.c |   72 +-
 cmds-subvolume.c |1 +
 commands.h   |3 --
 common.c |   46 --
 ctree.h  |6 +++
 ioctl.h  |   28 ++
 man/btrfs.8.in   |   14 +++
 print-tree.c |6 +++
 utils.c  |   97 +-
 utils.h  |7 
 15 files changed, 276 insertions(+), 151 deletions(-)
 delete mode 100644 common.c

-- 
1.7.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/3] Btrfs-progs: make two utility functions globally available

2012-05-22 Thread Stefan Behrens
Two convenient utility functions that have so far been local to scrub are
moved to utils.c.
They will be used in the device stats code in a following commit.

Signed-off-by: Stefan Behrens 
---
 cmds-scrub.c |   72 ++
 utils.c  |   66 +
 utils.h  |4 
 3 files changed, 72 insertions(+), 70 deletions(-)

diff --git a/cmds-scrub.c b/cmds-scrub.c
index c4503f4..37a9890 100644
--- a/cmds-scrub.c
+++ b/cmds-scrub.c
@@ -967,74 +967,6 @@ static struct scrub_file_record *last_dev_scrub(
return NULL;
 }
 
-static int scrub_device_info(int fd, u64 devid,
-struct btrfs_ioctl_dev_info_args *di_args)
-{
-   int ret;
-
-   di_args->devid = devid;
-   memset(&di_args->uuid, '\0', sizeof(di_args->uuid));
-
-   ret = ioctl(fd, BTRFS_IOC_DEV_INFO, di_args);
-   return ret ? -errno : 0;
-}
-
-static int scrub_fs_info(int fd, char *path,
-   struct btrfs_ioctl_fs_info_args *fi_args,
-   struct btrfs_ioctl_dev_info_args **di_ret)
-{
-   int ret = 0;
-   int ndevs = 0;
-   int i = 1;
-   struct btrfs_fs_devices *fs_devices_mnt = NULL;
-   struct btrfs_ioctl_dev_info_args *di_args;
-   char mp[BTRFS_PATH_NAME_MAX + 1];
-
-   memset(fi_args, 0, sizeof(*fi_args));
-
-   ret = ioctl(fd, BTRFS_IOC_FS_INFO, fi_args);
-   if (ret && errno == EINVAL) {
-   /* path is no mounted btrfs. try if it's a device */
-   ret = check_mounted_where(fd, path, mp, sizeof(mp),
-   &fs_devices_mnt);
-   if (!ret)
-   return -EINVAL;
-   if (ret < 0)
-   return ret;
-   fi_args->num_devices = 1;
-   fi_args->max_id = fs_devices_mnt->latest_devid;
-   i = fs_devices_mnt->latest_devid;
-   memcpy(fi_args->fsid, fs_devices_mnt->fsid, BTRFS_FSID_SIZE);
-   close(fd);
-   fd = open_file_or_dir(mp);
-   if (fd < 0)
-   return -errno;
-   } else if (ret) {
-   return -errno;
-   }
-
-   if (!fi_args->num_devices)
-   return 0;
-
-   di_args = *di_ret = malloc(fi_args->num_devices * sizeof(*di_args));
-   if (!di_args)
-   return -errno;
-
-   for (; i <= fi_args->max_id; ++i) {
-   BUG_ON(ndevs >= fi_args->num_devices);
-   ret = scrub_device_info(fd, i, &di_args[ndevs]);
-   if (ret == -ENODEV)
-   continue;
-   if (ret)
-   return ret;
-   ++ndevs;
-   }
-
-   BUG_ON(ndevs == 0);
-
-   return 0;
-}
-
 int mkdir_p(char *path)
 {
int i;
@@ -1155,7 +1087,7 @@ static int scrub_start(int argc, char **argv, int resume)
return 12;
}
 
-   ret = scrub_fs_info(fdmnt, path, &fi_args, &di_args);
+   ret = get_fs_info(fdmnt, path, &fi_args, &di_args);
if (ret) {
ERR(!do_quiet, "ERROR: getting dev info for scrub failed: "
"%s\n", strerror(-ret));
@@ -1621,7 +1553,7 @@ static int cmd_scrub_status(int argc, char **argv)
return 12;
}
 
-   ret = scrub_fs_info(fdmnt, path, &fi_args, &di_args);
+   ret = get_fs_info(fdmnt, path, &fi_args, &di_args);
if (ret) {
fprintf(stderr, "ERROR: getting dev info for scrub failed: "
"%s\n", strerror(-ret));
diff --git a/utils.c b/utils.c
index 6157115..037f64b 100644
--- a/utils.c
+++ b/utils.c
@@ -1233,3 +1233,69 @@ int open_file_or_dir(const char *fname)
return fd;
 }
 
+int get_device_info(int fd, u64 devid,
+   struct btrfs_ioctl_dev_info_args *di_args)
+{
+   int ret;
+
+   di_args->devid = devid;
+   memset(&di_args->uuid, '\0', sizeof(di_args->uuid));
+
+   ret = ioctl(fd, BTRFS_IOC_DEV_INFO, di_args);
+   return ret ? -errno : 0;
+}
+
+int get_fs_info(int fd, char *path, struct btrfs_ioctl_fs_info_args *fi_args,
+   struct btrfs_ioctl_dev_info_args **di_ret)
+{
+   int ret = 0;
+   int ndevs = 0;
+   int i = 1;
+   struct btrfs_fs_devices *fs_devices_mnt = NULL;
+   struct btrfs_ioctl_dev_info_args *di_args;
+   char mp[BTRFS_PATH_NAME_MAX + 1];
+
+   memset(fi_args, 0, sizeof(*fi_args));
+
+   ret = ioctl(fd, BTRFS_IOC_FS_INFO, fi_args);
+   if (ret && (errno == EINVAL || errno == ENOTTY)) {
+   /* path is not a mounted btrfs. Try if it's a device */
+   ret = check_mounted_where(fd, path, mp, sizeof(mp),
+ &fs_devices_mnt);
+   if (!ret)
+   return -EINVAL;
+   if (ret < 0)
+ 

[PATCH v4 1/3] Btrfs: add device counters for detected IO and checksum errors

2012-05-22 Thread Stefan Behrens
The goal is to detect when drives start to get an increased error rate,
when drives should be replaced soon. Therefore statistic counters are
added that count IO errors (read, write and flush). Additionally, the
software detected errors like checksum errors and corrupted blocks are
counted.

Signed-off-by: Stefan Behrens 
---
 fs/btrfs/disk-io.c   |   18 ++---
 fs/btrfs/extent_io.c |   27 +--
 fs/btrfs/scrub.c |   72 +++---
 fs/btrfs/volumes.c   |   61 +++---
 fs/btrfs/volumes.h   |   21 +++
 5 files changed, 174 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a7ffc88..e123629 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2556,18 +2556,21 @@ recovery_tree_root:
 
 static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 {
-   char b[BDEVNAME_SIZE];
-
if (uptodate) {
set_buffer_uptodate(bh);
} else {
+   struct btrfs_device *device = (struct btrfs_device *)
+   bh->b_private;
+
printk_ratelimited(KERN_WARNING "lost page write due to "
-   "I/O error on %s\n",
-  bdevname(bh->b_bdev, b));
+  "I/O error on %s\n", device->name);
/* note, we dont' set_buffer_write_io_error because we have
 * our own ways of dealing with the IO errors
 */
clear_buffer_uptodate(bh);
+   btrfs_device_stat_inc(&device->cnt_write_io_errs);
+   device->device_stats_dirty = 1;
+   btrfs_device_stat_print_on_error(device);
}
unlock_buffer(bh);
put_bh(bh);
@@ -2682,6 +2685,7 @@ static int write_dev_supers(struct btrfs_device *device,
set_buffer_uptodate(bh);
lock_buffer(bh);
bh->b_end_io = btrfs_end_buffer_write_sync;
+   bh->b_private = device;
}
 
/*
@@ -2740,6 +2744,12 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
}
if (!bio_flagged(bio, BIO_UPTODATE)) {
ret = -EIO;
+   if (!bio_flagged(bio, BIO_EOPNOTSUPP)) {
+   btrfs_device_stat_inc(
+   &device->cnt_flush_io_errs);
+   device->device_stats_dirty = 1;
+   btrfs_device_stat_print_on_error(device);
+   }
}
 
/* drop the reference from the wait == 0 run */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2fb52c2..6cd9a55 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1923,6 +1923,9 @@ int repair_io_failure(struct btrfs_mapping_tree 
*map_tree, u64 start,
if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) {
/* try to remap that extent elsewhere? */
bio_put(bio);
+   btrfs_device_stat_inc(&dev->cnt_write_io_errs);
+   dev->device_stats_dirty = 1;
+   btrfs_device_stat_print_on_error(dev);
return -EIO;
}
 
@@ -2347,10 +2350,30 @@ static void end_bio_extent_readpage(struct bio *bio, 
int err)
if (uptodate && tree->ops && tree->ops->readpage_end_io_hook) {
ret = tree->ops->readpage_end_io_hook(page, start, end,
  state, mirror);
-   if (ret)
+   if (ret) {
+   /* no IO indicated but software detected errors
+* in the block, either checksum errors or
+* issues with the contents */
+   int failed_mirror = (int)(uintptr_t)
+   bio->bi_bdev;
+   struct btrfs_root *root =
+   BTRFS_I(page->mapping->host)->root;
+   struct btrfs_device *device;
+
uptodate = 0;
-   else
+   device = btrfs_find_device_for_logical(
+   root, start,
+   (int)failed_mirror);
+   if (device) {
+   btrfs_device_stat_inc(
+   &device->cnt_corruption_errs);
+   device->device_stats_dirty = 1;
+   btrfs_device_stat_print_on_error(
+  

[PATCH v4 2/3] Btrfs: add ioctl to get and reset the device stats

2012-05-22 Thread Stefan Behrens
An ioctl interface is added to get the device statistic counters.
A second ioctl is added to atomically get and reset these counters.

Signed-off-by: Stefan Behrens 
---
 fs/btrfs/ioctl.c   |   26 
 fs/btrfs/ioctl.h   |   28 +
 fs/btrfs/volumes.c |   69 
 fs/btrfs/volumes.h |   13 ++
 4 files changed, 136 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 14f8e1f..19d2244 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3042,6 +3042,28 @@ static long btrfs_ioctl_scrub_progress(struct btrfs_root 
*root,
return ret;
 }
 
+static long btrfs_ioctl_get_device_stats(struct btrfs_root *root,
+void __user *arg, int reset_after_read)
+{
+   struct btrfs_ioctl_get_device_stats *sa;
+   int ret;
+
+   if (reset_after_read && !capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
+   sa = memdup_user(arg, sizeof(*sa));
+   if (IS_ERR(sa))
+   return PTR_ERR(sa);
+
+   ret = btrfs_get_device_stats(root, sa, reset_after_read);
+
+   if (copy_to_user(arg, sa, sizeof(*sa)))
+   ret = -EFAULT;
+
+   kfree(sa);
+   return ret;
+}
+
 static long btrfs_ioctl_ino_to_path(struct btrfs_root *root, void __user *arg)
 {
int ret = 0;
@@ -3424,6 +3446,10 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_balance_ctl(root, arg);
case BTRFS_IOC_BALANCE_PROGRESS:
return btrfs_ioctl_balance_progress(root, argp);
+   case BTRFS_IOC_GET_DEVICE_STATS:
+   return btrfs_ioctl_get_device_stats(root, argp, 0);
+   case BTRFS_IOC_GET_AND_RESET_DEVICE_STATS:
+   return btrfs_ioctl_get_device_stats(root, argp, 1);
}
 
return -ENOTTY;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 086e6bd..f1c1196 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -266,6 +266,30 @@ struct btrfs_ioctl_logical_ino_args {
__u64   inodes;
 };
 
+#define BTRFS_IOCTL_GET_DEVICE_STATS_MAX_NR_ITEMS  5
+struct btrfs_ioctl_get_device_stats {
+   __u64 devid;/* in */
+   __u64 nr_items; /* in/out */
+
+   /* out values: */
+
+   /* disk I/O failure stats */
+   __u64 cnt_write_io_errs; /* EIO or EREMOTEIO from lower layers */
+   __u64 cnt_read_io_errs; /* EIO or EREMOTEIO from lower layers */
+   __u64 cnt_flush_io_errs; /* EIO or EREMOTEIO from lower layers */
+
+   /* stats for indirect indications for I/O failures */
+   __u64 cnt_corruption_errs; /* checksum error, bytenr error or
+   * contents is illegal: this is an
+   * indication that the block was damaged
+   * during read or write, or written to
+   * wrong location or read from wrong
+   * location */
+   __u64 cnt_generation_errs; /* an indication that blocks have not
+   * been written */
+   __u64 unused[121]; /* pad to 1k */
+};
+
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -330,5 +354,9 @@ struct btrfs_ioctl_logical_ino_args {
struct btrfs_ioctl_ino_path_args)
 #define BTRFS_IOC_LOGICAL_INO _IOWR(BTRFS_IOCTL_MAGIC, 36, \
struct btrfs_ioctl_ino_path_args)
+#define BTRFS_IOC_GET_DEVICE_STATS _IOWR(BTRFS_IOCTL_MAGIC, 52, \
+struct btrfs_ioctl_get_device_stats)
+#define BTRFS_IOC_GET_AND_RESET_DEVICE_STATS _IOWR(BTRFS_IOCTL_MAGIC, 53, \
+struct btrfs_ioctl_get_device_stats)
 
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c458c74..5f5a6ce 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4638,3 +4638,72 @@ void btrfs_device_stat_print_on_error(struct 
btrfs_device *device)
   btrfs_device_stat_read(
&device->cnt_generation_errs));
 }
+
+int btrfs_get_device_stats(struct btrfs_root *root,
+  struct btrfs_ioctl_get_device_stats *stats,
+  int reset_after_read)
+{
+   struct btrfs_device *dev;
+   struct btrfs_fs_devices *fs_devices = root->fs_info->fs_devices;
+
+   mutex_lock(&fs_devices->device_list_mutex);
+   dev = btrfs_find_device(root, stats->devid, NULL, NULL);
+   mutex_unlock(&fs_devices->device_list_mutex);
+
+   if (!dev) {
+   printk(KERN_WARNING
+  "btrfs: get device_stats failed, device not found\n");
+   return -ENODEV;
+   

[PATCH v4 3/3] Btrfs: read device stats on mount, write modified ones during commit

2012-05-22 Thread Stefan Behrens
The device statistics are written into the device tree with each
transaction commit. Only modified statistics are written.
When a filesystem is mounted, the device statistics for each involved
device are read from the device tree and used to initialize the
counters.

Signed-off-by: Stefan Behrens 
---
 fs/btrfs/ctree.h   |   51 
 fs/btrfs/disk-io.c |7 ++
 fs/btrfs/print-tree.c  |3 +
 fs/btrfs/transaction.c |4 +
 fs/btrfs/volumes.c |  205 
 fs/btrfs/volumes.h |9 +++
 6 files changed, 279 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ec42a24..1dd7651 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -823,6 +823,26 @@ struct btrfs_csum_item {
u8 csum;
 } __attribute__ ((__packed__));
 
+struct btrfs_device_stats_item {
+   /*
+* grow this item struct at the end for future enhancements and keep
+* the existing values unchanged
+*/
+   __le64 cnt_write_io_errs; /* EIO or EREMOTEIO from lower layers */
+   __le64 cnt_read_io_errs; /* EIO or EREMOTEIO from lower layers */
+   __le64 cnt_flush_io_errs; /* EIO or EREMOTEIO from lower layers */
+
+   /* stats for indirect indications for I/O failures */
+   __le64 cnt_corruption_errs; /* checksum error, bytenr error or
+* contents is illegal: this is an
+* indication that the block was damaged
+* during read or write, or written to
+* wrong location or read from wrong
+* location */
+   __le64 cnt_generation_errs; /* an indication that blocks have not
+* been written */
+} __attribute__ ((__packed__));
+
 /* different types of block groups (and chunks) */
 #define BTRFS_BLOCK_GROUP_DATA (1ULL << 0)
 #define BTRFS_BLOCK_GROUP_SYSTEM   (1ULL << 1)
@@ -1508,6 +1528,12 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_BALANCE_ITEM_KEY 248
 
 /*
+ * Persistantly stores the io stats in the device tree.
+ * One key for all stats, (0, BTRFS_DEVICE_STATS_KEY, devid).
+ */
+#define BTRFS_DEVICE_STATS_KEY 249
+
+/*
  * string items are for debugging.  They just store a short string of
  * data in the FS
  */
@@ -2415,6 +2441,31 @@ static inline u32 
btrfs_file_extent_inline_item_len(struct extent_buffer *eb,
return btrfs_item_size(eb, e) - offset;
 }
 
+/* btrfs_device_stats_item */
+BTRFS_SETGET_FUNCS(device_stats_cnt_write_io_errs,
+  struct btrfs_device_stats_item, cnt_write_io_errs, 64);
+BTRFS_SETGET_FUNCS(device_stats_cnt_read_io_errs,
+  struct btrfs_device_stats_item, cnt_read_io_errs, 64);
+BTRFS_SETGET_FUNCS(device_stats_cnt_flush_io_errs,
+  struct btrfs_device_stats_item, cnt_flush_io_errs, 64);
+BTRFS_SETGET_FUNCS(device_stats_cnt_corruption_errs,
+  struct btrfs_device_stats_item, cnt_corruption_errs, 64);
+BTRFS_SETGET_FUNCS(device_stats_cnt_generation_errs,
+  struct btrfs_device_stats_item, cnt_generation_errs, 64);
+
+BTRFS_SETGET_STACK_FUNCS(stack_device_stats_cnt_write_io_errs,
+struct btrfs_device_stats_item, cnt_write_io_errs, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_stats_cnt_read_io_errs,
+struct btrfs_device_stats_item, cnt_read_io_errs, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_stats_cnt_flush_io_errs,
+struct btrfs_device_stats_item, cnt_flush_io_errs, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_stats_cnt_corruption_errs,
+struct btrfs_device_stats_item, cnt_corruption_errs,
+64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_stats_cnt_generation_errs,
+struct btrfs_device_stats_item, cnt_generation_errs,
+64);
+
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
 {
return sb->s_fs_info;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e123629..7ba08f7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2353,6 +2353,13 @@ retry_root_backup:
fs_info->generation = generation;
fs_info->last_trans_committed = generation;
 
+   ret = btrfs_init_device_stats(fs_info);
+   if (ret) {
+   printk(KERN_ERR "btrfs: failed to init device_stats: %d\n",
+  ret);
+   goto fail_block_groups;
+   }
+
ret = btrfs_init_space_info(fs_info);
if (ret) {
printk(KERN_ERR "Failed to initial space info: %d\n", ret);
diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index f38e452..a9e45e4 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -294,6 +294,9 @@ void btrfs_print_leaf(struct btrfs_root *root, struct 
extent_buffer *l)
 

[PATCH v4 0/3] Btrfs: add IO error device stats

2012-05-22 Thread Stefan Behrens
Changes v1-v2:
- Remove restriction that BTRFS_IOC_GET_DEVICE_STATS is a privileged
  operation
- Cast u64 to unsigned long long for printf()

Changes v2-v3:
- Rebased on Chris' current master

Changes v3-v4:
- Add padding at end of ioctl structure

The goal is to detect when drives start to get an increased error rate,
when drives should be replaced soon. Therefore statistic counters are
added that count IO errors (read, write and flush). Additionally, the
software detected errors like checksum errors and corrupted blocks are
counted.

An ioctl interface is added to get the device statistic counters.
A second ioctl is added to atomically get and reset these counters.

The device statistics are written into the device tree with each
transaction commit. Only modified statistics are written.
When a filesystem is mounted, the device statistics for each involved
device are read from the device tree and used to initialize the
counters.

A patch for the btrfs-progs world will also be sent.

Stefan Behrens (3):
  Btrfs: add device counters for detected IO and checksum errors
  Btrfs: add ioctl to get and reset the device stats
  Btrfs: read device stats on mount, write modified ones during commit

 fs/btrfs/ctree.h   |   51 
 fs/btrfs/disk-io.c |   25 +++-
 fs/btrfs/extent_io.c   |   27 +++-
 fs/btrfs/ioctl.c   |   26 
 fs/btrfs/ioctl.h   |   28 
 fs/btrfs/print-tree.c  |3 +
 fs/btrfs/scrub.c   |   72 ---
 fs/btrfs/transaction.c |4 +
 fs/btrfs/volumes.c |  335 +++-
 fs/btrfs/volumes.h |   43 +++
 10 files changed, 589 insertions(+), 25 deletions(-)

-- 
1.7.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph on btrfs 3.4rc

2012-05-22 Thread Christian Brunner
2012/5/21 Miao Xie :
> Hi Josef,
>
> On fri, 18 May 2012 15:01:05 -0400, Josef Bacik wrote:
>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>> index 9b9b15f..492c74f 100644
>> --- a/fs/btrfs/btrfs_inode.h
>> +++ b/fs/btrfs/btrfs_inode.h
>> @@ -57,9 +57,6 @@ struct btrfs_inode {
>>       /* used to order data wrt metadata */
>>       struct btrfs_ordered_inode_tree ordered_tree;
>>
>> -     /* for keeping track of orphaned inodes */
>> -     struct list_head i_orphan;
>> -
>>       /* list of all the delalloc inodes in the FS.  There are times we need
>>        * to write all the delalloc pages to disk, and this list is used
>>        * to walk them all.
>> @@ -156,6 +153,8 @@ struct btrfs_inode {
>>       unsigned dummy_inode:1;
>>       unsigned in_defrag:1;
>>       unsigned delalloc_meta_reserved:1;
>> +     unsigned has_orphan_item:1;
>> +     unsigned doing_truncate:1;
>
> I think the problem is we should not use the different lock to protect the 
> bit fields which
> are stored in the same machine word. Or some bit fields may be covered by the 
> others when
> someone change those fields. Could you try to declare 
> ->delalloc_meta_reserved and ->has_orphan_item
> as a integer?

I have tried changing it to:

struct btrfs_inode {
unsigned orphan_meta_reserved:1;
unsigned dummy_inode:1;
unsigned in_defrag:1;
-   unsigned delalloc_meta_reserved:1;
+   int delalloc_meta_reserved;
+   int has_orphan_item;
+   int doing_truncate;

The strange thing is, that I'm no longer hitting the BUG_ON, but the
old WARNING (no additional messages):

[351021.157124] [ cut here ]
[351021.162400] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]()
[351021.171812] Hardware name: ProLiant DL180 G6
[351021.176867] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(PO) hpsa squashfs [last unloaded: btrfs]
[351021.200236] Pid: 9837, comm: btrfs-transacti Tainted: PW
O 3.3.5-1.fits.1.el6.x86_64 #1
[351021.210126] Call Trace:
[351021.212957]  [] warn_slowpath_common+0x7f/0xc0
[351021.219758]  [] warn_slowpath_null+0x1a/0x20
[351021.226385]  []
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]
[351021.234461]  [] commit_fs_roots+0xc6/0x1c0 [btrfs]
[351021.241669]  [] ?
btrfs_run_delayed_items+0xf1/0x160 [btrfs]
[351021.249841]  []
btrfs_commit_transaction+0x584/0xa50 [btrfs]
[351021.258006]  [] ? start_transaction+0x92/0x310 [btrfs]
[351021.265580]  [] ? wake_up_bit+0x40/0x40
[351021.271719]  [] transaction_kthread+0x26b/0x2e0 [btrfs]
[351021.279405]  [] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[351021.288934]  [] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[351021.298449]  [] kthread+0x9e/0xb0
[351021.303989]  [] kernel_thread_helper+0x4/0x10
[351021.310691]  [] ? kthread_freezable_should_stop+0x70/0x70
[351021.318555]  [] ? gs_change+0x13/0x13
[351021.324479] ---[ end trace 9adc7b36a3e66833 ]---
[351710.339482] [ cut here ]
[351710.344754] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]()
[351710.354165] Hardware name: ProLiant DL180 G6
[351710.359222] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(PO) hpsa squashfs [last unloaded: btrfs]
[351710.382569] Pid: 9797, comm: kworker/5:0 Tainted: PW  O
3.3.5-1.fits.1.el6.x86_64 #1
[351710.392075] Call Trace:
[351710.394901]  [] warn_slowpath_common+0x7f/0xc0
[351710.401750]  [] warn_slowpath_null+0x1a/0x20
[351710.408414]  []
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]
[351710.416528]  [] commit_fs_roots+0xc6/0x1c0 [btrfs]
[351710.423775]  []
btrfs_commit_transaction+0x584/0xa50 [btrfs]
[351710.431983]  [] ? __switch_to+0x153/0x440
[351710.438352]  [] ? wake_up_bit+0x40/0x40
[351710.444529]  [] ?
btrfs_commit_transaction+0xa50/0xa50 [btrfs]
[351710.452894]  [] do_async_commit+0x1f/0x30 [btrfs]
[351710.459979]  [] process_one_work+0x129/0x450
[351710.466576]  [] worker_thread+0x17b/0x3c0
[351710.472884]  [] ? manage_workers+0x220/0x220
[351710.479472]  [] kthread+0x9e/0xb0
[351710.485029]  [] kernel_thread_helper+0x4/0x10
[351710.491731]  [] ? kthread_freezable_should_stop+0x70/0x70
[351710.499640]  [] ? gs_change+0x13/0x13
[351710.505590] ---[ end trace 9adc7b36a3e66834 ]---


Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Newbie questions on some of btrfs code...

2012-05-22 Thread Alex Lyakas
Hi Jan,

>> # I saw that slot==0 is special. My understanding is that btrfs
>> maintains the property that the parent of each node/leaf has a key
>> pointing to that node/leaf, which must be equal to the key in the
>> slot==0 of this node/leaf. That's what fixup_low_keys() tries to
>> maintain. Is this correct?
>
> Yes. I'm not 100% sure if the key in the parent node must match exactly
> the first key of the child node. It is probably allowed that it's less
> or equal than the first key. It is guaranteed to be larger than the
> largest of the previous (left) node, though.
>
> And yes, that's what fixup_low_keys is correcting.
>
>> # If my understanding in the previous bullet is correct: Is that the
>> reason that in btrfs_prev_leaf() it is assumed that if there is a
>> lesser key, btrfs_search_slot() will never bring us to the slot==0 of
>> the current leaf?
>
> It's quite straight: We look for a key smaller than the first (slot 0)
> of the current leaf. If we find the current leaf again (because
> btrfs_search_slot returns the place where such a key would have be
> inserted), then there's no previous leaf. No preconditions or
> assumptions on nodes in levels needed.

Let's say that slot[0] of the current leaf (A) has key=10. And let's
say that its parent node (N) has key=5 (and not 10). Let's say we have
a previous leaf (B), whose last slot has key=2.
If such tree is valid, then: btrfs_prev_leaf() will search for key==9.
Then btrfs_search_slot() would bring us node N and leaf A again,
wouldn't it? Because key(N)<=9. So we will receive leaf A back, and
will think that there is no previous leaf, while there is.
What am I missing here?

Thanks for your help,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Newbie questions on some of btrfs code...

2012-05-22 Thread Alex Lyakas
Thanks, Liu, that clarifies.

Alex.

On Tue, May 22, 2012 at 4:42 AM, Liu Bo  wrote:
> On 05/21/2012 06:05 PM, Alex Lyakas wrote:
>
>> Hi Liu,
>> thanks for the clarifications.
>>
>> I did not understand the dd example of yours, though.
>>
>>> So for the following situation:
       item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
               extent data disk byte 0 nr 0
               extent data offset 0 nr 4096 ram 8192
               extent compression 0
>>> As your case, after the first 'size 5' inline extent is written,
>>> "nr 4096 < ram 8192" could come from:
>>> 1) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=12 count=4 
>>> conv=notrunc;sync
>>> 2) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=8 count=4 
>>> conv=notrunc;sync
>>>
>>> 1) makes
       item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
               extent data disk byte 0 nr 0
               extent data offset 0 nr 8192 ram 8192
               extent compression 0
>>> 2) makes
       item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
               extent data disk byte 0 nr 0
               extent data offset 0 nr 4096 ram 8192
               extent compression 0
>>
>> You talk about the "ram_bytes" field. But do I need to look at it, if
>> I don't use compression or another encoding? Shouldn't I always look
>> at btrfs_file_extent_item::offset/num_bytes for the real data, and at
>> btrfs_file_extent_item::disk_bytenr/disk_num_bytes for finding
>> CHUNK_ITEM? Any reason I should be aware of "ram_bytes" field?
>>
>
>> The first dd created a 4k extent at offset 12k. How did we end up with
>> "nr 8192 ram 8192" and offset 4k?
>> The second dd added a 4k extent at 8k offset. But still EXTENT_DATA
>> has 4k offset.
>> So now we should have have twp 4k extents or one 8k extent. What am I 
>> missing?
>>
>> Alex.
>>
>
>
> As I mentioned, disk_bytenr == 0 means dummy extents, which we have not yet 
> allocate
> a range of space for it.
>
> After your first 'size=5' inline extent, we'll start allocating extents from 
> _4096_, cause
> it is _4k aligned_.
>
>>> 1) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=12 count=4 
>>> conv=notrunc;sync
> : we need a dummy extent for [4k, 12k], which starts from 4096, and nr is 8192
>
>>> 2) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=8 count=4 
>>> conv=notrunc;sync
> : we break [4k, 12k] into a dummy one [4k, 8k] and a real one [8k, 12k].
>
> More details, plz refer to btrfs_drop_extents();
>
> thanks,
> liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html