Re: How to recover a filesystem without formatting nor using the btrfs check command.
At 10/25/2016 01:54 AM, none wrote: So do you mean lowmem is also low cpu ? Not sure, but lowmem is high IO. And by design, it won't cause dead look unless there is a looping tree block. But that will be detected by check_tree_block(). So, it just avoids any possible dead loop AFAIK. Indeed here's the output if the metadata image isn't enough (it termintes correctly with the --lowmem option). I must recognize even without the --repair option, btrfs check hangs. I just forgot you have uploaded the image dump. I'll check it. But according to lowmem output, it seems all your extent tree is screwed up, maybe that's the cause of the problem? Thanks, Qu regards, Le 24/10/2016 à 03:15, Qu Wenruo a écrit : You could try to use --mode lowmem, which doesn't ever use any loop to get next block, but iterating trees. Current in mainline btrfs-progs, the low memory mode code only checks extent/chunk trees, file/subvolume trees are still checked by original mode. You could try the devel branch from David, which now contains the full low memory mode check code: https://github.com/kdave/btrfs-progs/tree/devel Although low memory mode doesn't support repair yet, it would give us enough info on what's corrupted, so we can later fix it by hand or enhance original mode. Thanks, Qu At 10/24/2016 03:42 AM, none wrote: Hello, I have the following bug https://bugzilla.kernel.org/show_bug.cgi?id=178781 in btrfs check, is there a way to recover my filesystem in clean state without formatting or using btrsfck ? Of course, the point is no longer need the files which are damaged. So is there a way to recover a btrfs filesystem by deleting the corrupted data instead of trying to restore it ? btrfs fi df /mnt/Opera_Mobile_Emulator_12.1_Linux Data, single: total=66.01GiB, used=0.00B System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=5.00GiB, used=28.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=4.00MiB, used=0.00B btrfs progs version 4.7.3 from Devuan Label: 'backup' uuid: 56040bbb-ed5c-47f2-82e2-34457bd7b4f3 Total devices 1 FS bytes used 44.00KiB devid1 size 298.91GiB used 76.04GiB path /dev/mapper/isw_bdffeeeijj_Volume0p7 uname -a Linux localhost 4.5.0-0.bpo.1-amd64 #1 SMP Debian 4.5.1-1~bpo8+1 (2016-04-20) x86_64 GNU/Linux Result of btrfs-image on /dev/mapper/isw_bdffeeeijj_Volume0p7 : https://web.archive.org/web/20161020220914/https://filebin.net/7ni8kfpog1dxw4jc/btrfs-image_capture.xz Thanks, -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v13.1 00/14] Btrfs In-band De-duplication
At 10/25/2016 01:46 AM, David Sterba wrote: On Thu, Oct 20, 2016 at 10:03:39AM +0800, Qu Wenruo wrote: Qu Wenruo (4): btrfs: delayed-ref: Add support for increasing data ref under spinlock btrfs: dedupe: Inband in-memory only de-duplication implement btrfs: relocation: Enhance error handling to avoid BUG_ON btrfs: dedupe: Introduce new reconfigure ioctl Wang Xiaoguang (10): btrfs: dedupe: Introduce dedupe framework and its header btrfs: dedupe: Introduce function to initialize dedupe info btrfs: dedupe: Introduce function to add hash into in-memory tree btrfs: dedupe: Introduce function to remove hash from in-memory tree btrfs: dedupe: Introduce function to search for an existing hash btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface btrfs: ordered-extent: Add support for dedupe btrfs: dedupe: Add ioctl for inband dedupelication btrfs: improve inode's outstanding_extents computation I've noticed that this patch is different from what Wang sent independently. Is the patch necessary for the dedupe patchset? If not, please drop it. Yes, Wang is still working the ENOSPC fix, which doesn't only affect dedupe, but also compression. The merge conflict is in the checks for free space inode, if (root == root->fs_info->tree_root) vs if (!btrfs_is_free_space_inode(...)) If you need the patch, then please update it to the latest version. I won't get the merge conflicts at least, this is acceptable for the testing 'for-next' branch. I'll re-order the patches, maybe put his ENOSPC fixes at the end or beginning. Since Wang hopes to fix ENOSPC for both compression and dedupe, the patchset rebase will be blocked for several days. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/4] btrfs-progs: raid56: Introduce new function to calculate raid5 parity or data stripe
Introduce new function raid5_gen_result() to calculate parity or data stripe. Signed-off-by: Qu WenruoSigned-off-by: David Sterba --- Changelog: v2: None --- disk-io.h | 1 + raid56.c | 63 +++ 2 files changed, 64 insertions(+) diff --git a/disk-io.h b/disk-io.h index 8ab36aa..245626c 100644 --- a/disk-io.h +++ b/disk-io.h @@ -192,5 +192,6 @@ int write_and_map_eb(struct btrfs_trans_handle *trans, struct btrfs_root *root, /* raid56.c */ void raid6_gen_syndrome(int disks, size_t bytes, void **ptrs); +int raid5_gen_result(int nr_devs, size_t stripe_len, int dest, void **data); #endif diff --git a/raid56.c b/raid56.c index 833df5f..8c79c45 100644 --- a/raid56.c +++ b/raid56.c @@ -26,6 +26,8 @@ #include "kerncompat.h" #include "ctree.h" #include "disk-io.h" +#include "volumes.h" +#include "utils.h" /* * This is the C data type to use @@ -107,3 +109,64 @@ void raid6_gen_syndrome(int disks, size_t bytes, void **ptrs) } } +static void xor_range(char *dst, const char*src, size_t size) +{ + /* Move to DWORD aligned */ + while (size && ((unsigned long)dst & sizeof(unsigned long))) { + *dst++ ^= *src++; + size--; + } + + /* DWORD aligned part */ + while (size >= sizeof(unsigned long)) { + *(unsigned long *)dst ^= *(unsigned long *)src; + src += sizeof(unsigned long); + dst += sizeof(unsigned long); + size -= sizeof(unsigned long); + } + /* Remaining */ + while (size) { + *dst++ ^= *src++; + size--; + } +} + +/* + * Generate desired data/parity stripe for RAID5 + * + * @nr_devs: Total number of devices, including parity + * @stripe_len:Stripe length + * @data: Data, with special layout: + * data[0]: Data stripe 0 + * data[nr_devs-2]: Last data stripe + * data[nr_devs-1]: RAID5 parity + * @dest: To generate which data. should follow above data layout + */ +int raid5_gen_result(int nr_devs, size_t stripe_len, int dest, void **data) +{ + int i; + char *buf = data[dest]; + + /* Validation check */ + if (stripe_len <= 0 || stripe_len != BTRFS_STRIPE_LEN) { + error("invalid parameter for %s", __func__); + return -EINVAL; + } + + if (dest >= nr_devs || nr_devs < 2) { + error("invalid parameter for %s", __func__); + return -EINVAL; + } + /* Shortcut for 2 devs RAID5, which is just RAID1 */ + if (nr_devs == 2) { + memcpy(data[dest], data[1 - dest], stripe_len); + return 0; + } + memset(buf, 0, stripe_len); + for (i = 0; i < nr_devs; i++) { + if (i == dest) + continue; + xor_range(buf, data[i], stripe_len); + } + return 0; +} -- 2.10.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/4] btrfs-progs: volumes: Remove BUG_ON in raid56 write routine
Remove various BUG_ON in raid56 write routine, including: 1) Memory allocation error Old codes allocates memory when code needs new memory in a loop, and catch the error using BUG_ON(). New codes allocates memory in a allocation loop first, if any failure is caught, freeing already allocated memories then throw -ENOMEM. 2) Write error Change BUG_ON() to correct return value. 3) Validation check Same as write error. Signed-off-by: Qu WenruoSigned-off-by: David Sterba --- changelog: v2: Fix error in calloc usage which leads to double free() on data stripes. Thanks David for pointing it out. --- volumes.c | 89 +++ 1 file changed, 61 insertions(+), 28 deletions(-) diff --git a/volumes.c b/volumes.c index b9ca550..70d8940 100644 --- a/volumes.c +++ b/volumes.c @@ -2063,25 +2063,39 @@ static int rmw_eb(struct btrfs_fs_info *info, return 0; } -static void split_eb_for_raid56(struct btrfs_fs_info *info, - struct extent_buffer *orig_eb, +static int split_eb_for_raid56(struct btrfs_fs_info *info, + struct extent_buffer *orig_eb, struct extent_buffer **ebs, u64 stripe_len, u64 *raid_map, int num_stripes) { - struct extent_buffer *eb; + struct extent_buffer **tmp_ebs; u64 start = orig_eb->start; u64 this_eb_start; int i; - int ret; + int ret = 0; + + tmp_ebs = calloc(num_stripes, sizeof(*tmp_ebs)); + if (!tmp_ebs) + return -ENOMEM; + /* Alloc memory in a row for data stripes */ for (i = 0; i < num_stripes; i++) { if (raid_map[i] >= BTRFS_RAID5_P_STRIPE) break; - eb = calloc(1, sizeof(struct extent_buffer) + stripe_len); - if (!eb) - BUG(); + tmp_ebs[i] = calloc(1, sizeof(**tmp_ebs) + stripe_len); + if (!tmp_ebs[i]) { + ret = -ENOMEM; + goto clean_up; + } + } + + for (i = 0; i < num_stripes; i++) { + struct extent_buffer *eb = tmp_ebs[i]; + + if (raid_map[i] >= BTRFS_RAID5_P_STRIPE) + break; eb->start = raid_map[i]; eb->len = stripe_len; @@ -2095,12 +2109,21 @@ static void split_eb_for_raid56(struct btrfs_fs_info *info, if (start > this_eb_start || start + orig_eb->len < this_eb_start + stripe_len) { ret = rmw_eb(info, eb, orig_eb); - BUG_ON(ret); + if (ret) + goto clean_up; } else { - memcpy(eb->data, orig_eb->data + eb->start - start, stripe_len); + memcpy(eb->data, orig_eb->data + eb->start - start, + stripe_len); } ebs[i] = eb; } + free(tmp_ebs); + return ret; +clean_up: + for (i = 0; i < num_stripes; i++) + free(tmp_ebs[i]); + free(tmp_ebs); + return ret; } int write_raid56_with_parity(struct btrfs_fs_info *info, @@ -2113,15 +2136,20 @@ int write_raid56_with_parity(struct btrfs_fs_info *info, int j; int ret; int alloc_size = eb->len; + void **pointers; - ebs = kmalloc(sizeof(*ebs) * multi->num_stripes, GFP_NOFS); - BUG_ON(!ebs); + ebs = malloc(sizeof(*ebs) * multi->num_stripes); + pointers = malloc(sizeof(*pointers) * multi->num_stripes); + if (!ebs || !pointers) + return -ENOMEM; if (stripe_len > alloc_size) alloc_size = stripe_len; - split_eb_for_raid56(info, eb, ebs, stripe_len, raid_map, - multi->num_stripes); + ret = split_eb_for_raid56(info, eb, ebs, stripe_len, raid_map, + multi->num_stripes); + if (ret) + goto out; for (i = 0; i < multi->num_stripes; i++) { struct extent_buffer *new_eb; @@ -2129,11 +2157,17 @@ int write_raid56_with_parity(struct btrfs_fs_info *info, ebs[i]->dev_bytenr = multi->stripes[i].physical; ebs[i]->fd = multi->stripes[i].dev->fd; multi->stripes[i].dev->total_ios++; - BUG_ON(ebs[i]->start != raid_map[i]); + if (ebs[i]->start != raid_map[i]) { + ret = -EINVAL; + goto out_free_split; + } continue; } - new_eb = kmalloc(sizeof(*eb) + alloc_size, GFP_NOFS); -
[PATCH v2 4/4] btrfs-progs: volumes: Use new raid5_gen_result to calculate raid5 parity
Use thew raid5_gen_result() function to calculate raid5 parity. Signed-off-by: Qu WenruoSigned-off-by: David Sterba --- Changelog: v2: None --- volumes.c | 19 ++- 1 file changed, 6 insertions(+), 13 deletions(-) diff --git a/volumes.c b/volumes.c index 70d8940..6d7adef 100644 --- a/volumes.c +++ b/volumes.c @@ -2133,7 +2133,6 @@ int write_raid56_with_parity(struct btrfs_fs_info *info, { struct extent_buffer **ebs, *p_eb = NULL, *q_eb = NULL; int i; - int j; int ret; int alloc_size = eb->len; void **pointers; @@ -2188,18 +2187,12 @@ int write_raid56_with_parity(struct btrfs_fs_info *info, raid6_gen_syndrome(multi->num_stripes, stripe_len, pointers); } else { ebs[multi->num_stripes - 1] = p_eb; - memcpy(p_eb->data, ebs[0]->data, stripe_len); - for (j = 1; j < multi->num_stripes - 1; j++) { - for (i = 0; i < stripe_len; i += sizeof(u64)) { - u64 p_eb_data; - u64 ebs_data; - - p_eb_data = get_unaligned_64(p_eb->data + i); - ebs_data = get_unaligned_64(ebs[j]->data + i); - p_eb_data ^= ebs_data; - put_unaligned_64(p_eb_data, p_eb->data + i); - } - } + for (i = 0; i < multi->num_stripes; i++) + pointers[i] = ebs[i]->data; + ret = raid5_gen_result(multi->num_stripes, stripe_len, + multi->num_stripes - 1, pointers); + if (ret < 0) + goto out_free_split; } for (i = 0; i < multi->num_stripes; i++) { -- 2.10.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/4] btrfs-progs: rename raid6.c to raid56.c
This allows us to put raid5 codes into that file other than creating a new raid5.c. Signed-off-by: Qu WenruoSigned-off-by: David Sterba --- Changelog v2: None --- Makefile.in | 2 +- disk-io.h | 2 +- raid6.c => raid56.c | 0 3 files changed, 2 insertions(+), 2 deletions(-) rename raid6.c => raid56.c (100%) diff --git a/Makefile.in b/Makefile.in index 5187a93..b53cf2c 100644 --- a/Makefile.in +++ b/Makefile.in @@ -91,7 +91,7 @@ CHECKER_FLAGS := -include $(check_defs) -D__CHECKER__ \ objects = ctree.o disk-io.o kernel-lib/radix-tree.o extent-tree.o print-tree.o \ root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \ extent-cache.o extent_io.o volumes.o utils.o repair.o \ - qgroup.o raid6.o free-space-cache.o kernel-lib/list_sort.o props.o \ + qgroup.o raid56.o free-space-cache.o kernel-lib/list_sort.o props.o \ ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \ inode.o file.o find-root.o free-space-tree.o help.o cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ diff --git a/disk-io.h b/disk-io.h index 1fc0bb7..8ab36aa 100644 --- a/disk-io.h +++ b/disk-io.h @@ -190,7 +190,7 @@ int write_tree_block(struct btrfs_trans_handle *trans, int write_and_map_eb(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *eb); -/* raid6.c */ +/* raid56.c */ void raid6_gen_syndrome(int disks, size_t bytes, void **ptrs); #endif diff --git a/raid6.c b/raid56.c similarity index 100% rename from raid6.c rename to raid56.c -- 2.10.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrubbing Errors after restoring backup
On 24 October 2016 at 17:53, Stefan Malte Schumacherwrote: > Hello > > For reference please see this post. > https://mail-archive.com/linux-btrfs@vger.kernel.org/msg58461.html > Please note that I downgraded to btrfs-progs 4.6.1 as advised. > > After exchanging the malfunctioning drive I re-created the filesystem > and restored the backup from my NAS. (I didnt entirely trust the > filesystem after so many errors) On completing the restoration I > manually started scrubbing, which ended with hundreds of checksum and > read errors on /dev/sda. > The drive checks out fine in smart and passed through all scheduled > SMART Self-Tests. The model is not identical to the two drives > recently added to the system - the new drives are WD Blue, the four > original ones are WD Greens. > > I have resetted the output from btrfs dev stats and restarted the > scrubbing process. I am unsure how to interpret or explain the errors > of the last scrub run. I scrubbed regularly each month for nearly > three years and never had any errors. I would be grateful for any > advice how to proceed. > > Yours sincerely > Stefan Hi Stefan, What kernel version are you using? Was the backup a file-level archive or a btrfs send stream? I'm confused about the evolution your hardware. Originally you had four disk raid1? Or a dix disk raid1? The one that failed was /dev/sdf, which seems to suggest: /dev/sdc - WD green /dev/sdd - WD green /dev/sde - WD green /dev/sdf - WD green <- failed I would expect that the new volume is something like: /dev/sdc - New unnamed model or 3 year old WD Green? /dev/sdd - New unnamed model or 3 year old WD Green? /dev/sde - New WD Blue /dev/sdf - New WD Blue Did you move the sata cables to use: /dev/sda - Unknown. New disk or 3 year old disk? /dev/sdb - Unknown. New disk or 3 year old disk? /dev/sdc - New WD Blue /dev/sdd - New WD Blue And this is a freshly-created btrfs volume? When you restored from backup, your hard drive firmware should have detected any bad sectors and relocated the write to a reserve sector, and I'm assuming none of the logs have anything in them that would indicate a failed write. If sda is from the 3 year old batch of WD greens I would distrust it. Frequent culprits of similar problems are flaky sata cables or a flaky PSU. In the case of flaky sata cables, dmesg (usually?) shows PHY and "hard resetting link" errors. I also wonder if the sata0 port on your motherboard might be bad. The only reason I mention this is because I've seen two H67/P67 cougarpoint chipset motherboards lose their sata0 channel. It also happens with other brands' chipsets... Whatever the case, when stuff like this happened to me I've always used something like a combination of a cpuburnP6 per logical CPU, memtester (in Linux; do this after a clean 24h memtest86+ run), a huge and intense bonnie++ run, with as many things plugged into the USB ports as possible--including charging at least one high-power device--while burning a DVD and/or running something that stresses the GPU...to try to shake down potential PSU issues. Maybe passmark (under Linux) has similar functionality with an easier interface? I've also used parallel diskscan (https://github.com/baruch/diskscan) runs to test old disks and to check for statistical anomalies. If you do: 1. use tape to number your cables; record which drives are connected into which sata ports with which cables. Do simultaneous runs of diskscan on /dev/disk/by-id/$relevant_disks, check dmesg, and record the results. 2. unplug sata cables from drives and shuffle; document specifics and test. 3. unplug sata cables from motherboard and shuffle; document specifics and test. For the cost of new sata cables, you might as well just buy new ones because then these tests can be used to check for bad ones among the new cables; it's a better use of time, because it's possible that you'll detect a bad cable, replace it, test the new cable, and find out that the new cable is defective. Fountain of Bad Luck™ <- If something can fail, it will fail when I use it ;-) That said, I've never tested a WD green drive...the reds' performance smoothly decreases towards the end of the drive (outer tracks are quite a bit faster than inner tracks). For all I know the greens have erratic performance baked into their power-saving design... If there's consistently a latency spike at the same location for the test associated with a particular drive that can indicate a relocated bad sector. Does anyone know if this method reliably indicates when a drive is lying about its SMART 5 Reallocated_Sector_Ct report? Cheers, Nicholas signature.asc Description: Digital signature
Re: bio linked list corruption.
On Oct 24, 2016 5:00 PM, "Linus Torvalds"wrote: > > On Mon, Oct 24, 2016 at 3:42 PM, Andy Lutomirski wrote: > > > Now the fallocate thread catches up and *exits*. Dave's test makes a > > new thread that reuses the stack (the vmap area or the backing store). > > > > Now the shmem_fault thread continues on its merry way and takes > > q->lock. But oh crap, q->lock is pointing at some random spot on some > > other thread's stack. Kaboom! > > Note that q->lock should be entirely immaterial, since inode->i_lock > nests outside of it in all uses. > > Now, if there is some code that runs *without* the inode->i_lock, then > that would be a big bug. > > But I'm not seeing it. > > I do agree that some race on some stack data structure could easily be > the cause of these issues. And yes, the vmap code obviously starts > reusing the stack much earlier, and would trigger problems that would > essentially be hidden by the fact that the kernel stack used to stay > around not just until exit(), but until the process was reaped. > > I just think that in this case i_lock really looks like it should > serialize things correctly. > > Or are you seeing something I'm not? No, I missed that. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bio linked list corruption.
On Mon, Oct 24, 2016 at 3:42 PM, Andy Lutomirskiwrote: > > Here's my theory: I think you're looking at the right code but the > wrong stack. shmem_fault_wait is fine, but shmem_fault_waitq looks > really dicey. Hmm. > Consider: > > fallocate calls wake_up_all(), which calls autoremove_wait_function(). > That wakes up the shmem_fault thread. Suppose that the shmem_fault > thread gets moving really quickly (presumably because it never went to > sleep in the first place -- suppose it hadn't made it to schedule(), > so schedule() turns into a no-op). It calls finish_wait() *before* > autoremove_wake_function() does the list_del_init(). finish_wait() > gets to the list_empty_careful() call and it returns true. All of this happens under inode->i_lock, so the different parts are serialized. So if this happens before the wakeup, then finish_wait() will se that the wait-queue entry is not empty (it points to the wait queue head in shmem_falloc_waitq. But then it will just remove itself carefully with list_del_init() under the waitqueue lock, and things are fine. Yes, it uses the waitiqueue lock on the other stack, but that stack is still ok since (a) we're serialized by inode->i_lock (b) this code ran before the fallocate thread catches up and exits. In other words, your scenario seems to be dependent on those two threads being able to race. But they both hold inode->i_lock in the critical region you are talking about. > Now the fallocate thread catches up and *exits*. Dave's test makes a > new thread that reuses the stack (the vmap area or the backing store). > > Now the shmem_fault thread continues on its merry way and takes > q->lock. But oh crap, q->lock is pointing at some random spot on some > other thread's stack. Kaboom! Note that q->lock should be entirely immaterial, since inode->i_lock nests outside of it in all uses. Now, if there is some code that runs *without* the inode->i_lock, then that would be a big bug. But I'm not seeing it. I do agree that some race on some stack data structure could easily be the cause of these issues. And yes, the vmap code obviously starts reusing the stack much earlier, and would trigger problems that would essentially be hidden by the fact that the kernel stack used to stay around not just until exit(), but until the process was reaped. I just think that in this case i_lock really looks like it should serialize things correctly. Or are you seeing something I'm not? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bio linked list corruption.
On Mon, Oct 24, 2016 at 1:46 PM, Linus Torvaldswrote: > On Mon, Oct 24, 2016 at 1:06 PM, Andy Lutomirski wrote: >>> >>> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC >> >> This is an unhandled kernel page fault. The string "Oops" is so helpful :-/ > > I think there was a line above it that DaveJ just didn't include. > >> >>> [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted >>> 4.9.0-rc1-think+ #11 >>> [69943.463510] task: 8804f8dd3740 task.stack: c9000b108000 >>> [69943.468077] RIP: 0010:[] >>> [69943.472704] [] __lock_acquire.isra.32+0x6b/0x8c0 >>> [69943.477489] RSP: 0018:c9000b10b9e8 EFLAGS: 00010086 >>> [69943.482368] RAX: 81789b90 RBX: 8804f8dd3740 RCX: >>> >>> [69943.487410] RDX: RSI: RDI: >>> >>> [69943.492515] RBP: c9000b10ba18 R08: 0001 R09: >>> >>> [69943.497666] R10: 0001 R11: 3f9cfa7f4e73 R12: >>> >>> [69943.502880] R13: R14: c9000af7bd48 R15: >>> 8804f8dd3740 >>> [69943.508163] FS: 7f64904a2b40() GS:880507a0() >>> knlGS: >>> [69943.513591] CS: 0010 DS: ES: CR0: 80050033 >>> [69943.518917] CR2: 81789d28 CR3: 0004a8f16000 CR4: >>> 001406e0 >>> [69943.524253] DR0: 7f5b97fd4000 DR1: DR2: >>> >>> [69943.529488] DR3: DR6: 0ff0 DR7: >>> 0600 >>> [69943.534771] Stack: >>> [69943.540023] 880507bd74c0 >>> [69943.545317] 8804f8dd3740 0046 >>> 0286[69943.545456] c9000af7bd08 >>> [69943.550930] 0100 c9000b10ba50 >>> 810c4b68[69943.551069] 810ba40c >>> [69943.556657] 8804 >>> c9000af7bd48[69943.556796] Call Trace: >>> [69943.562465] [] lock_acquire+0x58/0x70 >>> [69943.568354] [] ? finish_wait+0x3c/0x70 >>> [69943.574306] [] _raw_spin_lock_irqsave+0x42/0x80 >>> [69943.580335] [] ? finish_wait+0x3c/0x70 >>> [69943.586237] [] finish_wait+0x3c/0x70 >>> [69943.591992] [] shmem_fault+0x167/0x1b0 >>> [69943.597807] [] ? prepare_to_wait_event+0x100/0x100 >>> [69943.603741] [] __do_fault+0x6d/0x1b0 >>> [69943.609743] [] handle_mm_fault+0xc58/0x1170 >>> [69943.615822] [] ? handle_mm_fault+0x43/0x1170 >>> [69943.621971] [] __do_page_fault+0x172/0x4e0 >>> [69943.628184] [] do_page_fault+0x20/0x70 >>> [69943.634449] [] ? debug_smp_processor_id+0x17/0x20 >>> [69943.640784] [] page_fault+0x1f/0x30 >>> [69943.647170] [] ? strncpy_from_user+0x5c/0x170 >>> [69943.653480] [] ? strncpy_from_user+0x46/0x170 >>> [69943.659632] [] setxattr+0x57/0x170 >>> [69943.665846] [] ? debug_smp_processor_id+0x17/0x20 >>> [69943.672172] [] ? get_lock_stats+0x19/0x50 >>> [69943.678558] [] ? sched_clock_cpu+0xb6/0xd0 >>> [69943.685007] [] ? __lock_acquire.isra.32+0x1cf/0x8c0 >>> [69943.691542] [] ? __this_cpu_preempt_check+0x13/0x20 >>> [69943.698130] [] ? preempt_count_add+0x7c/0xc0 >>> [69943.704791] [] ? __mnt_want_write+0x61/0x90 >>> [69943.711519] [] SyS_fsetxattr+0x78/0xa0 >>> [69943.718300] [] do_syscall_64+0x5c/0x170 >>> [69943.724949] [] entry_SYSCALL64_slow_path+0x25/0x25 >>> [69943.731521] Code: >>> [69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 >>> 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00 >>> ff 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85 >> >> That's lock incl 0x198(%rax). I think this is: >> >> atomic_inc((atomic_t *)>ops); >> >> I suppose this could be stack corruption at work, but after a fair >> amount of staring, I still haven't found anything in the vmap_stack >> code that would cause stack corruption. > > Well, it is intriguing that what faults is this: > > finish_wait(shmem_falloc_waitq, _fault_wait); > > where 'shmem_fault_wait' is a on-stack wait queue. So it really looks > very much like stack corruption. > > What strikes me is that "finish_wait()" does this optimistic "has my > entry been removed" without holding the waitqueue lock (and uses > list_empty_careful() to make sure it does that "safely"). > > It has that big comment too: > > /* > * shmem_falloc_waitq points into the > shmem_fallocate() > * stack of the hole-punching task: shmem_falloc_waitq > * is usually invalid by the time we reach here, but > * finish_wait() does not dereference it in that case; > * though i_lock needed lest racing with > wake_up_all(). > */ > > the stack it comes from is the wait queue head from shmem_fallocate(), > which will do "wake_up_all()" under the inode lock. Here's my theory: I think you're
Re: bio linked list corruption.
On 10/24/2016 05:50 PM, Linus Torvalds wrote: On Mon, Oct 24, 2016 at 2:17 PM, Linus Torvaldswrote: The vmalloc/vfree code itself is a bit scary. In particular, we have a rather insane model of TLB flushing. We leave the virtual area on a lazy purge-list, and we delay flushing the TLB and actually freeing the virtual memory for it so that we can batch things up. Never mind. If DaveJ is running with DEBUG_PAGEALLOC, then the code in vmap_debug_free_range() should have forced a synchronous TLB flush fro vmalloc ranges too, so that doesn't explain it either. My big fear here is that we're just triggering an old stack corruption more reliably. I wonder if we can make it happen much faster by restricting the stacks to a static list and cycling through them in lifo fashion? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Scrubbing Errors after restoring backup
Hello For reference please see this post. https://mail-archive.com/linux-btrfs@vger.kernel.org/msg58461.html Please note that I downgraded to btrfs-progs 4.6.1 as advised. After exchanging the malfunctioning drive I re-created the filesystem and restored the backup from my NAS. (I didnt entirely trust the filesystem after so many errors) On completing the restoration I manually started scrubbing, which ended with hundreds of checksum and read errors on /dev/sda. The drive checks out fine in smart and passed through all scheduled SMART Self-Tests. The model is not identical to the two drives recently added to the system - the new drives are WD Blue, the four original ones are WD Greens. I have resetted the output from btrfs dev stats and restarted the scrubbing process. I am unsure how to interpret or explain the errors of the last scrub run. I scrubbed regularly each month for nearly three years and never had any errors. I would be grateful for any advice how to proceed. Yours sincerely Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bio linked list corruption.
On Mon, Oct 24, 2016 at 2:17 PM, Linus Torvaldswrote: > > The vmalloc/vfree code itself is a bit scary. In particular, we have a > rather insane model of TLB flushing. We leave the virtual area on a > lazy purge-list, and we delay flushing the TLB and actually freeing > the virtual memory for it so that we can batch things up. Never mind. If DaveJ is running with DEBUG_PAGEALLOC, then the code in vmap_debug_free_range() should have forced a synchronous TLB flush fro vmalloc ranges too, so that doesn't explain it either. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bio linked list corruption.
On Mon, Oct 24, 2016 at 1:46 PM, Linus Torvaldswrote: > > So this is all some really subtle code, but I'm not seeing that it > would be wrong. Ahh... Except maybe.. The vmalloc/vfree code itself is a bit scary. In particular, we have a rather insane model of TLB flushing. We leave the virtual area on a lazy purge-list, and we delay flushing the TLB and actually freeing the virtual memory for it so that we can batch things up. But we've free'd the physical pages that are *mapped* by that area when we do the vfree(). So there can be stale TLB entries that point to pages that have gotten re-used. They shouldn't matter, because nothing should be writing to those pages, but it strikes me that this may also be hurting the DEBUG_PAGEALLOC thing. Maybe we're not getting the page fautls that we *should* be getting, and there are hidden reads and writes to those paghes that already got free'd.\ There was some nasty reason why we did that insane thing. I think it was just that there are a few high-frequency vmalloc/vfree users and the TLB flushing was killing some performance. But it does strike me that we are playing very fast and loose with the TLB on the vmalloc area. So maybe all the new VMAP code is fine, and it's really vmalloc/vfree that has been subtly broken but nobody has ever cared before? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bio linked list corruption.
On Mon, Oct 24, 2016 at 1:06 PM, Andy Lutomirskiwrote: >> >> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC > > This is an unhandled kernel page fault. The string "Oops" is so helpful :-/ I think there was a line above it that DaveJ just didn't include. > >> [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted >> 4.9.0-rc1-think+ #11 >> [69943.463510] task: 8804f8dd3740 task.stack: c9000b108000 >> [69943.468077] RIP: 0010:[] >> [69943.472704] [] __lock_acquire.isra.32+0x6b/0x8c0 >> [69943.477489] RSP: 0018:c9000b10b9e8 EFLAGS: 00010086 >> [69943.482368] RAX: 81789b90 RBX: 8804f8dd3740 RCX: >> >> [69943.487410] RDX: RSI: RDI: >> >> [69943.492515] RBP: c9000b10ba18 R08: 0001 R09: >> >> [69943.497666] R10: 0001 R11: 3f9cfa7f4e73 R12: >> >> [69943.502880] R13: R14: c9000af7bd48 R15: >> 8804f8dd3740 >> [69943.508163] FS: 7f64904a2b40() GS:880507a0() >> knlGS: >> [69943.513591] CS: 0010 DS: ES: CR0: 80050033 >> [69943.518917] CR2: 81789d28 CR3: 0004a8f16000 CR4: >> 001406e0 >> [69943.524253] DR0: 7f5b97fd4000 DR1: DR2: >> >> [69943.529488] DR3: DR6: 0ff0 DR7: >> 0600 >> [69943.534771] Stack: >> [69943.540023] 880507bd74c0 >> [69943.545317] 8804f8dd3740 0046 >> 0286[69943.545456] c9000af7bd08 >> [69943.550930] 0100 c9000b10ba50 >> 810c4b68[69943.551069] 810ba40c >> [69943.556657] 8804 >> c9000af7bd48[69943.556796] Call Trace: >> [69943.562465] [] lock_acquire+0x58/0x70 >> [69943.568354] [] ? finish_wait+0x3c/0x70 >> [69943.574306] [] _raw_spin_lock_irqsave+0x42/0x80 >> [69943.580335] [] ? finish_wait+0x3c/0x70 >> [69943.586237] [] finish_wait+0x3c/0x70 >> [69943.591992] [] shmem_fault+0x167/0x1b0 >> [69943.597807] [] ? prepare_to_wait_event+0x100/0x100 >> [69943.603741] [] __do_fault+0x6d/0x1b0 >> [69943.609743] [] handle_mm_fault+0xc58/0x1170 >> [69943.615822] [] ? handle_mm_fault+0x43/0x1170 >> [69943.621971] [] __do_page_fault+0x172/0x4e0 >> [69943.628184] [] do_page_fault+0x20/0x70 >> [69943.634449] [] ? debug_smp_processor_id+0x17/0x20 >> [69943.640784] [] page_fault+0x1f/0x30 >> [69943.647170] [] ? strncpy_from_user+0x5c/0x170 >> [69943.653480] [] ? strncpy_from_user+0x46/0x170 >> [69943.659632] [] setxattr+0x57/0x170 >> [69943.665846] [] ? debug_smp_processor_id+0x17/0x20 >> [69943.672172] [] ? get_lock_stats+0x19/0x50 >> [69943.678558] [] ? sched_clock_cpu+0xb6/0xd0 >> [69943.685007] [] ? __lock_acquire.isra.32+0x1cf/0x8c0 >> [69943.691542] [] ? __this_cpu_preempt_check+0x13/0x20 >> [69943.698130] [] ? preempt_count_add+0x7c/0xc0 >> [69943.704791] [] ? __mnt_want_write+0x61/0x90 >> [69943.711519] [] SyS_fsetxattr+0x78/0xa0 >> [69943.718300] [] do_syscall_64+0x5c/0x170 >> [69943.724949] [] entry_SYSCALL64_slow_path+0x25/0x25 >> [69943.731521] Code: >> [69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 >> 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00 ff >> 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85 > > That's lock incl 0x198(%rax). I think this is: > > atomic_inc((atomic_t *)>ops); > > I suppose this could be stack corruption at work, but after a fair > amount of staring, I still haven't found anything in the vmap_stack > code that would cause stack corruption. Well, it is intriguing that what faults is this: finish_wait(shmem_falloc_waitq, _fault_wait); where 'shmem_fault_wait' is a on-stack wait queue. So it really looks very much like stack corruption. What strikes me is that "finish_wait()" does this optimistic "has my entry been removed" without holding the waitqueue lock (and uses list_empty_careful() to make sure it does that "safely"). It has that big comment too: /* * shmem_falloc_waitq points into the shmem_fallocate() * stack of the hole-punching task: shmem_falloc_waitq * is usually invalid by the time we reach here, but * finish_wait() does not dereference it in that case; * though i_lock needed lest racing with wake_up_all(). */ the stack it comes from is the wait queue head from shmem_fallocate(), which will do "wake_up_all()" under the inode lock. On the face of it, the inode lock should make that safe and serialize everything. And yes, finish_wait() does not touch the unsafe stuff if the wait-queue (in the local stack) is empty, which wake_up_all() *should* have guaranteed. It's just a regular
[PATCH 4/5] writeback: introduce super_operations->write_metadata
Now that we have metadata counters in the VM, we need to provide a way to kick writeback on dirty metadata. Introduce super_operations->write_metadata. This allows file systems to deal with writing back any dirty metadata we need based on the writeback needs of the system. Since there is no inode to key off of we need a list in the bdi for dirty super blocks to be added. From there we can find any dirty sb's on the bdi we are currently doing writeback on and call into their ->write_metadata callback. Signed-off-by: Josef BacikReviewed-by: Jan Kara --- fs/fs-writeback.c| 72 fs/super.c | 7 include/linux/backing-dev-defs.h | 2 ++ include/linux/fs.h | 4 +++ mm/backing-dev.c | 2 ++ 5 files changed, 81 insertions(+), 6 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index a5cb1dd..8e392b2 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1465,6 +1465,31 @@ static long writeback_chunk_size(struct bdi_writeback *wb, return pages; } +static long writeback_sb_metadata(struct super_block *sb, + struct bdi_writeback *wb, + struct wb_writeback_work *work) +{ + struct writeback_control wbc = { + .sync_mode = work->sync_mode, + .tagged_writepages = work->tagged_writepages, + .for_kupdate= work->for_kupdate, + .for_background = work->for_background, + .for_sync = work->for_sync, + .range_cyclic = work->range_cyclic, + .range_start= 0, + .range_end = LLONG_MAX, + }; + long write_chunk; + + write_chunk = writeback_chunk_size(wb, work); + wbc.nr_to_write = write_chunk; + sb->s_op->write_metadata(sb, ); + work->nr_pages -= write_chunk - wbc.nr_to_write; + + return write_chunk - wbc.nr_to_write; +} + + /* * Write a portion of b_io inodes which belong to @sb. * @@ -1491,6 +1516,7 @@ static long writeback_sb_inodes(struct super_block *sb, unsigned long start_time = jiffies; long write_chunk; long wrote = 0; /* count both pages and inodes */ + bool done = false; while (!list_empty(>b_io)) { struct inode *inode = wb_inode(wb->b_io.prev); @@ -1607,12 +1633,18 @@ static long writeback_sb_inodes(struct super_block *sb, * background threshold and other termination conditions. */ if (wrote) { - if (time_is_before_jiffies(start_time + HZ / 10UL)) - break; - if (work->nr_pages <= 0) + if (time_is_before_jiffies(start_time + HZ / 10UL) || + work->nr_pages <= 0) { + done = true; break; + } } } + if (!done && sb->s_op->write_metadata) { + spin_unlock(>list_lock); + wrote += writeback_sb_metadata(sb, wb, work); + spin_lock(>list_lock); + } return wrote; } @@ -1621,6 +1653,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb, { unsigned long start_time = jiffies; long wrote = 0; + bool done = false; while (!list_empty(>b_io)) { struct inode *inode = wb_inode(wb->b_io.prev); @@ -1640,12 +1673,39 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb, /* refer to the same tests at the end of writeback_sb_inodes */ if (wrote) { - if (time_is_before_jiffies(start_time + HZ / 10UL)) - break; - if (work->nr_pages <= 0) + if (time_is_before_jiffies(start_time + HZ / 10UL) || + work->nr_pages <= 0) { + done = true; break; + } } } + + if (!done && wb_stat(wb, WB_METADATA_DIRTY_BYTES)) { + LIST_HEAD(list); + + spin_unlock(>list_lock); + spin_lock(>bdi->sb_list_lock); + list_splice_init(>bdi->dirty_sb_list, ); + while (!list_empty()) { + struct super_block *sb; + + sb = list_first_entry(, struct super_block, + s_bdi_dirty_list); + list_move_tail(>s_bdi_dirty_list, + >bdi->dirty_sb_list); + if (!sb->s_op->write_metadata) + continue; +
[PATCH 1/5] remove mapping from balance_dirty_pages*()
The only reason we pass in the mapping is to get the inode in order to see if writeback cgroups is enabled, and even then it only checks the bdi and a super block flag. balance_dirty_pages() doesn't even use the mapping. Since balance_dirty_pages*() works on a bdi level, just pass in the bdi and super block directly so we can avoid using mapping. This will allow us to still use balance_dirty_pages for dirty metadata pages that are not backed by an address_mapping. Signed-off-by: Josef BacikReviewed-by: Jan Kara --- drivers/mtd/devices/block2mtd.c | 12 fs/btrfs/disk-io.c | 4 ++-- fs/btrfs/file.c | 3 ++- fs/btrfs/ioctl.c| 3 ++- fs/btrfs/relocation.c | 3 ++- fs/buffer.c | 3 ++- fs/iomap.c | 3 ++- fs/ntfs/attrib.c| 10 +++--- fs/ntfs/file.c | 4 ++-- include/linux/backing-dev.h | 29 +++-- include/linux/writeback.h | 3 ++- mm/filemap.c| 4 +++- mm/memory.c | 9 +++-- mm/page-writeback.c | 15 +++ 14 files changed, 71 insertions(+), 34 deletions(-) diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c index 7c887f1..7892d0b 100644 --- a/drivers/mtd/devices/block2mtd.c +++ b/drivers/mtd/devices/block2mtd.c @@ -52,7 +52,8 @@ static struct page *page_read(struct address_space *mapping, int index) /* erase a specified part of the device */ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len) { - struct address_space *mapping = dev->blkdev->bd_inode->i_mapping; + struct inode *inode = dev->blkdev->bd_inode; + struct address_space *mapping = inode->i_mapping; struct page *page; int index = to >> PAGE_SHIFT; // page index int pages = len >> PAGE_SHIFT; @@ -71,7 +72,8 @@ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len) memset(page_address(page), 0xff, PAGE_SIZE); set_page_dirty(page); unlock_page(page); - balance_dirty_pages_ratelimited(mapping); + balance_dirty_pages_ratelimited(inode_to_bdi(inode), + inode->i_sb); break; } @@ -141,7 +143,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, const u_char *buf, loff_t to, size_t len, size_t *retlen) { struct page *page; - struct address_space *mapping = dev->blkdev->bd_inode->i_mapping; + struct inode *inode = dev->blkdev->bd_inode; + struct address_space *mapping = inode->i_mapping; int index = to >> PAGE_SHIFT; // page index int offset = to & ~PAGE_MASK; // page offset int cpylen; @@ -162,7 +165,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, const u_char *buf, memcpy(page_address(page) + offset, buf, cpylen); set_page_dirty(page); unlock_page(page); - balance_dirty_pages_ratelimited(mapping); + balance_dirty_pages_ratelimited(inode_to_bdi(inode), + inode->i_sb); } put_page(page); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index e720d3e..fa07e81 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4085,8 +4085,8 @@ static void __btrfs_btree_balance_dirty(struct btrfs_root *root, ret = percpu_counter_compare(>fs_info->dirty_metadata_bytes, BTRFS_DIRTY_METADATA_THRESH); if (ret > 0) { - balance_dirty_pages_ratelimited( - root->fs_info->btree_inode->i_mapping); + balance_dirty_pages_ratelimited(>fs_info->bdi, + root->fs_info->sb); } } diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 72a180d..cbefdc8 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1711,7 +1711,8 @@ again: cond_resched(); - balance_dirty_pages_ratelimited(inode->i_mapping); + balance_dirty_pages_ratelimited(inode_to_bdi(inode), + inode->i_sb); if (dirty_pages < (root->nodesize >> PAGE_SHIFT) + 1) btrfs_btree_balance_dirty(root); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index af69129..f68692a 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1410,7 +1410,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, } defrag_count +=
[PATCH 0/5] Support for metadata specific accounting
(Dave again I apologize, for some reason our email server hates you and so I didn't get your previous responses again, and didn't notice until I was looking at the patchwork history for my previous submissions, so I'll try and answer all your questions here, and then I'll be much more judicious about making sure I'm getting your emails.) This is my current batch of patches to add some better infrastructure for handling metadata related memory. There's been a few changes to these patches since the last go around but the approach is essentially the same. Changes since last go around: -WB_WRITTEN/WB_DIRTIED are now being counted in bytes. This is necessary to deal with metadata that is smaller than page size. Currently btrfs only does things in pagesize increments, but with these patches it will allow us to easily support sub-pagesize blocksizes so we need these counters to be in bytes. -Added a METADATA_BYTES counter to account for the total number of bytes used for metadata. We need this because the VM adjusts the dirty ratios based on how much memory it thinks we can dirty, which is just the amount of free memory plus the pagecache. -Patch 5, which is a doozy, and I will explain below. This is a more complete approach to keeping track of memory that is in use for metadata. Previously Btrfs had been using a special inode to allocate all of our metadata pages out of, so if we had a heavy metadata workload we still benefited from the balance_dirty_pages() throttling which kept everything sane. We want to kill this in order to make it easier to support pagesize > blocksize support in btrfs, in much the same way XFS does this support. One concern that was brought up was the global accounting vs one bad actor. We still need the global accounting so we can enforce the global dirty limits, but we have the per-bdi accounting as well to make sure we are punishing the bad actor and not all the other file systems that happened to be hooked into this infrastructure. The *REFERENCED patch is to fix a behavior I noticed when testing these patches on a more memory-constrained system than I had previously used. Now that the eviction of metadata pages is controlled soley through the shrinker interface I was seeing a big perf drop when we had to start doing memory reclaim. This was because most of the RAM on the box (10gig of 16gig) was used soley by icache and dcache. Because this workload was just create a file and never touch it again most of this cache was easily evictable. However because we by default send every object in the icache/dcache through the LRU twice, it would take around 10,000 iterations through the shrinker before it would start to evict things (there was ~10 million objects between the two caches). The pagecache solves this by putting everything on the INACTIVE list first, and then theres a two part activation phase before it's moved to the active list, so you have to really touch a page a lot before it is considered active. However we assume active first, which is very latency inducing when we want to start reclaiming objects. Making this change keeps us in line with what pagecache does and eliminates the performance drop I was seeing. Let me know what you think. I've tried to keep this as minimal and sane as possible so we can build on it in the future as we want to handle more nuanced scenarios. Also if we think this is a bad approach I won't feel as bad throwing it out ;). Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bio linked list corruption.
On Sun, Oct 23, 2016 at 9:40 PM, Dave Joneswrote: > On Sun, Oct 23, 2016 at 05:32:21PM -0400, Chris Mason wrote: > > > > > > On 10/22/2016 11:20 AM, Dave Jones wrote: > > > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote: > > > > > > > > It could be worth trying this, too: > > > > > > > > > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack=174531fef4e8 > > > > > > > > > > It occurred to me that the current code is a little bit fragile. > > > > > > > > It's been nearly 24hrs with the above changes, and it's been pretty > much > > > > silent the whole time. > > > > > > > > The only thing of note over that time period has been a btrfs lockdep > > > > warning that's been around for a while, and occasional btrfs checksum > > > > failures, which I've been seeing for a while, but seem to have gotten > > > > worse since 4.8. > > > > > > > > I'm pretty confident in the disk being ok in this machine, so I think > > > > the checksum warnings are bogus. Chris suggested they may be the > result > > > > of memory corruption, but there's little else going on. > > > > > > The only interesting thing last nights run was this.. > > > > > > BUG: Bad page state in process kworker/u8:1 pfn:4e2b70 > > > page:ea00138adc00 count:0 mapcount:0 mapping:88046e9fc2e0 > index:0xdf0 > > > flags: 0x400c(referenced|uptodate) > > > page dumped because: non-NULL mapping > > > CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11 > > > Workqueue: writeback wb_workfn (flush-btrfs-2) > > > > Well crud, we're back to wondering if this is Btrfs or the stack > > corruption. Since the pagevecs are on the stack and this is a new > > crash, my guess is you'll be able to trigger it on xfs/ext4 too. But we > > should make sure. > > Here's an interesting one from today, pointing the finger at xattrs again. > > > [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC This is an unhandled kernel page fault. The string "Oops" is so helpful :-/ > [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted > 4.9.0-rc1-think+ #11 > [69943.463510] task: 8804f8dd3740 task.stack: c9000b108000 > [69943.468077] RIP: 0010:[] > [69943.472704] [] __lock_acquire.isra.32+0x6b/0x8c0 > [69943.477489] RSP: 0018:c9000b10b9e8 EFLAGS: 00010086 > [69943.482368] RAX: 81789b90 RBX: 8804f8dd3740 RCX: > > [69943.487410] RDX: RSI: RDI: > > [69943.492515] RBP: c9000b10ba18 R08: 0001 R09: > > [69943.497666] R10: 0001 R11: 3f9cfa7f4e73 R12: > > [69943.502880] R13: R14: c9000af7bd48 R15: > 8804f8dd3740 > [69943.508163] FS: 7f64904a2b40() GS:880507a0() > knlGS: > [69943.513591] CS: 0010 DS: ES: CR0: 80050033 > [69943.518917] CR2: 81789d28 CR3: 0004a8f16000 CR4: > 001406e0 > [69943.524253] DR0: 7f5b97fd4000 DR1: DR2: > > [69943.529488] DR3: DR6: 0ff0 DR7: > 0600 > [69943.534771] Stack: > [69943.540023] 880507bd74c0 > [69943.545317] 8804f8dd3740 0046 > 0286[69943.545456] c9000af7bd08 > [69943.550930] 0100 c9000b10ba50 > 810c4b68[69943.551069] 810ba40c > [69943.556657] 8804 > c9000af7bd48[69943.556796] Call Trace: > [69943.562465] [] lock_acquire+0x58/0x70 > [69943.568354] [] ? finish_wait+0x3c/0x70 > [69943.574306] [] _raw_spin_lock_irqsave+0x42/0x80 > [69943.580335] [] ? finish_wait+0x3c/0x70 > [69943.586237] [] finish_wait+0x3c/0x70 > [69943.591992] [] shmem_fault+0x167/0x1b0 > [69943.597807] [] ? prepare_to_wait_event+0x100/0x100 > [69943.603741] [] __do_fault+0x6d/0x1b0 > [69943.609743] [] handle_mm_fault+0xc58/0x1170 > [69943.615822] [] ? handle_mm_fault+0x43/0x1170 > [69943.621971] [] __do_page_fault+0x172/0x4e0 > [69943.628184] [] do_page_fault+0x20/0x70 > [69943.634449] [] ? debug_smp_processor_id+0x17/0x20 > [69943.640784] [] page_fault+0x1f/0x30 > [69943.647170] [] ? strncpy_from_user+0x5c/0x170 > [69943.653480] [] ? strncpy_from_user+0x46/0x170 > [69943.659632] [] setxattr+0x57/0x170 > [69943.665846] [] ? debug_smp_processor_id+0x17/0x20 > [69943.672172] [] ? get_lock_stats+0x19/0x50 > [69943.678558] [] ? sched_clock_cpu+0xb6/0xd0 > [69943.685007] [] ? __lock_acquire.isra.32+0x1cf/0x8c0 > [69943.691542] [] ? __this_cpu_preempt_check+0x13/0x20 > [69943.698130] [] ? preempt_count_add+0x7c/0xc0 > [69943.704791] [] ? __mnt_want_write+0x61/0x90 > [69943.711519] [] SyS_fsetxattr+0x78/0xa0 > [69943.718300] [] do_syscall_64+0x5c/0x170 > [69943.724949] [] entry_SYSCALL64_slow_path+0x25/0x25 > [69943.731521] Code: >
Re: [PATCH] btrfs: imporve delayed refs iterations
On Fri, Oct 21, 2016 at 05:05:07PM +0800, Wang Xiaoguang wrote: > This issue was found when I tried to delete a heavily reflinked file, > when deleting such files, other transaction operation will not have a > chance to make progress, for example, start_transaction() will blocked > in wait_current_trans(root) for long time, sometimes it even triggers > soft lockups, and the time taken to delete such heavily reflinked file > is also very large, often hundreds of seconds. Using perf top, it reports > that: > > PerfTop:7416 irqs/sec kernel:99.8% exact: 0.0% [4000Hz cpu-clock], > (all, 4 CPUs) > --- > 84.37% [btrfs] [k] __btrfs_run_delayed_refs.constprop.80 > 11.02% [kernel][k] delay_tsc > 0.79% [kernel][k] _raw_spin_unlock_irq > 0.78% [kernel][k] _raw_spin_unlock_irqrestore > 0.45% [kernel][k] do_raw_spin_lock > 0.18% [kernel][k] __slab_alloc > It seems __btrfs_run_delayed_refs() took most cpu time, after some debug > work, I found it's select_delayed_ref() causing this issue, for a delayed > head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but > select_delayed_ref() will firstly try to iterate node list to find > BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and > waste much time. > > To fix this issue, we introduce a new ref_add_list in struct > btrfs_delayed_ref_head, > then in select_delayed_ref(), if this list is not empty, we can directly use > nodes in this list. With this patch, it just took about 10~15 seconds to > delte the same file. Now using perf top, it reports that: > > PerfTop:2734 irqs/sec kernel:99.5% exact: 0.0% [4000Hz cpu-clock], > (all, 4 CPUs) > > > 20.74% [kernel] [k] _raw_spin_unlock_irqrestore > 16.33% [kernel] [k] __slab_alloc > 5.41% [kernel] [k] lock_acquired > 4.42% [kernel] [k] lock_acquire > 4.05% [kernel] [k] lock_release > 3.37% [kernel] [k] _raw_spin_unlock_irq > > For normal files, this patch also gives help, at least we do not need to > iterate whole list to found BTRFS_ADD_DELAYED_REF nodes. > > Signed-off-by: Wang Xiaoguang> --- > fs/btrfs/delayed-ref.c | 14 ++ > fs/btrfs/delayed-ref.h | 8 > fs/btrfs/disk-io.c | 2 ++ > fs/btrfs/extent-tree.c | 15 +-- > 4 files changed, 33 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > index 8d93854..39c28e0 100644 > --- a/fs/btrfs/delayed-ref.c > +++ b/fs/btrfs/delayed-ref.c > @@ -189,6 +189,8 @@ static inline void drop_delayed_ref(struct > btrfs_trans_handle *trans, > } else { > assert_spin_locked(>lock); > list_del(>list); > + if (!list_empty(>add_list)) > + list_del(>add_list); > } > ref->in_tree = 0; > btrfs_put_delayed_ref(ref); > @@ -431,6 +433,11 @@ add_delayed_ref_tail_merge(struct btrfs_trans_handle > *trans, > exist->action = ref->action; > mod = -exist->ref_mod; > exist->ref_mod = ref->ref_mod; > + if (ref->action == BTRFS_ADD_DELAYED_REF) > + list_add_tail(>add_list, > + >ref_add_list); > + else if (!list_empty(>add_list)) > + list_del(>add_list); ->action is either BTRFS_ADD_DELAYED_REF or BTRFS_DROP_DELAYED_REF, so in 'else' section, (!list_empty(>add_list)) is true indeed. > } else > mod = -ref->ref_mod; > } > @@ -444,6 +451,8 @@ add_delayed_ref_tail_merge(struct btrfs_trans_handle > *trans, > > add_tail: > list_add_tail(>list, >ref_list); > + if (ref->action == BTRFS_ADD_DELAYED_REF) > + list_add_tail(>add_list, >ref_add_list); > atomic_inc(>num_entries); > trans->delayed_ref_updates++; > spin_unlock(>lock); > @@ -590,6 +599,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, > head_ref->must_insert_reserved = must_insert_reserved; > head_ref->is_data = is_data; > INIT_LIST_HEAD(_ref->ref_list); > + INIT_LIST_HEAD(_ref->ref_add_list); > head_ref->processing = 0; > head_ref->total_ref_mod = count_mod; > head_ref->qgroup_reserved = 0; > @@ -671,6 +681,8 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info, > ref->is_head = 0; > ref->in_tree = 1; > ref->seq = seq; > + INIT_LIST_HEAD(>list); > + INIT_LIST_HEAD(>add_list); > > full_ref = btrfs_delayed_node_to_tree_ref(ref); > full_ref->parent = parent; > @@ -726,6 +738,8 @@
Re: Btrfs: fix free space tree bitmaps on big-endian systems
Hi Omar, On Mon, Oct 24, 2016 at 5:16 PM, Omar Sandovalwrote: > On Mon, Oct 24, 2016 at 09:23:04AM +0200, Geert Uytterhoeven wrote: >> On Sat, Oct 15, 2016 at 2:46 AM, Linux Kernel Mailing List >> wrote: >> > Web: >> > https://git.kernel.org/torvalds/c/2fe1d55134fce05c17ea118a2e37a4af771887bc >> > Commit: 2fe1d55134fce05c17ea118a2e37a4af771887bc >> >> 520f16abf003952d in v4.7.10 >> 1ff6341b5d92dd6b in v4.8.4 >> >> > Parent: 08895a8b6b06ed2323cd97a36ee40a116b3db8ed >> > Refname:refs/heads/master >> > Author: Omar Sandoval >> > AuthorDate: Thu Sep 22 17:24:20 2016 -0700 >> > Committer: David Sterba >> > CommitDate: Mon Oct 3 18:52:14 2016 +0200 >> > >> > Btrfs: fix free space tree bitmaps on big-endian systems >> > >> > In convert_free_space_to_{bitmaps,extents}(), we buffer the free space >> > bitmaps in memory and copy them directly to/from the extent buffers >> > with >> > {read,write}_extent_buffer(). The extent buffer bitmap helpers use byte >> > granularity, which is equivalent to a little-endian bitmap. This means >> > that on big-endian systems, the in-memory bitmaps will be written to >> > disk byte-swapped. To fix this, use byte-granularity for the bitmaps in >> > memory. >> >> This change looks overly complex to me, and decreases performance. > > Do you have evidence of that? Sure, it can in theory, but the change Nope, just reading the code. > only affects the free space tree format conversion, which is a rare > operation. In fact, I actually benchmarked the change with [1] and saw > no noticable difference on x86-64. In any case, now that the on-disk > format problem is fixed, we can improve the implementation. Good to hear that. >> > Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree") >> > Cc: sta...@vger.kernel.org # 4.5+ >> > Tested-by: Holger Hoffstätte >> > Tested-by: Chandan Rajendra >> > Signed-off-by: Omar Sandoval >> > Signed-off-by: David Sterba >> > --- >> > fs/btrfs/extent_io.c | 64 >> > +- >> > fs/btrfs/extent_io.h | 22 >> > fs/btrfs/free-space-tree.c | 17 ++-- >> > 3 files changed, 76 insertions(+), 27 deletions(-) >> > >> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> > index 44fe66b..c3ec30d 100644 >> > --- a/fs/btrfs/extent_io.c >> > +++ b/fs/btrfs/extent_io.c >> > @@ -5524,17 +5524,45 @@ void copy_extent_buffer(struct extent_buffer *dst, >> > struct extent_buffer *src, >> > } >> > } >> > >> > -/* >> > - * The extent buffer bitmap operations are done with byte granularity >> > because >> > - * bitmap items are not guaranteed to be aligned to a word and therefore a >> > - * single word in a bitmap may straddle two pages in the extent buffer. >> > - */ >> > -#define BIT_BYTE(nr) ((nr) / BITS_PER_BYTE) >> > -#define BYTE_MASK ((1 << BITS_PER_BYTE) - 1) >> > -#define BITMAP_FIRST_BYTE_MASK(start) \ >> > - ((BYTE_MASK << ((start) & (BITS_PER_BYTE - 1))) & BYTE_MASK) >> > -#define BITMAP_LAST_BYTE_MASK(nbits) \ >> > - (BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1))) >> > +void le_bitmap_set(u8 *map, unsigned int start, int len) >> > +{ >> > + u8 *p = map + BIT_BYTE(start); >> >> You cannot use cpu_to_le32/cpu_to_le64 on the masks and operate on >> unsigned longs in memory because there's no alignment guarantee, right? > > That's right. > >> > + const unsigned int size = start + len; >> > + int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE); >> > + u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start); >> > + >> > + while (len - bits_to_set >= 0) { >> > + *p |= mask_to_set; >> > + len -= bits_to_set; >> > + bits_to_set = BITS_PER_BYTE; >> > + mask_to_set = ~(u8)0; >> > + p++; >> > + } >> >> memset() for all but the first partial byte (if present)? > > Shrug, the original bitmap_set() helper doesn't do this. For our use > case, we're not expecting to do big spans with this since our free space > must be pretty fragmented for us to be using this in the first place. The original bitmap_set() helper doesn't do byte writes, but 32/64-bit writes, which is much closer to memset() from a performance point of view. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More
Re: [PATCH 1/2] btrfs-progs: fsck: Add support to clear v1 free space cache.
On Thu, Oct 20, 2016 at 09:04:23AM +0800, Qu Wenruo wrote: > >> + /* Clear all free space cache inodes and its extent data */ > >> + while (1) { > >> + bg_cache = btrfs_lookup_first_block_group(fs_info, current); > >> + if (!bg_cache) > >> + break; > >> + ret = btrfs_clear_free_space_cache(fs_info, bg_cache); > > > > The function can fail for a lot of reasons, what would be the filesystem > > state when we exit here? Some of the inodes could be cleared completely, > > the last one partially. The function copes with a missing inode item > > but I don't know how many other intermediate states could be left. > > If we exit here, no damage for the filesystem will be done, since we are > protected by transaction. > > As you can find, in btrfs_clear_free_space_cache(), > it will only commit transaction if we fully cleaned the whole inode and > its free space header. Ah I see, each blockgroup is killed inside a transaction. Then there's one more to invalidate the super block cache generation. > So we're quite safe here, free space header and inode are cleaned > atomically. > > PS: We really need btrfs_abort_transaction(), or when we exit abnormally > we will get a lot of backtrace/warning on uncommitted transaction. Yeah, the userspace is lacking behind kernel code in the transaction error handling. Patches welcome, this might be a big change all over the place, so I suggest to do it in smaller batches. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v9.1 0/5] In-band de-duplication for btrfs-progs
On Thu, Oct 20, 2016 at 10:06:59AM +0800, Qu Wenruo wrote: > Patchset can be fetched from github: > https://github.com/adam900710/btrfs-progs.git dedupe_latest Branch added to progs intergration, for preview and testing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v13.1 00/14] Btrfs In-band De-duplication
On Thu, Oct 20, 2016 at 10:03:39AM +0800, Qu Wenruo wrote: > Qu Wenruo (4): > btrfs: delayed-ref: Add support for increasing data ref under spinlock > btrfs: dedupe: Inband in-memory only de-duplication implement > btrfs: relocation: Enhance error handling to avoid BUG_ON > btrfs: dedupe: Introduce new reconfigure ioctl > > Wang Xiaoguang (10): > btrfs: dedupe: Introduce dedupe framework and its header > btrfs: dedupe: Introduce function to initialize dedupe info > btrfs: dedupe: Introduce function to add hash into in-memory tree > btrfs: dedupe: Introduce function to remove hash from in-memory tree > btrfs: dedupe: Introduce function to search for an existing hash > btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface > btrfs: ordered-extent: Add support for dedupe > btrfs: dedupe: Add ioctl for inband dedupelication > btrfs: improve inode's outstanding_extents computation I've noticed that this patch is different from what Wang sent independently. Is the patch necessary for the dedupe patchset? If not, please drop it. The merge conflict is in the checks for free space inode, if (root == root->fs_info->tree_root) vs if (!btrfs_is_free_space_inode(...)) If you need the patch, then please update it to the latest version. I won't get the merge conflicts at least, this is acceptable for the testing 'for-next' branch. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: imporve delayed refs iterations
On 10/24/16 18:46, David Sterba wrote: > On Fri, Oct 21, 2016 at 05:05:07PM +0800, Wang Xiaoguang wrote: >> This issue was found when I tried to delete a heavily reflinked file, >> when deleting such files, other transaction operation will not have a >> chance to make progress, for example, start_transaction() will blocked >> in wait_current_trans(root) for long time, sometimes it even triggers >> soft lockups, and the time taken to delete such heavily reflinked file >> is also very large, often hundreds of seconds. Using perf top, it reports >> that: > >> [...] With this patch, it just took about 10~15 seconds to >> delte the same file. > > Great improvement! Patch looks good on a quick skim so I'll add it to > next, but proper review is still required. If it helps, I've been running it for ~2 days now with no negative side effects, mostly rsync creating & deleting files with various levels of reflinking (via snapshots). No problems at all. Also tried to manually create & delete a large file with heavy CoW and hundreds of reflinked copies - no problem either and pretty fast. So.. Tested-by: Holger Hoffstättecheers, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: check: release path after usage
On Mon, Oct 24, 2016 at 10:18:14AM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues> > While performing an fsck, an assertion failure occurs because of reusing path > in a loop. > ctree.c:1112: btrfs_search_slot: Warning: assertion `p->nodes[0] != NULL` > failed, value 0 > > Signed-off-by: Goldwyn Rodrigues Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Fix wrong tree block alignment for unalianged block group
On Mon, Oct 24, 2016 at 03:22:33PM +0800, Qu Wenruo wrote: > Commit 854437ca(btrfs-progs: extent-tree: avoid allocating tree block > that crosses stripe boundary) introduces check for logical bytenr not > crossing stripe boundary. > > However that check is not completely correct. > It only checks if the logical bytenr and length agaist absolute logical > offset. > That's to say, it only check if a tree block lies in 64K logical stripe. > > But in fact, it's possible a block group starts at bytenr unaligned with > 64K, just like the following case. > > Then btrfsck will give false alert. > > 0 32K 64K 96K128K 160K ... > |--- Block group A - > |<-TB 32K-->| > |/Scrub stripe unit/| > |WRONG UNIT | > > In that case, TB(tree block) at bytenr 32K in fact fits into the kernel > scrub stripe unit. > But doesn't fit into the pure logical 64K stripe. > > Fix check_crossing_stripes() to compare bytenr to block group start, not > to absolute logical bytenr. > > Reported-by: Jussi Kansanen> Signed-off-by: Qu Wenruo Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: change btrfs_csum_final result param type to u8
On Fri, Oct 21, 2016 at 12:47:02PM +0100, Domagoj Tršan wrote: > csum member of struct btrfs_super_block has array type of u8. It makes sense > that function btrfs_csum_final should be also declared to accept u8 *. I > changed the declaration of method void btrfs_csum_final(u32 crc, char > *result); > to void btrfs_csum_final(u32 crc, u8 *result); Ok that's better, just the signed-off-by line is missing. I would add it myself on behalf of contributors I know, but as a matter of practice, fix it please and resend the patch. Using the "git commit -s" adds the line for you. It is a formality, established in the linux kernel development. If you need more explanatino what and why, please refer to the following docs. http://elinux.org/Developer_Certificate_Of_Origin https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: imporve delayed refs iterations
On Fri, Oct 21, 2016 at 05:05:07PM +0800, Wang Xiaoguang wrote: > This issue was found when I tried to delete a heavily reflinked file, > when deleting such files, other transaction operation will not have a > chance to make progress, for example, start_transaction() will blocked > in wait_current_trans(root) for long time, sometimes it even triggers > soft lockups, and the time taken to delete such heavily reflinked file > is also very large, often hundreds of seconds. Using perf top, it reports > that: > [...] With this patch, it just took about 10~15 seconds to > delte the same file. Great improvement! Patch looks good on a quick skim so I'll add it to next, but proper review is still required. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: introduce priority based delalloc shrink mechanism
Hi Josef, are you fine with V2? On Thu, Oct 13, 2016 at 05:31:25PM +0800, Wang Xiaoguang wrote: > Since commit b02441999efcc6152b87cd58e7970bb7843f76cf, we don't wait all > ordered extents, but I run into some enospc errors when doing large file > create and delete tests, it's because shrink_delalloc() does not write > enough delalloc bytes and wait them finished: > From: Miao Xie> Date: Mon, 4 Nov 2013 23:13:25 +0800 > Subject: [PATCH] Btrfs: don't wait for the completion of all the ordered > extents > > It is very likely that there are lots of ordered extents in the filesytem, > if we wait for the completion of all of them when we want to reclaim some > space for the metadata space reservation, we would be blocked for a long > time. The performance would drop down suddenly for a long time. > > Here we introduce a simple reclaim_priority variable, the lower the > value, the higher the priority, 0 is the maximum priority. The core > idea is: > delalloc_bytes = > percpu_counter_sum_positive(>fs_info->delalloc_bytes); > if (reclaim_priority) > to_reclaim = orig * (2 << (BTRFS_DEFAULT_FLUSH_PRIORITY - > reclaim_priority)); > else > to_reclaim = delalloc_bytes; > > Here 'orig' is the number of metadata we want to reserve, and as the priority > increases, we will try wo write more delalloc bytes, meanwhile if > "reclaim_priority == 0" returns true, we'll also wait all current ordered > extents to finish. > > Signed-off-by: Wang Xiaoguang > --- > fs/btrfs/extent-tree.c | 63 > ++-- > include/trace/events/btrfs.h | 11 +++- > 2 files changed, 42 insertions(+), 32 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index e08791d..7477c25 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4686,16 +4686,18 @@ static inline int calc_reclaim_items_nr(struct > btrfs_root *root, u64 to_reclaim) > } > > #define EXTENT_SIZE_PER_ITEM SZ_256K > +#define BTRFS_DEFAULT_FLUSH_PRIORITY3 > > /* > * shrink metadata reservation for delalloc > */ > -static void shrink_delalloc(struct btrfs_root *root, u64 to_reclaim, u64 > orig, > - bool wait_ordered) > +static void shrink_delalloc(struct btrfs_root *root, u64 orig, > + bool wait_ordered, int reclaim_priority) > { > struct btrfs_block_rsv *block_rsv; > struct btrfs_space_info *space_info; > struct btrfs_trans_handle *trans; > + u64 to_reclaim; > u64 delalloc_bytes; > u64 max_reclaim; > long time_left; > @@ -4703,22 +4705,36 @@ static void shrink_delalloc(struct btrfs_root *root, > u64 to_reclaim, u64 orig, > int loops; > int items; > enum btrfs_reserve_flush_enum flush; > + int items_to_wait; > + > + delalloc_bytes = percpu_counter_sum_positive( > + >fs_info->delalloc_bytes); > + if (reclaim_priority < 0) > + reclaim_priority = 0; > + > + if (reclaim_priority) > + to_reclaim = orig * (2 << (BTRFS_DEFAULT_FLUSH_PRIORITY - > + reclaim_priority)); > + else > + to_reclaim = delalloc_bytes; > > /* Calc the number of the pages we need flush for space reservation */ > items = calc_reclaim_items_nr(root, to_reclaim); > to_reclaim = (u64)items * EXTENT_SIZE_PER_ITEM; > + if (reclaim_priority) > + items_to_wait = items; > + else > + items_to_wait = -1; > > trans = (struct btrfs_trans_handle *)current->journal_info; > block_rsv = >fs_info->delalloc_block_rsv; > space_info = block_rsv->space_info; > > - delalloc_bytes = percpu_counter_sum_positive( > - >fs_info->delalloc_bytes); > if (delalloc_bytes == 0) { > if (trans) > return; > if (wait_ordered) > - btrfs_wait_ordered_roots(root->fs_info, items, > + btrfs_wait_ordered_roots(root->fs_info, items_to_wait, >0, (u64)-1); > return; > } > @@ -4763,7 +4779,7 @@ static void shrink_delalloc(struct btrfs_root *root, > u64 to_reclaim, u64 orig, > > loops++; > if (wait_ordered && !trans) { > - btrfs_wait_ordered_roots(root->fs_info, items, > + btrfs_wait_ordered_roots(root->fs_info, items_to_wait, >0, (u64)-1); > } else { > time_left = schedule_timeout_killable(1); > @@ -4836,7 +4852,7 @@ struct reserve_ticket { > > static int flush_space(struct btrfs_root *root, > struct btrfs_space_info *space_info, u64 num_bytes, > -
Re: Btrfs: fix free space tree bitmaps on big-endian systems
On Mon, Oct 24, 2016 at 09:23:04AM +0200, Geert Uytterhoeven wrote: > On Sat, Oct 15, 2016 at 2:46 AM, Linux Kernel Mailing List >wrote: > > Web: > > https://git.kernel.org/torvalds/c/2fe1d55134fce05c17ea118a2e37a4af771887bc > > Commit: 2fe1d55134fce05c17ea118a2e37a4af771887bc > > 520f16abf003952d in v4.7.10 > 1ff6341b5d92dd6b in v4.8.4 > > > Parent: 08895a8b6b06ed2323cd97a36ee40a116b3db8ed > > Refname:refs/heads/master > > Author: Omar Sandoval > > AuthorDate: Thu Sep 22 17:24:20 2016 -0700 > > Committer: David Sterba > > CommitDate: Mon Oct 3 18:52:14 2016 +0200 > > > > Btrfs: fix free space tree bitmaps on big-endian systems > > > > In convert_free_space_to_{bitmaps,extents}(), we buffer the free space > > bitmaps in memory and copy them directly to/from the extent buffers with > > {read,write}_extent_buffer(). The extent buffer bitmap helpers use byte > > granularity, which is equivalent to a little-endian bitmap. This means > > that on big-endian systems, the in-memory bitmaps will be written to > > disk byte-swapped. To fix this, use byte-granularity for the bitmaps in > > memory. > > This change looks overly complex to me, and decreases performance. We had to make it correct, performance is not an issue at the moment. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Initialize btrfs_path before use
On 10/24/2016 09:57 AM, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues> > While performing an fsck, an assertion failure occurs because of reusing path > in a loop. > ctree.c:1112: btrfs_search_slot: Warning: assertion `p->nodes[0] != NULL` > failed, value 0 > > Signed-off-by: Goldwyn Rodrigues > > diff --git a/cmds-check.c b/cmds-check.c > index 670ccd1..a6f281c 100644 > --- a/cmds-check.c > +++ b/cmds-check.c > @@ -7541,6 +7541,7 @@ static int record_orphan_data_extents(struct > btrfs_fs_info *fs_info, > key.objectid = dback->owner; > key.type = BTRFS_EXTENT_DATA_KEY; > key.offset = dback->offset; > + btrfs_init_path(path); > > ret = btrfs_search_slot(NULL, dest_root, , path, 0, 0); > /* > This is an incorrect fix. Please ignore. I shall send the correct fix shortly. It should be release_path() instead. -- Goldwyn -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Release path after usage.
From: Goldwyn RodriguesWhile performing an fsck, an assertion failure occurs because of reusing path in a loop. ctree.c:1112: btrfs_search_slot: Warning: assertion `p->nodes[0] != NULL` failed, value 0 Signed-off-by: Goldwyn Rodrigues diff --git a/cmds-check.c b/cmds-check.c index 670ccd1..5d9c724 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -7543,6 +7543,7 @@ static int record_orphan_data_extents(struct btrfs_fs_info *fs_info, key.offset = dback->offset; ret = btrfs_search_slot(NULL, dest_root, , path, 0, 0); + btrfs_release_path(path); /* * For ret < 0, it's OK since the fs-tree may be corrupted, * we need to record it for inode/file extent rebuild. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: fix free space tree bitmaps on big-endian systems
On Mon, Oct 24, 2016 at 09:23:04AM +0200, Geert Uytterhoeven wrote: > On Sat, Oct 15, 2016 at 2:46 AM, Linux Kernel Mailing List >wrote: > > Web: > > https://git.kernel.org/torvalds/c/2fe1d55134fce05c17ea118a2e37a4af771887bc > > Commit: 2fe1d55134fce05c17ea118a2e37a4af771887bc > > 520f16abf003952d in v4.7.10 > 1ff6341b5d92dd6b in v4.8.4 > > > Parent: 08895a8b6b06ed2323cd97a36ee40a116b3db8ed > > Refname:refs/heads/master > > Author: Omar Sandoval > > AuthorDate: Thu Sep 22 17:24:20 2016 -0700 > > Committer: David Sterba > > CommitDate: Mon Oct 3 18:52:14 2016 +0200 > > > > Btrfs: fix free space tree bitmaps on big-endian systems > > > > In convert_free_space_to_{bitmaps,extents}(), we buffer the free space > > bitmaps in memory and copy them directly to/from the extent buffers with > > {read,write}_extent_buffer(). The extent buffer bitmap helpers use byte > > granularity, which is equivalent to a little-endian bitmap. This means > > that on big-endian systems, the in-memory bitmaps will be written to > > disk byte-swapped. To fix this, use byte-granularity for the bitmaps in > > memory. > > This change looks overly complex to me, and decreases performance. Do you have evidence of that? Sure, it can in theory, but the change only affects the free space tree format conversion, which is a rare operation. In fact, I actually benchmarked the change with [1] and saw no noticable difference on x86-64. In any case, now that the on-disk format problem is fixed, we can improve the implementation. > > > > Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree") > > Cc: sta...@vger.kernel.org # 4.5+ > > Tested-by: Holger Hoffstätte > > Tested-by: Chandan Rajendra > > Signed-off-by: Omar Sandoval > > Signed-off-by: David Sterba > > --- > > fs/btrfs/extent_io.c | 64 > > +- > > fs/btrfs/extent_io.h | 22 > > fs/btrfs/free-space-tree.c | 17 ++-- > > 3 files changed, 76 insertions(+), 27 deletions(-) > > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > index 44fe66b..c3ec30d 100644 > > --- a/fs/btrfs/extent_io.c > > +++ b/fs/btrfs/extent_io.c > > @@ -5524,17 +5524,45 @@ void copy_extent_buffer(struct extent_buffer *dst, > > struct extent_buffer *src, > > } > > } > > > > -/* > > - * The extent buffer bitmap operations are done with byte granularity > > because > > - * bitmap items are not guaranteed to be aligned to a word and therefore a > > - * single word in a bitmap may straddle two pages in the extent buffer. > > - */ > > -#define BIT_BYTE(nr) ((nr) / BITS_PER_BYTE) > > -#define BYTE_MASK ((1 << BITS_PER_BYTE) - 1) > > -#define BITMAP_FIRST_BYTE_MASK(start) \ > > - ((BYTE_MASK << ((start) & (BITS_PER_BYTE - 1))) & BYTE_MASK) > > -#define BITMAP_LAST_BYTE_MASK(nbits) \ > > - (BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1))) > > +void le_bitmap_set(u8 *map, unsigned int start, int len) > > +{ > > + u8 *p = map + BIT_BYTE(start); > > You cannot use cpu_to_le32/cpu_to_le64 on the masks and operate on > unsigned longs in memory because there's no alignment guarantee, right? That's right. > > + const unsigned int size = start + len; > > + int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE); > > + u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start); > > + > > + while (len - bits_to_set >= 0) { > > + *p |= mask_to_set; > > + len -= bits_to_set; > > + bits_to_set = BITS_PER_BYTE; > > + mask_to_set = ~(u8)0; > > + p++; > > + } > > memset() for all but the first partial byte (if present)? Shrug, the original bitmap_set() helper doesn't do this. For our use case, we're not expecting to do big spans with this since our free space must be pretty fragmented for us to be using this in the first place. > > + if (len) { > > + mask_to_set &= BITMAP_LAST_BYTE_MASK(size); > > + *p |= mask_to_set; > > + } > > +} > > + > > +void le_bitmap_clear(u8 *map, unsigned int start, int len) > > +{ > > + u8 *p = map + BIT_BYTE(start); > > + const unsigned int size = start + len; > > + int bits_to_clear = BITS_PER_BYTE - (start % BITS_PER_BYTE); > > + u8 mask_to_clear = BITMAP_FIRST_BYTE_MASK(start); > > + > > + while (len - bits_to_clear >= 0) { > > + *p &= ~mask_to_clear; > > + len -= bits_to_clear; > > + bits_to_clear = BITS_PER_BYTE; > > + mask_to_clear = ~(u8)0; > > + p++; > > + } > > memset() for all but the first partial byte (if present)? > > > + if (len) { > > + mask_to_clear &=
Re: [PATCH] btrfs-progs: Initialize btrfs_path before use
On Mon, Oct 24, 2016 at 3:57 PM, Goldwyn Rodrigueswrote: > From: Goldwyn Rodrigues > > While performing an fsck, an assertion failure occurs because of reusing path > in a loop. > ctree.c:1112: btrfs_search_slot: Warning: assertion `p->nodes[0] != NULL` > failed, value 0 > > Signed-off-by: Goldwyn Rodrigues > > diff --git a/cmds-check.c b/cmds-check.c > index 670ccd1..a6f281c 100644 > --- a/cmds-check.c > +++ b/cmds-check.c > @@ -7541,6 +7541,7 @@ static int record_orphan_data_extents(struct > btrfs_fs_info *fs_info, > key.objectid = dback->owner; > key.type = BTRFS_EXTENT_DATA_KEY; > key.offset = dback->offset; > + btrfs_init_path(path); Hi Goldwyn, You want to call btrfs_release_path(), otherwise you leak memory. And it would be better placed before the 'continue' statement. thanks > > ret = btrfs_search_slot(NULL, dest_root, , path, 0, 0); > /* > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "People will forget what you said, people will forget what you did, but people will never forget how you made them feel." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: Initialize btrfs_path before use
From: Goldwyn RodriguesWhile performing an fsck, an assertion failure occurs because of reusing path in a loop. ctree.c:1112: btrfs_search_slot: Warning: assertion `p->nodes[0] != NULL` failed, value 0 Signed-off-by: Goldwyn Rodrigues diff --git a/cmds-check.c b/cmds-check.c index 670ccd1..a6f281c 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -7541,6 +7541,7 @@ static int record_orphan_data_extents(struct btrfs_fs_info *fs_info, key.objectid = dback->owner; key.type = BTRFS_EXTENT_DATA_KEY; key.offset = dback->offset; + btrfs_init_path(path); ret = btrfs_search_slot(NULL, dest_root, , path, 0, 0); /* -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs progs pre-release 4.8.2-rc1
Hi, a pre-release has been tagged. There are build and misc bugfixes. I'm doing a rc for a minor release as it proved to be useful in the past when dealing with the build errors. Please test. ETA for 4.8.2 is in +2 days. Changes since 4.8.1: * convert: also convert file attributes * check: quota verify fixes, handle reloc tree * build: add stub for FIEMAP_EXTENT_SHARED, compiles on ancient kernels * build: add stub for BUILD_ASSERT when ioctl.h is included * dump-tree: don't crash on unrecognized tree id for -t * tests: * add more ioctl tests * convert: more symlink tests, attribute tests * quota verify for reloc tree * other cleanups Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/ Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git Shortlog: David Sterba (10): btrfs-progs: ioctl: add 32bit compat for SET_RECEIVED_SUBVOL btrfs-progs: ioctl: add 64bit compat for SEND btrfs-progs: tests: make the ioctl-test actually useful btrfs-progs: test: add default ioctl-test build btrfs-progs: mkfs: detect version of running kernel btrfs-progs: build: detect fiemap shared flag but don't fail build btrfs-progs: dump-tree: fix crash on unrecognized tree id btrfs-progs: ioctl: fix build failure if BUILD_ASSERT is not defined btrfs-progs: update CHANGES for v4.8.2 Btrfs progs v4.8.2-rc1 Karel Zak (1): btrfs-progs: rename btrfs_scan_lblkid() to btrfs_scan_devices() Lakshmipathi.G (2): btrfs-progs: btrfs-debugfs: cleanup unused variables reported by pylint btrfs-progs: Add fast,slow symlinks, fifo types to convert test Lu Fengqi (1): btrfs-progs: move btrfs_extref_hash() to hash.h Qu Wenruo (9): btrfs-progs: Copy btrfs inode flags from kernel header btrfs-progs: Make btrfs-debug-tree print all readable strings for inode flags btrfs-progs: convert: Convert ext inode flags to btrfs inode flags btrfs-progs: convert-test: Add test case for common inode flags btrfs-progs: Fix stack overflow for checking qgroup on tree reloc tree btrfs-progs: test: Add test image for btrfsck qgroup rescan detection btrfs-progs: test: Add image for quota verify stack overflow btrfs-progs: rename raid6.c to raid56.c btrfs-progs: volumes: Remove BUG_ON in raid56 write routine Tsutomu Itoh (2): btrfs-progs: image: fix compiler warning btrfs-progs: send: remove unnecessary code -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bio linked list corruption.
On 10/24/2016 12:40 AM, Dave Jones wrote: On Sun, Oct 23, 2016 at 05:32:21PM -0400, Chris Mason wrote: > > > On 10/22/2016 11:20 AM, Dave Jones wrote: > > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote: > > > > > > It could be worth trying this, too: > > > > > > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack=174531fef4e8 > > > > > > > > It occurred to me that the current code is a little bit fragile. > > > > > > It's been nearly 24hrs with the above changes, and it's been pretty much > > > silent the whole time. > > > > > > The only thing of note over that time period has been a btrfs lockdep > > > warning that's been around for a while, and occasional btrfs checksum > > > failures, which I've been seeing for a while, but seem to have gotten > > > worse since 4.8. > > > > > > I'm pretty confident in the disk being ok in this machine, so I think > > > the checksum warnings are bogus. Chris suggested they may be the result > > > of memory corruption, but there's little else going on. > > > > The only interesting thing last nights run was this.. > > > > BUG: Bad page state in process kworker/u8:1 pfn:4e2b70 > > page:ea00138adc00 count:0 mapcount:0 mapping:88046e9fc2e0 index:0xdf0 > > flags: 0x400c(referenced|uptodate) > > page dumped because: non-NULL mapping > > CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11 > > Workqueue: writeback wb_workfn (flush-btrfs-2) > > Well crud, we're back to wondering if this is Btrfs or the stack > corruption. Since the pagevecs are on the stack and this is a new > crash, my guess is you'll be able to trigger it on xfs/ext4 too. But we > should make sure. Here's an interesting one from today, pointing the finger at xattrs again. [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11 [69943.463510] task: 8804f8dd3740 task.stack: c9000b108000 [69943.468077] RIP: 0010:[] Was this btrfs? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ioctl.h: add missing kernel compatibility header for BUILD_ASSERT
Hi, On Mon, Oct 24, 2016 at 09:29:12AM +0100, sly...@gmail.com wrote: > From: Sergei Trofimovich> > Header breakage noticed by cynede. Reproducible as: > > $ gcc -c /usr/include/btrfs/ioctl.h -o /tmp/a.o > /usr/include/btrfs/ioctl.h:42:14: error: expected declaration > specifiers or '...' before 'sizeof' > BUILD_ASSERT(sizeof(struct btrfs_ioctl_vol_args) == 4096); > ^~ > > Basically gcc tries to say us BUILD_ASSERT is not visible. > > BUILD_ASSERT lives in kerncompat.h which this change adds. I think including the kerncompat.h is too intrusive here, I've fixed by providing an empty macro if it's not defined. I'll release 4.8.2 soon. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Drive Replacement
On 2016-10-21 18:13, Peter Becker wrote: if you have >750 GB free you can simply remove one of the drives. btrfs device delete /dev/sd[x] /mnt #power off, replace device btrfs device add /dev/sd[y] /mnt Make sure to balance afterwards if you do this, the new disk will be pretty much unused until you do. if not you can use an USB-SATA adapter or an eSata-Port and make the following: btrfs device add /dev/sd[y] /mnt btrfs device delete /dev/sd[x] /mnt #power off, replace device I will comment that eSATA is vastly preferred to USB in this case (even a USB3.0 UAS device) as it is generally significantly more reliable (even if you are just using a eSATA to SATA cable and a power adapter for the drive). i avoid "btrfs device replace" because it's slower then add+delete In my experience, this is only true if you add then delete (delete then add then re-balance will move a lot of data twice), and even then only if the device is more than about half full. Device replace also has a couple of specific advantages: * It lets you get an exact percentage completion, unlike add+delete. This is only updated every few seconds, and doesn't show an estimate of time remaining, but is still better than nothing. * Device ID's remain the same. This can be an advantage or a disadvantage, but it's usually a good thing in a case like this, because the mapping between device number and device node (and therefore specific disks) will remain constant, which makes tracking which physical device is failing a bit easier if you have consistent device enumeration for storage devices (which you almost certainly do if you're using SATA). * It doesn't write any data on any other device except for superblocks and the basic metadata that describes the device layout. This is important because it means that it's safer for data that isn't on the device being replaced, and it has less impact on other operations when doing a live replacement. and don't forget to update fstab ! Assuming that he doesn't change which SATA ports the devices are connected to, he shouldn't have to change anything in /etc/fstab. Mount by UUID or LABEL will just work regardless, and mount by device node will continue to work as long as the same device nodes are used (which will be the case if he doesn't change anything else in the storage stack). 2016-10-22 0:07 GMT+02:00 Hugo Mills: On Sat, Oct 22, 2016 at 09:03:16AM +1100, Gareth Pye wrote: I've got a BTRFS array that is of mixed size disks: 2x750G 3x1.5T 3x3T And it's getting fuller than I'd like. The problem is that adding disks is harder than one would like as the computer only has 8 sata ports. Is it viable to do the following to upgrade one of the disks? A) Take array offline B) DD the contents of one of the 750G drives to a new 3T drive C) Remove the 750G from the system D) btrfs scan E) Mount array F) Run a balance I know that not physically removing the old copy of the drive will cause massive issues, but if I do that everything should be fine right? Yes. The one thing missing here is running # btrfs dev resize :max /mountpoint on the new device between steps E and F to allow the FS to use the full amount of the device. Otherwise, it'll still be the same size as the original. Hugo. -- Hugo Mills | Great films about cricket: Batsman Begins hugo@... carfax.org.uk | starring Christian Bail http://carfax.org.uk/ | PGP: E2AB1DE4 | -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ioctl.h: add missing kernel compatibility header for BUILD_ASSERT
From: Sergei TrofimovichHeader breakage noticed by cynede. Reproducible as: $ gcc -c /usr/include/btrfs/ioctl.h -o /tmp/a.o /usr/include/btrfs/ioctl.h:42:14: error: expected declaration specifiers or '...' before 'sizeof' BUILD_ASSERT(sizeof(struct btrfs_ioctl_vol_args) == 4096); ^~ Basically gcc tries to say us BUILD_ASSERT is not visible. BUILD_ASSERT lives in kerncompat.h which this change adds. Reported-by: Mikhail Pukhlikov Signed-off-by: Sergei Trofimovich --- ioctl.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/ioctl.h b/ioctl.h index 238e7ef..ee650a9 100644 --- a/ioctl.h +++ b/ioctl.h @@ -26,6 +26,12 @@ extern "C" { #include #include +#if BTRFS_FLAT_INCLUDES +#include "kerncompat.h" +#else +#include +#endif /* BTRFS_FLAT_INCLUDES */ + #ifndef __user #define __user #endif -- 2.10.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: fix free space tree bitmaps on big-endian systems
On Sat, Oct 15, 2016 at 2:46 AM, Linux Kernel Mailing Listwrote: > Web: > https://git.kernel.org/torvalds/c/2fe1d55134fce05c17ea118a2e37a4af771887bc > Commit: 2fe1d55134fce05c17ea118a2e37a4af771887bc 520f16abf003952d in v4.7.10 1ff6341b5d92dd6b in v4.8.4 > Parent: 08895a8b6b06ed2323cd97a36ee40a116b3db8ed > Refname:refs/heads/master > Author: Omar Sandoval > AuthorDate: Thu Sep 22 17:24:20 2016 -0700 > Committer: David Sterba > CommitDate: Mon Oct 3 18:52:14 2016 +0200 > > Btrfs: fix free space tree bitmaps on big-endian systems > > In convert_free_space_to_{bitmaps,extents}(), we buffer the free space > bitmaps in memory and copy them directly to/from the extent buffers with > {read,write}_extent_buffer(). The extent buffer bitmap helpers use byte > granularity, which is equivalent to a little-endian bitmap. This means > that on big-endian systems, the in-memory bitmaps will be written to > disk byte-swapped. To fix this, use byte-granularity for the bitmaps in > memory. This change looks overly complex to me, and decreases performance. > > Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree") > Cc: sta...@vger.kernel.org # 4.5+ > Tested-by: Holger Hoffstätte > Tested-by: Chandan Rajendra > Signed-off-by: Omar Sandoval > Signed-off-by: David Sterba > --- > fs/btrfs/extent_io.c | 64 > +- > fs/btrfs/extent_io.h | 22 > fs/btrfs/free-space-tree.c | 17 ++-- > 3 files changed, 76 insertions(+), 27 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 44fe66b..c3ec30d 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -5524,17 +5524,45 @@ void copy_extent_buffer(struct extent_buffer *dst, > struct extent_buffer *src, > } > } > > -/* > - * The extent buffer bitmap operations are done with byte granularity because > - * bitmap items are not guaranteed to be aligned to a word and therefore a > - * single word in a bitmap may straddle two pages in the extent buffer. > - */ > -#define BIT_BYTE(nr) ((nr) / BITS_PER_BYTE) > -#define BYTE_MASK ((1 << BITS_PER_BYTE) - 1) > -#define BITMAP_FIRST_BYTE_MASK(start) \ > - ((BYTE_MASK << ((start) & (BITS_PER_BYTE - 1))) & BYTE_MASK) > -#define BITMAP_LAST_BYTE_MASK(nbits) \ > - (BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1))) > +void le_bitmap_set(u8 *map, unsigned int start, int len) > +{ > + u8 *p = map + BIT_BYTE(start); You cannot use cpu_to_le32/cpu_to_le64 on the masks and operate on unsigned longs in memory because there's no alignment guarantee, right? > + const unsigned int size = start + len; > + int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE); > + u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start); > + > + while (len - bits_to_set >= 0) { > + *p |= mask_to_set; > + len -= bits_to_set; > + bits_to_set = BITS_PER_BYTE; > + mask_to_set = ~(u8)0; > + p++; > + } memset() for all but the first partial byte (if present)? > + if (len) { > + mask_to_set &= BITMAP_LAST_BYTE_MASK(size); > + *p |= mask_to_set; > + } > +} > + > +void le_bitmap_clear(u8 *map, unsigned int start, int len) > +{ > + u8 *p = map + BIT_BYTE(start); > + const unsigned int size = start + len; > + int bits_to_clear = BITS_PER_BYTE - (start % BITS_PER_BYTE); > + u8 mask_to_clear = BITMAP_FIRST_BYTE_MASK(start); > + > + while (len - bits_to_clear >= 0) { > + *p &= ~mask_to_clear; > + len -= bits_to_clear; > + bits_to_clear = BITS_PER_BYTE; > + mask_to_clear = ~(u8)0; > + p++; > + } memset() for all but the first partial byte (if present)? > + if (len) { > + mask_to_clear &= BITMAP_LAST_BYTE_MASK(size); > + *p &= ~mask_to_clear; > + } > +} > > /* > * eb_bitmap_offset() - calculate the page and offset of the byte containing > the > @@ -5578,7 +5606,7 @@ static inline void eb_bitmap_offset(struct > extent_buffer *eb, > int extent_buffer_test_bit(struct extent_buffer *eb, unsigned long start, >unsigned long nr) > { > - char *kaddr; > + u8 *kaddr; > struct page *page; > unsigned long i; > size_t offset; > @@ -5600,13 +5628,13 @@ int extent_buffer_test_bit(struct extent_buffer *eb, > unsigned long start, > void extent_buffer_bitmap_set(struct extent_buffer *eb, unsigned long start, > unsigned long pos, unsigned long len) > { > - char *kaddr; > + u8 *kaddr; > struct page *page; >
[PATCH] btrfs-progs: Fix wrong tree block alignment for unalianged block group
Commit 854437ca(btrfs-progs: extent-tree: avoid allocating tree block that crosses stripe boundary) introduces check for logical bytenr not crossing stripe boundary. However that check is not completely correct. It only checks if the logical bytenr and length agaist absolute logical offset. That's to say, it only check if a tree block lies in 64K logical stripe. But in fact, it's possible a block group starts at bytenr unaligned with 64K, just like the following case. Then btrfsck will give false alert. 0 32K 64K 96K128K 160K ... |--- Block group A - |<-TB 32K-->| |/Scrub stripe unit/| |WRONG UNIT | In that case, TB(tree block) at bytenr 32K in fact fits into the kernel scrub stripe unit. But doesn't fit into the pure logical 64K stripe. Fix check_crossing_stripes() to compare bytenr to block group start, not to absolute logical bytenr. Reported-by: Jussi KansanenSigned-off-by: Qu Wenruo --- cmds-check.c | 10 ++ extent-tree.c | 15 --- volumes.h | 23 --- 3 files changed, 38 insertions(+), 10 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 670ccd1..907d60c 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -4662,8 +4662,8 @@ static int add_extent_rec_nolookup(struct cache_tree *extent_cache, bytes_used += rec->nr; if (tmpl->metadata) - rec->crossing_stripes = check_crossing_stripes(rec->start, - global_info->tree_root->nodesize); + rec->crossing_stripes = check_crossing_stripes(global_info, + rec->start, global_info->tree_root->nodesize); check_extent_type(rec); return ret; } @@ -4764,7 +4764,8 @@ static int add_extent_rec(struct cache_tree *extent_cache, */ if (tmpl->metadata) rec->crossing_stripes = check_crossing_stripes( - rec->start, global_info->tree_root->nodesize); + global_info, rec->start, + global_info->tree_root->nodesize); check_extent_type(rec); maybe_free_extent_rec(extent_cache, rec); return ret; @@ -9359,7 +9360,8 @@ static int check_extent_item(struct btrfs_fs_info *fs_info, if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) metadata = 1; - if (metadata && check_crossing_stripes(key.objectid, eb->len)) { + if (metadata && check_crossing_stripes(global_info, key.objectid, + eb->len)) { error("bad metadata [%llu, %llu) crossing stripe boundary", key.objectid, key.objectid + nodesize); err |= CROSSING_STRIPE_BOUNDARY; diff --git a/extent-tree.c b/extent-tree.c index f6d0a7c..3b1577e 100644 --- a/extent-tree.c +++ b/extent-tree.c @@ -2606,11 +2606,20 @@ check_failed: } if (!(data & BTRFS_BLOCK_GROUP_DATA)) { - if (check_crossing_stripes(ins->objectid, num_bytes)) { - search_start = round_down(ins->objectid + num_bytes, - BTRFS_STRIPE_LEN); + if (check_crossing_stripes(info, ins->objectid, num_bytes)) { + struct btrfs_block_group_cache *bg_cache; + u64 bg_offset; + + bg_cache = btrfs_lookup_block_group(info, ins->objectid); + if (!bg_cache) + goto no_bg_cache; + bg_offset = ins->objectid - bg_cache->key.objectid; + + search_start = round_up(bg_offset + num_bytes, + BTRFS_STRIPE_LEN) + bg_offset; goto new_group; } +no_bg_cache: block_group = btrfs_lookup_block_group(info, ins->objectid); if (block_group) trans->block_group = block_group; diff --git a/volumes.h b/volumes.h index d7b7d3c..7cb38b0 100644 --- a/volumes.h +++ b/volumes.h @@ -155,11 +155,28 @@ struct map_lookup { * Check if the given range cross stripes. * To ensure kernel scrub won't causing bug on with METADATA in mixed * block group + * + * Return 1 if the range crosses STRIPE boundary + * Return 0 if the range doesn't cross STRIPE boundar or it + * doesn't belongs to any block group(no boundary to cross) */ -static inline int check_crossing_stripes(u64 start, u64 len) +static inline int check_crossing_stripes(struct btrfs_fs_info *fs_info, +u64 start, u64 len) { - return (start / BTRFS_STRIPE_LEN) != - ((start + len - 1) / BTRFS_STRIPE_LEN); + struct