Re: [PATCH 4/5] btrfs: Allow barrier_all_devices to do per-chunk device check
On 10/30/2015 07:41 PM, Qu Wenruo wrote: 在 2015年10月30日 16:32, Anand Jain 写道: Qu, We shouldn't mark FS readonly when chunks are degradable. As below. Thanks, Anand diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 39a2d57..dbb2483 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3530,7 +3530,7 @@ static int write_all_supers(struct btrfs_root *root, int max_mirrors) if (do_barriers) { ret = barrier_all_devices(root->fs_info); - if (ret) { + if (ret < 0) { mutex_unlock( &root->fs_info->fs_devices->device_list_mutex); btrfs_std_error(root->fs_info, ret, Sorry, I didn't got the point here. There should be no difference between ret and ret < 0, as barrier_all_devices() will only return -EIO or 0. oh sorry. you are right. I missed that point. Thanks, Anand Thanks, Qu On 09/21/2015 10:10 AM, Qu Wenruo wrote: The last user of num_tolerated_disk_barrier_failures is barrier_all_devices(). But it's can be easily changed to new per-chunk degradable check framework. Now btrfs_device will have two extra members, representing send/wait error, set at write_dev_flush() time. And then check it in a similar but more accurate behavior than old code. Signed-off-by: Qu Wenruo --- fs/btrfs/disk-io.c | 13 + fs/btrfs/volumes.c | 6 +- fs/btrfs/volumes.h | 4 3 files changed, 14 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index d64299f..7cd94e7 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3400,8 +3400,6 @@ static int barrier_all_devices(struct btrfs_fs_info *info) { struct list_head *head; struct btrfs_device *dev; -int errors_send = 0; -int errors_wait = 0; int ret; /* send down all the barriers */ @@ -3410,7 +3408,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (dev->missing) continue; if (!dev->bdev) { -errors_send++; +dev->err_send = 1; continue; } if (!dev->in_fs_metadata || !dev->writeable) @@ -3418,7 +3416,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) ret = write_dev_flush(dev, 0); if (ret) -errors_send++; +dev->err_send = 1; } /* wait for all the barriers */ @@ -3426,7 +3424,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (dev->missing) continue; if (!dev->bdev) { -errors_wait++; +dev->err_wait = 1; continue; } if (!dev->in_fs_metadata || !dev->writeable) @@ -3434,10 +3432,9 @@ static int barrier_all_devices(struct btrfs_fs_info *info) ret = write_dev_flush(dev, 1); if (ret) -errors_wait++; +dev->err_wait = 1; } -if (errors_send > info->num_tolerated_disk_barrier_failures || -errors_wait > info->num_tolerated_disk_barrier_failures) +if (btrfs_check_degradable(info, info->sb->s_flags) < 0) return -EIO; return 0; } diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f1ef215..88266fa 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6945,8 +6945,12 @@ int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags) btrfs_get_num_tolerated_disk_barrier_failures( map->type); for (i = 0; i < map->num_stripes; i++) { -if (map->stripes[i].dev->missing) +if (map->stripes[i].dev->missing || +map->stripes[i].dev->err_wait || +map->stripes[i].dev->err_send) missing++; +map->stripes[i].dev->err_wait = 0; +map->stripes[i].dev->err_send = 0; } if (missing > max_tolerated) { ret = -EIO; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index fe758df..cd02556 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -76,6 +76,10 @@ struct btrfs_device { int can_discard; int is_tgtdev_for_dev_replace; +/* for barrier_all_devices() check */ +int err_send; +int err_wait; + #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED seqcount_t data_seqcount; #endif -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Periodic kernel freezes
On 30/10/2015 16:25, Alex Adriaanse wrote: I have an EC2 instance on AWS that tends to freeze several times per week. When it freezes it stops responding to network traffic, disk I/O stops, and CPU goes to 100%. The system comes back fine after a reboot. I was finally able to get a kernel backtrace from when this happened today, which I have attached to this email. The VM in question runs Debian Jessie, and has 3 BTRFS filesystems, including the root filesystem. Details are included below. Any ideas? Hi Alex - I kept experiencing problems with the Jessie 3.16.x kernel on EC2 (and elsewhere) with BTRFS. Out of 8 nodes, one managed an uptime of 90 days, while the average was about 21 days. Crashes were seemingly random, and it was difficult to get stack traces. For the stack traces I did get, it wasn't always obvious that the problem lay with BTRFS. Reboots normally needed to be forceful. I'd suggest upgrading to a backports kernel (I compiled various 4.1.x kernels, but there's now 4.2.x in jessie-backports). You might also want to turn off compression... David. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Periodic kernel freezes
I have an EC2 instance on AWS that tends to freeze several times per week. When it freezes it stops responding to network traffic, disk I/O stops, and CPU goes to 100%. The system comes back fine after a reboot. I was finally able to get a kernel backtrace from when this happened today, which I have attached to this email. The VM in question runs Debian Jessie, and has 3 BTRFS filesystems, including the root filesystem. Details are included below. Any ideas? Thanks, Alex # uname -a Linux prod-docker-1-a 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u5 (2015-10-09) x86_64 GNU/Linux # btrfs --version Btrfs v3.17 # df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda 8.0G 1.3G 6.4G 17% / udev 10M 0 10M 0% /dev tmpfs 3.0G 8.6M 3.0G 1% /run tmpfs 7.5G 12K 7.5G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 7.5G 0 7.5G 0% /sys/fs/cgroup /dev/xvdb50G 3.9G 45G 9% /var/lib/docker /dev/xvdc 200G 70G 130G 35% /srv/volumes # btrfs fi show Label: none uuid: 8a293966-5c19-485c-a819-a6b801a1085d Total devices 1 FS bytes used 1.21GiB devid1 size 8.00GiB used 3.28GiB path /dev/xvda Label: 'docker' uuid: 5bf935e0-4519-43d9-b2e9-b3fb19374b72 Total devices 1 FS bytes used 3.70GiB devid1 size 50.00GiB used 6.04GiB path /dev/xvdb Label: 'volumes' uuid: 2d121370-7879-4485-8fd5-1fe0db5a0c12 Total devices 1 FS bytes used 68.82GiB devid1 size 200.00GiB used 124.04GiB path /dev/xvdc Btrfs v3.17 # btrfs fi df / Data, single: total=2.85GiB, used=1.17GiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=204.75MiB, used=38.03MiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B # btrfs fi df /var/lib/docker Data, single: total=4.01GiB, used=3.52GiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=179.58MiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=64.00MiB, used=0.00B # btrfs fi df /srv/volumes Data, single: total=122.01GiB, used=68.55GiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=277.20MiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=96.00MiB, used=0.00B [344317.872151] [ cut here ] [344317.876091] kernel BUG at /build/linux-xkTWug/linux-3.16.7-ckt11/mm/page_alloc.c:1011! [344317.876091] invalid opcode: [#1] SMP [344317.876091] Modules linked in: xt_nat xt_tcpudp veth xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables x_tables nf_nat nf_conntrack bridge stp llc crc32_pclmul ppdev ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd evdev psmouse serio_raw parport_pc parport ttm drm_kms_helper drm i2c_piix4 i2c_core processor thermal_sys button autofs4 btrfs xor raid6_pq ata_generic xen_blkfront crct10dif_pclmul crct10dif_common crc32c_intel ata_piix libata scsi_mod ixgbevf(O) [344317.876091] CPU: 0 PID: 9842 Comm: kworker/u30:7 Tainted: G O 3.16.0-4-amd64 #1 Debian 3.16.7-ckt11-1+deb8u5 [344317.876091] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/06/2015 [344317.876091] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs] [344317.876091] task: 8800eb30b630 ti: 880001a08000 task.ti: 880001a08000 [344317.876091] RIP: 0010:[] [] move_freepages+0x107/0x110 [344317.876091] RSP: 0018:880001a0b918 EFLAGS: 00010006 [344317.876091] RAX: 8803e08fb000 RBX: RCX: 0001 [344317.876091] RDX: ea000d922fc8 RSI: ea000d91c000 RDI: 8803e08fbe00 [344317.876091] RBP: 0001 R08: 8803e08fbe00 R09: [344317.876091] R10: R11: 8803e08fbeb0 R12: ea000d91cbd0 [344317.876091] R13: R14: R15: 8803e08fbe00 [344317.876091] FS: () GS:8803e040() knlGS: [344317.876091] CS: 0010 DS: ES: CR0: 80050033 [344317.876091] CR2: 7fd0a085fc00 CR3: 00035f1d5000 CR4: 001406f0 [344317.876091] Stack: [344317.876091] 81143c1c 0002115a8000 ea000d91cbf0 [344317.876091] 8803e08fbe90 8803e0412f78 8800eb30b698 8803e08fbe00 [344317.876091] 001f 0001 001a [344317.876091] Call Trace: [344317.876091] [] ? __rmqueue+0x37c/0x460 [344317.876091] [] ? get_page_from_freelist+0x685/0x910 [344317.876091] [] ? __alloc_pages_nodemask+0x16d/0xb30 [344317.876091] [] ? __alloc_pages_nodemask+0x16d/0xb30 [344317.876091] [] ? btrfs_find_space_for_alloc+0x22a/0x270 [btrfs] [344317.8
Re: FW: btrfs-progs: android build
On Mon, Aug 31, 2015 at 06:33:01PM +0200, David Sterba wrote: > So the preliminary support is merged. Outstanding issues are all related > to blkid API: > > - is_ssd Fixed by trivially ifdef around the function. > - btrfs_wipe_existing_sb > - check_overwrite > > In the ssd check case it's safe to provide the 'return 1' replacement, > but the other two are related to safety measures and I'm not comfortable > to ifdef them out yet. I'm not able to find the exact version of libblkid that android uses, closest guess is 2.14 which is pretty old. The low-level probing has been added to 2.15 and I think we can't avoid that. Reimpliementing the missing blkid functionality is possible but I'd rather not go that way. Please let me know if there's a version of android build system that provides sufficiently new blkid otherwise I'm afraid that we can't support it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corrupted RAID1: unsuccessful recovery / help needed
On 2015-10-30 06:58, Duncan wrote: Lukas Pirl posted on Fri, 30 Oct 2015 10:43:41 +1300 as excerpted: If there is one subvolume that contains all other (read only) snapshots and there is insufficient storage to copy them all separately: Is there an elegant way to preserve those when moving the data across disks? AFAIK, no elegant way without a writable mount. Tho I'm not sure, btrfs send, to a btrfs elsewhere using receive, may work, since you did specify read-only snapshots, which is what send normally works with in ordered to avoid changes to the snapshot while it's sending it. My own use-case doesn't involve either snapshots or send/receive, however, so I'm not sure if send can work with a read-only filesystem or not, but I think its normal method of operation is to create those read-only snapshots itself, which of course would require a writable filesystem, so I'm guessing it won't work unless you can convince it to use the read-only mounts as-is. Unless something has significantly changed since I last looked, send only works on existing snapshots and doesn't create any directly itself, and as such should work fine to send snapshots from a read-only filesystem. In theory, you could use it to send all the snapshots at once, although that would probably take a long time, so you'll probably have to use a loop like the fragment of shell-script that Hugo suggested in his response. That should result in an (almost) identical level of sharing. The less elegant way would involve manual deduplication. Copy one snapshot, then another, and dedup what hasn't changed between the two, then add a third and dedup again. ... Depending on the level of dedup (file vs block level) and the level of change in your filesystem, this should ultimately take about the same level of space as a full backup plus a series of incrementals. If you're using duperemove (which is the only maintained dedupe tool I know of for BTRFS), then this will likely take a long time for any reasonable amount of data, and probably take up more space on the destination drive than it does on the source (while duperemove does block-based deduplication, it uses large chunks by default). Meanwhile, this does reinforce the point that snapshots don't replace full backups, that being the reason I don't use them here, since if the filesystem goes bad, it'll very likely take all the snapshots with it. FWIW, while I don't use them directly myself as a backup, they are useful when doing a backup to get a guaranteed stable version of the filesystem being backed-up (this is also one of the traditional use cases for LVM snapshots, although those have a lot of different issues to deal with). For local backups (I also do cloud-storage based remote backups, but local is what matters in this case because it's where I actually use send/receive and snapshots) I use two different methods depending on the amount of storage I have: 1. If I'm relatively limited on local storage (like in my laptop where the secondary internal disk is only 64G), I use a temporary snapshot to generate a SquashFS image of the system, which I then store on the secondary drive. 2. If I have a lot of spare space (like on my desktop where I have 4x 1TB HDD's and 2x 128G SSD's), I make a snapshot of the filesystem, then use send/receive to transfer that to a backup filesystem on a separate disk. I then keep the original snapshot around on the filesystem so I can do incremental send/receive to speed up future backups. In both cases, I can directly boot my most recent backups if need be, and in the second case, I can actually use it to trivially regenerate the backed-up filesystems (by simply doing a send/receive in the opposite direction). Beyond providing a stable system-image for backups, the only valid use case for snapshots in my opinion is to provide the equivalent to MS Windows' 'Restore Point' feature (which I'm pretty sure is done currently by RHEL and SLES if they are installed on BTRFS) and possibly 'File History' for people who for some reason can't use real VCS or just need to store the last few revision (which is itself done by stuff like 'snapper'). smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH V8 01/13] Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size
On 10/28/2015 04:10 AM, Chandan Rajendra wrote: Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE units. Fix this by doing reservation/releases in block size units. Signed-off-by: Chandan Rajendra Reviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V8 03/13] Btrfs: Direct I/O read: Work on sectorsized blocks
On 10/28/2015 04:10 AM, Chandan Rajendra wrote: The direct I/O read's endio and corresponding repair functions work on page sized blocks. This commit adds the ability for direct I/O read to work on subpagesized blocks. Reviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V8 02/13] Btrfs: Compute and look up csums based on sectorsized blocks
On 10/28/2015 04:10 AM, Chandan Rajendra wrote: Checksums are applicable to sectorsize units. The current code uses bio->bv_len units to compute and look up checksums. This works on machines where sectorsize == PAGE_SIZE. This patch makes the checksum computation and look up code to work with sectorsize units. Reviewed-by: Liu Bo Signed-off-by: Chandan Rajendra Reviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] btrfs-progs: free comparer_set in cmd_qgroup_show
On Thu, Oct 29, 2015 at 05:31:47PM +0800, Zhao Lei wrote: > comparer_set, which was allocated by malloc(), should be free before > function return. > > Signed-off-by: Zhao Lei > --- > cmds-qgroup.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/cmds-qgroup.c b/cmds-qgroup.c > index a64b716..f069d32 100644 > --- a/cmds-qgroup.c > +++ b/cmds-qgroup.c > @@ -290,7 +290,7 @@ static int cmd_qgroup_show(int argc, char **argv) > int filter_flag = 0; > unsigned unit_mode; > > - struct btrfs_qgroup_comparer_set *comparer_set; > + struct btrfs_qgroup_comparer_set *comparer_set = NULL; > struct btrfs_qgroup_filter_set *filter_set; > filter_set = btrfs_qgroup_alloc_filter_set(); > comparer_set = btrfs_qgroup_alloc_comparer_set(); > @@ -372,6 +372,8 @@ static int cmd_qgroup_show(int argc, char **argv) > fprintf(stderr, "ERROR: can't list qgroups: %s\n", > strerror(e)); > > + free(comparer_set); Doh, coverity correctly found that comparer_set is freed inside btrfs_show_qgroups() a few lines above. Patch dropped. > + -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] btrfs: Allow barrier_all_devices to do per-chunk device check
在 2015年10月30日 16:32, Anand Jain 写道: Qu, We shouldn't mark FS readonly when chunks are degradable. As below. Thanks, Anand diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 39a2d57..dbb2483 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3530,7 +3530,7 @@ static int write_all_supers(struct btrfs_root *root, int max_mirrors) if (do_barriers) { ret = barrier_all_devices(root->fs_info); - if (ret) { + if (ret < 0) { mutex_unlock( &root->fs_info->fs_devices->device_list_mutex); btrfs_std_error(root->fs_info, ret, Sorry, I didn't got the point here. There should be no difference between ret and ret < 0, as barrier_all_devices() will only return -EIO or 0. Or I missed something? Thanks, Qu On 09/21/2015 10:10 AM, Qu Wenruo wrote: The last user of num_tolerated_disk_barrier_failures is barrier_all_devices(). But it's can be easily changed to new per-chunk degradable check framework. Now btrfs_device will have two extra members, representing send/wait error, set at write_dev_flush() time. And then check it in a similar but more accurate behavior than old code. Signed-off-by: Qu Wenruo --- fs/btrfs/disk-io.c | 13 + fs/btrfs/volumes.c | 6 +- fs/btrfs/volumes.h | 4 3 files changed, 14 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index d64299f..7cd94e7 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3400,8 +3400,6 @@ static int barrier_all_devices(struct btrfs_fs_info *info) { struct list_head *head; struct btrfs_device *dev; -int errors_send = 0; -int errors_wait = 0; int ret; /* send down all the barriers */ @@ -3410,7 +3408,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (dev->missing) continue; if (!dev->bdev) { -errors_send++; +dev->err_send = 1; continue; } if (!dev->in_fs_metadata || !dev->writeable) @@ -3418,7 +3416,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) ret = write_dev_flush(dev, 0); if (ret) -errors_send++; +dev->err_send = 1; } /* wait for all the barriers */ @@ -3426,7 +3424,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (dev->missing) continue; if (!dev->bdev) { -errors_wait++; +dev->err_wait = 1; continue; } if (!dev->in_fs_metadata || !dev->writeable) @@ -3434,10 +3432,9 @@ static int barrier_all_devices(struct btrfs_fs_info *info) ret = write_dev_flush(dev, 1); if (ret) -errors_wait++; +dev->err_wait = 1; } -if (errors_send > info->num_tolerated_disk_barrier_failures || -errors_wait > info->num_tolerated_disk_barrier_failures) +if (btrfs_check_degradable(info, info->sb->s_flags) < 0) return -EIO; return 0; } diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f1ef215..88266fa 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6945,8 +6945,12 @@ int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags) btrfs_get_num_tolerated_disk_barrier_failures( map->type); for (i = 0; i < map->num_stripes; i++) { -if (map->stripes[i].dev->missing) +if (map->stripes[i].dev->missing || +map->stripes[i].dev->err_wait || +map->stripes[i].dev->err_send) missing++; +map->stripes[i].dev->err_wait = 0; +map->stripes[i].dev->err_send = 0; } if (missing > max_tolerated) { ret = -EIO; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index fe758df..cd02556 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -76,6 +76,10 @@ struct btrfs_device { int can_discard; int is_tgtdev_for_dev_replace; +/* for barrier_all_devices() check */ +int err_send; +int err_wait; + #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED seqcount_t data_seqcount; #endif -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corrupted RAID1: unsuccessful recovery / help needed
On Fri, Oct 30, 2015 at 10:58:47AM +, Duncan wrote: > Lukas Pirl posted on Fri, 30 Oct 2015 10:43:41 +1300 as excerpted: > > > If there is one subvolume that contains all other (read only) snapshots > > and there is insufficient storage to copy them all separately: > > Is there an elegant way to preserve those when moving the data across > > disks? If they're read-only snapshots already, then yes: sent= for sub in *; do btrfs send $sent $sub | btrfs receive /where/ever sent="$sent -c$sub" done That will preserve the shared extents between the subvols on the receiving FS. If they're not read-only, then snapshotting each one again as RO before sending would be the approach, but if your FS is itself RO, that's not going to be possible, and you need to look at Duncan's email. Hugo. > AFAIK, no elegant way without a writable mount. > > Tho I'm not sure, btrfs send, to a btrfs elsewhere using receive, may > work, since you did specify read-only snapshots, which is what send > normally works with in ordered to avoid changes to the snapshot while > it's sending it. My own use-case doesn't involve either snapshots or > send/receive, however, so I'm not sure if send can work with a read-only > filesystem or not, but I think its normal method of operation is to > create those read-only snapshots itself, which of course would require a > writable filesystem, so I'm guessing it won't work unless you can > convince it to use the read-only mounts as-is. > > The less elegant way would involve manual deduplication. Copy one > snapshot, then another, and dedup what hasn't changed between the two, > then add a third and dedup again. ... Depending on the level of dedup > (file vs block level) and the level of change in your filesystem, this > should ultimately take about the same level of space as a full backup > plus a series of incrementals. > > > Meanwhile, this does reinforce the point that snapshots don't replace > full backups, that being the reason I don't use them here, since if the > filesystem goes bad, it'll very likely take all the snapshots with it. > > Snapshots do tend to be pretty convenient, arguably /too/ convenient and > near-zero-cost to make, as people then tend to just do scheduled > snapshots, without thinking about their overhead and maintenance costs on > the filesystem, until they already have problems. I'm not sure if you > are a regular list reader and have thus seen my normal spiel on btrfs > snapshot scaling and recommended limits to avoid problems or not, so if > not, here's a slightly condensed version... > > Btrfs has scaling issues that appear when trying to manage too many > snapshots. These tend to appear first in tools like balance and check, > where time to process a filesystem goes up dramatically as the number of > snapshots increases, to the point where it can become entirely > impractical to manage at all somewhere near the 100k snapshots range, and > is already dramatically affecting runtime at 10k snapshots. > > As a result, I recommend keeping per-subvol snapshots to 250-ish, which > will allow snapshotting four subvolumes while still keeping total > filesystem snapshots to 1000, or eight subvolumes at a filesystem total > of 2000 snapshots, levels where the scaling issues should remain well > within control. And 250-ish snapshots per subvolume is actually very > reasonable even with half-hour scheduled snapshotting, provided a > reasonable scheduled snapshot thinning program is also implemented, > cutting say to hourly after six hours, six-hourly after a day, 12 hourly > after 2 days, daily after a week, and weekly after four weeks to a > quarter (13 weeks). Out beyond a quarter or two, certainly within a > year, longer term backups to other media should be done, and snapshots > beyond that can be removed entirely, freeing up the space the old > snapshots kept locked down and helping to keep the btrfs healthy and > functioning well within its practical scalability limits. > > Because a balance that takes a month to complete because it's dealing > with a few hundred k snapshots is in practice (for most people) not > worthwhile to do at all, and also in practice, a year or even six months > out, are you really going to care about the precise half-hour snapshot, > or is the next daily or weekly snapshot going to be just as good, and a > whole lot easier to find among a couple hundred snapshots than hundreds > of thousands? > > If you have far too many snapshots, perhaps this sort of thinning > strategy will as well allow you to copy and dedup only key snapshots, say > weekly plus daily for the last week, doing the backup thing manually, as > well, modifying the thinning strategy accordingly if necessary to get it > to fit. Tho using the copy and dedup strategy above will still require > at least double the full space of a single copy, plus the space necessary > for each deduped snapshot c
Re: RichACLs for BTRFS? (this time complete)
On 2015-10-30 05:45, Marcel Ritter wrote: Hi btrfs-developers, I just read about the possible/planned merge of richacl patches into linux kernel 4.4. s. http://lwn.net/Articles/661078/ s. http://lwn.net/Articles/661357/ Will btrfs support richacls with kernel 4.4? According to the btrfs wiki, this topic has not been claimed: https://btrfs.wiki.kernel.org/index.php/Project_ideas#RichACLs_.2F_NFS4_ACLS As we'd like to use btrfs with NFSv4 I'd really like to see richacls on btrfs. Hope someone can comment on this topic. While I don't think we'll directly support richacls, it shouldn't be hard to integrate, as they're just stored in a couple of xattrs in the 'system' prefix. AFAICT, all that would really be needed is to make sure that things are wired up correctly so that we can differentiate between using POSIX ACL's and richacls (while I don't agree with a number of choices the developers have made with richacls (It should be relatively easy to find the long discussions we've had in the LKML archives), I do agree that these two different models should not be mixed). smime.p7s Description: S/MIME Cryptographic Signature
Re: corrupted RAID1: unsuccessful recovery / help needed
Lukas Pirl posted on Fri, 30 Oct 2015 10:43:41 +1300 as excerpted: > If there is one subvolume that contains all other (read only) snapshots > and there is insufficient storage to copy them all separately: > Is there an elegant way to preserve those when moving the data across > disks? AFAIK, no elegant way without a writable mount. Tho I'm not sure, btrfs send, to a btrfs elsewhere using receive, may work, since you did specify read-only snapshots, which is what send normally works with in ordered to avoid changes to the snapshot while it's sending it. My own use-case doesn't involve either snapshots or send/receive, however, so I'm not sure if send can work with a read-only filesystem or not, but I think its normal method of operation is to create those read-only snapshots itself, which of course would require a writable filesystem, so I'm guessing it won't work unless you can convince it to use the read-only mounts as-is. The less elegant way would involve manual deduplication. Copy one snapshot, then another, and dedup what hasn't changed between the two, then add a third and dedup again. ... Depending on the level of dedup (file vs block level) and the level of change in your filesystem, this should ultimately take about the same level of space as a full backup plus a series of incrementals. Meanwhile, this does reinforce the point that snapshots don't replace full backups, that being the reason I don't use them here, since if the filesystem goes bad, it'll very likely take all the snapshots with it. Snapshots do tend to be pretty convenient, arguably /too/ convenient and near-zero-cost to make, as people then tend to just do scheduled snapshots, without thinking about their overhead and maintenance costs on the filesystem, until they already have problems. I'm not sure if you are a regular list reader and have thus seen my normal spiel on btrfs snapshot scaling and recommended limits to avoid problems or not, so if not, here's a slightly condensed version... Btrfs has scaling issues that appear when trying to manage too many snapshots. These tend to appear first in tools like balance and check, where time to process a filesystem goes up dramatically as the number of snapshots increases, to the point where it can become entirely impractical to manage at all somewhere near the 100k snapshots range, and is already dramatically affecting runtime at 10k snapshots. As a result, I recommend keeping per-subvol snapshots to 250-ish, which will allow snapshotting four subvolumes while still keeping total filesystem snapshots to 1000, or eight subvolumes at a filesystem total of 2000 snapshots, levels where the scaling issues should remain well within control. And 250-ish snapshots per subvolume is actually very reasonable even with half-hour scheduled snapshotting, provided a reasonable scheduled snapshot thinning program is also implemented, cutting say to hourly after six hours, six-hourly after a day, 12 hourly after 2 days, daily after a week, and weekly after four weeks to a quarter (13 weeks). Out beyond a quarter or two, certainly within a year, longer term backups to other media should be done, and snapshots beyond that can be removed entirely, freeing up the space the old snapshots kept locked down and helping to keep the btrfs healthy and functioning well within its practical scalability limits. Because a balance that takes a month to complete because it's dealing with a few hundred k snapshots is in practice (for most people) not worthwhile to do at all, and also in practice, a year or even six months out, are you really going to care about the precise half-hour snapshot, or is the next daily or weekly snapshot going to be just as good, and a whole lot easier to find among a couple hundred snapshots than hundreds of thousands? If you have far too many snapshots, perhaps this sort of thinning strategy will as well allow you to copy and dedup only key snapshots, say weekly plus daily for the last week, doing the backup thing manually, as well, modifying the thinning strategy accordingly if necessary to get it to fit. Tho using the copy and dedup strategy above will still require at least double the full space of a single copy, plus the space necessary for each deduped snapshot copy you keep, since the dedup occurs after the copy. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [3/3] btrfs: qgroup: Fix a rebase bug which will cause qgroup double free
Woops, just noticed I copied and pasted a typo there. Sorry for the trouble. It should be: Tested-by: Johannes Henninger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bad handling of unpartitioned device in sysfs_devno_to_wholedisk() (which breaks mkfs.btrfs)
On Fri, Oct 30, 2015 at 09:43:28AM +0800, Tom Yan wrote: > So I noticed that SSD detection does work on unpartitioned devices in > mkfs.btrfs somehow: > https://bugzilla.kernel.org/show_bug.cgi?id=102921 > > Later I found out that it breaks at blkid_devno_to_wholedisk() in is_ssd(): > http://git.kernel.org/cgit/linux/kernel/git/kdave/btrfs-progs.git/tree/mkfs.c?h=v4.2.3#n1103 > > which Elliot had shown an example with strace: > https://lists.01.org/pipermail/linux-nvdimm/2015-September/002109.html > > And I think the problem occurs in the sysfs_get_devname() here: > https://git.kernel.org/cgit/utils/util-linux/util-linux.git/tree/lib/sysfs.c?h=v2.27#n785 > > Since sysfs_get_devname() has to call sysfs_readlink() later, which > output a long full device path in /sys, I don't think we should call > it directly with the buffer "diskname", which people won't expect that > it has to be large enough to carry the path in the middle of the > process. For example in is_sdd(), a char array of size 32 is used > ("wholedisk"). You're right. The function sysfs_get_devname() is not too elegant as it uses devname buffer for readlink. Fixed, the bugfix will be in v2.27.1. Thanks! Karel -- Karel Zak http://karelzak.blogspot.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RichACLs for BTRFS? (this time complete)
Hi btrfs-developers, I just read about the possible/planned merge of richacl patches into linux kernel 4.4. s. http://lwn.net/Articles/661078/ s. http://lwn.net/Articles/661357/ Will btrfs support richacls with kernel 4.4? According to the btrfs wiki, this topic has not been claimed: https://btrfs.wiki.kernel.org/index.php/Project_ideas#RichACLs_.2F_NFS4_ACLS As we'd like to use btrfs with NFSv4 I'd really like to see richacls on btrfs. Hope someone can comment on this topic. Bye, Marcel PS: Please excuse my former incomplete posting. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RichACLs for BTRFS?
Hi btrfs-developers, I just read about the possible/planned merge of richacl patches into linux kernel 4.4. s. http://lwn.net/Articles/661078/ s. http://lwn.net/Articles/661357/ Will btrfs support richacls with kernel 4.4? According to -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corrupted RAID1: unsuccessful recovery / help needed
Lukas Pirl posted on Fri, 30 Oct 2015 10:43:41 +1300 as excerpted: > Is e.g. "balance" also influenced by the userspace tools or does > the kernel the actual work? btrfs balance is done "online", that is, on the (writable-)mounted filesystem, and the kernel does the real work. It's the tools that work on the unmounted filesystem, btrfs check, btrfs restore, btrfs rescue, etc, where the userspace code does the real work, and thus where being current and having all the latests userspace fixes is vital. If you can't mount writable, you can't balance. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: random i/o error without error in dmesg
Marc Joliet posted on Thu, 29 Oct 2015 22:10:24 +0100 as excerpted: >>Meanwhile, as explained in the systemd docs (specifically the systemd >>for administrators series, IIRC), systemd dropping back to the initr* is >>actually its way of automatically doing effectively the same thing we >>were using lib_users and all those restarts to do, getting rid of all >>possible still running on root executables, including systemd itself, by >>reexecing systemd itself back in the initr*, as a way to help eliminate >>*everything* running on root, so it can not only be remounted read-only, >>but actually unmounted the same as any other filesystem, as userspace is >>now actually running from the initr* once again. That's a far *FAR* >>safer reboot after upgrade than traditional sysvinit solutions were able >>to do. =:^) > > Yeah, the ability to do that is a nice plus of using an initramfs. > Although I've never been clear on why it's *safer*. Is it because the > remount might fail? Or are there other reasons, too? While I don't claim anything but informed admin level authority on the problem... It's first worth noting that the problem a return to initramfs helps solve is in practice reasonably rare and obscure, since if it weren't, people would have been experiencing it in serious numbers on sysvinit- based systems all along, and something would have been done to solve it long before systemd came along. So it's a relatively narrow issue that in practice can only affect a few users, a relatively small portion of the time. >From my read of the systemd docs, it's more pointing out a theoretical issue than a practical one, pointing out that systemd is in fact a more theoretically correct solution to the (implicitly mostly theoretical) problem. In that context, I believe the (mostly theoretical) point is as much that we were treating / (and perhaps another mount or two) special, remounting it read-only instead of unmounting it because in practice there wasn't any other choice, and that now that systemd offers the choice, it can in fact be treated just like any other filesystem, fully unmounting it before shutdown. Since exceptions to rules are nice places for bugs to hide, in theory at least (the remount-ro root being such a universal exception that in practice it's a rule of its own, and bugs couldn't long hide in that exception /because/ of its universalness), being able to treat / like any other filesystem and unmount it is a "purer and more correct" solution. IOW, it's a nice counter to the "systemd isn't unixy enough" point, as here, it's more "unixy" than sysvinit ever was. That said, I expect that over the years there have been plenty of otherwise nice implementations of various useful things that ran into a shutdown/reboot-time problem due to root's remount-ro exception, that either limited them to non-root-filesystem deployment or sent them back for a workaround, if not causing them to rejected outright as unworkable, that in this new return-to-initr*-and-unmount-root environment will see faster deployment without the workarounds that heretofore were required. Of course that'll end up being a limitation on deployment on non-initr* direct-to-root boot sequences, but in this primarily prebuild binary distro with prebuild by-necessity-modular-kernel-and-initr* environment, that's unlikely to slow down wide deployment by much, and anyone wanting to do direct-to-root boots and/or non-systemd-based deployments will just have to find their own workarounds, which may ultimately be incorporated into upstream, or not, depending on upstream's whims. Which, bringing it all back to the btrfs list title topic, is already where multi-device btrfs as / filesystem is in terms of initr*, since that's basically broken without an initr* to assemble it. And of course the same thing goes for / on LVM, since it too requires userspace to activate, which means initr* if / is on it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS BUG at insert_inline_extent_backref+0xe3/0xf0 while rebalancing
Filipe Manana writes: > Try this (just sent a few minutes ago): > https://patchwork.kernel.org/patch/7463161/ I've been using this patch for a week now, doing two rebalances a day (one per file system) - no problem so far. Thanks! Probably unrelated to this I did experience one reboot without any trace, possibly because I had enabled panic = 10 and panic_on_oops = 1, but that event did not happen anytime near a balance was happening. I wonder if the hang detector could trigger that configuration to reboot? Thanks again for the great work, your detective work is always impressive :). -- _ / __// /__ __ http://www.modeemi.fi/~flux/\ \ / /_ / // // /\ \/ /\ / /_/ /_/ \___/ /_/\_\@modeemi.fi \/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] btrfs: Allow barrier_all_devices to do per-chunk device check
Qu, We shouldn't mark FS readonly when chunks are degradable. As below. Thanks, Anand diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 39a2d57..dbb2483 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3530,7 +3530,7 @@ static int write_all_supers(struct btrfs_root *root, int max_mirrors) if (do_barriers) { ret = barrier_all_devices(root->fs_info); - if (ret) { + if (ret < 0) { mutex_unlock( &root->fs_info->fs_devices->device_list_mutex); btrfs_std_error(root->fs_info, ret, On 09/21/2015 10:10 AM, Qu Wenruo wrote: The last user of num_tolerated_disk_barrier_failures is barrier_all_devices(). But it's can be easily changed to new per-chunk degradable check framework. Now btrfs_device will have two extra members, representing send/wait error, set at write_dev_flush() time. And then check it in a similar but more accurate behavior than old code. Signed-off-by: Qu Wenruo --- fs/btrfs/disk-io.c | 13 + fs/btrfs/volumes.c | 6 +- fs/btrfs/volumes.h | 4 3 files changed, 14 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index d64299f..7cd94e7 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3400,8 +3400,6 @@ static int barrier_all_devices(struct btrfs_fs_info *info) { struct list_head *head; struct btrfs_device *dev; - int errors_send = 0; - int errors_wait = 0; int ret; /* send down all the barriers */ @@ -3410,7 +3408,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (dev->missing) continue; if (!dev->bdev) { - errors_send++; + dev->err_send = 1; continue; } if (!dev->in_fs_metadata || !dev->writeable) @@ -3418,7 +3416,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) ret = write_dev_flush(dev, 0); if (ret) - errors_send++; + dev->err_send = 1; } /* wait for all the barriers */ @@ -3426,7 +3424,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (dev->missing) continue; if (!dev->bdev) { - errors_wait++; + dev->err_wait = 1; continue; } if (!dev->in_fs_metadata || !dev->writeable) @@ -3434,10 +3432,9 @@ static int barrier_all_devices(struct btrfs_fs_info *info) ret = write_dev_flush(dev, 1); if (ret) - errors_wait++; + dev->err_wait = 1; } - if (errors_send > info->num_tolerated_disk_barrier_failures || - errors_wait > info->num_tolerated_disk_barrier_failures) + if (btrfs_check_degradable(info, info->sb->s_flags) < 0) return -EIO; return 0; } diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f1ef215..88266fa 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6945,8 +6945,12 @@ int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags) btrfs_get_num_tolerated_disk_barrier_failures( map->type); for (i = 0; i < map->num_stripes; i++) { - if (map->stripes[i].dev->missing) + if (map->stripes[i].dev->missing || + map->stripes[i].dev->err_wait || + map->stripes[i].dev->err_send) missing++; + map->stripes[i].dev->err_wait = 0; + map->stripes[i].dev->err_send = 0; } if (missing > max_tolerated) { ret = -EIO; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index fe758df..cd02556 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -76,6 +76,10 @@ struct btrfs_device { int can_discard; int is_tgtdev_for_dev_replace; + /* for barrier_all_devices() check */ + int err_send; + int err_wait; + #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED seqcount_t data_seqcount; #endif -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html