How many subvols/snapshots are possible? (limits?)
Dear Devs, This is more a use case question of what is a good idea to do... Can btrfs support snapshots of the filesystem at very regular intervals, say minute by minute or even second by second? Or are there limits that will be hit with metadata overheads or links/reference limits or CPU overheads if 'too many' snapshots/subvols are made? If snapshots were to be taken once a minute and retained, what breaks first? What are 'reasonable' (maximum) numbers for frequency and number of held versions? Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Virtual Device Support
On 10/05/13 15:03, George Mitchell wrote: One the things that is frustrating me the most at this point from a user perspective ... The current method of simply using a random member device or a LABEL or a UUID is just not working well for me. Having a well thought out virtual device infrastructure would... Sorry, I'm a bit lost for your comments... What is your use case and what are you hoping/expecting to see? I've been following development of btrfs for a while and I'm looking forward to use it to efficiently replace some of the very useful features of LVM2, drbd, and md-raid that I'm using at present... OK, so the way of managing all that is going to be a little different. How would you want that? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Virtual Device Support
OK, so to summarise: On 19/05/13 15:49, George Mitchell wrote: In reply to both of these comments in one message, let me give you an example. I use shell scripts to mount and unmount btrfs volumes for backup purposes. Most of these volumes are not listed in fstab simply because I do not want to have to clutter my fstab with volumes that are used only for backup. So the only way I can mount them is either by LABEL or by UUID. But I can't unmount them by either LABEL or UUID because that is not supported by util-linux and they have no intention of supporting it in the future. So I have to resort to unmounting by directory ... Which all comes to a way of working... Likewise, I have some old and long used backups scripts that mount a one-of-many backups disk pack. My solution is to use filesystem labels and to use 'sed' to update just the one line in /etc/fstab for the backups mount point label so that the backups are then mounted/unmounted by the mount point. I've never been able to use the /dev/sdXX numbering because the multiple physical drives can be detected in a different order. Agreed, that for the sake of good consistency, being able to unmount by filesystem label is a nice/good idea. But is there any interest for that to be picked up? Put in a bug/feature request onto bugzilla? I would guess that most developers focus on mount point and let fstab/mtab sort out the detail... Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs (general) raid for other filesystems?
Just a random Sunday afternoon thought: We've got some rather nice variations on the block-level RAID schemes but instead being implemented at the filesystem level in btrfs... Could the btrfs RAID be coded to be general so that a filesystem stack could be set up whereby the filesystem level raids could be used for ANY filesystem? So for example, we could have the stack: filesystem level RAID | V filesystem | V Block level So, an interesting variation could be to have filesystem level raid operating on ext4 or nilfs or whatever... Would that be a sensible idea? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs pseudo-drbd
Dear Devs, Would there be any problem to use nbd (/dev/ndX) devices to gain btrfs-raid across multiple physical hosts across a network? (For a sort of btrfs-drbd! :-) ) Regards, Martin http://en.wikipedia.org/wiki/Network_block_device http://www.drbd.org/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs (general) raid for other filesystems?
On 19/05/13 18:39, Clemens Eisserer wrote: Hi Martin, So, an interesting variation could be to have filesystem level raid operating on ext4 or nilfs or whatever... Would that be a sensible idea? Thats already supported by using LVM. What do you think you would gain from layering in top of btrfs? md-raid and lvm-raid are raid at the block level. btrfs-raid offers a greater variety and far greater flexibility of raid options individually for filedata and metadata at the filesystem level. raid at the filesystem level should also gain higher performance over that of just blindly replicating blocks of binary data across devices at the block level. My thoughts are to take advantage of the btrfs-raid work being done but for all filesystems. Hence, we can then have a very flexible raid available for whatever filesystem might be best for any underlying device. OK... So we make all of lvm, md-raid, and drbd all redundant! Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs (general) raid for other filesystems?
On 19/05/13 20:34, Chris Murphy wrote: On May 19, 2013, at 12:59 PM, Martin m_bt...@ml1.co.uk wrote: btrfs-raid offers a greater variety and far greater flexibility of raid options individually for filedata and metadata at the filesystem level. Well it really doesn't. The btrfs raid advantages leverage prior work that makes btrfs what it is. Indeed, the btrfs raid as evolving looks to be tightly part of btrfs itself, shaped by what is being done in btrfs... And also there is the work going into how the 'raid' semantics operate for data and the filesystem metadata. Also tied into that is storage balancing and load (io bandwidth) balancing with most recently developers looking how to move 'hot' data onto preferred physical drives? OK... So we make all of lvm, md-raid, and drbd all redundant! No they are different things for different use cases. What you seem to be asking for is for a ZFS-like feature that allows other file systems to exist on ZFS, and thereby gaining some of the advantage of the underlying file system. That's going a little too deep... My thoughts are much more shallow. Can the raid and load balancing work being done for btrfs be bundled up so as to permit that to also be used instead as a filesystem layer that then utilises /any/ underlying filesystem? So, instead of btrfs style file-level raid and load balancing only on devices which have been formatted with btrfs, the raid and load balancing operates as a filsystem layer that coordinates storing files on any motley collection of multiple whatever filesystem-on-device. Obvious enough for raid1 to 'tee' a file out to multiple filesystems. For the other raids, filenames would need to be munged to denote their multiple parts (simply always append a 6 character index?). raid0 would need a file to be split into parts and then those file parts concatenated upon reading under the original filename. raid5/6 would similarly need file splitting but also the data redundancy added. For example, for paranoid redundancy and fast operation: raid1 + load balance | V btrfs on HDD1, ext4 on HDD2, NILFS on flash1, nfs to host2 Obviously, doing that loses any features (such as snapshots) not common across all the group. As for a use case? Would that be a good idea or not? :-) One thought is that users could set up funky redundant operation across networked devices using nfs. Another thought is that we go to an awful lot of trouble to accommodate extremely different storage technologies that are only ever going to physically diverge further. For example, we have HDDs and SSDs. We also have much cheaper flash with very limited wear levelling ideal for 'cold' data. Or even raw flash without all the proprietary firmware obscurity... Hence dedicate a particular filesystem to each rather than one monster for all? The raid + load balance could be a well defined layer with no or few special hooks into the lower layers. All just a thought... Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Virtual Device Support
On 21/05/13 04:37, Chris Murphy wrote: On May 20, 2013, at 7:08 PM, Duncan 1i5t5.dun...@cox.net wrote: Chris Murphy posted on Sun, 19 May 2013 12:18:19 -0600 as excerpted: It seems inconsistent that mount and unmount allows a /dev/ designation, but only mount honors label and UUID. Yes. I'm going to contradict myself and point out that mount with label or UUID is made unambiguous via either the default subvolume being mounted, or the -o subvol= option being specified. The volume label and UUID doesn't apply to umount because it's an ambiguous command. You'd have to umount a mountpoint, or possibly a subvolume specific UUID. I'll admit that I prefer working with filesystem labels. This is getting rather semantic... From man umount, this is what umount intends: # umount [-dflnrv] {dir|device}... The umount command detaches the file system(s) mentioned from the file hierarchy. A file system is specified by giving the directory where it has been mounted. Giving the special device on which the file system lives may also work, but is obsolete, mainly because it will fail in case this device was mounted on more than one directory. # I guess the ideas of labels and UUID and multiple devices came out a few years later?... For btrfs, umount needs to operate on the default subvol but with the means for also specifying a specific subvol if needed. One hook for btrfs to extend what/how 'umount' operates might be to perhaps extend what can be done with a /sbin/(u?)mount.btrfs 'helper'? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Virtual Device Support (N-way mirror code)
Duncan, Thanks for quiet a historical summary. Yep, ReiserFS has stood the test of time very well and I'm still using and abusing it still on various servers all the way from something like a decade ago! More recently I've been putting newer systems on ext4 mainly to take advantage of extents for large files on all disk types, and also deferred allocation to hopefully reduce wear on SSDs. Meanwhile, I've seen no need to change the ReiserFS on the existing systems, even for the multi-Terabyte backups. The near unlimited file linking is beautiful for creating in effect incremental backups spanning years! All on raid1 or raid5, and all remarkably robust. Enough waffle! :-) On 21/05/13 04:59, Duncan wrote: And hopefully, now that btrfs raid5/6 is in, in a few cycles the N-way mirrored code will make it as well I too am waiting for the N-way mirrored code for example to have 3 copies of data/metadata across 4 physical disks. When might that hit? Or is there a stable patch that can be added into kernel 3.8.13? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs pseudo-drbd
On 19/05/13 18:32, Martin wrote: Dear Devs, Would there be any problem to use nbd (/dev/ndX) devices to gain btrfs-raid across multiple physical hosts across a network? (For a sort of btrfs-drbd! :-) ) Regards, Martin http://en.wikipedia.org/wiki/Network_block_device http://www.drbd.org/ As a follow-up, both nbd and AoE look to be active. nbd uses tcp/ip (layer 3) and is network routable; AoE operates on layer 2 (no IP addressing) and so looks to enjoy a lower overhead for the performance. Ideal for putting together your own low cost SAN! Network Block Device (TCP version) http://nbd.sourceforge.net/ ATA Over Ethernet: As an Alternative http://www.rfxn.com/ata-over-ethernet-as-an-alternative/ EtherDrive® storage and Linux 2.6 http://support.coraid.com/support/linux/EtherDrive-2.6-HOWTO.html Hope of interest, Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs raid1 on 16TB: INFO: task rsync:11022 blocked for more than 180 seconds
Dear Devs, I have x4 4TB HDDs formatted with: mkfs.btrfs -L bu-16TB_0 -d raid1 -m raid1 /dev/sd[cdef] /etc/fstab mounts with the options: noatime,noauto,space_cache,inode_cache All on kernel 3.8.13. Upon using rsync to copy some heavily hardlinked backups from ReiserFS, I've so far had various: INFO: task rsync:11022 blocked for more than 180 seconds and one: INFO: task btrfs-endio-wri:10816 blocked for more than 180 seconds Further detail listed below. What's the fix or any debug worthwhile? Regards, Martin x1 of these: kernel: INFO: task rsync:11022 blocked for more than 180 seconds. kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. kernel: rsync D 0 11022 11021 0x kernel: 88012b0ae360 0082 815f1400 000120c0 kernel: 4000 880108a67fd8 810312ac kernel: 8801115ae748 810e3bad 8801115ae748 0081 kernel: Call Trace: kernel: [810312ac] ? ns_capable+0x33/0x46 kernel: [810e3bad] ? generic_permission+0x19e/0x1fe kernel: [810e427d] ? __inode_permission+0x2f/0x6d kernel: [810e3d63] ? lookup_fast+0x39/0x23c kernel: [811f464c] ? wait_current_trans.isra.29+0xa9/0xd8 kernel: [810427f0] ? abort_exclusive_wait+0x79/0x79 kernel: [811f5b59] ? start_transaction+0x3de/0x408 kernel: [810f013c] ? setattr_copy+0x8c/0xcb kernel: [811ff22b] ? btrfs_dirty_inode+0x24/0xa4 kernel: [810effe8] ? notify_change+0x1f0/0x2b8 kernel: [810ff680] ? utimes_common+0x10c/0x135 kernel: [810df445] ? cp_new_stat+0x10d/0x11f kernel: [810ff79a] ? do_utimes+0xf1/0x129 kernel: [810df7d9] ? sys_newlstat+0x23/0x2b kernel: [810ff89b] ? sys_utimensat+0x64/0x6b kernel: [81431652] ? system_call_fastpath+0x16/0x1b x2 of these: kernel: INFO: task rsync:11022 blocked for more than 180 seconds. kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. kernel: rsync D 0 11022 11021 0x kernel: 88012b0ae360 0082 815f1400 000120c0 kernel: 4000 880108a67fd8 810e3d63 kernel: 88012b0ae360 810e3b65 88001e959ef8 0081 kernel: Call Trace: kernel: [810e3d63] ? lookup_fast+0x39/0x23c kernel: [810e3b65] ? generic_permission+0x156/0x1fe kernel: [810e427d] ? __inode_permission+0x2f/0x6d kernel: [810e3d63] ? lookup_fast+0x39/0x23c kernel: [811f464c] ? wait_current_trans.isra.29+0xa9/0xd8 kernel: [810427f0] ? abort_exclusive_wait+0x79/0x79 kernel: [811f5b59] ? start_transaction+0x3de/0x408 kernel: [810f013c] ? setattr_copy+0x8c/0xcb kernel: [811ff22b] ? btrfs_dirty_inode+0x24/0xa4 kernel: [810effe8] ? notify_change+0x1f0/0x2b8 kernel: [810ff680] ? utimes_common+0x10c/0x135 kernel: [810df445] ? cp_new_stat+0x10d/0x11f kernel: [810ff79a] ? do_utimes+0xf1/0x129 kernel: [810df7d9] ? sys_newlstat+0x23/0x2b kernel: [810ff89b] ? sys_utimensat+0x64/0x6b kernel: [81431652] ? system_call_fastpath+0x16/0x1b x7 of these: kernel: INFO: task rsync:11022 blocked for more than 180 seconds. kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. kernel: rsync D 0 11022 11021 0x kernel: 88012b0ae360 0082 815f1400 000120c0 kernel: 4000 880108a67fd8 88010d9270c9 810e520b kernel: 7fffd5adb458 811e2fba 880108a67d88 810e3411 kernel: Call Trace: kernel: [810e520b] ? path_init+0x1da/0x32c kernel: [811e2fba] ? reserve_metadata_bytes.isra.59+0x7b/0x741 kernel: [810e3411] ? complete_walk+0x85/0xd6 kernel: [810ecfbc] ? __d_lookup+0x60/0x122 kernel: [811f464c] ? wait_current_trans.isra.29+0xa9/0xd8 kernel: [810427f0] ? abort_exclusive_wait+0x79/0x79 kernel: [811f5b59] ? start_transaction+0x3de/0x408 kernel: [810e5b96] ? kern_path_create+0x78/0x110 kernel: [81200836] ? btrfs_link+0x75/0x185 kernel: [810e4c11] ? vfs_link+0x102/0x184 kernel: [810e7e90] ? sys_linkat+0x16d/0x1c7 kernel: [81431652] ? system_call_fastpath+0x16/0x1b x1 of these: kernel: INFO: task btrfs-endio-wri:10816 blocked for more than 180 seconds. kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. kernel: btrfs-endio-wri D 0 10816 2 0x kernel: 880129bf4f80 0046 815f1400 000120c0 kernel: 4000 88010c635fd8 8801294404ea 88012944 Mkernel: 0050 880129e85240 kernel: Call Trace: kernel: [810d2dbf] ? kmem_cache_alloc+0x3e/0xde
btrfs raid1 on 16TB goes read-only after btrfs: block rsv returned -28
Dear Devs, I have x4 4TB HDDs formatted with: mkfs.btrfs -L bu-16TB_0 -d raid1 -m raid1 /dev/sd[cdef] /etc/fstab mounts with the options: noatime,noauto,space_cache,inode_cache All on kernel 3.8.13. Upon using rsync to copy some heavily hardlinked backups from ReiserFS, I've seen: The following block rsv returned -28 is repeated 7 times until there is a call trace for: WARNING: at fs/btrfs/super.c:256 __btrfs_abort_transaction+0x3d/0xad(). Then, the mount is set read-only. How to fix or debug? Thanks, Martin kernel: [ cut here ] kernel: WARNING: at fs/btrfs/extent-tree.c:6372 btrfs_alloc_free_block+0xd3/0x29c() kernel: Hardware name: GA-MA790FX-DS5 kernel: btrfs: block rsv returned -28 kernel: Modules linked in: raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq act_police cls_basic cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_CHECKSUM ipt_rpfilter xt_statistic xt_CT xt_LOG xt_time xt_connlimit xt_realm xt_addrtype xt_comment xt_recent xt_policy xt_nat ipt_ULOG ipt_REJECT ipt_MASQUERADE ipt_ECN ipt_CLUSTERIP ipt_ah xt_set ip_set nf_nat _tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp nf_ conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp xt_tcpmss xt_pkttype xt_owner xt_NFQUEUE xt_NFLOG nfnetlink_log xt_multiport xt_mar k xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_conntrack xt_connmark xt_CLASSIFY xt_AUDIT xt_tcpudp xt_state iptable_raw iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_i pv4 nf_conntrack iptable_mangle nfnetlink iptable_filter ip_tables x_tables bridge stp llc rtc snd_hda_codec_realtek fbcon bitblit softcursor font nouveau video mxm_wmi cfbfillrect cfbimgblt cfbcopyarea i2c_algo_bit evdev d rm_kms_helper snd_hda_intel ttm snd_hda_codec drm i2c_piix4 pcspkr snd_pcm serio_raw snd_page_alloc snd_timer k8temp snd i2c_core processor button thermal_sys sky2 wmi backlight fb fbdev pata_acpi firewire_ohci firewire_cor e pata_atiixp usbhid pata_jmicron sata_sil24 kernel: Pid: 10980, comm: btrfs-transacti Not tainted 3.8.13-gentoo #1 kernel: Call Trace: kernel: [811e6600] ? btrfs_init_new_buffer+0xef/0xf6 kernel: [810289c8] ? warn_slowpath_common+0x78/0x8c kernel: [81028a74] ? warn_slowpath_fmt+0x45/0x4a kernel: [81278f2c] ? ___ratelimit+0xc4/0xd0 kernel: [811e66da] ? btrfs_alloc_free_block+0xd3/0x29c kernel: [811d68e5] ? __btrfs_cow_block+0x136/0x454 kernel: [811f0d47] ? btrfs_buffer_uptodate+0x40/0x56 kernel: [811d6d8c] ? btrfs_cow_block+0x132/0x19d kernel: [811da606] ? btrfs_search_slot+0x2f5/0x624 kernel: [811dbc5a] ? btrfs_insert_empty_items+0x5c/0xaf kernel: [811e5089] ? run_clustered_refs+0x852/0x8e6 kernel: [811e4d20] ? run_clustered_refs+0x4e9/0x8e6 kernel: [811e7f6b] ? btrfs_run_delayed_refs+0x10d/0x289 kernel: [811f4ec6] ? btrfs_commit_transaction+0x3a5/0x93c kernel: [810427f0] ? abort_exclusive_wait+0x79/0x79 kernel: [811f5a8c] ? start_transaction+0x311/0x408 kernel: [811eed7e] ? transaction_kthread+0xd1/0x16d kernel: [811eecad] ? btrfs_alloc_root+0x34/0x34 kernel: [810420b3] ? kthread+0xad/0xb5 kernel: [81042006] ? __kthread_parkme+0x5e/0x5e kernel: [814315ac] ? ret_from_fork+0x7c/0xb0 kernel: [81042006] ? __kthread_parkme+0x5e/0x5e kernel: ---[ end trace b584e8ceb642293f ]--- kernel: [ cut here ] kernel: [ cut here ] kernel: WARNING: at fs/btrfs/super.c:256 __btrfs_abort_transaction+0x3d/0xad() kernel: Hardware name: GA-MA790FX-DS5 kernel: btrfs: Transaction aborted kernel: Modules linked in: raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq act_police cls_basic cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_CHECKSUM ipt_rpfilter xt_statistic xt_CT xt_LOG xt_time xt_connlimit xt_realm xt_addrtype xt_comment xt_recent xt_policy xt_nat ipt_ULOG ipt_REJECT ipt_MASQUERADE ipt_ECN ipt_CLUSTERIP ipt_ah xt_set ip_set nf_nat _tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp nf_ conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp xt_tcpmss xt_pkttype xt_owner xt_NFQUEUE xt_NFLOG nfnetlink_log xt_multiport xt_mar k xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_conntrack xt_connmark xt_CLASSIFY xt_AUDIT
Re: btrfs raid1 on 16TB goes read-only after btrfs: block rsv returned -28
On 05/06/13 16:05, Hugo Mills wrote: On Wed, Jun 05, 2013 at 03:57:42PM +0100, Martin wrote: Dear Devs, I have x4 4TB HDDs formatted with: mkfs.btrfs -L bu-16TB_0 -d raid1 -m raid1 /dev/sd[cdef] /etc/fstab mounts with the options: noatime,noauto,space_cache,inode_cache All on kernel 3.8.13. Upon using rsync to copy some heavily hardlinked backups from ReiserFS, I've seen: The following block rsv returned -28 is repeated 7 times until there is a call trace for: This is ENOSPC. Can you post the output of btrfs fi df /mountpoint and btrfs fi show, please? btrfs fi df: Data, RAID1: total=2.85TB, used=2.84TB Data: total=8.00MB, used=0.00 System, RAID1: total=8.00MB, used=412.00KB System: total=4.00MB, used=0.00 Metadata, RAID1: total=27.00GB, used=25.82GB Metadata: total=8.00MB, used=0.00 btrfs fi show: Label: 'bu-16TB_0' uuid: 8fd9a0a8-9109-46db-8da0-396d9c6bc8e9 Total devices 4 FS bytes used 2.87TB devid4 size 3.64TB used 1.44TB path /dev/sdf devid3 size 3.64TB used 1.44TB path /dev/sde devid1 size 3.64TB used 1.44TB path /dev/sdc devid2 size 3.64TB used 1.44TB path /dev/sdd And df -h: Filesystem Size Used Avail Use% Mounted on /dev/sde 15T 5.8T 8.9T 40% /mnt/sata16 WARNING: at fs/btrfs/super.c:256 __btrfs_abort_transaction+0x3d/0xad(). Then, the mount is set read-only. How to fix or debug? Thanks, Martin kernel: [ cut here ] kernel: WARNING: at fs/btrfs/extent-tree.c:6372 btrfs_alloc_free_block+0xd3/0x29c() kernel: Hardware name: GA-MA790FX-DS5 kernel: btrfs: block rsv returned -28 kernel: Modules linked in: raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq act_police cls_basic cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_CHECKSUM ipt_rpfilter xt_statistic xt_CT xt_LOG xt_time xt_connlimit xt_realm xt_addrtype xt_comment xt_recent xt_policy xt_nat ipt_ULOG ipt_REJECT ipt_MASQUERADE ipt_ECN ipt_CLUSTERIP ipt_ah xt_set ip_set nf_nat _tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp nf_ conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp xt_tcpmss xt_pkttype xt_owner xt_NFQUEUE xt_NFLOG nfnetlink_log xt_multiport xt_mar k xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_conntrack xt_connmark xt_CLASSIFY xt_AUDIT xt_tcpudp xt_state iptable_raw iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_i pv4 nf_conntrack iptable_mangle nfnetlink iptable_filter ip_tables x_tables bridge stp llc rtc snd_hda_codec_realtek fbcon bitblit softcursor font nouveau video mxm_wmi cfbfillrect cfbimgblt cfbcopyarea i2c_algo_bit evdev d rm_kms_helper snd_hda_intel ttm snd_hda_codec drm i2c_piix4 pcspkr snd_pcm serio_raw snd_page_alloc snd_timer k8temp snd i2c_core processor button thermal_sys sky2 wmi backlight fb fbdev pata_acpi firewire_ohci firewire_cor e pata_atiixp usbhid pata_jmicron sata_sil24 kernel: Pid: 10980, comm: btrfs-transacti Not tainted 3.8.13-gentoo #1 kernel: Call Trace: kernel: [811e6600] ? btrfs_init_new_buffer+0xef/0xf6 kernel: [810289c8] ? warn_slowpath_common+0x78/0x8c kernel: [81028a74] ? warn_slowpath_fmt+0x45/0x4a kernel: [81278f2c] ? ___ratelimit+0xc4/0xd0 kernel: [811e66da] ? btrfs_alloc_free_block+0xd3/0x29c kernel: [811d68e5] ? __btrfs_cow_block+0x136/0x454 kernel: [811f0d47] ? btrfs_buffer_uptodate+0x40/0x56 kernel: [811d6d8c] ? btrfs_cow_block+0x132/0x19d kernel: [811da606] ? btrfs_search_slot+0x2f5/0x624 kernel: [811dbc5a] ? btrfs_insert_empty_items+0x5c/0xaf kernel: [811e5089] ? run_clustered_refs+0x852/0x8e6 kernel: [811e4d20] ? run_clustered_refs+0x4e9/0x8e6 kernel: [811e7f6b] ? btrfs_run_delayed_refs+0x10d/0x289 kernel: [811f4ec6] ? btrfs_commit_transaction+0x3a5/0x93c kernel: [810427f0] ? abort_exclusive_wait+0x79/0x79 kernel: [811f5a8c] ? start_transaction+0x311/0x408 kernel: [811eed7e] ? transaction_kthread+0xd1/0x16d kernel: [811eecad] ? btrfs_alloc_root+0x34/0x34 kernel: [810420b3] ? kthread+0xad/0xb5 kernel: [81042006] ? __kthread_parkme+0x5e/0x5e kernel: [814315ac] ? ret_from_fork+0x7c/0xb0 kernel: [81042006] ? __kthread_parkme+0x5e/0x5e kernel: ---[ end trace b584e8ceb642293f ]--- kernel: [ cut here ] kernel: [ cut here ] kernel: WARNING: at fs/btrfs/super.c:256 __btrfs_abort_transaction+0x3d/0xad() kernel: Hardware name: GA-MA790FX-DS5 kernel: btrfs: Transaction
Re: btrfs raid1 on 16TB goes read-only after btrfs: block rsv returned -28
On 05/06/13 16:43, Hugo Mills wrote: On Wed, Jun 05, 2013 at 04:28:33PM +0100, Martin wrote: On 05/06/13 16:05, Hugo Mills wrote: On Wed, Jun 05, 2013 at 03:57:42PM +0100, Martin wrote: Dear Devs, I have x4 4TB HDDs formatted with: mkfs.btrfs -L bu-16TB_0 -d raid1 -m raid1 /dev/sd[cdef] /etc/fstab mounts with the options: noatime,noauto,space_cache,inode_cache All on kernel 3.8.13. Upon using rsync to copy some heavily hardlinked backups from ReiserFS, I've seen: The following block rsv returned -28 is repeated 7 times until there is a call trace for: This is ENOSPC. Can you post the output of btrfs fi df /mountpoint and btrfs fi show, please? btrfs fi df: Data, RAID1: total=2.85TB, used=2.84TB Data: total=8.00MB, used=0.00 System, RAID1: total=8.00MB, used=412.00KB System: total=4.00MB, used=0.00 Metadata, RAID1: total=27.00GB, used=25.82GB Metadata: total=8.00MB, used=0.00 btrfs fi show: Label: 'bu-16TB_0' uuid: 8fd9a0a8-9109-46db-8da0-396d9c6bc8e9 Total devices 4 FS bytes used 2.87TB devid4 size 3.64TB used 1.44TB path /dev/sdf devid3 size 3.64TB used 1.44TB path /dev/sde devid1 size 3.64TB used 1.44TB path /dev/sdc devid 2 size 3.64TB used 1.44TB path /dev/sdd OK, so you've got plenty of space to allocate. There were some issues in this area (block reserves and ENOSPC, and I think specifically addressing the issue of ENOSPC when there's space available to allocate) that were fixed between 3.8 and 3.9 (and probably some between 3.9 and 3.10-rc as well), so upgrading your kernel _may_ help here. Something else that may possibly help as a sticking-plaster is to write metadata more slowly, so that you don't have quite so much of it waiting to be written out for the next transaction. Practically, this may involve things like running sync on a loop. But it's definitely a horrible hack that may help if you're desperate for a quick fix until you can finish creating metadata so quickly and upgrade your kernel... Hugo. Thanks for that. I can give kernel 3.9.4 a try. For a giggle, I'll try first with nice 19 and syncs in a loop... One confusing bit is why the Data, RAID1: total=2.85TB from btrfs fi df? Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs raid1 on 16TB goes read-only after btrfs: block rsv returned -28
On 05/06/13 17:24, David Sterba wrote: On Wed, Jun 05, 2013 at 04:43:29PM +0100, Hugo Mills wrote: OK, so you've got plenty of space to allocate. There were some issues in this area (block reserves and ENOSPC, and I think specifically addressing the issue of ENOSPC when there's space available to allocate) that were fixed between 3.8 and 3.9 (and probably some between 3.9 and 3.10-rc as well), so upgrading your kernel _may_ help here. This is supposed to be fixed by https://patchwork-mail2.kernel.org/patch/2558911/ that went ti 3.10-rc with some followup patches, so it might not be enough as a standalone fix. Unless you really need 'inode_cache', remove it from the mount options. Thanks for that. Remounting without the inode_cache option looks to be allowing rsync to continue. (No sync loop needed.) For a 16TB raid1 on kernel 3.8.13, any good mount options to try? For that size of storage and with many hard links, is there any advantage formatting with leaf/node size greater than the default 4kBytes? Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs raid1 on 16TB goes read-only after btrfs: block rsv returned -28
On 05/06/13 22:12, Martin wrote: On 05/06/13 17:24, David Sterba wrote: On Wed, Jun 05, 2013 at 04:43:29PM +0100, Hugo Mills wrote: OK, so you've got plenty of space to allocate. There were some issues in this area (block reserves and ENOSPC, and I think specifically addressing the issue of ENOSPC when there's space available to allocate) that were fixed between 3.8 and 3.9 (and probably some between 3.9 and 3.10-rc as well), so upgrading your kernel _may_ help here. This is supposed to be fixed by https://patchwork-mail2.kernel.org/patch/2558911/ that went ti 3.10-rc with some followup patches, so it might not be enough as a standalone fix. Unless you really need 'inode_cache', remove it from the mount options. Thanks for that. Remounting without the inode_cache option looks to be allowing rsync to continue. (No sync loop needed.) rsync is still running ok but the data copying is awfully slow... The copy across is going to take many days at this rate :-( For a 16TB raid1 on kernel 3.8.13, any good mount options to try? For that size of storage and with many hard links, is there any advantage formatting with leaf/node size greater than the default 4kBytes? Any hints/tips? ;-) Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
raid1 inefficient unbalanced filesystem reads
On kernel 3.8.13: Using two equal performance SATAII HDDs, formatted for btrfs raid1 for both data and metadata and: The second disk appears to suffer about x8 the read activity of the first disk. This causes the second disk to quickly get maxed out whilst the first disk remains almost idle. Total writes to the two disks is equal. This is noticeable for example when running emerge --sync or running compiles on Gentoo. Is this a known feature/problem or worth looking/checking further? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 inefficient unbalanced filesystem reads
On 28/06/13 16:39, Hugo Mills wrote: On Fri, Jun 28, 2013 at 11:34:18AM -0400, Josef Bacik wrote: On Fri, Jun 28, 2013 at 02:59:45PM +0100, Martin wrote: On kernel 3.8.13: Using two equal performance SATAII HDDs, formatted for btrfs raid1 for both data and metadata and: The second disk appears to suffer about x8 the read activity of the first disk. This causes the second disk to quickly get maxed out whilst the first disk remains almost idle. Total writes to the two disks is equal. This is noticeable for example when running emerge --sync or running compiles on Gentoo. Is this a known feature/problem or worth looking/checking further? So we balance based on pids, so if you have one process that's doing a lot of work it will tend to be stuck on one disk, which is why you are seeing that kind of imbalance. Thanks, The other scenario is if the sequence of processes executed to do each compilation step happens to be an even number, then the heavy-duty file-reading parts will always hit the same parity of PID number. If each tool has, say, a small wrapper around it, then the wrappers will all run as (say) odd PIDs, and the tools themselves will run as even pids... Ouch! Good find... To just test with a: for a in {1..4} ; do ( dd if=/dev/zero of=$a bs=10M count=100 ) ; done ps shows: martin9776 9.6 0.1 18740 10904 pts/2D17:15 0:00 dd martin9778 8.5 0.1 18740 10904 pts/2D17:15 0:00 dd martin9780 8.5 0.1 18740 10904 pts/2D17:15 0:00 dd martin9782 9.5 0.1 18740 10904 pts/2D17:15 0:00 dd More to the story from atop looks to be: One disk maxed out with x3 dd on one cpu core, the second disk utilised by one dd on the second CPU core... Looks like using a simple round-robin is pathological for an even number of disks, or indeed if you have a mix of disks with different capabilities. File access will pile up on the slowest of the disks or on whatever HDD coincides with the process (pid) creation multiple... So... an immediate work-around is to go all SSD or work in odd multiples of HDDs?! Rather than that: Any easy tweaks available please? Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 inefficient unbalanced filesystem reads
On 28/06/13 18:04, Josef Bacik wrote: On Fri, Jun 28, 2013 at 09:55:31AM -0700, George Mitchell wrote: On 06/28/2013 09:25 AM, Martin wrote: On 28/06/13 16:39, Hugo Mills wrote: On Fri, Jun 28, 2013 at 11:34:18AM -0400, Josef Bacik wrote: On Fri, Jun 28, 2013 at 02:59:45PM +0100, Martin wrote: On kernel 3.8.13: flow of continual reads and writes very balanced across the first four drives in this set and then, like a big burp, a huge write on the fifth drive. But absolutely no reads from the fifth drive so far. Very Well that is interesting, writes should be relatively balanced across all drives. Granted we try and coalesce all writes to one drive, flush those out, and go on to the next drive, but you shouldn't be seeing the kind of activity you are currently seeing. I will take a look at it next week and see whats going on. As for reads we could definitely be much smarter, I would like to do something like this (I'm spelling it out in case somebody wants to do it before I get to it) 1) Keep a per-device counter of how many read requests have been done. 2) Make the PID based decision, and then check and see if the device we've chosen has many more read requests than the other device. If so choose the other device. - EXCEPTION: if we are doing a big sequential read we want to stay on one disk since the head will be already in place on the disk we've been pegging, so ignore the logic for this. This means saving the last sector we read from and comparing it to the next sector we are going to read from, MD does this. - EXCEPTION to the EXCEPTION: if the devices are SSD's then don't bother doing this work, always maintain evenness amongst the devices. If somebody were going to do this, they'd just have to find the places where we call find_live_mirror in volumes.c and adjust their logic to just hand find_live_mirror the entire map and then go through the devices and make their decision. You'd still need to keep the device replace logic. Thanks, Mmmm... I'm not sure trying to balance historical read/write counts is the way to go... What happens for the use case of an SSD paired up with a HDD? (For example an SSD and a similarly sized Raptor or enterprise SCSI?...) Or even just JBODs of a mishmash of different speeds? Rather than trying to balance io counts, can a realtime utilisation check be made and go for the least busy? That can be biased secondly to balance IO counts if some 'non-performance' flag/option is set/wanted by the user. Otherwise, go firstly for what is recognised to be the fastest or least busy?... Good find and good note! And thanks greatly for so quickly picking this up. Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfsck output: What does it all mean?
This is the btrfsck output for a real-world rsync backup onto a btrfs raid1 mirror across 4 drives (yes, I know at the moment for btrfs raid1 there's only ever two copies of the data...) checking extents checking fs roots root 5 inode 18446744073709551604 errors 2000 root 5 inode 18446744073709551605 errors 1 root 256 inode 18446744073709551604 errors 2000 root 256 inode 18446744073709551605 errors 1 found 3183604633600 bytes used err is 1 total csum bytes: 3080472924 total tree bytes: 28427821056 total fs tree bytes: 23409475584 btree space waste bytes: 4698218231 file data blocks allocated: 3155176812544 referenced 3155176812544 Btrfs Btrfs v0.19 Command exited with non-zero status 1 So: What does that little lot mean? The drives were mounted and active during an unexpected power-plug pull :-( Safe to mount again or are there other checks/fixes needed? Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 inefficient unbalanced filesystem reads
On 29/06/13 10:41, Russell Coker wrote: On Sat, 29 Jun 2013, Martin wrote: Mmmm... I'm not sure trying to balance historical read/write counts is the way to go... What happens for the use case of an SSD paired up with a HDD? (For example an SSD and a similarly sized Raptor or enterprise SCSI?...) Or even just JBODs of a mishmash of different speeds? Rather than trying to balance io counts, can a realtime utilisation check be made and go for the least busy? It would also be nice to be able to tune this. For example I've got a RAID-1 array that's mounted noatime, hardly ever written, and accessed via NFS on 100baseT. It would be nice if one disk could be spun down for most of the time and save 7W of system power. Something like the --write-mostly option of mdadm would be good here. For that case, a --read-mostly would be more apt ;-) Hence, add a check to preferentially use last disk used if all are idle? Also it should be possible for a RAID-1 array to allow faster reads for a single process reading a single file if the file in question is fragmented. That sounds good but complicated to gather and sort the fragments into groups per disk... Or is something like that already done by the block device elevator for HDDs? Also, is head seek optimisation turned off for SSD accesses? (This is sounding like a lot more than just swapping: current-pid % map-num_stripes to a psuedorandomhash( current-pid ) % map-num_stripes ... ;-) ) Are there any readily accessible present state for such as disk activity or queue length or access latency available for the btrfs process to read? I suspect a good first guess to cover many conditions would be to 'simply' choose whichever device is powered up and has the lowest current latency, or if idle has the lowest historical latency... Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Which better: rsync or snapshot + rsync --delete
Which is 'best' or 'faster'? Take a snapshot of an existing backup and then rsync --delete into that to make a backup of some other filesystem? Or use rsync --link to link a new backup tree against a previous backup tree for the some other filesystem? Which case does btrfs handle the better? Would there be any problems for doing this over an nfs mount of the btrfs? Both cases can take advantage of the raid and dedup and compression features of btrfs. Would taking a btrfs snapshot be better than rsync creating the hard links to unchanged files? Any other considerations? (There are perhaps about 5% new or changed files each time.) Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Corrupt btrfs filesystem recovery... (Due to *sata* errors)
This may be of interest for the fail cause aswel as how to recover... I have a known good 2TB (4kByte physical sectors) HDD that supports sata3 (6Gbit/s). Writing data via rsync at the 6Gbit/s sata rate caused IO errors for just THREE sectors... Yet btrfsck bombs out with LOTs of errors... How best to recover from this? (This is a 'backup' disk so not 'critical' but it would be nice to avoid rewriting about 1.5TB of data over the network...) Is there an obvious sequence/recipe to follow for recovery? Thanks, Martin Further details: Linux 3.10.7-gentoo-r1 #2 SMP Fri Sep 27 23:38:06 BST 2013 x86_64 AMD E-450 APU with Radeon(tm) HD Graphics AuthenticAMD GNU/Linux # btrfs version Btrfs v0.20-rc1-358-g194aa4a Single 2TB HDD using default mkbtrfs. Entire disk (/dev/sdc) is btrfs (no partitions). The IO errors were: kernel: end_request: I/O error, dev sdc, sector 3215049328 kernel: end_request: I/O error, dev sdc, sector 3215049328 kernel: end_request: I/O error, dev sdc, sector 3215049328 kernel: end_request: I/O error, dev sdc, sector 3215049328 kernel: end_request: I/O error, dev sdc, sector 3215049328 kernel: end_request: I/O error, dev sdc, sector 3206563752 kernel: end_request: I/O error, dev sdc, sector 3206563752 kernel: end_request: I/O error, dev sdc, sector 3206563752 kernel: end_request: I/O error, dev sdc, sector 3206563752 kernel: end_request: I/O error, dev sdc, sector 3206563752 kernel: end_request: I/O error, dev sdc, sector 3213925248 kernel: end_request: I/O error, dev sdc, sector 3213925248 kernel: end_request: I/O error, dev sdc, sector 3213925248 kernel: end_request: I/O error, dev sdc, sector 3213925248 kernel: end_request: I/O error, dev sdc, sector 3213925248 Lots of sata error noise omitted. The sata problem was fixed by limiting libata to 3Gbit/s: libata.force=3.0G added onto the Grub kernel line. Running badblocks twice in succession (non-destructive data test!) shows no surface errors and no further errors on the sata interface. Running btrfsck twice gives the same result, giving a failure with: Ignoring transid failure btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec-ino != key-objectid || rec-refs 1)' failed. An abridged summary is: checking extents parent transid verify failed on 907185082368 wanted 15935 found 12264 parent transid verify failed on 907185082368 wanted 15935 found 12264 parent transid verify failed on 907185127424 wanted 15935 found 12264 parent transid verify failed on 907185127424 wanted 15935 found 12264 leaf parent key incorrect 907185135616 bad block 907185135616 parent transid verify failed on 915444707328 wanted 16974 found 13021 parent transid verify failed on 915444707328 wanted 16974 found 13021 parent transid verify failed on 915445092352 wanted 16974 found 13021 parent transid verify failed on 915445092352 wanted 16974 found 13021 leaf parent key incorrect 915444883456 bad block 915444883456 leaf parent key incorrect 915445014528 bad block 915445014528 parent transid verify failed on 907185082368 wanted 15935 found 12264 parent transid verify failed on 907185082368 wanted 15935 found 12264 parent transid verify failed on 907185127424 wanted 15935 found 12264 parent transid verify failed on 907185127424 wanted 15935 found 12264 leaf parent key incorrect 907183771648 bad block 907183771648 leaf parent key incorrect 907183779840 bad block 907183779840 leaf parent key incorrect 907183783936 bad block 907183783936 [...] leaf parent key incorrect 907185913856 bad block 907185913856 leaf parent key incorrect 907185917952 bad block 907185917952 parent transid verify failed on 915431579648 wanted 16974 found 16972 parent transid verify failed on 915431579648 wanted 16974 found 16972 parent transid verify failed on 915432382464 wanted 16974 found 16972 parent transid verify failed on 915432382464 wanted 16974 found 16972 parent transid verify failed on 915444707328 wanted 16974 found 13021 parent transid verify failed on 915444707328 wanted 16974 found 13021 parent transid verify failed on 915445092352 wanted 16974 found 13021 parent transid verify failed on 915445092352 wanted 16974 found 13021 parent transid verify failed on 915445100544 wanted 16974 found 13021 parent transid verify failed on 915445100544 wanted 16974 found 13021 parent transid verify failed on 915432734720 wanted 16974 found 16972 parent transid verify failed on 915432734720 wanted 16974 found 16972 parent transid verify failed on 915433144320 wanted 16974 found 16972 parent transid verify failed on 915433144320 wanted 16974 found 16972 parent transid verify failed on 915431862272 wanted 16974 found 16972 parent transid verify failed on 915431862272 wanted 16974 found 16972 parent transid verify failed on 915444715520 wanted 16974 found 13021 parent transid verify failed on 915444715520 wanted 16974 found 13021 parent transid verify failed on 915445166080 wanted 16974 found 13021 parent transid verify failed on 915445166080 wanted 16974 found
Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)
Chris, All agreed. Further comment inlined: (Should have mentioned more prominently that the hardware problem has been worked-around by limiting the sata to 3Gbit/s on bootup.) On 28/09/13 21:51, Chris Murphy wrote: On Sep 28, 2013, at 1:26 PM, Martin m_bt...@ml1.co.uk wrote: Writing data via rsync at the 6Gbit/s sata rate caused IO errors for just THREE sectors... Yet btrfsck bombs out with LOTs of errors… Any fs will bomb out on write errors. Indeed. However, are not the sata errors reported back to btrfs so that it knows whatever parts haven't been updated? Is there not a mechanism to then go read-only? Also, should not the journal limit the damage? How best to recover from this? Why you're getting I/O errors at SATA 6Gbps link speed needs to be understood. Is it a bad cable? Bad SATA port? Drive or controller firmware bug? Or libata driver bug? I systematically eliminated such as leads, PSU, and NCQ. Limiting libata to only use 3Gbit/s is the one change that gives a consistent fix. The HDD and motherboard both support 6Gbit/s, but hey-ho, that's an experiment I can try again some other time when I have another HDD/SSD to test in there. In any case, for the existing HDD - motherboard combination, using sata2 rather than sata3 speeds shouldn't noticeably impact performance. (Other than sata2 works reliably and so is infinitely better for this case!) Lots of sata error noise omitted. And entire dmesg might still be useful. I don't know if the list will handle the whole dmesg in one email, but it's worth a shot (reply to an email in the thread, don't change the subject). I can email directly if of use/interest. Let me know offlist. do a smartctl -x on the drive, chances are it's recording PHY Event (smartctl -x errors shown further down...) Nothing untoward noticed: # smartctl -a /dev/sdc === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s) Device Model: WDC WD20EARX-00PASB0 Serial Number:WD-... LU WWN Device Id: ... Firmware Version: 51.0AB51 User Capacity:2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is:In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is:Sat Sep 28 23:35:57 2013 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled [...] SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051Pre-fail Always - 9 3 Spin_Up_Time0x0027 253 159 021Pre-fail Always - 1983 4 Start_Stop_Count0x0032 100 100 000Old_age Always - 55 5 Reallocated_Sector_Ct 0x0033 200 200 140Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000Old_age Always - 0 9 Power_On_Hours 0x0032 099 099 000Old_age Always - 800 10 Spin_Retry_Count0x0032 100 253 000Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 53 192 Power-Off_Retract_Count 0x0032 200 200 000Old_age Always - 31 193 Load_Cycle_Count0x0032 199 199 000Old_age Always - 3115 194 Temperature_Celsius 0x0022 118 110 000Old_age Always - 32 196 Reallocated_Event_Count 0x0032 200 200 000Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000Old_age Offline - 0 199 UDMA_CRC_Error_Count0x0032 200 200 000Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000Old_age Offline - 0 # smartctl -x /dev/sdc ... also shows the errors it saw: (Just the last 4 copied which look timed for when the HDD was last exposed to 6Gbit/s sata) Error 46 [21] occurred at disk power-on lifetime: 755 hours (31 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 01 -- 51 00 08 00 00 6c 1a 4b b0 e0 00 Error: AMNF 8 sectors at LBA = 0x6c1a4bb0 = 1813662640 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)
On 28/09/13 20:26, Martin wrote: ... btrfsck bombs out with LOTs of errors... How best to recover from this? (This is a 'backup' disk so not 'critical' but it would be nice to avoid rewriting about 1.5TB of data over the network...) Is there an obvious sequence/recipe to follow for recovery? I've got the drive reliably working with the sata limited to 3Gbit/s. What is the best sequence to try to tidy-up and carry on with the 1.5TB or so of data on there, rather than working from scratch? So far, I've only run btrfsck since the corruption errors for the three sectors... Suggestions for recovery? Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Corrupt btrfs filesystem recovery... What best instructions?
On 28/09/13 23:54, Martin wrote: On 28/09/13 20:26, Martin wrote: ... btrfsck bombs out with LOTs of errors... How best to recover from this? (This is a 'backup' disk so not 'critical' but it would be nice to avoid rewriting about 1.5TB of data over the network...) Is there an obvious sequence/recipe to follow for recovery? I've got the drive reliably working with the sata limited to 3Gbit/s. What is the best sequence to try to tidy-up and carry on with the 1.5TB or so of data on there, rather than working from scratch? So far, I've only run btrfsck since the corruption... So... Any options for btrfsck to fix things? Or is anything/everything that is fixable automatically fixed on the next mount? Or should: btrfs scrub /dev/sdX be run first? Or? What does btrfs do (or can do) for recovery? Advice welcomed, Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)
Chris, Thanks for good comment/discussion. On 29/09/13 03:06, Chris Murphy wrote: On Sep 28, 2013, at 4:51 PM, Martin m_bt...@ml1.co.uk wrote: Stick with forced 3Gbps, but I think it's worth while to find out what the actual problem is. One day you forget about this 3Gbps SATA link, upgrade or regress to another kernel and you don't have the 3Gbps forced speed on the parameter line, and poof - you've got more problems again. The hardware shouldn't negotiate a 6Gbps link and then do a backwards swan dive at 30,000' with your data as if it's an after thought. I've got an engineer's curiosity so that one is very definitely marked for revisiting at some time... If only to blog that x-y-z combination is a tar pit for your data... In any case, for the existing HDD - motherboard combination, using sata2 rather than sata3 speeds shouldn't noticeably impact performance. (Other than sata2 works reliably and so is infinitely better for this case!) It's true. Well, the IO data rate for badblocks is exactly the same as before, limited by the speed of the physical rust spinning and data density... I would also separately unmount the file system, note the latest kernel message, then mount the file system and see if there are any kernel messages that might indicate recognition of problems with the fs. I would not use btrfsck --repair until someone says it's a good idea. That person would not be me. It is sat unmounted until some informed opinion is gained... Thanks again for your notes, Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Corrupt btrfs filesystem recovery... What best instructions?
On 29/09/13 06:11, Duncan wrote: Martin posted on Sun, 29 Sep 2013 03:10:37 +0100 as excerpted: So... Any options for btrfsck to fix things? Or is anything/everything that is fixable automatically fixed on the next mount? Or should: btrfs scrub /dev/sdX be run first? Or? What does btrfs do (or can do) for recovery? Here's a general-case answer (courtesy gmane) to the order in which to try recovery question, that Hugo posted a few weeks ago: http://permalink.gmane.org/gmane.comp.file-systems.btrfs/27999 Thanks for that. Very well found! The instructions from Hugo are: Let's assume that you don't have a physical device failure (which is a different set of tools -- mount -odegraded, btrfs dev del missing). First thing to do is to take a btrfs-image -c9 -t4 of the filesystem, and keep a copy of the output to show josef. :) Then start with -orecovery and -oro,recovery for pretty much anything. If those fail, then look in dmesg for errors relating to the log tree -- if that's corrupt and can't be read (or causes a crash), use btrfs-zero-log. If there's problems with the chunk tree -- the only one I've seen recently was reporting something like can't map address -- then chunk-recover may be of use. After that, btrfsck is probably the next thing to try. If options -s1, -s2, -s3 have any success, then btrfs-select-super will help by replacing the superblock with one that works. If that's not going to be useful, fall back to btrfsck --repair. Finally, btrfsck --repair --init-extent-tree may be necessary if there's a damaged extent tree. Finally, if you've got corruption in the checksums, there's --init-csum-tree. Hugo. Those will be tried next... Note that in specific cases someone who knew what they were doing could omit some steps and focus on others, but I'm not at that level of know what I'm doing, so... Scrub... would go before this, if it's useful. But scrub depends on a second, valid copy being available in ordered to fix the bad-checksum one. On a single device btrfs, btrfs defaults to DUP metadata (unless it's SSD), so you may have a second copy for that, but you won't have a second copy of the data. This is a very strong reason to go btrfs raid1 mode (for both data and metadata) if you can, because that gives you a second copy of everything, thereby actually making use of btrfs' checksum and scrub ability. (Unfortunately, there is as yet no way to do N-way mirroring, there's only the second copy not a third, no matter how many devices you have in that raid1.) Finally, if you mentioned your kernel (and btrfs-tools) version(s) I missed it, but [boilerplate recommendation, stressed repeatedly both in the wiki and on-list] btrfs being still labeled experimental and under serious development, there's still lots of bugs fixed every kernel release. So as Chris Murphy said, if you're not on 3.11-stable or 3.12- rcX already, get there. Not only can the safety of your data depend on it, but by choosing to run experimental we're all testers, and our reports if something does go wrong will be far more usable if we're on a current kernel. Similarly, btrfs-tools 0.20-rc1 is already somewhat old; you really should be on a git-snapshot beyond that. (The master branch is kept stable, work is done in other branches and only merged to master when it's considered suitably stable, so a recently updated btrfs-tools master HEAD is at least in theory always the best possible version you can be running. If that's ever NOT the case, then testers need to be reporting that ASAP so it can be fixed, too.) Back to the kernel, it's worth noting that 3.12-rcX includes an option that turns off most btrfs bugons by default. Unless you're a btrfs developer (which it doesn't sound like you are), you'll want to activate that (turning off the bugons), as they're not helpful for ordinary users and just force unnecessary reboots when something minor and otherwise immediately recoverable goes wrong. That's just one of the latest fixes. Looking up what's available for Gentoo, the maintainers there look to be nicely sharp with multiple versions available all the way up to kernel 3.11.2... There's also the latest available from btrfs tools with sys-fs/btrfs-progs ... OK, so onto the cutting edge to compile them in... Thanks all, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Corrupt btrfs filesystem recovery... What best instructions?
On 29/09/13 22:29, Martin wrote: Looking up what's available for Gentoo, the maintainers there look to be nicely sharp with multiple versions available all the way up to kernel 3.11.2... That is being pulled in now as expected: sys-kernel/gentoo-sources-3.11.2 There's also the latest available from btrfs tools with sys-fs/btrfs-progs ... Oddly, that caused emerge to report: [ebuild UD ] sys-fs/btrfs-progs-0.19.11 [0.20_rc1_p358] 0 kB which is a *downgrade*. Hence, I'm keeping with the 0.20_rc1_p358. OK, so onto the cutting edge to compile them in... Interesting times as is said in a certain part of the world... Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Corrupt btrfs filesystem recovery... What best instructions?
So... The fix: ( Summary: Mounting -o recovery,noatime worked well and allowed a diff check to complete for all but one directory tree. So very nearly all the data is fine. Deleting the failed directory tree caused a call stack dump and eventually: kernel: parent transid verify failed on 915444822016 wanted 16974 found 13021 kernel: BTRFS info (device sdc): failed to delete reference to eggdrop-1.6.19.ebuild, inode 2096893 parent 5881667 kernel: BTRFS error (device sdc) in __btrfs_unlink_inode:3662: errno=-5 IO failure kernel: BTRFS info (device sdc): forced readonly Greater detail listed below. What next best to try? Safer to try again but this time with with no_space_cache,no_inode_cache? Thanks, Martin ) On 29/09/13 22:29, Martin wrote: On 29/09/13 06:11, Duncan wrote: What does btrfs do (or can do) for recovery? Here's a general-case answer (courtesy gmane) to the order in which to try recovery question, that Hugo posted a few weeks ago: http://permalink.gmane.org/gmane.comp.file-systems.btrfs/27999 Thanks for that. Very well found! The instructions from Hugo are: Let's assume that you don't have a physical device failure (which is a different set of tools -- mount -odegraded, btrfs dev del missing). First thing to do is to take a btrfs-image -c9 -t4 of the filesystem, and keep a copy of the output to show josef. :) Then start with -orecovery and -oro,recovery for pretty much anything. For anyone following this, first a health warning: If your data is in any way critical or important, then you should already have a backup copy elsewhere. If not, best make a binary image copy of your disk first! OK... So with the latest kernel (3.11.2) and btrfs tools (Btrfs v0.20-rc1-358-g194aa4a) and the sequence went: mount -v -t btrfs -o recovery LABEL=bu_A /mnt/bu_A (From syslog:) kernel: device label bu_A devid 1 transid 17222 /dev/sdc kernel: btrfs: enabling auto recovery kernel: btrfs: disk space caching is enabled kernel: btrfs: bdev /dev/sdc errs: wr 0, rd 27, flush 0, corrupt 0, gen 0 Running through a diff check for part of the backups, syslog reported: kernel: btrfs read error corrected: ino 1 off 915433144320 (dev /dev/sdc sector 1813661856) Also, the HDD was showing quite a few write operations so... Is noatime set?... Ooops... Didn't include a ro... So, killed the diff check and remounted: mount -v -t btrfs -o remount,recovery,noatime /mnt/bu_A mount: /dev/sdc mounted on /mnt/bu_A kernel: btrfs: enabling inode map caching kernel: btrfs: enabling auto recovery kernel: btrfs: disk space caching is enabled And running the diff check again... Now zero writes to the HDD :-) Various syslog messages were given: kernel: parent transid verify failed on 907185135616 wanted 15935 found 12264 kernel: btrfs read error corrected: ino 1 off 907185135616 (dev /dev/sdc sector 1781823824) kernel: parent transid verify failed on 907185143808 wanted 15935 found 12264 kernel: btrfs read error corrected: ino 1 off 907185143808 (dev /dev/sdc sector 1781823840) kernel: parent transid verify failed on 907185139712 wanted 15935 found 12264 kernel: btrfs read error corrected: ino 1 off 907185139712 (dev /dev/sdc sector 1781823832) kernel: parent transid verify failed on 907185152000 wanted 15935 found 10903 kernel: btrfs read error corrected: ino 1 off 907185152000 (dev /dev/sdc sector 1781823856) kernel: parent transid verify failed on 907183783936 wanted 15935 found 12263 kernel: btrfs read error corrected: ino 1 off 907183783936 (dev /dev/sdc sector 1781821184) kernel: parent transid verify failed on 907183792128 wanted 15935 found 10903 kernel: btrfs read error corrected: ino 1 off 907183792128 (dev /dev/sdc sector 1781821200) kernel: parent transid verify failed on 907183796224 wanted 15935 found 12263 kernel: btrfs read error corrected: ino 1 off 907183796224 (dev /dev/sdc sector 1781821208) kernel: parent transid verify failed on 907183841280 wanted 15935 found 10903 kernel: btrfs read error corrected: ino 1 off 907183841280 (dev /dev/sdc sector 1781821296) kernel: parent transid verify failed on 907183878144 wanted 15935 found 12263 kernel: btrfs read error corrected: ino 1 off 907183878144 (dev /dev/sdc sector 1781821368) kernel: parent transid verify failed on 907183874048 wanted 15935 found 12263 kernel: btrfs read error corrected: ino 1 off 907183874048 (dev /dev/sdc sector 1781821360) kernel: verify_parent_transid: 25 callbacks suppressed kernel: parent transid verify failed on 915431288832 wanted 16974 found 16972 kernel: repair_io_failure: 25 callbacks suppressed kernel: btrfs read error corrected: ino 1 off 915431288832 (dev /dev/sdc sector 1813658232) kernel: parent transid verify failed on 915444523008 wanted 16974 found 13021 kernel: parent transid verify failed on 915444523008 wanted 16974 found 13021 [...] One directory tree failed the diff checks so I 'mv'-ed that one tree to rename it out of the way and then ran an rm -Rf to remove
Re: Corrupt btrfs filesystem recovery... What best instructions?
What best to try next? mount -o recovery,noatime btrfsck: --repairtry to repair the filesystem --init-csum-treecreate a new CRC tree --init-extent-tree create a new extent tree or is a scrub worthwhile? The fail and switch to read-only occured whilst trying to delete a known bad directory tree. No worries for losing the data in that. But how best to clean up the filesystem errors? Thanks, Martin On 03/10/13 17:56, Martin wrote: On 03/10/13 01:49, Martin wrote: Summary: Mounting -o recovery,noatime worked well and allowed a diff check to complete for all but one directory tree. So very nearly all the data is fine. Deleting the failed directory tree caused a call stack dump and eventually: kernel: parent transid verify failed on 915444822016 wanted 16974 found 13021 kernel: BTRFS info (device sdc): failed to delete reference to eggdrop-1.6.19.ebuild, inode 2096893 parent 5881667 kernel: BTRFS error (device sdc) in __btrfs_unlink_inode:3662: errno=-5 IO failure kernel: BTRFS info (device sdc): forced readonly Greater detail listed below. What next best to try? Safer to try again but this time with with no_space_cache,no_inode_cache? Thanks, Martin Next best step to try? Remount -o recovery,noatime again? In the meantime, trying: btrfsck /dev/sdc gave the following output + abort: parent transid verify failed on 915444523008 wanted 16974 found 13021 Ignoring transid failure btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec-ino != key-objectid || rec-refs 1)' failed. id not match free space cache generation (1625) free space inode generation (0) did not match free space cache generation (1607) free space inode generation (0) did not match free space cache generation (1604) free space inode generation (0) did not match free space cache generation (1606) free space inode generation (0) did not match free space cache generation (1620) free space inode generation (0) did not match free space cache generation (1626) free space inode generation (0) did not match free space cache generation (1609) free space inode generation (0) did not match free space cache generation (1653) free space inode generation (0) did not match free space cache generation (1628) free space inode generation (0) did not match free space cache generation (1628) free space inode generation (0) did not match free space cache generation (1649) (There was no syslog output.) Full btrfsck listing attached. Suggestions please? Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs recovery: What do the commands actually do?
There's ad-hoc comment for various commands to recover from filesystem errors. But what do they actually do and when should what command be used? (The wiki gives scant indication other than to 'blindly' try things...) There's: mount -o recovery,noatime btrfsck: --repairtry to repair the filesystem --init-csum-treecreate a new CRC tree --init-extent-tree create a new extent tree And there is scrub... What do they do exactly and what are the indicators to try using them? Or when should you 'give up' on a filsystem and just retrieve whatever data can be read and start again? All that lot sounds good for a wiki page ;-) Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs recovery: What do the commands actually do?
On 04/10/13 19:32, Duncan wrote: Martin posted on Fri, 04 Oct 2013 16:47:19 +0100 as condensed: There's ad-hoc comment for various commands to recover from filesystem errors. But what do they actually do and when should what command be used? What do they do exactly and what are the indicators to try using them? Or when should you 'give up' on a filsystem and just retrieve whatever data can be read and start again? All that lot sounds good for a wiki page ;-) I recognize your name so you're a regular poster and may well have seen Hail fellow Gentoo-er ;-) This is a prod from the thread: http://article.gmane.org/gmane.comp.file-systems.btrfs/28775 this recover steps/order post from Hugo Mills, but you didn't mention it, so... http://permalink.gmane.org/gmane.comp.file-systems.btrfs/27999 As you suggest, that should really go in the wiki (maybe it's there already since that post, I haven't actually checked recently, but your post reads as if you looked and couldn't find a recovery list of this nature), but I've not gotten around to creating an account for myself there yet and committing it, and if no one else has either... But I do have it bookmarked for posting here, and for the day I do create myself that wiki account, if no one else has gotten to it by then... And while that answers what and what order, it doesn't cover what the commands actually do or why you'd /use/ that order, and that'd be very good to add as well. Yep. I'm using this as a bit of a test case as to how best to recover from whatever inevitable hiccups. All the more important to gain a good understanding before doing similar things to 16TB arrays... Comment/advice welcomed (please). Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Corrupt btrfs filesystem recovery... What best instructions?
No comment so blindly trying: btrfsck --repair /dev/sdc gave the following abort: btrfsck: extent-tree.c:2736: alloc_reserved_tree_block: Assertion `!(ret)' failed. Full output attached. All on: 3.11.2-gentoo Btrfs v0.20-rc1-358-g194aa4a For a 2TB single HDD formatted with defaults. What next? Thanks, Martin In the meantime, trying: btrfsck /dev/sdc gave the following output + abort: parent transid verify failed on 915444523008 wanted 16974 found 13021 Ignoring transid failure btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec-ino != key-objectid || rec-refs 1)' failed. id not match free space cache generation (1625) free space inode generation (0) did not match free space cache generation (1607) free space inode generation (0) did not match free space cache generation (1604) free space inode generation (0) did not match free space cache generation (1606) free space inode generation (0) did not match free space cache generation (1620) free space inode generation (0) did not match free space cache generation (1626) free space inode generation (0) did not match free space cache generation (1609) free space inode generation (0) did not match free space cache generation (1653) free space inode generation (0) did not match free space cache generation (1628) free space inode generation (0) did not match free space cache generation (1628) free space inode generation (0) did not match free space cache generation (1649) (There was no syslog output.) Full btrfsck listing attached. Suggestions please? Thanks, Martin checking extents leaf parent key incorrect 907183771648 bad block 907183771648 leaf parent key incorrect 907183779840 bad block 907183779840 leaf parent key incorrect 907183882240 bad block 907183882240 leaf parent key incorrect 907185160192 bad block 907185160192 leaf parent key incorrect 907185201152 bad block 907185201152 leaf parent key incorrect 915432497152 bad block 915432497152 leaf parent key incorrect 915432509440 bad block 915432509440 leaf parent key incorrect 915432513536 bad block 915432513536 leaf parent key incorrect 915432529920 bad block 915432529920 leaf parent key incorrect 915432701952 bad block 915432701952 leaf parent key incorrect 915433058304 bad block 915433058304 leaf parent key incorrect 915437543424 bad block 915437543424 leaf parent key incorrect 915437563904 bad block 915437563904 leaf parent key incorrect 91569760 bad block 91569760 leaf parent key incorrect 91573856 bad block 91573856 leaf parent key incorrect 915444506624 bad block 915444506624 leaf parent key incorrect 915444518912 bad block 915444518912 leaf parent key incorrect 915444523008 bad block 915444523008 leaf parent key incorrect 915444527104 bad block 915444527104 leaf parent key incorrect 915444539392 bad block 915444539392 leaf parent key incorrect 915444543488 bad block 915444543488 leaf parent key incorrect 915444547584 bad block 915444547584 leaf parent key incorrect 915444551680 bad block 915444551680 leaf parent key incorrect 915444555776 bad block 915444555776 leaf parent key incorrect 915444559872 bad block 915444559872 leaf parent key incorrect 915444563968 bad block 915444563968 leaf parent key incorrect 915444572160 bad block 915444572160 leaf parent key incorrect 915444576256 bad block 915444576256 leaf parent key incorrect 915444580352 bad block 915444580352 leaf parent key incorrect 915444584448 bad block 915444584448 leaf parent key incorrect 915444588544 bad block 915444588544 leaf parent key incorrect 915444678656 bad block 915444678656 leaf parent key incorrect 915444682752 bad block 915444682752 leaf parent key incorrect 915444793344 bad block 915444793344 leaf parent key incorrect 915444797440 bad block 915444797440 leaf parent key incorrect 915444813824 bad block 915444813824 leaf parent key incorrect 915444817920 bad block 915444817920 leaf parent key incorrect 915444822016 bad block 915444822016 leaf parent key incorrect 915444826112 bad block 915444826112 leaf parent key incorrect 915444830208 bad block 915444830208 leaf parent key incorrect 915444834304 bad block 915444834304 leaf parent key incorrect 915444924416 bad block 915444924416 leaf parent key incorrect 915444973568 bad block 915444973568 leaf parent key incorrect 915444977664 bad block 915444977664 leaf parent key incorrect 915444981760 bad block 915444981760 parent transid verify failed on 915444973568 wanted 16974 found 13021 parent transid verify failed on 915444973568 wanted 16974 found 13021 parent transid verify failed on 915444977664 wanted 16974 found 13021 parent transid verify failed on 915444977664 wanted 16974 found 13021 parent transid verify failed on 915444981760 wanted 16974 found 13021 parent transid verify failed on 915444981760 wanted 16974 found 13021 parent transid verify failed on 915432701952 wanted 16974 found 16972 parent transid verify failed on 915432701952 wanted 16974 found 16972 parent transid verify
ASM1083 rev01 PCIe to PCI Bridge chip (Was: Corrupt btrfs filesystem recovery... (Due to *sata* errors))
On 28/09/13 20:26, Martin wrote: AMD E-450 APU with Radeon(tm) HD Graphics AuthenticAMD GNU/Linux Just in case someone else stumbles across this thread due to a related problem for my particular motherboard... There appears to be a fatal hardware bug for the interrupt line deassert for a PCIe to PCI Bridge chip: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 01) See the thread on https://lkml.org/lkml/2012/1/30/216 For that chip, the interrupt line is not always deasserted for PCI interrupts. The hardware fault appears to be fixed in ASM1083 rev 03. Unfortunately, there is no useful OS workaround possible for rev 01. Hence, the PCI interrupts are unusable for ASM1083 rev01 ? :-( In brief, this means that the PCI card slots on the motherboard cannot be used for any hardware that might generate an interrupt. That means pretty much all normal PCI cards. (The PCIe card slots are fine.) For my own example, there does not appear to be any other devices using that bridge chip. The only concern is for the sound chip but I happen to never use sound on that system and so that is disabled. The problem is listed in syslog/dmesg by lines such as: kernel: irq 16: nobody cared (try booting with the irqpoll option) kernel: Disabling IRQ #16 Unfortunately, the HDDs and network interfaces also use that irq or irg 17 (which can also be affected). Losing the irq will badly slow down your system and can cause data corruption for heavy use of the HDD. Use: lspci | grep -i ASM1083 to see if you have that chip and if so, what revision. To see if you have any irqpoll messages, use: grep -ia irqpoll /var/log/messages To list what devices use what interrupts, use either of: grep -ia ' irq ' /var/log/messages cat /proc/interrupts Note that there should no longer be any ASM1083 rev01 chips being supplied by now. (ASM1083 rev03 chips have been seen in products.) Hope that helps for that bit of obscurity! Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Corrupt btrfs filesystem recovery... What best instructions?
So... The hint there is btrfsck: extent-tree.c:2736, so trying: btrfsck --repair --init-extent-tree /dev/sdc That ran for a while until: kernel: btrfsck[16610]: segfault at cc ip 0041d2a7 sp 7fffd2c2d710 error 4 in btrfsck[40+4d000] There's no other messages in the syslog. The output attached. What next? Thanks, Martin On 05/10/13 12:32, Martin wrote: No comment so blindly trying: btrfsck --repair /dev/sdc gave the following abort: btrfsck: extent-tree.c:2736: alloc_reserved_tree_block: Assertion `!(ret)' failed. Full output attached. All on: 3.11.2-gentoo Btrfs v0.20-rc1-358-g194aa4a For a 2TB single HDD formatted with defaults. What next? Thanks, Martin In the meantime, trying: btrfsck /dev/sdc gave the following output + abort: parent transid verify failed on 915444523008 wanted 16974 found 13021 Ignoring transid failure btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec-ino != key-objectid || rec-refs 1)' failed. id not match free space cache generation (1625) free space inode generation (0) did not match free space cache generation (1607) free space inode generation (0) did not match free space cache generation (1604) free space inode generation (0) did not match free space cache generation (1606) free space inode generation (0) did not match free space cache generation (1620) free space inode generation (0) did not match free space cache generation (1626) free space inode generation (0) did not match free space cache generation (1609) free space inode generation (0) did not match free space cache generation (1653) free space inode generation (0) did not match free space cache generation (1628) free space inode generation (0) did not match free space cache generation (1628) free space inode generation (0) did not match free space cache generation (1649) (There was no syslog output.) Full btrfsck listing attached. Suggestions please? Thanks, Martin btrfs unable to find ref byte nr 912043257856 parent 0 root 1 owner 2 offset 0 btrfs unable to find ref byte nr 912043343872 parent 0 root 1 owner 1 offset 1 btrfs unable to find ref byte nr 912044331008 parent 0 root 1 owner 0 offset 1 btrfs unable to find ref byte nr 912043261952 parent 0 root 1 owner 1 offset 1 btrfs unable to find ref byte nr 912043266048 parent 0 root 1 owner 0 offset 1 checking extents leaf parent key incorrect 907183771648 bad block 907183771648 leaf parent key incorrect 907183779840 bad block 907183779840 leaf parent key incorrect 907183882240 bad block 907183882240 leaf parent key incorrect 907185160192 bad block 907185160192 leaf parent key incorrect 907185201152 bad block 907185201152 leaf parent key incorrect 915432497152 bad block 915432497152 leaf parent key incorrect 915432509440 bad block 915432509440 leaf parent key incorrect 915432513536 bad block 915432513536 leaf parent key incorrect 915432529920 bad block 915432529920 leaf parent key incorrect 915433058304 bad block 915433058304 leaf parent key incorrect 915437543424 bad block 915437543424 leaf parent key incorrect 915437563904 bad block 915437563904 leaf parent key incorrect 91569760 bad block 91569760 leaf parent key incorrect 91573856 bad block 91573856 leaf parent key incorrect 915444506624 bad block 915444506624 leaf parent key incorrect 915444518912 bad block 915444518912 leaf parent key incorrect 915444523008 bad block 915444523008 leaf parent key incorrect 915444527104 bad block 915444527104 leaf parent key incorrect 915444539392 bad block 915444539392 leaf parent key incorrect 915444543488 bad block 915444543488 leaf parent key incorrect 915444547584 bad block 915444547584 leaf parent key incorrect 915444551680 bad block 915444551680 leaf parent key incorrect 915444555776 bad block 915444555776 leaf parent key incorrect 915444559872 bad block 915444559872 leaf parent key incorrect 915444563968 bad block 915444563968 leaf parent key incorrect 915444572160 bad block 915444572160 leaf parent key incorrect 915444576256 bad block 915444576256 leaf parent key incorrect 915444580352 bad block 915444580352 leaf parent key incorrect 915444584448 bad block 915444584448 leaf parent key incorrect 915444588544 bad block 915444588544 leaf parent key incorrect 915444793344 bad block 915444793344 leaf parent key incorrect 915444797440 bad block 915444797440 leaf parent key incorrect 915444813824 bad block 915444813824 leaf parent key incorrect 915444817920 bad block 915444817920 leaf parent key incorrect 915444822016 bad block 915444822016 leaf parent key incorrect 915444826112 bad block 915444826112 leaf parent key incorrect 915444830208 bad block 915444830208 leaf parent key incorrect 915444834304 bad block 915444834304 leaf parent key incorrect 915444924416 bad block 915444924416 ref mismatch on [12582912 8065024] extent item 0, found 1 btrfs unable to find ref byte nr 912014393344 parent 0 root 2 owner 0 offset 0 adding
btrfsck --repair --init-extent-tree: segfault error 4
Any clues or educated comment please? Can the corrupt directory tree safely be ignored and left in place? Or might that cause everything to fall over in a big heap as soon as I try to write data again? Could these other tricks work-around or fix the corrupt tree: Run a scrub? Make a snapshot and work from the snapshot? Or try mount -o recovery,noatime again? Or is it dead? (The 1.5TB of backup data is replicated elsewhere but it would be good to rescue this version rather than completely redo from scratch. Especially so for the sake of just a few MBytes of one corrupt directory tree.) Thanks, Martin On 05/10/13 14:18, Martin wrote: So... The hint there is btrfsck: extent-tree.c:2736, so trying: btrfsck --repair --init-extent-tree /dev/sdc That ran for a while until: kernel: btrfsck[16610]: segfault at cc ip 0041d2a7 sp 7fffd2c2d710 error 4 in btrfsck[40+4d000] There's no other messages in the syslog. The output attached. What next? Thanks, Martin On 05/10/13 12:32, Martin wrote: No comment so blindly trying: btrfsck --repair /dev/sdc gave the following abort: btrfsck: extent-tree.c:2736: alloc_reserved_tree_block: Assertion `!(ret)' failed. Full output attached. All on: 3.11.2-gentoo Btrfs v0.20-rc1-358-g194aa4a For a 2TB single HDD formatted with defaults. What next? Thanks, Martin In the meantime, trying: btrfsck /dev/sdc gave the following output + abort: parent transid verify failed on 915444523008 wanted 16974 found 13021 Ignoring transid failure btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec-ino != key-objectid || rec-refs 1)' failed. id not match free space cache generation (1625) free space inode generation (0) did not match free space cache generation (1607) free space inode generation (0) did not match free space cache generation (1604) free space inode generation (0) did not match free space cache generation (1606) free space inode generation (0) did not match free space cache generation (1620) free space inode generation (0) did not match free space cache generation (1626) free space inode generation (0) did not match free space cache generation (1609) free space inode generation (0) did not match free space cache generation (1653) free space inode generation (0) did not match free space cache generation (1628) free space inode generation (0) did not match free space cache generation (1628) free space inode generation (0) did not match free space cache generation (1649) (There was no syslog output.) Full btrfsck listing attached. Suggestions please? Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfsck --repair --init-extent-tree: segfault error 4
In summary: Looks like minimal damage remains and yet I'm still suffering Input/output error from btrfs and btrfsck appears to have looped... A diff check suggests the damage to be in one (heavily linked to) tree of a few MBytes. Would a scrub clear out the damaged trees? Worth debugging? Thanks, Martin Further detail: On 07/10/13 20:03, Chris Murphy wrote: On Oct 7, 2013, at 8:56 AM, Martin m_bt...@ml1.co.uk wrote: Or try mount -o recovery,noatime again? Because of this: free space inode generation (0) did not match free space cache generation (1607) Try mount option clear_cache. You could then use iotop to make sure the btrfs-freespace process becomes inactive before unmounting the file system; I don't think you need to wait in order to use the file system, nor do you need to unmount then remount without the option. But if it works, it should only be needed once, not as a persistent mount option. Thanks for that. So, trying: mount -v -t btrfs -o recovery,noatime,clear_cache /dev/sdc gave: kernel: device label bu_A devid 1 transid 17448 /dev/sdc kernel: btrfs: enabling inode map caching kernel: btrfs: enabling auto recovery kernel: btrfs: force clearing of disk cache kernel: btrfs: disk space caching is enabled kernel: btrfs: bdev /dev/sdc errs: wr 0, rd 27, flush 0, corrupt 0, gen 0 btrfs-freespace appeared occasionally briefly in atop but there's no noticeable disk activity. All very rapidly done? Running a diff check to see if all ok and what might be missing gave the syslog output: kernel: verify_parent_transid: 165 callbacks suppressed kernel: parent transid verify failed on 915444506624 wanted 16974 found 13021 kernel: parent transid verify failed on 915444506624 wanted 16974 found 13021 kernel: parent transid verify failed on 915444506624 wanted 16974 found 13021 kernel: parent transid verify failed on 915444506624 wanted 16974 found 13021 kernel: parent transid verify failed on 915444506624 wanted 16974 found 13021 kernel: parent transid verify failed on 915444506624 wanted 16974 found 13021 The diff eventually failed with Input/output error. 'mv' to move this failed directory tree out of the way worked. Attempting to use 'ln -s' gave the attached syslog output and the filesystem was made Read-only. Remounting: mount -v -o remount,recovery,noatime,clear_cache,rw /dev/sdc and the mv looks fine. Trying the 'ln -s' again gives: ln: creating symbolic link `./portage': Read-only file system unmounting gave the syslog message: kernel: btrfs: commit super ret -30 Mounting again: mount -v -t btrfs -o recovery,noatime,clear_cache /dev/sdc showed that the symbolic link was put in place ok. Rerunning the diff check eventually found another Input/output error. So unmounted and tried again: btrfsck --repair --init-extent-tree /dev/sdc Failed with: btrfs unable to find ref byte nr 911367733248 parent 0 root 1 owner 2 offset 0 btrfs unable to find ref byte nr 911367737344 parent 0 root 1 owner 1 offset 1 btrfs unable to find ref byte nr 911367741440 parent 0 root 1 owner 0 offset 1 leaf free space ret -297791851, leaf data size 3995, used 297795846 nritems 2 checking extents btrfsck: extent_io.c:606: free_extent_buffer: Assertion `!(eb-refs 0)' failed. enabling repair mode Checking filesystem on /dev/sdc UUID: 38a60270-f9c6-4ed4-8421-4bf1253ae0b3 Creating a new extent tree Failed to find [911367733248, 168, 4096] Failed to find [911367737344, 168, 4096] Failed to find [911367741440, 168, 4096] Rerunning again and this time btrfsck is sat there at 100% CPU for the last 24 hours. Full output so far is: parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 Ignoring transid failure Nothing syslog and no disk activity. Looped?... Or is it dead? (The 1.5TB of backup data is replicated elsewhere but it would be good to rescue this version rather than completely redo from scratch. Especially so for the sake of just a few MBytes of one corrupt directory tree.) Right. If you snapshot the subvolume containing the corrupt portion of the file system, the snapshot probably inherits that corruption. But if you write to only one of them, if those writes make the problem worse, should be isolated only to the one you write to. I might avoid writing to it, honestly. To save time, get increasingly aggressive to get data out of this directory and once you succeed, blow away the file system and start from scratch. You could also then try kernel 3.12 rc4, as there are some btrfs bug fixes I'm seeing in there also, but I don't know if any of them will help your case. If you try it, mount normally, then try to get your data. If that doesn't work, try the recovery option. Maybe you'll get different results. As suspected
Re: Btrfs and raid5 status with kernel 3.14, documentation, and howto
On 23/03/14 22:56, Marc MERLIN wrote: Ok, thanks to the help I got from you, and my own experiments, I've written this: http://marc.merlins.org/perso/btrfs/post_2014-03-23_Btrfs-Raid5-Status.html If someone reminds me how to edit the btrfs wiki, I'm happy to copy that there, or give anyone permission to take part of all of what I wrote and use it for any purpose. The highlights are if you're coming from the mdadm raid5 world: [---] Hope this helps, Marc Thanks for the very good summary. So... In very brief summary, btrfs raid5 is very much a work in progress. Question: Is the raid5 going to be seamlessly part of the error-correcting raids whereby raid5, raid6, raid-with-n-redundant-drives are all coded as one configurable raid? Also (second question): What happened to the raid naming scheme that better described the btrfs-style of raid by explicitly numbering the number of devices used for mirroring, striping, and error-correction? Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Suggestion: Anti-fragmentation safety catch (RFC)
Just an idea: btrfs Problem: I've had two systems die with huge load factors 100(!) for the case where a user program has unexpected to me been doing 'database'-like operations and caused multiple files to become heavily fragmented. The system eventually dies when data cannot be added to the fragmented files faster than the real time data collection. My example case is for two systems with btrfs raid1 using two HDDs each. Normal write speed is about 100MByte/s. After heavy fragmentation, the cpus are at 100% wait and i/o is a few hundred kByte/s. Possible fix: btrfs checks the ratio of filesize versus number of fragments and for a bad ratio either: 1: Performs a non-cow copy to defragment the file; 2: Turns off cow for that file and gives a syslog warning for that; 3: Automatically defragments the file. Or? For my case, I'm not sure 2 is a good idea in case the user is rattling through a gazillion files and the syslog gets swamped. Unfortunately, I don't know beforehand what files to mark no-cow unless I no-cow the entire user/applications. Thoughts? Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Suggestion: Anti-fragmentation safety catch (RFC)
On 24/03/14 20:19, Duncan wrote: Martin posted on Mon, 24 Mar 2014 19:47:34 + as excerpted: Possible fix: btrfs checks the ratio of filesize versus number of fragments and for a bad ratio either: [...] 3: Automatically defragments the file. See the autodefrag mount option. =:^) Thanks for that! So... https://btrfs.wiki.kernel.org/index.php/Mount_options autodefrag (since [kernel] 3.0) Will detect random writes into existing files and kick off background defragging. It is well suited to bdb or sqlite databases, but not virtualization images or big databases (yet). Once the developers make sure it doesn't defrag files over and over again, they'll move this toward the default. Looks like I might be a good test case :-) What's the problem for big images or big databases? What is considered big? Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs and raid5 status with kernel 3.14, documentation, and howto
On 24/03/14 21:52, Marc MERLIN wrote: On Mon, Mar 24, 2014 at 07:17:12PM +, Martin wrote: Thanks for the very good summary. So... In very brief summary, btrfs raid5 is very much a work in progress. If you know how to use it, which I didn't know do now, it's technically very usable as is. The corner cases are in having a failing drive which you can't hot remove because you can't write to it. It's unfortunate that you can't just kill a drive without umounting, making the drive disappear so that btrfs can't see it (dmsetup remove cryptname for me, so it's easy to do remotely), and remounting in degraded mode. Yes, looking good, but for my usage I need the option to run ok with a failed drive. So, that's one to keep a development eye on for continued progress... Question: Is the raid5 going to be seamlessly part of the error-correcting raids whereby raid5, raid6, raid-with-n-redundant-drives are all coded as one configurable raid? I'm not sure I parse your question. As far as btrfs is concerned you can switch from non raid to raid5 to raid6 by adding a drive and rebalancing which effectively reads and re-writes all the blocks in the new format. There's a big thread a short while ago about using parity across n-devices where the parity is spread such that you can have 1, 2, and up to 6 redundant devices. Well beyond just raid5 and raid6: http://lwn.net/Articles/579034/ Also (second question): What happened to the raid naming scheme that better described the btrfs-style of raid by explicitly numbering the number of devices used for mirroring, striping, and error-correction? btrfs fi show kind of tells you that if you know how to read it (I didn't initially). What's missing for you? btrfs raid1 at present is always just the two copies of data spread across whatever number of disks you have. A more flexible arrangement is to be able to set to have say 3 copies of data and use say 4 disks. There's a new naming scheme proposed somewhere that enumerates all the permutations possible for numbers of devices, copies and parity that btrfs can support. For me, that is a 'killer' feature beyond what can be done with md-raid for example. Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can anyone boot a system using btrfs root with linux 3.14 or newer? - RESOLVED
On 27/04/14 13:00, Пламен Петров wrote: The problem reported in this thread has been RESOLVED. It's not BTRFS's fault. Debugging on my part led to the actual problem in do_mounts.c - some filesystems mount routines return error codes other than 0, EACCES and EINVAL and such return codes result in the kernel panicking without trying to mount root with all of the available filesystems. Patch is available as attachment to bug 74901 - https://bugzilla.kernel.org/show_bug.cgi?id=74901 . The bugentry documents how I managed to find the problem. Well deduced and that looks to be a good natural clean fix. My only question is: What was the original intent to deliberately fail if something other than EACCES or EINVAL were reported? Also, the patch has been sent to the linux kernel mailing list - see http://news.gmane.org/find-root.php?group=gmane.linux.kernelarticle=1691881 Hopefully, it will find its way into the kernel, and later on - in stable releases. That all looks very good and very thorough. Thanks to you all! -- Plamen Petrov Thanks to you for chasing it through! AND for posting the Resolved to let everyone know. :-) Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
On 16/05/14 04:07, Russell Coker wrote: https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape Probably most of you already know about this, but for those of you who haven't the above describes ZFS ditto blocks which is a good feature we need on BTRFS. The briefest summary is that on top of the RAID redundancy there... [... are additional copies of metadata ...] Is that idea not already implemented in effect in btrfs with the way that the superblocks are replicated multiple times, ever more times, for ever more huge storage devices? The one exception is for SSDs whereby there is the excuse that you cannot know whether your data is usefully replicated across different erase blocks on a single device, and SSDs are not 'that big' anyhow. So... Your idea of replicating metadata multiple times in proportion to assumed 'importance' or 'extent of impact if lost' is an interesting approach. However, is that appropriate and useful considering the real world failure mechanisms that are to be guarded against? Do you see or measure any real advantage? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
On 18/05/14 17:09, Russell Coker wrote: On Sat, 17 May 2014 13:50:52 Martin wrote: [...] Do you see or measure any real advantage? Imagine that you have a RAID-1 array where both disks get ~14,000 read errors. This could happen due to a design defect common to drives of a particular model or some shared environmental problem. Most errors would be corrected by RAID-1 but there would be a risk of some data being lost due to both copies being corrupt. Another possibility is that one disk could entirely die (although total disk death seems rare nowadays) and the other could have corruption. If metadata was duplicated in addition to being on both disks then the probability of data loss would be reduced. Another issue is the case where all drive slots are filled with active drives (a very common configuration). To replace a disk you have to physically remove the old disk before adding the new one. If the array is a RAID-1 or RAID-5 then ANY error during reconstruction loses data. Using dup for metadata on top of the RAID protections (IE the ZFS ditto idea) means that case doesn't lose you data. Your example there is for the case where in effect there is no RAID. How is that case any better than what is already done for btrfs duplicating metadata? So... What real-world failure modes do the ditto blocks usefully protect against? And how does that compare for failure rates and against what is already done? For example, we have RAID1 and RAID5 to protect against any one RAID chunk being corrupted or for the total loss of any one device. There is a second part to that in that another failure cannot be tolerated until the RAID is remade. Hence, we have RAID6 that protects against any two failures for a chunk or device. Hence with just one failure, you can tolerate a second failure whilst rebuilding the RAID. And then we supposedly have safety-by-design where the filesystem itself is using a journal and barriers/sync to ensure that the filesystem is always kept in a consistent state, even after an interruption to any writes. *What other failure modes* should we guard against? There has been mention of fixing metadata keys from single bit flips... Should hamming codes be used instead of a crc so that we can have multiple bit error detect, single bit error correct functionality for all data both in RAM and on disk for those systems that do not use ECC RAM? Would that be useful?... Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
Very good comment from Ashford. Sorry, but I see no advantages from Russell's replies other than for a feel-good factor or a dangerous false sense of security. At best, there is a weak justification that for metadata, again going from 2% to 4% isn't going to be a great problem (storage is cheap and fast). I thought an important idea behind btrfs was that we avoid by design in the first place the very long and vulnerable RAID rebuild scenarios suffered for block-level RAID... On 21/05/14 03:51, Russell Coker wrote: Absolutely. Hopefully this discussion will inspire the developers to consider this an interesting technical challenge and a feature that is needed to beat ZFS. Sorry, but I think that is completely the wrong reasoning. ...Unless that is you are some proprietary sales droid hyping features and big numbers! :-P Personally I'm not convinced we gain anything beyond what btrfs will eventually offer in any case for the n-way raid or the raid-n Cauchy stuff. Also note that usually, data is wanted to be 100% reliable and retrievable. Or if that fails, you go to your backups instead. Gambling proportions and importance rather than *ensuring* fault/error tolerance is a very human thing... ;-) Sorry: Interesting idea but not convinced there's any advantage for disk/SSD storage. Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs filesystem freezing during snapshots
On 26/05/14 13:28, David Bloquel wrote: Hi, I have a problem with my btrfs filesystem which is freezing when I am doing snapshots. I have a cron that is snapshoting around 70 sub volume every ten minutes. The sub volumes that btrfs is snapshoting are containers folders that are running through my virtual environment. Sub directories that btrfs is snapshoting are not that big (from 500MB to 10GB max and usually around 3GB) but there is a lot of IO on the filesystem because of the intensive use of the CTs and VMs. At some point the snapshot process becomes really slow, at first it snapshot around one folder per seconds but then after a while it can take 30seconds or even few minutes to snapshot one single sub volumes. Subvolumes are really similar to each other in size and number of files so there is no reason that it takes 1second for one sub volume and then 3minutes for another one. Moreover when my snapshot cron is running all my vms and containers are slowing down until the whole filesystem freezes which leads to frozen CT and VMs (which is a real problem for me). Moreover I can see that my CPU load is really high during the process. when I'm am looking to dmesg there is a lot of messages of this kind: [96537.686467] BTRFS debug (device drbd0): unlinked 290 orphans [...] That looks to be running on top of drbd which will add a network write overhead (unless you are dangerously running asynchronously!). Hence you will see IO speed related limits a little sooner... However, I will guess that your primary problem is likely due to accumulating fragmentation due to adding ever more snapshots every 10 mins for the VMs/containers. There are other people far more practised here than I, but some guesses to try are: Use nocow for the VM images (and container images); Try using the btrfs auto defrag (beware your IO speed limit vs file size to be defragged); Avoid accumulating too many versions of any one snapshot. Note also the experimental status for btrfs... I'm sure you will have noticed the previous race problems for deleting snapshots. Aside: I've held off from using kernel 3.12 and 3.13 due to curious happenings on my test system. kernel 3.14.4 is behaving well so far. Hope that gives a few clues. Good luck, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about snapshot-aware defrag
On 02/06/14 14:22, Josef Bacik wrote: On 05/30/2014 06:00 PM, Martin wrote: OK... I'll jump in... On 30/05/14 21:43, Josef Bacik wrote: Hello, TL;DR: I want to only do snapshot-aware defrag on inodes in snapshots that haven't changed since the snapshot was taken. Yay or nay (with a reason why for nay) [...] === Summary and what I need === Option 1: Only relink inodes that haven't changed since the snapshot was taken. [...] Obvious way to go for fast KISS. One question: Will option one mean that we always need to mount with noatime or read-only to allow snapshot defragging to do anything? Yeah atime would screw this up, I hadn't thought of that. With that being the case I think the only option is to keep the old behavior, we don't want to screw up stuff like this just because users used a backup program on their snapshot and didn't use noatime. Thanks, Not so fast into non-KISS! The *ONLY* application that I know of that uses atime is Mutt and then *only* for mbox files!... NOTHING else uses atime as far as I know. We already have most distros enabling reltime by default as a just-in-case... Can we not have noatime as the default for btrfs? Also widely note that default in the man page and wiki and with why?... *And go KISS and move on faster* better? Myself, I still use Mutt sometimes, but no mbox, and all my filesystems have been noatime for many years now with good positive results. (Both home and work servers.) Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about snapshot-aware defrag
On 04/06/14 10:19, Erkki Seppala wrote: Martin m_bt...@ml1.co.uk writes: The *ONLY* application that I know of that uses atime is Mutt and then *only* for mbox files!... However, users, such as myself :), can be interested in when a certain file has been last accessed. With snapshots I can even get an idea of all the times the file has been accessed. *And go KISS and move on faster* better? Well, it in uncertain to me if it truly is better that btrfs would after that point no longer truly even support atime, if using it results in blowing up snapshot sizes. They might at that point even consider just using LVM2 snapshots (shudder) ;). Not quite... My emphasis is: 1: Go KISS for the defrag and accept that any atime use will render the defrag ineffective. Give a note that the noatime mount option should be used. 2: Consider using noatime as a /default/ being as there are no known 'must-use' use cases. Those users still wanting atime can add that as a mount option with the note that atime use reduces the snapshot defrag effectiveness. (The for/against atime is a good subject for another thread!) Go fast KISS! Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On 16/06/14 17:05, Josef Bacik wrote: On 06/16/2014 03:14 AM, Lennart Poettering wrote: On Mon, 16.06.14 10:17, Russell Coker (russ...@coker.com.au) wrote: I am not really following though why this trips up btrfs though. I am not sure I understand why this breaks btrfs COW behaviour. I mean, I don't believe that fallocate() makes any difference to fragmentation on BTRFS. Blocks will be allocated when writes occur so regardless of an fallocate() call the usage pattern in systemd-journald will cause fragmentation. journald's write pattern looks something like this: append something to the end, make sure it is written, then update a few offsets stored at the beginning of the file to point to the newly appended data. This is of course not easy to handle for COW file systems. But then again, it's probably not too different from access patterns of other database or database-like engines... Even though this appears to be a problem case for btrfs/COW, is there a more favourable write/access sequence possible that is easily implemented that is favourable for both ext4-like fs /and/ COW fs? Database-like writing is known 'difficult' for filesystems: Can a data log can be a simpler case? Was waiting for you to show up before I said anything since most systemd related emails always devolve into how evil you are rather than what is actually happening. Ouch! Hope you two know each other!! :-P :-) [...] since we shouldn't be fragmenting this badly. Like I said what you guys are doing is fine, if btrfs falls on it's face then its not your fault. I'd just like an exact idea of when you guys are fsync'ing so I can replicate in a smaller way. Thanks, Good if COW can be so resilient. I have about 2GBytes of data logging files and I must defrag those as part of my backups to stop the system fragmenting to a stop (I use cp -a to defrag the files to a new area and restart the data software logger on that). Random thoughts: Would using a second small file just for the mmap-ed pointers help avoid repeated rewriting of random offsets in the log file causing excessive fragmentation? Align the data writes to 16kByte or 64kByte boundaries/chunks? Are mmap-ed files a similar problem to using a swap file and so should the same btrfs file swap code be used for both? Not looked over the code so all random guesses... Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs support for efficient SSD operation (data blocks alignment)
My understanding is that for x86 architecture systems, btrfs only allows a sector size of 4kB for a HDD/SSD. That is fine for the present HDDs assuming the partitions are aligned to a 4kB boundary for that device. However for SSDs... I'm using for example a 60GByte SSD that has: 8kB page size; 16kB logical to physical mapping chunk size; 2MB erase block size; 64MB cache. And the sector size reported to Linux 3.0 is the default 512 bytes! My first thought is to try formatting with a sector size of 16kB to align with the SSD logical mapping chunk size. This is to avoid SSD write amplification. Also, the data transfer performance for that device is near maximum for writes with a blocksize of 16kB and above. Yet, btrfs supports a 4kByte page/sector size only at present... Is there any control possible over the btrfs filesystem structure to map metadata and data structures to the underlying device boundaries? For example to maximise performance, can the data chunks and the data chunk size be aligned to be sympathetic to the SSD logical mapping chunk size and the erase block size? What features other than the trim function does btrfs employ to optimise for SSD operation? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs support for efficient SSD operation (data blocks alignment)
On 09/02/12 01:42, Liu Bo wrote: On 02/09/2012 03:24 AM, Martin wrote: [ No problem for 4kByte sector HDDs. However, for SSDs... ] However for SSDs... I'm using for example a 60GByte SSD that has: 8kB page size; 16kB logical to physical mapping chunk size; 2MB erase block size; 64MB cache. And the sector size reported to Linux 3.0 is the default 512 bytes! [...] Is there any control possible over the btrfs filesystem structure to map metadata and data structures to the underlying device boundaries? For example to maximise performance, can the data chunks and the data chunk size be aligned to be sympathetic to the SSD logical mapping chunk size and the erase block size? The metadata buffer size will support size larger than 4K at least, it is on development. And also for the data? Also pack smaller data chunks in with the metadata as is done already but with all the present parameters proportioned according to the sector size? (For my example, the filesystem may as well use 16kByte sectors because the SSD firmware will do a read-modify-write for anything smaller.) What features other than the trim function does btrfs employ to optimise for SSD operation? e.g COW(avoid writing to one place multi-times), delayed allocation(intend to reduce the write frequency) I'm using ext4 on a SSD web server and have formatted with (for ext4): mke2fs -v -T ext4 -L fs_label_name -b 4096 -E stride=4,stripe-width=4,lazy_itable_init=0 -O none,dir_index,extent,filetype,flex_bg,has_journal,sparse_super,uninit_bg /dev/sdX and mounted with the mount options: journal_checksum,barrier,stripe=4,delalloc,commit=300,max_batch_time=15000,min_batch_time=200,discard,noatime,nouser_xattr,noacl,errors=remount-ro The main bits for the SSD are the: stripe=4,delalloc,commit=300,max_batch_time=15000,min_batch_time=200,discard,noatime The -b 4096 is the maximum value allowed. The stride and stripe-width then take that up to 16kBytes (hopefully...). (Make sure you're on a good UPS with a reliable shutdown mechanism for power fail!) A further thought is: For my one SSD example, the erase state appears to be all 0xFF... Can the fs easily check the erase state value and leave any blank space unchanged to minimise the bit flipping? Reasonable to be included? All unnecessary for HDDs but possibly of use for maintaining the lifespan of SSDs... Hope of interest, Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs support for efficient SSD operation (data blocks alignment)
Looking at this again from some time ago... Brief summary: There is a LOT of nefarious cleverness being attempted by SSD manufacturers to accommodate a 4kByte block size. Get that wrong, or just be unsympathetic to that 'cleverness', and you suffer performance degradation and/or premature device wear. Is that significant? Very likely it will be for the new three-bit FLASH devices that have a PE (program-erase) lifespan of only 1000 or so cycles per cell. A better question is whether the filesystem can be easily made to be more sympathetic to all SSDs? From my investigating, there appears to be a sweet spot for performance for writing (aligned) 16kByte blocks. TRIM and keeping the device non-full also helps greatly. I suspect that consecutive writes, as is the case for HDDs, also helps performance to a lesser degree. The erased state for SSDs appears to be either all 0xFF or all 0x00 (I've got examples of both). Can that be automatically detected and used by btrfs so as to minimise write cycling the bits for (unused) padded areas? Are 16kByte blocks/sectors useful to btrfs? Or rather, can btrfs usefully use 16kByte blocks? Can that be supported? Further detail... Some good comments: On 10/02/12 18:18, Martin Steigerwald wrote: Hi Martin, Am Mittwoch, 8. Februar 2012 schrieb Martin: My understanding is that for x86 architecture systems, btrfs only allows a sector size of 4kB for a HDD/SSD. That is fine for the present HDDs assuming the partitions are aligned to a 4kB boundary for that device. However for SSDs... I'm using for example a 60GByte SSD that has: 8kB page size; 16kB logical to physical mapping chunk size; 2MB erase block size; 64MB cache. And the sector size reported to Linux 3.0 is the default 512 bytes! My first thought is to try formatting with a sector size of 16kB to align with the SSD logical mapping chunk size. This is to avoid SSD write amplification. Also, the data transfer performance for that device is near maximum for writes with a blocksize of 16kB and above. Yet, btrfs supports a 4kByte page/sector size only at present... Thing is as far as I know the better SSDs and even the dumber ones have quite some intelligence in the firmware. And at least for me its not clear what the firmware of my Intel SSD 320 all does on its own and whether any of my optimization attempts even matter. [...] The article on write amplication on wikipedia gives me a glimpse of the complexity involved¹. Yes, I set stripe-width as well on my Ext4 filesystem, but frankly said I am not even sure whether this has any positive effect except of maybe sparing the SSD controller firmware some reshuffling work. So from my current point of view most of what you wrote IMHO is more important for really dumb flash. ... [...] grade SSDs just provide a SATA interface and hide the internals. So an optimization for one kind or one brand of SSDs may not be suitable for another one. There are PCI express models but these probably aren´t dumb either. And then there is the idea of auto commit memory (ACM) by Fusion-IO which just makes a part of the virtual address space persistent. So its a question on where to put the intelligence. For current SSDs is seems the intelligence is really near the storage medium and then IMHO it makes sense to even reduce the intelligence on the Linux side. [1] http://en.wikipedia.org/wiki/Write_amplification As an engineer, I have a deep mistrust of the phrase Trust me or of Magic or Proprietary, secret or Proprietary, keep out!. Anand at Anandtech has produced some good articles on some of what goes on inside SSDs and some of the consequences. If you want a good long read: The SSD Relapse: Understanding and Choosing the Best SSD http://www.anandtech.com/print/2829 Covers block allocation and write amplification and the effect of free space on the write amplification factor. ... The Fastest MLC SSD We've Ever Tested http://www.anandtech.com/print/2899 Details the Sandforce controller at that time and its use of data compression on the controller. The latest Sandforce controllers also utilise data deduplication on the SSD! OCZ Agility 3 (240GB) Review http://www.anandtech.com/print/4346 Shows an example set of Performance vs Transfer Size graphs. Flashy fists fly as OCZ and DDRdrive row over SSD performance http://www.theregister.co.uk/2011/01/14/ocz_and_ddrdrive_performance_row/ Shows an old and unfair comparison highlighting SSD performance degradation due to write amplification for 4kByte random writes on a full device. A bit of a Joker in the pack are the SSDs that implement their own controller-level data compression and data deduplication (all proprietary and secret...). Ofcourse, that is all useless for encrypted filesystems... Also, what does the controller based data compression do for aligning to the underlying device blocks? What is apparent from all that lot
btrfs across a mix of SSDs HDDs
How well does btrfs perform across a mix of: 1 SSD and 1 HDD for 'raid' 1 mirror for both data and metadata? Similarly so across 2 SSDs and 2 HDDs (4 devices)? Can multiple (small) SSDs be 'clustered' as one device and then mirrored with one large HDD with btrfs directly? (Other than using lvm...) The idea is to gain the random access speed of the SSDs but have the HDDs as backup in case the SSDs fail due to wear... The usage is to support a few hundred Maildirs + imap for users that often have many thousands of emails in the one folder for their inbox... (And no, the users cannot be trained to clean out their inboxes or to be more hierarchically tidy... :-( ) Or is btrfs yet too premature to suffer such use? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on low end and high end FLASH
On 02/05/12 00:18, Martin wrote: How well suited is btrfs to low-end and high-end FLASH devices? Paraphrasing from a thread elsewhere: FLASH can be categorised into two classes, which have extremely different characteristics: (a) the low-end (USB, SDHC, CF, cheap ATA SSD); A good FYI detailing low-end FLASH devices is given on: Flash memory card design https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey For those examples, it looks like write chunks of 32kBytes or more may well be a good idea... and (b) the high-end (SAS, PCIe, NAS, expensive ATA SSD). My own experience is that the low end (a) can have erase blocks as large as 4MBytes or more and they are easily worn out to failure. I've no idea what their page sizes might be nor what boundaries their wear levelling (if any) operate on. Their normal mode of operation is to use a FAT32 filesystem and to be filled up linearly with large files. I guess the more scattered layout of extN is non-too sympathetic to their normal operation. The high-end (b) may well have 4kByte pages or smaller but they will typically operate with multiple page chunks that are much larger, where 16kBytes appear to be the optimum performance size for the devices I've seen so far. How well does btrfs fit in with the features for those two categories? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs across a mix of SSDs HDDs
Thanks for good comments. Is the OP using Oracle Linux? He didn't say. But he didn't say he WON'T be using oracle linux (or other distro which supports btrfs) either. Plus the kernel can be installed on top of RHEL/Centos 5 and 6, so he can easily choose either the supported version, or the mainline version, each with its own consequences. For further info: Nope, not using Oracle Linux. Then again, I'm reasonably distro agnostic. I'm also happy to compile my own kernels. And the system in question uses a HDD RAID and looks to be more IOPS bound rather than suffering actual IO data rate bound. The large directories certainly don't help! It's running postfix + courier-imap at the moment and I'm looking to revamp it for the gradually ever increasing workload. CPU and RAM usage is low on average. It serves 2x Gbit networks + internet users (3 NIC ports). Hence I'm considering the best way for an revamp/upgrade. SSDs would certainly help with the IOPS but I'm cautious about SSD wear-out for a system that constantly thrashes through a lot of data. I could just throw more disks at it to divide up the IO load. Multiple pairs of HDD paired with SSD on md RAID 1 mirror is a thought with ext4... bcache looks ideal to help but also looks too 'experimental'. And I was hoping that btrfs would help with handling the large directories and multi-user parallel accesses, especially so for being 'mirrored' by btrfs itself (at the filesystem level) across 4 disks for example. Thoughts welcomed. Is btrfs development at the 'optimising' stage now, or is it all still very much a 'work in progress'? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs and 1 billion small files
On 07/05/12 12:05, viv...@gmail.com wrote: Il 07/05/2012 11:28, Alessio Focardi ha scritto: Hi, I need some help in designing a storage structure for 1 billion of small files (512 Bytes), and I was wondering how btrfs will fit in this scenario. Keep in mind that I never worked with btrfs - I just read some documentation and browsed this mailing list - so forgive me if my questions are silly! :X Are you *really* sure a database is *not* what are you looking for? My thought also. Or: 1 billion 512 byte files... Is that not a 512GByte HDD? With that, use a database to index your data by sector number and read/write your data direct to the disk? For that example, your database just holds filename, size, and sector. If your 512 byte files are written and accessed sequentially, then just use a HDD and address them by sector number from a database index. That then becomes your 'filesystem'. If you need fast random access, then use SSDs. Plausible? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs and 1 billion small files
On 08/05/12 13:31, Chris Mason wrote: [...] A few people have already mentioned how btrfs will pack these small files into metadata blocks. If you're running btrfs on a single disk, [...] But the cost is increased CPU usage. Btrfs hits memmove and memcpy pretty hard when you're using larger blocks. I suggest using a 16K or 32K block size. You can go up to 64K, it may work well if you have beefy CPUs. Example for 16K: mkfs.btrfs -l 16K -n 16K /dev/xxx Is that still with -s 4K ? Might that help SSDs that work in 16kByte chunks? And why are memmove and memcpy more heavily used? Does that suggest better optimisation of the (meta)data, or just a greater housekeeping overhead to shuffle data to new offsets? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
SSD format/mount parameters questions
For using SSDs: Are there any format/mount parameters that should be set for using btrfs on SSDs (other than the ssd mount option)? General questions: How long is the 'delay' for the delayed alloc? Are file allocations aligned to 4kiB boundaries, or larger? What byte value is used to pad unused space? (Aside: For some, the erased state reads all 0x00, and for others the erased state reads all 0xff.) Background: I've got a mix of various 120/128GB SSDs to newly set up. I will be using ext4 on the critical ones, but also wish to compare with btrfs... The mix includes some SSDs with the Sandforce controller that implements its own data compression and data deduplication. How well does btrfs fit with those compared to other non-data-compression controllers? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD format/mount parameters questions
On 19/05/12 18:36, Martin Steigerwald wrote: Am Freitag, 18. Mai 2012 schrieb Sander: Martin wrote (ao): Are there any format/mount parameters that should be set for using btrfs on SSDs (other than the ssd mount option)? If possible, format the whole device, do not partition the ssd. This will guarantee proper allignment. Current partitioning tools align at 1 MiB unless otherwise specified. And then thats only the alignment of the start of the filesystem. Not the granularity that the filesystem itself uses to align its writes. And then its not clear to me what effect proper alignment will actually have given the intelligent nature of SSD firmwares. That's what I'm trying to untangle rather than just trusting to magic. I'm also not so convinced about the SSD firmwares being quite so intelligent... So far, the only clear indications are that a number of SSDs have a performance 'sweet spot' when you use 16kByte blocks for data transfer. Practicalities for the SSD internal structure strongly suggest that they work in chunks of data greater than 4kBytes. 4kByte operation is a strong driver for SSD manufacturers, but what compromises do they make to accommodate that? And for btrfs: Extents are aligned to sector size boundaries (4kBytes default). And there is a comment that setting larger sector sizes increases the CPU overhead in btrfs due to the larger memory moves needed for making inserts into the trees. If the SSD is going to do a read-modify-write on anything smaller than 16kBytes in any case, might btrfs just as well use that chunk size to good advantage in the first place? So, what is most significant? Also: btrfs has a big advantage of using checksumming and COW. However, ext4 is more mature, similarly uses extents, and also allows specifying a large delayed allocation time to merge multiple writes if you're happy your system is safely on a UPS... I'm not too worried about this for MLC SSDs, but it is something that is of concern for the yet shorter modify-erase count lifespan of TLC SSDs. Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
SSD erase state and reducing SSD wear
I've got two recent examples of SSDs. Their pristine state from the manufacturer shows: Device Model: OCZ-VERTEX3 # hexdump -C /dev/sdd 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || * 1bf2976000 Device Model: OCZ VERTEX PLUS (OCZ VERTEX 2E) # hexdump -C /dev/sdd ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff || * df99e6000 What's a good way to test what state they get erased to from a TRIM operation? Can btrfs detect the erase state and pad unused space in filesystem writes with the same value so as to reduce SSD wear? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD erase state and reducing SSD wear
On 23/05/12 05:19, Calvin Walton wrote: On Tue, 2012-05-22 at 22:47 +0100, Martin wrote: I've got two recent examples of SSDs. Their pristine state from the manufacturer shows: Device Model: OCZ-VERTEX3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Device Model: OCZ VERTEX PLUS ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff What's a good way to test what state they get erased to from a TRIM operation? This pristine state probably matches up with the result of a trim command on the drive. In particular, a freshly erased flash block is in a state where the bits are all 1, so the Vertex Plus drive is showing you the flash contents directly. The Vertex 3 has substantially more processing, and the 0s are effectively generated on the fly for unmapped flash blocks (similar to how the missing portions of a sparse file contains 0s). So for that example of reading an 'empty' drive, the OCZ-VERTEX3 might not even be reading the flash chips at all!... Can btrfs detect the erase state and pad unused space in filesystem writes with the same value so as to reduce SSD wear? On the Vertex 3, this wouldn't actually do what you'd hope. The firmware in that drive actually compresses, deduplicates, and encrypts all the data prior to writing it to flash - and as a result the data that hits the flash looks nothing like what the filesystem wrote. (For best performance, it might make sense to disable btrfs's built-in compression on the Vertex 3 drive to allow the drive's compression to kick in. Let us know if you benchmark it either way.) Very good comment, thanks. That leaves a very good question of how the Sandforce controller uses the flash. Does it implement its own 'virtual block level' interface to then use the underlying flash using structures that are not visible externally? What does that do to concerns about alignment?... And for what granularity of write chunks? The benefit to doing this on the Vertex Plus is probably fairly small, since to rewrite a block - even if the block is partially unwritten - is still likely to require a read-modify-write cycle with an erase step. The granularity of the erase blocks is just too big for the savings to be very meaningful. My understanding is that the 'wear' mechanism in flash is a problem of charge getting trapped in the insulation material itself that surrounds the floating gate of a cell. The permanently trapped charge accumulates further for each change of state until a high enough offset voltage has accumulated to exceed what can be tolerated for correct operation of the cell. Hence, writing the *same value* as that for already stored for a cell should not cause any wear being as you are not changing the state of a cell. (No change in charge levels.) For non-Sandforce controllers, that suggests doing a read-modify-write to pad out whatever minimum sized write chunk. That would be rather poor for performance, and the manufacturer's secrecy means we cannot be sure of the underlying write block size for minimum sized alignment. Alternatively, padding out writes with the erased state value means that no further wear should be caused for when that block is eventually TRIMed/erased for rewriting. That should also be a 'soft' option for the Sandforce controllers in that /hopefully/ their compression/deduplication will compress down the padding so as not to be a problem. (Damn the Manufacturer's secrecy!) Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Will big metadata blocks fix # of hardlinks?
Thanks for noting this one. That is one very surprising and unexpected limit!... And a killer for some not completely rare applications... On 26/05/12 19:22, Sami Liedes wrote: Hi! I see that Linux 3.4 supports bigger metadata blocks for btrfs. Will using them allow a bigger number of hardlinks on a single file (i.e. the bug that has bitten at least git users on Debian[1,2], and BackupPC[3])? As far as I understand correctly, the problem has been that the hard links are stored in the same metadata block with some other metadata, so the size of the block is an inherent limitation? If so, I think it would be worth for me to try Btrfs again :) Sami [1] http://permalink.gmane.org/gmane.comp.file-systems.btrfs/13603 [2] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=642603 [3] https://bugzilla.kernel.org/show_bug.cgi?id=15762 One example fail case is just 13 hard links. Even x4 that (16k blocks) only gives 52 links for that example fail case. The brief summary for those are: * It's a rare corner case that needs a format change to fix, so won't-fix; * There are real world problem examples noted in those threads for such as: BackupPC (backups); nnmaildir mail backend in Gnus (an Emacs package for reading news and email); and a web archiver. * Also, Bacula (backups) and Mutt (email client) are quoted as problem examples in: Btrfs File-System Plans For Ubuntu 12.10 http://www.phoronix.com/scan.php?page=news_itempx=MTEwMDE For myself, I have a real world example for deduplication of identical files from a proprietary data capture system where the filenames change (timestamp and index data stored in the filename) yet there are periods where the file contents change only occasionally... The 'natural' thing to do is hardlink together all the identical files to then just have the unique filenames... And you might have many files in a particular directory... Note that for long filenames (surprisingly commonly done!), one fail case noted above is just 13 hard links. Looks like I'm stuck on ext4 with an impoverished cp -l for a fast 'snapshot' for the time being still... (Or differently, LVM snapshot and copy.) For btrfs, rather than a break everything format change, can a neat and robust 'workaround' be made so that the problem-case hardlinks to a file within the same directory perhaps spawn their own transparent subdirectory for the hard links?... Worse case then is that upon a downgrade to an older kernel, the 'transparent' subdirectory of hard links becomes visible as a distinct subdirectory? (That is a 'break' but at least data isn't lost.) Or am I chasing the wrong bits? ;-) More seriously: The killer there for me is that running rsync or running a deduplication script might hit too many hard links that were perfectly fine when on ext4. Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On 17/06/14 02:13, cwillu wrote: It's not a mmap problem, it's a small writes with an msync or fsync after each one problem. And for logging, that is exactly what is wanted to see why whatever crashed... Except... Whilst logging, hold off on the msync/fsync unless the next log message to be written is 'critical'? With that, the mundane logging gets appended just as for any normal file write. Only the more critical log messages suffer the extra overhead and fragmentation of an immediate msync/fsync. For the case of sequential writes (via write or mmap), padding writes to page boundaries would help, if the wasted space isn't an issue. Another approach, again assuming all other writes are appends, would be to periodically (but frequently enough that the pages are still in cache) read a chunk of the file and write it back in-place, with or without an fsync. On the other hand, if you can afford to lose some logs on a crash, not fsyncing/msyncing after each write will also eliminate the fragmentation. (Worth pointing out that none of that is conjecture, I just spent 30 minutes testing those cases while composing this ;p) Josef has mentioned in irc that a piece of Chris' raid5/6 work will also fix this when it lands. Interesting... The source problem is how the COW fragments under expected normal use... Is all this unavoidable unless we rethink the semantics? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Putting very big and small files in one subvolume?
Good questions and already good comment given. For another view... On 17/08/14 13:31, Duncan wrote: Shriramana Sharma posted on Sun, 17 Aug 2014 14:26:06 +0530 as excerpted: Hello. One more Q re generic BTRFS behaviour. https://btrfs.wiki.kernel.org/index.php/Main_Page specifically advertises BTRFS's Space-efficient packing of small files. So far (on ext3/4) I have been using two partitions for small/regular files (like my source code repos, home directory with its hidden config subdirectories etc) and big files (like downloaded Linux ISOs, VMs etc) under some sort of understanding that this will help curb fragmentation... The cases of pathological fragmentation by btrfs (for 'database-style' files and VM image files especially) have been mentioned, as have the use of nocow and/or using separate subvolumes to reduce or slow down the buildup of the fragmentation. systemd logging even bulldozed blindly into that one spectacularly!... There is now a defragment option. However, that does not scale well for large or frequently rewritten files and you gamble how much IO bandwidth you can afford to lose rewriting *entire* files. The COW fragmentation problem is not going to go away. Also, there is quite a high requirement for user awareness to specially mark directories/files as nocow. And yet then, that still does not work well if multiple snapshots are being taken...! Could a better and more complete fix be to automatically defragment say just x4 the size being written for a file segment? Also, for the file segment being defragged, abandon any links to other snapshots to in effect deliberately replicate the data where appropriate so that data segment is fully defragged. In any case, since BTRFS effectively discourages usage of separate partitions to take advantage of subvolumes etc, and given the above claim to the FS automatically handling small files efficiently, I wonder if it makes sense any longer to create separate subvolumes for such big/small files as I describe in my use case? It's worth noting that btrfs subvolumes are a reasonably lightweight construct, comparable enough to ordinary subdirectories that they're presented that way when browsing a parent subvolume, and there was actually discussion of making subvolumes and subdirs the exact same thing, effectively turning all subdirs into subvolumes. As it turns out that wasn't feasible due not to btrfs limitations, but (as I understand it) to assumptions about subdirectories vs. mountable entities (subvolumes) built into the Linux POSIX and VFS levels... Due to namespaces and inode number spaces?... OTOH, I tend to be rather more of an independent partition booster than many. The biggest reason for that is the too many eggs in one basket problem. Fully separate filesystems on separate partitions... I do so similarly myself. A good scheme that I have found to work well for my cases is to have separate partitions for: /boot /var /var/log / /usr /home /mnt/data... And all the better and easy to do using GPT partition tables. The one aspect to all that is that you can protect your system becoming jammed by suffering a full disk for whatever reason and all without needing to resort to quotas. So for example, rogue logging can fill up /var/log and you can still use the system and be able to easily tidy things up. However, that scheme does also require that you have a good idea of what partition sizes you will need right from when first set up. You can 'cheat' and gain flexibility at the expense of HDD head seek time by cobbling together LVM volumes as and when needed to resize whichever filesystem. Which is where btrfs comes into play in that if you can trust to not lose all your eggs to btrfs corruption, you can utilise your partition scheme with subvolumes and quotas and allow the intelligence in btrfs to make everything work well even if you change what size (quota) you want for a subvolume. The ENTIRE disk (no partition table) is all btrfs. Special NOTE: Myself, I consider btrfs *quotas* to be still very experimental at the moment and not to be used with valued data! Other big plusses for btrfs for me are the raid and snapshots. The killer though is for how robust the filesystem is against corruption and random data/hardware failure. btrfsck? Always keep multiple backups! Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance Issues
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 20/09/14 09:23, Marc Dietrich wrote: Am Freitag, 19. September 2014, 13:51:22 schrieb Holger Hoffstätte: On Fri, 19 Sep 2014 13:18:34 +0100, Rob Spanton wrote: I have a particularly uncomplicated setup (a desktop PC with a hard disk) and I'm seeing particularly slow performance from btrfs. A `git status` in the linux source tree takes about 46 seconds after dropping caches, whereas on other machines using ext4 this takes about 13s. My mail client (evolution) also seems to perform particularly poorly on this setup, and my hunch is that it's spending a lot of time waiting on the filesystem. This is - unfortunately - a particular btrfs oddity/characteristic/flaw, whatever you want to call it. git relies a lot on fast stat() calls, and those seem to be particularly slow with btrfs esp. on rotational media. I have the same problem with rsync on a freshly mounted volume; it gets fast (quite so!) after the first run. my favorite benchmark is ls -l /usr/bin: ext4: 0.934s btrfs: 21.814s So... On my old low power slow Atom SSD ext4 system: time ls -l /usr/bin real0m0.369s user0m0.048s sys 0m0.128s Repeated: real0m0.107s user0m0.040s sys 0m0.044s and that is for: # ls -l /usr/bin | wc 1384 13135 88972 On a comparatively super dual core Athlon64 SSD btrfs three disk btrfs raid1 system: real0m0.103s user0m0.004s sys 0m0.040s Repeated: real0m0.027s user0m0.008s sys 0m0.012s For: # ls -l /usr/bin | wc 1449 13534 89024 And on an identical comparatively super dual core Athlon64 HDD 'spinning rust' btrfs two disk btrfs raid1 system: real0m0.101s user0m0.008s sys 0m0.020s Repeated: real0m0.020s user0m0.004s sys 0m0.012s For: # ls -l /usr/bin | wc 1161 10994 79350 So, no untoward concerns there. Marc: You on something really ancient and hopelessly fragmented into oblivion? also mounting large partitons (several 100Gs) takes lot of time on btrfs. I've noticed that also for some 16TB btrfs raid1 mounts, btrfs is not as fast as mounting ext4 but then again all very much faster than mounting ext4 when a fsck count is tripped!... So, nothing untoward there. For my usage, controlling fragmentation and having some automatic mechanism to deal with pathological fragmentation with such as sqlite files are greater concerns! (Yes, there is the manual fix of NOCOW... I also put such horrors into tmpfs and snapshot that... All well and good but all unnecessary admin tasks!) Regards, Martin -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.15 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlQdhBwACgkQ+sI3Ds7h07f/VwCgkHPjrIkBkWh5zrKwvN7fXalZ LWcAoIbLFEoc7iTNLzgSChNvnYatIkuZ =YlDI -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RAID device nomination (Feature request)
Dear Devs, I have a number of esata disk packs holding 4 physical disks each where I wish to use the disk packs aggregated for 16TB and up to 64TB backups... Can btrfs...? 1: Mirror data such that there is a copy of data on each *disk pack* ? Note that esata shows just the disks as individual physical disks, 4 per disk pack. Can physical disks be grouped together to force the RAID data to be mirrored across all the nominated groups? 2: Similarly for a mix of different storage technologies such as manufacturer or type (SSD/HDD), can the disks be grouped to ensure a copy of the data is replicated across all the groups? For example, I deliberately buy HDDs from different batches/manufacturers to try to avoid common mode or similarly timed failures. Can btrfs be guided to safely spread the RAID data across the *different* hardware types/batches? 3: Also, for different speeds of disks, can btrfs tune itself to balance the read/writes accordingly? 4: Further thought: For SSDs, is the minimise heads movement 'staircase' code bypassed so as to speed up allocation for the don't care addressing (near zero seek time) of SSDs? And then again: Is 64TBytes of btrfs a good idea in the first place?! (There's more than one physical set of backups but I'd rather not suffer weeks to recover from one hiccup in the filesystem... Should I partition btrfs down to smaller gulps, or does the structure of btrfs in effect already do that?) Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Online dedup for Btrfs
Apart from the dates, this sounds highly plausible :-) If the hashing is done before the compression and the compression is done for isolated blocks, then this could even work! Any takers? ;-) For a performance enhancement, keep a hash tree in memory for the n most recently used/seen blocks?... A good writeup! Thanks for a good giggle. :-) Regards, Martin On 01/04/13 15:44, Harald Glatt wrote: On Mon, Apr 1, 2013 at 2:50 PM, Josef Bacik jba...@fusionio.com wrote: Hello, I was bored this weekend so I hacked up online dedup for Btrfs. It's working quite well so I think it can be more widely tested. There are two ways to use it 1) Compatible mode - this is a bit slower but will handle being used by older kernels. We use the csum tree to find duplicate blocks. Since it is relatively easy to have crc32c collisions this also involves reading the block from disk and doing a memcmp with the block we want to write to verify it has the same data. This is way slow but hey, no incompat flag! 2) Incompatible mode - so this is the way you probably want to use it if you don't care about being able to go back to older kernels. You select your hashing function (at the momement I only support sha1 but there is room in the format to have different functions). This creates a btree indexed by the hash and the bytenr. Then we lookup the hash and just link the extent in if it matches the hash. You can use -o paranoid-dedup if you are paranoid about hash collisions and this will force it to do the memcmp() dance to make sure that the extent we are deduping really matches the extent. So performance wise obviously the compat mode sucks. It's about 50% slower on disk and about 20% slower on my Fusion card. We get pretty good space savings, about 10% in my horrible test (just copy a git tree onto the fs), but IMHO not worth the performance hit. The incompat mode is a bit better, only 15% drop on disk and about 10% on my fusion card. Closer to the crc numbers if we have -o paranoid-dedup. The space savings is better since it uses the original extent sizes, we get about 15% space savings. Please feel free to pull and try it, you can get it here git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git dedup Thanks! Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Hey Josef, that's really cool! Can this be used together with lzo compression for example? How high (roughly) is the impact of something like force-compress=lzo compared to the 15% hit from this dedup? Thanks! Harald -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID device nomination (Feature request)
On 18/04/13 15:06, Hugo Mills wrote: On Thu, Apr 18, 2013 at 02:45:24PM +0100, Martin wrote: Dear Devs, I have a number of esata disk packs holding 4 physical disks each where I wish to use the disk packs aggregated for 16TB and up to 64TB backups... Can btrfs...? 1: Mirror data such that there is a copy of data on each *disk pack* ? Note that esata shows just the disks as individual physical disks, 4 per disk pack. Can physical disks be grouped together to force the RAID data to be mirrored across all the nominated groups? Interesting you should ask this: I realised quite recently that this could probably be done fairly easily with a modification to the chunk allocator. Hey, that sounds good. And easy? ;-) Possible?... 2: Similarly for a mix of different storage technologies such as manufacturer or type (SSD/HDD), can the disks be grouped to ensure a copy of the data is replicated across all the groups? For example, I deliberately buy HDDs from different batches/manufacturers to try to avoid common mode or similarly timed failures. Can btrfs be guided to safely spread the RAID data across the *different* hardware types/batches? From the kernel point of view, this is the same question as the previous one. Indeed so. The question is how the groups of disks are determined: Manually by the user for mkfs.btrfs and/or specified when disks are added/replaced; Or somehow automatically detected (but with a user override). Have a disk group UUID for a group of disks similar to that done for md-raid? 3: Also, for different speeds of disks, can btrfs tune itself to balance the read/writes accordingly? Not that I'm aware of. A 'nice to have' would be some sort of read-access load balancing with options to balance latency or queue depth... Could btrfs do that independently (complimentary with) of the block layer schedulers? 4: Further thought: For SSDs, is the minimise heads movement 'staircase' code bypassed so as to speed up allocation for the don't care addressing (near zero seek time) of SSDs? I think this is more to do with the behaviour of the block layer than the FS. There are alternative elevators that can be used, but I don't know how to configure them (or whether they need configuring at all). Regardless of the block level io schedulers, does not btrfs determine the LBA allocation?... For example, if for an SSD, the next free space allocation for whatever is to be newly written could become more like a log based round-robin allocation across the entire SSD (NILFS-like?) rather than trying to localise data to minimise the physical head movement as for a HDD. Or is there no useful gain with that over simply using the same one lump of allocator code as for HDDs? You have backups, which is good. Keep up with the latest kernels from kernel.org. The odds of you hitting something major are small, but non-zero. One thing that's probably fairly likely with your setup Healthy paranoia is good ;-) [...] So with light home use on a largeish array, I've had a number of cockups recently that were recoverable, albeit with some swearing. Thanks for the notes. On the other hand, it's entirely possible that something else will go wrong and things will blow up. My guess is that unless you have [...] My worry for moving up to spreading a filesystem across multiple disk packs is for when the disk pack hardware itself fails taking out all four disks... (And there's always the worry of the esata lead getting yanked to take out all four disks...) Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID device nomination (Feature request)
On 18/04/13 20:44, Hugo Mills wrote: On Thu, Apr 18, 2013 at 05:29:10PM +0100, Martin wrote: On 18/04/13 15:06, Hugo Mills wrote: On Thu, Apr 18, 2013 at 02:45:24PM +0100, Martin wrote: Dear Devs, I have a number of esata disk packs holding 4 physical disks each where I wish to use the disk packs aggregated for 16TB and up to 64TB backups... Can btrfs...? 1: Mirror data such that there is a copy of data on each *disk pack* ? Note that esata shows just the disks as individual physical disks, 4 per disk pack. Can physical disks be grouped together to force the RAID data to be mirrored across all the nominated groups? Interesting you should ask this: I realised quite recently that this could probably be done fairly easily with a modification to the chunk allocator. Hey, that sounds good. And easy? ;-) Possible?... We'll see... I'm a bit busy for the next week or so, but I'll see what I can do. Thanks greatly. That should nicely let me stay with my plan A and just let btrfs conveniently expand over multiple disk packs :-) (I'm playing 'safe' for the moment while I can by putting in bigger disks into new packs as needed. I've some packs with smaller disks that are nearly full that I want to continue to use so I'm agonising over whether to replace all the disks and rewrite all the data or use multiple disk packs as one. Plan A is good for keeping the existing disks :-) ) [...] The question is how the groups of disks are determined: Manually by the user for mkfs.btrfs and/or specified when disks are added/replaced; Or somehow automatically detected (but with a user override). Have a disk group UUID for a group of disks similar to that done for md-raid? I was planning on simply having userspace assign a (small) integer to each device. Devices with the same integer are in the same group, and won't have more than one copy of any given piece of data assigned to them. Note that there's already an unused disk group item which is a 32-bit integer in the device structure, which looks like it can be repurposed for this; there's no spare space in the device structure, so anything more than that will involve some kind of disk format change. The repurpose for no format change sounds very good and 32-bits should be enough for anyone. (Notwithstanding the inevitable 640k comments!) A 32-bit unsigned-int number that the user specifies? Or include a semi-random automatic numbering to a group of devices listed by the user?... Then again, I can't imagine anyone wanting to go beyond 8-bits... Hence a 16-bit unsigned int is still suitably overkill. That then offers the other 16-bits for some other repurpose ;-) For myself, it would be nice to be able to specify a number that is the same unique number that's stamped on the disk packs so that I can be sure what has been plugged in! (Assuming there's some option to list what's been plugged in.) 3: Also, for different speeds of disks, can btrfs tune itself to balance the read/writes accordingly? Not that I'm aware of. A 'nice to have' would be some sort of read-access load balancing with options to balance latency or queue depth... Could btrfs do that independently (complimentary with) of the block layer schedulers? All things are possible... :) Whether it's something that someone will actually do or not, I don't know. There's an argument for getting some policy into that allocation decision for other purposes (e.g. trying to ensure that if a disk dies from a filesystem with single allocation, you lose the fewest number of files). On the other hand, this is probably going to be one of those things that could have really nasty performance effects. It's also somewhat beyond my knowledge right now, so someone else will have to look at it. :) Sounds ideal for some university research ;-) [...] For example, if for an SSD, the next free space allocation for whatever is to be newly written could become more like a log based round-robin allocation across the entire SSD (NILFS-like?) rather than trying to localise data to minimise the physical head movement as for a HDD. Or is there no useful gain with that over simply using the same one lump of allocator code as for HDDs? No idea. It's going to need someone to write the code and benchmark the options, I suspect. A second university project? ;-) [...] (And there's always the worry of the esata lead getting yanked to take out all four disks...) As I said, I've done the latter myself. The array *should* go into Looks like I'll likely get to find out for myself sometime or other... Thanks for your help and keep me posted please. I'll be experimenting with the groupings as soon as they come along. Also for the dedup work that is being done. Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info
Re: RAID device nomination (Feature request)
On 18/04/13 20:48, Alex Elsayed wrote: Hugo Mills wrote: On Thu, Apr 18, 2013 at 02:45:24PM +0100, Martin wrote: Dear Devs, snip Note that esata shows just the disks as individual physical disks, 4 per disk pack. Can physical disks be grouped together to force the RAID data to be mirrored across all the nominated groups? Interesting you should ask this: I realised quite recently that this could probably be done fairly easily with a modification to the chunk allocator. snip One thing that might be an interesting approach: Ceph is already in mainline, and uses CRUSH in a similar way to what's described (topology-aware placement+replication). Ceph does it by OSD nodes rather than disk, and the units are objects rather than chunks, but it could potentially be a rather good fit. CRUSH does it by describing a topology hierarchy, and allocating the OSD ids to that hierarchy. It then uses that to map from a key to one-or-more locations. If we use chunk ID as the key, and use UUID_SUB in place of the OSD id, it could do the job. OK... That was a bit of a crash course (ok, sorry for the pun on crush :-) ) http://www.anchor.com.au/blog/2012/09/a-crash-course-in-ceph/ Interesting that the CRUSH map is written by hand, then compiled and passed to the cluster. Hence, looks like simply have the sysadmin specify what gets grouped into what group. (I certainly know what disk is where and where I want the data mirrored!) For my example, the disk packs are plugged into two servers (up to four at a time at present) so that we have some fail-over if one server dies. Ceph looks to be a little overkill for just two big storage users. Or perhaps include the same Ceph code routines into btrfs?... Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
grub/grub2 boot into btrfs raid root and with no initrd
I've made a few attempts to boot into a root filesystem created using: mkfs.btrfs -d raid1 -m raid1 -L btrfs_root_3 /dev/sda3 /dev/sdb3 Both grub and grub2 pick up a kernel image fine from an ext4 /boot on /dev/sda1 for exaample, but then fail to find or assemble the btrfs root. Setting up an initrd and grub operates fine for the btrfs raid. What is the special magic to do this without the need for an initrd? Is the comment/patch below from last year languishing unknown? Or is there some problem with that kernel approach? Thanks, Martin See: http://forums.gentoo.org/viewtopic-t-923554-start-0.html Below is my patch, which is working fine for me with 3.8.2. Code: $ cat /etc/portage/patches/sys-kernel/gentoo-sources/earlydevtmpfs.patch --- init/do_mounts.c.orig 2013-03-24 20:49:53.446971127 +0100 +++ init/do_mounts.c 2013-03-24 20:51:46.408237541 +0100 @@ -529,6 +529,7 @@ create_dev(/dev/root, ROOT_DEV); if (saved_root_name[0]) { create_dev(saved_root_name, ROOT_DEV); + devtmpfs_mount(dev); mount_block_root(saved_root_name, root_mountflags); } else { create_dev(/dev/root, ROOT_DEV); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Will btrfs scrub clear corrupt filesystem trees?
I have 1.5TB of data on a single disk formatted with defaults. There appears to be only two directory trees of a few MBytes that have suffered corruption (due to in the past too high a sata speed causing corruption). The filesystem mounts fine. But how to clear out the corrupt trees? At the moment, I have running: btrfsck --repair --init-extent-tree /dev/sdc parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 Ignoring transid failure ... And it is still running after over two days now. Looped? Would a: btrfs scrub start clear out the corrupt trees? Must I wait for the btrfsck to complete if it is recreating an extents tree?... Suggestions welcomed... Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
No apparent effect for btrfs device delete missing
Trying: btrfs device delete missing / appears not to do anything for a / mount for where I have swapped out a HDD: # btrfs filesystem show Label: 'test_btrfs_misc_5' uuid: 7d29d4e6-efdc-41dc-9aa8-e74dfbe13cc9 Total devices 2 FS bytes used 28.00KB devid1 size 59.74GB used 2.03GB path /dev/sdd5 *** Some devices missing Label: 'test_btrfs_root_4' uuid: 269e142c-e561-4227-b2b0-fe2f9fb99391 Total devices 3 FS bytes used 10.55GB devid4 size 56.00GB used 12.03GB path /dev/sde4 devid1 size 56.00GB used 12.05GB path /dev/sdd4 *** Some devices missing Btrfs v0.20-rc1-358-g194aa4a # btrfs device delete missing / # btrfs filesystem show Label: 'test_btrfs_misc_5' uuid: 7d29d4e6-efdc-41dc-9aa8-e74dfbe13cc9 Total devices 2 FS bytes used 28.00KB devid1 size 59.74GB used 2.03GB path /dev/sdd5 *** Some devices missing Label: 'test_btrfs_root_4' uuid: 269e142c-e561-4227-b2b0-fe2f9fb99391 Total devices 3 FS bytes used 10.55GB devid4 size 56.00GB used 12.03GB path /dev/sde4 devid1 size 56.00GB used 12.05GB path /dev/sdd4 *** Some devices missing Btrfs v0.20-rc1-358-g194aa4a All on the latest Linux 3.11.5-gentoo. # df -h | egrep '/$' rootfs 112G 22G 89G 20% / /dev/sdd4 112G 22G 89G 20% / Aside: Adding the /dev/sde4 device caused no balance action until I deleted a device to reduce the raid1 mirror (data and metadata) down to the two devices. The missing device was an old HDD that had physically failed. No data was lost for that example failure. Hope of interest, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
8 days looped? (btrfsck --repair --init-extent-tree)
Dear list, I've been trying to recover a 2TB single disk btrfs from a good few days ago as already commented on the list. btrfsck complained of an error in the extents and so I tried: btrfsck --repair --init-extent-tree /dev/sdX That was 8 days ago. The btrfs process is still running at 100% cpu but with no disk activity and no visible change in memory usage. Looped? Is there any way to check whether it is usefully doing anything or whether this is a lost cause? The only output it has given, within a few seconds of starting, is: parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 Ignoring transid failure Any comment/interest before abandoning? This all started from trying to delete/repair a directory tree of a few MBytes of files... Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 8 days looped? (btrfsck --repair --init-extent-tree)
On 22/10/13 19:17, Josef Bacik wrote: On Tue, Oct 22, 2013 at 06:58:48PM +0100, Martin wrote: Dear list, I've been trying to recover a 2TB single disk btrfs from a good few days ago as already commented on the list. btrfsck complained of an error in the extents and so I tried: btrfsck --repair --init-extent-tree /dev/sdX That was 8 days ago. The btrfs process is still running at 100% cpu but with no disk activity and no visible change in memory usage. Looped? Is there any way to check whether it is usefully doing anything or whether this is a lost cause? The only output it has given, within a few seconds of starting, is: parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 Ignoring transid failure Any comment/interest before abandoning? This all started from trying to delete/repair a directory tree of a few MBytes of files... Sooo it probably is looped, you should be able to attach gdb to it and run bt to see where it is stuck and send that back to the list so we can figure out what to do. Thanks, OK... But I doubt this helps much: (gdb) bt #0 0x0042b93f in ?? () #1 0x0041cf10 in ?? () #2 0x0041e29d in ?? () #3 0x0041e8ae in ?? () #4 0x00425bf2 in ?? () #5 0x00425cae in ?? () #6 0x00421e87 in ?? () #7 0x00422022 in ?? () #8 0x0042210c in ?? () #9 0x00416b07 in ?? () #10 0x004043ad in ?? () #11 0x7f5ba972860d in __libc_start_main () from /lib64/libc.so.6 #12 0x004043dd in ?? () #13 0x7fff7ead12a8 in ?? () #14 0x in ?? () #15 0x0004 in ?? () #16 0x0064f4d0 in ?? () #17 0x7fff7ead2469 in ?? () #18 0x7fff7ead2472 in ?? () #19 0x7fff7ead2485 in ?? () #20 0x in ?? () At least it stays consistent when repeated! Recompiling with -ggdb for the symbols and rerunning: # gdb /sbin/btrfsck 17151 GNU gdb (Gentoo 7.5.1 p2) 7.5.1 Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type show copying and show warranty for details. This GDB was configured as x86_64-pc-linux-gnu. For bug reporting instructions, please see: http://bugs.gentoo.org/... Reading symbols from /sbin/btrfsck...Reading symbols from /usr/lib64/debug/sbin/btrfsck.debug...(no debugging symbols found)...done. (no debugging symbols found)...done. Attaching to program: /sbin/btrfsck, process 17151 warning: Could not load shared library symbols for linux-vdso.so.1. Do you need set solib-search-path or set sysroot? Reading symbols from /lib64/libuuid.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libuuid.so.1 Reading symbols from /lib64/libblkid.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libblkid.so.1 Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libz.so.1 Reading symbols from /usr/lib64/liblzo2.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/liblzo2.so.2 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Using host libthread_db library /lib64/libthread_db.so.1. Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 0x0041e74f in btrfs_search_slot () (gdb) bt #0 0x0041e74f in btrfs_search_slot () #1 0x004259fa in find_first_block_group () #2 0x00425ab4 in btrfs_read_block_groups () #3 0x00421c15 in btrfs_setup_all_roots () #4 0x00421dce in __open_ctree_fd () #5 0x00421ea8 in open_ctree_fs_info () #6 0x004169b4 in cmd_check () #7 0x0040443b in main () And over twelve hours later: (gdb) #0 0x0041e74f in btrfs_search_slot () #1 0x004259fa in find_first_block_group () #2 0x00425ab4 in btrfs_read_block_groups () #3 0x00421c15 in btrfs_setup_all_roots () #4 0x00421dce in __open_ctree_fd () #5 0x00421ea8 in open_ctree_fs_info () #6 0x004169b4 in cmd_check () #7 0x0040443b in main () Any further debug useful? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 8 days looped? (btrfsck --repair --init-extent-tree)
On 23/10/13 17:21, Josef Bacik wrote: On Wed, Oct 23, 2013 at 04:32:51PM +0100, Martin wrote: Any further debug useful? Nope I know where it's breaking, I need to fix how we init the extent tree. Thanks, Good stuff. If of help, I can test new code or a patch for that example. (I'll leave the disk in place for the time being.) Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs-progs: allow --init-extent-tree to work when extent tree is borked
On 25/10/13 19:01, Josef Bacik wrote: Unfortunately you can't run --init-extent-tree if you can't actually read the extent root. Fix this by allowing partial starts with no extent root and then have fsck only check to see if the extent root is uptodate _after_ the check to see if we are init'ing the extent tree. Thanks, Signed-off-by: Josef Bacik jba...@fusionio.com --- cmds-check.c | 9 ++--- disk-io.c| 16 ++-- 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 69b0327..8ed7baa 100644 --- a/cmds-check.c +++ b/cmds-check.c Hey! Quick work!... Is that worth patching locally and trying against my example? Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs-progs: allow --init-extent-tree to work when extent tree is borked
On 25/10/13 19:31, Josef Bacik wrote: On Fri, Oct 25, 2013 at 07:27:24PM +0100, Martin wrote: On 25/10/13 19:01, Josef Bacik wrote: Unfortunately you can't run --init-extent-tree if you can't actually read the extent root. Fix this by allowing partial starts with no extent root and then have fsck only check to see if the extent root is uptodate _after_ the check to see if we are init'ing the extent tree. Thanks, Signed-off-by: Josef Bacik jba...@fusionio.com --- cmds-check.c | 9 ++--- disk-io.c| 16 ++-- 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 69b0327..8ed7baa 100644 --- a/cmds-check.c +++ b/cmds-check.c Hey! Quick work!... Is that worth patching locally and trying against my example? Yes, I'm a little worried about your particular case so I'd like to see if it works. If you don't see a lot of output after say 5 minutes let's assume I didn't fix your problem and let me know so I can make the other change I considered. Thanks, Nope... No-go. parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 Ignoring transid failure ...And nothing more. Looped. # gdb /sbin/btrfsck 31887 GNU gdb (Gentoo 7.5.1 p2) 7.5.1 Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type show copying and show warranty for details. This GDB was configured as x86_64-pc-linux-gnu. For bug reporting instructions, please see: http://bugs.gentoo.org/... Reading symbols from /sbin/btrfsck...Reading symbols from /usr/lib64/debug/sbin/btrfsck.debug...(no debugging symbols found)...done. (no debugging symbols found)...done. Attaching to program: /sbin/btrfsck, process 31887 warning: Could not load shared library symbols for linux-vdso.so.1. Do you need set solib-search-path or set sysroot? Reading symbols from /lib64/libuuid.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libuuid.so.1 Reading symbols from /lib64/libblkid.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libblkid.so.1 Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libz.so.1 Reading symbols from /usr/lib64/liblzo2.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/liblzo2.so.2 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Using host libthread_db library /lib64/libthread_db.so.1. Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 0x0042b7a9 in read_extent_buffer () (gdb) (gdb) bt #0 0x0042b7a9 in read_extent_buffer () #1 0x0041ccfd in btrfs_check_node () #2 0x0041e0a2 in check_block () #3 0x0041e69e in btrfs_search_slot () #4 0x00425a6e in find_first_block_group () #5 0x00425b28 in btrfs_read_block_groups () #6 0x00421c40 in btrfs_setup_all_roots () #7 0x00421e3f in __open_ctree_fd () #8 0x00421f19 in open_ctree_fs_info () #9 0x004169b4 in cmd_check () #10 0x0040443b in main () (gdb) # btrfs version Btrfs v0.20-rc1-358-g194aa4a-dirty Emerging (1 of 1) sys-fs/btrfs-progs- Unpacking source... GIT update -- repository: git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git at the commit:194aa4a1bd6447bb545286d0bcb0b0be8204d79f branch: master storage directory: /usr/portage/distfiles/egit-src/btrfs-progs.git checkout type:bare repository Cloning into '/var/tmp/portage/sys-fs/btrfs-progs-/work/btrfs-progs-'... done. Branch branch-master set up to track remote branch master from origin. Switched to a new branch 'branch-master' Unpacked to /var/tmp/portage/sys-fs/btrfs-progs-/work/btrfs-progs- Source unpacked in /var/tmp/portage/sys-fs/btrfs-progs-/work Preparing source in /var/tmp/portage/sys-fs/btrfs-progs-/work/btrfs-progs- ... Source prepared. * Applying user patches from /etc/portage/patches//sys-fs/btrfs-progs- ... * jbpatch2013-10-25-extents-fix.patch ... [ ok ] * Done with patching Configuring source in /var/tmp/portage/sys-fs/btrfs-progs-/work/btrfs-progs- ... Source configured. [...] Note the compile warnings: * QA Notice: Package triggers severe warnings which
Re: [PATCH] Btrfs-progs: allow --init-extent-tree to work when extent tree is borked
On 28/10/13 15:11, Josef Bacik wrote: On Sun, Oct 27, 2013 at 12:16:12AM +0100, Martin wrote: On 25/10/13 19:31, Josef Bacik wrote: On Fri, Oct 25, 2013 at 07:27:24PM +0100, Martin wrote: On 25/10/13 19:01, Josef Bacik wrote: Unfortunately you can't run --init-extent-tree if you can't actually read the extent root. Fix this by allowing partial starts with no extent root and then have fsck only check to see if the extent root is uptodate _after_ the check to see if we are init'ing the extent tree. Thanks, Signed-off-by: Josef Bacik jba...@fusionio.com --- cmds-check.c | 9 ++--- disk-io.c| 16 ++-- 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 69b0327..8ed7baa 100644 --- a/cmds-check.c +++ b/cmds-check.c Hey! Quick work!... Is that worth patching locally and trying against my example? Yes, I'm a little worried about your particular case so I'd like to see if it works. If you don't see a lot of output after say 5 minutes let's assume I didn't fix your problem and let me know so I can make the other change I considered. Thanks, Nope... No-go. Ok I've sent [PATCH] Btrfs-progs: rework open_ctree to take flags, add a new one which should address your situation. Thanks, Josef, Tried your patch: Signed-off-by: Josef Bacik jba...@fusionio.com 13 files changed, 75 insertions(+), 113 deletions(-) diff --git a/btrfs-convert.c b/btrfs-convert.c index 26c7b5f..ae10eed 100644 And the patching fails due to mismatching code... I have the Gentoo source for: Btrfs v0.20-rc1-358-g194aa4a (On Gentoo 3.11.5, will be on 3.11.6 later today.) What are the magic incantations to download your version of source code to try please? (Patched or unpatched?) Many thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: progs integration branch moved to master (new default leafsize)
On 08/11/13 22:01, Chris Mason wrote: Hi everyone, This patch is now the tip of the master branch for btrfs-progs, which has been updated to include most of the backlogged progs patches. Please take a look and give it a shake. This was based on Dave's integration tree (many thanks Dave!) minus the patches for online dedup. I've pulled in the coverity fixes and a few others from the list as well. The patch below switches our default mkfs leafsize up to 16K. This should be a better choice in almost every workload, but now is your chance to complain if it causes trouble. Thanks for that and nicely timely! Compiling on Gentoo (3.11.5-gentoo, sys-fs/btrfs-progs-) gives: * QA Notice: Package triggers severe warnings which indicate that it *may exhibit random runtime failures. * disk-io.c:91:5: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] * volumes.c:1930:5: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] * volumes.c:1931:6: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] * Please do not file a Gentoo bug and instead report the above QA * issues directly to the upstream developers of this software. * Homepage: https://btrfs.wiki.kernel.org 16KB is faster and leads to less metadata fragmentation in almost all workloads. It does slightly increase lock contention on the root nodes in some workloads, but that is best dealt with by adding more subvolumes (for now). Interesting and I was wondering about that. Good update. Also, hopefully that is a little more friendly for SSDs where often you see improved performance for 8kByte or 16kByte (aligned) writes... Testing in progress, Regards, Martin This uses 16KB or the page size, whichever is bigger. If you're doing a mixed block group mkfs, it uses the sectorsize instead. Since the kernel refuses to mount a mixed block group FS where the metadata leaf size doesn't match the data sectorsize, this also adds a similar check during mkfs. Signed-off-by: Chris Mason chris.ma...@fusionio.com --- mkfs.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/mkfs.c b/mkfs.c index bf8a831..cd0af9e 100644 --- a/mkfs.c +++ b/mkfs.c @@ -46,6 +46,8 @@ static u64 index_cnt = 2; +#define DEFAULT_MKFS_LEAF_SIZE 16384 + struct directory_name_entry { char *dir_name; char *path; @@ -1222,7 +1224,7 @@ int main(int ac, char **av) u64 alloc_start = 0; u64 metadata_profile = 0; u64 data_profile = 0; - u32 leafsize = sysconf(_SC_PAGESIZE); + u32 leafsize = max_t(u32, sysconf(_SC_PAGESIZE), DEFAULT_MKFS_LEAF_SIZE); u32 sectorsize = 4096; u32 nodesize = leafsize; u32 stripesize = 4096; @@ -1232,6 +1234,7 @@ int main(int ac, char **av) int ret; int i; int mixed = 0; + int leaf_forced = 0; int data_profile_opt = 0; int metadata_profile_opt = 0; int discard = 1; @@ -1269,6 +1272,7 @@ int main(int ac, char **av) case 'n': nodesize = parse_size(optarg); leafsize = parse_size(optarg); + leaf_forced = 1; break; case 'L': label = parse_label(optarg); @@ -1386,8 +1390,21 @@ int main(int ac, char **av) BTRFS_BLOCK_GROUP_RAID0 : 0; /* raid0 or single */ } } else { + u32 best_leafsize = max_t(u32, sysconf(_SC_PAGESIZE), sectorsize); metadata_profile = 0; data_profile = 0; + + if (!leaf_forced) { + leafsize = best_leafsize; + nodesize = best_leafsize; + if (check_leaf_or_node_size(leafsize, sectorsize)) + exit(1); + } + if (leafsize != sectorsize) { + fprintf(stderr, Error: mixed metadata/data block groups + require metadata blocksizes equal to the sectorsize\n); + exit(1); + } } ret = test_num_disk_vs_raid(metadata_profile, data_profile, -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs-progs: allow --init-extent-tree to work when extent tree is borked
On 07/11/13 01:25, Martin wrote: On 28/10/13 15:11, Josef Bacik wrote: Ok I've sent [PATCH] Btrfs-progs: rework open_ctree to take flags, add a new one which should address your situation. Thanks, Josef, Tried your patch: Signed-off-by: Josef Bacik jba...@fusionio.com 13 files changed, 75 insertions(+), 113 deletions(-) diff --git a/btrfs-convert.c b/btrfs-convert.c index 26c7b5f..ae10eed 100644 And the patching fails due to mismatching code... I have the Gentoo source for: Btrfs v0.20-rc1-358-g194aa4a (On Gentoo 3.11.5, will be on 3.11.6 later today.) What are the magic incantations to download your version of source code to try please? (Patched or unpatched?) OK so Chris Mason and the Gentoo sys-fs/btrfs-progs- came to the rescue to give: # btrfs version Btrfs v0.20-rc1-591-gc652e4e This time: # btrfsck --repair --init-extent-tree /dev/sdc quickly gave: parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 Ignoring transid failure btrfs unable to find ref byte nr 910293991424 parent 0 root 1 owner 2 offset 0 btrfs unable to find ref byte nr 910293995520 parent 0 root 1 owner 1 offset 1 btrfs unable to find ref byte nr 910293999616 parent 0 root 1 owner 0 offset 1 leaf free space ret -297791851, leaf data size 3995, used 297795846 nritems 2 checking extents btrfsck: extent_io.c:609: free_extent_buffer: Assertion `!(eb-refs 0)' failed. enabling repair mode Checking filesystem on /dev/sdc UUID: 38a60270-f9c6-4ed4-8421-4bf1253ae0b3 Creating a new extent tree Failed to find [910293991424, 168, 4096] Failed to find [910293995520, 168, 4096] Failed to find [910293999616, 168, 4096] From that, I've tried running again: # btrfsck --repair /dev/sdc giving thus far: parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 Ignoring transid failure ... And it is still running a couple of days later. GDB shows: (gdb) bt #0 0x0042d576 in read_extent_buffer () #1 0x0041ee79 in btrfs_check_node () #2 0x00420211 in check_block () #3 0x00420813 in btrfs_search_slot () #4 0x00427bb4 in btrfs_read_block_groups () #5 0x00423e40 in btrfs_setup_all_roots () #6 0x0042406d in __open_ctree_fd () #7 0x00424126 in open_ctree_fs_info () #8 0x0041812e in cmd_check () #9 0x00404904 in main () So... Has it looped or is it busy? There is no activity on /dev/sdc. Which comes to a request: Can the options -v (for verbose) and -s (to continuously show status) be added to btrfsck to give some indication of progress and what is happening? The -s should report progress by whatever appropriate real-time counts as done by such as badblocks -s. I'll leave running for a little while longer before trying a mount. Hope of interest. Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs-progs: allow --init-extent-tree to work when extent tree is borked
On 11/11/13 22:52, Martin wrote: On 07/11/13 01:25, Martin wrote: OK so Chris Mason and the Gentoo sys-fs/btrfs-progs- came to the rescue to give: # btrfs version Btrfs v0.20-rc1-591-gc652e4e From that, I've tried running again: # btrfsck --repair /dev/sdc giving thus far: parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 Ignoring transid failure ... And it is still running a couple of days later. GDB shows: (gdb) bt #0 0x0042d576 in read_extent_buffer () #1 0x0041ee79 in btrfs_check_node () #2 0x00420211 in check_block () #3 0x00420813 in btrfs_search_slot () #4 0x00427bb4 in btrfs_read_block_groups () #5 0x00423e40 in btrfs_setup_all_roots () #6 0x0042406d in __open_ctree_fd () #7 0x00424126 in open_ctree_fs_info () #8 0x0041812e in cmd_check () #9 0x00404904 in main () Another two days and: (gdb) bt #0 0x0042373a in read_tree_block () #1 0x00421538 in btrfs_search_slot () #2 0x00427bb4 in btrfs_read_block_groups () #3 0x00423e40 in btrfs_setup_all_roots () #4 0x0042406d in __open_ctree_fd () #5 0x00424126 in open_ctree_fs_info () #6 0x0041812e in cmd_check () #7 0x00404904 in main () So... Has it looped or is it busy? There is no activity on /dev/sdc. Same btrfs_read_block_groups but different stack above that: So perhaps something useful is being done?... No disk activity noticed. Which comes to a request: Can the options -v (for verbose) and -s (to continuously show status) be added to btrfsck to give some indication of progress and what is happening? The -s should report progress by whatever appropriate real-time counts as done by such as badblocks -s. OK... So I'll leave running for a little while longer before trying a mount. Some sort of progress indicator would be rather useful... Is this going to run for a few hours more or might this need to run for weeks to complete? Any clues to look for? (All on a 2TByte single disk btrfs, 4k defaults) Hope of interest. Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs-progs: allow --init-extent-tree to work when extent tree is borked
Another two days and a backtrace shows the hope of progress: #0 0x0041de2f in btrfs_node_key () #1 0x0041ee79 in btrfs_check_node () #2 0x00420211 in check_block () #3 0x00420813 in btrfs_search_slot () #4 0x00427bb4 in btrfs_read_block_groups () #5 0x00423e40 in btrfs_setup_all_roots () #6 0x0042406d in __open_ctree_fd () #7 0x00424126 in open_ctree_fs_info () #8 0x0041812e in cmd_check () #9 0x00404904 in main () No other output, 100% CPU, using only a single core, and no apparent disk activity. There looks to be a repeating pattern of calls. Is this working though the same test repeated per btrfs block? Are there any variables that can be checked with gdb to see how far it has gone so as to guess how long it might need to run? Phew? Hope of interest, Regards, Martin On 13/11/13 12:08, Martin wrote: On 11/11/13 22:52, Martin wrote: On 07/11/13 01:25, Martin wrote: OK so Chris Mason and the Gentoo sys-fs/btrfs-progs- came to the rescue to give: # btrfs version Btrfs v0.20-rc1-591-gc652e4e From that, I've tried running again: # btrfsck --repair /dev/sdc giving thus far: parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 Ignoring transid failure ... And it is still running a couple of days later. GDB shows: (gdb) bt #0 0x0042d576 in read_extent_buffer () #1 0x0041ee79 in btrfs_check_node () #2 0x00420211 in check_block () #3 0x00420813 in btrfs_search_slot () #4 0x00427bb4 in btrfs_read_block_groups () #5 0x00423e40 in btrfs_setup_all_roots () #6 0x0042406d in __open_ctree_fd () #7 0x00424126 in open_ctree_fs_info () #8 0x0041812e in cmd_check () #9 0x00404904 in main () Another two days and: (gdb) bt #0 0x0042373a in read_tree_block () #1 0x00421538 in btrfs_search_slot () #2 0x00427bb4 in btrfs_read_block_groups () #3 0x00423e40 in btrfs_setup_all_roots () #4 0x0042406d in __open_ctree_fd () #5 0x00424126 in open_ctree_fs_info () #6 0x0041812e in cmd_check () #7 0x00404904 in main () So... Has it looped or is it busy? There is no activity on /dev/sdc. Same btrfs_read_block_groups but different stack above that: So perhaps something useful is being done?... No disk activity noticed. Which comes to a request: Can the options -v (for verbose) and -s (to continuously show status) be added to btrfsck to give some indication of progress and what is happening? The -s should report progress by whatever appropriate real-time counts as done by such as badblocks -s. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs-progs: allow --init-extent-tree to work when extent tree is borked
On 07/11/13 01:25, Martin wrote: [...] And the patching fails due to mismatching code... I have the Gentoo source for: Btrfs v0.20-rc1-358-g194aa4a (On Gentoo 3.11.5, will be on 3.11.6 later today.) What are the magic incantations to download your version of source code to try please? (Patched or unpatched?) As an FYI for anyone stumbling onto this thread: See: https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories to get to the code! Cheers, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs-progs: allow --init-extent-tree to work when extent tree is borked
Continuing: gdb bt now gives: #0 0x0042075a in btrfs_search_slot () #1 0x00427bb4 in btrfs_read_block_groups () #2 0x00423e40 in btrfs_setup_all_roots () #3 0x0042406d in __open_ctree_fd () #4 0x00424126 in open_ctree_fs_info () #5 0x0041812e in cmd_check () #6 0x00404904 in main () #0 0x004208bc in btrfs_search_slot () #1 0x00427bb4 in btrfs_read_block_groups () #2 0x00423e40 in btrfs_setup_all_roots () #3 0x0042406d in __open_ctree_fd () #4 0x00424126 in open_ctree_fs_info () #5 0x0041812e in cmd_check () #6 0x00404904 in main () #0 0x004208d0 in btrfs_search_slot () #1 0x00427bb4 in btrfs_read_block_groups () #2 0x00423e40 in btrfs_setup_all_roots () #3 0x0042406d in __open_ctree_fd () #4 0x00424126 in open_ctree_fs_info () #5 0x0041812e in cmd_check () #6 0x00404904 in main () Still no further output. btrfsck running at 100% on a single core and with no apparent disk activity. All for a 2TB hdd. Should it take this long?... Regards, Martin On 15/11/13 17:18, Martin wrote: Another two days and a backtrace shows the hope of progress: #0 0x0041de2f in btrfs_node_key () #1 0x0041ee79 in btrfs_check_node () #2 0x00420211 in check_block () #3 0x00420813 in btrfs_search_slot () #4 0x00427bb4 in btrfs_read_block_groups () #5 0x00423e40 in btrfs_setup_all_roots () #6 0x0042406d in __open_ctree_fd () #7 0x00424126 in open_ctree_fs_info () #8 0x0041812e in cmd_check () #9 0x00404904 in main () No other output, 100% CPU, using only a single core, and no apparent disk activity. There looks to be a repeating pattern of calls. Is this working though the same test repeated per btrfs block? Are there any variables that can be checked with gdb to see how far it has gone so as to guess how long it might need to run? Phew? Hope of interest, Regards, Martin On 13/11/13 12:08, Martin wrote: On 11/11/13 22:52, Martin wrote: On 07/11/13 01:25, Martin wrote: OK so Chris Mason and the Gentoo sys-fs/btrfs-progs- came to the rescue to give: # btrfs version Btrfs v0.20-rc1-591-gc652e4e From that, I've tried running again: # btrfsck --repair /dev/sdc giving thus far: parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 Ignoring transid failure ... And it is still running a couple of days later. GDB shows: (gdb) bt #0 0x0042d576 in read_extent_buffer () #1 0x0041ee79 in btrfs_check_node () #2 0x00420211 in check_block () #3 0x00420813 in btrfs_search_slot () #4 0x00427bb4 in btrfs_read_block_groups () #5 0x00423e40 in btrfs_setup_all_roots () #6 0x0042406d in __open_ctree_fd () #7 0x00424126 in open_ctree_fs_info () #8 0x0041812e in cmd_check () #9 0x00404904 in main () Another two days and: (gdb) bt #0 0x0042373a in read_tree_block () #1 0x00421538 in btrfs_search_slot () #2 0x00427bb4 in btrfs_read_block_groups () #3 0x00423e40 in btrfs_setup_all_roots () #4 0x0042406d in __open_ctree_fd () #5 0x00424126 in open_ctree_fs_info () #6 0x0041812e in cmd_check () #7 0x00404904 in main () So... Has it looped or is it busy? There is no activity on /dev/sdc. Same btrfs_read_block_groups but different stack above that: So perhaps something useful is being done?... No disk activity noticed. Which comes to a request: Can the options -v (for verbose) and -s (to continuously show status) be added to btrfsck to give some indication of progress and what is happening? The -s should report progress by whatever appropriate real-time counts as done by such as badblocks -s. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Actual effect of mkfs.btrfs -m raid10 /dev/sdX ... -d raid10 /dev/sdX ...
On 19/11/13 23:16, Duncan wrote: So we have: 1) raid1 is exactly two copies of data, paired devices. 2) raid0 is a stripe exactly two devices wide (reinforced by to read a stripe takes only two devices), so again paired devices. Which is fine for some occasions and a very good start point. However, I'm sure there is a strong wish to be able to specify n-copies of data/metadata spread across m devices. Or even to specify 'hot spares'. This would be a great to overcome the problem of a set of drives becoming read-only when one btrfs drive fails or is removed. (Or should we always mount with the degraded option?) Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Actual effect of mkfs.btrfs -m raid10 /dev/sdX ... -d raid10 /dev/sdX ...
On 19/11/13 19:24, deadhorseconsulting wrote: Interesting, this confirms what I was observing. Given the wording in man pages for -m and -d which states Specify how the metadata or data must be spanned across the devices specified. I took devices specified to literally mean the devices specified after the according switch. That sounds like a hang-over from too many years use of the mdadm command and more recently such as the sgdisk command... ;-) Myself, I like the btrfs way to specify the list of parameters and then they all then get applied as a whole. The one bugbear at the moment is that for using multiple disks: Any actions seem to be applied to the list of devices in sequence one-by-one. There's no apparent intelligence to consider present pool - new pool of devices as a whole. More development! Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfsck --repair /dev/sdc (Was: [PATCH] Btrfs-progs: allow --init-extent-tree to work when extent tree is borked)
It's now gone back to a pattern from a full week ago: (gdb) bt #0 0x0042d576 in read_extent_buffer () #1 0x0041ee79 in btrfs_check_node () #2 0x00420211 in check_block () #3 0x00420813 in btrfs_search_slot () #4 0x00427bb4 in btrfs_read_block_groups () #5 0x00423e40 in btrfs_setup_all_roots () #6 0x0042406d in __open_ctree_fd () #7 0x00424126 in open_ctree_fs_info () #8 0x0041812e in cmd_check () #9 0x00404904 in main () I don't know if that has gone through that pattern during the week but at a-week-a-time, this is not going to finish in reasonable time. How come so very slow? Any hints/tips/fixes or abandon the test? Regards, Martin On 19/11/13 06:34, Martin wrote: Continuing: gdb bt now gives: #0 0x0042075a in btrfs_search_slot () #1 0x00427bb4 in btrfs_read_block_groups () #2 0x00423e40 in btrfs_setup_all_roots () #3 0x0042406d in __open_ctree_fd () #4 0x00424126 in open_ctree_fs_info () #5 0x0041812e in cmd_check () #6 0x00404904 in main () #0 0x004208bc in btrfs_search_slot () #1 0x00427bb4 in btrfs_read_block_groups () #2 0x00423e40 in btrfs_setup_all_roots () #3 0x0042406d in __open_ctree_fd () #4 0x00424126 in open_ctree_fs_info () #5 0x0041812e in cmd_check () #6 0x00404904 in main () #0 0x004208d0 in btrfs_search_slot () #1 0x00427bb4 in btrfs_read_block_groups () #2 0x00423e40 in btrfs_setup_all_roots () #3 0x0042406d in __open_ctree_fd () #4 0x00424126 in open_ctree_fs_info () #5 0x0041812e in cmd_check () #6 0x00404904 in main () Still no further output. btrfsck running at 100% on a single core and with no apparent disk activity. All for a 2TB hdd. Should it take this long?... Regards, Martin On 15/11/13 17:18, Martin wrote: Another two days and a backtrace shows the hope of progress: #0 0x0041de2f in btrfs_node_key () #1 0x0041ee79 in btrfs_check_node () #2 0x00420211 in check_block () #3 0x00420813 in btrfs_search_slot () #4 0x00427bb4 in btrfs_read_block_groups () #5 0x00423e40 in btrfs_setup_all_roots () #6 0x0042406d in __open_ctree_fd () #7 0x00424126 in open_ctree_fs_info () #8 0x0041812e in cmd_check () #9 0x00404904 in main () No other output, 100% CPU, using only a single core, and no apparent disk activity. There looks to be a repeating pattern of calls. Is this working though the same test repeated per btrfs block? Are there any variables that can be checked with gdb to see how far it has gone so as to guess how long it might need to run? Phew? Hope of interest, Regards, Martin On 13/11/13 12:08, Martin wrote: On 11/11/13 22:52, Martin wrote: On 07/11/13 01:25, Martin wrote: OK so Chris Mason and the Gentoo sys-fs/btrfs-progs- came to the rescue to give: # btrfs version Btrfs v0.20-rc1-591-gc652e4e From that, I've tried running again: # btrfsck --repair /dev/sdc giving thus far: parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 Ignoring transid failure ... And it is still running a couple of days later. GDB shows: (gdb) bt #0 0x0042d576 in read_extent_buffer () #1 0x0041ee79 in btrfs_check_node () #2 0x00420211 in check_block () #3 0x00420813 in btrfs_search_slot () #4 0x00427bb4 in btrfs_read_block_groups () #5 0x00423e40 in btrfs_setup_all_roots () #6 0x0042406d in __open_ctree_fd () #7 0x00424126 in open_ctree_fs_info () #8 0x0041812e in cmd_check () #9 0x00404904 in main () Another two days and: (gdb) bt #0 0x0042373a in read_tree_block () #1 0x00421538 in btrfs_search_slot () #2 0x00427bb4 in btrfs_read_block_groups () #3 0x00423e40 in btrfs_setup_all_roots () #4 0x0042406d in __open_ctree_fd () #5 0x00424126 in open_ctree_fs_info () #6 0x0041812e in cmd_check () #7 0x00404904 in main () So... Has it looped or is it busy? There is no activity on /dev/sdc. Same btrfs_read_block_groups but different stack above that: So perhaps something useful is being done?... No disk activity noticed. Which comes to a request: Can the options -v (for verbose) and -s (to continuously show status) be added to btrfsck to give some indication of progress and what is happening? The -s should report progress by whatever appropriate real-time counts as done by such as badblocks -s
Re: btrfsck --repair /dev/sdc (Was: [PATCH] Btrfs-progs: allow --init-extent-tree to work when extent tree is borked)
On 20/11/13 17:08, Duncan wrote: Martin posted on Wed, 20 Nov 2013 06:51:20 + as excerpted: It's now gone back to a pattern from a full week ago: (gdb) bt #0 0x0042d576 in read_extent_buffer () #1 0x0041ee79 in btrfs_check_node () #2 0x00420211 in check_block () #3 0x00420813 in btrfs_search_slot () #4 0x00427bb4 in btrfs_read_block_groups () #5 0x00423e40 in btrfs_setup_all_roots () #6 0x0042406d in __open_ctree_fd () #7 0x00424126 in open_ctree_fs_info () #8 0x0041812e in cmd_check () #9 0x00404904 in main () I don't know if that has gone through that pattern during the week but at a-week-a-time, this is not going to finish in reasonable time. How come so very slow? Any hints/tips/fixes or abandon the test? You're a patient man. =:^) Sort of... I can leave it running in the background until I come to need to do something else with that machine. So... A bit of an experiment. ( https://btrfs.wiki.kernel.org/index.php/FAQ , search on hours. ) OK, so we round that to a day a TB, double for your two TB, and double again in case your drive is much slower than the normal drive the comment might have been considering and because that's for a balance but you're doing a btrfsck --repair, which for all we know takes longer. That's still only four days, and you've been going well over a week. At this point I think it's reasonably safe to conclude it's in some sort of loop and likely will never finish. ... but at a week a shot, there comes a time when it's simply time to declare a loss and move on. Exactly so... No idea what btrfsck is so very slowly checking through or if it has indeed looped. Which is where progress output would be useful. However, btrfsck is rather too slow to be practical at the moment. Further development?... Any useful debug to be had from this case before I move on? Regards, Martin Still at: parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 Ignoring transid failure ...which is all the output thus far. And: (gdb) bt #0 0x0042d574 in read_extent_buffer () #1 0x0041ee79 in btrfs_check_node () #2 0x00420211 in check_block () #3 0x00420813 in btrfs_search_slot () #4 0x00427bb4 in btrfs_read_block_groups () #5 0x00423e40 in btrfs_setup_all_roots () #6 0x0042406d in __open_ctree_fd () #7 0x00424126 in open_ctree_fs_info () #8 0x0041812e in cmd_check () #9 0x00404904 in main () -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfsck --repair /dev/sdc (Was: [PATCH] Btrfs-progs: allow --init-extent-tree to work when extent tree is borked)
On 20/11/13 17:08, Duncan wrote: Which leads to the question of what to do next. Obviously, there have been a number of update patches since then, some of which might address your problem. You could update your kernel and userspace and try again... /if/ you have the patience... This is on kernel 3.11.5 and Btrfs v0.20-rc1-591-gc652e4e. Can easily upgrade to the latest kernel at the expense of killing the existing btrfsck run. Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: progs integration branch moved to master (new default leafsize)
On 21/11/13 23:37, Chris Mason wrote: Quoting Martin (2013-11-08 18:53:06) On 08/11/13 22:01, Chris Mason wrote: Hi everyone, This patch is now the tip of the master branch for btrfs-progs, which has been updated to include most of the backlogged progs patches. Please take a look and give it a shake. This was based on Dave's integration tree (many thanks Dave!) minus the patches for online dedup. I've pulled in the coverity fixes and a few others from the list as well. The patch below switches our default mkfs leafsize up to 16K. This should be a better choice in almost every workload, but now is your chance to complain if it causes trouble. Thanks for that and nicely timely! Compiling on Gentoo (3.11.5-gentoo, sys-fs/btrfs-progs-) gives: * QA Notice: Package triggers severe warnings which indicate that it *may exhibit random runtime failures. * disk-io.c:91:5: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] * volumes.c:1930:5: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] * volumes.c:1931:6: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] I'm not seeing these warnings with the current master branch, could you please rerun? From just now: * QA Notice: Package triggers severe warnings which indicate that it *may exhibit random runtime failures. * disk-io.c:91:5: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] * volumes.c:1930:5: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] * volumes.c:1931:6: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] * Please do not file a Gentoo bug and instead report the above QA * issues directly to the upstream developers of this software. * Homepage: https://btrfs.wiki.kernel.org Installing (1 of 1) sys-fs/btrfs-progs- ... Which is exactly the same. This is for Gentoo for: gcc: x86_64-pc-linux-gnu-4.7.3 # gcc --version gcc (Gentoo 4.7.3-r1 p1.3, pie-0.5.5) 4.7.3 Kernel: 3.11.9-gentoo # btrfs version Btrfs v0.20-rc1-597-g5aff090 And the - pulls the code in from: From git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs 9f0c53f..5aff090 master - master GIT update -- repository: git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git updating from commit: 9f0c53f574b242b0d5988db2972c8aac77ef35a9 to commit:5aff090a3951e7d787b32bb5c49adfec65091385 cmds-filesystem.c | 79 +++ mkfs.c| 18 +- 2 files changed, 88 insertions(+), 9 deletions(-) branch: master storage directory: /usr/portage/distfiles/egit-src/btrfs-progs.git checkout type:bare repository Cloning into '/var/tmp/portage/sys-fs/btrfs-progs-/work/btrfs-progs-'... done. Checking connectivity... done Branch branch-master set up to track remote branch master from origin. Switched to a new branch 'branch-master' Hope that helps, Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: progs integration branch moved to master (new default leafsize)
On 22/11/13 13:40, Chris Mason wrote: Quoting Martin (2013-11-22 04:03:41) * QA Notice: Package triggers severe warnings which indicate that it *may exhibit random runtime failures. * disk-io.c:91:5: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] * volumes.c:1930:5: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] * volumes.c:1931:6: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] Does gentoo modify the optimizations from the Makefile? We actually have many strict-aliasing warnings, but I didn't think they came up until -O2. For that system, I have -Os set in the Gentoo make.conf. At any rate, I'm adding -fno-strict-aliasing just to be sure. Good to catch to avoid unexpectedness, Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: progs integration branch moved to master (new default leafsize)
On 22/11/13 19:57, Chris Mason wrote: Quoting Martin (2013-11-22 14:50:17) On 22/11/13 13:40, Chris Mason wrote: Quoting Martin (2013-11-22 04:03:41) * QA Notice: Package triggers severe warnings which indicate that it *may exhibit random runtime failures. * disk-io.c:91:5: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] * volumes.c:1930:5: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] * volumes.c:1931:6: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] Does gentoo modify the optimizations from the Makefile? We actually have many strict-aliasing warnings, but I didn't think they came up until -O2. For that system, I have -Os set in the Gentoo make.conf. At any rate, I'm adding -fno-strict-aliasing just to be sure. Good to catch to avoid unexpectedness, Ok, please try with the current master to make sure the options are being picked up properly. If you're overriding the -fno-strict-aliasing, please don't ;) No changes my side for that system and... btrfs-progs now compiles with no warnings given. That looks like a fixed. # emerge -vD btrfs-progs These are the packages that would be merged, in order: Calculating dependencies... done! [ebuild R *] sys-fs/btrfs-progs- 0 kB Total: 1 package (1 reinstall), Size of downloads: 0 kB Verifying ebuild manifests Emerging (1 of 1) sys-fs/btrfs-progs- Unpacking source... GIT update -- repository: git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git at the commit:8116550e16628794b76051b6b8ea503055c08d6f branch: master storage directory: /usr/portage/distfiles/egit-src/btrfs-progs.git checkout type:bare repository Cloning into '/var/tmp/portage/sys-fs/btrfs-progs-/work/btrfs-progs-'... done. Checking connectivity... done Branch branch-master set up to track remote branch master from origin. Switched to a new branch 'branch-master' ... And then a clean compile. No warnings. Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Clean crash... (USB memory sticks mount)
On 24/11/13 20:50, Kai Krakow wrote: something about device mapper and write barriers not working correctly which are needed for btrfs being able to rely on transactions working correctly. Re USB memory sticks: I've found write barriers not to work for USB memory sticks (for at least the ones I have tried) for ext4 and btrfs. You must mount with the nobarrier option... Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs tagged as v3.12
I'm humbly totally unqualified to comment but that sounds like an excellent idea. Thanks. I can't say for others but I was put off by the 0.19 forever eternal version which pushed me to investigate GIT... I'm sure that has been putting off many people including distro assemblers. Just for some positive comment: Good progress, thanks. Regards, Martin (OK, that's the last of the positives for the Christmas present. Back to bugging! ;-) ) On 25/11/13 21:45, Chris Mason wrote: Hi everyone, I've tagged the current btrfs-progs repo as v3.12. The new idea is that instead of making the poor distros pull from git, I'll be creating tagged releases at roughly the same pace as Linus cuts kernels. Given the volume of btrfs-progs patches, we should have enough new code and fixes to justify releases at least as often as the kernel. Of course, if there are issues that need immediate attention, I'll tag a .y release (v3.12.1 for example). If the progs changes slow down, we might skip a version. But tracking kernel version numbers makes it easier for me to line up bug reports, mostly because I already devote a fair number of brain cells to remembering how old each kernel is. Just let me know if there are any questions. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfsck --repair /dev/sdc (Was: [PATCH] Btrfs-progs: allow --init-extent-tree to work when extent tree is borked)
On 20/11/13 20:00, Martin wrote: On 20/11/13 17:08, Duncan wrote: Martin posted on Wed, 20 Nov 2013 06:51:20 + as excerpted: It's now gone back to a pattern from a full week ago: (gdb) bt #0 0x0042d576 in read_extent_buffer () #1 0x0041ee79 in btrfs_check_node () #2 0x00420211 in check_block () #3 0x00420813 in btrfs_search_slot () #4 0x00427bb4 in btrfs_read_block_groups () #5 0x00423e40 in btrfs_setup_all_roots () #6 0x0042406d in __open_ctree_fd () #7 0x00424126 in open_ctree_fs_info () #8 0x0041812e in cmd_check () #9 0x00404904 in main () I don't know if that has gone through that pattern during the week but at a-week-a-time, this is not going to finish in reasonable time. How come so very slow? Any hints/tips/fixes or abandon the test? You're a patient man. =:^) Sort of... I can leave it running in the background until I come to need to do something else with that machine. So... A bit of an experiment. Until... No more... And just as the gdb bt shows something a little different! (gdb) bt #0 0x0041ddc4 in btrfs_comp_keys () #1 0x004208e9 in btrfs_search_slot () #2 0x00427bb4 in btrfs_read_block_groups () #3 0x00423e40 in btrfs_setup_all_roots () #4 0x0042406d in __open_ctree_fd () #5 0x00424126 in open_ctree_fs_info () #6 0x0041812e in cmd_check () #7 0x00404904 in main () Nearly done or weeks yet more to run? The poor thing gets killed in the morning for new work. Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfsck --repair /dev/sdc (Was: [PATCH] Btrfs-progs: allow --init-extent-tree to work when extent tree is borked)
I don't know if that has gone through that pattern during the week but at a-week-a-time, this is not going to finish in reasonable time. How come so very slow? Any hints/tips/fixes or abandon the test? You're a patient man. =:^) Sort of... I can leave it running in the background until I come to need to do something else with that machine. So... A bit of an experiment. Until... No more... And just as the gdb bt shows something a little different! (gdb) bt #0 0x0041ddc4 in btrfs_comp_keys () #1 0x004208e9 in btrfs_search_slot () #2 0x00427bb4 in btrfs_read_block_groups () #3 0x00423e40 in btrfs_setup_all_roots () #4 0x0042406d in __open_ctree_fd () #5 0x00424126 in open_ctree_fs_info () #6 0x0041812e in cmd_check () #7 0x00404904 in main () Nearly done or weeks yet more to run? The poor thing gets killed in the morning for new work. OK, so that all came to naught and it got killed for a kernel update and new work. Just for a giggle, I tried mounting that disk with the 'recovery' option and it failed with the usual complaint: btrfs: disabling disk space caching btrfs: enabling auto recovery parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 btrfs: open_ctree failed Trying a wild guess of btrfs-zero-log /dev/sdc gives: parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 parent transid verify failed on 911904604160 wanted 17448 found 17450 Ignoring transid failure ... and it is sat there at 100% CPU usage, no further output, and no apparent disk activity... Just like btrfsck was... So... Looks like time finally for a reformat. Any chance of outputting some indication of progress, and for a speedup, or options for partial recovery or?... Or for a fast 'slash-and-burn' recovery where damaged trees get cleanly amputated rather than too-painfully-slowly repaired?... Just a few wild ideas ;-) Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Feature Req: mkfs.btrfs -d dup option on single device
On 11/12/13 03:19, Imran Geriskovan wrote: SSDs: What's more (in relation to our long term data integrity aim) order of magnitude for their unpowered data retension period is 1 YEAR. (Read it as 6months to 2-3 years. While powered they refresh/shuffle the blocks) This makes SSDs unsuitable for mid-to-long tem consumer storage. Hence they are out of this discussion. (By the way, the only way for reliable duplication on SSDs, is using physically seperate devices.) Interesting... Have you any links/quotes/studies/specs for that please? Does btrfs need to date-stamp each block/chunk to ensure that data is rewritten before suffering flash memory bitrot? Is not the firmware in SSDs aware to rewrite any too-long unchanged data? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
BTRFS extended attributes mounted on a non-extended-attributes compiled kernel
What happens if... I have a btrfs that has utilised posix ACLs / extended attributes and I then subsequently mount that onto a system that does not have the kernel modules compiled for those features? Crash and burn? Or are the extra filesystem features benignly ignored until remounted on the original system with all the kernel modules? Thanks, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html