Re: Can btrfs silently repair read-error in raid1
On 08-05-2012 18:47, Hubert Kario wrote: On Tuesday 08 of May 2012 04:45:51 cwillu wrote: On Tue, May 8, 2012 at 1:36 AM, Fajar A. Nugrahal...@fajar.net wrote: On Tue, May 8, 2012 at 2:13 PM, Clemens Eissererlinuxhi...@gmail.com wrote: Hi, I have a quite unreliable SSD here which develops some bad blocks from time to time which result in read-errors. Once the block is written to again, its remapped internally and everything is fine again for that block. Would it be possible to create 2 btrfs partitions on that drive and use it in RAID1 - with btrfs silently repairing read-errors when they occur? Would it require special settings, to not fallback to read-only mode when a read-error occurs? The problem would be how the SSD (and linux) behaves when it encounters bad blocks (not bad disks, which is easier). If it does oh, I can't read this block. I just return an error immediately, then it's good. However, in most situation, it would be like hmmm, I can't read this block, let me retry that again. What? still error? then lets retry it again, and again., which could take several minutes for a single bad block. And during that time linux (the kernel) would do something like hey, the disk is not responding. Why don't we try some stuff? Let's try resetting the link. If it doesn't work, try downgrading the link speed. In short, if you KNOW the SSD is already showing signs of bad blocks, better just throw it away. The excessive number of retries (basically, the kernel repeating the work the drive already attempted) is being addressed in the block layer. [PATCH] libata-eh don't waste time retrying media errors (v3), I believe this is queued for 3.5 I just hope they don't remove retries completely, I've seen the second or third try return correct data on multiple disks from different vendors. (Which allowed me to use dd to write the data back to force relocation) But yes, Linux is a bit too overzelous with regards to retries... Regards, I hope they do. If you wish, you can force the retry, just trying your command again. This decision should happen in a higher level. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
failed disk (was: kernel 3.3.4 damages filesystem (?))
Hallo, Hugo, Du meintest am 07.05.12: mkfs.btrfs -m raid1 -d single should give you that. What's the difference to mkfs.btrfs -m raid1 -d raid0 - RAID-0 stripes each piece of data across all the disks. - single puts data on one disk at a time. [...] In fact, this is probably a good argument for having the option to put back the old allocator algorithm, which would have ensured that the first disk would fill up completely first before it touched the next one... The actual version seems to oscillate from disk to disk: Copying about 160 GiByte shows Label: none uuid: fd0596c6-d819-42cd-bb4a-420c38d2a60b Total devices 2 FS bytes used 155.64GB devid2 size 136.73GB used 114.00GB path /dev/sdl1 devid1 size 68.37GB used 45.04GB path /dev/sdk1 Btrfs Btrfs v0.19 Watching the amount showed that both disks are filled nearly simultaneously. That would be more difficult to restore ... Viele Gruesse! Helmut -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
failed disk (was: kernel 3.3.4 damages filesystem (?))
Hallo, Hugo, Du meintest am 07.05.12: [...] With a file system like ext2/3/4 I can work with several directories which are mounted together, but (as said before) one broken disk doesn't disturb the others. mkfs.btrfs -m raid1 -d single should give you that. Just a small bug, perhaps: created a system with mkfs.btrfs -m raid1 -d single /dev/sdl1 mount /dev/sdl1 /mnt/Scsi btrfs device add /dev/sdk1 /mnt/Scsi btrfs device add /dev/sdm1 /mnt/Scsi (filling with data) and btrfs fi df /mnt/Scsi now tells Data, RAID0: total=183.18GB, used=76.60GB Data: total=80.01GB, used=79.83GB System, DUP: total=8.00MB, used=32.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=1.00GB, used=192.74MB Metadata: total=8.00MB, used=0.00 -- Data, RAID0 confuses me (not very much ...), and the system for metadata (RAID1) is not told. Viele Gruesse! Helmut -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: failed disk (was: kernel 3.3.4 damages filesystem (?))
On Wed, May 09, 2012 at 04:25:00PM +0200, Helmut Hullen wrote: Du meintest am 07.05.12: [...] With a file system like ext2/3/4 I can work with several directories which are mounted together, but (as said before) one broken disk doesn't disturb the others. mkfs.btrfs -m raid1 -d single should give you that. Just a small bug, perhaps: created a system with mkfs.btrfs -m raid1 -d single /dev/sdl1 mount /dev/sdl1 /mnt/Scsi btrfs device add /dev/sdk1 /mnt/Scsi btrfs device add /dev/sdm1 /mnt/Scsi (filling with data) and btrfs fi df /mnt/Scsi now tells Data, RAID0: total=183.18GB, used=76.60GB Data: total=80.01GB, used=79.83GB System, DUP: total=8.00MB, used=32.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=1.00GB, used=192.74MB Metadata: total=8.00MB, used=0.00 -- Data, RAID0 confuses me (not very much ...), and the system for metadata (RAID1) is not told. DUP is two copies of each block, but it allows the two copies to live on the same device. It's done this because you started with a single device, and you can't do RAID-1 on one device. The first bit of metadata you write to it should automatically upgrade the DUP chunk to RAID-1. As to the spurious upgrade of single to RAID-0, I thought Ilya had stopped it doing that. What kernel version are you running? Out of interest, why did you do the device adds separately, instead of just this? # mkfs.btrfs -m raid1 -d single /dev/sdl1 /dev/sdk1 /dev/sdm1 Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Comic Sans goes into a bar, and the barman says, We don't --- serve your type here. signature.asc Description: Digital signature
Re: [PATCH 1/5] btrfs: extend readahead interface
On Thu, Apr 12, 2012 at 05:54:38PM +0200, Arne Jansen wrote: @@ -97,30 +119,87 @@ struct reada_machine_work { +/* + * this is the default callback for readahead. It just descends into the + * tree within the range given at creation. if an error occurs, just cut + * this part of the tree + */ +static void readahead_descend(struct btrfs_root *root, struct reada_control *rc, + u64 wanted_generation, struct extent_buffer *eb, + u64 start, int err, struct btrfs_key *top, + void *ctx) +{ + int nritems; + u64 generation; + int level; + int i; + + BUG_ON(err == -EAGAIN); /* FIXME: not yet implemented, don't cancel + * readahead with default callback */ + + if (err || eb == NULL) { + /* + * this is the error case, the extent buffer has not been + * read correctly. We won't access anything from it and + * just cleanup our data structures. Effectively this will + * cut the branch below this node from read ahead. + */ + return; + } + + level = btrfs_header_level(eb); + if (level == 0) { + /* + * if this is a leaf, ignore the content. + */ + return; + } + + nritems = btrfs_header_nritems(eb); + generation = btrfs_header_generation(eb); + + /* + * if the generation doesn't match, just ignore this node. + * This will cut off a branch from prefetch. Alternatively one could + * start a new (sub-) prefetch for this branch, starting again from + * root. + */ + if (wanted_generation != generation) + return; I think I saw passing wanted_generation = 0 somewheree, but cannot find it now again. Is it an expected value for the default RA callback, meaning eg. 'any generation I find' ? + + for (i = 0; i nritems; i++) { + u64 n_gen; + struct btrfs_key key; + struct btrfs_key next_key; + u64 bytenr; + + btrfs_node_key_to_cpu(eb, key, i); + if (i + 1 nritems) + btrfs_node_key_to_cpu(eb, next_key, i + 1); + else + next_key = *top; + bytenr = btrfs_node_blockptr(eb, i); + n_gen = btrfs_node_ptr_generation(eb, i); + + if (btrfs_comp_cpu_keys(key, rc-key_end) 0 + btrfs_comp_cpu_keys(next_key, rc-key_start) 0) + reada_add_block(rc, bytenr, next_key, + level - 1, n_gen, ctx); + } +} @@ -142,65 +221,21 @@ static int __readahead_hook(struct btrfs_root *root, struct extent_buffer *eb, re-scheduled_for = NULL; spin_unlock(re-lock); - if (err == 0) { - nritems = level ? btrfs_header_nritems(eb) : 0; - generation = btrfs_header_generation(eb); - /* - * FIXME: currently we just set nritems to 0 if this is a leaf, - * effectively ignoring the content. In a next step we could - * trigger more readahead depending from the content, e.g. - * fetch the checksums for the extents in the leaf. - */ - } else { + /* + * call hooks for all registered readaheads + */ + list_for_each_entry(rec, list, list) { + btrfs_tree_read_lock(eb); /* - * this is the error case, the extent buffer has not been - * read correctly. We won't access anything from it and - * just cleanup our data structures. Effectively this will - * cut the branch below this node from read ahead. + * we set the lock to blocking, as the callback might want to + * sleep on allocations. What about a more finer control given to the callbacks? The blocking lock may be unnecessary if the callback does not sleep. My idea is to add a field to 'struct reada_uptodate_ctx', preset with BTRFS_READ_LOCK by default, but let the RA user to set it to its needs. */ - nritems = 0; - generation = 0; + btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); + rec-rc-callback(root, rec-rc, rec-generation, eb, start, + err, re-top, rec-ctx); + btrfs_tree_read_unlock_blocking(eb); } @@ -521,12 +593,87 @@ static void reada_control_release(struct kref *kref) +/* + * context to pass from reada_add_block to worker in case the extent is + * already uptodate in memory + */ +struct reada_uptodate_ctx { + struct btrfs_keytop; + struct extent_buffer*eb; + struct reada_control*rc; + u64 logical; + u64
Re: kernel 3.3.4 damages filesystem (?)
Hi, On 05/08/2012 10:56 PM, Roman Mamedov wrote: Regarding btrfs, AFAIK even btrfs -d single suggested above works not per file, but per allocation extent, so in case of one disk failure you will lose random *parts* (extents) of random files, which in effect could mean no file in your whole file system will remain undamaged. Maybe we should evaluate the possiblility of such a one file gets on one disk feature. Helmut Hullen has the use case: Many disks, totally non-critical but nice-to-have data. If one disk dies, some *files* should lost, not some *random parts of all files*. This could be accomplished by some userspace-tool that moves stuff around, combined with file pinning-support, that lets the user make sure a specific file is on a specific disk. Cheers Kaspar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: failed disk
Hallo, Hugo, Du meintest am 09.05.12: mkfs.btrfs -m raid1 -d single should give you that. Just a small bug, perhaps: created a system with mkfs.btrfs -m raid1 -d single /dev/sdl1 mount /dev/sdl1 /mnt/Scsi btrfs device add /dev/sdk1 /mnt/Scsi btrfs device add /dev/sdm1 /mnt/Scsi (filling with data) and btrfs fi df /mnt/Scsi now tells Data, RAID0: total=183.18GB, used=76.60GB Data: total=80.01GB, used=79.83GB System, DUP: total=8.00MB, used=32.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=1.00GB, used=192.74MB Metadata: total=8.00MB, used=0.00 -- Data, RAID0 confuses me (not very much ...), and the system for metadata (RAID1) is not told. DUP is two copies of each block, but it allows the two copies to live on the same device. It's done this because you started with a single device, and you can't do RAID-1 on one device. The first bit of metadata you write to it should automatically upgrade the DUP chunk to RAID-1. Ok. Sounds familiar - have you explained that to me many months ago? As to the spurious upgrade of single to RAID-0, I thought Ilya had stopped it doing that. What kernel version are you running? 3.2.9, self made. I could test the message with 3.3.4, but not today (if it's only an interpretation of always the same data). Out of interest, why did you do the device adds separately, instead of just this? a) making the first 2 devices: I have tested both versions (one line with 2 devices or 2 lines with 1 device); no big difference. But I had tested the option -L (labelling) too, and that makes shit for the oneliner: both devices get the same label, and then findfs finds none of them. The really safe way would be: deleting this option for the mkfs.btrfs command and only using btrfs fi label device [newlabel] b) third device: that's my usual test: make a cluster of 2 deivces fill them with data add a third device delete the smallest device Viele Gruesse! Helmut -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: failed disk
On Wed, May 09, 2012 at 05:14:00PM +0200, Helmut Hullen wrote: Hallo, Hugo, Du meintest am 09.05.12: DUP is two copies of each block, but it allows the two copies to live on the same device. It's done this because you started with a single device, and you can't do RAID-1 on one device. The first bit of metadata you write to it should automatically upgrade the DUP chunk to RAID-1. Ok. Sounds familiar - have you explained that to me many months ago? Probably. I tend to explain this kind of thing a lot to people. As to the spurious upgrade of single to RAID-0, I thought Ilya had stopped it doing that. What kernel version are you running? 3.2.9, self made. OK, I'm pretty sure that's too old -- it will upgrade single to RAID-0. You can probably turn it back to single using balance filters: # btrfs fi balance -dconvert=single /mountpoint (You may want to write at least a little data to the FS first -- balance has some slightly odd behaviour on empty filesystems). I could test the message with 3.3.4, but not today (if it's only an interpretation of always the same data). Out of interest, why did you do the device adds separately, instead of just this? a) making the first 2 devices: I have tested both versions (one line with 2 devices or 2 lines with 1 device); no big difference. But I had tested the option -L (labelling) too, and that makes shit for the oneliner: both devices get the same label, and then findfs finds none of them. Umm... Yes, of course both devices will get the same label -- you're labelling the filesystem, not the devices. (Didn't we have this argument some time ago?). I don't know what findfs is doing, that it can't find the filesystem by label: you may need to run sync after mkfs, possibly. The really safe way would be: deleting this option for the mkfs.btrfs command and only using btrfs fi label device [newlabel] ... except that it'd have to take a filesystem as parameter, not a device (see above). b) third device: that's my usual test: make a cluster of 2 deivces fill them with data add a third device delete the smallest device What are you testing? And by delete do you mean btrfs dev delete or pull the cable out? Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Quidquid latine dictum sit, altum videtur. --- signature.asc Description: Digital signature
Re: failed disk (was: kernel 3.3.4 damages filesystem (?))
On Wed, May 09, 2012 at 03:37:35PM +0100, Hugo Mills wrote: On Wed, May 09, 2012 at 04:25:00PM +0200, Helmut Hullen wrote: Du meintest am 07.05.12: [...] With a file system like ext2/3/4 I can work with several directories which are mounted together, but (as said before) one broken disk doesn't disturb the others. mkfs.btrfs -m raid1 -d single should give you that. Just a small bug, perhaps: created a system with mkfs.btrfs -m raid1 -d single /dev/sdl1 mount /dev/sdl1 /mnt/Scsi btrfs device add /dev/sdk1 /mnt/Scsi btrfs device add /dev/sdm1 /mnt/Scsi (filling with data) and btrfs fi df /mnt/Scsi now tells Data, RAID0: total=183.18GB, used=76.60GB Data: total=80.01GB, used=79.83GB System, DUP: total=8.00MB, used=32.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=1.00GB, used=192.74MB Metadata: total=8.00MB, used=0.00 -- Data, RAID0 confuses me (not very much ...), and the system for metadata (RAID1) is not told. DUP is two copies of each block, but it allows the two copies to live on the same device. It's done this because you started with a single device, and you can't do RAID-1 on one device. The first bit of What Hugo said. Newer mkfs.btrfs will error out if you try to do this. metadata you write to it should automatically upgrade the DUP chunk to RAID-1. We don't upgrade chunks in place, only during balance. As to the spurious upgrade of single to RAID-0, I thought Ilya had stopped it doing that. What kernel version are you running? I did, but again, we were doing it only as part of balance, not as part of normal operation. Helmut, do you have any additional data points - the output of btrfs fi df right after you created FS or somewhere in the middle of filling it ? Also could you please paste the output of btrfs fi show and tell us what kernel version you are running ? Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: use ALIGN macro instead of open-coded expression
On Tue, May 08, 2012 at 04:16:24PM +0800, Yuanhan Liu wrote: According to section 'Find open-coded helpers or macros' at https://btrfs.wiki.kernel.org/index.php/Cleanup_ideas, here in the patch we use ALIGN macro to do the alignment. Well, I wrote this section and some time later also the patches, http://www.spinics.net/lists/linux-btrfs/msg12747.html but did not update the section with the status reflecting this, sorry that you duplicated work. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Subdirectory creation on snapshot
On Mon, May 07, 2012 at 05:10:08PM -0700, Brendan Smithyman wrote: I'm experiencing some odd-seeming behaviour with btrfs on Ubuntu 12.04, using the Ubuntu x86-64 generic 3.2.0-24 kernel and btrfs-tools 0.19+20120328-1~precise1 (backported from the current Debian version using Ubuntu's backportpackage). When I snapshot a subvolume on some of my drives, it creates an empty directory inside the snapshot with the same name as the original subvolume. Example case and details below: This is known and it's not a problem, though I was surprised when I had first seen it myself. Snapshotting is not recursive, the case of file-file, directory-directory is straightforward, and when a subvolume is encountered, a new file sub-type is created, it's identified by BTRFS_EMPTY_SUBVOL_DIR_OBJECTID internally, so it's a kind of a stub subvolume. It is identified by inode number 2 in stat output. The object cannot be modified and just sits there. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs: extended inode refs
On Tue, May 08, 2012 at 03:57:39PM -0700, Mark Fasheh wrote: Hi Jan, comments inline as usual! This function must not call free_extent_buffer(eb) in line 1306 after applying your patch set (immediately before the break). Second, I think we'd better add a blocking read lock on eb after incrementing it's refcount, because we need the current content to stay as it is. Both isn't part of your patches, but it might be easier if you make that bugfix change as a 3/4 patch within your set and turn this one into 4/4. If you don't like that, I'll send a separate patch for it. Don't miss the unlock if you do it ;-) Ok, I think I was able to figure out and add the correct locking calls. Basically I believe I need to wrap access around: btrfs_tree_read_lock(eb); btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); read eb contents btrfs_tree_read_unlock_blocking(eb); You only need a blocking lock if you're scheduling. Otherwise the spinlock variant is fine. + + while (1) { + ret = btrfs_find_one_extref(fs_root, inum, offset, path, iref2, + offset); + if (ret 0) + break; + if (ret) { + ret = found ? 0 : -ENOENT; + break; + } + ++found; + + slot = path-slots[0]; + eb = path-nodes[0]; + /* make sure we can use eb after releasing the path */ + atomic_inc(eb-refs); You need a blocking read lock here, too. Grab it before releasing the path. If you're calling btrfs_search_slot, it will give you a blocking lock on the leaf. If you set path-leave_spinning before the call, you'll have a spinning lock on the leaf. If you unlock a block that you got from a path (like eb = path-nodes[0]), the path structure has a flag for each level that indicates if that block was locked or not. See btrfs_release_path(). So, don't fiddle the locks without fiddling the paths. You can switch from spinning to/from blocking without touching the path, it figures that out. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/5] btrfs: add command to zero out superblock
On Thu, May 03, 2012 at 03:11:45PM +0200, Hubert Kario wrote: nice, didn't know about this. Such functionality would be nice to have. But then I don't think that a recreate the array if the parameters are the same is actually a good idea, lots of space for error. A pair of functions: btrfs dev zero-superblock btrfs dev restore-superblock As a user, I'm not sure what can I expect from the restore command. From where does it restore? Eg. a file? As a tester I have use for a temporary clearing of a superblock on a device, then mount it with -o degraded, work work, and then undo clearing. So, my idea is like btrfs device zero-superblock --undo with the obvious sanity checks. A regular user would never need to call this. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/5] btrfs: add command to zero out superblock
On Wednesday 09 of May 2012 19:18:07 David Sterba wrote: On Thu, May 03, 2012 at 03:11:45PM +0200, Hubert Kario wrote: nice, didn't know about this. Such functionality would be nice to have. But then I don't think that a recreate the array if the parameters are the same is actually a good idea, lots of space for error. A pair of functions: btrfs dev zero-superblock btrfs dev restore-superblock As a user, I'm not sure what can I expect from the restore command. From where does it restore? Eg. a file? As a tester I have use for a temporary clearing of a superblock on a device, then mount it with -o degraded, work work, and then undo clearing. So, my idea is like btrfs device zero-superblock --undo with the obvious sanity checks. A regular user would never need to call this. Yes, that's a better idea. -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2] Btrfs: improve space count for files with fragments
On Fri, Apr 27, 2012 at 09:44:13AM +0800, Liu Bo wrote: Let's take the above case: 0k 20k | --- extent --- | | - A - | 1k 19k And we assume that this extent starts from disk_bytenr on its FS logical offset. By splitting the [0k, 20k) extent, we'll get three delayed refs into the delayed-ref rbtree: a) [0k, 20k), in which only [disk_bytenr+1k, disk_bytenr+19k) will be freed at the end. b) [0k, 1k), which will _not_ allocate a new extent but use the remained space of [0k, 20k). c) [19k, 20k), ditto. And another ref [1k,19k) will get a new allocated space by our normal endio routine. What I want is free [0k, 20k), set this range DIRTY in the pinned_extents tree. alloc [0k, 1k), clear this range DIRTY in the pinned_extents tree. alloc [19k, 20k), ditto. However, in my stress test, this three refs may not be ordered by a)-b)-c), but b)-a)-c) instead. That would be a problem, because it will confuse our space_info's counter: bytes_reserved, bytes_pinned. Do you have an idea why the ordering may become broken? If it's a race, it might be better to fix it instead of adding a new bit to extent flags. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel 3.3.4 damages filesystem (?)
Helmut Hullen posted on Mon, 07 May 2012 12:46:00 +0200 as excerpted: The 3 btrfs disks are connected via a SiI 3114 SATA-PCI-Controller. Only 1 of the 3 disks seems to be damaged. I don't plan to rehash the raid0/single discussion here, but here's some perhaps useful additional information on that hardware: For some years I've been running that same hardware, SiI 3114 SATA PCI, on an old dual-socket 3-digit Opteron system, running for some years now dual dual-core Opteron 290s (the highest they went, 2.8 GHz, 4 cores in two sockets). However, I *WAS* running them in RAID-1, 4-disk md RAID-1, to be exact (with reiserfs, FWIW). What's VERY interesting is that I've just returned from being offline for several days due to severe disk-I/O hardware issues of my own -- again, on that Sil-SATA 3114. Most of the time I was getting full system crashes, but perhaps 25-33% of the time it didn't fully crash the system, simply error out with an eventual ATA reset. When the system didn't crash immediately, most of the time (about 80% I'd say) the reset would be good and I'd be back up, but sometimes it'd repeatedly reset, occasionally not ever becoming usable again. As the drives are all the same quite old Seagate 300 gig drives, at about half their rated SMART operating hours but I think well beyond the 5 year warrantee, I originally thought I'd just learned my lesson on the don't use all the same model or you're risking them all going out at once rule, but I bought a new drive (half-TB seagate 2.5 drive, I've been thinking about going 2.5 for awhile now and this was the chance, I'll RAID it later with at least one more, preferably a different run at least if not a different model) and have been SLOWLY, PAINFULLY, RESETTINGLY copying stuff over from one or another of the four RAID-1 drives. The reset problem, however, hasn't gone away, tho it's rather reduced on the newer hardware. I also happened to have a 4-3.5-in-3-5.25-slot drive enclosure that seemed to be making the problem worse, as when I first tried the new 2.5 inch retrofitted into it, the reset problem was as bad with it as with the old drives, but when I ran it lose, just cabled into the mobo and power-supply directly, resets went down significantly but did NOT go away. So... I've now concluded that I need a new controller and will probably buy one in a day or two. Meanwhile, I THOUGHT it was just me with the SIL-SATA controller, until I happened to see the same hardware mentioned on this thread. Now, I'm beginning to suspect that there's some new kernel DMA or storage or perhaps xorg/mesa (AMD AGPGART, after all, handling the DMA using half the aperture. if either the graphics or storage try writing to the wrong half...) problem that stressed what was already aging hardware, triggering the problem. It's worth noting that I tried running an older kernel and rebuilding (on Gentoo) most of X/mesa/anything-else-I-could- think-might-be-related between older versions that WERE working find before and newer versions, and reverting to older didn't help, so it's apparently NOT a direct software-only-bug. However, what I'm wondering now is whether as I said, software upgrades added stress to already aging hardware, such that it tipped it over the edge, and by the time I tried reverting, I'd already had enough crashes and etc that my entire system was unstable, and reverting to older software didn't help because now the hardware was unstable as well. I'd still chalk it up to simply failing hardware, except that it's a rather interesting coincidence that both you and I had their SIL-SATA 3114s go bad at very close to the same time. Meanwhile, I did recently see an interesting kernel commit, either late 3.4-rc5+ or early 3.4-rc6+. I don't want to try to track it down and lose this post to a crash on a less than stable system, but it did mention that AMD AGPGARTs sometimes poked holes in memory allocations and the commit was to try to allow for that. I'm not sure how long the bad code had been in the kernel, but if it was introduced at say the 3.2 or 3.3 kernel, it could be that is what first started triggering the lockups that lead to more and more system instability, until now I've bought a new drive and it looks like I'm going to need to replace the onboard SIL- SATA. So, some questions: * Do you run OpenGL/Mesa at all on that system, possibly with an OpenGL compositing window manager? * If so, how new is your mesa and xorg-server, and what is your video card/driver? * Do you run quite new kernels, say 3.3/3.4? * What libffi and cairo? (I did notice reverting libffi seemed to lessen the crashing a bit, especially with firefox on my bank's SSL site, which was where the problem first became ugly for me as I kept crashing trying to get in to pay bills, etc, but I'm not positive that's related, or it might be that likely otherwise separate bug's crashes advanced the ATA- resets issue
Re: [ANN] btrfs.wiki.kernel.org with up-to-date content again
David Sterba posted on Mon, 07 May 2012 17:44:16 +0200 as excerpted: Hi, the time of temporary wiki hosted at btrfs.ipv5.de is over, the content has been migrated back to official site at http://btrfs.wiki.kernel.org (ipv5.de wiki is set to redirect there). Thanks. I was checking it a couple days ago and noticed the migrating back notice. Then last night I was looking up something else, and noticed the redirect back to kernel.org. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Subdirectory creation on snapshot
Thanks David, If I understand you correctly, this would be the case with nested subvolumes; i.e., if subvolume A is exists within the directory tree subvolume B, and B is snapshotted. I expected this, and it sounds totally consistent with my understanding of how btrfs subvolumes work. However, the behaviour I'm seeing seems to be a different thing, so I just want to double-check: In my case I am executing the btrfs subvolume snapshot @working newsnapshot command (or something like it). The @working subvolume exists in the filesystem root, and does not contain any other subvolumes within its own subdirectory tree. In the new subvolume, newsnapshot, there is an entry called @working that is identified as inode number 2 as you say. But this isn't due to a subvolume in the directory tree of the original @working, since it still happens, e.g., if it is the only subvolume on the system (apart from the root, of course). The naive assumption is that (excepting nested subvolumes), the snapshot should be indistinguishable from the original. Additionally, I'm a bit perplexed by the behaviour on some of my volumes and not others. It's not a big deal, and I'm happy to take your word for it (or look at the code, if you'd be willing to point me in the right direction; I'm not averse to learning). I just wanted to double-check that we're talking about the same thing. I appreciate the help! Cheers, Brendan On 2012-05-09, at 9:58 AM, David Sterba wrote: On Mon, May 07, 2012 at 05:10:08PM -0700, Brendan Smithyman wrote: I'm experiencing some odd-seeming behaviour with btrfs on Ubuntu 12.04, using the Ubuntu x86-64 generic 3.2.0-24 kernel and btrfs-tools 0.19+20120328-1~precise1 (backported from the current Debian version using Ubuntu's backportpackage). When I snapshot a subvolume on some of my drives, it creates an empty directory inside the snapshot with the same name as the original subvolume. Example case and details below: This is known and it's not a problem, though I was surprised when I had first seen it myself. Snapshotting is not recursive, the case of file-file, directory-directory is straightforward, and when a subvolume is encountered, a new file sub-type is created, it's identified by BTRFS_EMPTY_SUBVOL_DIR_OBJECTID internally, so it's a kind of a stub subvolume. It is identified by inode number 2 in stat output. The object cannot be modified and just sits there. david smime.p7s Description: S/MIME cryptographic signature
Re: kernel 3.3.4 damages filesystem (?)
I dont know if this is related or not, but I updated two different computers to ubuntu 12, which uses kernel 3.2, and in both I had the same problem: using btrfs with compress-force=lzo, after some IO stress the filesystem became unusable, some sort of busy. Im using kernel 3.0 right now, with no such problem. On 09-05-2012 14:32, Duncan wrote: Helmut Hullen posted on Mon, 07 May 2012 12:46:00 +0200 as excerpted: The 3 btrfs disks are connected via a SiI 3114 SATA-PCI-Controller. Only 1 of the 3 disks seems to be damaged. I don't plan to rehash the raid0/single discussion here, but here's some perhaps useful additional information on that hardware: For some years I've been running that same hardware, SiI 3114 SATA PCI, on an old dual-socket 3-digit Opteron system, running for some years now dual dual-core Opteron 290s (the highest they went, 2.8 GHz, 4 cores in two sockets). However, I *WAS* running them in RAID-1, 4-disk md RAID-1, to be exact (with reiserfs, FWIW). What's VERY interesting is that I've just returned from being offline for several days due to severe disk-I/O hardware issues of my own -- again, on that Sil-SATA 3114. Most of the time I was getting full system crashes, but perhaps 25-33% of the time it didn't fully crash the system, simply error out with an eventual ATA reset. When the system didn't crash immediately, most of the time (about 80% I'd say) the reset would be good and I'd be back up, but sometimes it'd repeatedly reset, occasionally not ever becoming usable again. As the drives are all the same quite old Seagate 300 gig drives, at about half their rated SMART operating hours but I think well beyond the 5 year warrantee, I originally thought I'd just learned my lesson on the don't use all the same model or you're risking them all going out at once rule, but I bought a new drive (half-TB seagate 2.5 drive, I've been thinking about going 2.5 for awhile now and this was the chance, I'll RAID it later with at least one more, preferably a different run at least if not a different model) and have been SLOWLY, PAINFULLY, RESETTINGLY copying stuff over from one or another of the four RAID-1 drives. The reset problem, however, hasn't gone away, tho it's rather reduced on the newer hardware. I also happened to have a 4-3.5-in-3-5.25-slot drive enclosure that seemed to be making the problem worse, as when I first tried the new 2.5 inch retrofitted into it, the reset problem was as bad with it as with the old drives, but when I ran it lose, just cabled into the mobo and power-supply directly, resets went down significantly but did NOT go away. So... I've now concluded that I need a new controller and will probably buy one in a day or two. Meanwhile, I THOUGHT it was just me with the SIL-SATA controller, until I happened to see the same hardware mentioned on this thread. Now, I'm beginning to suspect that there's some new kernel DMA or storage or perhaps xorg/mesa (AMD AGPGART, after all, handling the DMA using half the aperture. if either the graphics or storage try writing to the wrong half...) problem that stressed what was already aging hardware, triggering the problem. It's worth noting that I tried running an older kernel and rebuilding (on Gentoo) most of X/mesa/anything-else-I-could- think-might-be-related between older versions that WERE working find before and newer versions, and reverting to older didn't help, so it's apparently NOT a direct software-only-bug. However, what I'm wondering now is whether as I said, software upgrades added stress to already aging hardware, such that it tipped it over the edge, and by the time I tried reverting, I'd already had enough crashes and etc that my entire system was unstable, and reverting to older software didn't help because now the hardware was unstable as well. I'd still chalk it up to simply failing hardware, except that it's a rather interesting coincidence that both you and I had their SIL-SATA 3114s go bad at very close to the same time. Meanwhile, I did recently see an interesting kernel commit, either late 3.4-rc5+ or early 3.4-rc6+. I don't want to try to track it down and lose this post to a crash on a less than stable system, but it did mention that AMD AGPGARTs sometimes poked holes in memory allocations and the commit was to try to allow for that. I'm not sure how long the bad code had been in the kernel, but if it was introduced at say the 3.2 or 3.3 kernel, it could be that is what first started triggering the lockups that lead to more and more system instability, until now I've bought a new drive and it looks like I'm going to need to replace the onboard SIL- SATA. So, some questions: * Do you run OpenGL/Mesa at all on that system, possibly with an OpenGL compositing window manager? * If so, how new is your mesa and xorg-server, and what is your video card/driver? * Do you run quite new kernels, say 3.3/3.4? * What libffi and cairo? (I did notice reverting libffi seemed to lessen the
Re: failed disk
Hallo, Hugo, Du meintest am 09.05.12: As to the spurious upgrade of single to RAID-0, I thought Ilya had stopped it doing that. What kernel version are you running? 3.2.9, self made. OK, I'm pretty sure that's too old -- it will upgrade single to RAID-0. You can probably turn it back to single using balance filters: # btrfs fi balance -dconvert=single /mountpoint (You may want to write at least a little data to the FS first -- balance has some slightly odd behaviour on empty filesystems). manana ... the system is just running balance after device delete. And that may still need 4 ... 5 hours. Out of interest, why did you do the device adds separately, instead of just this? a) making the first 2 devices: I have tested both versions (one line with 2 devices or 2 lines with 1 device); no big difference. But I had tested the option -L (labelling) too, and that makes shit for the oneliner: both devices get the same label, and then findfs finds none of them. Umm... Yes, of course both devices will get the same label -- you're labelling the filesystem, not the devices. (Didn't we have this argument some time ago?). Not with that special case (and that led me to misinterpreting the error ...). I don't know what findfs is doing, that it can't find the filesystem by label: you may need to run sync after mkfs, possibly. No - findfs works quite simple: if it finds 1 label then it tells the partition. If it finds more or less labels it tells nothing. b) third device: that's my usual test: make a cluster of 2 deivces fill them with data add a third device delete the smallest device What are you testing? And by delete do you mean btrfs dev delete or pull the cable out? First pure software delete. Tomorrow I'll reboot the system and look at the results with btrfs fi show It should tell only 2 devices (that's the part which seems to work as described at least since kernel 3.2). By the way: it seems to be necessary running btrfs fi balance ... after btrfs device add ... and after btrfs device delete Viele Gruesse! Helmut -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph on btrfs 3.4rc
On Fri, May 04, 2012 at 10:24:16PM +0200, Christian Brunner wrote: 2012/5/3 Josef Bacik jo...@redhat.com: On Thu, May 03, 2012 at 09:38:27AM -0700, Josh Durgin wrote: On Thu, 3 May 2012 11:20:53 -0400, Josef Bacik jo...@redhat.com wrote: On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote: Yeah all that was in the right place, I rebooted and I magically stopped getting that error, but now I'm getting this http://fpaste.org/OE92/ with that ping thing repeating over and over. Thanks, That just looks like the osd isn't running. If you restart the osd with 'debug osd = 20' the osd log should tell us what's going on. Ok that part was my fault, Duh I need to redo the tmpfs and mkcephfs stuff after reboot. But now I'm back to my original problem http://fpaste.org/PfwO/ I have the osd class dir = /usr/lib64/rados-classes thing set and libcls_rbd is in there, so I'm not sure what is wrong. Thanks, Thats really strange. Do you have the osd logs in /var/log/ceph? If so, can you look if you find anything about rbd or class loading in there? Another thing you should try is, whether you can access ceph with rados: # rados -p rbd ls # rados -p rbd -i /proc/cpuinfo put testobj # rados -p rbd -o - get testobj Ok weirdly ceph is trying to dlopen /usr/lib64/rados-classes/libcls_rbd.so but all I had was libcls_rbd.so.1 and libcls_rbd.so.1.0.0. Symlink fixed that part, I'll see if I can reproduce now. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs RAID with enterprise SATA or SAS drives
There is various information about - enterprise-class drives (either SAS or just enterprise SATA) - the SCSI/SAS protocols themselves vs SATA having more advanced features (e.g. for dealing with error conditions) than the average block device For example, Adaptec recommends that such drives will work better with their hardware RAID cards: http://ask.adaptec.com/cgi-bin/adaptec_tic.cfg/php/enduser/std_adp.php?p_faqid=14596 Desktop class disk drives have an error recovery feature that will result in a continuous retry of the drive (read or write) when an error is encountered, such as a bad sector. In a RAID array this can cause the RAID controller to time-out while waiting for the drive to respond. and this blog: http://www.adaptec.com/blog/?p=901 major advantages to enterprise drives (TLER for one) ... opt for the enterprise drives in a RAID environment no matter what the cost of the drive over the desktop drive My question.. - does btrfs RAID1 actively use the more advanced features of these drives, e.g. to work around errors without getting stuck on a bad block? - if a non-RAID SAS card is used, does it matter which card is chosen? Does btrfs work equally well with all of them? - ignoring the better MTBF and seek times of these drives, do any of the other features passively contribute to a better RAID experience when using btrfs? - for someone using SAS or enterprise SATA drives with Linux, I understand btrfs gives the extra benefit of checksums, are there any other specific benefits over using mdadm or dmraid? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2] Btrfs: improve space count for files with fragments
On 05/10/2012 01:29 AM, David Sterba wrote: On Fri, Apr 27, 2012 at 09:44:13AM +0800, Liu Bo wrote: Let's take the above case: 0k 20k | --- extent --- | | - A - | 1k 19k And we assume that this extent starts from disk_bytenr on its FS logical offset. By splitting the [0k, 20k) extent, we'll get three delayed refs into the delayed-ref rbtree: a) [0k, 20k), in which only [disk_bytenr+1k, disk_bytenr+19k) will be freed at the end. b) [0k, 1k), which will _not_ allocate a new extent but use the remained space of [0k, 20k). c) [19k, 20k), ditto. And another ref [1k,19k) will get a new allocated space by our normal endio routine. What I want is free [0k, 20k), set this range DIRTY in the pinned_extents tree. alloc [0k, 1k), clear this range DIRTY in the pinned_extents tree. alloc [19k, 20k), ditto. However, in my stress test, this three refs may not be ordered by a)-b)-c), but b)-a)-c) instead. That would be a problem, because it will confuse our space_info's counter: bytes_reserved, bytes_pinned. Do you have an idea why the ordering may become broken? If it's a race, it might be better to fix it instead of adding a new bit to extent flags. These refs are well managed in the delayed_ref rbtree, but processing these refs can be multi-threads, so the ordering is not ensured to be sequenced since the original design thinks each ref is independent. Any thoughts? :) thanks, liubo david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: use ALIGN macro instead of open-coded expression
On Wed, May 09, 2012 at 06:45:49PM +0200, David Sterba wrote: On Tue, May 08, 2012 at 04:16:24PM +0800, Yuanhan Liu wrote: According to section 'Find open-coded helpers or macros' at https://btrfs.wiki.kernel.org/index.php/Cleanup_ideas, here in the patch we use ALIGN macro to do the alignment. Well, I wrote this section and some time later also the patches, http://www.spinics.net/lists/linux-btrfs/msg12747.html but did not update the section with the status reflecting this, sorry that you duplicated work. It's OK. I just didn't find those 'issue' was fixed in mainline, thus I thought it still exist. Thanks, Yuanhan Liu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: failed disk
Hallo, Hugo, Du meintest am 09.05.12: btrfs fi df /mnt/Scsi now tells Data, RAID0: total=183.18GB, used=76.60GB Data: total=80.01GB, used=79.83GB System, DUP: total=8.00MB, used=32.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=1.00GB, used=192.74MB Metadata: total=8.00MB, used=0.00 -- Data, RAID0 confuses me (not very much ...), and the system for metadata (RAID1) is not told. DUP is two copies of each block, but it allows the two copies to live on the same device. It's done this because you started with a single device, and you can't do RAID-1 on one device. The first bit of metadata you write to it should automatically upgrade the DUP chunk to RAID-1. It has done - ok. Adding and removing disks/partitions works as expected. Viele Gruesse! Helmut -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html