[PATCH 1/1] btrfs: looping 'mkfs.btrfs -f dev' may fail with EBUSY
The thread holding the O_EXCL flag seems to be BTRFS_IOC_SCAN_DEV ioctl, which in turn calls btrfs_scan_one_device() to open dev with the O_EXCL flag. But btrfs_scan_one_device() does not write anything to the disk. and it is called by . An intermediary step (not the final open_ctree) in the mount thread to read the SB and . btrfs-control ioctls viz BTRFS_IOC_SCAN_DEV and BTRFS_IOC_DEVICES_READY they don't need the O_EXCL. test script: (run this in a loop) static int test_skip_this_disk(char *path) { int fd; char c; printf(%s , path); fd = open(path, O_RDWR|O_EXCL); if (fd 0) { printf(Open failed\n); return 1; } /*fflush(stdout); printf(Open Fine press enter\n); scanf(%c, c);*/ close(fd); return 0; } main(int arg, char **argv) { int i; if (arg == 1) { printf(usage: %s dev-with-btrfs-sb .. \n, argv[0]); exit(1); } for (i = 1; i arg; i++) test_skip_this_disk(argv[i]); } dump stack after the userland close(fd) dump_stack+0x9/0x60 btrfs_scan_one_device+0x18d/0x1f0 [btrfs] btrfs_control_ioctl+0xb9/0x210 [btrfs] do_vfs_ioctl+0x84/0x4c0 inode_has_perm+0x28/0x30 file_has_perm+0x8a/0xa0 SyS_ioctl+0x91/0xa0 system_call_fastpath+0x16/0x1b Signed-off-by: Anand Jain anand.j...@oracle.com --- fs/btrfs/volumes.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 636faa0..c186b5e 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -914,7 +914,6 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, * later supers, using BTRFS_SUPER_MIRROR_MAX instead */ bytenr = btrfs_sb_offset(0); - flags |= FMODE_EXCL; mutex_lock(uuid_mutex); bdev = blkdev_get_by_path(path, flags, holder); -- 1.8.5.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs-progs: fix race condition between btrfs and udev
Wang, Futher, a easy way to trigger this problem is by running the following c codes repeatedly: int main(int argc, char **argv) { /* pass a btrfs block device */ int fd = open(argv[1], O_RDWR | O_EXCL); if (fd 0) { perror(fail to open: %s, strerror(errno)); exit(1); } close(fd); return 0; } So the problem is RW opening would trigger udev event which will call btrfs_scan_one_device(). In btrfs_scan_one_device(), it would open the block device with EXCL flag..meanwhile if another program try to open that device with O_EXCL, it would fail with EBUSY. Expected. But do we need O_EXCL in kernel:btrfs_scan_one_device() at all ? This happen seldomly in the real world, but if we use loop device for test, we may hit this annoying problem. A walkaround way to solve this problem is to wait kernel scanning finished and then try it again. I agree this happens seldom in production. But I don't agree on something to fix as workaround. Just sent out the patch: btrfs: looping mkfs.btrfs -f dev may fail with EBUSY which I believe can be a final fix. Your review / tests appreciated. Thanks, Anand -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] btrfs-progs: Add human readable flags output for chunk/block group type.
Current btrfs-debug-tree output chunk/block group type as numbers, which makes it hard to understand and need to check the source to understand the meaning. This patch will convert numberic type output to human readable strings. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- changlog: v2: allow mixed chunk. --- print-tree.c | 59 ++- 1 file changed, 54 insertions(+), 5 deletions(-) diff --git a/print-tree.c b/print-tree.c index 7f831ad..6cf59ca 100644 --- a/print-tree.c +++ b/print-tree.c @@ -160,15 +160,60 @@ static int print_inode_ref_item(struct extent_buffer *eb, struct btrfs_item *ite return 0; } +/* Caller should ensure sizeof(*ret)=21 DATA|METADATA|RAID10 */ +static void bg_flags_to_str(u64 flags, char *ret) +{ + int empty = 1; + + if (flags BTRFS_BLOCK_GROUP_DATA) { + empty = 0; + strcpy(ret, DATA); + } + if (flags BTRFS_BLOCK_GROUP_METADATA) { + if (!empty) + strcat(ret, |); + strcat(ret, METADATA); + } + if (flags BTRFS_BLOCK_GROUP_SYSTEM) { + if (!empty) + strcat(ret, |); + strcat(ret, SYSTEM); + } + switch (flags BTRFS_BLOCK_GROUP_PROFILE_MASK) { + case BTRFS_BLOCK_GROUP_RAID0: + strcat(ret, |RAID0); + break; + case BTRFS_BLOCK_GROUP_RAID1: + strcat(ret, |RAID1); + break; + case BTRFS_BLOCK_GROUP_DUP: + strcat(ret, |DUP); + break; + case BTRFS_BLOCK_GROUP_RAID10: + strcat(ret, |RAID10); + break; + case BTRFS_BLOCK_GROUP_RAID5: + strcat(ret, |RAID5); + break; + case BTRFS_BLOCK_GROUP_RAID6: + strcat(ret, |RAID6); + break; + default: + break; + } +} + static void print_chunk(struct extent_buffer *eb, struct btrfs_chunk *chunk) { int num_stripes = btrfs_chunk_num_stripes(eb, chunk); int i; - printf(\t\tchunk length %llu owner %llu type %llu num_stripes %d\n, + char chunk_flags_str[32] = {0}; + + bg_flags_to_str(btrfs_chunk_type(eb, chunk), chunk_flags_str); + printf(\t\tchunk length %llu owner %llu type %s num_stripes %d\n, (unsigned long long)btrfs_chunk_length(eb, chunk), (unsigned long long)btrfs_chunk_owner(eb, chunk), - (unsigned long long)btrfs_chunk_type(eb, chunk), - num_stripes); + chunk_flags_str, num_stripes); for (i = 0 ; i num_stripes ; i++) { printf(\t\t\tstripe %d devid %llu offset %llu\n, i, (unsigned long long)btrfs_stripe_devid_nr(eb, chunk, i), @@ -744,6 +789,7 @@ void btrfs_print_leaf(struct btrfs_root *root, struct extent_buffer *l) u32 nr = btrfs_header_nritems(l); u64 objectid; u32 type; + char bg_flags_str[32]; printf(leaf %llu items %d free space %d generation %llu owner %llu\n, (unsigned long long)btrfs_header_bytenr(l), nr, @@ -858,10 +904,13 @@ void btrfs_print_leaf(struct btrfs_root *root, struct extent_buffer *l) struct btrfs_block_group_item); read_extent_buffer(l, bg_item, (unsigned long)bi, sizeof(bg_item)); - printf(\t\tblock group used %llu chunk_objectid %llu flags %llu\n, + memset(bg_flags_str, 0, sizeof(bg_flags_str)); + bg_flags_to_str(btrfs_block_group_flags(bg_item), + bg_flags_str); + printf(\t\tblock group used %llu chunk_objectid %llu flags %s\n, (unsigned long long)btrfs_block_group_used(bg_item), (unsigned long long)btrfs_block_group_chunk_objectid(bg_item), - (unsigned long long)btrfs_block_group_flags(bg_item)); + bg_flags_str); break; case BTRFS_CHUNK_ITEM_KEY: print_chunk(l, btrfs_item_ptr(l, i, struct btrfs_chunk)); -- 2.0.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] btrfs-progs: Add human readable flags output string for extent flags.
Current btrfs-debug-tree outputs extent flags as numbers, which makes it hard to understand and need to check the source to understand the meaning. This patch will convert numberic flags output to human readable strings. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- changlog: v2: none --- print-tree.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/print-tree.c b/print-tree.c index 7263b09..7f831ad 100644 --- a/print-tree.c +++ b/print-tree.c @@ -242,6 +242,25 @@ static void print_file_extent_item(struct extent_buffer *eb, btrfs_file_extent_compression(eb, fi)); } +/* Caller should ensure sizeof(*ret) = 16(DATA|TREE_BLOCK) */ +static void extent_flags_to_str(u64 flags, char *ret) +{ + int empty = 1; + + if (flags BTRFS_EXTENT_FLAG_DATA) { + empty = 0; + strcpy(ret, DATA); + } + if (flags BTRFS_EXTENT_FLAG_TREE_BLOCK) { + if (!empty) { + empty = 0; + strcat(ret, |); + } + strcat(ret, TREE_BLOCK); + } + return; +} + static void print_extent_item(struct extent_buffer *eb, int slot, int metadata) { struct btrfs_extent_item *ei; @@ -255,6 +274,7 @@ static void print_extent_item(struct extent_buffer *eb, int slot, int metadata) u32 item_size = btrfs_item_size_nr(eb, slot); u64 flags; u64 offset; + char flags_str[32] = {0}; if (item_size sizeof(*ei)) { #ifdef BTRFS_COMPAT_EXTENT_TREE_V0 @@ -271,11 +291,12 @@ static void print_extent_item(struct extent_buffer *eb, int slot, int metadata) ei = btrfs_item_ptr(eb, slot, struct btrfs_extent_item); flags = btrfs_extent_flags(eb, ei); + extent_flags_to_str(flags, flags_str); - printf(\t\textent refs %llu gen %llu flags %llu\n, + printf(\t\textent refs %llu gen %llu flags %s\n, (unsigned long long)btrfs_extent_refs(eb, ei), (unsigned long long)btrfs_extent_generation(eb, ei), - (unsigned long long)flags); + flags_str); if (flags BTRFS_EXTENT_FLAG_TREE_BLOCK !metadata) { struct btrfs_tree_block_info *info; -- 2.0.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] btrfs-progs: fix missing parity stripe for raid6 in chunk-recover
When deal with the p q stripes for data profile raid6, chunk-recover forgets to insert them into the chunk record. Just insert them back freely. Also wrap the insert procedure into a new function, fill_chunk_up. Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com --- changelog: v1-v2: cleanup changelog, 'inert' changed to 'insert' --- chunk-recover.c | 30 -- 1 file changed, 20 insertions(+), 10 deletions(-) diff --git a/chunk-recover.c b/chunk-recover.c index dfa7ff6..9b46b0b 100644 --- a/chunk-recover.c +++ b/chunk-recover.c @@ -1785,6 +1785,23 @@ static inline int count_devext_records(struct list_head *record_list) return num_of_records; } +static int fill_chunk_up(struct chunk_record *chunk, struct list_head *devexts, +struct recover_control *rc) +{ + int ret = 0; + int i; + + for (i = 0; i chunk-num_stripes; i++) { + if (!chunk-stripes[i].devid) { + ret = insert_stripe(devexts, rc, chunk, i); + if (ret) + break; + } + } + + return ret; +} + #define EQUAL_STRIPE (1 0) static int rebuild_raid_data_chunk_stripes(struct recover_control *rc, @@ -1919,9 +1936,9 @@ next_csum: num_unordered = count_devext_records(unordered); if (chunk-type_flags BTRFS_BLOCK_GROUP_RAID6 num_unordered == 2) { - list_splice_init(unordered, chunk-dextents); btrfs_release_path(path); - return 0; + ret = fill_chunk_up(chunk, unordered, rc); + return ret; } goto next_stripe; @@ -1966,14 +1983,7 @@ out: BTRFS_BLOCK_GROUP_RAID5) || (num_unordered == 3 chunk-type_flags BTRFS_BLOCK_GROUP_RAID6)) { - for (i = 0; i chunk-num_stripes; i++) { - if (!chunk-stripes[i].devid) { - ret = insert_stripe(unordered, rc, - chunk, i); - if (ret) - break; - } - } + ret = fill_chunk_up(chunk, unordered, rc); } } fail_out: -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] Add support for LZ4 compression
On Wed, Jun 04, 2014 at 10:00:06AM -0400, Chris Mason wrote: I have a slightly different reason for holding off on these. Disk format changes are forever, and we need a really strong use case for pulling them in. The format upgrade is inevitable for full bidirectional interoperability of filesystems with non-pagesized sectorsize and compression. At the moment this is not possible even without compression, but patches are on the way. With that said, thanks for spending all of the time on this. Pulling in Dave's idea to stream larger compression blocks through lzo (or any new alg) might be enough to push performance much higher, and better show case the differences between new algorithms. The space savings and speed gains can be measured outside of btrfs. From the past numbers I see that 4k-64k chunk brings another 5-10% of ratio and the de/compression speed is not worse. Bigger chunks do not improve that much, but the overhead for assembling the linear mappings would be decreased. The whole reason I chose zlib originally was because its streaming interface was a better fit for how FS IO worked. Right, zlib has the streaming interface and accepts randomly scattered blocks, but the others do not. LZ4 has a streaming extension proposed, but I haven't looked at it closely whether it satisfies our constraints. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add some simple end-to-end tests for btrfs-convert.
On Fri, May 30, 2014 at 08:26:02PM +0200, Julien Muchembled wrote: Le 05/21/14 19:20, Adam Buchbinder a écrit : + # 256MB is the smallest acceptable btrfs image. + dd if=/dev/zero of=$here/test.img bs=1024 count=$((256*1024)) \ +convert-tests-results.txt 21 || _fail dd failed What about using a sparse file to speed up the test and be nicer with the underlying storage ? For example: truncate -s 256M $here/test.img Because this does not reset test.img with zeros, one may want to the rm -f test.img line at the beginning of the test function. Yes, good idea. Patches welcome. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: make pretty Documentation/ build match the rest
On Fri, May 30, 2014 at 11:27:14AM -0500, Eric Sandeen wrote: This is the most important patch ever. ;) Oh it is :) I found this to be less aesthetically pleasing than it was before: [CC] btrfstune.o Making all in Documentation ASCIIDOC btrfs-convert.xml [LD] btrfstune XMLTO btrfs-convert.8 [CC] btrfs-show-super.o GZIP btrfs-convert.8.gz [LD] btrfs-show-super ASCIIDOC btrfs-debug-tree.xml XMLTO btrfs-debug-tree.8 If folks don't like the 2-letter abbreviations, obviously the patch just isn't that important. I like the prettyfied output, but XT or AD look unfamiliar in this context. I've tried a few tweaks, is this change ok for you? [LD] btrfs-show-super [ASCII] btrfs-property.xml [XMLTO] btrfs-debug-tree.8 [XMLTO] btrfstune.8 [LD] btrfs-image Full ASCIIDOC would need to update all the two-letter strings, does not seem justified. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs-progs: Improve the parse_size() error message.
On Thu, May 29, 2014 at 09:42:11AM +0800, Qu Wenruo wrote: When using parse_size(), even non-numeric value is passed, it will only give error message ERROR: size value is empty, which is quite confusing for end users. This patch will introduce more meaningful error message for the following new cases 1) Invalid size string (non-numeric string) 2) Minus size value (like -1K) Also this patch will take full use of endptr returned by strtoll() to reduce unneeded loop. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- utils.c | 56 +++- 1 file changed, 39 insertions(+), 17 deletions(-) diff --git a/utils.c b/utils.c index 392c5cf..499f08f 100644 --- a/utils.c +++ b/utils.c @@ -1612,18 +1612,45 @@ scan_again: u64 parse_size(char *s) { - int i; char c; + char *endptr; u64 mult = 1; + long long int signed_ret; + u64 ret; - for (i = 0; s s[i] isdigit(s[i]); i++) ; - if (!i) { - fprintf(stderr, ERROR: size value is empty\n); - exit(50); + if (!s) { + fprintf(stderr, ERROR: Size value is empty\n); + exit(1); We never pass a NULL pointer to parse_size so this check will be always false. Previously it verified that there are at least some digits. } - - if (s[i]) { - c = tolower(s[i]); + signed_ret = strtoll(s, endptr, 10); + if (endptr == s) { + fprintf(stderr, ERROR: Size value '%s' is invalid\n, s); + exit(1); + } + if (endptr[0] endptr[1]) { + fprintf(stderr, ERROR: Illegal suffix contains character '%c' in wrong position\n, + endptr[1]); + exit(1); + } + if (signed_ret = 0) { + fprintf(stderr, + ERROR: Size value '%s' is less equal than 0\n, s); + exit(1); + } + /* strtoll returns LLONG_MAX when overflow, if this happens, + * need to call strtoull to get the real size */ + if (errno == ERANGE signed_ret == LLONG_MAX) { + ret = strtoull(s, NULL, 10); Why do you parse the number twice? Negative sizes are currently not used so you can reject them. + if (errno == ERANGE ret == ULLONG_MAX) { + fprintf(stderr, + ERROR: Size value '%s' is too large for u64\n, + s); + exit(1); + } + } else + ret = signed_ret; + if (endptr[0]) { + c = tolower(endptr[0]); switch (c) { case 'e': mult *= 1024; @@ -1646,18 +1673,13 @@ u64 parse_size(char *s) case 'b': break; default: - fprintf(stderr, ERROR: Unknown size descriptor - '%c'\n, c); + fprintf(stderr, ERROR: Unknown size descriptor '%c'\n, c); exit(1); } } - if (s[i] s[i+1]) { - fprintf(stderr, ERROR: Illegal suffix contains - character '%c' in wrong position\n, - s[i+1]); - exit(51); - } - return strtoull(s, NULL, 10) * mult; + + ret *= mult; Although there was no overflow check before, I think it should be here. Eg. 12345678P is a valid size string but the result does not fit u64. + return ret; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] btrfs-progs: add mount options to btrfs-mount.5
On Wed, Jun 11, 2014 at 05:14:55PM -0500, Eric Sandeen wrote: This is a straight cut and paste from the util-linux mount manpage into btrfs-mount.5 It's pretty much impossible for util-linux to keep up with every filesystem out there, and Karel has more than once expressed a wish that mount options move into fs-specific manpages. So, here we go. The way btrfs asciidoc is generated, there's not a trivial way to have both btrfs(5) and btrfs(8) so I named it btrfs-mount(5) for now. A bit ick and I'm open to suggestions. So what if the mount options are generated from btrfs-mount.txt but installed under btrfs.5.gz name? If there are more section 5 manpages we can make it more generic but for now hardcoding btrfs-mount.* - btrfs.5. sounds ok to me. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
R: Re: Slow startup of systemd-journal on BTRFS
Messaggio originale Da: li...@colorremedies.com Data: 12/06/2014 2.40 A: kreij...@inwind.it, Goffredo Baroncellikreij...@libero.it Cc: systemd Mailing Listsystemd-de...@lists.freedesktop.org, linux-btrfs linux-btrfs@vger.kernel.org Ogg: Re: Slow startup of systemd-journal on BTRFS On Jun 11, 2014, at 3:28 PM, Goffredo Baroncelli kreij...@libero.it wrote: If someone is able to suggest me how FRAGMENT the log file, I can try to collect more scientific data. So long as you're not using compression, filefrag will show you fragments of systemd-journald journals. I can vouch for the behavior you experience without xattr +C or autodefrag, but further it also causes much slowness when reading journal contents. LIke if I want to search all boots for a particular error message to see how far back it started, this takes quite a bit longer than on other file systems. So far I'm not experiencing this problem with autodefrag or any other negative side effects, but my understanding is this code is still in flux. Since the journals have their own checksumming I'm not overly concerned about setting xattr +C. This is true; but it can be a general solution: the checksum of the data are needed during a scrub and/or a RAID rebuilding. I want to investigate doing an explicit defrag once a week. Chris Murphy G.Baroncelli -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
R: Re: Slow startup of systemd-journal on BTRFS
Messaggio originale Da: russ...@coker.com.au Data: 12/06/2014 3.18 A: kreij...@inwind.it Cc: systemd Mailing Listsystemd-de...@lists.freedesktop.org, linux-btrfs linux-btrfs@vger.kernel.org Ogg: Re: Slow startup of systemd-journal on BTRFS On Wed, 11 Jun 2014 23:28:54 Goffredo Baroncelli wrote: https://bugzilla.redhat.com/show_bug.cgi?id=1006386 suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. It seems that systemd-journal create a very hight fragmented files when it stores its log. And BTRFS it is know to behave slowly when a file is highly fragmented. This had caused a slow startup of systemd-journal, which in turn had blocked the services which depend by the loggin system. On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put SystemMaxUse=50M. That doesn't solve the fragmentation problem but reduces it enough that it doesn't bother me. IIRC my log files are about 80/100MB. So I am not sure if this could help. I want to investigate also the option MaxFileSec=1d which rotates the log file once a day (or a week) -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
R: Re: Slow startup of systemd-journal on BTRFS
Da: da...@fromorbit.com Data: 12/06/2014 3.21 A: kreij...@inwind.it Cc: systemd Mailing Listsystemd-de...@lists.freedesktop.org, linux-btrfs linux-btrfs@vger.kernel.org Ogg: Re: Slow startup of systemd-journal on BTRFS On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote: Hi all, I would like to share a my experience about a slowness of systemd when used on BTRFS. My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from systemd-analyze plot). I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why. However this link https://bugzilla.redhat.com/show_bug.cgi?id=1006386 suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. systemd has a very stupid journal write pattern. It checks if there is space in the file for the write, and if not it fallocates the small amount of space it needs (it does *4 byte* fallocate calls!) and then does the write to it. All this does is fragment the crap out of the log files because the filesystems cannot optimise the allocation patterns. I checked the code, and to me it seems that the fallocate() are done in FILE_SIZE_INCREASE unit (actually 8MB). Yup, it fragments journal files on XFS, too. http://oss.sgi.com/archives/xfs/2014-03/msg00322.html IIRC, the systemd developers consider this a filesystem problem and so refused to change the systemd code to be nice to the filesystem allocators, even though they don't actually need to use fallocate... If I am able to start a decent setup I would like to play to change some parameters like: - remove fallocate at all (at the beginning only ?) - increase the fallocate allocation unit - change the file log size and rotation time - periodically defragment [...[ Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: R: Re: Slow startup of systemd-journal on BTRFS
Goffredo Baroncelli kreij...@libero.it posted on Thu, 12 Jun 2014 13:13:26 +0200 as excerpted: systemd has a very stupid journal write pattern. It checks if there is space in the file for the write, and if not it fallocates the small amount of space it needs (it does *4 byte* fallocate calls!) and then does the write to it. All this does is fragment the crap out of the log files because the filesystems cannot optimise the allocation patterns. I checked the code, and to me it seems that the fallocate() are done in FILE_SIZE_INCREASE unit (actually 8MB). FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think actually pretty much equally bad without NOCOW set on the file. Why? Because btrfs data blocks are 4 KiB. With COW, the effect for either 4 byte or 8 MiB file allocations is going to end up being the same, forcing (repeated until full) rewrite of each 4 KiB block into its own extent. Turning off the fallocate should allow btrfs to at least consolidate a bit, tho to the extent that multiple 4 KiB blocks cannot be written, repeated fsync will still cause issues. 80-100 MiB logs (size mentioned in another reply) should be reasonably well handled by btrfs autodefrag, however, if it's turned on. I'd be worried if sizes were 256 MiB and certainly as sizes approached a GiB, but it should handle 80-100 MiB just fine. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] btrfs-progs: add mount options to btrfs-mount.5
On 6/12/14, 5:51 AM, David Sterba wrote: On Wed, Jun 11, 2014 at 05:14:55PM -0500, Eric Sandeen wrote: This is a straight cut and paste from the util-linux mount manpage into btrfs-mount.5 It's pretty much impossible for util-linux to keep up with every filesystem out there, and Karel has more than once expressed a wish that mount options move into fs-specific manpages. So, here we go. The way btrfs asciidoc is generated, there's not a trivial way to have both btrfs(5) and btrfs(8) so I named it btrfs-mount(5) for now. A bit ick and I'm open to suggestions. So what if the mount options are generated from btrfs-mount.txt but installed under btrfs.5.gz name? If there are more section 5 manpages we can make it more generic but for now hardcoding btrfs-mount.* - btrfs.5. sounds ok to me. Yeah, that seemed like kind of nasty hard-coding, but I suppose it works for now. I wanted to make it more generic, I didn't have a better idea.. -Eric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Deadlock/high load
Hi all, I have a problem that triggers quite often on our production machines. I don't really know what's triggering this or how to reproduce it, but the machine enters in some sort of deadlock state, where it consumes all the i/o and the load average goes very high in seconds (it even gets to over 200), sometimes in about a minute or even less, the machine is unresponsive and we have to reset it. Rarely, the load just stays high (~25) for hours, but it never gets down again, but this happens rarely, as I said. In general, the machine is either already unresponsive or is about to become unresponsive. The last machine that encountered this has 40 cores and the btrfs filesystem is running over SSDs. We encountered this on a plain 3.14 kernel, and also on the latest 3.14.6 kernel + all the patches whose summary is marked btrfs: that made it in 3.15, straight forward backported (cherry-picked) to 3.14. Also, no suspicious (malicious) activity from the running processes either. I noticed there was another report on 3.13 which was solved by a 3.15rc patch, it doesn't seem to be the same thing. Since the only chance to obtain something was via a SysRq dump, here's what I could get from the last w trigger (tasks that are in uninterruptable (blocked) state), showing only tasks that are related to btrfs: btrfs-transacti D 000e 0 2483 2 0x00080008 881fd05975d8 81880a27 881fd05974e8 881fd05974f8 881fd0596010 881fd05975d8 00011bc0 881fd13398f0 00011bc0 00011bc0 881fd28ecad0 881fd13398f0 Call Trace: [81880a27] ? __schedule+0x687/0x72c [8163aaf0] ? do_release_stripe+0xeb/0x182 [8114a076] ? zone_statistics+0x77/0x7e [8163fed0] ? raid5_unplug+0xaa/0xb3 [813cb87e] ? blk_flush_plug_list+0x99/0x1f0 [8163c24d] ? get_active_stripe+0x65/0x5ca [810f8704] ? prepare_to_wait+0x71/0x7c [816431f9] ? make_request+0x7b0/0x999 [816429d4] ? release_stripe_plug+0x20/0x95 [810f8497] ? bit_waitqueue+0xb0/0xb0 [810f8497] ? bit_waitqueue+0xb0/0xb0 [8166375a] ? md_make_request+0xfa/0x215 [81324f22] ? __btrfs_map_block+0xd6f/0xd89 [813ca63c] ? generic_make_request+0x99/0xda [813ca770] ? submit_bio+0xf3/0xfe [813251de] ? submit_stripe_bio+0x77/0x82 [813255b6] ? btrfs_map_bio+0x3cd/0x440 [812fdc1d] ? csum_tree_block+0x1c1/0x1ec [812fdfa6] ? btree_submit_bio_hook+0x97/0xf0 [811b561e] ? __bio_add_page+0x153/0x1de [8131ab64] ? submit_one_bio+0x63/0x90 [8113c61b] ? account_page_writeback+0x28/0x2d [8131b504] ? submit_extent_page+0xe7/0x17e [81320796] ? btree_write_cache_pages+0x44c/0x71a [8131b272] ? extent_range_clear_dirty_for_io+0x5a/0x5a [812fc41a] ? btree_writepages+0x4a/0x53 [8113cf5f] ? do_writepages+0x1b/0x24 [81134f76] ? __filemap_fdatawrite_range+0x4e/0x50 [81135b55] ? filemap_fdatawrite_range+0xe/0x10 [813020c1] ? btrfs_write_marked_extents+0x83/0xd1 [8130216b] ? btrfs_write_and_wait_transaction+0x5c/0x8a [81302ee2] ? btrfs_commit_transaction+0x68b/0x87c [810cf0b7] ? del_timer+0x87/0x87 [812fef3f] ? transaction_kthread+0x114/0x1e9 [812fee2b] ? close_ctree+0x280/0x280 [810df1ff] ? kthread+0xc9/0xd1 [810df136] ? kthread_freezable_should_stop+0x5b/0x5b [818842cc] ? ret_from_fork+0x7c/0xb0 [810df136] ? kthread_freezable_should_stop+0x5b/0x5b rs:main Q:Reg D 0002 0 6857 4976 0x 883fc9b0bb08 0002 883fc9b0b9e8 883fc9b0ba28 883fc9b0a010 883fc9b0bb08 00011bc0 883fc794db70 00011bc0 00011bc0 881fd28e8000 883fc794db70 Call Trace: [81040a93] ? native_sched_clock+0x17/0xd3 [810406e7] ? sched_clock+0x9/0xd [810eb7c2] ? arch_vtime_task_switch+0x81/0x86 [810ebc88] ? vtime_common_task_switch+0x29/0x2d [8104072d] ? read_tsc+0x9/0x1b [81880c2a] schedule+0x6e/0x70 [81880cbf] io_schedule+0x93/0xd7 [81134170] ? __lock_page+0x68/0x68 [81134179] sleep_on_page+0x9/0xd [8188118f] __wait_on_bit+0x45/0x7a [8113444e] wait_on_page_bit+0x71/0x78 [810f84f3] ? wake_atomic_t_function+0x28/0x28 [81311056] prepare_pages+0xd2/0x11b [813143a5] __btrfs_buffered_write+0x214/0x482 [8110eb76] ? futex_wait+0x176/0x239 [810c9650] ? current_fs_time+0x22/0x29 [81314a2a] btrfs_file_aio_write+0x417/0x507 [81198eb3] ? path_openat+0x593/0x5cc [8118c275] do_sync_write+0x59/0x79 [8118d53e] vfs_write+0xd3/0x172 [8118d69e] SyS_write+0x4b/0x8f [81884505] tracesys+0xd0/0xd5 freshclam D 0002 0 8305 4976 0x 883fbc1b1b08
Re: [PATCH] Btrfs: fix qgroups sanity test crash or hang
On Thu, Jun 12, 2014 at 1:35 AM, Chris Mason c...@fb.com wrote: On 06/11/2014 08:12 PM, Filipe David Borba Manana wrote: Often when running the qgroups sanity test, a crash or a hang happened. This is because the extent buffer the test uses for the root node doesn't have an header level explicitly set, making it have a random level value. This is a problem when it's not zero for the btrfs_search_slot() calls the test ends up doing, resulting in crashes or hangs such as the following: Therefore initialize the extent buffer as an empty leaf (level 0). Issue easy to reproduce when btrfs is built as a module via: $ for ((i = 1; i = 100; i++)); do rmmod btrfs; modprobe btrfs; done Nice, thanks Filipe, I hadn't been able to trigger this yet. It's a bit rare for me too, happens about 1/100 in my vm. Every time it fails root-node's header level is a non-zero value, usually way bigger than BTRFS_MAX_LEVEL (8), causing all sorts of badness in btrfs_search_slot() due to out of bounds accesses to the arrays path-{locks | nodes | slots} -chris -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Use sparse files for filesystem conversion tests.
On my system, this brings the FS conversion test suite's runtime from over ten seconds down to under two. Thanks to Julien Muchembled for the suggestion. Signed-off-by: Adam Buchbinder abuchbin...@google.com --- tests/convert-tests.sh | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/tests/convert-tests.sh b/tests/convert-tests.sh index 87369c5..9f7a5c8 100644 --- a/tests/convert-tests.sh +++ b/tests/convert-tests.sh @@ -13,15 +13,16 @@ _fail() } rm -f convert-tests-results.txt -rm -f test.img test(){ echo [TEST]$1 -shift -echo creating ext image with: $* convert-tests-results.txt + shift + echo creating ext image with: $* convert-tests-results.txt # 256MB is the smallest acceptable btrfs image. - dd if=/dev/zero of=$here/test.img bs=1024 count=$((256*1024)) \ -convert-tests-results.txt 21 || _fail dd failed + rm -f $here/test.img convert-tests-results.txt 21 \ + || _fail could not remove test image file + truncate -s 256M $here/test.img convert-tests-results.txt 21 \ + || _fail could not create test image file $* -F $here/test.img convert-tests-results.txt 21 \ || _fail filesystem create failed $here/btrfs-convert $here/test.img convert-tests-results.txt 21 \ @@ -30,6 +31,7 @@ test(){ || _fail btrfsck detected errors } -test ext2, 4k blocksize mke2fs -b 4096 -test ext3, 4k blocksize mke2fs -j -b 4096 -test ext4, 4k blocksize mke2fs -t ext4 -b 4096 +# btrfs-convert requires 4k blocksize. +test ext2 mke2fs -b 4096 +test ext3 mke2fs -j -b 4096 +test ext4 mke2fs -t ext4 -b 4096 -- 2.0.0.526.g5318336 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] lib: add size unit t/p/e to memparse
On Wed, 2 Apr 2014 16:54:37 +0800 Gui Hecheng guihc.f...@cn.fujitsu.com wrote: For modern filesystems such as btrfs, t/p/e size level operations are common. add size unit t/p/e parsing to memparse Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com --- changelog v1-v2: replace kilobyte with kibibyte, and others v2-v3: add missing unit bytes in comment --- lib/cmdline.c | 25 - 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index eb67911..511b9be 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -119,11 +119,17 @@ char *get_options(const char *str, int nints, int *ints) * @retptr: (output) Optional pointer to next char after parse completes * * Parses a string into a number. The number stored at @ptr is - * potentially suffixed with %K (for kilobytes, or 1024 bytes), - * %M (for megabytes, or 1048576 bytes), or %G (for gigabytes, or - * 1073741824). If the number is suffixed with K, M, or G, then - * the return value is the number multiplied by one kilobyte, one - * megabyte, or one gigabyte, respectively. + * potentially suffixed with + * %K (for kibibytes, or 1024 bytes), + * %M (for mebibytes, or 1048576 bytes), + * %G (for gibibytes, or 1073741824 bytes), + * %T (for tebibytes, or 1099511627776 bytes), + * %P (for pebibytes, or 1125899906842624 bytes), + * %E (for exbibytes, or 1152921504606846976 bytes). I'm afraid I find these names quite idiotic - we all know what the traditional terms mean so why go and muck with it. Also, kibibytes sounds like cat food. @@ -133,6 +139,15 @@ unsigned long long memparse(const char *ptr, char **retptr) unsigned long long ret = simple_strtoull(ptr, endptr, 0); switch (*endptr) { + case 'E': + case 'e': + ret = 10; + case 'P': + case 'p': + ret = 10; + case 'T': + case 't': + ret = 10; case 'G': case 'g': ret = 10; That bit makes sense. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs-progs: Improve the parse_size() error message.
Original Message Subject: Re: [PATCH v2] btrfs-progs: Improve the parse_size() error message. From: David Sterba dste...@suse.cz To: Qu Wenruo quwen...@cn.fujitsu.com Date: 2014年06月12日 18:15 On Thu, May 29, 2014 at 09:42:11AM +0800, Qu Wenruo wrote: When using parse_size(), even non-numeric value is passed, it will only give error message ERROR: size value is empty, which is quite confusing for end users. This patch will introduce more meaningful error message for the following new cases 1) Invalid size string (non-numeric string) 2) Minus size value (like -1K) Also this patch will take full use of endptr returned by strtoll() to reduce unneeded loop. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- utils.c | 56 +++- 1 file changed, 39 insertions(+), 17 deletions(-) diff --git a/utils.c b/utils.c index 392c5cf..499f08f 100644 --- a/utils.c +++ b/utils.c @@ -1612,18 +1612,45 @@ scan_again: u64 parse_size(char *s) { - int i; char c; + char *endptr; u64 mult = 1; + long long int signed_ret; + u64 ret; - for (i = 0; s s[i] isdigit(s[i]); i++) ; - if (!i) { - fprintf(stderr, ERROR: size value is empty\n); - exit(50); + if (!s) { + fprintf(stderr, ERROR: Size value is empty\n); + exit(1); We never pass a NULL pointer to parse_size so this check will be always false. Previously it verified that there are at least some digits. Command like 'mkfs.btrfs -b /dev/sda' *WILL* pass NULL pointer to parse_size(), so the check is needed. } - - if (s[i]) { - c = tolower(s[i]); + signed_ret = strtoll(s, endptr, 10); + if (endptr == s) { + fprintf(stderr, ERROR: Size value '%s' is invalid\n, s); + exit(1); + } + if (endptr[0] endptr[1]) { + fprintf(stderr, ERROR: Illegal suffix contains character '%c' in wrong position\n, + endptr[1]); + exit(1); + } + if (signed_ret = 0) { + fprintf(stderr, + ERROR: Size value '%s' is less equal than 0\n, s); + exit(1); + } + /* strtoll returns LLONG_MAX when overflow, if this happens, +* need to call strtoull to get the real size */ + if (errno == ERANGE signed_ret == LLONG_MAX) { + ret = strtoull(s, NULL, 10); Why do you parse the number twice? Negative sizes are currently not used so you can reject them. I will change the patch to judgement leading '-' and reject the value. + if (errno == ERANGE ret == ULLONG_MAX) { + fprintf(stderr, + ERROR: Size value '%s' is too large for u64\n, + s); + exit(1); + } + } else + ret = signed_ret; + if (endptr[0]) { + c = tolower(endptr[0]); switch (c) { case 'e': mult *= 1024; @@ -1646,18 +1673,13 @@ u64 parse_size(char *s) case 'b': break; default: - fprintf(stderr, ERROR: Unknown size descriptor - '%c'\n, c); + fprintf(stderr, ERROR: Unknown size descriptor '%c'\n, c); exit(1); } } - if (s[i] s[i+1]) { - fprintf(stderr, ERROR: Illegal suffix contains - character '%c' in wrong position\n, - s[i+1]); - exit(51); - } - return strtoull(s, NULL, 10) * mult; + + ret *= mult; Although there was no overflow check before, I think it should be here. Eg. 12345678P is a valid size string but the result does not fit u64. Right, I will check the overflow here. Thanks, Qu + return ret; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] lib: add size unit t/p/e to memparse
On Thu, 2014-06-12 at 14:15 -0700, Andrew Morton wrote: On Wed, 2 Apr 2014 16:54:37 +0800 Gui Hecheng guihc.f...@cn.fujitsu.com wrote: For modern filesystems such as btrfs, t/p/e size level operations are common. add size unit t/p/e parsing to memparse Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com --- changelog v1-v2: replace kilobyte with kibibyte, and others v2-v3: add missing unit bytes in comment --- lib/cmdline.c | 25 - 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index eb67911..511b9be 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -119,11 +119,17 @@ char *get_options(const char *str, int nints, int *ints) * @retptr: (output) Optional pointer to next char after parse completes * * Parses a string into a number. The number stored at @ptr is - * potentially suffixed with %K (for kilobytes, or 1024 bytes), - * %M (for megabytes, or 1048576 bytes), or %G (for gigabytes, or - * 1073741824). If the number is suffixed with K, M, or G, then - * the return value is the number multiplied by one kilobyte, one - * megabyte, or one gigabyte, respectively. + * potentially suffixed with + * %K (for kibibytes, or 1024 bytes), + * %M (for mebibytes, or 1048576 bytes), + * %G (for gibibytes, or 1073741824 bytes), + * %T (for tebibytes, or 1099511627776 bytes), + * %P (for pebibytes, or 1125899906842624 bytes), + * %E (for exbibytes, or 1152921504606846976 bytes). I'm afraid I find these names quite idiotic - we all know what the traditional terms mean so why go and muck with it. Also, kibibytes sounds like cat food. Yes, I will cleanup this part, Thanks very much. -Gui @@ -133,6 +139,15 @@ unsigned long long memparse(const char *ptr, char **retptr) unsigned long long ret = simple_strtoull(ptr, endptr, 0); switch (*endptr) { + case 'E': + case 'e': + ret = 10; + case 'P': + case 'p': + ret = 10; + case 'T': + case 't': + ret = 10; case 'G': case 'g': ret = 10; That bit makes sense. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] btrfs-progs: Improve the parse_size() error message.
When using parse_size(), even non-numeric value is passed, it will only give error message ERROR: size value is empty, which is quite confusing for end users. This patch will introduce more meaningful error message for the following new cases 1) Invalid size string (non-numeric string) 2) Minus size value (like -1K) Also this patch will take full use of endptr returned by strtoll() to reduce unneeded loop. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- changlog: v2: Remove uneeded return value Avoid abuse of goto v3: Don't reparse size twice Better u64 overflow check --- utils.c | 66 + 1 file changed, 50 insertions(+), 16 deletions(-) diff --git a/utils.c b/utils.c index 392c5cf..eab3a1b 100644 --- a/utils.c +++ b/utils.c @@ -1610,20 +1610,53 @@ scan_again: return 0; } -u64 parse_size(char *s) +/* A not-so-good version fls64. No fascinating optimization since + * no one except parse_size use it */ +static int fls64(u64 x) { int i; + + for (i = 0; i 64; i++) + if (x i (1UL 63)) + return 64 - i; + return 64 - i; +} + +u64 parse_size(char *s) +{ char c; + char *endptr; u64 mult = 1; + u64 ret; - for (i = 0; s s[i] isdigit(s[i]); i++) ; - if (!i) { - fprintf(stderr, ERROR: size value is empty\n); - exit(50); + if (!s) { + fprintf(stderr, ERROR: Size value is empty\n); + exit(1); } - - if (s[i]) { - c = tolower(s[i]); + if (s[0] == '-') { + fprintf(stderr, + ERROR: Size value '%s' is less equal than 0\n, s); + exit(1); + } + ret = strtoull(s, endptr, 10); + if (endptr == s) { + fprintf(stderr, ERROR: Size value '%s' is invalid\n, s); + exit(1); + } + if (endptr[0] endptr[1]) { + fprintf(stderr, ERROR: Illegal suffix contains character '%c' in wrong position\n, + endptr[1]); + exit(1); + } + /* strtoll returns LLONG_MAX when overflow, if this happens, +* need to call strtoull to get the real size */ + if (errno == ERANGE ret == ULLONG_MAX) { + fprintf(stderr, + ERROR: Size value '%s' is too large for u64\n, s); + exit(1); + } + if (endptr[0]) { + c = tolower(endptr[0]); switch (c) { case 'e': mult *= 1024; @@ -1646,18 +1679,19 @@ u64 parse_size(char *s) case 'b': break; default: - fprintf(stderr, ERROR: Unknown size descriptor - '%c'\n, c); + fprintf(stderr, ERROR: Unknown size descriptor '%c'\n, + c); exit(1); } } - if (s[i] s[i+1]) { - fprintf(stderr, ERROR: Illegal suffix contains - character '%c' in wrong position\n, - s[i+1]); - exit(51); + /* Check whether ret * mult overflow */ + if (fls64(ret) + fls64(mult) - 1 64) { + fprintf(stderr, + ERROR: Size value '%s' is too large for u64\n, s); + exit(1); } - return strtoull(s, NULL, 10) * mult; + ret *= mult; + return ret; } int open_file_or_dir3(const char *fname, DIR **dirstream, int open_flags) -- 2.0.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deadlock/high load
On Thu, Jun 12, 2014 at 04:15:27PM +0100, Alin Dobre wrote: Hi all, I have a problem that triggers quite often on our production machines. I don't really know what's triggering this or how to reproduce it, but the machine enters in some sort of deadlock state, where it consumes all the i/o and the load average goes very high in seconds (it even gets to over 200), sometimes in about a minute or even less, the machine is unresponsive and we have to reset it. Rarely, the load just stays high (~25) for hours, but it never gets down again, but this happens rarely, as I said. In general, the machine is either already unresponsive or is about to become unresponsive. The last machine that encountered this has 40 cores and the btrfs filesystem is running over SSDs. We encountered this on a plain 3.14 kernel, and also on the latest 3.14.6 kernel + all the patches whose summary is marked btrfs: that made it in 3.15, straight forward backported (cherry-picked) to 3.14. Also, no suspicious (malicious) activity from the running processes either. I noticed there was another report on 3.13 which was solved by a 3.15rc patch, it doesn't seem to be the same thing. The output of 'btrfs filesystem df' is appreciate, it can help determine if the FS has entered into 'almost full' situation, and that may cause a bug that pages are not marked with writeback tag and lead to processes's endless waiting. -liubo Since the only chance to obtain something was via a SysRq dump, here's what I could get from the last w trigger (tasks that are in uninterruptable (blocked) state), showing only tasks that are related to btrfs: btrfs-transacti D 000e 0 2483 2 0x00080008 881fd05975d8 81880a27 881fd05974e8 881fd05974f8 881fd0596010 881fd05975d8 00011bc0 881fd13398f0 00011bc0 00011bc0 881fd28ecad0 881fd13398f0 Call Trace: [81880a27] ? __schedule+0x687/0x72c [8163aaf0] ? do_release_stripe+0xeb/0x182 [8114a076] ? zone_statistics+0x77/0x7e [8163fed0] ? raid5_unplug+0xaa/0xb3 [813cb87e] ? blk_flush_plug_list+0x99/0x1f0 [8163c24d] ? get_active_stripe+0x65/0x5ca [810f8704] ? prepare_to_wait+0x71/0x7c [816431f9] ? make_request+0x7b0/0x999 [816429d4] ? release_stripe_plug+0x20/0x95 [810f8497] ? bit_waitqueue+0xb0/0xb0 [810f8497] ? bit_waitqueue+0xb0/0xb0 [8166375a] ? md_make_request+0xfa/0x215 [81324f22] ? __btrfs_map_block+0xd6f/0xd89 [813ca63c] ? generic_make_request+0x99/0xda [813ca770] ? submit_bio+0xf3/0xfe [813251de] ? submit_stripe_bio+0x77/0x82 [813255b6] ? btrfs_map_bio+0x3cd/0x440 [812fdc1d] ? csum_tree_block+0x1c1/0x1ec [812fdfa6] ? btree_submit_bio_hook+0x97/0xf0 [811b561e] ? __bio_add_page+0x153/0x1de [8131ab64] ? submit_one_bio+0x63/0x90 [8113c61b] ? account_page_writeback+0x28/0x2d [8131b504] ? submit_extent_page+0xe7/0x17e [81320796] ? btree_write_cache_pages+0x44c/0x71a [8131b272] ? extent_range_clear_dirty_for_io+0x5a/0x5a [812fc41a] ? btree_writepages+0x4a/0x53 [8113cf5f] ? do_writepages+0x1b/0x24 [81134f76] ? __filemap_fdatawrite_range+0x4e/0x50 [81135b55] ? filemap_fdatawrite_range+0xe/0x10 [813020c1] ? btrfs_write_marked_extents+0x83/0xd1 [8130216b] ? btrfs_write_and_wait_transaction+0x5c/0x8a [81302ee2] ? btrfs_commit_transaction+0x68b/0x87c [810cf0b7] ? del_timer+0x87/0x87 [812fef3f] ? transaction_kthread+0x114/0x1e9 [812fee2b] ? close_ctree+0x280/0x280 [810df1ff] ? kthread+0xc9/0xd1 [810df136] ? kthread_freezable_should_stop+0x5b/0x5b [818842cc] ? ret_from_fork+0x7c/0xb0 [810df136] ? kthread_freezable_should_stop+0x5b/0x5b rs:main Q:Reg D 0002 0 6857 4976 0x 883fc9b0bb08 0002 883fc9b0b9e8 883fc9b0ba28 883fc9b0a010 883fc9b0bb08 00011bc0 883fc794db70 00011bc0 00011bc0 881fd28e8000 883fc794db70 Call Trace: [81040a93] ? native_sched_clock+0x17/0xd3 [810406e7] ? sched_clock+0x9/0xd [810eb7c2] ? arch_vtime_task_switch+0x81/0x86 [810ebc88] ? vtime_common_task_switch+0x29/0x2d [8104072d] ? read_tsc+0x9/0x1b [81880c2a] schedule+0x6e/0x70 [81880cbf] io_schedule+0x93/0xd7 [81134170] ? __lock_page+0x68/0x68 [81134179] sleep_on_page+0x9/0xd [8188118f] __wait_on_bit+0x45/0x7a [8113444e] wait_on_page_bit+0x71/0x78 [810f84f3] ? wake_atomic_t_function+0x28/0x28 [81311056] prepare_pages+0xd2/0x11b [813143a5] __btrfs_buffered_write+0x214/0x482 [8110eb76] ? futex_wait+0x176/0x239
[PATCH 1/2 v3 RESEND] Btrfs: device_list_add() should not update list when mounted
From: Anand Jain anand.j...@oracle.com device_list_add() is called when user runs btrfs dev scan, which would add any btrfs device into the btrfs_fs_devices list. Now think of a mounted btrfs. And a new device which contains the a SB from the mounted btrfs devices. In this situation when user runs btrfs dev scan, the current code would just replace existing device with the new device. Which is to note that old device is neither closed nor gracefully removed from the btrfs. The FS is still operational with the old bdev however the device name is the btrfs_device is new which is provided by the btrfs dev scan. reproducer: devmgt[1] detach /dev/sdc replace the missing disk /dev/sdc btrfs rep start -f 1 /dev/sde /btrfs Label: none uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120 Total devices 2 FS bytes used 32.00KiB devid1 size 958.94MiB used 115.88MiB path /dev/sde devid2 size 958.94MiB used 103.88MiB path /dev/sdd make /dev/sdc to reappear devmgt attach host2 btrfs dev scan btrfs fi show -m Label: none uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120^M Total devices 2 FS bytes used 32.00KiB^M devid1 size 958.94MiB used 115.88MiB path /dev/sdc - Wrong. devid2 size 958.94MiB used 103.88MiB path /dev/sdd since /dev/sdc has been replaced with /dev/sde, the /dev/sdc shouldn't be part of the btrfs-fsid when it reappears. If user want it to be part of it then sys admin should be using btrfs device add instead. [1] github.com/anajain/devmgt.git Signed-off-by: Anand Jain anand.j...@oracle.com --- v2-v3: simpler commit and comment message v1-v2: remove '---' in commit messages which conflict with git am fs/btrfs/volumes.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index bb2cd66..56822f0 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -472,6 +472,7 @@ static noinline int device_list_add(const char *path, device = __find_device(fs_devices-devices, devid, disk_super-dev_item.uuid); } + if (!device) { if (fs_devices-opened) return -EBUSY; @@ -497,6 +498,32 @@ static noinline int device_list_add(const char *path, device-fs_devices = fs_devices; } else if (!device-name || strcmp(device-name-str, path)) { + /* +* When FS is already mounted. +* 1. If you are here and if the device-name is NULL that means +*this device was missing at time of FS mount. +* 2. If you are here and if the device-name is different from 'path' +*that means either +* a. The same device disappeared and reappeared with different +* name. or +* b. The missing-disk-which-was-replaced, has reappeared now. +* +* We must allow 1 and 2a above. But 2b would be a spurious and +* unintentional. +* Further in case of 1 and 2a above, the disk at 'path' would have +* missed some transaction when it was away and in case of 2a +* the stale bdev has to be updated as well. +* 2b must not be allowed at all time. +*/ + + /* +* As of now don't allow update to btrfs_fs_device through the +* btrfs dev scan cli, after FS has been mounted. +*/ + + if (fs_devices-opened) + return -EBUSY; + name = rcu_string_strdup(path, GFP_NOFS); if (!name) return -ENOMEM; -- 1.8.5.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs: check generation as replace duplicates devid+uuid
From: Anand Jain anand.j...@oracle.com When FS in unmounted we need to check generation number as well since devid+uuid combination could match with the missing replaced disk when it reappears, and without this patch it might pair with the replaced disk again. device_list_add() function is called in the following threads, mount device option mount argument ioctl BTRFS_IOC_SCAN_DEV (btrfs dev scan) ioctl BTRFS_IOC_DEVICES_READY (btrfs dev ready dev) they have been unit tested to work fine with this patch. If the user knows what he is doing and really want to pair with replaced disk (which is not a standard operation), then he should first clear the kernel btrfs device list in the memory by doing the module unload/load and followed with the mount -o device option. Signed-off-by: Anand Jain anand.j...@oracle.com --- fs/btrfs/volumes.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 56822f0..bb1b4bd 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -523,6 +523,16 @@ static noinline int device_list_add(const char *path, if (fs_devices-opened) return -EBUSY; + else { + /* +* That is if the FS is _not_ mounted and if you are here, that +* means there is more than one disk with same uuid and devid. +* We keep the one with larger generation number or the last-in +* if generation are equal. +*/ + if (found_transid device-generation) + return -EINVAL; + } name = rcu_string_strdup(path, GFP_NOFS); if (!name) @@ -535,6 +545,15 @@ static noinline int device_list_add(const char *path, } } + /* +* Unmount does not free the btrfs_device struct but would zero +* generation along with most of the other members. So just update +* it back. We need it to pick the disk with largest generation +* (as above). +*/ + if (!fs_devices-opened) + device-generation = found_transid; + if (found_transid fs_devices-latest_trans) { fs_devices-latest_devid = devid; fs_devices-latest_trans = found_transid; -- 1.8.5.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: When devices with same dev uuid occurs, only add the one with largest generation.
Qu, Kindly note, device-generation is not updated when user scan the device for the first time. I have sent out patch: [PATCH 2/2] btrfs: check generation as replace duplicates devid+uuid Please feel free to add your Signed-off-by, if you think so. review comments are appreciated as usual. Thanks, Anand On 10/06/2014 16:53, Qu Wenruo wrote: Since btrfs currently use dulicated dev uuid when doing device replace, the following problem will happen: 1) mount with device A missing using degraded mode. 2) replace device A with device B. 3) device A reappears. 4) umount btrfs fs. 5) mount btrfs fs. After step 5), btrfs will still use device A even device B has a larger generation. The patch will judge generation when difference device with same dev uuid. And the patch should be applied after Anand's patch: https://patchwork.kernel.org/patch/4309651/ Cc: Anand Jain anand.j...@oracle.com Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/btrfs/volumes.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0a5017a..07f0cf7 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -508,7 +508,8 @@ static noinline int device_list_add(const char *path, ret = 1; device-fs_devices = fs_devices; - } else if (!device-name || strcmp(device-name-str, path)) { + } else if (!device-name || (strcmp(device-name-str, path) + found_transid device-generation)) { /* * When FS is already mounted. * 1. If you are here and if the device-name is NULL that means -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html