[PATCH 1/1] btrfs: looping 'mkfs.btrfs -f dev' may fail with EBUSY

2014-06-12 Thread Anand Jain
The thread holding the O_EXCL flag seems to be BTRFS_IOC_SCAN_DEV ioctl,
which in turn calls btrfs_scan_one_device() to open dev with the O_EXCL flag.

But btrfs_scan_one_device() does not write anything to the disk.
and it is called by
  . An intermediary step (not the final open_ctree) in the
mount thread to read the SB and
  . btrfs-control ioctls viz BTRFS_IOC_SCAN_DEV and
BTRFS_IOC_DEVICES_READY
they don't need the O_EXCL.

test script: (run this in a loop)
static int test_skip_this_disk(char *path)
{
int fd;
char c;

printf(%s , path);
fd = open(path, O_RDWR|O_EXCL);
if (fd  0) {
printf(Open failed\n);
return 1;
}
/*fflush(stdout);
printf(Open Fine press enter\n);
scanf(%c, c);*/
close(fd);
return 0;
}

main(int arg, char **argv)
{
int i;

if (arg == 1) {
printf(usage: %s dev-with-btrfs-sb .. \n, argv[0]);
exit(1);
}

for (i = 1; i  arg; i++)
test_skip_this_disk(argv[i]);
}

dump stack after the userland close(fd)

dump_stack+0x9/0x60
btrfs_scan_one_device+0x18d/0x1f0 [btrfs]
btrfs_control_ioctl+0xb9/0x210 [btrfs]
do_vfs_ioctl+0x84/0x4c0
inode_has_perm+0x28/0x30
file_has_perm+0x8a/0xa0
SyS_ioctl+0x91/0xa0
system_call_fastpath+0x16/0x1b

Signed-off-by: Anand Jain anand.j...@oracle.com
---
 fs/btrfs/volumes.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 636faa0..c186b5e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -914,7 +914,6 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, 
void *holder,
 * later supers, using BTRFS_SUPER_MIRROR_MAX instead
 */
bytenr = btrfs_sb_offset(0);
-   flags |= FMODE_EXCL;
mutex_lock(uuid_mutex);
 
bdev = blkdev_get_by_path(path, flags, holder);
-- 
1.8.5.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs-progs: fix race condition between btrfs and udev

2014-06-12 Thread Anand Jain


Wang,


Futher, a easy way to trigger this problem is by running the following
c codes repeatedly:

  int main(int argc, char **argv)
  {
/* pass a btrfs block device */
int fd = open(argv[1], O_RDWR | O_EXCL);
if (fd  0) {
perror(fail to open: %s, strerror(errno));
exit(1);
}
close(fd);
return 0;
  }

So the problem is RW opening would trigger udev event which will
call btrfs_scan_one_device().




In btrfs_scan_one_device(), it
would open the block device with EXCL flag..meanwhile if another
program try to open that device with O_EXCL, it would fail with
EBUSY.


 Expected. But do we need O_EXCL in kernel:btrfs_scan_one_device()
 at all ?


This happen seldomly in the real world, but if we use loop device
for test, we may hit this annoying problem.



A walkaround way to solve this problem is to wait kernel scanning
finished and then try it again.


 I agree this happens seldom in production. But I don't agree on
 something to fix as workaround.

 Just sent out the patch:
 btrfs: looping mkfs.btrfs -f dev may fail with EBUSY

 which I believe can be a final fix. Your review / tests appreciated.


Thanks, Anand

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] btrfs-progs: Add human readable flags output for chunk/block group type.

2014-06-12 Thread Qu Wenruo
Current btrfs-debug-tree output chunk/block group type as numbers,
which makes it hard to understand and need to check the source to
understand the meaning.

This patch will convert numberic type output to human readable strings.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
changlog:
v2: allow mixed chunk.
---
 print-tree.c | 59 ++-
 1 file changed, 54 insertions(+), 5 deletions(-)

diff --git a/print-tree.c b/print-tree.c
index 7f831ad..6cf59ca 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -160,15 +160,60 @@ static int print_inode_ref_item(struct extent_buffer *eb, 
struct btrfs_item *ite
return 0;
 }
 
+/* Caller should ensure sizeof(*ret)=21 DATA|METADATA|RAID10 */
+static void bg_flags_to_str(u64 flags, char *ret)
+{
+   int empty = 1;
+
+   if (flags  BTRFS_BLOCK_GROUP_DATA) {
+   empty = 0;
+   strcpy(ret, DATA);
+   }
+   if (flags  BTRFS_BLOCK_GROUP_METADATA) {
+   if (!empty)
+   strcat(ret, |);
+   strcat(ret, METADATA);
+   }
+   if (flags  BTRFS_BLOCK_GROUP_SYSTEM) {
+   if (!empty)
+   strcat(ret, |);
+   strcat(ret, SYSTEM);
+   }
+   switch (flags  BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+   case BTRFS_BLOCK_GROUP_RAID0:
+   strcat(ret, |RAID0);
+   break;
+   case BTRFS_BLOCK_GROUP_RAID1:
+   strcat(ret, |RAID1);
+   break;
+   case BTRFS_BLOCK_GROUP_DUP:
+   strcat(ret, |DUP);
+   break;
+   case BTRFS_BLOCK_GROUP_RAID10:
+   strcat(ret, |RAID10);
+   break;
+   case BTRFS_BLOCK_GROUP_RAID5:
+   strcat(ret, |RAID5);
+   break;
+   case BTRFS_BLOCK_GROUP_RAID6:
+   strcat(ret, |RAID6);
+   break;
+   default:
+   break;
+   }
+}
+
 static void print_chunk(struct extent_buffer *eb, struct btrfs_chunk *chunk)
 {
int num_stripes = btrfs_chunk_num_stripes(eb, chunk);
int i;
-   printf(\t\tchunk length %llu owner %llu type %llu num_stripes %d\n,
+   char chunk_flags_str[32] = {0};
+
+   bg_flags_to_str(btrfs_chunk_type(eb, chunk), chunk_flags_str);
+   printf(\t\tchunk length %llu owner %llu type %s num_stripes %d\n,
   (unsigned long long)btrfs_chunk_length(eb, chunk),
   (unsigned long long)btrfs_chunk_owner(eb, chunk),
-  (unsigned long long)btrfs_chunk_type(eb, chunk),
-  num_stripes);
+  chunk_flags_str, num_stripes);
for (i = 0 ; i  num_stripes ; i++) {
printf(\t\t\tstripe %d devid %llu offset %llu\n, i,
  (unsigned long long)btrfs_stripe_devid_nr(eb, chunk, i),
@@ -744,6 +789,7 @@ void btrfs_print_leaf(struct btrfs_root *root, struct 
extent_buffer *l)
u32 nr = btrfs_header_nritems(l);
u64 objectid;
u32 type;
+   char bg_flags_str[32];
 
printf(leaf %llu items %d free space %d generation %llu owner %llu\n,
(unsigned long long)btrfs_header_bytenr(l), nr,
@@ -858,10 +904,13 @@ void btrfs_print_leaf(struct btrfs_root *root, struct 
extent_buffer *l)
struct btrfs_block_group_item);
read_extent_buffer(l, bg_item, (unsigned long)bi,
   sizeof(bg_item));
-   printf(\t\tblock group used %llu chunk_objectid %llu 
flags %llu\n,
+   memset(bg_flags_str, 0, sizeof(bg_flags_str));
+   bg_flags_to_str(btrfs_block_group_flags(bg_item),
+   bg_flags_str);
+   printf(\t\tblock group used %llu chunk_objectid %llu 
flags %s\n,
   (unsigned long 
long)btrfs_block_group_used(bg_item),
   (unsigned long 
long)btrfs_block_group_chunk_objectid(bg_item),
-  (unsigned long 
long)btrfs_block_group_flags(bg_item));
+  bg_flags_str);
break;
case BTRFS_CHUNK_ITEM_KEY:
print_chunk(l, btrfs_item_ptr(l, i, struct 
btrfs_chunk));
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] btrfs-progs: Add human readable flags output string for extent flags.

2014-06-12 Thread Qu Wenruo
Current btrfs-debug-tree outputs extent flags as numbers,
which makes it hard to understand and need to check the source to
understand the meaning.

This patch will convert numberic flags output to human readable strings.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
changlog:
v2: none
---
 print-tree.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/print-tree.c b/print-tree.c
index 7263b09..7f831ad 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -242,6 +242,25 @@ static void print_file_extent_item(struct extent_buffer 
*eb,
   btrfs_file_extent_compression(eb, fi));
 }
 
+/* Caller should ensure sizeof(*ret) = 16(DATA|TREE_BLOCK) */
+static void extent_flags_to_str(u64 flags, char *ret)
+{
+   int empty = 1;
+
+   if (flags  BTRFS_EXTENT_FLAG_DATA) {
+   empty = 0;
+   strcpy(ret, DATA);
+   }
+   if (flags  BTRFS_EXTENT_FLAG_TREE_BLOCK) {
+   if (!empty) {
+   empty = 0;
+   strcat(ret, |);
+   }
+   strcat(ret, TREE_BLOCK);
+   }
+   return;
+}
+
 static void print_extent_item(struct extent_buffer *eb, int slot, int metadata)
 {
struct btrfs_extent_item *ei;
@@ -255,6 +274,7 @@ static void print_extent_item(struct extent_buffer *eb, int 
slot, int metadata)
u32 item_size = btrfs_item_size_nr(eb, slot);
u64 flags;
u64 offset;
+   char flags_str[32] = {0};
 
if (item_size  sizeof(*ei)) {
 #ifdef BTRFS_COMPAT_EXTENT_TREE_V0
@@ -271,11 +291,12 @@ static void print_extent_item(struct extent_buffer *eb, 
int slot, int metadata)
 
ei = btrfs_item_ptr(eb, slot, struct btrfs_extent_item);
flags = btrfs_extent_flags(eb, ei);
+   extent_flags_to_str(flags, flags_str);
 
-   printf(\t\textent refs %llu gen %llu flags %llu\n,
+   printf(\t\textent refs %llu gen %llu flags %s\n,
   (unsigned long long)btrfs_extent_refs(eb, ei),
   (unsigned long long)btrfs_extent_generation(eb, ei),
-  (unsigned long long)flags);
+  flags_str);
 
if (flags  BTRFS_EXTENT_FLAG_TREE_BLOCK  !metadata) {
struct btrfs_tree_block_info *info;
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] btrfs-progs: fix missing parity stripe for raid6 in chunk-recover

2014-06-12 Thread Gui Hecheng
When deal with the p  q stripes for data profile raid6, chunk-recover
forgets to insert them into the chunk record. Just insert them back
freely.
Also wrap the insert procedure into a new function, fill_chunk_up.

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
changelog:
v1-v2: cleanup changelog, 'inert' changed to 'insert'
---
 chunk-recover.c | 30 --
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/chunk-recover.c b/chunk-recover.c
index dfa7ff6..9b46b0b 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -1785,6 +1785,23 @@ static inline int count_devext_records(struct list_head 
*record_list)
return num_of_records;
 }
 
+static int fill_chunk_up(struct chunk_record *chunk, struct list_head *devexts,
+struct recover_control *rc)
+{
+   int ret = 0;
+   int i;
+
+   for (i = 0; i  chunk-num_stripes; i++) {
+   if (!chunk-stripes[i].devid) {
+   ret = insert_stripe(devexts, rc, chunk, i);
+   if (ret)
+   break;
+   }
+   }
+
+   return ret;
+}
+
 #define EQUAL_STRIPE (1  0)
 
 static int rebuild_raid_data_chunk_stripes(struct recover_control *rc,
@@ -1919,9 +1936,9 @@ next_csum:
num_unordered = count_devext_records(unordered);
if (chunk-type_flags  BTRFS_BLOCK_GROUP_RAID6
 num_unordered == 2) {
-   list_splice_init(unordered, chunk-dextents);
btrfs_release_path(path);
-   return 0;
+   ret = fill_chunk_up(chunk, unordered, rc);
+   return ret;
}
 
goto next_stripe;
@@ -1966,14 +1983,7 @@ out:
 BTRFS_BLOCK_GROUP_RAID5)
 || (num_unordered == 3  chunk-type_flags
 BTRFS_BLOCK_GROUP_RAID6)) {
-   for (i = 0; i  chunk-num_stripes; i++) {
-   if (!chunk-stripes[i].devid) {
-   ret = insert_stripe(unordered, rc,
-   chunk, i);
-   if (ret)
-   break;
-   }
-   }
+   ret = fill_chunk_up(chunk, unordered, rc);
}
}
 fail_out:
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] Add support for LZ4 compression

2014-06-12 Thread David Sterba
On Wed, Jun 04, 2014 at 10:00:06AM -0400, Chris Mason wrote:
 I have a slightly different reason for holding off on these.  Disk
 format changes are forever, and we need a really strong use case for
 pulling them in.

The format upgrade is inevitable for full bidirectional interoperability
of filesystems with non-pagesized sectorsize and compression. At the
moment this is not possible even without compression, but patches are on
the way.

 With that said, thanks for spending all of the time on this.  Pulling in
 Dave's idea to stream larger compression blocks through lzo (or any new
 alg) might be enough to push performance much higher, and better show
 case the differences between new algorithms.

The space savings and speed gains can be measured outside of btrfs.
From the past numbers I see that 4k-64k chunk brings another 5-10% of
ratio and the de/compression speed is not worse.

Bigger chunks do not improve that much, but the overhead for assembling
the linear mappings would be decreased.

 The whole reason I chose zlib originally was because its streaming
 interface was a better fit for how FS IO worked.

Right, zlib has the streaming interface and accepts randomly scattered
blocks, but the others do not. LZ4 has a streaming extension proposed,
but I haven't looked at it closely whether it satisfies our constraints.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add some simple end-to-end tests for btrfs-convert.

2014-06-12 Thread David Sterba
On Fri, May 30, 2014 at 08:26:02PM +0200, Julien Muchembled wrote:
 Le 05/21/14 19:20, Adam Buchbinder a écrit :
  +   # 256MB is the smallest acceptable btrfs image.
  +   dd if=/dev/zero of=$here/test.img bs=1024 count=$((256*1024)) \
  +convert-tests-results.txt 21 || _fail dd failed
 
 What about using a sparse file to speed up the test and be nicer with the 
 underlying storage ?
 For example:
   truncate -s 256M $here/test.img
 
 Because this does not reset test.img with zeros, one may want to the
   rm -f test.img
 line at the beginning of the test function.

Yes, good idea. Patches welcome. Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: make pretty Documentation/ build match the rest

2014-06-12 Thread David Sterba
On Fri, May 30, 2014 at 11:27:14AM -0500, Eric Sandeen wrote:
 This is the most important patch ever.  ;)

Oh it is :)

 I found this to be less aesthetically pleasing than it was before:
 
  [CC] btrfstune.o
  Making all in Documentation
  ASCIIDOC btrfs-convert.xml
  [LD] btrfstune
  XMLTO btrfs-convert.8
  [CC] btrfs-show-super.o
  GZIP btrfs-convert.8.gz
  [LD] btrfs-show-super
  ASCIIDOC btrfs-debug-tree.xml
  XMLTO btrfs-debug-tree.8
 
 If folks don't like the 2-letter abbreviations, obviously the
 patch just isn't that important.

I like the prettyfied output, but XT or AD look unfamiliar in this context.
I've tried a few tweaks, is this change ok for you?

[LD] btrfs-show-super
[ASCII]  btrfs-property.xml
[XMLTO]  btrfs-debug-tree.8
[XMLTO]  btrfstune.8
[LD] btrfs-image

Full ASCIIDOC would need to update all the two-letter strings, does not
seem justified.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: Improve the parse_size() error message.

2014-06-12 Thread David Sterba
On Thu, May 29, 2014 at 09:42:11AM +0800, Qu Wenruo wrote:
 When using parse_size(), even non-numeric value is passed, it will only
 give error message ERROR: size value is empty, which is quite
 confusing for end users.
 
 This patch will introduce more meaningful error message for the
 following new cases
 1) Invalid size string (non-numeric string)
 2) Minus size value (like -1K)
 
 Also this patch will take full use of endptr returned by strtoll() to
 reduce unneeded loop.
 
 Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
 ---
  utils.c | 56 +++-
  1 file changed, 39 insertions(+), 17 deletions(-)
 
 diff --git a/utils.c b/utils.c
 index 392c5cf..499f08f 100644
 --- a/utils.c
 +++ b/utils.c
 @@ -1612,18 +1612,45 @@ scan_again:
  
  u64 parse_size(char *s)
  {
 - int i;
   char c;
 + char *endptr;
   u64 mult = 1;
 + long long int signed_ret;
 + u64 ret;
  
 - for (i = 0; s  s[i]  isdigit(s[i]); i++) ;
 - if (!i) {
 - fprintf(stderr, ERROR: size value is empty\n);
 - exit(50);
 + if (!s) {
 + fprintf(stderr, ERROR: Size value is empty\n);
 + exit(1);

We never pass a NULL pointer to parse_size so this check will be always
false. Previously it verified that there are at least some digits.

   }
 -
 - if (s[i]) {
 - c = tolower(s[i]);
 + signed_ret = strtoll(s, endptr, 10);
 + if (endptr == s) {
 + fprintf(stderr, ERROR: Size value '%s' is invalid\n, s);
 + exit(1);
 + }
 + if (endptr[0]  endptr[1]) {
 + fprintf(stderr, ERROR: Illegal suffix contains character '%c' 
 in wrong position\n,
 + endptr[1]);
 + exit(1);
 + }
 + if (signed_ret = 0) {
 + fprintf(stderr,
 + ERROR: Size value '%s' is less equal than 0\n, s);
 + exit(1);
 + }
 + /* strtoll returns LLONG_MAX when overflow, if this happens,
 +  * need to call strtoull to get the real size */
 + if (errno == ERANGE  signed_ret == LLONG_MAX) {
 + ret = strtoull(s, NULL, 10);

Why do you parse the number twice? Negative sizes are currently not used
so you can reject them.

 + if (errno == ERANGE  ret == ULLONG_MAX) {
 + fprintf(stderr,
 + ERROR: Size value '%s' is too large for u64\n,
 + s);
 + exit(1);
 + }
 + } else
 + ret = signed_ret;
 + if (endptr[0]) {
 + c = tolower(endptr[0]);
   switch (c) {
   case 'e':
   mult *= 1024;
 @@ -1646,18 +1673,13 @@ u64 parse_size(char *s)
   case 'b':
   break;
   default:
 - fprintf(stderr, ERROR: Unknown size descriptor 
 - '%c'\n, c);
 + fprintf(stderr, ERROR: Unknown size descriptor 
 '%c'\n, c);
   exit(1);
   }
   }
 - if (s[i]  s[i+1]) {
 - fprintf(stderr, ERROR: Illegal suffix contains 
 - character '%c' in wrong position\n,
 - s[i+1]);
 - exit(51);
 - }
 - return strtoull(s, NULL, 10) * mult;
 +
 + ret *= mult;

Although there was no overflow check before, I think it should be here.
Eg. 12345678P is a valid size string but the result does not fit u64.

 + return ret;
  }
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] btrfs-progs: add mount options to btrfs-mount.5

2014-06-12 Thread David Sterba
On Wed, Jun 11, 2014 at 05:14:55PM -0500, Eric Sandeen wrote:
 This is a straight cut and paste from the util-linux
 mount manpage into btrfs-mount.5
 
 It's pretty much impossible for util-linux to keep up
 with every filesystem out there, and Karel has more than
 once expressed a wish that mount options move into fs-specific
 manpages.
 
 So, here we go.
 
 The way btrfs asciidoc is generated, there's not a trivial
 way to have both btrfs(5) and btrfs(8) so I named it btrfs-mount(5)
 for now.  A bit ick and I'm open to suggestions.

So what if the mount options are generated from btrfs-mount.txt but
installed under btrfs.5.gz name? If there are more section 5 manpages we
can make it more generic but for now hardcoding btrfs-mount.* -
btrfs.5. sounds ok to me.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


R: Re: Slow startup of systemd-journal on BTRFS

2014-06-12 Thread Goffredo Baroncelli kreij...@libero.it


Messaggio originale
Da: li...@colorremedies.com
Data: 12/06/2014 2.40
A: kreij...@inwind.it, Goffredo Baroncellikreij...@libero.it
Cc: systemd Mailing Listsystemd-de...@lists.freedesktop.org, linux-btrfs
linux-btrfs@vger.kernel.org
Ogg: Re: Slow startup of systemd-journal on BTRFS


On Jun 11, 2014, at 3:28 PM, Goffredo Baroncelli kreij...@libero.it wrote:
 
 If someone is able to suggest me how FRAGMENT the log file, I can try to 
collect more scientific data.

So long as you're not using compression, filefrag will show you fragments of 
systemd-journald journals. I can vouch for the behavior 
 you experience without xattr +C or autodefrag, but further it also causes 
much slowness when reading journal contents. LIke if I want to 
 search all boots for a particular error message to see how far back it 
started, this takes quite a bit longer than on other file systems. 
 So far I'm not experiencing this problem with autodefrag or any other 
negative side effects, but my understanding is this code is still in flux.

Since the journals have their own checksumming I'm not overly concerned about 
setting xattr +C.

This is true; but it can be a general solution: the checksum of the data are 
needed during a 
scrub and/or a RAID rebuilding.

I want to investigate doing an explicit defrag once a week.



Chris Murphy
G.Baroncelli

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


R: Re: Slow startup of systemd-journal on BTRFS

2014-06-12 Thread Goffredo Baroncelli kreij...@libero.it


Messaggio originale
Da: russ...@coker.com.au
Data: 12/06/2014 3.18
A: kreij...@inwind.it
Cc: systemd Mailing Listsystemd-de...@lists.freedesktop.org, linux-btrfs
linux-btrfs@vger.kernel.org
Ogg: Re: Slow startup of systemd-journal on BTRFS

On Wed, 11 Jun 2014 23:28:54 Goffredo Baroncelli wrote:
 https://bugzilla.redhat.com/show_bug.cgi?id=1006386
 
 suggested me that the problem could be due to a bad interaction between
 systemd and btrfs. NetworkManager was innocent.  It seems that
 systemd-journal create a very hight fragmented files when it stores its
 log. And BTRFS it is know to behave slowly when a file is highly
 fragmented. This had caused a slow startup of systemd-journal, which in
 turn had blocked the services which depend by the loggin system.

On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put 
SystemMaxUse=50M.  That doesn't solve the fragmentation problem but 
reduces 
it enough that it doesn't bother me.

IIRC my log files are about 80/100MB. So I am not sure if this could help.
I want to investigate also the option

MaxFileSec=1d

which rotates the log file once a day (or a week)


-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


R: Re: Slow startup of systemd-journal on BTRFS

2014-06-12 Thread Goffredo Baroncelli kreij...@libero.it
Da: da...@fromorbit.com
Data: 12/06/2014 3.21
A: kreij...@inwind.it
Cc: systemd Mailing Listsystemd-de...@lists.freedesktop.org, linux-btrfs
linux-btrfs@vger.kernel.org
Ogg: Re: Slow startup of systemd-journal on BTRFS

On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
 Hi all,
 
 I would like to share a my experience about a slowness of systemd when used 
on BTRFS.
 
 My boot time was very high (about ~50 seconds); most of time it was due to 
NetworkManager which took about 30-40 seconds to start (this data came from 
systemd-analyze plot).
 
 I make several attempts to address this issue. Also I noticed that sometime 
this problem disappeared; but I was never able to understand why.
 
 However this link
 
  https://bugzilla.redhat.com/show_bug.cgi?id=1006386
 
 suggested me that the problem could be due to a bad interaction between 
systemd and btrfs. NetworkManager was innocent. 

systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
and then does the write to it.  All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.

I checked the code, and to me it seems that the fallocate() are
done in FILE_SIZE_INCREASE unit (actually 8MB). 


Yup, it fragments journal files on XFS, too.

http://oss.sgi.com/archives/xfs/2014-03/msg00322.html

IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...

If I am able to start a decent setup I would like to play to change some
parameters like:
- remove fallocate at all (at the beginning only ?)
- increase the fallocate allocation unit
- change the file log size and rotation time
- periodically defragment
[...[

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: R: Re: Slow startup of systemd-journal on BTRFS

2014-06-12 Thread Duncan
Goffredo Baroncelli kreij...@libero.it posted on Thu, 12 Jun 2014
13:13:26 +0200 as excerpted:

systemd has a very stupid journal write pattern. It checks if there is
space in the file for the write, and if not it fallocates the small
amount of space it needs (it does *4 byte* fallocate calls!) and then
does the write to it.  All this does is fragment the crap out of the log
files because the filesystems cannot optimise the allocation patterns.
 
 I checked the code, and to me it seems that the fallocate() are done in
 FILE_SIZE_INCREASE unit (actually 8MB).

FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think 
actually pretty much equally bad without NOCOW set on the file.

Why?  Because btrfs data blocks are 4 KiB.  With COW, the effect for 
either 4 byte or 8 MiB file allocations is going to end up being the 
same, forcing (repeated until full) rewrite of each 4 KiB block into its 
own extent.

Turning off the fallocate should allow btrfs to at least consolidate a 
bit, tho to the extent that multiple 4 KiB blocks cannot be written, 
repeated fsync will still cause issues.

80-100 MiB logs (size mentioned in another reply) should be reasonably 
well handled by btrfs autodefrag, however, if it's turned on.  I'd be 
worried if sizes were  256 MiB and certainly as sizes approached a GiB, 
but it should handle 80-100 MiB just fine.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] btrfs-progs: add mount options to btrfs-mount.5

2014-06-12 Thread Eric Sandeen
On 6/12/14, 5:51 AM, David Sterba wrote:
 On Wed, Jun 11, 2014 at 05:14:55PM -0500, Eric Sandeen wrote:
 This is a straight cut and paste from the util-linux
 mount manpage into btrfs-mount.5

 It's pretty much impossible for util-linux to keep up
 with every filesystem out there, and Karel has more than
 once expressed a wish that mount options move into fs-specific
 manpages.

 So, here we go.

 The way btrfs asciidoc is generated, there's not a trivial
 way to have both btrfs(5) and btrfs(8) so I named it btrfs-mount(5)
 for now.  A bit ick and I'm open to suggestions.
 
 So what if the mount options are generated from btrfs-mount.txt but
 installed under btrfs.5.gz name? If there are more section 5 manpages we
 can make it more generic but for now hardcoding btrfs-mount.* -
 btrfs.5. sounds ok to me.

Yeah, that seemed like kind of nasty hard-coding, but I suppose it works
for now.  I wanted to make it more generic, I didn't have a better idea..

-Eric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Deadlock/high load

2014-06-12 Thread Alin Dobre
Hi all,

I have a problem that triggers quite often on our production machines. I
don't really know what's triggering this or how to reproduce it, but the
machine enters in some sort of deadlock state, where it consumes all the
i/o and the load average goes very high in seconds (it even gets to over
200), sometimes in about a minute or even less, the machine is
unresponsive and we have to reset it. Rarely, the load just stays high
(~25) for hours, but it never gets down again, but this happens rarely,
as I said. In general, the machine is either already unresponsive or is
about to become unresponsive.

The last machine that encountered this has 40 cores and the btrfs
filesystem is running over SSDs. We encountered this on a plain 3.14
kernel, and also on the latest 3.14.6 kernel + all the patches whose
summary is marked btrfs: that made it in 3.15, straight forward
backported (cherry-picked) to 3.14.

Also, no suspicious (malicious) activity from the running processes either.

I noticed there was another report on 3.13 which was solved by a 3.15rc
patch, it doesn't seem to be the same thing.

Since the only chance to obtain something was via a SysRq dump, here's
what I could get from the last w trigger (tasks that are in
uninterruptable (blocked) state), showing only tasks that are related to
btrfs:

btrfs-transacti D 000e 0  2483  2 0x00080008
 881fd05975d8 81880a27 881fd05974e8 881fd05974f8
 881fd0596010 881fd05975d8 00011bc0 881fd13398f0
 00011bc0 00011bc0 881fd28ecad0 881fd13398f0
Call Trace:
 [81880a27] ? __schedule+0x687/0x72c
 [8163aaf0] ? do_release_stripe+0xeb/0x182
 [8114a076] ? zone_statistics+0x77/0x7e
 [8163fed0] ? raid5_unplug+0xaa/0xb3
 [813cb87e] ? blk_flush_plug_list+0x99/0x1f0
 [8163c24d] ? get_active_stripe+0x65/0x5ca
 [810f8704] ? prepare_to_wait+0x71/0x7c
 [816431f9] ? make_request+0x7b0/0x999
 [816429d4] ? release_stripe_plug+0x20/0x95
 [810f8497] ? bit_waitqueue+0xb0/0xb0
 [810f8497] ? bit_waitqueue+0xb0/0xb0
 [8166375a] ? md_make_request+0xfa/0x215
 [81324f22] ? __btrfs_map_block+0xd6f/0xd89
 [813ca63c] ? generic_make_request+0x99/0xda
 [813ca770] ? submit_bio+0xf3/0xfe
 [813251de] ? submit_stripe_bio+0x77/0x82
 [813255b6] ? btrfs_map_bio+0x3cd/0x440
 [812fdc1d] ? csum_tree_block+0x1c1/0x1ec
 [812fdfa6] ? btree_submit_bio_hook+0x97/0xf0
 [811b561e] ? __bio_add_page+0x153/0x1de
 [8131ab64] ? submit_one_bio+0x63/0x90
 [8113c61b] ? account_page_writeback+0x28/0x2d
 [8131b504] ? submit_extent_page+0xe7/0x17e
 [81320796] ? btree_write_cache_pages+0x44c/0x71a
 [8131b272] ? extent_range_clear_dirty_for_io+0x5a/0x5a
 [812fc41a] ? btree_writepages+0x4a/0x53
 [8113cf5f] ? do_writepages+0x1b/0x24
 [81134f76] ? __filemap_fdatawrite_range+0x4e/0x50
 [81135b55] ? filemap_fdatawrite_range+0xe/0x10
 [813020c1] ? btrfs_write_marked_extents+0x83/0xd1
 [8130216b] ? btrfs_write_and_wait_transaction+0x5c/0x8a
 [81302ee2] ? btrfs_commit_transaction+0x68b/0x87c
 [810cf0b7] ? del_timer+0x87/0x87
 [812fef3f] ? transaction_kthread+0x114/0x1e9
 [812fee2b] ? close_ctree+0x280/0x280
 [810df1ff] ? kthread+0xc9/0xd1
 [810df136] ? kthread_freezable_should_stop+0x5b/0x5b
 [818842cc] ? ret_from_fork+0x7c/0xb0
 [810df136] ? kthread_freezable_should_stop+0x5b/0x5b
rs:main Q:Reg   D 0002 0  6857   4976 0x
 883fc9b0bb08 0002 883fc9b0b9e8 883fc9b0ba28
 883fc9b0a010 883fc9b0bb08 00011bc0 883fc794db70
 00011bc0 00011bc0 881fd28e8000 883fc794db70
Call Trace:
 [81040a93] ? native_sched_clock+0x17/0xd3
 [810406e7] ? sched_clock+0x9/0xd
 [810eb7c2] ? arch_vtime_task_switch+0x81/0x86
 [810ebc88] ? vtime_common_task_switch+0x29/0x2d
 [8104072d] ? read_tsc+0x9/0x1b
 [81880c2a] schedule+0x6e/0x70
 [81880cbf] io_schedule+0x93/0xd7
 [81134170] ? __lock_page+0x68/0x68
 [81134179] sleep_on_page+0x9/0xd
 [8188118f] __wait_on_bit+0x45/0x7a
 [8113444e] wait_on_page_bit+0x71/0x78
 [810f84f3] ? wake_atomic_t_function+0x28/0x28
 [81311056] prepare_pages+0xd2/0x11b
 [813143a5] __btrfs_buffered_write+0x214/0x482
 [8110eb76] ? futex_wait+0x176/0x239
 [810c9650] ? current_fs_time+0x22/0x29
 [81314a2a] btrfs_file_aio_write+0x417/0x507
 [81198eb3] ? path_openat+0x593/0x5cc
 [8118c275] do_sync_write+0x59/0x79
 [8118d53e] vfs_write+0xd3/0x172
 [8118d69e] SyS_write+0x4b/0x8f
 [81884505] tracesys+0xd0/0xd5
freshclam   D 0002 0  8305   4976 0x
 883fbc1b1b08 

Re: [PATCH] Btrfs: fix qgroups sanity test crash or hang

2014-06-12 Thread Filipe David Manana
On Thu, Jun 12, 2014 at 1:35 AM, Chris Mason c...@fb.com wrote:
 On 06/11/2014 08:12 PM, Filipe David Borba Manana wrote:
 Often when running the qgroups sanity test, a crash or a hang happened.
 This is because the extent buffer the test uses for the root node doesn't
 have an header level explicitly set, making it have a random level value.
 This is a problem when it's not zero for the btrfs_search_slot() calls
 the test ends up doing, resulting in crashes or hangs such as the following:


 Therefore initialize the extent buffer as an empty leaf (level 0).

 Issue easy to reproduce when btrfs is built as a module via:

 $ for ((i = 1; i = 100; i++)); do rmmod btrfs; modprobe btrfs; done

 Nice, thanks Filipe, I hadn't been able to trigger this yet.

It's a bit rare for me too, happens about 1/100 in my vm. Every time
it fails root-node's header level is a non-zero value, usually way
bigger than BTRFS_MAX_LEVEL (8), causing all sorts of badness in
btrfs_search_slot() due to out of bounds accesses to the arrays
path-{locks | nodes | slots}


 -chris



-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Use sparse files for filesystem conversion tests.

2014-06-12 Thread Adam Buchbinder
On my system, this brings the FS conversion test suite's runtime from over
ten seconds down to under two.

Thanks to Julien Muchembled for the suggestion.

Signed-off-by: Adam Buchbinder abuchbin...@google.com
---
 tests/convert-tests.sh | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/tests/convert-tests.sh b/tests/convert-tests.sh
index 87369c5..9f7a5c8 100644
--- a/tests/convert-tests.sh
+++ b/tests/convert-tests.sh
@@ -13,15 +13,16 @@ _fail()
 }
 
 rm -f convert-tests-results.txt
-rm -f test.img
 
 test(){
echo  [TEST]$1
-shift
-echo creating ext image with: $*  convert-tests-results.txt
+   shift
+   echo creating ext image with: $*  convert-tests-results.txt
# 256MB is the smallest acceptable btrfs image.
-   dd if=/dev/zero of=$here/test.img bs=1024 count=$((256*1024)) \
-convert-tests-results.txt 21 || _fail dd failed
+   rm -f $here/test.img  convert-tests-results.txt 21 \
+   || _fail could not remove test image file
+   truncate -s 256M $here/test.img  convert-tests-results.txt 21 \
+   || _fail could not create test image file
$* -F $here/test.img  convert-tests-results.txt 21 \
|| _fail filesystem create failed
$here/btrfs-convert $here/test.img  convert-tests-results.txt 21 \
@@ -30,6 +31,7 @@ test(){
|| _fail btrfsck detected errors
 }
 
-test ext2, 4k blocksize mke2fs -b 4096
-test ext3, 4k blocksize mke2fs -j -b 4096
-test ext4, 4k blocksize mke2fs -t ext4 -b 4096
+# btrfs-convert requires 4k blocksize.
+test ext2 mke2fs -b 4096
+test ext3 mke2fs -j -b 4096
+test ext4 mke2fs -t ext4 -b 4096
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] lib: add size unit t/p/e to memparse

2014-06-12 Thread Andrew Morton
On Wed, 2 Apr 2014 16:54:37 +0800 Gui Hecheng guihc.f...@cn.fujitsu.com wrote:

 For modern filesystems such as btrfs, t/p/e size level operations
 are common.
 add size unit t/p/e parsing to memparse
 
 Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
 ---
 changelog
   v1-v2: replace kilobyte with kibibyte, and others
   v2-v3: add missing unit bytes in comment
 ---
  lib/cmdline.c | 25 -
  1 file changed, 20 insertions(+), 5 deletions(-)
 
 diff --git a/lib/cmdline.c b/lib/cmdline.c
 index eb67911..511b9be 100644
 --- a/lib/cmdline.c
 +++ b/lib/cmdline.c
 @@ -119,11 +119,17 @@ char *get_options(const char *str, int nints, int *ints)
   *   @retptr: (output) Optional pointer to next char after parse completes
   *
   *   Parses a string into a number.  The number stored at @ptr is
 - *   potentially suffixed with %K (for kilobytes, or 1024 bytes),
 - *   %M (for megabytes, or 1048576 bytes), or %G (for gigabytes, or
 - *   1073741824).  If the number is suffixed with K, M, or G, then
 - *   the return value is the number multiplied by one kilobyte, one
 - *   megabyte, or one gigabyte, respectively.
 + *   potentially suffixed with
 + *   %K (for kibibytes, or 1024 bytes),
 + *   %M (for mebibytes, or 1048576 bytes),
 + *   %G (for gibibytes, or 1073741824 bytes),
 + *   %T (for tebibytes, or 1099511627776 bytes),
 + *   %P (for pebibytes, or 1125899906842624 bytes),
 + *   %E (for exbibytes, or 1152921504606846976 bytes).

I'm afraid I find these names quite idiotic - we all know what the
traditional terms mean so why go and muck with it.

Also, kibibytes sounds like cat food.

 @@ -133,6 +139,15 @@ unsigned long long memparse(const char *ptr, char 
 **retptr)
   unsigned long long ret = simple_strtoull(ptr, endptr, 0);
  
   switch (*endptr) {
 + case 'E':
 + case 'e':
 + ret = 10;
 + case 'P':
 + case 'p':
 + ret = 10;
 + case 'T':
 + case 't':
 + ret = 10;
   case 'G':
   case 'g':
   ret = 10;

That bit makes sense.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: Improve the parse_size() error message.

2014-06-12 Thread Qu Wenruo


 Original Message 
Subject: Re: [PATCH v2] btrfs-progs: Improve the parse_size() error message.
From: David Sterba dste...@suse.cz
To: Qu Wenruo quwen...@cn.fujitsu.com
Date: 2014年06月12日 18:15

On Thu, May 29, 2014 at 09:42:11AM +0800, Qu Wenruo wrote:

When using parse_size(), even non-numeric value is passed, it will only
give error message ERROR: size value is empty, which is quite
confusing for end users.

This patch will introduce more meaningful error message for the
following new cases
1) Invalid size string (non-numeric string)
2) Minus size value (like -1K)

Also this patch will take full use of endptr returned by strtoll() to
reduce unneeded loop.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
  utils.c | 56 +++-
  1 file changed, 39 insertions(+), 17 deletions(-)

diff --git a/utils.c b/utils.c
index 392c5cf..499f08f 100644
--- a/utils.c
+++ b/utils.c
@@ -1612,18 +1612,45 @@ scan_again:
  
  u64 parse_size(char *s)

  {
-   int i;
char c;
+   char *endptr;
u64 mult = 1;
+   long long int signed_ret;
+   u64 ret;
  
-	for (i = 0; s  s[i]  isdigit(s[i]); i++) ;

-   if (!i) {
-   fprintf(stderr, ERROR: size value is empty\n);
-   exit(50);
+   if (!s) {
+   fprintf(stderr, ERROR: Size value is empty\n);
+   exit(1);

We never pass a NULL pointer to parse_size so this check will be always
false. Previously it verified that there are at least some digits.
Command like 'mkfs.btrfs -b /dev/sda' *WILL* pass NULL pointer to 
parse_size(), so the check is needed.





}
-
-   if (s[i]) {
-   c = tolower(s[i]);
+   signed_ret = strtoll(s, endptr, 10);
+   if (endptr == s) {
+   fprintf(stderr, ERROR: Size value '%s' is invalid\n, s);
+   exit(1);
+   }
+   if (endptr[0]  endptr[1]) {
+   fprintf(stderr, ERROR: Illegal suffix contains character '%c' in 
wrong position\n,
+   endptr[1]);
+   exit(1);
+   }
+   if (signed_ret = 0) {
+   fprintf(stderr,
+   ERROR: Size value '%s' is less equal than 0\n, s);
+   exit(1);
+   }
+   /* strtoll returns LLONG_MAX when overflow, if this happens,
+* need to call strtoull to get the real size */
+   if (errno == ERANGE  signed_ret == LLONG_MAX) {
+   ret = strtoull(s, NULL, 10);

Why do you parse the number twice? Negative sizes are currently not used
so you can reject them.

I will change the patch to judgement leading '-' and reject the value.



+   if (errno == ERANGE  ret == ULLONG_MAX) {
+   fprintf(stderr,
+   ERROR: Size value '%s' is too large for u64\n,
+   s);
+   exit(1);
+   }
+   } else
+   ret = signed_ret;
+   if (endptr[0]) {
+   c = tolower(endptr[0]);
switch (c) {
case 'e':
mult *= 1024;
@@ -1646,18 +1673,13 @@ u64 parse_size(char *s)
case 'b':
break;
default:
-   fprintf(stderr, ERROR: Unknown size descriptor 
-   '%c'\n, c);
+   fprintf(stderr, ERROR: Unknown size descriptor 
'%c'\n, c);
exit(1);
}
}
-   if (s[i]  s[i+1]) {
-   fprintf(stderr, ERROR: Illegal suffix contains 
-   character '%c' in wrong position\n,
-   s[i+1]);
-   exit(51);
-   }
-   return strtoull(s, NULL, 10) * mult;
+
+   ret *= mult;

Although there was no overflow check before, I think it should be here.
Eg. 12345678P is a valid size string but the result does not fit u64.

Right, I will check the overflow here.

Thanks,
Qu



+   return ret;
  }


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] lib: add size unit t/p/e to memparse

2014-06-12 Thread Gui Hecheng
On Thu, 2014-06-12 at 14:15 -0700, Andrew Morton wrote:
 On Wed, 2 Apr 2014 16:54:37 +0800 Gui Hecheng guihc.f...@cn.fujitsu.com 
 wrote:
 
  For modern filesystems such as btrfs, t/p/e size level operations
  are common.
  add size unit t/p/e parsing to memparse
  
  Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
  ---
  changelog
  v1-v2: replace kilobyte with kibibyte, and others
  v2-v3: add missing unit bytes in comment
  ---
   lib/cmdline.c | 25 -
   1 file changed, 20 insertions(+), 5 deletions(-)
  
  diff --git a/lib/cmdline.c b/lib/cmdline.c
  index eb67911..511b9be 100644
  --- a/lib/cmdline.c
  +++ b/lib/cmdline.c
  @@ -119,11 +119,17 @@ char *get_options(const char *str, int nints, int 
  *ints)
* @retptr: (output) Optional pointer to next char after parse completes
*
* Parses a string into a number.  The number stored at @ptr is
  - * potentially suffixed with %K (for kilobytes, or 1024 bytes),
  - * %M (for megabytes, or 1048576 bytes), or %G (for gigabytes, or
  - * 1073741824).  If the number is suffixed with K, M, or G, then
  - * the return value is the number multiplied by one kilobyte, one
  - * megabyte, or one gigabyte, respectively.
  + * potentially suffixed with
  + * %K (for kibibytes, or 1024 bytes),
  + * %M (for mebibytes, or 1048576 bytes),
  + * %G (for gibibytes, or 1073741824 bytes),
  + * %T (for tebibytes, or 1099511627776 bytes),
  + * %P (for pebibytes, or 1125899906842624 bytes),
  + * %E (for exbibytes, or 1152921504606846976 bytes).
 
 I'm afraid I find these names quite idiotic - we all know what the
 traditional terms mean so why go and muck with it.
 
 Also, kibibytes sounds like cat food.

Yes, I will cleanup this part, Thanks very much.

-Gui

  @@ -133,6 +139,15 @@ unsigned long long memparse(const char *ptr, char 
  **retptr)
  unsigned long long ret = simple_strtoull(ptr, endptr, 0);
   
  switch (*endptr) {
  +   case 'E':
  +   case 'e':
  +   ret = 10;
  +   case 'P':
  +   case 'p':
  +   ret = 10;
  +   case 'T':
  +   case 't':
  +   ret = 10;
  case 'G':
  case 'g':
  ret = 10;
 
 That bit makes sense.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] btrfs-progs: Improve the parse_size() error message.

2014-06-12 Thread Qu Wenruo
When using parse_size(), even non-numeric value is passed, it will only
give error message ERROR: size value is empty, which is quite
confusing for end users.

This patch will introduce more meaningful error message for the
following new cases
1) Invalid size string (non-numeric string)
2) Minus size value (like -1K)

Also this patch will take full use of endptr returned by strtoll() to
reduce unneeded loop.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
changlog:
v2:
   Remove uneeded return value
   Avoid abuse of goto
v3:
   Don't reparse size twice
   Better u64 overflow check
---
 utils.c | 66 +
 1 file changed, 50 insertions(+), 16 deletions(-)

diff --git a/utils.c b/utils.c
index 392c5cf..eab3a1b 100644
--- a/utils.c
+++ b/utils.c
@@ -1610,20 +1610,53 @@ scan_again:
return 0;
 }
 
-u64 parse_size(char *s)
+/* A not-so-good version fls64. No fascinating optimization since
+ * no one except parse_size use it */
+static int fls64(u64 x)
 {
int i;
+
+   for (i = 0; i 64; i++)
+   if (x  i  (1UL  63))
+   return 64 - i;
+   return 64 - i;
+}
+
+u64 parse_size(char *s)
+{
char c;
+   char *endptr;
u64 mult = 1;
+   u64 ret;
 
-   for (i = 0; s  s[i]  isdigit(s[i]); i++) ;
-   if (!i) {
-   fprintf(stderr, ERROR: size value is empty\n);
-   exit(50);
+   if (!s) {
+   fprintf(stderr, ERROR: Size value is empty\n);
+   exit(1);
}
-
-   if (s[i]) {
-   c = tolower(s[i]);
+   if (s[0] == '-') {
+   fprintf(stderr,
+   ERROR: Size value '%s' is less equal than 0\n, s);
+   exit(1);
+   }
+   ret = strtoull(s, endptr, 10);
+   if (endptr == s) {
+   fprintf(stderr, ERROR: Size value '%s' is invalid\n, s);
+   exit(1);
+   }
+   if (endptr[0]  endptr[1]) {
+   fprintf(stderr, ERROR: Illegal suffix contains character '%c' 
in wrong position\n,
+   endptr[1]);
+   exit(1);
+   }
+   /* strtoll returns LLONG_MAX when overflow, if this happens,
+* need to call strtoull to get the real size */
+   if (errno == ERANGE  ret == ULLONG_MAX) {
+   fprintf(stderr,
+   ERROR: Size value '%s' is too large for u64\n, s);
+   exit(1);
+   }
+   if (endptr[0]) {
+   c = tolower(endptr[0]);
switch (c) {
case 'e':
mult *= 1024;
@@ -1646,18 +1679,19 @@ u64 parse_size(char *s)
case 'b':
break;
default:
-   fprintf(stderr, ERROR: Unknown size descriptor 
-   '%c'\n, c);
+   fprintf(stderr, ERROR: Unknown size descriptor '%c'\n,
+   c);
exit(1);
}
}
-   if (s[i]  s[i+1]) {
-   fprintf(stderr, ERROR: Illegal suffix contains 
-   character '%c' in wrong position\n,
-   s[i+1]);
-   exit(51);
+   /* Check whether ret * mult overflow */
+   if (fls64(ret) + fls64(mult) - 1  64) {
+   fprintf(stderr,
+   ERROR: Size value '%s' is too large for u64\n, s);
+   exit(1);
}
-   return strtoull(s, NULL, 10) * mult;
+   ret *= mult;
+   return ret;
 }
 
 int open_file_or_dir3(const char *fname, DIR **dirstream, int open_flags)
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deadlock/high load

2014-06-12 Thread Liu Bo
On Thu, Jun 12, 2014 at 04:15:27PM +0100, Alin Dobre wrote:
 Hi all,
 
 I have a problem that triggers quite often on our production machines. I
 don't really know what's triggering this or how to reproduce it, but the
 machine enters in some sort of deadlock state, where it consumes all the
 i/o and the load average goes very high in seconds (it even gets to over
 200), sometimes in about a minute or even less, the machine is
 unresponsive and we have to reset it. Rarely, the load just stays high
 (~25) for hours, but it never gets down again, but this happens rarely,
 as I said. In general, the machine is either already unresponsive or is
 about to become unresponsive.
 
 The last machine that encountered this has 40 cores and the btrfs
 filesystem is running over SSDs. We encountered this on a plain 3.14
 kernel, and also on the latest 3.14.6 kernel + all the patches whose
 summary is marked btrfs: that made it in 3.15, straight forward
 backported (cherry-picked) to 3.14.
 
 Also, no suspicious (malicious) activity from the running processes either.
 
 I noticed there was another report on 3.13 which was solved by a 3.15rc
 patch, it doesn't seem to be the same thing.

The output of 'btrfs filesystem df' is appreciate, it can help determine if the
FS has entered into 'almost full' situation, and that may cause a bug that pages
are not marked with writeback tag and lead to processes's endless waiting.

-liubo

 
 Since the only chance to obtain something was via a SysRq dump, here's
 what I could get from the last w trigger (tasks that are in
 uninterruptable (blocked) state), showing only tasks that are related to
 btrfs:
 
 btrfs-transacti D 000e 0  2483  2 0x00080008
  881fd05975d8 81880a27 881fd05974e8 881fd05974f8
  881fd0596010 881fd05975d8 00011bc0 881fd13398f0
  00011bc0 00011bc0 881fd28ecad0 881fd13398f0
 Call Trace:
  [81880a27] ? __schedule+0x687/0x72c
  [8163aaf0] ? do_release_stripe+0xeb/0x182
  [8114a076] ? zone_statistics+0x77/0x7e
  [8163fed0] ? raid5_unplug+0xaa/0xb3
  [813cb87e] ? blk_flush_plug_list+0x99/0x1f0
  [8163c24d] ? get_active_stripe+0x65/0x5ca
  [810f8704] ? prepare_to_wait+0x71/0x7c
  [816431f9] ? make_request+0x7b0/0x999
  [816429d4] ? release_stripe_plug+0x20/0x95
  [810f8497] ? bit_waitqueue+0xb0/0xb0
  [810f8497] ? bit_waitqueue+0xb0/0xb0
  [8166375a] ? md_make_request+0xfa/0x215
  [81324f22] ? __btrfs_map_block+0xd6f/0xd89
  [813ca63c] ? generic_make_request+0x99/0xda
  [813ca770] ? submit_bio+0xf3/0xfe
  [813251de] ? submit_stripe_bio+0x77/0x82
  [813255b6] ? btrfs_map_bio+0x3cd/0x440
  [812fdc1d] ? csum_tree_block+0x1c1/0x1ec
  [812fdfa6] ? btree_submit_bio_hook+0x97/0xf0
  [811b561e] ? __bio_add_page+0x153/0x1de
  [8131ab64] ? submit_one_bio+0x63/0x90
  [8113c61b] ? account_page_writeback+0x28/0x2d
  [8131b504] ? submit_extent_page+0xe7/0x17e
  [81320796] ? btree_write_cache_pages+0x44c/0x71a
  [8131b272] ? extent_range_clear_dirty_for_io+0x5a/0x5a
  [812fc41a] ? btree_writepages+0x4a/0x53
  [8113cf5f] ? do_writepages+0x1b/0x24
  [81134f76] ? __filemap_fdatawrite_range+0x4e/0x50
  [81135b55] ? filemap_fdatawrite_range+0xe/0x10
  [813020c1] ? btrfs_write_marked_extents+0x83/0xd1
  [8130216b] ? btrfs_write_and_wait_transaction+0x5c/0x8a
  [81302ee2] ? btrfs_commit_transaction+0x68b/0x87c
  [810cf0b7] ? del_timer+0x87/0x87
  [812fef3f] ? transaction_kthread+0x114/0x1e9
  [812fee2b] ? close_ctree+0x280/0x280
  [810df1ff] ? kthread+0xc9/0xd1
  [810df136] ? kthread_freezable_should_stop+0x5b/0x5b
  [818842cc] ? ret_from_fork+0x7c/0xb0
  [810df136] ? kthread_freezable_should_stop+0x5b/0x5b
 rs:main Q:Reg   D 0002 0  6857   4976 0x
  883fc9b0bb08 0002 883fc9b0b9e8 883fc9b0ba28
  883fc9b0a010 883fc9b0bb08 00011bc0 883fc794db70
  00011bc0 00011bc0 881fd28e8000 883fc794db70
 Call Trace:
  [81040a93] ? native_sched_clock+0x17/0xd3
  [810406e7] ? sched_clock+0x9/0xd
  [810eb7c2] ? arch_vtime_task_switch+0x81/0x86
  [810ebc88] ? vtime_common_task_switch+0x29/0x2d
  [8104072d] ? read_tsc+0x9/0x1b
  [81880c2a] schedule+0x6e/0x70
  [81880cbf] io_schedule+0x93/0xd7
  [81134170] ? __lock_page+0x68/0x68
  [81134179] sleep_on_page+0x9/0xd
  [8188118f] __wait_on_bit+0x45/0x7a
  [8113444e] wait_on_page_bit+0x71/0x78
  [810f84f3] ? wake_atomic_t_function+0x28/0x28
  [81311056] prepare_pages+0xd2/0x11b
  [813143a5] __btrfs_buffered_write+0x214/0x482
  [8110eb76] ? futex_wait+0x176/0x239
  

[PATCH 1/2 v3 RESEND] Btrfs: device_list_add() should not update list when mounted

2014-06-12 Thread Anand Jain
From: Anand Jain anand.j...@oracle.com

device_list_add() is called when user runs btrfs dev scan, which would add
any btrfs device into the btrfs_fs_devices list.

Now think of a mounted btrfs. And a new device which contains the a SB
from the mounted btrfs devices.

In this situation when user runs btrfs dev scan, the current code would
just replace existing device with the new device.

Which is to note that old device is neither closed nor gracefully
removed from the btrfs.

The FS is still operational with the old bdev however the device name
is the btrfs_device is new which is provided by the btrfs dev scan.

reproducer:

devmgt[1] detach /dev/sdc

replace the missing disk /dev/sdc

btrfs rep start -f 1 /dev/sde /btrfs
Label: none  uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120
Total devices 2 FS bytes used 32.00KiB
devid1 size 958.94MiB used 115.88MiB path /dev/sde
devid2 size 958.94MiB used 103.88MiB path /dev/sdd

make /dev/sdc to reappear

devmgt attach host2

btrfs dev scan

btrfs fi show -m
Label: none  uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120^M
Total devices 2 FS bytes used 32.00KiB^M
devid1 size 958.94MiB used 115.88MiB path /dev/sdc - Wrong.
devid2 size 958.94MiB used 103.88MiB path /dev/sdd

since /dev/sdc has been replaced with /dev/sde, the /dev/sdc shouldn't be
part of the btrfs-fsid when it reappears. If user want it to be part of it
then sys admin should be using btrfs device add instead.

[1] github.com/anajain/devmgt.git

Signed-off-by: Anand Jain anand.j...@oracle.com
---
v2-v3: simpler commit and comment message
v1-v2: remove '---' in commit messages which conflict with git am

 fs/btrfs/volumes.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bb2cd66..56822f0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -472,6 +472,7 @@ static noinline int device_list_add(const char *path,
device = __find_device(fs_devices-devices, devid,
   disk_super-dev_item.uuid);
}
+
if (!device) {
if (fs_devices-opened)
return -EBUSY;
@@ -497,6 +498,32 @@ static noinline int device_list_add(const char *path,
 
device-fs_devices = fs_devices;
} else if (!device-name || strcmp(device-name-str, path)) {
+   /*
+* When FS is already mounted.
+* 1. If you are here and if the device-name is NULL that means
+*this device was missing at time of FS mount.
+* 2. If you are here and if the device-name is different from 
'path'
+*that means either
+*  a. The same device disappeared and reappeared with 
different
+* name. or
+*  b. The missing-disk-which-was-replaced, has reappeared 
now.
+*
+*  We must allow 1 and 2a above. But 2b would be a spurious and
+*  unintentional.
+*  Further in case of 1 and 2a above, the disk at 'path' would 
have
+*  missed some transaction when it was away and in case of 2a
+*  the stale bdev has to be updated as well.
+*  2b must not be allowed at all time.
+*/
+
+   /*
+* As of now don't allow update to btrfs_fs_device through the
+* btrfs dev scan cli, after FS has been mounted.
+*/
+
+   if (fs_devices-opened)
+   return -EBUSY;
+
name = rcu_string_strdup(path, GFP_NOFS);
if (!name)
return -ENOMEM;
-- 
1.8.5.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: check generation as replace duplicates devid+uuid

2014-06-12 Thread Anand Jain
From: Anand Jain anand.j...@oracle.com

When FS in unmounted we need to check generation number as well
since devid+uuid combination could match with the missing replaced
disk when it reappears, and without this patch it might pair with
the replaced disk again.

 device_list_add() function is called in the following threads,
mount device option
mount argument
ioctl BTRFS_IOC_SCAN_DEV (btrfs dev scan)
ioctl BTRFS_IOC_DEVICES_READY (btrfs dev ready dev)
 they have been unit tested to work fine with this patch.

 If the user knows what he is doing and really want to pair with
 replaced disk (which is not a standard operation), then he should
 first clear the kernel btrfs device list in the memory by doing
 the module unload/load and followed with the mount -o device option.

Signed-off-by: Anand Jain anand.j...@oracle.com
---
 fs/btrfs/volumes.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 56822f0..bb1b4bd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -523,6 +523,16 @@ static noinline int device_list_add(const char *path,
 
if (fs_devices-opened)
return -EBUSY;
+   else {
+   /*
+* That is if the FS is _not_ mounted and if you are 
here, that
+* means there is more than one disk with same uuid and 
devid.
+* We keep the one with larger generation number or the 
last-in
+* if generation are equal.
+*/
+   if (found_transid  device-generation)
+   return -EINVAL;
+   }
 
name = rcu_string_strdup(path, GFP_NOFS);
if (!name)
@@ -535,6 +545,15 @@ static noinline int device_list_add(const char *path,
}
}
 
+   /*
+* Unmount does not free the btrfs_device struct but would zero
+* generation along with most of the other members. So just update
+* it back. We need it to pick the disk with largest generation
+* (as above).
+*/
+   if (!fs_devices-opened)
+   device-generation = found_transid;
+
if (found_transid  fs_devices-latest_trans) {
fs_devices-latest_devid = devid;
fs_devices-latest_trans = found_transid;
-- 
1.8.5.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: When devices with same dev uuid occurs, only add the one with largest generation.

2014-06-12 Thread Anand Jain



Qu,

 Kindly note, device-generation is not updated when user scan the
 device for the first time.

 I have sent out patch:
  [PATCH 2/2] btrfs: check generation as replace duplicates devid+uuid

 Please feel free to add your Signed-off-by, if you think so.

 review comments are appreciated as usual.

Thanks, Anand


On 10/06/2014 16:53, Qu Wenruo wrote:

Since btrfs currently use dulicated dev uuid when doing device replace,
the following problem will happen:

1) mount with device A missing using degraded mode.
2) replace device A with device B.
3) device A reappears.
4) umount btrfs fs.
5) mount btrfs fs.

After step 5), btrfs will still use device A even device B has a larger
generation.

The patch will judge generation when difference device with same dev
uuid.
And the patch should be applied after Anand's patch:
https://patchwork.kernel.org/patch/4309651/

Cc: Anand Jain anand.j...@oracle.com
Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
  fs/btrfs/volumes.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0a5017a..07f0cf7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -508,7 +508,8 @@ static noinline int device_list_add(const char *path,

ret = 1;
device-fs_devices = fs_devices;
-   } else if (!device-name || strcmp(device-name-str, path)) {
+   } else if (!device-name || (strcmp(device-name-str, path) 
+  found_transid  device-generation)) {
/*
 * When FS is already mounted.
 * 1. If you are here and if the device-name is NULL that means


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html