date:20160801

Re: [PATCH 12/17] reflink: test cross-mountpoint reflink and dedupe

2016-08-01 Thread Christoph Hellwig

On Thu, Jul 21, 2016 at 04:47:32PM -0700, Darrick J. Wong wrote:
> Test sharing blocks via reflink and dedupe between two different
> mountpoints of the same filesystem.  This shouldn't work, since
> we don't allow cross-mountpoint functions.
> 
> Signed-off-by: Darrick J. Wong 

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 16/17] xfs/122: add the realtime rmapbt inode and btree fields

2016-08-01 Thread Christoph Hellwig

On Thu, Jul 21, 2016 at 04:48:00PM -0700, Darrick J. Wong wrote:
> Add the on-disk structures added by the realtime rmapbt.
> 
> Signed-off-by: Darrick J. Wong 

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/17] common/reflink: actually test dedupe on scratch device

2016-08-01 Thread Christoph Hellwig

On Thu, Jul 21, 2016 at 04:46:48PM -0700, Darrick J. Wong wrote:
> In _require_scratch_dedupe, test the scratch device, not the testdev.
> 
> Signed-off-by: Darrick J. Wong 

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 06/17] xfs: run xfs_repair at the end of each test

2016-08-01 Thread Christoph Hellwig

On Thu, Jul 21, 2016 at 04:46:54PM -0700, Darrick J. Wong wrote:
> Run xfs_repair twice at the end of each test -- once to rebuild
> the btree indices, and again with -n to check the rebuild work.

This looks fine to me in general, but shouldn't we have specific
tests that test the rebuilding in a normal auto run?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/17] xfs/128: cycle_mount the scratch device, not the test device

2016-08-01 Thread Christoph Hellwig

On Thu, Jul 21, 2016 at 04:47:01PM -0700, Darrick J. Wong wrote:
> This test uses the scratch device, so cycle that, not the test dev.
> This is also a xfs_fsr test, so put it in the fsr group.
> 
> Signed-off-by: Darrick J. Wong 

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/17] common/dmerror: fix mount option issues

2016-08-01 Thread Christoph Hellwig

On Thu, Jul 21, 2016 at 04:47:13PM -0700, Darrick J. Wong wrote:
> Calling _mount doesn't work when we want to add mount options
> such as realtime devices.  Since it's just a normal scratch device
> mount except for the source device, just call _scratch_mount with
> SCRATCH_DEV set to the dmerror device.
> 
> Signed-off-by: Darrick J. Wong 

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/17] xfs/179: use scratch device helpers

2016-08-01 Thread Christoph Hellwig

On Thu, Jul 21, 2016 at 04:47:19PM -0700, Darrick J. Wong wrote:
> Use the helper functions for scratch devices.  This fixes a problem
> where xfs/179 fails when there's a realtime device.
> 
> Signed-off-by: Darrick J. Wong 

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 08/17] xfs/129: fix post-metadump remounting idiocy

2016-08-01 Thread Christoph Hellwig

On Thu, Jul 21, 2016 at 04:47:07PM -0700, Darrick J. Wong wrote:
> Use the standard _scratch_mount to mount the filesystem from the restored
> image, instead of trying to call mount directly.  This is needed in case
> we had custom mount options (like rtdev).
> 
> Signed-off-by: Darrick J. Wong 

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 11/17] xfs/234: use scratch device helpers

2016-08-01 Thread Christoph Hellwig

On Thu, Jul 21, 2016 at 04:47:26PM -0700, Darrick J. Wong wrote:
> Use the helper functions for scratch devices.  This fixes a problem
> where xfs/234 fails when there's a realtime device.
> 
> Signed-off-by: Darrick J. Wong 

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] btrfs-progs: Introduce new send-dump object

2016-08-01 Thread Qu Wenruo

Introduce send-dump.[ch] which implements a new btrfs_send_ops to
exam and output all operations inside a send stream.

It has a better output format than the old and no longer compilable
send-test tool, but still tries to be script friendly.

Provides the basis for later "inspect-internal dump-send" command.

Signed-off-by: Qu Wenruo 
---
 Makefile.in |   2 +-
 send-dump.c | 367 
 send-dump.h |  24 
 3 files changed, 392 insertions(+), 1 deletion(-)
 create mode 100644 send-dump.c
 create mode 100644 send-dump.h

diff --git a/Makefile.in b/Makefile.in
index ac6b353..97caf95 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -80,7 +80,7 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o 
print-tree.o \
  extent-cache.o extent_io.o volumes.o utils.o repair.o \
  qgroup.o raid6.o free-space-cache.o list_sort.o props.o \
  ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \
- inode.o file.o find-root.o free-space-tree.o help.o
+ inode.o file.o find-root.o free-space-tree.o help.o send-dump.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/send-dump.c b/send-dump.c
new file mode 100644
index 000..bf451c7
--- /dev/null
+++ b/send-dump.c
@@ -0,0 +1,367 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "utils.h"
+#include "commands.h"
+#include "send-utils.h"
+#include "send-stream.h"
+#include "send-dump.h"
+
+#define path_cat_out_with_error(function_name, out_path, path1, path2, ret) \
+ret = path_cat_out(out_path, path1, path2);\
+if (ret < 0) { \
+   error("%s: path invalid: %s\n", function_name, path2);  \
+   return ret; \
+}
+
+#define TITLE_WIDTH16
+#define PATH_WIDTH 32
+
+static void print_dump(const char *title, const char *path,
+  const char *fmt, ...)
+{
+   va_list args;
+   char real_title[TITLE_WIDTH + 1];
+
+   real_title[0]='\0';
+   /* Append ':' to title*/
+   strncpy(real_title, title, TITLE_WIDTH - 1);
+   strncat(real_title, ":", TITLE_WIDTH);
+
+   /* Unified output */
+   printf("%-*s%-*s", TITLE_WIDTH, real_title, PATH_WIDTH, path);
+   va_start(args, fmt);
+   /* Command specified output */
+   vprintf(fmt, args);
+   va_end(args);
+   printf("\n");
+}
+
+static int print_subvol(const char *path, const u8 *uuid, u64 ctransid,
+   void *user)
+{
+   struct btrfs_dump_send_args *r = user;
+   char uuid_str[BTRFS_UUID_UNPARSED_SIZE];
+   int ret;
+
+   path_cat_out_with_error("subvol", r->full_subvol_path, r->root_path,
+   path, ret);
+   uuid_unparse(uuid, uuid_str);
+
+   print_dump("subvol", r->full_subvol_path, "uuid: %s, transid: %llu",
+  uuid_str, ctransid);
+   return 0;
+}
+
+static int print_snapshot(const char *path, const u8 *uuid, u64 ctransid,
+ const u8 *parent_uuid, u64 parent_ctransid,
+ void *user)
+{
+   struct btrfs_dump_send_args *r = user;
+   char uuid_str[BTRFS_UUID_UNPARSED_SIZE];
+   char parent_uuid_str[BTRFS_UUID_UNPARSED_SIZE];
+   int ret;
+
+   path_cat_out_with_error("snapshot", r->full_subvol_path, r->root_path,
+   path, ret);
+   uuid_unparse(uuid, uuid_str);
+   uuid_unparse(parent_uuid, parent_uuid_str);
+
+   print_dump("snapshot", r->full_subvol_path,
+  "uuid: %s, transid: %llu, parent_uuid: %s, parent_transid: 
%llu",
+  uuid_str, ctransid, parent_uuid_str, parent_ctransid);
+   return 0;
+}
+
+static int print_mkfile(const char *path, void *user)
+{
+   struct btrfs_dump_send_args *r = user;
+   char full_path[PATH_MAX];
+   int ret;
+
+   path_cat_out_with_error("mkfile", full_path, r->full_subvol_path, path,
+   ret);

[PATCH 2/3] btrfs-progs: inspect: Introduce dump-send-stream subcommand

2016-08-01 Thread Qu Wenruo

Introduce a new subcommand "dump-send-stream" for "inspect-internal"
command group.

It will exam and output all operations inside a send stream.
It's quite a useful tool to learn and dig kernel send stream.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-inspect-internal.asciidoc |  8 +++
 Makefile.in   |  3 +-
 cmds-inspect-dump-send.c  | 84 +++
 cmds-inspect-dump-send.h  | 24 
 cmds-inspect.c|  3 +
 5 files changed, 121 insertions(+), 1 deletion(-)
 create mode 100644 cmds-inspect-dump-send.c
 create mode 100644 cmds-inspect-dump-send.h

diff --git a/Documentation/btrfs-inspect-internal.asciidoc 
b/Documentation/btrfs-inspect-internal.asciidoc
index 74f6dea..8ba768d 100644
--- a/Documentation/btrfs-inspect-internal.asciidoc
+++ b/Documentation/btrfs-inspect-internal.asciidoc
@@ -146,6 +146,14 @@ Print sizes and statistics of trees.
 -b
 Print raw numbers in bytes.
 
+*dump-send-stream [options]*::
+exam and output all operations inside a send stream. Read from 'stdin' by 
default.
++
+`Options`
++
+-f|--file 
+Read from file instead of 'stdin'.
+
 EXIT STATUS
 ---
 *btrfs inspect-internal* returns a zero exit status if it succeeds. Non zero is
diff --git a/Makefile.in b/Makefile.in
index 97caf95..00bf639 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -86,7 +86,8 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o 
cmds-device.o cmds-scrub.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
   cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \
   cmds-property.o cmds-fi-usage.o cmds-inspect-dump-tree.o \
-  cmds-inspect-dump-super.o cmds-inspect-tree-stats.o cmds-fi-du.o
+  cmds-inspect-dump-super.o cmds-inspect-tree-stats.o \
+  cmds-fi-du.o cmds-inspect-dump-send.o
 libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \
   uuid-tree.o utils-lib.o rbtree-utils.o
 libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \
diff --git a/cmds-inspect-dump-send.c b/cmds-inspect-dump-send.c
new file mode 100644
index 000..46a3cd6
--- /dev/null
+++ b/cmds-inspect-dump-send.c
@@ -0,0 +1,84 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "kerncompat.h"
+#include "ctree.h"
+#include "send-dump.h"
+#include "send-stream.h"
+#include "utils.h"
+#include "commands.h"
+
+const char * const cmd_inspect_dump_send_usage[] = {
+   "btrfs inspect-internal dump-send-stream [options]",
+   "Dump the operations of a send stream from stdin",
+   "-f|--file read send stream from file",
+   "-h|--help   print this message",
+   NULL
+};
+
+int cmd_inspect_dump_send(int argc, char **argv)
+{
+   struct btrfs_dump_send_args dump_args;
+   int fd = 0;
+   int ret;
+
+   while (1) {
+   int c;
+   static const struct option long_options[] = {
+   { "file", required_argument, NULL, 'f' },
+   { "help", no_argument, NULL, 'h' },
+   { NULL, 0, NULL, 0 }
+   };
+
+   c = getopt_long(argc, argv, "f:h", long_options, NULL);
+   if (c < 0)
+   break;
+   switch(c) {
+   case 'f':
+   fd = open(optarg, O_RDONLY, 0666);
+   if (fd < 0) {
+   error("cannot open %s: %s", optarg,
+ strerror(errno));
+   exit(1);
+   }
+   break;
+   case 'h':
+   default:
+   usage(cmd_inspect_dump_send_usage);
+   break;
+   }
+   }
+   dump_args.root_path = malloc(PATH_MAX);
+   dump_args.root_path[0] = '.';
+   dump_args.root_path[1] = '\0';
+   dump_args.full_subvol_path = malloc(PATH_MAX);
+   dump_args.full_subvol_path[0] = '.';
+   dump_args.full_subvol_path[1] = '\0';
+
+   ret = btrfs_read_and_process_send_stream(fd, _print_send_ops,
+   _args, 0, 0);
+   if (ret < 0)
+

[PATCH 3/3] btrfs-progs: Remove send-test tool

2016-08-01 Thread Qu Wenruo

Since new "inspect dump-send" has better output and structure, it's time
to remove old and function-weak send-test tool.

New "inspect dump-send" can handle them all better.

Signed-off-by: Qu Wenruo 
---
 Makefile.in |   6 +-
 send-test.c | 447 
 2 files changed, 1 insertion(+), 452 deletions(-)
 delete mode 100644 send-test.c

diff --git a/Makefile.in b/Makefile.in
index 00bf639..5e4dd57 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -347,10 +347,6 @@ ioctl-test: $(objects) $(libs) ioctl-test.o
@echo "[LD] $@"
$(Q)$(CC) $(CFLAGS) -o ioctl-test $(objects) $(libs) ioctl-test.o 
$(LDFLAGS) $(LIBS)
 
-send-test: $(objects) $(libs) send-test.o
-   @echo "[LD] $@"
-   $(Q)$(CC) $(CFLAGS) -o send-test $(objects) $(libs) send-test.o 
$(LDFLAGS) $(LIBS)
-
 library-test: $(libs_shared) library-test.o
@echo "[LD] $@"
$(Q)$(CC) $(CFLAGS) -o library-test library-test.o $(LDFLAGS) -lbtrfs
@@ -382,7 +378,7 @@ clean-all: clean clean-doc clean-gen
 clean: $(CLEANDIRS)
@echo "Cleaning"
$(Q)$(RM) -f $(progs) cscope.out *.o *.o.d \
- dir-test ioctl-test quick-test send-test library-test 
library-test-static \
+ dir-test ioctl-test quick-test library-test library-test-static \
  btrfs.static mkfs.btrfs.static \
  $(check_defs) \
  $(libs) $(lib_links) \
diff --git a/send-test.c b/send-test.c
deleted file mode 100644
index 4645b89..000
--- a/send-test.c
+++ /dev/null
@@ -1,447 +0,0 @@
-/*
- * Copyright (C) 2013 SUSE.  All rights reserved.
- *
- * This code is adapted from cmds-send.c and cmds-receive.c,
- * Both of which are:
- *
- * Copyright (C) 2012 Alexander Block.  All rights reserved.
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public
- * License v2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public
- * License along with this program; if not, write to the
- * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
- * Boston, MA 021110-1307, USA.
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-/*
- * This should be compilable without the rest of the btrfs-progs
- * source distribution.
- */
-#if BTRFS_FLAT_INCLUDES
-#include "send-utils.h"
-#include "send-stream.h"
-#else
-#include 
-#include 
-#endif /* BTRFS_FLAT_INCLUDES */
-
-static int pipefd[2];
-struct btrfs_ioctl_send_args io_send = {0, };
-static char *subvol_path;
-static char *root_path;
-
-struct recv_args {
-   char *full_subvol_path;
-   char *root_path;
-};
-
-void usage(int error)
-{
-   printf("send-test  \n");
-   if (error)
-   exit(error);
-}
-
-static int print_subvol(const char *path, const u8 *uuid, u64 ctransid,
-   void *user)
-{
-   struct recv_args *r = user;
-   char uuid_str[BTRFS_UUID_UNPARSED_SIZE];
-
-   r->full_subvol_path = path_cat(r->root_path, path);
-   uuid_unparse(uuid, uuid_str);
-
-   printf("subvol\t%s\t%llu\t%s\n", uuid_str,
-  (unsigned long long)ctransid, r->full_subvol_path);
-
-   return 0;
-}
-
-static int print_snapshot(const char *path, const u8 *uuid, u64 ctransid,
- const u8 *parent_uuid, u64 parent_ctransid,
- void *user)
-{
-   struct recv_args *r = user;
-   char uuid_str[BTRFS_UUID_UNPARSED_SIZE];
-   char parent_uuid_str[BTRFS_UUID_UNPARSED_SIZE];
-
-   r->full_subvol_path = path_cat(r->root_path, path);
-   uuid_unparse(uuid, uuid_str);
-   uuid_unparse(parent_uuid, parent_uuid_str);
-
-   printf("snapshot\t%s\t%llu\t%s\t%llu\t%s\n", uuid_str,
-  (unsigned long long)ctransid, parent_uuid_str,
-  (unsigned long long)parent_ctransid, r->full_subvol_path);
-
-   return 0;
-}
-
-static int print_mkfile(const char *path, void *user)
-{
-   struct recv_args *r = user;
-   char *full_path = path_cat(r->full_subvol_path, path);
-
-   printf("mkfile\t%s\n", full_path);
-
-   free(full_path);
-   return 0;
-}
-
-static int print_mkdir(const char *path, void *user)
-{
-   struct recv_args *r = user;
-   char *full_path = path_cat(r->full_subvol_path, path);
-
-   printf("mkdir\t%s\n", full_path);
-
-   free(full_path);
-   return 0;
-}
-
-static int print_mknod(const char *path, u64 mode, u64 dev, void *user)
-{
-   struct recv_args *r = user;
-   char

Re: Fixup direct bi_rw modifiers

2016-08-01 Thread Christoph Hellwig

On Sat, Jul 30, 2016 at 04:45:48PM -0500, Shaun Tancheff wrote:
> bi_rw should be using bio_set_op_attrs to set bi_rw.

Looks fine,

Reviewed-by: Christoph Hellwig 

Jens,

what do you think about renaming bi_rw?  There aren't too many users
left, and any old code that would keep using it is alsmost guranteed
to be broken, so sending a post-rc1 patch to rename it might make
everyone else life easier.  Especially as it's also grossly misnamed
now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: systemd KillUserProcesses=yes and btrfs scrub

2016-08-01 Thread Austin S. Hemmelgarn


On 2016-07-30 20:29, Chris Murphy wrote:

On Sat, Jul 30, 2016 at 2:02 PM, Chris Murphy  wrote:

Short version: When systemd-logind login.conf KillUserProcesses=yes,
and the user does "sudo btrfs scrub start" in e.g. GNOME Terminal, and


Same thing with Xfce, so it's not DE specific. (Unsuprising.)

I inflated the size of the test volume, and it seems pretty clear that
the scrub is not completing, as the kernel threads stop sooner when
logging out vs not logging out. So the status reporting an
interruption appears to be valid for the net operation, not merely the
user space tool being interrupted.
You have your terminals set to start the shell as a login shell I'm 
guessing.  That's probably why closing the terminal window is triggering 
systemd's process killing.  It will of course still trigger when you 
close the graphical session though.  Personally, this is yet another 
reason for me to not like systemd.  This setting breaks traditional UNIX 
userspace semantics.


Personally, I'm with Duncan on this one though, if resume works 
correctly, then it's not a bug, just a bad interaction between an 
administrative tool designed for a server and an init system designed 
for a desktop.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/17] xfs/122: list the new log redo items

2016-08-01 Thread Christoph Hellwig

On Thu, Jul 21, 2016 at 04:46:42PM -0700, Darrick J. Wong wrote:
> List the new log redo items.  These should have stable sizes.
> 
> Signed-off-by: Darrick J. Wong 

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/17] xfs/26[34]: remove duplicate tests

2016-08-01 Thread Christoph Hellwig

On Thu, Jul 21, 2016 at 04:46:21PM -0700, Darrick J. Wong wrote:
> These two tests were accidentally double-added as xfs/30[78], but the
> newer versions have fixed up helper usage and fewer whitespace
> problems, so nuke the old tests.
> 
> Signed-off-by: Darrick J. Wong 

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/17] xfs: use rmapbt-checking helper

2016-08-01 Thread Christoph Hellwig

On Thu, Jul 21, 2016 at 04:46:29PM -0700, Darrick J. Wong wrote:
> Don't open-code _notrun checks for the rmapbt, just use the helper.
> 
> Signed-off-by: Darrick J. Wong 

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/17] xfs/310: fix the size calculation for the huge device

2016-08-01 Thread Christoph Hellwig

On Thu, Jul 21, 2016 at 04:46:35PM -0700, Darrick J. Wong wrote:
> Fix the calculation of the dmhuge size.  The previous calculation
> tried to calculate the size correctly, but got it wrong for 1k
> block sizes.  Therefore, clean the whole mess up.
> 
> Signed-off-by: Darrick J. Wong 

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A lot warnings in dmesg while running thunderbird

2016-08-01 Thread Chris Mason


On 07/24/2016 08:36 PM, Dave Chinner wrote:

On Fri, Jul 08, 2016 at 12:02:35PM -0400, Chris Mason wrote:

Can you please run the attached test program:

gcc -o short-write short-write.c -lpthread
./short-write some-new-file-on-btrfs


Hi Chris, this seems like a useful thing to be testing on a regular
basis - can you turn this into an xfstests regression test and
submit it?



[ vacation backlog delay apologies ]

Hi Dave,

The test is just a thread constantly madvising away the memory being 
used for IO.  It would be interesting to add a --evil option to either 
xfs_io or fsx so the evil is compounded across all of the existing xfstests.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qcow2 becomes 37P in size while qemu crashes

2016-08-01 Thread Chris Mason




On 07/23/2016 04:05 PM, Chris Murphy wrote:

Using btrfs-debug-tree, I'm finding something a bit odd about some of
the items in this 39P file.

Seems normal

item 71 key (994163 EXTENT_DATA 43689967616) itemoff 12467 itemsize 53
extent data disk byte 617349906432 nr 80805888
extent data offset 0 nr 80805888 ram 80805888
extent compression(none)

Seems not normal

item 58 key (994163 EXTENT_DATA 38345535488) itemoff 13156 itemsize 53
extent data disk byte 0 nr 0
extent data offset 394752000 nr 61440 ram 34626881742770176
extent compression(none)

There are quite a large number of items that take the 2nd form, and in
each case the ram value is the same as above.




Can't really blame a bit flip for that one.  Looks like our hole 
truncation math has gone crazy there.  I'll see what I can find, but 
please yell if you can reproduce.


Thanks!

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: systemd KillUserProcesses=yes and btrfs scrub

2016-08-01 Thread Chris Murphy

On Mon, Aug 1, 2016 at 9:46 AM, Chris Murphy  wrote:

> - The kernel workers are killed off within ~5 seconds of an
> uninterrupted scrub.

i.e. the kernel workers are doing the same work. They aren't being
killed sooner as a result of logging out from the DE. The only
apparent change from logging out from the DE from which the scrub was
issued, is the btrfs process becomes status Z. It is in fact not being
killed, which itself is kinda interesting/unexpected.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: systemd KillUserProcesses=yes and btrfs scrub

2016-08-01 Thread Chris Murphy

OK I've created a new volume that's sufficiently large I can tell if
the kernel workers doing the scrub are also being killed off. First, I
do a scrub without logging out to get a time for an uninterrupted
scrub. And then initiate a scrub which I start timing, but then logout
of the DE and watch for the kernel workers to stop.

- The kernel workers are killed off within ~5 seconds of an
uninterrupted scrub. Conclusion is the scrub is still happening by the
kernel.
- The btrfs process for the scrub isn't killed either, it's just
status Z for the entire length of the scrub.
- While this scrubbing is happening, issuing a 'btrfs scrub status'
gets me consistently stale information. It's the same information from
the moment the DE was logged out.

[root@localhost ~]# btrfs scrub status /mnt/x
scrub status for 9f9e5e1f-8d5a-44a0-8f69-8a393fb7ff3c
scrub started at Mon Aug  1 09:29:59 2016, running for 00:00:15
total bytes scrubbed: 3.06GiB with 0 errors

Even a minute later this information is the same.

Once the zombie btrfs process dies off, and the kernel workers stop
working, I get this bogus status information:

[root@localhost ~]# btrfs scrub status /mnt/x
scrub status for 9f9e5e1f-8d5a-44a0-8f69-8a393fb7ff3c
scrub started at Mon Aug  1 09:29:59 2016, interrupted after
00:00:15, not running
total bytes scrubbed: 3.06GiB with 0 errors


Only the user process was interrupted. Not the scrub. Looks like only
the user process is writing out the statistics and status, so once it
goes zombie, there's no accounting, rather than accounting being done
independently via sysfs.

Can I resume this scrub? Yes. But that's also bogus because there
really isn't anything to resume. All that work was done already, it
just hasn't been accounted for.

So whether you want to call this a bug, or deeply suboptimal behavior,
I think that's splitting hairs. Neither mdadm nor LVM scrubs are
affected by this logout behavior and systemd killing off user
processes. I always get reliable scrub status information from either
'echo check md/sync_action' or 'lvchange --syncaction check' before
and after logging out of the DE from which the command was issued.

And it's even inconsistent with btrfs replace where it continues to
give me correct status information from a tty shell even though the
replace command was issued in a DE, subsequently logged out of. So
'btrfs scrub' is inconsistent no matter how you look at it. It's a
bug.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: systemd KillUserProcesses=yes and btrfs scrub

2016-08-01 Thread Austin S. Hemmelgarn


On 2016-08-01 11:46, Chris Murphy wrote:

OK I've created a new volume that's sufficiently large I can tell if
the kernel workers doing the scrub are also being killed off. First, I
do a scrub without logging out to get a time for an uninterrupted
scrub. And then initiate a scrub which I start timing, but then logout
of the DE and watch for the kernel workers to stop.

- The kernel workers are killed off within ~5 seconds of an
uninterrupted scrub. Conclusion is the scrub is still happening by the
kernel.
This makes sense, systemd is killing based on session ID, and the kernel 
workers have an sid of 0 (I think, it should be whatever the sid of 
kthreadd (PID 2) has).

- The btrfs process for the scrub isn't killed either, it's just
status Z for the entire length of the scrub.
Z means the process is dead, but nothing has called wait() or similar to 
get status info from it, so it was killed, it's just that nothing took 
the body to the morgue yet.

- While this scrubbing is happening, issuing a 'btrfs scrub status'
gets me consistently stale information. It's the same information from
the moment the DE was logged out.
This makes sense, because the userspace component updates this info (and 
that's _all_ it does).


[root@localhost ~]# btrfs scrub status /mnt/x
scrub status for 9f9e5e1f-8d5a-44a0-8f69-8a393fb7ff3c
scrub started at Mon Aug  1 09:29:59 2016, running for 00:00:15
total bytes scrubbed: 3.06GiB with 0 errors

Even a minute later this information is the same.

Once the zombie btrfs process dies off, and the kernel workers stop
working, I get this bogus status information:

[root@localhost ~]# btrfs scrub status /mnt/x
scrub status for 9f9e5e1f-8d5a-44a0-8f69-8a393fb7ff3c
scrub started at Mon Aug  1 09:29:59 2016, interrupted after
00:00:15, not running
total bytes scrubbed: 3.06GiB with 0 errors


Only the user process was interrupted. Not the scrub. Looks like only
the user process is writing out the statistics and status, so once it
goes zombie, there's no accounting, rather than accounting being done
independently via sysfs.

Can I resume this scrub? Yes. But that's also bogus because there
really isn't anything to resume. All that work was done already, it
just hasn't been accounted for.

So whether you want to call this a bug, or deeply suboptimal behavior,
I think that's splitting hairs. Neither mdadm nor LVM scrubs are
affected by this logout behavior and systemd killing off user
processes. I always get reliable scrub status information from either
'echo check md/sync_action' or 'lvchange --syncaction check' before
and after logging out of the DE from which the command was issued.
MD and DM RAID handle this by starting kernel threads to do the scrub. 
They then store the info about the scrub in the array itself, so you can 
query it externally.  If you watch, neither of those commands runs 
longer than it takes to start the operation, so there's nothing for 
systemd to kill.


And it's even inconsistent with btrfs replace where it continues to
give me correct status information from a tty shell even though the
replace command was issued in a DE, subsequently logged out of. So
'btrfs scrub' is inconsistent no matter how you look at it. It's a
bug.

Replace was implemented the way scrub should have been.  It's done 
entirely in the kernel, and the userspace tools just start, stop and 
check status.  We should just get rid of the whole scrub state file crap 
and have a way to query the last scrub status directly from the FS. 
That would fix this particular issue, and make scrub more consistent 
with everything else (and solve the stale scrub status bug too).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: systemd KillUserProcesses=yes and btrfs scrub

2016-08-01 Thread Chris Murphy

On Mon, Aug 1, 2016 at 10:08 AM, Austin S. Hemmelgarn
 wrote:

>
> MD and DM RAID handle this by starting kernel threads to do the scrub. They
> then store the info about the scrub in the array itself, so you can query it
> externally.  If you watch, neither of those commands runs longer than it
> takes to start the operation, so there's nothing for systemd to kill.

pvmove continues to run and report progress so it can be killed off,
but it only polls for statistics, it's not actually recording them. So
even though it gets killed, subsequent pvmove command shows correct
statistics.

So that makes me wonder how btrfs device add and remove will behave,
if issued in a DE which is then logged out of. Those commands do not
return to prompt until they complete.


> Replace was implemented the way scrub should have been.  It's done entirely
> in the kernel, and the userspace tools just start, stop and check status.
> We should just get rid of the whole scrub state file crap and have a way to
> query the last scrub status directly from the FS. That would fix this
> particular issue, and make scrub more consistent with everything else (and
> solve the stale scrub status bug too).

OK, I'll update the bug report.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: systemd KillUserProcesses=yes and btrfs scrub

2016-08-01 Thread Chris Murphy

On Mon, Aug 1, 2016 at 10:19 AM, Chris Murphy  wrote:

> So that makes me wonder how btrfs device add and remove will behave,
> if issued in a DE which is then logged out of. Those commands do not
> return to prompt until they complete.

Strike add. That's fast. I'm concerned about dev delete/remove and also resize.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: send, don't bug on inconsistent snapshots

2016-08-01 Thread fdmanana

From: Filipe Manana 

When doing an incremental send, if we find a new/modified/deleted extent,
reference or xattr without having previously processed the corresponding
inode item we end up exexuting a BUG_ON(). This is because whenever an
extent, xattr or reference is added, modified or deleted, we always expect
to have the corresponding inode item updated. However there are situations
where this will not happen due to transient -ENOMEM or -ENOSPC errors when
doing delayed inode updates.

For example, when punching holes we can succeed in deleting and modifying
(shrinking) extents but later fail to do the delayed inode update. So after
such failure we close our transaction handle and right after a snapshot of
the fs/subvol tree can be made and used later for a send operation. The
same thing can happen during truncate, link, unlink, and xattr related
operations.

So instead of executing a BUG_ON, make send return an -EIO error and print
an informative error message do dmesg/syslog.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/send.c | 48 +---
 1 file changed, 45 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 2db8dc8..efe129f 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -273,6 +273,39 @@ struct name_cache_entry {
char name[];
 };
 
+static void inconsistent_snapshot_error(struct send_ctx *sctx,
+   enum btrfs_compare_tree_result result,
+   const char *what)
+{
+   const char *result_string;
+
+   switch (result) {
+   case BTRFS_COMPARE_TREE_NEW:
+   result_string = "new";
+   break;
+   case BTRFS_COMPARE_TREE_DELETED:
+   result_string = "deleted";
+   break;
+   case BTRFS_COMPARE_TREE_CHANGED:
+   result_string = "updated";
+   break;
+   case BTRFS_COMPARE_TREE_SAME:
+   ASSERT(0);
+   result_string = "unchanged";
+   break;
+   default:
+   ASSERT(0);
+   result_string = "unexpected";
+   }
+
+   btrfs_err(sctx->send_root->fs_info,
+ "Send: inconsistent snapshot, found %s %s for inode %llu 
without updated inode item, send root is %llu, parent root is %llu",
+ result_string, what, sctx->cmp_key->objectid,
+ sctx->send_root->root_key.objectid,
+ (sctx->parent_root ?
+  sctx->parent_root->root_key.objectid : 0));
+}
+
 static int is_waiting_for_move(struct send_ctx *sctx, u64 ino);
 
 static struct waiting_dir_move *
@@ -5711,7 +5744,10 @@ static int changed_ref(struct send_ctx *sctx,
 {
int ret = 0;
 
-   BUG_ON(sctx->cur_ino != sctx->cmp_key->objectid);
+   if (sctx->cur_ino != sctx->cmp_key->objectid) {
+   inconsistent_snapshot_error(sctx, result, "reference");
+   return -EIO;
+   }
 
if (!sctx->cur_inode_new_gen &&
sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID) {
@@ -5736,7 +5772,10 @@ static int changed_xattr(struct send_ctx *sctx,
 {
int ret = 0;
 
-   BUG_ON(sctx->cur_ino != sctx->cmp_key->objectid);
+   if (sctx->cur_ino != sctx->cmp_key->objectid) {
+   inconsistent_snapshot_error(sctx, result, "xattr");
+   return -EIO;
+   }
 
if (!sctx->cur_inode_new_gen && !sctx->cur_inode_deleted) {
if (result == BTRFS_COMPARE_TREE_NEW)
@@ -5760,7 +5799,10 @@ static int changed_extent(struct send_ctx *sctx,
 {
int ret = 0;
 
-   BUG_ON(sctx->cur_ino != sctx->cmp_key->objectid);
+   if (sctx->cur_ino != sctx->cmp_key->objectid) {
+   inconsistent_snapshot_error(sctx, result, "extent");
+   return -EIO;
+   }
 
if (!sctx->cur_inode_new_gen && !sctx->cur_inode_deleted) {
if (result != BTRFS_COMPARE_TREE_DELETED)
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qcow2 becomes 37P in size while qemu crashes

2016-08-01 Thread Chris Murphy

On Mon, Aug 1, 2016 at 9:26 AM, Chris Mason  wrote:
>
>
> On 07/23/2016 04:05 PM, Chris Murphy wrote:
>>
>> Using btrfs-debug-tree, I'm finding something a bit odd about some of
>> the items in this 39P file.
>>
>> Seems normal
>>
>> item 71 key (994163 EXTENT_DATA 43689967616) itemoff 12467 itemsize 53
>> extent data disk byte 617349906432 nr 80805888
>> extent data offset 0 nr 80805888 ram 80805888
>> extent compression(none)
>>
>> Seems not normal
>>
>> item 58 key (994163 EXTENT_DATA 38345535488) itemoff 13156 itemsize 53
>> extent data disk byte 0 nr 0
>> extent data offset 394752000 nr 61440 ram 34626881742770176
>> extent compression(none)
>>
>> There are quite a large number of items that take the 2nd form, and in
>> each case the ram value is the same as above.
>>
>>
>
> Can't really blame a bit flip for that one.  Looks like our hole truncation
> math has gone crazy there.  I'll see what I can find, but please yell if you
> can reproduce.

I've been trying, but no joy. Several bugs are required before hitting this.

User bug: I inadvertently edited the machine using virsh to use the
same backing qcow2 file for vda and vdb.
virsh bug: No warning for this double usage.
virt-manager bug: Starts the VM without warning when two virtio
devices are pointed to the same backing file.

Corruption of the file is inevitable. But exactly when it grew to 37P
I'm not sure, so far reproducing this results in a file limited to the
create time specified file size.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Corruption through subsequent compression?

2016-08-01 Thread Stephan Hopfmüller


Dear list,

I've got into some trouble with my 40GB BTRFS formatted root partition 
in a 2015 Macbook 15": I was running low on free space so I remembered, 
that I had forgotten to enable compression while installing Antergos (a 
flavour of Arch Linux). So I looked into the Arch Wiki at 
https://wiki.archlinux.org/index.php/Btrfs and executed

"btrfs filesystem defragment -r -v -clzo /"
as normal user. I noticed, that it would only compress files inside my 
home folder (which is on the same partition), so I run it again with 
"sudo". After some time I was surprised by windows closing and thought, 
that this might release file handles or something like that (I'm not 
experienced with the inner structure of BTRFS and optimization like 
defragment). I feared what might happen, when the console session, in 
which I was running the btrfs command, would get closed, but Ctrl-C-ing 
it would probably do damage too - so I waited till the window vanished. 
(KDE did already crash previously, so there already weren't any window 
decorations etc.).
After a reboot I tried running the same command again in a login shell 
(by pressing CTRL+ALT+F2) (so no desktop environment which could 
interfere somehow) and at some time it said "no space left on device". I 
started searching and read about metadata being full resulting in this 
error. A quick check resulted in about 75% usage, so I tried a rebalance 
with increasing "dlimit" till it tried moving blocks - this was the case 
at dlimit=12. Again I received the annoying "no space left on device" 
message and deleted some files (pacman cache etc.) and tried it again 
some more times. After some tries the btrfs executable crashed with a 
weird error saying:
"BTRFS: error (device dm-0) in merge_reloc_roots:2421: errno=-28 No 
space left

--[ cut here ]--
kernel BUG at fs/btrfs/volumes.c:3421!
invalid opcode:  [#1] PREEMPT SMP
Modules linked in: [...]"
I took a picture of the screen in case the complete error message might 
be needed.
Now I knew I was in trouble but didn't want to give up. I thought, that 
the compression might have somehow overwritten part of the btrfs 
executable with zeros and tried reinstalling it from the pacman cache, 
but it wasn't possible because of the filesystem being switched to read 
only. After a reboot things were looking even worser: Something like 
"...fixing, but needing a reboot" scrolled to the top and afterwards it 
only kept saying something like "CPU #... is hanging for [something 
about 20s]" and named a process (mount, kworker and some btrfs subprocess).
At this time it was past 2 am and all I wanted was getting some sleep, 
so I just powered it of by holding the power key.
Now, some hours later, I still don't know, whether my filesystem still 
can be fixed or whether I'm better left of nuking it and starting over...
My most important files are placed on a EXFAT formated partition and 
partially synced to some cloud-service, so I would probably only loose 
some time to redownload all applications (and my browser history etc.). 
I was really constrained on space so I didn't had much options for 
backing up the system. Next time I probably should get an external hard 
drive...
Now I'm constrained to use OS X but at least my Macbook is still usable 
(so the SSD still seams to be working).
As I'm not experienced with this kind of situation I am not sure what to 
do - probably doing some investigation out of a live system?


Could somebody please help me in trying to fix my filesystem?

Greetings,
Stephan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: systemd KillUserProcesses=yes and btrfs scrub

2016-08-01 Thread Austin S. Hemmelgarn


On 2016-08-01 12:19, Chris Murphy wrote:

On Mon, Aug 1, 2016 at 10:08 AM, Austin S. Hemmelgarn
 wrote:



MD and DM RAID handle this by starting kernel threads to do the scrub. They
then store the info about the scrub in the array itself, so you can query it
externally.  If you watch, neither of those commands runs longer than it
takes to start the operation, so there's nothing for systemd to kill.


pvmove continues to run and report progress so it can be killed off,
but it only polls for statistics, it's not actually recording them. So
even though it gets killed, subsequent pvmove command shows correct
statistics.
Because all that the pvmove command is doing is polling for statistics. 
It actually works kind of like a scrub, all the actual work is done in 
the kernel, the userspace component just handles reporting.  The 
difference is that the move operation is accounted and mutexed in the 
kernel itself, instead of userspace like scrub does.  This model is 
actually essentially what I think scrub (and balance for that matter) 
should look like, and if implemented right, we could actually store 
scrub results in the FS itself (that is, in the metadata, not in special 
files or anything like that).


So that makes me wonder how btrfs device add and remove will behave,
if issued in a DE which is then logged out of. Those commands do not
return to prompt until they complete.
They work via balance, so they should behave the same as a balance 
command, which means it will likely run part way then get cancelled 
because of the SIGTERM to the userspace component (assuming of course 
that it is still running when you log out).




Replace was implemented the way scrub should have been.  It's done entirely
in the kernel, and the userspace tools just start, stop and check status.
We should just get rid of the whole scrub state file crap and have a way to
query the last scrub status directly from the FS. That would fix this
particular issue, and make scrub more consistent with everything else (and
solve the stale scrub status bug too).


OK, I'll update the bug report.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 06/17] xfs: run xfs_repair at the end of each test

2016-08-01 Thread Darrick J. Wong

On Sun, Jul 31, 2016 at 11:27:19PM -0700, Christoph Hellwig wrote:
> On Thu, Jul 21, 2016 at 04:46:54PM -0700, Darrick J. Wong wrote:
> > Run xfs_repair twice at the end of each test -- once to rebuild
> > the btree indices, and again with -n to check the rebuild work.
> 
> This looks fine to me in general, but shouldn't we have specific
> tests that test the rebuilding in a normal auto run?

We do have specific tests that examine the outputs of rebuilding the
indices (all the fuzzer group tests do this too); this patch enables a
test runner to expand that coverage to all tests.  Running a
rebuilding xfs_repair for all the tests shook out some bugs in the
xfs_repair rmap handling code that only triggered under some of the
non-rmap non-reflink stressor tests.

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: systemd KillUserProcesses=yes and btrfs scrub

2016-08-01 Thread Chris Murphy

On Mon, Aug 1, 2016 at 10:58 AM, Austin S. Hemmelgarn
 wrote:
> On 2016-08-01 12:19, Chris Murphy wrote:
>>
>> On Mon, Aug 1, 2016 at 10:08 AM, Austin S. Hemmelgarn
>>  wrote:
>>
>>>
>>> MD and DM RAID handle this by starting kernel threads to do the scrub.
>>> They
>>> then store the info about the scrub in the array itself, so you can query
>>> it
>>> externally.  If you watch, neither of those commands runs longer than it
>>> takes to start the operation, so there's nothing for systemd to kill.
>>
>>
>> pvmove continues to run and report progress so it can be killed off,
>> but it only polls for statistics, it's not actually recording them. So
>> even though it gets killed, subsequent pvmove command shows correct
>> statistics.
>
> Because all that the pvmove command is doing is polling for statistics. It
> actually works kind of like a scrub, all the actual work is done in the
> kernel, the userspace component just handles reporting.  The difference is
> that the move operation is accounted and mutexed in the kernel itself,
> instead of userspace like scrub does.  This model is actually essentially
> what I think scrub (and balance for that matter) should look like, and if
> implemented right, we could actually store scrub results in the FS itself
> (that is, in the metadata, not in special files or anything like that).
>>
>>
>> So that makes me wonder how btrfs device add and remove will behave,
>> if issued in a DE which is then logged out of. Those commands do not
>> return to prompt until they complete.
>
> They work via balance, so they should behave the same as a balance command,
> which means it will likely run part way then get cancelled because of the
> SIGTERM to the userspace component (assuming of course that it is still
> running when you log out).

I've been using balance with &, and when I logout, the btrfs command
continues to flip between status D and R, just like before logout and
it appears to complete. I still get status messages of the balance
after logout, in kernel messages.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: systemd KillUserProcesses=yes and btrfs scrub

2016-08-01 Thread Austin S. Hemmelgarn


On 2016-08-01 13:15, Chris Murphy wrote:

On Mon, Aug 1, 2016 at 10:58 AM, Austin S. Hemmelgarn
 wrote:

On 2016-08-01 12:19, Chris Murphy wrote:


On Mon, Aug 1, 2016 at 10:08 AM, Austin S. Hemmelgarn
 wrote:



MD and DM RAID handle this by starting kernel threads to do the scrub.
They
then store the info about the scrub in the array itself, so you can query
it
externally.  If you watch, neither of those commands runs longer than it
takes to start the operation, so there's nothing for systemd to kill.



pvmove continues to run and report progress so it can be killed off,
but it only polls for statistics, it's not actually recording them. So
even though it gets killed, subsequent pvmove command shows correct
statistics.


Because all that the pvmove command is doing is polling for statistics. It
actually works kind of like a scrub, all the actual work is done in the
kernel, the userspace component just handles reporting.  The difference is
that the move operation is accounted and mutexed in the kernel itself,
instead of userspace like scrub does.  This model is actually essentially
what I think scrub (and balance for that matter) should look like, and if
implemented right, we could actually store scrub results in the FS itself
(that is, in the metadata, not in special files or anything like that).



So that makes me wonder how btrfs device add and remove will behave,
if issued in a DE which is then logged out of. Those commands do not
return to prompt until they complete.


They work via balance, so they should behave the same as a balance command,
which means it will likely run part way then get cancelled because of the
SIGTERM to the userspace component (assuming of course that it is still
running when you log out).


I've been using balance with &, and when I logout, the btrfs command
continues to flip between status D and R, just like before logout and
it appears to complete. I still get status messages of the balance
after logout, in kernel messages.

Interesting, maybe balance is explicitly white-listed?  Either that, or 
it just ignores whatever signal systemd uses to kill stuff in this 
context (I initially thought SIGTERM, but SIGHUP would make more sense 
in this context), which wouldn't surprise me either.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fixup direct bi_rw modifiers

2016-08-01 Thread Jens Axboe


On 08/01/2016 09:17 AM, Jens Axboe wrote:

On 08/01/2016 05:47 AM, Christoph Hellwig wrote:

On Sat, Jul 30, 2016 at 04:45:48PM -0500, Shaun Tancheff wrote:

bi_rw should be using bio_set_op_attrs to set bi_rw.


Looks fine,

Reviewed-by: Christoph Hellwig 


Added, thanks Shaun.


Jens,

what do you think about renaming bi_rw?  There aren't too many users
left, and any old code that would keep using it is alsmost guranteed
to be broken, so sending a post-rc1 patch to rename it might make
everyone else life easier.  Especially as it's also grossly misnamed
now.


I was planning on doing that, after -rc1. Much better to get build
breakage, than potentially much worse breakage.


Set of three patches, where the target one is an actual bug fix...
Temporary branch, I'll rebase it once -rc1 is out, if more
changes/fixups need to be made in the next week until that happens.

http://git.kernel.dk/cgit/linux-block/log/?h=for-4.8/bi_rwf

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fixup direct bi_rw modifiers

2016-08-01 Thread Jens Axboe


On 08/01/2016 05:47 AM, Christoph Hellwig wrote:

On Sat, Jul 30, 2016 at 04:45:48PM -0500, Shaun Tancheff wrote:

bi_rw should be using bio_set_op_attrs to set bi_rw.


Looks fine,

Reviewed-by: Christoph Hellwig 


Added, thanks Shaun.


Jens,

what do you think about renaming bi_rw?  There aren't too many users
left, and any old code that would keep using it is alsmost guranteed
to be broken, so sending a post-rc1 patch to rename it might make
everyone else life easier.  Especially as it's also grossly misnamed
now.


I was planning on doing that, after -rc1. Much better to get build
breakage, than potentially much worse breakage.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: systemd KillUserProcesses=yes and btrfs scrub

2016-08-01 Thread Chris Murphy

Aug 01 09:29:59 localhost.localdomain sudo[1875]:chris : TTY=pts/1
; PWD=/home/chris ; USER=root ; COMMAND=/sbin/btrfs scrub start /mnt/x
Aug 01 09:30:16 localhost.localdomain systemd[1]: user@1000.service:
Killing process 1883 (btrfs) with signal SIGKILL.
Aug 01 09:43:34 localhost.localdomain sudo[2574]:chris : TTY=pts/1
; PWD=/home/chris ; USER=root ; COMMAND=/sbin/btrfs scrub start /mnt/x
Aug 01 09:43:53 localhost.localdomain systemd[1]: user@1000.service:
Killing process 2579 (btrfs) with signal SIGKILL.
Aug 01 11:41:00 localhost.localdomain sudo[3479]:chris : TTY=pts/1
; PWD=/home/chris ; USER=root ; COMMAND=/sbin/btrfs fi show
Aug 01 11:41:04 localhost.localdomain sudo[3492]:chris : TTY=pts/1
; PWD=/home/chris ; USER=root ; COMMAND=/sbin/btrfs balance start
/mnt/x
Aug 01 11:41:14 localhost.localdomain kernel: BTRFS info (device vdb):
relocating block group 24800919552 flags 130
Aug 01 11:41:14 localhost.localdomain kernel: BTRFS info (device vdb):
relocating block group 23727177728 flags 132
Aug 01 11:41:14 localhost.localdomain kernel: BTRFS info (device vdb):
found 6 extents
Aug 01 11:41:14 localhost.localdomain kernel: BTRFS info (device vdb):
relocating block group 21512585216 flags 129
Aug 01 11:41:24 localhost.localdomain kernel: BTRFS info (device vdb):
found 27447 extents
Aug 01 11:41:26 localhost.localdomain kernel: BTRFS info (device vdb):
found 27446 extents
Aug 01 11:41:26 localhost.localdomain kernel: BTRFS info (device vdb):
relocating block group 19365101568 flags 129
Aug 01 11:41:36 localhost.localdomain systemd[1]: user@1000.service:
Killing process 3499 (btrfs) with signal SIGKILL.


It's using SIGKILL. The process goes Z for scrub but nothing happens
for balance. Weird. It's definitely not exempt though. Hmm, when I
don't filter the journal for btrfs...

Aug 01 11:54:38 localhost.localdomain systemd[3623]: Starting Exit the
Session...
Aug 01 11:54:38 localhost.localdomain systemd[3623]: Received
SIGRTMIN+24 from PID 4269 (kill).
Aug 01 11:54:38 localhost.localdomain systemd[1]: user@1000.service:
Killing process 4206 (sudo) with signal SIGKILL.
Aug 01 11:54:38 localhost.localdomain systemd[1]: user@1000.service:
Killing process 4213 (btrfs) with signal SIGKILL.
Aug 01 11:54:38 localhost.localdomain systemd[1]: Stopped User Manager
for UID 1000.
Aug 01 11:54:38 localhost.localdomain audit[1]: SERVICE_STOP pid=1
uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=user@1000 comm="systemd" exe="/usr/lib/systemd/systemd"
hostname=? addr=? terminal=? res=success'
Aug 01 11:54:38 localhost.localdomain systemd[1]: Removed slice User
Slice of chris.

In this case PID 4213 is the process that's still flipping between
status R and D. Kill is sent, but ignored. But not for scrubbing...


Aug 01 11:58:21 localhost.localdomain systemd[4294]: Starting Exit the
Session...
Aug 01 11:58:21 localhost.localdomain systemd[4294]: Received
SIGRTMIN+24 from PID 4933 (kill).
Aug 01 11:58:21 localhost.localdomain systemd[4301]:
pam_unix(systemd-user:session): session closed for user chris
Aug 01 11:58:21 localhost.localdomain systemd[1]: user@1000.service:
Killing process 4866 (btrfs) with signal SIGKILL.
Aug 01 11:58:21 localhost.localdomain systemd[1]: Stopped User Manager
for UID 1000.
Aug 01 11:58:21 localhost.localdomain audit[1]: SERVICE_STOP pid=1
uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=user@1000 comm="systemd" exe="/usr/lib/systemd/systemd"
hostname=? addr=? terminal=? res=success'
Aug 01 11:58:21 localhost.localdomain systemd[1]: Removed slice User
Slice of chris.


Must have something to do with the use of & with balance, which scrub
doesn't need to go to the background.




Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: FIDEDUPERANGE with src_length == 0

2016-08-01 Thread Darrick J. Wong

On Thu, Jul 14, 2016 at 11:16:47AM -0700, Omar Sandoval wrote:
> On Thu, Jul 14, 2016 at 02:12:58PM -0400, Chris Mason wrote:
> > 
> > 
> > On 07/14/2016 02:06 PM, Darrick J. Wong wrote:
> > > On Wed, Jul 13, 2016 at 03:19:38PM +0200, David Sterba wrote:
> > > > On Tue, Jul 12, 2016 at 10:26:43PM -0700, Darrick J. Wong wrote:
> > > > > On Mon, Jul 11, 2016 at 05:35:37PM -0700, Omar Sandoval wrote:
> > > > > > Hey, Darrick,
> > > > > > 
> > > > > > generic/182 is failing on Btrfs for me with the following output:
> > > > > > 
> > > > > > --- tests/generic/182.out   2016-07-07 19:51:54.0 -0700
> > > > > > +++ /tmp/fixxfstests/xfstests/results//generic/182.out.bad  
> > > > > > 2016-07-11 17:28:28.230039216 -0700
> > > > > > @@ -1,12 +1,10 @@
> > > > > >  QA output created by 182
> > > > > >  Create the original files
> > > > > > -dedupe: Extents did not match.
> > > > > >  f4820540fc0ac02750739896fe028d56  TEST_DIR/test-182/file1
> > > > > >  69ad53078a16243d98e21d9f8704a071  TEST_DIR/test-182/file2
> > > > > >  69ad53078a16243d98e21d9f8704a071  TEST_DIR/test-182/file2.chk
> > > > > >  Compare against check files
> > > > > >  Make the original file almost dedup-able
> > > > > > -dedupe: Extents did not match.
> > > > > >  f4820540fc0ac02750739896fe028d56  TEST_DIR/test-182/file1
> > > > > >  158d4e3578b94b89cbb44493a2110fb9  TEST_DIR/test-182/file2
> > > > > >  158d4e3578b94b89cbb44493a2110fb9  TEST_DIR/test-182/file2.chk
> > > > > > 
> > > > > > It looks like that test is checking that a dedupe with length == 0 
> > > > > > is
> > > > > > treated as a dedupe to EOF, but Btrfs doesn't do that [1]. As far 
> > > > > > as I
> > > > > > can tell, it never did, but maybe I'm just confused. What was the
> > > > > > behavior when you introduced that test? That seems like a reasonable
> > > > > > thing to do, but I wanted to clear this up before changing/fixing 
> > > > > > Btrfs.
> > > > > 
> > > > > It's a shortcut that we're introducing in the upcoming XFS 
> > > > > implementation,
> > > > > since it shares the same back end as clone/clonerange, which both have
> > > > > this behavior.
> > > > 
> > > > The support for zero length does not seem to be mentioned anywhere with
> > > > the dedupe range ioctl [1], so the current implemetnation is "up to
> > > > spec". That it should be valid is hidden in clone_verify_area where a
> > > > zero length is substituted with OFFSET_MAX
> > > > 
> > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__lxr.free-2Delectrons.com_source_fs_read-5Fwrite.c-23L1607=CwIBAg=5VD0RTtNlTh3ycd41b3MUw=9QPtTAxcitoznaWRKKHoEQ=CKo3CgE8Up_NBDdC9t7fCuwHwsdf6nZG2nKcl5-NqnI=ZymMvbZ2mZOYBKya3guibggSaaqOHZUqedhz0pT5PPc=
> > > > 
> > > > So it looks like it's up to the implementation in the filesystem to
> > > > handle that. As the btrfs ioctl was extent-based, a zero length extent
> > > > does not make sense, so this case was not handled. But in your patch
> > > > 
> > > > 2b3909f8a7fe94e0234850aa9d120cca15b6e1f7
> > > > btrfs: use new dedupe data function pointer
> > > > 
> > > > it was suddenly expected to work. So the missing bits are either 'not
> > > > supported' for zero length or actually implement iteration over the
> > > > whole file.
> > > > 
> > > > [1] 
> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mankier.com_2_ioctl-5Ffideduperange=CwIBAg=5VD0RTtNlTh3ycd41b3MUw=9QPtTAxcitoznaWRKKHoEQ=CKo3CgE8Up_NBDdC9t7fCuwHwsdf6nZG2nKcl5-NqnI=NYdHr9JyZZNKPLsOf_VmtZ-3X2B1azTYfyE4Lf1Fa5w=
> > > 
> > > Well, we can't change the semantics now because there could be programs 
> > > that
> > > aren't expecting a nonzero return from a length == 0 dedupe, so like 
> > > Christoph
> > > said, I'll just change generic/182 and make the VFS wrapper emulate the 
> > > btrfs
> > > behavior so that any subsequent implementation won't hit this.
> > > 
> > > I'll update the clone/clonerange manpages to mention the 0 -> EOF 
> > > behavior.
> > 
> > Its fine with me if we change btrfs to do the 0->EOF.  It's a corner case
> > I'm happy to include.
> > 
> > -chris
> 
> Yeah, I think it's a nice shortcut. Are there any programs which
> wouldn't want this, though? It's a milder sort of correctness problem
> since dedupe is "safe", but maybe there's some tool which is being dumb
> and trying to dedupe nothing.

The only users of extent-same that I know of are duperemove and
xfs_io.  duperemove doesn't seem to send zero-length dedupe requests.
xfs_io will if you tell it to, but the xfs_io feature is there
primarily to run xfstests.

Christoph has a valid point that we don't know the full set of users,
so we could be breaking them by changing this aspect.  OTOH this was
an undocumented ioctl for quite a while...

--D

> 
> -- 
> Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: systemd KillUserProcesses=yes and btrfs scrub

2016-08-01 Thread Chris Murphy

On Mon, Aug 1, 2016 at 11:19 AM, Austin S. Hemmelgarn
 wrote:
> On 2016-08-01 13:15, Chris Murphy wrote:

>> I've been using balance with &, and when I logout, the btrfs command
>> continues to flip between status D and R, just like before logout and
>> it appears to complete. I still get status messages of the balance
>> after logout, in kernel messages.
>>
> Interesting, maybe balance is explicitly white-listed?  Either that, or it
> just ignores whatever signal systemd uses to kill stuff in this context (I
> initially thought SIGTERM, but SIGHUP would make more sense in this
> context), which wouldn't surprise me either.

I'm not aware of any program specific white listing method with
KillUserProcesses=yes. However, there is KillExcludeUsers which by
default is KillExcludeUsers=root. Everything I run as sudo appears in
top and ps as use root. So are these processes exempt? And if so, why
is btrfs scrub becoming a zombie process? I don't know if it's
appropriate, but I asked about it (no response yet), whether all
things sudo should just be moved out of the user session. In my own
head I don't associate sudo commands with my user or my user session,
and at least top and ps agree with the former, so why not have sudo'd
processes put in a different scope from the outset?

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: systemd KillUserProcesses=yes and btrfs scrub

2016-08-01 Thread Chris Murphy

All of these have status R and D for their duration, and while all get
a SIGKILL from systemd on logout, none of the processes change status
or die until their kernel task is done. And each of these operations
complete successfully with no worse for the wear.

btrfs balance &
btrfs dev rem &
btrfs replace start

Only 'btrfs scrub' has status S, and once it gets SIGKILL, it goes Z
and all of its accounting is wrong. But the kernel tasks continue and
appear to complete.

I did all of this with a btrfs raid5, 3 and 4 disks, in a libvirt VM.

---
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs send to send out metadata and data separately

2016-08-01 Thread Filipe Manana

On Fri, Jul 29, 2016 at 1:40 PM, Qu Wenruo  wrote:
> Hi Filipe, and maintainers,
>
> I'm recently working on the root fix to free send from calling backref walk.
>
> My current idea is to send data and metadata separately, and only do clone
> detection inside the send subvolume.
>
> This method needs two new send commands:
> (And new send attribute, A_DATA_BYTENR)
> 1) SEND_C_DATA
>much like SEND_C_WRITE, with a little change in the 1st TLV.
>
>TLVs:
>A_DATA_BYTENR:bytenr of the data extent
>A_FILE_OFFSET:offset inside the data extent
>A_DATA:   real data
>
> 2) SEND_C_CLONE_DATA
>A little like SEND_C_CLONE, with unneeded parameters striped
>
>TLVs:
>A_PATH:   filename
>A_DATA_BYTENR:disk_bytenr of the EXTENT_DATA
>A_FILE_OFFSET:file offset
>A_FILE_OFFSET:offset inside the EXTENT_DATA
>A_CLONE_LEN:  num_bytes of the EXTENT_DATA
>
>
> The send part is different in how to sending out a EXTENT_DATA.
> The send work follow is:
>
> 1) Found a EXTENT_DATA to send.
>Check rb_tree of "disk_bytenr".
>if "disk_bytenr" in rb_tree
>  goto 2) Reflink data
>/* Initiate a SEND_C_DATA */
>Send out the *whole* *uncompressed* extent of "disk_bytenr".
>Adds "disk_bytenr" into rb_tree
>
>
> 2) Reflink data
>/* Initiate a SEND_C_CLONE_DATA */
>Filling disk_bytenr, offset and num_bytes, and send out the command.
>
> That's to say, send will send out extent data and referencer separately.
>
> So for kernel part, it's quite easy and *NO* time consuming backref walk
> ever.
> And no other part is modified.
>
>
> The main trick happens in the receive part.
>
> Receive will do the following thing first before recovering the
> subvolume/snapshot:
>
> 0) Create temporary dir for data extents
>Create a new dir with temporary name($data_extent), to put data extents
> into it.
>
> Then for SEND_C_DATA command:
> 1) Create file with file name $filename under $data_extent dir
>filename = $(printf "0x%x" $disk_bytenr)
>$disk_bytenr is the first u64 TLV of SEND_A_DATA command.
> 2) Write data into $data_extent/$filename
>
> Then handle the SEND_C_CLONE_DATA command
> It would be like
>   xfs_io -f -c "reflink $data_extent/$disk_bytenr $extent_offset
> $file_offset $num_bytes" $filename
> disk_bytenr=2nd TLV (string converted to u64, with "0x%x")
> extent_offset=3rd TLV, u64
> file_offset=4th TLV, u64
> num_bytes=5th TLV, u64
> filename=1th TLV, string
>
> Finally, after the snapshot/subvolume is recovered, remove the $data_extent
> directory.
>
>
> The whole idea is to completely remove the time consuming backref walk in
> send.
>
> So pros:
> 1) No backref walk, no soft lockup, no super long execution time
>Under worst case O(N^2), best case O(N)
>Memory usage worst case O(N), best case O(1)
>Where N is the number of reference to extents.
>
> 2) Almost the same metadata layout
>Including the overlap extents
>
> Cons:
> 1) Not full fs clone detection
>Such clone detection is only inside the send snapshot.
>
>For case that one extent is referred only once in the send snapshot,
>but also referred by source subvolume, then in the received
>subvolume, it will be a new extent, but not a clone.
>
>Only extent that is referred twice by send snapshot, that extent
>will be shared.
>
>(Although much better than disabling the whole clone detection)
> 2) Extra space usage
>Since it completely recovers the overlap extents
> 3) As many fragments as source subvolume
> 4) Possible slow recovery due to reflink speed.
>
>
> I am still concerned about the following problems:
>
> 1) Is it OK to add not only 1, but 2 new send commands?
> 2) Is such clone detection range change OK?
>
> Any ideas and suggestion is welcomed.


Qu,

I don't like the idea at all, for several reasons:

1) Too complex to implement. We should really avoid making things more
complex than they are already.
   Your earlier suggestion to cache backref lookups is much simpler
and solves the problem for the vast majority of cases (assuming a
bounded cache of course).
There's really no need for such high complexity.

2) By adding new commands to the stream, you break backwards compatibility.
   Think about all the tools out there that interpret send streams and
not just the receive command (for example snapper).

3) By requiring a new different behaviour for the receiver, suddenly
older versions of it will no longer be able to receive from new
kernels.

4) By keeping temporary files on the receiver end that contains whole
extents, you're creating periods of time where stale data is exposed.

Thanks.

>
> Thanks,
> Qu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David

Re: Btrfs send to send out metadata and data separately

2016-08-01 Thread Qu Wenruo




At 08/02/2016 02:00 AM, Filipe Manana wrote:

On Fri, Jul 29, 2016 at 1:40 PM, Qu Wenruo  wrote:

Hi Filipe, and maintainers,

I'm recently working on the root fix to free send from calling backref walk.

My current idea is to send data and metadata separately, and only do clone
detection inside the send subvolume.

This method needs two new send commands:
(And new send attribute, A_DATA_BYTENR)
1) SEND_C_DATA
   much like SEND_C_WRITE, with a little change in the 1st TLV.

   TLVs:
   A_DATA_BYTENR:bytenr of the data extent
   A_FILE_OFFSET:offset inside the data extent
   A_DATA:   real data

2) SEND_C_CLONE_DATA
   A little like SEND_C_CLONE, with unneeded parameters striped

   TLVs:
   A_PATH:   filename
   A_DATA_BYTENR:disk_bytenr of the EXTENT_DATA
   A_FILE_OFFSET:file offset
   A_FILE_OFFSET:offset inside the EXTENT_DATA
   A_CLONE_LEN:  num_bytes of the EXTENT_DATA


The send part is different in how to sending out a EXTENT_DATA.
The send work follow is:

1) Found a EXTENT_DATA to send.
   Check rb_tree of "disk_bytenr".
   if "disk_bytenr" in rb_tree
 goto 2) Reflink data
   /* Initiate a SEND_C_DATA */
   Send out the *whole* *uncompressed* extent of "disk_bytenr".
   Adds "disk_bytenr" into rb_tree


2) Reflink data
   /* Initiate a SEND_C_CLONE_DATA */
   Filling disk_bytenr, offset and num_bytes, and send out the command.

That's to say, send will send out extent data and referencer separately.

So for kernel part, it's quite easy and *NO* time consuming backref walk
ever.
And no other part is modified.


The main trick happens in the receive part.

Receive will do the following thing first before recovering the
subvolume/snapshot:

0) Create temporary dir for data extents
   Create a new dir with temporary name($data_extent), to put data extents
into it.

Then for SEND_C_DATA command:
1) Create file with file name $filename under $data_extent dir
   filename = $(printf "0x%x" $disk_bytenr)
   $disk_bytenr is the first u64 TLV of SEND_A_DATA command.
2) Write data into $data_extent/$filename

Then handle the SEND_C_CLONE_DATA command
It would be like
  xfs_io -f -c "reflink $data_extent/$disk_bytenr $extent_offset
$file_offset $num_bytes" $filename
disk_bytenr=2nd TLV (string converted to u64, with "0x%x")
extent_offset=3rd TLV, u64
file_offset=4th TLV, u64
num_bytes=5th TLV, u64
filename=1th TLV, string

Finally, after the snapshot/subvolume is recovered, remove the $data_extent
directory.


The whole idea is to completely remove the time consuming backref walk in
send.

So pros:
1) No backref walk, no soft lockup, no super long execution time
   Under worst case O(N^2), best case O(N)
   Memory usage worst case O(N), best case O(1)
   Where N is the number of reference to extents.

2) Almost the same metadata layout
   Including the overlap extents

Cons:
1) Not full fs clone detection
   Such clone detection is only inside the send snapshot.

   For case that one extent is referred only once in the send snapshot,
   but also referred by source subvolume, then in the received
   subvolume, it will be a new extent, but not a clone.

   Only extent that is referred twice by send snapshot, that extent
   will be shared.

   (Although much better than disabling the whole clone detection)
2) Extra space usage
   Since it completely recovers the overlap extents
3) As many fragments as source subvolume
4) Possible slow recovery due to reflink speed.


I am still concerned about the following problems:

1) Is it OK to add not only 1, but 2 new send commands?
2) Is such clone detection range change OK?

Any ideas and suggestion is welcomed.





Hi Filipe,

Thanks for your comment, it helps to refine the idea to fix the problem.

New idea is stated at the ending of the mail, hopes it will address all 
your conern.



Qu,

I don't like the idea at all, for several reasons:

1) Too complex to implement. We should really avoid making things more
complex than they are already.


Yes, new command new TLV new receiver behavior, the whole idea itself is 
complex.


But the core function for clone detection is simple.
Rb_tree for sent extent bytenr, and avoid sending sent extents.
At least it avoids doing expensive backref walk at all.

My new idea will keep the core function, while stripe all the new 
command/TLV/receiver behavior.



   Your earlier suggestion to cache backref lookups is much simpler
and solves the problem for the vast majority of cases (assuming a
bounded cache of course).


In fact, my earlier suggestion is not to cache backref walk result, but 
just like this one, implement a internal, simpler backref mapper.


The biggest problem of backref cache is, it conflicts with snapshot.
Any snapshot will easily trash backrefs of a tree.

It means either we do a full tree walk to trash all backref cache, 
making snapshot much slower, or a broken cache.

(And it adds more

40 matches

Mail list logo