Re: How to recover a filesystem without formatting nor using the btrfs check command.

2016-10-24 Thread Qu Wenruo



At 10/25/2016 01:54 AM, none wrote:

So do you mean lowmem is also low cpu ?


Not sure, but lowmem is high IO.
And by design, it won't cause dead look unless there is a looping tree 
block. But that will be detected by check_tree_block().


So, it just avoids any possible dead loop AFAIK.


Indeed here's the output if the metadata image isn't enough (it
termintes correctly with the --lowmem option). I must recognize even
without the --repair option, btrfs check hangs.


I just forgot you have uploaded the image dump.
I'll check it.

But according to lowmem output, it seems all your extent tree is screwed 
up, maybe that's the cause of the problem?


Thanks,
Qu


regards,

Le 24/10/2016 à 03:15, Qu Wenruo a écrit :


You could try to use --mode lowmem, which doesn't ever use any loop to
get next block, but iterating trees.

Current in mainline btrfs-progs, the low memory mode code only checks
extent/chunk trees, file/subvolume trees are still checked by original
mode.

You could try the devel branch from David, which now contains the full
low memory mode check code:
https://github.com/kdave/btrfs-progs/tree/devel

Although low memory mode doesn't support repair yet, it would give us
enough info on what's corrupted, so we can later fix it by hand or
enhance original mode.

Thanks,
Qu

At 10/24/2016 03:42 AM, none wrote:


Hello,
I have the following bug
https://bugzilla.kernel.org/show_bug.cgi?id=178781 in btrfs check, is
there a way to recover my filesystem in clean state without formatting
or using btrsfck ? Of course, the point is no longer need the files
which are damaged.
So is there a way to recover a btrfs filesystem by deleting the
corrupted data instead of trying to restore it ?

btrfs fi df /mnt/Opera_Mobile_Emulator_12.1_Linux
Data, single: total=66.01GiB, used=0.00B
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=5.00GiB, used=28.00KiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=4.00MiB, used=0.00B

btrfs progs version 4.7.3 from Devuan

Label: 'backup'  uuid: 56040bbb-ed5c-47f2-82e2-34457bd7b4f3
Total devices 1 FS bytes used 44.00KiB
devid1 size 298.91GiB used 76.04GiB path
/dev/mapper/isw_bdffeeeijj_Volume0p7

uname -a
Linux localhost 4.5.0-0.bpo.1-amd64 #1 SMP Debian 4.5.1-1~bpo8+1
(2016-04-20) x86_64 GNU/Linux

Result of btrfs-image on /dev/mapper/isw_bdffeeeijj_Volume0p7 :
https://web.archive.org/web/20161020220914/https://filebin.net/7ni8kfpog1dxw4jc/btrfs-image_capture.xz


Thanks,





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13.1 00/14] Btrfs In-band De-duplication

2016-10-24 Thread Qu Wenruo



At 10/25/2016 01:46 AM, David Sterba wrote:

On Thu, Oct 20, 2016 at 10:03:39AM +0800, Qu Wenruo wrote:

Qu Wenruo (4):
  btrfs: delayed-ref: Add support for increasing data ref under spinlock
  btrfs: dedupe: Inband in-memory only de-duplication implement
  btrfs: relocation: Enhance error handling to avoid BUG_ON
  btrfs: dedupe: Introduce new reconfigure ioctl

Wang Xiaoguang (10):
  btrfs: dedupe: Introduce dedupe framework and its header
  btrfs: dedupe: Introduce function to initialize dedupe info
  btrfs: dedupe: Introduce function to add hash into in-memory tree
  btrfs: dedupe: Introduce function to remove hash from in-memory tree
  btrfs: dedupe: Introduce function to search for an existing hash
  btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
  btrfs: ordered-extent: Add support for dedupe
  btrfs: dedupe: Add ioctl for inband dedupelication
  btrfs: improve inode's outstanding_extents computation


I've noticed that this patch is different from what Wang sent
independently. Is the patch necessary for the dedupe patchset? If not,
please drop it.


Yes, Wang is still working the ENOSPC fix, which doesn't only affect 
dedupe, but also compression.




The merge conflict is in the checks for free space inode,

  if (root == root->fs_info->tree_root)

vs

  if (!btrfs_is_free_space_inode(...))

If you need the patch, then please update it to the latest version. I
won't get the merge conflicts at least, this is acceptable for the
testing 'for-next' branch.


I'll re-order the patches, maybe put his ENOSPC fixes at the end or 
beginning.


Since Wang hopes to fix ENOSPC for both compression and dedupe, the 
patchset rebase will be blocked for several days.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/4] btrfs-progs: raid56: Introduce new function to calculate raid5 parity or data stripe

2016-10-24 Thread Qu Wenruo
Introduce new function raid5_gen_result() to calculate parity or data
stripe.

Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba 
---
Changelog:
v2:
  None
---
 disk-io.h |  1 +
 raid56.c  | 63 +++
 2 files changed, 64 insertions(+)

diff --git a/disk-io.h b/disk-io.h
index 8ab36aa..245626c 100644
--- a/disk-io.h
+++ b/disk-io.h
@@ -192,5 +192,6 @@ int write_and_map_eb(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
 
 /* raid56.c */
 void raid6_gen_syndrome(int disks, size_t bytes, void **ptrs);
+int raid5_gen_result(int nr_devs, size_t stripe_len, int dest, void **data);
 
 #endif
diff --git a/raid56.c b/raid56.c
index 833df5f..8c79c45 100644
--- a/raid56.c
+++ b/raid56.c
@@ -26,6 +26,8 @@
 #include "kerncompat.h"
 #include "ctree.h"
 #include "disk-io.h"
+#include "volumes.h"
+#include "utils.h"
 
 /*
  * This is the C data type to use
@@ -107,3 +109,64 @@ void raid6_gen_syndrome(int disks, size_t bytes, void 
**ptrs)
}
 }
 
+static void xor_range(char *dst, const char*src, size_t size)
+{
+   /* Move to DWORD aligned */
+   while (size && ((unsigned long)dst & sizeof(unsigned long))) {
+   *dst++ ^= *src++;
+   size--;
+   }
+
+   /* DWORD aligned part */
+   while (size >= sizeof(unsigned long)) {
+   *(unsigned long *)dst ^= *(unsigned long *)src;
+   src += sizeof(unsigned long);
+   dst += sizeof(unsigned long);
+   size -= sizeof(unsigned long);
+   }
+   /* Remaining */
+   while (size) {
+   *dst++ ^= *src++;
+   size--;
+   }
+}
+
+/*
+ * Generate desired data/parity stripe for RAID5
+ *
+ * @nr_devs:   Total number of devices, including parity
+ * @stripe_len:Stripe length
+ * @data:  Data, with special layout:
+ * data[0]: Data stripe 0
+ * data[nr_devs-2]: Last data stripe
+ * data[nr_devs-1]: RAID5 parity
+ * @dest:  To generate which data. should follow above data layout
+ */
+int raid5_gen_result(int nr_devs, size_t stripe_len, int dest, void **data)
+{
+   int i;
+   char *buf = data[dest];
+
+   /* Validation check */
+   if (stripe_len <= 0 || stripe_len != BTRFS_STRIPE_LEN) {
+   error("invalid parameter for %s", __func__);
+   return -EINVAL;
+   }
+
+   if (dest >= nr_devs || nr_devs < 2) {
+   error("invalid parameter for %s", __func__);
+   return -EINVAL;
+   }
+   /* Shortcut for 2 devs RAID5, which is just RAID1 */
+   if (nr_devs == 2) {
+   memcpy(data[dest], data[1 - dest], stripe_len);
+   return 0;
+   }
+   memset(buf, 0, stripe_len);
+   for (i = 0; i < nr_devs; i++) {
+   if (i == dest)
+   continue;
+   xor_range(buf, data[i], stripe_len);
+   }
+   return 0;
+}
-- 
2.10.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/4] btrfs-progs: volumes: Remove BUG_ON in raid56 write routine

2016-10-24 Thread Qu Wenruo
Remove various BUG_ON in raid56 write routine, including:
1) Memory allocation error
   Old codes allocates memory when code needs new memory in a loop, and
   catch the error using BUG_ON().
   New codes allocates memory in a allocation loop first, if any failure
   is caught, freeing already allocated memories then throw -ENOMEM.

2) Write error
   Change BUG_ON() to correct return value.

3) Validation check
   Same as write error.

Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba 
---
changelog:
v2:
  Fix error in calloc usage which leads to double free() on data
  stripes.
  Thanks David for pointing it out.
---
 volumes.c | 89 +++
 1 file changed, 61 insertions(+), 28 deletions(-)

diff --git a/volumes.c b/volumes.c
index b9ca550..70d8940 100644
--- a/volumes.c
+++ b/volumes.c
@@ -2063,25 +2063,39 @@ static int rmw_eb(struct btrfs_fs_info *info,
return 0;
 }
 
-static void split_eb_for_raid56(struct btrfs_fs_info *info,
-   struct extent_buffer *orig_eb,
+static int split_eb_for_raid56(struct btrfs_fs_info *info,
+  struct extent_buffer *orig_eb,
   struct extent_buffer **ebs,
   u64 stripe_len, u64 *raid_map,
   int num_stripes)
 {
-   struct extent_buffer *eb;
+   struct extent_buffer **tmp_ebs;
u64 start = orig_eb->start;
u64 this_eb_start;
int i;
-   int ret;
+   int ret = 0;
+
+   tmp_ebs = calloc(num_stripes, sizeof(*tmp_ebs));
+   if (!tmp_ebs)
+   return -ENOMEM;
 
+   /* Alloc memory in a row for data stripes */
for (i = 0; i < num_stripes; i++) {
if (raid_map[i] >= BTRFS_RAID5_P_STRIPE)
break;
 
-   eb = calloc(1, sizeof(struct extent_buffer) + stripe_len);
-   if (!eb)
-   BUG();
+   tmp_ebs[i] = calloc(1, sizeof(**tmp_ebs) + stripe_len);
+   if (!tmp_ebs[i]) {
+   ret = -ENOMEM;
+   goto clean_up;
+   }
+   }
+
+   for (i = 0; i < num_stripes; i++) {
+   struct extent_buffer *eb = tmp_ebs[i];
+
+   if (raid_map[i] >= BTRFS_RAID5_P_STRIPE)
+   break;
 
eb->start = raid_map[i];
eb->len = stripe_len;
@@ -2095,12 +2109,21 @@ static void split_eb_for_raid56(struct btrfs_fs_info 
*info,
if (start > this_eb_start ||
start + orig_eb->len < this_eb_start + stripe_len) {
ret = rmw_eb(info, eb, orig_eb);
-   BUG_ON(ret);
+   if (ret)
+   goto clean_up;
} else {
-   memcpy(eb->data, orig_eb->data + eb->start - start, 
stripe_len);
+   memcpy(eb->data, orig_eb->data + eb->start - start,
+  stripe_len);
}
ebs[i] = eb;
}
+   free(tmp_ebs);
+   return ret;
+clean_up:
+   for (i = 0; i < num_stripes; i++)
+   free(tmp_ebs[i]);
+   free(tmp_ebs);
+   return ret;
 }
 
 int write_raid56_with_parity(struct btrfs_fs_info *info,
@@ -2113,15 +2136,20 @@ int write_raid56_with_parity(struct btrfs_fs_info *info,
int j;
int ret;
int alloc_size = eb->len;
+   void **pointers;
 
-   ebs = kmalloc(sizeof(*ebs) * multi->num_stripes, GFP_NOFS);
-   BUG_ON(!ebs);
+   ebs = malloc(sizeof(*ebs) * multi->num_stripes);
+   pointers = malloc(sizeof(*pointers) * multi->num_stripes);
+   if (!ebs || !pointers)
+   return -ENOMEM;
 
if (stripe_len > alloc_size)
alloc_size = stripe_len;
 
-   split_eb_for_raid56(info, eb, ebs, stripe_len, raid_map,
-   multi->num_stripes);
+   ret = split_eb_for_raid56(info, eb, ebs, stripe_len, raid_map,
+ multi->num_stripes);
+   if (ret)
+   goto out;
 
for (i = 0; i < multi->num_stripes; i++) {
struct extent_buffer *new_eb;
@@ -2129,11 +2157,17 @@ int write_raid56_with_parity(struct btrfs_fs_info *info,
ebs[i]->dev_bytenr = multi->stripes[i].physical;
ebs[i]->fd = multi->stripes[i].dev->fd;
multi->stripes[i].dev->total_ios++;
-   BUG_ON(ebs[i]->start != raid_map[i]);
+   if (ebs[i]->start != raid_map[i]) {
+   ret = -EINVAL;
+   goto out_free_split;
+   }
continue;
}
-   new_eb = kmalloc(sizeof(*eb) + alloc_size, GFP_NOFS);
- 

[PATCH v2 4/4] btrfs-progs: volumes: Use new raid5_gen_result to calculate raid5 parity

2016-10-24 Thread Qu Wenruo
Use thew raid5_gen_result() function to calculate raid5 parity.

Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba 
---
Changelog:
v2:
  None
---
 volumes.c | 19 ++-
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/volumes.c b/volumes.c
index 70d8940..6d7adef 100644
--- a/volumes.c
+++ b/volumes.c
@@ -2133,7 +2133,6 @@ int write_raid56_with_parity(struct btrfs_fs_info *info,
 {
struct extent_buffer **ebs, *p_eb = NULL, *q_eb = NULL;
int i;
-   int j;
int ret;
int alloc_size = eb->len;
void **pointers;
@@ -2188,18 +2187,12 @@ int write_raid56_with_parity(struct btrfs_fs_info *info,
raid6_gen_syndrome(multi->num_stripes, stripe_len, pointers);
} else {
ebs[multi->num_stripes - 1] = p_eb;
-   memcpy(p_eb->data, ebs[0]->data, stripe_len);
-   for (j = 1; j < multi->num_stripes - 1; j++) {
-   for (i = 0; i < stripe_len; i += sizeof(u64)) {
-   u64 p_eb_data;
-   u64 ebs_data;
-
-   p_eb_data = get_unaligned_64(p_eb->data + i);
-   ebs_data = get_unaligned_64(ebs[j]->data + i);
-   p_eb_data ^= ebs_data;
-   put_unaligned_64(p_eb_data, p_eb->data + i);
-   }
-   }
+   for (i = 0; i < multi->num_stripes; i++)
+   pointers[i] = ebs[i]->data;
+   ret = raid5_gen_result(multi->num_stripes, stripe_len,
+  multi->num_stripes - 1, pointers);
+   if (ret < 0)
+   goto out_free_split;
}
 
for (i = 0; i < multi->num_stripes; i++) {
-- 
2.10.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/4] btrfs-progs: rename raid6.c to raid56.c

2016-10-24 Thread Qu Wenruo
This allows us to put raid5 codes into that file other than creating a
new raid5.c.

Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba 
---
Changelog
v2:
  None
---
 Makefile.in | 2 +-
 disk-io.h   | 2 +-
 raid6.c => raid56.c | 0
 3 files changed, 2 insertions(+), 2 deletions(-)
 rename raid6.c => raid56.c (100%)

diff --git a/Makefile.in b/Makefile.in
index 5187a93..b53cf2c 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -91,7 +91,7 @@ CHECKER_FLAGS := -include $(check_defs) -D__CHECKER__ \
 objects = ctree.o disk-io.o kernel-lib/radix-tree.o extent-tree.o print-tree.o 
\
  root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \
  extent-cache.o extent_io.o volumes.o utils.o repair.o \
- qgroup.o raid6.o free-space-cache.o kernel-lib/list_sort.o props.o \
+ qgroup.o raid56.o free-space-cache.o kernel-lib/list_sort.o props.o \
  ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \
  inode.o file.o find-root.o free-space-tree.o help.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
diff --git a/disk-io.h b/disk-io.h
index 1fc0bb7..8ab36aa 100644
--- a/disk-io.h
+++ b/disk-io.h
@@ -190,7 +190,7 @@ int write_tree_block(struct btrfs_trans_handle *trans,
 int write_and_map_eb(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 struct extent_buffer *eb);
 
-/* raid6.c */
+/* raid56.c */
 void raid6_gen_syndrome(int disks, size_t bytes, void **ptrs);
 
 #endif
diff --git a/raid6.c b/raid56.c
similarity index 100%
rename from raid6.c
rename to raid56.c
-- 
2.10.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrubbing Errors after restoring backup

2016-10-24 Thread Nicholas Steeves
On 24 October 2016 at 17:53, Stefan Malte Schumacher 
 wrote:
> Hello
>
> For reference please see this post.
> https://mail-archive.com/linux-btrfs@vger.kernel.org/msg58461.html
> Please note that I downgraded to btrfs-progs 4.6.1 as advised.
>
> After exchanging the malfunctioning drive I re-created the filesystem
> and restored the backup from my NAS. (I didnt entirely trust the
> filesystem after so many errors) On completing the restoration I
> manually started scrubbing, which ended with hundreds of checksum and
> read errors on /dev/sda.
> The drive checks out fine in smart and passed through all scheduled
> SMART Self-Tests. The model is not identical to the two drives
> recently added to the system - the new drives are WD Blue, the four
> original ones are WD Greens.
>
> I have resetted the output from btrfs dev stats and restarted the
> scrubbing process. I am unsure how to interpret or explain the errors
> of the last scrub run. I scrubbed regularly each month for nearly
> three years and never had any errors. I would be grateful for any
> advice how to proceed.
>
> Yours sincerely
> Stefan

Hi Stefan,

What kernel version are you using?  Was the backup a file-level
archive or a btrfs send stream?  I'm confused about the evolution your
hardware.  Originally you had four disk raid1?  Or a dix disk raid1?
The one that failed was /dev/sdf, which seems to suggest:

/dev/sdc - WD green
/dev/sdd - WD green
/dev/sde - WD green
/dev/sdf - WD green <- failed

I would expect that the new volume is something like:

/dev/sdc - New unnamed model or 3 year old WD Green?
/dev/sdd - New unnamed model or 3 year old WD Green?
/dev/sde - New WD Blue
/dev/sdf - New WD Blue

Did you move the sata cables to use:

/dev/sda - Unknown. New disk or 3 year old disk?
/dev/sdb - Unknown. New disk or 3 year old disk?
/dev/sdc - New WD Blue
/dev/sdd - New WD Blue

And this is a freshly-created btrfs volume?  When you restored from
backup, your hard drive firmware should have detected any bad sectors
and relocated the write to a reserve sector, and I'm assuming none of
the logs have anything in them that would indicate a failed
write.  If sda is from the 3 year old batch of WD greens I would
distrust it.  Frequent culprits of similar problems are flaky sata
cables or a flaky PSU.  In the case of flaky sata cables, dmesg
(usually?) shows PHY and "hard resetting link" errors.

I also wonder if the sata0 port on your motherboard might be bad.  The
only reason I mention this is because I've seen two H67/P67
cougarpoint chipset motherboards lose their sata0 channel.  It also
happens with other brands' chipsets...

Whatever the case, when stuff like this happened to me I've always
used something like a combination of a cpuburnP6 per logical CPU,
memtester (in Linux; do this after a clean 24h memtest86+ run), a huge
and intense bonnie++ run, with as many things plugged into the USB
ports as possible--including charging at least one high-power
device--while burning a DVD and/or running something that stresses the
GPU...to try to shake down potential PSU issues.  Maybe passmark
(under Linux) has similar functionality with an easier interface?
I've also used parallel diskscan (https://github.com/baruch/diskscan)
runs to test old disks and to check for statistical anomalies.  If you
do:

1. use tape to number your cables; record which drives are connected
into which sata ports with which cables.  Do simultaneous runs of
diskscan on /dev/disk/by-id/$relevant_disks, check dmesg, and record
the results.
2. unplug sata cables from drives and shuffle; document specifics and
test.
3. unplug sata cables from motherboard and shuffle; document
specifics and test.

For the cost of new sata cables, you might as well just buy new ones
because then these tests can be used to check for bad ones among the
new cables; it's a better use of time, because it's possible that
you'll detect a bad cable, replace it, test the new cable, and find
out that the new cable is defective.  Fountain of Bad Luck™ <- If
something can fail, it will fail when I use it ;-)

That said, I've never tested a WD green drive...the reds' performance
smoothly decreases towards the end of the drive (outer tracks are
quite a bit faster than inner tracks).  For all I know the greens have
erratic performance baked into their power-saving design...  If
there's consistently a latency spike at the same location for the test
associated with a particular drive that can indicate a relocated bad
sector.  Does anyone know if this method reliably indicates when a
drive is lying about its SMART 5 Reallocated_Sector_Ct report?

Cheers,
Nicholas


signature.asc
Description: Digital signature


Re: bio linked list corruption.

2016-10-24 Thread Andy Lutomirski
On Oct 24, 2016 5:00 PM, "Linus Torvalds"  wrote:
>
> On Mon, Oct 24, 2016 at 3:42 PM, Andy Lutomirski  wrote:
>
> > Now the fallocate thread catches up and *exits*.  Dave's test makes a
> > new thread that reuses the stack (the vmap area or the backing store).
> >
> > Now the shmem_fault thread continues on its merry way and takes
> > q->lock.  But oh crap, q->lock is pointing at some random spot on some
> > other thread's stack.  Kaboom!
>
> Note that q->lock should be entirely immaterial, since inode->i_lock
> nests outside of it in all uses.
>
> Now, if there is some code that runs *without* the inode->i_lock, then
> that would be a big bug.
>
> But I'm not seeing it.
>
> I do agree that some race on some stack data structure could easily be
> the cause of these issues. And yes, the vmap code obviously starts
> reusing the stack much earlier, and would trigger problems that would
> essentially be hidden by the fact that the kernel stack used to stay
> around not just until exit(), but until the process was reaped.
>
> I just think that in this case i_lock really looks like it should
> serialize things correctly.
>
> Or are you seeing something I'm not?

No, I missed that.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-24 Thread Linus Torvalds
On Mon, Oct 24, 2016 at 3:42 PM, Andy Lutomirski  wrote:
>
> Here's my theory: I think you're looking at the right code but the
> wrong stack.  shmem_fault_wait is fine, but shmem_fault_waitq looks
> really dicey.

Hmm.

> Consider:
>
> fallocate calls wake_up_all(), which calls autoremove_wait_function().
> That wakes up the shmem_fault thread.  Suppose that the shmem_fault
> thread gets moving really quickly (presumably because it never went to
> sleep in the first place -- suppose it hadn't made it to schedule(),
> so schedule() turns into a no-op).  It calls finish_wait() *before*
> autoremove_wake_function() does the list_del_init().  finish_wait()
> gets to the list_empty_careful() call and it returns true.

All of this happens under inode->i_lock, so the different parts are
serialized. So if this happens before the wakeup, then finish_wait()
will se that the wait-queue entry is not empty (it points to the wait
queue head in shmem_falloc_waitq.

But then it will just remove itself carefully with list_del_init()
under the waitqueue lock, and things are fine.

Yes, it uses the waitiqueue lock on the other stack, but that stack is
still ok since
 (a) we're serialized by inode->i_lock
 (b) this code ran before the fallocate thread catches up and exits.

In other words, your scenario seems to be dependent on those two
threads being able to race. But they both hold inode->i_lock in the
critical region you are talking about.

> Now the fallocate thread catches up and *exits*.  Dave's test makes a
> new thread that reuses the stack (the vmap area or the backing store).
>
> Now the shmem_fault thread continues on its merry way and takes
> q->lock.  But oh crap, q->lock is pointing at some random spot on some
> other thread's stack.  Kaboom!

Note that q->lock should be entirely immaterial, since inode->i_lock
nests outside of it in all uses.

Now, if there is some code that runs *without* the inode->i_lock, then
that would be a big bug.

But I'm not seeing it.

I do agree that some race on some stack data structure could easily be
the cause of these issues. And yes, the vmap code obviously starts
reusing the stack much earlier, and would trigger problems that would
essentially be hidden by the fact that the kernel stack used to stay
around not just until exit(), but until the process was reaped.

I just think that in this case i_lock really looks like it should
serialize things correctly.

Or are you seeing something I'm not?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-24 Thread Andy Lutomirski
On Mon, Oct 24, 2016 at 1:46 PM, Linus Torvalds
 wrote:
> On Mon, Oct 24, 2016 at 1:06 PM, Andy Lutomirski  wrote:
>>>
>>> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>
>> This is an unhandled kernel page fault.  The string "Oops" is so helpful :-/
>
> I think there was a line above it that DaveJ just didn't include.
>
>>
>>> [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 
>>> 4.9.0-rc1-think+ #11
>>> [69943.463510] task: 8804f8dd3740 task.stack: c9000b108000
>>> [69943.468077] RIP: 0010:[]
>>> [69943.472704]  [] __lock_acquire.isra.32+0x6b/0x8c0
>>> [69943.477489] RSP: 0018:c9000b10b9e8  EFLAGS: 00010086
>>> [69943.482368] RAX: 81789b90 RBX: 8804f8dd3740 RCX: 
>>> 
>>> [69943.487410] RDX:  RSI:  RDI: 
>>> 
>>> [69943.492515] RBP: c9000b10ba18 R08: 0001 R09: 
>>> 
>>> [69943.497666] R10: 0001 R11: 3f9cfa7f4e73 R12: 
>>> 
>>> [69943.502880] R13:  R14: c9000af7bd48 R15: 
>>> 8804f8dd3740
>>> [69943.508163] FS:  7f64904a2b40() GS:880507a0() 
>>> knlGS:
>>> [69943.513591] CS:  0010 DS:  ES:  CR0: 80050033
>>> [69943.518917] CR2: 81789d28 CR3: 0004a8f16000 CR4: 
>>> 001406e0
>>> [69943.524253] DR0: 7f5b97fd4000 DR1:  DR2: 
>>> 
>>> [69943.529488] DR3:  DR6: 0ff0 DR7: 
>>> 0600
>>> [69943.534771] Stack:
>>> [69943.540023]  880507bd74c0
>>> [69943.545317]  8804f8dd3740 0046 
>>> 0286[69943.545456]  c9000af7bd08
>>> [69943.550930]  0100 c9000b10ba50 
>>> 810c4b68[69943.551069]  810ba40c
>>> [69943.556657]  8804  
>>> c9000af7bd48[69943.556796] Call Trace:
>>> [69943.562465]  [] lock_acquire+0x58/0x70
>>> [69943.568354]  [] ? finish_wait+0x3c/0x70
>>> [69943.574306]  [] _raw_spin_lock_irqsave+0x42/0x80
>>> [69943.580335]  [] ? finish_wait+0x3c/0x70
>>> [69943.586237]  [] finish_wait+0x3c/0x70
>>> [69943.591992]  [] shmem_fault+0x167/0x1b0
>>> [69943.597807]  [] ? prepare_to_wait_event+0x100/0x100
>>> [69943.603741]  [] __do_fault+0x6d/0x1b0
>>> [69943.609743]  [] handle_mm_fault+0xc58/0x1170
>>> [69943.615822]  [] ? handle_mm_fault+0x43/0x1170
>>> [69943.621971]  [] __do_page_fault+0x172/0x4e0
>>> [69943.628184]  [] do_page_fault+0x20/0x70
>>> [69943.634449]  [] ? debug_smp_processor_id+0x17/0x20
>>> [69943.640784]  [] page_fault+0x1f/0x30
>>> [69943.647170]  [] ? strncpy_from_user+0x5c/0x170
>>> [69943.653480]  [] ? strncpy_from_user+0x46/0x170
>>> [69943.659632]  [] setxattr+0x57/0x170
>>> [69943.665846]  [] ? debug_smp_processor_id+0x17/0x20
>>> [69943.672172]  [] ? get_lock_stats+0x19/0x50
>>> [69943.678558]  [] ? sched_clock_cpu+0xb6/0xd0
>>> [69943.685007]  [] ? __lock_acquire.isra.32+0x1cf/0x8c0
>>> [69943.691542]  [] ? __this_cpu_preempt_check+0x13/0x20
>>> [69943.698130]  [] ? preempt_count_add+0x7c/0xc0
>>> [69943.704791]  [] ? __mnt_want_write+0x61/0x90
>>> [69943.711519]  [] SyS_fsetxattr+0x78/0xa0
>>> [69943.718300]  [] do_syscall_64+0x5c/0x170
>>> [69943.724949]  [] entry_SYSCALL64_slow_path+0x25/0x25
>>> [69943.731521] Code:
>>> [69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 
>>> 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00  
>>> ff 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85
>>
>> That's lock incl 0x198(%rax).  I think this is:
>>
>> atomic_inc((atomic_t *)>ops);
>>
>> I suppose this could be stack corruption at work, but after a fair
>> amount of staring, I still haven't found anything in the vmap_stack
>> code that would cause stack corruption.
>
> Well, it is intriguing that what faults is this:
>
> finish_wait(shmem_falloc_waitq, _fault_wait);
>
> where 'shmem_fault_wait' is a on-stack wait queue. So it really looks
> very much like stack corruption.
>
> What strikes me is that "finish_wait()" does this optimistic "has my
> entry been removed" without holding the waitqueue lock (and uses
> list_empty_careful() to make sure it does that "safely").
>
> It has that big comment too:
>
> /*
>  * shmem_falloc_waitq points into the 
> shmem_fallocate()
>  * stack of the hole-punching task: shmem_falloc_waitq
>  * is usually invalid by the time we reach here, but
>  * finish_wait() does not dereference it in that case;
>  * though i_lock needed lest racing with 
> wake_up_all().
>  */
>
> the stack it comes from is the wait queue head from shmem_fallocate(),
> which will do "wake_up_all()" under the inode lock.

Here's my theory: I think you're 

Re: bio linked list corruption.

2016-10-24 Thread Chris Mason

On 10/24/2016 05:50 PM, Linus Torvalds wrote:

On Mon, Oct 24, 2016 at 2:17 PM, Linus Torvalds
 wrote:


The vmalloc/vfree code itself is a bit scary. In particular, we have a
rather insane model of TLB flushing. We leave the virtual area on a
lazy purge-list, and we delay flushing the TLB and actually freeing
the virtual memory for it so that we can batch things up.


Never mind. If DaveJ is running with DEBUG_PAGEALLOC, then the code in
vmap_debug_free_range() should have forced a synchronous TLB flush fro
vmalloc ranges too, so that doesn't explain it either.



My big fear here is that we're just triggering an old stack corruption 
more reliably.  I wonder if we can make it happen much faster by 
restricting the stacks to a static list and cycling through them in lifo 
fashion?


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Scrubbing Errors after restoring backup

2016-10-24 Thread Stefan Malte Schumacher
Hello

For reference please see this post.
https://mail-archive.com/linux-btrfs@vger.kernel.org/msg58461.html
Please note that I downgraded to btrfs-progs 4.6.1 as advised.

After exchanging the malfunctioning drive I re-created the filesystem
and restored the backup from my NAS. (I didnt entirely trust the
filesystem after so many errors) On completing the restoration I
manually started scrubbing, which ended with hundreds of checksum and
read errors on /dev/sda.
The drive checks out fine in smart and passed through all scheduled
SMART Self-Tests. The model is not identical to the two drives
recently added to the system - the new drives are WD Blue, the four
original ones are WD Greens.

I have resetted the output from btrfs dev stats and restarted the
scrubbing process. I am unsure how to interpret or explain the errors
of the last scrub run. I scrubbed regularly each month for nearly
three years and never had any errors. I would be grateful for any
advice how to proceed.

Yours sincerely
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-24 Thread Linus Torvalds
On Mon, Oct 24, 2016 at 2:17 PM, Linus Torvalds
 wrote:
>
> The vmalloc/vfree code itself is a bit scary. In particular, we have a
> rather insane model of TLB flushing. We leave the virtual area on a
> lazy purge-list, and we delay flushing the TLB and actually freeing
> the virtual memory for it so that we can batch things up.

Never mind. If DaveJ is running with DEBUG_PAGEALLOC, then the code in
vmap_debug_free_range() should have forced a synchronous TLB flush fro
vmalloc ranges too, so that doesn't explain it either.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-24 Thread Linus Torvalds
On Mon, Oct 24, 2016 at 1:46 PM, Linus Torvalds
 wrote:
>
> So this is all some really subtle code, but I'm not seeing that it
> would be wrong.

Ahh... Except maybe..

The vmalloc/vfree code itself is a bit scary. In particular, we have a
rather insane model of TLB flushing. We leave the virtual area on a
lazy purge-list, and we delay flushing the TLB and actually freeing
the virtual memory for it so that we can batch things up.

But we've free'd the physical pages that are *mapped* by that area
when we do the vfree(). So there can be stale TLB entries that point
to pages that have gotten re-used. They shouldn't matter, because
nothing should be writing to those pages, but it strikes me that this
may also be hurting the DEBUG_PAGEALLOC thing. Maybe we're not getting
the page fautls that we *should* be getting, and there are hidden
reads and writes to those paghes that already got free'd.\

There was some nasty reason why we did that insane thing. I think it
was just that there are a few high-frequency vmalloc/vfree users and
the TLB flushing was killing some performance.

But it does strike me that we are playing very fast and loose with the
TLB on the vmalloc area.

So maybe all the new VMAP code is fine, and it's really vmalloc/vfree
that has been subtly broken but nobody has ever cared before?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-24 Thread Linus Torvalds
On Mon, Oct 24, 2016 at 1:06 PM, Andy Lutomirski  wrote:
>>
>> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>
> This is an unhandled kernel page fault.  The string "Oops" is so helpful :-/

I think there was a line above it that DaveJ just didn't include.

>
>> [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 
>> 4.9.0-rc1-think+ #11
>> [69943.463510] task: 8804f8dd3740 task.stack: c9000b108000
>> [69943.468077] RIP: 0010:[]
>> [69943.472704]  [] __lock_acquire.isra.32+0x6b/0x8c0
>> [69943.477489] RSP: 0018:c9000b10b9e8  EFLAGS: 00010086
>> [69943.482368] RAX: 81789b90 RBX: 8804f8dd3740 RCX: 
>> 
>> [69943.487410] RDX:  RSI:  RDI: 
>> 
>> [69943.492515] RBP: c9000b10ba18 R08: 0001 R09: 
>> 
>> [69943.497666] R10: 0001 R11: 3f9cfa7f4e73 R12: 
>> 
>> [69943.502880] R13:  R14: c9000af7bd48 R15: 
>> 8804f8dd3740
>> [69943.508163] FS:  7f64904a2b40() GS:880507a0() 
>> knlGS:
>> [69943.513591] CS:  0010 DS:  ES:  CR0: 80050033
>> [69943.518917] CR2: 81789d28 CR3: 0004a8f16000 CR4: 
>> 001406e0
>> [69943.524253] DR0: 7f5b97fd4000 DR1:  DR2: 
>> 
>> [69943.529488] DR3:  DR6: 0ff0 DR7: 
>> 0600
>> [69943.534771] Stack:
>> [69943.540023]  880507bd74c0
>> [69943.545317]  8804f8dd3740 0046 
>> 0286[69943.545456]  c9000af7bd08
>> [69943.550930]  0100 c9000b10ba50 
>> 810c4b68[69943.551069]  810ba40c
>> [69943.556657]  8804  
>> c9000af7bd48[69943.556796] Call Trace:
>> [69943.562465]  [] lock_acquire+0x58/0x70
>> [69943.568354]  [] ? finish_wait+0x3c/0x70
>> [69943.574306]  [] _raw_spin_lock_irqsave+0x42/0x80
>> [69943.580335]  [] ? finish_wait+0x3c/0x70
>> [69943.586237]  [] finish_wait+0x3c/0x70
>> [69943.591992]  [] shmem_fault+0x167/0x1b0
>> [69943.597807]  [] ? prepare_to_wait_event+0x100/0x100
>> [69943.603741]  [] __do_fault+0x6d/0x1b0
>> [69943.609743]  [] handle_mm_fault+0xc58/0x1170
>> [69943.615822]  [] ? handle_mm_fault+0x43/0x1170
>> [69943.621971]  [] __do_page_fault+0x172/0x4e0
>> [69943.628184]  [] do_page_fault+0x20/0x70
>> [69943.634449]  [] ? debug_smp_processor_id+0x17/0x20
>> [69943.640784]  [] page_fault+0x1f/0x30
>> [69943.647170]  [] ? strncpy_from_user+0x5c/0x170
>> [69943.653480]  [] ? strncpy_from_user+0x46/0x170
>> [69943.659632]  [] setxattr+0x57/0x170
>> [69943.665846]  [] ? debug_smp_processor_id+0x17/0x20
>> [69943.672172]  [] ? get_lock_stats+0x19/0x50
>> [69943.678558]  [] ? sched_clock_cpu+0xb6/0xd0
>> [69943.685007]  [] ? __lock_acquire.isra.32+0x1cf/0x8c0
>> [69943.691542]  [] ? __this_cpu_preempt_check+0x13/0x20
>> [69943.698130]  [] ? preempt_count_add+0x7c/0xc0
>> [69943.704791]  [] ? __mnt_want_write+0x61/0x90
>> [69943.711519]  [] SyS_fsetxattr+0x78/0xa0
>> [69943.718300]  [] do_syscall_64+0x5c/0x170
>> [69943.724949]  [] entry_SYSCALL64_slow_path+0x25/0x25
>> [69943.731521] Code:
>> [69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 
>> 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00  ff 
>> 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85
>
> That's lock incl 0x198(%rax).  I think this is:
>
> atomic_inc((atomic_t *)>ops);
>
> I suppose this could be stack corruption at work, but after a fair
> amount of staring, I still haven't found anything in the vmap_stack
> code that would cause stack corruption.

Well, it is intriguing that what faults is this:

finish_wait(shmem_falloc_waitq, _fault_wait);

where 'shmem_fault_wait' is a on-stack wait queue. So it really looks
very much like stack corruption.

What strikes me is that "finish_wait()" does this optimistic "has my
entry been removed" without holding the waitqueue lock (and uses
list_empty_careful() to make sure it does that "safely").

It has that big comment too:

/*
 * shmem_falloc_waitq points into the shmem_fallocate()
 * stack of the hole-punching task: shmem_falloc_waitq
 * is usually invalid by the time we reach here, but
 * finish_wait() does not dereference it in that case;
 * though i_lock needed lest racing with wake_up_all().
 */

the stack it comes from is the wait queue head from shmem_fallocate(),
which will do "wake_up_all()" under the inode lock.

On the face of it, the inode lock should make that safe and serialize
everything. And yes, finish_wait() does not touch the unsafe stuff if
the wait-queue (in the local stack) is empty, which wake_up_all()
*should* have guaranteed. It's just a regular 

[PATCH 4/5] writeback: introduce super_operations->write_metadata

2016-10-24 Thread Josef Bacik
Now that we have metadata counters in the VM, we need to provide a way to kick
writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
allows file systems to deal with writing back any dirty metadata we need based
on the writeback needs of the system.  Since there is no inode to key off of we
need a list in the bdi for dirty super blocks to be added.  From there we can
find any dirty sb's on the bdi we are currently doing writeback on and call into
their ->write_metadata callback.

Signed-off-by: Josef Bacik 
Reviewed-by: Jan Kara 
---
 fs/fs-writeback.c| 72 
 fs/super.c   |  7 
 include/linux/backing-dev-defs.h |  2 ++
 include/linux/fs.h   |  4 +++
 mm/backing-dev.c |  2 ++
 5 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a5cb1dd..8e392b2 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1465,6 +1465,31 @@ static long writeback_chunk_size(struct bdi_writeback 
*wb,
return pages;
 }
 
+static long writeback_sb_metadata(struct super_block *sb,
+ struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
+{
+   struct writeback_control wbc = {
+   .sync_mode  = work->sync_mode,
+   .tagged_writepages  = work->tagged_writepages,
+   .for_kupdate= work->for_kupdate,
+   .for_background = work->for_background,
+   .for_sync   = work->for_sync,
+   .range_cyclic   = work->range_cyclic,
+   .range_start= 0,
+   .range_end  = LLONG_MAX,
+   };
+   long write_chunk;
+
+   write_chunk = writeback_chunk_size(wb, work);
+   wbc.nr_to_write = write_chunk;
+   sb->s_op->write_metadata(sb, );
+   work->nr_pages -= write_chunk - wbc.nr_to_write;
+
+   return write_chunk - wbc.nr_to_write;
+}
+
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -1491,6 +1516,7 @@ static long writeback_sb_inodes(struct super_block *sb,
unsigned long start_time = jiffies;
long write_chunk;
long wrote = 0;  /* count both pages and inodes */
+   bool done = false;
 
while (!list_empty(>b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1607,12 +1633,18 @@ static long writeback_sb_inodes(struct super_block *sb,
 * background threshold and other termination conditions.
 */
if (wrote) {
-   if (time_is_before_jiffies(start_time + HZ / 10UL))
-   break;
-   if (work->nr_pages <= 0)
+   if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+   work->nr_pages <= 0) {
+   done = true;
break;
+   }
}
}
+   if (!done && sb->s_op->write_metadata) {
+   spin_unlock(>list_lock);
+   wrote += writeback_sb_metadata(sb, wb, work);
+   spin_lock(>list_lock);
+   }
return wrote;
 }
 
@@ -1621,6 +1653,7 @@ static long __writeback_inodes_wb(struct bdi_writeback 
*wb,
 {
unsigned long start_time = jiffies;
long wrote = 0;
+   bool done = false;
 
while (!list_empty(>b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1640,12 +1673,39 @@ static long __writeback_inodes_wb(struct bdi_writeback 
*wb,
 
/* refer to the same tests at the end of writeback_sb_inodes */
if (wrote) {
-   if (time_is_before_jiffies(start_time + HZ / 10UL))
-   break;
-   if (work->nr_pages <= 0)
+   if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+   work->nr_pages <= 0) {
+   done = true;
break;
+   }
}
}
+
+   if (!done && wb_stat(wb, WB_METADATA_DIRTY_BYTES)) {
+   LIST_HEAD(list);
+
+   spin_unlock(>list_lock);
+   spin_lock(>bdi->sb_list_lock);
+   list_splice_init(>bdi->dirty_sb_list, );
+   while (!list_empty()) {
+   struct super_block *sb;
+
+   sb = list_first_entry(, struct super_block,
+ s_bdi_dirty_list);
+   list_move_tail(>s_bdi_dirty_list,
+  >bdi->dirty_sb_list);
+   if (!sb->s_op->write_metadata)
+   continue;
+   

[PATCH 1/5] remove mapping from balance_dirty_pages*()

2016-10-24 Thread Josef Bacik
The only reason we pass in the mapping is to get the inode in order to see if
writeback cgroups is enabled, and even then it only checks the bdi and a super
block flag.  balance_dirty_pages() doesn't even use the mapping.  Since
balance_dirty_pages*() works on a bdi level, just pass in the bdi and super
block directly so we can avoid using mapping.  This will allow us to still use
balance_dirty_pages for dirty metadata pages that are not backed by an
address_mapping.

Signed-off-by: Josef Bacik 
Reviewed-by: Jan Kara 
---
 drivers/mtd/devices/block2mtd.c | 12 
 fs/btrfs/disk-io.c  |  4 ++--
 fs/btrfs/file.c |  3 ++-
 fs/btrfs/ioctl.c|  3 ++-
 fs/btrfs/relocation.c   |  3 ++-
 fs/buffer.c |  3 ++-
 fs/iomap.c  |  3 ++-
 fs/ntfs/attrib.c| 10 +++---
 fs/ntfs/file.c  |  4 ++--
 include/linux/backing-dev.h | 29 +++--
 include/linux/writeback.h   |  3 ++-
 mm/filemap.c|  4 +++-
 mm/memory.c |  9 +++--
 mm/page-writeback.c | 15 +++
 14 files changed, 71 insertions(+), 34 deletions(-)

diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c
index 7c887f1..7892d0b 100644
--- a/drivers/mtd/devices/block2mtd.c
+++ b/drivers/mtd/devices/block2mtd.c
@@ -52,7 +52,8 @@ static struct page *page_read(struct address_space *mapping, 
int index)
 /* erase a specified part of the device */
 static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len)
 {
-   struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+   struct inode *inode = dev->blkdev->bd_inode;
+   struct address_space *mapping = inode->i_mapping;
struct page *page;
int index = to >> PAGE_SHIFT;   // page index
int pages = len >> PAGE_SHIFT;
@@ -71,7 +72,8 @@ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t 
to, size_t len)
memset(page_address(page), 0xff, PAGE_SIZE);
set_page_dirty(page);
unlock_page(page);
-   balance_dirty_pages_ratelimited(mapping);
+   
balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
break;
}
 
@@ -141,7 +143,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, 
const u_char *buf,
loff_t to, size_t len, size_t *retlen)
 {
struct page *page;
-   struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+   struct inode *inode = dev->blkdev->bd_inode;
+   struct address_space *mapping = inode->i_mapping;
int index = to >> PAGE_SHIFT;   // page index
int offset = to & ~PAGE_MASK;   // page offset
int cpylen;
@@ -162,7 +165,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, 
const u_char *buf,
memcpy(page_address(page) + offset, buf, cpylen);
set_page_dirty(page);
unlock_page(page);
-   balance_dirty_pages_ratelimited(mapping);
+   balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
}
put_page(page);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e720d3e..fa07e81 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4085,8 +4085,8 @@ static void __btrfs_btree_balance_dirty(struct btrfs_root 
*root,
ret = percpu_counter_compare(>fs_info->dirty_metadata_bytes,
 BTRFS_DIRTY_METADATA_THRESH);
if (ret > 0) {
-   balance_dirty_pages_ratelimited(
-  root->fs_info->btree_inode->i_mapping);
+   balance_dirty_pages_ratelimited(>fs_info->bdi,
+   root->fs_info->sb);
}
 }
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 72a180d..cbefdc8 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1711,7 +1711,8 @@ again:
 
cond_resched();
 
-   balance_dirty_pages_ratelimited(inode->i_mapping);
+   balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
if (dirty_pages < (root->nodesize >> PAGE_SHIFT) + 1)
btrfs_btree_balance_dirty(root);
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index af69129..f68692a 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1410,7 +1410,8 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
}
 
defrag_count += 

[PATCH 0/5] Support for metadata specific accounting

2016-10-24 Thread Josef Bacik
(Dave again I apologize, for some reason our email server hates you and so I
didn't get your previous responses again, and didn't notice until I was looking
at the patchwork history for my previous submissions, so I'll try and answer all
your questions here, and then I'll be much more judicious about making sure I'm
getting your emails.)

This is my current batch of patches to add some better infrastructure for
handling metadata related memory.  There's been a few changes to these patches
since the last go around but the approach is essentially the same.

Changes since last go around:
-WB_WRITTEN/WB_DIRTIED are now being counted in bytes.  This is necessary to
deal with metadata that is smaller than page size.  Currently btrfs only does
things in pagesize increments, but with these patches it will allow us to easily
support sub-pagesize blocksizes so we need these counters to be in bytes.
-Added a METADATA_BYTES counter to account for the total number of bytes used
for metadata.  We need this because the VM adjusts the dirty ratios based on how
much memory it thinks we can dirty, which is just the amount of free memory
plus the pagecache.
-Patch 5, which is a doozy, and I will explain below.

This is a more complete approach to keeping track of memory that is in use for
metadata.  Previously Btrfs had been using a special inode to allocate all of
our metadata pages out of, so if we had a heavy metadata workload we still
benefited from the balance_dirty_pages() throttling which kept everything sane.
We want to kill this in order to make it easier to support pagesize > blocksize
support in btrfs, in much the same way XFS does this support.

One concern that was brought up was the global accounting vs one bad actor.  We
still need the global accounting so we can enforce the global dirty limits, but
we have the per-bdi accounting as well to make sure we are punishing the bad
actor and not all the other file systems that happened to be hooked into this
infrastructure.

The *REFERENCED patch is to fix a behavior I noticed when testing these patches
on a more memory-constrained system than I had previously used.  Now that the
eviction of metadata pages is controlled soley through the shrinker interface I
was seeing a big perf drop when we had to start doing memory reclaim.  This was
because most of the RAM on the box (10gig of 16gig) was used soley by icache and
dcache.  Because this workload was just create a file and never touch it again
most of this cache was easily evictable.  However because we by default send
every object in the icache/dcache through the LRU twice, it would take around
10,000 iterations through the shrinker before it would start to evict things
(there was ~10 million objects between the two caches).  The pagecache solves
this by putting everything on the INACTIVE list first, and then theres a two
part activation phase before it's moved to the active list, so you have to
really touch a page a lot before it is considered active.  However we assume
active first, which is very latency inducing when we want to start reclaiming
objects.  Making this change keeps us in line with what pagecache does and
eliminates the performance drop I was seeing.

Let me know what you think.  I've tried to keep this as minimal and sane as
possible so we can build on it in the future as we want to handle more nuanced
scenarios.  Also if we think this is a bad approach I won't feel as bad throwing
it out ;).  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-24 Thread Andy Lutomirski
On Sun, Oct 23, 2016 at 9:40 PM, Dave Jones  wrote:
> On Sun, Oct 23, 2016 at 05:32:21PM -0400, Chris Mason wrote:
>  >
>  >
>  > On 10/22/2016 11:20 AM, Dave Jones wrote:
>  > > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote:
>  > >
>  > >  >  > It could be worth trying this, too:
>  > >  >  >
>  > >  >  > 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack=174531fef4e8
>  > >  >  >
>  > >  >  > It occurred to me that the current code is a little bit fragile.
>  > >  >
>  > >  > It's been nearly 24hrs with the above changes, and it's been pretty 
> much
>  > >  > silent the whole time.
>  > >  >
>  > >  > The only thing of note over that time period has been a btrfs lockdep
>  > >  > warning that's been around for a while, and occasional btrfs checksum
>  > >  > failures, which I've been seeing for a while, but seem to have gotten
>  > >  > worse since 4.8.
>  > >  >
>  > >  > I'm pretty confident in the disk being ok in this machine, so I think
>  > >  > the checksum warnings are bogus.  Chris suggested they may be the 
> result
>  > >  > of memory corruption, but there's little else going on.
>  > >
>  > > The only interesting thing last nights run was this..
>  > >
>  > > BUG: Bad page state in process kworker/u8:1  pfn:4e2b70
>  > > page:ea00138adc00 count:0 mapcount:0 mapping:88046e9fc2e0 
> index:0xdf0
>  > > flags: 0x400c(referenced|uptodate)
>  > > page dumped because: non-NULL mapping
>  > > CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11
>  > > Workqueue: writeback wb_workfn (flush-btrfs-2)
>  >
>  > Well crud, we're back to wondering if this is Btrfs or the stack
>  > corruption.  Since the pagevecs are on the stack and this is a new
>  > crash, my guess is you'll be able to trigger it on xfs/ext4 too.  But we
>  > should make sure.
>
> Here's an interesting one from today, pointing the finger at xattrs again.
>
>
> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC

This is an unhandled kernel page fault.  The string "Oops" is so helpful :-/

> [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 
> 4.9.0-rc1-think+ #11
> [69943.463510] task: 8804f8dd3740 task.stack: c9000b108000
> [69943.468077] RIP: 0010:[]
> [69943.472704]  [] __lock_acquire.isra.32+0x6b/0x8c0
> [69943.477489] RSP: 0018:c9000b10b9e8  EFLAGS: 00010086
> [69943.482368] RAX: 81789b90 RBX: 8804f8dd3740 RCX: 
> 
> [69943.487410] RDX:  RSI:  RDI: 
> 
> [69943.492515] RBP: c9000b10ba18 R08: 0001 R09: 
> 
> [69943.497666] R10: 0001 R11: 3f9cfa7f4e73 R12: 
> 
> [69943.502880] R13:  R14: c9000af7bd48 R15: 
> 8804f8dd3740
> [69943.508163] FS:  7f64904a2b40() GS:880507a0() 
> knlGS:
> [69943.513591] CS:  0010 DS:  ES:  CR0: 80050033
> [69943.518917] CR2: 81789d28 CR3: 0004a8f16000 CR4: 
> 001406e0
> [69943.524253] DR0: 7f5b97fd4000 DR1:  DR2: 
> 
> [69943.529488] DR3:  DR6: 0ff0 DR7: 
> 0600
> [69943.534771] Stack:
> [69943.540023]  880507bd74c0
> [69943.545317]  8804f8dd3740 0046 
> 0286[69943.545456]  c9000af7bd08
> [69943.550930]  0100 c9000b10ba50 
> 810c4b68[69943.551069]  810ba40c
> [69943.556657]  8804  
> c9000af7bd48[69943.556796] Call Trace:
> [69943.562465]  [] lock_acquire+0x58/0x70
> [69943.568354]  [] ? finish_wait+0x3c/0x70
> [69943.574306]  [] _raw_spin_lock_irqsave+0x42/0x80
> [69943.580335]  [] ? finish_wait+0x3c/0x70
> [69943.586237]  [] finish_wait+0x3c/0x70
> [69943.591992]  [] shmem_fault+0x167/0x1b0
> [69943.597807]  [] ? prepare_to_wait_event+0x100/0x100
> [69943.603741]  [] __do_fault+0x6d/0x1b0
> [69943.609743]  [] handle_mm_fault+0xc58/0x1170
> [69943.615822]  [] ? handle_mm_fault+0x43/0x1170
> [69943.621971]  [] __do_page_fault+0x172/0x4e0
> [69943.628184]  [] do_page_fault+0x20/0x70
> [69943.634449]  [] ? debug_smp_processor_id+0x17/0x20
> [69943.640784]  [] page_fault+0x1f/0x30
> [69943.647170]  [] ? strncpy_from_user+0x5c/0x170
> [69943.653480]  [] ? strncpy_from_user+0x46/0x170
> [69943.659632]  [] setxattr+0x57/0x170
> [69943.665846]  [] ? debug_smp_processor_id+0x17/0x20
> [69943.672172]  [] ? get_lock_stats+0x19/0x50
> [69943.678558]  [] ? sched_clock_cpu+0xb6/0xd0
> [69943.685007]  [] ? __lock_acquire.isra.32+0x1cf/0x8c0
> [69943.691542]  [] ? __this_cpu_preempt_check+0x13/0x20
> [69943.698130]  [] ? preempt_count_add+0x7c/0xc0
> [69943.704791]  [] ? __mnt_want_write+0x61/0x90
> [69943.711519]  [] SyS_fsetxattr+0x78/0xa0
> [69943.718300]  [] do_syscall_64+0x5c/0x170
> [69943.724949]  [] entry_SYSCALL64_slow_path+0x25/0x25
> [69943.731521] Code:
> 

Re: [PATCH] btrfs: imporve delayed refs iterations

2016-10-24 Thread Liu Bo
On Fri, Oct 21, 2016 at 05:05:07PM +0800, Wang Xiaoguang wrote:
> This issue was found when I tried to delete a heavily reflinked file,
> when deleting such files, other transaction operation will not have a
> chance to make progress, for example, start_transaction() will blocked
> in wait_current_trans(root) for long time, sometimes it even triggers
> soft lockups, and the time taken to delete such heavily reflinked file
> is also very large, often hundreds of seconds. Using perf top, it reports
> that:
> 
> PerfTop:7416 irqs/sec  kernel:99.8%  exact:  0.0% [4000Hz cpu-clock],  
> (all, 4 CPUs)
> ---
> 84.37%  [btrfs] [k] __btrfs_run_delayed_refs.constprop.80
> 11.02%  [kernel][k] delay_tsc
>  0.79%  [kernel][k] _raw_spin_unlock_irq
>  0.78%  [kernel][k] _raw_spin_unlock_irqrestore
>  0.45%  [kernel][k] do_raw_spin_lock
>  0.18%  [kernel][k] __slab_alloc
> It seems __btrfs_run_delayed_refs() took most cpu time, after some debug
> work, I found it's select_delayed_ref() causing this issue, for a delayed
> head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but
> select_delayed_ref() will firstly try to iterate node list to find
> BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and
> waste much time.
> 
> To fix this issue, we introduce a new ref_add_list in struct 
> btrfs_delayed_ref_head,
> then in select_delayed_ref(), if this list is not empty, we can directly use
> nodes in this list. With this patch, it just took about 10~15 seconds to
> delte the same file. Now using perf top, it reports that:
> 
> PerfTop:2734 irqs/sec  kernel:99.5%  exact:  0.0% [4000Hz cpu-clock],  
> (all, 4 CPUs)
> 
> 
> 20.74%  [kernel]  [k] _raw_spin_unlock_irqrestore
> 16.33%  [kernel]  [k] __slab_alloc
>  5.41%  [kernel]  [k] lock_acquired
>  4.42%  [kernel]  [k] lock_acquire
>  4.05%  [kernel]  [k] lock_release
>  3.37%  [kernel]  [k] _raw_spin_unlock_irq
> 
> For normal files, this patch also gives help, at least we do not need to
> iterate whole list to found BTRFS_ADD_DELAYED_REF nodes.
> 
> Signed-off-by: Wang Xiaoguang 
> ---
>  fs/btrfs/delayed-ref.c | 14 ++
>  fs/btrfs/delayed-ref.h |  8 
>  fs/btrfs/disk-io.c |  2 ++
>  fs/btrfs/extent-tree.c | 15 +--
>  4 files changed, 33 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 8d93854..39c28e0 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -189,6 +189,8 @@ static inline void drop_delayed_ref(struct 
> btrfs_trans_handle *trans,
>   } else {
>   assert_spin_locked(>lock);
>   list_del(>list);
> + if (!list_empty(>add_list))
> + list_del(>add_list);
>   }
>   ref->in_tree = 0;
>   btrfs_put_delayed_ref(ref);
> @@ -431,6 +433,11 @@ add_delayed_ref_tail_merge(struct btrfs_trans_handle 
> *trans,
>   exist->action = ref->action;
>   mod = -exist->ref_mod;
>   exist->ref_mod = ref->ref_mod;
> + if (ref->action == BTRFS_ADD_DELAYED_REF)
> + list_add_tail(>add_list,
> +   >ref_add_list);
> + else if (!list_empty(>add_list))
> + list_del(>add_list);

->action is either BTRFS_ADD_DELAYED_REF or BTRFS_DROP_DELAYED_REF, so
in 'else' section, (!list_empty(>add_list)) is true indeed.

>   } else
>   mod = -ref->ref_mod;
>   }
> @@ -444,6 +451,8 @@ add_delayed_ref_tail_merge(struct btrfs_trans_handle 
> *trans,
>  
>  add_tail:
>   list_add_tail(>list, >ref_list);
> + if (ref->action == BTRFS_ADD_DELAYED_REF)
> + list_add_tail(>add_list, >ref_add_list);
>   atomic_inc(>num_entries);
>   trans->delayed_ref_updates++;
>   spin_unlock(>lock);
> @@ -590,6 +599,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
>   head_ref->must_insert_reserved = must_insert_reserved;
>   head_ref->is_data = is_data;
>   INIT_LIST_HEAD(_ref->ref_list);
> + INIT_LIST_HEAD(_ref->ref_add_list);
>   head_ref->processing = 0;
>   head_ref->total_ref_mod = count_mod;
>   head_ref->qgroup_reserved = 0;
> @@ -671,6 +681,8 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
>   ref->is_head = 0;
>   ref->in_tree = 1;
>   ref->seq = seq;
> + INIT_LIST_HEAD(>list);
> + INIT_LIST_HEAD(>add_list);
>  
>   full_ref = btrfs_delayed_node_to_tree_ref(ref);
>   full_ref->parent = parent;
> @@ -726,6 +738,8 @@ 

Re: Btrfs: fix free space tree bitmaps on big-endian systems

2016-10-24 Thread Geert Uytterhoeven
Hi Omar,

On Mon, Oct 24, 2016 at 5:16 PM, Omar Sandoval  wrote:
> On Mon, Oct 24, 2016 at 09:23:04AM +0200, Geert Uytterhoeven wrote:
>> On Sat, Oct 15, 2016 at 2:46 AM, Linux Kernel Mailing List
>>  wrote:
>> > Web:
>> > https://git.kernel.org/torvalds/c/2fe1d55134fce05c17ea118a2e37a4af771887bc
>> > Commit: 2fe1d55134fce05c17ea118a2e37a4af771887bc
>>
>> 520f16abf003952d in v4.7.10
>> 1ff6341b5d92dd6b in v4.8.4
>>
>> > Parent: 08895a8b6b06ed2323cd97a36ee40a116b3db8ed
>> > Refname:refs/heads/master
>> > Author: Omar Sandoval 
>> > AuthorDate: Thu Sep 22 17:24:20 2016 -0700
>> > Committer:  David Sterba 
>> > CommitDate: Mon Oct 3 18:52:14 2016 +0200
>> >
>> > Btrfs: fix free space tree bitmaps on big-endian systems
>> >
>> > In convert_free_space_to_{bitmaps,extents}(), we buffer the free space
>> > bitmaps in memory and copy them directly to/from the extent buffers 
>> > with
>> > {read,write}_extent_buffer(). The extent buffer bitmap helpers use byte
>> > granularity, which is equivalent to a little-endian bitmap. This means
>> > that on big-endian systems, the in-memory bitmaps will be written to
>> > disk byte-swapped. To fix this, use byte-granularity for the bitmaps in
>> > memory.
>>
>> This change looks overly complex to me, and decreases performance.
>
> Do you have evidence of that? Sure, it can in theory, but the change

Nope, just reading the code.

> only affects the free space tree format conversion, which is a rare
> operation. In fact, I actually benchmarked the change with [1] and saw
> no noticable difference on x86-64. In any case, now that the on-disk
> format problem is fixed, we can improve the implementation.

Good to hear that.

>> > Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
>> > Cc: sta...@vger.kernel.org # 4.5+
>> > Tested-by: Holger Hoffstätte 
>> > Tested-by: Chandan Rajendra 
>> > Signed-off-by: Omar Sandoval 
>> > Signed-off-by: David Sterba 
>> > ---
>> >  fs/btrfs/extent_io.c   | 64 
>> > +-
>> >  fs/btrfs/extent_io.h   | 22 
>> >  fs/btrfs/free-space-tree.c | 17 ++--
>> >  3 files changed, 76 insertions(+), 27 deletions(-)
>> >
>> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> > index 44fe66b..c3ec30d 100644
>> > --- a/fs/btrfs/extent_io.c
>> > +++ b/fs/btrfs/extent_io.c
>> > @@ -5524,17 +5524,45 @@ void copy_extent_buffer(struct extent_buffer *dst, 
>> > struct extent_buffer *src,
>> > }
>> >  }
>> >
>> > -/*
>> > - * The extent buffer bitmap operations are done with byte granularity 
>> > because
>> > - * bitmap items are not guaranteed to be aligned to a word and therefore a
>> > - * single word in a bitmap may straddle two pages in the extent buffer.
>> > - */
>> > -#define BIT_BYTE(nr) ((nr) / BITS_PER_BYTE)
>> > -#define BYTE_MASK ((1 << BITS_PER_BYTE) - 1)
>> > -#define BITMAP_FIRST_BYTE_MASK(start) \
>> > -   ((BYTE_MASK << ((start) & (BITS_PER_BYTE - 1))) & BYTE_MASK)
>> > -#define BITMAP_LAST_BYTE_MASK(nbits) \
>> > -   (BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1)))
>> > +void le_bitmap_set(u8 *map, unsigned int start, int len)
>> > +{
>> > +   u8 *p = map + BIT_BYTE(start);
>>
>> You cannot use cpu_to_le32/cpu_to_le64 on the masks and operate on
>> unsigned longs in memory because there's no alignment guarantee, right?
>
> That's right.
>
>> > +   const unsigned int size = start + len;
>> > +   int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE);
>> > +   u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start);
>> > +
>> > +   while (len - bits_to_set >= 0) {
>> > +   *p |= mask_to_set;
>> > +   len -= bits_to_set;
>> > +   bits_to_set = BITS_PER_BYTE;
>> > +   mask_to_set = ~(u8)0;
>> > +   p++;
>> > +   }
>>
>> memset() for all but the first partial byte (if present)?
>
> Shrug, the original bitmap_set() helper doesn't do this. For our use
> case, we're not expecting to do big spans with this since our free space
> must be pretty fragmented for us to be using this in the first place.

The original bitmap_set() helper doesn't do byte writes, but 32/64-bit writes,
which is much closer to memset() from a performance point of view.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More 

Re: [PATCH 1/2] btrfs-progs: fsck: Add support to clear v1 free space cache.

2016-10-24 Thread David Sterba
On Thu, Oct 20, 2016 at 09:04:23AM +0800, Qu Wenruo wrote:
> >> +  /* Clear all free space cache inodes and its extent data */
> >> +  while (1) {
> >> +  bg_cache = btrfs_lookup_first_block_group(fs_info, current);
> >> +  if (!bg_cache)
> >> +  break;
> >> +  ret = btrfs_clear_free_space_cache(fs_info, bg_cache);
> >
> > The function can fail for a lot of reasons, what would be the filesystem
> > state when we exit here? Some of the inodes could be cleared completely,
> > the last one partially.  The function copes with a missing inode item
> > but I don't know how many other intermediate states could be left.
> 
> If we exit here, no damage for the filesystem will be done, since we are 
> protected by transaction.
> 
> As you can find, in btrfs_clear_free_space_cache(),
> it will only commit transaction if we fully cleaned the whole inode and 
> its free space header.

Ah I see, each blockgroup is killed inside a transaction. Then there's
one more to invalidate the super block cache generation.

> So we're quite safe here, free space header and inode are cleaned 
> atomically.
> 
> PS: We really need btrfs_abort_transaction(), or when we exit abnormally 
> we will get a lot of backtrace/warning on uncommitted transaction.

Yeah, the userspace is lacking behind kernel code in the transaction
error handling. Patches welcome, this might be a big change all over the
place, so I suggest to do it in smaller batches.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9.1 0/5] In-band de-duplication for btrfs-progs

2016-10-24 Thread David Sterba
On Thu, Oct 20, 2016 at 10:06:59AM +0800, Qu Wenruo wrote:
> Patchset can be fetched from github:
> https://github.com/adam900710/btrfs-progs.git dedupe_latest

Branch added to progs intergration, for preview and testing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13.1 00/14] Btrfs In-band De-duplication

2016-10-24 Thread David Sterba
On Thu, Oct 20, 2016 at 10:03:39AM +0800, Qu Wenruo wrote:
> Qu Wenruo (4):
>   btrfs: delayed-ref: Add support for increasing data ref under spinlock
>   btrfs: dedupe: Inband in-memory only de-duplication implement
>   btrfs: relocation: Enhance error handling to avoid BUG_ON
>   btrfs: dedupe: Introduce new reconfigure ioctl
> 
> Wang Xiaoguang (10):
>   btrfs: dedupe: Introduce dedupe framework and its header
>   btrfs: dedupe: Introduce function to initialize dedupe info
>   btrfs: dedupe: Introduce function to add hash into in-memory tree
>   btrfs: dedupe: Introduce function to remove hash from in-memory tree
>   btrfs: dedupe: Introduce function to search for an existing hash
>   btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
>   btrfs: ordered-extent: Add support for dedupe
>   btrfs: dedupe: Add ioctl for inband dedupelication
>   btrfs: improve inode's outstanding_extents computation

I've noticed that this patch is different from what Wang sent
independently. Is the patch necessary for the dedupe patchset? If not,
please drop it.

The merge conflict is in the checks for free space inode,

  if (root == root->fs_info->tree_root)

vs

  if (!btrfs_is_free_space_inode(...))

If you need the patch, then please update it to the latest version. I
won't get the merge conflicts at least, this is acceptable for the
testing 'for-next' branch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: imporve delayed refs iterations

2016-10-24 Thread Holger Hoffstätte
On 10/24/16 18:46, David Sterba wrote:
> On Fri, Oct 21, 2016 at 05:05:07PM +0800, Wang Xiaoguang wrote:
>> This issue was found when I tried to delete a heavily reflinked file,
>> when deleting such files, other transaction operation will not have a
>> chance to make progress, for example, start_transaction() will blocked
>> in wait_current_trans(root) for long time, sometimes it even triggers
>> soft lockups, and the time taken to delete such heavily reflinked file
>> is also very large, often hundreds of seconds. Using perf top, it reports
>> that:
> 
>> [...] With this patch, it just took about 10~15 seconds to
>> delte the same file.
> 
> Great improvement! Patch looks good on a quick skim so I'll add it to
> next, but proper review is still required.

If it helps, I've been running it for ~2 days now with no negative side
effects, mostly rsync creating & deleting files with various levels of
reflinking (via snapshots). No problems at all.
Also tried to manually create & delete a large file with heavy CoW and
hundreds of reflinked copies - no problem either and pretty fast.

So..

Tested-by: Holger Hoffstätte 

cheers,
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: check: release path after usage

2016-10-24 Thread David Sterba
On Mon, Oct 24, 2016 at 10:18:14AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> While performing an fsck, an assertion failure occurs because of reusing path 
> in a loop.
> ctree.c:1112: btrfs_search_slot: Warning: assertion `p->nodes[0] != NULL` 
> failed, value 0
> 
> Signed-off-by: Goldwyn Rodrigues 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Fix wrong tree block alignment for unalianged block group

2016-10-24 Thread David Sterba
On Mon, Oct 24, 2016 at 03:22:33PM +0800, Qu Wenruo wrote:
> Commit 854437ca(btrfs-progs: extent-tree: avoid allocating tree block
> that crosses stripe boundary) introduces check for logical bytenr not
> crossing stripe boundary.
> 
> However that check is not completely correct.
> It only checks if the logical bytenr and length agaist absolute logical
> offset.
> That's to say, it only check if a tree block lies in 64K logical stripe.
> 
> But in fact, it's possible a block group starts at bytenr unaligned with
> 64K, just like the following case.
> 
> Then btrfsck will give false alert.
> 
> 0   32K   64K   96K128K 160K ...
> |--- Block group A -
>   |<-TB 32K-->|
> |/Scrub stripe unit/|
> |WRONG UNIT   |
> 
> In that case, TB(tree block) at bytenr 32K in fact fits into the kernel
> scrub stripe unit.
> But doesn't fit into the pure logical 64K stripe.
> 
> Fix check_crossing_stripes() to compare bytenr to block group start, not
> to absolute logical bytenr.
> 
> Reported-by: Jussi Kansanen 
> Signed-off-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: change btrfs_csum_final result param type to u8

2016-10-24 Thread David Sterba
On Fri, Oct 21, 2016 at 12:47:02PM +0100, Domagoj Tršan wrote:
> csum member of struct btrfs_super_block has array type of u8. It makes sense
> that function btrfs_csum_final should be also declared to accept u8 *. I
> changed the declaration of method void btrfs_csum_final(u32 crc, char 
> *result);
> to void btrfs_csum_final(u32 crc, u8 *result);

Ok that's better, just the signed-off-by line is missing. I would add it
myself on behalf of contributors I know, but as a matter of practice,
fix it please and resend the patch.  Using the "git commit -s" adds the
line for you.

It is a formality, established in the linux kernel development. If you
need more explanatino what and why, please refer to the following docs.

http://elinux.org/Developer_Certificate_Of_Origin
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: imporve delayed refs iterations

2016-10-24 Thread David Sterba
On Fri, Oct 21, 2016 at 05:05:07PM +0800, Wang Xiaoguang wrote:
> This issue was found when I tried to delete a heavily reflinked file,
> when deleting such files, other transaction operation will not have a
> chance to make progress, for example, start_transaction() will blocked
> in wait_current_trans(root) for long time, sometimes it even triggers
> soft lockups, and the time taken to delete such heavily reflinked file
> is also very large, often hundreds of seconds. Using perf top, it reports
> that:

> [...] With this patch, it just took about 10~15 seconds to
> delte the same file.

Great improvement! Patch looks good on a quick skim so I'll add it to
next, but proper review is still required.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: introduce priority based delalloc shrink mechanism

2016-10-24 Thread David Sterba
Hi Josef,

are you fine with V2?

On Thu, Oct 13, 2016 at 05:31:25PM +0800, Wang Xiaoguang wrote:
> Since commit b02441999efcc6152b87cd58e7970bb7843f76cf, we don't wait all
> ordered extents, but I run into some enospc errors when doing large file
> create and delete tests, it's because shrink_delalloc() does not write
> enough delalloc bytes and wait them finished:
> From: Miao Xie 
> Date: Mon, 4 Nov 2013 23:13:25 +0800
> Subject: [PATCH] Btrfs: don't wait for the completion of all the ordered 
> extents
> 
> It is very likely that there are lots of ordered extents in the filesytem,
> if we wait for the completion of all of them when we want to reclaim some
> space for the metadata space reservation, we would be blocked for a long
> time. The performance would drop down suddenly for a long time.
> 
> Here we introduce a simple reclaim_priority variable, the lower the
> value, the higher the priority, 0 is the maximum priority. The core
> idea is:
> delalloc_bytes = 
> percpu_counter_sum_positive(>fs_info->delalloc_bytes);
> if (reclaim_priority)
> to_reclaim = orig * (2 << (BTRFS_DEFAULT_FLUSH_PRIORITY - 
> reclaim_priority));
> else
> to_reclaim = delalloc_bytes;
> 
> Here 'orig' is the number of metadata we want to reserve, and as the priority
> increases, we will try wo write more delalloc bytes, meanwhile if
> "reclaim_priority == 0" returns true, we'll also wait all current ordered
> extents to finish.
> 
> Signed-off-by: Wang Xiaoguang 
> ---
>  fs/btrfs/extent-tree.c   | 63 
> ++--
>  include/trace/events/btrfs.h | 11 +++-
>  2 files changed, 42 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e08791d..7477c25 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4686,16 +4686,18 @@ static inline int calc_reclaim_items_nr(struct 
> btrfs_root *root, u64 to_reclaim)
>  }
>  
>  #define EXTENT_SIZE_PER_ITEM SZ_256K
> +#define  BTRFS_DEFAULT_FLUSH_PRIORITY3
>  
>  /*
>   * shrink metadata reservation for delalloc
>   */
> -static void shrink_delalloc(struct btrfs_root *root, u64 to_reclaim, u64 
> orig,
> - bool wait_ordered)
> +static void shrink_delalloc(struct btrfs_root *root, u64 orig,
> + bool wait_ordered, int reclaim_priority)
>  {
>   struct btrfs_block_rsv *block_rsv;
>   struct btrfs_space_info *space_info;
>   struct btrfs_trans_handle *trans;
> + u64 to_reclaim;
>   u64 delalloc_bytes;
>   u64 max_reclaim;
>   long time_left;
> @@ -4703,22 +4705,36 @@ static void shrink_delalloc(struct btrfs_root *root, 
> u64 to_reclaim, u64 orig,
>   int loops;
>   int items;
>   enum btrfs_reserve_flush_enum flush;
> + int items_to_wait;
> +
> + delalloc_bytes = percpu_counter_sum_positive(
> + >fs_info->delalloc_bytes);
> + if (reclaim_priority < 0)
> + reclaim_priority = 0;
> +
> + if (reclaim_priority)
> + to_reclaim = orig * (2 << (BTRFS_DEFAULT_FLUSH_PRIORITY -
> + reclaim_priority));
> + else
> + to_reclaim = delalloc_bytes;
>  
>   /* Calc the number of the pages we need flush for space reservation */
>   items = calc_reclaim_items_nr(root, to_reclaim);
>   to_reclaim = (u64)items * EXTENT_SIZE_PER_ITEM;
> + if (reclaim_priority)
> + items_to_wait = items;
> + else
> + items_to_wait = -1;
>  
>   trans = (struct btrfs_trans_handle *)current->journal_info;
>   block_rsv = >fs_info->delalloc_block_rsv;
>   space_info = block_rsv->space_info;
>  
> - delalloc_bytes = percpu_counter_sum_positive(
> - >fs_info->delalloc_bytes);
>   if (delalloc_bytes == 0) {
>   if (trans)
>   return;
>   if (wait_ordered)
> - btrfs_wait_ordered_roots(root->fs_info, items,
> + btrfs_wait_ordered_roots(root->fs_info, items_to_wait,
>0, (u64)-1);
>   return;
>   }
> @@ -4763,7 +4779,7 @@ static void shrink_delalloc(struct btrfs_root *root, 
> u64 to_reclaim, u64 orig,
>  
>   loops++;
>   if (wait_ordered && !trans) {
> - btrfs_wait_ordered_roots(root->fs_info, items,
> + btrfs_wait_ordered_roots(root->fs_info, items_to_wait,
>0, (u64)-1);
>   } else {
>   time_left = schedule_timeout_killable(1);
> @@ -4836,7 +4852,7 @@ struct reserve_ticket {
>  
>  static int flush_space(struct btrfs_root *root,
>  struct btrfs_space_info *space_info, u64 num_bytes,
> -

Re: Btrfs: fix free space tree bitmaps on big-endian systems

2016-10-24 Thread David Sterba
On Mon, Oct 24, 2016 at 09:23:04AM +0200, Geert Uytterhoeven wrote:
> On Sat, Oct 15, 2016 at 2:46 AM, Linux Kernel Mailing List
>  wrote:
> > Web:
> > https://git.kernel.org/torvalds/c/2fe1d55134fce05c17ea118a2e37a4af771887bc
> > Commit: 2fe1d55134fce05c17ea118a2e37a4af771887bc
> 
> 520f16abf003952d in v4.7.10
> 1ff6341b5d92dd6b in v4.8.4
> 
> > Parent: 08895a8b6b06ed2323cd97a36ee40a116b3db8ed
> > Refname:refs/heads/master
> > Author: Omar Sandoval 
> > AuthorDate: Thu Sep 22 17:24:20 2016 -0700
> > Committer:  David Sterba 
> > CommitDate: Mon Oct 3 18:52:14 2016 +0200
> >
> > Btrfs: fix free space tree bitmaps on big-endian systems
> >
> > In convert_free_space_to_{bitmaps,extents}(), we buffer the free space
> > bitmaps in memory and copy them directly to/from the extent buffers with
> > {read,write}_extent_buffer(). The extent buffer bitmap helpers use byte
> > granularity, which is equivalent to a little-endian bitmap. This means
> > that on big-endian systems, the in-memory bitmaps will be written to
> > disk byte-swapped. To fix this, use byte-granularity for the bitmaps in
> > memory.
> 
> This change looks overly complex to me, and decreases performance.

We had to make it correct, performance is not an issue at the moment.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Initialize btrfs_path before use

2016-10-24 Thread Goldwyn Rodrigues


On 10/24/2016 09:57 AM, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> While performing an fsck, an assertion failure occurs because of reusing path 
> in a loop.
> ctree.c:1112: btrfs_search_slot: Warning: assertion `p->nodes[0] != NULL` 
> failed, value 0
> 
> Signed-off-by: Goldwyn Rodrigues 
> 
> diff --git a/cmds-check.c b/cmds-check.c
> index 670ccd1..a6f281c 100644
> --- a/cmds-check.c
> +++ b/cmds-check.c
> @@ -7541,6 +7541,7 @@ static int record_orphan_data_extents(struct 
> btrfs_fs_info *fs_info,
>   key.objectid = dback->owner;
>   key.type = BTRFS_EXTENT_DATA_KEY;
>   key.offset = dback->offset;
> + btrfs_init_path(path);
>  
>   ret = btrfs_search_slot(NULL, dest_root, , path, 0, 0);
>   /*
> 

This is an incorrect fix. Please ignore. I shall send the correct fix
shortly. It should be release_path() instead.

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Release path after usage.

2016-10-24 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

While performing an fsck, an assertion failure occurs because of reusing path 
in a loop.
ctree.c:1112: btrfs_search_slot: Warning: assertion `p->nodes[0] != NULL` 
failed, value 0

Signed-off-by: Goldwyn Rodrigues 


diff --git a/cmds-check.c b/cmds-check.c
index 670ccd1..5d9c724 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -7543,6 +7543,7 @@ static int record_orphan_data_extents(struct 
btrfs_fs_info *fs_info,
key.offset = dback->offset;
 
ret = btrfs_search_slot(NULL, dest_root, , path, 0, 0);
+   btrfs_release_path(path);
/*
 * For ret < 0, it's OK since the fs-tree may be corrupted,
 * we need to record it for inode/file extent rebuild.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: fix free space tree bitmaps on big-endian systems

2016-10-24 Thread Omar Sandoval
On Mon, Oct 24, 2016 at 09:23:04AM +0200, Geert Uytterhoeven wrote:
> On Sat, Oct 15, 2016 at 2:46 AM, Linux Kernel Mailing List
>  wrote:
> > Web:
> > https://git.kernel.org/torvalds/c/2fe1d55134fce05c17ea118a2e37a4af771887bc
> > Commit: 2fe1d55134fce05c17ea118a2e37a4af771887bc
> 
> 520f16abf003952d in v4.7.10
> 1ff6341b5d92dd6b in v4.8.4
> 
> > Parent: 08895a8b6b06ed2323cd97a36ee40a116b3db8ed
> > Refname:refs/heads/master
> > Author: Omar Sandoval 
> > AuthorDate: Thu Sep 22 17:24:20 2016 -0700
> > Committer:  David Sterba 
> > CommitDate: Mon Oct 3 18:52:14 2016 +0200
> >
> > Btrfs: fix free space tree bitmaps on big-endian systems
> >
> > In convert_free_space_to_{bitmaps,extents}(), we buffer the free space
> > bitmaps in memory and copy them directly to/from the extent buffers with
> > {read,write}_extent_buffer(). The extent buffer bitmap helpers use byte
> > granularity, which is equivalent to a little-endian bitmap. This means
> > that on big-endian systems, the in-memory bitmaps will be written to
> > disk byte-swapped. To fix this, use byte-granularity for the bitmaps in
> > memory.
> 
> This change looks overly complex to me, and decreases performance.

Do you have evidence of that? Sure, it can in theory, but the change
only affects the free space tree format conversion, which is a rare
operation. In fact, I actually benchmarked the change with [1] and saw
no noticable difference on x86-64. In any case, now that the on-disk
format problem is fixed, we can improve the implementation.

> >
> > Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
> > Cc: sta...@vger.kernel.org # 4.5+
> > Tested-by: Holger Hoffstätte 
> > Tested-by: Chandan Rajendra 
> > Signed-off-by: Omar Sandoval 
> > Signed-off-by: David Sterba 
> > ---
> >  fs/btrfs/extent_io.c   | 64 
> > +-
> >  fs/btrfs/extent_io.h   | 22 
> >  fs/btrfs/free-space-tree.c | 17 ++--
> >  3 files changed, 76 insertions(+), 27 deletions(-)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 44fe66b..c3ec30d 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -5524,17 +5524,45 @@ void copy_extent_buffer(struct extent_buffer *dst, 
> > struct extent_buffer *src,
> > }
> >  }
> >
> > -/*
> > - * The extent buffer bitmap operations are done with byte granularity 
> > because
> > - * bitmap items are not guaranteed to be aligned to a word and therefore a
> > - * single word in a bitmap may straddle two pages in the extent buffer.
> > - */
> > -#define BIT_BYTE(nr) ((nr) / BITS_PER_BYTE)
> > -#define BYTE_MASK ((1 << BITS_PER_BYTE) - 1)
> > -#define BITMAP_FIRST_BYTE_MASK(start) \
> > -   ((BYTE_MASK << ((start) & (BITS_PER_BYTE - 1))) & BYTE_MASK)
> > -#define BITMAP_LAST_BYTE_MASK(nbits) \
> > -   (BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1)))
> > +void le_bitmap_set(u8 *map, unsigned int start, int len)
> > +{
> > +   u8 *p = map + BIT_BYTE(start);
> 
> You cannot use cpu_to_le32/cpu_to_le64 on the masks and operate on
> unsigned longs in memory because there's no alignment guarantee, right?

That's right.

> > +   const unsigned int size = start + len;
> > +   int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE);
> > +   u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start);
> > +
> > +   while (len - bits_to_set >= 0) {
> > +   *p |= mask_to_set;
> > +   len -= bits_to_set;
> > +   bits_to_set = BITS_PER_BYTE;
> > +   mask_to_set = ~(u8)0;
> > +   p++;
> > +   }
> 
> memset() for all but the first partial byte (if present)?

Shrug, the original bitmap_set() helper doesn't do this. For our use
case, we're not expecting to do big spans with this since our free space
must be pretty fragmented for us to be using this in the first place.

> > +   if (len) {
> > +   mask_to_set &= BITMAP_LAST_BYTE_MASK(size);
> > +   *p |= mask_to_set;
> > +   }
> > +}
> > +
> > +void le_bitmap_clear(u8 *map, unsigned int start, int len)
> > +{
> > +   u8 *p = map + BIT_BYTE(start);
> > +   const unsigned int size = start + len;
> > +   int bits_to_clear = BITS_PER_BYTE - (start % BITS_PER_BYTE);
> > +   u8 mask_to_clear = BITMAP_FIRST_BYTE_MASK(start);
> > +
> > +   while (len - bits_to_clear >= 0) {
> > +   *p &= ~mask_to_clear;
> > +   len -= bits_to_clear;
> > +   bits_to_clear = BITS_PER_BYTE;
> > +   mask_to_clear = ~(u8)0;
> > +   p++;
> > +   }
> 
> memset() for all but the first partial byte (if present)?
> 
> > +   if (len) {
> > +   mask_to_clear &= 

Re: [PATCH] btrfs-progs: Initialize btrfs_path before use

2016-10-24 Thread Filipe Manana
On Mon, Oct 24, 2016 at 3:57 PM, Goldwyn Rodrigues  wrote:
> From: Goldwyn Rodrigues 
>
> While performing an fsck, an assertion failure occurs because of reusing path 
> in a loop.
> ctree.c:1112: btrfs_search_slot: Warning: assertion `p->nodes[0] != NULL` 
> failed, value 0
>
> Signed-off-by: Goldwyn Rodrigues 
>
> diff --git a/cmds-check.c b/cmds-check.c
> index 670ccd1..a6f281c 100644
> --- a/cmds-check.c
> +++ b/cmds-check.c
> @@ -7541,6 +7541,7 @@ static int record_orphan_data_extents(struct 
> btrfs_fs_info *fs_info,
> key.objectid = dback->owner;
> key.type = BTRFS_EXTENT_DATA_KEY;
> key.offset = dback->offset;
> +   btrfs_init_path(path);

Hi Goldwyn,

You want to call btrfs_release_path(), otherwise you leak memory. And
it would be better placed before the 'continue' statement.

thanks

>
> ret = btrfs_search_slot(NULL, dest_root, , path, 0, 0);
> /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: Initialize btrfs_path before use

2016-10-24 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

While performing an fsck, an assertion failure occurs because of reusing path 
in a loop.
ctree.c:1112: btrfs_search_slot: Warning: assertion `p->nodes[0] != NULL` 
failed, value 0

Signed-off-by: Goldwyn Rodrigues 

diff --git a/cmds-check.c b/cmds-check.c
index 670ccd1..a6f281c 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -7541,6 +7541,7 @@ static int record_orphan_data_extents(struct 
btrfs_fs_info *fs_info,
key.objectid = dback->owner;
key.type = BTRFS_EXTENT_DATA_KEY;
key.offset = dback->offset;
+   btrfs_init_path(path);
 
ret = btrfs_search_slot(NULL, dest_root, , path, 0, 0);
/*
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs progs pre-release 4.8.2-rc1

2016-10-24 Thread David Sterba
Hi,

a pre-release has been tagged. There are build and misc bugfixes. I'm doing a
rc for a minor release as it proved to be useful in the past when dealing with
the build errors. Please test.  ETA for 4.8.2 is in +2 days.

Changes since 4.8.1:
  * convert: also convert file attributes
  * check: quota verify fixes, handle reloc tree
  * build: add stub for FIEMAP_EXTENT_SHARED, compiles on ancient kernels
  * build: add stub for BUILD_ASSERT when ioctl.h is included
  * dump-tree: don't crash on unrecognized tree id for -t
  * tests:
* add more ioctl tests
* convert: more symlink tests, attribute tests
* quota verify for reloc tree
  * other cleanups

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

David Sterba (10):
  btrfs-progs: ioctl: add 32bit compat for SET_RECEIVED_SUBVOL
  btrfs-progs: ioctl: add 64bit compat for SEND
  btrfs-progs: tests: make the ioctl-test actually useful
  btrfs-progs: test: add default ioctl-test build
  btrfs-progs: mkfs: detect version of running kernel
  btrfs-progs: build: detect fiemap shared flag but don't fail build
  btrfs-progs: dump-tree: fix crash on unrecognized tree id
  btrfs-progs: ioctl: fix build failure if BUILD_ASSERT is not defined
  btrfs-progs: update CHANGES for v4.8.2
  Btrfs progs v4.8.2-rc1

Karel Zak (1):
  btrfs-progs: rename btrfs_scan_lblkid() to btrfs_scan_devices()

Lakshmipathi.G (2):
  btrfs-progs: btrfs-debugfs: cleanup unused variables reported by pylint
  btrfs-progs: Add fast,slow symlinks, fifo types to convert test

Lu Fengqi (1):
  btrfs-progs: move btrfs_extref_hash() to hash.h

Qu Wenruo (9):
  btrfs-progs: Copy btrfs inode flags from kernel header
  btrfs-progs: Make btrfs-debug-tree print all readable strings for inode 
flags
  btrfs-progs: convert: Convert ext inode flags to btrfs inode flags
  btrfs-progs: convert-test: Add test case for common inode flags
  btrfs-progs: Fix stack overflow for checking qgroup on tree reloc tree
  btrfs-progs: test: Add test image for btrfsck qgroup rescan detection
  btrfs-progs: test: Add image for quota verify stack overflow
  btrfs-progs: rename raid6.c to raid56.c
  btrfs-progs: volumes: Remove BUG_ON in raid56 write routine

Tsutomu Itoh (2):
  btrfs-progs: image: fix compiler warning
  btrfs-progs: send: remove unnecessary code
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-24 Thread Chris Mason



On 10/24/2016 12:40 AM, Dave Jones wrote:

On Sun, Oct 23, 2016 at 05:32:21PM -0400, Chris Mason wrote:
 >
 >
 > On 10/22/2016 11:20 AM, Dave Jones wrote:
 > > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote:
 > >
 > >  >  > It could be worth trying this, too:
 > >  >  >
 > >  >  > 
https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack=174531fef4e8
 > >  >  >
 > >  >  > It occurred to me that the current code is a little bit fragile.
 > >  >
 > >  > It's been nearly 24hrs with the above changes, and it's been pretty much
 > >  > silent the whole time.
 > >  >
 > >  > The only thing of note over that time period has been a btrfs lockdep
 > >  > warning that's been around for a while, and occasional btrfs checksum
 > >  > failures, which I've been seeing for a while, but seem to have gotten
 > >  > worse since 4.8.
 > >  >
 > >  > I'm pretty confident in the disk being ok in this machine, so I think
 > >  > the checksum warnings are bogus.  Chris suggested they may be the result
 > >  > of memory corruption, but there's little else going on.
 > >
 > > The only interesting thing last nights run was this..
 > >
 > > BUG: Bad page state in process kworker/u8:1  pfn:4e2b70
 > > page:ea00138adc00 count:0 mapcount:0 mapping:88046e9fc2e0 
index:0xdf0
 > > flags: 0x400c(referenced|uptodate)
 > > page dumped because: non-NULL mapping
 > > CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11
 > > Workqueue: writeback wb_workfn (flush-btrfs-2)
 >
 > Well crud, we're back to wondering if this is Btrfs or the stack
 > corruption.  Since the pagevecs are on the stack and this is a new
 > crash, my guess is you'll be able to trigger it on xfs/ext4 too.  But we
 > should make sure.

Here's an interesting one from today, pointing the finger at xattrs again.


[69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ 
#11
[69943.463510] task: 8804f8dd3740 task.stack: c9000b108000
[69943.468077] RIP: 0010:[]


Was this btrfs?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ioctl.h: add missing kernel compatibility header for BUILD_ASSERT

2016-10-24 Thread David Sterba
Hi,

On Mon, Oct 24, 2016 at 09:29:12AM +0100, sly...@gmail.com wrote:
> From: Sergei Trofimovich 
> 
> Header breakage noticed by cynede. Reproducible as:
> 
> $ gcc -c /usr/include/btrfs/ioctl.h -o /tmp/a.o
> /usr/include/btrfs/ioctl.h:42:14: error: expected declaration 
> specifiers or '...' before 'sizeof'
>  BUILD_ASSERT(sizeof(struct btrfs_ioctl_vol_args) == 4096);
>   ^~
> 
> Basically gcc tries to say us BUILD_ASSERT is not visible.
> 
> BUILD_ASSERT lives in kerncompat.h which this change adds.

I think including the kerncompat.h is too intrusive here, I've fixed by
providing an empty macro if it's not defined. I'll release 4.8.2 soon.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Drive Replacement

2016-10-24 Thread Austin S. Hemmelgarn

On 2016-10-21 18:13, Peter Becker wrote:

if you have >750 GB free you can simply remove one of the drives.

btrfs device delete /dev/sd[x] /mnt
#power off, replace device
btrfs device add /dev/sd[y] /mnt
Make sure to balance afterwards if you do this, the new disk will be 
pretty much unused until you do.


if not you can use an USB-SATA adapter or an eSata-Port and make the following:

btrfs device add /dev/sd[y] /mnt
btrfs device delete /dev/sd[x] /mnt
#power off, replace device
I will comment that eSATA is vastly preferred to USB in this case (even 
a USB3.0 UAS device) as it is generally significantly more reliable 
(even if you are just using a eSATA to SATA cable and a power adapter 
for the drive).


i avoid "btrfs device replace" because it's slower then add+delete
In my experience, this is only true if you add then delete (delete then 
add then re-balance will move a lot of data twice), and even then only 
if the device is more than about half full.  Device replace also has a 
couple of specific advantages:
* It lets you get an exact percentage completion, unlike add+delete. 
This is only updated every few seconds, and doesn't show an estimate of 
time remaining, but is still better than nothing.
* Device ID's remain the same.  This can be an advantage or a 
disadvantage, but it's usually a good thing in a case like this, because 
the mapping between device number and device node (and therefore 
specific disks) will remain constant, which makes tracking which 
physical device is failing a bit easier if you have consistent device 
enumeration for storage devices (which you almost certainly do if you're 
using SATA).
* It doesn't write any data on any other device except for superblocks 
and the basic metadata that describes the device layout.  This is 
important because it means that it's safer for data that isn't on the 
device being replaced, and it has less impact on other operations when 
doing a live replacement.


and don't forget to update fstab !
Assuming that he doesn't change which SATA ports the devices are 
connected to, he shouldn't have to change anything in /etc/fstab.  Mount 
by UUID or LABEL will just work regardless, and mount by device node 
will continue to work as long as the same device nodes are used (which 
will be the case if he doesn't change anything else in the storage stack).


2016-10-22 0:07 GMT+02:00 Hugo Mills :

On Sat, Oct 22, 2016 at 09:03:16AM +1100, Gareth Pye wrote:

I've got a BTRFS array that is of mixed size disks:
2x750G
3x1.5T
3x3T
And it's getting fuller than I'd like. The problem is that adding
disks is harder than one would like as the computer only has 8 sata
ports. Is it viable to do the following to upgrade one of the disks?

A) Take array offline
B) DD the contents of one of the 750G drives to a new 3T drive
C) Remove the 750G from the system
D) btrfs scan
E) Mount array
F) Run a balance

I know that not physically removing the old copy of the drive will
cause massive issues, but if I do that everything should be fine
right?


   Yes. The one thing missing here is running

# btrfs dev resize :max /mountpoint

on the new device between steps E and F to allow the FS to use the
full amount of the device. Otherwise, it'll still be the same size as
the original.

   Hugo.

--
Hugo Mills | Great films about cricket: Batsman Begins
hugo@... carfax.org.uk | starring Christian Bail
http://carfax.org.uk/  |
PGP: E2AB1DE4  |

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ioctl.h: add missing kernel compatibility header for BUILD_ASSERT

2016-10-24 Thread slyich
From: Sergei Trofimovich 

Header breakage noticed by cynede. Reproducible as:

$ gcc -c /usr/include/btrfs/ioctl.h -o /tmp/a.o
/usr/include/btrfs/ioctl.h:42:14: error: expected declaration 
specifiers or '...' before 'sizeof'
 BUILD_ASSERT(sizeof(struct btrfs_ioctl_vol_args) == 4096);
  ^~

Basically gcc tries to say us BUILD_ASSERT is not visible.

BUILD_ASSERT lives in kerncompat.h which this change adds.

Reported-by: Mikhail Pukhlikov 
Signed-off-by: Sergei Trofimovich 
---
 ioctl.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/ioctl.h b/ioctl.h
index 238e7ef..ee650a9 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -26,6 +26,12 @@ extern "C" {
 #include 
 #include 
 
+#if BTRFS_FLAT_INCLUDES
+#include "kerncompat.h"
+#else
+#include 
+#endif /* BTRFS_FLAT_INCLUDES */
+
 #ifndef __user
 #define __user
 #endif
-- 
2.10.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: fix free space tree bitmaps on big-endian systems

2016-10-24 Thread Geert Uytterhoeven
On Sat, Oct 15, 2016 at 2:46 AM, Linux Kernel Mailing List
 wrote:
> Web:
> https://git.kernel.org/torvalds/c/2fe1d55134fce05c17ea118a2e37a4af771887bc
> Commit: 2fe1d55134fce05c17ea118a2e37a4af771887bc

520f16abf003952d in v4.7.10
1ff6341b5d92dd6b in v4.8.4

> Parent: 08895a8b6b06ed2323cd97a36ee40a116b3db8ed
> Refname:refs/heads/master
> Author: Omar Sandoval 
> AuthorDate: Thu Sep 22 17:24:20 2016 -0700
> Committer:  David Sterba 
> CommitDate: Mon Oct 3 18:52:14 2016 +0200
>
> Btrfs: fix free space tree bitmaps on big-endian systems
>
> In convert_free_space_to_{bitmaps,extents}(), we buffer the free space
> bitmaps in memory and copy them directly to/from the extent buffers with
> {read,write}_extent_buffer(). The extent buffer bitmap helpers use byte
> granularity, which is equivalent to a little-endian bitmap. This means
> that on big-endian systems, the in-memory bitmaps will be written to
> disk byte-swapped. To fix this, use byte-granularity for the bitmaps in
> memory.

This change looks overly complex to me, and decreases performance.

>
> Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
> Cc: sta...@vger.kernel.org # 4.5+
> Tested-by: Holger Hoffstätte 
> Tested-by: Chandan Rajendra 
> Signed-off-by: Omar Sandoval 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/extent_io.c   | 64 
> +-
>  fs/btrfs/extent_io.h   | 22 
>  fs/btrfs/free-space-tree.c | 17 ++--
>  3 files changed, 76 insertions(+), 27 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 44fe66b..c3ec30d 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5524,17 +5524,45 @@ void copy_extent_buffer(struct extent_buffer *dst, 
> struct extent_buffer *src,
> }
>  }
>
> -/*
> - * The extent buffer bitmap operations are done with byte granularity because
> - * bitmap items are not guaranteed to be aligned to a word and therefore a
> - * single word in a bitmap may straddle two pages in the extent buffer.
> - */
> -#define BIT_BYTE(nr) ((nr) / BITS_PER_BYTE)
> -#define BYTE_MASK ((1 << BITS_PER_BYTE) - 1)
> -#define BITMAP_FIRST_BYTE_MASK(start) \
> -   ((BYTE_MASK << ((start) & (BITS_PER_BYTE - 1))) & BYTE_MASK)
> -#define BITMAP_LAST_BYTE_MASK(nbits) \
> -   (BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1)))
> +void le_bitmap_set(u8 *map, unsigned int start, int len)
> +{
> +   u8 *p = map + BIT_BYTE(start);

You cannot use cpu_to_le32/cpu_to_le64 on the masks and operate on
unsigned longs in memory because there's no alignment guarantee, right?

> +   const unsigned int size = start + len;
> +   int bits_to_set = BITS_PER_BYTE - (start % BITS_PER_BYTE);
> +   u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(start);
> +
> +   while (len - bits_to_set >= 0) {
> +   *p |= mask_to_set;
> +   len -= bits_to_set;
> +   bits_to_set = BITS_PER_BYTE;
> +   mask_to_set = ~(u8)0;
> +   p++;
> +   }

memset() for all but the first partial byte (if present)?

> +   if (len) {
> +   mask_to_set &= BITMAP_LAST_BYTE_MASK(size);
> +   *p |= mask_to_set;
> +   }
> +}
> +
> +void le_bitmap_clear(u8 *map, unsigned int start, int len)
> +{
> +   u8 *p = map + BIT_BYTE(start);
> +   const unsigned int size = start + len;
> +   int bits_to_clear = BITS_PER_BYTE - (start % BITS_PER_BYTE);
> +   u8 mask_to_clear = BITMAP_FIRST_BYTE_MASK(start);
> +
> +   while (len - bits_to_clear >= 0) {
> +   *p &= ~mask_to_clear;
> +   len -= bits_to_clear;
> +   bits_to_clear = BITS_PER_BYTE;
> +   mask_to_clear = ~(u8)0;
> +   p++;
> +   }

memset() for all but the first partial byte (if present)?

> +   if (len) {
> +   mask_to_clear &= BITMAP_LAST_BYTE_MASK(size);
> +   *p &= ~mask_to_clear;
> +   }
> +}
>
>  /*
>   * eb_bitmap_offset() - calculate the page and offset of the byte containing 
> the
> @@ -5578,7 +5606,7 @@ static inline void eb_bitmap_offset(struct 
> extent_buffer *eb,
>  int extent_buffer_test_bit(struct extent_buffer *eb, unsigned long start,
>unsigned long nr)
>  {
> -   char *kaddr;
> +   u8 *kaddr;
> struct page *page;
> unsigned long i;
> size_t offset;
> @@ -5600,13 +5628,13 @@ int extent_buffer_test_bit(struct extent_buffer *eb, 
> unsigned long start,
>  void extent_buffer_bitmap_set(struct extent_buffer *eb, unsigned long start,
>   unsigned long pos, unsigned long len)
>  {
> -   char *kaddr;
> +   u8 *kaddr;
> struct page *page;
> 

[PATCH] btrfs-progs: Fix wrong tree block alignment for unalianged block group

2016-10-24 Thread Qu Wenruo
Commit 854437ca(btrfs-progs: extent-tree: avoid allocating tree block
that crosses stripe boundary) introduces check for logical bytenr not
crossing stripe boundary.

However that check is not completely correct.
It only checks if the logical bytenr and length agaist absolute logical
offset.
That's to say, it only check if a tree block lies in 64K logical stripe.

But in fact, it's possible a block group starts at bytenr unaligned with
64K, just like the following case.

Then btrfsck will give false alert.

0   32K   64K   96K128K 160K ...
|--- Block group A -
|<-TB 32K-->|
|/Scrub stripe unit/|
|WRONG UNIT   |

In that case, TB(tree block) at bytenr 32K in fact fits into the kernel
scrub stripe unit.
But doesn't fit into the pure logical 64K stripe.

Fix check_crossing_stripes() to compare bytenr to block group start, not
to absolute logical bytenr.

Reported-by: Jussi Kansanen 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c  | 10 ++
 extent-tree.c | 15 ---
 volumes.h | 23 ---
 3 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 670ccd1..907d60c 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -4662,8 +4662,8 @@ static int add_extent_rec_nolookup(struct cache_tree 
*extent_cache,
bytes_used += rec->nr;
 
if (tmpl->metadata)
-   rec->crossing_stripes = check_crossing_stripes(rec->start,
-   global_info->tree_root->nodesize);
+   rec->crossing_stripes = check_crossing_stripes(global_info,
+   rec->start, global_info->tree_root->nodesize);
check_extent_type(rec);
return ret;
 }
@@ -4764,7 +4764,8 @@ static int add_extent_rec(struct cache_tree *extent_cache,
 */
if (tmpl->metadata)
rec->crossing_stripes = check_crossing_stripes(
-   rec->start, global_info->tree_root->nodesize);
+   global_info, rec->start,
+   global_info->tree_root->nodesize);
check_extent_type(rec);
maybe_free_extent_rec(extent_cache, rec);
return ret;
@@ -9359,7 +9360,8 @@ static int check_extent_item(struct btrfs_fs_info 
*fs_info,
 
if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK)
metadata = 1;
-   if (metadata && check_crossing_stripes(key.objectid, eb->len)) {
+   if (metadata && check_crossing_stripes(global_info, key.objectid,
+  eb->len)) {
error("bad metadata [%llu, %llu) crossing stripe boundary",
  key.objectid, key.objectid + nodesize);
err |= CROSSING_STRIPE_BOUNDARY;
diff --git a/extent-tree.c b/extent-tree.c
index f6d0a7c..3b1577e 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -2606,11 +2606,20 @@ check_failed:
}
 
if (!(data & BTRFS_BLOCK_GROUP_DATA)) {
-   if (check_crossing_stripes(ins->objectid, num_bytes)) {
-   search_start = round_down(ins->objectid + num_bytes,
- BTRFS_STRIPE_LEN);
+   if (check_crossing_stripes(info, ins->objectid, num_bytes)) {
+   struct btrfs_block_group_cache *bg_cache;
+   u64 bg_offset;
+
+   bg_cache = btrfs_lookup_block_group(info, 
ins->objectid);
+   if (!bg_cache)
+   goto no_bg_cache;
+   bg_offset = ins->objectid - bg_cache->key.objectid;
+
+   search_start = round_up(bg_offset + num_bytes,
+   BTRFS_STRIPE_LEN) + bg_offset;
goto new_group;
}
+no_bg_cache:
block_group = btrfs_lookup_block_group(info, ins->objectid);
if (block_group)
trans->block_group = block_group;
diff --git a/volumes.h b/volumes.h
index d7b7d3c..7cb38b0 100644
--- a/volumes.h
+++ b/volumes.h
@@ -155,11 +155,28 @@ struct map_lookup {
  * Check if the given range cross stripes.
  * To ensure kernel scrub won't causing bug on with METADATA in mixed
  * block group
+ *
+ * Return 1 if the range crosses STRIPE boundary
+ * Return 0 if the range doesn't cross STRIPE boundar or it
+ * doesn't belongs to any block group(no boundary to cross)
  */
-static inline int check_crossing_stripes(u64 start, u64 len)
+static inline int check_crossing_stripes(struct btrfs_fs_info *fs_info,
+u64 start, u64 len)
 {
-   return (start / BTRFS_STRIPE_LEN) !=
-  ((start + len - 1) / BTRFS_STRIPE_LEN);
+   struct