Re: [PATCHv2 00/24] tree-log inode vs btrfs_inode cleanups

2017-01-19 Thread Nikolay Borisov


On 19.01.2017 20:21, David Sterba wrote:
> On Wed, Jan 18, 2017 at 12:31:26AM +0200, Nikolay Borisov wrote:
>> So here is a new set of patches cleaning up tree-log function 
>> w.r.t inode vs btrfs_inode. There are still some which remain 
>> but I didn't find compelling arguments to cleaning them up, so 
>> I've left them unchanged. This time there are some size shrinkage:
>>
>>text data bss dec hex filename
>>2530598174661   28288 2733547  29b5eb fs/btrfs/btrfs.ko - upstream 
>> master
>>
>>  text   data bss dec hex filename
>>  2530774  174661   28288 2733723  29b69b fs/btrfs/btrfs.ko - before 
>> tree-log cleanup
>>
>> textdata bss dec hex filename
>>  2530163  174661   28288 2733112  29b438 fs/btrfs/btrfs.ko - both series 
>> applied 
>>
>> So the net result of the 2 series is 435 bytes and I assume there 
>> will be further reduction in size once further cleanups are made 
>>
>> Changes since v1: 
>>  * Rebased all patche to latest master
>>
>> Nikolay Borisov (24):
>>   btrfs: Make btrfs_must_commit_transaction take btrfs_inode
>>   btrfs: Make btrfs_record_unlink_dir take btrfs_inode
>>   btrfs: Make btrfs_record_snapshot_destroy take btrfs_inode
>>   btrfs: Make btrfs_inode_in_log take btrfs_inode
>>   btrfs: Make btrfs_log_new_name take btrfs_inode
>>   btrfs: Make btrfs_del_dir_entries_in_log take btrfs_inode
>>   btrfs: Make btrfs_del_inode_ref take btrfs_inode
>>   btrfs: Make logged_inode_size take btrfs_inode
>>   btrfs: Make btrfs_check_ref_name_override take btrfs_inode
>>   btrfs: Make copy_items take btrfs_inode
>>   btrfs: Make btrfs_log_all_xattrs take btrfs_inode
>>   btrfs: Make btrfs_log_trailing_hole take btrfs_inode
>>   btrfs: Make btrfs_get_logged_extents take btrfs_inode
>>   btrfs: Make btrfs_log_changed_extents take btrfs_inode
>>   btrfs: Make log_dir_items take btrfs_inode
>>   btrfs: Make log_directory_changes take btrfs_inode
>>   btrfs: Make log_new_dir_dentries take btrfs_inode
>>   btrfs: Make btrfs_unlink_inode take btrfs_inode
>>   btrfs: Make drop_one_dir_item take btrfs_inode
>>   btrfs: Make __add_inode_ref take btrfs_inode
>>   btrfs: Make log_inode_item take btrfs_inode
>>   btrfs: Make btrfs_log_inode take btrfs_inode
>>   btrfs: Make count_inode_extrefs take btrfs_inode
>>   btrfs: Make count_inode_refs take btrfs_inode
> 
> Added to 4.11 queue, thanks. There were several 80+ lines added by the
> patches, I've updated the patches during review. Please be more careful
> with changes like this
> 
> From https://patchwork.kernel.org/patch/9522051/
> 
> - if (S_ISDIR(inode->i_mode) ||
> + if (S_ISDIR(inode->vfs_inode.i_mode) ||
>   (!test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
> -_I(inode)->runtime_flags) &&
> -  inode_only >= LOG_INODE_EXISTS))
> +>runtime_flags) &&
> +  inode_only == LOG_INODE_EXISTS))
> 
> Notice the change from >= to == .

Yeah, I did notice this after the rebase I did when you told me patch 6
wasn't applying cleanly, however I thought I had fixed it during the
rebase. Apparently I do need to recheck things one more time for good
measure.

Anyway, thanks for pulling!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check lowmem vs original

2017-01-19 Thread Qu Wenruo



At 01/20/2017 12:38 PM, Chris Murphy wrote:

All of my Btrfs file systems, including new ones, have errors
according to lowmem mode, and no errors reported at all for original
mode. So which is correct? If lowmem mode is correct, then there are
obviously kernel bugs that are causing problems right away, even on
minutes old file systems.

I can't tell from the output whether these are serious errors or minor
errors either, because the output from check is totally
incomprehensible. Further the usage summary at the end, extent bytes,
and fs bytes, etc. are different, sometimes by an order of magnitude,
between lowmem and original.

Errors like this:

ERROR: block group[46200258560 1073741824] used 1073741824 but extent
items used 1144422400
ERROR: block group[85928706048 1073741824] used 1073741824 but extent
items used 0
ERROR: block group[178270502912 1073741824] used 1073741824 but extent
items used 1083211776
ERROR: block group[360806612992 1073741824] used 1073479680 but extent
items used 1074769920


Christoph Anton Mitterer also reported this bug.

And it turns out to be a false alert.

Patch under test, should be out soon.
(But the image is not easy to create, so no test case yet)


ERROR: extent[366498091008, 134217728] referencer count mismatch
(root: 818, owner: 73782, offset: 134217728) wanted: 4, have: 26


This is a little different.
Not sure which fsck is wrong until btrfs-debug-tree extent and fs tree 
dump is provided.




One file system has over 100 lines (exactly the same thing, no difference)

ERROR: data extent[16913485824 7577600] backref lost


Same, need btrfs-debug-tree extent and fs tree dump.



Another file system, 15 minutes old with two mounts in its whole
lifetime, and only written with kernel 4.10-rc3 has over 30 lines of
varying numbers:

ERROR: root 257 EXTENT DATA[150134 11317248] prealloc shouln't have datasum

That file system should have no preallocated extents (It's a clean
installation of Fedora Rawhide, using only rsync)


btrfs-debug-tree will help to make sure what is wrong.



Again, zero errors with original mode; but all file systems of all
ages have errors with lowmem. This is kindof annoying to say the
least, it's like we're adding dice to the check and repair situation
on Btrfs which is already incredibly more complicated and unhelpful
than any other file system.


That's why lowmem mode is still not the default option.

The problem os original mode is, if you're checking a TB level fs and 
only 2 or 4G ram, then it's quite possible you ran out of memory and 
won't be able to check the fs forever, more several than annoying.


Thanks,
Qu



btrfs-progs 4.9.
kernels vary from 4.4 to 4.10-rc4, but one fs is only a month old
having used 4.7 and 4.8 kernels; and another one just 4.10-rc3 as I
said.





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/3] fstests: test btrfs incremental send after moving a directory

2017-01-19 Thread Eryu Guan
On Thu, Jan 12, 2017 at 03:13:37AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Test that an incremental send operation works after moving a directory
> into a new parent directory, deleting its previous parent directory and
> creating a new inode that has the same inode number as the old parent.
> 
> This issue is fixed by the following patch for the linux kernel:
> 
>   "Btrfs: incremental send, do not delay rename when parent inode is new"
> 
> Signed-off-by: Robbie Ko 
> Signed-off-by: Filipe Manana 

I tested this patchset with/without proposed kernel patches, all tests
worked as expected. Just one tiny update on commit below.

> ---

> +# Remount the filesystem so that the next created inodes will have the 
> numbers
> +# 258 and 259. This is because when a filesystem is mounted, btrfs sets the
> +# subvolume's inode counter to a value corresponding to the highest inode 
> number
> +# in the subvolume plus 1. This inode counter is used to assign a unique 
> number
> +# to each new inode and it's incremented by 1 after very inode creation.
> +# Note: we unmount and then mount instead of doing a mount with "-o remount"
> +# because otherwise the inode counter remains at value 260.
> +_scratch_unmount
> +_scratch_mount

I replaced these with _scratch_cycle_mount.

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs check lowmem vs original

2017-01-19 Thread Chris Murphy
All of my Btrfs file systems, including new ones, have errors
according to lowmem mode, and no errors reported at all for original
mode. So which is correct? If lowmem mode is correct, then there are
obviously kernel bugs that are causing problems right away, even on
minutes old file systems.

I can't tell from the output whether these are serious errors or minor
errors either, because the output from check is totally
incomprehensible. Further the usage summary at the end, extent bytes,
and fs bytes, etc. are different, sometimes by an order of magnitude,
between lowmem and original.

Errors like this:

ERROR: block group[46200258560 1073741824] used 1073741824 but extent
items used 1144422400
ERROR: block group[85928706048 1073741824] used 1073741824 but extent
items used 0
ERROR: block group[178270502912 1073741824] used 1073741824 but extent
items used 1083211776
ERROR: block group[360806612992 1073741824] used 1073479680 but extent
items used 1074769920
ERROR: extent[366498091008, 134217728] referencer count mismatch
(root: 818, owner: 73782, offset: 134217728) wanted: 4, have: 26

One file system has over 100 lines (exactly the same thing, no difference)

ERROR: data extent[16913485824 7577600] backref lost

Another file system, 15 minutes old with two mounts in its whole
lifetime, and only written with kernel 4.10-rc3 has over 30 lines of
varying numbers:

ERROR: root 257 EXTENT DATA[150134 11317248] prealloc shouln't have datasum

That file system should have no preallocated extents (It's a clean
installation of Fedora Rawhide, using only rsync)

Again, zero errors with original mode; but all file systems of all
ages have errors with lowmem. This is kindof annoying to say the
least, it's like we're adding dice to the check and repair situation
on Btrfs which is already incredibly more complicated and unhelpful
than any other file system.

btrfs-progs 4.9.
kernels vary from 4.4 to 4.10-rc4, but one fs is only a month old
having used 4.7 and 4.8 kernels; and another one just 4.10-rc3 as I
said.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] btrfs-progs: btrfs-corrupt-block refactor

2017-01-19 Thread Qu Wenruo

This is the RFC proposal for refactor btrfs-corrupt-block.

Just as Lakshmipathi and xfstests guys commented in btrfs ML, we need 
btrfs-corrupt-block to either craft a test case, or for later RAID56 
corruption test case.


However the current situation of btrfs-corrupt-block has several 
problems, and those problems prevents us from using it:


1) Bad arranged options
   Quite a lot of combination makes no sense.
   And quite a lot of function covers each other.

   This is quite frustrating when someone want to use it.

2) No documentation

3) No longer default installed/compiled
   This is OK if such tool is only used to corrupt things.

   But if the long term objective is to provide a tool to modify the fs
   offline at will, that's not a good idea.


So here I want to refactor the current btrfs-corrupt-block to archive 
the following objective:


1) Logical layer divided parameters  <>
   The current idea is like:

   btrfs-corrupt-block  
is one of
 "item" "data" "leaf" "node" (with more to come)

   "item"/"leaf"/"node" common options are:
 "--nocow":
Don't do CoW, but overwrite (default)
 "--cow":
Do normal tree block CoW
 "--pattern DATA"
The pattern to fill, default to 0xcd


   "item" options are:
 "--root NUMBER|ROOTID":
Mandatory
 "--key (OBJID,TYPE,OFFSET)":
Mandatory
 "--field FIELD"
Optional, default to corrupt all fields of the item
 "--delete"
Optional, if specified, the item will be deleted.
Conflicts with "--field"


   "leaf" options are:
 "--bytenr BYTENR" or "--key" with "--root"
Mandatory to locate the leaf
 "--field FIELD"
Optional, default to corrupt all fields of the leaf header
With --key, can also be used to corrupt certain item field
 NOTE: No delete is provided. Please use "node" "--delete" to
   delete a leaf.
   Or use "--filed ALL" with "--pattern 0x00" to wipe the
   leaf header.

   "node" options are:
 "--bytenr BYTENR" or "--key" "--root" and "--level"
Mandatory to locate the node
 "--field FIELD"
Optional, default to corrupt all fields of the node header
With --key, can also be used to corrupt certain nodeptr filed
 "--delete"
Optional, Only works with "--key" "--root" "--level".
Delete the nodeptr.

   "data" options are:
 "--bytenr BYTENR"
Mandatory, logical address
 "--length LENGTH"
Optional, default to sectorsize of the fs
 "--mirror"
Optional, default to all mirror

May add "raid56" destination for raid56 tests later.

2) Better documentation
   Both asciidoc doc and better help string

3) Better error string
   At least the same level we do in other btrfs commands

Any advice and idea are welcomed.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs recovery

2017-01-19 Thread Qu Wenruo



At 01/19/2017 06:06 PM, Sebastian Gottschall wrote:

Hello

I have a question. after a power outage my system was turning into a
unrecoverable state using btrfs (kernel 4.9)
since im running --init-extent-tree now for 3 days i'm asking how long
this process normally takes and why it outputs millions of lines like


--init-extent-tree will trash *ALL* current extent tree, and *REBUILD* 
them from fs-tree.


This can takes a long time depending on the size of the fs, and how many 
shared extents there are (snapshots and reflinks all counts).


Such a huge operation should only be used if you're sure only extent 
tree is corrupted, and other tree are all OK.


Or you'll just totally screw your fs further, especially when interrupted.



Backref 1562890240 root 262 owner 483059214 offset 0 num_refs 0 not
found in extent tree
Incorrect local backref count on 1562890240 root 262 owner 483059214
offset 0 found 1 wanted 0 back 0x23b0211d0
backpointer mismatch on [1562890240 4096]


This is common, since --init-extent-tree trash all extent tree, so every 
tree-block/data extent will trigger such output



adding new data backref on 1562890240 root 262 owner 483059214 offset 0
found 1
Repaired extent references for 1562890240


But as you see, it repaired the extent tree by adding back 
EXTENT_ITEM/METADATA_ITEM into extent tree, so far it works.


If you see such output with all the same bytenr, then things goes really 
wrong and maybe a dead loop.



Personally speaking, normal problem like failed to mount should not need 
--init-extent-tree.


Especially, extent-tree corruption normally is not really related to 
mount failure, but sudden remount to RO and kernel wanring.


Thanks,
Qu



please avoid typical answers like "potential dangerous operation" since
all repair options are declared as potenial dangerous.


Sebastian




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs_alloc_tree_block: Faulting instruction address: 0xc02d4584

2017-01-19 Thread Christian Kujau
Hi,

after upgrading this powerpc32 box from 4.10-rc2 to -rc4, the message 
below occured a few hours after boot. Full dmesg and .config:

  http://nerdbynature.de/bits/4.10-rc4/

Any ideas?

Thanks,
Christian.


Faulting instruction address: 0xc02d4584
Oops: Kernel access of bad area, sig: 11 [#1]
PowerMac
Modules linked in: ecb xt_tcpudp iptable_filter ip_tables x_tables 
nfnetlink_log nfnetlink sha256_generic twofish_generic twofish_common 
usb_storage therm_adt746x loop i2c_powermac arc4 firewire_sbp2 b43 
rng_core ssb bcma mac80211 cfg80211 ecryptfs [last unloaded: nbd]
CPU: 0 PID: 1395 Comm: btrfs-transacti Tainted: GW   
4.10.0-rc4-1-gab8184b #1
task: ee7162e0 task.stack: ee9cc000
NIP: c02d4584 LR: c02d4574 CTR: c00d0df0
REGS: ee9cdaa0 TRAP: 0300   Tainted: GW
(4.10.0-rc4-1-gab8184b)
MSR: 9032 
  CR: 24422248  XER: 
DAR: 10581054 DSISR: 4200 
GPR00: c02d4574 ee9cdb50 ee71d4b8 10581050 0001 dbc88118  0020 
GPR08: 1205 0004   24422444   93c3 
GPR16: f000 0001    ee260800   
GPR24:  0001 ee9cdc1b eef5c1a0 1000 ee47 dbc88118 ee470170 
NIP [c02d4584] btrfs_alloc_tree_block+0x18c/0x5c4
LR [c02d4574] btrfs_alloc_tree_block+0x17c/0x5c4
Call Trace:
[ee9cdb50] [c02d4574] btrfs_alloc_tree_block+0x17c/0x5c4 (unreliable)
[ee9cdbf0] [c02b86d4] __btrfs_cow_block+0x110/0x638
[ee9cdc70] [c02b8d74] btrfs_cow_block+0xdc/0x1b0
[ee9cdca0] [c02bc48c] btrfs_search_slot+0x1c0/0x904
[ee9cdd10] [c02dc680] btrfs_lookup_inode+0x3c/0x124
[ee9cdd50] [c02ec204] btrfs_update_inode_item+0x4c/0x10c
[ee9cdd80] [c02d05e4] cache_save_setup+0xc0/0x400
[ee9cdde0] [c02d4d54] btrfs_start_dirty_block_groups+0x184/0x47c
[ee9cde50] [c02e7e84] btrfs_commit_transaction+0x148/0xac4
[ee9cdeb0] [c02e313c] transaction_kthread+0x1d0/0x1ec
[ee9cdf00] [c004f1fc] kthread+0xf8/0x124
[ee9cdf40] [c0011480] ret_from_kernel_thread+0x5c/0x64
--- interrupt: 0 at   (null)
LR =   (null)
Instruction dump:
4800b3ed 7f838040 7c7e1b78 419d0430 806300d4 81db 81fb0004 4bdfe2b9 
3924 7ee6bb78 38630050 7fc5f378 <7dc91d2c> 7de01d2c 809501cf 807501cb 
---[ end trace 937683537ecd986b ]---



-- 
BOFH excuse #342:

HTTPD Error 4004 : very old Intel cpu - insufficient processing power
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dup vs raid1 in single disk

2017-01-19 Thread Austin S. Hemmelgarn

On 2017-01-19 13:23, Roman Mamedov wrote:

On Thu, 19 Jan 2017 17:39:37 +0100
"Alejandro R. Mosteo"  wrote:


I was wondering, from a point of view of data safety, if there is any
difference between using dup or making a raid1 from two partitions in
the same disk. This is thinking on having some protection against the
typical aging HDD that starts to have bad sectors.


RAID1 will write slower compared to DUP, as any optimization to make RAID1
devices work in parallel will cause a total performance disaster for you as
you will start trying to write to both partitions at the same time, turning
all linear writes into random ones, which are about two orders of magnitude
slower than linear on spinning hard drives. DUP shouldn't have this issue, but
still it will be twice slower than single, since you are writing everything
twice.
As of right now, there will actually be near zero impact on write 
performance (or at least, it's way less than the theoretical 50%) 
because there really isn't any optimization to speak of in the 
multi-device code.  That will hopefully change over time, but it's not 
likely to do so any time in the future since nobody appears to be 
working on multi-device write performance.


You could consider DUP data for when a disk is already known to be getting bad
sectors from time to time -- but then it's a fringe exercise to try and keep
using such disk in the first place. Yeah with DUP data DUP metadata you can
likely have some more life out of such disk as a throwaway storage space for
non-essential data, at half capacity, but is it worth the effort, as it's
likely to start failing progressively worse over time.

In all other cases the performance and storage space penalty of DUP within a
single device are way too great (and gained redundancy is too low) compared
to a proper system of single profile data + backups, or a RAID5/6 system (not
Btrfs-based) + backups.
That really depends on your usage.  In my case, I run DUP data on single 
disks regularly.  I still do backups of course, but the performance is 
worth far less for me (especially in the cases where I'm using NVMe 
SSD's which have performance measured in thousands of MB/s for both 
reads and writes) than the ability to recover from transient data 
corruption without needing to go to a backup.


As long as /home and any other write heavy directories are on a separate 
partition, I would actually advocate using DUP data on your root 
filesystem if you can afford the space simply because it's a whole lot 
easier to recover other data if the root filesystem still works.  Most 
of the root filesystem except some stuff under /var follows a WORM 
access pattern, and even the stuff that doesn't in /var is usually not 
performance critical, so the write performance penalty won't have 
anywhere near as much impact on how well the system runs as you might think.


There's also the fact that you're writing more metadata than data most 
of the time unless you're dealing with really big files, and metadata is 
already DUP mode (unless you are using an SSD), so the performance hit 
isn't 50%, it's actually a bit more than half the ratio of data writes 
to metadata writes.



On a related note, I see this caveat about dup in the manpage:

"For example, a SSD drive can remap the blocks internally to a single
copy thus deduplicating them. This negates the purpose of increased
redunancy (sic) and just wastes space"


That ability is vastly overestimated in the man page. There is no miracle
content-addressable storage system working at 500 MB/sec speeds all within a
little cheap controller on SSDs. Likely most of what it can do, is just
compress simple stuff, such as runs of zeroes or other repeating byte
sequences.
Most of those that do in-line compression don't implement it in 
firmware, they implement it in hardware, and even DEFLATE can get 500 
MB/second speeds if properly implemented in hardware.  The firmware may 
control how the hardware works, but it's usually hardware doing heavy 
lifting in that case, and getting a good ASIC made that can hit the 
required performance point for a reasonable compression algorithm like 
LZ4 or Snappy is insanely cheap once you've gotten past the VLSI work.


And the DUP mode is still useful on SSDs, for cases when one copy of the DUP
gets corrupted in-flight due to a bad controller or RAM or cable, you could
then restore that block from its good-CRC DUP copy.
The only window of time during which bad RAM could result in only one 
copy of a block being bad is after the first copy is written but before 
the second is, which is usually an insanely small amount of time.  As 
far as the cabling, the window for errors resulting in a single bad copy 
of a block is pretty much the same as for RAM, and if they're 
persistently bad, you're more likely to lose data for other reasons.


That said, I do still feel that DUP mode has value on SSD's.  The 
primary arguments against it are:

1. It 

Re: Raid 1 recovery

2017-01-19 Thread Chris Murphy
On Thu, Jan 19, 2017 at 12:15 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> Chris Murphy posted on Wed, 18 Jan 2017 14:30:28 -0700 as excerpted:
>
>> On Wed, Jan 18, 2017 at 2:07 PM, Jon  wrote:
>>> So, I had a raid 1 btrfs system setup on my laptop. Recently I upgraded
>>> the drives and wanted to get my data back. I figured I could just plug
>>> in one drive, but I found that the volume simply would not mount. I
>>> tried the other drive alone and got the same thing. Plugging in both at
>>> the same time and the volume mounted without issue.
>>
>> Requires mount option degraded.
>>
>> If this is a boot volume, this is difficult because the current udev
>> rule prevents a mount attempt so long as all devices for a Btrfs volume
>> aren't present.
>
> OK, so I've known about this from the list for some time, but what is the
> status with regard to udev/systemd (has a bug/issue been filed, results,
> link?), and what are the alternatives, both for upstream, and for a dev,
> either trying to be proactive, or currently facing a refusal to boot due
> to the issue?

If the udev rule isn't there, there's a chance that there's a mount
failure with any multiple device setup if one member device is late to
the party. If the udev rule is removed and if rootflags=degraded, now
whenever there's a late device, there's always a degraded boot and the
drive late to the party is out of sync. And we have no fast resync
like mdadm with write intent bitmaps, so it requires a complete volume
scrub (initiated manually) to avoid corruption, as soon as the volume
is made whole.

Now maybe the udev rule could be made smarter, I don't really know. If
it's multiple device you'd want a rule that just waits for say, 30
seconds or a minute or something sane, whatever that'd be. That way
normal operation just delays things a bit to make sure all member
drives are available at the same time, so that the mount command
(without degraded option) works. And the only failure case is when
there is in fact a bad drive. Someone willing to take the risk could
use such a udev rule along with rootflags=degraded, but this is asking
for trouble.

What's really needed is a daemon or other service that manages the
pool status. And that includes dealing with degradedness and resyncs
automatically.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 00/24] tree-log inode vs btrfs_inode cleanups

2017-01-19 Thread David Sterba
On Wed, Jan 18, 2017 at 12:31:26AM +0200, Nikolay Borisov wrote:
> So here is a new set of patches cleaning up tree-log function 
> w.r.t inode vs btrfs_inode. There are still some which remain 
> but I didn't find compelling arguments to cleaning them up, so 
> I've left them unchanged. This time there are some size shrinkage:
> 
>text  data bss dec hex filename
>2530598 174661   28288 2733547  29b5eb fs/btrfs/btrfs.ko - upstream 
> master
> 
>   text   data bss dec hex filename
>   2530774  174661   28288 2733723  29b69b fs/btrfs/btrfs.ko - before 
> tree-log cleanup
> 
> text data bss dec hex filename
>   2530163  174661   28288 2733112  29b438 fs/btrfs/btrfs.ko - both series 
> applied 
> 
> So the net result of the 2 series is 435 bytes and I assume there 
> will be further reduction in size once further cleanups are made 
> 
> Changes since v1: 
>  * Rebased all patche to latest master
> 
> Nikolay Borisov (24):
>   btrfs: Make btrfs_must_commit_transaction take btrfs_inode
>   btrfs: Make btrfs_record_unlink_dir take btrfs_inode
>   btrfs: Make btrfs_record_snapshot_destroy take btrfs_inode
>   btrfs: Make btrfs_inode_in_log take btrfs_inode
>   btrfs: Make btrfs_log_new_name take btrfs_inode
>   btrfs: Make btrfs_del_dir_entries_in_log take btrfs_inode
>   btrfs: Make btrfs_del_inode_ref take btrfs_inode
>   btrfs: Make logged_inode_size take btrfs_inode
>   btrfs: Make btrfs_check_ref_name_override take btrfs_inode
>   btrfs: Make copy_items take btrfs_inode
>   btrfs: Make btrfs_log_all_xattrs take btrfs_inode
>   btrfs: Make btrfs_log_trailing_hole take btrfs_inode
>   btrfs: Make btrfs_get_logged_extents take btrfs_inode
>   btrfs: Make btrfs_log_changed_extents take btrfs_inode
>   btrfs: Make log_dir_items take btrfs_inode
>   btrfs: Make log_directory_changes take btrfs_inode
>   btrfs: Make log_new_dir_dentries take btrfs_inode
>   btrfs: Make btrfs_unlink_inode take btrfs_inode
>   btrfs: Make drop_one_dir_item take btrfs_inode
>   btrfs: Make __add_inode_ref take btrfs_inode
>   btrfs: Make log_inode_item take btrfs_inode
>   btrfs: Make btrfs_log_inode take btrfs_inode
>   btrfs: Make count_inode_extrefs take btrfs_inode
>   btrfs: Make count_inode_refs take btrfs_inode

Added to 4.11 queue, thanks. There were several 80+ lines added by the
patches, I've updated the patches during review. Please be more careful
with changes like this

>From https://patchwork.kernel.org/patch/9522051/

-   if (S_ISDIR(inode->i_mode) ||
+   if (S_ISDIR(inode->vfs_inode.i_mode) ||
(!test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
-  _I(inode)->runtime_flags) &&
-inode_only >= LOG_INODE_EXISTS))
+  >runtime_flags) &&
+inode_only == LOG_INODE_EXISTS))

Notice the change from >= to == .
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dup vs raid1 in single disk

2017-01-19 Thread Roman Mamedov
On Thu, 19 Jan 2017 17:39:37 +0100
"Alejandro R. Mosteo"  wrote:

> I was wondering, from a point of view of data safety, if there is any
> difference between using dup or making a raid1 from two partitions in
> the same disk. This is thinking on having some protection against the
> typical aging HDD that starts to have bad sectors.

RAID1 will write slower compared to DUP, as any optimization to make RAID1
devices work in parallel will cause a total performance disaster for you as
you will start trying to write to both partitions at the same time, turning
all linear writes into random ones, which are about two orders of magnitude
slower than linear on spinning hard drives. DUP shouldn't have this issue, but
still it will be twice slower than single, since you are writing everything
twice.

You could consider DUP data for when a disk is already known to be getting bad
sectors from time to time -- but then it's a fringe exercise to try and keep
using such disk in the first place. Yeah with DUP data DUP metadata you can
likely have some more life out of such disk as a throwaway storage space for
non-essential data, at half capacity, but is it worth the effort, as it's
likely to start failing progressively worse over time.

In all other cases the performance and storage space penalty of DUP within a
single device are way too great (and gained redundancy is too low) compared
to a proper system of single profile data + backups, or a RAID5/6 system (not
Btrfs-based) + backups.

> On a related note, I see this caveat about dup in the manpage:
> 
> "For example, a SSD drive can remap the blocks internally to a single
> copy thus deduplicating them. This negates the purpose of increased
> redunancy (sic) and just wastes space"

That ability is vastly overestimated in the man page. There is no miracle
content-addressable storage system working at 500 MB/sec speeds all within a
little cheap controller on SSDs. Likely most of what it can do, is just
compress simple stuff, such as runs of zeroes or other repeating byte
sequences.

And the DUP mode is still useful on SSDs, for cases when one copy of the DUP
gets corrupted in-flight due to a bad controller or RAM or cable, you could
then restore that block from its good-CRC DUP copy.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: read-only fs, kernel 4.9.0, fs/btrfs/delayed-inode.c:1170 __btrfs_run_delayed_items,

2017-01-19 Thread Imran Geriskovan
I don't know if it is btrfs related but I'm getting
hard freezes on 4.8.17.

So I went back to 4.8.14 (with identical .config file).
It is one of my kernels which is known to be trouble
free for a long time.

Since they are hard lock up for real, I can't provide
anything.. Does anyone experience anything like that?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] btrfs-progs: lowmem-check: Fix wrong extent tree iteration

2017-01-19 Thread Christoph Anton Mitterer
Hey Qu.


On Wed, 2017-01-18 at 16:48 +0800, Qu Wenruo wrote:
> To Christoph,
> 
> Would you please try this patch, and to see if it suppress the block
> group
> warning?
I did another round of fsck in both modes (original/lomem), first
WITHOUT your patch, then WITH it... both on progs version 4.9... no
further RW mount between these 4 runs:


btrfs-progs v4.9 WITHOUT patch:
***
# btrfs check /dev/nbd0 ; echo $?
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
Checking filesystem on /dev/nbd0
UUID: 326d292d-f97b-43ca-b1e8-c722d3474719
found 7469206884352 bytes used err is 0
total csum bytes: 7281779252
total tree bytes: 10837262336
total fs tree bytes: 2011906048
total extent tree bytes: 1015349248
btree space waste bytes: 922444044
file data blocks allocated: 7458369622016
 referenced 7579485159424
0

# btrfs check --mode=lowmem /dev/nbd0 ; echo $?
checking extents
ERROR: block group[74117545984 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[239473786880 1073741824] used 1073741824 but extent items 
used 1207959552
ERROR: block group[500393050112 1073741824] used 1073741824 but extent items 
used 1207959552
ERROR: block group[581997428736 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[626557714432 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[668433645568 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[948680261632 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[982503129088 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[1039411445760 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[1054443831296 1073741824] used 1073741824 but extent items 
used 1207959552
ERROR: block group[1190809042944 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[1279392743424 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[1481256206336 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[1620842643456 1073741824] used 1073741824 but extent items 
used 1207959552
ERROR: block group[1914511032320 1073741824] used 1073741824 but extent items 
used 1207959552
ERROR: block group[3055361720320 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[3216422993920 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[3670615785472 1073741824] used 1073741824 but extent items 
used 1207959552
ERROR: block group[3801612288000 1073741824] used 1073741824 but extent items 
used 1207959552
ERROR: block group[3828455833600 1073741824] used 1073741824 but extent items 
used 1207959552
ERROR: block group[4250973241344 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[4261710659584 1073741824] used 1073741824 but extent items 
used 1074266112
ERROR: block group[4392707162112 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[4558063403008 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[4607455526912 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[4635372814336 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[4640204652544 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[4642352136192 1073741824] used 1073741824 but extent items 
used 1207959552
ERROR: block group[4681006841856 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[5063795802112 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[5171169984512 1073741824] used 1073741824 but extent items 
used 1207959552
ERROR: block group[5216267141120 1073741824] used 1073741824 but extent items 
used 1207959552
ERROR: block group[5290355326976 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[5445511020544 1073741824] used 1073741824 but extent items 
used 1074266112
ERROR: block group[6084387405824 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[6104788500480 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[6878956355584 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[6997067956224 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[7702516334592 1073741824] used 1073741824 but extent items 
used 0
ERROR: block group[8051482427392 1073741824] used 1073741824 but extent items 
used 1084751872
ERROR: block group[8116980678656 1073741824] used 1073217536 but extent items 
used 0
ERROR: errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Checking filesystem on /dev/nbd0
UUID: 326d292d-f97b-43ca-b1e8-c722d3474719
found 7469206884352 bytes used err is -5
total csum bytes: 7281779252
total tree bytes: 10837262336
total fs tree bytes: 2011906048
total extent tree bytes: 1015349248
btree space waste bytes: 922444044
file data 

Re: Fwd: dup vs raid1 in single disk

2017-01-19 Thread Austin S. Hemmelgarn

On 2017-01-19 11:39, Alejandro R. Mosteo wrote:

Hello list,

I was wondering, from a point of view of data safety, if there is any
difference between using dup or making a raid1 from two partitions in
the same disk. This is thinking on having some protection against the
typical aging HDD that starts to have bad sectors.

On a related note, I see this caveat about dup in the manpage:

"For example, a SSD drive can remap the blocks internally to a single
copy thus deduplicating them. This negates the purpose of increased
redunancy (sic) and just wastes space"

SSDs failure modes are different (more an all or nothing thing, I'm
told) so it wouldn't apply to the use case above, but I'm curious for
curiosity's sake if there would be any difference too.


On a traditional HDD, there actually is a reasonable safety benefit to 
using 2 partitions in raid1 mode over using dup mode.  This is because 
most traditional HDD firmware still keeps the mapping of physical 
sectors to logical sectors mostly linear, so having separate partitions 
will (usually) mean that the two copies are not located near each other 
on physical media.  A similar but weaker version of the same effect can 
be achieved by using the 'ssd_spread' mount option, but I would not 
suggest relying on that.  This doesn't apply to hybrid drives (because 
they move stuff around however they want like SSD's), or SMR drives 
(because they rewrite large portions of the disk when one place gets 
rewritten, so physical separation of the data copies doesn't get you as 
much protection).


For most SSD's, there is no practical benefit because the FTL in the SSD 
firmware generally maps physical sectors to logical sectors in whatever 
arbitrary way it wants, which is usually not going to be linear.


As far as failure modes on an SSD, you usually see one of two things 
happen, either the whole disk starts acting odd (or stops working), or 
individual blocks a few MB in size (which seem to move around the disk 
as they get over-written) start behaving odd.  The first case is the 
firmware or primary electronics going bad, while the second is 
individual erase blocks going bad.  As a general rule, SSD's will run 
longer as they're going bad than HDD's will, but in both cases you 
should look at replacing the device once you start seeing the error 
counters going up consistently over time (or if you see them suddenly 
jump to a much higher number).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fwd: dup vs raid1 in single disk

2017-01-19 Thread Alejandro R. Mosteo
Hello list,

I was wondering, from a point of view of data safety, if there is any
difference between using dup or making a raid1 from two partitions in
the same disk. This is thinking on having some protection against the
typical aging HDD that starts to have bad sectors.

On a related note, I see this caveat about dup in the manpage:

"For example, a SSD drive can remap the blocks internally to a single
copy thus deduplicating them. This negates the purpose of increased
redunancy (sic) and just wastes space"

SSDs failure modes are different (more an all or nothing thing, I'm
told) so it wouldn't apply to the use case above, but I'm curious for
curiosity's sake if there would be any difference too.

Thanks,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/4] xfstests: btrfs/132: add test for invaild update time by an incremental send

2017-01-19 Thread Filipe Manana
On Thu, Jan 5, 2017 at 11:31 AM, Filipe Manana  wrote:
> On Thu, Jan 5, 2017 at 2:45 AM, robbieko  wrote:
>> Filipe Manana 於 2017-01-04 21:09 寫到:
>>
>>> On Wed, Jan 4, 2017 at 10:53 AM, robbieko  wrote:

 From: Robbie Ko 

 Test that an incremental send operation dosen't' work because
 it tries to update the time to a deleted directory after it finishes
 a move operation.

 The other one is that an operation is applied to a file using the old
 name not the new name.

 This test exercises scenarios used to fail in btrfs and are fixed by
 the following patches for the linux kernel:

 "Btrfs: incremental send, add generation check for the inode waiting for
 rmdir operation."
 "Btrfs: incremental send, add generation check in existence
 demtermination for the parent directory"

 Signed-off-by: Robbie Ko 
 ---
 V3: remove "run_" based helpers
 V2: improve the change log

  tests/btrfs/132 | 123
 
  tests/btrfs/132.out |   7 +++
  tests/btrfs/group   |   1 +
  3 files changed, 131 insertions(+)
  create mode 100755 tests/btrfs/132
  create mode 100644 tests/btrfs/132.out

 diff --git a/tests/btrfs/132 b/tests/btrfs/132
 new file mode 100755
 index 000..f1bb698
 --- /dev/null
 +++ b/tests/btrfs/132
 @@ -0,0 +1,123 @@
 +#! /bin/bash
 +# FS QA Test No. btrfs/132
 +#
 +# Test that an incremental send operation dosen't' work because
 +# it tries to update the time to a deleted directory after it finishes
 +# a move operation.
 +#
 +# The other one is that an operation is applied to a file using the old
 +# name not the new name.
 +#
 +#---
 +# Copyright (C) 2016 Synology Inc. All Rights Reserved.
 +# Author: Robbie Ko 
 +#
 +# This program is free software; you can redistribute it and/or
 +# modify it under the terms of the GNU General Public License as
 +# published by the Free Software Foundation.
 +#
 +# This program is distributed in the hope that it would be useful,
 +# but WITHOUT ANY WARRANTY; without even the implied warranty of
 +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +# GNU General Public License for more details.
 +#
 +# You should have received a copy of the GNU General Public License
 +# along with this program; if not, write the Free Software Foundation,
 +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
 +#---
 +#
 +
 +seq=`basename $0`
 +seqres=$RESULT_DIR/$seq
 +echo "QA output created by $seq"
 +
 +tmp=/tmp/$$
 +status=1   # failure is the default!
 +trap "_cleanup; exit \$status" 0 1 2 3 15
 +
 +_cleanup()
 +{
 +   cd /
 +   rm -fr $send_files_dir
 +   rm -f $tmp.*
 +}
 +
 +# get standard environment, filters and checks
 +. ./common/rc
 +. ./common/filter
 +
 +# real QA test starts here
 +_supported_fs btrfs
 +_supported_os Linux
 +_require_test
 +_require_scratch
 +_require_fssum
 +
 +send_files_dir=$TEST_DIR/btrfs-test-$seq
 +
 +rm -f $seqres.full
 +rm -fr $send_files_dir
 +mkdir $send_files_dir
 +
 +_scratch_mkfs >>$seqres.full 2>&1
 +_scratch_mount
 +
 +mkdir $SCRATCH_MNT/dir257
 +mkdir $SCRATCH_MNT/dir258
 +mkdir $SCRATCH_MNT/dir259
 +mv $SCRATCH_MNT/dir257 $SCRATCH_MNT/dir258/dir257
 +
 +# Filesystem looks like:
 +#
 +# . (ino
 256)
 +# |--- dir258/  (ino
 258)
 +# ||--- dir257/ (ino
 257)
 +# |
 +# |--- dir259/  (ino
 259)
 +#
 +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
 +$SCRATCH_MNT/mysnap1 > /dev/null
 +
 +mv $SCRATCH_MNT/dir258/dir257 $SCRATCH_MNT/dir257
 +rmdir $SCRATCH_MNT/dir258
 +rmdir $SCRATCH_MNT/dir259
 +_scratch_unmount
 +_scratch_mount
>>>
>>>
>>> Why do need to remount the fs? There's also a _scratch_remount function.
>>>
>>> Again, I would like to see comments in the test explaining why and
>>> how/why did the incremental send operation used to fail (like most of
>>> the existing send/receive tests have).
>>>
>>> thanks
>>
>>
>> Because we need to use the same inode 258 recreate, so need to delete the
>> old dir258 and dir259, then remount the fs, the next new file will be
>> 

[PATCH 2/3] Btrfs: incremental send, do not delay rename when parent inode is new

2017-01-19 Thread fdmanana
From: Filipe Manana 

When we are checking if we need to delay the rename operation for an
inode we not checking if a parent inode that exists in the send and
parent snapshots is really the same inode or not, that is, we are not
comparing the generation number of the parent inode in the send and
parent snapshots. Not only this results in unnecessarily delaying a
rename operation but also can later on make us generate an incorrect
name for a new inode in the send snapshot that has the same number
as another inode in the parent snapshot but a different generation.

Here follows an example where this happens.

Parent snapshot:

 .  (ino 256, gen 3)
 |--- dir258/   (ino 258, gen 7)
 |   |--- dir257/   (ino 257, gen 7)
 |
 |--- dir259/   (ino 259, gen 7)

Send snapshot:

 .  (ino 256, gen 3)
 |--- file258   (ino 258, gen 10)
 |
 |--- new_dir259/   (ino 259, gen 10)
  |--- dir257/  (ino 257, gen 7)

The following steps happen when computing the incremental send stream:

1) When processing inode 257, its new parent is created using its orphan
   name (o257-21-0), and the rename operation for inode 257 is delayed
   because its new parent (inode 259) was not yet processed - this
   decision to delay the rename operation does not make much sense
   because the inode 259 in the send snapshot is a new inode, it's not
   the same as inode 259 in the parent snapshot.

2) When processing inode 258 we end up delaying its rmdir operation,
   because inode 257 was not yet renamed (moved away from the directory
   inode 258 represents). We also create the new inode 258 using its
   orphan name "o258-10-0", then rename it to its final name of "file258"
   and then issue a truncate operation for it. However this truncate
   operation contains an incorrect name, which corresponds to the orphan
   name and not to the final name, which makes the receiver fail. This
   happens because when we attempt to compute the inode's current name
   we verify that there's another inode with the same number (258) that
   has its rmdir operation pending and because of that we generate an
   orphan name for the new inode 258 (we do this in the function
   get_cur_path()).

Fix this by not delayed the rename operation of an inode if it has parents
with the same number but different generations in both snapshots.

The following steps reproduce this example scenario.

 $ mkfs.btrfs -f /dev/sdb
 $ mount /dev/sdb /mnt
 $ mkdir /mnt/dir257
 $ mkdir /mnt/dir258
 $ mkdir /mnt/dir259
 $ mv /mnt/dir257 /mnt/dir258/dir257
 $ btrfs subvolume snapshot -r /mnt /mnt/snap1

 $ mv /mnt/dir258/dir257 /mnt/dir257
 $ rmdir /mnt/dir258
 $ rmdir /mnt/dir259

 # Remount the filesystem so that the next created inodes will have the
 # numbers 258 and 259. This is because when a filesystem is mounted,
 # btrfs sets the subvolume's inode counter to a value corresponding to
 # the highest inode number in the subvolume plus 1. This inode counter
 # is used to assign a unique number to each new inode and it's
 # incremented by 1 after very inode creation.
 # Note: we unmount and then mount instead of doing a mount with
 # "-o remount" because otherwise the inode counter remains at value 260.
 $ umount /mnt
 $ mount /dev/sdb /mnt
 $ touch /mnt/file258
 $ mkdir /mnt/new_dir259
 $ mv /mnt/dir257 /mnt/new_dir259/dir257
 $ btrfs subvolume snapshot -r /mnt /mnt/snap2

 $ btrfs send /mnt/snap1 -f /tmp/1.snap
 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap

 $ umount /mnt
 $ mkfs.btrfs -f /dev/sdc
 $ mount /dev/sdc /mnt
 $ btrfs receive /mnt -f /tmo/1.snap
 $ btrfs receive /mnt -f /tmo/2.snap -vv
 receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, 
ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7
 utimes
 mkdir o259-10-0
 rename dir258 -> o258-7-0
 utimes
 mkfile o258-10-0
 rename o258-10-0 -> file258
 utimes
 truncate o258-10-0 size=0
 ERROR: truncate o258-10-0 failed: No such file or directory

Reported-by: Robbie Ko 
Signed-off-by: Filipe Manana 
---
 fs/btrfs/send.c | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 2ae32c4..2742324 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -3559,6 +3559,7 @@ static int wait_for_parent_move(struct send_ctx *sctx,
 {
int ret = 0;
u64 ino = parent_ref->dir;
+   u64 ino_gen = parent_ref->dir_gen;
u64 parent_ino_before, parent_ino_after;
struct fs_path *path_before = NULL;
struct fs_path *path_after = NULL;
@@ -3579,6 +3580,8 @@ static int wait_for_parent_move(struct send_ctx *sctx,
 * at 

[PATCH 3/3] Btrfs: incremental send, do not issue invalid rmdir operations

2017-01-19 Thread fdmanana
From: Robbie Ko 

When both the parent and send snapshots have a directory inode with the
same number but different generations (therefore they are different
inodes) and both have an entry with the same name, an incremental send
stream will contain an invalid rmdir operation that refers to the
orphanized name of the inode from the parent snapshot.

The following example scenario shows how this happens.

Parent snapshot:

 .
 | d259_old/   (ino 259, gen 9)
 | | d1/   (ino 258, gen 9)
 |
 | f   (ino 257, gen 9)

Send snapshot:

 .
 | d258/   (ino 258, gen 7)
 | d259/   (ino 259, gen 7)
 | d1/ (ino 257, gen 7)

When the kernel is processing inode 258 it notices that in both snapshots
there is an inode numbered 259 that is a parent of an inode 258. However
it ignores the fact that the inodes numbered 259 have different generations
in both snapshots, which means they are effectively different inodes.
Then it checks that both inodes 259 have a dentry named "d1" and because
of that it issues a rmdir operation with orphanized name of the inode 258
from the parent snapshot. This happens at send.c:process_record_refs(),
which calls send.c:did_overwrite_first_ref() that returns true and because
of that later on at process_recorded_refs() such rmdir operation is issued
because the inode being currently processed (258) is a directory and it
was deleted in the send snapshot (and replaced with another inode that has
the same number and is a directory too).
Fix this issue by comparing the generations of parent directory inodes
that have the same number and make send.c:did_overwrite_first_ref() when
the generations are different.

The following steps reproduce the problem.

 $ mkfs.btrfs -f /dev/sdb
 $ mount /dev/sdb /mnt
 $ touch /mnt/f
 $ mkdir /mnt/d1
 $ mkdir /mnt/d259_old
 $ mv /mnt/d1 /mnt/d259_old/d1
 $ btrfs subvolume snapshot -r /mnt /mnt/snap1
 $ btrfs send /mnt/snap1 -f /tmp/1.snap
 $ umount /mnt

 $ mkfs.btrfs -f /dev/sdc
 $ mount /dev/sdc /mnt
 $ mkdir /mnt/d1
 $ mkdir /mnt/dir258
 $ mkdir /mnt/dir259
 $ mv /mnt/d1 /mnt/dir259/d1
 $ btrfs subvolume snapshot -r /mnt /mnt/snap2
 $ btrfs receive /mnt/ -f /tmp/1.snap
 # Take note that once the filesystem is created, its current
 # generation has value 7 so the inodes from the second snapshot all have
 # a generation value of 7. And after receiving the first snapshot
 # the filesystem is at a generation value of 10, because the call to
 # create the second snapshot bumps the generation to 8 (the snapshot
 # creation ioctl does a transaction commit), the receive command calls
 # the snapshot creation ioctl to create the first snapshot, which bumps
 # the filesystem's generation to 9, and finally when the receive
 # operation finishes it calls an ioctl to transition the first snapshot
 # (snap1) from RW mode to RO mode, which does another transaction commit
 # and bumps the filesystem's generation to 10. This means all the inodes
 # in the first snapshot (snap1) have a generation value of 9.
 $ rm -f /tmp/1.snap
 $ btrfs send /mnt/snap1 -f /tmp/1.snap
 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
 $ umount /mnt

 $ mkfs.btrfs -f /dev/sdd
 $ mount /dev/sdd /mnt
 $ btrfs receive /mnt -f /tmp/1.snap
 $ btrfs receive -vv /mnt -f /tmp/2.snap
 receiving snapshot mysnap2 uuid=9c03962f-f620-0047-9f98-32e5a87116d9, 
ctransid=7 parent_uuid=d17a6e3f-14e5-df4f-be39-a7951a5399aa, parent_ctransid=9
 utimes
 unlink f
 mkdir o257-7-0
 mkdir o259-7-0
 rename o257-7-0 -> o259-7-0/d1
 chown o259-7-0/d1 - uid=0, gid=0
 chmod o259-7-0/d1 - mode=0755
 utimes o259-7-0/d1
 rmdir o258-9-0
 ERROR: rmdir o258-9-0 failed: No such file or directory

Signed-off-by: Robbie Ko 
Reviewed-by: Filipe Manana 
[Rewrote changelog to be more precise and clear]
Signed-off-by: Filipe Manana 

Signed-off-by: Filipe Manana 
---
 fs/btrfs/send.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 2742324..712922e 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -1937,6 +1937,19 @@ static int did_overwrite_ref(struct send_ctx *sctx,
if (ret <= 0)
goto out;
 
+   if (dir != BTRFS_FIRST_FREE_OBJECTID) {
+   ret = get_inode_info(sctx->send_root, dir, NULL, , NULL,
+NULL, NULL, NULL);
+   if (ret < 0 && ret != -ENOENT)
+   goto out;
+   if (ret) {
+   ret = 0;
+   goto out;
+   }
+   if (gen != dir_gen)
+   goto out;
+   }
+
/* check if the ref was overwritten by another ref */
ret = lookup_dir_item_inode(sctx->send_root, dir, name, name_len,
_inode, _type);
-- 
2.7.0.rc3

--
To 

Re: [PATCH v3 2/6] Btrfs: incremental send, fix invalid path for truncate operations

2017-01-19 Thread Filipe Manana
On Thu, Jan 5, 2017 at 8:24 AM, robbieko  wrote:
> From: Robbie Ko 
>
> Under certain situations, an incremental send operation can

Again, missing some word after the word "can", without it the phrase
doesn't make any sense.
Tip: don't copy paste change logs and then modify them for each patch,
it's a lot easier to make errors like that.

> a truncate operation that will make the receiving end fail when
> attempting to execute it, because the path is not exist.

The problem is generating wrong paths, and that can be for any
operation, not just truncates.
The subject should reflect that. As it is, it is misleading.

>
> Example scenario:
> Parent snapshot:
> | dir258/ (ino 258, gen 15, dir)
> |--- dir257/ (ino 257, gen 15, dir)
> | dir259/ (ino 259, gen 15, dir)
>
> Send snapshot:
> | file258 (ino 258, gen 21, file)
> | new_dir259/ (ino 259, gen 21, dir)
> |--- dir257/ (ino 257, gen 15, dir)
>
> utimes
> mkdir o259-21-0
> rename dir258 -> o258-15-0
> utimes
> mkfile o258-21-0
> rename o258-21-0 -> file258
> utimes
> truncate o258-21-0 size=0
> ERROR: truncate o258-21-0 failed: No such file or directory
>
> While computing the send stream the following steps happen:
>
> 1) While processing inode 257, we create o259-21-0 and delay
>dir257 rename operation, because its new parent in the send
>snapshot, inode 259, was not yet processed and therefore not
>yet renamed.

The problem is exactly here. There's no need to delay the rename of
257, because inode 259 is a new inode.
The whole logic for delaying renames is meant to be applied when
there's really a need to do so, when the child and parent inodes exist
in both snapshots.
So the problem is fully fixed by not delaying the rename for 257,
doing this also avoids the extra complexity and the need for all the
patches in your patchset except for patches 1/6 and 4/6.

I've added such a fix into my 4.11 integration branch and just sent it
to the list too.
It's here:

https://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/commit/?h=integration-4.11=b271d6c72e1dc34e93ff6035096e29d45ffa933e

thanks



>
> 2) Later when processing inode 258, we delay rmdir operation for dir258
>because it's not empty, and then orphanize it.
>
> 3) After we create o258-21-0 and rename it to file258, we get ENOENT on
>truncate file258. The reason is that the dir258 with inode 258 is
>waiting for rmdir operation and file258's inode is also 258, then
>get_current_path called before truncate will return a unique name.
>
> Fix this by adding generation check for the inode waiting for rmdir
> operation.
>
> Signed-off-by: Robbie Ko 
> ---
> V3: improve the change log
>  fs/btrfs/send.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 2060e75..22eca86 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -311,7 +311,7 @@ static int is_waiting_for_move(struct send_ctx *sctx, u64 
> ino);
>  static struct waiting_dir_move *
>  get_waiting_dir_move(struct send_ctx *sctx, u64 ino);
>
> -static int is_waiting_for_rm(struct send_ctx *sctx, u64 dir_ino);
> +static int is_waiting_for_rm(struct send_ctx *sctx, u64 dir_ino, u64 
> dir_gen);
>
>  static int need_send_hole(struct send_ctx *sctx)
>  {
> @@ -2292,7 +2292,7 @@ static int get_cur_path(struct send_ctx *sctx, u64 ino, 
> u64 gen,
>
> fs_path_reset(name);
>
> -   if (is_waiting_for_rm(sctx, ino)) {
> +   if (is_waiting_for_rm(sctx, ino, gen)) {
> ret = gen_unique_name(sctx, ino, gen, name);
> if (ret < 0)
> goto out;
> @@ -2899,11 +2899,11 @@ get_orphan_dir_info(struct send_ctx *sctx, u64 
> dir_ino)
> return NULL;
>  }
>
> -static int is_waiting_for_rm(struct send_ctx *sctx, u64 dir_ino)
> +static int is_waiting_for_rm(struct send_ctx *sctx, u64 dir_ino, u64 dir_gen)
>  {
> struct orphan_dir_info *odi = get_orphan_dir_info(sctx, dir_ino);
>
> -   return odi != NULL;
> +   return (odi != NULL && odi->gen == dir_gen);
>  }
>
>  static void free_orphan_dir_info(struct send_ctx *sctx,
> @@ -3166,7 +3166,7 @@ static int path_loop(struct send_ctx *sctx, struct 
> fs_path *name,
> while (ino != BTRFS_FIRST_FREE_OBJECTID) {
> fs_path_reset(name);
>
> -   if (is_waiting_for_rm(sctx, ino))
> +   if (is_waiting_for_rm(sctx, ino, gen))
> break;
> if (is_waiting_for_move(sctx, ino)) {
> if (*ancestor_ino == 0)
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"People will forget what you said,
 people 

Re: [PATCH v3 1/6] Btrfs: incremental send, fix failure to rename with the name collision

2017-01-19 Thread Filipe Manana
On Thu, Jan 5, 2017 at 8:24 AM, robbieko  wrote:
> From: Robbie Ko 
>
> Under certain situations, an incremental send operation can

Missing some word after the word "can".

> a rename operation that will make the receiving end fail when
> attempting to execute it, because the target is exist.
>
> Example scenario:
> Parent snapshot:

Two consecutive sentences ending with a colon is odd.

> |.(ino 256, gen 5)

. is not a child of anything, shouldn't have the | preceding it.

> | a1/ (ino 257, gen 5)
> | a2/ (ino 258, gen 5)
>
> Send snapshot:
> |.(ino 256, gen 7)

Same here.

> | a2  (ino 257, gen 7)
>
> rmdir a1
> mkfile o257-7-0
> rename o257-7-0 -> a2
> ERROR: rename o257-7-0 -> a2 failed: Is a directory

Alright what is this? So you just paste here the output of "btrfs
receive -vv" without mentioning where it comes from.
Let the reader guess and make some sense of it.

>
> While computing the send stream the following steps happen:
>
> 1) delete a1;
>
> 2) mkfile o257-7-0;
>
> 3) rename o257-7-0->a2;
>
> In step 3 we will check whether it will lead to overwrite.
>
> The generation number of inode 257's parent (ino 256) in send snapshot
> is 7, which is inconsistent with the one in parent snapshot (gen 5).
> For the general parent directory except inode 256, if its generation
> is not identical between parent and send snapshot, it will be removed
> then created. Thus it's not possible to happen overwrite under the new
> directory. However for the special inode 256, the above logic does not
> work since it is a subvolume. So overwrite check is required for the
> inode 256.
> Fix by skipping generation inconsistency check for inode 256.

I've heavily reworded the explanation and made other changes to the
changelog myself.
You should have given an example on how to create the snapshots so
that it's clear how to get different inodes with the same number but
different generations.

I've pushed into an integration branch in my git repository with the
goal of including it for 4.11:

https://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/log/?h=integration-4.11

The subject was also changed to "Btrfs: send, fix failure to rename
top level inode due to name collision".

So for this patch, you don't need to do anything else.
Thanks.

>
> Signed-off-by: Robbie Ko 
> ---
> V3: improve the change log
>  fs/btrfs/send.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index a87675f..2060e75 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -1681,6 +1681,9 @@ static int is_inode_existent(struct send_ctx *sctx, u64 
> ino, u64 gen)
>  {
> int ret;
>
> +   if (ino == BTRFS_FIRST_FREE_OBJECTID)
> +   return 1;
> +
> ret = get_cur_inode_state(sctx, ino, gen);
> if (ret < 0)
> goto out;
> @@ -1866,7 +1869,7 @@ static int will_overwrite_ref(struct send_ctx *sctx, 
> u64 dir, u64 dir_gen,
>  * not deleted and then re-created, if it was then we have no 
> overwrite
>  * and we can just unlink this entry.
>  */
> -   if (sctx->parent_root) {
> +   if (sctx->parent_root && dir != BTRFS_FIRST_FREE_OBJECTID) {
> ret = get_inode_info(sctx->parent_root, dir, NULL, , NULL,
>  NULL, NULL, NULL);
> if (ret < 0 && ret != -ENOENT)
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] xfstests: btrfs/134: add test for incremental send which renames a directory already being deleted

2017-01-19 Thread Filipe Manana
On Wed, Nov 2, 2016 at 3:52 AM, robbieko  wrote:
> Hi Eryu Guan,
>
> Yes, it need apply
> [PATCH] "Btrfs: incremental send, do not skip generation inconsistency check
> for inode 256."
> and test again, it will failed.
>
> because current code there is a problem, but just will not happen.

Then it's not a problem...
What your're saying is confusing to say the least.

I really don't like the idea of adding a patch that is know to
introduce a regression and then adding another patch that fixes it.
Anyway, your patchset has been reviewed and those patches are not
needed anymore as there's a better solution that doesn't imply
introducing a regression temporarily, so this test is not really
needed.

thanks

>
> Thansk
> Robbie Ko
>
> Eryu Guan 於 2016-11-01 15:20 寫到:
>
>> On Fri, Oct 28, 2016 at 09:44:06AM +0800, robbieko wrote:
>>>
>>> From: Robbie Ko 
>>>
>>> Test that an incremental send operation dosen't work because
>>> it tries to rename a directory which is already deleted.
>>>
>>> This test exercises scenarios used to fail in btrfs and are fixed by
>>> the following patch for the linux kernel:
>>>
>>> "Btrfs: incremental send, add generation check for inode is waiting for
>>> move."
>>
>>
>> I was testing with v4.9-rc1+ kernel and btrfs-progs v4.6. Seems above
>> patch is not merged in 4.9-rc1 kernel, but test passed for me, is that
>> expected?
>>
>> Thanks,
>> Eryu
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/3] fstests: test btrfs incremental send after replacing a top level inode

2017-01-19 Thread fdmanana
From: Filipe Manana 

Test that an incremental send operation does not fail when a new inode
replaces an old inode that has the same number but different generation,
and both are direct children of the subvolume/snapshot root.

This is fixed by the following patch for the linux kernel:

  "Btrfs: send, fix failure to rename top level inode due to name
   collision"

Signed-off-by: Robbie Ko 
Signed-off-by: Filipe Manana 
---

v4: Improved changelog and added comments to the test that explain what is
being tested and the problem.

 tests/btrfs/133 | 127 
 tests/btrfs/133.out |   8 
 tests/btrfs/group   |   1 +
 3 files changed, 136 insertions(+)
 create mode 100755 tests/btrfs/133
 create mode 100644 tests/btrfs/133.out

diff --git a/tests/btrfs/133 b/tests/btrfs/133
new file mode 100755
index 000..5839f17
--- /dev/null
+++ b/tests/btrfs/133
@@ -0,0 +1,127 @@
+#! /bin/bash
+# FS QA Test No. btrfs/133
+#
+# Test that an incremental send operation does not fail when a new inode
+# replaces an old inode that has the same number but different generation,
+# and both are direct children of the subvolume/snapshot root.
+#
+#---
+# Copyright (C) 2017 Synology Inc. All Rights Reserved.
+# Author: Robbie Ko 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -fr $send_files_dir
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_test
+_require_scratch
+_require_fssum
+
+send_files_dir=$TEST_DIR/btrfs-test-$seq
+
+rm -f $seqres.full
+rm -fr $send_files_dir
+mkdir $send_files_dir
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+
+mkdir $SCRATCH_MNT/a1
+mkdir $SCRATCH_MNT/a2
+
+# Filesystem looks like:
+#
+# . (ino 256)
+# |--- a1/  (ino 257)
+# |--- a2/  (ino 258)
+#
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+$SCRATCH_MNT/mysnap1 > /dev/null
+
+$BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f \
+$send_files_dir/1.snap 2>&1 1>/dev/null | _filter_scratch
+
+_scratch_unmount
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+touch $SCRATCH_MNT/a2
+
+# Filesystem now looks like:
+#
+# . (ino 256)
+# |--- a2   (ino 257)
+#
+# Notice that at this point inode 257 has a generation with value 7, which is
+# the generation value for a brand new filesystem.
+
+# Now create the second snapshot. This makes the filesystem's current 
generation
+# value to increase to the value 8, due to a transaction commit performed by 
the
+# snapshot creation ioctl.
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+$SCRATCH_MNT/mysnap2 > /dev/null
+
+# Now receive the first snapshot created in the first filesystem.
+# Before creating any inodes, the receive command creates the first snapshot,
+# which causes a transaction commit and therefore bumps the filesystem's 
current
+# generation to the value 9. All the inodes created end up getting a generation
+# value of 9 and the snapshot's root inode (256) gets a generation value of 8.
+$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $send_files_dir/1.snap > /dev/null
+rm $send_files_dir/1.snap
+
+$FSSUM_PROG -A -f -w $send_files_dir/1.fssum $SCRATCH_MNT/mysnap1
+$FSSUM_PROG -A -f -w $send_files_dir/2.fssum $SCRATCH_MNT/mysnap2
+
+$BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f \
+$send_files_dir/1.snap 2>&1 1>/dev/null | _filter_scratch
+$BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
+   -f $send_files_dir/2.snap 2>&1 1>/dev/null | _filter_scratch
+
+# Now recreate the filesystem by receiving both send streams and verify we get
+# the same content 

[PATCH v4 3/3] fstests: test btrfs incremental send after replacing directory

2017-01-19 Thread fdmanana
From: Filipe Manana 

Test that an incremental send operation works when in both snapshots
there are two directory inodes that have the same number but different
generations and have an entry with the same name that corresponds to
different inodes in each snapshot.

The btrfs issue is fixed by the following patch for the linux kernel:

  "Btrfs: incremental send, do not issue invalid rmdir operations"

Signed-off-by: Robbie Ko 
Signed-off-by: Filipe Manana 
---

v4: Improved changelog and added comments to the test that explain what is
being tested and the problem.

 tests/btrfs/135 | 159 
 tests/btrfs/135.out |   8 +++
 tests/btrfs/group   |   1 +
 3 files changed, 168 insertions(+)
 create mode 100755 tests/btrfs/135
 create mode 100644 tests/btrfs/135.out

diff --git a/tests/btrfs/135 b/tests/btrfs/135
new file mode 100755
index 000..bdc70b3
--- /dev/null
+++ b/tests/btrfs/135
@@ -0,0 +1,159 @@
+#! /bin/bash
+# FS QA Test No. btrfs/135
+#
+# Test that an incremental send operation works when in both snapshots there 
are
+# two directory inodes that have the same number but different generations and
+# have an entry with the same name that corresponds to different inodes in each
+# snapshot.
+#
+#---
+# Copyright (C) 2017 Synology Inc. All Rights Reserved.
+# Author: Robbie Ko 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -fr $send_files_dir
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_test
+_require_scratch
+_require_fssum
+
+send_files_dir=$TEST_DIR/btrfs-test-$seq
+
+rm -f $seqres.full
+rm -fr $send_files_dir
+mkdir $send_files_dir
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+
+touch $SCRATCH_MNT/f
+mkdir $SCRATCH_MNT/d1
+mkdir $SCRATCH_MNT/d259_old
+mv $SCRATCH_MNT/d1 $SCRATCH_MNT/d259_old/d1
+
+# Filesystem looks like:
+#
+# . (ino 256)
+# |--- f(ino 257)
+# |
+# |--- d259_old/(ino 259)
+# |--- d1/  (ino 258)
+#
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+   $SCRATCH_MNT/mysnap1 > /dev/null
+
+$BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f \
+   $send_files_dir/1.snap 2>&1 1>/dev/null | _filter_scratch
+
+_scratch_unmount
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+mkdir $SCRATCH_MNT/d1
+mkdir $SCRATCH_MNT/dir258
+mkdir $SCRATCH_MNT/dir259
+mv $SCRATCH_MNT/d1 $SCRATCH_MNT/dir259/d1
+
+# Filesystem now looks like:
+#
+# . (ino 256)
+# |--- dir258/  (ino 258)
+# |
+# |--- dir259/  (ino 259)
+#|--- d1/   (ino 257)
+#
+# Notice that at this point all inodes have a generation with value 7, which is
+# the generation value for a brand new filesystem.
+
+# Now create the second snapshot. This makes the filesystem's current 
generation
+# value to increase to the value 8, due to a transaction commit performed by 
the
+# snapshot creation ioctl.
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+   $SCRATCH_MNT/mysnap2 > /dev/null
+
+# Now receive the first snapshot created in the first filesystem.
+# Before creating any inodes, the receive command creates the first snapshot,
+# which causes a transaction commit and therefore bumps the filesystem's 
current
+# generation to the value 9. All the inodes created end up getting a generation
+# value of 9 and the snapshot's root inode (256) gets a generation value of 8.
+$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $send_files_dir/1.snap > /dev/null
+rm $send_files_dir/1.snap

[PATCH v4 2/3] fstests: test btrfs incremental send after moving a directory

2017-01-19 Thread fdmanana
From: Filipe Manana 

Test that an incremental send operation works after moving a directory
into a new parent directory, deleting its previous parent directory and
creating a new inode that has the same inode number as the old parent.

This issue is fixed by the following patch for the linux kernel:

  "Btrfs: incremental send, do not delay rename when parent inode is new"

Signed-off-by: Robbie Ko 
Signed-off-by: Filipe Manana 
---

v4: Improved changelog and added comments to the test that explain what is
being tested and the problem.

 tests/btrfs/134 | 126 
 tests/btrfs/134.out |   6 +++
 tests/btrfs/group   |   1 +
 3 files changed, 133 insertions(+)
 create mode 100755 tests/btrfs/134
 create mode 100644 tests/btrfs/134.out

diff --git a/tests/btrfs/134 b/tests/btrfs/134
new file mode 100755
index 000..e4b4c6d
--- /dev/null
+++ b/tests/btrfs/134
@@ -0,0 +1,126 @@
+#! /bin/bash
+# FS QA Test No. btrfs/134
+#
+# Test that an incremental send operation works after moving a directory into
+# a new parent directory, deleting its previous parent directory and creating
+# a new inode that has the same inode number as the old parent.
+#
+#---
+# Copyright (C) 2017 Synology Inc. All Rights Reserved.
+# Author: Robbie Ko 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -fr $send_files_dir
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_test
+_require_scratch
+_require_fssum
+
+send_files_dir=$TEST_DIR/btrfs-test-$seq
+
+rm -f $seqres.full
+rm -fr $send_files_dir
+mkdir $send_files_dir
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+
+mkdir $SCRATCH_MNT/dir257
+mkdir $SCRATCH_MNT/dir258
+mkdir $SCRATCH_MNT/dir259
+mv $SCRATCH_MNT/dir257 $SCRATCH_MNT/dir258/dir257
+
+# Filesystem looks like:
+#
+# . (ino 256, gen 
3)
+# |--- dir258/  (ino 258, gen 
7)
+# |   |--- dir257/  (ino 257, gen 
7)
+# |
+# |--- dir259/  (ino 259, gen 
7)
+#
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+$SCRATCH_MNT/mysnap1 > /dev/null
+
+mv $SCRATCH_MNT/dir258/dir257 $SCRATCH_MNT/dir257
+rmdir $SCRATCH_MNT/dir258
+rmdir $SCRATCH_MNT/dir259
+# Remount the filesystem so that the next created inodes will have the numbers
+# 258 and 259. This is because when a filesystem is mounted, btrfs sets the
+# subvolume's inode counter to a value corresponding to the highest inode 
number
+# in the subvolume plus 1. This inode counter is used to assign a unique number
+# to each new inode and it's incremented by 1 after very inode creation.
+# Note: we unmount and then mount instead of doing a mount with "-o remount"
+# because otherwise the inode counter remains at value 260.
+_scratch_unmount
+_scratch_mount
+touch $SCRATCH_MNT/file258
+mkdir $SCRATCH_MNT/new_dir259
+mv $SCRATCH_MNT/dir257 $SCRATCH_MNT/new_dir259/dir257
+
+# Filesystem now looks like:
+#
+# .(ino 256, gen 3)
+# |--- file258 (ino 258, gen 
10)
+# |
+# |--- new_dir259/ (ino 259, gen 
10)
+#  |--- dir257/(ino 257, gen 7)
+#
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+$SCRATCH_MNT/mysnap2 > /dev/null
+
+$FSSUM_PROG -A -f -w $send_files_dir/1.fssum $SCRATCH_MNT/mysnap1
+$FSSUM_PROG -A -f -w $send_files_dir/2.fssum \
+-x $SCRATCH_MNT/mysnap2/mysnap1 $SCRATCH_MNT/mysnap2
+
+$BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f \
+   $send_files_dir/1.snap 2>&1 1>/dev/null | _filter_scratch
+$BTRFS_UTIL_PROG send -p 

Re: [PATCH v3] btrfs-progs: Fix disable backtrace assert error

2017-01-19 Thread David Sterba
On Wed, Jan 18, 2017 at 09:42:45PM -0600, Goldwyn Rodrigues wrote:
> >>> +#define BUG_ON(c) ASSERT(!(c))
> >>
> >> The problem with this is that you are killing the value printed as a
> >> part of the trace for BUG_ON(). The main reason why commit
> >> 00e769d04c2c83029d6c71 was written. Please be careful with your notting
> >> especially when the values are being printed.
> >>
> > 
> > This is designed.
> > 
> > As ASSERT() is more meaningful than the abused BUG_ON(), I changed it to
> > print correct value for ASSERT() and ignore BUG_ON().
> 
> If ASSERT is meaningful, correct it. But please don't make BUG_ON worse
> and leave it as it is, at least until the time you can remove the
> BUG_ONs or convert it. It should print the correct value of why it did
> bug, or else it will print just 1, which is of no debug value.

Agreed, we want to see the exact value.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/6] Btrfs: incremental send, fix invalid path for rmdir operations

2017-01-19 Thread Filipe Manana
On Thu, Jan 5, 2017 at 8:24 AM, robbieko  wrote:
> From: Robbie Ko 
>
> Under certain situations, an incremental send operation can

Again, missing some word after the word "can".
Copy pasting change logs is not that good

> a rmdir operation that will make the receiving end fail when
> attempting to execute it, because the path is not exist.
>
> Example scenario:
> Parent snapshot:
> | d259_old/ (ino 259, gen 96)
> | d1/   (ino 258, gen 96)
> | f (ino 257, gen 96)
>
> Send snapshot:
> | d258/ (ino 258, gen 98)
> | d259/ (ino 259, gen 98)
> | d1/   (ino 257, gen 98)

Please try to align things for easier readability.

>
> unlink f
> mkdir o257-98-0
> mkdir o259-98-0
> chown o257-98-0 - uid=0, gid=0
> chmod o257-98-0 - mode=0755
> rmdir o258-96-0
> ERROR: rmdir o258-96-0 failed: No such file or directory

Same comment as before (patch 1/6). Don't just paste the output of
receive -vv out of nowhere without mentioning what it is.

>
> While computing the send stream the following steps happen:
>
> 1) While processing inode 257 we create o257-98-0 and o259-98-0,
>then delay o257-98-0 rename operation because its new parent
>in the send snapshot, inode 259, was not yet processed and therefore
>not yet renamed;
>
> 2) Later we want to delete d1 (ino 258, gen 96) while processing inode
>258. In order to get its path for delete, we need to check if it is
>overwritten in the send snapshot. And we find it is overwritten so
>we delete it via unique name , which leads to error. The reason is
>we will find out d1 is under parent directory (inode 259) in the send
>snapshot, and for this case, because d1(inode 257, gen 98) is not same
>as d1 (inode 258, gen 96), we conclude d1 has been overwritten.
>
> Fix this by adding generation check for the parent directory. Because
> both parent directory are not identical, we can just skip the overwrite
> check. In addition, inode 256 should not check for this since it is a
> subvolume.

I've heavily reworded the change log and included the patch in my 4.11
integration branch.

Thanks.

>
> Signed-off-by: Robbie Ko 
> ---
> V3: improve the change log
>  fs/btrfs/send.c | 13 +
>  1 file changed, 13 insertions(+)
>
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index eaf1c92..139f492 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -1938,6 +1938,19 @@ static int did_overwrite_ref(struct send_ctx *sctx,
> if (ret <= 0)
> goto out;
>
> +   if (dir != BTRFS_FIRST_FREE_OBJECTID) {
> +   ret = get_inode_info(sctx->send_root, dir, NULL, , NULL,
> +NULL, NULL, NULL);
> +   if (ret < 0 && ret != -ENOENT)
> +   goto out;
> +   if (ret) {
> +   ret = 0;
> +   goto out;
> +   }
> +   if (gen != dir_gen)
> +   goto out;
> +   }
> +
> /* check if the ref was overwritten by another ref */
> ret = lookup_dir_item_inode(sctx->send_root, dir, name, name_len,
> _inode, _type);
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Btrfs: send, fix failure to rename top level inode due to name collision

2017-01-19 Thread fdmanana
From: Robbie Ko 

Under certain situations, an incremental send operation can fail due to a
premature attempt to create a new top level inode (a direct child of the
subvolume/snapshot root) whose name collides with another inode that was
removed from the send snapshot.

Consider the following example scenario.

Parent snapshot:

  . (ino 256, gen 8)
  | a1/ (ino 257, gen 9)
  | a2/ (ino 258, gen 9)

Send snapshot:

  . (ino 256, gen 3)
  | a2/ (ino 257, gen 7)

In this scenario, when receiving the incremental send stream, the btrfs
receive command fails like this (ran in verbose mode, -vv argument):

  rmdir a1
  mkfile o257-7-0
  rename o257-7-0 -> a2
  ERROR: rename o257-7-0 -> a2 failed: Is a directory

What happens when computing the incremental send stream is:

1) An operation to remove the directory with inode number 257 and
   generation 9 is issued.

2) An operation to create the inode with number 257 and generation 7 is
   issued. This creates the inode with an orphanized name of "o257-7-0".

3) An operation rename the new inode 257 to its final name, "a2", is
   issued. This is incorrect because inode 258, which has the same name
   and it's a child of the same parent (root inode 256), was not yet
   processed and therefore no rmdir operation for it was yet issued.
   The rename operation is issued because we fail to detect that the
   name of the new inode 257 collides with inode 258, because their
   parent, a subvolume/snapshot root (inode 256) has a different
   generation in both snapshots.

So fix this by ignoring the generation value of a parent directory that
matches a root inode (number 256) when we are checking if the name of the
inode currently being processed collides with the name of some other
inode that was not yet processed.

We can achieve this scenario of different inodes with the same number but
different generation values either by mounting a filesystem with the inode
cache option (-o inode_cache) or by creating and sending snapshots across
different filesystems, like in the following example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ mkdir /mnt/a1
  $ mkdir /mnt/a2
  $ btrfs subvolume snapshot -r /mnt /mnt/snap1
  $ btrfs send /mnt/snap1 -f /tmp/1.snap
  $ umount /mnt

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt
  $ touch /mnt/a2
  $ btrfs subvolume snapshot -r /mnt /mnt/snap2
  $ btrfs receive /mnt -f /tmp/1.snap
  # Take note that once the filesystem is created, its current
  # generation has value 7 so the inode from the second snapshot has
  # a generation value of 7. And after receiving the first snapshot
  # the filesystem is at a generation value of 10, because the call to
  # create the second snapshot bumps the generation to 8 (the snapshot
  # creation ioctl does a transaction commit), the receive command calls
  # the snapshot creation ioctl to create the first snapshot, which bumps
  # the filesystem's generation to 9, and finally when the receive
  # operation finishes it calls an ioctl to transition the first snapshot
  # (snap1) from RW mode to RO mode, which does another transaction commit
  # and bumps the filesystem's generation to 10.
  $ rm -f /tmp/1.snap
  $ btrfs send /mnt/snap1 -f /tmp/1.snap
  $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
  $ umount /mnt

  $ mkfs.btrfs -f /dev/sdd
  $ mount /dev/sdd /mnt
  $ btrfs receive /mnt /tmp/1.snap
  # Receive of snapshot snap2 used to fail.
  $ btrfs receive /mnt /tmp/2.snap

Signed-off-by: Robbie Ko 
Reviewed-by: Filipe Manana 
[Rewrote changelog to be more precise and clear]
Signed-off-by: Filipe Manana 

Signed-off-by: Filipe Manana 
---
 fs/btrfs/send.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index d145ce8..2ae32c4 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -1681,6 +1681,9 @@ static int is_inode_existent(struct send_ctx *sctx, u64 
ino, u64 gen)
 {
int ret;
 
+   if (ino == BTRFS_FIRST_FREE_OBJECTID)
+   return 1;
+
ret = get_cur_inode_state(sctx, ino, gen);
if (ret < 0)
goto out;
@@ -1866,7 +1869,7 @@ static int will_overwrite_ref(struct send_ctx *sctx, u64 
dir, u64 dir_gen,
 * not deleted and then re-created, if it was then we have no overwrite
 * and we can just unlink this entry.
 */
-   if (sctx->parent_root) {
+   if (sctx->parent_root && dir != BTRFS_FIRST_FREE_OBJECTID) {
ret = get_inode_info(sctx->parent_root, dir, NULL, , NULL,
 NULL, NULL, NULL);
if (ret < 0 && ret != -ENOENT)
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-19 Thread Michal Hocko
On Thu 19-01-17 10:22:36, Jan Kara wrote:
> On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> > On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > > But before going to play with that I am really wondering whether we 
> > > > > > need
> > > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > > journal lock(s) which is the biggest problem from the reclaim 
> > > > > > recursion
> > > > > > point of view. What would cause a deadlock in no journal mode?
> > > > > 
> > > > > We still have the original problem for why we need GFP_NOFS even in
> > > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > > we don't want to recurse back into the file system's writeback path.
> > > > 
> > > > But we do not enter the writeback path from the direct reclaim. Or do
> > > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > > There is only try_to_release_page where we get back to the filesystems
> > > > but I do not see any NOFS protection in ext4_releasepage.
> > > 
> > > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > > shrinkers. That's it. So the recursion possibilities are rather more 
> > > limited
> > > than they used to be several years ago and we likely do not need as much
> > > GFP_NOFS protection as we used to.
> > 
> > Thanks for making my remark more clear Jack! I would just want to add
> > that I was playing with the patch below (it is basically
> > GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> > debugging patch which means they are called from within transaction) and
> > it didn't hit the lockdep when running xfstests both with or without the
> > enabled journal.
> > 
> > So am I still missing something or the nojournal mode is safe and the
> > current series is OK wrt. ext*?
> 
> I'm convinced the current series is OK, only real life will tell us whether
> we missed something or not ;)

I would like to extend the changelog of "jbd2: mark the transaction
context with the scope GFP_NOFS context".

"
Please note that setups without journal do not suffer from potential
recursion problems and so they do not need the scope protection because
neither ->releasepage nor ->evict_inode (which are the only fs entry
points from the direct reclaim) can reenter a locked context which is
doing the allocation currently.
"
 
> > The following patch in its current form is WIP and needs a proper review
> > before I post it.
> 
> So jbd2 changes look confusing (although technically correct) to me - we
> *always* should run in NOFS context in those place so having GFP_KERNEL
> there looks like it is unnecessarily hiding what is going on. So in those
> places I'd prefer to keep GFP_NOFS or somehow else make it very clear these
> allocations are expected to be GFP_NOFS (and assert that). Otherwise the
> changes look good to me.

I would really like to get rid most of NOFS direct usage and only
dictate it via the scope API otherwise I suspect we will just grow more
users and end up in the same situation as we are now currently over time.
In principle only the context which changes the reclaim reentrancy policy
should care about NOFS and everybody else should just pretend nothing
like that exists. There might be few exceptions of course, I am not yet
sure whether jbd2 is that case. But I am not proposing this change yet
(thanks for checking anyway)...
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid 1 recovery

2017-01-19 Thread Andrei Borzenkov
On Thu, Jan 19, 2017 at 10:15 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> Chris Murphy posted on Wed, 18 Jan 2017 14:30:28 -0700 as excerpted:
>
>> On Wed, Jan 18, 2017 at 2:07 PM, Jon  wrote:
>>> So, I had a raid 1 btrfs system setup on my laptop. Recently I upgraded
>>> the drives and wanted to get my data back. I figured I could just plug
>>> in one drive, but I found that the volume simply would not mount. I
>>> tried the other drive alone and got the same thing. Plugging in both at
>>> the same time and the volume mounted without issue.
>>
>> Requires mount option degraded.
>>
>> If this is a boot volume, this is difficult because the current udev
>> rule prevents a mount attempt so long as all devices for a Btrfs volume
>> aren't present.
>
> OK, so I've known about this from the list for some time, but what is the
> status with regard to udev/systemd (has a bug/issue been filed, results,
> link?), and what are the alternatives, both for upstream, and for a dev,
> either trying to be proactive, or currently facing a refusal to boot due
> to the issue?
>

The only possible solution is to introduce separate object that
represents btrsf "storage" (raid) that exports information whether it
can be mounted. Then it is possible to start timer and decide to mount
degraded after timer expires.

Compare with Linux MD that is assembled by udev as well and does exactly that.

Where this object is better implemented - I do not know. It can be
btrfs side (export pseudo device similar to /dev/mdX for Linux MD). It
can be user space - special service that waits for device assembly.

Abstracting it on btrfs level is more clean and does not depend on
specific user space (a.k.a. udev/systemd). OTOH zfs shares the same
problem at the end, so common solution to multi-device filesystem
handling would be good.

> IOW, I run btrfs raid1 on /, setup before the udev rule I believe as it
> worked back then, and I've known about the issue but haven't followed it
> closely enough to know what I should do if faced with a dead device, or
> what the current status is on alternatives to fix the problem longer
> term, either locally (does simply disabling the rule work?) or upstream,
> and what the alternatives might be along with the reasons this is still
> shipping instead.  And I'm asking in ordered to try to remedy that. =:^)
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs recovery

2017-01-19 Thread Sebastian Gottschall

Hello

I have a question. after a power outage my system was turning into a 
unrecoverable state using btrfs (kernel 4.9)
since im running --init-extent-tree now for 3 days i'm asking how long 
this process normally takes and why it outputs millions of lines like


Backref 1562890240 root 262 owner 483059214 offset 0 num_refs 0 not 
found in extent tree
Incorrect local backref count on 1562890240 root 262 owner 483059214 
offset 0 found 1 wanted 0 back 0x23b0211d0

backpointer mismatch on [1562890240 4096]
adding new data backref on 1562890240 root 262 owner 483059214 offset 0 
found 1

Repaired extent references for 1562890240

please avoid typical answers like "potential dangerous operation" since 
all repair options are declared as potenial dangerous.



Sebastian

--
Mit freundlichen Grüssen / Regards

Sebastian Gottschall / CTO

NewMedia-NET GmbH - DD-WRT
Firmensitz:  Berliner Ring 101, 64625 Bensheim
Registergericht: Amtsgericht Darmstadt, HRB 25473
Geschäftsführer: Peter Steinhäuser, Christian Scheele
http://www.dd-wrt.com
email: s.gottsch...@dd-wrt.com
Tel.: +496251-582650 / Fax: +496251-5826565

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-19 Thread Jan Kara
On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > But before going to play with that I am really wondering whether we 
> > > > > need
> > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > journal lock(s) which is the biggest problem from the reclaim 
> > > > > recursion
> > > > > point of view. What would cause a deadlock in no journal mode?
> > > > 
> > > > We still have the original problem for why we need GFP_NOFS even in
> > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > we don't want to recurse back into the file system's writeback path.
> > > 
> > > But we do not enter the writeback path from the direct reclaim. Or do
> > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > There is only try_to_release_page where we get back to the filesystems
> > > but I do not see any NOFS protection in ext4_releasepage.
> > 
> > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > shrinkers. That's it. So the recursion possibilities are rather more limited
> > than they used to be several years ago and we likely do not need as much
> > GFP_NOFS protection as we used to.
> 
> Thanks for making my remark more clear Jack! I would just want to add
> that I was playing with the patch below (it is basically
> GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> debugging patch which means they are called from within transaction) and
> it didn't hit the lockdep when running xfstests both with or without the
> enabled journal.
> 
> So am I still missing something or the nojournal mode is safe and the
> current series is OK wrt. ext*?

I'm convinced the current series is OK, only real life will tell us whether
we missed something or not ;)

> The following patch in its current form is WIP and needs a proper review
> before I post it.

So jbd2 changes look confusing (although technically correct) to me - we
*always* should run in NOFS context in those place so having GFP_KERNEL
there looks like it is unnecessarily hiding what is going on. So in those
places I'd prefer to keep GFP_NOFS or somehow else make it very clear these
allocations are expected to be GFP_NOFS (and assert that). Otherwise the
changes look good to me.

Honza

> ---
>  fs/ext4/inode.c   |  4 ++--
>  fs/ext4/mballoc.c | 14 +++---
>  fs/ext4/xattr.c   |  2 +-
>  fs/jbd2/journal.c |  4 ++--
>  fs/jbd2/revoke.c  |  2 +-
>  fs/jbd2/transaction.c |  2 +-
>  6 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b7d141c3b810..841cb8c4cb5e 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2085,7 +2085,7 @@ static int ext4_writepage(struct page *page,
>   return __ext4_journalled_writepage(page, len);
>  
>   ext4_io_submit_init(_submit, wbc);
> - io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
> + io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
>   if (!io_submit.io_end) {
>   redirty_page_for_writepage(wbc, page);
>   unlock_page(page);
> @@ -3794,7 +3794,7 @@ static int __ext4_block_zero_page_range(handle_t 
> *handle,
>   int err = 0;
>  
>   page = find_or_create_page(mapping, from >> PAGE_SHIFT,
> -mapping_gfp_constraint(mapping, ~__GFP_FS));
> +mapping_gfp_mask(mapping));
>   if (!page)
>   return -ENOMEM;
>  
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index d9fd184b049e..67b97cd6e3d6 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1251,7 +1251,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, 
> ext4_group_t group,
>  static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
> struct ext4_buddy *e4b)
>  {
> - return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS);
> + return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_KERNEL);
>  }
>  
>  static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
> @@ -2054,7 +2054,7 @@ static int ext4_mb_good_group(struct 
> ext4_allocation_context *ac,
>  
>   /* We only do this if the grp has never been initialized */
>   if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
> - int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
> + int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_KERNEL);
>   if (ret)
>   return ret;
>   }
> @@ -3600,7 +3600,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>   BUG_ON(ac->ac_status != AC_STATUS_FOUND);
>   BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>  
> - pa = kmem_cache_alloc(ext4_pspace_cachep, 

Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-19 Thread Michal Hocko
On Tue 17-01-17 18:29:25, Jan Kara wrote:
> On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > But before going to play with that I am really wondering whether we need
> > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > point of view. What would cause a deadlock in no journal mode?
> > > 
> > > We still have the original problem for why we need GFP_NOFS even in
> > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > we don't want to recurse back into the file system's writeback path.
> > 
> > But we do not enter the writeback path from the direct reclaim. Or do
> > you mean something other than pageout()'s mapping->a_ops->writepage?
> > There is only try_to_release_page where we get back to the filesystems
> > but I do not see any NOFS protection in ext4_releasepage.
> 
> Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> shrinkers. That's it. So the recursion possibilities are rather more limited
> than they used to be several years ago and we likely do not need as much
> GFP_NOFS protection as we used to.

Thanks for making my remark more clear Jack! I would just want to add
that I was playing with the patch below (it is basically
GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
debugging patch which means they are called from within transaction) and
it didn't hit the lockdep when running xfstests both with or without the
enabled journal.

So am I still missing something or the nojournal mode is safe and the
current series is OK wrt. ext*?

The following patch in its current form is WIP and needs a proper review
before I post it.
---
 fs/ext4/inode.c   |  4 ++--
 fs/ext4/mballoc.c | 14 +++---
 fs/ext4/xattr.c   |  2 +-
 fs/jbd2/journal.c |  4 ++--
 fs/jbd2/revoke.c  |  2 +-
 fs/jbd2/transaction.c |  2 +-
 6 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b7d141c3b810..841cb8c4cb5e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2085,7 +2085,7 @@ static int ext4_writepage(struct page *page,
return __ext4_journalled_writepage(page, len);
 
ext4_io_submit_init(_submit, wbc);
-   io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
+   io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
if (!io_submit.io_end) {
redirty_page_for_writepage(wbc, page);
unlock_page(page);
@@ -3794,7 +3794,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
int err = 0;
 
page = find_or_create_page(mapping, from >> PAGE_SHIFT,
-  mapping_gfp_constraint(mapping, ~__GFP_FS));
+  mapping_gfp_mask(mapping));
if (!page)
return -ENOMEM;
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d9fd184b049e..67b97cd6e3d6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1251,7 +1251,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, 
ext4_group_t group,
 static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
  struct ext4_buddy *e4b)
 {
-   return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS);
+   return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_KERNEL);
 }
 
 static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
@@ -2054,7 +2054,7 @@ static int ext4_mb_good_group(struct 
ext4_allocation_context *ac,
 
/* We only do this if the grp has never been initialized */
if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
-   int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
+   int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_KERNEL);
if (ret)
return ret;
}
@@ -3600,7 +3600,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
BUG_ON(ac->ac_status != AC_STATUS_FOUND);
BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
 
-   pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
+   pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
if (pa == NULL)
return -ENOMEM;
 
@@ -3694,7 +3694,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
 
BUG_ON(ext4_pspace_cachep == NULL);
-   pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
+   pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
if (pa == NULL)
return -ENOMEM;
 
@@ -4479,7 +4479,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
}
}
 
-   ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS);
+   ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_KERNEL);
if (!ac) {
ar->len = 0;
*errp = 

Re: [PATCH 1/2] btrfs-progs: cmds-check.c: supports inode nbytes fix in lowmem

2017-01-19 Thread Su Yue

Hi,

The test flag override way only runs all tests in lowmem mode or not, 
but can't decide one test repair or not.

I have some ideas below:

1.Create a hidden empty file under the test dir which need to be 
repaired.Edit tests/common.local:_skip_spec() to judge repair or not by 
the existence of hidden file.


2.Or just edit a test.sh under the test dir.I test my patches in this 
way but may the way is not grace enough.


I'm wondering which one is perferred.

Thanks
Su

On 01/17/2017 11:40 PM, David Sterba wrote:

Hi,

I have some comments, see below.

On Mon, Jan 09, 2017 at 01:38:07PM +0800, Su Yue wrote:

Added 'repair_inode_item' which dispatches functions such as
'repair_inode__nbytes_lowmem' to correct errors and
'struct inode_item_fix_info' to store correct values and errors.

Signed-off-by: Su Yue 
---
 cmds-check.c | 161 +++
 1 file changed, 152 insertions(+), 9 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 1dba298..567ca80 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -371,6 +371,17 @@ struct root_item_info {
 };

 /*
+ * Use inode_item_fix_info as function check_inode_item's arg.
+ */
+struct inode_item_fix_info {
+   u64 ino;
+   u64 isize;
+   u64 nbytes;
+
+   int err;
+};
+
+/*
  * Error bit for low memory mode check.
  *
  * Currently no caller cares about it yet.  Just internal use for error
@@ -1866,13 +1877,16 @@ struct node_refs {
 static int update_nodes_refs(struct btrfs_root *root, u64 bytenr,
 struct node_refs *nrefs, u64 level);
 static int check_inode_item(struct btrfs_root *root, struct btrfs_path *path,
-   unsigned int ext_ref);
-
+   unsigned int ext_ref,
+   struct inode_item_fix_info *info);
+static int repair_inode_item(struct btrfs_root *root,
+struct inode_item_fix_info *info);
 static int process_one_leaf_v2(struct btrfs_root *root, struct btrfs_path 
*path,
   struct node_refs *nrefs, int *level, int ext_ref)
 {
struct extent_buffer *cur = path->nodes[0];
struct btrfs_key key;
+   struct inode_item_fix_info info;
u64 cur_bytenr;
u32 nritems;
u64 first_ino = 0;
@@ -1881,6 +1895,7 @@ static int process_one_leaf_v2(struct btrfs_root *root, 
struct btrfs_path *path,
int ret = 0; /* Final return value */
int err = 0; /* Positive error bitmap */

+   memset(, 0, sizeof(info));
cur_bytenr = cur->start;

/* skip to first inode item or the first inode number change */
@@ -1900,8 +1915,26 @@ static int process_one_leaf_v2(struct btrfs_root *root, 
struct btrfs_path *path,
path->slots[0] = i;

 again:
-   err |= check_inode_item(root, path, ext_ref);
+   err |= check_inode_item(root, path, ext_ref, );
+
+   if (repair && (err & ~LAST_ITEM)) {
+   ret = repair_inode_item(root, );

+   if (ret < 0)
+   goto out;
+   /*
+* if some errors was repaired, path shall be searched
+* again since path has been changed
+*/
+   if (ret == 0) {
+   btrfs_item_key_to_cpu(path->nodes[0], ,
+ path->slots[0]);
+   btrfs_release_path(path);
+   btrfs_search_slot(NULL, root, , path, 0, 0);
+
+   cur = path->nodes[0];
+   }
+   }
if (err & LAST_ITEM)
goto out;

@@ -2211,7 +2244,8 @@ out:
 }

 static int check_inode_item(struct btrfs_root *root, struct btrfs_path *path,
-   unsigned int ext_ref);
+   unsigned int ext_ref,
+   struct inode_item_fix_info *info);

 static int walk_down_tree_v2(struct btrfs_root *root, struct btrfs_path *path,
 int *level, struct node_refs *nrefs, int ext_ref)
@@ -2293,7 +2327,7 @@ static int walk_down_tree_v2(struct btrfs_root *root, 
struct btrfs_path *path,
}

ret = check_child_node(root, cur, path->slots[*level], next);
-   if (ret < 0)
+   if (ret < 0)
break;

if (btrfs_is_leaf(next))
@@ -2383,6 +2417,105 @@ out:
return ret;
 }

+/*
+ * Set inode's nbytes to correct value in @info
+ *
+ * Returns <0  means on error
+ * Returns  0  means successful repair
+ */
+static int repair_inode_nbytes_lowmem(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct inode_item_fix_info *info)
+{
+   struct btrfs_inode_item *ei;
+   struct btrfs_key key;
+   struct btrfs_path path;
+   int ret;
+
+   ASSERT(info);
+   key.objectid