Re: Raid 0 setup doubt.

2016-03-28 Thread Duncan
Jose Otero posted on Mon, 28 Mar 2016 22:30:56 +0200 as excerpted:

> Duncan, you are right. I have 8 GB of RAM, and the most memory intensive
> thing I'll be doing is a VM for Windows. Now I double boot, but rarely
> go into Win, only to play some game occasionally. So, I think I'll be
> better off with Linux flat out and Win in a VM.

LOL.  That sounds /very/ much like me, tho obviously with different 
details given the timeframe, about 20 years ago.. and thru to this day, 
as the following story explains.

This was in the first few years after I got my first computer of my own, 
back in the early 90s, so well before I switched to Linux when the 
alternative was upgrading to MS eXPrivacy, starting the weekend eXPrivacy 
was actually released, in 2001.  So it was on MS.

When MS Windows95 came out, I upgraded to it, and normally stayed booted 
into it for all my usual tasks.  But I had one very favorite (to this 
day, actually) game, Master of Orion, original DOS edition, that wouldn't 
work in the original W95 -- I had to reboot to DOS to play it.

I remember what a relief it was to upgrade to 95-OSR2 and finally get the 
ability to run it from the DOS within W95, as that finally allowed me to 
play it without the hassle of rebooting all the time.

That's the first time I realized just what a hassle rebooting to do 
something specific was, as despite that being -- to this day -- my 
favorite computer game ever, I really didn't reboot very often to play 
it, when I had to actually reboot /to/ play it.

Of course W95OSR2 was upgraded to W98 -- at the time I was actually 
running the public betas for IE4/OE4 and was really looking forward to 
the advances that came with the to that point IE4-addon desktop 
integration, and I remember standing in line at midnight to get my copy 
of W98 as soon as possible.  At the time I was volunteering in the MS IE/
OE newsgroups, programming in VB and the MS Windows API, and on my way 
toward MSMVP.

But that was the height of my MS involvement.  By the time W98 came out I 
was already hearing about this Linux thing, and by sometime in 1999 I was 
convinced of the soundness of the Free/Libre and Open Source approach, 
and shortly thereafter read Eric S. Raymond's "The Cathedral and the 
Bazaar" and related essays (in dead-tree book form), the experience of 
which, for me, was one of repeated YES!!, EXACTLY!!, I didn't know others 
thought that way!!, because I had come to an immature form of many of the 
same conclusions on my own, due to my own VB programming experience, 
which sealed the deal.

But while I was convinced of the moral and logical correctness of the 
FLOSS way, I was loath to simply dump all the technical and developer API 
knowledge and experience I had on MS Windows by that point, and truth be 
told, may never have actually jumped off MS if MS themselves hadn't 
pushed me.

While I had played with Linux a bit, I quickly found that it simply 
wasn't practical on that level, for much the same reason booting to DOS 
to play Master of Orion wasn't practical, despite it being my favorite 
game.  Rebooting was simply too much of a hassle, and I simply didn't do 
it often enough in practice to get much of anywhere.

But it's at that point that MS first introduced its own malware, first 
with Office eXPrivacy, then publishing the fact that MS Windows eXPrivacy 
was going to be just that as well, that they were shipping activation 
malware that would, upon upgrade of too much of the machine, demand 
reactivation.

To me, this abuse of the user via activation malware was both a bridge 
too far, and 100% proof positive that MS considered itself a de-facto 
monopoly, regardless of what it might say in court.   After all, back in 
the day, MS Office got where it got in part because unlike the 
competition, it didn't require clumsy hardware dongles to allow one to 
use the software.  Their policy when trying to actually compete was that 
they'd rather their software be pirated, if it made them the de-facto 
standard, which it ultimately did.  That MS was now actually shipping 
deactivation malware as part of its product was thus 100% proof positive 
that they no longer considered anything else a serious competitive 
threat, and thus, that they could get away with inconveniencing their 
users via deactivation malware, since in their viewpoint they were now a 
monopoly and the users no longer had any other practical alternative 
/but/ MS.

And that was all the push it took.  By the time I started what I knew by 
then was THE switch, because MS really wasn't giving me any choice, I had 
actually been verifying my hardware upgrades against Linux compatibility 
for two full years, so I knew my hardware would handle Linux.  But I just 
couldn't spend the time booted to Linux to learn how to actually /use/ 
it, until MS gave me that push, leaving me no other viable option.  But 
once I knew I was switching, I was dead serious about it, and asked on my 
then ISP's newsgroup 

Re: Compression causes kernel crashes if there are I/O or checksum errors (was: RE: kernel BUG at fs/btrfs/volumes.c:5519 when hot-removing device in RAID-1)

2016-03-28 Thread Duncan
James Johnston posted on Mon, 28 Mar 2016 14:34:14 + as excerpted:

> Thanks for the corroborating report - it does sound to me like you ran
> into the same problem I've found.  (I don't suppose you ever captured
> any of the crashes?  If they assert on the same thing as me then it's
> even stronger evidence.)

No...  In fact, as I have compress=lzo on all my btrfs, until you found 
out that it didn't happen in the uncompressed case, I simply considered 
that part and parcel of btrfs not being fully stabilized and mature yet.  
I didn't even consider it a specific bug on its own, and thus didn't 
report it or trace it in any way, and simply worked around it, even tho I 
certainly found it frustrating.

>> The failure mode of this particular ssd was premature failure of more
>> and more sectors, about 3 MiB worth over several months based on the
>> raw count of reallocated sectors in smartctl -A, but using scrub to
>> rewrite them from the good device would normally work, forcing the
>> firmware to remap that sector to one of the spares as scrub corrected
>> the problem.
> 
> I wonder what the risk of a CRC collision was in your situation?
> 
> Certainly my test of "dd if=/dev/zero of=/dev/sdb" was very abusive, and
> I wonder if the result after scrubbing is trustworthy, or if there was
> some collisions.  But I wasn't checking to see if data coming out the
> other end was OK - I was just trying to see if the kernel crashes or not
> (e.g. a USB stick holding a bad btrfs file system should not crash a
> system).

I had absolutely no trouble with the scrubbed data, or at least none I 
attributed to that, tho I didn't have the data cross-hashed and cross-
check the post-scrub result against earlier hashes or anything, so a few 
CRC collisions could have certainly snuck thru.

But even were some to have done so, or even if they didn't in practice, 
if they could have in theory, just the standard crc checks are so far 
beyond what's built into a normal filesystem like the reiserfs that's 
still my second (and non-btrfs) level backup.  So it's not like I'm 
majorly concerned.  If I was paranoid, as I mentioned I could certainly 
be doing cross-checks against multiple hashes, but I survived without any 
sort of routine data integrity checking for years, and even a practical 
worst-case-scenario crc-collision is already an infinite percentage 
better than that (just as 1 is an infinite percentage of 0), so it's 
nothing I'm going to worry about unless I actually start seeing real 
cases of it.

>> So I quickly learned that if I powered up and the kernel crashed at
>> that point, I could reboot with the emergency kernel parameter, which
>> would tell systemd to give me a maintenance-mode root login prompt
>> after doing its normal mounts but before starting the normal post-mount
>> services, and I could run scrub from there.  That would normally repair
>> things without triggering the crash, and when I had run scrub
>> repeatedly if necessary to correct any unverified errors in the first
>> runs, I could then exit emergency mode and let systemd start the normal
>> services, including the service that read all these files off the now
>> freshly scrubbed filesystem, without further issues.
> 
> That is one thing I did not test.  I only ever scrubbed after first
> doing the "cat all files to null" test.  So in the case of compression,
> I never got that far.  Probably someone should test the scrubbing more
> thoroughly (i.e. with that abusive "dd" test I did) just to be sure that
> it is stable to confirm your observations, and that the problem is only
> limited to ordinary file I/O on the file system.

I suspect that when the devs duplicate the bug and ultimately trace it 
down, we'll know from the code-path whether scrub could have hit it or 
not, without actually testing the scrub case on its own.

And along with the fix it's a fair bet will be an fstests patch that will 
verify no regressions there once fixed, as well.

Once the fstests patch is in, it should be just a small tweak to test 
whether scrub's subject to the problem if it uses a different code-path, 
or not, and in fact once they find and verify with a fix the problem 
here, even if scrub doesn't use that code-path, I expect they'll be 
verifying scrub's own code-paths as well.

>> And apparently the devs don't test the someone less common combination
>> of both compression and high numbers of raid1 correctable checksum
>> errors, or they would have probably detected and fixed the problem from
>> that.
> 
> Well, I've only tested with RAID-1.  I don't know if:
> 
> 1.  The problem occurs with other RAID levels like RAID-10, RAID5/6.
> 
> 2.  The kernel crashes in non-duplicated levels.  In these cases, data
> loss is inevitable since the data is missing, but these losses should be
> handled cleanly, and not by crashing the kernel.

Good points.  Again, I expect the extent of the bug based on its code-
path and what actually uses it, should be 

Re: [PATCH v2] fstest: btrfs: test single 4k extent after subpagesize buffered writes

2016-03-28 Thread Liu Bo
On Wed, Mar 23, 2016 at 09:52:21PM -0700, Liu Bo wrote:
> On Wed, Mar 23, 2016 at 07:53:38PM +0800, Eryu Guan wrote:
> > On Tue, Mar 22, 2016 at 03:12:25PM -0700, Liu Bo wrote:
> > > On Tue, Mar 22, 2016 at 12:00:13PM +0800, Eryu Guan wrote:
> > > > On Thu, Mar 17, 2016 at 03:56:38PM -0700, Liu Bo wrote:
> > > > > This is to test if COW enabled btrfs can end up with single 4k extents
> > > > > when doing subpagesize buffered writes.
> > > > 
> > > > What happens if btrfs is mounted with "nodatacow" option? Does it need
> > > > to _notrun if cow is disabled?
> > > 
> > > In my test, the test passes if mounting with "nodatacow".
> > > Yes, it makes sense to have a _notrun for nodatacow.
> > 
> > If "nodatacow" btrfs should pass the test as well, then I don't think
> > _notrun is needed, so when it failed, something went wrong.
> 
> Ok, and it should pass in theory.
> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > The patch to fix the problem is
> > > > >   https://patchwork.kernel.org/patch/8527991/
> > > > > 
> > > > > Signed-off-by: Liu Bo 
> > > > > ---
> > > > > v2: - Teach awk to know system's pagesize.
> > > > > - Add "Silence is golden" to output.
> > > > > - Use local variables to lower case.
> > > > > - Add comments to make code clear.
> > > > 
> > > > This should be v3, and this patch was buried in the v2 thread :)
> > > 
> > > Oops, thanks for pointing it out.
> > > 
> > > > 
> > > > > 
> > > > >  tests/btrfs/027 | 102 
> > > > > 
> > > > >  tests/btrfs/027.out |   2 ++
> > > > >  tests/btrfs/group   |   1 +
> > > > >  3 files changed, 105 insertions(+)
> > > > >  create mode 100755 tests/btrfs/027
> > > > >  create mode 100644 tests/btrfs/027.out
> > > > > 
> > > > > diff --git a/tests/btrfs/027 b/tests/btrfs/027
> > > > > new file mode 100755
> > > > > index 000..19d324b
> > > > > --- /dev/null
> > > > > +++ b/tests/btrfs/027
> > > > > @@ -0,0 +1,102 @@
> > > > > +#! /bin/bash
> > > > > +# FS QA Test 027
> > > > > +#
> > > > > +# When btrfs is using cow mode, buffered writes of sub-pagesize can 
> > > > > end up with
> > > > > +# single 4k extents.
> > > > > +# Ref:
> > > > > +# "Stray 4k extents with slow buffered writes"
> > > > > +# https://www.spinics.net/lists/linux-btrfs/msg52628.html
> > > > 
> > > > After going through this thread, my understanding is that nodatacow
> > > > btrfs should pass this test even on unpatched kernel (e.g. v4.5). But
> > > > my test on v4.5 kernel failed with nodatacow mount option, pagesize
> > > > extent is still found.
> > > > 
> > > 
> > > I verified it again on my kvm box and it passed with a unpatched v4.5 
> > > kernel.
> > > 
> > > Can you please show me the 027.full file?
> > > 
> > > I can't think of a reason for this..
> > 
> > I'm using v4.5 kernel and v4.4 btrfs-progs, and it's not reproduced
> > everytime.
> > 
> > SECTION   -- btrfs_nodatacow
> > RECREATING-- btrfs on /dev/sda5
> > FSTYP -- btrfs
> > PLATFORM  -- Linux/x86_64 dhcp-66-86-11 4.5.0
> > MKFS_OPTIONS  -- /dev/sda6
> > MOUNT_OPTIONS -- -o nodatacow -o context=system_u:object_r:nfs_t:s0 
> > /dev/sda6 /mnt/testarea/scratch
> > 
> > btrfs/027 28s ... - output mismatch (see 
> > /root/xfstests/results//btrfs_nodatacow/btrfs/027.out.bad)
> > --- tests/btrfs/027.out 2016-03-23 15:39:41.56200 +0800
> > +++ /root/xfstests/results//btrfs_nodatacow/btrfs/027.out.bad   
> > 2016-03-23 19:37:38.96200 +0800
> > @@ -1,2 +1,3 @@
> >  QA output created by 027
> >  Silence is golden
> > +8
> > ...
> > (Run 'diff -u tests/btrfs/027.out 
> > /root/xfstests/results//btrfs_nodatacow/btrfs/027.out.bad'  to see the 
> > entire diff)
> > Ran: btrfs/027
> > Failures: btrfs/027
> > Failed 1 of 1 tests
> > 
> > And btrfs/027.full shows:
> > 
> > /mnt/testarea/scratch/testfile:
> >  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
> >0: [0..28863]:  2154496..2183359 28864   0x0
> >1: [28864..57751]:  2183360..2212247 2   0x0
> >2: [57752..85543]:  2212248..2240039 27792   0x0
> >3: [85544..113239]: 2240040..2267735 27696   0x0
> >4: [113240..113247]: 2267736..2267743 8   0x0
> >5: [113248..141999]: 2267744..2296495 28752   0x0
> >6: [142000..142023]: 2296496..229651924   0x0
> >7: [142024..159799]: 2296520..2314295 17776   0x1
> 
> I can barely reproduce one in 100 runs... but anyway if it is a bug,
> it's not a problem in this test case, I'll send a v3 version patch and
> work on this nocow case.

My trace results show that it's not a bug.

[0, 4096]
[4096, 8192]
...
[N-4096, N]
[N, N+4096]
[N+4096, N+8192]
...

There could be some latencies between writes against [N, N+4096] and writes 
against [N+4096, N+8192],
so when writeback starts between [N-4096, N] and [N, N+4096], btrfs will
find delayed allocation range ending at extent [N-4096, N], and then it
creates a extent to cover that range.  Later 

Re: Raid 0 setup doubt.

2016-03-28 Thread Chris Murphy
On Mon, Mar 28, 2016 at 6:35 AM, Austin S. Hemmelgarn
 wrote:

> The other caveat that nobody seems to mention outside of specific cases is
> that using suspend to disks exposes you to direct attack by anyone with the
> ability to either physically access the system, or boot an alternative OS on
> it.  This is however not a Linux specific issue (although Windows and OS X
> do a much better job of validating the hibernation image than Linux does
> before resuming from it, so it's not as easy to trick them into loading
> arbitrary data).

OS X uses dynamically created swapfiles, and the hibernation file is a
separate file that's pre-allocated. Both are on the root file system,
so if you encrypt, then those files are also encrypted. Hibernate
involves a hint in NVRAM that hibernate resume is necessary, and the
firmware uses a hibernate recovery mechanism in the bootloader which
also has a way to unlock encrypted volumes (which are kinda like an
encrypted logical volume, as Apple now defaults to using a logical
volume manager of their own creation).



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub: Tree block spanning stripes, ignored

2016-03-28 Thread Qu Wenruo



Ivan P wrote on 2016/03/28 23:21 +0200:

Well, the file in this inode is fine, I was able to copy it off the
disk. However, rm-ing the file causes a segmentation fault. Shortly
after that, I get a kernel oops. Same thing happens if I attempt to
re-run scrub.

How can I delete that inode? Could deleting it destroy the filesystem
beyond repair?


The kernel oops should protect you from completely destroying the fs.

However it seems that the problem is beyond kernel's handle (kernel oops).

So no safe recovery method now.

From now on, any repair advice from me *MAY* *destroy* your fs.
So please do backup when you still can.


The best possible try would be "btrfsck --init-extent-tree --repair".

If it works, then mount it and run "btrfs balance start ".
Lastly, umount and use btrfsck to re-check if it fixes the problem.

Thanks,
Qu



Regards,
Ivan

On Mon, Mar 28, 2016 at 3:10 AM, Qu Wenruo  wrote:



Ivan P wrote on 2016/03/27 16:31 +0200:


Thanks for the reply,

the raid1 array was created from scratch, so not converted from ext*.
I used btrfs-progs version 4.2.3 on kernel 4.2.5 to create the array, btw.



I don't remember any strange behavior after 4.0, so no clue here.

Go to the subvolume 5 (the top-level subvolume), find inode 71723 and try to
remove it.
Then, use 'btrfs filesystem sync ' to sync the inode removal.

Finally use latest btrfs-progs to check if the problem disappears.

This problem seems to be quite strange, so I can't locate the root cause,
but try to remove the file and hopes kernel can handle it.

Thanks,
Qu



Is there a way to fix the current situation without taking the whole
data off the disk?
I'm not familiar with file systems terms, so what exactly could I have
lost, if anything?

Regards,
Ivan

On Sun, Mar 27, 2016 at 4:23 PM, Qu Wenruo > wrote:



 On 03/27/2016 05:54 PM, Ivan P wrote:

 Read the info on the wiki, here's the rest of the requested
 information:

 # uname -r
 4.4.5-1-ARCH

 # btrfs fi show
 Label: 'ArchVault'  uuid: cd8a92b6-c5b5-4b19-b5e6-a839828d12d8
  Total devices 1 FS bytes used 2.10GiB
  devid1 size 14.92GiB used 4.02GiB path /dev/sdc1

 Label: 'Vault'  uuid: 013cda95-8aab-4cb2-acdd-2f0f78036e02
  Total devices 2 FS bytes used 800.72GiB
  devid1 size 931.51GiB used 808.01GiB path /dev/sda
  devid2 size 931.51GiB used 808.01GiB path /dev/sdb

 # btrfs fi df /mnt/vault/
 Data, RAID1: total=806.00GiB, used=799.81GiB
 System, RAID1: total=8.00MiB, used=128.00KiB
 Metadata, RAID1: total=2.00GiB, used=936.20MiB
 GlobalReserve, single: total=320.00MiB, used=0.00B

 On Fri, Mar 25, 2016 at 3:16 PM, Ivan P > wrote:

 Hello,

 using kernel  4.4.5 and btrfs-progs 4.4.1, I today ran a
 scrub on my
 2x1Tb btrfs raid1 array and it finished with 36
 unrecoverable errors
 [1], all blaming the treeblock 741942071296. Running "btrfs
 check
 --readonly" on one of the devices lists that extent as
 corrupted [2].

 How can I recover, how much did I really lose, and how can I
 prevent
 it from happening again?
 If you need me to provide more info, do tell.

 [1] http://cwillu.com:8080/188.110.141.36/1


 This message itself is normal, it just means a tree block is
 crossing 64K stripe boundary.
 And due to scrub limit, it can check if it's good or bad.
 But

 [2] http://pastebin.com/xA5zezqw

 This one is much more meaningful, showing several strange bugs.

 1. corrupt extent record: key 741942071296 168 1114112
 This means, this is a EXTENT_ITEM(168), and according to the offset,
 it means the length of the extent is, 1088K, definitely not a valid
 tree block size.

 But according to [1], kernel think it's a tree block, which is quite
 strange.
 Normally, such mismatch only happens in fs converted from ext*.

 2. Backref 741942071296 root 5 owner 71723 offset 2589392896
 num_refs 0 not found in extent tree

 num_refs 0, this is also strange, normal backref won't have a zero
 refrence number.

 3. bad metadata [741942071296, 741943185408) crossing stripe boundary
 It could be a false warning fixed in latest btrfsck.
 But you're using 4.4.1, so I think that's the problem.

 4. bad extent [741942071296, 741943185408), type mismatch with chunk
 This seems to explain the problem, a data extent appears in a
 metadata chunk.
 It seems that you're really using converted btrfs.

 If so, just roll it back to ext*. Current btrfs-convert has known
 bug but fix is 

Re: [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method

2016-03-28 Thread Qu Wenruo



Chris Mason wrote on 2016/03/28 10:09 -0400:

On Sat, Mar 26, 2016 at 09:11:53PM +0800, Qu Wenruo wrote:



On 03/25/2016 11:11 PM, Chris Mason wrote:

On Fri, Mar 25, 2016 at 09:59:39AM +0800, Qu Wenruo wrote:



Chris Mason wrote on 2016/03/24 16:58 -0400:

Are you storing the entire hash, or just the parts not represented in
the key?  I'd like to keep the on-disk part as compact as possible for
this part.


Currently, it's entire hash.

More detailed can be checked in another mail.

Although it's OK to truncate the last duplicated 8 bytes(64bit) for me,
I still quite like current implementation, as one memcpy() is simpler.


[ sorry FB makes urls look ugly, so I delete them from replys ;) ]

Right, I saw that but wanted to reply to the specific patch.  One of the
lessons learned from the extent allocation tree and file extent items is
they are just too big.  Lets save those bytes, it'll add up.


OK, I'll reduce the duplicated last 8 bytes.

And also, removing the "length" member, as it can be always fetched from
dedupe_info->block_size.


This would mean dedup_info->block_size is a write once field.  I'm ok
with that (just like metadata blocksize) but we should make sure the
ioctls etc don't allow changing it.


Not a problem, current block_size change is done by completely disabling 
dedupe(imply a sync_fs), then re-enable with new block_size.


So it would be OK.





The length itself is used to verify if we are at the transaction to a new
dedupe size, but later we use full sync_fs(), such behavior is not needed
any more.









+
+/*
+ * Objectid: bytenr
+ * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
+ * offset: Last 64 bit of the hash
+ *
+ * Used for bytenr <-> hash search (for free_extent)
+ * all its content is hash.
+ * So no special item struct is needed.
+ */
+


Can we do this instead with a backref from the extent?  It'll save us a
huge amount of IO as we delete things.


That's the original implementation from Liu Bo.

The problem is, it changes the data backref rules(originally, only
EXTENT_DATA item can cause data backref), and will make dedupe INCOMPACT
other than current RO_COMPACT.
So I really don't like to change the data backref rule.


Let me reread this part, the cost of maintaining the second index is
dramatically higher than adding a backref.  I do agree that's its nice
to be able to delete the dedup trees without impacting the rest, but
over the long term I think we'll regret the added balances.


Thanks for pointing the problem. Yes, I didn't even consider this fact.

But, on the other hand. such remove only happens when we remove the *last*
reference of the extent.
So, for medium to high dedupe rate case, such routine is not that frequent,
which will reduce the impact.
(Which is quite different for non-dedupe case)


It's both addition and removal, and the efficiency hit does depend on
what level of sharing you're able to achieve.  But what we don't want is
for metadata usage to explode as people make small non-duplicate changes
to their FS.
  If that happens, we'll only end up using dedup in back up
farms and other highly limited use cases.


Right, with current dedupe-specific backref, it'll bring unavoidable 
metadata overhead.


[[People are trading-off using non-default feature]]
Although IMHO, dedupe is not a generic feature, just like compression 
and possible encryption, people choose them with trade-off in their mind.


For example, compression can achieve quite high performance for easily 
compressible data, but can also get quite low performance for not so 
compressible data, like ISO file or videos.
(In my test with 2 cores VM, virtio blk on HDD, dd ISO into btrfs file 
will causing about 90MB/s for default mount option, while with 
compression, it's only about 40~50MB/s)


If we combine all overhead together (not only metadata overhead), almost 
all current transparent data processing method will only benefit 
specific use case while reducing generic performance.


So increased metadata overhead is acceptable for me, especially when the 
main overhead is CPU time spent on SHA256.


And we have workaround from setting dedupe disable prop to setting 
larger dedupe block_size to avoid small and non-dedupe writes to fill 
dedupe tree.





I do agree that delayed refs are error prone, but that's a good reason
not fix delayed refs, not to recreate the backrefs of the extent
allocation tree in a new dedicated tree.


[[We need an idea generic for both backends]]
Also I want to mention is, dedupe now contains 2 different backends, so 
we'd better choose one idea that won't break different backends into 
different incompat/ro_compat flags.


If using backref method, ondisk backend will definitely make dedupe 
incompatible, affecting in-memory backend even it's completely 
backward-compatible.


Or, splitting dedupe flag into DEDUPE_ONDISK and DEDUPE_INMEMORY, and 
former one is INCOMPAT, while latter is at most RO_COMPAT(if using 
dedupe tree).



[[Cleaner layout is less 

Re: Raid 0 setup doubt.

2016-03-28 Thread Duncan
Austin S. Hemmelgarn posted on Mon, 28 Mar 2016 08:35:59 -0400 as
excerpted:

> The other caveat that nobody seems to mention outside of specific cases
> is that using suspend to disks exposes you to direct attack by anyone
> with the ability to either physically access the system, or boot an
> alternative OS on it.  This is however not a Linux specific issue
> (although Windows and OS X do a much better job of validating the
> hibernation image than Linux does before resuming from it, so it's not
> as easy to trick them into loading arbitrary data).

I believe that within the kernel community, it's generally accepted that 
physical access is to be considered effectively full root access,  
because there's simply too many routes to get root if you have physical 
access to practically control them all.  I've certainly read that.

Which is what encryption is all about, including encrypted / (via initr*) 
if you're paranoid enough, as that's considered the only effective way to 
thwart physical-access == root-access.

And even that has some pretty big assumptions if physical access is 
available, including that no hardware keyloggers or the like are planted, 
as that would let an attacker simply log the password or other access key 
used.  One would have to for instance use a wired keyboard that they kept 
on their person (or inspect the keyboard, including taking it apart to 
check for loggers), and at minimum visually inspect its connection to the 
computer, including having a look inside the case, to be sure, before 
entering their password.  Or store the access key on a thumbdrive kept on 
the person, etc, and still inspect the computer left behind for listening/
logging devices...

In practice it's generally simpler to just control physical access 
entirely, to whatever degree (onsite video security systems with tamper-
evident timestamping... kept in a vault, missile silo, etc) matches the 
extant paranoia level.

Tho hosting the swap, and therefore hibernation data, on an encrypted 
device that's setup by the initr* is certainly possible, if it's 
considered worth the trouble.  Obviously that's going to require jumping 
thru many of the same hoops that (as mentioned upthread) splitting the 
hibernate image between devices will require, as it generally uses the 
same underlying initr*-based mechanisms.  I'd certainly imagine the 
Snowden's of the world will be doing that sort of thing, among the 
multitude of security options they must take.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


beruházás~

2016-03-28 Thread David Cooper



Hello,
I am representing an investment interest from Dubai for which we seek your
participation as an overseas representative. Reply on email below if
interested.
Email: philp...@gmail.com


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub: Tree block spanning stripes, ignored

2016-03-28 Thread Ivan P
Well, the file in this inode is fine, I was able to copy it off the
disk. However, rm-ing the file causes a segmentation fault. Shortly
after that, I get a kernel oops. Same thing happens if I attempt to
re-run scrub.

How can I delete that inode? Could deleting it destroy the filesystem
beyond repair?

Regards,
Ivan

On Mon, Mar 28, 2016 at 3:10 AM, Qu Wenruo  wrote:
>
>
> Ivan P wrote on 2016/03/27 16:31 +0200:
>>
>> Thanks for the reply,
>>
>> the raid1 array was created from scratch, so not converted from ext*.
>> I used btrfs-progs version 4.2.3 on kernel 4.2.5 to create the array, btw.
>
>
> I don't remember any strange behavior after 4.0, so no clue here.
>
> Go to the subvolume 5 (the top-level subvolume), find inode 71723 and try to
> remove it.
> Then, use 'btrfs filesystem sync ' to sync the inode removal.
>
> Finally use latest btrfs-progs to check if the problem disappears.
>
> This problem seems to be quite strange, so I can't locate the root cause,
> but try to remove the file and hopes kernel can handle it.
>
> Thanks,
> Qu
>>
>>
>> Is there a way to fix the current situation without taking the whole
>> data off the disk?
>> I'm not familiar with file systems terms, so what exactly could I have
>> lost, if anything?
>>
>> Regards,
>> Ivan
>>
>> On Sun, Mar 27, 2016 at 4:23 PM, Qu Wenruo > > wrote:
>>
>>
>>
>> On 03/27/2016 05:54 PM, Ivan P wrote:
>>
>> Read the info on the wiki, here's the rest of the requested
>> information:
>>
>> # uname -r
>> 4.4.5-1-ARCH
>>
>> # btrfs fi show
>> Label: 'ArchVault'  uuid: cd8a92b6-c5b5-4b19-b5e6-a839828d12d8
>>  Total devices 1 FS bytes used 2.10GiB
>>  devid1 size 14.92GiB used 4.02GiB path /dev/sdc1
>>
>> Label: 'Vault'  uuid: 013cda95-8aab-4cb2-acdd-2f0f78036e02
>>  Total devices 2 FS bytes used 800.72GiB
>>  devid1 size 931.51GiB used 808.01GiB path /dev/sda
>>  devid2 size 931.51GiB used 808.01GiB path /dev/sdb
>>
>> # btrfs fi df /mnt/vault/
>> Data, RAID1: total=806.00GiB, used=799.81GiB
>> System, RAID1: total=8.00MiB, used=128.00KiB
>> Metadata, RAID1: total=2.00GiB, used=936.20MiB
>> GlobalReserve, single: total=320.00MiB, used=0.00B
>>
>> On Fri, Mar 25, 2016 at 3:16 PM, Ivan P > > wrote:
>>
>> Hello,
>>
>> using kernel  4.4.5 and btrfs-progs 4.4.1, I today ran a
>> scrub on my
>> 2x1Tb btrfs raid1 array and it finished with 36
>> unrecoverable errors
>> [1], all blaming the treeblock 741942071296. Running "btrfs
>> check
>> --readonly" on one of the devices lists that extent as
>> corrupted [2].
>>
>> How can I recover, how much did I really lose, and how can I
>> prevent
>> it from happening again?
>> If you need me to provide more info, do tell.
>>
>> [1] http://cwillu.com:8080/188.110.141.36/1
>>
>>
>> This message itself is normal, it just means a tree block is
>> crossing 64K stripe boundary.
>> And due to scrub limit, it can check if it's good or bad.
>> But
>>
>> [2] http://pastebin.com/xA5zezqw
>>
>> This one is much more meaningful, showing several strange bugs.
>>
>> 1. corrupt extent record: key 741942071296 168 1114112
>> This means, this is a EXTENT_ITEM(168), and according to the offset,
>> it means the length of the extent is, 1088K, definitely not a valid
>> tree block size.
>>
>> But according to [1], kernel think it's a tree block, which is quite
>> strange.
>> Normally, such mismatch only happens in fs converted from ext*.
>>
>> 2. Backref 741942071296 root 5 owner 71723 offset 2589392896
>> num_refs 0 not found in extent tree
>>
>> num_refs 0, this is also strange, normal backref won't have a zero
>> refrence number.
>>
>> 3. bad metadata [741942071296, 741943185408) crossing stripe boundary
>> It could be a false warning fixed in latest btrfsck.
>> But you're using 4.4.1, so I think that's the problem.
>>
>> 4. bad extent [741942071296, 741943185408), type mismatch with chunk
>> This seems to explain the problem, a data extent appears in a
>> metadata chunk.
>> It seems that you're really using converted btrfs.
>>
>> If so, just roll it back to ext*. Current btrfs-convert has known
>> bug but fix is still under review.
>>
>> If want to use btrfs, use a newly created one instead of
>> btrfs-convert.
>>
>> Thanks,
>> Qu
>>
>>
>> Regards,
>> Soukyuu
>>
>> P.S.: please add me to CC when replying as I did not
>> subscribe to the
>> 

Re: "bad metadata" not fixed by btrfs repair

2016-03-28 Thread Chris Murphy
On Mon, Mar 28, 2016 at 12:51 PM, Nazar Mokrynskyi  wrote:

>>> # btrfs check --repair /dev/mapper/fanbtr
>>> bad metadata [4425377054720, 4425377071104) crossing stripe boundary
>>> bad metadata [4425380134912, 4425380151296) crossing stripe boundary
>>> bad metadata [4427532795904, 4427532812288) crossing stripe boundary
>>> bad metadata [4568321753088, 4568321769472) crossing stripe boundary
>>> bad metadata [4568489656320, 4568489672704) crossing stripe boundary
>>> bad metadata [4571474493440, 4571474509824) crossing stripe boundary
>>> bad metadata [4571946811392, 4571946827776) crossing stripe boundary
>>> bad metadata [4572782919680, 4572782936064) crossing stripe boundary
>>> bad metadata [4573086351360, 4573086367744) crossing stripe boundary
>>> bad metadata [4574221041664, 4574221058048) crossing stripe boundary
>>> bad metadata [4574373412864, 4574373429248) crossing stripe boundary
>>> bad metadata [4574958649344, 4574958665728) crossing stripe boundary
>>> bad metadata [4575996018688, 4575996035072) crossing stripe boundary
>>> bad metadata [4580376772608, 4580376788992) crossing stripe boundary


http://git.kernel.org/cgit/linux/kernel/git/kdave/btrfs-progs.git/tree/cmds-check.c
line 7722 discusses this error message and it looks like there's no
repair function for it yet; uncertain what problems can result from
this.

line 4576 has a possible clue, but I don't know what "can't handle"
means, if real problems just silently are permitted or what.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid 0 setup doubt.

2016-03-28 Thread Jose Otero
Thanks a lot Duncan, Chris Murphy, James Johnston, and Austin.

Thanks for the clear answer and the extra information to chew on.

Duncan, you are right. I have 8 GB of RAM, and the most memory intensive
thing I'll be doing is a VM for Windows. Now I double boot, but rarely
go into Win, only to play some game occasionally. So, I think I'll be
better off with Linux flat out and Win in a VM.

I'm probably overshooting too much with the 16 GiB swap, so I may be
ending up with 8 GiB swap. And I'll read up the splitting thing with the
priority trick because sounds nice. Thanks for the tip.

Take care everybody,

JM.



On 03/28/2016 02:56 AM, Duncan wrote:
> Jose Otero posted on Sun, 27 Mar 2016 12:35:43 +0200 as excerpted:
> 
>> Hello,
>>
>> --
>> I apologize beforehand if I'm asking a too basic question for the
>> mailing list, or if it has been already answered at nauseam.
>> --
> 
> Actually, looks like pretty reasonable questions, to me. =:^)
> 
>> I have two hdd (Western Digital 750 GB approx. 700 GiB each), and I
>> planning to set up a RAID 0 through btrfs. UEFI firmware/boot, no dual
>> boot, only linux.
>>
>> My question is, given the UEFI partition plus linux swap partition, I
>> won't have two equal sized partitions for setting up the RAID 0 array.
>> So, I'm not quite sure how to do it. I'll have:
>>
>> /dev/sda:
>>
>>16 KiB (GPT partition table)
>> sda1:  512 MiB (EFI, fat32)
>> sda2:  16 GiB (linux-swap)
>> sda3:  rest of the disk /  (btrfs)
>>
>> /dev/sdb:
>>
>> sdb1:  (btrfs)
>>
>> The btrfs partitions on each hdd are not of the same size (admittedly by
>> an small difference, but still). Even if a backup copy of the EFI
>> partition is created in the second hdd (i.e. sdb) which it may be, not
>> sure, because the linux-swap partion is still left out.
>>
>> Should I stripe both btrfs partitions together no matter the size?
> 
> That should work without issue.
> 
>> mkfs.btrfs -m raid0 -d raid0 /dev/sda3 /dev/sdb1
>>
>> How will btrfs manage the difference in size?
> 
> Btrfs raid0 requires two devices, minimum, striping each chunk across the 
> two.  Therefore, with two devices, to the extent that one device is 
> larger, the larger (as partitioned) device will leave the difference in 
> space unusable, as there's no second device to stripe with.
> 
>> Or should I partition out the extra size of /dev/sdb for trying to match
>> equally sized partions? in other words:
>>
>> /dev/sdb:
>>
>> sdb1:  17 GiB approx. free or for whatever I want.
>> sdb2:  (btrfs)
>>
>> and then:
>>
>> mkfs.btrfs -m raid0 -d raid0 /dev/sda3 /dev/sdb2
> 
> This should work as well.
> 
> 
> But there's another option you didn't mention, that may be useful, 
> depending on your exact need and usage of that swap:
> 
> Split your swap space in half, say (roughly, you can make one slightly 
> larger than the other to allow for the EFI on one device) 8 GiB on each 
> of the hdds.  Then, in your fstab or whatever you use to list the swap 
> options, put the option priority=100 (or whatever number you find 
> appropriate) on /both/ swap partitions.
> 
> With an equal priority on both swaps and with both active, the kernel 
> will effectively raid0 your swap as well (until one runs out, of course), 
> which, given that on spinning rust the device speed is the definite 
> performance bottleneck for swap, should roughly double your swap 
> performance. =:^)  Given that swap on spinning rust is slower than real 
> RAM by several orders of magnitude, it'll still be far slower than real 
> RAM, but twice as fast as it would be is better than otherwise, so...
> 
> 
> Tho how much RAM /do/ you have, and are you sure you really need swap at 
> all?  Many systems today have enough RAM that they don't really need swap 
> (at least as swap, see below), unless they're going to be used for 
> something extremely memory intensive, where the much lower speed of swap 
> isn't a problem.
> 
> If you have 8 GiB of RAM or more, this may well be your situation.  With 
> 4 GiB, you probably have more than enough RAM for normal operation, but 
> it may still be useful to have at least some swap, so Linux can keep more 
> recently used files cached while swapping out some seldom used 
> application RAM, but by 8 GiB you likely have enough RAM for reasonable 
> cache AND all your apps and won't actually use swap much at all.
> 
> Tho if you frequently edit GiB+ video files and/or work with many virtual 
> machines, 8 GiB RAM will likely be actually used, and 16 GiB may be the 
> point at which you don't use swap much at all.  And of course if you are 
> using LOTS of VMs or doing heavy 4K video editing, 16 GiB or more may 
> well still be in heavy use, but with that kind of memory-intensive usage, 
> 32 GiB of RAM or more would likely be a good investment.
> 
> Anyway, for systems with enough memory to not need swap in /normal/ 
> circumstances, in the event that 

Re: "bad metadata" not fixed by btrfs repair

2016-03-28 Thread Austin S. Hemmelgarn

On 2016-03-28 10:37, Marc Haber wrote:

Hi,

I have a btrfs which btrfs check --repair doesn't fix:

# btrfs check --repair /dev/mapper/fanbtr
bad metadata [4425377054720, 4425377071104) crossing stripe boundary
bad metadata [4425380134912, 4425380151296) crossing stripe boundary
bad metadata [4427532795904, 4427532812288) crossing stripe boundary
bad metadata [4568321753088, 4568321769472) crossing stripe boundary
bad metadata [4568489656320, 4568489672704) crossing stripe boundary
bad metadata [4571474493440, 4571474509824) crossing stripe boundary
bad metadata [4571946811392, 4571946827776) crossing stripe boundary
bad metadata [4572782919680, 4572782936064) crossing stripe boundary
bad metadata [4573086351360, 4573086367744) crossing stripe boundary
bad metadata [4574221041664, 4574221058048) crossing stripe boundary
bad metadata [4574373412864, 4574373429248) crossing stripe boundary
bad metadata [4574958649344, 4574958665728) crossing stripe boundary
bad metadata [4575996018688, 4575996035072) crossing stripe boundary
bad metadata [4580376772608, 4580376788992) crossing stripe boundary
repaired damaged extent references
Fixed 0 roots.
checking free space cache
checking fs roots
checking csums
checking root refs
enabling repair mode
Checking filesystem on /dev/mapper/fanbtr
UUID: 90f8d728-6bae-4fca-8cda-b368ba2c008e
cache and super generation don't match, space cache will be invalidated
found 97171628230 bytes used err is 0
total csum bytes: 91734220
total tree bytes: 3021848576
total fs tree bytes: 2762784768
total extent tree bytes: 148570112
btree space waste bytes: 545440822
file data blocks allocated: 308328280064
  referenced 177314340864
# btrfs check --repair /dev/mapper/fanbtr
checking extents
bad metadata [4425377054720, 4425377071104) crossing stripe boundary
bad metadata [4425380134912, 4425380151296) crossing stripe boundary
bad metadata [4427532795904, 4427532812288) crossing stripe boundary
bad metadata [4568321753088, 4568321769472) crossing stripe boundary
bad metadata [4568489656320, 4568489672704) crossing stripe boundary
bad metadata [4571474493440, 4571474509824) crossing stripe boundary
bad metadata [4571946811392, 4571946827776) crossing stripe boundary
bad metadata [4572782919680, 4572782936064) crossing stripe boundary
bad metadata [4573086351360, 4573086367744) crossing stripe boundary
bad metadata [4574221041664, 4574221058048) crossing stripe boundary
bad metadata [4574373412864, 4574373429248) crossing stripe boundary
bad metadata [4574958649344, 4574958665728) crossing stripe boundary
bad metadata [4575996018688, 4575996035072) crossing stripe boundary
bad metadata [4580376772608, 4580376788992) crossing stripe boundary
repaired damaged extent references
Fixed 0 roots.
checking free space cache
checking fs roots
checking csums
checking root refs
enabling repair mode
Checking filesystem on /dev/mapper/fanbtr
UUID: 90f8d728-6bae-4fca-8cda-b368ba2c008e
cache and super generation don't match, space cache will be invalidated
found 97171628230 bytes used err is 0
total csum bytes: 91734220
total tree bytes: 3021848576
total fs tree bytes: 2762784768
total extent tree bytes: 148570112
btree space waste bytes: 545440822
file data blocks allocated: 308328280064
  referenced 177314340864

How do I fix this?

Does the kernel play a role in btrfs check --repair, or is this all a
userspace matter?

Greetings
Marc

I had been hoping somebody with a bit more knowledge of this would 
answer, but seeing as that hasn't happened...


Did you convert this filesystem from ext4 (or ext3)?  If so, then you 
appear to have done so with a faulty version of btrfs-convert (I don't 
remember when btrfs-convert started having issues, but I'm relatively 
certain it's not completely fixed yet).  If that is the case, that's 
probably the ultimate cause of the 'bad metadata (, ) 
crossing stripe boundary' thing.


You hadn't mentioned what version of btrfs-progs you're using, and that 
is somewhat important for recovery.  I'm not sure if current versions of 
btrfs check can fix this issue, but I know for a fact that older 
versions (prior to at least 4.1) can not fix it.


As far as what the kernel is involved with, the easy way to check is if 
it's operating on a mounted filesystem or not.  If it only operates on 
mounted filesystems, it almost certainly goes through the kernel, if it 
only operates on unmounted filesystems, it's almost certainly done in 
userspace (except dev scan and technically fi show).  The typical advice 
is that you worry about kernel version for normal operation, and 
userspace version for initial setup (mkfs), and recovery (check, 
restore, recover, etc).


Now, slightly higher level discussion:
1. If you converted this filesystem using an old version of 
btrfs-convert, I would suggest recreating it from backup if possible. 
convert often results in sub-optimal data layout, and converted 
filesystems have (from what I've seen on the list at least) historically 

Re: linux 4.4.3 oops on aborted transaction, forces FS read-only

2016-03-28 Thread E V
It appears this issue has been caused by space cache corruption
preventing allocation of new meta data on the filesystem(8 device
RAID-10 for meta data.) I've now been running nospace_cache for the
past couple of weeks of my heavy rsync traffic without any failures.
It's certainly slower, but the rsync now completes every time without
the fs once being forced read-only.

On Fri, Mar 4, 2016 at 9:14 AM, E V  wrote:
> Looks like the transaction abort ends up causing the no space, if
> that's at all helpful. Lot's of free space seems to be irrelevant.
> Any chance this will be getting better soon? Seems to happen to me a
> lot these days, and adding space doesn't change anything.
> [282713.823416] WARNING: CPU: 4 PID: 3978 at
> fs/btrfs/extent-tree.c:6549 __btrfs_free_extent+0x98c/0x9a8 [btrfs]()
> [282713.823466] BTRFS: Transaction aborted (error -28)
> [282713.823467] Modules linked in: ipmi_si mpt3sas raid_class
> scsi_transport_sas dell_rbu nfsv3 nfsv4 nfsd auth_rpcgss oid_registry
> nfs_acl nfs lockd grace fscache sunrpc ext2 intel_powerclamp coretemp
> crct10dif_pclmul crc32_pclmul sha256_generic hmac drbg aesni_intel
> joydev aes_x86_64 glue_helper lrw gf128mul evdev ipmi_devintf iTCO_wdt
> ablk_helper cryptd iTCO_vendor_support dcdbas psmouse serio_raw pcspkr
> i7core_edac edac_core lpc_ich mfd_core ipmi_msghandler
> acpi_power_meter button processor loop autofs4 ext4 crc16 mbcache jbd2
> btrfs xor raid6_pq hid_generic usbhid hid sg sd_mod crc32c_intel
> uhci_hcd ehci_pci ehci_hcd megaraid_sas ixgbe mdio usbcore ptp
> usb_common pps_core scsi_mod bnx2 [last unloaded: ipmi_si]
> [282713.823891] CPU: 4 PID: 3978 Comm: btrfs-transacti Tainted: G
> I 4.4.3 #1
> [282713.823979]  0006 811eba76 8807c4fbbb50
> 0009
> [282713.824029]  810442ee a0160c9e ffe4
> 8807c4fbbba8
> [282713.824080]  88041e0e8000  81044346
> a01db5aa
> [282713.824130] Call Trace:
> [282713.824155]  [] ? dump_stack+0x46/0x59
> [282713.824185]  [] ? warn_slowpath_common+0x94/0xa9
> [282713.824222]  [] ? __btrfs_free_extent+0x98c/0x9a8 
> [btrfs]
> [282713.824252]  [] ? warn_slowpath_fmt+0x43/0x4b
> [282713.824288]  [] ? __btrfs_free_extent+0x98c/0x9a8 
> [btrfs]
> [282713.824319]  [] ? __cache_free.isra.49+0x1df/0x1ee
> [282713.824358]  [] ?
> btrfs_merge_delayed_refs+0x59/0x23d [btrfs]
> [282713.824412]  [] ?
> __btrfs_run_delayed_refs+0xad5/0xc83 [btrfs]
> [282713.824466]  [] ?
> btrfs_run_delayed_refs+0x6a/0x1a7 [btrfs]
> [282713.824519]  [] ?
> btrfs_write_dirty_block_groups+0xd0/0x219 [btrfs]
> [282713.824573]  [] ? commit_cowonly_roots+0x1dd/0x275 
> [btrfs]
> [282713.824613]  [] ?
> btrfs_commit_transaction+0x476/0x924 [btrfs]
> [282713.824668]  [] ? start_transaction+0x2e1/0x46f [btrfs]
> [282713.824708]  [] ? transaction_kthread+0xde/0x18e [btrfs]
> [282713.824749]  [] ?
> btrfs_cleanup_transaction+0x3ee/0x3ee [btrfs]
> [282713.824794]  [] ? kthread+0xa7/0xaf
> [282713.824820]  [] ? kthread_parkme+0x16/0x16
> [282713.824849]  [] ? ret_from_fork+0x3f/0x70
> [282713.824876]  [] ? kthread_parkme+0x16/0x16
> [282713.824902] ---[ end trace 2857e44546172518 ]---
> [282713.824946] BTRFS: error (device sdh) in __btrfs_free_extent:6549:
> errno=-28 No space left
> [282713.824995] BTRFS info (device sdh): forced readonly
> [282713.825021] BTRFS: error (device sdh) in
> btrfs_run_delayed_refs:2927: errno=-28 No space left
> [282713.856362] BTRFS warning (device sdh): Skipping commit of aborted
> transaction.
> [282713.856413] BTRFS: error (device sdh) in cleanup_transaction:1746:
> errno=-28 No space left
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "bad metadata" not fixed by btrfs repair

2016-03-28 Thread Marc Haber
On Mon, Mar 28, 2016 at 06:51:02PM +, Hugo Mills wrote:
>"Could not find root 8" is harmless (and will be going away as a
> message soon). It just means that systemd is probing the FS for
> quotas, and you don't have quotas enabled.

*phew* That message was not what I wanted to read on this filesystem.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "bad metadata" not fixed by btrfs repair

2016-03-28 Thread Nazar Mokrynskyi

I have the same thing with kernel 4.5 and btrfs-progs 4.4.

Wrote about it 2 weeks ago and didn't get any answer: 
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg51609.html


However, despite those messages everything seems to work fine.

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Skype: nazar-pc
Diaspora: naza...@diaspora.mokrynskyi.com
Tox: 
A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249

On 28.03.16 21:42, Marc Haber wrote:

On Mon, Mar 28, 2016 at 04:37:14PM +0200, Marc Haber wrote:

I have a btrfs which btrfs check --repair doesn't fix:

# btrfs check --repair /dev/mapper/fanbtr
bad metadata [4425377054720, 4425377071104) crossing stripe boundary
bad metadata [4425380134912, 4425380151296) crossing stripe boundary
bad metadata [4427532795904, 4427532812288) crossing stripe boundary
bad metadata [4568321753088, 4568321769472) crossing stripe boundary
bad metadata [4568489656320, 4568489672704) crossing stripe boundary
bad metadata [4571474493440, 4571474509824) crossing stripe boundary
bad metadata [4571946811392, 4571946827776) crossing stripe boundary
bad metadata [4572782919680, 4572782936064) crossing stripe boundary
bad metadata [4573086351360, 4573086367744) crossing stripe boundary
bad metadata [4574221041664, 4574221058048) crossing stripe boundary
bad metadata [4574373412864, 4574373429248) crossing stripe boundary
bad metadata [4574958649344, 4574958665728) crossing stripe boundary
bad metadata [4575996018688, 4575996035072) crossing stripe boundary
bad metadata [4580376772608, 4580376788992) crossing stripe boundary
repaired damaged extent references
Fixed 0 roots.
checking free space cache
checking fs roots
checking csums
checking root refs
enabling repair mode
Checking filesystem on /dev/mapper/fanbtr
UUID: 90f8d728-6bae-4fca-8cda-b368ba2c008e
cache and super generation don't match, space cache will be invalidated
found 97171628230 bytes used err is 0
total csum bytes: 91734220
total tree bytes: 3021848576
total fs tree bytes: 2762784768
total extent tree bytes: 148570112
btree space waste bytes: 545440822
file data blocks allocated: 308328280064
  referenced 177314340864

Mounting this filesystem gives:
Mar 28 20:25:18 fan kernel: [   20.979673] BTRFS error (device dm-16): could 
not find root 8
Mar 28 20:25:18 fan kernel: [   20.979739] BTRFS error (device dm-16): could 
not find root 8
Mar 28 20:25:18 fan kernel: [   20.980900] BTRFS error (device dm-16): could 
not find root 8
Mar 28 20:25:18 fan kernel: [   20.980948] BTRFS error (device dm-16): could 
not find root 8
Mar 28 20:25:18 fan kernel: [   20.981428] BTRFS error (device dm-16): could 
not find root 8
Mar 28 20:25:18 fan kernel: [   20.981472] BTRFS error (device dm-16): could 
not find root 8

which is not detected by btrfs check.

What is going on here?

Greetings
Marc






smime.p7s
Description: Кріптографічний підпис S/MIME


Re: "bad metadata" not fixed by btrfs repair

2016-03-28 Thread Hugo Mills
On Mon, Mar 28, 2016 at 08:42:37PM +0200, Marc Haber wrote:
> On Mon, Mar 28, 2016 at 04:37:14PM +0200, Marc Haber wrote:
> > I have a btrfs which btrfs check --repair doesn't fix:
> > 
> > # btrfs check --repair /dev/mapper/fanbtr
> > bad metadata [4425377054720, 4425377071104) crossing stripe boundary
> > bad metadata [4425380134912, 4425380151296) crossing stripe boundary
> > bad metadata [4427532795904, 4427532812288) crossing stripe boundary
> > bad metadata [4568321753088, 4568321769472) crossing stripe boundary
> > bad metadata [4568489656320, 4568489672704) crossing stripe boundary
> > bad metadata [4571474493440, 4571474509824) crossing stripe boundary
> > bad metadata [4571946811392, 4571946827776) crossing stripe boundary
> > bad metadata [4572782919680, 4572782936064) crossing stripe boundary
> > bad metadata [4573086351360, 4573086367744) crossing stripe boundary
> > bad metadata [4574221041664, 4574221058048) crossing stripe boundary
> > bad metadata [4574373412864, 4574373429248) crossing stripe boundary
> > bad metadata [4574958649344, 4574958665728) crossing stripe boundary
> > bad metadata [4575996018688, 4575996035072) crossing stripe boundary
> > bad metadata [4580376772608, 4580376788992) crossing stripe boundary
> > repaired damaged extent references
> > Fixed 0 roots.
> > checking free space cache
> > checking fs roots
> > checking csums
> > checking root refs
> > enabling repair mode
> > Checking filesystem on /dev/mapper/fanbtr
> > UUID: 90f8d728-6bae-4fca-8cda-b368ba2c008e
> > cache and super generation don't match, space cache will be invalidated
> > found 97171628230 bytes used err is 0
> > total csum bytes: 91734220
> > total tree bytes: 3021848576
> > total fs tree bytes: 2762784768
> > total extent tree bytes: 148570112
> > btree space waste bytes: 545440822
> > file data blocks allocated: 308328280064
> >  referenced 177314340864
> 
> Mounting this filesystem gives:
> Mar 28 20:25:18 fan kernel: [   20.979673] BTRFS error (device dm-16): could 
> not find root 8
> Mar 28 20:25:18 fan kernel: [   20.979739] BTRFS error (device dm-16): could 
> not find root 8
> Mar 28 20:25:18 fan kernel: [   20.980900] BTRFS error (device dm-16): could 
> not find root 8
> Mar 28 20:25:18 fan kernel: [   20.980948] BTRFS error (device dm-16): could 
> not find root 8
> Mar 28 20:25:18 fan kernel: [   20.981428] BTRFS error (device dm-16): could 
> not find root 8
> Mar 28 20:25:18 fan kernel: [   20.981472] BTRFS error (device dm-16): could 
> not find root 8
> 
> which is not detected by btrfs check.
> 
> What is going on here?

   "Could not find root 8" is harmless (and will be going away as a
message soon). It just means that systemd is probing the FS for
quotas, and you don't have quotas enabled.

   Hugo.

-- 
Hugo Mills | Hey, Virtual Memory! Now I can have a *really big*
hugo@... carfax.org.uk | ramdisk!
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: "bad metadata" not fixed by btrfs repair

2016-03-28 Thread Marc Haber
On Mon, Mar 28, 2016 at 04:37:14PM +0200, Marc Haber wrote:
> I have a btrfs which btrfs check --repair doesn't fix:
> 
> # btrfs check --repair /dev/mapper/fanbtr
> bad metadata [4425377054720, 4425377071104) crossing stripe boundary
> bad metadata [4425380134912, 4425380151296) crossing stripe boundary
> bad metadata [4427532795904, 4427532812288) crossing stripe boundary
> bad metadata [4568321753088, 4568321769472) crossing stripe boundary
> bad metadata [4568489656320, 4568489672704) crossing stripe boundary
> bad metadata [4571474493440, 4571474509824) crossing stripe boundary
> bad metadata [4571946811392, 4571946827776) crossing stripe boundary
> bad metadata [4572782919680, 4572782936064) crossing stripe boundary
> bad metadata [4573086351360, 4573086367744) crossing stripe boundary
> bad metadata [4574221041664, 4574221058048) crossing stripe boundary
> bad metadata [4574373412864, 4574373429248) crossing stripe boundary
> bad metadata [4574958649344, 4574958665728) crossing stripe boundary
> bad metadata [4575996018688, 4575996035072) crossing stripe boundary
> bad metadata [4580376772608, 4580376788992) crossing stripe boundary
> repaired damaged extent references
> Fixed 0 roots.
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> enabling repair mode
> Checking filesystem on /dev/mapper/fanbtr
> UUID: 90f8d728-6bae-4fca-8cda-b368ba2c008e
> cache and super generation don't match, space cache will be invalidated
> found 97171628230 bytes used err is 0
> total csum bytes: 91734220
> total tree bytes: 3021848576
> total fs tree bytes: 2762784768
> total extent tree bytes: 148570112
> btree space waste bytes: 545440822
> file data blocks allocated: 308328280064
>  referenced 177314340864

Mounting this filesystem gives:
Mar 28 20:25:18 fan kernel: [   20.979673] BTRFS error (device dm-16): could 
not find root 8
Mar 28 20:25:18 fan kernel: [   20.979739] BTRFS error (device dm-16): could 
not find root 8
Mar 28 20:25:18 fan kernel: [   20.980900] BTRFS error (device dm-16): could 
not find root 8
Mar 28 20:25:18 fan kernel: [   20.980948] BTRFS error (device dm-16): could 
not find root 8
Mar 28 20:25:18 fan kernel: [   20.981428] BTRFS error (device dm-16): could 
not find root 8
Mar 28 20:25:18 fan kernel: [   20.981472] BTRFS error (device dm-16): could 
not find root 8

which is not detected by btrfs check.

What is going on here?

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


best way to make space_cache=v2 default?

2016-03-28 Thread Stefan Priebe

Hi,

what's the best way to make space_cache=v2 the default in my custom 
kernel build?


Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


"bad metadata" not fixed by btrfs repair

2016-03-28 Thread Marc Haber
Hi,

I have a btrfs which btrfs check --repair doesn't fix:

# btrfs check --repair /dev/mapper/fanbtr
bad metadata [4425377054720, 4425377071104) crossing stripe boundary
bad metadata [4425380134912, 4425380151296) crossing stripe boundary
bad metadata [4427532795904, 4427532812288) crossing stripe boundary
bad metadata [4568321753088, 4568321769472) crossing stripe boundary
bad metadata [4568489656320, 4568489672704) crossing stripe boundary
bad metadata [4571474493440, 4571474509824) crossing stripe boundary
bad metadata [4571946811392, 4571946827776) crossing stripe boundary
bad metadata [4572782919680, 4572782936064) crossing stripe boundary
bad metadata [4573086351360, 4573086367744) crossing stripe boundary
bad metadata [4574221041664, 4574221058048) crossing stripe boundary
bad metadata [4574373412864, 4574373429248) crossing stripe boundary
bad metadata [4574958649344, 4574958665728) crossing stripe boundary
bad metadata [4575996018688, 4575996035072) crossing stripe boundary
bad metadata [4580376772608, 4580376788992) crossing stripe boundary
repaired damaged extent references
Fixed 0 roots.
checking free space cache
checking fs roots
checking csums
checking root refs
enabling repair mode
Checking filesystem on /dev/mapper/fanbtr
UUID: 90f8d728-6bae-4fca-8cda-b368ba2c008e
cache and super generation don't match, space cache will be invalidated
found 97171628230 bytes used err is 0
total csum bytes: 91734220
total tree bytes: 3021848576
total fs tree bytes: 2762784768
total extent tree bytes: 148570112
btree space waste bytes: 545440822
file data blocks allocated: 308328280064
 referenced 177314340864
# btrfs check --repair /dev/mapper/fanbtr
checking extents
bad metadata [4425377054720, 4425377071104) crossing stripe boundary
bad metadata [4425380134912, 4425380151296) crossing stripe boundary
bad metadata [4427532795904, 4427532812288) crossing stripe boundary
bad metadata [4568321753088, 4568321769472) crossing stripe boundary
bad metadata [4568489656320, 4568489672704) crossing stripe boundary
bad metadata [4571474493440, 4571474509824) crossing stripe boundary
bad metadata [4571946811392, 4571946827776) crossing stripe boundary
bad metadata [4572782919680, 4572782936064) crossing stripe boundary
bad metadata [4573086351360, 4573086367744) crossing stripe boundary
bad metadata [4574221041664, 4574221058048) crossing stripe boundary
bad metadata [4574373412864, 4574373429248) crossing stripe boundary
bad metadata [4574958649344, 4574958665728) crossing stripe boundary
bad metadata [4575996018688, 4575996035072) crossing stripe boundary
bad metadata [4580376772608, 4580376788992) crossing stripe boundary
repaired damaged extent references
Fixed 0 roots.
checking free space cache
checking fs roots
checking csums
checking root refs
enabling repair mode
Checking filesystem on /dev/mapper/fanbtr
UUID: 90f8d728-6bae-4fca-8cda-b368ba2c008e
cache and super generation don't match, space cache will be invalidated
found 97171628230 bytes used err is 0
total csum bytes: 91734220
total tree bytes: 3021848576
total fs tree bytes: 2762784768
total extent tree bytes: 148570112
btree space waste bytes: 545440822
file data blocks allocated: 308328280064
 referenced 177314340864

How do I fix this?

Does the kernel play a role in btrfs check --repair, or is this all a
userspace matter?

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Compression causes kernel crashes if there are I/O or checksum errors (was: RE: kernel BUG at fs/btrfs/volumes.c:5519 when hot-removing device in RAID-1)

2016-03-28 Thread James Johnston
Hi,

Thanks for the corroborating report - it does sound to me like you ran into the
same problem I've found.  (I don't suppose you ever captured any of the
crashes?  If they assert on the same thing as me then it's even stronger
evidence.)

> The failure mode of this particular ssd was premature failure of more and
> more sectors, about 3 MiB worth over several months based on the raw
> count of reallocated sectors in smartctl -A, but using scrub to rewrite
> them from the good device would normally work, forcing the firmware to
> remap that sector to one of the spares as scrub corrected the problem.

I wonder what the risk of a CRC collision was in your situation?

Certainly my test of "dd if=/dev/zero of=/dev/sdb" was very abusive, and I
wonder if the result after scrubbing is trustworthy, or if there was some
collisions.  But I wasn't checking to see if data coming out the other end was
OK - I was just trying to see if the kernel crashes or not (e.g. a USB stick
holding a bad btrfs file system should not crash a system).

> But /home (on an entirely separate filesystem, but a filesystem still on
> a pair of partitions, one on each of the same two ssds) would often have
> more, and because I have a particular program that I start with my X and
> KDE session that reads a bunch of files into cache as it starts up, I had
> a systemd service configured to start at boot and cat all the files in
> that particular directory to /dev/null, thus caching them so when I later
> started X and KDE (I don't run a *DM and thus login at the text CLI and
> startx, with a kde session, from the CLI) and thus this program, all the
> files it reads would already be in cache.
>
>  If that service was allowed to run, it would read in all
> those files and the resulting errors would often crash the kernel.

This sounds oddly familiar to how I made it crash. :)

> So I quickly learned that if I powered up and the kernel crashed at that
> point, I could reboot with the emergency kernel parameter, which would
> tell systemd to give me a maintenance-mode root login prompt after doing
> its normal mounts but before starting the normal post-mount services, and
> I could run scrub from there.  That would normally repair things without
> triggering the crash, and when I had run scrub repeatedly if necessary to
> correct any unverified errors in the first runs, I could then exit
> emergency mode and let systemd start the normal services, including the
> service that read all these files off the now freshly scrubbed
> filesystem, without further issues.

That is one thing I did not test.  I only ever scrubbed after first doing the
"cat all files to null" test.  So in the case of compression, I never got that
far.  Probably someone should test the scrubbing more thoroughly (i.e. with
that abusive "dd" test I did) just to be sure that it is stable to confirm your
observations, and that the problem is only limited to ordinary file I/O on the
file system.

> And apparently the devs don't test the
> someone less common combination of both compression and high numbers of
> raid1 correctable checksum errors, or they would have probably detected
> and fixed the problem from that.

Well, I've only tested with RAID-1.  I don't know if:

1.  The problem occurs with other RAID levels like RAID-10, RAID5/6.

2.  The kernel crashes in non-duplicated levels.  In these cases, data loss is
inevitable since the data is missing, but these losses should be handled
cleanly, and not by crashing the kernel.  For example:

a.  Checksum errors in RAID-0.
b.  Checksum errors on a single hard drive (not multiple device array).

I guess more testing is needed, but I don't have time to do this more
exhaustive testing right now, especially for these other RAID levels I'm not
planning to use (as I'm doing this in my limited free time).  (For now, I can
just turn off compression & move on.)

Do any devs do regular regression testing for these sorts of edge cases once
they come up? (i.e. this problem won't come back, will it?)

> So thanks for the additional tests and narrowing it down to the
> compression on raid1 with many checksum errors case.  Now that you've
> found out how the problem can be replicated, I'd guess we'll have a fix
> patch in relatively short order. =:^)

Hopefully!  Like I said, it might not be limited to RAID-1 though.  I only
tested RAID-1.

> That said, based on my own experience, I don't consider the problem dire
> enough to switch off compression on my btrfs raid1s here.  After all, I
> both figured out how to live with the problem on my failing ssd before I
> knew all this detail, and have eliminated the symptoms for the time being
> at least, as the devices I'm using now are currently reliable enough that
> I don't have to deal with this issue.
> 
> And in the even that I do encounter the problem again, in severe enough
> form that I can't even get a successful scrub in to fix it, possibly due
> to catastrophic failure of a 

Re: btrfs_destroy_inode WARN_ON.

2016-03-28 Thread Markus Trippelsdorf
On 2016.03.28 at 10:05 -0400, Josef Bacik wrote:
> >Mar 24 10:37:27 x4 kernel: WARNING: CPU: 3 PID: 11838 at 
> >fs/btrfs/inode.c:9261 btrfs_destroy_inode+0x22b/0x2a0
> 
> I saw this running some xfstests on our internal kernels but haven't been
> able to reproduce it on my latest enospc work (which is obviously perfect).
> What were you doing when you tripped this?  I'd like to see if I actually
> did fix it or if I still need to run it down.  Thanks,

I cannot really tell. Looking at the backtrace, both Dave and I were
running rm. 
This warning happened just once on my machine, so the issue is obviously
very hard to trigger.

-- 
Markus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method

2016-03-28 Thread Chris Mason
On Sat, Mar 26, 2016 at 09:11:53PM +0800, Qu Wenruo wrote:
> 
> 
> On 03/25/2016 11:11 PM, Chris Mason wrote:
> >On Fri, Mar 25, 2016 at 09:59:39AM +0800, Qu Wenruo wrote:
> >>
> >>
> >>Chris Mason wrote on 2016/03/24 16:58 -0400:
> >>>Are you storing the entire hash, or just the parts not represented in
> >>>the key?  I'd like to keep the on-disk part as compact as possible for
> >>>this part.
> >>
> >>Currently, it's entire hash.
> >>
> >>More detailed can be checked in another mail.
> >>
> >>Although it's OK to truncate the last duplicated 8 bytes(64bit) for me,
> >>I still quite like current implementation, as one memcpy() is simpler.
> >
> >[ sorry FB makes urls look ugly, so I delete them from replys ;) ]
> >
> >Right, I saw that but wanted to reply to the specific patch.  One of the
> >lessons learned from the extent allocation tree and file extent items is
> >they are just too big.  Lets save those bytes, it'll add up.
> 
> OK, I'll reduce the duplicated last 8 bytes.
> 
> And also, removing the "length" member, as it can be always fetched from
> dedupe_info->block_size.

This would mean dedup_info->block_size is a write once field.  I'm ok
with that (just like metadata blocksize) but we should make sure the
ioctls etc don't allow changing it.

> 
> The length itself is used to verify if we are at the transaction to a new
> dedupe size, but later we use full sync_fs(), such behavior is not needed
> any more.
> 
> 
> >
> >>
> >>>
> +
> +/*
> + * Objectid: bytenr
> + * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
> + * offset: Last 64 bit of the hash
> + *
> + * Used for bytenr <-> hash search (for free_extent)
> + * all its content is hash.
> + * So no special item struct is needed.
> + */
> +
> >>>
> >>>Can we do this instead with a backref from the extent?  It'll save us a
> >>>huge amount of IO as we delete things.
> >>
> >>That's the original implementation from Liu Bo.
> >>
> >>The problem is, it changes the data backref rules(originally, only
> >>EXTENT_DATA item can cause data backref), and will make dedupe INCOMPACT
> >>other than current RO_COMPACT.
> >>So I really don't like to change the data backref rule.
> >
> >Let me reread this part, the cost of maintaining the second index is
> >dramatically higher than adding a backref.  I do agree that's its nice
> >to be able to delete the dedup trees without impacting the rest, but
> >over the long term I think we'll regret the added balances.
> 
> Thanks for pointing the problem. Yes, I didn't even consider this fact.
> 
> But, on the other hand. such remove only happens when we remove the *last*
> reference of the extent.
> So, for medium to high dedupe rate case, such routine is not that frequent,
> which will reduce the impact.
> (Which is quite different for non-dedupe case)

It's both addition and removal, and the efficiency hit does depend on
what level of sharing you're able to achieve.  But what we don't want is
for metadata usage to explode as people make small non-duplicate changes
to their FS.   If that happens, we'll only end up using dedup in back up
farms and other highly limited use cases.

I do agree that delayed refs are error prone, but that's a good reason
not fix delayed refs, not to recreate the backrefs of the extent
allocation tree in a new dedicated tree.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_destroy_inode WARN_ON.

2016-03-28 Thread Josef Bacik

On 03/25/2016 04:25 AM, Markus Trippelsdorf wrote:

On 2016.03.24 at 18:54 -0400, Dave Jones wrote:

Just hit this on a tree from earlier this morning, v4.5-11140 or so.

WARNING: CPU: 2 PID: 32570 at fs/btrfs/inode.c:9261 
btrfs_destroy_inode+0x389/0x3f0 [btrfs]
CPU: 2 PID: 32570 Comm: rm Not tainted 4.5.0-think+ #14
  c039baf9 ef721ef0 88025966fc08 8957bcdb
    88025966fc50 890b41f1
  88045d918040 242d4eed6048 88024eed6048 88024eed6048
Call Trace:
  [] ? btrfs_destroy_inode+0x389/0x3f0 [btrfs]
  [] dump_stack+0x68/0x9d
  [] __warn+0x111/0x130
  [] warn_slowpath_null+0x1d/0x20
  [] btrfs_destroy_inode+0x389/0x3f0 [btrfs]
  [] destroy_inode+0x67/0x90
  [] evict+0x1b7/0x240
  [] iput+0x3ae/0x4e0
  [] ? dput+0x20e/0x460
  [] do_unlinkat+0x256/0x440
  [] ? do_rmdir+0x350/0x350
  [] ? syscall_trace_enter_phase1+0x87/0x260
  [] ? enter_from_user_mode+0x50/0x50
  [] ? __lock_is_held+0x25/0xd0
  [] ? mark_held_locks+0x22/0xc0
  [] ? syscall_trace_enter_phase2+0x12d/0x3d0
  [] ? SyS_rmdir+0x20/0x20
  [] SyS_unlinkat+0x1b/0x30
  [] do_syscall_64+0xf4/0x240
  [] entry_SYSCALL64_slow_path+0x25/0x25
---[ end trace a48ce4e6a1b5e409 ]---


That's WARN_ON(BTRFS_I(inode)->csum_bytes);

*maybe* it's a bad disk, but there's no indication in dmesg of anything awry.
Spinning rust on SATA, nothing special.


Same thing here:

Mar 24 10:37:27 x4 kernel: [ cut here ]
Mar 24 10:37:27 x4 kernel: WARNING: CPU: 3 PID: 11838 at fs/btrfs/inode.c:9261 
btrfs_destroy_inode+0x22b/0x2a0
Mar 24 10:37:27 x4 kernel: CPU: 3 PID: 11838 Comm: rm Not tainted 
4.5.0-11787-ga24e3d414e59-dirty #64
Mar 24 10:37:27 x4 kernel: Hardware name: System manufacturer System Product 
Name/M4A78T-E, BIOS 350304/13/2011
Mar 24 10:37:27 x4 kernel:  813c0d1a 81b8bb84 
812ffd0b
Mar 24 10:37:27 x4 kernel: 81099a9a  880149b86088 
88021585f000
Mar 24 10:37:27 x4 kernel: 812ffd0b  88005f526000 

Mar 24 10:37:27 x4 kernel: Call Trace:
Mar 24 10:37:27 x4 kernel: [] ? dump_stack+0x46/0x6c
Mar 24 10:37:27 x4 kernel: [] ? 
btrfs_destroy_inode+0x22b/0x2a0
Mar 24 10:37:27 x4 kernel: [] ? warn_slowpath_null+0x5a/0xe0
Mar 24 10:37:27 x4 kernel: [] ? 
btrfs_destroy_inode+0x22b/0x2a0
Mar 24 10:37:27 x4 kernel: [] ? do_unlinkat+0x13c/0x3e0
Mar 24 10:37:27 x4 kernel: [] ? 
entry_SYSCALL_64_fastpath+0x13/0x8f
Mar 24 10:37:27 x4 kernel: ---[ end trace e9bae5be848e7a9e ]---



I saw this running some xfstests on our internal kernels but haven't 
been able to reproduce it on my latest enospc work (which is obviously 
perfect).  What were you doing when you tripped this?  I'd like to see 
if I actually did fix it or if I still need to run it down.  Thanks,


Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid 0 setup doubt.

2016-03-28 Thread Austin S. Hemmelgarn

On 2016-03-27 20:56, Duncan wrote:


But there's another option you didn't mention, that may be useful,
depending on your exact need and usage of that swap:

Split your swap space in half, say (roughly, you can make one slightly
larger than the other to allow for the EFI on one device) 8 GiB on each
of the hdds.  Then, in your fstab or whatever you use to list the swap
options, put the option priority=100 (or whatever number you find
appropriate) on /both/ swap partitions.

With an equal priority on both swaps and with both active, the kernel
will effectively raid0 your swap as well (until one runs out, of course),
which, given that on spinning rust the device speed is the definite
performance bottleneck for swap, should roughly double your swap
performance. =:^)  Given that swap on spinning rust is slower than real
RAM by several orders of magnitude, it'll still be far slower than real
RAM, but twice as fast as it would be is better than otherwise, so...
I'm not 100% certain that it will double swap bandwidth unless you're 
constantly swapping, and even then it would only on average double the 
write bandwidth.  The kernel swaps pages in groups (8 pages by default, 
which is 32k, I usually up this to 16 pages on my systems because when 
I'm hitting swap, it usually means I'm hitting it hard), and I'm pretty 
certain that each group of pages only goes to one swap device.  This 
means that by default, with two devices, you would get 32k written at a 
time to alternating devices.  However, there is no guarantee that when 
you swap things in they will be from alternating devices, so you could 
be reading multiple MB of data from one device without even touching the 
other one.  Thus, for writes, this works like a raid0 setup with a large 
stripe size, but for reads it ends up somewhere between raid0 and single 
disk performance, depending on how lucky you are and what type of 
workload you are dealing with.


Tho how much RAM /do/ you have, and are you sure you really need swap at
all?  Many systems today have enough RAM that they don't really need swap
(at least as swap, see below), unless they're going to be used for
something extremely memory intensive, where the much lower speed of swap
isn't a problem.

If you have 8 GiB of RAM or more, this may well be your situation.  With
4 GiB, you probably have more than enough RAM for normal operation, but
it may still be useful to have at least some swap, so Linux can keep more
recently used files cached while swapping out some seldom used
application RAM, but by 8 GiB you likely have enough RAM for reasonable
cache AND all your apps and won't actually use swap much at all.

Tho if you frequently edit GiB+ video files and/or work with many virtual
machines, 8 GiB RAM will likely be actually used, and 16 GiB may be the
point at which you don't use swap much at all.  And of course if you are
using LOTS of VMs or doing heavy 4K video editing, 16 GiB or more may
well still be in heavy use, but with that kind of memory-intensive usage,
32 GiB of RAM or more would likely be a good investment.

Anyway, for systems with enough memory to not need swap in /normal/
circumstances, in the event that something's actually leaking memory
badly enough that swap is needed, there's a very good chance that you'll
never outrun the leak with swap anyway, as if it's really leaking gigs of
memory, it'll just eat up whatever gigs of swap you throw at it as well
and /still/ run out of memory.

Meanwhile, swap to spinning rust really is /slow/.  You're talking 16 GiB
of swap, and spinning rust speeds of 50 MiB/sec for swap isn't unusual.
That's ~20 seconds worth of swap-thrashing waiting per GiB, ~320 seconds
or over five minutes worth of swap thrashing to use the full 16 GiB.  OK,
so you take that priority= idea and raid0 over two devices, it'll still
be ~2:40 worth of waiting, to fully use that swap.  Is 16 GiB of swap
/really/ both needed and worth that sort of wait if you do actually use
it?

Tho again, if you're running a half dozen VMs and only actually use a
couple of them once or twice a day, having enough swap to let them swap
out the rest of the day, so the memory they took can be used for more
frequently accessed applications and cached files, can be useful.  But
that's a somewhat limited use-case.


So swap, for its original use as slow memory at least, really isn't that
much used any longer, tho it can still be quite useful in specific use-
cases.
I would tend to disagree here.  Using the default settings under Linux, 
it isn't used much, but there are many people (myself included), who 
turn off memory over-commit, and thus need reasonable amounts of swap 
space.  Many programs will allocate huge chunks of memory that they 
never need or even touch, either 'just in case', or because they want to 
manage their own memory usage.  To account for this, Linux has a knob 
for the virtual memory subsystem that controls how it handles 
allocations beyond the system's effective 

Re: Possible Raid Bug

2016-03-28 Thread Patrik Lundquist
On 28 March 2016 at 05:54, Anand Jain  wrote:
>
> On 03/26/2016 07:51 PM, Patrik Lundquist wrote:
>>
>> # btrfs device stats /mnt
>>
>> [/dev/sde].write_io_errs   11
>> [/dev/sde].read_io_errs0
>> [/dev/sde].flush_io_errs   2
>> [/dev/sde].corruption_errs 0
>> [/dev/sde].generation_errs 0
>>
>> The old counters are back. That's good, but wtf?
>
>
>  No. I doubt if they are old counters. The steps above didn't
>  show old error counts, but since you have created a file
>  test3 so there will be some write_io_errors, which we don;t
>  see after the balance. So I doubt if they are old counter
>  but instead they are new flush errors.

No, /mnt/test3 doesn't generate errors, only 'single' block groups.
The old counters seem to be cached somewhere and replace doesn't reset
them everywhere.

One more time with more device stats and I've upgraded the kernel to
Linux debian 4.5.0-trunk-amd64 #1 SMP Debian 4.5-1~exp1 (2016-03-20)
x86_64 GNU/Linux

# mkfs.btrfs -m raid10 -d raid10 /dev/sdb /dev/sdc /dev/sdd /dev/sde

# mount /dev/sdb /mnt; dmesg | tail
# touch /mnt/test1; sync; btrfs device usage /mnt

Only raid10 profiles.

# echo 1 >/sys/block/sde/device/delete; dmesg | tail

[  426.831037] sd 5:0:0:0: [sde] Synchronizing SCSI cache
[  426.831517] sd 5:0:0:0: [sde] Stopping disk
[  426.845199] ata6.00: disabled

We lost a disk.

# touch /mnt/test2; sync; dmesg | tail

[  467.126471] BTRFS error (device sde): bdev /dev/sde errs: wr 1, rd
0, flush 0, corrupt 0, gen 0
[  467.127386] BTRFS error (device sde): bdev /dev/sde errs: wr 2, rd
0, flush 0, corrupt 0, gen 0
[  467.128125] BTRFS error (device sde): bdev /dev/sde errs: wr 3, rd
0, flush 0, corrupt 0, gen 0
[  467.128640] BTRFS error (device sde): bdev /dev/sde errs: wr 4, rd
0, flush 0, corrupt 0, gen 0
[  467.129215] BTRFS error (device sde): bdev /dev/sde errs: wr 4, rd
0, flush 1, corrupt 0, gen 0
[  467.129331] BTRFS warning (device sde): lost page write due to IO
error on /dev/sde
[  467.129334] BTRFS error (device sde): bdev /dev/sde errs: wr 5, rd
0, flush 1, corrupt 0, gen 0
[  467.129420] BTRFS warning (device sde): lost page write due to IO
error on /dev/sde
[  467.129422] BTRFS error (device sde): bdev /dev/sde errs: wr 6, rd
0, flush 1, corrupt 0, gen 0

We've got write errors on the lost disk.

# btrfs device usage /mnt

No 'single' profiles because we haven't remounted yet.

# btrfs device stat /mnt

[/dev/sde].write_io_errs   6
[/dev/sde].read_io_errs0
[/dev/sde].flush_io_errs   1
[/dev/sde].corruption_errs 0
[/dev/sde].generation_errs 0

# reboot
# wipefs -a /dev/sde; reboot

# mount -o degraded /dev/sdb /mnt; dmesg | tail

[   52.876897] BTRFS info (device sdb): allowing degraded mounts
[   52.876901] BTRFS info (device sdb): disk space caching is enabled
[   52.876902] BTRFS: has skinny extents
[   52.878008] BTRFS warning (device sdb): devid 4 uuid
231d7892-3f31-40b5-8dff-baf8fec1a8aa is missing
[   52.879057] BTRFS info (device sdb): bdev (null) errs: wr 6, rd 0,
flush 1, corrupt 0, gen 0

# btrfs device usage /mnt

Still only raid10 profiles.

# btrfs device stat /mnt

[(null)].write_io_errs   6
[(null)].read_io_errs0
[(null)].flush_io_errs   1
[(null)].corruption_errs 0
[(null)].generation_errs 0

/dev/sde is now called "(null)". Print device id instead? E.g.
"[devid:4].write_io_errs   6"

# touch /mnt/test3; sync; btrfs device usage /mnt
/dev/sdb, ID: 1
   Device size: 2.00GiB
   Data,single:   624.00MiB
   Data,RAID10:   102.38MiB
   Metadata,RAID10:   102.38MiB
   System,RAID10:   4.00MiB
   Unallocated: 1.19GiB

/dev/sdc, ID: 2
   Device size: 2.00GiB
   Data,RAID10:   102.38MiB
   Metadata,RAID10:   102.38MiB
   System,single:  32.00MiB
   System,RAID10:   4.00MiB
   Unallocated: 1.76GiB

/dev/sdd, ID: 3
   Device size: 2.00GiB
   Data,RAID10:   102.38MiB
   Metadata,single:   256.00MiB
   Metadata,RAID10:   102.38MiB
   System,RAID10:   4.00MiB
   Unallocated: 1.55GiB

missing, ID: 4
   Device size:   0.00B
   Data,RAID10:   102.38MiB
   Metadata,RAID10:   102.38MiB
   System,RAID10:   4.00MiB
   Unallocated: 1.80GiB

Now we've got 'single' profiles on all devices except the missing one.
Replace missing device before unmount or get stuck with a read-only
filesystem.

# btrfs device stat /mnt

Same as before. Only old errors on the missing device.

# btrfs replace start -B 4 /dev/sde /mnt; dmesg | tail

[ 1268.598652] BTRFS info (device sdb): dev_replace from  (devid 4) to /dev/sde started
[ 1268.615601] BTRFS info (device sdb): dev_replace from  (devid 4) to /dev/sde finished

# btrfs device stats /mnt

[/dev/sde].write_io_errs   0
[/dev/sde].read_io_errs0
[/dev/sde].flush_io_errs   0
[/dev/sde].corruption_errs 0
[/dev/sde].generation_errs 0

Device "(null)" is back to /dev/sde and the error counts 

Re: csum errors in VirtualBox VDI files

2016-03-28 Thread Kai Krakow
Am Sun, 27 Mar 2016 13:04:25 -0600
schrieb Chris Murphy :

> As for the csum errors with this one single VDI file, you're going to
> have to come up with a way to reproduce it consistently. You'll need
> to have a good copy on a filesystem that comes up clean with btrfs
> check and scrub. And then reproduce the corruption somehow. One hint
> based on the other two users with similar setups or workload is they
> aren't using the discard mount option and you are. I'd say unless you
> have a newer SSD that supports queued trim, it probably shouldn't be
> used, it's known to cause the kinds of hangs you report with drives
> that only support non-queued trim. Those drives are better off getting
> fstrim e.g. once a week on a timer.

Let's get back to the csum errors later - that's only on the main drive
which has other corruptions, too, as I found out. So the csum errors
may well be a side effect.

I'm currently trying to fix the remaining problems of the backup drive
which uses no discard at all (it's no SSD).

I'd like to help you out of the confusion, here's the output of:

$ lsblk -o NAME,MODEL,FSTYPE,LABEL,MOUNTPOINT
NAMEMODELFSTYPE LABEL  MOUNTPOINT
sda Crucial_CT128MX1
├─sda1   vfat   ESP/boot
├─sda2
└─sda3   bcache
  ├─bcache0  btrfs  system
  ├─bcache1  btrfs  system
  └─bcache2  btrfs  system /
sdb SAMSUNG HD103SJ
├─sdb1   swap   swap0  [SWAP]
└─sdb2   bcache
  └─bcache2  btrfs  system /
sdc SAMSUNG HD103SJ
├─sdc1   swap   swap1  [SWAP]
└─sdc2   bcache
  └─bcache0  btrfs  system
sdd SAMSUNG HD103UJ
├─sdd1   swap   swap2  [SWAP]
└─sdd2   bcache
  └─bcache1  btrfs  system
sde 003-9VT166
└─sde1   btrfs  usb-backup

(the mountpoint is pretty bogus due to multiple subvolumes, so I
corrected it)

BTW: This discard option ran smooth for the last 12 months or so
(apparently, the SSD drive is soon to die - smartctl lifetime counter
is almost used up, bcache + btrfs can be pretty stressful I think). I'm
not even sure if btrfs mount option "discard" has any effect at all if
it is mounted through bcache.

BTW2: Even fstrim will issue queued trim if it is supported by the
drive and will get you in the same trouble. It needs to be disabled in
the kernel, according to [1]. Side note: I have model MX100 with
firmware update applied which is supposed to fix the problem. I never
experienced the libata fault messages in dmesg.

[1]:
http://forums.crucial.com/t5/Crucial-SSDs/M500-M5x0-QUEUED-TRIM-data-corruption-alert-mostly-for-Linux/td-p/151028

-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Compression causes kernel crashes if there are I/O or checksum errors (was: RE: kernel BUG at fs/btrfs/volumes.c:5519 when hot-removing device in RAID-1)

2016-03-28 Thread Duncan
James Johnston posted on Mon, 28 Mar 2016 04:41:24 + as excerpted:

> After puzzling over the btrfs failure I reported here a week ago, I
> think there is a bad incompatibility between compression and RAID-1
> (maybe other RAID levels too?).  I think it is unsafe for users to use
> compression, at least with multiple devices until this is
> fixed/investigated further.  That seems like a drastic claim, but I know
> I will not be using it for now.  Otherwise, checksum errors scattered
> across multiple devices that *should* be recoverable will render the
> file system unusable, even to read data from.  (One alternative
> hypothesis might be that defragmentation causes the issue, since I used
> defragment to compress existing files.)
> 
> I finally was able to simplify this to a hopefully easy to reproduce
> test case, described in lengthier detail below.  In summary, suppose we
> start with an uncompressed btrfs file system on only one disk containing
> the root file system,
> such as created by a clean install of a Linux distribution.  I then:
> (1) enable compress=lzo in fstab, reboot, and then defragment the disk
> to compress all the existing files, (2) add a second drive to the array
> and balance for RAID-1, (3) reboot for good measure, (4) cause a high
> level of I/O errors, such as hot-removal of the second drive, OR simply
> a high level of bit rot (i.e. use dd to corrupt most of the disk, while
> either mounted or unmounted). This is guaranteed to cause the kernel to
> crash.

Described that way, my own experience confirms your tests, except that 
(1) I hadn't tested the no-compression case to know it was any different, 
and (2) in my case I was actually using btrfs raid1 mode and scrub to be 
able to continue to deal with a failing ssd out of a pair, for quite some 
while after I would have ordinarily had to replace it were I not using 
something like btrfs raid1 with checksummed file integrity and scrubbing 
errors with replacements from the good device.

Here's how it worked for me and why I ultimately agree with your 
conclusions, at least regarding compressed raid1 mode crashes due to too 
many failed checksum failures (since I have no reference to agree or 
disagree with the uncompressed case).

As I said above, I had one ssd failing, but was taking the opportunity 
while I had it to watch its behavior deeper into the failure than I 
normally would, and while I was at it, get familiar enough with btrfs 
scrub to repair errors that it became just another routine command for me 
(to the point that I even scripted up a custom scrub command complete 
with my normally used options, etc).  On the relatively small (largest 
was 24 GiB per device, paired device btrfs raid1) multiple btrfs on 
partitions on the two devices scrub was normally under a minute to run 
even when doing quite a few repairs, so it wasn't as if it was taking me 
the hours to days it can take at TB scale on spinning rust.

The failure mode of this particular ssd was premature failure of more and 
more sectors, about 3 MiB worth over several months based on the raw 
count of reallocated sectors in smartctl -A, but using scrub to rewrite 
them from the good device would normally work, forcing the firmware to 
remap that sector to one of the spares as scrub corrected the problem.

One not immediately intuitive thing I found with scrub, BTW, was that if 
it finished with unverified errors, I needed to rerun scrub again to do 
further repairs.  I've since confirmed with someone who can read code (I 
sort of do but more at the admin playing with patches level than the dev 
level) that my guess at the reason behind this behavior was correct.  
When a metadata node fails checksum verification and is repaired, the 
checksums that it in turn contained cannot be verified in that pass and 
show up as unverified errors.  A repeated scrub once those errors are 
fixed can verify and fix if necessary those additional nodes, and 
occasionally up to three or four runs were necessary to fully verify and 
repair all blocks, eliminating all unverified errors, at which point 
further scrubs found no further errors.

It occurred to me as I write this, that the problem I saw and you have 
confirmed with testing and now reported, may actually be related to some 
interaction between these unverified errors and compressed blocks.

Anyway, as it happens, my / filesystem is normally mounted ro except 
during updates and by the end I was scrubbing after updates, and even 
after extended power-downs, so it generally had only a few errors.

But /home (on an entirely separate filesystem, but a filesystem still on 
a pair of partitions, one on each of the same two ssds) would often have 
more, and because I have a particular program that I start with my X and 
KDE session that reads a bunch of files into cache as it starts up, I had 
a systemd service configured to start at boot and cat all the files in 
that particular directory to /dev/null, thus caching 

bad metadata crossing stripe boundary (was: csum errors in VirtualBox VDI files)

2016-03-28 Thread Kai Krakow
Changing subject to reflect the current topic...

Am Sun, 27 Mar 2016 21:55:40 +0800
schrieb Qu Wenruo :

> > I finally got copy data:
> >
> > # before mounting let's check the FS:
> >
> > $ sudo btrfsck /dev/disk/by-label/usb-backup
> > Checking filesystem on /dev/disk/by-label/usb-backup
> > UUID: 1318ec21-c421-4e36-a44a-7be3d41f9c3f
> > checking extents
> > bad metadata [156041216, 156057600) crossing stripe boundary
> > bad metadata [181403648, 181420032) crossing stripe boundary
> > bad metadata [392167424, 392183808) crossing stripe boundary
> > bad metadata [783482880, 783499264) crossing stripe boundary
> > bad metadata [784924672, 784941056) crossing stripe boundary
> > bad metadata [130151612416, 130151628800) crossing stripe boundary
> > bad metadata [162826813440, 162826829824) crossing stripe boundary
> > bad metadata [162927083520, 162927099904) crossing stripe boundary
> > bad metadata [619740659712, 619740676096) crossing stripe boundary
> > bad metadata [619781947392, 619781963776) crossing stripe boundary
> > bad metadata [619795644416, 619795660800) crossing stripe boundary
> > bad metadata [619816091648, 619816108032) crossing stripe boundary
> > bad metadata [620011388928, 620011405312) crossing stripe boundary
> > bad metadata [890992459776, 890992476160) crossing stripe boundary
> > bad metadata [891022737408, 891022753792) crossing stripe boundary
> > bad metadata [891101773824, 891101790208) crossing stripe boundary
> > bad metadata [891301199872, 891301216256) crossing stripe boundary
> > bad metadata [1012219314176, 1012219330560) crossing stripe boundary
> > bad metadata [1017202409472, 1017202425856) crossing stripe boundary
> > bad metadata [1017365397504, 1017365413888) crossing stripe boundary
> > bad metadata [1020764422144, 1020764438528) crossing stripe boundary
> > bad metadata [1251103342592, 1251103358976) crossing stripe boundary
> > bad metadata [1251144695808, 1251144712192) crossing stripe boundary
> > bad metadata [1251147055104, 1251147071488) crossing stripe boundary
> > bad metadata [1259271225344, 1259271241728) crossing stripe boundary
> > bad metadata [1266223611904, 1266223628288) crossing stripe boundary
> > bad metadata [1304750063616, 130475008) crossing stripe boundary
> > bad metadata [1304790106112, 1304790122496) crossing stripe boundary
> > bad metadata [1304850792448, 1304850808832) crossing stripe boundary
> > bad metadata [1304869928960, 1304869945344) crossing stripe boundary
> > bad metadata [1305089540096, 1305089556480) crossing stripe boundary
> > bad metadata [1309561651200, 1309561667584) crossing stripe boundary
> > bad metadata [1309581443072, 1309581459456) crossing stripe boundary
> > bad metadata [1309583671296, 1309583687680) crossing stripe boundary
> > bad metadata [1309942808576, 1309942824960) crossing stripe boundary
> > bad metadata [1310050549760, 1310050566144) crossing stripe boundary
> > bad metadata [1313031585792, 1313031602176) crossing stripe boundary
> > bad metadata [1313232912384, 1313232928768) crossing stripe boundary
> > bad metadata [1555210764288, 1555210780672) crossing stripe boundary
> > bad metadata [1555395182592, 1555395198976) crossing stripe boundary
> > bad metadata [205057678, 2050576760832) crossing stripe boundary
> > bad metadata [2050803957760, 2050803974144) crossing stripe boundary
> > bad metadata [2050969108480, 2050969124864) crossing stripe
> > boundary  
> 
> Already mentioned in another reply, this *seems* to be false alert.
> Latest btrfs-progs would help.

No, btrfs-progs 4.5 reports those, too (as far as I understood, this
includes the fixes for bogus "bad metadata" errors, tho I thought this
has already been fixed in 4.2.1, I used 4.4.1). There were some nbytes
wrong errors before which I already repaired using "--repair". I think
that's okay, I had those in the past and it looks like btrfsck can
repair those now (and I don't have to delete and recreate the files).
It caused problems with "du" and "df" in the past, a problem that I'm
currently facing too. So I better fixed them.

With that done, the backup fs now only reports "bad metadata" which
have been there before space cache v2. Full output below.

> > checking free space tree cache and super generation don't match,
> > space cache will be invalidated checking fs roots  
> Err, I found a missing '\n' before "checking fs roots".

Copy and paste problem. Claws mail pretends to be smarter than me
- I missed to fix that one. ;-)

> And it seems that fs roots and extent tree are all OK.
> 
> Quite surprising.
> The only possible problem seems to be outdated space cache.
> 
> Maybe mount with "-o clear_cache" will help, but I don't think that's 
> the cause.

Helped, it automatically reverted the FS back to space cache v1 with
incompat flag cleared. (I wouldn't have enabled v2 if it wasn't
documented that this is possible)

> > checking csums
> > checking root refs
> > found 1860217443214 bytes used err 

Re: Raid 0 setup doubt.

2016-03-28 Thread Duncan
James Johnston posted on Mon, 28 Mar 2016 05:26:56 + as excerpted:

> For me, I use swap on an SSD, which is orders of magnitude faster than
> HDD.
> Swap can still be useful on an SSD and can really close the gap between
> RAM speeds and swap speeds.  (The original poster would do well to use
> one.)

FWIW, swap on ssd is an entirely different beast, and /can/ still make 
quite a lot of sense.  I'll absolutely agree with you there.

However, this wasn't about swap on ssd, it was about swap on hdd, and the 
post was already long enough, without adding in the quite different 
discussion of swap on ssd.  My posts already tend to be longer than most, 
and I have to pick /somewhere/ to draw the line.  This was simply the 
"somewhere" that I drew it in this case.

So thanks for raising the issue and filling in the missing pieces.  I 
think we agree, in general, about swap on ssd.

That said, here for example is a bit of why I ask the question, ssd or no 
ssd (spacing on free shrunk a bit for posting):

$ uptime
 00:07:50 up 11:28,  2 users,  load average: 0.04, 0.43, 1.01

$ free -m
   total   used free   shared buff/cache available
Mem:   1607372512632 1231   2715 13961
Swap:  0  00

16 GiB RAM, 12.5 GiB entirely free even with cache and buffers taking a 
bit under 3 GiB of RAM.  That's in kde/plasma5, after nearly 12 hours 
uptime.  (Tho I am running gentoo with more stuff turned off at build-
time than will be the case on most general-purpose binary distros, where 
lots of libraries that most people won't use are linked in for the sake 
of the few that will use them.  Significantly, I also have baloo turned 
off at build time, which still unfortunately requires some trivial 
patching on gentoo/kde, and stay /well/ clear of anything kdepim/akonadi 
related as both too bloated and far too unstable to handle my mail, 
etc.)  Triple full-hd 1080 monitors.

OK, startup firefox playing a full-screen 1080p video and let it run a 
bit... about half a GiB initial difference, 1.2 GiB used, only about 12 
GiB free, then up another 200 MiB used in a few minutes.

Now this is gentoo and it's my build machine.  It's only a six-core so I 
don't go hog-wild with the parallel builds, but portage is pointed at a 
tmpfs for its temporary build environment and my normal build settings 
allow 12 builds at a time, upto a load-average of 6, and each of those 
builds is set for upto 10 parallel jobs to a load average of 8 (thus 
encouraging parallelism at the individual package level first, and only 
where that doesn't utilize all cores does it load more packages to build 
in parallel).  I sometimes see upto 9 packages building at once and 
sometimes a 1-minute load of 10 or higher when build process that are 
already setup push it above the configured load-average of 8.

I don't run any VMs (but for an old DOS game in DOSSHELL, which qualifies 
as a VM, but from an age when machines with memory in the double-digit 
MiB were high-dollar, so it hardly counts), I keep / mounted ro except 
when I'm updating it, and the partition with all the build trees, 
sources, ccache and binpkgs is kept unmounted as well when I'm not using 
it.  Further, my media partition is unmounted by default as well.

But even during a build, I seldom use up all memory and start actually 
dumping cache, so which is when stuff would start getting pushed to swap 
as well if I had it, so I don't bother.

Back on my old machine I had 8 GiB RAM and swap, with swappiness[1] set 
to 100, I'd occasionally see a few hundred MB in swap, but seldom over a 
gig.  That was with a four-device spinning-rust mdraid1 setup, with swap 
similarly set to 4-way-striped via equal swap priority, but that machine 
was an old original dual-socket 3-digit opteron maxed out with dual-core 
Opteron 290s, so 2x2=4-core, and I had it accordingly a bit more limited 
in terms of parallel build jobs.

These days the main system is on dual ssds partitioned up in parallel, 
running multiple separate btrfs-raid1s on the pairs of partitions, one on 
each of the ssds.  Only media and backups is still on spinning rust, but 
given those numbers and the fact that suspend-to-ram works well on this 
machine and I never even tried suspend-to-disk, I just didn't see the 
point of setting up swap.

When I upgraded to the new machine, given the 6-core instead of 4-core, I 
decided I wanted more memory as well.  But altho 16 GiB is the next power-
of-two above the 8 GiB I was running (actually only 6 GiB by the time I 
upgraded, as a stick had died that I hadn't replaced) and I got 16 GiB 
for that reason, 12 GiB would have actually been plenty, and would have 
served my generally don't dump cache rule pretty well.  

That became even more the case when I upgraded to SSDs shortly 
thereafter, as recaching on ssd isn't the big deal it was with spinning 
rust, where I really did hate to reboot and lose all that cache that I'd 
have to read off of slow spinning 

Re: btrfs filesystem du - Failed to lookup root id - Inappropriate ioctl for device

2016-03-28 Thread Alexander Fougner
2016-03-27 22:26 GMT+02:00 Peter Becker :
> Hi i found the descriped error in if i execute du with btrfs-progs
> v4.5 with kernel v4.5.
>
> floyd@nas ~ $ sudo btrfs version
> btrfs-progs v4.5
>
> floyd@nas ~ $ uname -r
> 4.5.0-040500-generic
>
> floyd@nas ~ $ sudo btrfs fi show
> Label: 'RAID'  uuid: 3247737b-87f9-4e8c-8db3-2beed50fb104
> Total devices 4 FS bytes used 3.71TiB
> devid1 size 2.73TiB used 1.40TiB path /dev/sdf
> devid2 size 2.73TiB used 1.40TiB path /dev/sdb
> devid3 size 2.73TiB used 1.40TiB path /dev/sde
> devid4 size 4.55TiB used 3.22TiB path /dev/sdd
>
> Label: 'BACKUP'  uuid: 35e6ff5f-2612-4ef2-9cdb-07b3ccd0f517
> Total devices 1 FS bytes used 27.29GiB
> devid1 size 1.36TiB used 29.06GiB path /dev/sdc
>
> floyd@nas ~ $ sudo btrfs subv list -o /media/RAID/
> ID 258 gen 214551 top level 5 path apps
> ID 259 gen 214722 top level 5 path downloads
> ID 260 gen 214711 top level 5 path filme
> ID 261 gen 214553 top level 5 path misc
> ID 262 gen 214722 top level 5 path musik
> ID 263 gen 214555 top level 5 path owncloud
> ID 264 gen 214556 top level 5 path serien
>
> -- now the error --
>
> floyd@nas ~ $ sudo btrfs fi du /media/RAID/
>  Total   Exclusive  Set shared  Filename
>  500.00KiB   0.00B   -  /media/RAID//apps/Drive Snapshot
>  167.96MiB   0.00B   -  /media/RAID//apps/vmware
>  .. [46 rows] ..
> ERROR: cannot check space of '/media/RAID/': Unknown error -1
>
> floyd@nas ~ $ sudo btrfs fi du /media/RAID/apps
>  Total   Exclusive  Set shared  Filename
>  500.00KiB   0.00B   -  /media/RAID//apps/Drive Snapshot
>  167.96MiB   0.00B   -  /media/RAID//apps/vmware
>  .. [46 rows] ..
> ERROR: cannot check space of '/media/RAID/': Unknown error -1
>
> floyd@nas ~ $ sudo btrfs fi du /media/RAID/musik
> ERROR: cannot check space of '/media/RAID/': Unknown error -1
>

Fixed in btrfs-progs devel branch. Will probably be released as a 4.5.1 soon.

> but for subvolume "downloads" it finished without error
>
> floyd@nas ~ $ sudo btrfs fi du /media/RAID/
>  Total   Exclusive  Set shared  Filename
>  500.00KiB   0.00B   -  /media/RAID//apps/Drive Snapshot
>  167.96MiB   0.00B   -  /media/RAID//apps/vmware
>  ..
>
> -- more details --
>
> floyd@nas ~ $ sudo btrfs inspect-internal rootid /media/RAID/
> 5
>
> floyd@nas ~ $ sudo btrfs inspect-internal rootid /media/RAID/apps/
> 258
>
> floyd@nas ~ $ sudo btrfs inspect-internal dump-super
> /media/RAID/apps/Drive\ Snapshot/snapshot.exe
> superblock: bytenr=65536, device=/media/RAID/apps/Drive Snapshot/snapshot.exe
> -
> ERROR: bad magic on superblock on /media/RAID/apps/Drive
> Snapshot/snapshot.exe at 65536

dump-super takes a device as argument, not mountpoints.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html