Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-02 Thread Goffredo Baroncelli
On 2017-08-03 06:02, Duncan wrote:
> Liu Bo posted on Wed, 02 Aug 2017 14:27:21 -0600 as excerpted:
> 
> It is correct reading this as: all data is written two times ?
> 
> If as is being discussed the log is mirrored by default that'd be three 
> times...

And for raid6 you need to do it 4 times... (!)

> Parity-raid is slow and of course normally has the infamous write hole 
> this patch set is trying to close.  Yes, closing the write hole is 
> possible, but for sure it's going to make the performance bite of parity-
> raid even worse. =:^(

This is the reason for looking for possible optimization from the beginning: a 
full stripe (only datacow) writing doesn't require logging at all. This could 
be a big optimization ( if you need to write a lot of data, only tail and head 
are NOT full stripe). However this require to know that the data is [no]cow 
when it is logged, and I think that it is not so simple: possible but not 
simple.

> 
> Or are logged only the stripes involved by a RMW cycle (i.e. if a
> stripe is fully written, the log is bypassed )?

 For data, only data in bios from high level will be logged, while for
 parity, the whole parity will be logged.

 Full stripe write still logs all data and parity, as full stripe
 write may not survive from unclean shutdown.
>>>
>>> Does this matter ? Due to the COW nature of BTRFS if a transaction is
>>> interrupted (by an unclean shutdown) the transaction data are all lost.
>>> Am I missing something ?
>>>
>>> What I want to understand, is if it is possible to log only the
>>> "partial stripe"  RMW cycle.
>>>
>>>
>> I think your point is valid if all data is written with datacow.  In
>> case of nodatacow, btrfs does overwrite in place, so a full stripe write
>> may pollute on-disk data after unclean shutdown.  Checksum can detect
>> errors but repair thru raid5 may not recover the correct data.
> 
> But nodatacow doesn't have checksum...

True, but Liu is correct stating that a write "nocow" is not protected by a 
transaction.
The funny part, is that in case of raid5 we need to duplicate the data written 
for the nocow case, when for the cow case it would be possible to avoid it (in 
the full stripe case) !


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-02 Thread Duncan
Liu Bo posted on Wed, 02 Aug 2017 14:27:21 -0600 as excerpted:

>> >> It is correct reading this as: all data is written two times ?

If as is being discussed the log is mirrored by default that'd be three 
times...

Parity-raid is slow and of course normally has the infamous write hole 
this patch set is trying to close.  Yes, closing the write hole is 
possible, but for sure it's going to make the performance bite of parity-
raid even worse. =:^(

>> >> Or are logged only the stripes involved by a RMW cycle (i.e. if a
>> >> stripe is fully written, the log is bypassed )?
>> > 
>> > For data, only data in bios from high level will be logged, while for
>> > parity, the whole parity will be logged.
>> > 
>> > Full stripe write still logs all data and parity, as full stripe
>> > write may not survive from unclean shutdown.
>> 
>> Does this matter ? Due to the COW nature of BTRFS if a transaction is
>> interrupted (by an unclean shutdown) the transaction data are all lost.
>> Am I missing something ?
>> 
>> What I want to understand, is if it is possible to log only the
>> "partial stripe"  RMW cycle.
>>
>>
> I think your point is valid if all data is written with datacow.  In
> case of nodatacow, btrfs does overwrite in place, so a full stripe write
> may pollute on-disk data after unclean shutdown.  Checksum can detect
> errors but repair thru raid5 may not recover the correct data.

But nodatacow doesn't have checksum...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive loss of disk space

2017-08-02 Thread Duncan
Goffredo Baroncelli posted on Wed, 02 Aug 2017 19:52:30 +0200 as
excerpted:

> it seems that BTRFS always allocate the maximum space required, without
> consider the one already allocated. Is it too conservative ? I think no:
> consider the following scenario:
> 
> a) create a 2GB file
> b) fallocate -o 1GB -l 2GB
> c) write from 1GB to 3GB
> 
> after b), the expectation is that c) always succeed [1]: i.e. there is
> enough space on the filesystem. Due to the COW nature of BTRFS, you
> cannot rely on the already allocated space because there could be a
> small time window where both the old and the new data exists on the
> disk.

Not only a small time, perhaps (effectively) permanently, due to either 
of two factors:

1) If the existing extents are reflinked by snapshots or other files they 
obviously won't be released at all when the overwrite is completed.  
fallocate must account for this possibility, and behaving differently in 
the context of other reflinks would be confusing, so the best policy is 
consistently behave as if the existing data will not be freed.

2) As the devs have commented a number of times, an extent isn't freed if 
there's still a reflink to part of it.  If the original extent was a full 
1 GiB data chunk (the chunk being the max size of a native btrfs extent, 
one of the reasons a balance and defrag after conversion from ext4 and 
deletion of the ext4-saved subvolume is recommended, to break up the 
longer ext4 extents so they won't cause btrfs problems later) and all but 
a single 4 KiB block has been rewritten, the full 1 GiB extent will 
remain referenced and continue to take that original full 1 GiB space, 
*plus* the space of all the new-version extents of the overwritten data, 
of course.

So in our fallocate and overwrite scenario, we again must reserve space 
for two copies of the data, the original which may well not be freed even 
without other reflinks, if a single 4 KiB block of an extent remains 
unoverwritten, and the new version of the data.

At least that /was/ the behavior explained on-list previous to the hole-
punching changes.  I'm not a dev and haven't seen a dev comment on 
whether that remains the behavior after hole-punching, which may at least 
naively be expected to automatically handle and free overwritten data 
using hole-punching, or not.  I'd be interested in seeing someone who can 
read the code confirm one way or the other whether hole-punching changed 
that previous behavior, or not.
 
> My opinion is that in general this behavior is correct due to the COW
> nature of BTRFS.
> The only exception that I can find, is about the "nocow" file. For these
> cases taking in accout the already allocated space would be better.

I'd say it's dangerously optimistic even then, considering that "nocow" 
is actually "cow1" in the presence of snapshots.


Meanwhile, it's worth keeping in mind that it's exactly these sorts of 
corner-cases that are why btrfs is taking so long to stabilize.  
Supposedly "simple" expectations aren't always so simple, and if a 
filesystem gets it wrong, it's somebody's data hanging in the balance!  
(Tho if they've any wisdom at all, they'll ensure they're aware of the 
stability status of a filesystem before they put data on it, and will 
adjust their backup policies accordingly if they're using a still not 
fully stabilized filesystem such as btrfs, so the data won't actually be 
in any danger anyway unless it was literally throw-away value, only 
whatever specific instance of it was involved in that corner-case.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-02 Thread Chris Murphy
On Wed, Aug 2, 2017 at 2:38 AM, Brendan Hide  wrote:
> The title seems alarmist to me - and I suspect it is going to be
> misconstrued. :-/

Josef pushed bak on the HN thread with very sound reasoning about why
this is totally unsurprising. RHEL runs old kernels, and they have no
upstream Btrfs developers. So it's a huge PITA to backport the tons of
changes Btrfs has been going through (thousands of line changes per
kernel cycle).

What's more interesting to me is whether this means
-  CONFIG_BTRFS_FS=m
+  # CONFIG_BTRFS_FS is not set

In particular in elrepo.org kernels.

Also more interesting is this Stratis project that started up a few months ago:

https://github.com/stratis-storage/stratisd

Which also includes this design document:
https://stratis-storage.github.io/StratisSoftwareDesign.pdf

Basically they're creating a file system manager manifesting as a
daemon, new CLI tools, and new metadata formats for the volume
manager. So it's going to use existing device mapper, md, some LVM
stuff, XFS, in a layered approach abstracted from the user.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-02 Thread Fajar A. Nugraha
On Thu, Aug 3, 2017 at 1:44 AM, Chris Mason  wrote:
>
> On 08/02/2017 04:38 AM, Brendan Hide wrote:
>>
>> The title seems alarmist to me - and I suspect it is going to be 
>> misconstrued. :-/
>
>
> Supporting any filesystem is a huge amount of work.  I don't have a problem 
> with Redhat or any distro picking and choosing the projects they want to 
> support.
>

It'd help a lot of people if things like
https://btrfs.wiki.kernel.org/index.php/Status is kept up-to-date and
'promoted', so at least users are more informed about what they're
getting into and can choose which features (stable/still in dev/likely
to destroy your data) that they want to use.

For example, https://btrfs.wiki.kernel.org/index.php/Status says
compression is 'mostly OK' ('auto-repair and compression may crash'
looks pretty scary, as from newcomers-perspective it might be
interpretted as 'potential data loss'), while
https://en.opensuse.org/SDB:BTRFS#Compressed_btrfs_filesystems says
they support compression on newer opensuse versions.


>
> At least inside of FB, our own internal btrfs usage is continuing to grow.  
> Btrfs is becoming a big part of how we ship containers and other workloads 
> where snapshots improve performance.
>

Ubuntu also support btrfs as part their container implementation
(lxd), and (reading lxd mailing list) some people use lxd+btrfs on
their production environment. IIRC the last problem posted on lxd list
about btrfs was about how 'btrfs send/receive (used by lxd copy) is
slower than rsync for full/initial copy'.

-- 
Fajar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: search parity device wisely

2017-08-02 Thread Liu Bo
After mapping block with BTRFS_MAP_WRITE, parities have been sorted to
the end position, so this search can start from the first parity
stripe.

Signed-off-by: Liu Bo 
---
 fs/btrfs/raid56.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index d8ea0eb..0c5ed68 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2225,12 +2225,13 @@ raid56_parity_alloc_scrub_rbio(struct btrfs_fs_info 
*fs_info, struct bio *bio,
ASSERT(!bio->bi_iter.bi_size);
rbio->operation = BTRFS_RBIO_PARITY_SCRUB;
 
-   for (i = 0; i < rbio->real_stripes; i++) {
+   for (i = rbio->data_stripes; i < rbio->real_stripes; i++) {
if (bbio->stripes[i].dev == scrub_dev) {
rbio->scrubp = i;
break;
}
}
+   ASSERT(i < rbio->real_stripes);
 
/* Now we just support the sectorsize equals to page size */
ASSERT(fs_info->sectorsize == PAGE_SIZE);
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-02 Thread Liu Bo
On Wed, Aug 02, 2017 at 10:41:30PM +0200, Goffredo Baroncelli wrote:
> Hi Liu,
> 
> thanks for your reply, below my comments
> On 2017-08-02 19:57, Liu Bo wrote:
> > On Wed, Aug 02, 2017 at 12:14:27AM +0200, Goffredo Baroncelli wrote:
> >> On 2017-08-01 19:24, Liu Bo wrote:
> >>> On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
>  Hi Liu,
> 
>  On 2017-08-01 18:14, Liu Bo wrote:
> > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > separate disk as a journal (aka raid5/6 log), so that after unclean
> > shutdown we can make sure data and parity are consistent on the raid
> > array by replaying the journal.
> >
> 
>  it would be possible to have more information ?
>  - what is logged ? data, parity or data + parity ?
> >>>
> >>> Patch 5 has more details(sorry for not making it clear that in the
> >>> cover letter).
> >>>
> >>> So both data and parity are logged so that while replaying the journal
> >>> everything is written to whichever disk it should be written to.
> >>
> >> It is correct reading this as: all data is written two times ? Or are 
> >> logged only the stripes involved by a RMW cycle (i.e. if a stripe is fully 
> >> written, the log is bypassed )?
> > 
> > For data, only data in bios from high level will be logged, while for
> > parity, the whole parity will be logged.
> > 
> > Full stripe write still logs all data and parity, as full stripe write
> > may not survive from unclean shutdown.
> 
> Does this matter ? Due to the COW nature of BTRFS if a transaction is 
> interrupted (by an unclean shutdown) the transaction data are all lost. Am I 
> missing something ?
> 
> What I want to understand, is if it is possible to log only the "partial 
> stripe"  RMW cycle.
>

I think your point is valid if all data is written with datacow.  In
case of nodatacow, btrfs does overwrite in place, so a full stripe
write may pollute on-disk data after unclean shutdown.  Checksum can
detect errors but repair thru raid5 may not recover the correct data.

> > 
> > Taking a raid5 setup with 3 disks as an example, doing an overwrite
> > of 4k will log 4K(data) + 64K(parity).
> > 
> >>>
>  - in the past I thought that it would be sufficient to log only the 
>  stripe position involved by a RMW cycle, and then start a scrub on these 
>  stripes in case of an unclean shutdown: do you think that it is feasible 
>  ?
> >>>
> >>> An unclean shutdown causes inconsistence between data and parity, so
> >>> scrub won't help as it's not able to tell which one (data or parity)
> >>> is valid
> >> Scrub compares data against its checksum; so it knows if the data is 
> >> correct. If no disk is lost, a scrub process is sufficient/needed to 
> >> rebuild the parity/data.
> >>
> > 
> > If no disk is lost, it depends on whether the number of errors caused
> > by an unclean shutdown can be tolerated by the raid setup.
> 
> see below
> > 
> >> The problem born when after "an unclean shutdown" a disk failure happens. 
> >> But these  are *two* distinct failures. These together break the BTRFS 
> >> raid5 redundancy. But if you run a scrub process between these two 
> >> failures, the btrfs raid5 redundancy is still effective.
> >>
> > 
> > I wouldn't say that the redundancy is still effective after a scrub
> > process, but rather those data which match their checksum can still be
> > read out while the mismatched data are lost forever after unclean
> > shutdown.
> 
> 
> I think that this is the point where we are in disagreement: until now I 
> understood that in BTRFS
> a) a transaction is fully completed or fully not-completed. 
> b) a transaction is completed after both the data *and* the parity are 
> written.
> 
> With these assumption, due to the COW nature of BTRFS an unclean shutdown 
> might invalidate only data of the current transaction. Of course the unclean 
> shutdown prevent the transaction to be completed, and this means that all the 
> data of this transaction is lost in any case.
> 
> For the parity this is different, because it is possible a misalignment 
> between the parity and the data (which might be of different transactions).
> 
> Let me to explain with the help of your example:
> 
> > Taking a raid5 setup with 3 disks as an example, doing an overwrite
> > of 4k will log 4K(data) + 64K(parity).
> 
> If the transaction is aborted, 128k-4k = 124k are untouched, and these still 
> be valid. The last 4k might be wrong, but in any case this data is not 
> referenced because the transaction was never completed. 
> The parity need to be rebuild because we are not able to know if the 
> transaction was aborted before/after the data and/or parity writing
>

True, 4k data is not referenced, but again after rebuilding the
parity, the rest 124K and the 4k which has random data are not
consistent with the rebuilt parity.

The point is to keep parity and data consistent at any point of time
so that raid5 tolerance is 

Re: Massive loss of disk space

2017-08-02 Thread Goffredo Baroncelli
On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:
> On 2017-08-02 13:52, Goffredo Baroncelli wrote:
>> Hi,
>>
[...]

>> consider the following scenario:
>>
>> a) create a 2GB file
>> b) fallocate -o 1GB -l 2GB
>> c) write from 1GB to 3GB
>>
>> after b), the expectation is that c) always succeed [1]: i.e. there is 
>> enough space on the filesystem. Due to the COW nature of BTRFS, you cannot 
>> rely on the already allocated space because there could be a small time 
>> window where both the old and the new data exists on the disk.

> There is also an expectation based on pretty much every other FS in existence 
> that calling fallocate() on a range that is already in use is a (possibly 
> expensive) no-op, and by extension using fallocate() with an offset of 0 like 
> a ftruncate() call will succeed as long as the new size will fit.

The man page of fallocate doesn't guarantee that.

Unfortunately in a COW filesystem the assumption that an allocate area may be 
simply overwritten is not true. 

Let me to say it with others words: as general rule if you want to _write_ 
something in a cow filesystem, you need space. Doesn't matter if you are 
*over-writing* existing data or you are *appending* to a file.


> 
> I've checked JFS, XFS, ext4, vfat, NTFS (via NTFS-3G, not the kernel driver), 
> NILFS2, OCFS2 (local mode only), F2FS, UFS, and HFS+ on Linux, UFS and HFS+ 
> on OS X, UFS and ZFS on FreeBSD, FFS (UFS with a different name) and LFS (log 
> structured) on NetBSD, and UFS and ZFS on Solaris, and VxFS on HP-UX, and 
> _all_ of them behave correctly here and succeed with the test I listed, while 
> BTRFS does not.  This isn't codified in POSIX, but it's also not something 
> that is listed as implementation defined, which in turn means that we should 
> be trying to match the other implementations.

[...]

> 
>>
>> My opinion is that in general this behavior is correct due to the COW nature 
>> of BTRFS.
>> The only exception that I can find, is about the "nocow" file. For these 
>> cases taking in accout the already allocated space would be better.
> There are other, saner ways to make that expectation hold though, and I'm not 
> even certain that it does as things are implemented (I believe we still CoW 
> unwritten extents when data is written to them, because I _have_ had writes 
> to fallocate'ed files fail on BTRFS before with -ENOSPC).
> 
> The ideal situation IMO is as follows:
> 
> 1. This particular case (using fallocate() with an offset of 0 to extend a 
> file that is already larger than half the remaining free space on the FS) 
> _should_ succeed.  

This description is not accurate. What happened is the following:
1) you have a file *with valid data*
2) you want to prepare an update of this file and want to be sure to have 
enough space

at this point fallocate have to guarantee:
a) you have your old data still available
b) you have allocated the space for the update

In terms of a COW filesystem, you need the space of a) + the space of b)


> Short of very convoluted configurations, extending a file with fallocate will 
> not result in over-committing space on a CoW filesystem unless it would 
> extend the file by more than the remaining free space, and therefore barring 
> long external interactions, subsequent writes will also succeed.  Proof of 
> this for a general case is somewhat complicated, but in the very specific 
> case of the script I posted as a reproducer in the other thread about this 
> and the test case I gave in this thread, it's trivial to prove that the 
> writes will succeed.  Either way, the behavior of SnapRAID, while not optimal 
> in this case, is still a legitimate usage (I've seen programs do things like 
> that just to make sure the file isn't sparse).
> 
> 2. Conversion of unwritten extents to written ones should not require new 
> allocation.  Ideally, we need to be allocating not just space for the data, 
> but also reasonable space for the associated metadata when allocating an 
> unwritten extent, and there should be no CoW involved when they are written 
> to except for the small metadata updates required to account the new blocks.  
> Unless we're doing this, then we have edge cases where the the above listed 
> expectation does not hold (also note that GlobalReserve does not count IMO, 
> it's supposed to be for temporary usage only and doesn't ever appear to be 
> particularly large).
> 
> 3. There should be some small amount of space reserved globally for not just 
> metadata, but data too, so that a 'full' filesystem can still update existing 
> files reliably.  I'm not sure that we're not doing this already, but AIUI, 
> GlobalReserve is metadata only.  If we do this, we don't have to worry _as 
> much_ about avoiding CoW when converting unwritten extents to regular ones.
>>
>> Comments are welcome.
>>
>> BR
>> G.Baroncelli
>>
>> [1] from man 2 fallocate
>> [...]
>> After  a  successful call, subsequent writes into the range 
>> specified by 

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-02 Thread Goffredo Baroncelli
Hi Liu,

thanks for your reply, below my comments
On 2017-08-02 19:57, Liu Bo wrote:
> On Wed, Aug 02, 2017 at 12:14:27AM +0200, Goffredo Baroncelli wrote:
>> On 2017-08-01 19:24, Liu Bo wrote:
>>> On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
 Hi Liu,

 On 2017-08-01 18:14, Liu Bo wrote:
> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> separate disk as a journal (aka raid5/6 log), so that after unclean
> shutdown we can make sure data and parity are consistent on the raid
> array by replaying the journal.
>

 it would be possible to have more information ?
 - what is logged ? data, parity or data + parity ?
>>>
>>> Patch 5 has more details(sorry for not making it clear that in the
>>> cover letter).
>>>
>>> So both data and parity are logged so that while replaying the journal
>>> everything is written to whichever disk it should be written to.
>>
>> It is correct reading this as: all data is written two times ? Or are logged 
>> only the stripes involved by a RMW cycle (i.e. if a stripe is fully written, 
>> the log is bypassed )?
> 
> For data, only data in bios from high level will be logged, while for
> parity, the whole parity will be logged.
> 
> Full stripe write still logs all data and parity, as full stripe write
> may not survive from unclean shutdown.

Does this matter ? Due to the COW nature of BTRFS if a transaction is 
interrupted (by an unclean shutdown) the transaction data are all lost. Am I 
missing something ?

What I want to understand, is if it is possible to log only the "partial 
stripe"  RMW cycle.

> 
> Taking a raid5 setup with 3 disks as an example, doing an overwrite
> of 4k will log 4K(data) + 64K(parity).
> 
>>>
 - in the past I thought that it would be sufficient to log only the stripe 
 position involved by a RMW cycle, and then start a scrub on these stripes 
 in case of an unclean shutdown: do you think that it is feasible ?
>>>
>>> An unclean shutdown causes inconsistence between data and parity, so
>>> scrub won't help as it's not able to tell which one (data or parity)
>>> is valid
>> Scrub compares data against its checksum; so it knows if the data is 
>> correct. If no disk is lost, a scrub process is sufficient/needed to rebuild 
>> the parity/data.
>>
> 
> If no disk is lost, it depends on whether the number of errors caused
> by an unclean shutdown can be tolerated by the raid setup.

see below
> 
>> The problem born when after "an unclean shutdown" a disk failure happens. 
>> But these  are *two* distinct failures. These together break the BTRFS raid5 
>> redundancy. But if you run a scrub process between these two failures, the 
>> btrfs raid5 redundancy is still effective.
>>
> 
> I wouldn't say that the redundancy is still effective after a scrub
> process, but rather those data which match their checksum can still be
> read out while the mismatched data are lost forever after unclean
> shutdown.


I think that this is the point where we are in disagreement: until now I 
understood that in BTRFS
a) a transaction is fully completed or fully not-completed. 
b) a transaction is completed after both the data *and* the parity are written.

With these assumption, due to the COW nature of BTRFS an unclean shutdown might 
invalidate only data of the current transaction. Of course the unclean shutdown 
prevent the transaction to be completed, and this means that all the data of 
this transaction is lost in any case.

For the parity this is different, because it is possible a misalignment between 
the parity and the data (which might be of different transactions).

Let me to explain with the help of your example:

> Taking a raid5 setup with 3 disks as an example, doing an overwrite
> of 4k will log 4K(data) + 64K(parity).

If the transaction is aborted, 128k-4k = 124k are untouched, and these still be 
valid. The last 4k might be wrong, but in any case this data is not referenced 
because the transaction was never completed. 
The parity need to be rebuild because we are not able to know if the 
transaction was aborted before/after the data and/or parity writing


> 
> Thanks,
> 
> -liubo
>>
>>>
>>> With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe.
>>> With datacow, we don't do overwrite, but the following situation may happen,
>>> say we have a raid5 setup with 3 disks, the stripe length is 64k, so
>>>
>>> 1) write 64K  --> now the raid layout is
>>> [64K data + 64K random + 64K parity]
>>> 2) write another 64K --> now the raid layout after RMW is
>>> [64K 1)'s data + 64K 2)'s data + 64K new parity]
>>>
>>> If unclean shutdown occurs before 2) finishes, then parity may be
>>> corrupted and then 1)'s data may be recovered wrongly if the disk
>>> which holds 1)'s data is offline.
>>>
 - does this journal disk also host other btrfs log ?

>>>
>>> No, purely data/parity and some associated metadata.
>>>
>>> Thanks,
>>>
>>> -liubo
>>>

[PATCH] btrfs: pass fs_info to routines that always take tree_root

2017-08-02 Thread jeffm
From: Jeff Mahoney 

btrfs_find_root and btrfs_del_root always use the tree_root.  Let's pass
fs_info instead.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/ctree.h   |  7 ---
 fs/btrfs/disk-io.c |  2 +-
 fs/btrfs/extent-tree.c |  4 ++--
 fs/btrfs/free-space-tree.c |  2 +-
 fs/btrfs/qgroup.c  |  3 +--
 fs/btrfs/root-tree.c   | 15 +--
 6 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3f3eb7b17cac..eed7cc991a80 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2973,8 +2973,8 @@ int btrfs_del_root_ref(struct btrfs_trans_handle *trans,
   struct btrfs_fs_info *fs_info,
   u64 root_id, u64 ref_id, u64 dirid, u64 *sequence,
   const char *name, int name_len);
-int btrfs_del_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
-  const struct btrfs_key *key);
+int btrfs_del_root(struct btrfs_trans_handle *trans,
+  struct btrfs_fs_info *fs_info, const struct btrfs_key *key);
 int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root 
*root,
  const struct btrfs_key *key,
  struct btrfs_root_item *item);
@@ -2982,7 +2982,8 @@ int __must_check btrfs_update_root(struct 
btrfs_trans_handle *trans,
   struct btrfs_root *root,
   struct btrfs_key *key,
   struct btrfs_root_item *item);
-int btrfs_find_root(struct btrfs_root *root, const struct btrfs_key 
*search_key,
+int btrfs_find_root(struct btrfs_fs_info *fs_info,
+   const struct btrfs_key *search_key,
struct btrfs_path *path, struct btrfs_root_item *root_item,
struct btrfs_key *root_key);
 int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 080e2ebb8aa0..ea1959937875 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1581,7 +1581,7 @@ static struct btrfs_root *btrfs_read_tree_root(struct 
btrfs_root *tree_root,
 
__setup_root(root, fs_info, key->objectid);
 
-   ret = btrfs_find_root(tree_root, key, path,
+   ret = btrfs_find_root(fs_info, key, path,
  >root_item, >root_key);
if (ret) {
if (ret > 0)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 82d53a7b6652..12fa33accdcc 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9192,14 +9192,14 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
if (err)
goto out_end_trans;
 
-   ret = btrfs_del_root(trans, tree_root, >root_key);
+   ret = btrfs_del_root(trans, fs_info, >root_key);
if (ret) {
btrfs_abort_transaction(trans, ret);
goto out_end_trans;
}
 
if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
-   ret = btrfs_find_root(tree_root, >root_key, path,
+   ret = btrfs_find_root(fs_info, >root_key, path,
  NULL, NULL);
if (ret < 0) {
btrfs_abort_transaction(trans, ret);
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index a5e34de06c2f..684f12247db7 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -1257,7 +1257,7 @@ int btrfs_clear_free_space_tree(struct btrfs_fs_info 
*fs_info)
if (ret)
goto abort;
 
-   ret = btrfs_del_root(trans, tree_root, _space_root->root_key);
+   ret = btrfs_del_root(trans, fs_info, _space_root->root_key);
if (ret)
goto abort;
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 4ce351efe281..ba60523a443c 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -946,7 +946,6 @@ int btrfs_quota_enable(struct btrfs_trans_handle *trans,
 int btrfs_quota_disable(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info)
 {
-   struct btrfs_root *tree_root = fs_info->tree_root;
struct btrfs_root *quota_root;
int ret = 0;
 
@@ -968,7 +967,7 @@ int btrfs_quota_disable(struct btrfs_trans_handle *trans,
if (ret)
goto out;
 
-   ret = btrfs_del_root(trans, tree_root, _root->root_key);
+   ret = btrfs_del_root(trans, fs_info, _root->root_key);
if (ret)
goto out;
 
diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
index 460db0cb2d07..31c0e7265f44 100644
--- a/fs/btrfs/root-tree.c
+++ b/fs/btrfs/root-tree.c
@@ -62,7 +62,7 @@ static void btrfs_read_root_item(struct extent_buffer *eb, 
int slot,
 
 /*
  * btrfs_find_root - lookup the root by the key.
- * root: the root of the root tree
+ * fs_info: the fs_info for the file system to search
  * search_key: the key to 

Re: [PATCH 01/14] Btrfs: raid56: add raid56 log via add_dev v2 ioctl

2017-08-02 Thread Nikolay Borisov


On  1.08.2017 19:14, Liu Bo wrote:
> This introduces add_dev_v2 ioctl to add a device as raid56 journal
> device.  With the help of a journal device, raid56 is able to to get
> rid of potential write holes.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/ctree.h|  6 ++
>  fs/btrfs/ioctl.c| 48 
> -
>  fs/btrfs/raid56.c   | 42 
>  fs/btrfs/raid56.h   |  1 +
>  fs/btrfs/volumes.c  | 26 --
>  fs/btrfs/volumes.h  |  3 ++-
>  include/uapi/linux/btrfs.h  |  3 +++
>  include/uapi/linux/btrfs_tree.h |  4 
>  8 files changed, 125 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 643c70d..d967627 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -697,6 +697,7 @@ struct btrfs_stripe_hash_table {
>  void btrfs_init_async_reclaim_work(struct work_struct *work);
>  
>  /* fs_info */
> +struct btrfs_r5l_log;
>  struct reloc_control;
>  struct btrfs_device;
>  struct btrfs_fs_devices;
> @@ -1114,6 +1115,9 @@ struct btrfs_fs_info {
>   u32 nodesize;
>   u32 sectorsize;
>   u32 stripesize;
> +
> + /* raid56 log */
> + struct btrfs_r5l_log *r5log;
>  };
>  
>  static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
> @@ -2932,6 +2936,8 @@ static inline int btrfs_need_cleaner_sleep(struct 
> btrfs_fs_info *fs_info)
>  
>  static inline void free_fs_info(struct btrfs_fs_info *fs_info)
>  {
> + if (fs_info->r5log)
> + kfree(fs_info->r5log);
>   kfree(fs_info->balance_ctl);
>   kfree(fs_info->delayed_root);
>   kfree(fs_info->extent_root);
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index e176375..3d1ef4d 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2653,6 +2653,50 @@ static int btrfs_ioctl_defrag(struct file *file, void 
> __user *argp)
>   return ret;
>  }
>  
> +/* identical to btrfs_ioctl_add_dev, but this is with flags */
> +static long btrfs_ioctl_add_dev_v2(struct btrfs_fs_info *fs_info, void 
> __user *arg)
> +{
> + struct btrfs_ioctl_vol_args_v2 *vol_args;
> + int ret;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags))
> + return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
> +
> + mutex_lock(_info->volume_mutex);
> + vol_args = memdup_user(arg, sizeof(*vol_args));
> + if (IS_ERR(vol_args)) {
> + ret = PTR_ERR(vol_args);
> + goto out;
> + }
> +
> + if (vol_args->flags & BTRFS_DEVICE_RAID56_LOG &&
> + fs_info->r5log) {
> + ret = -EEXIST;
> + btrfs_info(fs_info, "r5log: attempting to add another log 
> device!");
> + goto out_free;
> + }
> +
> + vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
> + ret = btrfs_init_new_device(fs_info, vol_args->name, vol_args->flags);
> + if (!ret) {
> + if (vol_args->flags & BTRFS_DEVICE_RAID56_LOG) {
> + ASSERT(fs_info->r5log);
> + btrfs_info(fs_info, "disk added %s as raid56 log", 
> vol_args->name);
> + } else {
> + btrfs_info(fs_info, "disk added %s", vol_args->name);
> + }
> + }
> +out_free:
> + kfree(vol_args);
> +out:
> + mutex_unlock(_info->volume_mutex);
> + clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
> + return ret;
> +}
> +
>  static long btrfs_ioctl_add_dev(struct btrfs_fs_info *fs_info, void __user 
> *arg)
>  {
>   struct btrfs_ioctl_vol_args *vol_args;
> @@ -2672,7 +2716,7 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info 
> *fs_info, void __user *arg)
>   }
>  
>   vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
> - ret = btrfs_init_new_device(fs_info, vol_args->name);
> + ret = btrfs_init_new_device(fs_info, vol_args->name, 0);
>  
>   if (!ret)
>   btrfs_info(fs_info, "disk added %s", vol_args->name);
> @@ -5539,6 +5583,8 @@ long btrfs_ioctl(struct file *file, unsigned int
>   return btrfs_ioctl_resize(file, argp);
>   case BTRFS_IOC_ADD_DEV:
>   return btrfs_ioctl_add_dev(fs_info, argp);
> + case BTRFS_IOC_ADD_DEV_V2:
> + return btrfs_ioctl_add_dev_v2(fs_info, argp);
>   case BTRFS_IOC_RM_DEV:
>   return btrfs_ioctl_rm_dev(file, argp);
>   case BTRFS_IOC_RM_DEV_V2:
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index d8ea0eb..2b91b95 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -177,6 +177,25 @@ struct btrfs_raid_bio {
>   unsigned long *dbitmap;
>  };
>  
> +/* raid56 log */
> +struct btrfs_r5l_log {
> + /* protect this struct and log io */
> + struct mutex io_mutex;
> +
> + /* r5log device */
> + struct btrfs_device *dev;
> +
> + /* allocation range for log 

Re: Massive loss of disk space

2017-08-02 Thread Austin S. Hemmelgarn

On 2017-08-02 13:52, Goffredo Baroncelli wrote:

Hi,

On 2017-08-01 17:00, Austin S. Hemmelgarn wrote:

OK, I just did a dead simple test by hand, and it looks like I was right.  The 
method I used to check this is as follows:
1. Create and mount a reasonably small filesystem (I used an 8G temporary LV 
for this, a file would work too though).
2. Using dd or a similar tool, create a test file that takes up half of the 
size of the filesystem.  It is important that this _not_ be fallocated, but 
just written out.
3. Use `fallocate -l` to try and extend the size of the file beyond half the 
size of the filesystem.

For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will succeed 
with no error.  Based on this and some low-level inspection, it looks like 
BTRFS treats the full range of the fallocate call as unallocated, and thus is 
trying to allocate space for regions of that range that are already allocated.


I can confirm this behavior; below some step to reproduce it [2]; however I 
don't think that it is a bug, but this is the correct behavior for a COW 
filesystem (see below).


Looking at the function btrfs_fallocate() (file fs/btrfs/file.c)


static long btrfs_fallocate(struct file *file, int mode,
 loff_t offset, loff_t len)
{
[...]
 alloc_start = round_down(offset, blocksize);
 alloc_end = round_up(offset + len, blocksize);
[...]
 /*
  * Only trigger disk allocation, don't trigger qgroup reserve
  *
  * For qgroup space, it will be checked later.
  */
 ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode),
 alloc_end - alloc_start)


it seems that BTRFS always allocate the maximum space required, without 
consider the one already allocated. Is it too conservative ? I think no: 
consider the following scenario:

a) create a 2GB file
b) fallocate -o 1GB -l 2GB
c) write from 1GB to 3GB

after b), the expectation is that c) always succeed [1]: i.e. there is enough 
space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the 
already allocated space because there could be a small time window where both 
the old and the new data exists on the disk.
There is also an expectation based on pretty much every other FS in 
existence that calling fallocate() on a range that is already in use is 
a (possibly expensive) no-op, and by extension using fallocate() with an 
offset of 0 like a ftruncate() call will succeed as long as the new size 
will fit.


I've checked JFS, XFS, ext4, vfat, NTFS (via NTFS-3G, not the kernel 
driver), NILFS2, OCFS2 (local mode only), F2FS, UFS, and HFS+ on Linux, 
UFS and HFS+ on OS X, UFS and ZFS on FreeBSD, FFS (UFS with a different 
name) and LFS (log structured) on NetBSD, and UFS and ZFS on Solaris, 
and VxFS on HP-UX, and _all_ of them behave correctly here and succeed 
with the test I listed, while BTRFS does not.  This isn't codified in 
POSIX, but it's also not something that is listed as implementation 
defined, which in turn means that we should be trying to match the other 
implementations.




My opinion is that in general this behavior is correct due to the COW nature of 
BTRFS.
The only exception that I can find, is about the "nocow" file. For these cases 
taking in accout the already allocated space would be better.
There are other, saner ways to make that expectation hold though, and 
I'm not even certain that it does as things are implemented (I believe 
we still CoW unwritten extents when data is written to them, because I 
_have_ had writes to fallocate'ed files fail on BTRFS before with -ENOSPC).


The ideal situation IMO is as follows:

1. This particular case (using fallocate() with an offset of 0 to extend 
a file that is already larger than half the remaining free space on the 
FS) _should_ succeed.  Short of very convoluted configurations, 
extending a file with fallocate will not result in over-committing space 
on a CoW filesystem unless it would extend the file by more than the 
remaining free space, and therefore barring long external interactions, 
subsequent writes will also succeed.  Proof of this for a general case 
is somewhat complicated, but in the very specific case of the script I 
posted as a reproducer in the other thread about this and the test case 
I gave in this thread, it's trivial to prove that the writes will 
succeed.  Either way, the behavior of SnapRAID, while not optimal in 
this case, is still a legitimate usage (I've seen programs do things 
like that just to make sure the file isn't sparse).


2. Conversion of unwritten extents to written ones should not require 
new allocation.  Ideally, we need to be allocating not just space for 
the data, but also reasonable space for the associated metadata when 
allocating an unwritten extent, and there should be no CoW involved when 
they are written to except for the small metadata updates required to 
account the new blocks.  Unless we're 

Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-02 Thread Liu Bo
On Wed, Aug 02, 2017 at 12:14:27AM +0200, Goffredo Baroncelli wrote:
> On 2017-08-01 19:24, Liu Bo wrote:
> > On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
> >> Hi Liu,
> >>
> >> On 2017-08-01 18:14, Liu Bo wrote:
> >>> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> >>> separate disk as a journal (aka raid5/6 log), so that after unclean
> >>> shutdown we can make sure data and parity are consistent on the raid
> >>> array by replaying the journal.
> >>>
> >>
> >> it would be possible to have more information ?
> >> - what is logged ? data, parity or data + parity ?
> > 
> > Patch 5 has more details(sorry for not making it clear that in the
> > cover letter).
> > 
> > So both data and parity are logged so that while replaying the journal
> > everything is written to whichever disk it should be written to.
> 
> It is correct reading this as: all data is written two times ? Or are logged 
> only the stripes involved by a RMW cycle (i.e. if a stripe is fully written, 
> the log is bypassed )?

For data, only data in bios from high level will be logged, while for
parity, the whole parity will be logged.

Full stripe write still logs all data and parity, as full stripe write
may not survive from unclean shutdown.

Taking a raid5 setup with 3 disks as an example, doing an overwrite
of 4k will log 4K(data) + 64K(parity).

> > 
> >> - in the past I thought that it would be sufficient to log only the stripe 
> >> position involved by a RMW cycle, and then start a scrub on these stripes 
> >> in case of an unclean shutdown: do you think that it is feasible ?
> > 
> > An unclean shutdown causes inconsistence between data and parity, so
> > scrub won't help as it's not able to tell which one (data or parity)
> > is valid
> Scrub compares data against its checksum; so it knows if the data is correct. 
> If no disk is lost, a scrub process is sufficient/needed to rebuild the 
> parity/data.
>

If no disk is lost, it depends on whether the number of errors caused
by an unclean shutdown can be tolerated by the raid setup.

> The problem born when after "an unclean shutdown" a disk failure happens. But 
> these  are *two* distinct failures. These together break the BTRFS raid5 
> redundancy. But if you run a scrub process between these two failures, the 
> btrfs raid5 redundancy is still effective.
>

I wouldn't say that the redundancy is still effective after a scrub
process, but rather those data which match their checksum can still be
read out while the mismatched data are lost forever after unclean
shutdown.

Thanks,

-liubo
> 
> > 
> > With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe.
> > With datacow, we don't do overwrite, but the following situation may happen,
> > say we have a raid5 setup with 3 disks, the stripe length is 64k, so
> > 
> > 1) write 64K  --> now the raid layout is
> > [64K data + 64K random + 64K parity]
> > 2) write another 64K --> now the raid layout after RMW is
> > [64K 1)'s data + 64K 2)'s data + 64K new parity]
> > 
> > If unclean shutdown occurs before 2) finishes, then parity may be
> > corrupted and then 1)'s data may be recovered wrongly if the disk
> > which holds 1)'s data is offline.
> > 
> >> - does this journal disk also host other btrfs log ?
> >>
> > 
> > No, purely data/parity and some associated metadata.
> > 
> > Thanks,
> > 
> > -liubo
> > 
> >>> The idea and the code are similar to the write-through mode of md
> >>> raid5-cache, so ppl(partial parity log) is also feasible to implement.
> >>> (If you've been familiar with md, you may find this patch set is
> >>> boring to read...)
> >>>
> >>> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> >>> the implementation, the rest patches are improvements and bugfixes,
> >>> eg. readahead for recovery, checksum.
> >>>
> >>> Two btrfs-progs patches are required to play with this patch set, one
> >>> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> >>> option '-L', the other is to teach 'btrfs-show-super' to show
> >>> %journal_tail.
> >>>
> >>> This is currently based on 4.12-rc3.
> >>>
> >>> The patch set is tagged with RFC, and comments are always welcome,
> >>> thanks.
> >>>
> >>> Known limitations:
> >>> - Deleting a log device is not implemented yet.
> >>>
> >>>
> >>> Liu Bo (14):
> >>>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
> >>>   Btrfs: raid56: do not allocate chunk on raid56 log
> >>>   Btrfs: raid56: detect raid56 log on mount
> >>>   Btrfs: raid56: add verbose debug
> >>>   Btrfs: raid56: add stripe log for raid5/6
> >>>   Btrfs: raid56: add reclaim support
> >>>   Btrfs: raid56: load r5log
> >>>   Btrfs: raid56: log recovery
> >>>   Btrfs: raid56: add readahead for recovery
> >>>   Btrfs: raid56: use the readahead helper to get page
> >>>   Btrfs: raid56: add csum support
> >>>   Btrfs: raid56: fix error handling while adding a log device
> >>>   Btrfs: raid56: initialize raid5/6 log after 

[PATCH 2/3] fixed android.mk

2017-08-02 Thread filipbystricky
From: Filip Bystricky 

Signed-off-by: Filip Bystricky 
Reviewed-by: Mark Salyzyn 
---
 Android.mk | 53 +
 1 file changed, 21 insertions(+), 32 deletions(-)

diff --git a/Android.mk b/Android.mk
index 52fe9ab4..9516c2d1 100644
--- a/Android.mk
+++ b/Android.mk
@@ -1,18 +1,19 @@
 LOCAL_PATH:= $(call my-dir)
 
-#include $(call all-subdir-makefiles)
+# temporary flags to reduce the number of emitted warnings until they can be
+# fixed properly
+TEMP_CFLAGS := -Wno-pointer-arith 
-Wno-tautological-constant-out-of-range-compare \
+   -Wno-sign-compare -Wno-format -Wno-unused-parameter
 
 CFLAGS := -g -O1 -Wall -D_FORTIFY_SOURCE=2 -include config.h \
-   -DBTRFS_FLAT_INCLUDES -D_XOPEN_SOURCE=700 -fno-strict-aliasing -fPIC
+   -DBTRFS_FLAT_INCLUDES -D_XOPEN_SOURCE=700 -fno-strict-aliasing -fPIC \
+   -Wno-macro-redefined -Wno-typedef-redefinition 
-Wno-address-of-packed-member \
+   -Wno-missing-field-initializers $(TEMP_CFLAGS)
 
-LDFLAGS := -static -rdynamic
-
-LIBS := -luuid   -lblkid   -lz   -llzo2 -L. -lpthread
-LIBBTRFS_LIBS := $(LIBS)
-
-STATIC_CFLAGS := $(CFLAGS) -ffunction-sections -fdata-sections
-STATIC_LDFLAGS := -static -Wl,--gc-sections
-STATIC_LIBS := -luuid   -lblkid -luuid -lz   -llzo2 -L. -pthread
+STATIC_CFLAGS := $(CFLAGS) -ffunction-sections -fdata-sections \
+   -D_GNU_SOURCE=1 \
+   -DPACKAGE_STRING=\"btrfs\" \
+   -DPACKAGE_URL=\"http://btrfs.wiki.kernel.org\;
 
 btrfs_shared_libraries := libext2_uuid \
libext2_blkid
@@ -23,7 +24,8 @@ objects := ctree.c disk-io.c kernel-lib/radix-tree.c 
extent-tree.c print-tree.c
   qgroup.c free-space-cache.c kernel-lib/list_sort.c props.c \
   kernel-shared/ulist.c qgroup-verify.c backref.c string-table.c 
task-utils.c \
   inode.c file.c find-root.c free-space-tree.c help.c send-dump.c \
-  fsfeatures.c kernel-lib/tables.c kernel-lib/raid56.c
+  fsfeatures.c raid56.c
+
 cmds_objects := cmds-subvolume.c cmds-filesystem.c cmds-device.c cmds-scrub.c \
cmds-inspect.c cmds-balance.c cmds-send.c cmds-receive.c \
cmds-quota.c cmds-qgroup.c cmds-replace.c cmds-check.c \
@@ -38,12 +40,11 @@ libbtrfs_headers := send-stream.h send-utils.h send.h 
kernel-lib/rbtree.h btrfs-
kernel-lib/crc32c.h kernel-lib/list.h kerncompat.h \
kernel-lib/radix-tree.h kernel-lib/sizes.h 
kernel-lib/raid56.h \
extent-cache.h extent_io.h ioctl.h ctree.h btrfsck.h 
version.h
-TESTS := fsck-tests.sh convert-tests.sh
-blkid_objects := partition/ superblocks/ topology/
-
 
 # external/e2fsprogs/lib is needed for uuid/uuid.h
-common_C_INCLUDES := $(LOCAL_PATH) external/e2fsprogs/lib/ 
external/lzo/include/ external/zlib/
+common_C_INCLUDES := $(LOCAL_PATH) external/e2fsprogs/lib/ 
external/lzo/include/ external/zlib/ \
+   $(LOCAL_PATH)/kernel-lib
+
 
 #--
 include $(CLEAR_VARS)
@@ -56,23 +57,18 @@ include $(BUILD_STATIC_LIBRARY)
 #--
 include $(CLEAR_VARS)
 LOCAL_MODULE := btrfs
-#LOCAL_FORCE_STATIC_EXECUTABLE := true
 LOCAL_SRC_FILES := \
$(objects) \
$(cmds_objects) \
-   btrfs.c \
-   help.c \
+   btrfs.c
 
 LOCAL_C_INCLUDES := $(common_C_INCLUDES)
 LOCAL_CFLAGS := $(STATIC_CFLAGS)
-#LOCAL_LDLIBS := $(LIBBTRFS_LIBS)
-#LOCAL_LDFLAGS := $(STATIC_LDFLAGS)
 LOCAL_SHARED_LIBRARIES := $(btrfs_shared_libraries)
 LOCAL_STATIC_LIBRARIES := libbtrfs liblzo-static libz
 LOCAL_SYSTEM_SHARED_LIBRARIES := libc libcutils
-
 LOCAL_EXPORT_C_INCLUDES := $(common_C_INCLUDES)
-#LOCAL_MODULE_TAGS := optional
+
 include $(BUILD_EXECUTABLE)
 
 #--
@@ -85,14 +81,11 @@ LOCAL_SRC_FILES := \
 
 LOCAL_C_INCLUDES := $(common_C_INCLUDES)
 LOCAL_CFLAGS := $(STATIC_CFLAGS)
-#LOCAL_LDLIBS := $(LIBBTRFS_LIBS)
-#LOCAL_LDFLAGS := $(STATIC_LDFLAGS)
 LOCAL_SHARED_LIBRARIES := $(btrfs_shared_libraries)
 LOCAL_STATIC_LIBRARIES := libbtrfs liblzo-static
 LOCAL_SYSTEM_SHARED_LIBRARIES := libc libcutils
-
 LOCAL_EXPORT_C_INCLUDES := $(common_C_INCLUDES)
-#LOCAL_MODULE_TAGS := optional
+
 include $(BUILD_EXECUTABLE)
 
 #---
@@ -105,13 +98,9 @@ LOCAL_SRC_FILES := \
 LOCAL_C_INCLUDES := $(common_C_INCLUDES)
 LOCAL_CFLAGS := $(STATIC_CFLAGS)
 LOCAL_SHARED_LIBRARIES := $(btrfs_shared_libraries)
-#LOCAL_LDLIBS := $(LIBBTRFS_LIBS)
-#LOCAL_LDFLAGS := $(STATIC_LDFLAGS)
-LOCAL_SHARED_LIBRARIES := $(btrfs_shared_libraries)
 LOCAL_STATIC_LIBRARIES := libbtrfs liblzo-static
 LOCAL_SYSTEM_SHARED_LIBRARIES := libc libcutils
-
 LOCAL_EXPORT_C_INCLUDES := $(common_C_INCLUDES)
-LOCAL_MODULE_TAGS := optional
+
 include 

[PATCH 3/3] compile error fixes

2017-08-02 Thread filipbystricky
From: Filip Bystricky 

Android currently does not fully support libblkid, and android's bionic 
doesn't implement some pthread extras such as pthread_tryjoin_np and 
pthread_cancel. This patch fixes the resulting errors while trying to 
be as unobtrusive as possible, and is therefore just a temporary fix. 
For complete support of tools that use background tasks, the way those
are managed (in particular, how they are cancelled) would need to be 
reworked.

Signed-off-by: Filip Bystricky 
Reviewed-by: Mark Salyzyn 
---
 androidcompat.h | 38 --
 cmds-scrub.c|  5 +
 mkfs/common.c   |  8 
 mkfs/main.c |  7 +++
 task-utils.c|  1 +
 utils.c | 18 ++
 utils.h |  1 +
 7 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/androidcompat.h b/androidcompat.h
index eec76dad..bd0be172 100644
--- a/androidcompat.h
+++ b/androidcompat.h
@@ -7,22 +7,48 @@
 #ifndef __ANDROID_H__
 #define __ANDROID_H__
 
-#ifdef ANDROID
-
-#define pthread_setcanceltype(type, oldtype)   (0)
-#define pthread_setcancelstate(state, oldstate)(0)
+#ifdef __BIONIC__
 
+/*
+ * Bionic doesn't implement pthread_cancel or helpers.
+ *
+ * TODO: this is a temporary fix to just get the tools to compile.
+ * What we really want is to rework how background tasks are managed.
+ * All of the threads that are being cancelled are running in infinite loops.
+ * They should instead be checking a flag at each iteration to see if they
+ * should continue. Then cancelling would just be a matter of setting the flag.
+ *
+ * Most background tasks are managed using btrfs's task_utils library, in which
+ * case they are passed a task_ctx struct pointer.
+ *
+ * However, in two cases, they are created and cancelled directly with the 
pthread library:
+ *   - chunk-recover.c:scan_devices creates a thread for each device to scan, 
giving
+ * each a struct device_scan*.
+ *   - cmds-scrub.c:scrub_start creates a single thread and gives it a struct 
task_ctx*.
+ *
+ * Breakdown by command:
+ *   - btrfs check (cmds-check.c) uses a task (task_ctx) for indicating 
progress
+ *   - mkfs.btrfs (mkfs/main.c) doesn't appear to use any background tasks.
+ */
 #define pthread_cancel(ret)pthread_kill((ret), SIGUSR1)
 
+/*
+ * If given pointers are non-null, just zero out the pointed-to value.
+ * This also eliminates some unused variable warnings.
+ */
+#define pthread_setcanceltype(type, oldtype)   ((oldtype) ? (*(oldtype) = 0) : 
0)
+#define pthread_setcancelstate(state, oldstate)((oldstate) ? 
(*(oldstate) = 0) : 0)
+#define pthread_tryjoin_np(thread, retval) ((retval) ? ((int)(*(retval) = 
NULL)) : 0)
+
 typedef struct blkid_struct_probe *blkid_probe;
 
 #include 
 #define direct dirent
 
-#else  /* !ANDROID */
+#else  /* !__BIONIC__ */
 
 #include 
 
-#endif /* !ANDROID */
+#endif /* !__BIONIC__ */
 
 #endif /* __ANDROID_H__ */
diff --git a/cmds-scrub.c b/cmds-scrub.c
index 5388fdcf..5d8f6c24 100644
--- a/cmds-scrub.c
+++ b/cmds-scrub.c
@@ -46,6 +46,11 @@
 #include "commands.h"
 #include "help.h"
 
+#if defined(__BIONIC__) && !defined(PTHREAD_CANCELED)
+/* bionic's pthread does not define PTHREAD_CANCELED */
+#define PTHREAD_CANCELED   ((void *)-1)
+#endif
+
 static const char * const scrub_cmd_group_usage[] = {
"btrfs scrub  [options] |",
NULL
diff --git a/mkfs/common.c b/mkfs/common.c
index 1e8f26ea..0e4d5c39 100644
--- a/mkfs/common.c
+++ b/mkfs/common.c
@@ -549,6 +549,13 @@ out:
  *  0 for nothing found
  * -1 for internal error
  */
+#ifdef ANDROID /* none of these blkid functions exist in Android */
+static int check_overwrite(const char *device)
+{
+   /* We can't tell, so assume there is an existing fs or partition */
+   return 1;
+}
+#else
 static int check_overwrite(const char *device)
 {
const char  *type;
@@ -619,6 +626,7 @@ out:
  "existing filesystem.\n", device);
return ret;
 }
+#endif /* ANDROID */
 
 /*
  * Check if a device is suitable for btrfs
diff --git a/mkfs/main.c b/mkfs/main.c
index 61f746b3..8ebb11a4 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -1149,6 +1149,12 @@ static int zero_output_file(int out_fd, u64 size)
return ret;
 }
 
+#ifdef ANDROID /* all Androids use ssd (and android currently does not fully 
support libblkid) */
+static int is_ssd(const char *file)
+{
+   return 1;
+}
+#else
 static int is_ssd(const char *file)
 {
blkid_probe probe;
@@ -1196,6 +1202,7 @@ static int is_ssd(const char *file)
 
return rotational == '0';
 }
+#endif /* ANDROID */
 
 static int _cmp_device_by_id(void *priv, struct list_head *a,
 struct list_head *b)
diff --git a/task-utils.c b/task-utils.c
index 12b00027..1e89f13c 100644
--- a/task-utils.c
+++ b/task-utils.c
@@ -21,6 +21,7 @@
 #include 
 
 #include "task-utils.h"
+#include 

[PATCH 1/3] copied android.mk from devel branch

2017-08-02 Thread filipbystricky
From: Filip Bystricky 

This series of patches fixes some compile errors that trigger when
compiling to android devices.

This first patch just brings in devel's Android.mk, to which
kdave@ added a few fixes recently.

Signed-off-by: Filip Bystricky 
Reviewed-by: Mark Salyzyn 
---
 Android.mk | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/Android.mk b/Android.mk
index fe3209b6..52fe9ab4 100644
--- a/Android.mk
+++ b/Android.mk
@@ -17,22 +17,27 @@ STATIC_LIBS := -luuid   -lblkid -luuid -lz   -llzo2 -L. 
-pthread
 btrfs_shared_libraries := libext2_uuid \
libext2_blkid
 
-objects := ctree.c disk-io.c radix-tree.c extent-tree.c print-tree.c \
+objects := ctree.c disk-io.c kernel-lib/radix-tree.c extent-tree.c 
print-tree.c \
   root-tree.c dir-item.c file-item.c inode-item.c inode-map.c \
   extent-cache.c extent_io.c volumes.c utils.c repair.c \
-  qgroup.c raid6.c free-space-cache.c list_sort.c props.c \
-  ulist.c qgroup-verify.c backref.c string-table.c task-utils.c \
-  inode.c file.c find-root.c
+  qgroup.c free-space-cache.c kernel-lib/list_sort.c props.c \
+  kernel-shared/ulist.c qgroup-verify.c backref.c string-table.c 
task-utils.c \
+  inode.c file.c find-root.c free-space-tree.c help.c send-dump.c \
+  fsfeatures.c kernel-lib/tables.c kernel-lib/raid56.c
 cmds_objects := cmds-subvolume.c cmds-filesystem.c cmds-device.c cmds-scrub.c \
cmds-inspect.c cmds-balance.c cmds-send.c cmds-receive.c \
cmds-quota.c cmds-qgroup.c cmds-replace.c cmds-check.c \
cmds-restore.c cmds-rescue.c chunk-recover.c super-recover.c \
-   cmds-property.c cmds-fi-usage.c
-libbtrfs_objects := send-stream.c send-utils.c rbtree.c btrfs-list.c crc32c.c \
+   cmds-property.c cmds-fi-usage.c cmds-inspect-dump-tree.c \
+   cmds-inspect-dump-super.c cmds-inspect-tree-stats.c 
cmds-fi-du.c \
+   mkfs/common.c
+libbtrfs_objects := send-stream.c send-utils.c kernel-lib/rbtree.c 
btrfs-list.c \
+   kernel-lib/crc32c.c messages.c \
uuid-tree.c utils-lib.c rbtree-utils.c
-libbtrfs_headers := send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \
-   crc32c.h list.h kerncompat.h radix-tree.h extent-cache.h \
-   extent_io.h ioctl.h ctree.h btrfsck.h version.h
+libbtrfs_headers := send-stream.h send-utils.h send.h kernel-lib/rbtree.h 
btrfs-list.h \
+   kernel-lib/crc32c.h kernel-lib/list.h kerncompat.h \
+   kernel-lib/radix-tree.h kernel-lib/sizes.h 
kernel-lib/raid56.h \
+   extent-cache.h extent_io.h ioctl.h ctree.h btrfsck.h 
version.h
 TESTS := fsck-tests.sh convert-tests.sh
 blkid_objects := partition/ superblocks/ topology/
 
@@ -75,7 +80,8 @@ include $(CLEAR_VARS)
 LOCAL_MODULE := mkfs.btrfs
 LOCAL_SRC_FILES := \
 $(objects) \
-mkfs.c
+mkfs/common.c \
+mkfs/main.c
 
 LOCAL_C_INCLUDES := $(common_C_INCLUDES)
 LOCAL_CFLAGS := $(STATIC_CFLAGS)
@@ -108,4 +114,4 @@ LOCAL_SYSTEM_SHARED_LIBRARIES := libc libcutils
 LOCAL_EXPORT_C_INCLUDES := $(common_C_INCLUDES)
 LOCAL_MODULE_TAGS := optional
 include $(BUILD_EXECUTABLE)
-#--
+#--
\ No newline at end of file
-- 
2.14.0.rc1.383.gd1ce394fe2-goog

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-02 Thread Chris Mason



On 08/01/2017 01:39 PM, Austin S. Hemmelgarn wrote:

On 2017-08-01 13:25, Roman Mamedov wrote:

On Tue,  1 Aug 2017 10:14:23 -0600
Liu Bo  wrote:


This aims to fix write hole issue on btrfs raid5/6 setup by adding a
separate disk as a journal (aka raid5/6 log), so that after unclean
shutdown we can make sure data and parity are consistent on the raid
array by replaying the journal.


Could it be possible to designate areas on the in-array devices to be 
used as

journal?

While md doesn't have much spare room in its metadata for extraneous 
things
like this, Btrfs could use almost as much as it wants to, adding to 
size of the
FS metadata areas. Reliability-wise, the log could be stored as RAID1 
chunks.


It doesn't seem convenient to need having an additional storage device 
around
just for the log, and also needing to maintain its fault tolerance 
yourself (so
the log device would better be on a mirror, such as mdadm RAID1? more 
expense

and maintenance complexity).

I agree, MD pretty much needs a separate device simply because they 
can't allocate arbitrary space on the other array members.  BTRFS can do 
that though, and I would actually think that that would be _easier_ to 
implement than having a separate device.


That said, I do think that it would need to be a separate chunk type, 
because things could get really complicated if the metadata is itself 
using a parity raid profile.


Thanks for running with this Liu, I'm reading through all the patches. 
I do agree that it's better to put the logging into a dedicated chunk 
type, that way we can have it default to either double or triple mirroring.


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-02 Thread Chris Mason

On 08/02/2017 04:38 AM, Brendan Hide wrote:
The title seems alarmist to me - and I suspect it is going to be 
misconstrued. :-/


Supporting any filesystem is a huge amount of work.  I don't have a 
problem with Redhat or any distro picking and choosing the projects they 
want to support.


At least inside of FB, our own internal btrfs usage is continuing to 
grow.  Btrfs is becoming a big part of how we ship containers and other 
workloads where snapshots improve performance.


We also heavily use XFS, so I'm happy to see RH's long standing 
investment there continue.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive loss of disk space

2017-08-02 Thread Goffredo Baroncelli
Hi,

On 2017-08-01 17:00, Austin S. Hemmelgarn wrote:
> OK, I just did a dead simple test by hand, and it looks like I was right.  
> The method I used to check this is as follows:
> 1. Create and mount a reasonably small filesystem (I used an 8G temporary LV 
> for this, a file would work too though).
> 2. Using dd or a similar tool, create a test file that takes up half of the 
> size of the filesystem.  It is important that this _not_ be fallocated, but 
> just written out.
> 3. Use `fallocate -l` to try and extend the size of the file beyond half the 
> size of the filesystem.
> 
> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will 
> succeed with no error.  Based on this and some low-level inspection, it looks 
> like BTRFS treats the full range of the fallocate call as unallocated, and 
> thus is trying to allocate space for regions of that range that are already 
> allocated.

I can confirm this behavior; below some step to reproduce it [2]; however I 
don't think that it is a bug, but this is the correct behavior for a COW 
filesystem (see below).


Looking at the function btrfs_fallocate() (file fs/btrfs/file.c)


static long btrfs_fallocate(struct file *file, int mode,
loff_t offset, loff_t len)
{
[...]
alloc_start = round_down(offset, blocksize);
alloc_end = round_up(offset + len, blocksize);
[...]
/*
 * Only trigger disk allocation, don't trigger qgroup reserve
 *
 * For qgroup space, it will be checked later.
 */
ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode),
alloc_end - alloc_start)


it seems that BTRFS always allocate the maximum space required, without 
consider the one already allocated. Is it too conservative ? I think no: 
consider the following scenario:

a) create a 2GB file
b) fallocate -o 1GB -l 2GB
c) write from 1GB to 3GB

after b), the expectation is that c) always succeed [1]: i.e. there is enough 
space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the 
already allocated space because there could be a small time window where both 
the old and the new data exists on the disk. 

My opinion is that in general this behavior is correct due to the COW nature of 
BTRFS. 
The only exception that I can find, is about the "nocow" file. For these cases 
taking in accout the already allocated space would be better.

Comments are welcome.

BR
G.Baroncelli

[1] from man 2 fallocate
[...]
   After  a  successful call, subsequent writes into the range specified by 
offset and len are
   guaranteed not to fail because of lack of disk space.
[...]


[2]

-- create a 5G btrfs filesystem

# mkdir t1
# truncate --size 5G disk
# losetup /dev/loop0 disk
# mkfs.btrfs /dev/loop0
# mount /dev/loop0 t1

-- test
-- create a 1500 MB file, the expand it to 4000MB
-- expected result: the file is 4000MB size
-- result: fail: the expansion fails

# fallocate -l $((1024*1024*100*15))  file.bin
# fallocate -l $((1024*1024*100*40))  file.bin
fallocate: fallocate failed: No space left on device
# ls -lh file.bin 
-rw-r--r-- 1 root root 1.5G Aug  2 19:09 file.bin


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: copy fsid to super_block s_uuid

2017-08-02 Thread Darrick J. Wong
On Wed, Aug 02, 2017 at 02:02:11PM +0800, Anand Jain wrote:
> 
> Hi Darrick,
> 
>  Thanks for commenting..
> 
> >>+   memcpy(>s_uuid, fs_info->fsid, BTRFS_FSID_SIZE);
> >
> >uuid_copy()?
> 
>   It requires a larger migration to use uuid_t, IMO it can be done all
>   together, in a separate patch ?
> 
>   Just for experiment, starting with struct btrfs_fs_info.fsid and
>   to check its foot prints, I just renamed fsid to fs_id, and compiled.
>   It reports 73 'has no member named ‘fsid'' errors.
>   So looks like redefining u8 fsid[] to uuid_t fsid and further updating
>   all its foot prints, has to be simplified. Any suggestions ?

Cocinelle script?

 It was a fairly simply transition for xfs and others, though
from a simple grep it looks like btrfs uses open coded u8 arrays in a
few more places.

--D

> 
> Thanks, Anand
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-02 Thread Austin S. Hemmelgarn

On 2017-08-02 08:55, Lutz Vieweg wrote:

On 08/02/2017 01:25 PM, Austin S. Hemmelgarn wrote:

And this is a worst-case result of the fact that most
distros added BTRFS support long before it was ready.


RedHat still advertises "Ceph", and given Ceph initially recommended 
btrfs as

the filesystem to use for its nodes, it is interesting to read how clearly
they recommend against btrfs now:

http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/ 


We recommand against using btrfs due to the lack of a stable version
to test against and frequent bugs in the ENOSPC handling.
Yes, and the one thing they don't mention there is that Ceph is already 
doing most of the same things that BTRFS is, so you end up having 
performance issues due to duplicated work too.  What they specifically 
call out though is first the reason that it should not be supported yet 
in RHEL, OEL, and many other distros (I'm explicitly leaving 
SLES/OpenSUSE off of that list, because while I disagree with their 
choices of default behavior WRT BTRFS, they are actively involved in 
it's development, unlike most of the other distros that 'support' it), 
and then second one of the biggest issues for regular usage.


German IT magazine "Golem" speculates that RedHat's decision
is influenced by its recent acquisition of Permabit.

But I don't really see how XFS or Permabit tackle the problem
that if you need to create consistent backups of file systems while they 
are

in use, block-device level snapshots damage the write performance
big time.
When you're talking about data safety though, most people are willing to 
sacrifice write performance in favor of significantly lowering perceived 
risk.  The misguided early support of BTRFS without sufficient 
explanation of exactly how 'in-development' it is by many distros means 
that there are a lot of stories of issues and failures with BTRFS than 
ones of success (partly also because the filesystem is one of those 
things that people tend to complain about if it breaks, and not praise 
all that much if it works), and as a result, the general perception 
outside of people who use it actively is that it's pretty risky to use 
(which is absolutely accurate if you don't do routine maintenance on it).


(That backup topic is the one reason we use btrfs for a lot of
/home/ directories.)

I understand that XFS is expected to get some COW-features in the future
as well - but it remains to be seen what performance and robustness
implications that will have on XFS.
I believe basic reflink functionality is already upstream, and I wasn't 
aware of any other specific development for XFS.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-02 Thread Lutz Vieweg

On 08/02/2017 01:25 PM, Austin S. Hemmelgarn wrote:

And this is a worst-case result of the fact that most
distros added BTRFS support long before it was ready.


RedHat still advertises "Ceph", and given Ceph initially recommended btrfs as
the filesystem to use for its nodes, it is interesting to read how clearly
they recommend against btrfs now:

http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/

We recommand against using btrfs due to the lack of a stable version
to test against and frequent bugs in the ENOSPC handling.


German IT magazine "Golem" speculates that RedHat's decision
is influenced by its recent acquisition of Permabit.

But I don't really see how XFS or Permabit tackle the problem
that if you need to create consistent backups of file systems while they are
in use, block-device level snapshots damage the write performance
big time.

(That backup topic is the one reason we use btrfs for a lot of
/home/ directories.)

I understand that XFS is expected to get some COW-features in the future
as well - but it remains to be seen what performance and robustness
implications that will have on XFS.

Regards,

Lutz Vieweg

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-02 Thread Austin S. Hemmelgarn

On 2017-08-02 04:38, Brendan Hide wrote:
The title seems alarmist to me - and I suspect it is going to be 
misconstrued. :-/


 From the release notes at 
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html 



"Btrfs has been deprecated

The Btrfs file system has been in Technology Preview state since the 
initial release of Red Hat Enterprise Linux 6. Red Hat will not be 
moving Btrfs to a fully supported feature and it will be removed in a 
future major release of Red Hat Enterprise Linux.


The Btrfs file system did receive numerous updates from the upstream in 
Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat 
Enterprise Linux 7 series. However, this is the last planned update to 
this feature.


Red Hat will continue to invest in future technologies to address the 
use cases of our customers, specifically those related to snapshots, 
compression, NVRAM, and ease of use. We encourage feedback through your 
Red Hat representative on features and requirements you have for file 
systems and storage technology."


And this is a worst-case result of the fact that most distros added 
BTRFS support long before it was ready.


I'm betting some RH customer lost a lot of data because they didn't pay 
attention to the warnings and didn't do their research and were using 
raid5/6, and thus RH is considering it not worth investing in.  That, or 
they got fed up with the grandiose plans with no realistic timeline. 
There have been a number of cases of mishandled patches (chunk-level 
degraded check anyone?), and a lot of important (from an enterprise 
usage sense) features that have been proposed but to a naive outside 
have seen little to no progress (hot-spare support, device failure 
detection and handling, higher-order replication, working erasure coding 
(raid56), etc), and from both aspects, I can understand them not wanting 
to deal with it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive loss of disk space

2017-08-02 Thread Austin S. Hemmelgarn

On 2017-08-02 00:14, Duncan wrote:

Austin S. Hemmelgarn posted on Tue, 01 Aug 2017 10:47:30 -0400 as
excerpted:


I think I _might_ understand what's going on here.  Is that test program
calling fallocate using the desired total size of the file, or just
trying to allocate the range beyond the end to extend the file?  I've
seen issues with the first case on BTRFS before, and I'm starting to
think that it might actually be trying to allocate the exact amount of
space requested by fallocate, even if part of the range is already
allocated space.


If I've interpreted correctly (not being a dev, only a btrfs user,
sysadmin, and list regular) previous discussions I've seen on this list...

That's exactly what it's doing, and it's _intended_ behavior.

The reasoning is something like this:  fallocate is supposed to pre-
allocate some space with the intent being that writes into that space
won't fail, because the space is already allocated.

For an existing file with some data already in it, ext4 and xfs do that
counting the existing space.

But btrfs is copy-on-write, meaning it's going to have to write the new
data to a different location than the existing data, and it may well not
free up the existing allocation (if even a single 4k block of the
existing allocation remains unwritten, it will remain to hold down the
entire previous allocation, which isn't released until *none* of it is
still in use -- of course in normal usage "in use" can be due to old
snapshots or other reflinks to the same extent, as well, tho in these
test cases it's not).

So in ordered to provide the writes to preallocated space shouldn't ENOSPC
guarantee, btrfs can't count currently actually used space as part of the
fallocate.

The different behavior is entirely due to btrfs being COW, and thus a
choice having to be made, do we worst-case fallocate-reserve for writes
over currently used data that will have to be COWed elsewhere, possibly
without freeing the existing extents because there's still something
referencing them, or do we risk ENOSPCing on write to a previously
fallocated area?

The choice was to worst-case-reserve and take the ENOSPC risk at fallocate
time, so the write into that fallocated space could then proceed without
the ENOSPC risk that COW would otherwise imply.

Make sense, or is my understanding a horrible misunderstanding? =:^)
Your reasoning is sound, except for the fact that at least on older 
kernels (not sure if this is still the case), BTRFS will still perform a 
COW operation when updating a fallocate'ed region.


So if you're actually only appending, fallocate the /additional/ space,
not the /entire/ space, and you'll get what you need.  But if you're
potentially overwriting what's there already, better fallocate the entire
space, which triggers the btrfs worst-case allocation behavior you see,
in ordered to guarantee it won't ENOSPC during the actual write.

Of course the only time the behavior actually differs is with COW, but
then there's a BIG difference, but that BIG difference has a GOOD BIG
reason!  =:^)

Tho that difference will certainly necessitate some relearning the
/correct/ way to do it, for devs who were doing it the COW-worst-case way
all along, even if they didn't actually need to, because it didn't happen
to make a difference on what they happened to be testing on, which
happened not to be COW...

Reminds me of the way newer versions of gcc and/or trying to build with
clang as well tends to trigger relearning, because newer versions are
stricter in ordered to allow better optimization, and other
implementations are simply different in what they're strict on, /because/
they're a different implementation.  Well, btrfs is stricter... because
it's a different implementation that /has/ to be stricter... due to COW.
Except that that strictness breaks userspace programs that are doing 
perfectly reasonable things.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crashed filesystem, nothing helps

2017-08-02 Thread Thomas Wurfbaum
With the help of btrfs-corrupt-block i was able to get a little bit farer.
I marked some of my problem block corrupt.

Now i am in this stage
mainframe:~ # btrfs restore /dev/sdb1 /mnt
parent transid verify failed on 29409280 wanted 1486829 found 1488801
parent transid verify failed on 29409280 wanted 1486829 found 1488801
parent transid verify failed on 29409280 wanted 1486829 found 1488801
parent transid verify failed on 29409280 wanted 1486829 found 1488801
Ignoring transid failure
parent transid verify failed on 29376512 wanted 1327723 found 1489835
parent transid verify failed on 29376512 wanted 1327723 found 1489835
parent transid verify failed on 29376512 wanted 1327723 found 1489835
parent transid verify failed on 29376512 wanted 1327723 found 1489835
Ignoring transid failure
parent transid verify failed on 29786112 wanted 1489835 found 1489871
parent transid verify failed on 29786112 wanted 1489835 found 1489871
parent transid verify failed on 29786112 wanted 1489835 found 1489871
parent transid verify failed on 29786112 wanted 1489835 found 1489871
Ignoring transid failure
leaf parent key incorrect 29786112
Error searching -1

Regards,
Thomas

-- 
Thomas Wurfbaum
Starkertshofen 15
85084 Reichertshofen

Tel.: +49-160-3696336
Mail: tho...@wurfbaum.net

Google+:http://google.com/+ThomasWurfbaum
Facebook: https://www.facebook.com/profile.php?id=16061335414
Xing: https://www.xing.com/profile/Thomas_Wurfbaum

signature.asc
Description: This is a digitally signed message part.


Re: Crashed filesystem, nothing helps

2017-08-02 Thread Thomas Wurfbaum
Am Mittwoch, 2. August 2017, 11:31:41 CEST schrieb Roman Mamedov:
> Did it just abruptly exit there? Or you terminated it?

It apruptly stopped there

Regards,
Thomas

-- 
Thomas Wurfbaum
Starkertshofen 15
85084 Reichertshofen

Tel.: +49-160-3696336
Mail: tho...@wurfbaum.net

Google+:http://google.com/+ThomasWurfbaum
Facebook: https://www.facebook.com/profile.php?id=16061335414
Xing: https://www.xing.com/profile/Thomas_Wurfbaum

signature.asc
Description: This is a digitally signed message part.


Re: Crashed filesystem, nothing helps

2017-08-02 Thread Roman Mamedov
On Wed, 02 Aug 2017 11:17:04 +0200
Thomas Wurfbaum  wrote:
 
> A restore does also not help:
> mainframe:~ # btrfs restore /dev/sdb1 /mnt
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> Ignoring transid failure
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> Ignoring transid failure
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> Ignoring transid failure

Did it just abruptly exit there? Or you terminated it?

IIRC these messages (about ignoring) are not a problem for restore, it should
be able to continue. Or if not, it would print a more definitive error
message, e.g. "Couldn't read tree root" or such.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crashed filesystem, nothing helps

2017-08-02 Thread Thomas Wurfbaum
Maybe you are right, but i just followed the Suse guide:
https://en.opensuse.org/SDB:BTRFS
How to repair a broken/unmountable btrfs filesystem

I already tried the commands 
mount with the
-o usebackuproot option. (And -o usebackuproot,ro as well)
But they just produce this in dmesg:
[61054.470771] BTRFS info (device sdb1): trying to use backup root at mount 
time
[61054.470778] BTRFS info (device sdb1): disk space caching is enabled
[61054.470782] BTRFS info (device sdb1): has skinny extents
[61054.560876] BTRFS error (device sdb1): parent transid verify failed on 
29392896 wanted 1486833 found 1486836
[61054.563423] BTRFS error (device sdb1): parent transid verify failed on 
29392896 wanted 1486833 found 1486836
[61054.604057] BTRFS error (device sdb1): open_ctree failed
[61079.137435] BTRFS info (device sdb1): trying to use backup root at mount 
time
[61079.137443] BTRFS info (device sdb1): disk space caching is enabled
[61079.137445] BTRFS info (device sdb1): has skinny extents
[61079.227242] BTRFS error (device sdb1): parent transid verify failed on 
29392896 wanted 1486833 found 1486836
[61079.230087] BTRFS error (device sdb1): parent transid verify failed on 
29392896 wanted 1486833 found 1486836
[61079.260062] BTRFS error (device sdb1): open_ctree failed

And on the cli i get the following:
mainframe:~ # mount -o usebackuproot,ro /dev/sdb1 /data
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.

A restore does also not help:
mainframe:~ # btrfs restore /dev/sdb1 /mnt
parent transid verify failed on 29392896 wanted 1486833 found 1486836
parent transid verify failed on 29392896 wanted 1486833 found 1486836
parent transid verify failed on 29392896 wanted 1486833 found 1486836
parent transid verify failed on 29392896 wanted 1486833 found 1486836
Ignoring transid failure
parent transid verify failed on 29409280 wanted 1486829 found 1486833
parent transid verify failed on 29409280 wanted 1486829 found 1486833
parent transid verify failed on 29409280 wanted 1486829 found 1486833
parent transid verify failed on 29409280 wanted 1486829 found 1486833
Ignoring transid failure
parent transid verify failed on 29376512 wanted 1327723 found 1486833
parent transid verify failed on 29376512 wanted 1327723 found 1486833
parent transid verify failed on 29376512 wanted 1327723 found 1486833
parent transid verify failed on 29376512 wanted 1327723 found 1486833
Ignoring transid failure






-- 
Thomas Wurfbaum
Starkertshofen 15
85084 Reichertshofen

Tel.: +49-160-3696336
Mail: tho...@wurfbaum.net

Google+:http://google.com/+ThomasWurfbaum
Facebook: https://www.facebook.com/profile.php?id=16061335414
Xing: https://www.xing.com/profile/Thomas_Wurfbaum

signature.asc
Description: This is a digitally signed message part.


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-02 Thread Wang Shilong
I haven't seen active btrfs developers from some time, Redhat looks
put most of their efforts on XFS, It is time to switch to SLES/opensuse!


On Wed, Aug 2, 2017 at 4:38 PM, Brendan Hide  wrote:
> The title seems alarmist to me - and I suspect it is going to be
> misconstrued. :-/
>
> From the release notes at
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html
>
> "Btrfs has been deprecated
>
> The Btrfs file system has been in Technology Preview state since the initial
> release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a
> fully supported feature and it will be removed in a future major release of
> Red Hat Enterprise Linux.
>
> The Btrfs file system did receive numerous updates from the upstream in Red
> Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise
> Linux 7 series. However, this is the last planned update to this feature.
>
> Red Hat will continue to invest in future technologies to address the use
> cases of our customers, specifically those related to snapshots,
> compression, NVRAM, and ease of use. We encourage feedback through your Red
> Hat representative on features and requirements you have for file systems
> and storage technology."
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crashed filesystem, nothing helps

2017-08-02 Thread Hugo Mills
On Wed, Aug 02, 2017 at 10:27:50AM +0200, Thomas Wurfbaum wrote:
> Hello,
> 
> Yesterday morning i recognized a hard reboot of my system, but the /data 
> filesystem was 
> not possible to mount.
> 
> 
> mainframe:~ # uname -a
> Linux mainframe 4.11.8-2-default #1 SMP PREEMPT Thu Jun 29 14:37:33 UTC 2017 
> (42bd7a0) x86_64 x86_64 x86_64 GNU/Linux
> mainframe:~ # btrfs --version
> btrfs-progs v4.10.2+20170406
> mainframe:~ # btrfs fi show
> Label: none  uuid: 2276-0885-4683-ac04-477c27cfab80
> Total devices 1 FS bytes used 2.88TiB
> devid1 size 4.53TiB used 2.92TiB path /dev/sdb1
> mainframe:~ # btrfs restore /dev/sdb1 /mnt 
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> Ignoring transid failure
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> Ignoring transid failure
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> Ignoring transid failure
> mainframe:~ # mount /dev/sdb1 /data 
> mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
>missing codepage or helper program, or other error
> 
>In some cases useful info is found in syslog - try
>dmesg | tail or so.
> mainframe:~ # mount -o usebackuproot /dev/sdb1 /data
> mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
>missing codepage or helper program, or other error
> 
>In some cases useful info is found in syslog - try
>dmesg | tail or so.
> mainframe:~ # btrfs check /dev/sdb1
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> Ignoring transid failure
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> Ignoring transid failure
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> Ignoring transid failure
> Checking filesystem on /dev/sdb1
> UUID: 2276-0885-4683-ac04-477c27cfab80
> checking extents
> parent transid verify failed on 290766848 wanted 1486826 found 1486085
> parent transid verify failed on 290766848 wanted 1486826 found 1486085
> parent transid verify failed on 290766848 wanted 1486826 found 1486085
> parent transid verify failed on 290766848 wanted 1486826 found 1486085
> Ignoring transid failure
> parent transid verify failed on 292339712 wanted 1486826 found 1486086
> parent transid verify failed on 292339712 wanted 1486826 found 1486086
> parent transid verify failed on 291078144 wanted 1486826 found 1486085
> parent transid verify failed on 291078144 wanted 1486826 found 1486085
> parent transid verify failed on 291078144 wanted 1486826 found 1486085
> parent transid verify failed on 291078144 wanted 1486826 found 1486085
> Ignoring transid failure
> parent transid verify failed on 292978688 wanted 1486826 found 1486086
> parent transid verify failed on 292978688 wanted 1486826 found 1486086
> parent transid verify failed on 292978688 wanted 1486826 found 1486086
> parent transid verify failed on 292978688 wanted 1486826 found 1486086
> Ignoring transid failure
> parent transid verify failed on 292519936 wanted 1486826 found 1486086
> parent transid verify failed on 292519936 wanted 1486826 found 1486086
> parent transid verify failed on 292536320 wanted 1486826 found 1486086
> parent transid verify failed on 292536320 wanted 1486826 found 1486086
> parent transid verify failed on 292552704 wanted 1486826 found 1486086
> parent transid verify failed on 292552704 wanted 1486826 found 1486086
> parent transid verify failed on 292585472 wanted 1486826 found 1486086
> parent transid verify failed on 292585472 wanted 1486826 found 1486086
> parent transid verify failed on 292585472 wanted 1486826 found 1486086
> 

RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-02 Thread Brendan Hide
The title seems alarmist to me - and I suspect it is going to be 
misconstrued. :-/


From the release notes at 
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html


"Btrfs has been deprecated

The Btrfs file system has been in Technology Preview state since the 
initial release of Red Hat Enterprise Linux 6. Red Hat will not be 
moving Btrfs to a fully supported feature and it will be removed in a 
future major release of Red Hat Enterprise Linux.


The Btrfs file system did receive numerous updates from the upstream in 
Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat 
Enterprise Linux 7 series. However, this is the last planned update to 
this feature.


Red Hat will continue to invest in future technologies to address the 
use cases of our customers, specifically those related to snapshots, 
compression, NVRAM, and ease of use. We encourage feedback through your 
Red Hat representative on features and requirements you have for file 
systems and storage technology."



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2017-08-02 Thread Thomas Wurfbaum
subscribe linux-btrfs

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS error: bad tree block start 0 623771648

2017-08-02 Thread marcel.cochem
Thanks you for the information. It looks indeed like there are some
important parts all zerod... I will try to hijack to code to get easy
access to some config directories.
I already reinstalled the Operating System, now with dup Metadata and
will have a deeper look at the discard flag.
The restore tools don't work out of the box, because like Liu Bo
mentioned, they'll check metadata and exit on an error.

Thanks for your support
marcel


On Tue, Aug 1, 2017 at 11:45 PM, Liu Bo  wrote:
> On Tue, Aug 01, 2017 at 11:04:10AM +0500, Roman Mamedov wrote:
>> On Mon, 31 Jul 2017 11:12:01 -0700
>> Liu Bo  wrote:
>>
>> > Superblock and chunk tree root is OK, looks like the header part of
>> > the tree root is now all-zero, but I'm unable to think of a btrfs bug
>> > which can lead to that (if there is, it is a serious enough one)
>>
>> I see that the FS is being mounted with "discard". So maybe it was a TRIM 
>> gone
>> bad (wrong location or in a wrong sequence).
>>
>
> By checking discard path in btrfs, looks OK to me, more likely it's
> caused by problems from underlying stuff.
>
> Thanks,
>
> -liubo
>
>> Generally it appears to be not recommended to use "discard" by now (because 
>> of
>> its performance impact, and maybe possible issues like this), instead 
>> schedule
>> to call "fstrim " once a day or so, and/or on boot-up.
>>
>> > on ssd like disks, by default there is only one copy for metadata.
>>
>> Time and time again, the default of "single" metadata for SSD is a terrible
>> idea. Most likely DUP metadata would save the FS in this case.
>>
>> --
>> With respect,
>> Roman
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crashed filesystem, nothing helps

2017-08-02 Thread Thomas Wurfbaum
Please find attached my dmesg.log

Regards,
Thomas[0.00] Linux version 4.11.8-2-default (geeko@buildhost) (gcc version 7.1.1 20170629 [gcc-7-branch revision 249772] (SUSE Linux) ) #1 SMP PREEMPT Thu Jun 29 14:37:33 UTC 2017 (42bd7a0)
[0.00] Command line: BOOT_IMAGE=/vmlinuz-4.11.8-2-default root=UUID=6b92e93a-86f2-4007-b374-4c7ad6a57063 resume=/dev/disk/by-id/scsi-1AMCC_J0827296748EC30067A8-part2 splash=silent quiet showopts
[0.00] x86/fpu: x87 FPU will use FXSAVE
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009e7ff] usable
[0.00] BIOS-e820: [mem 0x0009e800-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xbfea] usable
[0.00] BIOS-e820: [mem 0xbfeb-0xbfee2fff] ACPI NVS
[0.00] BIOS-e820: [mem 0xbfee3000-0xbfee] ACPI data
[0.00] BIOS-e820: [mem 0xbfef-0xbfef] reserved
[0.00] BIOS-e820: [mem 0xf000-0xf3ff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00023fff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.4 present.
[0.00] DMI: System manufacturer System Product Name/M2N32 WS Professional, BIOS ASUS M2N32 WS Pro ACPI BIOS Revision 2001 05/05/2008
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] AGP: No AGP bridge found
[0.00] e820: last_pfn = 0x24 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-C7FFF write-protect
[0.00]   C8000-F uncachable
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask FF8000 write-back
[0.00]   1 base 008000 mask FFC000 write-back
[0.00]   2 base 00BFF0 mask F0 uncachable
[0.00]   3 base 01 mask FF write-back
[0.00]   4 base 02 mask FFC000 write-back
[0.00]   5 disabled
[0.00]   6 disabled
[0.00]   7 disabled
[0.00] TOM2: 00024000 aka 9216M
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- WT  
[0.00] e820: update [mem 0xbff0-0x] usable ==> reserved
[0.00] e820: last_pfn = 0xbfeb0 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000f6040-0x000f604f] mapped at [8808000f6040]
[0.00] Scanning 1 areas for low memory corruption
[0.00] Base memory trampoline at [880800098000] 98000 size 24576
[0.00] BRK [0x1b3269000, 0x1b3269fff] PGTABLE
[0.00] BRK [0x1b326a000, 0x1b326afff] PGTABLE
[0.00] BRK [0x1b326b000, 0x1b326bfff] PGTABLE
[0.00] BRK [0x1b326c000, 0x1b326cfff] PGTABLE
[0.00] BRK [0x1b326d000, 0x1b326dfff] PGTABLE
[0.00] BRK [0x1b326e000, 0x1b326efff] PGTABLE
[0.00] BRK [0x1b326f000, 0x1b326] PGTABLE
[0.00] BRK [0x1b327, 0x1b3270fff] PGTABLE
[0.00] RAMDISK: [mem 0x36b91000-0x375b]
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x000F7F20 24 (v02 Nvidia)
[0.00] ACPI: XSDT 0xBFEE3100 54 (v01 Nvidia ASUSACPI 42302E31 AWRD )
[0.00] ACPI: FACP 0xBFEEB480 F4 (v03 Nvidia ASUSACPI 42302E31 AWRD )
[0.00] ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Pm1aEventBlock: 32/8 (20170119/tbfadt-603)
[0.00] ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Pm1aControlBlock: 16/8 (20170119/tbfadt-603)
[0.00] ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/PmTimerBlock: 32/8 (20170119/tbfadt-603)
[0.00] ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Gpe0Block: 64/8 (20170119/tbfadt-603)
[0.00] ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Gpe1Block: 128/8 (20170119/tbfadt-603)
[0.00] ACPI BIOS Warning (bug): Invalid length for FADT/Pm1aEventBlock: 8, using default 32 (20170119/tbfadt-708)
[0.00] ACPI BIOS Warning (bug): Invalid length for FADT/Pm1aControlBlock: 8, using default 16 (20170119/tbfadt-708)
[0.00] ACPI BIOS Warning (bug): Invalid length for FADT/PmTimerBlock: 8, using default 32 (20170119/tbfadt-708)
[0.00] ACPI: DSDT 0xBFEE3280 008189 (v01 NVIDIA AWRDACPI 1000 MSFT 0300)
[0.00] ACPI: FACS 0xBFEB 40
[0.00] ACPI: FACS 0xBFEB 40
[0.00] ACPI: TCPA 0xBFEEB6C0 32 (v01 HTC

Crashed filesystem, nothing helps

2017-08-02 Thread Thomas Wurfbaum
Hello,

Yesterday morning i recognized a hard reboot of my system, but the /data 
filesystem was 
not possible to mount.


mainframe:~ # uname -a
Linux mainframe 4.11.8-2-default #1 SMP PREEMPT Thu Jun 29 14:37:33 UTC 2017 
(42bd7a0) x86_64 x86_64 x86_64 GNU/Linux
mainframe:~ # btrfs --version
btrfs-progs v4.10.2+20170406
mainframe:~ # btrfs fi show
Label: none  uuid: 2276-0885-4683-ac04-477c27cfab80
Total devices 1 FS bytes used 2.88TiB
devid1 size 4.53TiB used 2.92TiB path /dev/sdb1
mainframe:~ # btrfs restore /dev/sdb1 /mnt 
parent transid verify failed on 29392896 wanted 1486833 found 1486836
parent transid verify failed on 29392896 wanted 1486833 found 1486836
parent transid verify failed on 29392896 wanted 1486833 found 1486836
parent transid verify failed on 29392896 wanted 1486833 found 1486836
Ignoring transid failure
parent transid verify failed on 29409280 wanted 1486829 found 1486833
parent transid verify failed on 29409280 wanted 1486829 found 1486833
parent transid verify failed on 29409280 wanted 1486829 found 1486833
parent transid verify failed on 29409280 wanted 1486829 found 1486833
Ignoring transid failure
parent transid verify failed on 29376512 wanted 1327723 found 1486833
parent transid verify failed on 29376512 wanted 1327723 found 1486833
parent transid verify failed on 29376512 wanted 1327723 found 1486833
parent transid verify failed on 29376512 wanted 1327723 found 1486833
Ignoring transid failure
mainframe:~ # mount /dev/sdb1 /data 
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.
mainframe:~ # mount -o usebackuproot /dev/sdb1 /data
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.
mainframe:~ # btrfs check /dev/sdb1
parent transid verify failed on 29392896 wanted 1486833 found 1486836
parent transid verify failed on 29392896 wanted 1486833 found 1486836
parent transid verify failed on 29392896 wanted 1486833 found 1486836
parent transid verify failed on 29392896 wanted 1486833 found 1486836
Ignoring transid failure
parent transid verify failed on 29409280 wanted 1486829 found 1486833
parent transid verify failed on 29409280 wanted 1486829 found 1486833
parent transid verify failed on 29409280 wanted 1486829 found 1486833
parent transid verify failed on 29409280 wanted 1486829 found 1486833
Ignoring transid failure
parent transid verify failed on 29376512 wanted 1327723 found 1486833
parent transid verify failed on 29376512 wanted 1327723 found 1486833
parent transid verify failed on 29376512 wanted 1327723 found 1486833
parent transid verify failed on 29376512 wanted 1327723 found 1486833
Ignoring transid failure
Checking filesystem on /dev/sdb1
UUID: 2276-0885-4683-ac04-477c27cfab80
checking extents
parent transid verify failed on 290766848 wanted 1486826 found 1486085
parent transid verify failed on 290766848 wanted 1486826 found 1486085
parent transid verify failed on 290766848 wanted 1486826 found 1486085
parent transid verify failed on 290766848 wanted 1486826 found 1486085
Ignoring transid failure
parent transid verify failed on 292339712 wanted 1486826 found 1486086
parent transid verify failed on 292339712 wanted 1486826 found 1486086
parent transid verify failed on 291078144 wanted 1486826 found 1486085
parent transid verify failed on 291078144 wanted 1486826 found 1486085
parent transid verify failed on 291078144 wanted 1486826 found 1486085
parent transid verify failed on 291078144 wanted 1486826 found 1486085
Ignoring transid failure
parent transid verify failed on 292978688 wanted 1486826 found 1486086
parent transid verify failed on 292978688 wanted 1486826 found 1486086
parent transid verify failed on 292978688 wanted 1486826 found 1486086
parent transid verify failed on 292978688 wanted 1486826 found 1486086
Ignoring transid failure
parent transid verify failed on 292519936 wanted 1486826 found 1486086
parent transid verify failed on 292519936 wanted 1486826 found 1486086
parent transid verify failed on 292536320 wanted 1486826 found 1486086
parent transid verify failed on 292536320 wanted 1486826 found 1486086
parent transid verify failed on 292552704 wanted 1486826 found 1486086
parent transid verify failed on 292552704 wanted 1486826 found 1486086
parent transid verify failed on 292585472 wanted 1486826 found 1486086
parent transid verify failed on 292585472 wanted 1486826 found 1486086
parent transid verify failed on 292585472 wanted 1486826 found 1486086
parent transid verify failed on 292585472 wanted 1486826 found 1486086
Ignoring transid failure
parent transid verify failed on 290766848 wanted 1486826 found 1486085
Ignoring transid failure
leaf parent key incorrect 290766848
bad block 290766848

Re: [PATCH 0/2] More nritems range checking

2017-08-02 Thread Philipp Hahn
Hello,

Am 02.06.2017 um 12:08 schrieb Philipp Hahn:
> thank you for applying my last patch, but regarding my corrputed file system I
> found two other cases were btrfs crashes:
> - btrfs_del_items() was overlooked by me
> - deleting from an empty node
> 
> Find attached two patches to improve that.
> Please check the second patch hunk 2, as I'm unsure if "mid == nritems" is 
> valid.
> 
> (If someone can give me a hand on how to get my FS fixed again, I would
> appreciate that.)
> 
> Philipp Hahn (2):
>   btrfs-progs: Check slot + nr >= nritems overflow
>   btrfs-progs: Check nritems under-/overflow
> 
>  ctree.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)

Ping?

Philipp
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] btrfs: preserve i_mode if __btrfs_set_acl() fails

2017-08-02 Thread Ernesto A . Fernández
When changing a file's acl mask, btrfs_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.

If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.

Prevent this by restoring the original mode bits if __btrfs_set_acl
fails.

Signed-off-by: Ernesto A. Fernández 
---
Please ignore the two previous versions, this is far simpler and has the
same effect. To Josef Bacik: thank you for your review, I'm sorry I
wasted your time.

 fs/btrfs/acl.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
index 8d8370d..1ba49eb 100644
--- a/fs/btrfs/acl.c
+++ b/fs/btrfs/acl.c
@@ -114,13 +114,17 @@ static int __btrfs_set_acl(struct btrfs_trans_handle 
*trans,
 int btrfs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 {
int ret;
+   umode_t old_mode = inode->i_mode;
 
if (type == ACL_TYPE_ACCESS && acl) {
ret = posix_acl_update_mode(inode, >i_mode, );
if (ret)
return ret;
}
-   return __btrfs_set_acl(NULL, inode, acl, type);
+   ret = __btrfs_set_acl(NULL, inode, acl, type);
+   if (ret)
+   inode->i_mode = old_mode;
+   return ret;
 }
 
 /*
-- 
2.1.4


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: verify_dir_item fails in replay_xattr_deletes

2017-08-02 Thread Nikolay Borisov


On  2.08.2017 08:35, Lu Fengqi wrote:
> From: Su Yue 
> 
> In replay_xattr_deletes(), the argument @slot of verify_dir_item()
> should be variable @i instead of path->slots[0].


This was already fix in a patch sent by Filipe. Title is:

[PATCH] Btrfs: fix dir item validation when replaying xattr deletes

> 
> The bug causes failure of generic/066 and shared/002 in xfstest.
> dmesg:
> [12507.810781] BTRFS critical (device dm-0): invalid dir item name len: 10
> [12507.811185] BTRFS: error (device dm-0) in btrfs_replay_log:2475: errno=-5 
> IO failure (Failed to recover log tree)
> [12507.811928] BTRFS error (device dm-0): cleaner transaction attach returned 
> -30
> [12507.821020] BTRFS error (device dm-0): open_ctree failed
> [12508.131526] BTRFS info (device dm-0): disk space caching is enabled
> [12508.132145] BTRFS info (device dm-0): has skinny extents
> [12508.136265] BTRFS critical (device dm-0): invalid dir item name len: 10
> [12508.136678] BTRFS: error (device dm-0) in btrfs_replay_log:2475: errno=-5 
> IO failure (Failed to recover log tree)
> [12508.137501] BTRFS error (device dm-0): cleaner transaction attach returned 
> -30
> [12508.147982] BTRFS error (device dm-0): open_ctree failed
> 
> Signed-off-by: Su Yue 
> ---
>  fs/btrfs/tree-log.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index f20ef211a73d..3a11ae63676e 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2153,8 +2153,7 @@ static int replay_xattr_deletes(struct 
> btrfs_trans_handle *trans,
>   u32 this_len = sizeof(*di) + name_len + data_len;
>   char *name;
>  
> - ret = verify_dir_item(fs_info, path->nodes[0],
> -   path->slots[0], di);
> + ret = verify_dir_item(fs_info, path->nodes[0], i, di);
>   if (ret) {
>   ret = -EIO;
>   goto out;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html