Re: Btrfs/SSD

2017-05-15 Thread Duncan
Kai Krakow posted on Mon, 15 May 2017 21:12:06 +0200 as excerpted:

> Am Mon, 15 May 2017 14:09:20 +0100
> schrieb Tomasz Kusmierz :
>> 
>> Not true. When HDD uses 10% (10% is just for easy example) of space
>> as spare than aligment on disk is (US - used sector, SS - spare
>> sector, BS - bad sector)
>> 
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> 
>> if failure occurs - drive actually shifts sectors up:
>> 
>> US US US US US US US US US SS
>> US US US BS BS BS US US US US
>> US US US US US US US US US US
>> US US US US US US US US US US
>> US US US US US US US US US SS
>> US US US BS US US US US US US
>> US US US US US US US US US SS
>> US US US US US US US US US SS
> 
> This makes sense... Reserve area somehow implies it is continuous and
> as such located at one far end of the platter. But your image totally
> makes sense.

Thanks Tomasz.  It makes a lot of sense indeed, and had I thought about
it I think I already "knew" it, but I simply hadn't stopped to think about
it that hard, so you disabused me of the vague idea of spares all at one
end of the disk, too. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: tests: remove variable quotation from convert-tests

2017-05-15 Thread Lakshmipathi.G
Oops, sorry, I introduced those two issues in recent patches and
missed (skipped?) them while testing.  With above patch, 008/009
test-cases are working fine now.  thanks.

On 5/16/17, Tsutomu Itoh  wrote:
> In btrfs-progs-v4.11-rc1, the following convert-tests failed.
>
> [TEST/conv]   008-readonly-image
> [TEST/conv] readonly image test, btrfs defaults
> failed: mke2fs -t ext4 -b 4096 -F
> /Build/btrfs-progs-v4.11-rc1/tests/test.img
> test failed for case 008-readonly-image
> Makefile:271: recipe for target 'test-convert' failed
> make: *** [test-convert] Error 1
> [TEST/conv]   009-common-inode-flags
> [TEST/conv] common inode flags test, btrfs defaults
> failed: mke2fs -t ext4 -b 4096 -F
> /Build/btrfs-progs-v4.11-rc1/tests/test.img
> test failed for case 009-common-inode-flags
> Makefile:271: recipe for target 'test-convert' failed
> make: *** [test-convert] Error 1
>
> So, remove quotes from $default_mke2fs.
>
> Signed-off-by: Tsutomu Itoh 
> ---
>  tests/convert-tests/008-readonly-image/test.sh | 2 +-
>  tests/convert-tests/009-common-inode-flags/test.sh | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tests/convert-tests/008-readonly-image/test.sh
> b/tests/convert-tests/008-readonly-image/test.sh
> index b2f1ae37..27c9373e 100755
> --- a/tests/convert-tests/008-readonly-image/test.sh
> +++ b/tests/convert-tests/008-readonly-image/test.sh
> @@ -10,7 +10,7 @@ check_prereq btrfs-convert
>
>  default_mke2fs="mke2fs -t ext4 -b 4096"
>  convert_test_preamble '' 'readonly image test' 16k "$default_mke2fs"
> -convert_test_prep_fs "$default_mke2fs"
> +convert_test_prep_fs $default_mke2fs
>  run_check_umount_test_dev
>  convert_test_do_convert
>  run_check_mount_test_dev
> diff --git a/tests/convert-tests/009-common-inode-flags/test.sh
> b/tests/convert-tests/009-common-inode-flags/test.sh
> index a5828790..02823e14 100755
> --- a/tests/convert-tests/009-common-inode-flags/test.sh
> +++ b/tests/convert-tests/009-common-inode-flags/test.sh
> @@ -11,7 +11,7 @@ check_prereq btrfs-convert
>  fail=0
>  default_mke2fs="mke2fs -t ext4 -b 4096"
>  convert_test_preamble '' 'common inode flags test' 16k "$default_mke2fs"
> -convert_test_prep_fs "$default_mke2fs"
> +convert_test_prep_fs $default_mke2fs
>
>  # create file with specific flags
>  run_check $SUDO_HELPER touch "$TEST_MNT/flag_test"
> --
> 2.12.2
>
> 
> Tsutomu Itoh  t-i...@jp.fujitsu.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 

Cheers,
Lakshmipathi.G
http://www.giis.co.in http://www.webminal.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: tests: remove variable quotation from convert-tests

2017-05-15 Thread Tsutomu Itoh
In btrfs-progs-v4.11-rc1, the following convert-tests failed.

[TEST/conv]   008-readonly-image
[TEST/conv] readonly image test, btrfs defaults
failed: mke2fs -t ext4 -b 4096 -F /Build/btrfs-progs-v4.11-rc1/tests/test.img
test failed for case 008-readonly-image
Makefile:271: recipe for target 'test-convert' failed
make: *** [test-convert] Error 1
[TEST/conv]   009-common-inode-flags
[TEST/conv] common inode flags test, btrfs defaults
failed: mke2fs -t ext4 -b 4096 -F /Build/btrfs-progs-v4.11-rc1/tests/test.img
test failed for case 009-common-inode-flags
Makefile:271: recipe for target 'test-convert' failed
make: *** [test-convert] Error 1

So, remove quotes from $default_mke2fs.

Signed-off-by: Tsutomu Itoh 
---
 tests/convert-tests/008-readonly-image/test.sh | 2 +-
 tests/convert-tests/009-common-inode-flags/test.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/convert-tests/008-readonly-image/test.sh 
b/tests/convert-tests/008-readonly-image/test.sh
index b2f1ae37..27c9373e 100755
--- a/tests/convert-tests/008-readonly-image/test.sh
+++ b/tests/convert-tests/008-readonly-image/test.sh
@@ -10,7 +10,7 @@ check_prereq btrfs-convert
 
 default_mke2fs="mke2fs -t ext4 -b 4096"
 convert_test_preamble '' 'readonly image test' 16k "$default_mke2fs"
-convert_test_prep_fs "$default_mke2fs"
+convert_test_prep_fs $default_mke2fs
 run_check_umount_test_dev
 convert_test_do_convert
 run_check_mount_test_dev
diff --git a/tests/convert-tests/009-common-inode-flags/test.sh 
b/tests/convert-tests/009-common-inode-flags/test.sh
index a5828790..02823e14 100755
--- a/tests/convert-tests/009-common-inode-flags/test.sh
+++ b/tests/convert-tests/009-common-inode-flags/test.sh
@@ -11,7 +11,7 @@ check_prereq btrfs-convert
 fail=0
 default_mke2fs="mke2fs -t ext4 -b 4096"
 convert_test_preamble '' 'common inode flags test' 16k "$default_mke2fs"
-convert_test_prep_fs "$default_mke2fs"
+convert_test_prep_fs $default_mke2fs
 
 # create file with specific flags
 run_check $SUDO_HELPER touch "$TEST_MNT/flag_test"
-- 
2.12.2


Tsutomu Itoh  t-i...@jp.fujitsu.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Kai Krakow
Am Mon, 15 May 2017 22:05:05 +0200
schrieb Tomasz Torcz :

> On Mon, May 15, 2017 at 09:49:38PM +0200, Kai Krakow wrote:
> >   
> > > It's worth noting also that on average, COW filesystems like BTRFS
> > > (or log-structured-filesystems will not benefit as much as
> > > traditional filesystems from SSD caching unless the caching is
> > > built into the filesystem itself, since they don't do in-place
> > > rewrites (so any new write by definition has to drop other data
> > > from the cache).  
> > 
> > Yes, I considered that, too. And when I tried, there was almost no
> > perceivable performance difference between bcache-writearound and
> > bcache-writeback. But the latency of performance improvement was
> > much longer in writearound mode, so I sticked to writeback mode.
> > Also, writing random data is faster because bcache will defer it to
> > background and do writeback in sector order. Sequential access is
> > passed around bcache anyway, harddisks are already good at that.  
> 
>   Let me add my 2 cents.  bcache-writearound does not cache writes
> on SSD, so there are less writes overall to flash.  It is said
> to prolong the life of the flash drive.
>   I've recently switched from bcache-writeback to bcache-writearound,
> because my SSD caching drive is at the edge of it's lifetime. I'm
> using bcache in following configuration:
> http://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My SSD
> is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years ago.
> 
>   Now, according to
> http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html
> 120GB and 250GB warranty only covers 75 TBW (terabytes written).

According to your chart, all your data is written twice to bcache. It
may have been better to buy two drives, one per mirror. I don't think
that SSD firmwares do deduplication - so data is really written twice.

They may do compression but that won't be streaming compression but
per-block compression, so it won't help here as a deduplicator. Also,
due to internal structure, compression would probably work similar to
how zswap works: By combining compressed blocks into "buddy blocks", so
only compression above 2:1 will merge compressed blocks into single
blocks. For most of your data, this won't be true. So effectively, this
has no overall effect. For this reason, I doubt that any firmware takes
the chance for compression, effects are just too low vs. the management
overhead and complexity that adds to the already complicated FTL layer.


> My
> drive has  # smartctl -a /dev/sda  | grep LBA 241
> Total_LBAs_Written  0x0032   099   099   000Old_age
> Always   -   136025596053

Doesn't say this "99%" remaining? The threshold is far from being
reached...

I'm curious, what is Wear_Leveling_Count reporting?

> which multiplied by 512 bytes gives 69.6 TB. Close to 75TB? Well…
> 
> [35354.697513] sd 0:0:0:0: [sda] tag#19 FAILED Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE [35354.697516] sd 0:0:0:0:
> [sda] tag#19 Sense Key : Medium Error [current] [35354.697518] sd
> 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - auto
> reallocate failed [35354.697522] sd 0:0:0:0: [sda] tag#19 CDB:
> Read(10) 28 00 0c 30 82 9f 00 00 48 00 [35354.697524]
> blk_update_request: I/O error, dev sda, sector 204505785
> 
> Above started appearing recently.  So, I was really suprised that:
> - this drive is only rated for 120 TBW
> - I went through this limit in only 2 years
> 
>   The workload is lightly utilised home server / media center.

I think, bcache is a real SSD killer for drives around 120GB size or
below... I had similar life usage with my previous small SSD just after
one year. But I never had a sense error because I took it out of
service early. And I switched to writearound, too.

I think the write-pattern of bcache cannot be handled well by the FTL.
It behaves like a log-structured file system, with new writes only
appended, and sometimes a garbage collection is done by freeing
complete erase blocks. Maybe it could work better, if btrfs could pass
information about freed blocks down to bcache. Btrfs has a lot of these
due to COW nature.

I wonder if this would already be supported if turning on discard in
btrfs? Does anyone know?


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Tomasz Torcz
On Mon, May 15, 2017 at 09:49:38PM +0200, Kai Krakow wrote:
> 
> > It's worth noting also that on average, COW filesystems like BTRFS
> > (or log-structured-filesystems will not benefit as much as
> > traditional filesystems from SSD caching unless the caching is built
> > into the filesystem itself, since they don't do in-place rewrites (so
> > any new write by definition has to drop other data from the cache).
> 
> Yes, I considered that, too. And when I tried, there was almost no
> perceivable performance difference between bcache-writearound and
> bcache-writeback. But the latency of performance improvement was much
> longer in writearound mode, so I sticked to writeback mode. Also,
> writing random data is faster because bcache will defer it to
> background and do writeback in sector order. Sequential access is
> passed around bcache anyway, harddisks are already good at that.

  Let me add my 2 cents.  bcache-writearound does not cache writes
on SSD, so there are less writes overall to flash.  It is said
to prolong the life of the flash drive.
  I've recently switched from bcache-writeback to bcache-writearound,
because my SSD caching drive is at the edge of it's lifetime. I'm
using bcache in following configuration: 
https://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg
My SSD is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years ago.

  Now, according to 
http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html
120GB and 250GB warranty only covers 75 TBW (terabytes written).
My drive has  # smartctl -a /dev/sda  | grep LBA
241 Total_LBAs_Written  0x0032   099   099   000Old_age   Always   
-   136025596053

which multiplied by 512 bytes gives 69.6 TB. Close to 75TB? Well…

[35354.697513] sd 0:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
[35354.697516] sd 0:0:0:0: [sda] tag#19 Sense Key : Medium Error [current] 
[35354.697518] sd 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - 
auto reallocate failed
[35354.697522] sd 0:0:0:0: [sda] tag#19 CDB: Read(10) 28 00 0c 30 82 9f 00 00 
48 00
[35354.697524] blk_update_request: I/O error, dev sda, sector 204505785

Above started appearing recently.  So, I was really suprised that:
- this drive is only rated for 120 TBW
- I went through this limit in only 2 years

  The workload is lightly utilised home server / media center.

-- 
Tomasz TorczOnly gods can safely risk perfection,
xmpp: zdzich...@chrome.pl it's a dangerous thing for a man.  -- Alia

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Kai Krakow
Am Mon, 15 May 2017 08:03:48 -0400
schrieb "Austin S. Hemmelgarn" :

> > That's why I don't trust any of my data to them. But I still want
> > the benefit of their speed. So I use SSDs mostly as frontend caches
> > to HDDs. This gives me big storage with fast access. Indeed, I'm
> > using bcache successfully for this. A warm cache is almost as fast
> > as native SSD (at least it feels almost that fast, it will be
> > slower if you threw benchmarks at it).  
> That's to be expected though, most benchmarks don't replicate actual 
> usage patterns for client systems, and using SSD's for caching with 
> bcache or dm-cache for most server workloads except a file server
> will usually get you a performance hit.

You mean "performance boost"? Almost every read-most server workload
should benefit... I file server may be the exact opposite...

Also, I think dm-cache and bcache work very differently and are not
directly comparable. Their benefit depends much on the applied workload.

If I remember right, dm-cache is more about keeping "hot data" in the
flash storage while bcache is more about reducing seeking. So dm-cache
optimizes for bigger throughput of SSDs while bcache optimizes for
almost-zero seek overhead of SSDs. Depending on your underlying
storage, one or the other may even give zero benefit or worsen
performance. Which is what I'd call a "performance hit"... I didn't
ever try dm-cache, tho. For reasons I don't remember exactly, I didn't
like something about how it's implemented, I think it was related to
crash recovery. I don't know if that still holds true with modern
kernels. It may have changed but I never looked back to revise that
decision.


> It's worth noting also that on average, COW filesystems like BTRFS
> (or log-structured-filesystems will not benefit as much as
> traditional filesystems from SSD caching unless the caching is built
> into the filesystem itself, since they don't do in-place rewrites (so
> any new write by definition has to drop other data from the cache).

Yes, I considered that, too. And when I tried, there was almost no
perceivable performance difference between bcache-writearound and
bcache-writeback. But the latency of performance improvement was much
longer in writearound mode, so I sticked to writeback mode. Also,
writing random data is faster because bcache will defer it to
background and do writeback in sector order. Sequential access is
passed around bcache anyway, harddisks are already good at that.

But of course, the COW nature of btrfs will lower the hit rate I can
on writes. That's why I see no benefit in using bcache-writethrough
with btrfs.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Kai Krakow
Am Mon, 15 May 2017 07:46:01 -0400
schrieb "Austin S. Hemmelgarn" :

> On 2017-05-12 14:27, Kai Krakow wrote:
> > Am Tue, 18 Apr 2017 15:02:42 +0200
> > schrieb Imran Geriskovan :
> >  
> >> On 4/17/17, Austin S. Hemmelgarn  wrote:  
>  [...]  
> >>
> >> I'm trying to have a proper understanding of what "fragmentation"
> >> really means for an ssd and interrelation with wear-leveling.
> >>
> >> Before continuing lets remember:
> >> Pages cannot be erased individually, only whole blocks can be
> >> erased. The size of a NAND-flash page size can vary, and most
> >> drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have
> >> blocks of 128 or 256 pages, which means that the size of a block
> >> can vary between 256 KB and 4 MB.
> >> codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/
> >>
> >> Lets continue:
> >> Since block sizes are between 256k-4MB, data smaller than this will
> >> "probably" will not be fragmented in a reasonably empty and trimmed
> >> drive. And for a brand new ssd we may speak of contiguous series
> >> of blocks.
> >>
> >> However, as drive is used more and more and as wear leveling
> >> kicking in (ie. blocks are remapped) the meaning of "contiguous
> >> blocks" will erode. So any file bigger than a block size will be
> >> written to blocks physically apart no matter what their block
> >> addresses says. But my guess is that accessing device blocks
> >> -contiguous or not- are constant time operations. So it would not
> >> contribute performance issues. Right? Comments?
> >>
> >> So your the feeling about fragmentation/performance is probably
> >> related with if the file is spread into less or more blocks. If #
> >> of blocks used is higher than necessary (ie. no empty blocks can be
> >> found. Instead lots of partially empty blocks have to be used
> >> increasing the total # of blocks involved) then we will notice
> >> performance loss.
> >>
> >> Additionally if the filesystem will gonna try something to reduce
> >> the fragmentation for the blocks, it should precisely know where
> >> those blocks are located. Then how about ssd block informations?
> >> Are they available and do filesystems use it?
> >>
> >> Anyway if you can provide some more details about your experiences
> >> on this we can probably have better view on the issue.  
> >
> > What you really want for SSD is not defragmented files but
> > defragmented free space. That increases life time.
> >
> > So, defragmentation on SSD makes sense if it cares more about free
> > space but not file data itself.
> >
> > But of course, over time, fragmentation of file data (be it meta
> > data or content data) may introduce overhead - and in btrfs it
> > probably really makes a difference if I scan through some of the
> > past posts.
> >
> > I don't think it is important for the file system to know where the
> > SSD FTL located a data block. It's just important to keep
> > everything nicely aligned with erase block sizes, reduce rewrite
> > patterns, and free up complete erase blocks as good as possible.
> >
> > Maybe such a process should be called "compaction" and not
> > "defragmentation". In the end, the more continuous blocks of free
> > space there are, the better the chance for proper wear leveling.  
> 
> There is one other thing to consider though.  From a practical 
> perspective, performance on an SSD is a function of the number of 
> requests and what else is happening in the background.  The second 
> aspect isn't easy to eliminate on most systems, but the first is
> pretty easy to mitigate by defragmenting data.
> 
> Reiterating the example I made elsewhere in the thread:
> Assume you have an SSD and storage controller that can use DMA to 
> transfer up to 16MB of data off of the disk in a single operation.
> If you need to load a 16MB file off of this disk and it's properly
> aligned (it usually will be with most modern filesystems if the
> partition is properly aligned) and defragmented, it will take exactly
> one operation (assuming that doesn't get interrupted).  By contrast,
> if you have 16 fragments of 1MB each, that will take at minimum 2
> operations, and more likely 15-16 (depends on where everything is
> on-disk, and how smart the driver is about minimizing the number of
> required operations).  Each request has some amount of overhead to
> set up and complete, so the first case (one single extent) will take
> less total time to transfer the data than the second one.
> 
> This particular effect actually impacts almost any data transfer, not 
> just pulling data off of an SSD (this is why jumbo frames are
> important for high-performance networking, and why a higher latency
> timer on the PCI bus will improve performance (but conversely
> increase latency)), even when fetching data from a traditional hard
> drive (but it's not very noticeable there unless your fragments are
> tightly grouped, 

Re: Btrfs/SSD

2017-05-15 Thread Kai Krakow
Am Mon, 15 May 2017 14:09:20 +0100
schrieb Tomasz Kusmierz :

> > Traditional hard drives usually do this too these days (they've
> > been under-provisioned since before SSD's existed), which is part
> > of why older disks tend to be noisier and slower (the reserved
> > space is usually at the far inside or outside of the platter, so
> > using sectors from there to replace stuff leads to long seeks).  
> 
> Not true. When HDD uses 10% (10% is just for easy example) of space
> as spare than aligment on disk is (US - used sector, SS - spare
> sector, BS - bad sector)
> 
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> 
> if failure occurs - drive actually shifts sectors up:
> 
> US US US US US US US US US SS
> US US US BS BS BS US US US US
> US US US US US US US US US US
> US US US US US US US US US US
> US US US US US US US US US SS
> US US US BS US US US US US US
> US US US US US US US US US SS
> US US US US US US US US US SS

This makes sense... Reserve area somehow implies it is continuous and
as such located at one far end of the platter. But your image totally
makes sense.


> that strategy is in place to actually mitigate the problem that
> you’ve described, actually it was in place since drives were using
> PATA :) so if your drive get’s nosier over time it’s either a broken
> bearing or demagnetised arm magnet causing it to not aim propperly -
> so drive have to readjust position multiple times before hitting a
> right track -- To unsubscribe from this list: send the line
> "unsubscribe linux-btrfs" in the body of a message to
> majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

I can confirm that such drives usually do not get noisier except there's
something broken other than just a few sectors. And faulty bearing in
notebook drives is the most often scenario I see. I always recommend to
replace such drives early because they will usually fail completely.
Such notebooks are good candidates for SSD replacements btw. ;-)

The demagnetised arm magnet is an interesting error scenario - didn't
think of it. Thanks for the pointer.

But still, there's one noise you can easily identify as bad sectors:
When the drive starts clicking for 30 or more seconds while trying to
read data, and usually also freezes the OS during that time. Such
drives can be "repaired" by rewriting the offending sectors (because it
will be moved to reserve area then). But I guess it's best to already
replace such a drive by that time.

Early, back in PATA times, I often had harddisks exposing seemingly bad
sectors when power was cut while the drive was writing data. I usually
used dd to rewrite such sectors and the drive was good as new again -
except I lost some file data maybe. Luckily, modern drives don't show
such behavior. And also SSDs learned to handle this...


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 15/27] fs: retrofit old error reporting API onto new infrastructure

2017-05-15 Thread Jeff Layton
On Mon, 2017-05-15 at 12:42 +0200, Jan Kara wrote:
> On Tue 09-05-17 11:49:18, Jeff Layton wrote:
> > Now that we have a better way to store and report errors that occur
> > during writeback, we need to convert the existing codebase to use it. We
> > could just adapt all of the filesystem code and related infrastructure
> > to the new API, but that's a lot of churn.
> > 
> > When it comes to setting errors in the mapping, filemap_set_wb_error is
> > a drop-in replacement for mapping_set_error. Turn that function into a
> > simple wrapper around the new one.
> > 
> > Because we want to ensure that writeback errors are always reported at
> > fsync time, inject filemap_report_wb_error calls much closer to the
> > syscall boundary, in call_fsync.
> > 
> > For fsync calls (and things like the nfsd equivalent), we either return
> > the error that the fsync operation returns, or the one returned by
> > filemap_report_wb_error. In both cases, we advance the file->f_wb_err to
> > the latest value. This allows us to provide new fsync semantics that
> > will return errors that may have occurred previously and been viewed
> > via other file descriptors.
> > 
> > The final piece of the puzzle is what to do about filemap_check_errors
> > calls that are being called directly or via filemap_* functions. Here,
> > we must take a little "creative license".
> > 
> > Since we now handle advancing the file->f_wb_err value at the generic
> > filesystem layer, we no longer need those callers to clear errors out
> > of the mapping or advance an errseq_t.
> > 
> > A lot of the existing codebase relies on being getting an error back
> > from those functions when there is a writeback problem, so we do still
> > want to have them report writeback errors somehow.
> > 
> > When reporting writeback errors, we will always report errors that have
> > occurred since a particular point in time. With the old writeback error
> > reporting, the time we used was "since it was last tested/cleared" which
> > is entirely arbitrary and potentially racy. Now, we can at least report
> > the latest error that has occurred since an arbitrary point in time
> > (represented as a sampled errseq_t value).
> > 
> > In the case where we don't have a struct file to work with, this patch
> > just has the wrappers sample the current mapping->wb_err value, and use
> > that as an arbitrary point from which to check for errors.
> 
> I think this is really dangerous and we shouldn't do this. You are quite
> likely to lose IO errors in such calls because you will ignore all errors
> that happened during previous background writeback or even for IO that
> managed to complete before we called filemap_fdatawait(). Maybe we need to
> keep the original set-clear-bit IO error reporting for these cases, until
> we can convert them to fdatawait_range_since()?
> 

Yes that is a danger.

In fact, late last week I was finally able to make my xfstest work with
btrfs, and started hitting oopses with it because of the way the error
reporting changed. I rolled up a patch to fix that (and it simplifies
the code some, IMO), but other callers of filemap_fdatawait may have
similar problems here.

I'll pick up working on this again in a more piecemeal way. I had
originally looked at doing that, but there are some problematic places
(e.g. buffer.c), where the code is shared by a bunch of different
filesystems that makes it difficult.

We may end up needing some sort of FS_WB_ERRSEQ flag to do this
correctly, at least until the transition to errseq_t is done.

> > That's probably not "correct" in all cases, particularly in the case of
> > something like filemap_fdatawait, but I'm not sure it's any worse than
> > what we already have, and this gives us a basis from which to work.
> > 
> > A lot of those callers will likely want to change to a model where they
> > sample the errseq_t much earlier (perhaps when starting a transaction),
> > store it in an appropriate place and then use that value later when
> > checking to see if an error occurred.
> > 
> > That will almost certainly take some involvement from other subsystem
> > maintainers. I'm quite open to adding new API functions to help enable
> > this if that would be helpful, but I don't really want to do that until
> > I better understand what's needed.
> > 
> > Signed-off-by: Jeff Layton 
> 
> ...
> 
> > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> > index 5f7317875a67..7ce13281925f 100644
> > --- a/fs/f2fs/file.c
> > +++ b/fs/f2fs/file.c
> > @@ -187,6 +187,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t 
> > start, loff_t end,
> > .nr_to_write = LONG_MAX,
> > .for_reclaim = 0,
> > };
> > +   errseq_t since = READ_ONCE(file->f_wb_err);
> >  
> > if (unlikely(f2fs_readonly(inode->i_sb)))
> > return 0;
> > @@ -265,6 +266,8 @@ static int f2fs_do_sync_file(struct file *file, loff_t 
> > start, loff_t end,
> > }
> >  
> > ret = 

Re: [xfstests PATCH v2 2/3] ext4: allow ext4 to use $SCRATCH_LOGDEV

2017-05-15 Thread Darrick J. Wong
On Tue, May 09, 2017 at 12:12:44PM -0400, Jeff Layton wrote:
> The writeback error handling test requires that you put the journal on a
> separate device. This allows us to use dmerror to simulate data
> writeback failure, without affecting the journal.
> 
> xfs already has infrastructure for this (a'la $SCRATCH_LOGDEV), so wire
> up the ext4 code so that it can do the same thing when _scratch_mkfs is
> called.
> 
> Signed-off-by: Jeff Layton 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  common/rc | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/common/rc b/common/rc
> index 257b1903359d..8b815d9c8c33 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -675,6 +675,9 @@ _scratch_mkfs_ext4()
>   local tmp=`mktemp`
>   local mkfs_status
>  
> + [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
> + $mkfs_cmd -O journal_dev $SCRATCH_LOGDEV && \
> + mkfs_cmd="$mkfs_cmd -J device=$SCRATCH_LOGDEV"
>  
>   _scratch_do_mkfs "$mkfs_cmd" "$mkfs_filter" $* 2>$tmp.mkfserr 
> 1>$tmp.mkfsstd
>   mkfs_status=$?
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2] btrfs-progs: Fix slot >= nritems

2017-05-15 Thread Philipp Hahn
Running "btrfsck --repair /dev/sdd2" crashed as it can happen in
(corrupted) file systems, that slot > nritems:
> (gdb) bt full
> #0  0x77020e71 in __memmove_sse2_unaligned_erms () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00438764 in btrfs_del_ptr (trans=, 
> root=0x6e4fe0, path=0x1d17880, level=0, slot=7)
> at ctree.c:2611
> parent = 0xcd96980
> nritems = 
> __func__ = "btrfs_del_ptr"
> #2  0x00421b15 in repair_btree (corrupt_blocks=, 
> root=) at cmds-check.c:3539
> key = {objectid = 77990592512, type = 168 '\250', offset = 16384}
> trans = 0x8f48c0
> path = 0x1d17880
> level = 0
> #3  check_fs_root (wc=, root_cache=, 
> root=) at cmds-check.c:3703
> corrupt = 0x1d17880
> corrupt_blocks = {root = {rb_node = 0x6e80c60}}
> path = {nodes = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, slots = {0, 
> 0, 0, 0, 0, 0, 0, 0}, locks = {0, 0,
> 0, 0, 0, 0, 0, 0}, reada = 0, lowest_level = 0, search_for_split 
> = 0, skip_check_block = 0}
> nrefs = {bytenr = {271663104, 271646720, 560021504, 0, 0, 0, 0, 0}, 
> refs = {1, 1, 1, 0, 0, 0, 0, 0}}
> wret = 215575372
> root_node = {cache = {rb_node = {__rb_parent_color = 0, rb_right = 
> 0x0, rb_left = 0x0}, objectid = 0,
> start = 0, size = 0}, root_cache = {root = {rb_node = 0x0}}, 
> inode_cache = {root = {
>   rb_node = 0x781c80}}, current = 0x819530, refs = 0}
> status = 215575372
> rec = 0x1
> #4  check_fs_roots (root_cache=0xcd96b6d, root=) at 
> cmds-check.c:3809
> path = {nodes = {0x6eed90, 0x6a2f40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
> slots = {18, 2, 0, 0, 0, 0, 0, 0},
>   locks = {0, 0, 0, 0, 0, 0, 0, 0}, reada = 0, lowest_level = 0, 
> search_for_split = 0,
>   skip_check_block = 0}
> key = {objectid = 323, type = 132 '\204', offset = 
> 18446744073709551615}
> wc = {shared = {root = {rb_node = 0x0}}, nodes = {0x0, 0x0, 
> 0x7fffe428, 0x0, 0x0, 0x0, 0x0, 0x0},
>   active_node = 2, root_level = 2}
> leaf = 0x6e4fe0
> tmp_root = 0x6e4fe0
> #5  0x004287c3 in cmd_check (argc=215575372, argv=0x1d17880) at 
> cmds-check.c:11521
> root_cache = {root = {rb_node = 0x98c2940}}
> info = 0x6927b0
> bytenr = 6891440
> tree_root_bytenr = 0
> uuidbuf = "f65ff1a1-76ef-456e-beb5-c6c3841e7534"
> num = 215575372
> readonly = 218080104
> qgroups_repaired = 0
> #6  0x0040a41f in main (argc=3, argv=0x7fffebe8) at btrfs.c:243
> cmd = 0x689868
> bname = 
> ret = 

in that case the count of remaining items (nritems - slot - 1) gets
negative. That is then casted to (unsigned long len), which leads to the
observed crash.

Change the tests before the move to handle only the non-corrupted case,
were slow < nritems.

This does not fix the corruption, but allows btrfsck to finish without
crashing.

Signed-off-by: Philipp Hahn 
---
 ctree.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/ctree.c b/ctree.c
index 02c7180..798bde9 100644
--- a/ctree.c
+++ b/ctree.c
@@ -1487,7 +1487,8 @@ static int insert_ptr(struct btrfs_trans_handle *trans, 
struct btrfs_root
BUG();
if (nritems == BTRFS_NODEPTRS_PER_BLOCK(root))
BUG();
-   if (slot != nritems) {
+   if (slot < nritems) {
+   /* shift the items */
memmove_extent_buffer(lower,
  btrfs_node_key_ptr_offset(slot + 1),
  btrfs_node_key_ptr_offset(slot),
@@ -2249,7 +2250,7 @@ split:
 
nritems = btrfs_header_nritems(leaf);
 
-   if (slot != nritems) {
+   if (slot < nritems) {
/* shift the items */
memmove_extent_buffer(leaf, btrfs_item_nr_offset(slot + 1),
  btrfs_item_nr_offset(slot),
@@ -2500,7 +2501,7 @@ int btrfs_insert_empty_items(struct btrfs_trans_handle 
*trans,
slot = path->slots[0];
BUG_ON(slot < 0);
 
-   if (slot != nritems) {
+   if (slot < nritems) {
unsigned int old_data = btrfs_item_end_nr(leaf, slot);
 
if (old_data < data_end) {
@@ -2603,7 +2604,8 @@ int btrfs_del_ptr(struct btrfs_root *root, struct 
btrfs_path *path,
int ret = 0;
 
nritems = btrfs_header_nritems(parent);
-   if (slot != nritems -1) {
+   if (slot < nritems -1) {
+   /* shift the items */
memmove_extent_buffer(parent,
  btrfs_node_key_ptr_offset(slot),
  btrfs_node_key_ptr_offset(slot + 1),
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: tolerate errors if we have retried successfully

2017-05-15 Thread David Sterba
On Tue, May 09, 2017 at 12:40:53PM -0700, Liu Bo wrote:
> On Fri, May 05, 2017 at 06:52:45PM +0200, David Sterba wrote:
> > On Thu, Apr 13, 2017 at 06:11:56PM -0700, Liu Bo wrote:
> > > With raid1 profile, dio read isn't tolerating IO errors if read length is
> > > less than the stripe length (64K).
> > 
> > Can you please write more details why this is true? Some pointers to
> > code etc, I'm lost. Eg. where the errors is tolerated. Thanks.
> 
> Sure.
> 
> Our bio didn't get split in btrfs_submit_direct_hook() if (dip->flags &
> BTRFS_DIO_ORIG_BIO_SUBMITTED) is true.  If the underlying device returns error
> somehow, bio->bi_error has recorded that error.
> 
> If we could recover the correct data from another copy in profile 
> raid1/10/5/6,
> with btrfs_subio_endio_read() returning 0, bio would have the correct data in
> its vector, but bio->bi_error is not updated accordingly so that the following
> dio_end_io(dio_bio, bio->bi_error) makes directIO think this read has failed.

Great, thanks. Please update the patch and resend.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Fix slot >= nritems

2017-05-15 Thread David Sterba
On Mon, May 15, 2017 at 09:57:09AM +0200, Philipp Hahn wrote:
> Running "btrfsck --repair /dev/sdd2" crashed:
> > (gdb) bt full
> > #0  0x77020e71 in __memmove_sse2_unaligned_erms () from 
> > /lib/x86_64-linux-gnu/libc.so.6
> > No symbol table info available.
> > #1  0x00438764 in btrfs_del_ptr (trans=, 
> > root=0x6e4fe0, path=0x1d17880, level=0, slot=7)
> > at ctree.c:2611
> > parent = 0xcd96980
> > nritems = 
> > __func__ = "btrfs_del_ptr"
> > #2  0x00421b15 in repair_btree (corrupt_blocks=, 
> > root=) at cmds-check.c:3539
> > key = {objectid = 77990592512, type = 168 '\250', offset = 16384}
> > trans = 0x8f48c0
> > path = 0x1d17880
> > level = 0
> > #3  check_fs_root (wc=, root_cache=, 
> > root=) at cmds-check.c:3703
> > corrupt = 0x1d17880
> > corrupt_blocks = {root = {rb_node = 0x6e80c60}}
> > path = {nodes = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, slots = 
> > {0, 0, 0, 0, 0, 0, 0, 0}, locks = {0, 0,
> > 0, 0, 0, 0, 0, 0}, reada = 0, lowest_level = 0, 
> > search_for_split = 0, skip_check_block = 0}
> > nrefs = {bytenr = {271663104, 271646720, 560021504, 0, 0, 0, 0, 0}, 
> > refs = {1, 1, 1, 0, 0, 0, 0, 0}}
> > wret = 215575372
> > root_node = {cache = {rb_node = {__rb_parent_color = 0, rb_right = 
> > 0x0, rb_left = 0x0}, objectid = 0,
> > start = 0, size = 0}, root_cache = {root = {rb_node = 0x0}}, 
> > inode_cache = {root = {
> >   rb_node = 0x781c80}}, current = 0x819530, refs = 0}
> > status = 215575372
> > rec = 0x1
> > #4  check_fs_roots (root_cache=0xcd96b6d, root=) at 
> > cmds-check.c:3809
> > path = {nodes = {0x6eed90, 0x6a2f40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
> > slots = {18, 2, 0, 0, 0, 0, 0, 0},
> >   locks = {0, 0, 0, 0, 0, 0, 0, 0}, reada = 0, lowest_level = 0, 
> > search_for_split = 0,
> >   skip_check_block = 0}
> > key = {objectid = 323, type = 132 '\204', offset = 
> > 18446744073709551615}
> > wc = {shared = {root = {rb_node = 0x0}}, nodes = {0x0, 0x0, 
> > 0x7fffe428, 0x0, 0x0, 0x0, 0x0, 0x0},
> >   active_node = 2, root_level = 2}
> > leaf = 0x6e4fe0
> > tmp_root = 0x6e4fe0
> > #5  0x004287c3 in cmd_check (argc=215575372, argv=0x1d17880) at 
> > cmds-check.c:11521
> > root_cache = {root = {rb_node = 0x98c2940}}
> > info = 0x6927b0
> > bytenr = 6891440
> > tree_root_bytenr = 0
> > uuidbuf = "f65ff1a1-76ef-456e-beb5-c6c3841e7534"
> > num = 215575372
> > readonly = 218080104
> > qgroups_repaired = 0
> > #6  0x0040a41f in main (argc=3, argv=0x7fffebe8) at btrfs.c:243
> > cmd = 0x689868
> > bname = 
> > ret = 

The output from the debugging session is fine, but some textual
description is still needed, to explain why the change is needed and
what are the visible effects.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs progs pre-release 4.11-rc1

2017-05-15 Thread David Sterba
Hi,

a pre-release has been tagged. The 4.11 release is going to be a small one,
just a handful of updates. I was too busy with 4.12 kernel patches. Something
will have to change regarding btrfs-progs management, as the number of
unreviewed and unmerged patches is not decreasing. I'll write more about that
after 4.11 is out.

Changes:
  * receive: fix handling empty stream with -e (multi-stream)
  * send dump: fix printing long file names
  * stability fixes for: dump-super, print-tree, check
  * option parser updates: global opions (old xfstests will fail)
  * new and updated tests
  * documentation updates

ETA for 4.11 is in +2 days (2017-05-17).

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

Christian Brauner (3):
  btrfs-progs: fix btrfs send & receive with -e flag
  btrfs-progs: tests: misc/018-receive use receive -e to terminate on end 
marker
  btrfs-progs: receiv: fail on first -ENODATA only

Christophe de Dinechin (3):
  btrfs-progs: check: disambiguate between cases where add_tree_backref 
fails
  btrfs-progs: check: prevent attempt to insert extent record with 
max_size==0
  btrfs-progs: check: make max_size consistent with nr

David Sterba (10):
  btrfs-progs: docs: mount options, enhance ssd/nossd
  btrfs-progs: rework option parser to use getopt for global options
  btrfs-progs: README: sort bug reports means by preference
  btrfs-progs: docs: btrfs-replace, fix typo
  btrfs-progs: balance: minor wording adjustment for full balance warning
  btrfs-progs: docs: update formatting and wording for btrfs(5)
  btrfs-progs: fssum: fix warning, include correct header for major()
  btrfs-progs: tests: fix misc/019-receive-clones-on-munted-subvol
  btrfs-progs: update CHANGES for v4.11
  Btrfs progs v4.11-rc1

Evan Danaher (1):
  btrfs-progs: send: always print a space after path in dump

Hans van Kranenburg (2):
  btrfs-progs: docs: Fix missing newline in man 5 btrfs
  btrfs-progs: docs: Fix newlines for man btrfstune

Lakshmipathi.G (3):
  btrfs-progs: misc-tests: Superblock corruption and recovery using backup
  btrfs-progs: tests: add variable quotation to fsck-tests
  btrfs-progs: tests: add variable quotation to convert-tests

Lu Fengqi (4):
  btrfs-progs: dump-super: check array_size in print_sys_chunk_array
  btrfs-progs: print-tree: add validation to print_chunk
  btrfs-progs: tests: fssum, fix memory leak
  btrfs-progs: tests: Fix fuzz-test for bko-161821.raw.txt

Qu Wenruo (6):
  btrfs-progs: Fix memory leak when 0 sized block group item is found
  btrfs-progs: check: Fix memory leak in check_chunks_and_extents
  btrfs-progs: Use more strict check to read out tree root
  btrfs-progs: check: Avoid reading beyond item boundary for inode_ref
  btrfs-progs: check: Avoid reading beyond item boundary for dir_item and 
dir_index
  btrfs-progs: print-tree: Add leaf flags and backref revision output

Su Yue (1):
  btrfs-progs: check: Fix heap use after free

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Tomasz Kusmierz

> Traditional hard drives usually do this too these days (they've been 
> under-provisioned since before SSD's existed), which is part of why older 
> disks tend to be noisier and slower (the reserved space is usually at the far 
> inside or outside of the platter, so using sectors from there to replace 
> stuff leads to long seeks).

Not true. When HDD uses 10% (10% is just for easy example) of space as spare 
than aligment on disk is (US - used sector, SS - spare sector, BS - bad sector)

US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS

if failure occurs - drive actually shifts sectors up:

US US US US US US US US US SS
US US US BS BS BS US US US US
US US US US US US US US US US
US US US US US US US US US US
US US US US US US US US US SS
US US US BS US US US US US US
US US US US US US US US US SS
US US US US US US US US US SS

that strategy is in place to actually mitigate the problem that you’ve 
described, actually it was in place since drives were using PATA :) so if your 
drive get’s nosier over time it’s either a broken bearing or demagnetised arm 
magnet causing it to not aim propperly - so drive have to readjust position 
multiple times before hitting a right track --
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-15 Thread Austin S. Hemmelgarn

On 2017-05-15 04:14, Hugo Mills wrote:

On Sun, May 14, 2017 at 04:16:52PM -0700, Marc MERLIN wrote:

On Sun, May 14, 2017 at 09:21:11PM +, Hugo Mills wrote:

2) balance -musage=0
3) balance -musage=20


   In most cases, this is going to make ENOSPC problems worse, not
better. The reason for doign this kind of balance is to recover unused
space and allow it to be reallocated. The typical behaviour is that
data gets overallocated, and it's metadata which runs out. So, the
last thing you want to be doing is reducing the metadata allocation,
because that's the scarce resource.

   Also, I'd usually recommend using limit=n, where n is approximately
the amount of data overallcation (allocated space less used
space). It's much more controllable than usage.



Thanks for that.
So, would you just remove the balance -musage=20 altogether?


   Yes.
The advantages to doing that depend also on how much excess free space 
you have and what your usual usage is.  If you're balancing a filesystem 
for a mail server that has lots of free space, you may indeed want to 
re-balance metadata chunks regularly because you're likely to be 
rewriting significant amounts of metadata regularly.



As for limit= I'm not sure if it would be helpful since I run this
nightly. Anything that doesn't get done tonight due to limit, would be
done tomorrow?


   I'm suggesting limit= on its own. It's a fixed amount of work
compared to usage=, which may not do anything at all. For example,
it's perfectly possible to have a filesystem which is, say, 30% full,
and yet is still fully-allocated filesystem with more than 20% of
every chunk used. In that case your usage= wouldn't balance anything,
and you'd still be left in the situation of risking ENOSPC from
running out of metadata.
FWIW, I normally use '-dusage=80 -mlimit=16' for my nightly balances. 
The usage filter at 80% means you won't waste time re-balancing full or 
mostly full chunks, and the limit filter of 16 takes on average about 5 
minutes on the consumer SSD's I have.


   All you need to do is ensure that you have enough unallocated space
for the metadata to expand into if it needs to. That's the ultimate
goal of all this.

   If you have SSDs, it may also be beneficial to use nossd as a mount
option, because that seems to have some pathology in overallocating
chunks in normal usage. Hans investigated this in detail a month or
two ago.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] fs: Introduce RWF_NOWAIT

2017-05-15 Thread Jan Kara
On Thu 11-05-17 14:17:04, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> RWF_NOWAIT informs kernel to bail out if an AIO request will block
> for reasons such as file allocations, or a writeback triggered,
> or would block while allocating requests while performing
> direct I/O.
> 
> RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.
> 
> The check for -EOPNOTSUPP is placed in generic_file_write_iter(). This
> is called by most filesystems, either through fsops.write_iter() or through
> the function defined by write_iter(). If not, we perform the check defined
> by .write_iter() which is called for direct IO specifically.
> 
> Filesystems xfs, btrfs and ext4 would be supported in the following patches.
...
> diff --git a/fs/aio.c b/fs/aio.c
> index 020fa0045e3c..34027b67e2f4 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1592,6 +1592,12 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   goto out_put_req;
>   }
>  
> + if ((req->common.ki_flags & IOCB_NOWAIT) &&
> + !(req->common.ki_flags & IOCB_DIRECT)) {
> + ret = -EOPNOTSUPP;
> + goto out_put_req;
> + }
> +
>   ret = put_user(KIOCB_KEY, _iocb->aio_key);
>   if (unlikely(ret)) {
>   pr_debug("EFAULT: aio_key\n");

I think you need to also check here that the IO is write. So that NOWAIT
reads don't silently pass.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Austin S. Hemmelgarn

On 2017-05-12 14:36, Kai Krakow wrote:

Am Fri, 12 May 2017 15:02:20 +0200
schrieb Imran Geriskovan :


On 5/12/17, Duncan <1i5t5.dun...@cox.net> wrote:

FWIW, I'm in the market for SSDs ATM, and remembered this from a
couple weeks ago so went back to find it.  Thanks. =:^)

(I'm currently still on quarter-TB generation ssds, plus spinning
rust for the larger media partition and backups, and want to be rid
of the spinning rust, so am looking at half-TB to TB, which seems
to be the pricing sweet spot these days anyway.)


Since you are taking ssds to mainstream based on your experience,
I guess your perception of data retension/reliability is better than
that of spinning rust. Right? Can you eloborate?

Or an other criteria might be physical constraints of spinning rust
on notebooks which dictates that you should handle the device
with care when running.

What was your primary motivation other than performance?


Personally, I don't really trust SSDs so much. They are much more
robust when it comes to physical damage because there are no physical
parts. That's absolutely not my concern. Regarding this, I trust SSDs
better than HDDs.

My concern is with fail scenarios of some SSDs which die unexpected and
horribly. I found some reports of older Samsung SSDs which failed
suddenly and unexpected, and in a way that the drive completely died:
No more data access, everything gone. HDDs start with bad sectors and
there's a good chance I can recover most of the data except a few
sectors.
Older is the key here.  Some early SSD's did indeed behave like that, 
but most modern ones do generally show signs that they will fail in the 
near future.  There's also the fact that traditional hard drives _do_ 
fail like that sometimes, even without rough treatment.


When SSD blocks die, they are probably huge compared to a sector (256kB
to 4MB usually because that's erase block sizes). If this happens, the
firmware may decide to either allow read-only access or completely deny
access. There's another situation where dying storage chips may
completely mess up the firmware and there's no longer any access to
data.
I've yet to see an SSD that blocks user access to an erase block. 
Almost every one I've seen will instead rewrite the block (possibly with 
the corrupted data intact (that is, without mangling it further)) to one 
of the reserve blocks, and then just update it's internal mapping so 
that the old block doesn't get used, and the new one is pointing to the 
right place.  Some of the really good SSD's even use erasure coding in 
the FTL for data verification instead of CRC's, so they can actually 
reconstruct the missing bits when they do this.


Traditional hard drives usually do this too these days (they've been 
under-provisioned since before SSD's existed), which is part of why 
older disks tend to be noisier and slower (the reserved space is usually 
at the far inside or outside of the platter, so using sectors from there 
to replace stuff leads to long seeks).


That's why I don't trust any of my data to them. But I still want the
benefit of their speed. So I use SSDs mostly as frontend caches to
HDDs. This gives me big storage with fast access. Indeed, I'm using
bcache successfully for this. A warm cache is almost as fast as native
SSD (at least it feels almost that fast, it will be slower if you threw
benchmarks at it).
That's to be expected though, most benchmarks don't replicate actual 
usage patterns for client systems, and using SSD's for caching with 
bcache or dm-cache for most server workloads except a file server will 
usually get you a performance hit.


It's worth noting also that on average, COW filesystems like BTRFS (or 
log-structured-filesystems will not benefit as much as traditional 
filesystems from SSD caching unless the caching is built into the 
filesystem itself, since they don't do in-place rewrites (so any new 
write by definition has to drop other data from the cache).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 21/27] mm: clean up error handling in write_one_page

2017-05-15 Thread Jan Kara
On Tue 09-05-17 11:49:24, Jeff Layton wrote:
> Don't try to check PageError since that's potentially racy and not
> necessarily going to be set after writepage errors out.
> 
> Instead, sample the mapping error early on, and use that value to tell
> us whether we got a writeback error since then.
> 
> Signed-off-by: Jeff Layton 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  mm/page-writeback.c | 11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index de0dbf12e2c1..1643456881b4 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2373,11 +2373,12 @@ int do_writepages(struct address_space *mapping, 
> struct writeback_control *wbc)
>  int write_one_page(struct page *page)
>  {
>   struct address_space *mapping = page->mapping;
> - int ret = 0;
> + int ret = 0, ret2;
>   struct writeback_control wbc = {
>   .sync_mode = WB_SYNC_ALL,
>   .nr_to_write = 1,
>   };
> + errseq_t since = filemap_sample_wb_error(mapping);
>  
>   BUG_ON(!PageLocked(page));
>  
> @@ -2386,16 +2387,14 @@ int write_one_page(struct page *page)
>   if (clear_page_dirty_for_io(page)) {
>   get_page(page);
>   ret = mapping->a_ops->writepage(page, );
> - if (ret == 0) {
> + if (ret == 0)
>   wait_on_page_writeback(page);
> - if (PageError(page))
> - ret = -EIO;
> - }
>   put_page(page);
>   } else {
>   unlock_page(page);
>   }
> - return ret;
> + ret2 = filemap_check_wb_error(mapping, since);
> + return ret ? : ret2;
>  }
>  EXPORT_SYMBOL(write_one_page);
>  
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 22/27] jbd2: don't reset error in journal_finish_inode_data_buffers

2017-05-15 Thread Jan Kara
On Tue 09-05-17 11:49:25, Jeff Layton wrote:
> Now that we don't clear writeback errors after fetching them, there is
> no need to reset them. This is also potentially racy.
> 
> Signed-off-by: Jeff Layton 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/jbd2/commit.c | 13 ++---
>  1 file changed, 2 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b6b194ec1b4f..4c6262652028 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -264,17 +264,8 @@ static int journal_finish_inode_data_buffers(journal_t 
> *journal,
>   jinode->i_flags |= JI_COMMIT_RUNNING;
>   spin_unlock(>j_list_lock);
>   err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping);
> - if (err) {
> - /*
> -  * Because AS_EIO is cleared by
> -  * filemap_fdatawait_range(), set it again so
> -  * that user process can get -EIO from fsync().
> -  */
> - mapping_set_error(jinode->i_vfs_inode->i_mapping, -EIO);
> -
> - if (!ret)
> - ret = err;
> - }
> + if (err && !ret)
> + ret = err;
>   spin_lock(>j_list_lock);
>   jinode->i_flags &= ~JI_COMMIT_RUNNING;
>   smp_mb();
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 19/27] buffer: set errors in mapping at the time that the error occurs

2017-05-15 Thread Jan Kara
On Tue 09-05-17 11:49:22, Jeff Layton wrote:
> I noticed on xfs that I could still sometimes get back an error on fsync
> on a fd that was opened after the error condition had been cleared.
> 
> The problem is that the buffer code sets the write_io_error flag and
> then later checks that flag to set the error in the mapping. That flag
> perisists for quite a while however. If the file is later opened with
> O_TRUNC, the buffers will then be invalidated and the mapping's error
> set such that a subsequent fsync will return error. I think this is
> incorrect, as there was no writeback between the open and fsync.
> 
> Add a new mark_buffer_write_io_error operation that sets the flag and
> the error in the mapping at the same time. Replace all calls to
> set_buffer_write_io_error with mark_buffer_write_io_error, and remove
> the places that check this flag in order to set the error in the
> mapping.
> 
> This sets the error in the mapping earlier, at the time that it's first
> detected.
> 
> Signed-off-by: Jeff Layton 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Small nits below.

> @@ -354,7 +354,7 @@ void end_buffer_async_write(struct buffer_head *bh, int 
> uptodate)
>   } else {
>   buffer_io_error(bh, ", lost async page write");
>   mapping_set_error(page->mapping, -EIO);
> - set_buffer_write_io_error(bh);
> + mark_buffer_write_io_error(bh);

No need to call mapping_set_error() here when it gets called in
mark_buffer_write_io_error() again?

> @@ -1182,6 +1180,17 @@ void mark_buffer_dirty(struct buffer_head *bh)
>  }
>  EXPORT_SYMBOL(mark_buffer_dirty);
>  
> +void mark_buffer_write_io_error(struct buffer_head *bh)
> +{
> + set_buffer_write_io_error(bh);
> + /* FIXME: do we need to set this in both places? */
> + if (bh->b_page && bh->b_page->mapping)
> + mapping_set_error(bh->b_page->mapping, -EIO);
> + if (bh->b_assoc_map)
> + mapping_set_error(bh->b_assoc_map, -EIO);
> +}
> +EXPORT_SYMBOL(mark_buffer_write_io_error);

So buffers that are shared by several inodes cannot have bh->b_assoc_map
set. So for filesystems that have metadata like this setting in
bh->b_assoc_map doesn't really help and they have to check blockdevice's
mapping anyway. OTOH if filesystem doesn't have such type of metadata
relevant for fsync, this could help it. So maybe it is worth it.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Austin S. Hemmelgarn

On 2017-05-12 14:27, Kai Krakow wrote:

Am Tue, 18 Apr 2017 15:02:42 +0200
schrieb Imran Geriskovan :


On 4/17/17, Austin S. Hemmelgarn  wrote:

Regarding BTRFS specifically:
* Given my recently newfound understanding of what the 'ssd' mount
option actually does, I'm inclined to recommend that people who are
using high-end SSD's _NOT_ use it as it will heavily increase
fragmentation and will likely have near zero impact on actual device
lifetime (but may _hurt_ performance).  It will still probably help
with mid and low-end SSD's.


I'm trying to have a proper understanding of what "fragmentation"
really means for an ssd and interrelation with wear-leveling.

Before continuing lets remember:
Pages cannot be erased individually, only whole blocks can be erased.
The size of a NAND-flash page size can vary, and most drive have pages
of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256
pages, which means that the size of a block can vary between 256 KB
and 4 MB.
codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/

Lets continue:
Since block sizes are between 256k-4MB, data smaller than this will
"probably" will not be fragmented in a reasonably empty and trimmed
drive. And for a brand new ssd we may speak of contiguous series
of blocks.

However, as drive is used more and more and as wear leveling kicking
in (ie. blocks are remapped) the meaning of "contiguous blocks" will
erode. So any file bigger than a block size will be written to blocks
physically apart no matter what their block addresses says. But my
guess is that accessing device blocks -contiguous or not- are
constant time operations. So it would not contribute performance
issues. Right? Comments?

So your the feeling about fragmentation/performance is probably
related with if the file is spread into less or more blocks. If # of
blocks used is higher than necessary (ie. no empty blocks can be
found. Instead lots of partially empty blocks have to be used
increasing the total # of blocks involved) then we will notice
performance loss.

Additionally if the filesystem will gonna try something to reduce
the fragmentation for the blocks, it should precisely know where
those blocks are located. Then how about ssd block informations?
Are they available and do filesystems use it?

Anyway if you can provide some more details about your experiences
on this we can probably have better view on the issue.


What you really want for SSD is not defragmented files but defragmented
free space. That increases life time.

So, defragmentation on SSD makes sense if it cares more about free
space but not file data itself.

But of course, over time, fragmentation of file data (be it meta data
or content data) may introduce overhead - and in btrfs it probably
really makes a difference if I scan through some of the past posts.

I don't think it is important for the file system to know where the SSD
FTL located a data block. It's just important to keep everything nicely
aligned with erase block sizes, reduce rewrite patterns, and free up
complete erase blocks as good as possible.

Maybe such a process should be called "compaction" and not
"defragmentation". In the end, the more continuous blocks of free space
there are, the better the chance for proper wear leveling.


There is one other thing to consider though.  From a practical 
perspective, performance on an SSD is a function of the number of 
requests and what else is happening in the background.  The second 
aspect isn't easy to eliminate on most systems, but the first is pretty 
easy to mitigate by defragmenting data.


Reiterating the example I made elsewhere in the thread:
Assume you have an SSD and storage controller that can use DMA to 
transfer up to 16MB of data off of the disk in a single operation.  If 
you need to load a 16MB file off of this disk and it's properly aligned 
(it usually will be with most modern filesystems if the partition is 
properly aligned) and defragmented, it will take exactly one operation 
(assuming that doesn't get interrupted).  By contrast, if you have 16 
fragments of 1MB each, that will take at minimum 2 operations, and more 
likely 15-16 (depends on where everything is on-disk, and how smart the 
driver is about minimizing the number of required operations).  Each 
request has some amount of overhead to set up and complete, so the first 
case (one single extent) will take less total time to transfer the data 
than the second one.


This particular effect actually impacts almost any data transfer, not 
just pulling data off of an SSD (this is why jumbo frames are important 
for high-performance networking, and why a higher latency timer on the 
PCI bus will improve performance (but conversely increase latency)), 
even when fetching data from a traditional hard drive (but it's not very 
noticeable there unless your fragments are tightly grouped, because seek 
latency dominates 

Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-15 Thread Lionel Bouton
Le 15/05/2017 à 10:14, Hugo Mills a écrit :
> [...]
>> As for limit= I'm not sure if it would be helpful since I run this
>> nightly. Anything that doesn't get done tonight due to limit, would be
>> done tomorrow?
>I'm suggesting limit= on its own. It's a fixed amount of work
> compared to usage=, which may not do anything at all. For example,
> it's perfectly possible to have a filesystem which is, say, 30% full,
> and yet is still fully-allocated filesystem with more than 20% of
> every chunk used. In that case your usage= wouldn't balance anything,
> and you'd still be left in the situation of risking ENOSPC from
> running out of metadata.

Hugo, as I don't have any feedback on my approach to address this
problem could you have a look at my script or simply the principle : is
there any drawback vs using limit in calling balance multiple times
raising usage (and using the same value for data and metadata) until you
get enough free space ?

For reference :

https://github.com/jtek/ceph-utils/blob/master/btrfs-auto-rebalance.rb


Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-15 Thread Imran Geriskovan
On 5/15/17, Tomasz Kusmierz  wrote:
> Theoretically all sectors in over provision are erased - practically they
> are either erased or waiting to be erased or broken.
> Over provisioned area does have more uses than that. For example if you have
> a 1TB drive where you store 500GB of data that you never modify -> SSD will
> copy part of that data to over provisioned area -> free sectors that were
> unwritten for a while -> free sectors that were continuously hammered by
> writes and write a static data there. This mechanism is wear levelling - it
> means that SSD internals make sure that sectors on SSD have an equal use
> over time. Despite of some thinking that it’s pointless imagine situation
> where you’ve got a 1TB drive with 1GB free and you keep writing and
> modifying data in this 1GB free … those sectors will quickly die due to
> short flash life expectancy ( some as short as 1k erases ! ).

Thanks for the info. It can be understood that, the drive
has a pool of erase blocks from which some portion (say %90-95)
is provided as usable. Trimmed blocks are candidates
for new allocations. If the drive is not trimmed, that allocatable
pool becomes smaller than it can be and new allocations
under wear levelling logic is done from smaller group.
This will probably increase data traffic on that "small group"
of blocks, eating from their erase cycles.

However, this logic is valid if the drive does NOT move
data on trimmed blocks to trimmed/available ones.

Under some advanced wear leveling operations, drive may
decide to swap two blocks (one occupied/one vacant) if the
cummulative erase cycles of the former is much lower than
the latter to provide some balancing affect.

Theoretically swapping may even occur when the flash tend
to lose charge (and thus data) based on the age of the
data and/or block health.

But in any case I understand that trimming will provide
important degree of freedom and health to the drive.
Without trimming drive will continue to deal with worthless
blocks simply because it doesn't know they are worthless...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 15/27] fs: retrofit old error reporting API onto new infrastructure

2017-05-15 Thread Jan Kara
On Tue 09-05-17 11:49:18, Jeff Layton wrote:
> Now that we have a better way to store and report errors that occur
> during writeback, we need to convert the existing codebase to use it. We
> could just adapt all of the filesystem code and related infrastructure
> to the new API, but that's a lot of churn.
> 
> When it comes to setting errors in the mapping, filemap_set_wb_error is
> a drop-in replacement for mapping_set_error. Turn that function into a
> simple wrapper around the new one.
> 
> Because we want to ensure that writeback errors are always reported at
> fsync time, inject filemap_report_wb_error calls much closer to the
> syscall boundary, in call_fsync.
> 
> For fsync calls (and things like the nfsd equivalent), we either return
> the error that the fsync operation returns, or the one returned by
> filemap_report_wb_error. In both cases, we advance the file->f_wb_err to
> the latest value. This allows us to provide new fsync semantics that
> will return errors that may have occurred previously and been viewed
> via other file descriptors.
> 
> The final piece of the puzzle is what to do about filemap_check_errors
> calls that are being called directly or via filemap_* functions. Here,
> we must take a little "creative license".
> 
> Since we now handle advancing the file->f_wb_err value at the generic
> filesystem layer, we no longer need those callers to clear errors out
> of the mapping or advance an errseq_t.
> 
> A lot of the existing codebase relies on being getting an error back
> from those functions when there is a writeback problem, so we do still
> want to have them report writeback errors somehow.
> 
> When reporting writeback errors, we will always report errors that have
> occurred since a particular point in time. With the old writeback error
> reporting, the time we used was "since it was last tested/cleared" which
> is entirely arbitrary and potentially racy. Now, we can at least report
> the latest error that has occurred since an arbitrary point in time
> (represented as a sampled errseq_t value).
> 
> In the case where we don't have a struct file to work with, this patch
> just has the wrappers sample the current mapping->wb_err value, and use
> that as an arbitrary point from which to check for errors.

I think this is really dangerous and we shouldn't do this. You are quite
likely to lose IO errors in such calls because you will ignore all errors
that happened during previous background writeback or even for IO that
managed to complete before we called filemap_fdatawait(). Maybe we need to
keep the original set-clear-bit IO error reporting for these cases, until
we can convert them to fdatawait_range_since()?

> That's probably not "correct" in all cases, particularly in the case of
> something like filemap_fdatawait, but I'm not sure it's any worse than
> what we already have, and this gives us a basis from which to work.
> 
> A lot of those callers will likely want to change to a model where they
> sample the errseq_t much earlier (perhaps when starting a transaction),
> store it in an appropriate place and then use that value later when
> checking to see if an error occurred.
> 
> That will almost certainly take some involvement from other subsystem
> maintainers. I'm quite open to adding new API functions to help enable
> this if that would be helpful, but I don't really want to do that until
> I better understand what's needed.
> 
> Signed-off-by: Jeff Layton 

...

> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index 5f7317875a67..7ce13281925f 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -187,6 +187,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t 
> start, loff_t end,
>   .nr_to_write = LONG_MAX,
>   .for_reclaim = 0,
>   };
> + errseq_t since = READ_ONCE(file->f_wb_err);
>  
>   if (unlikely(f2fs_readonly(inode->i_sb)))
>   return 0;
> @@ -265,6 +266,8 @@ static int f2fs_do_sync_file(struct file *file, loff_t 
> start, loff_t end,
>   }
>  
>   ret = wait_on_node_pages_writeback(sbi, ino);
> + if (ret == 0)
> + ret = filemap_check_wb_error(NODE_MAPPING(sbi), since);
>   if (ret)
>   goto out;

So this conversion looks wrong and actually points to a larger issue with
the scheme. The problem is there are two mappings that come into play here
- file_inode(file)->i_mapping which is the data mapping and
NODE_MAPPING(sbi) which is the metadata mapping (and this is not a problem
specific to f2fs. For example ext2 also uses this scheme where block
devices' mapping is the metadata mapping). And we need to merge error
information from these two mappings so for the stamping scheme to work,
we'd need two stamps stored in struct file. One for data mapping and one
for metadata mapping. Or maybe there's some more clever scheme but for now
I don't see one...

Honza
-- 
Jan Kara 

Re: [PATCH v2] btrfs-progs: btrfs-convert: Add larger device support

2017-05-15 Thread Lakshmipathi.G
On Mon, May 15, 2017 at 09:40:29AM +0800, Qu Wenruo wrote:
> >bug: https://bugzilla.kernel.org/show_bug.cgi?id=194795
> 
> Errr, it seems that you forgot to update ext2_open_fs() to update how we get
> cctx->block_counts.
> 
> Without that update, we still get wrong total size of original fs, so
> converted image will be corrupted.
> 
> In this 22T case, it can't pass convert test since after conversion,
> converted image can't pass e2fsck.
> 
> Thanks,
> Qu
> 

Thanks for pointing out the issue. So we need to update 
common.h/cctx->block_count from u32 to u64?

do we also need to change other fields like inodes_count and
free_inode_count?

Cheers.
Lakshmipathi.G
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] btrfs-progs: check: Reuse btrfs_check_chunk_valid in lowmem mode

2017-05-15 Thread Qu Wenruo
Before this patch, btrfs check lowmem mode manually checks found chunk
item, even we already have the generic chunk validation checker,
btrfs_check_chunk_valid().

This patch will use btrfs_check_chunk_valid() to replace open-coded
chunk validation checker in check_chunk_item().

Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 24 ++--
 1 file changed, 6 insertions(+), 18 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 17b7efbf..58b73608 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -0,11 +0,9 @@ static int check_chunk_item(struct btrfs_fs_info 
*fs_info,
struct btrfs_block_group_item *bi;
struct btrfs_block_group_item bg_item;
struct btrfs_dev_extent *ptr;
-   u32 sectorsize = btrfs_super_sectorsize(fs_info->super_copy);
u64 length;
u64 chunk_end;
u64 type;
-   u64 profile;
int num_stripes;
u64 offset;
u64 objectid;
@@ -11126,25 +11124,15 @@ static int check_chunk_item(struct btrfs_fs_info 
*fs_info,
chunk = btrfs_item_ptr(eb, slot, struct btrfs_chunk);
length = btrfs_chunk_length(eb, chunk);
chunk_end = chunk_key.offset + length;
-   if (!IS_ALIGNED(length, sectorsize)) {
-   error("chunk[%llu %llu) not aligned to %u",
-   chunk_key.offset, chunk_end, sectorsize);
-   err |= BYTES_UNALIGNED;
+   ret = btrfs_check_chunk_valid(extent_root, eb, chunk, slot,
+ chunk_key.offset);
+   if (ret < 0) {
+   error("chunk[%llu %llu) is invalid", chunk_key.offset,
+   chunk_end);
+   err |= BYTES_UNALIGNED | UNKNOWN_TYPE;
goto out;
}
-
type = btrfs_chunk_type(eb, chunk);
-   profile = type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
-   if (!(type & BTRFS_BLOCK_GROUP_TYPE_MASK)) {
-   error("chunk[%llu %llu) has no chunk type",
-   chunk_key.offset, chunk_end);
-   err |= UNKNOWN_TYPE;
-   }
-   if (profile && (profile & (profile - 1))) {
-   error("chunk[%llu %llu) multiple profiles detected: %llx",
-   chunk_key.offset, chunk_end, profile);
-   err |= UNKNOWN_TYPE;
-   }
 
bg_key.objectid = chunk_key.offset;
bg_key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
-- 
2.12.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] btrfs-progs: lowmem check: Fix false alert on missing chunk or dev extent

2017-05-15 Thread Qu Wenruo
When checking chunk or dev extent, lowmem mode uses chunk length as dev
extent length, and if they mismatch, report missing chunk or dev extent
like:
--
ERROR: chunk[256 4324327424) stripe 0 did not find the related dev extent
ERROR: chunk[256 4324327424) stripe 1 did not find the related dev extent
ERROR: chunk[256 4324327424) stripe 2 did not find the related dev extent
--

However, only for Single/DUP/RAID1 profiles chunk length is the same as
dev extent length.
For other profiles, this will cause tons of false alert.

Fix it by using correct stripe length when checking chunk and dev extent
items.

Reported-by: Kai Krakow 
Signed-off-by: Qu Wenruo 
---
Test cases for this bug will follow as another patchset, since we need
some infrastructure update to support multi-device mkfs/check/mount.
---
 cmds-check.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 58b73608..eb3eb9e9 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -10872,7 +10872,12 @@ static int check_dev_extent_item(struct btrfs_fs_info 
*fs_info,
 
l = path.nodes[0];
chunk = btrfs_item_ptr(l, path.slots[0], struct btrfs_chunk);
-   if (btrfs_chunk_length(l, chunk) != length)
+   ret = btrfs_check_chunk_valid(chunk_root, l, chunk, path.slots[0],
+ chunk_key.offset);
+   if (ret < 0)
+   goto out;
+
+   if (btrfs_stripe_length(fs_info, l, chunk) != length)
goto out;
 
num_stripes = btrfs_chunk_num_stripes(l, chunk);
@@ -2,6 +7,7 @@ static int check_chunk_item(struct btrfs_fs_info 
*fs_info,
struct btrfs_dev_extent *ptr;
u64 length;
u64 chunk_end;
+   u64 stripe_len;
u64 type;
int num_stripes;
u64 offset;
@@ -11161,6 +11167,7 @@ static int check_chunk_item(struct btrfs_fs_info 
*fs_info,
}
 
num_stripes = btrfs_chunk_num_stripes(eb, chunk);
+   stripe_len = btrfs_stripe_length(fs_info, eb, chunk);
for (i = 0; i < num_stripes; i++) {
btrfs_release_path();
btrfs_init_path();
@@ -11180,7 +11187,7 @@ static int check_chunk_item(struct btrfs_fs_info 
*fs_info,
offset = btrfs_dev_extent_chunk_offset(leaf, ptr);
if (objectid != chunk_key.objectid ||
offset != chunk_key.offset ||
-   btrfs_dev_extent_length(leaf, ptr) != length)
+   btrfs_dev_extent_length(leaf, ptr) != stripe_len)
goto not_match_dev;
continue;
 not_match_dev:
-- 
2.12.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] btrfs-progs: Introduce function to get correct stripe length

2017-05-15 Thread Qu Wenruo
Introduce a new function, btrfs_get_chunk_stripe_len() to get correct
stripe length.
This is very handy for lowmem mode, which checks the mapping between
device extent and chunk item.

Signed-off-by: Qu Wenruo 
---
 volumes.c | 44 
 volumes.h |  3 +++
 2 files changed, 47 insertions(+)

diff --git a/volumes.c b/volumes.c
index 1e352869..8f529051 100644
--- a/volumes.c
+++ b/volumes.c
@@ -2284,3 +2284,47 @@ out:
 
return ret;
 }
+
+/*
+ * Get stripe length from chunk item and its stripe items
+ *
+ * Caller should only call this function after validating the chunk item
+ * by using btrfs_check_chunk_valid().
+ */
+u64 btrfs_stripe_length(struct btrfs_fs_info *fs_info,
+   struct extent_buffer *leaf,
+   struct btrfs_chunk *chunk)
+{
+   u64 stripe_len;
+   u64 chunk_len;
+   u32 num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
+   u64 profile = btrfs_chunk_type(leaf, chunk) &
+ BTRFS_BLOCK_GROUP_PROFILE_MASK;
+
+   chunk_len = btrfs_chunk_length(leaf, chunk);
+
+   switch (profile) {
+   case 0: /* Single profile */
+   case BTRFS_BLOCK_GROUP_RAID1:
+   case BTRFS_BLOCK_GROUP_DUP:
+   stripe_len = chunk_len;
+   break;
+   case BTRFS_BLOCK_GROUP_RAID0:
+   stripe_len = chunk_len / num_stripes;
+   break;
+   case BTRFS_BLOCK_GROUP_RAID5:
+   stripe_len = chunk_len / (num_stripes - 1);
+   break;
+   case BTRFS_BLOCK_GROUP_RAID6:
+   stripe_len = chunk_len / (num_stripes - 2);
+   break;
+   case BTRFS_BLOCK_GROUP_RAID10:
+   stripe_len = chunk_len / (num_stripes /
+   btrfs_chunk_sub_stripes(leaf, chunk));
+   break;
+   default:
+   /* Invalid chunk profile found */
+   BUG_ON(1);
+   }
+   return stripe_len;
+}
diff --git a/volumes.h b/volumes.h
index 699b0bae..fc0a775b 100644
--- a/volumes.h
+++ b/volumes.h
@@ -246,4 +246,7 @@ int btrfs_check_chunk_valid(struct btrfs_root *root,
struct extent_buffer *leaf,
struct btrfs_chunk *chunk,
int slot, u64 logical);
+u64 btrfs_stripe_length(struct btrfs_fs_info *fs_info,
+   struct extent_buffer *leaf,
+   struct btrfs_chunk *chunk);
 #endif
-- 
2.12.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] btrfs-progs: Cleanup open-coded btrfs_chunk_item_size

2017-05-15 Thread Qu Wenruo
In btrfs_check_chunk_valid() we calculates chunk item using open code.

use btrfs_chunk_item_size() to replace them.

Signed-off-by: Qu Wenruo 
---
 volumes.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/volumes.c b/volumes.c
index b350e259..62e23aee 100644
--- a/volumes.c
+++ b/volumes.c
@@ -1685,6 +1685,7 @@ int btrfs_check_chunk_valid(struct btrfs_root *root,
u16 num_stripes;
u16 sub_stripes;
u64 type;
+   u32 chunk_ondisk_size;
 
length = btrfs_chunk_length(leaf, chunk);
stripe_len = btrfs_chunk_stripe_len(leaf, chunk);
@@ -1724,16 +1725,16 @@ int btrfs_check_chunk_valid(struct btrfs_root *root,
BTRFS_BLOCK_GROUP_PROFILE_MASK) & type);
return -EIO;
}
+
+   chunk_ondisk_size = btrfs_chunk_item_size(num_stripes);
/*
 * Btrfs_chunk contains at least one stripe, and for sys_chunk
 * it can't exceed the system chunk array size
 * For normal chunk, it should match its chunk item size.
 */
if (num_stripes < 1 ||
-   (slot == -1 && sizeof(struct btrfs_stripe) * num_stripes >
-BTRFS_SYSTEM_CHUNK_ARRAY_SIZE) ||
-   (slot >= 0 && sizeof(struct btrfs_stripe) * (num_stripes - 1) >
-btrfs_item_size_nr(leaf, slot))) {
+   (slot == -1 && chunk_ondisk_size > BTRFS_SYSTEM_CHUNK_ARRAY_SIZE) ||
+   (slot >= 0 && chunk_ondisk_size > btrfs_item_size_nr(leaf, slot))) {
error("invalid num_stripes: %u", num_stripes);
return -EIO;
}
-- 
2.12.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] btrfs-progs: Enhance chunk item validation check

2017-05-15 Thread Qu Wenruo
btrfs_check_chunk_valid() doesn't check if
1) chunk flag has conflicting flags
   For example chunk type DATA|METADATA|RAID1|RAID10 is completely
   invalid, while current check_chunk_valid() can't detect it.
2) num_stripes is invalid for RAID10
   Num_stripes 5 is not valid for RAID10.

This patch will enhance btrfs_check_chunk_valid() to handle above cases.

Signed-off-by: Qu Wenruo 
---
 volumes.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/volumes.c b/volumes.c
index 62e23aee..1e352869 100644
--- a/volumes.c
+++ b/volumes.c
@@ -1725,6 +1725,19 @@ int btrfs_check_chunk_valid(struct btrfs_root *root,
BTRFS_BLOCK_GROUP_PROFILE_MASK) & type);
return -EIO;
}
+   if (!(type & BTRFS_BLOCK_GROUP_TYPE_MASK)) {
+   error("missing chunk type flag: %llu", type);
+   return -EIO;
+   }
+   if (!is_power_of_2(type & BTRFS_BLOCK_GROUP_PROFILE_MASK)) {
+   error("conflicting chunk type detected: %llu", type);
+   return -EIO;
+   }
+   if ((type & BTRFS_BLOCK_GROUP_PROFILE_MASK) &&
+   !is_power_of_2(type & BTRFS_BLOCK_GROUP_PROFILE_MASK)) {
+   error("conflicting chunk profile detected: %llu", type);
+   return -EIO;
+   }
 
chunk_ondisk_size = btrfs_chunk_item_size(num_stripes);
/*
@@ -1741,7 +1754,8 @@ int btrfs_check_chunk_valid(struct btrfs_root *root,
/*
 * Device number check against profile
 */
-   if ((type & BTRFS_BLOCK_GROUP_RAID10 && sub_stripes == 0) ||
+   if ((type & BTRFS_BLOCK_GROUP_RAID10 && (sub_stripes != 2 ||
+ !IS_ALIGNED(num_stripes, sub_stripes))) ||
(type & BTRFS_BLOCK_GROUP_RAID1 && num_stripes < 1) ||
(type & BTRFS_BLOCK_GROUP_RAID5 && num_stripes < 2) ||
(type & BTRFS_BLOCK_GROUP_RAID6 && num_stripes < 3) ||
-- 
2.12.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-15 Thread Hugo Mills
On Sun, May 14, 2017 at 04:16:52PM -0700, Marc MERLIN wrote:
> On Sun, May 14, 2017 at 09:21:11PM +, Hugo Mills wrote:
> > > 2) balance -musage=0
> > > 3) balance -musage=20
> > 
> >In most cases, this is going to make ENOSPC problems worse, not
> > better. The reason for doign this kind of balance is to recover unused
> > space and allow it to be reallocated. The typical behaviour is that
> > data gets overallocated, and it's metadata which runs out. So, the
> > last thing you want to be doing is reducing the metadata allocation,
> > because that's the scarce resource.
> > 
> >Also, I'd usually recommend using limit=n, where n is approximately
> > the amount of data overallcation (allocated space less used
> > space). It's much more controllable than usage.
> 
> 
> Thanks for that.
> So, would you just remove the balance -musage=20 altogether?

   Yes.

> As for limit= I'm not sure if it would be helpful since I run this
> nightly. Anything that doesn't get done tonight due to limit, would be
> done tomorrow?

   I'm suggesting limit= on its own. It's a fixed amount of work
compared to usage=, which may not do anything at all. For example,
it's perfectly possible to have a filesystem which is, say, 30% full,
and yet is still fully-allocated filesystem with more than 20% of
every chunk used. In that case your usage= wouldn't balance anything,
and you'd still be left in the situation of risking ENOSPC from
running out of metadata.

   All you need to do is ensure that you have enough unallocated space
for the metadata to expand into if it needs to. That's the ultimate
goal of all this.

   If you have SSDs, it may also be beneficial to use nossd as a mount
option, because that seems to have some pathology in overallocating
chunks in normal usage. Hans investigated this in detail a month or
two ago.

   Hugo.

-- 
Hugo Mills | "You know, the British have always been nice to mad
hugo@... carfax.org.uk | people."
http://carfax.org.uk/  |
PGP: E2AB1DE4  | Laura Jesson, Brief Encounter


signature.asc
Description: Digital signature


[PATCH] btrfs-progs: Fix slot >= nritems

2017-05-15 Thread Philipp Hahn
Running "btrfsck --repair /dev/sdd2" crashed:
> (gdb) bt full
> #0  0x77020e71 in __memmove_sse2_unaligned_erms () from 
> /lib/x86_64-linux-gnu/libc.so.6
> No symbol table info available.
> #1  0x00438764 in btrfs_del_ptr (trans=, 
> root=0x6e4fe0, path=0x1d17880, level=0, slot=7)
> at ctree.c:2611
> parent = 0xcd96980
> nritems = 
> __func__ = "btrfs_del_ptr"
> #2  0x00421b15 in repair_btree (corrupt_blocks=, 
> root=) at cmds-check.c:3539
> key = {objectid = 77990592512, type = 168 '\250', offset = 16384}
> trans = 0x8f48c0
> path = 0x1d17880
> level = 0
> #3  check_fs_root (wc=, root_cache=, 
> root=) at cmds-check.c:3703
> corrupt = 0x1d17880
> corrupt_blocks = {root = {rb_node = 0x6e80c60}}
> path = {nodes = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, slots = {0, 
> 0, 0, 0, 0, 0, 0, 0}, locks = {0, 0,
> 0, 0, 0, 0, 0, 0}, reada = 0, lowest_level = 0, search_for_split 
> = 0, skip_check_block = 0}
> nrefs = {bytenr = {271663104, 271646720, 560021504, 0, 0, 0, 0, 0}, 
> refs = {1, 1, 1, 0, 0, 0, 0, 0}}
> wret = 215575372
> root_node = {cache = {rb_node = {__rb_parent_color = 0, rb_right = 
> 0x0, rb_left = 0x0}, objectid = 0,
> start = 0, size = 0}, root_cache = {root = {rb_node = 0x0}}, 
> inode_cache = {root = {
>   rb_node = 0x781c80}}, current = 0x819530, refs = 0}
> status = 215575372
> rec = 0x1
> #4  check_fs_roots (root_cache=0xcd96b6d, root=) at 
> cmds-check.c:3809
> path = {nodes = {0x6eed90, 0x6a2f40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
> slots = {18, 2, 0, 0, 0, 0, 0, 0},
>   locks = {0, 0, 0, 0, 0, 0, 0, 0}, reada = 0, lowest_level = 0, 
> search_for_split = 0,
>   skip_check_block = 0}
> key = {objectid = 323, type = 132 '\204', offset = 
> 18446744073709551615}
> wc = {shared = {root = {rb_node = 0x0}}, nodes = {0x0, 0x0, 
> 0x7fffe428, 0x0, 0x0, 0x0, 0x0, 0x0},
>   active_node = 2, root_level = 2}
> leaf = 0x6e4fe0
> tmp_root = 0x6e4fe0
> #5  0x004287c3 in cmd_check (argc=215575372, argv=0x1d17880) at 
> cmds-check.c:11521
> root_cache = {root = {rb_node = 0x98c2940}}
> info = 0x6927b0
> bytenr = 6891440
> tree_root_bytenr = 0
> uuidbuf = "f65ff1a1-76ef-456e-beb5-c6c3841e7534"
> num = 215575372
> readonly = 218080104
> qgroups_repaired = 0
> #6  0x0040a41f in main (argc=3, argv=0x7fffebe8) at btrfs.c:243
> cmd = 0x689868
> bname = 
> ret = 

Signed-off-by: Philipp Hahn 
---
 ctree.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/ctree.c b/ctree.c
index 02c7180..798bde9 100644
--- a/ctree.c
+++ b/ctree.c
@@ -1487,7 +1487,8 @@ static int insert_ptr(struct btrfs_trans_handle *trans, 
struct btrfs_root
BUG();
if (nritems == BTRFS_NODEPTRS_PER_BLOCK(root))
BUG();
-   if (slot != nritems) {
+   if (slot < nritems) {
+   /* shift the items */
memmove_extent_buffer(lower,
  btrfs_node_key_ptr_offset(slot + 1),
  btrfs_node_key_ptr_offset(slot),
@@ -2249,7 +2250,7 @@ split:
 
nritems = btrfs_header_nritems(leaf);
 
-   if (slot != nritems) {
+   if (slot < nritems) {
/* shift the items */
memmove_extent_buffer(leaf, btrfs_item_nr_offset(slot + 1),
  btrfs_item_nr_offset(slot),
@@ -2500,7 +2501,7 @@ int btrfs_insert_empty_items(struct btrfs_trans_handle 
*trans,
slot = path->slots[0];
BUG_ON(slot < 0);
 
-   if (slot != nritems) {
+   if (slot < nritems) {
unsigned int old_data = btrfs_item_end_nr(leaf, slot);
 
if (old_data < data_end) {
@@ -2603,7 +2604,8 @@ int btrfs_del_ptr(struct btrfs_root *root, struct 
btrfs_path *path,
int ret = 0;
 
nritems = btrfs_header_nritems(parent);
-   if (slot != nritems -1) {
+   if (slot < nritems -1) {
+   /* shift the items */
memmove_extent_buffer(parent,
  btrfs_node_key_ptr_offset(slot),
  btrfs_node_key_ptr_offset(slot + 1),
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html