Re: [PATCH preview] btrfs: allow to set compression level for zlib
On 8/4/17, 6:27 PM, "Adam Borowski"wrote: > On Fri, Aug 04, 2017 at 09:51:44PM +, Nick Terrell wrote: > > On 07/25/2017 01:29 AM, David Sterba wrote: > > > Preliminary support for setting compression level for zlib, the > > > following works: > > > > Thanks for working on this, I think it is a great feature. > > I have a few comments relating to how it would work with zstd. > > Like, currently crashing because of ->set_level being 0? :p > > > > --- a/fs/btrfs/compression.c > > > +++ b/fs/btrfs/compression.c > > > @@ -866,6 +866,11 @@ static void free_workspaces(void) > > > * Given an address space and start and length, compress the bytes into > > > @pages > > > * that are allocated on demand. > > > * > > > + * @type_level is encoded algorithm and level, where level 0 means > > > whatever > > > + * default the algorithm chooses and is opaque here; > > > + * - compression algo are 0-3 > > > + * - the level are bits 4-7 > > > > zstd has 19 levels, but we can either only allow the first 15 + default, or > > provide a mapping from zstd-level to BtrFS zstd-level. > > Or give it more bits. Issues like this are exactly why this patch is marked > "preview". > > But, does zstd give any gains with high compression level but input data > capped at 128KB? I don't see levels above 15 on your benchmark, and certain > compression algorithms give worse results at highest levels for small > blocks. Yeah, I stopped my benchmarks at 15, since without configurable compression level, high levels didn't seem useful. But level 19 could be interesting if you are building a base image that is widely distributed. When testing BtrFS on the Silesia corpus, the compression ratio improved all the way to level 19. > > > > @@ -888,9 +893,11 @@ int btrfs_compress_pages(int type, struct > > > address_space *mapping, > > > { > > > struct list_head *workspace; > > > int ret; > > > + int type = type_level & 0xF; > > > > > > workspace = find_workspace(type); > > > > > > + btrfs_compress_op[type - 1]->set_level(workspace, type_level); > > > > zlib uses the same amount of memory independently of the compression level, > > but zstd uses a different amount of memory for each level. zstd will have > > to allocate memory here if it doesn't have enough (or has way to much), > > will that be okay? > > We can instead store workspaces per the encoded type+level, that'd allow > having different levels on different mounts (then props, once we get there). > > Depends on whether you want highest levels, though (asked above) -- the > highest ones take drastically more memory, so if they're out, blindly > reserving space for the highest supported level might be not too wasteful. Looking at the memory usage of BtrFS zstd, the 128 KB window size keeps the memory usage very reasonable up to level 19. The zstd compression levels are computed using a tool that selects the parameters that give the best compression ratio for a given compression speed target. Since BtrFS has a fixed window size, the default compression levels might not be optimal. We could compute our own compression levels for a 128 KB window size. | Level | Memory | |---|| | 1 | 0.8 MB | | 2 | 1.0 MB | | 3 | 1.3 MB | | 4 | 0.9 MB | | 5 | 1.4 MB | | 6 | 1.5 MB | | 7 | 1.4 MB | | 8 | 1.8 MB | | 9 | 1.8 MB | | 10| 1.8 MB | | 11| 1.8 MB | | 12| 1.8 MB | | 13| 2.4 MB | | 14| 2.6 MB | | 15| 2.6 MB | | 16| 3.1 MB | | 17| 3.1 MB | | 18| 3.1 MB | | 19| 3.1 MB | The workspace memory usage for each compression level. > > (I have only briefly looked at memory usage and set_level(), please ignore > me if I babble incoherently -- in bed on a N900 so I can't test it right > now.) > > > Meow! > -- > ⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition: > ⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal > ⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair) > ⠈⠳⣄ • use glitches to walk on water >
Re: [PATCH preview] btrfs: allow to set compression level for zlib
On Fri, Aug 04, 2017 at 09:51:44PM +, Nick Terrell wrote: > On 07/25/2017 01:29 AM, David Sterba wrote: > > Preliminary support for setting compression level for zlib, the > > following works: > > Thanks for working on this, I think it is a great feature. > I have a few comments relating to how it would work with zstd. Like, currently crashing because of ->set_level being 0? :p > > --- a/fs/btrfs/compression.c > > +++ b/fs/btrfs/compression.c > > @@ -866,6 +866,11 @@ static void free_workspaces(void) > > * Given an address space and start and length, compress the bytes into > > @pages > > * that are allocated on demand. > > * > > + * @type_level is encoded algorithm and level, where level 0 means whatever > > + * default the algorithm chooses and is opaque here; > > + * - compression algo are 0-3 > > + * - the level are bits 4-7 > > zstd has 19 levels, but we can either only allow the first 15 + default, or > provide a mapping from zstd-level to BtrFS zstd-level. Or give it more bits. Issues like this are exactly why this patch is marked "preview". But, does zstd give any gains with high compression level but input data capped at 128KB? I don't see levels above 15 on your benchmark, and certain compression algorithms give worse results at highest levels for small blocks. > > @@ -888,9 +893,11 @@ int btrfs_compress_pages(int type, struct > > address_space *mapping, > > { > > struct list_head *workspace; > > int ret; > > + int type = type_level & 0xF; > > > > workspace = find_workspace(type); > > > > + btrfs_compress_op[type - 1]->set_level(workspace, type_level); > > zlib uses the same amount of memory independently of the compression level, > but zstd uses a different amount of memory for each level. zstd will have > to allocate memory here if it doesn't have enough (or has way to much), > will that be okay? We can instead store workspaces per the encoded type+level, that'd allow having different levels on different mounts (then props, once we get there). Depends on whether you want highest levels, though (asked above) -- the highest ones take drastically more memory, so if they're out, blindly reserving space for the highest supported level might be not too wasteful. (I have only briefly looked at memory usage and set_level(), please ignore me if I babble incoherently -- in bed on a N900 so I can't test it right now.) Meow! -- ⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition: ⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal ⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair) ⠈⠳⣄ • use glitches to walk on water -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
Hi Qu, On Fri, Aug 4, 2017 at 10:05 PM, Qu Wenruowrote: > > > On 2017年08月02日 16:38, Brendan Hide wrote: >> >> The title seems alarmist to me - and I suspect it is going to be >> misconstrued. :-/ >> >> From the release notes at >> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html >> >> "Btrfs has been deprecated >> >> The Btrfs file system has been in Technology Preview state since the >> initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving >> Btrfs to a fully supported feature and it will be removed in a future major >> release of Red Hat Enterprise Linux. >> >> The Btrfs file system did receive numerous updates from the upstream in >> Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat >> Enterprise Linux 7 series. However, this is the last planned update to this >> feature. >> >> Red Hat will continue to invest in future technologies to address the use >> cases of our customers, specifically those related to snapshots, >> compression, NVRAM, and ease of use. We encourage feedback through your Red >> Hat representative on features and requirements you have for file systems >> and storage technology." > > > Personally speaking, unlike most of the btrfs supporters, I think Red Hat is > doing the correct thing for their enterprise use case. > > (To clarify, I'm not going to Red Hat, just in case anyone wonders why I'm > not supporting btrfs) > > [Good things of btrfs] > Btrfs is indeed a technic pioneer in a lot of aspects (at least in linux > world): > > 1) Metadata CoW instead of traditional journal > 2) Snapshot and delta-backup > I think this is the killer feature of Btrfs, and why SUSE is using it > for root fs. > 3) Default data CoW > 4) Data checksum and scrubbing > 5) Multi-device management > 6) Online resize/balancing > And a lot of more. > > [Bad things of btrfs] > But for enterprise usage, it's too advanced and has several problems > preventing them being widely applied: > > 1) Low performance from metadata/data CoW > This is a little complicated dilemma. > Although Btrfs can disable data CoW, nodatacow also disables data > checksum, which is another main feature for btrfs. > So Btrfs can't default to nodatacow, unlike XFS. > > And metadata CoW causes extra metadata write along with superblock > update (FUA), further degrading the performance. > > Such pioneered design makes traditional performance-intense use case > very unhappy. > Especially for almost all kind of databases. (Note that nodatacow can't > always solve the performance problem). > Most performance intense usage is still based on tradtional fs design > (journal with no CoW) > > 2) Low concurrency caused by tree design. > Unlike traditional one-tree-for-one-inode design, btrfs uses > one-tree-for-one-subvolume. > The design makes snapshot implementation very easy, while makes tree > very hot when a lot of modifiers are trying to modify any metadata. > > Btrfs has a lot of different way to solve it. > For extent tree (the most busy tree), we are using delayed-ref to speed > up extent tree update. > For fs tree fsync, we have log tree to speed things up. > These approaches work, at the cost of complexity and bugs, and we still > have slow fs tree modification speed. > > 3) Low code reusage of device-mapper. > I totally understand that, due to the unique support for data csum, > btrfs can't use device-mapper directly, as we must verify the data read out > from device before passing it to higher level. > So Btrfs uses its own device-mapper like implementation to handle > multi-device management. > > The result is mixed. For easy to handle case like RAID0/1/10 btrfs is > doing well. > While for RAID5/6, everyone knows the result. > > Such btrfs *enhanced* re-implementation not only makes btrfs larger but > also more complex and bug-prune. > > In short, btrfs is too advanced for generic use cases (performance) and > developers (bugs), unfortunately. > > And even SUSE is just pushing btrfs as root fs, mainly for the snapshot > feature. > But still ext4/xfs for data or performance intense use case. > > > [Other solution on the table] > On the other hand, I think RedHat is pushing storage technology based on LVM > (thin) and Xfs. > > For traditional LVM, it's stable but its snapshot design is old-fashion and > low-performance. > While new thin-provision LVM solves the problem using a method just like > Btrfs, but at block level. > > And for XFS, it's still traditional designed, journal based, > one-tree-for-one-inode. > But with fancy new features like data CoW. > > Even XFS + LVM-thin lacks ability to shrink fs or scrub data or delta > backup, it can do a lot of things just like Btrfs. > From snapshot to multi-device management. > > And more importantly, has better
Re: [PATCH v4 4/5] squashfs: Add zstd support
Signed-off-by: Sean PurcellOn Fri, Aug 4, 2017 at 4:19 PM, Nick Terrell wrote: > Add zstd compression and decompression support to SquashFS. zstd is a > great fit for SquashFS because it can compress at ratios approaching xz, > while decompressing twice as fast as zlib. For SquashFS in particular, > it can decompress as fast as lzo and lz4. It also has the flexibility > to turn down the compression ratio for faster compression times. > > The compression benchmark is run on the file tree from the SquashFS archive > found in ubuntu-16.10-desktop-amd64.iso [1]. It uses `mksquashfs` with the > default block size (128 KB) and and various compression algorithms/levels. > xz and zstd are also benchmarked with 256 KB blocks. The decompression > benchmark times how long it takes to `tar` the file tree into `/dev/null`. > See the benchmark file in the upstream zstd source repository located under > `contrib/linux-kernel/squashfs-benchmark.sh` [2] for details. > > I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. > The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, > 16 GB of RAM, and a SSD. > > | Method | Ratio | Compression MB/s | Decompression MB/s | > ||---|--|| > | gzip | 2.92 | 15 |128 | > | lzo| 2.64 | 9.5 |217 | > | lz4| 2.12 | 94 |218 | > | xz | 3.43 | 5.5 | 35 | > | xz 256 KB | 3.53 | 5.4 | 40 | > | zstd 1 | 2.71 | 96 |210 | > | zstd 5 | 2.93 | 69 |198 | > | zstd 10| 3.01 | 41 |225 | > | zstd 15| 3.13 | 11.4 |224 | > | zstd 16 256 KB | 3.24 | 8.1 |210 | > > This patch was written by Sean Purcell , but I will be > taking over the submission process. > > [1] http://releases.ubuntu.com/16.10/ > [2] > https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/squashfs-benchmark.sh > > zstd source repository: https://github.com/facebook/zstd > > Cc: Sean Purcell > Signed-off-by: Nick Terrell > --- > v3 -> v4: > - Fix minor linter warnings > > fs/squashfs/Kconfig| 14 + > fs/squashfs/Makefile | 1 + > fs/squashfs/decompressor.c | 7 +++ > fs/squashfs/decompressor.h | 4 ++ > fs/squashfs/squashfs_fs.h | 1 + > fs/squashfs/zstd_wrapper.c | 149 > + > 6 files changed, 176 insertions(+) > create mode 100644 fs/squashfs/zstd_wrapper.c > > diff --git a/fs/squashfs/Kconfig b/fs/squashfs/Kconfig > index ffb093e..1adb334 100644 > --- a/fs/squashfs/Kconfig > +++ b/fs/squashfs/Kconfig > @@ -165,6 +165,20 @@ config SQUASHFS_XZ > > If unsure, say N. > > +config SQUASHFS_ZSTD > + bool "Include support for ZSTD compressed file systems" > + depends on SQUASHFS > + select ZSTD_DECOMPRESS > + help > + Saying Y here includes support for reading Squashfs file systems > + compressed with ZSTD compression. ZSTD gives better compression > than > + the default ZLIB compression, while using less CPU. > + > + ZSTD is not the standard compression used in Squashfs and so most > + file systems will be readable without selecting this option. > + > + If unsure, say N. > + > config SQUASHFS_4K_DEVBLK_SIZE > bool "Use 4K device block size?" > depends on SQUASHFS > diff --git a/fs/squashfs/Makefile b/fs/squashfs/Makefile > index 246a6f3..6655631 100644 > --- a/fs/squashfs/Makefile > +++ b/fs/squashfs/Makefile > @@ -15,3 +15,4 @@ squashfs-$(CONFIG_SQUASHFS_LZ4) += lz4_wrapper.o > squashfs-$(CONFIG_SQUASHFS_LZO) += lzo_wrapper.o > squashfs-$(CONFIG_SQUASHFS_XZ) += xz_wrapper.o > squashfs-$(CONFIG_SQUASHFS_ZLIB) += zlib_wrapper.o > +squashfs-$(CONFIG_SQUASHFS_ZSTD) += zstd_wrapper.o > diff --git a/fs/squashfs/decompressor.c b/fs/squashfs/decompressor.c > index d2bc136..8366398 100644 > --- a/fs/squashfs/decompressor.c > +++ b/fs/squashfs/decompressor.c > @@ -65,6 +65,12 @@ static const struct squashfs_decompressor > squashfs_zlib_comp_ops = { > }; > #endif > > +#ifndef CONFIG_SQUASHFS_ZSTD > +static const struct squashfs_decompressor squashfs_zstd_comp_ops = { > + NULL, NULL, NULL, NULL, ZSTD_COMPRESSION, "zstd", 0 > +}; > +#endif > + > static const struct squashfs_decompressor squashfs_unknown_comp_ops = { > NULL, NULL, NULL, NULL, 0, "unknown", 0 > }; > @@ -75,6 +81,7 @@ static const struct squashfs_decompressor *decompressor[] = > { > _lzo_comp_ops, > _xz_comp_ops, > _lzma_unsupported_comp_ops, > + _zstd_comp_ops, > _unknown_comp_ops
Re: [PATCH v4 4/5] squashfs: Add zstd support
On 8/4/17, 3:10 PM, "linus...@gmail.com on behalf of Linus Torvalds"wrote: > On Fri, Aug 4, 2017 at 1:19 PM, Nick Terrell wrote: > > > > This patch was written by Sean Purcell , but I will be > > taking over the submission process. > > Please, if so, get Sean's sign-off, and also make sure that the patch > gets submitted with > >From: Sean Purcell > > at the top of the body of the email so that authorship gets properly > attributed by all the usual tools. > > Linus > Thanks for the help, I'll fix it for the next version.
Re: [PATCH v4 4/5] squashfs: Add zstd support
On Fri, Aug 4, 2017 at 1:19 PM, Nick Terrellwrote: > > This patch was written by Sean Purcell , but I will be > taking over the submission process. Please, if so, get Sean's sign-off, and also make sure that the patch gets submitted with From: Sean Purcell at the top of the body of the email so that authorship gets properly attributed by all the usual tools. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH preview] btrfs: allow to set compression level for zlib
On 07/25/2017 01:29 AM, David Sterba wrote: > Preliminary support for setting compression level for zlib, the > following works: Thanks for working on this, I think it is a great feature. I have a few comments relating to how it would work with zstd. > > $ mount -o compess=zlib # default > $ mount -o compess=zlib0# same > $ mount -o compess=zlib9# level 9, slower sync, less data > $ mount -o compess=zlib1# level 1, faster sync, more data > $ mount -o remount,compress=zlib3 # level set by remount > > The level is visible in the same format in /proc/mounts. Level set via > file property does not work yet. > > Required patch: "btrfs: prepare for extensions in compression options" > > Signed-off-by: David Sterba> --- > fs/btrfs/compression.c | 20 +++- > fs/btrfs/compression.h | 6 +- > fs/btrfs/ctree.h | 1 + > fs/btrfs/inode.c | 5 - > fs/btrfs/lzo.c | 5 + > fs/btrfs/super.c | 7 +-- > fs/btrfs/zlib.c| 12 +++- > 7 files changed, 50 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c > index 8ba1b86c9b72..142206d68495 100644 > --- a/fs/btrfs/compression.c > +++ b/fs/btrfs/compression.c > @@ -866,6 +866,11 @@ static void free_workspaces(void) > * Given an address space and start and length, compress the bytes into > @pages > * that are allocated on demand. > * > + * @type_level is encoded algorithm and level, where level 0 means whatever > + * default the algorithm chooses and is opaque here; > + * - compression algo are 0-3 > + * - the level are bits 4-7 zstd has 19 levels, but we can either only allow the first 15 + default, or provide a mapping from zstd-level to BtrFS zstd-level. > + * > * @out_pages is an in/out parameter, holds maximum number of pages to > allocate > * and returns number of actually allocated pages > * > @@ -880,7 +885,7 @@ static void free_workspaces(void) > * @max_out tells us the max number of bytes that we're allowed to > * stuff into pages > */ > -int btrfs_compress_pages(int type, struct address_space *mapping, > +int btrfs_compress_pages(unsigned int type_level, struct address_space > *mapping, >u64 start, struct page **pages, >unsigned long *out_pages, >unsigned long *total_in, > @@ -888,9 +893,11 @@ int btrfs_compress_pages(int type, struct address_space > *mapping, > { > struct list_head *workspace; > int ret; > + int type = type_level & 0xF; > > workspace = find_workspace(type); > > + btrfs_compress_op[type - 1]->set_level(workspace, type_level); zlib uses the same amount of memory independently of the compression level, but zstd uses a different amount of memory for each level. zstd will have to allocate memory here if it doesn't have enough (or has way to much), will that be okay? > ret = btrfs_compress_op[type-1]->compress_pages(workspace, mapping, > start, pages, > out_pages, > @@ -1047,3 +1054,14 @@ int btrfs_decompress_buf2page(const char *buf, > unsigned long buf_start, > > return 1; > } > + > +unsigned int btrfs_compress_str2level(const char *str) > +{ > + if (strncmp(str, "zlib", 4) != 0) > + return 0; > + > + if ('1' <= str[4] && str[4] <= '9' ) > + return str[4] - '0'; > + > + return 0; > +} > diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h > index 89bcf975efb8..8a6db02d8732 100644 > --- a/fs/btrfs/compression.h > +++ b/fs/btrfs/compression.h > @@ -76,7 +76,7 @@ struct compressed_bio { > void btrfs_init_compress(void); > void btrfs_exit_compress(void); > > -int btrfs_compress_pages(int type, struct address_space *mapping, > +int btrfs_compress_pages(unsigned int type_level, struct address_space > *mapping, >u64 start, struct page **pages, >unsigned long *out_pages, >unsigned long *total_in, > @@ -95,6 +95,8 @@ int btrfs_submit_compressed_write(struct inode *inode, u64 > start, > int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, >int mirror_num, unsigned long bio_flags); > > +unsigned btrfs_compress_str2level(const char *str); > + > enum btrfs_compression_type { > BTRFS_COMPRESS_NONE = 0, > BTRFS_COMPRESS_ZLIB = 1, > @@ -124,6 +126,8 @@ struct btrfs_compress_op { > struct page *dest_page, > unsigned long start_byte, > size_t srclen, size_t destlen); > + > + void (*set_level)(struct list_head *ws, unsigned int type); > }; > > extern const struct btrfs_compress_op btrfs_zlib_compress; > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
Re: please include 17024ad0a0fd ("Btrfs: fix early ENOSPC due to delalloc") to 4.12 stable
On Fri, Aug 04, 2017 at 11:25:14PM +0300, Nikolay Borisov wrote: > Hello, > > I'd like to aforementioned patch to be applied to stable 4.9/4.12. The > attached backport applies cleanly to both of them. Thanks, I'll queue it up after this next release happens. greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 5/5] crypto: Add zstd support
Adds zstd support to crypto and scompress. Only supports the default level. Signed-off-by: Nick Terrell--- crypto/Kconfig | 9 ++ crypto/Makefile | 1 + crypto/testmgr.c | 10 +++ crypto/testmgr.h | 71 +++ crypto/zstd.c| 265 +++ 5 files changed, 356 insertions(+) create mode 100644 crypto/zstd.c diff --git a/crypto/Kconfig b/crypto/Kconfig index caa770e..4fc3936 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -1662,6 +1662,15 @@ config CRYPTO_LZ4HC help This is the LZ4 high compression mode algorithm. +config CRYPTO_ZSTD + tristate "Zstd compression algorithm" + select CRYPTO_ALGAPI + select CRYPTO_ACOMP2 + select ZSTD_COMPRESS + select ZSTD_DECOMPRESS + help + This is the zstd algorithm. + comment "Random Number Generation" config CRYPTO_ANSI_CPRNG diff --git a/crypto/Makefile b/crypto/Makefile index d41f033..b22e1e8 100644 --- a/crypto/Makefile +++ b/crypto/Makefile @@ -133,6 +133,7 @@ obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o obj-$(CONFIG_CRYPTO_USER_API_RNG) += algif_rng.o obj-$(CONFIG_CRYPTO_USER_API_AEAD) += algif_aead.o +obj-$(CONFIG_CRYPTO_ZSTD) += zstd.o ecdh_generic-y := ecc.o ecdh_generic-y += ecdh.o diff --git a/crypto/testmgr.c b/crypto/testmgr.c index 7125ba3..8a124d3 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -3603,6 +3603,16 @@ static const struct alg_test_desc alg_test_descs[] = { .decomp = __VECS(zlib_deflate_decomp_tv_template) } } + }, { + .alg = "zstd", + .test = alg_test_comp, + .fips_allowed = 1, + .suite = { + .comp = { + .comp = __VECS(zstd_comp_tv_template), + .decomp = __VECS(zstd_decomp_tv_template) + } + } } }; diff --git a/crypto/testmgr.h b/crypto/testmgr.h index 6ceb0e2..e6b5920 100644 --- a/crypto/testmgr.h +++ b/crypto/testmgr.h @@ -34631,4 +34631,75 @@ static const struct comp_testvec lz4hc_decomp_tv_template[] = { }, }; +static const struct comp_testvec zstd_comp_tv_template[] = { + { + .inlen = 68, + .outlen = 39, + .input = "The algorithm is zstd. " + "The algorithm is zstd. " + "The algorithm is zstd.", + .output = "\x28\xb5\x2f\xfd\x00\x50\xf5\x00\x00\xb8\x54\x68\x65" + "\x20\x61\x6c\x67\x6f\x72\x69\x74\x68\x6d\x20\x69\x73" + "\x20\x7a\x73\x74\x64\x2e\x20\x01\x00\x55\x73\x36\x01" + , + }, + { + .inlen = 244, + .outlen = 151, + .input = "zstd, short for Zstandard, is a fast lossless " + "compression algorithm, targeting real-time " + "compression scenarios at zlib-level and better " + "compression ratios. The zstd compression library " + "provides in-memory compression and decompression " + "functions.", + .output = "\x28\xb5\x2f\xfd\x00\x50\x75\x04\x00\x42\x4b\x1e\x17" + "\x90\x81\x31\x00\xf2\x2f\xe4\x36\xc9\xef\x92\x88\x32" + "\xc9\xf2\x24\x94\xd8\x68\x9a\x0f\x00\x0c\xc4\x31\x6f" + "\x0d\x0c\x38\xac\x5c\x48\x03\xcd\x63\x67\xc0\xf3\xad" + "\x4e\x90\xaa\x78\xa0\xa4\xc5\x99\xda\x2f\xb6\x24\x60" + "\xe2\x79\x4b\xaa\xb6\x6b\x85\x0b\xc9\xc6\x04\x66\x86" + "\xe2\xcc\xe2\x25\x3f\x4f\x09\xcd\xb8\x9d\xdb\xc1\x90" + "\xa9\x11\xbc\x35\x44\x69\x2d\x9c\x64\x4f\x13\x31\x64" + "\xcc\xfb\x4d\x95\x93\x86\x7f\x33\x7f\x1a\xef\xe9\x30" + "\xf9\x67\xa1\x94\x0a\x69\x0f\x60\xcd\xc3\xab\x99\xdc" + "\x42\xed\x97\x05\x00\x33\xc3\x15\x95\x3a\x06\xa0\x0e" + "\x20\xa9\x0e\x82\xb9\x43\x45\x01", + }, +}; + +static const struct comp_testvec zstd_decomp_tv_template[] = { + { + .inlen = 43, + .outlen = 68, + .input = "\x28\xb5\x2f\xfd\x04\x50\xf5\x00\x00\xb8\x54\x68\x65" + "\x20\x61\x6c\x67\x6f\x72\x69\x74\x68\x6d\x20\x69\x73" + "\x20\x7a\x73\x74\x64\x2e\x20\x01\x00\x55\x73\x36\x01" + "\x6b\xf4\x13\x35", + .output = "The algorithm is zstd. " + "The algorithm is zstd. " + "The algorithm is zstd.", + }, + { + .inlen = 155, +
[PATCH v4 4/5] squashfs: Add zstd support
Add zstd compression and decompression support to SquashFS. zstd is a great fit for SquashFS because it can compress at ratios approaching xz, while decompressing twice as fast as zlib. For SquashFS in particular, it can decompress as fast as lzo and lz4. It also has the flexibility to turn down the compression ratio for faster compression times. The compression benchmark is run on the file tree from the SquashFS archive found in ubuntu-16.10-desktop-amd64.iso [1]. It uses `mksquashfs` with the default block size (128 KB) and and various compression algorithms/levels. xz and zstd are also benchmarked with 256 KB blocks. The decompression benchmark times how long it takes to `tar` the file tree into `/dev/null`. See the benchmark file in the upstream zstd source repository located under `contrib/linux-kernel/squashfs-benchmark.sh` [2] for details. I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. | Method | Ratio | Compression MB/s | Decompression MB/s | ||---|--|| | gzip | 2.92 | 15 |128 | | lzo| 2.64 | 9.5 |217 | | lz4| 2.12 | 94 |218 | | xz | 3.43 | 5.5 | 35 | | xz 256 KB | 3.53 | 5.4 | 40 | | zstd 1 | 2.71 | 96 |210 | | zstd 5 | 2.93 | 69 |198 | | zstd 10| 3.01 | 41 |225 | | zstd 15| 3.13 | 11.4 |224 | | zstd 16 256 KB | 3.24 | 8.1 |210 | This patch was written by Sean Purcell, but I will be taking over the submission process. [1] http://releases.ubuntu.com/16.10/ [2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/squashfs-benchmark.sh zstd source repository: https://github.com/facebook/zstd Cc: Sean Purcell Signed-off-by: Nick Terrell --- v3 -> v4: - Fix minor linter warnings fs/squashfs/Kconfig| 14 + fs/squashfs/Makefile | 1 + fs/squashfs/decompressor.c | 7 +++ fs/squashfs/decompressor.h | 4 ++ fs/squashfs/squashfs_fs.h | 1 + fs/squashfs/zstd_wrapper.c | 149 + 6 files changed, 176 insertions(+) create mode 100644 fs/squashfs/zstd_wrapper.c diff --git a/fs/squashfs/Kconfig b/fs/squashfs/Kconfig index ffb093e..1adb334 100644 --- a/fs/squashfs/Kconfig +++ b/fs/squashfs/Kconfig @@ -165,6 +165,20 @@ config SQUASHFS_XZ If unsure, say N. +config SQUASHFS_ZSTD + bool "Include support for ZSTD compressed file systems" + depends on SQUASHFS + select ZSTD_DECOMPRESS + help + Saying Y here includes support for reading Squashfs file systems + compressed with ZSTD compression. ZSTD gives better compression than + the default ZLIB compression, while using less CPU. + + ZSTD is not the standard compression used in Squashfs and so most + file systems will be readable without selecting this option. + + If unsure, say N. + config SQUASHFS_4K_DEVBLK_SIZE bool "Use 4K device block size?" depends on SQUASHFS diff --git a/fs/squashfs/Makefile b/fs/squashfs/Makefile index 246a6f3..6655631 100644 --- a/fs/squashfs/Makefile +++ b/fs/squashfs/Makefile @@ -15,3 +15,4 @@ squashfs-$(CONFIG_SQUASHFS_LZ4) += lz4_wrapper.o squashfs-$(CONFIG_SQUASHFS_LZO) += lzo_wrapper.o squashfs-$(CONFIG_SQUASHFS_XZ) += xz_wrapper.o squashfs-$(CONFIG_SQUASHFS_ZLIB) += zlib_wrapper.o +squashfs-$(CONFIG_SQUASHFS_ZSTD) += zstd_wrapper.o diff --git a/fs/squashfs/decompressor.c b/fs/squashfs/decompressor.c index d2bc136..8366398 100644 --- a/fs/squashfs/decompressor.c +++ b/fs/squashfs/decompressor.c @@ -65,6 +65,12 @@ static const struct squashfs_decompressor squashfs_zlib_comp_ops = { }; #endif +#ifndef CONFIG_SQUASHFS_ZSTD +static const struct squashfs_decompressor squashfs_zstd_comp_ops = { + NULL, NULL, NULL, NULL, ZSTD_COMPRESSION, "zstd", 0 +}; +#endif + static const struct squashfs_decompressor squashfs_unknown_comp_ops = { NULL, NULL, NULL, NULL, 0, "unknown", 0 }; @@ -75,6 +81,7 @@ static const struct squashfs_decompressor *decompressor[] = { _lzo_comp_ops, _xz_comp_ops, _lzma_unsupported_comp_ops, + _zstd_comp_ops, _unknown_comp_ops }; diff --git a/fs/squashfs/decompressor.h b/fs/squashfs/decompressor.h index a25713c..0f5a8e4 100644 --- a/fs/squashfs/decompressor.h +++ b/fs/squashfs/decompressor.h @@ -58,4 +58,8 @@ extern const struct squashfs_decompressor squashfs_lzo_comp_ops; extern const struct squashfs_decompressor squashfs_zlib_comp_ops; #endif +#ifdef
please include 17024ad0a0fd ("Btrfs: fix early ENOSPC due to delalloc") to 4.12 stable
Hello, I'd like to aforementioned patch to be applied to stable 4.9/4.12. The attached backport applies cleanly to both of them. >From 278e5d0839f4ecc6d7bfb7a95cb735b9034e8315 Mon Sep 17 00:00:00 2001 From: Omar SandovalDate: Thu, 20 Jul 2017 15:10:35 -0700 Subject: [PATCH] Btrfs: fix early ENOSPC due to delalloc If a lot of metadata is reserved for outstanding delayed allocations, we rely on shrink_delalloc() to reclaim metadata space in order to fulfill reservation tickets. However, shrink_delalloc() has a shortcut where if it determines that space can be overcommitted, it will stop early. This made sense before the ticketed enospc system, but now it means that shrink_delalloc() will often not reclaim enough space to fulfill any tickets, leading to an early ENOSPC. (Reservation tickets don't care about being able to overcommit, they need every byte accounted for.) Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims all of the metadata it is supposed to. This fixes early ENOSPCs we were seeing when doing a btrfs receive to populate a new filesystem, as well as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs. Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc infrastructure") Tested-by: Christoph Anton Mitterer Cc: sta...@vger.kernel.org Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval Signed-off-by: David Sterba Signed-off-by: Nikolay Borisov --- fs/btrfs/extent-tree.c | 4 1 file changed, 4 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a8d6ad4042b7..adb285a93753 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4813,10 +4813,6 @@ static void shrink_delalloc(struct btrfs_root *root, u64 to_reclaim, u64 orig, else flush = BTRFS_RESERVE_NO_FLUSH; spin_lock(_info->lock); - if (can_overcommit(root, space_info, orig, flush)) { - spin_unlock(_info->lock); - break; - } if (list_empty(_info->tickets) && list_empty(_info->priority_tickets)) { spin_unlock(_info->lock); -- 2.7.4
[PATCH v4 3/5] btrfs: Add zstd support
Add zstd compression and decompression support to BtrFS. zstd at its fastest level compresses almost as well as zlib, while offering much faster compression and decompression, approaching lzo speeds. I benchmarked btrfs with zstd compression against no compression, lzo compression, and zlib compression. I benchmarked two scenarios. Copying a set of files to btrfs, and then reading the files. Copying a tarball to btrfs, extracting it to btrfs, and then reading the extracted files. After every operation, I call `sync` and include the sync time. Between every pair of operations I unmount and remount the filesystem to avoid caching. The benchmark files can be found in the upstream zstd source repository under `contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}` [1] [2]. I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. The first compression benchmark is copying 10 copies of the unzipped Silesia corpus [3] into a BtrFS filesystem mounted with `-o compress-force=Method`. The decompression benchmark times how long it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is measured by comparing the output of `df` and `du`. See the benchmark file [1] for details. I benchmarked multiple zstd compression levels, although the patch uses zstd level 1. | Method | Ratio | Compression MB/s | Decompression speed | |-|---|--|-| | None| 0.99 | 504 | 686 | | lzo | 1.66 | 398 | 442 | | zlib| 2.58 | 65 | 241 | | zstd 1 | 2.57 | 260 | 383 | | zstd 3 | 2.71 | 174 | 408 | | zstd 6 | 2.87 | 70 | 398 | | zstd 9 | 2.92 | 43 | 406 | | zstd 12 | 2.93 | 21 | 408 | | zstd 15 | 3.01 | 11 | 354 | The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it measures the compression ratio, extracts the tar, and deletes the tar. Then it measures the compression ratio again, and `tar`s the extracted files into `/dev/null`. See the benchmark file [2] for details. | Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) | ||---|---|--||--| | None | 0.97 | 0.78 |0.981 | 5.501 |8.807 | | lzo| 2.06 | 1.38 |1.631 | 8.458 |8.585 | | zlib | 3.40 | 1.86 |7.750 | 21.544 | 11.744 | | zstd 1 | 3.57 | 1.85 |2.579 | 11.479 |9.389 | [1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh [2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh [3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia [4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz zstd source repository: https://github.com/facebook/zstd Signed-off-by: Nick Terrell--- v2 -> v3: - Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff - Change default compression level for BtrFS to 3 v3 -> v4: - Add missing includes, which fixes the aarch64 build - Fix minor linter warnings fs/btrfs/Kconfig | 2 + fs/btrfs/Makefile | 2 +- fs/btrfs/compression.c | 1 + fs/btrfs/compression.h | 6 +- fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 2 + fs/btrfs/ioctl.c | 6 +- fs/btrfs/props.c | 6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c | 2 + fs/btrfs/zstd.c| 432 + include/uapi/linux/btrfs.h | 8 +- 12 files changed, 468 insertions(+), 12 deletions(-) create mode 100644 fs/btrfs/zstd.c diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig index 80e9c18..a26c63b 100644 --- a/fs/btrfs/Kconfig +++ b/fs/btrfs/Kconfig @@ -6,6 +6,8 @@ config BTRFS_FS select ZLIB_DEFLATE select LZO_COMPRESS select LZO_DECOMPRESS + select ZSTD_COMPRESS + select ZSTD_DECOMPRESS select RAID6_PQ select XOR_BLOCKS select SRCU diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 128ce17..962a95a 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -6,7 +6,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ transaction.o inode.o file.o tree-defrag.o \ extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \ extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ - export.o tree-log.o free-space-cache.o zlib.o lzo.o \ + export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \ compression.o
[PATCH v4 1/5] lib: Add xxhash module
Adds xxhash kernel module with xxh32 and xxh64 hashes. xxhash is an extremely fast non-cryptographic hash algorithm for checksumming. The zstd compression and decompression modules added in the next patch require xxhash. I extracted it out from zstd since it is useful on its own. I copied the code from the upstream XXHash source repository and translated it into kernel style. I ran benchmarks and tests in the kernel and tests in userland. I benchmarked xxhash as a special character device. I ran in four modes, no-op, xxh32, xxh64, and crc32. The no-op mode simply copies the data to kernel space and ignores it. The xxh32, xxh64, and crc32 modes compute hashes on the copied data. I also ran it with four different buffer sizes. The benchmark file is located in the upstream zstd source repository under `contrib/linux-kernel/xxhash_test.c` [1]. I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using the file `filesystem.squashfs` from `ubuntu-16.10-desktop-amd64.iso`, which is 1,536,217,088 B large. Run the following commands for the benchmark: modprobe xxhash_test mknod xxhash_test c 245 0 time cp filesystem.squashfs xxhash_test The time is reported by the time of the userland `cp`. The GB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Normalized GB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). | Buffer Size (B) | Hash | Time (s) | GB/s | Adjusted GB/s | |-|---|--|--|---| |1024 | none |0.408 | 3.77 | - | |1024 | xxh32 |0.649 | 2.37 | 6.37 | |1024 | xxh64 |0.542 | 2.83 | 11.46 | |1024 | crc32 |1.290 | 1.19 | 1.74 | |4096 | none |0.380 | 4.04 | - | |4096 | xxh32 |0.645 | 2.38 | 5.79 | |4096 | xxh64 |0.500 | 3.07 | 12.80 | |4096 | crc32 |1.168 | 1.32 | 1.95 | |8192 | none |0.351 | 4.38 | - | |8192 | xxh32 |0.614 | 2.50 | 5.84 | |8192 | xxh64 |0.464 | 3.31 | 13.60 | |8192 | crc32 |1.163 | 1.32 | 1.89 | | 16384 | none |0.346 | 4.43 | - | | 16384 | xxh32 |0.590 | 2.60 | 6.30 | | 16384 | xxh64 |0.466 | 3.30 | 12.80 | | 16384 | crc32 |1.183 | 1.30 | 1.84 | Tested in userland using the test-suite in the zstd repo under `contrib/linux-kernel/test/XXHashUserlandTest.cpp` [2] by mocking the kernel functions. A line in each branch of every function in `xxhash.c` was commented out to ensure that the test-suite fails. Additionally tested while testing zstd and with SMHasher [3]. [1] https://phabricator.intern.facebook.com/P57526246 [2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/test/XXHashUserlandTest.cpp [3] https://github.com/aappleby/smhasher zstd source repository: https://github.com/facebook/zstd XXHash source repository: https://github.com/cyan4973/xxhash Signed-off-by: Nick Terrell--- v1 -> v2: - Make pointer in lib/xxhash.c:394 non-const include/linux/xxhash.h | 236 +++ lib/Kconfig| 3 + lib/Makefile | 1 + lib/xxhash.c | 500 + 4 files changed, 740 insertions(+) create mode 100644 include/linux/xxhash.h create mode 100644 lib/xxhash.c diff --git a/include/linux/xxhash.h b/include/linux/xxhash.h new file mode 100644 index 000..9e1f42c --- /dev/null +++ b/include/linux/xxhash.h @@ -0,0 +1,236 @@ +/* + * xxHash - Extremely Fast Hash algorithm + * Copyright (C) 2012-2016, Yann Collet. + * + * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are + * met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following disclaimer + * in the documentation and/or other materials provided with the + * distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
[PATCH v4 0/5] Add xxhash and zstd modules
Hi all, This patch set adds xxhash, zstd compression, and zstd decompression modules. It also adds zstd support to BtrFS and SquashFS. Each patch has relevant summaries, benchmarks, and tests. Best, Nick Terrell Changelog: v1 -> v2: - Make pointer in lib/xxhash.c:394 non-const (1/5) - Use div_u64() for division of u64s (2/5) - Reduce stack usage of ZSTD_compressSequences(), ZSTD_buildSeqTable(), ZSTD_decompressSequencesLong(), FSE_buildDTable(), FSE_decompress_wksp(), HUF_writeCTable(), HUF_readStats(), HUF_readCTable(), HUF_compressWeights(), HUF_readDTableX2(), and HUF_readDTableX4() (2/5) - No zstd function uses more than 400 B of stack space (2/5) v2 -> v3: - Work around gcc-7 bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388 (2/5) - Fix bug in dictionary compression from upstream commit cc1522351f (2/5) - Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff (3/5) - Change default compression level for BtrFS to 3 (3/5) v3 -> v4: - Fix compiler warnings (2/5) - Add missing includes (3/5) - Fix minor linter warnings (3/5, 4/5) - Add crypto patch (5/5) Nick Terrell (5): lib: Add xxhash module lib: Add zstd modules btrfs: Add zstd support squashfs: Add zstd support crypto: Add zstd support crypto/Kconfig |9 + crypto/Makefile|1 + crypto/testmgr.c | 10 + crypto/testmgr.h | 71 + crypto/zstd.c | 265 fs/btrfs/Kconfig |2 + fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |6 +- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |6 +- fs/btrfs/props.c |6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c |2 + fs/btrfs/zstd.c| 432 ++ fs/squashfs/Kconfig| 14 + fs/squashfs/Makefile |1 + fs/squashfs/decompressor.c |7 + fs/squashfs/decompressor.h |4 + fs/squashfs/squashfs_fs.h |1 + fs/squashfs/zstd_wrapper.c | 149 ++ include/linux/xxhash.h | 236 +++ include/linux/zstd.h | 1157 +++ include/uapi/linux/btrfs.h |8 +- lib/Kconfig| 11 + lib/Makefile |3 + lib/xxhash.c | 500 +++ lib/zstd/Makefile | 18 + lib/zstd/bitstream.h | 374 + lib/zstd/compress.c| 3479 lib/zstd/decompress.c | 2528 lib/zstd/entropy_common.c | 243 lib/zstd/error_private.h | 53 + lib/zstd/fse.h | 575 lib/zstd/fse_compress.c| 795 ++ lib/zstd/fse_decompress.c | 332 + lib/zstd/huf.h | 212 +++ lib/zstd/huf_compress.c| 770 ++ lib/zstd/huf_decompress.c | 960 lib/zstd/mem.h | 151 ++ lib/zstd/zstd_common.c | 75 + lib/zstd/zstd_internal.h | 250 lib/zstd/zstd_opt.h| 1014 + 44 files changed, 14736 insertions(+), 12 deletions(-) create mode 100644 crypto/zstd.c create mode 100644 fs/btrfs/zstd.c create mode 100644 fs/squashfs/zstd_wrapper.c create mode 100644 include/linux/xxhash.h create mode 100644 include/linux/zstd.h create mode 100644 lib/xxhash.c create mode 100644 lib/zstd/Makefile create mode 100644 lib/zstd/bitstream.h create mode 100644 lib/zstd/compress.c create mode 100644 lib/zstd/decompress.c create mode 100644 lib/zstd/entropy_common.c create mode 100644 lib/zstd/error_private.h create mode 100644 lib/zstd/fse.h create mode 100644 lib/zstd/fse_compress.c create mode 100644 lib/zstd/fse_decompress.c create mode 100644 lib/zstd/huf.h create mode 100644 lib/zstd/huf_compress.c create mode 100644 lib/zstd/huf_decompress.c create mode 100644 lib/zstd/mem.h create mode 100644 lib/zstd/zstd_common.c create mode 100644 lib/zstd/zstd_internal.h create mode 100644 lib/zstd/zstd_opt.h -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FAILED: patch "[PATCH] Btrfs: fix early ENOSPC due to delalloc" failed to apply to 4.12-stable tree
Hey. Could someone of the devs put some attention on this...? Thanks, Chris :-) On Mon, 2017-07-31 at 18:06 -0700, gre...@linuxfoundation.org wrote: > The patch below does not apply to the 4.12-stable tree. > If someone wants it applied there, or to any other stable or longterm > tree, then please email the backport, including the original git > commit > id to. > > thanks, > > greg k-h > > -- original commit in Linus's tree -- > > From 17024ad0a0fdfcfe53043afb969b813d3e020c21 Mon Sep 17 00:00:00 > 2001 > From: Omar Sandoval > Date: Thu, 20 Jul 2017 15:10:35 -0700 > Subject: [PATCH] Btrfs: fix early ENOSPC due to delalloc > > If a lot of metadata is reserved for outstanding delayed allocations, > we > rely on shrink_delalloc() to reclaim metadata space in order to > fulfill > reservation tickets. However, shrink_delalloc() has a shortcut where > if > it determines that space can be overcommitted, it will stop early. > This > made sense before the ticketed enospc system, but now it means that > shrink_delalloc() will often not reclaim enough space to fulfill any > tickets, leading to an early ENOSPC. (Reservation tickets don't care > about being able to overcommit, they need every byte accounted for.) > > Fix it by getting rid of the shortcut so that shrink_delalloc() > reclaims > all of the metadata it is supposed to. This fixes early ENOSPCs we > were > seeing when doing a btrfs receive to populate a new filesystem, as > well > as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs. > > Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc > infrastructure") > Tested-by: Christoph Anton Mitterer me> > Cc: sta...@vger.kernel.org > Reviewed-by: Josef Bacik > Signed-off-by: Omar Sandoval > Signed-off-by: David Sterba > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index a6635f07b8f1..e3b0b4196d3d 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4825,10 +4825,6 @@ static void shrink_delalloc(struct > btrfs_fs_info *fs_info, u64 to_reclaim, > else > flush = BTRFS_RESERVE_NO_FLUSH; > spin_lock(_info->lock); > - if (can_overcommit(fs_info, space_info, orig, flush, > false)) { > - spin_unlock(_info->lock); > - break; > - } > if (list_empty(_info->tickets) && > list_empty(_info->priority_tickets)) { > spin_unlock(_info->lock); > smime.p7s Description: S/MIME cryptographic signature
Re: Massive loss of disk space
On 2017-08-04 10:45, Goffredo Baroncelli wrote: On 2017-08-03 19:23, Austin S. Hemmelgarn wrote: On 2017-08-03 12:37, Goffredo Baroncelli wrote: On 2017-08-03 13:39, Austin S. Hemmelgarn wrote: [...] Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is. It seems that ZFS on linux doesn't support fallocate see https://github.com/zfsonlinux/zfs/issues/326 So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment. Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on). For fun I checked the freebsd source and zfs source. To me it seems that ZFS on freebsd doesn't implement posix_fallocate() (VOP_ALLOCATE in freebas jargon), but instead relies on the freebsd default one. http://fxr.watson.org/fxr/source/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L7212 Following the chain of function pointers http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=10#L110 it seems that the freebsd vop_allocate() is implemented in vop_stdallocate() http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=excerpts#L912 which simply calls read() and write() on the range [offset...offset+len), which for a "conventional" filesystem ensure the block allocation. Of course it is an expensive solution. So I think (but I am not familiar with freebsd) that ZFS doesn't implement a real posix_allocate but it try to simulate it. Of course this don't From a practical perspective though, posix_fallocate() doesn't matter, because almost everything uses the native fallocate call if at all possible. As you mention, FreeBSD is emulating it, but that 'emulation' provides behavior that is close enough to what is required that it doesn't matter. As a matter of perspective, posix_fallocate() is emulated on Linux too, see my reply below to your later comment about posix_fallocate() on BTRFS. Internally ZFS also keeps _some_ space reserved so it doesn't get wedged like BTRFS does when near full, and they don't do the whole data versus metadata segregation crap, so from a practical perspective, what FreeBSD's ZFS implementation does is sufficient because of the internal structure and handling of writes in ZFS. That said, I'm starting to wonder if just failing fallocate() calls to allocate space is actually the right thing to do here after all. Aside from this, we don't reserve metadata space for checksums and similar things for the eventual writes (so it's possible to get -ENOSPC on a write to an fallocate'ed region anyway because of metadata exhaustion), and splitting extents can also cause it to fail, so it's perfectly possible for the fallocate assumption to not hole on BTRFS. posix_fallocate in BTRFS is not reliable for another reason. This syscall guarantees that a BG is allocated, but I think that the allocated BG is available to all processes, so a parallel process my exhaust all the available space before the first process uses it. As mentioned above, posix_fallocate() is emulated in libc on Linux by calling the regular fallocate() if the FS supports it (which BTRFS does), or by writing out data like FreeBSD does in the kernel if the FS doesn't support fallocate(). IOW, posix_fallocate() has the exact same issues on BTRFS as Linux's fallocate() syscall does. My opinion is that BTRFS is not reliable when the space is exhausted, so it needs to work with an amount of disk space free. The size of this disk space should be O(2*size_of_biggest_write), and for operation like fallocate this means O(2*length). Again, this arises from how we handle writes. If we were to track blocks that have had fallocate called on them and only use those (for the first write at least) for writes to the file that had fallocate called on them (as well as breaking reflinks on them when fallocate is called), then we can get away with just using the size of the biggest write plus a little bit more space for _data_, but even then we need space for metadata (which we don't appear to track right now). I think that is not casual that the fallocate implemented by ZFSONLINUX works with the flag FALLOC_FL_PUNCH_HOLE mode. https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L662 [...] /* * The only flag combination which matches the behavior of zfs_space() * is FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE. The FALLOC_FL_PUNCH_HOLE * flag was introduced in the 2.6.38 kernel. */ #if defined(HAVE_FILE_FALLOCATE) || defined(HAVE_INODE_FALLOCATE) long zpl_fallocate_common(struct inode *ip, int mode, loff_t offset, loff_t len) { int error = -EOPNOTSUPP; #if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE) cred_t *cr = CRED(); flock64_t bf; loff_t olen; fstrans_cookie_t cookie; if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
Re: Massive loss of disk space
On 2017-08-03 19:23, Austin S. Hemmelgarn wrote: > On 2017-08-03 12:37, Goffredo Baroncelli wrote: >> On 2017-08-03 13:39, Austin S. Hemmelgarn wrote: [...] >>> Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a >>> CoW filesystem _does not_ need to behave like BTRFS is. >> >> It seems that ZFS on linux doesn't support fallocate >> >> see https://github.com/zfsonlinux/zfs/issues/326 >> >> So I think that you are referring to a posix_fallocate and ZFS on solaris, >> which I can't test so I can't comment. > Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on). For fun I checked the freebsd source and zfs source. To me it seems that ZFS on freebsd doesn't implement posix_fallocate() (VOP_ALLOCATE in freebas jargon), but instead relies on the freebsd default one. http://fxr.watson.org/fxr/source/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L7212 Following the chain of function pointers http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=10#L110 it seems that the freebsd vop_allocate() is implemented in vop_stdallocate() http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=excerpts#L912 which simply calls read() and write() on the range [offset...offset+len), which for a "conventional" filesystem ensure the block allocation. Of course it is an expensive solution. So I think (but I am not familiar with freebsd) that ZFS doesn't implement a real posix_allocate but it try to simulate it. Of course this don't > > That said, I'm starting to wonder if just failing fallocate() calls to > allocate space is actually the right thing to do here after all. Aside from > this, we don't reserve metadata space for checksums and similar things for > the eventual writes (so it's possible to get -ENOSPC on a write to an > fallocate'ed region anyway because of metadata exhaustion), and splitting > extents can also cause it to fail, so it's perfectly possible for the > fallocate assumption to not hole on BTRFS. posix_fallocate in BTRFS is not reliable for another reason. This syscall guarantees that a BG is allocated, but I think that the allocated BG is available to all processes, so a parallel process my exhaust all the available space before the first process uses it. My opinion is that BTRFS is not reliable when the space is exhausted, so it needs to work with an amount of disk space free. The size of this disk space should be O(2*size_of_biggest_write), and for operation like fallocate this means O(2*length). I think that is not casual that the fallocate implemented by ZFSONLINUX works with the flag FALLOC_FL_PUNCH_HOLE mode. https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L662 [...] /* * The only flag combination which matches the behavior of zfs_space() * is FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE. The FALLOC_FL_PUNCH_HOLE * flag was introduced in the 2.6.38 kernel. */ #if defined(HAVE_FILE_FALLOCATE) || defined(HAVE_INODE_FALLOCATE) long zpl_fallocate_common(struct inode *ip, int mode, loff_t offset, loff_t len) { int error = -EOPNOTSUPP; #if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE) cred_t *cr = CRED(); flock64_t bf; loff_t olen; fstrans_cookie_t cookie; if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) return (error); [...] -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 2017年08月02日 16:38, Brendan Hide wrote: The title seems alarmist to me - and I suspect it is going to be misconstrued. :-/ From the release notes at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html "Btrfs has been deprecated The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux. The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature. Red Hat will continue to invest in future technologies to address the use cases of our customers, specifically those related to snapshots, compression, NVRAM, and ease of use. We encourage feedback through your Red Hat representative on features and requirements you have for file systems and storage technology." Personally speaking, unlike most of the btrfs supporters, I think Red Hat is doing the correct thing for their enterprise use case. (To clarify, I'm not going to Red Hat, just in case anyone wonders why I'm not supporting btrfs) [Good things of btrfs] Btrfs is indeed a technic pioneer in a lot of aspects (at least in linux world): 1) Metadata CoW instead of traditional journal 2) Snapshot and delta-backup I think this is the killer feature of Btrfs, and why SUSE is using it for root fs. 3) Default data CoW 4) Data checksum and scrubbing 5) Multi-device management 6) Online resize/balancing And a lot of more. [Bad things of btrfs] But for enterprise usage, it's too advanced and has several problems preventing them being widely applied: 1) Low performance from metadata/data CoW This is a little complicated dilemma. Although Btrfs can disable data CoW, nodatacow also disables data checksum, which is another main feature for btrfs. So Btrfs can't default to nodatacow, unlike XFS. And metadata CoW causes extra metadata write along with superblock update (FUA), further degrading the performance. Such pioneered design makes traditional performance-intense use case very unhappy. Especially for almost all kind of databases. (Note that nodatacow can't always solve the performance problem). Most performance intense usage is still based on tradtional fs design (journal with no CoW) 2) Low concurrency caused by tree design. Unlike traditional one-tree-for-one-inode design, btrfs uses one-tree-for-one-subvolume. The design makes snapshot implementation very easy, while makes tree very hot when a lot of modifiers are trying to modify any metadata. Btrfs has a lot of different way to solve it. For extent tree (the most busy tree), we are using delayed-ref to speed up extent tree update. For fs tree fsync, we have log tree to speed things up. These approaches work, at the cost of complexity and bugs, and we still have slow fs tree modification speed. 3) Low code reusage of device-mapper. I totally understand that, due to the unique support for data csum, btrfs can't use device-mapper directly, as we must verify the data read out from device before passing it to higher level. So Btrfs uses its own device-mapper like implementation to handle multi-device management. The result is mixed. For easy to handle case like RAID0/1/10 btrfs is doing well. While for RAID5/6, everyone knows the result. Such btrfs *enhanced* re-implementation not only makes btrfs larger but also more complex and bug-prune. In short, btrfs is too advanced for generic use cases (performance) and developers (bugs), unfortunately. And even SUSE is just pushing btrfs as root fs, mainly for the snapshot feature. But still ext4/xfs for data or performance intense use case. [Other solution on the table] On the other hand, I think RedHat is pushing storage technology based on LVM (thin) and Xfs. For traditional LVM, it's stable but its snapshot design is old-fashion and low-performance. While new thin-provision LVM solves the problem using a method just like Btrfs, but at block level. And for XFS, it's still traditional designed, journal based, one-tree-for-one-inode. But with fancy new features like data CoW. Even XFS + LVM-thin lacks ability to shrink fs or scrub data or delta backup, it can do a lot of things just like Btrfs. From snapshot to multi-device management. And more importantly, has better performance for things like DB. So, for old use cases, the performance stays almost the same. For developers, guys are still focusing on their old fields, less to concern and more focused to debug. The old UNIX method still works here, do one thing and do it
Re: Power down tests...
Thanks guys. I've enabled that option now. Let's see how it goes. One general question regarding the stability of btrfs in kernel version 4.4. Is this okay for power off test cases? Or are there many important fixes in newer kernels? On Fri, Aug 4, 2017 at 5:24 PM, Dmitrii Tcvetkovwrote: > On Fri, 4 Aug 2017 13:19:39 +0530 > Shyam Prasad N wrote: > >> Oh ok. I read this in the man page and assumed that it's on by >> default: flushoncommit, noflushoncommit >>(default: on) >> >>This option forces any data dirtied by a write in a prior >> transaction to commit as part of the current commit. This makes the >> committed state a fully consistent view of the file system from the >>application’s perspective (i.e., it includes all completed >> file system operations). This was previously the behavior only when a >> snapshot was created. >> >>Disabling flushing may improve performance but is not >> crash-safe. >> >> >> Maybe this needs a correction? > > In 4.12 btrfs-progs man pages it's already updated. > > $ man 5 btrfs > ... >flushoncommit, noflushoncommit >(default: off) > >This option forces any data dirtied by a write in a prior >transaction to commit as part of the current commit, >effectively a full filesystem sync. > >This makes the committed state a fully consistent view of >the file system from the application’s perspective (i.e., it >includes all completed file system operations). This was >previously the behavior only when a snapshot was created. > >When off, the filesystem is consistent but buffered writes >may last more than one transaction commit. > > -- -Shyam -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Fix -EOVERFLOW handling in btrfs_ioctl_tree_search_v2
The buffer passed to btrfs_ioctl_tree_search* functions have to be at least sizeof(struct btrfs_ioctl_search_header). If this is not the case then the ioctl should return -EOVERFLOW and set the uarg->buf_size to the minimum required size. Currently btrfs_ioctl_tree_search_v2 would return an -EOVERFLOW error with ->buf_size being set to the value passed by user space. Fix this by removing the size check and relying on search_ioctl, which already includes it and correctly sets buf_size. Signed-off-by: Nikolay Borisov--- fs/btrfs/ioctl.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index fa1b78cf25f6..e80950b3f340 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2201,9 +2201,6 @@ static noinline int btrfs_ioctl_tree_search_v2(struct file *file, buf_size = args.buf_size; - if (buf_size < sizeof(struct btrfs_ioctl_search_header)) - return -EOVERFLOW; - /* limit result size to 16MB */ if (buf_size > buf_limit) buf_size = buf_limit; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 2017-08-03 16:45, Brendan Hide wrote: On 08/03/2017 09:22 PM, Austin S. Hemmelgarn wrote: On 2017-08-03 14:29, Christoph Anton Mitterer wrote: On Thu, 2017-08-03 at 20:08 +0200, waxhead wrote: There are no higher-level management tools (e.g. RAID management/monitoring, etc.)... [snip] As far as 'higher-level' management tools, you're using your system wrong if you _need_ them. There is no need for there to be a GUI, or a web interface, or a DBus interface, or any other such bloat in the main management tools, they work just fine as is and are mostly on par with the interfaces provided by LVM, MD, and ZFS (other than the lack of machine parseable output). I'd also argue that if you can't reassemble your storage stack by hand without using 'higher-level' tools, you should not be using that storage stack as you don't properly understand it. On the subject of monitoring specifically, part of the issue there is kernel side, any monitoring system currently needs to be polling-based, not event-based, and as a result monitoring tends to be a very system specific affair based on how much overhead you're willing to tolerate. The limited stuff that does exist is also trivial to integrate with many pieces of existing monitoring infrastructure (like Nagios or monit), and therefore the people who care about it a lot (like me) are either monitoring by hand, or are just using the tools with their existing infrastructure (for example, I use monit already on all my systems, so I just make sure to have entries in the config for that to check error counters and scrub results), so there's not much in the way of incentive for the concerned parties to reinvent the wheel. To counter, I think this is a big problem with btrfs, especially in terms of user attrition. We don't need "GUI" tools. At all. But we do need that btrfs is self-sufficient enough that regular users don't get burnt by what they would view as unexpected behaviour. We have currently a situation where btrfs is too demanding on inexperienced users. I feel we need better worst-case behaviours. For example, if *I* have a btrfs on its second-to-last-available chunk, it means I'm not micro-managing properly. But users shouldn't have to micro-manage in the first place. Btrfs (or a management tool) should just know to balance the least-used chunk and/or delete the lowest-priority snapshot, etc. It shouldn't cause my services/apps to give diskspace errors when, clearly, there is free space available. That's not just an issue with BTRFS, it's an issue with the distros too. The only one that ships any kind of scheduled regular maintenance as far as I know is SUSE. We don't need some daemon, or even special handling in the kernel, we just need to provide people with standard maintenance tools, and proper advice for monitoring. I've been meaning to write up some wrappers and a couple of cron files to handle this a bit better, but just haven't had time. I may look at getting that done either today or early next week. The other "high-level" aspect would be along the lines of better guidance and standardisation for distros on how best to configure btrfs. This would include guidance/best practices for things like appropriate subvolume mountpoints and snapshot paths, sensible schedules or logic (or perhaps even example tools/scripts) for balancing and scrubbing the filesystem. There are currently three standards for this: 1. The snapper way, used by at least SUSE and Ubuntu, which IMO ends up being way too complicated for not much benefit. 2. The traditional filesystem way, used by most other distros, which doesn't use subvolumes at all. 3. The user choice way, used by stuff like Arch and Gentoo, which pretty much says the rest of the OS could care less how the filesystems and subvolumes are organized, as long as things work. Overall, other than the first one, this is no different than with regular filesystems. I don't have all the answers. But I also don't want to have to tell people they can't adopt it because a) they don't (or never will) understand it; and b) they're going to resent me for their irresponsibly losing their own data. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SQLite Re: csum errors on top of dm-crypt
Roman Mamedov posted on Fri, 04 Aug 2017 12:44:44 +0500 as excerpted: > On Fri, 4 Aug 2017 12:18:58 +0500 Roman Mamedovwrote: > >> What I find weird is why the expected csum is the same on all of these. >> Any idea what this might point to as the cause? >> >> What is 0x98f94189, is it not a csum of a block of zeroes by any >> chance? > > It does seem to be something of that sort, as it appears in > https://www.spinics.net/lists/linux-btrfs/msg67281.html (though as > factual csum, not the expected one). > >> a few files turned out to be unreadable > > Actually, turns out ALL of those are sqlite files(!) > > .mozilla/firefox/.../places.sqlite <- 4 instances (for 4 users) > .moonchild productions/pale moon/.../urlclassifier3.sqlite > .config/chromium/Default/Application Cache/Cache/data_3 <- twice (for 2 > users) > .config/chromium/Default/History .config/chromium/Default/Top Sites > > nothing else affected. > > Forgot to mention that the kernel version is 4.9.40. Not very scientific but FWIW... Kernel 4.9 or perhaps a couple kernel cycles earlier is about the time I had some similar issues with my firefox database files, too. I lost extensions and their settings and had to bisect to the file level and restore the files from backup. The problem in my case was very likely an ungraceful shutdown. But I've had a couple such shutdowns recently (I stay current and am now on 4.13- rc3) as well, due to summer storm season and a power supply going out, that have had much better results -- clean scrubs after remount and nothing lost that I can see, even after the one that blinked back out right as I was rebooting. So it may be something fixed in newer kernels or just happenstance, but I won't argue with more reliable recent kernel btrfs, even if it /is/ "just me". =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
Austin S. Hemmelgarn posted on Thu, 03 Aug 2017 15:03:53 -0400 as excerpted: >> Same thing with the trim feature that is marked OK . It clearly says >> that is has performance implications. It is marked OK so one would >> expect it to not cause the filesystem to fail, but if the performance >> becomes so slow that the filesystem gets practically unusable it is of >> course not "OK". The relevant information is missing for people to make >> a decent choice and I certainly don't know how serious these >> performance implications are, if they are at all relevant... > The performance implications bit shouldn't be listed, that's a given for > any filesystem with discard (TRIM is the ATA and eMMC command, UNMAP is > the SCSI one, and ERASE is the name on SD cards, discard is the generic > kernel term) support. The issue arises from devices that don't have > support for queuing such commands, which is quite rare for SSD's these > days. Not so entirely rare. The generally well regarded Samsung EVO/Pro 850 ssd series don't support queued-trim, and indeed, due to a fiasco where new firmware lied about such support[1], the kernel now blacklists queued- trim on all samsung ssds. (I actually bought a pair of samsung evo 1TB ssds after seeing them well recommended both on this list and in various reviews. Only AFTER I had them and was wondering if I could now add discard to my btrfs mount options and therefore googling for samsung evo queued trim specifically, did I find out about this fiasco and samsung not supporting linux because anyone can write the code, or I'd have certainly reconsidered and would have very likely spent my money elsewhere. I did actually check the current kernel's blacklisting code and verified it, tho I also noted it whitelists samsung ssds for actually honoring flush directives where the code treats non-whitelisted ssds as not honoring them due apparently to too many claiming to do so while not actually doing so, to get better performance, so it's a mixed bag, one whitelisting for actually flushing when it claims to, one blacklisting for not reliably handling queued-trim despite some firmware claiming to do so. But the worst IMO is samsung support blackballing linux because anyone can write the code. =:^ That's worth blackballing samsung for, in my book; I just wish I'd found out before the purchase instead of after, tho the linux devs have at least made sure samsung ssd users don't lose data on linux due to samsung's lies, despite samsung's horrible support policy blackballing linux, at least at the time.) --- [1] The firmware said it supported a new ata standard where it's apparently mandatory, but the result was repeatedly corrupted data, with samsung support repeatedly said they don't support Linux because anyone can write code to execute, but they weren't seeing the problem on MS yet simply because MS hadn't issued a release that supported the new standard, and had queued-trim disabled by default with the older standards due to such problems when it was enabled. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: csum errors on top of dm-crypt
On Fri, 4 Aug 2017 12:44:44 +0500 Roman Mamedovwrote: > > What is 0x98f94189, is it not a csum of a block of zeroes by any chance? > > It does seem to be something of that sort Actually, I think I know what happened. I used "dd bs=1M conv=sparse" to copy source FS onto a LUKS device, which skipped copying 1M-sized areas of zeroes from the source device by seeking over those areas on the destination device. This only works OK if the destination device is entirely zeroed beforehand. But I also use --allow-discards for the LUKS device; so it may be that after a discard passthrough to the underlying SSD, which will then return zeroes for discarded areas, LUKS will not take care to pass zeroes back "upwards" when reading from those areas, instead it may attempt to decrypt them with its crypto process, making them read back to userspace as random data. So after an initial TRIM the destination crypto device was not actually zeroed, far from it. :) As a result, every large non-sparse file with at least 1MB-long run of zeroes in it (those sqlite ones appear to fit the bill) was not written out entirely onto the destination device by dd, and the intended zero areas were left full of crypto-randomness instead. Sorry for the noise, I hope at least this catch was somewhat entertaining. And Btrfs saves the day once again. :) -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Power down tests...
Oh ok. I read this in the man page and assumed that it's on by default: flushoncommit, noflushoncommit (default: on) This option forces any data dirtied by a write in a prior transaction to commit as part of the current commit. This makes the committed state a fully consistent view of the file system from the application’s perspective (i.e., it includes all completed file system operations). This was previously the behavior only when a snapshot was created. Disabling flushing may improve performance but is not crash-safe. Maybe this needs a correction? On Fri, Aug 4, 2017 at 12:52 PM, Adam Borowskiwrote: > On Fri, Aug 04, 2017 at 12:15:12PM +0530, Shyam Prasad N wrote: >> Is flushoncommit not a default option on version >> 4.4? Do I need specifically set this option? > > It's not the default. > > -- > ⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition: > ⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal > ⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair) > ⠈⠳⣄ • use glitches to walk on water -- -Shyam -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
SQLite Re: csum errors on top of dm-crypt
On Fri, 4 Aug 2017 12:18:58 +0500 Roman Mamedovwrote: > What I find weird is why the expected csum is the same on all of these. > Any idea what this might point to as the cause? > > What is 0x98f94189, is it not a csum of a block of zeroes by any chance? It does seem to be something of that sort, as it appears in https://www.spinics.net/lists/linux-btrfs/msg67281.html (though as factual csum, not the expected one). > a few files turned out to be unreadable Actually, turns out ALL of those are sqlite files(!) .mozilla/firefox/.../places.sqlite <- 4 instances (for 4 users) .moonchild productions/pale moon/.../urlclassifier3.sqlite .config/chromium/Default/Application Cache/Cache/data_3 <- twice (for 2 users) .config/chromium/Default/History .config/chromium/Default/Top Sites nothing else affected. Forgot to mention that the kernel version is 4.9.40. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Power down tests...
On Fri, Aug 04, 2017 at 12:15:12PM +0530, Shyam Prasad N wrote: > Is flushoncommit not a default option on version > 4.4? Do I need specifically set this option? It's not the default. -- ⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition: ⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal ⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair) ⠈⠳⣄ • use glitches to walk on water -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
csum errors on top of dm-crypt
Hello, I've migrated my home dir to a luks dm-crypt device some time ago, and today during a scheduled backup a few files turned out to be unreadable, with csum errors from Btrfs in dmesg. What I find weird is why the expected csum is the same on all of these. Any idea what this might point to as the cause? What is 0x98f94189, is it not a csum of a block of zeroes by any chance? (I use a patch from Qu Wenruo to improve the error reporting). [483575.992252] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 32 csum 0xe2a2e6eb expected csum 0x98f94189 mirror 1 [483575.994518] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 32 csum 0xe2a2e6eb expected csum 0x98f94189 mirror 1 [483575.995640] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 2785280 csum 0x7f97f4a6 expected csum 0x98f94189 mirror 1 [483575.996599] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 1736704 csum 0x7476ddf8 expected csum 0x98f94189 mirror 1 [483585.020047] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 1011712 csum 0xbadf2d3e expected csum 0x98f94189 mirror 1 [483585.023036] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 1011712 csum 0xbadf2d3e expected csum 0x98f94189 mirror 1 [483585.023702] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 1900544 csum 0x26c571dc expected csum 0x98f94189 mirror 1 [483585.023761] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 2949120 csum 0x27726fbe expected csum 0x98f94189 mirror 1 [483599.026289] BTRFS warning (device dm-1): csum failed root 5 ino 14988 off 17645568 csum 0xdd5bf4de expected csum 0x98f94189 mirror 1 [483599.027425] BTRFS warning (device dm-1): csum failed root 5 ino 14988 off 17465344 csum 0x42bf4f44 expected csum 0x98f94189 mirror 1 [483599.032396] BTRFS warning (device dm-1): csum failed root 5 ino 14988 off 17465344 csum 0x42bf4f44 expected csum 0x98f94189 mirror 1 [483599.092709] BTRFS warning (device dm-1): csum failed root 5 ino 15002 off 1110016 csum 0xbca8fc65 expected csum 0x98f94189 mirror 1 [483599.093080] BTRFS warning (device dm-1): csum failed root 5 ino 15002 off 1110016 csum 0xbca8fc65 expected csum 0x98f94189 mirror 1 [483599.093242] BTRFS warning (device dm-1): csum failed root 5 ino 15002 off 1736704 csum 0x1d4087fc expected csum 0x98f94189 mirror 1 [483627.708625] BTRFS warning (device dm-1): csum failed root 5 ino 29039 off 2613248 csum 0xe1952338 expected csum 0x98f94189 mirror 1 [483627.709459] BTRFS warning (device dm-1): csum failed root 5 ino 29039 off 2613248 csum 0xe1952338 expected csum 0x98f94189 mirror 1 [483627.709799] BTRFS warning (device dm-1): csum failed root 5 ino 29039 off 2965504 csum 0xfaff212d expected csum 0x98f94189 mirror 1 [483634.462684] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 5062656 csum 0x8c7df392 expected csum 0x98f94189 mirror 1 [483634.462703] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 4108288 csum 0x6005cecd expected csum 0x98f94189 mirror 1 [483634.466602] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 7159808 csum 0xfc06d954 expected csum 0x98f94189 mirror 1 [483634.466604] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 6111232 csum 0xc802b3b4 expected csum 0x98f94189 mirror 1 [483634.470118] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 4108288 csum 0x6005cecd expected csum 0x98f94189 mirror 1 [483634.470257] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 10305536 csum 0x3d8c1843 expected csum 0x98f94189 mirror 1 [483634.471085] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 9256960 csum 0xba3fede3 expected csum 0x98f94189 mirror 1 [483634.471128] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 8302592 csum 0x7de15198 expected csum 0x98f94189 mirror 1 [484152.178497] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 1163264 csum 0x341f3c2a expected csum 0x98f94189 mirror 1 [484152.180422] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 1736704 csum 0xf01ac658 expected csum 0x98f94189 mirror 1 [484152.181598] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 1163264 csum 0x341f3c2a expected csum 0x98f94189 mirror 1 [484152.182242] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 2785280 csum 0xc78988ec expected csum 0x98f94189 mirror 1 [484158.569489] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 2138112 csum 0xab34e90e expected csum 0x98f94189 mirror 1 [484158.571885] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 2785280 csum 0xd611911e expected csum 0x98f94189 mirror 1 [484158.575191] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 3833856 csum 0x6277c8a6 expected csum 0x98f94189 mirror 1 [484158.575620] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 4882432 csum 0x3293c3e7 expected csum 0x98f94189 mirror 1 [484158.578637] BTRFS warning
Re: Power down tests...
On Fri, Aug 04, 2017 at 11:21:15AM +0530, Shyam Prasad N wrote: > We're running a couple of experiments on our servers with btrfs > (kernel version 4.4). > And we're running some abrupt power-off tests for a couple of scenarios: > > 1. We have a filesystem on top of two different btrfs filesystems > (distributed across N disks). i.e. Our filesystem lays out data and > metadata on top of these two filesystems. With the test workload, it > is going to generate a good amount of 16MB files on top of the system. > On abrupt power-off and following reboot, what is the recommended > steps to be run. We're attempting btrfs mount, which seems to fail > sometimes. If it fails, we run a fsck and then mount the btrfs. The > issue that we're facing is that a few files have been zero-sized. As a > result, there is either a data-loss, or inconsistency in the stacked > filesystem's metadata. Sounds like you want to mount with -o flushoncommit. > We're mounting the btrfs with commit period of 5s. However, I do > expect btrfs to journal the I/Os that are still dirty. Why then are we > seeing the above behaviour. By default, btrfs does only metadata consistency, like most filesystems. This improves performance at the cost of failing use case like yours. -- ⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition: ⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal ⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair) ⠈⠳⣄ • use glitches to walk on water -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html