Re: [PATCH preview] btrfs: allow to set compression level for zlib

2017-08-04 Thread Nick Terrell
On 8/4/17, 6:27 PM, "Adam Borowski"  wrote:
> On Fri, Aug 04, 2017 at 09:51:44PM +, Nick Terrell wrote:
> > On 07/25/2017 01:29 AM, David Sterba wrote:
> > > Preliminary support for setting compression level for zlib, the
> > > following works:
> > 
> > Thanks for working on this, I think it is a great feature.
> > I have a few comments relating to how it would work with zstd.
> 
> Like, currently crashing because of ->set_level being 0? :p
> 
> > > --- a/fs/btrfs/compression.c
> > > +++ b/fs/btrfs/compression.c
> > > @@ -866,6 +866,11 @@ static void free_workspaces(void)
> > >   * Given an address space and start and length, compress the bytes into 
> > > @pages
> > >   * that are allocated on demand.
> > >   *
> > > + * @type_level is encoded algorithm and level, where level 0 means 
> > > whatever
> > > + * default the algorithm chooses and is opaque here;
> > > + * - compression algo are 0-3
> > > + * - the level are bits 4-7
> > 
> > zstd has 19 levels, but we can either only allow the first 15 + default, or
> > provide a mapping from zstd-level to BtrFS zstd-level.
> 
> Or give it more bits.  Issues like this are exactly why this patch is marked
> "preview".
> 
> But, does zstd give any gains with high compression level but input data
> capped at 128KB?  I don't see levels above 15 on your benchmark, and certain
> compression algorithms give worse results at highest levels for small
> blocks.

Yeah, I stopped my benchmarks at 15, since without configurable compression
level, high levels didn't seem useful. But level 19 could be interesting if
you are building a base image that is widely distributed. When testing BtrFS
on the Silesia corpus, the compression ratio improved all the way to level
19.

> 
> > > @@ -888,9 +893,11 @@ int btrfs_compress_pages(int type, struct 
> > > address_space *mapping,
> > >  {
> > >   struct list_head *workspace;
> > >   int ret;
> > > + int type = type_level & 0xF;
> > >  
> > >   workspace = find_workspace(type);
> > >  
> > > + btrfs_compress_op[type - 1]->set_level(workspace, type_level);
> > 
> > zlib uses the same amount of memory independently of the compression level,
> > but zstd uses a different amount of memory for each level. zstd will have
> > to allocate memory here if it doesn't have enough (or has way to much),
> > will that be okay?
> 
> We can instead store workspaces per the encoded type+level, that'd allow
> having different levels on different mounts (then props, once we get there).
> 
> Depends on whether you want highest levels, though (asked above) -- the
> highest ones take drastically more memory, so if they're out, blindly
> reserving space for the highest supported level might be not too wasteful.

Looking at the memory usage of BtrFS zstd, the 128 KB window size keeps the
memory usage very reasonable up to level 19. The zstd compression levels
are computed using a tool that selects the parameters that give the best
compression ratio for a given compression speed target. Since BtrFS has a
fixed window size, the default compression levels might not be optimal. We
could compute our own compression levels for a 128 KB window size.

| Level | Memory |
|---||
| 1 | 0.8 MB |
| 2 | 1.0 MB |
| 3 | 1.3 MB |
| 4 | 0.9 MB |
| 5 | 1.4 MB |
| 6 | 1.5 MB |
| 7 | 1.4 MB |
| 8 | 1.8 MB |
| 9 | 1.8 MB |
| 10| 1.8 MB |
| 11| 1.8 MB |
| 12| 1.8 MB |
| 13| 2.4 MB |
| 14| 2.6 MB |
| 15| 2.6 MB |
| 16| 3.1 MB |
| 17| 3.1 MB |
| 18| 3.1 MB |
| 19| 3.1 MB |

The workspace memory usage for each compression level.

> 
> (I have only briefly looked at memory usage and set_level(), please ignore
> me if I babble incoherently -- in bed on a N900 so I can't test it right
> now.)
> 
> 
> Meow!
> -- 
> ⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
> ⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal
> ⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair)
> ⠈⠳⣄ • use glitches to walk on water
> 




Re: [PATCH preview] btrfs: allow to set compression level for zlib

2017-08-04 Thread Adam Borowski
On Fri, Aug 04, 2017 at 09:51:44PM +, Nick Terrell wrote:
> On 07/25/2017 01:29 AM, David Sterba wrote:
> > Preliminary support for setting compression level for zlib, the
> > following works:
> 
> Thanks for working on this, I think it is a great feature.
> I have a few comments relating to how it would work with zstd.

Like, currently crashing because of ->set_level being 0? :p

> > --- a/fs/btrfs/compression.c
> > +++ b/fs/btrfs/compression.c
> > @@ -866,6 +866,11 @@ static void free_workspaces(void)
> >   * Given an address space and start and length, compress the bytes into 
> > @pages
> >   * that are allocated on demand.
> >   *
> > + * @type_level is encoded algorithm and level, where level 0 means whatever
> > + * default the algorithm chooses and is opaque here;
> > + * - compression algo are 0-3
> > + * - the level are bits 4-7
> 
> zstd has 19 levels, but we can either only allow the first 15 + default, or
> provide a mapping from zstd-level to BtrFS zstd-level.

Or give it more bits.  Issues like this are exactly why this patch is marked
"preview".

But, does zstd give any gains with high compression level but input data
capped at 128KB?  I don't see levels above 15 on your benchmark, and certain
compression algorithms give worse results at highest levels for small
blocks.

> > @@ -888,9 +893,11 @@ int btrfs_compress_pages(int type, struct 
> > address_space *mapping,
> >  {
> > struct list_head *workspace;
> > int ret;
> > +   int type = type_level & 0xF;
> >  
> > workspace = find_workspace(type);
> >  
> > +   btrfs_compress_op[type - 1]->set_level(workspace, type_level);
> 
> zlib uses the same amount of memory independently of the compression level,
> but zstd uses a different amount of memory for each level. zstd will have
> to allocate memory here if it doesn't have enough (or has way to much),
> will that be okay?

We can instead store workspaces per the encoded type+level, that'd allow
having different levels on different mounts (then props, once we get there).

Depends on whether you want highest levels, though (asked above) -- the
highest ones take drastically more memory, so if they're out, blindly
reserving space for the highest supported level might be not too wasteful.

(I have only briefly looked at memory usage and set_level(), please ignore
me if I babble incoherently -- in bed on a N900 so I can't test it right
now.)


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair)
⠈⠳⣄ • use glitches to walk on water
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-04 Thread Wang Shilong
Hi Qu,

On Fri, Aug 4, 2017 at 10:05 PM, Qu Wenruo  wrote:
>
>
> On 2017年08月02日 16:38, Brendan Hide wrote:
>>
>> The title seems alarmist to me - and I suspect it is going to be
>> misconstrued. :-/
>>
>>  From the release notes at
>> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html
>>
>> "Btrfs has been deprecated
>>
>> The Btrfs file system has been in Technology Preview state since the
>> initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving
>> Btrfs to a fully supported feature and it will be removed in a future major
>> release of Red Hat Enterprise Linux.
>>
>> The Btrfs file system did receive numerous updates from the upstream in
>> Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat
>> Enterprise Linux 7 series. However, this is the last planned update to this
>> feature.
>>
>> Red Hat will continue to invest in future technologies to address the use
>> cases of our customers, specifically those related to snapshots,
>> compression, NVRAM, and ease of use. We encourage feedback through your Red
>> Hat representative on features and requirements you have for file systems
>> and storage technology."
>
>
> Personally speaking, unlike most of the btrfs supporters, I think Red Hat is
> doing the correct thing for their enterprise use case.
>
> (To clarify, I'm not going to Red Hat, just in case anyone wonders why I'm
> not supporting btrfs)
>
> [Good things of btrfs]
> Btrfs is indeed a technic pioneer in a lot of aspects (at least in linux
> world):
>
> 1) Metadata CoW instead of traditional journal
> 2) Snapshot and delta-backup
> I think this is the killer feature of Btrfs, and why SUSE is using it
> for root fs.
> 3) Default data CoW
> 4) Data checksum and scrubbing
> 5) Multi-device management
> 6) Online resize/balancing
> And a lot of more.
>
> [Bad things of btrfs]
> But for enterprise usage, it's too advanced and has several problems
> preventing them being widely applied:
>
> 1) Low performance from metadata/data CoW
> This is a little complicated dilemma.
> Although Btrfs can disable data CoW, nodatacow also disables data
> checksum, which is another main feature for btrfs.
> So Btrfs can't default to nodatacow, unlike XFS.
>
> And metadata CoW causes extra metadata write along with superblock
> update (FUA), further degrading the performance.
>
> Such pioneered design makes traditional performance-intense use case
> very unhappy.
> Especially for almost all kind of databases. (Note that nodatacow can't
> always solve the performance problem).
> Most performance intense usage is still based on tradtional fs design
> (journal with no CoW)
>
> 2) Low concurrency caused by tree design.
>  Unlike traditional one-tree-for-one-inode design, btrfs uses
> one-tree-for-one-subvolume.
>  The design makes snapshot implementation very easy, while makes tree
> very hot when a lot of modifiers are trying to modify any metadata.
>
>  Btrfs has a lot of different way to solve it.
>  For extent tree (the most busy tree), we are using delayed-ref to speed
> up extent tree update.
>  For fs tree fsync, we have log tree to speed things up.
>  These approaches work, at the cost of complexity and bugs, and we still
> have slow fs tree modification speed.
>
> 3) Low code reusage of device-mapper.
>  I totally understand that, due to the unique support for data csum,
> btrfs can't use device-mapper directly, as we must verify the data read out
> from device before passing it to higher level.
> So Btrfs uses its own device-mapper like implementation to handle
> multi-device management.
>
> The result is mixed. For easy to handle case like RAID0/1/10 btrfs is
> doing well.
> While for RAID5/6, everyone knows the result.
>
> Such btrfs *enhanced* re-implementation not only makes btrfs larger but
> also more complex and bug-prune.
>
> In short, btrfs is too advanced for generic use cases (performance) and
> developers (bugs), unfortunately.
>
> And even SUSE is just pushing btrfs as root fs, mainly for the snapshot
> feature.
> But still ext4/xfs for data or performance intense use case.
>
>
> [Other solution on the table]
> On the other hand, I think RedHat is pushing storage technology based on LVM
> (thin) and Xfs.
>
> For traditional LVM, it's stable but its snapshot design is old-fashion and
> low-performance.
> While new thin-provision LVM solves the problem using a method just like
> Btrfs, but at block level.
>
> And for XFS, it's still traditional designed, journal based,
> one-tree-for-one-inode.
> But with fancy new features like data CoW.
>
> Even XFS + LVM-thin lacks ability to shrink fs or scrub data or delta
> backup, it can do a lot of things just like Btrfs.
> From snapshot to multi-device management.
>
> And more importantly, has better 

Re: [PATCH v4 4/5] squashfs: Add zstd support

2017-08-04 Thread Sean Purcell
Signed-off-by: Sean Purcell 

On Fri, Aug 4, 2017 at 4:19 PM, Nick Terrell  wrote:
> Add zstd compression and decompression support to SquashFS. zstd is a
> great fit for SquashFS because it can compress at ratios approaching xz,
> while decompressing twice as fast as zlib. For SquashFS in particular,
> it can decompress as fast as lzo and lz4. It also has the flexibility
> to turn down the compression ratio for faster compression times.
>
> The compression benchmark is run on the file tree from the SquashFS archive
> found in ubuntu-16.10-desktop-amd64.iso [1]. It uses `mksquashfs` with the
> default block size (128 KB) and and various compression algorithms/levels.
> xz and zstd are also benchmarked with 256 KB blocks. The decompression
> benchmark times how long it takes to `tar` the file tree into `/dev/null`.
> See the benchmark file in the upstream zstd source repository located under
> `contrib/linux-kernel/squashfs-benchmark.sh` [2] for details.
>
> I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
> The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
> 16 GB of RAM, and a SSD.
>
> | Method | Ratio | Compression MB/s | Decompression MB/s |
> ||---|--||
> | gzip   |  2.92 |   15 |128 |
> | lzo|  2.64 |  9.5 |217 |
> | lz4|  2.12 |   94 |218 |
> | xz |  3.43 |  5.5 | 35 |
> | xz 256 KB  |  3.53 |  5.4 | 40 |
> | zstd 1 |  2.71 |   96 |210 |
> | zstd 5 |  2.93 |   69 |198 |
> | zstd 10|  3.01 |   41 |225 |
> | zstd 15|  3.13 | 11.4 |224 |
> | zstd 16 256 KB |  3.24 |  8.1 |210 |
>
> This patch was written by Sean Purcell , but I will be
> taking over the submission process.
>
> [1] http://releases.ubuntu.com/16.10/
> [2] 
> https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/squashfs-benchmark.sh
>
> zstd source repository: https://github.com/facebook/zstd
>
> Cc: Sean Purcell 
> Signed-off-by: Nick Terrell 
> ---
> v3 -> v4:
> - Fix minor linter warnings
>
>  fs/squashfs/Kconfig|  14 +
>  fs/squashfs/Makefile   |   1 +
>  fs/squashfs/decompressor.c |   7 +++
>  fs/squashfs/decompressor.h |   4 ++
>  fs/squashfs/squashfs_fs.h  |   1 +
>  fs/squashfs/zstd_wrapper.c | 149 
> +
>  6 files changed, 176 insertions(+)
>  create mode 100644 fs/squashfs/zstd_wrapper.c
>
> diff --git a/fs/squashfs/Kconfig b/fs/squashfs/Kconfig
> index ffb093e..1adb334 100644
> --- a/fs/squashfs/Kconfig
> +++ b/fs/squashfs/Kconfig
> @@ -165,6 +165,20 @@ config SQUASHFS_XZ
>
>   If unsure, say N.
>
> +config SQUASHFS_ZSTD
> +   bool "Include support for ZSTD compressed file systems"
> +   depends on SQUASHFS
> +   select ZSTD_DECOMPRESS
> +   help
> + Saying Y here includes support for reading Squashfs file systems
> + compressed with ZSTD compression.  ZSTD gives better compression 
> than
> + the default ZLIB compression, while using less CPU.
> +
> + ZSTD is not the standard compression used in Squashfs and so most
> + file systems will be readable without selecting this option.
> +
> + If unsure, say N.
> +
>  config SQUASHFS_4K_DEVBLK_SIZE
> bool "Use 4K device block size?"
> depends on SQUASHFS
> diff --git a/fs/squashfs/Makefile b/fs/squashfs/Makefile
> index 246a6f3..6655631 100644
> --- a/fs/squashfs/Makefile
> +++ b/fs/squashfs/Makefile
> @@ -15,3 +15,4 @@ squashfs-$(CONFIG_SQUASHFS_LZ4) += lz4_wrapper.o
>  squashfs-$(CONFIG_SQUASHFS_LZO) += lzo_wrapper.o
>  squashfs-$(CONFIG_SQUASHFS_XZ) += xz_wrapper.o
>  squashfs-$(CONFIG_SQUASHFS_ZLIB) += zlib_wrapper.o
> +squashfs-$(CONFIG_SQUASHFS_ZSTD) += zstd_wrapper.o
> diff --git a/fs/squashfs/decompressor.c b/fs/squashfs/decompressor.c
> index d2bc136..8366398 100644
> --- a/fs/squashfs/decompressor.c
> +++ b/fs/squashfs/decompressor.c
> @@ -65,6 +65,12 @@ static const struct squashfs_decompressor 
> squashfs_zlib_comp_ops = {
>  };
>  #endif
>
> +#ifndef CONFIG_SQUASHFS_ZSTD
> +static const struct squashfs_decompressor squashfs_zstd_comp_ops = {
> +   NULL, NULL, NULL, NULL, ZSTD_COMPRESSION, "zstd", 0
> +};
> +#endif
> +
>  static const struct squashfs_decompressor squashfs_unknown_comp_ops = {
> NULL, NULL, NULL, NULL, 0, "unknown", 0
>  };
> @@ -75,6 +81,7 @@ static const struct squashfs_decompressor *decompressor[] = 
> {
> _lzo_comp_ops,
> _xz_comp_ops,
> _lzma_unsupported_comp_ops,
> +   _zstd_comp_ops,
> _unknown_comp_ops

Re: [PATCH v4 4/5] squashfs: Add zstd support

2017-08-04 Thread Nick Terrell
On 8/4/17, 3:10 PM, "linus...@gmail.com on behalf of Linus Torvalds" 
 wrote:
> On Fri, Aug 4, 2017 at 1:19 PM, Nick Terrell  wrote:
> >
> > This patch was written by Sean Purcell , but I will be
> > taking over the submission process.
> 
> Please, if so, get Sean's sign-off, and also make sure that the patch
> gets submitted with
> 
>From: Sean Purcell 
> 
> at the top of the body of the email so that authorship gets properly
> attributed by all the usual tools.
> 
>  Linus
> 

Thanks for the help, I'll fix it for the next version.



Re: [PATCH v4 4/5] squashfs: Add zstd support

2017-08-04 Thread Linus Torvalds
On Fri, Aug 4, 2017 at 1:19 PM, Nick Terrell  wrote:
>
> This patch was written by Sean Purcell , but I will be
> taking over the submission process.

Please, if so, get Sean's sign-off, and also make sure that the patch
gets submitted with

   From: Sean Purcell 

at the top of the body of the email so that authorship gets properly
attributed by all the usual tools.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH preview] btrfs: allow to set compression level for zlib

2017-08-04 Thread Nick Terrell
On 07/25/2017 01:29 AM, David Sterba wrote:
> Preliminary support for setting compression level for zlib, the
> following works:

Thanks for working on this, I think it is a great feature.
I have a few comments relating to how it would work with zstd.

> 
> $ mount -o compess=zlib # default
> $ mount -o compess=zlib0# same
> $ mount -o compess=zlib9# level 9, slower sync, less data
> $ mount -o compess=zlib1# level 1, faster sync, more data
> $ mount -o remount,compress=zlib3 # level set by remount
> 
> The level is visible in the same format in /proc/mounts. Level set via
> file property does not work yet.
> 
> Required patch: "btrfs: prepare for extensions in compression options"
> 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/compression.c | 20 +++-
>  fs/btrfs/compression.h |  6 +-
>  fs/btrfs/ctree.h   |  1 +
>  fs/btrfs/inode.c   |  5 -
>  fs/btrfs/lzo.c |  5 +
>  fs/btrfs/super.c   |  7 +--
>  fs/btrfs/zlib.c| 12 +++-
>  7 files changed, 50 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 8ba1b86c9b72..142206d68495 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -866,6 +866,11 @@ static void free_workspaces(void)
>   * Given an address space and start and length, compress the bytes into 
> @pages
>   * that are allocated on demand.
>   *
> + * @type_level is encoded algorithm and level, where level 0 means whatever
> + * default the algorithm chooses and is opaque here;
> + * - compression algo are 0-3
> + * - the level are bits 4-7

zstd has 19 levels, but we can either only allow the first 15 + default, or
provide a mapping from zstd-level to BtrFS zstd-level.

> + *
>   * @out_pages is an in/out parameter, holds maximum number of pages to 
> allocate
>   * and returns number of actually allocated pages
>   *
> @@ -880,7 +885,7 @@ static void free_workspaces(void)
>   * @max_out tells us the max number of bytes that we're allowed to
>   * stuff into pages
>   */
> -int btrfs_compress_pages(int type, struct address_space *mapping,
> +int btrfs_compress_pages(unsigned int type_level, struct address_space 
> *mapping,
>u64 start, struct page **pages,
>unsigned long *out_pages,
>unsigned long *total_in,
> @@ -888,9 +893,11 @@ int btrfs_compress_pages(int type, struct address_space 
> *mapping,
>  {
>   struct list_head *workspace;
>   int ret;
> + int type = type_level & 0xF;
>  
>   workspace = find_workspace(type);
>  
> + btrfs_compress_op[type - 1]->set_level(workspace, type_level);

zlib uses the same amount of memory independently of the compression level,
but zstd uses a different amount of memory for each level. zstd will have
to allocate memory here if it doesn't have enough (or has way to much),
will that be okay?

>   ret = btrfs_compress_op[type-1]->compress_pages(workspace, mapping,
> start, pages,
> out_pages,
> @@ -1047,3 +1054,14 @@ int btrfs_decompress_buf2page(const char *buf, 
> unsigned long buf_start,
>  
>   return 1;
>  }
> +
> +unsigned int btrfs_compress_str2level(const char *str)
> +{
> + if (strncmp(str, "zlib", 4) != 0)
> + return 0;
> +
> + if ('1' <= str[4] && str[4] <= '9' )
> + return str[4] - '0';
> +
> + return 0;
> +}
> diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> index 89bcf975efb8..8a6db02d8732 100644
> --- a/fs/btrfs/compression.h
> +++ b/fs/btrfs/compression.h
> @@ -76,7 +76,7 @@ struct compressed_bio {
>  void btrfs_init_compress(void);
>  void btrfs_exit_compress(void);
>  
> -int btrfs_compress_pages(int type, struct address_space *mapping,
> +int btrfs_compress_pages(unsigned int type_level, struct address_space 
> *mapping,
>u64 start, struct page **pages,
>unsigned long *out_pages,
>unsigned long *total_in,
> @@ -95,6 +95,8 @@ int btrfs_submit_compressed_write(struct inode *inode, u64 
> start,
>  int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>int mirror_num, unsigned long bio_flags);
>  
> +unsigned btrfs_compress_str2level(const char *str);
> +
>  enum btrfs_compression_type {
>   BTRFS_COMPRESS_NONE  = 0,
>   BTRFS_COMPRESS_ZLIB  = 1,
> @@ -124,6 +126,8 @@ struct btrfs_compress_op {
> struct page *dest_page,
> unsigned long start_byte,
> size_t srclen, size_t destlen);
> +
> + void (*set_level)(struct list_head *ws, unsigned int type);
>  };
>  
>  extern const struct btrfs_compress_op btrfs_zlib_compress;
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h

Re: please include 17024ad0a0fd ("Btrfs: fix early ENOSPC due to delalloc") to 4.12 stable

2017-08-04 Thread Greg KH
On Fri, Aug 04, 2017 at 11:25:14PM +0300, Nikolay Borisov wrote:
> Hello,
> 
> I'd like to aforementioned patch to be applied to stable 4.9/4.12. The
> attached backport applies cleanly to both of them.

Thanks, I'll queue it up after this next release happens.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 5/5] crypto: Add zstd support

2017-08-04 Thread Nick Terrell
Adds zstd support to crypto and scompress. Only supports the default
level.

Signed-off-by: Nick Terrell 
---
 crypto/Kconfig   |   9 ++
 crypto/Makefile  |   1 +
 crypto/testmgr.c |  10 +++
 crypto/testmgr.h |  71 +++
 crypto/zstd.c| 265 +++
 5 files changed, 356 insertions(+)
 create mode 100644 crypto/zstd.c

diff --git a/crypto/Kconfig b/crypto/Kconfig
index caa770e..4fc3936 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1662,6 +1662,15 @@ config CRYPTO_LZ4HC
help
  This is the LZ4 high compression mode algorithm.
 
+config CRYPTO_ZSTD
+   tristate "Zstd compression algorithm"
+   select CRYPTO_ALGAPI
+   select CRYPTO_ACOMP2
+   select ZSTD_COMPRESS
+   select ZSTD_DECOMPRESS
+   help
+ This is the zstd algorithm.
+
 comment "Random Number Generation"
 
 config CRYPTO_ANSI_CPRNG
diff --git a/crypto/Makefile b/crypto/Makefile
index d41f033..b22e1e8 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -133,6 +133,7 @@ obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o
 obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o
 obj-$(CONFIG_CRYPTO_USER_API_RNG) += algif_rng.o
 obj-$(CONFIG_CRYPTO_USER_API_AEAD) += algif_aead.o
+obj-$(CONFIG_CRYPTO_ZSTD) += zstd.o
 
 ecdh_generic-y := ecc.o
 ecdh_generic-y += ecdh.o
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 7125ba3..8a124d3 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3603,6 +3603,16 @@ static const struct alg_test_desc alg_test_descs[] = {
.decomp = 
__VECS(zlib_deflate_decomp_tv_template)
}
}
+   }, {
+   .alg = "zstd",
+   .test = alg_test_comp,
+   .fips_allowed = 1,
+   .suite = {
+   .comp = {
+   .comp = __VECS(zstd_comp_tv_template),
+   .decomp = __VECS(zstd_decomp_tv_template)
+   }
+   }
}
 };
 
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 6ceb0e2..e6b5920 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -34631,4 +34631,75 @@ static const struct comp_testvec 
lz4hc_decomp_tv_template[] = {
},
 };
 
+static const struct comp_testvec zstd_comp_tv_template[] = {
+   {
+   .inlen  = 68,
+   .outlen = 39,
+   .input  = "The algorithm is zstd. "
+ "The algorithm is zstd. "
+ "The algorithm is zstd.",
+   .output = "\x28\xb5\x2f\xfd\x00\x50\xf5\x00\x00\xb8\x54\x68\x65"
+ "\x20\x61\x6c\x67\x6f\x72\x69\x74\x68\x6d\x20\x69\x73"
+ "\x20\x7a\x73\x74\x64\x2e\x20\x01\x00\x55\x73\x36\x01"
+ ,
+   },
+   {
+   .inlen  = 244,
+   .outlen = 151,
+   .input  = "zstd, short for Zstandard, is a fast lossless "
+ "compression algorithm, targeting real-time "
+ "compression scenarios at zlib-level and better "
+ "compression ratios. The zstd compression library "
+ "provides in-memory compression and decompression "
+ "functions.",
+   .output = "\x28\xb5\x2f\xfd\x00\x50\x75\x04\x00\x42\x4b\x1e\x17"
+ "\x90\x81\x31\x00\xf2\x2f\xe4\x36\xc9\xef\x92\x88\x32"
+ "\xc9\xf2\x24\x94\xd8\x68\x9a\x0f\x00\x0c\xc4\x31\x6f"
+ "\x0d\x0c\x38\xac\x5c\x48\x03\xcd\x63\x67\xc0\xf3\xad"
+ "\x4e\x90\xaa\x78\xa0\xa4\xc5\x99\xda\x2f\xb6\x24\x60"
+ "\xe2\x79\x4b\xaa\xb6\x6b\x85\x0b\xc9\xc6\x04\x66\x86"
+ "\xe2\xcc\xe2\x25\x3f\x4f\x09\xcd\xb8\x9d\xdb\xc1\x90"
+ "\xa9\x11\xbc\x35\x44\x69\x2d\x9c\x64\x4f\x13\x31\x64"
+ "\xcc\xfb\x4d\x95\x93\x86\x7f\x33\x7f\x1a\xef\xe9\x30"
+ "\xf9\x67\xa1\x94\x0a\x69\x0f\x60\xcd\xc3\xab\x99\xdc"
+ "\x42\xed\x97\x05\x00\x33\xc3\x15\x95\x3a\x06\xa0\x0e"
+ "\x20\xa9\x0e\x82\xb9\x43\x45\x01",
+   },
+};
+
+static const struct comp_testvec zstd_decomp_tv_template[] = {
+   {
+   .inlen  = 43,
+   .outlen = 68,
+   .input  = "\x28\xb5\x2f\xfd\x04\x50\xf5\x00\x00\xb8\x54\x68\x65"
+ "\x20\x61\x6c\x67\x6f\x72\x69\x74\x68\x6d\x20\x69\x73"
+ "\x20\x7a\x73\x74\x64\x2e\x20\x01\x00\x55\x73\x36\x01"
+ "\x6b\xf4\x13\x35",
+   .output = "The algorithm is zstd. "
+ "The algorithm is zstd. "
+ "The algorithm is zstd.",
+   },
+   {
+   .inlen  = 155,
+

[PATCH v4 4/5] squashfs: Add zstd support

2017-08-04 Thread Nick Terrell
Add zstd compression and decompression support to SquashFS. zstd is a
great fit for SquashFS because it can compress at ratios approaching xz,
while decompressing twice as fast as zlib. For SquashFS in particular,
it can decompress as fast as lzo and lz4. It also has the flexibility
to turn down the compression ratio for faster compression times.

The compression benchmark is run on the file tree from the SquashFS archive
found in ubuntu-16.10-desktop-amd64.iso [1]. It uses `mksquashfs` with the
default block size (128 KB) and and various compression algorithms/levels.
xz and zstd are also benchmarked with 256 KB blocks. The decompression
benchmark times how long it takes to `tar` the file tree into `/dev/null`.
See the benchmark file in the upstream zstd source repository located under
`contrib/linux-kernel/squashfs-benchmark.sh` [2] for details.

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD.

| Method | Ratio | Compression MB/s | Decompression MB/s |
||---|--||
| gzip   |  2.92 |   15 |128 |
| lzo|  2.64 |  9.5 |217 |
| lz4|  2.12 |   94 |218 |
| xz |  3.43 |  5.5 | 35 |
| xz 256 KB  |  3.53 |  5.4 | 40 |
| zstd 1 |  2.71 |   96 |210 |
| zstd 5 |  2.93 |   69 |198 |
| zstd 10|  3.01 |   41 |225 |
| zstd 15|  3.13 | 11.4 |224 |
| zstd 16 256 KB |  3.24 |  8.1 |210 |

This patch was written by Sean Purcell , but I will be
taking over the submission process.

[1] http://releases.ubuntu.com/16.10/
[2] 
https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/squashfs-benchmark.sh

zstd source repository: https://github.com/facebook/zstd

Cc: Sean Purcell 
Signed-off-by: Nick Terrell 
---
v3 -> v4:
- Fix minor linter warnings

 fs/squashfs/Kconfig|  14 +
 fs/squashfs/Makefile   |   1 +
 fs/squashfs/decompressor.c |   7 +++
 fs/squashfs/decompressor.h |   4 ++
 fs/squashfs/squashfs_fs.h  |   1 +
 fs/squashfs/zstd_wrapper.c | 149 +
 6 files changed, 176 insertions(+)
 create mode 100644 fs/squashfs/zstd_wrapper.c

diff --git a/fs/squashfs/Kconfig b/fs/squashfs/Kconfig
index ffb093e..1adb334 100644
--- a/fs/squashfs/Kconfig
+++ b/fs/squashfs/Kconfig
@@ -165,6 +165,20 @@ config SQUASHFS_XZ

  If unsure, say N.

+config SQUASHFS_ZSTD
+   bool "Include support for ZSTD compressed file systems"
+   depends on SQUASHFS
+   select ZSTD_DECOMPRESS
+   help
+ Saying Y here includes support for reading Squashfs file systems
+ compressed with ZSTD compression.  ZSTD gives better compression than
+ the default ZLIB compression, while using less CPU.
+
+ ZSTD is not the standard compression used in Squashfs and so most
+ file systems will be readable without selecting this option.
+
+ If unsure, say N.
+
 config SQUASHFS_4K_DEVBLK_SIZE
bool "Use 4K device block size?"
depends on SQUASHFS
diff --git a/fs/squashfs/Makefile b/fs/squashfs/Makefile
index 246a6f3..6655631 100644
--- a/fs/squashfs/Makefile
+++ b/fs/squashfs/Makefile
@@ -15,3 +15,4 @@ squashfs-$(CONFIG_SQUASHFS_LZ4) += lz4_wrapper.o
 squashfs-$(CONFIG_SQUASHFS_LZO) += lzo_wrapper.o
 squashfs-$(CONFIG_SQUASHFS_XZ) += xz_wrapper.o
 squashfs-$(CONFIG_SQUASHFS_ZLIB) += zlib_wrapper.o
+squashfs-$(CONFIG_SQUASHFS_ZSTD) += zstd_wrapper.o
diff --git a/fs/squashfs/decompressor.c b/fs/squashfs/decompressor.c
index d2bc136..8366398 100644
--- a/fs/squashfs/decompressor.c
+++ b/fs/squashfs/decompressor.c
@@ -65,6 +65,12 @@ static const struct squashfs_decompressor 
squashfs_zlib_comp_ops = {
 };
 #endif

+#ifndef CONFIG_SQUASHFS_ZSTD
+static const struct squashfs_decompressor squashfs_zstd_comp_ops = {
+   NULL, NULL, NULL, NULL, ZSTD_COMPRESSION, "zstd", 0
+};
+#endif
+
 static const struct squashfs_decompressor squashfs_unknown_comp_ops = {
NULL, NULL, NULL, NULL, 0, "unknown", 0
 };
@@ -75,6 +81,7 @@ static const struct squashfs_decompressor *decompressor[] = {
_lzo_comp_ops,
_xz_comp_ops,
_lzma_unsupported_comp_ops,
+   _zstd_comp_ops,
_unknown_comp_ops
 };

diff --git a/fs/squashfs/decompressor.h b/fs/squashfs/decompressor.h
index a25713c..0f5a8e4 100644
--- a/fs/squashfs/decompressor.h
+++ b/fs/squashfs/decompressor.h
@@ -58,4 +58,8 @@ extern const struct squashfs_decompressor 
squashfs_lzo_comp_ops;
 extern const struct squashfs_decompressor squashfs_zlib_comp_ops;
 #endif

+#ifdef 

please include 17024ad0a0fd ("Btrfs: fix early ENOSPC due to delalloc") to 4.12 stable

2017-08-04 Thread Nikolay Borisov
Hello,

I'd like to aforementioned patch to be applied to stable 4.9/4.12. The
attached backport applies cleanly to both of them.
>From 278e5d0839f4ecc6d7bfb7a95cb735b9034e8315 Mon Sep 17 00:00:00 2001
From: Omar Sandoval 
Date: Thu, 20 Jul 2017 15:10:35 -0700
Subject: [PATCH] Btrfs: fix early ENOSPC due to delalloc

If a lot of metadata is reserved for outstanding delayed allocations, we
rely on shrink_delalloc() to reclaim metadata space in order to fulfill
reservation tickets. However, shrink_delalloc() has a shortcut where if
it determines that space can be overcommitted, it will stop early. This
made sense before the ticketed enospc system, but now it means that
shrink_delalloc() will often not reclaim enough space to fulfill any
tickets, leading to an early ENOSPC. (Reservation tickets don't care
about being able to overcommit, they need every byte accounted for.)

Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims
all of the metadata it is supposed to. This fixes early ENOSPCs we were
seeing when doing a btrfs receive to populate a new filesystem, as well
as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs.

Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc infrastructure")
Tested-by: Christoph Anton Mitterer 
Cc: sta...@vger.kernel.org
Reviewed-by: Josef Bacik 
Signed-off-by: Omar Sandoval 
Signed-off-by: David Sterba 
Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/extent-tree.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a8d6ad4042b7..adb285a93753 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4813,10 +4813,6 @@ static void shrink_delalloc(struct btrfs_root *root, u64 to_reclaim, u64 orig,
 		else
 			flush = BTRFS_RESERVE_NO_FLUSH;
 		spin_lock(_info->lock);
-		if (can_overcommit(root, space_info, orig, flush)) {
-			spin_unlock(_info->lock);
-			break;
-		}
 		if (list_empty(_info->tickets) &&
 		list_empty(_info->priority_tickets)) {
 			spin_unlock(_info->lock);
-- 
2.7.4



[PATCH v4 3/5] btrfs: Add zstd support

2017-08-04 Thread Nick Terrell
Add zstd compression and decompression support to BtrFS. zstd at its
fastest level compresses almost as well as zlib, while offering much
faster compression and decompression, approaching lzo speeds.

I benchmarked btrfs with zstd compression against no compression, lzo
compression, and zlib compression. I benchmarked two scenarios. Copying
a set of files to btrfs, and then reading the files. Copying a tarball
to btrfs, extracting it to btrfs, and then reading the extracted files.
After every operation, I call `sync` and include the sync time.
Between every pair of operations I unmount and remount the filesystem
to avoid caching. The benchmark files can be found in the upstream
zstd source repository under
`contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}`
[1] [2].

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD.

The first compression benchmark is copying 10 copies of the unzipped
Silesia corpus [3] into a BtrFS filesystem mounted with
`-o compress-force=Method`. The decompression benchmark times how long
it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is
measured by comparing the output of `df` and `du`. See the benchmark file
[1] for details. I benchmarked multiple zstd compression levels, although
the patch uses zstd level 1.

| Method  | Ratio | Compression MB/s | Decompression speed |
|-|---|--|-|
| None|  0.99 |  504 | 686 |
| lzo |  1.66 |  398 | 442 |
| zlib|  2.58 |   65 | 241 |
| zstd 1  |  2.57 |  260 | 383 |
| zstd 3  |  2.71 |  174 | 408 |
| zstd 6  |  2.87 |   70 | 398 |
| zstd 9  |  2.92 |   43 | 406 |
| zstd 12 |  2.93 |   21 | 408 |
| zstd 15 |  3.01 |   11 | 354 |

The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it
measures the compression ratio, extracts the tar, and deletes the tar.
Then it measures the compression ratio again, and `tar`s the extracted
files into `/dev/null`. See the benchmark file [2] for details.

| Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) |
||---|---|--||--|
| None   |  0.97 |  0.78 |0.981 |  5.501 |8.807 |
| lzo|  2.06 |  1.38 |1.631 |  8.458 |8.585 |
| zlib   |  3.40 |  1.86 |7.750 | 21.544 |   11.744 |
| zstd 1 |  3.57 |  1.85 |2.579 | 11.479 |9.389 |

[1] 
https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh
[2] 
https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh
[3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
[4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz

zstd source repository: https://github.com/facebook/zstd

Signed-off-by: Nick Terrell 
---
v2 -> v3:
- Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff
- Change default compression level for BtrFS to 3

v3 -> v4:
- Add missing includes, which fixes the aarch64 build
- Fix minor linter warnings

 fs/btrfs/Kconfig   |   2 +
 fs/btrfs/Makefile  |   2 +-
 fs/btrfs/compression.c |   1 +
 fs/btrfs/compression.h |   6 +-
 fs/btrfs/ctree.h   |   1 +
 fs/btrfs/disk-io.c |   2 +
 fs/btrfs/ioctl.c   |   6 +-
 fs/btrfs/props.c   |   6 +
 fs/btrfs/super.c   |  12 +-
 fs/btrfs/sysfs.c   |   2 +
 fs/btrfs/zstd.c| 432 +
 include/uapi/linux/btrfs.h |   8 +-
 12 files changed, 468 insertions(+), 12 deletions(-)
 create mode 100644 fs/btrfs/zstd.c

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index 80e9c18..a26c63b 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -6,6 +6,8 @@ config BTRFS_FS
select ZLIB_DEFLATE
select LZO_COMPRESS
select LZO_DECOMPRESS
+   select ZSTD_COMPRESS
+   select ZSTD_DECOMPRESS
select RAID6_PQ
select XOR_BLOCKS
select SRCU
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 128ce17..962a95a 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -6,7 +6,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   transaction.o inode.o file.o tree-defrag.o \
   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
-  export.o tree-log.o free-space-cache.o zlib.o lzo.o \
+  export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \
   compression.o 

[PATCH v4 1/5] lib: Add xxhash module

2017-08-04 Thread Nick Terrell
Adds xxhash kernel module with xxh32 and xxh64 hashes. xxhash is an
extremely fast non-cryptographic hash algorithm for checksumming.
The zstd compression and decompression modules added in the next patch
require xxhash. I extracted it out from zstd since it is useful on its
own. I copied the code from the upstream XXHash source repository and
translated it into kernel style. I ran benchmarks and tests in the kernel
and tests in userland.

I benchmarked xxhash as a special character device. I ran in four modes,
no-op, xxh32, xxh64, and crc32. The no-op mode simply copies the data to
kernel space and ignores it. The xxh32, xxh64, and crc32 modes compute
hashes on the copied data. I also ran it with four different buffer sizes.
The benchmark file is located in the upstream zstd source repository under
`contrib/linux-kernel/xxhash_test.c` [1].

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using the file `filesystem.squashfs`
from `ubuntu-16.10-desktop-amd64.iso`, which is 1,536,217,088 B large.
Run the following commands for the benchmark:

modprobe xxhash_test
mknod xxhash_test c 245 0
time cp filesystem.squashfs xxhash_test

The time is reported by the time of the userland `cp`.
The GB/s is computed with

1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Normalized GB/s is computed with

1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).


| Buffer Size (B) | Hash  | Time (s) | GB/s | Adjusted GB/s |
|-|---|--|--|---|
|1024 | none  |0.408 | 3.77 | - |
|1024 | xxh32 |0.649 | 2.37 |  6.37 |
|1024 | xxh64 |0.542 | 2.83 | 11.46 |
|1024 | crc32 |1.290 | 1.19 |  1.74 |
|4096 | none  |0.380 | 4.04 | - |
|4096 | xxh32 |0.645 | 2.38 |  5.79 |
|4096 | xxh64 |0.500 | 3.07 | 12.80 |
|4096 | crc32 |1.168 | 1.32 |  1.95 |
|8192 | none  |0.351 | 4.38 | - |
|8192 | xxh32 |0.614 | 2.50 |  5.84 |
|8192 | xxh64 |0.464 | 3.31 | 13.60 |
|8192 | crc32 |1.163 | 1.32 |  1.89 |
|   16384 | none  |0.346 | 4.43 | - |
|   16384 | xxh32 |0.590 | 2.60 |  6.30 |
|   16384 | xxh64 |0.466 | 3.30 | 12.80 |
|   16384 | crc32 |1.183 | 1.30 |  1.84 |

Tested in userland using the test-suite in the zstd repo under
`contrib/linux-kernel/test/XXHashUserlandTest.cpp` [2] by mocking the
kernel functions. A line in each branch of every function in `xxhash.c`
was commented out to ensure that the test-suite fails. Additionally
tested while testing zstd and with SMHasher [3].

[1] https://phabricator.intern.facebook.com/P57526246
[2] 
https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/test/XXHashUserlandTest.cpp
[3] https://github.com/aappleby/smhasher

zstd source repository: https://github.com/facebook/zstd
XXHash source repository: https://github.com/cyan4973/xxhash

Signed-off-by: Nick Terrell 
---
v1 -> v2:
- Make pointer in lib/xxhash.c:394 non-const

 include/linux/xxhash.h | 236 +++
 lib/Kconfig|   3 +
 lib/Makefile   |   1 +
 lib/xxhash.c   | 500 +
 4 files changed, 740 insertions(+)
 create mode 100644 include/linux/xxhash.h
 create mode 100644 lib/xxhash.c

diff --git a/include/linux/xxhash.h b/include/linux/xxhash.h
new file mode 100644
index 000..9e1f42c
--- /dev/null
+++ b/include/linux/xxhash.h
@@ -0,0 +1,236 @@
+/*
+ * xxHash - Extremely Fast Hash algorithm
+ * Copyright (C) 2012-2016, Yann Collet.
+ *
+ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *   * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *   * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,

[PATCH v4 0/5] Add xxhash and zstd modules

2017-08-04 Thread Nick Terrell
Hi all,

This patch set adds xxhash, zstd compression, and zstd decompression
modules. It also adds zstd support to BtrFS and SquashFS.

Each patch has relevant summaries, benchmarks, and tests.

Best,
Nick Terrell

Changelog:

v1 -> v2:
- Make pointer in lib/xxhash.c:394 non-const (1/5)
- Use div_u64() for division of u64s (2/5)
- Reduce stack usage of ZSTD_compressSequences(), ZSTD_buildSeqTable(),
  ZSTD_decompressSequencesLong(), FSE_buildDTable(), FSE_decompress_wksp(),
  HUF_writeCTable(), HUF_readStats(), HUF_readCTable(),
  HUF_compressWeights(), HUF_readDTableX2(), and HUF_readDTableX4() (2/5)
- No zstd function uses more than 400 B of stack space (2/5)

v2 -> v3:
- Work around gcc-7 bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388
  (2/5)
- Fix bug in dictionary compression from upstream commit cc1522351f (2/5)
- Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff (3/5)
- Change default compression level for BtrFS to 3 (3/5)

v3 -> v4:
- Fix compiler warnings (2/5)
- Add missing includes (3/5)
- Fix minor linter warnings (3/5, 4/5)
- Add crypto patch (5/5)

Nick Terrell (5):
  lib: Add xxhash module
  lib: Add zstd modules
  btrfs: Add zstd support
  squashfs: Add zstd support
  crypto: Add zstd support

 crypto/Kconfig |9 +
 crypto/Makefile|1 +
 crypto/testmgr.c   |   10 +
 crypto/testmgr.h   |   71 +
 crypto/zstd.c  |  265 
 fs/btrfs/Kconfig   |2 +
 fs/btrfs/Makefile  |2 +-
 fs/btrfs/compression.c |1 +
 fs/btrfs/compression.h |6 +-
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |2 +
 fs/btrfs/ioctl.c   |6 +-
 fs/btrfs/props.c   |6 +
 fs/btrfs/super.c   |   12 +-
 fs/btrfs/sysfs.c   |2 +
 fs/btrfs/zstd.c|  432 ++
 fs/squashfs/Kconfig|   14 +
 fs/squashfs/Makefile   |1 +
 fs/squashfs/decompressor.c |7 +
 fs/squashfs/decompressor.h |4 +
 fs/squashfs/squashfs_fs.h  |1 +
 fs/squashfs/zstd_wrapper.c |  149 ++
 include/linux/xxhash.h |  236 +++
 include/linux/zstd.h   | 1157 +++
 include/uapi/linux/btrfs.h |8 +-
 lib/Kconfig|   11 +
 lib/Makefile   |3 +
 lib/xxhash.c   |  500 +++
 lib/zstd/Makefile  |   18 +
 lib/zstd/bitstream.h   |  374 +
 lib/zstd/compress.c| 3479 
 lib/zstd/decompress.c  | 2528 
 lib/zstd/entropy_common.c  |  243 
 lib/zstd/error_private.h   |   53 +
 lib/zstd/fse.h |  575 
 lib/zstd/fse_compress.c|  795 ++
 lib/zstd/fse_decompress.c  |  332 +
 lib/zstd/huf.h |  212 +++
 lib/zstd/huf_compress.c|  770 ++
 lib/zstd/huf_decompress.c  |  960 
 lib/zstd/mem.h |  151 ++
 lib/zstd/zstd_common.c |   75 +
 lib/zstd/zstd_internal.h   |  250 
 lib/zstd/zstd_opt.h| 1014 +
 44 files changed, 14736 insertions(+), 12 deletions(-)
 create mode 100644 crypto/zstd.c
 create mode 100644 fs/btrfs/zstd.c
 create mode 100644 fs/squashfs/zstd_wrapper.c
 create mode 100644 include/linux/xxhash.h
 create mode 100644 include/linux/zstd.h
 create mode 100644 lib/xxhash.c
 create mode 100644 lib/zstd/Makefile
 create mode 100644 lib/zstd/bitstream.h
 create mode 100644 lib/zstd/compress.c
 create mode 100644 lib/zstd/decompress.c
 create mode 100644 lib/zstd/entropy_common.c
 create mode 100644 lib/zstd/error_private.h
 create mode 100644 lib/zstd/fse.h
 create mode 100644 lib/zstd/fse_compress.c
 create mode 100644 lib/zstd/fse_decompress.c
 create mode 100644 lib/zstd/huf.h
 create mode 100644 lib/zstd/huf_compress.c
 create mode 100644 lib/zstd/huf_decompress.c
 create mode 100644 lib/zstd/mem.h
 create mode 100644 lib/zstd/zstd_common.c
 create mode 100644 lib/zstd/zstd_internal.h
 create mode 100644 lib/zstd/zstd_opt.h

--
2.9.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FAILED: patch "[PATCH] Btrfs: fix early ENOSPC due to delalloc" failed to apply to 4.12-stable tree

2017-08-04 Thread Christoph Anton Mitterer
Hey.

Could someone of the devs put some attention on this...?

Thanks,
Chris :-)


On Mon, 2017-07-31 at 18:06 -0700, gre...@linuxfoundation.org wrote:
> The patch below does not apply to the 4.12-stable tree.
> If someone wants it applied there, or to any other stable or longterm
> tree, then please email the backport, including the original git
> commit
> id to .
> 
> thanks,
> 
> greg k-h
> 
> -- original commit in Linus's tree --
> 
> From 17024ad0a0fdfcfe53043afb969b813d3e020c21 Mon Sep 17 00:00:00
> 2001
> From: Omar Sandoval 
> Date: Thu, 20 Jul 2017 15:10:35 -0700
> Subject: [PATCH] Btrfs: fix early ENOSPC due to delalloc
> 
> If a lot of metadata is reserved for outstanding delayed allocations,
> we
> rely on shrink_delalloc() to reclaim metadata space in order to
> fulfill
> reservation tickets. However, shrink_delalloc() has a shortcut where
> if
> it determines that space can be overcommitted, it will stop early.
> This
> made sense before the ticketed enospc system, but now it means that
> shrink_delalloc() will often not reclaim enough space to fulfill any
> tickets, leading to an early ENOSPC. (Reservation tickets don't care
> about being able to overcommit, they need every byte accounted for.)
> 
> Fix it by getting rid of the shortcut so that shrink_delalloc()
> reclaims
> all of the metadata it is supposed to. This fixes early ENOSPCs we
> were
> seeing when doing a btrfs receive to populate a new filesystem, as
> well
> as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs.
> 
> Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc
> infrastructure")
> Tested-by: Christoph Anton Mitterer  me>
> Cc: sta...@vger.kernel.org
> Reviewed-by: Josef Bacik 
> Signed-off-by: Omar Sandoval 
> Signed-off-by: David Sterba 
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a6635f07b8f1..e3b0b4196d3d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4825,10 +4825,6 @@ static void shrink_delalloc(struct
> btrfs_fs_info *fs_info, u64 to_reclaim,
>   else
>   flush = BTRFS_RESERVE_NO_FLUSH;
>   spin_lock(_info->lock);
> - if (can_overcommit(fs_info, space_info, orig, flush,
> false)) {
> - spin_unlock(_info->lock);
> - break;
> - }
>   if (list_empty(_info->tickets) &&
>   list_empty(_info->priority_tickets)) {
>   spin_unlock(_info->lock);
> 

smime.p7s
Description: S/MIME cryptographic signature


Re: Massive loss of disk space

2017-08-04 Thread Austin S. Hemmelgarn

On 2017-08-04 10:45, Goffredo Baroncelli wrote:

On 2017-08-03 19:23, Austin S. Hemmelgarn wrote:

On 2017-08-03 12:37, Goffredo Baroncelli wrote:

On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:

[...]


Also, as I said below, _THIS WORKS ON ZFS_.  That immediately means that a CoW 
filesystem _does not_ need to behave like BTRFS is.


It seems that ZFS on linux doesn't support fallocate

see https://github.com/zfsonlinux/zfs/issues/326

So I think that you are referring to a posix_fallocate and ZFS on solaris, 
which I can't test so I can't comment.

Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).


For fun I checked the freebsd source and zfs source. To me it seems that ZFS on 
freebsd doesn't implement posix_fallocate() (VOP_ALLOCATE in freebas jargon), 
but instead relies on the freebsd default one.


http://fxr.watson.org/fxr/source/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L7212

Following the chain of function pointers

http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=10#L110

it seems that the freebsd vop_allocate() is implemented in vop_stdallocate()

http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=excerpts#L912

which simply calls read() and write() on the range [offset...offset+len), which for a 
"conventional" filesystem ensure the block allocation. Of course it is an 
expensive solution.

So I think (but I am not familiar with freebsd) that ZFS doesn't implement a 
real posix_allocate but it try to simulate it. Of course this don't
From a practical perspective though, posix_fallocate() doesn't matter, 
because almost everything uses the native fallocate call if at all 
possible.  As you mention, FreeBSD is emulating it, but that 'emulation' 
provides behavior that is close enough to what is required that it 
doesn't matter.  As a matter of perspective, posix_fallocate() is 
emulated on Linux too, see my reply below to your later comment about 
posix_fallocate() on BTRFS.


Internally ZFS also keeps _some_ space reserved so it doesn't get wedged 
like BTRFS does when near full, and they don't do the whole data versus 
metadata segregation crap, so from a practical perspective, what 
FreeBSD's ZFS implementation does is sufficient because of the internal 
structure and handling of writes in ZFS.





That said, I'm starting to wonder if just failing fallocate() calls to allocate 
space is actually the right thing to do here after all.  Aside from this, we 
don't reserve metadata space for checksums and similar things for the eventual 
writes (so it's possible to get -ENOSPC on a write to an fallocate'ed region 
anyway because of metadata exhaustion), and splitting extents can also cause it 
to fail, so it's perfectly possible for the fallocate assumption to not hole on 
BTRFS.


posix_fallocate in BTRFS is not reliable for another reason. This syscall 
guarantees that a BG is allocated, but I think that the allocated BG is 
available to all processes, so a parallel process my exhaust all the available 
space before the first process uses it.
As mentioned above, posix_fallocate() is emulated in libc on Linux by 
calling the regular fallocate() if the FS supports it (which BTRFS 
does), or by writing out data like FreeBSD does in the kernel if the FS 
doesn't support fallocate().  IOW, posix_fallocate() has the exact same 
issues on BTRFS as Linux's fallocate() syscall does.


My opinion is that BTRFS is not reliable when the space is exhausted, so it 
needs to work with an amount of disk space free. The size of this disk space 
should be O(2*size_of_biggest_write), and for operation like fallocate this 
means O(2*length).
Again, this arises from how we handle writes.  If we were to track 
blocks that have had fallocate called on them and only use those (for 
the first write at least) for writes to the file that had fallocate 
called on them (as well as breaking reflinks on them when fallocate is 
called), then we can get away with just using the size of the biggest 
write plus a little bit more space for _data_, but even then we need 
space for metadata (which we don't appear to track right now).


I think that is not casual that the fallocate implemented by ZFSONLINUX works 
with the flag FALLOC_FL_PUNCH_HOLE mode.

https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L662
[...]
/*
  * The only flag combination which matches the behavior of zfs_space()
  * is FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE.  The FALLOC_FL_PUNCH_HOLE
  * flag was introduced in the 2.6.38 kernel.
  */
#if defined(HAVE_FILE_FALLOCATE) || defined(HAVE_INODE_FALLOCATE)
long
zpl_fallocate_common(struct inode *ip, int mode, loff_t offset, loff_t len)
{
int error = -EOPNOTSUPP;

#if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE)
cred_t *cr = CRED();
flock64_t bf;
loff_t olen;
fstrans_cookie_t cookie;

if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
  

Re: Massive loss of disk space

2017-08-04 Thread Goffredo Baroncelli
On 2017-08-03 19:23, Austin S. Hemmelgarn wrote:
> On 2017-08-03 12:37, Goffredo Baroncelli wrote:
>> On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
[...]

>>> Also, as I said below, _THIS WORKS ON ZFS_.  That immediately means that a 
>>> CoW filesystem _does not_ need to behave like BTRFS is.
>>
>> It seems that ZFS on linux doesn't support fallocate
>>
>> see https://github.com/zfsonlinux/zfs/issues/326
>>
>> So I think that you are referring to a posix_fallocate and ZFS on solaris, 
>> which I can't test so I can't comment.
> Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).

For fun I checked the freebsd source and zfs source. To me it seems that ZFS on 
freebsd doesn't implement posix_fallocate() (VOP_ALLOCATE in freebas jargon), 
but instead relies on the freebsd default one.


http://fxr.watson.org/fxr/source/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L7212

Following the chain of function pointers

http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=10#L110

it seems that the freebsd vop_allocate() is implemented in vop_stdallocate()

http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=excerpts#L912

which simply calls read() and write() on the range [offset...offset+len), which 
for a "conventional" filesystem ensure the block allocation. Of course it is an 
expensive solution.

So I think (but I am not familiar with freebsd) that ZFS doesn't implement a 
real posix_allocate but it try to simulate it. Of course this don't


> 
> That said, I'm starting to wonder if just failing fallocate() calls to 
> allocate space is actually the right thing to do here after all.  Aside from 
> this, we don't reserve metadata space for checksums and similar things for 
> the eventual writes (so it's possible to get -ENOSPC on a write to an 
> fallocate'ed region anyway because of metadata exhaustion), and splitting 
> extents can also cause it to fail, so it's perfectly possible for the 
> fallocate assumption to not hole on BTRFS.  

posix_fallocate in BTRFS is not reliable for another reason. This syscall 
guarantees that a BG is allocated, but I think that the allocated BG is 
available to all processes, so a parallel process my exhaust all the available 
space before the first process uses it.

My opinion is that BTRFS is not reliable when the space is exhausted, so it 
needs to work with an amount of disk space free. The size of this disk space 
should be O(2*size_of_biggest_write), and for operation like fallocate this 
means O(2*length).

I think that is not casual that the fallocate implemented by ZFSONLINUX works 
with the flag FALLOC_FL_PUNCH_HOLE mode.

https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L662
[...]
/*
 * The only flag combination which matches the behavior of zfs_space()
 * is FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE.  The FALLOC_FL_PUNCH_HOLE
 * flag was introduced in the 2.6.38 kernel.
 */
#if defined(HAVE_FILE_FALLOCATE) || defined(HAVE_INODE_FALLOCATE)
long
zpl_fallocate_common(struct inode *ip, int mode, loff_t offset, loff_t len)
{
int error = -EOPNOTSUPP;

#if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE)
cred_t *cr = CRED();
flock64_t bf;
loff_t olen;
fstrans_cookie_t cookie;

if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
return (error);

[...]

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-04 Thread Qu Wenruo



On 2017年08月02日 16:38, Brendan Hide wrote:
The title seems alarmist to me - and I suspect it is going to be 
misconstrued. :-/


 From the release notes at 
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html 



"Btrfs has been deprecated

The Btrfs file system has been in Technology Preview state since the 
initial release of Red Hat Enterprise Linux 6. Red Hat will not be 
moving Btrfs to a fully supported feature and it will be removed in a 
future major release of Red Hat Enterprise Linux.


The Btrfs file system did receive numerous updates from the upstream in 
Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat 
Enterprise Linux 7 series. However, this is the last planned update to 
this feature.


Red Hat will continue to invest in future technologies to address the 
use cases of our customers, specifically those related to snapshots, 
compression, NVRAM, and ease of use. We encourage feedback through your 
Red Hat representative on features and requirements you have for file 
systems and storage technology."


Personally speaking, unlike most of the btrfs supporters, I think Red 
Hat is doing the correct thing for their enterprise use case.


(To clarify, I'm not going to Red Hat, just in case anyone wonders why 
I'm not supporting btrfs)


[Good things of btrfs]
Btrfs is indeed a technic pioneer in a lot of aspects (at least in linux 
world):


1) Metadata CoW instead of traditional journal
2) Snapshot and delta-backup
I think this is the killer feature of Btrfs, and why SUSE is using 
it for root fs.

3) Default data CoW
4) Data checksum and scrubbing
5) Multi-device management
6) Online resize/balancing
And a lot of more.

[Bad things of btrfs]
But for enterprise usage, it's too advanced and has several problems 
preventing them being widely applied:


1) Low performance from metadata/data CoW
This is a little complicated dilemma.
Although Btrfs can disable data CoW, nodatacow also disables data 
checksum, which is another main feature for btrfs.

So Btrfs can't default to nodatacow, unlike XFS.

And metadata CoW causes extra metadata write along with superblock 
update (FUA), further degrading the performance.


Such pioneered design makes traditional performance-intense use 
case very unhappy.
Especially for almost all kind of databases. (Note that nodatacow 
can't always solve the performance problem).
Most performance intense usage is still based on tradtional fs 
design (journal with no CoW)


2) Low concurrency caused by tree design.
 Unlike traditional one-tree-for-one-inode design, btrfs uses 
one-tree-for-one-subvolume.
 The design makes snapshot implementation very easy, while makes 
tree very hot when a lot of modifiers are trying to modify any metadata.


 Btrfs has a lot of different way to solve it.
 For extent tree (the most busy tree), we are using delayed-ref to 
speed up extent tree update.

 For fs tree fsync, we have log tree to speed things up.
 These approaches work, at the cost of complexity and bugs, and we 
still have slow fs tree modification speed.


3) Low code reusage of device-mapper.
 I totally understand that, due to the unique support for data 
csum, btrfs can't use device-mapper directly, as we must verify the data 
read out from device before passing it to higher level.
So Btrfs uses its own device-mapper like implementation to handle 
multi-device management.


The result is mixed. For easy to handle case like RAID0/1/10 btrfs 
is doing well.

While for RAID5/6, everyone knows the result.

Such btrfs *enhanced* re-implementation not only makes btrfs larger 
but also more complex and bug-prune.


In short, btrfs is too advanced for generic use cases (performance) and 
developers (bugs), unfortunately.


And even SUSE is just pushing btrfs as root fs, mainly for the snapshot 
feature.

But still ext4/xfs for data or performance intense use case.


[Other solution on the table]
On the other hand, I think RedHat is pushing storage technology based on 
LVM (thin) and Xfs.


For traditional LVM, it's stable but its snapshot design is old-fashion 
and low-performance.
While new thin-provision LVM solves the problem using a method just like 
Btrfs, but at block level.


And for XFS, it's still traditional designed, journal based, 
one-tree-for-one-inode.

But with fancy new features like data CoW.

Even XFS + LVM-thin lacks ability to shrink fs or scrub data or delta 
backup, it can do a lot of things just like Btrfs.

From snapshot to multi-device management.

And more importantly, has better performance for things like DB.

So, for old use cases, the performance stays almost the same.
For developers, guys are still focusing on their old fields, less to 
concern and more focused to debug. The old UNIX method still works here, 
do one thing and do it 

Re: Power down tests...

2017-08-04 Thread Shyam Prasad N
Thanks guys. I've enabled that option now. Let's see how it goes.
One general question regarding the stability of btrfs in kernel
version 4.4. Is this okay for power off test cases? Or are there many
important fixes in newer kernels?

On Fri, Aug 4, 2017 at 5:24 PM, Dmitrii Tcvetkov  wrote:
> On Fri, 4 Aug 2017 13:19:39 +0530
> Shyam Prasad N  wrote:
>
>> Oh ok. I read this in the man page and assumed that it's on by
>> default: flushoncommit, noflushoncommit
>>(default: on)
>>
>>This option forces any data dirtied by a write in a prior
>> transaction to commit as part of the current commit. This makes the
>> committed state a fully consistent view of the file system from the
>>application’s perspective (i.e., it includes all completed
>> file system operations). This was previously the behavior only when a
>> snapshot was created.
>>
>>Disabling flushing may improve performance but is not
>> crash-safe.
>>
>>
>> Maybe this needs a correction?
>
> In 4.12 btrfs-progs man pages it's already updated.
>
> $ man 5 btrfs
> ...
>flushoncommit, noflushoncommit
>(default: off)
>
>This option forces any data dirtied by a write in a prior
>transaction to commit as part of the current commit,
>effectively a full filesystem sync.
>
>This makes the committed state a fully consistent view of
>the file system from the application’s perspective (i.e., it
>includes all completed file system operations). This was
>previously the behavior only when a snapshot was created.
>
>When off, the filesystem is consistent but buffered writes
>may last more than one transaction commit.
>
>



-- 
-Shyam
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Fix -EOVERFLOW handling in btrfs_ioctl_tree_search_v2

2017-08-04 Thread Nikolay Borisov
The buffer passed to btrfs_ioctl_tree_search* functions have to be at least
sizeof(struct btrfs_ioctl_search_header). If this is not the case then the
ioctl should return -EOVERFLOW and set the uarg->buf_size to the minimum
required size. Currently btrfs_ioctl_tree_search_v2 would return an -EOVERFLOW
error with ->buf_size being set to the value passed by user space. Fix this by
removing the size check and relying on search_ioctl, which already includes it
and correctly sets buf_size.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ioctl.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index fa1b78cf25f6..e80950b3f340 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2201,9 +2201,6 @@ static noinline int btrfs_ioctl_tree_search_v2(struct 
file *file,
 
buf_size = args.buf_size;
 
-   if (buf_size < sizeof(struct btrfs_ioctl_search_header))
-   return -EOVERFLOW;
-
/* limit result size to 16MB */
if (buf_size > buf_limit)
buf_size = buf_limit;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-04 Thread Austin S. Hemmelgarn

On 2017-08-03 16:45, Brendan Hide wrote:



On 08/03/2017 09:22 PM, Austin S. Hemmelgarn wrote:

On 2017-08-03 14:29, Christoph Anton Mitterer wrote:

On Thu, 2017-08-03 at 20:08 +0200, waxhead wrote:
There are no higher-level management tools (e.g. RAID
management/monitoring, etc.)...

[snip]

As far as 'higher-level' management tools, you're using your system 
wrong if you _need_ them.  There is no need for there to be a GUI, or 
a web interface, or a DBus interface, or any other such bloat in the 
main management tools, they work just fine as is and are mostly on par 
with the interfaces provided by LVM, MD, and ZFS (other than the lack 
of machine parseable output).  I'd also argue that if you can't 
reassemble your storage stack by hand without using 'higher-level' 
tools, you should not be using that storage stack as you don't 
properly understand it.


On the subject of monitoring specifically, part of the issue there is 
kernel side, any monitoring system currently needs to be 
polling-based, not event-based, and as a result monitoring tends to be 
a very system specific affair based on how much overhead you're 
willing to tolerate. The limited stuff that does exist is also trivial 
to integrate with many pieces of existing monitoring infrastructure 
(like Nagios or monit), and therefore the people who care about it a 
lot (like me) are either monitoring by hand, or are just using the 
tools with their existing infrastructure (for example, I use monit 
already on all my systems, so I just make sure to have entries in the 
config for that to check error counters and scrub results), so there's 
not much in the way of incentive for the concerned parties to reinvent 
the wheel.


To counter, I think this is a big problem with btrfs, especially in 
terms of user attrition. We don't need "GUI" tools. At all. But we do 
need that btrfs is self-sufficient enough that regular users don't get 
burnt by what they would view as unexpected behaviour.  We have 
currently a situation where btrfs is too demanding on inexperienced users.


I feel we need better worst-case behaviours. For example, if *I* have a 
btrfs on its second-to-last-available chunk, it means I'm not 
micro-managing properly. But users shouldn't have to micro-manage in the 
first place. Btrfs (or a management tool) should just know to balance 
the least-used chunk and/or delete the lowest-priority snapshot, etc. It 
shouldn't cause my services/apps to give diskspace errors when, clearly, 
there is free space available.
That's not just an issue with BTRFS, it's an issue with the distros too. 
 The only one that ships any kind of scheduled regular maintenance as 
far as I know is SUSE.  We don't need some daemon, or even special 
handling in the kernel, we just need to provide people with standard 
maintenance tools, and proper advice for monitoring.  I've been meaning 
to write up some wrappers and a couple of cron files to handle this a 
bit better, but just haven't had time.  I may look at getting that done 
either today or early next week.


The other "high-level" aspect would be along the lines of better 
guidance and standardisation for distros on how best to configure btrfs. 
This would include guidance/best practices for things like appropriate 
subvolume mountpoints and snapshot paths, sensible schedules or logic 
(or perhaps even example tools/scripts) for balancing and scrubbing the 
filesystem.

There are currently three standards for this:
1. The snapper way, used by at least SUSE and Ubuntu, which IMO ends up 
being way too complicated for not much benefit.
2. The traditional filesystem way, used by most other distros, which 
doesn't use subvolumes at all.
3. The user choice way, used by stuff like Arch and Gentoo, which pretty 
much says the rest of the OS could care less how the filesystems and 
subvolumes are organized, as long as things work.


Overall, other than the first one, this is no different than with 
regular filesystems.


I don't have all the answers. But I also don't want to have to tell 
people they can't adopt it because a) they don't (or never will) 
understand it; and b) they're going to resent me for their irresponsibly 
losing their own data.




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SQLite Re: csum errors on top of dm-crypt

2017-08-04 Thread Duncan
Roman Mamedov posted on Fri, 04 Aug 2017 12:44:44 +0500 as excerpted:

> On Fri, 4 Aug 2017 12:18:58 +0500 Roman Mamedov  wrote:
> 
>> What I find weird is why the expected csum is the same on all of these.
>> Any idea what this might point to as the cause?
>> 
>> What is 0x98f94189, is it not a csum of a block of zeroes by any
>> chance?
> 
> It does seem to be something of that sort, as it appears in
> https://www.spinics.net/lists/linux-btrfs/msg67281.html (though as
> factual csum, not the expected one).
> 
>> a few files turned out to be unreadable
> 
> Actually, turns out ALL of those are sqlite files(!)
> 
> .mozilla/firefox/.../places.sqlite <- 4 instances (for 4 users)
> .moonchild productions/pale moon/.../urlclassifier3.sqlite
> .config/chromium/Default/Application Cache/Cache/data_3 <- twice (for 2
> users)
> .config/chromium/Default/History .config/chromium/Default/Top Sites
> 
> nothing else affected.
> 
> Forgot to mention that the kernel version is 4.9.40.

Not very scientific but FWIW...

Kernel 4.9 or perhaps a couple kernel cycles earlier is about the time I 
had some similar issues with my firefox database files, too.  I lost 
extensions and their settings and had to bisect to the file level and 
restore the files from backup.

The problem in my case was very likely an ungraceful shutdown.  But I've 
had a couple such shutdowns recently (I stay current and am now on 4.13-
rc3) as well, due to summer storm season and a power supply going out, 
that have had much better results -- clean scrubs after remount and 
nothing lost that I can see, even after the one that blinked back out 
right as I was rebooting.

So it may be something fixed in newer kernels or just happenstance, but I 
won't argue with more reliable recent kernel btrfs, even if it /is/ "just 
me". =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-04 Thread Duncan
Austin S. Hemmelgarn posted on Thu, 03 Aug 2017 15:03:53 -0400 as
excerpted:

>> Same thing with the trim feature that is marked OK . It clearly says
>> that is has performance implications. It is marked OK so one would
>> expect it to not cause the filesystem to fail, but if the performance
>> becomes so slow that the filesystem gets practically unusable it is of
>> course not "OK". The relevant information is missing for people to make
>> a decent choice and I certainly don't know how serious these
>> performance implications are, if they are at all relevant...
> The performance implications bit shouldn't be listed, that's a given for
> any filesystem with discard (TRIM is the ATA and eMMC command, UNMAP is
> the SCSI one, and ERASE is the name on SD cards, discard is the generic
> kernel term) support.  The issue arises from devices that don't have
> support for queuing such commands, which is quite rare for SSD's these
> days.

Not so entirely rare.  The generally well regarded Samsung EVO/Pro 850 
ssd series don't support queued-trim, and indeed, due to a fiasco where 
new firmware lied about such support[1], the kernel now blacklists queued-
trim on all samsung ssds.

(I actually bought a pair of samsung evo 1TB ssds after seeing them well 
recommended both on this list and in various reviews.  Only AFTER I had 
them and was wondering if I could now add discard to my btrfs mount 
options and therefore googling for samsung evo queued trim specifically, 
did I find out about this fiasco and samsung not supporting linux because 
anyone can write the code, or I'd have certainly reconsidered and would 
have very likely spent my money elsewhere.  I did actually check the 
current kernel's blacklisting code and verified it, tho I also noted it 
whitelists samsung ssds for actually honoring flush directives where the 
code treats non-whitelisted ssds as not honoring them due apparently to 
too many claiming to do so while not actually doing so, to get better 
performance, so it's a mixed bag, one whitelisting for actually flushing 
when it claims to, one blacklisting for not reliably handling queued-trim 
despite some firmware claiming to do so.  But the worst IMO is samsung 
support blackballing linux because anyone can write the code. =:^  That's 
worth blackballing samsung for, in my book; I just wish I'd found out 
before the purchase instead of after, tho the linux devs have at least 
made sure samsung ssd users don't lose data on linux due to samsung's 
lies, despite samsung's horrible support policy blackballing linux, at 
least at the time.)

---
[1] The firmware said it supported a new ata standard where it's 
apparently mandatory, but the result was repeatedly corrupted data, with 
samsung support repeatedly said they don't support Linux because anyone 
can write code to execute, but they weren't seeing the problem on MS yet 
simply because MS hadn't issued a release that supported the new 
standard, and had queued-trim disabled by default with the older 
standards due to such problems when it was enabled.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum errors on top of dm-crypt

2017-08-04 Thread Roman Mamedov
On Fri, 4 Aug 2017 12:44:44 +0500
Roman Mamedov  wrote:

> > What is 0x98f94189, is it not a csum of a block of zeroes by any chance?
> 
> It does seem to be something of that sort

Actually, I think I know what happened.

I used "dd bs=1M conv=sparse" to copy source FS onto a LUKS device, which
skipped copying 1M-sized areas of zeroes from the source device by seeking
over those areas on the destination device.

This only works OK if the destination device is entirely zeroed beforehand.

But I also use --allow-discards for the LUKS device; so it may be that after a
discard passthrough to the underlying SSD, which will then return zeroes for
discarded areas, LUKS will not take care to pass zeroes back "upwards" when
reading from those areas, instead it may attempt to decrypt them with its
crypto process, making them read back to userspace as random data.

So after an initial TRIM the destination crypto device was not actually zeroed,
far from it. :)

As a result, every large non-sparse file with at least 1MB-long run of zeroes
in it (those sqlite ones appear to fit the bill) was not written out entirely
onto the destination device by dd, and the intended zero areas were left full
of crypto-randomness instead.

Sorry for the noise, I hope at least this catch was somewhat entertaining.

And Btrfs saves the day once again. :)

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Power down tests...

2017-08-04 Thread Shyam Prasad N
Oh ok. I read this in the man page and assumed that it's on by default:
   flushoncommit, noflushoncommit
   (default: on)

   This option forces any data dirtied by a write in a prior
transaction to commit as part of the current commit. This makes the
committed state a fully consistent view of the file system from the
   application’s perspective (i.e., it includes all completed
file system operations). This was previously the behavior only when a
snapshot was created.

   Disabling flushing may improve performance but is not crash-safe.


Maybe this needs a correction?

On Fri, Aug 4, 2017 at 12:52 PM, Adam Borowski  wrote:
> On Fri, Aug 04, 2017 at 12:15:12PM +0530, Shyam Prasad N wrote:
>> Is flushoncommit not a default option on version
>> 4.4? Do I need specifically set this option?
>
> It's not the default.
>
> --
> ⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
> ⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal
> ⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair)
> ⠈⠳⣄ • use glitches to walk on water



-- 
-Shyam
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SQLite Re: csum errors on top of dm-crypt

2017-08-04 Thread Roman Mamedov
On Fri, 4 Aug 2017 12:18:58 +0500
Roman Mamedov  wrote:

> What I find weird is why the expected csum is the same on all of these.
> Any idea what this might point to as the cause?
> 
> What is 0x98f94189, is it not a csum of a block of zeroes by any chance?

It does seem to be something of that sort, as it appears in
https://www.spinics.net/lists/linux-btrfs/msg67281.html 
(though as factual csum, not the expected one).

> a few files turned out to be unreadable

Actually, turns out ALL of those are sqlite files(!)

.mozilla/firefox/.../places.sqlite <- 4 instances (for 4 users)
.moonchild productions/pale moon/.../urlclassifier3.sqlite
.config/chromium/Default/Application Cache/Cache/data_3 <- twice (for 2 users)
.config/chromium/Default/History
.config/chromium/Default/Top Sites

nothing else affected.

Forgot to mention that the kernel version is 4.9.40.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Power down tests...

2017-08-04 Thread Adam Borowski
On Fri, Aug 04, 2017 at 12:15:12PM +0530, Shyam Prasad N wrote:
> Is flushoncommit not a default option on version
> 4.4? Do I need specifically set this option?

It's not the default.

-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair)
⠈⠳⣄ • use glitches to walk on water
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


csum errors on top of dm-crypt

2017-08-04 Thread Roman Mamedov
Hello,

I've migrated my home dir to a luks dm-crypt device some time ago, and today
during a scheduled backup a few files turned out to be unreadable, with csum
errors from Btrfs in dmesg.

What I find weird is why the expected csum is the same on all of these.
Any idea what this might point to as the cause?

What is 0x98f94189, is it not a csum of a block of zeroes by any chance?

(I use a patch from Qu Wenruo to improve the error reporting).

[483575.992252] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 
32 csum 0xe2a2e6eb expected csum 0x98f94189 mirror 1
[483575.994518] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 
32 csum 0xe2a2e6eb expected csum 0x98f94189 mirror 1
[483575.995640] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 
2785280 csum 0x7f97f4a6 expected csum 0x98f94189 mirror 1
[483575.996599] BTRFS warning (device dm-1): csum failed root 5 ino 481 off 
1736704 csum 0x7476ddf8 expected csum 0x98f94189 mirror 1
[483585.020047] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 
1011712 csum 0xbadf2d3e expected csum 0x98f94189 mirror 1
[483585.023036] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 
1011712 csum 0xbadf2d3e expected csum 0x98f94189 mirror 1
[483585.023702] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 
1900544 csum 0x26c571dc expected csum 0x98f94189 mirror 1
[483585.023761] BTRFS warning (device dm-1): csum failed root 5 ino 6172 off 
2949120 csum 0x27726fbe expected csum 0x98f94189 mirror 1
[483599.026289] BTRFS warning (device dm-1): csum failed root 5 ino 14988 off 
17645568 csum 0xdd5bf4de expected csum 0x98f94189 mirror 1
[483599.027425] BTRFS warning (device dm-1): csum failed root 5 ino 14988 off 
17465344 csum 0x42bf4f44 expected csum 0x98f94189 mirror 1
[483599.032396] BTRFS warning (device dm-1): csum failed root 5 ino 14988 off 
17465344 csum 0x42bf4f44 expected csum 0x98f94189 mirror 1
[483599.092709] BTRFS warning (device dm-1): csum failed root 5 ino 15002 off 
1110016 csum 0xbca8fc65 expected csum 0x98f94189 mirror 1
[483599.093080] BTRFS warning (device dm-1): csum failed root 5 ino 15002 off 
1110016 csum 0xbca8fc65 expected csum 0x98f94189 mirror 1
[483599.093242] BTRFS warning (device dm-1): csum failed root 5 ino 15002 off 
1736704 csum 0x1d4087fc expected csum 0x98f94189 mirror 1
[483627.708625] BTRFS warning (device dm-1): csum failed root 5 ino 29039 off 
2613248 csum 0xe1952338 expected csum 0x98f94189 mirror 1
[483627.709459] BTRFS warning (device dm-1): csum failed root 5 ino 29039 off 
2613248 csum 0xe1952338 expected csum 0x98f94189 mirror 1
[483627.709799] BTRFS warning (device dm-1): csum failed root 5 ino 29039 off 
2965504 csum 0xfaff212d expected csum 0x98f94189 mirror 1
[483634.462684] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
5062656 csum 0x8c7df392 expected csum 0x98f94189 mirror 1
[483634.462703] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
4108288 csum 0x6005cecd expected csum 0x98f94189 mirror 1
[483634.466602] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
7159808 csum 0xfc06d954 expected csum 0x98f94189 mirror 1
[483634.466604] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
6111232 csum 0xc802b3b4 expected csum 0x98f94189 mirror 1
[483634.470118] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
4108288 csum 0x6005cecd expected csum 0x98f94189 mirror 1
[483634.470257] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
10305536 csum 0x3d8c1843 expected csum 0x98f94189 mirror 1
[483634.471085] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
9256960 csum 0xba3fede3 expected csum 0x98f94189 mirror 1
[483634.471128] BTRFS warning (device dm-1): csum failed root 5 ino 30602 off 
8302592 csum 0x7de15198 expected csum 0x98f94189 mirror 1
[484152.178497] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 
1163264 csum 0x341f3c2a expected csum 0x98f94189 mirror 1
[484152.180422] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 
1736704 csum 0xf01ac658 expected csum 0x98f94189 mirror 1
[484152.181598] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 
1163264 csum 0x341f3c2a expected csum 0x98f94189 mirror 1
[484152.182242] BTRFS warning (device dm-1): csum failed root 5 ino 142886 off 
2785280 csum 0xc78988ec expected csum 0x98f94189 mirror 1
[484158.569489] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 
2138112 csum 0xab34e90e expected csum 0x98f94189 mirror 1
[484158.571885] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 
2785280 csum 0xd611911e expected csum 0x98f94189 mirror 1
[484158.575191] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 
3833856 csum 0x6277c8a6 expected csum 0x98f94189 mirror 1
[484158.575620] BTRFS warning (device dm-1): csum failed root 5 ino 143605 off 
4882432 csum 0x3293c3e7 expected csum 0x98f94189 mirror 1
[484158.578637] BTRFS warning 

Re: Power down tests...

2017-08-04 Thread Adam Borowski
On Fri, Aug 04, 2017 at 11:21:15AM +0530, Shyam Prasad N wrote:
> We're running a couple of experiments on our servers with btrfs
> (kernel version 4.4).
> And we're running some abrupt power-off tests for a couple of scenarios:
> 
> 1. We have a filesystem on top of two different btrfs filesystems
> (distributed across N disks). i.e. Our filesystem lays out data and
> metadata on top of these two filesystems. With the test workload, it
> is going to generate a good amount of 16MB files on top of the system.
> On abrupt power-off and following reboot, what is the recommended
> steps to be run. We're attempting btrfs mount, which seems to fail
> sometimes. If it fails, we run a fsck and then mount the btrfs. The
> issue that we're facing is that a few files have been zero-sized. As a
> result, there is either a data-loss, or inconsistency in the stacked
> filesystem's metadata.

Sounds like you want to mount with -o flushoncommit.

> We're mounting the btrfs with commit period of 5s. However, I do
> expect btrfs to journal the I/Os that are still dirty. Why then are we
> seeing the above behaviour.

By default, btrfs does only metadata consistency, like most filesystems. 
This improves performance at the cost of failing use case like yours.

-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair)
⠈⠳⣄ • use glitches to walk on water
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html