Re: [PATCH v5 1/9] lib: zstd: Add zstd compatibility wrapper
On 10 Nov 2020, at 13:39, Christoph Hellwig wrote: On Mon, Nov 09, 2020 at 02:01:41PM -0500, Chris Mason wrote: You do consistently ask for a shim layer, but you haven???t explained what we gain by diverging from the documented and tested API of the upstream zstd project. It???s an important discussion given that we hope to regularly update the kernel side as they make improvements in zstd. An API that looks like every other kernel API, and doesn't cause endless amount of churn because someone decided they need a new API flavor of the day. Btw, I'm not asking for a shim layer - that was the compromise we ended up with. If zstd folks can't maintain a sane code base maybe we should just drop this childish churning code base from the tree. I think APIs change based on the needs of the project. We do this all the time in the kernel, and we don’t think twice about updating users of the API as needed. The zstd changes look awkward and large today because it’ a long time period, but we’ve all been pretty vocal in the past about the importance of being able to advance APIs. -chris
Re: [PATCH v5 1/9] lib: zstd: Add zstd compatibility wrapper
On 6 Nov 2020, at 13:38, Christoph Hellwig wrote: You just keep resedning this crap, don't you? Haven't you been told multiple times to provide a proper kernel API by now? You do consistently ask for a shim layer, but you haven’t explained what we gain by diverging from the documented and tested API of the upstream zstd project. It’s an important discussion given that we hope to regularly update the kernel side as they make improvements in zstd. The only benefit described so far seems to be camelcase related, but if there are problems in the API beyond that, I haven’t seen you describe them. I don’t think the camelcase alone justifies the added costs of the shim. -chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"
On 26 Oct 2020, at 12:20, Vincent Guittot wrote: Le lundi 26 oct. 2020 à 12:04:45 (-0400), Rik van Riel a écrit : On Mon, 26 Oct 2020 16:42:14 +0100 Vincent Guittot wrote: On Mon, 26 Oct 2020 at 16:04, Rik van Riel wrote: Could utilization estimates be off, either lagging or simply having a wrong estimate for a task, resulting in no task getting pulled sometimes, while doing a migrate_task imbalance always moves over something? task and cpu utilization are not always up to fully synced and may lag a bit which explains that sometimes LB can fail to migrate for a small diff OK, running with this little snippet below, I see latencies improve back to near where they used to be: Latency percentiles (usec) runtime 150 (s) 50.0th: 13 75.0th: 31 90.0th: 69 95.0th: 90 *99.0th: 761 99.5th: 2268 99.9th: 9104 min=1, max=16158 I suspect the right/cleaner approach might be to use migrate_task more in !CPU_NOT_IDLE cases? Running a task to an idle CPU immediately, instead of refusing to have the load balancer move it, improves latencies for fairly obvious reasons. I am not entirely clear on why the load balancer should need to be any more conservative about moving tasks than the wakeup path is in eg. select_idle_sibling. what you are suggesting is something like: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4978964e75e5..3b6fbf33abc2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9156,7 +9156,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * emptying busiest. */ if (local->group_type == group_has_spare) { - if (busiest->group_type > group_fully_busy) { + if ((busiest->group_type > group_fully_busy) && + !(env->sd->flags & SD_SHARE_PKG_RESOURCES)) { /* * If busiest is overloaded, try to fill spare * capacity. This might end up creating spare capacity which also fixes the problem for me and alignes LB with wakeup path regarding the migration in the LLC Vincent’s patch on top of 5.10-rc1 looks pretty great: Latency percentiles (usec) runtime 90 (s) (3320 total samples) 50.0th: 161 (1687 samples) 75.0th: 200 (817 samples) 90.0th: 228 (488 samples) 95.0th: 254 (164 samples) *99.0th: 314 (131 samples) 99.5th: 330 (17 samples) 99.9th: 356 (13 samples) min=29, max=358 Next we test in prod, which probably won’t have answers until tomorrow. Thanks again Vincent! -chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"
On 26 Oct 2020, at 11:05, Chris Mason wrote: On 26 Oct 2020, at 10:24, Vincent Guittot wrote: Could you try the fix below ? --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * emptying busiest. */ if (local->group_type == group_has_spare) { - if (busiest->group_type > group_fully_busy) { + if ((busiest->group_type > group_fully_busy) && + (busiest->group_weight > 1)) { /* * If busiest is overloaded, try to fill spare * capacity. This might end up creating spare capacity When we calculate an imbalance at te smallest level, ie between CPUs (group_weight == 1), we should try to spread tasks on cpus instead of trying to fill spare capacity. With this patch on top of v5.9, my latencies are unchanged. I’m building against current Linus now just in case I’m missing other fixes. I reran things to make sure the nothing changed on my test box this weekend: 5.4.0-rc1-9-gfcf0553db6f4 (last good kernel) Latency percentiles (usec) runtime 30 (s) (1000 total samples) 50.0th: 180 (502 samples) 75.0th: 227 (251 samples) 90.0th: 268 (147 samples) 95.0th: 300 (50 samples) *99.0th: 338 (41 samples) 99.5th: 344 (4 samples) 99.9th: 1186 (5 samples) min=25, max=1185 5.4.0-rc1-00010-g0b0695f2b34a (first bad kernel) Latency percentiles (usec) runtime 150 (s) (960 total samples) 50.0th: 166 (488 samples) 75.0th: 210 (232 samples) 90.0th: 254 (145 samples) 95.0th: 299 (47 samples) *99.0th: 12688 (39 samples) 99.5th: 13008 (5 samples) 99.9th: 13104 (4 samples) min=24, max=13100 3650b228f83adda7e5ee532e2b90429c03f7b9ec (v5.10-rc1) + your patch Latency percentiles (usec) runtime 30 (s) (1000 total samples) 50.0th: 169 (505 samples) 75.0th: 210 (246 samples) 90.0th: 267 (151 samples) 95.0th: 305 (48 samples) *99.0th: 12656 (40 samples) 99.5th: 12944 (5 samples) 99.9th: 13168 (5 samples) min=44, max=13155 -chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"
On 26 Oct 2020, at 10:24, Vincent Guittot wrote: Le lundi 26 oct. 2020 à 08:45:27 (-0400), Chris Mason a écrit : On 26 Oct 2020, at 4:39, Vincent Guittot wrote: Hi Chris On Sat, 24 Oct 2020 at 01:49, Chris Mason wrote: Hi everyone, We’re validating a new kernel in the fleet, and compared with v5.2, Which version are you using ? several improvements have been added since v5.5 and the rework of load_balance We’re validating v5.6, but all of the numbers referenced in this patch are against v5.9. I usually try to back port my way to victory on this kind of thing, but mainline seems to behave exactly the same as 0b0695f2b34a wrt this benchmark. ok. Thanks for the confirmation I have been able to reproduce the problem on my setup. Thanks for taking a look! Can I ask what parameters you used on schbench, and what kind of results you saw? Mostly I’m trying to make sure it’s a useful tool, but also the patch didn’t change things here. Could you try the fix below ? --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * emptying busiest. */ if (local->group_type == group_has_spare) { - if (busiest->group_type > group_fully_busy) { + if ((busiest->group_type > group_fully_busy) && + (busiest->group_weight > 1)) { /* * If busiest is overloaded, try to fill spare * capacity. This might end up creating spare capacity When we calculate an imbalance at te smallest level, ie between CPUs (group_weight == 1), we should try to spread tasks on cpus instead of trying to fill spare capacity. With this patch on top of v5.9, my latencies are unchanged. I’m building against current Linus now just in case I’m missing other fixes. -chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"
On 26 Oct 2020, at 4:39, Vincent Guittot wrote: Hi Chris On Sat, 24 Oct 2020 at 01:49, Chris Mason wrote: Hi everyone, We’re validating a new kernel in the fleet, and compared with v5.2, Which version are you using ? several improvements have been added since v5.5 and the rework of load_balance We’re validating v5.6, but all of the numbers referenced in this patch are against v5.9. I usually try to back port my way to victory on this kind of thing, but mainline seems to behave exactly the same as 0b0695f2b34a wrt this benchmark. performance is ~2-3% lower for some of our workloads. After some digging, Johannes found that our involuntary context switch rate was ~2x higher, and we were leaving a CPU idle a higher percentage of the time, even though the workload was trying to saturate the system. We were able to reproduce the problem with schbench, and Johannes bisected down to: commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912 Author: Vincent Guittot Date: Fri Oct 18 15:26:31 2019 +0200 sched/fair: Rework load_balance() Our working theory is the load balancing changes are leaving processes behind busy CPUs instead of moving them onto idle ones. I made a few schbench modifications to make this easier to demonstrate: https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/ My VM has 40 cpus (20 cores, 2 threads per core), and my schbench command line is: What is the topology ? are they all part of the same LLC ? We’ve seen the regression on both single socket and dual socket bare metal intel systems. On the VM I reproduced with, I saw similar latencies with and without siblings configured into the topology. -chris
[PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"
Hi everyone, We’re validating a new kernel in the fleet, and compared with v5.2, performance is ~2-3% lower for some of our workloads. After some digging, Johannes found that our involuntary context switch rate was ~2x higher, and we were leaving a CPU idle a higher percentage of the time, even though the workload was trying to saturate the system. We were able to reproduce the problem with schbench, and Johannes bisected down to: commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912 Author: Vincent Guittot Date: Fri Oct 18 15:26:31 2019 +0200 sched/fair: Rework load_balance() Our working theory is the load balancing changes are leaving processes behind busy CPUs instead of moving them onto idle ones. I made a few schbench modifications to make this easier to demonstrate: https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/ My VM has 40 cpus (20 cores, 2 threads per core), and my schbench command line is: schbench -t 20 -r 0 -c 100 -s 1000 -i 30 -z 120 This has two message threads, and 20 workers per message thread. Once woken up, the workers think for a full second, which means you’ll have some long latencies if you’re stuck behind one of these workers in the runqueue. The message thread does a little bit of work and then sleeps, so we end up with 40 threads hammering full blast on the CPU and 2 threads popping in and out of idle. schbench times the delay from when a message thread wakes a worker to when the worker runs. On a good kernel, the output looks like this: Latency percentiles (usec) runtime 1290 (s) (3280 total samples) 50.0th: 155 (1653 samples) 75.0th: 189 (808 samples) 90.0th: 216 (501 samples) 95.0th: 227 (163 samples) *99.0th: 256 (123 samples) 99.5th: 1510 (16 samples) 99.9th: 3132 (13 samples) min=21, max=3286 With 0b0695f2b34a, we get this: Latency percentiles (usec) runtime 1440 (s) (4480 total samples) 50.0th: 147 (2261 samples) 75.0th: 182 (1116 samples) 90.0th: 205 (671 samples) 95.0th: 224 (215 samples) *99.0th: 12240 (173 samples) <—— much higher p99 and up 99.5th: 12752 (22 samples) 99.9th: 13104 (18 samples) min=21, max=13172 Since the idea is to fully load the machine with schbench, use schbench -t , and make sure the box doesn’t have other stuff running in the background. I used a VM because it ended up giving more consistent results on our kernel test machines, which have some periodic noise running in the background. We’ve tried a few different approaches, but don’t quite have a solid fix yet. I thought I’d kick off the discussion with my most useful hunks so far: diff a/kernel/sched/fair.c b/kernel/sched/fair.c --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c -chris
Re: [PATCH v4 0/9] Update to zstd-1.4.6
On 2 Oct 2020, at 2:54, Christoph Hellwig wrote: On Wed, Sep 30, 2020 at 08:05:45PM +, Nick Terrell wrote: On Sep 29, 2020, at 11:53 PM, Christoph Hellwig wrote: As you keep resend this I keep retelling you that should not do it. Please provide a proper Linux API, and switch to that. Versioned APIs have absolutely no business in the Linux kernel. The API is not versioned. We provide a stable ABI for a large section of our API, and the parts that aren???t ABI stable don???t change in semantics, and undergo long deprecation periods before being removed. The change of callers is a one-time change to transition from the existing API in the kernel, which was never upstream's API, to upstream's API. Again, please transition it to a sane kernel API. We don't have an "upstream" in this case. The upstream is the zstd project where all this code originates, and where the active development takes place. As Eric Biggers pointed out, it also receives a lot of Q/A separate from the kernel. I think we gain a great deal by leveraging the testing and documentation of the zstd project in the kernel interfaces we use. We lose some consistency with the kernel coding style, but we gain the ability to search for docs, issues, and fixes directly against the zstd project and git repo. -chris
Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API
On 17 Sep 2020, at 6:04, Christoph Hellwig wrote: On Wed, Sep 16, 2020 at 09:35:51PM -0400, Rik van Riel wrote: One possibility is to have a kernel wrapper on top of the zstd API to make it more ergonomic. I personally don???t really see the value in it, since it adds another layer of indirection between zstd and the caller, but it could be done. Zstd would not be the first part of the kernel to come from somewhere else, and have wrappers when it gets integrated into the kernel. There certainly is precedence there. It would be interesting to know what Christoph's preference is. Yes, I think kernel wrappers would be a pretty sensible step forward. That also avoid the need to do strange upgrades to a new version, and instead we can just change APIs on a as-needed basis. When we add wrappers, we end up creating a kernel specific API that doesn’t match the upstream zstd docs, and it doesn’t leverage as much of the zstd fuzzing and testing. So we’re actually making kernel zstd slightly less usable in hopes that our kernel specific part of the API is familiar enough to us that it makes zstd more usable. There’s no way to compare the two until the wrappers are done, but given the code today I’d prefer that we focus on making it really easy to track upstream. I really understand Christoph’s side here, but I’d rather ride a camel with the group than go it alone. I’d also much rather spend time on any problems where the structure of the zstd APIs don’t fit the kernel’s needs. The btrfs streaming compression/decompression looks pretty clean to me, but I think Johannes mentioned some possibilities to improve things for zswap (optimizations for page-at-atime). If there are places where the zstd memory management or error handling don’t fit naturally into the kernel, that would also be higher on my list. Fixing those are probably going to be much easier if we’re close to the zstd upstream, again so that we can leverage testing and long term code maintenance done there. -chris
Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API
On 16 Sep 2020, at 4:49, Christoph Hellwig wrote: On Tue, Sep 15, 2020 at 08:42:59PM -0700, Nick Terrell wrote: From: Nick Terrell Move away from the compatibility wrapper to the zstd-1.4.6 API. This code is functionally equivalent. Again, please use sensible names And no one gives a fuck if this bad API is "zstd-1.4.6" as the Linux kernel uses its own APIs, not some random mess from a badly written userspace package. Hi Christoph, It’s not completely clear what you’re asking for here. If the API matches what’s in zstd-1.4.6, that seems like a reasonable way to label it. That’s what the upstream is for this code. I’m also not sure why we’re taking extra time to shit on the zstd userspace package. Can we please be constructive or at least actionable? -chris
Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API
On 16 Sep 2020, at 10:46, Christoph Hellwig wrote: On Wed, Sep 16, 2020 at 10:43:04AM -0400, Chris Mason wrote: Otherwise we just end up with drift and kernel-specific bugs that are harder to debug. To the extent those APIs make us contort the kernel code, I???m sure Nick is interested in improving things in both places. Seriously, we do not care elsewhere. Why would zlib be any different? Is the zlib upstream active? Or trying to sync active development with the kernel? I’d suggest the same path for them if they were. There are probably 1000 constructive ways to have that conversation. Please choose one of those instead of being an asshole. I think you are the asshole here by ignoring the practices we are using elsewhere and think your employers pet project is somehow special. It is not, and claiming so is everything but constructive. I’m happy to advocate for more constructive discussion for anyone’s project. I tend to pick threads where I have context and I know the people involved. The kernel best practices are pragmatic. As one of many users of any established-non-kernel project, there’s a compromise between the APIs they are using for a broad base of users and us. I’m sure they are interested in improving life for all of their users, while also improving maintainability for us. -chris
Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API
On 16 Sep 2020, at 10:30, Christoph Hellwig wrote: On Wed, Sep 16, 2020 at 10:20:52AM -0400, Chris Mason wrote: It???s not completely clear what you???re asking for here. If the API matches what???s in zstd-1.4.6, that seems like a reasonable way to label it. That???s what the upstream is for this code. I???m also not sure why we???re taking extra time to shit on the zstd userspace package. Can we please be constructive or at least actionable? Because it really doesn't matter that these crappy APIs he is introducing match anything, especially not something done as horribly as the zstd API. We'll need to do this properly, and claiming compliance to some version of this lousy API is completely irrelevant for the kernel. If the underlying goal is to closely follow the upstream of another project, we’re much better off using those APIs as provided. Otherwise we just end up with drift and kernel-specific bugs that are harder to debug. To the extent those APIs make us contort the kernel code, I’m sure Nick is interested in improving things in both places. There are probably 1000 constructive ways to have that conversation. Please choose one of those instead of being an asshole. -chris
Re: [PATCH] mm : fix pte _PAGE_DIRTY bit when fallback migrate page
On 16 Jul 2020, at 6:15, Robbie Ko wrote: Kirill A. Shutemov 於 2020/7/15 下午4:11 寫道: On Wed, Jul 15, 2020 at 10:45:39AM +0800, Robbie Ko wrote: Kirill A. Shutemov 於 2020/7/14 下午6:19 寫道: On Tue, Jul 14, 2020 at 11:46:12AM +0200, Vlastimil Babka wrote: On 7/13/20 3:57 AM, Robbie Ko wrote: Vlastimil Babka 於 2020/7/10 下午11:31 寫道: On 7/9/20 4:48 AM, robbieko wrote: From: Robbie Ko When a migrate page occurs, we first create a migration entry to replace the original pte, and then go to fallback_migrate_page to execute a writeout if the migratepage is not supported. In the writeout, we will clear the dirty bit of the page and use page_mkclean to clear the dirty bit along with the corresponding pte, but page_mkclean does not support migration entry. I don't follow the scenario. When we establish migration entries with try_to_unmap(), it transfers dirty bit from PTE to the page. Sorry, I mean is _PAGE_RW with pte_write When we establish migration entries with try_to_unmap(), we create a migration entry, and if pte_write we set it to SWP_MIGRATION_WRITE, which will replace the migration entry with the original pte. When migratepage, we go to fallback_migrate_page to execute a writeout if the migratepage is not supported. In the writeout, we call clear_page_dirty_for_io to clear the dirty bit of the page and use page_mkclean to clear pte _PAGE_RW with pte_wrprotect in page_mkclean_one. However, page_mkclean_one does not support migration entries, so the migration entry is still SWP_MIGRATION_WRITE. In writeout, then we call remove_migration_ptes to remove the migration entry, because it is still SWP_MIGRATION_WRITE so set _PAGE_RW to pte via pte_mkwrite. Therefore, subsequent mmap wirte will not trigger page_mkwrite to cause data loss. Hm, okay. Folks, is there any good reason why try_to_unmap(TTU_MIGRATION) should not clear PTE (make the PTE none) for file page? This, I'm not sure. But I think that for the fs that support migratepage, when migratepage is finished, the page should still be dirty, and the pte should still have _PAGE_RW, when the next mmap write occurs, we don't need to trigger the page_mkwrite again. I don’t know the page migration code well, but you’ll need this one as well on the 4.4 kernel you mentioned: commit 25f3c5021985e885292980d04a1423fd83c967bb Author: Chris Mason Date: Tue Jan 21 11:51:42 2020 -0500 Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker And this one as well: commit 7703bdd8d23e6ef057af3253958a793ec6066b28 Author: Chris Mason Date: Wed Jun 20 07:56:11 2018 -0700 Btrfs: don't clean dirty pages during buffered writes With those two in place, we haven’t found lost data from the migration code, but we did see the fallback migration helper dirtying pages without going through page_mkwrite, which triggers the suboptimal btrfs fixup worker code path. This isn’t a yea or nay on the patch, just additional info. -chris
Re: [Ksummit-discuss] [PATCH] CodingStyle: Inclusive Terminology
On 6 Jul 2020, at 10:06, Laurent Pinchart wrote: Hi Chris, On Mon, Jul 06, 2020 at 12:45:34PM +, Chris Mason via Ksummit-discuss wrote: On 5 Jul 2020, at 0:55, Willy Tarreau wrote: Maybe instead of providing an explicit list of a few words it should simply say that terms that take their roots in the non-technical world and whose meaning can only be understood based on history or local culture ought to be avoided, because *that* actually is the real root cause of the problem you're trying to address. I’d definitely agree that it’s a good goal to keep out non-technical terms. Even though we already try, every subsystem has its own set of patterns that reflect the most frequent contributors. That's an interesting point, because to me, it's the exact opposite. One of the intellectual rewards I find in working with the kernel is that our community is international and multicultural, allowing me to learn about other cultures. Aiming for the lowest common denominator seems to me to be closer to erasing cultural differences than including them. I hadn’t thought of it from this angle, but I do agree with you. I think the cultural side comes through more in discussions and in-person conferences than it does from the code itself. I do try to avoid local idioms or culture references unless I’m explaining them as part of a discussion or a personal story, mostly because I’ve gotten feedback from coworkers who had a hard time following my bad (ok, terrible) jokes or sarcasm. One internal example is commands that take —clowntown as an argument. It’s pretty therapeutic to type when you’re grumpy about tooling, but a lot of people probably have to look it up before it makes sense. -chris
Re: [PATCH] CodingStyle: Inclusive Terminology
On 5 Jul 2020, at 0:55, Willy Tarreau wrote: > On Sat, Jul 04, 2020 at 01:02:51PM -0700, Dan Williams wrote: >> +Non-inclusive terminology has that same distracting effect which is >> why >> +it is a style issue for Linux, it injures developer efficiency. > > I'm personally thinking that for a non-native speaker it's already > difficult to find the best term to describe something, but having to > apply an extra level of filtering on the found words to figure whether > they are allowed by the language police is even more difficult. Since our discussions are public, we’ve always had to deal with comments from people outside the community on a range of topics. But inside the kernel, it’s just a group of developers trying to help each other produce the best quality of code. We’ve got a long history together and in general I think we’re pretty good at assuming good intent. > *This* > injures developers efficiency. What could improve developers > efficiency > is to take care of removing *all* idiomatic or cultural words then. > For > example I've been participating to projects using the term > "blueprint", > I didn't understand what that meant. It was once explained to me and > given that it had no logical reason for being called this way, I now > forgot. If we follow your reasoning, Such words should be banned for > exactly the same reasons. Same for colors that probably don't mean > anything to those born blind. > > For example if in my local culture we eat tomatoes at starters and > apples for dessert, it could be convenient for me to use "tomato" and > "apple" as list elements to name the pointers leading to the beginning > and the end of the list, and it might sound obvious to many people, > but > not at all for many others. > > Maybe instead of providing an explicit list of a few words it should > simply say that terms that take their roots in the non-technical world > and whose meaning can only be understood based on history or local > culture ought to be avoided, because *that* actually is the real > root cause of the problem you're trying to address. I’d definitely agree that it’s a good goal to keep out non-technical terms. Even though we already try, every subsystem has its own set of patterns that reflect the most frequent contributors. -chris
Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
On 17 Jun 2020, at 13:20, Filipe Manana wrote: On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov wrote: --- fs/btrfs/extent_io.c | 45 1 file changed, 29 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c59e07360083..f6758ebbb6a2 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb, clear_bit(EXTENT_BUFFER_WRITE_ERR, >bflags); num_pages = num_extent_pages(eb); atomic_set(>io_pages, num_pages); + /* +* It is possible for releasepage to clear the TREE_REF bit before we +* set io_pages. See check_buffer_tree_ref for a more detailed comment. +*/ + check_buffer_tree_ref(eb); This is a whole different case from the one described in the changelog, as this is in the write path. Why do we need this one? This was Josef’s idea, but I really like the symmetry. You set io_pages, you do the tree_ref dance. Everyone fiddling with the write back bit right now correctly clears writeback after doing the atomic_dec on io_pages, but the race is tiny and prone to getting exposed again by shifting code around. Tree ref checks around io_pages are the most reliable way to prevent this bug from coming back again later. -chris
Re: [PATCH 10/12] btrfs: flag files as supporting buffered async reads
On 26 May 2020, at 15:51, Jens Axboe wrote: > btrfs uses generic_file_read_iter(), which already supports this. > > Signed-off-by: Jens Axboe Really looking forward to this! Acked-by: Chris Mason
Re: linux-next: cleanup the btrfs trees
On 19 Oct 2019, at 23:47, Stephen Rothwell wrote: > Hi all, > > The btrfs tree > (git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git#next) > has not bee updated in more than a year, so I have removed it and then > renamed the btrfs-kdave tree to btrfs. I hope this is OK and if any > other changes are needed, please let me know. Thanks Stephen -chris
Re: linux-next: Signed-off-by missing for commits in the net-next tree
On 16 Aug 2019, at 5:15, Andy Grover wrote: > On 8/16/19 3:06 PM, Gerd Rausch wrote: >> Hi, >> >> Just added the e-mail addresses I found using a simple "google >> search", >> in order to reach out to the original authors of these commits: >> Chris Mason and Andy Grover. >> >> I'm hoping they still remember their work from 7-8 years ago. > > Yes looks like what I was working on. What did you need from me? It's > too late to amend the commitlogs... Same question ;) The missing signed-off-by is a mistake, but from the point of view of the DCO, these patches are totally fine by me. -chris
Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
I'm being pretty liberal with chopping down quoted material to help emphasize a particular opinion about how to bootstrap existing out-of-tree projects into the kernel. My goal here is to talk more about the process and less about the technical details, so please forgive me if I've ignored or changed the technical meaning of anything below. On 30 May 2019, at 12:15, Kris Van Hees wrote: > On Thu, May 23, 2019 at 01:28:44PM -0700, Alexei Starovoitov wrote: > > ... I believe that the discussion that has been going on in other > emails has shown that while introducing a program type that provides a > generic (abstracted) context is a different approach from what has > been done > so far, it is a new use case that provides for additional ways in > which BPF > can be used. > [ ... ] > > Yes and no. It depends on what you are trying to do with the BPF > program that > is attached to the different events. From a tracing perspective, > providing a > single BPF program with an abstract context would ... [ ... ] > > In this model kprobe/ksys_write and > tracepoint/syscalls/sys_enter_write are > equivalent for most tracing purposes ... [ ... ] > > I agree with what you are saying but I am presenting an additional use > case [ ... ] >> >> All that aside the kernel support for shared libraries is an awesome >> feature to have and a bunch of folks want to see it happen, but >> it's not a blocker for 'dtrace to bpf' user space work. >> libbpf can be taught to do this 'pseudo shared library' feature >> while 'dtrace to bpf' side doesn't need to do anything special. [ ... ] This thread intermixes some abstract conceptual changes with smaller technical improvements, and in general it follows a familiar pattern other out-of-tree projects have hit while trying to adapt the kernel to their existing code. Just from this one email, I quoted the abstract models with use cases etc, and this is often where the discussions side track into less productive areas. > > So you are basically saying that I should redesign DTrace? In your place, I would have removed features and adapted dtrace as much as possible to require the absolute minimum of kernel patches, or even better, no patches at all. I'd document all of the features that worked as expected, and underline anything either missing or suboptimal that needed additional kernel changes. Then I'd focus on expanding the community of people using dtrace against the mainline kernel, and work through the series features and improvements one by one upstream over time. Your current approach relies on an all-or-nothing landing of patches upstream, and this consistently leads to conflict every time a project tries it. A more incremental approach will require bigger changes on the dtrace application side, but over time it'll be much easier to justify your kernel changes. You won't have to talk in abstract models, and you'll have many more concrete examples of people asking for dtrace features against mainline. Most importantly, you'll make dtrace available on more kernels than just the absolute latest mainline, and removing dependencies makes the project much easier for new users to try. -chris
Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait
On 7 May 2019, at 17:22, Dave Chinner wrote: > On Tue, May 07, 2019 at 01:05:28PM -0400, Rik van Riel wrote: >> The code in xlog_wait uses the spinlock to make adding the task to >> the wait queue, and setting the task state to UNINTERRUPTIBLE atomic >> with respect to the waker. >> >> Doing the wakeup after releasing the spinlock opens up the following >> race condition: >> >> - add task to wait queue >> >> - wake up task >> >> - set task state to UNINTERRUPTIBLE >> >> Simply moving the spin_unlock to after the wake_up_all results >> in the waker not being able to see a task on the waitqueue before >> it has set its state to UNINTERRUPTIBLE. > > Yup, seems like an issue. Good find, Rik. > > So, what problem is this actually fixing? Was it noticed by > inspection, or is it actually manifesting on production machines? > If it is manifesting IRL, what are the symptoms (e.g. hang running > out of log space?) and do you have a test case or any way to > exercise it easily? The steps to reproduce are semi-complicated, they create a bunch of files, do stuff, and then delete all the files in a loop. I think they shotgunned it across 500 or so machines to trigger 5 times, and then left the wreckage for us to poke at. The symptoms were identical to the bug fixed here: commit 696a562072e3c14bcd13ae5acc19cdf27679e865 Author: Brian Foster Date: Tue Mar 28 14:51:44 2017 -0700 xfs: use dedicated log worker wq to avoid deadlock with cil wq But since our 4.16 kernel is new than that, I briefly hoped that m_sync_workqueue needed to be flagged with WQ_MEM_RECLAIM. I don't have a great picture of how all of these workqueues interact, but I do think it needs WQ_MEM_RECLAIM. It can't be the cause of this deadlock, the workqueue watchdog would have fired. Rik mentioned that I found sleeping procs with an empty iclog waitqueue list, which is when he noticed this race. We sent a wakeup to the sleeping process, and ftrace showed the process looping back around to sleep on the iclog again. Long story short, Rik's patch definitely wouldn't have prevented the deadlock, and the iclog waitqueue I was poking must not have been the same one that process was sleeping on. The actual problem ended up being the blkmq IO schedulers sitting on a request. Switching schedulers makes the box come back to life, so it's either a kyber bug or slightly higher up in blkmqland. That's a huge tangent around acking Rik's patch, but it's hard to be sure if we've hit the lost wakeup in prod. I could search through all the related hung task timeouts, but they are probably all stuck in blkmq. Acked-but-I'm-still-blaming-Jens-by: Chris Mason -chris
Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
On 30 Jan 2019, at 20:34, Dave Chinner wrote: > On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote: >> >> >> On 29 Jan 2019, at 23:17, Dave Chinner wrote: >> >>> From: Dave Chinner >>> >>> This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0. >>> >>> This change causes serious changes to page cache and inode cache >>> behaviour and balance, resulting in major performance regressions >>> when combining worklaods such as large file copies and kernel >>> compiles. >>> >>> https://bugzilla.kernel.org/show_bug.cgi?id=202441 >> >> I'm a little confused by the latest comment in the bz: >> >> https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24 > > Which says the first patch that changed the shrinker behaviour is > the underlying cause of the regression. > >> Are these reverts sufficient? > > I think so. Based on the latest comment: "If I had been less strict in my testing I probably would have discovered that the problem was present earlier than 4.19.3. Mr Gushins commit made it more visible. I'm going back to work after two days off, so I might not be able to respond inside your working hours, but I'll keep checking in on this as I get a chance." I don't think the reverts are sufficient. > >> Roman beat me to suggesting Rik's followup. We hit a different >> problem >> in prod with small slabs, and have a lot of instrumentation on Rik's >> code helping. > > I think that's just another nasty, expedient hack that doesn't solve > the underlying problem. Solving the underlying problem does not > require changing core reclaim algorithms and upsetting a page > reclaim/shrinker balance that has been stable and worked well for > just about everyone for years. > Things are definitely breaking down in non-specialized workloads, and have been for a long time. -chris
Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
On 29 Jan 2019, at 23:17, Dave Chinner wrote: > From: Dave Chinner > > This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0. > > This change causes serious changes to page cache and inode cache > behaviour and balance, resulting in major performance regressions > when combining worklaods such as large file copies and kernel > compiles. > > https://bugzilla.kernel.org/show_bug.cgi?id=202441 I'm a little confused by the latest comment in the bz: https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24 Are these reverts sufficient? Roman beat me to suggesting Rik's followup. We hit a different problem in prod with small slabs, and have a lot of instrumentation on Rik's code helping. -chris
Re: [LKP] [lkp-robot] [brd] 316ba5736c: aim7.jobs-per-min -11.2% regression
On 18 Dec 2018, at 13:57, Jens Axboe wrote: > On 12/18/18 2:11 AM, kemi wrote: >> Hi, All >> Do we have special reason to keep this patch (316ba5736c9:brd: Mark >> as non-rotational). >> which leads to a performance regression when BRD is used as a disk on >> btrfs. > > I really suspect that this is a btrfs issue, as this is just flagging > what is pretty obvious, that a ramdisk is NOT a rotational drive. > So whatever btrfs is doing with that information is causing it to > run slower - this really doesn't make any sense, but there we are. > > CC'ing Chris, leaving the report below. Btrfs is changing the allocator decisions slightly for an SSD, especially the cluster size for metadata, which should show up as more system time spent in the btrfs allocator, but I'm not seeing that below. It also changes how quickly btrfs dispatches synchronous IO. But, some parts of the differential don't quite make sense to me: 47.50 ± 58% +1355.8% 691.50 ± 92% meminfo.Mlocked Are these changes expected? -chris > >> On 2018/7/10 下午1:27, kemi wrote: >>> Hi, SeongJae >>> Do you have any input for this regression? thanks >>> >>> On 2018年06月04日 13:52, kernel test robot wrote: Greeting, FYI, we noticed a -11.2% regression of aim7.jobs-per-min due to commit: commit: 316ba5736c9caa5dbcd84085989862d2df57431d ("brd: Mark as non-rotational") https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git for-4.18/block in testcase: aim7 on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 384G memory with following parameters: disk: 1BRD_48G fs: btrfs test: disk_rw load: 1500 cpufreq_governor: performance test-description: AIM7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of multiuser system. test-url: https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_projects_aimbench_files_aim-2Dsuite7_=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=9QPtTAxcitoznaWRKKHoEQ=kkEXHhn9ofFgUoBrBpTiepWkkQeot8EjTaMlN_yKeyw=ScajB-GPDPZvGMy0XU1Hbatu9gVLkqk2j8MSCzK0S8E= Details are as below: --> = compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase: gcc-7/performance/1BRD_48G/btrfs/x86_64-rhel-7.2/1500/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_rw/aim7 commit: 522a777566 ("block: consolidate struct request timestamp fields") 316ba5736c ("brd: Mark as non-rotational") 522a777566f56696 316ba5736c9caa5dbcd8408598 -- %stddev %change %stddev \ |\ 28321 -11.2% 25147aim7.jobs-per-min 318.19 +12.6% 358.23 aim7.time.elapsed_time 318.19 +12.6% 358.23 aim7.time.elapsed_time.max 1437526 ± 2% +14.6%1646849 ± 2% aim7.time.involuntary_context_switches 11986 +14.2% 13691aim7.time.system_time 73.06 ± 2% -3.6% 70.43aim7.time.user_time 2449470 ± 2% -25.0%1837521 ± 4% aim7.time.voluntary_context_switches 20.25 ± 58% +1681.5% 360.75 ±109% numa-meminfo.node1.Mlocked 456062 -16.3% 381859softirqs.SCHED 9015 ± 7% -21.3% 7098 ± 22% meminfo.CmaFree 47.50 ± 58% +1355.8% 691.50 ± 92% meminfo.Mlocked 5.24 ± 3% -1.23.99 ± 2% mpstat.cpu.idle% 0.61 ± 2% -0.10.52 ± 2% mpstat.cpu.usr% 16627 +12.8% 18762 ± 4% slabinfo.Acpi-State.active_objs 16627 +12.9% 18775 ± 4% slabinfo.Acpi-State.num_objs 57.00 ± 2% +17.5% 67.00vmstat.procs.r 20936 -24.8% 15752 ± 2% vmstat.system.cs 45474-1.7% 44681vmstat.system.in 6.50 ± 59% +1157.7% 81.75 ± 75% numa-vmstat.node0.nr_mlock 242870 ± 3% +13.2% 274913 ± 7% numa-vmstat.node0.nr_written 2278 ± 7% -22.6% 1763 ± 21% numa-vmstat.node1.nr_free_cma 4.75 ± 58% +1789.5% 89.75 ±109% numa-vmstat.node1.nr_mlock 88018135 ± 3% -48.9% 44980457 ± 7% cpuidle.C1.time 1398288 ± 3% -51.1% 683493 ± 9% cpuidle.C1.usage 3499814 ± 2% -38.5%2153158 ± 5% cpuidle.C1E.time 52722 ± 4% -45.6% 28692 ± 6%
Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, Friendly reminder that the TAB elections are coming soon. The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Linux Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. We're also working with kernel maintainers to help refine the new code of conduct, and serving as the initial point of contact for code of conduct issues. The board has ten members, one of whom sits on the Linux Foundation board of directors. The election to select five TAB members will be held at the 2018 Kernel Summit in Vancouver, Canada. The elections will take place at the conference center on Tuesday November 13th, at 5:30pm. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Vancouver. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org The deadline for receiving nominations is up until the beginning of the event where the election is held. In past years, everyone running for the TAB has given a short speech before the voting began. We've received feedback that the speeches add logistical complexity for the election, and may not be the best indicator of how well qualified someone is for the TAB. Instead of speeches, this year we're asking candidates to include statements about why they would like to participate in the TAB. These will be combined into a slideshow running during the election, and available via a public google doc at this location: https://goo.gl/rPEc2v Even though the deadline for nominations is right before voting begins, any statements must be received by Monday November 12th at 5PM Pacific, so that we have time to setup the slideshow. Current TAB members, and their election year: Chris Mason 2016 H. Peter Anvin 2016 Olof Johansson 2016 Rik van Riel2016 Dan Williams 2016 Jon Corbet 2017 Greg Kroah-Hartman 2017 Steven Rostedt 2017 Ted Tso 2017 Tim Bird2017 The five slots from 2016 are all up for election. As always, please let us know if you have questions, and please do consider running. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, Friendly reminder that the TAB elections are coming soon. The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Linux Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. We're also working with kernel maintainers to help refine the new code of conduct, and serving as the initial point of contact for code of conduct issues. The board has ten members, one of whom sits on the Linux Foundation board of directors. The election to select five TAB members will be held at the 2018 Kernel Summit in Vancouver, Canada. The elections will take place at the conference center on Tuesday November 13th, at 5:30pm. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Vancouver. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org The deadline for receiving nominations is up until the beginning of the event where the election is held. In past years, everyone running for the TAB has given a short speech before the voting began. We've received feedback that the speeches add logistical complexity for the election, and may not be the best indicator of how well qualified someone is for the TAB. Instead of speeches, this year we're asking candidates to include statements about why they would like to participate in the TAB. These will be combined into a slideshow running during the election, and available via a public google doc at this location: https://goo.gl/rPEc2v Even though the deadline for nominations is right before voting begins, any statements must be received by Monday November 12th at 5PM Pacific, so that we have time to setup the slideshow. Current TAB members, and their election year: Chris Mason 2016 H. Peter Anvin 2016 Olof Johansson 2016 Rik van Riel2016 Dan Williams 2016 Jon Corbet 2017 Greg Kroah-Hartman 2017 Steven Rostedt 2017 Ted Tso 2017 Tim Bird2017 The five slots from 2016 are all up for election. As always, please let us know if you have questions, and please do consider running. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Linux Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. We're also working with kernel maintainers to help refine the new code of conduct, and serving as the initial point of contact for code of conduct issues. The board has ten members, one of whom sits on the Linux Foundation board of directors. The election to select five TAB members will be held at the 2018 Kernel Summit in Vancouver, Canada. The elections will take place at the conference center on Tuesday November 13th, at 5:30pm. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Vancouver. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org The deadline for receiving nominations is up until the beginning of the event where the election is held. In past years, everyone running for the TAB has given a short speech before the voting began. We've received feedback that the speeches add logistical complexity for the election, and may not be the best indicator of how well qualified someone is for the TAB. Instead of speeches, this year we're asking candidates to include statements about why they would like to participate in the TAB. These will be combined into a slideshow running during the election, and available via a public google doc at this location: https://goo.gl/rPEc2v Even though the deadline for nominations is right before voting begins, any statements must be received by Monday November 12th at 5PM Pacific, so that we have time to setup the slideshow. Current TAB members, and their election year: Chris Mason 2016 H. Peter Anvin 2016 Olof Johansson 2016 Rik van Riel2016 Dan Williams 2016 Jon Corbet 2017 Greg Kroah-Hartman 2017 Steven Rostedt 2017 Ted Tso 2017 Tim Bird2017 The five slots from 2016 are all up for election. As always, please let us know if you have questions, and please do consider running. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Linux Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. We're also working with kernel maintainers to help refine the new code of conduct, and serving as the initial point of contact for code of conduct issues. The board has ten members, one of whom sits on the Linux Foundation board of directors. The election to select five TAB members will be held at the 2018 Kernel Summit in Vancouver, Canada. The elections will take place at the conference center on Tuesday November 13th, at 5:30pm. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Vancouver. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org The deadline for receiving nominations is up until the beginning of the event where the election is held. In past years, everyone running for the TAB has given a short speech before the voting began. We've received feedback that the speeches add logistical complexity for the election, and may not be the best indicator of how well qualified someone is for the TAB. Instead of speeches, this year we're asking candidates to include statements about why they would like to participate in the TAB. These will be combined into a slideshow running during the election, and available via a public google doc at this location: https://goo.gl/rPEc2v Even though the deadline for nominations is right before voting begins, any statements must be received by Monday November 12th at 5PM Pacific, so that we have time to setup the slideshow. Current TAB members, and their election year: Chris Mason 2016 H. Peter Anvin 2016 Olof Johansson 2016 Rik van Riel2016 Dan Williams 2016 Jon Corbet 2017 Greg Kroah-Hartman 2017 Steven Rostedt 2017 Ted Tso 2017 Tim Bird2017 The five slots from 2016 are all up for election. As always, please let us know if you have questions, and please do consider running. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Re: [PATCH 2/2] code-of-conduct: Strip the enforcement paragraph pending community discussion
On 6 Oct 2018, at 17:37, James Bottomley wrote: Significant concern has been expressed about the responsibilities outlined in the enforcement clause of the new code of conduct. Since there is concern that this becomes binding on the release of the 4.19 kernel, strip the enforcement clauses to give the community time to consider and debate how this should be handled. Even in the places where I don't agree with the discussion about what our code of conduct should be, I love that we're having it. Removing the enforcement clause basically goes back to the way things were. We'd be recognizing that we know issues happen, and explicitly stating that when serious events do happen, the community as a whole isn't committing to helping. It's true there are a lot of questions about how the community resolves problems and holds each other accountable for maintaining any code of conduct. I think the enforcement section leaves us the room we need to continue discussions and still make it clear that we're making an effort to shift away from the harsh discussions in the past. -chris
Re: [PATCH 2/2] code-of-conduct: Strip the enforcement paragraph pending community discussion
On 6 Oct 2018, at 17:37, James Bottomley wrote: Significant concern has been expressed about the responsibilities outlined in the enforcement clause of the new code of conduct. Since there is concern that this becomes binding on the release of the 4.19 kernel, strip the enforcement clauses to give the community time to consider and debate how this should be handled. Even in the places where I don't agree with the discussion about what our code of conduct should be, I love that we're having it. Removing the enforcement clause basically goes back to the way things were. We'd be recognizing that we know issues happen, and explicitly stating that when serious events do happen, the community as a whole isn't committing to helping. It's true there are a lot of questions about how the community resolves problems and holds each other accountable for maintaining any code of conduct. I think the enforcement section leaves us the room we need to continue discussions and still make it clear that we're making an effort to shift away from the harsh discussions in the past. -chris
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On 6 Mar 2018, at 11:12, Linus Torvalds wrote: On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitovwrote: As the first step in development of bpfilter project [1] the request_module() code is extended to allow user mode helpers to be invoked. Idea is that user mode helpers are built as part of the kernel build and installed as traditional kernel modules with .ko file extension into distro specified location, such that from a distribution point of view, they are no different than regular kernel modules. Thus, allow request_module() logic to load such user mode helper (umh) modules via: [,,] I like this, but I have one request: can we make sure that this action is visible in the system messages? When we load a regular module, at least it shows in lsmod afterwards, although I have a few times wanted to really see module load as an event in the logs too. When we load a module that just executes a user program, and there is no sign of it in the module list, I think we *really* need to make that event show to the admin some way. .. and yes, maybe we'll need to rate-limit the messages, and maybe it turns out that I'm entirely wrong and people will hate the messages after they get used to the concept of these pseudo-modules, but particularly for the early implementation when this is a new thing, I really want a message like executed user process xyz-abc as a pseudo-module or something in dmesg. I do *not* want this to be a magical way to hide things. Especially early on, this makes a lot of sense. But I wanted to plug bps and the hopefully growing set of bpf introspection tools: https://github.com/iovisor/bcc/blob/master/introspection/bps_example.txt Long term these are probably a good place to tell the admin what's going on. -chris
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On 6 Mar 2018, at 11:12, Linus Torvalds wrote: On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov wrote: As the first step in development of bpfilter project [1] the request_module() code is extended to allow user mode helpers to be invoked. Idea is that user mode helpers are built as part of the kernel build and installed as traditional kernel modules with .ko file extension into distro specified location, such that from a distribution point of view, they are no different than regular kernel modules. Thus, allow request_module() logic to load such user mode helper (umh) modules via: [,,] I like this, but I have one request: can we make sure that this action is visible in the system messages? When we load a regular module, at least it shows in lsmod afterwards, although I have a few times wanted to really see module load as an event in the logs too. When we load a module that just executes a user program, and there is no sign of it in the module list, I think we *really* need to make that event show to the admin some way. .. and yes, maybe we'll need to rate-limit the messages, and maybe it turns out that I'm entirely wrong and people will hate the messages after they get used to the concept of these pseudo-modules, but particularly for the early implementation when this is a new thing, I really want a message like executed user process xyz-abc as a pseudo-module or something in dmesg. I do *not* want this to be a magical way to hide things. Especially early on, this makes a lot of sense. But I wanted to plug bps and the hopefully growing set of bpf introspection tools: https://github.com/iovisor/bcc/blob/master/introspection/bps_example.txt Long term these are probably a good place to tell the admin what's going on. -chris
Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup
On 11/30/2017 12:23 PM, David Sterba wrote: On Wed, Nov 29, 2017 at 01:38:26PM -0500, Chris Mason wrote: On 11/29/2017 12:05 PM, Tejun Heo wrote: On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote: Hello, On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote: What has happened with this patch set? No idea. cc'ing Chris directly. Chris, if the patchset looks good, can you please route them through the btrfs tree? lol looking at the patchset again, I'm not sure that's obviously the right tree. It can either be cgroup, block or btrfs. If no one objects, I'll just route them through cgroup. We'll have to coordinate a bit during the next merge window but I don't have a problem with these going in through cgroup. Dave does this sound good to you? There are only minor changes to btrfs code so cgroup tree would be better. I'd like to include my patch to do all crcs inline (instead of handing off to helper threads) when io controls are in place. By the merge window we should have some good data on how much it's all helping. Are there any problems in sight if the inline crc and cgroup chnanges go separately? I assume there's a runtime dependency, not a code dependency, so it could be sorted by the right merge order. The feature is just more useful with the inline crcs. Without them we end up with kworkers doing both high and low prio submissions and it all boils down to the speed of the lowest priority. -chris
Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup
On 11/30/2017 12:23 PM, David Sterba wrote: On Wed, Nov 29, 2017 at 01:38:26PM -0500, Chris Mason wrote: On 11/29/2017 12:05 PM, Tejun Heo wrote: On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote: Hello, On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote: What has happened with this patch set? No idea. cc'ing Chris directly. Chris, if the patchset looks good, can you please route them through the btrfs tree? lol looking at the patchset again, I'm not sure that's obviously the right tree. It can either be cgroup, block or btrfs. If no one objects, I'll just route them through cgroup. We'll have to coordinate a bit during the next merge window but I don't have a problem with these going in through cgroup. Dave does this sound good to you? There are only minor changes to btrfs code so cgroup tree would be better. I'd like to include my patch to do all crcs inline (instead of handing off to helper threads) when io controls are in place. By the merge window we should have some good data on how much it's all helping. Are there any problems in sight if the inline crc and cgroup chnanges go separately? I assume there's a runtime dependency, not a code dependency, so it could be sorted by the right merge order. The feature is just more useful with the inline crcs. Without them we end up with kworkers doing both high and low prio submissions and it all boils down to the speed of the lowest priority. -chris
Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup
On 11/29/2017 12:05 PM, Tejun Heo wrote: On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote: Hello, On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote: What has happened with this patch set? No idea. cc'ing Chris directly. Chris, if the patchset looks good, can you please route them through the btrfs tree? lol looking at the patchset again, I'm not sure that's obviously the right tree. It can either be cgroup, block or btrfs. If no one objects, I'll just route them through cgroup. We'll have to coordinate a bit during the next merge window but I don't have a problem with these going in through cgroup. Dave does this sound good to you? I'd like to include my patch to do all crcs inline (instead of handing off to helper threads) when io controls are in place. By the merge window we should have some good data on how much it's all helping. -chris
Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup
On 11/29/2017 12:05 PM, Tejun Heo wrote: On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote: Hello, On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote: What has happened with this patch set? No idea. cc'ing Chris directly. Chris, if the patchset looks good, can you please route them through the btrfs tree? lol looking at the patchset again, I'm not sure that's obviously the right tree. It can either be cgroup, block or btrfs. If no one objects, I'll just route them through cgroup. We'll have to coordinate a bit during the next merge window but I don't have a problem with these going in through cgroup. Dave does this sound good to you? I'd like to include my patch to do all crcs inline (instead of handing off to helper threads) when io controls are in place. By the merge window we should have some good data on how much it's all helping. -chris
Reminder v2: Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, Quick update on the TAB elections, we have 6 nominations so far: Jon Corbet Greg Kroah-Hartman Shuah Khan Steve Rostedt Ted Tso Tim Bird The elections are coming soon, please feel free to contact me if you have any questions about the TAB. - The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Reminder v2: Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, Quick update on the TAB elections, we have 6 nominations so far: Jon Corbet Greg Kroah-Hartman Shuah Khan Steve Rostedt Ted Tso Tim Bird The elections are coming soon, please feel free to contact me if you have any questions about the TAB. - The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Reminder: Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, Quick update on the TAB elections, we have 5 nominations so far: Jon Corbet Greg Kroah-Hartman Shuah Khan Steve Rostedt Ted Tso The elections are next week, please feel free to contact me if you have any questions about the TAB. - The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Reminder: Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, Quick update on the TAB elections, we have 5 nominations so far: Jon Corbet Greg Kroah-Hartman Shuah Khan Steve Rostedt Ted Tso The elections are next week, please feel free to contact me if you have any questions about the TAB. - The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
[GIT PULL v2] zstd support (lib, btrfs, squashfs, nocrypto)
Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. Herbert had asked about the crypto patch when we discussed the pull, but I didn't realize he really meant not-right-now. I've rebased it out of this branch, and none of the other patches depended on it. I have things in my zstd-minimal branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd-minimal There's a trivial conflict with the main btrfs pull from last week. Dave's pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Nick has a number of benchmarks for the main zstd code in his lib/zstd commit: I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is 211,988,480 B large. Run the following commands for the benchmark: sudo modprobe zstd_compress_test sudo mknod zstd_compress_test c 245 0 sudo cp silesia.tar zstd_compress_test The time is reported by the time of the userland `cp`. The MB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Adjusted MB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | I benchmarked zstd decompression using the same method on the same machine. The benchmark file is located in the upstream zstd repo under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is the amount of memory required to decompress data compressed with the given compression level. If you know the maximum size of your input, you can reduce the memory usage of decompression irrespective of the compression level. | Method | Time (s) | MB/s| Adjusted MB/s | Memory (MB) | |--|--|-|---|-| | none |0.025 | 8479.54 | - | - | | zstd -1 |0.358 | 592.15 |636.60 |0.84 | | zstd -3 |0.396 | 535.32 |571.40 |1.46 | | zstd -5 |0.396 | 535.32 |571.40 |1.46 | | zstd -10 |0.374 | 566.81 |607.42 |2.51 | | zstd -15 |0.379 | 559.34 |598.84 |4.61 | | zstd -19 |0.412 | 514.54 |547.77 |8.80 | | zlib -1 |0.940 | 225.52 |231.68 |0.04 | | zlib -3 |0.883 | 240.08 |247.07 |0.04 | | zlib -6 |0.844 | 251.17 |258.84 |0.04 | | zlib -9 |0.837 | 253.27 |287.64 |0.04 | === I ran a long series of tests and benchmarks on the btrfs side and the gains are very similar to the core benchmarks Nick ran. Nick Terrell (3) commits (+14222/-12): btrfs: Add zstd support (+468/-12) lib: Add zstd modules (+13014/-0) lib: Add xxhash module (+740/-0) Sean Purcell (1) commits (+178/-0): squashfs: Add zstd support Total: (4) commits (+14400/-12) fs/btrfs/Kconfig |2 + fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |6 +- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |6 +- fs/btrfs/props.c |6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c |2 + fs/btrfs/zstd.c| 432 ++ fs/squashfs/Kconfig| 14 +
[GIT PULL v2] zstd support (lib, btrfs, squashfs, nocrypto)
Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. Herbert had asked about the crypto patch when we discussed the pull, but I didn't realize he really meant not-right-now. I've rebased it out of this branch, and none of the other patches depended on it. I have things in my zstd-minimal branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd-minimal There's a trivial conflict with the main btrfs pull from last week. Dave's pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Nick has a number of benchmarks for the main zstd code in his lib/zstd commit: I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is 211,988,480 B large. Run the following commands for the benchmark: sudo modprobe zstd_compress_test sudo mknod zstd_compress_test c 245 0 sudo cp silesia.tar zstd_compress_test The time is reported by the time of the userland `cp`. The MB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Adjusted MB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | I benchmarked zstd decompression using the same method on the same machine. The benchmark file is located in the upstream zstd repo under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is the amount of memory required to decompress data compressed with the given compression level. If you know the maximum size of your input, you can reduce the memory usage of decompression irrespective of the compression level. | Method | Time (s) | MB/s| Adjusted MB/s | Memory (MB) | |--|--|-|---|-| | none |0.025 | 8479.54 | - | - | | zstd -1 |0.358 | 592.15 |636.60 |0.84 | | zstd -3 |0.396 | 535.32 |571.40 |1.46 | | zstd -5 |0.396 | 535.32 |571.40 |1.46 | | zstd -10 |0.374 | 566.81 |607.42 |2.51 | | zstd -15 |0.379 | 559.34 |598.84 |4.61 | | zstd -19 |0.412 | 514.54 |547.77 |8.80 | | zlib -1 |0.940 | 225.52 |231.68 |0.04 | | zlib -3 |0.883 | 240.08 |247.07 |0.04 | | zlib -6 |0.844 | 251.17 |258.84 |0.04 | | zlib -9 |0.837 | 253.27 |287.64 |0.04 | === I ran a long series of tests and benchmarks on the btrfs side and the gains are very similar to the core benchmarks Nick ran. Nick Terrell (3) commits (+14222/-12): btrfs: Add zstd support (+468/-12) lib: Add zstd modules (+13014/-0) lib: Add xxhash module (+740/-0) Sean Purcell (1) commits (+178/-0): squashfs: Add zstd support Total: (4) commits (+14400/-12) fs/btrfs/Kconfig |2 + fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |6 +- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |6 +- fs/btrfs/props.c |6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c |2 + fs/btrfs/zstd.c| 432 ++ fs/squashfs/Kconfig| 14 +
Re: [GIT PULL] zstd support (lib, btrfs, squashfs)
On Sat, Sep 09, 2017 at 09:35:59AM +0800, Herbert Xu wrote: On Fri, Sep 08, 2017 at 03:33:05PM -0400, Chris Mason wrote: crypto/Kconfig |9 + crypto/Makefile|1 + crypto/testmgr.c | 10 + crypto/testmgr.h | 71 + crypto/zstd.c | 265 Is there anyone going to use zstd through the crypto API? If not then I don't see the point in adding it at this point. Especially as the compression API is still in a state of flux. That part was requested by intel, but I'm happy to leave it out for another time. The rest of the patch series doesn't depend on it at all. -chris
Re: [GIT PULL] zstd support (lib, btrfs, squashfs)
On Sat, Sep 09, 2017 at 09:35:59AM +0800, Herbert Xu wrote: On Fri, Sep 08, 2017 at 03:33:05PM -0400, Chris Mason wrote: crypto/Kconfig |9 + crypto/Makefile|1 + crypto/testmgr.c | 10 + crypto/testmgr.h | 71 + crypto/zstd.c | 265 Is there anyone going to use zstd through the crypto API? If not then I don't see the point in adding it at this point. Especially as the compression API is still in a state of flux. That part was requested by intel, but I'm happy to leave it out for another time. The rest of the patch series doesn't depend on it at all. -chris
Re: [GIT PULL] zstd support (lib, btrfs, squashfs)
On 09/08/2017 03:33 PM, Chris Mason wrote: Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. I have it in my zstd branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd There's a trivial conflict with the main btrfs pull that Dave Sterba just sent. His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. My idea was that you'd take our main btrfs pull first and this one second, but the conflicts are small enough it's not a big deal. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Just to clarify, we've been testing the kernel side of this here at FB, but our zstd use in prod is limited to the application side. -chris
Re: [GIT PULL] zstd support (lib, btrfs, squashfs)
On 09/08/2017 03:33 PM, Chris Mason wrote: Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. I have it in my zstd branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd There's a trivial conflict with the main btrfs pull that Dave Sterba just sent. His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. My idea was that you'd take our main btrfs pull first and this one second, but the conflicts are small enough it's not a big deal. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Just to clarify, we've been testing the kernel side of this here at FB, but our zstd use in prod is limited to the application side. -chris
[GIT PULL] zstd support (lib, btrfs, squashfs)
Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. I have it in my zstd branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd There's a trivial conflict with the main btrfs pull that Dave Sterba just sent. His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. My idea was that you'd take our main btrfs pull first and this one second, but the conflicts are small enough it's not a big deal. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Nick has a number of benchmarks for the main zstd code in his lib/zstd commit: I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is 211,988,480 B large. Run the following commands for the benchmark: sudo modprobe zstd_compress_test sudo mknod zstd_compress_test c 245 0 sudo cp silesia.tar zstd_compress_test The time is reported by the time of the userland `cp`. The MB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Adjusted MB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | I benchmarked zstd decompression using the same method on the same machine. The benchmark file is located in the upstream zstd repo under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is the amount of memory required to decompress data compressed with the given compression level. If you know the maximum size of your input, you can reduce the memory usage of decompression irrespective of the compression level. | Method | Time (s) | MB/s| Adjusted MB/s | Memory (MB) | |--|--|-|---|-| | none |0.025 | 8479.54 | - | - | | zstd -1 |0.358 | 592.15 |636.60 |0.84 | | zstd -3 |0.396 | 535.32 |571.40 |1.46 | | zstd -5 |0.396 | 535.32 |571.40 |1.46 | | zstd -10 |0.374 | 566.81 |607.42 |2.51 | | zstd -15 |0.379 | 559.34 |598.84 |4.61 | | zstd -19 |0.412 | 514.54 |547.77 |8.80 | | zlib -1 |0.940 | 225.52 |231.68 |0.04 | | zlib -3 |0.883 | 240.08 |247.07 |0.04 | | zlib -6 |0.844 | 251.17 |258.84 |0.04 | | zlib -9 |0.837 | 253.27 |287.64 |0.04 | === I ran a long series of tests and benchmarks on the btrfs side and the gains are very similar to the core benchmarks Nick ran. Nick Terrell (4) commits (+14578/-12): crypto: Add zstd support (+356/-0) btrfs: Add zstd support (+468/-12) lib: Add zstd modules (+13014/-0) lib: Add xxhash module (+740/-0) Sean Purcell (1) commits (+178/-0): squashfs: Add zstd support Total: (5) commits (+14756/-12)
[GIT PULL] zstd support (lib, btrfs, squashfs)
Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. I have it in my zstd branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd There's a trivial conflict with the main btrfs pull that Dave Sterba just sent. His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. My idea was that you'd take our main btrfs pull first and this one second, but the conflicts are small enough it's not a big deal. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Nick has a number of benchmarks for the main zstd code in his lib/zstd commit: I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is 211,988,480 B large. Run the following commands for the benchmark: sudo modprobe zstd_compress_test sudo mknod zstd_compress_test c 245 0 sudo cp silesia.tar zstd_compress_test The time is reported by the time of the userland `cp`. The MB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Adjusted MB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | I benchmarked zstd decompression using the same method on the same machine. The benchmark file is located in the upstream zstd repo under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is the amount of memory required to decompress data compressed with the given compression level. If you know the maximum size of your input, you can reduce the memory usage of decompression irrespective of the compression level. | Method | Time (s) | MB/s| Adjusted MB/s | Memory (MB) | |--|--|-|---|-| | none |0.025 | 8479.54 | - | - | | zstd -1 |0.358 | 592.15 |636.60 |0.84 | | zstd -3 |0.396 | 535.32 |571.40 |1.46 | | zstd -5 |0.396 | 535.32 |571.40 |1.46 | | zstd -10 |0.374 | 566.81 |607.42 |2.51 | | zstd -15 |0.379 | 559.34 |598.84 |4.61 | | zstd -19 |0.412 | 514.54 |547.77 |8.80 | | zlib -1 |0.940 | 225.52 |231.68 |0.04 | | zlib -3 |0.883 | 240.08 |247.07 |0.04 | | zlib -6 |0.844 | 251.17 |258.84 |0.04 | | zlib -9 |0.837 | 253.27 |287.64 |0.04 | === I ran a long series of tests and benchmarks on the btrfs side and the gains are very similar to the core benchmarks Nick ran. Nick Terrell (4) commits (+14578/-12): crypto: Add zstd support (+356/-0) btrfs: Add zstd support (+468/-12) lib: Add zstd modules (+13014/-0) lib: Add xxhash module (+740/-0) Sean Purcell (1) commits (+178/-0): squashfs: Add zstd support Total: (5) commits (+14756/-12)
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 03:25 PM, Hugo Mills wrote: On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote: On 08/10/2017 04:30 AM, Eric Biggers wrote: Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. Could we please not add more mount options? I get that they're easy to implement, but it's a very blunt instrument. What we tend to see (with both nodatacow and compress) is people using the mount options, then asking for exceptions, discovering that they can't do that, and then falling back to doing it with attributes or btrfs properties. Could we just start with btrfs properties this time round, and cut out the mount option part of this cycle. In the long run, it'd be great to see most of the btrfs-specific mount options get deprecated and ultimately removed entirely, in favour of attributes/properties, where feasible. It's a good point, and as was commented later down I'd just do mount -o compress=zstd:3 or something. But I do prefer properties in general for this. My big point was just that next step is outside of Nick's scope. -chris
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 03:25 PM, Hugo Mills wrote: On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote: On 08/10/2017 04:30 AM, Eric Biggers wrote: Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. Could we please not add more mount options? I get that they're easy to implement, but it's a very blunt instrument. What we tend to see (with both nodatacow and compress) is people using the mount options, then asking for exceptions, discovering that they can't do that, and then falling back to doing it with attributes or btrfs properties. Could we just start with btrfs properties this time round, and cut out the mount option part of this cycle. In the long run, it'd be great to see most of the btrfs-specific mount options get deprecated and ultimately removed entirely, in favour of attributes/properties, where feasible. It's a good point, and as was commented later down I'd just do mount -o compress=zstd:3 or something. But I do prefer properties in general for this. My big point was just that next step is outside of Nick's scope. -chris
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 03:00 PM, Eric Biggers wrote: On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote: On 08/10/2017 04:30 AM, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. I am not surprised --- Zstandard is closer to the state of the art, both format-wise and implementation-wise, than the other choices in BTRFS. My point is that benchmarks need to account for how much data is compressed at a time. This is a common mistake when comparing different compression algorithms; the algorithm name and compression level do not tell the whole story. The dictionary size is extremely significant. No one is going to compress or decompress a 200 MB file as a single stream in kernel mode, so it does not make sense to justify adding Zstandard *to the kernel* based on such a benchmark. It is going to be divided into chunks. How big are the chunks in BTRFS? I thought that it compressed only one page (4 KiB) at a time, but I hope that has been, or is being, improved; 32 KiB - 128 KiB should be a better amount. (And if the amount of data compressed at a time happens to be different between the different algorithms, note that BTRFS benchmarks are likely to be measuring that as much as the algorithms themselves.) Btrfs hooks the compression code into the delayed allocation mechanism we use to gather large extents for COW. So if you write 100MB to a file, we'll have 100MB to compress at a time (within the limits of the amount of pages we allow to collect before forcing it down). But we want to balance how much memory you might need to uncompress during random reads. So we have an artificial limit of 128KB that we send at a time to the compression code. It's easy to change this, it's just a tradeoff made to limit the cost of reading small bits. It's the same for zlib,lzo and the new zstd patch. -chris
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 03:00 PM, Eric Biggers wrote: On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote: On 08/10/2017 04:30 AM, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. I am not surprised --- Zstandard is closer to the state of the art, both format-wise and implementation-wise, than the other choices in BTRFS. My point is that benchmarks need to account for how much data is compressed at a time. This is a common mistake when comparing different compression algorithms; the algorithm name and compression level do not tell the whole story. The dictionary size is extremely significant. No one is going to compress or decompress a 200 MB file as a single stream in kernel mode, so it does not make sense to justify adding Zstandard *to the kernel* based on such a benchmark. It is going to be divided into chunks. How big are the chunks in BTRFS? I thought that it compressed only one page (4 KiB) at a time, but I hope that has been, or is being, improved; 32 KiB - 128 KiB should be a better amount. (And if the amount of data compressed at a time happens to be different between the different algorithms, note that BTRFS benchmarks are likely to be measuring that as much as the algorithms themselves.) Btrfs hooks the compression code into the delayed allocation mechanism we use to gather large extents for COW. So if you write 100MB to a file, we'll have 100MB to compress at a time (within the limits of the amount of pages we allow to collect before forcing it down). But we want to balance how much memory you might need to uncompress during random reads. So we have an artificial limit of 128KB that we send at a time to the compression code. It's easy to change this, it's just a tradeoff made to limit the cost of reading small bits. It's the same for zlib,lzo and the new zstd patch. -chris
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 04:30 AM, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. -chris
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 04:30 AM, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. -chris
Re: Moving ndctl development into the kernel tree?
On 07/22/2017 02:49 PM, Dan Williams wrote: On Fri, Jul 21, 2017 at 7:52 PM, Dan Williamswrote: [ adding Chris ] On Fri, Jul 21, 2017 at 4:44 PM, Dan Williams wrote: On Fri, Jul 21, 2017 at 3:58 PM, Ingo Molnar wrote: * Dan Williams wrote: [...] * Like perf, ndctl borrows the sub-command architecture and option parsing from git. So, this code could be refactored into something shared / generic, i.e. the bits in tools/perf/util/. Just as a side note, stacktool (tools/stacktool/) is using the Git sub-command and options parsing code as well, and it's already sharing it with perf, via the tools/lib/subcmd/ library. ndctl could use that as well. Ah, nice, that refactoring happened about a year after ndctl was born. Which brings up the next question about what to do with the git history, but I'd want to know if ndctl is even welcome upstream before digging any deeper. I suspect this would be similar to what Chris did to merge btrfs while retaining the standalone history. Chris, any pointers on what worked well and what if anything you would do differently? I.e. I'm looking to use git filter-branch to rewrite ndctl history as if if had always been in tools/ndctl in the kernel tree. I found this old thread https://lkml.org/lkml/2008/10/30/523 and it seems to also recommend using an older kernel as the branch base. So it wasn't as painful as I thought it would be, I just used the script Linus recommended in that thread. Here is what I came up with merging the last ndctl release on top of v4.9, and then applying the pending development patches re-filtered to tools/ndctl: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=for-4.14/ndctl ...the next thing would be to rework the versioning to use the kernel version and switch to using tools/lib/subcmd/. I'd like to say I figured it all out back then, but the truth is that Linus held my hand the whole way. My memory of it is that his script worked really well, I just ran that and verified the results. -chris
Re: Moving ndctl development into the kernel tree?
On 07/22/2017 02:49 PM, Dan Williams wrote: On Fri, Jul 21, 2017 at 7:52 PM, Dan Williams wrote: [ adding Chris ] On Fri, Jul 21, 2017 at 4:44 PM, Dan Williams wrote: On Fri, Jul 21, 2017 at 3:58 PM, Ingo Molnar wrote: * Dan Williams wrote: [...] * Like perf, ndctl borrows the sub-command architecture and option parsing from git. So, this code could be refactored into something shared / generic, i.e. the bits in tools/perf/util/. Just as a side note, stacktool (tools/stacktool/) is using the Git sub-command and options parsing code as well, and it's already sharing it with perf, via the tools/lib/subcmd/ library. ndctl could use that as well. Ah, nice, that refactoring happened about a year after ndctl was born. Which brings up the next question about what to do with the git history, but I'd want to know if ndctl is even welcome upstream before digging any deeper. I suspect this would be similar to what Chris did to merge btrfs while retaining the standalone history. Chris, any pointers on what worked well and what if anything you would do differently? I.e. I'm looking to use git filter-branch to rewrite ndctl history as if if had always been in tools/ndctl in the kernel tree. I found this old thread https://lkml.org/lkml/2008/10/30/523 and it seems to also recommend using an older kernel as the branch base. So it wasn't as painful as I thought it would be, I just used the script Linus recommended in that thread. Here is what I came up with merging the last ndctl release on top of v4.9, and then applying the pending development patches re-filtered to tools/ndctl: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=for-4.14/ndctl ...the next thing would be to rework the versioning to use the kernel version and switch to using tools/lib/subcmd/. I'd like to say I figured it all out back then, but the truth is that Linus held my hand the whole way. My memory of it is that his script worked really well, I just ran that and verified the results. -chris
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.12 branch has some fixes that Dave Sterba collected: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.12 We've been hitting an early enospc problem on production machines that Omar tracked down to an old int->u64 mistake. I waited a bit on this pull to make sure it was really the problem from production, but it's on ~2100 hosts now and I think we're good. Omar also noticed a commit in the queue would make new early ENOSPC problems. I pulled that out for now, which is why the top three commits are younger than the rest. Otherwise these are all fixes, some explaining very old bugs that we've been poking at for a while. Jeff Mahoney (2) commits (+4/-3): btrfs: fix race with relocation recovery and fs_root setup (+3/-3) btrfs: fix memory leak in update_space_info failure path (+1/-0) Liu Bo (1) commits (+1/-1): Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io Colin Ian King (1) commits (+1/-1): btrfs: fix incorrect error return ret being passed to mapping_set_error Omar Sandoval (1) commits (+2/-2): Btrfs: fix delalloc accounting leak caused by u32 overflow Qu Wenruo (1) commits (+122/-2): btrfs: fiemap: Cache and merge fiemap extent before submit it to user David Sterba (1) commits (+2/-2): btrfs: use correct types for page indices in btrfs_page_exists_in_range Jan Kara (1) commits (+6/-4): btrfs: Make flush bios explicitely sync Su Yue (1) commits (+1/-1): btrfs: tree-log.c: Wrong printk information about namelen Total: (9) commits (+139/-16) fs/btrfs/ctree.h | 4 +- fs/btrfs/dir-item.c| 2 +- fs/btrfs/disk-io.c | 10 ++-- fs/btrfs/extent-tree.c | 7 +-- fs/btrfs/extent_io.c | 126 +++-- fs/btrfs/inode.c | 6 +-- 6 files changed, 139 insertions(+), 16 deletions(-)
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.12 branch has some fixes that Dave Sterba collected: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.12 We've been hitting an early enospc problem on production machines that Omar tracked down to an old int->u64 mistake. I waited a bit on this pull to make sure it was really the problem from production, but it's on ~2100 hosts now and I think we're good. Omar also noticed a commit in the queue would make new early ENOSPC problems. I pulled that out for now, which is why the top three commits are younger than the rest. Otherwise these are all fixes, some explaining very old bugs that we've been poking at for a while. Jeff Mahoney (2) commits (+4/-3): btrfs: fix race with relocation recovery and fs_root setup (+3/-3) btrfs: fix memory leak in update_space_info failure path (+1/-0) Liu Bo (1) commits (+1/-1): Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io Colin Ian King (1) commits (+1/-1): btrfs: fix incorrect error return ret being passed to mapping_set_error Omar Sandoval (1) commits (+2/-2): Btrfs: fix delalloc accounting leak caused by u32 overflow Qu Wenruo (1) commits (+122/-2): btrfs: fiemap: Cache and merge fiemap extent before submit it to user David Sterba (1) commits (+2/-2): btrfs: use correct types for page indices in btrfs_page_exists_in_range Jan Kara (1) commits (+6/-4): btrfs: Make flush bios explicitely sync Su Yue (1) commits (+1/-1): btrfs: tree-log.c: Wrong printk information about namelen Total: (9) commits (+139/-16) fs/btrfs/ctree.h | 4 +- fs/btrfs/dir-item.c| 2 +- fs/btrfs/disk-io.c | 10 ++-- fs/btrfs/extent-tree.c | 7 +-- fs/btrfs/extent_io.c | 126 +++-- fs/btrfs/inode.c | 6 +-- 6 files changed, 139 insertions(+), 16 deletions(-)
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On 06/06/2017 05:21 AM, Peter Zijlstra wrote: On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote: On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote: On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: Please test.. Results are still coming in but things do look better with your patch applied. It does look like there's a regression when running hackbench in process mode and when the CPUs are not fully utilised, e.g. check this out: This turned out to be a false positive; your patch improves things as far as I can see. Hooray, I'll move it to a part of the queue intended for merging. It's a little late, but Roman Gushchin helped get some runs of this with our production workload. The patch is every so slightly better. Thanks! -chris
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On 06/06/2017 05:21 AM, Peter Zijlstra wrote: On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote: On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote: On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: Please test.. Results are still coming in but things do look better with your patch applied. It does look like there's a regression when running hackbench in process mode and when the CPUs are not fully utilised, e.g. check this out: This turned out to be a false positive; your patch improves things as far as I can see. Hooray, I'll move it to a part of the queue intended for merging. It's a little late, but Roman Gushchin helped get some runs of this with our production workload. The patch is every so slightly better. Thanks! -chris
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On 05/17/2017 06:53 AM, Peter Zijlstra wrote: On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote: sched/fair, cpumask: Export for_each_cpu_wrap() -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped) -{ - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); -} OK, so this patch fixed an actual bug in the for_each_cpu_wrap() implementation. The above 'n+1' should be 'n', and the effect is that it'll skip over CPUs, potentially resulting in an iteration that only sees every other CPU (for a fully contiguous mask). This in turn causes hackbench to further suffer from the regression introduced by commit: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") So its well past time to fix this. Where the old scheme was a cliff-edge throttle on idle scanning, this introduces a more gradual approach. Instead of stopping to scan entirely, we limit how many CPUs we scan. Initial benchmarks show that it mostly recovers hackbench while not hurting anything else, except Mason's schbench, but not as bad as the old thing. It also appears to recover the tbench high-end, which also suffered like hackbench. I'm also hoping it will fix/preserve kitsunyan's interactivity issue. Please test.. We'll get some tests going here too. -chris
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On 05/17/2017 06:53 AM, Peter Zijlstra wrote: On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote: sched/fair, cpumask: Export for_each_cpu_wrap() -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped) -{ - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); -} OK, so this patch fixed an actual bug in the for_each_cpu_wrap() implementation. The above 'n+1' should be 'n', and the effect is that it'll skip over CPUs, potentially resulting in an iteration that only sees every other CPU (for a fully contiguous mask). This in turn causes hackbench to further suffer from the regression introduced by commit: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") So its well past time to fix this. Where the old scheme was a cliff-edge throttle on idle scanning, this introduces a more gradual approach. Instead of stopping to scan entirely, we limit how many CPUs we scan. Initial benchmarks show that it mostly recovers hackbench while not hurting anything else, except Mason's schbench, but not as bad as the old thing. It also appears to recover the tbench high-end, which also suffered like hackbench. I'm also hoping it will fix/preserve kitsunyan's interactivity issue. Please test.. We'll get some tests going here too. -chris
Re: [GIT PULL] Btrfs
On 05/09/2017 01:56 PM, Chris Mason wrote: > Hi Linus, > > My for-linus-4.12 branch: > > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git > for-linus-4.12 I hit send too soon, sorry. There's a trivial conflict with our WARN_ON fix that went into 4.11. I pushed the resolution to for-linus-4.12-merged. diff --cc fs/btrfs/qgroup.c index afbea61,3f75b5c..deffbeb --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@@ -1078,7 -1031,8 +1034,8 @@@ static int __qgroup_excl_accounting(str qgroup->excl += sign * num_bytes; qgroup->excl_cmpr += sign * num_bytes; if (sign > 0) { + trace_qgroup_update_reserve(fs_info, qgroup, -(s64)num_bytes); - if (WARN_ON(qgroup->reserved < num_bytes)) + if (qgroup->reserved < num_bytes) report_reserved_underflow(fs_info, qgroup, num_bytes); else qgroup->reserved -= num_bytes; @@@ -1103,7 -1057,9 +1060,9 @@@ WARN_ON(sign < 0 && qgroup->excl < num_bytes); qgroup->excl += sign * num_bytes; if (sign > 0) { + trace_qgroup_update_reserve(fs_info, qgroup, + -(s64)num_bytes); - if (WARN_ON(qgroup->reserved < num_bytes)) + if (qgroup->reserved < num_bytes) report_reserved_underflow(fs_info, qgroup, num_bytes); else @@@ -2472,7 -2451,8 +2454,8 @@@ void btrfs_qgroup_free_refroot(struct b qg = unode_aux_to_qgroup(unode); + trace_qgroup_update_reserve(fs_info, qg, -(s64)num_bytes); - if (WARN_ON(qg->reserved < num_bytes)) + if (qg->reserved < num_bytes) report_reserved_underflow(fs_info, qg, num_bytes); else qg->reserved -= num_bytes;
Re: [GIT PULL] Btrfs
On 05/09/2017 01:56 PM, Chris Mason wrote: > Hi Linus, > > My for-linus-4.12 branch: > > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git > for-linus-4.12 I hit send too soon, sorry. There's a trivial conflict with our WARN_ON fix that went into 4.11. I pushed the resolution to for-linus-4.12-merged. diff --cc fs/btrfs/qgroup.c index afbea61,3f75b5c..deffbeb --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@@ -1078,7 -1031,8 +1034,8 @@@ static int __qgroup_excl_accounting(str qgroup->excl += sign * num_bytes; qgroup->excl_cmpr += sign * num_bytes; if (sign > 0) { + trace_qgroup_update_reserve(fs_info, qgroup, -(s64)num_bytes); - if (WARN_ON(qgroup->reserved < num_bytes)) + if (qgroup->reserved < num_bytes) report_reserved_underflow(fs_info, qgroup, num_bytes); else qgroup->reserved -= num_bytes; @@@ -1103,7 -1057,9 +1060,9 @@@ WARN_ON(sign < 0 && qgroup->excl < num_bytes); qgroup->excl += sign * num_bytes; if (sign > 0) { + trace_qgroup_update_reserve(fs_info, qgroup, + -(s64)num_bytes); - if (WARN_ON(qgroup->reserved < num_bytes)) + if (qgroup->reserved < num_bytes) report_reserved_underflow(fs_info, qgroup, num_bytes); else @@@ -2472,7 -2451,8 +2454,8 @@@ void btrfs_qgroup_free_refroot(struct b qg = unode_aux_to_qgroup(unode); + trace_qgroup_update_reserve(fs_info, qg, -(s64)num_bytes); - if (WARN_ON(qg->reserved < num_bytes)) + if (qg->reserved < num_bytes) report_reserved_underflow(fs_info, qg, num_bytes); else qg->reserved -= num_bytes;
[GIT PULL] Btrfs
bdev_get_queue (+3/-4) btrfs: check if the device is flush capable (+4/-0) btrfs: delete unused member nobarriers (+0/-4) Edmund Nadolski (2) commits (+25/-20): btrfs: provide enumeration for __merge_refs mode argument (+13/-10) btrfs: replace hardcoded value with SEQ_LAST macro (+12/-10) Goldwyn Rodrigues (2) commits (+24/-3): btrfs: qgroups: Retry after commit on getting EDQUOT (+23/-1) btrfs: No need to check !(flags & MS_RDONLY) twice (+1/-2) Chris Mason (1) commits (+2/-2): btrfs: fix the gfp_mask for the reada_zones radix tree Adam Borowski (1) commits (+9/-3): btrfs: fix a bogus warning when converting only data or metadata Deepa Dinamani (1) commits (+2/-1): btrfs: Use ktime_get_real_ts for root ctime Dan Carpenter (1) commits (+15/-26): Btrfs: handle only applicable errors returned by btrfs_get_extent Dmitry V. Levin (1) commits (+2/-0): MAINTAINERS: add btrfs file entries for include directories Hans van Kranenburg (1) commits (+5/-5): Btrfs: consistent usage of types in balance_args Total: (71) commits MAINTAINERS | 2 + fs/btrfs/backref.c | 41 ++- fs/btrfs/btrfs_inode.h | 7 + fs/btrfs/compression.c | 18 +- fs/btrfs/ctree.c | 20 +- fs/btrfs/ctree.h | 34 +- fs/btrfs/delayed-inode.c | 46 +-- fs/btrfs/delayed-inode.h | 6 +- fs/btrfs/delayed-ref.c | 8 +- fs/btrfs/delayed-ref.h | 8 +- fs/btrfs/dev-replace.c | 9 +- fs/btrfs/disk-io.c | 13 +- fs/btrfs/disk-io.h | 4 +- fs/btrfs/extent-tree.c | 35 +- fs/btrfs/extent_io.c | 59 +-- fs/btrfs/extent_io.h | 8 +- fs/btrfs/extent_map.c| 10 +- fs/btrfs/extent_map.h| 3 +- fs/btrfs/file.c | 82 - fs/btrfs/free-space-cache.c | 2 +- fs/btrfs/inode.c | 289 +++ fs/btrfs/ioctl.c | 33 +- fs/btrfs/ordered-data.c | 20 +- fs/btrfs/ordered-data.h | 2 +- fs/btrfs/qgroup.c| 102 ++ fs/btrfs/qgroup.h| 51 ++- fs/btrfs/raid56.c| 38 +- fs/btrfs/reada.c | 37 +- fs/btrfs/root-tree.c | 3 +- fs/btrfs/scrub.c | 331 +++-- fs/btrfs/send.c | 23 +- fs/btrfs/super.c | 3 +- fs/btrfs/tests/btrfs-tests.c | 1 - fs/btrfs/transaction.c | 48 ++- fs/btrfs/transaction.h | 6 +- fs/btrfs/tree-log.c | 2 +- fs/btrfs/volumes.c | 854 +++ fs/btrfs/volumes.h | 8 +- include/trace/events/btrfs.h | 187 +- include/uapi/linux/btrfs.h | 10 +- 40 files changed, 1629 insertions(+), 834 deletions(-)
[GIT PULL] Btrfs
bdev_get_queue (+3/-4) btrfs: check if the device is flush capable (+4/-0) btrfs: delete unused member nobarriers (+0/-4) Edmund Nadolski (2) commits (+25/-20): btrfs: provide enumeration for __merge_refs mode argument (+13/-10) btrfs: replace hardcoded value with SEQ_LAST macro (+12/-10) Goldwyn Rodrigues (2) commits (+24/-3): btrfs: qgroups: Retry after commit on getting EDQUOT (+23/-1) btrfs: No need to check !(flags & MS_RDONLY) twice (+1/-2) Chris Mason (1) commits (+2/-2): btrfs: fix the gfp_mask for the reada_zones radix tree Adam Borowski (1) commits (+9/-3): btrfs: fix a bogus warning when converting only data or metadata Deepa Dinamani (1) commits (+2/-1): btrfs: Use ktime_get_real_ts for root ctime Dan Carpenter (1) commits (+15/-26): Btrfs: handle only applicable errors returned by btrfs_get_extent Dmitry V. Levin (1) commits (+2/-0): MAINTAINERS: add btrfs file entries for include directories Hans van Kranenburg (1) commits (+5/-5): Btrfs: consistent usage of types in balance_args Total: (71) commits MAINTAINERS | 2 + fs/btrfs/backref.c | 41 ++- fs/btrfs/btrfs_inode.h | 7 + fs/btrfs/compression.c | 18 +- fs/btrfs/ctree.c | 20 +- fs/btrfs/ctree.h | 34 +- fs/btrfs/delayed-inode.c | 46 +-- fs/btrfs/delayed-inode.h | 6 +- fs/btrfs/delayed-ref.c | 8 +- fs/btrfs/delayed-ref.h | 8 +- fs/btrfs/dev-replace.c | 9 +- fs/btrfs/disk-io.c | 13 +- fs/btrfs/disk-io.h | 4 +- fs/btrfs/extent-tree.c | 35 +- fs/btrfs/extent_io.c | 59 +-- fs/btrfs/extent_io.h | 8 +- fs/btrfs/extent_map.c| 10 +- fs/btrfs/extent_map.h| 3 +- fs/btrfs/file.c | 82 - fs/btrfs/free-space-cache.c | 2 +- fs/btrfs/inode.c | 289 +++ fs/btrfs/ioctl.c | 33 +- fs/btrfs/ordered-data.c | 20 +- fs/btrfs/ordered-data.h | 2 +- fs/btrfs/qgroup.c| 102 ++ fs/btrfs/qgroup.h| 51 ++- fs/btrfs/raid56.c| 38 +- fs/btrfs/reada.c | 37 +- fs/btrfs/root-tree.c | 3 +- fs/btrfs/scrub.c | 331 +++-- fs/btrfs/send.c | 23 +- fs/btrfs/super.c | 3 +- fs/btrfs/tests/btrfs-tests.c | 1 - fs/btrfs/transaction.c | 48 ++- fs/btrfs/transaction.h | 6 +- fs/btrfs/tree-log.c | 2 +- fs/btrfs/volumes.c | 854 +++ fs/btrfs/volumes.h | 8 +- include/trace/events/btrfs.h | 187 +- include/uapi/linux/btrfs.h | 10 +- 40 files changed, 1629 insertions(+), 834 deletions(-)
Re: [PATCH] btrfs: always write superblocks synchronously
On 05/03/2017 04:36 AM, Jan Kara wrote: On Tue 02-05-17 09:28:13, Davidlohr Bueso wrote: Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since REQ_FUA and REQ_FLUSH flags are stripped from submitted IO when the disk doesn't have volatile write cache and thus effectively make the write async. This was seen to cause performance hits up to 90% regression in disk IO related benchmarks such as reaim and dbench[1]. Fix the problem by making sure the first superblock write is also treated as synchronous since they can block progress of the journalling (commit, log syncs) machinery and thus the whole filesystem. Fixes: b685d3d65ac (block: treat REQ_FUA and REQ_PREFLUSH as synchronous) Cc: stableCc: Jan Kara Signed-off-by: Davidlohr Bueso I wasn't patient enough and already sent the fix as part of my series fixing other filesystems [1]. It also fixes one more place in btrfs that needs REQ_SYNC to return to the original behavior. Thanks guys. -chris
Re: [PATCH] btrfs: always write superblocks synchronously
On 05/03/2017 04:36 AM, Jan Kara wrote: On Tue 02-05-17 09:28:13, Davidlohr Bueso wrote: Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since REQ_FUA and REQ_FLUSH flags are stripped from submitted IO when the disk doesn't have volatile write cache and thus effectively make the write async. This was seen to cause performance hits up to 90% regression in disk IO related benchmarks such as reaim and dbench[1]. Fix the problem by making sure the first superblock write is also treated as synchronous since they can block progress of the journalling (commit, log syncs) machinery and thus the whole filesystem. Fixes: b685d3d65ac (block: treat REQ_FUA and REQ_PREFLUSH as synchronous) Cc: stable Cc: Jan Kara Signed-off-by: Davidlohr Bueso I wasn't patient enough and already sent the fix as part of my series fixing other filesystems [1]. It also fixes one more place in btrfs that needs REQ_SYNC to return to the original behavior. Thanks guys. -chris
[GIT PULL] Btrfs
Hi Linus, We have one more for btrfs: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 This is dropping a new WARN_ON from rc1 that ended up making more noise than we really want. The larger fix for the underflow got delayed a bit and it's better for now to put it under CONFIG_BTRFS_DEBUG. David Sterba (1) commits (+7/-4): btrfs: qgroup: move noisy underflow warning to debugging build Total: (1) commits (+7/-4) fs/btrfs/qgroup.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-)
[GIT PULL] Btrfs
Hi Linus, We have one more for btrfs: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 This is dropping a new WARN_ON from rc1 that ended up making more noise than we really want. The larger fix for the underflow got delayed a bit and it's better for now to put it under CONFIG_BTRFS_DEBUG. David Sterba (1) commits (+7/-4): btrfs: qgroup: move noisy underflow warning to debugging build Total: (1) commits (+7/-4) fs/btrfs/qgroup.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-)
Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
On 04/25/2017 04:49 PM, Tejun Heo wrote: On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote: Will try that too. I can't see why HT would change it because I see single CPU queues misevaluated. Just in case, you need to tune the test params so that it doesn't load the machine too much and that there are some non-CPU intensive workloads going on to purturb things a bit. Anyways, I'm gonna try disabling HT. It's finickier but after changing the duty cycle a bit, it reproduces w/ HT off. I think the trick is setting the number of threads to the number of logical CPUs and tune -s/-c so that p99 starts climbing up. The following is from the root cgroup. Since it's only measuring wakeup latency, schbench is best at exposing problems when the machine is just barely below saturated. At saturation, everyone has to wait for the CPUs, and if we're relatively idle there's always a CPU to be found There's schbench -a to try and find this magic tipping point, but I haven't found a great way to automate for every kind of machine yet (sorry). -chris
Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
On 04/25/2017 04:49 PM, Tejun Heo wrote: On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote: Will try that too. I can't see why HT would change it because I see single CPU queues misevaluated. Just in case, you need to tune the test params so that it doesn't load the machine too much and that there are some non-CPU intensive workloads going on to purturb things a bit. Anyways, I'm gonna try disabling HT. It's finickier but after changing the duty cycle a bit, it reproduces w/ HT off. I think the trick is setting the number of threads to the number of logical CPUs and tune -s/-c so that p99 starts climbing up. The following is from the root cgroup. Since it's only measuring wakeup latency, schbench is best at exposing problems when the machine is just barely below saturated. At saturation, everyone has to wait for the CPUs, and if we're relatively idle there's always a CPU to be found There's schbench -a to try and find this magic tipping point, but I haven't found a great way to automate for every kind of machine yet (sorry). -chris
[GIT PULL] Btrfs
Hi Linus Dave Sterba collected a few more fixes for the last rc: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 These aren't marked for stable, but I'm putting them in with a batch were testing/sending by hand for this release. Liu Bo (3) commits (+11/-13): Btrfs: fix invalid dereference in btrfs_retry_endio (+4/-10) Btrfs: fix potential use-after-free for cloned bio (+1/-1) Btrfs: fix segmentation fault when doing dio read (+6/-2) Adam Borowski (1) commits (+3/-0): btrfs: drop the nossd flag when remounting with -o ssd Total: (4) commits (+14/-13) fs/btrfs/inode.c | 22 ++ fs/btrfs/super.c | 3 +++ fs/btrfs/volumes.c | 2 +- 3 files changed, 14 insertions(+), 13 deletions(-)
[GIT PULL] Btrfs
Hi Linus Dave Sterba collected a few more fixes for the last rc: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 These aren't marked for stable, but I'm putting them in with a batch were testing/sending by hand for this release. Liu Bo (3) commits (+11/-13): Btrfs: fix invalid dereference in btrfs_retry_endio (+4/-10) Btrfs: fix potential use-after-free for cloned bio (+1/-1) Btrfs: fix segmentation fault when doing dio read (+6/-2) Adam Borowski (1) commits (+3/-0): btrfs: drop the nossd flag when remounting with -o ssd Total: (4) commits (+14/-13) fs/btrfs/inode.c | 22 ++ fs/btrfs/super.c | 3 +++ fs/btrfs/volumes.c | 2 +- 3 files changed, 14 insertions(+), 13 deletions(-)
[GIT PULL] Btrfs
Hi Linus, We have 3 small fixes queued up in my for-linus-4.11 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Goldwyn Rodrigues (1) commits (+7/-7): btrfs: Change qgroup_meta_rsv to 64bit Dan Carpenter (1) commits (+6/-1): Btrfs: fix an integer overflow check Liu Bo (1) commits (+31/-21): Btrfs: bring back repair during read Total: (3) commits (+44/-29) fs/btrfs/ctree.h | 2 +- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent_io.c | 46 -- fs/btrfs/inode.c | 6 +++--- fs/btrfs/qgroup.c| 10 +- fs/btrfs/send.c | 7 ++- 6 files changed, 44 insertions(+), 29 deletions(-)
[GIT PULL] Btrfs
Hi Linus, We have 3 small fixes queued up in my for-linus-4.11 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Goldwyn Rodrigues (1) commits (+7/-7): btrfs: Change qgroup_meta_rsv to 64bit Dan Carpenter (1) commits (+6/-1): Btrfs: fix an integer overflow check Liu Bo (1) commits (+31/-21): Btrfs: bring back repair during read Total: (3) commits (+44/-29) fs/btrfs/ctree.h | 2 +- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent_io.c | 46 -- fs/btrfs/inode.c | 6 +++--- fs/btrfs/qgroup.c| 10 +- fs/btrfs/send.c | 7 ++- 6 files changed, 44 insertions(+), 29 deletions(-)
[GIT PULL] Btrfs
Hi Linus We have a small set of fixes for the next RC: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Zygo tracked down a very old bug with inline compressed extents. I didn't tag this one for stable because I want to do individual tested backports. It's a little tricky and I'd rather do some extra testing on it along the way. Otherwise they are pretty obvious: Liu Bo (1) commits (+2/-1): Btrfs: fix regression in lock_delalloc_pages Dmitry V. Levin (1) commits (+0/-27): btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h Zygo Blaxell (1) commits (+14/-0): btrfs: add missing memset while reading compressed inline extents Total: (3) commits (+16/-28) fs/btrfs/extent_io.c | 3 ++- fs/btrfs/inode.c | 14 ++ include/uapi/linux/btrfs.h | 27 --- 3 files changed, 16 insertions(+), 28 deletions(-)
[GIT PULL] Btrfs
Hi Linus We have a small set of fixes for the next RC: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Zygo tracked down a very old bug with inline compressed extents. I didn't tag this one for stable because I want to do individual tested backports. It's a little tricky and I'd rather do some extra testing on it along the way. Otherwise they are pretty obvious: Liu Bo (1) commits (+2/-1): Btrfs: fix regression in lock_delalloc_pages Dmitry V. Levin (1) commits (+0/-27): btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h Zygo Blaxell (1) commits (+14/-0): btrfs: add missing memset while reading compressed inline extents Total: (3) commits (+16/-28) fs/btrfs/extent_io.c | 3 ++- fs/btrfs/inode.c | 14 ++ include/uapi/linux/btrfs.h | 27 --- 3 files changed, 16 insertions(+), 28 deletions(-)
Re: [PATCH] jump_label: Fix anonymous union initialization
On 03/02/2017 04:42 PM, Steven Rostedt wrote: On Thu, 2 Mar 2017 16:07:19 -0500 Jason Baron <jba...@akamai.com> wrote: On 02/28/2017 11:32 AM, Boris Ostrovsky wrote: Pre-4.6 gcc do not allow direct static initialization of members of anonymous structs/unions. After commit 3821fd35b58d ("jump_label: Reduce the size of struct static_key") STATIC_KEY_INIT_{TRUE|FALSE} definitions cannot be compiled with those older compilers. Placing initializers inside curved brackets works around this problem. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- include/linux/jump_label.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index 8e06d75..518020b 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -166,10 +166,10 @@ extern void arch_jump_label_transform_static(struct jump_entry *entry, */ #define STATIC_KEY_INIT_TRUE \ { .enabled = { 1 }, \ - .entries = (void *)JUMP_TYPE_TRUE } + { .entries = (void *)JUMP_TYPE_TRUE } } #define STATIC_KEY_INIT_FALSE \ { .enabled = { 0 }, \ - .entries = (void *)JUMP_TYPE_FALSE } + { .entries = (void *)JUMP_TYPE_FALSE } } #else /* !HAVE_JUMP_LABEL */ (Adding Steve to 'cc) Thanks for the fix. Reviewed-by: Jason Baron <jba...@akamai.com> Funny, Chris pinged me on IRC telling me that jump labels broke with my latest tree. And we discovered it was because of anonymous unions and he was using an older compiler (4.4 or something). I didn't know how to make it work, and we were just going to say "tough, jump labels are not for 4.4". Although, didn't goto asm get added into 4.5? Did someone backport it to the gcc 4.4 compilers? I believe 4.5 handles anonymous unions. Since the broken commit went through my tree, I'll take this patch. I'm getting ready for another git pull request to Linus. Compiled-by: Chris Mason <c...@fb.com> -chris
Re: [PATCH] jump_label: Fix anonymous union initialization
On 03/02/2017 04:42 PM, Steven Rostedt wrote: On Thu, 2 Mar 2017 16:07:19 -0500 Jason Baron wrote: On 02/28/2017 11:32 AM, Boris Ostrovsky wrote: Pre-4.6 gcc do not allow direct static initialization of members of anonymous structs/unions. After commit 3821fd35b58d ("jump_label: Reduce the size of struct static_key") STATIC_KEY_INIT_{TRUE|FALSE} definitions cannot be compiled with those older compilers. Placing initializers inside curved brackets works around this problem. Signed-off-by: Boris Ostrovsky --- include/linux/jump_label.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index 8e06d75..518020b 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -166,10 +166,10 @@ extern void arch_jump_label_transform_static(struct jump_entry *entry, */ #define STATIC_KEY_INIT_TRUE \ { .enabled = { 1 }, \ - .entries = (void *)JUMP_TYPE_TRUE } + { .entries = (void *)JUMP_TYPE_TRUE } } #define STATIC_KEY_INIT_FALSE \ { .enabled = { 0 }, \ - .entries = (void *)JUMP_TYPE_FALSE } + { .entries = (void *)JUMP_TYPE_FALSE } } #else /* !HAVE_JUMP_LABEL */ (Adding Steve to 'cc) Thanks for the fix. Reviewed-by: Jason Baron Funny, Chris pinged me on IRC telling me that jump labels broke with my latest tree. And we discovered it was because of anonymous unions and he was using an older compiler (4.4 or something). I didn't know how to make it work, and we were just going to say "tough, jump labels are not for 4.4". Although, didn't goto asm get added into 4.5? Did someone backport it to the gcc 4.4 compilers? I believe 4.5 handles anonymous unions. Since the broken commit went through my tree, I'll take this patch. I'm getting ready for another git pull request to Linus. Compiled-by: Chris Mason -chris
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.11 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Has Btrfs round two. These are mostly a continuation of Dave Sterba's collection of cleanups, but Filipe also has some bug fixes and performance improvements. Nikolay Borisov (42) commits (+611/-579): btrfs: Make lock_and_cleanup_extent_if_need take btrfs_inode (+14/-14) btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inode (+39/-38) btrfs: Make btrfs_extent_item_to_extent_map take btrfs_inode (+10/-8) btrfs: all btrfs_delalloc_release_metadata take btrfs_inode (+22/-19) btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode (+3/-4) btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inode (+7/-6) btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode (+3/-3) btrfs: Make btrfs_orphan_release_metadata take btrfs_inode (+8/-8) btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode (+7/-7) btrfs: Make check_parent_dirs_for_sync take btrfs_inode (+14/-14) btrfs: make btrfs_free_io_failure_record take btrfs_inode (+9/-7) btrfs: Make btrfs_lookup_ordered_range take btrfs_inode (+19/-18) btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inode (+17/-16) btrfs: make btrfs_print_data_csum_error take btrfs_inode (+8/-7) btrfs: make btrfs_is_free_space_inode take btrfs_inode (+20/-19) btrfs: make btrfs_set_inode_index_count take btrfs_inode (+8/-8) btrfs: Make btrfs_requeue_inode_defrag take btrfs_inode (+5/-5) btrfs: Make clone_update_extent_map take btrfs_inode (+13/-14) btrfs: Make btrfs_mark_extent_written take btrfs_inode (+6/-6) btrfs: Make btrfs_drop_extent_cache take btrfs_inode (+30/-26) btrfs: Make calc_csum_metadata_size take btrfs_inode (+12/-15) btrfs: Make drop_outstanding_extent take btrfs_inode (+11/-12) btrfs: Make btrfs_del_delalloc_inode take btrfs_inode (+7/-7) btrfs: make btrfs_log_inode_parent take btrfs_inode (+24/-26) btrfs: Make btrfs_set_inode_index take btrfs_inode (+13/-13) btrfs: Make btrfs_clear_bit_hook take btrfs_inode (+25/-21) btrfs: Make check_extent_to_block take btrfs_inode (+6/-5) btrfs: make check_compressed_csum take btrfs_inode (+4/-5) btrfs: Make btrfs_insert_dir_item take btrfs_inode (+7/-7) btrfs: Make btrfs_log_all_parents take btrfs_inode (+5/-5) btrfs: Make btrfs_i_size_write take btrfs_inode (+18/-19) btrfs: make repair_io_failure take btrfs_inode (+12/-11) btrfs: Make btrfs_orphan_add take btrfs_inode (+24/-22) btrfs: make btrfs_orphan_del take btrfs_inode (+20/-20) btrfs: make clean_io_failure take btrfs_inode (+15/-14) btrfs: Make btrfs_add_nondir take btrfs_inode (+13/-9) btrfs: make free_io_failure take btrfs_inode (+13/-11) btrfs: Make check_can_nocow take btrfs_inode (+12/-10) btrfs: Make btrfs_add_link take btrfs_inode (+26/-23) btrfs: Make get_extent_t take btrfs_inode (+59/-54) btrfs: Make hole_mergeable take btrfs_inode (+5/-4) btrfs: Make fill_holes take btrfs_inode (+18/-19) David Sterba (16) commits (+139/-124): btrfs: use predefined limits for calculating maximum number of pages for compression (+6/-5) btrfs: derive maximum output size in the compression implementation (+9/-14) btrfs: merge nr_pages input and output parameter in compress_pages (+11/-15) btrfs: merge length input and output parameter in compress_pages (+18/-20) btrfs: add dummy callback for readpage_io_failed and drop checks (+10/-3) btrfs: do proper error handling in btrfs_insert_xattr_item (+2/-1) btrfs: drop checks for mandatory extent_io_ops callbacks (+3/-4) btrfs: constify device path passed to relevant helpers (+22/-18) btrfs: document existence of extent_io ops callbacks (+26/-11) btrfs: handle allocation error in update_dev_stat_item (+2/-1) btrfs: export compression buffer limits in a header (+15/-10) btrfs: constify name of subvolume in creation helpers (+3/-3) btrfs: constify buffers used by compression helpers (+3/-3) btrfs: remove BUG_ON from __tree_mod_log_insert (+0/-2) btrfs: constify input buffer of btrfs_csum_data (+3/-3) btrfs: let writepage_end_io_hook return void (+6/-11) Filipe Manana (8) commits (+163/-27): Btrfs: do not create explicit holes when replaying log tree if NO_HOLES enabled (+5/-0) Btrfs: try harder to migrate items to left sibling before splitting a leaf (+7/-0) Btrfs: fix assertion failure when freeing block groups at close_ctree() (+9/-6) Btrfs: incremental send, fix unnecessary hole writes for sparse files (+86/-2) Btrfs: fix use-after-free due to wrong order of destroying work queues (+7/-2) Btrfs: incremental send, do not delay rename when parent inode is new (+16/-3) Btrfs: fix data loss after truncate when using the no-holes feature (+6/-13) Btrfs: bulk delete checksum items in the same leaf (+27/-1) Robbie Ko (3) commits
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.11 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Has Btrfs round two. These are mostly a continuation of Dave Sterba's collection of cleanups, but Filipe also has some bug fixes and performance improvements. Nikolay Borisov (42) commits (+611/-579): btrfs: Make lock_and_cleanup_extent_if_need take btrfs_inode (+14/-14) btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inode (+39/-38) btrfs: Make btrfs_extent_item_to_extent_map take btrfs_inode (+10/-8) btrfs: all btrfs_delalloc_release_metadata take btrfs_inode (+22/-19) btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode (+3/-4) btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inode (+7/-6) btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode (+3/-3) btrfs: Make btrfs_orphan_release_metadata take btrfs_inode (+8/-8) btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode (+7/-7) btrfs: Make check_parent_dirs_for_sync take btrfs_inode (+14/-14) btrfs: make btrfs_free_io_failure_record take btrfs_inode (+9/-7) btrfs: Make btrfs_lookup_ordered_range take btrfs_inode (+19/-18) btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inode (+17/-16) btrfs: make btrfs_print_data_csum_error take btrfs_inode (+8/-7) btrfs: make btrfs_is_free_space_inode take btrfs_inode (+20/-19) btrfs: make btrfs_set_inode_index_count take btrfs_inode (+8/-8) btrfs: Make btrfs_requeue_inode_defrag take btrfs_inode (+5/-5) btrfs: Make clone_update_extent_map take btrfs_inode (+13/-14) btrfs: Make btrfs_mark_extent_written take btrfs_inode (+6/-6) btrfs: Make btrfs_drop_extent_cache take btrfs_inode (+30/-26) btrfs: Make calc_csum_metadata_size take btrfs_inode (+12/-15) btrfs: Make drop_outstanding_extent take btrfs_inode (+11/-12) btrfs: Make btrfs_del_delalloc_inode take btrfs_inode (+7/-7) btrfs: make btrfs_log_inode_parent take btrfs_inode (+24/-26) btrfs: Make btrfs_set_inode_index take btrfs_inode (+13/-13) btrfs: Make btrfs_clear_bit_hook take btrfs_inode (+25/-21) btrfs: Make check_extent_to_block take btrfs_inode (+6/-5) btrfs: make check_compressed_csum take btrfs_inode (+4/-5) btrfs: Make btrfs_insert_dir_item take btrfs_inode (+7/-7) btrfs: Make btrfs_log_all_parents take btrfs_inode (+5/-5) btrfs: Make btrfs_i_size_write take btrfs_inode (+18/-19) btrfs: make repair_io_failure take btrfs_inode (+12/-11) btrfs: Make btrfs_orphan_add take btrfs_inode (+24/-22) btrfs: make btrfs_orphan_del take btrfs_inode (+20/-20) btrfs: make clean_io_failure take btrfs_inode (+15/-14) btrfs: Make btrfs_add_nondir take btrfs_inode (+13/-9) btrfs: make free_io_failure take btrfs_inode (+13/-11) btrfs: Make check_can_nocow take btrfs_inode (+12/-10) btrfs: Make btrfs_add_link take btrfs_inode (+26/-23) btrfs: Make get_extent_t take btrfs_inode (+59/-54) btrfs: Make hole_mergeable take btrfs_inode (+5/-4) btrfs: Make fill_holes take btrfs_inode (+18/-19) David Sterba (16) commits (+139/-124): btrfs: use predefined limits for calculating maximum number of pages for compression (+6/-5) btrfs: derive maximum output size in the compression implementation (+9/-14) btrfs: merge nr_pages input and output parameter in compress_pages (+11/-15) btrfs: merge length input and output parameter in compress_pages (+18/-20) btrfs: add dummy callback for readpage_io_failed and drop checks (+10/-3) btrfs: do proper error handling in btrfs_insert_xattr_item (+2/-1) btrfs: drop checks for mandatory extent_io_ops callbacks (+3/-4) btrfs: constify device path passed to relevant helpers (+22/-18) btrfs: document existence of extent_io ops callbacks (+26/-11) btrfs: handle allocation error in update_dev_stat_item (+2/-1) btrfs: export compression buffer limits in a header (+15/-10) btrfs: constify name of subvolume in creation helpers (+3/-3) btrfs: constify buffers used by compression helpers (+3/-3) btrfs: remove BUG_ON from __tree_mod_log_insert (+0/-2) btrfs: constify input buffer of btrfs_csum_data (+3/-3) btrfs: let writepage_end_io_hook return void (+6/-11) Filipe Manana (8) commits (+163/-27): Btrfs: do not create explicit holes when replaying log tree if NO_HOLES enabled (+5/-0) Btrfs: try harder to migrate items to left sibling before splitting a leaf (+7/-0) Btrfs: fix assertion failure when freeing block groups at close_ctree() (+9/-6) Btrfs: incremental send, fix unnecessary hole writes for sparse files (+86/-2) Btrfs: fix use-after-free due to wrong order of destroying work queues (+7/-2) Btrfs: incremental send, do not delay rename when parent inode is new (+16/-3) Btrfs: fix data loss after truncate when using the no-holes feature (+6/-13) Btrfs: bulk delete checksum items in the same leaf (+27/-1) Robbie Ko (3) commits
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.11 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Has a series of fixes and cleanups that Dave Sterba has been collecting: There is a pretty big variety here, cleaning up internal APIs and fixing corner cases. David Sterba (46) commits (+235/-313): btrfs: remove unused parameter from btrfs_subvolume_release_metadata (+6/-11) btrfs: remove pointless rcu protection from btrfs_qgroup_inherit (+0/-2) btrfs: check quota status earlier and don't do unnecessary frees (+3/-2) btrfs: remove unused parameter from btrfs_prepare_extent_commit (+3/-5) btrfs: remove unnecessary mutex lock in qgroup_account_snapshot (+1/-5) btrfs: embed extent_changeset::range_changed to the structure (+11/-17) btrfs: remove unused parameter from cleanup_write_cache_enospc (+2/-3) btrfs: remove unused parameters from __btrfs_write_out_cache (+3/-8) btrfs: remove unused parameter from clone_copy_inline_extent (+2/-3) btrfs: remove unused parameter from extent_write_cache_pages (+2/-4) btrfs: remove unused parameter from tree_move_next_or_upnext (+2/-4) btrfs: remove unused parameter from btrfs_check_super_valid (+3/-5) btrfs: remove unused logic of limiting async delalloc pages (+0/-7) btrfs: fix over-80 lines introduced by previous cleanups (+74/-63) btrfs: remove unused parameter from read_block_for_search (+5/-5) btrfs: remove unused parameter from adjust_slots_upwards (+2/-3) btrfs: remove unused parameter from init_first_rw_device (+3/-5) btrfs: make space cache inode readahead failure nonfatal (+3/-7) btrfs: remove unused parameters from scrub_setup_wr_ctx (+3/-7) btrfs: remove unused parameter from __btrfs_alloc_chunk (+4/-6) btrfs: add wrapper for counting BTRFS_MAX_EXTENT_SIZE (+23/-31) btrfs: remove unused parameter from submit_extent_page (+3/-9) btrfs: remove unused parameter from clean_tree_block (+17/-19) btrfs: use GFP_KERNEL in btrfs_add/del_qgroup_relation (+2/-2) btrfs: remove unused parameter from __add_inline_refs (+2/-3) btrfs: remove unused parameter from add_pending_csums (+2/-4) btrfs: remove unused parameter from update_nr_written (+4/-4) btrfs: remove unused parameter from __push_leaf_right (+2/-3) btrfs: remove unused parameter from check_async_write (+2/-2) btrfs: remove unused parameter from btrfs_fill_super (+2/-3) btrfs: remove unused parameter from __push_leaf_left (+2/-3) btrfs: remove unused parameter from write_dev_supers (+3/-3) btrfs: remove unused parameter from __add_inode_ref (+1/-2) btrfs: remove unused parameters from btrfs_cmp_data (+2/-3) btrfs: remove unused parameter from create_snapshot (+2/-2) btrfs: ulist: make the finalization function public (+2/-1) btrfs: remove unused parameter from tree_move_down (+2/-2) btrfs: ulist: rename ulist_fini to ulist_release (+10/-10) btrfs: qgroups: make __del_qgroup_relation static (+1/-1) btrfs: use GFP_KERNEL in btrfs_read_qgroup_config (+1/-1) btrfs: remove unused parameter from split_item (+2/-3) btrfs: merge two superblock writing helpers (+4/-11) btrfs: qgroups: opencode qgroup_free helper (+9/-9) btrfs: use GFP_KERNEL in btrfs_quota_enable (+1/-1) btrfs: use GFP_KERNEL in create_snapshot (+2/-2) btrfs: remove unused ulist members (+0/-7) Nikolay Borisov (36) commits (+476/-480): btrfs: Make btrfs_delayed_inode_reserve_metadata take btrfs_inode (+8/-8) btrfs: Make btrfs_inode_delayed_dir_index_count take btrfs_inode (+5/-5) btrfs: Make btrfs_commit_inode_delayed_items take btrfs_inode (+4/-4) btrfs: Make btrfs_commit_inode_delayed_inode take btrfs_inode (+6/-6) btrfs: Make btrfs_get_or_create_delayed_node take btrfs_inode (+5/-6) btrfs: Make btrfs_kill_delayed_inode_items take btrfs_inode (+4/-4) btrfs: Make btrfs_delayed_delete_inode_ref take btrfs_inode (+5/-5) btrfs: Make btrfs_delete_delayed_dir_index take btrfs_inode (+6/-6) btrfs: Make btrfs_insert_delayed_dir_index take btrfs_inode (+5/-5) btrfs: Make btrfs_check_ref_name_override take btrfs_inode (+4/-5) btrfs: Make btrfs_record_snapshot_destroy take btrfs_inode (+6/-6) btrfs: Make btrfs_must_commit_transaction take btrfs_inode (+9/-9) btrfs: Make btrfs_del_dir_entries_in_log take btrfs_inode (+7/-7) btrfs: Make btrfs_log_changed_extents take btrfs_inode (+11/-11) btrfs: Make btrfs_record_unlink_dir take btrfs_inode (+14/-14) btrfs: Make btrfs_remove_delayed_node take btrfs_inode (+5/-5) btrfs: Make btrfs_get_logged_extents take btrfs_inode (+4/-4) btrfs: Make btrfs_log_trailing_hole take btrfs_inode (+4/-4) btrfs: Make btrfs_get_delayed_node take btrfs_inode (+8/-9) btrfs: Make btrfs_ino take a struct btrfs_inode (+151/-151) btrfs: Make log_directory_changes take btrfs_inode (+5/-6)
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.11 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Has a series of fixes and cleanups that Dave Sterba has been collecting: There is a pretty big variety here, cleaning up internal APIs and fixing corner cases. David Sterba (46) commits (+235/-313): btrfs: remove unused parameter from btrfs_subvolume_release_metadata (+6/-11) btrfs: remove pointless rcu protection from btrfs_qgroup_inherit (+0/-2) btrfs: check quota status earlier and don't do unnecessary frees (+3/-2) btrfs: remove unused parameter from btrfs_prepare_extent_commit (+3/-5) btrfs: remove unnecessary mutex lock in qgroup_account_snapshot (+1/-5) btrfs: embed extent_changeset::range_changed to the structure (+11/-17) btrfs: remove unused parameter from cleanup_write_cache_enospc (+2/-3) btrfs: remove unused parameters from __btrfs_write_out_cache (+3/-8) btrfs: remove unused parameter from clone_copy_inline_extent (+2/-3) btrfs: remove unused parameter from extent_write_cache_pages (+2/-4) btrfs: remove unused parameter from tree_move_next_or_upnext (+2/-4) btrfs: remove unused parameter from btrfs_check_super_valid (+3/-5) btrfs: remove unused logic of limiting async delalloc pages (+0/-7) btrfs: fix over-80 lines introduced by previous cleanups (+74/-63) btrfs: remove unused parameter from read_block_for_search (+5/-5) btrfs: remove unused parameter from adjust_slots_upwards (+2/-3) btrfs: remove unused parameter from init_first_rw_device (+3/-5) btrfs: make space cache inode readahead failure nonfatal (+3/-7) btrfs: remove unused parameters from scrub_setup_wr_ctx (+3/-7) btrfs: remove unused parameter from __btrfs_alloc_chunk (+4/-6) btrfs: add wrapper for counting BTRFS_MAX_EXTENT_SIZE (+23/-31) btrfs: remove unused parameter from submit_extent_page (+3/-9) btrfs: remove unused parameter from clean_tree_block (+17/-19) btrfs: use GFP_KERNEL in btrfs_add/del_qgroup_relation (+2/-2) btrfs: remove unused parameter from __add_inline_refs (+2/-3) btrfs: remove unused parameter from add_pending_csums (+2/-4) btrfs: remove unused parameter from update_nr_written (+4/-4) btrfs: remove unused parameter from __push_leaf_right (+2/-3) btrfs: remove unused parameter from check_async_write (+2/-2) btrfs: remove unused parameter from btrfs_fill_super (+2/-3) btrfs: remove unused parameter from __push_leaf_left (+2/-3) btrfs: remove unused parameter from write_dev_supers (+3/-3) btrfs: remove unused parameter from __add_inode_ref (+1/-2) btrfs: remove unused parameters from btrfs_cmp_data (+2/-3) btrfs: remove unused parameter from create_snapshot (+2/-2) btrfs: ulist: make the finalization function public (+2/-1) btrfs: remove unused parameter from tree_move_down (+2/-2) btrfs: ulist: rename ulist_fini to ulist_release (+10/-10) btrfs: qgroups: make __del_qgroup_relation static (+1/-1) btrfs: use GFP_KERNEL in btrfs_read_qgroup_config (+1/-1) btrfs: remove unused parameter from split_item (+2/-3) btrfs: merge two superblock writing helpers (+4/-11) btrfs: qgroups: opencode qgroup_free helper (+9/-9) btrfs: use GFP_KERNEL in btrfs_quota_enable (+1/-1) btrfs: use GFP_KERNEL in create_snapshot (+2/-2) btrfs: remove unused ulist members (+0/-7) Nikolay Borisov (36) commits (+476/-480): btrfs: Make btrfs_delayed_inode_reserve_metadata take btrfs_inode (+8/-8) btrfs: Make btrfs_inode_delayed_dir_index_count take btrfs_inode (+5/-5) btrfs: Make btrfs_commit_inode_delayed_items take btrfs_inode (+4/-4) btrfs: Make btrfs_commit_inode_delayed_inode take btrfs_inode (+6/-6) btrfs: Make btrfs_get_or_create_delayed_node take btrfs_inode (+5/-6) btrfs: Make btrfs_kill_delayed_inode_items take btrfs_inode (+4/-4) btrfs: Make btrfs_delayed_delete_inode_ref take btrfs_inode (+5/-5) btrfs: Make btrfs_delete_delayed_dir_index take btrfs_inode (+6/-6) btrfs: Make btrfs_insert_delayed_dir_index take btrfs_inode (+5/-5) btrfs: Make btrfs_check_ref_name_override take btrfs_inode (+4/-5) btrfs: Make btrfs_record_snapshot_destroy take btrfs_inode (+6/-6) btrfs: Make btrfs_must_commit_transaction take btrfs_inode (+9/-9) btrfs: Make btrfs_del_dir_entries_in_log take btrfs_inode (+7/-7) btrfs: Make btrfs_log_changed_extents take btrfs_inode (+11/-11) btrfs: Make btrfs_record_unlink_dir take btrfs_inode (+14/-14) btrfs: Make btrfs_remove_delayed_node take btrfs_inode (+5/-5) btrfs: Make btrfs_get_logged_extents take btrfs_inode (+4/-4) btrfs: Make btrfs_log_trailing_hole take btrfs_inode (+4/-4) btrfs: Make btrfs_get_delayed_node take btrfs_inode (+8/-9) btrfs: Make btrfs_ino take a struct btrfs_inode (+151/-151) btrfs: Make log_directory_changes take btrfs_inode (+5/-6)
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.10 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.10 Has two last minute fixes. The highest priority here is a regression fix for the decompression code, but we also fixed up a problem with the 32 bit compat ioctls. The decompression bug could hand back the wrong data on big reads when zlib was used. I have a larger cleanup to make the math here less error prone, but at this stage in the release Omar's patch is the best choice. Omar Sandoval (1) commits (+24/-15): Btrfs: fix btrfs_decompress_buf2page() Jeff Mahoney (1) commits (+4/-2): btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls Total: (2) commits (+28/-17) fs/btrfs/compression.c | 39 --- fs/btrfs/ioctl.c | 6 -- 2 files changed, 28 insertions(+), 17 deletions(-)
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.10 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.10 Has two last minute fixes. The highest priority here is a regression fix for the decompression code, but we also fixed up a problem with the 32 bit compat ioctls. The decompression bug could hand back the wrong data on big reads when zlib was used. I have a larger cleanup to make the math here less error prone, but at this stage in the release Omar's patch is the best choice. Omar Sandoval (1) commits (+24/-15): Btrfs: fix btrfs_decompress_buf2page() Jeff Mahoney (1) commits (+4/-2): btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls Total: (2) commits (+28/-17) fs/btrfs/compression.c | 39 --- fs/btrfs/ioctl.c | 6 -- 2 files changed, 28 insertions(+), 17 deletions(-)
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.10 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.10 Has some fixes that we've collected from the list. We still have one more pending to nail down a regression in lzo compression, but I wanted to get this batch out the door. Omar Sandoval (3) commits (+2/-6): Btrfs: remove ->{get, set}_acl() from btrfs_dir_ro_inode_operations (+0/-2) Btrfs: remove old tree_root case in btrfs_read_locked_inode() (+1/-4) Btrfs: disable xattr operations on subvolume directories (+1/-0) Liu Bo (1) commits (+12/-1): Btrfs: fix truncate down when no_holes feature is enabled Chandan Rajendra (1) commits (+2/-2): Btrfs: Fix deadlock between direct IO and fast fsync Wang Xiaoguang (1) commits (+1/-0): btrfs: fix false enospc error when truncating heavily reflinked file Total: (6) commits (+17/-9) fs/btrfs/inode.c | 26 +- 1 file changed, 17 insertions(+), 9 deletions(-)
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.10 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.10 Has some fixes that we've collected from the list. We still have one more pending to nail down a regression in lzo compression, but I wanted to get this batch out the door. Omar Sandoval (3) commits (+2/-6): Btrfs: remove ->{get, set}_acl() from btrfs_dir_ro_inode_operations (+0/-2) Btrfs: remove old tree_root case in btrfs_read_locked_inode() (+1/-4) Btrfs: disable xattr operations on subvolume directories (+1/-0) Liu Bo (1) commits (+12/-1): Btrfs: fix truncate down when no_holes feature is enabled Chandan Rajendra (1) commits (+2/-2): Btrfs: Fix deadlock between direct IO and fast fsync Wang Xiaoguang (1) commits (+1/-0): btrfs: fix false enospc error when truncating heavily reflinked file Total: (6) commits (+17/-9) fs/btrfs/inode.c | 26 +- 1 file changed, 17 insertions(+), 9 deletions(-)
[GIT PULL] Btrfs fixes
Hi Linus, Dave Sterba queued up a few fixes for btrfs. I have them in my for-linus-4.10 branch: These are all over the place. The tracepoint part of the pull fixes a crash and adds a little more information to two tracepoints, while the rest are good old fashioned fixes. git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.10 Liu Bo (5) commits (+34/-11): Btrfs: adjust outstanding_extents counter properly when dio write is split (+9/-2) Btrfs: add truncated_len for ordered extent tracepoints (+4/-0) Btrfs: use down_read_nested to make lockdep silent (+2/-1) Btrfs: add 'inode' for extent map tracepoint (+9/-5) Btrfs: fix lockdep warning about log_mutex (+10/-3) David Sterba (2) commits (+80/-69): btrfs: fix crash when tracepoint arguments are freed by wq callbacks (+24/-13) btrfs: make tracepoint format strings more compact (+56/-56) Jeff Mahoney (2) commits (+4/-1): btrfs: fix locking when we put back a delayed ref that's too new (+1/-1) btrfs: fix error handling when run_delayed_extent_op fails (+3/-0) Pan Bian (1) commits (+1/-3): btrfs: return the actual error value from from btrfs_uuid_tree_iterate Total: (10) commits (+119/-84) fs/btrfs/async-thread.c | 15 +++-- fs/btrfs/extent-tree.c | 8 ++- fs/btrfs/inode.c | 13 +++- fs/btrfs/tree-log.c | 13 +++- fs/btrfs/uuid-tree.c | 4 +- include/trace/events/btrfs.h | 146 +++ 6 files changed, 117 insertions(+), 82 deletions(-)
[GIT PULL] Btrfs fixes
Hi Linus, Dave Sterba queued up a few fixes for btrfs. I have them in my for-linus-4.10 branch: These are all over the place. The tracepoint part of the pull fixes a crash and adds a little more information to two tracepoints, while the rest are good old fashioned fixes. git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.10 Liu Bo (5) commits (+34/-11): Btrfs: adjust outstanding_extents counter properly when dio write is split (+9/-2) Btrfs: add truncated_len for ordered extent tracepoints (+4/-0) Btrfs: use down_read_nested to make lockdep silent (+2/-1) Btrfs: add 'inode' for extent map tracepoint (+9/-5) Btrfs: fix lockdep warning about log_mutex (+10/-3) David Sterba (2) commits (+80/-69): btrfs: fix crash when tracepoint arguments are freed by wq callbacks (+24/-13) btrfs: make tracepoint format strings more compact (+56/-56) Jeff Mahoney (2) commits (+4/-1): btrfs: fix locking when we put back a delayed ref that's too new (+1/-1) btrfs: fix error handling when run_delayed_extent_op fails (+3/-0) Pan Bian (1) commits (+1/-3): btrfs: return the actual error value from from btrfs_uuid_tree_iterate Total: (10) commits (+119/-84) fs/btrfs/async-thread.c | 15 +++-- fs/btrfs/extent-tree.c | 8 ++- fs/btrfs/inode.c | 13 +++- fs/btrfs/tree-log.c | 13 +++- fs/btrfs/uuid-tree.c | 4 +- include/trace/events/btrfs.h | 146 +++ 6 files changed, 117 insertions(+), 82 deletions(-)
Re: [Regression 4.7-rc1] btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl
On 01/06/2017 12:22 PM, Joseph Salisbury wrote: Hi Luke, A kernel bug report was opened against Ubuntu [0]. This bug was fixed by the following commit in v4.7-rc1: commit 4c63c2454eff996c5e27991221106eb511f7db38 Author: Luke DashjrDate: Thu Oct 29 08:22:21 2015 + btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl However, this commit introduced a new regression. With this commit applied, "btrfs fi show" no longer works and the btrfs snapshot functionality breaks. I was hoping to get your feedback, since you are the patch author. Do you think gathering any additional data will help diagnose this issue, or would it be best to submit a revert request? This is working for me, could you please include an strace of the problem? Thanks! -chris
Re: [Regression 4.7-rc1] btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl
On 01/06/2017 12:22 PM, Joseph Salisbury wrote: Hi Luke, A kernel bug report was opened against Ubuntu [0]. This bug was fixed by the following commit in v4.7-rc1: commit 4c63c2454eff996c5e27991221106eb511f7db38 Author: Luke Dashjr Date: Thu Oct 29 08:22:21 2015 + btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl However, this commit introduced a new regression. With this commit applied, "btrfs fi show" no longer works and the btrfs snapshot functionality breaks. I was hoping to get your feedback, since you are the patch author. Do you think gathering any additional data will help diagnose this issue, or would it be best to submit a revert request? This is working for me, could you please include an strace of the problem? Thanks! -chris
Re: OOM: Better, but still there on
On Wed, Dec 21, 2016 at 12:16:53PM +0100, Michal Hocko wrote: On Wed 21-12-16 20:00:38, Tetsuo Handa wrote: One thing to note here, when we are talking about 32b kernel, things have changed in 4.8 when we moved from the zone based to node based reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis") and associated patches). It is possible that the reporter is hitting some pathological path which needs fixing but it might be also related to something else. So I am rather not trying to blame 32b yet... It might be interesting to put tracing on releasepage and see if btrfs is pinning pages around. I can't see how 32bit kernels would be different, but maybe we're hitting a weird corner. -chris
Re: OOM: Better, but still there on
On Wed, Dec 21, 2016 at 12:16:53PM +0100, Michal Hocko wrote: On Wed 21-12-16 20:00:38, Tetsuo Handa wrote: One thing to note here, when we are talking about 32b kernel, things have changed in 4.8 when we moved from the zone based to node based reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis") and associated patches). It is possible that the reporter is hitting some pathological path which needs fixing but it might be also related to something else. So I am rather not trying to blame 32b yet... It might be interesting to put tracing on releasepage and see if btrfs is pinning pages around. I can't see how 32bit kernels would be different, but maybe we're hitting a weird corner. -chris
Re: OOM: Better, but still there on 4.9
On 12/16/2016 05:14 PM, Michal Hocko wrote: On Fri 16-12-16 13:15:18, Chris Mason wrote: On 12/16/2016 02:39 AM, Michal Hocko wrote: [...] I believe the right way to go around this is to pursue what I've started in [1]. I will try to prepare something for testing today for you. Stay tuned. But I would be really happy if somebody from the btrfs camp could check the NOFS aspect of this allocation. We have already seen allocation stalls from this path quite recently Just double checking, are you asking why we're using GFP_NOFS to avoid going into btrfs from the btrfs writepages call, or are you asking why we aren't allowing highmem? I am more interested in the NOFS part. Why cannot this be a full GFP_KERNEL context? What kind of locks we would lock up when recursing to the fs via slab shrinkers? Since this is our writepages call, any jump into direct reclaim would go to writepage, which would end up calling the same set of code to read metadata blocks, which would do a GFP_KERNEL allocation and end up back in writepage again. We'd also have issues with blowing through transaction reservations since the writepage recursion would have to nest into the running transaction. -chris
Re: OOM: Better, but still there on 4.9
On 12/16/2016 05:14 PM, Michal Hocko wrote: On Fri 16-12-16 13:15:18, Chris Mason wrote: On 12/16/2016 02:39 AM, Michal Hocko wrote: [...] I believe the right way to go around this is to pursue what I've started in [1]. I will try to prepare something for testing today for you. Stay tuned. But I would be really happy if somebody from the btrfs camp could check the NOFS aspect of this allocation. We have already seen allocation stalls from this path quite recently Just double checking, are you asking why we're using GFP_NOFS to avoid going into btrfs from the btrfs writepages call, or are you asking why we aren't allowing highmem? I am more interested in the NOFS part. Why cannot this be a full GFP_KERNEL context? What kind of locks we would lock up when recursing to the fs via slab shrinkers? Since this is our writepages call, any jump into direct reclaim would go to writepage, which would end up calling the same set of code to read metadata blocks, which would do a GFP_KERNEL allocation and end up back in writepage again. We'd also have issues with blowing through transaction reservations since the writepage recursion would have to nest into the running transaction. -chris