Re: [PATCH v5 1/9] lib: zstd: Add zstd compatibility wrapper

2020-11-10 Thread Chris Mason

On 10 Nov 2020, at 13:39, Christoph Hellwig wrote:


On Mon, Nov 09, 2020 at 02:01:41PM -0500, Chris Mason wrote:
You do consistently ask for a shim layer, but you haven???t explained 
what
we gain by diverging from the documented and tested API of the 
upstream zstd
project.  It???s an important discussion given that we hope to 
regularly

update the kernel side as they make improvements in zstd.


An API that looks like every other kernel API, and doesn't cause 
endless

amount of churn because someone decided they need a new API flavor of
the day.  Btw, I'm not asking for a shim layer - that was the 
compromise

we ended up with.

If zstd folks can't maintain a sane code base maybe we should just 
drop

this childish churning code base from the tree.


I think APIs change based on the needs of the project.  We do this all 
the time in the kernel, and we don’t think twice about updating users 
of the API as needed.  The zstd changes look awkward and large today 
because it’ a long time period, but we’ve all been pretty vocal in 
the past about the importance of being able to advance APIs.


-chris


Re: [PATCH v5 1/9] lib: zstd: Add zstd compatibility wrapper

2020-11-09 Thread Chris Mason




On 6 Nov 2020, at 13:38, Christoph Hellwig wrote:


You just keep resedning this crap, don't you?  Haven't you been told
multiple times to provide a proper kernel API by now?


You do consistently ask for a shim layer, but you haven’t explained 
what we gain by diverging from the documented and tested API of the 
upstream zstd project.  It’s an important discussion given that we 
hope to regularly update the kernel side as they make improvements in 
zstd.


The only benefit described so far seems to be camelcase related, but if 
there are problems in the API beyond that, I haven’t seen you describe 
them.  I don’t think the camelcase alone justifies the added costs of 
the shim.


-chris


Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"

2020-10-26 Thread Chris Mason

On 26 Oct 2020, at 12:20, Vincent Guittot wrote:


Le lundi 26 oct. 2020 à 12:04:45 (-0400), Rik van Riel a écrit :

On Mon, 26 Oct 2020 16:42:14 +0100
Vincent Guittot  wrote:

On Mon, 26 Oct 2020 at 16:04, Rik van Riel  wrote:



Could utilization estimates be off, either lagging or
simply having a wrong estimate for a task, resulting
in no task getting pulled sometimes, while doing a
migrate_task imbalance always moves over something?


task and cpu utilization are not always up to fully synced and may 
lag
a bit which explains that sometimes LB can fail to migrate for a 
small

diff


OK, running with this little snippet below, I see latencies
improve back to near where they used to be:

Latency percentiles (usec) runtime 150 (s)
50.0th: 13
75.0th: 31
90.0th: 69
95.0th: 90
*99.0th: 761
99.5th: 2268
99.9th: 9104
min=1, max=16158

I suspect the right/cleaner approach might be to use
migrate_task more in !CPU_NOT_IDLE cases?

Running a task to an idle CPU immediately, instead of refusing
to have the load balancer move it, improves latencies for fairly
obvious reasons.

I am not entirely clear on why the load balancer should need to
be any more conservative about moving tasks than the wakeup
path is in eg. select_idle_sibling.



what you are suggesting is something like:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4978964e75e5..3b6fbf33abc2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9156,7 +9156,8 @@ static inline void calculate_imbalance(struct 
lb_env *env, struct sd_lb_stats *s

 * emptying busiest.
 */
if (local->group_type == group_has_spare) {
-   if (busiest->group_type > group_fully_busy) {
+   if ((busiest->group_type > group_fully_busy) &&
+   !(env->sd->flags & SD_SHARE_PKG_RESOURCES)) {
/*
 * If busiest is overloaded, try to fill spare
 * capacity. This might end up creating spare 
capacity


which also fixes the problem for me and alignes LB with wakeup path 
regarding the migration

in the LLC


Vincent’s patch on top of 5.10-rc1 looks pretty great:

Latency percentiles (usec) runtime 90 (s) (3320 total samples)
50.0th: 161 (1687 samples)
75.0th: 200 (817 samples)
90.0th: 228 (488 samples)
95.0th: 254 (164 samples)
*99.0th: 314 (131 samples)
99.5th: 330 (17 samples)
99.9th: 356 (13 samples)
min=29, max=358

Next we test in prod, which probably won’t have answers until 
tomorrow.  Thanks again Vincent!


-chris


Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"

2020-10-26 Thread Chris Mason

On 26 Oct 2020, at 11:05, Chris Mason wrote:


On 26 Oct 2020, at 10:24, Vincent Guittot wrote:



Could you try the fix below ?

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct 
lb_env *env, struct sd_lb_stats *s

 * emptying busiest.
 */
if (local->group_type == group_has_spare) {
-   if (busiest->group_type > group_fully_busy) {
+   if ((busiest->group_type > group_fully_busy) &&
+   (busiest->group_weight > 1)) {
/*
 * If busiest is overloaded, try to fill 
spare
 * capacity. This might end up creating spare 
capacity



When we calculate an imbalance at te smallest level, ie between CPUs 
(group_weight == 1),
we should try to spread tasks on cpus instead of trying to fill spare 
capacity.


With this patch on top of v5.9, my latencies are unchanged.  I’m 
building against current Linus now just in case I’m missing other 
fixes.




I reran things to make sure the nothing changed on my test box this 
weekend:


5.4.0-rc1-9-gfcf0553db6f4 (last good kernel)
Latency percentiles (usec) runtime 30 (s) (1000 total samples)
50.0th: 180 (502 samples)
75.0th: 227 (251 samples)
90.0th: 268 (147 samples)
95.0th: 300 (50 samples)
*99.0th: 338 (41 samples)
99.5th: 344 (4 samples)
99.9th: 1186 (5 samples)
min=25, max=1185

5.4.0-rc1-00010-g0b0695f2b34a (first bad kernel)
Latency percentiles (usec) runtime 150 (s) (960 total samples)
50.0th: 166 (488 samples)
75.0th: 210 (232 samples)
90.0th: 254 (145 samples)
95.0th: 299 (47 samples)
*99.0th: 12688 (39 samples)
99.5th: 13008 (5 samples)
99.9th: 13104 (4 samples)
min=24, max=13100

3650b228f83adda7e5ee532e2b90429c03f7b9ec (v5.10-rc1) + your patch

Latency percentiles (usec) runtime 30 (s) (1000 total samples)
50.0th: 169 (505 samples)
75.0th: 210 (246 samples)
90.0th: 267 (151 samples)
95.0th: 305 (48 samples)
*99.0th: 12656 (40 samples)
99.5th: 12944 (5 samples)
99.9th: 13168 (5 samples)
min=44, max=13155

-chris


Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"

2020-10-26 Thread Chris Mason




On 26 Oct 2020, at 10:24, Vincent Guittot wrote:


Le lundi 26 oct. 2020 à 08:45:27 (-0400), Chris Mason a écrit :

On 26 Oct 2020, at 4:39, Vincent Guittot wrote:


Hi Chris

On Sat, 24 Oct 2020 at 01:49, Chris Mason  wrote:


Hi everyone,

We’re validating a new kernel in the fleet, and compared with 
v5.2,


Which version are you using ?
several improvements have been added since v5.5 and the rework of
load_balance


We’re validating v5.6, but all of the numbers referenced in this 
patch are
against v5.9.  I usually try to back port my way to victory on this 
kind of
thing, but mainline seems to behave exactly the same as 0b0695f2b34a 
wrt

this benchmark.


ok. Thanks for the confirmation

I have been able to reproduce the problem on my setup.


Thanks for taking a look!  Can I ask what parameters you used on 
schbench, and what kind of results you saw?  Mostly I’m trying to make 
sure it’s a useful tool, but also the patch didn’t change things 
here.




Could you try the fix below ?

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct 
lb_env *env, struct sd_lb_stats *s

 * emptying busiest.
 */
if (local->group_type == group_has_spare) {
-   if (busiest->group_type > group_fully_busy) {
+   if ((busiest->group_type > group_fully_busy) &&
+   (busiest->group_weight > 1)) {
/*
 * If busiest is overloaded, try to fill spare
 * capacity. This might end up creating spare 
capacity



When we calculate an imbalance at te smallest level, ie between CPUs 
(group_weight == 1),
we should try to spread tasks on cpus instead of trying to fill spare 
capacity.


With this patch on top of v5.9, my latencies are unchanged.  I’m 
building against current Linus now just in case I’m missing other 
fixes.


-chris


Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"

2020-10-26 Thread Chris Mason

On 26 Oct 2020, at 4:39, Vincent Guittot wrote:


Hi Chris

On Sat, 24 Oct 2020 at 01:49, Chris Mason  wrote:


Hi everyone,

We’re validating a new kernel in the fleet, and compared with v5.2,


Which version are you using ?
several improvements have been added since v5.5 and the rework of 
load_balance


We’re validating v5.6, but all of the numbers referenced in this patch 
are against v5.9.  I usually try to back port my way to victory on this 
kind of thing, but mainline seems to behave exactly the same as 
0b0695f2b34a wrt this benchmark.





performance is ~2-3% lower for some of our workloads.  After some
digging, Johannes found that our involuntary context switch rate was 
~2x
higher, and we were leaving a CPU idle a higher percentage of the 
time,

even though the workload was trying to saturate the system.

We were able to reproduce the problem with schbench, and Johannes
bisected down to:

commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912
Author: Vincent Guittot 
Date:   Fri Oct 18 15:26:31 2019 +0200

 sched/fair: Rework load_balance()

Our working theory is the load balancing changes are leaving 
processes

behind busy CPUs instead of moving them onto idle ones.  I made a few
schbench modifications to make this easier to demonstrate:

https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/

My VM has 40 cpus (20 cores, 2 threads per core), and my schbench
command line is:


What is the topology ? are they all part of the same LLC ?


We’ve seen the regression on both single socket and dual socket bare 
metal intel systems.  On the VM I reproduced with, I saw similar 
latencies with and without siblings configured into the topology.


-chris


[PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"

2020-10-23 Thread Chris Mason

Hi everyone,

We’re validating a new kernel in the fleet, and compared with v5.2, 
performance is ~2-3% lower for some of our workloads.  After some 
digging, Johannes found that our involuntary context switch rate was ~2x 
higher, and we were leaving a CPU idle a higher percentage of the time, 
even though the workload was trying to saturate the system.


We were able to reproduce the problem with schbench, and Johannes 
bisected down to:


commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912
Author: Vincent Guittot 
Date:   Fri Oct 18 15:26:31 2019 +0200

sched/fair: Rework load_balance()

Our working theory is the load balancing changes are leaving processes 
behind busy CPUs instead of moving them onto idle ones.  I made a few 
schbench modifications to make this easier to demonstrate:


https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/

My VM has 40 cpus (20 cores, 2 threads per core), and my schbench 
command line is:


schbench -t 20 -r 0 -c 100 -s 1000 -i 30 -z 120

This has two message threads, and 20 workers per message thread.  Once 
woken up, the workers think for a full second, which means you’ll have 
some long latencies if you’re stuck behind one of these workers in the 
runqueue.  The message thread does a little bit of work and then sleeps, 
so we end up with 40 threads hammering full blast on the CPU and 2 
threads popping in and out of idle.


schbench times the delay from when a message thread wakes a worker to 
when the worker runs.  On a good kernel, the output looks like this:


Latency percentiles (usec) runtime 1290 (s) (3280 total samples)
50.0th: 155 (1653 samples)
75.0th: 189 (808 samples)
90.0th: 216 (501 samples)
95.0th: 227 (163 samples)
*99.0th: 256 (123 samples)
99.5th: 1510 (16 samples)
99.9th: 3132 (13 samples)
min=21, max=3286

With 0b0695f2b34a, we get this:

Latency percentiles (usec) runtime 1440 (s) (4480 total samples)
50.0th: 147 (2261 samples)
75.0th: 182 (1116 samples)
90.0th: 205 (671 samples)
95.0th: 224 (215 samples)
*99.0th: 12240 (173 samples) <—— much higher p99 and up
99.5th: 12752 (22 samples)
99.9th: 13104 (18 samples)
min=21, max=13172

Since the idea is to fully load the machine with schbench, use schbench 
-t , and make sure the box doesn’t have other stuff 
running in the background.  I used a VM because it ended up giving more 
consistent results on our kernel test machines, which have some periodic 
noise running in the background.


We’ve tried a few different approaches, but don’t quite have a solid 
fix yet.  I thought I’d kick off the discussion with my most useful 
hunks so far:


diff a/kernel/sched/fair.c b/kernel/sched/fair.c
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

-chris


Re: [PATCH v4 0/9] Update to zstd-1.4.6

2020-10-02 Thread Chris Mason

On 2 Oct 2020, at 2:54, Christoph Hellwig wrote:


On Wed, Sep 30, 2020 at 08:05:45PM +, Nick Terrell wrote:



On Sep 29, 2020, at 11:53 PM, Christoph Hellwig  
wrote:


As you keep resend this I keep retelling you that should not do it.
Please provide a proper Linux API, and switch to that.  Versioned 
APIs

have absolutely no business in the Linux kernel.


The API is not versioned. We provide a stable ABI for a large section 
of our API,
and the parts that aren???t ABI stable don???t change in semantics, 
and undergo long

deprecation periods before being removed.

The change of callers is a one-time change to transition from the 
existing API

in the kernel, which was never upstream's API, to upstream's API.


Again, please transition it to a sane kernel API.  We don't have an
"upstream" in this case.


The upstream is the zstd project where all this code originates, and 
where the active development takes place.  As Eric Biggers pointed out, 
it also receives a lot of Q/A separate from the kernel.  I think we gain 
a great deal by leveraging the testing and documentation of the zstd 
project in the kernel interfaces we use.


We lose some consistency with the kernel coding style, but we gain the 
ability to search for docs, issues, and fixes directly against the zstd 
project and git repo.


-chris


Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API

2020-09-17 Thread Chris Mason

On 17 Sep 2020, at 6:04, Christoph Hellwig wrote:


On Wed, Sep 16, 2020 at 09:35:51PM -0400, Rik van Riel wrote:
One possibility is to have a kernel wrapper on top of the zstd API 
to

make it
more ergonomic. I personally don???t really see the value in it, 
since

it adds
another layer of indirection between zstd and the caller, but it
could be done.


Zstd would not be the first part of the kernel to
come from somewhere else, and have wrappers when
it gets integrated into the kernel. There certainly
is precedence there.

It would be interesting to know what Christoph's
preference is.


Yes, I think kernel wrappers would be a pretty sensible step forward.
That also avoid the need to do strange upgrades to a new version,
and instead we can just change APIs on a as-needed basis.


When we add wrappers, we end up creating a kernel specific API that 
doesn’t match the upstream zstd docs, and it doesn’t leverage as 
much of the zstd fuzzing and testing.


So we’re actually making kernel zstd slightly less usable in hopes 
that our kernel specific part of the API is familiar enough to us that 
it makes zstd more usable.  There’s no way to compare the two until 
the wrappers are done, but given the code today I’d prefer that we 
focus on making it really easy to track upstream.  I really understand 
Christoph’s side here, but I’d rather ride a camel with the group 
than go it alone.


I’d also much rather spend time on any problems where the structure of 
the zstd APIs don’t fit the kernel’s needs.  The btrfs streaming 
compression/decompression looks pretty clean to me, but I think Johannes 
mentioned some possibilities to improve things for zswap (optimizations 
for page-at-atime).  If there are places where the zstd memory 
management or error handling don’t fit naturally into the kernel, that 
would also be higher on my list.


Fixing those are probably going to be much easier if we’re close to 
the zstd upstream, again so that we can leverage testing and long term 
code maintenance done there.


-chris


Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API

2020-09-16 Thread Chris Mason

On 16 Sep 2020, at 4:49, Christoph Hellwig wrote:


On Tue, Sep 15, 2020 at 08:42:59PM -0700, Nick Terrell wrote:

From: Nick Terrell 

Move away from the compatibility wrapper to the zstd-1.4.6 API. This
code is functionally equivalent.


Again, please use sensible names  And no one gives a fuck if this bad
API is "zstd-1.4.6" as the Linux kernel uses its own APIs, not some
random mess from a badly written userspace package.


Hi Christoph,

It’s not completely clear what you’re asking for here.  If the API 
matches what’s in zstd-1.4.6, that seems like a reasonable way to 
label it.  That’s what the upstream is for this code.


I’m also not sure why we’re taking extra time to shit on the zstd 
userspace package.  Can we please be constructive or at least 
actionable?


-chris


Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API

2020-09-16 Thread Chris Mason

On 16 Sep 2020, at 10:46, Christoph Hellwig wrote:


On Wed, Sep 16, 2020 at 10:43:04AM -0400, Chris Mason wrote:
Otherwise we just end up with drift and kernel-specific bugs that are 
harder
to debug.  To the extent those APIs make us contort the kernel code, 
I???m

sure Nick is interested in improving things in both places.


Seriously, we do not care elsewhere.  Why would zlib be any different?


Is the zlib upstream active?  Or trying to sync active development with 
the kernel?  I’d suggest the same path for them if they were.




There are probably 1000 constructive ways to have that conversation.  
Please

choose one of those instead of being an asshole.


I think you are the asshole here by ignoring the practices we are 
using

elsewhere and think your employers pet project is somehow special.  It
is not, and claiming so is everything but constructive.


I’m happy to advocate for more constructive discussion for anyone’s 
project.  I tend to pick threads where I have context and I know the 
people involved.


The kernel best practices are pragmatic.  As one of many users of any 
established-non-kernel project, there’s a compromise between the APIs 
they are using for a broad base of users and us.  I’m sure they are 
interested in improving life for all of their users, while also 
improving maintainability for us.


-chris



Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API

2020-09-16 Thread Chris Mason

On 16 Sep 2020, at 10:30, Christoph Hellwig wrote:


On Wed, Sep 16, 2020 at 10:20:52AM -0400, Chris Mason wrote:
It???s not completely clear what you???re asking for here.  If the 
API
matches what???s in zstd-1.4.6, that seems like a reasonable way to 
label

it.  That???s what the upstream is for this code.

I???m also not sure why we???re taking extra time to shit on the zstd
userspace package.  Can we please be constructive or at least 
actionable?


Because it really doesn't matter that these crappy APIs he is
introducing match anything, especially not something done as horribly
as the zstd API.  We'll need to do this properly, and claiming
compliance to some version of this lousy API is completely irrelevant
for the kernel.


If the underlying goal is to closely follow the upstream of another 
project, we’re much better off using those APIs as provided.


Otherwise we just end up with drift and kernel-specific bugs that are 
harder to debug.  To the extent those APIs make us contort the kernel 
code, I’m sure Nick is interested in improving things in both places.


There are probably 1000 constructive ways to have that conversation.  
Please choose one of those instead of being an asshole.


-chris


Re: [PATCH] mm : fix pte _PAGE_DIRTY bit when fallback migrate page

2020-07-17 Thread Chris Mason

On 16 Jul 2020, at 6:15, Robbie Ko wrote:


Kirill A. Shutemov 於 2020/7/15 下午4:11 寫道:

On Wed, Jul 15, 2020 at 10:45:39AM +0800, Robbie Ko wrote:

Kirill A. Shutemov 於 2020/7/14 下午6:19 寫道:

On Tue, Jul 14, 2020 at 11:46:12AM +0200, Vlastimil Babka wrote:

On 7/13/20 3:57 AM, Robbie Ko wrote:

Vlastimil Babka 於 2020/7/10 下午11:31 寫道:

On 7/9/20 4:48 AM, robbieko wrote:

From: Robbie Ko 

When a migrate page occurs, we first create a migration entry
to replace the original pte, and then go to 
fallback_migrate_page

to execute a writeout if the migratepage is not supported.

In the writeout, we will clear the dirty bit of the page and 
use
page_mkclean to clear the dirty bit along with the 
corresponding pte,

but page_mkclean does not support migration entry.

I don't follow the scenario.

When we establish migration entries with try_to_unmap(), it 
transfers

dirty bit from PTE to the page.

Sorry, I mean is _PAGE_RW with pte_write

When we establish migration entries with try_to_unmap(),
we create a migration entry, and if pte_write we set it to 
SWP_MIGRATION_WRITE,

which will replace the migration entry with the original pte.

When migratepage,  we go to fallback_migrate_page to execute a 
writeout

if the migratepage is not supported.

In the writeout, we call clear_page_dirty_for_io to  clear the dirty 
bit of the page
and use page_mkclean to clear pte _PAGE_RW with pte_wrprotect in 
page_mkclean_one.


However, page_mkclean_one does not support migration entries, so the
migration entry is still SWP_MIGRATION_WRITE.

In writeout, then we call remove_migration_ptes to remove the 
migration entry,
because it is still SWP_MIGRATION_WRITE so set _PAGE_RW to pte via 
pte_mkwrite.


Therefore, subsequent mmap wirte will not trigger page_mkwrite to 
cause data loss.

Hm, okay.

Folks, is there any good reason why try_to_unmap(TTU_MIGRATION) 
should not

clear PTE (make the PTE none) for file page?


This, I'm not sure.
But I think that for the fs that support migratepage, when migratepage 
is finished,
the page should still be dirty, and the pte should still have 
_PAGE_RW,
when the next mmap write occurs, we don't need to trigger the 
page_mkwrite again.


I don’t know the page migration code well, but you’ll need this one 
as well on the 4.4 kernel you mentioned:


commit 25f3c5021985e885292980d04a1423fd83c967bb
Author: Chris Mason 
Date:   Tue Jan 21 11:51:42 2020 -0500

Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker

And this one as well:

commit 7703bdd8d23e6ef057af3253958a793ec6066b28
Author: Chris Mason 
Date:   Wed Jun 20 07:56:11 2018 -0700

Btrfs: don't clean dirty pages during buffered writes

With those two in place, we haven’t found lost data from the migration 
code, but we did see the fallback migration helper dirtying pages 
without going through page_mkwrite, which triggers the suboptimal btrfs 
fixup worker code path.  This isn’t a yea or nay on the patch, just 
additional info.


-chris


Re: [Ksummit-discuss] [PATCH] CodingStyle: Inclusive Terminology

2020-07-06 Thread Chris Mason




On 6 Jul 2020, at 10:06, Laurent Pinchart wrote:


Hi Chris,

On Mon, Jul 06, 2020 at 12:45:34PM +, Chris Mason via 
Ksummit-discuss wrote:

On 5 Jul 2020, at 0:55, Willy Tarreau wrote:



Maybe instead of providing an explicit list of a few words it should
simply say that terms that take their roots in the non-technical 
world

and whose meaning can only be understood based on history or local
culture ought to be avoided, because *that* actually is the real
root cause of the problem you're trying to address.


I’d definitely agree that it’s a good goal to keep out 
non-technical
terms.  Even though we already try, every subsystem has its own set 
of

patterns that reflect the most frequent contributors.


That's an interesting point, because to me, it's the exact opposite. 
One

of the intellectual rewards I find in working with the kernel is that
our community is international and multicultural, allowing me to learn
about other cultures. Aiming for the lowest common denominator seems 
to

me to be closer to erasing cultural differences than including them.


I hadn’t thought of it from this angle, but I do agree with you.  I 
think the cultural side comes through more in discussions and in-person 
conferences than it does from the code itself.


I do try to avoid local idioms or culture references unless I’m 
explaining them as part of a discussion or a personal story, mostly 
because I’ve gotten feedback from coworkers who had a hard time 
following my bad (ok, terrible) jokes or sarcasm.  One internal example 
is commands that take —clowntown as an argument.  It’s pretty 
therapeutic to type when you’re grumpy about tooling, but a lot of 
people probably have to look it up before it makes sense.


-chris


Re: [PATCH] CodingStyle: Inclusive Terminology

2020-07-06 Thread Chris Mason
On 5 Jul 2020, at 0:55, Willy Tarreau wrote:

> On Sat, Jul 04, 2020 at 01:02:51PM -0700, Dan Williams wrote:
>> +Non-inclusive terminology has that same distracting effect which is 
>> why
>> +it is a style issue for Linux, it injures developer efficiency.
>
> I'm personally thinking that for a non-native speaker it's already
> difficult to find the best term to describe something, but having to
> apply an extra level of filtering on the found words to figure whether
> they are allowed by the language police is even more difficult.

Since our discussions are public, we’ve always had to deal with 
comments from people outside the community on a range of topics.  But 
inside the kernel, it’s just a group of developers trying to help each 
other produce the best quality of code.  We’ve got a long history 
together and in general I think we’re pretty good at assuming good 
intent.

> *This*
> injures developers efficiency. What could improve developers 
> efficiency
> is to take care of removing *all* idiomatic or cultural words then. 
> For
> example I've been participating to projects using the term 
> "blueprint",
> I didn't understand what that meant. It was once explained to me and
> given that it had no logical reason for being called this way, I now
> forgot. If we follow your reasoning, Such words should be banned for
> exactly the same reasons. Same for colors that probably don't mean
> anything to those born blind.
>
> For example if in my local culture we eat tomatoes at starters and
> apples for dessert, it could be convenient for me to use "tomato" and
> "apple" as list elements to name the pointers leading to the beginning
> and the end of the list, and it might sound obvious to many people, 
> but
> not at all for many others.
>
> Maybe instead of providing an explicit list of a few words it should
> simply say that terms that take their roots in the non-technical world
> and whose meaning can only be understood based on history or local
> culture ought to be avoided, because *that* actually is the real
> root cause of the problem you're trying to address.

I’d definitely agree that it’s a good goal to keep out non-technical 
terms.  Even though we already try, every subsystem has its own set of 
patterns that reflect the most frequent contributors.

-chris

Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

2020-06-17 Thread Chris Mason

On 17 Jun 2020, at 13:20, Filipe Manana wrote:


On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov  wrote:


---
 fs/btrfs/extent_io.c | 45 


 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c59e07360083..f6758ebbb6a2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3927,6 +3927,11 @@ static noinline_for_stack int 
write_one_eb(struct extent_buffer *eb,

clear_bit(EXTENT_BUFFER_WRITE_ERR, >bflags);
num_pages = num_extent_pages(eb);
atomic_set(>io_pages, num_pages);
+   /*
+* It is possible for releasepage to clear the TREE_REF bit 
before we
+* set io_pages. See check_buffer_tree_ref for a more 
detailed comment.

+*/
+   check_buffer_tree_ref(eb);


This is a whole different case from the one described in the
changelog, as this is in the write path.
Why do we need this one?


This was Josef’s idea, but I really like the symmetry.  You set 
io_pages, you do the tree_ref dance.  Everyone fiddling with the write 
back bit right now correctly clears writeback after doing the atomic_dec 
on io_pages, but the race is tiny and prone to getting exposed again by 
shifting code around.  Tree ref checks around io_pages are the most 
reliable way to prevent this bug from coming back again later.


-chris


Re: [PATCH 10/12] btrfs: flag files as supporting buffered async reads

2020-05-26 Thread Chris Mason
On 26 May 2020, at 15:51, Jens Axboe wrote:

> btrfs uses generic_file_read_iter(), which already supports this.
>
> Signed-off-by: Jens Axboe 

Really looking forward to this!

Acked-by: Chris Mason 


Re: linux-next: cleanup the btrfs trees

2019-10-21 Thread Chris Mason
On 19 Oct 2019, at 23:47, Stephen Rothwell wrote:

> Hi all,
>
> The btrfs tree
> (git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git#next)
> has not bee updated in more than a year, so I have removed it and then
> renamed the btrfs-kdave tree to btrfs.  I hope this is OK and if any
> other changes are needed, please let me know.


Thanks Stephen

-chris


Re: linux-next: Signed-off-by missing for commits in the net-next tree

2019-08-16 Thread Chris Mason
On 16 Aug 2019, at 5:15, Andy Grover wrote:

> On 8/16/19 3:06 PM, Gerd Rausch wrote:
>> Hi,
>>
>> Just added the e-mail addresses I found using a simple "google 
>> search",
>> in order to reach out to the original authors of these commits:
>> Chris Mason and Andy Grover.
>>
>> I'm hoping they still remember their work from 7-8 years ago.
>
> Yes looks like what I was working on. What did you need from me? It's
> too late to amend the commitlogs...

Same question ;)  The missing signed-off-by is a mistake, but from the 
point of view of the DCO, these patches are totally fine by me.

-chris


Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-31 Thread Chris Mason

I'm being pretty liberal with chopping down quoted material to help 
emphasize a particular opinion about how to bootstrap existing 
out-of-tree projects into the kernel.  My goal here is to talk more 
about the process and less about the technical details, so please 
forgive me if I've ignored or changed the technical meaning of anything 
below.

On 30 May 2019, at 12:15, Kris Van Hees wrote:

> On Thu, May 23, 2019 at 01:28:44PM -0700, Alexei Starovoitov wrote:
>
> ... I believe that the discussion that has been going on in other
> emails has shown that while introducing a program type that provides a
> generic (abstracted) context is a different approach from what has 
> been done
> so far, it is a new use case that provides for additional ways in 
> which BPF
> can be used.
>

[ ... ]

>
> Yes and no.  It depends on what you are trying to do with the BPF 
> program that
> is attached to the different events.  From a tracing perspective, 
> providing a
> single BPF program with an abstract context would ...

[ ... ]

>
> In this model kprobe/ksys_write and 
> tracepoint/syscalls/sys_enter_write are
> equivalent for most tracing purposes ...

[ ... ]

>
> I agree with what you are saying but I am presenting an additional use 
> case

[ ... ]

>>
>> All that aside the kernel support for shared libraries is an awesome
>> feature to have and a bunch of folks want to see it happen, but
>> it's not a blocker for 'dtrace to bpf' user space work.
>> libbpf can be taught to do this 'pseudo shared library' feature
>> while 'dtrace to bpf' side doesn't need to do anything special.

[ ... ]

This thread intermixes some abstract conceptual changes with smaller 
technical improvements, and in general it follows a familiar pattern 
other out-of-tree projects have hit while trying to adapt the kernel to 
their existing code.  Just from this one email, I quoted the abstract 
models with use cases etc, and this is often where the discussions side 
track into less productive areas.

>
> So you are basically saying that I should redesign DTrace?

In your place, I would have removed features and adapted dtrace as much 
as possible to require the absolute minimum of kernel patches, or even 
better, no patches at all.  I'd document all of the features that worked 
as expected, and underline anything either missing or suboptimal that 
needed additional kernel changes.  Then I'd focus on expanding the 
community of people using dtrace against the mainline kernel, and work 
through the series features and improvements one by one upstream over 
time.

Your current approach relies on an all-or-nothing landing of patches 
upstream, and this consistently leads to conflict every time a project 
tries it.  A more incremental approach will require bigger changes on 
the dtrace application side, but over time it'll be much easier to 
justify your kernel changes.  You won't have to talk in abstract models, 
and you'll have many more concrete examples of people asking for dtrace 
features against mainline.  Most importantly, you'll make dtrace 
available on more kernels than just the absolute latest mainline, and 
removing dependencies makes the project much easier for new users to 
try.

-chris


Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait

2019-05-08 Thread Chris Mason
On 7 May 2019, at 17:22, Dave Chinner wrote:

> On Tue, May 07, 2019 at 01:05:28PM -0400, Rik van Riel wrote:
>> The code in xlog_wait uses the spinlock to make adding the task to
>> the wait queue, and setting the task state to UNINTERRUPTIBLE atomic
>> with respect to the waker.
>>
>> Doing the wakeup after releasing the spinlock opens up the following
>> race condition:
>>
>> - add task to wait queue
>>
>> -  wake up task
>>
>> - set task state to UNINTERRUPTIBLE
>>
>> Simply moving the spin_unlock to after the wake_up_all results
>> in the waker not being able to see a task on the waitqueue before
>> it has set its state to UNINTERRUPTIBLE.
>
> Yup, seems like an issue. Good find, Rik.
>
> So, what problem is this actually fixing? Was it noticed by
> inspection, or is it actually manifesting on production machines?
> If it is manifesting IRL, what are the symptoms (e.g. hang running
> out of log space?) and do you have a test case or any way to
> exercise it easily?

The steps to reproduce are semi-complicated, they create a bunch of 
files, do stuff, and then delete all the files in a loop.  I think they 
shotgunned it across 500 or so machines to trigger 5 times, and then 
left the wreckage for us to poke at.

The symptoms were identical to the bug fixed here:

commit 696a562072e3c14bcd13ae5acc19cdf27679e865
Author: Brian Foster 
Date:   Tue Mar 28 14:51:44 2017 -0700

xfs: use dedicated log worker wq to avoid deadlock with cil wq

But since our 4.16 kernel is new than that, I briefly hoped that 
m_sync_workqueue needed to be flagged with WQ_MEM_RECLAIM.  I don't have 
a great picture of how all of these workqueues interact, but I do think 
it needs WQ_MEM_RECLAIM.  It can't be the cause of this deadlock, the 
workqueue watchdog would have fired.

Rik mentioned that I found sleeping procs with an empty iclog waitqueue 
list, which is when he noticed this race.  We sent a wakeup to the 
sleeping process, and ftrace showed the process looping back around to 
sleep on the iclog again.  Long story short, Rik's patch definitely 
wouldn't have prevented the deadlock, and the iclog waitqueue I was 
poking must not have been the same one that process was sleeping on.

The actual problem ended up being the blkmq IO schedulers sitting on a 
request.  Switching schedulers makes the box come back to life, so it's 
either a kyber bug or slightly higher up in blkmqland.

That's a huge tangent around acking Rik's patch, but it's hard to be 
sure if we've hit the lost wakeup in prod.  I could search through all 
the related hung task timeouts, but they are probably all stuck in 
blkmq.

Acked-but-I'm-still-blaming-Jens-by: Chris Mason 

-chris


Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"

2019-01-31 Thread Chris Mason
On 30 Jan 2019, at 20:34, Dave Chinner wrote:

> On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote:
>>
>>
>> On 29 Jan 2019, at 23:17, Dave Chinner wrote:
>>
>>> From: Dave Chinner 
>>>
>>> This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
>>>
>>> This change causes serious changes to page cache and inode cache
>>> behaviour and balance, resulting in major performance regressions
>>> when combining worklaods such as large file copies and kernel
>>> compiles.
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=202441
>>
>> I'm a little confused by the latest comment in the bz:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24
>
> Which says the first patch that changed the shrinker behaviour is
> the underlying cause of the regression.
>
>> Are these reverts sufficient?
>
> I think so.

Based on the latest comment:

"If I had been less strict in my testing I probably would have 
discovered that the problem was present earlier than 4.19.3. Mr Gushins 
commit made it more visible.
I'm going back to work after two days off, so I might not be able to 
respond inside your working hours, but I'll keep checking in on this as 
I get a chance."

I don't think the reverts are sufficient.

>
>> Roman beat me to suggesting Rik's followup.  We hit a different 
>> problem
>> in prod with small slabs, and have a lot of instrumentation on Rik's
>> code helping.
>
> I think that's just another nasty, expedient hack that doesn't solve
> the underlying problem. Solving the underlying problem does not
> require changing core reclaim algorithms and upsetting a page
> reclaim/shrinker balance that has been stable and worked well for
> just about everyone for years.
>

Things are definitely breaking down in non-specialized workloads, and 
have been for a long time.

-chris


Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"

2019-01-30 Thread Chris Mason



On 29 Jan 2019, at 23:17, Dave Chinner wrote:

> From: Dave Chinner 
>
> This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
>
> This change causes serious changes to page cache and inode cache
> behaviour and balance, resulting in major performance regressions
> when combining worklaods such as large file copies and kernel
> compiles.
>
> https://bugzilla.kernel.org/show_bug.cgi?id=202441

I'm a little confused by the latest comment in the bz:

https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24

Are these reverts sufficient?

Roman beat me to suggesting Rik's followup.  We hit a different problem 
in prod with small slabs, and have a lot of instrumentation on Rik's 
code helping.

-chris


Re: [LKP] [lkp-robot] [brd] 316ba5736c: aim7.jobs-per-min -11.2% regression

2018-12-19 Thread Chris Mason
On 18 Dec 2018, at 13:57, Jens Axboe wrote:

> On 12/18/18 2:11 AM, kemi wrote:
>> Hi, All
>>   Do we have special reason to keep this patch (316ba5736c9:brd: Mark 
>> as non-rotational).
>> which leads to a performance regression when BRD is used as a disk on 
>> btrfs.
>
> I really suspect that this is a btrfs issue, as this is just flagging
> what is pretty obvious, that a ramdisk is NOT a rotational drive.
> So whatever btrfs is doing with that information is causing it to
> run slower - this really doesn't make any sense, but there we are.
>
> CC'ing Chris, leaving the report below.

Btrfs is changing the allocator decisions slightly for an SSD, 
especially the cluster size for metadata, which should show up as more 
system time spent in the btrfs allocator, but I'm not seeing that below. 
  It also changes how quickly btrfs dispatches synchronous IO.

But, some parts of the differential don't quite make sense to me:

  47.50 ± 58%   +1355.8% 691.50 ± 92%  meminfo.Mlocked

Are these changes expected?

-chris

>
>> On 2018/7/10 下午1:27, kemi wrote:
>>> Hi, SeongJae
>>>   Do you have any input for this regression? thanks
>>>
>>> On 2018年06月04日 13:52, kernel test robot wrote:

 Greeting,

 FYI, we noticed a -11.2% regression of aim7.jobs-per-min due to 
 commit:


 commit: 316ba5736c9caa5dbcd84085989862d2df57431d ("brd: Mark as 
 non-rotational")
 https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git 
 for-4.18/block

 in testcase: aim7
 on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 
 3.00GHz with 384G memory
 with following parameters:

disk: 1BRD_48G
fs: btrfs
test: disk_rw
load: 1500
cpufreq_governor: performance

 test-description: AIM7 is a traditional UNIX system level benchmark 
 suite which is used to test and measure the performance of 
 multiuser system.
 test-url: 
 https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_projects_aimbench_files_aim-2Dsuite7_=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=9QPtTAxcitoznaWRKKHoEQ=kkEXHhn9ofFgUoBrBpTiepWkkQeot8EjTaMlN_yKeyw=ScajB-GPDPZvGMy0XU1Hbatu9gVLkqk2j8MSCzK0S8E=



 Details are as below:
 -->

 =
 compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
   
 gcc-7/performance/1BRD_48G/btrfs/x86_64-rhel-7.2/1500/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_rw/aim7

 commit:
   522a777566 ("block: consolidate struct request timestamp fields")
   316ba5736c ("brd: Mark as non-rotational")

 522a777566f56696 316ba5736c9caa5dbcd8408598
  --
  %stddev %change %stddev
  \  |\
  28321   -11.2%  25147aim7.jobs-per-min
 318.19   +12.6% 358.23
 aim7.time.elapsed_time
 318.19   +12.6% 358.23
 aim7.time.elapsed_time.max
1437526 ±  2% +14.6%1646849 ±  2%  
 aim7.time.involuntary_context_switches
  11986   +14.2%  13691aim7.time.system_time
  73.06 ±  2%  -3.6%  70.43aim7.time.user_time
2449470 ±  2% -25.0%1837521 ±  4%  
 aim7.time.voluntary_context_switches
  20.25 ± 58%   +1681.5% 360.75 ±109%  
 numa-meminfo.node1.Mlocked
 456062   -16.3% 381859softirqs.SCHED
   9015 ±  7% -21.3%   7098 ± 22%  meminfo.CmaFree
  47.50 ± 58%   +1355.8% 691.50 ± 92%  meminfo.Mlocked
   5.24 ±  3%  -1.23.99 ±  2%  mpstat.cpu.idle%
   0.61 ±  2%  -0.10.52 ±  2%  mpstat.cpu.usr%
  16627   +12.8%  18762 ±  4%  
 slabinfo.Acpi-State.active_objs
  16627   +12.9%  18775 ±  4%  
 slabinfo.Acpi-State.num_objs
  57.00 ±  2% +17.5%  67.00vmstat.procs.r
  20936   -24.8%  15752 ±  2%  vmstat.system.cs
  45474-1.7%  44681vmstat.system.in
   6.50 ± 59%   +1157.7%  81.75 ± 75%  
 numa-vmstat.node0.nr_mlock
 242870 ±  3% +13.2% 274913 ±  7%  
 numa-vmstat.node0.nr_written
   2278 ±  7% -22.6%   1763 ± 21%  
 numa-vmstat.node1.nr_free_cma
   4.75 ± 58%   +1789.5%  89.75 ±109%  
 numa-vmstat.node1.nr_mlock
   88018135 ±  3% -48.9%   44980457 ±  7%  cpuidle.C1.time
1398288 ±  3% -51.1% 683493 ±  9%  cpuidle.C1.usage
3499814 ±  2% -38.5%2153158 ±  5%  cpuidle.C1E.time
  52722 ±  4% -45.6%  28692 ±  6%  

Linux Foundation Technical Advisory Board Elections -- Call for nominations

2018-11-04 Thread Chris Mason
Hello everyone,

Friendly reminder that the TAB elections are coming soon.

The Linux Foundation Technical Advisory Board (TAB) serves as the 
interface between the kernel development community and the Linux 
Foundation. The TAB advises the Foundation on kernel-related matters, 
helps member companies learn to work with the community, and works to 
resolve community-related problems before they get out of hand.  We're 
also working with kernel maintainers to help refine the new code of 
conduct, and serving as the initial point of contact for code of conduct 
issues.

The board has ten members, one of whom sits on the Linux Foundation 
board of directors.

The election to select five TAB members will be held at the 2018 Kernel 
Summit in Vancouver, Canada.  The elections will take place at the 
conference center on Tuesday November 13th, at 5:30pm.

The election will be open to all attendees of all of the Linux 
Foundation events taking place that week in Vancouver.  Anyone is 
eligible to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

The deadline for receiving nominations is up until the beginning of the 
event where the election is held.

In past years, everyone running for the TAB has given a short speech 
before the voting began.  We've received feedback that the speeches add 
logistical complexity for the election, and may not be the best 
indicator of how well qualified someone is for the TAB.

Instead of speeches, this year we're asking candidates to include 
statements about why they would like to participate in the TAB.  These 
will be combined into a slideshow running during the election, and 
available via a public google doc at this location:

https://goo.gl/rPEc2v

Even though the deadline for nominations is right before voting begins, 
any statements must be received by Monday November 12th at 5PM Pacific, 
so that we have time to setup the slideshow.

Current TAB members, and their election year:

Chris Mason 2016
H. Peter Anvin 2016
Olof Johansson 2016
Rik van Riel2016
Dan Williams 2016

Jon Corbet 2017
Greg Kroah-Hartman 2017
Steven Rostedt 2017
Ted Tso 2017
Tim Bird2017

The five slots from 2016 are all up for election.  As always, please let 
us know if you have questions, and please do consider running.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.
The other five are halfway through their term and will be up for
election next year.


Linux Foundation Technical Advisory Board Elections -- Call for nominations

2018-11-04 Thread Chris Mason
Hello everyone,

Friendly reminder that the TAB elections are coming soon.

The Linux Foundation Technical Advisory Board (TAB) serves as the 
interface between the kernel development community and the Linux 
Foundation. The TAB advises the Foundation on kernel-related matters, 
helps member companies learn to work with the community, and works to 
resolve community-related problems before they get out of hand.  We're 
also working with kernel maintainers to help refine the new code of 
conduct, and serving as the initial point of contact for code of conduct 
issues.

The board has ten members, one of whom sits on the Linux Foundation 
board of directors.

The election to select five TAB members will be held at the 2018 Kernel 
Summit in Vancouver, Canada.  The elections will take place at the 
conference center on Tuesday November 13th, at 5:30pm.

The election will be open to all attendees of all of the Linux 
Foundation events taking place that week in Vancouver.  Anyone is 
eligible to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

The deadline for receiving nominations is up until the beginning of the 
event where the election is held.

In past years, everyone running for the TAB has given a short speech 
before the voting began.  We've received feedback that the speeches add 
logistical complexity for the election, and may not be the best 
indicator of how well qualified someone is for the TAB.

Instead of speeches, this year we're asking candidates to include 
statements about why they would like to participate in the TAB.  These 
will be combined into a slideshow running during the election, and 
available via a public google doc at this location:

https://goo.gl/rPEc2v

Even though the deadline for nominations is right before voting begins, 
any statements must be received by Monday November 12th at 5PM Pacific, 
so that we have time to setup the slideshow.

Current TAB members, and their election year:

Chris Mason 2016
H. Peter Anvin 2016
Olof Johansson 2016
Rik van Riel2016
Dan Williams 2016

Jon Corbet 2017
Greg Kroah-Hartman 2017
Steven Rostedt 2017
Ted Tso 2017
Tim Bird2017

The five slots from 2016 are all up for election.  As always, please let 
us know if you have questions, and please do consider running.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.
The other five are halfway through their term and will be up for
election next year.


Linux Foundation Technical Advisory Board Elections -- Call for nominations

2018-10-22 Thread Chris Mason



Hello everyone,

The Linux Foundation Technical Advisory Board (TAB) serves as the 
interface between the kernel development community and the Linux 
Foundation. The TAB advises the Foundation on kernel-related matters, 
helps member companies learn to work with the community, and works to 
resolve community-related problems before they get out of hand.  We're 
also working with kernel maintainers to help refine the new code of 
conduct, and serving as the initial point of contact for code of conduct 
issues.


The board has ten members, one of whom sits on the Linux Foundation 
board of directors.


The election to select five TAB members will be held at the 2018 Kernel 
Summit in Vancouver, Canada.  The elections will take place at the 
conference center on Tuesday November 13th, at 5:30pm.


The election will be open to all attendees of all of the Linux 
Foundation events taking place that week in Vancouver.  Anyone is 
eligible to stand for election, simply send your nomination to:


tech-board-discuss at lists.linux-foundation.org

The deadline for receiving nominations is up until the beginning of the 
event where the election is held.


In past years, everyone running for the TAB has given a short speech 
before the voting began.  We've received feedback that the speeches add 
logistical complexity for the election, and may not be the best 
indicator of how well qualified someone is for the TAB.


Instead of speeches, this year we're asking candidates to include 
statements about why they would like to participate in the TAB.  These 
will be combined into a slideshow running during the election, and 
available via a public google doc at this location:


https://goo.gl/rPEc2v

Even though the deadline for nominations is right before voting begins, 
any statements must be received by Monday November 12th at 5PM Pacific, 
so that we have time to setup the slideshow.


Current TAB members, and their election year:

Chris Mason 2016
H. Peter Anvin 2016
Olof Johansson 2016
Rik van Riel2016
Dan Williams 2016

Jon Corbet 2017
Greg Kroah-Hartman 2017
Steven Rostedt 2017
Ted Tso 2017
Tim Bird2017

The five slots from 2016 are all up for election.  As always, please let 
us know if you have questions, and please do consider running.


Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.
The other five are halfway through their term and will be up for
election next year.


Linux Foundation Technical Advisory Board Elections -- Call for nominations

2018-10-22 Thread Chris Mason



Hello everyone,

The Linux Foundation Technical Advisory Board (TAB) serves as the 
interface between the kernel development community and the Linux 
Foundation. The TAB advises the Foundation on kernel-related matters, 
helps member companies learn to work with the community, and works to 
resolve community-related problems before they get out of hand.  We're 
also working with kernel maintainers to help refine the new code of 
conduct, and serving as the initial point of contact for code of conduct 
issues.


The board has ten members, one of whom sits on the Linux Foundation 
board of directors.


The election to select five TAB members will be held at the 2018 Kernel 
Summit in Vancouver, Canada.  The elections will take place at the 
conference center on Tuesday November 13th, at 5:30pm.


The election will be open to all attendees of all of the Linux 
Foundation events taking place that week in Vancouver.  Anyone is 
eligible to stand for election, simply send your nomination to:


tech-board-discuss at lists.linux-foundation.org

The deadline for receiving nominations is up until the beginning of the 
event where the election is held.


In past years, everyone running for the TAB has given a short speech 
before the voting began.  We've received feedback that the speeches add 
logistical complexity for the election, and may not be the best 
indicator of how well qualified someone is for the TAB.


Instead of speeches, this year we're asking candidates to include 
statements about why they would like to participate in the TAB.  These 
will be combined into a slideshow running during the election, and 
available via a public google doc at this location:


https://goo.gl/rPEc2v

Even though the deadline for nominations is right before voting begins, 
any statements must be received by Monday November 12th at 5PM Pacific, 
so that we have time to setup the slideshow.


Current TAB members, and their election year:

Chris Mason 2016
H. Peter Anvin 2016
Olof Johansson 2016
Rik van Riel2016
Dan Williams 2016

Jon Corbet 2017
Greg Kroah-Hartman 2017
Steven Rostedt 2017
Ted Tso 2017
Tim Bird2017

The five slots from 2016 are all up for election.  As always, please let 
us know if you have questions, and please do consider running.


Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.
The other five are halfway through their term and will be up for
election next year.


Re: [PATCH 2/2] code-of-conduct: Strip the enforcement paragraph pending community discussion

2018-10-08 Thread Chris Mason

On 6 Oct 2018, at 17:37, James Bottomley wrote:

Significant concern has been expressed about the responsibilities 
outlined in
the enforcement clause of the new code of conduct.  Since there is 
concern

that this becomes binding on the release of the 4.19 kernel, strip the
enforcement clauses to give the community time to consider and debate 
how this

should be handled.


Even in the places where I don't agree with the discussion about what 
our code of conduct should be, I love that we're having it.  Removing 
the enforcement clause basically goes back to the way things were.  We'd 
be recognizing that we know issues happen, and explicitly stating that 
when serious events do happen, the community as a whole isn't committing 
to helping.


It's true there are a lot of questions about how the community resolves 
problems and holds each other accountable for maintaining any code of 
conduct.  I think the enforcement section leaves us the room we need to 
continue discussions and still make it clear that we're making an effort 
to shift away from the harsh discussions in the past.


-chris




Re: [PATCH 2/2] code-of-conduct: Strip the enforcement paragraph pending community discussion

2018-10-08 Thread Chris Mason

On 6 Oct 2018, at 17:37, James Bottomley wrote:

Significant concern has been expressed about the responsibilities 
outlined in
the enforcement clause of the new code of conduct.  Since there is 
concern

that this becomes binding on the release of the 4.19 kernel, strip the
enforcement clauses to give the community time to consider and debate 
how this

should be handled.


Even in the places where I don't agree with the discussion about what 
our code of conduct should be, I love that we're having it.  Removing 
the enforcement clause basically goes back to the way things were.  We'd 
be recognizing that we know issues happen, and explicitly stating that 
when serious events do happen, the community as a whole isn't committing 
to helping.


It's true there are a lot of questions about how the community resolves 
problems and holds each other accountable for maintaining any code of 
conduct.  I think the enforcement section leaves us the room we need to 
continue discussions and still make it clear that we're making an effort 
to shift away from the harsh discussions in the past.


-chris




Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Chris Mason

On 6 Mar 2018, at 11:12, Linus Torvalds wrote:

On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  
wrote:
As the first step in development of bpfilter project [1] the 
request_module()
code is extended to allow user mode helpers to be invoked. Idea is 
that
user mode helpers are built as part of the kernel build and installed 
as
traditional kernel modules with .ko file extension into distro 
specified
location, such that from a distribution point of view, they are no 
different
than regular kernel modules. Thus, allow request_module() logic to 
load such

user mode helper (umh) modules via:

[,,]

I like this, but I have one request: can we make sure that this action
is visible in the system messages?

When we load a regular module, at least it shows in lsmod afterwards,
although I have a few times wanted to really see module load as an
event in the logs too.

When we load a module that just executes a user program, and there is
no sign of it in the module list, I think we *really* need to make
that event show to the admin some way.

.. and yes, maybe we'll need to rate-limit the messages, and maybe it
turns out that I'm entirely wrong and people will hate the messages
after they get used to the concept of these pseudo-modules, but
particularly for the early implementation when this is a new thing, I
really want a message like

 executed user process xyz-abc as a pseudo-module

or something in dmesg.

I do *not* want this to be a magical way to hide things.


Especially early on, this makes a lot of sense.  But I wanted to plug 
bps and the hopefully growing set of bpf introspection tools:


https://github.com/iovisor/bcc/blob/master/introspection/bps_example.txt

Long term these are probably a good place to tell the admin what's going 
on.


-chris


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Chris Mason

On 6 Mar 2018, at 11:12, Linus Torvalds wrote:

On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  
wrote:
As the first step in development of bpfilter project [1] the 
request_module()
code is extended to allow user mode helpers to be invoked. Idea is 
that
user mode helpers are built as part of the kernel build and installed 
as
traditional kernel modules with .ko file extension into distro 
specified
location, such that from a distribution point of view, they are no 
different
than regular kernel modules. Thus, allow request_module() logic to 
load such

user mode helper (umh) modules via:

[,,]

I like this, but I have one request: can we make sure that this action
is visible in the system messages?

When we load a regular module, at least it shows in lsmod afterwards,
although I have a few times wanted to really see module load as an
event in the logs too.

When we load a module that just executes a user program, and there is
no sign of it in the module list, I think we *really* need to make
that event show to the admin some way.

.. and yes, maybe we'll need to rate-limit the messages, and maybe it
turns out that I'm entirely wrong and people will hate the messages
after they get used to the concept of these pseudo-modules, but
particularly for the early implementation when this is a new thing, I
really want a message like

 executed user process xyz-abc as a pseudo-module

or something in dmesg.

I do *not* want this to be a magical way to hide things.


Especially early on, this makes a lot of sense.  But I wanted to plug 
bps and the hopefully growing set of bpf introspection tools:


https://github.com/iovisor/bcc/blob/master/introspection/bps_example.txt

Long term these are probably a good place to tell the admin what's going 
on.


-chris


Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup

2017-11-30 Thread Chris Mason



On 11/30/2017 12:23 PM, David Sterba wrote:

On Wed, Nov 29, 2017 at 01:38:26PM -0500, Chris Mason wrote:

On 11/29/2017 12:05 PM, Tejun Heo wrote:

On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote:

Hello,

On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote:

What has happened with this patch set?


No idea.  cc'ing Chris directly.  Chris, if the patchset looks good,
can you please route them through the btrfs tree?


lol looking at the patchset again, I'm not sure that's obviously the
right tree.  It can either be cgroup, block or btrfs.  If no one
objects, I'll just route them through cgroup.


We'll have to coordinate a bit during the next merge window but I don't
have a problem with these going in through cgroup.  Dave does this sound
good to you?


There are only minor changes to btrfs code so cgroup tree would be
better.


I'd like to include my patch to do all crcs inline (instead of handing
off to helper threads) when io controls are in place.  By the merge
window we should have some good data on how much it's all helping.


Are there any problems in sight if the inline crc and cgroup chnanges go
separately? I assume there's a runtime dependency, not a code
dependency, so it could be sorted by the right merge order.



The feature is just more useful with the inline crcs.  Without them we 
end up with kworkers doing both high and low prio submissions and it all 
boils down to the speed of the lowest priority.


-chris



Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup

2017-11-30 Thread Chris Mason



On 11/30/2017 12:23 PM, David Sterba wrote:

On Wed, Nov 29, 2017 at 01:38:26PM -0500, Chris Mason wrote:

On 11/29/2017 12:05 PM, Tejun Heo wrote:

On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote:

Hello,

On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote:

What has happened with this patch set?


No idea.  cc'ing Chris directly.  Chris, if the patchset looks good,
can you please route them through the btrfs tree?


lol looking at the patchset again, I'm not sure that's obviously the
right tree.  It can either be cgroup, block or btrfs.  If no one
objects, I'll just route them through cgroup.


We'll have to coordinate a bit during the next merge window but I don't
have a problem with these going in through cgroup.  Dave does this sound
good to you?


There are only minor changes to btrfs code so cgroup tree would be
better.


I'd like to include my patch to do all crcs inline (instead of handing
off to helper threads) when io controls are in place.  By the merge
window we should have some good data on how much it's all helping.


Are there any problems in sight if the inline crc and cgroup chnanges go
separately? I assume there's a runtime dependency, not a code
dependency, so it could be sorted by the right merge order.



The feature is just more useful with the inline crcs.  Without them we 
end up with kworkers doing both high and low prio submissions and it all 
boils down to the speed of the lowest priority.


-chris



Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup

2017-11-29 Thread Chris Mason

On 11/29/2017 12:05 PM, Tejun Heo wrote:

On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote:

Hello,

On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote:

What has happened with this patch set?


No idea.  cc'ing Chris directly.  Chris, if the patchset looks good,
can you please route them through the btrfs tree?


lol looking at the patchset again, I'm not sure that's obviously the
right tree.  It can either be cgroup, block or btrfs.  If no one
objects, I'll just route them through cgroup.


We'll have to coordinate a bit during the next merge window but I don't 
have a problem with these going in through cgroup.  Dave does this sound 
good to you?


I'd like to include my patch to do all crcs inline (instead of handing 
off to helper threads) when io controls are in place.  By the merge 
window we should have some good data on how much it's all helping.


-chris



Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup

2017-11-29 Thread Chris Mason

On 11/29/2017 12:05 PM, Tejun Heo wrote:

On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote:

Hello,

On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote:

What has happened with this patch set?


No idea.  cc'ing Chris directly.  Chris, if the patchset looks good,
can you please route them through the btrfs tree?


lol looking at the patchset again, I'm not sure that's obviously the
right tree.  It can either be cgroup, block or btrfs.  If no one
objects, I'll just route them through cgroup.


We'll have to coordinate a bit during the next merge window but I don't 
have a problem with these going in through cgroup.  Dave does this sound 
good to you?


I'd like to include my patch to do all crcs inline (instead of handing 
off to helper threads) when io controls are in place.  By the merge 
window we should have some good data on how much it's all helping.


-chris



Reminder v2: Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-22 Thread Chris Mason

Hello everyone,

Quick update on the TAB elections, we have 6 nominations so far:

Jon Corbet
Greg Kroah-Hartman
Shuah Khan
Steve Rostedt
Ted Tso
Tim Bird

The elections are coming soon, please feel free to contact me if you 
have any questions about the TAB.


-

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.  The 
election to select five TAB members will be held at the 2017 Kernel

Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


Reminder v2: Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-22 Thread Chris Mason

Hello everyone,

Quick update on the TAB elections, we have 6 nominations so far:

Jon Corbet
Greg Kroah-Hartman
Shuah Khan
Steve Rostedt
Ted Tso
Tim Bird

The elections are coming soon, please feel free to contact me if you 
have any questions about the TAB.


-

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.  The 
election to select five TAB members will be held at the 2017 Kernel

Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


Reminder: Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-16 Thread Chris Mason

Hello everyone,

Quick update on the TAB elections, we have 5 nominations so far:

Jon Corbet
Greg Kroah-Hartman
Shuah Khan
Steve Rostedt
Ted Tso

The elections are next week, please feel free to contact me if you have 
any questions about the TAB.


-

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.  The 
election to select five TAB members will be held at the 2017 Kernel

Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


Reminder: Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-16 Thread Chris Mason

Hello everyone,

Quick update on the TAB elections, we have 5 nominations so far:

Jon Corbet
Greg Kroah-Hartman
Shuah Khan
Steve Rostedt
Ted Tso

The elections are next week, please feel free to contact me if you have 
any questions about the TAB.


-

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.  The 
election to select five TAB members will be held at the 2017 Kernel

Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-09 Thread Chris Mason

Hello everyone,

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.
The election to select five TAB members will be held at the 2017 Kernel
Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-09 Thread Chris Mason

Hello everyone,

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.
The election to select five TAB members will be held at the 2017 Kernel
Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-09 Thread Chris Mason
Hello everyone,

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.  
The election to select five TAB members will be held at the 2017 Kernel
Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-09 Thread Chris Mason
Hello everyone,

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.  
The election to select five TAB members will be held at the 2017 Kernel
Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


[GIT PULL v2] zstd support (lib, btrfs, squashfs, nocrypto)

2017-09-11 Thread Chris Mason
Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert
and Phillip, we decided to send the whole thing in as one pull request.

Herbert had asked about the crypto patch when we discussed the pull, but
I didn't realize he really meant not-right-now.  I've rebased it out of
this branch, and none of the other patches depended on it.

I have things in my zstd-minimal branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd-minimal

There's a trivial conflict with the main btrfs pull from last week.
Dave's pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.

zstd is a big win in speed over zlib and in compression ratio over lzo,
and the compression team here at FB has gotten great results using it in
production.  Nick will continue to update the kernel side with new
improvements from the open source zstd userland code.

Nick has a number of benchmarks for the main zstd code in his lib/zstd
commit:


I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is
211,988,480 B large. Run the following commands for the benchmark:

sudo modprobe zstd_compress_test
sudo mknod zstd_compress_test c 245 0
sudo cp silesia.tar zstd_compress_test

The time is reported by the time of the userland `cp`.
The MB/s is computed with

1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Adjusted MB/s is computed with

1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |

I benchmarked zstd decompression using the same method on the same machine.
The benchmark file is located in the upstream zstd repo under
`contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is
the amount of memory required to decompress data compressed with the given
compression level. If you know the maximum size of your input, you can
reduce the memory usage of decompression irrespective of the compression
level.

| Method   | Time (s) | MB/s| Adjusted MB/s | Memory (MB) |
|--|--|-|---|-|
| none |0.025 | 8479.54 | - |   - |
| zstd -1  |0.358 |  592.15 |636.60 |0.84 |
| zstd -3  |0.396 |  535.32 |571.40 |1.46 |
| zstd -5  |0.396 |  535.32 |571.40 |1.46 |
| zstd -10 |0.374 |  566.81 |607.42 |2.51 |
| zstd -15 |0.379 |  559.34 |598.84 |4.61 |
| zstd -19 |0.412 |  514.54 |547.77 |8.80 |
| zlib -1  |0.940 |  225.52 |231.68 |0.04 |
| zlib -3  |0.883 |  240.08 |247.07 |0.04 |
| zlib -6  |0.844 |  251.17 |258.84 |0.04 |
| zlib -9  |0.837 |  253.27 |287.64 |0.04 |

===

I ran a long series of tests and benchmarks on the btrfs side and
the gains are very similar to the core benchmarks Nick ran.

Nick Terrell (3) commits (+14222/-12):
btrfs: Add zstd support (+468/-12)
lib: Add zstd modules (+13014/-0)
lib: Add xxhash module (+740/-0)

Sean Purcell (1) commits (+178/-0):
squashfs: Add zstd support

Total: (4) commits (+14400/-12)

 fs/btrfs/Kconfig   |2 +
 fs/btrfs/Makefile  |2 +-
 fs/btrfs/compression.c |1 +
 fs/btrfs/compression.h |6 +-
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |2 +
 fs/btrfs/ioctl.c   |6 +-
 fs/btrfs/props.c   |6 +
 fs/btrfs/super.c   |   12 +-
 fs/btrfs/sysfs.c   |2 +
 fs/btrfs/zstd.c|  432 ++
 fs/squashfs/Kconfig|   14 +
 

[GIT PULL v2] zstd support (lib, btrfs, squashfs, nocrypto)

2017-09-11 Thread Chris Mason
Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert
and Phillip, we decided to send the whole thing in as one pull request.

Herbert had asked about the crypto patch when we discussed the pull, but
I didn't realize he really meant not-right-now.  I've rebased it out of
this branch, and none of the other patches depended on it.

I have things in my zstd-minimal branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd-minimal

There's a trivial conflict with the main btrfs pull from last week.
Dave's pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.

zstd is a big win in speed over zlib and in compression ratio over lzo,
and the compression team here at FB has gotten great results using it in
production.  Nick will continue to update the kernel side with new
improvements from the open source zstd userland code.

Nick has a number of benchmarks for the main zstd code in his lib/zstd
commit:


I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is
211,988,480 B large. Run the following commands for the benchmark:

sudo modprobe zstd_compress_test
sudo mknod zstd_compress_test c 245 0
sudo cp silesia.tar zstd_compress_test

The time is reported by the time of the userland `cp`.
The MB/s is computed with

1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Adjusted MB/s is computed with

1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |

I benchmarked zstd decompression using the same method on the same machine.
The benchmark file is located in the upstream zstd repo under
`contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is
the amount of memory required to decompress data compressed with the given
compression level. If you know the maximum size of your input, you can
reduce the memory usage of decompression irrespective of the compression
level.

| Method   | Time (s) | MB/s| Adjusted MB/s | Memory (MB) |
|--|--|-|---|-|
| none |0.025 | 8479.54 | - |   - |
| zstd -1  |0.358 |  592.15 |636.60 |0.84 |
| zstd -3  |0.396 |  535.32 |571.40 |1.46 |
| zstd -5  |0.396 |  535.32 |571.40 |1.46 |
| zstd -10 |0.374 |  566.81 |607.42 |2.51 |
| zstd -15 |0.379 |  559.34 |598.84 |4.61 |
| zstd -19 |0.412 |  514.54 |547.77 |8.80 |
| zlib -1  |0.940 |  225.52 |231.68 |0.04 |
| zlib -3  |0.883 |  240.08 |247.07 |0.04 |
| zlib -6  |0.844 |  251.17 |258.84 |0.04 |
| zlib -9  |0.837 |  253.27 |287.64 |0.04 |

===

I ran a long series of tests and benchmarks on the btrfs side and
the gains are very similar to the core benchmarks Nick ran.

Nick Terrell (3) commits (+14222/-12):
btrfs: Add zstd support (+468/-12)
lib: Add zstd modules (+13014/-0)
lib: Add xxhash module (+740/-0)

Sean Purcell (1) commits (+178/-0):
squashfs: Add zstd support

Total: (4) commits (+14400/-12)

 fs/btrfs/Kconfig   |2 +
 fs/btrfs/Makefile  |2 +-
 fs/btrfs/compression.c |1 +
 fs/btrfs/compression.h |6 +-
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |2 +
 fs/btrfs/ioctl.c   |6 +-
 fs/btrfs/props.c   |6 +
 fs/btrfs/super.c   |   12 +-
 fs/btrfs/sysfs.c   |2 +
 fs/btrfs/zstd.c|  432 ++
 fs/squashfs/Kconfig|   14 +
 

Re: [GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason

On Sat, Sep 09, 2017 at 09:35:59AM +0800, Herbert Xu wrote:

On Fri, Sep 08, 2017 at 03:33:05PM -0400, Chris Mason wrote:


 crypto/Kconfig |9 +
 crypto/Makefile|1 +
 crypto/testmgr.c   |   10 +
 crypto/testmgr.h   |   71 +
 crypto/zstd.c  |  265 


Is there anyone going to use zstd through the crypto API? If not
then I don't see the point in adding it at this point.  Especially
as the compression API is still in a state of flux.


That part was requested by intel, but I'm happy to leave it out for 
another time.  The rest of the patch series doesn't depend on it at all.


-chris


Re: [GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason

On Sat, Sep 09, 2017 at 09:35:59AM +0800, Herbert Xu wrote:

On Fri, Sep 08, 2017 at 03:33:05PM -0400, Chris Mason wrote:


 crypto/Kconfig |9 +
 crypto/Makefile|1 +
 crypto/testmgr.c   |   10 +
 crypto/testmgr.h   |   71 +
 crypto/zstd.c  |  265 


Is there anyone going to use zstd through the crypto API? If not
then I don't see the point in adding it at this point.  Especially
as the compression API is still in a state of flux.


That part was requested by intel, but I'm happy to leave it out for 
another time.  The rest of the patch series doesn't depend on it at all.


-chris


Re: [GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason



On 09/08/2017 03:33 PM, Chris Mason wrote:

Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert and
Phillip, we decided to send the whole thing in as one pull request.

I have it in my zstd branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd

There's a trivial conflict with the main btrfs pull that Dave Sterba just
sent.  His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.  My
idea was that you'd take our main btrfs pull first and this one second,
but the conflicts are small enough it's not a big deal.

zstd is a big win in speed over zlib and in compression ratio over lzo, and
the compression team here at FB has gotten great results using it in production.
Nick will continue to update the kernel side with new improvements from the
open source zstd userland code.


Just to clarify, we've been testing the kernel side of this here at FB, 
but our zstd use in prod is limited to the application side.


-chris


Re: [GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason



On 09/08/2017 03:33 PM, Chris Mason wrote:

Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert and
Phillip, we decided to send the whole thing in as one pull request.

I have it in my zstd branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd

There's a trivial conflict with the main btrfs pull that Dave Sterba just
sent.  His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.  My
idea was that you'd take our main btrfs pull first and this one second,
but the conflicts are small enough it's not a big deal.

zstd is a big win in speed over zlib and in compression ratio over lzo, and
the compression team here at FB has gotten great results using it in production.
Nick will continue to update the kernel side with new improvements from the
open source zstd userland code.


Just to clarify, we've been testing the kernel side of this here at FB, 
but our zstd use in prod is limited to the application side.


-chris


[GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason
Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert and
Phillip, we decided to send the whole thing in as one pull request.

I have it in my zstd branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd

There's a trivial conflict with the main btrfs pull that Dave Sterba just
sent.  His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.  My
idea was that you'd take our main btrfs pull first and this one second,
but the conflicts are small enough it's not a big deal.

zstd is a big win in speed over zlib and in compression ratio over lzo, and
the compression team here at FB has gotten great results using it in production.
Nick will continue to update the kernel side with new improvements from the 
open source zstd userland code.

Nick has a number of benchmarks for the main zstd code in his lib/zstd
commit:


I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is
211,988,480 B large. Run the following commands for the benchmark:

sudo modprobe zstd_compress_test
sudo mknod zstd_compress_test c 245 0
sudo cp silesia.tar zstd_compress_test

The time is reported by the time of the userland `cp`.
The MB/s is computed with

1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Adjusted MB/s is computed with

1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |

I benchmarked zstd decompression using the same method on the same machine.
The benchmark file is located in the upstream zstd repo under
`contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is
the amount of memory required to decompress data compressed with the given
compression level. If you know the maximum size of your input, you can
reduce the memory usage of decompression irrespective of the compression
level.

| Method   | Time (s) | MB/s| Adjusted MB/s | Memory (MB) |
|--|--|-|---|-|
| none |0.025 | 8479.54 | - |   - |
| zstd -1  |0.358 |  592.15 |636.60 |0.84 |
| zstd -3  |0.396 |  535.32 |571.40 |1.46 |
| zstd -5  |0.396 |  535.32 |571.40 |1.46 |
| zstd -10 |0.374 |  566.81 |607.42 |2.51 |
| zstd -15 |0.379 |  559.34 |598.84 |4.61 |
| zstd -19 |0.412 |  514.54 |547.77 |8.80 |
| zlib -1  |0.940 |  225.52 |231.68 |0.04 |
| zlib -3  |0.883 |  240.08 |247.07 |0.04 |
| zlib -6  |0.844 |  251.17 |258.84 |0.04 |
| zlib -9  |0.837 |  253.27 |287.64 |0.04 |

===

I ran a long series of tests and benchmarks on the btrfs side and
the gains are very similar to the core benchmarks Nick ran.

Nick Terrell (4) commits (+14578/-12):  
crypto: Add zstd support (+356/-0)  
btrfs: Add zstd support (+468/-12)  
lib: Add zstd modules (+13014/-0)   
lib: Add xxhash module (+740/-0)

Sean Purcell (1) commits (+178/-0): 
squashfs: Add zstd support  

Total: (5) commits (+14756/-12)

[GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason
Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert and
Phillip, we decided to send the whole thing in as one pull request.

I have it in my zstd branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd

There's a trivial conflict with the main btrfs pull that Dave Sterba just
sent.  His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.  My
idea was that you'd take our main btrfs pull first and this one second,
but the conflicts are small enough it's not a big deal.

zstd is a big win in speed over zlib and in compression ratio over lzo, and
the compression team here at FB has gotten great results using it in production.
Nick will continue to update the kernel side with new improvements from the 
open source zstd userland code.

Nick has a number of benchmarks for the main zstd code in his lib/zstd
commit:


I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is
211,988,480 B large. Run the following commands for the benchmark:

sudo modprobe zstd_compress_test
sudo mknod zstd_compress_test c 245 0
sudo cp silesia.tar zstd_compress_test

The time is reported by the time of the userland `cp`.
The MB/s is computed with

1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Adjusted MB/s is computed with

1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |

I benchmarked zstd decompression using the same method on the same machine.
The benchmark file is located in the upstream zstd repo under
`contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is
the amount of memory required to decompress data compressed with the given
compression level. If you know the maximum size of your input, you can
reduce the memory usage of decompression irrespective of the compression
level.

| Method   | Time (s) | MB/s| Adjusted MB/s | Memory (MB) |
|--|--|-|---|-|
| none |0.025 | 8479.54 | - |   - |
| zstd -1  |0.358 |  592.15 |636.60 |0.84 |
| zstd -3  |0.396 |  535.32 |571.40 |1.46 |
| zstd -5  |0.396 |  535.32 |571.40 |1.46 |
| zstd -10 |0.374 |  566.81 |607.42 |2.51 |
| zstd -15 |0.379 |  559.34 |598.84 |4.61 |
| zstd -19 |0.412 |  514.54 |547.77 |8.80 |
| zlib -1  |0.940 |  225.52 |231.68 |0.04 |
| zlib -3  |0.883 |  240.08 |247.07 |0.04 |
| zlib -6  |0.844 |  251.17 |258.84 |0.04 |
| zlib -9  |0.837 |  253.27 |287.64 |0.04 |

===

I ran a long series of tests and benchmarks on the btrfs side and
the gains are very similar to the core benchmarks Nick ran.

Nick Terrell (4) commits (+14578/-12):  
crypto: Add zstd support (+356/-0)  
btrfs: Add zstd support (+468/-12)  
lib: Add zstd modules (+13014/-0)   
lib: Add xxhash module (+740/-0)

Sean Purcell (1) commits (+178/-0): 
squashfs: Add zstd support  

Total: (5) commits (+14756/-12)

Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-11 Thread Chris Mason



On 08/10/2017 03:25 PM, Hugo Mills wrote:

On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote:

On 08/10/2017 04:30 AM, Eric Biggers wrote:


Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as
well.  The numbers were in line with what Nick is posting here.
zstd is a big win over both lzo and zlib from a btrfs point of view.

It's true Nick's patches only support a single compression level in
btrfs, but that's because btrfs doesn't have a way to pass in the
compression ratio.  It could easily be a mount option, it was just
outside the scope of Nick's initial work.


Could we please not add more mount options? I get that they're easy
to implement, but it's a very blunt instrument. What we tend to see
(with both nodatacow and compress) is people using the mount options,
then asking for exceptions, discovering that they can't do that, and
then falling back to doing it with attributes or btrfs properties.
Could we just start with btrfs properties this time round, and cut out
the mount option part of this cycle.

In the long run, it'd be great to see most of the btrfs-specific
mount options get deprecated and ultimately removed entirely, in
favour of attributes/properties, where feasible.



It's a good point, and as was commented later down I'd just do mount -o 
compress=zstd:3 or something.


But I do prefer properties in general for this.  My big point was just 
that next step is outside of Nick's scope.


-chris



Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-11 Thread Chris Mason



On 08/10/2017 03:25 PM, Hugo Mills wrote:

On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote:

On 08/10/2017 04:30 AM, Eric Biggers wrote:


Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as
well.  The numbers were in line with what Nick is posting here.
zstd is a big win over both lzo and zlib from a btrfs point of view.

It's true Nick's patches only support a single compression level in
btrfs, but that's because btrfs doesn't have a way to pass in the
compression ratio.  It could easily be a mount option, it was just
outside the scope of Nick's initial work.


Could we please not add more mount options? I get that they're easy
to implement, but it's a very blunt instrument. What we tend to see
(with both nodatacow and compress) is people using the mount options,
then asking for exceptions, discovering that they can't do that, and
then falling back to doing it with attributes or btrfs properties.
Could we just start with btrfs properties this time round, and cut out
the mount option part of this cycle.

In the long run, it'd be great to see most of the btrfs-specific
mount options get deprecated and ultimately removed entirely, in
favour of attributes/properties, where feasible.



It's a good point, and as was commented later down I'd just do mount -o 
compress=zstd:3 or something.


But I do prefer properties in general for this.  My big point was just 
that next step is outside of Nick's scope.


-chris



Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Chris Mason

On 08/10/2017 03:00 PM, Eric Biggers wrote:

On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote:

On 08/10/2017 04:30 AM, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:



The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |



Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as
well.  The numbers were in line with what Nick is posting here.
zstd is a big win over both lzo and zlib from a btrfs point of view.

It's true Nick's patches only support a single compression level in
btrfs, but that's because btrfs doesn't have a way to pass in the
compression ratio.  It could easily be a mount option, it was just
outside the scope of Nick's initial work.



I am not surprised --- Zstandard is closer to the state of the art, both
format-wise and implementation-wise, than the other choices in BTRFS.  My point
is that benchmarks need to account for how much data is compressed at a time.
This is a common mistake when comparing different compression algorithms; the
algorithm name and compression level do not tell the whole story.  The
dictionary size is extremely significant.  No one is going to compress or
decompress a 200 MB file as a single stream in kernel mode, so it does not make
sense to justify adding Zstandard *to the kernel* based on such a benchmark.  It
is going to be divided into chunks.  How big are the chunks in BTRFS?  I thought
that it compressed only one page (4 KiB) at a time, but I hope that has been, or
is being, improved; 32 KiB - 128 KiB should be a better amount.  (And if the
amount of data compressed at a time happens to be different between the
different algorithms, note that BTRFS benchmarks are likely to be measuring that
as much as the algorithms themselves.)


Btrfs hooks the compression code into the delayed allocation mechanism 
we use to gather large extents for COW.  So if you write 100MB to a 
file, we'll have 100MB to compress at a time (within the limits of the 
amount of pages we allow to collect before forcing it down).


But we want to balance how much memory you might need to uncompress 
during random reads.  So we have an artificial limit of 128KB that we 
send at a time to the compression code.  It's easy to change this, it's 
just a tradeoff made to limit the cost of reading small bits.


It's the same for zlib,lzo and the new zstd patch.

-chris



Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Chris Mason

On 08/10/2017 03:00 PM, Eric Biggers wrote:

On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote:

On 08/10/2017 04:30 AM, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:



The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |



Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as
well.  The numbers were in line with what Nick is posting here.
zstd is a big win over both lzo and zlib from a btrfs point of view.

It's true Nick's patches only support a single compression level in
btrfs, but that's because btrfs doesn't have a way to pass in the
compression ratio.  It could easily be a mount option, it was just
outside the scope of Nick's initial work.



I am not surprised --- Zstandard is closer to the state of the art, both
format-wise and implementation-wise, than the other choices in BTRFS.  My point
is that benchmarks need to account for how much data is compressed at a time.
This is a common mistake when comparing different compression algorithms; the
algorithm name and compression level do not tell the whole story.  The
dictionary size is extremely significant.  No one is going to compress or
decompress a 200 MB file as a single stream in kernel mode, so it does not make
sense to justify adding Zstandard *to the kernel* based on such a benchmark.  It
is going to be divided into chunks.  How big are the chunks in BTRFS?  I thought
that it compressed only one page (4 KiB) at a time, but I hope that has been, or
is being, improved; 32 KiB - 128 KiB should be a better amount.  (And if the
amount of data compressed at a time happens to be different between the
different algorithms, note that BTRFS benchmarks are likely to be measuring that
as much as the algorithms themselves.)


Btrfs hooks the compression code into the delayed allocation mechanism 
we use to gather large extents for COW.  So if you write 100MB to a 
file, we'll have 100MB to compress at a time (within the limits of the 
amount of pages we allow to collect before forcing it down).


But we want to balance how much memory you might need to uncompress 
during random reads.  So we have an artificial limit of 128KB that we 
send at a time to the compression code.  It's easy to change this, it's 
just a tradeoff made to limit the cost of reading small bits.


It's the same for zlib,lzo and the new zstd patch.

-chris



Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Chris Mason

On 08/10/2017 04:30 AM, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:



The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |



Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as 
well.  The numbers were in line with what Nick is posting here.  zstd is 
a big win over both lzo and zlib from a btrfs point of view.


It's true Nick's patches only support a single compression level in 
btrfs, but that's because btrfs doesn't have a way to pass in the 
compression ratio.  It could easily be a mount option, it was just 
outside the scope of Nick's initial work.


-chris





Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Chris Mason

On 08/10/2017 04:30 AM, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:



The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |



Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as 
well.  The numbers were in line with what Nick is posting here.  zstd is 
a big win over both lzo and zlib from a btrfs point of view.


It's true Nick's patches only support a single compression level in 
btrfs, but that's because btrfs doesn't have a way to pass in the 
compression ratio.  It could easily be a mount option, it was just 
outside the scope of Nick's initial work.


-chris





Re: Moving ndctl development into the kernel tree?

2017-07-25 Thread Chris Mason

On 07/22/2017 02:49 PM, Dan Williams wrote:

On Fri, Jul 21, 2017 at 7:52 PM, Dan Williams  wrote:

[ adding Chris ]

On Fri, Jul 21, 2017 at 4:44 PM, Dan Williams  wrote:

On Fri, Jul 21, 2017 at 3:58 PM, Ingo Molnar  wrote:


* Dan Williams  wrote:


[...]

* Like perf, ndctl borrows the sub-command architecture and option
parsing from git. So, this code could be refactored into something
shared / generic, i.e. the bits in tools/perf/util/.


Just as a side note, stacktool (tools/stacktool/) is using the Git sub-command 
and
options parsing code as well, and it's already sharing it with perf, via the
tools/lib/subcmd/ library.

ndctl could use that as well.


Ah, nice, that refactoring happened about a year after ndctl was born.
Which brings up the next question about what to do with the git
history, but I'd want to know if ndctl is even welcome upstream before
digging any deeper.


I suspect this would be similar to what Chris did to merge btrfs while
retaining the standalone history. Chris, any pointers on what worked
well and what if anything you would do differently? I.e. I'm looking
to use git filter-branch to rewrite ndctl history as if if had always
been in tools/ndctl in the kernel tree. I found this old thread
https://lkml.org/lkml/2008/10/30/523 and it seems to also recommend
using an older kernel as the branch base.


So it wasn't as painful as I thought it would be, I just used the
script Linus recommended in that thread. Here is what I came up with
merging the last ndctl release on top of v4.9, and then applying the
pending development patches re-filtered to tools/ndctl:

 
https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=for-4.14/ndctl

...the next thing would be to rework the versioning to use the kernel
version and switch to using tools/lib/subcmd/.



I'd like to say I figured it all out back then, but the truth is that 
Linus held my hand the whole way.  My memory of it is that his script 
worked really well, I just ran that and verified the results.


-chris


Re: Moving ndctl development into the kernel tree?

2017-07-25 Thread Chris Mason

On 07/22/2017 02:49 PM, Dan Williams wrote:

On Fri, Jul 21, 2017 at 7:52 PM, Dan Williams  wrote:

[ adding Chris ]

On Fri, Jul 21, 2017 at 4:44 PM, Dan Williams  wrote:

On Fri, Jul 21, 2017 at 3:58 PM, Ingo Molnar  wrote:


* Dan Williams  wrote:


[...]

* Like perf, ndctl borrows the sub-command architecture and option
parsing from git. So, this code could be refactored into something
shared / generic, i.e. the bits in tools/perf/util/.


Just as a side note, stacktool (tools/stacktool/) is using the Git sub-command 
and
options parsing code as well, and it's already sharing it with perf, via the
tools/lib/subcmd/ library.

ndctl could use that as well.


Ah, nice, that refactoring happened about a year after ndctl was born.
Which brings up the next question about what to do with the git
history, but I'd want to know if ndctl is even welcome upstream before
digging any deeper.


I suspect this would be similar to what Chris did to merge btrfs while
retaining the standalone history. Chris, any pointers on what worked
well and what if anything you would do differently? I.e. I'm looking
to use git filter-branch to rewrite ndctl history as if if had always
been in tools/ndctl in the kernel tree. I found this old thread
https://lkml.org/lkml/2008/10/30/523 and it seems to also recommend
using an older kernel as the branch base.


So it wasn't as painful as I thought it would be, I just used the
script Linus recommended in that thread. Here is what I came up with
merging the last ndctl release on top of v4.9, and then applying the
pending development patches re-filtered to tools/ndctl:

 
https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=for-4.14/ndctl

...the next thing would be to rework the versioning to use the kernel
version and switch to using tools/lib/subcmd/.



I'd like to say I figured it all out back then, but the truth is that 
Linus held my hand the whole way.  My memory of it is that his script 
worked really well, I just ran that and verified the results.


-chris


[GIT PULL] Btrfs

2017-06-10 Thread Chris Mason
Hi Linus,

My for-linus-4.12 branch has some fixes that Dave Sterba collected:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.12

We've been hitting an early enospc problem on production machines that
Omar tracked down to an old int->u64 mistake.  I waited a bit on
this pull to make sure it was really the problem from production,
but it's on ~2100 hosts now and I think we're good.

Omar also noticed a commit in the queue would make new early ENOSPC
problems.  I pulled that out for now, which is why the top three commits
are younger than the rest.

Otherwise these are all fixes, some explaining very old bugs that we've
been poking at for a while.

Jeff Mahoney (2) commits (+4/-3):
btrfs: fix race with relocation recovery and fs_root setup (+3/-3)
btrfs: fix memory leak in update_space_info failure path (+1/-0)

Liu Bo (1) commits (+1/-1):
Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io

Colin Ian King (1) commits (+1/-1):
btrfs: fix incorrect error return ret being passed to mapping_set_error

Omar Sandoval (1) commits (+2/-2):
Btrfs: fix delalloc accounting leak caused by u32 overflow

Qu Wenruo (1) commits (+122/-2):
btrfs: fiemap: Cache and merge fiemap extent before submit it to user

David Sterba (1) commits (+2/-2):
btrfs: use correct types for page indices in btrfs_page_exists_in_range

Jan Kara (1) commits (+6/-4):
btrfs: Make flush bios explicitely sync

Su Yue (1) commits (+1/-1):
btrfs: tree-log.c: Wrong printk information about namelen

Total: (9) commits (+139/-16)

 fs/btrfs/ctree.h   |   4 +-
 fs/btrfs/dir-item.c|   2 +-
 fs/btrfs/disk-io.c |  10 ++--
 fs/btrfs/extent-tree.c |   7 +--
 fs/btrfs/extent_io.c   | 126 +++--
 fs/btrfs/inode.c   |   6 +--
 6 files changed, 139 insertions(+), 16 deletions(-)


[GIT PULL] Btrfs

2017-06-10 Thread Chris Mason
Hi Linus,

My for-linus-4.12 branch has some fixes that Dave Sterba collected:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.12

We've been hitting an early enospc problem on production machines that
Omar tracked down to an old int->u64 mistake.  I waited a bit on
this pull to make sure it was really the problem from production,
but it's on ~2100 hosts now and I think we're good.

Omar also noticed a commit in the queue would make new early ENOSPC
problems.  I pulled that out for now, which is why the top three commits
are younger than the rest.

Otherwise these are all fixes, some explaining very old bugs that we've
been poking at for a while.

Jeff Mahoney (2) commits (+4/-3):
btrfs: fix race with relocation recovery and fs_root setup (+3/-3)
btrfs: fix memory leak in update_space_info failure path (+1/-0)

Liu Bo (1) commits (+1/-1):
Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io

Colin Ian King (1) commits (+1/-1):
btrfs: fix incorrect error return ret being passed to mapping_set_error

Omar Sandoval (1) commits (+2/-2):
Btrfs: fix delalloc accounting leak caused by u32 overflow

Qu Wenruo (1) commits (+122/-2):
btrfs: fiemap: Cache and merge fiemap extent before submit it to user

David Sterba (1) commits (+2/-2):
btrfs: use correct types for page indices in btrfs_page_exists_in_range

Jan Kara (1) commits (+6/-4):
btrfs: Make flush bios explicitely sync

Su Yue (1) commits (+1/-1):
btrfs: tree-log.c: Wrong printk information about namelen

Total: (9) commits (+139/-16)

 fs/btrfs/ctree.h   |   4 +-
 fs/btrfs/dir-item.c|   2 +-
 fs/btrfs/disk-io.c |  10 ++--
 fs/btrfs/extent-tree.c |   7 +--
 fs/btrfs/extent_io.c   | 126 +++--
 fs/btrfs/inode.c   |   6 +--
 6 files changed, 139 insertions(+), 16 deletions(-)


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-06-09 Thread Chris Mason

On 06/06/2017 05:21 AM, Peter Zijlstra wrote:

On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote:

On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote:

On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:


Please test..


Results are still coming in but things do look better with your patch
applied.

It does look like there's a regression when running hackbench in
process mode and when the CPUs are not fully utilised, e.g. check this
out:


This turned out to be a false positive; your patch improves things as
far as I can see.


Hooray, I'll move it to a part of the queue intended for merging.


It's a little late, but Roman Gushchin helped get some runs of this with 
our production workload.  The patch is every so slightly better.


Thanks!

-chris



Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-06-09 Thread Chris Mason

On 06/06/2017 05:21 AM, Peter Zijlstra wrote:

On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote:

On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote:

On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:


Please test..


Results are still coming in but things do look better with your patch
applied.

It does look like there's a regression when running hackbench in
process mode and when the CPUs are not fully utilised, e.g. check this
out:


This turned out to be a false positive; your patch improves things as
far as I can see.


Hooray, I'll move it to a part of the queue intended for merging.


It's a little late, but Roman Gushchin helped get some runs of this with 
our production workload.  The patch is every so slightly better.


Thanks!

-chris



Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-05-17 Thread Chris Mason

On 05/17/2017 06:53 AM, Peter Zijlstra wrote:

On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote:

sched/fair, cpumask: Export for_each_cpu_wrap()



-static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int 
*wrapped)
-{



-   next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);



-}


OK, so this patch fixed an actual bug in the for_each_cpu_wrap()
implementation. The above 'n+1' should be 'n', and the effect is that
it'll skip over CPUs, potentially resulting in an iteration that only
sees every other CPU (for a fully contiguous mask).

This in turn causes hackbench to further suffer from the regression
introduced by commit:

  4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")

So its well past time to fix this.

Where the old scheme was a cliff-edge throttle on idle scanning, this
introduces a more gradual approach. Instead of stopping to scan
entirely, we limit how many CPUs we scan.

Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.

It also appears to recover the tbench high-end, which also suffered like
hackbench.

I'm also hoping it will fix/preserve kitsunyan's interactivity issue.

Please test..


We'll get some tests going here too.

-chris


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-05-17 Thread Chris Mason

On 05/17/2017 06:53 AM, Peter Zijlstra wrote:

On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote:

sched/fair, cpumask: Export for_each_cpu_wrap()



-static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int 
*wrapped)
-{



-   next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);



-}


OK, so this patch fixed an actual bug in the for_each_cpu_wrap()
implementation. The above 'n+1' should be 'n', and the effect is that
it'll skip over CPUs, potentially resulting in an iteration that only
sees every other CPU (for a fully contiguous mask).

This in turn causes hackbench to further suffer from the regression
introduced by commit:

  4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")

So its well past time to fix this.

Where the old scheme was a cliff-edge throttle on idle scanning, this
introduces a more gradual approach. Instead of stopping to scan
entirely, we limit how many CPUs we scan.

Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.

It also appears to recover the tbench high-end, which also suffered like
hackbench.

I'm also hoping it will fix/preserve kitsunyan's interactivity issue.

Please test..


We'll get some tests going here too.

-chris


Re: [GIT PULL] Btrfs

2017-05-09 Thread Chris Mason
On 05/09/2017 01:56 PM, Chris Mason wrote:
> Hi Linus,
> 
> My for-linus-4.12 branch:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
> for-linus-4.12

I hit send too soon, sorry.  There's a trivial conflict with our WARN_ON
fix that went into 4.11.  I pushed the resolution to
for-linus-4.12-merged.

diff --cc fs/btrfs/qgroup.c
index afbea61,3f75b5c..deffbeb
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@@ -1078,7 -1031,8 +1034,8 @@@ static int __qgroup_excl_accounting(str
qgroup->excl += sign * num_bytes;
qgroup->excl_cmpr += sign * num_bytes;
if (sign > 0) {
+   trace_qgroup_update_reserve(fs_info, qgroup, -(s64)num_bytes);
 -  if (WARN_ON(qgroup->reserved < num_bytes))
 +  if (qgroup->reserved < num_bytes)
report_reserved_underflow(fs_info, qgroup, num_bytes);
else
qgroup->reserved -= num_bytes;
@@@ -1103,7 -1057,9 +1060,9 @@@
WARN_ON(sign < 0 && qgroup->excl < num_bytes);
qgroup->excl += sign * num_bytes;
if (sign > 0) {
+   trace_qgroup_update_reserve(fs_info, qgroup,
+   -(s64)num_bytes);
 -  if (WARN_ON(qgroup->reserved < num_bytes))
 +  if (qgroup->reserved < num_bytes)
report_reserved_underflow(fs_info, qgroup,
  num_bytes);
else
@@@ -2472,7 -2451,8 +2454,8 @@@ void btrfs_qgroup_free_refroot(struct b
  
qg = unode_aux_to_qgroup(unode);
  
+   trace_qgroup_update_reserve(fs_info, qg, -(s64)num_bytes);
 -  if (WARN_ON(qg->reserved < num_bytes))
 +  if (qg->reserved < num_bytes)
report_reserved_underflow(fs_info, qg, num_bytes);
else
qg->reserved -= num_bytes;


Re: [GIT PULL] Btrfs

2017-05-09 Thread Chris Mason
On 05/09/2017 01:56 PM, Chris Mason wrote:
> Hi Linus,
> 
> My for-linus-4.12 branch:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
> for-linus-4.12

I hit send too soon, sorry.  There's a trivial conflict with our WARN_ON
fix that went into 4.11.  I pushed the resolution to
for-linus-4.12-merged.

diff --cc fs/btrfs/qgroup.c
index afbea61,3f75b5c..deffbeb
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@@ -1078,7 -1031,8 +1034,8 @@@ static int __qgroup_excl_accounting(str
qgroup->excl += sign * num_bytes;
qgroup->excl_cmpr += sign * num_bytes;
if (sign > 0) {
+   trace_qgroup_update_reserve(fs_info, qgroup, -(s64)num_bytes);
 -  if (WARN_ON(qgroup->reserved < num_bytes))
 +  if (qgroup->reserved < num_bytes)
report_reserved_underflow(fs_info, qgroup, num_bytes);
else
qgroup->reserved -= num_bytes;
@@@ -1103,7 -1057,9 +1060,9 @@@
WARN_ON(sign < 0 && qgroup->excl < num_bytes);
qgroup->excl += sign * num_bytes;
if (sign > 0) {
+   trace_qgroup_update_reserve(fs_info, qgroup,
+   -(s64)num_bytes);
 -  if (WARN_ON(qgroup->reserved < num_bytes))
 +  if (qgroup->reserved < num_bytes)
report_reserved_underflow(fs_info, qgroup,
  num_bytes);
else
@@@ -2472,7 -2451,8 +2454,8 @@@ void btrfs_qgroup_free_refroot(struct b
  
qg = unode_aux_to_qgroup(unode);
  
+   trace_qgroup_update_reserve(fs_info, qg, -(s64)num_bytes);
 -  if (WARN_ON(qg->reserved < num_bytes))
 +  if (qg->reserved < num_bytes)
report_reserved_underflow(fs_info, qg, num_bytes);
else
qg->reserved -= num_bytes;


[GIT PULL] Btrfs

2017-05-09 Thread Chris Mason
 bdev_get_queue (+3/-4)
btrfs: check if the device is flush capable (+4/-0)
btrfs: delete unused member nobarriers (+0/-4)

Edmund Nadolski (2) commits (+25/-20):
btrfs: provide enumeration for __merge_refs mode argument (+13/-10)
btrfs: replace hardcoded value with SEQ_LAST macro (+12/-10)

Goldwyn Rodrigues (2) commits (+24/-3):
btrfs: qgroups: Retry after commit on getting EDQUOT (+23/-1)
btrfs: No need to check !(flags & MS_RDONLY) twice (+1/-2)

Chris Mason (1) commits (+2/-2):
btrfs: fix the gfp_mask for the reada_zones radix tree

Adam Borowski (1) commits (+9/-3):
btrfs: fix a bogus warning when converting only data or metadata

Deepa Dinamani (1) commits (+2/-1):
btrfs: Use ktime_get_real_ts for root ctime

Dan Carpenter (1) commits (+15/-26):
Btrfs: handle only applicable errors returned by btrfs_get_extent

Dmitry V. Levin (1) commits (+2/-0):
MAINTAINERS: add btrfs file entries for include directories

Hans van Kranenburg (1) commits (+5/-5):
Btrfs: consistent usage of types in balance_args

Total: (71) commits

 MAINTAINERS  |   2 +
 fs/btrfs/backref.c   |  41 ++-
 fs/btrfs/btrfs_inode.h   |   7 +
 fs/btrfs/compression.c   |  18 +-
 fs/btrfs/ctree.c |  20 +-
 fs/btrfs/ctree.h |  34 +-
 fs/btrfs/delayed-inode.c |  46 +--
 fs/btrfs/delayed-inode.h |   6 +-
 fs/btrfs/delayed-ref.c   |   8 +-
 fs/btrfs/delayed-ref.h   |   8 +-
 fs/btrfs/dev-replace.c   |   9 +-
 fs/btrfs/disk-io.c   |  13 +-
 fs/btrfs/disk-io.h   |   4 +-
 fs/btrfs/extent-tree.c   |  35 +-
 fs/btrfs/extent_io.c |  59 +--
 fs/btrfs/extent_io.h |   8 +-
 fs/btrfs/extent_map.c|  10 +-
 fs/btrfs/extent_map.h|   3 +-
 fs/btrfs/file.c  |  82 -
 fs/btrfs/free-space-cache.c  |   2 +-
 fs/btrfs/inode.c | 289 +++
 fs/btrfs/ioctl.c |  33 +-
 fs/btrfs/ordered-data.c  |  20 +-
 fs/btrfs/ordered-data.h  |   2 +-
 fs/btrfs/qgroup.c| 102 ++
 fs/btrfs/qgroup.h|  51 ++-
 fs/btrfs/raid56.c|  38 +-
 fs/btrfs/reada.c |  37 +-
 fs/btrfs/root-tree.c |   3 +-
 fs/btrfs/scrub.c | 331 +++--
 fs/btrfs/send.c  |  23 +-
 fs/btrfs/super.c |   3 +-
 fs/btrfs/tests/btrfs-tests.c |   1 -
 fs/btrfs/transaction.c   |  48 ++-
 fs/btrfs/transaction.h   |   6 +-
 fs/btrfs/tree-log.c  |   2 +-
 fs/btrfs/volumes.c   | 854 +++
 fs/btrfs/volumes.h   |   8 +-
 include/trace/events/btrfs.h | 187 +-
 include/uapi/linux/btrfs.h   |  10 +-
 40 files changed, 1629 insertions(+), 834 deletions(-)


[GIT PULL] Btrfs

2017-05-09 Thread Chris Mason
 bdev_get_queue (+3/-4)
btrfs: check if the device is flush capable (+4/-0)
btrfs: delete unused member nobarriers (+0/-4)

Edmund Nadolski (2) commits (+25/-20):
btrfs: provide enumeration for __merge_refs mode argument (+13/-10)
btrfs: replace hardcoded value with SEQ_LAST macro (+12/-10)

Goldwyn Rodrigues (2) commits (+24/-3):
btrfs: qgroups: Retry after commit on getting EDQUOT (+23/-1)
btrfs: No need to check !(flags & MS_RDONLY) twice (+1/-2)

Chris Mason (1) commits (+2/-2):
btrfs: fix the gfp_mask for the reada_zones radix tree

Adam Borowski (1) commits (+9/-3):
btrfs: fix a bogus warning when converting only data or metadata

Deepa Dinamani (1) commits (+2/-1):
btrfs: Use ktime_get_real_ts for root ctime

Dan Carpenter (1) commits (+15/-26):
Btrfs: handle only applicable errors returned by btrfs_get_extent

Dmitry V. Levin (1) commits (+2/-0):
MAINTAINERS: add btrfs file entries for include directories

Hans van Kranenburg (1) commits (+5/-5):
Btrfs: consistent usage of types in balance_args

Total: (71) commits

 MAINTAINERS  |   2 +
 fs/btrfs/backref.c   |  41 ++-
 fs/btrfs/btrfs_inode.h   |   7 +
 fs/btrfs/compression.c   |  18 +-
 fs/btrfs/ctree.c |  20 +-
 fs/btrfs/ctree.h |  34 +-
 fs/btrfs/delayed-inode.c |  46 +--
 fs/btrfs/delayed-inode.h |   6 +-
 fs/btrfs/delayed-ref.c   |   8 +-
 fs/btrfs/delayed-ref.h   |   8 +-
 fs/btrfs/dev-replace.c   |   9 +-
 fs/btrfs/disk-io.c   |  13 +-
 fs/btrfs/disk-io.h   |   4 +-
 fs/btrfs/extent-tree.c   |  35 +-
 fs/btrfs/extent_io.c |  59 +--
 fs/btrfs/extent_io.h |   8 +-
 fs/btrfs/extent_map.c|  10 +-
 fs/btrfs/extent_map.h|   3 +-
 fs/btrfs/file.c  |  82 -
 fs/btrfs/free-space-cache.c  |   2 +-
 fs/btrfs/inode.c | 289 +++
 fs/btrfs/ioctl.c |  33 +-
 fs/btrfs/ordered-data.c  |  20 +-
 fs/btrfs/ordered-data.h  |   2 +-
 fs/btrfs/qgroup.c| 102 ++
 fs/btrfs/qgroup.h|  51 ++-
 fs/btrfs/raid56.c|  38 +-
 fs/btrfs/reada.c |  37 +-
 fs/btrfs/root-tree.c |   3 +-
 fs/btrfs/scrub.c | 331 +++--
 fs/btrfs/send.c  |  23 +-
 fs/btrfs/super.c |   3 +-
 fs/btrfs/tests/btrfs-tests.c |   1 -
 fs/btrfs/transaction.c   |  48 ++-
 fs/btrfs/transaction.h   |   6 +-
 fs/btrfs/tree-log.c  |   2 +-
 fs/btrfs/volumes.c   | 854 +++
 fs/btrfs/volumes.h   |   8 +-
 include/trace/events/btrfs.h | 187 +-
 include/uapi/linux/btrfs.h   |  10 +-
 40 files changed, 1629 insertions(+), 834 deletions(-)


Re: [PATCH] btrfs: always write superblocks synchronously

2017-05-03 Thread Chris Mason



On 05/03/2017 04:36 AM, Jan Kara wrote:

On Tue 02-05-17 09:28:13, Davidlohr Bueso wrote:

Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation.
Since REQ_FUA and REQ_FLUSH flags are stripped from submitted IO
when the disk doesn't have volatile write cache and thus effectively
make the write async. This was seen to cause performance hits up
to 90% regression in disk IO related benchmarks such as reaim and
dbench[1].

Fix the problem by making sure the first superblock write is also
treated as synchronous since they can block progress of the
journalling (commit, log syncs) machinery and thus the whole filesystem.





Fixes: b685d3d65ac (block: treat REQ_FUA and REQ_PREFLUSH as synchronous)
Cc: stable 
Cc: Jan Kara 
Signed-off-by: Davidlohr Bueso 


I wasn't patient enough and already sent the fix as part of my series
fixing other filesystems [1]. It also fixes one more place in btrfs that
needs REQ_SYNC to return to the original behavior.




Thanks guys.

-chris



Re: [PATCH] btrfs: always write superblocks synchronously

2017-05-03 Thread Chris Mason



On 05/03/2017 04:36 AM, Jan Kara wrote:

On Tue 02-05-17 09:28:13, Davidlohr Bueso wrote:

Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation.
Since REQ_FUA and REQ_FLUSH flags are stripped from submitted IO
when the disk doesn't have volatile write cache and thus effectively
make the write async. This was seen to cause performance hits up
to 90% regression in disk IO related benchmarks such as reaim and
dbench[1].

Fix the problem by making sure the first superblock write is also
treated as synchronous since they can block progress of the
journalling (commit, log syncs) machinery and thus the whole filesystem.





Fixes: b685d3d65ac (block: treat REQ_FUA and REQ_PREFLUSH as synchronous)
Cc: stable 
Cc: Jan Kara 
Signed-off-by: Davidlohr Bueso 


I wasn't patient enough and already sent the fix as part of my series
fixing other filesystems [1]. It also fixes one more place in btrfs that
needs REQ_SYNC to return to the original behavior.




Thanks guys.

-chris



[GIT PULL] Btrfs

2017-04-27 Thread Chris Mason

Hi Linus,

We have one more for btrfs:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

This is dropping a new WARN_ON from rc1 that ended up making more noise 
than we really want.  The larger fix for the underflow got delayed a bit 
and it's better for now to put it under CONFIG_BTRFS_DEBUG.


David Sterba (1) commits (+7/-4):
   btrfs: qgroup: move noisy underflow warning to debugging build

Total: (1) commits (+7/-4)

fs/btrfs/qgroup.c | 11 +++
1 file changed, 7 insertions(+), 4 deletions(-)


[GIT PULL] Btrfs

2017-04-27 Thread Chris Mason

Hi Linus,

We have one more for btrfs:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

This is dropping a new WARN_ON from rc1 that ended up making more noise 
than we really want.  The larger fix for the underflow got delayed a bit 
and it's better for now to put it under CONFIG_BTRFS_DEBUG.


David Sterba (1) commits (+7/-4):
   btrfs: qgroup: move noisy underflow warning to debugging build

Total: (1) commits (+7/-4)

fs/btrfs/qgroup.c | 11 +++
1 file changed, 7 insertions(+), 4 deletions(-)


Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

2017-04-25 Thread Chris Mason



On 04/25/2017 04:49 PM, Tejun Heo wrote:

On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote:

Will try that too.  I can't see why HT would change it because I see
single CPU queues misevaluated.  Just in case, you need to tune the
test params so that it doesn't load the machine too much and that
there are some non-CPU intensive workloads going on to purturb things
a bit.  Anyways, I'm gonna try disabling HT.


It's finickier but after changing the duty cycle a bit, it reproduces
w/ HT off.  I think the trick is setting the number of threads to the
number of logical CPUs and tune -s/-c so that p99 starts climbing up.
The following is from the root cgroup.


Since it's only measuring wakeup latency, schbench is best at exposing 
problems when the machine is just barely below saturated.  At 
saturation, everyone has to wait for the CPUs, and if we're relatively 
idle there's always a CPU to be found


There's schbench -a to try and find this magic tipping point, but I 
haven't found a great way to automate for every kind of machine yet (sorry).


-chris


Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

2017-04-25 Thread Chris Mason



On 04/25/2017 04:49 PM, Tejun Heo wrote:

On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote:

Will try that too.  I can't see why HT would change it because I see
single CPU queues misevaluated.  Just in case, you need to tune the
test params so that it doesn't load the machine too much and that
there are some non-CPU intensive workloads going on to purturb things
a bit.  Anyways, I'm gonna try disabling HT.


It's finickier but after changing the duty cycle a bit, it reproduces
w/ HT off.  I think the trick is setting the number of threads to the
number of logical CPUs and tune -s/-c so that p99 starts climbing up.
The following is from the root cgroup.


Since it's only measuring wakeup latency, schbench is best at exposing 
problems when the machine is just barely below saturated.  At 
saturation, everyone has to wait for the CPUs, and if we're relatively 
idle there's always a CPU to be found


There's schbench -a to try and find this magic tipping point, but I 
haven't found a great way to automate for every kind of machine yet (sorry).


-chris


[GIT PULL] Btrfs

2017-04-14 Thread Chris Mason

Hi Linus

Dave Sterba collected a few more fixes for the last rc:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

These aren't marked for stable, but I'm putting them in with a batch 
were testing/sending by hand for this release.


Liu Bo (3) commits (+11/-13):
   Btrfs: fix invalid dereference in btrfs_retry_endio (+4/-10)
   Btrfs: fix potential use-after-free for cloned bio (+1/-1)
   Btrfs: fix segmentation fault when doing dio read (+6/-2)

Adam Borowski (1) commits (+3/-0):
   btrfs: drop the nossd flag when remounting with -o ssd

Total: (4) commits (+14/-13)

fs/btrfs/inode.c   | 22 ++
fs/btrfs/super.c   |  3 +++
fs/btrfs/volumes.c |  2 +-
3 files changed, 14 insertions(+), 13 deletions(-)


[GIT PULL] Btrfs

2017-04-14 Thread Chris Mason

Hi Linus

Dave Sterba collected a few more fixes for the last rc:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

These aren't marked for stable, but I'm putting them in with a batch 
were testing/sending by hand for this release.


Liu Bo (3) commits (+11/-13):
   Btrfs: fix invalid dereference in btrfs_retry_endio (+4/-10)
   Btrfs: fix potential use-after-free for cloned bio (+1/-1)
   Btrfs: fix segmentation fault when doing dio read (+6/-2)

Adam Borowski (1) commits (+3/-0):
   btrfs: drop the nossd flag when remounting with -o ssd

Total: (4) commits (+14/-13)

fs/btrfs/inode.c   | 22 ++
fs/btrfs/super.c   |  3 +++
fs/btrfs/volumes.c |  2 +-
3 files changed, 14 insertions(+), 13 deletions(-)


[GIT PULL] Btrfs

2017-03-31 Thread Chris Mason
Hi Linus,

We have 3 small fixes queued up in my for-linus-4.11 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Goldwyn Rodrigues (1) commits (+7/-7):
btrfs: Change qgroup_meta_rsv to 64bit

Dan Carpenter (1) commits (+6/-1):
Btrfs: fix an integer overflow check

Liu Bo (1) commits (+31/-21):
Btrfs: bring back repair during read

Total: (3) commits (+44/-29)

 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/disk-io.c   |  2 +-
 fs/btrfs/extent_io.c | 46 --
 fs/btrfs/inode.c |  6 +++---
 fs/btrfs/qgroup.c| 10 +-
 fs/btrfs/send.c  |  7 ++-
 6 files changed, 44 insertions(+), 29 deletions(-)


[GIT PULL] Btrfs

2017-03-31 Thread Chris Mason
Hi Linus,

We have 3 small fixes queued up in my for-linus-4.11 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Goldwyn Rodrigues (1) commits (+7/-7):
btrfs: Change qgroup_meta_rsv to 64bit

Dan Carpenter (1) commits (+6/-1):
Btrfs: fix an integer overflow check

Liu Bo (1) commits (+31/-21):
Btrfs: bring back repair during read

Total: (3) commits (+44/-29)

 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/disk-io.c   |  2 +-
 fs/btrfs/extent_io.c | 46 --
 fs/btrfs/inode.c |  6 +++---
 fs/btrfs/qgroup.c| 10 +-
 fs/btrfs/send.c  |  7 ++-
 6 files changed, 44 insertions(+), 29 deletions(-)


[GIT PULL] Btrfs

2017-03-23 Thread Chris Mason

Hi Linus

We have a small set of fixes for the next RC:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Zygo tracked down a very old bug with inline compressed extents.
I didn't tag this one for stable because I want to do individual tested 
backports.  It's a little tricky and I'd rather do some extra testing

on it along the way.

Otherwise they are pretty obvious:

Liu Bo (1) commits (+2/-1):
   Btrfs: fix regression in lock_delalloc_pages

Dmitry V. Levin (1) commits (+0/-27):
   btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h

Zygo Blaxell (1) commits (+14/-0):
   btrfs: add missing memset while reading compressed inline extents

Total: (3) commits (+16/-28)

fs/btrfs/extent_io.c   |  3 ++-
fs/btrfs/inode.c   | 14 ++
include/uapi/linux/btrfs.h | 27 ---
3 files changed, 16 insertions(+), 28 deletions(-)


[GIT PULL] Btrfs

2017-03-23 Thread Chris Mason

Hi Linus

We have a small set of fixes for the next RC:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Zygo tracked down a very old bug with inline compressed extents.
I didn't tag this one for stable because I want to do individual tested 
backports.  It's a little tricky and I'd rather do some extra testing

on it along the way.

Otherwise they are pretty obvious:

Liu Bo (1) commits (+2/-1):
   Btrfs: fix regression in lock_delalloc_pages

Dmitry V. Levin (1) commits (+0/-27):
   btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h

Zygo Blaxell (1) commits (+14/-0):
   btrfs: add missing memset while reading compressed inline extents

Total: (3) commits (+16/-28)

fs/btrfs/extent_io.c   |  3 ++-
fs/btrfs/inode.c   | 14 ++
include/uapi/linux/btrfs.h | 27 ---
3 files changed, 16 insertions(+), 28 deletions(-)


Re: [PATCH] jump_label: Fix anonymous union initialization

2017-03-02 Thread Chris Mason

On 03/02/2017 04:42 PM, Steven Rostedt wrote:

On Thu, 2 Mar 2017 16:07:19 -0500
Jason Baron <jba...@akamai.com> wrote:


On 02/28/2017 11:32 AM, Boris Ostrovsky wrote:

Pre-4.6 gcc do not allow direct static initialization of members of
anonymous structs/unions. After commit 3821fd35b58d ("jump_label:
Reduce the size of struct static_key") STATIC_KEY_INIT_{TRUE|FALSE}
definitions cannot be compiled with those older compilers.

Placing initializers inside curved brackets works around this problem.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 include/linux/jump_label.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 8e06d75..518020b 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -166,10 +166,10 @@ extern void arch_jump_label_transform_static(struct 
jump_entry *entry,
  */
 #define STATIC_KEY_INIT_TRUE   \
{ .enabled = { 1 }, \
- .entries = (void *)JUMP_TYPE_TRUE }
+ { .entries = (void *)JUMP_TYPE_TRUE } }
 #define STATIC_KEY_INIT_FALSE  \
{ .enabled = { 0 }, \
- .entries = (void *)JUMP_TYPE_FALSE }
+ { .entries = (void *)JUMP_TYPE_FALSE } }

 #else  /* !HAVE_JUMP_LABEL */




(Adding Steve to 'cc)

Thanks for the fix.

Reviewed-by: Jason Baron <jba...@akamai.com>


Funny, Chris pinged me on IRC telling me that jump labels broke with my
latest tree. And we discovered it was because of anonymous unions and
he was using an older compiler (4.4 or something). I didn't know how to
make it work, and we were just going to say "tough, jump labels are not
for 4.4". Although, didn't goto asm get added into 4.5? Did someone
backport it to the gcc 4.4 compilers? I believe 4.5 handles anonymous
unions.

Since the broken commit went through my tree, I'll take this patch.
I'm getting ready for another git pull request to Linus.



Compiled-by: Chris Mason <c...@fb.com>

-chris



Re: [PATCH] jump_label: Fix anonymous union initialization

2017-03-02 Thread Chris Mason

On 03/02/2017 04:42 PM, Steven Rostedt wrote:

On Thu, 2 Mar 2017 16:07:19 -0500
Jason Baron  wrote:


On 02/28/2017 11:32 AM, Boris Ostrovsky wrote:

Pre-4.6 gcc do not allow direct static initialization of members of
anonymous structs/unions. After commit 3821fd35b58d ("jump_label:
Reduce the size of struct static_key") STATIC_KEY_INIT_{TRUE|FALSE}
definitions cannot be compiled with those older compilers.

Placing initializers inside curved brackets works around this problem.

Signed-off-by: Boris Ostrovsky 
---
 include/linux/jump_label.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 8e06d75..518020b 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -166,10 +166,10 @@ extern void arch_jump_label_transform_static(struct 
jump_entry *entry,
  */
 #define STATIC_KEY_INIT_TRUE   \
{ .enabled = { 1 }, \
- .entries = (void *)JUMP_TYPE_TRUE }
+ { .entries = (void *)JUMP_TYPE_TRUE } }
 #define STATIC_KEY_INIT_FALSE  \
{ .enabled = { 0 }, \
- .entries = (void *)JUMP_TYPE_FALSE }
+ { .entries = (void *)JUMP_TYPE_FALSE } }

 #else  /* !HAVE_JUMP_LABEL */




(Adding Steve to 'cc)

Thanks for the fix.

Reviewed-by: Jason Baron 


Funny, Chris pinged me on IRC telling me that jump labels broke with my
latest tree. And we discovered it was because of anonymous unions and
he was using an older compiler (4.4 or something). I didn't know how to
make it work, and we were just going to say "tough, jump labels are not
for 4.4". Although, didn't goto asm get added into 4.5? Did someone
backport it to the gcc 4.4 compilers? I believe 4.5 handles anonymous
unions.

Since the broken commit went through my tree, I'll take this patch.
I'm getting ready for another git pull request to Linus.



Compiled-by: Chris Mason 

-chris



[GIT PULL] Btrfs

2017-03-02 Thread Chris Mason
Hi Linus,

My for-linus-4.11 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Has Btrfs round two.  These are mostly a continuation of Dave Sterba's 
collection
of cleanups, but Filipe also has some bug fixes and performance improvements.

Nikolay Borisov (42) commits (+611/-579):
btrfs: Make lock_and_cleanup_extent_if_need take btrfs_inode (+14/-14)
btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inode (+39/-38)
btrfs: Make btrfs_extent_item_to_extent_map take btrfs_inode (+10/-8)
btrfs: all btrfs_delalloc_release_metadata take btrfs_inode (+22/-19)
btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode (+3/-4)
btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inode (+7/-6)
btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode (+3/-3)
btrfs: Make btrfs_orphan_release_metadata take btrfs_inode (+8/-8)
btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode (+7/-7)
btrfs: Make check_parent_dirs_for_sync take btrfs_inode (+14/-14)
btrfs: make btrfs_free_io_failure_record take btrfs_inode (+9/-7)
btrfs: Make btrfs_lookup_ordered_range take btrfs_inode (+19/-18)
btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inode (+17/-16)
btrfs: make btrfs_print_data_csum_error take btrfs_inode (+8/-7)
btrfs: make btrfs_is_free_space_inode take btrfs_inode (+20/-19)
btrfs: make btrfs_set_inode_index_count take btrfs_inode (+8/-8)
btrfs: Make btrfs_requeue_inode_defrag take btrfs_inode (+5/-5)
btrfs: Make clone_update_extent_map take btrfs_inode (+13/-14)
btrfs: Make btrfs_mark_extent_written take btrfs_inode (+6/-6)
btrfs: Make btrfs_drop_extent_cache take btrfs_inode (+30/-26)
btrfs: Make calc_csum_metadata_size take btrfs_inode (+12/-15)
btrfs: Make drop_outstanding_extent take btrfs_inode (+11/-12)
btrfs: Make btrfs_del_delalloc_inode take btrfs_inode (+7/-7)
btrfs: make btrfs_log_inode_parent take btrfs_inode (+24/-26)
btrfs: Make btrfs_set_inode_index take btrfs_inode (+13/-13)
btrfs: Make btrfs_clear_bit_hook take btrfs_inode (+25/-21)
btrfs: Make check_extent_to_block take btrfs_inode (+6/-5)
btrfs: make check_compressed_csum take btrfs_inode (+4/-5)
btrfs: Make btrfs_insert_dir_item take btrfs_inode (+7/-7)
btrfs: Make btrfs_log_all_parents take btrfs_inode (+5/-5)
btrfs: Make btrfs_i_size_write take btrfs_inode (+18/-19)
btrfs: make repair_io_failure take btrfs_inode (+12/-11)
btrfs: Make btrfs_orphan_add take btrfs_inode (+24/-22)
btrfs: make btrfs_orphan_del take btrfs_inode (+20/-20)
btrfs: make clean_io_failure take btrfs_inode (+15/-14)
btrfs: Make btrfs_add_nondir take btrfs_inode (+13/-9)
btrfs: make free_io_failure take btrfs_inode (+13/-11)
btrfs: Make check_can_nocow take btrfs_inode (+12/-10)
btrfs: Make btrfs_add_link take btrfs_inode (+26/-23)
btrfs: Make get_extent_t take btrfs_inode (+59/-54)
btrfs: Make hole_mergeable take btrfs_inode (+5/-4)
btrfs: Make fill_holes take btrfs_inode (+18/-19)

David Sterba (16) commits (+139/-124):
btrfs: use predefined limits for calculating maximum number of pages for 
compression (+6/-5)
btrfs: derive maximum output size in the compression implementation (+9/-14)
btrfs: merge nr_pages input and output parameter in compress_pages (+11/-15)
btrfs: merge length input and output parameter in compress_pages (+18/-20)
btrfs: add dummy callback for readpage_io_failed and drop checks (+10/-3)
btrfs: do proper error handling in btrfs_insert_xattr_item (+2/-1)
btrfs: drop checks for mandatory extent_io_ops callbacks (+3/-4)
btrfs: constify device path passed to relevant helpers (+22/-18)
btrfs: document existence of extent_io ops callbacks (+26/-11)
btrfs: handle allocation error in update_dev_stat_item (+2/-1)
btrfs: export compression buffer limits in a header (+15/-10)
btrfs: constify name of subvolume in creation helpers (+3/-3)
btrfs: constify buffers used by compression helpers (+3/-3)
btrfs: remove BUG_ON from __tree_mod_log_insert (+0/-2)
btrfs: constify input buffer of btrfs_csum_data (+3/-3)
btrfs: let writepage_end_io_hook return void (+6/-11)

Filipe Manana (8) commits (+163/-27):
Btrfs: do not create explicit holes when replaying log tree if NO_HOLES 
enabled (+5/-0)
Btrfs: try harder to migrate items to left sibling before splitting a leaf 
(+7/-0)
Btrfs: fix assertion failure when freeing block groups at close_ctree() 
(+9/-6)
Btrfs: incremental send, fix unnecessary hole writes for sparse files 
(+86/-2)
Btrfs: fix use-after-free due to wrong order of destroying work queues 
(+7/-2)
Btrfs: incremental send, do not delay rename when parent inode is new 
(+16/-3)
Btrfs: fix data loss after truncate when using the no-holes feature (+6/-13)
Btrfs: bulk delete checksum items in the same leaf (+27/-1)

Robbie Ko (3) commits 

[GIT PULL] Btrfs

2017-03-02 Thread Chris Mason
Hi Linus,

My for-linus-4.11 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Has Btrfs round two.  These are mostly a continuation of Dave Sterba's 
collection
of cleanups, but Filipe also has some bug fixes and performance improvements.

Nikolay Borisov (42) commits (+611/-579):
btrfs: Make lock_and_cleanup_extent_if_need take btrfs_inode (+14/-14)
btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inode (+39/-38)
btrfs: Make btrfs_extent_item_to_extent_map take btrfs_inode (+10/-8)
btrfs: all btrfs_delalloc_release_metadata take btrfs_inode (+22/-19)
btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode (+3/-4)
btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inode (+7/-6)
btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode (+3/-3)
btrfs: Make btrfs_orphan_release_metadata take btrfs_inode (+8/-8)
btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode (+7/-7)
btrfs: Make check_parent_dirs_for_sync take btrfs_inode (+14/-14)
btrfs: make btrfs_free_io_failure_record take btrfs_inode (+9/-7)
btrfs: Make btrfs_lookup_ordered_range take btrfs_inode (+19/-18)
btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inode (+17/-16)
btrfs: make btrfs_print_data_csum_error take btrfs_inode (+8/-7)
btrfs: make btrfs_is_free_space_inode take btrfs_inode (+20/-19)
btrfs: make btrfs_set_inode_index_count take btrfs_inode (+8/-8)
btrfs: Make btrfs_requeue_inode_defrag take btrfs_inode (+5/-5)
btrfs: Make clone_update_extent_map take btrfs_inode (+13/-14)
btrfs: Make btrfs_mark_extent_written take btrfs_inode (+6/-6)
btrfs: Make btrfs_drop_extent_cache take btrfs_inode (+30/-26)
btrfs: Make calc_csum_metadata_size take btrfs_inode (+12/-15)
btrfs: Make drop_outstanding_extent take btrfs_inode (+11/-12)
btrfs: Make btrfs_del_delalloc_inode take btrfs_inode (+7/-7)
btrfs: make btrfs_log_inode_parent take btrfs_inode (+24/-26)
btrfs: Make btrfs_set_inode_index take btrfs_inode (+13/-13)
btrfs: Make btrfs_clear_bit_hook take btrfs_inode (+25/-21)
btrfs: Make check_extent_to_block take btrfs_inode (+6/-5)
btrfs: make check_compressed_csum take btrfs_inode (+4/-5)
btrfs: Make btrfs_insert_dir_item take btrfs_inode (+7/-7)
btrfs: Make btrfs_log_all_parents take btrfs_inode (+5/-5)
btrfs: Make btrfs_i_size_write take btrfs_inode (+18/-19)
btrfs: make repair_io_failure take btrfs_inode (+12/-11)
btrfs: Make btrfs_orphan_add take btrfs_inode (+24/-22)
btrfs: make btrfs_orphan_del take btrfs_inode (+20/-20)
btrfs: make clean_io_failure take btrfs_inode (+15/-14)
btrfs: Make btrfs_add_nondir take btrfs_inode (+13/-9)
btrfs: make free_io_failure take btrfs_inode (+13/-11)
btrfs: Make check_can_nocow take btrfs_inode (+12/-10)
btrfs: Make btrfs_add_link take btrfs_inode (+26/-23)
btrfs: Make get_extent_t take btrfs_inode (+59/-54)
btrfs: Make hole_mergeable take btrfs_inode (+5/-4)
btrfs: Make fill_holes take btrfs_inode (+18/-19)

David Sterba (16) commits (+139/-124):
btrfs: use predefined limits for calculating maximum number of pages for 
compression (+6/-5)
btrfs: derive maximum output size in the compression implementation (+9/-14)
btrfs: merge nr_pages input and output parameter in compress_pages (+11/-15)
btrfs: merge length input and output parameter in compress_pages (+18/-20)
btrfs: add dummy callback for readpage_io_failed and drop checks (+10/-3)
btrfs: do proper error handling in btrfs_insert_xattr_item (+2/-1)
btrfs: drop checks for mandatory extent_io_ops callbacks (+3/-4)
btrfs: constify device path passed to relevant helpers (+22/-18)
btrfs: document existence of extent_io ops callbacks (+26/-11)
btrfs: handle allocation error in update_dev_stat_item (+2/-1)
btrfs: export compression buffer limits in a header (+15/-10)
btrfs: constify name of subvolume in creation helpers (+3/-3)
btrfs: constify buffers used by compression helpers (+3/-3)
btrfs: remove BUG_ON from __tree_mod_log_insert (+0/-2)
btrfs: constify input buffer of btrfs_csum_data (+3/-3)
btrfs: let writepage_end_io_hook return void (+6/-11)

Filipe Manana (8) commits (+163/-27):
Btrfs: do not create explicit holes when replaying log tree if NO_HOLES 
enabled (+5/-0)
Btrfs: try harder to migrate items to left sibling before splitting a leaf 
(+7/-0)
Btrfs: fix assertion failure when freeing block groups at close_ctree() 
(+9/-6)
Btrfs: incremental send, fix unnecessary hole writes for sparse files 
(+86/-2)
Btrfs: fix use-after-free due to wrong order of destroying work queues 
(+7/-2)
Btrfs: incremental send, do not delay rename when parent inode is new 
(+16/-3)
Btrfs: fix data loss after truncate when using the no-holes feature (+6/-13)
Btrfs: bulk delete checksum items in the same leaf (+27/-1)

Robbie Ko (3) commits 

[GIT PULL] Btrfs

2017-02-24 Thread Chris Mason
Hi Linus,

My for-linus-4.11 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Has a series of fixes and cleanups that Dave Sterba has been collecting:

There is a pretty big variety here, cleaning up internal APIs and fixing 
corner cases.

David Sterba (46) commits (+235/-313):
 btrfs: remove unused parameter from btrfs_subvolume_release_metadata 
(+6/-11)
 btrfs: remove pointless rcu protection from btrfs_qgroup_inherit (+0/-2)
 btrfs: check quota status earlier and don't do unnecessary frees (+3/-2)
 btrfs: remove unused parameter from btrfs_prepare_extent_commit (+3/-5)
 btrfs: remove unnecessary mutex lock in qgroup_account_snapshot (+1/-5)
 btrfs: embed extent_changeset::range_changed to the structure (+11/-17)
 btrfs: remove unused parameter from cleanup_write_cache_enospc (+2/-3)
 btrfs: remove unused parameters from __btrfs_write_out_cache (+3/-8)
 btrfs: remove unused parameter from clone_copy_inline_extent (+2/-3)
 btrfs: remove unused parameter from extent_write_cache_pages (+2/-4)
 btrfs: remove unused parameter from tree_move_next_or_upnext (+2/-4)
 btrfs: remove unused parameter from btrfs_check_super_valid (+3/-5)
 btrfs: remove unused logic of limiting async delalloc pages (+0/-7)
 btrfs: fix over-80 lines introduced by previous cleanups (+74/-63)
 btrfs: remove unused parameter from read_block_for_search (+5/-5)
 btrfs: remove unused parameter from adjust_slots_upwards (+2/-3)
 btrfs: remove unused parameter from init_first_rw_device (+3/-5)
 btrfs: make space cache inode readahead failure nonfatal (+3/-7)
 btrfs: remove unused parameters from scrub_setup_wr_ctx (+3/-7)
 btrfs: remove unused parameter from __btrfs_alloc_chunk (+4/-6)
 btrfs: add wrapper for counting BTRFS_MAX_EXTENT_SIZE (+23/-31)
 btrfs: remove unused parameter from submit_extent_page (+3/-9)
 btrfs: remove unused parameter from clean_tree_block (+17/-19)
 btrfs: use GFP_KERNEL in btrfs_add/del_qgroup_relation (+2/-2)
 btrfs: remove unused parameter from __add_inline_refs (+2/-3)
 btrfs: remove unused parameter from add_pending_csums (+2/-4)
 btrfs: remove unused parameter from update_nr_written (+4/-4)
 btrfs: remove unused parameter from __push_leaf_right (+2/-3)
 btrfs: remove unused parameter from check_async_write (+2/-2)
 btrfs: remove unused parameter from btrfs_fill_super (+2/-3)
 btrfs: remove unused parameter from __push_leaf_left (+2/-3)
 btrfs: remove unused parameter from write_dev_supers (+3/-3)
 btrfs: remove unused parameter from __add_inode_ref (+1/-2)
 btrfs: remove unused parameters from btrfs_cmp_data (+2/-3)
 btrfs: remove unused parameter from create_snapshot (+2/-2)
 btrfs: ulist: make the finalization function public (+2/-1)
 btrfs: remove unused parameter from tree_move_down (+2/-2)
 btrfs: ulist: rename ulist_fini to ulist_release (+10/-10)
 btrfs: qgroups: make __del_qgroup_relation static (+1/-1)
 btrfs: use GFP_KERNEL in btrfs_read_qgroup_config (+1/-1)
 btrfs: remove unused parameter from split_item (+2/-3)
 btrfs: merge two superblock writing helpers (+4/-11)
 btrfs: qgroups: opencode qgroup_free helper (+9/-9)
 btrfs: use GFP_KERNEL in btrfs_quota_enable (+1/-1)
 btrfs: use GFP_KERNEL in create_snapshot (+2/-2)
 btrfs: remove unused ulist members (+0/-7)

Nikolay Borisov (36) commits (+476/-480):
 btrfs: Make btrfs_delayed_inode_reserve_metadata take btrfs_inode (+8/-8)
 btrfs: Make btrfs_inode_delayed_dir_index_count take btrfs_inode (+5/-5)
 btrfs: Make btrfs_commit_inode_delayed_items take btrfs_inode (+4/-4)
 btrfs: Make btrfs_commit_inode_delayed_inode take btrfs_inode (+6/-6)
 btrfs: Make btrfs_get_or_create_delayed_node take btrfs_inode (+5/-6)
 btrfs: Make btrfs_kill_delayed_inode_items take btrfs_inode (+4/-4)
 btrfs: Make btrfs_delayed_delete_inode_ref take btrfs_inode (+5/-5)
 btrfs: Make btrfs_delete_delayed_dir_index take btrfs_inode (+6/-6)
 btrfs: Make btrfs_insert_delayed_dir_index take btrfs_inode (+5/-5)
 btrfs: Make btrfs_check_ref_name_override take btrfs_inode (+4/-5)
 btrfs: Make btrfs_record_snapshot_destroy take btrfs_inode (+6/-6)
 btrfs: Make btrfs_must_commit_transaction take btrfs_inode (+9/-9)
 btrfs: Make btrfs_del_dir_entries_in_log take btrfs_inode (+7/-7)
 btrfs: Make btrfs_log_changed_extents take btrfs_inode (+11/-11)
 btrfs: Make btrfs_record_unlink_dir take btrfs_inode (+14/-14)
 btrfs: Make btrfs_remove_delayed_node take btrfs_inode (+5/-5)
 btrfs: Make btrfs_get_logged_extents take btrfs_inode (+4/-4)
 btrfs: Make btrfs_log_trailing_hole take btrfs_inode (+4/-4)
 btrfs: Make btrfs_get_delayed_node take btrfs_inode (+8/-9)
 btrfs: Make btrfs_ino take a struct btrfs_inode (+151/-151)
 btrfs: Make log_directory_changes take btrfs_inode (+5/-6)
   

[GIT PULL] Btrfs

2017-02-24 Thread Chris Mason
Hi Linus,

My for-linus-4.11 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Has a series of fixes and cleanups that Dave Sterba has been collecting:

There is a pretty big variety here, cleaning up internal APIs and fixing 
corner cases.

David Sterba (46) commits (+235/-313):
 btrfs: remove unused parameter from btrfs_subvolume_release_metadata 
(+6/-11)
 btrfs: remove pointless rcu protection from btrfs_qgroup_inherit (+0/-2)
 btrfs: check quota status earlier and don't do unnecessary frees (+3/-2)
 btrfs: remove unused parameter from btrfs_prepare_extent_commit (+3/-5)
 btrfs: remove unnecessary mutex lock in qgroup_account_snapshot (+1/-5)
 btrfs: embed extent_changeset::range_changed to the structure (+11/-17)
 btrfs: remove unused parameter from cleanup_write_cache_enospc (+2/-3)
 btrfs: remove unused parameters from __btrfs_write_out_cache (+3/-8)
 btrfs: remove unused parameter from clone_copy_inline_extent (+2/-3)
 btrfs: remove unused parameter from extent_write_cache_pages (+2/-4)
 btrfs: remove unused parameter from tree_move_next_or_upnext (+2/-4)
 btrfs: remove unused parameter from btrfs_check_super_valid (+3/-5)
 btrfs: remove unused logic of limiting async delalloc pages (+0/-7)
 btrfs: fix over-80 lines introduced by previous cleanups (+74/-63)
 btrfs: remove unused parameter from read_block_for_search (+5/-5)
 btrfs: remove unused parameter from adjust_slots_upwards (+2/-3)
 btrfs: remove unused parameter from init_first_rw_device (+3/-5)
 btrfs: make space cache inode readahead failure nonfatal (+3/-7)
 btrfs: remove unused parameters from scrub_setup_wr_ctx (+3/-7)
 btrfs: remove unused parameter from __btrfs_alloc_chunk (+4/-6)
 btrfs: add wrapper for counting BTRFS_MAX_EXTENT_SIZE (+23/-31)
 btrfs: remove unused parameter from submit_extent_page (+3/-9)
 btrfs: remove unused parameter from clean_tree_block (+17/-19)
 btrfs: use GFP_KERNEL in btrfs_add/del_qgroup_relation (+2/-2)
 btrfs: remove unused parameter from __add_inline_refs (+2/-3)
 btrfs: remove unused parameter from add_pending_csums (+2/-4)
 btrfs: remove unused parameter from update_nr_written (+4/-4)
 btrfs: remove unused parameter from __push_leaf_right (+2/-3)
 btrfs: remove unused parameter from check_async_write (+2/-2)
 btrfs: remove unused parameter from btrfs_fill_super (+2/-3)
 btrfs: remove unused parameter from __push_leaf_left (+2/-3)
 btrfs: remove unused parameter from write_dev_supers (+3/-3)
 btrfs: remove unused parameter from __add_inode_ref (+1/-2)
 btrfs: remove unused parameters from btrfs_cmp_data (+2/-3)
 btrfs: remove unused parameter from create_snapshot (+2/-2)
 btrfs: ulist: make the finalization function public (+2/-1)
 btrfs: remove unused parameter from tree_move_down (+2/-2)
 btrfs: ulist: rename ulist_fini to ulist_release (+10/-10)
 btrfs: qgroups: make __del_qgroup_relation static (+1/-1)
 btrfs: use GFP_KERNEL in btrfs_read_qgroup_config (+1/-1)
 btrfs: remove unused parameter from split_item (+2/-3)
 btrfs: merge two superblock writing helpers (+4/-11)
 btrfs: qgroups: opencode qgroup_free helper (+9/-9)
 btrfs: use GFP_KERNEL in btrfs_quota_enable (+1/-1)
 btrfs: use GFP_KERNEL in create_snapshot (+2/-2)
 btrfs: remove unused ulist members (+0/-7)

Nikolay Borisov (36) commits (+476/-480):
 btrfs: Make btrfs_delayed_inode_reserve_metadata take btrfs_inode (+8/-8)
 btrfs: Make btrfs_inode_delayed_dir_index_count take btrfs_inode (+5/-5)
 btrfs: Make btrfs_commit_inode_delayed_items take btrfs_inode (+4/-4)
 btrfs: Make btrfs_commit_inode_delayed_inode take btrfs_inode (+6/-6)
 btrfs: Make btrfs_get_or_create_delayed_node take btrfs_inode (+5/-6)
 btrfs: Make btrfs_kill_delayed_inode_items take btrfs_inode (+4/-4)
 btrfs: Make btrfs_delayed_delete_inode_ref take btrfs_inode (+5/-5)
 btrfs: Make btrfs_delete_delayed_dir_index take btrfs_inode (+6/-6)
 btrfs: Make btrfs_insert_delayed_dir_index take btrfs_inode (+5/-5)
 btrfs: Make btrfs_check_ref_name_override take btrfs_inode (+4/-5)
 btrfs: Make btrfs_record_snapshot_destroy take btrfs_inode (+6/-6)
 btrfs: Make btrfs_must_commit_transaction take btrfs_inode (+9/-9)
 btrfs: Make btrfs_del_dir_entries_in_log take btrfs_inode (+7/-7)
 btrfs: Make btrfs_log_changed_extents take btrfs_inode (+11/-11)
 btrfs: Make btrfs_record_unlink_dir take btrfs_inode (+14/-14)
 btrfs: Make btrfs_remove_delayed_node take btrfs_inode (+5/-5)
 btrfs: Make btrfs_get_logged_extents take btrfs_inode (+4/-4)
 btrfs: Make btrfs_log_trailing_hole take btrfs_inode (+4/-4)
 btrfs: Make btrfs_get_delayed_node take btrfs_inode (+8/-9)
 btrfs: Make btrfs_ino take a struct btrfs_inode (+151/-151)
 btrfs: Make log_directory_changes take btrfs_inode (+5/-6)
   

[GIT PULL] Btrfs

2017-02-11 Thread Chris Mason
Hi Linus,

My for-linus-4.10 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.10

Has two last minute fixes.  The highest priority here is a regression 
fix for the decompression code, but we also fixed up a problem with the 
32 bit compat ioctls.

The decompression bug could hand back the wrong data on big reads when 
zlib was used.  I have a larger cleanup to make the math here less error 
prone, but at this stage in the release Omar's patch is the best choice.

Omar Sandoval (1) commits (+24/-15):
 Btrfs: fix btrfs_decompress_buf2page()

Jeff Mahoney (1) commits (+4/-2):
 btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls

Total: (2) commits (+28/-17)

  fs/btrfs/compression.c | 39 ---
  fs/btrfs/ioctl.c   |  6 --
  2 files changed, 28 insertions(+), 17 deletions(-)


[GIT PULL] Btrfs

2017-02-11 Thread Chris Mason
Hi Linus,

My for-linus-4.10 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.10

Has two last minute fixes.  The highest priority here is a regression 
fix for the decompression code, but we also fixed up a problem with the 
32 bit compat ioctls.

The decompression bug could hand back the wrong data on big reads when 
zlib was used.  I have a larger cleanup to make the math here less error 
prone, but at this stage in the release Omar's patch is the best choice.

Omar Sandoval (1) commits (+24/-15):
 Btrfs: fix btrfs_decompress_buf2page()

Jeff Mahoney (1) commits (+4/-2):
 btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls

Total: (2) commits (+28/-17)

  fs/btrfs/compression.c | 39 ---
  fs/btrfs/ioctl.c   |  6 --
  2 files changed, 28 insertions(+), 17 deletions(-)


[GIT PULL] Btrfs

2017-01-27 Thread Chris Mason
Hi Linus,

My for-linus-4.10 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.10

Has some fixes that we've collected from the list.  We still have one 
more pending to nail down a regression in lzo compression, but I wanted 
to get this batch out the door.

Omar Sandoval (3) commits (+2/-6):
 Btrfs: remove ->{get, set}_acl() from btrfs_dir_ro_inode_operations (+0/-2)
 Btrfs: remove old tree_root case in btrfs_read_locked_inode() (+1/-4)
 Btrfs: disable xattr operations on subvolume directories (+1/-0)

Liu Bo (1) commits (+12/-1):
 Btrfs: fix truncate down when no_holes feature is enabled

Chandan Rajendra (1) commits (+2/-2):
 Btrfs: Fix deadlock between direct IO and fast fsync

Wang Xiaoguang (1) commits (+1/-0):
 btrfs: fix false enospc error when truncating heavily reflinked file

Total: (6) commits (+17/-9)

  fs/btrfs/inode.c | 26 +-
  1 file changed, 17 insertions(+), 9 deletions(-)


[GIT PULL] Btrfs

2017-01-27 Thread Chris Mason
Hi Linus,

My for-linus-4.10 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.10

Has some fixes that we've collected from the list.  We still have one 
more pending to nail down a regression in lzo compression, but I wanted 
to get this batch out the door.

Omar Sandoval (3) commits (+2/-6):
 Btrfs: remove ->{get, set}_acl() from btrfs_dir_ro_inode_operations (+0/-2)
 Btrfs: remove old tree_root case in btrfs_read_locked_inode() (+1/-4)
 Btrfs: disable xattr operations on subvolume directories (+1/-0)

Liu Bo (1) commits (+12/-1):
 Btrfs: fix truncate down when no_holes feature is enabled

Chandan Rajendra (1) commits (+2/-2):
 Btrfs: Fix deadlock between direct IO and fast fsync

Wang Xiaoguang (1) commits (+1/-0):
 btrfs: fix false enospc error when truncating heavily reflinked file

Total: (6) commits (+17/-9)

  fs/btrfs/inode.c | 26 +-
  1 file changed, 17 insertions(+), 9 deletions(-)


[GIT PULL] Btrfs fixes

2017-01-13 Thread Chris Mason
Hi Linus,

Dave Sterba queued up a few fixes for btrfs.  I have them in my
for-linus-4.10 branch:

These are all over the place.  The tracepoint part of the pull fixes a
crash and adds a little more information to two tracepoints, while the
rest are good old fashioned fixes.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.10

Liu Bo (5) commits (+34/-11):
Btrfs: adjust outstanding_extents counter properly when dio write is split 
(+9/-2)
Btrfs: add truncated_len for ordered extent tracepoints (+4/-0)
Btrfs: use down_read_nested to make lockdep silent (+2/-1)
Btrfs: add 'inode' for extent map tracepoint (+9/-5)
Btrfs: fix lockdep warning about log_mutex (+10/-3)

David Sterba (2) commits (+80/-69):
btrfs: fix crash when tracepoint arguments are freed by wq callbacks 
(+24/-13)
btrfs: make tracepoint format strings more compact (+56/-56)

Jeff Mahoney (2) commits (+4/-1):
btrfs: fix locking when we put back a delayed ref that's too new (+1/-1)
btrfs: fix error handling when run_delayed_extent_op fails (+3/-0)

Pan Bian (1) commits (+1/-3):
btrfs: return the actual error value from  from btrfs_uuid_tree_iterate

Total: (10) commits (+119/-84)

 fs/btrfs/async-thread.c  |  15 +++--
 fs/btrfs/extent-tree.c   |   8 ++-
 fs/btrfs/inode.c |  13 +++-
 fs/btrfs/tree-log.c  |  13 +++-
 fs/btrfs/uuid-tree.c |   4 +-
 include/trace/events/btrfs.h | 146 +++
 6 files changed, 117 insertions(+), 82 deletions(-)


[GIT PULL] Btrfs fixes

2017-01-13 Thread Chris Mason
Hi Linus,

Dave Sterba queued up a few fixes for btrfs.  I have them in my
for-linus-4.10 branch:

These are all over the place.  The tracepoint part of the pull fixes a
crash and adds a little more information to two tracepoints, while the
rest are good old fashioned fixes.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.10

Liu Bo (5) commits (+34/-11):
Btrfs: adjust outstanding_extents counter properly when dio write is split 
(+9/-2)
Btrfs: add truncated_len for ordered extent tracepoints (+4/-0)
Btrfs: use down_read_nested to make lockdep silent (+2/-1)
Btrfs: add 'inode' for extent map tracepoint (+9/-5)
Btrfs: fix lockdep warning about log_mutex (+10/-3)

David Sterba (2) commits (+80/-69):
btrfs: fix crash when tracepoint arguments are freed by wq callbacks 
(+24/-13)
btrfs: make tracepoint format strings more compact (+56/-56)

Jeff Mahoney (2) commits (+4/-1):
btrfs: fix locking when we put back a delayed ref that's too new (+1/-1)
btrfs: fix error handling when run_delayed_extent_op fails (+3/-0)

Pan Bian (1) commits (+1/-3):
btrfs: return the actual error value from  from btrfs_uuid_tree_iterate

Total: (10) commits (+119/-84)

 fs/btrfs/async-thread.c  |  15 +++--
 fs/btrfs/extent-tree.c   |   8 ++-
 fs/btrfs/inode.c |  13 +++-
 fs/btrfs/tree-log.c  |  13 +++-
 fs/btrfs/uuid-tree.c |   4 +-
 include/trace/events/btrfs.h | 146 +++
 6 files changed, 117 insertions(+), 82 deletions(-)


Re: [Regression 4.7-rc1] btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl

2017-01-06 Thread Chris Mason

On 01/06/2017 12:22 PM, Joseph Salisbury wrote:

Hi Luke,

A kernel bug report was opened against Ubuntu [0].  This bug was fixed
by the following commit in v4.7-rc1:


commit 4c63c2454eff996c5e27991221106eb511f7db38

Author: Luke Dashjr 
Date:   Thu Oct 29 08:22:21 2015 +

btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in
btrfs_ioctl


However, this commit introduced a new regression.  With this commit
applied, "btrfs fi show" no longer works and the btrfs snapshot
functionality breaks.



I was hoping to get your feedback, since you are the patch author.  Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request?


This is working for me, could you please include an strace of the problem?

Thanks!

-chris



Re: [Regression 4.7-rc1] btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl

2017-01-06 Thread Chris Mason

On 01/06/2017 12:22 PM, Joseph Salisbury wrote:

Hi Luke,

A kernel bug report was opened against Ubuntu [0].  This bug was fixed
by the following commit in v4.7-rc1:


commit 4c63c2454eff996c5e27991221106eb511f7db38

Author: Luke Dashjr 
Date:   Thu Oct 29 08:22:21 2015 +

btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in
btrfs_ioctl


However, this commit introduced a new regression.  With this commit
applied, "btrfs fi show" no longer works and the btrfs snapshot
functionality breaks.



I was hoping to get your feedback, since you are the patch author.  Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request?


This is working for me, could you please include an strace of the problem?

Thanks!

-chris



Re: OOM: Better, but still there on

2016-12-21 Thread Chris Mason

On Wed, Dec 21, 2016 at 12:16:53PM +0100, Michal Hocko wrote:

On Wed 21-12-16 20:00:38, Tetsuo Handa wrote:

One thing to note here, when we are talking about 32b kernel, things
have changed in 4.8 when we moved from the zone based to node based
reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a
per-node basis") and associated patches). It is possible that the
reporter is hitting some pathological path which needs fixing but it
might be also related to something else. So I am rather not trying to
blame 32b yet...


It might be interesting to put tracing on releasepage and see if btrfs 
is pinning pages around.  I can't see how 32bit kernels would be 
different, but maybe we're hitting a weird corner.


-chris



Re: OOM: Better, but still there on

2016-12-21 Thread Chris Mason

On Wed, Dec 21, 2016 at 12:16:53PM +0100, Michal Hocko wrote:

On Wed 21-12-16 20:00:38, Tetsuo Handa wrote:

One thing to note here, when we are talking about 32b kernel, things
have changed in 4.8 when we moved from the zone based to node based
reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a
per-node basis") and associated patches). It is possible that the
reporter is hitting some pathological path which needs fixing but it
might be also related to something else. So I am rather not trying to
blame 32b yet...


It might be interesting to put tracing on releasepage and see if btrfs 
is pinning pages around.  I can't see how 32bit kernels would be 
different, but maybe we're hitting a weird corner.


-chris



Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Chris Mason

On 12/16/2016 05:14 PM, Michal Hocko wrote:

On Fri 16-12-16 13:15:18, Chris Mason wrote:

On 12/16/2016 02:39 AM, Michal Hocko wrote:

[...]

I believe the right way to go around this is to pursue what I've started
in [1]. I will try to prepare something for testing today for you. Stay
tuned. But I would be really happy if somebody from the btrfs camp could
check the NOFS aspect of this allocation. We have already seen
allocation stalls from this path quite recently


Just double checking, are you asking why we're using GFP_NOFS to avoid going
into btrfs from the btrfs writepages call, or are you asking why we aren't
allowing highmem?


I am more interested in the NOFS part. Why cannot this be a full
GFP_KERNEL context? What kind of locks we would lock up when recursing
to the fs via slab shrinkers?



Since this is our writepages call, any jump into direct reclaim would go 
to writepage, which would end up calling the same set of code to read 
metadata blocks, which would do a GFP_KERNEL allocation and end up back 
in writepage again.


We'd also have issues with blowing through transaction reservations 
since the writepage recursion would have to nest into the running 
transaction.


-chris



Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Chris Mason

On 12/16/2016 05:14 PM, Michal Hocko wrote:

On Fri 16-12-16 13:15:18, Chris Mason wrote:

On 12/16/2016 02:39 AM, Michal Hocko wrote:

[...]

I believe the right way to go around this is to pursue what I've started
in [1]. I will try to prepare something for testing today for you. Stay
tuned. But I would be really happy if somebody from the btrfs camp could
check the NOFS aspect of this allocation. We have already seen
allocation stalls from this path quite recently


Just double checking, are you asking why we're using GFP_NOFS to avoid going
into btrfs from the btrfs writepages call, or are you asking why we aren't
allowing highmem?


I am more interested in the NOFS part. Why cannot this be a full
GFP_KERNEL context? What kind of locks we would lock up when recursing
to the fs via slab shrinkers?



Since this is our writepages call, any jump into direct reclaim would go 
to writepage, which would end up calling the same set of code to read 
metadata blocks, which would do a GFP_KERNEL allocation and end up back 
in writepage again.


We'd also have issues with blowing through transaction reservations 
since the writepage recursion would have to nest into the running 
transaction.


-chris



  1   2   3   4   5   6   7   8   9   10   >