[ANNOUNCE] Three things.

2019-08-29 Thread Daniel Phillips
Hi folks, how's it going? Over here, we have been rather busy lately,
and for the last five years or so to be honest. Today it is my pleasure
to be able to announce three GPL open source projects:

1) Shardmap

Shardmap is the next generation directory index developed for Tux3, and
which we are now offering as a much needed replacement for Ext4 HTree.
Shardmap meets but usually beats HTree at all scales, has way better
readdir characteristics, and goes where HTree never did: up into the
billions of files per directory, with ease. Shardmap also is well on
its way to becoming a full blown standalone KVS in user space with sub
microsecond ACID operations in persistent memory.[1]

Code for Shardmap is here:

https://github.com/danielbot/Shardmap

2) Teamachine

Teamachine is a direct threaded code virtual machine with a cycle time
of .7 nanoseconds, which may just make it the fastest interpreter in the
known universe. Teamachine embeds Shardmap as a set of micro ops. With
Teamachine you can rapidly set up a set of Shardmap unit tests, or you
can build a world-beating query engine. Or just kick back and script
your game engine, the possibilities are endless.

Code for Teamachine is here:

https://github.com/danielbot/TeaMachine

3) Tux3

Tux3 is still alive, is still maintained against current mainline, and
is still faster, lighter, and more ACID than any other general purpose
Linux file system. Inasmuch as other devs have discovered that the same
issue cited as the blocker for merging Tux3 (get user pages) is also
problematic for kernel code that is already merged, I propose today that
we merge Tux3 without further ado, so that we can proceed to develop
a good solution together as is right, proper and just.

Code for Tux3 is here:

https://github.com/OGAWAHirofumi/tux3/tree/hirofumi

Everyone is welcome to join OFTC #tux3 and/or post to:

   http://tux3.org/cgi-bin/mailman/listinfo/tux3

to discuss these things, or anything at all. Fun times. See you there!

STANDARD DISCLAIMER: SHARDMAP WILL EAT YOUR DATA[2] TEAMACHINE WILL HALT
YOUR MACHINE AND SET IT ON FIRE. DOWNLOAD AND RUN OF YOUR OWN FREE WILL.

[1] Big shoutout to Yahoo! Japan for supporting Shardmap work.

[2] Tux3 is actually pretty good about not eating your data, but that is
another thing.

NB: followup posts are in the works re the detailed nature and status of
Shardmap, Teamachine and Tux3.

Regards,

Daniel


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote:

Note: Hirofumi's email is clear, logical and speaks to the
question. This branch of the thread is largely pointless, though
it essentially says the same thing in non-technical terms. Perhaps
your next response should be to Hirofumi, and perhaps it should be
technical.


Now, let me try to lead the way, but being specific. RDMA was raised
as a potential failure case for Tux3 page forking. But the RDMA api
does not let you use memory mmaped by Tux3 as a source or destination
of IO. Instead, it sets up its own pages and hands them out to the
RDMA app from a pool. So no issue. One down, right?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 3:27:12 PM PDT, David Lang wrote:

On Fri, 31 Jul 2015, Daniel Phillips wrote:


On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: ...


you weren't asking about any particular feature of Tux, you 
were asking if we were still willing to push out stuff that 
breaks for users and fix it later.


I think you left a key word out of my ask: "theoretical".

Especially for filesystems that can loose the data of whoever 
is using it, the answer seems to be a clear no.


there may be bugs in what's pushed out that we don't know 
about. But we don't push out potential data corruption bugs that 
we do know about (or think we do)


so if you think this should be pushed out with this known 
corner case that's not handled properly, you have to convince 
people that it's _so_ improbable that they shouldn't care about 
it.


There should also be an onus on the person posing the worry
to prove their case beyond a reasonable doubt, which has not been
done in case we are discussing here. Note: that is a technical
assessment to which a technical response is appropriate.

I do think that we should put a cap on this fencing and make
a real effort to get Tux3 into mainline. We should at least
set a ground rule that a problem should be proved real before it
becomes a reason to derail a project in the way that our project
has been derailed. Otherwise, it's hard to see what interest is
served.

OK, lets get back to the program. I accept your assertion that
we should convince people that the issue is improbable. To do
that, I need a specific issue to address. So far, no such issue
has been provided with specificity. Do you see why this is
frustrating?

Please, community. Give us specific issues to address, or give us
some way out of this eternal limbo. Or better, lets go back to the
old way of doing things in Linux, which is what got us where we
are today. Not this.

Note: Hirofumi's email is clear, logical and speaks to the
question. This branch of the thread is largely pointless, though
it essentially says the same thing in non-technical terms. Perhaps
your next response should be to Hirofumi, and perhaps it should be
technical.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:

We, the Linux Community have less tolerance for losing people's data and 
preventing them from operating than we used to when it was all tinkerer's 
personal data and secondary systems.

So rather than pushing optimizations out to everyone and seeing what breaks, we 
now do more testing and checking for failures before pushing things out.


By the way, I am curious about whose data you think will get lost
as a result of pushing out Tux3 with a possible theoretical bug
in a wildly improbable scenario that has not actually been
described with sufficient specificity to falsify, let alone
demonstrated.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:

If you define this as "loosing our mojo", then yes we have.


A pity. There remains so much to do that simply will not get
done in the absence of mojo.

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote:

Returning ENOSPC when you have free space you can't yet prove is safer than
not returning it and risking a data loss when you get hit by a write/commit
storm. :)


Remember when delayed allocation was scary and unproven, because proving
that ENOSPC will always be returned when needed is extremely difficult?
But the performance advantage was compelling, so we just worked at it
until it worked. There were times when it didn't work properly, but the
code was in the tree so it got fixed.

It's like that now with page forking - a new technique with compelling
advantages, and some challenges. In the past, we (the Linux community)
would rise to the challenge and err on the side of pushing optimizations
in early. That was our mojo, and that is how Linux became the dominant
operating system it is today. Do we, the Linux community, still have that
mojo?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote:

Note: Hirofumi's email is clear, logical and speaks to the
question. This branch of the thread is largely pointless, though
it essentially says the same thing in non-technical terms. Perhaps
your next response should be to Hirofumi, and perhaps it should be
technical.


Now, let me try to lead the way, but being specific. RDMA was raised
as a potential failure case for Tux3 page forking. But the RDMA api
does not let you use memory mmaped by Tux3 as a source or destination
of IO. Instead, it sets up its own pages and hands them out to the
RDMA app from a pool. So no issue. One down, right?

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:

If you define this as loosing our mojo, then yes we have.


A pity. There remains so much to do that simply will not get
done in the absence of mojo.

Regards,

Daniel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote:

Returning ENOSPC when you have free space you can't yet prove is safer than
not returning it and risking a data loss when you get hit by a write/commit
storm. :)


Remember when delayed allocation was scary and unproven, because proving
that ENOSPC will always be returned when needed is extremely difficult?
But the performance advantage was compelling, so we just worked at it
until it worked. There were times when it didn't work properly, but the
code was in the tree so it got fixed.

It's like that now with page forking - a new technique with compelling
advantages, and some challenges. In the past, we (the Linux community)
would rise to the challenge and err on the side of pushing optimizations
in early. That was our mojo, and that is how Linux became the dominant
operating system it is today. Do we, the Linux community, still have that
mojo?

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:

We, the Linux Community have less tolerance for losing people's data and 
preventing them from operating than we used to when it was all tinkerer's 
personal data and secondary systems.

So rather than pushing optimizations out to everyone and seeing what breaks, we 
now do more testing and checking for failures before pushing things out.


By the way, I am curious about whose data you think will get lost
as a result of pushing out Tux3 with a possible theoretical bug
in a wildly improbable scenario that has not actually been
described with sufficient specificity to falsify, let alone
demonstrated.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 3:27:12 PM PDT, David Lang wrote:

On Fri, 31 Jul 2015, Daniel Phillips wrote:


On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: ...


you weren't asking about any particular feature of Tux, you 
were asking if we were still willing to push out stuff that 
breaks for users and fix it later.


I think you left a key word out of my ask: theoretical.

Especially for filesystems that can loose the data of whoever 
is using it, the answer seems to be a clear no.


there may be bugs in what's pushed out that we don't know 
about. But we don't push out potential data corruption bugs that 
we do know about (or think we do)


so if you think this should be pushed out with this known 
corner case that's not handled properly, you have to convince 
people that it's _so_ improbable that they shouldn't care about 
it.


There should also be an onus on the person posing the worry
to prove their case beyond a reasonable doubt, which has not been
done in case we are discussing here. Note: that is a technical
assessment to which a technical response is appropriate.

I do think that we should put a cap on this fencing and make
a real effort to get Tux3 into mainline. We should at least
set a ground rule that a problem should be proved real before it
becomes a reason to derail a project in the way that our project
has been derailed. Otherwise, it's hard to see what interest is
served.

OK, lets get back to the program. I accept your assertion that
we should convince people that the issue is improbable. To do
that, I need a specific issue to address. So far, no such issue
has been provided with specificity. Do you see why this is
frustrating?

Please, community. Give us specific issues to address, or give us
some way out of this eternal limbo. Or better, lets go back to the
old way of doing things in Linux, which is what got us where we
are today. Not this.

Note: Hirofumi's email is clear, logical and speaks to the
question. This branch of the thread is largely pointless, though
it essentially says the same thing in non-technical terms. Perhaps
your next response should be to Hirofumi, and perhaps it should be
technical.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tux3 Report: How fast can we fail?

2015-05-27 Thread Daniel Phillips


On 05/27/2015 02:39 PM, Pavel Machek wrote:
> On Wed 2015-05-27 11:28:50, Daniel Phillips wrote:
>> On Tuesday, May 26, 2015 11:41:39 PM PDT, Mosis Tembo wrote:
>>> On Tue, May 26, 2015 at 6:03 PM, Pavel Machek  wrote:
>>>
>>>>
>>>>> We identified the following quality metrics for this algorithm:
>>>>>
>>>>> 1) Never fails to detect out of space in the front end.
>>>>> 2) Always fills a volume to 100% before reporting out of space.
>>>>> 3) Allows rm, rmdir and truncate even when a volume is full.
>>>
>>> This is definitely nonsense. You can not rm, rmdir and truncate
>>> when the volume is full. You will need a free space on disk to perform
>>> such operations. Do you know why?
>>
>> Because some extra space needs to be on the volume in order to do the
>> atomic commit. Specifically, there must be enough extra space to keep
>> both old and new copies of any changed metadata, plus enough space for
>> new data or metadata. You are almost right: we can't support rm, rmdir
>> or truncate _with atomic commit_ unless some space is available on the
>> volume. So we keep a small reserve to handle those operations, which
>> only those operations can access. We define the volume as "full" when
>> only the reserve remains. The reserve is not included in "available"
>> blocks reported to statfs, so the volume appears to be 100% full when
>> only the reserve remains.
>>
>> For Tux3, that reserve is variable - about 1% of free space, declining
>> to a minimum of 10 blocks as free space runs out. Eventually, we will
>> reduce the minimum a bit as we develop finer control over how free
>> space is used in very low space conditions, but 10 blocks is not bad
>> at all. With no journal and only 10 blocks of unusable space, we do
>> pretty well with tiny volumes.
> 
> Yeah. Filesystem that could not do rm on full filesystem would be
> braindead.
> 
> Now, what about
> 
> 1) writing to already-allocated space in existing files?

I mentioned earlier, it seems to work pretty well in Tux3. But do user
applications really expect it to work? I do not know of any, perhaps
you do.

Incidentally, I have been torture testing this very property using a
32K filesystem consisting of 64 x 512 byte blocks, with repeated dd,
mknod, rm, etc. Just to show that we are serious about getting this
part right.

> 2) writing to already-allocated space in existing files using mmap?

Not part of the preliminary nospace patch, but planned. I intend to
work on that detail after merge.

The problem is almost the same as write(2) in that the reserve must be
large enough to accommodate both old and new versions of all data
blocks, otherwise we lose our ACID, which we will go to great lengths
to avoid losing. The thing that makes this work nicely is the way the
delta shrinks as freespace runs out, which is the central point of our
new nospace algorithm.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-27 Thread Daniel Phillips


On 05/27/2015 02:37 PM, Pavel Machek wrote:
> On Wed 2015-05-27 11:09:25, Daniel Phillips wrote:
>> On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote:
>>> On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:
>>>> On 05/14/2015 08:06 PM, Rik van Riel wrote: ...
>>>
>>> Umm. Why do you think it is only issue for executable files?
>>
>> I meant: files with code in them, that will be executed. Please excuse
>> me for colliding with the chmod sense. I will say "code files" to avoid
>> ambiguity.
>>
>>> I'm free to mmap() any file, and then execute from it.
>>>
>>> /lib/ld-linux.so /path/to/binary
>>>
>>> is known way to exec programs that do not have x bit set.
>>
>> So... why would I write to a code file at the same time as stepping
>> through it with ptrace? Should I expect ptrace to work perfectly if
>> I do that? What would "work perfectly" mean, if the code is changing
>> at the same time as being traced?
> 
> Do you have any imagination at all?

[Non-collegial rhetoric alert, it would be helpful to avoid that.]

> Reasons I should expect ptrace to work perfectly if I'm writing to
> file:
> 
> 1) it used to work before
> 
> 2) it used to work before
> 
> 3) it used to work before and regressions are not allowed

Are you sure that ptrace will work perfectly on a file that you are
writing to at the same time as tracing? If so, it has magic that I
do not understand. Could you please explain.

> 4) some kind of just in time compiler

A JIT that can tolerate being written to by a task it knows nothing
about, at the same time as it is generating code in the file? I do
not know of any such JIT.

> 5) some kind of malware, playing tricks so that you have trouble
> analyzing it

By writing to a code file? Then it already has write access to the
code file, so it has already gotten inside your security perimeter
without needing help from page fork. That said, we should be alert
for any new holes that page fork might open. But if there are any,
they should be actual holes, not theoretical ones.

> and of course,
> 
> 6) it used to work before.

I look forward to your explanation of how.

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tux3 Report: How fast can we fail?

2015-05-27 Thread Daniel Phillips

On Tuesday, May 26, 2015 11:41:39 PM PDT, Mosis Tembo wrote:

On Tue, May 26, 2015 at 6:03 PM, Pavel Machek  wrote:




We identified the following quality metrics for this algorithm:

 1) Never fails to detect out of space in the front end.
 2) Always fills a volume to 100% before reporting out of space.
 3) Allows rm, rmdir and truncate even when a volume is full.


This is definitely nonsense. You can not rm, rmdir and truncate
when the volume is full. You will need a free space on disk to perform
such operations. Do you know why?


Because some extra space needs to be on the volume in order to do the
atomic commit. Specifically, there must be enough extra space to keep
both old and new copies of any changed metadata, plus enough space for
new data or metadata. You are almost right: we can't support rm, rmdir
or truncate _with atomic commit_ unless some space is available on the
volume. So we keep a small reserve to handle those operations, which
only those operations can access. We define the volume as "full" when
only the reserve remains. The reserve is not included in "available"
blocks reported to statfs, so the volume appears to be 100% full when
only the reserve remains.

For Tux3, that reserve is variable - about 1% of free space, declining
to a minimum of 10 blocks as free space runs out. Eventually, we will
reduce the minimum a bit as we develop finer control over how free
space is used in very low space conditions, but 10 blocks is not bad
at all. With no journal and only 10 blocks of unusable space, we do
pretty well with tiny volumes.

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-27 Thread Daniel Phillips

On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote:

On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:

On 05/14/2015 08:06 PM, Rik van Riel wrote: ...


Umm. Why do you think it is only issue for executable files?


I meant: files with code in them, that will be executed. Please excuse
me for colliding with the chmod sense. I will say "code files" to avoid
ambiguity.


I'm free to mmap() any file, and then execute from it.

/lib/ld-linux.so /path/to/binary

is known way to exec programs that do not have x bit set.


So... why would I write to a code file at the same time as stepping
through it with ptrace? Should I expect ptrace to work perfectly if
I do that? What would "work perfectly" mean, if the code is changing
at the same time as being traced?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-27 Thread Daniel Phillips

On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote:

On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:

On 05/14/2015 08:06 PM, Rik van Riel wrote: ...


Umm. Why do you think it is only issue for executable files?


I meant: files with code in them, that will be executed. Please excuse
me for colliding with the chmod sense. I will say code files to avoid
ambiguity.


I'm free to mmap() any file, and then execute from it.

/lib/ld-linux.so /path/to/binary

is known way to exec programs that do not have x bit set.


So... why would I write to a code file at the same time as stepping
through it with ptrace? Should I expect ptrace to work perfectly if
I do that? What would work perfectly mean, if the code is changing
at the same time as being traced?

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tux3 Report: How fast can we fail?

2015-05-27 Thread Daniel Phillips

On Tuesday, May 26, 2015 11:41:39 PM PDT, Mosis Tembo wrote:

On Tue, May 26, 2015 at 6:03 PM, Pavel Machek pa...@ucw.cz wrote:




We identified the following quality metrics for this algorithm:

 1) Never fails to detect out of space in the front end.
 2) Always fills a volume to 100% before reporting out of space.
 3) Allows rm, rmdir and truncate even when a volume is full.


This is definitely nonsense. You can not rm, rmdir and truncate
when the volume is full. You will need a free space on disk to perform
such operations. Do you know why?


Because some extra space needs to be on the volume in order to do the
atomic commit. Specifically, there must be enough extra space to keep
both old and new copies of any changed metadata, plus enough space for
new data or metadata. You are almost right: we can't support rm, rmdir
or truncate _with atomic commit_ unless some space is available on the
volume. So we keep a small reserve to handle those operations, which
only those operations can access. We define the volume as full when
only the reserve remains. The reserve is not included in available
blocks reported to statfs, so the volume appears to be 100% full when
only the reserve remains.

For Tux3, that reserve is variable - about 1% of free space, declining
to a minimum of 10 blocks as free space runs out. Eventually, we will
reduce the minimum a bit as we develop finer control over how free
space is used in very low space conditions, but 10 blocks is not bad
at all. With no journal and only 10 blocks of unusable space, we do
pretty well with tiny volumes.

Regards,

Daniel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tux3 Report: How fast can we fail?

2015-05-27 Thread Daniel Phillips


On 05/27/2015 02:39 PM, Pavel Machek wrote:
 On Wed 2015-05-27 11:28:50, Daniel Phillips wrote:
 On Tuesday, May 26, 2015 11:41:39 PM PDT, Mosis Tembo wrote:
 On Tue, May 26, 2015 at 6:03 PM, Pavel Machek pa...@ucw.cz wrote:


 We identified the following quality metrics for this algorithm:

 1) Never fails to detect out of space in the front end.
 2) Always fills a volume to 100% before reporting out of space.
 3) Allows rm, rmdir and truncate even when a volume is full.

 This is definitely nonsense. You can not rm, rmdir and truncate
 when the volume is full. You will need a free space on disk to perform
 such operations. Do you know why?

 Because some extra space needs to be on the volume in order to do the
 atomic commit. Specifically, there must be enough extra space to keep
 both old and new copies of any changed metadata, plus enough space for
 new data or metadata. You are almost right: we can't support rm, rmdir
 or truncate _with atomic commit_ unless some space is available on the
 volume. So we keep a small reserve to handle those operations, which
 only those operations can access. We define the volume as full when
 only the reserve remains. The reserve is not included in available
 blocks reported to statfs, so the volume appears to be 100% full when
 only the reserve remains.

 For Tux3, that reserve is variable - about 1% of free space, declining
 to a minimum of 10 blocks as free space runs out. Eventually, we will
 reduce the minimum a bit as we develop finer control over how free
 space is used in very low space conditions, but 10 blocks is not bad
 at all. With no journal and only 10 blocks of unusable space, we do
 pretty well with tiny volumes.
 
 Yeah. Filesystem that could not do rm on full filesystem would be
 braindead.
 
 Now, what about
 
 1) writing to already-allocated space in existing files?

I mentioned earlier, it seems to work pretty well in Tux3. But do user
applications really expect it to work? I do not know of any, perhaps
you do.

Incidentally, I have been torture testing this very property using a
32K filesystem consisting of 64 x 512 byte blocks, with repeated dd,
mknod, rm, etc. Just to show that we are serious about getting this
part right.

 2) writing to already-allocated space in existing files using mmap?

Not part of the preliminary nospace patch, but planned. I intend to
work on that detail after merge.

The problem is almost the same as write(2) in that the reserve must be
large enough to accommodate both old and new versions of all data
blocks, otherwise we lose our ACID, which we will go to great lengths
to avoid losing. The thing that makes this work nicely is the way the
delta shrinks as freespace runs out, which is the central point of our
new nospace algorithm.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-27 Thread Daniel Phillips


On 05/27/2015 02:37 PM, Pavel Machek wrote:
 On Wed 2015-05-27 11:09:25, Daniel Phillips wrote:
 On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote:
 On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:
 On 05/14/2015 08:06 PM, Rik van Riel wrote: ...

 Umm. Why do you think it is only issue for executable files?

 I meant: files with code in them, that will be executed. Please excuse
 me for colliding with the chmod sense. I will say code files to avoid
 ambiguity.

 I'm free to mmap() any file, and then execute from it.

 /lib/ld-linux.so /path/to/binary

 is known way to exec programs that do not have x bit set.

 So... why would I write to a code file at the same time as stepping
 through it with ptrace? Should I expect ptrace to work perfectly if
 I do that? What would work perfectly mean, if the code is changing
 at the same time as being traced?
 
 Do you have any imagination at all?

[Non-collegial rhetoric alert, it would be helpful to avoid that.]

 Reasons I should expect ptrace to work perfectly if I'm writing to
 file:
 
 1) it used to work before
 
 2) it used to work before
 
 3) it used to work before and regressions are not allowed

Are you sure that ptrace will work perfectly on a file that you are
writing to at the same time as tracing? If so, it has magic that I
do not understand. Could you please explain.

 4) some kind of just in time compiler

A JIT that can tolerate being written to by a task it knows nothing
about, at the same time as it is generating code in the file? I do
not know of any such JIT.

 5) some kind of malware, playing tricks so that you have trouble
 analyzing it

By writing to a code file? Then it already has write access to the
code file, so it has already gotten inside your security perimeter
without needing help from page fork. That said, we should be alert
for any new holes that page fork might open. But if there are any,
they should be actual holes, not theoretical ones.

 and of course,
 
 6) it used to work before.

I look forward to your explanation of how.

Regards,

Daniel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips
On 05/26/2015 02:36 PM, Rik van Riel wrote:
> On 05/26/2015 04:22 PM, Daniel Phillips wrote:
>> On 05/26/2015 02:00 AM, Jan Kara wrote:
>>> So my opinion is: Don't fork the page if page_count is elevated. You can
>>> just wait for the IO if you need stable pages in that case. It's slow but
>>> it's safe and it should be pretty rare. Is there any problem with that?
>>
>> That would be our fallback if anybody discovers a specific case where page
>> fork breaks something, which so far has not been demonstrated.
>>
>> With a known fallback, it is hard to see why we should delay merging over
>> that. Perfection has never been a requirement for merging filesystems. On
> 
> However, avoiding data corruption by erring on the side of safety is
> a pretty basic requirement.

Erring on the side of safety is still an error. As a community we have
never been fond of adding code or overhead to fix theoretical bugs. I
do not see why we should relax that principle now.

We can fix actual bugs, but theoretical bugs are only shapeless specters
passing in the night. We should not become frozen in fear of them.

>> the contrary, imperfection is a reason for merging, so that the many
>> eyeballs effect may prove its value.
> 
> If you skip the page fork when there is an elevated page count, tux3
> should be safe (at least from that aspect). Only do the COW when there
> is no "strange" use of the page going on.

Then you break the I in ACID. There must be a compelling reason to do
that.

Regards,

Daniel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips
On 05/26/2015 02:00 AM, Jan Kara wrote:
> On Tue 26-05-15 01:08:56, Daniel Phillips wrote:
>> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
>>>  E.g. video drivers (or infiniband or direct IO for that matter) which
>>> have buffers in user memory (may be mmapped file), grab references to pages
>>> and hand out PFNs of those pages to the hardware to store data in them...
>>> If you fork a page after the driver has handed PFNs to the hardware, you've
>>> just lost all the writes hardware will do.
>>
>> Hi Jan,
>>
>> The page forked because somebody wrote to it with write(2) or mmap write at
>> the same time as a video driver (or infiniband or direct IO) was
>> doing io to
>> it. Isn't the application trying hard to lose data in that case? It
>> would not need page fork to lose data that way.
> 
> So I can think of two valid uses:
> 
> 1) You setup IO to part of a page and modify from userspace a different
>part of a page.

Suppose the use case is reading textures from video memory into a mmapped
file, and at the same time, the application is allowed to update the
textures in the file via mmap or write(2). Fork happens at mkwrite time.
If the page is already dirty, we do not fork it. The video API must have
made the page writable and dirty, so I do not see an issue.

> 2) At least for video drivers there is one ioctl() which creates object
>with buffers in memory and another ioctl() to actually ship it to hardware
>(may be called repeatedly). So in theory app could validly dirty the pages
>before it ships them to hardware. If this happens repeatedly and interacts
>badly with background writeback, you will end up with a forked page in a
>buffer and from that point on things are broken.

Writeback does not fork pages. An app may dirty a page that is in process
of being shipped to hardware (must be a distinct part of the page, or it is
a race) and the data being sent to hardware will not be disturbed. If there
is an issue here, I do not see it.

> So my opinion is: Don't fork the page if page_count is elevated. You can
> just wait for the IO if you need stable pages in that case. It's slow but
> it's safe and it should be pretty rare. Is there any problem with that?

That would be our fallback if anybody discovers a specific case where page
fork breaks something, which so far has not been demonstrated.

With a known fallback, it is hard to see why we should delay merging over
that. Perfection has never been a requirement for merging filesystems. On
the contrary, imperfection is a reason for merging, so that the many
eyeballs effect may prove its value.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips
Hi Sergey,

On 05/26/2015 03:22 AM, Sergey Senozhatsky wrote:
> 
> Hello,
> 
> is it possible to page-fork-bomb the system by some 'malicious' app?

Not in any new way. A page fork can happen either in the front end,
where it has to wait for memory like any other normal memory user,
or in the backend, where Tux3 may have privileged access to low
memory reserves and therefore must place bounds on its memory use
like any other user of low memory reserves.

This is not specific to page fork. We must place such bounds for
any memory that the backend uses. Fortunately, the backend does not
allocate memory extravagently, for fork or anything else, so when
this does get to the top of our to-do list it should not be too
hard to deal with. We plan to attack that after merge, as we have
never observed a problem in practice. Rather, Tux3 already seems
to survive low memory situations pretty well compared to some other
filesystems.


Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips

On Monday, May 25, 2015 11:13:46 PM PDT, David Lang wrote:
I'm assuming that Rik is talking about whatever has the 
reference to the page via one of the methods that he talked 
about.


This would be a good moment to provide specifics.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips

On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:

  E.g. video drivers (or infiniband or direct IO for that matter) which
have buffers in user memory (may be mmapped file), grab references to pages
and hand out PFNs of those pages to the hardware to store data in them...
If you fork a page after the driver has handed PFNs to the hardware, you've
just lost all the writes hardware will do.


Hi Jan,

The page forked because somebody wrote to it with write(2) or mmap write at
the same time as a video driver (or infiniband or direct IO) was doing io 
to
it. Isn't the application trying hard to lose data in that case? It would 
not need page fork to lose data that way.


Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips

On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote:
if the page gets modified again, will that cause any issues? 
what if the page gets modified before the copy gets written out, 
so that there are two dirty copies of the page in the process of 
being written?


David Lang


How is the page going to get modified again? A forked page isn't
mapped by a pte, so userspace can't modify it by mmap. The forked
page is not in the page cache, so usespace can't modify it by
posix file ops. So the writer would have to be in kernel. Tux3
knows what it is doing, so it won't modify the page. What kernel
code besides Tux3 will modify the page?

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips

On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote:
if the page gets modified again, will that cause any issues? 
what if the page gets modified before the copy gets written out, 
so that there are two dirty copies of the page in the process of 
being written?


David Lang


How is the page going to get modified again? A forked page isn't
mapped by a pte, so userspace can't modify it by mmap. The forked
page is not in the page cache, so usespace can't modify it by
posix file ops. So the writer would have to be in kernel. Tux3
knows what it is doing, so it won't modify the page. What kernel
code besides Tux3 will modify the page?

Regards,

Daniel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips

On Monday, May 25, 2015 11:13:46 PM PDT, David Lang wrote:
I'm assuming that Rik is talking about whatever has the 
reference to the page via one of the methods that he talked 
about.


This would be a good moment to provide specifics.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips

On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:

  E.g. video drivers (or infiniband or direct IO for that matter) which
have buffers in user memory (may be mmapped file), grab references to pages
and hand out PFNs of those pages to the hardware to store data in them...
If you fork a page after the driver has handed PFNs to the hardware, you've
just lost all the writes hardware will do.


Hi Jan,

The page forked because somebody wrote to it with write(2) or mmap write at
the same time as a video driver (or infiniband or direct IO) was doing io 
to
it. Isn't the application trying hard to lose data in that case? It would 
not need page fork to lose data that way.


Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips
On 05/26/2015 02:00 AM, Jan Kara wrote:
 On Tue 26-05-15 01:08:56, Daniel Phillips wrote:
 On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
  E.g. video drivers (or infiniband or direct IO for that matter) which
 have buffers in user memory (may be mmapped file), grab references to pages
 and hand out PFNs of those pages to the hardware to store data in them...
 If you fork a page after the driver has handed PFNs to the hardware, you've
 just lost all the writes hardware will do.

 Hi Jan,

 The page forked because somebody wrote to it with write(2) or mmap write at
 the same time as a video driver (or infiniband or direct IO) was
 doing io to
 it. Isn't the application trying hard to lose data in that case? It
 would not need page fork to lose data that way.
 
 So I can think of two valid uses:
 
 1) You setup IO to part of a page and modify from userspace a different
part of a page.

Suppose the use case is reading textures from video memory into a mmapped
file, and at the same time, the application is allowed to update the
textures in the file via mmap or write(2). Fork happens at mkwrite time.
If the page is already dirty, we do not fork it. The video API must have
made the page writable and dirty, so I do not see an issue.

 2) At least for video drivers there is one ioctl() which creates object
with buffers in memory and another ioctl() to actually ship it to hardware
(may be called repeatedly). So in theory app could validly dirty the pages
before it ships them to hardware. If this happens repeatedly and interacts
badly with background writeback, you will end up with a forked page in a
buffer and from that point on things are broken.

Writeback does not fork pages. An app may dirty a page that is in process
of being shipped to hardware (must be a distinct part of the page, or it is
a race) and the data being sent to hardware will not be disturbed. If there
is an issue here, I do not see it.

 So my opinion is: Don't fork the page if page_count is elevated. You can
 just wait for the IO if you need stable pages in that case. It's slow but
 it's safe and it should be pretty rare. Is there any problem with that?

That would be our fallback if anybody discovers a specific case where page
fork breaks something, which so far has not been demonstrated.

With a known fallback, it is hard to see why we should delay merging over
that. Perfection has never been a requirement for merging filesystems. On
the contrary, imperfection is a reason for merging, so that the many
eyeballs effect may prove its value.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips
Hi Sergey,

On 05/26/2015 03:22 AM, Sergey Senozhatsky wrote:
 
 Hello,
 
 is it possible to page-fork-bomb the system by some 'malicious' app?

Not in any new way. A page fork can happen either in the front end,
where it has to wait for memory like any other normal memory user,
or in the backend, where Tux3 may have privileged access to low
memory reserves and therefore must place bounds on its memory use
like any other user of low memory reserves.

This is not specific to page fork. We must place such bounds for
any memory that the backend uses. Fortunately, the backend does not
allocate memory extravagently, for fork or anything else, so when
this does get to the top of our to-do list it should not be too
hard to deal with. We plan to attack that after merge, as we have
never observed a problem in practice. Rather, Tux3 already seems
to survive low memory situations pretty well compared to some other
filesystems.


Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips
On 05/26/2015 02:36 PM, Rik van Riel wrote:
 On 05/26/2015 04:22 PM, Daniel Phillips wrote:
 On 05/26/2015 02:00 AM, Jan Kara wrote:
 So my opinion is: Don't fork the page if page_count is elevated. You can
 just wait for the IO if you need stable pages in that case. It's slow but
 it's safe and it should be pretty rare. Is there any problem with that?

 That would be our fallback if anybody discovers a specific case where page
 fork breaks something, which so far has not been demonstrated.

 With a known fallback, it is hard to see why we should delay merging over
 that. Perfection has never been a requirement for merging filesystems. On
 
 However, avoiding data corruption by erring on the side of safety is
 a pretty basic requirement.

Erring on the side of safety is still an error. As a community we have
never been fond of adding code or overhead to fix theoretical bugs. I
do not see why we should relax that principle now.

We can fix actual bugs, but theoretical bugs are only shapeless specters
passing in the night. We should not become frozen in fear of them.

 the contrary, imperfection is a reason for merging, so that the many
 eyeballs effect may prove its value.
 
 If you skip the page fork when there is an elevated page count, tux3
 should be safe (at least from that aspect). Only do the COW when there
 is no strange use of the page going on.

Then you break the I in ACID. There must be a compelling reason to do
that.

Regards,

Daniel


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-25 Thread Daniel Phillips

On Monday, May 25, 2015 9:25:44 PM PDT, Rik van Riel wrote:

On 05/21/2015 03:53 PM, Daniel Phillips wrote:

On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:

how do you prevent it from continuing to interact with the old version
of the page and never see updates or have it's changes reflected on
the current page?


Why would it do that, and what would be surprising about it? Did
you have a specific case in mind?


After a get_page(), page_cache_get(), or other equivalent
function, a piece of code has the expectation that it can
continue using that page until after it has released the
reference count.

This can be an arbitrarily long period of time.


It is perfectly welcome to keep using that page as long as it
wants, Tux3 does not care. When it lets go of the last reference
(and Tux3 has finished with it) then the page is freeable. Did
you have a more specific example where this would be an issue?
Are you talking about kernel or userspace code?

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-25 Thread Daniel Phillips

On Monday, May 25, 2015 9:25:44 PM PDT, Rik van Riel wrote:

On 05/21/2015 03:53 PM, Daniel Phillips wrote:

On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:

how do you prevent it from continuing to interact with the old version
of the page and never see updates or have it's changes reflected on
the current page?


Why would it do that, and what would be surprising about it? Did
you have a specific case in mind?


After a get_page(), page_cache_get(), or other equivalent
function, a piece of code has the expectation that it can
continue using that page until after it has released the
reference count.

This can be an arbitrarily long period of time.


It is perfectly welcome to keep using that page as long as it
wants, Tux3 does not care. When it lets go of the last reference
(and Tux3 has finished with it) then the page is freeable. Did
you have a more specific example where this would be an issue?
Are you talking about kernel or userspace code?

Regards,

Daniel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-21 Thread Daniel Phillips

On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:
how do you prevent it from continuing to interact with the old 
version of the page and never see updates or have it's changes 
reflected on the current page?


Why would it do that, and what would be surprising about it? Did
you have a specific case in mind?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[WIP][PATCH] tux3: preliminatry nospace handling

2015-05-21 Thread Daniel Phillips
Hi Josef,

This is a rollup patch for preliminary nospace handling in Tux3, in 
line with my post here:

   http://lkml.iu.edu/hypermail/linux/kernel/1505.1/03167.html

You still have ENOSPC issues. Maybe it would be helpful to look at 
what we have done. I saw a reproducible case with 1,000 tasks in 
parallel last week that went nospace while 28% full. You also are not
giving a very good picture of the true full state via df.

Our algorithm is pretty simple, reliable and fast. I do not see any 
reason why Btrfs could not do it basically the same way. In one way it 
is easier for you - you are not forced to commit the entire delta, you 
can choose the bits you want to force to disk as convenient. You have 
more different kinds of cache objects to account, but that should be 
just detail. Your current frontend accounting looks plausible.

We're trying something a bit different with df, to see how it flies - 
we don't always return the same number to f_blocks, we actually return 
the volume size less the accounting reserve, which is variable. The 
reserve gets smaller as freespace gets smaller, so it is not a nasty 
surprise to the user to see it change, rather a pleasant surprise. What 
it does is make the 100% really be 100%, less just a handful of blocks, 
and it makes "used" and "available" add up exactly to "blocks". If the 
user wants to know how many blocks they really have, they can look at 
/proc/partitions.

Regards,

Daniel

diff --git a/fs/tux3/commit.c b/fs/tux3/commit.c
index 909a222..7043580 100644
--- a/fs/tux3/commit.c
+++ b/fs/tux3/commit.c
@@ -297,6 +297,7 @@ static int commit_delta(struct sb *sb)
tux3_wake_delta_commit(sb);
 
/* Commit was finished, apply defered bfree. */
+   sb->defreed = 0;
return unstash(sb, >defree, apply_defered_bfree);
 }
 
@@ -321,13 +322,13 @@ static int need_unify(struct sb *sb)
 /* For debugging */
 void tux3_start_backend(struct sb *sb)
 {
-   assert(current->journal_info == NULL);
+   assert(!change_active());
current->journal_info = sb;
 }
 
 void tux3_end_backend(void)
 {
-   assert(current->journal_info);
+   assert(change_active());
current->journal_info = NULL;
 }
 
@@ -337,12 +338,103 @@ int tux3_under_backend(struct sb *sb)
return current->journal_info == sb;
 }
 
+/* Internal use only */
+static struct delta_ref *to_delta_ref(struct sb *sb, unsigned delta)
+{
+   return >delta_refs[tux3_delta(delta)];
+}
+
+static block_t newfree(struct sb *sb)
+{
+   return sb->freeblocks + sb->defreed;
+}
+
+/*
+ * Reserve size should vary with budget. The reserve can include the
+ * log block overhead on the assumption that every block in the budget
+ * is a data block that generates one log record (or two?).
+ */
+block_t set_budget(struct sb *sb)
+{
+   block_t reserve = sb->freeblocks >> 7; /* FIXME: magic number */
+
+   if (1) {
+   if (reserve > max_reserve_blocks)
+   reserve = max_reserve_blocks;
+   if (reserve < min_reserve_blocks)
+   reserve = min_reserve_blocks;
+   } else if (0)
+   reserve = 10;
+
+   block_t budget = newfree(sb) - reserve;
+   if (1)
+   tux3_msg(sb, "set_budget: free %Li, budget %Li, reserve %Li", 
newfree(sb), budget, reserve);
+   sb->reserve = reserve;
+   atomic_set(>budget, budget);
+   return reserve;
+}
+
+/*
+ * After transition, the front delta may have used some of the balance
+ * left over from this delta. The charged amount of the back delta is
+ * now stable and gives the exact balance at transition by subtracting
+ * from the old budget. The difference between the new budget and the
+ * balance at transition, which must never be negative, is added to
+ * the current balance, so the effect is exactly the same as if we had
+ * set the new budget and balance atomically at transition time. But
+ * we do not know the new balance at transition time and even if we
+ * did, we would need to add serialization against frontend changes,
+ * which are currently lockless and would like to stay that way. So we 
+ * let the current delta charge against the remaining balance until
+ * flush is done, here, then adjust the balance to what it would have
+ * been if the budget had been reset exactly at transition.
+ *
+ * We have:
+ *
+ *consumed = oldfree - free
+ *oldbudget = oldfree - reserve
+ *newbudget = free - reserve
+ *transition_balance = oldbudget - charged
+ * 
+ * Factoring out the reserve, the balance adjustment is:
+ * 
+ *adjust = newbudget - transition_balance
+ *   = (free - reserve) - ((oldfree - reserve) - charged)
+ *   = free + (charged - oldfree)
+ *   = charged + (free - oldfree)
+ *   = charged - consumed
+ *
+ * To extend for variable reserve size, add the difference between
+ * old and new reserve size to the balance adjustment.
+ */
+void reset_balance(struct sb 

[WIP][PATCH] tux3: preliminatry nospace handling

2015-05-21 Thread Daniel Phillips
Hi Josef,

This is a rollup patch for preliminary nospace handling in Tux3, in 
line with my post here:

   http://lkml.iu.edu/hypermail/linux/kernel/1505.1/03167.html

You still have ENOSPC issues. Maybe it would be helpful to look at 
what we have done. I saw a reproducible case with 1,000 tasks in 
parallel last week that went nospace while 28% full. You also are not
giving a very good picture of the true full state via df.

Our algorithm is pretty simple, reliable and fast. I do not see any 
reason why Btrfs could not do it basically the same way. In one way it 
is easier for you - you are not forced to commit the entire delta, you 
can choose the bits you want to force to disk as convenient. You have 
more different kinds of cache objects to account, but that should be 
just detail. Your current frontend accounting looks plausible.

We're trying something a bit different with df, to see how it flies - 
we don't always return the same number to f_blocks, we actually return 
the volume size less the accounting reserve, which is variable. The 
reserve gets smaller as freespace gets smaller, so it is not a nasty 
surprise to the user to see it change, rather a pleasant surprise. What 
it does is make the 100% really be 100%, less just a handful of blocks, 
and it makes used and available add up exactly to blocks. If the 
user wants to know how many blocks they really have, they can look at 
/proc/partitions.

Regards,

Daniel

diff --git a/fs/tux3/commit.c b/fs/tux3/commit.c
index 909a222..7043580 100644
--- a/fs/tux3/commit.c
+++ b/fs/tux3/commit.c
@@ -297,6 +297,7 @@ static int commit_delta(struct sb *sb)
tux3_wake_delta_commit(sb);
 
/* Commit was finished, apply defered bfree. */
+   sb-defreed = 0;
return unstash(sb, sb-defree, apply_defered_bfree);
 }
 
@@ -321,13 +322,13 @@ static int need_unify(struct sb *sb)
 /* For debugging */
 void tux3_start_backend(struct sb *sb)
 {
-   assert(current-journal_info == NULL);
+   assert(!change_active());
current-journal_info = sb;
 }
 
 void tux3_end_backend(void)
 {
-   assert(current-journal_info);
+   assert(change_active());
current-journal_info = NULL;
 }
 
@@ -337,12 +338,103 @@ int tux3_under_backend(struct sb *sb)
return current-journal_info == sb;
 }
 
+/* Internal use only */
+static struct delta_ref *to_delta_ref(struct sb *sb, unsigned delta)
+{
+   return sb-delta_refs[tux3_delta(delta)];
+}
+
+static block_t newfree(struct sb *sb)
+{
+   return sb-freeblocks + sb-defreed;
+}
+
+/*
+ * Reserve size should vary with budget. The reserve can include the
+ * log block overhead on the assumption that every block in the budget
+ * is a data block that generates one log record (or two?).
+ */
+block_t set_budget(struct sb *sb)
+{
+   block_t reserve = sb-freeblocks  7; /* FIXME: magic number */
+
+   if (1) {
+   if (reserve  max_reserve_blocks)
+   reserve = max_reserve_blocks;
+   if (reserve  min_reserve_blocks)
+   reserve = min_reserve_blocks;
+   } else if (0)
+   reserve = 10;
+
+   block_t budget = newfree(sb) - reserve;
+   if (1)
+   tux3_msg(sb, set_budget: free %Li, budget %Li, reserve %Li, 
newfree(sb), budget, reserve);
+   sb-reserve = reserve;
+   atomic_set(sb-budget, budget);
+   return reserve;
+}
+
+/*
+ * After transition, the front delta may have used some of the balance
+ * left over from this delta. The charged amount of the back delta is
+ * now stable and gives the exact balance at transition by subtracting
+ * from the old budget. The difference between the new budget and the
+ * balance at transition, which must never be negative, is added to
+ * the current balance, so the effect is exactly the same as if we had
+ * set the new budget and balance atomically at transition time. But
+ * we do not know the new balance at transition time and even if we
+ * did, we would need to add serialization against frontend changes,
+ * which are currently lockless and would like to stay that way. So we 
+ * let the current delta charge against the remaining balance until
+ * flush is done, here, then adjust the balance to what it would have
+ * been if the budget had been reset exactly at transition.
+ *
+ * We have:
+ *
+ *consumed = oldfree - free
+ *oldbudget = oldfree - reserve
+ *newbudget = free - reserve
+ *transition_balance = oldbudget - charged
+ * 
+ * Factoring out the reserve, the balance adjustment is:
+ * 
+ *adjust = newbudget - transition_balance
+ *   = (free - reserve) - ((oldfree - reserve) - charged)
+ *   = free + (charged - oldfree)
+ *   = charged + (free - oldfree)
+ *   = charged - consumed
+ *
+ * To extend for variable reserve size, add the difference between
+ * old and new reserve size to the balance adjustment.
+ */
+void reset_balance(struct sb *sb, unsigned 

Re: [FYI] tux3: Core changes

2015-05-21 Thread Daniel Phillips

On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:
how do you prevent it from continuing to interact with the old 
version of the page and never see updates or have it's changes 
reflected on the current page?


Why would it do that, and what would be surprising about it? Did
you have a specific case in mind?

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-20 Thread Daniel Phillips
On 05/20/2015 03:51 PM, Daniel Phillips wrote:
> On 05/20/2015 12:53 PM, Rik van Riel wrote:
>> How does tux3 prevent a user of find_get_page() from reading from
>> or writing into the pre-COW page, instead of the current page?
> 
> Careful control of the dirty bits (we have two of them, one each
> for front and back). That is what pagefork_for_blockdirty is about.

Ah, and of course it does not matter if a reader is on the
pre-cow page. It would be reading the earlier copy, which might
no longer be the current copy, but it raced with the write so
nobody should be surprised. That is a race even without page fork.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-20 Thread Daniel Phillips


On 05/20/2015 12:53 PM, Rik van Riel wrote:
> On 05/20/2015 12:22 PM, Daniel Phillips wrote:
>> On 05/20/2015 07:44 AM, Jan Kara wrote:
>>> On Tue 19-05-15 13:33:31, David Lang wrote:
> 
>>>   Yeah, that's what I meant. If you create a function which manipulates
>>> page cache, you better make it work with other functions manipulating page
>>> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
>>> developer. Sure you can document all the conditions under which the
>>> function is safe to use but a function that has several paragraphs in front
>>> of it explaning when it is safe to use isn't very good API...
>>
>> Violent agreement, of course. To put it in concrete terms, each of
>> the page fork support functions must be examined and determined
>> sane. They are:
>>
>>  * cow_replace_page_cache
>>  * cow_delete_from_page_cache
>>  * cow_clone_page
>>  * page_cow_one
>>  * page_cow_file
>>
>> Would it be useful to drill down into those, starting from the top
>> of the list?
> 
> How do these interact with other page cache functions, like
> find_get_page() ?

Nicely:

   
https://github.com/OGAWAHirofumi/linux-tux3/blob/hirofumi/fs/tux3/filemap_mmap.c#L182

> How does tux3 prevent a user of find_get_page() from reading from
> or writing into the pre-COW page, instead of the current page?

Careful control of the dirty bits (we have two of them, one each
for front and back). That is what pagefork_for_blockdirty is about.

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-20 Thread Daniel Phillips


On 05/20/2015 07:44 AM, Jan Kara wrote:
> On Tue 19-05-15 13:33:31, David Lang wrote:
>> On Tue, 19 May 2015, Daniel Phillips wrote:
>>
>>>> I understand that Tux3 may avoid these issues due to some other mechanisms
>>>> it internally has but if page forking should get into mm subsystem, the
>>>> above must work.
>>>
>>> It does work, and by example, it does not need a lot of code to make
>>> it work, but the changes are not trivial. Tux3's delta writeback model
>>> will not suit everyone, so you can't just lift our code and add it to
>>> Ext4. Using it in Ext4 would require a per-inode writeback model, which
>>> looks practical to me but far from a weekend project. Maybe something
>>> to consider for Ext5.
>>>
>>> It is the job of new designs like Tux3 to chase after that final drop
>>> of performance, not our trusty Ext4 workhorse. Though stranger things
>>> have happened - as I recall, Ext4 had O(n) directory operations at one
>>> time. Fixing that was not easy, but we did it because we had to. Fixing
>>> Ext4's write performance is not urgent by comparison, and the barrier
>>> is high, you would want jbd3 for one thing.
>>>
>>> I think the meta-question you are asking is, where is the second user
>>> for this new CoW functionality? With a possible implication that if
>>> there is no second user then Tux3 cannot be merged. Is that is the
>>> question?
>>
>> I don't think they are asking for a second user. What they are
>> saying is that for this functionality to be accepted in the mm
>> subsystem, these problem cases need to work reliably, not just work
>> for Tux3 because of your implementation.
>>
>> So for things that you don't use, you need to make it an error if
>> they get used on a page that's been forked (or not be an error and
>> 'do the right thing')
>>
>> For cases where it doesn't matter because Tux3 controls the
>> writeback, and it's undefined in general what happens if writeback
>> is triggered twice on the same page, you will need to figure out how
>> to either prevent the second writeback from triggering if there's
>> one in process, or define how the two writebacks are going to happen
>> so that you can't end up with them re-ordered by some other
>> filesystem.
>>
>> I think that that's what's meant by the top statement that I left in
>> the quote. Even if your implementation details make it safe, these
>> need to be safe even without your implementation details to be
>> acceptable in the core kernel.
>   Yeah, that's what I meant. If you create a function which manipulates
> page cache, you better make it work with other functions manipulating page
> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
> developer. Sure you can document all the conditions under which the
> function is safe to use but a function that has several paragraphs in front
> of it explaning when it is safe to use isn't very good API...

Violent agreement, of course. To put it in concrete terms, each of
the page fork support functions must be examined and determined
sane. They are:

 * cow_replace_page_cache
 * cow_delete_from_page_cache
 * cow_clone_page
 * page_cow_one
 * page_cow_file

Would it be useful to drill down into those, starting from the top
of the list?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-20 Thread Daniel Phillips


On 05/20/2015 07:44 AM, Jan Kara wrote:
 On Tue 19-05-15 13:33:31, David Lang wrote:
 On Tue, 19 May 2015, Daniel Phillips wrote:

 I understand that Tux3 may avoid these issues due to some other mechanisms
 it internally has but if page forking should get into mm subsystem, the
 above must work.

 It does work, and by example, it does not need a lot of code to make
 it work, but the changes are not trivial. Tux3's delta writeback model
 will not suit everyone, so you can't just lift our code and add it to
 Ext4. Using it in Ext4 would require a per-inode writeback model, which
 looks practical to me but far from a weekend project. Maybe something
 to consider for Ext5.

 It is the job of new designs like Tux3 to chase after that final drop
 of performance, not our trusty Ext4 workhorse. Though stranger things
 have happened - as I recall, Ext4 had O(n) directory operations at one
 time. Fixing that was not easy, but we did it because we had to. Fixing
 Ext4's write performance is not urgent by comparison, and the barrier
 is high, you would want jbd3 for one thing.

 I think the meta-question you are asking is, where is the second user
 for this new CoW functionality? With a possible implication that if
 there is no second user then Tux3 cannot be merged. Is that is the
 question?

 I don't think they are asking for a second user. What they are
 saying is that for this functionality to be accepted in the mm
 subsystem, these problem cases need to work reliably, not just work
 for Tux3 because of your implementation.

 So for things that you don't use, you need to make it an error if
 they get used on a page that's been forked (or not be an error and
 'do the right thing')

 For cases where it doesn't matter because Tux3 controls the
 writeback, and it's undefined in general what happens if writeback
 is triggered twice on the same page, you will need to figure out how
 to either prevent the second writeback from triggering if there's
 one in process, or define how the two writebacks are going to happen
 so that you can't end up with them re-ordered by some other
 filesystem.

 I think that that's what's meant by the top statement that I left in
 the quote. Even if your implementation details make it safe, these
 need to be safe even without your implementation details to be
 acceptable in the core kernel.
   Yeah, that's what I meant. If you create a function which manipulates
 page cache, you better make it work with other functions manipulating page
 cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
 developer. Sure you can document all the conditions under which the
 function is safe to use but a function that has several paragraphs in front
 of it explaning when it is safe to use isn't very good API...

Violent agreement, of course. To put it in concrete terms, each of
the page fork support functions must be examined and determined
sane. They are:

 * cow_replace_page_cache
 * cow_delete_from_page_cache
 * cow_clone_page
 * page_cow_one
 * page_cow_file

Would it be useful to drill down into those, starting from the top
of the list?

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-20 Thread Daniel Phillips


On 05/20/2015 12:53 PM, Rik van Riel wrote:
 On 05/20/2015 12:22 PM, Daniel Phillips wrote:
 On 05/20/2015 07:44 AM, Jan Kara wrote:
 On Tue 19-05-15 13:33:31, David Lang wrote:
 
   Yeah, that's what I meant. If you create a function which manipulates
 page cache, you better make it work with other functions manipulating page
 cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
 developer. Sure you can document all the conditions under which the
 function is safe to use but a function that has several paragraphs in front
 of it explaning when it is safe to use isn't very good API...

 Violent agreement, of course. To put it in concrete terms, each of
 the page fork support functions must be examined and determined
 sane. They are:

  * cow_replace_page_cache
  * cow_delete_from_page_cache
  * cow_clone_page
  * page_cow_one
  * page_cow_file

 Would it be useful to drill down into those, starting from the top
 of the list?
 
 How do these interact with other page cache functions, like
 find_get_page() ?

Nicely:

   
https://github.com/OGAWAHirofumi/linux-tux3/blob/hirofumi/fs/tux3/filemap_mmap.c#L182

 How does tux3 prevent a user of find_get_page() from reading from
 or writing into the pre-COW page, instead of the current page?

Careful control of the dirty bits (we have two of them, one each
for front and back). That is what pagefork_for_blockdirty is about.

Regards,

Daniel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-20 Thread Daniel Phillips
On 05/20/2015 03:51 PM, Daniel Phillips wrote:
 On 05/20/2015 12:53 PM, Rik van Riel wrote:
 How does tux3 prevent a user of find_get_page() from reading from
 or writing into the pre-COW page, instead of the current page?
 
 Careful control of the dirty bits (we have two of them, one each
 for front and back). That is what pagefork_for_blockdirty is about.

Ah, and of course it does not matter if a reader is on the
pre-cow page. It would be reading the earlier copy, which might
no longer be the current copy, but it raced with the write so
nobody should be surprised. That is a race even without page fork.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-19 Thread Daniel Phillips
Hi Jan,

On 05/19/2015 07:00 AM, Jan Kara wrote:
> On Thu 14-05-15 01:26:23, Daniel Phillips wrote:
>> Hi Rik,
>>
>> Our linux-tux3 tree currently currently carries this 652 line diff
>> against core, to make Tux3 work. This is mainly by Hirofumi, except
>> the fs-writeback.c hook, which is by me. The main part you may be
>> interested in is rmap.c, which addresses the issues raised at the
>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
>>
>>LSFMM: Page forking
>>http://lwn.net/Articles/548091/
>>
>> This is just a FYI. An upcoming Tux3 report will be a tour of the page
>> forking design and implementation. For now, this is just to give a
>> general sense of what we have done. We heard there are concerns about
>> how ptrace will work. I really am not familiar with the issue, could
>> you please explain what you were thinking of there?
>   So here are a few things I find problematic about page forking (besides
> the cases with elevated page_count already discussed in this thread - there
> I believe that anything more complex than "wait for the IO instead of
> forking when page has elevated use count" isn't going to work. There are
> too many users depending on too subtle details of the behavior...). Some
> of them are actually mentioned in the above LWN article:
> 
> When you create a copy of a page and replace it in the radix tree, nobody
> in mm subsystem is aware that oldpage may be under writeback. That causes
> interesting issues:
> * truncate_inode_pages() can finish before all IO for the file is finished.
>   So far filesystems rely on the fact that once truncate_inode_pages()
>   finishes all running IO against the file is completed and new cannot be
>   submitted.

We do not use truncate_inode_pages because of issues like that. We use
some truncate helpers, which were available in some cases, or else had
to be implemented in Tux3 to make everything work properly. The details
are Hirofumi's stomping grounds. I am pretty sure that his solution is
good and tight, or Tux3 would not pass its torture tests.

> * Writeback can come and try to write newpage while oldpage is still under
>   IO. Then you'll have two IOs against one block which has undefined
>   results.

Those writebacks only come from Tux3 (or indirectly from fs-writeback,
through our writeback) so we are able to ensure that a dirty block is
only written once. (If redirtied, the block will fork so two dirty
blocks are written in two successive deltas.)

> * filemap_fdatawait() called from fsync() has additional problem that it is
>   not aware of oldpage and thus may return although IO hasn't finished yet.

We do not use filemap_fdatawait, instead, we wait on completion of our
own writeback, which is under our control.

> I understand that Tux3 may avoid these issues due to some other mechanisms
> it internally has but if page forking should get into mm subsystem, the
> above must work.

It does work, and by example, it does not need a lot of code to make
it work, but the changes are not trivial. Tux3's delta writeback model
will not suit everyone, so you can't just lift our code and add it to
Ext4. Using it in Ext4 would require a per-inode writeback model, which
looks practical to me but far from a weekend project. Maybe something
to consider for Ext5.

It is the job of new designs like Tux3 to chase after that final drop
of performance, not our trusty Ext4 workhorse. Though stranger things
have happened - as I recall, Ext4 had O(n) directory operations at one
time. Fixing that was not easy, but we did it because we had to. Fixing
Ext4's write performance is not urgent by comparison, and the barrier
is high, you would want jbd3 for one thing.

I think the meta-question you are asking is, where is the second user
for this new CoW functionality? With a possible implication that if
there is no second user then Tux3 cannot be merged. Is that is the
question?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-19 Thread Daniel Phillips
Hi Jan,

On 05/19/2015 07:00 AM, Jan Kara wrote:
 On Thu 14-05-15 01:26:23, Daniel Phillips wrote:
 Hi Rik,

 Our linux-tux3 tree currently currently carries this 652 line diff
 against core, to make Tux3 work. This is mainly by Hirofumi, except
 the fs-writeback.c hook, which is by me. The main part you may be
 interested in is rmap.c, which addresses the issues raised at the
 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]

LSFMM: Page forking
http://lwn.net/Articles/548091/

 This is just a FYI. An upcoming Tux3 report will be a tour of the page
 forking design and implementation. For now, this is just to give a
 general sense of what we have done. We heard there are concerns about
 how ptrace will work. I really am not familiar with the issue, could
 you please explain what you were thinking of there?
   So here are a few things I find problematic about page forking (besides
 the cases with elevated page_count already discussed in this thread - there
 I believe that anything more complex than wait for the IO instead of
 forking when page has elevated use count isn't going to work. There are
 too many users depending on too subtle details of the behavior...). Some
 of them are actually mentioned in the above LWN article:
 
 When you create a copy of a page and replace it in the radix tree, nobody
 in mm subsystem is aware that oldpage may be under writeback. That causes
 interesting issues:
 * truncate_inode_pages() can finish before all IO for the file is finished.
   So far filesystems rely on the fact that once truncate_inode_pages()
   finishes all running IO against the file is completed and new cannot be
   submitted.

We do not use truncate_inode_pages because of issues like that. We use
some truncate helpers, which were available in some cases, or else had
to be implemented in Tux3 to make everything work properly. The details
are Hirofumi's stomping grounds. I am pretty sure that his solution is
good and tight, or Tux3 would not pass its torture tests.

 * Writeback can come and try to write newpage while oldpage is still under
   IO. Then you'll have two IOs against one block which has undefined
   results.

Those writebacks only come from Tux3 (or indirectly from fs-writeback,
through our writeback) so we are able to ensure that a dirty block is
only written once. (If redirtied, the block will fork so two dirty
blocks are written in two successive deltas.)

 * filemap_fdatawait() called from fsync() has additional problem that it is
   not aware of oldpage and thus may return although IO hasn't finished yet.

We do not use filemap_fdatawait, instead, we wait on completion of our
own writeback, which is under our control.

 I understand that Tux3 may avoid these issues due to some other mechanisms
 it internally has but if page forking should get into mm subsystem, the
 above must work.

It does work, and by example, it does not need a lot of code to make
it work, but the changes are not trivial. Tux3's delta writeback model
will not suit everyone, so you can't just lift our code and add it to
Ext4. Using it in Ext4 would require a per-inode writeback model, which
looks practical to me but far from a weekend project. Maybe something
to consider for Ext5.

It is the job of new designs like Tux3 to chase after that final drop
of performance, not our trusty Ext4 workhorse. Though stranger things
have happened - as I recall, Ext4 had O(n) directory operations at one
time. Fixing that was not easy, but we did it because we had to. Fixing
Ext4's write performance is not urgent by comparison, and the barrier
is high, you would want jbd3 for one thing.

I think the meta-question you are asking is, where is the second user
for this new CoW functionality? With a possible implication that if
there is no second user then Tux3 cannot be merged. Is that is the
question?

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-18 Thread Daniel Phillips
On 05/17/2015 07:20 PM, Rik van Riel wrote:
> On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
>> On 05/14/2015 03:59 PM, Rik van Riel wrote:
>>> The issue is that things like ptrace, AIO, infiniband
>>> RDMA, and other direct memory access subsystems can take
>>> a reference to page A, which Tux3 clones into a new page B
>>> when the process writes it.
>>>
>>> However, while the process now points at page B, ptrace,
>>> AIO, infiniband, etc will still be pointing at page A.
>>
>> All these problems can also happen with truncate+new-extending-write
>>
>> It is the responsibility of the application to take file/range locks
>> to prevent these page-pinned problems.
> 
> It is unreasonable to expect a process that is being ptraced
> (potentially without its knowledge) to take special measures
> to protect the ptraced memory from disappearing.
> 
> It is impossible for the debugger to take those special measures
> for anonymous memory, or unlinked inodes.
> 
> I don't think your requirement is workable or reasonable.

Hi Rik,

You are quite right to poke at this aggressively. Whether or not
there is an issue needing fixing, we want to know the details. We
really need to do a deep dive in ptrace and know exactly what it
does, and whether Tux3 creates any new kind of hole. I really know
very little about ptrace at the moment, I only have heard that it
is a horrible hack we inherited from some place far away and a time
long ago.

A little guidance from you would help. Somewhere ptrace must modify
the executable page. Unlike uprobes, which makes sense to me, I did
not find where ptrace actually does that on a quick inspection.
Perhaps you could provide a pointer?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-18 Thread Daniel Phillips
On 05/17/2015 07:20 PM, Rik van Riel wrote:
 On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
 On 05/14/2015 03:59 PM, Rik van Riel wrote:
 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.

 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.

 All these problems can also happen with truncate+new-extending-write

 It is the responsibility of the application to take file/range locks
 to prevent these page-pinned problems.
 
 It is unreasonable to expect a process that is being ptraced
 (potentially without its knowledge) to take special measures
 to protect the ptraced memory from disappearing.
 
 It is impossible for the debugger to take those special measures
 for anonymous memory, or unlinked inodes.
 
 I don't think your requirement is workable or reasonable.

Hi Rik,

You are quite right to poke at this aggressively. Whether or not
there is an issue needing fixing, we want to know the details. We
really need to do a deep dive in ptrace and know exactly what it
does, and whether Tux3 creates any new kind of hole. I really know
very little about ptrace at the moment, I only have heard that it
is a horrible hack we inherited from some place far away and a time
long ago.

A little guidance from you would help. Somewhere ptrace must modify
the executable page. Unlike uprobes, which makes sense to me, I did
not find where ptrace actually does that on a quick inspection.
Perhaps you could provide a pointer?

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-15 Thread Daniel Phillips


On 05/15/2015 01:09 AM, Mel Gorman wrote:
> On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
>> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
>>>> The issue is that things like ptrace, AIO, infiniband
>>>> RDMA, and other direct memory access subsystems can take
>>>> a reference to page A, which Tux3 clones into a new page B
>>>> when the process writes it.
>>>>
>>>> However, while the process now points at page B, ptrace,
>>>> AIO, infiniband, etc will still be pointing at page A.
>>>>
>>>> This causes the process and the other subsystem to each
>>>> look at a different page, instead of at shared state,
>>>> causing ptrace to do nothing, AIO and RDMA data to be
>>>> invisible (or corrupted), etc...
>>>
>>> Is this a bit like page migration?
>>
>> Yes. Page migration will fail if there is an "extra"
>> reference to the page that is not accounted for by
>> the migration code.
> 
> When I said it's not like page migration, I was referring to the fact
> that a COW on a pinned page for RDMA is a different problem to page
> migration. The COW of a pinned page can lead to lost writes or
> corruption depending on the ordering of events.

I see the lost writes case, but not the corruption case, Do you
mean corruption by changing a page already in writeout? If so,
don't all filesystems have that problem?

If RDMA to a mmapped file races with write(2) to the same file,
maybe it is reasonable and expected to lose some data.

> Page migration fails
> when there are unexpected problems to avoid this class of issue which is
> fine for page migration but may be a critical failure in a filesystem
> depending on exactly why the copy is required.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-15 Thread Daniel Phillips
On 05/14/2015 08:06 PM, Rik van Riel wrote:
> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
>>> The issue is that things like ptrace, AIO, infiniband
>>> RDMA, and other direct memory access subsystems can take
>>> a reference to page A, which Tux3 clones into a new page B
>>> when the process writes it.
>>>
>>> However, while the process now points at page B, ptrace,
>>> AIO, infiniband, etc will still be pointing at page A.
>>>
>>> This causes the process and the other subsystem to each
>>> look at a different page, instead of at shared state,
>>> causing ptrace to do nothing, AIO and RDMA data to be
>>> invisible (or corrupted), etc...
>>
>> Is this a bit like page migration?
> 
> Yes. Page migration will fail if there is an "extra"
> reference to the page that is not accounted for by
> the migration code.
> 
> Only pages that have no extra refcount can be migrated.
> 
> Similarly, your cow code needs to fail if there is an
> extra reference count pinning the page. As long as
> the page has a user that you cannot migrate, you cannot
> move any of the other users over. They may rely on data
> written by the hidden-to-you user, and the hidden-to-you
> user may write to the page when you think it is a read
> only stable snapshot.

Please bear with me as I study these cases one by one.

First one is ptrace. Only for executable files, right?
Maybe we don't need to fork pages in executable files,

Uprobes... If somebody puts a breakpoint in a page and
we fork it, the replacement page has a copy of the
breakpoint, and all the code on the page. Did anything
break?

Note: we have the option of being cowardly and just not
doing page forking for mmapped files, or certain kinds
of mmapped files, etc. But first we should give it the
old college try, to see if absolute perfection is
possible and practical.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-15 Thread Daniel Phillips
On 05/14/2015 08:06 PM, Rik van Riel wrote:
 On 05/14/2015 08:06 PM, Daniel Phillips wrote:
 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.

 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.

 This causes the process and the other subsystem to each
 look at a different page, instead of at shared state,
 causing ptrace to do nothing, AIO and RDMA data to be
 invisible (or corrupted), etc...

 Is this a bit like page migration?
 
 Yes. Page migration will fail if there is an extra
 reference to the page that is not accounted for by
 the migration code.
 
 Only pages that have no extra refcount can be migrated.
 
 Similarly, your cow code needs to fail if there is an
 extra reference count pinning the page. As long as
 the page has a user that you cannot migrate, you cannot
 move any of the other users over. They may rely on data
 written by the hidden-to-you user, and the hidden-to-you
 user may write to the page when you think it is a read
 only stable snapshot.

Please bear with me as I study these cases one by one.

First one is ptrace. Only for executable files, right?
Maybe we don't need to fork pages in executable files,

Uprobes... If somebody puts a breakpoint in a page and
we fork it, the replacement page has a copy of the
breakpoint, and all the code on the page. Did anything
break?

Note: we have the option of being cowardly and just not
doing page forking for mmapped files, or certain kinds
of mmapped files, etc. But first we should give it the
old college try, to see if absolute perfection is
possible and practical.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-15 Thread Daniel Phillips


On 05/15/2015 01:09 AM, Mel Gorman wrote:
 On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
 On 05/14/2015 08:06 PM, Daniel Phillips wrote:
 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.

 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.

 This causes the process and the other subsystem to each
 look at a different page, instead of at shared state,
 causing ptrace to do nothing, AIO and RDMA data to be
 invisible (or corrupted), etc...

 Is this a bit like page migration?

 Yes. Page migration will fail if there is an extra
 reference to the page that is not accounted for by
 the migration code.
 
 When I said it's not like page migration, I was referring to the fact
 that a COW on a pinned page for RDMA is a different problem to page
 migration. The COW of a pinned page can lead to lost writes or
 corruption depending on the ordering of events.

I see the lost writes case, but not the corruption case, Do you
mean corruption by changing a page already in writeout? If so,
don't all filesystems have that problem?

If RDMA to a mmapped file races with write(2) to the same file,
maybe it is reasonable and expected to lose some data.

 Page migration fails
 when there are unexpected problems to avoid this class of issue which is
 fine for page migration but may be a critical failure in a filesystem
 depending on exactly why the copy is required.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [FYI] tux3: Core changes

2015-05-14 Thread Daniel Phillips
Hi Rik,

Added Mel, Andrea and Peterz to CC as interested parties. There are
probably others, please just jump in.

On 05/14/2015 05:59 AM, Rik van Riel wrote:
> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>> Hi Rik,
>>
>> Our linux-tux3 tree currently currently carries this 652 line diff
>> against core, to make Tux3 work. This is mainly by Hirofumi, except
>> the fs-writeback.c hook, which is by me. The main part you may be
>> interested in is rmap.c, which addresses the issues raised at the
>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
>>
>>LSFMM: Page forking
>>http://lwn.net/Articles/548091/
>>
>> This is just a FYI. An upcoming Tux3 report will be a tour of the page
>> forking design and implementation. For now, this is just to give a
>> general sense of what we have done. We heard there are concerns about
>> how ptrace will work. I really am not familiar with the issue, could
>> you please explain what you were thinking of there?
> 
> The issue is that things like ptrace, AIO, infiniband
> RDMA, and other direct memory access subsystems can take
> a reference to page A, which Tux3 clones into a new page B
> when the process writes it.
> 
> However, while the process now points at page B, ptrace,
> AIO, infiniband, etc will still be pointing at page A.
> 
> This causes the process and the other subsystem to each
> look at a different page, instead of at shared state,
> causing ptrace to do nothing, AIO and RDMA data to be
> invisible (or corrupted), etc...

Is this a bit like page migration?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[FYI] tux3: Core changes

2015-05-14 Thread Daniel Phillips
Hi Rik,

Our linux-tux3 tree currently currently carries this 652 line diff
against core, to make Tux3 work. This is mainly by Hirofumi, except
the fs-writeback.c hook, which is by me. The main part you may be
interested in is rmap.c, which addresses the issues raised at the
2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]

   LSFMM: Page forking
   http://lwn.net/Articles/548091/

This is just a FYI. An upcoming Tux3 report will be a tour of the page
forking design and implementation. For now, this is just to give a
general sense of what we have done. We heard there are concerns about
how ptrace will work. I really am not familiar with the issue, could
you please explain what you were thinking of there?

Enjoy,

Daniel

[1] Which happened to be a 15 minute bus ride away from me at the time.

diffstat tux3.core.patch
 fs/Makefile   |1 
 fs/fs-writeback.c |  100 +
 include/linux/fs.h|6 +
 include/linux/mm.h|5 +
 include/linux/pagemap.h   |2 
 include/linux/rmap.h  |   14 
 include/linux/writeback.h |   23 +++
 mm/filemap.c  |   82 +++
 mm/rmap.c |  139 ++
 mm/truncate.c |   98 
 10 files changed, 411 insertions(+), 59 deletions(-)

diff --git a/fs/Makefile b/fs/Makefile
index 91fcfa3..44d7192 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -70,7 +70,6 @@ obj-$(CONFIG_EXT4_FS) += ext4/
 obj-$(CONFIG_JBD)  += jbd/
 obj-$(CONFIG_JBD2) += jbd2/
 obj-$(CONFIG_TUX3) += tux3/
-obj-$(CONFIG_TUX3_MMAP)+= tux3/
 obj-$(CONFIG_CRAMFS)   += cramfs/
 obj-$(CONFIG_SQUASHFS) += squashfs/
 obj-y  += ramfs/
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2d609a5..fcd1c61 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -34,25 +34,6 @@
  */
 #define MIN_WRITEBACK_PAGES(4096UL >> (PAGE_CACHE_SHIFT - 10))
 
-/*
- * Passed into wb_writeback(), essentially a subset of writeback_control
- */
-struct wb_writeback_work {
-   long nr_pages;
-   struct super_block *sb;
-   unsigned long *older_than_this;
-   enum writeback_sync_modes sync_mode;
-   unsigned int tagged_writepages:1;
-   unsigned int for_kupdate:1;
-   unsigned int range_cyclic:1;
-   unsigned int for_background:1;
-   unsigned int for_sync:1;/* sync(2) WB_SYNC_ALL writeback */
-   enum wb_reason reason;  /* why was writeback initiated? */
-
-   struct list_head list;  /* pending work list */
-   struct completion *done;/* set if the caller waits */
-};
-
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -192,6 +173,36 @@ void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * Remove inode from writeback list if clean.
+ */
+void inode_writeback_done(struct inode *inode)
+{
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+   spin_lock(>wb.list_lock);
+   spin_lock(>i_lock);
+   if (!(inode->i_state & I_DIRTY))
+   list_del_init(>i_wb_list);
+   spin_unlock(>i_lock);
+   spin_unlock(>wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_done);
+
+/*
+ * Add inode to writeback dirty list with current time.
+ */
+void inode_writeback_touch(struct inode *inode)
+{
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+   spin_lock(>wb.list_lock);
+   inode->dirtied_when = jiffies;
+   list_move(>i_wb_list, >wb.b_dirty);
+   spin_unlock(>wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_touch);
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -610,9 +621,9 @@ static long writeback_chunk_size(struct backing_dev_info 
*bdi,
  *
  * Return the number of pages and/or inodes written.
  */
-static long writeback_sb_inodes(struct super_block *sb,
-   struct bdi_writeback *wb,
-   struct wb_writeback_work *work)
+static long generic_writeback_sb_inodes(struct super_block *sb,
+   struct bdi_writeback *wb,
+   struct wb_writeback_work *work)
 {
struct writeback_control wbc = {
.sync_mode  = work->sync_mode,
@@ -727,6 +738,22 @@ static long writeback_sb_inodes(struct super_block *sb,
return wrote;
 }
 
+static long writeback_sb_inodes(struct super_block *sb,
+   struct bdi_writeback *wb,
+   struct wb_writeback_work *work)
+{
+   if (sb->s_op->writeback) {
+   long ret;
+
+   spin_unlock(>list_lock);
+   ret = 

[WIP] tux3: Optimized fsync

2015-05-14 Thread Daniel Phillips
Greetings,

This diff against head (f59558a04c5ad052dc03ceeda62ccf31f4ab0004) of

   https://github.com/OGAWAHirofumi/linux-tux3/tree/hirofumi-user

provides the optimized fsync code that was used to generate the
benchmark results here:

   https://lkml.org/lkml/2015/4/28/838
   "How fast can we fsync?"

This patch also applies to:

   https://github.com/OGAWAHirofumi/linux-tux3/tree/hirofumi

which is a 3.19 kernel cloned from mainline. (Preferred)

Build instructions are on the wiki:

   https://github.com/OGAWAHirofumi/linux-tux3/wiki

There is some slight skew in the instructions because this is
not on master yet.


*  Caveat: No out of space handling on this branch!  ***
*** If you run out of space you will get a mysterious assert ***


Enjoy!

Daniel

diff --git a/fs/tux3/buffer.c b/fs/tux3/buffer.c
index ef0d917..a141687 100644
--- a/fs/tux3/buffer.c
+++ b/fs/tux3/buffer.c
@@ -29,7 +29,7 @@ TUX3_DEFINE_STATE_FNS(unsigned long, buf, BUFDELTA_AVAIL, 
BUFDELTA_BITS,
  * may not work on all arch (If set_bit() and cmpxchg() is not
  * exclusive, this has race).
  */
-static void tux3_set_bufdelta(struct buffer_head *buffer, int delta)
+void tux3_set_bufdelta(struct buffer_head *buffer, int delta)
 {
unsigned long state, old_state;
 
diff --git a/fs/tux3/commit.c b/fs/tux3/commit.c
index 909a222..955c441a 100644
--- a/fs/tux3/commit.c
+++ b/fs/tux3/commit.c
@@ -289,12 +289,13 @@ static int commit_delta(struct sb *sb)
req_flag |= REQ_NOIDLE | REQ_FLUSH | REQ_FUA;
}
 
-   trace("commit %i logblocks", be32_to_cpu(sb->super.logcount));
+   trace("commit %i logblocks", logcount(sb));
err = save_metablock(sb, req_flag);
if (err)
return err;
 
-   tux3_wake_delta_commit(sb);
+   if (!fsync_mode(sb))
+   tux3_wake_delta_commit(sb);
 
/* Commit was finished, apply defered bfree. */
return unstash(sb, >defree, apply_defered_bfree);
@@ -314,8 +315,7 @@ static void post_commit(struct sb *sb, unsigned delta)
 
 static int need_unify(struct sb *sb)
 {
-   static unsigned crudehack;
-   return !(++crudehack % 3);
+   return logcount(sb) > 300; /* FIXME: should be based on bandwidth and 
tunable */
 }
 
 /* For debugging */
@@ -359,7 +359,7 @@ static int do_commit(struct sb *sb, int flags)
 * FIXME: there is no need to commit if normal inodes are not
 * dirty? better way?
 */
-   if (!(flags & __FORCE_DELTA) && !tux3_has_dirty_inodes(sb, delta))
+   if (0 && !(flags & __FORCE_DELTA) && !tux3_has_dirty_inodes(sb, delta))
goto out;
 
/* Prepare to wait I/O */
@@ -402,6 +402,7 @@ static int do_commit(struct sb *sb, int flags)
 #endif
 
if ((!no_unify && need_unify(sb)) || (flags & __FORCE_UNIFY)) {
+   trace("unify %u, delta %u", sb->unify, delta);
err = unify_log(sb);
if (err)
goto error; /* FIXME: error handling */
diff --git a/fs/tux3/commit_flusher.c b/fs/tux3/commit_flusher.c
index 59d6781..31cd51e 100644
--- a/fs/tux3/commit_flusher.c
+++ b/fs/tux3/commit_flusher.c
@@ -198,6 +198,8 @@ long tux3_writeback(struct super_block *super, struct 
bdi_writeback *wb,
if (work->reason == WB_REASON_SYNC)
goto out;
 
+   trace("tux3_writeback, reason = %i", work->reason);
+   
if (work->reason == WB_REASON_TUX3_PENDING) {
struct tux3_wb_work *wb_work;
/* Specified target delta for staging. */
@@ -343,3 +345,7 @@ static void schedule_flush_delta(struct sb *sb, struct 
delta_ref *delta_ref)
sb->delta_pending++;
wake_up_all(>delta_transition_wq);
 }
+
+#ifdef __KERNEL__
+#include "commit_fsync.c"
+#endif
diff --git a/fs/tux3/commit_fsync.c b/fs/tux3/commit_fsync.c
new file mode 100644
index 000..9a59c59
--- /dev/null
+++ b/fs/tux3/commit_fsync.c
@@ -0,0 +1,341 @@
+/*
+ * Optimized fsync.
+ *
+ * Copyright (c) 2015 Daniel Phillips
+ */
+
+#include 
+
+static inline int fsync_pending(struct sb *sb)
+{
+   return atomic_read(>fsync_pending);
+}
+
+static inline int delta_needed(struct sb *sb)
+{
+   return waitqueue_active(>delta_transition_wq);
+}
+
+static inline int fsync_drain(struct sb *sb)
+{
+   return test_bit(TUX3_FSYNC_DRAIN_BIT, >backend_state);
+}
+
+static inline unsigned fsync_group(struct sb *sb)
+{
+   return atomic_read(>fsync_group);
+}
+
+static int suspend_transition(struct sb *sb)
+{
+   while (sb->suspended == NULL) {
+   if (!test_and_set_bit(TUX3_STATE_TRANSITION_BIT, 
>backend_state)) {
+   sb->suspended = delta_get(sb);
+ 

Re: [FYI] tux3: Core changes

2015-05-14 Thread Daniel Phillips
Hi Rik,

Added Mel, Andrea and Peterz to CC as interested parties. There are
probably others, please just jump in.

On 05/14/2015 05:59 AM, Rik van Riel wrote:
 On 05/14/2015 04:26 AM, Daniel Phillips wrote:
 Hi Rik,

 Our linux-tux3 tree currently currently carries this 652 line diff
 against core, to make Tux3 work. This is mainly by Hirofumi, except
 the fs-writeback.c hook, which is by me. The main part you may be
 interested in is rmap.c, which addresses the issues raised at the
 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]

LSFMM: Page forking
http://lwn.net/Articles/548091/

 This is just a FYI. An upcoming Tux3 report will be a tour of the page
 forking design and implementation. For now, this is just to give a
 general sense of what we have done. We heard there are concerns about
 how ptrace will work. I really am not familiar with the issue, could
 you please explain what you were thinking of there?
 
 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.
 
 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.
 
 This causes the process and the other subsystem to each
 look at a different page, instead of at shared state,
 causing ptrace to do nothing, AIO and RDMA data to be
 invisible (or corrupted), etc...

Is this a bit like page migration?

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[WIP] tux3: Optimized fsync

2015-05-14 Thread Daniel Phillips
Greetings,

This diff against head (f59558a04c5ad052dc03ceeda62ccf31f4ab0004) of

   https://github.com/OGAWAHirofumi/linux-tux3/tree/hirofumi-user

provides the optimized fsync code that was used to generate the
benchmark results here:

   https://lkml.org/lkml/2015/4/28/838
   How fast can we fsync?

This patch also applies to:

   https://github.com/OGAWAHirofumi/linux-tux3/tree/hirofumi

which is a 3.19 kernel cloned from mainline. (Preferred)

Build instructions are on the wiki:

   https://github.com/OGAWAHirofumi/linux-tux3/wiki

There is some slight skew in the instructions because this is
not on master yet.


*  Caveat: No out of space handling on this branch!  ***
*** If you run out of space you will get a mysterious assert ***


Enjoy!

Daniel

diff --git a/fs/tux3/buffer.c b/fs/tux3/buffer.c
index ef0d917..a141687 100644
--- a/fs/tux3/buffer.c
+++ b/fs/tux3/buffer.c
@@ -29,7 +29,7 @@ TUX3_DEFINE_STATE_FNS(unsigned long, buf, BUFDELTA_AVAIL, 
BUFDELTA_BITS,
  * may not work on all arch (If set_bit() and cmpxchg() is not
  * exclusive, this has race).
  */
-static void tux3_set_bufdelta(struct buffer_head *buffer, int delta)
+void tux3_set_bufdelta(struct buffer_head *buffer, int delta)
 {
unsigned long state, old_state;
 
diff --git a/fs/tux3/commit.c b/fs/tux3/commit.c
index 909a222..955c441a 100644
--- a/fs/tux3/commit.c
+++ b/fs/tux3/commit.c
@@ -289,12 +289,13 @@ static int commit_delta(struct sb *sb)
req_flag |= REQ_NOIDLE | REQ_FLUSH | REQ_FUA;
}
 
-   trace(commit %i logblocks, be32_to_cpu(sb-super.logcount));
+   trace(commit %i logblocks, logcount(sb));
err = save_metablock(sb, req_flag);
if (err)
return err;
 
-   tux3_wake_delta_commit(sb);
+   if (!fsync_mode(sb))
+   tux3_wake_delta_commit(sb);
 
/* Commit was finished, apply defered bfree. */
return unstash(sb, sb-defree, apply_defered_bfree);
@@ -314,8 +315,7 @@ static void post_commit(struct sb *sb, unsigned delta)
 
 static int need_unify(struct sb *sb)
 {
-   static unsigned crudehack;
-   return !(++crudehack % 3);
+   return logcount(sb)  300; /* FIXME: should be based on bandwidth and 
tunable */
 }
 
 /* For debugging */
@@ -359,7 +359,7 @@ static int do_commit(struct sb *sb, int flags)
 * FIXME: there is no need to commit if normal inodes are not
 * dirty? better way?
 */
-   if (!(flags  __FORCE_DELTA)  !tux3_has_dirty_inodes(sb, delta))
+   if (0  !(flags  __FORCE_DELTA)  !tux3_has_dirty_inodes(sb, delta))
goto out;
 
/* Prepare to wait I/O */
@@ -402,6 +402,7 @@ static int do_commit(struct sb *sb, int flags)
 #endif
 
if ((!no_unify  need_unify(sb)) || (flags  __FORCE_UNIFY)) {
+   trace(unify %u, delta %u, sb-unify, delta);
err = unify_log(sb);
if (err)
goto error; /* FIXME: error handling */
diff --git a/fs/tux3/commit_flusher.c b/fs/tux3/commit_flusher.c
index 59d6781..31cd51e 100644
--- a/fs/tux3/commit_flusher.c
+++ b/fs/tux3/commit_flusher.c
@@ -198,6 +198,8 @@ long tux3_writeback(struct super_block *super, struct 
bdi_writeback *wb,
if (work-reason == WB_REASON_SYNC)
goto out;
 
+   trace(tux3_writeback, reason = %i, work-reason);
+   
if (work-reason == WB_REASON_TUX3_PENDING) {
struct tux3_wb_work *wb_work;
/* Specified target delta for staging. */
@@ -343,3 +345,7 @@ static void schedule_flush_delta(struct sb *sb, struct 
delta_ref *delta_ref)
sb-delta_pending++;
wake_up_all(sb-delta_transition_wq);
 }
+
+#ifdef __KERNEL__
+#include commit_fsync.c
+#endif
diff --git a/fs/tux3/commit_fsync.c b/fs/tux3/commit_fsync.c
new file mode 100644
index 000..9a59c59
--- /dev/null
+++ b/fs/tux3/commit_fsync.c
@@ -0,0 +1,341 @@
+/*
+ * Optimized fsync.
+ *
+ * Copyright (c) 2015 Daniel Phillips
+ */
+
+#include linux/delay.h
+
+static inline int fsync_pending(struct sb *sb)
+{
+   return atomic_read(sb-fsync_pending);
+}
+
+static inline int delta_needed(struct sb *sb)
+{
+   return waitqueue_active(sb-delta_transition_wq);
+}
+
+static inline int fsync_drain(struct sb *sb)
+{
+   return test_bit(TUX3_FSYNC_DRAIN_BIT, sb-backend_state);
+}
+
+static inline unsigned fsync_group(struct sb *sb)
+{
+   return atomic_read(sb-fsync_group);
+}
+
+static int suspend_transition(struct sb *sb)
+{
+   while (sb-suspended == NULL) {
+   if (!test_and_set_bit(TUX3_STATE_TRANSITION_BIT, 
sb-backend_state)) {
+   sb-suspended = delta_get(sb);
+   return 1;
+   }
+   cpu_relax();
+   }
+   return 0;
+}
+
+static void resume_transition(struct sb *sb

[FYI] tux3: Core changes

2015-05-14 Thread Daniel Phillips
Hi Rik,

Our linux-tux3 tree currently currently carries this 652 line diff
against core, to make Tux3 work. This is mainly by Hirofumi, except
the fs-writeback.c hook, which is by me. The main part you may be
interested in is rmap.c, which addresses the issues raised at the
2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]

   LSFMM: Page forking
   http://lwn.net/Articles/548091/

This is just a FYI. An upcoming Tux3 report will be a tour of the page
forking design and implementation. For now, this is just to give a
general sense of what we have done. We heard there are concerns about
how ptrace will work. I really am not familiar with the issue, could
you please explain what you were thinking of there?

Enjoy,

Daniel

[1] Which happened to be a 15 minute bus ride away from me at the time.

diffstat tux3.core.patch
 fs/Makefile   |1 
 fs/fs-writeback.c |  100 +
 include/linux/fs.h|6 +
 include/linux/mm.h|5 +
 include/linux/pagemap.h   |2 
 include/linux/rmap.h  |   14 
 include/linux/writeback.h |   23 +++
 mm/filemap.c  |   82 +++
 mm/rmap.c |  139 ++
 mm/truncate.c |   98 
 10 files changed, 411 insertions(+), 59 deletions(-)

diff --git a/fs/Makefile b/fs/Makefile
index 91fcfa3..44d7192 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -70,7 +70,6 @@ obj-$(CONFIG_EXT4_FS) += ext4/
 obj-$(CONFIG_JBD)  += jbd/
 obj-$(CONFIG_JBD2) += jbd2/
 obj-$(CONFIG_TUX3) += tux3/
-obj-$(CONFIG_TUX3_MMAP)+= tux3/
 obj-$(CONFIG_CRAMFS)   += cramfs/
 obj-$(CONFIG_SQUASHFS) += squashfs/
 obj-y  += ramfs/
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2d609a5..fcd1c61 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -34,25 +34,6 @@
  */
 #define MIN_WRITEBACK_PAGES(4096UL  (PAGE_CACHE_SHIFT - 10))
 
-/*
- * Passed into wb_writeback(), essentially a subset of writeback_control
- */
-struct wb_writeback_work {
-   long nr_pages;
-   struct super_block *sb;
-   unsigned long *older_than_this;
-   enum writeback_sync_modes sync_mode;
-   unsigned int tagged_writepages:1;
-   unsigned int for_kupdate:1;
-   unsigned int range_cyclic:1;
-   unsigned int for_background:1;
-   unsigned int for_sync:1;/* sync(2) WB_SYNC_ALL writeback */
-   enum wb_reason reason;  /* why was writeback initiated? */
-
-   struct list_head list;  /* pending work list */
-   struct completion *done;/* set if the caller waits */
-};
-
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -192,6 +173,36 @@ void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * Remove inode from writeback list if clean.
+ */
+void inode_writeback_done(struct inode *inode)
+{
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+   spin_lock(bdi-wb.list_lock);
+   spin_lock(inode-i_lock);
+   if (!(inode-i_state  I_DIRTY))
+   list_del_init(inode-i_wb_list);
+   spin_unlock(inode-i_lock);
+   spin_unlock(bdi-wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_done);
+
+/*
+ * Add inode to writeback dirty list with current time.
+ */
+void inode_writeback_touch(struct inode *inode)
+{
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+   spin_lock(bdi-wb.list_lock);
+   inode-dirtied_when = jiffies;
+   list_move(inode-i_wb_list, bdi-wb.b_dirty);
+   spin_unlock(bdi-wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_touch);
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -610,9 +621,9 @@ static long writeback_chunk_size(struct backing_dev_info 
*bdi,
  *
  * Return the number of pages and/or inodes written.
  */
-static long writeback_sb_inodes(struct super_block *sb,
-   struct bdi_writeback *wb,
-   struct wb_writeback_work *work)
+static long generic_writeback_sb_inodes(struct super_block *sb,
+   struct bdi_writeback *wb,
+   struct wb_writeback_work *work)
 {
struct writeback_control wbc = {
.sync_mode  = work-sync_mode,
@@ -727,6 +738,22 @@ static long writeback_sb_inodes(struct super_block *sb,
return wrote;
 }
 
+static long writeback_sb_inodes(struct super_block *sb,
+   struct bdi_writeback *wb,
+   struct wb_writeback_work *work)
+{
+   if (sb-s_op-writeback) {
+   long ret;
+
+   spin_unlock(wb-list_lock);
+   

Re: Tux3 Report: How fast can we fail?

2015-05-13 Thread Daniel Phillips
Addendum to that post...

On 05/12/2015 10:46 AM, I wrote:
> ...For example, we currently
> overestimate the cost of a rewrite because we would need to go poking
> around in btrees to do that more accurately. Fixing that will be quite
> a bit of work...

Ah no, I was wrong about that, it will not be a lot of work because
it does not need to be done.

Charging the full cost of a rewrite as if it is a new write is the
right thing to do because we need to be sure the commit can allocate
space to redirect the existing blocks before it frees the old ones.
So that means there is no need for the front end to go delving into
file metadata, ever, which is a relief because that would have been
expensive and messy.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-13 Thread Daniel Phillips

On Wednesday, May 13, 2015 1:25:38 PM PDT, Martin Steigerwald wrote:

Am Mittwoch, 13. Mai 2015, 12:37:41 schrieb Daniel Phillips:

On 05/13/2015 12:09 PM, Martin Steigerwald wrote: ...


Daniel, if you want to change the process of patch review and 
inclusion into 
the kernel, model an example of how you would like it to be. This has way 
better chances to inspire others to change their behaviors themselves than 
accusing them of bad faith.


Its yours to choose. 


What outcome do you want to create?


The outcome I would like is:

 * Everybody has a good think about what has gone wrong in the past,
   not only with troublesome submitters, but with mutual respect and
   collegial conduct.

 * Tux3 is merged on its merits so we get more developers and
   testers and move it along faster.

 * I left LKML better than I found it.

 * Group hugs

Well, group hugs are optional, that one would be situational.

Regards,

Daniel
   


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-13 Thread Daniel Phillips

On Wednesday, May 13, 2015 1:02:34 PM PDT, Jeremy Allison wrote:

On Wed, May 13, 2015 at 12:37:41PM -0700, Daniel Phillips wrote:

On 05/13/2015 12:09 PM, Martin Steigerwald wrote:
 ...


Daniel, please listen to Martin. He speaks a fundamental truth
here.

As you know, I am also interested in Tux3, and would love to
see it as a filesystem option for NAS servers using Samba. But
please think about the way you're interacting with people on the
list, and whether that makes this outcome more or less likely.


Thanks Jeremy, that means more from you than anyone.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-13 Thread Daniel Phillips
On 05/13/2015 12:09 PM, Martin Steigerwald wrote:
> Daniel, what are you trying to achieve here?
> 
> I thought you wanted to create interest for your filesystem and acceptance 
> for merging it.
> 
> What I see you are actually creating tough is something different.
> 
> Is what you see after you send your mails really what you want to see? If 
> not… why not? And if you seek change, where can you create change?

That is the question indeed, whether to try and change the system
while merging, or just keep smiling and get the job done. The problem
is, I am just too stupid to realize that I can't change the system,
which is famously unpleasant for submitters.

> I really like to see Tux3 inside the kernel for easier testing, yet I also 
> see that the way you, in your oppinion, "defend" it, does not seem to move 
> that goal any closer, quite the opposite. It triggers polarity and 
> resistance.
> 
> I believe it to be more productive to work together with the people who will 
> decide about what goes into the kernel and the people whose oppinions are 
> respected by them, instead of against them.

Obviously true.

> "Assume good faith" can help here. No amount of accusing people of bad 
> intention will change them. The only thing you have the power to change is 
> your approach. You absolutely and ultimately do not have the power to change 
> other people. You can´t force Tux3 in by sheer willpower or attacking 
> people.
> 
> On any account for anyone discussing here: I believe that any personal 
> attacks, counter-attacks or "you are wrong" kind of speech will not help to 
> move this discussion out of the circling it seems to be in at the moment.

Thanks for the sane commentary. I have the power to change my behavior.
But if nobody else changes their behavior, the process remains just as
unpleasant for us as it ever was (not just me!). Obviously, this is
not the first time I have been through this, and it has never been
pleasant. After a while, contributors just get tired of the grind and
move on to something more fun. I know I did, and I am far from the
only one.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-13 Thread Daniel Phillips
On 05/13/2015 06:08 AM, Mike Galbraith wrote:
> On Wed, 2015-05-13 at 04:31 -0700, Daniel Phillips wrote:
>> Third possibility: build from our repository, as Mike did.
> 
> Sorry about that folks.  I've lost all interest, it won't happen again.

Thanks for your valuable contribution. Now we are seeing a steady
of stream of people heading to the repository, after you showed
it could be done.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-13 Thread Daniel Phillips
On 05/13/2015 04:31 AM, Daniel Phillips wrote:
Let me be the first to catch that arithmetic error

> Let's say our delta size is 400MB (typical under load) and we leave
> a "nice big gap" of 112 MB after flushing each one. Let's say we do
> two thousand of those before deciding that we have enough information
> available to switch to some smarter strategy. We used one GB of a
> a 4TB disk, say. The media transfer rate decreased by a factor of:
> 
> (1 - 2/1000) = .2%.

Ahem, no, we used 1/8th of the disk. The time/data rate increased
from unity to 1.125, for an average of 1.0625 across the region.
If we only use 1/10th of the disk instead, by not leaving gaps,
then the average time/data across the region is 1.05. The
difference is, 1.0625 - 1.05, so the gap strategy increases media
transfer time by 1.25%, which is not significant compared to the
performance deficit in question of 400%. So, same argument:
change in media transfer rate is just a distraction from the
original question.

In any case, we probably want to start using a smarter strategy
sooner than 1000 commits, maybe after ten or a hundred commits,
which would make the change in media transfer rate even less
relevant.

The thing is, when data first starts landing on media, we do not
have much information about what the long term load will be. So
just analyze the clues we have in the early commits and put those
early deltas onto disk in the most efficient format, which for
Tux3 seems to be linear per delta. There would be exceptions, but
that is the common case.

Then get smarter later. The intent is to get the best of both:
early efficiency, and long term nice aging behavior. I do not
accept the proposition that one must be sacrificed for the
other, I find that reasoning faulty.

> The performance deficit in question and the difference in media rate are
> three orders of magnitude apart, does that justify the term "similar or
> identical?".

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-13 Thread Daniel Phillips
On 05/13/2015 12:25 AM, Pavel Machek wrote:
> On Mon 2015-05-11 16:53:10, Daniel Phillips wrote:
>> Hi Pavel,
>>
>> On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>>>> It is a fact of life that when you change one aspect of an intimately 
>>>>> interconnected system,
>>>>> something else will change as well. You have naive/nonexistent free space 
>>>>> management now; when you
>>>>> design something workable there it is going to impact everything else 
>>>>> you've already done. It's an
>>>>> easy bet that the impact will be negative, the only question is to what 
>>>>> degree.
>>>>
>>>> You might lose that bet. For example, suppose we do strictly linear 
>>>> allocation
>>>> each delta, and just leave nice big gaps between the deltas for future
>>>> expansion. Clearly, we run at similar or identical speed to the current 
>>>> naive
>>>> strategy until we must start filling in the gaps, and at that point our 
>>>> layout
>>>> is not any worse than XFS, which started bad and stayed that way.
>>>
>>> Umm, are you sure. If "some areas of disk are faster than others" is
>>> still true on todays harddrives, the gaps will decrease the
>>> performance (as you'll "use up" the fast areas more quickly).
>>
>> That's why I hedged my claim with "similar or identical". The
>> difference in media speed seems to be a relatively small effect
> 
> When you knew it can't be identical? That's rather confusing, right?

Maybe. The top of thread is about a measured performance deficit of
a factor of five. Next to that, a media transfer rate variation by
a factor of two already starts to look small, and gets smaller when
scrutinized.

Let's say our delta size is 400MB (typical under load) and we leave
a "nice big gap" of 112 MB after flushing each one. Let's say we do
two thousand of those before deciding that we have enough information
available to switch to some smarter strategy. We used one GB of a
a 4TB disk, say. The media transfer rate decreased by a factor of:

(1 - 2/1000) = .2%.

The performance deficit in question and the difference in media rate are
three orders of magnitude apart, does that justify the term "similar or
identical?".

> Perhaps you should post more details how your benchmark is structured
> next time, so we can see you did not make any trivial mistakes...?

Makes sense to me, though I do take considerable care to ensure that
my results are reproducible. That is born out by the fact that Mike
did reproduce, albeit from the published branch, which is a bit behind
current work. And he went on to do some original testing of his own.

I had no idea Tux3 was so much faster than XFS on the Git self test,
because we never specifically tested anything like that, or optimized
for it. Of course I was interested in why. And that was not all, Mike
also noticed a really interesting fact about latency that I failed to
reproduce. That went on to the list of things to investigate as time
permits.

I reproduced Mike's results according to his description, by actually
building Git in the VM and running the selftests just to see if the same
thing happened, which it did. I didn't think that was worth mentioning
at the time, because if somebody publishes benchmarks, my first instinct
is to trust them. Trust and verify.

> Or just clean the code up so that it can get merged, so that we can
> benchmark ourselves...

Third possibility: build from our repository, as Mike did. Obviously,
we need to merge to master so the build process matches the Wiki. But
Hirofumi is busy with other things, so please be patient.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-13 Thread Daniel Phillips
On 05/13/2015 12:25 AM, Pavel Machek wrote:
 On Mon 2015-05-11 16:53:10, Daniel Phillips wrote:
 Hi Pavel,

 On 05/11/2015 03:12 PM, Pavel Machek wrote:
 It is a fact of life that when you change one aspect of an intimately 
 interconnected system,
 something else will change as well. You have naive/nonexistent free space 
 management now; when you
 design something workable there it is going to impact everything else 
 you've already done. It's an
 easy bet that the impact will be negative, the only question is to what 
 degree.

 You might lose that bet. For example, suppose we do strictly linear 
 allocation
 each delta, and just leave nice big gaps between the deltas for future
 expansion. Clearly, we run at similar or identical speed to the current 
 naive
 strategy until we must start filling in the gaps, and at that point our 
 layout
 is not any worse than XFS, which started bad and stayed that way.

 Umm, are you sure. If some areas of disk are faster than others is
 still true on todays harddrives, the gaps will decrease the
 performance (as you'll use up the fast areas more quickly).

 That's why I hedged my claim with similar or identical. The
 difference in media speed seems to be a relatively small effect
 
 When you knew it can't be identical? That's rather confusing, right?

Maybe. The top of thread is about a measured performance deficit of
a factor of five. Next to that, a media transfer rate variation by
a factor of two already starts to look small, and gets smaller when
scrutinized.

Let's say our delta size is 400MB (typical under load) and we leave
a nice big gap of 112 MB after flushing each one. Let's say we do
two thousand of those before deciding that we have enough information
available to switch to some smarter strategy. We used one GB of a
a 4TB disk, say. The media transfer rate decreased by a factor of:

(1 - 2/1000) = .2%.

The performance deficit in question and the difference in media rate are
three orders of magnitude apart, does that justify the term similar or
identical?.

 Perhaps you should post more details how your benchmark is structured
 next time, so we can see you did not make any trivial mistakes...?

Makes sense to me, though I do take considerable care to ensure that
my results are reproducible. That is born out by the fact that Mike
did reproduce, albeit from the published branch, which is a bit behind
current work. And he went on to do some original testing of his own.

I had no idea Tux3 was so much faster than XFS on the Git self test,
because we never specifically tested anything like that, or optimized
for it. Of course I was interested in why. And that was not all, Mike
also noticed a really interesting fact about latency that I failed to
reproduce. That went on to the list of things to investigate as time
permits.

I reproduced Mike's results according to his description, by actually
building Git in the VM and running the selftests just to see if the same
thing happened, which it did. I didn't think that was worth mentioning
at the time, because if somebody publishes benchmarks, my first instinct
is to trust them. Trust and verify.

 Or just clean the code up so that it can get merged, so that we can
 benchmark ourselves...

Third possibility: build from our repository, as Mike did. Obviously,
we need to merge to master so the build process matches the Wiki. But
Hirofumi is busy with other things, so please be patient.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-13 Thread Daniel Phillips
On 05/13/2015 04:31 AM, Daniel Phillips wrote:
Let me be the first to catch that arithmetic error

 Let's say our delta size is 400MB (typical under load) and we leave
 a nice big gap of 112 MB after flushing each one. Let's say we do
 two thousand of those before deciding that we have enough information
 available to switch to some smarter strategy. We used one GB of a
 a 4TB disk, say. The media transfer rate decreased by a factor of:
 
 (1 - 2/1000) = .2%.

Ahem, no, we used 1/8th of the disk. The time/data rate increased
from unity to 1.125, for an average of 1.0625 across the region.
If we only use 1/10th of the disk instead, by not leaving gaps,
then the average time/data across the region is 1.05. The
difference is, 1.0625 - 1.05, so the gap strategy increases media
transfer time by 1.25%, which is not significant compared to the
performance deficit in question of 400%. So, same argument:
change in media transfer rate is just a distraction from the
original question.

In any case, we probably want to start using a smarter strategy
sooner than 1000 commits, maybe after ten or a hundred commits,
which would make the change in media transfer rate even less
relevant.

The thing is, when data first starts landing on media, we do not
have much information about what the long term load will be. So
just analyze the clues we have in the early commits and put those
early deltas onto disk in the most efficient format, which for
Tux3 seems to be linear per delta. There would be exceptions, but
that is the common case.

Then get smarter later. The intent is to get the best of both:
early efficiency, and long term nice aging behavior. I do not
accept the proposition that one must be sacrificed for the
other, I find that reasoning faulty.

 The performance deficit in question and the difference in media rate are
 three orders of magnitude apart, does that justify the term similar or
 identical?.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-13 Thread Daniel Phillips
On 05/13/2015 06:08 AM, Mike Galbraith wrote:
 On Wed, 2015-05-13 at 04:31 -0700, Daniel Phillips wrote:
 Third possibility: build from our repository, as Mike did.
 
 Sorry about that folks.  I've lost all interest, it won't happen again.

Thanks for your valuable contribution. Now we are seeing a steady
of stream of people heading to the repository, after you showed
it could be done.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-13 Thread Daniel Phillips

On Wednesday, May 13, 2015 1:25:38 PM PDT, Martin Steigerwald wrote:

Am Mittwoch, 13. Mai 2015, 12:37:41 schrieb Daniel Phillips:

On 05/13/2015 12:09 PM, Martin Steigerwald wrote: ...


Daniel, if you want to change the process of patch review and 
inclusion into 
the kernel, model an example of how you would like it to be. This has way 
better chances to inspire others to change their behaviors themselves than 
accusing them of bad faith.


Its yours to choose. 


What outcome do you want to create?


The outcome I would like is:

 * Everybody has a good think about what has gone wrong in the past,
   not only with troublesome submitters, but with mutual respect and
   collegial conduct.

 * Tux3 is merged on its merits so we get more developers and
   testers and move it along faster.

 * I left LKML better than I found it.

 * Group hugs

Well, group hugs are optional, that one would be situational.

Regards,

Daniel
   


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-13 Thread Daniel Phillips
On 05/13/2015 12:09 PM, Martin Steigerwald wrote:
 Daniel, what are you trying to achieve here?
 
 I thought you wanted to create interest for your filesystem and acceptance 
 for merging it.
 
 What I see you are actually creating tough is something different.
 
 Is what you see after you send your mails really what you want to see? If 
 not… why not? And if you seek change, where can you create change?

That is the question indeed, whether to try and change the system
while merging, or just keep smiling and get the job done. The problem
is, I am just too stupid to realize that I can't change the system,
which is famously unpleasant for submitters.

 I really like to see Tux3 inside the kernel for easier testing, yet I also 
 see that the way you, in your oppinion, defend it, does not seem to move 
 that goal any closer, quite the opposite. It triggers polarity and 
 resistance.
 
 I believe it to be more productive to work together with the people who will 
 decide about what goes into the kernel and the people whose oppinions are 
 respected by them, instead of against them.

Obviously true.

 Assume good faith can help here. No amount of accusing people of bad 
 intention will change them. The only thing you have the power to change is 
 your approach. You absolutely and ultimately do not have the power to change 
 other people. You can´t force Tux3 in by sheer willpower or attacking 
 people.
 
 On any account for anyone discussing here: I believe that any personal 
 attacks, counter-attacks or you are wrong kind of speech will not help to 
 move this discussion out of the circling it seems to be in at the moment.

Thanks for the sane commentary. I have the power to change my behavior.
But if nobody else changes their behavior, the process remains just as
unpleasant for us as it ever was (not just me!). Obviously, this is
not the first time I have been through this, and it has never been
pleasant. After a while, contributors just get tired of the grind and
move on to something more fun. I know I did, and I am far from the
only one.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-13 Thread Daniel Phillips

On Wednesday, May 13, 2015 1:02:34 PM PDT, Jeremy Allison wrote:

On Wed, May 13, 2015 at 12:37:41PM -0700, Daniel Phillips wrote:

On 05/13/2015 12:09 PM, Martin Steigerwald wrote:
 ...


Daniel, please listen to Martin. He speaks a fundamental truth
here.

As you know, I am also interested in Tux3, and would love to
see it as a filesystem option for NAS servers using Samba. But
please think about the way you're interacting with people on the
list, and whether that makes this outcome more or less likely.


Thanks Jeremy, that means more from you than anyone.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tux3 Report: How fast can we fail?

2015-05-13 Thread Daniel Phillips
Addendum to that post...

On 05/12/2015 10:46 AM, I wrote:
 ...For example, we currently
 overestimate the cost of a rewrite because we would need to go poking
 around in btrees to do that more accurately. Fixing that will be quite
 a bit of work...

Ah no, I was wrong about that, it will not be a lot of work because
it does not need to be done.

Charging the full cost of a rewrite as if it is a new write is the
right thing to do because we need to be sure the commit can allocate
space to redirect the existing blocks before it frees the old ones.
So that means there is no need for the front end to go delving into
file metadata, ever, which is a relief because that would have been
expensive and messy.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On 05/12/2015 03:35 PM, David Lang wrote:
> On Tue, 12 May 2015, Daniel Phillips wrote:
>> On 05/12/2015 02:30 PM, David Lang wrote:
>>> You need to get out of the mindset that Ted and Dave are Enemies that you 
>>> need to overcome, they are
>>> friendly competitors, not Enemies.
>>
>> You are wrong about Dave These are not the words of any friend:
>>
>>   "I don't think I'm alone in my suspicion that there was something
>>   stinky about your numbers." -- Dave Chinner
> 
> you are looking for offense. That just means that something is wrong with 
> them, not that they were
> deliberatly falsified.

I am not mistaken. Dave made sure to eliminate any doubt about
what he meant. He said "Oh, so nicely contrived. But terribly
obvious now that I've found it" among other things.

Good work, Dave. Never mind that we did not hide it.

Let's look at some more of the story. Hirofumi ran the test and
I posted the results and explained the significant. I did not
even know that dbench had fsyncs at that time, since I had never
used it myself, nor that Hirofumi had taken them out in order to
test the things he was interested in. Which turned out to be very
interesting, don't you agree?

Anyway, Hirofumi followed up with a clear explanation, here:

   http://phunq.net/pipermail/tux3/2013-May/002022.html

Instead of accepting that, Dave chose to ride right over it and
carry on with his thinly veiled allegations of intellectual fraud,
using such words as "it's deceptive at best." Dave managed to
insult two people that day.

Dave dismissed the basic breakthrough we had made as "silly
marketing fluff". By now I hope you understand that the result in
question was anything but silly marketing fluff. There are real,
technical reasons that Tux3 wins benchmarks, and the specific
detail that Dave attacked so ungraciously is one of them.

Are you beginning to see who the victim of this mugging was?

>> Basically allegations of cheating. And wrong. Maybe Dave just
>> lives in his own dreamworld where everybody is out to get him, so
>> he has to attack people he views as competitors first.
> 
> you are the one doing the attacking.

Defending, not attacking. There is a distinction.

> Please stop. Take a break if needed, and then get back to
> producing software rather than complaining about how everyone is out to get 
> you.

Dave is not "everyone", and a "shut up" will not fix this.

What will fix this is a simple, professional statement that
an error was made, that there was no fraud or anything even
remotely resembling it, and that instead a technical
contribution was made. It is not even important that it come
from Dave. But it is important that the aspersions that were
cast be recognized for what they were.

By the way, do you remember the scene from "Unforgiven" where
the sherrif is kicking the guy on the ground and saying "I'm
not kicking you?" It feels like that.

As far as who should take a break goes, note that either of
us can stop the thread. Does it necessarily have to be me?

If you would prefer some light reading, you could read "How fast
can we fail?", which I believe is relevant to the question of
whether Tux3 is mergeable or not.

   https://lkml.org/lkml/2015/5/12/663

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On 05/12/2015 02:30 PM, David Lang wrote:
> On Tue, 12 May 2015, Daniel Phillips wrote:
>> Phoronix published a headline that identifies Dave Chinner as
>> someone who takes shots at other projects. Seems pretty much on
>> the money to me, and it ought to be obvious why he does it.
> 
> Phoronix turns any correction or criticism into an attack.

Phoronix gets attacked in an unseemly way by a number of people
in the developer community who should behave better. You are
doing it yourself, seemingly oblivious to the valuable role that
the publication plays in our community. Google for filesystem
benchmarks. Where do you find them? Right. Not to mention the
Xorg coverage, community issues, etc etc. The last thing we need
is a monoculture in Linux news, and we are dangerously close to
that now.

So, how is "EXT4 is not as stable or as well tested as most
people think" not a cheap shot? By my first hand experience, that
claim is absurd. Add to that the first hand experience of roughly
two billion other people. Seems to be a bit self serving too, or
was that just an accident.

> You need to get out of the mindset that Ted and Dave are Enemies that you 
> need to overcome, they are
> friendly competitors, not Enemies.

You are wrong about Dave, These are not the words of any friend:

   "I don't think I'm alone in my suspicion that there was something
   stinky about your numbers." -- Dave Chinner

Basically allegations of cheating. And wrong. Maybe Dave just
lives in his own dreamworld where everybody is out to get him, so
he has to attack people he views as competitors first.

Ted has more taste and his FUD attack was more artful, but it
still amounted to nothing more than piling on, He just picked up
Dave's straw man uncritically and proceeded to knock it down
some more. Nice way of distracting attention from the fact that
we actually did what we claimed, and instead of getting the
appropriate recognition for it, we were called cheaters. More or
less in so many words by Dave, and more subtly by Ted, but the
intent is clear and unmistakable. Apologies from both are still
in order, but it will be a rainy day in that hot place before we
ever see either of them do the right thing.

That said, Ted is no enemy, he is brilliant and usually conducts
himself admirably. Except sometimes. I wish I would say the same
about Dave, but what I see there is a guy who has invested his
entire identity in his XFS career and is insecure that something
might conspire against him to disrupt it. I mean, come on, if you
convince Redhat management to elevate your life's work to the
status of something that most of the paid for servers in the
world are going to run, do you continue attacking your peers or
do you chill a bit?

> They assume that you are working in good faith (but are
> inexperienced compared to them), and you need to assume that they are working 
> in good faith. If they
> ever do resort to underhanded means to sabotage you, Linus and the other 
> kernel developers will take
> action. But pointing out limits in your current implementation, problems in 
> your benchmarks based on
> how they are run, and concepts that are going to be difficult to merge is not 
> underhanded, it's
> exactly the type of assistance that you should be greatful for in friendly 
> competition.
> 
> You were the one who started crowing about how badly XFS performed.

Not at all, somebody else posted the terrible XFS benchmark result,
then Dave put up a big smokescreen to try to deflect atention from
it. There is a term for that kind of logical fallacy:

   http://en.wikipedia.org/wiki/Proof_by_intimidation

Seems to have worked well on you. But after all those words, XFS
does not run any faster, and it clearly needs to.

> Dave gave a long and detailed explination about the reasons for the 
> differences, and showing
benchmarks on other hardware that
> showed that XFS works very well there. That's not an attack on EXT4 (or 
> Tux3), it's an explination.

Long, detailed, and bogus. Summary: "oh, XFS doesn't work well on
that hardware? Get new hardware." Excuse me, but other filesystems
do work well on that hardware, the problem is not with the hardware.

> I have my own concerns about how things are going to work (I've voiced some 
> of them), but no, I
> haven't tried running Tux3 because you say it's not ready yet.

I did not say that. I said it is not ready for users. It is more
than ready for anybody who wants to develop it, or benchmark it,
or put test data on it, and has been for a long time. Except for
enospc, and that was apparently not an issue for Btrfs, was it.

>> You know what to do about checking for faulty benchmarks.
> 
> That requires that the code be readily available, which last I heard, Tux3 
> wasn't. Has this been fixed?

You heard wrong. The code is readily available and you can clone 

Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On 05/12/2015 02:30 PM, David Lang wrote:
> On Tue, 12 May 2015, Daniel Phillips wrote:
>> Phoronix published a headline that identifies Dave Chinner as
>> someone who takes shots at other projects. Seems pretty much on
>> the money to me, and it ought to be obvious why he does it.
> 
> Phoronix turns any correction or criticism into an attack.

Phoronix gets attacked in an unseemly way by a number of people
in the developer community who should behave better. You are
doing it yourself, seemingly oblivious to the valuable role that
the publication plays in our community. Google for filesystem
benchmarks. Where do you find them? Right. Not to mention the
Xorg coverage, community issues, etc etc. The last thing we
need is a monoculture in Linux news, and we are dangerously
close to that now.

So, how is "EXT4 is not as stable or as well tested as most
people think" not a cheap shot? By my first hand experience,
that claim is absurd. Add to that the first hand experience
of roughly two billion other people. Seems to be a bit self
serving too, or was that just an accident.

> You need to get out of the mindset that Ted and Dave are Enemies that you 
> need to overcome, they are
> friendly competitors, not Enemies.

You are wrong about Dave These are not the words of any friend:

   "I don't think I'm alone in my suspicion that there was something
   stinky about your numbers." -- Dave Chinner

Basically allegations of cheating. And wrong. Maybe Dave just
lives in his own dreamworld where everybody is out to get him, so
he has to attack people he views as competitors first.

Ted has more taste and his FUD attack was more artful, but it
still amounted to nothing more than piling on, he just picked
Dave's straw man uncritically and proceeded to and knock it down
some more. Nice way of distracting attention from the fact that
we actually did what we claimed, and instead of getting the
appropriate recognition for it, we were called cheaters. More or
less in so many words, and more subtly by Ted, but the intent
is clear and unmistakable. Apologies from both are still in order,
but it

> They assume that you are working in good faith (but are
> inexperienced compared to them), and you need to assume that they are working 
> in good faith. If they
> ever do resort to underhanded means to sabotage you, Linus and the other 
> kernel developers will take
> action. But pointing out limits in your current implementation, problems in 
> your benchmarks based on
> how they are run, and concepts that are going to be difficult to merge is not 
> underhanded, it's
> exactly the type of assistance that you should be greatful for in friendly 
> competition.
> 
> You were the one who started crowing about how badly XFS performed.

Not at all, somebody else posted the terrible XFS benchmark
result, then Dave put up a big smokescreen to try to deflect
atention from it. There is a term for that kind of logical
fallacy:

   http://en.wikipedia.org/wiki/Proof_by_intimidation

Seems to have worked well on you. But after all those words,
XFS does not run any faster, and it clearly needs to.

 Dave gave a long and detailed
> explination about the reasons for the differences, and showing benchmarks on 
> other hardware that
> showed that XFS works very well there. That's not an attack on EXT4 (or 
> Tux3), it's an explination.
> 
>>>> The real question is, has the Linux development process become
>>>> so political and toxic that worthwhile projects fail to benefit
>>>> from supposed grassroots community support. You are the poster
>>>> child for that.
>>>
>>> The linux development process is making code available, responding to 
>>> concerns from the experts in
>>> the community, and letting the code talk for itself.
>>
>> Nice idea, but it isn't working. Did you let the code talk to you?
>> Right, you let the code talk to Dave Chinner, then you listen to
>> what Dave Chinner has to say about it. Any chance that there might
>> be some creative licence acting somewhere in that chain?
> 
> I have my own concerns about how things are going to work (I've voiced some 
> of them), but no, I
> haven't tried running Tux3 because you say it's not ready yet.
> 
>>> There have been many people pushing code for inclusion that has not gotten 
>>> into the kernel, or has
>>> not been used by any distros after it's made it into the kernel, in spite 
>>> of benchmarks being posted
>>> that seem to show how wonderful the new code is. ReiserFS was one of the 
>>> first, and part of what
>>> tarnished it's reputation with many people was how much they were pushing 
>>> the benchmarks that were
>>> shown to be faulty (the one I remember

Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On 05/12/2015 11:39 AM, David Lang wrote:
> On Mon, 11 May 2015, Daniel Phillips wrote:
>>> ...it's the mm and core kernel developers that need to
>>> review and accept that code *before* we can consider merging tux3.
>>
>> Please do not say "we" when you know that I am just as much a "we"
>> as you are. Merging Tux3 is not your decision. The people whose
>> decision it actually is are perfectly capable of recognizing your
>> agenda for what it is.
>>
>>   http://www.phoronix.com/scan.php?page=news_item=MTA0NzM
>>   "XFS Developer Takes Shots At Btrfs, EXT4"
> 
> umm, Phoronix has no input on what gets merged into the kernel. they also hae 
> a reputation for
> trying to turn anything into click-bait by making it sound like a fight when 
> it isn't.

Perhaps you misunderstood. Linus decides what gets merged. Andrew
decides. Greg decides. Dave Chinner does not decide, he just does
his level best to create the impression that our project is unfit
to merge. Any chance there might be an agenda?

Phoronix published a headline that identifies Dave Chinner as
someone who takes shots at other projects. Seems pretty much on
the money to me, and it ought to be obvious why he does it.

>> The real question is, has the Linux development process become
>> so political and toxic that worthwhile projects fail to benefit
>> from supposed grassroots community support. You are the poster
>> child for that.
> 
> The linux development process is making code available, responding to 
> concerns from the experts in
> the community, and letting the code talk for itself.

Nice idea, but it isn't working. Did you let the code talk to you?
Right, you let the code talk to Dave Chinner, then you listen to
what Dave Chinner has to say about it. Any chance that there might
be some creative licence acting somewhere in that chain?

> There have been many people pushing code for inclusion that has not gotten 
> into the kernel, or has
> not been used by any distros after it's made it into the kernel, in spite of 
> benchmarks being posted
> that seem to show how wonderful the new code is. ReiserFS was one of the 
> first, and part of what
> tarnished it's reputation with many people was how much they were pushing the 
> benchmarks that were
> shown to be faulty (the one I remember most vividly was that the entire 
> benchmark completed in <30
> seconds, and they had the FS tuned to not start flushing data to disk for 30 
> seconds, so the entire
> 'benchmark' ran out of ram without ever touching the disk)

You know what to do about checking for faulty benchmarks.

> So when Ted and Dave point out problems with the benchmark (the difference in 
> behavior between a
> single spinning disk, different partitions on the same disk, SSDs, and 
> ramdisks), you would be
> better off acknowledging them and if you can't adjust and re-run the 
> benchmarks, don't start
> attacking them as a result.

Ted and Dave failed to point out any actual problem with any
benchmark. They invented issues with benchmarks and promoted those
as FUD.

> As Dave says above, it's not the other filesystem people you have to 
> convince, it's the core VFS and
> Memory Mangement folks you have to convince. You may need a little 
> benchmarking to show that there
> is a real advantage to be gained, but the real discussion is going to be on 
> the impact that page
> forking is going to have on everything else (both in complexity and in 
> performance impact to other
> things)

Yet he clearly wrote "we" as if he believes he is part of it.

Now that ENOSPC is done to a standard way beyond what Btrfs had
when it was merged, the next item on the agenda is writeback. That
involves us and VFS people as you say, and not Dave Chinner, who
only intends to obstruct the process as much as he possibly can. He
should get back to work on his own project. Nobody will miss his
posts if he doesn't make them. They contribute nothing of value,
create a lot of bad blood, and just serve to further besmirch the
famously tarnished reputation of LKML.

>> You know that Tux3 is already fast. Not just that of course. It
>> has a higher standard of data integrity than your metadata-only
>> journalling filesystem and a small enough code base that it can
>> be reasonably expected to reach the quality expected of an
>> enterprise class filesystem, quite possibly before XFS gets
>> there.
> 
> We wouldn't expect anyone developing a new filesystem to believe any 
> differently.

It is not a matter of belief, it is a matter of testable fact. For
example, you can count the lines. You can run the same benchmarks.

Proving the data consistency claims would be a little harder, you
need tools for that, and some o

Tux3 Report: How fast can we fail?

2015-05-12 Thread Daniel Phillips
efinition of 100%. Btrfs never gets
this right: full for it tends to range from 96% to 98%, and sometimes is
much lower, like 28%. It has its own definition of disk full in its own
utility, but that does not seem to be very accurate either. This part of
Btrfs needs major work. Even at this early stage, Tux3 is much better
than that.

One thing we can all rejoice over: nobody ever hit out of space while
trying to commit. At least, nobody ever admitted it. And nobody oopsed,
or asserted, though XFS did exhibit some denial of service issues where
the filesystem was unusable for tens of seconds.

Once again, in the full disclosure department: there are some known
holes remaining in Tux3's out of space handling. The unify suspend
algorithm is not yet implemented, without which we cannot guarantee
that out of space will never happen in commit. With the simple expedient
of a 100 block emergency reserve, it has never yet happened, but no
doubt some as yet untested load can make it happen. ENOSPC handling for
mmap is not yet implemented. Cost estimates for namespace operations
are too crude and ignore btree depth. Cost estimates could be tighter
than they are, to give better performance and report disk full more
promptly. The emergency reserve should be set each delta according to
delta budget. Big truncates need to be split over multiple commits
so they always free more blocks than they consume before commit. That
is about it. On the whole, I am really happy with the way this
has worked out.

Well, that is that for today. Tux3 now has decent out of space handling
that appears to work well and has a good strong theoretical basis. It
needs more work, but is no longer a reason to block Tux3 from merging,
if it ever really was.

Regards,

Daniel

[1] Overhead of an uncontended bus locked add is about 6 nanoseconds on
my i5, and about ten times higher when contended.

 /*
 * Blurt v0.0
 *
 * A trivial multitasking filesystem load generator
 *
 * Daniel Phillips, June 2015
 *
 * to build: c99 -Wall blurt.c -oblurt
 * to run: blurt   
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

enum { chunk = 1024, sync = 0 };

char text[chunk] = { "hello world!\n" };

int main(int argc, const char *argv[]) {
const char *basename = argc < 1 ? "foo" : argv[1];
char name[100];
int steps = argc < 3 ? 1 : atoi(argv[2]);
int tasks = argc < 4 ? 1 : atoi(argv[3]);
int fd, status, errors = 0;

for (int t = 0; t < tasks; t++) {
snprintf(name, sizeof name, "%s%i", basename, t);
if (!fork())
goto child;
}
for (int t = 0; t < tasks; t++) {
wait();
if (WIFEXITED(status) && WEXITSTATUS(status))
errors++;
}
return !!errors;

child:
fd = creat(name, S_IRWXU);
if (fd == -1)
goto fail1;
for (int i = 0; i < steps; i++) {
int ret = write(fd, text, sizeof text);
if (ret == -1)
goto fail2;
if (sync)
fsync(fd);
}
return 0;
fail1:
perror("create failed");
return 1;
fail2:
perror("write failed");
return 1;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tux3 Report: How fast can we fsync?

2015-05-12 Thread Daniel Phillips
efinition of 100%. Btrfs never gets
this right: full for it tends to range from 96% to 98%, and sometimes is
much lower, like 28%. It has its own definition of disk full in its own
utility, but that does not seem to be very accurate either. This part of
Btrfs needs major work. Even at this early stage, Tux3 is much better
than that.

One thing we can all rejoice over: nobody ever hit out of space while
trying to commit. At least, nobody ever admitted it. And nobody oopsed,
or asserted, though XFS did exhibit some denial of service issues where
the filesystem was unusable for tens of seconds.

Once again, in the full disclosure department: there are some known
holes remaining in Tux3's out of space handling. The unify suspend
algorithm is not yet implemented, without which we cannot guarantee
that out of space will never happen in commit. With the simple expedient
of a 100 block emergency reserve, it has never yet happened, but no
doubt some as yet untested load can make it happen. ENOSPC handling for
mmap is not yet implemented. Cost estimates for namespace operations
are too crude and ignore btree depth. Cost estimates could be tighter
than they are, to give better performance and report disk full more
promptly. The emergency reserve should be set each delta according to
delta budget. Big truncates need to be split over multiple commits
so they always free more blocks than they consume before commit. That
is about it. On the whole, I am really happy with the way this
has worked out.

Well, that is that for today. Tux3 now has decent out of space handling
that appears to work well and has a good strong theoretical basis. It
needs more work, but is no longer a reason to block Tux3 from merging,
if it ever really was.

Regards,

Daniel

[1] Overhead of an uncontended bus locked add is about 6 nanoseconds on
my i5, and about ten times higher when contended.

 /*
 * Blurt v0.0
 *
 * A trivial multitasking filesystem load generator
 *
 * Daniel Phillips, June 2015
 *
 * to build: c99 -Wall blurt.c -oblurt
 * to run: blurt   
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

enum { chunk = 1024, sync = 0 };

char text[chunk] = { "hello world!\n" };

int main(int argc, const char *argv[]) {
const char *basename = argc < 1 ? "foo" : argv[1];
char name[100];
int steps = argc < 3 ? 1 : atoi(argv[2]);
int tasks = argc < 4 ? 1 : atoi(argv[3]);
int fd, status, errors = 0;

for (int t = 0; t < tasks; t++) {
snprintf(name, sizeof name, "%s%i", basename, t);
if (!fork())
goto child;
}
for (int t = 0; t < tasks; t++) {
wait();
if (WIFEXITED(status) && WEXITSTATUS(status))
errors++;
}
return !!errors;

child:
fd = creat(name, S_IRWXU);
if (fd == -1)
goto fail1;
for (int i = 0; i < steps; i++) {
int ret = write(fd, text, sizeof text);
if (ret == -1)
goto fail2;
if (sync)
fsync(fd);
}
return 0;
fail1:
perror("create failed");
return 1;
fail2:
perror("write failed");
return 1;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On 05/12/2015 02:03 AM, Pavel Machek wrote:
> On Mon 2015-05-11 19:34:34, Daniel Phillips wrote:
>> On 05/11/2015 04:17 PM, Theodore Ts'o wrote:
>>> and another way that people
>>> doing competitive benchmarking can screw up and produce misleading
>>> numbers.
>>
>> If you think we screwed up or produced misleading numbers, could you
>> please be up front about it instead of making insinuations and
>> continuing your tirade against benchmarking and those who do it.
> 
> Are not you little harsh with Ted? He was polite.

Polite language does not include words like "screw up" and "misleading
numbers", those are combative words intended to undermine and disparage.
It is not clear how repeating the same words can be construed as less
polite than the original utterance.

>> The ram disk removes seek overhead and greatly reduces media transfer
>> overhead. This does not change things much: it confirms that Tux3 is
>> significantly faster than the others at synchronous loads. This is
>> apparently true independently of media type, though to be sure SSD
>> remains to be tested.
>>
>> The really interesting result is how much difference there is between
>> filesystems, even on a ram disk. Is it just CPU or is it synchronization
>> strategy and lock contention? Does our asynchronous front/back design
>> actually help a lot, instead of being a disadvantage as you predicted?
>>
>> It is too bad that fs_mark caps number of tasks at 64, because I am
>> sure that some embarrassing behavior would emerge at high task counts,
>> as with my tests on spinning disk.
> 
> I'd call system with 65 tasks doing heavy fsync load at the some time
> "embarrassingly misconfigured" :-). It is nice if your filesystem can
> stay fast in that case, but...

Well, Tux3 wins the fsync race now whether it is 1 task, 64 tasks or
10,000 tasks. At the high end, maybe it is just a curiosity, or maybe
it tells us something about how Tux3 is will scale on the big machines
that XFS currently lays claim to. And Java programmers are busy doing
all kinds of wild and crazy things with lots of tasks. Java almost
makes them do it. If they need their data durable then they can easily
create loads like my test case.

Suppose you have a web server meant to serve 10,000 transactions
simultaneously and it needs to survive crashes without losing client
state. How will you do it? You could install an expensive, finicky
database, or you could write some Java code that happens to work well
because Linux has a scheduler and a filesystem that can handle it.
Oh wait, we don't have the second one yet, but maybe we soon will.

I will not claim that stupidly fast and scalable fsync is the main
reason that somebody should want Tux3, however, the lack of a high
performance fsync was in fact used as a means of spreading FUD about
Tux3, so I had some fun going way beyond the call of duty to answer
that. By the way, I am still waiting for the original source of the
FUD to concede the point politely, but maybe he is waiting for the
code to land, which it still has not as of today, so I guess that is
fair. Note that it would have landed quite some time ago if Tux3 was
already merged.

Historical note: didn't Java motivate the O(1) scheduler?

Regarda,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On Monday, May 11, 2015 10:38:42 PM PDT, Dave Chinner wrote:
> I think Ted and I are on the same page here. "Competitive
> benchmarks" only matter to the people who are trying to sell
> something. You're trying to sell Tux3, but

By "same page", do you mean "transparently obvious about
obstructing other projects"?

> The "except page forking design" statement is your biggest hurdle
> for getting tux3 merged, not performance.

No, the "except page forking design" is because the design is
already good and effective. The small adjustments needed in core
are well worth merging because the benefits are proved by benchmarks.
So benchmarks are key and will not stop just because you don't like
the attention they bring to XFS issues.

> Without page forking, tux3
> cannot be merged at all. But it's not filesystem developers you need
> to convince about the merits of the page forking design and
> implementation - it's the mm and core kernel developers that need to
> review and accept that code *before* we can consider merging tux3.

Please do not say "we" when you know that I am just as much a "we"
as you are. Merging Tux3 is not your decision. The people whose
decision it actually is are perfectly capable of recognizing your
agenda for what it is.

   http://www.phoronix.com/scan.php?page=news_item=MTA0NzM
   "XFS Developer Takes Shots At Btrfs, EXT4"

The real question is, has the Linux development process become
so political and toxic that worthwhile projects fail to benefit
from supposed grassroots community support. You are the poster
child for that.

> IOWs, you need to focus on the important things needed to acheive
> your stated goal of getting tux3 merged. New filesystems should be
> faster than those based on 20-25 year old designs, so you don't need
> to waste time trying to convince people that tux3, when complete,
> will be fast.

You know that Tux3 is already fast. Not just that of course. It
has a higher standard of data integrity than your metadata-only
journalling filesystem and a small enough code base that it can
be reasonably expected to reach the quality expected of an
enterprise class filesystem, quite possibly before XFS gets
there.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On Monday, May 11, 2015 10:38:42 PM PDT, Dave Chinner wrote:
 I think Ted and I are on the same page here. Competitive
 benchmarks only matter to the people who are trying to sell
 something. You're trying to sell Tux3, but

By same page, do you mean transparently obvious about
obstructing other projects?

 The except page forking design statement is your biggest hurdle
 for getting tux3 merged, not performance.

No, the except page forking design is because the design is
already good and effective. The small adjustments needed in core
are well worth merging because the benefits are proved by benchmarks.
So benchmarks are key and will not stop just because you don't like
the attention they bring to XFS issues.

 Without page forking, tux3
 cannot be merged at all. But it's not filesystem developers you need
 to convince about the merits of the page forking design and
 implementation - it's the mm and core kernel developers that need to
 review and accept that code *before* we can consider merging tux3.

Please do not say we when you know that I am just as much a we
as you are. Merging Tux3 is not your decision. The people whose
decision it actually is are perfectly capable of recognizing your
agenda for what it is.

   http://www.phoronix.com/scan.php?page=news_itempx=MTA0NzM
   XFS Developer Takes Shots At Btrfs, EXT4

The real question is, has the Linux development process become
so political and toxic that worthwhile projects fail to benefit
from supposed grassroots community support. You are the poster
child for that.

 IOWs, you need to focus on the important things needed to acheive
 your stated goal of getting tux3 merged. New filesystems should be
 faster than those based on 20-25 year old designs, so you don't need
 to waste time trying to convince people that tux3, when complete,
 will be fast.

You know that Tux3 is already fast. Not just that of course. It
has a higher standard of data integrity than your metadata-only
journalling filesystem and a small enough code base that it can
be reasonably expected to reach the quality expected of an
enterprise class filesystem, quite possibly before XFS gets
there.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tux3 Report: How fast can we fsync?

2015-05-12 Thread Daniel Phillips
 lower, like 28%. It has its own definition of disk full in its own
utility, but that does not seem to be very accurate either. This part of
Btrfs needs major work. Even at this early stage, Tux3 is much better
than that.

One thing we can all rejoice over: nobody ever hit out of space while
trying to commit. At least, nobody ever admitted it. And nobody oopsed,
or asserted, though XFS did exhibit some denial of service issues where
the filesystem was unusable for tens of seconds.

Once again, in the full disclosure department: there are some known
holes remaining in Tux3's out of space handling. The unify suspend
algorithm is not yet implemented, without which we cannot guarantee
that out of space will never happen in commit. With the simple expedient
of a 100 block emergency reserve, it has never yet happened, but no
doubt some as yet untested load can make it happen. ENOSPC handling for
mmap is not yet implemented. Cost estimates for namespace operations
are too crude and ignore btree depth. Cost estimates could be tighter
than they are, to give better performance and report disk full more
promptly. The emergency reserve should be set each delta according to
delta budget. Big truncates need to be split over multiple commits
so they always free more blocks than they consume before commit. That
is about it. On the whole, I am really happy with the way this
has worked out.

Well, that is that for today. Tux3 now has decent out of space handling
that appears to work well and has a good strong theoretical basis. It
needs more work, but is no longer a reason to block Tux3 from merging,
if it ever really was.

Regards,

Daniel

[1] Overhead of an uncontended bus locked add is about 6 nanoseconds on
my i5, and about ten times higher when contended.

 /*
 * Blurt v0.0
 *
 * A trivial multitasking filesystem load generator
 *
 * Daniel Phillips, June 2015
 *
 * to build: c99 -Wall blurt.c -oblurt
 * to run: blurt basename steps tasks
 */

#include unistd.h
#include stdlib.h
#include stdio.h
#include fcntl.h
#include sys/wait.h
#include errno.h
#include sys/types.h
#include sys/stat.h

enum { chunk = 1024, sync = 0 };

char text[chunk] = { hello world!\n };

int main(int argc, const char *argv[]) {
const char *basename = argc  1 ? foo : argv[1];
char name[100];
int steps = argc  3 ? 1 : atoi(argv[2]);
int tasks = argc  4 ? 1 : atoi(argv[3]);
int fd, status, errors = 0;

for (int t = 0; t  tasks; t++) {
snprintf(name, sizeof name, %s%i, basename, t);
if (!fork())
goto child;
}
for (int t = 0; t  tasks; t++) {
wait(status);
if (WIFEXITED(status)  WEXITSTATUS(status))
errors++;
}
return !!errors;

child:
fd = creat(name, S_IRWXU);
if (fd == -1)
goto fail1;
for (int i = 0; i  steps; i++) {
int ret = write(fd, text, sizeof text);
if (ret == -1)
goto fail2;
if (sync)
fsync(fd);
}
return 0;
fail1:
perror(create failed);
return 1;
fail2:
perror(write failed);
return 1;
}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On 05/12/2015 11:39 AM, David Lang wrote:
 On Mon, 11 May 2015, Daniel Phillips wrote:
 ...it's the mm and core kernel developers that need to
 review and accept that code *before* we can consider merging tux3.

 Please do not say we when you know that I am just as much a we
 as you are. Merging Tux3 is not your decision. The people whose
 decision it actually is are perfectly capable of recognizing your
 agenda for what it is.

   http://www.phoronix.com/scan.php?page=news_itempx=MTA0NzM
   XFS Developer Takes Shots At Btrfs, EXT4
 
 umm, Phoronix has no input on what gets merged into the kernel. they also hae 
 a reputation for
 trying to turn anything into click-bait by making it sound like a fight when 
 it isn't.

Perhaps you misunderstood. Linus decides what gets merged. Andrew
decides. Greg decides. Dave Chinner does not decide, he just does
his level best to create the impression that our project is unfit
to merge. Any chance there might be an agenda?

Phoronix published a headline that identifies Dave Chinner as
someone who takes shots at other projects. Seems pretty much on
the money to me, and it ought to be obvious why he does it.

 The real question is, has the Linux development process become
 so political and toxic that worthwhile projects fail to benefit
 from supposed grassroots community support. You are the poster
 child for that.
 
 The linux development process is making code available, responding to 
 concerns from the experts in
 the community, and letting the code talk for itself.

Nice idea, but it isn't working. Did you let the code talk to you?
Right, you let the code talk to Dave Chinner, then you listen to
what Dave Chinner has to say about it. Any chance that there might
be some creative licence acting somewhere in that chain?

 There have been many people pushing code for inclusion that has not gotten 
 into the kernel, or has
 not been used by any distros after it's made it into the kernel, in spite of 
 benchmarks being posted
 that seem to show how wonderful the new code is. ReiserFS was one of the 
 first, and part of what
 tarnished it's reputation with many people was how much they were pushing the 
 benchmarks that were
 shown to be faulty (the one I remember most vividly was that the entire 
 benchmark completed in 30
 seconds, and they had the FS tuned to not start flushing data to disk for 30 
 seconds, so the entire
 'benchmark' ran out of ram without ever touching the disk)

You know what to do about checking for faulty benchmarks.

 So when Ted and Dave point out problems with the benchmark (the difference in 
 behavior between a
 single spinning disk, different partitions on the same disk, SSDs, and 
 ramdisks), you would be
 better off acknowledging them and if you can't adjust and re-run the 
 benchmarks, don't start
 attacking them as a result.

Ted and Dave failed to point out any actual problem with any
benchmark. They invented issues with benchmarks and promoted those
as FUD.

 As Dave says above, it's not the other filesystem people you have to 
 convince, it's the core VFS and
 Memory Mangement folks you have to convince. You may need a little 
 benchmarking to show that there
 is a real advantage to be gained, but the real discussion is going to be on 
 the impact that page
 forking is going to have on everything else (both in complexity and in 
 performance impact to other
 things)

Yet he clearly wrote we as if he believes he is part of it.

Now that ENOSPC is done to a standard way beyond what Btrfs had
when it was merged, the next item on the agenda is writeback. That
involves us and VFS people as you say, and not Dave Chinner, who
only intends to obstruct the process as much as he possibly can. He
should get back to work on his own project. Nobody will miss his
posts if he doesn't make them. They contribute nothing of value,
create a lot of bad blood, and just serve to further besmirch the
famously tarnished reputation of LKML.

 You know that Tux3 is already fast. Not just that of course. It
 has a higher standard of data integrity than your metadata-only
 journalling filesystem and a small enough code base that it can
 be reasonably expected to reach the quality expected of an
 enterprise class filesystem, quite possibly before XFS gets
 there.
 
 We wouldn't expect anyone developing a new filesystem to believe any 
 differently.

It is not a matter of belief, it is a matter of testable fact. For
example, you can count the lines. You can run the same benchmarks.

Proving the data consistency claims would be a little harder, you
need tools for that, and some of those aren't built yet. Or, if you
have technical ability, you can read the code and the copious design
material that has been posted and convince yourself that, yes, there
is something cool here, why didn't anybody do it that way before?
But of course that starts to sound like work. Debating nontechnical
issues and playing politics seems so much more like fun.

 If they didn't
 believe

Tux3 Report: How fast can we fail?

2015-05-12 Thread Daniel Phillips
 lower, like 28%. It has its own definition of disk full in its own
utility, but that does not seem to be very accurate either. This part of
Btrfs needs major work. Even at this early stage, Tux3 is much better
than that.

One thing we can all rejoice over: nobody ever hit out of space while
trying to commit. At least, nobody ever admitted it. And nobody oopsed,
or asserted, though XFS did exhibit some denial of service issues where
the filesystem was unusable for tens of seconds.

Once again, in the full disclosure department: there are some known
holes remaining in Tux3's out of space handling. The unify suspend
algorithm is not yet implemented, without which we cannot guarantee
that out of space will never happen in commit. With the simple expedient
of a 100 block emergency reserve, it has never yet happened, but no
doubt some as yet untested load can make it happen. ENOSPC handling for
mmap is not yet implemented. Cost estimates for namespace operations
are too crude and ignore btree depth. Cost estimates could be tighter
than they are, to give better performance and report disk full more
promptly. The emergency reserve should be set each delta according to
delta budget. Big truncates need to be split over multiple commits
so they always free more blocks than they consume before commit. That
is about it. On the whole, I am really happy with the way this
has worked out.

Well, that is that for today. Tux3 now has decent out of space handling
that appears to work well and has a good strong theoretical basis. It
needs more work, but is no longer a reason to block Tux3 from merging,
if it ever really was.

Regards,

Daniel

[1] Overhead of an uncontended bus locked add is about 6 nanoseconds on
my i5, and about ten times higher when contended.

 /*
 * Blurt v0.0
 *
 * A trivial multitasking filesystem load generator
 *
 * Daniel Phillips, June 2015
 *
 * to build: c99 -Wall blurt.c -oblurt
 * to run: blurt basename steps tasks
 */

#include unistd.h
#include stdlib.h
#include stdio.h
#include fcntl.h
#include sys/wait.h
#include errno.h
#include sys/types.h
#include sys/stat.h

enum { chunk = 1024, sync = 0 };

char text[chunk] = { hello world!\n };

int main(int argc, const char *argv[]) {
const char *basename = argc  1 ? foo : argv[1];
char name[100];
int steps = argc  3 ? 1 : atoi(argv[2]);
int tasks = argc  4 ? 1 : atoi(argv[3]);
int fd, status, errors = 0;

for (int t = 0; t  tasks; t++) {
snprintf(name, sizeof name, %s%i, basename, t);
if (!fork())
goto child;
}
for (int t = 0; t  tasks; t++) {
wait(status);
if (WIFEXITED(status)  WEXITSTATUS(status))
errors++;
}
return !!errors;

child:
fd = creat(name, S_IRWXU);
if (fd == -1)
goto fail1;
for (int i = 0; i  steps; i++) {
int ret = write(fd, text, sizeof text);
if (ret == -1)
goto fail2;
if (sync)
fsync(fd);
}
return 0;
fail1:
perror(create failed);
return 1;
fail2:
perror(write failed);
return 1;
}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On 05/12/2015 03:35 PM, David Lang wrote:
 On Tue, 12 May 2015, Daniel Phillips wrote:
 On 05/12/2015 02:30 PM, David Lang wrote:
 You need to get out of the mindset that Ted and Dave are Enemies that you 
 need to overcome, they are
 friendly competitors, not Enemies.

 You are wrong about Dave These are not the words of any friend:

   I don't think I'm alone in my suspicion that there was something
   stinky about your numbers. -- Dave Chinner
 
 you are looking for offense. That just means that something is wrong with 
 them, not that they were
 deliberatly falsified.

I am not mistaken. Dave made sure to eliminate any doubt about
what he meant. He said Oh, so nicely contrived. But terribly
obvious now that I've found it among other things.

Good work, Dave. Never mind that we did not hide it.

Let's look at some more of the story. Hirofumi ran the test and
I posted the results and explained the significant. I did not
even know that dbench had fsyncs at that time, since I had never
used it myself, nor that Hirofumi had taken them out in order to
test the things he was interested in. Which turned out to be very
interesting, don't you agree?

Anyway, Hirofumi followed up with a clear explanation, here:

   http://phunq.net/pipermail/tux3/2013-May/002022.html

Instead of accepting that, Dave chose to ride right over it and
carry on with his thinly veiled allegations of intellectual fraud,
using such words as it's deceptive at best. Dave managed to
insult two people that day.

Dave dismissed the basic breakthrough we had made as silly
marketing fluff. By now I hope you understand that the result in
question was anything but silly marketing fluff. There are real,
technical reasons that Tux3 wins benchmarks, and the specific
detail that Dave attacked so ungraciously is one of them.

Are you beginning to see who the victim of this mugging was?

 Basically allegations of cheating. And wrong. Maybe Dave just
 lives in his own dreamworld where everybody is out to get him, so
 he has to attack people he views as competitors first.
 
 you are the one doing the attacking.

Defending, not attacking. There is a distinction.

 Please stop. Take a break if needed, and then get back to
 producing software rather than complaining about how everyone is out to get 
 you.

Dave is not everyone, and a shut up will not fix this.

What will fix this is a simple, professional statement that
an error was made, that there was no fraud or anything even
remotely resembling it, and that instead a technical
contribution was made. It is not even important that it come
from Dave. But it is important that the aspersions that were
cast be recognized for what they were.

By the way, do you remember the scene from Unforgiven where
the sherrif is kicking the guy on the ground and saying I'm
not kicking you? It feels like that.

As far as who should take a break goes, note that either of
us can stop the thread. Does it necessarily have to be me?

If you would prefer some light reading, you could read How fast
can we fail?, which I believe is relevant to the question of
whether Tux3 is mergeable or not.

   https://lkml.org/lkml/2015/5/12/663

Regards,

Daniel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On 05/12/2015 02:30 PM, David Lang wrote:
 On Tue, 12 May 2015, Daniel Phillips wrote:
 Phoronix published a headline that identifies Dave Chinner as
 someone who takes shots at other projects. Seems pretty much on
 the money to me, and it ought to be obvious why he does it.
 
 Phoronix turns any correction or criticism into an attack.

Phoronix gets attacked in an unseemly way by a number of people
in the developer community who should behave better. You are
doing it yourself, seemingly oblivious to the valuable role that
the publication plays in our community. Google for filesystem
benchmarks. Where do you find them? Right. Not to mention the
Xorg coverage, community issues, etc etc. The last thing we need
is a monoculture in Linux news, and we are dangerously close to
that now.

So, how is EXT4 is not as stable or as well tested as most
people think not a cheap shot? By my first hand experience, that
claim is absurd. Add to that the first hand experience of roughly
two billion other people. Seems to be a bit self serving too, or
was that just an accident.

 You need to get out of the mindset that Ted and Dave are Enemies that you 
 need to overcome, they are
 friendly competitors, not Enemies.

You are wrong about Dave, These are not the words of any friend:

   I don't think I'm alone in my suspicion that there was something
   stinky about your numbers. -- Dave Chinner

Basically allegations of cheating. And wrong. Maybe Dave just
lives in his own dreamworld where everybody is out to get him, so
he has to attack people he views as competitors first.

Ted has more taste and his FUD attack was more artful, but it
still amounted to nothing more than piling on, He just picked up
Dave's straw man uncritically and proceeded to knock it down
some more. Nice way of distracting attention from the fact that
we actually did what we claimed, and instead of getting the
appropriate recognition for it, we were called cheaters. More or
less in so many words by Dave, and more subtly by Ted, but the
intent is clear and unmistakable. Apologies from both are still
in order, but it will be a rainy day in that hot place before we
ever see either of them do the right thing.

That said, Ted is no enemy, he is brilliant and usually conducts
himself admirably. Except sometimes. I wish I would say the same
about Dave, but what I see there is a guy who has invested his
entire identity in his XFS career and is insecure that something
might conspire against him to disrupt it. I mean, come on, if you
convince Redhat management to elevate your life's work to the
status of something that most of the paid for servers in the
world are going to run, do you continue attacking your peers or
do you chill a bit?

 They assume that you are working in good faith (but are
 inexperienced compared to them), and you need to assume that they are working 
 in good faith. If they
 ever do resort to underhanded means to sabotage you, Linus and the other 
 kernel developers will take
 action. But pointing out limits in your current implementation, problems in 
 your benchmarks based on
 how they are run, and concepts that are going to be difficult to merge is not 
 underhanded, it's
 exactly the type of assistance that you should be greatful for in friendly 
 competition.
 
 You were the one who started crowing about how badly XFS performed.

Not at all, somebody else posted the terrible XFS benchmark result,
then Dave put up a big smokescreen to try to deflect atention from
it. There is a term for that kind of logical fallacy:

   http://en.wikipedia.org/wiki/Proof_by_intimidation

Seems to have worked well on you. But after all those words, XFS
does not run any faster, and it clearly needs to.

 Dave gave a long and detailed explination about the reasons for the 
 differences, and showing
benchmarks on other hardware that
 showed that XFS works very well there. That's not an attack on EXT4 (or 
 Tux3), it's an explination.

Long, detailed, and bogus. Summary: oh, XFS doesn't work well on
that hardware? Get new hardware. Excuse me, but other filesystems
do work well on that hardware, the problem is not with the hardware.

 I have my own concerns about how things are going to work (I've voiced some 
 of them), but no, I
 haven't tried running Tux3 because you say it's not ready yet.

I did not say that. I said it is not ready for users. It is more
than ready for anybody who wants to develop it, or benchmark it,
or put test data on it, and has been for a long time. Except for
enospc, and that was apparently not an issue for Btrfs, was it.

 You know what to do about checking for faulty benchmarks.
 
 That requires that the code be readily available, which last I heard, Tux3 
 wasn't. Has this been fixed?

You heard wrong. The code is readily available and you can clone it
from here:

https://github.com/OGAWAHirofumi/linux-tux3.git

The hirofumi-user branch has the user tools including mkfs and basic
fsck, and the hirofumi branch is a 3.19 Linus

Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On 05/12/2015 02:30 PM, David Lang wrote:
 On Tue, 12 May 2015, Daniel Phillips wrote:
 Phoronix published a headline that identifies Dave Chinner as
 someone who takes shots at other projects. Seems pretty much on
 the money to me, and it ought to be obvious why he does it.
 
 Phoronix turns any correction or criticism into an attack.

Phoronix gets attacked in an unseemly way by a number of people
in the developer community who should behave better. You are
doing it yourself, seemingly oblivious to the valuable role that
the publication plays in our community. Google for filesystem
benchmarks. Where do you find them? Right. Not to mention the
Xorg coverage, community issues, etc etc. The last thing we
need is a monoculture in Linux news, and we are dangerously
close to that now.

So, how is EXT4 is not as stable or as well tested as most
people think not a cheap shot? By my first hand experience,
that claim is absurd. Add to that the first hand experience
of roughly two billion other people. Seems to be a bit self
serving too, or was that just an accident.

 You need to get out of the mindset that Ted and Dave are Enemies that you 
 need to overcome, they are
 friendly competitors, not Enemies.

You are wrong about Dave These are not the words of any friend:

   I don't think I'm alone in my suspicion that there was something
   stinky about your numbers. -- Dave Chinner

Basically allegations of cheating. And wrong. Maybe Dave just
lives in his own dreamworld where everybody is out to get him, so
he has to attack people he views as competitors first.

Ted has more taste and his FUD attack was more artful, but it
still amounted to nothing more than piling on, he just picked
Dave's straw man uncritically and proceeded to and knock it down
some more. Nice way of distracting attention from the fact that
we actually did what we claimed, and instead of getting the
appropriate recognition for it, we were called cheaters. More or
less in so many words, and more subtly by Ted, but the intent
is clear and unmistakable. Apologies from both are still in order,
but it

 They assume that you are working in good faith (but are
 inexperienced compared to them), and you need to assume that they are working 
 in good faith. If they
 ever do resort to underhanded means to sabotage you, Linus and the other 
 kernel developers will take
 action. But pointing out limits in your current implementation, problems in 
 your benchmarks based on
 how they are run, and concepts that are going to be difficult to merge is not 
 underhanded, it's
 exactly the type of assistance that you should be greatful for in friendly 
 competition.
 
 You were the one who started crowing about how badly XFS performed.

Not at all, somebody else posted the terrible XFS benchmark
result, then Dave put up a big smokescreen to try to deflect
atention from it. There is a term for that kind of logical
fallacy:

   http://en.wikipedia.org/wiki/Proof_by_intimidation

Seems to have worked well on you. But after all those words,
XFS does not run any faster, and it clearly needs to.

 Dave gave a long and detailed
 explination about the reasons for the differences, and showing benchmarks on 
 other hardware that
 showed that XFS works very well there. That's not an attack on EXT4 (or 
 Tux3), it's an explination.
 
 The real question is, has the Linux development process become
 so political and toxic that worthwhile projects fail to benefit
 from supposed grassroots community support. You are the poster
 child for that.

 The linux development process is making code available, responding to 
 concerns from the experts in
 the community, and letting the code talk for itself.

 Nice idea, but it isn't working. Did you let the code talk to you?
 Right, you let the code talk to Dave Chinner, then you listen to
 what Dave Chinner has to say about it. Any chance that there might
 be some creative licence acting somewhere in that chain?
 
 I have my own concerns about how things are going to work (I've voiced some 
 of them), but no, I
 haven't tried running Tux3 because you say it's not ready yet.
 
 There have been many people pushing code for inclusion that has not gotten 
 into the kernel, or has
 not been used by any distros after it's made it into the kernel, in spite 
 of benchmarks being posted
 that seem to show how wonderful the new code is. ReiserFS was one of the 
 first, and part of what
 tarnished it's reputation with many people was how much they were pushing 
 the benchmarks that were
 shown to be faulty (the one I remember most vividly was that the entire 
 benchmark completed in 30
 seconds, and they had the FS tuned to not start flushing data to disk for 
 30 seconds, so the entire
 'benchmark' ran out of ram without ever touching the disk)

 You know what to do about checking for faulty benchmarks.
 
 That requires that the code be readily available, which last I heard, Tux3 
 wasn't. Has this been fixed?
 
 So when Ted and Dave point out

Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On 05/12/2015 02:03 AM, Pavel Machek wrote:
 On Mon 2015-05-11 19:34:34, Daniel Phillips wrote:
 On 05/11/2015 04:17 PM, Theodore Ts'o wrote:
 and another way that people
 doing competitive benchmarking can screw up and produce misleading
 numbers.

 If you think we screwed up or produced misleading numbers, could you
 please be up front about it instead of making insinuations and
 continuing your tirade against benchmarking and those who do it.
 
 Are not you little harsh with Ted? He was polite.

Polite language does not include words like screw up and misleading
numbers, those are combative words intended to undermine and disparage.
It is not clear how repeating the same words can be construed as less
polite than the original utterance.

 The ram disk removes seek overhead and greatly reduces media transfer
 overhead. This does not change things much: it confirms that Tux3 is
 significantly faster than the others at synchronous loads. This is
 apparently true independently of media type, though to be sure SSD
 remains to be tested.

 The really interesting result is how much difference there is between
 filesystems, even on a ram disk. Is it just CPU or is it synchronization
 strategy and lock contention? Does our asynchronous front/back design
 actually help a lot, instead of being a disadvantage as you predicted?

 It is too bad that fs_mark caps number of tasks at 64, because I am
 sure that some embarrassing behavior would emerge at high task counts,
 as with my tests on spinning disk.
 
 I'd call system with 65 tasks doing heavy fsync load at the some time
 embarrassingly misconfigured :-). It is nice if your filesystem can
 stay fast in that case, but...

Well, Tux3 wins the fsync race now whether it is 1 task, 64 tasks or
10,000 tasks. At the high end, maybe it is just a curiosity, or maybe
it tells us something about how Tux3 is will scale on the big machines
that XFS currently lays claim to. And Java programmers are busy doing
all kinds of wild and crazy things with lots of tasks. Java almost
makes them do it. If they need their data durable then they can easily
create loads like my test case.

Suppose you have a web server meant to serve 10,000 transactions
simultaneously and it needs to survive crashes without losing client
state. How will you do it? You could install an expensive, finicky
database, or you could write some Java code that happens to work well
because Linux has a scheduler and a filesystem that can handle it.
Oh wait, we don't have the second one yet, but maybe we soon will.

I will not claim that stupidly fast and scalable fsync is the main
reason that somebody should want Tux3, however, the lack of a high
performance fsync was in fact used as a means of spreading FUD about
Tux3, so I had some fun going way beyond the call of duty to answer
that. By the way, I am still waiting for the original source of the
FUD to concede the point politely, but maybe he is waiting for the
code to land, which it still has not as of today, so I guess that is
fair. Note that it would have landed quite some time ago if Tux3 was
already merged.

Historical note: didn't Java motivate the O(1) scheduler?

Regarda,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-11 Thread Daniel Phillips
Hi David,

On 05/11/2015 05:12 PM, David Lang wrote:
> On Mon, 11 May 2015, Daniel Phillips wrote:
> 
>> On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>>>> It is a fact of life that when you change one aspect of an intimately 
>>>>> interconnected system,
>>>>> something else will change as well. You have naive/nonexistent free space 
>>>>> management now; when you
>>>>> design something workable there it is going to impact everything else 
>>>>> you've already done. It's an
>>>>> easy bet that the impact will be negative, the only question is to what 
>>>>> degree.
>>>>
>>>> You might lose that bet. For example, suppose we do strictly linear 
>>>> allocation
>>>> each delta, and just leave nice big gaps between the deltas for future
>>>> expansion. Clearly, we run at similar or identical speed to the current 
>>>> naive
>>>> strategy until we must start filling in the gaps, and at that point our 
>>>> layout
>>>> is not any worse than XFS, which started bad and stayed that way.
>>>
>>> Umm, are you sure. If "some areas of disk are faster than others" is
>>> still true on todays harddrives, the gaps will decrease the
>>> performance (as you'll "use up" the fast areas more quickly).
>>
>> That's why I hedged my claim with "similar or identical". The
>> difference in media speed seems to be a relatively small effect
>> compared to extra seeks. It seems that XFS puts big spaces between
>> new directories, and suffers a lot of extra seeks because of it.
>> I propose to batch new directories together initially, then change
>> the allocation goal to a new, relatively empty area if a big batch
>> of files lands on a directory in a crowded region. The "big" gaps
>> would be on the order of delta size, so not really very big.
> 
> This is an interesting idea, but what happens if the files don't arrive as a 
> big batch, but rather
> trickle in over time (think a logserver that if putting files into a bunch of 
> directories at a
> fairly modest rate per directory)

If files are trickling in then we can afford to spend a lot more time
finding nice places to tuck them in. Log server files are an especially
irksome problem for a redirect-on-write filesystem because the final
block tends to be rewritten many times and we must move it to a new
location each time, so every extent ends up as one block. Oh well. If
we just make sure to have some free space at the end of the file that
only that file can use (until everywhere else is full) then the long
term result will be slightly ravelled blocks that nonetheless tend to
be on the same track or flash block as their logically contiguous
neighbours. There will be just zero or one empty data blocks mixed
into the file tail as we commit the tail block over and over with the
same allocation goal. Sometimes there will be a block or two of
metadata as well, which will eventually bake themselves into the
middle of contiguous data and stop moving around.

Putting this together, we have:

  * At delta flush, break out all the log type files
  * Dedicate some block groups to append type files
  * Leave lots of space between files in those block groups
  * Peek at the last block of the file to set the allocation goal

Something like that. What we don't want is to throw those files into
the middle of a lot of rewrite-all files, messing up both kinds of file.
We don't care much about keeping these files near the parent directory
because one big seek per log file in a grep is acceptable, we just need
to avoid thousands of big seeks within the file, and not dribble single
blocks all over the disk.

It would also be nice to merge together extents somehow as the final
block is rewritten. One idea is to retain the final block dirty until
the next delta, and write it again into a contiguous position, so the
final block is always flushed twice. We already have the opportunistic
merge logic, but the redirty behavior and making sure it only happens
to log files would be a bit fiddly.

We will also play the incremental defragmentation card at some point,
but first we should try hard to control fragmentation in the first
place. Tux3 is well suited to online defragmentation because the delta
commit model makes it easy to move things around efficiently and safely,
but it does generate extra IO, so as a basic mechanism it is not ideal.
When we get to piling on features, that will be high on the list,
because it is relatively easy, and having that fallback gives a certain
sense of security.

> And when you then decide that you have to move the directory/file info, 
> doesn't that create a
> potentially large am

Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-11 Thread Daniel Phillips


On 05/11/2015 04:17 PM, Theodore Ts'o wrote:
> On Tue, May 12, 2015 at 12:12:23AM +0200, Pavel Machek wrote:
>> Umm, are you sure. If "some areas of disk are faster than others" is
>> still true on todays harddrives, the gaps will decrease the
>> performance (as you'll "use up" the fast areas more quickly).
> 
> It's still true.  The difference between O.D. and I.D. (outer diameter
> vs inner diameter) LBA's is typically a factor of 2.  This is why
> "short-stroking" works as a technique,

That is true, and the effect is not dominant compared to introducing
a lot of extra seeks.

> and another way that people
> doing competitive benchmarking can screw up and produce misleading
> numbers.

If you think we screwed up or produced misleading numbers, could you
please be up front about it instead of making insinuations and
continuing your tirade against benchmarking and those who do it.

> (If you use partitions instead of the whole disk, you have
> to use the same partition in order to make sure you aren't comparing
> apples with oranges.)

You can rest assured I did exactly that.

Somebody complained that things would look much different with seeks
factored out, so here are some new "competitive benchmarks" using
fs_mark on a ram disk:

   tasks11664
   
   ext4:   231  2154   5439
   btrfs:  152   962   2230
   xfs:268  2729   6466
   tux3:   315  5529  20301

(Files per second, more is better)

The shell commands are:

   fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s1048576 -w4096 -n1000 -t1
   fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s65536 -w4096 -n1000 -t16
   fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s4096 -w4096 -n1000 -t64

The ram disk removes seek overhead and greatly reduces media transfer
overhead. This does not change things much: it confirms that Tux3 is
significantly faster than the others at synchronous loads. This is
apparently true independently of media type, though to be sure SSD
remains to be tested.

The really interesting result is how much difference there is between
filesystems, even on a ram disk. Is it just CPU or is it synchronization
strategy and lock contention? Does our asynchronous front/back design
actually help a lot, instead of being a disadvantage as you predicted?

It is too bad that fs_mark caps number of tasks at 64, because I am
sure that some embarrassing behavior would emerge at high task counts,
as with my tests on spinning disk.

Anyway, everybody but you loves competitive benchmarks, that is why I
post them. They are not only useful for tracking down performance bugs,
but as you point out, they help us advertise the reasons why Tux3 is
interesting and ought to be merged.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-11 Thread Daniel Phillips
Hi Pavel,

On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>> It is a fact of life that when you change one aspect of an intimately 
>>> interconnected system,
>>> something else will change as well. You have naive/nonexistent free space 
>>> management now; when you
>>> design something workable there it is going to impact everything else 
>>> you've already done. It's an
>>> easy bet that the impact will be negative, the only question is to what 
>>> degree.
>>
>> You might lose that bet. For example, suppose we do strictly linear 
>> allocation
>> each delta, and just leave nice big gaps between the deltas for future
>> expansion. Clearly, we run at similar or identical speed to the current naive
>> strategy until we must start filling in the gaps, and at that point our 
>> layout
>> is not any worse than XFS, which started bad and stayed that way.
> 
> Umm, are you sure. If "some areas of disk are faster than others" is
> still true on todays harddrives, the gaps will decrease the
> performance (as you'll "use up" the fast areas more quickly).

That's why I hedged my claim with "similar or identical". The
difference in media speed seems to be a relatively small effect
compared to extra seeks. It seems that XFS puts big spaces between
new directories, and suffers a lot of extra seeks because of it.
I propose to batch new directories together initially, then change
the allocation goal to a new, relatively empty area if a big batch
of files lands on a directory in a crowded region. The "big" gaps
would be on the order of delta size, so not really very big.

Anyway, some people seem to have pounced on the words "naive" and
"linear allocation" and jumped to the conclusion that our whole
strategy is naive. Far from it. We don't just throw files randomly
at the disk. We sort and partition files and metadata, and we
carefully arrange the order of our allocation operations so that
linear allocation produces a nice layout for both read and write.

This turned out to be so much better than fiddling with the goal
of individual allocations that we concluded we would get best
results by sticking with linear allocation, but improve our sort
step. The new plan is to partition updates into batches according
to some affinity metrics, and set the linear allocation goal per
batch. So for example, big files and append-type files can get
special treatment in separate batches, while files that seem to
be related because of having the same directory parent and being
written in the same delta will continue to be streamed out using
"naive" linear allocation, which is not necessarily as naive as
one might think.

It will take time and a lot of performance testing to get this
right, but nobody should get the idea that it is any inherent
design limitation. The opposite is true: we have no restrictions
at all in media layout.

Compared to Ext4, we do need to address the issue that data moves
around when updated. This can cause rapid fragmentation. Btrfs has
shown issues with that for big, randomly updated files. We want to
fix it without falling back on update-in-place as Btrfs does.

Actually, Tux3 already has update-in-place, and unlike Btrfs, we
can switch to it for non-empty files. But we think that perfect data
isolation per delta is something worth fighting for, and we would
rather not force users to fiddle around with mode settings just to
make something work as well as it already does on Ext4. We will
tackle this issue by partitioning as above, and use a dedicated
allocation strategy for such files, which are easy to detect.

Metadata moving around per update does not seem to be a problem
because it is all single blocks that need very little slack space
to stay close to home.

> Anyway... you have brand new filesystem. Of course it should be
> faster/better/nicer than the existing filesystems. So don't be too
> harsh with XFS people.

They have done a lot of good work, but they still have a long way
to go. I don't see any shame in that.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-11 Thread Daniel Phillips
Hi Pavel,

On 05/11/2015 03:12 PM, Pavel Machek wrote:
 It is a fact of life that when you change one aspect of an intimately 
 interconnected system,
 something else will change as well. You have naive/nonexistent free space 
 management now; when you
 design something workable there it is going to impact everything else 
 you've already done. It's an
 easy bet that the impact will be negative, the only question is to what 
 degree.

 You might lose that bet. For example, suppose we do strictly linear 
 allocation
 each delta, and just leave nice big gaps between the deltas for future
 expansion. Clearly, we run at similar or identical speed to the current naive
 strategy until we must start filling in the gaps, and at that point our 
 layout
 is not any worse than XFS, which started bad and stayed that way.
 
 Umm, are you sure. If some areas of disk are faster than others is
 still true on todays harddrives, the gaps will decrease the
 performance (as you'll use up the fast areas more quickly).

That's why I hedged my claim with similar or identical. The
difference in media speed seems to be a relatively small effect
compared to extra seeks. It seems that XFS puts big spaces between
new directories, and suffers a lot of extra seeks because of it.
I propose to batch new directories together initially, then change
the allocation goal to a new, relatively empty area if a big batch
of files lands on a directory in a crowded region. The big gaps
would be on the order of delta size, so not really very big.

Anyway, some people seem to have pounced on the words naive and
linear allocation and jumped to the conclusion that our whole
strategy is naive. Far from it. We don't just throw files randomly
at the disk. We sort and partition files and metadata, and we
carefully arrange the order of our allocation operations so that
linear allocation produces a nice layout for both read and write.

This turned out to be so much better than fiddling with the goal
of individual allocations that we concluded we would get best
results by sticking with linear allocation, but improve our sort
step. The new plan is to partition updates into batches according
to some affinity metrics, and set the linear allocation goal per
batch. So for example, big files and append-type files can get
special treatment in separate batches, while files that seem to
be related because of having the same directory parent and being
written in the same delta will continue to be streamed out using
naive linear allocation, which is not necessarily as naive as
one might think.

It will take time and a lot of performance testing to get this
right, but nobody should get the idea that it is any inherent
design limitation. The opposite is true: we have no restrictions
at all in media layout.

Compared to Ext4, we do need to address the issue that data moves
around when updated. This can cause rapid fragmentation. Btrfs has
shown issues with that for big, randomly updated files. We want to
fix it without falling back on update-in-place as Btrfs does.

Actually, Tux3 already has update-in-place, and unlike Btrfs, we
can switch to it for non-empty files. But we think that perfect data
isolation per delta is something worth fighting for, and we would
rather not force users to fiddle around with mode settings just to
make something work as well as it already does on Ext4. We will
tackle this issue by partitioning as above, and use a dedicated
allocation strategy for such files, which are easy to detect.

Metadata moving around per update does not seem to be a problem
because it is all single blocks that need very little slack space
to stay close to home.

 Anyway... you have brand new filesystem. Of course it should be
 faster/better/nicer than the existing filesystems. So don't be too
 harsh with XFS people.

They have done a lot of good work, but they still have a long way
to go. I don't see any shame in that.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-11 Thread Daniel Phillips
Hi David,

On 05/11/2015 05:12 PM, David Lang wrote:
 On Mon, 11 May 2015, Daniel Phillips wrote:
 
 On 05/11/2015 03:12 PM, Pavel Machek wrote:
 It is a fact of life that when you change one aspect of an intimately 
 interconnected system,
 something else will change as well. You have naive/nonexistent free space 
 management now; when you
 design something workable there it is going to impact everything else 
 you've already done. It's an
 easy bet that the impact will be negative, the only question is to what 
 degree.

 You might lose that bet. For example, suppose we do strictly linear 
 allocation
 each delta, and just leave nice big gaps between the deltas for future
 expansion. Clearly, we run at similar or identical speed to the current 
 naive
 strategy until we must start filling in the gaps, and at that point our 
 layout
 is not any worse than XFS, which started bad and stayed that way.

 Umm, are you sure. If some areas of disk are faster than others is
 still true on todays harddrives, the gaps will decrease the
 performance (as you'll use up the fast areas more quickly).

 That's why I hedged my claim with similar or identical. The
 difference in media speed seems to be a relatively small effect
 compared to extra seeks. It seems that XFS puts big spaces between
 new directories, and suffers a lot of extra seeks because of it.
 I propose to batch new directories together initially, then change
 the allocation goal to a new, relatively empty area if a big batch
 of files lands on a directory in a crowded region. The big gaps
 would be on the order of delta size, so not really very big.
 
 This is an interesting idea, but what happens if the files don't arrive as a 
 big batch, but rather
 trickle in over time (think a logserver that if putting files into a bunch of 
 directories at a
 fairly modest rate per directory)

If files are trickling in then we can afford to spend a lot more time
finding nice places to tuck them in. Log server files are an especially
irksome problem for a redirect-on-write filesystem because the final
block tends to be rewritten many times and we must move it to a new
location each time, so every extent ends up as one block. Oh well. If
we just make sure to have some free space at the end of the file that
only that file can use (until everywhere else is full) then the long
term result will be slightly ravelled blocks that nonetheless tend to
be on the same track or flash block as their logically contiguous
neighbours. There will be just zero or one empty data blocks mixed
into the file tail as we commit the tail block over and over with the
same allocation goal. Sometimes there will be a block or two of
metadata as well, which will eventually bake themselves into the
middle of contiguous data and stop moving around.

Putting this together, we have:

  * At delta flush, break out all the log type files
  * Dedicate some block groups to append type files
  * Leave lots of space between files in those block groups
  * Peek at the last block of the file to set the allocation goal

Something like that. What we don't want is to throw those files into
the middle of a lot of rewrite-all files, messing up both kinds of file.
We don't care much about keeping these files near the parent directory
because one big seek per log file in a grep is acceptable, we just need
to avoid thousands of big seeks within the file, and not dribble single
blocks all over the disk.

It would also be nice to merge together extents somehow as the final
block is rewritten. One idea is to retain the final block dirty until
the next delta, and write it again into a contiguous position, so the
final block is always flushed twice. We already have the opportunistic
merge logic, but the redirty behavior and making sure it only happens
to log files would be a bit fiddly.

We will also play the incremental defragmentation card at some point,
but first we should try hard to control fragmentation in the first
place. Tux3 is well suited to online defragmentation because the delta
commit model makes it easy to move things around efficiently and safely,
but it does generate extra IO, so as a basic mechanism it is not ideal.
When we get to piling on features, that will be high on the list,
because it is relatively easy, and having that fallback gives a certain
sense of security.

 And when you then decide that you have to move the directory/file info, 
 doesn't that create a
 potentially large amount of unexpected IO that could end up interfering with 
 what the user is trying
 to do?

Right, we don't like that and don't plan to rely on it. What we hope
for is behavior that, when you slowly stir the pot, tends to improve the
layout just as often as it degrades it. It may indeed become harder to
find ideal places to put things as time goes by, but we also gain more
information to base decisions on.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message

Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-11 Thread Daniel Phillips


On 05/11/2015 04:17 PM, Theodore Ts'o wrote:
 On Tue, May 12, 2015 at 12:12:23AM +0200, Pavel Machek wrote:
 Umm, are you sure. If some areas of disk are faster than others is
 still true on todays harddrives, the gaps will decrease the
 performance (as you'll use up the fast areas more quickly).
 
 It's still true.  The difference between O.D. and I.D. (outer diameter
 vs inner diameter) LBA's is typically a factor of 2.  This is why
 short-stroking works as a technique,

That is true, and the effect is not dominant compared to introducing
a lot of extra seeks.

 and another way that people
 doing competitive benchmarking can screw up and produce misleading
 numbers.

If you think we screwed up or produced misleading numbers, could you
please be up front about it instead of making insinuations and
continuing your tirade against benchmarking and those who do it.

 (If you use partitions instead of the whole disk, you have
 to use the same partition in order to make sure you aren't comparing
 apples with oranges.)

You can rest assured I did exactly that.

Somebody complained that things would look much different with seeks
factored out, so here are some new competitive benchmarks using
fs_mark on a ram disk:

   tasks11664
   
   ext4:   231  2154   5439
   btrfs:  152   962   2230
   xfs:268  2729   6466
   tux3:   315  5529  20301

(Files per second, more is better)

The shell commands are:

   fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s1048576 -w4096 -n1000 -t1
   fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s65536 -w4096 -n1000 -t16
   fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s4096 -w4096 -n1000 -t64

The ram disk removes seek overhead and greatly reduces media transfer
overhead. This does not change things much: it confirms that Tux3 is
significantly faster than the others at synchronous loads. This is
apparently true independently of media type, though to be sure SSD
remains to be tested.

The really interesting result is how much difference there is between
filesystems, even on a ram disk. Is it just CPU or is it synchronization
strategy and lock contention? Does our asynchronous front/back design
actually help a lot, instead of being a disadvantage as you predicted?

It is too bad that fs_mark caps number of tasks at 64, because I am
sure that some embarrassing behavior would emerge at high task counts,
as with my tests on spinning disk.

Anyway, everybody but you loves competitive benchmarks, that is why I
post them. They are not only useful for tracking down performance bugs,
but as you point out, they help us advertise the reasons why Tux3 is
interesting and ought to be merged.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tux3 Report: How fast can we fsync?

2015-05-02 Thread Daniel Phillips

On Friday, May 1, 2015 6:07:48 PM PDT, David Lang wrote:

On Fri, 1 May 2015, Daniel Phillips wrote:

On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:


Well, yes - I never claimed XFS is a general purpose filesystem.  It
is a high performance filesystem. Is is also becoming more relevant
to general purpose systems as low cost storage gains capabilities
that used to be considered the domain of high performance storage...


OK. Well, Tux3 is general purpose and that means we care about single
spinning disk and small systems.


keep in mind that if you optimize only for the small systems 
you may not scale as well to the larger ones.


Tux3 is designed to scale, and it will when the time comes. I look 
forward to putting Shardmap through its billion file test in due course. 
However, right now it would be wise to stay focused on basic 
functionality suited to a workstation because volunteer devs tend to 
have those. After that, phones are a natural direction, where hard core 
ACID commit and really smooth file ops are particularly attractive.


per the ramdisk but, possibly not as relavent as you may think. 
This is why it's good to test on as many different systems as 
you can. As you run into different types of performance you can 
then pick ones to keep and test all the time.


I keep being surprised how well it works for things we never tested 
before.


Single spinning disk is interesting now, but will be less 
interesting later. multiple spinning disks in an array of some 
sort is going to remain very interesting for quite a while.


The way to do md well is to integrate it into the block layer like 
Freebsd does (GEOM) and expose a richer interface for the filesystem. 
That is how I think Tux3 should work with big iron raid. I hope to be

able to tackle that sometime before the stars start winking out.

now, some things take a lot more work to test than others. 
Getting time on a system with a high performance, high capacity 
RAID is hard, but getting hold of an SSD from Fry's is much 
easier. If it's a budget item, ping me directly and I can donate 
one for testing (the cost of a drive is within my unallocated 
budget and using that to improve Linux is worthwhile)


Thanks.

As I'm reading Dave's comments, he isn't attacking you the way 
you seem to think he is. He is pointing ot that there are 
problems with your data, but he's also taking a lot of time to 
explain what's happening (and yes, some of this is probably 
because your simple tests with XFS made it look so bad)


I hope the lightening up trend is a trend.

the other filesystems don't use naive algortihms, they use 
something more complex, and while your current numbers are 
interesting, they are only preliminary until you add something 
to handle fragmentation. That can cause very significant 
problems.


Fsync is pretty much agnostic to fragmentation, so those results are 
unlikely to change substantially even if we happen to do a lousy job on 
allocation policy, which I naturally consider unlikely. In fact, Tux3 
fsync is going to get faster over time for a couple of reasons: the 
minimum blocks per commit will be reduced, and we will get rid of most 
of the seeks to beginning of volume that we currently suffer per commit.


Remember how fabulous btrfs looked in the initial 
reports? and then corner cases were found that caused real 
problems and as the algorithms have been changed to prevent 
those corner cases from being so easy to hit, the common case 
has suffered somewhat. This isn't an attack on Tux2 or btrfs, 
it's just a reality of programming. If you are not accounting 
for all the corner cases, everything is easier, and faster.



Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
more substantial, so I can't compare my numbers directly to yours.


If you are doing tests with a 4G ramdisk on a machine with only 
4G of RAM, it seems like you end up testing a lot more than just 
the filesystem. Testing in such low memory situations can 
indentify significant issues, but it is questionable as a 'which 
filesystem is better' benchmark.


A 1.3 GB tmpfs, and sorry, it is 10 GB (the machine next to it is 4G). 
I am careful to ensure the test environment does not have spurious 
memory or cpu hogs. I will not claim that this is the most sterile test 
environment possible, but it is adequate for the task at hand. Nearly 
always, when I find big variations in the test numbers it turns out to 
be a quirk of one filesystem that is not exhibited by the others. 
Everything gets multiple runs and lands in a spreadsheet. Any fishy 
variance is investigated.


By the way, the low variance kings by far are Ext4 and Tux3, and of 
those two, guess which one is more consistent. XFS is usually steady, 
but can get "emotional" with lots of tasks, and Btrfs has regular wild 
mood swings whenever the stars change alignment. And while I'm making 
gross generalizations: XFS and Btrfs go OOM way before Ext4 and Tu

Re: Tux3 Report: How fast can we fsync?

2015-05-02 Thread Daniel Phillips

On Friday, May 1, 2015 6:07:48 PM PDT, David Lang wrote:

On Fri, 1 May 2015, Daniel Phillips wrote:

On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:


Well, yes - I never claimed XFS is a general purpose filesystem.  It
is a high performance filesystem. Is is also becoming more relevant
to general purpose systems as low cost storage gains capabilities
that used to be considered the domain of high performance storage...


OK. Well, Tux3 is general purpose and that means we care about single
spinning disk and small systems.


keep in mind that if you optimize only for the small systems 
you may not scale as well to the larger ones.


Tux3 is designed to scale, and it will when the time comes. I look 
forward to putting Shardmap through its billion file test in due course. 
However, right now it would be wise to stay focused on basic 
functionality suited to a workstation because volunteer devs tend to 
have those. After that, phones are a natural direction, where hard core 
ACID commit and really smooth file ops are particularly attractive.


per the ramdisk but, possibly not as relavent as you may think. 
This is why it's good to test on as many different systems as 
you can. As you run into different types of performance you can 
then pick ones to keep and test all the time.


I keep being surprised how well it works for things we never tested 
before.


Single spinning disk is interesting now, but will be less 
interesting later. multiple spinning disks in an array of some 
sort is going to remain very interesting for quite a while.


The way to do md well is to integrate it into the block layer like 
Freebsd does (GEOM) and expose a richer interface for the filesystem. 
That is how I think Tux3 should work with big iron raid. I hope to be

able to tackle that sometime before the stars start winking out.

now, some things take a lot more work to test than others. 
Getting time on a system with a high performance, high capacity 
RAID is hard, but getting hold of an SSD from Fry's is much 
easier. If it's a budget item, ping me directly and I can donate 
one for testing (the cost of a drive is within my unallocated 
budget and using that to improve Linux is worthwhile)


Thanks.

As I'm reading Dave's comments, he isn't attacking you the way 
you seem to think he is. He is pointing ot that there are 
problems with your data, but he's also taking a lot of time to 
explain what's happening (and yes, some of this is probably 
because your simple tests with XFS made it look so bad)


I hope the lightening up trend is a trend.

the other filesystems don't use naive algortihms, they use 
something more complex, and while your current numbers are 
interesting, they are only preliminary until you add something 
to handle fragmentation. That can cause very significant 
problems.


Fsync is pretty much agnostic to fragmentation, so those results are 
unlikely to change substantially even if we happen to do a lousy job on 
allocation policy, which I naturally consider unlikely. In fact, Tux3 
fsync is going to get faster over time for a couple of reasons: the 
minimum blocks per commit will be reduced, and we will get rid of most 
of the seeks to beginning of volume that we currently suffer per commit.


Remember how fabulous btrfs looked in the initial 
reports? and then corner cases were found that caused real 
problems and as the algorithms have been changed to prevent 
those corner cases from being so easy to hit, the common case 
has suffered somewhat. This isn't an attack on Tux2 or btrfs, 
it's just a reality of programming. If you are not accounting 
for all the corner cases, everything is easier, and faster.



Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
more substantial, so I can't compare my numbers directly to yours.


If you are doing tests with a 4G ramdisk on a machine with only 
4G of RAM, it seems like you end up testing a lot more than just 
the filesystem. Testing in such low memory situations can 
indentify significant issues, but it is questionable as a 'which 
filesystem is better' benchmark.


A 1.3 GB tmpfs, and sorry, it is 10 GB (the machine next to it is 4G). 
I am careful to ensure the test environment does not have spurious 
memory or cpu hogs. I will not claim that this is the most sterile test 
environment possible, but it is adequate for the task at hand. Nearly 
always, when I find big variations in the test numbers it turns out to 
be a quirk of one filesystem that is not exhibited by the others. 
Everything gets multiple runs and lands in a spreadsheet. Any fishy 
variance is investigated.


By the way, the low variance kings by far are Ext4 and Tux3, and of 
those two, guess which one is more consistent. XFS is usually steady, 
but can get emotional with lots of tasks, and Btrfs has regular wild 
mood swings whenever the stars change alignment. And while I'm making 
gross generalizations: XFS and Btrfs go OOM way before Ext4 and Tux3.


Just

Re: Tux3 Report: How fast can we fsync?

2015-05-01 Thread Daniel Phillips
On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>
> Well, yes - I never claimed XFS is a general purpose filesystem.  It
> is a high performance filesystem. Is is also becoming more relevant
> to general purpose systems as low cost storage gains capabilities
> that used to be considered the domain of high performance storage...

OK. Well, Tux3 is general purpose and that means we care about single
spinning disk and small systems.

>>> So, to demonstrate, I'll run the same tests but using a 256GB
>>> samsung 840 EVO SSD and show how much the picture changes.
>>
>> I will go you one better, I ran a series of fsync tests using
>> tmpfs, and I now have a very clear picture of how the picture
>> changes. The executive summary is: Tux3 is still way faster, and
>> still scales way better to large numbers of tasks. I have every
>> confidence that the same is true of SSD.
>
> /dev/ramX can't be compared to an SSD.  Yes, they both have low
> seek/IO latency but they have very different dispatch and IO
> concurrency models.  One is synchronous, the other is fully
> asynchronous.

I had ram available and no SSD handy to abuse. I was interested in
measuring the filesystem overhead with the device factored out. I
mounted loopback on a tmpfs file, which seems to be about the same as
/dev/ram, maybe slightly faster, but much easier to configure. I ran
some tests on a ramdisk just now and was mortified to find that I have
to reboot to empty the disk. It would take a compelling reason before
I do that again.

> This is an important distinction, as we'll see later on

I regard it as predictive of Tux3 performance on NVM.

> These trees:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git
> git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3-test.git
>
> have not been updated for 11 months. I thought tux3 had died long
> ago.
>
> You should keep them up to date, and send patches for xfstests to
> support tux3, and then you'll get a lot more people running,
> testing and breaking tux3

People are starting to show up to do testing now, pretty much the first
time, so we must do some housecleaning. It is gratifying that Tux3 never
broke for Mike, but of course it will assert just by running out of
space at the moment. As you rightly point out, that fix is urgent and is
my current project.

>> Running the same thing on tmpfs, Tux3 is significantly faster:
>>
>> Ext4:   1.40s
>> XFS:1.10s
>> Btrfs:  1.56s
>> Tux3:   1.07s
>
> 3% is not "signficantly faster". It's within run to run variation!

You are right, XFS and Tux3 are within experimental error for single
syncs on the ram disk, while Ext4 and Btrfs are way slower:

   Ext4:   1.59s
   XFS:1.11s
   Btrfs:  1.70s
   Tux3:   1.11s

A distinct performance gap appears between Tux3 and XFS as parallel
tasks increase.

>> You wish. In fact, Tux3 is a lot faster. ...
>
> Yes, it's easy to be fast when you have simple, naive algorithms and
> an empty filesystem.

No it isn't or the others would be fast too. In any case our algorithms
are far from naive, except for allocation. You can rest assured that
when allocation is brought up to a respectable standard in the fullness
of time, it will be competitive and will not harm our clean filesystem
performance at all.

There is no call for you to disparage our current achievements, which
are significant. I do not mind some healthy skepticism about the
allocation work, you know as well as anyone how hard it is. However your
denial of our current result is irritating and creates the impression
that you have an agenda. If you want to complain about something real,
complain that our current code drop is not done yet. I will humbly
apologize, and the same for enospc.

>> triple checked and reproducible:
>>
>>Tasks:   10  1001,00010,000
>>Ext4:   0.05 0.141.53 26.56
>>XFS:0.05 0.162.10 29.76
>>Btrfs:  0.08 0.373.18 34.54
>>Tux3:   0.02 0.050.18  2.16
>
> Yet I can't reproduce those XFS or ext4 numbers you are quoting
> there. eg. XFS on a 4GB ram disk:
>
> $ for i in 10 100 1000 1; do rm /mnt/test/foo* ; time
> ./test-fsync /mnt/test/foo 10 $i; done
>
> real0m0.030s
> user0m0.000s
> sys 0m0.014s
>
> real0m0.031s
> user0m0.008s
> sys 0m0.157s
>
> real0m0.305s
> user0m0.029s
> sys 0m1.555s
>
> real0m3.624s
> user0m0.219s
> sys 0m17.631s
> $
>
> That's roughly 10x faster than your numbers. Can you describe your
> test setup in detail? e.g.  post the full log from block device
> creation to benchmark completion so I can reproduce what you are
> doing exactly?

Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
more substantial, so I can't compare my numbers directly to yours.

Clearly the curve is the same: your numbers increase 10x going from 100
to 1,000 tasks and 12x going from 1,000 to 10,000. The Tux3 

Re: Tux3 Report: How fast can we fsync?

2015-05-01 Thread Daniel Phillips
On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:

 Well, yes - I never claimed XFS is a general purpose filesystem.  It
 is a high performance filesystem. Is is also becoming more relevant
 to general purpose systems as low cost storage gains capabilities
 that used to be considered the domain of high performance storage...

OK. Well, Tux3 is general purpose and that means we care about single
spinning disk and small systems.

 So, to demonstrate, I'll run the same tests but using a 256GB
 samsung 840 EVO SSD and show how much the picture changes.

 I will go you one better, I ran a series of fsync tests using
 tmpfs, and I now have a very clear picture of how the picture
 changes. The executive summary is: Tux3 is still way faster, and
 still scales way better to large numbers of tasks. I have every
 confidence that the same is true of SSD.

 /dev/ramX can't be compared to an SSD.  Yes, they both have low
 seek/IO latency but they have very different dispatch and IO
 concurrency models.  One is synchronous, the other is fully
 asynchronous.

I had ram available and no SSD handy to abuse. I was interested in
measuring the filesystem overhead with the device factored out. I
mounted loopback on a tmpfs file, which seems to be about the same as
/dev/ram, maybe slightly faster, but much easier to configure. I ran
some tests on a ramdisk just now and was mortified to find that I have
to reboot to empty the disk. It would take a compelling reason before
I do that again.

 This is an important distinction, as we'll see later on

I regard it as predictive of Tux3 performance on NVM.

 These trees:

 git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git
 git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3-test.git

 have not been updated for 11 months. I thought tux3 had died long
 ago.

 You should keep them up to date, and send patches for xfstests to
 support tux3, and then you'll get a lot more people running,
 testing and breaking tux3

People are starting to show up to do testing now, pretty much the first
time, so we must do some housecleaning. It is gratifying that Tux3 never
broke for Mike, but of course it will assert just by running out of
space at the moment. As you rightly point out, that fix is urgent and is
my current project.

 Running the same thing on tmpfs, Tux3 is significantly faster:

 Ext4:   1.40s
 XFS:1.10s
 Btrfs:  1.56s
 Tux3:   1.07s

 3% is not signficantly faster. It's within run to run variation!

You are right, XFS and Tux3 are within experimental error for single
syncs on the ram disk, while Ext4 and Btrfs are way slower:

   Ext4:   1.59s
   XFS:1.11s
   Btrfs:  1.70s
   Tux3:   1.11s

A distinct performance gap appears between Tux3 and XFS as parallel
tasks increase.

 You wish. In fact, Tux3 is a lot faster. ...

 Yes, it's easy to be fast when you have simple, naive algorithms and
 an empty filesystem.

No it isn't or the others would be fast too. In any case our algorithms
are far from naive, except for allocation. You can rest assured that
when allocation is brought up to a respectable standard in the fullness
of time, it will be competitive and will not harm our clean filesystem
performance at all.

There is no call for you to disparage our current achievements, which
are significant. I do not mind some healthy skepticism about the
allocation work, you know as well as anyone how hard it is. However your
denial of our current result is irritating and creates the impression
that you have an agenda. If you want to complain about something real,
complain that our current code drop is not done yet. I will humbly
apologize, and the same for enospc.

 triple checked and reproducible:

Tasks:   10  1001,00010,000
Ext4:   0.05 0.141.53 26.56
XFS:0.05 0.162.10 29.76
Btrfs:  0.08 0.373.18 34.54
Tux3:   0.02 0.050.18  2.16

 Yet I can't reproduce those XFS or ext4 numbers you are quoting
 there. eg. XFS on a 4GB ram disk:

 $ for i in 10 100 1000 1; do rm /mnt/test/foo* ; time
 ./test-fsync /mnt/test/foo 10 $i; done

 real0m0.030s
 user0m0.000s
 sys 0m0.014s

 real0m0.031s
 user0m0.008s
 sys 0m0.157s

 real0m0.305s
 user0m0.029s
 sys 0m1.555s

 real0m3.624s
 user0m0.219s
 sys 0m17.631s
 $

 That's roughly 10x faster than your numbers. Can you describe your
 test setup in detail? e.g.  post the full log from block device
 creation to benchmark completion so I can reproduce what you are
 doing exactly?

Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
more substantial, so I can't compare my numbers directly to yours.

Clearly the curve is the same: your numbers increase 10x going from 100
to 1,000 tasks and 12x going from 1,000 to 10,000. The Tux3 curve is
significantly flatter and starts from a lower base, so it ends with a
really wide gap. You will 

Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-04-30 Thread Daniel Phillips
Hi Ted,

On 04/30/2015 07:57 AM, Theodore Ts'o wrote:
> This is one of the reasons why I find head-to-head "competitions"
> between file systems to be not very helpful for anything other than
> benchmarketing.  It's almost certain that the benchmark won't be
> "fair" in some way, and it doesn't really matter whether the person
> doing the benchmark was doing it with malice aforethought, or was just
> incompetent and didn't understand the issues --- or did understand the
> issues and didn't really care, because what they _really_ wanted to do
> was to market their file system.

Your proposition, as I understand it, is that nobody should ever do
benchmarks because any benchmark must be one of: 1) malicious; 2)
incompetent; or 3) careless. When in fact, a benchmark may be perfectly
honest, competently done, and informative.

> And even if the benchmark is fair, it might not match up with the end
> user's hardware, or their use case.  There will always be some use
> case where file system A is better than file system B, for pretty much
> any file system.  Don't get me wrong --- I will do comparisons between
> file systems, but only so I can figure out ways of making _my_ file
> system better.  And more often than not, it's comparisons of the same
> file system before and after adding some new feature which is the most
> interesting.

I cordially invite you to replicate our fsync benchmarks, or invent
your own. I am confident that you will find that the numbers are
accurate, that the test cases were well chosen, that the results are
informative, and that there is no sleight of hand.

As for whether or not people should "market" their filesystems as you
put it, that is easy for you to disparage when you are the incumbant.
If we don't tell people what is great about Tux3 then how will they
ever find out? Sure, it might be "advertising", but the important
question is, is it _truthful_ advertising? Surely you remember how
Linus got started... that was really blatant, and I am glad he did it.

>> That are the allocation groups. I always wondered how it can be beneficial 
>> to spread the allocations onto 4 areas of one partition on expensive seek 
>> media. Now that makes better sense for me. I always had the gut impression 
>> that XFS may not be the fastest in all cases, but it is one of the 
>> filesystem with the most consistent performance over time, but never was 
>> able to fully explain why that is.
> 
> Yep, pretty much all of the traditional update-in-place file systems
> since the BSD FFS have done this, and for the same reason.  For COW
> file systems which are are constantly moving data and metadata blocks
> around, they will need different strategies for trying to avoid the
> free space fragmentation problem as the file system ages.

Right, different problems, but I have a pretty good idea how to go
about it now. I made a failed attempt a while back and learned a lot,
my mistake was to try to give every object a fixed home position based
on where it was first written and the result was worse for both read
and write. Now the interesting thing is, naive linear allocation is
great for both read and read, so my effort now is directed towards
ways of doing naive linear allocation but choosing carefully which
order we do the allocation in. I will keep you posted on how that
progresses of course.

Anyway, how did we get onto allocation? I thought my post was about
fsync, and after all, you are the guest of honor.

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-04-30 Thread Daniel Phillips
On 04/30/2015 07:33 AM, Mike Galbraith wrote:
> Well ok, let's forget bad blood, straw men... and answering my question
> too I suppose.  Not having any sexy  IO gizmos in my little desktop box,
> I don't care deeply which stomps the other flat on beastly boxen.

I'm with you, especially the forget bad blood part. I did my time in
big storage and I will no doubt do it again, but right now, what I care
about is bringing truth and beauty to small storage, which includes
that spinning rust of yours and also the cheap SSD you are about to
run out and buy.

I hope you caught the bit about how Tux3 is doing really well running
in tmpfs? According to my calculations, that means good things for SSD
performance.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-04-30 Thread Daniel Phillips
On 04/30/2015 07:28 AM, Howard Chu wrote:
> Daniel Phillips wrote:
>>
>>
>> On 04/30/2015 06:48 AM, Mike Galbraith wrote:
>>> On Thu, 2015-04-30 at 05:58 -0700, Daniel Phillips wrote:
>>>> On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
>>>>> On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
>>>>>
>>>>>> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
>>>>>> even with seek time factored out of the equation.
>>>>>
>>>>> Hm.  Do you have big-storage comparison numbers to back that?  I'm no
>>>>> storage guy (waiting for holographic crystal arrays to obsolete all this
>>>>> crap;), but Dave's big-storage guy words made sense to me.
>>>>
>>>> This has nothing to do with big storage. The proposition was that seek
>>>> time is the reason for Tux3's fsync performance. That claim was easily
>>>> falsified by removing the seek time.
>>>>
>>>> Dave's big storage words are there to draw attention away from the fact
>>>> that XFS ran the Git tests four times slower than Tux3 and three times
>>>> slower than Ext4. Whatever the big storage excuse is for that, the fact
>>>> is, XFS obviously sucks at little storage.
>>>
>>> If you allocate spanning the disk from start of life, you're going to
>>> eat seeks that others don't until later.  That seemed rather obvious and
>>> straight forward.
>>
>> It is a logical falacy. It mixes a grain of truth (spreading all over the
>> disk causes extra seeks) with an obvious falsehood (it is not necessarily
>> the only possible way to avoid long term fragmentation).
> 
> You're reading into it what isn't there. Spreading over the disk isn't (just) 
> about avoiding
> fragmentation - it's about delivering consistent and predictable latency. It 
> is undeniable that if
> you start by only allocating from the fastest portion of the platter, you are 
> going to see
> performance slow down over time. If you start by spreading allocations across 
> the entire platter,
> you make the worst-case and average-case latency equal, which is exactly what 
> a lot of folks are
> looking for.

Another fallacy: intentionally running slower than necessary is not necessarily
the only way to deliver consistent and predictable latency. Not only that, but
intentionally running slower than necessary does not necessarily guarantee
performing better than some alternate strategy later.

Anyway, let's not be silly. Everybody in the room who wants Git to run 4 times
slower with no guarantee of any benefit in the future, please raise your hand.

>>> He flat stated that xfs has passable performance on
>>> single bit of rust, and openly explained why.  I see no misdirection,
>>> only some evidence of bad blood between you two.
>>
>> Raising the spectre of theoretical fragmentation issues when we have not
>> even begun that work is a straw man and intellectually dishonest. You have
>> to wonder why he does it. It is destructive to our community image and
>> harmful to progress.
> 
> It is a fact of life that when you change one aspect of an intimately 
> interconnected system,
> something else will change as well. You have naive/nonexistent free space 
> management now; when you
> design something workable there it is going to impact everything else you've 
> already done. It's an
> easy bet that the impact will be negative, the only question is to what 
> degree.

You might lose that bet. For example, suppose we do strictly linear allocation
each delta, and just leave nice big gaps between the deltas for future
expansion. Clearly, we run at similar or identical speed to the current naive
strategy until we must start filling in the gaps, and at that point our layout
is not any worse than XFS, which started bad and stayed that way.

Now here is where you lose the bet: we already know that linear allocation
with wrap ends horribly right? However, as above, we start linear, without
compromise, but because of the gaps we leave, we are able to switch to a
slower strategy, but not nearly as slow as the ugly tangle we get with
simple wrap. So impact over the lifetime of the filesystem is positive, not
negative, and what seemed to be self evident to you turns out to be wrong.

In short, we would rather deliver as much performance as possible, all the
time. I really don't need to think about it very hard to know that is what I
want, and what most users want.

I will make you a bet in return: when we get to doing that part properly, the
quality of the work will be just as high as everything else we have completed
so far. Why would we suddenly get lazy?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >