On 7/22/13 4:52 AM, KONDO Mitsumasa wrote:
The writeback source code which I indicated part of writeback is almost
same as community kernel ( I also read linux kernel 3.9.7,
but it is almost same this part.

The main source code difference comes from going back to the RedHat 5 kernel, which means 2.6.18. For many of these versions, you are right that it is only the tuning parameters that were changed in newer versions.

Optimizing performance for the old RHEL5 kernel isn't the most important thing, but it's helpful to know the things it does very badly.

My fsync patch is only sleep returned succece of fsync and maximum sleep
time is set to 10 seconds. It does not cause bad for this problem.

It's easy to have hundreds of relations that are getting fsync calls during a checkpoint. If you have 100 relations getting a 10 second sleep each, you could potentially delay checkpoints by 17 minutes this way. I regularly see systems where shared_buffers=8GB and there are 200 to 400 relation segments that need a sync during a checkpoint.

This is the biggest problem with your submission. Once you give up following the checkpoint schedule carefully, it is very easy to end up with large checkpoint deadline misses on production servers. If someone thinks they are doing a checkpoint every 5 minutes, but your patch makes them take 20 minutes instead, that is bad. They will not expect that a crash might have to replay that much activity before the server is useful again.

You also don't seem afraid of how exceeding the
checkpoint timeout is a very bad thing yet.
I think it is important that why this problem was caused. We should try
to find the cause of which program has bug or problem.

The checkpointer process is the problem. There's no filesystem bug or complicated issues involved in many of the bad cases. Here is a simple example that shows how the toughest problem cases happen:

-64GB of RAM
-10% dirty_background_ratio = 6GB of dirty writes = 6144MB
-2MB/s random I/O when concurrent reads are heavy
-3027 seconds to clear the cache = 51 minutes

That's how you get to an example like the one in my slides:

LOG: checkpoint complete: wrote 33282 bu ers (3.2%); 0 transaction log file(s) added, 60 removed, 129 recycled; write=228.848 s, sync=4628.879 s, total=4858.859 s

It's very hard to do better on these, and I don't expect any change to help this a lot. But I don't want to see a change committed that makes this sort of checkpoint 17 minutes longer if there's 100 relations involved either.

My patch not only improvement of throughput but also
realize stable response time at fsync phase in checkpoint.

The main reason your patch improves latency and throughput is that it makes checkpoints farther apart. That's why I drew you a graph showing how the time between checkpoints lined up perfectly with TPS. If it was only a small problem it would be worth considering, but I think it's likely to end up with these >15 minute I've outlined here instead.

And I servey about ext3 file system.

I wouldn't worry too much about the problems ext3 has. Like the old RHEL5 kernel I was commenting about above, there are a lot of ext3 systems out there. But we can't do a lot about getting good performance from them. It's only important to test that you're not making them a lot worse with a change.

My system block size is 4096, but
8192 or more seems to better. It will decrease number of inode and get
more large sequential disk fields.

I normally increase read-ahead on Linux systems to get faster speed on sequential disk throughput. Changing the block size might work better in some cases, but not many people are willing to do that. Read-ahead is very easy to change at any time.

