Agree

On Thu, Aug 6, 2015 at 5:38 AM, Somnath Roy <somnath....@sandisk.com> wrote:
> Thanks Sage for digging down..I was suspecting something similar.. As I 
> mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 
> 64 GB of RAM in the system.
> The workaround I was talking about today  is working pretty good so far. In 
> this implementation, I am not giving much work to syncfs as each worker 
> thread is writing with o_dsync mode. I am issuing syncfs before trimming the 
> journal and most of the time I saw it is taking < 100 ms.

Actually I prefer we don't use syncfs anymore. I more like to use
"aio+dio+Filestore custom cache" to deal with all "syncfs+pagecache"
things. So we even can make cache more smart to aware of upper levels
instead of fadvise* calls. Second we can use "checkpoint" method like
mysql innodb, we can know the bw of frontend(filejournal) and decide
how much and how often we want to flush(using aio+dio).

Anyway, because it's a big project, we may prefer to work at newstore
instead of filestore.

> I have to wake up the sync_thread now after each worker thread finished 
> writing. I will benchmark both the approaches. As we discussed earlier, in 
> case of only fsync approach, we still need to do a db sync to make sure the 
> leveldb stuff persisted, right ?
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Wednesday, August 05, 2015 2:27 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org; sj...@redhat.com
> Subject: FileStore should not use syncfs(2)
>
> Today I learned that syncfs(2) does an O(n) search of the superblock's inode 
> list searching for dirty items.  I've always assumed that it was only 
> traversing dirty inodes (e.g., a list of dirty inodes), but that appears not 
> to be the case, even on the latest kernels.
>
> That means that the more RAM in the box, the larger (generally) the inode 
> cache, the longer syncfs(2) will take, and the more CPU you'll waste doing 
> it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 
> servicing a very light workload, and each syncfs(2) call was taking ~7 
> seconds (usually to write out a single inode).
>
> A possible workaround for such boxes is to turn 
> /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching 
> pages instead of inodes/dentries)...
>
> I think the take-away though is that we do need to bite the bullet and make 
> FileStore f[data]sync all the right things so that the syncfs call can be 
> avoided.  This is the path you were originally headed down, Somnath, and I 
> think it's the right one.
>
> The main thing to watch out for is that according to POSIX you really need to 
> fsync directories.  With XFS that isn't the case since all metadata 
> operations are going into the journal and that's fully ordered, but we don't 
> want to allow data loss on e.g. ext4 (we need to check what the metadata 
> ordering behavior is there) or other file systems.

I guess there only a little directory modify operations, is it true?
Maybe we only need to do syncfs when modifying directories?

>
> :(
>
> sage
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to