I think this could be part of what I am seeing. I found this post from back in 
2003

http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083

Which seems to describe a work around for the behaviour to what I am seeing. 
The constant small block IO I was seeing looks like it was either the pg log 
and info updates or FS metatdata. I have been going through the blktraces I did 
today and 90% of the time I am just seeing 8kb writes and journal writes. 

I think the journal and filestore settings I have been adjusting, have just 
been moving the "data" sync around the benchmark timeline and altering when the 
journal starts throttling. It seems that with small IO's the metadata overhead 
takes several times longer than the actual data writing. This probably also 
explains why a full SSD OSD is faster than a HDD+SSD even for brief bursts of 
IO.

In the thread I posted above, it seems that adding something like flashcache 
can massively help overcome this problem, so this is something I might look 
into. It’s a shame I didn't get BBWC with my OSD nodes as this would have also 
likely alleviated this problem with a lot less hassle.


> Ah, no, you're right. With the bench command it all goes in to one object, 
> it's
> just a separate transaction for each 64k write. But again depending on flusher
> and throttler settings in the OSD, and the backing FS' configuration, it can 
> be
> a lot of individual updates — in particular, every time there's a sync it has 
> to
> update the inode.
> Certainly that'll be the case in the described configuration, with relatively 
> low
> writeahead limits on the journal but high sync intervals — once you hit the
> limits, every write will get an immediate flush request.
> 
> But none of that should have much impact on your write amplification tests
> unless you're actually using "osd bench" to test it. You're more likely to be
> seeing the overhead of the pg log entry, pg info change, etc that's associated
> with each write.
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to