RE: Object Write Latency

Sage Weil Mon, 23 Sep 2013 09:20:19 -0700

On Mon, 23 Sep 2013, Andreas Joachim Peters wrote:
> We deployed 3 OSDs with an EXT4 using RapidDisk in-memory.
> 
> The FS does 140k/s append+sync and the latency is now:
> 
> ~1 ms for few byte objects with single replica
> ~2 ms for few byte objects three replica  (instead of 65-80ms)
> 
> This gives probably the base-line of the best you can do with the 
> current implementation.
> 
> ==> the 80ms are probably just a 'feature' of the hardware (JBOD 
> disks/controller) and we might try to find some tuning parameters to 
> improve the latency slightly.
> 
> Could you just explain how the async api functions (is_complete, 
> is_safe) map to the three states
> 
> 1) object is transferred from client to all OSDs and is present in memory 
> there


Nothing happens yet..

> 2) object is written to the OSD journal

Client gets a COMMIT, which implies ACK

> 3) object is committed from OSD journal to the OSD filesystem

OSD now allows subsequent reads, or read/modify/write operations.

> Is it correct that the object is visible by clients only when 3) has 
> happened?

Yeah.

The ACK (operation is serialized and visible) vs COMMIT (operation is now 
durable) was conceived under the assumption that the serialized+visible 
step would be cheaper than making it durable.  This is the case for btrfs.  
Because of this, the COMMIT message implies ACK, so the client will see 
either ACK + COMMIT or COMMIT, but never COMMIT + ACK.

For ext4 and xfs, we need to do write-ahead journaling just for 
consistency, so the commit happens first.

Hope that helps!

I still think you should look at the logs for the JBOD hardware to see 
where the time is spent; it sounds like there is room for improvement.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Object Write Latency

Reply via email to