On Tue, 20 Oct 2015, Z Zhang wrote:
> Thanks, Sage, for pointing out the PR and ceph branch. I will take a 
> closer look.
> 
> Yes, I am trying KVStore backend. The reason we are trying it is that 
> few user doesn't have such high requirement on data loss occasionally. 
> It seems KVStore backend without synchronized WAL could achieve better 
> performance than filestore. And only data still in page cache would get 
> lost on machine crashing, not process crashing, if we use WAL but no 
> synchronization. What do you think?

That sounds dangerous.  The OSDs are recording internal metadata about the 
cluster (peering, replication, etc.)... even if you don't care so much 
about recent user data writes you probably don't want to risk breaking 
RADOS itself.  If the kv backend is giving you a stale point-in-time 
consistent copy it's not so bad, but in a power-loss event it could give 
you problems...

sage

> 
>     Thanks. Zhi Zhang (David)
> 
> Date: Tue, 20 Oct 2015 05:47:44 -0700
> From: s...@newdream.net
> To: zhangz.da...@outlook.com
> CC: ceph-us...@lists.ceph.com; ceph-devel@vger.kernel.org
> Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore
> 
> On Tue, 20 Oct 2015, Z Zhang wrote:
> > Hi Guys,
> > 
> > I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with 
> > rocksdb 3.11 as OSD backend. I use rbd to test performance and following 
> > is my cluster info.
> > 
> > [ceph@xxx ~]$ ceph -s
> >     cluster b74f3944-d77f-4401-a531-fa5282995808
> >      health HEALTH_OK
> >      monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0}
> >             election epoch 1, quorum 0 xxx
> >      osdmap e338: 44 osds: 44 up, 44 in
> >             flags sortbitwise
> >       pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
> >             1940 MB used, 81930 GB / 81932 GB avail
> >                 2048 active+clean
> > 
> > All the disks are spinning ones with write cache turning on. Rocksdb's 
> > WAL and sst files are on the same disk as every OSD.
>  
> Are you using the KeyValueStore backend?
>  
> > Using fio to generate following write load: 
> > fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K 
> > -group_reporting -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1  
> > 
> > Test result:
> > WAL enabled + sync: false + disk write cache: on  will get ~700 IOPS.
> > WAL enabled + sync: true (default) + disk write cache: on|off  will get 
> > only ~25 IOPS.
> > 
> > I tuned some other rocksdb options, but with no lock.
>  
> The wip-newstore-frags branch sets some defaults for rocksdb that I think 
> look pretty reasonable (at least given how newstore is using rocksdb).
>  
> > I tracked down the rocksdb code and found each writer's Sync operation 
> > would take ~30ms to finish. And as shown above, it is strange that 
> > performance has no much difference no matters disk write cache is on or 
> > off.
> > 
> > Do your guys encounter the similar issue? Or do I miss something to 
> > cause rocksdb's poor write performance?
>  
> Yes, I saw the same thing.  This PR addresses the problem and is nearing 
> merge upstream:
>  
>       https://github.com/facebook/rocksdb/pull/746
>  
> There is also an XFS performance bug that is contributing to the problem, 
> but it looks like Dave Chinner just put together a fix for that.
>  
> But... we likely won't be using KeyValueStore in its current form over 
> rocksdb (or any other kv backend).  It stripes object data over key/value 
> pairs, which IMO is not the best approach.
>  
> sage
> 
> _______________________________________________
> ceph-users mailing list
> ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com                        
>                   
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to