Re: Blueprint: Add LevelDB support to ceph cluster backend store

2013-07-30 Thread Haomai Wang
2013-7-31, 2:01, Sage Weil wrote: > Hi Haomai, > > On Wed, 31 Jul 2013, Haomai Wang wrote: >> Every node of ceph cluster has a backend filesystem such as btrfs, >> xfs and ext4 that provides storage for data objects, whose location >> are determined by CRUSH algorithm. There should exists an ab

Re: Blueprint: Add LevelDB support to ceph cluster backend store

2013-07-30 Thread 袁冬
A better format result: 1KB Block LevelDB with Compress: 1.77MB/s LevelDB without Compress: 1.12MB/s Btrfs: 13.84MB/s 4KB Block LevelDB with Compress: 5.15MB/s LevelDB without Compress: 3.21MB/s Btrfs: 12.96MB/s 8KB Block LevelDB with Compress: 6.44MB/s LevelDB without Compress: 4.57MB/s Btrfs:

Re: Blueprint: Add LevelDB support to ceph cluster backend store

2013-07-30 Thread 袁冬
We have the same idea and already tested the LevelDB Performance VS Btrfs. The result is negative, especially for big block IO. 1KB Block 4KB Block8KB Block 128KB Block1MB Block LevelDB with Compress:1.77MB/s 5.15MB/s

Re: Blueprint: Add LevelDB support to ceph cluster backend store

2013-07-30 Thread Sage Weil
Hi Haomai, On Wed, 31 Jul 2013, Haomai Wang wrote: > Every node of ceph cluster has a backend filesystem such as btrfs, > xfs and ext4 that provides storage for data objects, whose location > are determined by CRUSH algorithm. There should exists an abstract > interface sitting between osd and bac

Re: Blueprint: Add LevelDB support to ceph cluster backend store

2013-07-30 Thread Gregory Farnum
On Tue, Jul 30, 2013 at 3:54 PM, Alex Elsayed wrote: > I posted this as a comment on the blueprint, but I figured I'd say it here: > > The thing I'd worry about here is that LevelDB's performance (along with > that of various other K/V stores) falls off a cliff for large values. > > Symas (who mak

Re: Blueprint: Add LevelDB support to ceph cluster backend store

2013-07-30 Thread Alex Elsayed
I posted this as a comment on the blueprint, but I figured I'd say it here: The thing I'd worry about here is that LevelDB's performance (along with that of various other K/V stores) falls off a cliff for large values. Symas (who make LMDB, used by OpenLDAP) did some benchmarking that shows dra

Re: Re: question about striped_read

2013-07-30 Thread majianpeng
>On Wed, Jul 31, 2013 at 9:36 AM, majianpeng wrote: >> [snip] >> I think this patch can do work: >> Those case which i tested >> A: filesize=0, buffer=1M >> B: data[2M] | hole| data[2M], bs= 6M/7M > >I don't think your zero buffer change is correct for this test case. > dd if=/dev/urandom of=fil

RE: Read ahead affect Ceph read performance much

2013-07-30 Thread Chen, Xiaoxi
My 0.02, we have done some readahead test tuning on server(ceph osd) side, the result showing that when readahead = 0.5 * object_size(4M in default), we can get max read throughput. Readahead value larger than this generally will not help, but also not harm the performance. For your case, seems

Blueprint: Add LevelDB support to ceph cluster backend store

2013-07-30 Thread Haomai Wang
Every node of ceph cluster has a backend filesystem such as btrfs, xfs and ext4 that provides storage for data objects, whose location are determined by CRUSH algorithm. There should exists an abstract interface sitting between osd and backend store, allowing different backend store implementation.

Re: Re: question about striped_read

2013-07-30 Thread majianpeng
[snip] I think this patch can do work: Those case which i tested A: filesize=0, buffer=1M B: data[2M] | hole| data[2M], bs= 6M/7M C: data[4m] | hole | hole |data[2M] bs=16M/18M Are there some case ignore? Thanks! Jianpeng Ma diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 2ddf061..96ce893

Re: Re: question about striped_read

2013-07-30 Thread Sage Weil
On Wed, 31 Jul 2013, majianpeng wrote: > >On Wed, 31 Jul 2013, majianpeng wrote: > >> >On Tue, Jul 30, 2013 at 7:41 PM, majianpeng wrote: > [snip] > > > >For ceph_osdc_readpages(), > > > >> A: ret = ENOENT > > > From the original code, for this case we should zero the area. > Why? If an object is

Re: Re: question about striped_read

2013-07-30 Thread majianpeng
>On Wed, 31 Jul 2013, majianpeng wrote: >> >On Tue, Jul 30, 2013 at 7:41 PM, majianpeng wrote: [snip] > >For ceph_osdc_readpages(), > >> A: ret = ENOENT > From the original code, for this case we should zero the area. Why? Thanks! Jianpeng Ma >The object does not exist. > >> B: ret = 0 > >The obj

Re: Re: question about striped_read

2013-07-30 Thread Sage Weil
On Wed, 31 Jul 2013, majianpeng wrote: > >On Tue, Jul 30, 2013 at 7:41 PM, majianpeng wrote: > > >dd if=/dev/urandom bs=1M count=2 of=file_with_holes > >dd if=/dev/urandom bs=1M count=2 seek=4 of=file_with_holes conv=notrunc > >dd if=file_with_holes bs=8M >/dev/null > > >

Re: Re: question about striped_read

2013-07-30 Thread majianpeng
>On Tue, Jul 30, 2013 at 7:41 PM, majianpeng wrote: >dd if=/dev/urandom bs=1M count=2 of=file_with_holes >dd if=/dev/urandom bs=1M count=2 seek=4 of=file_with_holes conv=notrunc >dd if=file_with_holes bs=8M >/dev/null > diff --git a/fs/ceph/file.c b/fs/ceph/file.c in

Re: Blueprint: inline data support (step 2)

2013-07-30 Thread Loic Dachary
Hi, It would be nice to have URLs to the current implementation and the benchmark results you got in the blueprint. http://wiki.ceph.com/01Planning/02Blueprints/Emperor/Inline_data_support_%28Step_2%29 Cheers On 31/07/2013 02:10, Li Wang wrote: > We have worked out a preliminary implementation

Blueprint: inline data support (step 2)

2013-07-30 Thread Li Wang
We have worked out a preliminary implementation for inline data support, and observed obvious speed up for small file access. The step 2 will focus on (1) Try to make things simpler to eliminate the state of a file half-inlined; (2) To efficiently deal with the share write or read/write; (3) P

blueprint : erasure code

2013-07-30 Thread Loic Dachary
Hi, I submitted a blueprint about the current status of the erasure coded pool implementation. As of now there still is a lot of work to be done to refactor PG and ReplicatedPG but Samuel Just found the right way to do it, not too long ago. In a nutshell a PGBackend base class from which Replic

Re: [PATCH] libceph: fix deadlock in ceph_build_auth()

2013-07-30 Thread David Miller
From: Alexey Khoroshilov Date: Mon, 29 Jul 2013 06:58:08 +0400 > ceph_build_auth() locks ac->mutex and then calls ceph_auth_build_hello() > that locks the same mutex, i.e. bring itself to deadlock. > > The patch moves actual code from ceph_auth_build_hello() to > ceph_build_hello_auth_request()

Re: Fwd: [ceph-users] Small fix for ceph.spec

2013-07-30 Thread Erik Logtenberg
Hi, I will report the issue there as well. Please note that Ceph seems to support Fedora 17, even though that release is considered end-of-life by Fedora. This issue with the leveldb package cannot be fixed for Fedora 17, only for 18 and 19. So if Ceph wants to continue supporting Fedora 17, addin

blueprint: cache pool overlay

2013-07-30 Thread Sage Weil
I posted a blueprint with an alternative approach to tiered storage than the redirects I mentioned yesterday. Instead of demoting data out of an existing pool to a colder pool, we could put a faster pool logically in front of an existing pool as a cache. Think SSD or fusionio or similar. I t

[RFC] Factors affect CephFS read performance

2013-07-30 Thread Li Wang
We measured Cephfs read performance by using iozone on a 32-node HPC cluster, the Ceph cluster configuration: 24 OSDs (one per node), 1 MDS, 1 -4 Clients (one thread per client per per node). The hardware of a node: CPU and network are both very powerful to not be bottleneck during the test, me

Re: krbd & live resize

2013-07-30 Thread Loic Dachary
Hi Laurent, Your patch can be applied to 3.8 as described here: http://dachary.org/?p=2179 Thanks again On 30/07/2013 12:07, Laurent Barbe wrote: > Live resize has been added in 3.6.10 for krbd client. > > There is a need to do revalidate_disk() on rbd resize : > https://git.kernel.org/cgit/li

Re: Negative degradation?

2013-07-30 Thread David McBride
On 30/07/13 09:20, Roald van Loon wrote: > Came across it this morning when booting my development environment, > has anyone seen this before? > > It's with 0.67-rc2; > > 2013-07-30 08:09:09.230349 mon.0 [INF] pgmap v4172: 1216 pgs: 123 > active, 219 active+clean, 161 active+clean+replay, 713 pee

[PATCH 8/9] ceph: WQ_NON_REENTRANT is meaningless and going away

2013-07-30 Thread Tejun Heo
Hello, Please route this through the subsystem tree. As written in the description, this shouldn't make any functional difference and just prepares for the removal of WQ_NON_REENTRANT which is already noop. Thanks. -- 8< --- dbf2576e37 ("workqueue: make all workqueues non-reentrant") ma

Re: Re: question about striped_read

2013-07-30 Thread Yan, Zheng
On Tue, Jul 30, 2013 at 7:41 PM, majianpeng wrote: >>> dd if=/dev/urandom bs=1M count=2 of=file_with_holes dd if=/dev/urandom bs=1M count=2 seek=4 of=file_with_holes conv=notrunc dd if=file_with_holes bs=8M >/dev/null >>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c >>> index 2ddf

Re: Re: question about striped_read

2013-07-30 Thread majianpeng
>On Tue, Jul 30, 2013 at 7:01 PM, majianpeng wrote: >>>On Mon, Jul 29, 2013 at 11:00 AM, majianpeng wrote: [snip] >I don't think the later was_short can handle the hole case. For the hole >case, >we should try reading next strip object instead of return. how about >

Re: Re: question about striped_read

2013-07-30 Thread majianpeng
>On Tue, Jul 30, 2013 at 7:01 PM, majianpeng wrote: >>>On Mon, Jul 29, 2013 at 11:00 AM, majianpeng wrote: [snip] >I don't think the later was_short can handle the hole case. For the hole >case, >we should try reading next strip object instead of return. how about >

Re: Re: question about striped_read

2013-07-30 Thread Yan, Zheng
On Tue, Jul 30, 2013 at 7:01 PM, majianpeng wrote: >>On Mon, Jul 29, 2013 at 11:00 AM, majianpeng wrote: >>> >>> [snip] >>> >I don't think the later was_short can handle the hole case. For the hole >>> >case, >>> >we should try reading next strip object instead of return. how about >>> >below pa

Re: Re: question about striped_read

2013-07-30 Thread majianpeng
>On Mon, Jul 29, 2013 at 11:00 AM, majianpeng wrote: >> >> [snip] >> >I don't think the later was_short can handle the hole case. For the hole >> >case, >> >we should try reading next strip object instead of return. how about >> >below patch. >> > >> Hi Yan, >> i uesed this demo to test h

Re: krbd & live resize

2013-07-30 Thread Loic Dachary
Hi Laurent, Thanks for the solution ! On 30/07/2013 12:07, Laurent Barbe wrote: > Live resize has been added in 3.6.10 for krbd client. > > There is a need to do revalidate_disk() on rbd resize : > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=d98df63ea7e87d5df4

Re: krbd & live resize

2013-07-30 Thread Laurent Barbe
Live resize has been added in 3.6.10 for krbd client. There is a need to do revalidate_disk() on rbd resize : https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=d98df63ea7e87d5df4dce0cece0210e2a777ac00 Cheers Laurent Le 30/07/2013 11:57, Loic Dachary a écrit :

Re: krbd & live resize

2013-07-30 Thread Loic Dachary
Hi, Tried on another machine running 3.8.0-25-generic #37~precise1-Ubuntu SMP and the behavior is the same. Cheers On 30/07/2013 11:57, Loic Dachary wrote: > > > On 30/07/2013 11:55, Laurent Barbe wrote: >> Hello Loic, >> >> which version of kernel do you use for krbd ? > > Linux i-csnces-

Re: krbd & live resize

2013-07-30 Thread Loic Dachary
On 30/07/2013 11:55, Laurent Barbe wrote: > Hello Loic, > > which version of kernel do you use for krbd ? Linux i-csnces- 3.2.0-41-generic #66-Ubuntu SMP That may explain a few things ... :-) > > Laurent > > > Le 29/07/2013 23:50, Loic Dachary a écrit : >> Hi, >> >> This works: >> >> l

Re: krbd & live resize

2013-07-30 Thread Laurent Barbe
Hello Loic, which version of kernel do you use for krbd ? Laurent Le 29/07/2013 23:50, Loic Dachary a écrit : Hi, This works: lvcreate --name tmp --size 10G all Logical volume "tmp" created mkfs.ext4 /dev/all/tmp mount /dev/all/tmp /mnt blockdev --getsize64 /dev/all/tmp 10737418240 lvext

Re: mds.0 crashed with 0.61.7

2013-07-30 Thread Andreas Friedrich
On Mon, Jul 29, 2013 at 08:47:00AM -0700, Sage Weil wrote: > Hi Andreas, > > Can you reproduce this (from mkcephfs onward) with debug mds = 20 and > debug ms = 1? I've seen this crash several times but never been able to > get to the bottom of it. ... done. The mds.0 logging file is appended.

Negative degradation?

2013-07-30 Thread Roald van Loon
Came across it this morning when booting my development environment, has anyone seen this before? It's with 0.67-rc2; 2013-07-30 08:09:09.230349 mon.0 [INF] pgmap v4172: 1216 pgs: 123 active, 219 active+clean, 161 active+clean+replay, 713 peering; 241 MB data, 630 MB used, 5483 MB / 6114 MB avail

Re: Fwd: [ceph-users] Small fix for ceph.spec

2013-07-30 Thread Danny Al-Gaaf
Hi, then the Fedora package is broken. If you check the spec file of: http://dl.fedoraproject.org/pub/fedora/linux/updates/19/SRPMS/leveldb-1.12.0-3.fc19.src.rpm You can see the spec-file sets a: BuildRequires: snappy-devel But not the corresponding "Requires: snappy-devel" for the devel pac

[PATCH] Add missing buildrequires for Fedora

2013-07-30 Thread Erik Logtenberg
Hi, This patch adds two buildrequires to the ceph.spec file, that are needed to build the rpms under Fedora. Danny Al-Gaaf commented that the snappy-devel dependency should actually be added to the leveldb-devel package. I will try to get that fixed too, in the mean time, this patch does make sure

Re: Fwd: [ceph-users] Small fix for ceph.spec

2013-07-30 Thread Erik Logtenberg
Hi, Fedora, in this case Fedora 19, x86_64. Kind regards, Erik. On 07/30/2013 09:29 AM, Danny Al-Gaaf wrote: > Hi, > > I think this is a bug in packaging of the leveldb package in this case > since the spec-file already sets dependencies on on leveldb-devel. > > leveldb depends on snappy, th

Re: Fwd: [ceph-users] Small fix for ceph.spec

2013-07-30 Thread Danny Al-Gaaf
Hi, I think this is a bug in packaging of the leveldb package in this case since the spec-file already sets dependencies on on leveldb-devel. leveldb depends on snappy, therefore the leveldb package should set a dependency on snappy-devel for leveldb-devel (check the SUSE spec file for leveldb: h