RE: Aggregate failure report in ceph -s

2015-11-21 Thread Chen, Xiaoxi
vember 20, 2015 7:28 PM > To: Chen, Xiaoxi > Cc: ceph-devel@vger.kernel.org > Subject: Re: Aggregate failure report in ceph -s > > On Fri, 20 Nov 2015, Chen, Xiaoxi wrote: > > > > Hi Sage, > > > >    As we are looking at the failure detection part of &

RE: Cannot start osd due to permission of journal raw device

2015-11-09 Thread Chen, Xiaoxi
PM > To: Chen, Xiaoxi > Cc: ceph-devel@vger.kernel.org > Subject: RE: Cannot start osd due to permission of journal raw device > > On Mon, 9 Nov 2015, Chen, Xiaoxi wrote: > > There is no such rules (only 70-persistent-net.rules) in my > > /etc/udev/ruled.d/ > > > &

RE: Cannot start osd due to permission of journal raw device

2015-11-08 Thread Chen, Xiaoxi
gt; To: Chen, Xiaoxi > Cc: ceph-devel@vger.kernel.org > Subject: Re: Cannot start osd due to permission of journal raw device > > On Fri, 6 Nov 2015, Chen, Xiaoxi wrote: > > Hi, > > I tried infernalis (version 9.1.0 > (3be81ae6cf17fcf689cd6f187c4615249fea4f61)) bu

Cannot start osd due to permission of journal raw device

2015-11-05 Thread Chen, Xiaoxi
Hi, I tried infernalis (version 9.1.0 (3be81ae6cf17fcf689cd6f187c4615249fea4f61)) but failed due to permission of journal , the OSD was upgraded from hammer(also true for newly created OSD). I am using raw device as journal, this is because the default privilege of raw block is

RE: Specify omap path for filestore

2015-11-04 Thread Chen, Xiaoxi
Hi Ning, Yes, we doesn’t save any IO, or may even need more IO as read amplification by LevelDB. But the tradeoff is using SSD IOPS instead of HDD IOPS, IOPS/$$ in SSD(10K+ IOPS per $100) is 2 order cheaper than that of in an HDD( 100 IOPS per $100). Some use case: 1.When we have enough

RE: Specify omap path for filestore

2015-11-01 Thread Chen, Xiaoxi
As we use submit_transaction(instead of submit_transaction_sync) in DBObjectmap, and we also don't use a kv_sync_thread for DB. Seems we need to rely on the syncfs(2) at commit time for persist everything? If that is the case, moving db out of the same FS as Data may cause issue? >

RE: chooseleaf may cause some unnecessary pg migrations

2015-10-23 Thread Chen, Xiaoxi
5 4:13 PM > To: Chen, Xiaoxi > Cc: ceph-devel@vger.kernel.org > Subject: RE: chooseleaf may cause some unnecessary pg migrations > > I just realized the measurement I mentioned last time is not precise. It > should be 'number of changed mappings' instead of 'number of remapped &g

RE: newstore direction

2015-10-21 Thread Chen, Xiaoxi
Nelson [mailto:mnel...@redhat.com] > Sent: Wednesday, October 21, 2015 9:36 PM > To: Allen Samuels; Sage Weil; Chen, Xiaoxi > Cc: James (Fei) Liu-SSI; Somnath Roy; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > Thanks Allen! The devil is always in the details

RE: chooseleaf may cause some unnecessary pg migrations

2015-10-19 Thread Chen, Xiaoxi
usangdi > Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org > Subject: RE: chooseleaf may cause some unnecessary pg migrations > > On Mon, 19 Oct 2015, Xusangdi wrote: > > > > > -Original Message- > > > From: ceph-devel-ow...@vger.kernel.org > > > [m

RE: newstore direction

2015-10-19 Thread Chen, Xiaoxi
+1, nowadays K-V DB care more about very small key-value pairs, say several bytes to a few KB, but in SSD case we only care about 4KB or 8KB. In this way, NVMKV is a good design and seems some of the SSD vendor are also trying to build this kind of interface, we had a NVM-L library but still

RE: newstore direction

2015-10-19 Thread Chen, Xiaoxi
There is something like : http://pmem.io/nvml/libpmemobj/ to adapt NVMe to transactional object storage. But definitely need some more works > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Varada Kari > Sent:

RE: chooseleaf may cause some unnecessary pg migrations

2015-10-18 Thread Chen, Xiaoxi
t: Friday, October 16, 2015 2:44 PM > To: Chen, Xiaoxi > Cc: ceph-devel@vger.kernel.org > Subject: RE: chooseleaf may cause some unnecessary pg migrations > > Sorry if I didn't state that clearly. > > Like you did, the performance is measured by the number of PGs remapped > betw

RE: chooseleaf may cause some unnecessary pg migrations

2015-10-16 Thread Chen, Xiaoxi
.com] > Sent: Friday, October 16, 2015 2:12 PM > To: Chen, Xiaoxi > Cc: ceph-devel@vger.kernel.org > Subject: RE: chooseleaf may cause some unnecessary pg migrations > > > > > -Original Message- > > From: Chen, Xiaoxi [mailto:xiaoxi.c...@intel.com] > > Sent:

RE: chooseleaf may cause some unnecessary pg migrations

2015-10-15 Thread Chen, Xiaoxi
I did some test by using crushtool --test (together with this PR and #PR 6004) It doesn't help in quantities way. In a 40 OSDs, 10 OSDs per Nodes demo crush map, test with rep = 2. 4096 PGs, in each run I will randomly kick off an OSD (reweight to 0) and compared the PG mapping. If any OSD in

RE: does newstore skip some data ??

2015-10-15 Thread Chen, Xiaoxi
How many osds do you have? I wonder if the overlay layer is that large to keep 160K obj(which is 64K * 32 on default, per OSD)? > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Tomy Cheru > Sent: Thursday, October

RE: [ceph-users] Initial performance cluster SimpleMessenger vs AsyncMessenger results

2015-10-14 Thread Chen, Xiaoxi
Hi Mark, The Async result in 128K drops quickly after some point, is that because of the testing methodology? Other conclusion looks to me like simple messenger + Jemalloc is the best practice till now as it has the same performance as async but using much less memory?

RE: Backend ObjectStore engine performance bench with FIO

2015-09-29 Thread Chen, Xiaoxi
Hi Casey, Would it better if we create an integration brunch on ceph/ceph/wip-fio-objstore to allow more people try and improve it? Seems James has some patches. -Xiaoxi > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org]

RE: Very slow recovery/peering with latest master

2015-09-28 Thread Chen, Xiaoxi
FWIW, blkid works well in both GPT(created by parted) and MSDOS(created by fdisk) in my environment. But blkid doesn't show the information of disk in external bay (which is connected by a JBOD controller) in my setup. See below, SDB and SDH are SSDs attached to the front panel but the rest

RE: 答复: 2 replications,flapping can not stop for a very long time

2015-09-14 Thread Chen, Xiaoxi
This is kind of unsolvable problem, in CAP , we choose Consistency and Availability, thus we had to lose Partition tolerance. There are three networks here , mon<-> osd, osd<-public->osd, osd<- cluster-> osd. If some of the networks are reachable but some are not, likely the flipping will

RE: create OSD on IBM's GPFS file system

2015-08-12 Thread Chen, Xiaoxi
That require some kind of driver in Ceph(see XFSFileStoreBackend.cc/h and BTRFSFileStoreBackend.cc/h), you need to implement a GPFSFileStoreBackend in Ceph. But Why you want OSD on top of GPFS? -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-

RE: FileStore should not use syncfs(2)

2015-08-07 Thread Chen, Xiaoxi
FWIW, I often see performance increase when favoring inode/dentry cache, but probably with far fewer inodes that the setup you just saw. It sounds like there needs to be some maximum limit on the inode/dentry cache to prevent this kind of behavior but still favor it up until that point.

RE: newstore performance update

2015-04-29 Thread Chen, Xiaoxi
Hi Mark, Really good test:) I only played a bit on SSD, the parallel WAL threads really helps but we still have a long way to go especially on all-ssd case. I tried this https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515 by hacking the rocksdb, but the performance

RE: newstore performance update

2015-04-29 Thread Chen, Xiaoxi
Hi Mark, You may miss this tunable: newstore_sync_wal_apply, which is default to true, but would be better to make if false. If sync_wal_apply is true, WAL apply will be don synchronize (in kv_sync_thread) instead of WAL thread. See if (g_conf-newstore_sync_wal_apply) {

RE: newstore and rocksdb column families

2015-04-22 Thread Chen, Xiaoxi
I think this is great since when we trying to optimize WAL, we set the write_buffer and memtable very aggressive, which will case read amplification. I was worring about it but now we can have separate column family : Write optimized for big stuff(WAL, and overlay)-trying to

RE: 回复: Re: NewStore performance analysis

2015-04-21 Thread Chen, Xiaoxi
work. How do you think Xiaoxi -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, April 21, 2015 12:48 AM To: Chen, Xiaoxi Cc: Mark

RE: 回复: Re: 回复: Re: 回复: Re: NewStore performance analysis

2015-04-21 Thread Chen, Xiaoxi
- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: Wednesday, April 22, 2015 7:59 AM To: Sage Weil; Chen, Xiaoxi Cc: Haomai Wang; Somnath Roy; Duan, Jiangang; Zhang, Jian; ceph-devel Subject: Re: 回复: Re: 回复: Re: 回复: Re: NewStore performance analysis On 04/21/2015 06:57 PM, Sage Weil wrote

回复: Re: 回复: Re: 回复: Re: NewStore performance analysis

2015-04-21 Thread Chen, Xiaoxi
Sage Weil编写 On Tue, 21 Apr 2015, Chen, Xiaoxi wrote: Haomai is right in theory, but I am not sure whether all user(mon,filestore,kvstore) of submit_transaction API clearly holding the expectation that their data is not persistent and may lost in failure. So in rocksdb now

NewStore performance analysis

2015-04-20 Thread Chen, Xiaoxi
[Resend in plain text] Hi,    I have played some tunable on RocksDB these days, try to optimize the performance of Newstore.   From the data now ,seems the WA of  RocksDB is not the issue that blocking the performance, and also seems not the fragment part(aio/dio, etc). The issue might be

回复: Re: NewStore performance analysis

2015-04-20 Thread Chen, Xiaoxi
Sage Weil编写 On Mon, 20 Apr 2015, Chen, Xiaoxi wrote: [Resend in plain text] Hi,    I have played some tunable on RocksDB these days, try to optimize the performance of Newstore.   From the data now ,seems the WA of  RocksDB is not the issue that blocking

RE: Regarding newstore performance

2015-04-17 Thread Chen, Xiaoxi
; Xiaoxi -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: Friday, April 17, 2015 8:11 PM To: Sage Weil Cc: Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel Subject: Re: Regarding newstore performance On 04/16/2015 07:38 PM, Sage Weil wrote: On Thu

RE: Regarding newstore performance

2015-04-17 Thread Chen, Xiaoxi
10519532 145653264 7% /var/lib/ceph/osd/ceph-0 -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: Friday, April 17, 2015 8:11 PM To: Sage Weil Cc: Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel Subject: Re: Regarding newstore performance On 04/16/2015 07:38 PM

RE: Regarding newstore performance

2015-04-17 Thread Chen, Xiaoxi
batch, 29.4 MB user ingest, stall time: 0 us Interval WAL: 15180 writes, 15179 syncs, 1.00 writes per sync, 0.03 MB written -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Friday, April 17, 2015 10:20 PM To: Chen, Xiaoxi Cc: Mark Nelson; Sage Weil; Somnath Roy; ceph

RE: Regarding newstore performance

2015-04-16 Thread Chen, Xiaoxi
...@gregs42.com] Sent: Friday, April 17, 2015 8:48 AM To: Sage Weil Cc: Mark Nelson; Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel Subject: Re: Regarding newstore performance On Thu, Apr 16, 2015 at 5:38 PM, Sage Weil s...@newdream.net wrote: On Thu, 16 Apr 2015, Mark Nelson wrote: On 04/16

RE: Regarding newstore performance

2015-04-15 Thread Chen, Xiaoxi
Hi Somnath, You could try apply this one:) https://github.com/ceph/ceph/pull/4356 BTW the previous RocksDB configuration has a bug that set rocksdb_disableDataSync to true by default, which may cause data loss in failure. So pls update the newstore to latest or manually set it

RE: Initial newstore vs filestore results

2015-04-07 Thread Chen, Xiaoxi
Hi mark, Really thanks for the data. Not sure if this PR will be merged soon (https://github.com/ceph/ceph/pull/4266) Some known bugs around : `rados ls` will cause assert fault (which was fix by the PR) `rbd list` will also cause assert failure (because omap_iter hasn’t

RE: [ceph-users] keyvaluestore backend metadata overhead

2015-02-01 Thread Chen, Xiaoxi
We can always use a structure database in an unstructured way, I think it's workable in theory, but why choose MySQL? As discussed some while ago, any LSM structured database design will suffer in performance due to write amplification, is that the reason goes to MySQL only about prevent

RE: Memstore issue on v0.91

2015-01-22 Thread Chen, Xiaoxi
This is due to the implicit type cast in the compiler, when st-f_blocks (used_bytes/st-f_bsize), the minus should be a negative ,but the compiler take it as an unsigned value Fix is proposed in https://github.com/ceph/ceph/pull/3451

回复: RE: Memstore issue on v0.91

2015-01-22 Thread Chen, Xiaoxi
4GB if you're going to be writing for 60 seconds with 4K objects at 20K IOPS. Thanks, Stephen -Original Message- From: Chen, Xiaoxi Sent: Thursday, January 22, 2015 1:39 AM To: mnel...@redhat.com; Blinick, Stephen L; Ceph Development Subject: RE: Memstore issue on v0.91 This is due

RE: straw2 and ln calculation

2015-01-16 Thread Chen, Xiaoxi
2527428193 169 std dev 12.5935 vs 12.5983 (expected). Xiaoxi -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Friday, January 16, 2015 10:22 AM To: Chen

RE: Question about Transaction::get_data_alignment

2014-11-25 Thread Chen, Xiaoxi
) ; } Xiaoxi -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Friday, November 21, 2014 9:33 AM To: Chen, Xiaoxi Cc: jianpeng; ceph-devel@vger.kernel.org; Cook

RE: Question about Transaction::get_data_alignment

2014-11-20 Thread Chen, Xiaoxi
/621c2a7dc2bc9724e9d2106b52aa9eedd2c793e8 xiaoxi -Original Message- From: Sage Weil [mailto:s...@newdream.net] Sent: Friday, November 21, 2014 1:30 AM To: Chen, Xiaoxi Cc: jianpeng; ceph-devel@vger.kernel.org; Cook, Nigel Subject: RE

RE: 10/14/2014 Weekly Ceph Performance Meeting

2014-10-16 Thread Chen, Xiaoxi
DERUMIER [mailto:aderum...@odiso.com] Sent: Thursday, October 16, 2014 2:25 PM To: Chen, Xiaoxi Cc: ceph-devel@vger.kernel.org; Mark Nelson Subject: Re: 10/14/2014 Weekly Ceph Performance Meeting We also find this before, seems it's because QEMU use single thread for IO, I tried to enable the debug

RE: 10/14/2014 Weekly Ceph Performance Meeting

2014-10-15 Thread Chen, Xiaoxi
We also find this before, seems it's because QEMU use single thread for IO, I tried to enable the debug log of in librbd, find that the threaded is always the same. Assuming the backend is power enough, so how many IOs can be sent out by qemu == how many IOPS can we get. The upper bound may

RE: Impact of page cache on OSD read performance for SSD

2014-09-24 Thread Chen, Xiaoxi
Have you ever seen large readahead_kb would hear random performance? We usually set it to very large (2M) , the random read performance keep steady, even in all SSD setup. Maybe with your optimization code for OP_QUEUE, the things may different? -Original Message- From:

RE: severe librbd performance degradation in Giant

2014-09-18 Thread Chen, Xiaoxi
Same question as Somnath, some customer of us not feeling that comfortable with cache, they still have some consistent concern. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Thursday, September 18,

RE: puzzled with the design pattern of ceph journal, really ruining performance

2014-09-17 Thread Chen, Xiaoxi
Hi Nicheal, 1. The main purpose of journal is provide transaction semantics (prevent partially update). Peer is not enough for this need because ceph writes all replica at the same time, so when crush, you have no idea about which replica has right data. For example, say if we have 2 replica,

RE: puzzled with the design pattern of ceph journal, really ruining performance

2014-09-17 Thread Chen, Xiaoxi
...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Alexandre DERUMIER Sent: Thursday, September 18, 2014 5:13 AM To: Mark Nelson Cc: Somnath Roy; ??; ceph-devel@vger.kernel.org; Chen, Xiaoxi Subject: Re: puzzled with the design pattern of ceph journal, really ruining performance

RE: [ceph-users] Crushmap ruleset for rack aware PG placement

2014-09-17 Thread Chen, Xiaoxi
The rule has max_size, can we just use that value? -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Johnu George (johnugeo) Sent: Thursday, September 18, 2014 6:41 AM To: Loic Dachary; ceph-devel Subject: Re: [ceph-users]

RE: osd cpu usage is bigger than 100%

2014-09-11 Thread Chen, Xiaoxi
1. 12% wa is quite normal, with more disks and more load ,you could even see 30%+ on random write case 2. Our BKM is set osd_op_threads to 20, -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of yue longguang Sent: Thursday,

RE: Cache tiering slow request issue: currently waiting for rw locks

2014-09-09 Thread Chen, Xiaoxi
can we set the cache_min_evict_age to a reasonable larger number (say 5min? 10min?) to walk around the window?-If a request cannot finished in minutes, that indicate there should be some issue in the cluster. -Original Message- From: ceph-devel-ow...@vger.kernel.org

Better output of ceph df

2014-09-09 Thread Chen, Xiaoxi
Hi list, I tried to understand the output of CEPH DF , finally I got it but it's really confusing in the POOLS section, so I sent out the mail and see if any good suggestion to make it better. Here is an example from my cluster GLOBAL: SIZE AVAIL RAW

Moving CrushWrapper::crush from public member to private

2014-09-04 Thread Chen, Xiaoxi
Hi List,    The CrushWrapper(https://github.com/ceph/ceph/blob/master/src/crush/CrushWrapper.cc) is a wrapper for crush so we can use such C++ wrapper instead of playing directly with crush C-API.    But currently, in CrushWrapper, the member   struct crush_map *crush  is a public

Re: Add a converter in OSDMap to split the ruleset into rule

2014-08-15 Thread Chen, Xiaoxi
在 2014-8-15,17:26,Loic Dachary l...@dachary.org 写道: On 15/08/2014 11:20, Loic Dachary wrote: Hi, I've added a few comments inline at https://github.com/xiaoxichen/ceph/commit/354c09131a64ac1e1a67c71794d1a3bab8334ca8 . Could you explain in pseudo code, in the commit message, what

Re: Resolving the ruleno / ruleset confusion

2014-08-08 Thread Chen, Xiaoxi
I think before we start bug fix or try to get rid of ruleset concept, we can start with define a reasonable use case. How we expect user to play with rule and pools. there is no CLI to create/modify a ruleset, even worse , you are not able to get the ruleset id without dump a rule. currently

RE: Resolving the ruleno / ruleset confusion

2014-08-08 Thread Chen, Xiaoxi
, the ID for myrule1 is 3. So they simply type in ceph osd pool set mypool1 crush_ruleset 3 In most case, this works, but actually this is not the right way to do. -Original Message- From: Loic Dachary [mailto:l...@dachary.org] Sent: Friday, August 8, 2014 11:11 PM To: Chen, Xiaoxi

Re: Resolving the ruleno / ruleset confusion

2014-08-08 Thread Chen, Xiaoxi
Make sense, would you mind me to take this job? I will start with the conversion function in monitor. 在 2014-8-9,0:08,Sage Weil sw...@redhat.com 写道: On Fri, 8 Aug 2014, Chen, Xiaoxi wrote: For my side, I have seen some guys(actually more than 80% of the user I have seen in university

RE: Could we introduce launchpad/gerrit for ceph

2013-08-09 Thread Chen, Xiaoxi
estimate date or plan for when we will introduce these stuff? -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: Friday, August 09, 2013 1:06 PM To: Chen, Xiaoxi Cc: ceph-devel@vger.kernel.org Subject: Re: Could we introduce launchpad/gerrit for ceph Hi, ??Now it?s a bit hard

Could we introduce launchpad/gerrit for ceph

2013-08-08 Thread Chen, Xiaoxi
Hi,   Now it’s a bit hard for us to track the bugs, review the submits, and track the blueprints. We do have a bug tracking system, but most of the time it doesn’t connect with a github submit link. We have email review , pull requests, and also some internal mechanism inside inktank , we do

RE: Read ahead affect Ceph read performance much

2013-07-30 Thread Chen, Xiaoxi
My 0.02, we have done some readahead test tuning on server(ceph osd) side, the result showing that when readahead = 0.5 * object_size(4M in default), we can get max read throughput. Readahead value larger than this generally will not help, but also not harm the performance. For your case,

RE: Any concern about Ceph on CentOS

2013-07-17 Thread Chen, Xiaoxi
PM To: Chen, Xiaoxi Cc: ceph-devel@vger.kernel.org; ceph-us...@ceph.com Subject: Re: Any concern about Ceph on CentOS Hi Xiaoxi, we are really running Ceph on CentOS-6.4 (6 server nodes, 3 client nodes, 160 OSDs). We put a 3.8.13 Kernel on top and installed the ceph-0.61.4 cluster with mkcephfs

Any concern about Ceph on CentOS

2013-07-16 Thread Chen, Xiaoxi
Hi list, I would like to ask if anyone really run Ceph on CentOS/RHEL? Since the kernel version for Cent/RHEL is much older than that of Ubuntu, I am thinking about whether we have some known performance/functionality issue? Thanks for everyone could share your insight for Ceph+CentOS.

How many Pipe per Ceph OSD daemon will keep?

2013-06-06 Thread Chen, Xiaoxi
Hi, From the code, each pipe (contains a TCP socket) will fork 2 threads, a reader and a writer. We really observe 100+ threads per OSD daemon with 30 instances of rados bench as clients. But this number seems a bit crazy, if I have a 40 disks node, thus I will have 40 OSDs,

Re: How many Pipe per Ceph OSD daemon will keep?

2013-06-06 Thread Chen, Xiaoxi
threads. This is still too high for 8 core or 16 core cpu/cpus and will waste a lot of cycles in context switchinh. 发自我的 iPhone 在 2013-6-7,0:21,Gregory Farnum g...@inktank.com 写道: On Thu, Jun 6, 2013 at 12:25 AM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote: Hi, From the code, each pipe

RE: [ceph-users] Ceph killed by OS because of OOM under high load

2013-06-04 Thread Chen, Xiaoxi
-Original Message- From: Gregory Farnum [mailto:g...@inktank.com] Sent: 2013年6月4日 0:37 To: Chen, Xiaoxi Cc: ceph-devel@vger.kernel.org; Mark Nelson (mark.nel...@inktank.com); ceph-us...@ceph.com Subject: Re: [ceph-users] Ceph killed by OS because of OOM under high load On Mon, Jun 3, 2013 at 8:47 AM

Ceph killed by OS because of OOM under high load

2013-06-03 Thread Chen, Xiaoxi
Hi, As my previous mail reported some weeks ago ,we are suffering from OSD crash/ OSD Flipping / System reboot and etc, all these unstable issue really stop us from digging further into ceph characterization. Good news is that we seems find out the cause, I explain our

RE: [ceph-users] OSD state flipping when cluster-network in high utilization

2013-05-22 Thread Chen, Xiaoxi
. Xiaoxi -Original Message- From: Chen, Xiaoxi Sent: 2013年5月16日 6:38 To: 'Sage Weil' Subject: RE: [ceph-users] OSD state flipping when cluster-network in high utilization Uploaded to /home/cephdrop/xiaoxi_flip_osd/osdlog.tar.gz Thanks -Original Message- From

RE: [ceph-users] OSD state flipping when cluster-network in high utilization

2013-05-15 Thread Chen, Xiaoxi
4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] do_op mode now rmw(wr=0) -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013年5月15日 11:40 To: Chen, Xiaoxi Cc: Mark Nelson

Re: [ceph-users] OSD state flipping when cluster-network in high utilization

2013-05-15 Thread Chen, Xiaoxi
Thanks, but i am not quite understand how to determine weather monitor overloaded? and if yes,will start several monitor help? 发自我的 iPhone 在 2013-5-15,23:07,Jim Schutt jasc...@sandia.gov 写道: On 05/14/2013 09:23 PM, Chen, Xiaoxi wrote: How responsive generally is the machine under load

Re: pg balancing

2013-05-14 Thread Chen, Xiaoxi
from which release can we get this? 发自我的 iPhone 在 2013-5-14,8:36,Sage Weil s...@inktank.com 写道: Hi Jim- You mentioned the other day your concerns about the uniformity of the PG and data distribution. There are several ways to attack it (including increasing the number of PGs), but one

Re: [ceph-users] OSD state flipping when cluster-network in high utilization

2013-05-14 Thread Chen, Xiaoxi
% io wait).Enabling jumbo frame **seems** make things worth.(just feeling.no data supports) 发自我的 iPhone 在 2013-5-14,23:36,Mark Nelson mark.nel...@inktank.com 写道: On 05/14/2013 10:30 AM, Sage Weil wrote: On Tue, 14 May 2013, Chen, Xiaoxi wrote: Hi We are suffering our OSD flipping

RE: [ceph-users] OSD state flipping when cluster-network in high utilization

2013-05-14 Thread Chen, Xiaoxi
related with CPU scheduler ? The heartbeat thread (in busy OSD ) failed to get enough cpu cycle. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: 2013年5月15日 7:23 To: Chen, Xiaoxi Cc: Mark Nelson; ceph-devel

Re: ceph and efficient access of distributed resources

2013-04-15 Thread Chen, Xiaoxi
, I believe, though I'm not sure how much detail they include there versus in the QAs). On Fri, Apr 12, 2013 at 7:32 PM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote: We are also discussing this internally, and come out with an idea to walk around it(Only for RBD case,havent think about Obj store

Re: ceph and efficient access of distributed resources

2013-04-12 Thread Chen, Xiaoxi
We are also discussing this internally, and come out with an idea to walk around it(Only for RBD case,havent think about Obj store),but not yet tested. If Mark and Greg can provide some feedback,that would be great. We are trying to write a script to generate some pools,for rack A,there is a

RE: prototype incremental rbd backup

2013-03-26 Thread Chen, Xiaoxi
If this feature works, I suppose we can have incremental backup RBD(2 copies,SSD based) to Rados ( 3 Copies,HDD based) to achieve higher HA with quite a few extra cost :) -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of

RE: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

2013-03-25 Thread Chen, Xiaoxi
Rephrase it to make it more clear From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Chen, Xiaoxi Sent: 2013年3月25日 17:02 To: 'ceph-us...@lists.ceph.com' (ceph-us...@lists.ceph.com) Cc: ceph-devel@vger.kernel.org Subject: [ceph-users] Ceph Crach

RE: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

2013-03-25 Thread Chen, Xiaoxi
[mailto:s...@inktank.com] Sent: 2013年3月25日 23:35 To: Chen, Xiaoxi Cc: 'ceph-us...@lists.ceph.com' (ceph-us...@lists.ceph.com); ceph-devel@vger.kernel.org Subject: Re: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes. Hi Xiaoxi, On Mon, 25 Mar 2013, Chen, Xiaoxi wrote

Re: github pull requests

2013-03-21 Thread Chen, Xiaoxi
can we have a review system like review.openstack.com? 发自我的 iPhone 在 2013-3-20,7:10,Guilhem Lettron guil...@lettron.fr 写道: Glad to see this openness! Everyone isn't like you. And I hope to see less [PATCH] in mailing-list, but maybe it's only a dream. Just my two cents. On Tue, Mar 19,

RE: Increase number of pg in running system

2013-02-05 Thread Chen, Xiaoxi
But can we change the pg_num of a pool when the pool contains data? If yes, how to do this? -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: 2013年2月6日 9:50 To: Mandell Degerness Cc:

RE: some performance issue

2013-02-04 Thread Chen, Xiaoxi
I doubt your data is correct ,even the ext4 data, have you use O_DIRECT when doing the test? It's unusual to have 2X random write IOPS than random read. CephFS kernel client seems not stable enough, think twice before you use it. From your previous mail I guess you would like to do some caching

RE: Slow request in XFS

2013-02-01 Thread Chen, Xiaoxi
] 1.523 deep-scrub ok 2013-02-01 16:38:12.301511 osd.117 [INF] 1.442 deep-scrub ok 2013-02-01 16:38:12.390220 osd.214 [INF] 2.26c deep-scrub ok -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013年2月1日 4:01 To: Jim Schutt Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org Subject

Slow request in XFS

2013-01-31 Thread Chen, Xiaoxi
Hi list, I just rebuild my ceph setup with 6 nodes, (20 sata+ 4 ssd as journal +10GbE ) per node,software stack is Ubuntu 12.10+ Kernel 3.6.3 + xfs+ceph 0.56.2. Before build up ceph cluster , I have checked all my disks can reach 90MB+/s for sequential write and 100MB+/s for sequential

RE: Ceph Production Environment Setup and Configurations?

2013-01-29 Thread Chen, Xiaoxi
[The following views only behalf of myself, not relate with Intel..] Looking forward for the performance data on Atom. Atom perform badly in Swift, but since Ceph is slightly efficient than Swift, it must be better. I have some concern about weather Atom can support such high throughput( you

Assert failed in PG.cc:5235 (we got a bad state machine event)

2013-01-24 Thread Chen, Xiaoxi
Hi list,    I have got the following log when I running test on top of Ceph. Seems this part of codes are quite fresh(not yet appear in 0.56.1), any idea about what happen? pgs=714 cs=11 l=0).reader got old message 1 = 6 0x4552800 osd_map(363..375 src has 1..375) v3, discarding 2013-01-25

RE: handling fs errors

2013-01-22 Thread Chen, Xiaoxi
Is there any known connection with the previous discussion Hit suicide timeout after adding new osd or Ceph unstable on XFS ? -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: 2013年1月22日 14:06 To:

Will multi-monitor speed up pg initializing?

2013-01-22 Thread Chen, Xiaoxi
Hi list, When first time I start my ceph cluster,it takes more than 15 minutes to get all the pg activeclean. It's fast at first (say 100pg/s) but quite slow when only hundreds of PG left peering. Is it a common situation? Since there is quite a few disk IO and network IO

/etc/init.d/ceph bug for multi-host when using -a option

2013-01-22 Thread Chen, Xiaoxi
Hi List, Here is part of /etc/init.d/ceph script: case $command in start) # Increase max_open_files, if the configuration calls for it. get_conf max_open_files 8192 max open files if [ $max_open_files != 0 ]; then # Note: Don't try

RE: Ceph slow request unstable issue

2013-01-16 Thread Chen, Xiaoxi
) Xiaoxi -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013年1月17日 10:35 To: Chen, Xiaoxi Subject: RE: Ceph slow request unstable issue On Thu, 17 Jan 2013, Chen, Xiaoxi wrote: No, on the OSD node, not the same node. OSD node with 3.2 kernel while client node with 3.6

RE: Slow requests

2013-01-15 Thread Chen, Xiaoxi
Hi, I have also seen the same warning even when I use v0.56.1 (both kernel rbd and OSD side) when write stress is high enough(Say I have 3 osds but having 4~5 clients doing dd on top of the rbd). 2013-01-15 15:54:05.990052 7ff97dd0c700 0 log [WRN] : slow request 32.545624 seconds old,

Ceph slow request unstable issue

2013-01-15 Thread Chen, Xiaoxi
Hi list, We are suffering from OSD or OS down when there is continuing high pressure on the Ceph rack. Basically we are on Ubuntu 12.04+ Ceph 0.56.1, 6 nodes, in each nodes with 20 * spindles + 4* SSDs as journal.(120 spindles in total) We create a lots of RBD volumes

Adding flashcache for data disk to cache Ceph metadata writes

2013-01-15 Thread Chen, Xiaoxi
...@inktank.com] Sent: 2013年1月16日 5:43 To: Chen, Xiaoxi Cc: Mark Nelson; Yan, Zheng Subject: RE: Seperate metadata disk for OSD On Tue, 15 Jan 2013, Chen, Xiaoxi wrote: Hi Sage, FlashCache works well for this scenarios, I created a hybrid-disk with 1 ssd partition(shared the same ssd

RE: Seperate metadata disk for OSD

2013-01-14 Thread Chen, Xiaoxi
! Xiaoxi -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013年1月13日 0:57 To: Chen, Xiaoxi Cc: Mark Nelson; Yan, Zheng ; ceph-devel@vger.kernel.org Subject: RE: Seperate metadata disk for OSD On Sat, 12 Jan 2013, Chen, Xiaoxi wrote

RE: Seperate metadata disk for OSD

2013-01-12 Thread Chen, Xiaoxi
: Mark Nelson [mailto:mark.nel...@inktank.com] Sent: 2013年1月12日 21:36 To: Yan, Zheng Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org Subject: Re: Seperate metadata disk for OSD Hi Xiaoxi and Zheng, We've played with both of these some internally, but not for a production deployment. Mostly just

Seperate metadata disk for OSD

2013-01-11 Thread Chen, Xiaoxi
Hi list, For a rbd write request, Ceph need to do 3 writes: 2013-01-10 13:10:15.539967 7f52f516c700 10 filestore(/data/osd.21) _do_transaction on 0x327d790 2013-01-10 13:10:15.539979 7f52f516c700 15 filestore(/data/osd.21) write meta/516b801c/pglog_2.1a/0//-1 36015~147 2013-01-10

RE: Crushmap Design Question

2013-01-08 Thread Chen, Xiaoxi
Hi, Setting rep size to 3 only make the data triple-replication, that means when you fail all OSDs in 2 out of 3 DCs, the data still accessable. But Monitor is another story, for monitor clusters with 2N+1 nodes, it require at least N+1 nodes alive, and indeed this is why you

RE: RBD fio Performance concerns

2012-11-23 Thread Chen, Xiaoxi
Hi Han, I have a cluster with 8 nodes(each node with 1 SSD as journal and 3 7200 rpm sata disk as data disk), each OSD consist of 1 sata disk together with one 30G partition from the SSD.So in total I have 24 OSDs. My test method is start 24VMs and 24 RBD volumes, make the VM and

RE: [Discussion] Enhancement for CRUSH rules

2012-11-22 Thread Chen, Xiaoxi
Hi list, I am thinking about the possibility to add some primitive in CRUSH to meet the following user stories: A. Same host, Same rack To balance between availability and performance ,one may like such a rule: 3 Replicas, Replica 1 and Replica 2 should in the same rack while