RE: High-availability testing of ceph

2012-07-31 Thread Eric_YH_Chen
Hi, Josh:

Thanks for your reply. However, I had asked a question about replica setting 
before.
http://www.spinics.net/lists/ceph-devel/msg07346.html

If the performance of rbd device is n MB/s under replica=2,
then that means the total io throughputs on hard disk is over 3 * n MB/s. 
Because I think the total number of copies is 3 in original.

So, it seems not correct now, the total number of copies is only 2. 
The total io through puts on disk should be 2 * n MB/s. Right?

-Original Message-
From: Josh Durgin [mailto:josh.dur...@inktank.com] 
Sent: Tuesday, July 31, 2012 1:56 PM
To: Eric YH Chen/WYHQ/Wiwynn
Cc: ceph-devel@vger.kernel.org; Chris YT Huang/WYHQ/Wiwynn; Victor CY 
Chang/WYHQ/Wiwynn
Subject: Re: High-availability testing of ceph

On 07/30/2012 07:46 PM, eric_yh_c...@wiwynn.com wrote:
 Hi, all:

 I am testing high-availability of ceph.

 Environment:  two servers, and 12 hard-disk on each server. Version: Ceph 0.48
   Kernel: 3.2.0-27

 We create a ceph cluster with 24 osd.
 Osd.0 ~ osd.11 is on server1
 Osd.12 ~ osd.23 is on server2

 The crush rule is using default rule.
 rule rbd {
  ruleset 2
  type replicated
  min_size 1
  max_size 10
  step take default
  step chooseleaf firstn 0 type host
  step emit
 }

 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 
 1536 pgp_num 1536 last_change 1172 owner 0

 Test case 1:
 1. Create a rbd device and read/write to it 2. Random turn off one osd 
 on server1  (service ceph stop osd.0) 3. check the read/write of rbd 
 device

 Test case 2:
 1. Create a rbd device and read/write to it 2. Random turn off one osd 
 on server1  (service ceph stop osd.0) 2. Random turn off one osd on 
 server2  (service ceph stop osd.12) 3. check the read/write of rbd 
 device

 About test case 1, we can access the rbd device as normal. But about test 
 case 2, we would hang there and no response.
 Is it a correct scenario ?

 I imagine that we can turn off any two osd when we set the replication as 2.
 Because without the master data, we have two other copies on two different 
 osd.
 Even when we turn off two osd, we can find the data on third osd.
 Any misunderstanding? Thanks!

rep size is the total number of copies, so stopping two osds with rep size 2 
may cause you to lose access to some objects.

Josh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


The cluster do not aware some osd are disappear

2012-07-31 Thread Eric_YH_Chen
Dear All:

My Environment:  two servers, and 12 hard-disk on each server. 
 Version: Ceph 0.48, Kernel: 3.2.0-27

We create a ceph cluster with 24 osd, 3 monitors
Osd.0 ~ osd.11 is on server1
Osd.12 ~ osd.23 is on server2
Mon.0 is on server1
Mon.1 is on server2
Mon.2 is on server3 which has no osd

When I turn off the network of server1, we expect that server2 would aware 12 
osd (on server 1) disappear.
However, when I type ceph -s, it still show 24 osd there.

And from the log of osd.0 and osd.11, we can find heartbeat check on server1, 
but not on server2. 
What happened to server2? Can we restart the heartbeat server? Thanks!

root@wistor-002:~# ceph -s
   health HEALTH_WARN 1 mons down, quorum 1,2 008,009
   monmap e1: 3 mons at 
{006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0},
 election epoch 522, quorum 1,2 008,009
   osdmap e1388: 24 osds: 24 up, 24 in
pgmap v288663: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 
GB / 22320 GB avail
   mdsmap e1: 0/0/1 up

log of ceph -w (we turn of server1 arround 15:20, that cause the new monitor 
election)
2012-07-31 15:21:25.966572 mon.0 [INF] pgmap v288658: 4608 pgs: 4608 
active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail
2012-07-31 15:20:10.400566 mon.1 [INF] mon.008 calling new monitor election
2012-07-31 15:21:36.030473 mon.1 [INF] mon.008 calling new monitor election
2012-07-31 15:21:36.079772 mon.2 [INF] mon.009 calling new monitor election
2012-07-31 15:21:46.102587 mon.1 [INF] mon.008@1 won leader election with 
quorum 1,2
2012-07-31 15:21:46.273253 mon.1 [INF] pgmap v288659: 4608 pgs: 4608 
active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail
2012-07-31 15:21:46.273379 mon.1 [INF] mdsmap e1: 0/0/1 up
2012-07-31 15:21:46.273495 mon.1 [INF] osdmap e1388: 24 osds: 24 up, 24 in
2012-07-31 15:21:46.273814 mon.1 [INF] monmap e1: 3 mons at 
{006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0}
2012-07-31 15:21:46.587679 mon.1 [INF] pgmap v288660: 4608 pgs: 4608 
active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail
2012-07-31 15:22:01.245813 mon.1 [INF] pgmap v288661: 4608 pgs: 4608 
active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail
2012-07-31 15:22:33.970838 mon.1 [INF] pgmap v288662: 4608 pgs: 4608 
active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail

Log of osd.0 (on server 1)
2012-07-31 15:20:25.309264 7fdc06470700  0 -- 192.168.200.81:6825/12162  
192.168.200.82:6840/8772 pipe(0x4dbea00 sd=52 pgs=0 cs=0 l=0).accept 
connect_seq 0 vs existing 0 state 1
2012-07-31 15:20:25.310887 7fdc1c551700  0 -- 192.168.200.81:6825/12162  
192.168.200.82:6833/15570 pipe(0x4dbec80 sd=51 pgs=0 cs=0 l=0).accept 
connect_seq 0 vs existing 0 state 1
2012-07-31 15:21:46.861458 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply 
from osd.12 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861496 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply 
from osd.13 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861506 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply 
from osd.14 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861514 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply 
from osd.15 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861522 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply 
from osd.16 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861530 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply 
from osd.17 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861538 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply 
from osd.18 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861546 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply 
from osd.19 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861556 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply 
from osd.20 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861576 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply 
from osd.21 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861609 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply 
from osd.22 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861618 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply 
from osd.23 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)

Log of osd.12 (on server 2)
2012-07-31 15:20:31.475815 7f9eac5ba700  0 osd.12 1387 pg[2.16f( v 1356'10485 
(465'9480,1356'10485] n=42 ec=1 les/c 1387/1387 1383/1383/1383) [12,0] r=0 
lpr=1383 mlcod 0'0 active+clean] watch: oi.user_version=45
2012-07-31 

How to integrate ceph with opendedup.

2012-07-31 Thread ramu
Hi all,

I want to integrate ceph with opendedup(sdfs) using java-rados.
Please help me to integration of ceph with opendedup.

Thanks,
Ramu.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to integrate ceph with opendedup.

2012-07-31 Thread Wido den Hollander



On 07/31/2012 11:18 AM, ramu wrote:

Hi all,

I want to integrate ceph with opendedup(sdfs) using java-rados.
Please help me to integration of ceph with opendedup.


What is the exact use-case for this? I get the point of de-duplication, 
but having a filesystem running on top of RADOS and not using CephFS?


That doesn't seem like a trivial integration.

Wido



Thanks,
Ramu.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


cannot startup one of the osd

2012-07-31 Thread Eric_YH_Chen
Hi, all:

My Environment:  two servers, and 12 hard-disk on each server. 
 Version: Ceph 0.48, Kernel: 3.2.0-27

We create a ceph cluster with 24 osd, 3 monitors
Osd.0 ~ osd.11 is on server1
Osd.12 ~ osd.23 is on server2
Mon.0 is on server1
Mon.1 is on server2
Mon.2 is on server3 which has no osd

root@ubuntu:~$ ceph -s
   health HEALTH_WARN 227 pgs degraded; 93 pgs down; 93 pgs peering; 85 pgs 
recovering; 82 pgs stuck inactive; 255 pgs stuck unclean; recovery 4808/138644 
degraded (3.468%); 202/69322 unfound (0.291%); 1/24 in osds are down
   monmap e1: 3 mons at 
{006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0},
 election epoch 564, quorum 0,1,2 006,008,009
   osdmap e1911: 24 osds: 23 up, 24 in
pgmap v292031: 4608 pgs: 4251 active+clean, 85 active+recovering+degraded, 
37 active+remapped, 58 down+peering, 142 active+degraded, 35 
down+replay+peering; 257 GB data, 948 GB used, 19370 GB / 21390 GB avail; 
4808/138644 degraded (3.468%); 202/69322 unfound (0.291%)
   mdsmap e1: 0/0/1 up

I find one of the osd cannot startup anymore. Before that, I am testing HA of 
Ceph cluster.

Step1:  shutdown server1, wait 5 min
Step2:  bootup server1, wait 5 min until ceph enter health status
Step3:  shutdown server2, wait 5 min
Step4:  bootup server2, wait 5 min until ceph enter health status
Repeat Step1~ Step4 several times, then I met this problem.


Log of ceph-osd.22.log
2012-07-31 17:18:15.120678 7f9375300780  0 filestore(/srv/disk10/data) mount 
found snaps 
2012-07-31 17:18:15.122081 7f9375300780  0 filestore(/srv/disk10/data) mount: 
enabling WRITEAHEAD journal mode: btrfs not detected
2012-07-31 17:18:15.128544 7f9375300780  1 journal _open /srv/disk10/journal fd 
23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0
2012-07-31 17:18:15.257302 7f9375300780  1 journal _open /srv/disk10/journal fd 
23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0
2012-07-31 17:18:15.273163 7f9375300780  1 journal close /srv/disk10/journal
2012-07-31 17:18:15.274395 7f9375300780 -1 filestore(/srv/disk10/data) limited 
size xattrs -- filestore_xattr_use_omap enabled
2012-07-31 17:18:15.275169 7f9375300780  0 filestore(/srv/disk10/data) mount 
FIEMAP ioctl is supported and appears to work
2012-07-31 17:18:15.275180 7f9375300780  0 filestore(/srv/disk10/data) mount 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2012-07-31 17:18:15.275312 7f9375300780  0 filestore(/srv/disk10/data) mount 
did NOT detect btrfs
2012-07-31 17:18:15.276060 7f9375300780  0 filestore(/srv/disk10/data) mount 
syncfs(2) syscall fully supported (by glib and kernel)
2012-07-31 17:18:15.276154 7f9375300780  0 filestore(/srv/disk10/data) mount 
found snaps 
2012-07-31 17:18:15.277031 7f9375300780  0 filestore(/srv/disk10/data) mount: 
enabling WRITEAHEAD journal mode: btrfs not detected
2012-07-31 17:18:15.280906 7f9375300780  1 journal _open /srv/disk10/journal fd 
32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0
2012-07-31 17:18:15.307761 7f9375300780  1 journal _open /srv/disk10/journal fd 
32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0
2012-07-31 17:18:19.466921 7f9360a97700  0 -- 192.168.200.82:6830/18744  
192.168.200.83:0/3485583732 pipe(0x45bd000 sd=34 pgs=0 cs=0 l=0).accept peer 
addr is really 192.168.200.83:0/3485583732 (socket is 192.168.200.83:45653/0)
2012-07-31 17:18:19.671681 7f9363a9d700 -1 os/DBObjectMap.cc: In function 
'virtual bool DBObjectMap::DBObjectMapIteratorImpl::valid()' thread 
7f9363a9d700 time 2012-07-31 17:18:19.670082
os/DBObjectMap.cc: 396: FAILED assert(!valid || cur_iter-valid())

ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
 1: /usr/bin/ceph-osd() [0x6a3123]
 2: (ReplicatedPG::send_push(int, ObjectRecoveryInfo, ObjectRecoveryProgress, 
ObjectRecoveryProgress*)+0x684) [0x53f314]
 3: (ReplicatedPG::push_start(ReplicatedPG::ObjectContext*, hobject_t const, 
int, eversion_t, interval_setunsigned long, std::maphobject_t, 
interval_setunsigned long, std::lesshobject_t, 
std::allocatorstd::pairhobject_t const, interval_setunsigned long   
)+0x333) [0x54c873]
 4: (ReplicatedPG::push_to_replica(ReplicatedPG::ObjectContext*, hobject_t 
const, int)+0x343) [0x54cdc3]
 5: (ReplicatedPG::recover_object_replicas(hobject_t const, eversion_t)+0x35f) 
[0x5527bf]
 6: (ReplicatedPG::wait_for_degraded_object(hobject_t const, 
std::tr1::shared_ptrOpRequest)+0x17b) [0x55406b]
 7: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x9de) [0x56305e]
 8: (PG::do_request(std::tr1::shared_ptrOpRequest)+0x199) [0x5fda89]
 9: (OSD::dequeue_op(PG*)+0x238) [0x5bf668]
 10: (ThreadPool::worker()+0x605) [0x796d55]
 11: (ThreadPool::WorkThread::entry()+0xd) [0x5d5d0d]
 12: (()+0x7e9a) [0x7f9374794e9a]
 13: (clone()+0x6d) [0x7f93734344bd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.

--- begin dump of recent events ---
   -21 2012-07-31 

Re: Ceph Benchmark HowTo

2012-07-31 Thread Mehdi Abaakouk
Hi all,

I have updated the how-to here:
http://ceph.com/wiki/Benchmark

And published the results of my latest tests:
http://ceph.com/wiki/Benchmark#First_Example

All results are good, my benchmark is clearly limited by my network
connection ~ 110MB/s.

In exception of the rest-api bench, the value seems really low.

I have configured radosgw with this:
http://ceph.com/docs/master/radosgw/config/
I clean disk cache on all servers before the bench,
and start rest-bench for 900 seconds with default value.

Is my rest-bench result normal ? Have I missed something ?

Don't hesitate if you need more informations on my setup.

And then, I have another question about how is the Standard Deviation
calculated with rados bench and rest-bench ? with the reported value 
printed each second by the benchmark client ?
If yes, when latency is too high, the reported bandwith is sometime zero,
then has the calculated StdDev for bandwith a sens ?


Cheers,
-- 
Mehdi Abaakouk for eNovance
mail: sil...@sileht.net
irc: sileht


signature.asc
Description: Digital signature


About teuthology

2012-07-31 Thread Mehdi Abaakouk
Hi,

I have taken a look into teuthology, the automation of all this tests
are good, but are they any way to run it into a already installed
ceph clusters ? 

Thanks in advance.

Cheers,

-- 
Mehdi Abaakouk for eNovance
mail: sil...@sileht.net
irc: sileht


signature.asc
Description: Digital signature


Re: About teuthology

2012-07-31 Thread Mark Nelson

On 7/31/12 8:59 AM, Mehdi Abaakouk wrote:

Hi,

I have taken a look into teuthology, the automation of all this tests
are good, but are they any way to run it into a already installed
ceph clusters ?

Thanks in advance.

Cheers,



Hi Mehdi,

I think a number of the test related tasks should run fine without 
strictly requiring the ceph task.  You may have to change binary 
locations for things like rados, but those should be pretty minor.


Best way to find out is to give it a try!

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


another performance-related thread

2012-07-31 Thread Andrey Korolyov
Hi,

I`ve finally managed to run rbd-related test on relatively powerful
machines and what I have got:

1) Reads on almost fair balanced cluster(eight nodes) did very well,
utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata
disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear
and sequential reads, which is close to overall disk throughput)
2) Writes get much worse, both on rados bench and on fio test when I
ran fio simularly on 120 vms - at it best, overall performance is
about 400Mbyte/s, using rados bench -t 12 on three host nodes

fio config:

rw=(randread|randwrite|seqread|seqwrite)
size=256m
direct=1
directory=/test
numjobs=1
iodepth=12
group_reporting
name=random-ead-direct
bs=1M
loops=12

for 120 vm set, Mbyte/s
linear reads:
MEAN: 14156
STDEV: 612.596
random reads:
MEAN: 14128
STDEV: 911.789
linear writes:
MEAN: 2956
STDEV: 283.165
random writes:
MEAN: 2986
STDEV: 361.311

each node holds 15 vms and for 64M rbd cache all possible three states
- wb, wt and no-cache has almost same numbers at the tests. I wonder
if it possible to raise write/read ratio somehow. Seems that osd
underutilize itself, e.g. I am not able to get single-threaded rbd
write to get above 35Mb/s. Adding second osd on same disk only raising
iowait time, but not benchmark results.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXTERNAL] Re: avoiding false detection of down OSDs

2012-07-31 Thread Jim Schutt

On 07/30/2012 06:24 PM, Gregory Farnum wrote:

On Mon, Jul 30, 2012 at 3:47 PM, Jim Schuttjasc...@sandia.gov  wrote:


Above you mentioned that you are seeing these issues as you scaled
out a storage cluster, but none of the solutions you mentioned
address scaling.  Let's assume your preferred solution handles
this issue perfectly on the biggest cluster anyone has built
today.  What do you predict will happen when that cluster size
is scaled up by a factor of 2, or 10, or 100?

Sage should probably describe in more depth what we've seen since he's
looked at it the most, but I can expand on it a little. In argonaut
and earlier version of Ceph, processing a new OSDMap for an OSD is
very expensive. I don't remember the precise numbers we'd whittled it
down to but it required at least one disk sync as well as pausing all
request processing for a while. If you combined this expense with a
large number of large maps (if, perhaps, one quarter of your 800-OSD
system had been down but not out for 6+ hours), you could cause memory
thrashing on OSDs as they came up, which could force them to become
very, very, veeery slow. In the next version of Ceph, map processing
is much less expensive (no syncs or full-system pauses required),
which will prevent request backup. And there are a huge number of ways
to reduce the memory utilization of maps, some of which can be
backported to argonaut and some of which can't.
Now, if we can't prevent our internal processes from running an OSD
out of memory, we'll have failed. But we don't think this is an
intractable problem; in fact we have reason to hope we've cleared it
up now that we've seen the problem — although we don't think it's
something that we can absolutely prevent on argonaut (too much code
churn).
So we're looking for something that we can apply to argonaut as a
band-aid, but that we can also keep around in case forces external to
Ceph start causing similar cluster-scale resource shortages beyond our
control (runaway co-located process eats up all the memory on lots of
boxes, switch fails and bandwidth gets cut in half, etc). If something
happens that means Ceph can only supply half as much throughput as it
was previously, then Ceph should provide that much throughput; right
now if that kind of incident occurs then Ceph won't provide any
throughput because it'll all be eaten by spurious recovery work.


Ah, thanks for the extra context.  I hadn't fully appreciated
the proposal was primarily a mitigation for argonaut, and
otherwise as a fail-safe mechanism.



As I mentioned above, I'm concerned this is addressing
symptoms, rather than root causes.  I'm concerned the
root cause has something to do with how the map processing
work scales with number of OSDs/PGs, and that this will
limit the maximum size of a Ceph storage cluster.

I think I discussed this above enough already? :)


Yep, thanks.




But, if you really just want to not mark down an OSD that is
laggy, I know this will sound simplistic, but I keep thinking
that the OSD knows for itself if it's up, even when the
heartbeat mechanism is backed up.  Couldn't there be some way
to ask an OSD suspected of being down whether it is or not,
separate from the heartbeat mechanism?  I mean, if you're
considering having the monitor ignore OSD down reports for a
while based on some estimate of past behavior, wouldn't it be
better for the monitor to just ask such an OSD, hey, are you
still there?  If it gets an immediate I'm busy, come back later,
extend the grace period; otherwise, mark the OSD down.

Hmm. The concern is that if an OSD is stuck on disk swapping then it's
going to be just as stuck for the monitors as the OSDs — they're all
using the same network in the basic case, etc. We want to be able to
make that guess before the OSD is able to answer such questions.
But I'll think on if we could try something else similar.


OK - thanks.

Also, FWIW I've been running my Ceph servers with no swap,
and I've recently doubled the size of my storage cluster.
Is it possible to have map processing do a little memory
accounting and log it, or to provide some way to learn
that map processing is chewing up significant amounts of
memory?  Or maybe there's already a way to learn this that
I need to learn about?  I sometimes run into something that
shares some characteristics with what you describe, but is
primarily triggered by high client write load.  I'd like
to be able to confirm or deny it's the same basic issue
you've described.

Thanks -- Jim


-Greg





--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another performance-related thread

2012-07-31 Thread Mark Nelson

Hi Andrey!

On 07/31/2012 10:03 AM, Andrey Korolyov wrote:

Hi,

I`ve finally managed to run rbd-related test on relatively powerful
machines and what I have got:

1) Reads on almost fair balanced cluster(eight nodes) did very well,
utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata
disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear
and sequential reads, which is close to overall disk throughput)


Does your 2108 have the RAID or JBOD firmware?  I'm guessing the RAID 
firmware given that you are able to change the caching behavior?  How do 
you have the arrays setup for the OSDs?



2) Writes get much worse, both on rados bench and on fio test when I
ran fio simularly on 120 vms - at it best, overall performance is
about 400Mbyte/s, using rados bench -t 12 on three host nodes

fio config:

rw=(randread|randwrite|seqread|seqwrite)
size=256m
direct=1
directory=/test
numjobs=1
iodepth=12
group_reporting
name=random-ead-direct
bs=1M
loops=12

for 120 vm set, Mbyte/s
linear reads:
MEAN: 14156
STDEV: 612.596
random reads:
MEAN: 14128
STDEV: 911.789
linear writes:
MEAN: 2956
STDEV: 283.165
random writes:
MEAN: 2986
STDEV: 361.311

each node holds 15 vms and for 64M rbd cache all possible three states
- wb, wt and no-cache has almost same numbers at the tests. I wonder
if it possible to raise write/read ratio somehow. Seems that osd
underutilize itself, e.g. I am not able to get single-threaded rbd
write to get above 35Mb/s. Adding second osd on same disk only raising
iowait time, but not benchmark results.


I've seen high IO wait times (especially with small writes) via rados 
bench as well.  It's something we are actively investigating.  Part of 
the issue with rados bench is that every single request is getting 
written to a seperate file, so especially at small IO sizes there is a 
lot of underlying filesystem metadata traffic.  For us, this is 
happening on 9260 controllers with RAID firmware.  I think we may see 
some improvement by switching to 2X08 cards with the JBOD (ie IT) 
firmware, but we haven't confirmed it yet.


We actually just purchased a variety of alternative RAID and SAS 
controllers to test with to see how universal this problem is. 
Theoretically RBD shouldn't suffer from this as badly as small writes to 
the same file should get buffered.  The same is true for CephFS when 
doing buffered IO to a single file due to the Linux buffer cache.  Small 
writes to many files will likely suffer in the same way that rados bench 
does though.



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Mark Nelson
Performance Engineer
Inktank
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another performance-related thread

2012-07-31 Thread Josh Durgin

On 07/31/2012 08:03 AM, Andrey Korolyov wrote:

Hi,

I`ve finally managed to run rbd-related test on relatively powerful
machines and what I have got:

1) Reads on almost fair balanced cluster(eight nodes) did very well,
utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata
disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear
and sequential reads, which is close to overall disk throughput)
2) Writes get much worse, both on rados bench and on fio test when I
ran fio simularly on 120 vms - at it best, overall performance is
about 400Mbyte/s, using rados bench -t 12 on three host nodes


How are your osd journals configured? What's your ceph.conf for the
osds?


fio config:

rw=(randread|randwrite|seqread|seqwrite)
size=256m
direct=1
directory=/test
numjobs=1
iodepth=12
group_reporting
name=random-ead-direct
bs=1M
loops=12

for 120 vm set, Mbyte/s
linear reads:
MEAN: 14156
STDEV: 612.596
random reads:
MEAN: 14128
STDEV: 911.789
linear writes:
MEAN: 2956
STDEV: 283.165
random writes:
MEAN: 2986
STDEV: 361.311

each node holds 15 vms and for 64M rbd cache all possible three states
- wb, wt and no-cache has almost same numbers at the tests. I wonder
if it possible to raise write/read ratio somehow. Seems that osd
underutilize itself, e.g. I am not able to get single-threaded rbd
write to get above 35Mb/s. Adding second osd on same disk only raising
iowait time, but not benchmark results.


Are these write tests using direct I/O? That will bypass the cache for
writes, which would explain the similar numbers with different cache
modes.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another performance-related thread

2012-07-31 Thread Andrey Korolyov
On 07/31/2012 07:17 PM, Mark Nelson wrote:
 Hi Andrey!

 On 07/31/2012 10:03 AM, Andrey Korolyov wrote:
 Hi,

 I`ve finally managed to run rbd-related test on relatively powerful
 machines and what I have got:

 1) Reads on almost fair balanced cluster(eight nodes) did very well,
 utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata
 disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear
 and sequential reads, which is close to overall disk throughput)

 Does your 2108 have the RAID or JBOD firmware?  I'm guessing the RAID
 firmware given that you are able to change the caching behavior?  How
 do you have the arrays setup for the OSDs?

Exactly, I am able to change cache behavior on-the-fly using 'famous'
megacli binary. Each node contains three disks, each of them configured
as raid0 single-disk - two 7200 server sata and intel 313 for journal.
On satas I am using xfs with default mount options and on ssd I`ve put
ext4 with disabled journal and of course with discard/noatime. This 2108
comes with SuperMicro firmware 2.120.243-1482 - guessing it is RAID
variant and I didn`t tried to reflash it yet. For tests, I have forced
write-through cache on - this should be very good at small writes
aggregation. Before using such config, I have configured two disks to
RAID0 and get slightly worse results on write bench. Thanks for
suggesting to try JBOD firmware, I`ll do tests using it this week and
post results.
 2) Writes get much worse, both on rados bench and on fio test when I
 ran fio simularly on 120 vms - at it best, overall performance is
 about 400Mbyte/s, using rados bench -t 12 on three host nodes

 fio config:

 rw=(randread|randwrite|seqread|seqwrite)
 size=256m
 direct=1
 directory=/test
 numjobs=1
 iodepth=12
 group_reporting
 name=random-ead-direct
 bs=1M
 loops=12

 for 120 vm set, Mbyte/s
 linear reads:
 MEAN: 14156
 STDEV: 612.596
 random reads:
 MEAN: 14128
 STDEV: 911.789
 linear writes:
 MEAN: 2956
 STDEV: 283.165
 random writes:
 MEAN: 2986
 STDEV: 361.311

 each node holds 15 vms and for 64M rbd cache all possible three states
 - wb, wt and no-cache has almost same numbers at the tests. I wonder
 if it possible to raise write/read ratio somehow. Seems that osd
 underutilize itself, e.g. I am not able to get single-threaded rbd
 write to get above 35Mb/s. Adding second osd on same disk only raising
 iowait time, but not benchmark results.

 I've seen high IO wait times (especially with small writes) via rados
 bench as well.  It's something we are actively investigating.  Part of
 the issue with rados bench is that every single request is getting
 written to a seperate file, so especially at small IO sizes there is a
 lot of underlying filesystem metadata traffic.  For us, this is
 happening on 9260 controllers with RAID firmware.  I think we may see
 some improvement by switching to 2X08 cards with the JBOD (ie IT)
 firmware, but we haven't confirmed it yet.

For 24 HT cores I have seen 2 percent iowait at most(at writes), so
almost surely there is no IO bottleneck at all(except breaking the rule
'one osd per physical disk', when iowait raising up to 50 percent on
entire system). Rados bench is not an universal measurement tool,
thought - using VM` IO requests instead of manipulating rados objects
will lead to almost fair result, by my opinion.


 We actually just purchased a variety of alternative RAID and SAS
 controllers to test with to see how universal this problem is.
 Theoretically RBD shouldn't suffer from this as badly as small writes
 to the same file should get buffered.  The same is true for CephFS
 when doing buffered IO to a single file due to the Linux buffer
 cache.  Small writes to many files will likely suffer in the same way
 that rados bench does though.

 -- 
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About teuthology

2012-07-31 Thread Mehdi Abaakouk
On Tue, Jul 31, 2012 at 09:27:54AM -0500, Mark Nelson wrote:
 On 7/31/12 8:59 AM, Mehdi Abaakouk wrote:
 Hi Mehdi,
 
 I think a number of the test related tasks should run fine without
 strictly requiring the ceph task.  You may have to change binary
 locations for things like rados, but those should be pretty minor.
 
 Best way to find out is to give it a try!

Thanks for your quick answer :)

I have already tried, but the code massively refers to files in
/tmp/cephtest/, it seems to me that changing the path of the 
binaries isn't enough, some of them are built by the ceph task.

Perhaps a quicker (a bit dirty) way is to create a new task 'cephdist',
that prepares the required files in /tmp/cephtest.
ie:
- link dist binary to /tmp/cephtest/binary/usr/local/bin/...
- link /etc/ceph/ceph.conf to /tmp/cephtest/ceph.conf
- ship cephtest tool in /tmp/cephtest (like ceph task)
- make dummy script for coverage (because distributed ceph doesn't seem
  to have ceph-coverage)

What do you think about it ?

Cheers

-- 
Mehdi Abaakouk for eNovance
mail: sil...@sileht.net
irc: sileht


signature.asc
Description: Digital signature


Re: [PATCH v3] rbd: fix the memory leak of bio_chain_clone

2012-07-31 Thread Guangliang Zhao
On Mon, Jul 30, 2012 at 02:54:44PM -0700, Yehuda Sadeh wrote:
 On Thu, Jul 26, 2012 at 11:20 PM, Guangliang Zhao gz...@suse.com wrote:
  The bio_pair alloced in bio_chain_clone would not be freed,
  this will cause a memory leak. It could be freed actually only
  after 3 times release, because the reference count of bio_pair
  is initialized to 3 when bio_split and bio_pair_release only
  drops the reference count.
 
  The function bio_pair_release must be called three times for
  releasing bio_pair, and the callback functions of bios on the
  requests will be called when the last release time in bio_pair_release,
  however, these functions will also be called in rbd_req_cb. In
  other words, they will be called twice, and it may cause serious
  consequences.
 
  This patch clones bio chian from the origin directly, doesn't use
  bio_split(without bio_pair). The new bio chain can be release
  whenever we don't need it.
 
  Signed-off-by: Guangliang Zhao gz...@suse.com
  ---
   drivers/block/rbd.c |   73 
  +-
   1 files changed, 31 insertions(+), 42 deletions(-)
 
  diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
  index 013c7a5..356657d 100644
  --- a/drivers/block/rbd.c
  +++ b/drivers/block/rbd.c
  @@ -712,51 +712,46 @@ static void zero_bio_chain(struct bio *chain, int 
  start_ofs)
  }
   }
 
  -/*
  - * bio_chain_clone - clone a chain of bios up to a certain length.
  - * might return a bio_pair that will need to be released.
  +/**
  + *  bio_chain_clone - clone a chain of bios up to a certain length.
  + *  @old: bio to clone
  + *  @offset: start point for bio clone
  + *  @len: length of bio chain
  + *  @gfp_mask: allocation priority
  + *
  + *  RETURNS:
  + *  Pointer to new bio chain on success, NULL on failure.
*/
  -static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
  -  struct bio_pair **bp,
  +static struct bio *bio_chain_clone(struct bio **old, int *offset,
 int len, gfp_t gfpmask)
   {
  struct bio *tmp, *old_chain = *old, *new_chain = NULL, *tail = NULL;
  int total = 0;
 
  -   if (*bp) {
  -   bio_pair_release(*bp);
  -   *bp = NULL;
  -   }
  -
  while (old_chain  (total  len)) {
  +   int need = len - total;
  +
  tmp = bio_kmalloc(gfpmask, old_chain-bi_max_vecs);
  if (!tmp)
  goto err_out;
 
  -   if (total + old_chain-bi_size  len) {
  -   struct bio_pair *bp;
  -
  -   /*
  -* this split can only happen with a single paged 
  bio,
  -* split_bio will BUG_ON if this is not the case
  -*/
  -   dout(bio_chain_clone split! total=%d remaining=%d
  -bi_size=%d\n,
  -(int)total, (int)len-total,
  -(int)old_chain-bi_size);
  -
  -   /* split the bio. We'll release it either in the 
  next
  -  call, or it will have to be released outside */
  -   bp = bio_split(old_chain, (len - total) / 
  SECTOR_SIZE);
  -   if (!bp)
  -   goto err_out;
  -
  -   __bio_clone(tmp, bp-bio1);
  -
  -   *next = bp-bio2;
  +   __bio_clone(tmp, old_chain);
  +   tmp-bi_sector += *offset  SECTOR_SHIFT;
  +   tmp-bi_io_vec-bv_offset += *offset  SECTOR_SHIFT;
  +   /*
  +* The bios span across multiple osd objects must be
  +* single paged, rbd_merge_bvec would guarantee it.
  +* So we needn't worry about other things.
  +*/
  +   if (tmp-bi_size - *offset  need) {
  +   tmp-bi_size = need;
  +   tmp-bi_io_vec-bv_len = need;
  +   *offset += need;
  } else {
  -   __bio_clone(tmp, old_chain);
  -   *next = old_chain-bi_next;
  +   old_chain = old_chain-bi_next;
  +   tmp-bi_size -= *offset;
  +   tmp-bi_io_vec-bv_len -= *offset;
  +   *offset = 0;
  }
 
 There's still some inherent issue here, which is it assumes
 tmp-bi_io_vec points to the only iovec for this bio. I don't think
 that is necessarily true, there may be multiple iovecs, 

Yes, the bios on the requests may have one or more pages, but the ones span 
across
multiple osds *must* be single page bios because of rbd_merge_bvec.

With rbd_merge_bvec, the new bvec will not permitted to merge, if it make the 
bio cross 
the osd boundary, except the 

Re: cannot startup one of the osd

2012-07-31 Thread Samuel Just
This crash happens on each startup?
-Sam

On Tue, Jul 31, 2012 at 2:32 AM,  eric_yh_c...@wiwynn.com wrote:
 Hi, all:

 My Environment:  two servers, and 12 hard-disk on each server.
  Version: Ceph 0.48, Kernel: 3.2.0-27

 We create a ceph cluster with 24 osd, 3 monitors
 Osd.0 ~ osd.11 is on server1
 Osd.12 ~ osd.23 is on server2
 Mon.0 is on server1
 Mon.1 is on server2
 Mon.2 is on server3 which has no osd

 root@ubuntu:~$ ceph -s
health HEALTH_WARN 227 pgs degraded; 93 pgs down; 93 pgs peering; 85 pgs 
 recovering; 82 pgs stuck inactive; 255 pgs stuck unclean; recovery 
 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%); 1/24 in osds are 
 down
monmap e1: 3 mons at 
 {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0},
  election epoch 564, quorum 0,1,2 006,008,009
osdmap e1911: 24 osds: 23 up, 24 in
 pgmap v292031: 4608 pgs: 4251 active+clean, 85 
 active+recovering+degraded, 37 active+remapped, 58 down+peering, 142 
 active+degraded, 35 down+replay+peering; 257 GB data, 948 GB used, 19370 GB / 
 21390 GB avail; 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%)
mdsmap e1: 0/0/1 up

 I find one of the osd cannot startup anymore. Before that, I am testing HA of 
 Ceph cluster.

 Step1:  shutdown server1, wait 5 min
 Step2:  bootup server1, wait 5 min until ceph enter health status
 Step3:  shutdown server2, wait 5 min
 Step4:  bootup server2, wait 5 min until ceph enter health status
 Repeat Step1~ Step4 several times, then I met this problem.


 Log of ceph-osd.22.log
 2012-07-31 17:18:15.120678 7f9375300780  0 filestore(/srv/disk10/data) mount 
 found snaps 
 2012-07-31 17:18:15.122081 7f9375300780  0 filestore(/srv/disk10/data) mount: 
 enabling WRITEAHEAD journal mode: btrfs not detected
 2012-07-31 17:18:15.128544 7f9375300780  1 journal _open /srv/disk10/journal 
 fd 23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0
 2012-07-31 17:18:15.257302 7f9375300780  1 journal _open /srv/disk10/journal 
 fd 23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0
 2012-07-31 17:18:15.273163 7f9375300780  1 journal close /srv/disk10/journal
 2012-07-31 17:18:15.274395 7f9375300780 -1 filestore(/srv/disk10/data) 
 limited size xattrs -- filestore_xattr_use_omap enabled
 2012-07-31 17:18:15.275169 7f9375300780  0 filestore(/srv/disk10/data) mount 
 FIEMAP ioctl is supported and appears to work
 2012-07-31 17:18:15.275180 7f9375300780  0 filestore(/srv/disk10/data) mount 
 FIEMAP ioctl is disabled via 'filestore fiemap' config option
 2012-07-31 17:18:15.275312 7f9375300780  0 filestore(/srv/disk10/data) mount 
 did NOT detect btrfs
 2012-07-31 17:18:15.276060 7f9375300780  0 filestore(/srv/disk10/data) mount 
 syncfs(2) syscall fully supported (by glib and kernel)
 2012-07-31 17:18:15.276154 7f9375300780  0 filestore(/srv/disk10/data) mount 
 found snaps 
 2012-07-31 17:18:15.277031 7f9375300780  0 filestore(/srv/disk10/data) mount: 
 enabling WRITEAHEAD journal mode: btrfs not detected
 2012-07-31 17:18:15.280906 7f9375300780  1 journal _open /srv/disk10/journal 
 fd 32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0
 2012-07-31 17:18:15.307761 7f9375300780  1 journal _open /srv/disk10/journal 
 fd 32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0
 2012-07-31 17:18:19.466921 7f9360a97700  0 -- 192.168.200.82:6830/18744  
 192.168.200.83:0/3485583732 pipe(0x45bd000 sd=34 pgs=0 cs=0 l=0).accept peer 
 addr is really 192.168.200.83:0/3485583732 (socket is 192.168.200.83:45653/0)
 2012-07-31 17:18:19.671681 7f9363a9d700 -1 os/DBObjectMap.cc: In function 
 'virtual bool DBObjectMap::DBObjectMapIteratorImpl::valid()' thread 
 7f9363a9d700 time 2012-07-31 17:18:19.670082
 os/DBObjectMap.cc: 396: FAILED assert(!valid || cur_iter-valid())

 ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
  1: /usr/bin/ceph-osd() [0x6a3123]
  2: (ReplicatedPG::send_push(int, ObjectRecoveryInfo, ObjectRecoveryProgress, 
 ObjectRecoveryProgress*)+0x684) [0x53f314]
  3: (ReplicatedPG::push_start(ReplicatedPG::ObjectContext*, hobject_t const, 
 int, eversion_t, interval_setunsigned long, std::maphobject_t, 
 interval_setunsigned long, std::lesshobject_t, 
 std::allocatorstd::pairhobject_t const, interval_setunsigned long   
 )+0x333) [0x54c873]
  4: (ReplicatedPG::push_to_replica(ReplicatedPG::ObjectContext*, hobject_t 
 const, int)+0x343) [0x54cdc3]
  5: (ReplicatedPG::recover_object_replicas(hobject_t const, 
 eversion_t)+0x35f) [0x5527bf]
  6: (ReplicatedPG::wait_for_degraded_object(hobject_t const, 
 std::tr1::shared_ptrOpRequest)+0x17b) [0x55406b]
  7: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x9de) [0x56305e]
  8: (PG::do_request(std::tr1::shared_ptrOpRequest)+0x199) [0x5fda89]
  9: (OSD::dequeue_op(PG*)+0x238) [0x5bf668]
  10: (ThreadPool::worker()+0x605) [0x796d55]
  11: (ThreadPool::WorkThread::entry()+0xd) [0x5d5d0d]
  12: (()+0x7e9a) [0x7f9374794e9a]
  

Re: About teuthology

2012-07-31 Thread Tommi Virtanen
On Tue, Jul 31, 2012 at 6:59 AM, Mehdi Abaakouk sil...@sileht.net wrote:
 Hi,

 I have taken a look into teuthology, the automation of all this tests
 are good, but are they any way to run it into a already installed
 ceph clusters ?

 Thanks in advance.

Many of the actual tests being run are already independent
functionality or stress tests or benchmarks; for example, ffsb will
run against any filesystem.

The things that are specifically written for teuthology are currently
quite tied in details.

There is a longer-term plan to rework teuthology into using
package-based installation, and at that time I hope we will be able to
modularize the tests out of the teuthology core, and to make them
easier to run from just the command line.

This work depends on a bunch of internal changes to our testing lab
infrastructure -- package-based testing is not feasible until we have
lab machine reinstallation 100% automated, and currently it still
tends to need too much manual care.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: High-availability testing of ceph

2012-07-31 Thread Tommi Virtanen
On Tue, Jul 31, 2012 at 12:31 AM,  eric_yh_c...@wiwynn.com wrote:
 If the performance of rbd device is n MB/s under replica=2,
 then that means the total io throughputs on hard disk is over 3 * n MB/s.
 Because I think the total number of copies is 3 in original.

 So, it seems not correct now, the total number of copies is only 2.
 The total io through puts on disk should be 2 * n MB/s. Right?

Yes, each replica needs to independently write the data to disk. On
top of that, there are journal writes, and filesystems have overhead
too. If you create a 1 GB object in a pool replicated 3 times, you
should expect about 3*1 GB writes in total to your osd data disks, and
at least 3*1 GB writes in total to your osd journal disks.

In normal use, you have many servers, and use CRUSH rules to ensure
the different replicas are not stored on the same server.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXTERNAL] Re: avoiding false detection of down OSDs

2012-07-31 Thread Gregory Farnum
On Tue, Jul 31, 2012 at 8:07 AM, Jim Schutt jasc...@sandia.gov wrote:
 On 07/30/2012 06:24 PM, Gregory Farnum wrote:
 Hmm. The concern is that if an OSD is stuck on disk swapping then it's
 going to be just as stuck for the monitors as the OSDs — they're all
 using the same network in the basic case, etc. We want to be able to
 make that guess before the OSD is able to answer such questions.
 But I'll think on if we could try something else similar.


 OK - thanks.

 Also, FWIW I've been running my Ceph servers with no swap,
 and I've recently doubled the size of my storage cluster.
 Is it possible to have map processing do a little memory
 accounting and log it, or to provide some way to learn
 that map processing is chewing up significant amounts of
 memory?  Or maybe there's already a way to learn this that
 I need to learn about?  I sometimes run into something that
 shares some characteristics with what you describe, but is
 primarily triggered by high client write load.  I'd like
 to be able to confirm or deny it's the same basic issue
 you've described.

I think that we've done all our diagnosis using profiling tools, but
there's now a map cache and it probably wouldn't be too difficult to
have it dump data via perfcounters if you poked around...anything like
this exist yet, Sage?
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Ceph changes for 3.6

2012-07-31 Thread Sage Weil
Hi Linus,

Please pull the following Ceph changes for 3.6 from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There are several trivial conflicts to resolve; sorry!  Stephen is 
carrying fixes for them in linux-next as well.

Lots of stuff this time around:

 * lots of cleanup and refactoring in the libceph messenger code, and many 
   hard to hit races and bugs closed as a result.
 * lots of cleanup and refactoring in the rbd code from Alex Elder, mostly in 
   preparation for the layering functionality that will be coming in 3.7.
 * some misc rbd cleanups from Josh Durgin that are finally going upstream
 * support for CRUSH tunables (used by newer clusters to improve the data
   placement)
 * some cleanup in our use of d_parent that Al brought up a while back
 * a random collection of fixes across the tree

There is another patch coming that fixes up our -atomic_open() behavior, 
but I'm going to hammer on it a bit more before sending it.

Thanks!
sage


Alan Cox (1):
  ceph: fix potential double free

Alex Elder (76):
  libceph: eliminate connection state DEAD
  libceph: kill bad_proto ceph connection op
  libceph: rename socket callbacks
  libceph: rename kvec_reset and kvec_add functions
  libceph: embed ceph messenger structure in ceph_client
  libceph: start separating connection flags from state
  libceph: start tracking connection socket state
  libceph: provide osd number when creating osd
  libceph: set CLOSED state bit in con_init
  libceph: osd_client: don't drop reply reference too early
  libceph: embed ceph connection structure in mon_client
  libceph: init monitor connection when opening
  libceph: fully initialize connection in con_init()
  libceph: tweak ceph_alloc_msg()
  libceph: have messages point to their connection
  libceph: have messages take a connection reference
  libceph: make ceph_con_revoke() a msg operation
  libceph: make ceph_con_revoke_message() a msg op
  libceph: encapsulate out message data setup
  libceph: encapsulate advancing msg page
  libceph: don't mark footer complete before it is
  libceph: move init_bio_*() functions up
  libceph: move init of bio_iter
  libceph: don't use bio_iter as a flag
  libceph: SOCK_CLOSED is a flag, not a state
  libceph: don't change socket state on sock event
  libceph: just set SOCK_CLOSED when state changes
  libceph: don't touch con state in con_close_socket()
  libceph: clear CONNECTING in ceph_con_close()
  libceph: clear NEGOTIATING when done
  libceph: define and use an explicit CONNECTED state
  libceph: separate banner and connect writes
  libceph: distinguish two phases of connect sequence
  libceph: small changes to messenger.c
  libceph: add some fine ASCII art
  libceph: drop declaration of ceph_con_get()
  libceph: fix off-by-one bug in ceph_encode_filepath()
  rbd: drop a useless local variable
  libceph: define ceph_extract_encoded_string()
  rbd: define dup_token()
  rbd: rename rbd_dev-block_name
  rbd: create pool_id device attribute
  rbd: dynamically allocate pool name
  rbd: dynamically allocate object prefix
  rbd: dynamically allocate image header name
  rbd: dynamically allocate image name
  rbd: dynamically allocate snapshot name
  rbd: use rbd_dev consistently
  rbd: rename some fields in struct rbd_dev
  rbd: more symbol renames
  rbd: option symbol renames
  rbd: kill num_reply parameters
  rbd: don't use snapc-seq that way
  rbd: preserve snapc-seq in rbd_header_set_snap()
  rbd: set snapc-seq only when refreshing header
  rbd: kill rbd_image_header-snap_seq
  rbd: drop extra header_rwsem init
  rbd: simplify __rbd_remove_all_snaps()
  rbd: clean up a few dout() calls
  ceph: define snap counts as u32 everywhere
  rbd: encapsulate header validity test
  rbd: rename rbd_device-id
  rbd: snapc is unused in rbd_req_sync_read()
  rbd: drop rbd_header_from_disk() gfp_flags parameter
  rbd: drop rbd_dev parameter in snap functions
  rbd: drop object_name from rbd_req_sync_watch()
  rbd: drop object_name from rbd_req_sync_notify()
  rbd: drop object_name from rbd_req_sync_notify_ack()
  rbd: drop object_name from rbd_req_sync_unwatch()
  rbd: have __rbd_add_snap_dev() return a pointer
  rbd: make rbd_create_rw_ops() return a pointer
  rbd: pass null version pointer in add_snap()
  rbd: always pass ops array to rbd_req_sync_op()
  rbd: fixes in rbd_header_from_disk()
  rbd: return obj version in __rbd_refresh_header()
  rbd: create rbd_refresh_helper()

Dan Carpenter (2):
  rbd: endian bug in rbd_req_cb()
  libceph: fix NULL dereference in reset_connection()

Guanjun He (1):
  libceph: prevent the 

Re: How to integrate ceph with opendedup.

2012-07-31 Thread Tommi Virtanen
On Tue, Jul 31, 2012 at 2:18 AM, ramu ramu.freesyst...@gmail.com wrote:
 I want to integrate ceph with opendedup(sdfs) using java-rados.
 Please help me to integration of ceph with opendedup.

It sounds like you could use radosgw and just use S3ChunkStore.

If you really want to implement your own ChunkStore straight on top of
RADOS, well, it sounds like FileBasedChunkStore should be an easy
model; copy-paste it, and replace all file operations with RADOS
object reads/writes/etc.


BTW it sounds like SDFS is not really a distributed file system, or at
least the architecture slides don't point at anything about multiple
mounting the same metadata store. It sounds like they made all
operations for one file system go through the same metadata server.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXTERNAL] Re: avoiding false detection of down OSDs

2012-07-31 Thread Sage Weil
On Tue, 31 Jul 2012, Gregory Farnum wrote:
 On Tue, Jul 31, 2012 at 8:07 AM, Jim Schutt jasc...@sandia.gov wrote:
  On 07/30/2012 06:24 PM, Gregory Farnum wrote:
  Hmm. The concern is that if an OSD is stuck on disk swapping then it's
  going to be just as stuck for the monitors as the OSDs ? they're all
  using the same network in the basic case, etc. We want to be able to
  make that guess before the OSD is able to answer such questions.
  But I'll think on if we could try something else similar.
 
 
  OK - thanks.
 
  Also, FWIW I've been running my Ceph servers with no swap,
  and I've recently doubled the size of my storage cluster.
  Is it possible to have map processing do a little memory
  accounting and log it, or to provide some way to learn
  that map processing is chewing up significant amounts of
  memory?  Or maybe there's already a way to learn this that
  I need to learn about?  I sometimes run into something that
  shares some characteristics with what you describe, but is
  primarily triggered by high client write load.  I'd like
  to be able to confirm or deny it's the same basic issue
  you've described.
 
 I think that we've done all our diagnosis using profiling tools, but
 there's now a map cache and it probably wouldn't be too difficult to
 have it dump data via perfcounters if you poked around...anything like
 this exist yet, Sage?

Much of the bad behavior was triggered by #2860, fixes for which just went 
into the stable and master branches yesterday.  It's difficult to fully 
observe the bad behavior, though (lots of time spend in 
generate_past_intervals, reading old maps off disk).  With the fix, we 
pretty much only process maps during handle_osd_map.

Adding perfcounters in the methods that grab a map out of the cache or 
(more importantly) read it off disk will give you better visibility into 
that.  It should be pretty easy to instrument that (and I'll gladly 
take patches that implement that... :).  Without knowing more about what 
you're seeing, it's hard to say if its related, though.  This was 
triggered by long periods of unclean pgs and lots of data migration, not 
high load.

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Puppet modules for Ceph

2012-07-31 Thread Tommi Virtanen
On Tue, Jul 24, 2012 at 6:15 AM,  loic.dach...@enovance.com wrote:
 Note that if puppet client was run on nodeB before it was run on nodeA, all
 three steps would have been run in sequence instead of being spread over two
 puppet client invocations.

Unfortunately, even that is not enough.

The relevant keys (client.admin, client.bootstrap-osd, later
bootstrap-mds radosgw etc also) can only be created once the mons have
reached quorum. This is some time after they have started, even in the
best case. Making the puppet/chef run wait for that sounds like a bad
idea; especially since I use further chef-client runs to feed ceph-mon
information about its peers, which may be necessary for it to ever
reach quorum.

While I can find ways of making the key generation happen as soon as
quorum is reached, communicating the keys to other nodes only happens
at the mercy of the configuration management system; both puppet and
chef seem to be in the mindset of run every N minutes option. So
even if we generate the keys best case 2 seconds after ceph-mon
startup, it needs a full configuration manager run on the source node,
and then a run on the destination node, before OSD bring-up etc can
succeed.

I have found no satisfying solution to this, so far.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Puppet modules for Ceph

2012-07-31 Thread Sage Weil
On Tue, 31 Jul 2012, Tommi Virtanen wrote:
 On Tue, Jul 24, 2012 at 6:15 AM,  loic.dach...@enovance.com wrote:
  Note that if puppet client was run on nodeB before it was run on nodeA, all
  three steps would have been run in sequence instead of being spread over two
  puppet client invocations.
 
 Unfortunately, even that is not enough.
 
 The relevant keys (client.admin, client.bootstrap-osd, later
 bootstrap-mds radosgw etc also) can only be created once the mons have
 reached quorum. This is some time after they have started, even in the
 best case. Making the puppet/chef run wait for that sounds like a bad
 idea; especially since I use further chef-client runs to feed ceph-mon
 information about its peers, which may be necessary for it to ever
 reach quorum.
 
 While I can find ways of making the key generation happen as soon as
 quorum is reached, communicating the keys to other nodes only happens
 at the mercy of the configuration management system; both puppet and
 chef seem to be in the mindset of run every N minutes option. So
 even if we generate the keys best case 2 seconds after ceph-mon
 startup, it needs a full configuration manager run on the source node,
 and then a run on the destination node, before OSD bring-up etc can
 succeed.
 
 I have found no satisfying solution to this, so far.

It is also possible to feed initial keys to the monitors during the 'mkfs' 
stage.  If the keys can be agreed on somehow beforehand, then they will 
already be in place when the initial quorum is reached.  Not sure if that 
helps in this situation or not...

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Puppet modules for Ceph

2012-07-31 Thread Tommi Virtanen
On Tue, Jul 31, 2012 at 11:51 AM, Sage Weil s...@inktank.com wrote:
 It is also possible to feed initial keys to the monitors during the 'mkfs'
 stage.  If the keys can be agreed on somehow beforehand, then they will
 already be in place when the initial quorum is reached.  Not sure if that
 helps in this situation or not...

Yeah, we're going that way for the mon. key in the chef cookbooks
(to get the mons talking to each other at all, that *has* to be done
that way), but putting more and more stuff there is not very nice.

Your typical CM framework does not let the recipe run arbitrary code
at that sort of an instantiation time, and pushing this work on the
admin makes it laborous and brittle; what happens when we need a new
type of a bootstrap-foo key? Get all admins to cram an extra entry
into their environment json?
http://ceph.com/docs/master/config-cluster/chef/#configure-your-ceph-environment

That just does not seem like a good way.

Juju seems to provide a real-time notification mechanism between
peers, using it's name-relation-changed hook. Other CM frameworks
may need to step up their game, or be subject to the keep re-running
chef-client until it works limitation.

If the CM makes it safe to trigger a run manually (e.g. sudo
chef-client whenever you feel like it), we can trigger that locally
when we finally create the keys. This still doesn't help the receiving
side to notice any faster.

If the CM makes it safe for us to change node attributes outside of
the full CM run, we can do trigger that when we finally create the
keys. Chef seems to have a full overwrites only semantic, so this is
probably not safe with it. And as above, this does not help the
receiving side to notice that it has information to fetch.

What I want to do longer term, is make the Chef cookbook for Ceph very
thin, push everything except the cross-node communication into Ceph
proper, and then write a mkcephfs v2.0 that uses SSH connections as
appropriate, from a central workstation host that can SSH anywhere,
to trigger these actions ASAP. Then that becomes the goal for CM
frameworks: provide me a communication mechanism between these nodes
that can do *this*.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another performance-related thread

2012-07-31 Thread Andrey Korolyov
On 07/31/2012 07:53 PM, Josh Durgin wrote:
 On 07/31/2012 08:03 AM, Andrey Korolyov wrote:
 Hi,

 I`ve finally managed to run rbd-related test on relatively powerful
 machines and what I have got:

 1) Reads on almost fair balanced cluster(eight nodes) did very well,
 utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata
 disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear
 and sequential reads, which is close to overall disk throughput)
 2) Writes get much worse, both on rados bench and on fio test when I
 ran fio simularly on 120 vms - at it best, overall performance is
 about 400Mbyte/s, using rados bench -t 12 on three host nodes

 How are your osd journals configured? What's your ceph.conf for the
 osds?

 fio config:

 rw=(randread|randwrite|seqread|seqwrite)
 size=256m
 direct=1
 directory=/test
 numjobs=1
 iodepth=12
 group_reporting
 name=random-ead-direct
 bs=1M
 loops=12

 for 120 vm set, Mbyte/s
 linear reads:
 MEAN: 14156
 STDEV: 612.596
 random reads:
 MEAN: 14128
 STDEV: 911.789
 linear writes:
 MEAN: 2956
 STDEV: 283.165
 random writes:
 MEAN: 2986
 STDEV: 361.311

 each node holds 15 vms and for 64M rbd cache all possible three states
 - wb, wt and no-cache has almost same numbers at the tests. I wonder
 if it possible to raise write/read ratio somehow. Seems that osd
 underutilize itself, e.g. I am not able to get single-threaded rbd
 write to get above 35Mb/s. Adding second osd on same disk only raising
 iowait time, but not benchmark results.

 Are these write tests using direct I/O? That will bypass the cache for
 writes, which would explain the similar numbers with different cache
 modes.

I have previously forgot that direct flag may affect rbd cache behaviout.

Without it on wb cache, read rate remained same and writes increased by
~ 0.15:
random writes:
MEAN: 3370
STDEV: 939.99

linear writes:
MEAN: 3561
STDEV: 824.954

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Quick CentOS/RHEL question ...

2012-07-31 Thread 袁冬
Ceph can work well on CentOS6.2 including File Access and RBD while
radowsgw is not still under our testing.

To install ceph on CentOS6, the main problem is the difference of the
packages' names between CentOS and Ubuntu, 'yum search ' may help. and
some times, 'ldconfig' is needed after the 'make install'

On 1 August 2012 06:17, Joe Landman land...@scalableinformatics.com wrote:
 Hi folks

   I was struggling and failing to get Ceph properly built/installed for
 CentOS 6 (and 5) last week.  Is this simply not a recommended platform?
 Please advise.  Thanks!



 --
 Joseph Landman, Ph.D
 Founder and CEO
 Scalable Informatics Inc.
 email: land...@scalableinformatics.com
 web  : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
 phone: +1 734 786 8423 x121
 fax  : +1 866 888 3112
 cell : +1 734 612 4615
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
袁冬
Tel:13573888215
Email:yuandong1...@gmail.com
QQ:10200230
MSN:yuandong1...@hotmail.com


-- 
袁冬
Tel:13573888215
Email:yuandong1...@gmail.com
QQ:10200230
MSN:yuandong1...@hotmail.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: cannot startup one of the osd

2012-07-31 Thread Eric_YH_Chen
Hi, Samuel:

It happens every startup, I cannot fix it now.

-Original Message-
From: Samuel Just [mailto:sam.j...@inktank.com] 
Sent: Wednesday, August 01, 2012 1:36 AM
To: Eric YH Chen/WYHQ/Wiwynn
Cc: ceph-devel@vger.kernel.org; Chris YT Huang/WYHQ/Wiwynn; Victor CY 
Chang/WYHQ/Wiwynn
Subject: Re: cannot startup one of the osd

This crash happens on each startup?
-Sam

On Tue, Jul 31, 2012 at 2:32 AM,  eric_yh_c...@wiwynn.com wrote:
 Hi, all:

 My Environment:  two servers, and 12 hard-disk on each server.
  Version: Ceph 0.48, Kernel: 3.2.0-27

 We create a ceph cluster with 24 osd, 3 monitors
 Osd.0 ~ osd.11 is on server1
 Osd.12 ~ osd.23 is on server2
 Mon.0 is on server1
 Mon.1 is on server2
 Mon.2 is on server3 which has no osd

 root@ubuntu:~$ ceph -s
health HEALTH_WARN 227 pgs degraded; 93 pgs down; 93 pgs peering; 85 pgs 
 recovering; 82 pgs stuck inactive; 255 pgs stuck unclean; recovery 
 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%); 1/24 in osds are 
 down
monmap e1: 3 mons at 
 {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0},
  election epoch 564, quorum 0,1,2 006,008,009
osdmap e1911: 24 osds: 23 up, 24 in
 pgmap v292031: 4608 pgs: 4251 active+clean, 85 
 active+recovering+degraded, 37 active+remapped, 58 down+peering, 142 
 active+degraded, 35 down+replay+peering; 257 GB data, 948 GB used, 19370 GB / 
 21390 GB avail; 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%)
mdsmap e1: 0/0/1 up

 I find one of the osd cannot startup anymore. Before that, I am testing HA of 
 Ceph cluster.

 Step1:  shutdown server1, wait 5 min
 Step2:  bootup server1, wait 5 min until ceph enter health status
 Step3:  shutdown server2, wait 5 min
 Step4:  bootup server2, wait 5 min until ceph enter health status 
 Repeat Step1~ Step4 several times, then I met this problem.


 Log of ceph-osd.22.log
 2012-07-31 17:18:15.120678 7f9375300780  0 filestore(/srv/disk10/data) 
 mount found snaps 
 2012-07-31 17:18:15.122081 7f9375300780  0 filestore(/srv/disk10/data) 
 mount: enabling WRITEAHEAD journal mode: btrfs not detected
 2012-07-31 17:18:15.128544 7f9375300780  1 journal _open 
 /srv/disk10/journal fd 23: 6442450944 bytes, block size 4096 bytes, 
 directio = 1, aio = 0
 2012-07-31 17:18:15.257302 7f9375300780  1 journal _open 
 /srv/disk10/journal fd 23: 6442450944 bytes, block size 4096 bytes, 
 directio = 1, aio = 0
 2012-07-31 17:18:15.273163 7f9375300780  1 journal close 
 /srv/disk10/journal
 2012-07-31 17:18:15.274395 7f9375300780 -1 filestore(/srv/disk10/data) 
 limited size xattrs -- filestore_xattr_use_omap enabled
 2012-07-31 17:18:15.275169 7f9375300780  0 filestore(/srv/disk10/data) 
 mount FIEMAP ioctl is supported and appears to work
 2012-07-31 17:18:15.275180 7f9375300780  0 filestore(/srv/disk10/data) 
 mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
 2012-07-31 17:18:15.275312 7f9375300780  0 filestore(/srv/disk10/data) 
 mount did NOT detect btrfs
 2012-07-31 17:18:15.276060 7f9375300780  0 filestore(/srv/disk10/data) 
 mount syncfs(2) syscall fully supported (by glib and kernel)
 2012-07-31 17:18:15.276154 7f9375300780  0 filestore(/srv/disk10/data) 
 mount found snaps 
 2012-07-31 17:18:15.277031 7f9375300780  0 filestore(/srv/disk10/data) 
 mount: enabling WRITEAHEAD journal mode: btrfs not detected
 2012-07-31 17:18:15.280906 7f9375300780  1 journal _open 
 /srv/disk10/journal fd 32: 6442450944 bytes, block size 4096 bytes, 
 directio = 1, aio = 0
 2012-07-31 17:18:15.307761 7f9375300780  1 journal _open 
 /srv/disk10/journal fd 32: 6442450944 bytes, block size 4096 bytes, 
 directio = 1, aio = 0
 2012-07-31 17:18:19.466921 7f9360a97700  0 -- 
 192.168.200.82:6830/18744  192.168.200.83:0/3485583732 
 pipe(0x45bd000 sd=34 pgs=0 cs=0 l=0).accept peer addr is really 
 192.168.200.83:0/3485583732 (socket is 192.168.200.83:45653/0)
 2012-07-31 17:18:19.671681 7f9363a9d700 -1 os/DBObjectMap.cc: In 
 function 'virtual bool DBObjectMap::DBObjectMapIteratorImpl::valid()' 
 thread 7f9363a9d700 time 2012-07-31 17:18:19.670082
 os/DBObjectMap.cc: 396: FAILED assert(!valid || cur_iter-valid())

 ceph version 0.48argonaut 
 (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
  1: /usr/bin/ceph-osd() [0x6a3123]
  2: (ReplicatedPG::send_push(int, ObjectRecoveryInfo, 
 ObjectRecoveryProgress, ObjectRecoveryProgress*)+0x684) [0x53f314]
  3: (ReplicatedPG::push_start(ReplicatedPG::ObjectContext*, hobject_t 
 const, int, eversion_t, interval_setunsigned long, 
 std::maphobject_t, interval_setunsigned long, std::lesshobject_t, 
 std::allocatorstd::pairhobject_t const, interval_setunsigned long 
   )+0x333) [0x54c873]
  4: (ReplicatedPG::push_to_replica(ReplicatedPG::ObjectContext*, 
 hobject_t const, int)+0x343) [0x54cdc3]
  5: (ReplicatedPG::recover_object_replicas(hobject_t const, 
 eversion_t)+0x35f) [0x5527bf]
  6: (ReplicatedPG::wait_for_degraded_object(hobject_t const,