Re: OSD crash on 0.48.2argonaut

2012-11-15 Thread Josh Durgin

On 11/14/2012 11:31 PM, eric_yh_c...@wiwynn.com wrote:

Dear All:

I met this issue on one of osd node. Is this a known issue? Thanks!

ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
  1: /usr/bin/ceph-osd() [0x6edaba]
  2: (()+0xfcb0) [0x7f08b112dcb0]
  3: (gsignal()+0x35) [0x7f08afd09445]
  4: (abort()+0x17b) [0x7f08afd0cbab]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f08b065769d]
  6: (()+0xb5846) [0x7f08b0655846]
  7: (()+0xb5873) [0x7f08b0655873]
  8: (()+0xb596e) [0x7f08b065596e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1de) [0x7a82fe]
  10: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x693) [0x530f83]
  11: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, 
eversion_t)+0x159) [0x531ac9]
  12: 
(ReplicatedPG::sub_op_modify_reply(std::tr1::shared_ptr)+0x15c) 
[0x53251c]
  13: (ReplicatedPG::do_sub_op_reply(std::tr1::shared_ptr)+0x81) 
[0x54d241]
  14: (PG::do_request(std::tr1::shared_ptr)+0x1e3) [0x600883]
  15: (OSD::dequeue_op(PG*)+0x238) [0x5bfaf8]
  16: (ThreadPool::worker()+0x4d5) [0x79f835]
  17: (ThreadPool::WorkThread::entry()+0xd) [0x5d87cd]
  18: (()+0x7e9a) [0x7f08b1125e9a]
  19: (clone()+0x6d) [0x7f08afdc54bd]
  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.


The log of the crashed osd should show which assert actually failed.
It could be this bug, but I can't tell without knowing which
assert was triggered:

http://tracker.newdream.net/issues/2956

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: improve speed with auth supported=none

2012-11-15 Thread Stefan Priebe - Profihost AG

Am 14.11.2012 14:24, schrieb Soporte:

El 13/11/2012 04:52 a.m., Stefan Priebe escribió:

Am 13.11.2012 08:42, schrieb Josh Durgin:

On 11/12/2012 01:57 PM, Stefan Priebe wrote:

Thanks,

this gives another burst for iops. I'm now at 23.000 iops ;-) So for
random 4k iops ceph auth and especially the logging is a lot of
overhead.


How much difference did disabling auth make vs only disabling logging?


disable debug logging: 3000 iops
disable auth logging: 2000 iops

Is anybody in the ceph team also interested in a call graph of kvm
when VM is doing random 4k write io?



Hi.

How disable debug and auth loggin in ceph.conf?


I have right now putted this stuff in all section - this is def. 
overkill but i didn't know where to put which of them:

[global]
auth cluster required = none
auth service required = none
auth supported = none
auth client required = none

debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

[mon]
debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

[osd]
debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0


Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-15 Thread Stefan Priebe - Profihost AG

Am 14.11.2012 15:59, schrieb Sage Weil:

Hi Stefan,

I would be nice to confirm that no clients are waiting on replies for
these requests; currently we suspect that the OSD request tracking is the
buggy part.  If you query the OSD admin socket you should be able to dump
requests and see the client IP, and then query the client.

Is it librbd?  In that case you likely need to change the config so that
it is listening on an admin socket ('admin socket = path').


Yes it is. So i have to specify admin socket at the KVM host? How do i 
query the admin socket for requests?


Stefan



On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:


Hello list,

i see this several times. Endless flying slow requests. And they never stop
until i restart the mentioned osd.

2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for > 31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for > 31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for > 31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed
2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for > 31792.858971 secs
2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed
2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for > 31793.859124 secs
2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed

When i now restart osd 24 they go away and everything is fine again.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-15 Thread Josh Durgin

On 11/15/2012 12:09 AM, Stefan Priebe - Profihost AG wrote:

Am 14.11.2012 15:59, schrieb Sage Weil:

Hi Stefan,

I would be nice to confirm that no clients are waiting on replies for
these requests; currently we suspect that the OSD request tracking is the
buggy part.  If you query the OSD admin socket you should be able to dump
requests and see the client IP, and then query the client.

Is it librbd?  In that case you likely need to change the config so that
it is listening on an admin socket ('admin socket = path').


Yes it is. So i have to specify admin socket at the KVM host? How do i
query the admin socket for requests?


Yes, add 'admin socket = /path/to/admin/socket' to the [client] section
in ceph.conf, and when a guest is running, show outstanding requests
with:

ceph --admin-daemon /path/to/admin/socket objecter_requests

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD network failure

2012-11-15 Thread Josh Durgin

On 11/13/2012 06:15 AM, Gandalf Corvotempesta wrote:

Hi,
what happens in case of OSD network failure? Is ceph smart enough to
isolate OSDs not synced?
Should I use LACP in ODS network or a single 10GBe per server should be ok?

LACP will need stackable switches and much more hardware investment.


OSDs send heartbeats to each other and report failure to receive
a heartbeat in a certain interval to the monitor cluster.
When the monitor cluster receives enough of these reports,
it marks the OSD 'down' in the OSD map, and after a grace period
to allow for flapping or daemon restarts, marks the osd 'out'
as well. This makes the cluster rebalance any data that was on the
failed OSD, and places no new data there.

A lot of this is configurable, but that's the basic model.

In this model, a network failure is equivalent to extreme slowness or a
crashed OSD - everything results in an updated map of the cluster
eventually, and the OSDs maintain strong consistency of the data
through the peering and recovery processes.

So basically you'd only need a single nic per storage node. Multiple
can be useful to separate frontend and backend traffic, but ceph
is designed to maintain strong consistency when failures occur.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problem with ceph and btrfs patch: set journal_info in async trans commit worker

2012-11-15 Thread Stefan Priebe - Profihost AG

Hi Miao,

Am 15.11.2012 06:18, schrieb Miao Xie:

Hi, Stefan

On wed, 14 Nov 2012 14:42:07 +0100, Stefan Priebe - Profihost AG wrote:

Hello list,

i wanted to try out ceph with latest vanilla kernel 3.7-rc5. I was seeing a 
massive performance degration. I see around 22x btrfs-endio-write processes 
every 10-20 seconds and they run a long time while consuming a massive amount 
of CPU.

So my performance of 23.000 iops drops to an up and down of 23.000 iops to 0 - 
avg is now 2500 iops instead of 23.000.

Git bisect shows me commit: e209db7ace281ca347b1ac699bf1fb222eac03fe "Btrfs: set 
journal_info in async trans commit worker" as the problematic patch.

When i revert this one everything is fine again.

Is this known?


Could you try the following patch?

http://marc.info/?l=linux-btrfs&m=135175512030453&w=2

I think the patch

   Btrfs: set journal_info in async trans commit worker

is not the real reason that caused the regression.

I guess it is caused by the bug of the reservation. When we join the
same transaction handle more than 2 times, the pointer of the reservation
in the transaction handle would be lost, and the statistical data in the
reservation would be corrupted. And then we would trigger the space flush,
which may block your tasks.


i applied your whole patchset. It looks a lot better now but avg iops is 
now 5000 iops and not 23.000 like when removing the mentioned commit 
(e209db7ace281ca347b1ac699bf1fb222eac03fe).


Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make mkcephfs and init-ceph osd filesystem handling more flexible

2012-11-15 Thread Danny Al-Gaaf
Hi Sage,

Am 15.11.2012 01:12, schrieb Sage Weil:
> Hi Danny,
> 
> Have you had a chance to work on this?  I'd like to include this 
> in bobtail.  If you don't have time we can go ahead an implement it, but 
> I'd like avoid duplicating effort.

I already work on it. Do you have a deadline for bobtail?

Danny

> Thanks!
> sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph-osd cpu usage

2012-11-15 Thread Stefan Priebe - Profihost AG

Hello list,

my main problem right now is that ceph does not scale for me (more vms 
using rbd). It does not scale as the ceph-osd is using all my CPU core 
all the time (8 cores) with just 4 SSDs. The SSDs are far away from 
being loaded.


What is the best way to find out what the ceph-osd process is doing all 
the time?


A gperf graph is attached.

Greets,
Stefan


out.pdf
Description: Adobe PDF document


Re: ceph-osd cpu usage

2012-11-15 Thread Alexandre DERUMIER
cpu usage is same for read and write  ?


- Mail original - 

De: "Stefan Priebe - Profihost AG"  
À: ceph-devel@vger.kernel.org 
Envoyé: Jeudi 15 Novembre 2012 11:56:37 
Objet: ceph-osd cpu usage 

Hello list, 

my main problem right now is that ceph does not scale for me (more vms 
using rbd). It does not scale as the ceph-osd is using all my CPU core 
all the time (8 cores) with just 4 SSDs. The SSDs are far away from 
being loaded. 

What is the best way to find out what the ceph-osd process is doing all 
the time? 

A gperf graph is attached. 

Greets, 
Stefan 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-osd cpu usage

2012-11-15 Thread Stefan Priebe - Profihost AG

Am 15.11.2012 12:18, schrieb Alexandre DERUMIER:

cpu usage is same for read and write  ?


no for read it is just around 25%. And i get "full" (limited by rbd / 
librbd) 23.000 iops per vm.




- Mail original -

De: "Stefan Priebe - Profihost AG" 
À: ceph-devel@vger.kernel.org
Envoyé: Jeudi 15 Novembre 2012 11:56:37
Objet: ceph-osd cpu usage

Hello list,

my main problem right now is that ceph does not scale for me (more vms
using rbd). It does not scale as the ceph-osd is using all my CPU core
all the time (8 cores) with just 4 SSDs. The SSDs are far away from
being loaded.

What is the best way to find out what the ceph-osd process is doing all
the time?

A gperf graph is attached.

Greets,
Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-15 Thread Sage Weil
On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
> Am 14.11.2012 15:59, schrieb Sage Weil:
> > Hi Stefan,
> > 
> > I would be nice to confirm that no clients are waiting on replies for
> > these requests; currently we suspect that the OSD request tracking is the
> > buggy part.  If you query the OSD admin socket you should be able to dump
> > requests and see the client IP, and then query the client.
> > 
> > Is it librbd?  In that case you likely need to change the config so that
> > it is listening on an admin socket ('admin socket = path').
> 
> Yes it is. So i have to specify admin socket at the KVM host?

Right.  IIRC the disk line is a ; (or \;) separated list of key/value 
pairs.

> How do i query the admin socket for requests?

ceph --admin-daemon /path/to/socket help
ceph --admin-daemon /path/to/socket objecter_dump (i think)

sage

> 
> Stefan
> 
> 
> > On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
> > 
> > > Hello list,
> > > 
> > > i see this several times. Endless flying slow requests. And they never
> > > stop
> > > until i restart the mentioned osd.
> > > 
> > > 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below;
> > > oldest blocked for > 31789.858457 secs
> > > 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds
> > > old,
> > > received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
> > > rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
> > > currently delayed
> > > 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below;
> > > oldest blocked for > 31790.858646 secs
> > > 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds
> > > old,
> > > received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
> > > rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
> > > currently delayed
> > > 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below;
> > > oldest blocked for > 31791.858827 secs
> > > 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds
> > > old,
> > > received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
> > > rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
> > > currently delayed
> > > 2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below;
> > > oldest blocked for > 31792.858971 secs
> > > 2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds
> > > old,
> > > received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
> > > rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
> > > currently delayed
> > > 2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below;
> > > oldest blocked for > 31793.859124 secs
> > > 2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds
> > > old,
> > > received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
> > > rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
> > > currently delayed
> > > 
> > > When i now restart osd 24 they go away and everything is fine again.
> > > 
> > > Stefan
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make mkcephfs and init-ceph osd filesystem handling more flexible

2012-11-15 Thread Sage Weil
On Thu, 15 Nov 2012, Danny Al-Gaaf wrote:
> Hi Sage,
> 
> Am 15.11.2012 01:12, schrieb Sage Weil:
> > Hi Danny,
> > 
> > Have you had a chance to work on this?  I'd like to include this 
> > in bobtail.  If you don't have time we can go ahead an implement it, but 
> > I'd like avoid duplicating effort.
> 
> I already work on it. Do you have a deadline for bobtail?

Release is ~3 weeks off, but it is technically frozen.  This week would be 
best so that we can make sure it is well tested.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-osd cpu usage

2012-11-15 Thread Mark Nelson
Out of curiosity, does it help much if you disable crc32c calculations? 
Use the "nocrc" option in your ceph.conf file.  I've had my eye on 
crcutil as an alternative to how we do crc32c now.


http://code.google.com/p/crcutil/

Mark

On 11/15/2012 06:19 AM, Stefan Priebe - Profihost AG wrote:

Am 15.11.2012 12:18, schrieb Alexandre DERUMIER:

cpu usage is same for read and write  ?


no for read it is just around 25%. And i get "full" (limited by rbd /
librbd) 23.000 iops per vm.



- Mail original -

De: "Stefan Priebe - Profihost AG" 
À: ceph-devel@vger.kernel.org
Envoyé: Jeudi 15 Novembre 2012 11:56:37
Objet: ceph-osd cpu usage

Hello list,

my main problem right now is that ceph does not scale for me (more vms
using rbd). It does not scale as the ceph-osd is using all my CPU core
all the time (8 cores) with just 4 SSDs. The SSDs are far away from
being loaded.

What is the best way to find out what the ceph-osd process is doing all
the time?

A gperf graph is attached.

Greets,
Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-osd cpu usage

2012-11-15 Thread Sage Weil
On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
> Hello list,
> 
> my main problem right now is that ceph does not scale for me (more vms using
> rbd). It does not scale as the ceph-osd is using all my CPU core all the time
> (8 cores) with just 4 SSDs. The SSDs are far away from being loaded.
> 
> What is the best way to find out what the ceph-osd process is doing all the
> time?
> 
> A gperf graph is attached.

Hmm, most significant time seems to be in the allocator and doing 
fsetxattr(2) (10%!).  Also some path traversal stuff.

Can you try the wip-fd-simple-cache branch, which tries to spend less time 
closing and reopening files?  I'm curious how much of a different it will 
make for you for both IOPS and CPU utilization.

It is also possible to use leveldb for most attrs.  If you set 
'filestore xattr use omap = true' it should put most attrs in leveldb.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-osd cpu usage

2012-11-15 Thread Stefan Priebe - Profihost AG

Am 15.11.2012 16:14, schrieb Sage Weil:

On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:

Hello list,

my main problem right now is that ceph does not scale for me (more vms using
rbd). It does not scale as the ceph-osd is using all my CPU core all the time
(8 cores) with just 4 SSDs. The SSDs are far away from being loaded.

What is the best way to find out what the ceph-osd process is doing all the
time?

A gperf graph is attached.


Hmm, most significant time seems to be in the allocator and doing
fsetxattr(2) (10%!).  Also some path traversal stuff.

Can you try the wip-fd-simple-cache branch, which tries to spend less time
closing and reopening files?  I'm curious how much of a different it will
make for you for both IOPS and CPU utilization.


Will try that this evening.


It is also possible to use leveldb for most attrs.  If you set
'filestore xattr use omap = true' it should put most attrs in leveldb.


Do i have to recreate the cephfs?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-osd cpu usage

2012-11-15 Thread Stefan Priebe

Am 15.11.2012 16:12, schrieb Mark Nelson:

Out of curiosity, does it help much if you disable crc32c calculations?
Use the "nocrc" option in your ceph.conf file.  I've had my eye on
crcutil as an alternative to how we do crc32c now.

http://code.google.com/p/crcutil/


Will try that how and where to i set the nocrc option?

Is it
[global]
nocrc = true

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-osd cpu usage

2012-11-15 Thread Sage Weil
On Thu, 15 Nov 2012, Stefan Priebe wrote:
> Am 15.11.2012 16:12, schrieb Mark Nelson:
> > Out of curiosity, does it help much if you disable crc32c calculations?
> > Use the "nocrc" option in your ceph.conf file.  I've had my eye on
> > crcutil as an alternative to how we do crc32c now.
> > 
> > http://code.google.com/p/crcutil/
> 
> Will try that how and where to i set the nocrc option?
> 
> Is it
> [global]
> nocrc = true

ms nocrc = true

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


new process: cherry-picking to stable releases

2012-11-15 Thread Sage Weil
If you are *ever* cherry-picking something to an older stable branch, 
please use 

 git cherry-pick -x 

That will append a '(cherry-picked from )' message to the bottom of 
the commit, allowing us to always find the original commit that we are 
duplicating.

This implies that we are always committing to the master/next branch 
*first*, and then cherry-picking to stable.  As a general rule, a fix 
should always be 'upstream' first in the new branches before it is 
backported to something older.  The exception is fix that is unique to the 
stable branch, for example because the new code has changed.

Thanks!
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new process: cherry-picking to stable releases

2012-11-15 Thread Alex Elder
On 11/15/2012 11:30 AM, Sage Weil wrote:
> If you are *ever* cherry-picking something to an older stable branch, 
> please use 
> 
>  git cherry-pick -x 
> 
> That will append a '(cherry-picked from )' message to the bottom of 
> the commit, allowing us to always find the original commit that we are 
> duplicating.
> 
> This implies that we are always committing to the master/next branch 
> *first*, and then cherry-picking to stable.  As a general rule, a fix 
> should always be 'upstream' first in the new branches before it is 
> backported to something older.  The exception is fix that is unique to the 
> stable branch, for example because the new code has changed.

It also implies--or to be useful requires--that anything committed
to master is pretty much permanent (and never rebased).  The master
branch should be considered append-only.

-Alex
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


master <=> next

2012-11-15 Thread Stefan Priebe

Hello list,

maybe i do not understand the difference between master and next but is 
it correct that the following commits are in next but NOT in master?


b40387d msg/Pipe: fix leak of Authorizer
0fb23cf Merge remote-tracking branch 'gh/wip-3477' into next
12c2b7f msg/DispatchQueue: release throttle on messages when dropping an id
5f214b2 PrioritizedQueue: allow remove_by_class to return removed items
98b93b5 librbd: use delete[] properly
4a7a81b objecter: fix leak of out_handlers
ef4e4c8 mon: calculate failed_since relative to message receive time
9267d8a rgw: update post policy parser
f6cb078 mon: set default port when binding to random local ip
dfeb8de Merge remote-tracking branch 'gh/wip-asok' into next
ce28455 rgw: relax date format check
4a34965 client: register admin socket commands without lock held
4db9442 objecter: separate locked and unlocked init/shutdown
9c31d09 mon: kick failures when we lose leadership
e43f9d7 mon: process failures when osds go down
763d348 mon: ignore failure messages if already pending a failure

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: master <=> next

2012-11-15 Thread Yehuda Sadeh
On Thu, Nov 15, 2012 at 11:17 AM, Stefan Priebe  wrote:
> Hello list,
>
> maybe i do not understand the difference between master and next but is it
> correct that the following commits are in next but NOT in master?
>
> b40387d msg/Pipe: fix leak of Authorizer
> 0fb23cf Merge remote-tracking branch 'gh/wip-3477' into next
> 12c2b7f msg/DispatchQueue: release throttle on messages when dropping an id
> 5f214b2 PrioritizedQueue: allow remove_by_class to return removed items
> 98b93b5 librbd: use delete[] properly
> 4a7a81b objecter: fix leak of out_handlers
> ef4e4c8 mon: calculate failed_since relative to message receive time
> 9267d8a rgw: update post policy parser
> f6cb078 mon: set default port when binding to random local ip
> dfeb8de Merge remote-tracking branch 'gh/wip-asok' into next
> ce28455 rgw: relax date format check
> 4a34965 client: register admin socket commands without lock held
> 4db9442 objecter: separate locked and unlocked init/shutdown
> 9c31d09 mon: kick failures when we lose leadership
> e43f9d7 mon: process failures when osds go down
> 763d348 mon: ignore failure messages if already pending a failure
>

Yes. The next branch will be merged into master later.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: master <=> next

2012-11-15 Thread Sage Weil
On Thu, 15 Nov 2012, Stefan Priebe wrote:
> Hello list,
> 
> maybe i do not understand the difference between master and next but is it
> correct that the following commits are in next but NOT in master?
> 
> b40387d msg/Pipe: fix leak of Authorizer
> 0fb23cf Merge remote-tracking branch 'gh/wip-3477' into next
> 12c2b7f msg/DispatchQueue: release throttle on messages when dropping an id
> 5f214b2 PrioritizedQueue: allow remove_by_class to return removed items
> 98b93b5 librbd: use delete[] properly
> 4a7a81b objecter: fix leak of out_handlers
> ef4e4c8 mon: calculate failed_since relative to message receive time
> 9267d8a rgw: update post policy parser
> f6cb078 mon: set default port when binding to random local ip
> dfeb8de Merge remote-tracking branch 'gh/wip-asok' into next
> ce28455 rgw: relax date format check
> 4a34965 client: register admin socket commands without lock held
> 4db9442 objecter: separate locked and unlocked init/shutdown
> 9c31d09 mon: kick failures when we lose leadership
> e43f9d7 mon: process failures when osds go down
> 763d348 mon: ignore failure messages if already pending a failure

Temporarily, yes.  Normally fixes go into next and are then merged back 
into master.  Since we're in the 'frozen but stabilizing' phase that's 
less work than cherry-picking every fix from master->next.

We then regularly merge next back into master so that master gets 
everything, but that hasn't happened in a day or so; I'll do it now.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-osd cpu usage

2012-11-15 Thread Stefan Priebe

Hi Mark,

Am 15.11.2012 16:12, schrieb Mark Nelson:

Out of curiosity, does it help much if you disable crc32c calculations?
Use the "nocrc" option in your ceph.conf file.  I've had my eye on
crcutil as an alternative to how we do crc32c now.

http://code.google.com/p/crcutil/

Mark
This changes nothing. CPU Load doesn't change on osds. This was a bit 
tricky ;-) i had to set nocrc on kvm host as well otherwise kvm wasn't 
starting due to bad crc.


Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-osd cpu usage

2012-11-15 Thread Stefan Priebe

Am 15.11.2012 16:14, schrieb Sage Weil:

On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
Hmm, most significant time seems to be in the allocator and doing
fsetxattr(2) (10%!).  Also some path traversal stuff.

Yes fsetxattr seems to be CPU hungry.


Can you try the wip-fd-simple-cache branch, which tries to spend less time
closing and reopening files?  I'm curious how much of a different it will
make for you for both IOPS and CPU utilization.

It seems to give me around 1000 iops across 3 VMs.


It is also possible to use leveldb for most attrs.  If you set
'filestore xattr use omap = true' it should put most attrs in leveldb.

Tried this but this raises CPU by 20%.

Any other ideas how to reduce ceph-osd while doing randwrite?

Randread gives me with 3 VMs: 60.000 iops
Randwrite gives me with 3 VMs: 25.000 iops

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor performance

2012-11-15 Thread Gregory Farnum
On Sun, Nov 4, 2012 at 7:13 AM, Aleksey Samarin  wrote:
> What may be possible solutions?
> Update centos to 6.3?

>From what I've heard the RHEL libc doesn't support the syncfs syscall
(even though the kernel does have it). :( So you'd need to make sure
the kernel supports it and then build a custom glibc, and then make
sure your Ceph software is built to use it.


> About issue with writes to lots of disk, i think parallel dd command
> will be good as test! :)

Yes — it really looks like maybe some of your disks are much slower
than the others. Try benchmarking each individually one-at-a-time, and
then in groups. I suspect you'll see a problem below the Ceph layers.

>
> 2012/11/4 Mark Nelson :
>> On 11/04/2012 07:18 AM, Aleksey Samarin wrote:
>>>
>>> Well, i create ceph cluster with 2 osd ( 1 osd per node),  2 mon, 2 mds.
>>> here is what I did:
>>>   ceph osd pool create bench
>>>   ceph osd tell \* bench
>>>   rados -p bench bench 30 write --no-cleanup
>>> output:
>>>
>>>   Maintaining 16 concurrent writes of 4194304 bytes for at least 30
>>> seconds.
>>>   Object prefix: benchmark_data_host01_11635
>>> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
>>> lat
>>>   0   0 0 0 0 0 -
>>> 0
>>>   1  1616 0 0 0 -
>>> 0
>>>   2  163721   41.991142  0.139005
>>> 1.08941
>>>   3  165337   49.324364  0.754114
>>> 1.09392
>>>   4  167559   58.989388  0.284647
>>> 0.914221
>>>   5  168973   58.389656  0.072228
>>> 0.881008
>>>   6  169579   52.657524   1.56959
>>> 0.961477
>>>   7  16   11195   54.276464  0.046105
>>> 1.08791
>>>   8  16   128   112   55.990668  0.035714
>>> 1.04594
>>>   9  16   150   134   59.545788  0.046298
>>> 1.04415
>>>  10  16   166   150   59.990164  0.048635
>>> 0.986384
>>>  11  16   176   160   58.172340  0.727784
>>> 0.988408
>>>  12  16   206   190   63.3231   120   0.28869
>>> 0.946624
>>>  13  16   225   209   64.297676   1.34472
>>> 0.919464
>>>  14  16   263   247   70.5605   152  0.070926
>>> 0.90046
>>>  15  16   295   279   74.3887   128  0.041517
>>> 0.830466
>>>  16  16   315   299   74.738880  0.296037
>>> 0.841527
>>>  17  16   333   317   74.577272  0.286097
>>> 0.849558
>>>  18  16   340   324   71.989128  0.295084
>>> 0.83922
>>>  19  16   343   327   68.831712   1.46948
>>> 0.845797
>>> 2012-11-04 17:14:52.090941min lat: 0.035714 max lat: 2.64841 avg lat:
>>> 0.861539
>>> sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
>>> lat
>>>  20  16   378   36272.389   140  0.566232
>>> 0.861539
>>>  21  16   400   384   73.131388  0.038835
>>> 0.857785
>>>  22  16   404   388   70.534416  0.801216
>>> 0.857002
>>>  23  16   413   397   69.032736  0.062256
>>> 0.86376
>>>  24  16   428   412   68.654360  0.042583
>>> 0.89389
>>>  25  16   450   434   69.427788  0.383877
>>> 0.905833
>>>  26  16   472   456   70.141588  0.269878
>>> 0.898023
>>>  27  16   472   456   67.5437 0 -
>>> 0.898023
>>>  28  16   512   496   70.844880  0.056798
>>> 0.891163
>>>  29  16   530   514   70.884372   1.20653
>>> 0.898112
>>>  30  16   542   526   70.121248  0.744383
>>> 0.890733
>>>   Total time run: 30.174151
>>> Total writes made:  543
>>> Write size: 4194304
>>> Bandwidth (MB/sec): 71.982
>>>
>>> Stddev Bandwidth:   38.318
>>> Max bandwidth (MB/sec): 152
>>> Min bandwidth (MB/sec): 0
>>> Average Latency:0.889026
>>> Stddev Latency: 0.677425
>>> Max latency:2.94467
>>> Min latency:0.035714
>>>
>>
>> Much better for 1 disk per node!  I suspect that lack of syncfs is hurting
>> you, or perhaps some other issue with writes to lots of disks at the same
>> time.
>>
>>
>>>
>>> 2012/11/4 Aleksey Samarin :

 Ok!
 Well, I'll take these tests and write about the results.

 btw,
 disks are the same, as some may be faster than others?

 2012/11/4 Gregory Farnum :
>
> That's only nine — where are the other three? If you have three slow
> disks that could definitely cause the troubles you're seeing.
>
> Also, what Mark said about sync versus syncfs.
>
> On Sun, Nov 4, 2012 at 1:26 PM, Aleksey Samarin 
> wrote

ceph-osd crashing (os/FileStore.cc: 4500: FAILED assert(replaying))

2012-11-15 Thread Stefan Priebe

Hello list,

actual master incl. upstream/wip-fd-simple-cache results in this crash 
when i try to start some of my osds (others work fine) today on multiple 
nodes:


-2> 2012-11-15 22:04:09.226945 7f3af1c7a780  0 osd.52 pg_epoch: 657 
pg[3.3b( v 632'823 (632'823,632'823] n=5 ec=17 les/c 18/18 656/656/17) 
[] r=0 lpr=0 pi=17-655/2 (info mismatch, log(632'823,0'0]) (log bound 
mismatch, empty) lcod 0'0 mlcod 0'0 inactive] Got exception 
'read_log_error: read_log got 0 bytes, expected 126086-0=126086' while 
reading log. Moving corrupted log file to 
'corrupt_log_2012-11-15_22:04_3.3b' for later analysis.
-1> 2012-11-15 22:04:09.233563 7f3af1c7a780  0 osd.52 pg_epoch: 657 
pg[3.557( v 632'753 (0'0,632'753] n=2 ec=17 les/c 18/18 656/656/17) [] 
r=0 lpr=0 pi=17-655/2 (info mismatch, log(0'0,0'0]) lcod 0'0 mlcod 0'0 
inactive] Got exception 'read_log_error: read_log got 0 bytes, expected 
115488-0=115488' while reading log. Moving corrupted log file to 
'corrupt_log_2012-11-15_22:04_3.557' for later analysis.
 0> 2012-11-15 22:04:09.234536 7f3ae87d0700 -1 os/FileStore.cc: In 
function 'int FileStore::_collection_add(coll_t, coll_t, const 
hobject_t&, const SequencerPosition&)' thread 7f3ae87d0700 time 
2012-11-15 22:04:09.233672

os/FileStore.cc: 4500: FAILED assert(replaying)

 ceph version 0.54-607-gf89e101 (f89e1012bafabd6875a4a1e1832d76ffdf45b039)
 1: (FileStore::_collection_add(coll_t, coll_t, hobject_t const&, 
SequencerPosition const&)+0x77d) [0x72ff0d]
 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int)+0x25fb) [0x73481b]
 3: (FileStore::do_transactions(std::liststd::allocator >&, unsigned long)+0x4c) 
[0x73952c]

 4: (FileStore::_do_op(FileStore::OpSequencer*)+0x195) [0x705c45]
 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x830f1b]
 6: (ThreadPool::WorkThread::entry()+0x10) [0x833700]
 7: (()+0x68ca) [0x7f3af16578ca]
 8: (clone()+0x6d) [0x7f3aefac6bfd]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 0 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 journal
   0/ 0 ms
   1/ 5 mon
   0/ 0 monc
   0/ 5 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new  100
  log_file /var/log/ceph/ceph-osd.52.log
--- end dump of recent events ---
2012-11-15 22:04:09.235734 7f3ae87d0700 -1 *** Caught signal (Aborted) **
 in thread 7f3ae87d0700

 ceph version 0.54-607-gf89e101 (f89e1012bafabd6875a4a1e1832d76ffdf45b039)
 1: /usr/bin/ceph-osd() [0x799769]
 2: (()+0xeff0) [0x7f3af165fff0]
 3: (gsignal()+0x35) [0x7f3aefa29215]
 4: (abort()+0x180) [0x7f3aefa2c020]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f3af02bddc5]
 6: (()+0xcb166) [0x7f3af02bc166]
 7: (()+0xcb193) [0x7f3af02bc193]
 8: (()+0xcb28e) [0x7f3af02bc28e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x7c9) [0x7fd069]
 10: (FileStore::_collection_add(coll_t, coll_t, hobject_t const&, 
SequencerPosition const&)+0x77d) [0x72ff0d]
 11: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int)+0x25fb) [0x73481b]
 12: (FileStore::do_transactions(std::liststd::allocator >&, unsigned long)+0x4c) 
[0x73952c]

 13: (FileStore::_do_op(FileStore::OpSequencer*)+0x195) [0x705c45]
 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x830f1b]
 15: (ThreadPool::WorkThread::entry()+0x10) [0x833700]
 16: (()+0x68ca) [0x7f3af16578ca]
 17: (clone()+0x6d) [0x7f3aefac6bfd]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


--- begin dump of recent events ---
 0> 2012-11-15 22:04:09.235734 7f3ae87d0700 -1 *** Caught signal 
(Aborted) **

 in thread 7f3ae87d0700

 ceph version 0.54-607-gf89e101 (f89e1012bafabd6875a4a1e1832d76ffdf45b039)
 1: /usr/bin/ceph-osd() [0x799769]
 2: (()+0xeff0) [0x7f3af165fff0]
 3: (gsignal()+0x35) [0x7f3aefa29215]
 4: (abort()+0x180) [0x7f3aefa2c020]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f3af02bddc5]
 6: (()+0xcb166) [0x7f3af02bc166]
 7: (()+0xcb193) [0x7f3af02bc193]
 8: (()+0xcb28e) [0x7f3af02bc28e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x7c9) [0x7fd069]
 10: (FileStore::_collection_add(coll_t, coll_t, hobject_t const&, 
SequencerPosition const&)+0x77d) [0x72ff0d]
 11: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int)+0x25fb) [0x73481b]
 12: (FileStore::do_transactions(std::liststd::allocator >&, unsigned

Re: mon can't start

2012-11-15 Thread Gregory Farnum
Sorry we missed this — everybody's been very busy!
If you're still having trouble, can you install the ceph debug symbol
packages and get this again? The backtrace isn't very helpful without
that, unfortunately.
-Greg

On Wed, Oct 24, 2012 at 7:21 PM, jie sun <0maid...@gmail.com> wrote:
> Hi,
>
> My ceph file system consists of 1 mon, 1 mds and 2 osds, and 1 mon
> 1mds 1 osd on the same machine(called machine a).
>
> Yesterday, I have to change the IP of mon(because the environment our
> machine room changed), I wonder if there is any way to do this without
> stop cephfs and deploy it again( the data before will disappear).
>
> My solution is add 2 new mons so that there are 3 mons, then remove 2
> mons, and only stay the mon I want to use.
> But without enough machines, I add 1 internal IP for the machine a and
> then do as 
> "http://ceph.com/docs/master/cluster-ops/add-or-rm-mons/#adding-a-monitor-manual";
> says.
> I don't understand the 6th step "ceph mon add  [:]\n";
> ", so I execute "ceph mon add mon.b 192.168.66.146:6790", after this
> my mon goes down.
> If a start mon with "/etc/ceph/init.d/ceph -a start mon.a", the log is:
> "
> terminate called after throwing an instance of 'ceph::buffer::end_of_buffer'
>   what():  buffer::end_of_buffer
> *** Caught signal (Aborted) **
>  in thread 7f14069e1760
>  ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
>  1: /usr/local/bin/ceph-mon() [0x55f7a9]
>  2: (()+0xf2d0) [0x7f14065d92d0]
>  3: (gsignal()+0x35) [0x7f1404fe3ab5]
>  4: (abort()+0x186) [0x7f1404fe4fb6]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f1405849a9d]
>  6: (()+0xbccb6) [0x7f1405847cb6]
>  7: (()+0xbcce3) [0x7f1405847ce3]
>  8: (()+0xbcdee) [0x7f1405847dee]
>  9: /usr/local/bin/ceph-mon() [0x5fa8af]
>  10: (main()+0x1f19) [0x485069]
>  11: (__libc_start_main()+0xfd) [0x7f1404fcfbfd]
>  12: /usr/local/bin/ceph-mon() [0x482f99]
> 2012-10-25 18:02:03.725089 7f14069e1760 -1 *** Caught signal (Aborted) **
>  in thread 7f14069e1760
>
>  ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
>  1: /usr/local/bin/ceph-mon() [0x55f7a9]
>  2: (()+0xf2d0) [0x7f14065d92d0]
>  3: (gsignal()+0x35) [0x7f1404fe3ab5]
>  4: (abort()+0x186) [0x7f1404fe4fb6]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f1405849a9d]
>  6: (()+0xbccb6) [0x7f1405847cb6]
>  7: (()+0xbcce3) [0x7f1405847ce3]
>  8: (()+0xbcdee) [0x7f1405847dee]
>  9: /usr/local/bin/ceph-mon() [0x5fa8af]
>  10: (main()+0x1f19) [0x485069]
>  11: (__libc_start_main()+0xfd) [0x7f1404fcfbfd]
>  12: /usr/local/bin/ceph-mon() [0x482f99]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
>  0> 2012-10-25 18:02:03.725089 7f14069e1760 -1 *** Caught signal
> (Aborted) **
>  in thread 7f14069e1760
>
>  ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
>  1: /usr/local/bin/ceph-mon() [0x55f7a9]
>  2: (()+0xf2d0) [0x7f14065d92d0]
>  3: (gsignal()+0x35) [0x7f1404fe3ab5]
>  4: (abort()+0x186) [0x7f1404fe4fb6]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f1405849a9d]
>  6: (()+0xbccb6) [0x7f1405847cb6]
>  7: (()+0xbcce3) [0x7f1405847ce3]
>  8: (()+0xbcdee) [0x7f1405847dee]
>  9: /usr/local/bin/ceph-mon() [0x5fa8af]
>  10: (main()+0x1f19) [0x485069]
>  11: (__libc_start_main()+0xfd) [0x7f1404fcfbfd]
>  12: /usr/local/bin/ceph-mon() [0x482f99]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
> bash: line 1: 31426 Aborted (core dumped)
> /usr/local/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c
> /tmp/ceph.conf.16468
> failed: 'ssh osd01  /usr/local/bin/ceph-mon -i a --pid-file
> /var/run/ceph/mon.a.pid -c /tmp/ceph.conf.16468 '
> "
>
> Now I don't know how to do with this situation.
>
> Thank you!
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: incompatible change to rbd tool behavior on copy

2012-11-15 Thread Dan Mick
A user has noticed some surprising behavior with the rbd command-line 
tool: with rbd copy, if the destination pool is not set (either with 
--dest-pool or by specifying destpool/image), then it is assumed to be 
the source pool name.


This seems to me only marginally convenient, and much more confusing; 
given how easy it is to specify the pool if it matters, and given the 
fact that the default pool is always 'rbd' for all other operations, I 
think it makes much more sense for the destination of a copy to default 
to 'rbd' unless otherwise specified.


I propose to make this change.  Does anyone think it's a bad idea?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Authorization issues in the 0.54

2012-11-15 Thread Andrey Korolyov
On Thu, Nov 15, 2012 at 5:03 PM, Andrey Korolyov  wrote:
> On Thu, Nov 15, 2012 at 5:12 AM, Yehuda Sadeh  wrote:
>> On Wed, Nov 14, 2012 at 4:20 AM, Andrey Korolyov  wrote:
>>> Hi,
>>> In the 0.54 cephx is probably broken somehow:
>>>
>>> $ ceph auth add client.qemukvm osd 'allow *' mon 'allow *' mds 'allow
>>> *' -i qemukvm.key
>>> 2012-11-14 15:51:23.153910 7ff06441f780 -1 read 65 bytes from qemukvm.key
>>> added key for client.qemukvm
>>>
>>> $ ceph auth list
>>> ...
>>> client.admin
>>> key: [xx]
>>> caps: [mds] allow *
>>
>> Note that for mds you just specify 'allow' and not 'allow *'. It
>> shouldn't affect the stuff that you're testing though.
>>
>
> Thanks for the hint!
>
>>> caps: [mon] allow *
>>> caps: [osd] allow *
>>> client.qemukvm
>>> key: [yy]
>>> caps: [mds] allow *
>>> caps: [mon] allow *
>>> caps: [osd] allow *
>>> ...
>>> $ virsh secret-set-value --secret uuid --base64 yy
>>> set username in the VM` xml...
>>> $ virsh start testvm
>>> kvm: -drive 
>>> file=rbd:rbd/vm0:id=qemukvm:key=yy:auth_supported=cephx\;none:mon_host=192.168.10.125\:6789\;192.168.10.127\:6789\;192.168.10.129\:6789,if=none,id=drive-virtio-disk0,format=raw:
>>> could not open disk image
>>> rbd:rbd/vm0:id=qemukvm:key=yy:auth_supported=cephx\;none:mon_host=192.168.10.125\:6789\;192.168.10.127\:6789\;192.168.10.129\:6789:
>>> Operation not permitted
>>> $ virsh secret-set-value --secret uuid --base64 xx
>>> set username again to admin for the VM` disk
>>> $ virsh start testvm
>>> Finally, vm started successfully.
>>>
>>> All rbd commands issued from cli works okay with the appropriate
>>> credentials, qemu binary was linked with same librbd as running one.
>>> Does anyone have a suggestion?
>>
>> There wasn't any change that I'm aware of that should make that
>> happening. Can you reproduce it with 'debug ms = 1' and 'debug auth =
>> 20'?
>>
>
> I`ll provide detailed logs some time later, after I do an upgrade of
> production rack.
>
> The situation is a quite strange - when I did upgrade from older
> version (tested for 0.51 and 0.53), auth was stopped to work exactly
> as above, and any actions with key(importing and elevating privileges
> or importing with max possible privileges) does nothing for an
> rbd-backed QEMU vm, only ``admin'' credentials able to pass
> authentication. When I finally reformatted cluster using mkcephfs for
> 0.54, authentication works with ``rwx'' rights on osd, when earlier
> ``rw'' was enough. Seems that this is some kind of bug in the monfs
> resulting to misworking authentication, also 0.53 to 0.54 was the
> first upgrade which made impossible version rollback - mons
> complaining to an empty set of some ``missing features'' on start, so
> I recreated monfs on every mon during online downgrade(I know that
> downgrade is bad by nature, but since on-disk format for osd was
> fixed, I have trying to do it).


Sorry, it was three overlapping factors - my inattention, additional
``x'' attribute in the required key capabilities and ``backup'' mon
stayed from time of upgrade - I have simply forgot to kill it and this
mon alone caused to drop authentication requests from qemu VMs somehow
in meantime allowing plain cluster operations using ``rbd'' command
and same credentials (very very strange). By the way, it seems that
monitor not included in cluster can easily flood any of existing mons
if it have same name, even it is completely outside authentication
keyring. Output from flooded mon is very close to #2645 by footprint.
I have suggestion that it`ll be reasonable to introduce temporary bans
or any type of foolproof behavior for bad authentication requests on
the monitors in future.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: changed rbd cp behavior in 0.53

2012-11-15 Thread Dan Mick
It's a bit different with rbd, as there's no "current dir", but I do 
tend to agree that "like every other place pool defaults, which means 
'rbd' literally" is more correct.  See my RFC from today:


"RFC: incompatible change to rbd tool behavior on copy"


On 11/15/2012 08:43 AM, Deb Barba wrote:

This is not common UNIX/posix behavior.

if you just give the source a file name, it should assume "." (current
directory) as it's location, not whatever path you started from.

I would expect most UNIX users would be losing a lot of files if they
try to copy from path x/y/z, and just provide a new name.  that would
indicate they wanted it stashed in ".".  Not cloned in path x/y/z  .

I am concerned this would confuse most users out in the field.

Thanks,
Deborah Barba

On Wed, Nov 14, 2012 at 10:43 PM, Andrey Korolyov mailto:and...@xdel.ru>> wrote:

On Thu, Nov 15, 2012 at 4:56 AM, Dan Mick mailto:dan.m...@inktank.com>> wrote:
 >
 >
 > On 11/12/2012 02:47 PM, Josh Durgin wrote:
 >>
 >> On 11/12/2012 08:30 AM, Andrey Korolyov wrote:
 >>>
 >>> Hi,
 >>>
 >>> For this version, rbd cp assumes that destination pool is the
same as
 >>> source, not 'rbd', if pool in the destination path is omitted.
 >>>
 >>> rbd cp install/img testimg
 >>> rbd ls install
 >>> img testimg
 >>>
 >>>
 >>> Is this change permanent?
 >>>
 >>> Thanks!
 >>
 >>
 >> This is a regression. The previous behavior will be restored for
0.54.
 >> I added http://tracker.newdream.net/issues/3478 to track it.
 >
 >
 > Actually, on detailed examination, it looks like this has been
the behavior
 > for a long time; I think the wiser course would be not to change this
 > defaulting.  One could argue the value of such defaulting, but
it's also
 > true that you can specify the source and destination pools
explicitly.
 >
 > Andrey, any strong objection to leaving this the way it is?

I`m not complaining -  this behavior seems more logical in the first
place and of course I use full path even doing something by hand.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org

More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: changed rbd cp behavior in 0.53

2012-11-15 Thread Andrey Korolyov
On Thu, Nov 15, 2012 at 8:43 PM, Deb Barba  wrote:
> This is not common UNIX/posix behavior.
>
> if you just give the source a file name, it should assume "." (current
> directory) as it's location, not whatever path you started from.
>
> I would expect most UNIX users would be losing a lot of files if they try to
> copy from path x/y/z, and just provide a new name.  that would indicate they
> wanted it stashed in ".".  Not cloned in path x/y/z  .
>
> I am concerned this would confuse most users out in the field.
>
> Thanks,
> Deborah Barba

Speaking of standards, rbd layout is more closely to /dev layout, or,
at least iSCSI targets, when not specifying full path or use some
predefined default prefix make no sense at all.

>
> On Wed, Nov 14, 2012 at 10:43 PM, Andrey Korolyov  wrote:
>>
>> On Thu, Nov 15, 2012 at 4:56 AM, Dan Mick  wrote:
>> >
>> >
>> > On 11/12/2012 02:47 PM, Josh Durgin wrote:
>> >>
>> >> On 11/12/2012 08:30 AM, Andrey Korolyov wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> For this version, rbd cp assumes that destination pool is the same as
>> >>> source, not 'rbd', if pool in the destination path is omitted.
>> >>>
>> >>> rbd cp install/img testimg
>> >>> rbd ls install
>> >>> img testimg
>> >>>
>> >>>
>> >>> Is this change permanent?
>> >>>
>> >>> Thanks!
>> >>
>> >>
>> >> This is a regression. The previous behavior will be restored for 0.54.
>> >> I added http://tracker.newdream.net/issues/3478 to track it.
>> >
>> >
>> > Actually, on detailed examination, it looks like this has been the
>> > behavior
>> > for a long time; I think the wiser course would be not to change this
>> > defaulting.  One could argue the value of such defaulting, but it's also
>> > true that you can specify the source and destination pools explicitly.
>> >
>> > Andrey, any strong objection to leaving this the way it is?
>>
>> I`m not complaining -  this behavior seems more logical in the first
>> place and of course I use full path even doing something by hand.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-15 Thread Nick Bartos
Sorry I guess this e-mail got missed.  I believe those patches came
from the ceph/linux-3.5.5-ceph branch.  I'm now using the wip-3.5
branch patches, which seem to all be fine.  We'll stick with 3.5 and
this backport for now until we can figure out what's wrong with 3.6.

I typically ignore the wip branches just due to the naming when I'm
looking for updates.  Where should I typically look for updates that
aren't in released kernels?  Also, is there anything else in the wip*
branches that you think we may find particularly useful?


On Mon, Nov 12, 2012 at 3:16 PM, Sage Weil  wrote:
> On Mon, 12 Nov 2012, Nick Bartos wrote:
>> After removing 8-libceph-protect-ceph_con_open-with-mutex.patch, it
>> seems we no longer have this hang.
>
> Hmm, that's a bit disconcerting.  Did this series come from our old 3.5
> stable series?  I recently prepared a new one that backports *all* of the
> fixes from 3.6 to 3.5 (and 3.4); see wip-3.5 in ceph-client.git.  I would
> be curious if you see problems with that.
>
> So far, with these fixes in place, we have not seen any unexplained kernel
> crashes in this code.
>
> I take it you're going back to a 3.5 kernel because you weren't able to
> get rid of the sync problem with 3.6?
>
> sage
>
>
>
>>
>> On Thu, Nov 8, 2012 at 5:43 PM, Josh Durgin  wrote:
>> > On 11/08/2012 02:10 PM, Mandell Degerness wrote:
>> >>
>> >> We are seeing a somewhat random, but frequent hang on our systems
>> >> during startup.  The hang happens at the point where an "rbd map
>> >> " command is run.
>> >>
>> >> I've attached the ceph logs from the cluster.  The map command happens
>> >> at Nov  8 18:41:09 on server 172.18.0.15.  The process which hung can
>> >> be seen in the log as 172.18.0.15:0/1143980479.
>> >>
>> >> It appears as if the TCP socket is opened to the OSD, but then times
>> >> out 15 minutes later, the process gets data when the socket is closed
>> >> on the client server and it retries.
>> >>
>> >> Please help.
>> >>
>> >> We are using ceph version 0.48.2argonaut
>> >> (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe).
>> >>
>> >> We are using a 3.5.7 kernel with the following list of patches applied:
>> >>
>> >> 1-libceph-encapsulate-out-message-data-setup.patch
>> >> 2-libceph-dont-mark-footer-complete-before-it-is.patch
>> >> 3-libceph-move-init-of-bio_iter.patch
>> >> 4-libceph-dont-use-bio_iter-as-a-flag.patch
>> >> 5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch
>> >> 6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch
>> >> 7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch
>> >> 8-libceph-protect-ceph_con_open-with-mutex.patch
>> >> 9-libceph-reset-connection-retry-on-successfully-negotiation.patch
>> >> 10-rbd-only-reset-capacity-when-pointing-to-head.patch
>> >> 11-rbd-set-image-size-when-header-is-updated.patch
>> >> 12-libceph-fix-crypto-key-null-deref-memory-leak.patch
>> >> 13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch
>> >> 14-ceph-avoid-divide-by-zero-in-__validate_layout.patch
>> >> 15-rbd-drop-dev-reference-on-error-in-rbd_open.patch
>> >> 16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch
>> >> 17-libceph-check-for-invalid-mapping.patch
>> >> 18-ceph-propagate-layout-error-on-osd-request-creation.patch
>> >> 19-rbd-BUG-on-invalid-layout.patch
>> >> 20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch
>> >> 21-ceph-avoid-32-bit-page-index-overflow.patch
>> >> 23-ceph-fix-dentry-reference-leak-in-encode_fh.patch
>> >>
>> >> Any suggestions?
>> >
>> >
>> > The log shows your monitors don't have time sychronized enough among
>> > them to make much progress (including authenticating new connections).
>> > That's probably the real issue. 0.2s is pretty large clock drift.
>> >
>> >
>> >> One thought is that the following patch (which we could not apply) is
>> >> what is required:
>> >>
>> >> 22-rbd-reset-BACKOFF-if-unable-to-re-queue.patch
>> >
>> >
>> > This is certainly useful too, but I don't think it's the cause of
>> > the delay in this case.
>> >
>> > Josh
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majord...@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-15 Thread Sage Weil
On Thu, 15 Nov 2012, Nick Bartos wrote:
> Sorry I guess this e-mail got missed.  I believe those patches came
> from the ceph/linux-3.5.5-ceph branch.  I'm now using the wip-3.5
> branch patches, which seem to all be fine.  We'll stick with 3.5 and
> this backport for now until we can figure out what's wrong with 3.6.
> 
> I typically ignore the wip branches just due to the naming when I'm
> looking for updates.  Where should I typically look for updates that
> aren't in released kernels?  Also, is there anything else in the wip*
> branches that you think we may find particularly useful?

You were looking in the right place.  The problem was we weren't super 
organized with our stable patches, and changed our minds about what to 
send upstream.  These are 'wip' in the sense that they were in preparation 
for going upstream.  The goal is to push them to the mainline stable 
kernels and ideally not keep them in our tree at all.

wip-3.5 is an oddity because the mainline stable kernel is EOL'd, but 
we're keeping it so that ubuntu can pick it up for quantal.

I'll make sure these are more clearly marked as stable.

sage


> 
> 
> On Mon, Nov 12, 2012 at 3:16 PM, Sage Weil  wrote:
> > On Mon, 12 Nov 2012, Nick Bartos wrote:
> >> After removing 8-libceph-protect-ceph_con_open-with-mutex.patch, it
> >> seems we no longer have this hang.
> >
> > Hmm, that's a bit disconcerting.  Did this series come from our old 3.5
> > stable series?  I recently prepared a new one that backports *all* of the
> > fixes from 3.6 to 3.5 (and 3.4); see wip-3.5 in ceph-client.git.  I would
> > be curious if you see problems with that.
> >
> > So far, with these fixes in place, we have not seen any unexplained kernel
> > crashes in this code.
> >
> > I take it you're going back to a 3.5 kernel because you weren't able to
> > get rid of the sync problem with 3.6?
> >
> > sage
> >
> >
> >
> >>
> >> On Thu, Nov 8, 2012 at 5:43 PM, Josh Durgin  
> >> wrote:
> >> > On 11/08/2012 02:10 PM, Mandell Degerness wrote:
> >> >>
> >> >> We are seeing a somewhat random, but frequent hang on our systems
> >> >> during startup.  The hang happens at the point where an "rbd map
> >> >> " command is run.
> >> >>
> >> >> I've attached the ceph logs from the cluster.  The map command happens
> >> >> at Nov  8 18:41:09 on server 172.18.0.15.  The process which hung can
> >> >> be seen in the log as 172.18.0.15:0/1143980479.
> >> >>
> >> >> It appears as if the TCP socket is opened to the OSD, but then times
> >> >> out 15 minutes later, the process gets data when the socket is closed
> >> >> on the client server and it retries.
> >> >>
> >> >> Please help.
> >> >>
> >> >> We are using ceph version 0.48.2argonaut
> >> >> (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe).
> >> >>
> >> >> We are using a 3.5.7 kernel with the following list of patches applied:
> >> >>
> >> >> 1-libceph-encapsulate-out-message-data-setup.patch
> >> >> 2-libceph-dont-mark-footer-complete-before-it-is.patch
> >> >> 3-libceph-move-init-of-bio_iter.patch
> >> >> 4-libceph-dont-use-bio_iter-as-a-flag.patch
> >> >> 5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch
> >> >> 6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch
> >> >> 7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch
> >> >> 8-libceph-protect-ceph_con_open-with-mutex.patch
> >> >> 9-libceph-reset-connection-retry-on-successfully-negotiation.patch
> >> >> 10-rbd-only-reset-capacity-when-pointing-to-head.patch
> >> >> 11-rbd-set-image-size-when-header-is-updated.patch
> >> >> 12-libceph-fix-crypto-key-null-deref-memory-leak.patch
> >> >> 13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch
> >> >> 14-ceph-avoid-divide-by-zero-in-__validate_layout.patch
> >> >> 15-rbd-drop-dev-reference-on-error-in-rbd_open.patch
> >> >> 16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch
> >> >> 17-libceph-check-for-invalid-mapping.patch
> >> >> 18-ceph-propagate-layout-error-on-osd-request-creation.patch
> >> >> 19-rbd-BUG-on-invalid-layout.patch
> >> >> 20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch
> >> >> 21-ceph-avoid-32-bit-page-index-overflow.patch
> >> >> 23-ceph-fix-dentry-reference-leak-in-encode_fh.patch
> >> >>
> >> >> Any suggestions?
> >> >
> >> >
> >> > The log shows your monitors don't have time sychronized enough among
> >> > them to make much progress (including authenticating new connections).
> >> > That's probably the real issue. 0.2s is pretty large clock drift.
> >> >
> >> >
> >> >> One thought is that the following patch (which we could not apply) is
> >> >> what is required:
> >> >>
> >> >> 22-rbd-reset-BACKOFF-if-unable-to-re-queue.patch
> >> >
> >> >
> >> > This is certainly useful too, but I don't think it's the cause of
> >> > the delay in this case.
> >> >
> >> > Josh
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> > the body of a message to majord...@vger.kernel.org
> >> > More majordomo

Re: osd not in tree

2012-11-15 Thread Josh Durgin

On 11/15/2012 11:21 PM, Drunkard Zhang wrote:

I installed mon x1, mds x1 and osd x11 in one host, then add some osd
from other hosts, But they are not in osd tree, also not usable, how
can I fix this?

The crush command I used:
ceph osd crush set 11 osd.11 3 pool=data datacenter=dh-1L, room=room1,
row=02, rack=05, host=squid87-log13


Remove the commas in that command and it'll work. I fixed the docs for
this.

Josh

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor performance

2012-11-15 Thread Aleksey Samarin
Thanks for your reply!

I was easier to change rhel on ubuntu. Now everything is fast and
stable! :) If interested can attach logs.

All the best, Alex!

2012/11/16 Gregory Farnum :
> On Sun, Nov 4, 2012 at 7:13 AM, Aleksey Samarin  wrote:
>> What may be possible solutions?
>> Update centos to 6.3?
>
> From what I've heard the RHEL libc doesn't support the syncfs syscall
> (even though the kernel does have it). :( So you'd need to make sure
> the kernel supports it and then build a custom glibc, and then make
> sure your Ceph software is built to use it.
>
>
>> About issue with writes to lots of disk, i think parallel dd command
>> will be good as test! :)
>
> Yes — it really looks like maybe some of your disks are much slower
> than the others. Try benchmarking each individually one-at-a-time, and
> then in groups. I suspect you'll see a problem below the Ceph layers.
>
>>
>> 2012/11/4 Mark Nelson :
>>> On 11/04/2012 07:18 AM, Aleksey Samarin wrote:

 Well, i create ceph cluster with 2 osd ( 1 osd per node),  2 mon, 2 mds.
 here is what I did:
   ceph osd pool create bench
   ceph osd tell \* bench
   rados -p bench bench 30 write --no-cleanup
 output:

   Maintaining 16 concurrent writes of 4194304 bytes for at least 30
 seconds.
   Object prefix: benchmark_data_host01_11635
 sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
 lat
   0   0 0 0 0 0 -
 0
   1  1616 0 0 0 -
 0
   2  163721   41.991142  0.139005
 1.08941
   3  165337   49.324364  0.754114
 1.09392
   4  167559   58.989388  0.284647
 0.914221
   5  168973   58.389656  0.072228
 0.881008
   6  169579   52.657524   1.56959
 0.961477
   7  16   11195   54.276464  0.046105
 1.08791
   8  16   128   112   55.990668  0.035714
 1.04594
   9  16   150   134   59.545788  0.046298
 1.04415
  10  16   166   150   59.990164  0.048635
 0.986384
  11  16   176   160   58.172340  0.727784
 0.988408
  12  16   206   190   63.3231   120   0.28869
 0.946624
  13  16   225   209   64.297676   1.34472
 0.919464
  14  16   263   247   70.5605   152  0.070926
 0.90046
  15  16   295   279   74.3887   128  0.041517
 0.830466
  16  16   315   299   74.738880  0.296037
 0.841527
  17  16   333   317   74.577272  0.286097
 0.849558
  18  16   340   324   71.989128  0.295084
 0.83922
  19  16   343   327   68.831712   1.46948
 0.845797
 2012-11-04 17:14:52.090941min lat: 0.035714 max lat: 2.64841 avg lat:
 0.861539
 sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
 lat
  20  16   378   36272.389   140  0.566232
 0.861539
  21  16   400   384   73.131388  0.038835
 0.857785
  22  16   404   388   70.534416  0.801216
 0.857002
  23  16   413   397   69.032736  0.062256
 0.86376
  24  16   428   412   68.654360  0.042583
 0.89389
  25  16   450   434   69.427788  0.383877
 0.905833
  26  16   472   456   70.141588  0.269878
 0.898023
  27  16   472   456   67.5437 0 -
 0.898023
  28  16   512   496   70.844880  0.056798
 0.891163
  29  16   530   514   70.884372   1.20653
 0.898112
  30  16   542   526   70.121248  0.744383
 0.890733
   Total time run: 30.174151
 Total writes made:  543
 Write size: 4194304
 Bandwidth (MB/sec): 71.982

 Stddev Bandwidth:   38.318
 Max bandwidth (MB/sec): 152
 Min bandwidth (MB/sec): 0
 Average Latency:0.889026
 Stddev Latency: 0.677425
 Max latency:2.94467
 Min latency:0.035714

>>>
>>> Much better for 1 disk per node!  I suspect that lack of syncfs is hurting
>>> you, or perhaps some other issue with writes to lots of disks at the same
>>> time.
>>>
>>>

 2012/11/4 Aleksey Samarin :
>
> Ok!
> Well, I'll take these tests and write about the results.
>
> btw,
> disks are the same, as some may be faster than others?
>>