Re: [ceph-users] ceph-osd mkfs mkkey hangs on ARM

2014-11-12 Thread Sage Weil
On Wed, 12 Nov 2014, Harm Weites wrote:
> Hi,
> 
> When trying to add a new OSD to my cluster the ceph-osd process hangs:
> 
> # ceph-osd -i $id --mkfs --mkkey
> 
> 
> At this point I have to explicitly kill -9 the ceph-osd since it doesn't
> respond to anything. It also didn't adhere to my foreground debug log
> request; the logs are empty. Stracing the ceph-osd [2] shows its very
> busy with this:
> 
>  nanosleep({0, 201}, NULL)   = 0
>  gettimeofday({1415741192, 862216}, NULL) = 0
>  nanosleep({0, 201}, NULL)   = 0
>  gettimeofday({1415741192, 864563}, NULL) = 0

Can you gdb attach to the ceph-osd process while it is in this state and 
see what 'bt' says?

sage


> 
> I've rebuilt python to undo a threading regression [2], though that's
> unrelated to this issue. It did fix ceph not returning properly after
> commands like 'ceph osd tree' though, so it is usefull.
> 
> This machine is Fedora 21 on ARM with ceph-0.80.7-1.fc21.armv7hl. The
> mon/mds/osd are all x86, CentOS 7. Could this be a configuration issue
> on my end or is something just broken on my platform?
> 
> # lscpu
> Architecture:  armv7l
> Byte Order:Little Endian
> CPU(s):2
> On-line CPU(s) list:   0,1
> Thread(s) per core:1
> Core(s) per socket:2
> Socket(s): 1
> Model name:ARMv7 Processor rev 4 (v7l)
> 
> [1] http://paste.openstack.org/show/132555/
> [2] http://bugs.python.org/issue21963
> 
> Regards,
> Harm
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problem with radosgw-admin subuser rm

2014-11-12 Thread Seth Mason
Hi --

I'm trying to remove a subuser but it's not removing the S3 keys when I
pass in --purge-keys.

First I create a sub-user:
$ radosgw-admin subuser create --uid=smason --subuser='smason:test' \
--access=full --key-type=s3 --gen-secret

"subusers": [
{ "id": "smason:test",
  "permissions": "full-control"}],
  "keys": [
{ "user": "smason",
  "access_key": "B8D062SWPB560CBA3HHX",
  "secret_key": ""},
{ "user": "smason:test",
  "access_key": "ERKTY5JJ1H2IXE9T5TY3",
  "secret_key": ""}],


Then I try to remove the user and the keys:
$ radosgw-admin subuser rm --subuser='smason:test' --purge-keys
 "subusers": [],
  "keys": [
{ "user": "smason",
  "access_key": "B8D062SWPB560CBA3HHX",
  "secret_key": ""},
{ "user": "smason:test",
  "access_key": "ERKTY5JJ1H2IXE9T5TY3",
  "secret_key": ""}],

I'm running ceph version 0.80.5
(38b73c67d375a2552d8ed67843c8a65c2c0feba6). FWIW, I've observed the same
behavior when I use the admin ops REST API.

Let me know if I can provide any more information.

Thanks in advance,

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-osd mkfs mkkey hangs on ARM

2014-11-12 Thread Harm Weites
Hi,

When trying to add a new OSD to my cluster the ceph-osd process hangs:

# ceph-osd -i $id --mkfs --mkkey


At this point I have to explicitly kill -9 the ceph-osd since it doesn't
respond to anything. It also didn't adhere to my foreground debug log
request; the logs are empty. Stracing the ceph-osd [2] shows its very
busy with this:

 nanosleep({0, 201}, NULL)   = 0
 gettimeofday({1415741192, 862216}, NULL) = 0
 nanosleep({0, 201}, NULL)   = 0
 gettimeofday({1415741192, 864563}, NULL) = 0

I've rebuilt python to undo a threading regression [2], though that's
unrelated to this issue. It did fix ceph not returning properly after
commands like 'ceph osd tree' though, so it is usefull.

This machine is Fedora 21 on ARM with ceph-0.80.7-1.fc21.armv7hl. The
mon/mds/osd are all x86, CentOS 7. Could this be a configuration issue
on my end or is something just broken on my platform?

# lscpu
Architecture:  armv7l
Byte Order:Little Endian
CPU(s):2
On-line CPU(s) list:   0,1
Thread(s) per core:1
Core(s) per socket:2
Socket(s): 1
Model name:ARMv7 Processor rev 4 (v7l)

[1] http://paste.openstack.org/show/132555/
[2] http://bugs.python.org/issue21963

Regards,
Harm
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] incorrect pool size, wrong ruleset?

2014-11-12 Thread houmles
Hi,

I have 2 hosts with 8 2TB drive in each.
I want to have 2 replicas between both hosts and then 2 replicas between osds 
on each host. That way even when I lost one host I still have 2 replicas.

Currently I have this ruleset:

rule repl {
ruleset 5
type replicated
min_size 1
max_size 10
step take asterix
step choose firstn -2 type osd
step emit
step take obelix
step choose firstn 2 type osd
step emit
}

Which works ok. I have 4 replicas as I want and PGs are distributed perfectly 
but when I run ceph df I have only 1/2 of my capacity which I should have.
In total it's 32TB, 16TB in each host. If there is a 2 replicas on each host it 
should report around 8TB, right? It's reporting only 4TB in pool which is 1/8 
of total capacity.
Can anyone tell me what is wrong?

Thanks

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Federated gateways

2014-11-12 Thread Craig Lewis
http://tracker.ceph.com/issues/9206

My post to the ML: http://www.spinics.net/lists/ceph-users/msg12665.html


IIRC, the system uses didn't see the other user's bucket in a bucket
listing, but they could read and write the objects fine.



On Wed, Nov 12, 2014 at 11:16 AM, Aaron Bassett 
wrote:

> In playing around with this a bit more, I noticed that the two users on
> the secondary node cant see each others buckets. Is this a problem?
>

IIRC, the system user couldn't see each other's buckets, but they could
read and write the objects.

> On Nov 11, 2014, at 6:56 PM, Craig Lewis 
> wrote:
>
> I see you're running 0.80.5.  Are you using Apache 2.4?  There is a known
>> issue with Apache 2.4 on the primary and replication.  It's fixed, just
>> waiting for the next firefly release.  Although, that causes 40x errors
>> with Apache 2.4, not 500 errors.
>>
>> It is apache 2.4, but I’m actually running 0.80.7 so I probably have that
>> bug fix?
>>
>>
> No, the unreleased 0.80.8 has the fix.
>
>
>
>>
>> Have you verified that both system users can read and write to both
>> clusters?  (Just make sure you clean up the writes to the slave cluster).
>>
>> Yes I can write everywhere and radosgw-agent isn’t getting any 403s like
>> it was earlier when I had mismatched keys. The .us-nh.rgw.buckets.index
>> pool is syncing properly, as are the users. It seems like really the only
>> thing that isn’t syncing is the .zone.rgw.buckets pool.
>>
>
> That's pretty much the same behavior I was seeing with Apache 2.4.
>
> Try downgrading the primary cluster to Apache 2.2.  In my testing, the
> secondary cluster could run 2.2 or 2.4.
>
> Do you have a link to that bug#? I want to see if it gives me any clues.
>
> Aaron
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Typical 10GbE latency

2014-11-12 Thread Udo Lembke
Hi Wido,
On 12.11.2014 12:55, Wido den Hollander wrote:
> (back to list)
>
>
> Indeed, there must be something! But I can't figure it out yet. Same
> controllers, tried the same OS, direct cables, but the latency is 40%
> higher.
>
>
perhaps something with pci-e order / interupts?
have you checked the bios settings or use another pcie-slot?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Log reading/how do I tell what an OSD is trying to connect to

2014-11-12 Thread Scott Laird
Here are the first 33k lines or so:
https://dl.dropboxusercontent.com/u/104949139/ceph-osd-log.txt

This is a different (but more or less identical) machine from the past set
of logs.  This system doesn't have quite as many drives in it, so I
couldn't spot a same-host error burst, but it's logging tons of the same
errors while trying to talk to 10.2.0.34.

On Wed Nov 12 2014 at 10:47:30 AM Gregory Farnum  wrote:

> On Tue, Nov 11, 2014 at 6:28 PM, Scott Laird  wrote:
> > I'm having a problem with my cluster.  It's running 0.87 right now, but I
> > saw the same behavior with 0.80.5 and 0.80.7.
> >
> > The problem is that my logs are filling up with "replacing existing
> (lossy)
> > channel" log lines (see below), to the point where I'm filling drives to
> > 100% almost daily just with logs.
> >
> > It doesn't appear to be network related, because it happens even when
> > talking to other OSDs on the same host.
>
> Well, that means it's probably not physical network related, but there
> can still be plenty wrong with the networking stack... ;)
>
> > The logs pretty much all point to
> > port 0 on the remote end.  Is this an indicator that it's failing to
> resolve
> > port numbers somehow, or is this normal at this point in connection
> setup?
>
> That's definitely unusual, but I'd need to see a little more to be
> sure if it's bad. My guess is that these pipes are connections from
> the other OSD's Objecter, which is treated as a regular client and
> doesn't bind to a socket for incoming connections.
>
> The repetitive channel replacements are concerning, though — they can
> be harmless in some circumstances but this looks more like the
> connection is simply failing to establish and so it's retrying over
> and over again. Can you restart the OSDs with "debug ms = 10" in their
> config file and post the logs somewhere? (There is not really any
> documentation available on what they mean, but the deeper detail ones
> might also be more understandable to you.)
> -Greg
>
> >
> > The systems that are causing this problem are somewhat unusual; they're
> > running OSDs in Docker containers, but they *should* be configured to
> run as
> > root and have full access to the host's network stack.  They manage to
> work,
> > mostly, but things are still really flaky.
> >
> > Also, is there documentation on what the various fields mean, short of
> > digging through the source?  And how does Ceph resolve OSD numbers into
> > host/port addresses?
> >
> >
> > 2014-11-12 01:50:40.802604 7f7828db8700  0 -- 10.2.0.36:6819/1 >>
> > 10.2.0.36:0/1 pipe(0x1ce31c80 sd=135 :6819 s=0 pgs=0 cs=0 l=1
> > c=0x1e070580).accept replacing existing (lossy) channel (new one lossy=1)
> >
> > 2014-11-12 01:50:40.802708 7f7816538700  0 -- 10.2.0.36:6830/1 >>
> > 10.2.0.36:0/1 pipe(0x1ff61080 sd=120 :6830 s=0 pgs=0 cs=0 l=1
> > c=0x1f3db2e0).accept replacing existing (lossy) channel (new one lossy=1)
> >
> > 2014-11-12 01:50:40.803346 7f781ba8d700  0 -- 10.2.0.36:6819/1 >>
> > 10.2.0.36:0/1 pipe(0x1ce31180 sd=125 :6819 s=0 pgs=0 cs=0 l=1
> > c=0x1e070420).accept replacing existing (lossy) channel (new one lossy=1)
> >
> > 2014-11-12 01:50:40.803944 7f781996c700  0 -- 10.2.0.36:6830/1 >>
> > 10.2.0.36:0/1 pipe(0x1ff618c0 sd=107 :6830 s=0 pgs=0 cs=0 l=1
> > c=0x1f3d8420).accept replacing existing (lossy) channel (new one lossy=1)
> >
> > 2014-11-12 01:50:40.804185 7f7816538700  0 -- 10.2.0.36:6819/1 >>
> > 10.2.0.36:0/1 pipe(0x1ffd1e40 sd=20 :6819 s=0 pgs=0 cs=0 l=1
> > c=0x1e070840).accept replacing existing (lossy) channel (new one lossy=1)
> >
> > 2014-11-12 01:50:40.805235 7f7813407700  0 -- 10.2.0.36:6819/1 >>
> > 10.2.0.36:0/1 pipe(0x1ffd1340 sd=60 :6819 s=0 pgs=0 cs=0 l=1
> > c=0x1b2d6260).accept replacing existing (lossy) channel (new one lossy=1)
> >
> > 2014-11-12 01:50:40.806364 7f781bc8f700  0 -- 10.2.0.36:6819/1 >>
> > 10.2.0.36:0/1 pipe(0x1ffd0b00 sd=162 :6819 s=0 pgs=0 cs=0 l=1
> > c=0x675c580).accept replacing existing (lossy) channel (new one lossy=1)
> >
> > 2014-11-12 01:50:40.806425 7f781aa7d700  0 -- 10.2.0.36:6830/1 >>
> > 10.2.0.36:0/1 pipe(0x1db29600 sd=143 :6830 s=0 pgs=0 cs=0 l=1
> > c=0x1f3d9600).accept replacing existing (lossy) channel (new one lossy=1)
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Federated gateways

2014-11-12 Thread Aaron Bassett
In playing around with this a bit more, I noticed that the two users on the 
secondary node cant see each others buckets. Is this a problem?
> On Nov 11, 2014, at 6:56 PM, Craig Lewis  wrote:
> 
>> I see you're running 0.80.5.  Are you using Apache 2.4?  There is a known 
>> issue with Apache 2.4 on the primary and replication.  It's fixed, just 
>> waiting for the next firefly release.  Although, that causes 40x errors with 
>> Apache 2.4, not 500 errors.
> It is apache 2.4, but I’m actually running 0.80.7 so I probably have that bug 
> fix?
> 
> 
> No, the unreleased 0.80.8 has the fix.
>  
>  
>> 
>> Have you verified that both system users can read and write to both 
>> clusters?  (Just make sure you clean up the writes to the slave cluster).
> Yes I can write everywhere and radosgw-agent isn’t getting any 403s like it 
> was earlier when I had mismatched keys. The .us-nh.rgw.buckets.index pool is 
> syncing properly, as are the users. It seems like really the only thing that 
> isn’t syncing is the .zone.rgw.buckets pool.
> 
> That's pretty much the same behavior I was seeing with Apache 2.4.
> 
> Try downgrading the primary cluster to Apache 2.2.  In my testing, the 
> secondary cluster could run 2.2 or 2.4.
Do you have a link to that bug#? I want to see if it gives me any clues. 

Aaron 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Log reading/how do I tell what an OSD is trying to connect to

2014-11-12 Thread Gregory Farnum
On Tue, Nov 11, 2014 at 6:28 PM, Scott Laird  wrote:
> I'm having a problem with my cluster.  It's running 0.87 right now, but I
> saw the same behavior with 0.80.5 and 0.80.7.
>
> The problem is that my logs are filling up with "replacing existing (lossy)
> channel" log lines (see below), to the point where I'm filling drives to
> 100% almost daily just with logs.
>
> It doesn't appear to be network related, because it happens even when
> talking to other OSDs on the same host.

Well, that means it's probably not physical network related, but there
can still be plenty wrong with the networking stack... ;)

> The logs pretty much all point to
> port 0 on the remote end.  Is this an indicator that it's failing to resolve
> port numbers somehow, or is this normal at this point in connection setup?

That's definitely unusual, but I'd need to see a little more to be
sure if it's bad. My guess is that these pipes are connections from
the other OSD's Objecter, which is treated as a regular client and
doesn't bind to a socket for incoming connections.

The repetitive channel replacements are concerning, though — they can
be harmless in some circumstances but this looks more like the
connection is simply failing to establish and so it's retrying over
and over again. Can you restart the OSDs with "debug ms = 10" in their
config file and post the logs somewhere? (There is not really any
documentation available on what they mean, but the deeper detail ones
might also be more understandable to you.)
-Greg

>
> The systems that are causing this problem are somewhat unusual; they're
> running OSDs in Docker containers, but they *should* be configured to run as
> root and have full access to the host's network stack.  They manage to work,
> mostly, but things are still really flaky.
>
> Also, is there documentation on what the various fields mean, short of
> digging through the source?  And how does Ceph resolve OSD numbers into
> host/port addresses?
>
>
> 2014-11-12 01:50:40.802604 7f7828db8700  0 -- 10.2.0.36:6819/1 >>
> 10.2.0.36:0/1 pipe(0x1ce31c80 sd=135 :6819 s=0 pgs=0 cs=0 l=1
> c=0x1e070580).accept replacing existing (lossy) channel (new one lossy=1)
>
> 2014-11-12 01:50:40.802708 7f7816538700  0 -- 10.2.0.36:6830/1 >>
> 10.2.0.36:0/1 pipe(0x1ff61080 sd=120 :6830 s=0 pgs=0 cs=0 l=1
> c=0x1f3db2e0).accept replacing existing (lossy) channel (new one lossy=1)
>
> 2014-11-12 01:50:40.803346 7f781ba8d700  0 -- 10.2.0.36:6819/1 >>
> 10.2.0.36:0/1 pipe(0x1ce31180 sd=125 :6819 s=0 pgs=0 cs=0 l=1
> c=0x1e070420).accept replacing existing (lossy) channel (new one lossy=1)
>
> 2014-11-12 01:50:40.803944 7f781996c700  0 -- 10.2.0.36:6830/1 >>
> 10.2.0.36:0/1 pipe(0x1ff618c0 sd=107 :6830 s=0 pgs=0 cs=0 l=1
> c=0x1f3d8420).accept replacing existing (lossy) channel (new one lossy=1)
>
> 2014-11-12 01:50:40.804185 7f7816538700  0 -- 10.2.0.36:6819/1 >>
> 10.2.0.36:0/1 pipe(0x1ffd1e40 sd=20 :6819 s=0 pgs=0 cs=0 l=1
> c=0x1e070840).accept replacing existing (lossy) channel (new one lossy=1)
>
> 2014-11-12 01:50:40.805235 7f7813407700  0 -- 10.2.0.36:6819/1 >>
> 10.2.0.36:0/1 pipe(0x1ffd1340 sd=60 :6819 s=0 pgs=0 cs=0 l=1
> c=0x1b2d6260).accept replacing existing (lossy) channel (new one lossy=1)
>
> 2014-11-12 01:50:40.806364 7f781bc8f700  0 -- 10.2.0.36:6819/1 >>
> 10.2.0.36:0/1 pipe(0x1ffd0b00 sd=162 :6819 s=0 pgs=0 cs=0 l=1
> c=0x675c580).accept replacing existing (lossy) channel (new one lossy=1)
>
> 2014-11-12 01:50:40.806425 7f781aa7d700  0 -- 10.2.0.36:6830/1 >>
> 10.2.0.36:0/1 pipe(0x1db29600 sd=143 :6830 s=0 pgs=0 cs=0 l=1
> c=0x1f3d9600).accept replacing existing (lossy) channel (new one lossy=1)
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep scrub, cache pools, replica 1

2014-11-12 Thread Gregory Farnum
On Tue, Nov 11, 2014 at 2:32 PM, Christian Balzer  wrote:
> On Tue, 11 Nov 2014 10:21:49 -0800 Gregory Farnum wrote:
>
>> On Mon, Nov 10, 2014 at 10:58 PM, Christian Balzer  wrote:
>> >
>> > Hello,
>> >
>> > One of my clusters has become busy enough (I'm looking at you, evil
>> > Window VMs that I shall banish elsewhere soon) to experience client
>> > noticeable performance impacts during deep scrub.
>> > Before this I instructed all OSDs to deep scrub in parallel at Saturday
>> > night and that finished before Sunday morning.
>> > So for now I'll fire them off one by one to reduce the load.
>> >
>> > Looking forward, that cluster doesn't need more space so instead of
>> > adding more hosts and OSDs I was thinking of a cache pool instead.
>> >
>> > I suppose that will keep the clients happy while the slow pool gets
>> > scrubbed.
>> > Is there anybody who tested cache pools with Firefly and compared the
>> > performance to Giant?
>> >
>> > For testing I'm currently playing with a single storage node and 8 SSD
>> > backed OSDs.
>> > Now what very much blew my mind is that a pool with a replication of 1
>> > still does quite the impressive read orgy, clearly reading all the
>> > data in the PGs.
>> > Why? And what is it comparing that data with, the cosmic background
>> > radiation?
>>
>> Yeah, cache pools currently do full-object promotions whenever an
>> object is accessed. There are some ideas and projects to improve this
>> or reduce its effects, but they're mostly just getting started.
> Thanks for confirming that, so probably not much better than Firefly
> _aside_ from the fact that SSD pools should be quite a bit faster in and
> by themselves in Giant.
> Guess there is no other way to find out than to test things, I have a
> feeling that determining the "hot" working set otherwise will be rather
> difficult.
>
>> At least, I assume that's what you mean by a read orgy; perhaps you
>> are seeing something else entirely?
>>
> Indeed I did, this was just an observation that any pool with a replica of
> 1 will still read ALL the data during a deep-scrub. What good would that
> do?

Oh, I see what you're saying; you mean it was reading all the data
during a scrub, not just that it was promoting things.

Anyway, reading all the data during a deep scrub verifies that we
*can* read all the data. That's one of the fundamental tasks of
scrubbing data in a storage system. It's often accompanied by other
checks or recovery behaviors to easily repair issues that are
discovered, but simply maintaining confidence that the data actually
exists is the principle goal. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados -p cache-flush-evict-all surprisingly slow

2014-11-12 Thread Gregory Farnum
My recollection is that the RADOS tool is issuing a special eviction
command on every object in the cache tier using primitives we don't use
elsewhere. Their existence is currently vestigial from our initial tiering
work (rather than the present caching), but I have some hope we'll extend
them again in the future.

The usual flushing and eviction routines, meanwhile, run as an agent inside
of the OSD and are extremely parallel. I think there's documentation about
how to flush entire cache pools in preparation for removing them; I'd check
those out. :)
-Greg
On Wed, Nov 12, 2014 at 7:46 AM Martin Millnert  wrote:

> Dear Cephers,
>
> I have a lab setup with 6x dual-socket hosts, 48GB RAM, 2x10Gbps hosts,
> each equipped with 2x S3700 100GB SSDs and 4x 500GB HDD, where the HDDs
> are mapped in a tree under a 'platter' root tree similar to guidance from
> Seb at http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-
> and-ssd-within-the-same-box/ ,
> and SSDs similarily under an 'ssd' root.  Replication is set to 3.
> Journals on tmpfs (simulating NVRAM).
>
> I have put an ssd pool as a cache tier in front of an hdd pool ("rbd"),
> and run
> fio-rbd against "rbd".  In the benchmarks, at bs=32kb, QD=128 from a
> single separate client machine, I reached at peak throughput of around
> 1.2 GB/s.  So there is some capability.  IOPS-wise I see a max of around
> 15k iops currently.
>
> After having filled the SSD cache tier, I ran rados -p rbd
> cache-flush-evict-all - and I was expecting to see the 6 SSD OSDs start
> to evict all the cache-tier pg's to the underlying pool, rbd, which maps
> to the HDDs.  I would have expected parallellism and high throughput,
> but what I now observe is ~80 MB/s on average flush speed.
>
> Which leads me to the question:  Is "rados -p 
> cache-flush-evict-all" supposed to work in a parallell manner?
>
> Cursory viewing in tcpdump suggests to me that eviction operation is
> serial, in which case the performance could make a little bit sense,
> since it is basically limited by the write speed of a single hdd.
>
> What should I see?
>
> If it is indeed a serial operation, is this different from the regular
> cache tier eviction routines that are triggered by full_ratios, max
> objects or max storage volume?
>
> Regards,
> Martin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Triggering shallow scrub on OSD where scrub is already in progress

2014-11-12 Thread Gregory Farnum
Yes, this is expected behavior. You're telling the OSD to scrub every PG it
holds, and it is doing so. The list of PGs to scrub is getting reset each
time, but none of the individual scrubs are getting restarted. (I believe
that if you instruct a PG to scrub while it's already doing so, nothing
happens.)
-Greg
On Tue, Nov 11, 2014 at 9:54 PM Mallikarjun Biradar <
mallikarjuna.bira...@gmail.com> wrote:

> Hi Greg,
>
> I am using 0.86
>
> refering to osd logs to check scrub behaviour.. Please have look at log
> snippet from osd log
>
> ##Triggered scrub on osd.10--->
> 2014-11-12 16:24:21.393135 7f5026f31700  0 log_channel(default) log [INF]
> : 0.4 scrub ok
> 2014-11-12 16:24:24.393586 7f5026f31700  0 log_channel(default) log [INF]
> : 0.20 scrub ok
> 2014-11-12 16:24:30.393989 7f5026f31700  0 log_channel(default) log [INF]
> : 0.21 scrub ok
> 2014-11-12 16:24:33.394764 7f5026f31700  0 log_channel(default) log [INF]
> : 0.23 scrub ok
> 2014-11-12 16:24:34.395293 7f5026f31700  0 log_channel(default) log [INF]
> : 0.36 scrub ok
> 2014-11-12 16:24:35.941704 7f5026f31700  0 log_channel(default) log [INF]
> : 1.1 scrub ok
> 2014-11-12 16:24:39.533780 7f5026f31700  0 log_channel(default) log [INF]
> : 1.d scrub ok
> 2014-11-12 16:24:41.811185 7f5026f31700  0 log_channel(default) log [INF]
> : 1.44 scrub ok
> 2014-11-12 16:24:54.257384 7f5026f31700  0 log_channel(default) log [INF]
> : 1.5b scrub ok
> 2014-11-12 16:25:02.973101 7f5026f31700  0 log_channel(default) log [INF]
> : 1.67 scrub ok
> 2014-11-12 16:25:17.597546 7f5026f31700  0 log_channel(default) log [INF]
> : 1.6b scrub ok
> ##Previous scrub is still in progress, triggered scrub on osd.10 again-->
> CEPH re-started scrub operation
> 20104-11-12 16:25:19.394029 7f5026f31700  0 log_channel(default) log [INF]
> : 0.4 scrub ok
> 2014-11-12 16:25:22.402630 7f5026f31700  0 log_channel(default) log [INF]
> : 0.20 scrub ok
> 2014-11-12 16:25:24.695565 7f5026f31700  0 log_channel(default) log [INF]
> : 0.21 scrub ok
> 2014-11-12 16:25:25.408821 7f5026f31700  0 log_channel(default) log [INF]
> : 0.23 scrub ok
> 2014-11-12 16:25:29.467527 7f5026f31700  0 log_channel(default) log [INF]
> : 0.36 scrub ok
> 2014-11-12 16:25:32.558838 7f5026f31700  0 log_channel(default) log [INF]
> : 1.1 scrub ok
> 2014-11-12 16:25:35.763056 7f5026f31700  0 log_channel(default) log [INF]
> : 1.d scrub ok
> 2014-11-12 16:25:38.166853 7f5026f31700  0 log_channel(default) log [INF]
> : 1.44 scrub ok
> 2014-11-12 16:25:40.602758 7f5026f31700  0 log_channel(default) log [INF]
> : 1.5b scrub ok
> 2014-11-12 16:25:42.169788 7f5026f31700  0 log_channel(default) log [INF]
> : 1.67 scrub ok
> 2014-11-12 16:25:45.851419 7f5026f31700  0 log_channel(default) log [INF]
> : 1.6b scrub ok
> 2014-11-12 16:25:51.259453 7f5026f31700  0 log_channel(default) log [INF]
> : 1.a8 scrub ok
> 2014-11-12 16:25:53.012220 7f5026f31700  0 log_channel(default) log [INF]
> : 1.a9 scrub ok
> 2014-11-12 16:25:54.009265 7f5026f31700  0 log_channel(default) log [INF]
> : 1.cb scrub ok
> 2014-11-12 16:25:56.516569 7f5026f31700  0 log_channel(default) log [INF]
> : 1.e2 scrub ok
>
>
>  -Thanks & regards,
> Mallikarjun Biradar
>
> On Tue, Nov 11, 2014 at 12:18 PM, Gregory Farnum  wrote:
>
>> On Sun, Nov 9, 2014 at 9:29 PM, Mallikarjun Biradar
>>  wrote:
>> > Hi all,
>> >
>> > Triggering shallow scrub on OSD where scrub is already in progress,
>> restarts
>> > scrub from beginning on that OSD.
>> >
>> >
>> > Steps:
>> > Triggered shallow scrub on an OSD (Cluster is running heavy IO)
>> > While scrub is in progress, triggered shallow scrub again on that OSD.
>> >
>> > Observed behavior, is scrub restarted from beginning on that OSD.
>> >
>> > Please let me know, whether its expected behaviour?
>>
>> What version of Ceph are you seeing this on? How are you identifying
>> that scrub is restarting from the beginning? It sounds sort of
>> familiar to me, but I thought this was fixed so it was a no-op if you
>> issue another scrub. (That's not authoritative though; I might just be
>> missing a reason we want to restart it.)
>> -Greg
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-12 Thread Scottix
I would say it depends on your system and where drives are connected
to. Some HBA have a cli tool to manage the drives connected like a
raid card would do.
One other method I found is sometimes it will expose the leds for you
http://fabiobaltieri.com/2011/09/21/linux-led-subsystem/ has an
article on the /sys/class/led but not guarantee.

On my laptop I could turn on lights and stuff but our server didn't
have anything. Seems like a feature either linux or smartctrl should
have. I have ran into this problem before but did a couple tricks to
figure it out.

I guess best solution is just to track the drives S/N. Maybe a good
note to have in the doc for a Ceph cluster to be aware of.

On Wed, Nov 12, 2014 at 9:06 AM, Erik Logtenberg  wrote:
> I have no experience with the DELL SAS controller, but usually the
> advantage of using a simple controller (instead of a RAID card) is that
> you can use full SMART directly.
>
> $ sudo smartctl -a /dev/sda
>
> === START OF INFORMATION SECTION ===
> Device Model: INTEL SSDSA2BW300G3H
> Serial Number:PEPR2381003E300EGN
>
> Personally, I make sure that I know which serial number drive is in
> which bay, so I can easily tell which drive I'm talking about.
>
> So you can use SMART both to notice (pre)failing disks -and- to
> physically identify them.
>
> The same smartctl command also returns the health status like so:
>
> 233 Media_Wearout_Indicator 0x0032   099   099   000Old_age   Always
>   -   0
>
> This specific SSD has 99% media lifetime left, so it's in the green. But
> it will continue to gradually degrade, and at some time It'll hit a
> percentage where I like to replace it. To keep an eye on the speed of
> decay, I'm graphing those SMART values in Cacti. That way I can somewhat
> predict how long a disk will last, especially SSD's which die very
> gradually.
>
> Erik.
>
>
> On 12-11-14 14:43, JF Le Fillâtre wrote:
>>
>> Hi,
>>
>> May or may not work depending on your JBOD and the way it's identified
>> and set up by the LSI card and the kernel:
>>
>> cat /sys/block/sdX/../../../../sas_device/end_device-*/bay_identifier
>>
>> The weird path and the wildcards are due to the way the sysfs is set up.
>>
>> That works with a Dell R520, 6GB HBA SAS cards and Dell MD1200s, running
>> CentOS release 6.5.
>>
>> Note that you can make your life easier by writing an udev script that
>> will create a symlink with a sane identifier for each of your external
>> disks. If you match along the lines of
>>
>> KERNEL=="sd*[a-z]", KERNELS=="end_device-*:*:*"
>>
>> then you'll just have to cat "/sys/class/sas_device/${1}/bay_identifier"
>> in a script (with $1 being the $id of udev after that match, so the
>> string "end_device-X:Y:Z") to obtain the bay ID.
>>
>> Thanks,
>> JF
>>
>>
>>
>> On 12/11/14 14:05, SCHAER Frederic wrote:
>>> Hi,
>>>
>>>
>>>
>>> I’m used to RAID software giving me the failing disks  slots, and most
>>> often blinking the disks on the disk bays.
>>>
>>> I recently installed a  DELL “6GB HBA SAS” JBOD card, said to be an LSI
>>> 2008 one, and I now have to identify 3 pre-failed disks (so says
>>> S.M.A.R.T) .
>>>
>>>
>>>
>>> Since this is an LSI, I thought I’d use MegaCli to identify the disks
>>> slot, but MegaCli does not see the HBA card.
>>>
>>> Then I found the LSI “sas2ircu” utility, but again, this one fails at
>>> giving me the disk slots (it finds the disks, serials and others, but
>>> slot is always 0)
>>>
>>> Because of this, I’m going to head over to the disk bay and unplug the
>>> disk which I think corresponds to the alphabetical order in linux, and
>>> see if it’s the correct one…. But even if this is correct this time, it
>>> might not be next time.
>>>
>>>
>>>
>>> But this makes me wonder : how do you guys, Ceph users, manage your
>>> disks if you really have JBOD servers ?
>>>
>>> I can’t imagine having to guess slots that each time, and I can’t
>>> imagine neither creating serial number stickers for every single disk I
>>> could have to manage …
>>>
>>> Is there any specific advice reguarding JBOD cards people should (not)
>>> use in their systems ?
>>>
>>> Any magical way to “blink” a drive in linux ?
>>>
>>>
>>>
>>> Thanks && regards
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Follow Me: @Taijutsun
scot...@gmail.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-12 Thread Erik Logtenberg
I have no experience with the DELL SAS controller, but usually the
advantage of using a simple controller (instead of a RAID card) is that
you can use full SMART directly.

$ sudo smartctl -a /dev/sda

=== START OF INFORMATION SECTION ===
Device Model: INTEL SSDSA2BW300G3H
Serial Number:PEPR2381003E300EGN

Personally, I make sure that I know which serial number drive is in
which bay, so I can easily tell which drive I'm talking about.

So you can use SMART both to notice (pre)failing disks -and- to
physically identify them.

The same smartctl command also returns the health status like so:

233 Media_Wearout_Indicator 0x0032   099   099   000Old_age   Always
  -   0

This specific SSD has 99% media lifetime left, so it's in the green. But
it will continue to gradually degrade, and at some time It'll hit a
percentage where I like to replace it. To keep an eye on the speed of
decay, I'm graphing those SMART values in Cacti. That way I can somewhat
predict how long a disk will last, especially SSD's which die very
gradually.

Erik.


On 12-11-14 14:43, JF Le Fillâtre wrote:
> 
> Hi,
> 
> May or may not work depending on your JBOD and the way it's identified
> and set up by the LSI card and the kernel:
> 
> cat /sys/block/sdX/../../../../sas_device/end_device-*/bay_identifier
> 
> The weird path and the wildcards are due to the way the sysfs is set up.
> 
> That works with a Dell R520, 6GB HBA SAS cards and Dell MD1200s, running
> CentOS release 6.5.
> 
> Note that you can make your life easier by writing an udev script that
> will create a symlink with a sane identifier for each of your external
> disks. If you match along the lines of
> 
> KERNEL=="sd*[a-z]", KERNELS=="end_device-*:*:*"
> 
> then you'll just have to cat "/sys/class/sas_device/${1}/bay_identifier"
> in a script (with $1 being the $id of udev after that match, so the
> string "end_device-X:Y:Z") to obtain the bay ID.
> 
> Thanks,
> JF
> 
> 
> 
> On 12/11/14 14:05, SCHAER Frederic wrote:
>> Hi,
>>
>>  
>>
>> I’m used to RAID software giving me the failing disks  slots, and most
>> often blinking the disks on the disk bays.
>>
>> I recently installed a  DELL “6GB HBA SAS” JBOD card, said to be an LSI
>> 2008 one, and I now have to identify 3 pre-failed disks (so says
>> S.M.A.R.T) .
>>
>>  
>>
>> Since this is an LSI, I thought I’d use MegaCli to identify the disks
>> slot, but MegaCli does not see the HBA card.
>>
>> Then I found the LSI “sas2ircu” utility, but again, this one fails at
>> giving me the disk slots (it finds the disks, serials and others, but
>> slot is always 0)
>>
>> Because of this, I’m going to head over to the disk bay and unplug the
>> disk which I think corresponds to the alphabetical order in linux, and
>> see if it’s the correct one…. But even if this is correct this time, it
>> might not be next time.
>>
>>  
>>
>> But this makes me wonder : how do you guys, Ceph users, manage your
>> disks if you really have JBOD servers ?
>>
>> I can’t imagine having to guess slots that each time, and I can’t
>> imagine neither creating serial number stickers for every single disk I
>> could have to manage …
>>
>> Is there any specific advice reguarding JBOD cards people should (not)
>> use in their systems ?
>>
>> Any magical way to “blink” a drive in linux ?
>>
>>  
>>
>> Thanks && regards
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Solaris 10 VMs extremely slow in KVM on Ceph RBD Devices

2014-11-12 Thread Christoph Adomeit
Hi,

i installed a Ceph Cluster with 50 OSDs on 4 Hosts and finally I am really 
happy with it.

Linux and Windows VMs run really fast in KVM on the Ceph Storage.

Only my Solaris 10 guests are terribly slow on ceph rbd storage. A solaris on 
Ceph Storage needs 15 Minutes to boot. When I move the Solaris Image to the old 
nexenta nfs storage and start it on the same kvm host it will fly and boot in 
1,5 Minutes.

I have tested ceph firefly and giant and the Problem is with both ceph versions.

The performance problem is not only with booting. The problem continues when 
the server is up. EVerything is terribly slow.

So the only difference here is ceph vs. nexenta nfs storage that causes the big 
performance problems.

The solaris guests have zfs root standard installation.

Does anybody have an idea or a hint what might go on here and what I should try 
to make solaris 10 Guests faster on ceph storage ?

Many Thanks
  Christoph

-- 
Christoph Adomeit
GATWORKS GmbH
Reststrauch 191
41199 Moenchengladbach
Sitz: Moenchengladbach
Amtsgericht Moenchengladbach, HRB 6303
Geschaeftsfuehrer:
Christoph Adomeit, Hans Wilhelm Terstappen

christoph.adom...@gatworks.de Internetloesungen vom Feinsten
Fon. +49 2166 9149-32  Fax. +49 2166 9149-10
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG's incomplete after OSD failure

2014-11-12 Thread Chad Seys
Would love to hear if you discover a way to get zapping incomplete PGs!

Perhaps this is a common enough issue to open an issue?

Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] The strategy of auto-restarting crashed OSD

2014-11-12 Thread Adeel Nazir


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> David Z
> Sent: Wednesday, November 12, 2014 8:16 AM
> To: Ceph Community; Ceph-users
> Subject: [ceph-users] The strategy of auto-restarting crashed OSD
> 
> Hi Guys,
> 
> We are experiencing some OSD crashing issues recently, like messenger
> crash, some strange crash (still being investigating), etc. Those crashes 
> seems
> not to reproduce after restarting OSD.
>
> So we are thinking about the strategy of auto-restarting crashed OSD for 1 or
> 2 times, then leave it as down if restarting doesn't work. This strategy might
> help us on pg peering and recovering impact to online traffic to some extent,
> since we won't mark OSD out automatically even if it is down unless we are
> sure it is disk failure.
> 
> However, we are also aware that this strategy may bring us some problems.
> Since your guys have more experience on CEPH, so we would like to hear
> some suggestions from you.
> 
> Thanks.
> 
> David Zhang
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


I'm currently looking at the same scenario of having to restart crashed OSDs. 
I'm looking towards using runit (http://smarden.org/runit/ & 
http://smarden.org/runit/useinit.html) to manage the OSD's...I'll probably 
modify my init script to send me a trap or email when it's restarted though.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rados -p cache-flush-evict-all surprisingly slow

2014-11-12 Thread Martin Millnert
Dear Cephers,

I have a lab setup with 6x dual-socket hosts, 48GB RAM, 2x10Gbps hosts,
each equipped with 2x S3700 100GB SSDs and 4x 500GB HDD, where the HDDs
are mapped in a tree under a 'platter' root tree similar to guidance from
Seb at 
http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
 ,
and SSDs similarily under an 'ssd' root.  Replication is set to 3.
Journals on tmpfs (simulating NVRAM).

I have put an ssd pool as a cache tier in front of an hdd pool ("rbd"), and run
fio-rbd against "rbd".  In the benchmarks, at bs=32kb, QD=128 from a
single separate client machine, I reached at peak throughput of around
1.2 GB/s.  So there is some capability.  IOPS-wise I see a max of around
15k iops currently.

After having filled the SSD cache tier, I ran rados -p rbd
cache-flush-evict-all - and I was expecting to see the 6 SSD OSDs start
to evict all the cache-tier pg's to the underlying pool, rbd, which maps
to the HDDs.  I would have expected parallellism and high throughput,
but what I now observe is ~80 MB/s on average flush speed.

Which leads me to the question:  Is "rados -p 
cache-flush-evict-all" supposed to work in a parallell manner?

Cursory viewing in tcpdump suggests to me that eviction operation is
serial, in which case the performance could make a little bit sense,
since it is basically limited by the write speed of a single hdd.

What should I see?

If it is indeed a serial operation, is this different from the regular
cache tier eviction routines that are triggered by full_ratios, max
objects or max storage volume?

Regards,
Martin


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and Compute on same hardware?

2014-11-12 Thread gaoxingxing

I think you may also consider risk like kernel crashes etc,since storage and 
compute node are sharing the same box.
Date: Wed, 12 Nov 2014 14:51:47 +
From: pieter.koo...@me.com
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph and Compute on same hardware?

Hi,
Thanks for the replies. Likely will not choose this method but wanted to make 
sure that it was a good technical reason rather than just a "best practice". I 
did not quite think of "conntracker" at the time so this is a good one to 
consider.
Thanks
Pieter
On 12 November 2014 14:30, Haomai Wang  wrote:
Actually, our production cluster(up to ten) all are that ceph-osd ran

on compute-node(KVM).



The primary action is that you need to constrain the cpu and memory.

For example, you can alloc a ceph cpu-set and memory group, let

ceph-osd run with it within limited cores and memory.



The another risk is the network. Because compute-node and ceph-osd

shared the same kernel network stack, it exists some risks that VM may

ran out of network resources such as conntracker in netfilter

framework.



On Wed, Nov 12, 2014 at 10:23 PM, Mark Nelson  wrote:

> Technically there's no reason it shouldn't work, but it does complicate

> things.  Probably the biggest worry would be that if something bad happens

> on the compute side (say it goes nuts with network or memory transfers) it

> could slow things down enough that OSDs start failing heartbeat checks

> causing ceph to go into recovery and maybe cause a vicious cycle of

> nastiness.

>

> You can mitigate some of this with cgroups and try to dedicate specific

> sockets and memory banks to Ceph/Compute, but we haven't done a lot of

> testing yet afaik.

>

> Mark

>

>

> On 11/12/2014 07:45 AM, Pieter Koorts wrote:

>>

>> Hi,

>>

>> A while back on a blog I saw mentioned that Ceph should not be run on

>> compute nodes and in the general sense should be on dedicated hardware.

>> Does this really still apply?

>>

>> An example, if you have nodes comprised of

>>

>> 16+ cores

>> 256GB+ RAM

>> Dual 10GBE Network

>> 2+8 OSD (SSD log + HDD store)

>>

>> I understand that Ceph can use a lot of IO and CPU in some cases but if

>> the nodes are powerful enough does it not make it an option to run

>> compute and storage on the same hardware to either increase density of

>> compute or save money on additional hardware?

>>

>> What are the reasons for not running Ceph on the Compute nodes.

>>

>> Thanks

>>

>> Pieter

>>

>>

>> ___

>> ceph-users mailing list

>> ceph-users@lists.ceph.com

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>

>

> ___

> ceph-users mailing list

> ceph-users@lists.ceph.com

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com







--

Best Regards,



Wheat

___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and Compute on same hardware?

2014-11-12 Thread Robert van Leeuwen
> A while back on a blog I saw mentioned that Ceph should not be run on compute 
> nodes and in the general
> sense should be on dedicated hardware. Does this really still apply?

In my opinion storage needs to be rock-solid.
Running other (complex) software on a Ceph node increases the chances of stuff 
falling over.
Worst-case a cascading effect takes down you whole storage platform.
If your storage platform bites the dust your whole compute cloud also falls 
over (assuming you boot instances from Ceph).

Troubleshooting issues (especially those that have no obvious cause)  becomes 
more complex having to rule out more potential causes.

Not saying it can not work perfectly fine.
I'd rather just not take any chances with the storage system...

Cheers,
Robert van Leeuwen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and Compute on same hardware?

2014-11-12 Thread Pieter Koorts
Hi,

Thanks for the replies. Likely will not choose this method but wanted to
make sure that it was a good technical reason rather than just a "best
practice". I did not quite think of "conntracker" at the time so this is a
good one to consider.

Thanks

Pieter

On 12 November 2014 14:30, Haomai Wang  wrote:

> Actually, our production cluster(up to ten) all are that ceph-osd ran
> on compute-node(KVM).
>
> The primary action is that you need to constrain the cpu and memory.
> For example, you can alloc a ceph cpu-set and memory group, let
> ceph-osd run with it within limited cores and memory.
>
> The another risk is the network. Because compute-node and ceph-osd
> shared the same kernel network stack, it exists some risks that VM may
> ran out of network resources such as conntracker in netfilter
> framework.
>
> On Wed, Nov 12, 2014 at 10:23 PM, Mark Nelson 
> wrote:
> > Technically there's no reason it shouldn't work, but it does complicate
> > things.  Probably the biggest worry would be that if something bad
> happens
> > on the compute side (say it goes nuts with network or memory transfers)
> it
> > could slow things down enough that OSDs start failing heartbeat checks
> > causing ceph to go into recovery and maybe cause a vicious cycle of
> > nastiness.
> >
> > You can mitigate some of this with cgroups and try to dedicate specific
> > sockets and memory banks to Ceph/Compute, but we haven't done a lot of
> > testing yet afaik.
> >
> > Mark
> >
> >
> > On 11/12/2014 07:45 AM, Pieter Koorts wrote:
> >>
> >> Hi,
> >>
> >> A while back on a blog I saw mentioned that Ceph should not be run on
> >> compute nodes and in the general sense should be on dedicated hardware.
> >> Does this really still apply?
> >>
> >> An example, if you have nodes comprised of
> >>
> >> 16+ cores
> >> 256GB+ RAM
> >> Dual 10GBE Network
> >> 2+8 OSD (SSD log + HDD store)
> >>
> >> I understand that Ceph can use a lot of IO and CPU in some cases but if
> >> the nodes are powerful enough does it not make it an option to run
> >> compute and storage on the same hardware to either increase density of
> >> compute or save money on additional hardware?
> >>
> >> What are the reasons for not running Ceph on the Compute nodes.
> >>
> >> Thanks
> >>
> >> Pieter
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Best Regards,
>
> Wheat
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and Compute on same hardware?

2014-11-12 Thread Andrey Korolyov
On Wed, Nov 12, 2014 at 5:30 PM, Haomai Wang  wrote:
> Actually, our production cluster(up to ten) all are that ceph-osd ran
> on compute-node(KVM).
>
> The primary action is that you need to constrain the cpu and memory.
> For example, you can alloc a ceph cpu-set and memory group, let
> ceph-osd run with it within limited cores and memory.
>
> The another risk is the network. Because compute-node and ceph-osd
> shared the same kernel network stack, it exists some risks that VM may
> ran out of network resources such as conntracker in netfilter
> framework.
>
> On Wed, Nov 12, 2014 at 10:23 PM, Mark Nelson  wrote:
>> Technically there's no reason it shouldn't work, but it does complicate
>> things.  Probably the biggest worry would be that if something bad happens
>> on the compute side (say it goes nuts with network or memory transfers) it
>> could slow things down enough that OSDs start failing heartbeat checks
>> causing ceph to go into recovery and maybe cause a vicious cycle of
>> nastiness.
>>
>> You can mitigate some of this with cgroups and try to dedicate specific
>> sockets and memory banks to Ceph/Compute, but we haven't done a lot of
>> testing yet afaik.
>>
>> Mark
>>
>>
>> On 11/12/2014 07:45 AM, Pieter Koorts wrote:
>>>
>>> Hi,
>>>
>>> A while back on a blog I saw mentioned that Ceph should not be run on
>>> compute nodes and in the general sense should be on dedicated hardware.
>>> Does this really still apply?
>>>
>>> An example, if you have nodes comprised of
>>>
>>> 16+ cores
>>> 256GB+ RAM
>>> Dual 10GBE Network
>>> 2+8 OSD (SSD log + HDD store)
>>>
>>> I understand that Ceph can use a lot of IO and CPU in some cases but if
>>> the nodes are powerful enough does it not make it an option to run
>>> compute and storage on the same hardware to either increase density of
>>> compute or save money on additional hardware?
>>>
>>> What are the reasons for not running Ceph on the Compute nodes.
>>>
>>> Thanks
>>>
>>> Pieter
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Best Regards,
>
> Wheat
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Yes, the essential part is a resource management, which can neither be
dynamic or static. In Flops we had implemented dynamic resource
control which allows to pack VMs and OSDs more densely than static
cg-based jails can allow (and it requires deep orchestration
modifications for every open source cloud orchestrator,
unfortunately). As long as you are able to manage strong traffic
isolation for storage and vm segment, there are absolutely no problem
(it can be static limits from linux-qos or tricky flow management for
OpenFlow, depends on what your orchestration allows). The possibility
of putting together compute and storage roles without significant
impact to performance characteristics was one of key features which
led our selection to Ceph as a storage backend three years ago.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and Compute on same hardware?

2014-11-12 Thread Haomai Wang
Actually, our production cluster(up to ten) all are that ceph-osd ran
on compute-node(KVM).

The primary action is that you need to constrain the cpu and memory.
For example, you can alloc a ceph cpu-set and memory group, let
ceph-osd run with it within limited cores and memory.

The another risk is the network. Because compute-node and ceph-osd
shared the same kernel network stack, it exists some risks that VM may
ran out of network resources such as conntracker in netfilter
framework.

On Wed, Nov 12, 2014 at 10:23 PM, Mark Nelson  wrote:
> Technically there's no reason it shouldn't work, but it does complicate
> things.  Probably the biggest worry would be that if something bad happens
> on the compute side (say it goes nuts with network or memory transfers) it
> could slow things down enough that OSDs start failing heartbeat checks
> causing ceph to go into recovery and maybe cause a vicious cycle of
> nastiness.
>
> You can mitigate some of this with cgroups and try to dedicate specific
> sockets and memory banks to Ceph/Compute, but we haven't done a lot of
> testing yet afaik.
>
> Mark
>
>
> On 11/12/2014 07:45 AM, Pieter Koorts wrote:
>>
>> Hi,
>>
>> A while back on a blog I saw mentioned that Ceph should not be run on
>> compute nodes and in the general sense should be on dedicated hardware.
>> Does this really still apply?
>>
>> An example, if you have nodes comprised of
>>
>> 16+ cores
>> 256GB+ RAM
>> Dual 10GBE Network
>> 2+8 OSD (SSD log + HDD store)
>>
>> I understand that Ceph can use a lot of IO and CPU in some cases but if
>> the nodes are powerful enough does it not make it an option to run
>> compute and storage on the same hardware to either increase density of
>> compute or save money on additional hardware?
>>
>> What are the reasons for not running Ceph on the Compute nodes.
>>
>> Thanks
>>
>> Pieter
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stackforge Puppet Module

2014-11-12 Thread David Moreau Simard
What comes to mind is that you need to make sure that you've cloned the git 
repository to /etc/puppet/modules/ceph and not /etc/puppet/modules/puppet-ceph.

Feel free to hop on IRC to discuss about puppet-ceph on freenode in 
#puppet-openstack.
You can find me there as dmsimard.

--
David Moreau Simard

> On Nov 12, 2014, at 8:58 AM, Nick Fisk  wrote:
> 
> Hi David,
> 
> Many thanks for your reply.
> 
> I must admit I have only just started looking at puppet, but a lot of what
> you said makes sense to me and understand the reason for not having the
> module auto discover disks.
> 
> I'm currently having a problem with the ceph::repo class when trying to push
> this out to a test server:-
> 
> Error: Could not retrieve catalog from remote server: Error 400 on SERVER:
> Could not find class ceph::repo for ceph-puppet-test on node
> ceph-puppet-test
> Warning: Not using cache on failed catalog
> Error: Could not retrieve catalog; skipping run
> 
> I'm a bit stuck but will hopefully work out why it's not working soon and
> then I can attempt your idea of using a script to dynamically pass disks to
> the puppet module.
> 
> Thanks,
> Nick
> 
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> David Moreau Simard
> Sent: 11 November 2014 12:05
> To: Nick Fisk
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Stackforge Puppet Module
> 
> Hi Nick,
> 
> The great thing about puppet-ceph's implementation on Stackforge is that it
> is both unit and integration tested.
> You can see the integration tests here:
> https://github.com/ceph/puppet-ceph/tree/master/spec/system
> 
> Where I'm getting at is that the tests allow you to see how you can use the
> module to a certain extent.
> For example, in the OSD integration tests:
> -
> https://github.com/ceph/puppet-ceph/blob/master/spec/system/ceph_osd_spec.rb
> #L24 and then:
> -
> https://github.com/ceph/puppet-ceph/blob/master/spec/system/ceph_osd_spec.rb
> #L82-L110
> 
> There's no auto discovery mechanism built-in the module right now. It's kind
> of dangerous, you don't want to format the wrong disks.
> 
> Now, this doesn't mean you can't "discover" the disks yourself and pass them
> to the module from your site.pp or from a composition layer.
> Here's something I have for my CI environment that uses the $::blockdevices
> fact to discover all devices, split that fact into a list of the devices and
> then reject the drives I don't want (such as the OS disk):
> 
># Assume OS is installed on xvda/sda/vda.
># On an Openstack VM, vdb is ephemeral, we don't want to use vdc.
># WARNING: ALL OTHER DISKS WILL BE FORMATTED/PARTITIONED BY CEPH!
>$block_devices = reject(split($::blockdevices, ','),
> '(xvda|sda|vda|vdc|sr0)')
>$devices = prefix($block_devices, '/dev/')
> 
> And then you can pass $devices to the module.
> 
> Let me know if you have any questions !
> --
> David Moreau Simard
> 
>> On Nov 11, 2014, at 6:23 AM, Nick Fisk  wrote:
>> 
>> Hi,
>> 
>> I'm just looking through the different methods of deploying Ceph and I 
>> particularly liked the idea that the stackforge puppet module 
>> advertises of using discover to automatically add new disks. I 
>> understand the principle of how it should work; using ceph-disk list 
>> to find unknown disks, but I would like to see in a little more detail on
> how it's been implemented.
>> 
>> I've been looking through the puppet module on Github, but I can't see 
>> anyway where this discovery is carried out.
>> 
>> Could anyone confirm if this puppet modules does currently support the 
>> auto discovery and where  in the code its carried out?
>> 
>> Many Thanks,
>> Nick
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and Compute on same hardware?

2014-11-12 Thread Mark Nelson
Technically there's no reason it shouldn't work, but it does complicate 
things.  Probably the biggest worry would be that if something bad 
happens on the compute side (say it goes nuts with network or memory 
transfers) it could slow things down enough that OSDs start failing 
heartbeat checks causing ceph to go into recovery and maybe cause a 
vicious cycle of nastiness.


You can mitigate some of this with cgroups and try to dedicate specific 
sockets and memory banks to Ceph/Compute, but we haven't done a lot of 
testing yet afaik.


Mark

On 11/12/2014 07:45 AM, Pieter Koorts wrote:

Hi,

A while back on a blog I saw mentioned that Ceph should not be run on
compute nodes and in the general sense should be on dedicated hardware.
Does this really still apply?

An example, if you have nodes comprised of

16+ cores
256GB+ RAM
Dual 10GBE Network
2+8 OSD (SSD log + HDD store)

I understand that Ceph can use a lot of IO and CPU in some cases but if
the nodes are powerful enough does it not make it an option to run
compute and storage on the same hardware to either increase density of
compute or save money on additional hardware?

What are the reasons for not running Ceph on the Compute nodes.

Thanks

Pieter


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stackforge Puppet Module

2014-11-12 Thread Nick Fisk
Hi David,

Many thanks for your reply.

I must admit I have only just started looking at puppet, but a lot of what
you said makes sense to me and understand the reason for not having the
module auto discover disks.

I'm currently having a problem with the ceph::repo class when trying to push
this out to a test server:-

Error: Could not retrieve catalog from remote server: Error 400 on SERVER:
Could not find class ceph::repo for ceph-puppet-test on node
ceph-puppet-test
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

I'm a bit stuck but will hopefully work out why it's not working soon and
then I can attempt your idea of using a script to dynamically pass disks to
the puppet module.

Thanks,
Nick


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
David Moreau Simard
Sent: 11 November 2014 12:05
To: Nick Fisk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Stackforge Puppet Module

Hi Nick,

The great thing about puppet-ceph's implementation on Stackforge is that it
is both unit and integration tested.
You can see the integration tests here:
https://github.com/ceph/puppet-ceph/tree/master/spec/system

Where I'm getting at is that the tests allow you to see how you can use the
module to a certain extent.
For example, in the OSD integration tests:
-
https://github.com/ceph/puppet-ceph/blob/master/spec/system/ceph_osd_spec.rb
#L24 and then:
-
https://github.com/ceph/puppet-ceph/blob/master/spec/system/ceph_osd_spec.rb
#L82-L110

There's no auto discovery mechanism built-in the module right now. It's kind
of dangerous, you don't want to format the wrong disks.

Now, this doesn't mean you can't "discover" the disks yourself and pass them
to the module from your site.pp or from a composition layer.
Here's something I have for my CI environment that uses the $::blockdevices
fact to discover all devices, split that fact into a list of the devices and
then reject the drives I don't want (such as the OS disk):

# Assume OS is installed on xvda/sda/vda.
# On an Openstack VM, vdb is ephemeral, we don't want to use vdc.
# WARNING: ALL OTHER DISKS WILL BE FORMATTED/PARTITIONED BY CEPH!
$block_devices = reject(split($::blockdevices, ','),
'(xvda|sda|vda|vdc|sr0)')
$devices = prefix($block_devices, '/dev/')

And then you can pass $devices to the module.

Let me know if you have any questions !
--
David Moreau Simard

> On Nov 11, 2014, at 6:23 AM, Nick Fisk  wrote:
> 
> Hi,
> 
> I'm just looking through the different methods of deploying Ceph and I 
> particularly liked the idea that the stackforge puppet module 
> advertises of using discover to automatically add new disks. I 
> understand the principle of how it should work; using ceph-disk list 
> to find unknown disks, but I would like to see in a little more detail on
how it's been implemented.
> 
> I've been looking through the puppet module on Github, but I can't see 
> anyway where this discovery is carried out.
> 
> Could anyone confirm if this puppet modules does currently support the 
> auto discovery and where  in the code its carried out?
> 
> Many Thanks,
> Nick
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph and Compute on same hardware?

2014-11-12 Thread Pieter Koorts
Hi,

A while back on a blog I saw mentioned that Ceph should not be run on
compute nodes and in the general sense should be on dedicated hardware.
Does this really still apply?

An example, if you have nodes comprised of

16+ cores
256GB+ RAM
Dual 10GBE Network
2+8 OSD (SSD log + HDD store)

I understand that Ceph can use a lot of IO and CPU in some cases but if the
nodes are powerful enough does it not make it an option to run compute and
storage on the same hardware to either increase density of compute or save
money on additional hardware?

What are the reasons for not running Ceph on the Compute nodes.

Thanks

Pieter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-12 Thread JF Le Fillâtre

Hi,

May or may not work depending on your JBOD and the way it's identified
and set up by the LSI card and the kernel:

cat /sys/block/sdX/../../../../sas_device/end_device-*/bay_identifier

The weird path and the wildcards are due to the way the sysfs is set up.

That works with a Dell R520, 6GB HBA SAS cards and Dell MD1200s, running
CentOS release 6.5.

Note that you can make your life easier by writing an udev script that
will create a symlink with a sane identifier for each of your external
disks. If you match along the lines of

KERNEL=="sd*[a-z]", KERNELS=="end_device-*:*:*"

then you'll just have to cat "/sys/class/sas_device/${1}/bay_identifier"
in a script (with $1 being the $id of udev after that match, so the
string "end_device-X:Y:Z") to obtain the bay ID.

Thanks,
JF



On 12/11/14 14:05, SCHAER Frederic wrote:
> Hi,
> 
>  
> 
> I’m used to RAID software giving me the failing disks  slots, and most
> often blinking the disks on the disk bays.
> 
> I recently installed a  DELL “6GB HBA SAS” JBOD card, said to be an LSI
> 2008 one, and I now have to identify 3 pre-failed disks (so says
> S.M.A.R.T) .
> 
>  
> 
> Since this is an LSI, I thought I’d use MegaCli to identify the disks
> slot, but MegaCli does not see the HBA card.
> 
> Then I found the LSI “sas2ircu” utility, but again, this one fails at
> giving me the disk slots (it finds the disks, serials and others, but
> slot is always 0)
> 
> Because of this, I’m going to head over to the disk bay and unplug the
> disk which I think corresponds to the alphabetical order in linux, and
> see if it’s the correct one…. But even if this is correct this time, it
> might not be next time.
> 
>  
> 
> But this makes me wonder : how do you guys, Ceph users, manage your
> disks if you really have JBOD servers ?
> 
> I can’t imagine having to guess slots that each time, and I can’t
> imagine neither creating serial number stickers for every single disk I
> could have to manage …
> 
> Is there any specific advice reguarding JBOD cards people should (not)
> use in their systems ?
> 
> Any magical way to “blink” a drive in linux ?
> 
>  
> 
> Thanks && regards
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.87 Giant released

2014-11-12 Thread debian Only
Dear expert

could you help to provide some guide upgrade Ceph from firefly to giant ?

many thanks !

2014-10-30 15:37 GMT+07:00 Joao Eduardo Luis :

> On 10/30/2014 05:54 AM, Sage Weil wrote:
>
>> On Thu, 30 Oct 2014, Nigel Williams wrote:
>>
>>> On 30/10/2014 8:56 AM, Sage Weil wrote:
>>>
 * *Degraded vs misplaced*: the Ceph health reports from 'ceph -s' and
 related commands now make a distinction between data that is
 degraded (there are fewer than the desired number of copies) and
 data that is misplaced (stored in the wrong location in the
 cluster).

>>>
>>> Is someone able to briefly described how/why misplaced happens please,
>>> is it
>>> repaired eventually? I've not seen misplaced (yet).
>>>
>>
>> Sure.  An easy way to get misplaced objects is to do 'ceph osd
>> out N' on an OSD.  Nothing is down, we still have as many copies
>> as we had before, but Ceph now wants to move them somewhere
>> else. Starting with giant, you will see the misplaced % in 'ceph -s' and
>> not degraded.
>>
>>leveldb_write_buffer_size = 32*1024*1024  = 33554432  // 32MB
   leveldb_cache_size= 512*1024*1204 = 536870912 // 512MB

>>>
>>> I noticed the typo, wondered about the code, but I'm not seeing the same
>>> values anyway?
>>>
>>> https://github.com/ceph/ceph/blob/giant/src/common/config_opts.h
>>>
>>> OPTION(leveldb_write_buffer_size, OPT_U64, 8 *1024*1024) // leveldb
>>> write
>>> buffer size
>>> OPTION(leveldb_cache_size, OPT_U64, 128 *1024*1024) // leveldb cache size
>>>
>>
>> Hmm!  Not sure where that 32MB number came from.  I'll fix it, thanks!
>>
>
> Those just happen to be the values used on the monitors (in ceph_mon.cc).
> Maybe that's where the mix up came from. :)
>
>   -Joao
>
>
> --
> Joao Eduardo Luis
> Software Engineer | http://inktank.com | http://ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] The strategy of auto-restarting crashed OSD

2014-11-12 Thread David Z
Hi Guys,

We are experiencing some OSD crashing issues recently, like messenger crash, 
some strange crash (still being investigating), etc. Those crashes seems not to 
reproduce after restarting OSD.

So we are thinking about the strategy of auto-restarting crashed OSD for 1 or 2 
times, then leave it as down if restarting doesn't work. This strategy might 
help us on pg peering and recovering impact to online traffic to some extent, 
since we won't mark OSD out automatically even if it is down unless we are sure 
it is disk failure.

However, we are also aware that this strategy may bring us some problems. Since 
your guys have more experience on CEPH, so we would like to hear some 
suggestions from you.

Thanks.

David Zhang  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-12 Thread SCHAER Frederic
Hi,

I'm used to RAID software giving me the failing disks  slots, and most often 
blinking the disks on the disk bays.
I recently installed a  DELL "6GB HBA SAS" JBOD card, said to be an LSI 2008 
one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T) .

Since this is an LSI, I thought I'd use MegaCli to identify the disks slot, but 
MegaCli does not see the HBA card.
Then I found the LSI "sas2ircu" utility, but again, this one fails at giving me 
the disk slots (it finds the disks, serials and others, but slot is always 0)
Because of this, I'm going to head over to the disk bay and unplug the disk 
which I think corresponds to the alphabetical order in linux, and see if it's 
the correct one But even if this is correct this time, it might not be next 
time.

But this makes me wonder : how do you guys, Ceph users, manage your disks if 
you really have JBOD servers ?
I can't imagine having to guess slots that each time, and I can't imagine 
neither creating serial number stickers for every single disk I could have to 
manage ...
Is there any specific advice reguarding JBOD cards people should (not) use in 
their systems ?
Any magical way to "blink" a drive in linux ?

Thanks && regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Typical 10GbE latency

2014-11-12 Thread Wido den Hollander
(back to list)

On 11/10/2014 06:57 PM, Gary M wrote:
> Hi Wido,
> 
> That is a bit weird.. I'd also check the Ethernet controller firmware
> version and settings between the other configurations. There must be
> something different.
> 

Indeed, there must be something! But I can't figure it out yet. Same
controllers, tried the same OS, direct cables, but the latency is 40%
higher.

> I can understand wanting to do a simple latency test.. But as we get closer
> to hw speeds and microsecond measurements, measures appear to be more
> unstable through software stacks.
> 

I fully agree with you. But a basic ICMP test on a idle machine should
be a baseline from where you can start with further diagnosing network
latency using better tools like netperf.

Wido

> 
> 
> -gary
> 
> On Mon, Nov 10, 2014 at 9:22 AM, Wido den Hollander  wrote:
> 
>> On 08-11-14 02:42, Gary M wrote:
>>> Wido,
>>>
>>> Take the switch out of the path between nodes and remeasure.. ICMP-echo
>>> requests are very low priority traffic for switches and network stacks.
>>>
>>
>> I tried with a direct TwinAx and fiber cable. No difference.
>>
>>> If you really want to know, place a network analyzer between the nodes
>>> to measure the request packet to response packet latency.. The ICMP
>>> traffic to the "ping application" is not accurate in the sub-millisecond
>>> range. And should only be used as a rough estimate.
>>>
>>
>> True, I fully agree with you. But, why is everybody showing a lower
>> latency here? My latencies are about 40% higher then what I see in this
>> setup and other setups.
>>
>>> You also may want to install the high resolution timer patch, sometimes
>>> called HRT, to the kernel which may give you different results.
>>>
>>> ICMP traffic takes a different path than the TCP traffic and should not
>>> be considered an indicator of defect.
>>>
>>
>> Yes, I'm aware. But it still doesn't explain me why the latency on other
>> systems, which are in production, is lower then on this idle system.
>>
>>> I believe the ping app calls the sendto system call.(sorry its been a
>>> while since I last looked)  Systems calls can take between .1us and .2us
>>> each. However, the ping application makes several of these calls and
>>> waits for a signal from the kernel. The wait for a signal means the ping
>>> application must wait to be rescheduled to report the time.Rescheduling
>>> will depend on a lot of other factors in the os. eg, timers, card
>>> interrupts other tasks with higher priorities.  Reporting the time must
>>> add a few more systems calls for this to happen. As the ping application
>>> loops to post the next ping request which again requires a few systems
>>> calls which may cause a task switch while in each system call.
>>>
>>> For the above factors, the ping application is not a good representation
>>> of network performance due to factors in the application and network
>>> traffic shaping performed at the switch and the tcp stacks.
>>>
>>
>> I think that netperf is probably a better tool, but that also does TCP
>> latencies.
>>
>> I want the real IP latency, so I assumed that ICMP would be the most
>> simple one.
>>
>> The other setups I have access to are in production and do not have any
>> special tuning, yet their latency is still lower then on this new
>> deployment.
>>
>> That's what gets me confused.
>>
>> Wido
>>
>>> cheers,
>>> gary
>>>
>>>
>>> On Fri, Nov 7, 2014 at 4:32 PM, Łukasz Jagiełło
>>> mailto:jagiello.luk...@gmail.com>> wrote:
>>>
>>> Hi,
>>>
>>> rtt min/avg/max/mdev = 0.070/0.177/0.272/0.049 ms
>>>
>>> 04:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit
>>> SFI/SFP+ Network Connection (rev 01)
>>>
>>> at both hosts and Arista 7050S-64 between.
>>>
>>> Both hosts were part of active ceph cluster.
>>>
>>>
>>> On Thu, Nov 6, 2014 at 5:18 AM, Wido den Hollander >> > wrote:
>>>
>>> Hello,
>>>
>>> While working at a customer I've ran into a 10GbE latency which
>>> seems
>>> high to me.
>>>
>>> I have access to a couple of Ceph cluster and I ran a simple
>>> ping test:
>>>
>>> $ ping -s 8192 -c 100 -n 
>>>
>>> Two results I got:
>>>
>>> rtt min/avg/max/mdev = 0.080/0.131/0.235/0.039 ms
>>> rtt min/avg/max/mdev = 0.128/0.168/0.226/0.023 ms
>>>
>>> Both these environment are running with Intel 82599ES 10Gbit
>>> cards in
>>> LACP. One with Extreme Networks switches, the other with Arista.
>>>
>>> Now, on a environment with Cisco Nexus 3000 and Nexus 7000
>>> switches I'm
>>> seeing:
>>>
>>> rtt min/avg/max/mdev = 0.160/0.244/0.298/0.029 ms
>>>
>>> As you can see, the Cisco Nexus network has high latency
>>> compared to the
>>> other setup.
>>>
>>> You would say the switches are to blame, but we also tried with
>>> a direct
>>> TwinAx connection, but that didn't h

Re: [ceph-users] mds isn't working anymore after osd's running full

2014-11-12 Thread Jasper Siero
Hello Greg,

The specific PG was always deep scrubbing (ceph pg dump all showed the last 
deep scrub of this PG was in august) but now when I look at it again the deep 
scrub is finished en everything is healthy. Maybe it is solved because the mds 
is running fine now and it unlocked something.

The problem is solved now :)

Thanks!

Jasper

Van: Gregory Farnum [g...@gregs42.com]
Verzonden: dinsdag 11 november 2014 19:19
Aan: Jasper Siero
CC: ceph-users
Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full

On Tue, Nov 11, 2014 at 5:06 AM, Jasper Siero
 wrote:
> No problem thanks for helping.
> I don't want to disable the deep scrubbing process itself because its very 
> useful but one placement group (3.30) is continuously deep scrubbing and it 
> should finish after some time but it won't.

Hmm, how are you determining that this one PG won't stop scrubbing?
This doesn't sound like any issues familiar to me.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help regarding Installi​ng ceph on a single machine with cephdeploy on ubuntu 14.04 64 bit

2014-11-12 Thread tej ak
Hi,

I am a starter to ceph and deperately trying to figure out how to install
and deploy ceph on a single machine with ceph deploy. I have ubuntu 14.04 -
64 bit installed in a virtual machine (on windows 8.1 through VMware
player)  and have installed devstack on ubuntu. I am trying to install ceph
on the same machine (Ubuntu) and interface with openstack. I have tried the
following steps but it says that mkcephfs does not exist and I read that it
is deprecated and ceph - deploy is there. But documentation talks about
multiple nodes. I am lost as to how to use ceph deploy and install and
setup ceph on a single machine. Pl guide me. I tried the following steps
earlier which was given for mkcephfs.

<<( reference http://eu.ceph.com/docs/wip-6919/start/quick-start/ sudo
apt-get update && sudo apt-get install ceph (2) Execute hostname -s on the
command line to retrieve the name of your host. Then, replace {hostname} in
the sample configuration file with your host name. Execute ifconfig on the
command line to retrieve the IP address of your host. Then, replace
{ip-address} with the IP address of your host. Finally, copy the contents
of the modified configuration file and save it to /etc/ceph/ceph.conf. This
file will configure Ceph to operate a monitor, two OSD daemons and one
metadata server on your local machin
[osd] osd journal size = 1000 filestore xattr use omap = true
# Execute $ hostname to retrieve the name of your host,
# and replace {hostname} with the name of your host.
# For the monitor, replace {ip-address} with the IP
# address of your host.
[mon.a]
host = {hostname}
mon addr = {ip-address}:6789
[osd.0] host = {hostname}
[osd.1] host = {hostname}
[mds.a] host = {hostname}
sudo mkdir /var/lib/ceph/osd/ceph-0 sudo mkdir /var/lib/ceph/osd/ceph-1
sudo mkdir /var/lib/ceph/mon/ceph-a sudo mkdir /var/lib/ceph/mds/ceph-a
cd /etc/ceph sudo mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring
sudo service ceph start ceph health
>
Regards,
Bobby
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Typical 10GbE latency

2014-11-12 Thread Alexandre DERUMIER
>>Is this with a 8192 byte payload?
Oh, sorry it was with 1500.
I'll try to send a report with 8192 tomorrow.

- Mail original - 

De: "Robert LeBlanc"  
À: "Alexandre DERUMIER"  
Cc: "Wido den Hollander" , ceph-users@lists.ceph.com 
Envoyé: Mardi 11 Novembre 2014 23:13:17 
Objet: Re: [ceph-users] Typical 10GbE latency 


Is this with a 8192 byte payload? Theoretical transfer time of 1 Gbps (you are 
only sending one packet so LACP won't help) one direction is 0.061 ms, double 
that and you are at 0.122 ms of bits in flight, then there is context 
switching, switch latency (store and forward assumed for 1 Gbps), etc which I'm 
not sure would fit in the rest of the 0.057 of you min time. If it is a 8192 
byte payload, then I'm really impressed! 


On Tue, Nov 11, 2014 at 11:56 AM, Alexandre DERUMIER < aderum...@odiso.com > 
wrote: 


Don't have yet 10GBE, but here my result my simple lacp on 2 gigabit links with 
a cisco 6500 

rtt min/avg/max/mdev = 0.179/0.202/0.221/0.019 ms 


(Seem to be lower than your 10gbe nexus) 


- Mail original - 

De: "Wido den Hollander" < w...@42on.com > 
À: ceph-users@lists.ceph.com 
Envoyé: Lundi 10 Novembre 2014 17:22:04 
Objet: Re: [ceph-users] Typical 10GbE latency 



On 08-11-14 02:42, Gary M wrote: 
> Wido, 
> 
> Take the switch out of the path between nodes and remeasure.. ICMP-echo 
> requests are very low priority traffic for switches and network stacks. 
> 

I tried with a direct TwinAx and fiber cable. No difference. 

> If you really want to know, place a network analyzer between the nodes 
> to measure the request packet to response packet latency.. The ICMP 
> traffic to the "ping application" is not accurate in the sub-millisecond 
> range. And should only be used as a rough estimate. 
> 

True, I fully agree with you. But, why is everybody showing a lower 
latency here? My latencies are about 40% higher then what I see in this 
setup and other setups. 

> You also may want to install the high resolution timer patch, sometimes 
> called HRT, to the kernel which may give you different results. 
> 
> ICMP traffic takes a different path than the TCP traffic and should not 
> be considered an indicator of defect. 
> 

Yes, I'm aware. But it still doesn't explain me why the latency on other 
systems, which are in production, is lower then on this idle system. 

> I believe the ping app calls the sendto system call.(sorry its been a 
> while since I last looked) Systems calls can take between .1us and .2us 
> each. However, the ping application makes several of these calls and 
> waits for a signal from the kernel. The wait for a signal means the ping 
> application must wait to be rescheduled to report the time.Rescheduling 
> will depend on a lot of other factors in the os. eg, timers, card 
> interrupts other tasks with higher priorities. Reporting the time must 
> add a few more systems calls for this to happen. As the ping application 
> loops to post the next ping request which again requires a few systems 
> calls which may cause a task switch while in each system call. 
> 
> For the above factors, the ping application is not a good representation 
> of network performance due to factors in the application and network 
> traffic shaping performed at the switch and the tcp stacks. 
> 

I think that netperf is probably a better tool, but that also does TCP 
latencies. 

I want the real IP latency, so I assumed that ICMP would be the most 
simple one. 

The other setups I have access to are in production and do not have any 
special tuning, yet their latency is still lower then on this new 
deployment. 

That's what gets me confused. 

Wido 

> cheers, 
> gary 
> 
> 
> On Fri, Nov 7, 2014 at 4:32 PM, Łukasz Jagiełło 
> < jagiello.luk...@gmail.com > wrote: 
> 
> Hi, 
> 
> rtt min/avg/max/mdev = 0.070/0.177/0.272/0.049 ms 
> 
> 04:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit 
> SFI/SFP+ Network Connection (rev 01) 
> 
> at both hosts and Arista 7050S-64 between. 
> 
> Both hosts were part of active ceph cluster. 
> 
> 
> On Thu, Nov 6, 2014 at 5:18 AM, Wido den Hollander < w...@42on.com 
> > wrote: 
> 
> Hello, 
> 
> While working at a customer I've ran into a 10GbE latency which 
> seems 
> high to me. 
> 
> I have access to a couple of Ceph cluster and I ran a simple 
> ping test: 
> 
> $ ping -s 8192 -c 100 -n  
> 
> Two results I got: 
> 
> rtt min/avg/max/mdev = 0.080/0.131/0.235/0.039 ms 
> rtt min/avg/max/mdev = 0.128/0.168/0.226/0.023 ms 
> 
> Both these environment are running with Intel 82599ES 10Gbit 
> cards in 
> LACP. One with Extreme Networks switches, the other with Arista. 
> 
> Now, on a environment with Cisco Nexus 3000 and Nexus 7000 
> switches I'm 
> seeing: 
> 
> rtt min/avg/max/mdev = 0.160/0.244/0.298/0.029 ms 
> 
> As you can see, the Cisco Nexus network has high latency 
> compared to the 
> other setup. 
> 
> You would say the switches are to blame, but we also tried with 
> a direct 
> TwinAx connectio