Re: [ceph-users] Can't fix down+incomplete PG

2016-02-10 Thread Scott Laird
Ah, I should have mentioned--size=3, min_size=1.

I'm pretty sure that 'down_osds_we_would_probe' is the problem, but it's
not clear if there's a way to fix that.



On Tue, Feb 9, 2016 at 11:30 PM Arvydas Opulskis <
arvydas.opuls...@adform.com> wrote:

> Hi,
>
>
>
> What is min_size for this pool? Maybe you need to decrease it for cluster
> to start recovering.
>
>
>
> Arvydas
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Scott Laird
> *Sent:* Wednesday, February 10, 2016 7:22 AM
> *To:* 'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com) <
> ceph-users@lists.ceph.com>
> *Subject:* [ceph-users] Can't fix down+incomplete PG
>
>
>
> I lost a few OSDs recently.  Now my cell is unhealthy and I can't figure
> out how to get it healthy again.
>
>
>
> OSD 3, 7, 10, and 40 died in a power outage.  Now I have 10 PGs that are
> down+incomplete, but all of them seem like they should have surviving
> replicas of all data.
>
>
>
> I'm running 9.2.0.
>
>
>
> $ ceph health detail | grep down
>
> pg 18.c1 is down+incomplete, acting [11,18,9]
>
> pg 18.47 is down+incomplete, acting [11,9,22]
>
> pg 18.1d7 is down+incomplete, acting [5,31,24]
>
> pg 18.1d6 is down+incomplete, acting [22,11,5]
>
> pg 18.2af is down+incomplete, acting [19,24,18]
>
> pg 18.2dd is down+incomplete, acting [15,11,22]
>
> pg 18.2de is down+incomplete, acting [15,17,11]
>
> pg 18.3e is down+incomplete, acting [25,8,18]
>
> pg 18.3d6 is down+incomplete, acting [22,39,24]
>
> pg 18.3e6 is down+incomplete, acting [9,23,8]
>
>
>
> $ ceph pg 18.c1 query
>
> {
>
> "state": "down+incomplete",
>
> "snap_trimq": "[]",
>
> "epoch": 960905,
>
> "up": [
>
> 11,
>
> 18,
>
> 9
>
> ],
>
> "acting": [
>
> 11,
>
> 18,
>
> 9
>
> ],
>
> "info": {
>
> "pgid": "18.c1",
>
> "last_update": "0'0",
>
> "last_complete": "0'0",
>
> "log_tail": "0'0",
>
> "last_user_version": 0,
>
> "last_backfill": "MAX",
>
> "last_backfill_bitwise": 0,
>
> "purged_snaps": "[]",
>
> "history": {
>
> "epoch_created": 595523,
>
> "last_epoch_started": 954170,
>
> "last_epoch_clean": 954170,
>
> "last_epoch_split": 0,
>
> "last_epoch_marked_full": 0,
>
> "same_up_since": 959988,
>
> "same_interval_since": 959988,
>
> "same_primary_since": 959988,
>
> "last_scrub": "613947'7736",
>
> "last_scrub_stamp": "2015-11-11 21:18:35.118057",
>
> "last_deep_scrub": "613947'7736",
>
> "last_deep_scrub_stamp": "2015-11-11 21:18:35.118057",
>
> "last_clean_scrub_stamp": "2015-11-11 21:18:35.118057"
>
> },
>
> ...
>
> "probing_osds": [
>
> "9",
>
> "11",
>
> "18",
>
> "23",
>
> "25"
>
> ],
>
> "down_osds_we_would_probe": [
>
> 7,
>
> 10
>
> ],
>
> "peering_blocked_by": []
>
> },
>
> {
>
> "name": "Started",
>
> "enter_time": "2016-02-09 20:35:57.627376"
>
> }
>
> ],
>
> "agent_state": {}
>
> }
>
>
>
> I tried replacing disks. I created a new OSD 3 and 7 but neither will
> start up; the ceph-osd task starts but never actually makes it to 'up' with
> nothing obvious in the logs.  I can post logs if that helps.  Since the
> OSDs were removed a few days ago, 'ceph osd lost' doesn't seem to help.
>
>
>
> Is there a way to fix these PGs and get my cluster healthy again?
>
>
>
>
>
> Scott
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Can't fix down+incomplete PG

2016-02-09 Thread Scott Laird
I lost a few OSDs recently.  Now my cell is unhealthy and I can't figure
out how to get it healthy again.

OSD 3, 7, 10, and 40 died in a power outage.  Now I have 10 PGs that are
down+incomplete, but all of them seem like they should have surviving
replicas of all data.

I'm running 9.2.0.

$ ceph health detail | grep down
pg 18.c1 is down+incomplete, acting [11,18,9]
pg 18.47 is down+incomplete, acting [11,9,22]
pg 18.1d7 is down+incomplete, acting [5,31,24]
pg 18.1d6 is down+incomplete, acting [22,11,5]
pg 18.2af is down+incomplete, acting [19,24,18]
pg 18.2dd is down+incomplete, acting [15,11,22]
pg 18.2de is down+incomplete, acting [15,17,11]
pg 18.3e is down+incomplete, acting [25,8,18]
pg 18.3d6 is down+incomplete, acting [22,39,24]
pg 18.3e6 is down+incomplete, acting [9,23,8]

$ ceph pg 18.c1 query
{
"state": "down+incomplete",
"snap_trimq": "[]",
"epoch": 960905,
"up": [
11,
18,
9
],
"acting": [
11,
18,
9
],
"info": {
"pgid": "18.c1",
"last_update": "0'0",
"last_complete": "0'0",
"log_tail": "0'0",
"last_user_version": 0,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": "[]",
"history": {
"epoch_created": 595523,
"last_epoch_started": 954170,
"last_epoch_clean": 954170,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 959988,
"same_interval_since": 959988,
"same_primary_since": 959988,
"last_scrub": "613947'7736",
"last_scrub_stamp": "2015-11-11 21:18:35.118057",
"last_deep_scrub": "613947'7736",
"last_deep_scrub_stamp": "2015-11-11 21:18:35.118057",
"last_clean_scrub_stamp": "2015-11-11 21:18:35.118057"
},
...
"probing_osds": [
"9",
"11",
"18",
"23",
"25"
],
"down_osds_we_would_probe": [
7,
10
],
"peering_blocked_by": []
},
{
"name": "Started",
"enter_time": "2016-02-09 20:35:57.627376"
}
],
"agent_state": {}
}

I tried replacing disks. I created a new OSD 3 and 7 but neither will start
up; the ceph-osd task starts but never actually makes it to 'up' with
nothing obvious in the logs.  I can post logs if that helps.  Since the
OSDs were removed a few days ago, 'ceph osd lost' doesn't seem to help.

Is there a way to fix these PGs and get my cluster healthy again?


Scott
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] download.ceph.com unreachable IPv6 [was: v9.1.0 Infernalis release candidate released]

2015-11-07 Thread Scott Laird
Ah, it looks like a pMTU problem.  We negotiate a 1440 byte mss, but that's
apparently too large for something along the way; if you look at the
tcpdump below, you'll see that the sequence number from download.ceph.com
jumps from 1 up to 4285 without the packets inbetween ever arriving.

08:39:23.459684 IP6 2001:470:e959:201:f652:14ff:fe09:530.50714 >
2607:f298:6050:51f3:f816:3eff:fe50:5ec.80: Flags [S], seq 1826359551, win
26820, options [mss 8940,sackOK,TS val 60514795 ecr 0,nop,wscale 7], length
0
08:39:23.577649 IP6 2607:f298:6050:51f3:f816:3eff:fe50:5ec.80 >
2001:470:e959:201:f652:14ff:fe09:530.50714: Flags [S.], seq 3196528483, ack
1826359552, win 28560, options [mss 1440,sackOK,TS val 917313247 ecr
60514795,nop,wscale 7], length 0
08:39:23.577679 IP6 2001:470:e959:201:f652:14ff:fe09:530.50714 >
2607:f298:6050:51f3:f816:3eff:fe50:5ec.80: Flags [.], ack 1, win 210,
options [nop,nop,TS val 60514825 ecr 917313247], length 0
08:39:23.577749 IP6 2001:470:e959:201:f652:14ff:fe09:530.50714 >
2607:f298:6050:51f3:f816:3eff:fe50:5ec.80: Flags [P.], seq 1:82, ack 1, win
210, options [nop,nop,TS val 60514825 ecr 917313247], length 81
08:39:23.655644 IP6 2607:f298:6050:51f3:f816:3eff:fe50:5ec.80 >
2001:470:e959:201:f652:14ff:fe09:530.50714: Flags [.], ack 82, win 224,
options [nop,nop,TS val 917313267 ecr 60514825], length 0
08:39:23.818794 IP6 2607:f298:6050:51f3:f816:3eff:fe50:5ec.80 >
2001:470:e959:201:f652:14ff:fe09:530.50714: Flags [P.], seq 4285:4764, ack
82, win 224, options [nop,nop,TS val 917313308 ecr 60514825], length 479
08:39:23.818806 IP6 2001:470:e959:201:f652:14ff:fe09:530.50714 >
2607:f298:6050:51f3:f816:3eff:fe50:5ec.80: Flags [.], ack 1, win 218,
options [nop,nop,TS val 60514885 ecr 917313267,nop,nop,sack 1 {4285:4764}],
length 0
08:39:28.706360 IP6 2607:f298:6050:51f3:f816:3eff:fe50:5ec.80 >
2001:470:e959:201:f652:14ff:fe09:530.50714: Flags [F.], seq 4764, ack 82,
win 224, options [nop,nop,TS val 917314529 ecr 60514885], length 0
08:39:28.706384 IP6 2001:470:e959:201:f652:14ff:fe09:530.50714 >
2607:f298:6050:51f3:f816:3eff:fe50:5ec.80: Flags [.], ack 1, win 218,
options [nop,nop,TS val 60516107 ecr 917313267,nop,nop,sack 1 {4285:4765}],
length 0
08:40:28.709596 IP6 2001:470:e959:201:f652:14ff:fe09:530.50714 >
2607:f298:6050:51f3:f816:3eff:fe50:5ec.80: Flags [.], ack 1, win 218,
options [nop,nop,TS val 60531108 ecr 917313267,nop,nop,sack 1 {4285:4765}],
length 0
08:40:28.875876 IP6 2607:f298:6050:51f3:f816:3eff:fe50:5ec.80 >
2001:470:e959:201:f652:14ff:fe09:530.50714: Flags [.], ack 82, win 224,
options [nop,nop,TS val 917329571 ecr 60516107], length 0

FWIW, it looks like eu.ceph.com uses an MSS of 1420, not 1440.


Scott

On Sat, Nov 7, 2015 at 8:30 AM Scott Laird  wrote:

> FWIW, I'm also having problems connecting to download.ceph.com over IPv6,
> from a HE tunnel.  I can talk to eu.ceph.com just fine.
>
> I'm seeing 100% failures over HTTP.  Here's a traceroute, including my
> address:
>
> # traceroute6 download.ceph.com
> traceroute to download.ceph.com (2607:f298:6050:51f3:f816:3eff:fe50:5ec)
> from 2001:470:e959:201:f652:14ff:fe09:530, 30 hops max, 24 byte packets
>  1  2001:470:e959:201:52c5:8dff:febe:a981
> (2001:470:e959:201:52c5:8dff:febe:a981)  1.013 ms  0.923 ms  0.8 ms
>  2  2001:470:e959:300:210:db01:2cff:1000
> (2001:470:e959:300:210:db01:2cff:1000)  0.313 ms  0.343 ms  0.177 ms
>  3  2001:470:b:685::1 (2001:470:b:685::1)  0.728 ms  0.408 ms  0.452 ms
>  4  scottlaird-2.tunnel.tserv14.sea1.ipv6.he.net (2001:470:a:685::1)
>  7.75 ms  7.831 ms  7.487 ms
>  5  v225.core1.sea1.he.net (2001:470:0:9b::1)  4.928 ms  15.432 ms  5.014
> ms
>  6  10ge13-4.core1.sjc2.he.net (2001:470:0:1c7::1)  34.031 ms  24.544 ms
>  25.762 ms
>  7  2001:470:0:34f::2 (2001:470:0:34f::2)  23.92 ms  23.942 ms  24.185 ms
>  8  2001:428::205:171:203:158 (2001:428::205:171:203:158)  74.452 ms
>  74.568 ms  74.53 ms
>  9  2001:428:2402:10:0:d:0:2 (2001:428:2402:10:0:d:0:2)  75.029 ms  74.792
> ms  75.145 ms
> 10  border11-bbnet1.wdc002.pnap.net (2600:c08:0:101:0:2:1:11)  160.907 ms
>  245.192 ms  204.006 ms
> 11  2600:c08:2002:d::2 (2600:c08:2002:d::2)  78.141 ms  78.102 ms  77.949
> ms
> 12  ip-2607-f298-5-cc01--1.dreamhost.com (2607:f298:5:cc01::1)  79.807 ms
>  80.133 ms  77.979 ms
> 13  ip-2607-f298-5-cc08--2.dreamhost.com (2607:f298:5:cc08::2)  76.345 ms
>  76.229 ms  77.439 ms
> 14  2607:f298:5:110d:f816:3eff:fe79:ad5d
> (2607:f298:5:110d:f816:3eff:fe79:ad5d)  84.761 ms  79.603 ms  78.729 ms
> 15  * * *
> 16  * * *
>
>
> On Thu, Oct 15, 2015 at 11:32 PM Corin Langosch <
> corin.lango...@netskin.com> wrote:
>
>> download.ceph.com resolves to 2607:f298:6050:51f3:f816:3eff:fe50:5ec
>> here. Ping seems to be blocked. Connect to port 80
>> works every few requests, probably 50%. So I assume there&#

Re: [ceph-users] download.ceph.com unreachable IPv6 [was: v9.1.0 Infernalis release candidate released]

2015-11-07 Thread Scott Laird
FWIW, I'm also having problems connecting to download.ceph.com over IPv6,
from a HE tunnel.  I can talk to eu.ceph.com just fine.

I'm seeing 100% failures over HTTP.  Here's a traceroute, including my
address:

# traceroute6 download.ceph.com
traceroute to download.ceph.com (2607:f298:6050:51f3:f816:3eff:fe50:5ec)
from 2001:470:e959:201:f652:14ff:fe09:530, 30 hops max, 24 byte packets
 1  2001:470:e959:201:52c5:8dff:febe:a981
(2001:470:e959:201:52c5:8dff:febe:a981)  1.013 ms  0.923 ms  0.8 ms
 2  2001:470:e959:300:210:db01:2cff:1000
(2001:470:e959:300:210:db01:2cff:1000)  0.313 ms  0.343 ms  0.177 ms
 3  2001:470:b:685::1 (2001:470:b:685::1)  0.728 ms  0.408 ms  0.452 ms
 4  scottlaird-2.tunnel.tserv14.sea1.ipv6.he.net (2001:470:a:685::1)  7.75
ms  7.831 ms  7.487 ms
 5  v225.core1.sea1.he.net (2001:470:0:9b::1)  4.928 ms  15.432 ms  5.014 ms
 6  10ge13-4.core1.sjc2.he.net (2001:470:0:1c7::1)  34.031 ms  24.544 ms
 25.762 ms
 7  2001:470:0:34f::2 (2001:470:0:34f::2)  23.92 ms  23.942 ms  24.185 ms
 8  2001:428::205:171:203:158 (2001:428::205:171:203:158)  74.452 ms
 74.568 ms  74.53 ms
 9  2001:428:2402:10:0:d:0:2 (2001:428:2402:10:0:d:0:2)  75.029 ms  74.792
ms  75.145 ms
10  border11-bbnet1.wdc002.pnap.net (2600:c08:0:101:0:2:1:11)  160.907 ms
 245.192 ms  204.006 ms
11  2600:c08:2002:d::2 (2600:c08:2002:d::2)  78.141 ms  78.102 ms  77.949 ms
12  ip-2607-f298-5-cc01--1.dreamhost.com (2607:f298:5:cc01::1)  79.807 ms
 80.133 ms  77.979 ms
13  ip-2607-f298-5-cc08--2.dreamhost.com (2607:f298:5:cc08::2)  76.345 ms
 76.229 ms  77.439 ms
14  2607:f298:5:110d:f816:3eff:fe79:ad5d
(2607:f298:5:110d:f816:3eff:fe79:ad5d)  84.761 ms  79.603 ms  78.729 ms
15  * * *
16  * * *


On Thu, Oct 15, 2015 at 11:32 PM Corin Langosch 
wrote:

> download.ceph.com resolves to 2607:f298:6050:51f3:f816:3eff:fe50:5ec
> here. Ping seems to be blocked. Connect to port 80
> works every few requests, probably 50%. So I assume there's some
> load-balancer there with a dead backend, which the
> load-balancer didn't detect/ kick...just guessing. Best Corin
>
> Am 16.10.2015 um 08:27 schrieb Björn Lässig:
> > Getting the same error here.
> > With sixxs 4 out of 5 wgets are failing. ping6 is dropped.
> >
> > We tried from different sites in .at .uk and .de. Tries from uk and de
> are failing mostly. at is working.
> >
> > does ''ping6 download.ceph.com'' works for you?
> >
> > regards
> >   Björn Lässig
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Difference between CephFS and RBD

2015-07-06 Thread Scott Laird
CephFS is a filesystem, rbd is a block device.  CephFS is a lot like NFS;
it's a filesystem shared over the network where different machines can
access it all at the same time.  RBD is more like a hard disk image, shared
over the network.  It's easy to put a normal filesystem (like ext2) on top
of it and mount it on a computer, but if you mount the same RBD device on
multiple computers at once then Really Bad Things are going to happen to
the filesystem.

In general, if you want to share a bunch of files between multiple
machines, then CephFS is your best bet.  If you want to store a disk image,
perhaps for use with virtual machines, then you want RBD.  If you want
storage that is mostly compatible with Amazon's S3, then use radosgw.

On Mon, Jul 6, 2015 at 8:04 AM Hadi Montakhabi  wrote:

> Hello Cephers,
>
> I can't quite grasp the difference between CephFS and Rados Block Device
> (RBD).
> In both cases we do mount the storage on the client and it is using the
> storage from the storage cluster.
> Would someone explain it to me like I am five please?
>
> Thanks,
> Hadi
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cost- and Powerefficient OSD-Nodes

2015-04-29 Thread Scott Laird
FWIW, I tried using some 256G MX100s with ceph and had horrible performance
issues within a month or two.  I was seeing 100% utilization with high
latency but only 20 MB/s writes.  I had a number of S3500s in the same pool
that were dramatically better.  Which is to say that they were actually
faster than the hard disk pool they were fronting, rather than slower.

If you do go with MX200s, I'd recommend only using at most 80% of the
drive; most cheap SSDs perform *much* better at sustained writes if you
give them more overprovisioning space to work with.


Scott

On Tue, Apr 28, 2015, 4:30 PM Dominik Hannen  wrote:

> > It's all about the total latency per operation. Most IO sizes over 10GB
> > don't make much difference to the Round Trip Time. But comparatively even
> > 128KB IO's over 1GB take quite a while. For example ping a host with a
> > payload of 64k over 1GB and 10GB networks and look at the difference in
> > times. Now double this for Ceph (Client->Prim OSD->Sec OSD)
> >
> > When you are using SSD journals you normally end up with write latency of
> > 3-4ms over 10GB, 1GB networking will probably increase this by another
> > 2-4ms. IOPs=1000/latency
> >
> > I guess it all really depends on how important performance is
>
> I recon we are talking about single-threaded IOPs? It looks like 10ms
> latency
> is in the worst-case region.. 100 IOPs will do fine.
>
> At least in my understanding heavily multi-threaded load should be able to
> get higher IOPs regardless of latency?
>
> Some presentation material suggested that the adverse effects of higher
> latency, due to 1Gbit, begin above IO sizes of 2k, maybe there is room to
> tune IOPs hungry applications/vms accordingly.
>
> > Just had a look and the Seagate Surveillance disks spin at 7200RPM
> (missed
> > that you put that there), whereas the WD ones that I am familiar with
> spin
> > at 5400rpm, so not as bad as I thought.
> >
> > So probably ok to use, but I don't see many people using them for Ceph/
> > generic NAS so can't be sure there's no hidden gotchas.
>
> I am not sure how trustworthy newegg-reviews are, but somehow I get some
> doubts about them now.
> I guess it does not matter that much, at least if not more than a disk a
> month
> is failing? The 3-year warranty gives some hope..
>
> Are there some cost-efficient HDDs that someone can suggest? (Most likely
> 3TB
> drives, that seems to be the sweet-spot at the moment.)
>
> > Sorry nothing in detail, I did actually build a ceph cluster on the same
> 8
> > core CPU as you have listed. I didn't have any performance problems but
> I do
> > remember with SSD journals when doing high queue depth writes I could get
> > the CPU quite high. It's like what I said before about the 1vs10Gb
> > networking, how important is performance, If using this CPU gives you an
> > extra 1ms of latency per OSD, is that acceptable?
> >
> > Agree 12cores (guessing 2.5Ghz each) will be an overkill for just 12
> OSDs. I
> > have a very similar spec and see exactly the same as you, but will change
> > the nodes to 1CPU each when I expand and use the spare CPU's for the new
> > nodes.
> >
> > I'm using this:-
> >
> > http://www.supermicro.nl/products/system/4U/F617/SYS-F617H6-FTPTL_.cfm
> >
> > Mainly because of rack density, which I know doesn't apply to you. But
> the
> > fact they share PSU's/Rails/Chassis helps reduce power a bit and drives
> down
> > cost
> >
> > I can get 14 disks in each and they have 10GB on board. The SAS
> controller
> > is flashable to JBOD mode.
> >
> > Maybe one of the other Twin solutions might be suitable?
>
> I did consider that exact model (It was mentioned on the list some time
> ago)
> I could get about the same effective storage-capacity with it, but
> 10G-Networking is just too expensive on the Switch-side.
>
> Also those nodes and 10G-Switches consume a lot more power.
>
> By my estimates and numbers I found, the Avoton-Nodes should run at about
> 55W
> each. The Switches (EX3300) according to tech-specs would need 76W at max
> each.
>
> ___
> Dominik
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cost- and Powerefficient OSD-Nodes

2015-04-28 Thread Scott Laird
FYI, most Juniper switches hash LAGs on IP+port, so you'd get somewhat
better performance than you would with simple MAC or IP hashing.  10G is
better if you can afford it, though.

On Tue, Apr 28, 2015 at 9:55 AM Nick Fisk  wrote:

>
>
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Dominik Hannen
> > Sent: 28 April 2015 17:08
> > To: Nick Fisk
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Cost- and Powerefficient OSD-Nodes
> >
> > >> Interconnect as currently planned:
> > >> 4 x 1Gbit LACP Bonds over a pair of MLAG-capable switches (planned:
> > >> EX3300)
> >
> > > If you can do 10GB networking its really worth it. I found that with
> > > 1G, latency effects your performance before you max out the bandwidth.
> > > We got some Supermicro servers with 10GB-T onboard for a tiny price
> > > difference and some basic 10GB-T switches.
> >
> > I do not except to max out the bandwidth. My estimation would be 200 MB/s
> > r/w are needed at maximum.
> >
> > The performance-metric that suffers most as far as I read would be IOPS?
> > How many IOps do you think will be possible with 8 x 4osd-nodes with
> > 4x1Gbit (distributed among all the clients, VMs, etc)
>
> It's all about the total latency per operation. Most IO sizes over 10GB
> don't make much difference to the Round Trip Time. But comparatively even
> 128KB IO's over 1GB take quite a while. For example ping a host with a
> payload of 64k over 1GB and 10GB networks and look at the difference in
> times. Now double this for Ceph (Client->Prim OSD->Sec OSD)
>
> When you are using SSD journals you normally end up with write latency of
> 3-4ms over 10GB, 1GB networking will probably increase this by another
> 2-4ms. IOPs=1000/latency
>
> I guess it all really depends on how important performance is
>
>
> >
> > >> 250GB SSD - Journal (MX200 250GB with extreme over-provisioning,
> > >> staggered deployment, monitored for TBW-Value)
> >
> > > Not sure if that SSD would be suitable for a journal. I would
> > > recommend going with one of the Intel 3700's. You could also save a
> > > bit and run the OS from it.
> >
> > I am still on the fence about ditching the SATA-DOM and install the OS on
> the
> > SSD as well.
> >
> > If the MX200 turn out to be unsuited, I can still use them for other
> purposes
> > and fetch some better SSDs later.
> >
> > >> Seagate Surveillance HDD (ST3000VX000) 7200rpm
> >
> > > Would also possibly consider a more NAS/Enterprise friendly HDD
> >
> > I thought video-surveillance HDDs would be a nice fit, they are build to
> run
> > 24/7 and to write multiple data-stream at the same time to disk.
> > Also cheap, which enables me to get more nodes from the start.
>
> Just had a look and the Seagate Surveillance disks spin at 7200RPM (missed
> that you put that there), whereas the WD ones that I am familiar with spin
> at 5400rpm, so not as bad as I thought.
>
> So probably ok to use, but I don't see many people using them for Ceph/
> generic NAS so can't be sure there's no hidden gotchas.
>
> >
> > > CPU might be on the limit, but would probably suffice. If anything you
> > > won't max out all the cores, but the overall speed of the CPU might
> > > increase latency, which may or may not be a problem for you.
> >
> > Do you have some values, so that I can imagine the difference?
> > I also maintain another cluster with dual-socket hexa-core Xeon
> 12osd-nodes
> > and all the CPUs do is idling. And the 2x10G LACP Link is usually never
> used
> > above 1 Gbit.
> > Hence the focus on cost-efficiency with this build.
>
> Sorry nothing in detail, I did actually build a ceph cluster on the same 8
> core CPU as you have listed. I didn't have any performance problems but I
> do
> remember with SSD journals when doing high queue depth writes I could get
> the CPU quite high. It's like what I said before about the 1vs10Gb
> networking, how important is performance, If using this CPU gives you an
> extra 1ms of latency per OSD, is that acceptable?
>
> Agree 12cores (guessing 2.5Ghz each) will be an overkill for just 12 OSDs.
> I
> have a very similar spec and see exactly the same as you, but will change
> the nodes to 1CPU each when I expand and use the spare CPU's for the new
> nodes.
>
> I'm using this:-
>
> http://www.supermicro.nl/products/system/4U/F617/SYS-F617H6-FTPTL_.cfm
>
> Mainly because of rack density, which I know doesn't apply to you. But the
> fact they share PSU's/Rails/Chassis helps reduce power a bit and drives
> down
> cost
>
> I can get 14 disks in each and they have 10GB on board. The SAS controller
> is flashable to JBOD mode.
>
> Maybe one of the other Twin solutions might be suitable?
>
> >
> > >> Are there any cost-effective suggestions to improve this
> configuration?
> >
> > > Have you looked at a normal Xeon based server but with more disks per
> > > node? Depending on how much capacity you need spending a little more
> > > per server but all

Re: [ceph-users] OSDs failing on upgrade from Giant to Hammer

2015-04-21 Thread Scott Laird
So, this seems to work:

ceph-objectstore-tool --op list-pgs --data-path /var/lib/ceph/osd/ceph-36/
--journal-path /var/lib/ceph/osd/ceph-36/journal > /tmp/pgs

Examine /tmp/pgs, compare to 'ceph osd pool ls detail', produce a list of
invalid pgs.

Then run

ceph-objectstore-tool --op remove --data-path /var/lib/ceph/osd/ceph-36/
--journal-path /var/lib/ceph/osd/ceph-36/journal --pgid $id

This OSD is now up and running; I'll start in on the rest of them.  Thanks
for the help.


Scott

On Tue, Apr 21, 2015 at 1:04 AM Samuel Just  wrote:

> Yep, you have hit bug 11429.  At some point, you removed a pool and then
> restarted these osds.  Due to the original bug, 10617, those osds never
> actually removed the pgs in that pool.  I'm working on a fix, or you can
> manually remove pgs corresponding to pools which no longer exist from the
> crashing osds using the ceph-objectstore-tool.
> -Sam
>
> ----- Original Message -
> From: "Scott Laird" 
> To: "Samuel Just" 
> Cc: "Robert LeBlanc" , "'ceph-users@lists.ceph.com'
> (ceph-users@lists.ceph.com)" 
> Sent: Monday, April 20, 2015 6:13:06 AM
> Subject: Re: [ceph-users] OSDs failing on upgrade from Giant to Hammer
>
> They're kind of big; here are links:
>
> https://dl.dropboxusercontent.com/u/104949139/osdmap
> https://dl.dropboxusercontent.com/u/104949139/ceph-osd.36.log
>
> On Sun, Apr 19, 2015 at 8:42 PM Samuel Just  wrote:
>
> > I have a suspicion about what caused this.  Can you restart one of the
> > problem osds with
> >
> > debug osd = 20
> > debug filestore = 20
> > debug ms = 1
> >
> > and attach the resulting log from startup to crash along with the osdmap
> > binary (ceph osd getmap -o ).
> > -Sam
> >
> > - Original Message -
> > From: "Scott Laird" 
> > To: "Robert LeBlanc" 
> > Cc: "'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com)" <
> > ceph-users@lists.ceph.com>
> > Sent: Sunday, April 19, 2015 6:13:55 PM
> > Subject: Re: [ceph-users] OSDs failing on upgrade from Giant to Hammer
> >
> > Nope. Straight from 0.87 to 0.94.1. FWIW, at someone's suggestion, I just
> > upgraded the kernel on one of the boxes from 3.14 to 3.18; no
> improvement.
> > Rebooting didn't help, either. Still failing with the same error in the
> > logs.
> >
> > On Sun, Apr 19, 2015 at 2:06 PM Robert LeBlanc < rob...@leblancnet.us >
> > wrote:
> >
> >
> >
> > Did you upgrade from 0.92? If you did, did you flush the logs before
> > upgrading?
> >
> > On Sun, Apr 19, 2015 at 1:02 PM, Scott Laird < sc...@sigkill.org >
> wrote:
> >
> >
> >
> > I'm upgrading from Giant to Hammer (0.94.1), and I'm seeing a ton of OSDs
> > die (and stay dead) with this error in the logs:
> >
> > 2015-04-19 11:53:36.796847 7f61fa900900 -1 osd/OSD.h: In function
> > 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f61fa900900 time
> > 2015-04-19 11:53:36.794951
> > osd/OSD.h: 716: FAILED assert(ret)
> >
> > ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x8b) [0xbc271b]
> > 2: (OSDService::get_map(unsigned int)+0x3f) [0x70923f]
> > 3: (OSD::load_pgs()+0x1769) [0x6c35d9]
> > 4: (OSD::init()+0x71f) [0x6c4c7f]
> > 5: (main()+0x2860) [0x651fc0]
> > 6: (__libc_start_main()+0xf5) [0x7f61f7a3fec5]
> > 7: /usr/bin/ceph-osd() [0x66aff7]
> > NOTE: a copy of the executable, or `objdump -rdS ` is needed
> > to interpret this.
> >
> > This is on a small cluster, with ~40 OSDs on 5 servers running Ubuntu
> > 14.04. So far, every single server that I've upgraded has had at least
> one
> > disk that has failed to restart with this error, and one has had several
> > disks in this state.
> >
> > Restarting the OSD after it dies with this doesn't help.
> >
> > I haven't lost any data through this due to my slow rollout, but it's
> > really annoying.
> >
> > Here are two full logs from OSDs on two different machines:
> >
> > https://dl.dropboxusercontent.com/u/104949139/ceph-osd.25.log
> > https://dl.dropboxusercontent.com/u/104949139/ceph-osd.34.log
> >
> > Any suggestions?
> >
> >
> > Scott
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs failing on upgrade from Giant to Hammer

2015-04-20 Thread Scott Laird
They're kind of big; here are links:

https://dl.dropboxusercontent.com/u/104949139/osdmap
https://dl.dropboxusercontent.com/u/104949139/ceph-osd.36.log

On Sun, Apr 19, 2015 at 8:42 PM Samuel Just  wrote:

> I have a suspicion about what caused this.  Can you restart one of the
> problem osds with
>
> debug osd = 20
> debug filestore = 20
> debug ms = 1
>
> and attach the resulting log from startup to crash along with the osdmap
> binary (ceph osd getmap -o ).
> -Sam
>
> ----- Original Message -
> From: "Scott Laird" 
> To: "Robert LeBlanc" 
> Cc: "'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com)" <
> ceph-users@lists.ceph.com>
> Sent: Sunday, April 19, 2015 6:13:55 PM
> Subject: Re: [ceph-users] OSDs failing on upgrade from Giant to Hammer
>
> Nope. Straight from 0.87 to 0.94.1. FWIW, at someone's suggestion, I just
> upgraded the kernel on one of the boxes from 3.14 to 3.18; no improvement.
> Rebooting didn't help, either. Still failing with the same error in the
> logs.
>
> On Sun, Apr 19, 2015 at 2:06 PM Robert LeBlanc < rob...@leblancnet.us >
> wrote:
>
>
>
> Did you upgrade from 0.92? If you did, did you flush the logs before
> upgrading?
>
> On Sun, Apr 19, 2015 at 1:02 PM, Scott Laird < sc...@sigkill.org > wrote:
>
>
>
> I'm upgrading from Giant to Hammer (0.94.1), and I'm seeing a ton of OSDs
> die (and stay dead) with this error in the logs:
>
> 2015-04-19 11:53:36.796847 7f61fa900900 -1 osd/OSD.h: In function
> 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f61fa900900 time
> 2015-04-19 11:53:36.794951
> osd/OSD.h: 716: FAILED assert(ret)
>
> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0xbc271b]
> 2: (OSDService::get_map(unsigned int)+0x3f) [0x70923f]
> 3: (OSD::load_pgs()+0x1769) [0x6c35d9]
> 4: (OSD::init()+0x71f) [0x6c4c7f]
> 5: (main()+0x2860) [0x651fc0]
> 6: (__libc_start_main()+0xf5) [0x7f61f7a3fec5]
> 7: /usr/bin/ceph-osd() [0x66aff7]
> NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
> This is on a small cluster, with ~40 OSDs on 5 servers running Ubuntu
> 14.04. So far, every single server that I've upgraded has had at least one
> disk that has failed to restart with this error, and one has had several
> disks in this state.
>
> Restarting the OSD after it dies with this doesn't help.
>
> I haven't lost any data through this due to my slow rollout, but it's
> really annoying.
>
> Here are two full logs from OSDs on two different machines:
>
> https://dl.dropboxusercontent.com/u/104949139/ceph-osd.25.log
> https://dl.dropboxusercontent.com/u/104949139/ceph-osd.34.log
>
> Any suggestions?
>
>
> Scott
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs failing on upgrade from Giant to Hammer

2015-04-19 Thread Scott Laird
Nope.  Straight from 0.87 to 0.94.1.  FWIW, at someone's suggestion, I just
upgraded the kernel on one of the boxes from 3.14 to 3.18; no improvement.
Rebooting didn't help, either.  Still failing with the same error in the
logs.

On Sun, Apr 19, 2015 at 2:06 PM Robert LeBlanc  wrote:

> Did you upgrade from 0.92? If you did, did you flush the logs before
> upgrading?
>
> On Sun, Apr 19, 2015 at 1:02 PM, Scott Laird  wrote:
>
>> I'm upgrading from Giant to Hammer (0.94.1), and I'm seeing a ton of OSDs
>> die (and stay dead) with this error in the logs:
>>
>> 2015-04-19 11:53:36.796847 7f61fa900900 -1 osd/OSD.h: In function
>> 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f61fa900900 time
>> 2015-04-19 11:53:36.794951
>> osd/OSD.h: 716: FAILED assert(ret)
>>
>>  ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x8b) [0xbc271b]
>>  2: (OSDService::get_map(unsigned int)+0x3f) [0x70923f]
>>  3: (OSD::load_pgs()+0x1769) [0x6c35d9]
>>  4: (OSD::init()+0x71f) [0x6c4c7f]
>>  5: (main()+0x2860) [0x651fc0]
>>  6: (__libc_start_main()+0xf5) [0x7f61f7a3fec5]
>>  7: /usr/bin/ceph-osd() [0x66aff7]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
>> to interpret this.
>>
>> This is on a small cluster, with ~40 OSDs on 5 servers running Ubuntu
>> 14.04.  So far, every single server that I've upgraded has had at least one
>> disk that has failed to restart with this error, and one has had several
>> disks in this state.
>>
>> Restarting the OSD after it dies with this doesn't help.
>>
>> I haven't lost any data through this due to my slow rollout, but it's
>> really annoying.
>>
>> Here are two full logs from OSDs on two different machines:
>>
>> https://dl.dropboxusercontent.com/u/104949139/ceph-osd.25.log
>> https://dl.dropboxusercontent.com/u/104949139/ceph-osd.34.log
>>
>> Any suggestions?
>>
>>
>> Scott
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs failing on upgrade from Giant to Hammer

2015-04-19 Thread Scott Laird
I'm upgrading from Giant to Hammer (0.94.1), and I'm seeing a ton of OSDs
die (and stay dead) with this error in the logs:

2015-04-19 11:53:36.796847 7f61fa900900 -1 osd/OSD.h: In function
'OSDMapRef OSDService::get_map(epoch_t)' thread 7f61fa900900 time
2015-04-19 11:53:36.794951
osd/OSD.h: 716: FAILED assert(ret)

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x8b) [0xbc271b]
 2: (OSDService::get_map(unsigned int)+0x3f) [0x70923f]
 3: (OSD::load_pgs()+0x1769) [0x6c35d9]
 4: (OSD::init()+0x71f) [0x6c4c7f]
 5: (main()+0x2860) [0x651fc0]
 6: (__libc_start_main()+0xf5) [0x7f61f7a3fec5]
 7: /usr/bin/ceph-osd() [0x66aff7]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

This is on a small cluster, with ~40 OSDs on 5 servers running Ubuntu
14.04.  So far, every single server that I've upgraded has had at least one
disk that has failed to restart with this error, and one has had several
disks in this state.

Restarting the OSD after it dies with this doesn't help.

I haven't lost any data through this due to my slow rollout, but it's
really annoying.

Here are two full logs from OSDs on two different machines:

https://dl.dropboxusercontent.com/u/104949139/ceph-osd.25.log
https://dl.dropboxusercontent.com/u/104949139/ceph-osd.34.log

Any suggestions?


Scott
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Force an OSD to try to peer

2015-04-14 Thread Scott Laird
Things *mostly* work if hosts on the same network have different MTUs, at
least with TCP, because the hosts will negotiate the MSS for each
connection.  UDP will still break, but large UDP packets are less common.
You don't want to run that way for very long, but there's no need for an
atomic MTU swap.

What *really* screws things up is when the host MTU is bigger than the
switch MTU.

On Tue, Apr 14, 2015 at 1:42 AM Martin Millnert  wrote:

> On Tue, Mar 31, 2015 at 10:44:51PM +0300, koukou73gr wrote:
> > On 03/31/2015 09:23 PM, Sage Weil wrote:
> > >
> > >It's nothing specific to peering (or ceph).  The symptom we've seen is
> > >just that byte stop passing across a TCP connection, usually when there
> is
> > >some largish messages being sent.  The ping/heartbeat messages get
> through
> > >because they are small and we disable nagle so they never end up in
> large
> > >frames.
> >
> > Is there any special route one should take in order to transition a
> > live cluster to use jumbo frames and avoid such pitfalls with OSD
> > peering?
>
> 1. Configure entire switch infrastructure for jumbo frames.
> 2. Enable config versioning of switch infrastructure configurations
> 3. Bonus points: Monitor config changes of switch infrastructure
> 4. Run ping test using e.g. fping from each node to every other node,
> with large frames.
> 5. Bonus points: Setup such a test in some monitor infrastructure.
> 6. Once you trust the config (and monitoring), up all the nodes MTU
> to jumbo size, simultaneously.  This is the critical step and perhaps
> it could be further perfected. Ideally you would like an atomic
> MTU-upgrade command on the entire cluster.
>
> /M
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Network redundancy pro and cons, best practice, suggestions?

2015-04-13 Thread Scott Laird
Redundancy is a means to an end, not an end itself.

If you can afford to lose component X, manually replace it, and then return
everything impacted to service, then there's no point in making X redundant.

If you can afford to lose a single disk (which Ceph certainly can), then
there's no point in local RAID.

If you can afford to lose a single machine, then there's no point in
redundant power supplies (although they can make power maintenance work a
lot less complex).

If you can afford to lose everything attached to a switch, then there's no
point in making it redundant.


Doing redundant networking to the host adds a lot of complexity that isn't
really there with single-attached hosts.  For instance, what happens if one
of the switches loses its connection to the outside world?  With LACP,
you'll probably lose connectivity to half of your peers.  Doing something
like OSPF, possibly with ECMP, avoids that problem, but certainly doesn't
make things less complicated.

In most cases, I'd avoid switch redundancy.  If I had more than 10 racks,
there's really no point, because you should be able to lose a rack without
massive disruption.  If I only had a rack or two, than I quite likely
wouldn't bother, simply because it ends up being a bigger part of the cost
and the added complexity and cost isn't worth it in most cases.

It comes down to engineering tradeoffs and money, and the right balance is
different in just about every situation.  It's a function of money,
acceptance of risk, scale, performance, networking experience, and the cost
of outages.


Scott

On Mon, Apr 13, 2015 at 4:02 AM Christian Balzer  wrote:

>
> Hello,
>
> On Mon, 13 Apr 2015 11:03:24 +0200 Götz Reinicke - IT Koordinator wrote:
>
> > Dear ceph users,
> >
> > we are planing a ceph storage cluster from scratch. Might be up to 1 PB
> > within the next 3 years, multiple buildings, new network infrastructure
> > for the cluster etc.
> >
> > I had some excellent trainings on ceph, so the essential fundamentals
> > are familiar to me, and I know our goals/dreams can be reached. :)
> >
> > There is just "one tiny piece" in the design I'm currently unsure
> > about :)
> >
> > Ceph follows some sort of keep it small and simple, e.g. dont use raid
> > controllers, use more boxes and disks, fast network etc.
> >
> While small and plenty is definitely true, some people actually use RAID
> for OSDs (like RAID1) to avoid ever having to deal with a failed OSD and
> getting a 4x replication in the end.
> Your needs and budget may of course differ.
>
> > So from our current design we plan 40Gb Storage and Client LAN.
> >
> > Would you suggest to connect the OSD nodes redundant to both networks?
> > That would end up with 4 * 40Gb ports in each box, two Switches to
> > connect to.
> >
> If you can afford it, fabric switches are quite nice, as they allow for
> LACP over 2 switches, so if everything is working you get twice the speed,
> if not still full redundancy. The Brocade VDX stuff comes to mind.
>
> However if you're not tied into an Ethernet network, you might do better
> and cheaper with an Infiniband network on the storage side of things.
> This will become even more attractive as RDMA support improves with Ceph.
>
> Separating public (client) and private (storage, OSD interconnect)
> networks with Ceph makes only sense if your storage node can actually
> utilize all that bandwidth.
>
> So at your storage node density of 12 HDDs (16 HDD chassis are not space
> efficient), 40GbE is overkill with a single link/network, insanely so with
> 2 networks.
>
> > I'd think of OSD nodes with 12 - 16 * 4TB SATA disks for "high" io
> > pools. (+ currently SSD for journal, but may be until we start, levelDB,
> > rocksDB are ready ... ?)
> >
> > Later some less io bound pools for data archiving/backup. (bigger and
> > more Disks per node)
> >
> > We would also do some Cache tiering for some pools.
> >
> > From HP, Intel, Supermicron etc reference documentations, they use
> > usually non-redundant network connection. (single 10Gb)
> >
> > I know: redundancy keeps some headaches small, but also adds some more
> > complexity and increases the budget. (add network adapters, other
> > server, more switches, etc)
> >
> Complexity not so much, cost yes.
>
> > So what would you suggest, what are your experiences?
> >
> It all depends on how small (large really) you can start.
>
> I have only small clusters with few nodes, so for me redundancy is a big
> deal.
> Thus those cluster use Infiniband, 2 switches and dual-port HCAs on the
> nodes in an active-standby mode.
>
> If you however can start with something like 10 racks (ToR switches),
> loosing one switch would mean a loss of 10% of your cluster, which is
> something it should be able to cope with.
> Especially if you configured Ceph to _not_ start re-balancing data
> automatically if a rack goes down (so that you have a chance to put a
> replacement switch in place, which you of course kept handy on-site for
> su

Re: [ceph-users] low power single disk nodes

2015-04-09 Thread Scott Laird
Minnowboard Max?  2 atom cores, 1 SATA port, and a real (non-USB) Ethernet
port.

On Thu, Apr 9, 2015, 8:03 AM p...@philw.com  wrote:

> Rather expensive option:
>
> Applied Micro X-Gene, overkill for a single disk, and only really
> available in a
> development kit format right now.
>
>  c1-development-kits/>
>
> Better Option:
>
> Ambedded CY7 - 7 nodes in 1U half Depth, 6 positions for SATA disks, and
> one
> node with mSATA SSD
>
> 
>
> --phil
>
> > On 09 April 2015 at 15:57 Quentin Hartman 
> > wrote:
> >
> >  I'm skeptical about how well this would work, but a Banana Pi might be a
> > place to start. Like a raspberry pi, but it has a SATA connector:
> > http://www.bananapi.org/
> >
> >  On Thu, Apr 9, 2015 at 3:18 AM, Jerker Nyberg  >  > wrote:
> >> >Hello ceph users,
> > >
> > >Is anyone running any low powered single disk nodes with Ceph now?
> > > Calxeda seems to be no more according to Wikipedia. I do not think HP
> > > moonshot is what I am looking for - I want stand-alone nodes, not
> server
> > > cartridges integrated into server chassis. And I do not want to be
> locked to
> > > a single vendor.
> > >
> > >I was playing with Raspberry Pi 2 for signage when I thought of my
> old
> > > experiments with Ceph.
> > >
> > >I am thinking of for example Odroid-C1 or Odroid-XU3 Lite or maybe
> > > something with a low-power Intel x64/x86 processor. Together with one
> SSD or
> > > one low power HDD the node could get all power via PoE (via splitter or
> > > integrated into board if such boards exist). PoE provide remote
> power-on
> > > power-off even for consumer grade nodes.
> > >
> > >The cost for a single low power node should be able to compete with
> > > traditional PC-servers price per disk. Ceph take care of redundancy.
> > >
> > >I think simple custom casing should be good enough - maybe just
> strap or
> > > velcro everything on trays in the rack, at least for the nodes with
> SSD.
> > >
> > >Kind regards,
> > >--
> > >Jerker Nyberg, Uppsala, Sweden.
> > >___
> > >ceph-users mailing list
> > >ceph-users@lists.ceph.com 
> > >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > >  >  ___
> >  ceph-users mailing list
> >  ceph-users@lists.ceph.com
> >  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Supermicro hardware recommendation

2015-02-14 Thread Scott Laird
I ended up destroying the EC pool and starting over.  It was killing all of
my OSD machines, and I couldn't keep anything working right with EC in
use.  So, no core dumps and I'm not in a place to reproduce easily
anymore.  This was with Giant on Ubuntu 14.04.

On Thu Feb 12 2015 at 7:07:38 AM Mark Nelson  wrote:

> On 02/08/2015 10:41 PM, Scott Laird wrote:
> > Does anyone have a good recommendation for per-OSD memory for EC?  My EC
> > test blew up in my face when my OSDs suddenly spiked to 10+ GB per OSD
> > process as soon as any reconstruction was needed.  Which (of course)
> > caused OSDs to OOM, which meant more reconstruction, which fairly
> > immediately led to a dead cluster.  This was with Giant.  Is this
> typical?
>
> Doh, that shouldn't happen.  Can you reproduce it?  Would be especially
> nice if we could get a core dump or if you could make it happen under
> valgrind.  If the CPUs are spinning, even a perf report might prove useful.
>
> >
> > On Fri Feb 06 2015 at 2:41:50 AM Mohamed Pakkeer  > <mailto:mdfakk...@gmail.com>> wrote:
> >
> > Hi all,
> >
> > We are building EC cluster with cache tier for CephFS. We are
> > planning to use the following 1U chassis along with Intel SSD DC
> > S3700 for cache tier. It has 10 * 2.5" slots. Could you recommend a
> > suitable Intel processor and amount of RAM to cater 10 * SSDs?.
> >
> > http://www.supermicro.com/products/system/1U/1028/SYS-1028R-WTRT.cfm
> >
> >
> > Regards
> >
> > K.Mohamed Pakkeer
> >
> >
> >
> > On Fri, Feb 6, 2015 at 2:57 PM, Stephan Seitz
> > mailto:s.se...@heinlein-support.de>>
> > wrote:
> >
> > Hi,
> >
> > Am Dienstag, den 03.02.2015, 15:16 + schrieb Colombo Marco:
> > > Hi all,
> > >  I have to build a new Ceph storage cluster, after i‘ve read
> the
> > > hardware recommendations and some mail from this mailing list
> i would
> > > like to buy these servers:
> >
> > just FYI:
> >
> > SuperMicro already focuses on ceph with a productline:
> > http://www.supermicro.com/solutions/datasheet_Ceph.pdf
> > http://www.supermicro.com/solutions/storage_ceph.cfm
> >
> >
> >
> > regards,
> >
> >
> > Stephan Seitz
> >
> > --
> >
> > Heinlein Support GmbH
> > Schwedter Str. 8/9b, 10119 Berlin
> >
> > http://www.heinlein-support.de
> >
> > Tel: 030 / 405051-44
> > Fax: 030 / 405051-19
> >
> > Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht
> > Berlin-Charlottenburg,
> > Geschäftsführer: Peer Heinlein -- Sitz: Berlin
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> > --
> > Thanks & Regards
> > K.Mohamed Pakkeer
> > Mobile- 0091-8754410114
> >
> > _
> > ceph-users mailing list
> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
> > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Supermicro hardware recommendation

2015-02-08 Thread Scott Laird
Does anyone have a good recommendation for per-OSD memory for EC?  My EC
test blew up in my face when my OSDs suddenly spiked to 10+ GB per OSD
process as soon as any reconstruction was needed.  Which (of course) caused
OSDs to OOM, which meant more reconstruction, which fairly immediately led
to a dead cluster.  This was with Giant.  Is this typical?

On Fri Feb 06 2015 at 2:41:50 AM Mohamed Pakkeer 
wrote:

> Hi all,
>
> We are building EC cluster with cache tier for CephFS. We are planning to
> use the following 1U chassis along with Intel SSD DC S3700 for cache tier.
> It has 10 * 2.5" slots. Could you recommend a suitable Intel processor and
> amount of RAM to cater 10 * SSDs?.
>
> http://www.supermicro.com/products/system/1U/1028/SYS-1028R-WTRT.cfm
>
>
> Regards
>
> K.Mohamed Pakkeer
>
>
>
> On Fri, Feb 6, 2015 at 2:57 PM, Stephan Seitz  > wrote:
>
>> Hi,
>>
>> Am Dienstag, den 03.02.2015, 15:16 + schrieb Colombo Marco:
>> > Hi all,
>> >  I have to build a new Ceph storage cluster, after i‘ve read the
>> > hardware recommendations and some mail from this mailing list i would
>> > like to buy these servers:
>>
>> just FYI:
>>
>> SuperMicro already focuses on ceph with a productline:
>> http://www.supermicro.com/solutions/datasheet_Ceph.pdf
>> http://www.supermicro.com/solutions/storage_ceph.cfm
>>
>>
>>
>> regards,
>>
>>
>> Stephan Seitz
>>
>> --
>>
>> Heinlein Support GmbH
>> Schwedter Str. 8/9b, 10119 Berlin
>>
>> http://www.heinlein-support.de
>>
>> Tel: 030 / 405051-44
>> Fax: 030 / 405051-19
>>
>> Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht
>> Berlin-Charlottenburg,
>> Geschäftsführer: Peer Heinlein -- Sitz: Berlin
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Thanks & Regards
> K.Mohamed Pakkeer
> Mobile- 0091-8754410114
>
>  ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Multiple OSDs crashing constantly

2015-01-14 Thread Scott Laird
I'm having a problem with 0.87 on Ubuntu.  I created a cephfs filesystem on
top of a 2,2 EC pool with a cache tier and copied a bunch of data
(non-critical) onto it, and now 4 of my OSDs (on 3 physical servers) are
crash-looping on startup.  If I stop one of the crashing OSDs, then a
different OSD starts crashing.

I don't see anything useful in the OSD logs with debug osd = 20/20 and
debug ms=20/20.  It stops logging and exits.

Here are the logs (500M before gzip):
https://dl.dropboxusercontent.com/u/104949139/ceph-osd-22-v0.87-crashloop2.log.gz

Rebooting the cephfs client system seems to temporarily fix the problem,
but attempting to remount cephfs causes OSDs to crash again.

Any suggestions on how to debug this?


Scott
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Migrating from replicated pool to erasure coding

2014-12-08 Thread Scott Laird
Is it possible to move CephFS on a replicated/mirrored pool to using
erasure coding?  Assuming that it's not, is that on the roadmap for any
future release?

I have 10T in CephFS now, and I'm trying to decide if I'd be better off
blowing it away and recreating CephFS with a SSD cache tier over an EC pool.

Thanks.


Scott
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Log reading/how do I tell what an OSD is trying to connect to

2014-11-18 Thread Scott Laird
I think I just solved at least part of the problem.

Because of the somewhat peculiar way that I have Docker configured, docker
instances on another system were being assigned my OSD's IP address,
running for a couple seconds, and then failing (for unrelated reasons).
Effectively, there was something sitting on the network throwing random
RSTs at my TCP connections and then vanishing.

Amazingly, Ceph seems to have been able to handle it *just* well enough to
make it non-obvious that the problem was external and network related.

That doesn't quite explain the issues with local OSDs acting up, though.

For now, I've moved all of my OSDs back to Ubuntu; it's more work to
manage, but on the other hand it's actually working.


Scott

On Tue Nov 18 2014 at 3:14:54 PM Gregory Farnum  wrote:

> It's a little strange, but with just the one-sided log it looks as
> though the OSD is setting up a bunch of connections and then
> deliberately tearing them down again within  second or two (i.e., this
> is not a direct messenger bug, but it might be an OSD one, or it might
> be something else).
> Is it possible that you have some firewalls set up that are allowing
> through some traffic but not others? The OSDs use a bunch of ports and
> it looks like maybe there are at least intermittent issues with them
> heartbeating.
> -Greg
>
> On Wed, Nov 12, 2014 at 11:32 AM, Scott Laird  wrote:
> > Here are the first 33k lines or so:
> > https://dl.dropboxusercontent.com/u/104949139/ceph-osd-log.txt
> >
> > This is a different (but more or less identical) machine from the past
> set
> > of logs.  This system doesn't have quite as many drives in it, so I
> couldn't
> > spot a same-host error burst, but it's logging tons of the same errors
> while
> > trying to talk to 10.2.0.34.
> >
> > On Wed Nov 12 2014 at 10:47:30 AM Gregory Farnum 
> wrote:
> >>
> >> On Tue, Nov 11, 2014 at 6:28 PM, Scott Laird  wrote:
> >> > I'm having a problem with my cluster.  It's running 0.87 right now,
> but
> >> > I
> >> > saw the same behavior with 0.80.5 and 0.80.7.
> >> >
> >> > The problem is that my logs are filling up with "replacing existing
> >> > (lossy)
> >> > channel" log lines (see below), to the point where I'm filling drives
> to
> >> > 100% almost daily just with logs.
> >> >
> >> > It doesn't appear to be network related, because it happens even when
> >> > talking to other OSDs on the same host.
> >>
> >> Well, that means it's probably not physical network related, but there
> >> can still be plenty wrong with the networking stack... ;)
> >>
> >> > The logs pretty much all point to
> >> > port 0 on the remote end.  Is this an indicator that it's failing to
> >> > resolve
> >> > port numbers somehow, or is this normal at this point in connection
> >> > setup?
> >>
> >> That's definitely unusual, but I'd need to see a little more to be
> >> sure if it's bad. My guess is that these pipes are connections from
> >> the other OSD's Objecter, which is treated as a regular client and
> >> doesn't bind to a socket for incoming connections.
> >>
> >> The repetitive channel replacements are concerning, though — they can
> >> be harmless in some circumstances but this looks more like the
> >> connection is simply failing to establish and so it's retrying over
> >> and over again. Can you restart the OSDs with "debug ms = 10" in their
> >> config file and post the logs somewhere? (There is not really any
> >> documentation available on what they mean, but the deeper detail ones
> >> might also be more understandable to you.)
> >> -Greg
> >>
> >> >
> >> > The systems that are causing this problem are somewhat unusual;
> they're
> >> > running OSDs in Docker containers, but they *should* be configured to
> >> > run as
> >> > root and have full access to the host's network stack.  They manage to
> >> > work,
> >> > mostly, but things are still really flaky.
> >> >
> >> > Also, is there documentation on what the various fields mean, short of
> >> > digging through the source?  And how does Ceph resolve OSD numbers
> into
> >> > host/port addresses?
> >> >
> >> >
> >> > 2014-11-12 01:50:40.802604 7f7828db8700  0 -- 10.2.0.36:6819/1 >>
> >> > 10.

Re: [ceph-users] Cache tiering and cephfs

2014-11-17 Thread Scott Laird
Hmm.  I'd rather not recreate by cephfs filesystem from scratch if I don't
have do.  Has anyone managed to add a cache tier to a running cephfs
filesystem?

On Sun Nov 16 2014 at 1:39:47 PM Erik Logtenberg  wrote:

> I know that it is possible to run CephFS with a cache tier on the data
> pool in Giant, because that's what I do. However when I configured it, I
> was on the previous release. When I upgraded to Giant, everything just
> kept working.
>
> By the way when I set it up, I used the following commmands:
>
> ceph osd pool create cephfs-data 192 192 erasure
> ceph osd pool create cephfs-metadata 192 192 replicated ssd
> ceph osd pool create cephfs-data-cache 192 192 replicated ssd
> ceph osd pool set cephfs-data-cache crush_ruleset 1
> ceph osd pool set cephfs-metadata crush_ruleset 1
> ceph osd tier add cephfs-data cephfs-data-cache
> ceph osd tier cache-mode cephfs-data-cache writeback
> ceph osd tier set-overlay cephfs-data cephfs-data-cache
> ceph osd dump
> ceph mds newfs 5 6 --yes-i-really-mean-it
>
> So actually I didn't add a cache tier to an existing CephFS, but first
> made the pools and added CephFS directly after. In my case, the "ssd"
> pool is ssd-backed (obviously), while the default pool is on rotating
> media; the crush_ruleset 1 is meant to place both the cache pool and the
> metadata pool on the ssd's.
>
> Erik.
>
>
> On 11/16/2014 08:01 PM, Scott Laird wrote:
> > Is it possible to add a cache tier to cephfs's data pool in giant?
> >
> > I'm getting a error:
> >
> > $ ceph osd tier set-overlay data data-cache
> >
> > Error EBUSY: pool 'data' is in use by CephFS via its tier
> >
> >
> > From what I can see in the code, that comes from
> > OSDMonitor::_check_remove_tier; I don't understand why set-overlay needs
> > to call _check_remove_tier.  A quick look makes it look like set-overlay
> > will always fail once MDS has been set up.  Is this a bug, or am I doing
> > something wrong?
> >
> >
> > Scott
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cache tiering and cephfs

2014-11-16 Thread Scott Laird
Is it possible to add a cache tier to cephfs's data pool in giant?

I'm getting a error:

$ ceph osd tier set-overlay data data-cache

Error EBUSY: pool 'data' is in use by CephFS via its tier


>From what I can see in the code, that comes from
OSDMonitor::_check_remove_tier; I don't understand why set-overlay needs to
call _check_remove_tier.  A quick look makes it look like set-overlay will
always fail once MDS has been set up.  Is this a bug, or am I doing
something wrong?


Scott
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Log reading/how do I tell what an OSD is trying to connect to

2014-11-12 Thread Scott Laird
Here are the first 33k lines or so:
https://dl.dropboxusercontent.com/u/104949139/ceph-osd-log.txt

This is a different (but more or less identical) machine from the past set
of logs.  This system doesn't have quite as many drives in it, so I
couldn't spot a same-host error burst, but it's logging tons of the same
errors while trying to talk to 10.2.0.34.

On Wed Nov 12 2014 at 10:47:30 AM Gregory Farnum  wrote:

> On Tue, Nov 11, 2014 at 6:28 PM, Scott Laird  wrote:
> > I'm having a problem with my cluster.  It's running 0.87 right now, but I
> > saw the same behavior with 0.80.5 and 0.80.7.
> >
> > The problem is that my logs are filling up with "replacing existing
> (lossy)
> > channel" log lines (see below), to the point where I'm filling drives to
> > 100% almost daily just with logs.
> >
> > It doesn't appear to be network related, because it happens even when
> > talking to other OSDs on the same host.
>
> Well, that means it's probably not physical network related, but there
> can still be plenty wrong with the networking stack... ;)
>
> > The logs pretty much all point to
> > port 0 on the remote end.  Is this an indicator that it's failing to
> resolve
> > port numbers somehow, or is this normal at this point in connection
> setup?
>
> That's definitely unusual, but I'd need to see a little more to be
> sure if it's bad. My guess is that these pipes are connections from
> the other OSD's Objecter, which is treated as a regular client and
> doesn't bind to a socket for incoming connections.
>
> The repetitive channel replacements are concerning, though — they can
> be harmless in some circumstances but this looks more like the
> connection is simply failing to establish and so it's retrying over
> and over again. Can you restart the OSDs with "debug ms = 10" in their
> config file and post the logs somewhere? (There is not really any
> documentation available on what they mean, but the deeper detail ones
> might also be more understandable to you.)
> -Greg
>
> >
> > The systems that are causing this problem are somewhat unusual; they're
> > running OSDs in Docker containers, but they *should* be configured to
> run as
> > root and have full access to the host's network stack.  They manage to
> work,
> > mostly, but things are still really flaky.
> >
> > Also, is there documentation on what the various fields mean, short of
> > digging through the source?  And how does Ceph resolve OSD numbers into
> > host/port addresses?
> >
> >
> > 2014-11-12 01:50:40.802604 7f7828db8700  0 -- 10.2.0.36:6819/1 >>
> > 10.2.0.36:0/1 pipe(0x1ce31c80 sd=135 :6819 s=0 pgs=0 cs=0 l=1
> > c=0x1e070580).accept replacing existing (lossy) channel (new one lossy=1)
> >
> > 2014-11-12 01:50:40.802708 7f7816538700  0 -- 10.2.0.36:6830/1 >>
> > 10.2.0.36:0/1 pipe(0x1ff61080 sd=120 :6830 s=0 pgs=0 cs=0 l=1
> > c=0x1f3db2e0).accept replacing existing (lossy) channel (new one lossy=1)
> >
> > 2014-11-12 01:50:40.803346 7f781ba8d700  0 -- 10.2.0.36:6819/1 >>
> > 10.2.0.36:0/1 pipe(0x1ce31180 sd=125 :6819 s=0 pgs=0 cs=0 l=1
> > c=0x1e070420).accept replacing existing (lossy) channel (new one lossy=1)
> >
> > 2014-11-12 01:50:40.803944 7f781996c700  0 -- 10.2.0.36:6830/1 >>
> > 10.2.0.36:0/1 pipe(0x1ff618c0 sd=107 :6830 s=0 pgs=0 cs=0 l=1
> > c=0x1f3d8420).accept replacing existing (lossy) channel (new one lossy=1)
> >
> > 2014-11-12 01:50:40.804185 7f7816538700  0 -- 10.2.0.36:6819/1 >>
> > 10.2.0.36:0/1 pipe(0x1ffd1e40 sd=20 :6819 s=0 pgs=0 cs=0 l=1
> > c=0x1e070840).accept replacing existing (lossy) channel (new one lossy=1)
> >
> > 2014-11-12 01:50:40.805235 7f7813407700  0 -- 10.2.0.36:6819/1 >>
> > 10.2.0.36:0/1 pipe(0x1ffd1340 sd=60 :6819 s=0 pgs=0 cs=0 l=1
> > c=0x1b2d6260).accept replacing existing (lossy) channel (new one lossy=1)
> >
> > 2014-11-12 01:50:40.806364 7f781bc8f700  0 -- 10.2.0.36:6819/1 >>
> > 10.2.0.36:0/1 pipe(0x1ffd0b00 sd=162 :6819 s=0 pgs=0 cs=0 l=1
> > c=0x675c580).accept replacing existing (lossy) channel (new one lossy=1)
> >
> > 2014-11-12 01:50:40.806425 7f781aa7d700  0 -- 10.2.0.36:6830/1 >>
> > 10.2.0.36:0/1 pipe(0x1db29600 sd=143 :6830 s=0 pgs=0 cs=0 l=1
> > c=0x1f3d9600).accept replacing existing (lossy) channel (new one lossy=1)
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Log reading/how do I tell what an OSD is trying to connect to

2014-11-11 Thread Scott Laird
I'm having a problem with my cluster.  It's running 0.87 right now, but I
saw the same behavior with 0.80.5 and 0.80.7.

The problem is that my logs are filling up with "replacing existing (lossy)
channel" log lines (see below), to the point where I'm filling drives to
100% almost daily just with logs.

It doesn't appear to be network related, because it happens even when
talking to other OSDs on the same host.  The logs pretty much all point to
port 0 on the remote end.  Is this an indicator that it's failing to
resolve port numbers somehow, or is this normal at this point in connection
setup?

The systems that are causing this problem are somewhat unusual; they're
running OSDs in Docker containers, but they *should* be configured to run
as root and have full access to the host's network stack.  They manage to
work, mostly, but things are still really flaky.

Also, is there documentation on what the various fields mean, short of
digging through the source?  And how does Ceph resolve OSD numbers into
host/port addresses?


2014-11-12 01:50:40.802604 7f7828db8700  0 -- 10.2.0.36:6819/1 >>
10.2.0.36:0/1 pipe(0x1ce31c80 sd=135 :6819 s=0 pgs=0 cs=0 l=1
c=0x1e070580).accept replacing existing (lossy) channel (new one lossy=1)

2014-11-12 01:50:40.802708 7f7816538700  0 -- 10.2.0.36:6830/1 >>
10.2.0.36:0/1 pipe(0x1ff61080 sd=120 :6830 s=0 pgs=0 cs=0 l=1
c=0x1f3db2e0).accept replacing existing (lossy) channel (new one lossy=1)

2014-11-12 01:50:40.803346 7f781ba8d700  0 -- 10.2.0.36:6819/1 >>
10.2.0.36:0/1 pipe(0x1ce31180 sd=125 :6819 s=0 pgs=0 cs=0 l=1
c=0x1e070420).accept replacing existing (lossy) channel (new one lossy=1)

2014-11-12 01:50:40.803944 7f781996c700  0 -- 10.2.0.36:6830/1 >>
10.2.0.36:0/1 pipe(0x1ff618c0 sd=107 :6830 s=0 pgs=0 cs=0 l=1
c=0x1f3d8420).accept replacing existing (lossy) channel (new one lossy=1)

2014-11-12 01:50:40.804185 7f7816538700  0 -- 10.2.0.36:6819/1 >>
10.2.0.36:0/1 pipe(0x1ffd1e40 sd=20 :6819 s=0 pgs=0 cs=0 l=1
c=0x1e070840).accept replacing existing (lossy) channel (new one lossy=1)

2014-11-12 01:50:40.805235 7f7813407700  0 -- 10.2.0.36:6819/1 >>
10.2.0.36:0/1 pipe(0x1ffd1340 sd=60 :6819 s=0 pgs=0 cs=0 l=1
c=0x1b2d6260).accept replacing existing (lossy) channel (new one lossy=1)

2014-11-12 01:50:40.806364 7f781bc8f700  0 -- 10.2.0.36:6819/1 >>
10.2.0.36:0/1 pipe(0x1ffd0b00 sd=162 :6819 s=0 pgs=0 cs=0 l=1
c=0x675c580).accept replacing existing (lossy) channel (new one lossy=1)

2014-11-12 01:50:40.806425 7f781aa7d700  0 -- 10.2.0.36:6830/1 >>
10.2.0.36:0/1 pipe(0x1db29600 sd=143 :6830 s=0 pgs=0 cs=0 l=1
c=0x1f3d9600).accept replacing existing (lossy) channel (new one lossy=1)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] journals relabeled by OS, symlinks broken

2014-10-27 Thread Scott Laird
Double-check that you did it right.  Does 'ls -lL
/var/lib/ceph/osd/ceph-33/journal' resolve to a block-special device?

On Mon Oct 27 2014 at 12:12:20 PM Steve Anthony  wrote:

>  Nice. Thanks all, I'll adjust my scripts to call ceph-deploy using
> /dev/disk/by-id for future ODSs.
>
> I tried stopping an existing OSD on another node (which is working -
> osd.33 in this case), changing /var/lib/ceph/osd/ceph-33/journal to point
> to the same partition using /dev/disk/by-id, and starting the OSD again,
> but it fails to start with:
>
> 2014-10-27 11:03:31.607060 7fa65018e780 -1
> filestore(/var/lib/ceph/osd/ceph-33) mount failed to open journal
> /var/lib/ceph/osd/ceph-33/journal: (2) No such file or directory
> 2014-10-27 11:03:31.617262 7fa65018e780 -1  ** ERROR: error converting
> store /var/lib/ceph/osd/ceph-33: (2) No such file or directory
>
> The journal symlink exists and points to the same partition as before when
> it was /dev/sde1. Can I not change these existing symlinks manually to
> point to the same partition using /dev/disk/by-id?
>
>
> -Steve
>
>
> On 10/27/2014 12:44 PM, Mariusz Gronczewski wrote:
> > * /dev/disk/by-id
> >
> > by-path will change if you connect it to different controller, or
> > replace your controller with other model, or put it in different pci
> > slot
> >
> > On Sat, 25 Oct 2014 17:20:58 +, Scott Laird 
> 
> > wrote:
> >
> >> You'd be best off using /dev/disk/by-path/ or similar links; that way
> they
> >> follow the disks if they're renamed again.
> >>
> >> On Fri, Oct 24, 2014, 9:40 PM Steve Anthony 
>  wrote:
> >>
> >>> Hello,
> >>>
> >>> I was having problems with a node in my cluster (Ceph v0.80.7/Debian
> >>> Wheezy/Kernel 3.12), so I rebooted it and the disks were relabled when
> >>> it came back up. Now all the symlinks to the journals are broken. The
> >>> SSDs are now sda, sdb, and sdc but the journals were sdc, sdd, and sde:
> >>>
> >>> root@ceph17:~# ls -l /var/lib/ceph/osd/ceph-*/journal
> >>> lrwxrwxrwx 1 root root 9 Oct 20 16:47
> /var/lib/ceph/osd/ceph-150/journal
> >>> -> /dev/sde1
> >>> lrwxrwxrwx 1 root root 9 Oct 20 16:53
> /var/lib/ceph/osd/ceph-157/journal
> >>> -> /dev/sdd1
> >>> lrwxrwxrwx 1 root root 9 Oct 21 08:31
> /var/lib/ceph/osd/ceph-164/journal
> >>> -> /dev/sdc1
> >>> lrwxrwxrwx 1 root root 9 Oct 21 16:33
> /var/lib/ceph/osd/ceph-171/journal
> >>> -> /dev/sde2
> >>> lrwxrwxrwx 1 root root 9 Oct 22 10:50
> /var/lib/ceph/osd/ceph-178/journal
> >>> -> /dev/sdc2
> >>> lrwxrwxrwx 1 root root 9 Oct 22 15:48
> /var/lib/ceph/osd/ceph-184/journal
> >>> -> /dev/sdd2
> >>> lrwxrwxrwx 1 root root 9 Oct 23 10:46
> /var/lib/ceph/osd/ceph-191/journal
> >>> -> /dev/sde3
> >>> lrwxrwxrwx 1 root root 9 Oct 23 15:22
> /var/lib/ceph/osd/ceph-195/journal
> >>> -> /dev/sdc3
> >>> lrwxrwxrwx 1 root root 9 Oct 23 16:59
> /var/lib/ceph/osd/ceph-201/journal
> >>> -> /dev/sdd3
> >>> lrwxrwxrwx 1 root root 9 Oct 24 21:32
> /var/lib/ceph/osd/ceph-214/journal
> >>> -> /dev/sde4
> >>> lrwxrwxrwx 1 root root 9 Oct 24 21:33
> /var/lib/ceph/osd/ceph-215/journal
> >>> -> /dev/sdd4
> >>>
> >>> Any way to fix this without just removing all the OSDs and re-adding
> >>> them? I thought about recreating the symlinks to point at the new SSD
> >>> labels, but I figured I'd check here first. Thanks!
> >>>
> >>> -Steve
> >>>
> >>> --
> >>> Steve Anthony
> >>> LTS HPC Support Specialist
> >>> Lehigh University
> >>> sma...@lehigh.edu
> >>>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >
> >
> >
>
> --
> Steve Anthony
> LTS HPC Support Specialist
> Lehigh University
> sma...@lehigh.edu
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] journals relabeled by OS, symlinks broken

2014-10-25 Thread Scott Laird
You'd be best off using /dev/disk/by-path/ or similar links; that way they
follow the disks if they're renamed again.

On Fri, Oct 24, 2014, 9:40 PM Steve Anthony  wrote:

> Hello,
>
> I was having problems with a node in my cluster (Ceph v0.80.7/Debian
> Wheezy/Kernel 3.12), so I rebooted it and the disks were relabled when
> it came back up. Now all the symlinks to the journals are broken. The
> SSDs are now sda, sdb, and sdc but the journals were sdc, sdd, and sde:
>
> root@ceph17:~# ls -l /var/lib/ceph/osd/ceph-*/journal
> lrwxrwxrwx 1 root root 9 Oct 20 16:47 /var/lib/ceph/osd/ceph-150/journal
> -> /dev/sde1
> lrwxrwxrwx 1 root root 9 Oct 20 16:53 /var/lib/ceph/osd/ceph-157/journal
> -> /dev/sdd1
> lrwxrwxrwx 1 root root 9 Oct 21 08:31 /var/lib/ceph/osd/ceph-164/journal
> -> /dev/sdc1
> lrwxrwxrwx 1 root root 9 Oct 21 16:33 /var/lib/ceph/osd/ceph-171/journal
> -> /dev/sde2
> lrwxrwxrwx 1 root root 9 Oct 22 10:50 /var/lib/ceph/osd/ceph-178/journal
> -> /dev/sdc2
> lrwxrwxrwx 1 root root 9 Oct 22 15:48 /var/lib/ceph/osd/ceph-184/journal
> -> /dev/sdd2
> lrwxrwxrwx 1 root root 9 Oct 23 10:46 /var/lib/ceph/osd/ceph-191/journal
> -> /dev/sde3
> lrwxrwxrwx 1 root root 9 Oct 23 15:22 /var/lib/ceph/osd/ceph-195/journal
> -> /dev/sdc3
> lrwxrwxrwx 1 root root 9 Oct 23 16:59 /var/lib/ceph/osd/ceph-201/journal
> -> /dev/sdd3
> lrwxrwxrwx 1 root root 9 Oct 24 21:32 /var/lib/ceph/osd/ceph-214/journal
> -> /dev/sde4
> lrwxrwxrwx 1 root root 9 Oct 24 21:33 /var/lib/ceph/osd/ceph-215/journal
> -> /dev/sdd4
>
> Any way to fix this without just removing all the OSDs and re-adding
> them? I thought about recreating the symlinks to point at the new SSD
> labels, but I figured I'd check here first. Thanks!
>
> -Steve
>
> --
> Steve Anthony
> LTS HPC Support Specialist
> Lehigh University
> sma...@lehigh.edu
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Network hardware recommendations

2014-10-07 Thread Scott Laird
IIRC, one thing to look out for is that there are two ways to do IP over
Infiniband.  You can either do IP over Infiniband directly (IPoIB), or
encapsulate Ethernet in Infiniband (EoIB), and then do IP over the fake
Ethernet network.

IPoIB is more common, but I'd assume that IB<->Ethernet bridges really only
bridge EoIB.

On Tue Oct 07 2014 at 5:34:57 PM Christian Balzer  wrote:

> On Tue, 07 Oct 2014 20:40:31 +0000 Scott Laird wrote:
>
> > I've done this two ways in the past.  Either I'll give each machine an
> > Infiniband network link and a 1000baseT link and use the Infiniband one
> > as the private network for Ceph, or I'll throw an Infiniband card into a
> > PC and run something like Vyatta/VyOS on it and make it a router, so IP
> > traffic can get out of the IB network.  Of course, those have both been
> > for test labs.  YMMV.
> >
>
> That.
>
> Of course in a production environment you would want something with 2
> routers in a failover configuration.
> And there are switches/gateways that combine IB and Ethernet, but they
> tend to be not so cheap. ^^
>
> More below.
>
> > On Tue Oct 07 2014 at 11:05:23 AM Massimiliano Cuttini
> >  wrote:
> >
> > >  Hi Christian,
> > >
> > >  When you say "10 gig infiniband", do you mean QDRx4 Infiniband
> > > (usually flogged as 40Gb/s even though it is 32Gb/s, but who's
> > > counting), which tends to be the same basic hardware as the 10Gb/s
> > > Ethernet offerings from Mellanox?
> > >
> > > A brand new 18 port switch of that caliber will only cost about 180$
> > > per port, too.
> > >
> > >
> > >
> > > I investigate about infiniband but i didn't found affordable prices at
> > > all.
>
> Then you're doing it wrong or comparing apples to oranges (you of course
> need to compare IB switches to similar 10GbE ones).
> And the prices of HCA (aka network cards in the servers) and cabling.
>
> > > Moreover how do you connect your *l**egacy node servers* to your
> > > *brand new storages* if you have Infiniband only on storages &
> > > switches? Is there any mixed switch that allow you both to connect
> > > with Infiniband and Ethernet?
> > >
> > > If there is, please send specs because i cannot find just by google it.
> > >
> The moment you type in "infiniband et" google will already predict amongst
> other pertinent things "infiniband ethernet gateway" and "infiniband
> ethernet bridge".
> But even "infiniband ethernet switch" has a link telling you pretty much
> what was said here now at the 6th position:
> http://www.tomshardware.com/forum/44997-42-connect-
> infiniband-switch-ethernet
>
> Christian
> > > Thanks,
> > > Max
> > >
> > >  ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Network hardware recommendations

2014-10-07 Thread Scott Laird
I've done this two ways in the past.  Either I'll give each machine an
Infiniband network link and a 1000baseT link and use the Infiniband one as
the private network for Ceph, or I'll throw an Infiniband card into a PC
and run something like Vyatta/VyOS on it and make it a router, so IP
traffic can get out of the IB network.  Of course, those have both been for
test labs.  YMMV.

On Tue Oct 07 2014 at 11:05:23 AM Massimiliano Cuttini 
wrote:

>  Hi Christian,
>
>  When you say "10 gig infiniband", do you mean QDRx4 Infiniband (usually
> flogged as 40Gb/s even though it is 32Gb/s, but who's counting), which
> tends to be the same basic hardware as the 10Gb/s Ethernet offerings from
> Mellanox?
>
> A brand new 18 port switch of that caliber will only cost about 180$ per
> port, too.
>
>
>
> I investigate about infiniband but i didn't found affordable prices at all
> Moreover how do you connect your *l**egacy node servers* to your *brand
> new storages* if you have Infiniband only on storages & switches?
> Is there any mixed switch that allow you both to connect with Infiniband
> and Ethernet?
>
> If there is, please send specs because i cannot find just by google it.
>
> Thanks,
> Max
>
>  ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD journal deployment experiences

2014-09-06 Thread Scott Laird
IOPS are weird things with SSDs.  In theory, you'd see 25% of the write
IOPS when writing to a 4-way RAID5 device, since you write to all 4 devices
in parallel.  Except that's not actually true--unlike HDs where an IOP is
an IOP, SSD IOPS limits are really just a function of request size.
 Because each operation would be ~1/3rd the size, you should see a net of
about 3x the performance of one drive overall, or 75% of the sum of the
drives.  The CPU use will be higher, but it may or may not be a substantial
hit for your use case.  Journals are basically write-only, and 200G S3700s
are supposed to be able to sustain around 360 MB/sec, so RAID 5 would give
you somewhere around 1 GB/sec writing on paper.  Depending on your access
patterns, that may or may not be a win vs single SSDs; it should give you
slightly lower latency for uncongested writes at the very least.  It's
probably worth benchmarking if you have the time.

OTOH, S3700s seem to be pretty reliable, and if your cluster is big enough
to handle the loss of 5 OSDs without a big hit, then the lack of complexity
may be a bigger win all on its own.


Scott

On Sat Sep 06 2014 at 9:28:32 AM Dan Van Der Ster 
wrote:

>  RAID5... Hadn't considered it due to the IOPS penalty (it would get
> 1/4th of the IOPS of separated journal devices, according to some online
> raid calc). Compared to RAID10, I guess we'd get 50% more capacity, but
> lower performance.
>
> After the anecdotes that the DCS3700 is very rarely failing, and without a
> stable bcache to build upon, I'm leaning toward the usual 5 journal
> partitions per SSD. But that will leave at least 100GB free per drive, so I
> might try running an OSD there.
>
> Cheers, Dan
> On Sep 6, 2014 6:07 PM, Scott Laird  wrote:
>  Backing up slightly, have you considered RAID 5 over your SSDs?
>  Practically speaking, there's no performance downside to RAID 5 when your
> devices aren't IOPS-bound.
>
> On Sat Sep 06 2014 at 8:37:56 AM Christian Balzer  wrote:
>
>> On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote:
>>
>> > September 6 2014 4:01 PM, "Christian Balzer"  wrote:
>> > > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote:
>> > >
>> > >> Hi Christian,
>> > >>
>> > >> Let's keep debating until a dev corrects us ;)
>> > >
>> > > For the time being, I give the recent:
>> > >
>> > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html
>> > >
>> > > And not so recent:
>> > > http://www.spinics.net/lists/ceph-users/msg04152.html
>> > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
>> > >
>> > > And I'm not going to use BTRFS for mainly RBD backed VM images
>> > > (fragmentation city), never mind the other stability issues that crop
>> > > up here ever so often.
>> >
>> >
>> > Thanks for the links... So until I learn otherwise, I better assume the
>> > OSD is lost when the journal fails. Even though I haven't understood
>> > exactly why :( I'm going to UTSL to understand the consistency better.
>> > An op state diagram would help, but I didn't find one yet.
>> >
>> Using the source as an option of last resort is always nice, having to
>> actually do so for something like this feels a bit lacking in the
>> documentation department (that or my google foo being weak). ^o^
>>
>> > BTW, do you happen to know, _if_ we re-use an OSD after the journal has
>> > failed, are any object inconsistencies going to be found by a
>> > scrub/deep-scrub?
>> >
>> No idea.
>> And really a scenario I hope to never encounter. ^^;;
>>
>> > >>
>> > >> We have 4 servers in a 3U rack, then each of those servers is
>> > >> connected to one of these enclosures with a single SAS cable.
>> > >>
>> > >>>> With the current config, when I dd to all drives in parallel I can
>> > >>>> write at 24*74MB/s = 1776MB/s.
>> > >>>
>> > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0
>> > >>> lanes, so as far as that bus goes, it can do 4GB/s.
>> > >>> And given your storage pod I assume it is connected with 2 mini-SAS
>> > >>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA
>> > >>> bandwidth.
>> > >>
>> > >> From above, we are only using 4 lanes -- so around 2GB/s is expected.
>> > >
>> > > A

Re: [ceph-users] SSD journal deployment experiences

2014-09-06 Thread Scott Laird
Backing up slightly, have you considered RAID 5 over your SSDs?
 Practically speaking, there's no performance downside to RAID 5 when your
devices aren't IOPS-bound.

On Sat Sep 06 2014 at 8:37:56 AM Christian Balzer  wrote:

> On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote:
>
> > September 6 2014 4:01 PM, "Christian Balzer"  wrote:
> > > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote:
> > >
> > >> Hi Christian,
> > >>
> > >> Let's keep debating until a dev corrects us ;)
> > >
> > > For the time being, I give the recent:
> > >
> > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html
> > >
> > > And not so recent:
> > > http://www.spinics.net/lists/ceph-users/msg04152.html
> > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
> > >
> > > And I'm not going to use BTRFS for mainly RBD backed VM images
> > > (fragmentation city), never mind the other stability issues that crop
> > > up here ever so often.
> >
> >
> > Thanks for the links... So until I learn otherwise, I better assume the
> > OSD is lost when the journal fails. Even though I haven't understood
> > exactly why :( I'm going to UTSL to understand the consistency better.
> > An op state diagram would help, but I didn't find one yet.
> >
> Using the source as an option of last resort is always nice, having to
> actually do so for something like this feels a bit lacking in the
> documentation department (that or my google foo being weak). ^o^
>
> > BTW, do you happen to know, _if_ we re-use an OSD after the journal has
> > failed, are any object inconsistencies going to be found by a
> > scrub/deep-scrub?
> >
> No idea.
> And really a scenario I hope to never encounter. ^^;;
>
> > >>
> > >> We have 4 servers in a 3U rack, then each of those servers is
> > >> connected to one of these enclosures with a single SAS cable.
> > >>
> >  With the current config, when I dd to all drives in parallel I can
> >  write at 24*74MB/s = 1776MB/s.
> > >>>
> > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0
> > >>> lanes, so as far as that bus goes, it can do 4GB/s.
> > >>> And given your storage pod I assume it is connected with 2 mini-SAS
> > >>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA
> > >>> bandwidth.
> > >>
> > >> From above, we are only using 4 lanes -- so around 2GB/s is expected.
> > >
> > > Alright, that explains that then. Any reason for not using both ports?
> > >
> >
> > Probably to minimize costs, and since the single 10Gig-E is a bottleneck
> > anyway. The whole thing is suboptimal anyway, since this hardware was
> > not purchased for Ceph to begin with. Hence retrofitting SSDs, etc...
> >
> The single 10Gb/s link is the bottleneck for sustained stuff, but when
> looking at spikes...
> Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port
> might also get some loving. ^o^
>
> The cluster I'm currently building is based on storage nodes with 4 SSDs
> (100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and 8
> HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for
> redundancy, not speed. ^^
>
> > >>> Impressive, even given your huge cluster with 1128 OSDs.
> > >>> However that's not really answering my question, how much data is on
> > >>> an average OSD and thus gets backfilled in that hour?
> > >>
> > >> That's true -- our drives have around 300TB on them. So I guess it
> > >> will take longer - 3x longer - when the drives are 1TB full.
> > >
> > > On your slides, when the crazy user filled the cluster with 250 million
> > > objects and thus 1PB of data, I recall seeing a 7 hour backfill time?
> > >
> >
> > Yeah that was fun :) It was 250 million (mostly) 4k objects, so not
> > close to 1PB. The point was that to fill the cluster with RBD, we'd need
> > 250 million (4MB) objects. So, object-count-wise this was a full
> > cluster, but for the real volume it was more like 70TB IIRC (there were
> > some other larger objects too).
> >
> Ah, I see. ^^
>
> > In that case, the backfilling was CPU-bound, or perhaps
> > wbthrottle-bound, I don't remember... It was just that there were many
> > tiny tiny objects to synchronize.
> >
> Indeed. This is something me and others have seen as well, as in
> backfilling being much slower than the underlying HW would permit and
> being CPU intensive.
>
> > > Anyway, I guess the lesson to take away from this is that size and
> > > parallelism does indeed help, but even in a cluster like yours
> > > recovering from a 2TB loss would likely be in the 10 hour range...
> >
> > Bigger clusters probably backfill faster simply because there are more
> > OSDs involved in the backfilling. In our cluster we initially get 30-40
> > backfills in parallel after 1 OSD fails. That's even with max backfills
> > = 1. The backfilling sorta follows an 80/20 rule -- 80% of the time is
> > spent backfilling the last 20% of the PGs, just because some OSDs
> > randomly get more new PGs than

Re: [ceph-users] question about monitor and paxos relationship

2014-08-31 Thread Scott Laird
If you want your data to be N+2 redundant (able to handle 2 failures, more
or less), then you need to set size=3 and have 3 replicas of your data.

If you want your monitors to be N+2 redundant, then you need 5 monitors.

If you feel that your data is worth size=3, then you should really try to
have 5 monitors.  Unless you're building a cluster with <5 servers, of
course.


This is common to pretty much every quorum-based system in existence, not
just Ceph.  In my experience, 1 replica is fine for test instances that
have no expectation of data persistence or availability, 3 replicas is okay
for small instances that don't need any sort of strong availability
guarantee, and 5 replicas is really where you need to be for any sort of
large-scale production use.  I've been stuck using 3-way replicated quorum
systems in large-scale production systems, and it made any sort of planned
maintenance absolutely terrifying.  Or really any back-end outage at all,
because you're left operating completely without a net.  Any additional
failure and the service craters spectacularly and publicly.  Since I really
hate reading newspaper articles about outages in my systems, I use 5-way
quorums whenever possible.


Scott

On Sat Aug 30 2014 at 7:40:18 PM Joao Eduardo Luis 
wrote:

> Nigel mistakenly replied just to me, CC'ing the list.
>
> On 08/30/2014 08:12 AM, Nigel Williams wrote:
> > On Sat, Aug 30, 2014 at 11:59 AM, Joao Eduardo Luis
> >  wrote:
> >> But yeah, if you're going with 2 or 4, you'll be better off with 3 or
> 5.  As
> >> long as you don't go with 1 you should be okay.
> >
> > On a recent panel discussion one member strongly advocated 5 as the
> > minimum number of MONs for a large Ceph deployment. Large in this case
> > was PBs of storage.
> >
> > For a Ceph cluster with 100s of OSDs and 100s of TB across multiple
> > racks (therefore many paths involved) is 5 x MONs a good rule-of-thumb
> > or is three sufficient?
>
> Whoever stated that was probably right.  I don't often like to speak
> about what works best for (really) large deployments as I don't often
> see them.  In theory, 5 monitors will fare better than 3 for 100s of OSDs.
>
> As far as the monitors are concerned, this will be so mostly because 5
> monitors are able to serve more maps concurrently than 3 monitors would.
>   I don't think we have tests to back my reasoning here, but I don't
> think that the cluster workload or its size would have that much of an
> impact on the number of monitors.  Albeit a technical detail, the fact
> is that every message that an OSD would send to a monitor that would
> trigger an update to a map is *always* forwarded to the leader monitor.
>   This means that regardless of how many monitors you have, you'll
> always end up with the same monitor dealing with the map updates and
> that always puts a cap on map update throughput -- this is not that big
> of a deal, usually, and knobs may be adjusted if need be.
>
> On the other hand, given you have 5 monitors instead of 3 means that
> you'll be able to spread OSD connections throughout more monitors, and
> even if updates are forwarded to the leader, connection-wise the load is
> more spread out -- the message is forwarded by the monitor the OSD
> connects to, and said monitor will act as a proxy in replying to the
> OSD, so there's less hammering the leader directly.
>
> But the point where this actually may make a real difference is in
> serving osdmap updates.  So, the OSDs need those updates.  Even
> considering that OSDs will share maps amongst themselves, they still
> need to get them from somewhere -- and that somewhere is the monitor
> cluster.  If you have 100s of OSDs connected to just 3 monitors, each
> monitor will end up serving bunches of reads (sending map updates to
> OSDs) while dealing with messages that will trigger map updates (which
> will in turn be forwarded to the leader).  Given that any client (OSDs
> included) connect to monitors at random at start and maintain that
> connection for a while, a "rule of thumb" would tell us that the leader
> would be responsible for serving 1/3 of all map reads while still
> handling map updates.  Having 5 monitors would reduce this load to 1/5.
>
> However, I don't know of a good indicator to whether a given cluster
> should go with 5 monitors instead of 3.  Or 7 monitors instead of 5.  I
> don't think there are many clusters running 7 monitors, but it may so be
> that for even larger clusters, having 5 or 7 monitors serving updates
> makes up for the increased number of messages required to commit an
> update -- keep in mind that due to Paxos nature one always needs an ack
> for an update from at least (N+1)/2 monitors.  Again, this is twofold:
> we may have more messages being passed around, but given each monitor is
> under a lower load we may even get to them faster.
>
> I think I went a bit offtrack.
>
> Let me know if this led to further confusion instead.
>
>-Joao
>
>
> --
> Joao Eduardo Luis
> S

Re: [ceph-users] Ceph networks, to bond or not to bond?

2014-06-05 Thread Scott Laird
Doing bonding without LACP is probably going to end up being painful.
 Sooner or later you're going to end up with one end thinking that bonding
is working while the other end thinks that it's not, and half of your
traffic is going to get black-holed.

I've had moderately decent luck running Ceph on top of a weird network by
carefully controlling the source address that every outbound connection
uses and then telling Ceph that it's running with a 1-network config.  With
Linux, the default source address of an outbound TCP connection is a
function of the route that the kernel picks to send traffic to the remote
end, and you can override it on a per-route basis (it's visible as the the
'src' attribute in iproute).  I have a mixed Infiniband+GigE network with
each host running an OSPF routing daemon (for non-Ceph reasons, mostly),
and the only two ways that I could get Ceph to be happy were:

1.  Turn off the Infiniband network.  Slow, and causes other problems.
2.  Tell Ceph that there was no cluster network, and tell the OSPF daemon
to always set src=$eth0_ip on routes that it adds.  Then just pretend that
the Ethernet network is the only one that exists, and sometimes you get a
sudden and unexpected boost in bandwidth due to /32 routes that send
traffic via Infiniband instead of Ethernet.

It works, but I wouldn't recommend it for production.  It would have been
cheaper for me to buy a 10 GigE switch and cards for my garage than to have
debugged all of this, and that's just for a hobby project.

OTOH, it's probably the only way to get working multipathing for Ceph.


On Thu, Jun 5, 2014 at 10:50 AM, Cedric Lemarchand 
wrote:

> Le 05/06/2014 18:27, Sven Budde a écrit :
> > Hi Alexandre,
> >
> > thanks for the reply. As said, my switches are not stackable, so using
> LCAP seems not to be my best option.
> >
> > I'm seeking for an explanation how Ceph is utilizing two (or more)
> independent links on both the public and the cluster network.
> AFAIK, Ceph do not support multiple IP link in the same "designated
> network" (aka client/osd networks). Ceph is not aware of links
> aggregations, it has to be done at the Ethernet layer, so :
>
> - if your switchs are stackable, you can use traditional LACP on both
> sides (switch and Ceph)
> - if they are not, and as Mariusz said, use the appropriate bonding mode
> on the Ceph side and do not use LCAP on switchs.
>
> More infos here :
> http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding
>
> Cheers !
> >
> > If I configure two IPs for the public network on two NICs, will Ceph
> route traffic from its (multiple) OSDs on this node over both IPs?
> >
> > Cheers,
> > Sven
> >
> > -Ursprüngliche Nachricht-
> > Von: Alexandre DERUMIER [mailto:aderum...@odiso.com]
> > Gesendet: Donnerstag, 5. Juni 2014 18:14
> > An: Sven Budde
> > Cc: ceph-users@lists.ceph.com
> > Betreff: Re: [ceph-users] Ceph networks, to bond or not to bond?
> >
> > Hi,
> >
> >>> My low-budget setup consists of two gigabit switches, capable of LACP,
> >>> but not stackable. For redundancy, I'd like to have my links spread
> >>> evenly over both switches.
> > If you want to do lacp with both switches, they need to be stackable.
> >
> > (or use active-backup bonding)
> >
> >>> My question where I didn't find a conclusive answer in the
> >>> documentation and mailing archives:
> >>> Will the OSDs utilize both 'single' interfaces per network, if I
> >>> assign two IPs per public and per cluster network? Or will all OSDs
> >>> just bind on one IP and use only a single link?
> > you just need 1 ip by bond.
> >
> > with lacp, the load balacing use an hash algorithm, to loadbalance tcp
> connections.
> > (that also mean than 1 connection can't use more than 1 link)
> >
> > check that your switch support ip+port hash algorithm,
> (xmit_hash_policy=layer3+4  is linux lacp bonding)
> >
> > like this, each osd->osd can be loadbalanced, same for your clients->osd.
> >
> >
> >
> >
> >
> >
> > - Mail original -
> >
> > De: "Sven Budde" 
> > À: ceph-users@lists.ceph.com
> > Envoyé: Jeudi 5 Juin 2014 16:20:04
> > Objet: [ceph-users] Ceph networks, to bond or not to bond?
> >
> > Hello all,
> >
> > I'm currently building a new small cluster with three nodes, each node
> having 4x 1 Gbit/s network interfaces available and 8-10 OSDs running per
> node.
> >
> > I thought I assign 2x 1 Gb/s for the public network, and the other 2x 1
> Gb/s for the cluster network.
> >
> > My low-budget setup consists of two gigabit switches, capable of LACP,
> but not stackable. For redundancy, I'd like to have my links spread evenly
> over both switches.
> >
> > My question where I didn't find a conclusive answer in the documentation
> and mailing archives:
> > Will the OSDs utilize both 'single' interfaces per network, if I assign
> two IPs per public and per cluster network? Or will all OSDs just bind on
> one IP and use only a single link?
> >
> > I'd rather avoid bonding the NICs, as if one switch fa

Re: [ceph-users] btrfs + cache tier = disaster

2014-06-02 Thread Scott Laird
Oh, and thanks for the "filestore btrfs snap = false" pointer.  In
ceph.conf, under [osd], I assume?


On Mon, Jun 2, 2014 at 10:07 AM, Scott Laird  wrote:

> FWIW, I figured out the ceph "out of memory" error that was keeping me
> from recovering one FS:
>
> # ls -l /mnt
> ls: cannot access /mnt/snap_3415219: Cannot allocate memory
> total 5242920
> -rw-r--r-- 1 root root472 May 23 19:16 activate.monmap
> -rw-r--r-- 1 root root  3 May 23 19:16 active
> -rw-r--r-- 1 root root 37 May 23 19:14 ceph_fsid
> drwxr-xr-x 1 root root  11688 May 31 15:10 current
> -rw-r--r-- 1 root root 37 May 23 19:14 fsid
> -rw-r--r-- 1 root root 5368709120 Jun  2 08:11 journal
> -rw--- 1 root root 56 May 23 19:16 keyring
> -rw-r--r-- 1 root root 21 May 23 19:14 magic
> -rw-r--r-- 1 root root  6 May 23 19:16 ready
> d? ? ?? ?? snap_3415219
> drwxr-xr-x 1 root root  11688 May 31 15:10 snap_3415260
> drwxr-xr-x 1 root root  11688 May 31 15:10 snap_3415296
> -rw-r--r-- 1 root root  4 May 23 19:16 store_version
> -rw-r--r-- 1 root root 69 May 29 22:34 superblock
> -rw-r--r-- 1 root root  0 May 23 19:16 upstart
> -rw-r--r-- 1 root root  2 May 23 19:16 whoami
>
> snap_3415219 is clearly corrupt.  I'm going to duplicate the filesystem
> (it's only 50G, doesn't take long) without the file and see if that'll work.
>
>
> On Mon, Jun 2, 2014 at 9:51 AM, Dmitry Smirnov 
> wrote:
>
>> On Mon, 2 Jun 2014 17:47:57 Thorwald Lundqvist wrote:
>> > I'd say don't use btrfs at all, it has proven unstable for us in
>> production
>> > even without cache. It's just not ready for production use.
>>
>> Perception of stability depends on experience. For instance some consider
>> XFS
>> to be ready for production but it does not tolerate power loss which lead
>> to
>> loss of data. Also fixing corrupted XFS may not be possible due to
>> xfs_repair
>> memory requirements.
>>
>> Ready for production or not depends on testing (building confidence) and
>> understanding limitations. As a matter of fact Btrfs is very stable and
>> reliable on recent kernels (3.11+) if used pretty much as ext4 i.e.
>> without
>> advanced features (e.g. snapshots, subvolumes etc.).
>>
>> Linux 3.14.1 is affected by serious Btrfs regression(s) that were fixed in
>> later releases.
>>
>> Unfortunately even latest Linux can crash and corrupt Btrfs file system if
>> OSDs are using snapshots (which is the default). Due to kernel bugs
>> related to
>> Btrfs snapshots I also lost some OSDs until I found that snapshotting can
>> be
>> disabled with "filestore btrfs snap = false".
>>
>> So far I'm very happy with Btrfs stability on OSDs when snapshots are
>> disabled.
>>
>> --
>> Cheers,
>>  Dmitry Smirnov
>>  GPG key : 4096R/53968D1B
>>
>> ---
>>
>> Every decent man is ashamed of the government he lives under.
>> -- H. L. Mencken
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] btrfs + cache tier = disaster

2014-06-02 Thread Scott Laird
FWIW, I figured out the ceph "out of memory" error that was keeping me from
recovering one FS:

# ls -l /mnt
ls: cannot access /mnt/snap_3415219: Cannot allocate memory
total 5242920
-rw-r--r-- 1 root root472 May 23 19:16 activate.monmap
-rw-r--r-- 1 root root  3 May 23 19:16 active
-rw-r--r-- 1 root root 37 May 23 19:14 ceph_fsid
drwxr-xr-x 1 root root  11688 May 31 15:10 current
-rw-r--r-- 1 root root 37 May 23 19:14 fsid
-rw-r--r-- 1 root root 5368709120 Jun  2 08:11 journal
-rw--- 1 root root 56 May 23 19:16 keyring
-rw-r--r-- 1 root root 21 May 23 19:14 magic
-rw-r--r-- 1 root root  6 May 23 19:16 ready
d? ? ?? ?? snap_3415219
drwxr-xr-x 1 root root  11688 May 31 15:10 snap_3415260
drwxr-xr-x 1 root root  11688 May 31 15:10 snap_3415296
-rw-r--r-- 1 root root  4 May 23 19:16 store_version
-rw-r--r-- 1 root root 69 May 29 22:34 superblock
-rw-r--r-- 1 root root  0 May 23 19:16 upstart
-rw-r--r-- 1 root root  2 May 23 19:16 whoami

snap_3415219 is clearly corrupt.  I'm going to duplicate the filesystem
(it's only 50G, doesn't take long) without the file and see if that'll work.


On Mon, Jun 2, 2014 at 9:51 AM, Dmitry Smirnov 
wrote:

> On Mon, 2 Jun 2014 17:47:57 Thorwald Lundqvist wrote:
> > I'd say don't use btrfs at all, it has proven unstable for us in
> production
> > even without cache. It's just not ready for production use.
>
> Perception of stability depends on experience. For instance some consider
> XFS
> to be ready for production but it does not tolerate power loss which lead
> to
> loss of data. Also fixing corrupted XFS may not be possible due to
> xfs_repair
> memory requirements.
>
> Ready for production or not depends on testing (building confidence) and
> understanding limitations. As a matter of fact Btrfs is very stable and
> reliable on recent kernels (3.11+) if used pretty much as ext4 i.e. without
> advanced features (e.g. snapshots, subvolumes etc.).
>
> Linux 3.14.1 is affected by serious Btrfs regression(s) that were fixed in
> later releases.
>
> Unfortunately even latest Linux can crash and corrupt Btrfs file system if
> OSDs are using snapshots (which is the default). Due to kernel bugs
> related to
> Btrfs snapshots I also lost some OSDs until I found that snapshotting can
> be
> disabled with "filestore btrfs snap = false".
>
> So far I'm very happy with Btrfs stability on OSDs when snapshots are
> disabled.
>
> --
> Cheers,
>  Dmitry Smirnov
>  GPG key : 4096R/53968D1B
>
> ---
>
> Every decent man is ashamed of the government he lives under.
> -- H. L. Mencken
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] btrfs + cache tier = disaster

2014-06-02 Thread Scott Laird
I can cope with single-FS failures, within reason.  It's the coordinated
failures across multiple servers that really freak me out.


On Mon, Jun 2, 2014 at 8:47 AM, Thorwald Lundqvist 
wrote:

> I'm sorry to hear about that.
>
> I'd say don't use btrfs at all, it has proven unstable for us in
> production even without cache. It's just not ready for production use.
>
>
> On Mon, Jun 2, 2014 at 5:20 PM, Scott Laird  wrote:
>
>> I found a fun failure mode this weekend.
>>
>> I have 6 SSDs in my 6-node Ceph cluster at home.  The SSDs are
>> partitioned; about half of the SSD is used for journal space for other
>> OSDs, and half holds an OSD for a cache tier.  I finally turned it on the
>> cache late last week, and everything was great, until yesterday morning,
>> when my whole cluster was down, hard.
>>
>> Apparently, I mis-set target_max_bytes, because 5 of the 6 SSD partitions
>> were 100% full.  On the 5 full machines (running Ubuntu's 3.14.1 kernel),
>> the cache filesystem was unreadable; any attempt to access it threw kernel
>> errors.  Rebooting cleared up 2 of those, leaving me with 3 of 6 devices
>> alive in the pool, and 3 devices with corrupt filesystems.
>>
>> Apparently btrfs really, *REALLY* doesn't like full filesystems, because
>> filling them 100% full seems to have fatally corrupted them.  No power
>> loss, etc. involved.
>>
>> Trying to mount the filesystems fail, giving btrfs messages like this:
>>
>> [81720.111053] BTRFS: device fsid 319cbd8a-71ac-4b42-9d5c-b02658e75cdc
>> devid 1 transid 61429 /dev/sde9
>> [81720.113074] BTRFS info (device sde9): disk space caching is enabled
>> [81720.188759] BTRFS: detected SSD devices, enabling SSD mode
>> [81720.195442] BTRFS error (device sde9): block group 36528193536 has
>> wrong amount of free space
>> [81720.195488] BTRFS error (device sde9): failed to load free space cache
>> for block group 36528193536
>> [81720.205248] btree_readpage_end_io_hook: 69 callbacks suppressed
>> [81720.205252] BTRFS: bad tree block start 0 395247616
>> [81720.205622] BTRFS: bad tree block start 0 395247616
>> [81720.212772] BTRFS: bad tree block start 0 39714816
>> [81720.213152] BTRFS: bad tree block start 0 39714816
>> [81720.213551] BTRFS: bad tree block start 0 39714816
>> [81720.213925] BTRFS: bad tree block start 0 39714816
>> [81720.214324] BTRFS: bad tree block start 0 39714816
>> [81720.214697] BTRFS: bad tree block start 0 39714816
>> [81720.215070] BTRFS: bad tree block start 0 39714816
>> [81720.215441] BTRFS: bad tree block start 0 39714816
>> [81720.246457] BTRFS: error (device sde9) in open_ctree:2839: errno=-5 IO
>> failure (Failed to recover log tree)
>> [81720.277276] BTRFS: open_ctree failed
>>
>> btrfsck wasn't helpful on the one system that I tried it on.  Nor was
>> mounting with -o ro,recovery.  I can mount the filesystems if I run
>> btrfs-zero-log (after dding a FS image), but Ceph is unhappy:
>>
>>
>> # ceph-osd -i 9 -d
>> 2014-06-02 08:10:49.217019 7f9873cc4800  0 ceph version 0.80.1
>> (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 17600
>> starting osd.9 at :/0 osd_data /var/lib/ceph/osd/ceph-9
>> /var/lib/ceph/osd/ceph-9/journal
>> 2014-06-02 08:10:49.219400 7f9873cc4800  0
>> filestore(/var/lib/ceph/osd/ceph-9) mount detected btrfs
>> 2014-06-02 08:10:49.232826 7f9873cc4800  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features: FIEMAP
>> ioctl is supported and appears to work
>> 2014-06-02 08:10:49.232838 7f9873cc4800  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features: FIEMAP
>> ioctl is disabled via 'filestore fiemap' config option
>> 2014-06-02 08:10:49.247357 7f9873cc4800  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features:
>> syncfs(2) syscall fully supported (by glibc and kernel)
>> 2014-06-02 08:10:49.247677 7f9873cc4800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: CLONE_RANGE
>> ioctl is supported
>> 2014-06-02 08:10:49.261718 7f9873cc4800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: SNAP_CREATE
>> is supported
>> 2014-06-02 08:10:49.262442 7f9873cc4800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
>> SNAP_DESTROY is supported
>> 2014-06-02 08:10:49.263020 7f9873cc4800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: START_SYNC
>> is supported (transid 71371)
>> 2014-06-02 08:10:49.269221 7f9873cc4800  0
>> btrfsfilestoreback

[ceph-users] btrfs + cache tier = disaster

2014-06-02 Thread Scott Laird
I found a fun failure mode this weekend.

I have 6 SSDs in my 6-node Ceph cluster at home.  The SSDs are partitioned;
about half of the SSD is used for journal space for other OSDs, and half
holds an OSD for a cache tier.  I finally turned it on the cache late last
week, and everything was great, until yesterday morning, when my whole
cluster was down, hard.

Apparently, I mis-set target_max_bytes, because 5 of the 6 SSD partitions
were 100% full.  On the 5 full machines (running Ubuntu's 3.14.1 kernel),
the cache filesystem was unreadable; any attempt to access it threw kernel
errors.  Rebooting cleared up 2 of those, leaving me with 3 of 6 devices
alive in the pool, and 3 devices with corrupt filesystems.

Apparently btrfs really, *REALLY* doesn't like full filesystems, because
filling them 100% full seems to have fatally corrupted them.  No power
loss, etc. involved.

Trying to mount the filesystems fail, giving btrfs messages like this:

[81720.111053] BTRFS: device fsid 319cbd8a-71ac-4b42-9d5c-b02658e75cdc
devid 1 transid 61429 /dev/sde9
[81720.113074] BTRFS info (device sde9): disk space caching is enabled
[81720.188759] BTRFS: detected SSD devices, enabling SSD mode
[81720.195442] BTRFS error (device sde9): block group 36528193536 has wrong
amount of free space
[81720.195488] BTRFS error (device sde9): failed to load free space cache
for block group 36528193536
[81720.205248] btree_readpage_end_io_hook: 69 callbacks suppressed
[81720.205252] BTRFS: bad tree block start 0 395247616
[81720.205622] BTRFS: bad tree block start 0 395247616
[81720.212772] BTRFS: bad tree block start 0 39714816
[81720.213152] BTRFS: bad tree block start 0 39714816
[81720.213551] BTRFS: bad tree block start 0 39714816
[81720.213925] BTRFS: bad tree block start 0 39714816
[81720.214324] BTRFS: bad tree block start 0 39714816
[81720.214697] BTRFS: bad tree block start 0 39714816
[81720.215070] BTRFS: bad tree block start 0 39714816
[81720.215441] BTRFS: bad tree block start 0 39714816
[81720.246457] BTRFS: error (device sde9) in open_ctree:2839: errno=-5 IO
failure (Failed to recover log tree)
[81720.277276] BTRFS: open_ctree failed

btrfsck wasn't helpful on the one system that I tried it on.  Nor was
mounting with -o ro,recovery.  I can mount the filesystems if I run
btrfs-zero-log (after dding a FS image), but Ceph is unhappy:


# ceph-osd -i 9 -d
2014-06-02 08:10:49.217019 7f9873cc4800  0 ceph version 0.80.1
(a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 17600
starting osd.9 at :/0 osd_data /var/lib/ceph/osd/ceph-9
/var/lib/ceph/osd/ceph-9/journal
2014-06-02 08:10:49.219400 7f9873cc4800  0
filestore(/var/lib/ceph/osd/ceph-9) mount detected btrfs
2014-06-02 08:10:49.232826 7f9873cc4800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features: FIEMAP
ioctl is supported and appears to work
2014-06-02 08:10:49.232838 7f9873cc4800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2014-06-02 08:10:49.247357 7f9873cc4800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2014-06-02 08:10:49.247677 7f9873cc4800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: CLONE_RANGE
ioctl is supported
2014-06-02 08:10:49.261718 7f9873cc4800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: SNAP_CREATE
is supported
2014-06-02 08:10:49.262442 7f9873cc4800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
SNAP_DESTROY is supported
2014-06-02 08:10:49.263020 7f9873cc4800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: START_SYNC
is supported (transid 71371)
2014-06-02 08:10:49.269221 7f9873cc4800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: WAIT_SYNC
is supported
2014-06-02 08:10:49.270902 7f9873cc4800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
SNAP_CREATE_V2 is supported
2014-06-02 08:10:49.275792 7f9873cc4800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) list_checkpoints: stat
'/var/lib/ceph/osd/ceph-9/snap_3415219' failed: (12) Cannot allocate memory
2014-06-02 08:10:49.275900 7f9873cc4800 -1
filestore(/var/lib/ceph/osd/ceph-9) FileStore::mount : error in
_list_snaps: (12) Cannot allocate memory
2014-06-02 08:10:49.275936 7f9873cc4800 -1  ** ERROR: error converting
store /var/lib/ceph/osd/ceph-9: (12) Cannot allocate memory


Similarly, I can recover most of the data via 'btrfs restore', but Ceph has
a different failure mode:

# ceph-osd -i 16 -d
2014-06-02 08:12:41.590122 7fdfda65e800  0 ceph version 0.80.1
(a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 5094
starting osd.16 at :/0 osd_data /var/lib/ceph/osd/ceph-16
/var/lib/ceph/osd/ceph-16/journal
2014-06-02 08:12:41.621624 7fdfda65e800  0
filestore(/var/lib/ceph/osd/ceph-16) mount detected btrfs
2014-06-02 08:12:41.693025 7fdfda65e800  0
genericfilestorebackend(/var/

Re: [ceph-users] Is there a way to repair placement groups? [Offtopic - ZFS]

2014-05-28 Thread Scott Laird
IMHO, you were probably either benchmarking the wrong thing or had a really
unusual use profile.  RAIDZ* always does full-stripe reads so it can verify
checksums, so even small reads hit all of the devices in the vdev.  That
means that you get 0 parallelism on small reads, unlike most other RAID5+
systems where disks can read independently.  So, with an 8-disk RAIDZ
(RAIDZ2, RAIDZ3...) setup, you'll only get 1 disk worth of read IOPS,
instead of 6-8 disks worth.  That's a pretty massive hit.

The bandwidth, availability, and general hassle of RAIDZ2 is nice, though.


On Tue, May 27, 2014 at 3:49 PM, Craig Lewis wrote:

>  On 5/27/14 13:40 , phowell wrote:
>
> Hi
>
> First apologies if this is the wrong place to ask this question.
>
> We are running a small Ceph (0.79) cluster will about 12 osd's which are
> on top of a zfs raid 1+0 (for another discussion)... which were created on
> this version.
>
>
> Just a reminder to benchmark everything, especially things you have known
> to be true since the dawn of time.  I benchmarked RAID10 vs. RAID5 so long
> ago, I had to find a 3.5" floppy to open the spreadsheet.
>
>
> Recently, I was testing ZFS on software encrypted volumes, and wanted to
> see how badly it would impact a PostgreSQL server.  My test setup was using
> RAIDZ2, so I just ran the benchmark on that zpool.
>
> Imagine my surprise when an untuned and encrypted RAIDZ2 posted better
> benchmarks than a tuned ZFS RAID10.
>
>
> I really think the "RAID5 is bad for performance" is a nasty hold-over
> from when parity calculations needed dedicated hardware.  I won't be
> building any more ZFS RAID10 arrays.
>
>
> --
>
>  *Craig Lewis*
>  Senior Systems Engineer
> Office +1.714.602.1309
> Email cle...@centraldesktop.com
>
>  *Central Desktop. Work together in ways you never thought possible.*
>  Connect with us   Website   |  
> Twitter |
> Facebook   |  
> LinkedIn |
> Blog 
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Feature request: stable naming for external journals

2014-05-22 Thread Scott Laird
I recently created a few OSDs with journals on a partitioned SSD.  Example:

$ ceph-deploy osd prepare v2:sde:sda8

It worked fine at first, but after rebooting, the new OSD failed to start.
 I discovered that the journal drive had been renamed from /dev/sda to
/dev/sdc, so the journal symlink in /var/lib/ceph/osd/ceph-XX no longer
pointed to the correct block device.

I have a couple requests/suggestions:

1.  Make this clearer in the logs.  I've seen at least a couple cases where
a simple "Unable to open journal" message would have saved me a bunch of
time.

2.  Consider some method of generating more stable journal names under the
hood.  I'm using /dev/disk/by-id/... under Ubuntu, but that's probably not
generally portable.  I've been tempted to put a filesystem on my journal
devices, mount it by UUID, and then symlink to a file on the mounted
device.  It's not as fast, but at least it'd have a stable name.

(This was caused by adding an SSD and then moving / onto it; during the
reboots needed for migrating /, drive ordering changed several times.  It
probably wouldn't have happened if I'd started with hardware bought new and
dedicated to Ceph)


Scott
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com