Re: [ceph-users] Canonical Livepatch broke CephFS client

2019-08-14 Thread Tim Bishop
On Wed, Aug 14, 2019 at 12:44:15PM +0200, Ilya Dryomov wrote:
> On Tue, Aug 13, 2019 at 10:56 PM Tim Bishop  wrote:
> > This email is mostly a heads up for others who might be using
> > Canonical's livepatch on Ubuntu on a CephFS client.
> >
> > I have an Ubuntu 18.04 client with the standard kernel currently at
> > version linux-image-4.15.0-54-generic 4.15.0-54.58. CephFS is mounted
> > with the kernel client. Cluster is running mimic 13.2.6. I've got
> > livepatch running and this evening it did an update:
> >
> > Aug 13 17:33:55 myclient canonical-livepatch[2396]: Client.Check
> > Aug 13 17:33:55 myclient canonical-livepatch[2396]: Checking with livepatch 
> > service.
> > Aug 13 17:33:55 myclient canonical-livepatch[2396]: updating last-check
> > Aug 13 17:33:55 myclient canonical-livepatch[2396]: touched last check
> > Aug 13 17:33:56 myclient canonical-livepatch[2396]: Applying update 54.1 
> > for 4.15.0-54.58-generic
> > Aug 13 17:33:56 myclient kernel: [3700923.970750] PKCS#7 signature not 
> > signed with a trusted key
> > Aug 13 17:33:59 myclient kernel: [3700927.069945] livepatch: enabling patch 
> > 'lkp_Ubuntu_4_15_0_54_58_generic_54'
> > Aug 13 17:33:59 myclient kernel: [3700927.154956] livepatch: 
> > 'lkp_Ubuntu_4_15_0_54_58_generic_54': starting patching transition
> > Aug 13 17:34:01 myclient kernel: [3700928.994487] livepatch: 
> > 'lkp_Ubuntu_4_15_0_54_58_generic_54': patching complete
> > Aug 13 17:34:09 myclient canonical-livepatch[2396]: Applied patch version 
> > 54.1 to 4.15.0-54.58-generic
> >
> > And then immediately I saw:
> >
> > Aug 13 17:34:18 myclient kernel: [3700945.728684] libceph: mds0 
> > 1.2.3.4:6800 socket closed (con state OPEN)
> > Aug 13 17:34:18 myclient kernel: [3700946.040138] libceph: mds0 
> > 1.2.3.4:6800 socket closed (con state OPEN)
> > Aug 13 17:34:19 myclient kernel: [3700947.105692] libceph: mds0 
> > 1.2.3.4:6800 socket closed (con state OPEN)
> > Aug 13 17:34:20 myclient kernel: [3700948.033704] libceph: mds0 
> > 1.2.3.4:6800 socket closed (con state OPEN)
> >
> > And on the MDS:
> >
> > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 Message signature 
> > does not match contents.
> > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Signature on 
> > message:
> > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367sig: 
> > 10517606059379971075
> > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Locally calculated 
> > signature:
> > 2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 
> > sig_check:4899837294009305543
> > 2019-08-13 17:34:18.286 7ff165e75700  0 Signature failed.
> > 2019-08-13 17:34:18.286 7ff165e75700  0 -- 1.2.3.4:6800/512468759 >> 
> > 4.3.2.1:0/928333509 conn(0xe6b9500 :6800 >> 
> > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=0).process >> 
> > Signature check failed
> >
> > Thankfully I was able to umount -f to unfreeze the client, but I have
> > been unsuccessful remounting the file system using the kernel client.
> > The fuse client worked fine as a workaround, but is slower.
> >
> > Taking a look at livepatch 54.1 I can see it touches Ceph code in the
> > kernel:
> >
> > https://git.launchpad.net/~ubuntu-livepatch/+git/bionic-livepatches/commit/?id=3a3081c1e4c8e2e0f9f7a1ae4204eba5f38fbd29
> >
> > But the relevance of those changes isn't immediately clear to me. I
> > expect after a reboot it'll be fine, but as yet untested.
> 
> These changes are very relevant.  They introduce support for CEPHX_V2
> protocol, where message signatures are computed slightly differently:
> same algorithm but a different set of inputs.  The live-patched kernel
> likely started signing using CEPHX_V2 without renegotiating.

Ah - thanks for looking. Looks like something that wasn't a security
issue so shouldn't have been included in the live patch.

> This is a good example of how live-patching can go wrong.  A reboot
> should definitely help.

Yup, it certainly has its tradeoffs (not having to reboot so regularly
is certainly a positive, though). I've replicated on a test machine and
confirmed that a reboot does indeed fix the problem.

Thanks,

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Canonical Livepatch broke CephFS client

2019-08-13 Thread Tim Bishop
Hi,

This email is mostly a heads up for others who might be using
Canonical's livepatch on Ubuntu on a CephFS client.

I have an Ubuntu 18.04 client with the standard kernel currently at
version linux-image-4.15.0-54-generic 4.15.0-54.58. CephFS is mounted
with the kernel client. Cluster is running mimic 13.2.6. I've got
livepatch running and this evening it did an update:

Aug 13 17:33:55 myclient canonical-livepatch[2396]: Client.Check
Aug 13 17:33:55 myclient canonical-livepatch[2396]: Checking with livepatch 
service.
Aug 13 17:33:55 myclient canonical-livepatch[2396]: updating last-check
Aug 13 17:33:55 myclient canonical-livepatch[2396]: touched last check
Aug 13 17:33:56 myclient canonical-livepatch[2396]: Applying update 54.1 for 
4.15.0-54.58-generic
Aug 13 17:33:56 myclient kernel: [3700923.970750] PKCS#7 signature not signed 
with a trusted key
Aug 13 17:33:59 myclient kernel: [3700927.069945] livepatch: enabling patch 
'lkp_Ubuntu_4_15_0_54_58_generic_54'
Aug 13 17:33:59 myclient kernel: [3700927.154956] livepatch: 
'lkp_Ubuntu_4_15_0_54_58_generic_54': starting patching transition
Aug 13 17:34:01 myclient kernel: [3700928.994487] livepatch: 
'lkp_Ubuntu_4_15_0_54_58_generic_54': patching complete
Aug 13 17:34:09 myclient canonical-livepatch[2396]: Applied patch version 54.1 
to 4.15.0-54.58-generic

And then immediately I saw:

Aug 13 17:34:18 myclient kernel: [3700945.728684] libceph: mds0 1.2.3.4:6800 
socket closed (con state OPEN)
Aug 13 17:34:18 myclient kernel: [3700946.040138] libceph: mds0 1.2.3.4:6800 
socket closed (con state OPEN)
Aug 13 17:34:19 myclient kernel: [3700947.105692] libceph: mds0 1.2.3.4:6800 
socket closed (con state OPEN)
Aug 13 17:34:20 myclient kernel: [3700948.033704] libceph: mds0 1.2.3.4:6800 
socket closed (con state OPEN)

And on the MDS:

2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 Message signature 
does not match contents.
2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Signature on message:
2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367sig: 
10517606059379971075
2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367Locally calculated 
signature:
2019-08-13 17:34:18.286 7ff165e75700  0 SIGN: MSG 9241367 
sig_check:4899837294009305543
2019-08-13 17:34:18.286 7ff165e75700  0 Signature failed.
2019-08-13 17:34:18.286 7ff165e75700  0 -- 1.2.3.4:6800/512468759 >> 
4.3.2.1:0/928333509 conn(0xe6b9500 :6800 >> 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=0).process >> 
Signature check failed

Thankfully I was able to umount -f to unfreeze the client, but I have
been unsuccessful remounting the file system using the kernel client.
The fuse client worked fine as a workaround, but is slower.

Taking a look at livepatch 54.1 I can see it touches Ceph code in the
kernel:

https://git.launchpad.net/~ubuntu-livepatch/+git/bionic-livepatches/commit/?id=3a3081c1e4c8e2e0f9f7a1ae4204eba5f38fbd29

But the relevance of those changes isn't immediately clear to me. I
expect after a reboot it'll be fine, but as yet untested.

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs kernel client hung after eviction

2019-01-24 Thread Tim Bishop
Hi,

I have a cephfs kernel client (Ubuntu 18.04 4.15.0-34-generic) that's
completely hung after the client was evicted by the MDS.

The client logged:

Jan 24 17:31:46 client kernel: [10733559.309496] libceph: FULL or reached pool 
quota
Jan 24 17:32:26 client kernel: [10733599.232213] libceph: mon0 n.n.n.n:6789 
session lost, hunting for new mon

And the MDS logged:

2019-01-24 17:36:38.859 7f3ac7844700  0 log_channel(cluster) log [WRN] : 
evicting unresponsive client client:cephfs-client (86527773), after 300.081 
seconds

Looking in mdsc shows:

% head /sys/kernel/debug/ceph/[id].client86527773/mdsc
20  mds0getattr  #103d042
21  mds0getattr  #103d042
22  mds0getattr  #103d042
23  mds0getattr  #103d042
24  mds0getattr  #103d042
25  mds0getattr  #103d042
26  mds0getattr  #103d042
27  mds0getattr  #103d042
28  mds0getattr  #103d042
29  mds0getattr  #103d042

But osdc hangs when I try to access it.

I've tried umount -f but it hangs too. umount -l hides the problem (df
no longer hangs), but any processes that were trying to access the mount
are still blocked.

I've also tried switching back and forth to standby MDSs in case that
unblocked something. There are no current OSD blacklist entries either.

It looks like rebooting is the only option, but that's somewhat of a
pain to do. There's lots of people using this machine :-(

Any ideas?

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests from bluestore osds

2018-09-05 Thread Tim Bishop
On Sat, Sep 01, 2018 at 12:45:06PM -0400, Brett Chancellor wrote:
> Hi Cephers,
>   I am in the process of upgrading a cluster from Filestore to bluestore,
> but I'm concerned about frequent warnings popping up against the new
> bluestore devices. I'm frequently seeing messages like this, although the
> specific osd changes, it's always one of the few hosts I've converted to
> bluestore.
> 
> 6 ops are blocked > 32.768 sec on osd.219
> 1 osds have slow requests
> 
> I'm running 12.2.4, have any of you seen similar issues? It seems as though
> these messages pop up more frequently when one of the bluestore pgs is
> involved in a scrub.  I'll include my bluestore creation process below, in
> case that might cause an issue. (sdb, sdc, sdd are SATA, sde and sdf are
> SSD)

I had the same issue for some time after a bunch of updates including
switching to Luminous and to Bluestore from Filestore. I was unable to
track down what was causing it.

Then recently I did another bunch of updates including Ceph 12.2.7 and
the latest Ubuntu 16.04 updates (which would have included microcode
updates), and the problem has as good as gone away. My monitoring also
shows reduced latency on my RBD volumes and end users are reporting
improved performance.

Just throwing this out there in case it helps anyone else - appreciate
it's very anecdotal.

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disk write cache - safe?

2018-03-15 Thread Tim Bishop
Thank you Christian, David, and Reed for your responses.

My servers have the Dell H730 RAID controller in them, but I have the
OSD disks in Non-RAID mode. When initially testing I compared single
RAID-0 containers with Non-RAID and the Non-RAID performance was
acceptable, so I opted for the configuration with less components
between Ceph and the disks. This seemed to be the safer approach at the
time.

What I obviously hadn't realised was that the drive caches were enabled.
Without those caches the difference is much greater, and the latency
is now becoming a problem.

My reading of the documentation led me to think along the lines
Christian mentions below - that is, that data in flight would be lost,
but that the disks should be consistent and still usable. But it would
be nice to get confirmation of whether that holds for Bluestore.
However, it looks like this wasn't the case for Reed, although perhaps
that was at an earlier time when Ceph and/or Linux didn't handle things
was well?

I had also thought that our power supply was pretty safe - redundant
PSUs with independent feeds, redundant UPSs, and a generator. But Reed's
experiences certainly highlight that even that can fail, so it was good
to hear that from someone else rather than experience it first hand.

I do have tape backups, but recovery would be a pain, so based on all
your comments I'll leave the drive caches off and look at using the RAID
controller cache with its BBU instead.

Tim.

On Thu, Mar 15, 2018 at 04:13:49PM +0900, Christian Balzer wrote:
> Hello,
> 
> what has been said by others before is essentially true, as in if you want:
> 
> - as much data conservation as possible and have
> - RAID controllers with decent amounts of cache and a BBU
> 
> then disabling the on disk cache is the way to go.
> 
> But as you found out, w/o those caches and a controller cache to replace
> them, performance will tank.
> 
> And of course any data only in the pagecache (dirty) and not yet flushed
> to the controller/disks is lost anyway in a power failure.
> 
> All current FS _should_ be powerfail safe (barriers) in the sense that you
> may loose the data in the disk caches (if properly exposed to the OS and
> the controller or disk not lying about having written data to disk) but
> the FS will be consistent and not "all will be lost".
> 
> I'm hoping that this is true for Bluestore, but somebody needs to do that
> testing.
> 
> So if you can live with the loss of the in-transit data in the disk caches
> in addition to the pagecache and/or you trust your DC never to loose
> power, go ahead and get re-enable the disk caches.
> 
> If you have the money and need for a sound happy sleep, do the BBU
> controller cache dance. 
> Some controllers (Areca comes to mind) actually manage to IT mode style
> exposure of the disks and still use their HW cache.
> 
> Christian

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Disk write cache - safe?

2018-03-14 Thread Tim Bishop
I'm using Ceph on Ubuntu 16.04 on Dell R730xd servers. A recent [1]
update to the PERC firmware disabled the disk write cache by default
which made a noticable difference to the latency on my disks (spinning
disks, not SSD) - by as much as a factor of 10.

For reference their change list says:

"Changes default value of drive cache for 6 Gbps SATA drive to disabled.
This is to align with the industry for SATA drives. This may result in a
performance degradation especially in non-Raid mode. You must perform an
AC reboot to see existing configurations change."

It's fairly straightforward to re-enable the cache either in the PERC
BIOS, or by using hdparm, and doing so returns the latency back to what
it was before.

Checking the Ceph documentation I can see that older versions [2]
recommended disabling the write cache for older kernels. But given I'm
using a newer kernel, and there's no mention of this in the Luminous
docs, is it safe to assume it's ok to enable the disk write cache now?

If it makes a difference, I'm using a mixture of filestore and bluestore
OSDs - migration is still ongoing.

Thanks,

Tim.

[1] - 
https://www.dell.com/support/home/uk/en/ukdhs1/Drivers/DriversDetails?driverId=8WK8N
[2] - 
http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Dashboard runs on all manager instances?

2018-01-09 Thread Tim Bishop
Hi,

I've recently upgraded from Jewel to Luminous and I'm therefore new to
using the Dashboard. I noted this section in the documentation:

http://docs.ceph.com/docs/master/mgr/dashboard/#load-balancer

"Please note that the dashboard will only start on the
manager which is active at that moment. Query the Ceph
cluster status to see which manager is active (e.g., ceph
mgr dump). In order to make the dashboard available via a
consistent URL regardless of which manager daemon is currently
active, you may want to set up a load balancer front-end
to direct traffic to whichever manager endpoint is available.
If you use a reverse http proxy that forwards a subpath to
the dashboard, you need to configure url_prefix (see above)."

However, from what I can see the dashboard is actually started on all
manager instances. On the standby instances it simply has a redirect to
the active instance. So the above documentation would look to be
incorrect?

I was planning to use Apache's mod_proxy_balancer to just pick the
active instance, but the above makes that tricky. How have others solved
this? Or do you just pick the right instance and go direct? Or maybe I'm
missing a config option to make the dashboard only run on the active
manager?

Thanks,
Tim.

-- 
Tim Bishop
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is anyone seeing iissues with task_numa_find_cpu?

2016-06-28 Thread Tim Bishop
0x3b/0x50
> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686173]
> [] SYSC_recvfrom+0xe1/0x160
> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686202]
> [] ? ktime_get_ts64+0x45/0xf0
> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686230]
> [] SyS_recvfrom+0xe/0x10
> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686259]
> [] entry_SYSCALL_64_fastpath+0x16/0x71
> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686287] Code: 55 b0 4c
> 89 f7 e8 53 cd ff ff 48 8b 55 b0 49 8b 4e 78 48 8b 82 d8 01 00 00 48
> 83 c1 01 31 d2 49 0f af 86 b0 00 00 00 4c 8b 73 78 <48> f7 f1 48 8b 4b
> 20 49 89 c0 48 29 c1 48 8b 45 d0 4c 03 43 48
> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686512] RIP
> [] task_numa_find_cpu+0x22e/0x6f0
> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686544]  RSP 
> Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686896] ---[ end trace
> 544cb9f68cb55c93 ]---
> Jun 28 09:52:15 roc04r-sca090 kernel: [138246.669713] mpt2sas_cm0:
> log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
> Jun 28 09:55:01 roc0
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] New Ceph mirror

2016-06-20 Thread Tim Bishop
Hi Wido,

Six months or so ago you asked for new Ceph mirrors. I saw there wasn't
currently one in the UK, so I've set one up following your guidelines
here:

https://github.com/ceph/ceph/tree/master/mirroring

The mirror is available at:

http://ceph.mirrorservice.org/

Over both IPv4 and IPv6, and rsync is enabled too. The vhost is
configured to respond to uk.ceph.com should you feel it's appropriate to
use that for our site.

Our service is hosted on the UK academic network and has an 8 Gbit
uplink speed. We're also users of Ceph ourselves within the School of
Computing here at the University, so I'm following the Ceph developments
closely.

If you need any further information from us please let me know.

Tim.

-- 
Tim Bishop,
Computing Officer, School of Computing, University of Kent.
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD - single drive RAID 0 or JBOD?

2016-05-07 Thread Tim Bishop
Hi all,

I've got servers (Dell R730xd) with a number of drives in connected to a
Dell H730 RAID controller. I'm trying to make a decision about whether I
should put the drives in "Non-RAID" mode, or if I should make individual
RAID 0 arrays for each drive.

Going for the RAID 0 approach would mean that the cache on the
controller will be used for the drives giving some increased
performance. But it comes at the expense of less direct access to the
disks for the operating system and Ceph, and I have had one oddity[1]
with a drive that went away when switched from RAID 0 to Non-RAID.

There are currently no SSDs being used as journals in the machines, so
the increased performance is beneficial, but data integrity is obviously
paramount.

I've seen recommendations both ways on this list, so I'm hoping for
feedback from people who've made the same decision, possibly with
evidence (postive or negative) about either approach.

Thanks!

Tim.

[1] The OS was getting I/O errors when reading certain files which gave
scrub errors. No problems were shown in the RAID controller, and a check
of the disk didn't reveal any issues. Switching to Non-RAID mode and
recreating the OSD fixed the problem and there haven't been any issues
since.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd map on Jewel, sysfs write failed when rbd map

2016-04-18 Thread Tim Bishop
OTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named
> above. If the reader of this message is not the intended recipient, you
> are hereby notified that you have received this message in error and
> that any review, dissemination, distribution, or copying of this
> message is strictly prohibited. If you have received this communication
> in error, please notify the sender by telephone or e-mail (as shown
> above) immediately and destroy any and all copies of this message in
> your possession (whether hard copies or electronically stored copies).
> 
> References
> 
> Visible links
> 1. 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-September/004928.html
> 2. mailto:hjwsm1...@gmail.com
> 3. mailto:xizhiyon...@gmail.com
> 4. mailto:ceph-users@lists.ceph.com
> 5. http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> Hidden links:
> 7. 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-September/004928.html
> 8. mailto:xizhiyon...@gmail.com

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Calamari Goes Open Source

2014-05-31 Thread Tim Bishop
This is great news? Are there (or will there be) binary packages
available?

Tim.

On Fri, May 30, 2014 at 06:04:42PM -0400, Patrick McGarry wrote:
> Hey cephers,
> 
> Sorry to push this announcement so late on a Friday but...
> 
> Calamari has arrived!
> 
> The source code bits have been flipped, the ticket tracker has been
> moved, and we have even given you a little bit of background from both
> a technical and vision point of view:
> 
> Technical (ceph.com):
> http://ceph.com/community/ceph-calamari-goes-open-source/
> 
> Vision (inktank.com):
> http://www.inktank.com/software/future-of-calamari/
> 
> The ceph.com link should give you everything you need to know about
> what tech comprises Calamari, where the source lives, and where the
> discussions will take place.  If you have any questions feel free to
> hit the new ceph-calamari list or stop by IRC and we'll get you
> started.  Hope you all enjoy the GUI!
> 
> Best Regards,
> 
> Patrick McGarry

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Dell H310

2014-03-07 Thread Tim Bishop
Hi all,

I'm hoping to get some feedback on the Dell H310 (LSI SAS2008 chipset).
Based on searching I'd done previously I got the impression that people
generally recommended avoiding it in favour of the higher specced H710
(LSI SAS2208 chipset).

But yesterday I read Inktank's Hardware Configuration Guide[1] document
(great document by the way!), and it mentions the H310 as a possible
controller. So maybe it's not as bad as I'd assumed?

It'd be great to hear from people using it successfully, or from people
who've had bad experiences with it.

Tim.

[1] http://info.inktank.com/hardware_configuration_guide

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ubuntu 13.10 packages

2014-02-27 Thread Tim Bishop
Hi Michael,

On Tue, Feb 25, 2014 at 10:01:31PM +, Michael wrote:
> Just wondering if there was a reason for no packages for Ubuntu Saucy in 
> http://ceph.com/packages/ceph-extras/debian/dists/. Could do with 
> upgrading to fix a few bugs but would hate to have to drop Ceph from 
> being handled through the package manager!

The raring packages work fine with saucy. Not ideal, but it's the
easiest solution at the moment.

I'd like to see native saucy packages too, and actually it'd be good to
get trusty sorted soon too so it's all ready for the release.

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does Ceph deal with OSDs that have been away for a while?

2014-02-21 Thread Tim Bishop
Thanks Greg. Can I just confirm, does it do a full backfill
automatically in the case where the log no longer overlaps?

I guess the key question is - do I have to worry about it, or will it
always "do the right thing"?

Tim.

On Fri, Feb 21, 2014 at 11:57:09AM -0800, Gregory Farnum wrote:
> It depends on how long ago (in terms of data writes) it disappeared.
> Each PG has a log of the changes that have been made (by default I
> think it's 3000? Maybe just 1k), and if an OSD goes away and comes
> back while the logs still overlap it will just sync up the changed
> objects. Otherwise it has to do a full backfill across the PG.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> On Fri, Feb 21, 2014 at 10:33 AM, Tim Bishop  wrote:
> > I'm wondering how Ceph deals with OSDs that have been away for a while.
> > Do they need to be completely rebuilt, or does it know which objects are
> > good and which need to go?
> >
> > I know Ceph handles well the situation of an OSD going away, and
> > rebalances etc to maintain the required redundancy levels. But I'm
> > unsure what it does when an OSD comes back some time later still
> > containing data.
> >
> > Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How does Ceph deal with OSDs that have been away for a while?

2014-02-21 Thread Tim Bishop
I'm wondering how Ceph deals with OSDs that have been away for a while.
Do they need to be completely rebuilt, or does it know which objects are
good and which need to go?

I know Ceph handles well the situation of an OSD going away, and
rebalances etc to maintain the required redundancy levels. But I'm
unsure what it does when an OSD comes back some time later still
containing data.

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph network topology with redundant switches

2013-12-23 Thread Tim Bishop
On Fri, Dec 20, 2013 at 09:26:35AM -0800, Kyle Bader wrote:
> > The area I'm currently investigating is how to configure the
> > networking. To avoid a SPOF I'd like to have redundant switches for
> > both the public network and the internal network, most likely running
> > at 10Gb. I'm considering splitting the nodes in to two separate racks
> > and connecting each half to its own switch, and then trunk the
> > switches together to allow the two halves of the cluster to see each
> > other. The idea being that if a single switch fails I'd only lose half
> > of the cluster.
> 
> This is fine if you are using a replication factor of 2, you would
> need 2/3 of the cluster surviving if using a replication factor 3 with
> "osd pool default min size" set to 2.

Ah! Thanks for pointing that out. I'd not appreciated the impact of that
setting. I'll have to consider my options here. This also fits well with
Wido's reply (in this same thread) about splitting the nodes in to three
groups rather than two, although the cost of 10Gb networking starts to
become prohibitive at that point.

> > My question is about configuring the public network. If it's all one
> > subnet then the clients consuming the Ceph resources can't have both
> > links active, so they'd be configured in an active/standby role. But
> > this results in quite heavy usage of the trunk between the two
> > switches when a client accesses nodes on the other switch than the one
> > they're actively connected to.
> 
> The linux bonding driver supports several strategies for teaming network
> adapters on L2 networks.

Across switches? Wido's reply mentions using mlag to span LACP trunks
across switches. This isn't something I'd seen before, so I'd assumed I
couldn't do it. Certainly an area I need to look in to more.

> > So, can I configure multiple public networks? I think so, based on the
> > documentation, but I'm not completely sure. Can I have one half of the
> > cluster on one subnet, and the other half on another? And then the
> > client machine can have interfaces in different subnets and "do the
> > right thing" with both interfaces to talk to all the nodes. This seems
> > like a fairly simple solution that avoids a SPOF in Ceph or the network
> > layer.
> 
> You can have multiple networks for both the public and cluster networks,
> the only restriction is that all subnets for a given type be within the same
> supernet. For example
> 
> 10.0.0.0/16 - Public supernet (configured in ceph.conf)
> 10.0.1.0/24 - Public rack 1
> 10.0.2.0/24 - Public rack 2
> 10.1.0.0/16 - Cluster supernet (configured in ceph.conf)
> 10.1.1.0/24 - Cluster rack 1
> 10.1.2.0/24 - Cluster rack 2

Thanks, that clarifies how this works.

> > As an aside, there's a similar issue on the cluster network side with
> > heavy traffic on the trunk between the two cluster switches. But I
> > can't see that's avoidable, and presumably it's something people just
> > have to deal with in larger Ceph installations?
> 
> A proper CRUSH configuration is going to place a replica on a node in
> each rack, this means every write is going to cross the trunk. Other
> traffic that you will see on the trunk:
> 
> * OSDs gossiping with one another
> * OSD/Monitor traffic in the case where an OSD is connected to a
>   monitor connected in the adjacent rack (map updates, heartbeats).

Am I right in say that the first of these happens over the cluster
network and the second over the public network? It looks like monitors
don't have a cluster network address.

> * OSD/Client traffic where the OSD and client are in adjacent racks
> 
> If you use all 4 40GbE uplinks (common on 10GbE ToR) then your
> cluster level bandwidth is oversubscribed 4:1. To lower oversubscription
> you are going to have to steal some of the other 48 ports, 12 for 2:1 and
> 24 for a non-blocking fabric. Given number of nodes you have/plan to
> have you will be utilizing 6-12 links per switch, leaving you with 12-18
> links for clients on a non-blocking fabric, 24-30 for 2:1 and 36-48 for 4:1.

I guess this is an area where instrumentation can be used to measure the
load on the trunk and add more links if required.

Thank you for your help.

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph network topology with redundant switches

2013-12-20 Thread Tim Bishop
Hi Wido,

Thanks for the reply.

On Fri, Dec 20, 2013 at 08:14:13AM +0100, Wido den Hollander wrote:
> On 12/18/2013 09:39 PM, Tim Bishop wrote:
> > I'm investigating and planning a new Ceph cluster starting with 6
> > nodes with currently planned growth to 12 nodes over a few years. Each
> > node will probably contain 4 OSDs, maybe 6.
> >
> > The area I'm currently investigating is how to configure the
> > networking. To avoid a SPOF I'd like to have redundant switches for
> > both the public network and the internal network, most likely running
> > at 10Gb. I'm considering splitting the nodes in to two separate racks
> > and connecting each half to its own switch, and then trunk the
> > switches together to allow the two halves of the cluster to see each
> > other. The idea being that if a single switch fails I'd only lose half
> > of the cluster.
> 
> Why not three switches in total and use VLANs on the switches to 
> separate public/cluster traffic?
> 
> This way you can configure the CRUSH map to have one replica go to each 
> "switch" so that when you loose a switch you still have two replicas 
> available.
> 
> Saves you a lot of switches and makes the network simpler.

I was planning to use VLANs to separate the public and cluster traffic
on the same switches.

Two switches costs less than three switches :-) I think on a slightly
larger scale cluster it might make more sense to go up to three (or even
more) switches, but I'm not sure the extra cost is worth it at this
level. I was planning two switches, using VLANs to separate the public
and cluster traffic, and connecting half of the cluster to each switch.

> > (I'm not touching on the required third MON in a separate location and
> > the CRUSH rules to make sure data is correctly replicated - I'm happy
> > with the setup there)
> >
> > To allow consumers of Ceph to see the full cluster they'd be directly
> > connected to both switches. I could have another layer of switches for
> > them and interlinks between them, but I'm not sure it's worth it on
> > this sort of scale.
> >
> > My question is about configuring the public network. If it's all one
> > subnet then the clients consuming the Ceph resources can't have both
> > links active, so they'd be configured in an active/standby role. But
> > this results in quite heavy usage of the trunk between the two
> > switches when a client accesses nodes on the other switch than the one
> > they're actively connected to.
> >
> 
> Why can't the clients have both links active? You could use LACP? Some 
> switches support mlag to span LACP trunks over two switches.
> 
> Or use some intelligent bonding mode in the Linux kernel.

I've only ever used LACP to the same switch, and I hadn't realised there
were options for spanning LACP links across multiple switches. Thanks
for the information there.

> > So, can I configure multiple public networks? I think so, based on the
> > documentation, but I'm not completely sure. Can I have one half of the
> > cluster on one subnet, and the other half on another? And then the
> > client machine can have interfaces in different subnets and "do the
> > right thing" with both interfaces to talk to all the nodes. This seems
> > like a fairly simple solution that avoids a SPOF in Ceph or the network
> > layer.
> 
> There is no restriction on the IPs of the OSDs. All they need is a Layer 
> 3 route to the WHOLE cluster and monitors.
> 
> Say doesn't have to be in a Layer 2 network, everything can be simply 
> Layer 3. You just have to make sure all the nodes can reach each other.

Thanks, that makes sense and makes planning simpler. I suppose it's
logical really... in a HUGE cluster you'd probably have a whole manner
of networks spread around the datacenter.

> > Or maybe I'm missing an alternative that would be better? I'm aiming
> > for something that keeps things as simple as possible while meeting
> > the redundancy requirements.
> >
>client
>  |
>  |
>core switch
> /| \
>/ |  \
>   /  |   \
>  /   |\
> /| \
> switch1  switch2 switch3
> ||  |
>OSD  OSD   OSD
> 
> 
> You could build something like that. That would be fairly simple.

Isn't the core switch in that diagram a SPOF? Or is it presumed to
already be a redundant setup?

> Keep in mind that you can always loose a switch and stil

[ceph-users] Ceph network topology with redundant switches

2013-12-18 Thread Tim Bishop
Hi all,

I'm investigating and planning a new Ceph cluster starting with 6
nodes with currently planned growth to 12 nodes over a few years. Each
node will probably contain 4 OSDs, maybe 6.

The area I'm currently investigating is how to configure the
networking. To avoid a SPOF I'd like to have redundant switches for
both the public network and the internal network, most likely running
at 10Gb. I'm considering splitting the nodes in to two separate racks
and connecting each half to its own switch, and then trunk the
switches together to allow the two halves of the cluster to see each
other. The idea being that if a single switch fails I'd only lose half
of the cluster.

(I'm not touching on the required third MON in a separate location and
the CRUSH rules to make sure data is correctly replicated - I'm happy
with the setup there)

To allow consumers of Ceph to see the full cluster they'd be directly
connected to both switches. I could have another layer of switches for
them and interlinks between them, but I'm not sure it's worth it on
this sort of scale.

My question is about configuring the public network. If it's all one
subnet then the clients consuming the Ceph resources can't have both
links active, so they'd be configured in an active/standby role. But
this results in quite heavy usage of the trunk between the two
switches when a client accesses nodes on the other switch than the one
they're actively connected to.

So, can I configure multiple public networks? I think so, based on the
documentation, but I'm not completely sure. Can I have one half of the
cluster on one subnet, and the other half on another? And then the
client machine can have interfaces in different subnets and "do the
right thing" with both interfaces to talk to all the nodes. This seems
like a fairly simple solution that avoids a SPOF in Ceph or the network
layer.

Or maybe I'm missing an alternative that would be better? I'm aiming
for something that keeps things as simple as possible while meeting
the redundancy requirements.

As an aside, there's a similar issue on the cluster network side with
heavy traffic on the trunk between the two cluster switches. But I
can't see that's avoidable, and presumably it's something people just
have to deal with in larger Ceph installations?

Finally, this is all theoretical planning to try and avoid designing
in bottlenecks at the outset. I don't have any concrete ideas of
loading so in practice none of it may be an issue.

Thanks for your thoughts.

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ZFS on RBD?

2013-05-24 Thread Tim Bishop
On Fri, May 24, 2013 at 05:10:14PM +0200, Wido den Hollander wrote:
> On 05/23/2013 11:34 PM, Tim Bishop wrote:
> > I'm evaluating Ceph and one of my workloads is a server that provides
> > home directories to end users over both NFS and Samba. I'm looking at
> > whether this could be backed by Ceph provided storage.
> >
> > So to test this I built a single node Ceph instance (Ubuntu precise,
> > ceph.com packages) in a VM and popped a couple of OSDs on it. I then
> > built another VM and used it to mount an RBD from the Ceph node. No
> > problems... it all worked as described in the documentation.
> >
> > Then I started to look at the filesystem I was using on top of the RBD.
> > I'd tested ext4 without any problems. I'd been testing ZFS (from stable
> > zfs-native PPA) separately against local storage on the client VM too,
> > so I thought I'd try that on top of the RBD. This is when I hit
> > problems, and the VM paniced (trace at the end of this email).
> >
> > Now I am just experimenting, so this isn't a huge deal right now. But
> > I'm wondering if this is something that should work? Am I overlooking
> > something? Is it a silly idea to even try it?
> 
> It should work, but I'm not sure what is happening here. But I'm 
> wondering, what's the reasoning behind this? You can use ZFS on multiple 
> machines, so you are exporting via RBD from one machine to another.
> 
> Wouldn't it be easier to just use NBD or iSCSI in this case?
> 
> I can't find the usecase here for using RBD, since that is designed to 
> work in a distributed load.
> 
> Is this just a test you wanted to run or something you were thinking 
> about deploying?

Thank you for the reply. It's a bit of both; at this stage I'm just
testing, but it's something I might deploy, if it works.

I'll briefly explain the scenario.

So I have various systems that I'd like to move on to Ceph, including
stuff like VM servers. But this particular workload is a set of home
directories that are mounted across a mixture of Unix-based servers,
some Linux, some Solaris, and also end user desktops using Windows and
MacOS.

Since I can't directly mount the filesystem on all the end user machines
I thought a proxy host would be a good idea. It could mount the RBD
directly and then reshare it using NFS and Samba to the various other
machines. It could have 10Gbit networking to make full use of the
available storage from Ceph.

I could make the filesystem on the proxy host just ext4, but I pondered
ZFS for some of the extra features it offers. For example, creating a
file system per user and easy snapshots.

The overall idea is to consolidate storage from various different
systems using locally attached storage arrays to a central storage pool
based on Ceph. It's just an idea at this stage, so I'm testing to see
what's feasible, and what works.

Please do let me know if I'm approaching this in the wrong way!

Thank you,

Tim.

(I submitted a bug report to the ZFS folk:
https://github.com/zfsonlinux/spl/issues/241 )

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x5AE7D984
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ZFS on RBD?

2013-05-23 Thread Tim Bishop
Hi all,

I'm evaluating Ceph and one of my workloads is a server that provides
home directories to end users over both NFS and Samba. I'm looking at
whether this could be backed by Ceph provided storage.

So to test this I built a single node Ceph instance (Ubuntu precise,
ceph.com packages) in a VM and popped a couple of OSDs on it. I then
built another VM and used it to mount an RBD from the Ceph node. No
problems... it all worked as described in the documentation.

Then I started to look at the filesystem I was using on top of the RBD.
I'd tested ext4 without any problems. I'd been testing ZFS (from stable
zfs-native PPA) separately against local storage on the client VM too,
so I thought I'd try that on top of the RBD. This is when I hit
problems, and the VM paniced (trace at the end of this email).

Now I am just experimenting, so this isn't a huge deal right now. But
I'm wondering if this is something that should work? Am I overlooking
something? Is it a silly idea to even try it?

The trace looks to be in the ZFS code, so if there's a bug that needs
fixing it's probably over there rather than in Ceph, but I thought here
might be a good starting point for advice.

Thanks in advance everyone,

Tim.

[  504.644120] divide error:  [#1] SMP
[  504.644298] Modules linked in: coretemp(F) ppdev(F) vmw_balloon(F) 
microcode(F) psmouse(F) serio_raw(F) parport_pc(F) vmwgfx(F) i2c_piix4(F) 
mac_hid(F) ttm(F) shpchp(F) drm(F) rbd(F) libceph(F) lp(F) parport(F) zfs(POF) 
zcommon(POF) znvpair(POF) zavl(POF) zunicode(POF) spl(OF) floppy(F) e1000(F) 
mptspi(F) mptscsih(F) mptbase(F) btrfs(F) zlib_deflate(F) libcrc32c(F)
[  504.646156] CPU 0
[  504.646234] Pid: 2281, comm: txg_sync Tainted: PF   B  O 
3.8.0-21-generic #32~precise1-Ubuntu VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform
[  504.646550] RIP: 0010:[]  [] 
spa_history_write+0x82/0x1d0 [zfs]
[  504.646816] RSP: 0018:88003ae3dab8  EFLAGS: 00010246
[  504.646940] RAX:  RBX:  RCX: 
[  504.647091] RDX:  RSI: 0020 RDI: 
[  504.647242] RBP: 88003ae3db28 R08: 88003b2afc00 R09: 0002
[  504.647423] R10: 88003b9a4512 R11: 6d206b6e61742066 R12: 88003add6600
[  504.647600] R13: 88003cfc2000 R14: 88003d3c9000 R15: 0008
[  504.647778] FS:  () GS:88003fc0() 
knlGS:
[  504.647997] CS:  0010 DS:  ES:  CR0: 8005003b
[  504.648153] CR2: 7fbc1ef54a38 CR3: 3bf3e000 CR4: 07f0
[  504.648380] DR0:  DR1:  DR2: 
[  504.648586] DR3:  DR6: 0ff0 DR7: 0400
[  504.648766] Process txg_sync (pid: 2281, threadinfo 88003ae3c000, task 
88003b7c45c0)
[  504.648990] Stack:
[  504.649087]  0002 a01e3360 88003b2afc00 
88003ae3dba0
[  504.649461]  88003d3c9000 0008 88003cfc2000 
5530ebc2
[  504.649835]  88003d22ac40 88003d22ac40 88003cfc2000 
88003b2afc00
[  504.650209] Call Trace:
[  504.650351]  [] spa_history_log_sync+0x235/0x650 [zfs]
[  504.650554]  [] dsl_sync_task_group_sync+0x123/0x210 [zfs]
[  504.650760]  [] dsl_pool_sync+0x41b/0x530 [zfs]
[  504.650953]  [] spa_sync+0x3a8/0xa50 [zfs]
[  504.651117]  [] ? ktime_get_ts+0x4c/0xe0
[  504.651302]  [] txg_sync_thread+0x2df/0x540 [zfs]
[  504.651501]  [] ? txg_init+0x250/0x250 [zfs]
[  504.651676]  [] thread_generic_wrapper+0x78/0x90 [spl]
[  504.651856]  [] ? __thread_create+0x310/0x310 [spl]
[  504.652029]  [] kthread+0xc0/0xd0
[  504.652174]  [] ? flush_kthread_worker+0xb0/0xb0
[  504.652339]  [] ret_from_fork+0x7c/0xb0
[  504.652492]  [] ? flush_kthread_worker+0xb0/0xb0
[  504.652655] Code: 55 b0 48 89 fa 48 29 f2 48 01 c2 48 39 55 b8 0f 82 bc 00 
00 00 4c 8b 75 b0 41 bf 08 00 00 00 48 29 c8 31 d2 49 8b b5 70 08 00 00 <48> f7 
f7 4c 8d 45 c0 4c 89 f7 48 01 ca 48 29 d3 48 83 fb 08 49
[  504.659810] RIP  [] spa_history_write+0x82/0x1d0 [zfs]
[  504.660045]  RSP 
[  504.660187] ---[ end trace e69c7eee3ba17773 ]---

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x5AE7D984
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com