from:"Mike Dawson"

[ceph-users] Build Raw Volume from Recovered RBD Objects

2016-04-19 Thread Mike Dawson


All,

I was called in to assist in a failed Ceph environment with the cluster 
in an inoperable state. No rbd volumes are mountable/exportable due to 
missing PGs.


The previous operator was using a replica count of 2. The cluster 
suffered a power outage and various non-catastrophic hardware issues as 
they were starting it back up. At some point during recovery, drives 
were removed from the cluster leaving several PGs missing.


Efforts to restore the missing PGs from the data on the removed drives 
failed using the process detailed in a Red Hat Customer Support blog 
post [0]. Upon starting the OSDs with recovered PGs, a segfault halts 
progress. The original operator isn't clear on when, but there may have 
been a software upgrade applied after the drives were pulled.


I believe the cluster may be irrecoverable at this point.

My recovery assistance has focused on a plan to:

1) Scrape all objects for several key rbd volumes from live OSDs and the 
removed former OSD drives.


2) Compare and deduplicate the two copies of each object.

3) Recombine the objects for each volume into a raw image.

I have completed steps 1 and 2 with apparent success. My initial stab at 
step 3 yielded a raw image that could be mounted and had signs of a 
filesystem, but it could not be read. Could anyone assist me with the 
following questions?


1) Are the rbd objects in order by filename? If not, what is the method 
to determine their order?


2) How should objects smaller than the default 4MB chunk size be 
handled? Should they be padded somehow?


3) If any objects were completely missing and therefore unavailable to 
this process, how should they be handled? I assume we need to offset/pad 
to compensate.

--
Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250
M: 317-490-3018
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Discuss: New default recovery config settings

2015-06-04 Thread Mike Dawson


With a write-heavy RBD workload, I add the following to ceph.conf:

osd_max_backfills = 2
osd_recovery_max_active = 2

If things are going well during recovery (i.e. guests happy and no slow 
requests), I will often bump both up to three:


# ceph tell osd.* injectargs '--osd-max-backfills 3 
--osd-recovery-max-active 3'


If I see slow requests, I drop them down.

The biggest downside to setting either to 1 seems to be the long tail 
issue detailed in:


http://tracker.ceph.com/issues/9566

Thanks,
Mike Dawson


On 6/3/2015 6:44 PM, Sage Weil wrote:

On Mon, 1 Jun 2015, Gregory Farnum wrote:

On Mon, Jun 1, 2015 at 6:39 PM, Paul Von-Stamwitz
pvonstamw...@us.fujitsu.com wrote:

On Fri, May 29, 2015 at 4:18 PM, Gregory Farnum g...@gregs42.com wrote:

On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote:

Many people have reported that they need to lower the osd recovery config 
options to minimize the impact of recovery on client io.  We are talking about 
changing the defaults as follows:

osd_max_backfills to 1 (from 10)
osd_recovery_max_active to 3 (from 15)
osd_recovery_op_priority to 1 (from 10)
osd_recovery_max_single_start to 1 (from 5)


I'm under the (possibly erroneous) impression that reducing the number of max 
backfills doesn't actually reduce recovery speed much (but will reduce memory 
use), but that dropping the op priority can. I'd rather we make users manually 
adjust values which can have a material impact on their data safety, even if 
most of them choose to do so.

After all, even under our worst behavior we're still doing a lot better than a 
resilvering RAID array. ;) -Greg
--



Greg,
When we set...

osd recovery max active = 1
osd max backfills = 1

We see rebalance times go down by more than half and client write performance 
increase significantly while rebalancing. We initially played with these 
settings to improve client IO expecting recovery time to get worse, but we got 
a 2-for-1.
This was with firefly using replication, downing an entire node with lots of 
SAS drives. We left osd_recovery_threads, osd_recovery_op_priority, and 
osd_recovery_max_single_start default.

We dropped osd_recovery_max_active and osd_max_backfills together. If you're 
right, do you think osd_recovery_max_active=1 is primary reason for the 
improvement? (higher osd_max_backfills helps recovery time with erasure coding.)


Well, recovery max active and max backfills are similar in many ways.
Both are about moving data into a new or outdated copy of the PG ? the
difference is that recovery refers to our log-based recovery (where we
compare the PG logs and move over the objects which have changed)
whereas backfill requires us to incrementally move through the entire
PG's hash space and compare.
I suspect dropping down max backfills is more important than reducing
max recovery (gathering recovery metadata happens largely in memory)
but I don't really know either way.

My comment was meant to convey that I'd prefer we not reduce the
recovery op priority levels. :)


We could make a less extreme move than to 1, but IMO we have to reduce it
one way or another.  Every major operator I've talked to does this, our PS
folks have been recommending it for years, and I've yet to see a single
complaint about recovery times... meanwhile we're drowning in a sea of
complaints about the impact on clients.

How about

  osd_max_backfills to 1 (from 10)
  osd_recovery_max_active to 3 (from 15)
  osd_recovery_op_priority to 3 (from 10)
  osd_recovery_max_single_start to 1 (from 5)

(same as above, but 1/3rd the recovery op prio instead of 1/10th)
?

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Negative amount of objects degraded

2014-10-30 Thread Mike Dawson


Erik,

I reported a similar issue 22 months ago. I don't think any developer 
has ever really prioritized these issues.


http://tracker.ceph.com/issues/3720

I was able to recover that cluster. The method I used is in the 
comments. I have no idea if my cluster was broken for the same reason as 
your. Your results may vary.


- Mike Dawson


On 10/30/2014 4:50 PM, Erik Logtenberg wrote:

Thanks for pointing that out. Unfortunately, those tickets contain only
a description of the problem, but no solution or workaround. One was
opened 8 months ago and the other more than a year ago. No love since.

Is there any way I can get my cluster back in a healthy state?

Thanks,

Erik.


On 10/30/2014 05:13 PM, John Spray wrote:

There are a couple of open tickets about bogus (negative) stats on PGs:
http://tracker.ceph.com/issues/5884
http://tracker.ceph.com/issues/7737

Cheers,
John

On Thu, Oct 30, 2014 at 12:38 PM, Erik Logtenberg e...@logtenberg.eu wrote:

Hi,

Yesterday I removed two OSD's, to replace them with new disks. Ceph was
not able to completely reach all active+clean state, but some degraded
objects remain. However, the amount of degraded objects is negative
(-82), see below:

2014-10-30 13:31:32.862083 mon.0 [INF] pgmap v209175: 768 pgs: 761
active+clean, 7 active+remapped; 1644 GB data, 2524 GB used, 17210 GB /
19755 GB avail; 2799 B/s wr, 1 op/s; -82/1439391 objects degraded (-0.006%)

According to rados df, the -82 degraded objects are part of the
cephfs-data-cache pool, which is an SSD-backed replicated pool, that
functions as a cache pool for an HDD-backed erasure coded pool for cephfs.

The cache should be empty, because I isseud rados
cache-flush-evict-all-command, and rados -p cephfs-data-cache ls
indeed shows zero objects in this pool.

rados df however does show 192 objects for this pool, with just 35KB
used and -82 degraded:

pool name   category KB  objects   clones
   degraded  unfound   rdrd KB   wrwr KB
cephfs-data-cache - 35  1920
  -82   0 1119   348800  1198371   1703673493

Please advice...

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] converting legacy puppet-ceph configured OSDs to look like ceph-deployed OSDs

2014-10-15 Thread Mike Dawson


On 10/15/2014 4:20 PM, Dan van der Ster wrote:

Hi Ceph users,

(sorry for the novel, but perhaps this might be useful for someone)

During our current project to upgrade our cluster from disks-only to
SSD journals, we've found it useful to convert our legacy puppet-ceph
deployed cluster (using something like the enovance module) to one that
looks like it has had its OSD created with ceph-disk prepare. It's been
educational for me, and I thought it would be good experience to share.

To start, the old puppet-ceph configures OSDs explicitly in
ceph.conf, like this:

[osd.211]
host = p05151113489275
devs = /dev/disk/by-path/pci-:02:00.0-sas-...-lun-0-part1

and ceph-disk list says this about the disks:

/dev/sdh :
  /dev/sdh1 other, xfs, mounted on /var/lib/ceph/osd/osd.211

In other words, ceph-disk doesn't know anything about the OSD living
on that disk.

Before deploying our SSD journals I was trying to find the best way to
map OSDs to SSD journal partitions (in puppet!), but basically there is
no good way to do this with the legacy puppet-ceph module. (What we'd
have to do is puppetize the partitioning of SSDs, then manually map OSDs
to SSD partitions. This would be tedious, and also error prone after
disk replacements and reboots).

However, I've found that by using ceph-deploy, i.e ceph-disk, to
prepare and activate OSDs, this becomes very simple, trivial even. Using
ceph-disk we keep the OSD/SSD mapping out of puppet; instead the state
is stored in the OSD itself. (1.5 years ago when we deployed this
cluster, ceph-deploy was advertised as quick tool to spin up small
clusters, so we didn't dare
use it. I realize now that it (or the puppet/chef/... recipes based on
it) is _the_only_way_ to build a cluster if you're starting out today.)

Now our problem was that I couldn't go and re-ceph-deploy the whole
cluster, since we've got some precious user data there. Instead, I
needed to learn how ceph-disk is labeling and preparing disks, and
modify our existing OSDs in place to look like they'd been prepared and
activated with ceph-disk.

In the end, I've worked out all the configuration and sgdisk magic and
put the recipes into a couple of scripts here [1]. Note that I do not
expect these to work for any other cluster unmodified. In fact, that
would be dangerous, so don't blame me if you break something. But they
might helpful for understanding how the ceph-disk udev magic works and
could be a basis for upgrading other clusters.

The scripts are:

ceph-deployifier/ceph-create-journals.sh:
   - this script partitions SSDs (assuming sda to sdd) with 5 partitions
each
   - the only trick is to add the partition name 'ceph journal' and set
the typecode to the magic JOURNAL_UUID along with a random partition guid

ceph-deployifier/ceph-label-disks.sh:
   - this script discovers the next OSD which is not prepared with
ceph-disk, finds an appropriate unused journal partition, and converts
the OSD to a ceph-disk prepared lookalike.
   - aside from the discovery part, the main magic is to:
 - create the files active, sysvinit and journal_uuid on the OSD
 - rename the partition to 'ceph data', set the typecode to the
magic OSD_UUID, and the partition guid to the OSD's uuid.
 - link to the /dev/disk/by-partuuid/ journal symlink, and make the
new journal
   - at the end, udev is triggered and the OSD is started (via the
ceph-disk activation magic)

The complete details are of course in the scripts. (I also have
another version of ceph-label-disks.sh that doesn't expect an SSD
journal but instead prepares the single disk 2 partitions scheme.)

After running these scripts you'll get a nice shiny ceph-disk list output:

/dev/sda :
  /dev/sda1 ceph journal, for /dev/sde1
  /dev/sda2 ceph journal, for /dev/sdf1
  /dev/sda3 ceph journal, for /dev/sdg1
...
/dev/sde :
  /dev/sde1 ceph data, active, cluster ceph, osd.2, journal /dev/sda1
/dev/sdf :
  /dev/sdf1 ceph data, active, cluster ceph, osd.8, journal /dev/sda2
/dev/sdg :
  /dev/sdg1 ceph data, active, cluster ceph, osd.12, journal /dev/sda3
...

And all of the udev magic is working perfectly. I've tested all of the
reboot, failed OSD, and failed SSD scenarios and it all works as it
should. And the puppet-ceph manifest for osd's is now just a very simple
wrapper around ceph-disk prepare. (I haven't published ours to github
yet, but it is very similar to the stackforge puppet-ceph manifest).

There you go, sorry that was so long. I hope someone finds this useful :)

Best Regards,
Dan

[1]
https://github.com/cernceph/ceph-scripts/tree/master/tools/ceph-deployifier



Dan,

Thank you for publishing this! I put some time into this very issue 
earlier this year, but got pulled in another direction before completing 
the work. I'd like to bring a production cluster deployed with mkcephfs 
out of the stone ages, so your work will be very useful to me.


Thanks again,
Mike Dawson




___
ceph-users mailing list

Re: [ceph-users] v0.67.11 dumpling released

2014-09-25 Thread Mike Dawson


On 9/25/2014 11:09 AM, Sage Weil wrote:

v0.67.11 Dumpling
===

This stable update for Dumpling fixes several important bugs that affect a
small set of users.

We recommend that all Dumpling users upgrade at their convenience.  If
none of these issues are affecting your deployment there is no urgency.


Notable Changes
---

* common: fix sending dup cluster log items (#9080 Sage Weil)
* doc: several doc updates (Alfredo Deza)
* libcephfs-java: fix build against older JNI headesr (Greg Farnum)
* librados: fix crash in op timeout path (#9362 Matthias Kiefer, Sage Weil)
* librbd: fix crash using clone of flattened image (#8845 Josh Durgin)
* librbd: fix error path cleanup when failing to open image (#8912 Josh Durgin)
* mon: fix crash when adjusting pg_num before any OSDs are added (#9052
   Sage Weil)
* mon: reduce log noise from paxos (Aanchal Agrawal, Sage Weil)
* osd: allow scrub and snap trim thread pool IO priority to be adjusted
   (Sage Weil)


Sage,

Thanks for the great work! Could you provide any links describing how to 
tune the scrub and snap trim thread pool IO priority? I couldn't find 
these settings in the docs.


IIUC, 0.67.11 does not include the proposed changes to address #9487 or 
#9503, right?


Thanks,
Mike Dawson



* osd: fix mount/remount sync race (#9144 Sage Weil)

Getting Ceph


* Git at git://github.com/ceph/ceph.git
* Tarball at http://ceph.com/download/ceph-0.67.11.tar.gz
* For packages, see http://ceph.com/docs/master/install/get-packages
* For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.67.11 dumpling released

2014-09-25 Thread Mike Dawson

Looks like the packages have partially hit the repo, but at least the 
following are missing:


Failed to fetch 
http://ceph.com/debian-dumpling/pool/main/c/ceph/librbd1_0.67.11-1precise_amd64.deb 
 404  Not Found
Failed to fetch 
http://ceph.com/debian-dumpling/pool/main/c/ceph/librados2_0.67.11-1precise_amd64.deb 
 404  Not Found
Failed to fetch 
http://ceph.com/debian-dumpling/pool/main/c/ceph/python-ceph_0.67.11-1precise_amd64.deb 
 404  Not Found
Failed to fetch 
http://ceph.com/debian-dumpling/pool/main/c/ceph/ceph_0.67.11-1precise_amd64.deb 
 404  Not Found
Failed to fetch 
http://ceph.com/debian-dumpling/pool/main/c/ceph/libcephfs1_0.67.11-1precise_amd64.deb 
 404  Not Found


Based on the timestamps of the files that made it, it looks like the 
process to publish the packages isn't still in process, but rather 
failed yesterday.


Thanks,
Mike Dawson


On 9/25/2014 11:09 AM, Sage Weil wrote:

v0.67.11 Dumpling
===

This stable update for Dumpling fixes several important bugs that affect a
small set of users.

We recommend that all Dumpling users upgrade at their convenience.  If
none of these issues are affecting your deployment there is no urgency.


Notable Changes
---

* common: fix sending dup cluster log items (#9080 Sage Weil)
* doc: several doc updates (Alfredo Deza)
* libcephfs-java: fix build against older JNI headesr (Greg Farnum)
* librados: fix crash in op timeout path (#9362 Matthias Kiefer, Sage Weil)
* librbd: fix crash using clone of flattened image (#8845 Josh Durgin)
* librbd: fix error path cleanup when failing to open image (#8912 Josh Durgin)
* mon: fix crash when adjusting pg_num before any OSDs are added (#9052
   Sage Weil)
* mon: reduce log noise from paxos (Aanchal Agrawal, Sage Weil)
* osd: allow scrub and snap trim thread pool IO priority to be adjusted
   (Sage Weil)
* osd: fix mount/remount sync race (#9144 Sage Weil)

Getting Ceph


* Git at git://github.com/ceph/ceph.git
* Tarball at http://ceph.com/download/ceph-0.67.11.tar.gz
* For packages, see http://ceph.com/docs/master/install/get-packages
* For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Mike Dawson


On 8/28/2014 11:17 AM, Loic Dachary wrote:



On 28/08/2014 16:29, Mike Dawson wrote:

On 8/28/2014 12:23 AM, Christian Balzer wrote:

On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:




On 27/08/2014 04:34, Christian Balzer wrote:


Hello,

On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:


Hi Craig,

I assume the reason for the 48 hours recovery time is to keep the cost
of the cluster low ? I wrote 1h recovery time because it is roughly
the time it would take to move 4TB over a 10Gb/s link. Could you
upgrade your hardware to reduce the recovery time to less than two
hours ? Or are there factors other than cost that prevent this ?



I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it
only 10 hours according to your wishful thinking formula.

He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.

The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are
being read are in the page cache of the storage nodes.
Easy peasy.

Until something happens that breaks this routine, like a deep scrub,
all those VMs rebooting at the same time or a backfill caused by a
failed OSD. Now all of a sudden client ops compete with the backfill
ops, page caches are no longer hot, the spinners are seeking left and
right. Pandemonium.

I doubt very much that even with a SSD backed cluster you would get
away with less than 2 hours for 4TB.

To give you some real life numbers, I currently am building a new
cluster but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the
OSD. Both operations took about the same time, 4 minutes for
evacuating the OSD (having 7 write targets clearly helped) for measly
12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
OSD. And that is on one node (thus no network latency) that has the
default parameters (so a max_backfill of 10) which was otherwise
totally idle.

In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.


That makes sense to me :-)

When I wrote 1h, I thought about what happens when an OSD becomes
unavailable with no planning in advance. In the scenario you describe
the risk of a data loss does not increase since the objects are evicted
gradually from the disk being decommissioned and the number of replica
stays the same at all times. There is not a sudden drop in the number of
replica  which is what I had in mind.


That may be, but I'm rather certain that there is no difference in speed
and priority of a rebalancing caused by an OSD set to weight 0 or one
being set out.


If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
will start transferring a new replica of the objects they have to the
new OSD in their PG. The replacement will not be a single OSD although
nothing prevents the same OSD to be used in more than one PG as a
replacement for the lost one. If the cluster network is connected at
10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
duplicates do not originate from a single OSD but from at least dozens
of them and since they target more than one OSD, I assume we can expect
an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
account for the fact that the cluster network is never idle.

Am I being too optimistic ?

Vastly.


Do you see another blocking factor that
would significantly slow down recovery ?


As Craig and I keep telling you, the network is not the limiting factor.
Concurrent disk IO is, as I pointed out in the other thread.


Completely agree.

On a production cluster with OSDs backed by spindles, even with OSD journals on 
SSDs, it is insufficient to calculate single-disk replacement backfill time 
based solely on network throughput. IOPS will likely be the limiting factor 
when backfilling a single failed spinner in a production cluster.

Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd 
cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 
3:1), with dual 1GbE bonded NICs.

Using the only throughput math, backfill could have theoretically completed in 
a bit over 2.5 hours, but it actually took 15 hours. I've done this a few times 
with similar results.

Why? Spindle contention on the replacement drive. Graph the '%util' metric from 
something like 'iostat -xt 2' during a single disk backfill to get a very clear 
view that spindle contention is the true limiting

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Mike Dawson



On 8/28/2014 4:17 PM, Craig Lewis wrote:

My initial experience was similar to Mike's, causing a similar level of
paranoia.  :-)  I'm dealing with RadosGW though, so I can tolerate
higher latencies.

I was running my cluster with noout and nodown set for weeks at a time.


I'm sure Craig will agree, but wanted to add this for other readers:

I find value in the noout flag for temporary intervention, but prefer to 
set mon osd down out interval for dealing with events that may occur 
in the future to give an operator time to intervene.


The nodown flag is another beast altogether. The nodown flag tends to be 
*a bad thing* when attempting to provide reliable client io. For our use 
case, we want OSDs to be marked down quickly if they are in fact 
unavailable for any reason, so client io doesn't hang waiting for them.


If OSDs are flapping during recovery (i.e. the wrongly marked me down 
log messages), I've found far superior results by tuning the recovery 
knobs than by permanently setting the nodown flag.


- Mike



  Recovery of a single OSD might cause other OSDs to crash. In the
primary cluster, I was always able to get it under control before it
cascaded too wide.  In my secondary cluster, it did spiral out to 40% of
the OSDs, with 2-5 OSDs down at any time.






I traced my problems to a combination of osd max backfills was too high
for my cluster, and my mkfs.xfs arguments were causing memory starvation
issues.  I lowered osd max backfills, added SSD journals,
and reformatted every OSD with better mkfs.xfs arguments.  Now both
clusters are stable, and I don't want to break it.

I only have 45 OSDs, so the risk with a 24-48 hours recovery time is
acceptable to me.  It will be a problem as I scale up, but scaling up
will also help with the latency problems.




On Thu, Aug 28, 2014 at 10:38 AM, Mike Dawson mike.daw...@cloudapt.com
mailto:mike.daw...@cloudapt.com wrote:


We use 3x replication and have drives that have relatively high
steady-state IOPS. Therefore, we tend to prioritize client-side IO
more than a reduction from 3 copies to 2 during the loss of one
disk. The disruption to client io is so great on our cluster, we
don't want our cluster to be in a recovery state without
operator-supervision.

Letting OSDs get marked out without operator intervention was a
disaster in the early going of our cluster. For example, an OSD
daemon crash would trigger automatic recovery where it was unneeded.
Ironically, often times the unneeded recovery would often trigger
additional daemons to crash, making a bad situation worse. During
the recovery, rbd client io would often times go to 0.

To deal with this issue, we set mon osd down out interval = 14400,
so as operators we have 4 hours to intervene before Ceph attempts to
self-heal. When hardware is at fault, we remove the osd, replace the
drive, re-add the osd, then allow backfill to begin, thereby
completely skipping step B in your timeline above.

- Mike



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to avoid deep-scrubbing performance hit?

2014-06-09 Thread Mike Dawson


Craig,

I've struggled with the same issue for quite a while. If your i/o is 
similar to mine, I believe you are on the right track. For the past 
month or so, I have been running this cronjob:


* * * * *   for strPg in `ceph pg dump | egrep 
'^[0-9]\.[0-9a-f]{1,4}' | sort -k20 | awk '{ print $1 }' | head -2`; do 
ceph pg deep-scrub $strPg; done


That roughly handles my 20672 PGs that are set to be deep-scrubbed every 
7 days. Your script may be a bit better, but this quick and dirty method 
has helped my cluster maintain more consistency.


The real key for me is to avoid the clumpiness I have observed without 
that hack where concurrent deep-scrubs sit at zero for a long period of 
time (despite having PGs that were months overdue for a deep-scrub), 
then concurrent deep-scrubs suddenly spike up and stay in the teens for 
hours, killing client writes/second.


The scrubbing behavior table[0] indicates that a periodic tick initiates 
scrubs on a per-PG basis. Perhaps the timing of ticks aren't 
sufficiently randomized when you restart lots of OSDs concurrently (for 
instance via pdsh).


On my cluster I suffer a significant drag on client writes/second when I 
exceed perhaps four or five concurrent PGs in deep-scrub. When 
concurrent deep-scrubs get into the teens, I get a massive drop in 
client writes/second.


Greg, is there locking involved when a PG enters deep-scrub? If so, is 
the entire PG locked for the duration or is each individual object 
inside the PG locked as it is processed? Some of my PGs will be in 
deep-scrub for minutes at a time.


0: http://ceph.com/docs/master/dev/osd_internals/scrub/

Thanks,
Mike Dawson


On 6/9/2014 6:22 PM, Craig Lewis wrote:

I've correlated a large deep scrubbing operation to cluster stability
problems.

My primary cluster does a small amount of deep scrubs all the time,
spread out over the whole week.  It has no stability problems.

My secondary cluster doesn't spread them out.  It saves them up, and
tries to do all of the deep scrubs over the weekend.  The secondary
starts loosing OSDs about an hour after these deep scrubs start.

To avoid this, I'm thinking of writing a script that continuously scrubs
the oldest outstanding PG.  In psuedo-bash:
# Sort by the deep-scrub timestamp, taking the single oldest PG
while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21,
$1}' | sort | head -1 | read date time pg
  do
   ceph pg deep-scrub ${pg}
   while ceph status | grep scrubbing+deep
do
 sleep 5
   done
   sleep 30
done


Does anybody think this will solve my problem?

I'm also considering disabling deep-scrubbing until the secondary
finishes replicating from the primary.  Once it's caught up, the write
load should drop enough that opportunistic deep scrubs should have a
chance to run.  It should only take another week or two to catch up.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Calamari Goes Open Source

2014-05-30 Thread Mike Dawson

Great work Inktank / Red Hat! An open source Calamari will be a great 
benefit to the community!


Cheers,
Mike Dawson


On 5/30/2014 6:04 PM, Patrick McGarry wrote:

Hey cephers,

Sorry to push this announcement so late on a Friday but...

Calamari has arrived!

The source code bits have been flipped, the ticket tracker has been
moved, and we have even given you a little bit of background from both
a technical and vision point of view:

Technical (ceph.com):
http://ceph.com/community/ceph-calamari-goes-open-source/

Vision (inktank.com):
http://www.inktank.com/software/future-of-calamari/

The ceph.com link should give you everything you need to know about
what tech comprises Calamari, where the source lives, and where the
discussions will take place.  If you have any questions feel free to
hit the new ceph-calamari list or stop by IRC and we'll get you
started.  Hope you all enjoy the GUI!



Best Regards,

Patrick McGarry
Director, Community || Inktank
http://ceph.com  ||  http://inktank.com
@scuttlemonkey || @ceph || @inktank
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multiple L2 LAN segments with Ceph

2014-05-28 Thread Mike Dawson


Travis,

We run a routed ECMP spine-leaf network architecture with Ceph and have 
no issues on the network side whatsoever. Each leaf switch has an L2 
cidr block inside a common L3 supernet.


We do not currently split cluster_network and public_network. If we did, 
we'd likely build a separate spine-leaf network with it's own L3 supernet.


A simple IPv4 example:

- ceph-cluster: 10.1.0.0/16
- cluster-leaf1: 10.1.1.0/24
- node1: 10.1.1.1/24
- node2: 10.1.1.2/24
- cluster-leaf2: 10.1.2.0/24

- ceph-public: 10.2.0.0/16
- public-leaf1: 10.2.1.0/24
- node1: 10.2.1.1/24
- node2: 10.2.1.2/24
- public-leaf2: 10.2.2.0/24

ceph.conf would be:

cluster_network: 10.1.0.0/255.255.0.0
public_network: 10.2.0.0/255.255.0.0

- Mike Dawson

On 5/28/2014 1:01 PM, Travis Rhoden wrote:

Hi folks,

Does anybody know if there are any issues running Ceph with multiple L2
LAN segements?  I'm picturing a large multi-rack/multi-row deployment
where you may give each rack (or row) it's own L2 segment, then connect
them all with L3/ECMP in a leaf-spine architecture.

I'm wondering how cluster_network (or public_network) in ceph.conf works
in this case.  Does that directive just tell a daemon starting on a
particular node which network to bind to?  Or is a CIDR that has to be
accurate for every OSD and MON in the entire cluster?

Thanks,

  - Travis


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to find the disk partitions attached to a OSD

2014-05-21 Thread Mike Dawson


Perhaps:

# mount | grep ceph

- Mike Dawson


On 5/21/2014 11:00 AM, Sharmila Govind wrote:

Hi,
  I am new to Ceph. I have a storage node with 2 OSDs. Iam trying to
figure out to which pyhsical device/partition each of the OSDs are
attached to. Is there are command that can be executed in the storage
node to find out the same.

Thanks in Advance,
Sharmila


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to find the disk partitions attached to a OSD

2014-05-21 Thread Mike Dawson

Looks like you may not have any OSDs properly setup and mounted. It 
should look more like:


user@host:~# mount | grep ceph
/dev/sdb1 on /var/lib/ceph/osd/ceph-0 type xfs (rw,noatime,inode64)
/dev/sdc1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noatime,inode64)
/dev/sdd1 on /var/lib/ceph/osd/ceph-2 type xfs (rw,noatime,inode64)

Confirm the OSD in your ceph cluster with:

user@host:~# ceph osd tree

- Mike


On 5/21/2014 11:15 AM, Sharmila Govind wrote:

Hi Mike,
Thanks for your quick response. When I try mount on the storage node
this is what I get:

*root@cephnode4:~# mount*
*/dev/sda1 on / type ext4 (rw,errors=remount-ro)*
*proc on /proc type proc (rw,noexec,nosuid,nodev)*
*sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)*
*none on /sys/fs/fuse/connections type fusectl (rw)*
*none on /sys/kernel/debug type debugfs (rw)*
*none on /sys/kernel/security type securityfs (rw)*
*udev on /dev type devtmpfs (rw,mode=0755)*
*devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)*
*tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)*
*none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)*
*none on /run/shm type tmpfs (rw,nosuid,nodev)*
*/dev/sdb on /mnt/CephStorage1 type ext4 (rw)*
*/dev/sdc on /mnt/CephStorage2 type ext4 (rw)*
*/dev/sda7 on /mnt/Storage type ext4 (rw)*
*/dev/sda2 on /boot type ext4 (rw)*
*/dev/sda5 on /home type ext4 (rw)*
*/dev/sda6 on /mnt/CephStorage type ext4 (rw)*



Is there anything wrong in the setup I have? I dont have any 'ceph'
related mounts.

Thanks,
Sharmila



On Wed, May 21, 2014 at 8:34 PM, Mike Dawson mike.daw...@cloudapt.com
mailto:mike.daw...@cloudapt.com wrote:

Perhaps:

# mount | grep ceph

- Mike Dawson



On 5/21/2014 11:00 AM, Sharmila Govind wrote:

Hi,
   I am new to Ceph. I have a storage node with 2 OSDs. Iam
trying to
figure out to which pyhsical device/partition each of the OSDs are
attached to. Is there are command that can be executed in the
storage
node to find out the same.

Thanks in Advance,
Sharmila


_
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] PG Selection Criteria for Deep-Scrub

2014-05-20 Thread Mike Dawson

Today I noticed that deep-scrub is consistently missing some of my 
Placement Groups, leaving me with the following distribution of PGs and 
the last day they were successfully deep-scrubbed.


# ceph pg dump all | grep active | awk '{ print $20}' | sort -k1 | uniq -c
  5 2013-11-06
221 2013-11-20
  1 2014-02-17
 25 2014-02-19
 60 2014-02-20
  4 2014-03-06
  3 2014-04-03
  6 2014-04-04
  6 2014-04-05
 13 2014-04-06
  4 2014-04-08
  3 2014-04-10
  2 2014-04-11
 50 2014-04-12
 28 2014-04-13
 14 2014-04-14
  3 2014-04-15
 78 2014-04-16
 44 2014-04-17
  8 2014-04-18
  1 2014-04-20
 16 2014-05-02
 69 2014-05-04
140 2014-05-05
569 2014-05-06
   9231 2014-05-07
103 2014-05-08
514 2014-05-09
   1593 2014-05-10
393 2014-05-16
   2563 2014-05-17
   1283 2014-05-18
   1640 2014-05-19
   1979 2014-05-20

I have been running the default osd deep scrub interval of once per 
week, but have disabled deep-scrub on several occasions in an attempt to 
avoid the associated degraded cluster performance I have written about 
before.


To get the PGs longest in need of a deep-scrub started, I set the 
nodeep-scrub flag, and wrote a script to manually kick off deep-scrub 
according to age. It is processing as expected.


Do you consider this a feature request or a bug? Perhaps the code that 
schedules PGs to deep-scrub could be improved to prioritize PGs that 
have needed a deep-scrub the longest.


Thanks,
Mike Dawson
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG Selection Criteria for Deep-Scrub

2014-05-20 Thread Mike Dawson

I tend to set it whenever I don't want to be bothered by storage 
performance woes (nights I value sleep, etc).


This cluster is bounded by relentless small writes (it has a couple 
dozen rbd volumes backing video surveillance DVRs). Some of the software 
we run is completely unaffected whereas other software falls apart 
during periods of deep-scrubs. I theorize it has to do with the 
individual software's attitude about flushing to disk / buffering.


- Mike


On 5/20/2014 8:31 PM, Aaron Ten Clay wrote:

For what it's worth, version 0.79 has different headers, and the awk
command needs $19 instead of $20. But here is the output I have on a
small cluster that I recently rebuilt:

$ ceph pg dump all | grep active | awk '{ print $19}' | sort -k1 | uniq -c
dumped all in format plain
   1 2014-05-15
   2 2014-05-17
  19 2014-05-18
 193 2014-05-19
 105 2014-05-20

I have set noscrub and nodeep-scrub, as well as noout and nodown off and
on while I performed various maintenance, but that hasn't (apparently)
impeded the regular schedule.

With what frequency are you setting the nodeep-scrub flag?

-Aaron


On Tue, May 20, 2014 at 5:21 PM, Mike Dawson mike.daw...@cloudapt.com
mailto:mike.daw...@cloudapt.com wrote:

Today I noticed that deep-scrub is consistently missing some of my
Placement Groups, leaving me with the following distribution of PGs
and the last day they were successfully deep-scrubbed.

# ceph pg dump all | grep active | awk '{ print $20}' | sort -k1 |
uniq -c
   5 2013-11-06
 221 2013-11-20
   1 2014-02-17
  25 2014-02-19
  60 2014-02-20
   4 2014-03-06
   3 2014-04-03
   6 2014-04-04
   6 2014-04-05
  13 2014-04-06
   4 2014-04-08
   3 2014-04-10
   2 2014-04-11
  50 2014-04-12
  28 2014-04-13
  14 2014-04-14
   3 2014-04-15
  78 2014-04-16
  44 2014-04-17
   8 2014-04-18
   1 2014-04-20
  16 2014-05-02
  69 2014-05-04
 140 2014-05-05
 569 2014-05-06
9231 2014-05-07
 103 2014-05-08
 514 2014-05-09
1593 2014-05-10
 393 2014-05-16
2563 2014-05-17
1283 2014-05-18
1640 2014-05-19
1979 2014-05-20

I have been running the default osd deep scrub interval of once
per week, but have disabled deep-scrub on several occasions in an
attempt to avoid the associated degraded cluster performance I have
written about before.

To get the PGs longest in need of a deep-scrub started, I set the
nodeep-scrub flag, and wrote a script to manually kick off
deep-scrub according to age. It is processing as expected.

Do you consider this a feature request or a bug? Perhaps the code
that schedules PGs to deep-scrub could be improved to prioritize PGs
that have needed a deep-scrub the longest.

Thanks,
Mike Dawson
_
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Occasional Missing Admin Sockets

2014-05-13 Thread Mike Dawson


All,

I have a recurring issue where the admin sockets 
(/var/run/ceph/ceph-*.*.asok) may vanish on a running cluster while the 
daemons keep running (or restart without my knowledge). I see this issue 
on a dev cluster running Ubuntu and Ceph Emperor/Firefly, deployed with 
ceph-deploy using Upstart to control daemons. I never see this issue on 
Ubuntu / Dumpling / sysvinit.


Has anyone else seen this issue or know the likely cause?

--
Thanks,

Mike Dawson
Co-Founder  Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Monitoring ceph statistics using rados python module

2014-05-13 Thread Mike Dawson


Adrian,

Yes, it is single OSD oriented.

Like Haomai, we monitor perf dumps from individual OSD admin sockets. On 
new enough versions of ceph, you can do 'ceph daemon osd.x perf dump', 
which is a shorter way to ask for the same output as 'ceph 
--admin-daemon /var/run/ceph/ceph-osd.x.asok perf dump'. Keep in mind, 
either version has to be run locally on the host where osd.x is running.


We use Sensu to take samples and push them to Graphite. We have the 
ability to then build dashboards showing the whole cluster, units in our 
CRUSH tree, hosts, or an individual OSDs.


I have found that monitoring each OSD's admin daemon is critical. Often 
times a single OSD can affect performance of the entire cluster. Without 
individual data, these types of issues can be quite difficult to pinpoint.


Also, note that Inktank has developed Calamari. There are rumors that it 
may be open sourced at some point in the future.


Cheers,
Mike Dawson


On 5/13/2014 12:33 PM, Adrian Banasiak wrote:

Thanks for sugestion with admin daemon but it looks like single osd
oriented. I have used perf dump on mon socket and it output some
interesting data in case of monitoring whole cluster:
{ cluster: { num_mon: 4,
   num_mon_quorum: 4,
   num_osd: 29,
   num_osd_up: 29,
   num_osd_in: 29,
   osd_epoch: 1872,
   osd_kb: 20218112516,
   osd_kb_used: 5022202696,
   osd_kb_avail: 15195909820,
   num_pool: 4,
   num_pg: 3500,
   num_pg_active_clean: 3500,
   num_pg_active: 3500,
   num_pg_peering: 0,
   num_object: 400746,
   num_object_degraded: 0,
   num_object_unfound: 0,
   num_bytes: 1678788329609,
   num_mds_up: 0,
   num_mds_in: 0,
   num_mds_failed: 0,
   mds_epoch: 1},

Unfortunately cluster wide IO statistics are still missing.


2014-05-13 17:17 GMT+02:00 Haomai Wang haomaiw...@gmail.com
mailto:haomaiw...@gmail.com:

Not sure your demand.

I use ceph --admin-daemon /var/run/ceph/ceph-osd.x.asok perf dump to
get the monitor infos. And the result can be parsed by simplejson
easily via python.

On Tue, May 13, 2014 at 10:56 PM, Adrian Banasiak
adr...@banasiak.it mailto:adr...@banasiak.it wrote:
  Hi, i am working with test Ceph cluster and now I want to
implement Zabbix
  monitoring with items such as:
 
  - whoe cluster IO (for example ceph -s - recovery io 143 MB/s, 35
  objects/s)
  - pg statistics
 
  I would like to create single script in python to retrive values
using rados
  python module, but there are only few informations in
documentation about
  module usage. I've created single function which calculates all pools
  current read/write statistics but i cant find out how to add
recovery IO
  usage and pg statistics:
 
  read = 0
  write = 0
  for pool in conn.list_pools():
  io = conn.open_ioctx(pool)
  stats[pool] = io.get_stats()
  read+=int(stats[pool]['num_rd'])
  write+=int(stats[pool]['num_wr'])
 
  Could someone share his knowledge about rados module for
retriving ceph
  statistics?
 
  BTW Ceph is awesome!
 
  --
  Best regards, Adrian Banasiak
  email: adr...@banasiak.it mailto:adr...@banasiak.it
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 



--
Best Regards,

Wheat




--
Pozdrawiam, Adrian Banasiak
email: adr...@banasiak.it mailto:adr...@banasiak.it


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Occasional Missing Admin Sockets

2014-05-13 Thread Mike Dawson


Greg/Loic,

I can confirm that logrotate --force /etc/logrotate.d/ceph removes the 
monitor admin socket on my boxes running 0.80.1 just like the 
description in Issue 7188 [0].


0: http://tracker.ceph.com/issues/7188

Should that bug be reopened?

Thanks,
Mike Dawson


On 5/13/2014 2:10 PM, Gregory Farnum wrote:

On Tue, May 13, 2014 at 9:06 AM, Mike Dawson mike.daw...@cloudapt.com wrote:

All,

I have a recurring issue where the admin sockets
(/var/run/ceph/ceph-*.*.asok) may vanish on a running cluster while the
daemons keep running


Hmm.


(or restart without my knowledge).


I'm guessing this might be involved:


I see this issue on
a dev cluster running Ubuntu and Ceph Emperor/Firefly, deployed with
ceph-deploy using Upstart to control daemons. I never see this issue on
Ubuntu / Dumpling / sysvinit.


*goes and greps the git log*

I'm betting it was commit 45600789f1ca399dddc5870254e5db883fb29b38
(which has, in fact, been backported to dumpling and emperor),
intended so that turning on a new daemon wouldn't remove the admin
socket of an existing one. But I think that means that if you activate
the new daemon before the old one has finished shutting down and
unlinking, you would end up with a daemon that had no admin socket.
Perhaps it's an incomplete fix and we need a tracker ticket?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.80 Firefly released

2014-05-09 Thread Mike Dawson


Andrey,

In initial testing, it looks like it may work rather efficiently.

1) Upgrade all mon, osd, and clients to Firefly. Restart everything so 
no legacy ceph code is running.



2) Add mon osd allow primary affinity = true to ceph.conf, distribute 
ceph.conf to nodes.



3) Inject it into the monitors to make it immediately active:

# ceph tell mon.* injectargs '--mon_osd_allow_primary_affinity true'

Ignore the mon.a: injectargs: failed to parse arguments: true 
warnings, this appears to be a bug [0].



4) Check to see how many PGs have OSD 0 as their primary:

ceph pg dump | awk '{ print $15   $14   $1}' | egrep ^0 | wc -l


5) Set primary affinity to zero on osd.0:

# ceph osd primary-affinity osd.0 0

If you didn't set mon_osd_allow_primary_affinity properly above, you'll 
get a helpful error message.



6) Confirm it worked by comparing how many PGs have osd.0 as their primary.

ceph pg dump | awk '{ print $15 }' | egrep ^0 | wc -l

On my small dev cluster, the number goes to 0 in less than 10 seconds.


7) Perform maintenance and watch ceph -w. If you didn't get all your 
clients updated, you'll likely see a bunch of errors in ceph -w like:


2014-05-09 21:12:42.534900 osd.0 [WRN] client.130959 x.x.x.x:0/1015056 
misdirected client.130959.0:619497 pg 4.90eaebe to osd.0 not [6,1,0] in 
e1650/1650


8) After you are done with maintenance, reset the primary affinity:

# ceph osd primary-affinity osd.0 1


I have not scaled up my testing, but it looks like this has the 
potential to work well in preventing unnecessary read starvation in 
certain situations.



0: http://tracker.ceph.com/issues/8323#note-1


Cheers,
Mike Dawson

On 5/8/2014 8:20 AM, Andrey Korolyov wrote:

Mike, would you mind to write your experience if you`ll manage to get
this flow through first? I hope I`ll be able to conduct some tests
related to 0.80 only next week, including maintenance combined with
primary pointer relocation - one of most crucial things remaining in
Ceph for the production performance.

On Wed, May 7, 2014 at 10:18 PM, Mike Dawson mike.daw...@cloudapt.com wrote:


On 5/7/2014 11:53 AM, Gregory Farnum wrote:


On Wed, May 7, 2014 at 8:44 AM, Dan van der Ster
daniel.vanders...@cern.ch wrote:


Hi,


Sage Weil wrote:

* *Primary affinity*: Ceph now has the ability to skew selection of
OSDs as the primary copy, which allows the read workload to be
cheaply skewed away from parts of the cluster without migrating any
data.


Can you please elaborate a bit on this one? I found the blueprint [1] but
still don't quite understand how it works. Does this only change the
crush
calculation for reads? i.e writes still go to the usual primary, but
reads
are distributed across the replicas? If so, does this change the
consistency
model in any way.



It changes the calculation of who becomes the primary, and that
primary serves both reads and writes. In slightly more depth:
Previously, the primary has always been the first OSD chosen as a
member of the PG.
For erasure coding, we added the ability to specify a primary
independent of the selection ordering. This was part of a broad set of
changes to prevent moving the EC shards around between different
members of the PG, and means that the primary might be the second OSD
in the PG, or the fourth.
Once this work existed, we realized that it might be useful in other
cases, because primaries get more of the work for their PG (serving
all reads, coordinating writes).
So we added the ability to specify a primary affinity, which is like
the CRUSH weights but only impacts whether you become the primary. So
if you have 3 OSDs that each have primary affinity = 1, it will behave
as normal. If two have primary affinity = 0, the remaining OSD will be
the primary. Etc.



Is it possible (and/or advisable) to set primary affinity low while
backfilling / recovering an OSD in an effort to prevent unnecessary slow
reads that could be directed to less busy replicas? I suppose if the cost of
setting/unsetting primary affinity is low and clients are starved for reads
during backfill/recovery from the osd in question, it could be a win.

Perhaps the workflow for maintenance on osd.0 would be something like:

- Stop osd.0, do some maintenance on osd.0
- Read primary affinity of osd.0, store it for later
- Set primary affinity on osd.0 to 0
- Start osd.0
- Enjoy a better backfill/recovery experience. RBD clients happier.
- Reset primary affinity on osd.0 to previous value

If the cost of setting primary affinity is low enough, perhaps this strategy
could be automated by the ceph daemons.

Thanks,
Mike Dawson



-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http

Re: [ceph-users] v0.80 Firefly released

2014-05-07 Thread Mike Dawson



On 5/7/2014 11:53 AM, Gregory Farnum wrote:

On Wed, May 7, 2014 at 8:44 AM, Dan van der Ster
daniel.vanders...@cern.ch wrote:

Hi,


Sage Weil wrote:

* *Primary affinity*: Ceph now has the ability to skew selection of
   OSDs as the primary copy, which allows the read workload to be
   cheaply skewed away from parts of the cluster without migrating any
   data.


Can you please elaborate a bit on this one? I found the blueprint [1] but
still don't quite understand how it works. Does this only change the crush
calculation for reads? i.e writes still go to the usual primary, but reads
are distributed across the replicas? If so, does this change the consistency
model in any way.


It changes the calculation of who becomes the primary, and that
primary serves both reads and writes. In slightly more depth:
Previously, the primary has always been the first OSD chosen as a
member of the PG.
For erasure coding, we added the ability to specify a primary
independent of the selection ordering. This was part of a broad set of
changes to prevent moving the EC shards around between different
members of the PG, and means that the primary might be the second OSD
in the PG, or the fourth.
Once this work existed, we realized that it might be useful in other
cases, because primaries get more of the work for their PG (serving
all reads, coordinating writes).
So we added the ability to specify a primary affinity, which is like
the CRUSH weights but only impacts whether you become the primary. So
if you have 3 OSDs that each have primary affinity = 1, it will behave
as normal. If two have primary affinity = 0, the remaining OSD will be
the primary. Etc.


Is it possible (and/or advisable) to set primary affinity low while 
backfilling / recovering an OSD in an effort to prevent unnecessary slow 
reads that could be directed to less busy replicas? I suppose if the 
cost of setting/unsetting primary affinity is low and clients are 
starved for reads during backfill/recovery from the osd in question, it 
could be a win.


Perhaps the workflow for maintenance on osd.0 would be something like:

- Stop osd.0, do some maintenance on osd.0
- Read primary affinity of osd.0, store it for later
- Set primary affinity on osd.0 to 0
- Start osd.0
- Enjoy a better backfill/recovery experience. RBD clients happier.
- Reset primary affinity on osd.0 to previous value

If the cost of setting primary affinity is low enough, perhaps this 
strategy could be automated by the ceph daemons.


Thanks,
Mike Dawson


-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 16 osds: 11 up, 16 in

2014-05-07 Thread Mike Dawson


Craig,

I suspect the disks in question are seeking constantly and the spindle 
contention is causing significant latency. A strategy of throttling 
backfill/recovery and reducing client traffic tends to work for me.


1) You should make sure recovery and backfill are throttled:
ceph tell osd.* injectargs '--osd_max_backfills 1'
ceph tell osd.* injectargs '--osd_recovery_max_active 1'
ceph tell osd.* injectargs '--osd_recovery_op_priority 1'

2) We run a not-particularly critical service with a constant stream of 
95% write/5% read small, random IO. During recovery/backfill, we are 
heavily bound by IOPS. It often times feels like a net win to throttle 
unessential client traffic in an effort to get spindle contention under 
control if Step 1 wasn't enough.


If that all fails, you can try ceph osd set nodown which will prevent 
OSDs from being marked down (with or without proper cause), but that 
tends to cause me more trouble than its worth.


Thanks,

Mike Dawson
Co-Founder  Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 5/7/2014 1:28 PM, Craig Lewis wrote:

The 5 OSDs that are down have all been kicked out for being
unresponsive.  The 5 OSDs are getting kicked faster than they can
complete the recovery+backfill.  The number of degraded PGs is growing
over time.

root@ceph0c:~# ceph -w
 cluster 1604ec7a-6ceb-42fc-8c68-0a7896c4e120
  health HEALTH_WARN 49 pgs backfill; 926 pgs degraded; 252 pgs
down; 30 pgs incomplete; 291 pgs peering; 1 pgs recovery_wait; 175 pgs
stale; 255 pgs stuck inactive; 175 pgs stuck stale; 1234 pgs stuck
unclean; 66 requests are blocked  32 sec; recovery 6820014/3806
objects degraded (17.921%); 4/16 in osds are down; noout flag(s) set
  monmap e2: 2 mons at
{ceph0c=10.193.0.6:6789/0,ceph1c=10.193.0.7:6789/0}, election epoch 238,
quorum 0,1 ceph0c,ceph1c
  osdmap e38673: 16 osds: 12 up, 16 in
 flags noout
   pgmap v7325233: 2560 pgs, 17 pools, 14090 GB data, 18581 kobjects
 28456 GB used, 31132 GB / 59588 GB avail
 6820014/3806 objects degraded (17.921%)
1 stale+active+clean+scrubbing+deep
   15 active
 1247 active+clean
1 active+recovery_wait
   45 stale+active+clean
   39 peering
   29 stale+active+degraded+wait_backfill
  252 down+peering
  827 active+degraded
   50 stale+active+degraded
   20 stale+active+degraded+remapped+wait_backfill
   30 stale+incomplete
4 active+clean+scrubbing+deep

Here's a snippet of ceph.log for one of these OSDs:
2014-05-07 09:22:46.747036 mon.0 10.193.0.6:6789/0 39981 : [INF] osd.3
marked down after no pg stats for 901.212859seconds
2014-05-07 09:47:17.930251 mon.0 10.193.0.6:6789/0 40561 : [INF] osd.3
10.193.0.6:6812/2830 boot
2014-05-07 09:47:16.914519 osd.3 10.193.0.6:6812/2830 823 : [WRN] map
e38649 wrongly marked me down

root@ceph0c:~# uname -a
Linux ceph0c 3.5.0-46-generic #70~precise1-Ubuntu SMP Thu Jan 9 23:55:12
UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
root@ceph0c:~# lsb_release -a
No LSB modules are available.
Distributor ID:Ubuntu
Description:Ubuntu 12.04.4 LTS
Release:12.04
Codename:precise
root@ceph0c:~# ceph -v
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)


Any ideas what I can do to make these OSDs stop drying after 15 minutes?




--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website http://www.centraldesktop.com/  | Twitter
http://www.twitter.com/centraldesktop  | Facebook
http://www.facebook.com/CentralDesktop  | LinkedIn
http://www.linkedin.com/groups?gid=147417  | Blog
http://cdblog.centraldesktop.com/



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Deep-Scrub Scheduling

2014-05-07 Thread Mike Dawson

My write-heavy cluster struggles under the additional load created by 
deep-scrub from time to time. As I have instrumented the cluster more, 
it has become clear that there is something I cannot explain happening 
in the scheduling of PGs to undergo deep-scrub.


Please refer to these images [0][1] to see two graphical representations 
of how deep-scrub goes awry in my cluster. These were two separate 
incidents. Both show a period of happy scrub and deep-scrubs and 
stable writes/second across the cluster, then an approximately 5x jump 
in concurrent deep-scrubs where client IO is cut by nearly 50%.


The first image (deep-scrub-issue1.jpg) shows a happy cluster with low 
numbers of scrub and deep-scrub running until about 10pm, then something 
triggers deep-scrubs to increase about 5x and remain high until I 
manually 'ceph osd set nodeep-scrub' at approx 10am. During the time of 
higher concurrent deep-scrubs, IOPS drop significantly due to OSD 
spindle contention preventing qemu/rbd clients from writing like normal.


The second image (deep-scrub-issue2.jpg) shows a similar approx 5x jump 
in concurrent deep-scrubs and associated drop in writes/second. This 
image also adds a summary of the 'dump historic ops' which show the to 
be expected jump in the slowest ops in the cluster.


Does anyone have an idea of what is happening when the spike in 
concurrent deep-scrub occurs and how to prevent the adverse effects, 
outside of disabling deep-scrub permanently?


0: http://www.mikedawson.com/deep-scrub-issue1.jpg
1: http://www.mikedawson.com/deep-scrub-issue2.jpg

Thanks,
Mike Dawson
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deep-Scrub Scheduling

2014-05-07 Thread Mike Dawson

Perhaps, but if that were the case, would you expect the max concurrent 
number of deep-scrubs to approach the number of OSDs in the cluster?


I have 72 OSDs in this cluster and concurrent deep-scrubs seem to peak 
at a max of 12. Do pools (two in use) and replication settings (3 copies 
in both pools) factor in?


72 OSDs / (2 pools * 3 copies) = 12 max concurrent deep-scrubs

That seems plausible (without looking at the code).

But, if I 'ceph osd set nodeep-scrub' then 'ceph osd unset 
nodeep-scrub', the count of concurrent deep-scrubs doesn't resume the 
high level, but rather stays low seemingly for days at a time, until the 
next onslaught. If driven by the max scrub interval, shouldn't it jump 
quickly back up?


Is there way to find the last scrub time for a given PG via the CLI to 
know for sure?


Thanks,
Mike Dawson

On 5/7/2014 10:59 PM, Gregory Farnum wrote:

Is it possible you're running into the max scrub intervals and jumping
up to one-per-OSD from a much lower normal rate?

On Wednesday, May 7, 2014, Mike Dawson mike.daw...@cloudapt.com
mailto:mike.daw...@cloudapt.com wrote:

My write-heavy cluster struggles under the additional load created
by deep-scrub from time to time. As I have instrumented the cluster
more, it has become clear that there is something I cannot explain
happening in the scheduling of PGs to undergo deep-scrub.

Please refer to these images [0][1] to see two graphical
representations of how deep-scrub goes awry in my cluster. These
were two separate incidents. Both show a period of happy scrub and
deep-scrubs and stable writes/second across the cluster, then an
approximately 5x jump in concurrent deep-scrubs where client IO is
cut by nearly 50%.

The first image (deep-scrub-issue1.jpg) shows a happy cluster with
low numbers of scrub and deep-scrub running until about 10pm, then
something triggers deep-scrubs to increase about 5x and remain high
until I manually 'ceph osd set nodeep-scrub' at approx 10am. During
the time of higher concurrent deep-scrubs, IOPS drop significantly
due to OSD spindle contention preventing qemu/rbd clients from
writing like normal.

The second image (deep-scrub-issue2.jpg) shows a similar approx 5x
jump in concurrent deep-scrubs and associated drop in writes/second.
This image also adds a summary of the 'dump historic ops' which show
the to be expected jump in the slowest ops in the cluster.

Does anyone have an idea of what is happening when the spike in
concurrent deep-scrub occurs and how to prevent the adverse effects,
outside of disabling deep-scrub permanently?

0: http://www.mikedawson.com/__deep-scrub-issue1.jpg
http://www.mikedawson.com/deep-scrub-issue1.jpg
1: http://www.mikedawson.com/__deep-scrub-issue2.jpg
http://www.mikedawson.com/deep-scrub-issue2.jpg

Thanks,
Mike Dawson
_
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Software Engineer #42 @ http://inktank.com | http://ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-deploy osd activate error: AttributeError: 'module' object has no attribute 'logger' exception

2014-04-30 Thread Mike Dawson


Victor,

This is a verified issue reported earlier today:

http://tracker.ceph.com/issues/8260

Cheers,
Mike


On 4/30/2014 3:10 PM, Victor Bayon wrote:

Hi all,
I am following the quick-ceph-deploy tutorial [1] and I am getting a
  error when running the ceph-deploy osd activate and I am getting an
exception. See below[2].
I am following the quick tutorial step by step, except that
any help greatly appreciate
ceph-deploy mon create-initial does not seem to gather the keys and I
have to execute

manually with

ceph-deploy gatherkeys node01

I am following the same configuration with
- one admin node (myhost)
- 1 monitoring node (node01)
- 2 osd (node02, node03)


I am in Ubuntu Server 12.04 LTS (precise) and using ceph emperor


Any help greatly appreciated

Many thanks

Best regards

/V

[1] http://ceph.com/docs/master/start/quick-ceph-deploy/
[2] Error:
ceph@myhost:~/cluster$ ceph-deploy osd activate node02:/var/local/osd0
node03:/var/local/osd1
[ceph_deploy.conf][DEBUG ] found configuration file at:
/home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.0): /usr/bin/ceph-deploy osd
activate node02:/var/local/osd0 node03:/var/local/osd1
[ceph_deploy.osd][DEBUG ] Activating cluster ceph disks
node02:/var/local/osd0: node03:/var/local/osd1:
[node02][DEBUG ] connected to host: node02
[node02][DEBUG ] detect platform information from remote host
[node02][DEBUG ] detect machine type
[ceph_deploy.osd][INFO  ] Distro info: Ubuntu 12.04 precise
[ceph_deploy.osd][DEBUG ] activating host node02 disk /var/local/osd0
[ceph_deploy.osd][DEBUG ] will use init type: upstart
[node02][INFO  ] Running command: sudo ceph-disk-activate --mark-init
upstart --mount /var/local/osd0
[node02][WARNIN] got latest monmap
[node02][WARNIN] 2014-04-30 19:36:30.268882 7f506fd07780 -1 journal
FileJournal::_open: disabling aio for non-block journal.  Use
journal_force_aio to force use of aio anyway
[node02][WARNIN] 2014-04-30 19:36:30.298239 7f506fd07780 -1 journal
FileJournal::_open: disabling aio for non-block journal.  Use
journal_force_aio to force use of aio anyway
[node02][WARNIN] 2014-04-30 19:36:30.301091 7f506fd07780 -1
filestore(/var/local/osd0) could not find 23c2fcde/osd_superblock/0//-1
in index: (2) No such file or directory
[node02][WARNIN] 2014-04-30 19:36:30.307474 7f506fd07780 -1 created
object store /var/local/osd0 journal /var/local/osd0/journal for osd.0
fsid 76de3b72-44e3-47eb-8bd7-2b5b6e3666eb
[node02][WARNIN] 2014-04-30 19:36:30.307512 7f506fd07780 -1 auth: error
reading file: /var/local/osd0/keyring: can't open
/var/local/osd0/keyring: (2) No such file or directory
[node02][WARNIN] 2014-04-30 19:36:30.307547 7f506fd07780 -1 created new
key in keyring /var/local/osd0/keyring
[node02][WARNIN] added key for osd.0
Traceback (most recent call last):
   File /usr/bin/ceph-deploy, line 21, in module
 sys.exit(main())
   File
/usr/lib/python2.7/dist-packages/ceph_deploy/util/decorators.py, line
62, in newfunc
 return f(*a, **kw)
   File /usr/lib/python2.7/dist-packages/ceph_deploy/cli.py, line 147,
in main
 return args.func(args)
   File /usr/lib/python2.7/dist-packages/ceph_deploy/osd.py, line 532,
in osd
 activate(args, cfg)
   File /usr/lib/python2.7/dist-packages/ceph_deploy/osd.py, line 338,
in activate
 catch_osd_errors(distro.conn, distro.logger, args)
AttributeError: 'module' object has no attribute 'logger'


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Backfill and Recovery traffic shaping

2014-04-19 Thread Mike Dawson


Hi Greg,

On 4/19/2014 2:20 PM, Greg Poirier wrote:

We have a cluster in a sub-optimal configuration with data and journal
colocated on OSDs (that coincidentally are spinning disks).

During recovery/backfill, the entire cluster suffers degraded
performance because of the IO storm that backfills cause. Client IO
becomes extremely latent.


Graph '%util' or simply watch it from 'iostat -xt 2'. It may likely show 
you the bottleneck is iops available from your spinning disks. Client IO 
can see significant latency (or at worse complete stalls) as your disks 
approach saturation.


 I've tried to decrease the impact that

recovery/backfill has with the following:

ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'


On our cluster, these settings can be an effective method for minimizing 
disruption. I'd also recommend you disable deep scrub by:


ceph osd set nodeep-scrub

Re-enable it later with:

ceph osd unset nodeep-scrub

I have some clients that are much more susceptible to disruptions from 
spindle contention during recovery/backfill. Others operate without 
disruption. I am working to quantify the difference, but I believe it is 
related to caching or syncing behavior of the individual application/OS.




The only other option I have left would be to use linux traffic shaping
to artificially reduce the bandwidth available to the interfaced tagged
for cluster traffic (instead of separate physical networks, we use VLAN
tagging). We are nowhere _near_ the point where network saturation would
cause the latency we're seeing, so I am left to believe that it is
simply disk IO saturation.

I could be wrong about this assumption, though, as iostat doesn't
terrify me. This could be suboptimal network configuration on the
cluster as well. I'm still looking into that possibility, but I wanted
to get feedback on what I'd done already first--as well as the proposed
traffic shaping idea.

Thoughts?



I would exhaust all troubleshooting / tuning related to spindle 
contention before spending much more than a cursory look at network sanity.


It sounds to me that you simply don't have enough IOPS available as 
configured in your cluster to operate your client IO workload while also 
absorbing the performance hit of recovery/backfill.


With a workload consisting of lots of small writes, I've seen client IO 
starved with as little as 5Mbps of traffic per host due to spindle 
contention once deep-scrub and/or recovery/backfill start. Co-locating 
OSD Journals on the same spinners as you have will double that likelihood.


Possible solutions include moving OSD Journals to SSD (with a reasonable 
ratio), expanding the cluster, or increasing the performance of 
underlying storage.


Cheers,
Mike



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD write access patterns and atime

2014-04-17 Thread Mike Dawson


Thanks Dan!

Thanks,
Mike Dawson


On 4/17/2014 4:06 AM, Dan van der Ster wrote:

Mike Dawson wrote:

Dan,

Could you describe how you harvested and analyzed this data? Even
better, could you share the code?

Cheers,
Mike

First enable debug_filestore=10, then you'll see logs like this:

2014-04-17 09:40:34.466749 7fb39df16700 10
filestore(/var/lib/ceph/osd/osd.0) write
4.206_head/57186206/rbd_data.1f7ccd36575a0ed.1620/head//4
651264~4096 = 4096

and this for reads:

2014-04-17 09:46:10.449577 7fb392427700 10
filestore(/var/lib/ceph/osd/osd.0) FileStore::read
4.fe9_head/f7281fe9/rbd_data.10bb48f705289c0.6a24/head//4
1994752~4096/4096

The last num is the size of the write/read.

Then run this:
https://github.com/cernceph/ceph-scripts/blob/master/tools/rbd-io-stats.pl

Cheers, Dan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD write access patterns and atime

2014-04-16 Thread Mike Dawson


Dan,

Could you describe how you harvested and analyzed this data? Even 
better, could you share the code?


Cheers,
Mike

On 4/16/2014 11:08 AM, Dan van der Ster wrote:

Dear ceph-users,

I've recently started looking through our FileStore logs to better
understand the VM/RBD IO patterns, and noticed something interesting.
Here is a snapshot of the write lengths for one OSD server (with 24
OSDs) -- I've listed the top 10 write lengths ordered by number of
writes in one day:

Writes per length:
4096: 2011442
8192: 438259
4194304: 207293
12288: 175848
16384: 148274
20480: 69050
24576: 58961
32768: 54771
28672: 43627
65536: 34208
49152: 31547
40960: 28075

There were ~400 writes to that server on that day, so you see that
~50% of the writes were 4096 bytes, and then the distribution drops off
sharply before a peak again at 4MB (the object size, i.e. the max write
size). (For those interested, read lengths are below in the P.S.)

I'm trying to understand that distribution, and the best explanation
I've come up with is that these are ext4/xfs metadata updates, probably
atime updates. Based on that theory, I'm going to test noatime on a few
VMs and see if I notice a change in the distribution.

Did anyone already go through such an exercise, or does anyone already
enforce/recommend specific mount options for their clients' RBD volumes?
Of course I realize that noatime is a generally recommended mount option
for performance, but I've never heard a discussion about noatime
specifically in relation to RBD volumes.

Best Regards, Dan

P.S. Reads per length:
524288: 1235401
4096: 675012
8192: 488194
516096: 342771
16384: 187577
65536: 87783
131072: 87279
12288: 66735
49152: 50170
24576: 47794
262144: 45199
466944: 23064

So reads are mostly 512kB, which is probably some default read-ahead size.

-- Dan van der Ster || Data  Storage Services || CERN IT Department --
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Migrate from mkcephfs to ceph-deploy

2014-04-14 Thread Mike Dawson


Hello,

I have a production cluster that was deployed with mkcephfs around the 
Bobtail release. Quite a bit has changed in regards to ceph.conf 
conventions, ceph-deploy, symlinks to journal partitions, udev magic, 
and upstart.


Is there any path to migrate these OSDs up to the new style setup? For 
obvious reasons I'd prefer to avoid redeploying the OSDs.


With each release, I get a bit more worried that this legacy setup will 
cause issues. If you are an operators with a cluster older than a year 
or so, what have you done?


Thanks,
Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Error while provisioning my first OSD

2014-04-05 Thread Mike Dawson


Adam,

I believe you need the command 'ceph osd create' prior to 'ceph-osd -i X 
--mkfs --mkkey' for each OSD you add.


http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#adding-an-osd-manual

Cheers,
Mike

On 4/5/2014 7:37 PM, Adam Clark wrote:

HI all,
   I am trying to setup a Ceph cluster for the first time.

I am following the manual deployment at I want to orchestrate it with
puppet.

http://ceph.com/docs/master/install/manual-deployment/

All is going well until I want to add the OSD to the crush map.

I get the following error:
ceph osd crush add osd.0 1.0 host=ceph-osd133
Error ENOENT: osd.0 does not exist.  create it before updating the crush map

Here is the process that I went through:
ceph -v
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)

cat /etc/ceph/ceph.conf
[global]
osd_pool_default_pgp_num = 100
osd_pool_default_min_size = 1
auth_service_required = cephx
mon_initial_members = ceph-mon01,ceph-mon02,ceph-mon03
fsid = 983a74a9-1e99-42ef-8a1d-097553c3e6ce
cluster_network = 172.16.34.0/24 http://172.16.34.0/24
auth_supported = cephx
auth_cluster_required = cephx
mon_host = 172.16.33.20,172.16.33.21,172.16.33.22
auth_client_required = cephx
osd_pool_default_size = 2
osd_pool_default_pg_num = 100
public_network = 172.16.33.0/24 http://172.16.33.0/24

ceph -s
 cluster 983a74a9-1e99-42ef-8a1d-097553c3e6ce
  health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean;
no osds
  monmap e3: 3 mons at
{ceph-mon01=172.16.33.20:6789/0,ceph-mon02=172.16.33.21:6789/0,ceph-mon03=172.16.33.22:6789/0
http://172.16.33.20:6789/0,ceph-mon02=172.16.33.21:6789/0,ceph-mon03=172.16.33.22:6789/0},
election epoch 6, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03
  osdmap e3: 0 osds: 0 up, 0 in
   pgmap v4: 192 pgs, 3 pools, 0 bytes data, 0 objects
 0 kB used, 0 kB / 0 kB avail
  192 creating

ceph-disk list
/dev/fd0 other, unknown
/dev/sda :
  /dev/sda1 other, ext2, mounted on /boot
  /dev/sda2 other
  /dev/sda5 other, LVM2_member
/dev/sdb :
  /dev/sdb1 ceph data, active, cluster ceph, osd.0, journal /dev/sdb2
  /dev/sdb2 ceph journal, for /dev/sdb1
/dev/sr0 other, unknown

mount /dev/sdb1 /var/lib/ceph/osd/ceph-0
ceph-osd -i 0 --mkfs --mkkey
ceph auth add osd.0 osd 'allow *' mon 'allow rwx' -i
/var/lib/ceph/osd/ceph-0/keyring
ceph osd crush add-bucket ceph-osd133 host
ceph osd crush move ceph-osd133 root=default
ceph osd crush add osd.0 1.0 host=ceph-osd133
Error ENOENT: osd.0 does not exist.  create it before updating the crush map

I have seen that in earlier versions, it can show this message but
happily proceeds.

Is the doco out of date, or am I missing something?

Cheers

Adam


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Pause i/o from time to time

2013-12-29 Thread Mike Dawson


What version of qemu do you have?

The issues I had were fixed once I upgraded qemu to =1.4.2 which 
includes a critical rbd patch for asynchronous io from Josh Durgin.


Cheers,
Mike

On 12/28/2013 4:09 PM, Andrei Mikhailovsky wrote:


Hi guys,

Did anyone figure out what could be causing this problem and a workaround?

I've noticed a very annoying behaviour with my vms. It seems to happen
randomly about 5-10 times a day and the pauses last between 2-10
minutes. It happens across all vms on all host servers in my cluster. I
am running 0.67.4 on ubuntu 12.04 with 3.11 kernel from backports.

Initially i though that these pauses are caused by the scrubbing issue
reported by Mike, however, I've also noticed the stalls when the cluster
is not scrubbing. Both of my osd servers are pretty idle (load around 1
to 2) with osds are less than 10% utilised.

Unlike Uwe's case, I am not using iscsi, but plain rbd with qemu and I
do not see any i/o errors in dmesg or kernel panics. the vms just freeze
and become unresponsive, so i can't ssh into it or do simple commands
like ls. VMs do respond to pings though.

Thanks

Andrei


*From: *Uwe Grohnwaldt u...@grohnwaldt.eu
*To: *ceph-users@lists.ceph.com
*Sent: *Thursday, 24 October, 2013 8:31:42 AM
*Subject: *Re: [ceph-users] Pause i/o from time to time

Hello ceph-users,

we're hitting a similar problem last Thursday and today. We have a
cluster consisting of 6 storagenodes containing 70 osds (JBOD
configuration). We created several rbd devices and mapped them on
dedicated server and exporting them via targetcli. This iscsi target are
connected to Citrix XenServer 6.1 (with HF30) and XenServer 6.2 (HF4).

In the last time some disks died. After this some errors occured on this
dedicated iscsitarget:
Oct 23 15:19:42 targetcli01 kernel: [673836.709887] end_request: I/O
error, dev rbd4, sector 2034037064
Oct 23 15:19:42 targetcli01 kernel: [673836.713596]
test_bit(BIO_UPTODATE) failed for bio: 880127546c00, err: -6
Oct 23 15:19:43 targetcli01 kernel: [673837.497382] end_request: I/O
error, dev rbd4, sector 2034037064
Oct 23 15:19:43 targetcli01 kernel: [673837.501323]
test_bit(BIO_UPTODATE) failed for bio: 880124d933c0, err: -6

These errors go through up to the virtual machines and lead to readonly
filesystems. We could trigger this behavior with set one disk to out.

We are using Ubuntu 13.04 with latest stable ceph (ceph version 0.67.4
(ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7)

Our ceph.conf is like this:

[global]
filestore_xattr_use_omap = true
mon_host = 10.200.20.1,10.200.20.2,10.200.20.3
osd_journal_size = 1024
public_network = 10.200.40.0/16
mon_initial_members = ceph-mon01, ceph-mon02, ceph-mon03
cluster_network = 10.210.40.0/16
auth_supported = none
fsid = 9283e647-2b57-4077-b427-0d3d656233b3

[osd]
osd_max_backfills = 4
osd_recovery_max_active = 1

[osd.0]
public_addr = 10.200.40.1
cluster_addr = 10.210.40.1



After the first outage we set osd_max_backfill to 8, after the second
one to 4 but it didn't help. It seems like it is the bug mentioned at
http://tracker.ceph.com/issues/6278 . The problem is, that this is a
production environment and the problems began after we moved several VMs
to it. In our test environment we can't reproduct it but we are working
on a larger testinstallation.

Does anybody have an idea how to investigate further without destroying
virtual machines? ;)

Sometimes these IO errors lead to kernel panics on the iscsi target
machine. The targetcli/lio config is a simple default config without any
tuning or big configurations.


Mit freundlichen Grüßen / Best Regards,
Uwe Grohnwaldt

- Original Message -
  From: Timofey timo...@koolin.ru
  To: Mike Dawson mike.daw...@cloudapt.com
  Cc: ceph-users@lists.ceph.com
  Sent: Dienstag, 17. September 2013 22:37:44
  Subject: Re: [ceph-users] Pause i/o from time to time
 
  I have examined logs.
  Yes, first time it can be scrubbing. It repaired some self.
 
  I had 2 servers before first problem: one dedicated for osd (osd.0),
  and second - with osd and websites (osd.1).
  After problem I add third server - dedicated for osd (osd.2) and call
  ceph osd set out osd.1 for replace data.
 
  In ceph -s i saw normal replacing process and all work good about 5-7
  hours.
  Then I have many misdirected records (few hundreds per second):
  osd.0 [WRN] client.359671  misdirected client.359671.1:220843 pg
  2.3ae744c0 to osd.0 not [2,0] in e1040/1040
  and errors in i/o operations.
 
  Now I have about 20GB ceph logs with this errors. (I don't work with
  cluster now - I copy out all data on hdd and work from hdd).
 
  Is any way have local software raid1 with ceph rbd and local image
  (for work when ceph fail or work slow by any reason).
  I tried mdadm but it work bad - server hang up every few hours.
 
   You could be suffering from a known, but unfixed issue [1] where
   spindle contention from scrub

Re: [ceph-users] rebooting nodes in a ceph cluster

2013-12-21 Thread Mike Dawson

It is also useful to mention that you can set the noout flag when doing 
maintenance of any given length needs to exceeds the 'mon osd down out 
interval'.


$ ceph osd set noout
** no re-balancing will happen **

$ ceph osd unset noout
** normal re-balancing rules will resume **


- Mike Dawson


On 12/19/2013 7:51 PM, Sage Weil wrote:

On Thu, 19 Dec 2013, John-Paul Robinson wrote:

What impact does rebooting nodes in a ceph cluster have on the health of
the ceph cluster?  Can it trigger rebalancing activities that then have
to be undone once the node comes back up?

I have a 4 node ceph cluster each node has 11 osds.  There is a single
pool with redundant storage.

If it takes 15 minutes for one of my servers to reboot is there a risk
that some sort of needless automatic processing will begin?


By default, we start rebalancing data after 5 minutes.  You can adjust
this (to, say, 15 minutes) with

  mon osd down out interval = 900

in ceph.conf.

sage



I'm assuming that the ceph cluster can go into a not ok state but that
in this particular configuration all the data is protected against the
single node failure and there is no place for the data to migrate too so
nothing bad will happen.

Thanks for any feedback.

~jpr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rebooting nodes in a ceph cluster

2013-12-21 Thread Mike Dawson

I think my wording was a bit misleading in my last message. Instead of 
no re-balancing will happen, I should have said that no OSDs will be 
marked out of the cluster with the noout flag set.


- Mike

On 12/21/2013 2:06 PM, Mike Dawson wrote:

It is also useful to mention that you can set the noout flag when doing
maintenance of any given length needs to exceeds the 'mon osd down out
interval'.

$ ceph osd set noout
** no re-balancing will happen **

$ ceph osd unset noout
** normal re-balancing rules will resume **


- Mike Dawson


On 12/19/2013 7:51 PM, Sage Weil wrote:

On Thu, 19 Dec 2013, John-Paul Robinson wrote:

What impact does rebooting nodes in a ceph cluster have on the health of
the ceph cluster?  Can it trigger rebalancing activities that then have
to be undone once the node comes back up?

I have a 4 node ceph cluster each node has 11 osds.  There is a single
pool with redundant storage.

If it takes 15 minutes for one of my servers to reboot is there a risk
that some sort of needless automatic processing will begin?


By default, we start rebalancing data after 5 minutes.  You can adjust
this (to, say, 15 minutes) with

  mon osd down out interval = 900

in ceph.conf.

sage



I'm assuming that the ceph cluster can go into a not ok state but that
in this particular configuration all the data is protected against the
single node failure and there is no place for the data to migrate too so
nothing bad will happen.

Thanks for any feedback.

~jpr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sanity check of deploying Ceph very unconventionally (on top of RAID6, with very few nodes and OSDs)

2013-12-17 Thread Mike Dawson


Christian,

I think you are going to suffer the effects of spindle contention with 
this type of setup. Based on your email and my assumptions, I will use 
the following inputs:


- 4 OSDs, each backed by a 12-disk RAID 6 set
- 75iops for each 7200rpm 3TB drive
- RAID 6 write penalty of 6
- OSD Journal co-located with OSD
- Ceph replication size of 2

4osds * 12disks * 75iops / 6(RAID6WritePenalty) / 2(OsdJournalHit) / 
2(CephReplication) = 150 Writes/second max


4osds * 12disks * 75iops / 2xCephReplication = 1800 Reads/second max

My guess is 150 writes/second is far lower than your 500 VMs will 
require. After all, this setup will likely give you lower writes/second 
than a single 15K SAS drive. Further, if you need to replace a drive, I 
suspect this setup would grind to a halt as the RAID6 set set attempts 
to repair.


On the other hand, if you planned for 48 individual drives with OSD 
journals on SSDs in a typical setup of perhaps 5:1 or lower ratio of 
SSDs:HDs, the calculation would look like:


48osds * 75iops / 2xCephReplication = 1800 Writes/second max

48osds * 75iops / 2xCephReplication = 1800 Reads/second max

As you can see, I estimate 12x more random writes without RAID6 (6x) and 
co-located osd journals (2x).


Plus you'll be able to configure 12x more placement groups in your CRUSH 
rules by going from 4 osds to 48 osds. That will allow Ceph's 
psuedo-random placement rules to significantly improve the distribution 
of data and io load across the cluster to decrease the risk of hot-spots.


A few other notes:

- You'll certainly want QEMU 1.4.2 or later to get asynchronous io for RBD.

- You'll likely want to enable RBD writeback cache. It helps coalesce 
small writes before hitting the disks.



Cheers,
Mike



On 12/17/2013 2:44 AM, Christian Balzer wrote:


Hello,

I've been doing a lot of reading and am looking at the following design
for a storage cluster based on Ceph. I will address all the likely
knee-jerk reactions and reasoning below, so hold your guns until you've
read it all. I also have a number of questions I've not yet found the
answer to or determined it by experimentation.

Hardware:
2x 4U (can you say Supermicro? ^.^) servers with 24 3.5 hotswap bays, 2
internal OS (journal?) drives, probably Opteron 4300 CPUs (see below),
Areca 1882 controller with 4GB cache, 2 or 3 2-port Infiniband HCAs.
24 3TB HDs (30% of the price of a 4TB one!) in one or two RAID6, 2 of them
hotspares, giving us 60TB per node and thus with a replication factor of 2
that's also the usable space.
Space for 2 more identical servers if need be.

Network:
Infiniband QDR, 2x 18port switches (interconnected of course), redundant
paths everywhere, including to the clients (compute nodes).

Ceph configuration:
Additional server with mon, mons also on the 2 storage nodes, at least 2
OSDs per node (see below)

This is for a private cloud with about 500 VMs at most. There will 2 types
of VMs, the majority writing a small amount of log chatter to their
volumes, the other type (a few dozen) writing a more substantial data
stream.
I estimate less than 100MB/s of read/writes at full build out, which
should be well within the abilities of this setup.


Now for the rationale of this design that goes contrary to anything normal
Ceph layouts suggest:

1. Idiot (aka NOC monkey) proof hotswap of disks.
This will be deployed in a remote data center, meaning that qualified
people will not be available locally and thus would have to travel there
each time a disk or two fails.
In short, telling somebody to pull the disk tray with the red flashing LED
and put a new one from the spare pile in there is a lot more likely to
result in success than telling them to pull the 3rd row, 4th column disk
in server 2. ^o^

2. Density, TCO
Ideally I would love to deploy something like this:
http://www.mbx.com/60-drive-4u-storage-server/
but they seem to not really have a complete product description, price
list, etc. ^o^ With a monster like that, I'd be willing to reconsider local
raids and just overspec things in a way that a LOT disks can fail before
somebody (with a clue) needs to visit that DC.
However failing that, the typical approach to use many smaller servers for
OSDs increases the costs and/or reduces density. Replacing the 4U servers
with 2U ones (that hold 12 disks) would require some sort of controller (to
satisfy my #1 requirement) and similar amounts of HCAs per node, clearly
driving the TCO up. 1U servers with typically 4 disk would be even worse.

3. Increased reliability/stability
Failure of a single disk has no impact on the whole cluster, no need for
any CPU/network intensive rebalancing.


Questions/remarks:

Due to the fact that there will be redundancy, reliability on the disk
level and that there will be only 2 storage nodes initially, I'm
planning to disable rebalancing.
Or will Ceph realize that making replicas on the same server won't really
save the day and refrain from doing so?
If more nodes are added

Re: [ceph-users] Adding new OSDs, need to increase PGs?

2013-12-03 Thread Mike Dawson


Robert,

Interesting results on the effect of # of PG/PGPs. My cluster struggles 
a bit under the strain of heavy random small-sized writes.


The IOPS you mention seem high to me given 30 drives and 3x replication 
unless they were pure reads or on high-rpm drives. Instead of assuming, 
I want to pose a few questions:


- How are you testing? rados bench, rbd bench, rbd bench with writeback 
cache, etc?


- Were the 2000-2500 random 4k IOPS more reads than writes? If you test 
100% 4k random reads, what do you get? If you test 100% 4k random 
writes, what do you get?


- What drives do you have? Any RAID involved under your OSDs?

Thanks,
Mike Dawson


On 12/3/2013 1:31 AM, Robert van Leeuwen wrote:



On 2 dec. 2013, at 18:26, Brian Andrus brian.and...@inktank.com wrote:



  Setting your pg_num and pgp_num to say... 1024 would A) increase data 
granularity, B) likely lend no noticeable increase to resource consumption, and 
C) allow some room for future OSDs two be added while still within range of 
acceptable pg numbers. You could probably safely double even that number if you 
plan on expanding at a rapid rate and want to avoid splitting PGs every time a 
node is added.

In general, you can conservatively err on the larger side when it comes to 
pg/p_num. Any excess resource utilization will be negligible (up to a certain 
point). If you have a comfortable amount of available RAM, you could experiment 
with increasing the multiplier in the equation you are using and see how it 
affects your final number.

The pg_num and pgp_num parameters can safely be changed before or after your 
new nodes are integrated.


I would be a bit conservative with the PGs / PGPs.
I've experimented with the PG number a bit and noticed the following random IO 
performance drop.
( this could be something to our specific setup but since the PG is easily 
increased and impossible to decrease I would be conservative)

  The setup:
3 OSD nodes with 128GB ram, 2 * 6 core CPU (12 with ht).
Nodes have 10 OSDs running on 1 tb disks and 2 SSDs for Journals.

We use a replica count of 3 so optimum according to formula is about 1000
With 1000 PGs I got about 2000-2500 random 4k IOPS.

Because the nodes are fast enough and I expect the cluster to be expanded with 
3 more nodes I set the PGs to 2000.
Performance dropped to about 1200-1400 IOPS.

I noticed that the spinning disks where no longer maxing out on 100% usage.
Memory and CPU did not seem to be a problem.
Since had the option to recreate the pool and I was not using the recommended 
settings I did not really dive into the issue.
I will not stray to far from the recommended settings in the future though :)

Cheers,
Robert van Leeuwen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Adding new OSDs, need to increase PGs?

2013-12-03 Thread Mike Dawson


Robert,

Do you have rbd writeback cache enabled on these volumes? That could 
certainly explain the higher than expected write performance. Any chance 
you could re-test with rbd writeback on vs. off?


Thanks,
Mike Dawson

On 12/3/2013 10:37 AM, Robert van Leeuwen wrote:

Hi Mike,

I am using filebench within a kvm virtual. (Like an actual workload we will 
have)
Using 100% synchronous 4k writes with a 50GB file on a 100GB volume with 32 
writer threads.
Also tried from multiple KVM machines from multiple hosts.
Aggregated performance keeps at 2k+ IOPS

The disks are 7200RPM 2.5 inch drives, no RAID whatsoever.
I agree the amount of IOPS seem high.
Maybe the journal on SSD (2 x Intel 3500) helps a bit in this regard but the 
SSD's where not maxed out yet.
The writes seem to be limited by the spinning disks:
As soon as the benchmark starts the are used for 100% percent.
Also the usage dropped to 0% pretty much immediately after the benchmark so it 
looks like it's not lagging behind the journal.

Did not really test reads yet since we have so much read cache (128 GB per 
node) I assume we will mostly be write limited.

Cheers,
Robert van Leeuwen



Sent from my iPad


On 3 dec. 2013, at 16:15, Mike Dawson mike.daw...@cloudapt.com wrote:

Robert,

Interesting results on the effect of # of PG/PGPs. My cluster struggles a bit 
under the strain of heavy random small-sized writes.

The IOPS you mention seem high to me given 30 drives and 3x replication unless 
they were pure reads or on high-rpm drives. Instead of assuming, I want to pose 
a few questions:

- How are you testing? rados bench, rbd bench, rbd bench with writeback cache, 
etc?

- Were the 2000-2500 random 4k IOPS more reads than writes? If you test 100% 4k 
random reads, what do you get? If you test 100% 4k random writes, what do you 
get?

- What drives do you have? Any RAID involved under your OSDs?

Thanks,
Mike Dawson



On 12/3/2013 1:31 AM, Robert van Leeuwen wrote:


On 2 dec. 2013, at 18:26, Brian Andrus brian.and...@inktank.com wrote:


  Setting your pg_num and pgp_num to say... 1024 would A) increase data 
granularity, B) likely lend no noticeable increase to resource consumption, and 
C) allow some room for future OSDs two be added while still within range of 
acceptable pg numbers. You could probably safely double even that number if you 
plan on expanding at a rapid rate and want to avoid splitting PGs every time a 
node is added.

In general, you can conservatively err on the larger side when it comes to 
pg/p_num. Any excess resource utilization will be negligible (up to a certain 
point). If you have a comfortable amount of available RAM, you could experiment 
with increasing the multiplier in the equation you are using and see how it 
affects your final number.

The pg_num and pgp_num parameters can safely be changed before or after your 
new nodes are integrated.


I would be a bit conservative with the PGs / PGPs.
I've experimented with the PG number a bit and noticed the following random IO 
performance drop.
( this could be something to our specific setup but since the PG is easily 
increased and impossible to decrease I would be conservative)

  The setup:
3 OSD nodes with 128GB ram, 2 * 6 core CPU (12 with ht).
Nodes have 10 OSDs running on 1 tb disks and 2 SSDs for Journals.

We use a replica count of 3 so optimum according to formula is about 1000
With 1000 PGs I got about 2000-2500 random 4k IOPS.

Because the nodes are fast enough and I expect the cluster to be expanded with 
3 more nodes I set the PGs to 2000.
Performance dropped to about 1200-1400 IOPS.

I noticed that the spinning disks where no longer maxing out on 100% usage.
Memory and CPU did not seem to be a problem.
Since had the option to recreate the pool and I was not using the recommended 
settings I did not really dive into the issue.
I will not stray to far from the recommended settings in the future though :)

Cheers,
Robert van Leeuwen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to enable rbd cache

2013-11-25 Thread Mike Dawson

Greg is right, you need to enable RBD admin sockets. This can be a bit 
tricky though, so here are a few tips:


1) In ceph.conf on the compute node, explicitly set a location for the 
admin socket:


[client.volumes]
admin socket = /var/run/ceph/rbd-$pid.asok

In this example, libvirt/qemu is running with permissions from 
ceph.client.volumes.keyring. If you use something different, adjust 
accordingly. You can put this under a more generic [client] section, but 
there are some downsides (like a new admin socket for each ceph cli 
command).


2) Watch for permissions issues creating the admin socket at the path 
you used above. For me, I needed to explicitly grant some permissions in 
/etc/apparmor.d/abstractions/libvirt-qemu, specifically I had to add:


  # for rbd
  capability mknod,

and

  # for rbd
  /etc/ceph/ceph.conf r,
  /var/log/ceph/* rw,
  /{,var/}run/ceph/** rw,

3) Be aware that if you have multiple rbd volumes attached to a single 
rbd image, you'll only get an admin socket to the volume mounted last. 
If you can set admin_socket via the libvirt xml for each volume, you can 
avoid this issue. This thread will explain better:


http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg16168.html

4) Once you get an RBD admin socket, query it like:

ceph --admin-daemon /var/run/ceph/rbd-29050.asok config show | grep rbd


Cheers,
Mike Dawson


On 11/25/2013 11:12 AM, Gregory Farnum wrote:

On Mon, Nov 25, 2013 at 5:58 AM, Mark Nelson mark.nel...@inktank.com wrote:

On 11/25/2013 07:21 AM, Shu, Xinxin wrote:


Recently , I want to enable rbd cache to identify performance benefit. I
add rbd_cache=true option in my ceph configure file, I use ’virsh
attach-device’ to attach rbd to vm, below is my vdb xml file.



Ceph configuration files are a bit confusing because sometimes you'll see
something like rbd_cache listed somewhere but in the ceph.conf file you'll
want a space instead:

rbd cache = true

with no underscore.  That should (hopefully) fix it for you!


I believe the config file will take either format.

The RBD cache is a client-side thing, though, so it's not ever going
to show up in the OSD! You want to look at the admin socket created by
QEMU (via librbd) to see if it's working. :)
-Greg
-Greg





disk type='network' device='disk'

driver name='qemu' type='raw' cache='writeback'/

source protocol='rbd'

name='rbd/node12_2:rbd_cache=true:rbd_cache_writethrough_until_flush=true'/

target dev='vdb' bus='virtio'/

serial6b5ff6f4-9f8c-4fe0-84d6-9d795967c7dd/serial

address type='pci' domain='0x' bus='0x00' slot='0x06'
function='0x0'/i

/disk

I do not know this is ok to enable rbd cache. I see perf counter for rbd
cache in source code, but when I used admin daemon to check rbd cache
statistics,

Ceph –admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump

But I did not get any rbd cahce flags.

My question is how to enable rbd cahce and check rbd cache perf counter,
or how can I make sure rbd cache is enabled, any tips will be
appreciated? Thanks in advanced.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Running on disks that lose their head

2013-11-07 Thread Mike Dawson




Thanks,

Mike Dawson
Co-Founder  Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 11/7/2013 2:12 PM, Kyle Bader wrote:

Once I know a drive has had a head failure, do I trust that the rest of the 
drive isn't going to go at an inconvenient moment vs just fixing it right now 
when it's not 3AM on Christmas morning? (true story)  As good as Ceph is, do I 
trust that Ceph is smart enough to prevent spreading corrupt data all over the 
cluster if I leave bad disks in place and they start doing terrible things to 
the data?


I have a lot more disks than I have trust in disks. If a drive lost a
head then I want it gone.

I love the idea of using smart data but can foresee see some
implementation issues. We have seen some raid configurations where
polling smart will halt all raid operations momentarily. Also, some
controllers require you to use their CLI tool to pool for smart vs
smartmontools.

It would be similarly awesome to embed something like an apdex score
against each osd, especially if it factored in hierarchy to identify
poor performing osds, nodes, racks, etc..


Kyle,

I think you are spot-on here. Apdex or similar scoring for gear 
performance is important for Ceph, IMO. Due to pseudo-random placement 
and replication, it can be quite difficult to identify 1) if hardware, 
software, or configuration are the cause of slowness, and 2) which 
hardware (if any) is slow. I recently discovered a method that seems 
address both points built.


Zackc, Loicd, and I have been the main participants in a weekly 
Teuthology call the past few weeks. We've talked mostly about methods to 
extend Teuthology to capture performance metrics. Would you be willing 
to join us during the Teuthology and Ceph-Brag sessions at the Firefly 
Developer Summit?


Cheers,
Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph User Committee

2013-11-06 Thread Mike Dawson


I also have time I could spend. Thanks for getting this started Loic!

Thanks,
Mike Dawson


On 11/6/2013 12:35 PM, Loic Dachary wrote:

Hi Ceph,

I would like to open a discussion about organizing a Ceph User Committee. We 
briefly discussed the idea with Ross Turk, Patrick McGarry and Sage Weil today 
during the OpenStack summit. A pad was created and roughly summarizes the idea:

http://pad.ceph.com/p/user-committee

If there is enough interest, I'm willing to devote one day a week working for 
the Ceph User Committee. And yes, that includes sitting at the Ceph booth 
during the FOSDEM :-) And interviewing Ceph users and describing their use 
cases, which I enjoy very much. But also contribute to a user centric roadmap, 
which is what ultimately matters for the company I work for.

If you'd like to see this happen but don't have time to participate in this 
discussion, please add your name + email at the end of the pad.

What do you think ?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph cluster performance

2013-11-06 Thread Mike Dawson

We just fixed a performance issue on our cluster related to spikes of
high latency on some of our SSDs used for osd journals. In our case, the
slow SSDs showed spikes of 100x higher latency than expected.

What SSDs were you using that were so slow?

Cheers,
Mike

On 11/6/2013 12:39 PM, Dinu Vlad wrote:

I'm using the latest 3.8.0 branch from raring. Is there a more recent/better
kernel recommended?

Meanwhile, I think I might have identified the culprit - my SSD drives are
extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By
comparison, an Intel 530 in another server (also installed behind a SAS
expander is doing the same test with ~ 8k iops. I guess I'm good for replacing
them.

Removing the SSD drives from the setup and re-testing with ceph = 595 MB/s
throughput under the same conditions (only mechanical drives, journal on a
separate partition on each one, 8 rados bench processes, 16 threads each).

On Nov 5, 2013, at 4:38 PM, Mark Nelson mark.nel...@inktank.com wrote:

Ok, some more thoughts:

1) What kernel are you using?

2) Mixing SATA and SAS on an expander backplane can some times have bad
effects. We don't really know how bad this is and in what circumstances, but
the Nexenta folks have seen problems with ZFS on solaris and it's not
impossible linux may suffer too:

http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html

3) If you are doing tests and look at disk throughput with something like collectl
-sD -oT do the writes look balanced across the spinning disks? Do any devices
have much really high service times or queue times?

4) Also, after the test is done, you can try:

find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {}
dump_historic_ops \; foo

and then grep for duration in foo. You'll get a list of the slowest
operations over the last 10 minutes from every osd on the node. Once you identify a slow
duration, you can go back and in an editor search for the slow duration and look at where
in the OSD it hung up. That might tell us more about slow/latent operations.

5) Something interesting here is that I've heard from another party that in a
36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on a
SAS9207-8i controller and were pushing significantly faster throughput than you
are seeing (even given the greater number of drives). So it's very interesting
to me that you are pushing so much less. The 36 drive supermicro chassis I
have with no expanders and 30 drives with 6 SSDs can push about 2100MB/s with a
bunch of 9207-8i controllers and XFS (no replication).

Mark

On 11/05/2013 05:15 AM, Dinu Vlad wrote:

Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph
settings I was able to get 440 MB/s from 8 rados bench instances, over a single
osd node (pool pg_num = 1800, size = 1)

This still looks awfully slow to me - fio throughput across all disks reaches
2.8 GB/s!!

I'd appreciate any suggestion, where to look for the issue. Thanks!

On Oct 31, 2013, at 6:35 PM, Dinu Vlad dinuvla...@gmail.com wrote:

I tested the osd performance from a single node. For this purpose I deployed a new cluster
(using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool,
1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster
configuration stayed default, with the same additions about xfs mount
mkfs.xfs as before.

With a single host, the pgs were stuck unclean (active only, not
active+clean):

# ceph -s
cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062
health HEALTH_WARN 1800 pgs stuck unclean
monmap e1: 3 mons at
{cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0},
election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3
osdmap e101: 18 osds: 18 up, 18 in
pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB /
16759 GB avail
mdsmap e1: 0/0/1 up

Test results:
Local test, 1 process, 16 threads: 241.7 MB/s
Local test, 8 processes, 128 threads: 374.8 MB/s
Remote test, 1 process, 16 threads: 231.8 MB/s
Remote test, 8 processes, 128 threads: 366.1 MB/s

Maybe it's just me, but it seems on the low side too.

Thanks,
Dinu

On Oct 30, 2013, at 8:59 PM, Mark Nelson mark.nel...@inktank.com wrote:

On 10/30/2013 01:51 PM, Dinu Vlad wrote:

Mark,

The SSDs are
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021
and the HDDs are
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS.

The chasis is a SiliconMechanics C602 - but I don't have the exact model.
It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander.

I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to
what the driver reports in dmesg). here are the results (filtered):

Sequential:
Run status group 0 (all jobs):

Re: [ceph-users] ceph cluster performance

2013-11-06 Thread Mike Dawson

No, in our case flashing the firmware to the latest release cured the
problem.

If you build a new cluster with the slow SSDs, I'd be interested in the
results of ioping[0] or fsync-tester[1]. I theorize that you may see
spikes of high latency.

[0] https://code.google.com/p/ioping/
[1] https://github.com/gregsfortytwo/fsync-tester

Thanks,
Mike Dawson

On 11/6/2013 4:18 PM, Dinu Vlad wrote:

ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i.

By fixed - you mean replaced the SSDs?

Thanks,
Dinu

On Nov 6, 2013, at 10:25 PM, Mike Dawson mike.daw...@cloudapt.com wrote:

We just fixed a performance issue on our cluster related to spikes of high
latency on some of our SSDs used for osd journals. In our case, the slow SSDs
showed spikes of 100x higher latency than expected.

What SSDs were you using that were so slow?

Cheers,
Mike

On 11/6/2013 12:39 PM, Dinu Vlad wrote:

I'm using the latest 3.8.0 branch from raring. Is there a more recent/better
kernel recommended?

On Nov 5, 2013, at 4:38 PM, Mark Nelson mark.nel...@inktank.com wrote:

Ok, some more thoughts:

1) What kernel are you using?

http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html

4) Also, after the test is done, you can try:

find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {}
dump_historic_ops \; foo

Mark

On 11/05/2013 05:15 AM, Dinu Vlad wrote:

Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph
settings I was able to get 440 MB/s from 8 rados bench instances, over a single
osd node (pool pg_num = 1800, size = 1)

This still looks awfully slow to me - fio throughput across all disks reaches
2.8 GB/s!!

I'd appreciate any suggestion, where to look for the issue. Thanks!

On Oct 31, 2013, at 6:35 PM, Dinu Vlad dinuvla...@gmail.com wrote:

With a single host, the pgs were stuck unclean (active only, not
active+clean):

Maybe it's just me, but it seems on the low side too.

Thanks,
Dinu

On Oct 30, 2013, at 8:59 PM, Mark Nelson mark.nel...@inktank.com wrote:

On 10/30/2013 01:51 PM, Dinu Vlad wrote:

Mark,

The SSDs

Re: [ceph-users] Ceph health checkup

2013-10-31 Thread Mike Dawson


Narendra,

This is an issue. You really want your cluster to he HEALTH_OK with all 
PGs active+clean. Some exceptions apply (like scrub / deep-scrub).


What do 'ceph health detail' and 'ceph osd tree' show?

Thanks,

Mike Dawson
Co-Founder  Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 10/31/2013 6:53 PM, Trivedi, Narendra wrote:

My Ceph cluster health checkup tells me the following. Should I be concerned? 
What's the remedy? What is missing? I issued this command from the monitor 
node. Please correct me if I am wrong, but  I think admin's node job is done 
after the installation unless I want to add additional OSD/MONs.

[ceph@ceph-node1-mon-centos-6-4 ceph]$ sudo ceph health
HEALTH_WARN 145 pgs degraded; 43 pgs down; 47 pgs peering; 76 pgs stale; 47 pgs 
stuck inactive; 76 pgs stuck stale; 192 pgs stuck unclean

Thanks a lot in advance!
Narendra


This message contains information which may be confidential and/or privileged. 
Unless you are the intended recipient (or authorized to receive for the 
intended recipient), you may not read, use, copy or disclose to anyone the 
message or any information contained in the message. If you have received the 
message in error, please advise the sender by reply e-mail and delete the 
message and any attachment(s) thereto without retaining any copies.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How can I check the image's IO ?

2013-10-30 Thread Mike Dawson


Vernon,

You can use the rbd command bench-write documented here:

http://ceph.com/docs/next/man/8/rbd/#commands

The command might looks something like:

rbd --pool test-pool bench-write --io-size 4096 --io-threads 16 
--io-total 1GB test-image


Some other interesting flags are --rbd-cache, --no-rbd-cache, and 
--io-pattern {seq|rand}



Cheers,
Mike

On 10/30/2013 3:23 AM, vernon1987 wrote:

Hi cephers,
I use qemu-img create -f rbd rbd:test-pool/test-image to create a
image. I want to know how can I check this image's IO. Or how to check
the IO for each block?
Thanks.
2013-10-30

vernon


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph monitor problems

2013-10-30 Thread Mike Dawson


Aaron,

Don't mistake valid for advisable.

For documentation purposes, three monitors is the advisable initial 
configuration for multi-node ceph clusters. If there is a valid need for 
more than three monitors, it is advisable to add them two at a time to 
maintain an odd number of total monitors.


-Mike

On 10/30/2013 4:46 PM, Aaron Ten Clay wrote:

On Wed, Oct 30, 2013 at 1:43 PM, Joao Eduardo Luis
joao.l...@inktank.com mailto:joao.l...@inktank.com wrote:


A quorum of 2 monitors is completely fine as long as both monitors
are up.  A quorum is always possible regardless of how many monitors
you have, as long as a majority is up and able to form it (1 out of
1, 2 out of 2, 2 out of 3, 3 out of 4, 3 out of 5, 4 out of 6,...).

   -Joao


Joao,

The page at http://ceph.com/docs/master/rados/operations/add-or-rm-mons/
only lists 1; 3 out of 5; 4 out of 6; etc.. Perhaps it should be
updated if 2 out of 2 is a valid configuration?

-Aaron



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] About use same SSD for OS and Journal

2013-10-25 Thread Mike Dawson


Kurt,

When you had OS and osd journals co-located, how many osd journals were 
on the SSD containing the OS?


You mention you now use a 5:1 ratio. Was the ratio something like 11:1 
before (one SSD for OS plus 11 osd journals to 11 OSDs in a 12-disk 
chassis)?


Also, what throughput per drive were you seeing on the cluster during 
the periods where things got laggy due to backfills, etc?


Last, did you attempt to throttle using ceph config setting in the old 
setup? Do you need to throttle in your current setup?


Thanks,
Mike Dawson


On 10/24/2013 10:40 AM, Kurt Bauer wrote:

Hi,

we had a setup like this and ran into trouble, so I would strongly
discourage you from setting it up like this. Under normal circumstances
there's no problem, but when the cluster is under heavy load, for
example when it has a lot of pgs backfilling, for whatever reason
(increasing num of pgs, adding OSDs,..), there's obviously a lot of
entries written to the journals.
What we saw then was extremly laggy behavior of the cluster and when
looking at the iostats of the SSD, they were at 100% most of the time. I
don't exactly know what causes this and why the SSDs can't cope with the
amount of IOs, but seperating OS and journals did the trick. We now have
quick 15k HDDs in Raid1 for OS and Monitor journal and per 5 OSD
journals one SSD with one partition per journal (used as raw partition).

Hope that helps,
best regards,
Kurt

Martin Catudal schrieb:

Hi,
  Here my scenario :
I will have a small cluster (4 nodes) with 4 (4 TB) OSD's per node.

I will have OS installed on two SSD in raid 1 configuration.

Is one of you have successfully and efficiently a Ceph cluster that is
built with Journal on a separate partition on the OS SSD's?

I know that it may occur a lot of IO on the Journal SSD and I'm scared
of have my OS suffer from too much IO.

Any background experience?

Martin



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] saucy salamander support?

2013-10-22 Thread Mike Dawson


For the time being, you can install the Raring debs on Saucy without issue.

echo deb http://ceph.com/debian-dumpling/ raring main | sudo tee 
/etc/apt/sources.list.d/ceph.list


I'd also like to register a +1 request for official builds targeted at 
Saucy.


Cheers,
Mike


On 10/22/2013 11:42 AM, LaSalle, Jurvis wrote:

Hi,

I accidentally installed Saucy Salamander.  Does the project have a
timeframe for supporting this Ubuntu release?

Thanks,
JL

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multiply OSDs per host strategy ?

2013-10-16 Thread Mike Dawson


Andrija,

You can use a single pool and the proper CRUSH rule


step chooseleaf firstn 0 type host


to accomplish your goal.

http://ceph.com/docs/master/rados/operations/crush-map/


Cheers,
Mike Dawson


On 10/16/2013 5:16 PM, Andrija Panic wrote:

Hi,

I have 2 x  2TB disks, in 3 servers, so total of 6 disks... I have
deployed total of 6 OSDs.
ie:
host1 = osd.0 and osd.1
host2 = osd.2 and osd.3
host4 = osd.4 and osd.5

Now, since I will have total of 3 replica (original + 2 replicas), I
want my replica placement to be such, that I don't end up having 2
replicas on 1 host (replica on osd0, osd1 (both on host1) and replica on
osd2. I want all 3 replicas spread on different hosts...

I know this is to be done via crush maps, but I'm not sure if it would
be better to have 2 pools, 1 pool on osd0,2,4 and and another pool on
osd1,3,5.

If possible, I would want only 1 pool, spread across all 6 OSDs, but
with data placement such, that I don't end up having 2 replicas on 1
host...not sure if this is possible at all...

Is that possible, or maybe I should go for RAID0 in each server (2 x 2Tb
= 4TB for osd0) or maybe JBOD  (1 volume, so 1 OSD per host) ?

Any suggesting about best practice ?

Regards,

--

Andrija Panić


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph and RAID

2013-10-03 Thread Mike Dawson

Currently Ceph uses replication. Each pool is set with a replication 
factor. A replication factor of 1 obviously offers no redundancy. 
Replication factors of 2 or 3 are common. So, Ceph currently halfs or 
thirds your usable storage, accordingly. Also, note you can co-mingle 
pools of various replication factors, so the actual math can get more 
complicated.


There is a team of developers building an Erasure Coding backend for 
Ceph that will allow for more options.


http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

http://wiki.ceph.com/01Planning/02Blueprints/Emperor/Erasure_coded_storage_backend_%28step_2%29

Initial release is scheduled for Ceph's Firefly release in February 2014.


Thanks,

Mike Dawson
Co-Founder  Director of Cloud Architecture
Cloudapt LLC

On 10/3/2013 2:44 PM, Aronesty, Erik wrote:

Does Ceph really halve your storage like that?

If if you specify N+1,does it really store two copies, or just compute 
checksums across MxN stripes?  I guess Raid5+Ceph with a large array (12 disks 
say) would be not too bad (2.2TB for each 1).

But It would be nicer, if I had 12 storage units in a single rack on a single 
network, for me to tell CEPH to stripe across them in a RAIDZ fashion, so that 
I'm only losing 10% of my storage to redundancy... not 50%.

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of John-Paul Robinson
Sent: Thursday, October 03, 2013 12:08 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph and RAID

What is this take on such a configuration?

Is it worth the effort of tracking rebalancing at two layers, RAID
mirror and possibly Ceph if the pool has a redundancy policy.  Or is it
better to just let ceph rebalance itself when you lose a non-mirrored disk?

If following the raid mirror approach, would you then skip redundency
at the ceph layer to keep your total overhead the same?  It seems that
would be risky in the even you loose your storage server with the
raid-1'd drives.  No Ceph level redunancy would then be fatal.  But if
you do raid-1 plus ceph redundancy, doesn't that mean it takes 4TB for
each 1 real TB?

~jpr

On 10/02/2013 10:03 AM, Dimitri Maziuk wrote:

I would consider (mdadm) raid-1, dep. on the hardware  budget,
because this way a single disk failure will not trigger a cluster-wide
rebalance.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD Snap removal priority

2013-09-27 Thread Mike Dawson


[cc ceph-devel]

Travis,

RBD doesn't behave well when Ceph maintainance operations create spindle 
contention (i.e. 100% util from iostat). More about that below.


Do you run XFS under your OSDs? If so, can you check for extent 
fragmentation? Should be something like:


xfs_db -c frag -r /dev/sdb1

We recently saw a fragmentation factors of over 80%, with lots of ino's 
having hundreds of extents. After 24 hours+ of defrag'ing, we got it 
under control, but we're seeing the fragmentation factor grow by ~1.5% 
daily. We experienced spindle contention issues even after the defrag.




Sage, Sam, etc,

I think the real issue is Ceph has several states where it performs what 
I would call maintanance operations that saturate the underlying 
storage without properly yielding to client i/o (which should have a 
higher priority).


I have experienced or seen reports of Ceph maintainance affecting rbd 
client i/o in many ways:


- QEMU/RBD Client I/O Stalls or Halts Due to Spindle Contention from 
Ceph Maintainance [1]

- Recovery and/or Backfill Cause QEMU/RBD Reads to Hang [2]
- rbd snap rm (Travis' report below)

[1] http://tracker.ceph.com/issues/6278
[2] http://tracker.ceph.com/issues/6333

I think this family of issues speak to the need for Ceph to have more 
visibility into the underlying storage's limitations (especially spindle 
contention) when performing known expensive maintainance operations.


Thanks,
Mike Dawson

On 9/27/2013 12:25 PM, Travis Rhoden wrote:

Hello everyone,

I'm running a Cuttlefish cluster that hosts a lot of RBDs.  I recently
removed a snapshot of a large one (rbd snap rm -- 12TB), and I noticed
that all of the clients had markedly decreased performance.  Looking
at iostat on the OSD nodes had most disks pegged at 100% util.

I know there are thread priorities that can be set for clients vs
recovery, but I'm not sure what deleting a snapshot falls under.  I
couldn't really find anything relevant.  Is there anything I can tweak
to lower the priority of such an operation?  I didn't need it to
complete fast, as rbd snap rm returns immediately and the actual
deletion is done asynchronously.  I'd be fine with it taking longer at
a lower priority, but as it stands now it brings my cluster to a crawl
and is causing issues with several VMs.

I see an osd snap trim thread timeout option in the docs -- Is the
operation occuring here what you would call snap trimming?  If so, any
chance of adding an option for osd snap trim priority just like
there is for osd client op and osd recovery op?

Hope what I am saying makes sense...

  - Travis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Pause i/o from time to time

2013-09-17 Thread Mike Dawson

You could be suffering from a known, but unfixed issue [1] where spindle 
contention from scrub and deep-scrub cause periodic stalls in RBD. You 
can try to disable scrub and deep-scrub with:


# ceph osd set noscrub
# ceph osd set nodeep-scrub

If your problem stops, Issue #6278 is likely the cause. To re-enable 
scrub and deep-scrub:


# ceph osd unset noscrub
# ceph osd unset nodeep-scrub

Because you seem to only have two OSDs, you may also be saturating your 
disks even without scrub or deep-scrub.


http://tracker.ceph.com/issues/6278

Cheers,
Mike Dawson


On 9/16/2013 12:30 PM, Timofey wrote:

I use ceph for HA-cluster.
Some time ceph rbd go to have pause in work (stop i/o operations). Sometime it 
can be when one of OSD slow response to requests. Sometime it can be my mistake 
(xfs_freeze -f for one of OSD-drive).
I have 2 storage servers with one osd on each. This pauses can be few minutes.

1. Is any settings for fast change primary osd if current osd work bad (slow, 
don't response).
2. Can I use ceph-rbd in software raid-array with local drive, for use local 
drive instead of ceph if ceph cluster fail?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] status of glance/cinder/nova integration in openstack grizzly

2013-09-10 Thread Mike Dawson


Darren,

I can confirm Copy on Write (show_image_direct_url = True) does work in 
Grizzly.


It sounds like you are close. To check permissions, run 'ceph auth 
list', and reply with client.images and client.volumes (or whatever 
keys you use in Glance and Cinder).


Cheers,
Mike Dawson


On 9/10/2013 10:12 AM, Darren Birkett wrote:

Hi All,

tl;dr - does glance/rbd and cinder/rbd play together nicely in grizzly?

I'm currently testing a ceph/rados back end with an openstack
installation.  I have the following things working OK:

1. cinder configured to create volumes in RBD
2. nova configured to boot from RBD backed cinder volumes (libvirt UUID
secret set etc)
3. glance configured to use RBD as a back end store for images

With this setup, when I create a bootable volume in cinder, passing an
id of an image in glance, the image gets downloaded, converted to raw,
and then created as an RBD object and made available to cinder.  The
correct metadata field for the cinder volume is populated
(volume_image_metadata) and so the cinder client marks the volume as
bootable.  This is all fine.

If I want to take advantage of the fact that both glance images and
cinder volumes are stored in RBD, I can add the following flag to the
glance-api.conf:

show_image_direct_url = True

This enables cinder to see that the glance image is stored in RBD, and
the cinder rbd driver then, instead of downloading the image and
creating an RBD image from it, just issues an 'rbd clone' command (seen
in the cinder-volume.log):

rbd clone --pool images --image dcb2f16d-a09d-4064-9198-1965274e214d
--snap snap --dest-pool volumes --dest
volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d

This is all very nice, and the cinder volume is available immediately as
you'd expect.  The problem is that the metadata field is not populated
so it's not seen as bootable.  Even manually populating this field
leaves the volume unbootable.  The volume can not even be attached to
another instance for inspection.

libvirt doesn't seem to be able to access the rbd device. From
nova-compute.log:

qemu-system-x86_64: -drive
file=rbd:volumes/volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d:id=volumes:key=AQAnAy9ScPB4IRAAtxD/V1rDciqFiT9AMPPr+A==:auth_supported=cephx\;none,if=none,id=drive-virtio-disk0,format=raw,serial=20987f9d-b4fb-463d-8b8f-fa667bd47c6d,cache=none:
error reading header from volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d

qemu-system-x86_64: -drive
file=rbd:volumes/volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d:id=volumes:key=AQAnAy9ScPB4IRAAtxD/V1rDciqFiT9AMPPr+A==:auth_supported=cephx\;none,if=none,id=drive-virtio-disk0,format=raw,serial=20987f9d-b4fb-463d-8b8f-fa667bd47c6d,cache=none:
could not open disk image
rbd:volumes/volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d:id=volumes:key=AQAnAy9ScPB4IRAAtxD/V1rDciqFiT9AMPPr+A==:auth_supported=cephx\;none:
Operation not permitted

It's almost like a permission issue, but my ceph/rbd knowledge is still
fledgeling.

I know that the cinder rbd driver has been rewritten to use librbd in
havana, and I'm wondering if this will change any of this behaviour?
  I'm also wondering if anyone has actually got this working with
grizzly, and how?

Many thanks
Darren



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] status of glance/cinder/nova integration in openstack grizzly

2013-09-10 Thread Mike Dawson

On 9/10/2013 4:50 PM, Darren Birkett wrote:

Hi Mike,

That led me to realise what the issue was. My cinder (volumes) client
did not have the correct perms on the images pool. I ran the following
to update the perms for that client:

ceph auth caps client.volumes mon 'allow r' osd 'allow class-read
object_prefix rbd_children, allow rwx pool=volumes, allow rx pool=images'

...and was then able to successfully boot an instance from a cinder
volume that was created by cloning a glance image from the images pool!

Glad you found it. This has been a sticking point for several people.

One last question: I presume the fact that the 'volume_image_metadata'
field is not populated when cloning a glance image into a cinder volume
is a bug? It means that the cinder client doesn't show the volume as
bootable, though I'm not sure what other detrimental effect it actually
has (clearly the volume can be booted from).
I think you are talking about data in the cinder table of your database
backend (mysql?). I don't have 'volume_image_metadata' at all here. I
don't think this is the issue.

To create a Cinder volume from Glance, I do something like:

cinder --os_tenant_name MyTenantName create --image-id
00e0042e-d007-400a-918a-d5e00cea8b0f --display-name MyVolumeName 40

I can then spin up an instance backed by MyVolumeName and boot as expected.

Thanks
Darren

On 10 September 2013 21:04, Darren Birkett darren.birk...@gmail.com
mailto:darren.birk...@gmail.com wrote:

Hi Mike,

Thanks - glad to hear it definitely works as expected! Here's my
client.glance and client.volumes from 'ceph auth list':

client.glance
key: AQAWFi9SOKzAABAAPV1ZrpWkx72tmJ5E7nOi3A==
caps: [mon] allow r
caps: [osd] allow rwx pool=images, allow class-read object_prefix
rbd_children
client.volumes
key: AQAnAy9ScPB4IRAAtxD/V1rDciqFiT9AMPPr+A==
caps: [mon] allow r
caps: [osd] allow class-read object_prefix rbd_children, allow rwx
pool=volumes

Thanks
Darren

On 10 September 2013 20:08, Mike Dawson mike.daw...@cloudapt.com
mailto:mike.daw...@cloudapt.com wrote:

Darren,

I can confirm Copy on Write (show_image_direct_url = True) does
work in Grizzly.

It sounds like you are close. To check permissions, run 'ceph
auth list', and reply with client.images and client.volumes
(or whatever keys you use in Glance and Cinder).

Cheers,
Mike Dawson

On 9/10/2013 10:12 AM, Darren Birkett wrote:

Hi All,

tl;dr - does glance/rbd and cinder/rbd play together nicely
in grizzly?

I'm currently testing a ceph/rados back end with an openstack
installation. I have the following things working OK:

1. cinder configured to create volumes in RBD
2. nova configured to boot from RBD backed cinder volumes
(libvirt UUID
secret set etc)
3. glance configured to use RBD as a back end store for images

With this setup, when I create a bootable volume in cinder,
passing an
id of an image in glance, the image gets downloaded,
converted to raw,
and then created as an RBD object and made available to
cinder. The
correct metadata field for the cinder volume is populated
(volume_image_metadata) and so the cinder client marks the
volume as
bootable. This is all fine.

If I want to take advantage of the fact that both glance
images and
cinder volumes are stored in RBD, I can add the following
flag to the
glance-api.conf:

show_image_direct_url = True

This enables cinder to see that the glance image is stored
in RBD, and
the cinder rbd driver then, instead of downloading the image and
creating an RBD image from it, just issues an 'rbd clone'
command (seen
in the cinder-volume.log):

rbd clone --pool images --image
dcb2f16d-a09d-4064-9198-__1965274e214d
--snap snap --dest-pool volumes --dest
volume-20987f9d-b4fb-463d-__8b8f-fa667bd47c6d

This is all very nice, and the cinder volume is available
immediately as
you'd expect. The problem is that the metadata field is not
populated
so it's not seen as bootable. Even manually populating this
field
leaves the volume unbootable. The volume can not even be
attached to
another instance for inspection.

libvirt doesn't seem to be able to access the rbd device. From
nova-compute.log:

qemu-system-x86_64: -drive

file=rbd:volumes/volume-__20987f9d-b4fb-463d-8b8f-__fa667bd47c6d:id=volumes:key=__AQAnAy9ScPB4IRAAtxD

Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-29 Thread Mike Dawson


Sam and Oliver,

We've had tons of issues with Dumpling rbd volumes showing sporadic 
periods of high latency for Windows guests doing lots of small writes. 
We saw the issue occasionally with Cuttlefish, but it got significantly 
worse with Dumpling. Initial results with wip-dumpling-perf2 appear very 
promising.


Thanks for your work! I'll report back tomorrow if I have any new results.

Thanks,

Mike Dawson
Co-Founder  Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/29/2013 2:52 PM, Oliver Daudey wrote:

Hey Mark and list,

FYI for you and the list: Samuel and I seem to have found and fixed the
remaining performance-problems.  For those who can't wait, fixes are in
wip-dumpling-perf2 and will probably be in the next point-release.


Regards,

  Oliver

On 27-08-13 17:13, Mark Nelson wrote:

Ok, definitely let us know how it goes!  For what it's worth, I'm
testing Sam's wip-dumpling-perf branch with the wbthrottle code disabled
now and comparing it both to that same branch with it enabled along with
0.67.1.  Don't have any perf data, but quite a bit of other data to look
through, both in terms of RADOS bench and RBD.

Mark

On 08/27/2013 10:07 AM, Oliver Daudey wrote:

Hey Mark,

That will take a day or so for me to know with enough certainty.  With
the low CPU-usage and preliminary results today, I'm confident enough to
upgrade all OSDs in production and test the cluster all-Dumpling
tomorrow.  For now, I only upgraded a single OSD and measured CPU-usage
and whatever performance-effects that had on the cluster, so if I would
lose that OSD, I could recover. :-)  Will get back to you.


 Regards,

 Oliver

On 27-08-13 15:04, Mark Nelson wrote:

Hi Olver/Matthew,

Ignoring CPU usage, has speed remained slower as well?

Mark

On 08/27/2013 03:08 AM, Oliver Daudey wrote:

Hey Samuel,

The PGLog::check() is now no longer visible in profiling, so it
helped
for that.  Unfortunately, it doesn't seem to have helped to bring down
the OSD's CPU-loading much.  Leveldb still uses much more than in
Cuttlefish.  On my test-cluster, I didn't notice any difference in the
RBD bench-results, either, so I have to assume that it didn't help
performance much.

Here's the `perf top' I took just now on my production-cluster with
your
new version, under regular load.  Also note the memcmp and memcpy,
which also don't show up when running a Cuttlefish-OSD:
15.65%  [kernel] [k]
intel_idle
 7.20%  libleveldb.so.1.9[.]
0x3ceae
 6.28%  libc-2.11.3.so   [.]
memcmp
 5.22%  [kernel] [k]
find_busiest_group
 3.92%  kvm  [.]
0x2cf006
 2.40%  libleveldb.so.1.9[.]
leveldb::InternalKeyComparator::Compar
 1.95%  [kernel] [k]
_raw_spin_lock
 1.69%  [kernel] [k]
default_send_IPI_mask_sequence_phys
 1.46%  libc-2.11.3.so   [.]
memcpy
 1.17%  libleveldb.so.1.9[.]
leveldb::Block::Iter::Next()
 1.16%  [kernel] [k]
hrtimer_interrupt
 1.07%  [kernel] [k]
native_write_cr0
 1.01%  [kernel] [k]
__hrtimer_start_range_ns
 1.00%  [kernel] [k]
clockevents_program_event
 0.93%  [kernel] [k]
find_next_bit
 0.93%  libstdc++.so.6.0.13  [.]
std::string::_M_mutate(unsigned long,
 0.89%  [kernel] [k]
cpumask_next_and
 0.87%  [kernel] [k]
__schedule
 0.85%  [kernel] [k]
_raw_spin_unlock_irqrestore
 0.85%  [kernel] [k]
do_select
 0.84%  [kernel] [k]
apic_timer_interrupt
 0.80%  [kernel] [k]
fget_light
 0.79%  [kernel] [k]
native_write_msr_safe
 0.76%  [kernel] [k]
_raw_spin_lock_irqsave
 0.66%  libc-2.11.3.so   [.]
0xdc6d8
 0.61%  libpthread-2.11.3.so [.]
pthread_mutex_lock
 0.61%  [kernel] [k]
tg_load_down
 0.59%  [kernel] [k]
reschedule_interrupt
 0.59%  libsnappy.so.1.1.2   [.]
snappy::RawUncompress(snappy::Source*,
 0.56%  libstdc++.so.6.0.13  [.] std::string::append(char
const*, unsig
 0.54%  [kvm_intel]  [k]
vmx_vcpu_run
 0.53%  [kernel] [k]
copy_user_generic_string
 0.53%  [kernel] [k]
load_balance
 0.50%  [kernel] [k]
rcu_needs_cpu
 0.45%  [kernel] [k] fput


  Regards,

Oliver

On ma, 2013-08-26 at 23:33 -0700, Samuel Just wrote:

I just pushed a patch to wip-dumpling-log-assert (based on current
dumpling head).  I had disabled most of the code in PGLog::check() but
left an (I thought) innocuous assert.  It seems that with (at least)
g

Re: [ceph-users] Openstack glance ceph rbd_store_user authentification problem

2013-08-08 Thread Mike Dawson


Steffan,

It works for me. I have:

user@node:/etc/ceph# cat /etc/glance/glance-api.conf | grep rbd
default_store = rbd
#   glance.store.rbd.Store,
rbd_store_ceph_conf = /etc/ceph/ceph.conf
rbd_store_user = images
rbd_store_pool = images
rbd_store_chunk_size = 4


Thanks,
Mike Dawson


On 8/8/2013 9:01 AM, Steffen Thorhauer wrote:

Hi,
recently I had a problem with openstack glance and ceph.
I used the
http://ceph.com/docs/master/rbd/rbd-openstack/#configuring-glance
documentation and
http://docs.openstack.org/developer/glance/configuring.html documentation
I'm using ubuntu 12.04 LTS with grizzly from Ubuntu Cloud Archive and
ceph 61.7.

glance-api.conf had following config options

default_store = rbd
rbd_store_user=images
rbd_store_pool = images
rbd_store_ceph_conf = /etc/ceph/ceph.conf


All the time when doing glance image create I get errors. In the glance
api log I only found error like

2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images Traceback (most
recent call last):
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images   File
/usr/lib/python2.7/dist-packages/glance/api/v1/images.py, line 444, in
_upload
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images image_meta['size'])
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images   File
/usr/lib/python2.7/dist-packages/glance/store/rbd.py, line 241, in add
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images with
rados.Rados(conffile=self.conf_file, rados_id=self.user) as conn:
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images   File
/usr/lib/python2.7/dist-packages/rados.py, line 134, in __enter__
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images self.connect()
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images   File
/usr/lib/python2.7/dist-packages/rados.py, line 192, in connect
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images raise
make_ex(ret, error calling connect)
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images ObjectNotFound:
error calling connect

This trace message helped me not very much :-(
My google search glance.api.v1.images ObjectNotFound: error calling
connect did only find
http://irclogs.ceph.widodh.nl/index.php?date=2012-10-26
This  points me to an ceph authentification problem. But the ceph tools
worked fine for me.
The I tried the debug option in glance-api.conf and I found following
entry .

DEBUG glance.common.config [-] rbd_store_pool = images
log_opt_values /usr/lib/python2.7/dist-packages/oslo/config/cfg.py:1485
DEBUG glance.common.config [-] rbd_store_user = glance
log_opt_values /usr/lib/python2.7/dist-packages/oslo/config/cfg.py:1485

The glance-api service  did not use my rbd_store_user = images option!!
Then I configured a client.glance auth and it worked with the
implicit glance user!!!

Now my question: Am I the only one with this problem??

Regards,
   Steffen Thorhauer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to recover the osd.

2013-08-08 Thread Mike Dawson


Looks like you didn't get osd.0 deployed properly. Can you show:

- ls /var/lib/ceph/osd/ceph-0
- cat /etc/ceph/ceph.conf


Thanks,

Mike Dawson
Co-Founder  Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/8/2013 9:13 AM, Suresh Sadhu wrote:

HI,

My storage health cluster is warning state , one of the osd is in down
state and even if I try to start the osd it fail to start

sadhu@ubuntu3:~$ ceph osd stat

e22: 2 osds: 1 up, 1 in

sadhu@ubuntu3:~$ ls /var/lib/ceph/osd/

ceph-0  ceph-1

sadhu@ubuntu3:~$ ceph osd tree

# idweight  type name   up/down reweight

-1  0.14root default

-2  0.14host ubuntu3

0   0.06999 osd.0   down0

1   0.06999 osd.1   up  1

sadhu@ubuntu3:~$ sudo /etc/init.d/ceph -a start 0

/etc/init.d/ceph: 0. not found (/etc/ceph/ceph.conf defines ,
/var/lib/ceph defines )

sadhu@ubuntu3:~$ sudo /etc/init.d/ceph -a start osd.0

/etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines ,
/var/lib/ceph defines )

Ceph health status in warning mode.

pg 4.10 is active+degraded, acting [1]

pg 3.17 is active+degraded, acting [1]

pg 5.16 is active+degraded, acting [1]

pg 4.17 is active+degraded, acting [1]

pg 3.10 is active+degraded, acting [1]

recovery 62/124 degraded (50.000%)

mds.ceph@ubuntu3 at 10.147.41.3:6803/2148 is laggy/unresponsi

regards

sadhu



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Large storage nodes - best practices

2013-08-05 Thread Mike Dawson

On 8/5/2013 12:51 PM, Brian Candler wrote:

On 05/08/2013 17:15, Mike Dawson wrote:

Short answer: Ceph generally is used with multiple OSDs per node. One
OSD per storage drive with no RAID is the most common setup. At 24- or
36-drives per chassis, there are several potential bottlenecks to
consider.

Mark Nelson, the Ceph performance guy at Inktank, has published
several articles you should consider reading. A few of interest are
[0], [1], and [2]. The last link is a 5-part series.

Yep, I saw [0] and [1]. He tries a 6-disk RAID0 array and generally gets
lower throughput than 6 x JBOD disks (although I think he's using the
controller RAID0 functionality, rather than mdraid).

AFAICS he has a 36-disk chassis but only runs tests with 6 disks, which
is a shame as it would be nice to know which other bottleneck you could
hit first with this type of setup.

The third link I sent shows Mark's results with 24 spinners and 8 SSDs
for journals. Specifically read:

http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-1-introduction-and-rados-bench/#setup

Florian Haas has also published some thoughts on bottenecks:

http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals

Also, note that there is on-going work to add erasure coding as a
optional backend (as opposed to the current replication scheme). If
you prioritize bulk storage over performance, you may be interested in
following the progress [3], [4], and [5].

[0]:
http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/

[1]:
http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/

[2]:
http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-1-introduction-and-rados-bench/

[3]:
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

[4]:
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

[5]: http://www.inktank.com/about-inktank/roadmap/

Thank you - erasure coding is very much of interest for the
archival-type storage I'm looking at. However your links [3] and [4] are
identical, did you mean to link to another one?

Oops.

http://wiki.ceph.com/01Planning/02Blueprints/Emperor/Erasure_coded_storage_backend_%28step_2%29

Cheers,

Brian.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Qemu-devel] [Bug 1207686]

2013-08-05 Thread Mike Dawson


Josh,

Logs are uploaded to cephdrop with the file name 
mikedawson-rbd-qemu-deadlock.


- At about 2013-08-05 19:46 or 47, we hit the issue, traffic went to 0
- At about 2013-08-05 19:53:51, ran a 'virsh screenshot'


Environment is:

- Ceph 0.61.7 (client is co-mingled with three OSDs)
- rbd cache = true and cache=writeback
- qemu 1.4.0 1.4.0+dfsg-1expubuntu4
- Ubuntu Raring with 3.8.0-25-generic

This issue is reproducible in my environment, and I'm willing to run any 
wip branch you need. What else can I provide to help?


Thanks,
Mike Dawson


On 8/5/2013 3:48 AM, Stefan Hajnoczi wrote:

On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote:

Am 02.08.2013 um 23:47 schrieb Mike Dawson mike.daw...@cloudapt.com:

We can un-wedge the guest by opening a NoVNC session or running a 'virsh 
screenshot' command. After that, the guest resumes and runs as expected. At that point we 
can examine the guest. Each time we'll see:


If virsh screenshot works then this confirms that QEMU itself is still
responding.  Its main loop cannot be blocked since it was able to
process the screendump command.

This supports Josh's theory that a callback is not being invoked.  The
virtio-blk I/O request would be left in a pending state.

Now here is where the behavior varies between configurations:

On a Windows guest with 1 vCPU, you may see the symptom that the guest no
longer responds to ping.

On a Linux guest with multiple vCPUs, you may see the hung task message
from the guest kernel because other vCPUs are still making progress.
Just the vCPU that issued the I/O request and whose task is in
UNINTERRUPTIBLE state would really be stuck.

Basically, the symptoms depend not just on how QEMU is behaving but also
on the guest kernel and how many vCPUs you have configured.

I think this can explain how both problems you are observing, Oliver and
Mike, are a result of the same bug.  At least I hope they are :).

Stefan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process

2013-08-02 Thread Mike Dawson


Oliver,

We've had a similar situation occur. For about three months, we've run 
several Windows 2008 R2 guests with virtio drivers that record video 
surveillance. We have long suffered an issue where the guest appears to 
hang indefinitely (or until we intervene). For the sake of this 
conversation, we call this state wedged, because it appears something 
(rbd, qemu, virtio, etc) gets stuck on a deadlock. When a guest gets 
wedged, we see the following:


- the guest will not respond to pings
- the qemu-system-x86_64 process drops to 0% cpu
- graphite graphs show the interface traffic dropping to 0bps
- the guest will stay wedged forever (or until we intervene)
- strace of qemu-system-x86_64 shows QEMU is making progress [1][2]

We can un-wedge the guest by opening a NoVNC session or running a 
'virsh screenshot' command. After that, the guest resumes and runs as 
expected. At that point we can examine the guest. Each time we'll see:


- No Windows error logs whatsoever while the guest is wedged
- A time sync typically occurs right after the guest gets un-wedged
- Scheduled tasks do not run while wedged
- Windows error logs do not show any evidence of suspend, sleep, etc

We had so many issue with guests becoming wedged, we wrote a script to 
'virsh screenshot' them via cron. Then we installed some updates and had 
a month or so of higher stability (wedging happened maybe 1/10th as 
often). Until today we couldn't figure out why.


Yesterday, I realized qemu was starting the instances without specifying 
cache=writeback. We corrected that, and let them run overnight. With RBD 
writeback re-enabled, wedging came back as often as we had seen in the 
past. I've counted ~40 occurrences in the past 12-hour period. So I feel 
like writeback caching in RBD certainly makes the deadlock more likely 
to occur.


Joshd asked us to gather RBD client logs:

joshd it could very well be the writeback cache not doing a callback 
at some point - if you could gather logs of a vm getting stuck with 
debug rbd = 20, debug ms = 1, and debug objectcacher = 30 that would be 
great


We'll do that over the weekend. If you could as well, we'd love the help!

[1] http://www.gammacode.com/kvm/wedged-with-timestamps.txt
[2] http://www.gammacode.com/kvm/not-wedged.txt

Thanks,

Mike Dawson
Co-Founder  Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/2/2013 6:22 AM, Oliver Francke wrote:

Well,

I believe, I'm the winner of buzzwords-bingo for today.

But seriously speaking... as I don't have this particular problem with
qcow2 with kernel 3.2 nor qemu-1.2.2 nor newer kernels, I hope I'm not
alone here?
We have a raising number of tickets from people reinstalling from ISO's
with 3.2-kernel.

Fast fallback is to start all VM's with qemu-1.2.2, but we then lose
some features ala latency-free-RBD-cache ;)

I just opened a bug for qemu per:

https://bugs.launchpad.net/qemu/+bug/1207686

with all dirty details.

Installing a backport-kernel 3.9.x or upgrade Ubuntu-kernel to 3.8.x
fixes it. So we have a bad combination for all distros with 3.2-kernel
and rbd as storage-backend, I assume.

Any similar findings?
Any idea of tracing/debugging ( Josh? ;) ) very welcome,

Oliver.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why is my mon store.db is 220GB?

2013-08-01 Thread Mike Dawson

220GB is way, way too big. I suspect your monitors need to go through a 
successful leveldb compaction. The early releases of Cuttlefish suffered 
several issues with store.db growing unbounded. Most were fixed by 
0.61.5, I believe.


You may have luck stoping all Ceph daemons, then starting the monitor by 
itself. When there were bugs, leveldb compaction tended work better 
without OSD traffic hitting the monitors. Also, there are some settings 
to force a compact on startup like 'mon compact on start = true' and mon 
compact on trim = true. I don't think either are required anymore 
though. See some history here:


http://tracker.ceph.com/issues/4895


Thanks,

Mike Dawson
Co-Founder  Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/1/2013 6:52 PM, Jeppesen, Nelson wrote:

My Mon store.db has been at 220GB for a few months now. Why is this and
how can I fix it? I have one monitor in this cluster and I suspect that
I can’t  add monitors to the cluster because it is too big. Thank you.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Defective ceph startup script

2013-07-31 Thread Mike Dawson


Greg,

You can check the currently running version (and much more) using the 
admin socket:


http://ceph.com/docs/master/rados/operations/monitoring/#using-the-admin-socket

For me, this looks like:

# ceph --admin-daemon /var/run/ceph/ceph-mon.a.asok version
{version:0.61.7}

# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok version
{version:0.61.7}


Also, I use 'service ceph restart' on Ubuntu 13.04 running a mkcephfs 
deployment. It may be different when using ceph-deploy.



Thanks,

Mike Dawson
Co-Founder  Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 7/31/2013 2:51 PM, Greg Chavez wrote:


I am running on Ubuntu 13.04.

There is something amiss with /etc/init.d/ceph on all of my ceph nodes.

I was upgrading to 0.61.7 from what I *thought* was 0.61.5 today when I
realized that service ceph-all restart wasn't actually doing anything.
  I saw nothing in /var/log/ceph.log - it just kept printing pg statuses
- and the PIDs of the osd and mon daemons did not change.  Stops failed
as well.

Then, when I tried to do individual osd restarts like this:

root@kvm-cs-sn-14i:/var/lib/ceph/osd# service ceph -v status osd.10
/etc/init.d/ceph: osd.10 not found (/etc/ceph/ceph.conf defines ,
/var/lib/ceph defines )


Despite the fact that I have this directory: /var/lib/ceph/osd/ceph-10/.

I have the same issue with mon restarts:

root@kvm-cs-sn-14i:/var/lib/ceph/mon# ls
ceph-kvm-cs-sn-14i

root@kvm-cs-sn-14i:/var/lib/ceph/mon# service ceph -v status
mon.kvm-cs-sn-14i
/etc/init.d/ceph: mon.kvm-cs-sn-14i not found (/etc/ceph/ceph.conf
defines , /var/lib/ceph defines )


I'm very worried that I have all my packages at  0.61.7 while my osd and
mon daemons could be running as old as  0.61.1!

Can anyone help me figure this out?  Thanks.


--
\*..+.-
--Greg Chavez
+//..;};


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Production/Non-production segmentation

2013-07-31 Thread Mike Dawson


Greg,

IMO the most critical risks when running Ceph are bugs that affect 
daemon stability and the upgrade process.


Due to the speed of releases in the Ceph project, I feel having separate 
physical hardware is the safer way to go, especially in light of your 
mention of an SLA for your production services.


A separate non-production cluster will allow you to test and validate 
new versions (including point releases within a stable series) before 
you attempt to upgrade your production cluster.


Cheers,

Mike Dawson
Co-Founder  Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 7/31/2013 10:47 AM, Greg Poirier wrote:

Does anyone here have multiple clusters or segment their single cluster
in such a way as to try to maintain different SLAs for production vs
non-production services?

We have been toying with the idea of running separate clusters (on the
same hardware, but reserve a portion of the OSDs for the production
cluster), but I'd rather have a single cluster in order to more evenly
distribute load across all of the spindles.

Thoughts or observations from people with Ceph in production would be
greatly appreciated.

Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Production/Non-production segmentation

2013-07-31 Thread Mike Dawson



On 7/31/2013 3:34 PM, Greg Poirier wrote:

On Wed, Jul 31, 2013 at 12:19 PM, Mike Dawson mike.daw...@cloudapt.com
mailto:mike.daw...@cloudapt.com wrote:

Due to the speed of releases in the Ceph project, I feel having
separate physical hardware is the safer way to go, especially in
light of your mention of an SLA for your production services.

Ah. I guess I should offer a little more background as to what I mean by
production vs. non-production: customer-facing, and not.


That makes more sense.



We're using Ceph primarily for volume storage with OpenStack at the
moment and operate two OS clusters: one for all of our customer-facing
services (which require a higher SLA) and one for all of our internal
services. The idea being that all of the customer-facing stuff is
segmented physically from anything our developers might be testing
internally.

What I'm wondering:

Does anyone else here do this?


Have you looked at Ceph Pools? I think you may find they address many of 
your concerns while maintaining a single cluster.




If so, do you run multiple Ceph clusters?
Do you let Ceph sort itself out?
Can this be done with a single physical cluster, but multiple logical
clusters?
Should it be?

I know that, mathematically speaking, the larger your Ceph cluster is,
the more evenly distributed the load (thanks to CRUSH). I'm wondering
if, in practice, RBD can still create hotspots (say from a runaway
service with multiple instances and volumes that is suddenly doing a ton
of IO). This would increase IO latency across the Ceph cluster, I'd
assume, and could impact the performance of customer-facing services.

So, to some degree, physical segmentation makes sense to me. But can we
simply reserve some OSDs per physical host for a production logical
cluster and then use the rest for the development logical cluster
(separate MON clusters for each, but all running on the same hardware).
Or, given a sufficiently large cluster, is this not even a concern?

I'm also interested in hearing about experience using CephFS, Swift, and
RBD all on a single cluster or if people have chosen to use multiple
clusters for these as well. For example, if you need faster volume
storage in RBD, so you go for more spindles and smaller disks vs. larger
disks with fewer spindles for object storage, which can have a higher
allowance for latency than volume storage.


See the response from Greg F. from Inktank to a similar question:

http://comments.gmane.org/gmane.comp.file-systems.ceph.user/2090




A separate non-production cluster will allow you to test and
validate new versions (including point releases within a stable
series) before you attempt to upgrade your production cluster.


Oh yeah. I'm doing that for sure.
Thanks,

Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cinder volume creation issues

2013-07-26 Thread Mike Dawson


You can specify the uuid in the secret.xml file like:

secret ephemeral='no' private='no'
uuidbdf77f5d-bf0b-1053-5f56-cd76b32520dc/uuid
usage type='ceph'
nameclient.volumes secret/name
  /usage
/secret

Then use that same uuid on all machines in cinder.conf:

rbd_secret_uuid=bdf77f5d-bf0b-1053-5f56-cd76b32520dc


Also, the column you are referring to in the OpenStack Dashboard lists 
the machine running the Cinder APIs, not specifically the server hosting 
the storage. Like Greg stated, Ceph stripes the storage across your cluster.


Fix your uuids and cinder.conf any you'll be moving in the right direction.

Cheers,
Mike


On 7/26/2013 1:32 PM, johnu wrote:

Greg,
 :) I am not getting where was the mistake in the
configuration. virsh secret-define gave  different secrets

sudo virsh secret-define --file secret.xml
uuid of secret is output here
sudo virsh secret-set-value --secret {uuid of secret} --base64 $(cat 
client.volumes.key)



On Fri, Jul 26, 2013 at 10:16 AM, Gregory Farnum g...@inktank.com
mailto:g...@inktank.com wrote:

On Fri, Jul 26, 2013 at 10:11 AM, johnu johnugeorge...@gmail.com
mailto:johnugeorge...@gmail.com wrote:
  Greg,
  Yes, the outputs match

Nope, they don't. :) You need the secret_uuid to be the same on each
node, because OpenStack is generating configuration snippets on one
node (which contain these secrets) and then shipping them to another
node where they're actually used.

Your secrets are also different despite having the same rbd user
specified, so that's broken too; not quite sure how you got there...
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

 
  master node:
 
  ceph auth get-key client.volumes
  AQC/ze1R2EOWNBAAmLUE4U7zO1KafZ/CzVVTqQ==
 
  virsh secret-get-value bdf77f5d-bf0b-1053-5f56-cd76b32520dc
  AQC/ze1R2EOWNBAAmLUE4U7zO1KafZ/CzVVTqQ==
 
  /etc/cinder/cinder.conf
 
  volume_driver=cinder.volume.drivers.rbd.RBDDriver
  rbd_pool=volumes
  glance_api_version=2
  rbd_user=volumes
  rbd_secret_uuid=bdf77f5d-bf0b-1053-5f56-cd76b32520dc
 
 
  slave1
 
  /etc/cinder/cinder.conf
 
  volume_driver=cinder.volume.drivers.rbd.RBDDriver
  rbd_pool=volumes
  glance_api_version=2
  rbd_user=volumes
  rbd_secret_uuid=62d0b384-50ad-2e17-15ed-66bfeda40252
 
 
  virsh secret-get-value 62d0b384-50ad-2e17-15ed-66bfeda40252
  AQC/ze1R2EOWNBAAmLUE4U7zO1KafZ/CzVVTqQ==
 
  slave2
 
  /etc/cinder/cinder.conf
 
  volume_driver=cinder.volume.drivers.rbd.RBDDriver
  rbd_pool=volumes
  glance_api_version=2
  rbd_user=volumes
  rbd_secret_uuid=33651ba9-5145-1fda-3e61-df6a5e6051f5
 
  virsh secret-get-value 33651ba9-5145-1fda-3e61-df6a5e6051f5
  AQC/ze1R2EOWNBAAmLUE4U7zO1KafZ/CzVVTqQ==
 
 
  Yes, Openstack horizon is showing same host for all volumes.
Somehow, if
  volume is attached to an instance lying on the same host, it works
  otherwise, it doesn't. Might be a coincidence. And I am surprised
that no
  one else has seen or reported this issue. Any idea?
 
  On Fri, Jul 26, 2013 at 9:45 AM, Gregory Farnum g...@inktank.com
mailto:g...@inktank.com wrote:
 
  On Fri, Jul 26, 2013 at 9:35 AM, johnu johnugeorge...@gmail.com
mailto:johnugeorge...@gmail.com wrote:
   Greg,
   I verified in all cluster nodes that rbd_secret_uuid
is same as
   virsh secret-list. And If I do virsh secret-get-value of this
uuid, i
   getting back the auth key for client.volumes.  What did you
mean by same
   configuration?. Did you mean same secret for all compute nodes?
 
  If you run virsh secret-get-value with that rbd_secret_uuid on
each
  compute node, does it return the right secret for client.volumes?
 
   when we login as admin, There is a column in admin
panel which
   gives
   the 'host' where the volumes lie. I know that volumes are
striped across
   the
   cluster but it gives same host for all volumes. That is why ,I got
   little
   confused.
 
  That's not something you can get out of the RBD stack itself; is
this
  something that OpenStack is showing you? I suspect it's just
making up
  information to fit some API expectations, but somebody more familiar
  with the OpenStack guts can probably chime in.
  -Greg
  Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4

2013-06-25 Thread Mike Dawson


Darryl,

I've seen this issue a few times recently. I believe Joao was looking 
into it at one point, but I don't know if it has been resolved (Any news 
Joao?). Others have run into it too. Look closely at:


http://tracker.ceph.com/issues/4999
http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15

I'd recommend you submit this as a bug on the tracker.

It sounds like you have reliable quorum between a and b, that's good. 
The workaround that has worked for me is to remove mon.c, then re-add 
it. Assuming your monitor leveldb stores aren't too large, the process 
is rather quick. Follow the instructions at:


http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors

then

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors

- Mike


On 6/25/2013 10:34 PM, Darryl Bond wrote:

Upgrading a cluster from 6.1.3 to 6.1.4  with 3 monitors. Cluster had
been successfully upgraded from bobtail to cuttlefish and then from
6.1.2 to 6.1.3. There have been no changes to ceph.conf.

Node mon.a upgrade, a,b,c monitors OK after upgrade
Node mon.b upgrade a,b monitors OK after upgrade (note that c was not
available, even though I hadn't touched it)
Node mon.c very slow to install the upgrade, RAM was tight for some
reason and mon process was using half the RAM
Node mon.c shutdown mon.c
Node mon.c performed the upgrade
Node mon.c restart ceph - mon.c will not start


service ceph start mon.c

=== mon.c ===
Starting Ceph mon.c on ceph3...
[23992]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i c --pid-file
/var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph3...

health HEALTH_WARN 1 mons down, quorum 0,1 a,b
monmap e1: 3 mons at
{a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0},
election epoch 14224, quorum 0,1 a,b
osdmap e1342: 18 osds: 18 up, 18 in
 pgmap v4058788: 5448 pgs: 5447 active+clean, 1
active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB /
47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s
mdsmap e1: 0/0/1 up

Set debug mon = 20
Nothing going into logs other than assertion--- begin dump of recent
events ---
  0 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal
(Aborted) **
  in thread 7fd5e81b57c0

  ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
  1: /usr/bin/ceph-mon() [0x596fe2]
  2: (()+0xf000) [0x7fd5e782]
  3: (gsignal()+0x35) [0x7fd5e619fba5]
  4: (abort()+0x148) [0x7fd5e61a1358]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d]
  6: (()+0x5eeb6) [0x7fd5e6a97eb6]
  7: (()+0x5eee3) [0x7fd5e6a97ee3]
  8: (()+0x5f10e) [0x7fd5e6a9810e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x40a) [0x64a6aa]
  10: /usr/bin/ceph-mon() [0x65f916]
  11: /usr/bin/ceph-mon() [0x6960e9]
  12: (pick_addresses(CephContext*)+0x8d) [0x69624d]
  13: (main()+0x1a8a) [0x49786a]
  14: (__libc_start_main()+0xf5) [0x7fd5e618ba05]
  15: /usr/bin/ceph-mon() [0x499a69]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
   20/20 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent 1
   max_new 1000
   log_file /var/log/ceph/ceph-mon.c.log
--- end dump of recent events ---


The contents of this electronic message and any attachments are intended
only for the addressee and may contain legally privileged, personal,
sensitive or confidential information. If you are not the intended
addressee, and have received this email, any transmission, distribution,
downloading, printing or photocopying of the contents of this message or
attachments is strictly prohibited. Any legal privilege or
confidentiality attached to this message and attachments is not waived,
lost or destroyed by reason of delivery to any person other than
intended addressee. If you have received this message and are not the
intended addressee you should notify the sender by return email and
destroy all copies of the message and any attachments. Unless expressly

Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4

2013-06-25 Thread Mike Dawson

I've typically moved it off to a non-conflicting path in lieu of 
deleting it outright, but either way should work. IIRC, I used something 
like:


sudo mv /var/lib/ceph/mon/ceph-c /var/lib/ceph/mon/ceph-c-bak  sudo 
mkdir /var/lib/ceph/mon/ceph-c


- Mike

On 6/25/2013 11:08 PM, Darryl Bond wrote:

Thanks for your prompt response.
Given that my mon.c /var/lib/ceph/mon/ceph-c is currently populated,
should I delete it's contents after removing the monitor and before
re-adding it?

Darryl

On 06/26/13 12:50, Mike Dawson wrote:

Darryl,

I've seen this issue a few times recently. I believe Joao was looking
into it at one point, but I don't know if it has been resolved (Any news
Joao?). Others have run into it too. Look closely at:

http://tracker.ceph.com/issues/4999
http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21
http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15

I'd recommend you submit this as a bug on the tracker.

It sounds like you have reliable quorum between a and b, that's good.
The workaround that has worked for me is to remove mon.c, then re-add
it. Assuming your monitor leveldb stores aren't too large, the process
is rather quick. Follow the instructions at:

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors


then

http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors


- Mike


On 6/25/2013 10:34 PM, Darryl Bond wrote:

Upgrading a cluster from 6.1.3 to 6.1.4  with 3 monitors. Cluster had
been successfully upgraded from bobtail to cuttlefish and then from
6.1.2 to 6.1.3. There have been no changes to ceph.conf.

Node mon.a upgrade, a,b,c monitors OK after upgrade
Node mon.b upgrade a,b monitors OK after upgrade (note that c was not
available, even though I hadn't touched it)
Node mon.c very slow to install the upgrade, RAM was tight for some
reason and mon process was using half the RAM
Node mon.c shutdown mon.c
Node mon.c performed the upgrade
Node mon.c restart ceph - mon.c will not start


service ceph start mon.c

=== mon.c ===
Starting Ceph mon.c on ceph3...
[23992]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i c --pid-file
/var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on ceph3...

 health HEALTH_WARN 1 mons down, quorum 0,1 a,b
 monmap e1: 3 mons at
{a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0},
election epoch 14224, quorum 0,1 a,b
 osdmap e1342: 18 osds: 18 up, 18 in
  pgmap v4058788: 5448 pgs: 5447 active+clean, 1
active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB /
47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s
 mdsmap e1: 0/0/1 up

Set debug mon = 20
Nothing going into logs other than assertion--- begin dump of recent
events ---
   0 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal
(Aborted) **
   in thread 7fd5e81b57c0

   ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
   1: /usr/bin/ceph-mon() [0x596fe2]
   2: (()+0xf000) [0x7fd5e782]
   3: (gsignal()+0x35) [0x7fd5e619fba5]
   4: (abort()+0x148) [0x7fd5e61a1358]
   5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d]
   6: (()+0x5eeb6) [0x7fd5e6a97eb6]
   7: (()+0x5eee3) [0x7fd5e6a97ee3]
   8: (()+0x5f10e) [0x7fd5e6a9810e]
   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x40a) [0x64a6aa]
   10: /usr/bin/ceph-mon() [0x65f916]
   11: /usr/bin/ceph-mon() [0x6960e9]
   12: (pick_addresses(CephContext*)+0x8d) [0x69624d]
   13: (main()+0x1a8a) [0x49786a]
   14: (__libc_start_main()+0xf5) [0x7fd5e618ba05]
   15: /usr/bin/ceph-mon() [0x499a69]
   NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.




The contents of this electronic message and any attachments are intended
only for the addressee and may contain legally privileged, personal,
sensitive or confidential information. If you are not the intended
addressee, and have received this email, any transmission, distribution,
downloading, printing or photocopying of the contents of this message or
attachments is strictly prohibited. Any legal privilege or
confidentiality attached to this message and attachments is not waived,
lost or destroyed by reason of delivery to any person other than
intended addressee. If you have received this message and are not the
intended addressee you should notify the sender by return email and
destroy all copies of the message and any attachments. Unless expressly
attributed, the views expressed in this email do not necessarily
represent the views of the company.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multi Rack Reference architecture

2013-06-04 Thread Mike Dawson

Behind a registration form, but iirc, this is likely what you are 
looking for:


http://www.inktank.com/resource/dreamcompute-architecture-blueprint/

- Mike

On 5/31/2013 3:26 AM, Gandalf Corvotempesta wrote:

In reference architecture PDF, downloadable from your website, there was
some reference to a multi rack architecture described in another doc.

Is this paper available ?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mon IO usage

2013-05-21 Thread Mike Dawson


Sylvain,

I can confirm I see a similar traffic pattern.

Any time I have lots of writes going to my cluster (like heavy writes 
from RBD or remapping/backfilling after losing an OSD), I see all sorts 
of monitor issues.


If my monitor leveldb store.db directories grow past some unknown point 
(maybe ~1GB or so), 'compact on trim' is insufficiently slow. The 
store.db grows faster than compact can trim the garbage. After that 
point, the only hope to rein in the store.db size is to stop the OSDs 
and get leveldb to compact without any ongoing writes.


I sent Sage and Joao a transaction dump of the growth yesterday. Sage 
looked, but the files are so large it is tough to get useful info.


http://tracker.ceph.com/issues/4895

I believe this issue has existed since 0.48.

- Mike

On 5/21/2013 8:16 AM, Sylvain Munaut wrote:

Hi,


I've just added some monitoring to the IO usage of mon (trying to
track down that growing mon issue), and I'm kind of surprised by the
amount of IO generated by the monitor process.

I get continuous 4 Mo/s / 75 iops with added big spikes at each
compaction every 3 min or so.

Is there a description somewhere of what the monitor does exactly ?  I
mean the monmap / pgmap / osdmap / mdsmap / election epoch don't
change that often (pgmap is like 1 per second and that's the fastest
change by several orders of magnitude). So what exactly does the
monitor do with all that IO ???


Cheers,

 Sylvain
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Running Ceph issues: HEALTH_WARN, unknown auth protocol, others

2013-05-01 Thread Mike Dawson


Wyatt,

A few notes:

- Yes, the second host = ceph under mon.a is redundant and should be 
deleted.


- auth client required = cephx [osd] should be simply
auth client required = cephx.

- Looks like you only have one OSD. You need at least as many (and 
hopefully more) OSDs than highest replication level out of your pools.


Mike

On 5/1/2013 12:23 PM, Wyatt Gorman wrote:

Here is my ceph.conf. I just figured out that the second host = isn't
necessary, though it is like that on the 5-minute quick start guide...
(Perhaps I'll submit my couple of fixes that I've had to implement so
far). That fixes the redefined host issue, but none of the others.

[global]
 # For version 0.55 and beyond, you must explicitly enable or
 # disable authentication with auth entries in [global].

 auth cluster required = cephx
 auth service required = cephx
 auth client required = cephx [osd]
 osd journal size = 1000

 #The following assumes ext4 filesystem.
 filestore xattr use omap = true
 # For Bobtail (v 0.56) and subsequent versions, you may add
 #settings for mkcephfs so that it will create and mount the file
 #system on a particular OSD for you. Remove the comment `#`
 #character for the following settings and replace the values in
 #braces with appropriate values, or leave the following settings
 #commented out to accept the default values. You must specify
 #the --mkfs option with mkcephfs in order for the deployment
 #script to utilize the following settings, and you must define
 #the 'devs' option for each osd instance; see below. osd mkfs
 #type = {fs-type} osd mkfs options {fs-type} = {mkfs options} #
 #default for xfs is -f osd mount options {fs-type} = {mount
 #options} # default mount option is rw,noatime
 # For example, for ext4, the mount option might look like this:

 #osd mkfs options ext4 = user_xattr,rw,noatime
 # Execute $ hostname to retrieve the name of your host, and
 # replace {hostname} with the name of your host. For the
 # monitor, replace {ip-address} with the IP address of your
 # host.
[mon.a]
 host = ceph
 mon addr = 10.81.2.100:6789 http://10.81.2.100:6789 [osd.0]
 host = ceph

 # For Bobtail (v 0.56) and subsequent versions, you may add
 # settings for mkcephfs so that it will create and mount the
 # file system on a particular OSD for you. Remove the comment
 # `#` character for the following setting for each OSD and
 # specify a path to the device if you use mkcephfs with the
 # --mkfs option.

 #devs = {path-to-device}
[osd.1]
 host = ceph
 #devs = {path-to-device}
[mds.a]
 host = ceph


On Wed, May 1, 2013 at 12:14 PM, Mike Dawson
mike.daw...@scholarstack.com mailto:mike.daw...@scholarstack.com wrote:

Wyatt,

Please post your ceph.conf.

- mike


On 5/1/2013 12:06 PM, Wyatt Gorman wrote:

Hi everyone,

I'm setting up a test ceph cluster and am having trouble getting it
running (great for testing, huh?). I went through the
installation on
Debian squeeze, had to modify the mkcephfs script a bit because
it calls
monmaptool with too many paramaters in the $args variable (mine had
--add a [ip address]:[port] [osd1] and I had to get rid of the
[osd1]
part for the monmaptool command to take it). Anyway, so I got it
installed, started the service, waiting a little while for it to
build
the fs, and ran ceph health and got (and am still getting
after a day
and a reboot) the following error: (note: I have also been
getting the
first line in various calls, unsure why it is complaining, I
followed
the instructions...)

warning: line 34: 'host' in section 'mon.a' redefined
2013-05-01 12:04:39.801102 b733b710 -1 WARNING: unknown auth
protocol
defined: [osd]
HEALTH_WARN 384 pgs degraded; 384 pgs stuck unclean; recovery 21/42
degraded (50.000%)

Can anybody tell me the root of this issue, and how I can fix
it? Thank you!

- Wyatt Gorman


_
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cuttlefish countdown -- OSD doesn't get marked out

2013-04-26 Thread Mike Dawson


Sage,

I confirm this issue. The requested info is listed below.

*Note that due to the pre-Cuttlefish monitor sync issues, this 
deployment has been running three monitors (mon.b and mon.c working 
properly in quorum. mon.a stuck forever synchronizing).


For the past two hours, no OSD processes have been running on any host, 
yet some OSDs are still marked as up.


http://www.gammacode.com/upload/ceph-osd-tree


The mon* sections of ceph.conf are:

[mon]
debug mon = 20
debug paxos = 20
debug ms = 1

[mon.a]
host = node2
mon addr = 10.1.0.3:6789

[mon.b]
host = node26
mon addr = 10.1.0.67:6789

[mon.c]
host = node49
mon addr = 10.1.0.130:6789

root@controller1:~# ceph -s
   health HEALTH_WARN 43 pgs degraded; 13308 pgs peering; 27932 pgs 
stale; 13308 pgs stuck inactive; 27932 pgs stuck stale; 13582 pgs stuck 
unclean; recovery 7264/7986546 degraded (0.091%); 47/66 in osds are 
down; 1 mons down, quorum 1,2 b,c
   monmap e1: 3 mons at 
{a=10.1.0.3:6789/0,b=10.1.0.67:6789/0,c=10.1.0.130:6789/0}, election 
epoch 1428, quorum 1,2 b,c

   osdmap e1323: 66 osds: 19 up, 66 in
pgmap v427324: 28864 pgs: 257 active+clean, 231 stale+active, 15025 
stale+active+clean, 675 peering, 12633 stale+peering, 43 
stale+active+degraded; 448 GB data, 1402 GB used, 178 TB / 180 TB avail; 
7264/7986546 degraded (0.091%)

   mdsmap e1: 0/0/1 up

For reference, this is ceph version 0.60-666-ga5cade1 
(a5cade1fe7338602fb2bbfa867433d825f337c87) from gitbuilder.


Thanks,
Mike

On 4/25/2013 12:17 PM, Sage Weil wrote:

On Thu, 25 Apr 2013, Martin Mailand wrote:

Hi,

if I shutdown an OSD, the OSD gets marked down after 20 seconds, after
300 seconds the osd should get marked out, an the cluster should resync.
But that doesn't happened, the OSD stays in the status down/in forever,
therefore the cluster stays forever degraded.
I can reproduce it with a new installed cluster.

If I manually set the osd out (ceph osd out 1), the cluster resync
starts immediately.

I think thats a release critical bug, because the cluster health is not
automatically recovered.


What is the output from 'ceph osd tree' and the contents of your
[mon*] sections of ceph.conf?

Thanks!
sage




And I reported this behavior a while ago
http://article.gmane.org/gmane.comp.file-systems.ceph.user/603/

-martin


Log:


root@store1:~# ceph -s
health HEALTH_OK
monmap e1: 3 mons at
{a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0},
election epoch 82, quorum 0,1,2 a,b,c
osdmap e204: 24 osds: 24 up, 24 in
 pgmap v106709: 5056 pgs: 5056 active+clean; 526 GB data, 1068 GB
used, 173 TB / 174 TB avail
mdsmap e1: 0/0/1 up

root@store1:~# ceph --version
ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c)
root@store1:~# /etc/init.d/ceph stop osd.1
=== osd.1 ===
Stopping Ceph osd.1 on store1...bash: warning: setlocale: LC_ALL: cannot
change locale (en_GB.utf8)
kill 5492...done
root@store1:~# ceph -s
health HEALTH_OK
monmap e1: 3 mons at
{a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0},
election epoch 82, quorum 0,1,2 a,b,c
osdmap e204: 24 osds: 24 up, 24 in
 pgmap v106709: 5056 pgs: 5056 active+clean; 526 GB data, 1068 GB
used, 173 TB / 174 TB avail
mdsmap e1: 0/0/1 up

root@store1:~# date -R
Thu, 25 Apr 2013 13:09:54 +0200



root@store1:~# ceph -s  date -R
health HEALTH_WARN 423 pgs degraded; 423 pgs stuck unclean; recovery
10999/269486 degraded (4.081%); 1/24 in osds are down
monmap e1: 3 mons at
{a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0},
election epoch 82, quorum 0,1,2 a,b,c
osdmap e206: 24 osds: 23 up, 24 in
 pgmap v106715: 5056 pgs: 4633 active+clean, 423 active+degraded; 526
GB data, 1068 GB used, 173 TB / 174 TB avail; 10999/269486 degraded (4.081%)
mdsmap e1: 0/0/1 up

Thu, 25 Apr 2013 13:10:14 +0200


root@store1:~# ceph -s  date -R
health HEALTH_WARN 423 pgs degraded; 423 pgs stuck unclean; recovery
10999/269486 degraded (4.081%); 1/24 in osds are down
monmap e1: 3 mons at
{a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0},
election epoch 82, quorum 0,1,2 a,b,c
osdmap e206: 24 osds: 23 up, 24 in
 pgmap v106719: 5056 pgs: 4633 active+clean, 423 active+degraded; 526
GB data, 1068 GB used, 173 TB / 174 TB avail; 10999/269486 degraded (4.081%)
mdsmap e1: 0/0/1 up

Thu, 25 Apr 2013 13:23:01 +0200

On 25.04.2013 01:46, Sage Weil wrote:

Hi everyone-

We are down to a handful of urgent bugs (3!) and a cuttlefish release date
that is less than a week away.  Thank you to everyone who has been
involved in coding, testing, and stabilizing this release.  We are close!

If you would like to test the current release candidate, your efforts
would be much appreciated!  For deb systems, you can do

  wget -q -O- 
'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/autobuild.asc' | sudo 
apt-key add -
  echo deb

Re: [ceph-users] Crushmap doesn't match osd tree

2013-04-25 Thread Mike Dawson


Mike,

I use a process like:

crushtool -c new-crushmap.txt -o new-crushmap  ceph osd setcrushmap -i 
new-crushmap


I did not attempt to validate your crush map. If that command fails, I 
would scrutinize your crushmap for validity/correctness.


Once you have the new crushmap injected, you can do something like:

ceph osd crush move ec02sv35 root=default datacenter=site-hd 
room=room-CR3.11391 rack=rack-9.41933-pehdpw09a



- Mike


On 4/25/2013 6:11 AM, Mike Bryant wrote:

Hi,
On version 0.56.4, I'm having a problem with my crush map.
The output of osd tree is:
# idweighttype nameup/downreweight

00osd.0up1
10osd.1up1
20osd.2up1
30osd.3up1
40osd.4up1
50osd.5up1

But there are buckets set in the crush map (Attached).

How can I fix this?
Editing the crush map and doing setcrushmap doesn't appear to change anything.

Cheers
Mike



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Monitor Access Denied message to itself?

2013-04-18 Thread Mike Dawson


Greg,

Looks like Sage has a fix for this problem. In case it matters, I have 
seen a few cases that conflict with your notes in this thread and the 
bug report.


I have seen the bug exclusively on new Ceph installs (without upgrading 
from bobtail), so it is not isolated to upgrades.


Further, I have seen it on test deployments with a single monitor, so it 
doesn't seem to be limited to deployments with a leader and followers.


Thanks getting this bug moving forward.

Thanks,
Mike


On 4/18/2013 6:23 PM, Gregory Farnum wrote:

There's a little bit of python called ceph-create-keys, which is
invoked by the upstart scripts. You can kill the running processes,
and edit them out of the scripts, without direct harm. (Their purpose
is to create some standard keys which the newer deployment tools rely
on to do things like create OSDs, etc.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Apr 18, 2013 at 3:20 PM, Matthew Roy imjustmatt...@gmail.com wrote:

On 04/18/2013 06:03 PM, Joao Eduardo Luis wrote:


There's definitely some command messages being forwarded, but AFAICT
they're being forwarded to the monitor, not by the monitor, which by
itself is a good omen towards the monitor being the leader :-)

In any case, nothing in the trace's code path indicates we could be a
peon, unless the monitor itself believed to be the leader.  If you take
a closer look, you'll see that we come from 'handle_last()', which is
bound to happen only on the leader (we'll assert otherwise).  For the
monitor to be receiving these messages it must mean the peons believe
him to be the leader -- or we have so many bugs going around that it's
just madness!

In all seriousness, when I was chasing after this bug, Matthew sent me
his logs with higher debug levels -- no craziness going around :-)

   -Joao



Is there a way to tell who's being denied? Even if it's just log
pollution I'd like to know which client is misconfigured. There are
similar messages in all the mon logs:

mon.a:
2013-04-18 18:16:51.254378 7fc7c6d10700  1 --
[2001:470:8:dd9::20]:6789/0 -- [2001:470:8:dd9::21]:6789/0 --
route(mon_command_ack([auth,get-or-create,client.admin,mon,allow
*,osd,allow *,mds,allow]=-13 access denied v775211) v1 tid 8867608) v2
-- ?+0 0x7fc61a18b160 con 0x253f700


mon.b:
2013-04-18 18:16:49.670758 7f37c7afa700 20 --
[2001:470:8:dd9::21]:6789/0  [2001:470:8:dd9::21]:0/22372
pipe(0x7f383c070b70 sd=90 :6789 s=2 pgs=1 cs=1 l=1).writer encoding 7
0x7f37f49876a0
mon_command_ack([auth,get-or-create,client.admin,mon,allow *,osd,allow
*,mds,allow]=-13 access denied v775209) v1

(mon.c was removed since the first log file in the thread)

mon.d:
2013-04-18 18:16:51.304897 7f927d40f700  1 --
[2001:470:8:dd9:7271:bcff:febd:e398]:6789/0 -- client.?
[2001:470:8:dd9::21]:0/26333 --
mon_command_ack([auth,get-or-create,client.admin,mon,allow *,osd,allow
*,mds,allow]=-13 access denied v775211) v1 -- ?+0 0x7f923c0230a0

The spacing on these messages is about 0.001s so there's a lot of them
going around. All these systems are running 0.60-472-g327002e

Matthew




--
Matthew

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Monitor Access Denied message to itself?

2013-04-08 Thread Mike Dawson


Matthew,

I have seen the same behavior on 0.59. Ran through some troubleshooting 
with Dan and Joao on March 21st and 22nd, but I haven't looked at it 
since then.


If you look at running processes, I believe you'll see an instance of 
ceph-create-keys start each time you start a Monitor. So, if you restart 
the monitor several times, you'll have several ceph-create-keys 
processes piling, essentially leaking processes. IIRC, the tmp files you 
see in /etc/ceph correspond with the ceph-create-keys PID. Can you 
confirm that's what you are seeing?


I haven't looked in a couple weeks, but I hope to start 0.60 later today.

- Mike





On 4/8/2013 12:43 AM, Matthew Roy wrote:

I'm seeing weird messages in my monitor logs that don't correlate to
admin activity:

2013-04-07 22:54:11.528871 7f2e9e6c8700  1 --
[2001:something::20]:6789/0 -- [2001:something::20]:0/1920 --
mon_command_ack([auth,get-or-create,client.admin,mon,allow *,osd,allow
*,mds,allow]=-13 access denied v134192) v1 -- ?+0 0x37bfc00 con 0x3716840

It's also writing out a bunch of empty files along the lines of
ceph.client.admin.keyring.1008.tmp in /etc/ceph/ Could this be related
to the mon trying to Starting ceph-create-keys when starting?

This could be the cause of, or just associated with, some general
instability of the monitor cluster. After increasing the logging level I
did catch one crash:

  ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c)
  1: /usr/bin/ceph-mon() [0x5834fa]
  2: (()+0xfcb0) [0x7f4b03328cb0]
  3: (gsignal()+0x35) [0x7f4b01efe425]
  4: (abort()+0x17b) [0x7f4b01f01b8b]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f4b0285069d]
  6: (()+0xb5846) [0x7f4b0284e846]
  7: (()+0xb5873) [0x7f4b0284e873]
  8: (()+0xb596e) [0x7f4b0284e96e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x636c8f]
  10: (PaxosService::propose_pending()+0x46d) [0x4dee3d]
  11: (MDSMonitor::tick()+0x1c62) [0x51cdd2]
  12: (MDSMonitor::on_active()+0x1a) [0x512ada]
  13: (PaxosService::_active()+0x31d) [0x4e067d]
  14: (Context::complete(int)+0xa) [0x4b7b4a]
  15: (finish_contexts(CephContext*, std::listContext*,
std::allocatorContext* , int)+0x95) [0x4ba5a5]
  16: (Paxos::handle_last(MMonPaxos*)+0xbef) [0x4da92f]
  17: (Paxos::dispatch(PaxosServiceMessage*)+0x26b) [0x4dad8b]
  18: (Monitor::_ms_dispatch(Message*)+0x149f) [0x4b310f]
  19: (Monitor::ms_dispatch(Message*)+0x32) [0x4c9d12]
  20: (DispatchQueue::entry()+0x341) [0x698da1]
  21: (DispatchQueue::DispatchThread::entry()+0xd) [0x626c5d]
  22: (()+0x7e9a) [0x7f4b03320e9a]
  23: (clone()+0x6d) [0x7f4b01fbbcbd]

The complete log is at: http://goo.gl/UmNs3


Does anyone recognize what's going on?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

71 matches

Mail list logo