[ceph-users] resolved - unusual growth in cluster after replacing journalSSDs

2018-02-06 Thread Jogi Hofmüller
Dear all,

we finally found the reason for the unexpected growth in our cluster. 
The data was created by a collectd plugin [1] that measures latency by
running rados bench once a minute.  Since our cluster was stressed out
for a while, removing the objects created by rados bench failed.  We
completely overlooked the log messages that should have given us the
hint a lot earlier.  e.g.:

Jan 18 23:26:09 ceph1 ceph-osd: 2018-01-18 23:26:09.931638
7f963389f700  0 -- IP:6802/1986 submit_message osd_op_reply(374
benchmark_data_ceph3_31746_object158 [delete] v21240'22867646
uv22867646 ack = 0) v7 remote, IP:0/3091801967, failed lossy con,
dropping message 0x7f96672a6680

Over time we "collected" some 1.5TB of benchmark data :(

Furthermore, due to some misunderstanding we had the collectd plugin
that runs the benchmarks running on two machines, doubling the stress
on the cluster.

And finally we created benchmark data in our main production pool,
which also was a bad idea.

Hope this info will be useful for someone :)

[1]  https://github.com/rochaporto/collectd-ceph

Cheers,
-- 
J.Hofmüller
We are all idiots with deadlines.
- Mike West


signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] unusual growth in cluster after replacing journalSSDs

2017-11-16 Thread Jogi Hofmüller
Hi,

Am Donnerstag, den 16.11.2017, 13:44 +0100 schrieb Burkhard Linke:

> > What remains is the growth of used data in the cluster.
> > 
> > I put background information of our cluster and some graphs of
> > different metrics on a wiki page:
> > 
> >    https://wiki.mur.at/Dokumentation/CephCluster
> > 
> > Basically we need to reduce the growth in the cluster, but since we
> > are
> > not sure what causes it we don't have an idea.
> 
> Just a wild guess (wiki page is not accessible yet):

Oh damn, sorry! Fixed that.  The wiki page is accessible now.

> Are you sure that the journals were creating on the new SSD? If the 
> journals were created as files in the OSD directory, their size might
> be accounted for in the cluster size report (assuming OSDs are
> reporting their free space, not a sum of all object sizes).

Yes, I am sure.  Just checked and all the journal links point to the
correct devices.  See OSD 5 as an example:

ls -l /var/lib/ceph/osd/ceph-5
total 64
-rw-r--r--   1 root root   481 Mar 30  2017 activate.monmap
-rw-r--r--   1 ceph ceph 3 Mar 30  2017 active
-rw-r--r--   1 ceph ceph37 Mar 30  2017 ceph_fsid
drwxr-xr-x 342 ceph ceph 12288 Apr  6  2017 current
-rw-r--r--   1 ceph ceph37 Mar 30  2017 fsid
lrwxrwxrwx   1 root root58 Oct 17 14:43 journal -> /dev/disk/by-
partuuid/f04832e3-2f09-460e-806f-4a6fe7aa1425
-rw-r--r--   1 ceph ceph37 Oct 25 11:12 journal_uuid
-rw---   1 ceph ceph56 Mar 30  2017 keyring
-rw-r--r--   1 ceph ceph21 Mar 30  2017 magic
-rw-r--r--   1 ceph ceph 6 Mar 30  2017 ready
-rw-r--r--   1 ceph ceph 4 Mar 30  2017 store_version
-rw-r--r--   1 ceph ceph53 Mar 30  2017 superblock
-rw-r--r--   1 ceph ceph 0 Nov  7 11:45 systemd
-rw-r--r--   1 ceph ceph10 Mar 30  2017 type
-rw-r--r--   1 ceph ceph 2 Mar 30  2017 whoami

Regards,
-- 
J.Hofmüller

   Nisiti
   - Abie Nathan, 1927-2008



signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] unusual growth in cluster after replacing journal SSDs

2017-11-16 Thread Jogi Hofmüller
Dear all,

for about a month we experience something strange in our small cluster.
 Let me first describe what happened on the way.

On Oct 4ht smartmon told us that the journal SSDs in one of our two
ceph nodes will fail.  Since getting replacements took way longer than
expected we decided to place the journal on a spare HDD rather than
have the SSD fail and leave us in an uncertain state.

On Oct 17th we finally got the replacement SSDs.  First we replaced
broken/soon to be broken SSD and moved journals from the temporarily
used HDD to the new SSD.  Then we also replaced the journal SSD on the
other ceph node since it would probably fail sooner than later.

We performed all operations by setting noout first, then taking down
the OSDs, flushing journals, replacing disks, creating new journals and
starting OSDs again.  We waited until the cluster was back in HEALTH_OK
state before we proceeded to the next node.

AFAIR mkjournal crashed once on the second node.  So we ran the command
again and journals where created.

The next day in the morning at 6:25 (time of cron.daily jobs on Debian
systems) we registered almost 2000 slow requests.  We've had slow
requests before, but never more than 900 per day and that was rare.

Another odd thing we noticed is that the cluster had grown over night
by 50GB!  We currently run 12 vservers from ceph images and they are
all not really busy.  Usually used data would grow by 2GB per week or
less.  Network traffic between our three monitors has roughly doubled
at the same time and stayed on that level until now.

We eventually got rid of all the slow requests by removing all but one
snapshot per image.  We used to take nightly snapshots of all images
and keep 14 snapshots per image.

Now we take on snapshot per image per night, use export-diff and
offload the diff to storage outside of ceph and remove the nightly
snapshot right away.  The only snapshot we keep is the one that the
diffs are based on.

What remains is the growth of used data in the cluster.

I put background information of our cluster and some graphs of
different metrics on a wiki page:

  https://wiki.mur.at/Dokumentation/CephCluster

Basically we need to reduce the growth in the cluster, but since we are
not sure what causes it we don't have an idea.

So the main question that I have is what went gone wrong when we
replaced the journal disks?  And of course: how can we fix it?

As always, any hint appreciated!

Regards,
-- 
J.Hofmüller

   Ich zitiere wie Espenlaub.
   - https://twitter.com/TheGurkenkaiser/status/463444397678690304


signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-20 Thread Jogi Hofmüller
Hi,

Am Dienstag, den 18.04.2017, 18:34 + schrieb Peter Maloney:

> The 'slower with every snapshot even after CoW totally flattens it'
> issue I just find easy to test, and I didn't test it on hammer or
> earlier, and others confirmed it, but didn't keep track of the
> versions. Just make an rbd image, map it (probably... but my tests
> were with qemu librbd), do fio randwrite tests with sync and direct
> on the device (no need for a fs, or anything), and then make a few
> snaps and watch it go way slower. 
> 
> How about we make this thread a collection of versions then. And I'll
> redo my test on Thursday maybe.

I did some tests now and provide the results and observations here:

This is the fio config file I used:


[global]
ioengine=rbd
clientname=admin
pool=images
rbdname=benchmark
invalidate=0# mandatory
rw=randwrite
bs=4k

[rbd_iodepth32]
iodepth=32


Results from fio on image 'benchmark' without any snapshots:

rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd,
iodepth=32
fio-2.16
Starting 1 process
rbd engine: RBD version: 0.1.10
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/3620KB/0KB /s] [0/905/0 iops]
[eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=14192: Thu Apr 20
13:11:27 2017
  write: io=8192.0MB, bw=1596.2KB/s, iops=399, runt=5252799msec
slat (usec): min=1, max=6708, avg=173.27, stdev=97.65
clat (msec): min=9, max=14505, avg=79.97, stdev=456.86
 lat (msec): min=9, max=14505, avg=80.15, stdev=456.86
clat percentiles (msec):
 |  1.00th=[   26],  5.00th=[   28], 10.00th=[   28],
20.00th=[   30],
 | 30.00th=[   31], 40.00th=[   32], 50.00th=[   33],
60.00th=[   35],
 | 70.00th=[   37], 80.00th=[   39], 90.00th=[   43],
95.00th=[   47],
 | 99.00th=[ 1516], 99.50th=[ 3621], 99.90th=[ 7046], 99.95th=[
8094],
 | 99.99th=[10159]
lat (msec) : 10=0.01%, 20=0.29%, 50=96.17%, 100=1.49%, 250=0.31%
lat (msec) : 500=0.21%, 750=0.15%, 1000=0.14%, 2000=0.38%,
>=2000=0.85%
  cpu  : usr=31.95%, sys=58.32%, ctx=5392823, majf=0, minf=0
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>=64=0.0%
 issued: total=r=0/w=2097152/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: io=8192.0MB, aggrb=1596KB/s, minb=1596KB/s, maxb=1596KB/s,
mint=5252799msec, maxt=5252799msec

Disk stats (read/write):
  vdb: ios=6/20, merge=0/29, ticks=76/12168, in_queue=12244, util=0.23%
sudo fio rbd.fio  2023.87s user 3216.33s system 99% cpu 1:27:31.92
total

Now I created three snapshots of image 'benchmark'. Cluster became
iresponsive (slow requests stared to appear), a new run of fio never
got passed 0.0%.

Removed all three snapshots. Cluster became responsive again, fio
started to work like before (left it running during snapshot removal).

Created one snapshot of 'benchmark' while fio was running. Cluster
became iresponsive after few minutes, fio got nothing done as soon as
the snapshot was made.

Stopped here ;)

Regards,
-- 
J.Hofmüller

   mur.sat -- a space art project
   http://sat.mur.at/


signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread Jogi Hofmüller
Hi,

Am Dienstag, den 18.04.2017, 13:02 +0200 schrieb mj:
> 
> On 04/18/2017 11:24 AM, Jogi Hofmüller wrote:
> > This might have been true for hammer and older versions of ceph.
> > From
> > what I can tell now, every snapshot taken reduces performance of
> > the
> > entire cluster :(
> 
> Really? Can others confirm this? Is this a 'wellknown fact'?
> (unknown only to us, perhaps...)

I have to add some more/new details now. We started removing snapshots
for VMs today. We did this VM for VM and waited some time in between
while monitoring the cluster.

After having removed all snapshots for the third VM the cluster went
back to a 'normal' state again: no more slow requests. i/o waits for
VMs are down to acceptable numbers again (<10% peeks, <5% average).

So, either there is one VM/image that irritates the entire cluster or
we reached some kind of threshold or it's something completely
different.

As for the well known fact: Peter Maloney pointed that out in this
thread (mail from last Thursday).

Regards,
-- 
J.Hofmüller

   http://thesix.mur.at/


signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread Jogi Hofmüller
Hi,

thanks for all you comments so far.

Am Donnerstag, den 13.04.2017, 16:53 +0200 schrieb Lionel Bouton:
> Hi,
> 
> Le 13/04/2017 à 10:51, Peter Maloney a écrit :
> > Ceph snapshots relly slow things down.

I can confirm that now :(

> We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
> measurable impact on performance... until we tried to remove them. We
> usually have at least one snapshot per VM image, often 3 or 4.

This might have been true for hammer and older versions of ceph. From
what I can tell now, every snapshot taken reduces performance of the
entire cluster :(

So it looks like we were too naive in thinking that snapshots of VMs
done in ceph could be a viable backup solution. Which brings me to the
question, what are others doing for VM backup?

Regards,
-- 
J.Hofmüller

   http://thesix.mur.at/


signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread Jogi Hofmüller
Dear David,

Am Mittwoch, den 12.04.2017, 13:46 + schrieb David Turner:
> I can almost guarantee what you're seeing is PG subfolder splitting. 

Evey day there's something new to learn about ceph ;)

> When the subfolders in a PG get X number of objects, it splits into
> 16 subfolders.  Every cluster I manage has blocked requests and OSDs
> that get marked down while this is happening.  To stop the OSDs
> getting marked down, I increase the osd_heartbeat_grace until the
> OSDs no longer mark themselves down during this process.

Thanks for the hint. I adjusted the values accordingly and will monitor
our cluster. This morning there were no troubles at all btw. Still
wondering what caused yesterday's mayhem ...

Regards,
-- 
J.Hofmüller

   Nisiti
   - Abie Nathan, 1927-2008



signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] slow requests and short OSD failures in small cluster

2017-04-12 Thread Jogi Hofmüller
Dear all,

we run a small cluster [1] that is exclusively used for virtualisation
(kvm/libvirt). Recently we started to run into performance problems
(slow requests, failing OSDs) for no *obvious* reason (at least not for
us).

We do nightly snapshots of VM images and keep the snapshots for 14
days. Currently we run 8 VMs in the cluster.

At first it looked like the problem was related to snapshotting images
of VMs that were up and running (respectively deleting the snapshots
after 14 days). So we changed the procedure to first suspend the VM and
the snapshot its image(s). Snapshots are made at 4 am.

When we removed *all* the old snapshots (the ones done of running VMs)
the cluster suddenly behaved 'normal' again, but after two days of
creating snapshots (not deleting any) of suspended VMs, the slow
requests started again (although by far not as frequent as before).

This morning we experienced subsequent failures (e.g. osd.2
IPv4:6800/1621 failed (2 reporters from different host after 49.976472
>= grace 46.444312) of 4 of our 6 OSDs, resulting in HEALTH_WARN with
up to about 20% of PGs active+undersized+degraded or stale+active+clean
or remapped+peering. No OSD failure lasted longer than 4 minutes. After
15 minutes everything was back to normal again. The noise started at
6:25 am, a time when cron.daily scripts run here.

We have no clue what could have caused this behavior :( There seems to
be no shortage of resources (CPU, RAM, network) that would explain what
happened, but maybe we did not look in the right places. So any hint on
where to look/what to look for would be greatly appreciated :)

[1]  cluster setup

Three nodes: ceph1, ceph2, ceph3

ceph1 and ceph2

1x Intel(R) Xeon(R) CPU E3-1275 v3 @ 3.50GHz
32 GB RAM
RAID1 for OS
1x Intel 530 Series SSDs (120GB) for Journals
3x WDC WD2500BUCT-63TWBY0 for OSDs (1TB)
2x Gbit Ethernet bonded (802.3ad) on HP 2920 Stack 

ceph3

virtual machine
1 CPU
4 GB RAM 

Software

Debian GNU/Linux Jessie (8.7)
Kernel 3.16
ceph 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f) 

Ceph Services

3 Monitors: ceph1, ceph2, ceph3

6 OSDs: ceph1 (3), ceph2 (3) 

Regards,
-- 
J.Hofmüller

   Nisiti
   - Abie Nathan, 1927-2008



signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] solved: ceph-deploy mon create-initial fails on Debian/Jessie

2015-11-25 Thread Jogi Hofmüller
Hi all,

Well, after repeating the procedure a few times I once ran ceph-deploy
forgetkeys and voila, that did it.

Sorry for the noise,
-- 
J.Hofmüller

Ein literarisches Meisterwerk ist nur ein Wörterbuch in Unordnung.
  - Jean Cocteau



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy mon create-initial fails on Debian/Jessie

2015-11-25 Thread Jogi Hofmüller
Hi all,

I am reinstalling our test cluster and run into problems when running

  ceph-deploy mon create-initial

It fails stating:

[ceph_deploy.gatherkeys][WARNIN] Unable to find
/var/lib/ceph/bootstrap-osd/ceph.keyring on ceph1
[ceph_deploy][ERROR ] KeyNotFoundError: Could not find keyring file:
/var/lib/ceph/bootstrap-osd/ceph.keyring on host ceph1


ceph1 is one of the two nodes I use in our test cluster.  After running

  ceph-deploy install ceph1 ceph2

I notice that the directories under /var/lib/ceph/ stay empty.  So there
really are no keys but from what people on IRC said they should be there.

I basically (as always) followed the instructions from here:

  http://docs.ceph.com/docs/v0.94.5/start/quick-ceph-deploy/

using Debian/Jessie (8.2) systems.  sudo for my ceph user works fine as
does everything up to the point when I run the above mentioned creation
of the initial monitor.

Did I hit a bug?

Cheers,
-- 
j.hofmüller

mur.sat -- a space art projecthttp://sat.mur.at/



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cant get cluster to become healthy. "stale+undersized+degraded+peered"

2015-09-30 Thread Jogi Hofmüller
Hi Kurt,

Am 2015-09-30 um 17:09 schrieb Kurt Bauer:

> You have two nodes but repl.size 3 for your test-data pool. With the
> default crushmap this won't work as it tries to replicate on different
> nodes.
> 
> So either change to rep.size 2, or add another node ;-)

Thanks a lot!  I did not set anything specific when creating the pool;
3 is the default as I know now.  Setting size manually to two worked.

  ceph osd pool set test-data size 2

and I put that in my config too :)

Regards,
-- 
j.hofmüller

We are all idiots with deadlines.   - Mike West



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cant get cluster to become healthy. "stale+undersized+degraded+peered"

2015-09-30 Thread Jogi Hofmüller
Hi,

Some more info:

ceph osd tree
ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 3.59998 root default
-2 1.7 host ceph1
 0 0.8 osd.0   up  1.0  1.0
 1 0.8 osd.1   up  1.0  1.0
-3 1.7 host ceph2
 2 0.8 osd.2   up  1.0  1.0
 3 0.8 osd.3   up  1.0  1.0


With on pool that contains no objects:

ceph status
cluster 2d766dc4-0705-46f9-b559-664e49e0da5c
 health HEALTH_WARN
128 pgs degraded
128 pgs stuck degraded
128 pgs stuck unclean
128 pgs stuck undersized
128 pgs undersized
 monmap e1: 1 mons at {ceph1=172.16.16.17:6789/0}
election epoch 2, quorum 0 ceph1
 osdmap e22: 4 osds: 4 up, 4 in
  pgmap v45: 128 pgs, 1 pools, 0 bytes data, 0 objects
6768 kB used, 3682 GB / 3686 GB avail
 128 active+undersized+degraded

ceph osd dump
epoch 22
fsid 2d766dc4-0705-46f9-b559-664e49e0da5c
created 2015-09-30 16:09:58.109963
modified 2015-09-30 16:46:00.625417
flags
pool 1 'test-data' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 128 pgp_num 128 last_change 21 flags
hashpspool stripe_width 0
max_osd 4
osd.0 up   in  weight 1 up_from 4 up_thru 21 down_at 0
last_clean_interval [0,0) PUB.17:6800/11953 172.16.16.17:6800/11953
172.16.16.17:6801/11953 PUB.17:6801/11953 exists,up
e384b160-d213-40a4-b3f1-a9146aaa41e1
osd.1 up   in  weight 1 up_from 8 up_thru 21 down_at 0
last_clean_interval [0,0) PUB.17:6802/12839 172.16.16.17:6802/12839
172.16.16.17:6803/12839 PUB.17:6803/12839 exists,up
4c14bda4-3c31-4188-976e-7f59fd717294
osd.2 up   in  weight 1 up_from 12 up_thru 21 down_at 0
last_clean_interval [0,0) PUB.18:6800/6583 172.16.16.18:6800/6583
172.16.16.18:6801/6583 89.106.208.18:6801/6583 exists,up
3dd88154-63b7-476d-b8c2-8a34483eb358
osd.3 up   in  weight 1 up_from 17 up_thru 21 down_at 0
last_clean_interval [0,0) PUB.18:6802/7453 172.16.16.18:6802/7453
172.16.16.18:6803/7453 PUB.18:6803/7453 exists,up
1a96aa8d-c13d-4536-b772-b4189e0069ff

After deleting the pool:

ceph status
cluster 2d766dc4-0705-46f9-b559-664e49e0da5c
 health HEALTH_WARN
too few PGs per OSD (0 < min 30)
 monmap e1: 1 mons at {ceph1=172.16.16.17:6789/0}
election epoch 2, quorum 0 ceph1
 osdmap e23: 4 osds: 4 up, 4 in
  pgmap v48: 0 pgs, 0 pools, 0 bytes data, 0 objects
6780 kB used, 3682 GB / 3686 GB avail
ceph osd dump
epoch 23
fsid 2d766dc4-0705-46f9-b559-664e49e0da5c
created 2015-09-30 16:09:58.109963
modified 2015-09-30 16:56:24.678984
flags
max_osd 4
osd.0 up   in  weight 1 up_from 4 up_thru 21 down_at 0
last_clean_interval [0,0) PUB.17:6800/11953 172.16.16.17:6800/11953
172.16.16.17:6801/11953 PUB.17:6801/11953 exists,up
e384b160-d213-40a4-b3f1-a9146aaa41e1
osd.1 up   in  weight 1 up_from 8 up_thru 21 down_at 0
last_clean_interval [0,0) PUB.17:6802/12839 172.16.16.17:6802/12839
172.16.16.17:6803/12839 89.106.208.17:6803/12839 exists,up
4c14bda4-3c31-4188-976e-7f59fd717294
osd.2 up   in  weight 1 up_from 12 up_thru 21 down_at 0
last_clean_interval [0,0) PUB.18:6800/6583 172.16.16.18:6800/6583
172.16.16.18:6801/6583 PUB.18:6801/6583 exists,up
3dd88154-63b7-476d-b8c2-8a34483eb358
osd.3 up   in  weight 1 up_from 17 up_thru 21 down_at 0
last_clean_interval [0,0) PUB.18:6802/7453 172.16.16.18:6802/7453
172.16.16.18:6803/7453 PUB.18:6803/7453 exists,up
1a96aa8d-c13d-4536-b772-b4189e0069ff

Regards,
-- 
j.hofmüller

Gerüchtegenerator  http://plagi.at/geruecht



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cant get cluster to become healthy. "stale+undersized+degraded+peered"

2015-09-30 Thread Jogi Hofmüller
Hi,

Am 2015-09-17 um 19:02 schrieb Stefan Eriksson:

> I purged all nodes and did purgedata aswell and restarted, after this
> Everything was fine. You are most certainly right, if anyone else have
> this error, reinitialize the cluster might be the fastest way forward.

Great that it worked for you, it didn't for me.  The second installation
of ceph on two nodes with 4 osds and I still oscillate between your
original problem (with a default pool from installation that I cannot
explain where it came from) and the

too few PGs per OSD (0 < min 30

when I delete the default pool.

I basically followed the procedure described here [1] and made some
modifications to the config before calling 'ceph-deploy install' on my
nodes.  Here is the config I use (fsid and IPs deleted):


[global]
fsid = ID
mon_initial_members = ceph1
mon_host = private-ip
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = public-network
cluster_network = private-network
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 150
osd_pool_default_pgp_num = 150
osd_crush_chooseleaf_type = 1

[osd]
osd_journal_size = 1



[1]  http://docs.ceph.com/docs/master/start/quick-ceph-deploy/

-- 
J.Hofmüller

Ein literarisches Meisterwerk ist nur ein Wörterbuch in Unordnung.
  - Jean Cocteau



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [sepia] debian jessie repository ?

2015-09-30 Thread Jogi Hofmüller
Hi,

Am 2015-09-29 um 15:54 schrieb Gregory Farnum:

> Can you create a ceph-deploy ticket at tracker.ceph.com, please?
> And maybe make sure you're running the latest ceph-deploy, but
> honestly I've no idea what it's doing these days or if this is a
> resolved issue.

Just file a bug.

The ceph-deploy version installed here is 1.5.28.  I installed it
according to the docs [1] via apt-get.

FWIW I managed to get ceph installed on Debian Jessie by doing the
following (on each node)

1)  install the repository key manually
2)  setting /etc/apt/sources.list.d/ceph.list to read

  deb http://ceph.com/debian-hammer wheezy main

3)  added an entry for wheezy packages in /etc/apt/sources.list
4)  set 'adjust_repos = False' in cephdeploy.conf

[1]
http://docs.ceph.com/docs/master/start/quick-start-preflight/#ceph-deploy-setup

Regards,
-- 
j.hofmüller

mur.sat -- a space art projecthttp://sat.mur.at/



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [sepia] debian jessie repository ?

2015-09-29 Thread Jogi Hofmüller
Hi,

Am 2015-09-25 um 22:23 schrieb Udo Lembke:

> you can use this sources-list
> 
> cat /etc/apt/sources.list.d/ceph.list
> deb http://gitbuilder.ceph.com/ceph-deb-jessie-x86_64-basic/ref/v0.94.3
> jessie main

The thing is:  whatever I write into ceph.list, ceph-deploy just
overwrites it with "deb http://ceph.com/debian-hammer/ jessie main"
which does not exist :(

Here is what the log says after "ceph-deploy install:

[ceph1][DEBUG ] Err http://ceph.com jessie/main amd64 Packages
[ceph1][DEBUG ]   404  Not Found [IP:
2607:f298:6050:51f3:f816:3eff:fe50:5ec 80]
[ceph1][DEBUG ] Ign http://ceph.com jessie/main Translation-en_US
[ceph1][DEBUG ] Ign http://ceph.com jessie/main Translation-en
[ceph1][WARNIN] W: Failed to fetch
http://ceph.com/debian-hammer/dists/jessie/main/binary-amd64/Packages
404  Not Found [IP: 2607:f298:6050:51f3:f816:3eff:fe50:5ec 80]
[ceph1][WARNIN]
[ceph1][WARNIN] E: Some index files failed to download. They have been
ignored, or old ones used instead.
[ceph1][ERROR ] RuntimeError: command returned non-zero exit status: 100
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: env
DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get
--assume-yes -q update

Advice needed.

Cheers,
-- 
J.Hofmüller

Fakten verschwinden nicht, nur weil eins sie ignoriert.
  - nach Aldous Huxley



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [sepia] debian jessie repository ?

2015-09-25 Thread Jogi Hofmüller
Hi,

Am 2015-09-25 um 22:23 schrieb Udo Lembke:
> you can use this sources-list
> 
> cat /etc/apt/sources.list.d/ceph.list
> deb http://gitbuilder.ceph.com/ceph-deb-jessie-x86_64-basic/ref/v0.94.3
> jessie main

Thanks!  Will test it as soon as I get back to work next week.

Regards,
-- 
j.hofmüller

mur.sat -- a space art projecthttp://sat.mur.at/



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [sepia] debian jessie repository ?

2015-09-25 Thread Jogi Hofmüller
Hi,

Am 2015-09-11 um 13:20 schrieb Florent B:

> Jessie repository will be available on next Hammer release ;)

An how should I continue installing ceph meanwhile?  ceph-deploy new ...
overwrites the /etc/apt/sources.list.d/ceph.list and hence throws an
error :(

Any hint appreciated.

Cheers,
-- 
J.Hofmüller

wash your hands and say your prayers
because jesus and germs are everywhere



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new cluster does not reach active+clean

2013-10-03 Thread Jogi Hofmüller
Hi Tyler,

Am 2013-10-03 13:22, schrieb Tyler Brekke:

> You can add this to your ceph conf to distribute by device rather then node.
> 
> osd crush chooseleaf type = 0

Great!  Thanks for reminding me.  I had that in previous setups but
forgot it this time.

> This information is also available on the docs :)

I am painfully aware of that ;)

Cheers!
-- 
j.hofmüller

Optimism doesn't alter the laws of physics. - Subcommander T'Pol



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] trouble adding OSDs - which documentation to use

2013-10-03 Thread Jogi Hofmüller
Dear all,

This is getting weird now ...

Am 2013-10-03 11:18, schrieb Jogi Hofmüller:

> root@ceph-server1:~# service ceph start
> === osd.0 ===
> No filesystem type defined!

This message is generated by /etc/init.d/ceph (OK, most of you know that
I guess), which is looking for "osd mkfs type" in ceph.conf.  This is
where it failed for me before adding these lines to ceph.conf:

[osd]
osd mkfs type = xfs

Now, with the correct devs = /dev/sdaX in the corresponding [osdX]
section everything works.

I have been searching the entire documentation for these two parameters
and did not find very much useful explanation/guides there.

Cheers!
-- 
j.hofmüller

Optimism doesn't alter the laws of physics. - Subcommander T'Pol



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] new cluster does not reach active+clean

2013-10-03 Thread Jogi Hofmüller
Dear all,

Hope I am not on everyones nerves by now ;)

Just started over and created a new cluster:

  one monitor (ceph-mon0)
  one osd-server (ceph-rd0)

After activating the two OSDs on ceph-rd0 the cluster reaches a state
active+degraded and never becomes healthy.  Unfortunately this
particular state is not documented here [1].

Some output:

ceph@ceph-admin:~/cl0$ ceph -w
  cluster 6f1dfb78-e917-4286-a8f0-2e389d295e43
   health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean
   monmap e1: 1 mons at {ceph-mon0=192.168.122.56:6789/0}, election
epoch 2, quorum 0 ceph-mon0
   osdmap e8: 2 osds: 2 up, 2 in
pgmap v15: 192 pgs: 192 active+degraded; 0 bytes data, 69924 KB
used, 6053 MB / 6121 MB avail
   mdsmap e1: 0/0/1 up


2013-10-03 13:09:59.99 osd.0 [INF] pg has no unfound objects

ceph@ceph-admin:~/cl0$ ceph health detail
HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean
pg 0.3f is stuck unclean since forever, current state active+degraded,
last acting [0]
pg 1.3e is stuck unclean since forever, current state active+degraded,
last acting [0]
pg 2.3d is stuck unclean since forever, current state active+degraded,
last acting [0]
(cut some lines)
pg 1.0 is active+degraded, acting [0]
pg 0.1 is active+degraded, acting [0]
pg 2.2 is active+degraded, acting [0]
pg 1.1 is active+degraded, acting [0]
pg 0.0 is active+degraded, acting [0]

Any idea what went wrong here?

[1]  http://eu.ceph.com/docs/wip-3060/ops/manage/failures/osd/

Regards!
-- 
j.hofmüller

Optimism doesn't alter the laws of physics. - Subcommander T'Pol



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] trouble adding OSDs - which documentation to use

2013-10-03 Thread Jogi Hofmüller
Seas Wolfgang,

Am 2013-10-02 09:01, schrieb Wolfgang Hennerbichler:
> On 10/01/2013 05:08 PM, Jogi Hofmüller wrote:

>> Is this [1] outdated?  If not, why are the links to chef-* not
>> working? Is chef-* still recommended/used?
> 
> I believe this is a matter of taste. I can not say if this is
> outdated, but I prefer not to use chef but only ceph-deploy.

Ah, good.  That's what I was thinking somehow.

> Others might have different opinions on that, but I am the
> old-fashioned guy who puts the stuff into his configuration file (like
> bobtail used to be).
> This works for me (ceph.conf):
> 
> [osd.0]
> host = rd-c2
> devs = /dev/sdb
> 
> [osd.1]
> host = rd-c2
> devs = /dev/sdc
> 
> ...
> 
> On startup ceph mounts the disk to /var/lib/ceph/osd/ceph-[OSD-Number]
> and works.

Actually that is what I expected.  Thing is (just reproduced it) that my
OSDs won't start after rebooting the host.  For example:

root@ceph-server1:~# service ceph start
=== osd.0 ===
No filesystem type defined!

This is the relevant part of the config:

[osd.0]
host = ceph-server1
devs = /dev/sdb

And now I see that ceph-deploy disk list run on the admin host tells me:

ceph@ceph-admin:~/mur-cluster$ ceph-deploy disk list ceph-server1
[ceph_deploy.sudo_pushy][DEBUG ] will use a remote connection with sudo
[ceph_deploy.osd][INFO  ] Distro info: Debian 7.1 wheezy
[ceph_deploy.osd][DEBUG ] Listing disks on ceph-server1...
[ceph-server1][INFO  ] Running command: ceph-disk list
[ceph-server1][INFO  ] /dev/sda :
[ceph-server1][INFO  ]  /dev/sda1 ceph data, prepared, unknown cluster
b134da22-a3dd-41cb-95c2-fb6a75af8c1f, osd.0, journal /dev/sda2
[ceph-server1][INFO  ]  /dev/sda2 ceph journal, for /dev/sda1
[ceph-server1][INFO  ] /dev/sdb :
[ceph-server1][INFO  ]  /dev/sdb1 ceph data, prepared, unknown cluster
b134da22-a3dd-41cb-95c2-fb6a75af8c1f, osd.1, journal /dev/sdb2
[ceph-server1][INFO  ]  /dev/sdb2 ceph journal, for /dev/sdb1


Which completely fries my brain (unknown cluster
b134da22-a3dd-41cb-95c2-fb6a75af8c1f) ...

Any hint on what went wrong here?  Is the unknown cluster the reason for
the unknown filesystem?

Cheers!
-- 
j.hofmüller

Optimism doesn't alter the laws of physics. - Subcommander T'Pol



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD: Newbie question regarding ceph-deploy odd create

2013-10-01 Thread Jogi Hofmüller
Hi Piers,

Am 2013-09-27 22:59, schrieb Piers Dawson-Damer:

> I'm trying to setup my first cluster,   (have never manually
> bootstrapped a cluster)

I am about at the same stage here ;)

> Is ceph-deploy odd activate/prepare supposed to write to the master
> ceph.conf file, specific entries for each OSD along the lines
> of http://ceph.com/docs/master/rados/configuration/osd-config-ref/ ?

All I can say that it does not do so.  Still waiting for an answer to a
similar question I posed yesterday ...

I let you know if I get closer to solving these things ;)

Cheers!
-- 
j.hofmüller

Optimism doesn't alter the laws of physics. - Subcommander T'Pol



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] trouble adding OSDs - which documentation to use

2013-10-01 Thread Jogi Hofmüller
Dear all,

I am back to managing the cluster before starting to use it even on a
test host.  First of all a question regarding the docs:

Is this [1] outdated?  If not, why are the links to chef-* not working?
 Is chef-* still recommended/used?

After adding a new OSD (with ceph-deploy version 1.2.6) and starting the
daemon after a reboot of the osd-server it complains:

root@ceph-server1:~# service ceph start
=== osd.0 ===
No filesystem type defined!

I could not find anything in the docs on how to specify the fs-type.
How is mounting the data-partition done usually?  It works if I mount it
via an entry in /etc/fstab (or manually) but I would have to edit that
manually.

All this is done using ceph "dumpling" installed/deployed according to
the getting started info from [2].

[1]  http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
[2]  http://ceph.com/docs/master/start/quick-ceph-deploy/

Regards!
-- 
j.hofmüller

Optimism doesn't alter the laws of physics. - Subcommander T'Pol



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] authentication trouble

2013-09-26 Thread Jogi Hofmüller
Dear all,

I am fairly new to ceph and just in the process of testing it using
several virtual machines.

Now I tried to create a block device on a client and fumbled with
settings for about an hour or two until the command line

  rbd --id dovecot create home --size=1024

finally succeeded.  The keyring is /etc/ceph/ceph.keyring and I thought
the name [client.dovecot] would be used by rbd.

I would appreciated any hint on how to configure the client.NAME in the
config to ease operation.

Regards!
-- 
j.hofmüller

Optimism doesn't alter the laws of physics. - Subcommander T'Pol



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com