[ceph-users] cephfs modification time

2015-01-09 Thread Lorieri
Hi,

I have a program that tails a file and this file is create on another machine

some tail programs does not work because the modification time is not
updated in the remote machines

I've find this old thread
http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/11001

it mentions the problem and suggest ntp sync

I tried to re-sync ntp and restart the ceph cluster, but the issue persists

do you know if it is possible to avoid this behavior ?

thanks
-lorieri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph as backend for Swift

2015-01-09 Thread Mark Kirkwood
It is not too difficult to get going, once you add various patches so it 
works:


- missing __init__.py
- Allow to set ceph.conf
- Fix write issue: ioctx.write() does not return the written length
- Add param to async_update call (for swift in Juno)

There are a number of forks/pulls etc for these. Mine is here 
(fix_install branch):


https://github.com/markir9/swift-ceph-backend/commits/fix_install


However do take a look at the thread in the openstack list about this:

http://www.gossamer-threads.com/lists/openstack/dev/43482

In particular the geo replication side of things needs more coding 
before it works.


Cheers

Mark

On 09/01/15 15:51, Sebastien Han wrote:

You can have a look of what I did here with Christian:

* https://github.com/stackforge/swift-ceph-backend
* https://github.com/enovance/swiftceph-ansible

If you have further question just let us know.


On 08 Jan 2015, at 15:51, Robert LeBlanc rob...@leblancnet.us wrote:

Anyone have a reference for documentation to get Ceph to be a backend for Swift?

Thanks,
Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



Cheers.

Sébastien Han
Cloud Architect

Always give 100%. Unless you're giving blood.

Phone: +33 (0)1 49 70 99 72
Mail: sebastien@enovance.com
Address : 11 bis, rue Roquépine - 75008 Paris
Web : www.enovance.com - Twitter : @enovance



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon problem after power failure

2015-01-09 Thread Joao Eduardo Luis

On 01/09/2015 04:31 PM, Jeff wrote:

We had a power failure last night and our five node cluster has
two nodes with mon's that fail to start.  Here's what we see:

#  /usr/bin/ceph-mon --cluster=ceph -i ceph2 -f
2015-01-09 11:28:45.579267 b6c10740 -1 ERROR: on disk data includes unsupported 
features: compat={},rocompat={},incompat={6=support isa/lrc erasure code}
2015-01-09 11:28:45.606896 b6c10740 -1 error checking features: (1) Operation 
not permitted

and

# /usr/local/bin/ceph-mon --cluster=ceph -i ceph4 -f
Corruption: 6 missing files; e.g.: 
/var/lib/ceph/mon/ceph-ceph4/store.db/4011258.ldb
Corruption: 6 missing files; e.g.: 
/var/lib/ceph/mon/ceph-ceph4/store.db/4011258.ldb
2015-01-09 11:30:32.024445 b6ea1740 -1 failed to create new leveldb store

Does anyone have any suggestions for how to get these two monitors running
again?


Recreate them.  Only way I'm aware especially considering the leveldb 
corruption.


  -Joao


--
Joao Eduardo Luis
Software Engineer | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coded PGs incomplete

2015-01-09 Thread Nick Fisk
Hi Italo,

 

If you check for a post from me from a couple of days back, I have done exactly 
this.

 

I created a k=5 m=3 over 4 hosts. This ensured that I could lose a whole host 
and then an OSD on another host and the cluster was still fully operational.

 

I’m not sure if my method I used in the Crush map was the best way to achieve 
what I did, but it seemed to work.

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Italo 
Santos
Sent: 08 January 2015 22:35
To: Loic Dachary
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Erasure coded PGs incomplete

 

Thanks for your answer. But another doubt raised…

 

Suppose I have 4 hosts with a erasure pool created with k=3, m=1 and failure 
domain by host and I lost a host. On this case I’ll face with the same issue on 
the beginning of this thread because k+m  number of hosts, right?

 

- On this scenario, with one host less I still able to read and write data on 
cluster?

- To solve the issue I’ll need add another host on cluster?

 

Regards.

 

Italo Santos

 http://italosantos.com.br/ http://italosantos.com.br/

 

On Wednesday, December 17, 2014 at 20:19, Loic Dachary wrote:

 

 

On 17/12/2014 19:46, Italo Santos wrote: Understood.

Thanks for your help, the cluster is healthy now :D

 

Also, using for example k=6,m=1 and failure domain by host I’ll be able lose 
all OSD on the same host, but if a lose 2 disks on different hosts I can lose 
data right? So, it is possible been a failure domain which allow me to lose an 
OSD or a host?

 

That's actually a good way to put it :-)

 

 

Regards.

 

*Italo Santos*

http://italosantos.com.br/

 

On Wednesday, December 17, 2014 at 4:27 PM, Loic Dachary wrote:

 

 

 

On 17/12/2014 19:22, Italo Santos wrote:

Loic,

 

So, if want have a failure domain by host, I’ll need set up a erasure profile 
which k+m = total number of hosts I have, right?

 

Yes, k+m has to be = number of hosts.

 

 

Regards.

 

*Italo Santos*

http://italosantos.com.br/

 

On Wednesday, December 17, 2014 at 3:24 PM, Loic Dachary wrote:

 

 

 

On 17/12/2014 18:18, Italo Santos wrote:

Hello,

 

I’ve take a look to this documentation (which help a lot) and if I understand 
right, when I set a profile like:

 

===

ceph osd erasure-code-profile set isilon k=8 m=2 ruleset-failure-domain=host

===

 

And create a pool following the recommendations on doc, I’ll need (100*16)/2 = 
800 PGs, I’ll need the sufficient number of hosts to support create total PGs?

 

You will need k+m = 10 host per OSD. If you only have 10 hosts that should be 
ok and the 800 PGs will use these 10 OSD in various orders. It also means that 
you will end up having 800 PG per OSD which is a bit too mche. If you have 20 
OSDs that will be better : each PG will get 10 OSD out of 20 and each OSD will 
have 400 PGs. Ideally you want the number of PG per OSD to be in the range 
(approximately) [20,300].

 

Cheers

 

 

Regards.

 

*Italo Santos*

http://italosantos.com.br/

 

On Wednesday, December 17, 2014 at 2:42 PM, Loic Dachary wrote:

 

Hi,

 

Thanks for the update : good news are much appreciated :-) Would you have time 
to review the documentation at https://github.com/ceph/ceph/pull/3194/files ? 
It was partly motivated by the problem you had.

 

Cheers

 

On 17/12/2014 14:03, Italo Santos wrote:

Hello Loic,

 

Thanks for you help, I’ve take a look to my crush map and I replace step 
chooseleaf indep 0 type osd” by step choose indep 0 type osd” and all PGs was 
created successfully.

 

At.

 

*Italo Santos*

http://italosantos.com.br/

 

On Tuesday, December 16, 2014 at 8:39 PM, Loic Dachary wrote:

 

Hi,

 

The 2147483647 means that CRUSH did not find enough OSD for a given PG. If you 
check the crush rule associated with the erasure coded pool, you will most 
probably find why.

 

Cheers

 

On 16/12/2014 23:32, Italo Santos wrote:

Hello,

 

I'm trying to create an erasure pool following 
http://docs.ceph.com/docs/master/rados/operations/erasure-code/, but when I try 
create a pool with a specifc erasure-code-profile (myprofile) the PGs became 
on incomplete state.

 

Anyone can help me?

 

Below the profile I created:

root@ceph0001:~# ceph osd erasure-code-profile get myprofile

directory=/usr/lib/ceph/erasure-code

k=6

m=2

plugin=jerasure

technique=reed_sol_van

 

The status of cluster:

root@ceph0001:~# ceph health

HEALTH_WARN 12 pgs incomplete; 12 pgs stuck inactive; 12 pgs stuck unclean

 

health detail:

root@ceph0001:~# ceph health detail

HEALTH_WARN 12 pgs incomplete; 12 pgs stuck inactive; 12 pgs stuck unclean

pg 2.9 is stuck inactive since forever, current state incomplete, last acting 
[4,10,15,2147483647,3,2147483647,2147483647,2147483647]

pg 2.8 is stuck inactive since forever, current state incomplete, last acting 
[0,2147483647,4,2147483647,10,2147483647,15,2147483647]

pg 2.b is stuck inactive since forever, current state incomplete, last acting 

Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-09 Thread Christian Eichelmann
Hi Lionel,

we have a ceph cluster with in sum about 1PB, 12 OSDs with 60 Disks,
devided into 4 racks in 2 rooms, all connected with a dedicated 10G
cluster network. Of course with a replication level of 3.

We did about 9 Month intensive testing. Just like you, we were never
experiences that kind of problems before. And incomplete PG was
recovering as soon as at least one OSD holding a copy of it came back up.

We still don't know what caused this specific error, but at no point
there were more than two hosts down at the same time. Our pool has a
min_size of 1. And after everything was up again, we had completely LOST
2 of 3 pg copies (the directories on the OSDs were empty) and the third
copy was obvioulsy broken, because even manually injecting this pg into
the other osds didn't changed anything.

My main problem here is, that with even one incomplete PG your pool is
rendered unusable. And there is currently no way to make ceph forget
about the data of this pg and create it as an empty one. So the only way
to make this pool usable again is to loose all your data in there. Which
for me is just not acceptable.

Regards,
Christian

Am 07.01.2015 21:10, schrieb Lionel Bouton:
 On 12/30/14 16:36, Nico Schottelius wrote:
 Good evening,

 we also tried to rescue data *from* our old / broken pool by map'ing the
 rbd devices, mounting them on a host and rsync'ing away as much as
 possible.

 However, after some time rsync got completly stuck and eventually the
 host which mounted the rbd mapped devices decided to kernel panic at
 which time we decided to drop the pool and go with a backup.

 This story and the one of Christian makes me wonder:

 Is anyone using ceph as a backend for qemu VM images in production?
 
 Yes with Ceph 0.80.5 since September after extensive testing over
 several months (including an earlier version IIRC) and some hardware
 failure simulations. We plan to upgrade one storage host and one monitor
 to 0.80.7 to validate this version over several months too before
 migrating the others.
 

 And:

 Has anyone on the list been able to recover from a pg incomplete /
 stuck situation like ours?
 
 Only by adding back an OSD with the data needed to reach min_size for
 said pg, which is expected behavior. Even with some experimentations
 with isolated unstable OSDs I've not yet witnessed a case where Ceph
 lost multiple replicates simultaneously (we lost one OSD to disk failure
 and another to a BTRFS bug but without trying to recover the filesystem
 so we might have been able to recover this OSD).
 
 If your setup is susceptible to situations where you can lose all
 replicates you will lose data but there's not much that can be done
 about that. Ceph actually begins to generate new replicates to replace
 the missing onesaftermon osd down out interval so the actual loss
 should not happen unless you lose (and can't recover) size OSDs on
 separate hosts (with default crush map) simultaneously. Before going in
 production you should know how long Ceph will take to fully recover from
 a disk or host failure by testing it with load. Your setup might not be
 robust if it hasn't the available disk space or the speed needed to
 recover quickly from such a failure.
 
 Lionel
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Christian Eichelmann
Systemadministrator

11 Internet AG - IT Operations Mail  Media Advertising  Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
christian.eichelm...@1und1.de

Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Documentation of ceph pg num query

2015-01-09 Thread Gregory Farnum
On Fri, Jan 9, 2015 at 1:24 AM, Christian Eichelmann
christian.eichelm...@1und1.de wrote:
 Hi all,

 as mentioned last year, our ceph cluster is still broken and unusable.
 We are still investigating what has happened and I am taking more deep
 looks into the output of ceph pg pgnum query.

 The problem is that I can find some informations about what some of the
 sections mean, but mostly I can only guess. Is there any kind of
 documentation where I can find some explanations of whats state there?
 Because without that the output is barely usefull.

There is unfortunately not really documentation around this right now.
If you have specific questions someone can probably help you with
them, though.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uniform distribution

2015-01-09 Thread Gregory Farnum
100GB objects (or ~40 on a hard drive!) are way too large for you to
get an effective random distribution.
-Greg

On Thu, Jan 8, 2015 at 5:25 PM, Mark Nelson mark.nel...@inktank.com wrote:
 On 01/08/2015 03:35 PM, Michael J Brewer wrote:

 Hi all,

 I'm working on filling a cluster to near capacity for testing purposes.
 Though I'm noticing that it isn't storing the data uniformly between
 OSDs during the filling process. I currently have the following levels:

 Node 1:
 /dev/sdb1  3904027124  2884673100  1019354024  74%
 /var/lib/ceph/osd/ceph-0
 /dev/sdc1  3904027124  2306909388  1597117736  60%
 /var/lib/ceph/osd/ceph-1
 /dev/sdd1  3904027124  3296767276   607259848  85%
 /var/lib/ceph/osd/ceph-2
 /dev/sde1  3904027124  3670063612   233963512  95%
 /var/lib/ceph/osd/ceph-3

 Node 2:
 /dev/sdb1  3904027124  3250627172   653399952  84%
 /var/lib/ceph/osd/ceph-4
 /dev/sdc1  3904027124  3611337492   292689632  93%
 /var/lib/ceph/osd/ceph-5
 /dev/sdd1  3904027124  2831199600  1072827524  73%
 /var/lib/ceph/osd/ceph-6
 /dev/sde1  3904027124  2466292856  1437734268  64%
 /var/lib/ceph/osd/ceph-7

 I am using rados put to upload 100g files to the cluster, doing two at
 a time from two different locations. Is this expected behavior, or can
 someone shed light on why it is doing this? We're using the opensource
 version 80.7. We're also using the default CRUSH configuration.


 So crush utilizes pseudo-random distributions, but sadly random
 distributions tend to be clumpy and not perfectly uniform until you get to
 very high sample counts. The gist of it is that if you have a really low
 density of PGs/OSD and/or are very unlucky, you can end up with a skewed
 distribution.  If you are even more unlucky, you could compound that with a
 streak of objects landing on PGs associated with some specific OSD.  This
 particular case looks rather bad.  How many PGs and OSDs do you have?


 Regards,
 *MICHAEL J. BREWER*
 
 *Phone:* 1-512-286-5596 | *Tie-Line:* 363-5596*
 E-mail:*_mjbre...@us.ibm.com_ mailto:mjbre...@us.ibm.com


 11501 Burnet Rd
 Austin, TX 78758-3400
 United States




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph on peta scale

2015-01-09 Thread Gregory Farnum
On Thu, Jan 8, 2015 at 5:46 AM, Zeeshan Ali Shah zas...@pdc.kth.se wrote:
 I just finished configuring ceph up to 100 TB with openstack ... Since we
 are also using Lustre in our HPC machines , just wondering what is the
 bottle neck in ceph going on Peta Scale like Lustre .

 any idea ? or someone tried it

If you're talking about people building a petabyte Ceph system, there
are *many* who run clusters of that size. If you're talking about the
Ceph filesystem as a replacement for Lustre at that scale, the concern
is less about the raw amount of data and more about the resiliency of
the current code base at that size...but if you want to try it out and
tell us what problems you run into we will love you forever. ;)
(The scalable file system use case is what actually spawned the Ceph
project, so in theory there shouldn't be any serious scaling
bottlenecks. In practice it will depend on what kind of metadata
throughput you need because the multi-MDS stuff is improving but still
less stable.)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph configuration on multiple public networks.

2015-01-09 Thread J-P Methot

Hi,

We've setup ceph and openstack on a fairly peculiar network 
configuration (or at least I think it is) and I'm looking for 
information on how to make it work properly.


Basically, we have 3 networks, a management network, a storage network 
and a cluster network. The management network is over a 1 gbps link, 
while the storage network is over 2 bonded 10 gbps links. The cluster 
network can be ignored for now, as it works well.


Now, the main problem is that ceph osd nodes are plugged on the 
management, storage and cluster networks, but the monitors are only 
plugged on the management network. When I do tests, I see that all the 
traffic ends up going through the management network, slowing down 
ceph's performances. Because of the current network setup, I can't hook 
up the monitoring nodes on the  storage network, as we're missing ports 
on the switch.


Would it be possible to maintain access to the management nodes while 
forcing the ceph cluster to use the storage network for data transfer? 
As a reference, here's my ceph.conf.


[global]
osd_pool_default_pgp_num = 800
osd_pg_bits = 12
auth_service_required = cephx
osd_pool_default_size = 3
filestore_xattr_use_omap = true
auth_client_required = cephx
osd_pool_default_pg_num = 800
auth_cluster_required = cephx
mon_host = 10.251.0.51
public_network = 10.251.0.0/24, 10.21.0.0/24
mon_initial_members = cephmon1
cluster_network = 192.168.31.0/24
fsid = 60e1b557-e081-4dab-aa76-e68ba38a159e
osd_pgp_bits = 12

As you can see I've setup 2 public networks, 10.251.0.0 being the 
management network and 10.21.0.0 being the storage network. Would it be 
possible to maintain cluster functionality and remove 10.251.0.0/24 from 
the public_network list? For example, if I were to remove it from the 
public network list and referenced each monitor node IP in the config 
file, would I be able to maintain connectivity?


--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] backfill_toofull, but OSDs not full

2015-01-09 Thread Udo Lembke
Hi,
I had an similiar effect two weeks ago - 1PG backfill_toofull and due
reweighting and delete there was enough free space but the rebuild
process stopped after a while.

After stop and start ceph on the second node, the rebuild process runs
without trouble and the backfill_toofull are gone.

This happens with firefly.

Udo

On 09.01.2015 21:29, c3 wrote:
 In this case the root cause was half denied reservations.

 http://tracker.ceph.com/issues/9626

 This stopped backfills since, those listed as backfilling were
 actually half denied and doing nothing. The toofull status is not
 checked until a free backfill slot happens, so everything was just stuck.

 Interestingly, the toofull was created by other backfills which were
 not stoppped.
 http://tracker.ceph.com/issues/9594

 Quite the log jam to clear.


 Quoting Craig Lewis cle...@centraldesktop.com:

 What was the osd_backfill_full_ratio?  That's the config that controls
 backfill_toofull.  By default, it's 85%.  The mon_osd_*_ratio affect the
 ceph status.

 I've noticed that it takes a while for backfilling to restart after
 changing osd_backfill_full_ratio.  Backfilling usually restarts for
 me in
 10-15 minutes.  Some PGs will stay in that state until the cluster is
 nearly done recoverying.

 I've only seen backfill_toofull happen after the OSD exceeds the
 ratio (so
 it's reactive, no proactive).  Mine usually happen when I'm
 rebalancing a
 nearfull cluster, and an OSD backfills itself toofull.




 On Mon, Jan 5, 2015 at 11:32 AM, c3 ceph-us...@lopkop.com wrote:

 Hi,

 I am wondering how a PG gets marked backfill_toofull.

 I reweighted several OSDs using ceph osd crush reweight. As
 expected, PG
 began moving around (backfilling).

 Some PGs got marked +backfilling (~10), some +wait_backfill (~100).

 But some are marked +backfill_toofull. My OSDs are between 25% and 72%
 full.

 Looking at ceph pg dump, I can find the backfill_toofull PGs and
 verified
 the OSDs involved are less than 72% full.

 Do backfill reservations include a size? Are these OSDs projected to be
 toofull, once the current backfilling complete? Some of the
 backfill_toofull and backfilling point to the same OSDs.

 I did adjust the full ratios, but that did not change the
 backfill_toofull
 status.
 ceph tell mon.\* injectargs '--mon_osd_full_ratio 0.95'
 ceph tell osd.\* injectargs '--osd_backfill_full_ratio 0.92'


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RHEL 7 Installs

2015-01-09 Thread John Wilkins
Ken,

I had a number of issues installing Ceph on RHEL 7, which I think are
mostly due to dependencies. I followed the quick start guide, which
gets the latest major release--e.g., Firefly, Giant.

ceph.conf is here: http://goo.gl/LNjFp3
ceph.log common errors included: http://goo.gl/yL8UsM

To resolve these, I had to download and install libunwind and python-jinja2.

It also seems that the Giant repo had 0.86 and 0.87 packages for
python-ceph, and ceph-deploy didn't like that.

ceph.log error: http://goo.gl/oeKGUv

To resolve this, I had to download and install python-ceph v0.87.
Then, run the ceph-deploy install command again.


-- 
John Wilkins
Red Hat
jowil...@redhat.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-09 Thread Jiri Kanicky

Hi Nico,

I would probably recommend to upgrade to 0.87 (giant). I am running this 
version for some time now and it works very well. I also upgraded from 
firefly and it was easy.


The issue you are experiencing seems quite complex and it would require 
debug logs to troubleshoot.


Apology that I did not help much.

-Jiri

On 9/01/2015 20:23, Nico Schottelius wrote:

Good morning Jiri,

sure, let me catch up on this:

- Kernel 3.16
- ceph: 0.80.7
- fs: xfs
- os: debian (backports) (1x)/ubuntu (2x)

Cheers,

Nico

Jiri Kanicky [Fri, Jan 09, 2015 at 10:44:33AM +1100]:

Hi Nico.

If you are experiencing such issues it would be good if you provide more info 
about your deployment: ceph version, kernel versions, OS, filesystem btrfs/xfs.

Thx Jiri

- Reply message -
From: Nico Schottelius nico-eph-us...@schottelius.org
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = 
Cluster unusable]
Date: Wed, Dec 31, 2014 02:36

Good evening,

we also tried to rescue data *from* our old / broken pool by map'ing the
rbd devices, mounting them on a host and rsync'ing away as much as
possible.

However, after some time rsync got completly stuck and eventually the
host which mounted the rbd mapped devices decided to kernel panic at
which time we decided to drop the pool and go with a backup.

This story and the one of Christian makes me wonder:

Is anyone using ceph as a backend for qemu VM images in production?

And:

Has anyone on the list been able to recover from a pg incomplete /
stuck situation like ours?

Reading about the issues on the list here gives me the impression that
ceph as a software is stuck/incomplete and has not yet become ready
clean for production (sorry for the word joke).

Cheers,

Nico

Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]:

Hi Nico and all others who answered,

After some more trying to somehow get the pgs in a working state (I've
tried force_create_pg, which was putting then in creating state. But
that was obviously not true, since after rebooting one of the containing
osd's it went back to incomplete), I decided to save what can be saved.

I've created a new pool, created a new image there, mapped the old image
from the old pool and the new image from the new pool to a machine, to
copy data on posix level.

Unfortunately, formatting the image from the new pool hangs after some
time. So it seems that the new pool is suffering from the same problem
as the old pool. Which is totaly not understandable for me.

Right now, it seems like Ceph is giving me no options to either save
some of the still intact rbd volumes, or to create a new pool along the
old one to at least enable our clients to send data to ceph again.

To tell the truth, I guess that will result in the end of our ceph
project (running for already 9 Monthes).

Regards,
Christian

Am 29.12.2014 15:59, schrieb Nico Schottelius:

Hey Christian,

Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:

[incomplete PG / RBD hanging, osd lost also not helping]

that is very interesting to hear, because we had a similar situation
with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
directories to allow OSDs to start after the disk filled up completly.

So I am sorry not to being able to give you a good hint, but I am very
interested in seeing your problem solved, as it is a show stopper for
us, too. (*)

Cheers,

Nico

(*) We migrated from sheepdog to gluster to ceph and so far sheepdog
 seems to run much smoother. The first one is however not supported
 by opennebula directly, the second one not flexible enough to host
 our heterogeneous infrastructure (mixed disk sizes/amounts) - so we
 are using ceph at the moment.



--
Christian Eichelmann
Systemadministrator

11 Internet AG - IT Operations Mail  Media Advertising  Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
christian.eichelm...@1und1.de

Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren

--
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Documentation of ceph pg num query

2015-01-09 Thread John Wilkins
Have you looked at

http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/
http://ceph.com/docs/master/rados/operations/pg-states/
http://ceph.com/docs/master/rados/operations/pg-concepts/

On Fri, Jan 9, 2015 at 1:24 AM, Christian Eichelmann
christian.eichelm...@1und1.de wrote:
 Hi all,

 as mentioned last year, our ceph cluster is still broken and unusable.
 We are still investigating what has happened and I am taking more deep
 looks into the output of ceph pg pgnum query.

 The problem is that I can find some informations about what some of the
 sections mean, but mostly I can only guess. Is there any kind of
 documentation where I can find some explanations of whats state there?
 Because without that the output is barely usefull.

 Regards,
 Christian
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
John Wilkins
Red Hat
jowil...@redhat.com
(415) 425-9599
http://redhat.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uniform distribution

2015-01-09 Thread Mark Nelson
I didn't actually calculate the per-OSD object density but yes, I agree 
that will hurt.


On 01/09/2015 12:09 PM, Gregory Farnum wrote:

100GB objects (or ~40 on a hard drive!) are way too large for you to
get an effective random distribution.
-Greg

On Thu, Jan 8, 2015 at 5:25 PM, Mark Nelson mark.nel...@inktank.com wrote:

On 01/08/2015 03:35 PM, Michael J Brewer wrote:


Hi all,

I'm working on filling a cluster to near capacity for testing purposes.
Though I'm noticing that it isn't storing the data uniformly between
OSDs during the filling process. I currently have the following levels:

Node 1:
/dev/sdb1  3904027124  2884673100  1019354024  74%
/var/lib/ceph/osd/ceph-0
/dev/sdc1  3904027124  2306909388  1597117736  60%
/var/lib/ceph/osd/ceph-1
/dev/sdd1  3904027124  3296767276   607259848  85%
/var/lib/ceph/osd/ceph-2
/dev/sde1  3904027124  3670063612   233963512  95%
/var/lib/ceph/osd/ceph-3

Node 2:
/dev/sdb1  3904027124  3250627172   653399952  84%
/var/lib/ceph/osd/ceph-4
/dev/sdc1  3904027124  3611337492   292689632  93%
/var/lib/ceph/osd/ceph-5
/dev/sdd1  3904027124  2831199600  1072827524  73%
/var/lib/ceph/osd/ceph-6
/dev/sde1  3904027124  2466292856  1437734268  64%
/var/lib/ceph/osd/ceph-7

I am using rados put to upload 100g files to the cluster, doing two at
a time from two different locations. Is this expected behavior, or can
someone shed light on why it is doing this? We're using the opensource
version 80.7. We're also using the default CRUSH configuration.



So crush utilizes pseudo-random distributions, but sadly random
distributions tend to be clumpy and not perfectly uniform until you get to
very high sample counts. The gist of it is that if you have a really low
density of PGs/OSD and/or are very unlucky, you can end up with a skewed
distribution.  If you are even more unlucky, you could compound that with a
streak of objects landing on PGs associated with some specific OSD.  This
particular case looks rather bad.  How many PGs and OSDs do you have?



Regards,
*MICHAEL J. BREWER*

*Phone:* 1-512-286-5596 | *Tie-Line:* 363-5596*
E-mail:*_mjbre...@us.ibm.com_ mailto:mjbre...@us.ibm.com


11501 Burnet Rd
Austin, TX 78758-3400
United States




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow/Hung IOs

2015-01-09 Thread Craig Lewis
I doesn't seem like the problem here, but I've noticed that slow OSDs have
a large fan-out.  I have less than 100 OSDs, so every OSD talks to every
other OSD in my cluster.

I was getting slow notices from all of my OSDs.  Nothing jumped out, so I
started looking at disk write latency graphs.  I noticed that all the OSDs
in one node had 10x the write latency of the other nodes.  After that, I
graphed the number of slow notices per OSD, and noticed that a much higher
number of slow requests on that node.

Long story short, I lost a battery on my write cache.  But it wasn't at all
obvious from the slow request notices, not until I dug deeper.



On Mon, Jan 5, 2015 at 4:07 PM, Sanders, Bill bill.sand...@teradata.com
wrote:

  Thanks for the reply.

 14 and 18 happened to show up during that run, but its certainly not only
 those OSD's.  It seems to vary each run.  Just from the runs I've done
 today I've seen the following pairs of OSD's:

 ['0,13', '0,18', '0,24', '0,25', '0,32', '0,34', '0,36', '10,22', '11,30',
 '12,28', '13,30', '14,22', '14,24', '14,27', '14,30', '14,31', '14,33',
 '14,34', '14,35', '14,39', '16,20', '16,27', '18,38', '19,30', '19,31',
 '19,39', '20,38', '22,30', '26,37', '26,38', '27,33', '27,34', '27,36',
 '28,32', '28,34', '28,36', '28,37', '3,18', '3,27', '3,29', '3,37', '4,10',
 '4,29', '5,19', '5,37', '6,25', '9,28', '9,29', '9,37']

 Which is almost all of the OSD's in the system.

 Bill

  --
 *From:* Lincoln Bryant [linco...@uchicago.edu]
 *Sent:* Monday, January 05, 2015 3:40 PM
 *To:* Sanders, Bill
 *Cc:* ceph-users@lists.ceph.com
 *Subject:* Re: [ceph-users] Slow/Hung IOs

  Hi BIll,

  From your log excerpt, it looks like your slow requests are happening on
 OSDs 14 and 18. Is it always these two OSDs?

  If you don't have a long recovery time (e.g., the cluster is just full
 of test data), maybe you could try setting OSDs 14 and 18 out and
 re-benching?

  Alternatively I suppose you could just use bonnie++ or dd etc to write
 to those OSDs (careful to not clobber any Ceph dirs) and see how the
 performance looks.

  Cheers,
 Lincoln

   On Jan 5, 2015, at 4:36 PM, Sanders, Bill wrote:

   Hi Ceph Users,

 We've got a Ceph cluster we've built, and we're experiencing issues with
 slow or hung IO's, even running 'rados bench' on the OSD cluster.  Things
 start out great, ~600 MB/s, then rapidly drops off as the test waits for
 IO's. Nothing seems to be taxed... the system just seems to be waiting.
 Any help trying to figure out what could cause the slow IO's is appreciated.

 For example, 'rados -p rbd bench 60 write -t 32' takes over 900s to
 complete:

 A typical rados bench:
  Total time run: 957.458274
 Total writes made:  9251
 Write size: 4194304
 Bandwidth (MB/sec): 38.648

 Stddev Bandwidth:   157.323
 Max bandwidth (MB/sec): 964
 Min bandwidth (MB/sec): 0
 Average Latency:3.21126
 Stddev Latency: 51.9546
 Max latency:910.72
 Min latency:0.04516


 According to ceph.log, we're not experiencing any OSD flapping or monitor
 election cycles, just slow requests:

 # grep slow /var/log/ceph/ceph.log:
 2015-01-05 13:42:42.937678 osd.18 39.7.48.7:6803/11185 220 : [WRN] 3 slow
 requests, 1 included below; oldest blocked for  513.611379 secs
 2015-01-05 13:42:42.937685 osd.18 39.7.48.7:6803/11185 221 : [WRN] slow
 request 30.136429 seconds old, received at 2015-01-05 13:42:12.801205:
 osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.000f [write
 114688~512] 3.841c0edf ondisk+write e994) v4 currently waiting for subops
 from 3,37
 2015-01-05 13:42:49.938681 osd.18 39.7.48.7:6803/11185 222 : [WRN] 3 slow
 requests, 1 included below; oldest blocked for  520.612372 secs
 2015-01-05 13:42:49.938688 osd.18 39.7.48.7:6803/11185 223 : [WRN] slow
 request 480.636547 seconds old, received at 2015-01-05 13:34:49.302080:
 osd_op(client.92008.1:3100010 rb.0.140d.238e1f29.0c77 [write
 3622400~512] 3.d031a69f ondisk+write e994) v4 currently waiting for subops
 from 26,37
 2015-01-05 13:43:12.941838 osd.18 39.7.48.7:6803/11185 224 : [WRN] 3 slow
 requests, 1 included below; oldest blocked for  543.615545 secs
 2015-01-05 13:43:12.941844 osd.18 39.7.48.7:6803/11185 225 : [WRN] slow
 request 60.140595 seconds old, received at 2015-01-05 13:42:12.801205:
 osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.000f [write
 114688~512] 3.841c0edf ondisk+write e994) v4 currently waiting for subops
 from 3,37
 2015-01-05 13:44:04.933440 osd.14 39.7.48.7:6818/11640 251 : [WRN] 4 slow
 requests, 1 included below; oldest blocked for  606.941954 secs
 2015-01-05 13:44:04.933469 osd.14 39.7.48.7:6818/11640 252 : [WRN] slow
 request 240.101138 seconds old, received at 2015-01-05 13:40:04.832272:
 osd_op(client.92008.1:3101102 rb.0.142b.238e1f29.0010 [write
 475136~512] 3.5e623815 ondisk+write e994) v4 currently waiting for subops
 from 27,33
 2015-01-05 13:44:12.950805 osd.18 

Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-09 Thread Gregory Farnum
On Fri, Jan 9, 2015 at 2:00 AM, Nico Schottelius
nico-ceph-us...@schottelius.org wrote:
 Lionel, Christian,

 we do have the exactly same trouble as Christian,
 namely

 Christian Eichelmann [Fri, Jan 09, 2015 at 10:43:20AM +0100]:
 We still don't know what caused this specific error...

 and

 ...there is currently no way to make ceph forget about the data of this pg 
 and create it as an empty one. So the only way
 to make this pool usable again is to loose all your data in there.

 I wonder what is the position of ceph developers regarding
 dropping (emptying) specific pgs?
 Is that a use case that was never thought of or tested?

I've never worked directly on any of the cluster this has happened to,
but I believe every time we've seen issues like this with somebody we
have a relationship with it's either:
1) been resolved by using the existing tools to stuff lost, or
2) been the result of local filesystems/disks silently losing data due
to some fault or other.

The second case means the OSDs have corrupted state and trusting them
is tricky. Also, most people we've had relationships with that this
has happened to really want to not lose all the data in the PG, which
necessitates manually mucking around anyway. ;)

Mailing list issues are obviously a lot harder to categorize, but the
ones we've taken time on where people say the commands don't work have
generally fallen into the second bucket.

If you want to experiment, I think all the manual mucking around has
been done with the objectstore tool and removing bad PGs, moving them
around, or faking journal entries, but I've not done it myself so I
could be mistaken.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] backfill_toofull, but OSDs not full

2015-01-09 Thread Craig Lewis
What was the osd_backfill_full_ratio?  That's the config that controls
backfill_toofull.  By default, it's 85%.  The mon_osd_*_ratio affect the
ceph status.

I've noticed that it takes a while for backfilling to restart after
changing osd_backfill_full_ratio.  Backfilling usually restarts for me in
10-15 minutes.  Some PGs will stay in that state until the cluster is
nearly done recoverying.

I've only seen backfill_toofull happen after the OSD exceeds the ratio (so
it's reactive, no proactive).  Mine usually happen when I'm rebalancing a
nearfull cluster, and an OSD backfills itself toofull.




On Mon, Jan 5, 2015 at 11:32 AM, c3 ceph-us...@lopkop.com wrote:

 Hi,

 I am wondering how a PG gets marked backfill_toofull.

 I reweighted several OSDs using ceph osd crush reweight. As expected, PG
 began moving around (backfilling).

 Some PGs got marked +backfilling (~10), some +wait_backfill (~100).

 But some are marked +backfill_toofull. My OSDs are between 25% and 72%
 full.

 Looking at ceph pg dump, I can find the backfill_toofull PGs and verified
 the OSDs involved are less than 72% full.

 Do backfill reservations include a size? Are these OSDs projected to be
 toofull, once the current backfilling complete? Some of the
 backfill_toofull and backfilling point to the same OSDs.

 I did adjust the full ratios, but that did not change the backfill_toofull
 status.
 ceph tell mon.\* injectargs '--mon_osd_full_ratio 0.95'
 ceph tell osd.\* injectargs '--osd_backfill_full_ratio 0.92'


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] backfill_toofull, but OSDs not full

2015-01-09 Thread c3

In this case the root cause was half denied reservations.

http://tracker.ceph.com/issues/9626

This stopped backfills since, those listed as backfilling were  
actually half denied and doing nothing. The toofull status is not  
checked until a free backfill slot happens, so everything was just  
stuck.


Interestingly, the toofull was created by other backfills which were  
not stoppped.

http://tracker.ceph.com/issues/9594

Quite the log jam to clear.


Quoting Craig Lewis cle...@centraldesktop.com:


What was the osd_backfill_full_ratio?  That's the config that controls
backfill_toofull.  By default, it's 85%.  The mon_osd_*_ratio affect the
ceph status.

I've noticed that it takes a while for backfilling to restart after
changing osd_backfill_full_ratio.  Backfilling usually restarts for me in
10-15 minutes.  Some PGs will stay in that state until the cluster is
nearly done recoverying.

I've only seen backfill_toofull happen after the OSD exceeds the ratio (so
it's reactive, no proactive).  Mine usually happen when I'm rebalancing a
nearfull cluster, and an OSD backfills itself toofull.




On Mon, Jan 5, 2015 at 11:32 AM, c3 ceph-us...@lopkop.com wrote:


Hi,

I am wondering how a PG gets marked backfill_toofull.

I reweighted several OSDs using ceph osd crush reweight. As expected, PG
began moving around (backfilling).

Some PGs got marked +backfilling (~10), some +wait_backfill (~100).

But some are marked +backfill_toofull. My OSDs are between 25% and 72%
full.

Looking at ceph pg dump, I can find the backfill_toofull PGs and verified
the OSDs involved are less than 72% full.

Do backfill reservations include a size? Are these OSDs projected to be
toofull, once the current backfilling complete? Some of the
backfill_toofull and backfilling point to the same OSDs.

I did adjust the full ratios, but that did not change the backfill_toofull
status.
ceph tell mon.\* injectargs '--mon_osd_full_ratio 0.95'
ceph tell osd.\* injectargs '--osd_backfill_full_ratio 0.92'


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RHEL 7 Installs

2015-01-09 Thread Travis Rhoden
Hi John,

For the last part, there being two different versions of packages in
Giant, I don't think that's the actual problem.

What's really happening there is that python-ceph has been obsoleted
by other packages that are getting picked up by Yum.  See the line
that says Package python-ceph is obsoleted by python-rados...

It's the same deal as http://tracker.ceph.com/issues/10476

You could try the same fix there.

On Fri, Jan 9, 2015 at 4:50 PM, John Wilkins john.wilk...@inktank.com wrote:
 Ken,

 I had a number of issues installing Ceph on RHEL 7, which I think are
 mostly due to dependencies. I followed the quick start guide, which
 gets the latest major release--e.g., Firefly, Giant.

 ceph.conf is here: http://goo.gl/LNjFp3
 ceph.log common errors included: http://goo.gl/yL8UsM

 To resolve these, I had to download and install libunwind and python-jinja2.

 It also seems that the Giant repo had 0.86 and 0.87 packages for
 python-ceph, and ceph-deploy didn't like that.

 ceph.log error: http://goo.gl/oeKGUv

 To resolve this, I had to download and install python-ceph v0.87.
 Then, run the ceph-deploy install command again.


 --
 John Wilkins
 Red Hat
 jowil...@redhat.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-09 Thread Irek Fasikhov
Very very good :)

пт, 9 янв. 2015, 2:17, William Bloom (wibloom) wibl...@cisco.com:

  Awesome, thanks Michael.



 Regards

 William



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
 Of *Michael J. Kidd
 *Sent:* Wednesday, January 07, 2015 2:09 PM
 *To:* ceph-us...@ceph.com
 *Subject:* [ceph-users] PG num calculator live on Ceph.com



 Hello all,

   Just a quick heads up that we now have a PG calculator to help determine
 the proper PG per pool numbers to achieve a target PG per OSD ratio.

 http://ceph.com/pgcalc

 Please check it out!  Happy to answer any questions, and always welcome
 any feedback on the tool / verbiage, etc...

 As an aside, we're also working to update the documentation to reflect the
 best practices.  See Ceph.com tracker for this at:
 http://tracker.ceph.com/issues/9867

 Thanks!

 Michael J. Kidd
 Sr. Storage Consultant
 Inktank Professional Services

  - by Red Hat
   ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] question about S3 multipart upload ignores request headers

2015-01-09 Thread baijia...@126.com
I patch the http://tracker.ceph.com/issues/8452 
run s3 test suite and still is error;
err log: ERROR: failed to get obj attrs, 
obj=test-client.0-31zepqoawd8dxfa-212:_multipart_mymultipart.2/0IQGoJ7hG8ZtTyfAnglChBO79HUsjeC.meta
 ret=-2 

I found code that it may has problem:
when function exec ret = get_obj_attrs(store, s, meta_obj, attrs, NULL, NULL); 
 , whether should exec meta_obj.set_in_extra_data(true);  before it. 
because meta_obj is in the extra bucket.





baijia...@126.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-09 Thread Nico Schottelius
Lionel, Christian,

we do have the exactly same trouble as Christian,
namely

Christian Eichelmann [Fri, Jan 09, 2015 at 10:43:20AM +0100]:
 We still don't know what caused this specific error...

and

 ...there is currently no way to make ceph forget about the data of this pg 
 and create it as an empty one. So the only way
 to make this pool usable again is to loose all your data in there. 

I wonder what is the position of ceph developers regarding
dropping (emptying) specific pgs?
Is that a use case that was never thought of or tested?

For us it is essential to be able to keep the pool/cluster
running even in case we have lost pgs.

Even though I do not like the fact that we lost a pg for
an unknown reason, I would prefer ceph to handle that case to recover to
the best possible situation.

Namely I wonder if we can integrate a tool that shows 
which (parts of) rbd images would be affected by dropping
a pg. That would give us the chance to selectively restore
VMs in case this happens again.

Cheers,

Nico

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2015-01-09 Thread Christian Balzer
On Thu, 8 Jan 2015 21:17:12 -0700 Robert LeBlanc wrote:

 On Thu, Jan 8, 2015 at 8:31 PM, Christian Balzer ch...@gol.com wrote:
  On Thu, 8 Jan 2015 11:41:37 -0700 Robert LeBlanc wrote:
  Which of course currently means a strongly consistent lockup in these
  scenarios. ^o^
 
 That is one way of putting it
 
If I had the time and more importantly the talent to help with code, I'd
do so. 
Failing that, pointing out the often painful truth is something I can do.

  Slightly off-topic and snarky, that strong consistency is of course of
  limited use when in the case of a corrupted PG Ceph basically asks you
  to toss a coin.
  As in minor corruption, impossible for a mere human to tell which
  replica is the good one, because one OSD is down and the 2 remaining
  ones differ by one bit or so.
 
 This is where checksumming is supposed to come in. I think Sage has been
 leading that initiative. 

Yeah, I'm aware of that effort. 
Of course in the meantime even a very simple majority vote would be most
welcome and helpful in nearly all cases (with 3 replicas available).

One wonders if this is basically acknowledging that while offloading some
things like checksums to the underlying layer/FS are desirable from a
codebase/effort/complexity view, neither BTRFS or ZFS are fully production
ready and won't be for some time.

 Basically, when an OSD reads an object it should
 be able to tell if there was bit rot by hashing what it just read and
 checking the MD5SUM that it did when it first received the object. If it
 doesn't match it can ask another OSD until it finds one that matches.
 
 This provides a number of benefits:
 
1. Protect against bit rot. Checked on read and on deep scrub.
2. Automatically recover the correct version of the object.
3. If the client computes the MD5SUM before it sent over the wire, the
data can be guaranteed through the memory of several
machines/devices/cables/etc.
4. Getting by with size 2 is less risky for those who really want to
do that.
 
 With all these benefits, there is a trade-off associated with it, mostly
 CPU. However with the inclusion of AES in silicon, it may not be a huge
 issue now. But, I'm not a programmer and familiar with the aspect of the
 Ceph code to be authoritative in any way.

Yup, all very useful and pertinent points.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Documentation of ceph pg num query

2015-01-09 Thread Christian Eichelmann
Hi all,

as mentioned last year, our ceph cluster is still broken and unusable.
We are still investigating what has happened and I am taking more deep
looks into the output of ceph pg pgnum query.

The problem is that I can find some informations about what some of the
sections mean, but mostly I can only guess. Is there any kind of
documentation where I can find some explanations of whats state there?
Because without that the output is barely usefull.

Regards,
Christian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Minimum Cluster Install (ARM)

2015-01-09 Thread Christian Balzer
On Thu, 8 Jan 2015 01:35:03 + Garg, Pankaj wrote:

 Hi,
 I am trying to get a very minimal Ceph cluster up and running (on ARM)
 and I'm wondering what is the smallest unit that I can run rados-bench
 on ? Documentation at
 (http://ceph.com/docs/next/start/quick-ceph-deploy/) seems to refer to 4
 different nodes. Admin Node, Monitor Node and 2 OSD only nodes.
 
 Can the Admin node be an x86 machine even if the deployment is ARM based?
 
 Or can the Admin Node and Monitor node co-exist.
 
 Finally, I'm assuming I can get by with only 1 independent OSD node.
 
 If that's possible, I can get by with 2 ARM systems only. Can someone
 please shed some light on whether this will work?
 
You can do everything on one node even (need to set replica size to 1).

However that will not realistically reflect reality at all and any
benchmarks will be skewed (as always with very small clusters).

In addition to that your systems need to be fast enough, meaning CPU for
OSD and MON, fast storage for MON(OS) and sufficient RAM to handle
everything.

Again, learning some things about Ceph is feasible with a minimal cluster,
but drawing conclusions about performance with what you have at hand will
be tricky.

Lastly, Ceph can get very compute intense (OSDs, particularly with small
I/Os), so I'm skeptical that ARM will cut the mustard.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2015-01-09 Thread Robert LeBlanc
On Thu, Jan 8, 2015 at 8:31 PM, Christian Balzer ch...@gol.com wrote:
 On Thu, 8 Jan 2015 11:41:37 -0700 Robert LeBlanc wrote:
 Which of course currently means a strongly consistent lockup in these
 scenarios. ^o^

That is one way of putting it

 Slightly off-topic and snarky, that strong consistency is of course of
 limited use when in the case of a corrupted PG Ceph basically asks you to
 toss a coin.
 As in minor corruption, impossible for a mere human to tell which
 replica is the good one, because one OSD is down and the 2 remaining ones
 differ by one bit or so.

This is where checksumming is supposed to come in. I think Sage has been
leading that initiative. Basically, when an OSD reads an object it should
be able to tell if there was bit rot by hashing what it just read and
checking the MD5SUM that it did when it first received the object. If it
doesn't match it can ask another OSD until it finds one that matches.

This provides a number of benefits:

   1. Protect against bit rot. Checked on read and on deep scrub.
   2. Automatically recover the correct version of the object.
   3. If the client computes the MD5SUM before it sent over the wire, the
   data can be guaranteed through the memory of several
   machines/devices/cables/etc.
   4. Getting by with size 2 is less risky for those who really want to
   do that.

With all these benefits, there is a trade-off associated with it, mostly
CPU. However with the inclusion of AES in silicon, it may not be a huge
issue now. But, I'm not a programmer and familiar with the aspect of the
Ceph code to be authoritative in any way.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-09 Thread Nico Schottelius
Good morning Jiri,

sure, let me catch up on this:

- Kernel 3.16
- ceph: 0.80.7
- fs: xfs
- os: debian (backports) (1x)/ubuntu (2x)

Cheers,

Nico

Jiri Kanicky [Fri, Jan 09, 2015 at 10:44:33AM +1100]:
 Hi Nico.
 
 If you are experiencing such issues it would be good if you provide more info 
 about your deployment: ceph version, kernel versions, OS, filesystem 
 btrfs/xfs.
 
 Thx Jiri
 
 - Reply message -
 From: Nico Schottelius nico-eph-us...@schottelius.org
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = 
 Cluster unusable]
 Date: Wed, Dec 31, 2014 02:36
 
 Good evening,
 
 we also tried to rescue data *from* our old / broken pool by map'ing the
 rbd devices, mounting them on a host and rsync'ing away as much as
 possible.
 
 However, after some time rsync got completly stuck and eventually the
 host which mounted the rbd mapped devices decided to kernel panic at
 which time we decided to drop the pool and go with a backup.
 
 This story and the one of Christian makes me wonder:
 
 Is anyone using ceph as a backend for qemu VM images in production?
 
 And:
 
 Has anyone on the list been able to recover from a pg incomplete /
 stuck situation like ours?
 
 Reading about the issues on the list here gives me the impression that
 ceph as a software is stuck/incomplete and has not yet become ready
 clean for production (sorry for the word joke).
 
 Cheers,
 
 Nico
 
 Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]:
  Hi Nico and all others who answered,
  
  After some more trying to somehow get the pgs in a working state (I've
  tried force_create_pg, which was putting then in creating state. But
  that was obviously not true, since after rebooting one of the containing
  osd's it went back to incomplete), I decided to save what can be saved.
  
  I've created a new pool, created a new image there, mapped the old image
  from the old pool and the new image from the new pool to a machine, to
  copy data on posix level.
  
  Unfortunately, formatting the image from the new pool hangs after some
  time. So it seems that the new pool is suffering from the same problem
  as the old pool. Which is totaly not understandable for me.
  
  Right now, it seems like Ceph is giving me no options to either save
  some of the still intact rbd volumes, or to create a new pool along the
  old one to at least enable our clients to send data to ceph again.
  
  To tell the truth, I guess that will result in the end of our ceph
  project (running for already 9 Monthes).
  
  Regards,
  Christian
  
  Am 29.12.2014 15:59, schrieb Nico Schottelius:
   Hey Christian,
   
   Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:
   [incomplete PG / RBD hanging, osd lost also not helping]
   
   that is very interesting to hear, because we had a similar situation
   with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
   directories to allow OSDs to start after the disk filled up completly.
   
   So I am sorry not to being able to give you a good hint, but I am very
   interested in seeing your problem solved, as it is a show stopper for
   us, too. (*)
   
   Cheers,
   
   Nico
   
   (*) We migrated from sheepdog to gluster to ceph and so far sheepdog
   seems to run much smoother. The first one is however not supported
   by opennebula directly, the second one not flexible enough to host
   our heterogeneous infrastructure (mixed disk sizes/amounts) - so we 
   are using ceph at the moment.
   
  
  
  -- 
  Christian Eichelmann
  Systemadministrator
  
  11 Internet AG - IT Operations Mail  Media Advertising  Targeting
  Brauerstraße 48 · DE-76135 Karlsruhe
  Telefon: +49 721 91374-8026
  christian.eichelm...@1und1.de
  
  Amtsgericht Montabaur / HRB 6484
  Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
  Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
  Aufsichtsratsvorsitzender: Michael Scheeren
 
 -- 
 New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com