Re: [ceph-users] Ceph RBD and Backup.

2014-07-03 Thread Wolfgang Hennerbichler
if the rbd filesystem ‘belongs’ to you you can do sth like this:

http://www.wogri.com/linux/ceph-vm-backup/

On Jul 3, 2014, at 7:21 AM, Irek Fasikhov  wrote:

> 
> Hi,All.
> 
> Dear community. How do you make backups CEPH RDB?
> 
> Thanks
> 
> -- 
> Fasihov Irek (aka Kataklysm).
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD and Backup.

2014-07-03 Thread Christian Kauhaus
Am 03.07.2014 07:21, schrieb Irek Fasikhov:
> Dear community. How do you make backups CEPH RDB?

We @ gocept are currently in the process of developing "backy", a new-style
backup tool that works directly with block level snapshots / diffs.

The tool is not quite finished, but it is making rapid progress. It would be
great if you'd try it, spot bugs, contribute code etc. Help is appreciated. :-)

PyPI page: https://pypi.python.org/pypi/backy/

Pull requests go here: https://bitbucket.org/ctheune/backy

Christian Theune  is the primary contact.

HTH

Christian

-- 
Dipl.-Inf. Christian Kauhaus <>< · k...@gocept.com · systems administration
gocept gmbh & co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] release date for 0.80.2

2014-07-03 Thread Andrei Mikhailovsky
Hi guys, 

Was wondering if 0.80.2 is coming any time soon? I am planning na upgrade from 
Emperor and was wondering if I should wait for 0.80.2 to come out if the 
release date is pretty soon. Otherwise, I will go for the 0.80.1. 

Cheers 
Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mixing CEPH versions on new ceph nodes...

2014-07-03 Thread Andrija Panic
Hi Wido, thanks for answers - I have mons and OSD on each host... server1:
mon + 2 OSDs, same for server2 and server3.

Any Proposed upgrade path, or just start with 1 server and move along to
others ?

Thanks again.
Andrija


On 2 July 2014 16:34, Wido den Hollander  wrote:

> On 07/02/2014 04:08 PM, Andrija Panic wrote:
>
>> Hi,
>>
>> I have existing CEPH cluster of 3 nodes, versions 0.72.2
>>
>> I'm in a process of installing CEPH on 4th node, but now CEPH version is
>> 0.80.1
>>
>> Will this make problems running mixed CEPH versions ?
>>
>>
> No, but the recommendation is not to have this running for a very long
> period. Try to upgrade all nodes to the same version within a reasonable
> amount of time.
>
>
>  I intend to upgrade CEPH on exsiting 3 nodes anyway ?
>> Recommended steps ?
>>
>>
> Always upgrade the monitors first! Then to the OSDs one by one.
>
>  Thanks
>>
>> --
>>
>> Andrija Panić
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Andrija Panić
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] release date for 0.80.2

2014-07-03 Thread Wido den Hollander

On 07/03/2014 10:27 AM, Andrei Mikhailovsky wrote:

Hi guys,

Was wondering if 0.80.2 is coming any time soon? I am planning na
upgrade from Emperor and was wondering if I should wait for 0.80.2 to
come out if the release date is pretty soon. Otherwise, I will go for
the 0.80.1.



Why bother? Upgrading from 0.80.1 to .2 is not that much work.

Or is there a specific bug in 0.80.1 which you don't want to run into?


Cheers
Andrei


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mixing CEPH versions on new ceph nodes...

2014-07-03 Thread Wido den Hollander

On 07/03/2014 10:59 AM, Andrija Panic wrote:

Hi Wido, thanks for answers - I have mons and OSD on each host...
server1: mon + 2 OSDs, same for server2 and server3.

Any Proposed upgrade path, or just start with 1 server and move along to
others ?



Upgrade the packages, but don't restart the daemons yet, then:

1. Restart the mon leader
2. Restart the two other mons
3. Restart all the OSDs one by one

I suggest that you wait for the cluster to become fully healthy again 
before restarting the next OSD.


Wido


Thanks again.
Andrija


On 2 July 2014 16:34, Wido den Hollander mailto:w...@42on.com>> wrote:

On 07/02/2014 04:08 PM, Andrija Panic wrote:

Hi,

I have existing CEPH cluster of 3 nodes, versions 0.72.2

I'm in a process of installing CEPH on 4th node, but now CEPH
version is
0.80.1

Will this make problems running mixed CEPH versions ?


No, but the recommendation is not to have this running for a very
long period. Try to upgrade all nodes to the same version within a
reasonable amount of time.


I intend to upgrade CEPH on exsiting 3 nodes anyway ?
Recommended steps ?


Always upgrade the monitors first! Then to the OSDs one by one.

Thanks

--

Andrija Panić


_
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com




--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902 
Skype: contact42on
_
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com





--

Andrija Panić



--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Pools do not respond

2014-07-03 Thread Iban Cabrillo
Hi folk,
  I am following step by step the test intallation, and checking some
configuration before try to deploy a production cluster.

  Now I have a Health cluster with 3 mons + 4 OSDs.
  I have created a pool with belonging all osd.x and two more one for two
servers o the other for the other two.

  The general pool work fine (I can create images and mount it on remote
machines).

  But the other two does not work (the commands rados put, or rbd ls "pool"
hangs for ever).

  this is the tree:

   [ceph@cephadm ceph-cloud]$ sudo ceph osd tree
# id weight type name up/down reweight
-7 5.4 root 4x1GbFCnlSAS
-3 2.7 host node04
1 2.7 osd.1 up 1
-4 2.7 host node03
2 2.7 osd.2 up 1
-6 8.1 root 4x4GbFCnlSAS
-5 5.4 host node01
3 2.7 osd.3 up 1
4 2.7 osd.4 up 1
-2 2.7 host node04
0 2.7 osd.0 up 1
-1 13.5 root default
-2 2.7 host node04
0 2.7 osd.0 up 1
-3 2.7 host node04
1 2.7 osd.1 up 1
-4 2.7 host node03
2 2.7 osd.2 up 1
-5 5.4 host node01
3 2.7 osd.3 up 1
4 2.7 osd.4 up 1


And this is the crushmap:

...
root 4x4GbFCnlSAS {
id -6 #do not change unnecessarily
alg straw
hash 0  # rjenkins1
item node01 weight 5.400
item node04 weight 2.700
}
root 4x1GbFCnlSAS {
id -7 #do not change unnecessarily
alg straw
hash 0  # rjenkins1
item node04 weight 2.700
item node03 weight 2.700
}
# rules
rule 4x4GbFCnlSAS {
ruleset 1
type replicated
min_size 1
max_size 10
step take 4x4GbFCnlSAS
step choose firstn 0 type host
step emit
}
rule 4x1GbFCnlSAS {
ruleset 2
type replicated
min_size 1
max_size 10
step take 4x1GbFCnlSAS
step choose firstn 0 type host
step emit
}
..
I of course set the crush_rules:
sudo ceph osd pool set cloud-4x1GbFCnlSAS crush_ruleset 2
sudo ceph osd pool set cloud-4x4GbFCnlSAS crush_ruleset 1

but seems that are something wrong (4x4GbFCnlSAS.pool is 512MB file):
   sudo rados -p cloud-4x1GbFCnlSAS put 4x4GbFCnlSAS.object
4x4GbFCnlSAS.pool
!!HANGS for eve!

from the ceph-client happen the same
 rbd ls cloud-4x1GbFCnlSAS
 !!HANGS for eve!


[root@cephadm ceph-cloud]# ceph osd map cloud-4x1GbFCnlSAS
4x1GbFCnlSAS.object
osdmap e49 pool 'cloud-4x1GbFCnlSAS' (3) object '4x1GbFCnlSAS.object' -> pg
3.114ae7a9 (3.29) -> *up ([], p-1) acting ([], p-1)*

Any idea what i am doing wrong??

Thanks in advance, I
Bertrand Russell:
*"El problema con el mundo es que los estúpidos están seguros de todo y los
inteligentes están llenos de dudas*"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-03 Thread Joao Eduardo Luis

On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i >
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush23
6d5
< tunable chooseleaf_vary_r 1

 Looks like the chooseleaf_vary_r tunable somehow ended up divergent?


The only thing that comes to mind that could cause this is if we changed 
the leader's in-memory map, proposed it, it failed, and only the leader 
got to write the map to disk somehow.  This happened once on a totally 
different issue (although I can't pinpoint right now which).


In such a scenario, the leader would serve the incorrect osdmap to 
whoever asked osdmaps from it, the remaining quorum would serve the 
correct osdmaps to all the others.  This could cause this divergence. 
Or it could be something else.


Are there logs for the monitors for the timeframe this may have happened in?

  -Joao



Pierre: do you recall how and when that got set?


I am not sure to understand, but if I good remember after the update in
firefly, I was in state : HEALTH_WARN crush map has legacy tunables and
I see "feature set mismatch" in log.

So if I good remeber, i do : ceph osd crush tunables optimal for the
problem of "crush map" and I update my client and server kernel to 3.16rc.

It's could be that ?

Pierre


-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just  wrote:

Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
 wrote:

The files

When I upgrade :
  ceph-deploy install --stable firefly servers...
  on each servers service ceph restart mon
  on each servers service ceph restart osd
  on each servers service ceph restart mds

I upgraded from emperor to firefly. After repair, remap, replace,
etc ... I
have some PG which pass in peering state.

I thought why not try the version 0.82, it could solve my problem. (
It's my mistake ). So, I upgrade from firefly to 0.83 with :
  ceph-deploy install --testing servers...
  ..

Now, all programs are in version 0.82.
I have 3 mons, 36 OSD and 3 mds.

Pierre

PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta
directory.

Le 03/07/2014 00:10, Samuel Just a écrit :


Also, what version did you upgrade from, and how did you upgrade?
-Sam

On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just 
wrote:


Ok, in current/meta on osd 20 and osd 23, please attach all files
matching

^osdmap.13258.*

There should be one such file on each osd. (should look something
like
osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
you'll want to use find).

What version of ceph is running on your mons?  How many mons do
you have?
-Sam

On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
 wrote:


Hi,

I do it, the log files are available here :
https://blondeau.users.greyc.fr/cephlog/debug20/

The OSD's files are really big +/- 80M .

After starting the osd.20 some other osd crash. I pass from 31
osd up to
16.
I remark that after this the number of down+peering PG decrease
from 367
to
248. It's "normal" ? May be it's temporary, the time that the
cluster
verifies all the PG ?

Regards
Pierre

Le 02/07/2014 19:16, Samuel Just a écrit :


You should add

debug osd = 20
debug filestore = 20
debug ms = 1

to the [osd] section of the ceph.conf and restart the osds.  I'd
like
all three logs if possible.

Thanks
-Sam

On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
 wrote:



Yes, but how i do that ?

With a command like that ?

ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
--debug-ms
1'

By modify the /etc/ceph/ceph.conf ? This file is really poor
because I
use
udev detection.

When I have made these changes, you want the three log files or
only
osd.20's ?

Thank you so much for the help

Regards
Pierre

Le 01/07/2014 23:51, Samuel Just a écrit :


Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
?
-Sam

On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
 wrote:




Hi,

I join :
 - osd.20 is one of osd that I detect which makes crash
other
OSD.
 - osd.23 is one of osd which crash when i start osd.20
 - mds, is one of my MDS

I cut log file because they are to big but. All is here :
https://blondeau.users.greyc.fr/cephlog/

Regards

Le 30/06/2014 17:35, Gregory Farnum a écrit :


What's the backtrace from the crashing OSDs?

Keep in mind that as a dev release, it's generally best not to
upgrade
to unnamed versions like 0.82 (but it's probably too late to go
back
now).




I will remember it the ne

Re: [ceph-users] Bypass Cache-Tiering for special reads (Backups)

2014-07-03 Thread Marc
On 03/07/2014 07:32, Kyle Bader wrote:
>> I was wondering, having a cache pool in front of an RBD pool is all fine
>> and dandy, but imagine you want to pull backups of all your VMs (or one
>> of them, or multiple...). Going to the cache for all those reads isn't
>> only pointless, it'll also potentially fill up the cache and possibly
>> evict actually frequently used data. Which got me thinking... wouldn't
>> it be nifty if there was a special way of doing specific backup reads
>> where you'd bypass the cache, ensuring the dirty cache contents get
>> written to cold pool first? Or at least doing special reads where a
>> cache-miss won't actually cache the requested data?
>>
>> AFAIK the backup routine for an RBD-backed KVM usually involves creating
>> a snapshot of the RBD and putting that into a backup storage/tape, all
>> done via librbd/API.
>>
>> Maybe something like that even already exists?
> When used in the context of OpenStack Cinder, it does:
>
> http://ceph.com/docs/next/rbd/rbd-openstack/#configuring-cinder-backup
>
> You can have the backup pool use the default crush rules, assuming the
> default isn't your hot pool. Another option might be to put backups on
> an erasure coded pool, I'm not sure if that has been tested, but in
> principle should work since objects composing a snapshot should be
> immutable.
>
Hm... considering that the RBDs are accessed via the default crush rule
due to the overlaying, they dont actually use the cache ruleset for
anything I wouldnt think.

Also the whole idea of backups is to have them on a separate medium.
Storing RBD-backups on Ceph is no better than just taking snapshots and
keeping them around.

Thanks for the input though!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mixing CEPH versions on new ceph nodes...

2014-07-03 Thread Andrija Panic
Thanks a lot Wido, will do...

Andrija


On 3 July 2014 13:12, Wido den Hollander  wrote:

> On 07/03/2014 10:59 AM, Andrija Panic wrote:
>
>> Hi Wido, thanks for answers - I have mons and OSD on each host...
>> server1: mon + 2 OSDs, same for server2 and server3.
>>
>> Any Proposed upgrade path, or just start with 1 server and move along to
>> others ?
>>
>>
> Upgrade the packages, but don't restart the daemons yet, then:
>
> 1. Restart the mon leader
> 2. Restart the two other mons
> 3. Restart all the OSDs one by one
>
> I suggest that you wait for the cluster to become fully healthy again
> before restarting the next OSD.
>
> Wido
>
>  Thanks again.
>> Andrija
>>
>>
>> On 2 July 2014 16:34, Wido den Hollander > > wrote:
>>
>> On 07/02/2014 04:08 PM, Andrija Panic wrote:
>>
>> Hi,
>>
>> I have existing CEPH cluster of 3 nodes, versions 0.72.2
>>
>> I'm in a process of installing CEPH on 4th node, but now CEPH
>> version is
>> 0.80.1
>>
>> Will this make problems running mixed CEPH versions ?
>>
>>
>> No, but the recommendation is not to have this running for a very
>> long period. Try to upgrade all nodes to the same version within a
>> reasonable amount of time.
>>
>>
>> I intend to upgrade CEPH on exsiting 3 nodes anyway ?
>> Recommended steps ?
>>
>>
>> Always upgrade the monitors first! Then to the OSDs one by one.
>>
>> Thanks
>>
>> --
>>
>> Andrija Panić
>>
>>
>> _
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>>
>> 
>>
>>
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902 
>> Skype: contact42on
>> _
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>>
>> 
>>
>>
>>
>>
>> --
>>
>> Andrija Panić
>>
>
>
> --
> Wido den Hollander
> Ceph consultant and trainer
> 42on B.V.
>
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
>



-- 

Andrija Panić
--
  http://admintweets.com
--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mixing CEPH versions on new ceph nodes...

2014-07-03 Thread Andrija Panic
Wido,
one final question:
since I compiled libvirt1.2.3 usinfg ceph-devel 0.72 - do I need to
recompile libvirt again now with ceph-devel 0.80 ?

Perhaps not smart question, but need to make sure I don't screw something...
Thanks for your time,
Andrija


On 3 July 2014 14:27, Andrija Panic  wrote:

> Thanks a lot Wido, will do...
>
> Andrija
>
>
> On 3 July 2014 13:12, Wido den Hollander  wrote:
>
>> On 07/03/2014 10:59 AM, Andrija Panic wrote:
>>
>>> Hi Wido, thanks for answers - I have mons and OSD on each host...
>>> server1: mon + 2 OSDs, same for server2 and server3.
>>>
>>> Any Proposed upgrade path, or just start with 1 server and move along to
>>> others ?
>>>
>>>
>> Upgrade the packages, but don't restart the daemons yet, then:
>>
>> 1. Restart the mon leader
>> 2. Restart the two other mons
>> 3. Restart all the OSDs one by one
>>
>> I suggest that you wait for the cluster to become fully healthy again
>> before restarting the next OSD.
>>
>> Wido
>>
>>  Thanks again.
>>> Andrija
>>>
>>>
>>> On 2 July 2014 16:34, Wido den Hollander >> > wrote:
>>>
>>> On 07/02/2014 04:08 PM, Andrija Panic wrote:
>>>
>>> Hi,
>>>
>>> I have existing CEPH cluster of 3 nodes, versions 0.72.2
>>>
>>> I'm in a process of installing CEPH on 4th node, but now CEPH
>>> version is
>>> 0.80.1
>>>
>>> Will this make problems running mixed CEPH versions ?
>>>
>>>
>>> No, but the recommendation is not to have this running for a very
>>> long period. Try to upgrade all nodes to the same version within a
>>> reasonable amount of time.
>>>
>>>
>>> I intend to upgrade CEPH on exsiting 3 nodes anyway ?
>>> Recommended steps ?
>>>
>>>
>>> Always upgrade the monitors first! Then to the OSDs one by one.
>>>
>>> Thanks
>>>
>>> --
>>>
>>> Andrija Panić
>>>
>>>
>>> _
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>>>
>>> 
>>>
>>>
>>>
>>> --
>>> Wido den Hollander
>>> 42on B.V.
>>> Ceph trainer and consultant
>>>
>>> Phone: +31 (0)20 700 9902 
>>> Skype: contact42on
>>> _
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>>>
>>> 
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Andrija Panić
>>>
>>
>>
>> --
>> Wido den Hollander
>> Ceph consultant and trainer
>> 42on B.V.
>>
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>>
>
>
>
> --
>
> Andrija Panić
> --
>   http://admintweets.com
> --
>



-- 

Andrija Panić
--
  http://admintweets.com
--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] write performance per disk

2014-07-03 Thread VELARTIS Philipp Dürhammer
Hi,

I have a ceph cluster setup (with 45 sata disk journal on disks) and get only 
450mb/sec writes seq (maximum playing around with threads in rados bench) with 
replica of 2
Which is about ~20Mb writes per disk (what y see in atop also)
theoretically with replica2 and having journals on disk should be 45 X 100mb 
(sata) / 2 (replica) / 2 (journal writes) which makes it 1125
satas in reality have 120mb/sec so the theoretical output should be more.

I would expect to have between 40-50mb/sec for each sata disk

Can somebody confirm that he can reach this speed with a setup with journals on 
the satas (with journals on ssd speed should be 100mb per disk)?
or does ceph only give about ¼ of the speed for a disk? (and not the ½ as 
expected because of journals)


My setup is 3 servers with: 2 x 2.6ghz xeons, 128gb ram 15 satas for ceph (and 
ssds for system) 1 x 10gig for external traffic, 1 x 10gig for osd traffic
with reads I can saturate the network but writes is far away. And I would 
expect at least to saturate the 10gig with sequential writes also

Thank you
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mixing CEPH versions on new ceph nodes...

2014-07-03 Thread Wido den Hollander

On 07/03/2014 03:07 PM, Andrija Panic wrote:

Wido,
one final question:
since I compiled libvirt1.2.3 usinfg ceph-devel 0.72 - do I need to
recompile libvirt again now with ceph-devel 0.80 ?

Perhaps not smart question, but need to make sure I don't screw something...


No, no need to. The librados API didn't change in case you are using RBD 
storage pool support.


Otherwise it just talks to Qemu and that talks to librbd/librados.

Wido


Thanks for your time,
Andrija


On 3 July 2014 14:27, Andrija Panic mailto:andrija.pa...@gmail.com>> wrote:

Thanks a lot Wido, will do...

Andrija


On 3 July 2014 13:12, Wido den Hollander mailto:w...@42on.com>> wrote:

On 07/03/2014 10:59 AM, Andrija Panic wrote:

Hi Wido, thanks for answers - I have mons and OSD on each
host...
server1: mon + 2 OSDs, same for server2 and server3.

Any Proposed upgrade path, or just start with 1 server and
move along to
others ?


Upgrade the packages, but don't restart the daemons yet, then:

1. Restart the mon leader
2. Restart the two other mons
3. Restart all the OSDs one by one

I suggest that you wait for the cluster to become fully healthy
again before restarting the next OSD.

Wido

Thanks again.
Andrija


On 2 July 2014 16:34, Wido den Hollander mailto:w...@42on.com>
>> wrote:

 On 07/02/2014 04:08 PM, Andrija Panic wrote:

 Hi,

 I have existing CEPH cluster of 3 nodes, versions
0.72.2

 I'm in a process of installing CEPH on 4th node,
but now CEPH
 version is
 0.80.1

 Will this make problems running mixed CEPH versions ?


 No, but the recommendation is not to have this running
for a very
 long period. Try to upgrade all nodes to the same
version within a
 reasonable amount of time.


 I intend to upgrade CEPH on exsiting 3 nodes anyway ?
 Recommended steps ?


 Always upgrade the monitors first! Then to the OSDs one
by one.

 Thanks

 --

 Andrija Panić


 ___
 ceph-users mailing list
ceph-users@lists.ceph.com 
>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



>



 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902


 Skype: contact42on
 ___
 ceph-users mailing list
ceph-users@lists.ceph.com 
>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



>




--

Andrija Panić



--
Wido den Hollander
Ceph consultant and trainer
42on B.V.


Phone: +31 (0)20 700 9902 
Skype: contact42on




--

Andrija Panić
--
http://admintweets.com
--




--

Andrija Panić
--
http://admintweets.com
--



--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mixing CEPH versions on new ceph nodes...

2014-07-03 Thread Andrija Panic
Thanks again a lot.


On 3 July 2014 15:20, Wido den Hollander  wrote:

> On 07/03/2014 03:07 PM, Andrija Panic wrote:
>
>> Wido,
>> one final question:
>> since I compiled libvirt1.2.3 usinfg ceph-devel 0.72 - do I need to
>> recompile libvirt again now with ceph-devel 0.80 ?
>>
>> Perhaps not smart question, but need to make sure I don't screw
>> something...
>>
>
> No, no need to. The librados API didn't change in case you are using RBD
> storage pool support.
>
> Otherwise it just talks to Qemu and that talks to librbd/librados.
>
> Wido
>
>  Thanks for your time,
>> Andrija
>>
>>
>> On 3 July 2014 14:27, Andrija Panic > > wrote:
>>
>> Thanks a lot Wido, will do...
>>
>> Andrija
>>
>>
>> On 3 July 2014 13:12, Wido den Hollander > > wrote:
>>
>> On 07/03/2014 10:59 AM, Andrija Panic wrote:
>>
>> Hi Wido, thanks for answers - I have mons and OSD on each
>> host...
>> server1: mon + 2 OSDs, same for server2 and server3.
>>
>> Any Proposed upgrade path, or just start with 1 server and
>> move along to
>> others ?
>>
>>
>> Upgrade the packages, but don't restart the daemons yet, then:
>>
>> 1. Restart the mon leader
>> 2. Restart the two other mons
>> 3. Restart all the OSDs one by one
>>
>> I suggest that you wait for the cluster to become fully healthy
>> again before restarting the next OSD.
>>
>> Wido
>>
>> Thanks again.
>> Andrija
>>
>>
>> On 2 July 2014 16:34, Wido den Hollander > 
>> >> wrote:
>>
>>  On 07/02/2014 04:08 PM, Andrija Panic wrote:
>>
>>  Hi,
>>
>>  I have existing CEPH cluster of 3 nodes, versions
>> 0.72.2
>>
>>  I'm in a process of installing CEPH on 4th node,
>> but now CEPH
>>  version is
>>  0.80.1
>>
>>  Will this make problems running mixed CEPH versions ?
>>
>>
>>  No, but the recommendation is not to have this running
>> for a very
>>  long period. Try to upgrade all nodes to the same
>> version within a
>>  reasonable amount of time.
>>
>>
>>  I intend to upgrade CEPH on exsiting 3 nodes anyway ?
>>  Recommended steps ?
>>
>>
>>  Always upgrade the monitors first! Then to the OSDs one
>> by one.
>>
>>  Thanks
>>
>>  --
>>
>>  Andrija Panić
>>
>>
>>  ___
>>
>>  ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> > >
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph._
>> ___com
>> 
>>
>>
>>
>> > >
>>
>>
>>
>>  --
>>  Wido den Hollander
>>  42on B.V.
>>  Ceph trainer and consultant
>>
>>  Phone: +31 (0)20 700 9902
>> 
>> 
>>  Skype: contact42on
>>  ___
>>
>>  ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> > >
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph._
>> ___com
>> 
>>
>>
>>
>> > >
>>
>>
>>
>>
>> --
>>
>> Andrija Panić
>>
>>
>>
>> --
>> Wido den Hollander
>> Ceph consultant and trainer
>> 42on B.V.
>>
>>
>> Phone: +31 (0)20 700 9902 
>>
>> Skype: contact42on
>>
>>
>>
>>
>> --
>>
>> Andrija Panić
>> --
>> http://admintweets.com
>> --
>>
>>
>>
>>
>> --
>>
>> Andrija Panić
>> --
>> http://admintweets.com
>> --
>>
>
>
> --
> Wido den Hollander
> Ceph consultant and trainer
> 42on B.V.
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on

Re: [ceph-users] write performance per disk

2014-07-03 Thread Wido den Hollander

On 07/03/2014 03:11 PM, VELARTIS Philipp Dürhammer wrote:

Hi,

I have a ceph cluster setup (with 45 sata disk journal on disks) and get
only 450mb/sec writes seq (maximum playing around with threads in rados
bench) with replica of 2



How many threads?


Which is about ~20Mb writes per disk (what y see in atop also)
theoretically with replica2 and having journals on disk should be 45 X
100mb (sata) / 2 (replica) / 2 (journal writes) which makes it 1125
satas in reality have 120mb/sec so the theoretical output should be more.

I would expect to have between 40-50mb/sec for each sata disk

Can somebody confirm that he can reach this speed with a setup with
journals on the satas (with journals on ssd speed should be 100mb per disk)?
or does ceph only give about ¼ of the speed for a disk? (and not the ½
as expected because of journals)



Did you verify how much each machine is doing? It could be that the data 
is not distributed evenly and that on a certain machine the drives are 
doing 50MB/sec.



My setup is 3 servers with: 2 x 2.6ghz xeons, 128gb ram 15 satas for
ceph (and ssds for system) 1 x 10gig for external traffic, 1 x 10gig for
osd traffic
with reads I can saturate the network but writes is far away. And I
would expect at least to saturate the 10gig with sequential writes also



Should be possible, but with 3 servers the data distribution might not 
be optimal causing a lower write performance.


I've seen 10Gbit write performance on multiple clusters without any 
problems.



Thank you



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] what is the difference between snapshot and clone in theory?

2014-07-03 Thread yalogr
hi,all
 
what is the difference between snapshot and clone in theory?
 
 
 
thanks___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] write performance per disk

2014-07-03 Thread VELARTIS Philipp Dürhammer
HI,

Ceph.conf:
   osd journal size = 15360
   rbd cache = true
rbd cache size = 2147483648
rbd cache max dirty = 1073741824
rbd cache max dirty age = 100
osd recovery max active = 1
 osd max backfills = 1
 osd mkfs options xfs = "-f -i size=2048"
 osd mount options xfs = 
"rw,noatime,nobarrier,logbsize=256k,logbufs=8,inode64,allocsize=4M"
 osd op threads = 8

so it should be 8 threads?

All 3 machines have more or less the same disk load at the same time.
also the disks:
sdb  35.5687.10  6849.09 617310   48540806
sdc  26.7572.62  5148.58 514701   36488992
sdd  35.1553.48  6802.57 378993   48211141
sde  31.0479.04  6208.48 560141   44000710
sdf  32.7938.35  6238.28 271805   44211891
sdg  31.6777.84  5987.45 551680   42434167
sdh  32.9551.29  6315.76 363533   44761001
sdi  31.6756.93  5956.29 403478   42213336
sdj  35.8377.82  6929.31 551501   49109354
sdk  36.8673.84  7291.00 523345   51672704
sdl  36.02   112.90  7040.47 800177   49897132
sdm  33.2538.02  6455.05 269446   45748178
sdn  33.5239.10  6645.19 277101   47095696
sdo  33.2646.22  6388.20 327541   45274394
sdp  33.3874.12  6480.62 525325   45929369


the question is: is this a poor performance to get max 500mb/write with 45 
disks and replica 2 or should I expect this?


-Ursprüngliche Nachricht-
Von: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] Im Auftrag von Wido 
den Hollander
Gesendet: Donnerstag, 03. Juli 2014 15:22
An: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] write performance per disk

On 07/03/2014 03:11 PM, VELARTIS Philipp Dürhammer wrote:
> Hi,
>
> I have a ceph cluster setup (with 45 sata disk journal on disks) and 
> get only 450mb/sec writes seq (maximum playing around with threads in 
> rados
> bench) with replica of 2
>

How many threads?

> Which is about ~20Mb writes per disk (what y see in atop also) 
> theoretically with replica2 and having journals on disk should be 45 X 
> 100mb (sata) / 2 (replica) / 2 (journal writes) which makes it 1125 
> satas in reality have 120mb/sec so the theoretical output should be more.
>
> I would expect to have between 40-50mb/sec for each sata disk
>
> Can somebody confirm that he can reach this speed with a setup with 
> journals on the satas (with journals on ssd speed should be 100mb per disk)?
> or does ceph only give about ¼ of the speed for a disk? (and not the ½ 
> as expected because of journals)
>

Did you verify how much each machine is doing? It could be that the data is not 
distributed evenly and that on a certain machine the drives are doing 50MB/sec.

> My setup is 3 servers with: 2 x 2.6ghz xeons, 128gb ram 15 satas for 
> ceph (and ssds for system) 1 x 10gig for external traffic, 1 x 10gig 
> for osd traffic with reads I can saturate the network but writes is 
> far away. And I would expect at least to saturate the 10gig with 
> sequential writes also
>

Should be possible, but with 3 servers the data distribution might not be 
optimal causing a lower write performance.

I've seen 10Gbit write performance on multiple clusters without any problems.

> Thank you
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] why lock th whole osd handle thread

2014-07-03 Thread baijia...@126.com
when I see the function "OSD::OpWQ::_process ". I find pg lock locks the whole 
function. so when I  use multi-thread write the same object , so are they must 
serialize from osd handle thread to journal write thread ?



baijia...@126.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Multipart upload on ceph 0.8 doesn't work?

2014-07-03 Thread Patrycja Szabłowska
Hi,

I'm trying to make multi part upload work. I'm using ceph
0.80-702-g9bac31b (from the ceph's github).

I've tried the code provided by Mark Kirkwood here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/034940.html


But unfortunately, it gives me the error:

(multitest)pszablow@pat-desktop:~/$ python boto_multi.py
  begin upload of abc.yuv
  size 746496, 7 parts
Traceback (most recent call last):
  File "boto_multi.py", line 36, in 
part = bucket.initiate_multipart_upload(objname)
  File 
"/home/pszablow/venvs/multitest/local/lib/python2.7/site-packages/boto/s3/bucket.py",
line 1742, in initiate_multipart_upload
response.status, response.reason, body)
boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
AccessDenied


The single part upload works for me. I am able to create buckets and objects.
I've tried also other similar examples, but none of them works.


Any ideas what's wrong? Does the ceph's multi part upload actually
work for anybody?


Thanks,

Patrycja Szabłowska
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multipart upload on ceph 0.8 doesn't work?

2014-07-03 Thread Luis Periquito
I was at this issue this morning. It seems radosgw requires you to have a
pool named '' to work with multipart. I just created a pool with that name
rados mkpool ''

either that or allow the pool be created by the radosgw...


On 3 July 2014 16:27, Patrycja Szabłowska 
wrote:

> Hi,
>
> I'm trying to make multi part upload work. I'm using ceph
> 0.80-702-g9bac31b (from the ceph's github).
>
> I've tried the code provided by Mark Kirkwood here:
>
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/034940.html
>
>
> But unfortunately, it gives me the error:
>
> (multitest)pszablow@pat-desktop:~/$ python boto_multi.py
>   begin upload of abc.yuv
>   size 746496, 7 parts
> Traceback (most recent call last):
>   File "boto_multi.py", line 36, in 
> part = bucket.initiate_multipart_upload(objname)
>   File
> "/home/pszablow/venvs/multitest/local/lib/python2.7/site-packages/boto/s3/bucket.py",
> line 1742, in initiate_multipart_upload
> response.status, response.reason, body)
> boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
>  encoding="UTF-8"?>AccessDenied
>
>
> The single part upload works for me. I am able to create buckets and
> objects.
> I've tried also other similar examples, but none of them works.
>
>
> Any ideas what's wrong? Does the ceph's multi part upload actually
> work for anybody?
>
>
> Thanks,
>
> Patrycja Szabłowska
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Luis Periquito

Unix Engineer

Ocado.com 

Head Office, Titan Court, 3 Bishop Square, Hatfield Business Park,
Hatfield, Herts AL10 9NE

-- 


Notice:  This email is confidential and may contain copyright material of 
members of the Ocado Group. Opinions and views expressed in this message 
may not necessarily reflect the opinions and views of the members of the 
Ocado Group.

If you are not the intended recipient, please notify us immediately and 
delete all copies of this message. Please note that it is your 
responsibility to scan this message for viruses.  

References to the “Ocado Group” are to Ocado Group plc (registered in 
England and Wales with number 7098618) and its subsidiary undertakings (as 
that expression is defined in the Companies Act 2006) from time to time.  
The registered office of Ocado Group plc is Titan Court, 3 Bishops Square, 
Hatfield Business Park, Hatfield, Herts. AL10 9NE.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-03 Thread Pierre BLONDEAU

Le 03/07/2014 13:49, Joao Eduardo Luis a écrit :

On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i >
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush23
6d5
< tunable chooseleaf_vary_r 1

 Looks like the chooseleaf_vary_r tunable somehow ended up divergent?


The only thing that comes to mind that could cause this is if we changed
the leader's in-memory map, proposed it, it failed, and only the leader
got to write the map to disk somehow.  This happened once on a totally
different issue (although I can't pinpoint right now which).

In such a scenario, the leader would serve the incorrect osdmap to
whoever asked osdmaps from it, the remaining quorum would serve the
correct osdmaps to all the others.  This could cause this divergence. Or
it could be something else.

Are there logs for the monitors for the timeframe this may have happened
in?


Which exactly timeframe you want ? I have 7 days of logs, I should have 
informations about the upgrade from firefly to 0.82.

Which mon's log do you want ? Three ?

Regards


   -Joao



Pierre: do you recall how and when that got set?


I am not sure to understand, but if I good remember after the update in
firefly, I was in state : HEALTH_WARN crush map has legacy tunables and
I see "feature set mismatch" in log.

So if I good remeber, i do : ceph osd crush tunables optimal for the
problem of "crush map" and I update my client and server kernel to
3.16rc.

It's could be that ?

Pierre


-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just 
wrote:

Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
 wrote:

The files

When I upgrade :
  ceph-deploy install --stable firefly servers...
  on each servers service ceph restart mon
  on each servers service ceph restart osd
  on each servers service ceph restart mds

I upgraded from emperor to firefly. After repair, remap, replace,
etc ... I
have some PG which pass in peering state.

I thought why not try the version 0.82, it could solve my problem. (
It's my mistake ). So, I upgrade from firefly to 0.83 with :
  ceph-deploy install --testing servers...
  ..

Now, all programs are in version 0.82.
I have 3 mons, 36 OSD and 3 mds.

Pierre

PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta
directory.

Le 03/07/2014 00:10, Samuel Just a écrit :


Also, what version did you upgrade from, and how did you upgrade?
-Sam

On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just 
wrote:


Ok, in current/meta on osd 20 and osd 23, please attach all files
matching

^osdmap.13258.*

There should be one such file on each osd. (should look something
like
osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
you'll want to use find).

What version of ceph is running on your mons?  How many mons do
you have?
-Sam

On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
 wrote:


Hi,

I do it, the log files are available here :
https://blondeau.users.greyc.fr/cephlog/debug20/

The OSD's files are really big +/- 80M .

After starting the osd.20 some other osd crash. I pass from 31
osd up to
16.
I remark that after this the number of down+peering PG decrease
from 367
to
248. It's "normal" ? May be it's temporary, the time that the
cluster
verifies all the PG ?

Regards
Pierre

Le 02/07/2014 19:16, Samuel Just a écrit :


You should add

debug osd = 20
debug filestore = 20
debug ms = 1

to the [osd] section of the ceph.conf and restart the osds.  I'd
like
all three logs if possible.

Thanks
-Sam

On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
 wrote:



Yes, but how i do that ?

With a command like that ?

ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
--debug-ms
1'

By modify the /etc/ceph/ceph.conf ? This file is really poor
because I
use
udev detection.

When I have made these changes, you want the three log files or
only
osd.20's ?

Thank you so much for the help

Regards
Pierre

Le 01/07/2014 23:51, Samuel Just a écrit :


Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
?
-Sam

On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
 wrote:




Hi,

I join :
 - osd.20 is one of osd that I detect which makes crash
other
OSD.
 - osd.23 is one of osd which crash when i start osd.20
 - mds, is one of my MDS

I cut log file because they are to big but. All is here :
https://blondeau.users.greyc.fr/cephlog/

Regards

Le 30/06/2014 17:35, Gregory Farnum a écrit :


W

Re: [ceph-users] Performance is really bad when I run from vstart.sh

2014-07-03 Thread Zhe Zhang
That makes sense. Thank you!

Zhe

From: David Zafman [mailto:david.zaf...@inktank.com]
Sent: Wednesday, July 02, 2014 9:46 PM
To: Zhe Zhang
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance is really bad when I run from vstart.sh


By default the vstart.sh setup would put all data below a directory called 
"dev" in the source tree.  In that case you're using a single spindle.  The 
vstart script isn't intended for performance testing.

David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com

On Jul 2, 2014, at 5:48 PM, Zhe Zhang 
mailto:zhe_zh...@symantec.com>> wrote:


Hi folks,

I run ceph on a single node which contains 25 hard drives and each @7200 RPM. I 
write raw data into the array, it achieved 2 GB/s. I presumed the performance 
of ceph could go beyond 1 GB/s. but when I compile and ceph code and run 
development mode with vstart.sh, the average throughput is only 200 MB/s for 
rados bench write.
I suspected it was due to the debug mode when I configure the source code, and 
I disable the gdb with ./configure CFLAGS='-O3' CXXFLAGS='O3' (avoid '-g' 
flag). But it did not help at all.
I switched to the repository, and install ceph with ceph-deploy, the 
performance achieved 800 MB/s. Since I did not successfully set up the ceph 
with ceph-deploy, and there are still some pg at "creating+incomplete" state, I 
guess this could impact the performance.
Anyway, could someone give me some suggestions? Why it is so slow when I run 
from vstart.sh?

Best,
Zhe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds

2014-07-03 Thread Gregory Farnum
It looks like you're just putting in data faster than your cluster can
handle (in terms of IOPS).
The first big hole (queue_op_wq->reached_pg) is it sitting in a queue
and waiting for processing. The second parallel blocks are
1) write_thread_in_journal_buffer->journaled_completion_queued, and
that is again a queue while it's waiting to be written to disk,
2) waiting for subops from [19,9]->sub_op_commit_received(x2) is
waiting for the replica OSDs to write the transaction to disk.

You might be able to tune it a little, but right now bucket indices
live in one object, so every write has to touch the same set of OSDs
(twice! to mark an object as "putting", and "put"). 2*3/360 = 166,
which is probably past what those disks can do, and artificially
increasing the latency.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Jul 2, 2014 at 11:24 PM, baijia...@126.com  wrote:
> hi, everyone
>
> when I user rest bench testing RGW with cmd : rest-bench --access-key=ak
> --secret=sk  --bucket=bucket --seconds=360 -t 200  -b 524288  --no-cleanup
> write
>
> I found when RGW call the method "bucket_prepare_op " is very slow. so I
> observed from 'dump_historic_ops',to see:
> { "description": "osd_op(client.4211.0:265984 .dir.default.4148.1 [call
> rgw.bucket_prepare_op] 3.b168f3d0 e37)",
>   "received_at": "2014-07-03 11:07:02.465700",
>   "age": "308.315230",
>   "duration": "3.401743",
>   "type_data": [
> "commit sent; apply or cleanup",
> { "client": "client.4211",
>   "tid": 265984},
> [
> { "time": "2014-07-03 11:07:02.465852",
>   "event": "waiting_for_osdmap"},
> { "time": "2014-07-03 11:07:02.465875",
>   "event": "queue op_wq"},
> { "time": "2014-07-03 11:07:03.729087",
>   "event": "reached_pg"},
> { "time": "2014-07-03 11:07:03.729120",
>   "event": "started"},
> { "time": "2014-07-03 11:07:03.729126",
>   "event": "started"},
> { "time": "2014-07-03 11:07:03.804366",
>   "event": "waiting for subops from [19,9]"},
> { "time": "2014-07-03 11:07:03.804431",
>   "event": "commit_queued_for_journal_write"},
> { "time": "2014-07-03 11:07:03.804509",
>   "event": "write_thread_in_journal_buffer"},
> { "time": "2014-07-03 11:07:03.934419",
>   "event": "journaled_completion_queued"},
> { "time": "2014-07-03 11:07:05.297282",
>   "event": "sub_op_commit_rec"},
> { "time": "2014-07-03 11:07:05.297319",
>   "event": "sub_op_commit_rec"},
> { "time": "2014-07-03 11:07:05.311217",
>   "event": "op_applied"},
> { "time": "2014-07-03 11:07:05.867384",
>   "event": "op_commit finish lock"},
> { "time": "2014-07-03 11:07:05.867385",
>   "event": "op_commit"},
> { "time": "2014-07-03 11:07:05.867424",
>   "event": "commit_sent"},
> { "time": "2014-07-03 11:07:05.867428",
>   "event": "op_commit finish"},
> { "time": "2014-07-03 11:07:05.867443",
>   "event": "done"}]]}]}
>
> so I find 2 performance degradation. one is from "queue op_wq" to
> "reached_pg" , anothor is from "journaled_completion_queued" to "op_commit".
> and I must stess that there are so many ops write to one bucket object, so
> how to reduce Latency ?
>
>
> 
> baijia...@126.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pools do not respond

2014-07-03 Thread Gregory Farnum
The PG in question isn't being properly mapped to any OSDs. There's a
good chance that those trees (with 3 OSDs in 2 hosts) aren't going to
map well anyway, but the immediate problem should resolve itself if
you change the "choose" to "chooseleaf" in your rules.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Jul 3, 2014 at 4:17 AM, Iban Cabrillo  wrote:
> Hi folk,
>   I am following step by step the test intallation, and checking some
> configuration before try to deploy a production cluster.
>
>   Now I have a Health cluster with 3 mons + 4 OSDs.
>   I have created a pool with belonging all osd.x and two more one for two
> servers o the other for the other two.
>
>   The general pool work fine (I can create images and mount it on remote
> machines).
>
>   But the other two does not work (the commands rados put, or rbd ls "pool"
> hangs for ever).
>
>   this is the tree:
>
>[ceph@cephadm ceph-cloud]$ sudo ceph osd tree
> # id weight type name up/down reweight
> -7 5.4 root 4x1GbFCnlSAS
> -3 2.7 host node04
> 1 2.7 osd.1 up 1
> -4 2.7 host node03
> 2 2.7 osd.2 up 1
> -6 8.1 root 4x4GbFCnlSAS
> -5 5.4 host node01
> 3 2.7 osd.3 up 1
> 4 2.7 osd.4 up 1
> -2 2.7 host node04
> 0 2.7 osd.0 up 1
> -1 13.5 root default
> -2 2.7 host node04
> 0 2.7 osd.0 up 1
> -3 2.7 host node04
> 1 2.7 osd.1 up 1
> -4 2.7 host node03
> 2 2.7 osd.2 up 1
> -5 5.4 host node01
> 3 2.7 osd.3 up 1
> 4 2.7 osd.4 up 1
>
>
> And this is the crushmap:
>
> ...
> root 4x4GbFCnlSAS {
> id -6 #do not change unnecessarily
> alg straw
> hash 0  # rjenkins1
> item node01 weight 5.400
> item node04 weight 2.700
> }
> root 4x1GbFCnlSAS {
> id -7 #do not change unnecessarily
> alg straw
> hash 0  # rjenkins1
> item node04 weight 2.700
> item node03 weight 2.700
> }
> # rules
> rule 4x4GbFCnlSAS {
> ruleset 1
> type replicated
> min_size 1
> max_size 10
> step take 4x4GbFCnlSAS
> step choose firstn 0 type host
> step emit
> }
> rule 4x1GbFCnlSAS {
> ruleset 2
> type replicated
> min_size 1
> max_size 10
> step take 4x1GbFCnlSAS
> step choose firstn 0 type host
> step emit
> }
> ..
> I of course set the crush_rules:
> sudo ceph osd pool set cloud-4x1GbFCnlSAS crush_ruleset 2
> sudo ceph osd pool set cloud-4x4GbFCnlSAS crush_ruleset 1
>
> but seems that are something wrong (4x4GbFCnlSAS.pool is 512MB file):
>sudo rados -p cloud-4x1GbFCnlSAS put 4x4GbFCnlSAS.object
> 4x4GbFCnlSAS.pool
> !!HANGS for eve!
>
> from the ceph-client happen the same
>  rbd ls cloud-4x1GbFCnlSAS
>  !!HANGS for eve!
>
>
> [root@cephadm ceph-cloud]# ceph osd map cloud-4x1GbFCnlSAS
> 4x1GbFCnlSAS.object
> osdmap e49 pool 'cloud-4x1GbFCnlSAS' (3) object '4x1GbFCnlSAS.object' -> pg
> 3.114ae7a9 (3.29) -> up ([], p-1) acting ([], p-1)
>
> Any idea what i am doing wrong??
>
> Thanks in advance, I
> Bertrand Russell:
> "El problema con el mundo es que los estúpidos están seguros de todo y los
> inteligentes están llenos de dudas"
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bypass Cache-Tiering for special reads (Backups)

2014-07-03 Thread Gregory Farnum
On Wed, Jul 2, 2014 at 3:06 PM, Marc  wrote:
> Hi,
>
> I was wondering, having a cache pool in front of an RBD pool is all fine
> and dandy, but imagine you want to pull backups of all your VMs (or one
> of them, or multiple...). Going to the cache for all those reads isn't
> only pointless, it'll also potentially fill up the cache and possibly
> evict actually frequently used data. Which got me thinking... wouldn't
> it be nifty if there was a special way of doing specific backup reads
> where you'd bypass the cache, ensuring the dirty cache contents get
> written to cold pool first? Or at least doing special reads where a
> cache-miss won't actually cache the requested data?

Yeah, these are nifty features but the cache coherency implications
are a bit difficult. More options will come as we are able to develop
and (more importantly, by far) validate them.
-Greg

>
> AFAIK the backup routine for an RBD-backed KVM usually involves creating
> a snapshot of the RBD and putting that into a backup storage/tape, all
> done via librbd/API.
>
> Maybe something like that even already exists?
>
>
> KR,
> Marc
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why lock th whole osd handle thread

2014-07-03 Thread Gregory Farnum
On Thu, Jul 3, 2014 at 8:24 AM, baijia...@126.com  wrote:
> when I see the function "OSD::OpWQ::_process ". I find pg lock locks the
> whole function. so when I  use multi-thread write the same object , so are
> they must
> serialize from osd handle thread to journal write thread ?

It's serialized while processing the write, but that doesn't include
the wait time for the data to be placed on disk — merely sequencing it
and feeding it into the journal queue. Writes have to be ordered, so
that's not likely to change.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds

2014-07-03 Thread baijia...@126.com
I find that the function of "OSD::OpWQ::_process " use pg-lock lock the whole 
function.so this mean that osd threads can't handle op which write for the same 
object.
though add log to the  ReplicatedPG::op_commit , I find pg lock cost long time 
sometimes. but I don't know where lock pg .
where lock pg for a long time?

thanks



baijia...@126.com

From: Gregory Farnum
Date: 2014-07-04 01:02
To: baijia...@126.com
CC: ceph-users
Subject: Re: [ceph-users] RGW performance test , put 30 thousands objects to 
one bucket, average latency 3 seconds
It looks like you're just putting in data faster than your cluster can
handle (in terms of IOPS).
The first big hole (queue_op_wq->reached_pg) is it sitting in a queue
and waiting for processing. The second parallel blocks are
1) write_thread_in_journal_buffer->journaled_completion_queued, and
that is again a queue while it's waiting to be written to disk,
2) waiting for subops from [19,9]->sub_op_commit_received(x2) is
waiting for the replica OSDs to write the transaction to disk.

You might be able to tune it a little, but right now bucket indices
live in one object, so every write has to touch the same set of OSDs
(twice! to mark an object as "putting", and "put"). 2*3/360 = 166,
which is probably past what those disks can do, and artificially
increasing the latency.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Jul 2, 2014 at 11:24 PM, baijia...@126.com  wrote:
> hi, everyone
>
> when I user rest bench testing RGW with cmd : rest-bench --access-key=ak
> --secret=sk  --bucket=bucket --seconds=360 -t 200  -b 524288  --no-cleanup
> write
>
> I found when RGW call the method "bucket_prepare_op " is very slow. so I
> observed from 'dump_historic_ops',to see:
> { "description": "osd_op(client.4211.0:265984 .dir.default.4148.1 [call
> rgw.bucket_prepare_op] 3.b168f3d0 e37)",
>   "received_at": "2014-07-03 11:07:02.465700",
>   "age": "308.315230",
>   "duration": "3.401743",
>   "type_data": [
> "commit sent; apply or cleanup",
> { "client": "client.4211",
>   "tid": 265984},
> [
> { "time": "2014-07-03 11:07:02.465852",
>   "event": "waiting_for_osdmap"},
> { "time": "2014-07-03 11:07:02.465875",
>   "event": "queue op_wq"},
> { "time": "2014-07-03 11:07:03.729087",
>   "event": "reached_pg"},
> { "time": "2014-07-03 11:07:03.729120",
>   "event": "started"},
> { "time": "2014-07-03 11:07:03.729126",
>   "event": "started"},
> { "time": "2014-07-03 11:07:03.804366",
>   "event": "waiting for subops from [19,9]"},
> { "time": "2014-07-03 11:07:03.804431",
>   "event": "commit_queued_for_journal_write"},
> { "time": "2014-07-03 11:07:03.804509",
>   "event": "write_thread_in_journal_buffer"},
> { "time": "2014-07-03 11:07:03.934419",
>   "event": "journaled_completion_queued"},
> { "time": "2014-07-03 11:07:05.297282",
>   "event": "sub_op_commit_rec"},
> { "time": "2014-07-03 11:07:05.297319",
>   "event": "sub_op_commit_rec"},
> { "time": "2014-07-03 11:07:05.311217",
>   "event": "op_applied"},
> { "time": "2014-07-03 11:07:05.867384",
>   "event": "op_commit finish lock"},
> { "time": "2014-07-03 11:07:05.867385",
>   "event": "op_commit"},
> { "time": "2014-07-03 11:07:05.867424",
>   "event": "commit_sent"},
> { "time": "2014-07-03 11:07:05.867428",
>   "event": "op_commit finish"},
> { "time": "2014-07-03 11:07:05.867443",
>   "event": "done"}]]}]}
>
> so I find 2 performance degradation. one is from "queue op_wq" to
> "reached_pg" , anothor is from "journaled_completion_queued" to "op_commit".
> and I must stess that there are so many ops write to one bucket object, so
> how to reduce Latency ?
>
>
> 
> baijia...@126.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds

2014-07-03 Thread baijia...@126.com
I put .rgw.buckets.index pool to SSD osd,bucket object must write to the SSD, 
and disk use ratio less than 50%. so I don't think disk is bottleneck




baijia...@126.com

From: baijia...@126.com
Date: 2014-07-04 01:29
To: Gregory Farnum
CC: ceph-users
Subject: Re: Re: [ceph-users] RGW performance test , put 30 thousands objects 
to one bucket, average latency 3 seconds
I find that the function of "OSD::OpWQ::_process " use pg-lock lock the whole 
function.so this mean that osd threads can't handle op which write for the same 
object.
though add log to the  ReplicatedPG::op_commit , I find pg lock cost long time 
sometimes. but I don't know where lock pg .
where lock pg for a long time?

thanks



baijia...@126.com

From: Gregory Farnum
Date: 2014-07-04 01:02
To: baijia...@126.com
CC: ceph-users
Subject: Re: [ceph-users] RGW performance test , put 30 thousands objects to 
one bucket, average latency 3 seconds
It looks like you're just putting in data faster than your cluster can
handle (in terms of IOPS).
The first big hole (queue_op_wq->reached_pg) is it sitting in a queue
and waiting for processing. The second parallel blocks are
1) write_thread_in_journal_buffer->journaled_completion_queued, and
that is again a queue while it's waiting to be written to disk,
2) waiting for subops from [19,9]->sub_op_commit_received(x2) is
waiting for the replica OSDs to write the transaction to disk.

You might be able to tune it a little, but right now bucket indices
live in one object, so every write has to touch the same set of OSDs
(twice! to mark an object as "putting", and "put"). 2*3/360 = 166,
which is probably past what those disks can do, and artificially
increasing the latency.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Jul 2, 2014 at 11:24 PM, baijia...@126.com  wrote:
> hi, everyone
>
> when I user rest bench testing RGW with cmd : rest-bench --access-key=ak
> --secret=sk  --bucket=bucket --seconds=360 -t 200  -b 524288  --no-cleanup
> write
>
> I found when RGW call the method "bucket_prepare_op " is very slow. so I
> observed from 'dump_historic_ops',to see:
> { "description": "osd_op(client.4211.0:265984 .dir.default.4148.1 [call
> rgw.bucket_prepare_op] 3.b168f3d0 e37)",
>   "received_at": "2014-07-03 11:07:02.465700",
>   "age": "308.315230",
>   "duration": "3.401743",
>   "type_data": [
> "commit sent; apply or cleanup",
> { "client": "client.4211",
>   "tid": 265984},
> [
> { "time": "2014-07-03 11:07:02.465852",
>   "event": "waiting_for_osdmap"},
> { "time": "2014-07-03 11:07:02.465875",
>   "event": "queue op_wq"},
> { "time": "2014-07-03 11:07:03.729087",
>   "event": "reached_pg"},
> { "time": "2014-07-03 11:07:03.729120",
>   "event": "started"},
> { "time": "2014-07-03 11:07:03.729126",
>   "event": "started"},
> { "time": "2014-07-03 11:07:03.804366",
>   "event": "waiting for subops from [19,9]"},
> { "time": "2014-07-03 11:07:03.804431",
>   "event": "commit_queued_for_journal_write"},
> { "time": "2014-07-03 11:07:03.804509",
>   "event": "write_thread_in_journal_buffer"},
> { "time": "2014-07-03 11:07:03.934419",
>   "event": "journaled_completion_queued"},
> { "time": "2014-07-03 11:07:05.297282",
>   "event": "sub_op_commit_rec"},
> { "time": "2014-07-03 11:07:05.297319",
>   "event": "sub_op_commit_rec"},
> { "time": "2014-07-03 11:07:05.311217",
>   "event": "op_applied"},
> { "time": "2014-07-03 11:07:05.867384",
>   "event": "op_commit finish lock"},
> { "time": "2014-07-03 11:07:05.867385",
>   "event": "op_commit"},
> { "time": "2014-07-03 11:07:05.867424",
>   "event": "commit_sent"},
> { "time": "2014-07-03 11:07:05.867428",
>   "event": "op_commit finish"},
> { "time": "2014-07-03 11:07:05.867443",
>   "event": "done"}]]}]}
>
> so I find 2 performance degradation. one is from "queue op_wq" to
> "reached_pg" , anothor is from "journaled_completion_queued" to "op_commit".
> and I must stess that there are so many ops write to one bucket object, so
> how to reduce Latency ?
>
>
> 
> baijia...@126.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-user

Re: [ceph-users] Pools do not respond

2014-07-03 Thread Iban Cabrillo
Hi Gregory,
  Thanks a lot I begin to understand who ceph works.
  I add a couple of osd servers, and balance the disk between them.

[ceph@cephadm ceph-cloud]$ sudo ceph osd tree
# idweighttype nameup/downreweight
-716.2root 4x1GbFCnlSAS
-95.4host node02
72.7osd.7up1
82.7osd.8up1
-45.4host node03
22.7osd.2up1
92.7osd.9up1
-35.4host node04
12.7osd.1up1
102.7osd.10up1
-616.2root 4x4GbFCnlSAS
-55.4host node01
32.7osd.3up1
42.7osd.4up1
-85.4host node02
52.7osd.5up1
62.7osd.6up1
-25.4host node04
02.7osd.0up1
112.7osd.11up1
-132.4root default
-25.4host node04
02.7osd.0up1
112.7osd.11up1
-35.4host node04
12.7osd.1up1
102.7osd.10up1
-45.4host node03
22.7osd.2up1
92.7osd.9up1
-55.4host node01
32.7osd.3up1
42.7osd.4up1
-85.4host node02
52.7osd.5up1
62.7osd.6up1
-95.4host node02
72.7osd.7up1
82.7osd.8up1

The Idea Is to have at least 4 servers and 3 disk (2.7 TB SAN attached) for
server per pool.
Now i have to adjust the pg and pgp and make some performance test.

PD which is the difference betwwwn chose ans choseleaf?

Thanks a lot!


2014-07-03 19:06 GMT+02:00 Gregory Farnum :

> The PG in question isn't being properly mapped to any OSDs. There's a
> good chance that those trees (with 3 OSDs in 2 hosts) aren't going to
> map well anyway, but the immediate problem should resolve itself if
> you change the "choose" to "chooseleaf" in your rules.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Thu, Jul 3, 2014 at 4:17 AM, Iban Cabrillo 
> wrote:
> > Hi folk,
> >   I am following step by step the test intallation, and checking some
> > configuration before try to deploy a production cluster.
> >
> >   Now I have a Health cluster with 3 mons + 4 OSDs.
> >   I have created a pool with belonging all osd.x and two more one for two
> > servers o the other for the other two.
> >
> >   The general pool work fine (I can create images and mount it on remote
> > machines).
> >
> >   But the other two does not work (the commands rados put, or rbd ls
> "pool"
> > hangs for ever).
> >
> >   this is the tree:
> >
> >[ceph@cephadm ceph-cloud]$ sudo ceph osd tree
> > # id weight type name up/down reweight
> > -7 5.4 root 4x1GbFCnlSAS
> > -3 2.7 host node04
> > 1 2.7 osd.1 up 1
> > -4 2.7 host node03
> > 2 2.7 osd.2 up 1
> > -6 8.1 root 4x4GbFCnlSAS
> > -5 5.4 host node01
> > 3 2.7 osd.3 up 1
> > 4 2.7 osd.4 up 1
> > -2 2.7 host node04
> > 0 2.7 osd.0 up 1
> > -1 13.5 root default
> > -2 2.7 host node04
> > 0 2.7 osd.0 up 1
> > -3 2.7 host node04
> > 1 2.7 osd.1 up 1
> > -4 2.7 host node03
> > 2 2.7 osd.2 up 1
> > -5 5.4 host node01
> > 3 2.7 osd.3 up 1
> > 4 2.7 osd.4 up 1
> >
> >
> > And this is the crushmap:
> >
> > ...
> > root 4x4GbFCnlSAS {
> > id -6 #do not change unnecessarily
> > alg straw
> > hash 0  # rjenkins1
> > item node01 weight 5.400
> > item node04 weight 2.700
> > }
> > root 4x1GbFCnlSAS {
> > id -7 #do not change unnecessarily
> > alg straw
> > hash 0  # rjenkins1
> > item node04 weight 2.700
> > item node03 weight 2.700
> > }
> > # rules
> > rule 4x4GbFCnlSAS {
> > ruleset 1
> > type replicated
> > min_size 1
> > max_size 10
> > step take 4x4GbFCnlSAS
> > step choose firstn 0 type host
> > step emit
> > }
> > rule 4x1GbFCnlSAS {
> > ruleset 2
> > type replicated
> > min_size 1
> > max_size 10
> > step take 4x1GbFCnlSAS
> > step choose firstn 0 type host
> > step emit
> > }
> > ..
> > I of course set the crush_rules:
> > sudo ceph osd pool set cloud-4x1GbFCnlSAS crush_ruleset 2
> > sudo ceph osd pool set cloud-4x4GbFCnlSAS crush_ruleset 1
> >
> > but seems that are something wrong (4x4GbFCnlSAS.pool is 512MB file):
> >sudo rados -p cloud-4x1GbFCnlSAS put 4x4GbFCnlSAS.object
> > 4x4GbFCnlSAS.pool
> > !!HANGS for eve!
> >
> > from the ceph-client happen the same
> >  rbd ls cloud-4x1GbFCnlSAS
> >  !!HANGS for eve!
> >
> >
> > [root@cephadm ceph-cloud]# ceph osd map cloud-4x1GbFCnlSAS
> > 4x1GbFCnlSAS.object
> > osdmap e49 pool 'cloud-4x1GbFCnlSAS' (3) object '4x1GbFCnlSAS.object' ->
> pg
> > 3.114ae7a

Re: [ceph-users] Pools do not respond

2014-07-03 Thread Gregory Farnum
On Thu, Jul 3, 2014 at 11:17 AM, Iban Cabrillo  wrote:
> Hi Gregory,
>   Thanks a lot I begin to understand who ceph works.
>   I add a couple of osd servers, and balance the disk between them.
>
>
> [ceph@cephadm ceph-cloud]$ sudo ceph osd tree
> # idweighttype nameup/downreweight
> -716.2root 4x1GbFCnlSAS
> -95.4host node02
> 72.7osd.7up1
> 82.7osd.8up1
> -45.4host node03
>
> 22.7osd.2up1
> 92.7osd.9up1
> -35.4host node04
>
> 12.7osd.1up1
> 102.7osd.10up1
> -616.2root 4x4GbFCnlSAS
>
> -55.4host node01
> 32.7osd.3up1
> 42.7osd.4up1
> -85.4host node02
> 52.7osd.5up1
> 62.7osd.6up1
> -25.4host node04
>
> 02.7osd.0up1
> 112.7osd.11up1
> -132.4root default
> -25.4host node04
>
> 02.7osd.0up1
> 112.7osd.11up1
> -35.4host node04
>
> 12.7osd.1up1
> 102.7osd.10up1
> -45.4host node03
>
> 22.7osd.2up1
> 92.7osd.9up1
> -55.4host node01
> 32.7osd.3up1
> 42.7osd.4up1
> -85.4host node02
> 52.7osd.5up1
> 62.7osd.6up1
> -95.4host node02
> 72.7osd.7up1
> 82.7osd.8up1
>
> The Idea Is to have at least 4 servers and 3 disk (2.7 TB SAN attached) for
> server per pool.
> Now i have to adjust the pg and pgp and make some performance test.
>
> PD which is the difference betwwwn chose ans choseleaf?

"choose" instructs the system to choose N different buckets of the
given type (where N is specified by the "firstn 0" block to be the
replication level, but could be 1: "firstn 1", or replication - 1:
"firstn -1"). Since you're saying "choose firstn 0 type host", that's
what you're getting out, and then you're emitting those 3 (by default)
hosts. But they aren't valid "devices" (OSDs), so it's not a valid
mapping; you're supposed to then say "choose firstn 1 device" or
similar.
"chooseleaf" instead tells the system to choose N different buckets,
and then descend from each of those buckets to a leaf ("device") in
the CRUSH hierarchy. It's a little more robust against different
mappings and failure conditions, so generally a better choice than
"choose" if you don't need the finer granularity provided by choose.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-03 Thread Joao Luis
Do those logs have a higher debugging level than the default? If not
nevermind as they will not have enough information. If they do however,
we'd be interested in the portion around the moment you set the tunables.
Say, before the upgrade and a bit after you set the tunable. If you want to
be finer grained, then ideally it would be the moment where those maps were
created, but you'd have to grep the logs for that.

Or drop the logs somewhere and I'll take a look.

  -Joao
On Jul 3, 2014 5:48 PM, "Pierre BLONDEAU" 
wrote:

> Le 03/07/2014 13:49, Joao Eduardo Luis a écrit :
>
>> On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:
>>
>>> Le 03/07/2014 00:55, Samuel Just a écrit :
>>>
 Ah,

 ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
 /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i >
 /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
 ../ceph/src/osdmaptool: osdmap file
 'osd-20_osdmap.13258__0_4E62BB79__none'
 ../ceph/src/osdmaptool: exported crush map to /tmp/crush20
 ../ceph/src/osdmaptool: osdmap file
 'osd-23_osdmap.13258__0_4E62BB79__none'
 ../ceph/src/osdmaptool: exported crush map to /tmp/crush23
 6d5
 < tunable chooseleaf_vary_r 1

  Looks like the chooseleaf_vary_r tunable somehow ended up divergent?

>>>
>> The only thing that comes to mind that could cause this is if we changed
>> the leader's in-memory map, proposed it, it failed, and only the leader
>> got to write the map to disk somehow.  This happened once on a totally
>> different issue (although I can't pinpoint right now which).
>>
>> In such a scenario, the leader would serve the incorrect osdmap to
>> whoever asked osdmaps from it, the remaining quorum would serve the
>> correct osdmaps to all the others.  This could cause this divergence. Or
>> it could be something else.
>>
>> Are there logs for the monitors for the timeframe this may have happened
>> in?
>>
>
> Which exactly timeframe you want ? I have 7 days of logs, I should have
> informations about the upgrade from firefly to 0.82.
> Which mon's log do you want ? Three ?
>
> Regards
>
> -Joao
>>
>>
 Pierre: do you recall how and when that got set?

>>>
>>> I am not sure to understand, but if I good remember after the update in
>>> firefly, I was in state : HEALTH_WARN crush map has legacy tunables and
>>> I see "feature set mismatch" in log.
>>>
>>> So if I good remeber, i do : ceph osd crush tunables optimal for the
>>> problem of "crush map" and I update my client and server kernel to
>>> 3.16rc.
>>>
>>> It's could be that ?
>>>
>>> Pierre
>>>
>>>  -Sam

 On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just 
 wrote:

> Yeah, divergent osdmaps:
> 555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_
> 4E62BB79__none
> 6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_
> 4E62BB79__none
>
> Joao: thoughts?
> -Sam
>
> On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
>  wrote:
>
>> The files
>>
>> When I upgrade :
>>   ceph-deploy install --stable firefly servers...
>>   on each servers service ceph restart mon
>>   on each servers service ceph restart osd
>>   on each servers service ceph restart mds
>>
>> I upgraded from emperor to firefly. After repair, remap, replace,
>> etc ... I
>> have some PG which pass in peering state.
>>
>> I thought why not try the version 0.82, it could solve my problem. (
>> It's my mistake ). So, I upgrade from firefly to 0.83 with :
>>   ceph-deploy install --testing servers...
>>   ..
>>
>> Now, all programs are in version 0.82.
>> I have 3 mons, 36 OSD and 3 mds.
>>
>> Pierre
>>
>> PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta
>> directory.
>>
>> Le 03/07/2014 00:10, Samuel Just a écrit :
>>
>>  Also, what version did you upgrade from, and how did you upgrade?
>>> -Sam
>>>
>>> On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just 
>>> wrote:
>>>

 Ok, in current/meta on osd 20 and osd 23, please attach all files
 matching

 ^osdmap.13258.*

 There should be one such file on each osd. (should look something
 like
 osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
 you'll want to use find).

 What version of ceph is running on your mons?  How many mons do
 you have?
 -Sam

 On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
  wrote:

>
> Hi,
>
> I do it, the log files are available here :
> https://blondeau.users.greyc.fr/cephlog/debug20/
>
> The OSD's files are really big +/- 80M .
>
> After starting the osd.20 some other osd crash. I pass from 31
> osd up to
> 16.
> I remark that after t

[ceph-users] mon: leveldb checksum mismatch

2014-07-03 Thread Jason Harley
Hi list —

I’ve got a small dev. cluster: 3 OSD nodes with 6 disks/OSDs each and a single 
monitor (this, it seems, was my mistake).  The monitor node went down hard and 
it looks like the monitor’s db is in a funny state.  Running ‘ceph-mon’ 
manually with ‘debug_mon 20’ and ‘debug_ms 20’ gave the following:

> /usr/bin/ceph-mon -i monhost --mon-data /var/lib/ceph/mon/ceph-monhost 
> --debug_mon 20 --debug_ms 20 -d
> 2014-07-03 23:20:55.800512 7f973918e7c0  0 ceph version 0.67.7 
> (d7ab4244396b57aac8b7e80812115bbd079e6b73), process ceph-mon, pid 24930
> Corruption: checksum mismatch
> Corruption: checksum mismatch
> 2014-07-03 23:20:56.455797 7f973918e7c0 -1 failed to create new leveldb store

I attempted to make use of the leveldb Python library’s ‘RepairDB’ function, 
which just moves enough files into ‘lost’ that when running the monitor again 
I’m asked if I ran mkcephfs.

Any insight into resolving these two checksum mismatches so I can access my OSD 
data would be greatly appreciated.

Thanks,
./JRH

p.s. I’m assuming that without the maps from the monitor, my OSD data is 
unrecoverable also.

  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon: leveldb checksum mismatch

2014-07-03 Thread Joao Eduardo Luis

On 07/04/2014 12:29 AM, Jason Harley wrote:

Hi list —

I’ve got a small dev. cluster: 3 OSD nodes with 6 disks/OSDs each and a single 
monitor (this, it seems, was my mistake).  The monitor node went down hard and 
it looks like the monitor’s db is in a funny state.  Running ‘ceph-mon’ 
manually with ‘debug_mon 20’ and ‘debug_ms 20’ gave the following:


/usr/bin/ceph-mon -i monhost --mon-data /var/lib/ceph/mon/ceph-monhost 
--debug_mon 20 --debug_ms 20 -d
2014-07-03 23:20:55.800512 7f973918e7c0  0 ceph version 0.67.7 
(d7ab4244396b57aac8b7e80812115bbd079e6b73), process ceph-mon, pid 24930
Corruption: checksum mismatch
Corruption: checksum mismatch
2014-07-03 23:20:56.455797 7f973918e7c0 -1 failed to create new leveldb store


I attempted to make use of the leveldb Python library’s ‘RepairDB’ function, 
which just moves enough files into ‘lost’ that when running the monitor again 
I’m asked if I ran mkcephfs.

Any insight into resolving these two checksum mismatches so I can access my OSD 
data would be greatly appreciated.

Thanks,
./JRH

p.s. I’m assuming that without the maps from the monitor, my OSD data is 
unrecoverable also.


Hello Jason,

We don't have a way to repair leveldb.  Having multiple monitors usually 
help with such tricky situations.


According to this [1] the python bindings you're using may not be linked 
into snappy, which we were using (mistakenly until recently) to compress 
data as it goes into leveldb.  Not having those snappy bindings may be 
what's causing all those files to be moved to lost instead.


The suggestion that the thread in [1] offers is to have the repair 
functionality directly in the 'application' itself.  We could do this by 
adding a repair option to ceph-kvstore-tool -- which could help.


I'll be happy to get that into ceph-kvstore-tool tomorrow and push a 
branch for you to compile and test.


  -Joao


[1] - https://groups.google.com/forum/#!topic/leveldb/YvszWNio2-Q

--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon: leveldb checksum mismatch

2014-07-03 Thread Jason Harley
Hi Joao,

On Jul 3, 2014, at 7:57 PM, Joao Eduardo Luis  wrote:

> We don't have a way to repair leveldb.  Having multiple monitors usually help 
> with such tricky situations.

I know this, but for this small dev cluster I wasn’t thinking about corruption 
of my mon’s backing store.  Silly me :)

> 
> According to this [1] the python bindings you're using may not be linked into 
> snappy, which we were using (mistakenly until recently) to compress data as 
> it goes into leveldb.  Not having those snappy bindings may be what's causing 
> all those files to be moved to lost instead.

I found the same posting, and confirmed that the ‘levedb.so’ that ships with 
the ‘python-leveldb’ package on Ubuntu 13.10 links against ‘snappy’.

> The suggestion that the thread in [1] offers is to have the repair 
> functionality directly in the 'application' itself.  We could do this by 
> adding a repair option to ceph-kvstore-tool -- which could help.
> 
> I'll be happy to get that into ceph-kvstore-tool tomorrow and push a branch 
> for you to compile and test.

I would be more than happy to try this out.  Without fixing these checksums, I 
think I’m reinitializing my cluster. :\

Thank you,
./JRH___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com