[ceph-users] Cross-posting to users and ceph-devel

2015-10-14 Thread Wido den Hollander
Hi,

Not to complain or flame about it, but I see a lot of messages which are
being send to both users and ceph-devel.

Imho that beats the purpose of having a users and a devel list, isn't it?

The problem is that messages go to both lists and users hit reply-all
again and so it continues.

For example the Releases go to both devel and users, should they?
Shouldn't users or even only announce be enough?

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-14 Thread Ilya Dryomov
On Wed, Oct 14, 2015 at 6:05 PM, Jan Schermer  wrote:
> But that's exactly what filesystems and their own journals do already :-)

They do it for filesystem "transactions", not ceph transactions.  It's
true that there is quite a bit of double journaling going on - newstore
should help with that quite a lot.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-14 Thread Jan Schermer
But that's exactly what filesystems and their own journals do already :-)

Jan

> On 14 Oct 2015, at 17:02, Somnath Roy  wrote:
> 
> Jan,
> Journal helps FileStore to maintain the transactional integrity in the event 
> of a crash. That's the main reason.
> 
> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
> Schermer
> Sent: Wednesday, October 14, 2015 2:28 AM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?
> 
> Hi,
> I've been thinking about this for a while now - does Ceph really need a 
> journal? Filesystems are already pretty good at committing data to disk when 
> asked (and much faster too), we have external journals in XFS and Ext4...
> In a scenario where client does an ordinary write, there's no need to flush 
> it anywhere (the app didn't ask for it) so it ends up in pagecache and gets 
> committed eventually.
> If a client asks for the data to be flushed then fdatasync/fsync on the 
> filestore object takes care of that, including ordering and stuff.
> For reads, you just read from filestore (no need to differentiate between 
> filestore/journal) - pagecache gives you the right version already.
> 
> Or is journal there to achieve some tiering for writes when the running 
> spindles with SSDs? This is IMO the only thing ordinary filesystems don't do 
> out of box even when filesystem journal is put on SSD - the data get flushed 
> to spindle whenever fsync-ed (even with data=journal). But in reality, most 
> of the data will hit the spindle either way and when you run with SSDs it 
> will always be much slower. And even for tiering - there are already many 
> options (bcache, flashcache or even ZFS L2ARC) that are much more performant 
> and proven stable. I think the fact that people  have a need to combine Ceph 
> with stuff like that already proves the point.
> 
> So a very interesting scenario would be to disable Ceph journal and at most 
> use data=journal on ext4. The complexity of the data path would drop 
> significantly, latencies decrease, CPU time is saved...
> I just feel that Ceph has lots of unnecessary complexity inside that 
> duplicates what filesystems (and pagecache...) have been doing for a while 
> now without eating most of our CPU cores - why don't we use that? Is it 
> possible to disable journal completely?
> 
> Did I miss something that makes journal essential?
> 
> Jan
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-14 Thread Jan Schermer
Can you elaborate on that? I don't think there needs to be a difference. Ceph 
is hosting mostly filesystems, so it's all just a bunch of filesystem 
transactions anyway...

Jan

> On 14 Oct 2015, at 18:14, Ilya Dryomov  wrote:
> 
> On Wed, Oct 14, 2015 at 6:05 PM, Jan Schermer  wrote:
>> But that's exactly what filesystems and their own journals do already :-)
> 
> They do it for filesystem "transactions", not ceph transactions.  It's
> true that there is quite a bit of double journaling going on - newstore
> should help with that quite a lot.
> 
> Thanks,
> 
>Ilya

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph PGs stuck creating after running force_create_pg

2015-10-14 Thread James Green
Hello,

We recently had 2 nodes go down in our ceph cluster, one was repaired and the 
other had all 12 osds destroyed when it went down. We brought everything back 
online, there were several PGs that were showing as down+peering as well as 
down. After marking the failed OSDs as lost and removing them from the cluster 
we now have around 90 PGs that are showing as incomplete. At this point we just 
want to get the cluster back up and in a healthy state. I tried recreating the 
PGs using force_create_pg and now they are all stuck in creating.

PG dump shows 90 pgs all with the same output
2.182   0   0   0   0   0   0   0   0   
creating2015-10-14 10:31:28.832527  0'0 0:0 []  -1  
[]  -1  0'0 0.000'0 0.00

When I ran pg query on one of the groups I noticed under 
"down_osds_we_would_probe" one of the failed OSDs was listed. I already removed 
the OSD from the cluster, trying to mark it lost says the OSD does not exist.

Here is my crushmap http://pastebin.com/raw.php?i=vyk9vMT1

Why are the PGs trying to query osds that have been lost and removed from the 
cluster?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-14 Thread Somnath Roy
Jan,
Journal helps FileStore to maintain the transactional integrity in the event of 
a crash. That's the main reason.

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
Schermer
Sent: Wednesday, October 14, 2015 2:28 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

Hi,
I've been thinking about this for a while now - does Ceph really need a 
journal? Filesystems are already pretty good at committing data to disk when 
asked (and much faster too), we have external journals in XFS and Ext4...
In a scenario where client does an ordinary write, there's no need to flush it 
anywhere (the app didn't ask for it) so it ends up in pagecache and gets 
committed eventually.
If a client asks for the data to be flushed then fdatasync/fsync on the 
filestore object takes care of that, including ordering and stuff.
For reads, you just read from filestore (no need to differentiate between 
filestore/journal) - pagecache gives you the right version already.

Or is journal there to achieve some tiering for writes when the running 
spindles with SSDs? This is IMO the only thing ordinary filesystems don't do 
out of box even when filesystem journal is put on SSD - the data get flushed to 
spindle whenever fsync-ed (even with data=journal). But in reality, most of the 
data will hit the spindle either way and when you run with SSDs it will always 
be much slower. And even for tiering - there are already many options (bcache, 
flashcache or even ZFS L2ARC) that are much more performant and proven stable. 
I think the fact that people  have a need to combine Ceph with stuff like that 
already proves the point.

So a very interesting scenario would be to disable Ceph journal and at most use 
data=journal on ext4. The complexity of the data path would drop significantly, 
latencies decrease, CPU time is saved...
I just feel that Ceph has lots of unnecessary complexity inside that duplicates 
what filesystems (and pagecache...) have been doing for a while now without 
eating most of our CPU cores - why don't we use that? Is it possible to disable 
journal completely?

Did I miss something that makes journal essential?

Jan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] download.ceph.com unreachable IPv6 [was: v9.1.0 Infernalis release candidate released]

2015-10-14 Thread Wido den Hollander


On 14-10-15 16:30, Björn Lässig wrote:
> On 10/13/2015 11:01 PM, Sage Weil wrote:
>>   http://download.ceph.com/debian-testing
> 
> unfortunately this site is not reachable at the moment.
> 
> 
> $ wget http://download.ceph.com/debian-testing/dists/wheezy/InRelease  -O -
> --2015-10-14 16:06:55--
> http://download.ceph.com/debian-testing/dists/wheezy/InRelease
> Resolving download.ceph.com (download.ceph.com)...
> 2607:f298:6050:51f3:f816:3eff:fe50:5ec, 173.236.253.173
> Connecting to download.ceph.com
> (download.ceph.com)|2607:f298:6050:51f3:f816:3eff:fe50:5ec|:80... connected.
> HTTP request sent, awaiting response...
> 
> $ telnet -6 2607:f298:6050:51f3:f816:3eff:fe50:5ec 80
> Trying 2607:f298:6050:51f3:f816:3eff:fe50:5ec...
> 
> Then it waits until it timeouts.
> 

Works for me here from a XS4All connection in the Netherlands over IPv6.

wido@wido-desktop:~$ wget -6
http://download.ceph.com/debian-testing/dists/wheezy/InRelease
--2015-10-14 17:10:41--
http://download.ceph.com/debian-testing/dists/wheezy/InRelease
Resolving download.ceph.com (download.ceph.com)...
2607:f298:6050:51f3:f816:3eff:fe50:5ec
Connecting to download.ceph.com
(download.ceph.com)|2607:f298:6050:51f3:f816:3eff:fe50:5ec|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6873 (6,7K)
Saving to: ‘InRelease’

100%[==>]
6.873   --.-K/s   in 0s

2015-10-14 17:10:41 (306 MB/s) - ‘InRelease’ saved [6873/6873]

wido@wido-desktop:~$


> Thanks for your great work,
> 
>   Björn Lässig
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What are linger_ops in the output of objecter_requests ?

2015-10-14 Thread Ilya Dryomov
On Wed, Oct 14, 2015 at 5:13 PM, Saverio Proto  wrote:
> Hello,
>
> debugging slow requests behaviour of our Rados Gateway, I run into
> this linger_ops field and I cannot understand the meaning.
>
> I would expect in the "ops" field to find slow requests stucked there.
> Actually most of the time I have "ops": [], and looks like ops gets
> empty very quickly.
>
> However linger_ops is populated, and it is always the same requests,
> looks like those are there forever.
>
> Any explanation about what are linger_ops ?

Just as the name suggests, these are requests that are supposed to
linger.  This is an internal implementation detail, part of watch/notify
infrastructure.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What are linger_ops in the output of objecter_requests ?

2015-10-14 Thread Saverio Proto
Hello,

debugging slow requests behaviour of our Rados Gateway, I run into
this linger_ops field and I cannot understand the meaning.

I would expect in the "ops" field to find slow requests stucked there.
Actually most of the time I have "ops": [], and looks like ops gets
empty very quickly.

However linger_ops is populated, and it is always the same requests,
looks like those are there forever.

Any explanation about what are linger_ops ?

thanks !

Saverio


r...@os.zhdk.cloud /home/proto ; ceph daemon
/var/run/ceph/ceph-radosgw.gateway.asok objecter_requests
{
"ops": [],
"linger_ops": [
{
"linger_id": 8,
"pg": "10.84ada7c9",
"osd": 9,
"object_id": "notify.7",
"object_locator": "@10",
"target_object_id": "notify.7",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
},
{
"linger_id": 2,
"pg": "10.16dafda0",
"osd": 27,
"object_id": "notify.1",
"object_locator": "@10",
"target_object_id": "notify.1",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
},
{
"linger_id": 6,
"pg": "10.31099063",
"osd": 52,
"object_id": "notify.5",
"object_locator": "@10",
"target_object_id": "notify.5",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
},
{
"linger_id": 3,
"pg": "10.88aa5c95",
"osd": 66,
"object_id": "notify.2",
"object_locator": "@10",
"target_object_id": "notify.2",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
},
{
"linger_id": 5,
"pg": "10.a204812d",
"osd": 66,
"object_id": "notify.4",
"object_locator": "@10",
"target_object_id": "notify.4",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
},
{
"linger_id": 4,
"pg": "10.f8c99aee",
"osd": 68,
"object_id": "notify.3",
"object_locator": "@10",
"target_object_id": "notify.3",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
},
{
"linger_id": 1,
"pg": "10.4322fa9f",
"osd": 82,
"object_id": "notify.0",
"object_locator": "@10",
"target_object_id": "notify.0",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
},
{
"linger_id": 7,
"pg": "10.97c520d4",
"osd": 103,
"object_id": "notify.6",
"object_locator": "@10",
"target_object_id": "notify.6",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
}
],
"pool_ops": [],
"pool_stat_ops": [],
"statfs_ops": [],
"command_ops": []
}
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Potential OSD deadlock?

2015-10-14 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

It seems in our situation the cluster is just busy, usually with
really small RBD I/O. We have gotten things to where it doesn't happen
as much in a steady state, but when we have an OSD fail (mostly from
an XFS log bug we hit at least once a week), it is very painful as the
OSD exits and enters the cluster. We are working to split the PGs a
couple of fold, but this is a painful process for the reasons
mentioned in the tracker. Matt Benjamin and Sam Just had a discussion
on IRC about getting the other primaries to throttle back when such a
situation occurs so that each primary OSD has some time to service
client I/O and to push back on the clients to slow down in these
situations.

In our case a single OSD can lock up a VM for a very long time while
others are happily going about their business. Instead of looking like
the cluster is out of I/O, it looks like there is an error. If
pressure is pushed back to clients, it would show up as all of the
clients slowing down a little instead of one or two just hanging for
even over 1,000 seconds.

My thoughts is that each OSD should have some percentage to time given
to servicing client I/O whereas now it seems that replica I/O can
completely starve client I/O. I understand why replica traffic needs a
higher priority, but I think some balance needs to be attained.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWHne4CRDmVDuy+mK58QAAwYUP/RzTrmsYV7Vi6e64Yikh
YMMI4Cxt4mBWbTIOsb8iRY98EkqhUWd/kz45OoFQgwE4hS3O5Lksf3u0pcmS
I+Gz6jQ4/K0B6Mc3Rt19ofD1cA9s6BLnHSqTFZEUVapiHftj84ewIRLts9dg
YCJJeaaOV8fu07oZvnumRTAKOzWPyQizQKBGx7nujIg13Us0st83C8uANzoX
hKvlA2qVMXO4rLgR7nZMcgj+X+/79v7MDycM3WP/Q21ValsNfETQVhN+XxC8
D/IUfX4/AKUEuF4WBEck4Z/Wx9YD+EvpLtQVLy21daazRApWES/iy089F63O
k9RHp189c4WCduFBaTvZj2cdekAq/Wl50O1AdafYFptWqYhw+aKpihI+yMrX
+LhWgoYALD6wyXr0KVDZZszIRZbO/PSjct8z13aXBJoJm9r0Vyazfhi9jNW9
Z/1GD7gv5oHymf7eR9u7T8INdjNzn6Qllj7XCyZfQv5TYxsRWMZxf5vEkpMB
nAYANoZcNs4ZSIy+OdFOb6nM66ujrytWL1DqWusJUEM/GauBw0fxnQ/i+pMy
XU8gYbG1um5YY8jrtvvkhnbHdeO/k24/cH7MGslxeezBPnMNzmqj3qVdiX1H
EBbyBBtp8OF+pKExrmZc2w01W/Nxl6GbVoG+IKJ61FgwKOXEiMwb0wv5mu30
eP3D
=R0O9
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Oct 14, 2015 at 12:00 AM, Haomai Wang  wrote:
> On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil  wrote:
>> On Mon, 12 Oct 2015, Robert LeBlanc wrote:
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>> After a weekend, I'm ready to hit this from a different direction.
>>>
>>> I replicated the issue with Firefly so it doesn't seem an issue that
>>> has been introduced or resolved in any nearby version. I think overall
>>> we may be seeing [1] to a great degree. From what I can extract from
>>> the logs, it looks like in situations where OSDs are going up and
>>> down, I see I/O blocked at the primary OSD waiting for peering and/or
>>> the PG to become clean before dispatching the I/O to the replicas.
>>>
>>> In an effort to understand the flow of the logs, I've attached a small
>>> 2 minute segment of a log I've extracted what I believe to be
>>> important entries in the life cycle of an I/O along with my
>>> understanding. If someone would be kind enough to help my
>>> understanding, I would appreciate it.
>>>
>>> 2015-10-12 14:12:36.537906 7fb9d2c68700 10 -- 192.168.55.16:6800/11295
>>> >> 192.168.55.12:0/2013622 pipe(0x26c9 sd=47 :6800 s=2 pgs=2 cs=1
>>> l=1 c=0x32c85440).reader got message 19 0x2af81700
>>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>>
>>> - ->Messenger has recieved the message from the client (previous
>>> entries in the 7fb9d2c68700 thread are the individual segments that
>>> make up this message).
>>>
>>> 2015-10-12 14:12:36.537963 7fb9d2c68700  1 -- 192.168.55.16:6800/11295
>>> <== client.6709 192.168.55.12:0/2013622 19 
>>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>>  235+0+4194304 (2317308138 0 2001296353) 0x2af81700 con 0x32c85440
>>>
>>> - ->OSD process acknowledges that it has received the write.
>>>
>>> 2015-10-12 14:12:36.538096 7fb9d2c68700 15 osd.4 44 enqueue_op
>>> 0x3052b300 prio 63 cost 4194304 latency 0.012371
>>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>>
>>> - ->Not sure excatly what is going on here, the op is being enqueued 
>>> somewhere..
>>>
>>> 2015-10-12 14:13:06.542819 7fb9e2d3a700 10 osd.4 44 dequeue_op
>>> 0x3052b300 

[ceph-users] Proc for Impl XIO mess with Infernalis

2015-10-14 Thread German Anders
Hi all,

I would like to know if with this new release of Infernalis is there
somewhere a procedure in order to implement xio messager with ib and ceph.
Also if it's possible to change an existing ceph cluster to this kind of
new setup (the existing cluster does not had any production data yet).

Thanks in advance,

Cheers,

*German* 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Initial performance cluster SimpleMessenger vs AsyncMessenger results

2015-10-14 Thread Chen, Xiaoxi
Hi Mark,
 The Async result  in 128K drops quickly after some point, is that because 
of the testing methodology?
  
 Other conclusion looks to me like simple messenger + Jemalloc is the best 
practice till now as it has the same performance as async but using much less 
memory?

-Xiaoxi

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mark Nelson
> Sent: Tuesday, October 13, 2015 9:03 PM
> To: Haomai Wang
> Cc: ceph-devel; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Initial performance cluster SimpleMessenger vs
> AsyncMessenger results
> 
> Hi Haomai,
> 
> Great!  I haven't had a chance to dig in and look at it with valgrind yet, 
> but if I
> get a chance after I'm done with newstore fragment testing and somnath's
> writepath work I'll try to go back and dig in if you haven't had a chance yet.
> 
> Mark
> 
> On 10/12/2015 09:56 PM, Haomai Wang wrote:
> > resend
> >
> > On Tue, Oct 13, 2015 at 10:56 AM, Haomai Wang 
> wrote:
> >> COOL
> >>
> >> Interesting that async messenger will consume more memory than
> >> simple, in my mind I always think async should use less memory. I
> >> will give a look at this
> >>
> >> On Tue, Oct 13, 2015 at 12:50 AM, Mark Nelson 
> wrote:
> >>>
> >>> Hi Guy,
> >>>
> >>> Given all of the recent data on how different memory allocator
> >>> configurations improve SimpleMessenger performance (and the effect
> >>> of memory allocators and transparent hugepages on RSS memory usage),
> >>> I thought I'd run some tests looking how AsyncMessenger does in
> >>> comparison.  We spoke about these a bit at the last performance
> meeting but here's the full write up.
> >>> The rough conclusion as of right now appears to be:
> >>>
> >>> 1) AsyncMessenger performance is not dependent on the memory
> >>> allocator like with SimpleMessenger.
> >>>
> >>> 2) AsyncMessenger is faster than SimpleMessenger with TCMalloc +
> >>> 32MB (ie
> >>> default) thread cache.
> >>>
> >>> 3) AsyncMessenger is consistently faster than SimpleMessenger for
> >>> 128K random reads.
> >>>
> >>> 4) AsyncMessenger is sometimes slower than SimpleMessenger when
> >>> memory allocator optimizations are used.
> >>>
> >>> 5) AsyncMessenger currently uses far more RSS memory than
> SimpleMessenger.
> >>>
> >>> Here's a link to the paper:
> >>>
> >>> https://drive.google.com/file/d/0B2gTBZrkrnpZS1Q4VktjZkhrNHc/view
> >>>
> >>> Mark
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >>
> >> --
> >>
> >> Best Regards,
> >>
> >> Wheat
> >
> >
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Potential OSD deadlock?

2015-10-14 Thread Sage Weil
On Wed, 14 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> It seems in our situation the cluster is just busy, usually with
> really small RBD I/O. We have gotten things to where it doesn't happen
> as much in a steady state, but when we have an OSD fail (mostly from
> an XFS log bug we hit at least once a week), it is very painful as the
> OSD exits and enters the cluster. We are working to split the PGs a
> couple of fold, but this is a painful process for the reasons
> mentioned in the tracker. Matt Benjamin and Sam Just had a discussion
> on IRC about getting the other primaries to throttle back when such a
> situation occurs so that each primary OSD has some time to service
> client I/O and to push back on the clients to slow down in these
> situations.
> 
> In our case a single OSD can lock up a VM for a very long time while
> others are happily going about their business. Instead of looking like
> the cluster is out of I/O, it looks like there is an error. If
> pressure is pushed back to clients, it would show up as all of the
> clients slowing down a little instead of one or two just hanging for
> even over 1,000 seconds.

This 1000 seconds figure is very troubling.  Do you have logs?  I suspect 
this is a different issue than the prioritization one in the log from the 
other day (which only waited about 30s for higher-priority replica 
requests).

> My thoughts is that each OSD should have some percentage to time given
> to servicing client I/O whereas now it seems that replica I/O can
> completely starve client I/O. I understand why replica traffic needs a
> higher priority, but I think some balance needs to be attained.

We currently do 'fair' prioritized queueing with a token bucket filter 
only for requests with priorities <= 63.  Simply increasing this threshold 
so that it covers replica requests might be enough.  But... we'll be 
starting client requests locally at the expense of in-progress client 
writes elsewhere.  Given that the amount of (our) client-related work we 
do is always bounded by the msgr throttle, I think this is okay since we 
only make the situation worse by a fixed factor.  (We still don't address 
the possibilty that we are replica for every other osd in the system and 
could be flooded by N*(max client ops per osd).

It's this line:

https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L8334

sage



> 
> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWHne4CRDmVDuy+mK58QAAwYUP/RzTrmsYV7Vi6e64Yikh
> YMMI4Cxt4mBWbTIOsb8iRY98EkqhUWd/kz45OoFQgwE4hS3O5Lksf3u0pcmS
> I+Gz6jQ4/K0B6Mc3Rt19ofD1cA9s6BLnHSqTFZEUVapiHftj84ewIRLts9dg
> YCJJeaaOV8fu07oZvnumRTAKOzWPyQizQKBGx7nujIg13Us0st83C8uANzoX
> hKvlA2qVMXO4rLgR7nZMcgj+X+/79v7MDycM3WP/Q21ValsNfETQVhN+XxC8
> D/IUfX4/AKUEuF4WBEck4Z/Wx9YD+EvpLtQVLy21daazRApWES/iy089F63O
> k9RHp189c4WCduFBaTvZj2cdekAq/Wl50O1AdafYFptWqYhw+aKpihI+yMrX
> +LhWgoYALD6wyXr0KVDZZszIRZbO/PSjct8z13aXBJoJm9r0Vyazfhi9jNW9
> Z/1GD7gv5oHymf7eR9u7T8INdjNzn6Qllj7XCyZfQv5TYxsRWMZxf5vEkpMB
> nAYANoZcNs4ZSIy+OdFOb6nM66ujrytWL1DqWusJUEM/GauBw0fxnQ/i+pMy
> XU8gYbG1um5YY8jrtvvkhnbHdeO/k24/cH7MGslxeezBPnMNzmqj3qVdiX1H
> EBbyBBtp8OF+pKExrmZc2w01W/Nxl6GbVoG+IKJ61FgwKOXEiMwb0wv5mu30
> eP3D
> =R0O9
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Wed, Oct 14, 2015 at 12:00 AM, Haomai Wang  wrote:
> > On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil  wrote:
> >> On Mon, 12 Oct 2015, Robert LeBlanc wrote:
> >>> -BEGIN PGP SIGNED MESSAGE-
> >>> Hash: SHA256
> >>>
> >>> After a weekend, I'm ready to hit this from a different direction.
> >>>
> >>> I replicated the issue with Firefly so it doesn't seem an issue that
> >>> has been introduced or resolved in any nearby version. I think overall
> >>> we may be seeing [1] to a great degree. From what I can extract from
> >>> the logs, it looks like in situations where OSDs are going up and
> >>> down, I see I/O blocked at the primary OSD waiting for peering and/or
> >>> the PG to become clean before dispatching the I/O to the replicas.
> >>>
> >>> In an effort to understand the flow of the logs, I've attached a small
> >>> 2 minute segment of a log I've extracted what I believe to be
> >>> important entries in the life cycle of an I/O along with my
> >>> understanding. If someone would be kind enough to help my
> >>> understanding, I would appreciate it.
> >>>
> >>> 2015-10-12 14:12:36.537906 7fb9d2c68700 10 -- 192.168.55.16:6800/11295
> >>> >> 192.168.55.12:0/2013622 pipe(0x26c9 sd=47 :6800 s=2 pgs=2 cs=1
> >>> l=1 c=0x32c85440).reader got message 19 0x2af81700
> >>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) 

[ceph-users] Fwd: Proc for Impl XIO mess with Infernalis

2015-10-14 Thread German Anders
Let me be more specific about what I need in order to move forward with
this kind of install:

setup:

3 mon servers
8 osd servers (4 with SAS disks and SSD journal - relation 1:3) and (4 with
SSD disks osd & journal on the same disk)


running ceph version 0.94.3

I've already install and test *fio*, *accelio* and *rdma* on all the nodes
(mon + osd)

The ceph cluster is already setup and running (with no production data on
it). What I would want to do is to change this cluster in order to support
XIO messager rdma.

a couple of questions...

first, is this a valid option? is possible to change it without re-doing
the whole cluster?
second, is there anyone around who has already made this thing?
third, I know that it's not production ready, but if someone has any
procedure to turn an existing cluster into rmda xio support will really
appreciated.

thanks in advance,

cheers,

*German* 

-- Forwarded message --
From: German Anders 
Date: 2015-10-14 12:46 GMT-03:00
Subject: Proc for Impl XIO mess with Infernalis
To: ceph-users 


Hi all,

I would like to know if with this new release of Infernalis is there
somewhere a procedure in order to implement xio messager with ib and ceph.
Also if it's possible to change an existing ceph cluster to this kind of
new setup (the existing cluster does not had any production data yet).

Thanks in advance,

Cheers,

*German* 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Sage Weil
On Wed, 14 Oct 2015, Kyle Hutson wrote:
> > Which bug?  We want to fix hammer, too!
> 
> This
> one: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg23915.html
> 
> (Adam sits about 5' from me.)

Oh... that fix is already in the hammer branch and will be in 0.94.4.  
Since you have to go to that anyway before infernalis you may as well stop 
there (unless there is something else you want from internalis!).

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] download.ceph.com unreachable IPv6 [was: v9.1.0 Infernalis release candidate released]

2015-10-14 Thread Björn Lässig
On 10/14/2015 05:11 PM, Wido den Hollander wrote:
> 
> 
> On 14-10-15 16:30, Björn Lässig wrote:
>> On 10/13/2015 11:01 PM, Sage Weil wrote:
>>>   http://download.ceph.com/debian-testing
>>
>> unfortunately this site is not reachable at the moment.
>>
> wido@wido-desktop:~$ wget -6
> http://download.ceph.com/debian-testing/dists/wheezy/InRelease
> […]
> 2015-10-14 17:10:41 (306 MB/s) - ‘InRelease’ saved [6873/6873]

We tried from different locations with

 * sixxs
 * our AS (AS201824)
 * some testing from hetzner. (AS24940)

On some locations ''wget -6
http://download.ceph.com/debian-testing/dists/wheezy/InRelease'' does
not work, on some of them, it only works 1 out of 5 times.

Testing is hard. They drop icmp6-ping. I have no idea, why they are
doing this.

Their PTR record for download.ceph.com is missing, but thats not the
point here.


regards,

 Björn Lässig

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Initial performance cluster SimpleMessenger vs AsyncMessenger results

2015-10-14 Thread Mark Nelson

Hi Xiaoxi,

I would ignore the tails on those tests.  I suspect it's just some fio 
processes finishing earlier than others and the associated aggregate 
performance dropping off.  These reads tests are so fast that my 
original guess at reasonable volume sizes for 300 second tests appear to 
be off.


Mark

On 10/14/2015 10:57 AM, Chen, Xiaoxi wrote:

Hi Mark,
  The Async result  in 128K drops quickly after some point, is that because 
of the testing methodology?

  Other conclusion looks to me like simple messenger + Jemalloc is the best 
practice till now as it has the same performance as async but using much less 
memory?

-Xiaoxi


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Mark Nelson
Sent: Tuesday, October 13, 2015 9:03 PM
To: Haomai Wang
Cc: ceph-devel; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Initial performance cluster SimpleMessenger vs
AsyncMessenger results

Hi Haomai,

Great!  I haven't had a chance to dig in and look at it with valgrind yet, but 
if I
get a chance after I'm done with newstore fragment testing and somnath's
writepath work I'll try to go back and dig in if you haven't had a chance yet.

Mark

On 10/12/2015 09:56 PM, Haomai Wang wrote:

resend

On Tue, Oct 13, 2015 at 10:56 AM, Haomai Wang 

wrote:

COOL

Interesting that async messenger will consume more memory than
simple, in my mind I always think async should use less memory. I
will give a look at this

On Tue, Oct 13, 2015 at 12:50 AM, Mark Nelson 

wrote:


Hi Guy,

Given all of the recent data on how different memory allocator
configurations improve SimpleMessenger performance (and the effect
of memory allocators and transparent hugepages on RSS memory usage),
I thought I'd run some tests looking how AsyncMessenger does in
comparison.  We spoke about these a bit at the last performance

meeting but here's the full write up.

The rough conclusion as of right now appears to be:

1) AsyncMessenger performance is not dependent on the memory
allocator like with SimpleMessenger.

2) AsyncMessenger is faster than SimpleMessenger with TCMalloc +
32MB (ie
default) thread cache.

3) AsyncMessenger is consistently faster than SimpleMessenger for
128K random reads.

4) AsyncMessenger is sometimes slower than SimpleMessenger when
memory allocator optimizations are used.

5) AsyncMessenger currently uses far more RSS memory than

SimpleMessenger.


Here's a link to the paper:

https://drive.google.com/file/d/0B2gTBZrkrnpZS1Q4VktjZkhrNHc/view

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--

Best Regards,

Wheat





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Kyle Hutson
Nice! Thanks!

On Wed, Oct 14, 2015 at 1:23 PM, Sage Weil  wrote:

> On Wed, 14 Oct 2015, Kyle Hutson wrote:
> > > Which bug?  We want to fix hammer, too!
> >
> > This
> > one:
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg23915.html
> >
> > (Adam sits about 5' from me.)
>
> Oh... that fix is already in the hammer branch and will be in 0.94.4.
> Since you have to go to that anyway before infernalis you may as well stop
> there (unless there is something else you want from internalis!).
>
> sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Potential OSD deadlock?

2015-10-14 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I'm sure I have a log of a 1,000 second block somewhere, I'll have to
look around for it.

I'll try turning that knob and see what happens. I'll come back with
the results.

Thanks,

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Oct 14, 2015 at 11:08 AM, Sage Weil  wrote:
> On Wed, 14 Oct 2015, Robert LeBlanc wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> It seems in our situation the cluster is just busy, usually with
>> really small RBD I/O. We have gotten things to where it doesn't happen
>> as much in a steady state, but when we have an OSD fail (mostly from
>> an XFS log bug we hit at least once a week), it is very painful as the
>> OSD exits and enters the cluster. We are working to split the PGs a
>> couple of fold, but this is a painful process for the reasons
>> mentioned in the tracker. Matt Benjamin and Sam Just had a discussion
>> on IRC about getting the other primaries to throttle back when such a
>> situation occurs so that each primary OSD has some time to service
>> client I/O and to push back on the clients to slow down in these
>> situations.
>>
>> In our case a single OSD can lock up a VM for a very long time while
>> others are happily going about their business. Instead of looking like
>> the cluster is out of I/O, it looks like there is an error. If
>> pressure is pushed back to clients, it would show up as all of the
>> clients slowing down a little instead of one or two just hanging for
>> even over 1,000 seconds.
>
> This 1000 seconds figure is very troubling.  Do you have logs?  I suspect
> this is a different issue than the prioritization one in the log from the
> other day (which only waited about 30s for higher-priority replica
> requests).
>
>> My thoughts is that each OSD should have some percentage to time given
>> to servicing client I/O whereas now it seems that replica I/O can
>> completely starve client I/O. I understand why replica traffic needs a
>> higher priority, but I think some balance needs to be attained.
>
> We currently do 'fair' prioritized queueing with a token bucket filter
> only for requests with priorities <= 63.  Simply increasing this threshold
> so that it covers replica requests might be enough.  But... we'll be
> starting client requests locally at the expense of in-progress client
> writes elsewhere.  Given that the amount of (our) client-related work we
> do is always bounded by the msgr throttle, I think this is okay since we
> only make the situation worse by a fixed factor.  (We still don't address
> the possibilty that we are replica for every other osd in the system and
> could be flooded by N*(max client ops per osd).
>
> It's this line:
>
> https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L8334
>
> sage
>
>
>
>>
>> Thanks,
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWHne4CRDmVDuy+mK58QAAwYUP/RzTrmsYV7Vi6e64Yikh
>> YMMI4Cxt4mBWbTIOsb8iRY98EkqhUWd/kz45OoFQgwE4hS3O5Lksf3u0pcmS
>> I+Gz6jQ4/K0B6Mc3Rt19ofD1cA9s6BLnHSqTFZEUVapiHftj84ewIRLts9dg
>> YCJJeaaOV8fu07oZvnumRTAKOzWPyQizQKBGx7nujIg13Us0st83C8uANzoX
>> hKvlA2qVMXO4rLgR7nZMcgj+X+/79v7MDycM3WP/Q21ValsNfETQVhN+XxC8
>> D/IUfX4/AKUEuF4WBEck4Z/Wx9YD+EvpLtQVLy21daazRApWES/iy089F63O
>> k9RHp189c4WCduFBaTvZj2cdekAq/Wl50O1AdafYFptWqYhw+aKpihI+yMrX
>> +LhWgoYALD6wyXr0KVDZZszIRZbO/PSjct8z13aXBJoJm9r0Vyazfhi9jNW9
>> Z/1GD7gv5oHymf7eR9u7T8INdjNzn6Qllj7XCyZfQv5TYxsRWMZxf5vEkpMB
>> nAYANoZcNs4ZSIy+OdFOb6nM66ujrytWL1DqWusJUEM/GauBw0fxnQ/i+pMy
>> XU8gYbG1um5YY8jrtvvkhnbHdeO/k24/cH7MGslxeezBPnMNzmqj3qVdiX1H
>> EBbyBBtp8OF+pKExrmZc2w01W/Nxl6GbVoG+IKJ61FgwKOXEiMwb0wv5mu30
>> eP3D
>> =R0O9
>> -END PGP SIGNATURE-
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Wed, Oct 14, 2015 at 12:00 AM, Haomai Wang  wrote:
>> > On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil  wrote:
>> >> On Mon, 12 Oct 2015, Robert LeBlanc wrote:
>> >>> -BEGIN PGP SIGNED MESSAGE-
>> >>> Hash: SHA256
>> >>>
>> >>> After a weekend, I'm ready to hit this from a different direction.
>> >>>
>> >>> I replicated the issue with Firefly so it doesn't seem an issue that
>> >>> has been introduced or resolved in any nearby version. I think overall
>> >>> we may be seeing [1] to a great degree. From what I can extract from
>> >>> the logs, it looks like in situations where OSDs are going up and
>> >>> down, I see I/O blocked at the primary OSD waiting for peering and/or
>> >>> the PG to become clean before dispatching the I/O to the replicas.
>> >>>
>> >>> In an effort to understand the flow of the logs, I've attached a small
>> >>> 2 minute segment of a log I've extracted what I believe to be
>> >>> important entries in the life cycle of an I/O along with my
>> >>> understanding. If someone would be kind enough to help my
>> >>> understanding, I 

Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Kyle Hutson
A couple of questions related to this, especially since we have a hammer
bug that's biting us so we're anxious to upgrade to Infernalis.

1) RE: ibrbd and librados ABI compatibility is broken.  Be careful installing
this RC on client machines (e.g., those running qemu). It will be fixed in
the final v9.2.0 release.

We have several qemu clients. If we upgrade the ceph servers (and not the
qemu clients), will this affect us?

2) RE: Upgrading directly from Firefly v0.80.z is not possible.  All
clusters must first upgrade to Hammer v0.94.4 or a later v0.94.z release;
only then is it possible to upgrade to Infernalis 9.2.z.

I think I understand this, but want to verify. We're on 0.94.3. Can we
upgrade to the RC 9.1.0 and then safely upgrade to 9.2.z when it is
finalized? Any foreseen issues with this upgrade path?

On Wed, Oct 14, 2015 at 7:30 AM, Sage Weil  wrote:

> On Wed, 14 Oct 2015, Dan van der Ster wrote:
> > Hi Goncalo,
> >
> > On Wed, Oct 14, 2015 at 6:51 AM, Goncalo Borges
> >  wrote:
> > > Hi Sage...
> > >
> > > I've seen that the rh6 derivatives have been ruled out.
> > >
> > > This is a problem in our case since the OS choice in our systems is,
> > > somehow, imposed by CERN. The experiments software is certified for
> SL6 and
> > > the transition to SL7 will take some time.
> >
> > Are you accessing Ceph directly from "physics" machines? Here at CERN
> > we run CentOS 7 on the native clients (e.g. qemu-kvm hosts) and by the
> > time we upgrade to Infernalis the servers will all be CentOS 7 as
> > well. Batch nodes running SL6 don't (currently) talk to Ceph directly
> > (in the future they might talk to Ceph-based storage via an xroot
> > gateway). But if there are use-cases then perhaps we could find a
> > place to build and distributing the newer ceph clients.
> >
> > There's a ML ceph-t...@cern.ch where we could take this discussion.
> > Mail me if have trouble joining that e-Group.
>
> Also note that it *is* possible to build infernalis on el6, but it
> requires a lot more effort... enough that we would rather spend our time
> elsewhere (at least as far as ceph.com packages go).  If someone else
> wants to do that work we'd be happy to take patches to update the and/or
> release process.
>
> IIRC the thing that eventually made me stop going down this patch was the
> fact that the newer gcc had a runtime dependency on the newer libstdc++,
> which wasn't part of the base distro... which means we'd need also to
> publish those packages in the ceph.com repos, or users would have to
> add some backport repo or ppa or whatever to get things running.  Bleh.
>
> sage
>
>
> >
> > Cheers, Dan
> > CERN IT-DSS
> >
> > > This is kind of a showstopper specially if we can't deploy clients in
> SL6 /
> > > Centos6.
> > >
> > > Is there any alternative?
> > >
> > > TIA
> > > Goncalo
> > >
> > >
> > >
> > > On 10/14/2015 08:01 AM, Sage Weil wrote:
> > >>
> > >> This is the first Infernalis release candidate.  There have been some
> > >> major changes since hammer, and the upgrade process is non-trivial.
> > >> Please read carefully.
> > >>
> > >> Getting the release candidate
> > >> -
> > >>
> > >> The v9.1.0 packages are pushed to the development release
> repositories::
> > >>
> > >>http://download.ceph.com/rpm-testing
> > >>http://download.ceph.com/debian-testing
> > >>
> > >> For for info, see::
> > >>
> > >>http://docs.ceph.com/docs/master/install/get-packages/
> > >>
> > >> Or install with ceph-deploy via::
> > >>
> > >>ceph-deploy install --testing HOST
> > >>
> > >> Known issues
> > >> 
> > >>
> > >> * librbd and librados ABI compatibility is broken.  Be careful
> > >>installing this RC on client machines (e.g., those running qemu).
> > >>It will be fixed in the final v9.2.0 release.
> > >>
> > >> Major Changes from Hammer
> > >> -
> > >>
> > >> * *General*:
> > >>* Ceph daemons are now managed via systemd (with the exception of
> > >>  Ubuntu Trusty, which still uses upstart).
> > >>* Ceph daemons run as 'ceph' user instead root.
> > >>* On Red Hat distros, there is also an SELinux policy.
> > >> * *RADOS*:
> > >>* The RADOS cache tier can now proxy write operations to the base
> > >>  tier, allowing writes to be handled without forcing migration of
> > >>  an object into the cache.
> > >>* The SHEC erasure coding support is no longer flagged as
> > >>  experimental. SHEC trades some additional storage space for
> faster
> > >>  repair.
> > >>* There is now a unified queue (and thus prioritization) of client
> > >>  IO, recovery, scrubbing, and snapshot trimming.
> > >>* There have been many improvements to low-level repair tooling
> > >>  (ceph-objectstore-tool).
> > >>* The internal ObjectStore API has been significantly cleaned up in
> > >> order
> > >>  to faciliate new storage backends like 

Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Kyle Hutson
> Which bug?  We want to fix hammer, too!

This one:
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg23915.html

(Adam sits about 5' from me.)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-14 Thread Somnath Roy
FileSystem like XFS guarantees a single file write but in Ceph transaction we 
are touching file/xattrs/leveldb (omap), so no way filesystem can guarantee 
that transaction. That's why FileStore has implemented a write_ahead journal. 
Basically, it is writing the entire transaction object there and only trimming 
from journal when it is actually applied (all the operation executed) and 
persisted in the backend. 

Thanks & Regards
Somnath

-Original Message-
From: Jan Schermer [mailto:j...@schermer.cz] 
Sent: Wednesday, October 14, 2015 9:06 AM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

But that's exactly what filesystems and their own journals do already :-)

Jan

> On 14 Oct 2015, at 17:02, Somnath Roy  wrote:
> 
> Jan,
> Journal helps FileStore to maintain the transactional integrity in the event 
> of a crash. That's the main reason.
> 
> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
> Schermer
> Sent: Wednesday, October 14, 2015 2:28 AM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?
> 
> Hi,
> I've been thinking about this for a while now - does Ceph really need a 
> journal? Filesystems are already pretty good at committing data to disk when 
> asked (and much faster too), we have external journals in XFS and Ext4...
> In a scenario where client does an ordinary write, there's no need to flush 
> it anywhere (the app didn't ask for it) so it ends up in pagecache and gets 
> committed eventually.
> If a client asks for the data to be flushed then fdatasync/fsync on the 
> filestore object takes care of that, including ordering and stuff.
> For reads, you just read from filestore (no need to differentiate between 
> filestore/journal) - pagecache gives you the right version already.
> 
> Or is journal there to achieve some tiering for writes when the running 
> spindles with SSDs? This is IMO the only thing ordinary filesystems don't do 
> out of box even when filesystem journal is put on SSD - the data get flushed 
> to spindle whenever fsync-ed (even with data=journal). But in reality, most 
> of the data will hit the spindle either way and when you run with SSDs it 
> will always be much slower. And even for tiering - there are already many 
> options (bcache, flashcache or even ZFS L2ARC) that are much more performant 
> and proven stable. I think the fact that people  have a need to combine Ceph 
> with stuff like that already proves the point.
> 
> So a very interesting scenario would be to disable Ceph journal and at most 
> use data=journal on ext4. The complexity of the data path would drop 
> significantly, latencies decrease, CPU time is saved...
> I just feel that Ceph has lots of unnecessary complexity inside that 
> duplicates what filesystems (and pagecache...) have been doing for a while 
> now without eating most of our CPU cores - why don't we use that? Is it 
> possible to disable journal completely?
> 
> Did I miss something that makes journal essential?
> 
> Jan
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Sage Weil
On Wed, 14 Oct 2015, Kyle Hutson wrote:
> A couple of questions related to this, especially since we have a hammer
> bug that's biting us so we're anxious to upgrade to Infernalis.

Which bug?  We want to fix hammer, too!

> 1) RE: ibrbd and librados ABI compatibility is broken.  Be careful installing
> this RC on client machines (e.g., those running qemu). It will be fixed in
> the final v9.2.0 release.
> 
> We have several qemu clients. If we upgrade the ceph servers (and not the
> qemu clients), will this affect us?

Nope! That will be fine.

> 2) RE: Upgrading directly from Firefly v0.80.z is not possible.  All
> clusters must first upgrade to Hammer v0.94.4 or a later v0.94.z release;
> only then is it possible to upgrade to Infernalis 9.2.z.
> 
> I think I understand this, but want to verify. We're on 0.94.3. Can we
> upgrade to the RC 9.1.0 and then safely upgrade to 9.2.z when it is
> finalized? Any foreseen issues with this upgrade path?

You need to first upgrade ot 0.94.4 (or latest hammer branch) before going 
to 9.1.0.  You'll of course be able to upgrade form there to 9.2.z.

sage

> 
> On Wed, Oct 14, 2015 at 7:30 AM, Sage Weil  wrote:
> 
> > On Wed, 14 Oct 2015, Dan van der Ster wrote:
> > > Hi Goncalo,
> > >
> > > On Wed, Oct 14, 2015 at 6:51 AM, Goncalo Borges
> > >  wrote:
> > > > Hi Sage...
> > > >
> > > > I've seen that the rh6 derivatives have been ruled out.
> > > >
> > > > This is a problem in our case since the OS choice in our systems is,
> > > > somehow, imposed by CERN. The experiments software is certified for
> > SL6 and
> > > > the transition to SL7 will take some time.
> > >
> > > Are you accessing Ceph directly from "physics" machines? Here at CERN
> > > we run CentOS 7 on the native clients (e.g. qemu-kvm hosts) and by the
> > > time we upgrade to Infernalis the servers will all be CentOS 7 as
> > > well. Batch nodes running SL6 don't (currently) talk to Ceph directly
> > > (in the future they might talk to Ceph-based storage via an xroot
> > > gateway). But if there are use-cases then perhaps we could find a
> > > place to build and distributing the newer ceph clients.
> > >
> > > There's a ML ceph-t...@cern.ch where we could take this discussion.
> > > Mail me if have trouble joining that e-Group.
> >
> > Also note that it *is* possible to build infernalis on el6, but it
> > requires a lot more effort... enough that we would rather spend our time
> > elsewhere (at least as far as ceph.com packages go).  If someone else
> > wants to do that work we'd be happy to take patches to update the and/or
> > release process.
> >
> > IIRC the thing that eventually made me stop going down this patch was the
> > fact that the newer gcc had a runtime dependency on the newer libstdc++,
> > which wasn't part of the base distro... which means we'd need also to
> > publish those packages in the ceph.com repos, or users would have to
> > add some backport repo or ppa or whatever to get things running.  Bleh.
> >
> > sage
> >
> >
> > >
> > > Cheers, Dan
> > > CERN IT-DSS
> > >
> > > > This is kind of a showstopper specially if we can't deploy clients in
> > SL6 /
> > > > Centos6.
> > > >
> > > > Is there any alternative?
> > > >
> > > > TIA
> > > > Goncalo
> > > >
> > > >
> > > >
> > > > On 10/14/2015 08:01 AM, Sage Weil wrote:
> > > >>
> > > >> This is the first Infernalis release candidate.  There have been some
> > > >> major changes since hammer, and the upgrade process is non-trivial.
> > > >> Please read carefully.
> > > >>
> > > >> Getting the release candidate
> > > >> -
> > > >>
> > > >> The v9.1.0 packages are pushed to the development release
> > repositories::
> > > >>
> > > >>http://download.ceph.com/rpm-testing
> > > >>http://download.ceph.com/debian-testing
> > > >>
> > > >> For for info, see::
> > > >>
> > > >>http://docs.ceph.com/docs/master/install/get-packages/
> > > >>
> > > >> Or install with ceph-deploy via::
> > > >>
> > > >>ceph-deploy install --testing HOST
> > > >>
> > > >> Known issues
> > > >> 
> > > >>
> > > >> * librbd and librados ABI compatibility is broken.  Be careful
> > > >>installing this RC on client machines (e.g., those running qemu).
> > > >>It will be fixed in the final v9.2.0 release.
> > > >>
> > > >> Major Changes from Hammer
> > > >> -
> > > >>
> > > >> * *General*:
> > > >>* Ceph daemons are now managed via systemd (with the exception of
> > > >>  Ubuntu Trusty, which still uses upstart).
> > > >>* Ceph daemons run as 'ceph' user instead root.
> > > >>* On Red Hat distros, there is also an SELinux policy.
> > > >> * *RADOS*:
> > > >>* The RADOS cache tier can now proxy write operations to the base
> > > >>  tier, allowing writes to be handled without forcing migration of
> > > >>  an object into the cache.
> > > >>* The SHEC erasure coding support is no longer flagged as
> > > 

Re: [ceph-users] CephFS "corruption" -- Nulled bytes

2015-10-14 Thread Adam Tygart
Not thoroughly tested, but I've got a quick and dirty script to fix
these up. Worst case scenario, it does nothing. In my limited testing,
the contents of the files comes back without a remount of cephfs.

https://github.com/BeocatKSU/admin/blob/master/ec_cephfs_fixer.py

--
Adam

On Thu, Oct 8, 2015 at 11:11 AM, Lincoln Bryant  wrote:
> Hi Sage,
>
> Will this patch be in 0.94.4? We've got the same problem here.
>
> -Lincoln
>
>> On Oct 8, 2015, at 12:11 AM, Sage Weil  wrote:
>>
>> On Wed, 7 Oct 2015, Adam Tygart wrote:
>>> Does this patch fix files that have been corrupted in this manner?
>>
>> Nope, it'll only prevent it from happening to new files (that haven't yet
>> been migrated between the cache and base tier).
>>
>>> If not, or I guess even if it does, is there a way to walk the
>>> metadata and data pools and find objects that are affected?
>>
>> Hmm, this may actually do the trick.. find a file that appears to be
>> zeroed, and do truncate it up and then down again.  For example, of foo is
>> 100 bytes, do
>>
>> truncate --size 101 foo
>> truncate --size 100 foo
>>
>> then unmount and remound the client and see if the content reappears.
>>
>> Assuming that works (it did in my simple test) it'd be pretty easy to
>> write something that walks the tree and does the truncate trick for any
>> file whose first however many bytes are 0 (though it will mess up
>> mtime...).
>>
>>> Is that '_' xattr in hammer? If so, how can I access it? Doing a
>>> listxattr on the inode just lists 'parent', and doing the same on the
>>> parent directory's inode simply lists 'parent'.
>>
>> This is the file in /var/lib/ceph/osd/ceph-NNN/current.  For example,
>>
>> $ attr -l ./3.0_head/100.__head_F0B56F30__3
>> Attribute "cephos.spill_out" has a 2 byte value for 
>> ./3.0_head/100.__head_F0B56F30__3
>> Attribute "cephos.seq" has a 23 byte value for 
>> ./3.0_head/100.__head_F0B56F30__3
>> Attribute "ceph._" has a 250 byte value for 
>> ./3.0_head/100.__head_F0B56F30__3
>> Attribute "ceph._@1" has a 5 byte value for 
>> ./3.0_head/100.__head_F0B56F30__3
>> Attribute "ceph.snapset" has a 31 byte value for 
>> ./3.0_head/100.__head_F0B56F30__3
>>
>> ...but hopefully you won't need to touch any of that ;)
>>
>> sage
>>
>>
>>>
>>> Thanks for your time.
>>>
>>> --
>>> Adam
>>>
>>>
>>> On Mon, Oct 5, 2015 at 9:36 AM, Sage Weil  wrote:
 On Mon, 5 Oct 2015, Adam Tygart wrote:
> Okay, this has happened several more times. Always seems to be a small
> file that should be read-only (perhaps simultaneously) on many
> different clients. It is just through the cephfs interface that the
> files are corrupted, the objects in the cachepool and erasure coded
> pool are still correct. I am beginning to doubt these files are
> getting a truncation request.

 This is still consistent with the #12551 bug.  The object data is correct,
 but the cephfs truncation metadata on the object is wrong, causing it to
 be implicitly zeroed out on read.  It's easily triggered by writers who
 use O_TRUNC on open...

> Twice now have been different perl files, once was someones .bashrc,
> once was an input file for another application, timestamps on the
> files indicate that the files haven't been modified in weeks.
>
> Any other possibilites? Or any way to figure out what happened?

 You can confirm by extracting the '_' xattr on the object (append any @1
 etc fragments) and feeding it to ceph-dencoder with

 ceph-dencoder type object_info_t import  decode 
 dump_json

 and confirming that truncate_seq is 0, and verifying that the truncate_seq
 on the read request is non-zero.. you'd need to turn up the osd logs with
 debug ms = 1 and look for the osd_op that looks like "read 0~$length
 [$truncate_seq@$truncate_size]" (with real values in there).

 ...but it really sounds like you're hitting the bug.  Unfortunately
 the fix is not backported to hammer just yet.  You can follow
http://tracker.ceph.com/issues/13034

 sage



>
> --
> Adam
>
> On Sun, Sep 27, 2015 at 10:44 PM, Adam Tygart  wrote:
>> I've done some digging into cp and mv's semantics (from coreutils). If
>> the inode is existing, the file will get truncated, then data will get
>> copied in. This is definitely within the scope of the bug above.
>>
>> --
>> Adam
>>
>> On Fri, Sep 25, 2015 at 8:08 PM, Adam Tygart  wrote:
>>> It may have been. Although the timestamp on the file was almost a
>>> month ago. The typical workflow for this particular file is to copy an
>>> updated version overtop of it.
>>>
>>> i.e. 'cp qss kstat'
>>>
>>> I'm not sure if cp semantics would keep the 

Re: [ceph-users] Potential OSD deadlock?

2015-10-14 Thread Haomai Wang
On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil  wrote:
> On Mon, 12 Oct 2015, Robert LeBlanc wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> After a weekend, I'm ready to hit this from a different direction.
>>
>> I replicated the issue with Firefly so it doesn't seem an issue that
>> has been introduced or resolved in any nearby version. I think overall
>> we may be seeing [1] to a great degree. From what I can extract from
>> the logs, it looks like in situations where OSDs are going up and
>> down, I see I/O blocked at the primary OSD waiting for peering and/or
>> the PG to become clean before dispatching the I/O to the replicas.
>>
>> In an effort to understand the flow of the logs, I've attached a small
>> 2 minute segment of a log I've extracted what I believe to be
>> important entries in the life cycle of an I/O along with my
>> understanding. If someone would be kind enough to help my
>> understanding, I would appreciate it.
>>
>> 2015-10-12 14:12:36.537906 7fb9d2c68700 10 -- 192.168.55.16:6800/11295
>> >> 192.168.55.12:0/2013622 pipe(0x26c9 sd=47 :6800 s=2 pgs=2 cs=1
>> l=1 c=0x32c85440).reader got message 19 0x2af81700
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>
>> - ->Messenger has recieved the message from the client (previous
>> entries in the 7fb9d2c68700 thread are the individual segments that
>> make up this message).
>>
>> 2015-10-12 14:12:36.537963 7fb9d2c68700  1 -- 192.168.55.16:6800/11295
>> <== client.6709 192.168.55.12:0/2013622 19 
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>  235+0+4194304 (2317308138 0 2001296353) 0x2af81700 con 0x32c85440
>>
>> - ->OSD process acknowledges that it has received the write.
>>
>> 2015-10-12 14:12:36.538096 7fb9d2c68700 15 osd.4 44 enqueue_op
>> 0x3052b300 prio 63 cost 4194304 latency 0.012371
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>
>> - ->Not sure excatly what is going on here, the op is being enqueued 
>> somewhere..
>>
>> 2015-10-12 14:13:06.542819 7fb9e2d3a700 10 osd.4 44 dequeue_op
>> 0x3052b300 prio 63 cost 4194304 latency 30.017094
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v
>> 5 pg pg[0.29( v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c
>> 40/44 32/32/10) [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702
>> active+clean]
>>
>> - ->The op is dequeued from this mystery queue 30 seconds later in a
>> different thread.
>
> ^^ This is the problem.  Everything after this looks reasonable.  Looking
> at the other dequeue_op calls over this period, it looks like we're just
> overwhelmed with higher priority requests.  New clients are 63, while
> osd_repop (replicated write from another primary) are 127 and replies from
> our own replicated ops are 196.  We do process a few other prio 63 items,
> but you'll see that their latency is also climbing up to 30s over this
> period.
>
> The question is why we suddenly get a lot of them.. maybe the peering on
> other OSDs just completed so we get a bunch of these?  It's also not clear
> to me what makes osd.4 or this op special.  We expect a mix of primary and
> replica ops on all the OSDs, so why would we suddenly have more of them
> here

I guess the bug tracker(http://tracker.ceph.com/issues/13482) is
related to this thread.

So is it means that there exists live lock with client op and repop?
We permit all clients issue too much client ops which cause some OSDs
bottleneck, then actually other OSDs maybe idle enough and accept more
client ops. Finally, all osds are stuck into the bottleneck OSD. It
seemed reasonable, but why it will last so long?

>
> sage
>
>
>>
>> 2015-10-12 14:13:06.542912 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
>> do_op osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>> may_write -> write-ordered flags ack+ondisk+write+known_if_redirected
>>
>> - ->Not sure what this message is. Look up of secondary OSDs?
>>
>> 2015-10-12 14:13:06.544999 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 

Re: [ceph-users] ceph same rbd on multiple client

2015-10-14 Thread gjprabu
Hi Tyler,



 Thanks for your reply. We have disabled rbd_cache but still issue is 
persist. Please find our configuration file.


# cat /etc/ceph/ceph.conf

[global]

fsid = 944fa0af-b7be-45a9-93ff-b9907cfaee3f

mon_initial_members = integ-hm5, integ-hm6, integ-hm7

mon_host = 192.168.112.192,192.168.112.193,192.168.112.194

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

filestore_xattr_use_omap = true

osd_pool_default_size = 2



[mon]

mon_clock_drift_allowed = .500



[client]

rbd_cache = false



--


 cluster 944fa0af-b7be-45a9-93ff-b9907cfaee3f

 health HEALTH_OK

 monmap e2: 3 mons at 
{integ-hm5=192.168.112.192:6789/0,integ-hm6=192.168.112.193:6789/0,integ-hm7=192.168.112.194:6789/0}

election epoch 480, quorum 0,1,2 integ-hm5,integ-hm6,integ-hm7

 osdmap e49780: 2 osds: 2 up, 2 in

  pgmap v2256565: 190 pgs, 2 pools, 1364 GB data, 410 kobjects

2559 GB used, 21106 GB / 24921 GB avail

 190 active+clean

  client io 373 kB/s rd, 13910 B/s wr, 103 op/s





Regards

Prabu


  On Tue, 13 Oct 2015 19:59:38 +0530 Tyler Bishop 
tyler.bis...@beyondhosting.net wrote 




You need to disable RBD caching.







 Tyler Bishop
Chief Technical Officer
 513-299-7108 x10
 
tyler.bis...@beyondhosting.net

 
 
If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited.

 









From: "gjprabu" gjpr...@zohocorp.com

To: "Frédéric Nass" frederic.n...@univ-lorraine.fr

Cc: "ceph-users@lists.ceph.com" ceph-users@lists.ceph.com, 
"Siva Sokkumuthu" sivaku...@zohocorp.com, "Kamal Kannan 
Subramani(kamalakannan)" ka...@manageengine.com

Sent: Tuesday, October 13, 2015 9:11:30 AM

Subject: Re: [ceph-users] ceph same rbd on multiple client




Hi ,




 We have CEPH  RBD with OCFS2 mounted servers. we are facing i/o errors 
simultaneously while move the folder using one nodes in the same disk other 
nodes data replicating with below said error (Copying is not having any 
problem). Workaround if we remount the partition this issue get resolved but 
after sometime problem again reoccurred. please help on this issue.



Note : We have total 5 Nodes, here two nodes working fine other nodes are 
showing like below input/output error on moved data's.



ls -althr 

ls: cannot access LITE_3_0_M4_1_TEST: Input/output error 

ls: cannot access LITE_3_0_M4_1_OLD: Input/output error 

total 0 

d? ? ? ? ? ? LITE_3_0_M4_1_TEST 

d? ? ? ? ? ? LITE_3_0_M4_1_OLD 



Regards

Prabu






 On Fri, 22 May 2015 17:33:04 +0530 Frédéric Nass 
frederic.n...@univ-lorraine.fr wrote 




Hi,



Waiting for CephFS, you can use clustered filesystem like OCFS2 or GFS2 on top 
of RBD mappings so that each host can access the same device and clustered 
filesystem.



Regards,



Frédéric.



Le 21/05/2015 16:10, gjprabu a écrit :





-- Frédéric Nass Sous direction des Infrastructures, Direction du Numérique, 
Université de Lorraine. Tél : 03.83.68.53.83
___ 

ceph-users mailing list 

ceph-users@lists.ceph.com 

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


Hi All,



We are using rbd and map the same rbd image to the rbd device on two 
different client but i can't see the data until i umount and mount -a 
partition. Kindly share the solution for this issue.



Example

create rbd image named foo

map foo to /dev/rbd0 on server A,   mount /dev/rbd0 to /mnt

map foo to /dev/rbd0 on server B,   mount /dev/rbd0 to /mnt



Regards

Prabu








___ ceph-users mailing list 
ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com








___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Dan van der Ster
Hi Goncalo,

On Wed, Oct 14, 2015 at 6:51 AM, Goncalo Borges
 wrote:
> Hi Sage...
>
> I've seen that the rh6 derivatives have been ruled out.
>
> This is a problem in our case since the OS choice in our systems is,
> somehow, imposed by CERN. The experiments software is certified for SL6 and
> the transition to SL7 will take some time.

Are you accessing Ceph directly from "physics" machines? Here at CERN
we run CentOS 7 on the native clients (e.g. qemu-kvm hosts) and by the
time we upgrade to Infernalis the servers will all be CentOS 7 as
well. Batch nodes running SL6 don't (currently) talk to Ceph directly
(in the future they might talk to Ceph-based storage via an xroot
gateway). But if there are use-cases then perhaps we could find a
place to build and distributing the newer ceph clients.

There's a ML ceph-t...@cern.ch where we could take this discussion.
Mail me if have trouble joining that e-Group.

Cheers, Dan
CERN IT-DSS

> This is kind of a showstopper specially if we can't deploy clients in SL6 /
> Centos6.
>
> Is there any alternative?
>
> TIA
> Goncalo
>
>
>
> On 10/14/2015 08:01 AM, Sage Weil wrote:
>>
>> This is the first Infernalis release candidate.  There have been some
>> major changes since hammer, and the upgrade process is non-trivial.
>> Please read carefully.
>>
>> Getting the release candidate
>> -
>>
>> The v9.1.0 packages are pushed to the development release repositories::
>>
>>http://download.ceph.com/rpm-testing
>>http://download.ceph.com/debian-testing
>>
>> For for info, see::
>>
>>http://docs.ceph.com/docs/master/install/get-packages/
>>
>> Or install with ceph-deploy via::
>>
>>ceph-deploy install --testing HOST
>>
>> Known issues
>> 
>>
>> * librbd and librados ABI compatibility is broken.  Be careful
>>installing this RC on client machines (e.g., those running qemu).
>>It will be fixed in the final v9.2.0 release.
>>
>> Major Changes from Hammer
>> -
>>
>> * *General*:
>>* Ceph daemons are now managed via systemd (with the exception of
>>  Ubuntu Trusty, which still uses upstart).
>>* Ceph daemons run as 'ceph' user instead root.
>>* On Red Hat distros, there is also an SELinux policy.
>> * *RADOS*:
>>* The RADOS cache tier can now proxy write operations to the base
>>  tier, allowing writes to be handled without forcing migration of
>>  an object into the cache.
>>* The SHEC erasure coding support is no longer flagged as
>>  experimental. SHEC trades some additional storage space for faster
>>  repair.
>>* There is now a unified queue (and thus prioritization) of client
>>  IO, recovery, scrubbing, and snapshot trimming.
>>* There have been many improvements to low-level repair tooling
>>  (ceph-objectstore-tool).
>>* The internal ObjectStore API has been significantly cleaned up in
>> order
>>  to faciliate new storage backends like NewStore.
>> * *RGW*:
>>* The Swift API now supports object expiration.
>>* There are many Swift API compatibility improvements.
>> * *RBD*:
>>* The ``rbd du`` command shows actual usage (quickly, when
>>  object-map is enabled).
>>* The object-map feature has seen many stability improvements.
>>* Object-map and exclusive-lock features can be enabled or disabled
>>  dynamically.
>>* You can now store user metadata and set persistent librbd options
>>  associated with individual images.
>>* The new deep-flatten features allows flattening of a clone and all
>>  of its snapshots.  (Previously snapshots could not be flattened.)
>>* The export-diff command command is now faster (it uses aio).  There
>> is also
>>  a new fast-diff feature.
>>* The --size argument can be specified with a suffix for units
>>  (e.g., ``--size 64G``).
>>* There is a new ``rbd status`` command that, for now, shows who has
>>  the image open/mapped.
>> * *CephFS*:
>>* You can now rename snapshots.
>>* There have been ongoing improvements around administration,
>> diagnostics,
>>  and the check and repair tools.
>>* The caching and revocation of client cache state due to unused
>>  inodes has been dramatically improved.
>>* The ceph-fuse client behaves better on 32-bit hosts.
>>
>> Distro compatibility
>> 
>>
>> We have decided to drop support for many older distributions so that we
>> can
>> move to a newer compiler toolchain (e.g., C++11).  Although it is still
>> possible
>> to build Ceph on older distributions by installing backported development
>> tools,
>> we are not building and publishing release packages for ceph.com.
>>
>> In particular,
>>
>> * CentOS 7 or later; we have dropped support for CentOS 6 (and other
>>RHEL 6 derivatives, like Scientific Linux 6).
>> * Debian Jessie 8.x or later; Debian Wheezy 7.x's g++ has incomplete
>>support for C++11 (and no 

[ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-14 Thread Jan Schermer
Hi,
I've been thinking about this for a while now - does Ceph really need a 
journal? Filesystems are already pretty good at committing data to disk when 
asked (and much faster too), we have external journals in XFS and Ext4...
In a scenario where client does an ordinary write, there's no need to flush it 
anywhere (the app didn't ask for it) so it ends up in pagecache and gets 
committed eventually.
If a client asks for the data to be flushed then fdatasync/fsync on the 
filestore object takes care of that, including ordering and stuff.
For reads, you just read from filestore (no need to differentiate between 
filestore/journal) - pagecache gives you the right version already.
 
Or is journal there to achieve some tiering for writes when the running 
spindles with SSDs? This is IMO the only thing ordinary filesystems don't do 
out of box even when filesystem journal is put on SSD - the data get flushed to 
spindle whenever fsync-ed (even with data=journal). But in reality, most of the 
data will hit the spindle either way and when you run with SSDs it will always 
be much slower. And even for tiering - there are already many options (bcache, 
flashcache or even ZFS L2ARC) that are much more performant and proven stable. 
I think the fact that people  have a need to combine Ceph with stuff like that 
already proves the point.

So a very interesting scenario would be to disable Ceph journal and at most use 
data=journal on ext4. The complexity of the data path would drop significantly, 
latencies decrease, CPU time is saved...  
I just feel that Ceph has lots of unnecessary complexity inside that duplicates 
what filesystems (and pagecache...) have been doing for a while now without 
eating most of our CPU cores - why don't we use that? Is it possible to disable 
journal completely?

Did I miss something that makes journal essential? 

Jan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Can we place the release key on download.ceph.com?

2015-10-14 Thread Wido den Hollander
Hi,

Currently the public keys for signing the packages can be found on
git.ceph.com:
https://git.ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc

git.ceph.com doesn't have IPv6, but it also isn't mirrored to any system.

It would be handy if http://download.ceph.com/release.asc would exist.

Any objections against mirroring the pubkey there as well? If not, could
somebody do it?

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] download.ceph.com unreachable IPv6 [was: v9.1.0 Infernalis release candidate released]

2015-10-14 Thread Wido den Hollander
On 10/14/2015 06:50 PM, Björn Lässig wrote:
> On 10/14/2015 05:11 PM, Wido den Hollander wrote:
>>
>>
>> On 14-10-15 16:30, Björn Lässig wrote:
>>> On 10/13/2015 11:01 PM, Sage Weil wrote:
   http://download.ceph.com/debian-testing
>>>
>>> unfortunately this site is not reachable at the moment.
>>>
>> wido@wido-desktop:~$ wget -6
>> http://download.ceph.com/debian-testing/dists/wheezy/InRelease
>> […]
>> 2015-10-14 17:10:41 (306 MB/s) - ‘InRelease’ saved [6873/6873]
> 
> We tried from different locations with
> 
>  * sixxs
>  * our AS (AS201824)
>  * some testing from hetzner. (AS24940)
> 
> On some locations ''wget -6
> http://download.ceph.com/debian-testing/dists/wheezy/InRelease'' does
> not work, on some of them, it only works 1 out of 5 times.
> 

On all locations I have it works just fine via IPv6-only. No idea why
it's not working for you.

Wido

> Testing is hard. They drop icmp6-ping. I have no idea, why they are
> doing this.
> 
> Their PTR record for download.ceph.com is missing, but thats not the
> point here.
> 
> 
> regards,
> 
>  Björn Lässig
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] download.ceph.com unreachable IPv6 [was: v9.1.0 Infernalis release candidate released]

2015-10-14 Thread Wido den Hollander
On 10/14/2015 06:50 PM, Björn Lässig wrote:
> On 10/14/2015 05:11 PM, Wido den Hollander wrote:
>>
>>
>> On 14-10-15 16:30, Björn Lässig wrote:
>>> On 10/13/2015 11:01 PM, Sage Weil wrote:
   http://download.ceph.com/debian-testing
>>>
>>> unfortunately this site is not reachable at the moment.
>>>
>> wido@wido-desktop:~$ wget -6
>> http://download.ceph.com/debian-testing/dists/wheezy/InRelease
>> […]
>> 2015-10-14 17:10:41 (306 MB/s) - ‘InRelease’ saved [6873/6873]
> 
> We tried from different locations with
> 
>  * sixxs
>  * our AS (AS201824)
>  * some testing from hetzner. (AS24940)
> 
> On some locations ''wget -6
> http://download.ceph.com/debian-testing/dists/wheezy/InRelease'' does
> not work, on some of them, it only works 1 out of 5 times.
> 

Could you try this:

http://eu.ceph.com/debian-testing/dists/wheezy/InRelease

eu.ceph.com is a mirror of download.ceph.com and can also be reached via
IPv6.

Wido

> Testing is hard. They drop icmp6-ping. I have no idea, why they are
> doing this.
> 
> Their PTR record for download.ceph.com is missing, but thats not the
> point here.
> 
> 
> regards,
> 
>  Björn Lässig
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Francois Lafont
Hi and thanks at all for this good news, ;)

On 13/10/2015 23:01, Sage Weil wrote:

>#. Fix the data ownership during the upgrade.  This is the preferred 
> option,
>   but is more work.  The process for each host would be to:
> 
>   #. Upgrade the ceph package.  This creates the ceph user and group.  For
>example::
> 
>  ceph-deploy install --stable infernalis HOST
> 
>   #. Stop the daemon(s).::
> 
>  service ceph stop   # fedora, centos, rhel, debian
>  stop ceph-all   # ubuntu
>  
>   #. Fix the ownership::
> 
>  chown -R ceph:ceph /var/lib/ceph
> 
>   #. Restart the daemon(s).::
> 
>  start ceph-all# ubuntu
>  systemctl start ceph.target   # debian, centos, fedora, rhel

With this (preferred) option, if I understand well, I should
repeat these commands above host-by-host. Personally, my monitors
are hosted in the OSD servers (I have no dedicated monitor server).
So, with this option, I will have osd daemons upgraded before
monitor daemons. Is it a problem?

I ask the question because, during a migration to a new release,
it's generally recommended to upgrade _all_ the monitors before
to upgrade the first osd daemon.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Sage Weil
On Thu, 15 Oct 2015, Goncalo Borges wrote:
> Hi Sage, Dan...
> 
> In our case, we have strongly invested in the testing of CephFS. It seems as a
> good solution to some of the issues we currently experience regarding the use
> cases from our researchers.
> 
> While I do not see a problem in deploying Ceph cluster in SL7, I suspect that
> we will need CephFS clients in SL6 for quite some time. The problem here is
> that our researchers use a whole bunch of software provided by the CERN
> experiments to generate MC data or analyse experimental data. This software is
> currently certified for SL6 and I think that a SL7 version will take a
> considerable amount of time. So we need a CephFS client that allows our
> researchers to access and analyse the data in that environment.
> 
> If you guys did not think it was worthwhile the effort to built for those
> flavors, that actually tells me this is a complicated task that, most
> probably, I can not do it on my own.

I don't think it will be much of a problem.

First, if you're using the CephFS kernel client, the important bit is the 
kernel--you'll want something quite recent.  The OS doesn't really matter 
much.  The only piece that is of any use is mount.ceph, but it is 
optional.  It only does two semi-useful things: it resolves DNS if you 
identify your monitor(s) with something other than an IP (and actually the 
kernel can do this too if it's built with the right options) and it will 
turn a '-o secretfile=' into a '-o 
secret='.  In other words, it's optional, although it makes it 
slightly awkward not to put the ceph key in /etc/fstab.  In any case, it's 
trivial to build that binary and install/distirbute it in some other 
manner.

Or, you can build the ceph packages with the newer gcc.. it isn't 
that painful.  I stopped because I didn't want to have us distributing 
newer versions of the libstdc++ libraries in the ceph repositories.

If you're talking about using libcephfs or ceph-fuse, then building those 
packages is inevitable... but probably not that onerous.

sage



> 
> I am currently interacting with Dan and other colleagues in a CERN mailing
> list. Let us see what would be the outcome of that discussion.
> 
> But at the moment I am open to suggestions.
> 
> TIA
> Goncalo
> 
> On 10/14/2015 11:30 PM, Sage Weil wrote:
> > On Wed, 14 Oct 2015, Dan van der Ster wrote:
> > > Hi Goncalo,
> > > 
> > > On Wed, Oct 14, 2015 at 6:51 AM, Goncalo Borges
> > >  wrote:
> > > > Hi Sage...
> > > > 
> > > > I've seen that the rh6 derivatives have been ruled out.
> > > > 
> > > > This is a problem in our case since the OS choice in our systems is,
> > > > somehow, imposed by CERN. The experiments software is certified for SL6
> > > > and
> > > > the transition to SL7 will take some time.
> > > Are you accessing Ceph directly from "physics" machines? Here at CERN
> > > we run CentOS 7 on the native clients (e.g. qemu-kvm hosts) and by the
> > > time we upgrade to Infernalis the servers will all be CentOS 7 as
> > > well. Batch nodes running SL6 don't (currently) talk to Ceph directly
> > > (in the future they might talk to Ceph-based storage via an xroot
> > > gateway). But if there are use-cases then perhaps we could find a
> > > place to build and distributing the newer ceph clients.
> > > 
> > > There's a ML ceph-t...@cern.ch where we could take this discussion.
> > > Mail me if have trouble joining that e-Group.
> > Also note that it *is* possible to build infernalis on el6, but it
> > requires a lot more effort... enough that we would rather spend our time
> > elsewhere (at least as far as ceph.com packages go).  If someone else
> > wants to do that work we'd be happy to take patches to update the and/or
> > release process.
> > 
> > IIRC the thing that eventually made me stop going down this patch was the
> > fact that the newer gcc had a runtime dependency on the newer libstdc++,
> > which wasn't part of the base distro... which means we'd need also to
> > publish those packages in the ceph.com repos, or users would have to
> > add some backport repo or ppa or whatever to get things running.  Bleh.
> > 
> > sage
> > 
> > 
> > > Cheers, Dan
> > > CERN IT-DSS
> > > 
> > > > This is kind of a showstopper specially if we can't deploy clients in
> > > > SL6 /
> > > > Centos6.
> > > > 
> > > > Is there any alternative?
> > > > 
> > > > TIA
> > > > Goncalo
> > > > 
> > > > 
> > > > 
> > > > On 10/14/2015 08:01 AM, Sage Weil wrote:
> > > > > This is the first Infernalis release candidate.  There have been some
> > > > > major changes since hammer, and the upgrade process is non-trivial.
> > > > > Please read carefully.
> > > > > 
> > > > > Getting the release candidate
> > > > > -
> > > > > 
> > > > > The v9.1.0 packages are pushed to the development release
> > > > > repositories::
> > > > > 
> > > > > http://download.ceph.com/rpm-testing
> > > > > http://download.ceph.com/debian-testing
> > > > 

Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Francois Lafont
Sorry, another remark.

On 13/10/2015 23:01, Sage Weil wrote:

> The v9.1.0 packages are pushed to the development release repositories::
> 
>   http://download.ceph.com/rpm-testing
>   http://download.ceph.com/debian-testing

I don't see the 9.1.0 available for Ubuntu Trusty :


http://download.ceph.com/debian-testing/dists/trusty/main/binary-amd64/Packages
(the string "9.1" is not present in this page currently)

The 9.0.3 is available but, after a quick test, this version of
the package doesn't create the ceph unix account.

Have I forgotten something?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Goncalo Borges

Hi Sage, Dan...

In our case, we have strongly invested in the testing of CephFS. It 
seems as a good solution to some of the issues we currently experience 
regarding the use cases from our researchers.


While I do not see a problem in deploying Ceph cluster in SL7, I suspect 
that we will need CephFS clients in SL6 for quite some time. The problem 
here is that our researchers use a whole bunch of software provided by 
the CERN experiments to generate MC data or analyse experimental data. 
This software is currently certified for SL6 and I think that a SL7 
version will take a considerable amount of time. So we need a CephFS 
client that allows our researchers to access and analyse the data in 
that environment.


If you guys did not think it was worthwhile the effort to built for 
those flavors, that actually tells me this is a complicated task that, 
most probably, I can not do it on my own.


I am currently interacting with Dan and other colleagues in a CERN 
mailing list. Let us see what would be the outcome of that discussion.


But at the moment I am open to suggestions.

TIA
Goncalo

On 10/14/2015 11:30 PM, Sage Weil wrote:

On Wed, 14 Oct 2015, Dan van der Ster wrote:

Hi Goncalo,

On Wed, Oct 14, 2015 at 6:51 AM, Goncalo Borges
 wrote:

Hi Sage...

I've seen that the rh6 derivatives have been ruled out.

This is a problem in our case since the OS choice in our systems is,
somehow, imposed by CERN. The experiments software is certified for SL6 and
the transition to SL7 will take some time.

Are you accessing Ceph directly from "physics" machines? Here at CERN
we run CentOS 7 on the native clients (e.g. qemu-kvm hosts) and by the
time we upgrade to Infernalis the servers will all be CentOS 7 as
well. Batch nodes running SL6 don't (currently) talk to Ceph directly
(in the future they might talk to Ceph-based storage via an xroot
gateway). But if there are use-cases then perhaps we could find a
place to build and distributing the newer ceph clients.

There's a ML ceph-t...@cern.ch where we could take this discussion.
Mail me if have trouble joining that e-Group.

Also note that it *is* possible to build infernalis on el6, but it
requires a lot more effort... enough that we would rather spend our time
elsewhere (at least as far as ceph.com packages go).  If someone else
wants to do that work we'd be happy to take patches to update the and/or
release process.

IIRC the thing that eventually made me stop going down this patch was the
fact that the newer gcc had a runtime dependency on the newer libstdc++,
which wasn't part of the base distro... which means we'd need also to
publish those packages in the ceph.com repos, or users would have to
add some backport repo or ppa or whatever to get things running.  Bleh.

sage



Cheers, Dan
CERN IT-DSS


This is kind of a showstopper specially if we can't deploy clients in SL6 /
Centos6.

Is there any alternative?

TIA
Goncalo



On 10/14/2015 08:01 AM, Sage Weil wrote:

This is the first Infernalis release candidate.  There have been some
major changes since hammer, and the upgrade process is non-trivial.
Please read carefully.

Getting the release candidate
-

The v9.1.0 packages are pushed to the development release repositories::

http://download.ceph.com/rpm-testing
http://download.ceph.com/debian-testing

For for info, see::

http://docs.ceph.com/docs/master/install/get-packages/

Or install with ceph-deploy via::

ceph-deploy install --testing HOST

Known issues


* librbd and librados ABI compatibility is broken.  Be careful
installing this RC on client machines (e.g., those running qemu).
It will be fixed in the final v9.2.0 release.

Major Changes from Hammer
-

* *General*:
* Ceph daemons are now managed via systemd (with the exception of
  Ubuntu Trusty, which still uses upstart).
* Ceph daemons run as 'ceph' user instead root.
* On Red Hat distros, there is also an SELinux policy.
* *RADOS*:
* The RADOS cache tier can now proxy write operations to the base
  tier, allowing writes to be handled without forcing migration of
  an object into the cache.
* The SHEC erasure coding support is no longer flagged as
  experimental. SHEC trades some additional storage space for faster
  repair.
* There is now a unified queue (and thus prioritization) of client
  IO, recovery, scrubbing, and snapshot trimming.
* There have been many improvements to low-level repair tooling
  (ceph-objectstore-tool).
* The internal ObjectStore API has been significantly cleaned up in
order
  to faciliate new storage backends like NewStore.
* *RGW*:
* The Swift API now supports object expiration.
* There are many Swift API compatibility improvements.
* *RBD*:
* The ``rbd du`` command shows actual usage (quickly, when
  object-map is enabled).
* The object-map feature has seen 

[ceph-users] Does SSD Journal improve the performance?

2015-10-14 Thread hzwuli...@gmail.com
Hi, 

It should be sure SSD Journal will improve the performance of IOPS. But 
unfortunately it's not in my test.

I have two pools with the same number of osds:
pool1, ssdj_sas:
9 osd servers, 8 OSDs(SAS) on every server
Journal on SSD, one SSD disk for 4 SAS disks.

pool 2, sas:
9 osd servers, 8 OSDs(SAS) on every server
Journal on SAS disk itself。

I use rbd to create a volume in pool1 and pool2 separately and use fio to test
the rand write IOPS。here is the fio configuration:

rw=randwrite
ioengine=libaio
direct=1
iodepth=128
bs=4k
numjobs=1

The result i got is:
volume in pool1, about 5k
volume in pool2, about 12k

It's a big gap here, anyone can give me some suggestion here?

ceph version: hammer(0.94.3)
kernel: 3.10



hzwuli...@gmail.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does SSD Journal improve the performance?

2015-10-14 Thread Christian Balzer

Hello,

Firstly, this is clearly a ceph-users question, don't cross post to
ceph-devel.

On Thu, 15 Oct 2015 09:29:03 +0800 hzwuli...@gmail.com wrote:

> Hi, 
> 
> It should be sure SSD Journal will improve the performance of IOPS. But
> unfortunately it's not in my test.
> 
> I have two pools with the same number of osds:
> pool1, ssdj_sas:
> 9 osd servers, 8 OSDs(SAS) on every server
> Journal on SSD, one SSD disk for 4 SAS disks.
> 
Details. All of them.
Specific HW (CPU, RAM, etc.) of these servers and the network, what type of
SSDs, HDDs, controllers.

> pool 2, sas:
> 9 osd servers, 8 OSDs(SAS) on every server
> Journal on SAS disk itself。
> 
Is the HW identical to pool1 except for the journal placement?

> I use rbd to create a volume in pool1 and pool2 separately and use fio
> to test the rand write IOPS。here is the fio configuration:
> 
> rw=randwrite
> ioengine=libaio
> direct=1
> iodepth=128
> bs=4k
> numjobs=1
> 
> The result i got is:
> volume in pool1, about 5k
> volume in pool2, about 12k
> 
Now this job will stress the CPUs quite a bit (which you should be able to
see with atop or the likes). 

However if the HW is identical in both pools your SSD may be one of those
that perform abysmal with direct IO.

There are plenty of threads in the ML archives about this topic.
 
Christian

> It's a big gap here, anyone can give me some suggestion here?
> 
> ceph version: hammer(0.94.3)
> kernel: 3.10
> 
> 
> 
> hzwuli...@gmail.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Sage Weil
On Thu, 15 Oct 2015, Francois Lafont wrote:
> Hi and thanks at all for this good news, ;)
> 
> On 13/10/2015 23:01, Sage Weil wrote:
> 
> >#. Fix the data ownership during the upgrade.  This is the preferred 
> > option,
> >   but is more work.  The process for each host would be to:
> > 
> >   #. Upgrade the ceph package.  This creates the ceph user and group.  
> > For
> >  example::
> > 
> >ceph-deploy install --stable infernalis HOST
> > 
> >   #. Stop the daemon(s).::
> > 
> >service ceph stop   # fedora, centos, rhel, debian
> >stop ceph-all   # ubuntu
> >
> >   #. Fix the ownership::
> > 
> >chown -R ceph:ceph /var/lib/ceph
> > 
> >   #. Restart the daemon(s).::
> > 
> >start ceph-all# ubuntu
> >systemctl start ceph.target   # debian, centos, fedora, rhel
> 
> With this (preferred) option, if I understand well, I should
> repeat these commands above host-by-host. Personally, my monitors
> are hosted in the OSD servers (I have no dedicated monitor server).
> So, with this option, I will have osd daemons upgraded before
> monitor daemons. Is it a problem?

No.  You can also chown -R /var/lib/ceph/mon and /var/lib/ceph/osd 
separately. 

> I ask the question because, during a migration to a new release,
> it's generally recommended to upgrade _all_ the monitors before
> to upgrade the first osd daemon.

Doing all the monitors is recommended, but not strictly required.

Also note that the chown on the OSD dirs can take a very long time 
(hours).  I suspect we should revise the recommendation to do it for the 
mons and not the osds... or at least give a better warning about how long 
it takes.  (And I'm very interested in hearing what peoples' experiences 
are here...)

Thanks!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Sage Weil
On Thu, 15 Oct 2015, Francois Lafont wrote:

> Sorry, another remark.
> 
> On 13/10/2015 23:01, Sage Weil wrote:
> 
> > The v9.1.0 packages are pushed to the development release repositories::
> > 
> >   http://download.ceph.com/rpm-testing
> >   http://download.ceph.com/debian-testing
> 
> I don't see the 9.1.0 available for Ubuntu Trusty :
> 
> 
> http://download.ceph.com/debian-testing/dists/trusty/main/binary-amd64/Packages
> (the string "9.1" is not present in this page currently)
> 
> The 9.0.3 is available but, after a quick test, this version of
> the package doesn't create the ceph unix account.

You're right.. I see jessie but not trusty in the archive.  Alfredo, can 
you verify it synced properly?

Thanks!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] avoid 3-mds fs laggy on 1 rejoin?

2015-10-14 Thread Dzianis Kahanovich

Yan, Zheng пишет:


2) I have 3 active mds now. I try, it works, keep it. Restart still
problematic.



multiple active MDS is not ready for production.


OK, so if I run 3 active now (looks good) - better to turn back to 1?


3) Yes, more caps on master VM (4.2.3 kernel mount, there are
web+mail+heartbeat cluster 2xVMs) on apache root. In this place was (no more
now) described CLONE_FS -> CLONE_VFORK deadlocks. But 4.2.3 installed just
before tests, was 4.1.8 with similar effects (but log from 4.2.3 on VM
clients).


I suspect your problem is due to some mount have too many open files.
during mds failover, mds needs to open these files, which take a long
time.


Can some kind of cache improve behaviour?

--
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Sage Weil
On Wed, 14 Oct 2015, Dan van der Ster wrote:
> Hi Goncalo,
> 
> On Wed, Oct 14, 2015 at 6:51 AM, Goncalo Borges
>  wrote:
> > Hi Sage...
> >
> > I've seen that the rh6 derivatives have been ruled out.
> >
> > This is a problem in our case since the OS choice in our systems is,
> > somehow, imposed by CERN. The experiments software is certified for SL6 and
> > the transition to SL7 will take some time.
> 
> Are you accessing Ceph directly from "physics" machines? Here at CERN
> we run CentOS 7 on the native clients (e.g. qemu-kvm hosts) and by the
> time we upgrade to Infernalis the servers will all be CentOS 7 as
> well. Batch nodes running SL6 don't (currently) talk to Ceph directly
> (in the future they might talk to Ceph-based storage via an xroot
> gateway). But if there are use-cases then perhaps we could find a
> place to build and distributing the newer ceph clients.
> 
> There's a ML ceph-t...@cern.ch where we could take this discussion.
> Mail me if have trouble joining that e-Group.

Also note that it *is* possible to build infernalis on el6, but it 
requires a lot more effort... enough that we would rather spend our time 
elsewhere (at least as far as ceph.com packages go).  If someone else 
wants to do that work we'd be happy to take patches to update the and/or 
release process.

IIRC the thing that eventually made me stop going down this patch was the 
fact that the newer gcc had a runtime dependency on the newer libstdc++, 
which wasn't part of the base distro... which means we'd need also to 
publish those packages in the ceph.com repos, or users would have to 
add some backport repo or ppa or whatever to get things running.  Bleh.

sage


> 
> Cheers, Dan
> CERN IT-DSS
> 
> > This is kind of a showstopper specially if we can't deploy clients in SL6 /
> > Centos6.
> >
> > Is there any alternative?
> >
> > TIA
> > Goncalo
> >
> >
> >
> > On 10/14/2015 08:01 AM, Sage Weil wrote:
> >>
> >> This is the first Infernalis release candidate.  There have been some
> >> major changes since hammer, and the upgrade process is non-trivial.
> >> Please read carefully.
> >>
> >> Getting the release candidate
> >> -
> >>
> >> The v9.1.0 packages are pushed to the development release repositories::
> >>
> >>http://download.ceph.com/rpm-testing
> >>http://download.ceph.com/debian-testing
> >>
> >> For for info, see::
> >>
> >>http://docs.ceph.com/docs/master/install/get-packages/
> >>
> >> Or install with ceph-deploy via::
> >>
> >>ceph-deploy install --testing HOST
> >>
> >> Known issues
> >> 
> >>
> >> * librbd and librados ABI compatibility is broken.  Be careful
> >>installing this RC on client machines (e.g., those running qemu).
> >>It will be fixed in the final v9.2.0 release.
> >>
> >> Major Changes from Hammer
> >> -
> >>
> >> * *General*:
> >>* Ceph daemons are now managed via systemd (with the exception of
> >>  Ubuntu Trusty, which still uses upstart).
> >>* Ceph daemons run as 'ceph' user instead root.
> >>* On Red Hat distros, there is also an SELinux policy.
> >> * *RADOS*:
> >>* The RADOS cache tier can now proxy write operations to the base
> >>  tier, allowing writes to be handled without forcing migration of
> >>  an object into the cache.
> >>* The SHEC erasure coding support is no longer flagged as
> >>  experimental. SHEC trades some additional storage space for faster
> >>  repair.
> >>* There is now a unified queue (and thus prioritization) of client
> >>  IO, recovery, scrubbing, and snapshot trimming.
> >>* There have been many improvements to low-level repair tooling
> >>  (ceph-objectstore-tool).
> >>* The internal ObjectStore API has been significantly cleaned up in
> >> order
> >>  to faciliate new storage backends like NewStore.
> >> * *RGW*:
> >>* The Swift API now supports object expiration.
> >>* There are many Swift API compatibility improvements.
> >> * *RBD*:
> >>* The ``rbd du`` command shows actual usage (quickly, when
> >>  object-map is enabled).
> >>* The object-map feature has seen many stability improvements.
> >>* Object-map and exclusive-lock features can be enabled or disabled
> >>  dynamically.
> >>* You can now store user metadata and set persistent librbd options
> >>  associated with individual images.
> >>* The new deep-flatten features allows flattening of a clone and all
> >>  of its snapshots.  (Previously snapshots could not be flattened.)
> >>* The export-diff command command is now faster (it uses aio).  There
> >> is also
> >>  a new fast-diff feature.
> >>* The --size argument can be specified with a suffix for units
> >>  (e.g., ``--size 64G``).
> >>* There is a new ``rbd status`` command that, for now, shows who has
> >>  the image open/mapped.
> >> * *CephFS*:
> >>* You can now rename snapshots.
> 

Re: [ceph-users] Ceph OSD on ZFS

2015-10-14 Thread Christian Balzer

Hello,

On Wed, 14 Oct 2015 09:25:41 +1000 Lindsay Mathieson wrote:

> I'm adding a node (4 * WD RED 3TB) to our small cluster to bring it up to
> replica 3. 

Can we assume from this node that the your current setup is something like
2 nodes with 4 drives each?

> Given how much headache it has been managing multiple osd's

Any particularities?

> (including disk failures) 
They happen, but given your (assumed) cluster size they should pretty rare.

>on my other nodes, I've decided to put all 4
> disks on the new node in a ZFS RAID 10 config with SSD SLOG & Cache with
> just one OSD running on top of a ZFS filesystem mount.
> 

Well, well, well...

I assume you did read all the past threads about ZFS as backing FS for
Ceph?

Can your current cluster handle a deep scrub during average, peak time
utilization without falling apart and getting unusably slow?

> I only have one SSD available for cache and journaling, would I be better
> off doing one the following?
> 
> Option 1:
> - Journal on the ZFS Filesystem (advantage - simplicity)
> 
COW and journal... not so good, I'd imagine.

> Option 2:
> - Journal on a SSD partition, for a total of 3 (SLOG, Cache, Ceph
> Journal). Advantage - possibly better ceph performance?
>
Probably.
 
> Option 3
> - Something else?
>

The more OSDs, the better Ceph will perform (up to a point). 

Note that I have similar clusters to avoid dealing with disk failures, but
unless you can afford to upgrade in a timely fashion and paying for what
is effectively a 4 way replication you may want to reconsider your
approach.

If your cluster is really that lightly loaded that you can sustain
performance with a reduced number of OSDs, I'd go with RAID1 OSDs and a
replication of 2. 
So that adding the additional node and rebuilding the old ones will
actually only slightly decrease your OSD count (from assumed 8 to 6).

> Also - should I put the monitor on ZFS as well?
> 
leveldb and COW, also probably not so good.

Christian
> If this works out, I'll probably migrate the other two nodes  to a
> similar setup.
> 
> thanks,
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-14 Thread Jason Dillaman
> Can you elaborate on that? I don't think there needs to be a difference. Ceph
> is hosting mostly filesystems, so it's all just a bunch of filesystem
> transactions anyway...
> 

There is some additional background information here [1].  The XFS journal 
protects "atomic" (for lack of a better word) actions that actually require 
multiple disk writes in implementation.  The Ceph journal acts similarly to 
ensure that Ceph-level "atomic" actions can be consistently applied to the 
underlying filesystem in case of failure (e.g. if ceph-osd crashed in the 
middle of one of these compound "atomic" Ceph metadata updates -- XFS wouldn't 
know how to get the system back to a consistent state).

As Ilya alluded to, the forthcoming NewStore is able to avoid costly 
double-writes in certain scenarios such as create, append and overwrite 
operations by decoupling objects from the underlying filesystem's actual 
storage path.

[1] 
https://github.com/ceph/ceph/blob/master/doc/rados/configuration/journal-ref.rst

-- 

Jason Dillaman 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS file to rados object mapping

2015-10-14 Thread Francois Lafont
Hi,

On 14/10/2015 06:45, Gregory Farnum wrote:

>> Ok, however during my tests I had been careful to replace the correct
>> file by a bad file with *exactly* the same size (the content of the
>> file was just a little string and I have changed it by a string with
>> exactly the same size). I had been careful to undo the mtime update
>> too (I had restore the mtime of the file before the change). Despite
>> this, the "repair" command worked well. Tested twice: 1. with the change
>> on the primary OSD and 2. on the secondary OSD. And I was surprised
>> because I though the test 1. (in primary OSD) will fail.
> 
> Hm. I'm a little confused by that, actually. Exactly what was the path
> to the files you changed, and do you have before-and-after comparisons
> on the content and metadata?

I didn't remember exactly the process I have made so I have just retried
today. Here is my process. I have a healthy cluster with 3 nodes (Ubuntu
Trusty) and I have ceph Hammer (version 0.94.3). I have mounted cephfs on
/mnt on one of the nodes.

~# cat /mnt/file.txt # yes it's a little file. ;)
123456

~# ls -i /mnt/file.txt 
1099511627776 /mnt/file.txt

~# printf "%x\n" 1099511627776
100

~# rados -p data ls - | grep 100
100.

I have the name of the object mapped to my "file.txt".

~# ceph osd map data 100.
osdmap e76 pool 'data' (3) object '100.' -> pg 3.f0b56f30 
(3.30) -> up ([1,2], p1) acting ([1,2], p1)

So my object is in the primary OSD OSD-1 and in the secondary OSD OSD-2.
So I open a terminal in the node which hosts the primary OSD OSD-1 and
then:

~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
 
123456

~# ll 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
 
-rw-r--r-- 1 root root 7 Oct 15 03:46 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3

Now, I change the content with this script called "change_content.sh" to
preserve the mtime after the change:

-
#!/bin/sh

f="$1"
f_tmp="${f}.tmp"
content="$2"
cp --preserve=all "$f" "$f_tmp"
echo "$content" >"$f"
touch -r "$f_tmp" "$f" # to restore the mtime after the change
rm "$f_tmp"
-

So, let's go, I replace the content by a new content with exactly
the same size (ie "ABCDEF" in this example):

~# ./change_content.sh 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
 ABCDEF

~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
 
ABCDEF

~# ll 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
 
-rw-r--r-- 1 root root 7 Oct 15 03:46 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3

Now, the secondary OSD contains the good version of the object and
the primary a bad version. Now, I launch a "ceph pg repair":

~# ceph pg repair 3.30
instructing pg 3.30 on osd.1 to repair

# I'm in the primary OSD and the file below has been repaired correctly.
~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
 
123456

As you can see, the repair command has worked well.
Maybe my little is too trivial?

>> Greg, if I understand you well, I shouldn't have too much confidence in
>> the "ceph pg repair" command, is it correct?
>>
>> But, if yes, what is the good way to repair a PG?
> 
> Usually what we recommend is for those with 3 copies to find the
> differing copy, delete it, and run a repair — then you know it'll
> repair from a good version. But yeah, it's not as reliable as we'd
> like it to be on its own.

I would like to be sure to well understand. The process could be (in
the case where size == 3):

1. In each of the 3 OSDs where my object is put:

md5sum /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}*

2. Normally, I will have the same result in 2 OSDs, and in the other
OSD, let's call it OSD-X, the result will be different. So, in the OSD-X,
I run:

rm /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}*

3. And now I can run the "ceph pg repair" command without risk:

ceph pg repair $pg_id
 
Is it the correct process?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com