[ceph-users] Effect of tunables on client system load

2017-06-08 Thread Nathanial Byrnes
Hi All,
   First, some background:
   I have been running a small (4 compute nodes) xen server cluster
backed by both a small ceph (4 other nodes with a total of 18x 1-spindle
osd's) and small gluster cluster (2 nodes each with a 14 spindle RAID
array). I started with gluster 3-4 years ago, at first using NFS to access
gluster, then upgraded to gluster FUSE. However, I had been facinated with
ceph since I first read about it, and probably added ceph as soon as XCP
released a kernel with RBD support, possibly approaching 2 years ago.
   With Ceph, since I started out with the kernel RBD, I believe it
locked me to Bobtail tunables. I connected to XCP via a project that tricks
XCP into running LVM on the RBDs managing all this through the iSCSI mgmt
infrastructure somehow... Only recently I've switched to a newer project
that uses the RBD-NBD mapping instead. This should let me use whatever
tunables my client SW support AFAIK. I have not yet changed my tunables as
the data re-org will probably take a day or two (only 1Gb networking...).

   Over this time period, I've observed that my gluster backed guests tend
not to consume as much of domain-0's (the Xen VM management host) resources
as do my Ceph backed guests. To me, this is somewhat intuitive  as the ceph
client has to do more "thinking" than the gluster client. However, It seems
to me that the IO performance of the VM guests is well outside than the
difference in spindle count would suggest. I am open to the notion that
there are probably quite a few sub-optimal design choices/constraints
within the environment. However, I haven't the resources to conduct all
that many experiments and benchmarks So, over time I've ended up
treating ceph as my resilient storage, and gluster as my more performant
(3x vs 2x replication, and, as mentioned above, my gluster guests had
quicker guest IO and lower dom-0 load).

So, on to my questions:

   Would setting my tunables to jewel (my present release), or anything
newer than bobtail (which is what I think I am set to if I read the ceph
status warning correctly) reduce my dom-0 load and/or improve any aspects
of the client IO performance?

   Will adding nodes to the cluster ceph reduce load on dom-0, and/or
improve client IO performance (I doubt the former and would expect the
latter...)?

   So, why did I bring up gluster at all? In an ideal world, I would like
to have just one storage environment that would satisfy all my
organizations needs. If forced to choose with the knowledge I have today, I
would have to select gluster. I am hoping to come up with some actionable
data points that might help me discover some of my mistakes which might
explain my experience to date and maybe even help remedy said mistakes. As
I mentioned earlier, I like ceph, more so than gluster, and would like to
employ more within my environment. But, given budgetary constraints, I need
to do what's best for my organization.

   Thanks in advance,
   Nate
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD node type/count mixes in the cluster

2017-06-08 Thread Deepak Naidu
Wanted to check if anyone has a ceph cluster which has mixed vendor servers 
both with same disk size i.e. 8TB but different count i.e. Example 10 OSD 
servers from Dell with 60 Disk per server and other 10 OSD servers from HP with 
26 Disk per server.

If so does that change any performance dynamics ? or is it not advisable .

--
Deepak
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG that should not be on undersized+degraded on multi datacenter Ceph cluster

2017-06-08 Thread Brad Hubbard
On Thu, Jun 8, 2017 at 11:31 PM, Alejandro Comisario
 wrote:
> Hi Brad.
> Taking into consideration the unlikely posibility that someone
> realizes what the problem is in this specific case, that would be
> higly apreciated.
>
> I presume that having jewel, if you can somehow remediate this, will
> be something that i will not be able to have on this deploy right?

I can propose a backport but whether it is included or not is not up to me.

>
> best.
>
> On Thu, Jun 8, 2017 at 2:20 AM, Brad Hubbard  wrote:
>> On Thu, Jun 8, 2017 at 2:59 PM, Alejandro Comisario
>>  wrote:
>>> ha!
>>> is there ANY way of knowing when this peering maximum has been reached for a
>>> PG?
>>
>> Not currently AFAICT.
>>
>> It takes place deep in this c code that is shared between the kernel
>> and userspace implementations.
>>
>> https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L444
>>
>> Whilst the kernel implementation generates some output the userspace
>> code does not. I'm looking at how that situation can be improved.
>>
>>>
>>> On Jun 7, 2017 20:21, "Brad Hubbard"  wrote:

 On Wed, Jun 7, 2017 at 5:13 PM, Peter Maloney
  wrote:

 >
 > Now if only there was a log or warning seen in ceph -s that said the
 > tries was exceeded,

 Challenge accepted.

 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 Cheers,
 Brad
>>
>>
>>
>> --
>> Cheers,
>> Brad
>
>
>
> --
> Alejandro Comisario
> CTO | NUBELIU
> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> _
> www.nubeliu.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache mode readforward mode will eat your babies?

2017-06-08 Thread Christian Balzer
On Thu, 8 Jun 2017 07:06:04 -0400 Alfredo Deza wrote:

> On Thu, Jun 8, 2017 at 3:38 AM, Christian Balzer  wrote:
> > On Thu, 8 Jun 2017 17:03:15 +1000 Brad Hubbard wrote:
> >  
> >> On Thu, Jun 8, 2017 at 3:47 PM, Christian Balzer  wrote:  
> >> > On Thu, 8 Jun 2017 15:29:05 +1000 Brad Hubbard wrote:
> >> >  
> >> >> On Thu, Jun 8, 2017 at 3:10 PM, Christian Balzer  wrote: 
> >> >>  
> >> >> > On Thu, 8 Jun 2017 14:21:43 +1000 Brad Hubbard wrote:
> >> >> >  
> >> >> >> On Thu, Jun 8, 2017 at 1:06 PM, Christian Balzer  
> >> >> >> wrote:  
> >> >> >> >
> >> >> >> > Hello,
> >> >> >> >
> >> >> >> > New cluster, Jewel, setting up cache-tiering:
> >> >> >> > ---
> >> >> >> > Error EPERM: 'readforward' is not a well-supported cache mode and 
> >> >> >> > may corrupt your data.  pass --yes-i-really-mean-it to force.
> >> >> >> > ---
> >> >> >> >
> >> >> >> > That's new and certainly wasn't there in Hammer, nor did it whine 
> >> >> >> > about
> >> >> >> > this when upgrading my test cluster to Jewel.
> >> >> >> >
> >> >> >> > And speaking of whining, I did that about this and readproxy, but 
> >> >> >> > not
> >> >> >> > their stability (readforward has been working nearly a year 
> >> >> >> > flawlessly in
> >> >> >> > the test cluster) but their lack of documentation.
> >> >> >> >
> >> >> >> > So while of course there is no warranty for anything with OSS, is 
> >> >> >> > there
> >> >> >> > any real reason for the above scaremongering or is that based 
> >> >> >> > solely on
> >> >> >> > lack of testing/experience?  
> >> >> >>
> >> >> >> https://github.com/ceph/ceph/pull/8210 and
> >> >> >> https://github.com/ceph/ceph/pull/8210/commits/90fe8e3d0b1ded6d14a6a43ecbd6c8634f691fbe
> >> >> >> may offer some insight.
> >> >> >>  
> >> >> > They do, alas of course immediately raise the following questions:
> >> >> >
> >> >> > 1. Where is that mode documented?  
> >> >>
> >> >> It *was* documented by,
> >> >> https://github.com/ceph/ceph/pull/7023/commits/d821acada39937b9dacf87614c924114adea8a58
> >> >> in https://github.com/ceph/ceph/pull/7023 but was removed by
> >> >> https://github.com/ceph/ceph/commit/6b6b38163b7742d97d21457cf38bdcc9bde5ae1a
> >> >> in https://github.com/ceph/ceph/pull/9070
> >> >>  
> >> >
> >> > I was talking about proxy, which isn't AFAICT, nor is there a BIG bold 
> >> > red  
> >>
> >> That was hard to follow for me, in a thread titled "Cache mode
> >> readforward mode will eat your babies?".
> >>  
> > Context, the initial github bits talk about proxy.
> >
> > Anyways, the documentation is in utter shambles and wrong and this really
> > really should have been mentioned more clearly in the release notes, but
> > then again none of the other cache changes were, never mind the wrong
> > osd_tier_promote_max* defaults.
> >
> > So for the record:
> >
> > The readproxy mode does what the old documentation states and proxies
> > objects through the cache-tier when being read w/o promoting them[*], while
> > writing objects will go into cache-tier as usual and with the
> > rate configured.
> >
> > [*]
> > Pro-Tip: It does however do the silent 0 byte object creation for reads,
> > so your cache-tier storage performance will be somewhat impacted, in
> > addition to the CPU usage there that readforward would have also avoided.
> > This is important when considering the value for "target_max_objects", as a
> > writeback mode cache will likely evict things based on space used and
> > reach a natural upper object limit.
> > For example an existing cache-tier in writeback mode here has a 2GB size
> > and 560K objects, 13.4TB and 3.6M objects on the backing storage.
> > With readproxy and a similar sized cluster I'll be setting
> > "target_max_objects" to something around 2M to avoid needless eviction and
> > then re-creation of null objects when things are read.  
> 
> Thank you for taking the time to explain this in the mailing list,
> could you help us in submitting a pull request with this
> documentation addition?
> 

I'll review that whole page again, it's riddled with stuff.
Like in the eviction settings talking about flushing all of a sudden,
which doesn't help when most people are confused by the those two things
initially anyway.

Christian

> I would be happy to review and merge.
> >
> > Christian
> >  
> >> > statement in the release notes (or docs) for everybody to switch from
> >> > (read)forward to (read)proxy.
> >> >
> >> > And the two bits up there have _very_ conflicting statements about what
> >> > readproxy does, the older one would do what I want (at the cost of
> >> > shuffling all through the cache-tier network pipes), the newer one seems
> >> > to be actually describing the proxy functionality (no new objects i.e 
> >> > from
> >> > writes being added).
> >> >
> >> > I'll be ready to play with my new cluster in a bit and shall investigate
> >> > what does actually what.
> >> >
> >> > Christian
> >> >  
> >> >> HTH.
> >> >>  
> >> >> >
> >> >> > 2. The release notes aren't any pa

Re: [ceph-users] rados rm: device or resource busy

2017-06-08 Thread Brad Hubbard
I can reproduce this.

The key is to look at debug logging on the primary.

2017-06-09 09:30:14.776355 7f9cf26a4700 20 
/home/brad/working/src/ceph3/src/cls/lock/cls_lock.cc:247: lock_op
2017-06-09 09:30:14.776359 7f9cf26a4700 20 
/home/brad/working/src/ceph3/src/cls/lock/cls_lock.cc:162: requested
lock_type=exclusive fail_if_exists=1
2017-06-09 09:30:14.776363 7f9cf26a4700 10 osd.0 pg_epoch: 10 pg[0.6(
v 10'31 (0'0,10'31] local-lis/les=8/9 n=2 ec=1/1 lis/c 8/8 les/c/f
9/9/0 8/8/4) [0,1,2] r=0 lpr=8 crt=10'28 lcod 10'30 mlcod 10'27
active+clean] do_osd_op 0:6d521d9c:::testfile.:head
[getxattr lock.striper.lock]
2017-06-09 09:30:14.776372 7f9cf26a4700 10 osd.0 pg_epoch: 10 pg[0.6(
v 10'31 (0'0,10'31] local-lis/les=8/9 n=2 ec=1/1 lis/c 8/8 les/c/f
9/9/0 8/8/4) [0,1,2] r=0 lpr=8 crt=10'28 lcod 10'30 mlcod 10'27
active+clean] do_osd_op  getxattr lock.striper.lock
2017-06-09 09:30:14.776383 7f9cf26a4700 15
filestore(/home/brad/working/src/ceph3/build/dev/osd0) getattr
0.6_head/#0:6d521d9c:::testfile.:head#
'_lock.striper.lock'
2017-06-09 09:30:14.776408 7f9cf26a4700 10
filestore(/home/brad/working/src/ceph3/build/dev/osd0) getattr
0.6_head/#0:6d521d9c:::testfile.:head#
'_lock.striper.lock' = 126
2017-06-09 09:30:14.776419 7f9cf26a4700 20 
/home/brad/working/src/ceph3/src/cls/lock/cls_lock.cc:189: cannot take
lock on object, conflicting tag
2017-06-09 09:30:14.776422 7f9cf26a4700 10 osd.0 pg_epoch: 10 pg[0.6(
v 10'31 (0'0,10'31] local-lis/les=8/9 n=2 ec=1/1 lis/c 8/8 les/c/f
9/9/0 8/8/4) [0,1,2] r=0 lpr=8 crt=10'28 lcod 10'30 mlcod 10'27
active+clean] method called response length=0
2017-06-09 09:30:14.776432 7f9cf26a4700 10 osd.0 pg_epoch: 10 pg[0.6(
v 10'31 (0'0,10'31] local-lis/les=8/9 n=2 ec=1/1 lis/c 8/8 les/c/f
9/9/0 8/8/4) [0,1,2] r=0 lpr=8 crt=10'28 lcod 10'30 mlcod 10'27
active+clean]  dropping ondisk_read_lock
2017-06-09 09:30:14.776445 7f9cf26a4700 20 osd.0 pg_epoch: 10 pg[0.6(
v 10'31 (0'0,10'31] local-lis/les=8/9 n=2 ec=1/1 lis/c 8/8 les/c/f
9/9/0 8/8/4) [0,1,2] r=0 lpr=8 crt=10'28 lcod 10'30 mlcod 10'27
active+clean]  op order client.4122 tid 1 (first)
2017-06-09 09:30:14.776453 7f9cf26a4700 20 osd.0 pg_epoch: 10 pg[0.6(
v 10'31 (0'0,10'31] local-lis/les=8/9 n=2 ec=1/1 lis/c 8/8 les/c/f
9/9/0 8/8/4) [0,1,2] r=0 lpr=8 crt=10'28 lcod 10'30 mlcod 10'27
active+clean] execute_ctx update_log_only -- result=-16
2017-06-09 09:30:14.776468 7f9cf26a4700 20 osd.0 pg_epoch: 10 pg[0.6(
v 10'31 (0'0,10'31] local-lis/les=8/9 n=2 ec=1/1 lis/c 8/8 les/c/f
9/9/0 8/8/4) [0,1,2] r=0 lpr=8 crt=10'28 lcod 10'30 mlcod 10'27
active+clean] record_write_error r=-16
2017-06-09 09:30:14.776478 7f9cf26a4700 10 osd.0 pg_epoch: 10 pg[0.6(
v 10'31 (0'0,10'31] local-lis/les=8/9 n=2 ec=1/1 lis/c 8/8 les/c/f
9/9/0 8/8/4) [0,1,2] r=0 lpr=8 crt=10'28 lcod 10'30 mlcod 10'27
active+clean] submit_log_entries 10'32 (0'0) error
0:6d521d9c:::testfile.:head by client.4122.0:1
0.00 -16
2017-06-09 09:30:14.776490 7f9cf26a4700 10 osd.0 pg_epoch: 10 pg[0.6(
v 10'31 (0'0,10'31] local-lis/les=8/9 n=2 ec=1/1 lis/c 8/8 les/c/f
9/9/0 8/8/4) [0,1,2] r=0 lpr=8 crt=10'28 lcod 10'30 mlcod 10'27
active+clean] new_repop: repgather(0x565246704a80 10'32 rep_tid=33
committed?=0 applied?=0 r=-16)
2017-06-09 09:30:14.776502 7f9cf26a4700 10 osd.0 pg_epoch: 10 pg[0.6(
v 10'31 (0'0,10'31] local-lis/les=8/9 n=2 ec=1/1 lis/c 8/8 les/c/f
9/9/0 8/8/4) [0,1,2] r=0 lpr=8 crt=10'28 lcod 10'30 mlcod 10'27
active+clean] merge_new_log_entries 10'32 (0'0) error
0:6d521d9c:::testfile.:head by client.4122.0:1
0.00 -16
2017-06-09 09:30:14.776514 7f9cf26a4700 20 update missing, append
10'32 (0'0) error0:6d521d9c:::testfile.:head by
client.4122.0:1 0.00 -16

Specifically this.

/home/brad/working/src/ceph3/src/cls/lock/cls_lock.cc:189: cannot take
lock on object, conflicting tag

That's here where you will notice it is returning EBUSY which is error
code 16, "Device or resource busy".

https://github.com/badone/ceph/blob/wip-ceph_test_admin_socket_output/src/cls/lock/cls_lock.cc#L189

In order to remove the existing parts of the file you should be able
to just run "rados --pool testpool ls" and remove the listed objects
belonging to "testfile".

Example:
rados --pool testpool ls
testfile.0004
testfile.0001
testfile.
testfile.0003
testfile.0005
testfile.0002

rados --pool testpool rm testfile.
rados --pool testpool rm testfile.0001
...

Please open a tracker for this so it can be investigated further.

On Fri, Jun 9, 2017 at 1:43 AM, Jan Kasprzak  wrote:
> Hello,
>
> David Turner wrote:
> : How long have you waited?
>
> About a day.
>
> : I don't do much with rados objects directly.  I usually use RBDs and
> : cephfs.  If you just need to clean things up, you can delete the pool and
> : recreate it since it looks like it's testing.  However this is prob

[ceph-users] Living with huge bucket sizes

2017-06-08 Thread Bryan Stillwell
This has come up quite a few times before, but since I was only working with
RBD before I didn't pay too close attention to the conversation.  I'm looking
for the best way to handle existing clusters that have buckets with a large
number of objects (>20 million) in them.  The cluster I'm doing test on is
currently running hammer (0.94.10), so if things got better in jewel I would
love to hear about it!

One idea I've played with is to create a new SSD pool by adding an OSD
to every journal SSD.  My thinking was that our data is mostly small
objects (~100KB) so the journal drives were unlikely to be getting close
to any throughput limitations.  They should also have plenty of IOPs
left to handle the .rgw.buckets.index pool.

So on our test cluster I created a separate root that I called
rgw-buckets-index, I added all the OSDs I created on the journal SSDs,
and created a new crush rule to place data on it:

ceph osd crush rule create-simple rgw-buckets-index_ruleset rgw-buckets-index 
chassis

Once everything was set up correctly I tried switching the
.rgw.buckets.index pool over to it by doing:

ceph osd set norebalance
ceph osd pool set .rgw.buckets.index crush_ruleset 1
# Wait for peering to complete
ceph osd unset norebalance

Things started off well, but once it got to backfilling the PGs which
have the large buckets on them, I started seeing a large number of slow
requests like these:

  ack+ondisk+write+known_if_redirected e68708) currently waiting for degraded 
object
  ondisk+write+known_if_redirected e68708) currently waiting for degraded object
  ack+ondisk+write+known_if_redirected e68708) currently waiting for rw locks

Digging in on the OSDs, it seems they would either restart or die after
seeing a lot of these messages:

  heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f8f5d604700' had timed 
out after 30

or:

  heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f99ec2e4700' had timed out 
after 15

The ones that died saw messages like these:

  heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fcd59e7c700' had timed 
out after 60

Followed by:

  heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fcd48c1d700' had suicide 
timed out after 150


The backfilling process would appear to hang on some of the PGs, but I
figured out that they were recovering omap data and was able to keep an
eye on the process by running:

watch 'ceph pg 272.22 query | grep omap_recovered_to'

A lot of the timeouts happened after the PGs finished the omap recovery,
which took over an hour on one of the PGs.

Has anyone found a good solution for this for existing large buckets?  I
know sharding is the solution going forward, but afaik it can't be done
on existing buckets yet (although the dynamic resharding work mentioned
on today's performance call sounds promising).

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-08 Thread Vaibhav Bhembre
We have an internal management service that works at a higher layer
upstream on top of multiple Ceph clusters. It needs a way to
differentiate and connect separately to each of those clusters.
Presently making that distinction is relatively easy since we create
those connections based on /etc/conf/$cluster.conf, where each cluster
name is unique. I am not sure how this will work for us if we go away
from the way of uniquely identifying multiple clusters from a single client.

On Thu, Jun 8, 2017 at 3:37 PM, Sage Weil  wrote:
>
> At CDM yesterday we talked about removing the ability to name your ceph
> clusters.  There are a number of hurtles that make it difficult to fully
> get rid of this functionality, not the least of which is that some
> (many?) deployed clusters make use of it.  We decided that the most we can
> do at this point is remove support for it in ceph-deploy and ceph-ansible
> so that no new clusters or deployed nodes use it.
>
> The first PR in this effort:
>
> https://github.com/ceph/ceph-deploy/pull/441
>
> Background:
>
> The cluster name concept was added to allow multiple clusters to have
> daemons coexist on the same host.  At the type it was a hypothetical
> requirement for a user that never actually made use of it, and the
> support is kludgey:
>
>  - default cluster name is 'ceph'
>  - default config is /etc/ceph/$cluster.conf, so that the normal
> 'ceph.conf' still works
>  - daemon data paths include the cluster name,
>  /var/lib/ceph/osd/$cluster-$id
>which is weird (but mostly people are used to it?)
>  - any cli command you want to touch a non-ceph cluster name
> needs -C $name or --cluster $name passed to it.
>
> Also, as of jewel,
>
>  - systemd only supports a single cluster per host, as defined by $CLUSTER
> in /etc/{sysconfig,default}/ceph
>
> which you'll notice removes support for the original "requirement".
>
> Also note that you can get the same effect by specifying the config path
> explicitly (-c /etc/ceph/foo.conf) along with the various options that
> substitute $cluster in (e.g., osd_data=/var/lib/ceph/osd/$cluster-$id).
>
>
> Crap preventing us from removing this entirely:
>
>  - existing daemon directories for existing clusters
>  - various scripts parse the cluster name out of paths
>
>
> Converting an existing cluster "foo" back to "ceph":
>
>  - rename /etc/ceph/foo.conf -> ceph.conf
>  - rename /var/lib/ceph/*/foo-* -> /var/lib/ceph/*/ceph-*
>  - remove the CLUSTER=foo line in /etc/{default,sysconfig}/ceph
>  - reboot
>
>
> Questions:
>
>  - Does anybody on the list use a non-default cluster name?
>  - If so, do you have a reason not to switch back to 'ceph'?
>
> Thanks!
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-08 Thread mmokhtar
Hi Sage, 

We do use cluster names, we do not use ceph-deploy or ceph-ansible so in
the short term it is not an issue. We have scripts that call cli
commands with the --cluster XX parameter, would that still work ? What
time frame do you have in mind for removing this ? 

Cheers /Maged 

On 2017-06-08 21:37, Sage Weil wrote:

> At CDM yesterday we talked about removing the ability to name your ceph 
> clusters.  There are a number of hurtles that make it difficult to fully 
> get rid of this functionality, not the least of which is that some 
> (many?) deployed clusters make use of it.  We decided that the most we can 
> do at this point is remove support for it in ceph-deploy and ceph-ansible 
> so that no new clusters or deployed nodes use it.
> 
> The first PR in this effort:
> 
> https://github.com/ceph/ceph-deploy/pull/441
> 
> Background:
> 
> The cluster name concept was added to allow multiple clusters to have 
> daemons coexist on the same host.  At the type it was a hypothetical 
> requirement for a user that never actually made use of it, and the 
> support is kludgey:
> 
> - default cluster name is 'ceph'
> - default config is /etc/ceph/$cluster.conf, so that the normal 
> 'ceph.conf' still works
> - daemon data paths include the cluster name,
> /var/lib/ceph/osd/$cluster-$id
> which is weird (but mostly people are used to it?)
> - any cli command you want to touch a non-ceph cluster name 
> needs -C $name or --cluster $name passed to it.
> 
> Also, as of jewel,
> 
> - systemd only supports a single cluster per host, as defined by $CLUSTER 
> in /etc/{sysconfig,default}/ceph
> 
> which you'll notice removes support for the original "requirement".
> 
> Also note that you can get the same effect by specifying the config path 
> explicitly (-c /etc/ceph/foo.conf) along with the various options that 
> substitute $cluster in (e.g., osd_data=/var/lib/ceph/osd/$cluster-$id).
> 
> Crap preventing us from removing this entirely:
> 
> - existing daemon directories for existing clusters
> - various scripts parse the cluster name out of paths
> 
> Converting an existing cluster "foo" back to "ceph":
> 
> - rename /etc/ceph/foo.conf -> ceph.conf
> - rename /var/lib/ceph/*/foo-* -> /var/lib/ceph/*/ceph-*
> - remove the CLUSTER=foo line in /etc/{default,sysconfig}/ceph 
> - reboot
> 
> Questions:
> 
> - Does anybody on the list use a non-default cluster name?
> - If so, do you have a reason not to switch back to 'ceph'?
> 
> Thanks!
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-08 Thread Benjeman Meekhof
Hi Sage,

We did at one time run multiple clusters on our OSD nodes and RGW
nodes (with Jewel).  We accomplished this by putting code in our
puppet-ceph module that would create additional systemd units with
appropriate CLUSTER=name environment settings for clusters not named
ceph.  IE, if the module were asked to configure OSD for a cluster
named 'test' it would copy/edit the ceph-osd service to create a
'test-osd@.service' unit that would start instances with CLUSTER=test
so they would point to the right config file, etc   Eventually on the
RGW side I started doing instance-specific overrides like
'/etc/systemd/system/ceph-rado...@client.name.d/override.conf' so as
to avoid replicating the stock systemd unit.

We gave up on multiple clusters on the OSD nodes because it wasn't
really that useful to maintain a separate 'test' cluster on the same
hardware.  We continue to need ability to reference multiple clusters
for RGW nodes and other clients. For the other example, users of our
project might have their own Ceph clusters in addition to wanting to
use ours.

If the daemon solution in the no-clustername future is to 'modify
systemd unit files to do something' we're already doing that so it's
not a big issue.  However the current modification of over-riding
CLUSTER in the environment section of systemd files does seem cleaner
than over-riding an exec command to specify a different config file
and keyring path.   Maybe systemd units could ship with those
arguments as variables for easily over-riding.

thanks,
Ben

On Thu, Jun 8, 2017 at 3:37 PM, Sage Weil  wrote:
> At CDM yesterday we talked about removing the ability to name your ceph
> clusters.  There are a number of hurtles that make it difficult to fully
> get rid of this functionality, not the least of which is that some
> (many?) deployed clusters make use of it.  We decided that the most we can
> do at this point is remove support for it in ceph-deploy and ceph-ansible
> so that no new clusters or deployed nodes use it.
>
> The first PR in this effort:
>
> https://github.com/ceph/ceph-deploy/pull/441
>
> Background:
>
> The cluster name concept was added to allow multiple clusters to have
> daemons coexist on the same host.  At the type it was a hypothetical
> requirement for a user that never actually made use of it, and the
> support is kludgey:
>
>  - default cluster name is 'ceph'
>  - default config is /etc/ceph/$cluster.conf, so that the normal
> 'ceph.conf' still works
>  - daemon data paths include the cluster name,
>  /var/lib/ceph/osd/$cluster-$id
>which is weird (but mostly people are used to it?)
>  - any cli command you want to touch a non-ceph cluster name
> needs -C $name or --cluster $name passed to it.
>
> Also, as of jewel,
>
>  - systemd only supports a single cluster per host, as defined by $CLUSTER
> in /etc/{sysconfig,default}/ceph
>
> which you'll notice removes support for the original "requirement".
>
> Also note that you can get the same effect by specifying the config path
> explicitly (-c /etc/ceph/foo.conf) along with the various options that
> substitute $cluster in (e.g., osd_data=/var/lib/ceph/osd/$cluster-$id).
>
>
> Crap preventing us from removing this entirely:
>
>  - existing daemon directories for existing clusters
>  - various scripts parse the cluster name out of paths
>
>
> Converting an existing cluster "foo" back to "ceph":
>
>  - rename /etc/ceph/foo.conf -> ceph.conf
>  - rename /var/lib/ceph/*/foo-* -> /var/lib/ceph/*/ceph-*
>  - remove the CLUSTER=foo line in /etc/{default,sysconfig}/ceph
>  - reboot
>
>
> Questions:
>
>  - Does anybody on the list use a non-default cluster name?
>  - If so, do you have a reason not to switch back to 'ceph'?
>
> Thanks!
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-08 Thread Dan van der Ster
Hi Sage,

We need named clusters on the client side. RBD or CephFS clients, or
monitoring/admin machines all need to be able to access several clusters.

Internally, each cluster is indeed called "ceph", but the clients use
distinct names to differentiate their configs/keyrings.

Cheers, Dan


On Jun 8, 2017 9:37 PM, "Sage Weil"  wrote:

At CDM yesterday we talked about removing the ability to name your ceph
clusters.  There are a number of hurtles that make it difficult to fully
get rid of this functionality, not the least of which is that some
(many?) deployed clusters make use of it.  We decided that the most we can
do at this point is remove support for it in ceph-deploy and ceph-ansible
so that no new clusters or deployed nodes use it.

The first PR in this effort:

https://github.com/ceph/ceph-deploy/pull/441

Background:

The cluster name concept was added to allow multiple clusters to have
daemons coexist on the same host.  At the type it was a hypothetical
requirement for a user that never actually made use of it, and the
support is kludgey:

 - default cluster name is 'ceph'
 - default config is /etc/ceph/$cluster.conf, so that the normal
'ceph.conf' still works
 - daemon data paths include the cluster name,
 /var/lib/ceph/osd/$cluster-$id
   which is weird (but mostly people are used to it?)
 - any cli command you want to touch a non-ceph cluster name
needs -C $name or --cluster $name passed to it.

Also, as of jewel,

 - systemd only supports a single cluster per host, as defined by $CLUSTER
in /etc/{sysconfig,default}/ceph

which you'll notice removes support for the original "requirement".

Also note that you can get the same effect by specifying the config path
explicitly (-c /etc/ceph/foo.conf) along with the various options that
substitute $cluster in (e.g., osd_data=/var/lib/ceph/osd/$cluster-$id).


Crap preventing us from removing this entirely:

 - existing daemon directories for existing clusters
 - various scripts parse the cluster name out of paths


Converting an existing cluster "foo" back to "ceph":

 - rename /etc/ceph/foo.conf -> ceph.conf
 - rename /var/lib/ceph/*/foo-* -> /var/lib/ceph/*/ceph-*
 - remove the CLUSTER=foo line in /etc/{default,sysconfig}/ceph
 - reboot


Questions:

 - Does anybody on the list use a non-default cluster name?
 - If so, do you have a reason not to switch back to 'ceph'?

Thanks!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-08 Thread Sage Weil
On Thu, 8 Jun 2017, Bassam Tabbara wrote:
> Thanks Sage.
> 
> > At CDM yesterday we talked about removing the ability to name your ceph 
> > clusters. 
> 
> Just to be clear, it would still be possible to run multiple ceph 
> clusters on the same nodes, right?

Yes, but you'd need to either (1) use containers (so that different 
daemons see a different /etc/ceph/ceph.conf) or (2) modify the systemd 
unit files to do... something.  

This is actually no different from Jewel. It's just that currently you can 
run a single cluster on a host (without containers) but call it 'foo' and 
knock yourself out by passing '--cluster foo' every time you invoke the 
CLI.

I'm guessing you're in the (1) case anyway and this doesn't affect you at 
all :)

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-08 Thread Bassam Tabbara
Thanks Sage.

> At CDM yesterday we talked about removing the ability to name your ceph 
> clusters. 


Just to be clear, it would still be possible to run multiple ceph clusters on 
the same nodes, right?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] removing cluster name support

2017-06-08 Thread Sage Weil
At CDM yesterday we talked about removing the ability to name your ceph 
clusters.  There are a number of hurtles that make it difficult to fully 
get rid of this functionality, not the least of which is that some 
(many?) deployed clusters make use of it.  We decided that the most we can 
do at this point is remove support for it in ceph-deploy and ceph-ansible 
so that no new clusters or deployed nodes use it.

The first PR in this effort:

https://github.com/ceph/ceph-deploy/pull/441

Background:

The cluster name concept was added to allow multiple clusters to have 
daemons coexist on the same host.  At the type it was a hypothetical 
requirement for a user that never actually made use of it, and the 
support is kludgey:

 - default cluster name is 'ceph'
 - default config is /etc/ceph/$cluster.conf, so that the normal 
'ceph.conf' still works
 - daemon data paths include the cluster name,
 /var/lib/ceph/osd/$cluster-$id
   which is weird (but mostly people are used to it?)
 - any cli command you want to touch a non-ceph cluster name 
needs -C $name or --cluster $name passed to it.

Also, as of jewel,

 - systemd only supports a single cluster per host, as defined by $CLUSTER 
in /etc/{sysconfig,default}/ceph

which you'll notice removes support for the original "requirement".

Also note that you can get the same effect by specifying the config path 
explicitly (-c /etc/ceph/foo.conf) along with the various options that 
substitute $cluster in (e.g., osd_data=/var/lib/ceph/osd/$cluster-$id).


Crap preventing us from removing this entirely:

 - existing daemon directories for existing clusters
 - various scripts parse the cluster name out of paths


Converting an existing cluster "foo" back to "ceph":

 - rename /etc/ceph/foo.conf -> ceph.conf
 - rename /var/lib/ceph/*/foo-* -> /var/lib/ceph/*/ceph-*
 - remove the CLUSTER=foo line in /etc/{default,sysconfig}/ceph 
 - reboot


Questions:

 - Does anybody on the list use a non-default cluster name?
 - If so, do you have a reason not to switch back to 'ceph'?

Thanks!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lumionous: bluestore 'tp_osd_tp thread tp_osd_tp' had timed out after 60

2017-06-08 Thread nokia ceph
Thank jake, can you confirm are you testing this in which ceph version -
the out of memory you noticed. There is already a memory leak issue
reported in kraken v11.2.0 .  which addressed in this tracker ..
http://tracker.ceph.com/issues/18924 ..

#ceph -v

Ok so you are mounting/mapping ceph as a rbd and writing into it.

We are discussing luminous v12.0.3 issue here, I think we are all on the
same path.

Thanks
Jayaram


On Thu, Jun 8, 2017 at 8:13 PM, Jake Grimmett  wrote:

> Hi Mark / Jayaram,
>
> After running the cluster last night, I noticed lots of
> "Out Of Memory" errors in /var/log/messages, many of these correlate to
> dead OSD's. If this is the problem, this might now be another case of
> the high memory use issues reported in Kraken.
>
> e.g. my script logs:
> Thu 8 Jun 08:26:37 BST 2017  restart OSD  1
>
> and /var/log/messages states...
>
> Jun  8 08:26:35 ceph1 kernel: Out of memory: Kill process 7899
> (ceph-osd) score 113 or sacrifice child
> Jun  8 08:26:35 ceph1 kernel: Killed process 7899 (ceph-osd)
> total-vm:8569516kB, anon-rss:7518836kB, file-rss:0kB, shmem-rss:0kB
> Jun  8 08:26:36 ceph1 systemd: ceph-osd@1.service: main process exited,
> code=killed, status=9/KILL
> Jun  8 08:26:36 ceph1 systemd: Unit ceph-osd@1.service entered failed
> state.
>
> The OSD nodes have 64GB RAM, presumably enough RAM for 10 OSD's doing
> 4+1 EC ?
>
> I've added "bluestore_cache_size = 104857600" to ceph.conf, and am
> retesting. I will see if OSD problems occur, and report back.
>
> As to loading the cluster, I run an rsync job on each node, pulling data
> from an NFS mounted Isilon. A single node pulls ~200MB/s, with all 7
> nodes running, the ceph -w reports between 700 > 1500MB/s writes.
>
> as requested, here is my "restart_OSD_and_log-this.sh" script:
>
> 
> #!/bin/bash
> # catches single failed OSDs, log and restart
> while : ; do
> OSD=`ceph osd tree 2> /dev/null | grep down | \
> awk '{ print $3}' | awk -F "." '{print $2 }'`
> if [ "$OSD" != "" ] ; then
> DATE=`date`
> echo $DATE " restart OSD " $OSD  >> /root/osd_restart_log
> echo "OSD" $OSD "is down, restarting.."
> OSDHOST=`ceph osd find $OSD | grep host | awk -F '"' '{print $4}'`
> ssh $OSDHOST systemctl restart ceph-osd@$OSD
> sleep 30
> else
> echo -ne "\r\033[k"
> echo -ne "all OSD OK"
> fi
> sleep 1
> done
> 
>
> thanks again,
>
> Jake
>
> On 08/06/17 12:08, nokia ceph wrote:
> > Hello Mark,
> >
> > Raised tracker for the issue  -- http://tracker.ceph.com/issues/20222
> >
> > Jake can you share the restart_OSD_and_log-this.sh script
> >
> > Thanks
> > Jayaram
> >
> > On Wed, Jun 7, 2017 at 9:40 PM, Jake Grimmett  > > wrote:
> >
> > Hi Mark & List,
> >
> > Unfortunately, even when using yesterdays master version of ceph,
> > I'm still seeing OSDs go down, same error as before:
> >
> > OSD log shows lots of entries like this:
> >
> > (osd38)
> > 2017-06-07 16:48:46.070564 7f90b58c3700  1 heartbeat_map is_healthy
> > 'tp_osd_tp thread tp_osd_tp' had timed out after 60
> >
> > (osd3)
> > 2017-06-07 17:01:25.391075 7f62de6c3700  1 heartbeat_map is_healthy
> > 'tp_osd_tp thread tp_osd_tp' had timed out after 60
> > 2017-06-07 17:01:26.276881 7f62dbe86700 -1 osd.3 6165
> heartbeat_check:
> > no reply from 10.1.0.86:6811  osd.2 since
> > back 2017-06-07 17:00:19.640002
> > front 2017-06-07 17:01:21.950160 (cutoff 2017-06-07 17:01:06.276881)
> >
> >
> > [root@ceph4 ceph]# ceph -v
> > ceph version 12.0.2-2399-ge38ca14
> > (e38ca14914340d65ea8001c7bd6e0ff769f3eb2e) luminous (dev)
> >
> >
> > I'll continue running the cluster with my
> "restart_OSD_and_log-this.sh"
> > workaround...
> >
> > thanks again for your help,
> >
> > Jake
> >
> > On 06/06/17 15:52, Jake Grimmett wrote:
> > > Hi Mark,
> > >
> > > OK, I'll upgrade to the current master and retest...
> > >
> > > best,
> > >
> > > Jake
> > >
> > > On 06/06/17 15:46, Mark Nelson wrote:
> > >> Hi Jake,
> > >>
> > >> I just happened to notice this was on 12.0.3.  Would it be
> > possible to
> > >> test this out with current master and see if it still is a
> problem?
> > >>
> > >> Mark
> > >>
> > >> On 06/06/2017 09:10 AM, Mark Nelson wrote:
> > >>> Hi Jake,
> > >>>
> > >>> Thanks much.  I'm guessing at this point this is probably a
> > bug.  Would
> > >>> you (or nokiauser) mind creating a bug in the tracker with a
> short
> > >>> description of what's going on and the collectl sample showing
> > this is
> > >>> not IOs backing up on the disk?
> > >>>
> > >>> If you want to try it, we have a gdb based wallclock profi

Re: [ceph-users] RGW lifecycle not expiring objects

2017-06-08 Thread Graham Allan
Sorry I didn't get to reply until now. The thing is I believe I *do* 
have a lifecycle configured on at least one bucket. As noted in that 
issue, I get an error returned when trying to set the lifecycle, but it 
does appear to get stored:


% aws --endpoint-url https://xxx.xxx.xxx.xxx s3api \
  get-bucket-lifecycle-configuration --bucket=testgta
{
"Rules": [
{
"Status": "Enabled",
"Prefix": "",
"Expiration": {
"Days": 3
},
"ID": "test"
}
]
}

Does that look valid? Is there some other configuration which is needed 
besides this? The rgw config values for lifecycle look as expected.


Thanks,

Graham

On 06/06/2017 01:43 PM, Ben Hines wrote:
If you have nothing listed in 'lc list', you probably need to add a 
lifecycle configuration using the S3 API. It's not automatic and has to 
be added per-bucket.



Here's some sample code for doing so: http://tracker.ceph.com/issues/19587

-Ben

On Tue, Jun 6, 2017 at 9:07 AM, Graham Allan > wrote:


I still haven't seen anything get expired from our kraken (11.2.0)
system.

When I run "radosgw-admin lc list" I get no output, besides debug
output (I have "debug rgw = 10" at present):

# radosgw-admin lc list
2017-06-06 10:57:49.319576 7f2b26ffd700  2
RGWDataChangesLog::ChangesRenewThread: start
2017-06-06 10:57:49.350646 7f2b49558c80 10 Cannot find current
period zone using local zone
2017-06-06 10:57:49.379065 7f2b49558c80  2 all 8 watchers are set,
enabling cache
[]
2017-06-06 10:57:49.399538 7f2b49558c80  2 removed watcher,
disabling cache

Unclear to me whether the debug message about "Cannot find current
period zone using local zone" is related or indicates a problem.

Currently all the lc config is more or less default, eg a few values:

# ceph --show-config|grep rgw_|grep lifecycle
rgw_lifecycle_enabled = true
rgw_lifecycle_thread = 1
rgw_lifecycle_work_time = 00:00-06:00


Graham

On 06/05/2017 01:07 PM, Ben Hines wrote:

FWIW lifecycle is working for us. I did have to research to find
the appropriate lc config file settings, the documentation for
which is found in a git pull request (waiting for another
release?) rather than on the Ceph docs site.
https://github.com/ceph/ceph/pull/13990



Try these:
debug rgw = 20
rgw lifecycle work time = 00:01-23:59


and see if you have lifecycles listed when you run:


radosgw-admin lc list


2017-06-05 10:58:00.473957 7f3429f77c80  0 System already converted
[
  {
  "bucket": ":bentest:default.653959.6",
  "status": "COMPLETE"
  },
  {
  "bucket": "::default.24713983.1",
  "status": "PROCESSING"
  },
  {
  "bucket": "::default.24713983.2",
  "status": "PROCESSING"
  },




At 10 loglevel, the lifecycle processor logs 'DELETED' each time
it deletes something:
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_lc.cc#L388


   grep --text DELETED client..log | wc -l
121853


-Ben

On Mon, Jun 5, 2017 at 6:16 AM, Daniel Gryniewicz
mailto:d...@redhat.com>
>> wrote:

 Kraken has lifecycle, Jewel does not.

 Daniel


 On 06/04/2017 07:16 PM, ceph.nov...@habmalnefrage.de

 > wrote:


 grrr... sorry && and again as text :|


 Gesendet: Montag, 05. Juni 2017 um 01:12 Uhr
 Von: ceph.nov...@habmalnefrage.de

 >
 An: "Yehuda Sadeh-Weinraub" mailto:yeh...@redhat.com>
 >>
 Cc: "ceph-users@lists.ceph.com

 >" mailto:ceph-users@lists.ceph.com>
 >>, ceph-de...@vger.kernel.org

 >
 Betreff: Re: [ceph-users] RGW lifecycle not expiring
objects



 Hi (again) Yehuda.

  

Re: [ceph-users] CephFS Snapshot questions

2017-06-08 Thread John Spray
On Thu, Jun 8, 2017 at 3:33 PM, McFarland, Bruce
 wrote:
> John,
>
> Thanks for your answers. I have a clarification on my questions see below
> inline.
>
> Bruce
>
>
>
> From: John Spray 
> Date: Thursday, June 8, 2017 at 1:45 AM
> To: "McFarland, Bruce" 
> Cc: "ceph-users@lists.ceph.com" 
> Subject: Re: [ceph-users] CephFS Snapshot questions
>
>
>
> On Wed, Jun 7, 2017 at 11:46 PM, McFarland, Bruce
>
>  wrote:
>
> I have a couple of CephFS snapshot questions
>
>
>
> -  Is there any functionality similar to rbd clone/flatten such that
>
> the snapshot can be made writable?  Or is that as simple as copying the
>
> .snap/ to another cluster?
>
>
>
> No, there's no cloning.  You don't need another cluster though -- you
>
> can "cp -r" your snapshot anywhere on any filesystem, and you'll end
>
> up with fresh files that you can write to.
>
>
>
>
>
> -  If the first object write since the snapid was created is a user
>
> error how is that object recovered if it isn’t added to the snapid until
>
> it’s 1st write after snapid creation?
>
>
>
> Don't understand the question at all.  "user error"?
>
>
>
> I think I’ve answered this for myself. The case would be a user’s first
> write to an object after the snap is created being an error they wanted to
> “fix” by restoring the object from the objects clone. So when the user
> writes the “error” to the object it is copied to the snap while it is also
> being written. The object can then be restored from the clone in this case
> where the first write to it is in error and it can be recovered from its
> clone which hadn’t been populated with that object until that write.

It seems like you're imagining a situation where we do COW at the
CephFS layer?  We don't do that.  The snapshot stuff is all
implemented at a lower level, and exactly how it works is up to the
disk store.

What you see as a user of CephFS is just the snapshotted file and the
current revision of the file -- when you want to restore something,
you do it by over-writing the current revision from the snapshot.



> -  If I want to clone the .snap// and not all objects have
>
> been written since .snap// was created how do I know if or get all
>
> objects into the snap if I wanted to move the snap to another cluster?
>
>
>
> There's no concept of moving a snapshot between clusters.  If you're
>
> just talking about doing a "cp -r" of the snapshot, then the MDS
>
> should do the right thing in terms of blocking your reads on files
>
> that have dirty data in client caches -- when we make a snapshot then
>
> clients doing buffered writes are asked to flush those buffers.
>
>
>
> There are 2 cases I’m wondering about here that I didn’t accurately
> describe. 1 would be data migration between clusters which might not be
> possible and 2 would be storing clones on a second cluster.
>
> 1.  Is it possible to snap a directory tree on its source cluster and
> then copy it to a new/different destination cluster? Would that be
> prohibited due to the snaps MDS being on the source cluster? I can see that
> being useful for migrating data/users between clusters, but that it might
> not be possible.

You can copy around the files from snapshots, but the snapshots
themselves are not something that's import/exportable.

>
> 2.  I would expect this to be possible where a snap is created, it’s
> then compressed into a tarball, and that tarball is stored on a second
> cluster for any future DR at which point it’s copied back to the source
> cluster, extracted restoring directory tree to state at time of snap
> creation.

You can totally do this -- it's all just files, from the point of view
of a tool like tar.

John

>
>
>
> John
>
>
>
>
>
>
>
> I might not be making complete sense yet and am in the process of testing to
>
> see how CephFS snapshots behave.
>
>
>
>
>
>
>
>
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados rm: device or resource busy

2017-06-08 Thread Jan Kasprzak
Hello,

David Turner wrote:
: How long have you waited?

About a day.

: I don't do much with rados objects directly.  I usually use RBDs and
: cephfs.  If you just need to clean things up, you can delete the pool and
: recreate it since it looks like it's testing.  However this is probably a
: prime time to figure out how to get past this in case it happens in the
: future in production.

Yes. This is why I am asking now.

-Yenya

: On Thu, Jun 8, 2017 at 11:04 AM Jan Kasprzak  wrote:
: > I have created a RADOS striped object using
: >
: > $ dd someargs | rados --pool testpool --striper put testfile -
: >
: > and interrupted it in the middle of writing. Now I cannot remove this
: > object:
: >
: > $ rados --pool testpool --striper rm testfile
: > error removing testpool>testfile: (16) Device or resource busy
: >
: > How can I tell CEPH that the writer is no longer around and does not come
: > back,
: > so that I can remove the object "testfile"?

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
> That's why this kind of vulnerability is a concern: deploying stuff is  <
> often about collecting an obscene number of .jar files and pushing them <
> up to the application server.  --pboddie at LWN <
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados rm: device or resource busy

2017-06-08 Thread David Turner
How long have you waited? Watchers of objects in ceph time out after a
while and you should be able to delete it.  I'm talking around the range of
30 minutes, so it's likely this isn't the problem if you've been wrestling
with it long enough to write in about.

I don't do much with rados objects directly.  I usually use RBDs and
cephfs.  If you just need to clean things up, you can delete the pool and
recreate it since it looks like it's testing.  However this is probably a
prime time to figure out how to get past this in case it happens in the
future in production.

Hopefully someone that has more experience with manually creating and
removing rados objects chimes in.
On Thu, Jun 8, 2017 at 11:04 AM Jan Kasprzak  wrote:

> Hello,
>
> I have created a RADOS striped object using
>
> $ dd someargs | rados --pool testpool --striper put testfile -
>
> and interrupted it in the middle of writing. Now I cannot remove this
> object:
>
> $ rados --pool testpool --striper rm testfile
> error removing testpool>testfile: (16) Device or resource busy
>
> How can I tell CEPH that the writer is no longer around and does not come
> back,
> so that I can remove the object "testfile"?
>
> Thanks,
>
> -Yenya
>
> --
> | Jan "Yenya" Kasprzak 
> |
> | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5
> |
> > That's why this kind of vulnerability is a concern: deploying stuff is  <
> > often about collecting an obscene number of .jar files and pushing them <
> > up to the application server.  --pboddie at LWN <
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing SSD Landscape

2017-06-08 Thread Reed Dier
I did stumble across Samsung PM1725/a in both AIC and 2.5” U.2 form factor.

AIC starts at 1.6T and goes up to 6.4T, while 2.5” goes from 800G up to 6.4T.

The thing that caught my eye with this model is the x8 lanes in AIC, and the 
5DWPD over 5 years.

No idea on how available it is, or how it compares price wise, but comparing to 
the Micron 9100, you can get 5DWPD compared to 3DWPD, which when talking in 
terms of journal devices, which could be a big difference in lifespan.

And from what I read, the PM1725a isn’t as performant as say the P3700, or some 
other enterprise NVMe drives like the HGST SN100, its still NVMe, and leaps and 
bounds lower latency and deeper queuing compared to SATA SSDs.

Reed

> On Jun 8, 2017, at 2:43 AM, Luis Periquito  wrote:
> 
> Looking at that anandtech comparison it seems the Micron usually is
> worse than the P3700.
> 
> This week I asked for a few nodes with P3700 400G and got an answer as
> they're end of sale, and the supplier wouldn't be able to get it
> anywhere in the world. Has anyone got a good replacement for these?
> 
> The official replacement is the P4600, but those start at 2T and has
> the appropriate price rise (it's slightly cheaper per GB than the
> P3700), and it hasn't been officially released yet.
> 
> The P4800X (Optane) costs about the same as the P4600 and is small...
> 
> Not really sure about the Micron 9100, and couldn't find anything
> interesting/comparable in the Samsung range...
> 
> 
> On Wed, May 17, 2017 at 5:03 PM, Reed Dier  wrote:
>> Agreed, the issue I have seen is that the P4800X (Optane) is demonstrably
>> more expensive than the P3700 for a roughly equivalent amount of storage
>> space (400G v 375G).
>> 
>> However, the P4800X is perfectly suited to a Ceph environment, with 30 DWPD,
>> or 12.3 PBW. And on top of that, it seems to generally outperform the P3700
>> in terms of latency, iops, and raw throughput, especially at greater queue
>> depths. The biggest thing I took away was performance consistency.
>> 
>> Anandtech did a good comparison against the P3700 and the Micron 9100 MAX,
>> ironically the 9100 MAX has been the model I have been looking at to replace
>> P3700’s in future OSD nodes.
>> 
>> http://www.anandtech.com/show/11209/intel-optane-ssd-dc-p4800x-review-a-deep-dive-into-3d-xpoint-enterprise-performance/
>> 
>> There are also the DC P4500 and P4600 models in the pipeline from Intel,
>> also utilizing 3D NAND, however I have been told that they will not be
>> shipping in volume until mid to late Q3.
>> And as was stated earlier, these are all starting at much larger storage
>> sizes, 1-4T in size, and with respective endurance ratings of 1.79 PBW and
>> 10.49 PBW for endurance on the 2TB versions of each of those. Which should
>> equal about .5 and ~3 DWPD for most workloads.
>> 
>> At least the Micron 5100 MAX are finally shipping in volume to offer a
>> replacement to Intel S3610, though no good replacement for the S3710 yet
>> that I’ve seen on the endurance part.
>> 
>> Reed
>> 
>> On May 17, 2017, at 5:44 AM, Luis Periquito  wrote:
>> 
>> Anyway, in a couple months we'll start testing the Optane drives. They
>> are small and perhaps ideal journals, or?
>> 
>> The problem with optanes is price: from what I've seen they cost 2x or
>> 3x as much as the P3700...
>> But at least from what I've read they do look really great...
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rados rm: device or resource busy

2017-06-08 Thread Jan Kasprzak
Hello,

I have created a RADOS striped object using

$ dd someargs | rados --pool testpool --striper put testfile -

and interrupted it in the middle of writing. Now I cannot remove this object:

$ rados --pool testpool --striper rm testfile
error removing testpool>testfile: (16) Device or resource busy

How can I tell CEPH that the writer is no longer around and does not come back,
so that I can remove the object "testfile"?

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
> That's why this kind of vulnerability is a concern: deploying stuff is  <
> often about collecting an obscene number of .jar files and pushing them <
> up to the application server.  --pboddie at LWN <
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lumionous: bluestore 'tp_osd_tp thread tp_osd_tp' had timed out after 60

2017-06-08 Thread Jake Grimmett
Hi Mark / Jayaram,

After running the cluster last night, I noticed lots of
"Out Of Memory" errors in /var/log/messages, many of these correlate to
dead OSD's. If this is the problem, this might now be another case of
the high memory use issues reported in Kraken.

e.g. my script logs:
Thu 8 Jun 08:26:37 BST 2017  restart OSD  1

and /var/log/messages states...

Jun  8 08:26:35 ceph1 kernel: Out of memory: Kill process 7899
(ceph-osd) score 113 or sacrifice child
Jun  8 08:26:35 ceph1 kernel: Killed process 7899 (ceph-osd)
total-vm:8569516kB, anon-rss:7518836kB, file-rss:0kB, shmem-rss:0kB
Jun  8 08:26:36 ceph1 systemd: ceph-osd@1.service: main process exited,
code=killed, status=9/KILL
Jun  8 08:26:36 ceph1 systemd: Unit ceph-osd@1.service entered failed state.

The OSD nodes have 64GB RAM, presumably enough RAM for 10 OSD's doing
4+1 EC ?

I've added "bluestore_cache_size = 104857600" to ceph.conf, and am
retesting. I will see if OSD problems occur, and report back.

As to loading the cluster, I run an rsync job on each node, pulling data
from an NFS mounted Isilon. A single node pulls ~200MB/s, with all 7
nodes running, the ceph -w reports between 700 > 1500MB/s writes.

as requested, here is my "restart_OSD_and_log-this.sh" script:


#!/bin/bash
# catches single failed OSDs, log and restart
while : ; do
OSD=`ceph osd tree 2> /dev/null | grep down | \
awk '{ print $3}' | awk -F "." '{print $2 }'`
if [ "$OSD" != "" ] ; then
DATE=`date`
echo $DATE " restart OSD " $OSD  >> /root/osd_restart_log
echo "OSD" $OSD "is down, restarting.."
OSDHOST=`ceph osd find $OSD | grep host | awk -F '"' '{print $4}'`
ssh $OSDHOST systemctl restart ceph-osd@$OSD
sleep 30
else
echo -ne "\r\033[k"
echo -ne "all OSD OK"
fi
sleep 1
done


thanks again,

Jake

On 08/06/17 12:08, nokia ceph wrote:
> Hello Mark,
> 
> Raised tracker for the issue  -- http://tracker.ceph.com/issues/20222
> 
> Jake can you share the restart_OSD_and_log-this.sh script 
> 
> Thanks
> Jayaram
> 
> On Wed, Jun 7, 2017 at 9:40 PM, Jake Grimmett  > wrote:
> 
> Hi Mark & List,
> 
> Unfortunately, even when using yesterdays master version of ceph,
> I'm still seeing OSDs go down, same error as before:
> 
> OSD log shows lots of entries like this:
> 
> (osd38)
> 2017-06-07 16:48:46.070564 7f90b58c3700  1 heartbeat_map is_healthy
> 'tp_osd_tp thread tp_osd_tp' had timed out after 60
> 
> (osd3)
> 2017-06-07 17:01:25.391075 7f62de6c3700  1 heartbeat_map is_healthy
> 'tp_osd_tp thread tp_osd_tp' had timed out after 60
> 2017-06-07 17:01:26.276881 7f62dbe86700 -1 osd.3 6165 heartbeat_check:
> no reply from 10.1.0.86:6811  osd.2 since
> back 2017-06-07 17:00:19.640002
> front 2017-06-07 17:01:21.950160 (cutoff 2017-06-07 17:01:06.276881)
> 
> 
> [root@ceph4 ceph]# ceph -v
> ceph version 12.0.2-2399-ge38ca14
> (e38ca14914340d65ea8001c7bd6e0ff769f3eb2e) luminous (dev)
> 
> 
> I'll continue running the cluster with my "restart_OSD_and_log-this.sh"
> workaround...
> 
> thanks again for your help,
> 
> Jake
> 
> On 06/06/17 15:52, Jake Grimmett wrote:
> > Hi Mark,
> >
> > OK, I'll upgrade to the current master and retest...
> >
> > best,
> >
> > Jake
> >
> > On 06/06/17 15:46, Mark Nelson wrote:
> >> Hi Jake,
> >>
> >> I just happened to notice this was on 12.0.3.  Would it be
> possible to
> >> test this out with current master and see if it still is a problem?
> >>
> >> Mark
> >>
> >> On 06/06/2017 09:10 AM, Mark Nelson wrote:
> >>> Hi Jake,
> >>>
> >>> Thanks much.  I'm guessing at this point this is probably a
> bug.  Would
> >>> you (or nokiauser) mind creating a bug in the tracker with a short
> >>> description of what's going on and the collectl sample showing
> this is
> >>> not IOs backing up on the disk?
> >>>
> >>> If you want to try it, we have a gdb based wallclock profiler
> that might
> >>> be interesting to run while it's in the process of timing out. 
> It tries
> >>> to grab 2000 samples from the osd process which typically takes
> about 10
> >>> minutes or so.  You'll need to either change the number of
> samples to be
> >>> lower in the python code (maybe like 50-100), or change the
> timeout to
> >>> be something longer.
> >>>
> >>> You can find the code here:
> >>>
> >>> https://github.com/markhpc/gdbprof
> 
> >>>
> >>> and invoke it like:
> >>>
> >>> udo gdb -ex 'set pagination off' -ex 'attach 27962' -ex 'source
> >>> ./gdbprof.py' -ex 'profile begin' -ex 

Re: [ceph-users] CephFS Snapshot questions

2017-06-08 Thread McFarland, Bruce
John,
Thanks for your answers. I have a clarification on my questions see below 
inline.
Bruce

From: John Spray 
Date: Thursday, June 8, 2017 at 1:45 AM
To: "McFarland, Bruce" 
Cc: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] CephFS Snapshot questions

On Wed, Jun 7, 2017 at 11:46 PM, McFarland, Bruce
mailto:bruce.mcfarl...@teradata.com>> wrote:
I have a couple of CephFS snapshot questions

-  Is there any functionality similar to rbd clone/flatten such that
the snapshot can be made writable?  Or is that as simple as copying the
.snap/ to another cluster?

No, there's no cloning.  You don't need another cluster though -- you
can "cp -r" your snapshot anywhere on any filesystem, and you'll end
up with fresh files that you can write to.


-  If the first object write since the snapid was created is a user
error how is that object recovered if it isn’t added to the snapid until
it’s 1st write after snapid creation?

Don't understand the question at all.  "user error"?

I think I’ve answered this for myself. The case would be a user’s first write 
to an object after the snap is created being an error they wanted to “fix” by 
restoring the object from the objects clone. So when the user writes the 
“error” to the object it is copied to the snap while it is also being written. 
The object can then be restored from the clone in this case where the first 
write to it is in error and it can be recovered from its clone which hadn’t 
been populated with that object until that write.

-  If I want to clone the .snap// and not all objects have
been written since .snap// was created how do I know if or get all
objects into the snap if I wanted to move the snap to another cluster?

There's no concept of moving a snapshot between clusters.  If you're
just talking about doing a "cp -r" of the snapshot, then the MDS
should do the right thing in terms of blocking your reads on files
that have dirty data in client caches -- when we make a snapshot then
clients doing buffered writes are asked to flush those buffers.

There are 2 cases I’m wondering about here that I didn’t accurately describe. 1 
would be data migration between clusters which might not be possible and 2 
would be storing clones on a second cluster.

1.  Is it possible to snap a directory tree on its source cluster and then 
copy it to a new/different destination cluster? Would that be prohibited due to 
the snaps MDS being on the source cluster? I can see that being useful for 
migrating data/users between clusters, but that it might not be possible.

2.  I would expect this to be possible where a snap is created, it’s then 
compressed into a tarball, and that tarball is stored on a second cluster for 
any future DR at which point it’s copied back to the source cluster, extracted 
restoring directory tree to state at time of snap creation.

John



I might not be making complete sense yet and am in the process of testing to
see how CephFS snapshots behave.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replica with NVMe

2017-06-08 Thread David Turner
Whether or not 2x replica is possible has little to do with the technology
and EVERYTHING to do with your use case.  How redundant is your hardware
for instance?  If you have the best drives in the world that will never
fail after constant use over 100 years but you don't have redundant
power, bonded network, are running on used hardware, are in a cheap
datacenter that doesn't guarantee 99.999% uptime, etc, etc then you are
going to lose hosts regardless of what your disks are.

As Wido was quoted saying, the biggest problem with 2x replication is that
people use it with min_size=1.  That is cancer and will eventually cause
you to have inconsistent data and most likely data loss.  OTOH, min_size=2
and size=2 means that you need to schedule down time to restart your ceph
hosts for kernel updates, upgrading ceph, restarting the daemons with new
config file options that can't be injected, etc.  You can get around that
by using min_size=1 while you perform the scheduled maintenance.  If for
any reason you ever lose a server, NVMe, etc while running with 2 replica
and min_size=2, then you have unscheduled down time.

Running with 2x Replica right now is possible.  For that matter, people run
with 1x replication all the time (especially in testing).  You will never
get anyone to tell you that it is the optimal configuration because it is
and will always be a lie for general use cases no matter how robust and
bullet proof your drives are.  The primary problem is that nodes to be
restarted, power goes out, and Murphy's law.  Is your use case such that
having a certain percentage of data loss is acceptable?  Then run size=2
and min_size=1 and assume that you will eventually lose data.  Does your
use case allow for unexpected downtime?  Then run size=2 and min_size=2.
If you cannot lose data no matter what and must maintain as high of an
uptime as possible then you should be asking questions about multi-site
replication and the down sides of running 4x replication... 2x replication
shouldn't even cross your mind.

Now I'm assuming that you're broaching the topic because a 3x replica NVMe
cluster is super expensive.  I think all of us feel your pain there,
otherwise we'd all be running it.  A topic that has happened on the ML a
couple times is to use primary_affinity and an interesting distribution of
buckets in your crush map to build a cluster with both SSD storage and HDD
storage in a way that your data is well backed up, but all writes and reads
happen to the SSD/NVMe.  What you do here would be create 3 "racks" in your
crush map and use a rack failure domain.  1 rack has all of your SSD hosts,
and your HDD hosts with SSD/NVMe journals (matching what your other nodes
have) are split between your other 2 racks.  Now you set primary_affinity=0
for all of your HDD nodes forcing Ceph to use the SSD/NVMe OSD as the
primary for all of the PGs.  What you end up with is a 3 replica situation
where 1, and only 1, copy go onto an SSD and 2 copies go onto HDDs.  Once
you have this set up the way things will work is writes still happen to all
OSDs in a PG, so you will have 2 writes going to HDDs, except the write
acks once it is written to the SSD journal.  So your writes happen to all
flash storage.  Your reads are only ever done to your primary OSD for a PG,
so all reads will happen to the SSD/NVMe OSD.  Your recovery/backfilling
will be slower as you'll be reading a fair amount of your data from HDDs,
but that's a fairly insignificant sacrifice for what you are gaining.  For
each 1TB of flash storage, you need to have 2TB of HDD storage.  If you
have more HDD storage than this ratio, then it is wasted and won't be used.

To recap... The problems with 2x replica isn't the disk failure rate or how
bullet proof your hardware is.  Unless downtime or data loss is acceptable,
just don't talk about 2x replica.  But you can have 3 replicas that run as
fast as all flash with only having 1 replica of flash storage and enough
flash journals for the slower HDD replicas.  The trade off for this is that
you limit future customizations to your CRUSH map if you want to actually
configure logical racks for a growing/large cluster and you have generally
increased complexity when adding new storage nodes.

If downtime or data loss is not an acceptable running state and running
with a complex CRUSH map is not viable due to who will be in charge of
adding the storage... Then you're back to getting 3x replicas of the same
type of storage.

On Thu, Jun 8, 2017 at 9:32 AM  wrote:

> I'm thinking to delay this project until Luminous release to
> have Bluestore support.
>
> So are you telling me that checksum capability will be present in
> Bluestore and therefore considering using NVMe with 2x replica for
> production data will be possibile?
>
>
> --
> *From: *"nick" 
> *To: *"Vy Nguyen Tan" , i...@witeq.com
> *Cc: *"ceph-users" 
> *Sent: *Thursday, June 8, 2017 3:19:20 PM
> *Subject: *RE: [ceph-users] 2x replica w

Re: [ceph-users] 2x replica with NVMe

2017-06-08 Thread Nick Fisk
Bluestore will make 2x Replica’s “safer” to use in theory. Until Bluestore is 
in use in the wild, I don’t think anyone can give any guarantees. 

 

From: i...@witeq.com [mailto:i...@witeq.com] 
Sent: 08 June 2017 14:32
To: nick 
Cc: Vy Nguyen Tan ; ceph-users 

Subject: Re: [ceph-users] 2x replica with NVMe

 

I'm thinking to delay this project until Luminous release to have Bluestore 
support.

 

So are you telling me that checksum capability will be present in Bluestore and 
therefore considering using NVMe with 2x replica for production data will be 
possibile?

 

  _  

From: "nick" 
To: "Vy Nguyen Tan" , i...@witeq.com
Cc: "ceph-users" 
Sent: Thursday, June 8, 2017 3:19:20 PM
Subject: RE: [ceph-users] 2x replica with NVMe

 

There are two main concerns with using 2x replicas, recovery speed and coming 
across inconsistent objects.

 

With spinning disks their size to access speed means recovery can take a long 
time and increases the chance that additional failures may happen during the 
recovery process. NVME will recover a lot faster and so this risk is greatly 
reduced and means that using 2x replicas may be possible.

 

However, with Filestore there are no checksums and so there is no way to 
determine in the event of inconsistent objects, which one is corrupt. So even 
with NVME, I would not feel 100% confident using 2x replicas. With Bluestore 
this problem will go away.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vy 
Nguyen Tan
Sent: 08 June 2017 13:47
To: i...@witeq.com
Cc: ceph-users 
Subject: Re: [ceph-users] 2x replica with NVMe

 

Hi,

 

I think that the replica 2x on HDD/SSD are the same. You should read quote from 
Wido bellow:

 

""Hi,


As a Ceph consultant I get numerous calls throughout the year to help people 
with getting their broken Ceph clusters back online.

The causes of downtime vary vastly, but one of the biggest causes is that 
people use replication 2x. size = 2, min_size = 1.

In 2016 the amount of cases I have where data was lost due to these settings 
grew exponentially.

Usually a disk failed, recovery kicks in and while recovery is happening a 
second disk fails. Causing PGs to become incomplete.

There have been to many times where I had to use xfs_repair on broken disks and 
use ceph-objectstore-tool to export/import PGs.

I really don't like these cases, mainly because they can be prevented easily by 
using size = 3 and min_size = 2 for all pools.

With size = 2 you go into the danger zone as soon as a single disk/daemon 
fails. With size = 3 you always have two additional copies left thus keeping 
your data safe(r).

If you are running CephFS, at least consider running the 'metadata' pool with 
size = 3 to keep the MDS happy.

Please, let this be a big warning to everybody who is running with size = 2. 
The downtime and problems caused by missing objects/replicas are usually big 
and it takes days to recover from those. But very often data is lost and/or 
corrupted which causes even more problems.

I can't stress this enough. Running with size = 2 in production is a SERIOUS 
hazard and should not be done imho.

To anyone out there running with size = 2, please reconsider this!

Thanks,

Wido""

 

On Thu, Jun 8, 2017 at 5:32 PM, mailto:i...@witeq.com> > wrote:

Hi all,

 

i'm going to build an all-flash ceph cluster, looking around the existing 
documentation i see lots of guides and and use case scenarios from various 
vendor testing Ceph with replica 2x.

 

Now, i'm an old school Ceph user, I always considered 2x replica really 
dangerous for production data, especially when both OSDs can't decide which 
replica is the good one.

Why all NVMe storage vendor and partners use only 2x replica? 

They claim it's safe because NVMe is better in handling errors, but i usually 
don't trust marketing claims :)

Is it true? Can someone confirm that NVMe is different compared to HDD and 
therefore replica 2 can be considered safe to be put in production?

 

Many Thanks

Giordano


___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

 

 




 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG that should not be on undersized+degraded on multi datacenter Ceph cluster

2017-06-08 Thread Alejandro Comisario
Hi Brad.
Taking into consideration the unlikely posibility that someone
realizes what the problem is in this specific case, that would be
higly apreciated.

I presume that having jewel, if you can somehow remediate this, will
be something that i will not be able to have on this deploy right?

best.

On Thu, Jun 8, 2017 at 2:20 AM, Brad Hubbard  wrote:
> On Thu, Jun 8, 2017 at 2:59 PM, Alejandro Comisario
>  wrote:
>> ha!
>> is there ANY way of knowing when this peering maximum has been reached for a
>> PG?
>
> Not currently AFAICT.
>
> It takes place deep in this c code that is shared between the kernel
> and userspace implementations.
>
> https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L444
>
> Whilst the kernel implementation generates some output the userspace
> code does not. I'm looking at how that situation can be improved.
>
>>
>> On Jun 7, 2017 20:21, "Brad Hubbard"  wrote:
>>>
>>> On Wed, Jun 7, 2017 at 5:13 PM, Peter Maloney
>>>  wrote:
>>>
>>> >
>>> > Now if only there was a log or warning seen in ceph -s that said the
>>> > tries was exceeded,
>>>
>>> Challenge accepted.
>>>
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> --
>>> Cheers,
>>> Brad
>
>
>
> --
> Cheers,
> Brad



-- 
Alejandro Comisario
CTO | NUBELIU
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replica with NVMe

2017-06-08 Thread info
I'm thinking to delay this project until Luminous release to have Bluestore 
support. 

So are you telling me that checksum capability will be present in Bluestore and 
therefore considering using NVMe with 2x replica for production data will be 
possibile? 



From: "nick"  
To: "Vy Nguyen Tan" , i...@witeq.com 
Cc: "ceph-users"  
Sent: Thursday, June 8, 2017 3:19:20 PM 
Subject: RE: [ceph-users] 2x replica with NVMe 



There are two main concerns with using 2x replicas, recovery speed and coming 
across inconsistent objects. 



With spinning disks their size to access speed means recovery can take a long 
time and increases the chance that additional failures may happen during the 
recovery process. NVME will recover a lot faster and so this risk is greatly 
reduced and means that using 2x replicas may be possible. 



However, with Filestore there are no checksums and so there is no way to 
determine in the event of inconsistent objects, which one is corrupt. So even 
with NVME, I would not feel 100% confident using 2x replicas. With Bluestore 
this problem will go away. 




From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vy 
Nguyen Tan 
Sent: 08 June 2017 13:47 
To: i...@witeq.com 
Cc: ceph-users  
Subject: Re: [ceph-users] 2x replica with NVMe 





Hi, 





I think that the replica 2x on HDD/SSD are the same. You should read quote from 
Wido bellow: 





" " Hi, 



As a Ceph consultant I get numerous calls throughout the year to help people 
with getting their broken Ceph clusters back online . 

The causes of downtime vary vastly, but one of the biggest causes is that 
people use replication 2x. size = 2, min_size = 1. 

In 2016 the amount of cases I have where data was lost due to these settings 
grew exponentially. 

Usually a disk failed, recovery kicks in and while recovery is happening a 
second disk fails. Causing PGs to become incomplete. 

There have been to many times where I had to use xfs_repair on broken disks and 
use ceph -objectstore-tool to export/import PGs. 

I really don't like these cases, mainly because they can be prevented easily by 
using size = 3 and min_size = 2 for all pools. 

With size = 2 you go into the danger zone as soon as a single disk/daemon 
fails. With size = 3 you always have two additional copies left thus keeping 
your data safe(r). 

If you are running CephFS, at least consider running the 'metadata' pool with 
size = 3 to keep the MDS happy. 

Please, let this be a big warning to everybody who is running with size = 2. 
The downtime and problems caused by missing objects/replicas are usually big 
and it takes days to recover from those. But very often data is lost and/or 
corrupted which causes even more problems. 

I can't stress this enough. Running with size = 2 in production is a SERIOUS 
hazard and should not be done imho. 

To anyone out there running with size = 2, please reconsider this! 

Thanks, 

Wido"" 





On Thu, Jun 8, 2017 at 5:32 PM, < i...@witeq.com > wrote: 




Hi all, 





i'm going to build an all-flash ceph cluster, looking around the existing 
documentation i see lots of guides and and use case scenarios from various 
vendor testing Ceph with replica 2x. 





Now, i'm an old school Ceph user, I always considered 2x replica really 
dangerous for production data, especially when both OSDs can't decide which 
replica is the good one. 


Why all NVMe storage vendor and partners use only 2x replica? 


They claim it's safe because NVMe is better in handling errors, but i usually 
don't trust marketing claims :) 


Is it true? Can someone confirm that NVMe is different compared to HDD and 
therefore replica 2 can be considered safe to be put in production? 





Many Thanks 


Giordano 



___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lumionous: bluestore 'tp_osd_tp thread tp_osd_tp' had timed out after 60

2017-06-08 Thread nokia ceph
Hello Mark,

As this issue noticed while giving write via librados {C API} only , the
same can't be reproduce with rados user space utility.
Ref;- http://docs.ceph.com/docs/master/rados/api/librados/

Jack, I guess you also creating load via librados.

Thanks
Jayaram

On Thu, Jun 8, 2017 at 5:46 PM, Mark Nelson  wrote:

> Hi Jayaram,
>
> Thanks for creating a tracker entry! Any chance you could add a note about
> how you are generating the 200MB/s client workload?  I've not seen this
> problem in the lab, but any details you could give that would help us
> reproduce the problem would be much appreciated!
>
> Mark
>
> On 06/08/2017 06:08 AM, nokia ceph wrote:
>
>> Hello Mark,
>>
>> Raised tracker for the issue  -- http://tracker.ceph.com/issues/20222
>>
>> Jake can you share the restart_OSD_and_log-this.sh script
>>
>> Thanks
>> Jayaram
>>
>> On Wed, Jun 7, 2017 at 9:40 PM, Jake Grimmett > > wrote:
>>
>> Hi Mark & List,
>>
>> Unfortunately, even when using yesterdays master version of ceph,
>> I'm still seeing OSDs go down, same error as before:
>>
>> OSD log shows lots of entries like this:
>>
>> (osd38)
>> 2017-06-07 16:48:46.070564 7f90b58c3700  1 heartbeat_map is_healthy
>> 'tp_osd_tp thread tp_osd_tp' had timed out after 60
>>
>> (osd3)
>> 2017-06-07 17:01:25.391075 7f62de6c3700  1 heartbeat_map is_healthy
>> 'tp_osd_tp thread tp_osd_tp' had timed out after 60
>> 2017-06-07 17:01:26.276881 7f62dbe86700 -1 osd.3 6165 heartbeat_check:
>> no reply from 10.1.0.86:6811  osd.2 since
>>
>> back 2017-06-07 17:00:19.640002
>> front 2017-06-07 17:01:21.950160 (cutoff 2017-06-07 17:01:06.276881)
>>
>>
>> [root@ceph4 ceph]# ceph -v
>> ceph version 12.0.2-2399-ge38ca14
>> (e38ca14914340d65ea8001c7bd6e0ff769f3eb2e) luminous (dev)
>>
>>
>> I'll continue running the cluster with my
>> "restart_OSD_and_log-this.sh"
>> workaround...
>>
>> thanks again for your help,
>>
>> Jake
>>
>> On 06/06/17 15:52, Jake Grimmett wrote:
>> > Hi Mark,
>> >
>> > OK, I'll upgrade to the current master and retest...
>> >
>> > best,
>> >
>> > Jake
>> >
>> > On 06/06/17 15:46, Mark Nelson wrote:
>> >> Hi Jake,
>> >>
>> >> I just happened to notice this was on 12.0.3.  Would it be
>> possible to
>> >> test this out with current master and see if it still is a problem?
>> >>
>> >> Mark
>> >>
>> >> On 06/06/2017 09:10 AM, Mark Nelson wrote:
>> >>> Hi Jake,
>> >>>
>> >>> Thanks much.  I'm guessing at this point this is probably a
>> bug.  Would
>> >>> you (or nokiauser) mind creating a bug in the tracker with a short
>> >>> description of what's going on and the collectl sample showing
>> this is
>> >>> not IOs backing up on the disk?
>> >>>
>> >>> If you want to try it, we have a gdb based wallclock profiler
>> that might
>> >>> be interesting to run while it's in the process of timing out.
>> It tries
>> >>> to grab 2000 samples from the osd process which typically takes
>> about 10
>> >>> minutes or so.  You'll need to either change the number of
>> samples to be
>> >>> lower in the python code (maybe like 50-100), or change the
>> timeout to
>> >>> be something longer.
>> >>>
>> >>> You can find the code here:
>> >>>
>> >>> https://github.com/markhpc/gdbprof
>> 
>> >>>
>> >>> and invoke it like:
>> >>>
>> >>> udo gdb -ex 'set pagination off' -ex 'attach 27962' -ex 'source
>> >>> ./gdbprof.py' -ex 'profile begin' -ex 'quit'
>> >>>
>> >>> where 27962 in this case is the PID of the ceph-osd process.
>> You'll
>> >>> need gdb with the python bindings and the ceph debug symbols for
>> it to
>> >>> work.
>> >>>
>> >>> This might tell us over time if the tp_osd_tp processes are just
>> sitting
>> >>> on pg::locks.
>> >>>
>> >>> Mark
>> >>>
>> >>> On 06/06/2017 05:34 AM, Jake Grimmett wrote:
>>  Hi Mark,
>> 
>>  Thanks again for looking into this problem.
>> 
>>  I ran the cluster overnight, with a script checking for dead
>> OSDs every
>>  second, and restarting them.
>> 
>>  40 OSD failures occurred in 12 hours, some OSDs failed multiple
>> times,
>>  (there are 50 OSDs in the EC tier).
>> 
>>  Unfortunately, the output of collectl doesn't appear to show any
>>  increase in disk queue depth and service times before the OSDs
>> die.
>> 
>>  I've put a couple of examples of collectl output for the disks
>>  associated with the OSDs here:
>> 
>>  https://hastebin.com/icuvotemot.scala
>> 
>> 
>>  please let me know if you need more info.

Re: [ceph-users] 2x replica with NVMe

2017-06-08 Thread Nick Fisk
There are two main concerns with using 2x replicas, recovery speed and coming 
across inconsistent objects.

 

With spinning disks their size to access speed means recovery can take a long 
time and increases the chance that additional failures may happen during the 
recovery process. NVME will recover a lot faster and so this risk is greatly 
reduced and means that using 2x replicas may be possible.

 

However, with Filestore there are no checksums and so there is no way to 
determine in the event of inconsistent objects, which one is corrupt. So even 
with NVME, I would not feel 100% confident using 2x replicas. With Bluestore 
this problem will go away.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vy 
Nguyen Tan
Sent: 08 June 2017 13:47
To: i...@witeq.com
Cc: ceph-users 
Subject: Re: [ceph-users] 2x replica with NVMe

 

Hi,

 

I think that the replica 2x on HDD/SSD are the same. You should read quote from 
Wido bellow:

 

""Hi,


As a Ceph consultant I get numerous calls throughout the year to help people 
with getting their broken Ceph clusters back online.

The causes of downtime vary vastly, but one of the biggest causes is that 
people use replication 2x. size = 2, min_size = 1.

In 2016 the amount of cases I have where data was lost due to these settings 
grew exponentially.

Usually a disk failed, recovery kicks in and while recovery is happening a 
second disk fails. Causing PGs to become incomplete.

There have been to many times where I had to use xfs_repair on broken disks and 
use ceph-objectstore-tool to export/import PGs.

I really don't like these cases, mainly because they can be prevented easily by 
using size = 3 and min_size = 2 for all pools.

With size = 2 you go into the danger zone as soon as a single disk/daemon 
fails. With size = 3 you always have two additional copies left thus keeping 
your data safe(r).

If you are running CephFS, at least consider running the 'metadata' pool with 
size = 3 to keep the MDS happy.

Please, let this be a big warning to everybody who is running with size = 2. 
The downtime and problems caused by missing objects/replicas are usually big 
and it takes days to recover from those. But very often data is lost and/or 
corrupted which causes even more problems.

I can't stress this enough. Running with size = 2 in production is a SERIOUS 
hazard and should not be done imho.

To anyone out there running with size = 2, please reconsider this!

Thanks,

Wido""

 

On Thu, Jun 8, 2017 at 5:32 PM, mailto:i...@witeq.com> > wrote:

Hi all,

 

i'm going to build an all-flash ceph cluster, looking around the existing 
documentation i see lots of guides and and use case scenarios from various 
vendor testing Ceph with replica 2x.

 

Now, i'm an old school Ceph user, I always considered 2x replica really 
dangerous for production data, especially when both OSDs can't decide which 
replica is the good one.

Why all NVMe storage vendor and partners use only 2x replica? 

They claim it's safe because NVMe is better in handling errors, but i usually 
don't trust marketing claims :)

Is it true? Can someone confirm that NVMe is different compared to HDD and 
therefore replica 2 can be considered safe to be put in production?

 

Many Thanks

Giordano


___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replica with NVMe

2017-06-08 Thread Vy Nguyen Tan
Hi,

I think that the replica 2x on HDD/SSD are the same. You should read quote
from Wido bellow:

""Hi,

As a Ceph consultant I get numerous calls throughout the year to help people
 with getting their broken Ceph clusters back online.

The causes of downtime vary vastly, but one of the biggest causes is that
people use replication 2x. size = 2, min_size = 1.

In 2016 the amount of cases I have where data was lost due to these
settings grew exponentially.

Usually a disk failed, recovery kicks in and while recovery is happening a
second disk fails. Causing PGs to become incomplete.

There have been to many times where I had to use xfs_repair on broken disks
and use ceph-objectstore-tool to export/import PGs.

I really don't like these cases, mainly because they can be prevented
easily by using size = 3 and min_size = 2 for all pools.

With size = 2 you go into the danger zone as soon as a single disk/daemon
fails. With size = 3 you always have two additional copies left thus
keeping your data safe(r).

If you are running CephFS, at least consider running the 'metadata' pool
with size = 3 to keep the MDS happy.

Please, let this be a big warning to everybody who is running with size =
2. The downtime and problems caused by missing objects/replicas are usually
big and it takes days to recover from those. But very often data is lost
and/or corrupted which causes even more problems.

I can't stress this enough. Running with size = 2 in production is a
SERIOUS hazard and should not be done imho.

To anyone out there running with size = 2, please reconsider this!

Thanks,

Wido""

On Thu, Jun 8, 2017 at 5:32 PM,  wrote:

> Hi all,
>
> i'm going to build an all-flash ceph cluster, looking around the existing
> documentation i see lots of guides and and use case scenarios from various
> vendor testing Ceph with replica 2x.
>
> Now, i'm an old school Ceph user, I always considered 2x replica really
> dangerous for production data, especially when both OSDs can't decide which
> replica is the good one.
> Why all NVMe storage vendor and partners use only 2x replica?
> They claim it's safe because NVMe is better in handling errors, but i
> usually don't trust marketing claims :)
> Is it true? Can someone confirm that NVMe is different compared to HDD and
> therefore replica 2 can be considered safe to be put in production?
>
> Many Thanks
> Giordano
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lumionous: bluestore 'tp_osd_tp thread tp_osd_tp' had timed out after 60

2017-06-08 Thread Mark Nelson

Hi Jayaram,

Thanks for creating a tracker entry! Any chance you could add a note 
about how you are generating the 200MB/s client workload?  I've not seen 
this problem in the lab, but any details you could give that would help 
us reproduce the problem would be much appreciated!


Mark

On 06/08/2017 06:08 AM, nokia ceph wrote:

Hello Mark,

Raised tracker for the issue  -- http://tracker.ceph.com/issues/20222

Jake can you share the restart_OSD_and_log-this.sh script

Thanks
Jayaram

On Wed, Jun 7, 2017 at 9:40 PM, Jake Grimmett mailto:j...@mrc-lmb.cam.ac.uk>> wrote:

Hi Mark & List,

Unfortunately, even when using yesterdays master version of ceph,
I'm still seeing OSDs go down, same error as before:

OSD log shows lots of entries like this:

(osd38)
2017-06-07 16:48:46.070564 7f90b58c3700  1 heartbeat_map is_healthy
'tp_osd_tp thread tp_osd_tp' had timed out after 60

(osd3)
2017-06-07 17:01:25.391075 7f62de6c3700  1 heartbeat_map is_healthy
'tp_osd_tp thread tp_osd_tp' had timed out after 60
2017-06-07 17:01:26.276881 7f62dbe86700 -1 osd.3 6165 heartbeat_check:
no reply from 10.1.0.86:6811  osd.2 since
back 2017-06-07 17:00:19.640002
front 2017-06-07 17:01:21.950160 (cutoff 2017-06-07 17:01:06.276881)


[root@ceph4 ceph]# ceph -v
ceph version 12.0.2-2399-ge38ca14
(e38ca14914340d65ea8001c7bd6e0ff769f3eb2e) luminous (dev)


I'll continue running the cluster with my "restart_OSD_and_log-this.sh"
workaround...

thanks again for your help,

Jake

On 06/06/17 15:52, Jake Grimmett wrote:
> Hi Mark,
>
> OK, I'll upgrade to the current master and retest...
>
> best,
>
> Jake
>
> On 06/06/17 15:46, Mark Nelson wrote:
>> Hi Jake,
>>
>> I just happened to notice this was on 12.0.3.  Would it be
possible to
>> test this out with current master and see if it still is a problem?
>>
>> Mark
>>
>> On 06/06/2017 09:10 AM, Mark Nelson wrote:
>>> Hi Jake,
>>>
>>> Thanks much.  I'm guessing at this point this is probably a
bug.  Would
>>> you (or nokiauser) mind creating a bug in the tracker with a short
>>> description of what's going on and the collectl sample showing
this is
>>> not IOs backing up on the disk?
>>>
>>> If you want to try it, we have a gdb based wallclock profiler
that might
>>> be interesting to run while it's in the process of timing out.
It tries
>>> to grab 2000 samples from the osd process which typically takes
about 10
>>> minutes or so.  You'll need to either change the number of
samples to be
>>> lower in the python code (maybe like 50-100), or change the
timeout to
>>> be something longer.
>>>
>>> You can find the code here:
>>>
>>> https://github.com/markhpc/gdbprof

>>>
>>> and invoke it like:
>>>
>>> udo gdb -ex 'set pagination off' -ex 'attach 27962' -ex 'source
>>> ./gdbprof.py' -ex 'profile begin' -ex 'quit'
>>>
>>> where 27962 in this case is the PID of the ceph-osd process.  You'll
>>> need gdb with the python bindings and the ceph debug symbols for
it to
>>> work.
>>>
>>> This might tell us over time if the tp_osd_tp processes are just
sitting
>>> on pg::locks.
>>>
>>> Mark
>>>
>>> On 06/06/2017 05:34 AM, Jake Grimmett wrote:
 Hi Mark,

 Thanks again for looking into this problem.

 I ran the cluster overnight, with a script checking for dead
OSDs every
 second, and restarting them.

 40 OSD failures occurred in 12 hours, some OSDs failed multiple
times,
 (there are 50 OSDs in the EC tier).

 Unfortunately, the output of collectl doesn't appear to show any
 increase in disk queue depth and service times before the OSDs die.

 I've put a couple of examples of collectl output for the disks
 associated with the OSDs here:

 https://hastebin.com/icuvotemot.scala


 please let me know if you need more info...

 best regards,

 Jake


>
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lumionous: bluestore 'tp_osd_tp thread tp_osd_tp' had timed out after 60

2017-06-08 Thread nokia ceph
Hello Mark,

Raised tracker for the issue  -- http://tracker.ceph.com/issues/20222

Jake can you share the restart_OSD_and_log-this.sh script

Thanks
Jayaram

On Wed, Jun 7, 2017 at 9:40 PM, Jake Grimmett  wrote:

> Hi Mark & List,
>
> Unfortunately, even when using yesterdays master version of ceph,
> I'm still seeing OSDs go down, same error as before:
>
> OSD log shows lots of entries like this:
>
> (osd38)
> 2017-06-07 16:48:46.070564 7f90b58c3700  1 heartbeat_map is_healthy
> 'tp_osd_tp thread tp_osd_tp' had timed out after 60
>
> (osd3)
> 2017-06-07 17:01:25.391075 7f62de6c3700  1 heartbeat_map is_healthy
> 'tp_osd_tp thread tp_osd_tp' had timed out after 60
> 2017-06-07 17:01:26.276881 7f62dbe86700 -1 osd.3 6165 heartbeat_check:
> no reply from 10.1.0.86:6811 osd.2 since back 2017-06-07 17:00:19.640002
> front 2017-06-07 17:01:21.950160 (cutoff 2017-06-07 17:01:06.276881)
>
>
> [root@ceph4 ceph]# ceph -v
> ceph version 12.0.2-2399-ge38ca14
> (e38ca14914340d65ea8001c7bd6e0ff769f3eb2e) luminous (dev)
>
>
> I'll continue running the cluster with my "restart_OSD_and_log-this.sh"
> workaround...
>
> thanks again for your help,
>
> Jake
>
> On 06/06/17 15:52, Jake Grimmett wrote:
> > Hi Mark,
> >
> > OK, I'll upgrade to the current master and retest...
> >
> > best,
> >
> > Jake
> >
> > On 06/06/17 15:46, Mark Nelson wrote:
> >> Hi Jake,
> >>
> >> I just happened to notice this was on 12.0.3.  Would it be possible to
> >> test this out with current master and see if it still is a problem?
> >>
> >> Mark
> >>
> >> On 06/06/2017 09:10 AM, Mark Nelson wrote:
> >>> Hi Jake,
> >>>
> >>> Thanks much.  I'm guessing at this point this is probably a bug.  Would
> >>> you (or nokiauser) mind creating a bug in the tracker with a short
> >>> description of what's going on and the collectl sample showing this is
> >>> not IOs backing up on the disk?
> >>>
> >>> If you want to try it, we have a gdb based wallclock profiler that
> might
> >>> be interesting to run while it's in the process of timing out.  It
> tries
> >>> to grab 2000 samples from the osd process which typically takes about
> 10
> >>> minutes or so.  You'll need to either change the number of samples to
> be
> >>> lower in the python code (maybe like 50-100), or change the timeout to
> >>> be something longer.
> >>>
> >>> You can find the code here:
> >>>
> >>> https://github.com/markhpc/gdbprof
> >>>
> >>> and invoke it like:
> >>>
> >>> udo gdb -ex 'set pagination off' -ex 'attach 27962' -ex 'source
> >>> ./gdbprof.py' -ex 'profile begin' -ex 'quit'
> >>>
> >>> where 27962 in this case is the PID of the ceph-osd process.  You'll
> >>> need gdb with the python bindings and the ceph debug symbols for it to
> >>> work.
> >>>
> >>> This might tell us over time if the tp_osd_tp processes are just
> sitting
> >>> on pg::locks.
> >>>
> >>> Mark
> >>>
> >>> On 06/06/2017 05:34 AM, Jake Grimmett wrote:
>  Hi Mark,
> 
>  Thanks again for looking into this problem.
> 
>  I ran the cluster overnight, with a script checking for dead OSDs
> every
>  second, and restarting them.
> 
>  40 OSD failures occurred in 12 hours, some OSDs failed multiple times,
>  (there are 50 OSDs in the EC tier).
> 
>  Unfortunately, the output of collectl doesn't appear to show any
>  increase in disk queue depth and service times before the OSDs die.
> 
>  I've put a couple of examples of collectl output for the disks
>  associated with the OSDs here:
> 
>  https://hastebin.com/icuvotemot.scala
> 
>  please let me know if you need more info...
> 
>  best regards,
> 
>  Jake
> 
> 
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache mode readforward mode will eat your babies?

2017-06-08 Thread Alfredo Deza
On Thu, Jun 8, 2017 at 3:38 AM, Christian Balzer  wrote:
> On Thu, 8 Jun 2017 17:03:15 +1000 Brad Hubbard wrote:
>
>> On Thu, Jun 8, 2017 at 3:47 PM, Christian Balzer  wrote:
>> > On Thu, 8 Jun 2017 15:29:05 +1000 Brad Hubbard wrote:
>> >
>> >> On Thu, Jun 8, 2017 at 3:10 PM, Christian Balzer  wrote:
>> >> > On Thu, 8 Jun 2017 14:21:43 +1000 Brad Hubbard wrote:
>> >> >
>> >> >> On Thu, Jun 8, 2017 at 1:06 PM, Christian Balzer  wrote:
>> >> >> >
>> >> >> > Hello,
>> >> >> >
>> >> >> > New cluster, Jewel, setting up cache-tiering:
>> >> >> > ---
>> >> >> > Error EPERM: 'readforward' is not a well-supported cache mode and 
>> >> >> > may corrupt your data.  pass --yes-i-really-mean-it to force.
>> >> >> > ---
>> >> >> >
>> >> >> > That's new and certainly wasn't there in Hammer, nor did it whine 
>> >> >> > about
>> >> >> > this when upgrading my test cluster to Jewel.
>> >> >> >
>> >> >> > And speaking of whining, I did that about this and readproxy, but not
>> >> >> > their stability (readforward has been working nearly a year 
>> >> >> > flawlessly in
>> >> >> > the test cluster) but their lack of documentation.
>> >> >> >
>> >> >> > So while of course there is no warranty for anything with OSS, is 
>> >> >> > there
>> >> >> > any real reason for the above scaremongering or is that based solely 
>> >> >> > on
>> >> >> > lack of testing/experience?
>> >> >>
>> >> >> https://github.com/ceph/ceph/pull/8210 and
>> >> >> https://github.com/ceph/ceph/pull/8210/commits/90fe8e3d0b1ded6d14a6a43ecbd6c8634f691fbe
>> >> >> may offer some insight.
>> >> >>
>> >> > They do, alas of course immediately raise the following questions:
>> >> >
>> >> > 1. Where is that mode documented?
>> >>
>> >> It *was* documented by,
>> >> https://github.com/ceph/ceph/pull/7023/commits/d821acada39937b9dacf87614c924114adea8a58
>> >> in https://github.com/ceph/ceph/pull/7023 but was removed by
>> >> https://github.com/ceph/ceph/commit/6b6b38163b7742d97d21457cf38bdcc9bde5ae1a
>> >> in https://github.com/ceph/ceph/pull/9070
>> >>
>> >
>> > I was talking about proxy, which isn't AFAICT, nor is there a BIG bold red
>>
>> That was hard to follow for me, in a thread titled "Cache mode
>> readforward mode will eat your babies?".
>>
> Context, the initial github bits talk about proxy.
>
> Anyways, the documentation is in utter shambles and wrong and this really
> really should have been mentioned more clearly in the release notes, but
> then again none of the other cache changes were, never mind the wrong
> osd_tier_promote_max* defaults.
>
> So for the record:
>
> The readproxy mode does what the old documentation states and proxies
> objects through the cache-tier when being read w/o promoting them[*], while
> writing objects will go into cache-tier as usual and with the
> rate configured.
>
> [*]
> Pro-Tip: It does however do the silent 0 byte object creation for reads,
> so your cache-tier storage performance will be somewhat impacted, in
> addition to the CPU usage there that readforward would have also avoided.
> This is important when considering the value for "target_max_objects", as a
> writeback mode cache will likely evict things based on space used and
> reach a natural upper object limit.
> For example an existing cache-tier in writeback mode here has a 2GB size
> and 560K objects, 13.4TB and 3.6M objects on the backing storage.
> With readproxy and a similar sized cluster I'll be setting
> "target_max_objects" to something around 2M to avoid needless eviction and
> then re-creation of null objects when things are read.

Thank you for taking the time to explain this in the mailing list,
could you help us in submitting a pull request with this
documentation addition?

I would be happy to review and merge.
>
> Christian
>
>> > statement in the release notes (or docs) for everybody to switch from
>> > (read)forward to (read)proxy.
>> >
>> > And the two bits up there have _very_ conflicting statements about what
>> > readproxy does, the older one would do what I want (at the cost of
>> > shuffling all through the cache-tier network pipes), the newer one seems
>> > to be actually describing the proxy functionality (no new objects i.e from
>> > writes being added).
>> >
>> > I'll be ready to play with my new cluster in a bit and shall investigate
>> > what does actually what.
>> >
>> > Christian
>> >
>> >> HTH.
>> >>
>> >> >
>> >> > 2. The release notes aren't any particular help there either and 
>> >> > issues/PR
>> >> > talk about forward, not readforward as the culprit.
>> >> >
>> >> > 3. What I can gleam from the bits I found, proxy just replaces the 
>> >> > forward
>> >> > functionality.  Alas what I'm after is a mode that will not promote 
>> >> > reads
>> >> > to the cache, aka readforward. Or another set of parameters that will
>> >> > produce the same results.
>> >> >
>> >> > Christian
>> >> >
>> >> >> >
>> >> >> > Christian
>> >> >> > --
>> >> >> > Christian BalzerNetwork/Systems Engineer
>> >> >> > ch...@gol.com 

[ceph-users] 2x replica with NVMe

2017-06-08 Thread info
Hi all, 

i'm going to build an all-flash ceph cluster, looking around the existing 
documentation i see lots of guides and and use case scenarios from various 
vendor testing Ceph with replica 2x. 

Now, i'm an old school Ceph user, I always considered 2x replica really 
dangerous for production data, especially when both OSDs can't decide which 
replica is the good one. 
Why all NVMe storage vendor and partners use only 2x replica? 
They claim it's safe because NVMe is better in handling errors, but i usually 
don't trust marketing claims :) 
Is it true? Can someone confirm that NVMe is different compared to HDD and 
therefore replica 2 can be considered safe to be put in production? 

Many Thanks 
Giordano 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Snapshot questions

2017-06-08 Thread John Spray
On Wed, Jun 7, 2017 at 11:46 PM, McFarland, Bruce
 wrote:
> I have a couple of CephFS snapshot questions
>
> -  Is there any functionality similar to rbd clone/flatten such that
> the snapshot can be made writable?  Or is that as simple as copying the
> .snap/ to another cluster?

No, there's no cloning.  You don't need another cluster though -- you
can "cp -r" your snapshot anywhere on any filesystem, and you'll end
up with fresh files that you can write to.

>
> -  If the first object write since the snapid was created is a user
> error how is that object recovered if it isn’t added to the snapid until
> it’s 1st write after snapid creation?

Don't understand the question at all.  "user error"?

> -  If I want to clone the .snap// and not all objects have
> been written since .snap// was created how do I know if or get all
> objects into the snap if I wanted to move the snap to another cluster?

There's no concept of moving a snapshot between clusters.  If you're
just talking about doing a "cp -r" of the snapshot, then the MDS
should do the right thing in terms of blocking your reads on files
that have dirty data in client caches -- when we make a snapshot then
clients doing buffered writes are asked to flush those buffers.

John

>
>
> I might not be making complete sense yet and am in the process of testing to
> see how CephFS snapshots behave.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing SSD Landscape

2017-06-08 Thread Luis Periquito
Looking at that anandtech comparison it seems the Micron usually is
worse than the P3700.

This week I asked for a few nodes with P3700 400G and got an answer as
they're end of sale, and the supplier wouldn't be able to get it
anywhere in the world. Has anyone got a good replacement for these?

The official replacement is the P4600, but those start at 2T and has
the appropriate price rise (it's slightly cheaper per GB than the
P3700), and it hasn't been officially released yet.

The P4800X (Optane) costs about the same as the P4600 and is small...

Not really sure about the Micron 9100, and couldn't find anything
interesting/comparable in the Samsung range...


On Wed, May 17, 2017 at 5:03 PM, Reed Dier  wrote:
> Agreed, the issue I have seen is that the P4800X (Optane) is demonstrably
> more expensive than the P3700 for a roughly equivalent amount of storage
> space (400G v 375G).
>
> However, the P4800X is perfectly suited to a Ceph environment, with 30 DWPD,
> or 12.3 PBW. And on top of that, it seems to generally outperform the P3700
> in terms of latency, iops, and raw throughput, especially at greater queue
> depths. The biggest thing I took away was performance consistency.
>
> Anandtech did a good comparison against the P3700 and the Micron 9100 MAX,
> ironically the 9100 MAX has been the model I have been looking at to replace
> P3700’s in future OSD nodes.
>
> http://www.anandtech.com/show/11209/intel-optane-ssd-dc-p4800x-review-a-deep-dive-into-3d-xpoint-enterprise-performance/
>
> There are also the DC P4500 and P4600 models in the pipeline from Intel,
> also utilizing 3D NAND, however I have been told that they will not be
> shipping in volume until mid to late Q3.
> And as was stated earlier, these are all starting at much larger storage
> sizes, 1-4T in size, and with respective endurance ratings of 1.79 PBW and
> 10.49 PBW for endurance on the 2TB versions of each of those. Which should
> equal about .5 and ~3 DWPD for most workloads.
>
> At least the Micron 5100 MAX are finally shipping in volume to offer a
> replacement to Intel S3610, though no good replacement for the S3710 yet
> that I’ve seen on the endurance part.
>
> Reed
>
> On May 17, 2017, at 5:44 AM, Luis Periquito  wrote:
>
> Anyway, in a couple months we'll start testing the Optane drives. They
> are small and perhaps ideal journals, or?
>
> The problem with optanes is price: from what I've seen they cost 2x or
> 3x as much as the P3700...
> But at least from what I've read they do look really great...
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache mode readforward mode will eat your babies?

2017-06-08 Thread Christian Balzer
On Thu, 8 Jun 2017 17:03:15 +1000 Brad Hubbard wrote:

> On Thu, Jun 8, 2017 at 3:47 PM, Christian Balzer  wrote:
> > On Thu, 8 Jun 2017 15:29:05 +1000 Brad Hubbard wrote:
> >  
> >> On Thu, Jun 8, 2017 at 3:10 PM, Christian Balzer  wrote:  
> >> > On Thu, 8 Jun 2017 14:21:43 +1000 Brad Hubbard wrote:
> >> >  
> >> >> On Thu, Jun 8, 2017 at 1:06 PM, Christian Balzer  wrote: 
> >> >>  
> >> >> >
> >> >> > Hello,
> >> >> >
> >> >> > New cluster, Jewel, setting up cache-tiering:
> >> >> > ---
> >> >> > Error EPERM: 'readforward' is not a well-supported cache mode and may 
> >> >> > corrupt your data.  pass --yes-i-really-mean-it to force.
> >> >> > ---
> >> >> >
> >> >> > That's new and certainly wasn't there in Hammer, nor did it whine 
> >> >> > about
> >> >> > this when upgrading my test cluster to Jewel.
> >> >> >
> >> >> > And speaking of whining, I did that about this and readproxy, but not
> >> >> > their stability (readforward has been working nearly a year 
> >> >> > flawlessly in
> >> >> > the test cluster) but their lack of documentation.
> >> >> >
> >> >> > So while of course there is no warranty for anything with OSS, is 
> >> >> > there
> >> >> > any real reason for the above scaremongering or is that based solely 
> >> >> > on
> >> >> > lack of testing/experience?  
> >> >>
> >> >> https://github.com/ceph/ceph/pull/8210 and
> >> >> https://github.com/ceph/ceph/pull/8210/commits/90fe8e3d0b1ded6d14a6a43ecbd6c8634f691fbe
> >> >> may offer some insight.
> >> >>  
> >> > They do, alas of course immediately raise the following questions:
> >> >
> >> > 1. Where is that mode documented?  
> >>
> >> It *was* documented by,
> >> https://github.com/ceph/ceph/pull/7023/commits/d821acada39937b9dacf87614c924114adea8a58
> >> in https://github.com/ceph/ceph/pull/7023 but was removed by
> >> https://github.com/ceph/ceph/commit/6b6b38163b7742d97d21457cf38bdcc9bde5ae1a
> >> in https://github.com/ceph/ceph/pull/9070
> >>  
> >
> > I was talking about proxy, which isn't AFAICT, nor is there a BIG bold red  
> 
> That was hard to follow for me, in a thread titled "Cache mode
> readforward mode will eat your babies?".
> 
Context, the initial github bits talk about proxy.

Anyways, the documentation is in utter shambles and wrong and this really
really should have been mentioned more clearly in the release notes, but
then again none of the other cache changes were, never mind the wrong
osd_tier_promote_max* defaults.

So for the record:

The readproxy mode does what the old documentation states and proxies
objects through the cache-tier when being read w/o promoting them[*], while
writing objects will go into cache-tier as usual and with the
rate configured.

[*]
Pro-Tip: It does however do the silent 0 byte object creation for reads,
so your cache-tier storage performance will be somewhat impacted, in
addition to the CPU usage there that readforward would have also avoided.
This is important when considering the value for "target_max_objects", as a
writeback mode cache will likely evict things based on space used and
reach a natural upper object limit. 
For example an existing cache-tier in writeback mode here has a 2GB size
and 560K objects, 13.4TB and 3.6M objects on the backing storage. 
With readproxy and a similar sized cluster I'll be setting
"target_max_objects" to something around 2M to avoid needless eviction and
then re-creation of null objects when things are read.

Christian

> > statement in the release notes (or docs) for everybody to switch from
> > (read)forward to (read)proxy.
> >
> > And the two bits up there have _very_ conflicting statements about what
> > readproxy does, the older one would do what I want (at the cost of
> > shuffling all through the cache-tier network pipes), the newer one seems
> > to be actually describing the proxy functionality (no new objects i.e from
> > writes being added).
> >
> > I'll be ready to play with my new cluster in a bit and shall investigate
> > what does actually what.
> >
> > Christian
> >  
> >> HTH.
> >>  
> >> >
> >> > 2. The release notes aren't any particular help there either and 
> >> > issues/PR
> >> > talk about forward, not readforward as the culprit.
> >> >
> >> > 3. What I can gleam from the bits I found, proxy just replaces the 
> >> > forward
> >> > functionality.  Alas what I'm after is a mode that will not promote reads
> >> > to the cache, aka readforward. Or another set of parameters that will
> >> > produce the same results.
> >> >
> >> > Christian
> >> >  
> >> >> >
> >> >> > Christian
> >> >> > --
> >> >> > Christian BalzerNetwork/Systems Engineer
> >> >> > ch...@gol.com   Rakuten Communications
> >> >> > ___
> >> >> > ceph-users mailing list
> >> >> > ceph-users@lists.ceph.com
> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> >> >>
> >> >>
> >> >>  
> >> >
> >> >
> >> > --
> >> > Christian BalzerNetwork/Systems Engineer
> >> > ch...@gol.com   

Re: [ceph-users] Cache mode readforward mode will eat your babies?

2017-06-08 Thread Brad Hubbard
On Thu, Jun 8, 2017 at 3:47 PM, Christian Balzer  wrote:
> On Thu, 8 Jun 2017 15:29:05 +1000 Brad Hubbard wrote:
>
>> On Thu, Jun 8, 2017 at 3:10 PM, Christian Balzer  wrote:
>> > On Thu, 8 Jun 2017 14:21:43 +1000 Brad Hubbard wrote:
>> >
>> >> On Thu, Jun 8, 2017 at 1:06 PM, Christian Balzer  wrote:
>> >> >
>> >> > Hello,
>> >> >
>> >> > New cluster, Jewel, setting up cache-tiering:
>> >> > ---
>> >> > Error EPERM: 'readforward' is not a well-supported cache mode and may 
>> >> > corrupt your data.  pass --yes-i-really-mean-it to force.
>> >> > ---
>> >> >
>> >> > That's new and certainly wasn't there in Hammer, nor did it whine about
>> >> > this when upgrading my test cluster to Jewel.
>> >> >
>> >> > And speaking of whining, I did that about this and readproxy, but not
>> >> > their stability (readforward has been working nearly a year flawlessly 
>> >> > in
>> >> > the test cluster) but their lack of documentation.
>> >> >
>> >> > So while of course there is no warranty for anything with OSS, is there
>> >> > any real reason for the above scaremongering or is that based solely on
>> >> > lack of testing/experience?
>> >>
>> >> https://github.com/ceph/ceph/pull/8210 and
>> >> https://github.com/ceph/ceph/pull/8210/commits/90fe8e3d0b1ded6d14a6a43ecbd6c8634f691fbe
>> >> may offer some insight.
>> >>
>> > They do, alas of course immediately raise the following questions:
>> >
>> > 1. Where is that mode documented?
>>
>> It *was* documented by,
>> https://github.com/ceph/ceph/pull/7023/commits/d821acada39937b9dacf87614c924114adea8a58
>> in https://github.com/ceph/ceph/pull/7023 but was removed by
>> https://github.com/ceph/ceph/commit/6b6b38163b7742d97d21457cf38bdcc9bde5ae1a
>> in https://github.com/ceph/ceph/pull/9070
>>
>
> I was talking about proxy, which isn't AFAICT, nor is there a BIG bold red

That was hard to follow for me, in a thread titled "Cache mode
readforward mode will eat your babies?".

> statement in the release notes (or docs) for everybody to switch from
> (read)forward to (read)proxy.
>
> And the two bits up there have _very_ conflicting statements about what
> readproxy does, the older one would do what I want (at the cost of
> shuffling all through the cache-tier network pipes), the newer one seems
> to be actually describing the proxy functionality (no new objects i.e from
> writes being added).
>
> I'll be ready to play with my new cluster in a bit and shall investigate
> what does actually what.
>
> Christian
>
>> HTH.
>>
>> >
>> > 2. The release notes aren't any particular help there either and issues/PR
>> > talk about forward, not readforward as the culprit.
>> >
>> > 3. What I can gleam from the bits I found, proxy just replaces the forward
>> > functionality.  Alas what I'm after is a mode that will not promote reads
>> > to the cache, aka readforward. Or another set of parameters that will
>> > produce the same results.
>> >
>> > Christian
>> >
>> >> >
>> >> > Christian
>> >> > --
>> >> > Christian BalzerNetwork/Systems Engineer
>> >> > ch...@gol.com   Rakuten Communications
>> >> > ___
>> >> > ceph-users mailing list
>> >> > ceph-users@lists.ceph.com
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >>
>> >>
>> >
>> >
>> > --
>> > Christian BalzerNetwork/Systems Engineer
>> > ch...@gol.com   Rakuten Communications
>>
>>
>>
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com