Re: [ceph-users] OSD Full Ratio Luminous - Unset

2017-07-06 Thread Ashley Merrick
After looking into this further it seem's none of the :


ceph osd set-{full,nearfull,backfillfull}-ratio


Commands seem to be taking any effect on the cluster including the backfillfull 
ratio, this command looks to have been added/changed since Jewel, and a 
different way of setting the above. However does not seem to be giving the 
expected results.


,Ashley


From: ceph-users  on behalf of Ashley 
Merrick 
Sent: 06 July 2017 12:44:09
To: ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Full Ratio Luminous - Unset

Anyone have some feedback on this? Happy to log a bug ticket if it is one, but 
want to make sure not missing something Luminous change related.

,Ashley

Sent from my iPhone

On 4 Jul 2017, at 3:30 PM, Ashley Merrick 
mailto:ash...@amerrick.co.uk>> wrote:


Okie noticed their is a new command to set these.


Tried these and still showing as 0 and error on full ratio out of order "ceph 
osd set-{full,nearfull,backfillfull}-ratio"


,Ashley


From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Ashley Merrick 
mailto:ash...@amerrick.co.uk>>
Sent: 04 July 2017 05:55:10
To: ceph-us...@ceph.com
Subject: [ceph-users] OSD Full Ratio Luminous - Unset


Hello,


On a Luminous upgraded from Jewel I am seeing the following in ceph -s  : "Full 
ratio(s) out of order"


and


ceph pg dump | head
dumped all
version 44281
stamp 2017-07-04 05:52:08.337258
last_osdmap_epoch 0
last_pg_scan 0
full_ratio 0
nearfull_ratio 0

I have tried to inject the values however makes no effect, these where 
previously non 0 values and the issue only showed after running "ceph osd 
require-osd-release luminous"


Thanks,

Ashley

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Note about rbd_aio_write usage

2017-07-06 Thread Piotr Dałek

On 17-07-06 09:39 PM, Jason Dillaman wrote:

On Thu, Jul 6, 2017 at 3:25 PM, Piotr Dałek  wrote:

Is that deep copy an equivalent of what
Jewel librbd did at unspecified point of time, or extra one?


It's equivalent / replacement -- not an additional copy. This was
changed to support scatter/gather IO API methods which the latest
version of QEMU now directly utilizes (eliminating the need for a
bounce-buffer copy on every IO).


OK, that makes more sense now.


Once we get that librados issue resolved, that initial librbd IO
buffer copy will be dropped and librbd will become zero-copy for IO
(at least that's the goal). That's why I am recommending that you just
assume normal AIO semantics and not try to optimize for Luminous since
perhaps the next release will have that implementation detail of the
extra copy removed.


Is this: 
https://github.com/yuyuyu101/ceph/commit/794b49b5b860c538a349bdadb16bb6ae97ad9c20#commitcomment-15707924 
the issue you mention? Because at this point I'm considering switching to 
C++ API and passing static bufferptr buried in my bufferlist instead of 
having extra copy done by C API rbd_aio_write (that way I'd at least control 
the allocations).


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph @ OpenStack Sydney Summit

2017-07-06 Thread Blair Bethwaite
Oops, this time plain text...

On 7 July 2017 at 13:47, Blair Bethwaite  wrote:
>
> Hi all,
>
> Are there any "official" plans to have Ceph events co-hosted with OpenStack 
> Summit Sydney, like in Boston?
>
> The call for presentations closes in a week. The Forum will be organised 
> throughout September and (I think) that is the most likely place to have e.g. 
> Ceph ops sessions like we have in the past. Some of my local colleagues have 
> also expressed interest in having a CephFS BoF.
>
> --
> Cheers,
> ~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph @ OpenStack Sydney Summit

2017-07-06 Thread Blair Bethwaite
Hi all,

Are there any "official" plans to have Ceph events co-hosted with OpenStack
Summit Sydney, like in Boston?

The call for presentations closes in a week. The Forum will be organised
throughout September and (I think) that is the most likely place to have
e.g. Ceph ops sessions like we have in the past. Some of my local
colleagues have also expressed interest in having a CephFS BoF.

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to set Ceph client operation priority (ionice)

2017-07-06 Thread Christian Balzer

Hello,

On Thu, 6 Jul 2017 14:34:41 -0700 Su, Zhan wrote:

> Hi,
> 
> We are running a Ceph cluster serving both batch workload (e.g. data import
> / export, offline processing) and latency-sensitive workload. Currently
> batch traffic causes a huge slow down in serving latency-sensitive requests
> (e.g. streaming). When that happens, network is not the bottleneck (50%~60%
> usage of the 10Gib link) and cpu looks to be fairly idle as well. Our
> hypothesis is that requests hit the same drive and caused this slowdown. We
> use spinning disks and they are bad at serving two sequential I/O at the
> same time.
> 
Don't hypothesize, verify it with atop, iostat, etc.
But if you're using plain disks w/o any SSDs for journals or otherwise,
that is most likely what happens, yes.

> We would like to know whether there is a way to set Ceph or Ceph client so
> operations for different workload are properly prioritized. Thanks.
> 
Not that I'm aware of, there isn't even a way to tell which client is
causing what activities at this time, which has been put forward multiple
times here.

You could have different pools with different crush rules to separate the
distinct users so that those reads go to OSDs that aren't used by the
batch stuff. 

Beyond that, journal SSDs (future WAL SSDs for Bluestore), SSDs for bcache
or so to cache reads, SSD pools, cache-tiering, etc.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing very large buckets

2017-07-06 Thread Blair Bethwaite
How did you even get 60M objects into the bucket...?! The stuck requests
are only likely to be impacting the PG in which the bucket index is stored.
Hopefully you are not running other pools on those OSDs?

You'll need to upgrade to Jewel and gain the --bypass-gc radosgw-admin
flag, that speeds up the deletion considerably, but with a 60M object
bucket I imagine you're still going to be waiting quite a few days for it
to finish. Without this it's basically impossible.

We are actually working through this issue right now on an old 6M object
bucket. We got impatient and tried resharding the bucket index to speed
things up further but now the bucket rm is doing nothing. Waiting for
support advice from RH...

Cheers,

On 7 Jul. 2017 02:44, "Eric Beerman"  wrote:

Hello,

We have a bucket that has 60 million + objects in it, and are trying to
delete it. To do so, we have tried doing:

radosgw-admin bucket list --bucket=

and then cycling through the list of object names and deleting them, 1,000
at a time. However, after ~3-4k objects deleted, the list call stops
working and just hangs. We have also noticed slow requests for the cluster
most of the time after running that command when it hangs. We know there is
also a "radosgw-admin bucket rm --bucket= --purge-objects" command,
but we are nervous that this will cause slowness in the cluster as well,
since the listing did - or that it might not work at all, considering the
list didn't work.

We are running Ceph version 0.94.3, and there is no bucket sharding on the
index.

What is the recommended way to delete a large bucket like that in
production, without occurring any downtime/slow requests?

Thanks,
- Eric

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up backfill after increasing PGs and or adding OSDs

2017-07-06 Thread Christian Balzer

Hello,

On Thu, 6 Jul 2017 17:57:06 + george.vasilaka...@stfc.ac.uk wrote:

> Thanks for your response David.
> 
> What you've described has been what I've been thinking about too. We have 
> 1401 OSDs in the cluster currently and this output is from the tail end of 
> the backfill for +64 PG increase on the biggest pool.
> 
> The problem is we see this cluster do at most 20 backfills at the same time 
> and as the queue of PGs to backfill gets smaller there are fewer and fewer 
> actively backfilling which I don't quite understand.
> 

Welcome to the club.
You're not the first one to wonder about this and while David's comment
about max_backfill is valid, it simply doesn't explain all of this.

See this and my thoughts about things, unfortunately no developer ever
followed up on it:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009704.html

Christian

> Out of the PGs currently backfilling, all of them have completely changed 
> their sets (difference between acting and up sets is 11), which makes some 
> sense since what moves around are the newly spawned PGs. That's 5 PGs 
> currently in backfilling states which makes 110 OSDs blocked. What happened 
> to the other 1300? That's what's strange to me. There are another 7 waiting 
> to backfill.
> Out of all the OSDs in the up and acting sets of all PGs currently 
> backfilling or waiting to backfill there are 13 OSDs in common so I guess 
> that kind of answers it. I haven't checked to see but I suspect each 
> backfilling PG has at least one OSD in one of its sets in common with either 
> set of one of the waiting PGs.
> 
> So I guess we can't do much about the tail end taking so long: there's no way 
> for more of the PGs to actually be backfilling at the same time.
> 
> I think we'll have to try bumping osd_max_backfills. Has anyone tried bumping 
> the relative priorities of recovery vs others? What about noscrub?
> 
> Best regards,
> 
> George
> 
> 
> From: David Turner [drakonst...@gmail.com]
> Sent: 06 July 2017 16:08
> To: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Speeding up backfill after increasing PGs and or 
> adding OSDs
> 
> Just a quick place to start is osd_max_backfills.  You have this set to 1.  
> Each PG is on 11 OSDs.  When you have a PG moving, it is on the original 11 
> OSDs and the new X number of OSDs that it is going to.  For each of your PGs 
> that is moving, an OSD can only move 1 at a time (your osd_max_backfills), 
> and each PG is on 11 + X OSDs.
> 
> So with your cluster.  I don't see how many OSDs you have, but you have 25 
> PGs moving around and 8 of them are actively backfilling.  Assuming you were 
> only changing 1 OSD per backfill operation, that would mean that you had at 
> least 96 OSDs (11+1 * 8).  That would be a perfect distribution of OSDs for 
> the PGs backfilling.  Let's say now that you're averaging closer to 3 OSDs 
> changing per PG and that the remaining 17 PGs waiting to backfill are blocked 
> by a few OSDs each (because those OSDs are already included in the 8 active 
> backfilling PGs.  That would indicate that you have closer to 200+ OSDs.
> 
> Every time I'm backfilling and want to speed things up, I watch iostat on 
> some of my OSDs and increase osd_max_backfills until I'm consistently using 
> about 70% of the disk to allow for customer overhead.  You can always figure 
> out what's best for your use case though.  Generally I've been ok running 
> with osd_max_backfills=5 without much problem and bringing that up some when 
> I know that client IO will be minimal, but again it depends on your use case 
> and cluster.
> 
> On Thu, Jul 6, 2017 at 10:08 AM 
> mailto:george.vasilaka...@stfc.ac.uk>> wrote:
> Hey folks,
> 
> We have a cluster that's currently backfilling from increasing PG counts. We 
> have tuned recovery and backfill way down as a "precaution" and would like to 
> start tuning it to bring up to a good balance between that and client I/O.
> 
> At the moment we're in the process of bumping up PG numbers for pools serving 
> production workloads. Said pools are EC 8+3.
> 
> It looks like we're having very low numbers of PGs backfilling as in:
> 
> 2567 TB used, 5062 TB / 7630 TB avail
> 145588/849529410 objects degraded (0.017%)
> 5177689/849529410 objects misplaced (0.609%)
> 7309 active+clean
>   23 active+clean+scrubbing
>   18 active+clean+scrubbing+deep
>   13 active+remapped+backfill_wait
>5 active+undersized+degraded+remapped+backfilling
>4 active+undersized+degraded+remapped+backfill_wait
>3 active+remapped+backfilling
>1 active+clean+inconsistent
> recovery io 1966 MB/s, 96 objects/s
>   client io 726 MB/s rd, 147 MB/s wr, 89 op/s rd, 71 op/s wr
> 
> Also, the rate of recovery in terms of data and o

Re: [ceph-users] Watch for fstrim running on your Ubuntu systems

2017-07-06 Thread Reed Dier
I could easily see that being the case, especially with Micron as a common 
thread, but it appears that I am on the latest FW for both the SATA and the 
NVMe:

> $ sudo ./msecli -L | egrep 'Device|FW'
> Device Name  : /dev/sda
> FW-Rev   : D0MU027
> Device Name  : /dev/sdb
> FW-Rev   : D0MU027
> Device Name  : /dev/sdc
> FW-Rev   : D0MU027
> Device Name  : /dev/sdd
> FW-Rev   : D0MU027
> Device Name  : /dev/sde
> FW-Rev   : D0MU027
> Device Name  : /dev/sdf
> FW-Rev   : D0MU027
> Device Name  : /dev/sdg
> FW-Rev   : D0MU027
> Device Name  : /dev/sdh
> FW-Rev   : D0MU027
> Device Name  : /dev/sdi
> FW-Rev   : D0MU027
> Device Name  : /dev/sdj
> FW-Rev   : D0MU027
> Device Name  : /dev/nvme0
> FW-Rev   : 0091634

D0MU027 and 1634 are the latest reported FW from Micron, current as of 
04/12/2017 and 12/07/2016, respectively.

Could be current FW doesn’t play nice, so thats on the table. But for now, its 
a thread that can’t be pulled any further.

Appreciate the feedback,

Reed

> On Jul 6, 2017, at 1:18 PM, Peter Maloney 
>  wrote:
> 
> Hey,
> 
> I have some SAS Micron S630DC-400 which came with firmware M013 which did the 
> same or worse (takes very long... 100% blocked for about 5min for 16GB 
> trimmed), and works just fine with firmware M017 (4s for 32GB trimmed). So 
> maybe you just need an update.
> 
> Peter
> 
> 
> 
> On 07/06/17 18:39, Reed Dier wrote:
>> Hi Wido,
>> 
>> I came across this ancient ML entry with no responses and wanted to follow 
>> up with you to see if you recalled any solution to this.
>> Copying the ceph-users list to preserve any replies that may result for 
>> archival.
>> 
>> I have a couple of boxes with 10x Micron 5100 SATA SSD’s, journaled on 
>> Micron 9100 NVMe SSD’s; ceph 10.2.7; Ubuntu 16.04 4.8 kernel.
>> 
>> I have noticed now twice that I’ve had SSD’s flapping due to the fstrim 
>> eating up the io 100%.
>> It eventually righted itself after a little less than 8 hours.
>> Noout flag was set, so it didn’t create any unnecessary rebalance or whatnot.
>> 
>> Timeline showing that only 1 OSD ever went down at a time, but they seemed 
>> to go down in a rolling fashion during the fstrim session.
>> You can actually see in the OSD graph all 10 OSD’s on this node go down 1 by 
>> 1 over time.
>> 
>> 
>> And the OSD’s were going down because of:
>> 
>>> 2017-07-02 13:47:32.618752 7ff612721700  1 heartbeat_map is_healthy 
>>> 'OSD::osd_op_tp thread 0x7ff5ecd0c700' had timed out after 15
>>> 2017-07-02 13:47:32.618757 7ff612721700  1 heartbeat_map is_healthy 
>>> 'FileStore::op_tp thread 0x7ff608d9e700' had timed out after 60
>>> 2017-07-02 13:47:32.618760 7ff612721700  1 heartbeat_map is_healthy 
>>> 'FileStore::op_tp thread 0x7ff608d9e700' had suicide timed out after 180
>>> 2017-07-02 13:47:32.624567 7ff612721700 -1 common/HeartbeatMap.cc 
>>> : In function 'bool 
>>> ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, 
>>> time_t)' thread 7ff612721700 time 2017-07-02 13:47:32.618784
>>> common/HeartbeatMap.cc : 86: FAILED assert(0 == 
>>> "hit suicide timeout")
>> 
>> 
>> I am curious if you were able to nice it or something similar to mitigate 
>> this issue?
>> Oddly, I have similar machines with Samsung SM863a’s with Intel P3700 
>> journals that do not appear to be affected by the fstrim load issue despite 
>> identical weekly cron jobs enabled. Only the Micron drives (newer) have had 
>> these issues.
>> 
>> Appreciate any pointers,
>> 
>> Reed
>> 
>>> Wido den Hollander wido at 42on.com  
>>> 
>>> Tue Dec 9 01:21:16 PST 2014
>>> Hi,
>>> 
>>> Last sunday I got a call early in the morning that a Ceph cluster was
>>> having some issues. Slow requests and OSDs marking each other down.
>>> 
>>> Since this is a 100% SSD cluster I was a bit confused and started
>>> investigating.
>>> 
>>> It took me about 15 minutes to see that fstrim was running and was
>>> utilizing the SSDs 100%.
>>> 
>>> On Ubuntu 14.04 there is a weekly CRON which executes fstrim-all. It
>>> detects all mountpoints which can be trimmed and starts to trim those.
>>> 
>>> On the Intel SSDs used here it caused them to become 100% busy for a
>>> couple of minutes. That was enough for them to no longer respond on
>>> heartbeats, thus timing out and being marked down.
>>> 
>>> Luckily we had the "out interval" set to 1800 seconds on that cluster,
>>> so no OSD was marked as "out".
>>> 
>>> fstrim-all does not execute fstrim with a ionice priority. From what I
>>> understand, but haven't tested yet, is that running fstrim with ionice
>>> -c Idle s

[ceph-users] How to set Ceph client operation priority (ionice)

2017-07-06 Thread Su, Zhan
Hi,

We are running a Ceph cluster serving both batch workload (e.g. data import
/ export, offline processing) and latency-sensitive workload. Currently
batch traffic causes a huge slow down in serving latency-sensitive requests
(e.g. streaming). When that happens, network is not the bottleneck (50%~60%
usage of the 10Gib link) and cpu looks to be fairly idle as well. Our
hypothesis is that requests hit the same drive and caused this slowdown. We
use spinning disks and they are bad at serving two sequential I/O at the
same time.

We would like to know whether there is a way to set Ceph or Ceph client so
operations for different workload are properly prioritized. Thanks.

Regards,
Zhan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd journal support

2017-07-06 Thread Jason Dillaman
There are no immediate plans to support the RBD journaling in krbd.
The journaling feature requires a lot of code and, with limited
resources, the priority has been to provide alternative block device
options that pass-through to librbd for such use-cases and to optimize
the performance of librbd / librados to shrink the performance gap.

On Thu, Jul 6, 2017 at 12:40 PM, Maged Mokhtar  wrote:
> Hi all,
>
> Are there any plans to support rbd journal feature in kernel krbd ?
>
> Cheers /Maged
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Note about rbd_aio_write usage

2017-07-06 Thread Jason Dillaman
On Thu, Jul 6, 2017 at 3:25 PM, Piotr Dałek  wrote:
> Is that deep copy an equivalent of what
> Jewel librbd did at unspecified point of time, or extra one?

It's equivalent / replacement -- not an additional copy. This was
changed to support scatter/gather IO API methods which the latest
version of QEMU now directly utilizes (eliminating the need for a
bounce-buffer copy on every IO).

Once we get that librados issue resolved, that initial librbd IO
buffer copy will be dropped and librbd will become zero-copy for IO
(at least that's the goal). That's why I am recommending that you just
assume normal AIO semantics and not try to optimize for Luminous since
perhaps the next release will have that implementation detail of the
extra copy removed.

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to set up bluestore manually?

2017-07-06 Thread Vasu Kulkarni
I recommend you file a tracker issue at http://tracker.ceph.com/ with all
details( ceph version, steps you ran and output hiding out anything you
dont want to put),  I doubt its a ceph-deploy issue
but we can try in our lab to replicate it.

On Thu, Jul 6, 2017 at 5:25 AM, Martin Emrich 
wrote:

> Hi!
>
> I changed the partitioning scheme to use a "real" primary partition
> instead of a logical volume. Ceph-deploy seems run fine now, but the OSD
> does not start.
>
> I see lots of these in the journal:
>
> Jul 06 13:53:42  sh[9768]: 0> 2017-07-06 13:53:42.794027 7fcf9918fb80 -1
> *** Caught signal (Aborted) **
> Jul 06 13:53:42  sh[9768]: in thread 7fcf9918fb80 thread_name:ceph-osd
> Jul 06 13:53:42  sh[9768]: ceph version 12.1.0 (
> 262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
> Jul 06 13:53:42  sh[9768]: 1: (()+0x9cd6af) [0x7fcf99b776af]
> Jul 06 13:53:42  sh[9768]: 2: (()+0xf370) [0x7fcf967d9370]
> Jul 06 13:53:42  sh[9768]: 3: (gsignal()+0x37) [0x7fcf958031d7]
> Jul 06 13:53:42  sh[9768]: 4: (abort()+0x148) [0x7fcf958048c8]
> Jul 06 13:53:42  sh[9768]: 5: (ceph::__ceph_assert_fail(char const*, char
> const*, int, char const*)+0x284) [0x7fcf99bb5394]
> Jul 06 13:53:42  sh[9768]: 6: (BitMapAreaIN::reserve_blocks(long)+0xb6)
> [0x7fcf99b6c486]
> Jul 06 13:53:42  sh[9768]: 7: (BitMapAllocator::reserve(unsigned
> long)+0x80) [0x7fcf99b6a240]
> Jul 06 13:53:42  sh[9768]: 8: (BlueFS::_allocate(unsigned char, unsigned
> long, std::vector mempool::pool_allocator<(mempool::pool_index_t)9,
> bluefs_extent_t> >*)+0xee) [0x7fcf99b31c\
> 0e]
> Jul 06 13:53:42  sh[9768]: 9: 
> (BlueFS::_flush_and_sync_log(std::unique_lock&,
> unsigned long, unsigned long)+0xbc4) [0x7fcf99b38be4]
> Jul 06 13:53:42  sh[9768]: 10: (BlueFS::sync_metadata()+0x215)
> [0x7fcf99b3d725]
> Jul 06 13:53:42  sh[9768]: 11: (BlueFS::umount()+0x74) [0x7fcf99b3dc44]
> Jul 06 13:53:42  sh[9768]: 12: (BlueStore::_open_db(bool)+0x579)
> [0x7fcf99a62859]
> Jul 06 13:53:42  sh[9768]: 13: (BlueStore::fsck(bool)+0x39b)
> [0x7fcf99a9581b]
> Jul 06 13:53:42  sh[9768]: 14: (BlueStore::mkfs()+0x1168) [0x7fcf99a6d118]
> Jul 06 13:53:42  sh[9768]: 15: (OSD::mkfs(CephContext*, ObjectStore*,
> std::string const&, uuid_d, int)+0x29b) [0x7fcf9964b75b]
> Jul 06 13:53:42  sh[9768]: 16: (main()+0xf83) [0x7fcf99590573]
> Jul 06 13:53:42  sh[9768]: 17: (__libc_start_main()+0xf5) [0x7fcf957efb35]
> Jul 06 13:53:42  sh[9768]: 18: (()+0x4826e6) [0x7fcf9962c6e6]
> Jul 06 13:53:42  sh[9768]: NOTE: a copy of the executable, or `objdump
> -rdS ` is needed to interpret this.
> Jul 06 13:53:42  sh[9768]: Traceback (most recent call last):
> Jul 06 13:53:42  sh[9768]: File "/usr/sbin/ceph-disk", line 9, in 
> Jul 06 13:53:42  sh[9768]: load_entry_point('ceph-disk==1.0.0',
> 'console_scripts', 'ceph-disk')()
>
>
> Also interesting is the message "-1 rocksdb: Invalid argument: db: does
> not exist (create_if_missing is false)"... Looks to me as if ceph-deploy
> did not create the RocksDB?
>
> So still no success with bluestore :(
>
> Thanks,
>
> Martin
>
> -Ursprüngliche Nachricht-
> Von: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] Im Auftrag von
> Martin Emrich
> Gesendet: Dienstag, 4. Juli 2017 22:02
> An: Loris Cuoghi ; ceph-users@lists.ceph.com
> Betreff: Re: [ceph-users] How to set up bluestore manually?
>
> Hi!
>
> After getting some other stuff done, I finally got around to continuing
> here.
>
> I set up a whole new cluster with ceph-deploy, but adding the first OSD
> fails:
>
> ceph-deploy osd create --bluestore ${HOST}:/dev/sdc --block-wal
> /dev/cl/ceph-waldb-sdc --block-db /dev/cl/ceph-waldb-sdc .
> .
> .
> [WARNIN] get_partition_dev: Try 9/10 : partition 1 for
> /dev/cl/ceph-waldb-sdc does not exist in /sys/block/dm-2  [WARNIN]
> get_dm_uuid: get_dm_uuid /dev/cl/ceph-waldb-sdc uuid path is
> /sys/dev/block/253:2/dm/uuid  [WARNIN] get_dm_uuid: get_dm_uuid
> /dev/cl/ceph-waldb-sdc uuid is LVM-2r0bGcoyMB0VnWeGGS77eOD5IOu8wA
> PN3wPX4OWSS1XGkYZYoziXhfAFMjJf4FJR
>  [WARNIN]
>  [WARNIN] get_partition_dev: Try 10/10 : partition 1 for
> /dev/cl/ceph-waldb-sdc does not exist in /sys/block/dm-2  [WARNIN]
> get_dm_uuid: get_dm_uuid /dev/cl/ceph-waldb-sdc uuid path is
> /sys/dev/block/253:2/dm/uuid  [WARNIN] get_dm_uuid: get_dm_uuid
> /dev/cl/ceph-waldb-sdc uuid is LVM-2r0bGcoyMB0VnWeGGS77eOD5IOu8wA
> PN3wPX4OWSS1XGkYZYoziXhfAFMjJf4FJR
>  [WARNIN]
>  [WARNIN] Traceback (most recent call last):
>  [WARNIN]   File "/usr/sbin/ceph-disk", line 9, in 
>  [WARNIN] load_entry_point('ceph-disk==1.0.0', 'console_scripts',
> 'ceph-disk')()
>  [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py",
> line 5687, in run
>  [WARNIN] main(sys.argv[1:])
>  [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py",
> line 5638, in main
>  [WARNIN] args.func(args)
>  [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py",
> line 2004, in main
>  [WARNIN] Prepare.factory(args).prepare()
>  [WARNIN]   File "/usr/lib/p

Re: [ceph-users] Degraded objects while OSD is being added/filled

2017-07-06 Thread Andras Pataki

Hi Greg,

At the moment our cluster is all in balance.  We have one failed drive 
that will be replaced in a few days (the OSD has been removed from ceph 
and will be re-added with the replacement drive).  I'll document the 
state of the PGs before the addition of the drive and during the 
recovery process and report back.


We have a few pools, most are on 3 replicas now, some with non-critical 
data that we have elsewhere are on 2.  But I've seen the degradation 
even on the 3 replica pools (I think in my original example there was an 
example of such a pool as well).


Andras


On 06/30/2017 04:38 PM, Gregory Farnum wrote:
On Wed, Jun 21, 2017 at 6:57 AM Andras Pataki 
mailto:apat...@flatironinstitute.org>> 
wrote:


Hi cephers,

I noticed something I don't understand about ceph's behavior when
adding an OSD.  When I start with a clean cluster (all PG's
active+clean) and add an OSD (via ceph-deploy for example), the
crush map gets updated and PGs get reassigned to different OSDs,
and the new OSD starts getting filled with data.  As the new OSD
gets filled, I start seeing PGs in degraded states.  Here is an
example:

  pgmap v52068792: 42496 pgs, 6 pools, 1305 TB data, 390
Mobjects
3164 TB used, 781 TB / 3946 TB avail
*8017/994261437 objects degraded (0.001%)*
2220581/994261437 objects misplaced (0.223%)
   42393 active+clean
  91 active+remapped+wait_backfill
   9 active+clean+scrubbing+deep
*   1 active+recovery_wait+degraded*
   1 active+clean+scrubbing
   1 active+remapped+backfilling


Any ideas why there would be any persistent degradation in the
cluster while the newly added drive is being filled? It takes
perhaps a day or two to fill the drive - and during all this time
the cluster seems to be running degraded.  As data is written to
the cluster, the number of degraded objects increases over time. 
Once the newly added OSD is filled, the cluster comes back to

clean again.

Here is the PG that is degraded in this picture:

7.87c10200419430477
active+recovery_wait+degraded2017-06-20 14:12:44.119921   
344610'7583572:2797[402,521] 402[402,521]402   
344610'72017-06-16 06:04:55.822503344610'72017-06-16

06:04:55.822503

The newly added osd here is 521.  Before it got added, this PG had
two replicas clean, but one got forgotten somehow?


This sounds a bit concerning at first glance. Can you provide some 
output of exactly what commands you're invoking, and the "ceph -s" 
output as it changes in response?


I really don't see how adding a new OSD can result in it "forgetting" 
about existing valid copies — it's definitely not supposed to — so I 
wonder if there's a collision in how it's deciding to remove old 
locations.


Are you running with only two copies of your data? It shouldn't matter 
but there could also be errors resulting in a behavioral difference 
between two and three copies.

-Greg


Other remapped PGs have 521 in their "up" set but still have the
two existing copies in their "acting" set - and no degradation is
shown.  Examples:

2.f241428201628564051014850801 3102   
3102active+remapped+wait_backfill 2017-06-20
14:12:42.650308583553'2033479 583573:2033266[467,521]   
467[467,499]467 582430'2072017-06-16

09:08:51.055131 582036'20308372017-05-31 20:37:54.831178
6.2b7d104990140209980 372428746873673   
3673 active+remapped+wait_backfill2017-06-20
14:12:42.070019583569'165163583572:342128 [541,37,521]   
541[541,37,532]541 582430'1618902017-06-18

09:42:49.148402 582430'1618902017-06-18 09:42:49.148402

We are running the latest Jewel patch level everywhere (10.2.7). 
Any insights would be appreciated.


Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding storage to exiting clusters with minimal impact

2017-07-06 Thread Peter Maloney
Here's my possibly unique method... I had 3 nodes with 12 disks each,
and when adding 2 more nodes, I had issues with the common method you
describe, totally blocking clients for minutes, but this worked great
for me:

> my own method
> - osd max backfills = 1 and osd recovery max active = 1
> - create them with crush weight 0 so no peering happens
> - (starting here the script below does it, eg. `ceph_activate_osds 6`
> will set weight 6)
> - after they're up, set them reweight 0
> - then set crush weight to the TB of the disk
> - peering starts, but reweight is 0 so it doesn't block clients
> - when that's done, reweight 1 and it should be faster than the
> previous peering and not bug clients as much
>
>
> # list osds with hosts next to them for easy filtering with awk
> (doesn't support chassis, rack, etc. buckets)
> ceph_list_osd() {
> ceph osd tree | awk '
> BEGIN {found=0; host=""};
> $3 == "host" {found=1; host=$4; getline};
> $3 == "host" {found=0}
> found || $3 ~ /osd\./ {print $0 " " host}'
> }
>
> peering_sleep() {
> echo "sleeping"
> sleep 2
> while ceph health | grep -q peer; do
> echo -n .
> sleep 1
> done
> echo
> sleep 5
> }
>
> # after an osd is already created, this reweights them to 'activate' them
> ceph_activate_osds() {
> weight="$1"
> host=$(hostname -s)
> 
> if [ -z "$weight" ]; then
> # TODO: somehow make this automatic...
> # This assumes all disks are the same weight.
> weight=6.00099
> fi
> 
> # for crush weight 0 osds, set reweight 0 so the crush weight
> non-zero won't cause as many blocked requests
> for id in $(ceph_list_osd | awk '$2 == 0 {print $1}'); do
> ceph osd reweight $id 0 &
> done
> wait
> peering_sleep
> 
> # the harsh reweight which we do slowly
> for id in $(ceph_list_osd | awk -v host="$host" '$5 == 0 && $7 ==
> host {print $1}'); do
> echo ceph osd crush reweight "osd.$id" "$weight"
> ceph osd crush reweight "osd.$id" "$weight"
> peering_sleep
> done
> 
> # the light reweight
> for id in $(ceph_list_osd | awk -v host="$host" '$5 == 0 && $7 ==
> host {print $1}'); do
> ceph osd reweight $id 1 &
> done
> wait
> }


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Watch for fstrim running on your Ubuntu systems

2017-07-06 Thread Peter Maloney
Hey,

I have some SAS Micron S630DC-400 which came with firmware M013 which
did the same or worse (takes very long... 100% blocked for about 5min
for 16GB trimmed), and works just fine with firmware M017 (4s for 32GB
trimmed). So maybe you just need an update.

Peter



On 07/06/17 18:39, Reed Dier wrote:
> Hi Wido,
>
> I came across this ancient ML entry with no responses and wanted to
> follow up with you to see if you recalled any solution to this.
> Copying the ceph-users list to preserve any replies that may result
> for archival.
>
> I have a couple of boxes with 10x Micron 5100 SATA SSD’s, journaled on
> Micron 9100 NVMe SSD’s; ceph 10.2.7; Ubuntu 16.04 4.8 kernel.
>
> I have noticed now twice that I’ve had SSD’s flapping due to the
> fstrim eating up the io 100%.
> It eventually righted itself after a little less than 8 hours.
> Noout flag was set, so it didn’t create any unnecessary rebalance or
> whatnot.
>
> Timeline showing that only 1 OSD ever went down at a time, but they
> seemed to go down in a rolling fashion during the fstrim session.
> You can actually see in the OSD graph all 10 OSD’s on this node go
> down 1 by 1 over time.
>
> And the OSD’s were going down because of:
>
>> 2017-07-02 13:47:32.618752 7ff612721700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7ff5ecd0c700' had timed out after 15
>> 2017-07-02 13:47:32.618757 7ff612721700  1 heartbeat_map is_healthy
>> 'FileStore::op_tp thread 0x7ff608d9e700' had timed out after 60
>> 2017-07-02 13:47:32.618760 7ff612721700  1 heartbeat_map is_healthy
>> 'FileStore::op_tp thread 0x7ff608d9e700' had suicide timed out after 180
>> 2017-07-02 13:47:32.624567 7ff612721700 -1 common/HeartbeatMap.cc
>> : In function 'bool
>> ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const
>> char*, time_t)' thread 7ff612721700 time 2017-07-02 13:47:32.618784
>> common/HeartbeatMap.cc : 86: FAILED assert(0
>> == "hit suicide timeout")
>
> I am curious if you were able to nice it or something similar to
> mitigate this issue?
> Oddly, I have similar machines with Samsung SM863a’s with Intel P3700
> journals that do not appear to be affected by the fstrim load issue
> despite identical weekly cron jobs enabled. Only the Micron drives
> (newer) have had these issues.
>
> Appreciate any pointers,
>
> Reed
>
>> *Wido den Hollander* wido at 42on.com 
>> 
>> /Tue Dec 9 01:21:16 PST 2014/
>> Hi,
>>
>> Last sunday I got a call early in the morning that a Ceph cluster was
>> having some issues. Slow requests and OSDs marking each other down.
>>
>> Since this is a 100% SSD cluster I was a bit confused and started
>> investigating.
>>
>> It took me about 15 minutes to see that fstrim was running and was
>> utilizing the SSDs 100%.
>>
>> On Ubuntu 14.04 there is a weekly CRON which executes fstrim-all. It
>> detects all mountpoints which can be trimmed and starts to trim those.
>>
>> On the Intel SSDs used here it caused them to become 100% busy for a
>> couple of minutes. That was enough for them to no longer respond on
>> heartbeats, thus timing out and being marked down.
>>
>> Luckily we had the "out interval" set to 1800 seconds on that cluster,
>> so no OSD was marked as "out".
>>
>> fstrim-all does not execute fstrim with a ionice priority. From what I
>> understand, but haven't tested yet, is that running fstrim with ionice
>> -c Idle should solve this.
>>
>> It's weird that this issue didn't come up earlier on that cluster, but
>> after killing fstrim all problems we resolved and the cluster ran
>> happily again.
>>
>> So watch out for fstrim on early Sunday mornings on Ubuntu!
>>
>> -- 
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up backfill after increasing PGs and or adding OSDs

2017-07-06 Thread David Turner
ceph pg dump | grep backfill

Look through the output of that command and see the acting (osds the pg is
on/moving off of) and current (where the pg will end up).  All it takes is
a single osd being listed on a pg currently backfilling and any other PGs
it's listed on will be backfill+wait and have to wait until there is an
available osd_max_backfill for it to start.

On Thu, Jul 6, 2017, 1:57 PM  wrote:

> Thanks for your response David.
>
> What you've described has been what I've been thinking about too. We have
> 1401 OSDs in the cluster currently and this output is from the tail end of
> the backfill for +64 PG increase on the biggest pool.
>
> The problem is we see this cluster do at most 20 backfills at the same
> time and as the queue of PGs to backfill gets smaller there are fewer and
> fewer actively backfilling which I don't quite understand.
>
> Out of the PGs currently backfilling, all of them have completely changed
> their sets (difference between acting and up sets is 11), which makes some
> sense since what moves around are the newly spawned PGs. That's 5 PGs
> currently in backfilling states which makes 110 OSDs blocked. What happened
> to the other 1300? That's what's strange to me. There are another 7 waiting
> to backfill.
> Out of all the OSDs in the up and acting sets of all PGs currently
> backfilling or waiting to backfill there are 13 OSDs in common so I guess
> that kind of answers it. I haven't checked to see but I suspect each
> backfilling PG has at least one OSD in one of its sets in common with
> either set of one of the waiting PGs.
>
> So I guess we can't do much about the tail end taking so long: there's no
> way for more of the PGs to actually be backfilling at the same time.
>
> I think we'll have to try bumping osd_max_backfills. Has anyone tried
> bumping the relative priorities of recovery vs others? What about noscrub?
>
> Best regards,
>
> George
>
> 
> From: David Turner [drakonst...@gmail.com]
> Sent: 06 July 2017 16:08
> To: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Speeding up backfill after increasing PGs and or
> adding OSDs
>
> Just a quick place to start is osd_max_backfills.  You have this set to
> 1.  Each PG is on 11 OSDs.  When you have a PG moving, it is on the
> original 11 OSDs and the new X number of OSDs that it is going to.  For
> each of your PGs that is moving, an OSD can only move 1 at a time (your
> osd_max_backfills), and each PG is on 11 + X OSDs.
>
> So with your cluster.  I don't see how many OSDs you have, but you have 25
> PGs moving around and 8 of them are actively backfilling.  Assuming you
> were only changing 1 OSD per backfill operation, that would mean that you
> had at least 96 OSDs (11+1 * 8).  That would be a perfect distribution of
> OSDs for the PGs backfilling.  Let's say now that you're averaging closer
> to 3 OSDs changing per PG and that the remaining 17 PGs waiting to backfill
> are blocked by a few OSDs each (because those OSDs are already included in
> the 8 active backfilling PGs.  That would indicate that you have closer to
> 200+ OSDs.
>
> Every time I'm backfilling and want to speed things up, I watch iostat on
> some of my OSDs and increase osd_max_backfills until I'm consistently using
> about 70% of the disk to allow for customer overhead.  You can always
> figure out what's best for your use case though.  Generally I've been ok
> running with osd_max_backfills=5 without much problem and bringing that up
> some when I know that client IO will be minimal, but again it depends on
> your use case and cluster.
>
> On Thu, Jul 6, 2017 at 10:08 AM  george.vasilaka...@stfc.ac.uk>> wrote:
> Hey folks,
>
> We have a cluster that's currently backfilling from increasing PG counts.
> We have tuned recovery and backfill way down as a "precaution" and would
> like to start tuning it to bring up to a good balance between that and
> client I/O.
>
> At the moment we're in the process of bumping up PG numbers for pools
> serving production workloads. Said pools are EC 8+3.
>
> It looks like we're having very low numbers of PGs backfilling as in:
>
> 2567 TB used, 5062 TB / 7630 TB avail
> 145588/849529410 objects degraded (0.017%)
> 5177689/849529410 objects misplaced (0.609%)
> 7309 active+clean
>   23 active+clean+scrubbing
>   18 active+clean+scrubbing+deep
>   13 active+remapped+backfill_wait
>5 active+undersized+degraded+remapped+backfilling
>4 active+undersized+degraded+remapped+backfill_wait
>3 active+remapped+backfilling
>1 active+clean+inconsistent
> recovery io 1966 MB/s, 96 objects/s
>   client io 726 MB/s rd, 147 MB/s wr, 89 op/s rd, 71 op/s wr
>
> Also, the rate of recovery in terms of data and object throughput varies a
> lot, even with the number of PGs backfil

Re: [ceph-users] Note about rbd_aio_write usage

2017-07-06 Thread Jason Dillaman
On Thu, Jul 6, 2017 at 11:46 AM, Piotr Dałek  wrote:
> How about a hybrid solution? Keep the old rbd_aio_write contract (don't copy
> the buffer with the assumption that it won't change) and instead of
> constructing bufferlist containing bufferptr to copied data, construct a
> bufferlist containing bufferptr made with create_static(user_buffer)?

We must be talking past each other -- there was never such a
pre-Lumunous contract since (1) it did copy the buffer on every IO and
(2) it could have potentially copied before the 'rbd_aio_write' call
returned or after (but at least before the completion was fired). Just
because it works some times doesn't mean it would always work since it
would be a race between two threads.

Unfortunately, until the librados issue is solved, you will still have
to copy the data once when using the C++ API. The only advantage is
that you would be responsible for the copying and it would only need
to be performed once instead of twice.

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up backfill after increasing PGs and or adding OSDs

2017-07-06 Thread george.vasilakakos
Thanks for your response David.

What you've described has been what I've been thinking about too. We have 1401 
OSDs in the cluster currently and this output is from the tail end of the 
backfill for +64 PG increase on the biggest pool.

The problem is we see this cluster do at most 20 backfills at the same time and 
as the queue of PGs to backfill gets smaller there are fewer and fewer actively 
backfilling which I don't quite understand.

Out of the PGs currently backfilling, all of them have completely changed their 
sets (difference between acting and up sets is 11), which makes some sense 
since what moves around are the newly spawned PGs. That's 5 PGs currently in 
backfilling states which makes 110 OSDs blocked. What happened to the other 
1300? That's what's strange to me. There are another 7 waiting to backfill.
Out of all the OSDs in the up and acting sets of all PGs currently backfilling 
or waiting to backfill there are 13 OSDs in common so I guess that kind of 
answers it. I haven't checked to see but I suspect each backfilling PG has at 
least one OSD in one of its sets in common with either set of one of the 
waiting PGs.

So I guess we can't do much about the tail end taking so long: there's no way 
for more of the PGs to actually be backfilling at the same time.

I think we'll have to try bumping osd_max_backfills. Has anyone tried bumping 
the relative priorities of recovery vs others? What about noscrub?

Best regards,

George


From: David Turner [drakonst...@gmail.com]
Sent: 06 July 2017 16:08
To: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Speeding up backfill after increasing PGs and or 
adding OSDs

Just a quick place to start is osd_max_backfills.  You have this set to 1.  
Each PG is on 11 OSDs.  When you have a PG moving, it is on the original 11 
OSDs and the new X number of OSDs that it is going to.  For each of your PGs 
that is moving, an OSD can only move 1 at a time (your osd_max_backfills), and 
each PG is on 11 + X OSDs.

So with your cluster.  I don't see how many OSDs you have, but you have 25 PGs 
moving around and 8 of them are actively backfilling.  Assuming you were only 
changing 1 OSD per backfill operation, that would mean that you had at least 96 
OSDs (11+1 * 8).  That would be a perfect distribution of OSDs for the PGs 
backfilling.  Let's say now that you're averaging closer to 3 OSDs changing per 
PG and that the remaining 17 PGs waiting to backfill are blocked by a few OSDs 
each (because those OSDs are already included in the 8 active backfilling PGs.  
That would indicate that you have closer to 200+ OSDs.

Every time I'm backfilling and want to speed things up, I watch iostat on some 
of my OSDs and increase osd_max_backfills until I'm consistently using about 
70% of the disk to allow for customer overhead.  You can always figure out 
what's best for your use case though.  Generally I've been ok running with 
osd_max_backfills=5 without much problem and bringing that up some when I know 
that client IO will be minimal, but again it depends on your use case and 
cluster.

On Thu, Jul 6, 2017 at 10:08 AM 
mailto:george.vasilaka...@stfc.ac.uk>> wrote:
Hey folks,

We have a cluster that's currently backfilling from increasing PG counts. We 
have tuned recovery and backfill way down as a "precaution" and would like to 
start tuning it to bring up to a good balance between that and client I/O.

At the moment we're in the process of bumping up PG numbers for pools serving 
production workloads. Said pools are EC 8+3.

It looks like we're having very low numbers of PGs backfilling as in:

2567 TB used, 5062 TB / 7630 TB avail
145588/849529410 objects degraded (0.017%)
5177689/849529410 objects misplaced (0.609%)
7309 active+clean
  23 active+clean+scrubbing
  18 active+clean+scrubbing+deep
  13 active+remapped+backfill_wait
   5 active+undersized+degraded+remapped+backfilling
   4 active+undersized+degraded+remapped+backfill_wait
   3 active+remapped+backfilling
   1 active+clean+inconsistent
recovery io 1966 MB/s, 96 objects/s
  client io 726 MB/s rd, 147 MB/s wr, 89 op/s rd, 71 op/s wr

Also, the rate of recovery in terms of data and object throughput varies a lot, 
even with the number of PGs backfilling remaining constant.

Here's the config in the OSDs:

"osd_max_backfills": "1",
"osd_min_recovery_priority": "0",
"osd_backfill_full_ratio": "0.85",
"osd_backfill_retry_interval": "10",
"osd_allow_recovery_below_min_size": "true",
"osd_recovery_threads": "1",
"osd_backfill_scan_min": "16",
"osd_backfill_scan_max": "64",
"osd_recovery_thread_timeout": "30",
"osd_recovery_thread_suicide_timeout": "300",
"osd_recovery_sleep": "0",
"osd_recovery_delay_start": "0",
"osd_recover

[ceph-users] Removing very large buckets

2017-07-06 Thread Eric Beerman
Hello,

We have a bucket that has 60 million + objects in it, and are trying to delete 
it. To do so, we have tried doing:

radosgw-admin bucket list --bucket=

and then cycling through the list of object names and deleting them, 1,000 at a 
time. However, after ~3-4k objects deleted, the list call stops working and 
just hangs. We have also noticed slow requests for the cluster most of the time 
after running that command when it hangs. We know there is also a 
"radosgw-admin bucket rm --bucket= --purge-objects" command, but we are 
nervous that this will cause slowness in the cluster as well, since the listing 
did - or that it might not work at all, considering the list didn't work.

We are running Ceph version 0.94.3, and there is no bucket sharding on the 
index.

What is the recommended way to delete a large bucket like that in production, 
without occurring any downtime/slow requests?

Thanks,
- Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding storage to exiting clusters with minimal impact

2017-07-06 Thread Brian Andrus
On Thu, Jul 6, 2017 at 9:18 AM, Gregory Farnum  wrote:

> On Thu, Jul 6, 2017 at 7:04 AM  wrote:
>
>> Hi Ceph Users,
>>
>>
>>
>> We plan to add 20 storage nodes to our existing cluster of 40 nodes, each
>> node has 36 x 5.458 TiB drives. We plan to add the storage such that all
>> new OSDs are prepared, activated and ready to take data but not until we
>> start slowly increasing their weightings. We also expect this not to cause
>> any backfilling before we adjust the weightings.
>>
>>
>>
>> When testing the deployment on our development cluster, adding a new OSD
>> to the host bucket with a crush weight of 5.458 and an OSD reweight of 0
>> (we have set “noin”) causes the acting sets of a few pools to change, thus
>> triggering backfilling. Interestingly, none of the pool backfilling have
>> the new OSD in their acting set.
>>
>>
>>
>> This was not what we expected, so I have to ask, is what we are trying to
>> achieve possible and how best we should go about doing it.
>>
>
> Yeah, there's an understandable but unfortunate bit where when you add a
> new CRUSH device/bucket to a CRUSH bucket (so, a new disk to a host, or a
> new host to a rack) you change the overall weight of that bucket (the host
> or rack). So even though the new OSD might be added with a *reweight* of
> zero, it has a "real" weight of 5.458 and so a little bit more data is
> mapped into the host/rack, even though none gets directed to the new disk
> until you set its reweight value up.
>
> As you note below, if you add the disks with a weight of zero that doesn't
> happen, so you can try doing that and weighting them up gradually.
> -Greg
>

This works well for us - Adding in OSDs with crush weight of 0 (osd crush
initial weight = 0) and slowly crush weighting them in while the reweight
remains at 1. This should also result in less overall data movement if that
is a concern.



>
>>
>> Commands run:
>>
>> ceph osd crush add osd.43 0 host=ceph-sn833 - causes no backfilling
>>
>> ceph osd crush add osd.44 5.458 host=ceph-sn833 - does cause backfilling
>>
>>
>>
>> For multiple hosts and OSDs, we plan to prepare a new crushmap and inject
>> that into the cluster.
>>
>>
>>
>> Best wishes,
>>
>> Bruno
>>
>>
>>
>>
>>
>> Bruno Canning
>>
>> LHC Data Store System Administrator
>>
>> Scientific Computing Department
>>
>> STFC Rutherford Appleton Laboratory
>>
>> Harwell Oxford
>>
>> Didcot
>>
>> OX11 0QX
>>
>> Tel. +44 ((0)1235) 446621
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] krbd journal support

2017-07-06 Thread Maged Mokhtar
Hi all,

Are there any plans to support rbd journal feature in kernel krbd ?

Cheers /Maged

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Watch for fstrim running on your Ubuntu systems

2017-07-06 Thread Reed Dier
Hi Wido,

I came across this ancient ML entry with no responses and wanted to follow up 
with you to see if you recalled any solution to this.
Copying the ceph-users list to preserve any replies that may result for 
archival.

I have a couple of boxes with 10x Micron 5100 SATA SSD’s, journaled on Micron 
9100 NVMe SSD’s; ceph 10.2.7; Ubuntu 16.04 4.8 kernel.

I have noticed now twice that I’ve had SSD’s flapping due to the fstrim eating 
up the io 100%.
It eventually righted itself after a little less than 8 hours.
Noout flag was set, so it didn’t create any unnecessary rebalance or whatnot.

Timeline showing that only 1 OSD ever went down at a time, but they seemed to 
go down in a rolling fashion during the fstrim session.
You can actually see in the OSD graph all 10 OSD’s on this node go down 1 by 1 
over time.


And the OSD’s were going down because of:

> 2017-07-02 13:47:32.618752 7ff612721700  1 heartbeat_map is_healthy 
> 'OSD::osd_op_tp thread 0x7ff5ecd0c700' had timed out after 15
> 2017-07-02 13:47:32.618757 7ff612721700  1 heartbeat_map is_healthy 
> 'FileStore::op_tp thread 0x7ff608d9e700' had timed out after 60
> 2017-07-02 13:47:32.618760 7ff612721700  1 heartbeat_map is_healthy 
> 'FileStore::op_tp thread 0x7ff608d9e700' had suicide timed out after 180
> 2017-07-02 13:47:32.624567 7ff612721700 -1 common/HeartbeatMap.cc: In 
> function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, 
> const char*, time_t)' thread 7ff612721700 time 2017-07-02 13:47:32.618784
> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")


I am curious if you were able to nice it or something similar to mitigate this 
issue?
Oddly, I have similar machines with Samsung SM863a’s with Intel P3700 journals 
that do not appear to be affected by the fstrim load issue despite identical 
weekly cron jobs enabled. Only the Micron drives (newer) have had these issues.

Appreciate any pointers,

Reed

> Wido den Hollander wido at 42on.com  
> 
> Tue Dec 9 01:21:16 PST 2014
> Hi,
> 
> Last sunday I got a call early in the morning that a Ceph cluster was
> having some issues. Slow requests and OSDs marking each other down.
> 
> Since this is a 100% SSD cluster I was a bit confused and started
> investigating.
> 
> It took me about 15 minutes to see that fstrim was running and was
> utilizing the SSDs 100%.
> 
> On Ubuntu 14.04 there is a weekly CRON which executes fstrim-all. It
> detects all mountpoints which can be trimmed and starts to trim those.
> 
> On the Intel SSDs used here it caused them to become 100% busy for a
> couple of minutes. That was enough for them to no longer respond on
> heartbeats, thus timing out and being marked down.
> 
> Luckily we had the "out interval" set to 1800 seconds on that cluster,
> so no OSD was marked as "out".
> 
> fstrim-all does not execute fstrim with a ionice priority. From what I
> understand, but haven't tested yet, is that running fstrim with ionice
> -c Idle should solve this.
> 
> It's weird that this issue didn't come up earlier on that cluster, but
> after killing fstrim all problems we resolved and the cluster ran
> happily again.
> 
> So watch out for fstrim on early Sunday mornings on Ubuntu!
> 
> -- 
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-06 Thread Wido den Hollander

> Op 6 juli 2017 om 18:27 schreef Massimiliano Cuttini :
> 
> 
> WOW!
> 
> Thanks to everybody!
> A tons of suggestion and good tips!
> 
> At the moment we are already using 100Gb/s cards and we are already 
> adopted 100Gb/s switch so we can go with 40Gb/s that are fully 
> compatible with our SWITCH.
> About CPU I was wrong, the model that we are seeing is not 2603 but 2630 
> which is quite different.
> Bad mistake!
> 
> This processor have 10 cores and 2.20GHz.
> I think it's the best price/quality by intel.
> 
> About that it seems that most of your reccomendation goes in the 
> direction to have less core but much faster speed.
> Is this right? So having 10 cores is not as good as having a faster one?

Partially. Make sure you have at least one physical CPU core per OSD.

And then the Ghz starts counting if you really want to push IOps. Especially 
over NVMe. You will need very fast CPUs to fully utilize those cards.

Wido

> 
> 
> 
> Il 05/07/2017 12:51, Wido den Hollander ha scritto:
> >> Op 5 juli 2017 om 12:39 schreef c...@jack.fr.eu.org:
> >>
> >>
> >> Beware, a single 10G NIC is easily saturated by a single NVMe device
> >>
> > Yes, it is. But that what was what I'm pointing at. Bandwidth is usually 
> > not a problem, latency is.
> >
> > Take a look at a Ceph cluster running out there, it is probably doing a lot 
> > of IOps, but not that much bandwidth.
> >
> > A production cluster I took a look at:
> >
> > "client io 405 MB/s rd, 116 MB/s wr, 12211 op/s rd, 13272 op/s wr"
> >
> > This cluster is 15 machines with 10 OSDs (SSD, PM863a) each.
> >
> > So 405/15 = 27MB/sec
> >
> > It's doing 13k IOps now, that increases to 25k during higher load, but the 
> > bandwidth stays below 500MB/sec in TOTAL.
> >
> > So yes, you are right, a NVMe device can sature a single NIC, but most of 
> > the time latency and IOps are what count. Not bandwidth.
> >
> > Wido
> >
> >> On 05/07/2017 11:54, Wido den Hollander wrote:
>  Op 5 juli 2017 om 11:41 schreef "Van Leeuwen, Robert" 
>  :
> 
> 
>  Hi Max,
> 
>  You might also want to look at the PCIE lanes.
>  I am not an expert on the matter but my guess would be the 8 NVME drives 
>  + 2x100Gbit would be too much for
>  the current Xeon generation (40 PCIE lanes) to fully utilize.
> 
> >>> Fair enough, but you might want to think about if you really, really need 
> >>> 100Gbit. Those cards are expensive, same goes for the Gbics and switches.
> >>>
> >>> Storage is usually latency bound and not so much bandwidth. Imho a lot of 
> >>> people focus on raw TBs and bandwidth, but in the end IOps and latency 
> >>> are what usually matters.
> >>>
> >>> I'd probably stick with 2x10Gbit for now and use the money I saved on 
> >>> more memory and faster CPUs.
> >>>
> >>> Wido
> >>>
>  I think the upcoming AMD/Intel offerings will improve that quite a bit 
>  so you may want to wait for that.
>  As mentioned earlier. Single Core cpu speed matters for latency so you 
>  probably want to up that.
> 
>  You can also look at the DIMM configuration.
>  TBH I am not sure how much it impacts Ceph performance but having just 2 
>  DIMMS slots populated will not give you max memory bandwidth.
>  Having some extra memory for read-cache probably won’t hurt either 
>  (unless you know your workload won’t include any cacheable reads)
> 
>  Cheers,
>  Robert van Leeuwen
> 
>  From: ceph-users  on behalf of 
>  Massimiliano Cuttini 
>  Organization: PhoenixWeb Srl
>  Date: Wednesday, July 5, 2017 at 10:54 AM
>  To: "ceph-users@lists.ceph.com" 
>  Subject: [ceph-users] New cluster - configuration tips and 
>  reccomendation - NVMe
> 
> 
>  Dear all,
> 
>  luminous is coming and sooner we should be allowed to avoid double 
>  writing.
>  This means use 100% of the speed of SSD and NVMe.
>  Cluster made all of SSD and NVMe will not be penalized and start to make 
>  sense.
> 
>  Looking forward I'm building the next pool of storage which we'll setup 
>  on next term.
>  We are taking in consideration a pool of 4 with the following single 
>  node configuration:
> 
> *   2x E5-2603 v4 - 6 cores - 1.70GHz
> *   2x 32Gb of RAM
> *   2x NVMe M2 for OS
> *   6x NVMe U2 for OSD
> *   2x 100Gib ethernet cards
> 
>  We have yet not sure about which Intel and how much RAM we should put on 
>  it to avoid CPU bottleneck.
>  Can you help me to choose the right couple of CPU?
>  Did you see any issue on the configuration proposed?
> 
>  Thanks,
>  Max
>  ___
>  ceph-users mailing list
>  ceph-users@lists.ceph.com
>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> 

Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-06 Thread Massimiliano Cuttini

WOW!

Thanks to everybody!
A tons of suggestion and good tips!

At the moment we are already using 100Gb/s cards and we are already 
adopted 100Gb/s switch so we can go with 40Gb/s that are fully 
compatible with our SWITCH.
About CPU I was wrong, the model that we are seeing is not 2603 but 2630 
which is quite different.

Bad mistake!

This processor have 10 cores and 2.20GHz.
I think it's the best price/quality by intel.

About that it seems that most of your reccomendation goes in the 
direction to have less core but much faster speed.

Is this right? So having 10 cores is not as good as having a faster one?



Il 05/07/2017 12:51, Wido den Hollander ha scritto:

Op 5 juli 2017 om 12:39 schreef c...@jack.fr.eu.org:


Beware, a single 10G NIC is easily saturated by a single NVMe device


Yes, it is. But that what was what I'm pointing at. Bandwidth is usually not a 
problem, latency is.

Take a look at a Ceph cluster running out there, it is probably doing a lot of 
IOps, but not that much bandwidth.

A production cluster I took a look at:

"client io 405 MB/s rd, 116 MB/s wr, 12211 op/s rd, 13272 op/s wr"

This cluster is 15 machines with 10 OSDs (SSD, PM863a) each.

So 405/15 = 27MB/sec

It's doing 13k IOps now, that increases to 25k during higher load, but the 
bandwidth stays below 500MB/sec in TOTAL.

So yes, you are right, a NVMe device can sature a single NIC, but most of the 
time latency and IOps are what count. Not bandwidth.

Wido


On 05/07/2017 11:54, Wido den Hollander wrote:

Op 5 juli 2017 om 11:41 schreef "Van Leeuwen, Robert" :


Hi Max,

You might also want to look at the PCIE lanes.
I am not an expert on the matter but my guess would be the 8 NVME drives + 
2x100Gbit would be too much for
the current Xeon generation (40 PCIE lanes) to fully utilize.


Fair enough, but you might want to think about if you really, really need 
100Gbit. Those cards are expensive, same goes for the Gbics and switches.

Storage is usually latency bound and not so much bandwidth. Imho a lot of 
people focus on raw TBs and bandwidth, but in the end IOps and latency are what 
usually matters.

I'd probably stick with 2x10Gbit for now and use the money I saved on more 
memory and faster CPUs.

Wido


I think the upcoming AMD/Intel offerings will improve that quite a bit so you 
may want to wait for that.
As mentioned earlier. Single Core cpu speed matters for latency so you probably 
want to up that.

You can also look at the DIMM configuration.
TBH I am not sure how much it impacts Ceph performance but having just 2 DIMMS 
slots populated will not give you max memory bandwidth.
Having some extra memory for read-cache probably won’t hurt either (unless you 
know your workload won’t include any cacheable reads)

Cheers,
Robert van Leeuwen

From: ceph-users  on behalf of Massimiliano 
Cuttini 
Organization: PhoenixWeb Srl
Date: Wednesday, July 5, 2017 at 10:54 AM
To: "ceph-users@lists.ceph.com" 
Subject: [ceph-users] New cluster - configuration tips and reccomendation - NVMe


Dear all,

luminous is coming and sooner we should be allowed to avoid double writing.
This means use 100% of the speed of SSD and NVMe.
Cluster made all of SSD and NVMe will not be penalized and start to make sense.

Looking forward I'm building the next pool of storage which we'll setup on next 
term.
We are taking in consideration a pool of 4 with the following single node 
configuration:

   *   2x E5-2603 v4 - 6 cores - 1.70GHz
   *   2x 32Gb of RAM
   *   2x NVMe M2 for OS
   *   6x NVMe U2 for OSD
   *   2x 100Gib ethernet cards

We have yet not sure about which Intel and how much RAM we should put on it to 
avoid CPU bottleneck.
Can you help me to choose the right couple of CPU?
Did you see any issue on the configuration proposed?

Thanks,
Max
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects while OSD is being added/filled

2017-07-06 Thread Gregory Farnum
On Tue, Jul 4, 2017 at 10:47 PM Eino Tuominen  wrote:

> ​Hello,
>
>
> I noticed the same behaviour in our cluster.
>
>
> ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>
>
>
> cluster 0a9f2d69-5905-4369-81ae-e36e4a791831
>
>  health HEALTH_WARN
>
> 1 pgs backfill_toofull
>
> 4366 pgs backfill_wait
>
> 11 pgs backfilling
>
> 45 pgs degraded
>
> 45 pgs recovery_wait
>
> 45 pgs stuck degraded
>
> 4423 pgs stuck unclean
>
> recovery 181563/302722835 objects degraded (0.060%)
>
> recovery 57192879/302722835 objects misplaced (18.893%)
>
> 1 near full osd(s)
>
> noout,nodeep-scrub flag(s) set
>
>  monmap e3: 3 mons at {0=
> 130.232.243.65:6789/0,1=130.232.243.66:6789/0,2=130.232.243.67:6789/0}
>
> election epoch 356, quorum 0,1,2 0,1,2
>
>  osdmap e388588: 260 osds: 260 up, 242 in; 4378 remapped pgs
>
> flags nearfull,noout,nodeep-scrub,require_jewel_osds
>
>   pgmap v80658624: 25728 pgs, 8 pools, 202 TB data, 89212 kobjects
>
> 612 TB used, 300 TB / 912 TB avail
>
> 181563/302722835 objects degraded (0.060%)
>
> 57192879/302722835 objects misplaced (18.893%)
>
>21301 active+clean
>
> 4366 active+remapped+wait_backfill
>
>   45 active+recovery_wait+degraded
>
>   11 active+remapped+backfilling
>
>4 active+clean+scrubbing
>
>1 active+remapped+backfill_toofull
>
> recovery io 421 MB/s, 155 objects/s
>
>   client io 201 kB/s rd, 2034 B/s wr, 75 op/s rd, 0 op/s wr
>
> I'm currently doing a rolling migration from Puppet on Ubuntu to Ansible
> on RHEL, and I started with a healthy cluster, evacuated some nodes by
> setting their weight to 0, removed them from the cluster and re-added them
> with ansible playbook.
>
> Basically I ran
>
> ceph osd crush remove osd.$num
>
> ceph osd rm $num
>
> ceph auth del osd.$num
>
> in a loop for the osds I was replacing, and then let the ansible ceph-osd
> playbook to bring the host back to the cluster. Crushmap is attached.
>

This case is different. If you are removing OSDs before they've had the
chance to offload themselves, objects are going to be degraded since you're
removing a copy! :)
-Greg


> ​
> --
>   Eino Tuominen
>
>
> --
> *From:* ceph-users  on behalf of
> Gregory Farnum 
> *Sent:* Friday, June 30, 2017 23:38
> *To:* Andras Pataki; ceph-users
> *Subject:* Re: [ceph-users] Degraded objects while OSD is being
> added/filled
>
> On Wed, Jun 21, 2017 at 6:57 AM Andras Pataki <
> apat...@flatironinstitute.org> wrote:
>
>> Hi cephers,
>>
>> I noticed something I don't understand about ceph's behavior when adding
>> an OSD.  When I start with a clean cluster (all PG's active+clean) and add
>> an OSD (via ceph-deploy for example), the crush map gets updated and PGs
>> get reassigned to different OSDs, and the new OSD starts getting filled
>> with data.  As the new OSD gets filled, I start seeing PGs in degraded
>> states.  Here is an example:
>>
>>   pgmap v52068792: 42496 pgs, 6 pools, 1305 TB data, 390 Mobjects
>> 3164 TB used, 781 TB / 3946 TB avail
>> *8017/994261437 objects degraded (0.001%)*
>> 2220581/994261437 objects misplaced (0.223%)
>>42393 active+clean
>>   91 active+remapped+wait_backfill
>>9 active+clean+scrubbing+deep
>> *   1 active+recovery_wait+degraded*
>>1 active+clean+scrubbing
>>1 active+remapped+backfilling
>>
>>
>> Any ideas why there would be any persistent degradation in the cluster
>> while the newly added drive is being filled?  It takes perhaps a day or two
>> to fill the drive - and during all this time the cluster seems to be
>> running degraded.  As data is written to the cluster, the number of
>> degraded objects increases over time.  Once the newly added OSD is filled,
>> the cluster comes back to clean again.
>>
>> Here is the PG that is degraded in this picture:
>>
>> 7.87c10200419430477
>> active+recovery_wait+degraded2017-06-20 14:12:44.119921344610'7
>> 583572:2797[402,521]402[402,521]402344610'7
>> 2017-06-16 06:04:55.822503344610'72017-06-16 06:04:55.822503
>>
>> The newly added osd here is 521.  Before it got added, this PG had two
>> replicas clean, but one got forgotten somehow?
>>
>
> This sounds a bit concerning at first glance. Can you provide some output
> of exactly what commands you're invoking, and the "ceph -s" output as it
> changes in response?
>
> I really don't see how adding a new OSD can result in it "forgetting"
> about existing valid copies — it's definitely not supposed to — so I wonder
> if there's a collision in how it's 

Re: [ceph-users] Adding storage to exiting clusters with minimal impact

2017-07-06 Thread Gregory Farnum
On Thu, Jul 6, 2017 at 7:04 AM  wrote:

> Hi Ceph Users,
>
>
>
> We plan to add 20 storage nodes to our existing cluster of 40 nodes, each
> node has 36 x 5.458 TiB drives. We plan to add the storage such that all
> new OSDs are prepared, activated and ready to take data but not until we
> start slowly increasing their weightings. We also expect this not to cause
> any backfilling before we adjust the weightings.
>
>
>
> When testing the deployment on our development cluster, adding a new OSD
> to the host bucket with a crush weight of 5.458 and an OSD reweight of 0
> (we have set “noin”) causes the acting sets of a few pools to change, thus
> triggering backfilling. Interestingly, none of the pool backfilling have
> the new OSD in their acting set.
>
>
>
> This was not what we expected, so I have to ask, is what we are trying to
> achieve possible and how best we should go about doing it.
>

Yeah, there's an understandable but unfortunate bit where when you add a
new CRUSH device/bucket to a CRUSH bucket (so, a new disk to a host, or a
new host to a rack) you change the overall weight of that bucket (the host
or rack). So even though the new OSD might be added with a *reweight* of
zero, it has a "real" weight of 5.458 and so a little bit more data is
mapped into the host/rack, even though none gets directed to the new disk
until you set its reweight value up.

As you note below, if you add the disks with a weight of zero that doesn't
happen, so you can try doing that and weighting them up gradually.
-Greg


>
> Commands run:
>
> ceph osd crush add osd.43 0 host=ceph-sn833 - causes no backfilling
>
> ceph osd crush add osd.44 5.458 host=ceph-sn833 - does cause backfilling
>
>
>
> For multiple hosts and OSDs, we plan to prepare a new crushmap and inject
> that into the cluster.
>
>
>
> Best wishes,
>
> Bruno
>
>
>
>
>
> Bruno Canning
>
> LHC Data Store System Administrator
>
> Scientific Computing Department
>
> STFC Rutherford Appleton Laboratory
>
> Harwell Oxford
>
> Didcot
>
> OX11 0QX
>
> Tel. +44 ((0)1235) 446621
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Note about rbd_aio_write usage

2017-07-06 Thread Piotr Dałek

On 17-07-06 04:40 PM, Jason Dillaman wrote:

On Thu, Jul 6, 2017 at 10:22 AM, Piotr Dałek  wrote:

So I really see two problems here: lack of API docs and
backwards-incompatible change in API behavior.


Docs are always in need of update, so any pull requests would be
greatly appreciated.

However, I disagree that the behavior has substantively changed -- it
was always possible for pre-Luminous to (sometimes) copy the buffer
before the "rbd_aio_write" method completed.


But that copy was buried somewhere deep in the librbd internals and - 
looking at Jewel version - most would assume that it's not really copied and 
user is responsible for keeping buffer intact until write is complete. API 
user doesn't really care about what's going on internally and is beyond 
their control.



With Luminous, this
behavior is more consistent -- but in a future release memory may be
zero-copied. If your application can properly conform to the
(unwritten) contract that the buffers should remain unchanged, there
would be no need for the application to pre-copy the buffers.


So far I am forced to do a copy anyway (see below). The question is whether 
it's me doing it, or librbd. It doesn't make sense to have it both do the 
same -- especially if it's going to handle tens of terabytes of data, which 
could mean for 10TB of data at least 83 886 080 memory allocations, releases 
and copies plus 2 684 354 560 page faults (assuming 4KB pages) -- and these 
are the best case scenario numbers assuming 128KB I/O size. What I 
understand that you expect from me, is to have at least number of memory 
copies doubled and push not "just" 20TB over the memory bus (reading 10TB 
from one buffer and writing these 10TB to another), but 40.
In other words, if I'd write my code considering how Jewel librbd works, 
there would be no real issue, apart from the fact that suddenly my program 
would consume more memory and would burn more CPU cycles once librbd is 
upgraded to Luminous which, considering the amount of data, would be 
noticeable change.



If the libfuse implementation requires that the memory is not-in-use
by the time you return control to it (i.e. it's a synchronous API and
you are using async methods), you will always need to copy it.
Yes, libfuse expects that once I leave entrypoint, it is free to do anything 
it wishes with previously provided buffers -- and that's what it actually does.


> The C++
> API allows you to control the copying since you need to pass
> "bufferlist"s to the API methods and since they utilize a reference
> counter, there is no internal copying within librbd / librados.

How about a hybrid solution? Keep the old rbd_aio_write contract (don't copy 
the buffer with the assumption that it won't change) and instead of 
constructing bufferlist containing bufferptr to copied data, construct a 
bufferlist containing bufferptr made with create_static(user_buffer)?



--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up backfill after increasing PGs and or adding OSDs

2017-07-06 Thread David Turner
Just a quick place to start is osd_max_backfills.  You have this set to 1.
Each PG is on 11 OSDs.  When you have a PG moving, it is on the original 11
OSDs and the new X number of OSDs that it is going to.  For each of your
PGs that is moving, an OSD can only move 1 at a time (your
osd_max_backfills), and each PG is on 11 + X OSDs.

So with your cluster.  I don't see how many OSDs you have, but you have 25
PGs moving around and 8 of them are actively backfilling.  Assuming you
were only changing 1 OSD per backfill operation, that would mean that you
had at least 96 OSDs (11+1 * 8).  That would be a perfect distribution of
OSDs for the PGs backfilling.  Let's say now that you're averaging closer
to 3 OSDs changing per PG and that the remaining 17 PGs waiting to backfill
are blocked by a few OSDs each (because those OSDs are already included in
the 8 active backfilling PGs.  That would indicate that you have closer to
200+ OSDs.

Every time I'm backfilling and want to speed things up, I watch iostat on
some of my OSDs and increase osd_max_backfills until I'm consistently using
about 70% of the disk to allow for customer overhead.  You can always
figure out what's best for your use case though.  Generally I've been ok
running with osd_max_backfills=5 without much problem and bringing that up
some when I know that client IO will be minimal, but again it depends on
your use case and cluster.

On Thu, Jul 6, 2017 at 10:08 AM  wrote:

> Hey folks,
>
> We have a cluster that's currently backfilling from increasing PG counts.
> We have tuned recovery and backfill way down as a "precaution" and would
> like to start tuning it to bring up to a good balance between that and
> client I/O.
>
> At the moment we're in the process of bumping up PG numbers for pools
> serving production workloads. Said pools are EC 8+3.
>
> It looks like we're having very low numbers of PGs backfilling as in:
>
> 2567 TB used, 5062 TB / 7630 TB avail
> 145588/849529410 objects degraded (0.017%)
> 5177689/849529410 objects misplaced (0.609%)
> 7309 active+clean
>   23 active+clean+scrubbing
>   18 active+clean+scrubbing+deep
>   13 active+remapped+backfill_wait
>5 active+undersized+degraded+remapped+backfilling
>4 active+undersized+degraded+remapped+backfill_wait
>3 active+remapped+backfilling
>1 active+clean+inconsistent
> recovery io 1966 MB/s, 96 objects/s
>   client io 726 MB/s rd, 147 MB/s wr, 89 op/s rd, 71 op/s wr
>
> Also, the rate of recovery in terms of data and object throughput varies a
> lot, even with the number of PGs backfilling remaining constant.
>
> Here's the config in the OSDs:
>
> "osd_max_backfills": "1",
> "osd_min_recovery_priority": "0",
> "osd_backfill_full_ratio": "0.85",
> "osd_backfill_retry_interval": "10",
> "osd_allow_recovery_below_min_size": "true",
> "osd_recovery_threads": "1",
> "osd_backfill_scan_min": "16",
> "osd_backfill_scan_max": "64",
> "osd_recovery_thread_timeout": "30",
> "osd_recovery_thread_suicide_timeout": "300",
> "osd_recovery_sleep": "0",
> "osd_recovery_delay_start": "0",
> "osd_recovery_max_active": "5",
> "osd_recovery_max_single_start": "1",
> "osd_recovery_max_chunk": "8388608",
> "osd_recovery_max_omap_entries_per_chunk": "64000",
> "osd_recovery_forget_lost_objects": "false",
> "osd_scrub_during_recovery": "false",
> "osd_kill_backfill_at": "0",
> "osd_debug_skip_full_check_in_backfill_reservation": "false",
> "osd_debug_reject_backfill_probability": "0",
> "osd_recovery_op_priority": "5",
> "osd_recovery_priority": "5",
> "osd_recovery_cost": "20971520",
> "osd_recovery_op_warn_multiple": "16",
>
> What I'm looking for, first of all, is a better understanding of the
> mechanism that schedules the backfilling/recovery work; the end goal is to
> understand how to tune this safely to achieve as close to an optimal
> balance between rate at which recovery and client work is performed.
>
> I'm thinking things like osd_max_backfills,
> osd_backfill_scan_min/osd_backfill_scan_max might be prime candidates for
> tuning.
>
> Any thoughs/insights by the Ceph community will be greatly appreciated,
>
> George
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Note about rbd_aio_write usage

2017-07-06 Thread Jason Dillaman
On Thu, Jul 6, 2017 at 10:22 AM, Piotr Dałek  wrote:
> So I really see two problems here: lack of API docs and
> backwards-incompatible change in API behavior.

Docs are always in need of update, so any pull requests would be
greatly appreciated.

However, I disagree that the behavior has substantively changed -- it
was always possible for pre-Luminous to (sometimes) copy the buffer
before the "rbd_aio_write" method completed. With Luminous, this
behavior is more consistent -- but in a future release memory may be
zero-copied. If your application can properly conform to the
(unwritten) contract that the buffers should remain unchanged, there
would be no need for the application to pre-copy the buffers.

If the libfuse implementation requires that the memory is not-in-use
by the time you return control to it (i.e. it's a synchronous API and
you are using async methods), you will always need to copy it. The C++
API allows you to control the copying since you need to pass
"bufferlist"s to the API methods and since they utilize a reference
counter, there is no internal copying within librbd / librados.

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon leader election problem, should it be improved ?

2017-07-06 Thread Sage Weil
On Thu, 6 Jul 2017, Z Will wrote:
> Hi Joao :
> 
>  Thanks for thorough analysis . My initial concern is that , I think
> in some cases ,  network failure will make low rank monitor see little
> siblings (not enough to form a quorum ) , but some high rank mointor
> can see more siblings, so I want to try to choose  the one who can see
> the most to be leader, to tolerate the netwok error to the biggiest
> extent , not just to solve the corner case.   Yes , you are right.
> This kind of complex network failure is rare to occure. Trying to find
> out  who can contact the highest number of monitors can only cover
> some of the situation , and will  introduce some other complexities
> and slow effcient. This is not good. Blacklisting a problematic
> monitor is simple and good idea.  The implementation in monitor now is
> like this, no matter which one  with high rank num lost connection
> with the leader, this lost monitor  will constantly try to call leader
> election, affect its siblings, and then affect the whole cluster.
> Because the leader election procedure is fast, it will be OK for a
> short time , but soon leader election start again, the cluster will
> become unstable. I think the probability of this kind of network error
> is high, YES ?  So based on your idea,  make a little change :
> 
>  - send a probe to all monitors
>  - receive acks
>  - After receiving acks, it will konw the current quorum and how much
> monitors it can reach to .
>If it can reach to current leader, then it will try to join
> current quorum
>If it can not reach to current leader, then it will decide
> whether to stand by for a while and try later or start a leader
> election  based on the information got from probing phase.
> 
> Do you think this will be OK ?

I'm worried that even if we can form an initial quorum, we are currently 
very casual about the "call new election" logic.  If a mon is not part of 
the quorum it will currently trigger a new election... and with this 
change it will then not be included in it because it can't reach all mons.  
The logic there will also have to change so that it confirms that it can 
reach a majority of mon peers before requesting a new election.

sage


> 
> 
> On Wed, Jul 5, 2017 at 6:26 PM, Joao Eduardo Luis  wrote:
> > On 07/05/2017 08:01 AM, Z Will wrote:
> >>
> >> Hi Joao:
> >> I think this is all because we choose the monitor with the
> >> smallest rank number to be leader. For this kind of network error, no
> >> matter which mon has lost connection with the  mon who has the
> >> smallest rank num , will be constantly calling an election, that say
> >> ,will constantly affact the cluster until it is stopped by human . So
> >> do you think it make sense if I try to figure out a way to choose the
> >> monitor who can see the most monitors ,  or with  the smallest rank
> >> num if the view num is same , to be leader ?
> >> In probing phase:
> >>they will know there own view, so can set a view num.
> >> In election phase:
> >>they send the view num , rank num .
> >>when receiving the election message, it compare the view num (
> >> higher is leader ) and rank num ( lower is leader).
> >
> >
> > As I understand it, our elector trades-off reliability in case of network
> > failure for expediency in forming a quorum. This by itself is not a problem
> > since we don't see many real-world cases where this behaviour happens, and
> > we are a lot more interested in making sure we have a quorum - given without
> > a quorum your cluster is effectively unusable.
> >
> > Currently, we form a quorum with a minimal number of messages passed.
> > From my poor recollection, I think the Elector works something like
> >
> > - 1 probe message to each monitor in the monmap
> > - receives defer from a monitor, or defers to a monitor
> > - declares victory if number of defers is an absolute majority (including
> > one's defer).
> >
> > An election cycle takes about 4-5 messages to complete, with roughly two
> > round-trips (in the best case scenario).
> >
> > Figuring out which monitor is able to contact the highest number of
> > monitors, and having said monitor being elected the leader, will necessarily
> > increase the number of messages transferred.
> >
> > A rough idea would be
> >
> > - all monitors will send probes to all other monitors in the monmap;
> > - all monitors need to ack the other's probes;
> > - each monitor will count the number of monitors it can reach, and then send
> > a message proposing itself as the leader to the other monitors, with the
> > list of monitors they see;
> > - each monitor will propose itself as the leader, or defer to some other
> > monitor.
> >
> > This is closer to 3 round-trips.
> >
> > Additionally, we'd have to account for the fact that some monitors may be
> > able to reach all other monitors, while some may only be able to reach a
> > portion. How do we handle this scenario?
> >
> > - What do we do with m

Re: [ceph-users] Note about rbd_aio_write usage

2017-07-06 Thread Piotr Dałek

On 17-07-06 03:43 PM, Jason Dillaman wrote:

I've learned the hard way that pre-luminous, even if it copies the buffer,
it does so too late. In my specific case, my FUSE module does enter the
write call and issues rbd_aio_write there, then exits the write - expecting
the buffer provided by FUSE to be copied by librbd (as it happens now in
Luminous). I didn't expect that it's a new behavior and once my code was
deployed to use Jewel librbd, it started to consistently corrupt data during
write.

The correct (POSIX-style) program behavior should treat the buffer as
immutable until the IO operation completes. It is never safe to assume
the buffer can be re-used while the IO is in-flight. You should not
add any logic to assume the buffer is safely copied prior to the
completion of the IO.


Indeed, most systems - not only POSIX ones - supporting asynchronous writes 
expect that buffer remain unchanged until the write is done. I wasn't sure 
how rbd_aio_write operates and consulted the source, as there's no docs for 
the api itself. That intermediate copy in librbd deceived me -- because if 
librbd copies the data, why should I do the same before calling 
rbd_aio_write? To stress-test memory bus? So I really see two problems here: 
lack of API docs and backwards-incompatible change in API behavior.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Speeding up backfill after increasing PGs and or adding OSDs

2017-07-06 Thread george.vasilakakos
Hey folks,

We have a cluster that's currently backfilling from increasing PG counts. We 
have tuned recovery and backfill way down as a "precaution" and would like to 
start tuning it to bring up to a good balance between that and client I/O.

At the moment we're in the process of bumping up PG numbers for pools serving 
production workloads. Said pools are EC 8+3.

It looks like we're having very low numbers of PGs backfilling as in:

2567 TB used, 5062 TB / 7630 TB avail
145588/849529410 objects degraded (0.017%)
5177689/849529410 objects misplaced (0.609%)
7309 active+clean
  23 active+clean+scrubbing
  18 active+clean+scrubbing+deep
  13 active+remapped+backfill_wait
   5 active+undersized+degraded+remapped+backfilling
   4 active+undersized+degraded+remapped+backfill_wait
   3 active+remapped+backfilling
   1 active+clean+inconsistent
recovery io 1966 MB/s, 96 objects/s
  client io 726 MB/s rd, 147 MB/s wr, 89 op/s rd, 71 op/s wr

Also, the rate of recovery in terms of data and object throughput varies a lot, 
even with the number of PGs backfilling remaining constant.

Here's the config in the OSDs:

"osd_max_backfills": "1",
"osd_min_recovery_priority": "0",
"osd_backfill_full_ratio": "0.85",
"osd_backfill_retry_interval": "10",
"osd_allow_recovery_below_min_size": "true",
"osd_recovery_threads": "1",
"osd_backfill_scan_min": "16",
"osd_backfill_scan_max": "64",
"osd_recovery_thread_timeout": "30",
"osd_recovery_thread_suicide_timeout": "300",
"osd_recovery_sleep": "0",
"osd_recovery_delay_start": "0",
"osd_recovery_max_active": "5",
"osd_recovery_max_single_start": "1",
"osd_recovery_max_chunk": "8388608",
"osd_recovery_max_omap_entries_per_chunk": "64000",
"osd_recovery_forget_lost_objects": "false",
"osd_scrub_during_recovery": "false",
"osd_kill_backfill_at": "0",
"osd_debug_skip_full_check_in_backfill_reservation": "false",
"osd_debug_reject_backfill_probability": "0",
"osd_recovery_op_priority": "5",
"osd_recovery_priority": "5",
"osd_recovery_cost": "20971520",
"osd_recovery_op_warn_multiple": "16",

What I'm looking for, first of all, is a better understanding of the mechanism 
that schedules the backfilling/recovery work; the end goal is to understand how 
to tune this safely to achieve as close to an optimal balance between rate at 
which recovery and client work is performed.

I'm thinking things like osd_max_backfills, 
osd_backfill_scan_min/osd_backfill_scan_max might be prime candidates for 
tuning.

Any thoughs/insights by the Ceph community will be greatly appreciated,

George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Adding storage to exiting clusters with minimal impact

2017-07-06 Thread bruno.canning
Hi Ceph Users,

We plan to add 20 storage nodes to our existing cluster of 40 nodes, each node 
has 36 x 5.458 TiB drives. We plan to add the storage such that all new OSDs 
are prepared, activated and ready to take data but not until we start slowly 
increasing their weightings. We also expect this not to cause any backfilling 
before we adjust the weightings.

When testing the deployment on our development cluster, adding a new OSD to the 
host bucket with a crush weight of 5.458 and an OSD reweight of 0 (we have set 
"noin") causes the acting sets of a few pools to change, thus triggering 
backfilling. Interestingly, none of the pool backfilling have the new OSD in 
their acting set.

This was not what we expected, so I have to ask, is what we are trying to 
achieve possible and how best we should go about doing it.

Commands run:
ceph osd crush add osd.43 0 host=ceph-sn833 - causes no backfilling
ceph osd crush add osd.44 5.458 host=ceph-sn833 - does cause backfilling

For multiple hosts and OSDs, we plan to prepare a new crushmap and inject that 
into the cluster.

Best wishes,
Bruno


Bruno Canning
LHC Data Store System Administrator
Scientific Computing Department
STFC Rutherford Appleton Laboratory
Harwell Oxford
Didcot
OX11 0QX
Tel. +44 ((0)1235) 446621

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Note about rbd_aio_write usage

2017-07-06 Thread Jason Dillaman
The correct (POSIX-style) program behavior should treat the buffer as
immutable until the IO operation completes. It is never safe to assume
the buffer can be re-used while the IO is in-flight. You should not
add any logic to assume the buffer is safely copied prior to the
completion of the IO.

On Thu, Jul 6, 2017 at 9:33 AM, Piotr Dałek  wrote:
> On 17-07-06 03:03 PM, Jason Dillaman wrote:
>>
>> On Thu, Jul 6, 2017 at 8:26 AM, Piotr Dałek 
>> wrote:
>>>
>>> Hi,
>>>
>>> If you're using "rbd_aio_write()" in your code, be aware of the fact that
>>> before Luminous release, this function expects buffer to remain unchanged
>>> until write op ends, and on Luminous and later this function internally
>>> copies the buffer, allocating memory where needed, freeing it once write
>>> is
>>> done.
>
>
>> Pre-Luminous also copies the provided buffer when using the C API --
>> it just copies it at a later point and not immediately. The eventual
>> goal is to eliminate the copy completely, but that requires some
>> additional plumbing work deep down within the librados messenger
>> layer.
>
>
> I've learned the hard way that pre-luminous, even if it copies the buffer,
> it does so too late. In my specific case, my FUSE module does enter the
> write call and issues rbd_aio_write there, then exits the write - expecting
> the buffer provided by FUSE to be copied by librbd (as it happens now in
> Luminous). I didn't expect that it's a new behavior and once my code was
> deployed to use Jewel librbd, it started to consistently corrupt data during
> write.
>
>
> --
> Piotr Dałek
> piotr.da...@corp.ovh.com
> https://www.ovh.com/us/
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Note about rbd_aio_write usage

2017-07-06 Thread Piotr Dałek

On 17-07-06 03:03 PM, Jason Dillaman wrote:

On Thu, Jul 6, 2017 at 8:26 AM, Piotr Dałek  wrote:

Hi,

If you're using "rbd_aio_write()" in your code, be aware of the fact that
before Luminous release, this function expects buffer to remain unchanged
until write op ends, and on Luminous and later this function internally
copies the buffer, allocating memory where needed, freeing it once write is
done.



Pre-Luminous also copies the provided buffer when using the C API --
it just copies it at a later point and not immediately. The eventual
goal is to eliminate the copy completely, but that requires some
additional plumbing work deep down within the librados messenger
layer.


I've learned the hard way that pre-luminous, even if it copies the buffer, 
it does so too late. In my specific case, my FUSE module does enter the 
write call and issues rbd_aio_write there, then exits the write - expecting 
the buffer provided by FUSE to be copied by librbd (as it happens now in 
Luminous). I didn't expect that it's a new behavior and once my code was 
deployed to use Jewel librbd, it started to consistently corrupt data during 
write.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Note about rbd_aio_write usage

2017-07-06 Thread Jason Dillaman
Pre-Luminous also copies the provided buffer when using the C API --
it just copies it at a later point and not immediately. The eventual
goal is to eliminate the copy completely, but that requires some
additional plumbing work deep down within the librados messenger
layer.

On Thu, Jul 6, 2017 at 8:26 AM, Piotr Dałek  wrote:
> Hi,
>
> If you're using "rbd_aio_write()" in your code, be aware of the fact that
> before Luminous release, this function expects buffer to remain unchanged
> until write op ends, and on Luminous and later this function internally
> copies the buffer, allocating memory where needed, freeing it once write is
> done.
>
> If you write an app that may need to work with Luminous *and* pre-Luminous
> versions of librbd, you may want to provide a version check (using
> rbd_version() for example) so either your buffers won't change before write
> is done or you don't incur a penalty for unnecessary memory allocation and
> copy on your side (though it's probably unavoidable with current state of
> Luminous).
>
> --
> Piotr Dałek
> piotr.da...@corp.ovh.com
> https://www.ovh.com/us/
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to force "rbd unmap"

2017-07-06 Thread Ilya Dryomov
On Thu, Jul 6, 2017 at 2:23 PM, Stanislav Kopp  wrote:
> 2017-07-06 14:16 GMT+02:00 Ilya Dryomov :
>> On Thu, Jul 6, 2017 at 1:28 PM, Stanislav Kopp  wrote:
>>> Hi,
>>>
>>> 2017-07-05 20:31 GMT+02:00 Ilya Dryomov :
 On Wed, Jul 5, 2017 at 7:55 PM, Stanislav Kopp  wrote:
> Hello,
>
> I have problem that sometimes I can't unmap rbd device, I get "sysfs
> write failed rbd: unmap failed: (16) Device or resource busy", there
> is no open files and "holders" directory is empty. I saw on the
> mailling list that you can "force" unmapping the device, but I cant
> find how does it work. "man rbd" only mentions "force" as "KERNEL RBD
> (KRBD) OPTION", but "modinfo rbd" doesn't show this option. Did I miss
> something?

 Forcing unmap on an open device is not a good idea.  I'd suggest
 looking into what's holding the device and fixing that instead.
>>>
>>> We use pacemaker's resource agent for rbd mount/unmount
>>> (https://github.com/ceph/ceph/blob/master/src/ocf/rbd.in)
>>> I've reproduced the failure again and now saw in ps output that there
>>> is still unmout fs process in D state:
>>>
>>> root 29320  0.0  0.0 21980 1272 ?D09:18   0:00
>>> umount /export/rbd1
>>>
>>> this explains rbd unmap problem, but strange enough I don't see this
>>> mount in /proc/mounts, so it looks like it was successfully unmounted,
>>> if I try to strace the "umount" procces it hung (the strace, with no
>>> output), looks like kernel problem? Do you have some tips for further
>>> debugging?
>>
>> Check /sys/kernel/debug/ceph//osdc.  It lists
>> in-flight requests, that's what umount is blocked on.
>
> I see this in my output, but don't know what does it means honestly:
>
> root@nfs-test01:~# cat
> /sys/kernel/debug/ceph/4f23f683-21e6-49f3-ae2c-c95b150b9dc6.client138566/osdc
> REQUESTS 2 homeless 0
> 658 osd9 0.75514984 [9,1,6]/9 [9,1,6]/9
> rbd_data.6e28c6b8b4567. 0x400024 10'0
> set-alloc-hint,write
> 659 osd15 0.40f1ea02 [15,7,9]/15 [15,7,9]/15
> rbd_data.6e28c6b8b4567.0001 0x400024 10'0
> set-alloc-hint,write

It means you have two pending writes (OSD requests), to osd9 and osd15.
What is the output of

$ ceph -s
$ ceph pg dump pgs_brief

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Note about rbd_aio_write usage

2017-07-06 Thread Piotr Dałek

Hi,

If you're using "rbd_aio_write()" in your code, be aware of the fact that 
before Luminous release, this function expects buffer to remain unchanged 
until write op ends, and on Luminous and later this function internally 
copies the buffer, allocating memory where needed, freeing it once write is 
done.


If you write an app that may need to work with Luminous *and* pre-Luminous 
versions of librbd, you may want to provide a version check (using 
rbd_version() for example) so either your buffers won't change before write 
is done or you don't incur a penalty for unnecessary memory allocation and 
copy on your side (though it's probably unavoidable with current state of 
Luminous).


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to set up bluestore manually?

2017-07-06 Thread Martin Emrich
Hi!

I changed the partitioning scheme to use a "real" primary partition instead of 
a logical volume. Ceph-deploy seems run fine now, but the OSD does not start.

I see lots of these in the journal:

Jul 06 13:53:42  sh[9768]: 0> 2017-07-06 13:53:42.794027 7fcf9918fb80 -1 *** 
Caught signal (Aborted) **
Jul 06 13:53:42  sh[9768]: in thread 7fcf9918fb80 thread_name:ceph-osd
Jul 06 13:53:42  sh[9768]: ceph version 12.1.0 
(262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
Jul 06 13:53:42  sh[9768]: 1: (()+0x9cd6af) [0x7fcf99b776af]
Jul 06 13:53:42  sh[9768]: 2: (()+0xf370) [0x7fcf967d9370]
Jul 06 13:53:42  sh[9768]: 3: (gsignal()+0x37) [0x7fcf958031d7]
Jul 06 13:53:42  sh[9768]: 4: (abort()+0x148) [0x7fcf958048c8]
Jul 06 13:53:42  sh[9768]: 5: (ceph::__ceph_assert_fail(char const*, char 
const*, int, char const*)+0x284) [0x7fcf99bb5394]
Jul 06 13:53:42  sh[9768]: 6: (BitMapAreaIN::reserve_blocks(long)+0xb6) 
[0x7fcf99b6c486]
Jul 06 13:53:42  sh[9768]: 7: (BitMapAllocator::reserve(unsigned long)+0x80) 
[0x7fcf99b6a240]
Jul 06 13:53:42  sh[9768]: 8: (BlueFS::_allocate(unsigned char, unsigned long, 
std::vector >*)+0xee) [0x7fcf99b31c\
0e]
Jul 06 13:53:42  sh[9768]: 9: 
(BlueFS::_flush_and_sync_log(std::unique_lock&, unsigned long, 
unsigned long)+0xbc4) [0x7fcf99b38be4]
Jul 06 13:53:42  sh[9768]: 10: (BlueFS::sync_metadata()+0x215) [0x7fcf99b3d725]
Jul 06 13:53:42  sh[9768]: 11: (BlueFS::umount()+0x74) [0x7fcf99b3dc44]
Jul 06 13:53:42  sh[9768]: 12: (BlueStore::_open_db(bool)+0x579) 
[0x7fcf99a62859]
Jul 06 13:53:42  sh[9768]: 13: (BlueStore::fsck(bool)+0x39b) [0x7fcf99a9581b]
Jul 06 13:53:42  sh[9768]: 14: (BlueStore::mkfs()+0x1168) [0x7fcf99a6d118]
Jul 06 13:53:42  sh[9768]: 15: (OSD::mkfs(CephContext*, ObjectStore*, 
std::string const&, uuid_d, int)+0x29b) [0x7fcf9964b75b]
Jul 06 13:53:42  sh[9768]: 16: (main()+0xf83) [0x7fcf99590573]
Jul 06 13:53:42  sh[9768]: 17: (__libc_start_main()+0xf5) [0x7fcf957efb35]
Jul 06 13:53:42  sh[9768]: 18: (()+0x4826e6) [0x7fcf9962c6e6]
Jul 06 13:53:42  sh[9768]: NOTE: a copy of the executable, or `objdump -rdS 
` is needed to interpret this.
Jul 06 13:53:42  sh[9768]: Traceback (most recent call last):
Jul 06 13:53:42  sh[9768]: File "/usr/sbin/ceph-disk", line 9, in 
Jul 06 13:53:42  sh[9768]: load_entry_point('ceph-disk==1.0.0', 
'console_scripts', 'ceph-disk')()


Also interesting is the message "-1 rocksdb: Invalid argument: db: does not 
exist (create_if_missing is false)"... Looks to me as if ceph-deploy did not 
create the RocksDB?

So still no success with bluestore :(

Thanks,

Martin

-Ursprüngliche Nachricht-
Von: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] Im Auftrag von 
Martin Emrich
Gesendet: Dienstag, 4. Juli 2017 22:02
An: Loris Cuoghi ; ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] How to set up bluestore manually?

Hi!

After getting some other stuff done, I finally got around to continuing here.

I set up a whole new cluster with ceph-deploy, but adding the first OSD fails:

ceph-deploy osd create --bluestore ${HOST}:/dev/sdc --block-wal 
/dev/cl/ceph-waldb-sdc --block-db /dev/cl/ceph-waldb-sdc .
.
.
[WARNIN] get_partition_dev: Try 9/10 : partition 1 for /dev/cl/ceph-waldb-sdc 
does not exist in /sys/block/dm-2  [WARNIN] get_dm_uuid: get_dm_uuid 
/dev/cl/ceph-waldb-sdc uuid path is /sys/dev/block/253:2/dm/uuid  [WARNIN] 
get_dm_uuid: get_dm_uuid /dev/cl/ceph-waldb-sdc uuid is 
LVM-2r0bGcoyMB0VnWeGGS77eOD5IOu8wAPN3wPX4OWSS1XGkYZYoziXhfAFMjJf4FJR
 [WARNIN]
 [WARNIN] get_partition_dev: Try 10/10 : partition 1 for /dev/cl/ceph-waldb-sdc 
does not exist in /sys/block/dm-2  [WARNIN] get_dm_uuid: get_dm_uuid 
/dev/cl/ceph-waldb-sdc uuid path is /sys/dev/block/253:2/dm/uuid  [WARNIN] 
get_dm_uuid: get_dm_uuid /dev/cl/ceph-waldb-sdc uuid is 
LVM-2r0bGcoyMB0VnWeGGS77eOD5IOu8wAPN3wPX4OWSS1XGkYZYoziXhfAFMjJf4FJR
 [WARNIN]
 [WARNIN] Traceback (most recent call last):
 [WARNIN]   File "/usr/sbin/ceph-disk", line 9, in 
 [WARNIN] load_entry_point('ceph-disk==1.0.0', 'console_scripts', 
'ceph-disk')()
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
5687, in run
 [WARNIN] main(sys.argv[1:])
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
5638, in main
 [WARNIN] args.func(args)
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
2004, in main
 [WARNIN] Prepare.factory(args).prepare()
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
1993, in prepare
 [WARNIN] self._prepare()
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
2074, in _prepare
 [WARNIN] self.data.prepare(*to_prepare_list)
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
2807, in prepare
 [WARNIN] self.prepare_device(*to_prepare_list)
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
2983, in prepare_device
 [WARNIN] to_prepare.prepare()
 [WARNIN]   File "/

Re: [ceph-users] How to force "rbd unmap"

2017-07-06 Thread Ilya Dryomov
On Thu, Jul 6, 2017 at 1:28 PM, Stanislav Kopp  wrote:
> Hi,
>
> 2017-07-05 20:31 GMT+02:00 Ilya Dryomov :
>> On Wed, Jul 5, 2017 at 7:55 PM, Stanislav Kopp  wrote:
>>> Hello,
>>>
>>> I have problem that sometimes I can't unmap rbd device, I get "sysfs
>>> write failed rbd: unmap failed: (16) Device or resource busy", there
>>> is no open files and "holders" directory is empty. I saw on the
>>> mailling list that you can "force" unmapping the device, but I cant
>>> find how does it work. "man rbd" only mentions "force" as "KERNEL RBD
>>> (KRBD) OPTION", but "modinfo rbd" doesn't show this option. Did I miss
>>> something?
>>
>> Forcing unmap on an open device is not a good idea.  I'd suggest
>> looking into what's holding the device and fixing that instead.
>
> We use pacemaker's resource agent for rbd mount/unmount
> (https://github.com/ceph/ceph/blob/master/src/ocf/rbd.in)
> I've reproduced the failure again and now saw in ps output that there
> is still unmout fs process in D state:
>
> root 29320  0.0  0.0  21980  1272 ?D09:18   0:00
> umount /export/rbd1
>
> this explains rbd unmap problem, but strange enough I don't see this
> mount in /proc/mounts, so it looks like it was successfully unmounted,
> if I try to strace the "umount" procces it hung (the strace, with no
> output), looks like kernel problem? Do you have some tips for further
> debugging?

Check /sys/kernel/debug/ceph//osdc.  It lists
in-flight requests, that's what umount is blocked on.

>
>
>> Did you see http://tracker.ceph.com/issues/12763?
>
> yes, I saw it, but we don't use "multipath" so I thought this is not
> relevant for us, am I wrong?
>
>>>
>>> As client where rbd is mapped I use Debian stretch with kernel 4.9,
>>> ceph cluster is on version 11.2.
>>
>> rbd unmap -o force $DEV
>
> thanks, tried but it hung too, I must to fix the root cause with fs
> unmount it seems.

Yeah, -o force makes unmap ignore the open count, but doesn't abort
pending I/O.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Full Ratio Luminous - Unset

2017-07-06 Thread Ashley Merrick
Anyone have some feedback on this? Happy to log a bug ticket if it is one, but 
want to make sure not missing something Luminous change related.

,Ashley

Sent from my iPhone

On 4 Jul 2017, at 3:30 PM, Ashley Merrick 
mailto:ash...@amerrick.co.uk>> wrote:


Okie noticed their is a new command to set these.


Tried these and still showing as 0 and error on full ratio out of order "ceph 
osd set-{full,nearfull,backfillfull}-ratio"


,Ashley


From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Ashley Merrick 
mailto:ash...@amerrick.co.uk>>
Sent: 04 July 2017 05:55:10
To: ceph-us...@ceph.com
Subject: [ceph-users] OSD Full Ratio Luminous - Unset


Hello,


On a Luminous upgraded from Jewel I am seeing the following in ceph -s  : "Full 
ratio(s) out of order"


and


ceph pg dump | head
dumped all
version 44281
stamp 2017-07-04 05:52:08.337258
last_osdmap_epoch 0
last_pg_scan 0
full_ratio 0
nearfull_ratio 0

I have tried to inject the values however makes no effect, these where 
previously non 0 values and the issue only showed after running "ceph osd 
require-osd-release luminous"


Thanks,

Ashley

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon leader election problem, should it be improved ?

2017-07-06 Thread Z Will
Hi Joao :

 Thanks for thorough analysis . My initial concern is that , I think
in some cases ,  network failure will make low rank monitor see little
siblings (not enough to form a quorum ) , but some high rank mointor
can see more siblings, so I want to try to choose  the one who can see
the most to be leader, to tolerate the netwok error to the biggiest
extent , not just to solve the corner case.   Yes , you are right.
This kind of complex network failure is rare to occure. Trying to find
out  who can contact the highest number of monitors can only cover
some of the situation , and will  introduce some other complexities
and slow effcient. This is not good. Blacklisting a problematic
monitor is simple and good idea.  The implementation in monitor now is
like this, no matter which one  with high rank num lost connection
with the leader, this lost monitor  will constantly try to call leader
election, affect its siblings, and then affect the whole cluster.
Because the leader election procedure is fast, it will be OK for a
short time , but soon leader election start again, the cluster will
become unstable. I think the probability of this kind of network error
is high, YES ?  So based on your idea,  make a little change :

 - send a probe to all monitors
 - receive acks
 - After receiving acks, it will konw the current quorum and how much
monitors it can reach to .
   If it can reach to current leader, then it will try to join
current quorum
   If it can not reach to current leader, then it will decide
whether to stand by for a while and try later or start a leader
election  based on the information got from probing phase.

Do you think this will be OK ?


On Wed, Jul 5, 2017 at 6:26 PM, Joao Eduardo Luis  wrote:
> On 07/05/2017 08:01 AM, Z Will wrote:
>>
>> Hi Joao:
>> I think this is all because we choose the monitor with the
>> smallest rank number to be leader. For this kind of network error, no
>> matter which mon has lost connection with the  mon who has the
>> smallest rank num , will be constantly calling an election, that say
>> ,will constantly affact the cluster until it is stopped by human . So
>> do you think it make sense if I try to figure out a way to choose the
>> monitor who can see the most monitors ,  or with  the smallest rank
>> num if the view num is same , to be leader ?
>> In probing phase:
>>they will know there own view, so can set a view num.
>> In election phase:
>>they send the view num , rank num .
>>when receiving the election message, it compare the view num (
>> higher is leader ) and rank num ( lower is leader).
>
>
> As I understand it, our elector trades-off reliability in case of network
> failure for expediency in forming a quorum. This by itself is not a problem
> since we don't see many real-world cases where this behaviour happens, and
> we are a lot more interested in making sure we have a quorum - given without
> a quorum your cluster is effectively unusable.
>
> Currently, we form a quorum with a minimal number of messages passed.
> From my poor recollection, I think the Elector works something like
>
> - 1 probe message to each monitor in the monmap
> - receives defer from a monitor, or defers to a monitor
> - declares victory if number of defers is an absolute majority (including
> one's defer).
>
> An election cycle takes about 4-5 messages to complete, with roughly two
> round-trips (in the best case scenario).
>
> Figuring out which monitor is able to contact the highest number of
> monitors, and having said monitor being elected the leader, will necessarily
> increase the number of messages transferred.
>
> A rough idea would be
>
> - all monitors will send probes to all other monitors in the monmap;
> - all monitors need to ack the other's probes;
> - each monitor will count the number of monitors it can reach, and then send
> a message proposing itself as the leader to the other monitors, with the
> list of monitors they see;
> - each monitor will propose itself as the leader, or defer to some other
> monitor.
>
> This is closer to 3 round-trips.
>
> Additionally, we'd have to account for the fact that some monitors may be
> able to reach all other monitors, while some may only be able to reach a
> portion. How do we handle this scenario?
>
> - What do we do with monitors that do not reach all other monitors?
> - Do we ignore them for electoral purposes?
> - Are they part of the final quorum?
> - What if we need those monitors to form a quorum?
>
> Personally, I think the easiest solution to this problem would be
> blacklisting a problematic monitor (for a given amount a time, or until a
> new election is needed due to loss of quorum, or by human intervention).
>
> For example, if a monitor believes it should be the leader, and if all other
> monitors are deferring to someone else that is not reachable, the monitor
> could then enter a special case branch:
>
> - send a probe to all monitors
> - receive acks
> - s