Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-19 Thread Piotr Dałek
One thing you can check is the CPU performance (cpu governor in particular). 
On such light loads I've seen CPUs sitting in low performance mode (slower 
clocks), giving MUCH worse performance results than when tried with heavier 
loads. Try "cpupower monitor" on OSD nodes in a loop and observe the core 
frequencies.


On 2019-03-19 3:17 p.m., jes...@krogh.cc wrote:

Hi All.

I'm trying to get head and tails into where we can stretch our Ceph cluster
into what applications. Parallism works excellent, but baseline throughput
it - perhaps - not what I would expect it to be.

Luminous cluster running bluestore - all OSD-daemons have 16GB of cache.

Fio files attacher - 4KB random read and 4KB random write - test file is
"only" 1GB
In this i ONLY care about raw IOPS numbers.

I have 2 pools, both 3x replicated .. one backed with SSDs S4510's
(14x1TB) and one with HDD's 84x10TB.

Network latency from rbd mount to one of the osd-hosts.
--- ceph-osd01.nzcorp.net ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9189ms
rtt min/avg/max/mdev = 0.084/0.108/0.146/0.022 ms

SSD:
randr:
# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  38   1727.07   2033.66   1954.71 1949.4789 46.592401
randw:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  36400.05455.26436.58 433.91417 12.468187

The double (or triple) network penalty of-course kicks in and delivers a
lower throughput here.
Are these performance numbers in the ballpark of what we'd expect?

With 1GB of test file .. I would really expect this to be memory cached in
the OSD/bluestore cache
and thus deliver a read IOPS closer to theoretical max: 1s/0.108ms => 9.2K
IOPS

Again on the write side - all OSDs are backed by Battery-Backed write
cache, thus writes should go directly
into memory of the constroller .. .. still slower than reads - due to
having to visit 3 hosts.. but not this low?

Suggestions for improvements? Are other people seeing similar results?

For the HDD tests I get similar - surprisingly slow numbers:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  38 36.91 118.8 69.14 72.926842  21.75198

This should have the same performance characteristics as the SSD's as the
writes should be hitting BBWC.

# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  39 26.18181.51 48.16 50.574872  24.01572

Same here - shold be cached in the blue-store cache as it is 16GB x 84
OSD's  .. with a 1GB testfile.

Any thoughts - suggestions - insights ?

Jesper



--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph block - volume with RAID#0

2019-01-31 Thread Piotr Dałek

On 2019-01-31 6:05 a.m., M Ranga Swami Reddy wrote:

My thought was - Ceph block volume with raid#0 (means I mounted a ceph
block volumes to an instance/VM, there I would like to configure this
volume with RAID0).

Just to know, if anyone doing the same as above, if yes what are the
constraints?


Exclusive lock on RBD images will kill any (theoretical) performance gains. 
Without exclusive lock, you lose some of RBD features.


Plus, using 2+ clients with single images doesn't sound like a good idea.

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: what are the potential risks of mixed cluster and client ms_type

2018-11-18 Thread Piotr Dałek

On 2018-11-19 8:17 a.m., Honggang(Joseph) Yang wrote:

thank you. but I encountered a problem:
https://tracker.ceph.com/issues/37300

I don't know if this is because of mix use of messger type.


Have you done basic troubleshooting, like checking osd.179 networking? 
Usually this means firewall or network hardware issues.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: what are the potential risks of mixed cluster and client ms_type

2018-11-18 Thread Piotr Dałek

On 2018-11-19 5:05 a.m., Honggang(Joseph) Yang wrote:

hello,

Our cluster side ms_type is async, while client side ms_type is
simple. I want to know if this is a proper way to use, what are the
potential risks?


None if Ceph doesn't complain about async messenger being experimental - 
both messengers use the same protocol.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD image "lightweight snapshots"

2018-08-09 Thread Piotr Dałek

Hello,

At OVH we're heavily utilizing snapshots for our backup system. We think 
there's an interesting optimization opportunity regarding snapshots I'd like 
to discuss here.


The idea is to introduce a concept of a "lightweight" snapshots - such 
snapshot would not contain data but only the information about what has 
changed on the image since it was created (so basically only the object map 
part of snapshots).


Our backup solution (which seems to be a pretty common practice) is as follows:

1. Create snapshot of the image we want to backup
2. If there's a previous backup snapshot, export diff and apply it on the 
backup image

3. If there's no older snapshot, just do a full backup of image

This introduces one big issue: it enforces COW snapshot on image, meaning 
that original image access latencies and consumed space increases. 
"Lightweight" snapshots would remove these inefficiencies - no COW 
performance and storage overhead.


At first glance, it seems like it could be implemented as extension to 
current RBD snapshot system, leaving out the machinery required for 
copy-on-write. In theory it could even co-exist with regular snapshots. 
Removal of these "lightweight" snapshots would be instant (or near instant).


So what do others think about this?

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Safe to use rados -p rbd cleanup?

2018-07-16 Thread Piotr Dałek

On 18-07-16 01:40 PM, Wido den Hollander wrote:



On 07/15/2018 11:12 AM, Mehmet wrote:

hello guys,

in my production cluster i've many objects like this

"#> rados -p rbd ls | grep 'benchmark'"
... .. .
benchmark_data_inkscope.example.net_32654_object1918
benchmark_data_server_26414_object1990
... .. .

Is it safe to run "rados -p rbd cleanup" or is there any risk for my
images?


the cleanup will require more then just that as you will need to specify
the benchmark prefix as well.


Yes and no. "rados -p rbd cleanup" will try to locate benchmark metadata 
object and remove only objects indiced by these metadata. "--prefix" is used 
when these metadata are lost or overwritten.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Safe to use rados -p rbd cleanup?

2018-07-16 Thread Piotr Dałek

On 18-07-15 11:12 AM, Mehmet wrote:

hello guys,

in my production cluster i've many objects like this

"#> rados -p rbd ls | grep 'benchmark'"
... .. .
benchmark_data_inkscope.example.net_32654_object1918
benchmark_data_server_26414_object1990
... .. .

Is it safe to run "rados -p rbd cleanup" or is there any risk for my images?


It'll probably fail due to hostname mismatch (rados bench write produces 
objects with caller hostname embedded in object name). Try what Wido 
suggested to cleanup all benchmark-made objects.

Otherwise yes, it's safe as objects for rbd images are named differently.

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs for data drives

2018-07-11 Thread Piotr Dałek

On 18-07-11 02:35 PM, David Blundell wrote:

Hi,

I’m looking at 4TB Intel DC P4510 for data drives running BlueStore with WAL, 
DB and data on the same drives.  Has anyone had any good / bad experiences with 
them?  As Intel’s new data centre NVMe SSD it should be fast and reliable but 
then I would have thought the same about the DC S4600 drives which currently 
seem best to avoid…

David


tl;dr - try to avoid TLC NAND flash at all costs if consistent write 
performance is your target.


Lately I was benchmarking Intel DC P4500 (not DC P4510, mind you) and I 
easily ran into performance issues. Both DC P4500 and DC P4510 utilize 3d 
TLC NAND flash chips, so you won't get great speeds on very low queue 
depths, but what's interesting in DC P4500 is that it seems to use SLC cache 
that provides fast qd=1 4k random writes, close to 300MB/s (or ~90k IOPS), 
but qd=1 4k random reads are from totally different league (~38MB/s, ~10k 
IOPS). What is worse, it's not that difficult to exhaust that SLC cache and 
then your overall write performance drops BADLY. In my case, I was getting 
RBD write IOPS varying from 10 to 40k IOPS depending on if and for how long 
the write test was running and how heavy it was.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritize recovery over backfilling

2018-06-07 Thread Piotr Dałek

On 18-06-07 12:43 PM, Caspar Smit wrote:

Hi Piotr,

Thanks for your answer! I've set nodown and now it doesn't mark any OSD's as 
down anymore :)


Any tip when everything is recovered/backfilled and unsetting the nodown 
flag?


When all pgs are reported as active+clean (any scrubbing/deep scrubbing is 
fine).


>Shutdown all activity to the ceph cluster before that moment?




Depends on whether it's actually possible in your case and what load your 
users generate - you have to decide.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritize recovery over backfilling

2018-06-07 Thread Piotr Dałek

On 18-06-06 09:29 PM, Caspar Smit wrote:

Hi all,

We have a Luminous 12.2.2 cluster with 3 nodes and i recently added a node 
to it.


osd-max-backfills is at the default 1 so backfilling didn't go very fast but 
that doesn't matter.


Once it started backfilling everything looked ok:

~300 pgs in backfill_wait
~10 pgs backfilling (~number of new osd's)

But i noticed the degraded objects increasing a lot. I presume a pg that is 
in backfill_wait state doesn't accept any new writes anymore? Hence 
increasing the degraded objects?


So far so good, but once a while i noticed a random OSD flapping (they come 
back up automatically). This isn't because the disk is saturated but a 
driver/controller/kernel incompatibility which 'hangs' the disk for a short 
time (scsi abort_task error in syslog). Investigating further i noticed this 
was already the case before the node expansion.

These OSD's flapping results in lots of pg states which are a bit worrying:

              109 active+remapped+backfill_wait
              80  active+undersized+degraded+remapped+backfill_wait
              51  active+recovery_wait+degraded+remapped
              41  active+recovery_wait+degraded
              27  active+recovery_wait+undersized+degraded+remapped
              14  active+undersized+remapped+backfill_wait
              4   active+undersized+degraded+remapped+backfilling

I think the recovery_wait is more important then the backfill_wait, so i 
like to prioritize these because the recovery_wait was triggered by the 
flapping OSD's

>
furthermore the undersized ones should get absolute priority or is that 
already the case?


I was thinking about setting "nobackfill" to prioritize recovery instead of 
backfilling.

Would that help in this situation? Or am i making it even worse then?

ps. i tried increasing the heartbeat values for the OSD's to no avail, they 
still get flagged as down once in a while after a hiccup of the driver.


First of all, use "nodown" flag so osds won't be marked down automatically 
and unset it once everything backfills/recovers and settles for good -- note 
that there might be lingering osd down reports, so unsetting nodown might 
cause some of problematic osds to be instantly marked as down.


Second, since Luminous you can use "ceph pg force-recovery" to ask 
particular pgs to recover first, even if there are other pgs to backfill 
and/or recovery.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reduced productivity because of slow requests

2018-06-06 Thread Piotr Dałek

On 18-06-06 01:57 PM, Grigory Murashov wrote:

Hello cephers!

I have luminous 12.2.5 cluster of 3 nodes 5 OSDs each with S3 RGW. All OSDs 
are HDD.


I often (about twice a day) have slow request problem which reduces cluster 
efficiency. It can be started both in day peak and night time. Doesn't matter.


That's what I have in ceph health detail 
https://avatars.mds.yandex.net/get-pdb/234183/9ba023d0-4352-4235-8826-76b412016e9f/s1200 
[..]
Since it starts in any time but twice a day and for fixed period of time I 
assume it could be some recovery or rebalancing operations.


I tried to find smth out in osd logs but there are nothing about it.

Any thoughts how to avoid it?


Have you tried disabling scrub and deep scrub?

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] a big cluster or several small

2018-05-15 Thread Piotr Dałek

On 18-05-14 06:49 PM, Marc Boisis wrote:


Hi,

Hello,
Currently we have a 294 OSD (21 hosts/3 racks) cluster with RBD clients 
only, 1 single pool (size=3).


We want to divide this cluster into several to minimize the risk in case of 
failure/crash.
For example, a cluster for the mail, another for the file servers, a test 
cluster ...

Do you think it's a good idea ?


If reliability and data availability is your main concern, and you don't 
share data between clusters - yes.


Do you have experience feedback on multiple clusters in production on the 
same hardware:

- containers (LXD or Docker)
- multiple cluster on the same host without virtualization (with ceph-deploy 
... --cluster ...)

- multilple pools
...

Do you have any advice?


We're using containers to host OSDs, but we don't host multiple clusters on 
same machine (in other words, single physical machine hosts containers for 
one and the same cluster). We're using Ceph for RBD images, so having 
multiple clusters isn't a problem for us.


Our main reason for using multiple clusters is that Ceph has a bad 
reliability history when scaling up and even now there are many issues 
unresolved (https://tracker.ceph.com/issues/21761 for example) so by 
dividing single, large cluster into few smaller ones, we reduce the impact 
for customers when things go fatally wrong - when one cluster goes down or 
it's performance is on single ESDI drive level due to recovery, other 
clusters - and their users - are unaffected. For us this already proved 
useful in the past.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Integrating XEN Server : Long query time for "rbd ls -l" queries

2018-04-25 Thread Piotr Dałek

On 18-04-25 02:29 PM, Marc Schöchlin wrote:

Hello list,

we are trying to integrate a storage repository in xenserver.
(i also describe the problem as a issue in the ceph bugtracker:
https://tracker.ceph.com/issues/23853)

Summary:

The slowness is a real pain for us, because this prevents the xen
storage repository to work efficently.
Gathering information for XEN Pools with hundreds of virtual machines
(using "--format json") would be a real pain...
The high user time consumption and the really huge amount of threads
suggests that there is something really inefficient in the "rbd" utility.

So what can i do to make "rbd ls -l" faster or to get comparable
information regarding snapshot hierarchy information?


Can you run this command with extra argument 
"--rbd_concurrent_management_ops=1" and share the timing of that?


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High apply latency

2018-02-02 Thread Piotr Dałek

On 18-02-02 09:55 AM, Jakub Jaszewski wrote:

Hi,

So I have changed merge & split settings to
filestore_merge_threshold = 40
filestore_split_multiple = 8

and restart all OSDs , host by host.

Let me ask a question, although the pool default.rgw.buckets.data that was 
affected prior to the above change has higher write bandwidth it is very 
random now. Writes are random for other pools (same for EC and replicated 
types) too, before the change writes to replicated pools were much more stable.

Reads from pools look fine and stable.

Is it the result of mentioned change ? Is PG directory structure updating or 
...?


The HUGE problem with filestore is that it can't handle large number of 
small objects well. Sure, if the number only grows slowly (case with RBD 
images) then it's probably not that noticeable, but in case of 31 millions 
of objects that come and go at random pace you're going to hit frequent 
problems with filestore collections splitting and merging. Pre-Luminous, it 
happened on all osds hosting particular collection at once, and in Luminous 
there's "filestore split rand factor" which according to docs:


Description:  A random factor added to the split threshold to avoid
  too many filestore splits occurring at once. See
  ``filestore split multiple`` for details.
  This can only be changed for an existing osd offline,
  via ceph-objectstore-tool's apply-layout-settings command.

You may want to try the above as well.

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] formatting bytes and object counts in ceph status ouput

2018-01-02 Thread Piotr Dałek

On 18-01-02 11:43 AM, Jan Fajerski wrote:

Hi lists,
Currently the ceph status output formats all numbers with binary unit 
prefixes, i.e. 1MB equals 1048576 bytes and an object count of 1M equals 
1048576 objects. I received a bug report from a user that printing object 
counts with a base 2 multiplier is confusing (I agree) so I opened a bug and 
https://github.com/ceph/ceph/pull/19117.
In the PR discussion a couple of questions arose that I'd like to get some 
opinions on:
- Should we print binary unit prefixes (MiB, GiB, ...) since that would be 
  technically correct?


+1

- Should counters (like object counts) be formatted with a base 10 
multiplier or  a multiplier woth base 2?


+1

My proposal would be to both use binary unit prefixes and use base 10 
multipliers for counters. I think this aligns with user expectations as well 
as the relevant standard(s?).


Most users expect that non-size counters - like object counts - use base-10, 
and size counters use base-2 units. Ceph's "standard" of using base-2 
everywhere was confusing for me as well initially, but I got used to that... 
Still, wouldn't mind if that would get sorted out once and for all.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snap trim queue length issues

2017-12-18 Thread Piotr Dałek

On 17-12-15 03:58 PM, Sage Weil wrote:

On Fri, 15 Dec 2017, Piotr Dałek wrote:

On 17-12-14 05:31 PM, David Turner wrote:

I've tracked this in a much more manual way.  I would grab a random subset
[..]

This was all on a Hammer cluster.  The changes to the snap trimming queues
going into the main osd thread made it so that our use case was not viable
on Jewel until changes to Jewel that happened after I left.  It's exciting
that this will actually be a reportable value from the cluster.

Sorry that this story doesn't really answer your question, except to say
that people aware of this problem likely have a work around for it.  However
I'm certain that a lot more clusters are impacted by this than are aware of
it and being able to quickly see that would be beneficial to troubleshooting
problems.  Backporting would be nice.  I run a few Jewel clusters that have
some VM's and it would be nice to see how well the cluster handle snap
trimming.  But they are much less critical on how much snapshots they do.


Thanks for your response, it pretty much confirms what I though:
- users aware of issue have their own hacks that don't need to be efficient or
convenient.
- users unaware of issue are, well, unaware and at risk of serious service
disruption once disk space is all used up.

Hopefully it'll be convincing enough for devs. ;)


Your PR looks great!  I commented with a nit on the format of the warning
itself.


I just adressed the comments.


I expect this is trivial to backport to luminous; it will need to be
partially reimplemented for jewel (with some care around the pg_stat_t and
a different check for the jewel-style health checks).


Yeah, that's why I expected some resistance here and asked for comments. I 
really don't mind reimplementing this, it's not a big deal.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snap trim queue length issues

2017-12-15 Thread Piotr Dałek

On 17-12-14 05:31 PM, David Turner wrote:
I've tracked this in a much more manual way.  I would grab a random subset 
[..]


This was all on a Hammer cluster.  The changes to the snap trimming queues 
going into the main osd thread made it so that our use case was not viable 
on Jewel until changes to Jewel that happened after I left.  It's exciting 
that this will actually be a reportable value from the cluster.


Sorry that this story doesn't really answer your question, except to say 
that people aware of this problem likely have a work around for it.  However 
I'm certain that a lot more clusters are impacted by this than are aware of 
it and being able to quickly see that would be beneficial to troubleshooting 
problems.  Backporting would be nice.  I run a few Jewel clusters that have 
some VM's and it would be nice to see how well the cluster handle snap 
trimming.  But they are much less critical on how much snapshots they do.


Thanks for your response, it pretty much confirms what I though:
- users aware of issue have their own hacks that don't need to be efficient 
or convenient.
- users unaware of issue are, well, unaware and at risk of serious service 
disruption once disk space is all used up.


Hopefully it'll be convincing enough for devs. ;)

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Snap trim queue length issues

2017-12-14 Thread Piotr Dałek

Hi,

We recently ran into low disk space issues on our clusters, and it wasn't 
because of actual data. On those affected clusters we're hosting VMs and 
volumes, so naturally there are snapshots involved. For some time, we 
observed increased disk space usage that we couldn't explain, as there was 
discrepancy between  what Ceph reported and actual space used on disks. We 
finally found out that snap trim queues were both long and not getting any 
shorter, and decreasing snap trim sleep and increasing max concurrent snap 
trims helped reversing the trend - we're safe now.
The problem is, we haven't been aware of this issue for some time, and 
there's no easy (and fast[1]) way to check this. I made a pull request[2] 
that makes snap trim queue lengths available to monitoring tools
and also generates health warning when things go out of control, so an admin 
can act before hell breaks loose.


My question is, how many Jewel users would be interested in a such feature? 
There's a lot of changes between Luminous and Jewel, and it's not going to 
be a straight backport, but it's not a big patch either, so I won't mind 
doing it myself. But having some support from users would be helpful in 
pushing this into next Jewel release.


Thanks!


[1] one of our guys hacked a bash oneliner that printed out snap trim queue 
lengths for all pgs, but full run takes over an hour to complete on a 
cluster with over 20k pgs...

[2] https://github.com/ceph/ceph/pull/19520

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph.conf tuning ... please comment

2017-12-06 Thread Piotr Dałek

On 17-12-06 07:01 AM, Stefan Kooman wrote:

[osd]
# http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/
osd crush update on start = false
osd heartbeat interval = 1 # default 6
osd mon heartbeat interval = 10# default 30
osd mon report interval min = 1# default 5
osd mon report interval max = 15   # default 120

The osd would almost immediately see a "cut off" to their partner OSD's
in the placement group. By default they wait 6 seconds before sending
their report to the monitors. During our analysis this is exactly the
time the monitors were keeping an election. By tuning all of the above
we could get them to send their reports faster, and by the time the
election process was finished the monitors would handle the reports from
the OSDs and come to the conclusion that a DC is down, flag it down
and allow for normal client IO again.

Of course, stability and data safety is most important to us. So if any
of these settings make you worry please let us know.


Heartbeats, especially in Luminous, are quite heavy bandwidth-wise if you 
have a lot of OSDs in clusters. You may want to keep osd heartbeat interval 
at 3 lowest, or if that's not acceptable then at least set "osd heartbeat 
min size" to 0.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Single disk per OSD ?

2017-12-01 Thread Piotr Dałek

On 17-12-01 12:23 PM, Maged Mokhtar wrote:

Hi all,

I believe most exiting setups use 1 disk per OSD. Is this going to be the 
most common setup in the future ? With the move to lvm, will this prefer the 
use of multiple disks per OSD ? On the other side i also see nvme vendors 
recommending multiple OSDs ( 2,4 ) per disk as disks are getting faster for 
a single OSD process.


Can anyone shed some light/recommendations into this please ?


You don't put more than one OSD on spinning disk because access times will 
kill your performance - they already do [kill your performance] and asking 
hdds to do double/triple/quadruple/... duty is only going to make it far 
more worse. On the other hand, SSD drives have access time so short that 
they're most often bottlenecked by SSD users and not SSD itself, so it makes 
perfect sense to put 2-4 OSDs on one OSD.
LVM isn't going to change much in that pattern, it may be easier to setup 
RAID0 HDD OSDs, but that's questionable use case, and OSDs with JBODs under 
them are counterproductive (single disk failure would be caught by Ceph, but 
replacing failed drives will be more difficult -- plus, JBOD OSDs 
significantly extend the damage area once such OSD fails).


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk is now deprecated

2017-11-28 Thread Piotr Dałek

On 17-11-28 09:12 AM, Wido den Hollander wrote:



Op 27 november 2017 om 14:36 schreef Alfredo Deza <ad...@redhat.com>:


For the upcoming Luminous release (12.2.2), ceph-disk will be
officially in 'deprecated' mode (bug fixes only). A large banner with
deprecation information has been added, which will try to raise
awareness.



As much as I like ceph-volume and the work being done, is it really a good idea 
to use a minor release to deprecate a tool?

Can't we just introduce ceph-volume and deprecate ceph-disk at the release of 
M? Because when you upgrade to 12.2.2 suddenly existing integrations will have 
deprecation warnings being thrown at them while they haven't upgraded to a new 
major version.

As ceph-deploy doesn't support ceph-disk either I don't think it's a good idea 
to deprecate it right now.

How do others feel about this?


Same, although we don't have a *big* problem with this (we haven't upgraded 
to Luminous yet, so we can skip to next point release and move to 
ceph-volume together with Luminous). It's still a problem, though - now we 
have more of our infrastructure to migrate and test, meaning even more 
delays in production upgrades.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Restart is required?

2017-11-16 Thread Piotr Dałek

On 17-11-16 05:34 PM, Jaroslaw Owsiewski wrote:
Thanks for your reply and information. Yes, we are using filestore. Will it 
still work in Luminous: ??


http://docs.ceph.com/docs/master/rados/configuration/filestore-config-ref/ :

|"filestore merge threshold|

Description:	Min number of files in a subdir before merging into parent 
NOTE: A negative value means to disable subdir merging


"

will variable definition like "filestore_merge_treshold = -50" (negative 
value) work? (in Jewel it worked like a charm)


Yes, I don't see any changes to that.

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Restart is required?

2017-11-16 Thread Piotr Dałek

On 17-11-16 02:44 PM, Jaroslaw Owsiewski wrote:

HI,

what exactly means message:

filestore_split_multiple = '24' (not observed, change may require restart)

This has happend after command:

# ceph tell osd.0 injectargs '--filestore-split-multiple 24'


It means that "filestore split multiple" is not observed for runtime 
changes, meaning that new value will be stored in osd.0 process memory, but 
not used at all.



Do I really need to restart OSD to make changes to take effect?

ceph version 12.2.1 () luminous (stable)


Yes.

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-07 Thread Piotr Dałek

On 17-11-07 12:02 AM, Jan Pekař - Imatic wrote:

Hi,

I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu 
1:2.8+dfsg-6+deb9u3

I'm running 3 nodes with 3 monitors and 8 osds on my nodes, all on IPV6.

When I tested the cluster, I detected strange and severe problem.
On first node I'm running qemu hosts with librados disk connection to the 
cluster and all 3 monitors mentioned in connection.

On second node I stopped mon and osd with command

kill -STOP MONPID OSDPID

Within one minute all my qemu hosts on first node freeze, so they even don't 
respond to ping. [..]


Why would you want to *stop* (as in, freeze) a process instead of killing it?
Anyway, with processes still there, it may take a few minutes before cluster 
realizes that daemons are stopped and kicks it out of cluster, restoring 
normal behavior (assuming correctly set crush rules).


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd rm snap on image with exclusive lock

2017-10-25 Thread Piotr Dałek

On 17-10-25 03:30 PM, Jason Dillaman wrote:

Hmm, hard to say off the top of my head. If you could enable "debug
librbd = 20" logging on the buggy client that owns the lock, create a
new snapshot, and attempt to delete it, it would be interesting to
verify that the image is being properly refreshed.


I'd love to, but that would require us to restart that client - not an 
option. We'll try to reproduce this somehow anyway and let you know if 
something interesting shows up.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd rm snap on image with exclusive lock

2017-10-25 Thread Piotr Dałek

On 17-10-25 02:39 PM, Jason Dillaman wrote:

That log is showing that a snap remove request was made from a client
that couldn't acquire the lock to a client that currently owns the
lock. The client that currently owns the lock responded w/ an -ENOENT
error that the snapshot doesn't exist. Depending on the maintenance
operation requested, different errors codes are filtered out to handle
the case where Ceph double (or more) delivers the request message to
the lock owner. Normally this isn't an issue since the local client
pre-checks the image state before sending the RPC message (i.e. snap
remove will first locally ensure the snap exists and respond w/
-ENOENT if it doesn't).

Therefore, in this case, the question is who is this rogue client that
still owns the lock and is responding the a snap remove request but
hasn't refreshed its state to know that the snapshot exists.


Thanks, that makes things clear.

Seems like we have some Cinders utilizing Infernalis (9.2.1) librbd. Are you 
aware of any bugs in 9.2.x that could cause such behavior? We've seen that 
for the first time...


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd rm snap on image with exclusive lock

2017-10-25 Thread Piotr Dałek
693 7f752da04700 20 librbd::AioImageRequestWQ: 
clear_require_lock_on_read
2017-10-24 09:50:29.654694 7f752da04700  5 librbd::AioImageRequestWQ: 
unblock_writes: 0x7f7557932f50, num=0
2017-10-24 09:50:29.654697 7f752da04700 10 librbd::image::CloseRequest: 
0x7f7557939090 handle_shut_down_exclusive_lock: r=0
2017-10-24 09:50:29.654700 7f752da04700 10 librbd::image::CloseRequest: 
0x7f7557939090 send_flush_readahead
2017-10-24 09:50:29.654702 7f752da04700 10 librbd::image::CloseRequest: 
0x7f7557939090 handle_flush_readahead: r=0
2017-10-24 09:50:29.654702 7f752da04700 10 librbd::image::CloseRequest: 
0x7f7557939090 send_shut_down_cache
2017-10-24 09:50:29.654789 7f752da04700 10 librbd::image::CloseRequest: 
0x7f7557939090 handle_shut_down_cache: r=0
2017-10-24 09:50:29.654793 7f752da04700 10 librbd::image::CloseRequest: 
0x7f7557939090 send_flush_op_work_queue
2017-10-24 09:50:29.654796 7f752da04700 10 librbd::image::CloseRequest: 
0x7f7557939090 handle_flush_op_work_queue: r=0
2017-10-24 09:50:29.654799 7f752da04700 10 librbd::image::CloseRequest: 
0x7f7557939090 handle_flush_image_watcher: r=0
2017-10-24 09:50:29.654812 7f752da04700 10 librbd::ImageState: 
0x7f7557933d90 handle_close: r=0


According to the log above, exclusive lock code set error code to EBUSY, 
which makes sense considering that the client owns the lock and is still 
alive. Then it's translated to EAGAIN, again making sense (client may go 
away at some point and just drop the lock). Then out of sudden, that gets 
translated to ENOENT that gets swallowed by filter in 
C_InvokeAsyncRequest::finish(). These two things don't make any sense at all.


So, two questions:
1. Why it is possible to create snapshots but not remove them when exclusive 
lock on image is taken? (jewel bug?)

2. Why the error is transformed and then ignored?

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] A new SSD for journals - everything sucks?

2017-10-11 Thread Piotr Dałek

On 17-10-11 09:50 AM, Josef Zelenka wrote:

Hello everyone,
lately, we've had issues with buying SSDs that we use for 
journaling(Kingston stopped making them) - Kingston V300 - so we decided to 
start using a different model and started researching which one would be the 
best price/value for us. We compared five models, to check if they are 
compatible with our needs - SSDNow v300, HyperX Fury,SSDNOw KC400, SSDNow 
UV400 and SSDNow A400. the best one is still the V300, with the highest iops 
of 59 001. Second best and still useable was the HyperX Fury with 45000 
iops.  The other three had terrible results, the max iops we got were around 
13 000 with the dsync and direct flags. We also tested Samsung SSDs(the EVO 
series) and we got similarly bad results. To get to the root of my question 
- i am pretty sure we are not the only ones affected by the v300's death. Is 
there anyone else out there with some benchmarking data/knowledge about some 
good price/performance SSDs for ceph journaling? I can also share the 
complete benchmarking data my coworker made, if someone is interested.


Never, absolutely never pick consumer-grade SSDs for Ceph cluster, and in 
particular - never pick a drive with low TBW for journal. Ceph is going to 
kill it within a few months. Besides, consumer-grade drives are not 
optimized for Ceph-like/enterprise workloads, resulting in weird performance 
characteristics, like tens of thousands of IOPS for a first few seconds, 
then dropping to 1K IOPS (typical for drives with TLC NAND and SLC NAND 
cache), or performing reasonably till some write queue depth is hit, then 
degrading badly (underperforming controller), or killing your OSD journals 
on power failure (no BBU or capacitors to power the drive while flushing 
when PSU goes down).


You may want to look at this:
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why sudden (and brief) HEALTH_ERR

2017-10-04 Thread Piotr Dałek

On 17-10-04 08:51 AM, lists wrote:

Hi,

Yesterday I chowned our /var/lib/ceph ceph, to completely finalize our jewel 
migration, and noticed something interesting.


After I brought back up the OSDs I just chowned, the system had some 
recovery to do. During that recovery, the system went to HEALTH_ERR for a 
short moment:


See below, for consecutive ceph -s outputs:


[..]
root@pm2:~# ceph -s
    cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1
 health HEALTH_ERR
    2 pgs are stuck inactive for more than 300 seconds


^^ that.


    761 pgs degraded
    2 pgs recovering
    181 pgs recovery_wait
    2 pgs stuck inactive
    273 pgs stuck unclean
    543 pgs undersized
    recovery 1394085/8384166 objects degraded (16.628%)
    4/24 in osds are down
    noout flag(s) set
 monmap e3: 3 mons at 
{0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0}

    election epoch 256, quorum 0,1,2 0,1,2
 osdmap e10230: 24 osds: 20 up, 24 in; 543 remapped pgs
    flags noout,sortbitwise,require_jewel_osds
  pgmap v36531146: 1088 pgs, 2 pools, 10703 GB data, 2729 kobjects
    32724 GB used, 56656 GB / 89380 GB avail
    1394085/8384166 objects degraded (16.628%)
 543 active+undersized+degraded
 310 active+clean
 181 active+recovery_wait+degraded
  26 active+degraded
  13 active
   9 activating+degraded
   4 activating
   2 active+recovering+degraded
recovery io 133 MB/s, 37 objects/s
  client io 64936 B/s rd, 9935 kB/s wr, 0 op/s rd, 942 op/s wr
[..]
It was only very briefly, but it did worry me a bit, fortunately, we went 
back to the expected HEALTH_WARN very quickly, and everything finished fine, 
so I guess nothing to worry.


But I'm curious: can anyone explain WHY we got a brief HEALTH_ERR?

No smart errors, apply and commit latency are all within the expected 
ranges, the systems basically is healthy.


Curious :-)


Since Jewel (AFAIR), when (re)starting OSDs, pg status is reset to "never 
contacted", resulting in "pgs are stuck inactive for more than 300 seconds" 
being reported until osds regain connections between themselves.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Note about rbd_aio_write usage

2017-07-07 Thread Piotr Dałek

On 17-07-06 09:39 PM, Jason Dillaman wrote:

On Thu, Jul 6, 2017 at 3:25 PM, Piotr Dałek <bra...@predictor.org.pl> wrote:

Is that deep copy an equivalent of what
Jewel librbd did at unspecified point of time, or extra one?


It's equivalent / replacement -- not an additional copy. This was
changed to support scatter/gather IO API methods which the latest
version of QEMU now directly utilizes (eliminating the need for a
bounce-buffer copy on every IO).


OK, that makes more sense now.


Once we get that librados issue resolved, that initial librbd IO
buffer copy will be dropped and librbd will become zero-copy for IO
(at least that's the goal). That's why I am recommending that you just
assume normal AIO semantics and not try to optimize for Luminous since
perhaps the next release will have that implementation detail of the
extra copy removed.


Is this: 
https://github.com/yuyuyu101/ceph/commit/794b49b5b860c538a349bdadb16bb6ae97ad9c20#commitcomment-15707924 
the issue you mention? Because at this point I'm considering switching to 
C++ API and passing static bufferptr buried in my bufferlist instead of 
having extra copy done by C API rbd_aio_write (that way I'd at least control 
the allocations).


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Note about rbd_aio_write usage

2017-07-06 Thread Piotr Dałek

On 17-07-06 04:40 PM, Jason Dillaman wrote:

On Thu, Jul 6, 2017 at 10:22 AM, Piotr Dałek <piotr.da...@corp.ovh.com> wrote:

So I really see two problems here: lack of API docs and
backwards-incompatible change in API behavior.


Docs are always in need of update, so any pull requests would be
greatly appreciated.

However, I disagree that the behavior has substantively changed -- it
was always possible for pre-Luminous to (sometimes) copy the buffer
before the "rbd_aio_write" method completed.


But that copy was buried somewhere deep in the librbd internals and - 
looking at Jewel version - most would assume that it's not really copied and 
user is responsible for keeping buffer intact until write is complete. API 
user doesn't really care about what's going on internally and is beyond 
their control.



With Luminous, this
behavior is more consistent -- but in a future release memory may be
zero-copied. If your application can properly conform to the
(unwritten) contract that the buffers should remain unchanged, there
would be no need for the application to pre-copy the buffers.


So far I am forced to do a copy anyway (see below). The question is whether 
it's me doing it, or librbd. It doesn't make sense to have it both do the 
same -- especially if it's going to handle tens of terabytes of data, which 
could mean for 10TB of data at least 83 886 080 memory allocations, releases 
and copies plus 2 684 354 560 page faults (assuming 4KB pages) -- and these 
are the best case scenario numbers assuming 128KB I/O size. What I 
understand that you expect from me, is to have at least number of memory 
copies doubled and push not "just" 20TB over the memory bus (reading 10TB 
from one buffer and writing these 10TB to another), but 40.
In other words, if I'd write my code considering how Jewel librbd works, 
there would be no real issue, apart from the fact that suddenly my program 
would consume more memory and would burn more CPU cycles once librbd is 
upgraded to Luminous which, considering the amount of data, would be 
noticeable change.



If the libfuse implementation requires that the memory is not-in-use
by the time you return control to it (i.e. it's a synchronous API and
you are using async methods), you will always need to copy it.
Yes, libfuse expects that once I leave entrypoint, it is free to do anything 
it wishes with previously provided buffers -- and that's what it actually does.


> The C++
> API allows you to control the copying since you need to pass
> "bufferlist"s to the API methods and since they utilize a reference
> counter, there is no internal copying within librbd / librados.

How about a hybrid solution? Keep the old rbd_aio_write contract (don't copy 
the buffer with the assumption that it won't change) and instead of 
constructing bufferlist containing bufferptr to copied data, construct a 
bufferlist containing bufferptr made with create_static(user_buffer)?



--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Note about rbd_aio_write usage

2017-07-06 Thread Piotr Dałek

On 17-07-06 03:43 PM, Jason Dillaman wrote:

I've learned the hard way that pre-luminous, even if it copies the buffer,
it does so too late. In my specific case, my FUSE module does enter the
write call and issues rbd_aio_write there, then exits the write - expecting
the buffer provided by FUSE to be copied by librbd (as it happens now in
Luminous). I didn't expect that it's a new behavior and once my code was
deployed to use Jewel librbd, it started to consistently corrupt data during
write.

The correct (POSIX-style) program behavior should treat the buffer as
immutable until the IO operation completes. It is never safe to assume
the buffer can be re-used while the IO is in-flight. You should not
add any logic to assume the buffer is safely copied prior to the
completion of the IO.


Indeed, most systems - not only POSIX ones - supporting asynchronous writes 
expect that buffer remain unchanged until the write is done. I wasn't sure 
how rbd_aio_write operates and consulted the source, as there's no docs for 
the api itself. That intermediate copy in librbd deceived me -- because if 
librbd copies the data, why should I do the same before calling 
rbd_aio_write? To stress-test memory bus? So I really see two problems here: 
lack of API docs and backwards-incompatible change in API behavior.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Note about rbd_aio_write usage

2017-07-06 Thread Piotr Dałek

On 17-07-06 03:03 PM, Jason Dillaman wrote:

On Thu, Jul 6, 2017 at 8:26 AM, Piotr Dałek <piotr.da...@corp.ovh.com> wrote:

Hi,

If you're using "rbd_aio_write()" in your code, be aware of the fact that
before Luminous release, this function expects buffer to remain unchanged
until write op ends, and on Luminous and later this function internally
copies the buffer, allocating memory where needed, freeing it once write is
done.



Pre-Luminous also copies the provided buffer when using the C API --
it just copies it at a later point and not immediately. The eventual
goal is to eliminate the copy completely, but that requires some
additional plumbing work deep down within the librados messenger
layer.


I've learned the hard way that pre-luminous, even if it copies the buffer, 
it does so too late. In my specific case, my FUSE module does enter the 
write call and issues rbd_aio_write there, then exits the write - expecting 
the buffer provided by FUSE to be copied by librbd (as it happens now in 
Luminous). I didn't expect that it's a new behavior and once my code was 
deployed to use Jewel librbd, it started to consistently corrupt data during 
write.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Note about rbd_aio_write usage

2017-07-06 Thread Piotr Dałek

Hi,

If you're using "rbd_aio_write()" in your code, be aware of the fact that 
before Luminous release, this function expects buffer to remain unchanged 
until write op ends, and on Luminous and later this function internally 
copies the buffer, allocating memory where needed, freeing it once write is 
done.


If you write an app that may need to work with Luminous *and* pre-Luminous 
versions of librbd, you may want to provide a version check (using 
rbd_version() for example) so either your buffers won't change before write 
is done or you don't incur a penalty for unnecessary memory allocation and 
copy on your side (though it's probably unavoidable with current state of 
Luminous).


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sparse file info in filestore not propagated to other OSDs

2017-06-21 Thread Piotr Dałek

On 17-06-21 03:24 PM, Sage Weil wrote:

On Wed, 21 Jun 2017, Piotr Dałek wrote:

On 17-06-14 03:44 PM, Sage Weil wrote:

On Wed, 14 Jun 2017, Paweł Sadowski wrote:

On 04/13/2017 04:23 PM, Piotr Dałek wrote:

On 04/06/2017 03:25 PM, Sage Weil wrote:

On Thu, 6 Apr 2017, Piotr Dałek wrote:

[snip]


I think the solution here is to use sparse_read during recovery.  The
PushOp data representation already supports it; it's just a matter of
skipping the zeros.  The recovery code could also have an option to
check
for fully-zero regions of the data and turn those into holes as
well.  For
ReplicatedBackend, see build_push_op().


So far it turns out that there's even easier solution, we just enabled
"filestore seek hole" on some test cluster and that seems to fix the
problem for us. We'll see if fiemap works too.



Is it safe to enable "filestore seek hole", are there any tests that
verifies that everything related to RBD works fine with this enabled?
Can we make this enabled by default?


We would need to enable it in the qa environment first.  The risk here is
that users run a broad range of kernels and we are exposing ourselves to
any bugs in any kernel version they may run.  I'd prefer to leave it off
by default.


That's a common regression? If not, we could blacklist particular kernels and
call it a day.

>>

We can enable it in the qa suite, though, which covers
centos7 (latest kernel) and ubuntu xenial and trusty.


+1. Do you need some particular PR for that?


Sure.  How about a patch that adds the config option to several of the
files in qa/suites/rados/thrash/thrashers?


OK.


I tested on few of our production images and it seems that about 30% is
sparse. This will be lost on any cluster wide event (add/remove nodes,
PG grow, recovery).

How this is/will be handled in BlueStore?


BlueStore exposes the same sparseness metadata that enabling the
filestore seek hole or fiemap options does, so it won't be a problem
there.

I think the only thing that we could potentially add is zero detection
on writes (so that explicitly writing zeros consumes no space).  We'd
have to be a bit careful measuring the performance impact of that check on
non-zero writes.


I saw that RBD (librbd) does that - replacing writes with discards when buffer
contains only zeros. Some code that does the same in librados could be added
and it shouldn't impact performance much, current implementation of
mem_is_zero is fast and shouldn't be a big problem.


I'd rather not have librados silently translating requests; I think it
makes more sense to do any zero checking in bluestore.  _do_write_small
and _do_write_big already break writes into (aligned) chunks; that would
be an easy place to add the check.


That leaves out filestore.

And while I get your point, doing it on librados level would reduce network 
usage for zeroed out regions as well, and check could be done just once, not 
replica_size times...


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritise recovery on specific PGs/OSDs?

2017-06-21 Thread Piotr Dałek

On 17-06-20 02:44 PM, Richard Hesketh wrote:

Is there a way, either by individual PG or by OSD, I can prioritise 
backfill/recovery on a set of PGs which are currently particularly important to 
me?

For context, I am replacing disks in a 5-node Jewel cluster, on a node-by-node 
basis - mark out the OSDs on a node, wait for them to clear, replace OSDs, 
bring up and in, mark out the OSDs on the next set, etc. I've done my first 
node, but the significant CRUSH map changes means most of my data is moving. I 
only currently care about the PGs on my next set of OSDs to replace - the other 
remapped PGs I don't care about settling because they're only going to end up 
moving around again after I do the next set of disks. I do want the PGs 
specifically on the OSDs I am about to replace to backfill because I don't want 
to compromise data integrity by downing them while they host active PGs. If I 
could specifically prioritise the backfill on those PGs/OSDs, I could get on 
with replacing disks without worrying about causing degraded PGs.

I'm in a situation right now where there is merely a couple of dozen PGs on the 
disks I want to replace, which are all remapped and waiting to backfill - but 
there are 2200 other PGs also waiting to backfill because they've moved around 
too, and it's extremely frustating to be sat waiting to see when the ones I 
care about will finally be handled so I can get on with replacing those disks.


You could prioritize recovery on pool if that would work for you (as others 
wrote), or +1 this PR: https://github.com/ceph/ceph/pull/13723 (it's bit 
outdated as I'm constantly low on time, but I promise to push it forward!).


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sparse file info in filestore not propagated to other OSDs

2017-06-21 Thread Piotr Dałek

On 17-06-14 03:44 PM, Sage Weil wrote:

On Wed, 14 Jun 2017, Paweł Sadowski wrote:

On 04/13/2017 04:23 PM, Piotr Dałek wrote:

On 04/06/2017 03:25 PM, Sage Weil wrote:

On Thu, 6 Apr 2017, Piotr Dałek wrote:

[snip]


I think the solution here is to use sparse_read during recovery.  The
PushOp data representation already supports it; it's just a matter of
skipping the zeros.  The recovery code could also have an option to
check
for fully-zero regions of the data and turn those into holes as
well.  For
ReplicatedBackend, see build_push_op().


So far it turns out that there's even easier solution, we just enabled
"filestore seek hole" on some test cluster and that seems to fix the
problem for us. We'll see if fiemap works too.



Is it safe to enable "filestore seek hole", are there any tests that
verifies that everything related to RBD works fine with this enabled?
Can we make this enabled by default?


We would need to enable it in the qa environment first.  The risk here is
that users run a broad range of kernels and we are exposing ourselves to
any bugs in any kernel version they may run.  I'd prefer to leave it off
by default.


That's a common regression? If not, we could blacklist particular kernels 
and call it a day.

 > We can enable it in the qa suite, though, which covers

centos7 (latest kernel) and ubuntu xenial and trusty.


+1. Do you need some particular PR for that?


I tested on few of our production images and it seems that about 30% is
sparse. This will be lost on any cluster wide event (add/remove nodes,
PG grow, recovery).

How this is/will be handled in BlueStore?


BlueStore exposes the same sparseness metadata that enabling the
filestore seek hole or fiemap options does, so it won't be a problem
there.

I think the only thing that we could potentially add is zero detection
on writes (so that explicitly writing zeros consumes no space).  We'd
have to be a bit careful measuring the performance impact of that check on
non-zero writes.


I saw that RBD (librbd) does that - replacing writes with discards when 
buffer contains only zeros. Some code that does the same in librados could 
be added and it shouldn't impact performance much, current implementation of 
 mem_is_zero is fast and shouldn't be a big problem.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Socket errors, CRC, lossy con messages

2017-04-11 Thread Piotr Dałek

On 04/10/2017 08:16 PM, Alex Gorbachev wrote:

I am trying to understand the cause of a problem we started
encountering a few weeks ago.  There are 30 or so per hour messages on
OSD nodes of type:

ceph-osd.33.log:2017-04-10 13:42:39.935422 7fd7076d8700  0 bad crc in
data 2227614508 != exp 2469058201

and

2017-04-10 13:42:39.939284 7fd722c42700  0 -- 10.80.3.25:6826/5752
submit_message osd_op_reply(1826606251
rbd_data.922d95238e1f29.000101bf [set-alloc-hint object_size
16777216 write_size 16777216,write 6328320~12288] v103574'18626765
uv18626765 ondisk = 0) v6 remote, 10.80.3.216:0/1934733503, failed
lossy con, dropping message 0x3b55600 [..]


Is that happening on entire cluster, or just specific OSDs? That is a clear 
indication of data corruption, in the above example osd.33 calculated crc 
for received data block and found out that it doesn't match what was 
precalculated by sending side. Try gathering some more examples of such crc 
errors and isolate osd/host that sends malformed data, then do usual 
diagnostics like memory test on that mahcine.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow perfomance: sanity check

2017-04-06 Thread Piotr Dałek

On 04/06/2017 09:34 AM, Stanislav Kopp wrote:

Hello,

I'm evaluate ceph cluster, to see  if you can use it for our
virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu
16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning
drive (XFS), MONITORs are installed on the same nodes, all nodes are
connected via 10G switch.

The problem is, on client I have only ~25-30 MB/s with seq. write. (dd
with "oflag=direct"). [..]


8TB size suggest these are some kind of "archive" drives (SMR drives). Is 
that correct? If so, you may want to use non-SMR drives, because Ceph is not 
optimized for those.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recompiling source code - to find exact RPM

2017-03-24 Thread Piotr Dałek

On 03/23/2017 06:10 PM, nokia ceph wrote:

Hello Piotr,

I didn't understand, could you please elaborate about this procedure as
mentioned in the last update.  It would be really helpful if you share any
useful link/doc to understand what you actually meant. Yea correct, normally
we do this procedure but it takes more time. But here my intention is to how
to find out the rpm which caused the change. I think we are in opposite
direction.



Here's described how to build Ceph from source ("Build Ceph" paragraph): 
http://docs.ceph.com/docs/master/install/build-ceph/
And here's how to install the built binaries: 
http://docs.ceph.com/docs/master/install/install-storage-cluster/#installing-a-build
That's enough to build and install Ceph binaries on a specific host without 
building RPMs. After doing a code change, "make install" is enough to update 
binaries, restart of Ceph daemons is still required.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recompiling source code - to find exact RPM

2017-03-23 Thread Piotr Dałek

On 03/23/2017 02:02 PM, nokia ceph wrote:


Hello Piotr,

We do customizing ceph code for our testing purpose. It's a part of our R :)

Recompiling source code will create 38 rpm's out of these I need to find
which one is the correct rpm which I made change in the source code. That's
what I'm try to figure out.


Yes, I understand that. But wouldn't be faster and/or more convenient if you 
would just recompile binaries in-place (or use network symlinks) instead of 
packaging entire Ceph and (re)installing its packages each time you do the 
change? Generating RPMs takes a while.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recompiling source code - to find exact RPM

2017-03-23 Thread Piotr Dałek

On 03/23/2017 01:41 PM, nokia ceph wrote:

Hey brad,

Thanks for the info.

Yea we know that these are test rpm's.

The idea behind my question is if I made any changes in the ceph source
code, then I recompile it. Then I need to find which is the appropriate rpm
mapped to that changed file. If I find the exact RPM, then apply that RPM in
our existing ceph cluster instead of applying/overwriting  all the compiled
rpms.

I hope this cleared your doubt.


And why exactly you want to rebuild rpms each time? If the machines are 
powerful enough, you could recompile binaries in place. Or symlink them via 
nfs (or whatever) to build machine and build once there.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading 2K OSDs from Hammer to Jewel. Our experience

2017-03-13 Thread Piotr Dałek

On 03/13/2017 11:07 AM, Dan van der Ster wrote:

On Sat, Mar 11, 2017 at 12:21 PM, <cephmailingl...@mosibi.nl> wrote:


The next and biggest problem we encountered had to do with the CRC errors on 
the OSD map. On every map update, the OSDs that were not upgraded yet, got that 
CRC error and asked the monitor for a full OSD map instead of just a delta 
update. At first we did not understand what exactly happened, we ran the 
upgrade per node using a script and in that script we watch the state of the 
cluster and when the cluster is healthy again, we upgrade the next host. Every 
time we started the script (skipping the already upgraded hosts) the first 
host(s) upgraded without issues and then we got blocked I/O on the cluster. The 
blocked I/O went away within a minute of 2 (not measured). After investigation 
we found out that the blocked I/O happened when nodes where asking the monitor 
for a (full) OSD map and that resulted shortly in a full saturated network link 
on our monitor.



Thanks for the detailed upgrade report. I wanted to zoom in on this
CRC/fullmap issue because it could be quite disruptive for us when we
upgrade from hammer to jewel.

I've read various reports that the fool proof way to avoid the full
map DoS would be to upgrade all OSDs to jewel before the mon's.
Did anyone have success with that workaround? I'm cc'ing Bryan because
he knows this issue very well.


With https://github.com/ceph/ceph/pull/13131 merged into 10.2.6, this issue 
shouldn't be a problem (at least we don't see it anymore).


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with upgrade from 0.94.9 to 10.2.5

2017-01-26 Thread Piotr Dałek

On 01/24/2017 03:57 AM, Mike Lovell wrote:

i was just testing an upgrade of some monitors in a test cluster from hammer
(0.94.7) to jewel (10.2.5). after upgrade each of the first two monitors, i
stopped and restarted a single osd to cause changes in the maps. the same
error messages showed up in ceph -w. i haven't dug into it much but just
wanted to second that i've seen this happen on a recent hammer to recent
jewel upgrade.


Thanks for confirmation.
We've prepared the patch which fixes the issue for us:
https://github.com/ceph/ceph/pull/13131


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with upgrade from 0.94.9 to 10.2.5

2017-01-18 Thread Piotr Dałek

On 01/17/2017 12:52 PM, Piotr Dałek wrote:

During our testing we found out that during upgrade from 0.94.9 to 10.2.5
we're hitting issue http://tracker.ceph.com/issues/17386 ("Upgrading 0.94.6
-> 0.94.9 saturating mon node networking"). Apparently, there's a few
commits for both hammer and jewel which are supposed to fix this issue for
upgrades from 0.94.6 to 0.94.9 (and possibly for others), but we're still
seeing this upgrading to Jewel, and symptoms are exactly same - after
upgrading MONs, each not yet upgraded OSD takes full OSDMap from monitors
after failing the CRC check. Anyone else encountered this?


http://tracker.ceph.com/issues/18582

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Issue with upgrade from 0.94.9 to 10.2.5

2017-01-17 Thread Piotr Dałek

Hello,

During our testing we found out that during upgrade from 0.94.9 to 10.2.5 
we're hitting issue http://tracker.ceph.com/issues/17386 ("Upgrading 0.94.6 
-> 0.94.9 saturating mon node networking"). Apparently, there's a few 
commits for both hammer and jewel which are supposed to fix this issue for 
upgrades from 0.94.6 to 0.94.9 (and possibly for others), but we're still 
seeing this upgrading to Jewel, and symptoms are exactly same - after 
upgrading MONs, each not yet upgraded OSD takes full OSDMap from monitors 
after failing the CRC check. Anyone else encountered this?


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Any librados C API users out there?

2017-01-11 Thread Piotr Dałek

Hello,

As the subject says - are here any users/consumers of librados C API? I'm asking because we're researching if this PR: 
https://github.com/ceph/ceph/pull/12216 will be actually beneficial for larger group of users. This PR adds a bunch of new APIs that perform 
object writes without intermediate data copy, which will reduce cpu and memory load on clients. If you're using librados C API for object 
writes, feel free to comment here or in the pull request.



--
Piotr Dałek


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com