Re: [ceph-users] iSCSI on Ubuntu and HA / Multipathing

2019-07-10 Thread Frédéric Nass
Hi Edward, 

What "Red Hat Enterprise Linux/CentOS 7.5 (or newer); Linux kernel v4.16 (or 
newer)" means is that you either need to use RHEL/CentOS 7.5 distribution with 
a 3.10.0-852+ kernel or any other distribution with a 4.16+ upstream kernel. 

Regards, 
Frédéric. 

- Le 10 Juil 19, à 22:34, Edward Kalk  a écrit : 

> The Docs say : [ http://docs.ceph.com/docs/nautilus/rbd/iscsi-targets/ |
> http://docs.ceph.com/docs/nautilus/rbd/iscsi-targets/ ]

> * Red Hat Enterprise Linux/CentOS 7.5 (or newer); Linux kernel v4.16 (or 
> newer)

> ^^Is there a version combination of CEPH and Ubuntu that works? Is anyone
> running iSCSI on Ubuntu ?
> -Ed

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-17 Thread Frédéric Nass



Le 14/05/2019 à 10:04, Kevin Flöh a écrit :


On 13.05.19 11:21 nachm., Dan van der Ster wrote:
Presumably the 2 OSDs you marked as lost were hosting those 
incomplete PGs?

It would be useful to double confirm that: check with `ceph pg 
query` and `ceph pg dump`.
(If so, this is why the ignore history les thing isn't helping; you
don't have the minimum 3 stripes up for those 3+1 PGs.)


yes, but as written in my other mail, we still have enough shards, at 
least I think so.




If those "lost" OSDs by some miracle still have the PG data, you might
be able to export the relevant PG stripes with the
ceph-objectstore-tool. I've never tried this myself, but there have
been threads in the past where people export a PG from a nearly dead
hdd, import to another OSD, then backfilling works.

guess that is not possible.


Hi Kevin,

You want to make sure of this.

Unless you recreated the OSDs 4 and 23 and had new data written on them, 
they should still host the data you need.
What Dan suggested (export the 7 inconsistent PGs and import them on a 
healthy OSD) seems to be the only way to recover your lost data, as with 
4 hosts and 2 OSDs lost, you're left with 2 chunks of data/parity when 
you actually need 3 to access it. Reducing min_size to 3 will not help.


Have a look here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html

This is probably the best way you want to follow form now on.

Regards,
Frédéric.



If OTOH those PGs are really lost forever, and someone else should
confirm what I say here, I think the next step would be to force
recreate the incomplete PGs then run a set of cephfs scrub/repair
disaster recovery cmds to recover what you can from the cephfs.

-- dan


would this let us recover at least some of the data on the pgs? If not 
we would just set up a new ceph directly without fixing the old one 
and copy whatever is left.


Best regards,

Kevin





On Mon, May 13, 2019 at 4:20 PM Kevin Flöh  wrote:

Dear ceph experts,

we have several (maybe related) problems with our ceph cluster, let me
first show you the current ceph status:

    cluster:
  id: 23e72372-0d44-4cad-b24f-3641b14b86f4
  health: HEALTH_ERR
  1 MDSs report slow metadata IOs
  1 MDSs report slow requests
  1 MDSs behind on trimming
  1/126319678 objects unfound (0.000%)
  19 scrub errors
  Reduced data availability: 2 pgs inactive, 2 pgs 
incomplete

  Possible data damage: 7 pgs inconsistent
  Degraded data redundancy: 1/500333881 objects degraded
(0.000%), 1 pg degraded
  118 stuck requests are blocked > 4096 sec. Implicated 
osds

24,32,91

    services:
  mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
  mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
  mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3
up:standby
  osd: 96 osds: 96 up, 96 in

    data:
  pools:   2 pools, 4096 pgs
  objects: 126.32M objects, 260TiB
  usage:   372TiB used, 152TiB / 524TiB avail
  pgs: 0.049% pgs not active
   1/500333881 objects degraded (0.000%)
   1/126319678 objects unfound (0.000%)
   4076 active+clean
   10   active+clean+scrubbing+deep
   7    active+clean+inconsistent
   2    incomplete
   1    active+recovery_wait+degraded

    io:
  client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr


and ceph health detail:


HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow 
requests;

1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs
incomplete; Possible data damage: 7 pgs inconsistent; Degraded data
redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118
stuck requests are blocked > 4096 sec. Implicated osds 24,32,91
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
  mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are
blocked > 30 secs, oldest blocked for 351193 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
  mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are blocked 
> 30 sec

MDS_TRIM 1 MDSs behind on trimming
  mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46034/128)
max_segments: 128, num_segments: 46034
OBJECT_UNFOUND 1/126319687 objects unfound (0.000%)
  pg 1.24c has 1 unfound objects
OSD_SCRUB_ERRORS 19 scrub errors
PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs 
incomplete

  pg 1.5dd is incomplete, acting [24,4,23,79] (reducing pool ec31
min_size from 3 may help; search ceph.com/docs for 'incomplete')
  pg 1.619 is incomplete, acting [91,23,4,81] (reducing pool ec31
min_size from 3 may help; search ceph.com/docs for 'incomplete')
PG_DAMAGED Possible data damage: 7 pgs inco

Re: [ceph-users] Bluestore with so many small files

2019-04-23 Thread Frédéric Nass
Hi, 

You probably forgot to recreate the OSD after changing 
bluestore_min_alloc_size. 

Regards, 
Frédéric. 

- Le 22 Avr 19, à 5:41, 刘 俊  a écrit : 

> Hi All ,
> I still see this issue with latest ceph Luminous 12.2.11 and 12.2.12.
> I have set bluestore_min_alloc_size = 4096 before the test.
> when I write 10 small objects less than 64KB through rgw, the RAW USED
> showed in "ceph df" looks incorrect.
> For example, I test three times and clean up the rgw data pool each time, the
> object size for first time is 4KB, for second time is 32KB, for third time is
> 64KB.
> The RAW USED showed in "ceph df" are the same(18GB),  looks like always equal 
> to
> 64KB*10/1024*3 . (replicator is 3 here )
> Any thought?
> Jamie
> ___
> Hi Behnam,

> On 2/12/2018 4:06 PM, Behnam Loghmani wrote:
>> Hi there, > > I am using ceph Luminous 12.2.2 with: > > 3 osds (each osd is
>> 100G) - no WAL/DB separation. > 3 mons > 1 rgw > cluster size 3 > > I stored
>> lots of thumbnails with very small size on ceph with radosgw. > > Actual size
>> of files is something about 32G but it filled 70G of each osd. > > what's the
>> reason of this high disk usage? Most probably the major reason is BlueStore
> > allocation granularity. E.g.
> an object of 1K bytes length needs 64K of disk space if default
> bluestore_min_alloc_size_hdd  (=64K) is applied.
> Additional inconsistency in space reporting might also appear since
> BlueStore adds up DB volume space when accounting total store space.
> While free space is taken from Block device only. is As a result when
> reporting "Used" space always contain that total DB space part ( i.e.
> Used = Total(Block+DB) - Free(Block) ). That correlates to other
> comments in this thread about RockDB space usage.
> There is a pending PR to fix that: [
> https://github.com/ceph/ceph/pull/19454/commits/144fb9663778f833782bdcb16acd707c3ed62a86
> |
> https://github.com/ceph/ceph/pull/19454/commits/144fb9663778f833782bdcb16acd707c3ed62a86
> ] You may look for "Bluestore: inaccurate disk usage statistics problem"
> in this mail list for previous discussion as well.

>> should I change "bluestore_min_alloc_size_hdd"? and If I change it and > set 
>> it
>> to smaller size, does it impact on performance? Unfortunately I haven't
> > benchmark "small writes over hdd" cases much
> hence don't have exacts answer here. Indeed these 'min_alloc_size'
> family of parameters might impact the performance quite significantly.
>> > what is the best practice for storing small files on bluestore? > > Best
> > > regards, > Behnam Loghmani
>> > On Mon, Feb 12, 2018 at 5:06 PM, David Turner < [
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | drakonstein at
>> > gmail.com ] > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > | drakonstein at gmail.com ] >> wrote: > > Some of your overhead is the 
>> > Wal and
>> > rocksdb that are on the OSDs. > The Wal is pretty static in size, but 
>> > rocksdb
>> > grows with the amount > of objects you have. You also have copies of the 
>> > osdmap
>> > on each osd. > There's just overhead that adds up. The biggest is going to 
>> > be >
>> > rocksdb with how many objects you have. > > > On Mon, Feb 12, 2018, 8:06 AM
>> > Behnam Loghmani > < [ 
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>> > behnam.loghmani at gmail.com ] > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | behnam.loghmani at
>> > gmail.com ] >> wrote: > > Hi there, > > I am using ceph Luminous 12.2.2 
>> > with: >
>> > > 3 osds (each osd is 100G) - no WAL/DB separation. > 3 mons > 1 rgw > 
>> > > cluster
>> > size 3 > > I stored lots of thumbnails with very small size on ceph with >
>> > radosgw. > > Actual size of files is something about 32G but it filled 70G 
>> > of >
>> > each osd. > > what's the reason of this high disk usage? > should I change
>> > "bluestore_min_alloc_size_hdd"? and If I change > it and set it to smaller
>> > size, does it impact on performance? > > what is the best practice for 
>> > storing
>> > small files on bluestore? > > Best regards, > Behnam Loghmani >
>> > ___ > ceph-users mailing list 
>> > > [
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | ceph-users at
>> > lists.ceph.com ] > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | ceph-users at
>> > lists.ceph.com ] > > [ 
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] > < [
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] >

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-10 Thread Frédéric Nass

Hi everyone,

So if the kernel is able to reclaim those pages, is there still a point 
in running the heap release on a regular basis?


Regards,
Frédéric.

Le 09/04/2019 à 19:33, Olivier Bonvalet a écrit :

Good point, thanks !

By making memory pressure (by playing with vm.min_free_kbytes), memory
is freed by the kernel.

So I think I essentially need to update monitoring rules, to avoid
false positive.

Thanks, I continue to read your resources.


Le mardi 09 avril 2019 à 09:30 -0500, Mark Nelson a écrit :

My understanding is that basically the kernel is either unable or
uninterested (maybe due to lack of memory pressure?) in reclaiming
the
memory .  It's possible you might have better behavior if you set
/sys/kernel/mm/khugepaged/max_ptes_none to a low value (maybe 0) or
maybe disable transparent huge pages entirely.


Some background:

https://github.com/gperftools/gperftools/issues/1073

https://blog.nelhage.com/post/transparent-hugepages/

https://www.kernel.org/doc/Documentation/vm/transhuge.txt


Mark


On 4/9/19 7:31 AM, Olivier Bonvalet wrote:

Well, Dan seems to be right :

_tune_cache_size
  target: 4294967296
heap: 6514409472
unmapped: 2267537408
  mapped: 4246872064
old cache_size: 2845396873
new cache size: 2845397085


So we have 6GB in heap, but "only" 4GB mapped.

But "ceph tell osd.* heap release" should had release that ?


Thanks,

Olivier


Le lundi 08 avril 2019 à 16:09 -0500, Mark Nelson a écrit :

One of the difficulties with the osd_memory_target work is that
we
can't
tune based on the RSS memory usage of the process. Ultimately
it's up
to
the kernel to decide to reclaim memory and especially with
transparent
huge pages it's tough to judge what the kernel is going to do
even
if
memory has been unmapped by the process.  Instead the autotuner
looks
at
how much memory has been mapped and tries to balance the caches
based
on
that.


In addition to Dan's advice, you might also want to enable debug
bluestore at level 5 and look for lines containing "target:" and
"cache_size:".  These will tell you the current target, the
mapped
memory, unmapped memory, heap size, previous aggregate cache
size,
and
new aggregate cache size.  The other line will give you a break
down
of
how much memory was assigned to each of the bluestore caches and
how
much each case is using.  If there is a memory leak, the
autotuner
can
only do so much.  At some point it will reduce the caches to fit
within
cache_min and leave it there.


Mark


On 4/8/19 5:18 AM, Dan van der Ster wrote:

Which OS are you using?
With CentOS we find that the heap is not always automatically
released. (You can check the heap freelist with `ceph tell
osd.0
heap
stats`).
As a workaround we run this hourly:

ceph tell mon.* heap release
ceph tell osd.* heap release
ceph tell mds.* heap release

-- Dan

On Sat, Apr 6, 2019 at 1:30 PM Olivier Bonvalet <
ceph.l...@daevel.fr> wrote:

Hi,

on a Luminous 12.2.11 deploiement, my bluestore OSD exceed
the
osd_memory_target :

daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
ceph3646 17.1 12.0 6828916 5893136 ? Ssl  mars29
1903:42 /usr/bin/ceph-osd -f --cluster ceph --id 143 --
setuser
ceph --setgroup ceph
ceph3991 12.9 11.2 6342812 5485356 ? Ssl  mars29
1443:41 /usr/bin/ceph-osd -f --cluster ceph --id 144 --
setuser
ceph --setgroup ceph
ceph4361 16.9 11.8 6718432 5783584 ? Ssl  mars29
1889:41 /usr/bin/ceph-osd -f --cluster ceph --id 145 --
setuser
ceph --setgroup ceph
ceph4731 19.7 12.2 6949584 5982040 ? Ssl  mars29
2198:47 /usr/bin/ceph-osd -f --cluster ceph --id 146 --
setuser
ceph --setgroup ceph
ceph5073 16.7 11.6 6639568 5701368 ? Ssl  mars29
1866:05 /usr/bin/ceph-osd -f --cluster ceph --id 147 --
setuser
ceph --setgroup ceph
ceph5417 14.6 11.2 6386764 5519944 ? Ssl  mars29
1634:30 /usr/bin/ceph-osd -f --cluster ceph --id 148 --
setuser
ceph --setgroup ceph
ceph5760 16.9 12.0 6806448 5879624 ? Ssl  mars29
1882:42 /usr/bin/ceph-osd -f --cluster ceph --id 149 --
setuser
ceph --setgroup ceph
ceph6105 16.0 11.6 6576336 5694556 ? Ssl  mars29
1782:52 /usr/bin/ceph-osd -f --cluster ceph --id 150 --
setuser
ceph --setgroup ceph

daevel-ob@ssdr712h:~$ free -m
 totalusedfree  shared  bu
ff/ca
che   available
Mem:  47771   452101643  17
9
17   43556
Swap: 0   0   0

# ceph daemon osd.147 config show | grep memory_target
   "osd_memory_target": "4294967296",


And there is no recovery / backfilling, the cluster is fine :

  $ ceph status
cluster:
  id: de035250-323d-4cf6-8c4b-cf0faf6296b1
  health: HEALTH_OK

services:
  mon: 5 daemons, quorum
tolriq,tsyne,olkas,lorunde,amphel
  mgr: tsyne(active), standbys: olkas, tolriq,
lorunde,
amphel
  osd: 120 osds: 116 up, 116 in

data:
  pools:   20 pool

Re: [ceph-users] CephFS and many small files

2019-04-02 Thread Frédéric Nass

Hello,

I haven't had any issues either with 4k allocation size in cluster 
holding 358M objects for 116TB (237TB raw) and 2.264B chunks/replicas.


This is an average of 324k per object and 12.6M of chunks/replicas per 
OSD with RocksDB sizes going from 12.1GB to 21.14GB depending on how 
much PGs the OSDs have.
RocksDB sizes will lower as we add more OSDs to the cluster by the end 
of this year.


We've seen a huge latency improvement by moving OSDs to Bluestore. 
Filestore (XFS) wouldn't operate well anymore with over 10M of files, 
with a negligible fragmentation factor and 8/40 split/merge thresholds.


Frédéric.

Le 01/04/2019 à 14:47, Sergey Malinin a écrit :

I haven't had any issues with 4k allocation size in cluster holding 189M files.

April 1, 2019 2:04 PM, "Paul Emmerich"  wrote:


I'm not sure about the real-world impacts of a lower min alloc size or
the rationale behind the default values for HDDs (64) and SSDs (16kb).

Paul

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





smime.p7s
Description: Signature cryptographique S/MIME
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritize recovery over backfilling

2019-02-20 Thread Frédéric Nass
Hi Sage,

Would be nice to have this one backported to Luminous if easy. 

Cheers,
Frédéric.

> Le 7 juin 2018 à 13:33, Sage Weil  a écrit :
> 
> On Wed, 6 Jun 2018, Caspar Smit wrote:
>> Hi all,
>> 
>> We have a Luminous 12.2.2 cluster with 3 nodes and i recently added a node
>> to it.
>> 
>> osd-max-backfills is at the default 1 so backfilling didn't go very fast
>> but that doesn't matter.
>> 
>> Once it started backfilling everything looked ok:
>> 
>> ~300 pgs in backfill_wait
>> ~10 pgs backfilling (~number of new osd's)
>> 
>> But i noticed the degraded objects increasing a lot. I presume a pg that is
>> in backfill_wait state doesn't accept any new writes anymore? Hence
>> increasing the degraded objects?
>> 
>> So far so good, but once a while i noticed a random OSD flapping (they come
>> back up automatically). This isn't because the disk is saturated but a
>> driver/controller/kernel incompatibility which 'hangs' the disk for a short
>> time (scsi abort_task error in syslog). Investigating further i noticed
>> this was already the case before the node expansion.
>> 
>> These OSD's flapping results in lots of pg states which are a bit worrying:
>> 
>> 109 active+remapped+backfill_wait
>> 80  active+undersized+degraded+remapped+backfill_wait
>> 51  active+recovery_wait+degraded+remapped
>> 41  active+recovery_wait+degraded
>> 27  active+recovery_wait+undersized+degraded+remapped
>> 14  active+undersized+remapped+backfill_wait
>> 4   active+undersized+degraded+remapped+backfilling
>> 
>> I think the recovery_wait is more important then the backfill_wait, so i
>> like to prioritize these because the recovery_wait was triggered by the
>> flapping OSD's
> 
> Just a note: this is fixed in mimic.  Previously, we would choose the 
> highest-priority PG to start recovery on at the time, but once recovery 
> had started, the appearance of a new PG with a higher priority (e.g., 
> because it finished peering after the others) wouldn't preempt/cancel the 
> other PG's recovery, so you would get behavior like the above.
> 
> Mimic implements that preemption, so you should not see behavior like 
> this.  (If you do, then the function that assigns a priority score to a 
> PG needs to be tweaked.)
> 
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritize recovery over backfilling

2019-02-20 Thread Frédéric Nass
Hi,

Please keep in mind that setting the ‘nodown' flag will prevent PGs from 
becoming degraded but will also prevent client's requests from being served by 
other OSDs that would have take over the non responsive one without the 
‘nodown’ flag in a healthy manner. And this the whole time the OSD is non 
responsive.

Regards,
Frédéric.


> Le 7 juin 2018 à 08:47, Piotr Dałek  a écrit :
> 
> On 18-06-06 09:29 PM, Caspar Smit wrote:
>> Hi all,
>> We have a Luminous 12.2.2 cluster with 3 nodes and i recently added a node 
>> to it.
>> osd-max-backfills is at the default 1 so backfilling didn't go very fast but 
>> that doesn't matter.
>> Once it started backfilling everything looked ok:
>> ~300 pgs in backfill_wait
>> ~10 pgs backfilling (~number of new osd's)
>> But i noticed the degraded objects increasing a lot. I presume a pg that is 
>> in backfill_wait state doesn't accept any new writes anymore? Hence 
>> increasing the degraded objects?
>> So far so good, but once a while i noticed a random OSD flapping (they come 
>> back up automatically). This isn't because the disk is saturated but a 
>> driver/controller/kernel incompatibility which 'hangs' the disk for a short 
>> time (scsi abort_task error in syslog). Investigating further i noticed this 
>> was already the case before the node expansion.
>> These OSD's flapping results in lots of pg states which are a bit worrying:
>>  109 active+remapped+backfill_wait
>>  80  active+undersized+degraded+remapped+backfill_wait
>>  51  active+recovery_wait+degraded+remapped
>>  41  active+recovery_wait+degraded
>>  27  active+recovery_wait+undersized+degraded+remapped
>>  14  active+undersized+remapped+backfill_wait
>>  4   active+undersized+degraded+remapped+backfilling
>> I think the recovery_wait is more important then the backfill_wait, so i 
>> like to prioritize these because the recovery_wait was triggered by the 
>> flapping OSD's
> >
>> furthermore the undersized ones should get absolute priority or is that 
>> already the case?
>> I was thinking about setting "nobackfill" to prioritize recovery instead of 
>> backfilling.
>> Would that help in this situation? Or am i making it even worse then?
>> ps. i tried increasing the heartbeat values for the OSD's to no avail, they 
>> still get flagged as down once in a while after a hiccup of the driver.
> 
> First of all, use "nodown" flag so osds won't be marked down automatically 
> and unset it once everything backfills/recovers and settles for good -- note 
> that there might be lingering osd down reports, so unsetting nodown might 
> cause some of problematic osds to be instantly marked as down.
> 
> Second, since Luminous you can use "ceph pg force-recovery" to ask particular 
> pgs to recover first, even if there are other pgs to backfill and/or recovery.
> 
> -- 
> Piotr Dałek
> piotr.da...@corp.ovh.com 
> https://www.ovhcloud.com 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using FC with LIO targets

2018-10-31 Thread Frédéric Nass
Hi Mike,

Thank you for your answer. I thought maybe FC would just be the transport 
protocol to LIO and all would be fine but I forgot the tcmu-runner part which I 
suppose is where some iSCSI specifics were hard coded.

FC was interesting in the way that (when already set) it would avoid having to 
dedicate specific NICs to iSCSI traffic. Nowadays, many ESXi blades only have 
2x 10Gbs or 25Gbs NICs configured as LACP LAG in ESXi.
So dedicating NICs to iSCSI traffic may not be easy anymore. I don’t know. 
We’ll see how DELL NPAR can help us creating virtual hardware NICs that we can 
use for ISCSI traffic.

Best regards,
Frédéric.

> Le 31 oct. 2018 à 04:59, Mike Christie  a écrit :
> 
> On 10/28/2018 03:18 AM, Frédéric Nass wrote:
>> Hello Mike, Jason,
>> 
>> Assuming we adapt the current LIO configuration scripts and put QLogic HBAs 
>> in our SCSI targets, could we use FC instead of iSCSI as a SCSI transport 
>> protocol with LIO ? Would this still work with multipathing and ALUA ?
>> Do you see any issues coming from this type of configuration ?
> 
> The FC drivers have a similar problem as iscsi.
> 
> The general problem is making sure the transport paths are flushed when
> we failover/back. I had thought using explicit failover would fix this,
> but for vpshere HA type of setups and for the single host with multiple
> initiator ports to connected to the same target port we still hit issues.
> 
> For iscsi, I am working on this patchset (maybe half is now merged but
> the patchset is larger due to some other requested fixes in sort of
> related code):
> 
> https://www.spinics.net/lists/target-devel/msg16943.html
> 
> where from userspace we can flush the iscsi paths when performing
> failover so we know there are no stale IOs in that iscsi/code path.
> 
> For FC drivers I was planning something similar where we would send a FC
> echo like we do for the iscsi nop.
> 
> If you are asking if you can just drop in one of the FC target drivers
> into the ceph-iscsi-config/cli/tcmu-runner stuff then it would not work,
> because there are a lot of places where iscsi references are hard coded now.
> 
> 
> 
>> Best regards,
>> Frédéric.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Using FC with LIO targets

2018-10-28 Thread Frédéric Nass
Hello Mike, Jason,

Assuming we adapt the current LIO configuration scripts and put QLogic HBAs in 
our SCSI targets, could we use FC instead of iSCSI as a SCSI transport protocol 
with LIO ? Would this still work with multipathing and ALUA ?
Do you see any issues coming from this type of configuration ?

Best regards,
Frédéric.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disk write cache - safe?

2018-03-19 Thread Frédéric Nass

Hi Steven,

Le 16/03/2018 à 17:26, Steven Vacaroaia a écrit :

Hi All,

Can someone confirm please that, for a perfect performance/safety 
compromise, the following would be the best settings  ( id 0 is SSD, 
id 1 is HDD )
Alternatively, any suggestions / sharing configuration / advice would 
be greatly appreciated

Note
server is a DELL R620 with PERC 710 , 1GB cache
SSD is entreprise Toshiba PX05SMB040Y
HDD is Entreprise Seagate  ST600MM0006


 megacli -LDGetProp  -DskCache -Lall -a0

Adapter 0-VD 0(target id: 0): Disk Write Cache : Enabled
Adapter 0-VD 1(target id: 1): Disk Write Cache : Disabled


Sounds good to me as Toshiba PX05SMB040Y SSDs include power-loss 
protection 
(https://toshiba.semicon-storage.com/eu/product/storage-products/enterprise-ssd/px05smbxxx.html)


megacli -LDGetProp  -Cache -Lall -a0

Adapter 0-VD 0(target id: 0): Cache Policy:WriteBack, ReadAdaptive, 
Direct, No Write Cache if bad BBU
Adapter 0-VD 1(target id: 1): Cache Policy:WriteBack, ReadAdaptive, 
Cached, Write Cache OK if bad BBU


I've always wondered about ReadAdaptive with no real answer. This would 
need clarification from RHCS / Ceph performance team.


With a 1GB PERC cache, my guess is that you should set SSDs to 
writethrough whatever your workload is, so that the whole cache is 
dedicated to HDDs only, and your nodes don't hit a PERC cache full issue 
that would be hard to diagnose. Besides, write caching should always be 
avoided with a bad BBU.


Regards,

Frédéric.


Many thanks

Steven





On 16 March 2018 at 06:20, Frédéric Nass 
<mailto:frederic.n...@univ-lorraine.fr>> wrote:


Hi Tim,

I wanted to share our experience here as we've been in a situation
in the past (on a friday afternoon of course...) that injecting a
snaptrim priority of 40 to all OSDs in the cluster (to speed up
snaptimming) resulted in alls OSD nodes crashing at the same time,
in all 3 datacenters. My first thought at that particular moment
was : call your wife and tell her you'll be late home. :-D

And this event was not related to a power outage.

Fortunately I had spent some time (when building the cluster)
thinking how each option should be set along the I/O path for #1
data consistency and #2 best possible performance, and that was :

- Single SATA disks Raid0 with writeback PERC caching on each
virtual disk
- write barriers kept enabled on XFS mounts (I had measured a 1.5
% performance gap so disabling warriers was no good choice, and is
never actually)
- SATA disks write buffer disabled (as volatile)
- SSD journal disks write buffer enabled (as persistent)

We hardly believed it but when all nodes came back online, all
OSDs rejoined the cluster and service was back as it was before.
We didn't face any XFS errors nor did we have any further scrub or
deep-scrub errors.

My assumption was that the extra power demand for snaptrimimng may
have led to node power instability or that we hit a SATA firmware
or maybe a kernel bug.

We also had SSDs as Raid0 with writeback PERC cache ON but changed
that to write-through as we could get more IOPS from them
regarding our workloads.

Thanks for sharing the information about DELL changing the default
disk buffer policy. What's odd is that it all buffers were
disabled after the node rebooted, including SSDs !
I am now changing them back to enabled for SSDs only.

As said by others, you'd better keep the disks buffers disabled
and rebuild the OSDs after setting the disks as Raid0 with
writeback enabled.

Best,

Frédéric.

Le 14/03/2018 à 20:42, Tim Bishop a écrit :

I'm using Ceph on Ubuntu 16.04 on Dell R730xd servers. A
recent [1]
update to the PERC firmware disabled the disk write cache by
default
which made a noticable difference to the latency on my disks
(spinning
disks, not SSD) - by as much as a factor of 10.

For reference their change list says:

"Changes default value of drive cache for 6 Gbps SATA drive to
disabled.
This is to align with the industry for SATA drives. This may
result in a
performance degradation especially in non-Raid mode. You must
perform an
AC reboot to see existing configurations change."

It's fairly straightforward to re-enable the cache either in
the PERC
BIOS, or by using hdparm, and doing so returns the latency
back to what
it was before.

Checking the Ceph documentation I can see that older versions [2]
recommended disabling the write cache for older kernels. But
given I'm
using a newer kernel, and there's no mention of this in the
Luminous
docs, is it safe to assume it's ok to enable the disk write
cac

Re: [ceph-users] Disk write cache - safe?

2018-03-16 Thread Frédéric Nass

Hi Tim,

I wanted to share our experience here as we've been in a situation in 
the past (on a friday afternoon of course...) that injecting a snaptrim 
priority of 40 to all OSDs in the cluster (to speed up snaptimming) 
resulted in alls OSD nodes crashing at the same time, in all 3 
datacenters. My first thought at that particular moment was : call your 
wife and tell her you'll be late home. :-D


And this event was not related to a power outage.

Fortunately I had spent some time (when building the cluster) thinking 
how each option should be set along the I/O path for #1 data consistency 
and #2 best possible performance, and that was :


- Single SATA disks Raid0 with writeback PERC caching on each virtual disk
- write barriers kept enabled on XFS mounts (I had measured a 1.5 % 
performance gap so disabling warriers was no good choice, and is never 
actually)

- SATA disks write buffer disabled (as volatile)
- SSD journal disks write buffer enabled (as persistent)

We hardly believed it but when all nodes came back online, all OSDs 
rejoined the cluster and service was back as it was before. We didn't 
face any XFS errors nor did we have any further scrub or deep-scrub errors.


My assumption was that the extra power demand for snaptrimimng may have 
led to node power instability or that we hit a SATA firmware or maybe a 
kernel bug.


We also had SSDs as Raid0 with writeback PERC cache ON but changed that 
to write-through as we could get more IOPS from them regarding our 
workloads.


Thanks for sharing the information about DELL changing the default disk 
buffer policy. What's odd is that it all buffers were disabled after the 
node rebooted, including SSDs !

I am now changing them back to enabled for SSDs only.

As said by others, you'd better keep the disks buffers disabled and 
rebuild the OSDs after setting the disks as Raid0 with writeback enabled.


Best,

Frédéric.

Le 14/03/2018 à 20:42, Tim Bishop a écrit :

I'm using Ceph on Ubuntu 16.04 on Dell R730xd servers. A recent [1]
update to the PERC firmware disabled the disk write cache by default
which made a noticable difference to the latency on my disks (spinning
disks, not SSD) - by as much as a factor of 10.

For reference their change list says:

"Changes default value of drive cache for 6 Gbps SATA drive to disabled.
This is to align with the industry for SATA drives. This may result in a
performance degradation especially in non-Raid mode. You must perform an
AC reboot to see existing configurations change."

It's fairly straightforward to re-enable the cache either in the PERC
BIOS, or by using hdparm, and doing so returns the latency back to what
it was before.

Checking the Ceph documentation I can see that older versions [2]
recommended disabling the write cache for older kernels. But given I'm
using a newer kernel, and there's no mention of this in the Luminous
docs, is it safe to assume it's ok to enable the disk write cache now?

If it makes a difference, I'm using a mixture of filestore and bluestore
OSDs - migration is still ongoing.

Thanks,

Tim.

[1] - 
https://www.dell.com/support/home/uk/en/ukdhs1/Drivers/DriversDetails?driverId=8WK8N
[2] - 
http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High apply latency

2018-02-06 Thread Frédéric Nass

Hi Jakub,


Le 06/02/2018 à 16:03, Jakub Jaszewski a écrit :

​Hi Frederic,

I've not enable debug level logging on all OSDs, just on one for the 
test, need to double check that.
But looks that merging is ongoing on few OSDs or OSDs are faulty, I 
will dig into that tomorrow.

Write bandwidth is very random


I just reread the whole thread:

- Splitting is not happening anymore - if it ever did - that's for sure.
- Regarding the write bandwidth variations, it seems that these 
variations only concern EC 6+3 pools.
- As you get more than a 1.2 GB/s on replicated pools with 4MB iops, I 
would think that neither NVMe, nor PERC or HDDs is to blame.


Did you check CPU load during EC 6+3 writes on pool 
default.rgw.buckets.data ?


If you don't see any 100% CPU load, nor any 100% iostat issues on either 
the NVMe disk or HDDs, then I would benchmark the network for bandwidth 
or latency issues.


BTW, did you see that some of your OSDs were not tagged as 'hdd' (ceph 
osd df tree).





# rados bench -p default.rgw.buckets.data 120 write
hints = 1
Maintaining 16 concurrent writes of 4194432 bytes to objects of size 
4194432 for up to 120 seconds or 0 objects

Object prefix: benchmark_data_sg08-09_59104
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg 
lat(s)

0       0         0         0         0         0  -           0
1      16       155       139    555.93   556.017  0.0750027     0.10687
2      16       264       248   495.936   436.013 0.154185    0.118693
3      16       330       314   418.616   264.008 0.118476    0.142667
4      16       415       399   398.953    340.01  0.0873379     0.15102
5      16       483       467   373.557   272.008 0.750453    0.159819
6      16       532       516   343.962   196.006  0.0298334    0.171218
7      16       617       601   343.391    340.01 0.192698    0.177288
8      16       700       684   341.963    332.01  0.0281355    0.171277
9      16       762       746   331.521   248.008  0.0962037    0.163734
 10      16       804       788   315.167   168.005  1.40356    0.196298
 11      16       897       881    320.33   372.011  0.0369085     0.19496
 12      16       985       969   322.966   352.011  0.0290563    0.193986
 13      15      1106      1091   335.657   488.015  0.0617642    0.188703
 14      16      1166      1150   328.537   236.007  0.0401884    0.186206
 15      16      1251      1235   329.299    340.01 0.171256    0.190974
 16      16      1339      1323   330.716   352.011 0.024222    0.189901
 17      16      1417      1401   329.613    312.01  0.0289473    0.186562
 18      16      1465      1449   321.967   192.006 0.028123    0.189153
 19      16      1522      1506    317.02   228.007 0.265448    0.188288
2018-02-06 13:43:21.412512 min lat: 0.0204657 max lat: 3.61509 avg 
lat: 0.18918
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg 
lat(s)

 20      16      1564      1548   309.568   168.005  0.0327581     0.18918
 21      16      1636      1620    308.54   288.009  0.0715159    0.187381
 22      16      1673      1657   301.242   148.005  1.57285    0.191596
 23      16      1762      1746   303.621   356.011  6.00352    0.206217
 24      16      1885      1869   311.468   492.015  0.0298435    0.203874
 25      16      2010      1994   319.008   500.015  0.0258761    0.199652
 26      16      2116      2100   323.044   424.013  0.0533319     0.19631
 27      16      2201      2185    323.67    340.01 0.134796    0.195953
 28      16      2257      2241    320.11   224.007 0.473629    0.196464
 29      16      2333      2317   319.554   304.009  0.0362741    0.198054
 30      16      2371      2355   313.968   152.005 0.438141    0.200265
 31      16      2459      2443   315.194   352.011  0.0610629    0.200858
 32      16      2525      2509   313.593   264.008  0.0234799    0.201008
 33      16      2612      2596   314.635   348.011 0.072019    0.199094
 34      16      2682      2666   313.615   280.009  0.10062    0.197586
 35      16      2757      2741   313.225   300.009  0.0552581    0.196981
 36      16      2849      2833   314.746   368.011 0.257323     0.19565
 37      16      2891      2875   310.779   168.005  0.0918386     0.19556
 38      16      2946      2930    308.39   220.007  0.0276621    0.195792
 39      16      2975      2959   303.456   116.004  0.0588971     0.19952
2018-02-06 13:43:41.415107 min lat: 0.0204657 max lat: 7.9873 avg lat: 
0.198749
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg 
lat(s)

 40      16      3060      3044   304.369    340.01  0.0217136    0.198749
 41      16      3098      3082   300.652   152.005  0.0717398    0.199052
 42      16      3141      3125   297.589   172.005  0.0257422    0.201899
 43      15      3241      3226   300.063   404.012  0.0733869    0.209446
 44      16      3332      3316   301.424   360.011  0.0327249    0.206686
 45      16      3430      3414   303.436   392.012  0.0413156    0.2037

Re: [ceph-users] High apply latency

2018-02-05 Thread Frédéric Nass

Hi Jakub,

Le 05/02/2018 à 12:26, Jakub Jaszewski a écrit :

Hi Frederic,

Many thanks for your contribution to the topic!

I've just set logging level 20 for filestore via

ceph tell osd.0 config set debug_filestore 20

but so far
​found
 nothing by keyword 'split'
​ in ​/var/log/ceph/ceph-osd.0.log



So, if you're running ceph > 12.2.1, that means splitting is not 
happening. Did you check during writes ? Did you check other OSDs logs ?


Actually, splitting should not happen now that you've increased 
​​filestore_merge_threshold and filestore_split_multiple values.




​I've also run your script across the cluster nodes, results as follows

id=3, pool=volumes, objects=10454548, avg=160.28
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.2344
id=3, pool=volumes, objects=10454548, avg=159.22
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.9994
id=3, pool=volumes, objects=10454548, avg=159.843
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=34.7435
id=3, pool=volumes, objects=10454548, avg=159.695
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.0579
id=3, pool=volumes, objects=10454548, avg=160.594
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=34.7757
id=3, pool=volumes, objects=10454548, avg=160.099
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=33.8517
id=3, pool=volumes, objects=10454548, avg=159.912
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=37.5698
id=3, pool=volumes, objects=10454548, avg=159.407
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.4991
id=3, pool=volumes, objects=10454548, avg=160.075
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.481

Looks like there is nothing to be handled by split, am I right? But 
what about merging ? Avg is less than 40

 ​, should directories structure be reduced now?


It should, I guess. But then you'd see blocked requests on every object 
deletion. If you do, you might want to set ​​filestore_merge_threshold 
to -40 (negative value) so merging does not happen anymore.

Splitting would still happen over 5120 files per subdirectory.



    "
​​
filestore_merge_threshold": "40",
"filestore_split_multiple": "8",
"filestore_split_rand_factor": "20",

M
​ay I ask for the link to documentation where I can read more about 
OSD underlying directory structure?


I'm not aware of any related documentation.

Do you still observe slow or blocked requests now that you've increased 
​​filestore_merge_threshold and filestore_split_multiple ?


Regards,

Frédéric.




​And just noticed log entries in /var/log/ceph/ceph-osd.0.log

​2018-02-05 11:22:03.346400 7f3cc94fe700  0 -- 10.212.14.11:6818/4702 
<http://10.212.14.11:6818/4702> >> 10.212.14.17:6802/82845 
<http://10.212.14.17:6802/82845> conn(0xe254cca800 :6818 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg accept connect_seq 27 vs existing csq=27 
existing_state=STATE_STANDBY
2018-02-05 11:22:03.346583 7f3cc94fe700  0 -- 10.212.14.11:6818/4702 
<http://10.212.14.11:6818/4702> >> 10.212.14.17:6802/82845 
<http://10.212.14.17:6802/82845> conn(0xe254cca800 :6818 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg accept connect_seq 28 vs existing csq=27 
existing_state=STATE_STANDBY

​


M
​
any thanks!​
​Jakub​
​

On Mon, Feb 5, 2018 at 9:56 AM, Frédéric Nass 
<mailto:frederic.n...@univ-lorraine.fr>> wrote:


Hi,

In addition, starting with Luminous 12.2.1 (RHCS 3), splitting ops
should be loggued with default setting of debug level messages:
https://github.com/ceph/ceph/blob/v12.2.1/src/os/filestore/HashIndex.cc#L320

<https://github.com/ceph/ceph/blob/v12.2.1/src/os/filestore/HashIndex.cc#L320>
There's also a RFE for merging to be loggued as well as splitting:
https://bugzilla.redhat.com/show_bug.cgi?id=1523532
<https://bugzilla.redhat.com/show_bug.cgi?id=1523532>

Regards,

Frédéric.


Le 02/02/2018 à 17:00, Frédéric Nass a écrit :


Hi,

Split and merge operations happen during writes only, splitting
on file creation and merging on file deletion.

As you don't see any blocked requests during reads I would guess
your issue happens during splitting. Now that you increased
filestore_merge_threshold and filestore_split_multiple, you
shouldn't expect any splitting operations to happen any soon, nor
any merging operations, unless your workload consists of writing
a huge number of files and removing them.

You should check how many files are in each lower directories of
pool 20's PGs. This would help to confirm that the blocked
requests come with the splitting.

We now use the below script (on one of the OSD nodes) to get an
average value of the number of f

Re: [ceph-users] restrict user access to certain rbd image

2018-02-02 Thread Frédéric Nass

Hi,

We use this on our side:

$ rbd create rbd-image --size 1048576 --pool rbd --image-feature layering
$ rbd create rbd-other-image --size 1048576 --pool rbd --image-feature 
layering


$ rbd info rbd/rbd-image
rbd image 'rbd-image':
    size 1024 GB in 262144 objects
    order 22 (4096 kB objects)
    block_name_prefix: rbd_data.2b36cf238e1f29
    format: 2
    features: layering
    flags:

$ ceph auth get-or-create client.rbd.image mon 'allow r' osd 'allow rwx 
pool rbd object_prefix rbd_data.2b36cf238e1f29; allow rwx pool rbd 
object_prefix rbd_header.2b36cf238e1f29; allow rx pool rbd object_prefix 
rbd_id.rbd-image' -o /etc/ceph/ceph.client.rbd.image.keyring


$ rbd -p rbd --keyring=/etc/ceph/ceph.client.rbd.image.keyring 
--id=rbd.image info rbd-image

rbd image 'rbd-image':
    size 1024 GB in 262144 objects
    order 22 (4096 kB objects)
    block_name_prefix: rbd_data.2b36cf238e1f29
    format: 2
    features: layering
    flags:

$ rbd -p rbd --keyring=/etc/ceph/ceph.client.rbd.image.keyring 
--id=rbd.image info rbd-other-image

rbd: error opening image rbd-other-image: (1) Operation not permitted
2018-02-02 17:19:13.758624 7f38d76fd700 -1 librbd::image::OpenRequest: 
failed to stat v2 image header: (1) Operation not permitted
2018-02-02 17:19:13.758724 7f38d6efc700 -1 librbd::ImageState: 
0x55ac0ea6b7f0 failed to open image: (1) Operation not permitted


$ rbd --keyring=/etc/ceph/ceph.client.rbd.image.keyring --id=rbd.image 
-p rbd ls

rbd: list: (1) Operation not permitted

Regards,

Frédéric.

Le 02/02/2018 à 17:05, Gregory Farnum a écrit :
I don't think it's well-integrated with the tooling, but check out the 
cephx docs for the "prefix" level of access. It lets you grant access 
only to objects whose name matches a prefix, which for rbd would be 
the rbd volume ID (or name? Something easy to identify).

-Greg

On Fri, Feb 2, 2018 at 7:42 AM > wrote:


Hello!

I wonder if it's possible in ceph Luminous to manage user access
to rbd images on per image (but not
the whole rbd pool) basis?
I need to provide rbd images for my users but would like to
disable their ability to list all images
in a pool as well as to somehow access/use ones if a ceph admin
didn't authorize that.
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-ISCSI

2017-10-17 Thread Frédéric Nass
Hi folks, 

For those who missed it, the fun was here :-) : 
https://youtu.be/IgpVOOVNJc0?t=3715 

Frederic. 

- Le 11 Oct 17, à 17:05, Jake Young  a écrit : 

> On Wed, Oct 11, 2017 at 8:57 AM Jason Dillaman < [ mailto:jdill...@redhat.com 
> |
> jdill...@redhat.com ] > wrote:

>> On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López < [
>> mailto:jorp...@unizar.es | jorp...@unizar.es ] > wrote:

>>> As far as I am able to understand there are 2 ways of setting iscsi for ceph

>>> 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...

>> The target_core_rbd approach is only utilized by SUSE (and its derivatives 
>> like
>> PetaSAN) as far as I know. This was the initial approach for Red Hat-derived
>> kernels as well until the upstream kernel maintainers indicated that they
>> really do not want a specialized target backend for just krbd. The next 
>> attempt
>> was to re-use the existing target_core_iblock to interface with krbd via the
>> kernel's block layer, but that hit similar upstream walls trying to get 
>> support
>> for SCSI command passthrough to the block layer.

>>> 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)

>> The TCMU approach is what upstream and Red Hat-derived kernels will support
>> going forward.
>> The lrbd project was developed by SUSE to assist with configuring a cluster 
>> of
>> iSCSI gateways via the cli. The ceph-iscsi-config + ceph-iscsi-cli projects 
>> are
>> similar in goal but take a slightly different approach. ceph-iscsi-config
>> provides a set of common Python libraries that can be re-used by 
>> ceph-iscsi-cli
>> and ceph-ansible for deploying and configuring the gateway. The 
>> ceph-iscsi-cli
>> project provides the gwcli tool which acts as a cluster-aware replacement for
>> targetcli.

>>> I don't know which one is better, I am seeing that oficial support is 
>>> pointing
>>> to tcmu but i havent done any testbench.

>> We (upstream Ceph) provide documentation for the TCMU approach because that 
>> is
>> what is available against generic upstream kernels (starting with 4.14 when
>> it's out). Since it uses librbd (which still needs to undergo some 
>> performance
>> improvements) instead of krbd, we know that librbd 4k IO performance is 
>> slower
>> compared to krbd, but 64k and 128k IO performance is comparable. However, I
>> think most iSCSI tuning guides would already tell you to use larger block 
>> sizes
>> (i.e. 64K NTFS blocks or 32K-128K ESX blocks).

>>> Does anyone tried both? Do they give the same output? Are both able to 
>>> manage
>>> multiple iscsi targets mapped to a single rbd disk?

>> Assuming you mean multiple portals mapped to the same RBD disk, the answer is
>> yes, both approaches should support ALUA. The ceph-iscsi-config tooling will
>> only configure Active/Passive because we believe there are certain edge
>> conditions that could result in data corruption if configured for 
>> Active/Active
>> ALUA.

>> The TCMU approach also does not currently support SCSI persistent reservation
>> groups (needed for Windows clustering) because that support isn't available 
>> in
>> the upstream kernel. The SUSE kernel has an approach that utilizes two
>> round-trips to the OSDs for each IO to simulate PGR support. Earlier this
>> summer I believe SUSE started to look into how to get generic PGR support
>> merged into the upstream kernel using corosync/dlm to synchronize the states
>> between multiple nodes in the target. I am not sure of the current state of
>> that work, but it would benefit all LIO targets when complete.

>>> I will try to make my own testing but if anyone has tried in advance it 
>>> would be
>>> really helpful.

>>> Jorge Pinilla López
>>> [ mailto:jorp...@unizar.es | jorp...@unizar.es ]

>>> [
>>> 
>>> https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient
>>> ]   Libre de virus. [
>>> 
>>> https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient
>>> | www.avast.com ] [
>>> 
>>> https://mail.univ-lorraine.fr/#m_7291678653307726003_m_7112777861777147567_m_2432837294105570265_m_4580024349895004366_m_-4947191068488210222_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2
>>> |   ]

>>> ___
>>> ceph-users mailing list
>>> [ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ]
>>> [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]

>> --
>> Jason
>> ___
>> ceph-users mailing list
>> [ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ]
>> [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]

> Thanks Jason!

> You should cut and paste that answer into a blog post on [ http://ceph.com/ |
> ceph.com ] . It is a great summary of where things stand

Re: [ceph-users] inconsistent pg on erasure coded pool

2017-10-05 Thread Frédéric Nass

Hi Kenneth,

You should check for drive or XFS related errors in /var/log/message 
files on all nodes. We've had a similar issue in the past with a bad 
block on a hard drive.

We've had to :

1. Stop the OSD associated to the drive that had a bad block, flush its 
journal (ceph-osd -i $osd --flush-journal) and umount the filesystem,

2. Clear the bad blocks in the RAID/PERC Controller,
3. xfs_repair the partition, and partprobe the drive to start the OSD again,
4. ceph pg repair 

Regards,

Frédéric.

Le 04/10/2017 à 14:02, Kenneth Waegeman a écrit :

Hi,

We have some inconsistency / scrub error on a Erasure coded pool, that 
I can't seem to solve.


[root@osd008 ~]# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 5.144 is active+clean+inconsistent, acting 
[81,119,148,115,142,100,25,63,48,11,43]

1 scrub errors

In the log files, it seems there is 1 missing shard:

/var/log/ceph/ceph-osd.81.log.2.gz:2017-10-02 23:49:11.940624 
7f0a9d7e2700 -1 log_channel(cluster) log [ERR] : 5.144s0 shard 63(7) 
missing 5:2297a2e1:::10014e2d8d5.:head
/var/log/ceph/ceph-osd.81.log.2.gz:2017-10-03 00:48:06.681941 
7f0a9d7e2700 -1 log_channel(cluster) log [ERR] : 5.144s0 deep-scrub 1 
missing, 0 inconsistent objects
/var/log/ceph/ceph-osd.81.log.2.gz:2017-10-03 00:48:06.681947 
7f0a9d7e2700 -1 log_channel(cluster) log [ERR] : 5.144 deep-scrub 1 
errors


I tried running ceph pg repair on the pg, but nothing changed. I also 
tried starting a new deep-scrub on the  osd 81 (ceph osd deep-scrub 
81) but I don't see any deep-scrub starting at the osd.


How can we solve this ?

Thank you!


Kenneth

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer to Jewel upgrade questions

2017-05-16 Thread Frédéric Nass
- Le 16 Mai 17, à 20:43, Shain Miley  a écrit : 

> Hello,

> I am going to be upgrading our production Ceph cluster from
> Hammer/Ubuntu 14.04 to Jewel/Ubuntu 16.04 and I wanted to ask a question
> and sanity check my upgrade plan.

> Here are the steps I am planning to take during the upgrade:

Hi Shain, 

0) upgrade operating system packages first and reboot on new kernel if needed. 

> 1)Upgrade to latest hammer on current cluster
> 2)Remove or rename the existing ‘ceph’ user and ‘ceph’ group on each node
> 3)Upgrade the ceph packages to latest Jewel (mon, then osd, then rbd
> clients)
You might want to upgrade the RBD clients first. This may not be a mandatory 
step but a careful one. 

> 4)stop ceph daemons
> 5)change permissions on ceph directories and osd journals:

> find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -type d|parallel chown -R
> 64045:64045
> chown 64045:64045 /var/lib/ceph
> chown 64045:64045 /var/lib/ceph/*
> chown 64045:64045 /var/lib/ceph/bootstrap-*/*

> for ID in $(ls /var/lib/ceph/osd/|cut -d '-' -f 2); do
> JOURNAL=$(readlink -f /var/lib/ceph/osd/ceph-${ID}/journal)
> chown ceph ${JOURNAL}

You can avoid this step by adding setuser_match_path = 
/var/lib/ceph/$type/$cluster-$id to the [osd] section. This will make the Ceph 
daemons run as root if the daemon’s data directory is still owned by root. 
Newly deployed daemons will be created with data owned by user ceph and will 
run with reduced privileges, but upgraded daemons will continue to run as root. 

Or you can still change the property of the files to ceph but it might be long 
depending on the number of objects and PGs you have in your cluster, for an 
average zero benefit, especially since when bluestore comes out, you'll 
recreate all these datas. 

> 6)restart ceph daemons

> The two questions I have are:
> 1)Am I missing anything from the steps above...based on prior
> experiences performing upgrades of this kind?

> 2)Should I upgrade to Ubuntu 16.04 first and then upgrade Ceph...or vice
> versa?
This documentation (http://docs.ceph.com/docs/master/start/os-recommendations/) 
suggest to stick with Ubuntu 14.04 but RHCS KB show that RHCS 2.x (Jewel 
10.2.x) is only supported on Ubuntu 16.04. 
When upgrading from Hammer to Jewel, we upgraded OS first from RHEL 7 to 7.1 
then RHCS. I'm not sure whether you should temporarily run Hammer on Ubuntu 
16.04 or Jewel on Ubuntu 14.04. 
I would upgrade the lowest layer first (OS) of a single OSD node and see how it 
goes. 

Regards, 

Frederic. 

> Thanks in advance,
> Shain

> --
> NPR | Shain Miley | Manager of Infrastructure, Digital Media | smi...@npr.org 
> |
> 202.513.3649

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Maintaining write performance under a steady intake of small objects

2017-05-01 Thread Frédéric Nass



Le 28/04/2017 à 17:03, Mark Nelson a écrit :

On 04/28/2017 08:23 AM, Frédéric Nass wrote:


Le 28/04/2017 à 15:19, Frédéric Nass a écrit :


Hi Florian, Wido,

That's interesting. I ran some bluestore benchmarks a few weeks ago on
Luminous dev (1st release) and came to the same (early) conclusion
regarding the performance drop with many small objects on bluestore,
whatever the number of PGs is on a pool. Here is the graph I generated
from the results:



The test was run on a 36 OSDs cluster (3x R730xd with 12x 4TB SAS
drives) with rocksdb and WAL on same SAS drives.
Test consisted of multiple runs of the following command on a size 1
pool : rados bench -p pool-test-mom02h06-2 120 write -b 4K -t 128
--no-cleanup


Correction: test was made on a size 1 pool hosted on a single 12x OSDs
node. The rados bench was run from this single host (to this single 
host).


Frédéric.


If you happen to have time, I would be very interested to see what the 
compaction statistics look like in rocksdb (available via the osd 
logs).  We actually wrote a tool that's in the cbt tools directory 
that can parse the data and look at what rocksdb is doing.  Here's 
some of the data we collected last fall:


https://drive.google.com/open?id=0B2gTBZrkrnpZRFdiYjFRNmxLblU

The idea there was to try to determine how WAL buffer size / count and 
min_alloc size affected the amount of compaction work that rocksdb was 
doing.  There are also some more general compaction statistics that 
are more human readable in the logs that are worth looking at (ie 
things like write amp and such).


The gist of it is that as you do lots of small writes the amount of 
metadata that has to be kept track of in rocksdb increases, and 
rocksdb ends up doing a *lot* of compaction work, with the associated 
read and write amplification.  The only ways to really deal with this 
are to either reduce the amount of metadata (onodes, extents, etc) or 
see if we can find any ways to reduce the amount of work rocksdb has 
to do.


On the first point, increasing the min_alloc size in bluestore tends 
to help, but with tradeoffs.  Any io smaller than the min_alloc size 
will be doubly-written like with filestore journals, so you trade 
reducing metadata for an extra WAL write. We did a bunch of testing 
last fall and at least on NVMe it was better to use a 16k min_alloc 
size and eat the WAL write than use a 4K min_alloc size, skip the WAL 
write, but shove more metadata at rocksdb.  For HDDs, I wouldn't 
expect too bad of behavior with the default 64k min alloc size, but it 
sounds like it could be a problem based on your results.  That's why 
it would be interesting to see if that's what's happening during your 
tests.


Another issue is that short lived WAL writes potentially can leak into 
level0 and cause additional compaction work.  Sage has a pretty clever 
idea to fix this but we need someone knowledgeable about rocksdb to go 
in and try to implement it (or something like it).


Anyway, we still see a significant amount of work being done by 
rocksdb due to compaction, most of it being random reads.  We actually 
spoke about this quite a bit yesterday at the performance meeting.  If 
you look at a wallclock profile of 4K random writes, you'll see a ton 
of work being doing on compact (about 70% in total of thread 2):


https://paste.fedoraproject.org/paste/uS3LHRHw2Yma0iUYSkgKOl5M1UNdIGYhyRLivL9gydE= 



One thing we are still confused about is why rocksdb is doing 
random_reads for compaction rather than sequential reads.  It would be 
really great if someone that knows rocksdb well could help us 
understand why it's doing this.


Ultimately for something like RBD I suspect the performance will stop 
dropping once you've completely filled the disk with 4k random writes. 
For RGW type work, the more tiny objects you add the more data rocksdb 
has to keep track of and the more rocksdb is going to slow down.  It's 
not the same problem filestore suffers from, but it's similar in that 
the more keys/bytes/levels rocksdb has to deal with, the more data 
gets moved around between levels, the more background work that 
happens, the more likely we are waiting on rocksdb before we can write 
more data.


Mark



Hi Mark,

This is very interesting. I actually did use "bluefs buffered io = true" 
and "bluestore compression mode = aggressive" during the tests as I saw 
these 2 options were improving write performances (x4) but didn't look 
at the logs for compaction statistics. These nodes I used for the tests 
made it to production so I won't be able to reproduce the test any soon, 
but I will when we get new hardware.


Frederic.





I hope this will improve as the performance drop seems more related to
how many objects are in the pool (> 40M) rather than how many objects
are written each second.
Like Wido, I was thinking that we may have to increase the number of
dis

Re: [ceph-users] Maintaining write performance under a steady intake of small objects

2017-04-28 Thread Frédéric Nass


Le 28/04/2017 à 15:19, Frédéric Nass a écrit :


Hi Florian, Wido,

That's interesting. I ran some bluestore benchmarks a few weeks ago on 
Luminous dev (1st release) and came to the same (early) conclusion 
regarding the performance drop with many small objects on bluestore, 
whatever the number of PGs is on a pool. Here is the graph I generated 
from the results:




The test was run on a 36 OSDs cluster (3x R730xd with 12x 4TB SAS 
drives) with rocksdb and WAL on same SAS drives.
Test consisted of multiple runs of the following command on a size 1 
pool : rados bench -p pool-test-mom02h06-2 120 write -b 4K -t 128 
--no-cleanup


Correction: test was made on a size 1 pool hosted on a single 12x OSDs 
node. The rados bench was run from this single host (to this single host).


Frédéric.



I hope this will improve as the performance drop seems more related to 
how many objects are in the pool (> 40M) rather than how many objects 
are written each second.
Like Wido, I was thinking that we may have to increase the number of 
disks in the future to keep up with the needed performance for our 
Zimbra messaging use case.
Or move datas from current EC pool to a replicated pool, as erasure 
coding doesn't help either for this type of use cases.


Regards,

Frédéric.

Le 26/04/2017 à 22:25, Wido den Hollander a écrit :

Op 24 april 2017 om 19:52 schreef Florian Haas:


Hi everyone,

so this will be a long email — it's a summary of several off-list
conversations I've had over the last couple of weeks, but the TL;DR
version is this question:

How can a Ceph cluster maintain near-constant performance
characteristics while supporting a steady intake of a large number of
small objects?

This is probably a very common problem, but we have a bit of a dearth of
truly adequate best practices for it. To clarify, what I'm talking about
is an intake on the order of millions per hour. That might sound like a
lot, but if you consider an intake of 700 objects/s at 20 KiB/object,
that's just 14 MB/s. That's not exactly hammering your cluster — but it
amounts to 2.5 million objects created per hour.


I have seen that the amount of objects at some point becomes a problem.

Eventually you will have scrubs running and especially a deep-scrub will cause 
issues.

I have never had the use-case to have a sustained intake of so many 
objects/hour, but it is interesting though.


Under those circumstances, two things tend to happen:

(1) There's a predictable decline in insert bandwidth. In other words, a
cluster that may allow inserts at a rate of 2.5M/hr rapidly goes down to
1.8M/hr and then 1.7M/hr ... and by "rapidly" I mean hours, not days. As
I understand it, this is mainly due to the FileStore's propensity to
index whole directories with a readdir() call which is an linear-time
operation.

(2) FileStore's mitigation strategy for this is to proactively split
directories so they never get so large as for readdir() to become a
significant bottleneck. That's fine, but in a cluster with a steadily
growing number of objects, that tends to lead to lots and lots of
directory splits happening simultanously — causing inserts to slow to a
crawl.

For (2) there is a workaround: we can initialize a pool with an expected
number of objects, set a pool max_objects quota, and disable on-demand
splitting altogether by setting a negative filestore merge threshold.
That way, all splitting occurs at pool creation time, and before another
split were to happen, you hit the pool quota. So you never hit that
brick wall causes by the thundering herd of directory splits. Of course,
it also means that when you want to insert yet more objects, you need
another pool — but you can handle that at the application level.

It's actually a bit of a dilemma: we want directory splits to happen
proactively, so that readdir() doesn't slow things down, but then we
also *don't* want them to happen, because while they do, inserts flatline.

(2) will likely be killed off completely by BlueStore, because there are
no more directories, hence nothing to split.

For (1) there really isn't a workaround that I'm aware of for FileStore.
And at least preliminary testing shows that BlueStore clusters suffer
from similar, if not the same, performance degradation (although, to be
fair, I haven't yet seen tests under the above parameters with rocksdb
and WAL on NVMe hardware).


Can you point me to this testing of BlueStore?


For (1) however I understand that there would be a potential solution in
FileStore itself, by throwing away Ceph's own directory indexing and
just rely on flat directory lookups — which should be logarithmic-time
operations in both btrfs and XFS, as both use B-trees for directory
indexing. But I understand that that would be a fairly massive operation
that looks even less attractive to undertake with BlueStore around the
corner.

One suggestion that has 

Re: [ceph-users] Maintaining write performance under a steady intake of small objects

2017-04-28 Thread Frédéric Nass

Hi Florian, Wido,

That's interesting. I ran some bluestore benchmarks a few weeks ago on 
Luminous dev (1st release) and came to the same (early) conclusion 
regarding the performance drop with many small objects on bluestore, 
whatever the number of PGs is on a pool. Here is the graph I generated 
from the results:




The test was run on a 36 OSDs cluster (3x R730xd with 12x 4TB SAS 
drives) with rocksdb and WAL on same SAS drives.
Test consisted of multiple runs of the following command on a size 1 
pool : rados bench -p pool-test-mom02h06-2 120 write -b 4K -t 128 
--no-cleanup


I hope this will improve as the performance drop seems more related to 
how many objects are in the pool (> 40M) rather than how many objects 
are written each second.
Like Wido, I was thinking that we may have to increase the number of 
disks in the future to keep up with the needed performance for our 
Zimbra messaging use case.
Or move datas from current EC pool to a replicated pool, as erasure 
coding doesn't help either for this type of use cases.


Regards,

Frédéric.

Le 26/04/2017 à 22:25, Wido den Hollander a écrit :

Op 24 april 2017 om 19:52 schreef Florian Haas :


Hi everyone,

so this will be a long email — it's a summary of several off-list
conversations I've had over the last couple of weeks, but the TL;DR
version is this question:

How can a Ceph cluster maintain near-constant performance
characteristics while supporting a steady intake of a large number of
small objects?

This is probably a very common problem, but we have a bit of a dearth of
truly adequate best practices for it. To clarify, what I'm talking about
is an intake on the order of millions per hour. That might sound like a
lot, but if you consider an intake of 700 objects/s at 20 KiB/object,
that's just 14 MB/s. That's not exactly hammering your cluster — but it
amounts to 2.5 million objects created per hour.


I have seen that the amount of objects at some point becomes a problem.

Eventually you will have scrubs running and especially a deep-scrub will cause 
issues.

I have never had the use-case to have a sustained intake of so many 
objects/hour, but it is interesting though.


Under those circumstances, two things tend to happen:

(1) There's a predictable decline in insert bandwidth. In other words, a
cluster that may allow inserts at a rate of 2.5M/hr rapidly goes down to
1.8M/hr and then 1.7M/hr ... and by "rapidly" I mean hours, not days. As
I understand it, this is mainly due to the FileStore's propensity to
index whole directories with a readdir() call which is an linear-time
operation.

(2) FileStore's mitigation strategy for this is to proactively split
directories so they never get so large as for readdir() to become a
significant bottleneck. That's fine, but in a cluster with a steadily
growing number of objects, that tends to lead to lots and lots of
directory splits happening simultanously — causing inserts to slow to a
crawl.

For (2) there is a workaround: we can initialize a pool with an expected
number of objects, set a pool max_objects quota, and disable on-demand
splitting altogether by setting a negative filestore merge threshold.
That way, all splitting occurs at pool creation time, and before another
split were to happen, you hit the pool quota. So you never hit that
brick wall causes by the thundering herd of directory splits. Of course,
it also means that when you want to insert yet more objects, you need
another pool — but you can handle that at the application level.

It's actually a bit of a dilemma: we want directory splits to happen
proactively, so that readdir() doesn't slow things down, but then we
also *don't* want them to happen, because while they do, inserts flatline.

(2) will likely be killed off completely by BlueStore, because there are
no more directories, hence nothing to split.

For (1) there really isn't a workaround that I'm aware of for FileStore.
And at least preliminary testing shows that BlueStore clusters suffer
from similar, if not the same, performance degradation (although, to be
fair, I haven't yet seen tests under the above parameters with rocksdb
and WAL on NVMe hardware).


Can you point me to this testing of BlueStore?


For (1) however I understand that there would be a potential solution in
FileStore itself, by throwing away Ceph's own directory indexing and
just rely on flat directory lookups — which should be logarithmic-time
operations in both btrfs and XFS, as both use B-trees for directory
indexing. But I understand that that would be a fairly massive operation
that looks even less attractive to undertake with BlueStore around the
corner.

One suggestion that has been made (credit to Greg) was to do object
packing, i.e. bunch up a lot of discrete data chunks into a single RADOS
object. But in terms of distribution and lookup logic that would have to
be built on top, that seems weird to me (CRUSH on top of CRUSH to find
out which RADOS object a chunk belongs to, or some such?)

Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-04-27 Thread Frédéric Nass

Hi Adam,

What Greg and Chris are referring to is the SSD write cliff aka write 
amplification:


- https://flashstorageguy.wordpress.com/tag/write-cliff/
- https://en.wikipedia.org/wiki/Write_amplification

This and the MTBF are the main reasons to choose enterprise grade SSDs 
over consumer grade SSDs.
Especially if your SSDs are used by many (4+) spinning disks, as the 
write cliff phenomenon will occur earlier.


Regards,

Frédéric.

Le 26/04/2017 à 16:53, Adam Carheden a écrit :

What I'm trying to get from the list is /why/ the "enterprise" drives
are important. Performance? Reliability? Something else?

The Intel was the only one I was seriously considering. The others were
just ones I had for other purposes, so I thought I'd see how they fared
in benchmarks.

The Intel was the clear winner, but my tests did show that throughput
tanked with more threads. Hypothetically, if I was throwing 16 OSDs at
it, all with osd op threads = 2, do the benchmarks below not show that
the Hynix would be a better choice (at least for performance)?

Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. Obviously
the single drive leaves more bays free for OSD disks, but is there any
other reason a single S3610 is preferable to 4 S3520s? Wouldn't 4xS3520s
mean:

a) fewer OSDs go down if the SSD fails

b) better throughput (I'm speculating that the S3610 isn't 4 times
faster than the S3520)

c) load spread across 4 SATA channels (I suppose this doesn't really
matter since the drives can't throttle the SATA bus).




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-04-26 Thread Frédéric Nass
Hi Greg,

Thanks a lot for your work on this one. It really helps us right now.

Would it be easy to add the snaptrim speed on a ceph -s, like "snaptrim io 144 
MB/s, 721 objects/s" (or just objects/s if sizes are unknown) ?
It would help to see how the snaptrim speed changes along with snap trimming 
options.

When a snapshot is removed, all primary OSDs seem to start trimming at the same 
time. Can we avoid this or limit their number ?

Best regards,

Frédéric Nass.

- Le 26 Avr 17, à 20:24, Gregory Farnum gfar...@redhat.com a écrit :

> Hey all,

> Resurrecting this thread because I just wanted to let you know that
> Sam's initial work in master has been backported to Jewel and will be
> in the next (10.2.8, I think?) release:
> https://github.com/ceph/ceph/pull/14492/

> Once upgraded, it will be safe to use the "osd snap trim sleep" option
> again. It also adds a new "osd max trimming pgs" (default 2) that
> limits the number of PGs each primary will simultaneously trim on, and
> adds "snaptrim" and "snaptrim_wait" to the list of reported PG states.
> :)

> (For those of you running Kraken, its backport hasn't merged yet but
> is at https://github.com/ceph/ceph/pull/14597)
> -Greg

> On Tue, Feb 21, 2017 at 3:32 PM, Nick Fisk  wrote:
> > Yep sure, will try and present some figures at tomorrow’s meeting again.



> > From: Samuel Just [mailto:sj...@redhat.com]
> > Sent: 21 February 2017 18:14


> > To: Nick Fisk 
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?



> > Ok, I've added explicit support for osd_snap_trim_sleep (same param, new
> > non-blocking implementation) to that branch. Care to take it for a whirl?

> > -Sam



> > On Thu, Feb 9, 2017 at 11:36 AM, Nick Fisk  wrote:

> > Building now



> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Samuel Just
> > Sent: 09 February 2017 19:22
> > To: Nick Fisk 
> > Cc: ceph-users@lists.ceph.com


> > Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?



> > Ok, https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep (based on
> > master) passed a rados suite. It adds a configurable limit to the number of
> > pgs which can be trimming on any OSD (default: 2). PGs trimming will be in
> > snaptrim state, PGs waiting to trim will be in snaptrim_wait state. I
> > suspect this'll be adequate to throttle the amount of trimming. If not, I
> > can try to add an explicit limit to the rate at which the work items trickle
> > into the queue. Can someone test this branch? Tester beware: this has not
> > merged into master yet and should only be run on a disposable cluster.

> > -Sam



> > On Tue, Feb 7, 2017 at 1:16 PM, Nick Fisk  wrote:

> > Yeah it’s probably just the fact that they have more PG’s so they will hold
> > more data and thus serve more IO. As they have a fixed IO limit, they will
> > always hit the limit first and become the bottleneck.



> > The main problem with reducing the filestore queue is that I believe you
> > will start to lose the benefit of having IO’s queued up on the disk, so that
> > the scheduler can re-arrange them to action them in the most efficient manor
> > as the disk head moves across the platters. You might possibly see up to a
> > 20% hit on performance, in exchange for more consistent client latency.



> > From: Steve Taylor [mailto:steve.tay...@storagecraft.com]
> > Sent: 07 February 2017 20:35
> > To: n...@fisk.me.uk; ceph-users@lists.ceph.com


> > Subject: RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> > sleep?



> > Thanks, Nick.



> > One other data point that has come up is that nearly all of the blocked
> > requests that are waiting on subops are waiting for OSDs with more PGs than
> > the others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB
> > OSDs. The cluster is well balanced based on OSD capacity, so those 7 OSDs
> > individually have 33% more PGs than the others and are causing almost all of
> > the blocked requests. It appears that maps updates are generally not
> > blocking long enough to show up as blocked requests.



> > I set the reweight on those 7 OSDs to 0.75 and things are backfilling now.
> > I’ll test some more when the PG counts per OSD are more balanced and see
> > what I get. I’ll also play with the filestore queue. I was telling some of
> > my colleagues yesterday that this looked likely to be related to buffer
> > bloat somewhere. I appreciate the suggestion.



> > 

Re: [ceph-users] PG calculator improvement

2017-04-14 Thread Frédéric Nass
  I know it seems a bit self-serving
to make this suggestionas I work at Red Hat, but there is a lot on
the line when any establishment is storing potentially business
critical data.

I suspect the answer lies in a combination of the above or in
something I've not thought of.Please do weigh in as any and all
suggestions are more than welcome.

Thanks,
    Michael J. Kidd
Principal Software Maintenance Engineer
Red Hat Ceph Storage
+1 919-442-8878 


On Wed, Apr 12, 2017 at 6:35 AM, Frédéric Nass
mailto:frederic.n...@univ-lorraine.fr>> wrote:


Hi,

I wanted to share a bad experience we had due to how the PG
calculator works.

When we set our production cluster months ago, we had to
decide on the number of PGs to give to each pool in the cluster.
As you know, the PG calc would recommended to give a lot of
PGs to heavy pools in size, regardless the number of objects
in the pools. How bad...

We essentially had 3 pools to set on 144 OSDs :

1. a EC5+4 pool for the radosGW (.rgw.buckets) that would hold
80% of all datas in the cluster. PG calc recommended 2048 PGs.
2. a EC5+4 pool for zimbra's data (emails) that would hold 20%
of all datas. PG calc recommended 512 PGs.
3. a replicated pool for zimbra's metadata (null size objects
holding xattrs - used for deduplication) that would hold 0% of
all datas. PG calc recommended 128 PGs, but we decided on 256.

With 120M of objects in pool #3, as soon as we upgraded to
Jewel, we hit the Jewel scrubbing bug (OSDs flapping).
Before we could upgrade to patched Jewel, scrub all the
cluster again prior to increasing the number of PGs on this
pool, we had to take more than a hundred of snapshots (for
backup/restoration purposes), with the number of objects still
increasing in the pool. Then when a snapshot was removed, we
hit the current Jewel snap trimming bug affecting pools with
too many objects for the number of PGs. The only way we could
stop the trimming was to stop OSDs resulting in PGs being
degraded and not trimming anymore (snap trimming only happens
on active+clean PGs).

We're now just getting out of this hole, thanks to Nick's post
regarding osd_snap_trim_sleep and RHCS support expertise.

If the PG calc had considered not only the pools weight but
also the number of expected objects in the pool (which we knew
by that time), we wouldn't have it these 2 bugs.
We hope this will help improving the ceph.com
<http://ceph.com> and RHCS PG calculators.

Regards,

Frédéric.

-- 


Frédéric Nass

Sous-direction Infrastructures
Direction du Numérique
Université de Lorraine

Tél : +33 3 72 74 11 35 

___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG calculator improvement

2017-04-12 Thread Frédéric Nass


Hi,

I wanted to share a bad experience we had due to how the PG calculator 
works.


When we set our production cluster months ago, we had to decide on the 
number of PGs to give to each pool in the cluster.
As you know, the PG calc would recommended to give a lot of PGs to heavy 
pools in size, regardless the number of objects in the pools. How bad...


We essentially had 3 pools to set on 144 OSDs :

1. a EC5+4 pool for the radosGW (.rgw.buckets) that would hold 80% of 
all datas in the cluster. PG calc recommended 2048 PGs.
2. a EC5+4 pool for zimbra's data (emails) that would hold 20% of all 
datas. PG calc recommended 512 PGs.
3. a replicated pool for zimbra's metadata (null size objects holding 
xattrs - used for deduplication) that would hold 0% of all datas. PG 
calc recommended 128 PGs, but we decided on 256.


With 120M of objects in pool #3, as soon as we upgraded to Jewel, we hit 
the Jewel scrubbing bug (OSDs flapping).
Before we could upgrade to patched Jewel, scrub all the cluster again 
prior to increasing the number of PGs on this pool, we had to take more 
than a hundred of snapshots (for backup/restoration purposes), with the 
number of objects still increasing in the pool. Then when a snapshot was 
removed, we hit the current Jewel snap trimming bug affecting pools with 
too many objects for the number of PGs. The only way we could stop the 
trimming was to stop OSDs resulting in PGs being degraded and not 
trimming anymore (snap trimming only happens on active+clean PGs).


We're now just getting out of this hole, thanks to Nick's post regarding 
osd_snap_trim_sleep and RHCS support expertise.


If the PG calc had considered not only the pools weight but also the 
number of expected objects in the pool (which we knew by that time), we 
wouldn't have it these 2 bugs.

We hope this will help improving the ceph.com and RHCS PG calculators.

Regards,

Frédéric.

--

Frédéric Nass

Sous-direction Infrastructures
Direction du Numérique
Université de Lorraine

Tél : +33 3 72 74 11 35

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Number of objects 'in' a snapshot ?

2017-03-31 Thread Frédéric Nass
I just realized that Nick's post is only a few days old. Found the 
tracker : http://tracker.ceph.com/issues/19241


Frederic.

Le 31/03/2017 à 10:12, Frédéric Nass a écrit :


Hi,

Can we get the number of objects in a pool snapshot ? (That is how 
much will be removed on snapshot removal)


We're facing terrible performance issues due to snapshot removal in 
Jewel. Nick warns about using "osd_snap_trim_sleep" in Jewel 
(https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36429.html)

How bad is it exactly ? Is there a tracker somewhere about this ?

Regards,



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Number of objects 'in' a snapshot ?

2017-03-31 Thread Frédéric Nass


Hi,

Can we get the number of objects in a pool snapshot ? (That is how much 
will be removed on snapshot removal)


We're facing terrible performance issues due to snapshot removal in 
Jewel. Nick warns about using "osd_snap_trim_sleep" in Jewel 
(https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36429.html)

How bad is it exactly ? Is there a tracker somewhere about this ?

Regards,

--

Frédéric Nass

Sous-direction Infrastructures
Direction du Numérique
Université de Lorraine

Tél : +33 3 72 74 11 35

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues on Jewel 10.2.2

2016-12-16 Thread Frédéric Nass

Hi,

1 - rados or rbd bug ? We're using rados bench.

2 - This is not bandwith related. If it was, it should happen almost 
instantly and not 15 minutes after I start to write to the pool.
Once it has happened on the pool, I can then reproduce with a fewer 
--concurrent-ios, like 12 or even 1.


This happens with :
OSDs journals on SSDs with the SAS drives in Raid0 writeback with XFS 
and split/merge threshold 10/2 (default)
OSDs journals on SSDs with the SAS drives in Raid0 writeback with XFS 
and split/merge threshold 40/8
OSDs journals on SSDs with the SAS drives in Raid0 writeback with btrfs 
and split/merge threshold 10/2 (default)
OSDs journals on SAS drives (not using the SSDs) in Raid0 writeback with 
XFS and split/merge threshold 10/2 (default)
OSDs journals on SAS drives (not using the SSDs) in Raid0 write-through 
with XFS and split/merge threshold 10/2 (default). PERC H730p mini is 
not the culprit apparently.


I tried with bluestore but OSDs wouldn't launch (even with an 
experimental ... = * set. I suppose its disabled within RHCS 2.0) so I 
couldn't tell if this is filestore related.


When the rados bench stops writing, we can see slow requests, and one or 
more SAS drives hitting 100% iostat usage, even with --concurrent-ios=1. 
With full debug on this particular OSD, we don't see any filestore 
operation anymore.

Just some recurring sched_scrub task and then some :

   -25> 2016-12-16 10:08:41.891756 7f8855051700  1 heartbeat_map 
is_healthy 'FileStore::op_tp thread 0x7f8865903700' had timed out after 60
   -24> 2016-12-16 10:08:41.891758 7f8855051700  1 heartbeat_map 
is_healthy 'FileStore::op_tp thread 0x7f8866104700' had timed out after 60
   -23> 2016-12-16 10:08:41.891759 7f8855051700  1 heartbeat_map 
is_healthy 'FileStore::op_tp thread 0x7f885f0f6700' had timed out after 60
   -22> 2016-12-16 10:08:41.891772 7f8856b57700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7f8842f10700' had timed out after 15
   -21> 2016-12-16 10:08:41.891775 7f8856b57700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7f884641b700' had timed out after 15
   -20> 2016-12-16 10:08:41.891777 7f8856b57700  1 heartbeat_map 
is_healthy 'FileStore::op_tp thread 0x7f885f8f7700' had timed out after 60
   -19> 2016-12-16 10:08:41.891779 7f8856b57700  1 heartbeat_map 
is_healthy 'FileStore::op_tp thread 0x7f88600f8700' had timed out after 60


then the OSD hit the suicide timeout :

 0> 2016-12-16 10:08:42.031740 7f8856b57700 -1 
common/HeartbeatMap.cc: In function 'bool 
ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, 
time_t)' thread 7f8856b57700 time 2016-12-16 10:08:42.029391

common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x85) [0x7f887873be25]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char 
const*, long)+0x2e1) [0x7f88786783a1]

 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f8878678bfe]
 4: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7f88780b206f]
 5: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7f88780b329b]
 6: (DispatchQueue::entry()+0x78a) [0x7f88787fcd0a]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f887871761d]
 8: (()+0x7dc5) [0x7f887666adc5]
 9: (clone()+0x6d) [0x7f8874cf673d]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
[...]
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-osd.16.log
--- end dump of recent events ---

and comes back to life on its own 2'42" later.

We use ceph version 10.2.2-41.el7cp 
(1ac1c364ca12fa985072174e75339bfb1f50e9ee) (RHCS 2.0).


We're hitting something here.

Regards,

Frederic.


Le 15/12/2016 à 21:04, Vincent Godin a écrit :

Hello,

I didn't look at your video but i already can tell you some tracks :

1 - there is a bug in 10.2.2 which make the client cache not working. 
The client cache works as it never recieved a flush so it will stay in 
writethrough mode. This bug is clear in 10.2.3


2 - 2 SSDs in JBOD and 12 x 4TB NL SAS in RAID0 are not very well 
optimized if your workload is based on write. You will perform in 
write at the max speed of your two SSD only. I don't know the real 
speed of your SSD nor your SAS disks but let's say:


your SSD can reach a 400 MB/s in write throughput
your SAS can reach a 130 MB/s in write throughput

i suppose that you use 1 SSD to host the journals of 6 SAS
Your max throughput in write will be 2 x 400 MB/s so 800 MB/s compare 
to the 12 x 130 MB/s = 1560 MB/s of your SAS


if you had 4 SSD for the journal, 1 SSD for 3 SAS
Your max throughput would be 4 x 400

[ceph-users] Performance issues on Jewel 10.2.2.

2016-12-14 Thread Frédéric Nass


Hi,

We're having performance issues on a Jewel 10.2.2 cluster. It started 
with IOs taking several seconds to be acknowledged so we did some 
benchmarks.


We could reproduce with a rados bench on new pool set on a single host 
(R730xd with 2 SSDs in JBOD and 12 4TB NL SAS in RAID0 writeback) with 
no replication (min_size 1, size_1).
We suspect this could be related to XFS filestore split operation or any 
other filestore operation.


Could someone have a look at this video : 
https://youtu.be/JQV3VfpAjbM?vq=hd1080


Video shows :

- admin node with commands and comments (top left)
- htop (middle left)
- rados bench (bottom left)
- iostat (top right)
- growing number of dirs in all PGs of that pool on osd.12 (/dev/sdd) 
and growing number of objects in the pool. (bottom right)


OSD debug log, perf report and osd params :

ceph-osd.12.log (http://u2l.fr/ceph-osd-12-log-tgz) with full debug log 
on from 12:00:26 to 12:00:36. On the video at 17'26" we can see that 
osd.12 (/dev/sdd) is 100% busy at 12:00:26.
test_perf_report.txt (http://u2l.fr/test-perf-report-txt) based on 
perf.data from 12:02:50 to 12:03:44.

mom02h06_osd.12_config_show.txt (http://u2l.fr/osd-12-config-show)
mom02h06_osd.12_config_diff.txt (http://u2l.fr/osd-12-config-diff)
ceph-conf-osd-params.txt (http://u2l.fr/ceph-conf-osd-params)

Regards,

--

Frédéric Nass

Sous-direction Infrastructures
Direction du Numérique
Université de Lorraine

Tél : +33 3 72 74 11 35

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] filestore_split_multiple hardcoded maximum?

2016-12-08 Thread Frédéric Nass
David, you might also be interested in the new Jewel 10.2.4 tool called 
'ceph-objectstore-tool' from Josh. 

It allows to split filestore directories offline 
(http://tracker.ceph.com/issues/17220). Unfortunatly not merge apprently. 

Regards, 

Frédéric. 

- Le 27 Sep 16, à 0:42, David Turner  a 
écrit : 

> We are running on Hammer 0.94.7 and have had very bad experiences with PG
> folders splitting a sub-directory further. OSDs being marked out, hundreds of
> blocked requests, etc. We have modified our settings and watched the behavior
> match the ceph documentation for splitting, but right now the subfolders are
> splitting outside of what the documentation says they should.

> filestore_split_multiple * abs(filestore_merge_threshold) * 16

> Our filestore_merge_threshold is set to 40. When we had our
> filestore_split_multiple set to 8, we were splitting subfolders when a
> subfolder had (8 * 40 * 16 = ) 5120 objects in the directory. In a different
> cluster we had to push that back again with elevated settings and the
> subfolders split when they had (16 * 40 * 16 = ) 10240 objects.

> We have another cluster that we're working with that is splitting at a value
> that seems to be a hardcoded maximum. The settings are (32 * 40 * 16 = ) 20480
> objects before it should split, but it seems to be splitting subfolders at
> 12800 objects.

> Normally I would expect this number to be a power of 2, but we recently found
> another hardcoded maximum of the object map only allowing RBD's with a maximum
> 256,000,000 objects in them. The 12800 matches that as being a base 2 followed
> by a set of zero's to be the hardcoded maximum.

> Has anyone else encountered what seems to be a hardcoded maximum here? Are we
> missing a setting elsewhere that is capping us, or diminishing our value? Much
> more to the point, though, is there any way to mitigate how painful it is to
> split subfolders in PGs? So far it seems like the only way we can do it is to
> push up the setting to later drop it back down during a week that we plan to
> have our cluster plagued with blocked requests all while cranking our
> osd_heartbeat_grace so that we don't have flapping osds.

> A little more about our setup is that we have 32x 4TB HGST drives with 4x 
> 200GB
> Intel DC3710 journals (8 drives per journal), dual hyper-threaded octa-core
> Xeon (32 virtual cores), 192GB memory, 10Gb redundant network... per storage
> node.


>   David Turner | Cloud Operations Engineer | StorageCraft Technology 
> Corporation
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 | Mobile: 385.224.2943

>   If you are not the intended recipient of this message or received it
>   erroneously, please notify the sender and delete it, together with any
>   attachments, and be advised that any dissemination or copying of this 
> message
>   is prohibited.

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] filestore_split_multiple hardcoded maximum?

2016-12-08 Thread Frédéric Nass
Hi David, 

I'm surprised your message didn't get any echo yet. I guess it depends on how 
many files your OSDs get to store on filesystem which depends essentialy on use 
cases. 

We're having similar issues with a 144 osd cluster running 2 pools. Each one 
holds 100 M objects.One is replication x3 (256 PGs) and the other is EC k=5, 
m=4 (512 PGs). 
That's 300 M + 900 M = 1.2 B files stored on XFS filesystem. 

We're observing that our PGs subfolders only holds around 120 files each when 
they should holds around 320 (we're using default split / merge values). 
All objetcs were created when cluster was running Hammer. We're now running 
Jewel (RHCS 2.0 actually). 

We ran some tests on a Jewel backup infrastructure. Split happens at around 320 
files per directory, as expected. 
We have no idea why we're not seeing 320 files per PG subfolder on our 
production cluster pools. 

Everything we read suggests to raise the filestore_merge_threshold and 
filestore_split_multiple values to 40 / 8 : 

https://www.redhat.com/en/files/resources/en-rhst-cephstorage-supermicro-INC0270868_v2_0715.pdf
 
https://bugzilla.redhat.com/show_bug.cgi?id=1219974 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041179.html 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012987.html 

We now need to merge directories (when you need to split apparently :-) 

We will do so, by increasing the filestore_merge_threshold in 10 units steps 
until maybe 120 to lower it back to 40. 
Between each steps we'll run 'rados bench' (in cleanup mode) on both pools to 
generate enough deletes operations to trigger merges operations on each PGs. 
By running the 'rados bench' at night our clients won't be much impacted by 
blocked requests. 

Running this on you cluster would also provoke split when rados bench writes to 
the pools. 

Also, note that you can set merge and split values to a specific OSD in 
ceph.conf ([osd.123]) so you can see how the OSD reorganizes the PGs tree when 
running a 'rados bench'. 

Regarding the OSDs flapping, does this happen when scrubbing ? You may hit the 
Jewel scrubbing bug Sage reported like 3 weeks ago (look for 'stalls caused by 
scrub on jewel'). 
It's fixed in 10.2.4 and waiting for QA to make it to RHCS >= 2.0 

We are impacted by this bug because we have a lot of objects (200k) per PGs 
with, I think, bad split / merge values. Lowering vfs_cache_pressure to 1 might 
also help to avoid the flapping. 

Regards, 

Frederic Nass, 
Université de Lorraine. 

- Le 27 Sep 16, à 0:42, David Turner  a 
écrit : 

> We are running on Hammer 0.94.7 and have had very bad experiences with PG
> folders splitting a sub-directory further. OSDs being marked out, hundreds of
> blocked requests, etc. We have modified our settings and watched the behavior
> match the ceph documentation for splitting, but right now the subfolders are
> splitting outside of what the documentation says they should.

> filestore_split_multiple * abs(filestore_merge_threshold) * 16

> Our filestore_merge_threshold is set to 40. When we had our
> filestore_split_multiple set to 8, we were splitting subfolders when a
> subfolder had (8 * 40 * 16 = ) 5120 objects in the directory. In a different
> cluster we had to push that back again with elevated settings and the
> subfolders split when they had (16 * 40 * 16 = ) 10240 objects.

> We have another cluster that we're working with that is splitting at a value
> that seems to be a hardcoded maximum. The settings are (32 * 40 * 16 = ) 20480
> objects before it should split, but it seems to be splitting subfolders at
> 12800 objects.

> Normally I would expect this number to be a power of 2, but we recently found
> another hardcoded maximum of the object map only allowing RBD's with a maximum
> 256,000,000 objects in them. The 12800 matches that as being a base 2 followed
> by a set of zero's to be the hardcoded maximum.

> Has anyone else encountered what seems to be a hardcoded maximum here? Are we
> missing a setting elsewhere that is capping us, or diminishing our value? Much
> more to the point, though, is there any way to mitigate how painful it is to
> split subfolders in PGs? So far it seems like the only way we can do it is to
> push up the setting to later drop it back down during a week that we plan to
> have our cluster plagued with blocked requests all while cranking our
> osd_heartbeat_grace so that we don't have flapping osds.

> A little more about our setup is that we have 32x 4TB HGST drives with 4x 
> 200GB
> Intel DC3710 journals (8 drives per journal), dual hyper-threaded octa-core
> Xeon (32 virtual cores), 192GB memory, 10Gb redundant network... per storage
> node.


>   David Turner | Cloud Operations Engineer | StorageCraft Technology 
> Corporation
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 | Mobile: 385.224.2943

>   If you are not the intended recipient of this message or received it
>   erroneously

Re: [ceph-users] stalls caused by scrub on jewel

2016-12-01 Thread Frédéric Nass
Hi Yoann,

Thank you for your input. I was just told by RH support that it’s gonna make it 
to RHCS 2.0 (10.2.3). Thank you guys for the fix !

We thought about increasing the number of PGs just after changing the 
merge/split threshold values but this would have led to a _lot_ of data 
movements (1.2 billion of XFS files) over weeks, without any possibility to 
scrub / deep-scrub to ensure data consistency. Still as soon as we get the fix, 
we will increase the number of PGs.

Regards,

Frederic.



> Le 1 déc. 2016 à 16:47, Yoann Moulin  a écrit :
> 
> Hello,
> 
>> We're impacted by this bug (case 01725311). Our cluster is running RHCS 2.0 
>> and is no more capable to scrub neither deep-scrub.
>> 
>> [1] http://tracker.ceph.com/issues/17859
>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1394007
>> [3] https://github.com/ceph/ceph/pull/11898
>> 
>> I'm worried we'll have to live with a cluster that can't scrub/deep-scrub 
>> until March 2017 (ETA for RHCS 2.2 running Jewel 10.2.4).
>> 
>> Can we have this fix any sooner ?
> 
> As far as I know about that bug, it appears if you have big PGs, a workaround 
> could be increasing the pg_num of the pool that has the biggest PGs.
> 
> -- 
> Yoann Moulin
> EPFL IC-IT

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stalls caused by scrub on jewel

2016-12-01 Thread Frédéric Nass


Hi Sage, Sam,

We're impacted by this bug (case 01725311). Our cluster is running RHCS 
2.0 and is no more capable to scrub neither deep-scrub.


[1] http://tracker.ceph.com/issues/17859
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1394007
[3] https://github.com/ceph/ceph/pull/11898

I'm worried we'll have to live with a cluster that can't 
scrub/deep-scrub until March 2017 (ETA for RHCS 2.2 running Jewel 10.2.4).


Can we have this fix any sooner ?

Regards

Frédéric.

Le 15/11/2016 à 23:35, Sage Weil a écrit :

Hi everyone,

There was a regression in jewel that can trigger long OSD stalls during
scrub.  How long the stalls are depends on how many objects are in your
PGs, how fast your storage device is, and what is cached, but in at least
one case they were long enough that the OSD internal heartbeat check
failed and it committed suicide (120 seconds).

The workaround for now is to simply

  ceph osd set noscrub

as the bug is only triggered by scrub.  A fix is being tested and will be
available shortly.

If you've seen any kind of weird latencies or slow requests on jewel, I
suggest setting noscrub and seeing if they go away!

The tracker bug is

  http://tracker.ceph.com/issues/17859

Big thanks to Yoann Moulin for helping track this down!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] effectively reducing scrub io impact

2016-10-20 Thread Frédéric Nass
- Le 20 Oct 16, à 15:03, Oliver Dzombic  a écrit : 

> Hi Christian,

> thank you for your time.

> The problem is deep scrub only.

> Jewel 10.2.2 is used.

> Thank you for your hint with manual deep scrubs on specific OSD's. I
> didnt come up with that idea.

> -

> Where do you know

> osd_scrub_sleep

> from ?

> I am saw here lately on the mailinglist multiple times many "hidden"
> config options. ( while hidden is everything which is not mentioned in
> the doku @ ceph.com ).

> ceph.com does not know about osd_scrub_sleep config option ( except
> mentioned in (past) release notes )

> The search engine finds it mainly in github or bugtracker.

> Is there any source of a (complete) list of available config options,
> useable by normal admin's ?

Hi Oliver, 

This is probably what you're looking for: 
https://github.com/ceph/ceph/blob/master/src/common/config_opts.h 

You can change the Branch on the left to match the version of your cluster. 

Regards, 

Frederic. 

> Or is it really neccessary to grab through source codes and release
> notes to collect that kind information on your own ?

> --
> Mit freundlichen Gruessen / Best regards

> Oliver Dzombic
> IP-Interactive

> mailto:i...@ip-interactive.de

> Anschrift:

> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen

> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic

> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107

> Am 20.10.2016 um 14:39 schrieb Christian Balzer:

> > Hello,

> > On Thu, 20 Oct 2016 11:23:54 +0200 Oliver Dzombic wrote:

> >> Hi,

> >> we have here globally:

> >> osd_client_op_priority = 63
> >> osd_disk_thread_ioprio_class = idle
> >> osd_disk_thread_ioprio_priority = 7
> >> osd_max_scrubs = 1

> > If you google for osd_max_scrubs you will find plenty of threads, bug
> > reports, etc.

> > The most significant and benificial impact for client I/O can be achieved
> > by telling scrub to release its deadly grip on the OSDs with something like
> > osd_scrub_sleep = 0.1

> > Also which version, Hammer IIRC?
> > Jewel's unified queue should help as well, but no first hand experience
> > here.

> >> to influence the scrubbing performance and

> >> osd_scrub_begin_hour = 1
> >> osd_scrub_end_hour = 7

> >> to influence the scrubbing time frame


> >> Now, as it seems, this time frame is/was not enough, so ceph started
> >> scrubbing all the time, i assume because of the age of the objects.

> > You may want to line things up, so that OSDs/PGs are evenly spread out.
> > For example with 6 OSDs, manually initiate a deep scrub each day (at 01:00
> > in your case), so that only a specific subset is doing deep scrub conga.


> >> And it does it with:

> >> 4 active+clean+scrubbing+deep

> >> ( instead of the configured 1 )

> > That's per OSD, not global, see above, google.


> >> So now, we experience a situation, where the spinning drives are so
> >> busy, that the IO performance got too bad.

> >> The only reason that its not a catastrophy is, that we have a cache tier
> >> in front of it, which loweres the IO needs on the spnning drives.

> >> Unluckily we have also some pools going directly on the spinning drives.

> >> So these pools experience a very bad IO performance.

> >> So we had to disable scrubbing during business houres ( which is not
> >> really a solution ).

> > It is, unfortunately, for many people.
> > As mentioned many times, if your cluster is having issues with deep-scrubs
> > during peak hours, it will also be unhappy if you loose an OSD and
> > backfills happen.
> > If it is unhappy with normal scrubs, you need to upgrade/expand HW
> > immediately.

> >> So any idea why

> >> 1. 4-5 scrubs we can see, while osd_max_scrubs = 1 is set ?
> > See above.

> > With BlueStore in the wings and reduced (negated?) need for deep-scrubs, I
> > doubt this will see much coding effort.

> >> 2. Why the impact on the spinning drives is so hard, while we lowered
> >> the IO priority for it ?

> > That has only a small impact, deep-scrub by its very nature reads all
> > objects and thus kills I/Os by seeks and polluting caches.


> > Christian

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMWare

2016-10-18 Thread Frédéric Nass


Hi Alex,

Just to know, what kind of backstore are you using whithin Storcium ? 
vdisk_fileio or vdisk_blockio ?


I see your agents can handle both : 
http://www.spinics.net/lists/ceph-users/msg27817.html


Regards,

Frédéric.

Le 06/10/2016 à 16:01, Alex Gorbachev a écrit :

On Wed, Oct 5, 2016 at 2:32 PM, Patrick McGarry  wrote:

Hey guys,

Starting to buckle down a bit in looking at how we can better set up
Ceph for VMWare integration, but I need a little info/help from you
folks.

If you currently are using Ceph+VMWare, or are exploring the option,
I'd like some simple info from you:

1) Company
2) Current deployment size
3) Expected deployment growth
4) Integration method (or desired method) ex: iscsi, native, etc

Just casting the net so we know who is interested and might want to
help us shape and/or test things in the future if we can make it
better. Thanks.


Hi Patrick,

We have Storcium certified with VMWare, and we use it ourselves:

Ceph Hammer latest

SCST redundant Pacemaker based delivery front ends - our agents are
published on github

EnhanceIO for read caching at delivery layer

NFS v3, and iSCSI and FC delivery

Our deployment size we use ourselves is 700 TB raw.

Challenges are as others described, but HA and multi host access works
fine courtesy of SCST.  Write amplification is a challenge on spinning
disks.

Happy to share more.

Alex


--

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMWare

2016-10-18 Thread Frédéric Nass

Hi Alex,

Just to know, what kind of backstore are you using whithin Storcium ? 
vdisk_fileio or vdisk_blockio ?


I see your agents can handle both : 
http://www.spinics.net/lists/ceph-users/msg27817.html


Regards,

Frédéric.


Le 06/10/2016 à 16:01, Alex Gorbachev a écrit :

On Wed, Oct 5, 2016 at 2:32 PM, Patrick McGarry  wrote:

Hey guys,

Starting to buckle down a bit in looking at how we can better set up
Ceph for VMWare integration, but I need a little info/help from you
folks.

If you currently are using Ceph+VMWare, or are exploring the option,
I'd like some simple info from you:

1) Company
2) Current deployment size
3) Expected deployment growth
4) Integration method (or desired method) ex: iscsi, native, etc

Just casting the net so we know who is interested and might want to
help us shape and/or test things in the future if we can make it
better. Thanks.


Hi Patrick,

We have Storcium certified with VMWare, and we use it ourselves:

Ceph Hammer latest

SCST redundant Pacemaker based delivery front ends - our agents are
published on github

EnhanceIO for read caching at delivery layer

NFS v3, and iSCSI and FC delivery

Our deployment size we use ourselves is 700 TB raw.

Challenges are as others described, but HA and multi host access works
fine courtesy of SCST.  Write amplification is a challenge on spinning
disks.

Happy to share more.

Alex


--

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD journal pool

2016-10-12 Thread Frédéric Nass

I would have tried it but our cluster is still running RHCS 1.3. :-)

Frederic.

Le 12/10/2016 à 08:45, Frédéric Nass a écrit :

Hello,

Can we use rbd journaling without using rbd mirroring in Jewel ? So 
that we can set rbd journals on SSD pools and improve write IOPS on 
standard (no mirrored) RBD images.

Assuming IOs are acknowleged when written to the journal pool.

Everything I read regarding RBD journaling is related to RBD mirroring.

Regards,



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD journal pool

2016-10-11 Thread Frédéric Nass

Hello,

Can we use rbd journaling without using rbd mirroring in Jewel ? So that 
we can set rbd journals on SSD pools and improve write IOPS on standard 
(no mirrored) RBD images.

Assuming IOs are acknowleged when written to the journal pool.

Everything I read regarding RBD journaling is related to RBD mirroring.

Regards,

--

Frédéric Nass

Sous-direction Infrastructures
Direction du Numérique
Université de Lorraine

Tél : +33 3 72 74 11 35

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMWare

2016-10-11 Thread Frédéric Nass

Hi Patrick,

1) Université de Lorraine. (7.000 researchers and staff members, 60.000 
students, 42 schools and education structures, 60 research labs).


2) RHCS cluster: 144 OSDs on 12 nodes for 520 TB raw capacity.
VMware clusters: 7 VMware clusters (40 ESXi hosts). First need is 
to provide capacitive storage (Ceph) to VMs running in a VMware vRA IaaS 
cluster (6 ESXi hosts).


3) Deployment growth ?
RHCS cluster: Initial need was 750 TB of usable storage, so a x4 
growth in the next 3 years is expected to reach 1 PB of usable storage.
VMware clusters: We just started to offer a IaaS service to 
research laboratories and education structures whithin our university.
We can expect to host several hundreds of VMs in the next 2 years 
(~600-800).


4) Integration method ? Clearly native.
I spent some of the last 6 months working on building an HA gateway 
cluster (iSCSI and NFS) to provide RHCS Ceph storage to our VMware IaaS 
Cluster. Here are my findings:


* iSCSI ?

Gives better performance than NFS, we know that. BUT, we cannot go 
into production with iSCSI because of ESXi hosts entering a never ending 
iSCSI 'Abort Task' loop when the Ceph cluster fails to acknowledge a 4MB 
IO in less than 5s, resulting in VMs crashing. I've been told by a 
VMware engineer that this 5s limit cannot be raised as it's hardcoded in 
ESXi iSCSI software initiator.
Why would an IO take more than 5s ? In case of a important load on 
the Ceph cluster, or a Ceph failure scenario (network isolation, OSD 
crash), or deep-scrubbing bothering client IOs or any combination of 
these or those I didn't think about...


What I have tested:
iSCSI Active/Active HA cluster. Each ESXi sees the same datastore 
through both targets but only accesses one datastore at a time through a 
statically defined prefered path.
3 ESXi work on one target, 3 ESXi work on the other. If a target 
goes down, the other paths are used.


- LIO iSCSI targets with kernel RBD mapping (no cache). VAAI 
methods. Easy to configure. Delivers good performance with eagger zeroed 
virtual disks. 'Abort Task' loop has the ESXi disconnect from the 
vCenter Server.

Restartign the target get them back in but some VMs certainly crashed.
- FreeBSD / FreeNAS running in KVM (on top of CentOS) mapping RBD 
images through librbd. Found that fileio backstore was used. Found hard 
to make it HA with librbd cache. And still the 'Abort Task' loop...
- SCST ESOS targets with kernel RBD mapping (no cache). VAAI 
methods, ALUA. Easy to configure too. 'Abort Task' still happens but the 
ESX does not get disconnected from the vCenter Server. Still targets 
have to be restarted to fix this situation.


* NFS ?

Gives less performance than iSCSI, we know that too. BUT, it's 
probably the best option right now. It's very easy to make it HA with 
Pacemaker/Corosync as VMware doesn't make use of the NFS lock manager. 
Here is a good start : 
https://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/
We're still benchmarking IOPs to decide whether we can go into 
production with this infrastructure but we're actually very satisfied 
with the HA mechanism.
Running synchronous writes on multiple VMs (on virtual disk hosted 
on NFS datastores with 'sync' exports of RBD images) while Storage 
vMotioning those multiple disks between NFS RBD datastores and flapping 
ViP (and thus NFS exports) from one server to the other at the same time 
never kills any VM nor makes any datastore unavailable.
And every Storage vMotion task complete ! This is excellent 
results. Note that it's important to run VMware Tools in VMs as VMware 
Tools installation extend the write delay timeout on local iSCSI devices.


What I have tested:
- NFS exports with async mode sharing RBD images with XFS on top of 
it. Gives the best performances but, as an evidence, no one will want to 
use this mode in production.
- NFS exports with sync mode sharing RBD images with XFS on top of 
it. Gives mitigated performances. We would clearly announce this type of 
storage as capacitive and not performant through our IaaS service.
  As VMs caches writes, IOPS might be good enough for tier 2 or 3 
applications. We would probably be able to increase the number of IOPS 
by using more RBD images and NFS shares.
- NFS exports with sync mode sharing RBD images with ZFS (with 
compression) on top of it. The idea is to provide better performance by 
putting the SLOG (write journal) on fast SSD drives.
  See this real life (love-)story : 
https://virtualexistenz.wordpress.com/2013/02/01/using-zfs-storage-as-vmware-nfs-datastores-a-real-life-love-story/
  Each NFS server has 2 mirrored SSDs (RAID1). Each NFS server 
export partitions of this SSD volume through iSCSI.
  Each NFS server is a client of local and distant iSCSI target. 
Then the SLOG device is made of a ZFS mirror of 2 disks : local iSCSI 
device and distant iSCSI device

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass



Le 22/07/2016 14:10, Nick Fisk a écrit :


*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
Behalf Of *Frédéric Nass

*Sent:* 22 July 2016 11:19
*To:* n...@fisk.me.uk; 'Jake Young' ; 'Jan 
Schermer' 

*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 11:48, Nick Fisk a écrit :

*From:*Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr]
*Sent:* 22 July 2016 10:40
*To:* n...@fisk.me.uk <mailto:n...@fisk.me.uk>; 'Jake Young'
 <mailto:jak3...@gmail.com>; 'Jan Schermer'
 <mailto:j...@schermer.cz>
*Cc:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 10:23, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
*On Behalf Of *Frédéric Nass
*Sent:* 22 July 2016 09:10
*To:* n...@fisk.me.uk <mailto:n...@fisk.me.uk>; 'Jake Young'
 <mailto:jak3...@gmail.com>; 'Jan Schermer'
 <mailto:j...@schermer.cz>
*Cc:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 09:47, Nick Fisk a écrit :

*From:*ceph-users
[mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of
*Frédéric Nass
*Sent:* 22 July 2016 08:11
*To:* Jake Young 
<mailto:jak3...@gmail.com>; Jan Schermer 
<mailto:j...@schermer.cz>
*Cc:* ceph-users@lists.ceph.com
<mailto:ceph-users@lists.ceph.com>
*Subject:* Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer
mailto:j...@schermer.cz>> wrote:


> On 20 Jul 2016, at 18:38, Mike Christie
    mailto:mchri...@redhat.com>>
wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with
VMWare ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non
HA support though.
>
>>
>> Knowing that HA iSCSI target was on the
roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS
targets when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD
images. Each iSCSI target
>> has all VAAI primitives enabled and run the
same configuration.
>> - RBD images are mapped on each target using
the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs
through both targets,
>> but in a failover manner so that each ESXi
always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives
are enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this
configuration ?
>
> If you use a application that uses scsi
persistent reservations then you
> could run into troubles, because some apps
expect the reservation info
> to be on the failover nodes as well as the
active ones.
>
> Depending on the how you do failover and the
issue that caused the
> failover, IO could be stuck on the old active
node and cause data
> corruption. If the initial active node looses
its network connectivity
> and you failover, you have to make sure that the
in

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass



Le 22/07/2016 11:48, Nick Fisk a écrit :


*From:*Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr]
*Sent:* 22 July 2016 10:40
*To:* n...@fisk.me.uk; 'Jake Young' ; 'Jan 
Schermer' 

*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 10:23, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
    Behalf Of *Frédéric Nass
*Sent:* 22 July 2016 09:10
*To:* n...@fisk.me.uk <mailto:n...@fisk.me.uk>; 'Jake Young'
 <mailto:jak3...@gmail.com>; 'Jan Schermer'
 <mailto:j...@schermer.cz>
*Cc:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 09:47, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
*On Behalf Of *Frédéric Nass
*Sent:* 22 July 2016 08:11
*To:* Jake Young 
<mailto:jak3...@gmail.com>; Jan Schermer 
<mailto:j...@schermer.cz>
*Cc:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
*Subject:* Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer mailto:j...@schermer.cz>> wrote:


> On 20 Jul 2016, at 18:38, Mike Christie
mailto:mchri...@redhat.com>> wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare
ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA
support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we
chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets
when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD
images. Each iSCSI target
>> has all VAAI primitives enabled and run the same
configuration.
>> - RBD images are mapped on each target using the
kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs
through both targets,
>> but in a failover manner so that each ESXi always
access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are
enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent
reservations then you
> could run into troubles, because some apps expect
the reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue
that caused the
> failover, IO could be stuck on the old active node
and cause data
> corruption. If the initial active node looses its
network connectivity
> and you failover, you have to make sure that the
initial active node is
> fenced off and IO stuck on that node will never be
executed. So do
> something like add it to the ceph monitor blacklist
and make sure IO on
> that node is flushed and failed before
unblacklisting it.
>

With iSCSI you can't really do hot failover unless you
only use synchronous IO.

VMware does only use synchronous IO. Since the hypervisor
can't tell what type of data the VMs are writing, all IO
is treated as needing to be synchronous.

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you
don't know what in-flight IO happened before the outage
and which di

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass



Le 22/07/2016 10:23, Nick Fisk a écrit :


*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
Behalf Of *Frédéric Nass

*Sent:* 22 July 2016 09:10
*To:* n...@fisk.me.uk; 'Jake Young' ; 'Jan 
Schermer' 

*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 09:47, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
    Behalf Of *Frédéric Nass
*Sent:* 22 July 2016 08:11
*To:* Jake Young  <mailto:jak3...@gmail.com>;
Jan Schermer  <mailto:j...@schermer.cz>
*Cc:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
*Subject:* Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer mailto:j...@schermer.cz>> wrote:


> On 20 Jul 2016, at 18:38, Mike Christie
mailto:mchri...@redhat.com>> wrote:
    >
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare
ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA
support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we
chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when
it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images.
Each iSCSI target
>> has all VAAI primitives enabled and run the same
configuration.
>> - RBD images are mapped on each target using the kernel
client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through
both targets,
>> but in a failover manner so that each ESXi always
access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are
enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent
reservations then you
> could run into troubles, because some apps expect the
reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that
caused the
> failover, IO could be stuck on the old active node and
cause data
> corruption. If the initial active node looses its
network connectivity
> and you failover, you have to make sure that the initial
active node is
> fenced off and IO stuck on that node will never be
executed. So do
> something like add it to the ceph monitor blacklist and
make sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you
only use synchronous IO.

VMware does only use synchronous IO. Since the hypervisor
can't tell what type of data the VMs are writing, all IO is
treated as needing to be synchronous.

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't
know what in-flight IO happened before the outage
and which didn't. You could end with only part of the
"transaction" written on persistent storage.

If you only use synchronous IO all the way from client to
the persistent storage shared between
iSCSI target then all should be fine, otherwise YMMV -
some people run it like that without realizing
the dangers and have never had a problem, so it may be
strictly theoretical, and it all depends on how often you
need to do the
failover and what data you are storing - corrupting a few
images on a gallery site could be fine but corrupting
a large database tablespace is no fun at all.

No, it's not. VMFS corruption is pretty bad too and there i

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass



Le 22/07/2016 09:47, Nick Fisk a écrit :


*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
Behalf Of *Frédéric Nass

*Sent:* 22 July 2016 08:11
*To:* Jake Young ; Jan Schermer 
*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer mailto:j...@schermer.cz>> wrote:


> On 20 Jul 2016, at 18:38, Mike Christie mailto:mchri...@redhat.com>> wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi
client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support
though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose
iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's
available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each
iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel
client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through
both targets,
>> but in a failover manner so that each ESXi always access
the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled
client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent
reservations then you
> could run into troubles, because some apps expect the
reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that
caused the
> failover, IO could be stuck on the old active node and cause
data
> corruption. If the initial active node looses its network
connectivity
> and you failover, you have to make sure that the initial
active node is
> fenced off and IO stuck on that node will never be executed.
So do
> something like add it to the ceph monitor blacklist and make
sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only
use synchronous IO.

VMware does only use synchronous IO. Since the hypervisor can't
tell what type of data the VMs are writing, all IO is treated as
needing to be synchronous.

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't
know what in-flight IO happened before the outage
and which didn't. You could end with only part of the
"transaction" written on persistent storage.

If you only use synchronous IO all the way from client to the
persistent storage shared between
iSCSI target then all should be fine, otherwise YMMV - some
people run it like that without realizing
the dangers and have never had a problem, so it may be
strictly theoretical, and it all depends on how often you need
to do the
failover and what data you are storing - corrupting a few
images on a gallery site could be fine but corrupting
a large database tablespace is no fun at all.

No, it's not. VMFS corruption is pretty bad too and there is no
fsck for VMFS...


Some (non opensource) solutions exist, Solaris supposedly does
this in some(?) way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's
possible without client support
(you essentialy have to do something like transactions and
replay the last transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO
synchronous or make it at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring
between the targets) without making it synchronous all the way.

This is why the SAN vendors wrote their own clients and drivers.
It is not possible to dynamically make all OS's do what 

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass



Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer <mailto:j...@schermer.cz>> wrote:



> On 20 Jul 2016, at 18:38, Mike Christie > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client
? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI
over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's
available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI
target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client
(so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both
targets,
>> but in a failover manner so that each ESXi always access the
same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled
client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent reservations
then you
> could run into troubles, because some apps expect the
reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network
connectivity
> and you failover, you have to make sure that the initial active
node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make
sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only use
synchronous IO.


VMware does only use synchronous IO. Since the hypervisor can't tell 
what type of data the VMs are writing, all IO is treated as needing to 
be synchronous.


(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know
what in-flight IO happened before the outage
and which didn't. You could end with only part of the
"transaction" written on persistent storage.

If you only use synchronous IO all the way from client to the
persistent storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people
run it like that without realizing
the dangers and have never had a problem, so it may be strictly
theoretical, and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images
on a gallery site could be fine but corrupting
a large database tablespace is no fun at all.


No, it's not. VMFS corruption is pretty bad too and there is no fsck 
for VMFS...



Some (non opensource) solutions exist, Solaris supposedly does
this in some(?) way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's
possible without client support
(you essentialy have to do something like transactions and replay
the last transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO
synchronous or make it at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring between
the targets) without making it synchronous all the way.


This is why the SAN vendors wrote their own clients and drivers. It is 
not possible to dynamically make all OS's do what your iSCSI target 
expects.


Something like VMware does the right thing pretty much all the time 
(there are some iSCSI initiator bugs in earlier ESXi 5.x).  If you 
have control of your ESXi hosts then attempting to set up HA iSCSI 
targets is possible.


If you have a mixed client environment with various versions of 
Windows connecting to the target, you may be better off buying some 
SAN appliances.



The one time I had to use it I resorted to simply mirroring in via
mdraid on the client side over two targets sharing the same
DAS, and this worked fine during testing but never went to
production in the end.

Jan

>
>>
>> Wo

Re: [ceph-users] ceph + vmware

2016-07-20 Thread Frédéric Nass


Hi Mike,

Thanks for the update on the RHCS iSCSI target.

Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is 
it too early to say / announce).


Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS 
so we'll just have to remap RBDs to RHCS targets when it's available.


So we're currently running :

- 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target 
has all VAAI primitives enabled and run the same configuration.
- RBD images are mapped on each target using the kernel client (so no 
RBD cache).
- 6 ESXi. Each ESXi can access to the same LUNs through both targets, 
but in a failover manner so that each ESXi always access the same LUN 
through one target at a time.
- LUNs are VMFS datastores and VAAI primitives are enabled client side 
(except UNMAP as per default).


Do you see anthing risky regarding this configuration ?

Would you recommend LIO or STGT (with rbd bs-type) target for ESXi clients ?

Best regards,

Frederic.

--

Frédéric Nass

Sous-direction Infrastructures
Direction du Numérique
Université de Lorraine

Tél : +33 3 72 74 11 35



Le 11/07/2016 17:45, Mike Christie a écrit :

On 07/08/2016 02:22 PM, Oliver Dzombic wrote:

Hi,

does anyone have experience how to connect vmware with ceph smart ?

iSCSI multipath does not really worked well.

Are you trying to export rbd images from multiple iscsi targets at the
same time or just one target?

For the HA/multiple target setup, I am working on this for Red Hat. We
plan to release it in RHEL 7.3/RHCS 2.1. SUSE ships something already as
someone mentioned.

We just got a large chunk of code in the upstream kernel (it is in the
block layer maintainer's tree for the next kernel) so it should be
simple to add COMPARE_AND_WRITE support now. We should be posting krbd
exclusive lock support in the next couple weeks.



NFS could be, but i think thats just too much layers in between to have
some useable performance.

Systems like ScaleIO have developed a vmware addon to talk with it.

Is there something similar out there for ceph ?

What are you using ?

Thank you !


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Write barriers, controler cache and disk cache.

2015-10-05 Thread Frédéric Nass

Hello,

We are building a new Ceph cluster and have a few questions regarding 
the use of write barriers, controler cache, and disk cache (buffer).


Greg said that barriers should be used 
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/002854.html) 
for data safety which is the default mount option with XFS.


But what if we're using RAID controlers with battery backed up cache ?

XFS faq recommends to disable barriers when using battery backuped up 
RAID controlers AND to make sure to disable disks's cache. Which we are 
able to do on our PERC controlers (Dell R730xd). 
(http://xfs.org/index.php/XFS_FAQ#Write_barrier_support). Also we should 
make sure to disable battery relearning cycles.


Questions are :

Is it safe to use the controler cache in front of SAS/SATA data disks ? 
(Our tests showed 1.95x more read/write IOPS when using cache)


Is it safe to use the controler cache in front of SSD metadata disks ? 
(Our tests showed 1.38x more read/write IOPS when using cache). SSD 
metadata disks are protected from power loss 
(http://toshiba.semicon-storage.com/us/product/storage-products/enterprise-ssd/px02smb-px02smfxxx.html)


When using the controler cache (with multiple single drive RAID0 
volumes), should we disable disk cache in any scenario ? Should we use 
barriers or not ?


It's not clear to me whether the barrier mechanism will apply to the 
controler cache or through the controler cache up to the physical disks.


Regards,

Frederic.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph same rbd on multiple client

2015-05-22 Thread Frédéric Nass

Hi,

Waiting for CephFS, you can use clustered filesystem like OCFS2 or GFS2 
on top of RBD mappings so that each host can access the same device and 
clustered filesystem.


Regards,

Frédéric.

Le 21/05/2015 16:10, gjprabu a écrit :

Hi All,

We are using rbd and map the same rbd image to the rbd device 
on two different client but i can't see the data until i umount and 
mount -a partition. Kindly share the solution for this issue.


*Example*
create rbd image named foo
map foo to /dev/rbd0 on server A,   mount /dev/rbd0 to /mnt
map foo to /dev/rbd0 on server B,   mount /dev/rbd0 to /mnt

Regards
Prabu



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Frédéric Nass

Sous direction des Infrastructures,
Direction du Numérique,
Université de Lorraine.

Tél : 03.83.68.53.83

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snapshots and fstrim with cache tiers ?

2015-03-27 Thread Frédéric Nass


Hello,

The snapshot with a cache tier part was answered by Greg Farnum 
(https://www.mail-archive.com/ceph-users@lists.ceph.com/msg18329.html).


What about fstrim with a cache tier ? It doesn't seem to work.

Also is there a background task that recovers freed blocks ?

Best regards,

Frédéric.


Le 25/03/2015 11:14, Frédéric Nass a écrit :


Hello,


I have a few questions regarding snapshots and fstrim with cache tiers.


In the "cache tier and erasure coding FAQ" related to ICE 1.2 (based 
on Firefly), Inktank says "Snapshots are not supported in conjunction 
with cache tiers."


What are the risks of using snapshots with cache tiers ? Would this 
"better not use it recommandation" still be true with Giant or Hammer ?



Regarding the fstrim command, it doesn't seem to work with cache 
tiers. The freed up blocks don't get back in the ceph cluster.
Can someone confirm this ? Is there something we can do to get those 
freed up blocks back in the cluster ?



Also, can we run an fstrim task from the cluster side ? That is, 
without having to map and mount each rbd image or rely on the client 
to operate this task ?



Best regards,


--

Frédéric Nass

Sous-direction Infrastructures
Direction du Numérique
Université de Lorraine

email : frederic.n...@univ-lorraine.fr
Tél : +33 3 83 68 53 83


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Frédéric Nass

Sous direction des Infrastructures,
Direction du Numérique,
Université de Lorraine.

Tél : 03.83.68.53.83

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] error creating image in rbd-erasure-pool

2015-03-25 Thread Frédéric Nass
Hi Greg, 

Thank you for this clarification. It helps a lot. 

Does this "can't think of any issues" apply to both rbd and pool snapshots ? 

Frederic. 

- Mail original -

> On Tue, Mar 24, 2015 at 12:09 PM, Brendan Moloney  wrote:
> >
> >> Hi Loic and Markus,
> >> By the way, Inktank do not support snapshot of a pool with cache tiering :
> >>
> >> *
> >> https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf
> >
> > Hi,
> >
> > You seem to be talking about pool snapshots rather than RBD snapshots. But
> > in the linked document it is not clear that there is a distinction:
> >
> > Can I use snapshots with a cache tier?
> > Snapshots are not supported in conjunction with cache tiers.
> >
> > Can anyone clarify if this is just pool snapshots?

> I think that was just a decision based on the newness and complexity
> of the feature for product purposes. Snapshots against cache tiered
> pools certainly should be fine in Giant/Hammer and we can't think of
> any issues in Firefly off the tops of our heads.
> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

Cordialement, 

Frédéric Nass. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Snapshots and fstrim with cache tiers ?

2015-03-25 Thread Frédéric Nass


Hello, 




I have a few questions regarding snapshots and fstrim with cache tiers. 




In the "cache tier and erasure coding FAQ" related to ICE 1.2 (based on 
Firefly), Inktank says "Snapshots are not supported in conjunction with cache 
tiers." 

What are the risks of using snapshots with cache tiers ? Would this "better not 
use it recommandation" still be true with Giant or Hammer ? 




Regarding the fstrim command, it doesn't seem to work with cache tiers. The 
freed up blocks don't get back in the ceph cluster. 
Can someone confirm this ? Is there something we can do to get those freed up 
blocks back in the cluster ? 




Also, can we run an fstrim task from the cluster side ? That is, without having 
to map and mount each rbd image or rely on the client to operate this task ? 




Best regards, 





-- 

Frédéric Nass 

Sous-direction Infrastructures 
Direction du Numérique 
Université de Lorraine 

email : frederic.n...@univ-lorraine.fr 
Tél : +33 3 83 68 53 83 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com