Re: [ceph-users] cephfs and erasure coding

2017-03-08 Thread Maxime Guyot
Hi,

>“The answer as to how to move an existing cephfs pool from replication to 
>erasure coding (and vice versa) is to create the new pool and rsync your data 
>between them.”
Shouldn’t it be possible to just do the “ceph osd tier add  ecpool cachepool && 
ceph osd tier cache-mode cachepool writeback” and let Ceph redirect the 
requests (CephFS or other) to the cache pool?

Cheers,
Maxime

From: ceph-users  on behalf of David Turner 

Date: Wednesday 8 March 2017 22:27
To: Rhian Resnick , "ceph-us...@ceph.com" 

Subject: Re: [ceph-users] cephfs and erasure coding

I use CephFS on erasure coding at home using a cache tier.  It works fine for 
my use case, but we know nothing about your use case to know if it will work 
well for you.

The answer as to how to move an existing cephfs pool from replication to 
erasure coding (and vice versa) is to create the new pool and rsync your data 
between them.

[cid:image001.jpg@01D298AE.DE1475E0]

David Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.



From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Rhian Resnick 
[rresn...@fau.edu]
Sent: Wednesday, March 08, 2017 12:54 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] cephfs and erasure coding

Two questions on Cephfs and erasure coding that Google couldn't answer.





1) How well does cephfs work with erasure coding?



2) How would you move an existing cephfs pool that uses replication to erasure 
coding?



Rhian Resnick

Assistant Director Middleware and HPC

Office of Information Technology



Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [mage] 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Posix AIO vs libaio read performance

2017-03-08 Thread Xavier Trilla
Hi,

I'm trying to debut why there is a big difference using POSIX AIO and libaio 
when performing read tests from inside a VM using librbd.

The results I'm getting using FIO are:

POSIX AIO Read:

Type: Random Read - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /:

Average: 2.54 MB/s
Average: 632 IOPS

Libaio Read:

Type: Random Read - IO Engine: Libaio - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /:

Average: 147.88 MB/s
Average: 36967 IOPS

When performing writes the differences aren't so big, because the cluster 
-which is in production right now- is CPU bonded:

POSIX AIO Write:

Type: Random Write - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /:

Average: 14.87 MB/s
Average: 3713 IOPS

Libaio Write:

Type: Random Write - IO Engine: Libaio - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /:

Average: 14.51 MB/s
Average: 3622 IOPS


Even if the write results are CPU bonded, as the machines containing the OSDs 
don't have enough CPU to handle all the IOPS (CPU upgrades are on its way) I 
cannot really understand why I'm seeing so much difference in the read tests.

Some configuration background:

- Cluster and clients are using Hammer 0.94.90
- It's a full SSD cluster running over Samsung Enterprise SATA SSDs, with all 
the typical tweaks (Customized ceph.conf, optimized sysctl, etc...)
- Tried QEMU 2.0 and 2.7 - Similar results
- Tried virtio-blk and virtio-scsi - Similar results

I've been reading about POSIX AIO and Libaio, and I can see there are several 
differences on how they work (Like one being user space and the other one being 
kernel) but I don't really get why Ceph have such problems handling POSIX AIO 
read operations, but not write operation, and how to avoid them.

Right now I'm trying to identify if it's something wrong with our Ceph cluster 
setup, with Ceph in general or with QEMU (virtio-scsi or virtio-blk as both 
have the same behavior)

If you would like to try to reproduce the issue here are the two command lines 
I'm using:

fio --name=randread-posix --output ./test --runtime 60 --ioengine=posixaio 
--buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32
fio --name=randread-libaio --output ./test --runtime 60 --ioengine=libaio 
--buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32


If you could shed any light over this I would be really helpful, as right now, 
although I have still some ideas left to try, I'm don't have much idea about 
why is this happening...

Thanks!
Xavier
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How does ceph preserve read/write consistency?

2017-03-08 Thread 许雪寒
Hi, everyone.

Recently, in our test, we found a strange phenomenon: a READ req from client A 
that arrived later than a WRITE req from client B is finished ealier than that 
WRITE req.

The logs are as follows(we did a little modification to the level of some logs 
to 1 in order to get some insights of OSDs while avoiding too much logs filling 
up the disk):

2017-03-07 18:58:27.439107 7f80ae9fc700  1 2017-03-07 18:58:27.439109 
7f2dbefe5700  1 -- 10.208.129.17:6800/1653080 <== client.5234411 
10.208.129.31:0/1029364 99819  osd_op(client.5234411.0:1312595 
rbd_data.4ff53c1322a29c.0cad [stat,set-alloc-hint object_size 
4194304 write_size 4194304,write 3981312~12288] 4.395b22ac snapc 
2d1c=[2b6a,2ae2] ack+ondisk+write+known_if_redirected e15531) v5  
291+0+12288 (84537638 0 1181937587) 0x7f80113c5e80 con 0x7f80988c7800-- 
10.208.140.17:6802/1653275 --> 10.208.140.34:6802/545067 -- 
osd_repop_reply(client.5278878.0:1311414 4.19a ondisk, result = 0) v1 -- ?+0 
0x7f2ce94bf800 con 0x7f2d62169100
2017-03-07 18:58:27.439130 7f80e33ff700  1 osd.14 15531 dequeue_op 
0x7f7fe2827f00 prio 63 cost 4096 latency 0.019538 
osd_op(client.5234411.0:1312144 rbd_data.4ff53c1322a29c.0cad 
[stat,set-alloc-hint object_size 4194304 write_size 4194304,write 2719744~4096] 
4.395b22ac snapc 2d1c=[2b6a,2ae2] ack+ondisk+write+known_if_redirected e15531) 
v5 pg pg[4.ac( v 15531'8325461 (15531'8322451,15531'8325461] local-les=11579 
n=176 ec=1384 les/c 11579/11579 11578/11578/11551) [14,6,15] r=0 lpr=11578 
luod=15531'8325460 lua=15531'8325460 crt=15531'8325459 lcod 15531'8325459 mlcod 
15531'8325459 active+clean]
2017-03-07 18:58:27.439176 7f80c3fff700  1 osd.14 15531 dequeue_op 
0x7f7fe281d100 prio 63 cost 8192 latency 0.019477 
osd_op(client.5234411.0:1312145 rbd_data.4ff53c1322a29c.0cad 
[stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3112960~8192] 
4.395b22ac snapc 2d1c=[2b6a,2ae2] ack+ondisk+write+known_if_redirected e15531) 
v5 pg pg[4.ac( v 15531'8325461 (15531'8322451,15531'8325461] local-les=11579 
n=176 ec=1384 les/c 11579/11579 11578/11578/11551) [14,6,15] r=0 lpr=11578 
luod=15531'8325460 lua=15531'8325460 crt=15531'8325459 lcod 15531'8325459 mlcod 
15531'8325459 active+clean]
2017-03-07 18:58:27.439191 7f80ae9fc700  1 -- 10.208.129.17:6800/1653080 <== 
client.5234411 10.208.129.31:0/1029364 99820  
osd_op(client.5234411.0:1312596 rbd_data.4ff53c1322a29c.0cad 
[stat,set-alloc-hint object_size 4194304 write_size 4194304,write 1990656~4096] 
4.395b22ac snapc 2d1c=[2b6a,2ae2] ack+ondisk+write+known_if_redirected e15531) 
v5  291+0+4096 (2656609427 0 1238378996) 0x7f80113c6100 con 0x7f80988c7800
2017-03-07 18:58:27.439230 7f80e33ff700  1 osd.14 15531 dequeue_op 
0x7f7fe281d200 prio 63 cost 4096 latency 0.019387 
osd_op(client.5234411.0:1312148 rbd_data.4ff53c1322a29c.0cad 
[stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3104768~4096] 
4.395b22ac snapc 2d1c=[2b6a,2ae2] ack+ondisk+write+known_if_redirected e15531) 
v5 pg pg[4.ac( v 15531'8325461 (15531'8322451,15531'8325461] local-les=11579 
n=176 ec=1384 les/c 11579/11579 11578/11578/11551) [14,6,15] r=0 lpr=11578 
luod=15531'8325460 lua=15531'8325460 crt=15531'8325459 lcod 15531'8325459 mlcod 
15531'8325459 active+clean]
2017-03-07 18:58:27.439258 7f80ae9fc700  1 -- 10.208.129.17:6800/1653080 <== 
client.5234411 10.208.129.31:0/1029364 99821  
osd_op(client.5234411.0:1312603 rbd_data.4ff53c1322a29c.0cad 
[stat,set-alloc-hint object_size 4194304 write_size 4194304,write 1384448~4096] 
4.395b22ac snapc 2d1c=[2b6a,2ae2] ack+ondisk+write+known_if_redirected e15531) 
v5  291+0+4096 (1049117786 0 1515194573) 0x7f80113c6380 con 0x7f80988c7800
.
.
.
.
.
.
2017-03-07 18:59:55.022570 7f80a3cdf700  1 -- 10.208.129.17:6800/1653080 <== 
client.5239664 10.208.129.12:0/1005804 302  osd_op(client.5239664.0:5646 
rbd_data.24e35147272369.0b0b@99 [sparse-read 0~4194304] 4.e3bddcca 
ack+read+localize_reads+known_if_redirected e15533) v5  199+0+0 (32967759 0 
0) 0x7f7fc165ef80 con 0x7f8098810c80
2017-03-07 18:59:55.026579 7f2d4b7ff700  1 -- 10.208.129.17:6802/1653275 --> 
10.208.129.12:0/1006309 -- osd_op_reply(6751 
rbd_data.4ff52432edaf14.0d99 [sparse-read 0~4194304 
[fadvise_sequential+fadvise_nocache]] v0'0 uv8885110 ondisk = 0) v6 -- ?+0 
0x7f2c62c18c00 con 0x7f2cfb237300
2017-03-07 18:59:55.030936 7f80b31fb700  1 -- 10.208.129.17:6800/1653080 <== 
client.5290815 10.208.129.12:0/1005958 330  osd_op(client.5290815.0:6476 
rbd_data.4ff53c1322a29c.0cad@2d4c [sparse-read 0~4194304 
[fadvise_sequential+fadvise_nocache]] 4.395b22ac ack+read+known_if_redirected 
e15533) v5  199+0+0 (1930673694 0 0) 0x7f7fbc287800 con 0x7f809882c280
2017-03-07 18:59:55.032485 7f2d1af49700  1 -- 10.208.129.17:6802/1653275 <== 
client.5290821 10.208.129.12:0/1006079 427  osd_op(client.5290821.0:6249 

[ceph-users] Jewel problems with sysv-init and non ceph-deploy (udev trickery) OSDs

2017-03-08 Thread Christian Balzer

Hello,

Yes, this is Debian Jessie with sysv-init, not systemd.
I prefer my servers to be deterministic.

Firstly an issue with /var/run/ceph.

The init.d/ceph script has these lines:
---
if [ ! -d $run_dir ]; then
# assume /var/run exists
install -d -m0770 -o ceph -g ceph /var/run/ceph
fi
---

Which should do the right thing.
However with 10.2.6 (and probably before) under Debian Jessie something
else seems to create the directory with root ownership before the startup
script runs.
So the above check finds the directory and never installs it.
Which consequently leads to a failure to start any services.
Removing the check and making the "install" unconditional fixes this.
http://tracker.ceph.com/issues/19242


The second issue is related in the sense that it's also caused by the
desired ceph:ceph ownership of things.

My OSDs are all manually deployed since at the very least the SSD based
ones share the disk with OS/swap partitions. 
And mounted via fstab, no ceph-deploy, GUID and udev magic.
All this works fine with Hammer and Jewel when running as root with 
"setuser match path = /var/lib/ceph/$type/$cluster-$id" in ceph.conf.

Of course this fails predictably when trying to do things the Jewel way as
ceph:ceph when it comes to external journals, as ownership is not
preserved with udev between reboots.

Last time I checked, ceph-deploy still doesn't understand dealing with
just a partition (for the data) instead of a full disk.

So from where I'm standing the alternatives seem to be:

a) drink the ceph-deploy cool-aid, buy more SSDs to house the OS. 
b) run everything as root, until kingdom come.
c) spend loads of time to construct something fragile that will do the
   right thing via udev rules.

Correct?

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error with ceph to cloudstack integration.

2017-03-08 Thread frank

Hi,

We have made sure that the key,ceph user ,ceph admin keys are correct. 
could you let us know if there is any other possibility that would mess 
up the integration.


Regards,
Frank



On 03/06/2017 01:22 PM, Wido den Hollander wrote:

Op 6 maart 2017 om 6:26 schreef frank :


Hi,

We have setup a ceph server and cloudstack server. All the osds are up
with ceph status currently OK.



[root@admin-ceph ~]#  ceph status
  cluster ebac75fc-e631-4c9f-a310-880cbcdd1d25
   health HEALTH_OK
   monmap e1: 1 mons at {mon1=10.10.48.7:6789/0}
  election epoch 3, quorum 0 mon1
   osdmap e32: 2 osds: 2 up, 2 in
  flags sortbitwise,require_jewel_osds
pgmap v10253: 240 pgs, 8 pools, 25009 kB data, 39 objects
  121 MB used, 1852 GB / 1852 GB avail
   240 active+clean



But when integrating with cloudstack we are receiving the below error.

===

2017-03-02 21:03:02,944 DEBUG [c.c.a.t.Request]
(AgentManager-Handler-15:null) Seq 1-653584895922143294: Processing:  {
Ans: , MgmtId: 207381009036, via: 1, Ver: v1, Flags: 10,
[{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
Failed to create storage pool:
9c51d737-3a6f-3bb3-8f28-109954fc2ef0\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.createStoragePool(LibvirtStorageAdaptor.java:524)\n\tat
com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.createStoragePool(KVMStoragePoolManager.java:277)\n\tat
com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.createStoragePool(KVMStoragePoolManager.java:271)\n\tat
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:2823)\n\tat
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1325)\n\tat
com.cloud.agent.Agent.processRequest(Agent.java:501)\n\tat
com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:808)\n\tat
com.cloud.utils.nio.Task.run(Task.java:84)\n\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
java.lang.Thread.run(Thread.java:745)\n","wait":0}}] }
2017-03-02 21:03:02,944 DEBUG [c.c.a.t.Request]
(catalina-exec-6:ctx-f293a10c ctx-093b4faf) Seq 1-653584895922143294:
Received:  { Ans: , MgmtId: 207381009036, via: 1, Ver: v1, Flags: 10, {
Answer } }
2017-03-02 21:03:02,944 DEBUG [c.c.a.m.AgentManagerImpl]
(catalina-exec-6:ctx-f293a10c ctx-093b4faf) Details from executing class
com.cloud.agent.api.ModifyStoragePoolCommand:
com.cloud.utils.exception.CloudRuntimeException: Failed to create
storage pool: 9c51d737-3a6f-3bb3-8f28-109954fc2ef0
  at
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.createStoragePool(LibvirtStorageAdaptor.java:524)
  at
com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.createStoragePool(KVMStoragePoolManager.java:277)
  at
com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.createStoragePool(KVMStoragePoolManager.java:271)

==


Please have a check and let us know if there is any thing missing on our
end. The ceph server is setup with centos7 and jewel as its ceph version.

Any help will be greatly appreciated.

You should check the logs of the CloudStack Agent where it tried to create the 
pool.

But also verify the data you entered in to CloudStack:

- Hostname of Monitor
- Cephx user
- Cephx key
- Pool

Wido



Regards,

Frank



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bogus "inactive" errors during OSD restarts with Jewel

2017-03-08 Thread Christian Balzer


Hello,

during OSD restarts with Jewel (10.2.5 and .6 at least) I've seen
"stuck inactive for more than 300 seconds" errors like this when observing
things with "watch ceph -s" : 
---
 health HEALTH_ERR
59 pgs are stuck inactive for more than 300 seconds
223 pgs degraded
74 pgs peering
84 pgs stale
59 pgs stuck inactive
297 pgs stuck unclean
223 pgs undersized
recovery 38420/179352 objects degraded (21.422%)
2/16 in osds are down
---

Now this is is neither reflected in any logs, nor true of course (the
restarts take a few seconds per OSD and the cluster is fully recovered
to HEALTH_OK in 12 seconds or so.

But it surely is a good scare for somebody not doing this on a test
cluster.

Anybody else seeing this?

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is librados for Python so Neglected?

2017-03-08 Thread Josh Durgin

On 03/08/2017 02:15 PM, Kent Borg wrote:

On 03/08/2017 05:08 PM, John Spray wrote:

Specifically?
I'm not saying you're wrong, but I am curious which bits in particular
you missed.



Object maps. Those transaction-y things. Object classes. Maybe more I
don't know about because I have been learning via Python.


There are certainly gaps in the python bindings, but those are all
covered since jewel.

Hmm, you may have been confused by the docs website - I'd thought the
reference section was autogenerated from the docstrings, like it is for
librbd, but it's just static text: http://tracker.ceph.com/issues/19238

For reference, take a look at 'help(rados)' from the python
interpreter, or check out the source and tests:

https://github.com/ceph/ceph/blob/jewel/src/pybind/rados/rados.pyx
https://github.com/ceph/ceph/blob/jewel/src/test/pybind/test_rados.py

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is librados for Python so Neglected?

2017-03-08 Thread Kent Borg

On 03/08/2017 05:08 PM, John Spray wrote:

Specifically?
I'm not saying you're wrong, but I am curious which bits in particular
you missed.



Object maps. Those transaction-y things. Object classes. Maybe more I 
don't know about because I have been learning via Python.


-kb, the Kent who has been figuring he would rewrite in Rust once he 
knew what he wanted to write.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is librados for Python so Neglected?

2017-03-08 Thread John Spray
On Wed, Mar 8, 2017 at 9:28 PM, Kent Borg  wrote:
> Python is such a great way to learn things. Such a shame the librados Python
> library is missing so much. It makes RADOS look so much more limited than it
> is.

Specifically?

I'm not saying you're wrong, but I am curious which bits in particular
you missed.

John

>
> -kb
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Object Map Costs (Was: Snapshot Costs (Was: Re: Pool Sizes))

2017-03-08 Thread Kent Borg

I'm slowly working my way through Ceph's features...

I recently happened upon object maps. (I had heard of LevelDB being in 
there but never saw how to use it: That's because I have been using 
Python! And the Python library is missing lots of features! Grrr.)


How fast are those omap calls?

Which is faster: a single LevelDB query yielding a few bytes vs. a 
single RADOS object read of that many bytes at a specific offset?


How about iterating through a whole set of values vs. reading a RADOS 
object holding the same amount of data?


Thanks,

-kb, the Kent who is guessing LevelDB will be slower in both cases, 
because he really isn't using the key/value aspect of LevelDB but is 
still paying for it.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Why is librados for Python so Neglected?

2017-03-08 Thread Kent Borg
Python is such a great way to learn things. Such a shame the librados 
Python library is missing so much. It makes RADOS look so much more 
limited than it is.


-kb
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs and erasure coding

2017-03-08 Thread John Spray
On Wed, Mar 8, 2017 at 7:54 PM, Rhian Resnick  wrote:

> Two questions on Cephfs and erasure coding that Google couldn't answer.
>
>
>
> 1) How well does cephfs work with erasure coding?
>

In the current released versions, you cannot use erasure coded pools with
CephFS, unless there is a replicated cache tier in between.  This isn't
generally advisable because cache tiers bring their own complexity.

The reason for all this is that currently, erasure coded pools only support
a subset of operations.  Notably they do not support overwriting objects
(i.e. modifying in place, the way we would in a filesystem when someone
writes to an existing file).

There is work underway to remove that limitation.  In the current master
(development) code it is already possible to use erasure coded pools
directly as cephfs data pools when a special setting is used to enable
overwrites in EC pools.

John

>
> 2) How would you move an existing cephfs pool that uses replication to
> erasure coding?
>




>
> Rhian Resnick
>
> Assistant Director Middleware and HPC
>
> Office of Information Technology
>
>
> Florida Atlantic University
>
> 777 Glades Road, CM22, Rm 173B
>
> Boca Raton, FL 33431
>
> Phone 561.297.2647 <(561)%20297-2647>
>
> Fax 561.297.0222 <(561)%20297-0222>
>
>  [image: image] 
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs and erasure coding

2017-03-08 Thread Rhian Resnick
Two questions on Cephfs and erasure coding that Google couldn't answer.



1) How well does cephfs work with erasure coding?


2) How would you move an existing cephfs pool that uses replication to erasure 
coding?


Rhian Resnick

Assistant Director Middleware and HPC

Office of Information Technology


Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [image] 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] broken links to ceph papers

2017-03-08 Thread Gregory Farnum
You'd might have an easier time grabbing the source out of ceph.git/doc and
converting the raw rst files to whatever you want. :)

On Wed, Mar 8, 2017 at 10:33 AM Daniel W Corley 
wrote:

> On this subject,  I have noticed there are no downloads available for the
> documentation at http://docs.ceph.com/docs/master/.   Would there be any
> concern if this were pulled via wget scripts and made into a PDF for
> offline reading or printing ? Possibly even being made available to share.
> ---
> --
>
>
>
> *Daniel W Corley - RHCE*
>
> *(713)301-1615*
>
>
>
> On 2017-03-08 11:46, Patrick McGarry wrote:
>
> Hey Martin,
>
> All of the links should be updated with the exception of the SK
> Telecom paper that was linked to IEEE. I'm working on getting a hard
> copy of that paper to host on ceph.com. Thanks for letting us know.
>
>
> On Wed, Mar 8, 2017 at 4:22 AM, Martin Bukatovic 
> wrote:
>
> Dear Ceph community,
>
> I noticed that many links on publications page[1
> ]
> are broken, including link to weil-thesis.pdf
>
> Could you fix broken links so that the old links
> are working again?
>
> [1] http://ceph.com/publications/
>
> --
> Martin Bukatovic
> USM QE team
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] broken links to ceph papers

2017-03-08 Thread Daniel W Corley
On this subject,  I have noticed there are no downloads available for
the documentation at http://docs.ceph.com/docs/master/.   Would there be
any concern if this were pulled via wget scripts and made into a PDF for
offline reading or printing ? Possibly even being made available to
share. 

---

-

DANIEL W CORLEY - RHCE 

(713)301-1615 

On 2017-03-08 11:46, Patrick McGarry wrote:

> Hey Martin,
> 
> All of the links should be updated with the exception of the SK
> Telecom paper that was linked to IEEE. I'm working on getting a hard
> copy of that paper to host on ceph.com. Thanks for letting us know.
> 
> On Wed, Mar 8, 2017 at 4:22 AM, Martin Bukatovic  wrote: 
> 
>> Dear Ceph community,
>> 
>> I noticed that many links on publications page[1 [1]]
>> are broken, including link to weil-thesis.pdf
>> 
>> Could you fix broken links so that the old links
>> are working again?
>> 
>> [1] http://ceph.com/publications/
>> 
>> --
>> Martin Bukatovic
>> USM QE team
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

Links:
--
[1] http://ceph.com/publications/___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] broken links to ceph papers

2017-03-08 Thread Patrick McGarry
Hey Martin,

All of the links should be updated with the exception of the SK
Telecom paper that was linked to IEEE. I'm working on getting a hard
copy of that paper to host on ceph.com. Thanks for letting us know.


On Wed, Mar 8, 2017 at 4:22 AM, Martin Bukatovic  wrote:
> Dear Ceph community,
>
> I noticed that many links on publications page[1]
> are broken, including link to weil-thesis.pdf
>
> Could you fix broken links so that the old links
> are working again?
>
> [1] http://ceph.com/publications/
>
> --
> Martin Bukatovic
> USM QE team
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clarification for rgw installation and conflagration ( jwel )

2017-03-08 Thread Abhishek Lekshmanan


On 03/08/2017 04:55 PM, Yair Magnezi wrote:
> Hello Guys .
> 
> I'm  new to RGW and need some clarification  ( i'm running 10.2.5 ) 
> As much as i understand 'jewl'  uses Civetweb instead of Apache and
> FastCGI but in the configuration guide ( just the next step in the  the
> install guide ) it says "Configuring a Ceph Object Gateway requires a
> running Ceph Storage Cluster, and an Apache web server with the FastCGI
> module"
> 
> http://docs.ceph.com/docs/jewel/radosgw/config/
> 
> 
> Civetweb is not mentioned at all and there are no instructions which
> relate to civetweb at all .
> I'd like to move on with configuration ( 'connecting' the rgw to my ceph
> cluster ) but don't understand how to do it .
> The section "ADDING A GATEWAY CONFIGURATION TO CEPH"  has instruction
> only to apache 
> Any clarification is much appreciated .
> 

There is some info. on the install section
http://docs.ceph.com/docs/jewel/install/install-ceph-gateway/#change-the-default-port

essentially the main configuration to run civetweb would be the value of
`rgw_frontends` with civetweb and port as shown in the example. There is
also some info on the migrating section in the same doc, the docs
require some love from a willing community member ;)

Best,
Abhishek

> Thanks
> 
> 
> 
> 
> 
> 
> 
> This e-mail, as well as any attached document, may contain material
> which is confidential and privileged and may include trademark,
> copyright and other intellectual property rights that are proprietary to
> Kenshoo Ltd,  its subsidiaries or affiliates ("Kenshoo"). This e-mail
> and its attachments may be read, copied and used only by the addressee
> for the purpose(s) for which it was disclosed herein. If you have
> received it in error, please destroy the message and any attachment, and
> contact us immediately. If you are not the intended recipient, be aware
> that any review, reliance, disclosure, copying, distribution or use of
> the contents of this message without Kenshoo's express permission is
> strictly prohibited.
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG repair

2017-03-08 Thread Reed Dier
This PG/object is still doing something rather odd.

Attempted to repair the object, which it supposedly attempted, but now I appear 
to have less visibility.

> $ ceph health detail
> HEALTH_ERR 3 pgs inconsistent; 4 scrub errors; mds0: Many clients (20) 
> failing to respond to cache pressure; noout,sortbitwise,require_jewel_osds 
> flag(s) set
> pg 10.2d8 is active+clean+inconsistent, acting [18,17,22]
> pg 10.7bd is active+clean+inconsistent, acting [8,23,17]
> pg 17.ec is active+clean+inconsistent, acting [23,2,21]
> 4 scrub errors
> noout,sortbitwise,require_jewel_osds flag(s) set


23 is the osd scheduled for replacement, generated another read error.

However, 17.ec does not show in the rados list inconsistent pg objects command

> $ rados list-inconsistent-pg objects
> ["10.2d8","10.7bd”]

And examining 10.2d8 as before, I’m presented with this:

> $ rados list-inconsistent-obj 10.2d8 --format=json-pretty
> {
> "epoch": 21094,
> "inconsistents": []
> }

Even though in the logs, the deep scrub and repair both show that the object 
was not repaired.

> $ zgrep 10.2d8 ceph-*
> ceph-osd.18.log.2.gz:2017-03-06 15:10:08.729827 7fc8dfeb8700  0 
> log_channel(cluster) log [INF] : 10.2d8 repair starts
> ceph-osd.18.log.2.gz:2017-03-06 15:13:49.793839 7fc8dfeb8700 -1 
> log_channel(cluster) log [ERR] : 10.2d8 recorded data digest 0x7fa9879c != on 
> disk 0xa6798e03 on {object.name}:head
> ceph-osd.18.log.2.gz:2017-03-06 15:13:49.793941 7fc8dfeb8700 -1 
> log_channel(cluster) log [ERR] : repair 10.2d8 {object.name}:head on disk 
> size (15913) does not match object info size (10280) adjusted for ondisk to 
> (10280)
> ceph-osd.18.log.2.gz:2017-03-06 15:46:13.286268 7fc8dfeb8700 -1 
> log_channel(cluster) log [ERR] : 10.2d8 repair 2 errors, 0 fixed
> ceph-osd.18.log.4.gz:2017-03-04 18:16:23.693057 7fc8dd6b3700  0 
> log_channel(cluster) log [INF] : 10.2d8 deep-scrub starts
> ceph-osd.18.log.4.gz:2017-03-04 18:19:25.471322 7fc8dfeb8700 -1 
> log_channel(cluster) log [ERR] : 10.2d8 recorded data digest 0x7fa9879c != on 
> disk 0xa6798e03 on {object.name}:head
> ceph-osd.18.log.4.gz:2017-03-04 18:19:25.471403 7fc8dfeb8700 -1 
> log_channel(cluster) log [ERR] : deep-scrub 10.2d8 {object.name}:head on disk 
> size (15913) does not match object info size (10280) adjusted for ondisk to 
> (10280)
> ceph-osd.18.log.4.gz:2017-03-04 18:55:39.617841 7fc8dd6b3700 -1 
> log_channel(cluster) log [ERR] : 10.2d8 deep-scrub 2 errors


File size and md5 still match.

> ls -la 
> /var/lib/ceph/osd/ceph-*/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}
> -rw-r--r-- 1 ceph ceph 15913 Mar  2 17:24 
> /var/lib/ceph/osd/ceph-17/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}

> -rw-r--r-- 1 ceph ceph 15913 Mar  2 17:24 
> /var/lib/ceph/osd/ceph-18/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}
> -rw-r--r-- 1 ceph ceph 15913 Mar  2 17:24 
> /var/lib/ceph/osd/ceph-22/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}

> md5sum 
> /var/lib/ceph/osd/ceph-*/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}
> 55a76349b758d68945e5028784c59f24  
> /var/lib/ceph/osd/ceph-17/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}
> 55a76349b758d68945e5028784c59f24  
> /var/lib/ceph/osd/ceph-18/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}
> 55a76349b758d68945e5028784c59f24  
> /var/lib/ceph/osd/ceph-22/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}


So is the object actually inconsistent?
Is rados somehow behind on something, not showing the third inconsistent PG?

Appreciate any help.

Reed

> On Mar 2, 2017, at 9:21 AM, Reed Dier  wrote:
> 
> Over the weekend, two inconsistent PG’s popped up in my cluster. This being 
> after having scrubs disabled for close to 6 weeks after a very long rebalance 
> after adding 33% more OSD’s, an OSD failing, increasing PG’s, etc.
> 
> It appears we came out the other end with 2 inconsistent PG’s and I’m trying 
> to resolve them, and not seeming to have much luck.
> Ubuntu 16.04, Jewel 10.2.5, 3x replicated pool for reference.
> 
>> $ ceph health detail
>> HEALTH_ERR 2 pgs inconsistent; 3 scrub errors; 
>> noout,sortbitwise,require_jewel_osds flag(s) set
>> pg 10.7bd is active+clean+inconsistent, acting [8,23,17]
>> pg 10.2d8 is active+clean+inconsistent, acting [18,17,22]
>> 3 scrub errors
> 
>> $ rados list-inconsistent-pg objects
>> ["10.2d8","10.7bd”]
> 
> Pretty straight forward, 2 PG’s with inconsistent copies. Lets dig deeper.
> 
>> $ rados list-inconsistent-obj 10.2d8 --format=json-pretty
>> {
>> "epoch": 21094,
>> "inconsistents": [
>> {
>> "object": {
>> "name": “object.name",
>> "nspace": “namespace.name",
>> "locator": "",
>> "snap": "head"
>> },
>> "errors": [],
>> 

[ceph-users] clarification for rgw installation and conflagration ( jwel )

2017-03-08 Thread Yair Magnezi
Hello Guys .

I'm  new to RGW and need some clarification  ( i'm running 10.2.5 )
As much as i understand 'jewl'  uses Civetweb instead of Apache and FastCGI
but in the configuration guide ( just the next step in the  the install
guide ) it says "Configuring a Ceph Object Gateway requires a running Ceph
Storage Cluster, and an Apache web server with the FastCGI module"

http://docs.ceph.com/docs/jewel/radosgw/config/

Civetweb is not mentioned at all and there are no instructions which relate
to civetweb at all .
I'd like to move on with configuration ( 'connecting' the rgw to my ceph
cluster ) but don't understand how to do it .
The section "ADDING A GATEWAY CONFIGURATION TO CEPH"  has instruction only
to apache
Any clarification is much appreciated .

Thanks

-- 
This e-mail, as well as any attached document, may contain material which 
is confidential and privileged and may include trademark, copyright and 
other intellectual property rights that are proprietary to Kenshoo Ltd, 
 its subsidiaries or affiliates ("Kenshoo"). This e-mail and its 
attachments may be read, copied and used only by the addressee for the 
purpose(s) for which it was disclosed herein. If you have received it in 
error, please destroy the message and any attachment, and contact us 
immediately. If you are not the intended recipient, be aware that any 
review, reliance, disclosure, copying, distribution or use of the contents 
of this message without Kenshoo's express permission is strictly prohibited.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Jewel] upgrade 10.2.3 => 10.2.5 KO : first OSD server freeze every two days :)

2017-03-08 Thread pascal.pu...@pci-conseil.net

Hello,

No new information. Every two night server OSD 1 freeze with a load > 500.

It's every 2 days. Sometime during scrub, sometime during fstrim, 
sometime during nothing...


But this night, this OSD server came not a life after some minutes as 
before... 8 hours without this server and all its OSD (12/36).


This morning, I restart it  and now after some hours :

HEALTH_WARN 1 pgs degraded; 1 pgs recovering; 1 pgs stuck unclean; 
recovery 304/46002595 objects degraded (0.001%); recovery 11288/46002595 
objects misplaced (0.025%); recovery 3/9779473 unfound (0.000%)
pg 50.2dd is stuck unclean for 23531.224308, current state 
active+recovering+degraded+remapped, last acting [7,28]

pg 50.2dd is active+recovering+degraded+remapped, acting [7,28], 3 unfound
recovery 304/46002595 objects degraded (0.001%)
recovery 11288/46002595 objects misplaced (0.025%)
recovery 3/9779473 unfound (0.000%)

Pool 50.2dd is a RBD filesystem, XFS with replicat 2x.

So what is the best solution ?

#ceph pg 50.2dd mark_unfound_lost delete

or

#ceph pg 50.2dd mark_unfound_lost revert

?

What can have more impact to RBD/XFS filesystem ? a xfs_repair required 
after ?


So, I will probably try ceph version 10.2.6 this evening because I 
really found nothing to fix...


Why this freeze ? why only this server OSD freeze and not others ? why 
every 2 days ? It's crazy.


I already checked all : disk, network, soft, all servers are equals.

(all issues started the day after upgrade to 10.2.5 from 10.2.3).

Thanks for your help.

Regards,

Le 02/03/2017 à 15:34, pascal.pu...@pci-conseil.net a écrit :


Hello,

So, I need maybe some advices : 1 week ago (last 19 feb), I upgraded 
my stable Ceph Jewel from 10.2.3 to 10.2.5 (YES, It was maybe a bad idea).


I never had problem with Ceph 10.2.3 since last upgrade, last 23 
September.


So since my upgrade (10.2.5), every 2 days, the first OSD server 
totaly Freeze. Load go > 500 and come back after somes minutes… I lost 
all OSD from this server (12/36) during issue.




[...]
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shrinking lab cluster to free hardware for a new deployment

2017-03-08 Thread Henrik Korkuc

On 17-03-08 15:39, Kevin Olbrich wrote:

Hi!

Currently I have a cluster with 6 OSDs (5 hosts, 7TB RAID6 each).
We want to shut down the cluster but it holds some semi-productive VMs 
we might or might not need in the future.
To keep them, we would like to shrink our cluster from 6 to 2 OSDs (we 
use size 2 and min_size 1).


Should I set the OSDs out one by one or with norefill, norecovery 
flags set but all at once?

If last is the case, which flags should be set also?

just set OSDs out and wait for them to rebalace, OSDs will be active and 
serve traffic while data will be moving off them. I had a case where 
some pgs wouldn't move out, so after everything settles, you may need to 
remove OSDs from crush one by one.



Thanks!

Kind regards,
Kevin Olbrich.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shrinking lab cluster to free hardware for a new deployment

2017-03-08 Thread Maxime Guyot
Hi Kevin,

I don’t know about those flags, but if you want to shrink your cluster you can 
simply set the weight of the OSDs to be removed to 0 like so: “ceph osd 
reweight osd.X 0”
You can either do it gradually if your are concerned about client I/O (probably 
not since you speak of a test / semi prod cluster) or all at once.
This should take care of all the data movements.

Once the cluster is back to HEALTH_OK, you can then proceed with the standard 
remove OSD procedure: 
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
You should be able to delete all the OSDs in a short period of time since the 
data movement has already been taken care of with the reweight.

I hope that helps.

Cheers,
Maxime

From: ceph-users  on behalf of Kevin Olbrich 

Date: Wednesday 8 March 2017 14:39
To: "ceph-users@lists.ceph.com" 
Subject: [ceph-users] Shrinking lab cluster to free hardware for a new 
deployment

Hi!

Currently I have a cluster with 6 OSDs (5 hosts, 7TB RAID6 each).
We want to shut down the cluster but it holds some semi-productive VMs we might 
or might not need in the future.
To keep them, we would like to shrink our cluster from 6 to 2 OSDs (we use size 
2 and min_size 1).

Should I set the OSDs out one by one or with norefill, norecovery flags set but 
all at once?
If last is the case, which flags should be set also?

Thanks!

Kind regards,
Kevin Olbrich.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] re enable scrubbing

2017-03-08 Thread Peter Maloney
On 03/08/17 13:50, Laszlo Budai wrote:
>
> In my case we have 72 OSDs. We are experiencing some performance
> issues. We believe that the reason is the scrubbing, so we want to
> turn scrubbing off for a few days.
> Given the default parameters of 1 day for scrub and 7 days for deep
> scrub. We turn off scrub for let's say 6 days, then when we turn it on
> will it try to do all the scrubbing that were supposed to be done in
> those days when it was turned off?
>
It looks at the last completed scrub time to see which ones to do as
soon as possible. It will do any outstanding scrubbing as soon as it
can. So just limit it, and it won't ruin performance.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Shrinking lab cluster to free hardware for a new deployment

2017-03-08 Thread Kevin Olbrich
Hi!

Currently I have a cluster with 6 OSDs (5 hosts, 7TB RAID6 each).
We want to shut down the cluster but it holds some semi-productive VMs we
might or might not need in the future.
To keep them, we would like to shrink our cluster from 6 to 2 OSDs (we use
size 2 and min_size 1).

Should I set the OSDs out one by one or with norefill, norecovery flags set
but all at once?
If last is the case, which flags should be set also?

Thanks!

Kind regards,
Kevin Olbrich.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Strange read results using FIO inside RBD QEMU VM ...

2017-03-08 Thread Xavier Trilla
After some investigation, we realized it looks like the bottleneck is in the 
OSDs IOPS. The time it takes to resolve every IOP seems to be too high.

We'll apply the following upgrades:


  *   Ceph.conf modifications to allow better utilization of SSD Drives
  *   Some extra sysctl modifications (Although in that front everything seems 
quite well adjusted)
  *   Jewel (Mainly to allow use of TCMalloc 2.4 with TC 128MB)
  *   Newer Kernel (We are using a fair old 3.14 kernel as standard in our 
platform)
  *   TC Malloc 2.4 (Not clear if we'll upgrade to Ubuntu 16.04 or we'll 
upgrade TCMalloc on 14.04)

We still have to consider:


  *   Using Jemalloc in OSD Servers
  *   Using Jemalloc in QEMU

Any comments or suggestions are welcome :)

Thanks!

Saludos Cordiales,
Xavier Trilla P.
SiliconHosting

¿Un Servidor Cloud con SSDs, redundado
y disponible en menos de 30 segundos?

¡Pruébalo ahora en Clouding.io!

De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de Xavier 
Trilla
Enviado el: martes, 7 de marzo de 2017 21:50
Para: ceph-users@lists.ceph.com
Asunto: [ceph-users] Strange read results using FIO inside RBD QEMU VM ...

Hi,

We have a pure SSD based Ceph cluster (+100 OSDs with Enterprise SSDs and IT 
mode cards) Hammer 0.94.9 over 10G. It's really stable and we are really happy 
with the performance we are getting. But after a customer ran some tests, we 
realized about something quite strange. Our user did some tests using FIO, and 
the strange thing is that Write tests did work as expected, but some Read tests 
did not.

The VM he used was artificially limited via QEMU to 3200  read and 3200  write 
IOPS. In the write department everything works more or less as expected. The 
results get close to 3200 IOPS but the read tests are the ones we don't really 
understand.

We ran tests using different IO Engines: Sync, libaio and POSIX AIO, during the 
write tests the 3 of them expect quite similar -which is something I did not 
really expect- but on the read department there is a huge difference:

Read Results (Random Read - Buffered: No - Direct: Yes - Block Size: 4KB):

LibAIO - Average: 3196 IOPS
POSIX AIO - Average: 878 IOPS
Sync -   Average: 929 IOPS

Write Results (Random Read - Buffered: No - Direct: Yes - Block Size: 4KB):

LibAIO -Average: 2741 IOPS
POSIX AIO -Average: 2673 IOPS
Sync -  Average: 2795 IOPS

I would expect a difference when using LibAIO or POSIX AIO, but I would expect 
it in both read and write results,  not only during reads.

So, I'm quite disoriented with this one... Does anyone have an idea about what 
might be going on?

Thanks!

Saludos Cordiales,
Xavier Trilla P.
Clouding.io

¿Un Servidor Cloud con SSDs, redundado
y disponible en menos de 30 segundos?

¡Pruébalo ahora en Clouding.io!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] re enable scrubbing

2017-03-08 Thread Laszlo Budai


In my case we have 72 OSDs. We are experiencing some performance issues. We 
believe that the reason is the scrubbing, so we want to turn scrubbing off for 
a few days.
Given the default parameters of 1 day for scrub and 7 days for deep scrub. We 
turn off scrub for let's say 6 days, then when we turn it on will it try to do 
all the scrubbing that were supposed to be done in those days when it was 
turned off?

Kind regards,
Laszlo

On 08.03.2017 12:51, Peter Maloney wrote:

It will stick to the config. If you limit the amount of work scrub does
at a time, then you can let it do whatever it wants without issues
(except 10.2.x which had a bug fixed in 10.2.4, but skip to 10.2.5 to
fix a regression).

For example:

# less scrub work at a time, with delay
osd scrub chunk min = 1  # default 5
osd scrub chunk max = 1 # default 25
osd scrub sleep = 0.5   # default 0

# lower scrub priority (possibly no effect since Jewel)
osd disk thread ioprio class = idle
osd disk thread ioprio priority = 3


And this is already default:

osd deep scrub stride = 524288  # 512 KiB
osd max scrubs = 1


And I set this, but not recommending it. The reason I post it here is
just to show that the above is slowed down enough that everything is
scrubbed within this long scrub interval, but might need adjustment for
a more normal setting here:

# 60 days ... default is 7 days
osd deep scrub interval = 5259488


And more inline answers below


On 03/08/17 10:46, Laszlo Budai wrote:

Hello,

is there any risk related to cluster overload when the scrub is re
enabled after a certain amount of time being disabled?

I am thinking of the following scenario:
1. scrub/deep scrub are disabled.
2. after a while (few days) we re enable them. How will the cluster
perform?

should be as normal during normal scrubbing... just no/short breaks in
between. (use osd scrub sleep for this)

Will it run all the scrub jobs that were supposed to be running in the
meantime, or it will just start scheduling scrub jobs according to the
scrub related parameters?

It will run them 1 at a time, or however you have configured it, until
all are within the target time range. Why shouldn't it obey its config?

And maybe as a side effect, the next time they are scrubbed will also be
timed closely together too.



Can you point me to some documentation about this topic?

Nothing interesting with descriptions, just the reference manual for the
options listed above. Someone on IRC gave me the above options and I
tested and fiddled with them to see how ceph behaves.


Thank you,
Laszlo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replication vs Erasure Coding with only 2 elementsinthe failure-domain.

2017-03-08 Thread Maxime Guyot
Hi,

If using Erasure Coding, I think that should be using “choose indep” rather 
than “firstn” (according to 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-January/007306.html)

“- min_size 4
- max_size 4
- step take 
- step chooseleaf firstn 2 type host
- step emit
- step take 
- step chooseleaf firstn 2 type host
- step emit

Unfortunately I'm not aware of a solution. It would require replacing 'step 
take ' with 'step take ' and 'step take ' 
with 'step take '. Iteration is not part of crush as far as I 
know. Maybe someone else can give some more insight into this.”

How about something like this:

“rule eck2m2_ruleset {
  ruleset 0
  type erasure
  min_size 4
  max_size 4
  step take default
  step choose indep 2 type room
 step choose indep 2 type host
  step emit
}
“
Such a rule should put 2 shards in each room on different 4 hosts.

If you are serious about surviving the loss of one of the room, you might want 
to consider the recovery time and how likely it is to have an OSD failure in 
the surviving room during the recovery phase. Something like EC(n,n+1) or  LRC 
(http://docs.ceph.com/docs/master/rados/operations/erasure-code-lrc/) might 
help.

Cheers,
Maxime

From: ceph-users  on behalf of Burkhard 
Linke 
Date: Wednesday 8 March 2017 08:05
To: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] Replication vs Erasure Coding with only 2 
elementsinthe failure-domain.


Hi,

On 03/07/2017 05:53 PM, Francois Blondel wrote:

Hi all,



We have (only) 2 separate "rooms" (crush bucket) and would like to build a 
cluster being able to handle the complete loss of one room.

*snipsnap*

Second idea would be to use Erasure Coding, as it fits our performance 
requirements and would use less raw space.



Creating an EC profile like:

   “ceph osd erasure-code-profile set eck2m2room k=2 m=2 
ruleset-failure-domain=room”

and a pool using that EC profile, with “ceph osd pool create ecpool 128 128 
erasure eck2m2room” of course leads to having “128 creating+incomplete” PGs, as 
we only have 2 rooms.



Is there somehow a way to store the “parity chuncks” (m) on both rooms, so that 
the loss of a room would be possible ?



If I understood correctly, an Erasure Coding of for example k=2, m=2, would use 
the same space as a replication with a size of 2, but be more reliable, as we 
could afford the loss of more OSDs at the same time.

Would it be possible to instruct the crush rule to store the first k and m 
chuncks in room 1, and the second k and m chuncks in room 2 ?

As far as I understand erasure coding there's no special handling for parity or 
data chunks. To assemble an EC object you just need k chunks, regardless 
whether they are data or parity chunks.

You should be able to distribute the chunks among two rooms by creating a new 
crush rule:

- min_size 4
- max_size 4
- step take 
- step chooseleaf firstn 2 type host
- step emit
- step take 
- step chooseleaf firstn 2 type host
- step emit

I'm not 100% sure about whether chooseleaf is correct or another choose step is 
necessary to ensure that two osd from differents hosts are chosen (if 
necessary). The important point is using two choose-emit cycles and using the 
correct start points. Just insert the crush labels for the rooms.

This approach should work, but it has two drawbacks:

- crash handling
In case of host failing in a room, the PG from that host will be replicated to 
another host in the same room. You have to ensure that there's enough capacity 
in each rooms (vs. having enough capacity in the complete cluster), which might 
be tricky.

- bandwidth / host utilization
Almost all ceph based applications/libraries use the 'primary' osd for 
accessing data in a PG. The primary OSD is the first one generated by the crush 
rule. In the upper example, the primary OSDs will all be located in the first 
room. All client traffic will be heading to hosts in that room. Depending on 
your setup this might not be a desired solution.

Unfortunately I'm not aware of a solution. It would require to replace 'step 
take ' with 'step take ' and 'step take ' 
with 'step take '. Iteration is not part of crush as far as I 
know. Maybe someone else can give some more insight into this.

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Jewel v10.2.6 released

2017-03-08 Thread Abhishek L
This point release fixes several important bugs in RBD mirroring, RGW 
multi-site, CephFS, and RADOS.

We recommend that all v10.2.x users upgrade.

For more detailed information, see the complete changelog[1] and the release 
notes[2]

Notable Changes
---
* build/ops: add hostname sanity check to run-{c}make-check.sh (issue#18134 , 
pr#12302 , Nathan Cutler)
* build/ops: add ldap lib to rgw lib deps based on build config (issue#17313 , 
pr#13183 , Nathan Cutler)
* build/ops: ceph-create-keys loops forever (issue#17753 , pr#11884 , Alfredo 
Deza)
* build/ops: ceph daemons DUMPABLE flag is cleared by setuid preventing 
coredumps (issue#17650 , pr#11736 , Patrick Donnelly)
* build/ops: fixed compilation error when --with-radowsgw=no (issue#18512 , 
pr#12729 , Pan Liu)
* build/ops: fixed the issue when --disable-server, compilation fails. 
(issue#18120 , pr#12239 , Pan Liu)
* build/ops: fix undefined crypto references with --with-xio (issue#18133 , 
pr#12296 , Nathan Cutler)
* build/ops: install-deps.sh based on /etc/os-release (issue#18466 , 
issue#18198 , pr#12405 , Jan Fajerski, Nitin A Kamble, Nathan Cutler)
* build/ops: Remove the runtime dependency on lsb_release (issue#17425 , 
pr#11875 , John Coyle, Brad Hubbard)
* build/ops: rpm: /etc/ceph/rbdmap is packaged with executable access rights 
(issue#17395 , pr#11855 , Ken Dreyer)
* build/ops: selinux: Allow ceph to manage tmp files (issue#17436 , pr#13048 , 
Boris Ranto)
* build/ops: systemd: Restart Mon after 10s in case of failure (issue#18635 , 
pr#13058 , Wido den Hollander)
* build/ops: systemd restarts Ceph Mon to quickly after failing to start 
(issue#18635 , pr#13184 , Wido den Hollander)
* ceph-disk: fix flake8 errors (issue#17898 , pr#11976 , Ken Dreyer)
* cephfs: fuse client crash when adding a new osd (issue#17270 , pr#11860 , 
John Spray)
* cli: ceph-disk: convert none str to str before printing it (issue#18371 , 
pr#13187 , Kefu Chai)
* client: Fix lookup of "/.." in jewel (issue#18408 , pr#12766 , Jeff Layton)
* client: fix stale entries in command table (issue#17974 , pr#12137 , John 
Spray)
* client: populate metadata during mount (issue#18361 , pr#13085 , John Spray)
* cli: implement functionality for adding, editing and removing omap values 
with binary keys (issue#18123 , pr#12755 , Jason Dillaman)
* common: Improve linux dcache hash algorithm (issue#17599 , pr#11529 , Yibo 
Cai)
* common: utime.h: fix timezone issue in round_to_* funcs.  (issue#14862 , 
pr#11508 , Zhao Chao)
* doc: Python Swift client commands in Quick Developer Guide don't match 
configuration in vstart.sh (issue#17746 , pr#13043 , Ronak Jain)
* librbd: allow to open an image without opening parent image (issue#18325 , 
pr#13130 , Ricardo Dias)
* librbd: metadata_set API operation should not change global config setting 
(issue#18465 , pr#13168 , Mykola Golub)
* librbd: new API method to force break a peer's exclusive lock (issue#15632 , 
issue#16773 , issue#17188 , issue#16988 , issue#17210 , issue#17251 , 
issue#18429 , issue#17227 , issue#18327 , issue#17015 , pr#12890 , Danny 
Al-Gaaf, Mykola Golub, Jason Dillaman)
* librbd: properly order concurrent updates to the object map (issue#16176 , 
pr#12909 , Jason Dillaman)
* librbd: restore journal access when force disabling mirroring (issue#17588 , 
pr#11916 , Mykola Golub)
* mds: Cannot create deep directories when caps contain path=/somepath 
(issue#17858 , pr#12154 , Patrick Donnelly)
* mds: cephfs metadata pool: deep-scrub error omap_digest != best guess 
omap_digest (issue#17177 , pr#12380 , Yan, Zheng)
* mds: cephfs test failures (ceph.com/qa is broken, should be 
download.ceph.com/qa) (issue#18574 , pr#13023 , John Spray)
* mds: ceph-fuse crash during snapshot tests (issue#18460 , pr#13120 , Yan, 
Zheng)
* mds: ceph_volume_client: fix recovery from partial auth update  (issue#17216 
, pr#11656 , Ramana Raja)
* mds: ceph_volume_client.py : Error: Can't handle arrays of non-strings 
(issue#17800 , pr#12325 , Ramana Raja)
* mds: Cleanly reject session evict command when in replay (issue#17801 , 
pr#12153 , Yan, Zheng)
* mds: client segfault on ceph_rmdir path / (issue#9935 , pr#13029 , Michal 
Jarzabek)
* mds: Clients without pool-changing caps shouldn't be allowed to change 
pool_namespace (issue#17798 , pr#12155 , John Spray)
* mds: Decode errors on backtrace will crash MDS (issue#18311 , pr#12836 , 
Nathan Cutler, John Spray)
* mds: false failing to respond to cache pressure warning (issue#17611 , 
pr#11861 , Yan, Zheng)
* mds: finish clientreplay requests before requesting active state (issue#18461 
, pr#13113 , Yan, Zheng)
* mds: fix incorrect assertion in Server::_dir_is_nonempty() (issue#18578 , 
pr#13459 , Yan, Zheng)
* mds: fix MDSMap upgrade decoding (issue#17837 , pr#13139 , John Spray, 
Patrick Donnelly)
* mds: fix missing ll_get for ll_walk (issue#18086 , pr#13125 , Gui Hecheng)
* mds: Fix mount root for ceph_mount users and change tarball format 
(issue#18312 , issue#18254 , pr#12592 

Re: [ceph-users] re enable scrubbing

2017-03-08 Thread Peter Maloney
It will stick to the config. If you limit the amount of work scrub does
at a time, then you can let it do whatever it wants without issues
(except 10.2.x which had a bug fixed in 10.2.4, but skip to 10.2.5 to
fix a regression).

For example:
> # less scrub work at a time, with delay
> osd scrub chunk min = 1  # default 5
> osd scrub chunk max = 1 # default 25
> osd scrub sleep = 0.5   # default 0
>
> # lower scrub priority (possibly no effect since Jewel)
> osd disk thread ioprio class = idle
> osd disk thread ioprio priority = 3

And this is already default:
> osd deep scrub stride = 524288  # 512 KiB
> osd max scrubs = 1

And I set this, but not recommending it. The reason I post it here is
just to show that the above is slowed down enough that everything is
scrubbed within this long scrub interval, but might need adjustment for
a more normal setting here:
> # 60 days ... default is 7 days
> osd deep scrub interval = 5259488

And more inline answers below


On 03/08/17 10:46, Laszlo Budai wrote:
> Hello,
>
> is there any risk related to cluster overload when the scrub is re
> enabled after a certain amount of time being disabled?
>
> I am thinking of the following scenario:
> 1. scrub/deep scrub are disabled.
> 2. after a while (few days) we re enable them. How will the cluster
> perform? 
should be as normal during normal scrubbing... just no/short breaks in
between. (use osd scrub sleep for this)
> Will it run all the scrub jobs that were supposed to be running in the
> meantime, or it will just start scheduling scrub jobs according to the
> scrub related parameters?
It will run them 1 at a time, or however you have configured it, until
all are within the target time range. Why shouldn't it obey its config?

And maybe as a side effect, the next time they are scrubbed will also be
timed closely together too.
>
>
> Can you point me to some documentation about this topic?
Nothing interesting with descriptions, just the reference manual for the
options listed above. Someone on IRC gave me the above options and I
tested and fiddled with them to see how ceph behaves.
>
> Thank you,
> Laszlo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Much more dentries than inodes, is that normal?

2017-03-08 Thread John Spray
On Tue, Mar 7, 2017 at 3:05 PM, Xiaoxi Chen  wrote:
> Thanks John.
>
> Very likely, note that mds_mem::ino + mds_cache::strays_created ~=
> mds::inodes, plus the MDS was the active-standby one, and become
> active days ago due to failover.
>
> mds": {
> "inodes": 1291393,
> }
> "mds_cache": {
> "num_strays": 3559,
> "strays_created": 706120,
> "strays_purged": 702561
> }
> "mds_mem": {
> "ino": 584974,
> }
>
> I do have a cache dump from the mds via admin socket,  is there
> anything I can get from it  to make 100% percent sure?

You could go through that dump and look for the dentries with no inode
number set, but honestly if this is a previously-standby-replay daemon
and you're running pre-Kraken code I'd be pretty sure it's the known
issue.

John

>
> Xiaoxi
>
> 2017-03-07 22:20 GMT+08:00 John Spray :
>> On Tue, Mar 7, 2017 at 9:17 AM, Xiaoxi Chen  wrote:
>>> Hi,
>>>
>>>   From the admin socket of mds, I got following data on our
>>> production cephfs env, roughly we have 585K inodes and almost same
>>> amount of caps, but we have>2x dentries than inodes.
>>>
>>>   I am pretty sure we dont use hard link intensively (if any).
>>> And the #ino match with "rados ls --pool $my_data_pool}.
>>>
>>>   Thanks for any explanations, appreciate it.
>>>
>>>
>>> "mds_mem": {
>>> "ino": 584974,
>>> "ino+": 1290944,
>>> "ino-": 705970,
>>> "dir": 25750,
>>> "dir+": 25750,
>>> "dir-": 0,
>>> "dn": 1291393,
>>> "dn+": 1997517,
>>> "dn-": 706124,
>>> "cap": 584560,
>>> "cap+": 2657008,
>>> "cap-": 2072448,
>>> "rss": 24599976,
>>> "heap": 166284,
>>> "malloc": 18446744073708721289,
>>> "buf": 0
>>> },
>>>
>>
>> One possibility is that you have many "null" dentries, which are
>> created when we do a lookup and a file is not found -- we create a
>> special dentry to remember that that filename does not exist, so that
>> we can return ENOENT quickly next time.  On pre-Kraken versions, null
>> dentries can also be left behind after file deletions when the
>> deletion is replayed on a standbyreplay MDS
>> (http://tracker.ceph.com/issues/16919)
>>
>> John
>>
>>
>>
>>>
>>> Xiaoxi
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] re enable scrubbing

2017-03-08 Thread Laszlo Budai

Hello,

is there any risk related to cluster overload when the scrub is re enabled 
after a certain amount of time being disabled?

I am thinking of the following scenario:
1. scrub/deep scrub are disabled.
2. after a while (few days) we re enable them. How will the cluster perform? 
Will it run all the scrub jobs that were supposed to be running in the 
meantime, or it will just start scheduling scrub jobs according to the scrub 
related parameters?


Can you point me to some documentation about this topic?

Thank you,
Laszlo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] broken links to ceph papers

2017-03-08 Thread Martin Bukatovic
Dear Ceph community,

I noticed that many links on publications page[1]
are broken, including link to weil-thesis.pdf

Could you fix broken links so that the old links
are working again?

[1] http://ceph.com/publications/

-- 
Martin Bukatovic
USM QE team
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MySQL and ceph volumes

2017-03-08 Thread Matteo Dacrema
Ok, thank you guys.

I changed the innodb flush method to O_DIRECT and seems to performs quite 
better.

Regards
Matteo



This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

> Il giorno 08 mar 2017, alle ore 09:08, Wido den Hollander  ha 
> scritto:
> 
>> 
>> Op 8 maart 2017 om 0:35 schreef Matteo Dacrema > >:
>> 
>> 
>> Thank you Adrian!
>> 
>> I’ve forgot this option and I can reproduce the problem.
>> 
>> Now, what could be the problem on ceph side with O_DSYNC writes?
>> 
> 
> As mentioned nothing, but what you can do with MySQL is provide it multiple 
> RBD disks, eg:
> 
> - Disk for Operating System
> - Disk for /var/lib/mysql
> - Disk for InnoDB data
> - Disk for InnoDB log
> - Disk for /var/log/mysql (binary logs)
> 
> That way you can send in more parallel I/O into the Ceph cluster and gain 
> more performance.
> 
> Wido
> 
>> Regards
>> Matteo
>> 
>> 
>> 
>> This email and any files transmitted with it are confidential and intended 
>> solely for the use of the individual or entity to whom they are addressed. 
>> If you have received this email in error please notify the system manager. 
>> This message contains confidential information and is intended only for the 
>> individual named. If you are not the named addressee you should not 
>> disseminate, distribute or copy this e-mail. Please notify the sender 
>> immediately by e-mail if you have received this e-mail by mistake and delete 
>> this e-mail from your system. If you are not the intended recipient you are 
>> notified that disclosing, copying, distributing or taking any action in 
>> reliance on the contents of this information is strictly prohibited.
>> 
>>> Il giorno 08 mar 2017, alle ore 00:25, Adrian Saul 
>>>  ha scritto:
>>> 
>>> 
>>> Possibly MySQL is doing sync writes, where as your FIO could be doing 
>>> buffered writes.
>>> 
>>> Try enabling the sync option on fio and compare results.
>>> 
>>> 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Matteo Dacrema
 Sent: Wednesday, 8 March 2017 7:52 AM
 To: ceph-users
 Subject: [ceph-users] MySQL and ceph volumes
 
 Hi All,
 
 I have a galera cluster running on openstack with data on ceph volumes
 capped at 1500 iops for read and write ( 3000 total ).
 I can’t understand why with fio I can reach 1500 iops without IOwait and
 MySQL can reach only 150 iops both read or writes showing 30% of IOwait.
 
 I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) 
 and I
 can’t reproduce the problem.
 
 Anyone can tell me where I’m wrong?
 
 Thank you
 Regards
 Matteo
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> Confidentiality: This email and any attachments are confidential and may be 
>>> subject to copyright, legal or some other professional privilege. They are 
>>> intended solely for the attention and use of the named addressee(s). They 
>>> may only be copied, distributed or disclosed with the consent of the 
>>> copyright owner. If you have received this email by mistake or by breach of 
>>> the confidentiality clause, please notify the sender immediately by return 
>>> email and delete or destroy all copies of the email. Any confidentiality, 
>>> privilege or copyright is not waived or lost because this email has been 
>>> sent to you by mistake.
>>> 
>>> --
>>> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
>>> infetto.
>>> Seguire il link qui sotto per segnalarlo come spam: 
>>> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=13CCD402D0.AA534
>>> 
>>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto.

Re: [ceph-users] MySQL and ceph volumes

2017-03-08 Thread Wido den Hollander

> Op 8 maart 2017 om 0:35 schreef Matteo Dacrema :
> 
> 
> Thank you Adrian!
> 
> I’ve forgot this option and I can reproduce the problem.
> 
> Now, what could be the problem on ceph side with O_DSYNC writes?
> 

As mentioned nothing, but what you can do with MySQL is provide it multiple RBD 
disks, eg:

- Disk for Operating System
- Disk for /var/lib/mysql
- Disk for InnoDB data
- Disk for InnoDB log
- Disk for /var/log/mysql (binary logs)

That way you can send in more parallel I/O into the Ceph cluster and gain more 
performance.

Wido

> Regards
> Matteo
> 
> 
> 
> This email and any files transmitted with it are confidential and intended 
> solely for the use of the individual or entity to whom they are addressed. If 
> you have received this email in error please notify the system manager. This 
> message contains confidential information and is intended only for the 
> individual named. If you are not the named addressee you should not 
> disseminate, distribute or copy this e-mail. Please notify the sender 
> immediately by e-mail if you have received this e-mail by mistake and delete 
> this e-mail from your system. If you are not the intended recipient you are 
> notified that disclosing, copying, distributing or taking any action in 
> reliance on the contents of this information is strictly prohibited.
> 
> > Il giorno 08 mar 2017, alle ore 00:25, Adrian Saul 
> >  ha scritto:
> > 
> > 
> > Possibly MySQL is doing sync writes, where as your FIO could be doing 
> > buffered writes.
> > 
> > Try enabling the sync option on fio and compare results.
> > 
> > 
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> >> Matteo Dacrema
> >> Sent: Wednesday, 8 March 2017 7:52 AM
> >> To: ceph-users
> >> Subject: [ceph-users] MySQL and ceph volumes
> >> 
> >> Hi All,
> >> 
> >> I have a galera cluster running on openstack with data on ceph volumes
> >> capped at 1500 iops for read and write ( 3000 total ).
> >> I can’t understand why with fio I can reach 1500 iops without IOwait and
> >> MySQL can reach only 150 iops both read or writes showing 30% of IOwait.
> >> 
> >> I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) 
> >> and I
> >> can’t reproduce the problem.
> >> 
> >> Anyone can tell me where I’m wrong?
> >> 
> >> Thank you
> >> Regards
> >> Matteo
> >> 
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > Confidentiality: This email and any attachments are confidential and may be 
> > subject to copyright, legal or some other professional privilege. They are 
> > intended solely for the attention and use of the named addressee(s). They 
> > may only be copied, distributed or disclosed with the consent of the 
> > copyright owner. If you have received this email by mistake or by breach of 
> > the confidentiality clause, please notify the sender immediately by return 
> > email and delete or destroy all copies of the email. Any confidentiality, 
> > privilege or copyright is not waived or lost because this email has been 
> > sent to you by mistake.
> > 
> > --
> > Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> > infetto.
> > Seguire il link qui sotto per segnalarlo come spam: 
> > http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=13CCD402D0.AA534
> > 
> > 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com