[ceph-users] Issue of S3 API: x-amz-acl: public-read-write and authenticated-read
Hi ceph-users, I am trying S3-compatible API of ceph, but meet following issues: 1. x-amz-acl: public-read-write I upload an object with public-read-write acl. Then I can get this object directly without access key. curl -v -s http://radosgw_server/mybucket0/20131015_1 ... HTTP/1.1 200 ... But I can't write or delete this object without access key. curl -v -s http://ceph7.dev.mobstor.corp.bf1.yahoo.com/mybucket0/20131015_1-XPUT -d 1234 or curl -v -s http://ceph7.dev.mobstor.corp.bf1.yahoo.com/mybucket0/20131015_1-XDELETE ... HTTP/1.1 403 ... ?xml version=1.0 encoding=UTF-8?ErrorCodeAccessDenied/Code/Error 2. x-amz-acl: authenticated-read I have created two radowgw users. I upload an object with authenticated-read acl using access key of one radowgw user. Then I can get this object using this user's access key, but I can't get this object using the other user's access key. I am not sure if I use this authenticated-read acl correctly, please correct me if I am wrong. Thanks. -- Regards, Zhi ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] how to make ceph with hadoop
| hi all! my ceph is 0.62, and I want to build it wit hadoop. ./configure -with-hadoop but it return jni.h not found. I found the jni.h in /usr/java/jdk/include/jni.h how can I fix this Problem! thinks pengft | | | | | |___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph, Keystone and S3
Hi All, Does anyone know if it'll be possible to use the radosgw admin API when using keystone users? I suspect not due to the user requiring specific caps, however it'd be great if someone can validate (I'm still running v0.67.4 so can't play with this much). Thanks! -Matt On Tue, Oct 15, 2013 at 6:34 PM, Carlos Gimeno Yañez cgim...@bifi.eswrote: Thank you very much Yehuda, that was the missing piece of my puzzle! I think that this should be added to the official documentation. Regards 2013/10/15 Yehuda Sadeh yeh...@inktank.com On Tue, Oct 15, 2013 at 7:17 AM, Carlos Gimeno Yañez cgim...@bifi.es wrote: Hi I've deployed Ceph using Ceph-deploy and following the official documentation. I've created a user to use with Swift and everything is working fine, my users can create buckets and upload files if they use Horizon Dashboard or Swift CLI. However, everything changes if they try to do it with S3 API. When they download their credentials from Horizon dashboard to get their keys, they can't connect to ceph using S3 API. They only get a 403 Access Denied error message. I'm using Ceph 0.70 so, if i'm not wrong, ceph should be able to validate S3 tokens against keystone since 0.69 version. Here is my ceph.conf: [client.radosgw.gateway] host = server2 keyring = /etc/ceph/keyring.radosgw.gateway rgw socket path = /var/run/ceph/radosgw.sock log file = /var/log/ceph/radosgw.log rgw keystone url = server4:35357 rgw keystone admin token = admintoken rgw keystone accepted roles = admin _member_ Member rgw print continue = false rgw keystone token cache size = 500 rgw keystone revocation interval = 500 nss db path = /var/ceph/nss #Add DNS hostname to enable S3 subdomain calls rgw dns name = server2 And this is the error message (with s3-curl): GET / HTTP/1.1 User-Agent: curl/7.29.0 Host: host_ip Accept: */* Date: Tue, 15 Oct 2013 14:07:24 + Authorization: AWS 3a1ecdea87d6493a9922c13a06d392cf:SNu/sjTuDtvunOQKJaU8Besm1RQ= HTTP/1.1 403 Forbidden Date: Tue, 15 Oct 2013 14:07:24 GMT Server: Apache/2.2.22 (Ubuntu) Accept-Ranges: bytes Content-Length: 78 Content-Type: application/xml { [data not shown] ?xml version=1.0 encoding=UTF-8? Error CodeAccessDenied/Code /Error Regards Try adding: rgw s3 auth use keystone = true to your ceph.conf Yehuda ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw public access problem
Hello, when i set a read permission for all users to the bucket i read only the content of the bucket but i received access denied for all directory and sub-directory inside this bucket. Where i wrong??? Many thanks Fabio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] bit correctness and checksumming
Hi all, There has been some confusion the past couple days at the CHEP conference during conversations about Ceph and protection from bit flips or other subtle data corruption. Can someone please summarise the current state of data integrity protection in Ceph, assuming we have an XFS backend filesystem? ie. don't rely on the protection offered by btrfs. I saw in the docs that wire messages and journal writes are CRC'd, but nothing explicit about the objects themselves. We also have some specific questions: 1. Is an object checksum stored on the OSD somewhere? Is this in user.ceph._, because it wasn't obvious when looking at the code… 2. When is the checksum verified. Surely it is checked during the deep scrub, but what about during an object read? 2b. Can a user read corrupted data if the master replica has a bit flip but this hasn't yet been found by a deep scrub? 3. During deep scrub of an object with 2 replicas, suppose the checksum is different for the two objects -- which object wins? (I.e. if you store the checksum locally, this is trivial since the consistency of objects can be evaluated locally. Without the local checksum, you can have conflicts.) 4. If the checksum is already stored per object in the OSD, is this retrievable by librados? We have some applications which also need to know the checksum of the data and this would be handy if it was already calculated by Ceph. Thanks in advance! Dan van der Ster CERN IT ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to make ceph with hadoop
The --with-hadoop option has been removed. The Ceph Hadoop bindings are now located in git://github.com/ceph/hadoop-common cepfs/branch-1.0, and the required CephFS Java bindings can be built from the Ceph Git repository using the --enable-cephfs-java configure option. On Wed, Oct 16, 2013 at 12:26 AM, 鹏 wkp4...@126.com wrote: ** hi all! my ceph is 0.62, and I want to build it wit hadoop. ./configure -with-hadoop but it return jni.h not found. I found the jni.h in /usr/java/jdk/include/jni.h how can I fix this Problem! thinks pengft ** ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw public access problem
On 10/16/13 5:15 AM, Fabio - NS3 srl wrote: Hello, when i set a read permission for all users to the bucket i read only the content of the bucket but i received access denied for all directory and sub-directory inside this bucket. Where i wrong??? Hi Fabio, This is the default S3 behavior. The default canned ACL will be the user who writes the key and FULL_CONTROL. You will have to iterate the keys and grant a specific read ACL. You can also on upload of the keys specify the ACL. Also we have a patch pending[1] that provides some relief for this use case where we would allow the bucket ACLs to be evaluated and be authoritative before the key ACLs. It needs to get cleaned up a bit but I think it would very much be useful in your case. We are about to go into production running this on two different Ceph Object Stores. [1] - https://github.com/ceph/ceph/pull/672 Thanks, derek -- --- Derek T. Yarnell University of Maryland Institute for Advanced Computer Studies ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Missing Dependency for ceph-deploy 1.2.7
On Tue, Oct 15, 2013 at 9:54 PM, Luke Jing Yuan jyl...@mimos.my wrote: Hi, I am trying to install/upgrade to 1.2.7 but Ubuntu (Precise) is complaining about unmet dependency which seemed to be python-pushy 0.5.3 which seemed to be missing. Am I correct to assume so? That is odd, we still have pushy packages available for the version that you are having issues with, see: http://ceph.com/debian-dumpling/pool/main/p/python-pushy/ It might be that you need to update your repos? Regards, Luke -- - - DISCLAIMER: This e-mail (including any attachments) is for the addressee(s) only and may contain confidential information. If you are not the intended recipient, please note that any dealing, review, distribution, printing, copying or use of this e-mail is strictly prohibited. If you have received this email in error, please notify the sender immediately and delete the original message. MIMOS Berhad is a research and development institution under the purview of the Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions and other information in this e- mail that do not relate to the official business of MIMOS Berhad and/or its subsidiaries shall be understood as neither given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts responsibility for the same. All liability arising from or in connection with computer viruses and/or corrupted e-mails is excluded to the fullest extent permitted by law. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] poor read performance on rbd+LVM, LVM overload
Hello cephLVM communities! I noticed very slow reads from xfs mount that is on ceph client(rbd+gpt partition+LVM PV + xfs on LE) To find a cause I created another rbd in the same pool, formatted it straight away with xfs, mounted. Write performance for both xfs mounts is similar ~12MB/s reads with dd if=/mnt/somefile bs=1M | pv | dd of=/dev/null as follows: with LVM ~4MB/s pure xfs ~30MB/s Watched performance while doing reads with atop. In LVM case atop shows LVM overloaded: LVM | s-LV_backups | busy 95% | read 21515 | write 0 | KiB/r 4 | | KiB/w 0 | MBr/s 4.20 | MBw/s 0.00 | avq 1.00 | avio 0.85 ms | client kernel 3.10.10 ceph version 0.67.4 My considerations: I have expanded rbd under LVM couple of times(accordingly expanding gpt partition, PV,VG,LV, xfs afterwards), but that should have no impact on performance(tested clean rbd+LVM, same read performance as for expanded one). As with device-mapper, after LVM is initialized it is just a small table with LE-PE mapping that should reside in close CPU cache. I am guessing this could be related to old CPU used, probably caching near CPU does not work well(I tested also local HDDs with/without LVM and got read speed ~13MB/s vs 46MB/s with atop showing same overload in LVM case). What could make so great difference when LVM is used and what/how to tune? As write performance does not differ, DM extent lookup should not be lagging, where is the trick? CPU used: # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping: 10 microcode : 0x2 cpu MHz : 3200.077 cache size : 2048 KB physical id : 0 siblings: 2 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr ss e sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl cid cx16 xtpr lahf_lm bogomips: 6400.15 clflush size: 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management: Br, Ugis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy zap disk failure
On Tue, Oct 15, 2013 at 9:19 PM, Guang yguan...@yahoo.com wrote: -bash-4.1$ which sgdisk /usr/sbin/sgdisk Which path does ceph-deploy use? That is unexpected... these are the paths that ceph-deploy uses: '/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin' So `/usr/sbin/` is there. I believe this is a case where $PATH gets altered because of sudo (resetting the env variable). This should be fixed in the next release. In the meantime, you could set the $PATH for non-interactive sessions (which is what ceph-deploy does) for all users. I *think* that would be in `/etc/profile` Thanks, Guang On Oct 15, 2013, at 11:15 PM, Alfredo Deza wrote: On Tue, Oct 15, 2013 at 10:52 AM, Guang yguan...@yahoo.com wrote: Hi ceph-users, I am trying with the new ceph-deploy utility on RHEL6.4 and I came across a new issue: -bash-4.1$ ceph-deploy --version 1.2.7 -bash-4.1$ ceph-deploy disk zap server:/dev/sdb [ceph_deploy.cli][INFO ] Invoked (1.2.7): /usr/bin/ceph-deploy disk zap server:/dev/sdb [ceph_deploy.osd][DEBUG ] zapping /dev/sdb on server [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] detect platform information from remote host [ceph_deploy.osd][INFO ] Distro info: Red Hat Enterprise Linux Server 6.4 Santiago [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] zeroing last few blocks of device [osd2.ceph.mobstor.bf1.yahoo.com][INFO ] Running command: sudo sgdisk --zap-all --clear --mbrtogpt -- /dev/sdb [osd2.ceph.mobstor.bf1.yahoo.com][ERROR ] sudo: sgdisk: command not found While I run disk zap on the host directly, it can work without issues. Anyone meet the same issue? Can you run `which sgdisk` on that host? I want to make sure this is not a $PATH problem. ceph-deploy tries to use the proper path remotely but it could be that this one is not there. Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] snapshots on CephFS
Hi Greg, on http://comments.gmane.org/gmane.comp.file-systems.ceph.user/1705 I found a statement from you regarding snapshots on cephfs: ---snip--- Filesystem snapshots exist and you can experiment with them on CephFS (there's a hidden .snaps folder; you can create or remove snapshots by creating directories in that folder; navigate up and down it, etc). ---snip--- Can you please explain in more detail or with example CMDs how to create/list/remove snapshots in CephFS ? I assume they will be created on a directory level ? How will the CephFS snapshots cohere with the underlaying pools ? (e.g. using cephfs /mnt/cephfs/dir-1/dir2 set_layout -p 18) Thanks, -Dieter ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] kvm live migrate wil ceph
I wouldn't go so far as to say putting a vm in a file on a networked filesystem is wrong. It is just not the best choice if you have a ceph cluster at hand, in my opinion. Networked filesystems have a bunch of extra stuff to implement posix semantics and live in kernel space. You just need simple block device semantics and you don't need to entangle the hypervisor's kernel space. What it boils down to is the engineering first principle of selecting the least complicated solution that satisfies the requirements of the problem. You don't get anything when you trade the simplicity of rbd for the complexity of a networked filesystem. For format 2 I think the only caveat is that it requires newer clients and the kernel client takes some time to catch up to the user space clients. You may not be able to mount filesystems on rbd devices with the kernel client depending on kernel version, this may or may not be important to you. You can always use a vm to mount a filesystem on a rbd device as a work around. On Oct 16, 2013, at 9:11 AM, Jon three1...@gmail.com wrote: Hello Michael, Thanks for the reply. It seems like ceph isn't actually mounting the rbd to the vm host which is where I think I was getting hung up (I had previously been attempting to mount rbds directly to multiple hosts and as you can imagine having issues). Could you possible expound on why using a clustered filesystem approach is wrong (or conversely why using RBD's is the correct approach)? As for format2 rbd images, it looks like they provide exactly the Copy-On-Write functionality that I am looking for. Any caveats or things I should look out for when going from format 1 to format 2 images? (I think I read something about not being able to use both at the same time...) Thanks Again, Jon A On Mon, Oct 14, 2013 at 4:42 PM, Michael Lowe j.michael.l...@gmail.com wrote: I live migrate all the time using the rbd driver in qemu, no problems. Qemu will issue a flush as part of the migration so everything is consistent. It's the right way to use ceph to back vm's. I would strongly recommend against a network file system approach. You may want to look into format 2 rbd images, the cloning and writable snapshots may be what you are looking for. Sent from my iPad On Oct 14, 2013, at 5:37 AM, Jon three1...@gmail.com wrote: Hello, I would like to live migrate a VM between two hypervisors. Is it possible to do this with a rbd disk or should the vm disks be created as qcow images on a CephFS/NFS share (is it possible to do clvm over rbds? OR GlusterFS over rbds?)and point kvm at the network directory. As I understand it, rbds aren't cluster aware so you can't mount an rbd on multiple hosts at once, but maybe libvirt has a way to handle the transfer...? I like the idea of master or golden images where guests write any changes to a new image, I don't think rbds are able to handle copy-on-write in the same way kvm does so maybe a clustered filesystem approach is the ideal way to go. Thanks for your input. I think I'm just missing some piece. .. I just don't grok... Bestv Regards, Jon A ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw-admin doesn't list user anymore
On 10/16/13 4:26 AM, Valery Tschopp wrote: Hi Derek, Thanks for your example. I've added caps='metadata=*', but I still have an error and get: send: 'GET /admin/metadata/user?format=json HTTP/1.1\r\nHost: objects.bcc.switch.ch\r\nAccept-Encoding: identity\r\nDate: Wed, 16 Oct 2013 08:09:57 GMT\r\nContent-Length: 0\r\nAuthorization: AWS VC***o=\r\nUser-Agent: Boto/2.12.0 Python/2.7.5 Darwin/12.5.0\r\n\r\n' reply: 'HTTP/1.1 405 Method Not Allowed\r\n' In which version of radosgw is the /admin/metadata REST endpoint available? I currently have 0.67.4. We are using this on ceph-0.67.4. Do you have your gateways logging? -- --- Derek T. Yarnell University of Maryland Institute for Advanced Computer Studies ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] snapshots on CephFS
Dieter, Creating snapshots using cephfs is quite simple...all you need to do is create a directory (mkdir) inside the hidden '.snap' directory. After that you can list (ls) and remove them (rm -r) just as you would any other directory: smiley@server1:/mnt/cephfs$ cd .snap smiley@server1:/mnt/cephfs/.snap$ ls snap1 snapshot-10-13-2013 smiley@theneykov:/mnt/cephfs/.snap$ mkdir right_now smiley@theneykov:/mnt/1/.snap$ ls -l total 0 drwxrwxrwx 1 root root 0 Oct 13 14:38 snap1 drwxrwxrwx 1 root root 0 Oct 16 11:16 right_now drwxrwxrwx 1 root root 0 Oct 16 11:16 snapshot-10-13-2013 Shain Shain Miley | Manager of Systems and Infrastructure, Digital Media | smi...@npr.org | 202.513.3649 From: ceph-users-boun...@lists.ceph.com [ceph-users-boun...@lists.ceph.com] on behalf of Kasper Dieter [dieter.kas...@ts.fujitsu.com] Sent: Wednesday, October 16, 2013 11:01 AM To: Gregory Farnum Cc: ceph-users@lists.ceph.com Subject: [ceph-users] snapshots on CephFS Hi Greg, on http://comments.gmane.org/gmane.comp.file-systems.ceph.user/1705 I found a statement from you regarding snapshots on cephfs: ---snip--- Filesystem snapshots exist and you can experiment with them on CephFS (there's a hidden .snaps folder; you can create or remove snapshots by creating directories in that folder; navigate up and down it, etc). ---snip--- Can you please explain in more detail or with example CMDs how to create/list/remove snapshots in CephFS ? I assume they will be created on a directory level ? How will the CephFS snapshots cohere with the underlaying pools ? (e.g. using cephfs /mnt/cephfs/dir-1/dir2 set_layout -p 18) Thanks, -Dieter ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] poor read performance on rbd+LVM, LVM overload
Hi, On Wed, 16 Oct 2013, Ugis wrote: Hello cephLVM communities! I noticed very slow reads from xfs mount that is on ceph client(rbd+gpt partition+LVM PV + xfs on LE) To find a cause I created another rbd in the same pool, formatted it straight away with xfs, mounted. Write performance for both xfs mounts is similar ~12MB/s reads with dd if=/mnt/somefile bs=1M | pv | dd of=/dev/null as follows: with LVM ~4MB/s pure xfs ~30MB/s Watched performance while doing reads with atop. In LVM case atop shows LVM overloaded: LVM | s-LV_backups | busy 95% | read 21515 | write 0 | KiB/r 4 | | KiB/w 0 | MBr/s 4.20 | MBw/s 0.00 | avq 1.00 | avio 0.85 ms | client kernel 3.10.10 ceph version 0.67.4 My considerations: I have expanded rbd under LVM couple of times(accordingly expanding gpt partition, PV,VG,LV, xfs afterwards), but that should have no impact on performance(tested clean rbd+LVM, same read performance as for expanded one). As with device-mapper, after LVM is initialized it is just a small table with LE-PE mapping that should reside in close CPU cache. I am guessing this could be related to old CPU used, probably caching near CPU does not work well(I tested also local HDDs with/without LVM and got read speed ~13MB/s vs 46MB/s with atop showing same overload in LVM case). What could make so great difference when LVM is used and what/how to tune? As write performance does not differ, DM extent lookup should not be lagging, where is the trick? My first guess is that LVM is shifting the content of hte device such that it no longer aligns well with the RBD striping (by default, 4MB). The non-aligned reads/writes would need to touch two objects instead of one, and dd is generally doing these synchronously (i.e., lots of waiting). I'm not sure what options LVM provides for aligning things to the underlying storage... sage CPU used: # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping: 10 microcode : 0x2 cpu MHz : 3200.077 cache size : 2048 KB physical id : 0 siblings: 2 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr ss e sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl cid cx16 xtpr lahf_lm bogomips: 6400.15 clflush size: 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management: Br, Ugis -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] snapshots on CephFS
On Wed, Oct 16, 2013 at 8:01 AM, Kasper Dieter dieter.kas...@ts.fujitsu.com wrote: Hi Greg, on http://comments.gmane.org/gmane.comp.file-systems.ceph.user/1705 I found a statement from you regarding snapshots on cephfs: ---snip--- Filesystem snapshots exist and you can experiment with them on CephFS (there's a hidden .snaps folder; you can create or remove snapshots by creating directories in that folder; navigate up and down it, etc). ---snip--- Can you please explain in more detail or with example CMDs how to create/list/remove snapshots in CephFS ? As Shain described, you just do mkdir/ls/rmdir in the .snaps folder. I assume they will be created on a directory level ? Snapshots cover the entire subtree starting with the folder you create them from. If a user puts it in their home directory, there will be a snapshot of all their document folders, source code folders, etc as well. How will the CephFS snapshots cohere with the underlaying pools ? (e.g. using cephfs /mnt/cephfs/dir-1/dir2 set_layout -p 18) CephFS snapshots store some metadata directly in the directory object (in the metadata pool), but the file data is stored using RADOS self-managed snapshots on the regular objects. If you specify that a file/folder goes in a different pool, the snapshots also live there as a matter of course. Separately: 1) you will probably have a better time specifying layouts using the ceph.layout virtual xattrs if your installation is new enough. (There's no new functionality there, but it's a lot friendlier and less fiddly than the cephfs tool is.) 2) Keep in mind that snapshots are noticeably less stable in use than the regular filesystem features. The ability to create new ones is turned off by default in the next branch (admins can enable them with a monitor command). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bit correctness and checksumming
Does Ceph log anywhere corrected(/caught) silent corruption - would be interesting to know how much a problem this is, in a large scale deployment. Something to gather in the league table mentioned at the London Ceph day? Just thinking out-loud (please shout me down...) - if the FS itself performs it's own ECC, ATA streaming command set might be of use to avoid performance degradation due to drive level recovery at all. On 2013-10-16 17:12, Sage Weil wrote: On Wed, 16 Oct 2013, Dan Van Der Ster wrote: Hi all, There has been some confusion the past couple days at the CHEP conference during conversations about Ceph and protection from bit flips or other subtle data corruption. Can someone please summarise the current state of data integrity protection in Ceph, assuming we have an XFS backend filesystem? ie. don't rely on the protection offered by btrfs. I saw in the docs that wire messages and journal writes are CRC'd, but nothing explicit about the objects themselves. - Everything that passes over the wire is checksummed (crc32c). This is mainly because the TCP checksum is so weak. - The journal entries have a crc. - During deep scrub, we read the objects and metadata, calculate a crc32c, and compare across replicas. This detects missing objects, bitrot, failing disks, or anything other source of inconistency. - Ceph does not calculate and store a per-object checksum. Doing so is difficult because rados allows arbitrary overwrites of parts of an object. - Ceph *does* have a new opportunistic checksum feature, which is currently only enabled in QA. It calculates and stores checksums on whatever block size you configure (e.g., 64k) if/when we write/overwrite a complete block, and will verify any complete block read against the stored crc, if one happens to be available. This can help catch some but not all sources of corruption. We also have some specific questions: 1. Is an object checksum stored on the OSD somewhere? Is this in user.ceph._, because it wasn't obvious when looking at the code? No (except for the new/experimental opportunistic crc I mention above). 2. When is the checksum verified. Surely it is checked during the deep scrub, but what about during an object read? For non-btrfs, no crc to verify. For btrfs, the fs has its own crc and verifies it. 2b. Can a user read corrupted data if the master replica has a bit flip but this hasn't yet been found by a deep scrub? Yes. 3. During deep scrub of an object with 2 replicas, suppose the checksum is different for the two objects -- which object wins? (I.e. if you store the checksum locally, this is trivial since the consistency of objects can be evaluated locally. Without the local checksum, you can have conflicts.) In this case we normally choose the primary. The repair has to be explicitly triggered by the admin, however, and there are some options to control that choice. 4. If the checksum is already stored per object in the OSD, is this retrievable by librados? We have some applications which also need to know the checksum of the data and this would be handy if it was already calculated by Ceph. It would! It may be that the way to get there is to build and API to expose the opportunistic checksums, and/or to extend that feature to maintain full checksums (by re-reading partially overwritten blocks on write). (Note, however, that even this wouldn't cover xattrs and omap content; really this is something that should be handled by the backend storage/file system.) sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bit correctness and checksumming
At CERN, we have had cases in the past of silent corruptions. It is good to be able to identify the devices causing them and swap them out. It's an old presentation but the concepts are still relevant today ... http://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf Tim -Original Message- From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ja...@peacon.co.uk Sent: 16 October 2013 18:54 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] bit correctness and checksumming Does Ceph log anywhere corrected(/caught) silent corruption - would be interesting to know how much a problem this is, in a large scale deployment. Something to gather in the league table mentioned at the London Ceph day? Just thinking out-loud (please shout me down...) - if the FS itself performs it's own ECC, ATA streaming command set might be of use to avoid performance degradation due to drive level recovery at all. On 2013-10-16 17:12, Sage Weil wrote: On Wed, 16 Oct 2013, Dan Van Der Ster wrote: Hi all, There has been some confusion the past couple days at the CHEP conference during conversations about Ceph and protection from bit flips or other subtle data corruption. Can someone please summarise the current state of data integrity protection in Ceph, assuming we have an XFS backend filesystem? ie. don't rely on the protection offered by btrfs. I saw in the docs that wire messages and journal writes are CRC'd, but nothing explicit about the objects themselves. - Everything that passes over the wire is checksummed (crc32c). This is mainly because the TCP checksum is so weak. - The journal entries have a crc. - During deep scrub, we read the objects and metadata, calculate a crc32c, and compare across replicas. This detects missing objects, bitrot, failing disks, or anything other source of inconistency. - Ceph does not calculate and store a per-object checksum. Doing so is difficult because rados allows arbitrary overwrites of parts of an object. - Ceph *does* have a new opportunistic checksum feature, which is currently only enabled in QA. It calculates and stores checksums on whatever block size you configure (e.g., 64k) if/when we write/overwrite a complete block, and will verify any complete block read against the stored crc, if one happens to be available. This can help catch some but not all sources of corruption. We also have some specific questions: 1. Is an object checksum stored on the OSD somewhere? Is this in user.ceph._, because it wasn't obvious when looking at the code? No (except for the new/experimental opportunistic crc I mention above). 2. When is the checksum verified. Surely it is checked during the deep scrub, but what about during an object read? For non-btrfs, no crc to verify. For btrfs, the fs has its own crc and verifies it. 2b. Can a user read corrupted data if the master replica has a bit flip but this hasn't yet been found by a deep scrub? Yes. 3. During deep scrub of an object with 2 replicas, suppose the checksum is different for the two objects -- which object wins? (I.e. if you store the checksum locally, this is trivial since the consistency of objects can be evaluated locally. Without the local checksum, you can have conflicts.) In this case we normally choose the primary. The repair has to be explicitly triggered by the admin, however, and there are some options to control that choice. 4. If the checksum is already stored per object in the OSD, is this retrievable by librados? We have some applications which also need to know the checksum of the data and this would be handy if it was already calculated by Ceph. It would! It may be that the way to get there is to build and API to expose the opportunistic checksums, and/or to extend that feature to maintain full checksums (by re-reading partially overwritten blocks on write). (Note, however, that even this wouldn't cover xattrs and omap content; really this is something that should be handled by the backend storage/file system.) sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bit correctness and checksumming
Thank you Sage for the thorough answer. It just occurred to me to also ask about the gateway. The docs explain that one can supply content-md5 during an object PUT (which I assume is verified by the RGW), but does a GET respond with the ETag md5? (Sorry, I don't have a gateway running at the moment to check for myself, and the answer is relevant to this discussion anyway). Cheers, Dan Sage Weil s...@inktank.com wrote: On Wed, 16 Oct 2013, Dan Van Der Ster wrote: Hi all, There has been some confusion the past couple days at the CHEP conference during conversations about Ceph and protection from bit flips or other subtle data corruption. Can someone please summarise the current state of data integrity protection in Ceph, assuming we have an XFS backend filesystem? ie. don't rely on the protection offered by btrfs. I saw in the docs that wire messages and journal writes are CRC'd, but nothing explicit about the objects themselves. - Everything that passes over the wire is checksummed (crc32c). This is mainly because the TCP checksum is so weak. - The journal entries have a crc. - During deep scrub, we read the objects and metadata, calculate a crc32c, and compare across replicas. This detects missing objects, bitrot, failing disks, or anything other source of inconistency. - Ceph does not calculate and store a per-object checksum. Doing so is difficult because rados allows arbitrary overwrites of parts of an object. - Ceph *does* have a new opportunistic checksum feature, which is currently only enabled in QA. It calculates and stores checksums on whatever block size you configure (e.g., 64k) if/when we write/overwrite a complete block, and will verify any complete block read against the stored crc, if one happens to be available. This can help catch some but not all sources of corruption. We also have some specific questions: 1. Is an object checksum stored on the OSD somewhere? Is this in user.ceph._, because it wasn't obvious when looking at the code? No (except for the new/experimental opportunistic crc I mention above). 2. When is the checksum verified. Surely it is checked during the deep scrub, but what about during an object read? For non-btrfs, no crc to verify. For btrfs, the fs has its own crc and verifies it. 2b. Can a user read corrupted data if the master replica has a bit flip but this hasn't yet been found by a deep scrub? Yes. 3. During deep scrub of an object with 2 replicas, suppose the checksum is different for the two objects -- which object wins? (I.e. if you store the checksum locally, this is trivial since the consistency of objects can be evaluated locally. Without the local checksum, you can have conflicts.) In this case we normally choose the primary. The repair has to be explicitly triggered by the admin, however, and there are some options to control that choice. 4. If the checksum is already stored per object in the OSD, is this retrievable by librados? We have some applications which also need to know the checksum of the data and this would be handy if it was already calculated by Ceph. It would! It may be that the way to get there is to build and API to expose the opportunistic checksums, and/or to extend that feature to maintain full checksums (by re-reading partially overwritten blocks on write). (Note, however, that even this wouldn't cover xattrs and omap content; really this is something that should be handled by the backend storage/file system.) sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bit correctness and checksumming
It was long ago and Linux was very different . With respect to today, we found quite a few cases of bad RAID cards which had limited ECC checking on their memory, Stuck bits had serious impacts given our data transit volumes :-( While the root causes we found in the past may be less likely today (as we move towards replicas and away from hardware RAID), keeping in place a background scrubbing and method to identify components which could be potentially causing corruption by external probing and quality checks is very useful. Tim -Original Message- From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ja...@peacon.co.uk Sent: 16 October 2013 20:06 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] bit correctness and checksumming Very interesting link. I don't suppose there is any data available separating 4K and 512-byte sectored drives? On 2013-10-16 18:43, Tim Bell wrote: At CERN, we have had cases in the past of silent corruptions. It is good to be able to identify the devices causing them and swap them out. It's an old presentation but the concepts are still relevant today ... http://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf Tim -Original Message- From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ja...@peacon.co.uk Sent: 16 October 2013 18:54 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] bit correctness and checksumming Does Ceph log anywhere corrected(/caught) silent corruption - would be interesting to know how much a problem this is, in a large scale deployment. Something to gather in the league table mentioned at the London Ceph day? Just thinking out-loud (please shout me down...) - if the FS itself performs it's own ECC, ATA streaming command set might be of use to avoid performance degradation due to drive level recovery at all. On 2013-10-16 17:12, Sage Weil wrote: On Wed, 16 Oct 2013, Dan Van Der Ster wrote: Hi all, There has been some confusion the past couple days at the CHEP conference during conversations about Ceph and protection from bit flips or other subtle data corruption. Can someone please summarise the current state of data integrity protection in Ceph, assuming we have an XFS backend filesystem? ie. don't rely on the protection offered by btrfs. I saw in the docs that wire messages and journal writes are CRC'd, but nothing explicit about the objects themselves. - Everything that passes over the wire is checksummed (crc32c). This is mainly because the TCP checksum is so weak. - The journal entries have a crc. - During deep scrub, we read the objects and metadata, calculate a crc32c, and compare across replicas. This detects missing objects, bitrot, failing disks, or anything other source of inconistency. - Ceph does not calculate and store a per-object checksum. Doing so is difficult because rados allows arbitrary overwrites of parts of an object. - Ceph *does* have a new opportunistic checksum feature, which is currently only enabled in QA. It calculates and stores checksums on whatever block size you configure (e.g., 64k) if/when we write/overwrite a complete block, and will verify any complete block read against the stored crc, if one happens to be available. This can help catch some but not all sources of corruption. We also have some specific questions: 1. Is an object checksum stored on the OSD somewhere? Is this in user.ceph._, because it wasn't obvious when looking at the code? No (except for the new/experimental opportunistic crc I mention above). 2. When is the checksum verified. Surely it is checked during the deep scrub, but what about during an object read? For non-btrfs, no crc to verify. For btrfs, the fs has its own crc and verifies it. 2b. Can a user read corrupted data if the master replica has a bit flip but this hasn't yet been found by a deep scrub? Yes. 3. During deep scrub of an object with 2 replicas, suppose the checksum is different for the two objects -- which object wins? (I.e. if you store the checksum locally, this is trivial since the consistency of objects can be evaluated locally. Without the local checksum, you can have conflicts.) In this case we normally choose the primary. The repair has to be explicitly triggered by the admin, however, and there are some options to control that choice. 4. If the checksum is already stored per object in the OSD, is this retrievable by librados? We have some applications which also need to know the checksum of the data and this would be handy if it was already calculated by Ceph. It would! It may be that the way to get there is to build and API
Re: [ceph-users] bit correctness and checksumming
On Wed, 16 Oct 2013, ja...@peacon.co.uk wrote: Does Ceph log anywhere corrected(/caught) silent corruption - would be interesting to know how much a problem this is, in a large scale deployment. Something to gather in the league table mentioned at the London Ceph day? It is logged, and causes the 'ceph health' check to complain. There are not currently any historical counts on how many inconsistencies have been found and subsequently repaired, though; this would be interested to collect and report! Just thinking out-loud (please shout me down...) - if the FS itself performs it's own ECC, ATA streaming command set might be of use to avoid performance degradation due to drive level recovery at all. Maybe, I'm not familiar... sage On 2013-10-16 17:12, Sage Weil wrote: On Wed, 16 Oct 2013, Dan Van Der Ster wrote: Hi all, There has been some confusion the past couple days at the CHEP conference during conversations about Ceph and protection from bit flips or other subtle data corruption. Can someone please summarise the current state of data integrity protection in Ceph, assuming we have an XFS backend filesystem? ie. don't rely on the protection offered by btrfs. I saw in the docs that wire messages and journal writes are CRC'd, but nothing explicit about the objects themselves. - Everything that passes over the wire is checksummed (crc32c). This is mainly because the TCP checksum is so weak. - The journal entries have a crc. - During deep scrub, we read the objects and metadata, calculate a crc32c, and compare across replicas. This detects missing objects, bitrot, failing disks, or anything other source of inconistency. - Ceph does not calculate and store a per-object checksum. Doing so is difficult because rados allows arbitrary overwrites of parts of an object. - Ceph *does* have a new opportunistic checksum feature, which is currently only enabled in QA. It calculates and stores checksums on whatever block size you configure (e.g., 64k) if/when we write/overwrite a complete block, and will verify any complete block read against the stored crc, if one happens to be available. This can help catch some but not all sources of corruption. We also have some specific questions: 1. Is an object checksum stored on the OSD somewhere? Is this in user.ceph._, because it wasn't obvious when looking at the code? No (except for the new/experimental opportunistic crc I mention above). 2. When is the checksum verified. Surely it is checked during the deep scrub, but what about during an object read? For non-btrfs, no crc to verify. For btrfs, the fs has its own crc and verifies it. 2b. Can a user read corrupted data if the master replica has a bit flip but this hasn't yet been found by a deep scrub? Yes. 3. During deep scrub of an object with 2 replicas, suppose the checksum is different for the two objects -- which object wins? (I.e. if you store the checksum locally, this is trivial since the consistency of objects can be evaluated locally. Without the local checksum, you can have conflicts.) In this case we normally choose the primary. The repair has to be explicitly triggered by the admin, however, and there are some options to control that choice. 4. If the checksum is already stored per object in the OSD, is this retrievable by librados? We have some applications which also need to know the checksum of the data and this would be handy if it was already calculated by Ceph. It would! It may be that the way to get there is to build and API to expose the opportunistic checksums, and/or to extend that feature to maintain full checksums (by re-reading partially overwritten blocks on write). (Note, however, that even this wouldn't cover xattrs and omap content; really this is something that should be handled by the backend storage/file system.) sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bit correctness and checksumming
On Wed, Oct 16, 2013 at 6:12 PM, Sage Weil s...@inktank.com wrote: 3. During deep scrub of an object with 2 replicas, suppose the checksum is different for the two objects -- which object wins? (I.e. if you store the checksum locally, this is trivial since the consistency of objects can be evaluated locally. Without the local checksum, you can have conflicts.) In this case we normally choose the primary. The repair has to be explicitly triggered by the admin, however, and there are some options to control that choice. Which options would those be? I only know about ceph pg repair pg.id BTW, I read in a previous mail that... Repair does the equivalent of a deep-scrub to find problems. This mostly is reading object data/omap/xattr to create checksums and compares them across all copies. When a discrepancy is identified an arbitrary copy which did not have I/O errors is selected and used to re-write the other replicas. This seems like a right thing to do when inconsistencies are the result of i/o errors. But when caused by random bit flips, this sounds like an effective way to propagate corrupted data while making ceph health = HEALTH_OK. Is that opportunistic checksum feature planned for emporer? Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph cluster access using s3 api with curl
Rookie question: What's the curl command / URL / steps to get an authentication token from the cluster without using the swift debug command first. Using the swift_key values should work but I haven't found the right combination /url. Here's what I've done: 1: Get user info from ceph cluster: # radosgw-admin user info --uid rados 2013-10-16 13:29:42.956578 7f166aeef780 0 WARNING: cannot read region map { user_id: rados, display_name: rados, email: n...@none.com, suspended: 0, max_buckets: 1000, auid: 0, subusers: [], keys: [ { user: rados, access_key: V92UJ5F24DF2CDGQINTK, secret_key: uzWaCMQnZ8uxyR3zte2Dthxbca\/H4qsm3p0QI29f}], swift_keys: [ { user: rados:swift, secret_key: 123}], caps: [], op_mask: read, write, delete, default_placement: , placement_tags: []} 2: Jump thru the (unnecessary) Swift deubg hoop. Debug truncated the http command that holds the key: # swift --verbose --debug -V 1.0 -A http://10.113.193.189/auth -U rados:swift -K 123 list DEBUG:swiftclient:REQ: curl -i http://10.113.193.189/auth -X GET DEBUG:swiftclient:RESP STATUS: 204 DEBUG:swiftclient:REQ: curl -i http://10.113.193.189/swift/v1?format=json -X GET -H X-Auth-Token: AUTH_rgwtk0b007261646f733a73776966740ddca424fed74e69be4860524846912b0f99a7531ecda91ae47684ebd6b69e40f1dc6b45 DEBUG:swiftclient:RESP STATUS: 200 DEBUG:swiftclient:RESP BODY: [] 3: I should be able to pass user and password values from the user info command. I haven't found the correct url or path to use. This command (and variations : auth/v1.0 ...) fails. Is the directory structure / URL to get an authentication token documented somewhere? # curl -i http://10.113.193.189/auth -X GET -H 'X-Storage-User: rados:swift' -H 'X-Storage-Pass: 123' HTTP/1.1 403 Forbidden Date: Wed, 16 Oct 2013 20:33:31 GMT Server: Apache/2.2.22 (Ubuntu) Accept-Ranges: bytes Content-Length: 23 Content-Type: application/json {Code:AccessDenied} Thanks, Tim ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Multiply OSDs per host strategy ?
Hi, I have 2 x 2TB disks, in 3 servers, so total of 6 disks... I have deployed total of 6 OSDs. ie: host1 = osd.0 and osd.1 host2 = osd.2 and osd.3 host4 = osd.4 and osd.5 Now, since I will have total of 3 replica (original + 2 replicas), I want my replica placement to be such, that I don't end up having 2 replicas on 1 host (replica on osd0, osd1 (both on host1) and replica on osd2. I want all 3 replicas spread on different hosts... I know this is to be done via crush maps, but I'm not sure if it would be better to have 2 pools, 1 pool on osd0,2,4 and and another pool on osd1,3,5. If possible, I would want only 1 pool, spread across all 6 OSDs, but with data placement such, that I don't end up having 2 replicas on 1 host...not sure if this is possible at all... Is that possible, or maybe I should go for RAID0 in each server (2 x 2Tb = 4TB for osd0) or maybe JBOD (1 volume, so 1 OSD per host) ? Any suggesting about best practice ? Regards, -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multiply OSDs per host strategy ?
Andrija, You can use a single pool and the proper CRUSH rule step chooseleaf firstn 0 type host to accomplish your goal. http://ceph.com/docs/master/rados/operations/crush-map/ Cheers, Mike Dawson On 10/16/2013 5:16 PM, Andrija Panic wrote: Hi, I have 2 x 2TB disks, in 3 servers, so total of 6 disks... I have deployed total of 6 OSDs. ie: host1 = osd.0 and osd.1 host2 = osd.2 and osd.3 host4 = osd.4 and osd.5 Now, since I will have total of 3 replica (original + 2 replicas), I want my replica placement to be such, that I don't end up having 2 replicas on 1 host (replica on osd0, osd1 (both on host1) and replica on osd2. I want all 3 replicas spread on different hosts... I know this is to be done via crush maps, but I'm not sure if it would be better to have 2 pools, 1 pool on osd0,2,4 and and another pool on osd1,3,5. If possible, I would want only 1 pool, spread across all 6 OSDs, but with data placement such, that I don't end up having 2 replicas on 1 host...not sure if this is possible at all... Is that possible, or maybe I should go for RAID0 in each server (2 x 2Tb = 4TB for osd0) or maybe JBOD (1 volume, so 1 OSD per host) ? Any suggesting about best practice ? Regards, -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is there a way to query RBD usage
On 10/15/2013 08:56 PM, Blair Bethwaite wrote: Date: Wed, 16 Oct 2013 16:06:49 +1300 From: Mark Kirkwood mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz To: Wido den Hollander w...@42on.com mailto:w...@42on.com, ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com Subject: Re: [ceph-users] Is there a way to query RBD usage Message-ID: 525e02c9.9050...@catalyst.net.nz mailto:525e02c9.9050...@catalyst.net.nz Content-Type: text/plain; charset=ISO-8859-1; format=flowed On 16/10/13 15:53, Wido den Hollander wrote: On 10/16/2013 03:15 AM, Blair Bethwaite wrote: I.e., can we see what the actual allocated/touched size of an RBD is in relation to its provisioned size? No, not an easy way. The only way would be to probe which RADOS objects exist, but that's a heavy operation you don't want to do with large images or with a large number of RBD images. So maybe a 'df' arg for rbd would be a nice addition to blueprints? Yes, I think so. It does seem a little conflicting to promote Ceph as doing thin-provisioned volumes, but then not actually be able to interrogate their real usage against the provisioned size. As a cloud admin using Ceph as my block-storage layer I really want to be able to look at several metrics in relation to volumes and tenants: total GB quota, GB provisioned (i.e., total size of volumessnaps), GB allocated When users come crying for more quota I need to whether they're making efficient use of what they've got. This actually leads into more of a conversation around the quota model of dishing out storage. IMHO it would be much more preferable to do things in a more EBS oriented fashion, where we're able to see actual usage in the backend. Especially true with snapshots - users are typically dismayed that their snapshots count towards their quota for the full size of the originally provisioned volume (despite the fact the snapshot could usually be truncated/shrunk by a factor of two or more). You can see the space written in the image and between snapshots (not including fs overhead on the osds) since cuttlefish: http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/3684 It'd be nice to wrap that in a df or similar command though. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] changing from default journals to external journals
I configured my cluster using the default journal location for my osds. Can I migrate the default journals to explicit separate devices without a complete cluster teardown and reinstallation? How? Thanks, Tim ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] changing from default journals to external journals
On Wed, 16 Oct 2013, Snider, Tim wrote: I configured my cluster using the default journal location for my osds. Can I migrate the default journals to explicit separate devices without a complete cluster teardown and reinstallation? How? - stop a ceph-osd daemon, then - ceph-osd --flush-journal -i NNN - set/adjust the journal symlink at /var/lib/ceph/osd/ceph-NNN/journal to point wherever you want - ceph-osd --mkjournal -i NNN - start ceph-osd This won't set up the udev magic on the journal device, but that doesn't really matter if you're not hotplugging devices. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multiply OSDs per host strategy ?
well, nice one :) *step chooseleaf firstn 0 type host* - it is the part of default crush map (3 hosts, 2 OSDs per host) It means: write 3 replicas (in my case) to 3 hosts...and randomly select OSD from each host ? I already read all the docs...and still not sure how to proceed... On 16 October 2013 23:27, Mike Dawson mike.daw...@cloudapt.com wrote: Andrija, You can use a single pool and the proper CRUSH rule step chooseleaf firstn 0 type host to accomplish your goal. http://ceph.com/docs/master/**rados/operations/crush-map/http://ceph.com/docs/master/rados/operations/crush-map/ Cheers, Mike Dawson On 10/16/2013 5:16 PM, Andrija Panic wrote: Hi, I have 2 x 2TB disks, in 3 servers, so total of 6 disks... I have deployed total of 6 OSDs. ie: host1 = osd.0 and osd.1 host2 = osd.2 and osd.3 host4 = osd.4 and osd.5 Now, since I will have total of 3 replica (original + 2 replicas), I want my replica placement to be such, that I don't end up having 2 replicas on 1 host (replica on osd0, osd1 (both on host1) and replica on osd2. I want all 3 replicas spread on different hosts... I know this is to be done via crush maps, but I'm not sure if it would be better to have 2 pools, 1 pool on osd0,2,4 and and another pool on osd1,3,5. If possible, I would want only 1 pool, spread across all 6 OSDs, but with data placement such, that I don't end up having 2 replicas on 1 host...not sure if this is possible at all... Is that possible, or maybe I should go for RAID0 in each server (2 x 2Tb = 4TB for osd0) or maybe JBOD (1 volume, so 1 OSD per host) ? Any suggesting about best practice ? Regards, -- Andrija Panić __**_ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] kvm live migrate wil ceph
Hello Michael, Thanks for the reply. It seems like ceph isn't actually mounting the rbd to the vm host which is where I think I was getting hung up (I had previously been attempting to mount rbds directly to multiple hosts and as you can imagine having issues). Could you possible expound on why using a clustered filesystem approach is wrong (or conversely why using RBD's is the correct approach)? As for format2 rbd images, it looks like they provide exactly the Copy-On-Write functionality that I am looking for. Any caveats or things I should look out for when going from format 1 to format 2 images? (I think I read something about not being able to use both at the same time...) Thanks Again, Jon A On Mon, Oct 14, 2013 at 4:42 PM, Michael Lowe j.michael.l...@gmail.comwrote: I live migrate all the time using the rbd driver in qemu, no problems. Qemu will issue a flush as part of the migration so everything is consistent. It's the right way to use ceph to back vm's. I would strongly recommend against a network file system approach. You may want to look into format 2 rbd images, the cloning and writable snapshots may be what you are looking for. Sent from my iPad On Oct 14, 2013, at 5:37 AM, Jon three1...@gmail.com wrote: Hello, I would like to live migrate a VM between two hypervisors. Is it possible to do this with a rbd disk or should the vm disks be created as qcow images on a CephFS/NFS share (is it possible to do clvm over rbds? OR GlusterFS over rbds?)and point kvm at the network directory. As I understand it, rbds aren't cluster aware so you can't mount an rbd on multiple hosts at once, but maybe libvirt has a way to handle the transfer...? I like the idea of master or golden images where guests write any changes to a new image, I don't think rbds are able to handle copy-on-write in the same way kvm does so maybe a clustered filesystem approach is the ideal way to go. Thanks for your input. I think I'm just missing some piece. .. I just don't grok... Bestv Regards, Jon A ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CloudStack + KVM(Ubuntu 12.04, Libvirt 1.0.2) + Ceph [Seeking Help]
Hi, I have gotten so close to have Ceph work in my cloud but I have reached a roadblock. Any help would be greatly appreciated. I receive the following error when trying to get KVM to run a VM with an RBD volume: Libvirtd.log: 2013-10-16 22 :05:15.516+: 9814: error : qemuProcessReadLogOutput:1477 : internal error Process exited while reading console log output: char device redirected to /dev/pts/3 kvm: -drive file=rbd:libvirt-pool/new-libvirt-image:id=libvirt:key=+F5ScBQlLhAAYCH8qhGEh/gjKW+NpziAlA==:auth_supported=cephx\;none:mon_host= 10.0.1.83\:6789,if=none,id=drive-ide0-0-1: error connecting kvm: -drive file=rbd:libvirt-pool/new-libvirt-image:id=libvirt:key=+F5ScBQlLhAAYCH8qhGEh/gjKW+NpziAlA==:auth_supported=cephx\;none:mon_host= 10.0.1.83\:6789,if=none,id=drive-ide0-0-1: could not open disk image rbd:libvirt-pool/new-libvirt-image:id=libvirt:key=+F5ScBQlLhAAYCH8qhGEh /gjKW+NpziAlA==:auth_supported=cephx\;none:mon_host=10.0.1.83\:6789: Invalid argument Ceph Pool showing test volume exists: root@ubuntu-test-KVM-RBD:/opt# rbd -p libvirt-pool ls new-libvirt-image Ceph Auth: client.libvirt key: AQBx+F5ScBQlLhAAYCH8qhGEh/gjKW+NpziAlA== caps: [ mon ] allow r caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=libvirt-pool KVM Drive Support: root@ubuntu-test-KVM-RBD:/opt# kvm --drive format=?ibvirt-image:id=libvirt:key=+F5Sc Supported formats: vvfat vpc vmdk vdi sheepdog rbd raw host_cdrom host_floppy host_device file qed qcow2 qcow parallels nbd dmg tftp ftps ft p https http cow cloop bochs blkverify blkdebug Thank you if anyone can help Kelcey Damage | Infrastructure Systems Architect Strategy | Automation | Cloud Computing | Technology Development Backbone Technology, Inc 604-331-1152 ext. 114 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Very unbalanced osd data placement with differing sized devices
I stumbled across this today: 4 osds on 4 hosts (names ceph1 - ceph4). They are KVM guests (this is a play setup). - ceph1 and ceph2 each have a 5G volume for osd data (+ 2G vol for journal) - ceph3 and ceph4 each have a 10G volume for osd data (+ 2G vol for journal) I do a standard installation via ceph-deploy (1.2.7) of ceph (0.67.4) on each one [1]. The topology looks like: $ ceph osd tree # idweighttype nameup/downreweight -10.01999root default -20host ceph1 00osd.0up1 -30host ceph2 10osd.1up1 -40.009995host ceph3 20.009995osd.2up1 -50.009995host ceph4 30.009995osd.3up1 So osd.0 and osd.1 (on ceph1,2) have weight 0, and osd2 and osd.3 (on ceph3,4) have weight 0.009995 this suggests that data will flee osd.0,1 and live only on osd.3.4. Sure enough putting in a few objects via radus put results in: ceph1 $ df -m Filesystem 1M-blocks Used Available Use% Mounted on /dev/vda1 5038 2508 2275 53% / udev 994 1 994 1% /dev tmpfs401 1 401 1% /run none 5 0 5 0% /run/lock none1002 0 1002 0% /run/shm /dev/vdb1 510940 5070 1% /var/lib/ceph/osd/ceph-0 (similarly for ceph2), whereas: ceph3 $df -m Filesystem 1M-blocks Used Available Use% Mounted on /dev/vda1 5038 2405 2377 51% / udev 994 1 994 1% /dev tmpfs401 1 401 1% /run none 5 0 5 0% /run/lock none1002 0 1002 0% /run/shm /dev/vdb1 10229 1315 8915 13% /var/lib/ceph/osd/ceph-2 (similarly for ceph4). Obviously I can fix this via the reweighting the first two osds to something like 0.005, but I'm wondering if there is something I've missed - clearly some kind of auto weighting is has been performed on the basis of the size difference in the data volumes, but looks to be skewing data far too much to the bigger ones. Is there perhaps a bug in the smarts for this? Or is it just because I'm using small volumes (5G = 0 weight)? Cheers Mark [1] i.e: $ ceph-deploy new ceph1 $ ceph-deploy mon create ceph1 $ ceph-deploy gatherkeys ceph1 $ ceph-deploy osd create ceph1:/dev/vdb:/dev/vdc ... $ ceph-deploy osd create ceph4:/dev/vdb:/dev/vdc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Very unbalanced osd data placement with differing sized devices
I may be wrong, but I always thought that a weight of 0 means don't put anything there. All weights 0 will be looked at proportionally. See http://ceph.com/docs/master/rados/operations/crush-map/ which recommends higher weights anyway: Weighting Bucket Items Ceph expresses bucket weights as double integers, which allows for fine weighting. A weight is the relative difference between device capacities. We recommend using 1.00 as the relative weight for a 1TB storage device. In such a scenario, a weight of 0.5 would represent approximately 500GB, and a weight of 3.00 would represent approximately 3TB. Higher level buckets have a weight that is the sum total of the leaf items aggregated by the bucket. A bucket item weight is one dimensional, but you may also calculate your item weights to reflect the performance of the storage drive. For example, if you have many 1TB drives where some have relatively low data transfer rate and the others have a relatively high data transfer rate, you may weight them differently, even though they have the same capacity (e.g., a weight of 0.80 for the first set of drives with lower total throughput, and 1.20 for the second set of drives with higher total throughput). David Zafman Senior Developer http://www.inktank.com On Oct 16, 2013, at 8:15 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: I stumbled across this today: 4 osds on 4 hosts (names ceph1 - ceph4). They are KVM guests (this is a play setup). - ceph1 and ceph2 each have a 5G volume for osd data (+ 2G vol for journal) - ceph3 and ceph4 each have a 10G volume for osd data (+ 2G vol for journal) I do a standard installation via ceph-deploy (1.2.7) of ceph (0.67.4) on each one [1]. The topology looks like: $ ceph osd tree # idweighttype nameup/downreweight -10.01999root default -20host ceph1 00osd.0up1 -30host ceph2 10osd.1up1 -40.009995host ceph3 20.009995osd.2up1 -50.009995host ceph4 30.009995osd.3up1 So osd.0 and osd.1 (on ceph1,2) have weight 0, and osd2 and osd.3 (on ceph3,4) have weight 0.009995 this suggests that data will flee osd.0,1 and live only on osd.3.4. Sure enough putting in a few objects via radus put results in: ceph1 $ df -m Filesystem 1M-blocks Used Available Use% Mounted on /dev/vda1 5038 2508 2275 53% / udev 994 1 994 1% /dev tmpfs401 1 401 1% /run none 5 0 5 0% /run/lock none1002 0 1002 0% /run/shm /dev/vdb1 510940 5070 1% /var/lib/ceph/osd/ceph-0 (similarly for ceph2), whereas: ceph3 $df -m Filesystem 1M-blocks Used Available Use% Mounted on /dev/vda1 5038 2405 2377 51% / udev 994 1 994 1% /dev tmpfs401 1 401 1% /run none 5 0 5 0% /run/lock none1002 0 1002 0% /run/shm /dev/vdb1 10229 1315 8915 13% /var/lib/ceph/osd/ceph-2 (similarly for ceph4). Obviously I can fix this via the reweighting the first two osds to something like 0.005, but I'm wondering if there is something I've missed - clearly some kind of auto weighting is has been performed on the basis of the size difference in the data volumes, but looks to be skewing data far too much to the bigger ones. Is there perhaps a bug in the smarts for this? Or is it just because I'm using small volumes (5G = 0 weight)? Cheers Mark [1] i.e: $ ceph-deploy new ceph1 $ ceph-deploy mon create ceph1 $ ceph-deploy gatherkeys ceph1 $ ceph-deploy osd create ceph1:/dev/vdb:/dev/vdc ... $ ceph-deploy osd create ceph4:/dev/vdb:/dev/vdc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Very unbalanced osd data placement with differing sized devices
On Thu, 17 Oct 2013, Mark Kirkwood wrote: I stumbled across this today: 4 osds on 4 hosts (names ceph1 - ceph4). They are KVM guests (this is a play setup). - ceph1 and ceph2 each have a 5G volume for osd data (+ 2G vol for journal) - ceph3 and ceph4 each have a 10G volume for osd data (+ 2G vol for journal) I do a standard installation via ceph-deploy (1.2.7) of ceph (0.67.4) on each one [1]. The topology looks like: $ ceph osd tree # idweighttype nameup/downreweight -10.01999root default -20host ceph1 00osd.0up1 -30host ceph2 10osd.1up1 -40.009995host ceph3 20.009995osd.2up1 -50.009995host ceph4 30.009995osd.3up1 So osd.0 and osd.1 (on ceph1,2) have weight 0, and osd2 and osd.3 (on ceph3,4) have weight 0.009995 this suggests that data will flee osd.0,1 and live only on osd.3.4. Sure enough putting in a few objects via radus put results in: ceph1 $ df -m Filesystem 1M-blocks Used Available Use% Mounted on /dev/vda1 5038 2508 2275 53% / udev 994 1 994 1% /dev tmpfs401 1 401 1% /run none 5 0 5 0% /run/lock none1002 0 1002 0% /run/shm /dev/vdb1 510940 5070 1% /var/lib/ceph/osd/ceph-0 (similarly for ceph2), whereas: ceph3 $df -m Filesystem 1M-blocks Used Available Use% Mounted on /dev/vda1 5038 2405 2377 51% / udev 994 1 994 1% /dev tmpfs401 1 401 1% /run none 5 0 5 0% /run/lock none1002 0 1002 0% /run/shm /dev/vdb1 10229 1315 8915 13% /var/lib/ceph/osd/ceph-2 (similarly for ceph4). Obviously I can fix this via the reweighting the first two osds to something like 0.005, but I'm wondering if there is something I've missed - clearly some kind of auto weighting is has been performed on the basis of the size difference in the data volumes, but looks to be skewing data far too much to the bigger ones. Is there perhaps a bug in the smarts for this? Or is it just because I'm using small volumes (5G = 0 weight)? Yeah, I think this is just rounding error. By default a weight of 1.0 == 1 TB, so these are just very small numbers. Internally, we're storing as a fixed-point 32-bit value where 1.0 == 0x1, and 5MB is just too small for those units. You can disable this autoweighting with osd crush update on start = false in ceph.conf. sage Cheers Mark [1] i.e: $ ceph-deploy new ceph1 $ ceph-deploy mon create ceph1 $ ceph-deploy gatherkeys ceph1 $ ceph-deploy osd create ceph1:/dev/vdb:/dev/vdc ... $ ceph-deploy osd create ceph4:/dev/vdb:/dev/vdc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fedora package dependencies
Performing yum updates on Fedora 19 now break qemu. There is a different set of package names and contents between the default fedora ceph packages and the ceph.com packages. The is no ceph-libs package in the ceph.com repository and qemu now enforces the dependency on ceph-libs. Yum update now produces this error: Error: Package: 2:qemu-common-1.4.2-12.fc19.x86_64 (updates) Requires: ceph-libs = 0.61 Available: ceph-libs-0.56.4-1.fc19.i686 (fedora) ceph-libs = 0.56.4-1.fc19 Available: ceph-libs-0.67.3-2.fc19.i686 (updates) ceph-libs = 0.67.3-2.fc19 The ceph-libs dependency enforcement is new as of this qemu update. Should not ceph.com Fedora packages mirror the default fedora packages on name and contents? Regards Darryl The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee, and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments. Unless expressly attributed, the views expressed in this email do not necessarily represent the views of the company. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com