[ceph-users] Issue of S3 API: x-amz-acl: public-read-write and authenticated-read

2013-10-16 Thread david zhang
Hi ceph-users,

I am trying S3-compatible API of ceph, but meet following issues:

1. x-amz-acl: public-read-write

I upload an object with public-read-write acl. Then I can get this object
directly without access key.

curl -v -s http://radosgw_server/mybucket0/20131015_1
...
 HTTP/1.1 200
...

But I can't write or delete this object without access key.
curl -v -s http://ceph7.dev.mobstor.corp.bf1.yahoo.com/mybucket0/20131015_1-XPUT
-d 1234 or
curl -v -s 
http://ceph7.dev.mobstor.corp.bf1.yahoo.com/mybucket0/20131015_1-XDELETE
...
 HTTP/1.1 403
...
?xml version=1.0
encoding=UTF-8?ErrorCodeAccessDenied/Code/Error


2. x-amz-acl: authenticated-read

I have created two radowgw users. I upload an object
with authenticated-read acl using access key of one radowgw user. Then I
can get this object using this user's access key, but I can't get this
object using the other user's access key.

I am not sure if I use this authenticated-read acl correctly, please
correct me if I am wrong.

Thanks.

-- 
Regards,
Zhi
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how to make ceph with hadoop

2013-10-16 Thread
|
 hi all!
  my ceph is 0.62, and I want to build it wit hadoop.
  ./configure -with-hadoop
   
but it return jni.h not found.
I found the jni.h in /usr/java/jdk/include/jni.h
 
   how can I fix this Problem!

   thinks
pengft










|
|
|   |   |
|___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph, Keystone and S3

2013-10-16 Thread Matt Thompson
Hi All,

Does anyone know if it'll be possible to use the radosgw admin API when
using keystone users?  I suspect not due to the user requiring specific
caps, however it'd be great if someone can validate (I'm still running
v0.67.4 so can't play with this much).

Thanks!

-Matt


On Tue, Oct 15, 2013 at 6:34 PM, Carlos Gimeno Yañez cgim...@bifi.eswrote:

 Thank you very much Yehuda, that was the missing piece of my puzzle!

 I think that this should be added to the official documentation.

 Regards


 2013/10/15 Yehuda Sadeh yeh...@inktank.com

 On Tue, Oct 15, 2013 at 7:17 AM, Carlos Gimeno Yañez cgim...@bifi.es
 wrote:
  Hi
 
  I've deployed Ceph using Ceph-deploy and following the official
  documentation. I've created a user to use with Swift and everything is
  working fine, my users can create buckets and upload files if they use
  Horizon Dashboard or Swift CLI.
 
  However, everything changes if they try to do it with S3 API. When they
  download their credentials from Horizon dashboard to get their keys,
 they
  can't connect to ceph using S3 API. They only get a 403 Access Denied
  error message. I'm using Ceph 0.70 so, if i'm not wrong, ceph should be
 able
  to validate S3 tokens against keystone since 0.69 version.
 
  Here is my ceph.conf:
 
  [client.radosgw.gateway]
  host = server2
  keyring = /etc/ceph/keyring.radosgw.gateway
  rgw socket path = /var/run/ceph/radosgw.sock
  log file = /var/log/ceph/radosgw.log
  rgw keystone url = server4:35357
  rgw keystone admin token = admintoken
  rgw keystone accepted roles = admin _member_ Member
  rgw print continue = false
  rgw keystone token cache size = 500
  rgw keystone revocation interval = 500
  nss db path = /var/ceph/nss
 
  #Add DNS hostname to enable S3 subdomain calls
  rgw dns name = server2
 
 
  And this is the error message (with s3-curl):
 
 
  GET / HTTP/1.1
  User-Agent: curl/7.29.0
  Host: host_ip
  Accept: */*
  Date: Tue, 15 Oct 2013 14:07:24 +
  Authorization: AWS
  3a1ecdea87d6493a9922c13a06d392cf:SNu/sjTuDtvunOQKJaU8Besm1RQ=
 
   HTTP/1.1 403 Forbidden
   Date: Tue, 15 Oct 2013 14:07:24 GMT
   Server: Apache/2.2.22 (Ubuntu)
   Accept-Ranges: bytes
   Content-Length: 78
   Content-Type: application/xml
  
  { [data not shown]
  ?xml version=1.0 encoding=UTF-8?
  Error
  CodeAccessDenied/Code
  /Error
 
  Regards


 Try adding:

 rgw s3 auth use keystone = true

 to your ceph.conf


 Yehuda



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw public access problem

2013-10-16 Thread Fabio - NS3 srl

Hello,
when i set a read permission for all users to the bucket i read only the
content of the bucket but i received access denied for all directory 
and sub-directory inside this bucket.


Where i wrong???

Many thanks
Fabio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] bit correctness and checksumming

2013-10-16 Thread Dan Van Der Ster
Hi all,
There has been some confusion the past couple days at the CHEP conference 
during conversations about Ceph and protection from bit flips or other subtle 
data corruption. Can someone please summarise the current state of data 
integrity protection in Ceph, assuming we have an XFS backend filesystem? ie. 
don't rely on the protection offered by btrfs. I saw in the docs that wire 
messages and journal writes are CRC'd, but nothing explicit about the objects 
themselves.

We also have some specific questions:

1. Is an object checksum stored on the OSD somewhere? Is this in user.ceph._, 
because it wasn't obvious when looking at the code…
2. When is the checksum verified. Surely it is checked during the deep scrub, 
but what about during an object read?
2b. Can a user read corrupted data if the master replica has a bit flip but 
this hasn't yet been found by a deep scrub?
3. During deep scrub of an object with 2 replicas, suppose the checksum is 
different for the two objects -- which object wins? (I.e. if you store the 
checksum locally, this is trivial since the consistency of objects can be 
evaluated locally. Without the local checksum, you can have conflicts.)
4. If the checksum is already stored per object in the OSD, is this retrievable 
by librados? We have some applications which also need to know the checksum of 
the data and this would be handy if it was already calculated by Ceph.

Thanks in advance!

Dan van der Ster
CERN IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to make ceph with hadoop

2013-10-16 Thread Noah Watkins
The --with-hadoop option has been removed. The Ceph Hadoop bindings are now
located in git://github.com/ceph/hadoop-common cepfs/branch-1.0, and the
required CephFS Java bindings can be built from the Ceph Git repository
using the --enable-cephfs-java configure option.


On Wed, Oct 16, 2013 at 12:26 AM, 鹏 wkp4...@126.com wrote:

 **
  hi all!
   my ceph is 0.62, and I want to build it wit hadoop.
   ./configure -with-hadoop

 but it return jni.h not found.
 I found the jni.h in /usr/java/jdk/include/jni.h

how can I fix this Problem!

thinks
 pengft









 **




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw public access problem

2013-10-16 Thread Derek Yarnell
On 10/16/13 5:15 AM, Fabio - NS3 srl wrote:
 Hello,
 when i set a read permission for all users to the bucket i read only the
 content of the bucket but i received access denied for all directory
 and sub-directory inside this bucket.
 
 Where i wrong???

Hi Fabio,

This is the default S3 behavior.  The default canned ACL will be the
user who writes the key and FULL_CONTROL.  You will have to iterate the
keys and grant a specific read ACL.  You can also on upload of the keys
specify the ACL.

Also we have a patch pending[1] that provides some relief for this use
case where we would allow the bucket ACLs to be evaluated and be
authoritative before the key ACLs.  It needs to get cleaned up a bit but
I think it would very much be useful in your case.  We are about to go
into production running this on two different Ceph Object Stores.

[1] - https://github.com/ceph/ceph/pull/672

Thanks,
derek

-- 
---
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Missing Dependency for ceph-deploy 1.2.7

2013-10-16 Thread Alfredo Deza
On Tue, Oct 15, 2013 at 9:54 PM, Luke Jing Yuan jyl...@mimos.my wrote:
 Hi,

 I am trying to install/upgrade to 1.2.7 but Ubuntu (Precise) is complaining 
 about unmet dependency which seemed to be python-pushy 0.5.3 which seemed to 
 be missing. Am I correct to assume so?

That is odd, we still have pushy packages available for the version
that you are having issues with, see:
http://ceph.com/debian-dumpling/pool/main/p/python-pushy/

It might be that you need to update your repos?

 Regards,
 Luke

 --
 -
 -
 DISCLAIMER:

 This e-mail (including any attachments) is for the addressee(s)
 only and may contain confidential information. If you are not the
 intended recipient, please note that any dealing, review,
 distribution, printing, copying or use of this e-mail is strictly
 prohibited. If you have received this email in error, please notify
 the sender  immediately and delete the original message.
 MIMOS Berhad is a research and development institution under
 the purview of the Malaysian Ministry of Science, Technology and
 Innovation. Opinions, conclusions and other information in this e-
 mail that do not relate to the official business of MIMOS Berhad
 and/or its subsidiaries shall be understood as neither given nor
 endorsed by MIMOS Berhad and/or its subsidiaries and neither
 MIMOS Berhad nor its subsidiaries accepts responsibility for the
 same. All liability arising from or in connection with computer
 viruses and/or corrupted e-mails is excluded to the fullest extent
 permitted by law.


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] poor read performance on rbd+LVM, LVM overload

2013-10-16 Thread Ugis
Hello cephLVM communities!

I noticed very slow reads from xfs mount that is on ceph
client(rbd+gpt partition+LVM PV + xfs on LE)
To find a cause I created another rbd in the same pool, formatted it
straight away with xfs, mounted.

Write performance for both xfs mounts is similar ~12MB/s

reads with dd if=/mnt/somefile bs=1M | pv | dd of=/dev/null as follows:
with LVM ~4MB/s
pure xfs ~30MB/s

Watched performance while doing reads with atop. In LVM case atop
shows LVM overloaded:
LVM | s-LV_backups  | busy 95% |  read   21515 | write  0  |
KiB/r  4 |   | KiB/w  0 |  MBr/s   4.20 | MBw/s
0.00  | avq 1.00 |  avio 0.85 ms |

client kernel 3.10.10
ceph version 0.67.4

My considerations:
I have expanded rbd under LVM couple of times(accordingly expanding
gpt partition, PV,VG,LV, xfs afterwards), but that should have no
impact on performance(tested clean rbd+LVM, same read performance as
for expanded one).

As with device-mapper, after LVM is initialized it is just a small
table with LE-PE  mapping that should reside in close CPU cache.
I am guessing this could be related to old CPU used, probably caching
near CPU does not work well(I tested also local HDDs with/without LVM
and got read speed ~13MB/s vs 46MB/s with atop showing same overload
in  LVM case).

What could make so great difference when LVM is used and what/how to
tune? As write performance does not differ, DM extent lookup should
not be lagging, where is the trick?

CPU used:
# cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 4
model name  : Intel(R) Xeon(TM) CPU 3.20GHz
stepping: 10
microcode   : 0x2
cpu MHz : 3200.077
cache size  : 2048 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 1
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr ss
   e sse2 ss ht tm
pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl
cid cx16 xtpr lahf_lm
bogomips: 6400.15
clflush size: 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

Br,
Ugis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy zap disk failure

2013-10-16 Thread Alfredo Deza
On Tue, Oct 15, 2013 at 9:19 PM, Guang yguan...@yahoo.com wrote:
 -bash-4.1$ which sgdisk
 /usr/sbin/sgdisk

 Which path does ceph-deploy use?

That is unexpected... these are the paths that ceph-deploy uses:

'/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin'

So `/usr/sbin/` is there. I believe  this is a case where $PATH gets
altered because of sudo (resetting the env variable).

This should be fixed in the next release. In the meantime, you could
set the $PATH for non-interactive sessions (which is what ceph-deploy
does)
for all users. I *think* that would be in `/etc/profile`



 Thanks,
 Guang

 On Oct 15, 2013, at 11:15 PM, Alfredo Deza wrote:

 On Tue, Oct 15, 2013 at 10:52 AM, Guang yguan...@yahoo.com wrote:
 Hi ceph-users,
 I am trying with the new ceph-deploy utility on RHEL6.4 and I came across a
 new issue:

 -bash-4.1$ ceph-deploy --version
 1.2.7
 -bash-4.1$ ceph-deploy disk zap server:/dev/sdb
 [ceph_deploy.cli][INFO  ] Invoked (1.2.7): /usr/bin/ceph-deploy disk zap
 server:/dev/sdb
 [ceph_deploy.osd][DEBUG ] zapping /dev/sdb on server
 [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] detect platform information from
 remote host
 [ceph_deploy.osd][INFO  ] Distro info: Red Hat Enterprise Linux Server 6.4
 Santiago
 [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] zeroing last few blocks of device
 [osd2.ceph.mobstor.bf1.yahoo.com][INFO  ] Running command: sudo sgdisk
 --zap-all --clear --mbrtogpt -- /dev/sdb
 [osd2.ceph.mobstor.bf1.yahoo.com][ERROR ] sudo: sgdisk: command not found

 While I run disk zap on the host directly, it can work without issues.
 Anyone meet the same issue?

 Can you run `which sgdisk` on that host? I want to make sure this is
 not a $PATH problem.

 ceph-deploy tries to use the proper path remotely but it could be that
 this one is not there.



 Thanks,
 Guang

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] snapshots on CephFS

2013-10-16 Thread Kasper Dieter
Hi Greg,

on http://comments.gmane.org/gmane.comp.file-systems.ceph.user/1705
I found a statement from you regarding snapshots on cephfs:

---snip---
Filesystem snapshots exist and you can experiment with them on CephFS
(there's a hidden .snaps folder; you can create or remove snapshots
by creating directories in that folder; navigate up and down it, etc).
---snip---

Can you please explain in more detail or with example CMDs how to 
create/list/remove snapshots in CephFS ?
I assume they will be created on a directory level ?
How will the CephFS snapshots cohere with the underlaying pools ?
(e.g. using cephfs /mnt/cephfs/dir-1/dir2 set_layout -p 18)


Thanks,
-Dieter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kvm live migrate wil ceph

2013-10-16 Thread Mike Lowe
I wouldn't go so far as to say putting a vm in a file on a networked filesystem 
is wrong.  It is just not the best choice if you have a ceph cluster at hand, 
in my opinion.  Networked filesystems have a bunch of extra stuff to implement 
posix semantics and live in kernel space.  You just need simple block device 
semantics and you don't need to entangle the hypervisor's kernel space.  What 
it boils down to is the engineering first principle of selecting the least 
complicated solution that satisfies the requirements of the problem. You don't 
get anything when you trade the simplicity of rbd for the complexity of a 
networked filesystem.

For format 2 I think the only caveat is that it requires newer clients and the 
kernel client takes some time to catch up to the user space clients.  You may 
not be able to mount filesystems on rbd devices with the kernel client 
depending on kernel version, this may or may not be important to you.  You can 
always use a vm to mount a filesystem on a rbd device as a work around.  

On Oct 16, 2013, at 9:11 AM, Jon three1...@gmail.com wrote:

 Hello Michael,
 
 Thanks for the reply.  It seems like ceph isn't actually mounting the rbd 
 to the vm host which is where I think I was getting hung up (I had previously 
 been attempting to mount rbds directly to multiple hosts and as you can 
 imagine having issues).
 
 Could you possible expound on why using a clustered filesystem approach is 
 wrong (or conversely why using RBD's is the correct approach)?
 
 As for format2 rbd images, it looks like they provide exactly the 
 Copy-On-Write functionality that I am looking for.  Any caveats or things I 
 should look out for when going from format 1 to format 2 images? (I think I 
 read something about not being able to use both at the same time...)
 
 Thanks Again,
 Jon A
 
 
 On Mon, Oct 14, 2013 at 4:42 PM, Michael Lowe j.michael.l...@gmail.com 
 wrote:
 I live migrate all the time using the rbd driver in qemu, no problems.  Qemu 
 will issue a flush as part of the migration so everything is consistent.  
 It's the right way to use ceph to back vm's. I would strongly recommend 
 against a network file system approach.  You may want to look into format 2 
 rbd images, the cloning and writable snapshots may be what you are looking 
 for.
 
 Sent from my iPad
 
 On Oct 14, 2013, at 5:37 AM, Jon three1...@gmail.com wrote:
 
 Hello,
 
 I would like to live migrate a VM between two hypervisors.  Is it possible 
 to do this with a rbd disk or should the vm disks be created as qcow images 
 on a CephFS/NFS share (is it possible to do clvm over rbds? OR GlusterFS 
 over rbds?)and point kvm at the network directory.  As I understand it, rbds 
 aren't cluster aware so you can't mount an rbd on multiple hosts at once, 
 but maybe libvirt has a way to handle the transfer...?  I like the idea of 
 master or golden images where guests write any changes to a new image, I 
 don't think rbds are able to handle copy-on-write in the same way kvm does 
 so maybe a clustered filesystem approach is the ideal way to go.
 
 Thanks for your input. I think I'm just missing some piece. .. I just don't 
 grok...
 
 Bestv Regards,
 Jon A
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw-admin doesn't list user anymore

2013-10-16 Thread Derek Yarnell
On 10/16/13 4:26 AM, Valery Tschopp wrote:
 Hi Derek,
 
 Thanks for your example.
 
 I've added caps='metadata=*', but I still have an error and get:
 
 send: 'GET /admin/metadata/user?format=json HTTP/1.1\r\nHost:
 objects.bcc.switch.ch\r\nAccept-Encoding: identity\r\nDate: Wed, 16 Oct
 2013 08:09:57 GMT\r\nContent-Length: 0\r\nAuthorization: AWS
 VC***o=\r\nUser-Agent: Boto/2.12.0 Python/2.7.5
 Darwin/12.5.0\r\n\r\n'
 reply: 'HTTP/1.1 405 Method Not Allowed\r\n'
 
 
 In which version of radosgw is the /admin/metadata REST endpoint
 available? I currently have 0.67.4.

We are using this on ceph-0.67.4.  Do you have your gateways logging?

-- 
---
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] snapshots on CephFS

2013-10-16 Thread Shain Miley
Dieter,

Creating snapshots using cephfs is quite simple...all you need to do is create 
a directory (mkdir) inside the hidden '.snap' directory.

After that you can list (ls) and remove them (rm -r) just as you would any 
other directory:

smiley@server1:/mnt/cephfs$ cd .snap

smiley@server1:/mnt/cephfs/.snap$ ls
snap1  snapshot-10-13-2013

smiley@theneykov:/mnt/cephfs/.snap$ mkdir right_now


smiley@theneykov:/mnt/1/.snap$ ls -l
total 0
drwxrwxrwx 1 root root 0 Oct 13 14:38 snap1
drwxrwxrwx 1 root root 0 Oct 16 11:16 right_now
drwxrwxrwx 1 root root 0 Oct 16 11:16 snapshot-10-13-2013


Shain

Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
smi...@npr.org | 202.513.3649


From: ceph-users-boun...@lists.ceph.com [ceph-users-boun...@lists.ceph.com] on 
behalf of Kasper Dieter [dieter.kas...@ts.fujitsu.com]
Sent: Wednesday, October 16, 2013 11:01 AM
To: Gregory Farnum
Cc: ceph-users@lists.ceph.com
Subject: [ceph-users] snapshots on CephFS

Hi Greg,

on http://comments.gmane.org/gmane.comp.file-systems.ceph.user/1705
I found a statement from you regarding snapshots on cephfs:

---snip---
Filesystem snapshots exist and you can experiment with them on CephFS
(there's a hidden .snaps folder; you can create or remove snapshots
by creating directories in that folder; navigate up and down it, etc).
---snip---

Can you please explain in more detail or with example CMDs how to 
create/list/remove snapshots in CephFS ?
I assume they will be created on a directory level ?
How will the CephFS snapshots cohere with the underlaying pools ?
(e.g. using cephfs /mnt/cephfs/dir-1/dir2 set_layout -p 18)


Thanks,
-Dieter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] poor read performance on rbd+LVM, LVM overload

2013-10-16 Thread Sage Weil
Hi,

On Wed, 16 Oct 2013, Ugis wrote:
 Hello cephLVM communities!
 
 I noticed very slow reads from xfs mount that is on ceph
 client(rbd+gpt partition+LVM PV + xfs on LE)
 To find a cause I created another rbd in the same pool, formatted it
 straight away with xfs, mounted.
 
 Write performance for both xfs mounts is similar ~12MB/s
 
 reads with dd if=/mnt/somefile bs=1M | pv | dd of=/dev/null as follows:
 with LVM ~4MB/s
 pure xfs ~30MB/s
 
 Watched performance while doing reads with atop. In LVM case atop
 shows LVM overloaded:
 LVM | s-LV_backups  | busy 95% |  read   21515 | write  0  |
 KiB/r  4 |   | KiB/w  0 |  MBr/s   4.20 | MBw/s
 0.00  | avq 1.00 |  avio 0.85 ms |
 
 client kernel 3.10.10
 ceph version 0.67.4
 
 My considerations:
 I have expanded rbd under LVM couple of times(accordingly expanding
 gpt partition, PV,VG,LV, xfs afterwards), but that should have no
 impact on performance(tested clean rbd+LVM, same read performance as
 for expanded one).
 
 As with device-mapper, after LVM is initialized it is just a small
 table with LE-PE  mapping that should reside in close CPU cache.
 I am guessing this could be related to old CPU used, probably caching
 near CPU does not work well(I tested also local HDDs with/without LVM
 and got read speed ~13MB/s vs 46MB/s with atop showing same overload
 in  LVM case).
 
 What could make so great difference when LVM is used and what/how to
 tune? As write performance does not differ, DM extent lookup should
 not be lagging, where is the trick?

My first guess is that LVM is shifting the content of hte device such that 
it no longer aligns well with the RBD striping (by default, 4MB).  The 
non-aligned reads/writes would need to touch two objects instead of 
one, and dd is generally doing these synchronously (i.e., lots of 
waiting).

I'm not sure what options LVM provides for aligning things to the 
underlying storage...

sage


 
 CPU used:
 # cat /proc/cpuinfo
 processor   : 0
 vendor_id   : GenuineIntel
 cpu family  : 15
 model   : 4
 model name  : Intel(R) Xeon(TM) CPU 3.20GHz
 stepping: 10
 microcode   : 0x2
 cpu MHz : 3200.077
 cache size  : 2048 KB
 physical id : 0
 siblings: 2
 core id : 0
 cpu cores   : 1
 apicid  : 0
 initial apicid  : 0
 fpu : yes
 fpu_exception   : yes
 cpuid level : 5
 wp  : yes
 flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
 mca cmov pat pse36 clflush dts acpi mmx fxsr ss
e sse2 ss ht tm
 pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl
 cid cx16 xtpr lahf_lm
 bogomips: 6400.15
 clflush size: 64
 cache_alignment : 128
 address sizes   : 36 bits physical, 48 bits virtual
 power management:
 
 Br,
 Ugis
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] snapshots on CephFS

2013-10-16 Thread Gregory Farnum
On Wed, Oct 16, 2013 at 8:01 AM, Kasper Dieter
dieter.kas...@ts.fujitsu.com wrote:
 Hi Greg,

 on http://comments.gmane.org/gmane.comp.file-systems.ceph.user/1705
 I found a statement from you regarding snapshots on cephfs:

 ---snip---
 Filesystem snapshots exist and you can experiment with them on CephFS
 (there's a hidden .snaps folder; you can create or remove snapshots
 by creating directories in that folder; navigate up and down it, etc).
 ---snip---

 Can you please explain in more detail or with example CMDs how to 
 create/list/remove snapshots in CephFS ?

As Shain described, you just do mkdir/ls/rmdir in the .snaps folder.

 I assume they will be created on a directory level ?

Snapshots cover the entire subtree starting with the folder you create
them from. If a user puts it in their home directory, there will be a
snapshot of all their document folders, source code folders, etc as
well.

 How will the CephFS snapshots cohere with the underlaying pools ?
 (e.g. using cephfs /mnt/cephfs/dir-1/dir2 set_layout -p 18)

CephFS snapshots store some metadata directly in the directory object
(in the metadata pool), but the file data is stored using RADOS
self-managed snapshots on the regular objects. If you specify that a
file/folder goes in a different pool, the snapshots also live there as
a matter of course.

Separately:
1) you will probably have a better time specifying layouts using the
ceph.layout virtual xattrs if your installation is new enough.
(There's no new functionality there, but it's a lot friendlier and
less fiddly than the cephfs tool is.)
2) Keep in mind that snapshots are noticeably less stable in use than
the regular filesystem features. The ability to create new ones is
turned off by default in the next branch (admins can enable them
with a monitor command).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bit correctness and checksumming

2013-10-16 Thread james


Does Ceph log anywhere corrected(/caught) silent corruption - would be 
interesting to know how much a problem this is, in a large scale 
deployment.  Something to gather in the league table mentioned at the 
London Ceph day?


Just thinking out-loud (please shout me down...) - if the FS itself 
performs it's own ECC, ATA streaming command set might be of use to 
avoid performance degradation due to drive level recovery at all.



On 2013-10-16 17:12, Sage Weil wrote:

On Wed, 16 Oct 2013, Dan Van Der Ster wrote:

Hi all,
There has been some confusion the past couple days at the CHEP
conference during conversations about Ceph and protection from bit 
flips

or other subtle data corruption. Can someone please summarise the
current state of data integrity protection in Ceph, assuming we have 
an

XFS backend filesystem? ie. don't rely on the protection offered by
btrfs. I saw in the docs that wire messages and journal writes are
CRC'd, but nothing explicit about the objects themselves.


- Everything that passes over the wire is checksummed (crc32c).  This 
is

mainly because the TCP checksum is so weak.

- The journal entries have a crc.

- During deep scrub, we read the objects and metadata, calculate a 
crc32c,

and compare across replicas.  This detects missing objects, bitrot,
failing disks, or anything other source of inconistency.

- Ceph does not calculate and store a per-object checksum.  Doing so 
is
difficult because rados allows arbitrary overwrites of parts of an 
object.


- Ceph *does* have a new opportunistic checksum feature, which is
currently only enabled in QA.  It calculates and stores checksums on
whatever block size you configure (e.g., 64k) if/when we 
write/overwrite a
complete block, and will verify any complete block read against the 
stored
crc, if one happens to be available.  This can help catch some but 
not all

sources of corruption.


We also have some specific questions:

1. Is an object checksum stored on the OSD somewhere? Is this in 
user.ceph._, because it wasn't obvious when looking at the code?


No (except for the new/experimental opportunistic crc I mention 
above).


2. When is the checksum verified. Surely it is checked during the 
deep scrub, but what about during an object read?


For non-btrfs, no crc to verify.  For btrfs, the fs has its own crc 
and

verifies it.

2b. Can a user read corrupted data if the master replica has a bit 
flip but this hasn't yet been found by a deep scrub?


Yes.

3. During deep scrub of an object with 2 replicas, suppose the 
checksum is different for the two objects -- which object wins? (I.e. 
if you store the checksum locally, this is trivial since the 
consistency of objects can be evaluated locally. Without the local 
checksum, you can have conflicts.)


In this case we normally choose the primary.  The repair has to be
explicitly triggered by the admin, however, and there are some 
options to

control that choice.

4. If the checksum is already stored per object in the OSD, is this 
retrievable by librados? We have some applications which also need to 
know the checksum of the data and this would be handy if it was 
already calculated by Ceph.


It would!  It may be that the way to get there is to build and API to
expose the opportunistic checksums, and/or to extend that feature to
maintain full checksums (by re-reading partially overwritten blocks 
on
write).  (Note, however, that even this wouldn't cover xattrs and 
omap
content; really this is something that should be handled by the 
backend

storage/file system.)

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bit correctness and checksumming

2013-10-16 Thread Tim Bell

At CERN, we have had cases in the past of silent corruptions. It is good to be 
able to identify the devices causing them and swap them out.

It's an old presentation but the concepts are still relevant today ... 
http://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf

Tim


 -Original Message-
 From: ceph-users-boun...@lists.ceph.com 
 [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ja...@peacon.co.uk
 Sent: 16 October 2013 18:54
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] bit correctness and checksumming
 
 
 Does Ceph log anywhere corrected(/caught) silent corruption - would be 
 interesting to know how much a problem this is, in a large scale
 deployment.  Something to gather in the league table mentioned at the London 
 Ceph day?
 
 Just thinking out-loud (please shout me down...) - if the FS itself performs 
 it's own ECC, ATA streaming command set might be of use to
 avoid performance degradation due to drive level recovery at all.
 
 
 On 2013-10-16 17:12, Sage Weil wrote:
  On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
  Hi all,
  There has been some confusion the past couple days at the CHEP
  conference during conversations about Ceph and protection from bit
  flips or other subtle data corruption. Can someone please summarise
  the current state of data integrity protection in Ceph, assuming we
  have an XFS backend filesystem? ie. don't rely on the protection
  offered by btrfs. I saw in the docs that wire messages and journal
  writes are CRC'd, but nothing explicit about the objects themselves.
 
  - Everything that passes over the wire is checksummed (crc32c).  This
  is mainly because the TCP checksum is so weak.
 
  - The journal entries have a crc.
 
  - During deep scrub, we read the objects and metadata, calculate a
  crc32c, and compare across replicas.  This detects missing objects,
  bitrot, failing disks, or anything other source of inconistency.
 
  - Ceph does not calculate and store a per-object checksum.  Doing so
  is difficult because rados allows arbitrary overwrites of parts of an
  object.
 
  - Ceph *does* have a new opportunistic checksum feature, which is
  currently only enabled in QA.  It calculates and stores checksums on
  whatever block size you configure (e.g., 64k) if/when we
  write/overwrite a complete block, and will verify any complete block
  read against the stored crc, if one happens to be available.  This can
  help catch some but not all sources of corruption.
 
  We also have some specific questions:
 
  1. Is an object checksum stored on the OSD somewhere? Is this in
  user.ceph._, because it wasn't obvious when looking at the code?
 
  No (except for the new/experimental opportunistic crc I mention
  above).
 
  2. When is the checksum verified. Surely it is checked during the
  deep scrub, but what about during an object read?
 
  For non-btrfs, no crc to verify.  For btrfs, the fs has its own crc
  and verifies it.
 
  2b. Can a user read corrupted data if the master replica has a bit
  flip but this hasn't yet been found by a deep scrub?
 
  Yes.
 
  3. During deep scrub of an object with 2 replicas, suppose the
  checksum is different for the two objects -- which object wins? (I.e.
  if you store the checksum locally, this is trivial since the
  consistency of objects can be evaluated locally. Without the local
  checksum, you can have conflicts.)
 
  In this case we normally choose the primary.  The repair has to be
  explicitly triggered by the admin, however, and there are some options
  to control that choice.
 
  4. If the checksum is already stored per object in the OSD, is this
  retrievable by librados? We have some applications which also need to
  know the checksum of the data and this would be handy if it was
  already calculated by Ceph.
 
  It would!  It may be that the way to get there is to build and API to
  expose the opportunistic checksums, and/or to extend that feature to
  maintain full checksums (by re-reading partially overwritten blocks on
  write).  (Note, however, that even this wouldn't cover xattrs and omap
  content; really this is something that should be handled by the
  backend storage/file system.)
 
  sage
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bit correctness and checksumming

2013-10-16 Thread Dan van der Ster
Thank you Sage for the thorough answer.

It just occurred to me to also ask about the gateway. The docs explain that one 
can supply content-md5 during an object PUT (which I assume is verified by the 
RGW), but does a GET respond with the ETag md5? (Sorry, I don't have a gateway 
running at the moment to check for myself, and the answer is relevant to this 
discussion anyway).

Cheers,
Dan

Sage Weil s...@inktank.com wrote:
On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
 Hi all,
 There has been some confusion the past couple days at the CHEP 
 conference during conversations about Ceph and protection from bit
flips 
 or other subtle data corruption. Can someone please summarise the 
 current state of data integrity protection in Ceph, assuming we have
an 
 XFS backend filesystem? ie. don't rely on the protection offered by 
 btrfs. I saw in the docs that wire messages and journal writes are 
 CRC'd, but nothing explicit about the objects themselves.

- Everything that passes over the wire is checksummed (crc32c).  This
is 
mainly because the TCP checksum is so weak.

- The journal entries have a crc.

- During deep scrub, we read the objects and metadata, calculate a
crc32c, 
and compare across replicas.  This detects missing objects, bitrot, 
failing disks, or anything other source of inconistency.

- Ceph does not calculate and store a per-object checksum.  Doing so is

difficult because rados allows arbitrary overwrites of parts of an
object.

- Ceph *does* have a new opportunistic checksum feature, which is 
currently only enabled in QA.  It calculates and stores checksums on 
whatever block size you configure (e.g., 64k) if/when we
write/overwrite a 
complete block, and will verify any complete block read against the
stored 
crc, if one happens to be available.  This can help catch some but not
all 
sources of corruption.

 We also have some specific questions:
 
 1. Is an object checksum stored on the OSD somewhere? Is this in
user.ceph._, because it wasn't obvious when looking at the code?

No (except for the new/experimental opportunistic crc I mention above).

 2. When is the checksum verified. Surely it is checked during the
deep scrub, but what about during an object read?

For non-btrfs, no crc to verify.  For btrfs, the fs has its own crc and

verifies it.

 2b. Can a user read corrupted data if the master replica has a bit
flip but this hasn't yet been found by a deep scrub?

Yes.

 3. During deep scrub of an object with 2 replicas, suppose the
checksum is different for the two objects -- which object wins? (I.e.
if you store the checksum locally, this is trivial since the
consistency of objects can be evaluated locally. Without the local
checksum, you can have conflicts.)

In this case we normally choose the primary.  The repair has to be 
explicitly triggered by the admin, however, and there are some options
to 
control that choice.

 4. If the checksum is already stored per object in the OSD, is this
retrievable by librados? We have some applications which also need to
know the checksum of the data and this would be handy if it was already
calculated by Ceph.

It would!  It may be that the way to get there is to build and API to 
expose the opportunistic checksums, and/or to extend that feature to 
maintain full checksums (by re-reading partially overwritten blocks on 
write).  (Note, however, that even this wouldn't cover xattrs and omap 
content; really this is something that should be handled by the
backend 
storage/file system.)

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bit correctness and checksumming

2013-10-16 Thread Tim Bell

It was long ago and Linux was very different .

With respect to today, we found quite a few cases of bad RAID cards which had 
limited ECC checking on their memory, Stuck bits had serious impacts given our 
data transit volumes :-(

While the root causes we found in the past may be less likely today (as we move 
towards replicas and away from hardware RAID), keeping in place a background 
scrubbing and method to identify components which could be potentially causing 
corruption by external probing and quality checks is very useful.

Tim


 -Original Message-
 From: ceph-users-boun...@lists.ceph.com 
 [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ja...@peacon.co.uk
 Sent: 16 October 2013 20:06
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] bit correctness and checksumming
 
 Very interesting link.  I don't suppose there is any data available 
 separating 4K and 512-byte sectored drives?
 
 
 On 2013-10-16 18:43, Tim Bell wrote:
  At CERN, we have had cases in the past of silent corruptions. It is
  good to be able to identify the devices causing them and swap them
  out.
 
  It's an old presentation but the concepts are still relevant today
  ...
  http://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf
 
  Tim
 
 
  -Original Message-
  From: ceph-users-boun...@lists.ceph.com
  [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
  ja...@peacon.co.uk
  Sent: 16 October 2013 18:54
  To: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] bit correctness and checksumming
 
 
  Does Ceph log anywhere corrected(/caught) silent corruption - would
  be interesting to know how much a problem this is, in a large scale
  deployment.  Something to gather in the league table mentioned at
  the London Ceph day?
 
  Just thinking out-loud (please shout me down...) - if the FS itself
  performs it's own ECC, ATA streaming command set might be of use to
  avoid performance degradation due to drive level recovery at all.
 
 
  On 2013-10-16 17:12, Sage Weil wrote:
   On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
   Hi all,
   There has been some confusion the past couple days at the CHEP
   conference during conversations about Ceph and protection from
  bit
   flips or other subtle data corruption. Can someone please
  summarise
   the current state of data integrity protection in Ceph, assuming
  we
   have an XFS backend filesystem? ie. don't rely on the protection
   offered by btrfs. I saw in the docs that wire messages and
  journal
   writes are CRC'd, but nothing explicit about the objects
  themselves.
  
   - Everything that passes over the wire is checksummed (crc32c).
  This
   is mainly because the TCP checksum is so weak.
  
   - The journal entries have a crc.
  
   - During deep scrub, we read the objects and metadata, calculate a
   crc32c, and compare across replicas.  This detects missing
  objects,
   bitrot, failing disks, or anything other source of inconistency.
  
   - Ceph does not calculate and store a per-object checksum.  Doing
  so
   is difficult because rados allows arbitrary overwrites of parts of
  an
   object.
  
   - Ceph *does* have a new opportunistic checksum feature, which is
   currently only enabled in QA.  It calculates and stores checksums
  on
   whatever block size you configure (e.g., 64k) if/when we
   write/overwrite a complete block, and will verify any complete
  block
   read against the stored crc, if one happens to be available.  This
  can
   help catch some but not all sources of corruption.
  
   We also have some specific questions:
  
   1. Is an object checksum stored on the OSD somewhere? Is this in
   user.ceph._, because it wasn't obvious when looking at the code?
  
   No (except for the new/experimental opportunistic crc I mention
   above).
  
   2. When is the checksum verified. Surely it is checked during the
   deep scrub, but what about during an object read?
  
   For non-btrfs, no crc to verify.  For btrfs, the fs has its own
  crc
   and verifies it.
  
   2b. Can a user read corrupted data if the master replica has a
  bit
   flip but this hasn't yet been found by a deep scrub?
  
   Yes.
  
   3. During deep scrub of an object with 2 replicas, suppose the
   checksum is different for the two objects -- which object wins?
  (I.e.
   if you store the checksum locally, this is trivial since the
   consistency of objects can be evaluated locally. Without the
  local
   checksum, you can have conflicts.)
  
   In this case we normally choose the primary.  The repair has to be
   explicitly triggered by the admin, however, and there are some
  options
   to control that choice.
  
   4. If the checksum is already stored per object in the OSD, is
  this
   retrievable by librados? We have some applications which also
  need to
   know the checksum of the data and this would be handy if it was
   already calculated by Ceph.
  
   It would!  It may be that the way to get there is to build and API
  

Re: [ceph-users] bit correctness and checksumming

2013-10-16 Thread Sage Weil
On Wed, 16 Oct 2013, ja...@peacon.co.uk wrote:
 Does Ceph log anywhere corrected(/caught) silent corruption - would be
 interesting to know how much a problem this is, in a large scale deployment.
 Something to gather in the league table mentioned at the London Ceph day?

It is logged, and causes the 'ceph health' check to complain. There are 
not currently any historical counts on how many inconsistencies have been 
found and subsequently repaired, though; this would be interested to 
collect and report!

 Just thinking out-loud (please shout me down...) - if the FS itself performs
 it's own ECC, ATA streaming command set might be of use to avoid performance
 degradation due to drive level recovery at all.

Maybe, I'm not familiar... 

sage

 
 
 On 2013-10-16 17:12, Sage Weil wrote:
  On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
   Hi all,
   There has been some confusion the past couple days at the CHEP
   conference during conversations about Ceph and protection from bit flips
   or other subtle data corruption. Can someone please summarise the
   current state of data integrity protection in Ceph, assuming we have an
   XFS backend filesystem? ie. don't rely on the protection offered by
   btrfs. I saw in the docs that wire messages and journal writes are
   CRC'd, but nothing explicit about the objects themselves.
  
  - Everything that passes over the wire is checksummed (crc32c).  This is
  mainly because the TCP checksum is so weak.
  
  - The journal entries have a crc.
  
  - During deep scrub, we read the objects and metadata, calculate a crc32c,
  and compare across replicas.  This detects missing objects, bitrot,
  failing disks, or anything other source of inconistency.
  
  - Ceph does not calculate and store a per-object checksum.  Doing so is
  difficult because rados allows arbitrary overwrites of parts of an object.
  
  - Ceph *does* have a new opportunistic checksum feature, which is
  currently only enabled in QA.  It calculates and stores checksums on
  whatever block size you configure (e.g., 64k) if/when we write/overwrite a
  complete block, and will verify any complete block read against the stored
  crc, if one happens to be available.  This can help catch some but not all
  sources of corruption.
  
   We also have some specific questions:
   
   1. Is an object checksum stored on the OSD somewhere? Is this in
   user.ceph._, because it wasn't obvious when looking at the code?
  
  No (except for the new/experimental opportunistic crc I mention above).
  
   2. When is the checksum verified. Surely it is checked during the deep
   scrub, but what about during an object read?
  
  For non-btrfs, no crc to verify.  For btrfs, the fs has its own crc and
  verifies it.
  
   2b. Can a user read corrupted data if the master replica has a bit flip
   but this hasn't yet been found by a deep scrub?
  
  Yes.
  
   3. During deep scrub of an object with 2 replicas, suppose the checksum is
   different for the two objects -- which object wins? (I.e. if you store the
   checksum locally, this is trivial since the consistency of objects can be
   evaluated locally. Without the local checksum, you can have conflicts.)
  
  In this case we normally choose the primary.  The repair has to be
  explicitly triggered by the admin, however, and there are some options to
  control that choice.
  
   4. If the checksum is already stored per object in the OSD, is this
   retrievable by librados? We have some applications which also need to know
   the checksum of the data and this would be handy if it was already
   calculated by Ceph.
  
  It would!  It may be that the way to get there is to build and API to
  expose the opportunistic checksums, and/or to extend that feature to
  maintain full checksums (by re-reading partially overwritten blocks on
  write).  (Note, however, that even this wouldn't cover xattrs and omap
  content; really this is something that should be handled by the backend
  storage/file system.)
  
  sage
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bit correctness and checksumming

2013-10-16 Thread Dan van der Ster
On Wed, Oct 16, 2013 at 6:12 PM, Sage Weil s...@inktank.com wrote:
 3. During deep scrub of an object with 2 replicas, suppose the checksum is 
 different for the two objects -- which object wins? (I.e. if you store the 
 checksum locally, this is trivial since the consistency of objects can be 
 evaluated locally. Without the local checksum, you can have conflicts.)

 In this case we normally choose the primary.  The repair has to be
 explicitly triggered by the admin, however, and there are some options to
 control that choice.

Which options would those be? I only know about ceph pg repair pg.id

BTW, I read in a previous mail that...

 Repair does the equivalent of a deep-scrub to find problems.  This mostly is 
 reading object data/omap/xattr to create checksums and compares them across 
 all copies.  When a discrepancy is identified an arbitrary copy which did not 
 have I/O errors is selected and used to re-write the other replicas.

This seems like a right thing to do when inconsistencies are the
result of i/o errors. But when caused by random bit flips, this sounds
like an effective way to propagate corrupted data while making ceph
health = HEALTH_OK.

Is that opportunistic checksum feature planned for emporer?

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph cluster access using s3 api with curl

2013-10-16 Thread Snider, Tim
Rookie question: What's the curl command / URL / steps to get an authentication 
token from the cluster without using the swift debug command first.
Using the swift_key values should work but I haven't found the right 
combination /url.
Here's what I've done:
1: Get user info from ceph cluster:
# radosgw-admin user info --uid rados
2013-10-16 13:29:42.956578 7f166aeef780  0 WARNING: cannot read region 
map
{ user_id: rados,
  display_name: rados,
  email: n...@none.com,
  suspended: 0,
  max_buckets: 1000,
  auid: 0,
  subusers: [],
  keys: [
{ user: rados,
  access_key: V92UJ5F24DF2CDGQINTK,
  secret_key: uzWaCMQnZ8uxyR3zte2Dthxbca\/H4qsm3p0QI29f}],
  swift_keys: [
{ user: rados:swift,
  secret_key: 123}],
  caps: [],
  op_mask: read, write, delete,
  default_placement: ,
  placement_tags: []}

2:  Jump thru the (unnecessary) Swift deubg hoop. Debug truncated the http 
command that holds the key:
# swift --verbose --debug -V 1.0 -A http://10.113.193.189/auth -U 
rados:swift  -K 123 list
DEBUG:swiftclient:REQ: curl -i http://10.113.193.189/auth -X GET

DEBUG:swiftclient:RESP STATUS: 204

DEBUG:swiftclient:REQ: curl -i 
http://10.113.193.189/swift/v1?format=json -X GET -H X-Auth-Token: 
AUTH_rgwtk0b007261646f733a73776966740ddca424fed74e69be4860524846912b0f99a7531ecda91ae47684ebd6b69e40f1dc6b45

DEBUG:swiftclient:RESP STATUS: 200

DEBUG:swiftclient:RESP BODY: []

3: I should be able to pass user and password values from the user info 
command. I haven't found the correct url or path to use.  This command (and 
variations : auth/v1.0 ...) fails. Is the directory structure / URL to get an 
authentication token documented somewhere?
# curl -i http://10.113.193.189/auth -X GET -H 'X-Storage-User: 
rados:swift' -H 'X-Storage-Pass: 123'
HTTP/1.1 403 Forbidden
Date: Wed, 16 Oct 2013 20:33:31 GMT
Server: Apache/2.2.22 (Ubuntu)
Accept-Ranges: bytes
Content-Length: 23
Content-Type: application/json

{Code:AccessDenied}

Thanks,
Tim
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Multiply OSDs per host strategy ?

2013-10-16 Thread Andrija Panic
Hi,

I have 2 x  2TB disks, in 3 servers, so total of 6 disks... I have deployed
total of 6 OSDs.
ie:
host1 = osd.0 and osd.1
host2 = osd.2 and osd.3
host4 = osd.4 and osd.5

Now, since I will have total of 3 replica (original + 2 replicas), I want
my replica placement to be such, that I don't end up having 2 replicas on 1
host (replica on osd0, osd1 (both on host1) and replica on osd2. I want all
3 replicas spread on different hosts...

I know this is to be done via crush maps, but I'm not sure if it would be
better to have 2 pools, 1 pool on osd0,2,4 and and another pool on osd1,3,5.

If possible, I would want only 1 pool, spread across all 6 OSDs, but with
data placement such, that I don't end up having 2 replicas on 1 host...not
sure if this is possible at all...

Is that possible, or maybe I should go for RAID0 in each server (2 x 2Tb =
4TB for osd0) or maybe JBOD  (1 volume, so 1 OSD per host) ?

Any suggesting about best practice ?

Regards,

-- 

Andrija Panić
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiply OSDs per host strategy ?

2013-10-16 Thread Mike Dawson

Andrija,

You can use a single pool and the proper CRUSH rule


step chooseleaf firstn 0 type host


to accomplish your goal.

http://ceph.com/docs/master/rados/operations/crush-map/


Cheers,
Mike Dawson


On 10/16/2013 5:16 PM, Andrija Panic wrote:

Hi,

I have 2 x  2TB disks, in 3 servers, so total of 6 disks... I have
deployed total of 6 OSDs.
ie:
host1 = osd.0 and osd.1
host2 = osd.2 and osd.3
host4 = osd.4 and osd.5

Now, since I will have total of 3 replica (original + 2 replicas), I
want my replica placement to be such, that I don't end up having 2
replicas on 1 host (replica on osd0, osd1 (both on host1) and replica on
osd2. I want all 3 replicas spread on different hosts...

I know this is to be done via crush maps, but I'm not sure if it would
be better to have 2 pools, 1 pool on osd0,2,4 and and another pool on
osd1,3,5.

If possible, I would want only 1 pool, spread across all 6 OSDs, but
with data placement such, that I don't end up having 2 replicas on 1
host...not sure if this is possible at all...

Is that possible, or maybe I should go for RAID0 in each server (2 x 2Tb
= 4TB for osd0) or maybe JBOD  (1 volume, so 1 OSD per host) ?

Any suggesting about best practice ?

Regards,

--

Andrija Panić


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is there a way to query RBD usage

2013-10-16 Thread Josh Durgin

On 10/15/2013 08:56 PM, Blair Bethwaite wrote:


  Date: Wed, 16 Oct 2013 16:06:49 +1300
  From: Mark Kirkwood mark.kirkw...@catalyst.net.nz
mailto:mark.kirkw...@catalyst.net.nz
  To: Wido den Hollander w...@42on.com mailto:w...@42on.com,
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] Is there a way to query RBD usage
  Message-ID: 525e02c9.9050...@catalyst.net.nz
mailto:525e02c9.9050...@catalyst.net.nz
  Content-Type: text/plain; charset=ISO-8859-1; format=flowed
 
  On 16/10/13 15:53, Wido den Hollander wrote:
   On 10/16/2013 03:15 AM, Blair Bethwaite wrote:
   I.e., can we see what the actual allocated/touched size of an RBD
is in
   relation to its provisioned size?
  
  
   No, not an easy way. The only way would be to probe which RADOS
   objects exist, but that's a heavy operation you don't want to do with
   large images or with a large number of RBD images.
  
 
  So maybe a 'df' arg for rbd would be a nice addition to blueprints?

Yes, I think so. It does seem a little conflicting to promote Ceph as
doing thin-provisioned volumes, but then not actually be able to
interrogate their real usage against the provisioned size. As a cloud
admin using Ceph as my block-storage layer I really want to be able to
look at several metrics in relation to volumes and tenants:
total GB quota, GB provisioned (i.e., total size of volumessnaps), GB
allocated
When users come crying for more quota I need to whether they're making
efficient use of what they've got.

This actually leads into more of a conversation around the quota model
of dishing out storage. IMHO it would be much more preferable to do
things in a more EBS oriented fashion, where we're able to see actual
usage in the backend. Especially true with snapshots - users are
typically dismayed that their snapshots count towards their quota for
the full size of the originally provisioned volume (despite the fact the
snapshot could usually be truncated/shrunk by a factor of two or more).


You can see the space written in the image and between snapshots (not
including fs overhead on the osds) since cuttlefish:

http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/3684

It'd be nice to wrap that in a df or similar command though.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] changing from default journals to external journals

2013-10-16 Thread Snider, Tim
I configured my cluster using the default journal location for my osds. Can I 
migrate the default journals to explicit separate devices without a complete 
cluster teardown and reinstallation? How?

Thanks,
Tim
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] changing from default journals to external journals

2013-10-16 Thread Sage Weil
On Wed, 16 Oct 2013, Snider, Tim wrote:
 I configured my cluster using the default journal location for my osds. Can
 I migrate the default journals to explicit separate devices without a
 complete cluster teardown and reinstallation? How?

- stop a ceph-osd daemon, then
- ceph-osd --flush-journal -i NNN
- set/adjust the journal symlink at /var/lib/ceph/osd/ceph-NNN/journal to 
  point wherever you want
- ceph-osd --mkjournal -i NNN
- start ceph-osd

This won't set up the udev magic on the journal device, but that doesn't 
really matter if you're not hotplugging devices.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiply OSDs per host strategy ?

2013-10-16 Thread Andrija Panic
well, nice one :)

*step chooseleaf firstn 0 type host* - it is the part of default crush map
(3 hosts, 2 OSDs per host)

It means: write 3 replicas (in my case) to 3 hosts...and randomly select
OSD from each host ?

I already read all the docs...and still not sure how to proceed...


On 16 October 2013 23:27, Mike Dawson mike.daw...@cloudapt.com wrote:

 Andrija,

 You can use a single pool and the proper CRUSH rule


 step chooseleaf firstn 0 type host


 to accomplish your goal.

 http://ceph.com/docs/master/**rados/operations/crush-map/http://ceph.com/docs/master/rados/operations/crush-map/


 Cheers,
 Mike Dawson



 On 10/16/2013 5:16 PM, Andrija Panic wrote:

 Hi,

 I have 2 x  2TB disks, in 3 servers, so total of 6 disks... I have
 deployed total of 6 OSDs.
 ie:
 host1 = osd.0 and osd.1
 host2 = osd.2 and osd.3
 host4 = osd.4 and osd.5

 Now, since I will have total of 3 replica (original + 2 replicas), I
 want my replica placement to be such, that I don't end up having 2
 replicas on 1 host (replica on osd0, osd1 (both on host1) and replica on
 osd2. I want all 3 replicas spread on different hosts...

 I know this is to be done via crush maps, but I'm not sure if it would
 be better to have 2 pools, 1 pool on osd0,2,4 and and another pool on
 osd1,3,5.

 If possible, I would want only 1 pool, spread across all 6 OSDs, but
 with data placement such, that I don't end up having 2 replicas on 1
 host...not sure if this is possible at all...

 Is that possible, or maybe I should go for RAID0 in each server (2 x 2Tb
 = 4TB for osd0) or maybe JBOD  (1 volume, so 1 OSD per host) ?

 Any suggesting about best practice ?

 Regards,

 --

 Andrija Panić


 __**_
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 

Andrija Panić
--
  http://admintweets.com
--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kvm live migrate wil ceph

2013-10-16 Thread Jon
Hello Michael,

Thanks for the reply.  It seems like ceph isn't actually mounting the rbd
to the vm host which is where I think I was getting hung up (I had
previously been attempting to mount rbds directly to multiple hosts and as
you can imagine having issues).

Could you possible expound on why using a clustered filesystem approach is
wrong (or conversely why using RBD's is the correct approach)?

As for format2 rbd images, it looks like they provide exactly the
Copy-On-Write functionality that I am looking for.  Any caveats or things I
should look out for when going from format 1 to format 2 images? (I think I
read something about not being able to use both at the same time...)

Thanks Again,
Jon A


On Mon, Oct 14, 2013 at 4:42 PM, Michael Lowe j.michael.l...@gmail.comwrote:

 I live migrate all the time using the rbd driver in qemu, no problems.
  Qemu will issue a flush as part of the migration so everything is
 consistent.  It's the right way to use ceph to back vm's. I would strongly
 recommend against a network file system approach.  You may want to look
 into format 2 rbd images, the cloning and writable snapshots may be what
 you are looking for.

 Sent from my iPad

 On Oct 14, 2013, at 5:37 AM, Jon three1...@gmail.com wrote:

 Hello,

 I would like to live migrate a VM between two hypervisors.  Is it
 possible to do this with a rbd disk or should the vm disks be created as
 qcow images on a CephFS/NFS share (is it possible to do clvm over rbds? OR
 GlusterFS over rbds?)and point kvm at the network directory.  As I
 understand it, rbds aren't cluster aware so you can't mount an rbd on
 multiple hosts at once, but maybe libvirt has a way to handle the
 transfer...?  I like the idea of master or golden images where guests
 write any changes to a new image, I don't think rbds are able to handle
 copy-on-write in the same way kvm does so maybe a clustered filesystem
 approach is the ideal way to go.

 Thanks for your input. I think I'm just missing some piece. .. I just
 don't grok...

 Bestv Regards,
 Jon A

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CloudStack + KVM(Ubuntu 12.04, Libvirt 1.0.2) + Ceph [Seeking Help]

2013-10-16 Thread Kelcey Jamison Damage
Hi, 

I have gotten so close to have Ceph work in my cloud but I have reached a 
roadblock. Any help would be greatly appreciated. 

I receive the following error when trying to get KVM to run a VM with an RBD 
volume: 

Libvirtd.log: 

2013-10-16 22 :05:15.516+: 9814: error : qemuProcessReadLogOutput:1477 : 
internal error Process exited while reading console log output: 
char device redirected to /dev/pts/3 
kvm: -drive 
file=rbd:libvirt-pool/new-libvirt-image:id=libvirt:key=+F5ScBQlLhAAYCH8qhGEh/gjKW+NpziAlA==:auth_supported=cephx\;none:mon_host=
 
10.0.1.83\:6789,if=none,id=drive-ide0-0-1: error connecting 
kvm: -drive 
file=rbd:libvirt-pool/new-libvirt-image:id=libvirt:key=+F5ScBQlLhAAYCH8qhGEh/gjKW+NpziAlA==:auth_supported=cephx\;none:mon_host=
 
10.0.1.83\:6789,if=none,id=drive-ide0-0-1: could not open disk image 
rbd:libvirt-pool/new-libvirt-image:id=libvirt:key=+F5ScBQlLhAAYCH8qhGEh 
/gjKW+NpziAlA==:auth_supported=cephx\;none:mon_host=10.0.1.83\:6789: Invalid 
argument 

Ceph Pool showing test volume exists: 

root@ubuntu-test-KVM-RBD:/opt# rbd -p libvirt-pool ls 
new-libvirt-image 

Ceph Auth: 

client.libvirt 
key: AQBx+F5ScBQlLhAAYCH8qhGEh/gjKW+NpziAlA== 
caps: [ mon ] allow r 
caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
pool=libvirt-pool 

KVM Drive Support: 

root@ubuntu-test-KVM-RBD:/opt# kvm --drive 
format=?ibvirt-image:id=libvirt:key=+F5Sc 
Supported formats: vvfat vpc vmdk vdi sheepdog rbd raw host_cdrom host_floppy 
host_device file qed qcow2 qcow parallels nbd dmg tftp ftps ft 
p https http cow cloop bochs blkverify blkdebug 

Thank you if anyone can help 

Kelcey Damage | Infrastructure Systems Architect 
Strategy | Automation | Cloud Computing | Technology Development 

Backbone Technology, Inc 
604-331-1152 ext. 114 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Very unbalanced osd data placement with differing sized devices

2013-10-16 Thread Mark Kirkwood

I stumbled across this today:

4 osds on 4 hosts (names ceph1 - ceph4). They are KVM guests (this is a 
play setup).


- ceph1 and ceph2 each have a 5G volume for osd data (+ 2G vol for journal)
- ceph3 and ceph4 each have a 10G volume for osd data (+ 2G vol for journal)

I do a standard installation via ceph-deploy (1.2.7) of ceph (0.67.4) on 
each one [1]. The topology looks like:


$ ceph osd tree
# idweighttype nameup/downreweight
-10.01999root default
-20host ceph1
00osd.0up1
-30host ceph2
10osd.1up1
-40.009995host ceph3
20.009995osd.2up1
-50.009995host ceph4
30.009995osd.3up1

So osd.0 and osd.1 (on ceph1,2) have weight 0, and osd2 and osd.3 (on 
ceph3,4) have weight 0.009995 this suggests that data will flee osd.0,1 
and live only on osd.3.4. Sure enough putting in a few objects via radus 
put results in:


ceph1 $ df -m
Filesystem 1M-blocks  Used Available Use% Mounted on
/dev/vda1   5038  2508  2275  53% /
udev 994 1   994   1% /dev
tmpfs401 1   401   1% /run
none   5 0 5   0% /run/lock
none1002 0  1002   0% /run/shm
/dev/vdb1   510940  5070   1% /var/lib/ceph/osd/ceph-0

(similarly for ceph2), whereas:

ceph3 $df -m
Filesystem 1M-blocks  Used Available Use% Mounted on
/dev/vda1   5038  2405  2377  51% /
udev 994 1   994   1% /dev
tmpfs401 1   401   1% /run
none   5 0 5   0% /run/lock
none1002 0  1002   0% /run/shm
/dev/vdb1  10229  1315  8915  13% /var/lib/ceph/osd/ceph-2

(similarly for ceph4). Obviously I can fix this via the reweighting the 
first two osds to something like 0.005, but I'm wondering if there is 
something I've missed - clearly some kind of auto weighting is has been 
performed on the basis of the size difference in the data volumes, but 
looks to be skewing data far too much to the bigger ones. Is there 
perhaps a bug in the smarts for this? Or is it just because I'm using 
small volumes (5G = 0 weight)?


Cheers

Mark

[1] i.e:

$ ceph-deploy new ceph1
$ ceph-deploy mon create ceph1
$ ceph-deploy gatherkeys ceph1
$ ceph-deploy osd create ceph1:/dev/vdb:/dev/vdc
...
$ ceph-deploy osd create ceph4:/dev/vdb:/dev/vdc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very unbalanced osd data placement with differing sized devices

2013-10-16 Thread David Zafman

I may be wrong, but I always thought that a weight of 0 means don't put 
anything there.  All weights  0 will be looked at proportionally.

See http://ceph.com/docs/master/rados/operations/crush-map/ which recommends 
higher weights anyway:

Weighting Bucket Items

Ceph expresses bucket weights as double integers, which allows for fine 
weighting. A weight is the relative difference between device capacities. We 
recommend using 1.00 as the relative weight for a 1TB storage device. In such a 
scenario, a weight of 0.5 would represent approximately 500GB, and a weight of 
3.00 would represent approximately 3TB. Higher level buckets have a weight that 
is the sum total of the leaf items aggregated by the bucket.

A bucket item weight is one dimensional, but you may also calculate your item 
weights to reflect the performance of the storage drive. For example, if you 
have many 1TB drives where some have relatively low data transfer rate and the 
others have a relatively high data transfer rate, you may weight them 
differently, even though they have the same capacity (e.g., a weight of 0.80 
for the first set of drives with lower total throughput, and 1.20 for the 
second set of drives with higher total throughput).


David Zafman
Senior Developer
http://www.inktank.com




On Oct 16, 2013, at 8:15 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz 
wrote:

 I stumbled across this today:
 
 4 osds on 4 hosts (names ceph1 - ceph4). They are KVM guests (this is a play 
 setup).
 
 - ceph1 and ceph2 each have a 5G volume for osd data (+ 2G vol for journal)
 - ceph3 and ceph4 each have a 10G volume for osd data (+ 2G vol for journal)
 
 I do a standard installation via ceph-deploy (1.2.7) of ceph (0.67.4) on each 
 one [1]. The topology looks like:
 
 $ ceph osd tree
 # idweighttype nameup/downreweight
 -10.01999root default
 -20host ceph1
 00osd.0up1
 -30host ceph2
 10osd.1up1
 -40.009995host ceph3
 20.009995osd.2up1
 -50.009995host ceph4
 30.009995osd.3up1
 
 So osd.0 and osd.1 (on ceph1,2) have weight 0, and osd2 and osd.3 (on 
 ceph3,4) have weight 0.009995 this suggests that data will flee osd.0,1 and 
 live only on osd.3.4. Sure enough putting in a few objects via radus put 
 results in:
 
 ceph1 $ df -m
 Filesystem 1M-blocks  Used Available Use% Mounted on
 /dev/vda1   5038  2508  2275  53% /
 udev 994 1   994   1% /dev
 tmpfs401 1   401   1% /run
 none   5 0 5   0% /run/lock
 none1002 0  1002   0% /run/shm
 /dev/vdb1   510940  5070   1% /var/lib/ceph/osd/ceph-0
 
 (similarly for ceph2), whereas:
 
 ceph3 $df -m
 Filesystem 1M-blocks  Used Available Use% Mounted on
 /dev/vda1   5038  2405  2377  51% /
 udev 994 1   994   1% /dev
 tmpfs401 1   401   1% /run
 none   5 0 5   0% /run/lock
 none1002 0  1002   0% /run/shm
 /dev/vdb1  10229  1315  8915  13% /var/lib/ceph/osd/ceph-2
 
 (similarly for ceph4). Obviously I can fix this via the reweighting the first 
 two osds to something like 0.005, but I'm wondering if there is something 
 I've missed - clearly some kind of auto weighting is has been performed on 
 the basis of the size difference in the data volumes, but looks to be skewing 
 data far too much to the bigger ones. Is there perhaps a bug in the smarts 
 for this? Or is it just because I'm using small volumes (5G = 0 weight)?
 
 Cheers
 
 Mark
 
 [1] i.e:
 
 $ ceph-deploy new ceph1
 $ ceph-deploy mon create ceph1
 $ ceph-deploy gatherkeys ceph1
 $ ceph-deploy osd create ceph1:/dev/vdb:/dev/vdc
 ...
 $ ceph-deploy osd create ceph4:/dev/vdb:/dev/vdc
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very unbalanced osd data placement with differing sized devices

2013-10-16 Thread Sage Weil
On Thu, 17 Oct 2013, Mark Kirkwood wrote:
 I stumbled across this today:
 
 4 osds on 4 hosts (names ceph1 - ceph4). They are KVM guests (this is a play
 setup).
 
 - ceph1 and ceph2 each have a 5G volume for osd data (+ 2G vol for journal)
 - ceph3 and ceph4 each have a 10G volume for osd data (+ 2G vol for journal)
 
 I do a standard installation via ceph-deploy (1.2.7) of ceph (0.67.4) on each
 one [1]. The topology looks like:
 
 $ ceph osd tree
 # idweighttype nameup/downreweight
 -10.01999root default
 -20host ceph1
 00osd.0up1
 -30host ceph2
 10osd.1up1
 -40.009995host ceph3
 20.009995osd.2up1
 -50.009995host ceph4
 30.009995osd.3up1
 
 So osd.0 and osd.1 (on ceph1,2) have weight 0, and osd2 and osd.3 (on ceph3,4)
 have weight 0.009995 this suggests that data will flee osd.0,1 and live only
 on osd.3.4. Sure enough putting in a few objects via radus put results in:
 
 ceph1 $ df -m
 Filesystem 1M-blocks  Used Available Use% Mounted on
 /dev/vda1   5038  2508  2275  53% /
 udev 994 1   994   1% /dev
 tmpfs401 1   401   1% /run
 none   5 0 5   0% /run/lock
 none1002 0  1002   0% /run/shm
 /dev/vdb1   510940  5070   1% /var/lib/ceph/osd/ceph-0
 
 (similarly for ceph2), whereas:
 
 ceph3 $df -m
 Filesystem 1M-blocks  Used Available Use% Mounted on
 /dev/vda1   5038  2405  2377  51% /
 udev 994 1   994   1% /dev
 tmpfs401 1   401   1% /run
 none   5 0 5   0% /run/lock
 none1002 0  1002   0% /run/shm
 /dev/vdb1  10229  1315  8915  13% /var/lib/ceph/osd/ceph-2
 
 (similarly for ceph4). Obviously I can fix this via the reweighting the first
 two osds to something like 0.005, but I'm wondering if there is something I've
 missed - clearly some kind of auto weighting is has been performed on the
 basis of the size difference in the data volumes, but looks to be skewing data
 far too much to the bigger ones. Is there perhaps a bug in the smarts for
 this? Or is it just because I'm using small volumes (5G = 0 weight)?

Yeah, I think this is just rounding error.  By default a weight of 1.0 == 
1 TB, so these are just very small numbers.  Internally, we're storing 
as a fixed-point 32-bit value where 1.0 == 0x1, and 5MB is just too 
small for those units.

You can disable this autoweighting with 

 osd crush update on start = false

in ceph.conf.

sage


 
 Cheers
 
 Mark
 
 [1] i.e:
 
 $ ceph-deploy new ceph1
 $ ceph-deploy mon create ceph1
 $ ceph-deploy gatherkeys ceph1
 $ ceph-deploy osd create ceph1:/dev/vdb:/dev/vdc
 ...
 $ ceph-deploy osd create ceph4:/dev/vdb:/dev/vdc
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fedora package dependencies

2013-10-16 Thread Darryl Bond

Performing yum updates on Fedora 19 now break qemu.
There is a different set of package names and contents between the
default fedora ceph packages and the ceph.com packages.
The is no ceph-libs package in the ceph.com repository and qemu now
enforces the dependency on ceph-libs.
Yum update now produces this error:
Error: Package: 2:qemu-common-1.4.2-12.fc19.x86_64 (updates)
   Requires: ceph-libs = 0.61
   Available: ceph-libs-0.56.4-1.fc19.i686 (fedora)
   ceph-libs = 0.56.4-1.fc19
   Available: ceph-libs-0.67.3-2.fc19.i686 (updates)
   ceph-libs = 0.67.3-2.fc19

The ceph-libs dependency enforcement is new as of this qemu update.

Should not ceph.com Fedora packages mirror the default fedora packages
on name and contents?

Regards
Darryl


The contents of this electronic message and any attachments are intended only 
for the addressee and may contain legally privileged, personal, sensitive or 
confidential information. If you are not the intended addressee, and have 
received this email, any transmission, distribution, downloading, printing or 
photocopying of the contents of this message or attachments is strictly 
prohibited. Any legal privilege or confidentiality attached to this message and 
attachments is not waived, lost or destroyed by reason of delivery to any 
person other than intended addressee. If you have received this message and are 
not the intended addressee you should notify the sender by return email and 
destroy all copies of the message and any attachments. Unless expressly 
attributed, the views expressed in this email do not necessarily represent the 
views of the company.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com