date:20131016

[ceph-users] Fedora package dependencies

2013-10-16 Thread Darryl Bond


Performing yum updates on Fedora 19 now break qemu.
There is a different set of package names and contents between the
default fedora ceph packages and the ceph.com packages.
The is no ceph-libs package in the ceph.com repository and qemu now
enforces the dependency on ceph-libs.
Yum update now produces this error:
Error: Package: 2:qemu-common-1.4.2-12.fc19.x86_64 (updates)
   Requires: ceph-libs >= 0.61
   Available: ceph-libs-0.56.4-1.fc19.i686 (fedora)
   ceph-libs = 0.56.4-1.fc19
   Available: ceph-libs-0.67.3-2.fc19.i686 (updates)
   ceph-libs = 0.67.3-2.fc19

The ceph-libs dependency enforcement is new as of this qemu update.

Should not ceph.com Fedora packages mirror the default fedora packages
on name and contents?

Regards
Darryl


The contents of this electronic message and any attachments are intended only 
for the addressee and may contain legally privileged, personal, sensitive or 
confidential information. If you are not the intended addressee, and have 
received this email, any transmission, distribution, downloading, printing or 
photocopying of the contents of this message or attachments is strictly 
prohibited. Any legal privilege or confidentiality attached to this message and 
attachments is not waived, lost or destroyed by reason of delivery to any 
person other than intended addressee. If you have received this message and are 
not the intended addressee you should notify the sender by return email and 
destroy all copies of the message and any attachments. Unless expressly 
attributed, the views expressed in this email do not necessarily represent the 
views of the company.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Very unbalanced osd data placement with differing sized devices

2013-10-16 Thread Mark Kirkwood

Yeah - I'm *guessing* that my use of a 5G volume is essentially 
'underflowing' the auto weight assignment.


On 17/10/13 16:28, David Zafman wrote:

I may be wrong, but I always thought that a weight of 0 means don't put anything 
there.  All weights > 0 will be looked at proportionally.

See http://ceph.com/docs/master/rados/operations/crush-map/ which recommends 
higher weights anyway:




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Very unbalanced osd data placement with differing sized devices

2013-10-16 Thread Sage Weil

On Thu, 17 Oct 2013, Mark Kirkwood wrote:
> I stumbled across this today:
> 
> 4 osds on 4 hosts (names ceph1 -> ceph4). They are KVM guests (this is a play
> setup).
> 
> - ceph1 and ceph2 each have a 5G volume for osd data (+ 2G vol for journal)
> - ceph3 and ceph4 each have a 10G volume for osd data (+ 2G vol for journal)
> 
> I do a standard installation via ceph-deploy (1.2.7) of ceph (0.67.4) on each
> one [1]. The topology looks like:
> 
> $ ceph osd tree
> # idweighttype nameup/downreweight
> -10.01999root default
> -20host ceph1
> 00osd.0up1
> -30host ceph2
> 10osd.1up1
> -40.009995host ceph3
> 20.009995osd.2up1
> -50.009995host ceph4
> 30.009995osd.3up1
> 
> So osd.0 and osd.1 (on ceph1,2) have weight 0, and osd2 and osd.3 (on ceph3,4)
> have weight 0.009995 this suggests that data will flee osd.0,1 and live only
> on osd.3.4. Sure enough putting in a few objects via radus put results in:
> 
> ceph1 $ df -m
> Filesystem 1M-blocks  Used Available Use% Mounted on
> /dev/vda1   5038  2508  2275  53% /
> udev 994 1   994   1% /dev
> tmpfs401 1   401   1% /run
> none   5 0 5   0% /run/lock
> none1002 0  1002   0% /run/shm
> /dev/vdb1   510940  5070   1% /var/lib/ceph/osd/ceph-0
> 
> (similarly for ceph2), whereas:
> 
> ceph3 $df -m
> Filesystem 1M-blocks  Used Available Use% Mounted on
> /dev/vda1   5038  2405  2377  51% /
> udev 994 1   994   1% /dev
> tmpfs401 1   401   1% /run
> none   5 0 5   0% /run/lock
> none1002 0  1002   0% /run/shm
> /dev/vdb1  10229  1315  8915  13% /var/lib/ceph/osd/ceph-2
> 
> (similarly for ceph4). Obviously I can fix this via the reweighting the first
> two osds to something like 0.005, but I'm wondering if there is something I've
> missed - clearly some kind of auto weighting is has been performed on the
> basis of the size difference in the data volumes, but looks to be skewing data
> far too much to the bigger ones. Is there perhaps a bug in the smarts for
> this? Or is it just because I'm using small volumes (5G = 0 weight)?

Yeah, I think this is just rounding error.  By default a weight of 1.0 == 
1 TB, so these are just very small numbers.  Internally, we're storing 
as a fixed-point 32-bit value where 1.0 == 0x1, and 5MB is just too 
small for those units.

You can disable this autoweighting with 

 osd crush update on start = false

in ceph.conf.

sage


> 
> Cheers
> 
> Mark
> 
> [1] i.e:
> 
> $ ceph-deploy new ceph1
> $ ceph-deploy mon create ceph1
> $ ceph-deploy gatherkeys ceph1
> $ ceph-deploy osd create ceph1:/dev/vdb:/dev/vdc
> ...
> $ ceph-deploy osd create ceph4:/dev/vdb:/dev/vdc
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Very unbalanced osd data placement with differing sized devices

2013-10-16 Thread David Zafman

I may be wrong, but I always thought that a weight of 0 means don't put 
anything there.  All weights > 0 will be looked at proportionally.

See http://ceph.com/docs/master/rados/operations/crush-map/ which recommends 
higher weights anyway:

Weighting Bucket Items

Ceph expresses bucket weights as double integers, which allows for fine 
weighting. A weight is the relative difference between device capacities. We 
recommend using 1.00 as the relative weight for a 1TB storage device. In such a 
scenario, a weight of 0.5 would represent approximately 500GB, and a weight of 
3.00 would represent approximately 3TB. Higher level buckets have a weight that 
is the sum total of the leaf items aggregated by the bucket.

A bucket item weight is one dimensional, but you may also calculate your item 
weights to reflect the performance of the storage drive. For example, if you 
have many 1TB drives where some have relatively low data transfer rate and the 
others have a relatively high data transfer rate, you may weight them 
differently, even though they have the same capacity (e.g., a weight of 0.80 
for the first set of drives with lower total throughput, and 1.20 for the 
second set of drives with higher total throughput).

David Zafman
Senior Developer
http://www.inktank.com

On Oct 16, 2013, at 8:15 PM, Mark Kirkwood  
wrote:

> I stumbled across this today:
> 
> 4 osds on 4 hosts (names ceph1 -> ceph4). They are KVM guests (this is a play 
> setup).
> 
> - ceph1 and ceph2 each have a 5G volume for osd data (+ 2G vol for journal)
> - ceph3 and ceph4 each have a 10G volume for osd data (+ 2G vol for journal)
> 
> I do a standard installation via ceph-deploy (1.2.7) of ceph (0.67.4) on each 
> one [1]. The topology looks like:
> 
> $ ceph osd tree
> # idweighttype nameup/downreweight
> -10.01999root default
> -20host ceph1
> 00osd.0up1
> -30host ceph2
> 10osd.1up1
> -40.009995host ceph3
> 20.009995osd.2up1
> -50.009995host ceph4
> 30.009995osd.3up1
> 
> So osd.0 and osd.1 (on ceph1,2) have weight 0, and osd2 and osd.3 (on 
> ceph3,4) have weight 0.009995 this suggests that data will flee osd.0,1 and 
> live only on osd.3.4. Sure enough putting in a few objects via radus put 
> results in:
> 
> ceph1 $ df -m
> Filesystem 1M-blocks  Used Available Use% Mounted on
> /dev/vda1   5038  2508  2275  53% /
> udev 994 1   994   1% /dev
> tmpfs401 1   401   1% /run
> none   5 0 5   0% /run/lock
> none1002 0  1002   0% /run/shm
> /dev/vdb1   510940  5070   1% /var/lib/ceph/osd/ceph-0
> 
> (similarly for ceph2), whereas:
> 
> ceph3 $df -m
> Filesystem 1M-blocks  Used Available Use% Mounted on
> /dev/vda1   5038  2405  2377  51% /
> udev 994 1   994   1% /dev
> tmpfs401 1   401   1% /run
> none   5 0 5   0% /run/lock
> none1002 0  1002   0% /run/shm
> /dev/vdb1  10229  1315  8915  13% /var/lib/ceph/osd/ceph-2
> 
> (similarly for ceph4). Obviously I can fix this via the reweighting the first 
> two osds to something like 0.005, but I'm wondering if there is something 
> I've missed - clearly some kind of auto weighting is has been performed on 
> the basis of the size difference in the data volumes, but looks to be skewing 
> data far too much to the bigger ones. Is there perhaps a bug in the smarts 
> for this? Or is it just because I'm using small volumes (5G = 0 weight)?
> 
> Cheers
> 
> Mark
> 
> [1] i.e:
> 
> $ ceph-deploy new ceph1
> $ ceph-deploy mon create ceph1
> $ ceph-deploy gatherkeys ceph1
> $ ceph-deploy osd create ceph1:/dev/vdb:/dev/vdc
> ...
> $ ceph-deploy osd create ceph4:/dev/vdb:/dev/vdc
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Very unbalanced osd data placement with differing sized devices

2013-10-16 Thread Mark Kirkwood


I stumbled across this today:

4 osds on 4 hosts (names ceph1 -> ceph4). They are KVM guests (this is a 
play setup).


- ceph1 and ceph2 each have a 5G volume for osd data (+ 2G vol for journal)
- ceph3 and ceph4 each have a 10G volume for osd data (+ 2G vol for journal)

I do a standard installation via ceph-deploy (1.2.7) of ceph (0.67.4) on 
each one [1]. The topology looks like:


$ ceph osd tree
# idweighttype nameup/downreweight
-10.01999root default
-20host ceph1
00osd.0up1
-30host ceph2
10osd.1up1
-40.009995host ceph3
20.009995osd.2up1
-50.009995host ceph4
30.009995osd.3up1

So osd.0 and osd.1 (on ceph1,2) have weight 0, and osd2 and osd.3 (on 
ceph3,4) have weight 0.009995 this suggests that data will flee osd.0,1 
and live only on osd.3.4. Sure enough putting in a few objects via radus 
put results in:


ceph1 $ df -m
Filesystem 1M-blocks  Used Available Use% Mounted on
/dev/vda1   5038  2508  2275  53% /
udev 994 1   994   1% /dev
tmpfs401 1   401   1% /run
none   5 0 5   0% /run/lock
none1002 0  1002   0% /run/shm
/dev/vdb1   510940  5070   1% /var/lib/ceph/osd/ceph-0

(similarly for ceph2), whereas:

ceph3 $df -m
Filesystem 1M-blocks  Used Available Use% Mounted on
/dev/vda1   5038  2405  2377  51% /
udev 994 1   994   1% /dev
tmpfs401 1   401   1% /run
none   5 0 5   0% /run/lock
none1002 0  1002   0% /run/shm
/dev/vdb1  10229  1315  8915  13% /var/lib/ceph/osd/ceph-2

(similarly for ceph4). Obviously I can fix this via the reweighting the 
first two osds to something like 0.005, but I'm wondering if there is 
something I've missed - clearly some kind of auto weighting is has been 
performed on the basis of the size difference in the data volumes, but 
looks to be skewing data far too much to the bigger ones. Is there 
perhaps a bug in the smarts for this? Or is it just because I'm using 
small volumes (5G = 0 weight)?


Cheers

Mark

[1] i.e:

$ ceph-deploy new ceph1
$ ceph-deploy mon create ceph1
$ ceph-deploy gatherkeys ceph1
$ ceph-deploy osd create ceph1:/dev/vdb:/dev/vdc
...
$ ceph-deploy osd create ceph4:/dev/vdb:/dev/vdc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] CloudStack + KVM(Ubuntu 12.04, Libvirt 1.0.2) + Ceph [Seeking Help]

2013-10-16 Thread Kelcey Jamison Damage

Hi, 

I have gotten so close to have Ceph work in my cloud but I have reached a 
roadblock. Any help would be greatly appreciated. 

I receive the following error when trying to get KVM to run a VM with an RBD 
volume: 

Libvirtd.log: 

2013-10-16 22 :05:15.516+: 9814: error : qemuProcessReadLogOutput:1477 : 
internal error Process exited while reading console log output: 
char device redirected to /dev/pts/3 
kvm: -drive 
file=rbd:libvirt-pool/new-libvirt-image:id=libvirt:key=+F5ScBQlLhAAYCH8qhGEh/gjKW+NpziAlA==:auth_supported=cephx\;none:mon_host=
 
10.0.1.83\:6789,if=none,id=drive-ide0-0-1: error connecting 
kvm: -drive 
file=rbd:libvirt-pool/new-libvirt-image:id=libvirt:key=+F5ScBQlLhAAYCH8qhGEh/gjKW+NpziAlA==:auth_supported=cephx\;none:mon_host=
 
10.0.1.83\:6789,if=none,id=drive-ide0-0-1: could not open disk image 
rbd:libvirt-pool/new-libvirt-image:id=libvirt:key=+F5ScBQlLhAAYCH8qhGEh 
/gjKW+NpziAlA==:auth_supported=cephx\;none:mon_host=10.0.1.83\:6789: Invalid 
argument 

Ceph Pool showing test volume exists: 

root@ubuntu-test-KVM-RBD:/opt# rbd -p libvirt-pool ls 
new-libvirt-image 

Ceph Auth: 

client.libvirt 
key: AQBx+F5ScBQlLhAAYCH8qhGEh/gjKW+NpziAlA== 
caps: [ mon ] allow r 
caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
pool=libvirt-pool 

KVM Drive Support: 

root@ubuntu-test-KVM-RBD:/opt# kvm --drive 
format=?ibvirt-image:id=libvirt:key=+F5Sc 
Supported formats: vvfat vpc vmdk vdi sheepdog rbd raw host_cdrom host_floppy 
host_device file qed qcow2 qcow parallels nbd dmg tftp ftps ft 
p https http cow cloop bochs blkverify blkdebug 

Thank you if anyone can help 

Kelcey Damage | Infrastructure Systems Architect 
Strategy | Automation | Cloud Computing | Technology Development 

Backbone Technology, Inc 
604-331-1152 ext. 114 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] kvm live migrate wil ceph

2013-10-16 Thread Jon

Hello Michael,

Thanks for the reply.  It seems like ceph isn't actually "mounting" the rbd
to the vm host which is where I think I was getting hung up (I had
previously been attempting to mount rbds directly to multiple hosts and as
you can imagine having issues).

Could you possible expound on why using a clustered filesystem approach is
wrong (or conversely why using RBD's is the correct approach)?

As for format2 rbd images, it looks like they provide exactly the
Copy-On-Write functionality that I am looking for.  Any caveats or things I
should look out for when going from format 1 to format 2 images? (I think I
read something about not being able to use both at the same time...)

Thanks Again,
Jon A

On Mon, Oct 14, 2013 at 4:42 PM, Michael Lowe wrote:

> I live migrate all the time using the rbd driver in qemu, no problems.
>  Qemu will issue a flush as part of the migration so everything is
> consistent.  It's the right way to use ceph to back vm's. I would strongly
> recommend against a network file system approach.  You may want to look
> into format 2 rbd images, the cloning and writable snapshots may be what
> you are looking for.
>
> Sent from my iPad
>
> On Oct 14, 2013, at 5:37 AM, Jon  wrote:
>
> Hello,
>
> I would like to live migrate a VM between two "hypervisors".  Is it
> possible to do this with a rbd disk or should the vm disks be created as
> qcow images on a CephFS/NFS share (is it possible to do clvm over rbds? OR
> GlusterFS over rbds?)and point kvm at the network directory.  As I
> understand it, rbds aren't "cluster aware" so you can't mount an rbd on
> multiple hosts at once, but maybe libvirt has a way to handle the
> transfer...?  I like the idea of "master" or "golden" images where guests
> write any changes to a new image, I don't think rbds are able to handle
> copy-on-write in the same way kvm does so maybe a clustered filesystem
> approach is the ideal way to go.
>
> Thanks for your input. I think I'm just missing some piece. .. I just
> don't grok...
>
> Bestv Regards,
> Jon A
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multiply OSDs per host strategy ?

2013-10-16 Thread Andrija Panic

well, nice one :)

*step chooseleaf firstn 0 type host* - it is the part of default crush map
(3 hosts, 2 OSDs per host)

It means: write 3 replicas (in my case) to 3 hosts...and randomly select
OSD from each host ?

I already read all the docs...and still not sure how to proceed...


On 16 October 2013 23:27, Mike Dawson  wrote:

> Andrija,
>
> You can use a single pool and the proper CRUSH rule
>
>
> step chooseleaf firstn 0 type host
>
>
> to accomplish your goal.
>
> http://ceph.com/docs/master/**rados/operations/crush-map/
>
>
> Cheers,
> Mike Dawson
>
>
>
> On 10/16/2013 5:16 PM, Andrija Panic wrote:
>
>> Hi,
>>
>> I have 2 x  2TB disks, in 3 servers, so total of 6 disks... I have
>> deployed total of 6 OSDs.
>> ie:
>> host1 = osd.0 and osd.1
>> host2 = osd.2 and osd.3
>> host4 = osd.4 and osd.5
>>
>> Now, since I will have total of 3 replica (original + 2 replicas), I
>> want my replica placement to be such, that I don't end up having 2
>> replicas on 1 host (replica on osd0, osd1 (both on host1) and replica on
>> osd2. I want all 3 replicas spread on different hosts...
>>
>> I know this is to be done via crush maps, but I'm not sure if it would
>> be better to have 2 pools, 1 pool on osd0,2,4 and and another pool on
>> osd1,3,5.
>>
>> If possible, I would want only 1 pool, spread across all 6 OSDs, but
>> with data placement such, that I don't end up having 2 replicas on 1
>> host...not sure if this is possible at all...
>>
>> Is that possible, or maybe I should go for RAID0 in each server (2 x 2Tb
>> = 4TB for osd0) or maybe JBOD  (1 volume, so 1 OSD per host) ?
>>
>> Any suggesting about best practice ?
>>
>> Regards,
>>
>> --
>>
>> Andrija Panić
>>
>>
>> __**_
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>>
>>


-- 

Andrija Panić
--
  http://admintweets.com
--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] changing from default journals to external journals

2013-10-16 Thread Sage Weil

On Wed, 16 Oct 2013, Snider, Tim wrote:
> I configured my cluster using the default journal location for my osds. Can
> I migrate the default journals to explicit separate devices without a
> complete cluster teardown and reinstallation? How?

- stop a ceph-osd daemon, then
- ceph-osd --flush-journal -i NNN
- set/adjust the journal symlink at /var/lib/ceph/osd/ceph-NNN/journal to 
  point wherever you want
- ceph-osd --mkjournal -i NNN
- start ceph-osd

This won't set up the udev magic on the journal device, but that doesn't 
really matter if you're not hotplugging devices.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] changing from default journals to external journals

2013-10-16 Thread Snider, Tim

I configured my cluster using the default journal location for my osds. Can I 
migrate the default journals to explicit separate devices without a complete 
cluster teardown and reinstallation? How?

Thanks,
Tim
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Is there a way to query RBD usage

2013-10-16 Thread Josh Durgin

On 10/15/2013 08:56 PM, Blair Bethwaite wrote:

 > Date: Wed, 16 Oct 2013 16:06:49 +1300
 > From: Mark Kirkwood mailto:mark.kirkw...@catalyst.net.nz>>
 > To: Wido den Hollander mailto:w...@42on.com>>,
ceph-users@lists.ceph.com 
 > Subject: Re: [ceph-users] Is there a way to query RBD usage
 > Message-ID: <525e02c9.9050...@catalyst.net.nz
>
 > Content-Type: text/plain; charset=ISO-8859-1; format=flowed
 >
 > On 16/10/13 15:53, Wido den Hollander wrote:
 > > On 10/16/2013 03:15 AM, Blair Bethwaite wrote:
 > >> I.e., can we see what the actual allocated/touched size of an RBD
is in
 > >> relation to its provisioned size?
 > >>
 > >
 > > No, not an easy way. The only way would be to probe which RADOS
 > > objects exist, but that's a heavy operation you don't want to do with
 > > large images or with a large number of RBD images.
 > >
 >
 > So maybe a 'df' arg for rbd would be a nice addition to blueprints?

Yes, I think so. It does seem a little conflicting to promote Ceph as
doing thin-provisioned volumes, but then not actually be able to
interrogate their real usage against the provisioned size. As a cloud
admin using Ceph as my block-storage layer I really want to be able to
look at several metrics in relation to volumes and tenants:
total GB quota, GB provisioned (i.e., total size of volumes&snaps), GB
allocated
When users come crying for more quota I need to whether they're making
efficient use of what they've got.

This actually leads into more of a conversation around the quota model
of dishing out storage. IMHO it would be much more preferable to do
things in a more EBS oriented fashion, where we're able to see actual
usage in the backend. Especially true with snapshots - users are
typically dismayed that their snapshots count towards their quota for
the full size of the originally provisioned volume (despite the fact the
snapshot could usually be truncated/shrunk by a factor of two or more).

You can see the space written in the image and between snapshots (not
including fs overhead on the osds) since cuttlefish:

http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/3684

It'd be nice to wrap that in a df or similar command though.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multiply OSDs per host strategy ?

2013-10-16 Thread Mike Dawson


Andrija,

You can use a single pool and the proper CRUSH rule


step chooseleaf firstn 0 type host


to accomplish your goal.

http://ceph.com/docs/master/rados/operations/crush-map/


Cheers,
Mike Dawson


On 10/16/2013 5:16 PM, Andrija Panic wrote:

Hi,

I have 2 x  2TB disks, in 3 servers, so total of 6 disks... I have
deployed total of 6 OSDs.
ie:
host1 = osd.0 and osd.1
host2 = osd.2 and osd.3
host4 = osd.4 and osd.5

Now, since I will have total of 3 replica (original + 2 replicas), I
want my replica placement to be such, that I don't end up having 2
replicas on 1 host (replica on osd0, osd1 (both on host1) and replica on
osd2. I want all 3 replicas spread on different hosts...

I know this is to be done via crush maps, but I'm not sure if it would
be better to have 2 pools, 1 pool on osd0,2,4 and and another pool on
osd1,3,5.

If possible, I would want only 1 pool, spread across all 6 OSDs, but
with data placement such, that I don't end up having 2 replicas on 1
host...not sure if this is possible at all...

Is that possible, or maybe I should go for RAID0 in each server (2 x 2Tb
= 4TB for osd0) or maybe JBOD  (1 volume, so 1 OSD per host) ?

Any suggesting about best practice ?

Regards,

--

Andrija Panić


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Multiply OSDs per host strategy ?

2013-10-16 Thread Andrija Panic

Hi,

I have 2 x  2TB disks, in 3 servers, so total of 6 disks... I have deployed
total of 6 OSDs.
ie:
host1 = osd.0 and osd.1
host2 = osd.2 and osd.3
host4 = osd.4 and osd.5

Now, since I will have total of 3 replica (original + 2 replicas), I want
my replica placement to be such, that I don't end up having 2 replicas on 1
host (replica on osd0, osd1 (both on host1) and replica on osd2. I want all
3 replicas spread on different hosts...

I know this is to be done via crush maps, but I'm not sure if it would be
better to have 2 pools, 1 pool on osd0,2,4 and and another pool on osd1,3,5.

If possible, I would want only 1 pool, spread across all 6 OSDs, but with
data placement such, that I don't end up having 2 replicas on 1 host...not
sure if this is possible at all...

Is that possible, or maybe I should go for RAID0 in each server (2 x 2Tb =
4TB for osd0) or maybe JBOD  (1 volume, so 1 OSD per host) ?

Any suggesting about best practice ?

Regards,

-- 

Andrija Panić
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph cluster access using s3 api with curl

2013-10-16 Thread Snider, Tim

Rookie question: What's the curl command / URL / steps to get an authentication 
token from the cluster without using the swift debug command first.
Using the swift_key values should work but I haven't found the right 
combination /url.
Here's what I've done:
1: Get user info from ceph cluster:
# radosgw-admin user info --uid rados
2013-10-16 13:29:42.956578 7f166aeef780  0 WARNING: cannot read region 
map
{ "user_id": "rados",
  "display_name": "rados",
  "email": "n...@none.com",
  "suspended": 0,
  "max_buckets": 1000,
  "auid": 0,
  "subusers": [],
  "keys": [
{ "user": "rados",
  "access_key": "V92UJ5F24DF2CDGQINTK",
  "secret_key": "uzWaCMQnZ8uxyR3zte2Dthxbca\/H4qsm3p0QI29f"}],
  "swift_keys": [
{ "user": "rados:swift",
  "secret_key": "123"}],
  "caps": [],
  "op_mask": "read, write, delete",
  "default_placement": "",
  "placement_tags": []}

2:  Jump thru the (unnecessary) Swift deubg hoop. Debug truncated the http 
command that holds the key:
# swift --verbose --debug -V 1.0 -A http://10.113.193.189/auth -U 
rados:swift  -K 123 list
DEBUG:swiftclient:REQ: curl -i http://10.113.193.189/auth -X GET

DEBUG:swiftclient:RESP STATUS: 204

DEBUG:swiftclient:REQ: curl -i 
http://10.113.193.189/swift/v1?format=json -X GET -H "X-Auth-Token: 
AUTH_rgwtk0b007261646f733a73776966740ddca424fed74e69be4860524846912b0f99a7531ecda91ae47684ebd6b69e40f1dc6b45"

DEBUG:swiftclient:RESP STATUS: 200

DEBUG:swiftclient:RESP BODY: []

3: I should be able to pass user and password values from the user info 
command. I haven't found the correct url or path to use.  This command (and 
variations : auth/v1.0 ...) fails. Is the directory structure / URL to get an 
authentication token documented somewhere?
# curl -i http://10.113.193.189/auth -X GET -H 'X-Storage-User: 
rados:swift' -H 'X-Storage-Pass: 123'
HTTP/1.1 403 Forbidden
Date: Wed, 16 Oct 2013 20:33:31 GMT
Server: Apache/2.2.22 (Ubuntu)
Accept-Ranges: bytes
Content-Length: 23
Content-Type: application/json

{"Code":"AccessDenied"}

Thanks,
Tim
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bit correctness and checksumming

2013-10-16 Thread Dan van der Ster

On Wed, Oct 16, 2013 at 6:12 PM, Sage Weil  wrote:
>> 3. During deep scrub of an object with 2 replicas, suppose the checksum is 
>> different for the two objects -- which object wins? (I.e. if you store the 
>> checksum locally, this is trivial since the consistency of objects can be 
>> evaluated locally. Without the local checksum, you can have conflicts.)
>
> In this case we normally choose the primary.  The repair has to be
> explicitly triggered by the admin, however, and there are some options to
> control that choice.

Which options would those be? I only know about ceph pg repair 

BTW, I read in a previous mail that...

> Repair does the equivalent of a deep-scrub to find problems.  This mostly is 
> reading object data/omap/xattr to create checksums and compares them across 
> all copies.  When a discrepancy is identified an arbitrary copy which did not 
> have I/O errors is selected and used to re-write the other replicas.

This seems like a right thing to do when inconsistencies are the
result of i/o errors. But when caused by random bit flips, this sounds
like an effective way to propagate corrupted data while making ceph
health = HEALTH_OK.

Is that opportunistic checksum feature planned for emporer?

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bit correctness and checksumming

2013-10-16 Thread Sage Weil

On Wed, 16 Oct 2013, ja...@peacon.co.uk wrote:
> Does Ceph log anywhere corrected(/caught) silent corruption - would be
> interesting to know how much a problem this is, in a large scale deployment.
> Something to gather in the league table mentioned at the London Ceph day?

It is logged, and causes the 'ceph health' check to complain. There are 
not currently any historical counts on how many inconsistencies have been 
found and subsequently repaired, though; this would be interested to 
collect and report!

> Just thinking out-loud (please shout me down...) - if the FS itself performs
> it's own ECC, ATA streaming command set might be of use to avoid performance
> degradation due to drive level recovery at all.

Maybe, I'm not familiar... 

sage

> 
> 
> On 2013-10-16 17:12, Sage Weil wrote:
> > On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
> > > Hi all,
> > > There has been some confusion the past couple days at the CHEP
> > > conference during conversations about Ceph and protection from bit flips
> > > or other subtle data corruption. Can someone please summarise the
> > > current state of data integrity protection in Ceph, assuming we have an
> > > XFS backend filesystem? ie. don't rely on the protection offered by
> > > btrfs. I saw in the docs that wire messages and journal writes are
> > > CRC'd, but nothing explicit about the objects themselves.
> > 
> > - Everything that passes over the wire is checksummed (crc32c).  This is
> > mainly because the TCP checksum is so weak.
> > 
> > - The journal entries have a crc.
> > 
> > - During deep scrub, we read the objects and metadata, calculate a crc32c,
> > and compare across replicas.  This detects missing objects, bitrot,
> > failing disks, or anything other source of inconistency.
> > 
> > - Ceph does not calculate and store a per-object checksum.  Doing so is
> > difficult because rados allows arbitrary overwrites of parts of an object.
> > 
> > - Ceph *does* have a new opportunistic checksum feature, which is
> > currently only enabled in QA.  It calculates and stores checksums on
> > whatever block size you configure (e.g., 64k) if/when we write/overwrite a
> > complete block, and will verify any complete block read against the stored
> > crc, if one happens to be available.  This can help catch some but not all
> > sources of corruption.
> > 
> > > We also have some specific questions:
> > > 
> > > 1. Is an object checksum stored on the OSD somewhere? Is this in
> > > user.ceph._, because it wasn't obvious when looking at the code?
> > 
> > No (except for the new/experimental opportunistic crc I mention above).
> > 
> > > 2. When is the checksum verified. Surely it is checked during the deep
> > > scrub, but what about during an object read?
> > 
> > For non-btrfs, no crc to verify.  For btrfs, the fs has its own crc and
> > verifies it.
> > 
> > > 2b. Can a user read corrupted data if the master replica has a bit flip
> > > but this hasn't yet been found by a deep scrub?
> > 
> > Yes.
> > 
> > > 3. During deep scrub of an object with 2 replicas, suppose the checksum is
> > > different for the two objects -- which object wins? (I.e. if you store the
> > > checksum locally, this is trivial since the consistency of objects can be
> > > evaluated locally. Without the local checksum, you can have conflicts.)
> > 
> > In this case we normally choose the primary.  The repair has to be
> > explicitly triggered by the admin, however, and there are some options to
> > control that choice.
> > 
> > > 4. If the checksum is already stored per object in the OSD, is this
> > > retrievable by librados? We have some applications which also need to know
> > > the checksum of the data and this would be handy if it was already
> > > calculated by Ceph.
> > 
> > It would!  It may be that the way to get there is to build and API to
> > expose the opportunistic checksums, and/or to extend that feature to
> > maintain full checksums (by re-reading partially overwritten blocks on
> > write).  (Note, however, that even this wouldn't cover xattrs and omap
> > content; really this is something that "should" be handled by the backend
> > storage/file system.)
> > 
> > sage
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bit correctness and checksumming

2013-10-16 Thread Tim Bell


It was long ago and Linux was very different .

With respect to today, we found quite a few cases of bad RAID cards which had 
limited ECC checking on their memory, Stuck bits had serious impacts given our 
data transit volumes :-(

While the root causes we found in the past may be less likely today (as we move 
towards replicas and away from hardware RAID), keeping in place a background 
scrubbing and method to identify components which could be potentially causing 
corruption by external probing and quality checks is very useful.

Tim


> -Original Message-
> From: ceph-users-boun...@lists.ceph.com 
> [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ja...@peacon.co.uk
> Sent: 16 October 2013 20:06
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] bit correctness and checksumming
> 
> Very interesting link.  I don't suppose there is any data available 
> separating 4K and 512-byte sectored drives?
> 
> 
> On 2013-10-16 18:43, Tim Bell wrote:
> > At CERN, we have had cases in the past of silent corruptions. It is
> > good to be able to identify the devices causing them and swap them
> > out.
> >
> > It's an old presentation but the concepts are still relevant today
> > ...
> > http://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf
> >
> > Tim
> >
> >
> >> -Original Message-
> >> From: ceph-users-boun...@lists.ceph.com
> >> [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> >> ja...@peacon.co.uk
> >> Sent: 16 October 2013 18:54
> >> To: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] bit correctness and checksumming
> >>
> >>
> >> Does Ceph log anywhere corrected(/caught) silent corruption - would
> >> be interesting to know how much a problem this is, in a large scale
> >> deployment.  Something to gather in the league table mentioned at
> >> the London Ceph day?
> >>
> >> Just thinking out-loud (please shout me down...) - if the FS itself
> >> performs it's own ECC, ATA streaming command set might be of use to
> >> avoid performance degradation due to drive level recovery at all.
> >>
> >>
> >> On 2013-10-16 17:12, Sage Weil wrote:
> >> > On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
> >> >> Hi all,
> >> >> There has been some confusion the past couple days at the CHEP
> >> >> conference during conversations about Ceph and protection from
> >> bit
> >> >> flips or other subtle data corruption. Can someone please
> >> summarise
> >> >> the current state of data integrity protection in Ceph, assuming
> >> we
> >> >> have an XFS backend filesystem? ie. don't rely on the protection
> >> >> offered by btrfs. I saw in the docs that wire messages and
> >> journal
> >> >> writes are CRC'd, but nothing explicit about the objects
> >> themselves.
> >> >
> >> > - Everything that passes over the wire is checksummed (crc32c).
> >> This
> >> > is mainly because the TCP checksum is so weak.
> >> >
> >> > - The journal entries have a crc.
> >> >
> >> > - During deep scrub, we read the objects and metadata, calculate a
> >> > crc32c, and compare across replicas.  This detects missing
> >> objects,
> >> > bitrot, failing disks, or anything other source of inconistency.
> >> >
> >> > - Ceph does not calculate and store a per-object checksum.  Doing
> >> so
> >> > is difficult because rados allows arbitrary overwrites of parts of
> >> an
> >> > object.
> >> >
> >> > - Ceph *does* have a new opportunistic checksum feature, which is
> >> > currently only enabled in QA.  It calculates and stores checksums
> >> on
> >> > whatever block size you configure (e.g., 64k) if/when we
> >> > write/overwrite a complete block, and will verify any complete
> >> block
> >> > read against the stored crc, if one happens to be available.  This
> >> can
> >> > help catch some but not all sources of corruption.
> >> >
> >> >> We also have some specific questions:
> >> >>
> >> >> 1. Is an object checksum stored on the OSD somewhere? Is this in
> >> >> user.ceph._, because it wasn't obvious when looking at the code?
> >> >
> >> > No (except for the new/experimental opportunistic crc I mention
> >> > above).
> >> >
> >> >> 2. When is the checksum verified. Surely it is checked during the
> >> >> deep scrub, but what about during an object read?
> >> >
> >> > For non-btrfs, no crc to verify.  For btrfs, the fs has its own
> >> crc
> >> > and verifies it.
> >> >
> >> >> 2b. Can a user read corrupted data if the master replica has a
> >> bit
> >> >> flip but this hasn't yet been found by a deep scrub?
> >> >
> >> > Yes.
> >> >
> >> >> 3. During deep scrub of an object with 2 replicas, suppose the
> >> >> checksum is different for the two objects -- which object wins?
> >> (I.e.
> >> >> if you store the checksum locally, this is trivial since the
> >> >> consistency of objects can be evaluated locally. Without the
> >> local
> >> >> checksum, you can have conflicts.)
> >> >
> >> > In this case we normally choose the primary.  The repair has to be
> >> > explicitly triggered by the admin,

Re: [ceph-users] bit correctness and checksumming

2013-10-16 Thread Dan van der Ster

Thank you Sage for the thorough answer.

It just occurred to me to also ask about the gateway. The docs explain that one 
can supply content-md5 during an object PUT (which I assume is verified by the 
RGW), but does a GET respond with the ETag md5? (Sorry, I don't have a gateway 
running at the moment to check for myself, and the answer is relevant to this 
discussion anyway).

Cheers,
Dan

Sage Weil  wrote:
>On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
>> Hi all,
>> There has been some confusion the past couple days at the CHEP 
>> conference during conversations about Ceph and protection from bit
>flips 
>> or other subtle data corruption. Can someone please summarise the 
>> current state of data integrity protection in Ceph, assuming we have
>an 
>> XFS backend filesystem? ie. don't rely on the protection offered by 
>> btrfs. I saw in the docs that wire messages and journal writes are 
>> CRC'd, but nothing explicit about the objects themselves.
>
>- Everything that passes over the wire is checksummed (crc32c).  This
>is 
>mainly because the TCP checksum is so weak.
>
>- The journal entries have a crc.
>
>- During deep scrub, we read the objects and metadata, calculate a
>crc32c, 
>and compare across replicas.  This detects missing objects, bitrot, 
>failing disks, or anything other source of inconistency.
>
>- Ceph does not calculate and store a per-object checksum.  Doing so is
>
>difficult because rados allows arbitrary overwrites of parts of an
>object.
>
>- Ceph *does* have a new opportunistic checksum feature, which is 
>currently only enabled in QA.  It calculates and stores checksums on 
>whatever block size you configure (e.g., 64k) if/when we
>write/overwrite a 
>complete block, and will verify any complete block read against the
>stored 
>crc, if one happens to be available.  This can help catch some but not
>all 
>sources of corruption.
>
>> We also have some specific questions:
>> 
>> 1. Is an object checksum stored on the OSD somewhere? Is this in
>user.ceph._, because it wasn't obvious when looking at the code?
>
>No (except for the new/experimental opportunistic crc I mention above).
>
>> 2. When is the checksum verified. Surely it is checked during the
>deep scrub, but what about during an object read?
>
>For non-btrfs, no crc to verify.  For btrfs, the fs has its own crc and
>
>verifies it.
>
>> 2b. Can a user read corrupted data if the master replica has a bit
>flip but this hasn't yet been found by a deep scrub?
>
>Yes.
>
>> 3. During deep scrub of an object with 2 replicas, suppose the
>checksum is different for the two objects -- which object wins? (I.e.
>if you store the checksum locally, this is trivial since the
>consistency of objects can be evaluated locally. Without the local
>checksum, you can have conflicts.)
>
>In this case we normally choose the primary.  The repair has to be 
>explicitly triggered by the admin, however, and there are some options
>to 
>control that choice.
>
>> 4. If the checksum is already stored per object in the OSD, is this
>retrievable by librados? We have some applications which also need to
>know the checksum of the data and this would be handy if it was already
>calculated by Ceph.
>
>It would!  It may be that the way to get there is to build and API to 
>expose the opportunistic checksums, and/or to extend that feature to 
>maintain full checksums (by re-reading partially overwritten blocks on 
>write).  (Note, however, that even this wouldn't cover xattrs and omap 
>content; really this is something that "should" be handled by the
>backend 
>storage/file system.)
>
>sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bit correctness and checksumming

2013-10-16 Thread james

Very interesting link.  I don't suppose there is any data available 
separating 4K and 512-byte sectored drives?



On 2013-10-16 18:43, Tim Bell wrote:

At CERN, we have had cases in the past of silent corruptions. It is
good to be able to identify the devices causing them and swap them
out.

It's an old presentation but the concepts are still relevant today
... 
http://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf


Tim



-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
ja...@peacon.co.uk

Sent: 16 October 2013 18:54
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bit correctness and checksumming


Does Ceph log anywhere corrected(/caught) silent corruption - would 
be interesting to know how much a problem this is, in a large scale
deployment.  Something to gather in the league table mentioned at 
the London Ceph day?


Just thinking out-loud (please shout me down...) - if the FS itself 
performs it's own ECC, ATA streaming command set might be of use to

avoid performance degradation due to drive level recovery at all.


On 2013-10-16 17:12, Sage Weil wrote:
> On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
>> Hi all,
>> There has been some confusion the past couple days at the CHEP
>> conference during conversations about Ceph and protection from 
bit
>> flips or other subtle data corruption. Can someone please 
summarise
>> the current state of data integrity protection in Ceph, assuming 
we

>> have an XFS backend filesystem? ie. don't rely on the protection
>> offered by btrfs. I saw in the docs that wire messages and 
journal
>> writes are CRC'd, but nothing explicit about the objects 
themselves.

>
> - Everything that passes over the wire is checksummed (crc32c).  
This

> is mainly because the TCP checksum is so weak.
>
> - The journal entries have a crc.
>
> - During deep scrub, we read the objects and metadata, calculate a
> crc32c, and compare across replicas.  This detects missing 
objects,

> bitrot, failing disks, or anything other source of inconistency.
>
> - Ceph does not calculate and store a per-object checksum.  Doing 
so
> is difficult because rados allows arbitrary overwrites of parts of 
an

> object.
>
> - Ceph *does* have a new opportunistic checksum feature, which is
> currently only enabled in QA.  It calculates and stores checksums 
on

> whatever block size you configure (e.g., 64k) if/when we
> write/overwrite a complete block, and will verify any complete 
block
> read against the stored crc, if one happens to be available.  This 
can

> help catch some but not all sources of corruption.
>
>> We also have some specific questions:
>>
>> 1. Is an object checksum stored on the OSD somewhere? Is this in
>> user.ceph._, because it wasn't obvious when looking at the code?
>
> No (except for the new/experimental opportunistic crc I mention
> above).
>
>> 2. When is the checksum verified. Surely it is checked during the
>> deep scrub, but what about during an object read?
>
> For non-btrfs, no crc to verify.  For btrfs, the fs has its own 
crc

> and verifies it.
>
>> 2b. Can a user read corrupted data if the master replica has a 
bit

>> flip but this hasn't yet been found by a deep scrub?
>
> Yes.
>
>> 3. During deep scrub of an object with 2 replicas, suppose the
>> checksum is different for the two objects -- which object wins? 
(I.e.

>> if you store the checksum locally, this is trivial since the
>> consistency of objects can be evaluated locally. Without the 
local

>> checksum, you can have conflicts.)
>
> In this case we normally choose the primary.  The repair has to be
> explicitly triggered by the admin, however, and there are some 
options

> to control that choice.
>
>> 4. If the checksum is already stored per object in the OSD, is 
this
>> retrievable by librados? We have some applications which also 
need to

>> know the checksum of the data and this would be handy if it was
>> already calculated by Ceph.
>
> It would!  It may be that the way to get there is to build and API 
to
> expose the opportunistic checksums, and/or to extend that feature 
to
> maintain full checksums (by re-reading partially overwritten 
blocks on
> write).  (Note, however, that even this wouldn't cover xattrs and 
omap

> content; really this is something that "should" be handled by the
> backend storage/file system.)
>
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bit correctness and checksumming

2013-10-16 Thread Tim Bell


At CERN, we have had cases in the past of silent corruptions. It is good to be 
able to identify the devices causing them and swap them out.

It's an old presentation but the concepts are still relevant today ... 
http://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf

Tim


> -Original Message-
> From: ceph-users-boun...@lists.ceph.com 
> [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ja...@peacon.co.uk
> Sent: 16 October 2013 18:54
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] bit correctness and checksumming
> 
> 
> Does Ceph log anywhere corrected(/caught) silent corruption - would be 
> interesting to know how much a problem this is, in a large scale
> deployment.  Something to gather in the league table mentioned at the London 
> Ceph day?
> 
> Just thinking out-loud (please shout me down...) - if the FS itself performs 
> it's own ECC, ATA streaming command set might be of use to
> avoid performance degradation due to drive level recovery at all.
> 
> 
> On 2013-10-16 17:12, Sage Weil wrote:
> > On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
> >> Hi all,
> >> There has been some confusion the past couple days at the CHEP
> >> conference during conversations about Ceph and protection from bit
> >> flips or other subtle data corruption. Can someone please summarise
> >> the current state of data integrity protection in Ceph, assuming we
> >> have an XFS backend filesystem? ie. don't rely on the protection
> >> offered by btrfs. I saw in the docs that wire messages and journal
> >> writes are CRC'd, but nothing explicit about the objects themselves.
> >
> > - Everything that passes over the wire is checksummed (crc32c).  This
> > is mainly because the TCP checksum is so weak.
> >
> > - The journal entries have a crc.
> >
> > - During deep scrub, we read the objects and metadata, calculate a
> > crc32c, and compare across replicas.  This detects missing objects,
> > bitrot, failing disks, or anything other source of inconistency.
> >
> > - Ceph does not calculate and store a per-object checksum.  Doing so
> > is difficult because rados allows arbitrary overwrites of parts of an
> > object.
> >
> > - Ceph *does* have a new opportunistic checksum feature, which is
> > currently only enabled in QA.  It calculates and stores checksums on
> > whatever block size you configure (e.g., 64k) if/when we
> > write/overwrite a complete block, and will verify any complete block
> > read against the stored crc, if one happens to be available.  This can
> > help catch some but not all sources of corruption.
> >
> >> We also have some specific questions:
> >>
> >> 1. Is an object checksum stored on the OSD somewhere? Is this in
> >> user.ceph._, because it wasn't obvious when looking at the code?
> >
> > No (except for the new/experimental opportunistic crc I mention
> > above).
> >
> >> 2. When is the checksum verified. Surely it is checked during the
> >> deep scrub, but what about during an object read?
> >
> > For non-btrfs, no crc to verify.  For btrfs, the fs has its own crc
> > and verifies it.
> >
> >> 2b. Can a user read corrupted data if the master replica has a bit
> >> flip but this hasn't yet been found by a deep scrub?
> >
> > Yes.
> >
> >> 3. During deep scrub of an object with 2 replicas, suppose the
> >> checksum is different for the two objects -- which object wins? (I.e.
> >> if you store the checksum locally, this is trivial since the
> >> consistency of objects can be evaluated locally. Without the local
> >> checksum, you can have conflicts.)
> >
> > In this case we normally choose the primary.  The repair has to be
> > explicitly triggered by the admin, however, and there are some options
> > to control that choice.
> >
> >> 4. If the checksum is already stored per object in the OSD, is this
> >> retrievable by librados? We have some applications which also need to
> >> know the checksum of the data and this would be handy if it was
> >> already calculated by Ceph.
> >
> > It would!  It may be that the way to get there is to build and API to
> > expose the opportunistic checksums, and/or to extend that feature to
> > maintain full checksums (by re-reading partially overwritten blocks on
> > write).  (Note, however, that even this wouldn't cover xattrs and omap
> > content; really this is something that "should" be handled by the
> > backend storage/file system.)
> >
> > sage
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bit correctness and checksumming

2013-10-16 Thread james



Does Ceph log anywhere corrected(/caught) silent corruption - would be 
interesting to know how much a problem this is, in a large scale 
deployment.  Something to gather in the league table mentioned at the 
London Ceph day?


Just thinking out-loud (please shout me down...) - if the FS itself 
performs it's own ECC, ATA streaming command set might be of use to 
avoid performance degradation due to drive level recovery at all.



On 2013-10-16 17:12, Sage Weil wrote:

On Wed, 16 Oct 2013, Dan Van Der Ster wrote:

Hi all,
There has been some confusion the past couple days at the CHEP
conference during conversations about Ceph and protection from bit 
flips

or other subtle data corruption. Can someone please summarise the
current state of data integrity protection in Ceph, assuming we have 
an

XFS backend filesystem? ie. don't rely on the protection offered by
btrfs. I saw in the docs that wire messages and journal writes are
CRC'd, but nothing explicit about the objects themselves.


- Everything that passes over the wire is checksummed (crc32c).  This 
is

mainly because the TCP checksum is so weak.

- The journal entries have a crc.

- During deep scrub, we read the objects and metadata, calculate a 
crc32c,

and compare across replicas.  This detects missing objects, bitrot,
failing disks, or anything other source of inconistency.

- Ceph does not calculate and store a per-object checksum.  Doing so 
is
difficult because rados allows arbitrary overwrites of parts of an 
object.


- Ceph *does* have a new opportunistic checksum feature, which is
currently only enabled in QA.  It calculates and stores checksums on
whatever block size you configure (e.g., 64k) if/when we 
write/overwrite a
complete block, and will verify any complete block read against the 
stored
crc, if one happens to be available.  This can help catch some but 
not all

sources of corruption.


We also have some specific questions:

1. Is an object checksum stored on the OSD somewhere? Is this in 
user.ceph._, because it wasn't obvious when looking at the code?


No (except for the new/experimental opportunistic crc I mention 
above).


2. When is the checksum verified. Surely it is checked during the 
deep scrub, but what about during an object read?


For non-btrfs, no crc to verify.  For btrfs, the fs has its own crc 
and

verifies it.

2b. Can a user read corrupted data if the master replica has a bit 
flip but this hasn't yet been found by a deep scrub?


Yes.

3. During deep scrub of an object with 2 replicas, suppose the 
checksum is different for the two objects -- which object wins? (I.e. 
if you store the checksum locally, this is trivial since the 
consistency of objects can be evaluated locally. Without the local 
checksum, you can have conflicts.)


In this case we normally choose the primary.  The repair has to be
explicitly triggered by the admin, however, and there are some 
options to

control that choice.

4. If the checksum is already stored per object in the OSD, is this 
retrievable by librados? We have some applications which also need to 
know the checksum of the data and this would be handy if it was 
already calculated by Ceph.


It would!  It may be that the way to get there is to build and API to
expose the opportunistic checksums, and/or to extend that feature to
maintain full checksums (by re-reading partially overwritten blocks 
on
write).  (Note, however, that even this wouldn't cover xattrs and 
omap
content; really this is something that "should" be handled by the 
backend

storage/file system.)

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] snapshots on CephFS

2013-10-16 Thread Gregory Farnum

On Wed, Oct 16, 2013 at 8:01 AM, Kasper Dieter
 wrote:
> Hi Greg,
>
> on http://comments.gmane.org/gmane.comp.file-systems.ceph.user/1705
> I found a statement from you regarding snapshots on cephfs:
>
> ---snip---
> Filesystem snapshots exist and you can experiment with them on CephFS
> (there's a hidden ".snaps" folder; you can create or remove snapshots
> by creating directories in that folder; navigate up and down it, etc).
> ---snip---
>
> Can you please explain in more detail or with example CMDs how to 
> create/list/remove snapshots in CephFS ?

As Shain described, you just do mkdir/ls/rmdir in the .snaps folder.

> I assume they will be created on a directory level ?

Snapshots cover the entire subtree starting with the folder you create
them from. If a user puts it in their home directory, there will be a
snapshot of all their document folders, source code folders, etc as
well.

> How will the CephFS snapshots cohere with the underlaying pools ?
> (e.g. using cephfs /mnt/cephfs/dir-1/dir2 set_layout -p 18)

CephFS snapshots store some metadata directly in the directory object
(in the metadata pool), but the file data is stored using RADOS
self-managed snapshots on the regular objects. If you specify that a
file/folder goes in a different pool, the snapshots also live there as
a matter of course.

Separately:
1) you will probably have a better time specifying layouts using the
ceph.layout virtual xattrs if your installation is new enough.
(There's no new functionality there, but it's a lot friendlier and
less fiddly than the cephfs tool is.)
2) Keep in mind that snapshots are noticeably less stable in use than
the regular filesystem features. The ability to create new ones is
turned off by default in the "next" branch (admins can enable them
with a monitor command).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] poor read performance on rbd+LVM, LVM overload

2013-10-16 Thread Sage Weil

Hi,

On Wed, 16 Oct 2013, Ugis wrote:
> Hello ceph&LVM communities!
> 
> I noticed very slow reads from xfs mount that is on ceph
> client(rbd+gpt partition+LVM PV + xfs on LE)
> To find a cause I created another rbd in the same pool, formatted it
> straight away with xfs, mounted.
> 
> Write performance for both xfs mounts is similar ~12MB/s
> 
> reads with "dd if=/mnt/somefile bs=1M | pv | dd of=/dev/null" as follows:
> with LVM ~4MB/s
> pure xfs ~30MB/s
> 
> Watched performance while doing reads with atop. In LVM case atop
> shows LVM overloaded:
> LVM | s-LV_backups  | busy 95% |  read   21515 | write  0  |
> KiB/r  4 |   | KiB/w  0 |  MBr/s   4.20 | MBw/s
> 0.00  | avq 1.00 |  avio 0.85 ms |
> 
> client kernel 3.10.10
> ceph version 0.67.4
> 
> My considerations:
> I have expanded rbd under LVM couple of times(accordingly expanding
> gpt partition, PV,VG,LV, xfs afterwards), but that should have no
> impact on performance(tested clean rbd+LVM, same read performance as
> for expanded one).
> 
> As with device-mapper, after LVM is initialized it is just a small
> table with LE->PE  mapping that should reside in close CPU cache.
> I am guessing this could be related to old CPU used, probably caching
> near CPU does not work well(I tested also local HDDs with/without LVM
> and got read speed ~13MB/s vs 46MB/s with atop showing same overload
> in  LVM case).
> 
> What could make so great difference when LVM is used and what/how to
> tune? As write performance does not differ, DM extent lookup should
> not be lagging, where is the trick?

My first guess is that LVM is shifting the content of hte device such that 
it no longer aligns well with the RBD striping (by default, 4MB).  The 
non-aligned reads/writes would need to touch two objects instead of 
one, and dd is generally doing these synchronously (i.e., lots of 
waiting).

I'm not sure what options LVM provides for aligning things to the 
underlying storage...

sage


> 
> CPU used:
> # cat /proc/cpuinfo
> processor   : 0
> vendor_id   : GenuineIntel
> cpu family  : 15
> model   : 4
> model name  : Intel(R) Xeon(TM) CPU 3.20GHz
> stepping: 10
> microcode   : 0x2
> cpu MHz : 3200.077
> cache size  : 2048 KB
> physical id : 0
> siblings: 2
> core id : 0
> cpu cores   : 1
> apicid  : 0
> initial apicid  : 0
> fpu : yes
> fpu_exception   : yes
> cpuid level : 5
> wp  : yes
> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr ss
>e sse2 ss ht tm
> pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl
> cid cx16 xtpr lahf_lm
> bogomips: 6400.15
> clflush size: 64
> cache_alignment : 128
> address sizes   : 36 bits physical, 48 bits virtual
> power management:
> 
> Br,
> Ugis
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bit correctness and checksumming

2013-10-16 Thread Sage Weil

On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
> Hi all,
> There has been some confusion the past couple days at the CHEP 
> conference during conversations about Ceph and protection from bit flips 
> or other subtle data corruption. Can someone please summarise the 
> current state of data integrity protection in Ceph, assuming we have an 
> XFS backend filesystem? ie. don't rely on the protection offered by 
> btrfs. I saw in the docs that wire messages and journal writes are 
> CRC'd, but nothing explicit about the objects themselves.

- Everything that passes over the wire is checksummed (crc32c).  This is 
mainly because the TCP checksum is so weak.

- The journal entries have a crc.

- During deep scrub, we read the objects and metadata, calculate a crc32c, 
and compare across replicas.  This detects missing objects, bitrot, 
failing disks, or anything other source of inconistency.

- Ceph does not calculate and store a per-object checksum.  Doing so is 
difficult because rados allows arbitrary overwrites of parts of an object.

- Ceph *does* have a new opportunistic checksum feature, which is 
currently only enabled in QA.  It calculates and stores checksums on 
whatever block size you configure (e.g., 64k) if/when we write/overwrite a 
complete block, and will verify any complete block read against the stored 
crc, if one happens to be available.  This can help catch some but not all 
sources of corruption.

> We also have some specific questions:
> 
> 1. Is an object checksum stored on the OSD somewhere? Is this in user.ceph._, 
> because it wasn't obvious when looking at the code?

No (except for the new/experimental opportunistic crc I mention above).

> 2. When is the checksum verified. Surely it is checked during the deep scrub, 
> but what about during an object read?

For non-btrfs, no crc to verify.  For btrfs, the fs has its own crc and 
verifies it.

> 2b. Can a user read corrupted data if the master replica has a bit flip but 
> this hasn't yet been found by a deep scrub?

Yes.

> 3. During deep scrub of an object with 2 replicas, suppose the checksum is 
> different for the two objects -- which object wins? (I.e. if you store the 
> checksum locally, this is trivial since the consistency of objects can be 
> evaluated locally. Without the local checksum, you can have conflicts.)

In this case we normally choose the primary.  The repair has to be 
explicitly triggered by the admin, however, and there are some options to 
control that choice.

> 4. If the checksum is already stored per object in the OSD, is this 
> retrievable by librados? We have some applications which also need to know 
> the checksum of the data and this would be handy if it was already calculated 
> by Ceph.

It would!  It may be that the way to get there is to build and API to 
expose the opportunistic checksums, and/or to extend that feature to 
maintain full checksums (by re-reading partially overwritten blocks on 
write).  (Note, however, that even this wouldn't cover xattrs and omap 
content; really this is something that "should" be handled by the backend 
storage/file system.)

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] snapshots on CephFS

2013-10-16 Thread Shain Miley

Dieter,

Creating snapshots using cephfs is quite simple...all you need to do is create 
a directory (mkdir) inside the hidden '.snap' directory.

After that you can list (ls) and remove them (rm -r) just as you would any 
other directory:

smiley@server1:/mnt/cephfs$ cd .snap

smiley@server1:/mnt/cephfs/.snap$ ls
snap1  snapshot-10-13-2013

smiley@theneykov:/mnt/cephfs/.snap$ mkdir right_now


smiley@theneykov:/mnt/1/.snap$ ls -l
total 0
drwxrwxrwx 1 root root 0 Oct 13 14:38 snap1
drwxrwxrwx 1 root root 0 Oct 16 11:16 right_now
drwxrwxrwx 1 root root 0 Oct 16 11:16 snapshot-10-13-2013


Shain

Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
smi...@npr.org | 202.513.3649


From: ceph-users-boun...@lists.ceph.com [ceph-users-boun...@lists.ceph.com] on 
behalf of Kasper Dieter [dieter.kas...@ts.fujitsu.com]
Sent: Wednesday, October 16, 2013 11:01 AM
To: Gregory Farnum
Cc: ceph-users@lists.ceph.com
Subject: [ceph-users] snapshots on CephFS

Hi Greg,

on http://comments.gmane.org/gmane.comp.file-systems.ceph.user/1705
I found a statement from you regarding snapshots on cephfs:

---snip---
Filesystem snapshots exist and you can experiment with them on CephFS
(there's a hidden ".snaps" folder; you can create or remove snapshots
by creating directories in that folder; navigate up and down it, etc).
---snip---

Can you please explain in more detail or with example CMDs how to 
create/list/remove snapshots in CephFS ?
I assume they will be created on a directory level ?
How will the CephFS snapshots cohere with the underlaying pools ?
(e.g. using cephfs /mnt/cephfs/dir-1/dir2 set_layout -p 18)


Thanks,
-Dieter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw-admin doesn't list user anymore

2013-10-16 Thread Derek Yarnell

On 10/16/13 4:26 AM, Valery Tschopp wrote:
> Hi Derek,
> 
> Thanks for your example.
> 
> I've added caps='metadata=*', but I still have an error and get:
> 
> send: 'GET /admin/metadata/user?format=json HTTP/1.1\r\nHost:
> objects.bcc.switch.ch\r\nAccept-Encoding: identity\r\nDate: Wed, 16 Oct
> 2013 08:09:57 GMT\r\nContent-Length: 0\r\nAuthorization: AWS
> VC***o=\r\nUser-Agent: Boto/2.12.0 Python/2.7.5
> Darwin/12.5.0\r\n\r\n'
> reply: 'HTTP/1.1 405 Method Not Allowed\r\n'
> 
> 
> In which version of radosgw is the /admin/metadata REST endpoint
> available? I currently have 0.67.4.

We are using this on ceph-0.67.4.  Do you have your gateways logging?

-- 
---
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] kvm live migrate wil ceph

2013-10-16 Thread Mike Lowe

I wouldn't go so far as to say putting a vm in a file on a networked filesystem 
is wrong.  It is just not the best choice if you have a ceph cluster at hand, 
in my opinion.  Networked filesystems have a bunch of extra stuff to implement 
posix semantics and live in kernel space.  You just need simple block device 
semantics and you don't need to entangle the hypervisor's kernel space.  What 
it boils down to is the engineering first principle of selecting the least 
complicated solution that satisfies the requirements of the problem. You don't 
get anything when you trade the simplicity of rbd for the complexity of a 
networked filesystem.

For format 2 I think the only caveat is that it requires newer clients and the 
kernel client takes some time to catch up to the user space clients.  You may 
not be able to mount filesystems on rbd devices with the kernel client 
depending on kernel version, this may or may not be important to you.  You can 
always use a vm to mount a filesystem on a rbd device as a work around.  

On Oct 16, 2013, at 9:11 AM, Jon  wrote:

> Hello Michael,
> 
> Thanks for the reply.  It seems like ceph isn't actually "mounting" the rbd 
> to the vm host which is where I think I was getting hung up (I had previously 
> been attempting to mount rbds directly to multiple hosts and as you can 
> imagine having issues).
> 
> Could you possible expound on why using a clustered filesystem approach is 
> wrong (or conversely why using RBD's is the correct approach)?
> 
> As for format2 rbd images, it looks like they provide exactly the 
> Copy-On-Write functionality that I am looking for.  Any caveats or things I 
> should look out for when going from format 1 to format 2 images? (I think I 
> read something about not being able to use both at the same time...)
> 
> Thanks Again,
> Jon A
> 
> 
> On Mon, Oct 14, 2013 at 4:42 PM, Michael Lowe  
> wrote:
> I live migrate all the time using the rbd driver in qemu, no problems.  Qemu 
> will issue a flush as part of the migration so everything is consistent.  
> It's the right way to use ceph to back vm's. I would strongly recommend 
> against a network file system approach.  You may want to look into format 2 
> rbd images, the cloning and writable snapshots may be what you are looking 
> for.
> 
> Sent from my iPad
> 
> On Oct 14, 2013, at 5:37 AM, Jon  wrote:
> 
>> Hello,
>> 
>> I would like to live migrate a VM between two "hypervisors".  Is it possible 
>> to do this with a rbd disk or should the vm disks be created as qcow images 
>> on a CephFS/NFS share (is it possible to do clvm over rbds? OR GlusterFS 
>> over rbds?)and point kvm at the network directory.  As I understand it, rbds 
>> aren't "cluster aware" so you can't mount an rbd on multiple hosts at once, 
>> but maybe libvirt has a way to handle the transfer...?  I like the idea of 
>> "master" or "golden" images where guests write any changes to a new image, I 
>> don't think rbds are able to handle copy-on-write in the same way kvm does 
>> so maybe a clustered filesystem approach is the ideal way to go.
>> 
>> Thanks for your input. I think I'm just missing some piece. .. I just don't 
>> grok...
>> 
>> Bestv Regards,
>> Jon A
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] snapshots on CephFS

2013-10-16 Thread Kasper Dieter

Hi Greg,

on http://comments.gmane.org/gmane.comp.file-systems.ceph.user/1705
I found a statement from you regarding snapshots on cephfs:

---snip---
Filesystem snapshots exist and you can experiment with them on CephFS
(there's a hidden ".snaps" folder; you can create or remove snapshots
by creating directories in that folder; navigate up and down it, etc).
---snip---

Can you please explain in more detail or with example CMDs how to 
create/list/remove snapshots in CephFS ?
I assume they will be created on a directory level ?
How will the CephFS snapshots cohere with the underlaying pools ?
(e.g. using cephfs /mnt/cephfs/dir-1/dir2 set_layout -p 18)


Thanks,
-Dieter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-deploy zap disk failure

2013-10-16 Thread Alfredo Deza

On Tue, Oct 15, 2013 at 9:19 PM, Guang  wrote:
> -bash-4.1$ which sgdisk
> /usr/sbin/sgdisk
>
> Which path does ceph-deploy use?

That is unexpected... these are the paths that ceph-deploy uses:

'/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin'

So `/usr/sbin/` is there. I believe  this is a case where $PATH gets
altered because of sudo (resetting the env variable).

This should be fixed in the next release. In the meantime, you could
set the $PATH for non-interactive sessions (which is what ceph-deploy
does)
for all users. I *think* that would be in `/etc/profile`


>
> Thanks,
> Guang
>
> On Oct 15, 2013, at 11:15 PM, Alfredo Deza wrote:
>
>> On Tue, Oct 15, 2013 at 10:52 AM, Guang  wrote:
>>> Hi ceph-users,
>>> I am trying with the new ceph-deploy utility on RHEL6.4 and I came across a
>>> new issue:
>>>
>>> -bash-4.1$ ceph-deploy --version
>>> 1.2.7
>>> -bash-4.1$ ceph-deploy disk zap server:/dev/sdb
>>> [ceph_deploy.cli][INFO  ] Invoked (1.2.7): /usr/bin/ceph-deploy disk zap
>>> server:/dev/sdb
>>> [ceph_deploy.osd][DEBUG ] zapping /dev/sdb on server
>>> [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] detect platform information from
>>> remote host
>>> [ceph_deploy.osd][INFO  ] Distro info: Red Hat Enterprise Linux Server 6.4
>>> Santiago
>>> [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] zeroing last few blocks of device
>>> [osd2.ceph.mobstor.bf1.yahoo.com][INFO  ] Running command: sudo sgdisk
>>> --zap-all --clear --mbrtogpt -- /dev/sdb
>>> [osd2.ceph.mobstor.bf1.yahoo.com][ERROR ] sudo: sgdisk: command not found
>>>
>>> While I run disk zap on the host directly, it can work without issues.
>>> Anyone meet the same issue?
>>
>> Can you run `which sgdisk` on that host? I want to make sure this is
>> not a $PATH problem.
>>
>> ceph-deploy tries to use the proper path remotely but it could be that
>> this one is not there.
>>
>>
>>>
>>> Thanks,
>>> Guang
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] poor read performance on rbd+LVM, LVM overload

2013-10-16 Thread Ugis

Hello ceph&LVM communities!

I noticed very slow reads from xfs mount that is on ceph
client(rbd+gpt partition+LVM PV + xfs on LE)
To find a cause I created another rbd in the same pool, formatted it
straight away with xfs, mounted.

Write performance for both xfs mounts is similar ~12MB/s

reads with "dd if=/mnt/somefile bs=1M | pv | dd of=/dev/null" as follows:
with LVM ~4MB/s
pure xfs ~30MB/s

Watched performance while doing reads with atop. In LVM case atop
shows LVM overloaded:
LVM | s-LV_backups  | busy 95% |  read   21515 | write  0  |
KiB/r  4 |   | KiB/w  0 |  MBr/s   4.20 | MBw/s
0.00  | avq 1.00 |  avio 0.85 ms |

client kernel 3.10.10
ceph version 0.67.4

My considerations:
I have expanded rbd under LVM couple of times(accordingly expanding
gpt partition, PV,VG,LV, xfs afterwards), but that should have no
impact on performance(tested clean rbd+LVM, same read performance as
for expanded one).

As with device-mapper, after LVM is initialized it is just a small
table with LE->PE  mapping that should reside in close CPU cache.
I am guessing this could be related to old CPU used, probably caching
near CPU does not work well(I tested also local HDDs with/without LVM
and got read speed ~13MB/s vs 46MB/s with atop showing same overload
in  LVM case).

What could make so great difference when LVM is used and what/how to
tune? As write performance does not differ, DM extent lookup should
not be lagging, where is the trick?

CPU used:
# cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 4
model name  : Intel(R) Xeon(TM) CPU 3.20GHz
stepping: 10
microcode   : 0x2
cpu MHz : 3200.077
cache size  : 2048 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 1
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr ss
   e sse2 ss ht tm
pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl
cid cx16 xtpr lahf_lm
bogomips: 6400.15
clflush size: 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

Br,
Ugis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Missing Dependency for ceph-deploy 1.2.7

2013-10-16 Thread Alfredo Deza

On Tue, Oct 15, 2013 at 9:54 PM, Luke Jing Yuan  wrote:
> Hi,
>
> I am trying to install/upgrade to 1.2.7 but Ubuntu (Precise) is complaining 
> about unmet dependency which seemed to be python-pushy 0.5.3 which seemed to 
> be missing. Am I correct to assume so?

That is odd, we still have pushy packages available for the version
that you are having issues with, see:
http://ceph.com/debian-dumpling/pool/main/p/python-pushy/

It might be that you need to update your repos?
>
> Regards,
> Luke
>
> --
> -
> -
> DISCLAIMER:
>
> This e-mail (including any attachments) is for the addressee(s)
> only and may contain confidential information. If you are not the
> intended recipient, please note that any dealing, review,
> distribution, printing, copying or use of this e-mail is strictly
> prohibited. If you have received this email in error, please notify
> the sender  immediately and delete the original message.
> MIMOS Berhad is a research and development institution under
> the purview of the Malaysian Ministry of Science, Technology and
> Innovation. Opinions, conclusions and other information in this e-
> mail that do not relate to the official business of MIMOS Berhad
> and/or its subsidiaries shall be understood as neither given nor
> endorsed by MIMOS Berhad and/or its subsidiaries and neither
> MIMOS Berhad nor its subsidiaries accepts responsibility for the
> same. All liability arising from or in connection with computer
> viruses and/or corrupted e-mails is excluded to the fullest extent
> permitted by law.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw public access problem

2013-10-16 Thread Derek Yarnell

On 10/16/13 5:15 AM, Fabio - NS3 srl wrote:
> Hello,
> when i set a read permission for all users to the bucket i read only the
> content of the bucket but i received "access denied" for all directory
> and sub-directory inside this bucket.
> 
> Where i wrong???

Hi Fabio,

This is the default S3 behavior.  The default canned ACL will be the
user who writes the key and FULL_CONTROL.  You will have to iterate the
keys and grant a specific read ACL.  You can also on upload of the keys
specify the ACL.

Also we have a patch pending[1] that provides some relief for this use
case where we would allow the bucket ACLs to be evaluated and be
authoritative before the key ACLs.  It needs to get cleaned up a bit but
I think it would very much be useful in your case.  We are about to go
into production running this on two different Ceph Object Stores.

[1] - https://github.com/ceph/ceph/pull/672

Thanks,
derek

-- 
---
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to make ceph with hadoop

2013-10-16 Thread Noah Watkins

The --with-hadoop option has been removed. The Ceph Hadoop bindings are now
located in git://github.com/ceph/hadoop-common cepfs/branch-1.0, and the
required CephFS Java bindings can be built from the Ceph Git repository
using the --enable-cephfs-java configure option.

On Wed, Oct 16, 2013 at 12:26 AM, 鹏  wrote:

> **
>  hi all!
>   my ceph is 0.62, and I want to build it wit hadoop.
>   ./configure -with-hadoop
>
> but it return jni.h not found.
> I found the jni.h in /usr/java/jdk/include/jni.h
>
>how can I fix this Problem！
>
>thinks
> pengft
>
>
>
>
>
>
>
>
>
> **
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] bit correctness and checksumming

2013-10-16 Thread Dan Van Der Ster

Hi all,
There has been some confusion the past couple days at the CHEP conference 
during conversations about Ceph and protection from bit flips or other subtle 
data corruption. Can someone please summarise the current state of data 
integrity protection in Ceph, assuming we have an XFS backend filesystem? ie. 
don't rely on the protection offered by btrfs. I saw in the docs that wire 
messages and journal writes are CRC'd, but nothing explicit about the objects 
themselves.

We also have some specific questions:

1. Is an object checksum stored on the OSD somewhere? Is this in user.ceph._, 
because it wasn't obvious when looking at the code…
2. When is the checksum verified. Surely it is checked during the deep scrub, 
but what about during an object read?
2b. Can a user read corrupted data if the master replica has a bit flip but 
this hasn't yet been found by a deep scrub?
3. During deep scrub of an object with 2 replicas, suppose the checksum is 
different for the two objects -- which object wins? (I.e. if you store the 
checksum locally, this is trivial since the consistency of objects can be 
evaluated locally. Without the local checksum, you can have conflicts.)
4. If the checksum is already stored per object in the OSD, is this retrievable 
by librados? We have some applications which also need to know the checksum of 
the data and this would be handy if it was already calculated by Ceph.

Thanks in advance!

Dan van der Ster
CERN IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] radosgw public access problem

2013-10-16 Thread Fabio - NS3 srl


Hello,
when i set a read permission for all users to the bucket i read only the
content of the bucket but i received "access denied" for all directory 
and sub-directory inside this bucket.


Where i wrong???

Many thanks
Fabio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph, Keystone and S3

2013-10-16 Thread Matt Thompson

Hi All,

Does anyone know if it'll be possible to use the radosgw admin API when
using keystone users?  I suspect not due to the user requiring specific
caps, however it'd be great if someone can validate (I'm still running
v0.67.4 so can't play with this much).

Thanks!

-Matt


On Tue, Oct 15, 2013 at 6:34 PM, Carlos Gimeno Yañez wrote:

> Thank you very much Yehuda, that was the missing piece of my puzzle!
>
> I think that this should be added to the official documentation.
>
> Regards
>
>
> 2013/10/15 Yehuda Sadeh 
>
>> On Tue, Oct 15, 2013 at 7:17 AM, Carlos Gimeno Yañez 
>> wrote:
>> > Hi
>> >
>> > I've deployed Ceph using Ceph-deploy and following the official
>> > documentation. I've created a user to use with Swift and everything is
>> > working fine, my users can create buckets and upload files if they use
>> > Horizon Dashboard or Swift CLI.
>> >
>> > However, everything changes if they try to do it with S3 API. When they
>> > download their credentials from Horizon dashboard to get their keys,
>> they
>> > can't connect to ceph using S3 API. They only get a "403 Access Denied"
>> > error message. I'm using Ceph 0.70 so, if i'm not wrong, ceph should be
>> able
>> > to validate S3 tokens against keystone since 0.69 version.
>> >
>> > Here is my ceph.conf:
>> >
>> > [client.radosgw.gateway]
>> > host = server2
>> > keyring = /etc/ceph/keyring.radosgw.gateway
>> > rgw socket path = /var/run/ceph/radosgw.sock
>> > log file = /var/log/ceph/radosgw.log
>> > rgw keystone url = server4:35357
>> > rgw keystone admin token = admintoken
>> > rgw keystone accepted roles = admin _member_ Member
>> > rgw print continue = false
>> > rgw keystone token cache size = 500
>> > rgw keystone revocation interval = 500
>> > nss db path = /var/ceph/nss
>> >
>> > #Add DNS hostname to enable S3 subdomain calls
>> > rgw dns name = server2
>> >
>> >
>> > And this is the error message (with s3-curl):
>> >
>> >
>> >> GET / HTTP/1.1
>> >> User-Agent: curl/7.29.0
>> >> Host: host_ip
>> >> Accept: */*
>> >> Date: Tue, 15 Oct 2013 14:07:24 +
>> >> Authorization: AWS
>> >> 3a1ecdea87d6493a9922c13a06d392cf:SNu/sjTuDtvunOQKJaU8Besm1RQ=
>> >>
>> > < HTTP/1.1 403 Forbidden
>> > < Date: Tue, 15 Oct 2013 14:07:24 GMT
>> > < Server: Apache/2.2.22 (Ubuntu)
>> > < Accept-Ranges: bytes
>> > < Content-Length: 78
>> > < Content-Type: application/xml
>> > <
>> > { [data not shown]
>> > 
>> > 
>> > AccessDenied
>> > 
>> >
>> > Regards
>>
>>
>> Try adding:
>>
>> rgw s3 auth use keystone = true
>>
>> to your ceph.conf
>>
>>
>> Yehuda
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw-admin doesn't list user anymore

2013-10-16 Thread Valery Tschopp


Hi Derek,

Thanks for your example.

I've added caps='metadata=*', but I still have an error and get:

send: 'GET /admin/metadata/user?format=json HTTP/1.1\r\nHost: 
objects.bcc.switch.ch\r\nAccept-Encoding: identity\r\nDate: Wed, 16 Oct 
2013 08:09:57 GMT\r\nContent-Length: 0\r\nAuthorization: AWS 
VC***o=\r\nUser-Agent: Boto/2.12.0 Python/2.7.5 
Darwin/12.5.0\r\n\r\n'

reply: 'HTTP/1.1 405 Method Not Allowed\r\n'


In which version of radosgw is the /admin/metadata REST endpoint 
available? I currently have 0.67.4.


Cheers,
Valery

On 15/10/13 22:00 , Derek Yarnell wrote:

On 10/15/13 12:00 PM, Valery Tschopp wrote:

Hi Derek,

Thanks. Is the 'radosgw-admin metadata list user' also available as REST
API?


Hi Valery,

Yes, it is.  Not well documented I believe right now.  I have this
tested and working in my python bindings[1] for the radosgw (get_users).
  You will need at least --caps='metadata=read' (might need *).

   /{admin}/metadata/user?format=json

[1] - https://github.com/dyarnell/rgwadmin

Thanks,
derek



--
SWITCH
--
Valery Tschopp, Software Engineer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
email: valery.tsch...@switch.ch phone: +41 44 268 1544




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] how to make ceph with hadoop

2013-10-16 Thread 鹏

|
 hi all!
  my ceph is 0.62, and I want to build it wit hadoop.
  ./configure -with-hadoop
   
but it return jni.h not found.
I found the jni.h in /usr/java/jdk/include/jni.h
 
   how can I fix this Problem！

   thinks
pengft










|
|
|   |   |
|___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

38 matches

Mail list logo