date:20160315

Re: [ceph-users] Understanding "ceph -w" output - cluster monitoring

2016-03-15 Thread Christian Balzer

On Mon, 14 Mar 2016 23:38:24 -0700 Blade Doyle wrote:

> On Mon, Mar 14, 2016 at 3:48 PM, Christian Balzer  wrote:
> 
> >
> > Hello,
> >
> > On Mon, 14 Mar 2016 09:16:13 -0700 Blade Doyle wrote:
> >
> > > Hi Ceph Community,
> > >
> > > I am trying to use "ceph -w" output to monitor my ceph cluster.  The
> > > basic setup is:
> > >
> > > A python script runs ceph -w and processes each line of output.  It
> > > finds the data it wants and reports it to InfluxDB.  I view the data
> > > using Grafana, and Ceph Dashboard.
> > >
> >
> > A much richer and more precise source of information would be the
> > various performance counters and using collectd to feed them into
> > graphite and friends.
> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/039953.html
> >
> > I'm using the DWM one, YMMV.
> >
> 
> Thanks much for your reply, Christian.
> 
> Ugh.  Ok, then it looks like the key info here is to get the data from
> the osd/mon sockets.  Forgive me for not digging too deep yet, but it
> looks like I would do something like:
> 
> ceph --admin-daemon /var/run/ceph/ceph-osd.4.asok perf dump
> 
Correct.

> * which of that data is read/write bytes?
More than one choice, the obvious ones are "counter-osd_op_out_bytes" and
"counter-osd_op_in_bytes".

This is why collectd with graphite is so much fun, you just click and
drool until the data (graph) makes sense.

> * Is that data for the entire cluster, or just that osd?  (would I need
> to read data from each individual osd sock in the cluster?)
> 
Indeed the later, the mons don't keep track of this.

Christian
> Thanks,
> Blade.


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Disable cephx authentication ?

2016-03-15 Thread Nguyen Hoang Nam

Hi there,
I setup ceph cluster with disable cephx cluster authen and enable cephx client 
authen as follow :
auth_cluster_required = none
auth_service_required = cephx
auth_client_required = cephx
I can run command such as `ceph -s`, `rados -p rbd put` but I can not run 
command `rbd ls`, `rbd create` ... Output of those commands always are:
2016-03-15 10:49:30.659194 7f1a6eda0700  0 cephx: verify_reply couldn't decrypt 
with error: error decoding block for decryption
2016-03-15 10:49:30.659211 7f1a6eda0700  0 -- 172.30.6.101:0/954989888 >> 
172.30.6.103:6804/23638 pipe(0x7f1a8119f7f0 sd=4 :45067 s=1 pgs=0 cs=0 l=1 
c=0x7f1a81197000).failed verifying authorize reply
Can you explain me why RBD failed in this case ? Thank you in advance
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] TR: CEPH nightmare or not

2016-03-15 Thread Pierre DOUCET

Hi,

We have a 3 ceph clusters (Hammer 0.94.5) on same physical nodes Using LXC on 
debian Wheezy. Each physical node has 12 4To 7200 RPM hard drive, 2x200Gb SSD 
MLC, 2 x 10 Gb ethernet. On each physical drive we have an lxc container for 1 
OSD and the journal is on SSD partition.

One of our ceph clusters has 96 OSD with 1024 Pgp.
Last week we raised our Pgp from 1024 to 2048 in one pass. Bad idea :(. You 
need to read the fucking manual before upgrading this kind of parameter.
Ceph was a bit stressed and can't return to normal. A few OSD (~10%) were 
flapping


On our physical nodes, we noticed some network problems:
Ping 127.0.0.1:
64 bytes from 127.0.0.1: icmp_req=1258 ttl=64 time=0.146 ms
ping: sendmsg: Invalid argument
64 bytes from 127.0.0.1: icmp_req=1260 ttl=64 time=0.023 ms
ping: sendmsg: Invalid argument
64 bytes from 127.0.0.1: icmp_req=1262 ttl=64 time=0.028 ms
ping: sendmsg: Invalid argument
ping: sendmsg: Invalid argument
ping: sendmsg: Invalid argument
64 bytes from 127.0.0.1: icmp_req=1266 ttl=64 time=0.026 ms
64 bytes from 127.0.0.1: icmp_req=1267 ttl=64 time=0.142 ms
ping: sendmsg: Invalid argument
ping: sendmsg: Invalid argument
64 bytes from 127.0.0.1: icmp_req=1270 ttl=64 time=0.137 ms
ping: sendmsg: Invalid argument


With our kernel  (3.16) nothing in the logs.After a few days of research, we 
tried to upgrade kernel to a newer version (4.4.4). Not so easy to backport it 
to debian wheezy but after a few hours, it works. The problem wasn't gone away 
but we noticed a new message in logs:
arp_cache: Neighbour table overflow.

In Debian , arp cache level 1 has only 128 records !

We had this to our sysctl.conf on every physical node:
net.ipv4.neigh.default.gc_thresh1 = 4096
net.ipv4.neigh.default.gc_thresh2 = 8192
net.ipv4.neigh.default.gc_thresh3 = 8192
net.ipv4.neigh.default.gc_stale_time = 86400


Immediately networks problems disappeared and our cluster came back to a better 
state in a few hours : HEALTH_OK :)


To sum up:
Do not raise your pgp in one pass !
Look at your kernel parameters, you may need some tweaks to be fine

Regards

Pierre DOUCET

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cache on full ssd cluster

2016-03-15 Thread Yair Magnezi

Thanks Christian .

Still

"So yes, your numbers are normal for single client, low depth reads, as many
threads in this ML confirm."

we're facing very  high latency ( i expect much less latency  from ssd
cluster ) :

clat percentiles (usec):
 |  1.00th=[  350],  5.00th=[  390], 10.00th=[  414], 20.00th=[  454],
 | 30.00th=[  494], 40.00th=[  540], 50.00th=[  612], 60.00th=[  732],
 | 70.00th=[ 1064], 80.00th=[10304], 90.00th=[37632], 95.00th=[38656],
 | 99.00th=[40192], 99.50th=[41216], 99.90th=[43264], 99.95th=[43776],

Thanks






*Yair Magnezi *




*Storage & Data Protection TL   // KenshooOffice +972 7 32862423   //
Mobile +972 50 575-2955__*



On Tue, Mar 15, 2016 at 2:28 AM, Christian Balzer  wrote:

>
> Hello,
>
> On Mon, 14 Mar 2016 15:51:11 +0200 Yair Magnezi wrote:
>
> > On Fri, Mar 11, 2016 at 2:01 AM, Christian Balzer  wrote:
> >
> > >
> > > Hello,
> > >
> > > As alway there are many similar threads in here, googling and reading
> > > up stuff are good for you.
> > >
> > > On Thu, 10 Mar 2016 16:55:03 +0200 Yair Magnezi wrote:
> > >
> > > > Hello Cephers .
> > > >
> > > > I wonder if anyone has some experience with full ssd cluster .
> > > > We're testing ceph ( "firefly" ) with 4 nodes ( supermicro
> > > >  SYS-F628R3-R72BPT ) * 1TB  SSD , total of 12 osds .
> > > > Our network is 10 gig .
> > > Much more, relevant details, from SW versions (kernel, OS, Ceph) and
> > > configuration (replica size of your pool) to precise HW info.
> > >
> >
> > H/W  --> 4 nodes  supermicro ( SYS-F628R3-R72BPT ) , every node has
> > 64 GB mem ,
> >   MegaRAID SAS 2208 : RAID0 , 4 * 1 TB ssd ( SAMSUNG
> > MZ7KM960HAHP-5 )
> >
>
> SM863, they should be fine.
> However I've never seen any results of them with sync writes, if you have
> the time, something to test.
>
> > Cluster --. 4 nodes , 12 OSD's , replica size = 2  , ubuntu 14.04.1
> > LTS ,
> >
> Otherwise similar to my cache pool against which I tested below,
> 2 nodes with 4x 800GB Intel DC S3610 each, replica of 2, thus 8 OSDs.
> 2 E5-2623 (3GHz base speed) per node.
> Network is QDR Infiniband, IPoIB.
>
> Debian Jessie and Ceph Hammer, though.
>
> > >
> > > In particular your SSDs, exact maker/version/size.
> > > Where are your journals?
> > >
> > > SAMSUNG MZ7KM960HAHP-5 , 893.752 GB
> > Journals on the same drive data ( all SSD as  mentioned )
> >
> Again, should be fine but test these with sync writes.
> And of course monitor their wearout over time.
>
> >
> > > Also Firefly is EOL, Hammer and even more so the upcoming Jewel have
> > > significant improvements with SSDs.
> > >
> > > > We used the ceph_deploy for installation with all defaults
> > > > ( followed ceph documentation for integration with open-stack )
> > > > As much as we understand there is no need to enable the rbd cache as
> > > > we're running on full ssd.
> > > RBD cache as in the client side librbd cache is always very helpful,
> > > fast backing storage or not.
> > > It can significantly reduce the number of small writes, something Ceph
> > > has to do a lot of heavy lifting for.
> > >
> > > > bench marking the cluster shows very poor performance write but
> > > > mostly read ( clients are open-stack but also vmware instances ) .
> > >
> > > Benchmarking how (exact command line for fio for example) and with what
> > > results?
> > > You say poor, but that might be "normal" for your situation, we can't
> > > really tell w/o hard data.
> > >
> >
> >
> >
> >fio --name=randread --ioengine=libaio --iodepth=1 --rw=randread
> > --bs=4k --direct=1 --size=256M --numjobs=10 --runtime=120
> > --group_reporting --directory=/ceph_test2
> >
>
> Just to make sure, this is run inside your VM?
>
> >root@open-compute1:~# fio --name=randread --ioengine=libaio
> > --iodepth=1 --rw=randread --bs=4k --direct=1 --size=256M --numjobs=10
> > --runtime=120 --group_reporting --directory=/ceph_test2
> > randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> > iodepth=1
> > ...
> > randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> > iodepth=1
> > fio-2.1.3
> > Starting 10 processes
> > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > Jobs: 10 (f=10): [rr] [100.0% done] [4616KB/0KB/0KB /s] [1154/0/0
> > iops] [eta 00m:00s]
> > randread: (groupid=0, jobs=10): err= 0: pid=25393: Mon Mar 14 09:17:24
> > 2016 read : io=597360KB, bw=4976.5KB/s, iops=1244, runt=12003

Re: [ceph-users] Understanding "ceph -w" output - cluster monitoring

2016-03-15 Thread John Spray

On Tue, Mar 15, 2016 at 6:38 AM, Blade Doyle  wrote:
>
>
> On Mon, Mar 14, 2016 at 3:48 PM, Christian Balzer  wrote:
>>
>>
>> Hello,
>>
>> On Mon, 14 Mar 2016 09:16:13 -0700 Blade Doyle wrote:
>>
>> > Hi Ceph Community,
>> >
>> > I am trying to use "ceph -w" output to monitor my ceph cluster.  The
>> > basic setup is:
>> >
>> > A python script runs ceph -w and processes each line of output.  It
>> > finds
>> > the data it wants and reports it to InfluxDB.  I view the data using
>> > Grafana, and Ceph Dashboard.
>> >
>>
>> A much richer and more precise source of information would be the various
>> performance counters and using collectd to feed them into graphite and
>> friends.
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/039953.html
>>
>> I'm using the DWM one, YMMV.
>
>
> Thanks much for your reply, Christian.
>
> Ugh.  Ok, then it looks like the key info here is to get the data from the
> osd/mon sockets.  Forgive me for not digging too deep yet, but it looks like
> I would do something like:
>
> ceph --admin-daemon /var/run/ceph/ceph-osd.4.asok perf dump

Only if you want per-daemon stats.

> * which of that data is read/write bytes?
> * Is that data for the entire cluster, or just that osd?  (would I need to
> read data from each individual osd sock in the cluster?)

Please have a look at the link I posted.  There is an existing piece
of code there for doing stats collection, and it supports both
gathering stats from every daemon (you can sum them yourself) or
gathering the already-summed stats from the mon (much simpler if you
don't need more detail).

Remember that the diamond code is free software: even if you don't
want to use diamond you're completely free to just copy what it does.

As for the meaning of stats, you'll mostly find that it's either
obvious ("num_read_kb", "num_write_kb" etc) or completely obscure
("num_evict_mode_some").  As long as you only want the obvious ones
you'll be fine :-)

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] SSD and Journal

2016-03-15 Thread Yair Magnezi

Hi Guys .

On a full ssd cluster , is it meaningful to put the journal on a different
drive ? does it have any impact on  performance ?

Thanks



*Yair Magnezi *




*Storage & Data Protection   // KenshooOffice +972 7 32862423   // Mobile
+972 50 575-2955__*

-- 
This e-mail, as well as any attached document, may contain material which 
is confidential and privileged and may include trademark, copyright and 
other intellectual property rights that are proprietary to Kenshoo Ltd, 
 its subsidiaries or affiliates ("Kenshoo"). This e-mail and its 
attachments may be read, copied and used only by the addressee for the 
purpose(s) for which it was disclosed herein. If you have received it in 
error, please destroy the message and any attachment, and contact us 
immediately. If you are not the intended recipient, be aware that any 
review, reliance, disclosure, copying, distribution or use of the contents 
of this message without Kenshoo's express permission is strictly prohibited.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph client lost connection to primary osd

2016-03-15 Thread louis



Hi, can the ceph client(by librbd) io continued if connection to primary osd lost?  Thanks
 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD and Journal

2016-03-15 Thread Somnath Roy

Yes, if you can manage *cost* , separating journal on a different device should 
improve write performance. But, you need to evaluate how many osd journals you 
can dedicate to a single OSD as at some point it will be bottlenecked by that 
journal device BW.

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Yair 
Magnezi
Sent: Tuesday, March 15, 2016 6:44 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] SSD and Journal

Hi Guys .

On a full ssd cluster , is it meaningful to put the journal on a different 
drive ? does it have any impact on  performance ?

Thanks




Yair Magnezi
Storage & Data Protection   // Kenshoo
Office +972 7 32862423   // Mobile +972 50 575-2955
__



This e-mail, as well as any attached document, may contain material which is 
confidential and privileged and may include trademark, copyright and other 
intellectual property rights that are proprietary to Kenshoo Ltd,  its 
subsidiaries or affiliates ("Kenshoo"). This e-mail and its attachments may be 
read, copied and used only by the addressee for the purpose(s) for which it was 
disclosed herein. If you have received it in error, please destroy the message 
and any attachment, and contact us immediately. If you are not the intended 
recipient, be aware that any review, reliance, disclosure, copying, 
distribution or use of the contents of this message without Kenshoo's express 
permission is strictly prohibited.
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Calculating PG in an mixed environment

2016-03-15 Thread Martin Palma

Hi all,

The documentation [0] gives us the following formula for calculating
the number of PG if the cluster is bigger than 50 OSDs:

 (OSDs * 100)
Total PGs =  
 pool size

When we have mixed storage server (HDD disks and SSD disks) and we
have defined different roots in our crush map to map some pools only
to HDD disk and some to SSD disks like described by Sebastien Han [1].

In the above formula what number of OSDs should be use to calculate
the  PGs for a pool only on the HDD disks? The total number of OSDs in
a cluster or only the number of OSDs which have an HDD disk as
backend?

Best,
Martin


[0] 
http://docs.ceph.com/docs/master/rados/operations/placement-groups/#choosing-the-number-of-placement-groups
[1] 
http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Calculating PG in an mixed environment

2016-03-15 Thread huang jun

you can find in http://ceph.com/pgcalc/


2016-03-15 23:41 GMT+08:00 Martin Palma :
> Hi all,
>
> The documentation [0] gives us the following formula for calculating
> the number of PG if the cluster is bigger than 50 OSDs:
>
>  (OSDs * 100)
> Total PGs =  
>  pool size
>
> When we have mixed storage server (HDD disks and SSD disks) and we
> have defined different roots in our crush map to map some pools only
> to HDD disk and some to SSD disks like described by Sebastien Han [1].
>
> In the above formula what number of OSDs should be use to calculate
> the  PGs for a pool only on the HDD disks? The total number of OSDs in
> a cluster or only the number of OSDs which have an HDD disk as
> backend?
>
> Best,
> Martin
>
>
> [0] 
> http://docs.ceph.com/docs/master/rados/operations/placement-groups/#choosing-the-number-of-placement-groups
> [1] 
> http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
thanks
huangjun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Calculating PG in an mixed environment

2016-03-15 Thread Michael Kidd

Hello Martin,
  The proper way is to perform the following process:

For all Pools utilizing the same bucket of OSDs:
(Pool1_pg_num * Pool1_size) + (Pool2_pg_num * Pool2_size) + ...
(Pool(n)_pg_num * Pool(n)_size)
--
OSD count

This value should be between 100 and 200 PGs and is the actual ratio of PGs
per OSD in that bucket of OSDs.

For the actual recommendation from Ceph Devs (and written by myself),
please see:
http://ceph.com/pgcalc/

NOTE: The tool is partially broken, but the explanation at the top/bottom
is sound.  I'll work to get the tool fully functional again.

Thanks,

Michael J. Kidd
Sr. Software Maintenance Engineer
Red Hat Ceph Storage
+1 919-442-8878

On Tue, Mar 15, 2016 at 11:41 AM, Martin Palma  wrote:

> Hi all,
>
> The documentation [0] gives us the following formula for calculating
> the number of PG if the cluster is bigger than 50 OSDs:
>
>  (OSDs * 100)
> Total PGs =  
>  pool size
>
> When we have mixed storage server (HDD disks and SSD disks) and we
> have defined different roots in our crush map to map some pools only
> to HDD disk and some to SSD disks like described by Sebastien Han [1].
>
> In the above formula what number of OSDs should be use to calculate
> the  PGs for a pool only on the HDD disks? The total number of OSDs in
> a cluster or only the number of OSDs which have an HDD disk as
> backend?
>
> Best,
> Martin
>
>
> [0]
> http://docs.ceph.com/docs/master/rados/operations/placement-groups/#choosing-the-number-of-placement-groups
> [1]
> http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-15 Thread Stephen Lord


Hi,

The ceph-disk (10.0.4 version) command seems to have problems operating on a 
Redhat 7 system, it uses the partprobe command unconditionally to update the 
partition table, I had to change this to partx -u to get past this.

@@ -1321,13 +1321,13 @@
 processed, i.e. the 95-ceph-osd.rules actions and mode changes,
 group changes etc. are complete.
 """
-LOG.debug('Calling partprobe on %s device %s', description, dev)
+LOG.debug('Calling partx on %s device %s', description, dev)
 partprobe_ok = False
 error = 'unknown error'
 for i in (1, 2, 3, 4, 5):
 command_check_call(['udevadm', 'settle', '--timeout=600'])
 try:
-_check_output(['partprobe', dev])
+_check_output(['partx', '-u', dev])
 partprobe_ok = True
 break
 except subprocess.CalledProcessError as e:


It really needs to be doing that conditional on the operating system version.

Steve
 

--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] data corruption with hammer

2016-03-15 Thread Mike Lovell

there are not any monitors running on the new nodes. the monitors are on
separate nodes and running the 0.94.5 release. i spent some time thinking
about this last night as well and my thoughts went to the recency patches.
i wouldn't think that caused this but its the only thing that seems close.

mike

On Mon, Mar 14, 2016 at 9:35 PM, Christian Balzer  wrote:

>
> Hello,
>
> On Mon, 14 Mar 2016 20:51:04 -0600 Mike Lovell wrote:
>
> > something weird happened on one of the ceph clusters that i administer
> > tonight which resulted in virtual machines using rbd volumes seeing
> > corruption in multiple forms.
> >
> > when everything was fine earlier in the day, the cluster was a number of
> > storage nodes spread across 3 different roots in the crush map. the first
> > bunch of storage nodes have both hard drives and ssds in them with the
> > hard drives in one root and the ssds in another. there is a pool for
> > each and the pool for the ssds is a cache tier for the hard drives. the
> > last set of storage nodes were in a separate root with their own pool
> > that is being used for burn in testing.
> >
> > these nodes had run for a while with test traffic and we decided to move
> > them to the main root and pools. the main cluster is running 0.94.5 and
> > the new nodes got 0.94.6 due to them getting configured after that was
> > released. i removed the test pool and did a ceph osd crush move to move
> > the first node into the main cluster, the hard drives into the root for
> > that tier of storage and the ssds into the root and pool for the cache
> > tier. each set was done about 45 minutes apart and they ran for a couple
> > hours while performing backfill without any issue other than high load
> > on the cluster.
> >
> Since I glanced what your setup looks like from Robert's posts and yours I
> won't say the obvious thing, as you aren't using EC pools.
>
> > we normally run the ssd tier in the forward cache-mode due to the ssds we
> > have not being able to keep up with the io of writeback. this results in
> > io on the hard drives slowing going up and performance of the cluster
> > starting to suffer. about once a week, i change the cache-mode between
> > writeback and forward for short periods of time to promote actively used
> > data to the cache tier. this moves io load from the hard drive tier to
> > the ssd tier and has been done multiple times without issue. i normally
> > don't do this while there are backfills or recoveries happening on the
> > cluster but decided to go ahead while backfill was happening due to the
> > high load.
> >
> As you might recall, I managed to have "rados bench" break (I/O error) when
> doing these switches with Firefly on my crappy test cluster, but not with
> Hammer.
> However I haven't done any such switches on my production cluster with a
> cache tier, both because the cache pool hasn't even reached 50% capacity
> after 2 weeks of pounding and because I'm sure that everything will hold
> up when it comes to the first flushing.
>
> Maybe the extreme load (as opposed to normal VM ops) of your cluster
> during the backfilling triggered the same or a similar bug.
>
> > i tried this procedure to change the ssd cache-tier between writeback and
> > forward cache-mode and things seemed okay from the ceph cluster. about 10
> > minutes after the first attempt a changing the mode, vms using the ceph
> > cluster for their storage started seeing corruption in multiple forms.
> > the mode was flipped back and forth multiple times in that time frame
> > and its unknown if the corruption was noticed with the first change or
> > subsequent changes. the vms were having issues of filesystems having
> > errors and getting remounted RO and mysql databases seeing corruption
> > (both myisam and innodb). some of this was recoverable but on some
> > filesystems there was corruption that lead to things like lots of data
> > ending up in the lost+found and some of the databases were
> > un-recoverable (backups are helping there).
> >
> > i'm not sure what would have happened to cause this corruption. the
> > libvirt logs for the qemu processes for the vms did not provide any
> > output of problems from the ceph client code. it doesn't look like any
> > of the qemu processes had crashed. also, it has now been several hours
> > since this happened with no additional corruption noticed by the vms. it
> > doesn't appear that we had any corruption happen before i attempted the
> > flipping of the ssd tier cache-mode.
> >
> > the only think i can think of that is different between this time doing
> > this procedure vs previous attempts was that there was the one storage
> > node running 0.94.6 where the remainder were running 0.94.5. is is
> > possible that something changed between these two releases that would
> > have caused problems with data consistency related to the cache tier? or
> > otherwise? any other thoughts or suggestions?
> >
> What comes to mind in terms of these 2 versions is th

Re: [ceph-users] data corruption with hammer

2016-03-15 Thread Robert LeBlanc

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

There are no monitors on the new node.

It doesn't look like there has been any new corruption since we
stopped changing the cache modes. Upon closer inspection, some files
have been changed such that binary files are now ASCII files and visa
versa. These are readable ASCII files and are things like PHP or
script files. Or C files where ASCII files should be.

I've seen this type of corruption before when a SAN node misbehaved
and both controllers were writing concurrently to the backend disks.
The volume was only mounted by one host, but the writes were split
between the controllers when it should have been active/passive.

We have killed off the OSDs on the new node as a precaution and will
try to replicate this in our lab.

I suspicion is that is has to do with the cache promotion code update,
but I'm not sure how it would have caused this.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJW6D4zCRDmVDuy+mK58QAAoW0QAKmaNnN78m/3/YLIIlAB
U+q9PKXgB4ptds1prEJrB/HJqtxIi021M2urk6iO2XRUgR4qSWZyVJWMmeE9
6EhM6IvLbweOePr2LJ5nAVEkL5Fns+ya/aOAvilqo2WJGr8jt9J1ABjQgodp
SAGwDywo3GbGUmdxWWy5CrhLsdc9WNhiXdBxREh/uqWFvw2D8/1Uq4/u8tEv
fohrGD+SZfYLQwP9O/v8Rc1C3A0h7N4ytSMiN7Xg2CC9bJDynn0FTrP2LAr/
edEYx+SWF2VtKuG7wVHrQqImTfDUoTLJXP5Q6B+Oxy852qvWzglfoRhaKwGf
fodaxFlTDQaeMnyhMlODRMMXadmiTmyM/WK44YBuMjM8tnlaxf7yKgh09ADz
ay5oviRWnn7peXmq65TvaZzUfz6Mx5ZWYtqIevaXb0ieFgrxCTdVbdpnMNRt
bMwQ+yVQ8WB5AQmEqN6p6enBCxpvr42p8Eu484dO0xqjIiEOfsMANT/8V63y
RzjPMOaFKFnl3JoYNm61RGAUYszNBeX/Plm/3mP0qiiGBAeHYoxh7DNYlrs/
gUb/O9V0yNuHQIRTs8ZRyrzZKpmh9YMYo8hCsfIqWZjMwEyQaRFuysQB3NaR
lQCO/o12Khv2cygmTCQxS2L7vp2zrkPaS/KietqQ0gwkV1XbynK0XyLkAVDw
zTLa
=Wk/a
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Mar 14, 2016 at 9:35 PM, Christian Balzer  wrote:
>
> Hello,
>
> On Mon, 14 Mar 2016 20:51:04 -0600 Mike Lovell wrote:
>
>> something weird happened on one of the ceph clusters that i administer
>> tonight which resulted in virtual machines using rbd volumes seeing
>> corruption in multiple forms.
>>
>> when everything was fine earlier in the day, the cluster was a number of
>> storage nodes spread across 3 different roots in the crush map. the first
>> bunch of storage nodes have both hard drives and ssds in them with the
>> hard drives in one root and the ssds in another. there is a pool for
>> each and the pool for the ssds is a cache tier for the hard drives. the
>> last set of storage nodes were in a separate root with their own pool
>> that is being used for burn in testing.
>>
>> these nodes had run for a while with test traffic and we decided to move
>> them to the main root and pools. the main cluster is running 0.94.5 and
>> the new nodes got 0.94.6 due to them getting configured after that was
>> released. i removed the test pool and did a ceph osd crush move to move
>> the first node into the main cluster, the hard drives into the root for
>> that tier of storage and the ssds into the root and pool for the cache
>> tier. each set was done about 45 minutes apart and they ran for a couple
>> hours while performing backfill without any issue other than high load
>> on the cluster.
>>
> Since I glanced what your setup looks like from Robert's posts and yours I
> won't say the obvious thing, as you aren't using EC pools.
>
>> we normally run the ssd tier in the forward cache-mode due to the ssds we
>> have not being able to keep up with the io of writeback. this results in
>> io on the hard drives slowing going up and performance of the cluster
>> starting to suffer. about once a week, i change the cache-mode between
>> writeback and forward for short periods of time to promote actively used
>> data to the cache tier. this moves io load from the hard drive tier to
>> the ssd tier and has been done multiple times without issue. i normally
>> don't do this while there are backfills or recoveries happening on the
>> cluster but decided to go ahead while backfill was happening due to the
>> high load.
>>
> As you might recall, I managed to have "rados bench" break (I/O error) when
> doing these switches with Firefly on my crappy test cluster, but not with
> Hammer.
> However I haven't done any such switches on my production cluster with a
> cache tier, both because the cache pool hasn't even reached 50% capacity
> after 2 weeks of pounding and because I'm sure that everything will hold
> up when it comes to the first flushing.
>
> Maybe the extreme load (as opposed to normal VM ops) of your cluster
> during the backfilling triggered the same or a similar bug.
>
>> i tried this procedure to change the ssd cache-tier between writeback and
>> forward cache-mode and things seemed okay from the ceph cluster. about 10
>> minutes after the first attempt a changing the mode, vms using the ceph
>> cluster for their storage started seeing corruption in multiple forms.
>>

Re: [ceph-users] Calculating PG in an mixed environment

2016-03-15 Thread Martin Palma

Thank you both for the quick replay, and I found my answer "Number of
OSDs which this Pool will have PGs in. Typically, this is the entire
Cluster OSD count, but could be less based on CRUSH rules. (e.g.
Separate SSD and SATA disk sets)"

@Michael: So the ration of PGs per OSDs should between 100 and 200
PGs. This means that if I calculate the PGs of a pool with the first
formula and I get let's say 8129 and if I have 4 pools, I'm way
overcommitting the PGs per OSD ration when each of the 4 pools uses
8129 PGs. Right? So I should lower the PGs on the pools?

Best,
Martin

On Tue, Mar 15, 2016 at 4:47 PM, Michael Kidd  wrote:
> Hello Martin,
>   The proper way is to perform the following process:
>
> For all Pools utilizing the same bucket of OSDs:
> (Pool1_pg_num * Pool1_size) + (Pool2_pg_num * Pool2_size) + ...
> (Pool(n)_pg_num * Pool(n)_size)
> --
> OSD count
>
> This value should be between 100 and 200 PGs and is the actual ratio of PGs
> per OSD in that bucket of OSDs.
>
> For the actual recommendation from Ceph Devs (and written by myself), please
> see:
> http://ceph.com/pgcalc/
>
> NOTE: The tool is partially broken, but the explanation at the top/bottom is
> sound.  I'll work to get the tool fully functional again.
>
> Thanks,
>
> Michael J. Kidd
> Sr. Software Maintenance Engineer
> Red Hat Ceph Storage
> +1 919-442-8878
>
> On Tue, Mar 15, 2016 at 11:41 AM, Martin Palma  wrote:
>>
>> Hi all,
>>
>> The documentation [0] gives us the following formula for calculating
>> the number of PG if the cluster is bigger than 50 OSDs:
>>
>>  (OSDs * 100)
>> Total PGs =  
>>  pool size
>>
>> When we have mixed storage server (HDD disks and SSD disks) and we
>> have defined different roots in our crush map to map some pools only
>> to HDD disk and some to SSD disks like described by Sebastien Han [1].
>>
>> In the above formula what number of OSDs should be use to calculate
>> the  PGs for a pool only on the HDD disks? The total number of OSDs in
>> a cluster or only the number of OSDs which have an HDD disk as
>> backend?
>>
>> Best,
>> Martin
>>
>>
>> [0]
>> http://docs.ceph.com/docs/master/rados/operations/placement-groups/#choosing-the-number-of-placement-groups
>> [1]
>> http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-15 Thread Ben Hines

It seems like ceph-disk is often breaking on centos/redhat systems. Does it
have automated tests in the ceph release structure?

-Ben


On Tue, Mar 15, 2016 at 8:52 AM, Stephen Lord 
wrote:

>
> Hi,
>
> The ceph-disk (10.0.4 version) command seems to have problems operating on
> a Redhat 7 system, it uses the partprobe command unconditionally to update
> the partition table, I had to change this to partx -u to get past this.
>
> @@ -1321,13 +1321,13 @@
>  processed, i.e. the 95-ceph-osd.rules actions and mode changes,
>  group changes etc. are complete.
>  """
> -LOG.debug('Calling partprobe on %s device %s', description, dev)
> +LOG.debug('Calling partx on %s device %s', description, dev)
>  partprobe_ok = False
>  error = 'unknown error'
>  for i in (1, 2, 3, 4, 5):
>  command_check_call(['udevadm', 'settle', '--timeout=600'])
>  try:
> -_check_output(['partprobe', dev])
> +_check_output(['partx', '-u', dev])
>  partprobe_ok = True
>  break
>  except subprocess.CalledProcessError as e:
>
>
> It really needs to be doing that conditional on the operating system
> version.
>
> Steve
>
>
> --
> The information contained in this transmission may be confidential. Any
> disclosure, copying, or further distribution of confidential information is
> not permitted unless such privilege is explicitly granted in writing by
> Quantum. Quantum reserves the right to have electronic communications,
> including email and attachments, sent across its networks filtered through
> anti virus and spam software programs and retain such messages in order to
> comply with applicable data security and retention requirements. Quantum is
> not responsible for the proper and complete transmission of the substance
> of this communication or for any delay in its receipt.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph for home use

2016-03-15 Thread Edward Wingate

Wanting to play around with Ceph, I have a single-node Ceph with 1
monitor and 3 OSDs running on a VM.  I am loving the flexibility that
Ceph provides (and perhaps just the novelty of it).  I've been
planning for some time to build a NAS for home use and seriously
thinking about running Ceph on real hardware (Core2 Quad Q9550 with
8-16GB RAM and 3x 4TB HDD) as the backing store.  Given my resources,
I'd still only run a single node with 3 OSDs and replica count of 2.
I'd then have a VM mount the a Ceph RBD to serve Samba/NFS shares.

I realize mine is not the usual or ideal use case for Ceph, but do you
see any reason not to do this? I don't need high-performance, just
good enough to serve 2 movie streams, which my test VM is already able
to do, and it will be one of my backup data stores, not main store
(yet).  I just like Ceph's ability to add storage as needed with
minimal fuss.

Also, I have some older HD's (500-750GB, 5+ years old) that are still
chugging along fine.  I don't want to entrust main data to them, but
feel I could use them for temporary backfill purposes.  If I add them
to another node, can Ceph be configured to use them only for backfill
purposes, should the need arise?

Anyway, just wanted a sanity check, in case the novelty of running
Ceph is clouding my judgement.  Feel free to set me straight if this
is just a silly idea for such relatively small scale storage!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-15 Thread Gregory Farnum

There's a ceph-disk suite from last August that Loïc set up, but based
on the qa list it wasn't running for a while and isn't in great shape.
:/ I know there are some CentOS7 boxes in the sepia lab but it might
not be enough for a small and infrequently-run test to reliably get
tested against them.
-Greg

On Tue, Mar 15, 2016 at 11:04 AM, Ben Hines  wrote:
> It seems like ceph-disk is often breaking on centos/redhat systems. Does it
> have automated tests in the ceph release structure?
>
> -Ben
>
>
> On Tue, Mar 15, 2016 at 8:52 AM, Stephen Lord 
> wrote:
>>
>>
>> Hi,
>>
>> The ceph-disk (10.0.4 version) command seems to have problems operating on
>> a Redhat 7 system, it uses the partprobe command unconditionally to update
>> the partition table, I had to change this to partx -u to get past this.
>>
>> @@ -1321,13 +1321,13 @@
>>  processed, i.e. the 95-ceph-osd.rules actions and mode changes,
>>  group changes etc. are complete.
>>  """
>> -LOG.debug('Calling partprobe on %s device %s', description, dev)
>> +LOG.debug('Calling partx on %s device %s', description, dev)
>>  partprobe_ok = False
>>  error = 'unknown error'
>>  for i in (1, 2, 3, 4, 5):
>>  command_check_call(['udevadm', 'settle', '--timeout=600'])
>>  try:
>> -_check_output(['partprobe', dev])
>> +_check_output(['partx', '-u', dev])
>>  partprobe_ok = True
>>  break
>>  except subprocess.CalledProcessError as e:
>>
>>
>> It really needs to be doing that conditional on the operating system
>> version.
>>
>> Steve
>>
>>
>> --
>> The information contained in this transmission may be confidential. Any
>> disclosure, copying, or further distribution of confidential information is
>> not permitted unless such privilege is explicitly granted in writing by
>> Quantum. Quantum reserves the right to have electronic communications,
>> including email and attachments, sent across its networks filtered through
>> anti virus and spam software programs and retain such messages in order to
>> comply with applicable data security and retention requirements. Quantum is
>> not responsible for the proper and complete transmission of the substance of
>> this communication or for any delay in its receipt.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-15 Thread Vasu Kulkarni

Ceph-deploy suite and also selinux suite(which isn't merged yet) indirectly
tests ceph-disk and has been run on Jewel as well. I guess the issue
Stephen is seeing is on multipath device
which I believe is a known issue.

On Tue, Mar 15, 2016 at 11:42 AM, Gregory Farnum  wrote:

> There's a ceph-disk suite from last August that Loïc set up, but based
> on the qa list it wasn't running for a while and isn't in great shape.
> :/ I know there are some CentOS7 boxes in the sepia lab but it might
> not be enough for a small and infrequently-run test to reliably get
> tested against them.
> -Greg
>
> On Tue, Mar 15, 2016 at 11:04 AM, Ben Hines  wrote:
> > It seems like ceph-disk is often breaking on centos/redhat systems. Does
> it
> > have automated tests in the ceph release structure?
> >
> > -Ben
> >
> >
> > On Tue, Mar 15, 2016 at 8:52 AM, Stephen Lord 
> > wrote:
> >>
> >>
> >> Hi,
> >>
> >> The ceph-disk (10.0.4 version) command seems to have problems operating
> on
> >> a Redhat 7 system, it uses the partprobe command unconditionally to
> update
> >> the partition table, I had to change this to partx -u to get past this.
> >>
> >> @@ -1321,13 +1321,13 @@
> >>  processed, i.e. the 95-ceph-osd.rules actions and mode changes,
> >>  group changes etc. are complete.
> >>  """
> >> -LOG.debug('Calling partprobe on %s device %s', description, dev)
> >> +LOG.debug('Calling partx on %s device %s', description, dev)
> >>  partprobe_ok = False
> >>  error = 'unknown error'
> >>  for i in (1, 2, 3, 4, 5):
> >>  command_check_call(['udevadm', 'settle', '--timeout=600'])
> >>  try:
> >> -_check_output(['partprobe', dev])
> >> +_check_output(['partx', '-u', dev])
> >>  partprobe_ok = True
> >>  break
> >>  except subprocess.CalledProcessError as e:
> >>
> >>
> >> It really needs to be doing that conditional on the operating system
> >> version.
> >>
> >> Steve
> >>
> >>
> >> --
> >> The information contained in this transmission may be confidential. Any
> >> disclosure, copying, or further distribution of confidential
> information is
> >> not permitted unless such privilege is explicitly granted in writing by
> >> Quantum. Quantum reserves the right to have electronic communications,
> >> including email and attachments, sent across its networks filtered
> through
> >> anti virus and spam software programs and retain such messages in order
> to
> >> comply with applicable data security and retention requirements.
> Quantum is
> >> not responsible for the proper and complete transmission of the
> substance of
> >> this communication or for any delay in its receipt.
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-15 Thread Stephen Lord

Not multipath if you mean using the multipath driver, just trying to setup OSDs 
which use a data disk and a journal ssd. If I run just a disk based OSD and 
only specify one device to ceph-deploy then it usually works although sometimes 
has to retry. In the case where I am using it to carve an SSD into several 
partitions for journals it fails on the second one.

Steve


> On Mar 15, 2016, at 1:45 PM, Vasu Kulkarni  wrote:
> 
> Ceph-deploy suite and also selinux suite(which isn't merged yet) indirectly 
> tests ceph-disk and has been run on Jewel as well. I guess the issue Stephen 
> is seeing is on multipath device
> which I believe is a known issue.
> 
> On Tue, Mar 15, 2016 at 11:42 AM, Gregory Farnum  wrote:
> There's a ceph-disk suite from last August that Loïc set up, but based
> on the qa list it wasn't running for a while and isn't in great shape.
> :/ I know there are some CentOS7 boxes in the sepia lab but it might
> not be enough for a small and infrequently-run test to reliably get
> tested against them.
> -Greg
> 
> On Tue, Mar 15, 2016 at 11:04 AM, Ben Hines  wrote:
> > It seems like ceph-disk is often breaking on centos/redhat systems. Does it
> > have automated tests in the ceph release structure?
> >
> > -Ben
> >
> >
> > On Tue, Mar 15, 2016 at 8:52 AM, Stephen Lord 
> > wrote:
> >>
> >>
> >> Hi,
> >>
> >> The ceph-disk (10.0.4 version) command seems to have problems operating on
> >> a Redhat 7 system, it uses the partprobe command unconditionally to update
> >> the partition table, I had to change this to partx -u to get past this.
> >>
> >> @@ -1321,13 +1321,13 @@
> >>  processed, i.e. the 95-ceph-osd.rules actions and mode changes,
> >>  group changes etc. are complete.
> >>  """
> >> -LOG.debug('Calling partprobe on %s device %s', description, dev)
> >> +LOG.debug('Calling partx on %s device %s', description, dev)
> >>  partprobe_ok = False
> >>  error = 'unknown error'
> >>  for i in (1, 2, 3, 4, 5):
> >>  command_check_call(['udevadm', 'settle', '--timeout=600'])
> >>  try:
> >> -_check_output(['partprobe', dev])
> >> +_check_output(['partx', '-u', dev])
> >>  partprobe_ok = True
> >>  break
> >>  except subprocess.CalledProcessError as e:
> >>
> >>
> >> It really needs to be doing that conditional on the operating system
> >> version.
> >>
> >> Steve
> >>
> >>
> >> --
> >> The information contained in this transmission may be confidential. Any
> >> disclosure, copying, or further distribution of confidential information is
> >> not permitted unless such privilege is explicitly granted in writing by
> >> Quantum. Quantum reserves the right to have electronic communications,
> >> including email and attachments, sent across its networks filtered through
> >> anti virus and spam software programs and retain such messages in order to
> >> comply with applicable data security and retention requirements. Quantum is
> >> not responsible for the proper and complete transmission of the substance 
> >> of
> >> this communication or for any delay in its receipt.
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=xCs2lM8j21CKCDOMzMG9A39MKnroKXExLDI0-FgCPkA&s=yZ89WNI7wA8agL8i7CODARX7K864Ewod22WMdbv82xw&e=


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-15 Thread Vasu Kulkarni

Do you mind giving the full failed logs somewhere in fpaste.org along with
some os version details?
 There are some known issues on RHEL,  If you use 'osd prepare' and 'osd
activate'(specifying just the journal partition here) it might work better.

On Tue, Mar 15, 2016 at 12:05 PM, Stephen Lord 
wrote:

> Not multipath if you mean using the multipath driver, just trying to setup
> OSDs which use a data disk and a journal ssd. If I run just a disk based
> OSD and only specify one device to ceph-deploy then it usually works
> although sometimes has to retry. In the case where I am using it to carve
> an SSD into several partitions for journals it fails on the second one.
>
> Steve
>
>
> > On Mar 15, 2016, at 1:45 PM, Vasu Kulkarni  wrote:
> >
> > Ceph-deploy suite and also selinux suite(which isn't merged yet)
> indirectly tests ceph-disk and has been run on Jewel as well. I guess the
> issue Stephen is seeing is on multipath device
> > which I believe is a known issue.
> >
> > On Tue, Mar 15, 2016 at 11:42 AM, Gregory Farnum 
> wrote:
> > There's a ceph-disk suite from last August that Loïc set up, but based
> > on the qa list it wasn't running for a while and isn't in great shape.
> > :/ I know there are some CentOS7 boxes in the sepia lab but it might
> > not be enough for a small and infrequently-run test to reliably get
> > tested against them.
> > -Greg
> >
> > On Tue, Mar 15, 2016 at 11:04 AM, Ben Hines  wrote:
> > > It seems like ceph-disk is often breaking on centos/redhat systems.
> Does it
> > > have automated tests in the ceph release structure?
> > >
> > > -Ben
> > >
> > >
> > > On Tue, Mar 15, 2016 at 8:52 AM, Stephen Lord 
> > > wrote:
> > >>
> > >>
> > >> Hi,
> > >>
> > >> The ceph-disk (10.0.4 version) command seems to have problems
> operating on
> > >> a Redhat 7 system, it uses the partprobe command unconditionally to
> update
> > >> the partition table, I had to change this to partx -u to get past
> this.
> > >>
> > >> @@ -1321,13 +1321,13 @@
> > >>  processed, i.e. the 95-ceph-osd.rules actions and mode changes,
> > >>  group changes etc. are complete.
> > >>  """
> > >> -LOG.debug('Calling partprobe on %s device %s', description, dev)
> > >> +LOG.debug('Calling partx on %s device %s', description, dev)
> > >>  partprobe_ok = False
> > >>  error = 'unknown error'
> > >>  for i in (1, 2, 3, 4, 5):
> > >>  command_check_call(['udevadm', 'settle', '--timeout=600'])
> > >>  try:
> > >> -_check_output(['partprobe', dev])
> > >> +_check_output(['partx', '-u', dev])
> > >>  partprobe_ok = True
> > >>  break
> > >>  except subprocess.CalledProcessError as e:
> > >>
> > >>
> > >> It really needs to be doing that conditional on the operating system
> > >> version.
> > >>
> > >> Steve
> > >>
> > >>
> > >> --
> > >> The information contained in this transmission may be confidential.
> Any
> > >> disclosure, copying, or further distribution of confidential
> information is
> > >> not permitted unless such privilege is explicitly granted in writing
> by
> > >> Quantum. Quantum reserves the right to have electronic communications,
> > >> including email and attachments, sent across its networks filtered
> through
> > >> anti virus and spam software programs and retain such messages in
> order to
> > >> comply with applicable data security and retention requirements.
> Quantum is
> > >> not responsible for the proper and complete transmission of the
> substance of
> > >> this communication or for any delay in its receipt.
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=xCs2lM8j21CKCDOMzMG9A39MKnroKXExLDI0-FgCPkA&s=yZ89WNI7wA8agL8i7CODARX7K864Ewod22WMdbv82xw&e=
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-15 Thread Stephen Lord

I would have to nuke my cluster right now, and I do not have a spare one..

The procedure though is literally this, given a 3 node redhat 7.2 cluster, 
ceph00, ceph01 and ceph02

ceph-deploy install --testing ceph00 ceph01 ceph02
ceph-deploy new ceph00 ceph01 ceph02 

ceph-deploy mon create  ceph00 ceph01 ceph02
ceph-deploy gatherkeys  ceph00

ceph-deploy osd create ceph00:sdb:/dev/sdi
ceph-deploy osd create ceph00:sdc:/dev/sdi

All devices have their partition tables wiped before this. They are all just 
SATA devices, no special devices in the way.

sdi is an ssd and it is being carved up for journals. The first osd create 
works, the second one gets stuck in a loop in the update_partition call in 
ceph_disk for the 5 iterations before it gives up. When I look in 
/sys/block/sdi the partition for the first osd is visible, the one for the 
second is not. However looking at /proc/partitions it sees the correct thing. 
So something about partprobe is not kicking udev into doing the right thing 
when the second partition is added I suspect.

If I do not use the separate journal device then it usually works, but 
occasionally I see a single retry in that same loop.

There is code in ceph_deploy which uses partprobe or partx depending on which 
distro it detects, that is how I worked out what to change here.

If I have to tear things down again I will reproduce and post here.

Steve

> On Mar 15, 2016, at 2:12 PM, Vasu Kulkarni  wrote:
> 
> Do you mind giving the full failed logs somewhere in fpaste.org along with 
> some os version details?
>  There are some known issues on RHEL,  If you use 'osd prepare' and 'osd 
> activate'(specifying just the journal partition here) it might work better.
> 
> On Tue, Mar 15, 2016 at 12:05 PM, Stephen Lord  wrote:
> Not multipath if you mean using the multipath driver, just trying to setup 
> OSDs which use a data disk and a journal ssd. If I run just a disk based OSD 
> and only specify one device to ceph-deploy then it usually works although 
> sometimes has to retry. In the case where I am using it to carve an SSD into 
> several partitions for journals it fails on the second one.
> 
> Steve
> 

--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-15 Thread Vasu Kulkarni

Thanks for the steps that should be enough to test it out, I hope you got
the latest ceph-deploy either from pip or throught github.

On Tue, Mar 15, 2016 at 12:29 PM, Stephen Lord 
wrote:

> I would have to nuke my cluster right now, and I do not have a spare one..
>
> The procedure though is literally this, given a 3 node redhat 7.2 cluster,
> ceph00, ceph01 and ceph02
>
> ceph-deploy install --testing ceph00 ceph01 ceph02
> ceph-deploy new ceph00 ceph01 ceph02
>
> ceph-deploy mon create  ceph00 ceph01 ceph02
> ceph-deploy gatherkeys  ceph00
>
> ceph-deploy osd create ceph00:sdb:/dev/sdi
> ceph-deploy osd create ceph00:sdc:/dev/sdi
>
> All devices have their partition tables wiped before this. They are all
> just SATA devices, no special devices in the way.
>
> sdi is an ssd and it is being carved up for journals. The first osd create
> works, the second one gets stuck in a loop in the update_partition call in
> ceph_disk for the 5 iterations before it gives up. When I look in
> /sys/block/sdi the partition for the first osd is visible, the one for the
> second is not. However looking at /proc/partitions it sees the correct
> thing. So something about partprobe is not kicking udev into doing the
> right thing when the second partition is added I suspect.
>
> If I do not use the separate journal device then it usually works, but
> occasionally I see a single retry in that same loop.
>
> There is code in ceph_deploy which uses partprobe or partx depending on
> which distro it detects, that is how I worked out what to change here.
>
> If I have to tear things down again I will reproduce and post here.
>
> Steve
>
> > On Mar 15, 2016, at 2:12 PM, Vasu Kulkarni  wrote:
> >
> > Do you mind giving the full failed logs somewhere in fpaste.org along
> with some os version details?
> >  There are some known issues on RHEL,  If you use 'osd prepare' and 'osd
> activate'(specifying just the journal partition here) it might work better.
> >
> > On Tue, Mar 15, 2016 at 12:05 PM, Stephen Lord 
> wrote:
> > Not multipath if you mean using the multipath driver, just trying to
> setup OSDs which use a data disk and a journal ssd. If I run just a disk
> based OSD and only specify one device to ceph-deploy then it usually works
> although sometimes has to retry. In the case where I am using it to carve
> an SSD into several partitions for journals it fails on the second one.
> >
> > Steve
> >
>
>
> --
> The information contained in this transmission may be confidential. Any
> disclosure, copying, or further distribution of confidential information is
> not permitted unless such privilege is explicitly granted in writing by
> Quantum. Quantum reserves the right to have electronic communications,
> including email and attachments, sent across its networks filtered through
> anti virus and spam software programs and retain such messages in order to
> comply with applicable data security and retention requirements. Quantum is
> not responsible for the proper and complete transmission of the substance
> of this communication or for any delay in its receipt.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-15 Thread Stephen Lord

My ceph-deploy came from the download.ceph.com site and it is 1.5.31-0. This 
code is in ceph itself though, the deploy logic is where the code appears to do 
the right thing ;-)

Steve

> On Mar 15, 2016, at 2:38 PM, Vasu Kulkarni  wrote:
> 
> Thanks for the steps that should be enough to test it out, I hope you got the 
> latest ceph-deploy either from pip or throught github.
> 
> On Tue, Mar 15, 2016 at 12:29 PM, Stephen Lord  wrote:
> I would have to nuke my cluster right now, and I do not have a spare one..
> 
> The procedure though is literally this, given a 3 node redhat 7.2 cluster, 
> ceph00, ceph01 and ceph02
> 
> ceph-deploy install --testing ceph00 ceph01 ceph02
> ceph-deploy new ceph00 ceph01 ceph02
> 
> ceph-deploy mon create  ceph00 ceph01 ceph02
> ceph-deploy gatherkeys  ceph00
> 
> ceph-deploy osd create ceph00:sdb:/dev/sdi
> ceph-deploy osd create ceph00:sdc:/dev/sdi
> 
> All devices have their partition tables wiped before this. They are all just 
> SATA devices, no special devices in the way.
> 
> sdi is an ssd and it is being carved up for journals. The first osd create 
> works, the second one gets stuck in a loop in the update_partition call in 
> ceph_disk for the 5 iterations before it gives up. When I look in 
> /sys/block/sdi the partition for the first osd is visible, the one for the 
> second is not. However looking at /proc/partitions it sees the correct thing. 
> So something about partprobe is not kicking udev into doing the right thing 
> when the second partition is added I suspect.
> 
> If I do not use the separate journal device then it usually works, but 
> occasionally I see a single retry in that same loop.
> 
> There is code in ceph_deploy which uses partprobe or partx depending on which 
> distro it detects, that is how I worked out what to change here.
> 
> If I have to tear things down again I will reproduce and post here.
> 
> Steve
> 
> > On Mar 15, 2016, at 2:12 PM, Vasu Kulkarni  wrote:
> >
> > Do you mind giving the full failed logs somewhere in fpaste.org along with 
> > some os version details?
> >  There are some known issues on RHEL,  If you use 'osd prepare' and 'osd 
> > activate'(specifying just the journal partition here) it might work better.
> >
> > On Tue, Mar 15, 2016 at 12:05 PM, Stephen Lord  
> > wrote:
> > Not multipath if you mean using the multipath driver, just trying to setup 
> > OSDs which use a data disk and a journal ssd. If I run just a disk based 
> > OSD and only specify one device to ceph-deploy then it usually works 
> > although sometimes has to retry. In the case where I am using it to carve 
> > an SSD into several partitions for journals it fails on the second one.
> >
> > Steve
> >
> 
> 
> --
> The information contained in this transmission may be confidential. Any 
> disclosure, copying, or further distribution of confidential information is 
> not permitted unless such privilege is explicitly granted in writing by 
> Quantum. Quantum reserves the right to have electronic communications, 
> including email and attachments, sent across its networks filtered through 
> anti virus and spam software programs and retain such messages in order to 
> comply with applicable data security and retention requirements. Quantum is 
> not responsible for the proper and complete transmission of the substance of 
> this communication or for any delay in its receipt.
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephx capabilities to forbid rbd creation

2016-03-15 Thread David Casier

Hi Loris,
If i'm not mistaken, there are no rbd ACL in cephx.
Why not 1 pool/client and pool quota ?

David.

2016-02-12 3:34 GMT+01:00 Loris Cuoghi :
> Hi!
>
> We are on version 9.2.0, 5 mons and 80 OSDS distributed on 10 hosts.
>
> How could we twist cephx capabilities so to forbid our KVM+QEMU+libvirt
> hosts any RBD creation capability ?
>
> We currently have an rbd-user key like so :
>
> caps: [mon] allow r
> caps: [osd] allow x object_prefix rbd_children, allow rwx
> object_prefix rbd_header., allow rwx object_prefix rbd_id., allow rw
> object_prefix rbd_data.
>
>
> And another rbd-manager key like the one suggested in the documentation,
> which is used in a central machine which is the only one allowed to create
> RBD images:
>
> caps: [mon] allow r
> caps: [osd] allow class-read object_prefix rbd_children, allow rwx
> pool=rbd
>
> Now, the libvirt hosts all share the same "rbd-user" secret.
> Our intention is to permit the QEMU processes to take full advantage of any
> single RBD functionality, but to forbid any new RBD creation with this same
> key. In the eventuality of a stolen key, or other hellish scenarios.
>
> What cephx capabilities did you guys configure for your virtualization
> hosts?
>
> Thanks,
>
> Loris
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Disable cephx authentication ?

2016-03-15 Thread David Casier

Interesting !
Is it safe to do this ?
Perhaps "rados" is considered an internal command while rbd  is a
client librados ?

In MonClient.cc :
   if (!cct->_conf->auth_supported.empty())
  method = cct->_conf->auth_supported;
else if (entity_name.get_type() == CEPH_ENTITY_TYPE_OSD ||
 entity_name.get_type() == CEPH_ENTITY_TYPE_MDS ||
 entity_name.get_type() == CEPH_ENTITY_TYPE_MON)
  method = cct->_conf->auth_cluster_required;
else
  method = cct->_conf->auth_client_required;

2016-03-15 9:35 GMT+01:00 Nguyen Hoang Nam :
> Hi there,
>
> I setup ceph cluster with disable cephx cluster authen and enable cephx
> client authen as follow :
>
> auth_cluster_required = none
>
> auth_service_required = cephx
>
> auth_client_required = cephx
>
> I can run command such as `ceph -s`, `rados -p rbd put` but I can not run
> command `rbd ls`, `rbd create` ... Output of those commands always are:
>
> 2016-03-15 10:49:30.659194 7f1a6eda0700  0 cephx: verify_reply couldn't
> decrypt with error: error decoding block for decryption
>
> 2016-03-15 10:49:30.659211 7f1a6eda0700  0 -- 172.30.6.101:0/954989888 >>
> 172.30.6.103:6804/23638 pipe(0x7f1a8119f7f0 sd=4 :45067 s=1 pgs=0 cs=0 l=1
> c=0x7f1a8
> 1197000).failed verifying authorize reply
>
> Can you explain me why RBD failed in this case ? Thank you in advance
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephx capabilities to forbid rbd creation

2016-03-15 Thread Loris Cuoghi


Hi David,

One pool per virtualization host would make it impossible to live
migrate a VM. :)

Thanks,

Loris


Le 15/03/2016 22:11, David Casier a écrit :
> Hi Loris,
> If i'm not mistaken, there are no rbd ACL in cephx.
> Why not 1 pool/client and pool quota ?
>
> David.
>
> 2016-02-12 3:34 GMT+01:00 Loris Cuoghi :
>> Hi!
>>
>> We are on version 9.2.0, 5 mons and 80 OSDS distributed on 10 hosts.
>>
>> How could we twist cephx capabilities so to forbid our KVM+QEMU+libvirt
>> hosts any RBD creation capability ?
>>
>> We currently have an rbd-user key like so :
>>
>> caps: [mon] allow r
>> caps: [osd] allow x object_prefix rbd_children, allow rwx
>> object_prefix rbd_header., allow rwx object_prefix rbd_id., allow rw
>> object_prefix rbd_data.
>>
>>
>> And another rbd-manager key like the one suggested in the documentation,
>> which is used in a central machine which is the only one allowed to create
>> RBD images:
>>
>> caps: [mon] allow r
>> caps: [osd] allow class-read object_prefix rbd_children, allow rwx
>> pool=rbd
>>
>> Now, the libvirt hosts all share the same "rbd-user" secret.
>> Our intention is to permit the QEMU processes to take full advantage of any
>> single RBD functionality, but to forbid any new RBD creation with this same
>> key. In the eventuality of a stolen key, or other hellish scenarios.
>>
>> What cephx capabilities did you guys configure for your virtualization
>> hosts?
>>
>> Thanks,
>>
>> Loris
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephx capabilities to forbid rbd creation

2016-03-15 Thread David Casier

Hi,
Maybe (not tested) :
[osd ]allow * object_prefix   ?



2016-03-15 22:18 GMT+01:00 Loris Cuoghi :
>
> Hi David,
>
> One pool per virtualization host would make it impossible to live
> migrate a VM. :)
>
> Thanks,
>
> Loris
>
>
> Le 15/03/2016 22:11, David Casier a écrit :
>> Hi Loris,
>> If i'm not mistaken, there are no rbd ACL in cephx.
>> Why not 1 pool/client and pool quota ?
>>
>> David.
>>
>> 2016-02-12 3:34 GMT+01:00 Loris Cuoghi :
>>> Hi!
>>>
>>> We are on version 9.2.0, 5 mons and 80 OSDS distributed on 10 hosts.
>>>
>>> How could we twist cephx capabilities so to forbid our KVM+QEMU+libvirt
>>> hosts any RBD creation capability ?
>>>
>>> We currently have an rbd-user key like so :
>>>
>>> caps: [mon] allow r
>>> caps: [osd] allow x object_prefix rbd_children, allow rwx
>>> object_prefix rbd_header., allow rwx object_prefix rbd_id., allow rw
>>> object_prefix rbd_data.
>>>
>>>
>>> And another rbd-manager key like the one suggested in the documentation,
>>> which is used in a central machine which is the only one allowed to create
>>> RBD images:
>>>
>>> caps: [mon] allow r
>>> caps: [osd] allow class-read object_prefix rbd_children, allow rwx
>>> pool=rbd
>>>
>>> Now, the libvirt hosts all share the same "rbd-user" secret.
>>> Our intention is to permit the QEMU processes to take full advantage of any
>>> single RBD functionality, but to forbid any new RBD creation with this same
>>> key. In the eventuality of a stolen key, or other hellish scenarios.
>>>
>>> What cephx capabilities did you guys configure for your virtualization
>>> hosts?
>>>
>>> Thanks,
>>>
>>> Loris
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>



-- 



Cordialement,

David CASIER


3B Rue Taylor, CS20004
75481 PARIS Cedex 10 Paris

Ligne directe: 01 75 98 53 85
Email: david.cas...@aevoo.fr

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephx capabilities to forbid rbd creation

2016-03-15 Thread Loris Cuoghi

So, one key per RBD.
Or, dynamically enable/disable access to each RBD in each hypervisor's key.
Uhm, something doesn't scale here. :P
(I wonder if there's any limit to a key's capabilities string...)

But, as it appears, I share your view that it is the only available
approach right now.

Anyone would like to prove us wrong? :)

Le 15/03/2016 22:33, David Casier a écrit :
> Hi,
> Maybe (not tested) :
> [osd ]allow * object_prefix   ?
>
>
>
> 2016-03-15 22:18 GMT+01:00 Loris Cuoghi :
>> Hi David,
>>
>> One pool per virtualization host would make it impossible to live
>> migrate a VM. :)
>>
>> Thanks,
>>
>> Loris
>>
>>
>> Le 15/03/2016 22:11, David Casier a écrit :
>>> Hi Loris,
>>> If i'm not mistaken, there are no rbd ACL in cephx.
>>> Why not 1 pool/client and pool quota ?
>>>
>>> David.
>>>
>>> 2016-02-12 3:34 GMT+01:00 Loris Cuoghi :
 Hi!

 We are on version 9.2.0, 5 mons and 80 OSDS distributed on 10 hosts.

 How could we twist cephx capabilities so to forbid our KVM+QEMU+libvirt
 hosts any RBD creation capability ?

 We currently have an rbd-user key like so :

 caps: [mon] allow r
 caps: [osd] allow x object_prefix rbd_children, allow rwx
 object_prefix rbd_header., allow rwx object_prefix rbd_id., allow rw
 object_prefix rbd_data.


 And another rbd-manager key like the one suggested in the documentation,
 which is used in a central machine which is the only one allowed to create
 RBD images:

 caps: [mon] allow r
 caps: [osd] allow class-read object_prefix rbd_children, allow rwx
 pool=rbd

 Now, the libvirt hosts all share the same "rbd-user" secret.
 Our intention is to permit the QEMU processes to take full advantage of any
 single RBD functionality, but to forbid any new RBD creation with this same
 key. In the eventuality of a stolen key, or other hellish scenarios.

 What cephx capabilities did you guys configure for your virtualization
 hosts?

 Thanks,

 Loris
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephx capabilities to forbid rbd creation

2016-03-15 Thread Gregory Farnum

On Tue, Mar 15, 2016 at 2:44 PM, Loris Cuoghi  wrote:
> So, one key per RBD.
> Or, dynamically enable/disable access to each RBD in each hypervisor's key.
> Uhm, something doesn't scale here. :P
> (I wonder if there's any limit to a key's capabilities string...)
>
> But, as it appears, I share your view that it is the only available
> approach right now.
>
> Anyone would like to prove us wrong? :)

The OSD capabilities aren't fine-grained enough to prevent you from
creating objects, except by specifying that you only get access to
certain prefixes or namespaces. So, either you lock down a key to a
specific set of RBD volumes, or you let it create RBD volumes
arbitrarily.
...unless, maybe, you can keep it from writing to the RBD index
objects? But that doesn't prevent the user from scribbling across your
cluster, just registering it. ;)

That said, it is *possible* (although probably *unwise*) to give
hypervisor keys access to all of the RBD volumes they host. cephx keys
can have an arbitrary number of "allow" clauses, although I imagine if
you get them large enough it could cause trouble (or maybe not?) in
terms of memory usage or just plain old permission parsing time. And
you'd likely run into issues with newly-created or newly-migrated
instances ending up on a hypervisor which has an old version of its
keyring cached. I'm not certain if there's a way to refresh those
on-demand from the monitor.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-15 Thread Samuel Just

Ok, a branch merged to master which should fix this
(https://github.com/ceph/ceph/pull/8136).  It'll be backported in due
course.  The problem is that that patch won't clean orphaned files
that already exist.

Let me explain a bit about what the orphaned files look like.  The
problem is files with object names that result in escaped filenames
longer than the max filename ceph will create (~250 iirc).  Normally,
the name of the file is an escaped and sanitized version of the object
name:

./DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/default.325674.107\u\ushadow\u.KePEE8heghHVnlb1\uEIupG0I5eROwRn\u77__head_C1DCD459__46__0

corresponds to an object like

c1dcd459/default.325674.107__shadow_.KePEE8heghHVnlb1_EIupG0I5eROwRn_77/head//70

the DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ path is derived from the hash
starting with the last value: cd459 -> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/

It ends up in DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ because that's the
longest path that exists (DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D does not
exist -- if DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ ever gets too full,
DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D would be created and this file
would be moved into it).

When the escaped filename gets too long, we truncate the filename, and
then append a hash and a number yielding a name like:

./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long

The _long at the end is always present with files like this.
fa202ec9b4b3b217275a is the hash of the filename.  The 0 indicates
that it's the 0th file with this prefix and this hash -- if there are
hash collisions with the same prefix, you'll see _1_ and _2_ and so on
to distinguish them (very very unlikely).  When the filename has been
truncated as with this one, you will find the full file name in the
attrs (attr user.cephosd.lfn3):

./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
user.cephos.lfn3:
default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0

Let's look at one of the orphaned files (the one with the same
file-name as the previous one, actually):

./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
user.cephos.lfn3:
default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46_3189d_0

This one has the same filename as the previous object, but is an
orphan.  What makes it an orphan is that it has hash 79CED459, but is
in ./DIR_9/DIR_5/DIR_4/DIR_D even though
./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E exists (objects-files are always at
the farthest directory from the root matching their hash).  All of the
orphans will be long-file-name objects (but most long-file-name
objects are fine and are neither orphans nor have duplicates -- it's a
fairly low occurrence bug).  In your case, I think *all* of the
orphans will probably happen to have files with duplicate names in the
correct directory -- though might not if the object had actually been
deleted since the bug happened.  When there are duplicates, the full
object names will either be the same or differ by the generation
number at the end (_0 vs 3189d_0) in this case.

Once the orphaned files are cleaned up, your cluster should be back to
normal.  If you want to wait, someone might get time to build a patch
for ceph-objectstore-tool to automate this.  You can try removing the
orphan we identified in pg 70.459 and re-scrubbing to confirm that
that fixes the pg.
-Sam

On Wed, Mar 9, 2016 at 6:58 AM, Jeffrey McDonald  wrote:
> Hi, I went back to the mon logs to see if I could illicit any additional
> information about this PG.
> Prior to 1/27/16, the deep-scrub on this OSD passes(then I see obsolete
> rollback objects found):
>
> ceph.log.4.gz:2016-01-20 09:43:36.195640 osd.307 10.31.0.67:6848/127170 538
> : cluster [INF] 70.459 deep-scrub ok
> ceph.log.4.gz:2016-01-27 09:51:49.952459 osd.307 10.31.0.67:6848/127170 583
> : cluster [INF] 70.459 deep-scrub starts
> ceph.log.4.gz:2016-01-27 10:10:57.196311 osd.108 10.31.0.6

[ceph-users] mon create-initial failed after installation (ceph-deploy: 1.5.31 / ceph: 10.0.2)

2016-03-15 Thread Shinobu Kinjo

Hello,

I've tried to install the ceph using ceph-deploy as usual.

[ceph@octopus conf]$ ceph-deploy install --mon --mds --testing octopus

*install* was completed without any surprising.
But *mon create-initial* was failed:

### Take1 ###
 Log 
[ceph@octopus conf]$ ceph-deploy mon create-initial
[ceph_deploy.conf][DEBUG ] found configuration file at:
/home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.31): /usr/bin/ceph-deploy mon
create-initial
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: create-initial
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   :

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.cli][INFO  ]  keyrings  : None
[ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts octopus
[ceph_deploy.mon][DEBUG ] detecting platform for host octopus ...
[octopus][DEBUG ] connection detected need for sudo
[octopus][DEBUG ] connected to host: octopus
[octopus][DEBUG ] detect platform information from remote host
[octopus][DEBUG ] detect machine type
[octopus][DEBUG ] find the location of an executable
[ceph_deploy.mon][INFO  ] distro info: CentOS Linux 7.2.1511 Core
[octopus][DEBUG ] determining if provided host has same hostname in remote
[octopus][DEBUG ] get remote short hostname
[octopus][DEBUG ] deploying mon to octopus
[octopus][DEBUG ] get remote short hostname
[octopus][DEBUG ] remote hostname: octopus
[octopus][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[octopus][DEBUG ] create the mon path if it does not exist
[octopus][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-octopus/done
[octopus][DEBUG ] done path does not exist: /var/lib/ceph/mon/ceph-octopus/done
[octopus][INFO  ] creating keyring file:
/var/lib/ceph/tmp/ceph-octopus.mon.keyring
[octopus][DEBUG ] create the monitor keyring file
[octopus][INFO  ] Running command: sudo ceph-mon --cluster ceph --mkfs
-i octopus --keyring /var/lib/ceph/tmp/ceph-octopus.mon.keyring
--setuser 1000 --setgroup 1000
[octopus][DEBUG ] ceph-mon: mon.noname-a 172.16.0.2:6789/0 is local,
renaming to mon.octopus
[octopus][DEBUG ] ceph-mon: set fsid to 53dab59e-3e16-4fdf-a029-b92c32aabde8
[octopus][DEBUG ] ceph-mon: created monfs at
/var/lib/ceph/mon/ceph-octopus for mon.octopus
[octopus][INFO  ] unlinking keyring file
/var/lib/ceph/tmp/ceph-octopus.mon.keyring
[octopus][DEBUG ] create a done file to avoid re-doing the mon deployment
[octopus][DEBUG ] create the init path if it does not exist
[octopus][INFO  ] Running command: sudo systemctl enable ceph.target
[octopus][WARNIN] Failed to execute operation: Access denied
[octopus][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy.mon][ERROR ] Failed to execute command: systemctl enable
ceph.target
[ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors
 Log 

Interestingly after rebooting the host, it's completed -;

### Take2 ###
 Log 
[ceph@octopus conf]$ ceph-deploy mon create-initial
[ceph_deploy.conf][DEBUG ] found configuration file at:
/home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.31): /usr/bin/ceph-deploy mon
create-initial
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: create-initial
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   :

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.cli][INFO  ]  keyrings  : None
[ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts octopus
[ceph_deploy.mon][DEBUG ] detecting platform for host octopus ...
[octopus][DEBUG ] connection detected need for sudo
[octopus][DEBUG ] connected to host: octopus
[octopus][DEBUG ] detect platform information from remote host
[octopus][DEBUG ] detect machine type
[octopus][DEBUG ] find the location of an executable
[ceph_deploy.mon][INFO  ] distro info: CentOS Linux 7.2.1511 Core
[octopus][DEBUG ] determining if provided host has same hostname in remote
[octopus][DEBUG ] get remote short hostname

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-15 Thread Samuel Just

[back on list]

ceph-objectstore-tool has a whole bunch of machinery for modifying an
offline objectstore.  It would be the easiest place to put it -- you
could add a

ceph-objectstore-tool --op filestore-repair-orphan-links [--dry-run] ...

command which would mount the filestore in a special mode and iterate
over all collections and repair them.  If you want to go that route,
we'd be happy to help you get it written.  Once it fixes your cluster,
we'd then be able to merge and backport it in case anyone else hits
it.

You'd probably be fine doing it while the OSD is live...but as a rule
I usually prefer to do my osd surgery offline.  Journal doesn't matter
here, the orphaned files are basically invisible to the filestore
(except when doing a collection scan for scrub) since they are in the
wrong directory.

I don't think the orphans are necessarily going to be 0 size.  There
might be quirk of how radosgw creates these objects that always causes
them to be created 0 size than then overwritten with a writefull -- if
that's true it might be the case that you would only see 0 size ones.
-Sam

On Tue, Mar 15, 2016 at 4:02 PM, Jeffrey McDonald  wrote:
> Thanks,  I can try to write a tool to do this.   Does ceph-objectstore-tool
> provide a framework?
>
> Can I safely delete the files while the OSD is alive or should I take it
> offline?   Any concerns about the journal?
>
> Are there any other properties of the orphans, e.g. will the orphans always
> be size 0?
>
> Thanks!
> Jeff
>
> On Tue, Mar 15, 2016 at 5:35 PM, Samuel Just  wrote:
>>
>> Ok, a branch merged to master which should fix this
>> (https://github.com/ceph/ceph/pull/8136).  It'll be backported in due
>> course.  The problem is that that patch won't clean orphaned files
>> that already exist.
>>
>> Let me explain a bit about what the orphaned files look like.  The
>> problem is files with object names that result in escaped filenames
>> longer than the max filename ceph will create (~250 iirc).  Normally,
>> the name of the file is an escaped and sanitized version of the object
>> name:
>>
>>
>> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/default.325674.107\u\ushadow\u.KePEE8heghHVnlb1\uEIupG0I5eROwRn\u77__head_C1DCD459__46__0
>>
>> corresponds to an object like
>>
>>
>> c1dcd459/default.325674.107__shadow_.KePEE8heghHVnlb1_EIupG0I5eROwRn_77/head//70
>>
>> the DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ path is derived from the hash
>> starting with the last value: cd459 -> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/
>>
>> It ends up in DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ because that's the
>> longest path that exists (DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D does not
>> exist -- if DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ ever gets too full,
>> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D would be created and this file
>> would be moved into it).
>>
>> When the escaped filename gets too long, we truncate the filename, and
>> then append a hash and a number yielding a name like:
>>
>>
>> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>>
>> The _long at the end is always present with files like this.
>> fa202ec9b4b3b217275a is the hash of the filename.  The 0 indicates
>> that it's the 0th file with this prefix and this hash -- if there are
>> hash collisions with the same prefix, you'll see _1_ and _2_ and so on
>> to distinguish them (very very unlikely).  When the filename has been
>> truncated as with this one, you will find the full file name in the
>> attrs (attr user.cephosd.lfn3):
>>
>>
>> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
>> user.cephos.lfn3:
>>
>> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0
>>
>> Let's look at one of the orphaned files (the one with the same
>> file-name as the previous one, actually):
>>
>>
>> ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
>> user.cephos.lfn3:
>>
>> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46_3189d_0
>>
>

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-15 Thread Jeffrey McDonald

One more question.did we hit the bug because we had hardware issues
during the remapping or would it have happened regardless of the hardware
issues?   e.g. I'm not planning to add any additional hardware soon, but
would the bug pop again on an (unpatched) system not subject to any
remapping?

thanks,
jeff

On Tue, Mar 15, 2016 at 7:27 PM, Samuel Just  wrote:

> [back on list]
>
> ceph-objectstore-tool has a whole bunch of machinery for modifying an
> offline objectstore.  It would be the easiest place to put it -- you
> could add a
>
> ceph-objectstore-tool --op filestore-repair-orphan-links [--dry-run] ...
>
> command which would mount the filestore in a special mode and iterate
> over all collections and repair them.  If you want to go that route,
> we'd be happy to help you get it written.  Once it fixes your cluster,
> we'd then be able to merge and backport it in case anyone else hits
> it.
>
> You'd probably be fine doing it while the OSD is live...but as a rule
> I usually prefer to do my osd surgery offline.  Journal doesn't matter
> here, the orphaned files are basically invisible to the filestore
> (except when doing a collection scan for scrub) since they are in the
> wrong directory.
>
> I don't think the orphans are necessarily going to be 0 size.  There
> might be quirk of how radosgw creates these objects that always causes
> them to be created 0 size than then overwritten with a writefull -- if
> that's true it might be the case that you would only see 0 size ones.
> -Sam
>
> On Tue, Mar 15, 2016 at 4:02 PM, Jeffrey McDonald 
> wrote:
> > Thanks,  I can try to write a tool to do this.   Does
> ceph-objectstore-tool
> > provide a framework?
> >
> > Can I safely delete the files while the OSD is alive or should I take it
> > offline?   Any concerns about the journal?
> >
> > Are there any other properties of the orphans, e.g. will the orphans
> always
> > be size 0?
> >
> > Thanks!
> > Jeff
> >
> > On Tue, Mar 15, 2016 at 5:35 PM, Samuel Just  wrote:
> >>
> >> Ok, a branch merged to master which should fix this
> >> (https://github.com/ceph/ceph/pull/8136).  It'll be backported in due
> >> course.  The problem is that that patch won't clean orphaned files
> >> that already exist.
> >>
> >> Let me explain a bit about what the orphaned files look like.  The
> >> problem is files with object names that result in escaped filenames
> >> longer than the max filename ceph will create (~250 iirc).  Normally,
> >> the name of the file is an escaped and sanitized version of the object
> >> name:
> >>
> >>
> >>
> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/default.325674.107\u\ushadow\u.KePEE8heghHVnlb1\uEIupG0I5eROwRn\u77__head_C1DCD459__46__0
> >>
> >> corresponds to an object like
> >>
> >>
> >>
> c1dcd459/default.325674.107__shadow_.KePEE8heghHVnlb1_EIupG0I5eROwRn_77/head//70
> >>
> >> the DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ path is derived from the hash
> >> starting with the last value: cd459 -> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/
> >>
> >> It ends up in DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ because that's the
> >> longest path that exists (DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D does not
> >> exist -- if DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ ever gets too full,
> >> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D would be created and this file
> >> would be moved into it).
> >>
> >> When the escaped filename gets too long, we truncate the filename, and
> >> then append a hash and a number yielding a name like:
> >>
> >>
> >>
> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
> >>
> >> The _long at the end is always present with files like this.
> >> fa202ec9b4b3b217275a is the hash of the filename.  The 0 indicates
> >> that it's the 0th file with this prefix and this hash -- if there are
> >> hash collisions with the same prefix, you'll see _1_ and _2_ and so on
> >> to distinguish them (very very unlikely).  When the filename has been
> >> truncated as with this one, you will find the full file name in the
> >> attrs (attr user.cephosd.lfn3):
> >>
> >>
> >>
> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
> >> user.cephos.lfn3:
> >>
> >>
> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0
> >>
> >> Let's look at one of the orphaned files (the one with the same
> >> file-name as the previous one, actually):
> >>
> >>
> >>
> ./DIR_9/DI

[ceph-users] Local SSD cache for ceph on each compute node.

2016-03-15 Thread Daniel Niasoff

Hi,

Let me start. Ceph is amazing, no it really is!

But a hypervisor reading and writing all its data off the network off the 
network will add some latency to read and writes.

So the hypervisor could do with a local cache, possible SSD or even NVMe.

Spent a while looking into this but it seems really strange that few people see 
the value of this.

Basically the cache would be used in two ways

a) cache hot data
b) writeback cache for ceph writes

There is the RBD cache but that isn't disk based and on a hypervisor memory is 
at a premium.

A simple solution would be to put a journal on each compute node and get each 
hypervisor to use its own journal. Would this work?

Something like this  
http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png

Can this be achieved?

A better explanation of what I am trying to achieve is here

http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/

This talk if it was voted in looks interesting - 
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/6827

Can anyone help?

Thanks

Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-15 Thread Samuel Just

The bug is entirely independent of hardware issues -- entirely a ceph
bug.  xfs doesn't let us specify an ordering when reading a directory,
so we have to keep directory sizes small.  That means that when one of
those pg collection subfolders has 320 files in it, we split it into
up to 16 smaller directories.  Overwriting or removing an ec object
requires us to rename the old version out of the way in case we need
to roll back (that's the generation number I mentioned above).  For
crash safety, this involves first creating a link to the new name,
then removing the old one.  Both the old and new link will be in the
same subdirectory.  If creating the new link pushes the directory to
320 files then we do a split while both links are present.  If the
file in question is using the special long filename handling, then a
bug in the resulting link juggling causes us to orphan the old version
of the file.  Your cluster seems to have an unusual number of objects
with very long names, which is why it is so visible on your cluster.

There are critical pool sizes where the PGs will all be close to one
of those limits.  It's possible you are not close to one of those
limits.  It's also possible you are nearing one now.  In any case, the
remapping gave the orphaned files an opportunity to cause trouble, but
they don't appear due to remapping.
-Sam

On Tue, Mar 15, 2016 at 5:41 PM, Jeffrey McDonald  wrote:
> One more question.did we hit the bug because we had hardware issues
> during the remapping or would it have happened regardless of the hardware
> issues?   e.g. I'm not planning to add any additional hardware soon, but
> would the bug pop again on an (unpatched) system not subject to any
> remapping?
>
> thanks,
> jeff
>
> On Tue, Mar 15, 2016 at 7:27 PM, Samuel Just  wrote:
>>
>> [back on list]
>>
>> ceph-objectstore-tool has a whole bunch of machinery for modifying an
>> offline objectstore.  It would be the easiest place to put it -- you
>> could add a
>>
>> ceph-objectstore-tool --op filestore-repair-orphan-links [--dry-run] ...
>>
>> command which would mount the filestore in a special mode and iterate
>> over all collections and repair them.  If you want to go that route,
>> we'd be happy to help you get it written.  Once it fixes your cluster,
>> we'd then be able to merge and backport it in case anyone else hits
>> it.
>>
>> You'd probably be fine doing it while the OSD is live...but as a rule
>> I usually prefer to do my osd surgery offline.  Journal doesn't matter
>> here, the orphaned files are basically invisible to the filestore
>> (except when doing a collection scan for scrub) since they are in the
>> wrong directory.
>>
>> I don't think the orphans are necessarily going to be 0 size.  There
>> might be quirk of how radosgw creates these objects that always causes
>> them to be created 0 size than then overwritten with a writefull -- if
>> that's true it might be the case that you would only see 0 size ones.
>> -Sam
>>
>> On Tue, Mar 15, 2016 at 4:02 PM, Jeffrey McDonald 
>> wrote:
>> > Thanks,  I can try to write a tool to do this.   Does
>> > ceph-objectstore-tool
>> > provide a framework?
>> >
>> > Can I safely delete the files while the OSD is alive or should I take it
>> > offline?   Any concerns about the journal?
>> >
>> > Are there any other properties of the orphans, e.g. will the orphans
>> > always
>> > be size 0?
>> >
>> > Thanks!
>> > Jeff
>> >
>> > On Tue, Mar 15, 2016 at 5:35 PM, Samuel Just  wrote:
>> >>
>> >> Ok, a branch merged to master which should fix this
>> >> (https://github.com/ceph/ceph/pull/8136).  It'll be backported in due
>> >> course.  The problem is that that patch won't clean orphaned files
>> >> that already exist.
>> >>
>> >> Let me explain a bit about what the orphaned files look like.  The
>> >> problem is files with object names that result in escaped filenames
>> >> longer than the max filename ceph will create (~250 iirc).  Normally,
>> >> the name of the file is an escaped and sanitized version of the object
>> >> name:
>> >>
>> >>
>> >>
>> >> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/default.325674.107\u\ushadow\u.KePEE8heghHVnlb1\uEIupG0I5eROwRn\u77__head_C1DCD459__46__0
>> >>
>> >> corresponds to an object like
>> >>
>> >>
>> >>
>> >> c1dcd459/default.325674.107__shadow_.KePEE8heghHVnlb1_EIupG0I5eROwRn_77/head//70
>> >>
>> >> the DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ path is derived from the hash
>> >> starting with the last value: cd459 -> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/
>> >>
>> >> It ends up in DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ because that's the
>> >> longest path that exists (DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D does not
>> >> exist -- if DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ ever gets too full,
>> >> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D would be created and this file
>> >> would be moved into it).
>> >>
>> >> When the escaped filename gets too long, we truncate the filename, and
>> >> then append a hash and a number yielding a name like:
>> >>
>> >>
>> >>
>> >> .

Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-15 Thread Jason Dillaman

The good news is such a feature is in the early stage of design [1].  Hopefully 
this is a feature that will land in the Kraken release timeframe.  

[1] 
http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consistent_write-back_caching_extension

-- 

Jason Dillaman 


- Original Message -
> From: "Daniel Niasoff" 
> To: ceph-users@lists.ceph.com
> Sent: Tuesday, March 15, 2016 8:47:04 PM
> Subject: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> Hi,
> 
> Let me start. Ceph is amazing, no it really is!
> 
> But a hypervisor reading and writing all its data off the network off the
> network will add some latency to read and writes.
> 
> So the hypervisor could do with a local cache, possible SSD or even NVMe.
> 
> Spent a while looking into this but it seems really strange that few people
> see the value of this.
> 
> Basically the cache would be used in two ways
> 
> a) cache hot data
> b) writeback cache for ceph writes
> 
> There is the RBD cache but that isn't disk based and on a hypervisor memory
> is at a premium.
> 
> A simple solution would be to put a journal on each compute node and get each
> hypervisor to use its own journal. Would this work?
> 
> Something like this
> http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png
> 
> Can this be achieved?
> 
> A better explanation of what I am trying to achieve is here
> 
> http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/
> 
> This talk if it was voted in looks interesting -
> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/6827
> 
> Can anyone help?
> 
> Thanks
> 
> Daniel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-15 Thread Daniel Niasoff

Thanks.

Reassuring but I could do with something today :)

-Original Message-
From: Jason Dillaman [mailto:dilla...@redhat.com] 
Sent: 16 March 2016 01:25
To: Daniel Niasoff 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.

The good news is such a feature is in the early stage of design [1].  Hopefully 
this is a feature that will land in the Kraken release timeframe.  

[1] 
http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consistent_write-back_caching_extension

-- 

Jason Dillaman 


- Original Message -
> From: "Daniel Niasoff" 
> To: ceph-users@lists.ceph.com
> Sent: Tuesday, March 15, 2016 8:47:04 PM
> Subject: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> Hi,
> 
> Let me start. Ceph is amazing, no it really is!
> 
> But a hypervisor reading and writing all its data off the network off 
> the network will add some latency to read and writes.
> 
> So the hypervisor could do with a local cache, possible SSD or even NVMe.
> 
> Spent a while looking into this but it seems really strange that few 
> people see the value of this.
> 
> Basically the cache would be used in two ways
> 
> a) cache hot data
> b) writeback cache for ceph writes
> 
> There is the RBD cache but that isn't disk based and on a hypervisor 
> memory is at a premium.
> 
> A simple solution would be to put a journal on each compute node and 
> get each hypervisor to use its own journal. Would this work?
> 
> Something like this
> http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png
> 
> Can this be achieved?
> 
> A better explanation of what I am trying to achieve is here
> 
> http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/
> 
> This talk if it was voted in looks interesting -
> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Present
> ation/6827
> 
> Can anyone help?
> 
> Thanks
> 
> Daniel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephx capabilities to forbid rbd creation

2016-03-15 Thread Jason Dillaman

Perhaps something like this?

mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow r 
class-read object_prefix rbd_directory, allow rwx object_prefix rbd_header., 
allow rwx object_prefix rbd_data., allow rwx object_prefix rbd_id.'

As Greg mentioned, this won't stop you from just creating random objects in the 
pool that match the substrings list in the cap, but it will prevent you from 
creating new images.


-- 

Jason Dillaman 
Red Hat Ceph Storage Engineering 
dilla...@redhat.com 
http://www.redhat.com 


- Original Message -
> From: "Gregory Farnum" 
> To: "Loris Cuoghi" 
> Cc: ceph-users@lists.ceph.com
> Sent: Tuesday, March 15, 2016 5:54:43 PM
> Subject: Re: [ceph-users] cephx capabilities to forbid rbd creation
> 
> On Tue, Mar 15, 2016 at 2:44 PM, Loris Cuoghi  wrote:
> > So, one key per RBD.
> > Or, dynamically enable/disable access to each RBD in each hypervisor's key.
> > Uhm, something doesn't scale here. :P
> > (I wonder if there's any limit to a key's capabilities string...)
> >
> > But, as it appears, I share your view that it is the only available
> > approach right now.
> >
> > Anyone would like to prove us wrong? :)
> 
> The OSD capabilities aren't fine-grained enough to prevent you from
> creating objects, except by specifying that you only get access to
> certain prefixes or namespaces. So, either you lock down a key to a
> specific set of RBD volumes, or you let it create RBD volumes
> arbitrarily.
> ...unless, maybe, you can keep it from writing to the RBD index
> objects? But that doesn't prevent the user from scribbling across your
> cluster, just registering it. ;)
> 
> That said, it is *possible* (although probably *unwise*) to give
> hypervisor keys access to all of the RBD volumes they host. cephx keys
> can have an arbitrary number of "allow" clauses, although I imagine if
> you get them large enough it could cause trouble (or maybe not?) in
> terms of memory usage or just plain old permission parsing time. And
> you'd likely run into issues with newly-created or newly-migrated
> instances ending up on a hypervisor which has an old version of its
> keyring cached. I'm not certain if there's a way to refresh those
> on-demand from the monitor.
> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-15 Thread Samuel Just

Ah, actually, I think there will be duplicates only around half the
time -- either the old link or the new link could be orphaned
depending on which xfs decides to list first.  Only if the old link is
orphaned will it match the name of the object once it's recreated.  I
should be able to find time to put together a branch in the next week
or two if you want to wait.  It's still probably worth trying removing
that object in 70.459.
-Sam

On Tue, Mar 15, 2016 at 6:03 PM, Samuel Just  wrote:
> The bug is entirely independent of hardware issues -- entirely a ceph
> bug.  xfs doesn't let us specify an ordering when reading a directory,
> so we have to keep directory sizes small.  That means that when one of
> those pg collection subfolders has 320 files in it, we split it into
> up to 16 smaller directories.  Overwriting or removing an ec object
> requires us to rename the old version out of the way in case we need
> to roll back (that's the generation number I mentioned above).  For
> crash safety, this involves first creating a link to the new name,
> then removing the old one.  Both the old and new link will be in the
> same subdirectory.  If creating the new link pushes the directory to
> 320 files then we do a split while both links are present.  If the
> file in question is using the special long filename handling, then a
> bug in the resulting link juggling causes us to orphan the old version
> of the file.  Your cluster seems to have an unusual number of objects
> with very long names, which is why it is so visible on your cluster.
>
> There are critical pool sizes where the PGs will all be close to one
> of those limits.  It's possible you are not close to one of those
> limits.  It's also possible you are nearing one now.  In any case, the
> remapping gave the orphaned files an opportunity to cause trouble, but
> they don't appear due to remapping.
> -Sam
>
> On Tue, Mar 15, 2016 at 5:41 PM, Jeffrey McDonald  wrote:
>> One more question.did we hit the bug because we had hardware issues
>> during the remapping or would it have happened regardless of the hardware
>> issues?   e.g. I'm not planning to add any additional hardware soon, but
>> would the bug pop again on an (unpatched) system not subject to any
>> remapping?
>>
>> thanks,
>> jeff
>>
>> On Tue, Mar 15, 2016 at 7:27 PM, Samuel Just  wrote:
>>>
>>> [back on list]
>>>
>>> ceph-objectstore-tool has a whole bunch of machinery for modifying an
>>> offline objectstore.  It would be the easiest place to put it -- you
>>> could add a
>>>
>>> ceph-objectstore-tool --op filestore-repair-orphan-links [--dry-run] ...
>>>
>>> command which would mount the filestore in a special mode and iterate
>>> over all collections and repair them.  If you want to go that route,
>>> we'd be happy to help you get it written.  Once it fixes your cluster,
>>> we'd then be able to merge and backport it in case anyone else hits
>>> it.
>>>
>>> You'd probably be fine doing it while the OSD is live...but as a rule
>>> I usually prefer to do my osd surgery offline.  Journal doesn't matter
>>> here, the orphaned files are basically invisible to the filestore
>>> (except when doing a collection scan for scrub) since they are in the
>>> wrong directory.
>>>
>>> I don't think the orphans are necessarily going to be 0 size.  There
>>> might be quirk of how radosgw creates these objects that always causes
>>> them to be created 0 size than then overwritten with a writefull -- if
>>> that's true it might be the case that you would only see 0 size ones.
>>> -Sam
>>>
>>> On Tue, Mar 15, 2016 at 4:02 PM, Jeffrey McDonald 
>>> wrote:
>>> > Thanks,  I can try to write a tool to do this.   Does
>>> > ceph-objectstore-tool
>>> > provide a framework?
>>> >
>>> > Can I safely delete the files while the OSD is alive or should I take it
>>> > offline?   Any concerns about the journal?
>>> >
>>> > Are there any other properties of the orphans, e.g. will the orphans
>>> > always
>>> > be size 0?
>>> >
>>> > Thanks!
>>> > Jeff
>>> >
>>> > On Tue, Mar 15, 2016 at 5:35 PM, Samuel Just  wrote:
>>> >>
>>> >> Ok, a branch merged to master which should fix this
>>> >> (https://github.com/ceph/ceph/pull/8136).  It'll be backported in due
>>> >> course.  The problem is that that patch won't clean orphaned files
>>> >> that already exist.
>>> >>
>>> >> Let me explain a bit about what the orphaned files look like.  The
>>> >> problem is files with object names that result in escaped filenames
>>> >> longer than the max filename ceph will create (~250 iirc).  Normally,
>>> >> the name of the file is an escaped and sanitized version of the object
>>> >> name:
>>> >>
>>> >>
>>> >>
>>> >> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/default.325674.107\u\ushadow\u.KePEE8heghHVnlb1\uEIupG0I5eROwRn\u77__head_C1DCD459__46__0
>>> >>
>>> >> corresponds to an object like
>>> >>
>>> >>
>>> >>
>>> >> c1dcd459/default.325674.107__shadow_.KePEE8heghHVnlb1_EIupG0I5eROwRn_77/head//70
>>> >>
>>> >> the DIR_9/DIR_5/DI

Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-15 Thread Jason Dillaman

Indeed, well understood.

As a shorter term workaround, if you have control over the VMs, you could 
always just slice out an LVM volume from local SSD/NVMe and pass it through to 
the guest.  Within the guest, use dm-cache (or similar) to add a cache 
front-end to your RBD volume.  Others have also reported improvements by using 
the QEMU x-data-plane option and RAIDing several RBD images together within the 
VM.

-- 

Jason Dillaman 


- Original Message -
> From: "Daniel Niasoff" 
> To: "Jason Dillaman" 
> Cc: ceph-users@lists.ceph.com
> Sent: Tuesday, March 15, 2016 9:32:50 PM
> Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> Thanks.
> 
> Reassuring but I could do with something today :)
> 
> -Original Message-
> From: Jason Dillaman [mailto:dilla...@redhat.com]
> Sent: 16 March 2016 01:25
> To: Daniel Niasoff 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> The good news is such a feature is in the early stage of design [1].
> Hopefully this is a feature that will land in the Kraken release timeframe.
> 
> [1]
> http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consistent_write-back_caching_extension
> 
> --
> 
> Jason Dillaman
> 
> 
> - Original Message -
> > From: "Daniel Niasoff" 
> > To: ceph-users@lists.ceph.com
> > Sent: Tuesday, March 15, 2016 8:47:04 PM
> > Subject: [ceph-users] Local SSD cache for ceph on each compute node.
> > 
> > Hi,
> > 
> > Let me start. Ceph is amazing, no it really is!
> > 
> > But a hypervisor reading and writing all its data off the network off
> > the network will add some latency to read and writes.
> > 
> > So the hypervisor could do with a local cache, possible SSD or even NVMe.
> > 
> > Spent a while looking into this but it seems really strange that few
> > people see the value of this.
> > 
> > Basically the cache would be used in two ways
> > 
> > a) cache hot data
> > b) writeback cache for ceph writes
> > 
> > There is the RBD cache but that isn't disk based and on a hypervisor
> > memory is at a premium.
> > 
> > A simple solution would be to put a journal on each compute node and
> > get each hypervisor to use its own journal. Would this work?
> > 
> > Something like this
> > http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png
> > 
> > Can this be achieved?
> > 
> > A better explanation of what I am trying to achieve is here
> > 
> > http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/
> > 
> > This talk if it was voted in looks interesting -
> > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Present
> > ation/6827
> > 
> > Can anyone help?
> > 
> > Thanks
> > 
> > Daniel
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-15 Thread Daniel Niasoff

I am using openstack so need this to be fully automated and apply to all my VMs.

If I could do what you mention at the hypervisor level that would me much 
easier.

The options that you mention I guess are for very specific use cases and need 
to be configured on a per vm basis whilst I am looking for a general "ceph on 
steroids" approach for all my VMs without any maintenance.

Thanks again :)

-Original Message-
From: Jason Dillaman [mailto:dilla...@redhat.com] 
Sent: 16 March 2016 01:42
To: Daniel Niasoff 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.

Indeed, well understood.

As a shorter term workaround, if you have control over the VMs, you could 
always just slice out an LVM volume from local SSD/NVMe and pass it through to 
the guest.  Within the guest, use dm-cache (or similar) to add a cache 
front-end to your RBD volume.  Others have also reported improvements by using 
the QEMU x-data-plane option and RAIDing several RBD images together within the 
VM.

-- 

Jason Dillaman 


- Original Message -
> From: "Daniel Niasoff" 
> To: "Jason Dillaman" 
> Cc: ceph-users@lists.ceph.com
> Sent: Tuesday, March 15, 2016 9:32:50 PM
> Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> Thanks.
> 
> Reassuring but I could do with something today :)
> 
> -Original Message-
> From: Jason Dillaman [mailto:dilla...@redhat.com]
> Sent: 16 March 2016 01:25
> To: Daniel Niasoff 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
> 
> The good news is such a feature is in the early stage of design [1].
> Hopefully this is a feature that will land in the Kraken release timeframe.
> 
> [1]
> http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consist
> ent_write-back_caching_extension
> 
> --
> 
> Jason Dillaman
> 
> 
> - Original Message -
> > From: "Daniel Niasoff" 
> > To: ceph-users@lists.ceph.com
> > Sent: Tuesday, March 15, 2016 8:47:04 PM
> > Subject: [ceph-users] Local SSD cache for ceph on each compute node.
> > 
> > Hi,
> > 
> > Let me start. Ceph is amazing, no it really is!
> > 
> > But a hypervisor reading and writing all its data off the network 
> > off the network will add some latency to read and writes.
> > 
> > So the hypervisor could do with a local cache, possible SSD or even NVMe.
> > 
> > Spent a while looking into this but it seems really strange that few 
> > people see the value of this.
> > 
> > Basically the cache would be used in two ways
> > 
> > a) cache hot data
> > b) writeback cache for ceph writes
> > 
> > There is the RBD cache but that isn't disk based and on a hypervisor 
> > memory is at a premium.
> > 
> > A simple solution would be to put a journal on each compute node and 
> > get each hypervisor to use its own journal. Would this work?
> > 
> > Something like this
> > http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png
> > 
> > Can this be achieved?
> > 
> > A better explanation of what I am trying to achieve is here
> > 
> > http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/
> > 
> > This talk if it was voted in looks interesting - 
> > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Prese
> > nt
> > ation/6827
> > 
> > Can anyone help?
> > 
> > Thanks
> > 
> > Daniel
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph for home use

2016-03-15 Thread Lindsay Mathieson

On 16 March 2016 at 04:34, Edward Wingate  wrote:

> Given my resources,
> I'd still only run a single node with 3 OSDs and replica count of 2.
> I'd then have a VM mount the a Ceph RBD to serve Samba/NFS shares.
>

Fun & instructive to play with ceph that way, but not really a good use of
it - ceph's main thing is to provide replication across nodes for
redundancy and failover. Which you aren't going to get with one node :)

I really recommend seting up your NAS using ZFS (under BSD or Linux), its
an excellent usecase for you setup. You can configure mirrored disks for
redundacy and extend the storage indefintly by adding extra disks. Plus you
get all the zfs goodies -  excellent cmd tools, snapshots, checksums, ssd
caches and more.

You can share zfs disks directly via smb, nfs or iscsi or via your VM. And
you will get much better performance.

-- 
Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] rgw bucket deletion woes

2016-03-15 Thread Pavan Rallabhandi

Hi,

I find this to be discussed here before, but couldn¹t find any solution
hence the mail. In RGW, for a bucket holding objects in the range of ~
millions, one can find it to take for ever to delete the bucket(via
radosgw-admin). I understand the gc(and its parameters) that would reclaim
the space eventually, but am looking more at the bucket deletion options
that can possibly speed up the operation.

I realize, currently rgw_remove_bucket(), does it 1000 objects at a time,
serially. Wanted to know if there is a reason(that am possibly missing and
discussed) for this to be left that way, otherwise I was considering a
patch to make it happen better.

Thanks,
-Pavan.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Is there an api to list all s3 user

2016-03-15 Thread Mika c

Hi all,
Hi, I try to find an api that can list all s3 user like command
'radosgw-admin metadata list user'.
But I can not find any document related. Have anyone know how to get this
information？
Any comments will be much appreciated!


Best wishes,
Mika
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

45 matches

Mail list logo