Re: [ceph-users] Fwd: Ceph Storage Migration from SAN storage to Local Disks

2016-08-03 Thread Bharath Krishna
Hi Gaurav,

There are several ways to do it depending on how you deployed your ceph 
cluster. Easiest way to do it is using ceph-ansible with purge-cluster yaml 
ready made to wipe off CEPH.

https://github.com/ceph/ceph-ansible/blob/master/purge-cluster.yml

You may need to configure ansible inventory with ceph hosts.

Else if you want to purge manually, you can do it using: 
http://docs.ceph.com/docs/hammer/rados/deployment/ceph-deploy-purge/


Thanks
Bharath

From: ceph-users  on behalf of Gaurav Goyal 

Date: Thursday, August 4, 2016 at 8:19 AM
To: David Turner 
Cc: ceph-users 
Subject: Re: [ceph-users] Fwd: Ceph Storage Migration from SAN storage to Local 
Disks

Please suggest a procedure for this uninstallation process?


Regards
Gaurav Goyal

On Wed, Aug 3, 2016 at 5:58 PM, Gaurav Goyal 
> wrote:

Thanks for your  prompt
response!

Situation is bit different now. Customer want us to remove the ceph storage 
configuration from scratch. Let is openstack system work without ceph. Later on 
install ceph with local disks.

So I need to know a procedure to uninstall ceph and unconfigure it from  
openstack.

Regards
Gaurav Goyal
On 03-Aug-2016 4:59 pm, "David Turner" 
> wrote:
If I'm understanding your question correctly that you're asking how to actually 
remove the SAN osds from ceph, then it doesn't matter what is using the storage 
(ie openstack, cephfs, krbd, etc) as the steps are the same.

I'm going to assume that you've already added the new storage/osds to the 
cluster, weighted the SAN osds to 0.0 and that the backfilling has finished.  
If that is true, then your disk used space on the SAN's should be basically 
empty while the new osds on the local disks should have a fair amount of data.  
If that is the case, then for every SAN osd, you just run the following 
commands replacing OSD_ID with the osd's id:

# On the server with the osd being removed
sudo stop ceph-osd id=OSD_ID
ceph osd down OSD_ID
ceph osd out OSD_ID
ceph osd crush remove osd.OSD_ID
ceph auth del osd.OSD_ID
ceph osd rm OSD_ID

Test running those commands on a test osd and if you had set the weight of the 
osd to 0.0 previously and if the backfilling had finished, then what you should 
see is that your cluster has 1 less osd than it used to, and no pgs should be 
backfilling.

HOWEVER, if my assumptions above are incorrect, please provide the output of 
the following commands and try to clarify your question.

ceph status
ceph osd tree

I hope this helps.

> Hello David,
>
> Can you help me with steps/Procedure to uninstall Ceph storage from openstack 
> environment?
>
>
> Regards
> Gaurav Goyal

[cid:image001.jpg@01D1EE42.88EF6E60]

David Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-deploy on Jewel error

2016-08-03 Thread Chengwei Yang
On Thu, Aug 04, 2016 at 12:20:01AM +, EP Komarla wrote:
> Hi All,
> 
>  
> 
> I am trying to do a fresh install of Ceph Jewel on my cluster.  I went through
> all the steps in configuring the network, ssh, password, etc.  Now I am at the
> stage of running the ceph-deploy commands to install monitors and other 
> nodes. 
> I am getting the below error when I am deploying the first monitor.  Not able
> to figure out what it is that I am missing here.  Any pointers or help
> appreciated.
> 
>  
> 
> Thanks in advance.
> 
>  
> 
> - epk
> 
>  
> 
> [ep-c2-mon-01][DEBUG ] ---> Package librbd1.x86_64 1:0.94.7-0.el7 will be
> updated
> 
> [ep-c2-mon-01][DEBUG ] ---> Package librbd1.x86_64 1:10.2.2-0.el7 will be an
> update
> 
> [ep-c2-mon-01][DEBUG ] ---> Package python-cephfs.x86_64 1:0.94.7-0.el7 will 
> be
> updated
> 
> [ep-c2-mon-01][DEBUG ] ---> Package python-cephfs.x86_64 1:10.2.2-0.el7 will 
> be
> an update
> 
> [ep-c2-mon-01][DEBUG ] ---> Package python-rados.x86_64 1:0.94.7-0.el7 will be
> updated
> 
> [ep-c2-mon-01][DEBUG ] ---> Package python-rados.x86_64 1:10.2.2-0.el7 will be
> an update
> 
> [ep-c2-mon-01][DEBUG ] ---> Package python-rbd.x86_64 1:0.94.7-0.el7 will be
> updated
> 
> [ep-c2-mon-01][DEBUG ] ---> Package python-rbd.x86_64 1:10.2.2-0.el7 will be 
> an
> update
> 
> [ep-c2-mon-01][DEBUG ] --> Running transaction check
> 
> [ep-c2-mon-01][DEBUG ] ---> Package ceph-selinux.x86_64 1:10.2.2-0.el7 will be
> installed
> 
> [ep-c2-mon-01][DEBUG ] --> Processing Dependency: selinux-policy-base >=
> 3.13.1-60.el7_2.3 for package: 1:ceph-selinux-10.2.2-0.el7.x86_64
> 
> [ep-c2-mon-01][DEBUG ] ---> Package python-setuptools.noarch 0:0.9.8-4.el7 
> will
> be installed
> 
> [ep-c2-mon-01][DEBUG ] --> Finished Dependency Resolution
> 
> [ep-c2-mon-01][WARNIN] Error: Package: 1:ceph-selinux-10.2.2-0.el7.x86_64
> (ceph)
> 
> [ep-c2-mon-01][DEBUG ]  You could try using --skip-broken to work around the
> problem
> 
> [ep-c2-mon-01][WARNIN]Requires: selinux-policy-base >=
> 3.13.1-60.el7_2.3

It said it requires selinux-policy-base >= 3.13.1-60.el7_2.3

> 
> [ep-c2-mon-01][WARNIN]Installed:
> selinux-policy-targeted-3.13.1-60.el7.noarch (@CentOS/7)
> 
> [ep-c2-mon-01][WARNIN]selinux-policy-base = 3.13.1-60.el7
> 
> [ep-c2-mon-01][WARNIN]Available:
> selinux-policy-minimum-3.13.1-60.el7.noarch (CentOS-7)
> 
> [ep-c2-mon-01][WARNIN]selinux-policy-base = 3.13.1-60.el7
> 
> [ep-c2-mon-01][WARNIN]Available:
> selinux-policy-mls-3.13.1-60.el7.noarch (CentOS-7)
> 
> [ep-c2-mon-01][WARNIN]selinux-policy-base = 3.13.1-60.el7

However, both installed version and available versions are not meet the
requirement, so if fail.

You may have an incorrect repo configuration.

> 
> [ep-c2-mon-01][DEBUG ]  You could try running: rpm -Va --nofiles --nodigest
> 
> [ep-c2-mon-01][ERROR ] RuntimeError: command returned non-zero exit status: 1
> 
> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: yum -y install
> ceph ceph-radosgw
> 
>  
> 
>  
> 
> EP KOMARLA,
> 
> Flex_RGB_Sml_tm
> 
> Emal: ep.koma...@flextronics.com
> 
> Address: 677 Gibraltor Ct, Building #2, Milpitas, CA 94035, USA
> 
> Phone: 408-674-6090 (mobile)
> 
>  
> 
> 
> Legal Disclaimer:
> The information contained in this message may be privileged and confidential.
> It is intended to be read only by the individual or entity to whom it is
> addressed or by their designee. If the reader of this message is not the
> intended recipient, you are on notice that any distribution of this message, 
> in
> any form, is strictly prohibited. If you have received this message in error,
> please immediately notify the sender and delete or destroy any copy of this
> message!
> SECURITY NOTE: file ~/.netrc must not be accessible by others



> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Thanks,
Chengwei


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs issue - able to mount with user key, not able to write

2016-08-03 Thread Goncalo Borges

Hi ...

We also use a _/mount_user/_ key to mount cephfs with ceph-fuse. I 
remember that we had some troubles also. We use ceph-authtool to 
generate the key with the following syntax:


   ceph-authtool --create-keyring 
 --gen-key -n 
 --cap mds ''
 --cap osd ' pool='
 --cap mon ''

In our case, the command we executed was
/
/

   /# ceph-authtool --create-keyring
   /etc/ceph/ceph.client.mount_user.keyring --gen-key -n
   client.mount_user --cap mds 'allow' --cap osd 'allow rw
   pool=coepp_cephfs_data' --cap mon 'allow r/



Please note the following particularities:
   - The name of the key should be 
.client..keyring (i.e. 
ceph.client.mount_user.keyring)
   - The name of the user should be client. (i.e. 
client.mount_user)

   - This key has the following permissions:
 --cap mds 'allow'
 --cap osd 'allow rw pool=coepp_cephfs_data'
 --cap mon 'allow r'

It seems you are following this structure but please double check it. I 
also remember some bugs in Jewel regarding permissions in paths. So you 
might think to remote it from the key. After creation, one should import 
the key to the authority list

/
/

   /# ceph auth import -i /etc/ceph/ceph.client.mount_user.keyring /

   /# ceph auth list/
   /installed auth entries:/

   /(...)/

   /client.mount_user/
   /key: /
   /caps: [mds] allow/
   /caps: [mon] allow r/
   /caps: [osd] allow rw pool=coepp_cephfs_data/


Finally I mount it as

   /# ceph-fuse --id mount_user -k
   /etc/ceph/ceph.client.mount_user.keyring -m X.X.X.X:6789 -r /cephfs
   /coepp/cephfs//'

where X.X.X.X stands for the mom ip address.

Does this help?
Cheers
G.


On 08/03/2016 06:23 PM, Daleep Singh Bais wrote:

Dear All,

I am trying to use CephFS in my setup. I have created a test setup with
01 MON and 03 OSD's.

I have created a MDS server and able to mount it on client using FUSE.
Using admin keyring, I am able to write to cephfs and sub-dir also.

I am experiencing issue when I try to write to cephfs using another
user. I have created the reqd keys with permissions.

When I try to write, I see that metadata object count increase, but no
change in data pool.

Also this is what i see in logs

2016-08-03 08:17:20.771597 b16feff0  0 log_channel(cluster) log [INF] :
closing stale session client.165552 192.168.1.29:0/5671 after 302.321097
2016-08-03 08:19:16.049985 accfeff0  0 -- 192.168.1.201:6800/7088 >>
192.168.1.29:0/5707 pipe(0x8549ed00 sd=22 :6800 s=2 pgs=2 cs=1 l=0
c=0x857342e0).fault with nothing to send, going to standby

My cephx key is client.test1 created like :

ceph auth get-or-create client.test1 mon 'allow r' mds 'allow r, allow
rw path=/test1' osd 'allow rw pool=data' -o
/etc/ceph/ceph.client.test1.keyring

#ceph mds stat
e11: 1/1/1 up {0=mon1=up:active}

# ceph --version
ceph version 10.2.2-1-g502540f (502540faf67308fa595e03f9f446b4ba67df731d)


Any suggestion would be helpful.

Thanks.

Daleep Singh Bais

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Ceph Storage Migration from SAN storage to Local Disks

2016-08-03 Thread Gaurav Goyal
Please suggest a procedure for this uninstallation process?


Regards
Gaurav Goyal

On Wed, Aug 3, 2016 at 5:58 PM, Gaurav Goyal 
wrote:

> Thanks for your  prompt
> response!
>
> Situation is bit different now. Customer want us to remove the ceph
> storage configuration from scratch. Let is openstack system work without
> ceph. Later on install ceph with local disks.
>
> So I need to know a procedure to uninstall ceph and unconfigure it from
> openstack.
>
> Regards
> Gaurav Goyal
> On 03-Aug-2016 4:59 pm, "David Turner" 
> wrote:
>
>> If I'm understanding your question correctly that you're asking how to
>> actually remove the SAN osds from ceph, then it doesn't matter what is
>> using the storage (ie openstack, cephfs, krbd, etc) as the steps are the
>> same.
>>
>> I'm going to assume that you've already added the new storage/osds to the
>> cluster, weighted the SAN osds to 0.0 and that the backfilling has
>> finished.  If that is true, then your disk used space on the SAN's should
>> be basically empty while the new osds on the local disks should have a fair
>> amount of data.  If that is the case, then for every SAN osd, you just run
>> the following commands replacing OSD_ID with the osd's id:
>>
>> # On the server with the osd being removed
>> sudo stop ceph-osd id=OSD_ID
>> ceph osd down OSD_ID
>> ceph osd out OSD_ID
>> ceph osd crush remove osd.OSD_ID
>> ceph auth del osd.OSD_ID
>> ceph osd rm OSD_ID
>>
>> Test running those commands on a test osd and if you had set the weight
>> of the osd to 0.0 previously and if the backfilling had finished, then what
>> you should see is that your cluster has 1 less osd than it used to, and no
>> pgs should be backfilling.
>>
>> HOWEVER, if my assumptions above are incorrect, please provide the output
>> of the following commands and try to clarify your question.
>>
>> ceph status
>> ceph osd tree
>>
>> I hope this helps.
>>
>> > Hello David,
>> >
>> > Can you help me with steps/Procedure to uninstall Ceph storage from
>> openstack environment?
>> >
>> >
>> > Regards
>> > Gaurav Goyal
>>
>> --
>>
>>  David Turner | Cloud Operations Engineer | 
>> StorageCraft
>> Technology Corporation 
>> 380 Data Drive Suite 300 | Draper | Utah | 84020
>> Office: 801.871.2760 | Mobile: 385.224.2943
>>
>> --
>>
>> If you are not the intended recipient of this message or received it
>> erroneously, please notify the sender and delete it, together with any
>> attachments, and be advised that any dissemination or copying of this
>> message is prohibited.
>>
>> --
>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph-deploy on Jewel error

2016-08-03 Thread EP Komarla
Hi All,

I am trying to do a fresh install of Ceph Jewel on my cluster.  I went through 
all the steps in configuring the network, ssh, password, etc.  Now I am at the 
stage of running the ceph-deploy commands to install monitors and other nodes.  
I am getting the below error when I am deploying the first monitor.  Not able 
to figure out what it is that I am missing here.  Any pointers or help 
appreciated.

Thanks in advance.

- epk

[ep-c2-mon-01][DEBUG ] ---> Package librbd1.x86_64 1:0.94.7-0.el7 will be 
updated
[ep-c2-mon-01][DEBUG ] ---> Package librbd1.x86_64 1:10.2.2-0.el7 will be an 
update
[ep-c2-mon-01][DEBUG ] ---> Package python-cephfs.x86_64 1:0.94.7-0.el7 will be 
updated
[ep-c2-mon-01][DEBUG ] ---> Package python-cephfs.x86_64 1:10.2.2-0.el7 will be 
an update
[ep-c2-mon-01][DEBUG ] ---> Package python-rados.x86_64 1:0.94.7-0.el7 will be 
updated
[ep-c2-mon-01][DEBUG ] ---> Package python-rados.x86_64 1:10.2.2-0.el7 will be 
an update
[ep-c2-mon-01][DEBUG ] ---> Package python-rbd.x86_64 1:0.94.7-0.el7 will be 
updated
[ep-c2-mon-01][DEBUG ] ---> Package python-rbd.x86_64 1:10.2.2-0.el7 will be an 
update
[ep-c2-mon-01][DEBUG ] --> Running transaction check
[ep-c2-mon-01][DEBUG ] ---> Package ceph-selinux.x86_64 1:10.2.2-0.el7 will be 
installed
[ep-c2-mon-01][DEBUG ] --> Processing Dependency: selinux-policy-base >= 
3.13.1-60.el7_2.3 for package: 1:ceph-selinux-10.2.2-0.el7.x86_64
[ep-c2-mon-01][DEBUG ] ---> Package python-setuptools.noarch 0:0.9.8-4.el7 will 
be installed
[ep-c2-mon-01][DEBUG ] --> Finished Dependency Resolution
[ep-c2-mon-01][WARNIN] Error: Package: 1:ceph-selinux-10.2.2-0.el7.x86_64 (ceph)
[ep-c2-mon-01][DEBUG ]  You could try using --skip-broken to work around the 
problem
[ep-c2-mon-01][WARNIN]Requires: selinux-policy-base >= 
3.13.1-60.el7_2.3
[ep-c2-mon-01][WARNIN]Installed: 
selinux-policy-targeted-3.13.1-60.el7.noarch (@CentOS/7)
[ep-c2-mon-01][WARNIN]selinux-policy-base = 3.13.1-60.el7
[ep-c2-mon-01][WARNIN]Available: 
selinux-policy-minimum-3.13.1-60.el7.noarch (CentOS-7)
[ep-c2-mon-01][WARNIN]selinux-policy-base = 3.13.1-60.el7
[ep-c2-mon-01][WARNIN]Available: 
selinux-policy-mls-3.13.1-60.el7.noarch (CentOS-7)
[ep-c2-mon-01][WARNIN]selinux-policy-base = 3.13.1-60.el7
[ep-c2-mon-01][DEBUG ]  You could try running: rpm -Va --nofiles --nodigest
[ep-c2-mon-01][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: yum -y install 
ceph ceph-radosgw


EP KOMARLA,
[Flex_RGB_Sml_tm]
Emal: ep.koma...@flextronics.com
Address: 677 Gibraltor Ct, Building #2, Milpitas, CA 94035, USA
Phone: 408-674-6090 (mobile)


Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-device BlueStore OSDs multiple fsck failures

2016-08-03 Thread Stillwell, Bryan J
Thanks Somnath,

I'll try moving my testing to master tomorrow to see if that improves the
stability at all.

Bryan

On 8/3/16, 4:50 PM, "Somnath Roy"  wrote:

>Probably, it is better to move to latest master and reproduce this
>defect. Lot of stuff has changed between this.
>This is a good test case and I doubt any of us testing by enabling fsck()
>on mount/unmount.
>
>Thanks & Regards
>Somnath
>
>-Original Message-
>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>Stillwell, Bryan J
>Sent: Wednesday, August 03, 2016 3:41 PM
>To: ceph-users@lists.ceph.com
>Subject: [ceph-users] Multi-device BlueStore OSDs multiple fsck failures
>
>I've been doing some benchmarking of BlueStore in 10.2.2 the last few
>days and have come across a failure that keeps happening after stressing
>the cluster fairly heavily.  Some of the OSDs started failing and
>attempts to restart them fail to log anything in /var/log/ceph/, so I
>tried starting them manually and ran into these error messages:
>
># /usr/bin/ceph-osd --cluster=ceph -i 4 -f --setuser ceph --setgroup ceph
>2016-08-02 22:52:01.190226 7f97d75e1800 -1 WARNING: the following
>dangerous and experimental features are enabled: *
>2016-08-02 22:52:01.190340 7f97d75e1800 -1 WARNING: the following
>dangerous and experimental features are enabled: *
>2016-08-02 22:52:01.190497 7f97d75e1800 -1 WARNING: experimental feature
>'bluestore' is enabled Please be aware that this feature is experimental,
>untested, unsupported, and may result in data corruption, data loss,
>and/or irreparable damage to your cluster.  Do not use feature with
>important data.
>
>starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4/
>/var/lib/ceph/osd/ceph-4/journal
>2016-08-02 22:52:01.194461 7f97d75e1800 -1 WARNING: the following
>dangerous and experimental features are enabled: *
>2016-08-02 22:52:01.237619 7f97d75e1800 -1 WARNING: experimental feature
>'rocksdb' is enabled Please be aware that this feature is experimental,
>untested, unsupported, and may result in data corruption, data loss,
>and/or irreparable damage to your cluster.  Do not use feature with
>important data.
>
>2016-08-02 22:52:01.501405 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/)  a#20:bac03f87:::4_454:head# nid
>67134 already in use
>2016-08-02 22:52:01.629900 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/)  9#20:e64f44a7:::4_258:head# nid
>78351 already in use
>2016-08-02 22:52:01.967599 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/) fsck free extent
>256983760896~1245184 intersects allocated blocks
>2016-08-02 22:52:01.967605 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/) fsck overlap: [256984940544~65536]
>2016-08-02 22:52:01.978635 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/) fsck free extent 258455044096~196608
>intersects allocated blocks
>2016-08-02 22:52:01.978640 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/) fsck overlap: [258455175168~65536]
>2016-08-02 22:52:01.978647 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/) fsck leaked some space; free+used =
>[0~252138684416,252138815488~4844945408,256984940544~1470103552,2584551751
>6
>8~5732719067136] != expected 0~5991174242304
>2016-08-02 22:52:02.987479 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/) mount fsck found 5 errors
>2016-08-02 22:52:02.987488 7f97d75e1800 -1 osd.4 0 OSD:init: unable to
>mount object store
>2016-08-02 22:52:02.987498 7f97d75e1800 -1  ** ERROR: osd init failed:
>(5) Input/output error
>
>
>Here's another one:
>
># /usr/bin/ceph-osd --cluster=ceph -i 11 -f --setuser ceph --setgroup ceph
>2016-08-03 22:16:49.052319 7f0e4d949800 -1 WARNING: the following
>dangerous and experimental features are enabled: *
>2016-08-03 22:16:49.052445 7f0e4d949800 -1 WARNING: the following
>dangerous and experimental features are enabled: *
>2016-08-03 22:16:49.052690 7f0e4d949800 -1 WARNING: experimental feature
>'bluestore' is enabled Please be aware that this feature is experimental,
>untested, unsupported, and may result in data corruption, data loss,
>and/or irreparable damage to your cluster.  Do not use feature with
>important data.
>
>starting osd.11 at :/0 osd_data /var/lib/ceph/osd/ceph-11/
>/var/lib/ceph/osd/ceph-11/journal
>2016-08-03 22:16:49.056779 7f0e4d949800 -1 WARNING: the following
>dangerous and experimental features are enabled: *
>2016-08-03 22:16:49.095695 7f0e4d949800 -1 WARNING: experimental feature
>'rocksdb' is enabled Please be aware that this feature is experimental,
>untested, unsupported, and may result in data corruption, data loss,
>and/or irreparable damage to your cluster.  Do not use feature with
>important data.
>
>2016-08-03 22:16:49.821451 7f0e4d949800 -1
>bluestore(/var/lib/ceph/osd/ceph-11/)  6#20:2eed99bf:::4_257:head# nid
>72869 already in use
>2016-08-03 22:16:49.961943 7f0e4d949800 -1
>bluestore(/var/lib/ceph/osd/ceph-11/) fsck free extent 257123155968~65536
>intersects allocated blocks
>2016-08-03 

Re: [ceph-users] Multi-device BlueStore OSDs multiple fsck failures

2016-08-03 Thread Somnath Roy
Yes Greg, agreed, I found some corruption during BlueFS replay , could also be 
caught in detail if I run fsck() may be..
Will do it , but, in dev environment time consumed during fsck() could be a 
challenge (though I have no idea how long it will take per TB of data, never 
ran it)  considering the number of time we need to restart OSDs..

Thanks & Regards
Somnath

-Original Message-
From: Gregory Farnum [mailto:gfar...@redhat.com]
Sent: Wednesday, August 03, 2016 4:03 PM
To: Somnath Roy
Cc: Stillwell, Bryan J; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Multi-device BlueStore OSDs multiple fsck failures

On Wed, Aug 3, 2016 at 3:50 PM, Somnath Roy  wrote:
> Probably, it is better to move to latest master and reproduce this defect. 
> Lot of stuff has changed between this.
> This is a good test case and I doubt any of us testing by enabling fsck() on 
> mount/unmount.

Given that the allocator keeps changing, running fsck frequently while testing 
is probably a good idea... ;)
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-device BlueStore OSDs multiple fsck failures

2016-08-03 Thread Gregory Farnum
On Wed, Aug 3, 2016 at 3:50 PM, Somnath Roy  wrote:
> Probably, it is better to move to latest master and reproduce this defect. 
> Lot of stuff has changed between this.
> This is a good test case and I doubt any of us testing by enabling fsck() on 
> mount/unmount.

Given that the allocator keeps changing, running fsck frequently while
testing is probably a good idea... ;)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-device BlueStore OSDs multiple fsck failures

2016-08-03 Thread Somnath Roy
Probably, it is better to move to latest master and reproduce this defect. Lot 
of stuff has changed between this.
This is a good test case and I doubt any of us testing by enabling fsck() on 
mount/unmount.

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Stillwell, Bryan J
Sent: Wednesday, August 03, 2016 3:41 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Multi-device BlueStore OSDs multiple fsck failures

I've been doing some benchmarking of BlueStore in 10.2.2 the last few days and 
have come across a failure that keeps happening after stressing the cluster 
fairly heavily.  Some of the OSDs started failing and attempts to restart them 
fail to log anything in /var/log/ceph/, so I tried starting them manually and 
ran into these error messages:

# /usr/bin/ceph-osd --cluster=ceph -i 4 -f --setuser ceph --setgroup ceph
2016-08-02 22:52:01.190226 7f97d75e1800 -1 WARNING: the following dangerous and 
experimental features are enabled: *
2016-08-02 22:52:01.190340 7f97d75e1800 -1 WARNING: the following dangerous and 
experimental features are enabled: *
2016-08-02 22:52:01.190497 7f97d75e1800 -1 WARNING: experimental feature 
'bluestore' is enabled Please be aware that this feature is experimental, 
untested, unsupported, and may result in data corruption, data loss, and/or 
irreparable damage to your cluster.  Do not use feature with important data.

starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4/ 
/var/lib/ceph/osd/ceph-4/journal
2016-08-02 22:52:01.194461 7f97d75e1800 -1 WARNING: the following dangerous and 
experimental features are enabled: *
2016-08-02 22:52:01.237619 7f97d75e1800 -1 WARNING: experimental feature 
'rocksdb' is enabled Please be aware that this feature is experimental, 
untested, unsupported, and may result in data corruption, data loss, and/or 
irreparable damage to your cluster.  Do not use feature with important data.

2016-08-02 22:52:01.501405 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/)  a#20:bac03f87:::4_454:head# nid
67134 already in use
2016-08-02 22:52:01.629900 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/)  9#20:e64f44a7:::4_258:head# nid
78351 already in use
2016-08-02 22:52:01.967599 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck free extent 256983760896~1245184 
intersects allocated blocks
2016-08-02 22:52:01.967605 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck overlap: [256984940544~65536]
2016-08-02 22:52:01.978635 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck free extent 258455044096~196608 
intersects allocated blocks
2016-08-02 22:52:01.978640 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck overlap: [258455175168~65536]
2016-08-02 22:52:01.978647 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck leaked some space; free+used =
[0~252138684416,252138815488~4844945408,256984940544~1470103552,25845517516
8~5732719067136] != expected 0~5991174242304
2016-08-02 22:52:02.987479 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) mount fsck found 5 errors
2016-08-02 22:52:02.987488 7f97d75e1800 -1 osd.4 0 OSD:init: unable to mount 
object store
2016-08-02 22:52:02.987498 7f97d75e1800 -1  ** ERROR: osd init failed: (5) 
Input/output error


Here's another one:

# /usr/bin/ceph-osd --cluster=ceph -i 11 -f --setuser ceph --setgroup ceph
2016-08-03 22:16:49.052319 7f0e4d949800 -1 WARNING: the following dangerous and 
experimental features are enabled: *
2016-08-03 22:16:49.052445 7f0e4d949800 -1 WARNING: the following dangerous and 
experimental features are enabled: *
2016-08-03 22:16:49.052690 7f0e4d949800 -1 WARNING: experimental feature 
'bluestore' is enabled Please be aware that this feature is experimental, 
untested, unsupported, and may result in data corruption, data loss, and/or 
irreparable damage to your cluster.  Do not use feature with important data.

starting osd.11 at :/0 osd_data /var/lib/ceph/osd/ceph-11/ 
/var/lib/ceph/osd/ceph-11/journal
2016-08-03 22:16:49.056779 7f0e4d949800 -1 WARNING: the following dangerous and 
experimental features are enabled: *
2016-08-03 22:16:49.095695 7f0e4d949800 -1 WARNING: experimental feature 
'rocksdb' is enabled Please be aware that this feature is experimental, 
untested, unsupported, and may result in data corruption, data loss, and/or 
irreparable damage to your cluster.  Do not use feature with important data.

2016-08-03 22:16:49.821451 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/)  6#20:2eed99bf:::4_257:head# nid
72869 already in use
2016-08-03 22:16:49.961943 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/) fsck free extent 257123155968~65536 
intersects allocated blocks
2016-08-03 22:16:49.961950 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/) fsck overlap: [257123155968~65536]
2016-08-03 22:16:49.962012 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/) fsck leaked some space; free+used = 

[ceph-users] Multi-device BlueStore OSDs multiple fsck failures

2016-08-03 Thread Stillwell, Bryan J
I've been doing some benchmarking of BlueStore in 10.2.2 the last few days
and
have come across a failure that keeps happening after stressing the cluster
fairly heavily.  Some of the OSDs started failing and attempts to restart
them
fail to log anything in /var/log/ceph/, so I tried starting them manually
and
ran into these error messages:

# /usr/bin/ceph-osd --cluster=ceph -i 4 -f --setuser ceph --setgroup ceph
2016-08-02 22:52:01.190226 7f97d75e1800 -1 WARNING: the following
dangerous and experimental features are enabled: *
2016-08-02 22:52:01.190340 7f97d75e1800 -1 WARNING: the following
dangerous and experimental features are enabled: *
2016-08-02 22:52:01.190497 7f97d75e1800 -1 WARNING: experimental feature
'bluestore' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster.  Do not use
feature with important data.

starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4/
/var/lib/ceph/osd/ceph-4/journal
2016-08-02 22:52:01.194461 7f97d75e1800 -1 WARNING: the following
dangerous and experimental features are enabled: *
2016-08-02 22:52:01.237619 7f97d75e1800 -1 WARNING: experimental feature
'rocksdb' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster.  Do not use
feature with important data.

2016-08-02 22:52:01.501405 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/)  a#20:bac03f87:::4_454:head# nid
67134 already in use
2016-08-02 22:52:01.629900 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/)  9#20:e64f44a7:::4_258:head# nid
78351 already in use
2016-08-02 22:52:01.967599 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck free extent 256983760896~1245184
intersects allocated blocks
2016-08-02 22:52:01.967605 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck overlap: [256984940544~65536]
2016-08-02 22:52:01.978635 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck free extent 258455044096~196608
intersects allocated blocks
2016-08-02 22:52:01.978640 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck overlap: [258455175168~65536]
2016-08-02 22:52:01.978647 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck leaked some space; free+used =
[0~252138684416,252138815488~4844945408,256984940544~1470103552,25845517516
8~5732719067136] != expected 0~5991174242304
2016-08-02 22:52:02.987479 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) mount fsck found 5 errors
2016-08-02 22:52:02.987488 7f97d75e1800 -1 osd.4 0 OSD:init: unable to
mount object store
2016-08-02 22:52:02.987498 7f97d75e1800 -1  ** ERROR: osd init failed: (5)
Input/output error


Here's another one:

# /usr/bin/ceph-osd --cluster=ceph -i 11 -f --setuser ceph --setgroup ceph
2016-08-03 22:16:49.052319 7f0e4d949800 -1 WARNING: the following
dangerous and experimental features are enabled: *
2016-08-03 22:16:49.052445 7f0e4d949800 -1 WARNING: the following
dangerous and experimental features are enabled: *
2016-08-03 22:16:49.052690 7f0e4d949800 -1 WARNING: experimental feature
'bluestore' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster.  Do not use
feature with important data.

starting osd.11 at :/0 osd_data /var/lib/ceph/osd/ceph-11/
/var/lib/ceph/osd/ceph-11/journal
2016-08-03 22:16:49.056779 7f0e4d949800 -1 WARNING: the following
dangerous and experimental features are enabled: *
2016-08-03 22:16:49.095695 7f0e4d949800 -1 WARNING: experimental feature
'rocksdb' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster.  Do not use
feature with important data.

2016-08-03 22:16:49.821451 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/)  6#20:2eed99bf:::4_257:head# nid
72869 already in use
2016-08-03 22:16:49.961943 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/) fsck free extent 257123155968~65536
intersects allocated blocks
2016-08-03 22:16:49.961950 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/) fsck overlap: [257123155968~65536]
2016-08-03 22:16:49.962012 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/) fsck leaked some space; free+used =
[0~241963433984,241963499520~5749210742784] != expected 0~5991174242304
2016-08-03 22:16:50.855099 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/) mount fsck found 3 errors
2016-08-03 22:16:50.855109 7f0e4d949800 -1 osd.11 0 OSD:init: unable to
mount object store
2016-08-03 22:16:50.855118 7f0e4d949800 -1  ** ERROR: osd init failed: (5)
Input/output error


I currently have a total of 12 OSDs down (out of 46) which all appear to be
experiencing this problem.

Here are more details of the cluster (currently just a single node):

2x Xeon E5-2699 v4 @ 2.20GHz

[ceph-users] [Troubleshooting] I have a watcher I can't get rid of...

2016-08-03 Thread K.C. Wong
I'm having a hard time removing an RBD that I no longer need.

# rbd rm /
2016-08-03 15:00:01.085784 7ff9dfc997c0 -1 librbd: image has watchers - not 
removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
This means the image is still open or the client using it crashed. Try again 
after closing/unmapping it or waiting 30s for the crashed client to timeout.

So, I use `rbd status` to identify the watcher:

# rbd status /
Watchers:
watcher=:0/705293879 client.1076985 cookie=1

I log onto that host, and did

# rbd showmapped

which returns nothing

I don't use snapshot and I don't use cloning, so, there shouldn't
be any image sharing. I ended up rebooting that host and the
watcher is still around, and my problem persist: I can't remove
the RBD.

At this point, I'm all out of ideas on how to troubleshoot this
problem. I'm running infernalis:

# ceph --version
ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)

in my set up, on CentOS 7.2 hosts

# uname -r
3.10.0-327.22.2.el7.x86_64

I appreciate any assistance,

-kc

K.C. Wong
kcw...@verseon.com
4096R/B8995EDE  E527 CBE8 023E 79EA 8BBB  5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Ceph Storage Migration from SAN storage to Local Disks

2016-08-03 Thread Gaurav Goyal
Thanks for your  prompt
response!

Situation is bit different now. Customer want us to remove the ceph storage
configuration from scratch. Let is openstack system work without ceph.
Later on install ceph with local disks.

So I need to know a procedure to uninstall ceph and unconfigure it from
openstack.

Regards
Gaurav Goyal
On 03-Aug-2016 4:59 pm, "David Turner" 
wrote:

> If I'm understanding your question correctly that you're asking how to
> actually remove the SAN osds from ceph, then it doesn't matter what is
> using the storage (ie openstack, cephfs, krbd, etc) as the steps are the
> same.
>
> I'm going to assume that you've already added the new storage/osds to the
> cluster, weighted the SAN osds to 0.0 and that the backfilling has
> finished.  If that is true, then your disk used space on the SAN's should
> be basically empty while the new osds on the local disks should have a fair
> amount of data.  If that is the case, then for every SAN osd, you just run
> the following commands replacing OSD_ID with the osd's id:
>
> # On the server with the osd being removed
> sudo stop ceph-osd id=OSD_ID
> ceph osd down OSD_ID
> ceph osd out OSD_ID
> ceph osd crush remove osd.OSD_ID
> ceph auth del osd.OSD_ID
> ceph osd rm OSD_ID
>
> Test running those commands on a test osd and if you had set the weight of
> the osd to 0.0 previously and if the backfilling had finished, then what
> you should see is that your cluster has 1 less osd than it used to, and no
> pgs should be backfilling.
>
> HOWEVER, if my assumptions above are incorrect, please provide the output
> of the following commands and try to clarify your question.
>
> ceph status
> ceph osd tree
>
> I hope this helps.
>
> > Hello David,
> >
> > Can you help me with steps/Procedure to uninstall Ceph storage from
> openstack environment?
> >
> >
> > Regards
> > Gaurav Goyal
>
> --
>
>  David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation 
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 | Mobile: 385.224.2943
>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
> --
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I use fio with randwrite io to ceph image , it's run 2000 IOPS in the first time , and run 6000 IOPS in second time

2016-08-03 Thread Warren Wang - ISD
It¹s probably rbd cache taking effect. If you know all your clients are
well behaved, you could set "rbd cache writethrough until flush" to false,
instead of the default true, but understand the ramification. You could
also just do it during benchmarking.

Warren Wang



From:  ceph-users  on behalf of
"m13913886...@yahoo.com" 
Reply-To:  "m13913886...@yahoo.com" 
Date:  Monday, August 1, 2016 at 11:30 PM
To:  Ceph-users 
Subject:  [ceph-users] I use fio with randwrite io to ceph image , it's
run 2000 IOPS in the first time , and run 6000 IOPS in second time



In version 10.2.2, fio firstly run 2000 IOPS, then I break fio,
and continue run fio, it run 6000 IOPS.

But in version 0.94, fio always run 6000 IOPS. With or without
repeated fio.


what is the different between this two versions about this.


my config is that :

I have three nodes, and two osds per node. A total of six osds.
All osds are ssd disk.


Here is my ceph.conf of osd:

[osd]

osd mkfs type=xfs
osd data = /data/$name
osd_journal_size = 8
filestore xattr use omap = true
filestore min sync interval = 10
filestore max sync interval = 15
filestore queue max ops = 25000
filestore queue max bytes = 10485760
filestore queue committing max ops = 5000
filestore queue committing max bytes = 1048576

journal max write bytes = 1073714824
journal max write entries = 1
journal queue max ops = 5
journal queue max bytes = 1048576

osd max write size = 512
osd client message size cap = 2147483648
osd deep scrub stride = 131072
osd op threads = 8
osd disk threads = 4
osd map cache size = 1024
osd map cache bl size = 128
osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"
osd recovery op priority = 4
osd recovery max active = 10
osd max backfills = 4


This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Ceph Storage Migration from SAN storage to Local Disks

2016-08-03 Thread David Turner
If I'm understanding your question correctly that you're asking how to actually 
remove the SAN osds from ceph, then it doesn't matter what is using the storage 
(ie openstack, cephfs, krbd, etc) as the steps are the same.

I'm going to assume that you've already added the new storage/osds to the 
cluster, weighted the SAN osds to 0.0 and that the backfilling has finished.  
If that is true, then your disk used space on the SAN's should be basically 
empty while the new osds on the local disks should have a fair amount of data.  
If that is the case, then for every SAN osd, you just run the following 
commands replacing OSD_ID with the osd's id:

# On the server with the osd being removed
sudo stop ceph-osd id=OSD_ID
ceph osd down OSD_ID
ceph osd out OSD_ID
ceph osd crush remove osd.OSD_ID
ceph auth del osd.OSD_ID
ceph osd rm OSD_ID

Test running those commands on a test osd and if you had set the weight of the 
osd to 0.0 previously and if the backfilling had finished, then what you should 
see is that your cluster has 1 less osd than it used to, and no pgs should be 
backfilling.

HOWEVER, if my assumptions above are incorrect, please provide the output of 
the following commands and try to clarify your question.

ceph status
ceph osd tree

I hope this helps.

> Hello David,
>
> Can you help me with steps/Procedure to uninstall Ceph storage from openstack 
> environment?
>
>
> Regards
> Gaurav Goyal



[cid:image093b67.JPG@c4506d08.49b6558d]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Ceph Storage Migration from SAN storage to Local Disks

2016-08-03 Thread Gaurav Goyal
Hello David,

Can you help me with steps/Procedure to uninstall Ceph storage from
openstack environment?


Regards
Gaurav Goyal

On Tue, Aug 2, 2016 at 11:57 AM, Gaurav Goyal 
wrote:

> Hello David,
>
> Thanks a lot for detailed information!
>
> This is going to help me.
>
>
> Regards
> Gaurav Goyal
>
> On Tue, Aug 2, 2016 at 11:46 AM, David Turner <
> david.tur...@storagecraft.com> wrote:
>
>> I'm going to assume you know how to add and remove storage
>> http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-osds/.  The
>> only other part of this process is reweighting the crush map for the old
>> osds to a new weight of 0.0
>> http://docs.ceph.com/docs/master/rados/operations/crush-map/.
>>
>> I would recommend setting the nobackfill and norecover flags.
>>
>> ceph osd set nobackfill
>> ceph osd set norecover
>>
>> Next you would add all of the new osds according to the ceph docs and
>> then reweight the old osds to 0.0.
>>
>> ceph osd crush reweight osd.1 0.0
>>
>> Once you have all of that set, unset nobackfill and norecover.
>>
>> ceph osd unset nobackfill
>> ceph osd unset norecover
>>
>> Wait until all of the backfilling finishes and then remove the old SAN
>> osds as per the ceph docs.
>>
>>
>> There is a thread from this mailing list about the benefits of weighting
>> osds to 0.0 instead of just removing them.  The best thing that you gain
>> from doing it this way is that you can remove multiple nodes/osds at the
>> same time without having degraded objects and especially without losing
>> objects.
>>
>> --
>>
>>  David Turner | Cloud Operations Engineer | 
>> StorageCraft
>> Technology Corporation 
>> 380 Data Drive Suite 300 | Draper | Utah | 84020
>> Office: 801.871.2760 | Mobile: 385.224.2943
>>
>> --
>>
>> If you are not the intended recipient of this message or received it
>> erroneously, please notify the sender and delete it, together with any
>> attachments, and be advised that any dissemination or copying of this
>> message is prohibited.
>>
>> --
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Read Stalls with Multiple OSD Servers

2016-08-03 Thread Christoph Adomeit
Hi Tom,

thank you very much for your hint regarding tcp_sack and sysctl network stack 
tuning. This pointed me in the right direction.

We had seldom similar issues where under high network load reads stalled on 
osds.

Enabling tcp_sack made the situation better for us and some more tuning 
completely solved the issue for us.

I learned one more time that you need absolutely clean and fast networking for 
ceph and that ceph uses resources much more than any other network software.

However, I think ceph should be designed more fault tolerant regarding minor 
network issues since minor problems and a few lost packets can alwyas happen.

Thanks
  Christoph

On Tue, Aug 02, 2016 at 07:14:27PM +, Helander, Thomas wrote:
> Hi David,
> 
> There’s a good amount of backstory to our configuration, but I’m happy to 
> report I found the source of my problem.
> 
> We were applying some “optimizations” for our 10GbE via sysctl, including 
> disabling net.ipv4.tcp_sack. Re-enabling net.ipv4.tcp_sack resolved the issue.
> 
> Thanks,
> Tom
> 
> From: David Turner [mailto:david.tur...@storagecraft.com]
> Sent: Monday, August 01, 2016 12:06 PM
> To: Helander, Thomas ; 
> ceph-users@lists.ceph.com
> Subject: RE: Read Stalls with Multiple OSD Servers
> 
> Why are you running Raid 6 osds?  Ceph's usefulness is a lot of osds that can 
> fail and be replaced.  With your processors/ram, you should be running these 
> as individual osds.  That will utilize your dual processor setup much more.  
> Ceph is optimal for 1 core per osd.  Extra cores are more or less wasted in 
> the storage node.  You only have 2 storage nodes, so you can't utilize a lot 
> of the benefits of Ceph.  Your setup looks like you're much better suited for 
> a Gluster cluster instead of a Ceph cluster.  I don't know what your needs 
> are, but that's what it looks like from here.
> 
> [cid:image001.jpg@01D1ECB5.B37D8B00]
> 
> David Turner | Cloud Operations Engineer | StorageCraft Technology 
> Corporation
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 | Mobile: 385.224.2943
> 
> 
> If you are not the intended recipient of this message or received it 
> erroneously, please notify the sender and delete it, together with any 
> attachments, and be advised that any dissemination or copying of this message 
> is prohibited.
> 
> 
> 
> From: Helander, Thomas [thomas.helan...@kla-tencor.com]
> Sent: Monday, August 01, 2016 11:10 AM
> To: David Turner; ceph-users@lists.ceph.com
> Subject: RE: Read Stalls with Multiple OSD Servers
> Hi David,
> 
> Thanks for the quick response and suggestion. I do have just a basic network 
> config (one network, no VLANs) and am able to ping between the storage 
> servers using hostnames and IPs.
> 
> Thanks,
> Tom
> 
> From: David Turner [mailto:david.tur...@storagecraft.com]
> Sent: Monday, August 01, 2016 9:14 AM
> To: Helander, Thomas 
> >; 
> ceph-users@lists.ceph.com
> Subject: RE: Read Stalls with Multiple OSD Servers
> 
> This could be explained by your osds not being able to communicate with each 
> other.  We have 2 vlans between our storage nodes, the public and private 
> networks for ceph to use.  We added 2 new nodes in a new rack on new switches 
> and as soon as we added a single osd for one of them to the cluster, the 
> peering never finished and we had a lot of blocked requests that never went 
> away.
> 
> In testing we found that the rest of the cluster could not communicate with 
> these nodes on the private vlan and after fixing the network switch config, 
> everything worked perfectly for adding in the 2 new nodes.
> 
> If you are using a basic network configuration with only one network and/or 
> vlan, then this is likely not to be your issue.  But to check and make sure, 
> you should test pinging between your nodes on all of the IPs they have.
> 
> [cid:image001.jpg@01D1ECB5.B37D8B00]
> 
> David Turner | Cloud Operations Engineer | StorageCraft Technology 
> Corporation
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 | Mobile: 385.224.2943
> 
> 
> If you are not the intended recipient of this message or received it 
> erroneously, please notify the sender and delete it, together with any 
> attachments, and be advised that any dissemination or copying of this message 
> is prohibited.
> 
> 
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Helander, 
> Thomas [thomas.helan...@kla-tencor.com]
> Sent: 

Re: [ceph-users] ceph-dbg package for Xenial (ubuntu-16.04.x) broken

2016-08-03 Thread J. Ryan Earl
Inspecting the ceph-dbg packages under
http://download.ceph.com/debian-jewel/pool/main/c/ceph/ it looks like this
is an ongoing issue and not specific to just 10.2.2.  Specifically there
are only 2 ceph-dbg package versions:

ceph-dbg_10.0.2-1trusty_amd64.deb
ceph-dbg_10.0.2-1~bpo80+1_amd64.deb

There aren't even 10.0.2 'ceph' packages there, only 10.1.x and 10.2.x
versions of the actual binaries.  So it seems that there are literally no
debug packages available for any of the Debian-based Jewel releases
available.  This seems like a systemic issue.

I've created an issue on the tracker: http://tracker.ceph.com/issues/16912

On Wed, Aug 3, 2016 at 1:30 PM, Ken Dreyer  wrote:

> For some reason, during the v10.2.2 release,
> ceph-dbg_10.0.2-1xenial_amd64.deb did not get transferred to
> http://download.ceph.com/debian-jewel/pool/main/c/ceph/
>
> - Ken
>
> On Wed, Aug 3, 2016 at 12:27 PM, J. Ryan Earl  wrote:
> > Hello,
> >
> > New to the list.  I'm working on performance tuning and testing a new
> Ceph
> > cluster built on Ubuntu 16.04 LTS and newest "Jewel" Ceph release.  I'm
> in
> > the process of collecting stack frames as part of a profiling inspection
> > using FlameGraph (https://github.com/brendangregg/FlameGraph) to inspect
> > where the CPU is spending time but need to load the 'dbg' packages to get
> > symbol information.  However, it appears the 'ceph-dbg' package has
> broken
> > dependencies:
> >
> > ceph1.oak:/etc/apt# apt-get install ceph-dbgReading package lists...
> > DoneBuilding dependency tree   Reading state information... DoneSome
> > packages could not be installed. This may mean that you haverequested an
> > impossible situation or if you are using the unstabledistribution that
> some
> > required packages have not yet been createdor been moved out of
> Incoming.The
> > following information may help to resolve the situation:
> > The following packages have unmet dependencies: ceph-dbg : Depends: ceph
> (=
> > 10.2.2-0ubuntu0.16.04.2) but 10.2.2-1xenial is to be installedE: Unable
> to
> > correct problems, you have held broken packages.
> > Any ideas on how to quickly work around this issue so I can continue
> > performance profiling?
> > Thank you,-JR
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cleaning Up Failed Multipart Uploads

2016-08-03 Thread Brian Felton
>
> The problem is that operations can happen concurrently, so the decision
> whether to remove or not to remove an entry is not very easy. We have seen
> before that application initiated multiple uploads of the same part, but
> the one that actually complete the last was not the last to upload (e.g.,
> due to networking timeouts and retries that happen in different layers).


I'm very aware of the issue since I reported that bug.  And your patch is
working great :)

Right, this is a separate issue. Did you try running 'radosgw-admin bucket
> check --fix'?


Yes.  Not only have I run 'bucket check' with all combinations of --fix and
--check-objects, I've also written a cleanup script to iterate through the
bucket shards in .rgw.buckets.index, iterate through the omap keys on each
shard, and check for entries that no longer exist in .rgw.buckets, removing
those stale omap keys.  While this at least cleans the 'multipart' obejcts
from the 'bucket list' output, it still doesn't kick off an update of the
bucket's stats.

Brian

On Wed, Aug 3, 2016 at 1:19 PM, Yehuda Sadeh-Weinraub 
wrote:

>
>
> On Wed, Aug 3, 2016 at 10:57 AM, Brian Felton  wrote:
>
>> I should clarify:
>>
>> There doesn't seem to be a problem with list_multipart_parts -- upon
>> further review, it seems to be doing the right thing.  What tipped me off
>> is that when one aborts a multipart upload where parts have been uploaded
>> more than once, the last copy of each part uploaded is successfully removed
>> (not just removed from the bucket's stats, as with complete multipart, but
>> submitted for garbage collection).  The difference seems to be in the
>> following:
>>
>> In RGWCompleteMultipart::execute, the removal doesn't occur on the
>> entries returned from list_mutlpart_parts; instead, we initialize a
>> 'src_obj' rgw_obj structure and grab its index key
>> (src_obj.get_index_key(_key)), which is then pushed onto remove_objs.
>>
>
> iirc, we don't really remove the objects there. Only remove the entries
> from the index.
>
>
>>
>> In RGWAbortMultipart::execute, we operate directly on the
>> RGWUploadPartInfo value in the obj_parts map, submitting it for deletion
>> (gc) if its manifest is empty.
>>
>> If this is correct, there is no "fix" for list_multipart_parts; instead,
>> it would seem that the only fix is to not allow an upload part to generate
>> a new prefix in RGWPutObj::execute().
>>
>
> The problem is that operations can happen concurrently, so the decision
> whether to remove or not to remove an entry is not very easy. We have seen
> before that application initiated multiple uploads of the same part, but
> the one that actually complete the last was not the last to upload (e.g.,
> due to networking timeouts and retries that happen in different layers).
>
>
>> Since I don't really have any context on why a new prefix would be
>> generated if the object already exists, I'm not the least bit confident
>> that changing it will not have all sorts of unforeseen consequences.  That
>> said, since all knowledge of an uploaded part seems to vanish from
>> existence once it has been replaced, I don't see how the accounting of
>> multipart data will ever be correct.
>>
>
> Having a mutable part is problematic, since different uploads might step
> on each other (as with the example I provided above), and you end up with
> corrupted data.
>
>
>>
>> And yes, I've tried the orphan find, but I'm not really sure what to do
>> with the results.  The post I could find in the mailing list (mostly from
>> you), seemed to conclude that no action should be taken on the things that
>> it finds are orphaned.  Also, I have removed a significant number of
>> multipart and shadow files that are not valid, but none of that actually
>>
>
> The tool is not removing data, only reporting about possible leaked rados
> objects.
>
>
>> updates the buckets stats to the correct values.  If I had some mechanism
>> for forcing that, this would be much less of a big deal.
>>
>
> Right, this is a separate issue. Did you try running 'radosgw-admin bucket
> check --fix'?
>
> Yehuda
>
>
>>
>>
>> Brian
>>
>> On Wed, Aug 3, 2016 at 12:46 PM, Yehuda Sadeh-Weinraub > > wrote:
>>
>>>
>>>
>>> On Wed, Aug 3, 2016 at 10:10 AM, Brian Felton 
>>> wrote:
>>>
 This may just be me having a conversation with myself, but maybe this
 will be helpful to someone else.

 Having dug and dug and dug through the code, I've come to the following
 realizations:

1. When a multipart upload is completed, the function
list_multipart_parts in rgw_op.cc is called.  This seems to be the 
 start of
the problems, as it will only return those parts in the 'multipart'
namespace that include the upload id in the name, irrespective of how 
 many
copies of parts exist on the system with non-upload id prefixes
2. In the course of writing to the 

Re: [ceph-users] ceph-dbg package for Xenial (ubuntu-16.04.x) broken

2016-08-03 Thread Ken Dreyer
For some reason, during the v10.2.2 release,
ceph-dbg_10.0.2-1xenial_amd64.deb did not get transferred to
http://download.ceph.com/debian-jewel/pool/main/c/ceph/

- Ken

On Wed, Aug 3, 2016 at 12:27 PM, J. Ryan Earl  wrote:
> Hello,
>
> New to the list.  I'm working on performance tuning and testing a new Ceph
> cluster built on Ubuntu 16.04 LTS and newest "Jewel" Ceph release.  I'm in
> the process of collecting stack frames as part of a profiling inspection
> using FlameGraph (https://github.com/brendangregg/FlameGraph) to inspect
> where the CPU is spending time but need to load the 'dbg' packages to get
> symbol information.  However, it appears the 'ceph-dbg' package has broken
> dependencies:
>
> ceph1.oak:/etc/apt# apt-get install ceph-dbgReading package lists...
> DoneBuilding dependency tree   Reading state information... DoneSome
> packages could not be installed. This may mean that you haverequested an
> impossible situation or if you are using the unstabledistribution that some
> required packages have not yet been createdor been moved out of Incoming.The
> following information may help to resolve the situation:
> The following packages have unmet dependencies: ceph-dbg : Depends: ceph (=
> 10.2.2-0ubuntu0.16.04.2) but 10.2.2-1xenial is to be installedE: Unable to
> correct problems, you have held broken packages.
> Any ideas on how to quickly work around this issue so I can continue
> performance profiling?
> Thank you,-JR
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-dbg package for Xenial (ubuntu-16.04.x) broken

2016-08-03 Thread J. Ryan Earl
Hello,

New to the list.  I'm working on performance tuning and testing a new Ceph
cluster built on Ubuntu 16.04 LTS and newest "Jewel" Ceph release.  I'm in
the process of collecting stack frames as part of a profiling inspection
using FlameGraph (https://github.com/brendangregg/FlameGraph) to inspect
where the CPU is spending time but need to load the 'dbg' packages to get
symbol information.  However, it appears the 'ceph-dbg' package has broken
dependencies:

ceph1.oak:/etc/apt# apt-get install ceph-dbgReading package lists...
DoneBuilding dependency tree   Reading state information... DoneSome
packages could not be installed. This may mean that you haverequested an
impossible situation or if you are using the unstabledistribution that some
required packages have not yet been createdor been moved out of
Incoming.The following information may help to resolve the situation:
The following packages have unmet dependencies: ceph-dbg : Depends: ceph (=
10.2.2-0ubuntu0.16.04.2) but 10.2.2-1xenial is to be installedE: Unable to
correct problems, you have held broken packages.
Any ideas on how to quickly work around this issue so I can continue
performance profiling?
Thank you,-JR
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How using block device after cluster ceph on?

2016-08-03 Thread Patrick McGarry
Moving this to ceph-user so the broader community can weigh in.

However, I would recommend you please spell out your question in much
more detail if possible. Using a fragment like this will most likely
not get a response. Thanks.


On Tue, Aug 2, 2016 at 7:55 PM, Leandro  wrote:
> Hi.
> after the ceph running on two nodes, another admin node. so I can use it is
> necessary I have another come only with the ceph client and oh yes use the
> cluster of disk space that is personal ?
> Thanks for help.



-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cleaning Up Failed Multipart Uploads

2016-08-03 Thread Yehuda Sadeh-Weinraub
On Wed, Aug 3, 2016 at 10:57 AM, Brian Felton  wrote:

> I should clarify:
>
> There doesn't seem to be a problem with list_multipart_parts -- upon
> further review, it seems to be doing the right thing.  What tipped me off
> is that when one aborts a multipart upload where parts have been uploaded
> more than once, the last copy of each part uploaded is successfully removed
> (not just removed from the bucket's stats, as with complete multipart, but
> submitted for garbage collection).  The difference seems to be in the
> following:
>
> In RGWCompleteMultipart::execute, the removal doesn't occur on the entries
> returned from list_mutlpart_parts; instead, we initialize a 'src_obj'
> rgw_obj structure and grab its index key
> (src_obj.get_index_key(_key)), which is then pushed onto remove_objs.
>

iirc, we don't really remove the objects there. Only remove the entries
from the index.


>
> In RGWAbortMultipart::execute, we operate directly on the
> RGWUploadPartInfo value in the obj_parts map, submitting it for deletion
> (gc) if its manifest is empty.
>
> If this is correct, there is no "fix" for list_multipart_parts; instead,
> it would seem that the only fix is to not allow an upload part to generate
> a new prefix in RGWPutObj::execute().
>

The problem is that operations can happen concurrently, so the decision
whether to remove or not to remove an entry is not very easy. We have seen
before that application initiated multiple uploads of the same part, but
the one that actually complete the last was not the last to upload (e.g.,
due to networking timeouts and retries that happen in different layers).


> Since I don't really have any context on why a new prefix would be
> generated if the object already exists, I'm not the least bit confident
> that changing it will not have all sorts of unforeseen consequences.  That
> said, since all knowledge of an uploaded part seems to vanish from
> existence once it has been replaced, I don't see how the accounting of
> multipart data will ever be correct.
>

Having a mutable part is problematic, since different uploads might step on
each other (as with the example I provided above), and you end up with
corrupted data.


>
> And yes, I've tried the orphan find, but I'm not really sure what to do
> with the results.  The post I could find in the mailing list (mostly from
> you), seemed to conclude that no action should be taken on the things that
> it finds are orphaned.  Also, I have removed a significant number of
> multipart and shadow files that are not valid, but none of that actually
>

The tool is not removing data, only reporting about possible leaked rados
objects.


> updates the buckets stats to the correct values.  If I had some mechanism
> for forcing that, this would be much less of a big deal.
>

Right, this is a separate issue. Did you try running 'radosgw-admin bucket
check --fix'?

Yehuda


>
>
> Brian
>
> On Wed, Aug 3, 2016 at 12:46 PM, Yehuda Sadeh-Weinraub 
> wrote:
>
>>
>>
>> On Wed, Aug 3, 2016 at 10:10 AM, Brian Felton  wrote:
>>
>>> This may just be me having a conversation with myself, but maybe this
>>> will be helpful to someone else.
>>>
>>> Having dug and dug and dug through the code, I've come to the following
>>> realizations:
>>>
>>>1. When a multipart upload is completed, the function
>>>list_multipart_parts in rgw_op.cc is called.  This seems to be the start 
>>> of
>>>the problems, as it will only return those parts in the 'multipart'
>>>namespace that include the upload id in the name, irrespective of how 
>>> many
>>>copies of parts exist on the system with non-upload id prefixes
>>>2. In the course of writing to the OSDs, a list (remove_objs) is
>>>processed in cls_rgw.cc:unaccount_entry(), causing bucket stats to be
>>>decremented
>>>3. These decremented stats are written to the bucket's index
>>>entry/entries in .rgw.buckets.index via the CEPH_OSD_OP_OMAPSETHEADER 
>>> case
>>>in ReplicatedPG::do_osd_ops
>>>
>>> So this explains why manually removing the multipart entries from
>>> .rgw.buckets and cleaning the shadow entries in .rgw.buckets.index does not
>>> cause the bucket's stats to be updated.  What I don't know how to do is
>>> force an update of the bucket's stats from the CLI.  I can retrieve the
>>> omap header from each of the bucket's shards in .rgw.buckets.index, but I
>>> don't have the first clue how to read the data or rebuild it into something
>>> valid.  I've searched the docs and mailing list archives, but I didn't find
>>> any solution to this problem.  For what it's worth, I've tried 'bucket
>>> check' with all combinations of '--check-objects' and '--fix' after
>>> cleaning up .rgw.buckets and .rgw.buckets.index.
>>>
>>> From a long-term perspective, it seems there are two possible fixes here:
>>>
>>>1. Update the logic in list_multipart_parts to return all the parts
>>>for a multipart 

Re: [ceph-users] Cleaning Up Failed Multipart Uploads

2016-08-03 Thread Brian Felton
I should clarify:

There doesn't seem to be a problem with list_multipart_parts -- upon
further review, it seems to be doing the right thing.  What tipped me off
is that when one aborts a multipart upload where parts have been uploaded
more than once, the last copy of each part uploaded is successfully removed
(not just removed from the bucket's stats, as with complete multipart, but
submitted for garbage collection).  The difference seems to be in the
following:

In RGWCompleteMultipart::execute, the removal doesn't occur on the entries
returned from list_mutlpart_parts; instead, we initialize a 'src_obj'
rgw_obj structure and grab its index key
(src_obj.get_index_key(_key)), which is then pushed onto remove_objs.

In RGWAbortMultipart::execute, we operate directly on the RGWUploadPartInfo
value in the obj_parts map, submitting it for deletion (gc) if its manifest
is empty.

If this is correct, there is no "fix" for list_multipart_parts; instead, it
would seem that the only fix is to not allow an upload part to generate a
new prefix in RGWPutObj::execute().  Since I don't really have any context
on why a new prefix would be generated if the object already exists, I'm
not the least bit confident that changing it will not have all sorts of
unforeseen consequences.  That said, since all knowledge of an uploaded
part seems to vanish from existence once it has been replaced, I don't see
how the accounting of multipart data will ever be correct.

And yes, I've tried the orphan find, but I'm not really sure what to do
with the results.  The post I could find in the mailing list (mostly from
you), seemed to conclude that no action should be taken on the things that
it finds are orphaned.  Also, I have removed a significant number of
multipart and shadow files that are not valid, but none of that actually
updates the buckets stats to the correct values.  If I had some mechanism
for forcing that, this would be much less of a big deal.

Brian

On Wed, Aug 3, 2016 at 12:46 PM, Yehuda Sadeh-Weinraub 
wrote:

>
>
> On Wed, Aug 3, 2016 at 10:10 AM, Brian Felton  wrote:
>
>> This may just be me having a conversation with myself, but maybe this
>> will be helpful to someone else.
>>
>> Having dug and dug and dug through the code, I've come to the following
>> realizations:
>>
>>1. When a multipart upload is completed, the function
>>list_multipart_parts in rgw_op.cc is called.  This seems to be the start 
>> of
>>the problems, as it will only return those parts in the 'multipart'
>>namespace that include the upload id in the name, irrespective of how many
>>copies of parts exist on the system with non-upload id prefixes
>>2. In the course of writing to the OSDs, a list (remove_objs) is
>>processed in cls_rgw.cc:unaccount_entry(), causing bucket stats to be
>>decremented
>>3. These decremented stats are written to the bucket's index
>>entry/entries in .rgw.buckets.index via the CEPH_OSD_OP_OMAPSETHEADER case
>>in ReplicatedPG::do_osd_ops
>>
>> So this explains why manually removing the multipart entries from
>> .rgw.buckets and cleaning the shadow entries in .rgw.buckets.index does not
>> cause the bucket's stats to be updated.  What I don't know how to do is
>> force an update of the bucket's stats from the CLI.  I can retrieve the
>> omap header from each of the bucket's shards in .rgw.buckets.index, but I
>> don't have the first clue how to read the data or rebuild it into something
>> valid.  I've searched the docs and mailing list archives, but I didn't find
>> any solution to this problem.  For what it's worth, I've tried 'bucket
>> check' with all combinations of '--check-objects' and '--fix' after
>> cleaning up .rgw.buckets and .rgw.buckets.index.
>>
>> From a long-term perspective, it seems there are two possible fixes here:
>>
>>1. Update the logic in list_multipart_parts to return all the parts
>>for a multipart object, so that *all* parts in the 'multipart' namespace
>>can be properly removed
>>2. Update the logic in RGWPutObj::execute() to not restart a write if
>>the put_data_and_throttle() call returns -EEXIST but instead put the data
>>in the original file(s)
>>
>> While I think 2 would involve the least amount of yak shaving with the
>> multipart logic since the MP logic already assumes a happy path where all
>> objects have a prefix of the multipart upload id, I'm all but certain this
>> is going to horribly break many other parts of the system that I don't
>> fully understand.
>>
>
> #2 is dangerous. That was the original behavior, and it is racy and *will*
> lead to data corruption.  OTOH, I don't think #1 is an easy option. We only
> keep a single entry per part, so we don't really have a good way to see all
> the uploaded pieces. We could extend the meta object to keep record of all
> the uploaded parts, and at the end, when assembling everything remove the
> parts that aren't part of the final 

Re: [ceph-users] Cleaning Up Failed Multipart Uploads

2016-08-03 Thread Yehuda Sadeh-Weinraub
On Wed, Aug 3, 2016 at 10:10 AM, Brian Felton  wrote:

> This may just be me having a conversation with myself, but maybe this will
> be helpful to someone else.
>
> Having dug and dug and dug through the code, I've come to the following
> realizations:
>
>1. When a multipart upload is completed, the function
>list_multipart_parts in rgw_op.cc is called.  This seems to be the start of
>the problems, as it will only return those parts in the 'multipart'
>namespace that include the upload id in the name, irrespective of how many
>copies of parts exist on the system with non-upload id prefixes
>2. In the course of writing to the OSDs, a list (remove_objs) is
>processed in cls_rgw.cc:unaccount_entry(), causing bucket stats to be
>decremented
>3. These decremented stats are written to the bucket's index
>entry/entries in .rgw.buckets.index via the CEPH_OSD_OP_OMAPSETHEADER case
>in ReplicatedPG::do_osd_ops
>
> So this explains why manually removing the multipart entries from
> .rgw.buckets and cleaning the shadow entries in .rgw.buckets.index does not
> cause the bucket's stats to be updated.  What I don't know how to do is
> force an update of the bucket's stats from the CLI.  I can retrieve the
> omap header from each of the bucket's shards in .rgw.buckets.index, but I
> don't have the first clue how to read the data or rebuild it into something
> valid.  I've searched the docs and mailing list archives, but I didn't find
> any solution to this problem.  For what it's worth, I've tried 'bucket
> check' with all combinations of '--check-objects' and '--fix' after
> cleaning up .rgw.buckets and .rgw.buckets.index.
>
> From a long-term perspective, it seems there are two possible fixes here:
>
>1. Update the logic in list_multipart_parts to return all the parts
>for a multipart object, so that *all* parts in the 'multipart' namespace
>can be properly removed
>2. Update the logic in RGWPutObj::execute() to not restart a write if
>the put_data_and_throttle() call returns -EEXIST but instead put the data
>in the original file(s)
>
> While I think 2 would involve the least amount of yak shaving with the
> multipart logic since the MP logic already assumes a happy path where all
> objects have a prefix of the multipart upload id, I'm all but certain this
> is going to horribly break many other parts of the system that I don't
> fully understand.
>

#2 is dangerous. That was the original behavior, and it is racy and *will*
lead to data corruption.  OTOH, I don't think #1 is an easy option. We only
keep a single entry per part, so we don't really have a good way to see all
the uploaded pieces. We could extend the meta object to keep record of all
the uploaded parts, and at the end, when assembling everything remove the
parts that aren't part of the final assembly.

> The good news is that the assembly of the multipart object is being done
> correctly; what I can't figure out is how it knows about the non-upload id
> prefixes when creating the metadata on the multipart object in
> .rgw.buckets.  My best guess is that it's copying the metadata from the
> 'meta' object in .rgw.buckets.extra (which is correctly updated with the
> new part prefixes after each successful upload), but I haven't absolutely
> confirmed that.
>

Yeah, something along these lines.


> If one of the developer folk that are more familiar with this could weigh
> in, I would be greatly appreciative.
>

btw, did you try to run the radosgw-admin orphan find tool?

Yehuda

> Brian
>
> On Tue, Aug 2, 2016 at 8:59 AM, Brian Felton  wrote:
>
>> I am actively working through the code and debugging everything.  I
>> figure the issue is with how RGW is listing the parts of a multipart upload
>> when it completes or aborts the upload (read: it's not getting *all* the
>> parts, just those that are either most recent or tagged with the upload
>> id).  As soon as I can figure out a patch, or, more importantly, how to
>> manually address the problem, I will respond with instructions.
>>
>> The reported bug contains detailed instructions on reproducing the
>> problem, so it's trivial to reproduce and test on a small and/or new
>> cluster.
>>
>> Brian
>>
>>
>> On Tue, Aug 2, 2016 at 8:53 AM, Tyler Bishop <
>> tyler.bis...@beyondhosting.net> wrote:
>>
>>> We're having the same issues.   I have a 1200TB pool at 90% utilization
>>> however disk utilization is only 40%
>>>
>>>
>>>
>>>  [image: http://static.beyondhosting.net/img/bh-small.png]
>>>
>>>
>>> *Tyler Bishop *Chief Technical Officer
>>> 513-299-7108 x10
>>>
>>> tyler.bis...@beyondhosting.net
>>>
>>> If you are not the intended recipient of this transmission you are
>>> notified that disclosing, copying, distributing or taking any action in
>>> reliance on the contents of this information is strictly prohibited.
>>>
>>>
>>>
>>> --
>>> *From: *"Brian Felton" 

Re: [ceph-users] Cleaning Up Failed Multipart Uploads

2016-08-03 Thread Brian Felton
This may just be me having a conversation with myself, but maybe this will
be helpful to someone else.

Having dug and dug and dug through the code, I've come to the following
realizations:

   1. When a multipart upload is completed, the function
   list_multipart_parts in rgw_op.cc is called.  This seems to be the start of
   the problems, as it will only return those parts in the 'multipart'
   namespace that include the upload id in the name, irrespective of how many
   copies of parts exist on the system with non-upload id prefixes
   2. In the course of writing to the OSDs, a list (remove_objs) is
   processed in cls_rgw.cc:unaccount_entry(), causing bucket stats to be
   decremented
   3. These decremented stats are written to the bucket's index
   entry/entries in .rgw.buckets.index via the CEPH_OSD_OP_OMAPSETHEADER case
   in ReplicatedPG::do_osd_ops

So this explains why manually removing the multipart entries from
.rgw.buckets and cleaning the shadow entries in .rgw.buckets.index does not
cause the bucket's stats to be updated.  What I don't know how to do is
force an update of the bucket's stats from the CLI.  I can retrieve the
omap header from each of the bucket's shards in .rgw.buckets.index, but I
don't have the first clue how to read the data or rebuild it into something
valid.  I've searched the docs and mailing list archives, but I didn't find
any solution to this problem.  For what it's worth, I've tried 'bucket
check' with all combinations of '--check-objects' and '--fix' after
cleaning up .rgw.buckets and .rgw.buckets.index.

>From a long-term perspective, it seems there are two possible fixes here:

   1. Update the logic in list_multipart_parts to return all the parts for
   a multipart object, so that *all* parts in the 'multipart' namespace can be
   properly removed
   2. Update the logic in RGWPutObj::execute() to not restart a write if
   the put_data_and_throttle() call returns -EEXIST but instead put the data
   in the original file(s)

While I think 2 would involve the least amount of yak shaving with the
multipart logic since the MP logic already assumes a happy path where all
objects have a prefix of the multipart upload id, I'm all but certain this
is going to horribly break many other parts of the system that I don't
fully understand.

The good news is that the assembly of the multipart object is being done
correctly; what I can't figure out is how it knows about the non-upload id
prefixes when creating the metadata on the multipart object in
.rgw.buckets.  My best guess is that it's copying the metadata from the
'meta' object in .rgw.buckets.extra (which is correctly updated with the
new part prefixes after each successful upload), but I haven't absolutely
confirmed that.

If one of the developer folk that are more familiar with this could weigh
in, I would be greatly appreciative.

Brian

On Tue, Aug 2, 2016 at 8:59 AM, Brian Felton  wrote:

> I am actively working through the code and debugging everything.  I figure
> the issue is with how RGW is listing the parts of a multipart upload when
> it completes or aborts the upload (read: it's not getting *all* the parts,
> just those that are either most recent or tagged with the upload id).  As
> soon as I can figure out a patch, or, more importantly, how to manually
> address the problem, I will respond with instructions.
>
> The reported bug contains detailed instructions on reproducing the
> problem, so it's trivial to reproduce and test on a small and/or new
> cluster.
>
> Brian
>
>
> On Tue, Aug 2, 2016 at 8:53 AM, Tyler Bishop <
> tyler.bis...@beyondhosting.net> wrote:
>
>> We're having the same issues.   I have a 1200TB pool at 90% utilization
>> however disk utilization is only 40%
>>
>>
>>
>>  [image: http://static.beyondhosting.net/img/bh-small.png]
>>
>>
>> *Tyler Bishop *Chief Technical Officer
>> 513-299-7108 x10
>>
>> tyler.bis...@beyondhosting.net
>>
>> If you are not the intended recipient of this transmission you are
>> notified that disclosing, copying, distributing or taking any action in
>> reliance on the contents of this information is strictly prohibited.
>>
>>
>>
>> --
>> *From: *"Brian Felton" 
>> *To: *"ceph-users" 
>> *Sent: *Wednesday, July 27, 2016 9:24:30 AM
>> *Subject: *[ceph-users] Cleaning Up Failed Multipart Uploads
>>
>> Greetings,
>>
>> Background: If an object storage client re-uploads parts to a multipart
>> object, RadosGW does not clean up all of the parts properly when the
>> multipart upload is aborted or completed.  You can read all of the gory
>> details (including reproduction steps) in this bug report:
>> http://tracker.ceph.com/issues/16767.
>>
>> My setup: Hammer 0.94.6 cluster only used for S3-compatible object
>> storage.  RGW stripe size is 4MiB.
>>
>> My problem: I have buckets that are reporting TB more utilization (and,
>> in one case, 200k more objects) than they should report.  

Re: [ceph-users] Automount Failovered Multi MDS CephFS

2016-08-03 Thread John Spray
On Wed, Aug 3, 2016 at 5:24 PM, Lazuardi Nasution
 wrote:
> Hi John,
>
> If I have multi MON, should I put all MON IPs on /etc/fstab?

Yes, you put all your mons in your /etc/fstab entry.

> Is there any
> way to overcome MDS metadata bottleneck when only single MDS active?

Nope, you're going to have to wait for a future release that has
stable multi-MDS.

> In case on loadbalanced file/web servers, which one is better, each server
> mount to other replicated/distributed FS (for example via GlusterFS) on top
> of CephFS (different directory) or just directly mounts the same directory
> of CephFS?

You *definitely* don't want to layer another distributed filesystem on
top of cephfs.  If you're going to use CephFS, you mount it directly
on the client nodes where you want to use it.

John

>
> Best regards,
>
> On Wed, Aug 3, 2016 at 11:14 PM, John Spray  wrote:
>>
>> On Wed, Aug 3, 2016 at 5:10 PM, Lazuardi Nasution
>>  wrote:
>> > Hi,
>> >
>> > I'm looking for example about what to put on /etc/fstab if I want to
>> > auto
>> > mount CephFS on failovered multi MDS (only one MDS is active) especially
>> > with Jewel. My target is to build loadbalanced file/web servers with
>> > CephFS
>> > backend.
>>
>> The MDS configuration doesn't affect what you put in fstab, because
>> the addresses in fstab are just the mons (the client talks to the mons
>> to learn where to find the MDSs).
>>
>> So refer to the docs (http://docs.ceph.com/docs/master/cephfs/fstab/),
>> and if you have multiple monitors you put them in a comma separated
>> list.
>>
>> John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Automount Failovered Multi MDS CephFS

2016-08-03 Thread Lazuardi Nasution
Hi John,

If I have multi MON, should I put all MON IPs on /etc/fstab? Is there any
way to overcome MDS metadata bottleneck when only single MDS active?

In case on loadbalanced file/web servers, which one is better, each server
mount to other replicated/distributed FS (for example via GlusterFS) on top
of CephFS (different directory) or just directly mounts the same directory
of CephFS?

Best regards,

On Wed, Aug 3, 2016 at 11:14 PM, John Spray  wrote:

> On Wed, Aug 3, 2016 at 5:10 PM, Lazuardi Nasution
>  wrote:
> > Hi,
> >
> > I'm looking for example about what to put on /etc/fstab if I want to auto
> > mount CephFS on failovered multi MDS (only one MDS is active) especially
> > with Jewel. My target is to build loadbalanced file/web servers with
> CephFS
> > backend.
>
> The MDS configuration doesn't affect what you put in fstab, because
> the addresses in fstab are just the mons (the client talks to the mons
> to learn where to find the MDSs).
>
> So refer to the docs (http://docs.ceph.com/docs/master/cephfs/fstab/),
> and if you have multiple monitors you put them in a comma separated
> list.
>
> John
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Automount Failovered Multi MDS CephFS

2016-08-03 Thread Lazuardi Nasution
Hi,

I'm looking for example about what to put on /etc/fstab if I want to auto
mount CephFS on failovered multi MDS (only one MDS is active) especially
with Jewel. My target is to build loadbalanced file/web servers with CephFS
backend.

Best regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CDM Starting in 15m

2016-08-03 Thread Patrick McGarry
Just a reminder, the Ceph Developer Monthly planning meeting is
starting online in approx 15m

http://wiki.ceph.com/Planning

-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Automount Failovered Multi MDS CephFS

2016-08-03 Thread John Spray
On Wed, Aug 3, 2016 at 5:10 PM, Lazuardi Nasution
 wrote:
> Hi,
>
> I'm looking for example about what to put on /etc/fstab if I want to auto
> mount CephFS on failovered multi MDS (only one MDS is active) especially
> with Jewel. My target is to build loadbalanced file/web servers with CephFS
> backend.

The MDS configuration doesn't affect what you put in fstab, because
the addresses in fstab are just the mons (the client talks to the mons
to learn where to find the MDSs).

So refer to the docs (http://docs.ceph.com/docs/master/cephfs/fstab/),
and if you have multiple monitors you put them in a comma separated
list.

John



>
> Best regards,
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Automount Failovered Multi MDS CephFS

2016-08-03 Thread Daniel Schwager
Maybe something like this?

192.168.135.31:6789:/ /cephfs ceph 
name=cephfs,secretfile=/etc/ceph/client.cephfs,noatime  0 0

Best regards
Daniel

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Lazuardi Nasution
Sent: Wednesday, August 03, 2016 6:10 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Automount Failovered Multi MDS CephFS

Hi,

I'm looking for example about what to put on /etc/fstab if I want to auto mount 
CephFS on failovered multi MDS (only one MDS is active) especially with Jewel. 
My target is to build loadbalanced file/web servers with CephFS backend.

Best regards,


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-03 Thread Alex Gorbachev
On Wed, Aug 3, 2016 at 9:59 AM, Alex Gorbachev  wrote:
> On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin  wrote:
>> Alex Gorbachev wrote on 08/02/2016 07:56 AM:
>>> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov  wrote:
 On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev  
 wrote:
> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin  
> wrote:
>> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
>>> Hi Ilya,
>>>
>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  wrote:
 On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev 
  wrote:
> RBD illustration showing RBD ignoring discard until a certain
> threshold - why is that?  This behavior is unfortunately incompatible
> with ESXi discard (UNMAP) behavior.
>
> Is there a way to lower the discard sensitivity on RBD devices?
>
>>> 
>
> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> print SUM/1024 " KB" }'
> 819200 KB
>
> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> print SUM/1024 " KB" }'
> 782336 KB

 Think about it in terms of underlying RADOS objects (4M by default).
 There are three cases:

 discard range   | command
 -
 whole object| delete
 object's tail   | truncate
 object's head   | zero

 Obviously, only delete and truncate free up space.  In all of your
 examples, except the last one, you are attempting to discard the head
 of the (first) object.

 You can free up as little as a sector, as long as it's the tail:

 OffsetLength  Type
 0 4194304 data

 # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28

 OffsetLength  Type
 0 4193792 data
>>>
>>> Looks like ESXi is sending in each discard/unmap with the fixed
>>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
>>> is a slight reduction in size via rbd diff method, but now I
>>> understand that actual truncate only takes effect when the discard
>>> happens to clip the tail of an image.
>>>
>>> So far looking at
>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=2057513
>>>
>>> ...the only variable we can control is the count of 8192-sector chunks
>>> and not their size.  Which means that most of the ESXi discard
>>> commands will be disregarded by Ceph.
>>>
>>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>>>
>>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
>>> 1342099456, nr_sects 8192)
>>
>> Yes, correct. However, to make sure that VMware is not (erroneously) 
>> enforced to do this, you need to perform one more check.
>>
>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here 
>> correct granularity and alignment (4M, I guess?)
>
> This seems to reflect the granularity (4194304), which matches the
> 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
> value.
>
> Can discard_alignment be specified with RBD?

 It's exported as a read-only sysfs attribute, just like
 discard_granularity:

 # cat /sys/block/rbd0/discard_alignment
 4194304
>>>
>>> Ah thanks Ilya, it is indeed there.  Vlad, your email says to look for
>>> discard_alignment in /sys/block//queue, but for RBD it's in
>>> /sys/block/ - could this be the source of the issue?
>>
>> No. As you can see below, the alignment reported correctly. So, this must be 
>> VMware
>> issue, because it is ignoring the alignment parameter. You can try to align 
>> your VMware
>> partition on 4M boundary, it might help.
>
> Is this not a mismatch:
>
> - From sg_inq: Unmap granularity alignment: 8192
>
> - From "cat /sys/block/rbd0/discard_alignment": 4194304
>
> I am compiling the latest SCST trunk now.

Scratch that, please, I just did a test that shows correct calculation
of 4MB in sectors.

- On iSCSI client node:

dd if=/dev/urandom of=/dev/sdf bs=1M count=800
blkdiscard -o 0 -l 4194304 /dev/sdf

- On iSCSI server node:

Aug  3 10:50:57 e1 kernel: [  893.444538] [1381]:
vdisk_unmap_range:3832:Discarding (start_sector 0, nr_sects 8192)

(8192 * 512 = 4194304)

Now proceeding to test discard again with the latest SCST trunk build.


>
> Thanks,
> Alex
>
>>
>>> Here is what I get querying the iscsi-exported RBD device on Linux:
>>>
>>> 

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-03 Thread Alex Gorbachev
On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin  wrote:
> Alex Gorbachev wrote on 08/02/2016 07:56 AM:
>> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov  wrote:
>>> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev  
>>> wrote:
 On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin  
 wrote:
> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
>> Hi Ilya,
>>
>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  wrote:
>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev 
>>>  wrote:
 RBD illustration showing RBD ignoring discard until a certain
 threshold - why is that?  This behavior is unfortunately incompatible
 with ESXi discard (UNMAP) behavior.

 Is there a way to lower the discard sensitivity on RBD devices?

>> 

 root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
 print SUM/1024 " KB" }'
 819200 KB

 root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
 print SUM/1024 " KB" }'
 782336 KB
>>>
>>> Think about it in terms of underlying RADOS objects (4M by default).
>>> There are three cases:
>>>
>>> discard range   | command
>>> -
>>> whole object| delete
>>> object's tail   | truncate
>>> object's head   | zero
>>>
>>> Obviously, only delete and truncate free up space.  In all of your
>>> examples, except the last one, you are attempting to discard the head
>>> of the (first) object.
>>>
>>> You can free up as little as a sector, as long as it's the tail:
>>>
>>> OffsetLength  Type
>>> 0 4194304 data
>>>
>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>>>
>>> OffsetLength  Type
>>> 0 4193792 data
>>
>> Looks like ESXi is sending in each discard/unmap with the fixed
>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
>> is a slight reduction in size via rbd diff method, but now I
>> understand that actual truncate only takes effect when the discard
>> happens to clip the tail of an image.
>>
>> So far looking at
>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=2057513
>>
>> ...the only variable we can control is the count of 8192-sector chunks
>> and not their size.  Which means that most of the ESXi discard
>> commands will be disregarded by Ceph.
>>
>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>>
>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
>> 1342099456, nr_sects 8192)
>
> Yes, correct. However, to make sure that VMware is not (erroneously) 
> enforced to do this, you need to perform one more check.
>
> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here 
> correct granularity and alignment (4M, I guess?)

 This seems to reflect the granularity (4194304), which matches the
 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
 value.

 Can discard_alignment be specified with RBD?
>>>
>>> It's exported as a read-only sysfs attribute, just like
>>> discard_granularity:
>>>
>>> # cat /sys/block/rbd0/discard_alignment
>>> 4194304
>>
>> Ah thanks Ilya, it is indeed there.  Vlad, your email says to look for
>> discard_alignment in /sys/block//queue, but for RBD it's in
>> /sys/block/ - could this be the source of the issue?
>
> No. As you can see below, the alignment reported correctly. So, this must be 
> VMware
> issue, because it is ignoring the alignment parameter. You can try to align 
> your VMware
> partition on 4M boundary, it might help.

Is this not a mismatch:

- From sg_inq: Unmap granularity alignment: 8192

- From "cat /sys/block/rbd0/discard_alignment": 4194304

I am compiling the latest SCST trunk now.

Thanks,
Alex

>
>> Here is what I get querying the iscsi-exported RBD device on Linux:
>>
>> root@kio1:/sys/block/sdf#  sg_inq -p 0xB0 /dev/sdf
>> VPD INQUIRY: Block limits page (SBC)
>>   Maximum compare and write length: 255 blocks
>>   Optimal transfer length granularity: 8 blocks
>>   Maximum transfer length: 16384 blocks
>>   Optimal transfer length: 1024 blocks
>>   Maximum prefetch, xdread, xdwrite transfer length: 0 blocks
>>   Maximum unmap LBA count: 8192
>>   Maximum unmap block descriptor count: 4294967295
>>   Optimal unmap granularity: 8192
>>   Unmap granularity alignment valid: 1
>>   Unmap granularity alignment: 8192
>
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Number of PGs: fix from start or change as we grow ?

2016-08-03 Thread Christian Balzer

Hello,

On Wed, 3 Aug 2016 15:15:21 +0300 Maged Mokhtar wrote:

> Hello,
> 
> I would like to build a small cluster with 20 disks to start but in the 
> future would like to gradually increase it to maybe 200 disks.
> Is it better to fix the number of PGs in the pool from the beginning or is it 
> better to start with a small number and then gradually change the number of 
> PGs as the system grows ?
> Is the act of changing the number of PGs in a running cluster something that 
> can be done regularly ? 
> 

This is both something that's strongly hinted at in the documentation as
well as discussed countless times on this ML (google is your friend),
along with the means to minimize the impact of this action.

Setting the "correct" PG value for a 200 OSD cluster (8192) at the start
with 20 OSDs (recommended value 512) is not going to be pretty and will
have your cluster in a warning state with about 1200 (!) PGs per OSD at
the very least.
Never mind CPU and RAM usage. 

Increasing PGs is an involved and costly operation, so it should be done
as little as possible. 
However if your cluster is designed/configured well and not operating
constantly at its breaking point, it's also an operation that should be
doable w/o major impacts. 

I'd start with 1024 PGs on those 20 OSDs, at 50 OSDs go to 4096 PGs and at
around 100 OSDs it is safe to go to 8192 PGs.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ubuntu 14.04 Striping / RBD / Single Thread Performance

2016-08-03 Thread w...@globe.de

Hi List,
i am using Ceph Infernalis and Ubuntu 14.04 Kernel 3.13.
18 Data Server / 3 MON / 3 RBD Clients

I want to use RBD on the Client with image format 2 and Striping.
Is it supported?

I want to create rbd with:
rbd create testrbd -s 2T --image-format=2 --image-feature=striping 
--image-feature=exclusive-lock --stripe-unit 65536B --stripe-count 8


Do i become better single Thread performance with a higher stripe count?
If not: Should i use Ubuntu 16.04 with Kernel 4.4 ? Is it with that 
Kernel supported?


The manpage says:

http://manpages.ubuntu.com/manpages/wily/man8/rbd.8.html


   *PARAMETERS*

   *--image-format*  *format*
  Specifies which object layout to use. The default is 1.

  · format  2  -  Use the second rbd format, which is supported by
*librbd and kernel since version 3.11 (except for striping).*
This adds support for cloning and is more easily extensible to
allow more features in the future.


Regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Number of PGs: fix from start or change as we grow ?

2016-08-03 Thread Luis Periquito
Changing the number of PGs is one of the most expensive operations you
can run, and should be avoided as much as possible.

Having said that you should try to avoid having way too many PGs with
very few OSDs, but it's certainly preferable to splitting PGs...

On Wed, Aug 3, 2016 at 1:15 PM, Maged Mokhtar
 wrote:
> Hello,
>
> I would like to build a small cluster with 20 disks to start but in the
> future would like to gradually increase it to maybe 200 disks.
> Is it better to fix the number of PGs in the pool from the beginning or is
> it better to start with a small number and then gradually change the number
> of PGs as the system grows ?
> Is the act of changing the number of PGs in a running cluster something that
> can be done regularly ?
>
> Cheers
> /Maged
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Christian Balzer

Hello,

On Wed, 3 Aug 2016 13:42:50 +0200 Jan Schermer wrote:

> Christian, can you post your values for Power_Loss_Cap_Test on the drive 
> which is failing?
>
Sure:
---
175 Power_Loss_Cap_Test 0x0033   001   001   010Pre-fail  Always   
FAILING_NOW 1 (47 942)
---

Now according to the Intel data sheet that value of 1 means failed, NOT
the actual buffer time it usually means, like this on the neighboring SSD:
---
175 Power_Loss_Cap_Test 0x0033   100   100   010Pre-fail  Always   
-   614 (47 944)
---

And my 800GB DC S3610s have more than 10 times the endurance, my guess is
a combo of larger cache and slower writes:
---
175 Power_Loss_Cap_Test 0x0033   100   100   010Pre-fail  Always   
-   8390 (22 7948)
---

I'll definitely leave that "failing" SSD in place until it has done the
next self-check.

Christian

> Thanks
> Jan
> 
> > On 03 Aug 2016, at 13:33, Christian Balzer  wrote:
> > 
> > 
> > Hello,
> > 
> > yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it
> > seemed to be such an odd thing to fail (given that's not single capacitor).
> > 
> > As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA
> > worthy issue. 
> > 
> > For the record, Intel SSDs use (typically 24) sectors when doing firmware
> > upgrades, so this is a totally healthy 3610. ^o^
> > ---
> >  5 Reallocated_Sector_Ct   0x0032   099   099   000Old_age   Always 
> >   -   24
> > ---
> > 
> > Christian
> > 
> > On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote:
> > 
> >> Right, I actually updated to smartmontools 6.5+svn4324, which now
> >> properly supports this drive model. Some of the smart attr names have
> >> changed, and make more sense now (and there are no more "Unknowns"):
> >> 
> >> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
> >>  5 Reallocated_Sector_Ct   -O--CK   081   081   000-944
> >>  9 Power_On_Hours  -O--CK   100   100   000-1067
> >> 12 Power_Cycle_Count   -O--CK   100   100   000-7
> >> 170 Available_Reservd_Space PO--CK   085   085   010-0
> >> 171 Program_Fail_Count  -O--CK   100   100   000-0
> >> 172 Erase_Fail_Count-O--CK   100   100   000-68
> >> 174 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
> >> 175 Power_Loss_Cap_Test PO--CK   100   100   010-6510 (4 4307)
> >> 183 SATA_Downshift_Count-O--CK   100   100   000-0
> >> 184 End-to-End_ErrorPO--CK   100   100   090-0
> >> 187 Reported_Uncorrect  -O--CK   100   100   000-0
> >> 190 Temperature_Case-O---K   070   065   000-30 (Min/Max
> >> 25/35)
> >> 192 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
> >> 194 Temperature_Internal-O---K   100   100   000-30
> >> 197 Current_Pending_Sector  -O--C-   100   100   000-1100
> >> 199 CRC_Error_Count -OSRCK   100   100   000-0
> >> 225 Host_Writes_32MiB   -O--CK   100   100   000-20135
> >> 226 Workld_Media_Wear_Indic -O--CK   100   100   000-20
> >> 227 Workld_Host_Reads_Perc  -O--CK   100   100   000-82
> >> 228 Workload_Minutes-O--CK   100   100   000-64012
> >> 232 Available_Reservd_Space PO--CK   084   084   010-0
> >> 233 Media_Wearout_Indicator -O--CK   100   100   000-0
> >> 234 Thermal_Throttle-O--CK   100   100   000-0/0
> >> 241 Host_Writes_32MiB   -O--CK   100   100   000-20135
> >> 242 Host_Reads_32MiB-O--CK   100   100   000-92945
> >> 243 NAND_Writes_32MiB   -O--CK   100   100   000-95289
> >> 
> >> Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space
> >> seems to be holding steady.
> >> 
> >> AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden
> >> death. The drive simply disappeared from the controller one day, and
> >> could no longer be detected.
> >> 
> >> On 03/08/16 12:15, Jan Schermer wrote:
> >>> Make sure you are reading the right attribute and interpreting it right.
> >>> update-smart-drivedb sometimes makes wonders :)
> >>> 
> >>> I wonder what isdct tool would say the drive's life expectancy is with 
> >>> this workload? Are you really writing ~600TB/month??
> >>> 
> >>> Jan
> >>> 
> >> 
> >> 
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> 
> > 
> > 
> > -- 
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com  

[ceph-users] Number of PGs: fix from start or change as we grow ?

2016-08-03 Thread Maged Mokhtar
Hello,

I would like to build a small cluster with 20 disks to start but in the future 
would like to gradually increase it to maybe 200 disks.
Is it better to fix the number of PGs in the pool from the beginning or is it 
better to start with a small number and then gradually change the number of PGs 
as the system grows ?
Is the act of changing the number of PGs in a running cluster something that 
can be done regularly ? 

Cheers 
/Maged



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
Christian, can you post your values for Power_Loss_Cap_Test on the drive which 
is failing?

Thanks
Jan

> On 03 Aug 2016, at 13:33, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it
> seemed to be such an odd thing to fail (given that's not single capacitor).
> 
> As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA
> worthy issue. 
> 
> For the record, Intel SSDs use (typically 24) sectors when doing firmware
> upgrades, so this is a totally healthy 3610. ^o^
> ---
>  5 Reallocated_Sector_Ct   0x0032   099   099   000Old_age   Always   
> -   24
> ---
> 
> Christian
> 
> On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote:
> 
>> Right, I actually updated to smartmontools 6.5+svn4324, which now
>> properly supports this drive model. Some of the smart attr names have
>> changed, and make more sense now (and there are no more "Unknowns"):
>> 
>> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>>  5 Reallocated_Sector_Ct   -O--CK   081   081   000-944
>>  9 Power_On_Hours  -O--CK   100   100   000-1067
>> 12 Power_Cycle_Count   -O--CK   100   100   000-7
>> 170 Available_Reservd_Space PO--CK   085   085   010-0
>> 171 Program_Fail_Count  -O--CK   100   100   000-0
>> 172 Erase_Fail_Count-O--CK   100   100   000-68
>> 174 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
>> 175 Power_Loss_Cap_Test PO--CK   100   100   010-6510 (4 4307)
>> 183 SATA_Downshift_Count-O--CK   100   100   000-0
>> 184 End-to-End_ErrorPO--CK   100   100   090-0
>> 187 Reported_Uncorrect  -O--CK   100   100   000-0
>> 190 Temperature_Case-O---K   070   065   000-30 (Min/Max
>> 25/35)
>> 192 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
>> 194 Temperature_Internal-O---K   100   100   000-30
>> 197 Current_Pending_Sector  -O--C-   100   100   000-1100
>> 199 CRC_Error_Count -OSRCK   100   100   000-0
>> 225 Host_Writes_32MiB   -O--CK   100   100   000-20135
>> 226 Workld_Media_Wear_Indic -O--CK   100   100   000-20
>> 227 Workld_Host_Reads_Perc  -O--CK   100   100   000-82
>> 228 Workload_Minutes-O--CK   100   100   000-64012
>> 232 Available_Reservd_Space PO--CK   084   084   010-0
>> 233 Media_Wearout_Indicator -O--CK   100   100   000-0
>> 234 Thermal_Throttle-O--CK   100   100   000-0/0
>> 241 Host_Writes_32MiB   -O--CK   100   100   000-20135
>> 242 Host_Reads_32MiB-O--CK   100   100   000-92945
>> 243 NAND_Writes_32MiB   -O--CK   100   100   000-95289
>> 
>> Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space
>> seems to be holding steady.
>> 
>> AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden
>> death. The drive simply disappeared from the controller one day, and
>> could no longer be detected.
>> 
>> On 03/08/16 12:15, Jan Schermer wrote:
>>> Make sure you are reading the right attribute and interpreting it right.
>>> update-smart-drivedb sometimes makes wonders :)
>>> 
>>> I wonder what isdct tool would say the drive's life expectancy is with this 
>>> workload? Are you really writing ~600TB/month??
>>> 
>>> Jan
>>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs quota implement

2016-08-03 Thread Daleep Singh Bais
Dear all,

Further to my Cephfs testing, I am trying to put quota on the mount I
have done on client end. I am getting error message when querying the same.

ceph-fuse  fuse.ceph-fuse  2.8T  5.5G  2.8T   1% /cephfs

# setfattr -n ceph.quota.max_bytes -v 1 /cephfs/test1/

# getfattr -n ceph.quota.max_bytes /cephfs/test1/
*/cephfs/test1/: ceph.quota.max_bytes: No such attribute*
 
Can you please suggest the correct way to implement the quota and
resolve the above issue?
I am using Ceph Jewel for my test.

Thanks.

Daleep Singh Bais
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Christian Balzer

Hello,

yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it
seemed to be such an odd thing to fail (given that's not single capacitor).

As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA
worthy issue. 

For the record, Intel SSDs use (typically 24) sectors when doing firmware
upgrades, so this is a totally healthy 3610. ^o^
---
  5 Reallocated_Sector_Ct   0x0032   099   099   000Old_age   Always   
-   24
---

Christian

On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote:

> Right, I actually updated to smartmontools 6.5+svn4324, which now
> properly supports this drive model. Some of the smart attr names have
> changed, and make more sense now (and there are no more "Unknowns"):
> 
> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>   5 Reallocated_Sector_Ct   -O--CK   081   081   000-944
>   9 Power_On_Hours  -O--CK   100   100   000-1067
>  12 Power_Cycle_Count   -O--CK   100   100   000-7
> 170 Available_Reservd_Space PO--CK   085   085   010-0
> 171 Program_Fail_Count  -O--CK   100   100   000-0
> 172 Erase_Fail_Count-O--CK   100   100   000-68
> 174 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
> 175 Power_Loss_Cap_Test PO--CK   100   100   010-6510 (4 4307)
> 183 SATA_Downshift_Count-O--CK   100   100   000-0
> 184 End-to-End_ErrorPO--CK   100   100   090-0
> 187 Reported_Uncorrect  -O--CK   100   100   000-0
> 190 Temperature_Case-O---K   070   065   000-30 (Min/Max
> 25/35)
> 192 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
> 194 Temperature_Internal-O---K   100   100   000-30
> 197 Current_Pending_Sector  -O--C-   100   100   000-1100
> 199 CRC_Error_Count -OSRCK   100   100   000-0
> 225 Host_Writes_32MiB   -O--CK   100   100   000-20135
> 226 Workld_Media_Wear_Indic -O--CK   100   100   000-20
> 227 Workld_Host_Reads_Perc  -O--CK   100   100   000-82
> 228 Workload_Minutes-O--CK   100   100   000-64012
> 232 Available_Reservd_Space PO--CK   084   084   010-0
> 233 Media_Wearout_Indicator -O--CK   100   100   000-0
> 234 Thermal_Throttle-O--CK   100   100   000-0/0
> 241 Host_Writes_32MiB   -O--CK   100   100   000-20135
> 242 Host_Reads_32MiB-O--CK   100   100   000-92945
> 243 NAND_Writes_32MiB   -O--CK   100   100   000-95289
> 
> Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space
> seems to be holding steady.
> 
> AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden
> death. The drive simply disappeared from the controller one day, and
> could no longer be detected.
> 
> On 03/08/16 12:15, Jan Schermer wrote:
> > Make sure you are reading the right attribute and interpreting it right.
> > update-smart-drivedb sometimes makes wonders :)
> > 
> > I wonder what isdct tool would say the drive's life expectancy is with this 
> > workload? Are you really writing ~600TB/month??
> > 
> > Jan
> > 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Daniel Swarbrick
Right, I actually updated to smartmontools 6.5+svn4324, which now
properly supports this drive model. Some of the smart attr names have
changed, and make more sense now (and there are no more "Unknowns"):

ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   -O--CK   081   081   000-944
  9 Power_On_Hours  -O--CK   100   100   000-1067
 12 Power_Cycle_Count   -O--CK   100   100   000-7
170 Available_Reservd_Space PO--CK   085   085   010-0
171 Program_Fail_Count  -O--CK   100   100   000-0
172 Erase_Fail_Count-O--CK   100   100   000-68
174 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
175 Power_Loss_Cap_Test PO--CK   100   100   010-6510 (4 4307)
183 SATA_Downshift_Count-O--CK   100   100   000-0
184 End-to-End_ErrorPO--CK   100   100   090-0
187 Reported_Uncorrect  -O--CK   100   100   000-0
190 Temperature_Case-O---K   070   065   000-30 (Min/Max
25/35)
192 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
194 Temperature_Internal-O---K   100   100   000-30
197 Current_Pending_Sector  -O--C-   100   100   000-1100
199 CRC_Error_Count -OSRCK   100   100   000-0
225 Host_Writes_32MiB   -O--CK   100   100   000-20135
226 Workld_Media_Wear_Indic -O--CK   100   100   000-20
227 Workld_Host_Reads_Perc  -O--CK   100   100   000-82
228 Workload_Minutes-O--CK   100   100   000-64012
232 Available_Reservd_Space PO--CK   084   084   010-0
233 Media_Wearout_Indicator -O--CK   100   100   000-0
234 Thermal_Throttle-O--CK   100   100   000-0/0
241 Host_Writes_32MiB   -O--CK   100   100   000-20135
242 Host_Reads_32MiB-O--CK   100   100   000-92945
243 NAND_Writes_32MiB   -O--CK   100   100   000-95289

Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space
seems to be holding steady.

AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden
death. The drive simply disappeared from the controller one day, and
could no longer be detected.

On 03/08/16 12:15, Jan Schermer wrote:
> Make sure you are reading the right attribute and interpreting it right.
> update-smart-drivedb sometimes makes wonders :)
> 
> I wonder what isdct tool would say the drive's life expectancy is with this 
> workload? Are you really writing ~600TB/month??
> 
> Jan
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
I'm a fool, I miscalculated the writes by a factor of 1000 of course :-)
600GB/month is not much for S36xx at all, must be some sort of defect then...

Jan


> On 03 Aug 2016, at 12:15, Jan Schermer  wrote:
> 
> Make sure you are reading the right attribute and interpreting it right.
> update-smart-drivedb sometimes makes wonders :)
> 
> I wonder what isdct tool would say the drive's life expectancy is with this 
> workload? Are you really writing ~600TB/month??
> 
> Jan
> 
> 
>> On 03 Aug 2016, at 12:06, Maxime Guyot  wrote:
>> 
>> Hi,
>> 
>> I haven’t had problems with Power_Loss_Cap_Test so far. 
>> 
>> Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the 
>> “Available Reserved Space” (SMART ID: 232/E8h), the data sheet 
>> (http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf)
>>  reads:
>> "This attribute reports the number of reserve blocks
>> 
>>  remaining. The normalized value 
>> begins at 100 (64h),
>> which corresponds to 100 percent availability of the
>> reserved space. The threshold value for this attribute is
>> 10 percent availability."
>> 
>> According to the SMART data you copied, it should be about 84% of the over 
>> provisioning left? Since the drive is pretty young, it might be some form of 
>> defect?
>> I have a number of S3610 with ~150 DW, all SMART counters are their initial 
>> values (except for the temperature).
>> 
>> Cheers,
>> Maxime
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" 
>> > daniel.swarbr...@profitbricks.com> wrote:
>> 
>>> Hi Christian,
>>> 
>>> Intel drives are good, but apparently not infallible. I'm watching a DC
>>> S3610 480GB die from reallocated sectors.
>>> 
>>> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>>> 5 Reallocated_Sector_Ct   -O--CK   081   081   000-756
>>> 9 Power_On_Hours  -O--CK   100   100   000-1065
>>> 12 Power_Cycle_Count   -O--CK   100   100   000-7
>>> 175 Program_Fail_Count_Chip PO--CK   100   100   010-17454078318
>>> 183 Runtime_Bad_Block   -O--CK   100   100   000-0
>>> 184 End-to-End_ErrorPO--CK   100   100   090-0
>>> 187 Reported_Uncorrect  -O--CK   100   100   000-0
>>> 190 Airflow_Temperature_Cel -O---K   070   065   000-30 (Min/Max
>>> 25/35)
>>> 192 Power-Off_Retract_Count -O--CK   100   100   000-6
>>> 194 Temperature_Celsius -O---K   100   100   000-30
>>> 197 Current_Pending_Sector  -O--C-   100   100   000-1288
>>> 199 UDMA_CRC_Error_Count-OSRCK   100   100   000-0
>>> 228 Power-off_Retract_Count -O--CK   100   100   000-63889
>>> 232 Available_Reservd_Space PO--CK   084   084   010-0
>>> 233 Media_Wearout_Indicator -O--CK   100   100   000-0
>>> 241 Total_LBAs_Written  -O--CK   100   100   000-20131
>>> 242 Total_LBAs_Read -O--CK   100   100   000-92945
>>> 
>>> The Reallocated_Sector_Ct is increasing about once a minute. I'm not
>>> sure how many reserved sectors the drive has, i.e., how soon before it
>>> starts throwing write IO errors.
>>> 
>>> It's a very young drive, with only 1065 hours on the clock, and has not
>>> even done two full drive-writes:
>>> 
>>> Device Statistics (GP Log 0x04)
>>> Page Offset Size Value  Description
>>> 1  =  ==  == General Statistics (rev 2) ==
>>> 1  0x008  47  Lifetime Power-On Resets
>>> 1  0x018  6   1319318736  Logical Sectors Written
>>> 1  0x020  6137121729  Number of Write Commands
>>> 1  0x028  6   6091245600  Logical Sectors Read
>>> 1  0x030  6115252407  Number of Read Commands
>>> 
>>> Fortunately this drive is not used as a Ceph journal. It's in a mdraid
>>> RAID5 array :-|
>>> 
>>> Cheers,
>>> Daniel
>>> 
>>> On 03/08/16 07:45, Christian Balzer wrote:
 
 Hello,
 
 not a Ceph specific issue, but this is probably the largest sample size of
 SSD users I'm familiar with. ^o^
 
 This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
 religious experience.
 
 It turns out that the SMART check plugin I run to mostly get an early
 wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
 200GB DC S3700 used for journals.
 
 While SMART is of the opinion that this drive is failing and will explode
 spectacularly any moment that particular failure is of little worries to
 me, never mind that I'll eventually replace this unit.
 
 What brings me here is that this is the first time in over 3 years that an
 Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
 this particular failure has been seen by others.
 
 That 

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
Make sure you are reading the right attribute and interpreting it right.
update-smart-drivedb sometimes makes wonders :)

I wonder what isdct tool would say the drive's life expectancy is with this 
workload? Are you really writing ~600TB/month??

Jan


> On 03 Aug 2016, at 12:06, Maxime Guyot  wrote:
> 
> Hi,
> 
> I haven’t had problems with Power_Loss_Cap_Test so far. 
> 
> Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the 
> “Available Reserved Space” (SMART ID: 232/E8h), the data sheet 
> (http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf)
>  reads:
> "This attribute reports the number of reserve blocks
> 
>   remaining. The normalized value 
> begins at 100 (64h),
> which corresponds to 100 percent availability of the
> reserved space. The threshold value for this attribute is
> 10 percent availability."
> 
> According to the SMART data you copied, it should be about 84% of the over 
> provisioning left? Since the drive is pretty young, it might be some form of 
> defect?
> I have a number of S3610 with ~150 DW, all SMART counters are their initial 
> values (except for the temperature).
> 
> Cheers,
> Maxime
> 
> 
> 
> 
> 
> 
> 
> 
> On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" 
>  daniel.swarbr...@profitbricks.com> wrote:
> 
>> Hi Christian,
>> 
>> Intel drives are good, but apparently not infallible. I'm watching a DC
>> S3610 480GB die from reallocated sectors.
>> 
>> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>> 5 Reallocated_Sector_Ct   -O--CK   081   081   000-756
>> 9 Power_On_Hours  -O--CK   100   100   000-1065
>> 12 Power_Cycle_Count   -O--CK   100   100   000-7
>> 175 Program_Fail_Count_Chip PO--CK   100   100   010-17454078318
>> 183 Runtime_Bad_Block   -O--CK   100   100   000-0
>> 184 End-to-End_ErrorPO--CK   100   100   090-0
>> 187 Reported_Uncorrect  -O--CK   100   100   000-0
>> 190 Airflow_Temperature_Cel -O---K   070   065   000-30 (Min/Max
>> 25/35)
>> 192 Power-Off_Retract_Count -O--CK   100   100   000-6
>> 194 Temperature_Celsius -O---K   100   100   000-30
>> 197 Current_Pending_Sector  -O--C-   100   100   000-1288
>> 199 UDMA_CRC_Error_Count-OSRCK   100   100   000-0
>> 228 Power-off_Retract_Count -O--CK   100   100   000-63889
>> 232 Available_Reservd_Space PO--CK   084   084   010-0
>> 233 Media_Wearout_Indicator -O--CK   100   100   000-0
>> 241 Total_LBAs_Written  -O--CK   100   100   000-20131
>> 242 Total_LBAs_Read -O--CK   100   100   000-92945
>> 
>> The Reallocated_Sector_Ct is increasing about once a minute. I'm not
>> sure how many reserved sectors the drive has, i.e., how soon before it
>> starts throwing write IO errors.
>> 
>> It's a very young drive, with only 1065 hours on the clock, and has not
>> even done two full drive-writes:
>> 
>> Device Statistics (GP Log 0x04)
>> Page Offset Size Value  Description
>> 1  =  ==  == General Statistics (rev 2) ==
>> 1  0x008  47  Lifetime Power-On Resets
>> 1  0x018  6   1319318736  Logical Sectors Written
>> 1  0x020  6137121729  Number of Write Commands
>> 1  0x028  6   6091245600  Logical Sectors Read
>> 1  0x030  6115252407  Number of Read Commands
>> 
>> Fortunately this drive is not used as a Ceph journal. It's in a mdraid
>> RAID5 array :-|
>> 
>> Cheers,
>> Daniel
>> 
>> On 03/08/16 07:45, Christian Balzer wrote:
>>> 
>>> Hello,
>>> 
>>> not a Ceph specific issue, but this is probably the largest sample size of
>>> SSD users I'm familiar with. ^o^
>>> 
>>> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
>>> religious experience.
>>> 
>>> It turns out that the SMART check plugin I run to mostly get an early
>>> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
>>> 200GB DC S3700 used for journals.
>>> 
>>> While SMART is of the opinion that this drive is failing and will explode
>>> spectacularly any moment that particular failure is of little worries to
>>> me, never mind that I'll eventually replace this unit.
>>> 
>>> What brings me here is that this is the first time in over 3 years that an
>>> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
>>> this particular failure has been seen by others.
>>> 
>>> That of course entails people actually monitoring for these things. ^o^
>>> 
>>> Thanks,
>>> 
>>> Christian
>>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing 

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Maxime Guyot
Hi,

I haven’t had problems with Power_Loss_Cap_Test so far. 

Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the “Available 
Reserved Space” (SMART ID: 232/E8h), the data sheet 
(http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf)
 reads:
"This attribute reports the number of reserve blocks

remaining. The normalized value 
begins at 100 (64h),
which corresponds to 100 percent availability of the
reserved space. The threshold value for this attribute is
10 percent availability."

According to the SMART data you copied, it should be about 84% of the over 
provisioning left? Since the drive is pretty young, it might be some form of 
defect?
I have a number of S3610 with ~150 DW, all SMART counters are their initial 
values (except for the temperature).

Cheers,
Maxime








On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" 
 wrote:

>Hi Christian,
>
>Intel drives are good, but apparently not infallible. I'm watching a DC
>S3610 480GB die from reallocated sectors.
>
>ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>  5 Reallocated_Sector_Ct   -O--CK   081   081   000-756
>  9 Power_On_Hours  -O--CK   100   100   000-1065
> 12 Power_Cycle_Count   -O--CK   100   100   000-7
>175 Program_Fail_Count_Chip PO--CK   100   100   010-17454078318
>183 Runtime_Bad_Block   -O--CK   100   100   000-0
>184 End-to-End_ErrorPO--CK   100   100   090-0
>187 Reported_Uncorrect  -O--CK   100   100   000-0
>190 Airflow_Temperature_Cel -O---K   070   065   000-30 (Min/Max
>25/35)
>192 Power-Off_Retract_Count -O--CK   100   100   000-6
>194 Temperature_Celsius -O---K   100   100   000-30
>197 Current_Pending_Sector  -O--C-   100   100   000-1288
>199 UDMA_CRC_Error_Count-OSRCK   100   100   000-0
>228 Power-off_Retract_Count -O--CK   100   100   000-63889
>232 Available_Reservd_Space PO--CK   084   084   010-0
>233 Media_Wearout_Indicator -O--CK   100   100   000-0
>241 Total_LBAs_Written  -O--CK   100   100   000-20131
>242 Total_LBAs_Read -O--CK   100   100   000-92945
>
>The Reallocated_Sector_Ct is increasing about once a minute. I'm not
>sure how many reserved sectors the drive has, i.e., how soon before it
>starts throwing write IO errors.
>
>It's a very young drive, with only 1065 hours on the clock, and has not
>even done two full drive-writes:
>
>Device Statistics (GP Log 0x04)
>Page Offset Size Value  Description
>  1  =  ==  == General Statistics (rev 2) ==
>  1  0x008  47  Lifetime Power-On Resets
>  1  0x018  6   1319318736  Logical Sectors Written
>  1  0x020  6137121729  Number of Write Commands
>  1  0x028  6   6091245600  Logical Sectors Read
>  1  0x030  6115252407  Number of Read Commands
>
>Fortunately this drive is not used as a Ceph journal. It's in a mdraid
>RAID5 array :-|
>
>Cheers,
>Daniel
>
>On 03/08/16 07:45, Christian Balzer wrote:
>> 
>> Hello,
>> 
>> not a Ceph specific issue, but this is probably the largest sample size of
>> SSD users I'm familiar with. ^o^
>> 
>> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
>> religious experience.
>> 
>> It turns out that the SMART check plugin I run to mostly get an early
>> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
>> 200GB DC S3700 used for journals.
>> 
>> While SMART is of the opinion that this drive is failing and will explode
>> spectacularly any moment that particular failure is of little worries to
>> me, never mind that I'll eventually replace this unit.
>> 
>> What brings me here is that this is the first time in over 3 years that an
>> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
>> this particular failure has been seen by others.
>> 
>> That of course entails people actually monitoring for these things. ^o^
>> 
>> Thanks,
>> 
>> Christian
>> 
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Daniel Swarbrick
Hi Christian,

Intel drives are good, but apparently not infallible. I'm watching a DC
S3610 480GB die from reallocated sectors.

ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   -O--CK   081   081   000-756
  9 Power_On_Hours  -O--CK   100   100   000-1065
 12 Power_Cycle_Count   -O--CK   100   100   000-7
175 Program_Fail_Count_Chip PO--CK   100   100   010-17454078318
183 Runtime_Bad_Block   -O--CK   100   100   000-0
184 End-to-End_ErrorPO--CK   100   100   090-0
187 Reported_Uncorrect  -O--CK   100   100   000-0
190 Airflow_Temperature_Cel -O---K   070   065   000-30 (Min/Max
25/35)
192 Power-Off_Retract_Count -O--CK   100   100   000-6
194 Temperature_Celsius -O---K   100   100   000-30
197 Current_Pending_Sector  -O--C-   100   100   000-1288
199 UDMA_CRC_Error_Count-OSRCK   100   100   000-0
228 Power-off_Retract_Count -O--CK   100   100   000-63889
232 Available_Reservd_Space PO--CK   084   084   010-0
233 Media_Wearout_Indicator -O--CK   100   100   000-0
241 Total_LBAs_Written  -O--CK   100   100   000-20131
242 Total_LBAs_Read -O--CK   100   100   000-92945

The Reallocated_Sector_Ct is increasing about once a minute. I'm not
sure how many reserved sectors the drive has, i.e., how soon before it
starts throwing write IO errors.

It's a very young drive, with only 1065 hours on the clock, and has not
even done two full drive-writes:

Device Statistics (GP Log 0x04)
Page Offset Size Value  Description
  1  =  ==  == General Statistics (rev 2) ==
  1  0x008  47  Lifetime Power-On Resets
  1  0x018  6   1319318736  Logical Sectors Written
  1  0x020  6137121729  Number of Write Commands
  1  0x028  6   6091245600  Logical Sectors Read
  1  0x030  6115252407  Number of Read Commands

Fortunately this drive is not used as a Ceph journal. It's in a mdraid
RAID5 array :-|

Cheers,
Daniel

On 03/08/16 07:45, Christian Balzer wrote:
> 
> Hello,
> 
> not a Ceph specific issue, but this is probably the largest sample size of
> SSD users I'm familiar with. ^o^
> 
> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
> religious experience.
> 
> It turns out that the SMART check plugin I run to mostly get an early
> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
> 200GB DC S3700 used for journals.
> 
> While SMART is of the opinion that this drive is failing and will explode
> spectacularly any moment that particular failure is of little worries to
> me, never mind that I'll eventually replace this unit.
> 
> What brings me here is that this is the first time in over 3 years that an
> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
> this particular failure has been seen by others.
> 
> That of course entails people actually monitoring for these things. ^o^
> 
> Thanks,
> 
> Christian
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH map utilization issue

2016-08-03 Thread Rob Reus
Hi,


> I have never tried it, but gets back to my original question: Why the rack in 
> between and not add the hosts directly to the root?
>
> You should add the rack when you want to set the failure domain to racks and 
> thus replicate over multiple racks.
>
> In your case you want the failure domain to be 'host', so I'd suggest to 
> stick with that.


Currently I am building a Ceph test cluster, which will eventually have racks, 
or maybe even rooms as failure domain. Because of that, I have also included 
the rack bucket.


> I prefer KISS and thus separate roots, but I think this is what you're
> after:

> http://cephnotes.ksperis.com/blog/2015/02/02/crushmap-example-of-a-hierarchical-cluster-map


I have used that article as an example, but he is running into the same issue 
as I am. Seems like he also did not find a solution. It may very well be that 
what I am trying to do, is just not possible. I was just hoping that maybe 
someone could explain why I am seeing what I'm seeing.


Thanks.


CRUSHMAP : Example of a Hierarchical Cluster 
Map
cephnotes.ksperis.com
It is not always easy to know how to organize your data in the Crushmap, 
especially when trying to distribute the data geographically while separating 
different types ...




Van: Christian Balzer 
Verzonden: woensdag 3 augustus 2016 10:45:57
Aan: ceph-users@lists.ceph.com
CC: Rob Reus
Onderwerp: Re: [ceph-users] CRUSH map utilization issue


Hello,

On Wed, 3 Aug 2016 08:35:49 + Rob Reus wrote:

> Hi Wido,
>
>
> This is indeed something I have tried, and confirmed to work, see the other 
> CRUSH map link I have provided in my original email.
>
>
> However, I was wondering if achieving that same goal, but with only 1 root, 
> is possible/feasible.
>
I prefer KISS and thus separate roots, but I think this is what you're
after:
http://cephnotes.ksperis.com/blog/2015/02/02/crushmap-example-of-a-hierarchical-cluster-map

Christian

>
> Thanks!
>
>
> 
> Van: Wido den Hollander 
> Verzonden: woensdag 3 augustus 2016 10:30
> Aan: Rob Reus; ceph-users@lists.ceph.com
> Onderwerp: Re: [ceph-users] CRUSH map utilization issue
>
>
> > Op 3 augustus 2016 om 10:08 schreef Rob Reus :
> >
> >
> > Hi all,
> >
> >
> > I built a CRUSH map, with the goal to distinguish between SSD and HDD 
> > storage machines using only 1 root. The map can be found here: 
> > http://pastebin.com/VQdB0CE9
> >
> >
> > The issue I am having is this:
> >
> >
> > root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization 
> > --rule 0 --num-rep 3
> > rule 0 (replicated_ruleset_ssd), x = 0..1023, numrep = 3..3
> > rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 0:84/1024
> > rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 1:437/1024
> > rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 2:438/1024
> > rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 3:65/1024
> >
> >
> > And then the same test using num-rep 46 (the lowest possible number that 
> > shows full utilization):
> >
> >
> > root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization 
> > --rule 0 --num-rep 46
> > rule 0 (replicated_ruleset_ssd), x = 0..1023, numrep = 46..46
> > rule 0 (replicated_ruleset_ssd) num_rep 46 result size == 3:1024/1024
> >
> >
> > Full output of above commands can be found here 
> > http://pastebin.com/2mbBnmSM and here http://pastebin.com/ar6SAFnX
> >
> >
> > The fact that amount of num-rep seems to scale with how many OSDs I am 
> > using, leads me to believe I am doing something wrong.
> >
> >
> > When using 2 roots (1 dedicated to SSD and 1 to HDD), everything works 
> > perfectly (example: http://pastebin.com/Uthxesut).
> >
> >
> > Would love to know what I am missing.
> >
>
> Can you tell me the reasoning behind adding the rack in between? Since there 
> is only one rack, why add the machines there?
>
> In your case I wouldn't add a new type either, but I would do this:
>
> host machineA-ssd {
>
> }
>
> host machineB-ssd {
>
> }
>
> host machineA-hdd {
>
> }
>
> host machineB-hdd {
>
> }
>
> root ssd {
> item machineA-sdd
> item machineB-ssd
> }
>
> root hdd {
> item machineA-hdd
> item machineB-hdd
> }
>
> rule replicated_ruleset_ssd {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take ssd
> step chooseleaf firstn 0 type host
> step emit
> }
>
> rule replicated_ruleset_hdd {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take hdd
> step chooseleaf firstn 0 type host
> step emit
> }
>
> And try again :)
>
> Wido
>
> >
> > Thanks!
> >
> >
> > - Rob
> >
> > 

Re: [ceph-users] CRUSH map utilization issue

2016-08-03 Thread Christian Balzer

Hello,

On Wed, 3 Aug 2016 08:35:49 + Rob Reus wrote:

> Hi Wido,
> 
> 
> This is indeed something I have tried, and confirmed to work, see the other 
> CRUSH map link I have provided in my original email.
> 
> 
> However, I was wondering if achieving that same goal, but with only 1 root, 
> is possible/feasible.
> 
I prefer KISS and thus separate roots, but I think this is what you're
after:
http://cephnotes.ksperis.com/blog/2015/02/02/crushmap-example-of-a-hierarchical-cluster-map

Christian

> 
> Thanks!
> 
> 
> 
> Van: Wido den Hollander 
> Verzonden: woensdag 3 augustus 2016 10:30
> Aan: Rob Reus; ceph-users@lists.ceph.com
> Onderwerp: Re: [ceph-users] CRUSH map utilization issue
> 
> 
> > Op 3 augustus 2016 om 10:08 schreef Rob Reus :
> >
> >
> > Hi all,
> >
> >
> > I built a CRUSH map, with the goal to distinguish between SSD and HDD 
> > storage machines using only 1 root. The map can be found here: 
> > http://pastebin.com/VQdB0CE9
> >
> >
> > The issue I am having is this:
> >
> >
> > root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization 
> > --rule 0 --num-rep 3
> > rule 0 (replicated_ruleset_ssd), x = 0..1023, numrep = 3..3
> > rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 0:84/1024
> > rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 1:437/1024
> > rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 2:438/1024
> > rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 3:65/1024
> >
> >
> > And then the same test using num-rep 46 (the lowest possible number that 
> > shows full utilization):
> >
> >
> > root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization 
> > --rule 0 --num-rep 46
> > rule 0 (replicated_ruleset_ssd), x = 0..1023, numrep = 46..46
> > rule 0 (replicated_ruleset_ssd) num_rep 46 result size == 3:1024/1024
> >
> >
> > Full output of above commands can be found here 
> > http://pastebin.com/2mbBnmSM and here http://pastebin.com/ar6SAFnX
> >
> >
> > The fact that amount of num-rep seems to scale with how many OSDs I am 
> > using, leads me to believe I am doing something wrong.
> >
> >
> > When using 2 roots (1 dedicated to SSD and 1 to HDD), everything works 
> > perfectly (example: http://pastebin.com/Uthxesut).
> >
> >
> > Would love to know what I am missing.
> >
> 
> Can you tell me the reasoning behind adding the rack in between? Since there 
> is only one rack, why add the machines there?
> 
> In your case I wouldn't add a new type either, but I would do this:
> 
> host machineA-ssd {
> 
> }
> 
> host machineB-ssd {
> 
> }
> 
> host machineA-hdd {
> 
> }
> 
> host machineB-hdd {
> 
> }
> 
> root ssd {
> item machineA-sdd
> item machineB-ssd
> }
> 
> root hdd {
> item machineA-hdd
> item machineB-hdd
> }
> 
> rule replicated_ruleset_ssd {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take ssd
> step chooseleaf firstn 0 type host
> step emit
> }
> 
> rule replicated_ruleset_hdd {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take hdd
> step chooseleaf firstn 0 type host
> step emit
> }
> 
> And try again :)
> 
> Wido
> 
> >
> > Thanks!
> >
> >
> > - Rob
> >
> > [http://pastebin.com/i/facebook.png]
> >
> > # begin crush map tunable choose_local_tries 0 tunable 
> > choose_local_fallback_t - Pastebin.com
> > pastebin.com
> >
> >
> > [http://pastebin.com/i/facebook.png]
> >
> > root@ceph2:~/crush_files# crushtool -i crushmap --test 
> > --show-utilization-all -- - Pastebin.com
> > pastebin.com
> >
> > [http://pastebin.com/i/facebook.png]
> >
> > root@ceph2:~/crush_files# crushtool -i crushmap --test 
> > --show-utilization-all -- - Pastebin.com
> > pastebin.com
> >
> >
> > [http://pastebin.com/i/facebook.png]
> >
> > # begin crush map tunable choose_local_tries 0 tunable 
> > choose_local_fallback_t - Pastebin.com
> > pastebin.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH map utilization issue

2016-08-03 Thread Wido den Hollander

> Op 3 augustus 2016 om 10:35 schreef Rob Reus :
> 
> 
> Hi Wido,
> 
> 
> This is indeed something I have tried, and confirmed to work, see the other 
> CRUSH map link I have provided in my original email.
> 

Ah, double e-mails.

> 
> However, I was wondering if achieving that same goal, but with only 1 root, 
> is possible/feasible.
> 

I have never tried it, but gets back to my original question: Why the rack in 
between and not add the hosts directly to the root?

You should add the rack when you want to set the failure domain to racks and 
thus replicate over multiple racks.

In your case you want the failure domain to be 'host', so I'd suggest to stick 
with that.

Custom crush types are supported, no problem, however, 'host' is such a widely 
used type in CRUSH that I wouldn't change that.

Wido

> 
> Thanks!
> 
> 
> 
> Van: Wido den Hollander 
> Verzonden: woensdag 3 augustus 2016 10:30
> Aan: Rob Reus; ceph-users@lists.ceph.com
> Onderwerp: Re: [ceph-users] CRUSH map utilization issue
> 
> 
> > Op 3 augustus 2016 om 10:08 schreef Rob Reus :
> >
> >
> > Hi all,
> >
> >
> > I built a CRUSH map, with the goal to distinguish between SSD and HDD 
> > storage machines using only 1 root. The map can be found here: 
> > http://pastebin.com/VQdB0CE9
> >
> >
> > The issue I am having is this:
> >
> >
> > root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization 
> > --rule 0 --num-rep 3
> > rule 0 (replicated_ruleset_ssd), x = 0..1023, numrep = 3..3
> > rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 0:84/1024
> > rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 1:437/1024
> > rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 2:438/1024
> > rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 3:65/1024
> >
> >
> > And then the same test using num-rep 46 (the lowest possible number that 
> > shows full utilization):
> >
> >
> > root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization 
> > --rule 0 --num-rep 46
> > rule 0 (replicated_ruleset_ssd), x = 0..1023, numrep = 46..46
> > rule 0 (replicated_ruleset_ssd) num_rep 46 result size == 3:1024/1024
> >
> >
> > Full output of above commands can be found here 
> > http://pastebin.com/2mbBnmSM and here http://pastebin.com/ar6SAFnX
> >
> >
> > The fact that amount of num-rep seems to scale with how many OSDs I am 
> > using, leads me to believe I am doing something wrong.
> >
> >
> > When using 2 roots (1 dedicated to SSD and 1 to HDD), everything works 
> > perfectly (example: http://pastebin.com/Uthxesut).
> >
> >
> > Would love to know what I am missing.
> >
> 
> Can you tell me the reasoning behind adding the rack in between? Since there 
> is only one rack, why add the machines there?
> 
> In your case I wouldn't add a new type either, but I would do this:
> 
> host machineA-ssd {
> 
> }
> 
> host machineB-ssd {
> 
> }
> 
> host machineA-hdd {
> 
> }
> 
> host machineB-hdd {
> 
> }
> 
> root ssd {
> item machineA-sdd
> item machineB-ssd
> }
> 
> root hdd {
> item machineA-hdd
> item machineB-hdd
> }
> 
> rule replicated_ruleset_ssd {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take ssd
> step chooseleaf firstn 0 type host
> step emit
> }
> 
> rule replicated_ruleset_hdd {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take hdd
> step chooseleaf firstn 0 type host
> step emit
> }
> 
> And try again :)
> 
> Wido
> 
> >
> > Thanks!
> >
> >
> > - Rob
> >
> > [http://pastebin.com/i/facebook.png]
> >
> > # begin crush map tunable choose_local_tries 0 tunable 
> > choose_local_fallback_t - Pastebin.com
> > pastebin.com
> >
> >
> > [http://pastebin.com/i/facebook.png]
> >
> > root@ceph2:~/crush_files# crushtool -i crushmap --test 
> > --show-utilization-all -- - Pastebin.com
> > pastebin.com
> >
> > [http://pastebin.com/i/facebook.png]
> >
> > root@ceph2:~/crush_files# crushtool -i crushmap --test 
> > --show-utilization-all -- - Pastebin.com
> > pastebin.com
> >
> >
> > [http://pastebin.com/i/facebook.png]
> >
> > # begin crush map tunable choose_local_tries 0 tunable 
> > choose_local_fallback_t - Pastebin.com
> > pastebin.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH map utilization issue

2016-08-03 Thread Rob Reus
Hi Wido,


This is indeed something I have tried, and confirmed to work, see the other 
CRUSH map link I have provided in my original email.


However, I was wondering if achieving that same goal, but with only 1 root, is 
possible/feasible.


Thanks!



Van: Wido den Hollander 
Verzonden: woensdag 3 augustus 2016 10:30
Aan: Rob Reus; ceph-users@lists.ceph.com
Onderwerp: Re: [ceph-users] CRUSH map utilization issue


> Op 3 augustus 2016 om 10:08 schreef Rob Reus :
>
>
> Hi all,
>
>
> I built a CRUSH map, with the goal to distinguish between SSD and HDD storage 
> machines using only 1 root. The map can be found here: 
> http://pastebin.com/VQdB0CE9
>
>
> The issue I am having is this:
>
>
> root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization 
> --rule 0 --num-rep 3
> rule 0 (replicated_ruleset_ssd), x = 0..1023, numrep = 3..3
> rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 0:84/1024
> rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 1:437/1024
> rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 2:438/1024
> rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 3:65/1024
>
>
> And then the same test using num-rep 46 (the lowest possible number that 
> shows full utilization):
>
>
> root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization 
> --rule 0 --num-rep 46
> rule 0 (replicated_ruleset_ssd), x = 0..1023, numrep = 46..46
> rule 0 (replicated_ruleset_ssd) num_rep 46 result size == 3:1024/1024
>
>
> Full output of above commands can be found here http://pastebin.com/2mbBnmSM 
> and here http://pastebin.com/ar6SAFnX
>
>
> The fact that amount of num-rep seems to scale with how many OSDs I am using, 
> leads me to believe I am doing something wrong.
>
>
> When using 2 roots (1 dedicated to SSD and 1 to HDD), everything works 
> perfectly (example: http://pastebin.com/Uthxesut).
>
>
> Would love to know what I am missing.
>

Can you tell me the reasoning behind adding the rack in between? Since there is 
only one rack, why add the machines there?

In your case I wouldn't add a new type either, but I would do this:

host machineA-ssd {

}

host machineB-ssd {

}

host machineA-hdd {

}

host machineB-hdd {

}

root ssd {
item machineA-sdd
item machineB-ssd
}

root hdd {
item machineA-hdd
item machineB-hdd
}

rule replicated_ruleset_ssd {
ruleset 0
type replicated
min_size 1
max_size 10
step take ssd
step chooseleaf firstn 0 type host
step emit
}

rule replicated_ruleset_hdd {
ruleset 0
type replicated
min_size 1
max_size 10
step take hdd
step chooseleaf firstn 0 type host
step emit
}

And try again :)

Wido

>
> Thanks!
>
>
> - Rob
>
> [http://pastebin.com/i/facebook.png]
>
> # begin crush map tunable choose_local_tries 0 tunable 
> choose_local_fallback_t - Pastebin.com
> pastebin.com
>
>
> [http://pastebin.com/i/facebook.png]
>
> root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization-all 
> -- - Pastebin.com
> pastebin.com
>
> [http://pastebin.com/i/facebook.png]
>
> root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization-all 
> -- - Pastebin.com
> pastebin.com
>
>
> [http://pastebin.com/i/facebook.png]
>
> # begin crush map tunable choose_local_tries 0 tunable 
> choose_local_fallback_t - Pastebin.com
> pastebin.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH map utilization issue

2016-08-03 Thread Wido den Hollander

> Op 3 augustus 2016 om 10:08 schreef Rob Reus :
> 
> 
> Hi all,
> 
> 
> I built a CRUSH map, with the goal to distinguish between SSD and HDD storage 
> machines using only 1 root. The map can be found here: 
> http://pastebin.com/VQdB0CE9
> 
> 
> The issue I am having is this:
> 
> 
> root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization 
> --rule 0 --num-rep 3
> rule 0 (replicated_ruleset_ssd), x = 0..1023, numrep = 3..3
> rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 0:84/1024
> rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 1:437/1024
> rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 2:438/1024
> rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 3:65/1024
> 
> 
> And then the same test using num-rep 46 (the lowest possible number that 
> shows full utilization):
> 
> 
> root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization 
> --rule 0 --num-rep 46
> rule 0 (replicated_ruleset_ssd), x = 0..1023, numrep = 46..46
> rule 0 (replicated_ruleset_ssd) num_rep 46 result size == 3:1024/1024
> 
> 
> Full output of above commands can be found here http://pastebin.com/2mbBnmSM 
> and here http://pastebin.com/ar6SAFnX
> 
> 
> The fact that amount of num-rep seems to scale with how many OSDs I am using, 
> leads me to believe I am doing something wrong.
> 
> 
> When using 2 roots (1 dedicated to SSD and 1 to HDD), everything works 
> perfectly (example: http://pastebin.com/Uthxesut).
> 
> 
> Would love to know what I am missing.
> 

Can you tell me the reasoning behind adding the rack in between? Since there is 
only one rack, why add the machines there?

In your case I wouldn't add a new type either, but I would do this:

host machineA-ssd {

}

host machineB-ssd {

}

host machineA-hdd {

}

host machineB-hdd {

}

root ssd {
item machineA-sdd
item machineB-ssd
}

root hdd {
item machineA-hdd
item machineB-hdd
}

rule replicated_ruleset_ssd {
ruleset 0
type replicated
min_size 1
max_size 10
step take ssd
step chooseleaf firstn 0 type host
step emit
}

rule replicated_ruleset_hdd {
ruleset 0
type replicated
min_size 1
max_size 10
step take hdd
step chooseleaf firstn 0 type host
step emit
}

And try again :)

Wido

> 
> Thanks!
> 
> 
> - Rob
> 
> [http://pastebin.com/i/facebook.png]
> 
> # begin crush map tunable choose_local_tries 0 tunable 
> choose_local_fallback_t - Pastebin.com
> pastebin.com
> 
> 
> [http://pastebin.com/i/facebook.png]
> 
> root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization-all 
> -- - Pastebin.com
> pastebin.com
> 
> [http://pastebin.com/i/facebook.png]
> 
> root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization-all 
> -- - Pastebin.com
> pastebin.com
> 
> 
> [http://pastebin.com/i/facebook.png]
> 
> # begin crush map tunable choose_local_tries 0 tunable 
> choose_local_fallback_t - Pastebin.com
> pastebin.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CRUSH map utilization issue

2016-08-03 Thread Rob Reus
Hi all,


I built a CRUSH map, with the goal to distinguish between SSD and HDD storage 
machines using only 1 root. The map can be found here: 
http://pastebin.com/VQdB0CE9


The issue I am having is this:


root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization 
--rule 0 --num-rep 3
rule 0 (replicated_ruleset_ssd), x = 0..1023, numrep = 3..3
rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 0:84/1024
rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 1:437/1024
rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 2:438/1024
rule 0 (replicated_ruleset_ssd) num_rep 3 result size == 3:65/1024


And then the same test using num-rep 46 (the lowest possible number that shows 
full utilization):


root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization 
--rule 0 --num-rep 46
rule 0 (replicated_ruleset_ssd), x = 0..1023, numrep = 46..46
rule 0 (replicated_ruleset_ssd) num_rep 46 result size == 3:1024/1024


Full output of above commands can be found here http://pastebin.com/2mbBnmSM 
and here http://pastebin.com/ar6SAFnX


The fact that amount of num-rep seems to scale with how many OSDs I am using, 
leads me to believe I am doing something wrong.


When using 2 roots (1 dedicated to SSD and 1 to HDD), everything works 
perfectly (example: http://pastebin.com/Uthxesut).


Would love to know what I am missing.


Thanks!


- Rob

[http://pastebin.com/i/facebook.png]

# begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_t 
- Pastebin.com
pastebin.com


[http://pastebin.com/i/facebook.png]

root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization-all 
-- - Pastebin.com
pastebin.com

[http://pastebin.com/i/facebook.png]

root@ceph2:~/crush_files# crushtool -i crushmap --test --show-utilization-all 
-- - Pastebin.com
pastebin.com


[http://pastebin.com/i/facebook.png]

# begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_t 
- Pastebin.com
pastebin.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CRUSH map utilization issue

2016-08-03 Thread Rob Reus
Hi all,


I have built a CRUSH map myself, with the goal to distinguish between SSD 
storage machines, and HDD storage machines, using a custom type. This with in 
mind to have only 1 root (default). The map can be found here: 
http://pastebin.com/VQdB0CE9


Now the issue I am seeing is that when I run `crushtool --test` on the map, it 
does not utilize well.


[http://pastebin.com/i/facebook.png]

# begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_t 
- Pastebin.com
pastebin.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com