Re: [ceph-users] Disk failures

2016-06-08 Thread Gandalf Corvotempesta
Il 09 giu 2016 02:09, "Christian Balzer"  ha scritto:
> Ceph currently doesn't do any (relevant) checksumming at all, so if a
> PRIMARY PG suffers from bit-rot this will be undetected until the next
> deep-scrub.
>
> This is one of the longest and gravest outstanding issues with Ceph and
> supposed to be addressed with bluestore (which currently doesn't have
> checksum verified reads either).

So if bit rot happens on primary PG, ceph is spreading the currupted data
across the cluster?
What would be sent to the replica,  the original data or the saved one?

When bit rot happens I'll have 1 corrupted object and 2 good.
how do you manage this between deep scrubs?  Which data would be used by
ceph? I think that a bitrot on a huge VM block device could lead to a mess
like the whole device corrupted
VM affected by bitrot would be able to stay up and running?
And bitrot on a qcow2 file?

Let me try to explain: when writing to primary PG i have to write bit "1"
Due to a bit rot, I'm saving "0".
Would ceph read the wrote bit and spread that across the cluster (so it
will spread "0") or spread the in memory value "1" ?

What if the journal fails during a read or a write? Ceph is able to recover
by removing that journal from the affected osd (and still running at lower
speed) or should i use a raid1 on ssds used by journal ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Jewel 10.2.1 compilation in SL6/Centos6

2016-06-08 Thread Goncalo Borges

Hi All...

For reasons which are not important here, I have to compile ceph clients 
in SL6/Centos6. In a previous thread, I have posted instructions on how 
to do that for an Infernalis release. The instructions for Jewel 10.2.1 
follow. Maybe someone else may profit from those since Centos6 releases 
are no longer provided.


Please note that, as requirements, gcc > 4.7 and python 2.7 have to be 
available in the machine (you can compile and install those in 
alternative locations, or simply use a software collection).



1./ The tarball available at ceph download site and the source tarball 
distributed with the srpm are different in size. I have used the source 
tarball distributed with the srpm


2./ Set the environment:

   $ export
   
LD_LIBRARY_PATH=/usr/local/sw/sl6/x86_64/gcc/4.8.4/lib:/usr/local/sw/sl6/x86_64/gcc/4.8.4/lib64:$LD_LIBRARY_PATH
   $ export CC=/usr/local/sw/sl6/x86_64/gcc/4.8.4/bin/gcc
   $ export PATH=/usr/local/sw/sl6/x86_64/gcc/4.8.4/bin:$PATH
   $ export CPATH=/usr/local/sw/sl6/x86_64/gcc/4.8.4/include:$CPATH
   $ cat /etc/ld.so.conf.d/gcc-4.8.4.el6.x86_64.conf
   /usr/local/sw/sl6/x86_64/gcc/4.8.4/lib64
   /usr/local/sw/sl6/x86_64/gcc/4.8.4/lib

   $ export
   PYTHONPATH=/usr/local/sw/sl6/gcc484/x86_64/python/2.7.6:$PYTHONPATH
   $ export
   
LD_LIBRARY_PATH=/usr/local/sw/sl6/gcc484/x86_64/python/2.7.6/lib:$LD_LIBRARY_PATH
   $ export PATH=/usr/local/sw/sl6/gcc484/x86_64/python/2.7.6/bin:$PATH
   $ export
   CPATH=/usr/local/sw/sl6/gcc484/x86_64/python/2.7.6/include/:$CPATH
   $ cat /etc/ld.so.conf.d/python-2.7.6.el6.x86_64.conf
   /usr/local/sw/sl6/gcc484/x86_64/python/2.7.6/lib/

   Note that we will have to make the following symbolic link since
   most configuration tools rely on commands such as'python-config
   –ldflags|–cflags|…' to set the proper configurations paths for
   compiling python bindings.

   $ ln -s
   /usr/local/sw/sl6/gcc484/x86_64/python/2.7.6/bin/python2.7-config
   /usr/local/sw/sl6/gcc484/x86_64/python/2.7.6/bin/python-config


3./ Download and install a more recent kernel, kernel-headers and 
kernel-devel and use it in the compilation node (some tools use 
definitions only provided by a most recent kernel-header). In our case, 
we have used the kernels available in 
http://elrepo.org/linux/kernel/el6/x86_64/RPMS/ 
(3.10.101-1.el6.elrepo.x86_64)


   $ rpm -qa | grep kernel
   kernel-lt-headers-3.10.101-1.el6.elrepo.x86_64
   kernel-lt-devel-3.10.101-1.el6.elrepo.x86_64
   dracut-kernel-004-388.el6.noarch
   kernel-lt-3.10.101-1.el6.elrepo.x86_64


4./ Install and start a virtualenv with python 2.7 (from your 
SL6/Centos6 repos)


   $ yum install virtual-env

   $ virtualenv -p
   /usr/local/sw/sl6/gcc484/x86_64/python/2.7.6/bin/python2.7
   --system-site-packages python27
   Running virtualenv with interpreter
   /usr/local/sw/sl6/gcc484/x86_64/python/2.7.6/bin/python2.7
   New python executable in python27/bin/python2.7
   Also creating executable in python27/bin/python
   Installing
   
Setuptoolsdone.
   Installing
   
Pip...done.

   $ source python27/bin/activate
   (python27)#
   (python27)# which python-config
   /usr/local/sw/sl6/gcc484/x86_64/python/2.7.6/bin/python-config

5./ Install some python packages needed during ceph 10.2.1 compilation

   (python27)# pip install Cython
   (python27)# pip install nose
   (python27)# pip install requests
   (python27)# pip install sphinx


6./ Install the following software (from your SL6/Centos6 repos)

   (python27)$ wget
   
http://mirror.centos.org/centos/6/os/x86_64/Packages/snappy-devel-1.1.0-1.el6.x86_64.rpm
   (SL6 does not provide this package)
   (python27)$ yum localinstall snappy-devel-1.1.0-1.el6.x86_64.rpm
   (python27)$ yum install junit4 fcgi-devel expat-devel
   libbabeltrace-devel lttng-ust-devel gperftools-devel
   libatomic_ops-devel keyutils-libs-devel nss-devel bzip2-devel yasm
   xmlstarlet xfsprogs-devel snappy-devel libudev-devel libblkid-devel
   libxml2-devel libedit-devel libcurl-devel libaio-devel leveldb-devel
   fuse-devel cmake sharutils java-devel libselinux-devel
   selinux-policy-doc openldap-devel openssl-devel


7./ Ceph requires boost-random which is impossible to satisfy since it 
does not exist in Centos6/SL6. I've downloaded the SRPM from Centos 7 
and rebuild it. Please note that my first rebuild fails with problems 
producing the debug repo. To solve it, I modified the rpm spec to 
include the following line at the beginning: '%define debug_package 
%{nil}'. This instruction will bypass the debuginfo rpm creation.


   (python27)$ wget
   
http://vault.centos.org/7.1.1503/os/Source/SPackages/boost-1.53.0-23.el7.src.rpm
   (python27)$ yum install mpich-devel openmpi-devel chrpath libicu-devel
   (python27)$ rpmbuild --rebuild boost-1.53.0-23.el7.src.rpm
   (python27)$ yum remove 'boost*'
   (python27)$ cd /root/

Re: [ceph-users] Disk failures

2016-06-08 Thread Christian Balzer

Hello,

On Wed, 08 Jun 2016 20:26:56 + Krzysztof Nowicki wrote:

> Hi,
> 
> śr., 8.06.2016 o 21:35 użytkownik Gandalf Corvotempesta <
> gandalf.corvotempe...@gmail.com> napisał:
> 
> > 2016-06-08 20:49 GMT+02:00 Krzysztof Nowicki <
> > krzysztof.a.nowi...@gmail.com>:
> > > From my own experience with failing HDDs I've seen cases where the
> > > drive
> > was
> > > failing silently initially. This manifested itself in repeated deep
> > > scrub failures. Correct me if I'm wrong here, but Ceph keeps
> > > checksums of data being written and in case that data is read back
> > > corrupted on one of the OSDs this will be detected by scrub and
> > > reported as inconsistency. In
> > such
> > > cases automatic repair should be sufficient as having the checksums
> > > it is possible to tell which copy is correct. In such case the OSD
> > > will not be removed automatically and it's for the cluster
> > > administrator to get suspicious in case such an inconsistency occurs
> > > repeatedly and remove the OSD in question.
> >
> > Ok but could this lead to data corruption? What would happens to the
> > client if a write fails?
> >
> If a write fails due to an IO error on the underlying HDD the OSD daemon
> will most likely abort.
Indeed it will.

> In case a write succeeds but gets corrupted by a silent HDD failure you
> will have corrupted data on this OSD. I'm not sure if Ceph verifies the
> checksums upon read, but if it doesn't then the data read back by the
> client could be corrupted in case the corruption happened on the primary
> OSD for that PG.
That.

Ceph currently doesn't do any (relevant) checksumming at all, so if a
PRIMARY PG suffers from bit-rot this will be undetected until the next
deep-scrub.

This is one of the longest and gravest outstanding issues with Ceph and
supposed to be addressed with bluestore (which currently doesn't have
checksum verified reads either).


> The behaviour could also be affected by the filesystem the OSD is
> running. For example BTRFS is known for keeping data checksums and in
> such case reading corrupted data will fail at filesystem level and the
> OSD will just see an IO error.
> 
Correct.

However BTRFS (and ZFS) as filestore for Ceph do open other cans of worms.

Regards,

Christian
> >
> > > When the drive fails more severely and causes IO failures then the
> > > effect will most likely be an abort of the OSD daemon which causes
> > > the relevant
> > OSD
> > > to go down. The cause of the abort can be determined by examining the
> > logs.
> >
> > In this case, healing and rebalancing is done automatically, right?
> > If I want a replica 3 and one OSD fails, the objects stored on that OSD
> > would
> > be automatically moved and replicated across the cluster to keep my
> > replica requirement?
> >
> Yes, this is correct.
> 
> >
> > > In any case SMART is your best friend and it is strongly advised to
> > > run smartd in order to get early warnings.
> >
> > Yes, but SMART is not always reliable.
> >
> True, but it won't harm to have it running anyway.
> 
> >
> > All modern RAID controllers are able to read the whole disk (or disks)
> > looking for bad sectors or inconsistency,
> > the smart extended test doesn't do this
> >
> Strange. From what I understood the extended SMART test actually goes
> over each sector and tests it for readability.
> 
> Regards
> Chris


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disk failures

2016-06-08 Thread list

As long as there hasn't been a change recently Ceph does not store checksums.

Deep scrub compares checksums across replicas.

See 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/034646.html



Am 8. Juni 2016 22:27:46 schrieb Krzysztof Nowicki 
:



Hi,

śr., 8.06.2016 o 21:35 użytkownik Gandalf Corvotempesta <
gandalf.corvotempe...@gmail.com> napisał:


2016-06-08 20:49 GMT+02:00 Krzysztof Nowicki <
krzysztof.a.nowi...@gmail.com>:
> From my own experience with failing HDDs I've seen cases where the drive
was
> failing silently initially. This manifested itself in repeated deep scrub
> failures. Correct me if I'm wrong here, but Ceph keeps checksums of data
> being written and in case that data is read back corrupted on one of the
> OSDs this will be detected by scrub and reported as inconsistency. In
such
> cases automatic repair should be sufficient as having the checksums it is
> possible to tell which copy is correct. In such case the OSD will not be
> removed automatically and it's for the cluster administrator to get
> suspicious in case such an inconsistency occurs repeatedly and remove the
> OSD in question.

Ok but could this lead to data corruption? What would happens to the client
if a write fails?


If a write fails due to an IO error on the underlying HDD the OSD daemon
will most likely abort.
In case a write succeeds but gets corrupted by a silent HDD failure you
will have corrupted data on this OSD. I'm not sure if Ceph verifies the
checksums upon read, but if it doesn't then the data read back by the
client could be corrupted in case the corruption happened on the primary
OSD for that PG.
The behaviour could also be affected by the filesystem the OSD is running.
For example BTRFS is known for keeping data checksums and in such case
reading corrupted data will fail at filesystem level and the OSD will just
see an IO error.



> When the drive fails more severely and causes IO failures then the effect
> will most likely be an abort of the OSD daemon which causes the relevant
OSD
> to go down. The cause of the abort can be determined by examining the
logs.

In this case, healing and rebalancing is done automatically, right?
If I want a replica 3 and one OSD fails, the objects stored on that OSD
would
be automatically moved and replicated across the cluster to keep my
replica requirement?


Yes, this is correct.



> In any case SMART is your best friend and it is strongly advised to run
> smartd in order to get early warnings.

Yes, but SMART is not always reliable.


True, but it won't harm to have it running anyway.



All modern RAID controllers are able to read the whole disk (or disks)
looking for bad sectors or inconsistency,
the smart extended test doesn't do this


Strange. From what I understood the extended SMART test actually goes over
each sector and tests it for readability.

Regards
Chris



--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disk failures

2016-06-08 Thread list




Am 8. Juni 2016 22:27:46 schrieb Krzysztof Nowicki 
:



Hi,

śr., 8.06.2016 o 21:35 użytkownik Gandalf Corvotempesta <
gandalf.corvotempe...@gmail.com> napisał:


2016-06-08 20:49 GMT+02:00 Krzysztof Nowicki <
krzysztof.a.nowi...@gmail.com>:
> From my own experience with failing HDDs I've seen cases where the drive
was
> failing silently initially. This manifested itself in repeated deep scrub
> failures. Correct me if I'm wrong here, but Ceph keeps checksums of data
> being written and in case that data is read back corrupted on one of the
> OSDs this will be detected by scrub and reported as inconsistency. In
such
> cases automatic repair should be sufficient as having the checksums it is
> possible to tell which copy is correct. In such case the OSD will not be
> removed automatically and it's for the cluster administrator to get
> suspicious in case such an inconsistency occurs repeatedly and remove the
> OSD in question.

Ok but could this lead to data corruption? What would happens to the client
if a write fails?


If a write fails due to an IO error on the underlying HDD the OSD daemon
will most likely abort.
In case a write succeeds but gets corrupted by a silent HDD failure you
will have corrupted data on this OSD. I'm not sure if Ceph verifies the
checksums upon read, but if it doesn't then the data read back by the
client could be corrupted in case the corruption happened on the primary
OSD for that PG.
The behaviour could also be affected by the filesystem the OSD is running.
For example BTRFS is known for keeping data checksums and in such case
reading corrupted data will fail at filesystem level and the OSD will just
see an IO error.



> When the drive fails more severely and causes IO failures then the effect
> will most likely be an abort of the OSD daemon which causes the relevant
OSD
> to go down. The cause of the abort can be determined by examining the
logs.

In this case, healing and rebalancing is done automatically, right?
If I want a replica 3 and one OSD fails, the objects stored on that OSD
would
be automatically moved and replicated across the cluster to keep my
replica requirement?


Yes, this is correct.



> In any case SMART is your best friend and it is strongly advised to run
> smartd in order to get early warnings.

Yes, but SMART is not always reliable.


True, but it won't harm to have it running anyway.



All modern RAID controllers are able to read the whole disk (or disks)
looking for bad sectors or inconsistency,
the smart extended test doesn't do this


Strange. From what I understood the extended SMART test actually goes over
each sector and tests it for readability.

Regards
Chris



--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Want a free ticket to Red Hat Summit?

2016-06-08 Thread Patrick McGarry
Hey cephers,

This year Red Hat Summit is 27-30 June in San Francisco at the Moscone
and I have one extra (exhibit hall and keynotes only) pass to the
event. If you’d like to attend to meet with vendors, chat with other
attendees, and hang with an irreverent community manager, let me know.

The only catch is you need to hang around the Ceph Booth for an hour
or two each day to let me grab some food and run a session or two. Hit
me up if you’re interested though. Thanks.


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Migrating from one Ceph cluster to another

2016-06-08 Thread Marek Dohojda
I have a ceph cluster (Hammer) and I just built a new cluster
(Infernalis).  This cluster contains VM boxes based on KVM.

What I would like to do is move all the data from one ceph cluster to
another.  However the only way I could find from my google searches would
be to move each image to local disk, copy this image across to new cluster,
and import it.

I am hoping that there is a way to just synch the data (and I do realize
that KVMs will have to be down for the full migration) from one cluster to
another.

Thank you
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disk failures

2016-06-08 Thread Krzysztof Nowicki
Hi,

śr., 8.06.2016 o 21:35 użytkownik Gandalf Corvotempesta <
gandalf.corvotempe...@gmail.com> napisał:

> 2016-06-08 20:49 GMT+02:00 Krzysztof Nowicki <
> krzysztof.a.nowi...@gmail.com>:
> > From my own experience with failing HDDs I've seen cases where the drive
> was
> > failing silently initially. This manifested itself in repeated deep scrub
> > failures. Correct me if I'm wrong here, but Ceph keeps checksums of data
> > being written and in case that data is read back corrupted on one of the
> > OSDs this will be detected by scrub and reported as inconsistency. In
> such
> > cases automatic repair should be sufficient as having the checksums it is
> > possible to tell which copy is correct. In such case the OSD will not be
> > removed automatically and it's for the cluster administrator to get
> > suspicious in case such an inconsistency occurs repeatedly and remove the
> > OSD in question.
>
> Ok but could this lead to data corruption? What would happens to the client
> if a write fails?
>
If a write fails due to an IO error on the underlying HDD the OSD daemon
will most likely abort.
In case a write succeeds but gets corrupted by a silent HDD failure you
will have corrupted data on this OSD. I'm not sure if Ceph verifies the
checksums upon read, but if it doesn't then the data read back by the
client could be corrupted in case the corruption happened on the primary
OSD for that PG.
The behaviour could also be affected by the filesystem the OSD is running.
For example BTRFS is known for keeping data checksums and in such case
reading corrupted data will fail at filesystem level and the OSD will just
see an IO error.

>
> > When the drive fails more severely and causes IO failures then the effect
> > will most likely be an abort of the OSD daemon which causes the relevant
> OSD
> > to go down. The cause of the abort can be determined by examining the
> logs.
>
> In this case, healing and rebalancing is done automatically, right?
> If I want a replica 3 and one OSD fails, the objects stored on that OSD
> would
> be automatically moved and replicated across the cluster to keep my
> replica requirement?
>
Yes, this is correct.

>
> > In any case SMART is your best friend and it is strongly advised to run
> > smartd in order to get early warnings.
>
> Yes, but SMART is not always reliable.
>
True, but it won't harm to have it running anyway.

>
> All modern RAID controllers are able to read the whole disk (or disks)
> looking for bad sectors or inconsistency,
> the smart extended test doesn't do this
>
Strange. From what I understood the extended SMART test actually goes over
each sector and tests it for readability.

Regards
Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Filestore update script?

2016-06-08 Thread Wido den Hollander

> Op 8 juni 2016 om 21:48 schreef "WRIGHT, JON R (JON R)" 
> :
> 
> 
> Wido,
> 
> Thanks for that advice, and I'll follow it.  To your knowledge, is there 
> a FileStore Update script around somewhere?
> 

Not that I'm aware of. Just don't try to manually do things to OSDs. If they 
fail, they fail. Let them. Recovery will kick in and protect your data.

Wido

> Jon
> 
> On 6/8/2016 3:11 AM, Wido den Hollander wrote:
> >> Op 7 juni 2016 om 23:08 schreef "WRIGHT, JON R (JON R)" 
> >> :
> >>
> >>
> >> I'm trying to recover an OSD after running xfs_repair on the disk. It
> >> seems to be ok now.  There is a log message that includes the following:
> >> "Please run the FileStore update script before starting the OSD, or set
> >> filestore_update_to to 4"
> >>
> > why did you run the xfs_repair? My recommendation is always to wipe a OSD 
> > which has XFS errors. They don't show up by accident. Something has 
> > happened. Bit-rot on the disk? Controller failure?
> >
> > I'd say, wipe the OSD and let the Ceph recovery take care of it. Re-format 
> > it with XFS and rebuild the OSD.
> >
> > Wido
> >
> >> What is the FileStore update script?  Google search doesn't produce
> >> useful information on what or where it is.   Also the
> >> filestore_update_to option is set in what config file?
> >>
> >> Thanks
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Filestore update script?

2016-06-08 Thread WRIGHT, JON R (JON R)

Wido,

Thanks for that advice, and I'll follow it.  To your knowledge, is there 
a FileStore Update script around somewhere?


Jon

On 6/8/2016 3:11 AM, Wido den Hollander wrote:

Op 7 juni 2016 om 23:08 schreef "WRIGHT, JON R (JON R)" 
:


I'm trying to recover an OSD after running xfs_repair on the disk. It
seems to be ok now.  There is a log message that includes the following:
"Please run the FileStore update script before starting the OSD, or set
filestore_update_to to 4"


why did you run the xfs_repair? My recommendation is always to wipe a OSD which 
has XFS errors. They don't show up by accident. Something has happened. Bit-rot 
on the disk? Controller failure?

I'd say, wipe the OSD and let the Ceph recovery take care of it. Re-format it 
with XFS and rebuild the OSD.

Wido


What is the FileStore update script?  Google search doesn't produce
useful information on what or where it is.   Also the
filestore_update_to option is set in what config file?

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disk failures

2016-06-08 Thread Gandalf Corvotempesta
2016-06-08 20:49 GMT+02:00 Krzysztof Nowicki :
> From my own experience with failing HDDs I've seen cases where the drive was
> failing silently initially. This manifested itself in repeated deep scrub
> failures. Correct me if I'm wrong here, but Ceph keeps checksums of data
> being written and in case that data is read back corrupted on one of the
> OSDs this will be detected by scrub and reported as inconsistency. In such
> cases automatic repair should be sufficient as having the checksums it is
> possible to tell which copy is correct. In such case the OSD will not be
> removed automatically and it's for the cluster administrator to get
> suspicious in case such an inconsistency occurs repeatedly and remove the
> OSD in question.

Ok but could this lead to data corruption? What would happens to the client
if a write fails?

> When the drive fails more severely and causes IO failures then the effect
> will most likely be an abort of the OSD daemon which causes the relevant OSD
> to go down. The cause of the abort can be determined by examining the logs.

In this case, healing and rebalancing is done automatically, right?
If I want a replica 3 and one OSD fails, the objects stored on that OSD would
be automatically moved and replicated across the cluster to keep my
replica requirement?

> In any case SMART is your best friend and it is strongly advised to run
> smartd in order to get early warnings.

Yes, but SMART is not always reliable.

All modern RAID controllers are able to read the whole disk (or disks)
looking for bad sectors or inconsistency,
the smart extended test doesn't do this
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disk failures

2016-06-08 Thread Krzysztof Nowicki
Hi,

>From my own experience with failing HDDs I've seen cases where the drive
was failing silently initially. This manifested itself in repeated deep
scrub failures. Correct me if I'm wrong here, but Ceph keeps checksums of
data being written and in case that data is read back corrupted on one of
the OSDs this will be detected by scrub and reported as inconsistency. In
such cases automatic repair should be sufficient as having the checksums it
is possible to tell which copy is correct. In such case the OSD will not be
removed automatically and it's for the cluster administrator to get
suspicious in case such an inconsistency occurs repeatedly and remove the
OSD in question.

When the drive fails more severely and causes IO failures then the effect
will most likely be an abort of the OSD daemon which causes the relevant
OSD to go down. The cause of the abort can be determined by examining the
logs.

In any case SMART is your best friend and it is strongly advised to run
smartd in order to get early warnings.

Regards
Chris

wt., 7.06.2016 o 22:06 użytkownik Gandalf Corvotempesta <
gandalf.corvotempe...@gmail.com> napisał:

> Hi,
> How ceph detect and manage disk failures?  What happens if some data are
> wrote on a bad sector?
>
> Are there any change to get the bad sector "distributed" across the
> cluster due to the replication?
>
> Is ceph able to remove the OSD bound to the failed disk automatically?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Error in OSD

2016-06-08 Thread Tu Holmes
Hey Cephers.

Is there a way to force a fix on this error?

/var/log/ceph/ceph-osd.46.log.2.gz:4845:2016-06-06 22:26:57.322073
7f3569b2a700 -1 log_channel(cluster) log [ERR] : 24.325 shard 20: soid
325/hit_set_24.325_archive_2016-05-17 06:35:28.136171_2016-06-01
14:55:35.910702/head/.ceph-internal/24 missing attr _, missing attr snapset

/var/log/ceph/ceph-osd.46.log.2.gz:4846:2016-06-06 22:28:45.469568
7f356b7a0700 -1 log_channel(cluster) log [ERR] : 24.325 scrub 0 missing, 1
inconsistent objects

/var/log/ceph/ceph-osd.46.log.2.gz:4847:2016-06-06 22:28:45.469571
7f356b7a0700 -1 log_channel(cluster) log [ERR] : 24.325 scrub 1 errors


To give some context:

I offline each node during the upgrade to jewel so I can modify permissions
and then complete the ceph-deploy portions.

When everything comes back on line. I typically have been getting "one" of
these types of errors.

Yesterday, I had 2 of these and a ceph pg repair fixed one, and the other
just magically cleared itself up.


I tried a ceph pg repair on this PG, but it hasn't cleared up.

Is this one of the situations where I should let it simmer and see if it
just fixes itself?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy prepare journal on software raid ( md device )

2016-06-08 Thread Oliver Dzombic
Hi,

i red, that ceph-deploy does not support software raid devices

http://tracker.ceph.com/issues/13084

But thats already nearly 1 year ago, and the problem is different.

As it seems to me, the "only" major problem is, that the newly created
journal partition remains in the "Device or ressource busy" state. So
that ceph-deploy gives up after some time.

Does anyone knows a workaround ?


[root@cephmon1 ceph-cluster-gen2]# ceph-deploy osd prepare
cephosd1:/dev/sdf:/dev/md128
[ceph_deploy.conf][DEBUG ] found configuration file at:
/root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.33): /usr/bin/ceph-deploy osd
prepare cephosd1:/dev/sdf:/dev/md128
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  disk  : [('cephosd1',
'/dev/sdf', '/dev/md128')]
[ceph_deploy.cli][INFO  ]  dmcrypt   : False
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  bluestore : None
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: prepare
[ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   :
/etc/ceph/dmcrypt-keys
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   :

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  fs_type   : xfs
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.cli][INFO  ]  zap_disk  : False
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks
cephosd1:/dev/sdf:/dev/md128
[cephosd1][DEBUG ] connected to host: cephosd1
[cephosd1][DEBUG ] detect platform information from remote host
[cephosd1][DEBUG ] detect machine type
[cephosd1][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: CentOS Linux 7.2.1511 Core
[ceph_deploy.osd][DEBUG ] Deploying osd to cephosd1
[cephosd1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[cephosd1][WARNIN] osd keyring does not exist yet, creating one
[cephosd1][DEBUG ] create a keyring file
[ceph_deploy.osd][DEBUG ] Preparing host cephosd1 disk /dev/sdf journal
/dev/md128 activate False
[cephosd1][DEBUG ] find the location of an executable
[cephosd1][INFO  ] Running command: /usr/sbin/ceph-disk -v prepare
--cluster ceph --fs-type xfs -- /dev/sdf /dev/md128
[cephosd1][WARNIN] command: Running command: /usr/bin/ceph-osd
--cluster=ceph --show-config-value=fsid
[cephosd1][WARNIN] command: Running command: /usr/bin/ceph-osd
--check-allows-journal -i 0 --cluster ceph
[cephosd1][WARNIN] command: Running command: /usr/bin/ceph-osd
--check-wants-journal -i 0 --cluster ceph
[cephosd1][WARNIN] command: Running command: /usr/bin/ceph-osd
--check-needs-journal -i 0 --cluster ceph
[cephosd1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cephosd1][WARNIN] command: Running command: /usr/bin/ceph-osd
--cluster=ceph --show-config-value=osd_journal_size
[cephosd1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cephosd1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cephosd1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cephosd1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdf1 uuid path is
/sys/dev/block/8:81/dm/uuid
[cephosd1][WARNIN] command: Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
[cephosd1][WARNIN] command: Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_mount_options_xfs
[cephosd1][WARNIN] get_dm_uuid: get_dm_uuid /dev/md128 uuid path is
/sys/dev/block/9:128/dm/uuid
[cephosd1][WARNIN] prepare_device: OSD will not be hot-swappable if
journal is not the same device as the osd data
[cephosd1][WARNIN] get_dm_uuid: get_dm_uuid /dev/md128 uuid path is
/sys/dev/block/9:128/dm/uuid
[cephosd1][WARNIN] ptype_tobe_for_name: name = journal
[cephosd1][WARNIN] get_dm_uuid: get_dm_uuid /dev/md128 uuid path is
/sys/dev/block/9:128/dm/uuid
[cephosd1][WARNIN] command: Running command: /usr/sbin/parted --machine
-- /dev/md128 print
BYT;
 lyzing
[cephosd1][WARNIN] /dev/md128:240GB:md:512:512:unknown:Linux Software
RAID Array:;
[cephosd1][WARNIN]
[cephosd1][WARNIN] create_partition: Creating journal partition num 1
size 2 on /dev/md128
[cephosd1][WARNIN] command_check_call: Running command: /usr/sbin/sgdisk
--new=1:0:+2M --change-name=1:ceph journal
--partition-guid=1:449fc1e0-ae4b-40ea-b214-02659682d0bd
--typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/md128
[cephosd1][DEBUG ] Creating new GPT entries.
[cephosd1][DEBUG ] The operation has compl

[ceph-users] radosgw issue resolved, documentation suggestions

2016-06-08 Thread Sylvain, Eric
Gentlemen, I have resolved my issue, it was resolved using ,[client.rgw.gateway]

Towards helping others I have the following comments for the documentation 
people, unless somehow I am missing a nuance in using [client.rgw.gateway] and 
[client.rgw.] and [client.radosgw.gateway] and 
[client.radosgw.]

Starting at the top:
http://docs.ceph.com/docs/jewel/radosgw/

Then to…

MANUAL INSTALL  (http://docs.ceph.com/docs/jewel/install/install-ceph-gateway/)
===
Could be consistent with ceph.conf specifiers:
CHANGE THE DEFAULT PORT
   Mentions: [client.rgw.gateway-node1]
-and then in another section-
MIGRATING FROM APACHE TO CIVETWEB
   Mentions:[client.radosgw.gateway-node1]
   Mentions:[client.radosgw.gateway-node]
Note the difference in “rgw” and “radosgw”

SIMPLE CONFIGURATION (http://docs.ceph.com/docs/jewel/radosgw/config/)
===
Talk about using name for “gateway” instance in first paragraph, which is not 
consistent with “Manual Install”, which uses  (i.e., hostname -s).
ADD A GATEWAY CONFIGURATION TO CEPH
   Mentions: [client.radosgw.gateway]
Not consistent with “rgw” vs “radosgw”
Talks about Apache and civetweb, these could be more distinct and presented as 
either this civetweb or apache config, not intermingled…



-Original Message-
From: Karol Mroz [mailto:km...@suse.de] 
Sent: Tuesday, June 07, 2016 1:59 PM
To: Sylvain, Eric 
Cc: LOPEZ Jean-Charles ; ceph-us...@ceph.com
Subject: Re: [ceph-users] New user questions with radosgw with Jewel 10.2.1

Hi Eric,

Please see inline...

On Tue, Jun 07, 2016 at 05:14:25PM +, Sylvain, Eric wrote:
> 
> Yes, my system is set to run as “ceph”:
>
> /etc/systemd/system/ceph-radosgw.target.wants/ceph-radosgw@rgw.p6-os1-mon7.service
>ExecStart=/usr/bin/radosgw -f --cluster ${CLUSTER} --name 
> client.%i -conf  --setuser ceph --setgroup ceph

The first thing to check is the heading line of the RGW section in your 
ceph.conf file.

Systemd passes --name as "client.%i" where "%i" expands to (in your case): 
rgw.p6-os1-mon7.
Thus, your ceph.conf RGW heading should be: [client.rgw.p6-os1-mon7] If these 
entires do not match, RGW will not parse the necessary configuration section 
and will simply use defaults (ie. port 7480, etc).

> 
> Yet changing these to “root” has no effect.
>ExecStart=/usr/bin/radosgw -f --cluster ${CLUSTER} --name client.%i 
> -conf  --setuser root --setgroup root

Using user root is not needed for binding to privileged ports. When configured 
to use civetweb, RGW delegates the permission drop to civetweb. Civetweb does 
this _after_ binding to a port, reading certificates, etc. So, using the 
default "--user ceph" and "--group ceph" is best.

> 
> Also doing: chown root.root /usr/bin/radosgw; chmod 4755 
> /usr/bin/radosgw Had no effect.

This is not needed.

> 
> I feel the issue is in reading /etc/ceph/ceph.conf, because even if I change
>rgw frontends = “bogus bogus bogus”
> Expecting some failure, it still started up fine (on port 7480).
> The config still says:
> # ceph --admin-daemon /var/run/ceph/ceph-client.rgw.p6-os1-mon7.asok 
> config show
> …
> "rgw_frontends": "fastcgi, civetweb port=7480",
> …

Which points to RGW not parsing it's config section.

> 
> Again suspecting keyring I created “client.radosgw.gateway” and 
> changed ceph.conf for it, to see if that would help, no luck…
> 
> Could this be tied to having admin, mon and radosgw on same host?

RGW should reside happily alongside any of the ceph daemons.

> 
> Does keyring restrict what parts of ceph.conf are available?
> 
> Thanks in advance for anything you can provide
> 

Hope this helps.

--
Regards,
Karol
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Difference between step choose and step chooseleaf

2016-06-08 Thread Gregory Farnum
On Wed, Jun 8, 2016 at 8:22 AM, George Shuklin  wrote:
> Hello.
>
> Can someone help me to see difference between step choose and step
> chooseleaf in CRUSH map?

When you run "choose" on a CRUSH bucket type, it selects CRUSH bucket
nodes of that type. If you run chooseleaf, it selects leaf nodes
underneath that bucket type (ie, it goes all the way down the tree to
picking specifics OSDs). If you're doing a simple map where you want
to separate all copies across racks, or OSDs, or whatever, it's
generally best to just use chooseleaf on your failure domain. If
you're trying to do something more complicated like run 2x2 or
something, you'll want to choose your way down to the split points and
then run chooseleaf to go the rest of the way.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Difference between step choose and step chooseleaf

2016-06-08 Thread George Shuklin

Hello.

Can someone help me to see difference between step choose and step 
chooseleaf in CRUSH map?


Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how o understand pg full

2016-06-08 Thread lin zhou
Hi,cephers
I know osd full.AFAIK,pg is just a logical concept.so what does pg full mean?


Thanks.

--
hnuzhoul...@gmail.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SignatureDoesNotMatch when authorize v4 with HTTPS.

2016-06-08 Thread Khang Nguyễn Nhật
Hello all,
I'm having problems with authentication AWS4 when using HTTPS (my cluster
running on Ceph Jewel 10.2.1 and platform CentOS 7). I used boto3 create
presigned_url, here's my example:

s3 = boto3.client(service_name='s3', region_name='', use_ssl=False,
endpoint_url='https://rgw.x.x',
  aws_access_key_id= ,
  aws_secret_access_key= ,
  config=Config(signature_version='s3v4', region_name='')
 )
url = s3.generate_presigned_url(ClientMethod='list_buckets',
HttpMethod='GET', ExpiresIn=3600)
rsp = requests.get(url, proxies={'http': '', 'https': ''}, headers={'': ''})

Then I received error 403 SignatureDoesNotMatch. And this is my rgw.log:

SERVER_PORT = 0
SERVER_PORT_SECURE = 443
HTTP_HOST: rgw.x.x
format = canonical host headers: rgw.x.x: 0
..
failed to authorize the request
req 1: 0.007245: s3: GET /: list_buckets: http status = 403
..

I've seen this in
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rest_s3.cc:
int RGW_Auth_S3::authorize_v4(RGWRados *store, struct req_state *s){
  ..
  string port = s->info.env->get("SERVER_PORT", "");
  string secure_port = s->info.env->get("SERVER_PORT_SECURE", "");
 ...
if (using_qs && (token == "host")) {
  if (!port.empty() && port != "80") {
token_value = token_value + ":" + port;
  } else if (!secure_port.empty() && secure_port != "443") {
token_value = token_value + ":" + secure_port;
  }
}
.

So if SERVER_PORT = 0 then host:rgw.x.x: 0 and it leads to an error
SignatureDoesNotMatch ?
I do not know how to make civetweb in RGW listen on port 80, 443s to ignore
this error.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cache Tier

2016-06-08 Thread Adrien Gillard
Hello Vincent,

There was indeed a bug in hammer 0.94.6 that caused data corruption, only
if you were using min_read_recency_for_promote > 1.
That was discussed on the mailing list [0] and fixed in 0.94.7 [1]

AFAIK, infernalis releases were never affected.


[0] http://www.spinics.net/lists/ceph-users/msg26356.html
[1] http://tracker.ceph.com/issues/15171

On Wed, Jun 8, 2016 at 9:41 AM, Vincent Godin  wrote:

> Is there now a stable version of Ceph in Hammer and/or Infernalis whis
> which we can safely use cache tier in write back mode ?
> I saw few month ago a post saying that we have to wait for a next release
> to use it safely.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph file change monitor

2016-06-08 Thread John Spray
On Wed, Jun 8, 2016 at 8:40 AM, siva kumar <85s...@gmail.com> wrote:
> Dear Team,
>
> We are using ceph storage & cephFS for mounting .
>
> Our configuration :
>
> 3 osd
> 3 monitor
> 4 clients .
> ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
>
> We would like to get file change notifications like what is the event
> (ADDED, MODIFIED,DELETED) and for which file the event has occurred. These
> notifications should be sent to our server.
> How to get these notifications?

This isn't a feature that CephFS has right now.  Still, I would be
interested to know what protocol/format your server would consume
these kinds of notifications in?

John

> Ultimately we would like to add our custom file watch notification hooks to
> ceph so that we can handle this notifications by our self .
>
> Additional Info :
>
> [test@ceph-zclient1 ~]$ ceph -s
>
>> cluster a8c92ae6-6842-4fa2-bfc9-8cdefd28df5c
>
>  health HEALTH_WARN
> mds0: ceph-client1 failing to respond to cache pressure
> mds0: ceph-client2 failing to respond to cache pressure
> mds0: ceph-client3 failing to respond to cache pressure
> mds0: ceph-client4 failing to respond to cache pressure
>  monmap e1: 3 mons at
> {ceph-zadmin=xxx.xxx.xxx.xxx:6789/0,ceph-zmonitor=xxx.xxx.xxx.xxx:6789/0,ceph-zmonitor1=xxx.xxx.xxx.xxx:6789/0}
> election epoch 16, quorum 0,1,2
> ceph-zadmin,ceph-zmonitor1,ceph-zmonitor
>  mdsmap e52184: 1/1/1 up {0=ceph-zstorage1=up:active}
>  osdmap e3278: 3 osds: 3 up, 3 in
>   pgmap v5068139: 384 pgs, 3 pools, 518 GB data, 7386 kobjects
> 1149 GB used, 5353 GB / 6503 GB avail
>  384 active+clean
>
>   client io 1259 B/s rd, 179 kB/s wr, 11 op/s
>
>
>
> Thanks,
> S.Sivakumar
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSPF to the host

2016-06-08 Thread Bastian Rosner

Hi,

regarding clustered VyOS on KVM: In theory this sounds like a safe plan, 
but will come with a great performance penalty because of all the 
context-switches. And even with PCI-passthrough you will also feel 
increased latency.


Docker/LXC/LXD on the other hand does not share the context-switch 
dilemma. Not sure if VyOS likes to run in a docker container though.


I didn't have a chance to play with VPP[1] yet, but it sounds like this 
could be quite useful for high performance routing/switch inside a 
container.


[1]: https://wiki.fd.io/view/VPP

Cheers, Bastian

Am 2016-06-08 09:04, schrieb Josef Johansson:

Hi,

Regarding single points of failure on the daemon on the host I was 
thinking
about doing a cluster setup with i.e. VyOS on kvm-machines on the host, 
and

they handle all the ospf stuff as well. I have not done any performance
benchmarks but it should be possible to do at least. Maybe even 
possible to
do in docker or straight in lxc since it's mostly route management in 
the

kernel.

Regards,
Josef

On Mon, 6 Jun 2016, 18:54 Jeremy Hanmer, 
wrote:

We do the same thing. OSPF between ToR switches, BGP to all of the 
hosts

with each one advertising its own /32 (each has 2 NICs).

On Mon, Jun 6, 2016 at 6:29 AM, Luis Periquito 
wrote:


Nick,

TL;DR: works brilliantly :)

Where I work we have all of the ceph nodes (and a lot of other stuff)
using OSPF and BGP server attachment. With that we're able to 
implement
solutions like Anycast addresses, removing the need to add load 
balancers,

for the radosgw solution.

The biggest issues we've had were around the per-flow vs per-packets
traffic load balancing, but as long as you keep it simple you 
shouldn't

have any issues.

Currently we have a P2P network between the servers and the ToR 
switches
on a /31 subnet, and then create a virtual loopback address, which is 
the
interface we use for all communications. Running tests like iperf 
we're
able to reach 19Gbps (on a 2x10Gbps network). OTOH we no longer have 
the

ability to separate traffic between public and osd network, but never
really felt the need for it.

Also spend a bit of time planning how the network will look like and 
it's
topology. If done properly (think details like route summarization) 
then

it's really worth the extra effort.



On Mon, Jun 6, 2016 at 11:57 AM, Nick Fisk  wrote:


Hi All,



Has anybody had any experience with running the network routed down 
all

the way to the host?



I know the standard way most people configured their OSD nodes is to
bond the two nics which will then talk via a VRRP gateway and then 
probably
from then on the networking is all Layer3. The main disadvantage I 
see here
is that you need a beefy inter switch link to cope with the amount 
of
traffic flowing between switches to the VRRP address. I’ve been 
trying to
design around this by splitting hosts into groups with different 
VRRP
gateways on either switch, but this relies on using active/passive 
bonding
on the OSD hosts to make sure traffic goes from the correct Nic to 
the

directly connected switch.



What I was thinking, instead of terminating the Layer3 part of the
network at the access switches, terminate it at the hosts. If each 
Nic of
the OSD host had a different subnet and the actual “OSD Server” 
address
bound to a loopback adapter, OSPF should advertise this loopback 
adapter
address as reachable via the two L3 links on the physically attached 
Nic’s.
This should give you a redundant topology which also will respect 
your
physically layout and potentially give you higher performance due to 
ECMP.




Any thoughts, any pitfalls?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSPF to the host

2016-06-08 Thread Luis Periquito
> OTOH, running ceph on dynamically routed networks will put your routing
> daemon (e.g. bird) in a SPOF position...
>
I run a somewhat large estate with either BGP or OSPF attachment, not
only ceph is happy in either of them, as I have never had issues with
the routing daemons (after setting them up properly). However I only
run rj45 copper.

Only had issues because both links, for several unrelated reasons,
became unavailable, and then the host is not contactable.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Cache Tier

2016-06-08 Thread Vincent Godin
Is there now a stable version of Ceph in Hammer and/or Infernalis whis
which we can safely use cache tier in write back mode ?
I saw few month ago a post saying that we have to wait for a next release
to use it safely.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph file change monitor

2016-06-08 Thread siva kumar
Dear Team,

We are using ceph storage & cephFS for mounting .

Our configuration :

3 osd
3 monitor
4 clients .
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

We would like to get file change notifications like what is the event
(ADDED, MODIFIED,DELETED) and for which file the event has occurred. These
notifications should be sent to our server.
How to get these notifications?

Ultimately we would like to add our custom file watch notification hooks to
ceph so that we can handle this notifications by our self .

Additional Info :

[test@ceph-zclient1 ~]$ ceph -s

cluster a8c92ae6-6842-4fa2-bfc9-8cdefd28df5c
>
 health HEALTH_WARN
mds0: ceph-client1 failing to respond to cache pressure
mds0: ceph-client2 failing to respond to cache pressure
mds0: ceph-client3 failing to respond to cache pressure
mds0: ceph-client4 failing to respond to cache pressure
 monmap e1: 3 mons at
{ceph-zadmin=xxx.xxx.xxx.xxx:6789/0,ceph-zmonitor=xxx.xxx.xxx.xxx:6789/0,ceph-zmonitor1=xxx.xxx.xxx.xxx:6789/0}
election epoch 16, quorum 0,1,2
ceph-zadmin,ceph-zmonitor1,ceph-zmonitor
 mdsmap e52184: 1/1/1 up {0=ceph-zstorage1=up:active}
 osdmap e3278: 3 osds: 3 up, 3 in
  pgmap v5068139: 384 pgs, 3 pools, 518 GB data, 7386 kobjects
1149 GB used, 5353 GB / 6503 GB avail
 384 active+clean

  client io 1259 B/s rd, 179 kB/s wr, 11 op/s



Thanks,
S.Sivakumar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can a pool tier to other pools more than once ? 回复: Must host bucket name be the same with hostname ?

2016-06-08 Thread Christian Balzer

Hello,

On Wed, 8 Jun 2016 15:16:32 +0800 秀才 wrote:

> Thanks!
> 
> 
> It seems to work!
> 
> 
> I configure my cluster's crush rulesest according to
> https://elkano.org/blog/ceph-sata-ssd-pools-server-editing-crushmap/.
> Then restart my cluster, things looks like ok.
> 
> 
> My tests have not finished.
> Go on making tier-cache.
> 
> 
> ceph osd tier add images images ssdpool
> ceph osd tier cache-mode ssdpool writeback
> ceph osd tier add images volumes ssdpool
> ceph osd tier cache-mode ssdpool writeback
> 
To answer your question below, no that won't work.
A cache pool can not be shared.
If you google for this, you may actually find a thread where I asked that
same question. 

> 
> 
> But 'ceph -s' replys:
> 
> 
> 1 cache pools are missing hit_sets
> 
If the 4 commands up there are all you did, your cache tier setup isn't
finished. 
Re-read the documentation and the various cache tier threads here,
including my "Cache tier operation clarifications" thread.

> 
> And then 'ceph osd tree' replys(a bit longer):
> 
> 
> ID  WEIGHT  TYPE NAMEUP/DOWN REWEIGHT
> PRIMARY-AFFINITY  -10 0.5 root
> default-ssd -6
> 0.14000 host
> bjd-01-control1-ssd  3
> 0.06999 osd.3   down  1.0
> 1.04 0.06999 osd.4   down
> 1.0  1.0   -7 0.14000 host
> bjd-01-control2-ssd 11
> 0.06999 osd.11  down  1.0  1.0
> 12 0.06999 osd.12  down  1.0
> 1.0   -8 0.17999 host
> bjd-01-compute1-ssd 18
> 0.09000 osd.18  down  1.0  1.0
> 19 0.09000 osd.19  down  1.0
> 1.0   -9 0.14000 host
> bjd-01-compute2-ssd 28
> 0.06999 osd.28  down  1.0  1.0
> 29 0.06999 osd.29  down  1.0
> 1.0   -1 6.06000 root
> default -2
> 1.5 host
> bjd-01-control1  0
> 0.25000 osd.0 up  1.0
> 1.02 0.25000 osd.2 up
> 1.0  1.05 0.25000 osd.5
> up  1.0  1.06 0.25000
> osd.6 up  1.0  1.0   22
> 0.25000 osd.22up  1.0  1.0
> 23 0.25000 osd.23up  1.0
> 1.0   -3 1.5 host
> bjd-01-control2  7
> 0.25000 osd.7 up  1.0
> 1.08 0.25000 osd.8 up
> 1.0  1.09 0.25000 osd.9
> up  1.0  1.0   10 0.25000
> osd.10up  1.0  1.0   13
> 0.25000 osd.13up  1.0  1.0
> 14 0.25000 osd.14up  1.0
> 1.0   -4 1.56000 host
> bjd-01-compute1 15
> 0.25000 osd.15up  1.0  1.0
> 16 0.25000 osd.16up  1.0
> 1.0   17 0.26999 osd.17up
> 1.0  1.0   20 0.26999 osd.20
> up  1.0  1.0   21 0.26999
> osd.21up  1.0  1.01
> 0.25000 osd.1 up  1.0  1.0
> -5 1.5 host
> bjd-01-compute2 24
> 0.25000 osd.24up  1.0  1.0
> 25 0.25000 osd.25up  1.0
> 1.0   26 0.25000 osd.26up
> 1.0  1.0   27 0.25000 osd.27
> up  1.0  1.0   30 0.25000
> osd.30up  1.0  1.0   31
> 0.25000 osd.31up  1.0  1.0 
> 
> 
> In the end, i run ceph-osd manually, 'ceph-osd -i 12
> -c /etc/ceph/ceph.conf -f':
> 
> 
> SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0d 00 00 00
> 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 SG_IO:
> bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0d 00 00 00 00 20 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2016-06-08
> 06:21:58.335164 7fc376d74880 -1 osd.12 383 log_to_monitors

That's just very very bad, I don't know what you're doing there but you
either have HW or configuration problems.

Christian

> {default=true} ./include/interval_set.h: In function 'void
> interval_set::erase(T, T) [with T = snapid_t]' thread 7fc34dbe9700
> time 2016-06-08 06:21:58.341270 ./include/interval_set.h: 386: FAILED
> assert(_size >= 0) ./incl

[ceph-users] Can a pool tier to other pools more than once ? ?????? Must host bucket name be the same with hostname ?

2016-06-08 Thread ????
Thanks!


It seems to work!


I configure my cluster's crush rulesest according to 
https://elkano.org/blog/ceph-sata-ssd-pools-server-editing-crushmap/.
Then restart my cluster, things looks like ok.


My tests have not finished.
Go on making tier-cache.


ceph osd tier add images images ssdpool
ceph osd tier cache-mode ssdpool writeback
ceph osd tier add images volumes ssdpool
ceph osd tier cache-mode ssdpool writeback



But 'ceph -s' replys:


1 cache pools are missing hit_sets


And then 'ceph osd tree' replys(a bit longer):


ID  WEIGHT  TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY  -10 
0.5 root default-ssd -6 
0.14000 host bjd-01-control1-ssd  3 
0.06999 osd.3   down  1.0  1.04 
0.06999 osd.4   down  1.0  1.0   -7 
0.14000 host bjd-01-control2-ssd 11 
0.06999 osd.11  down  1.0  1.0   12 
0.06999 osd.12  down  1.0  1.0   -8 
0.17999 host bjd-01-compute1-ssd 18 
0.09000 osd.18  down  1.0  1.0   19 
0.09000 osd.19  down  1.0  1.0   -9 
0.14000 host bjd-01-compute2-ssd 28 
0.06999 osd.28  down  1.0  1.0   29 
0.06999 osd.29  down  1.0  1.0   -1 
6.06000 root default -2 
1.5 host bjd-01-control1  0 
0.25000 osd.0 up  1.0  1.02 
0.25000 osd.2 up  1.0  1.05 
0.25000 osd.5 up  1.0  1.06 
0.25000 osd.6 up  1.0  1.0   22 
0.25000 osd.22up  1.0  1.0   23 
0.25000 osd.23up  1.0  1.0   -3 
1.5 host bjd-01-control2  7 
0.25000 osd.7 up  1.0  1.08 
0.25000 osd.8 up  1.0  1.09 
0.25000 osd.9 up  1.0  1.0   10 
0.25000 osd.10up  1.0  1.0   13 
0.25000 osd.13up  1.0  1.0   14 
0.25000 osd.14up  1.0  1.0   -4 
1.56000 host bjd-01-compute1 15 
0.25000 osd.15up  1.0  1.0   16 
0.25000 osd.16up  1.0  1.0   17 
0.26999 osd.17up  1.0  1.0   20 
0.26999 osd.20up  1.0  1.0   21 
0.26999 osd.21up  1.0  1.01 
0.25000 osd.1 up  1.0  1.0   -5 
1.5 host bjd-01-compute2 24 
0.25000 osd.24up  1.0  1.0   25 
0.25000 osd.25up  1.0  1.0   26 
0.25000 osd.26up  1.0  1.0   27 
0.25000 osd.27up  1.0  1.0   30 
0.25000 osd.30up  1.0  1.0   31 
0.25000 osd.31up  1.0  1.0
 


In the end, i run ceph-osd manually, 'ceph-osd -i 12 -c /etc/ceph/ceph.conf -f':


SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0d 00 00 00 00 20 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0d 00 00 00 00 20 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
2016-06-08 06:21:58.335164 7fc376d74880 -1 osd.12 383 log_to_monitors 
{default=true}
./include/interval_set.h: In function 'void interval_set::erase(T, T) [with 
T = snapid_t]' thread 7fc34dbe9700 time 2016-06-08 06:21:58.341270
./include/interval_set.h: 386: FAILED assert(_size >= 0)
./include/interval_set.h: In function 'void interval_set::erase(T, T) [with 
T = snapid_t]' thread 7fc34d3e8700 time 2016-06-08 06:21:58.341246
./include/interval_set.h: 386: FAILED assert(_size >= 0)
./include/interval_set.h: In function 'void interval_set::erase(T, T) [with 
T = snapid_t]' thread 7fc34c3e6700 time 2016-06-08 06:21:58.342349
./include/interval_set.h: 386: FAILED assert(_size >= 0)
./include/interval_set.h: In function 'void interval_set::erase(T, T) [with 
T = snapid_t]' thread 7fc34ab

Re: [ceph-users] Filestore update script?

2016-06-08 Thread Wido den Hollander

> Op 7 juni 2016 om 23:08 schreef "WRIGHT, JON R (JON R)" 
> :
> 
> 
> I'm trying to recover an OSD after running xfs_repair on the disk. It 
> seems to be ok now.  There is a log message that includes the following: 
> "Please run the FileStore update script before starting the OSD, or set 
> filestore_update_to to 4"
> 

why did you run the xfs_repair? My recommendation is always to wipe a OSD which 
has XFS errors. They don't show up by accident. Something has happened. Bit-rot 
on the disk? Controller failure?

I'd say, wipe the OSD and let the Ceph recovery take care of it. Re-format it 
with XFS and rebuild the OSD.

Wido

> What is the FileStore update script?  Google search doesn't produce 
> useful information on what or where it is.   Also the 
> filestore_update_to option is set in what config file?
> 
> Thanks
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSPF to the host

2016-06-08 Thread Josef Johansson
Hi,

Regarding single points of failure on the daemon on the host I was thinking
about doing a cluster setup with i.e. VyOS on kvm-machines on the host, and
they handle all the ospf stuff as well. I have not done any performance
benchmarks but it should be possible to do at least. Maybe even possible to
do in docker or straight in lxc since it's mostly route management in the
kernel.

Regards,
Josef

On Mon, 6 Jun 2016, 18:54 Jeremy Hanmer, 
wrote:

> We do the same thing. OSPF between ToR switches, BGP to all of the hosts
> with each one advertising its own /32 (each has 2 NICs).
>
> On Mon, Jun 6, 2016 at 6:29 AM, Luis Periquito 
> wrote:
>
>> Nick,
>>
>> TL;DR: works brilliantly :)
>>
>> Where I work we have all of the ceph nodes (and a lot of other stuff)
>> using OSPF and BGP server attachment. With that we're able to implement
>> solutions like Anycast addresses, removing the need to add load balancers,
>> for the radosgw solution.
>>
>> The biggest issues we've had were around the per-flow vs per-packets
>> traffic load balancing, but as long as you keep it simple you shouldn't
>> have any issues.
>>
>> Currently we have a P2P network between the servers and the ToR switches
>> on a /31 subnet, and then create a virtual loopback address, which is the
>> interface we use for all communications. Running tests like iperf we're
>> able to reach 19Gbps (on a 2x10Gbps network). OTOH we no longer have the
>> ability to separate traffic between public and osd network, but never
>> really felt the need for it.
>>
>> Also spend a bit of time planning how the network will look like and it's
>> topology. If done properly (think details like route summarization) then
>> it's really worth the extra effort.
>>
>>
>>
>> On Mon, Jun 6, 2016 at 11:57 AM, Nick Fisk  wrote:
>>
>>> Hi All,
>>>
>>>
>>>
>>> Has anybody had any experience with running the network routed down all
>>> the way to the host?
>>>
>>>
>>>
>>> I know the standard way most people configured their OSD nodes is to
>>> bond the two nics which will then talk via a VRRP gateway and then probably
>>> from then on the networking is all Layer3. The main disadvantage I see here
>>> is that you need a beefy inter switch link to cope with the amount of
>>> traffic flowing between switches to the VRRP address. I’ve been trying to
>>> design around this by splitting hosts into groups with different VRRP
>>> gateways on either switch, but this relies on using active/passive bonding
>>> on the OSD hosts to make sure traffic goes from the correct Nic to the
>>> directly connected switch.
>>>
>>>
>>>
>>> What I was thinking, instead of terminating the Layer3 part of the
>>> network at the access switches, terminate it at the hosts. If each Nic of
>>> the OSD host had a different subnet and the actual “OSD Server” address
>>> bound to a loopback adapter, OSPF should advertise this loopback adapter
>>> address as reachable via the two L3 links on the physically attached Nic’s.
>>> This should give you a redundant topology which also will respect your
>>> physically layout and potentially give you higher performance due to ECMP.
>>>
>>>
>>>
>>> Any thoughts, any pitfalls?
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com