Re: [ceph-users] Do you see a data loss if a SSD hosting several OSD journals crashes

2016-05-19 Thread Christian Balzer

Hello,

On Fri, 20 May 2016 03:44:52 + EP Komarla wrote:

> Thanks Christian.  Point noted.  Going forward I will write text to make
> it easy to read.
> 
> Thanks for your response.  Losing a journal drive seems expensive as I
> will have to rebuild 5 OSDs in this eventuality.
>
Potentially, there are ways to avoid a full rebuild, but that depends on
some factors and is pretty advanced stuff.

It's expensive, but as Dyweni wrote an expected situation your cluster
should be able to handle. 

The chances of loosing a journal SSD unexpectedly are of course going be
very small if you choose the right type of SSD, Intel DC 37xx or at least
36xx for example.
 
Christian

> - epk
> 
> -Original Message-
> From: Christian Balzer [mailto:ch...@gol.com] 
> Sent: Thursday, May 19, 2016 7:00 PM
> To: ceph-users@lists.ceph.com
> Cc: EP Komarla 
> Subject: Re: [ceph-users] Do you see a data loss if a SSD hosting
> several OSD journals crashes
> 
> 
> Hello,
> 
> first of all, wall of text. Don't do that. 
> Use returns and paragraphs liberally to make reading easy.
> I'm betting at least half of the people who could have answered you
> question took a look at this blob of text and ignored it.
> 
> Secondly, search engines are your friend.
> The first hit when googling for "ceph ssd journal failure" is this gem:
> http://ceph.com/planet/ceph-recover-osds-after-ssd-journal-failure/
> 
> Loosing a journal SSD will at most cost you the data on all associated
> OSDs and thus the recovery/backfill traffic, if you don't feel like
> doing what the link above describes.
> 
> Ceph will not acknowledge a client write before all journals (replica
> size, 3 by default) have received the data, so loosing one journal SSD
> will NEVER result in an actual data loss.
> 
> Christian
> 
> On Fri, 20 May 2016 01:38:08 + EP Komarla wrote:
> 
> >   *   We are trying to assess if we are going to see a data loss if an
> > SSD that is hosting journals for few OSDs crashes. In our 
> > configuration, each SSD is partitioned into 5 chunks and each chunk is 
> > mapped as a journal drive for one OSD. What I understand from the Ceph
> > documentation: "Consistency: Ceph OSD Daemons require a filesystem 
> > interface that guarantees atomic compound operations. Ceph OSD Daemons 
> > write a description of the operation to the journal and apply the 
> > operation to the filesystem. This enables atomic updates to an object 
> > (for example, placement group metadata). Every few seconds-between 
> > filestore max sync interval and filestore min sync interval-the Ceph 
> > OSD Daemon stops writes and synchronizes the journal with the 
> > filesystem, allowing Ceph OSD Daemons to trim operations from the 
> > journal and reuse the space. On failure, Ceph OSD Daemons replay the 
> > journal starting after the last synchronization operation." So, my 
> > question is what happens if an SSD fails - am I going to lose all the 
> > data that has not been written/synchronized to OSD?  In my case, am I 
> > going to lose data for all the 5 OSDs which can be bad?  This is of 
> > concern to us. What are the options to prevent any data loss at all?  
> > Is it better to have the journals on the same hard drive, i.e., to 
> > have one journal per OSD and host it on the same hard drive?  Of 
> > course, performance will not be as good as having an SSD for OSD 
> > journal. In this case, I am thinking I will not lose data as there are 
> > secondary OSDs where data is replicated (we are using triple 
> > replication).  Any thoughts?  What other solutions people have adopted 
> > for data reliability and consistency to address the case I am
> > mentioning?
> > 
> > 
> > 
> > Legal Disclaimer:
> > The information contained in this message may be privileged and 
> > confidential. It is intended to be read only by the individual or 
> > entity to whom it is addressed or by their designee. If the reader of 
> > this message is not the intended recipient, you are on notice that any 
> > distribution of this message, in any form, is strictly prohibited. If 
> > you have received this message in error, please immediately notify the 
> > sender and delete or destroy any copy of this message!
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mark out vs crush weight 0

2016-05-19 Thread Christian Balzer

Hello,

On Thu, 19 May 2016 13:26:33 +0200 Oliver Dzombic wrote:

> Hi,
> 
> a sparedisk is a nice idea.
> 
> But i think thats something you can also do with a shellscript.
> 

Definitely, but you're then going to have a very likely possibility of
getting in conflict with your MONs and what they want to do.

For example you would have to query the running, active configuration of
your timeouts from the monitors to make sure you act before they do.

Doable, yes. Easy and 100% safe, not so much.

Christian

> Checking if an osd is down or out and just using your spare disk.
> 
> Maybe the programming ressources should not be used for something most
> of us can do with a simple shell script checking every 5 seconds the
> situation.
> 
> 
> 
> Maybe better idea ( in my humble opinion ) is to solve this stuff by
> optimizing the code in recovery situations.
> 
> Currently we have things like
> 
> client-op-priority,
> recovery-op-priority,
> max-backfills,
> recovery-max-active and so on
> 
> to limit the performance impact in a recovery situation.
> 
> And still in a situation of recovery the performance go downhill ( a lot
> )  when all OSD's start to refill the to_be_recovered OSD.
> 
> In my case, i was removing old HDD's from a cluster.
> 
> If i down/out them ( 6 TB drives 40-50% full ) the cluster's performance
> will go down very dramatically. So i had to reduce the weight by 0.1
> steps to ease this pain, but could not remove it completely.
> 
> 
> So i think the tools / code to protect the cluster's performance ( even
> in recovery situation ) can be improved.
> 
> Of course, on one hand, we want to make sure, that asap the configured
> amount of replica's and this way, datasecurity is restored.
> 
> But on the other hand, it does not help too much if the recovery
> proceedure will impact the cluster's performance on a level where the
> useability is too much reduced.
> 
> So maybe introcude another config option to controle this ratio ?
> 
> To control more effectively how much IOPS/Bandwidth is used ( maybe
> streight in numbers in form of an IO ratelimit ) so that administrator's
> have the chance to config, according to the hardware environment, the
> "perfect" settings for their individual usecase.
> 
> 
> Because, right now, when i reduce the weight of a 6 TB HDD, while having
> ~ 30 OSD's in the cluster, from 1.0 to 0.9, around 3-5% of data will be
> moved around the cluster ( replication 2 ).
> 
> While its moving, there is a true performance hit on the virtual servers.
> 
> So if this could be solved, by a IOPS/HDD Bandwidth rate limit, that i
> can simply tell the cluster to use max. 10 IOPS and/or 10 MB/s for the
> recovery, then i think it would be a great help for any usecase and
> administrator.
> 
> Thanks !
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Do you see a data loss if a SSD hosting several OSD journals crashes

2016-05-19 Thread Dyweni - Ceph-Users

Hi,

Yes and no, for the actual data loss.  This depends on your crush map.

If you're using the original map (which came with the installation), 
then your smallest failure domain will be the host.  If you have replica 
size and 3 hosts and 5 OSDs per host (15 OSDs total), then loosing the 
journal SSD in one host will only result in the data on that specific 
host being lost, and having to be re-created from the other two hosts.  
This is an anticipated failure.


If you changed the crush map to make the smallest failure domain to be 
the OSD, and ceph places all copies of a piece of data on the OSDs 
belonging to the SAME journal, then yes, you could end up with some 
pieces of data completely lost when that journal dies.


If I were in your shoes and I didn't want to set the the smallest 
failure domain to be the host, then I would create a new level 'ssd' and 
make that my failure domain.  This way, if I had 10 OSDs and 2 SSD 
Journals per host, my crush map would look like this:  5 hosts -> 2 
Journals/host -> 5 OSDs/Journal.  This way, if I lost a journal, I would 
be losing only one copy of my data.  Ceph would not place more than one 
copy of data per Journal (even though there are 5 OSDs behind that 
Journal).


I know this is a bit advanced, but I hope this clarifies things for you.

Dyweni




On 2016-05-19 20:59, Christian Balzer wrote:

Hello,

first of all, wall of text. Don't do that.
Use returns and paragraphs liberally to make reading easy.
I'm betting at least half of the people who could have answered you
question took a look at this blob of text and ignored it.

Secondly, search engines are your friend.
The first hit when googling for "ceph ssd journal failure" is this gem:
http://ceph.com/planet/ceph-recover-osds-after-ssd-journal-failure/

Loosing a journal SSD will at most cost you the data on all associated
OSDs and thus the recovery/backfill traffic, if you don't feel like 
doing

what the link above describes.

Ceph will not acknowledge a client write before all journals (replica
size, 3 by default) have received the data, so loosing one journal SSD
will NEVER result in an actual data loss.

Christian

On Fri, 20 May 2016 01:38:08 + EP Komarla wrote:


  *   We are trying to assess if we are going to see a data loss if an
SSD that is hosting journals for few OSDs crashes. In our 
configuration,

each SSD is partitioned into 5 chunks and each chunk is mapped as a
journal drive for one OSD. What I understand from the Ceph
documentation: "Consistency: Ceph OSD Daemons require a filesystem
interface that guarantees atomic compound operations. Ceph OSD Daemons
write a description of the operation to the journal and apply the
operation to the filesystem. This enables atomic updates to an object
(for example, placement group metadata). Every few seconds-between
filestore max sync interval and filestore min sync interval-the Ceph 
OSD

Daemon stops writes and synchronizes the journal with the filesystem,
allowing Ceph OSD Daemons to trim operations from the journal and 
reuse

the space. On failure, Ceph OSD Daemons replay the journal starting
after the last synchronization operation." So, my question is what
happens if an SSD fails - am I going to lose all the data that has not
been written/synchronized to OSD?  In my case, am I going to lose data
for all the 5 OSDs which can be bad?  This is of concern to us. What 
are

the options to prevent any data loss at all?  Is it better to have the
journals on the same hard drive, i.e., to have one journal per OSD and
host it on the same hard drive?  Of course, performance will not be as
good as having an SSD for OSD journal. In this case, I am thinking I
will not lose data as there are secondary OSDs where data is 
replicated
(we are using triple replication).  Any thoughts?  What other 
solutions
people have adopted for data reliability and consistency to address 
the

case I am mentioning?



Legal Disclaimer:
The information contained in this message may be privileged and
confidential. It is intended to be read only by the individual or 
entity

to whom it is addressed or by their designee. If the reader of this
message is not the intended recipient, you are on notice that any
distribution of this message, in any form, is strictly prohibited. If
you have received this message in error, please immediately notify the
sender and delete or destroy any copy of this message!



--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Do you see a data loss if a SSD hosting several OSD journals crashes

2016-05-19 Thread Christian Balzer

Hello,

first of all, wall of text. Don't do that. 
Use returns and paragraphs liberally to make reading easy.
I'm betting at least half of the people who could have answered you
question took a look at this blob of text and ignored it.

Secondly, search engines are your friend.
The first hit when googling for "ceph ssd journal failure" is this gem:
http://ceph.com/planet/ceph-recover-osds-after-ssd-journal-failure/

Loosing a journal SSD will at most cost you the data on all associated
OSDs and thus the recovery/backfill traffic, if you don't feel like doing
what the link above describes.

Ceph will not acknowledge a client write before all journals (replica
size, 3 by default) have received the data, so loosing one journal SSD
will NEVER result in an actual data loss.

Christian

On Fri, 20 May 2016 01:38:08 + EP Komarla wrote:

>   *   We are trying to assess if we are going to see a data loss if an
> SSD that is hosting journals for few OSDs crashes. In our configuration,
> each SSD is partitioned into 5 chunks and each chunk is mapped as a
> journal drive for one OSD. What I understand from the Ceph
> documentation: "Consistency: Ceph OSD Daemons require a filesystem
> interface that guarantees atomic compound operations. Ceph OSD Daemons
> write a description of the operation to the journal and apply the
> operation to the filesystem. This enables atomic updates to an object
> (for example, placement group metadata). Every few seconds-between
> filestore max sync interval and filestore min sync interval-the Ceph OSD
> Daemon stops writes and synchronizes the journal with the filesystem,
> allowing Ceph OSD Daemons to trim operations from the journal and reuse
> the space. On failure, Ceph OSD Daemons replay the journal starting
> after the last synchronization operation." So, my question is what
> happens if an SSD fails - am I going to lose all the data that has not
> been written/synchronized to OSD?  In my case, am I going to lose data
> for all the 5 OSDs which can be bad?  This is of concern to us. What are
> the options to prevent any data loss at all?  Is it better to have the
> journals on the same hard drive, i.e., to have one journal per OSD and
> host it on the same hard drive?  Of course, performance will not be as
> good as having an SSD for OSD journal. In this case, I am thinking I
> will not lose data as there are secondary OSDs where data is replicated
> (we are using triple replication).  Any thoughts?  What other solutions
> people have adopted for data reliability and consistency to address the
> case I am mentioning?
> 
> 
> 
> Legal Disclaimer:
> The information contained in this message may be privileged and
> confidential. It is intended to be read only by the individual or entity
> to whom it is addressed or by their designee. If the reader of this
> message is not the intended recipient, you are on notice that any
> distribution of this message, in any form, is strictly prohibited. If
> you have received this message in error, please immediately notify the
> sender and delete or destroy any copy of this message!


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Do you see a data loss if a SSD hosting several OSD journals crashes

2016-05-19 Thread EP Komarla
  *   We are trying to assess if we are going to see a data loss if an SSD that 
is hosting journals for few OSDs crashes. In our configuration, each SSD is 
partitioned into 5 chunks and each chunk is mapped as a journal drive for one 
OSD. What I understand from the Ceph documentation: "Consistency: Ceph OSD 
Daemons require a filesystem interface that guarantees atomic compound 
operations. Ceph OSD Daemons write a description of the operation to the 
journal and apply the operation to the filesystem. This enables atomic updates 
to an object (for example, placement group metadata). Every few seconds-between 
filestore max sync interval and filestore min sync interval-the Ceph OSD Daemon 
stops writes and synchronizes the journal with the filesystem, allowing Ceph 
OSD Daemons to trim operations from the journal and reuse the space. On 
failure, Ceph OSD Daemons replay the journal starting after the last 
synchronization operation." So, my question is what happens if an SSD fails - 
am I going to lose all the data that has not been written/synchronized to OSD?  
In my case, am I going to lose data for all the 5 OSDs which can be bad?  This 
is of concern to us. What are the options to prevent any data loss at all?  Is 
it better to have the journals on the same hard drive, i.e., to have one 
journal per OSD and host it on the same hard drive?  Of course, performance 
will not be as good as having an SSD for OSD journal. In this case, I am 
thinking I will not lose data as there are secondary OSDs where data is 
replicated (we are using triple replication).  Any thoughts?  What other 
solutions people have adopted for data reliability and consistency to address 
the case I am mentioning?



Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph storage capacity does not free when deleting contents from RBD volumes

2016-05-19 Thread Christian Balzer

Hello,

On Fri, 20 May 2016 00:11:02 + David Turner wrote:

> You can also mount the rbd with the discard option. It works the same
> way as you would mount an ssd to free up the space when you delete
> things. I use the discard option on my ext4 rbds on Ubuntu and it frees
> up the used Ceph space immediately.
> 
While that certainly works (and has the advantage of being fully automatic
and not needing any human or CRON intervention), it has a big disadvantage
as well.

With normal storage (SSDs) the consensus already is that manual fstrim at a
time of low utilization is preferable, since TRIM activity can slow down
things as the SSD does its housekeeping.

With Ceph this is even more pronounced (TRIM'ing RBD images is quite
costly in terms of IO, IOPS), so doing TRIMs only when actually needed and
during quite times is much preferred.

Regards,

Christian
> Sent from my iPhone
> 
> On May 19, 2016, at 12:30 PM, Albert Archer
> > wrote:
> 
> Thank you for your great support .
> 
> Best Regards
> Albert
> 
> On Thu, May 19, 2016 at 10:41 PM, Udo Lembke
> > wrote: Hi Albert,
> to free unused space you must enable trim (or do an fstrim) in the vm -
> and all things in the storage chain must support this. The normal
> virtio-driver don't support trim, but if you use scsi-disks with
> virtio-scsi-driver you can use it. Work well but need some time for huge
> filesystems.
> 
> Udo
> 
> 
> On 19.05.2016 19:58, Albert Archer wrote:
> Hello All.
> I am newbie in ceph. and i use jewel release for testing purpose. it
> seems every thing is OK, HEALTH_OK , all of OSDs are in UP and IN state.
> I create some RBD images (rbd create  ) and map to some ubuntu
> host .
> I can read and write data to my volume , but when i delete some content
> from volume (e,g some huge files,...), populated capacity of cluster
> does not free and None of objects were clean.
> what is the problem ???
> 
> Regards
> [https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif]Albert
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph storage capacity does not free when deleting contents from RBD volumes

2016-05-19 Thread David Turner
You can also mount the rbd with the discard option. It works the same way as 
you would mount an ssd to free up the space when you delete things. I use the 
discard option on my ext4 rbds on Ubuntu and it frees up the used Ceph space 
immediately.

Sent from my iPhone

On May 19, 2016, at 12:30 PM, Albert Archer 
> wrote:

Thank you for your great support .

Best Regards
Albert

On Thu, May 19, 2016 at 10:41 PM, Udo Lembke 
> wrote:
Hi Albert,
to free unused space you must enable trim (or do an fstrim) in the vm - and all 
things in the storage chain must support this.
The normal virtio-driver don't support trim, but if you use scsi-disks with 
virtio-scsi-driver you can use it.
Work well but need some time for huge filesystems.

Udo


On 19.05.2016 19:58, Albert Archer wrote:
Hello All.
I am newbie in ceph. and i use jewel release for testing purpose. it
seems every thing is OK, HEALTH_OK , all of OSDs are in UP and IN state.
I create some RBD images (rbd create  ) and map to some ubuntu
host .
I can read and write data to my volume , but when i delete some content
from volume (e,g some huge files,...), populated capacity of cluster
does not free and None of objects were clean.
what is the problem ???

Regards
[https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif]Albert



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph storage capacity does not free when deleting contents from RBD volumes

2016-05-19 Thread Albert Archer
Thank you for your great support .

Best Regards
Albert

On Thu, May 19, 2016 at 10:41 PM, Udo Lembke  wrote:

> Hi Albert,
> to free unused space you must enable trim (or do an fstrim) in the vm -
> and all things in the storage chain must support this.
> The normal virtio-driver don't support trim, but if you use scsi-disks
> with virtio-scsi-driver you can use it.
> Work well but need some time for huge filesystems.
>
> Udo
>
>
> On 19.05.2016 19:58, Albert Archer wrote:
>
> Hello All.
> I am newbie in ceph. and i use jewel release for testing purpose. it
> seems every thing is OK, HEALTH_OK , all of OSDs are in UP and IN state.
> I create some RBD images (rbd create  ) and map to some ubuntu
> host .
> I can read and write data to my volume , but when i delete some content
> from volume (e,g some huge files,...), populated capacity of cluster
> does not free and None of objects were clean.
> what is the problem ???
>
> Regards
> Albert
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph storage capacity does not free when deleting contents from RBD volumes

2016-05-19 Thread Udo Lembke
Hi Albert,
to free unused space you must enable trim (or do an fstrim) in the vm -
and all things in the storage chain must support this.
The normal virtio-driver don't support trim, but if you use scsi-disks
with virtio-scsi-driver you can use it.
Work well but need some time for huge filesystems.

Udo

On 19.05.2016 19:58, Albert Archer wrote:
> Hello All.
> I am newbie in ceph. and i use jewel release for testing purpose. it
> seems every thing is OK, HEALTH_OK , all of OSDs are in UP and IN state.
> I create some RBD images (rbd create  ) and map to some ubuntu
> host . 
> I can read and write data to my volume , but when i delete some content
> from volume (e,g some huge files,...), populated capacity of cluster
> does not free and None of objects were clean.
> what is the problem ???
>
> Regards 
> Albert
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph storage capacity does not free when deleting contents from RBD volumes

2016-05-19 Thread Edward R Huyer
That is normal behavior.  Ceph has no understanding of the filesystem living on 
top of the RBD, so it doesn’t know when space is freed up.  If you are running 
a sufficiently current kernel, you can use fstrim to cause the kernel to tell 
Ceph what blocks are free.  More details here:  
http://www.sebastien-han.fr/blog/2015/01/26/ceph-and-krbd-discard/


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Albert 
Archer
Sent: Thursday, May 19, 2016 1:59 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] ceph storage capacity does not free when deleting 
contents from RBD volumes

Hello All.
I am newbie in ceph. and i use jewel release for testing purpose. it
seems every thing is OK, HEALTH_OK , all of OSDs are in UP and IN state.
I create some RBD images (rbd create  ) and map to some ubuntu
host .
I can read and write data to my volume , but when i delete some content
from volume (e,g some huge files,...), populated capacity of cluster
does not free and None of objects were clean.
what is the problem ???

Regards
[https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif]Albert
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph storage capacity does not free when deleting contents from RBD volumes

2016-05-19 Thread Albert Archer
Hello All.
I am newbie in ceph. and i use jewel release for testing purpose. it
seems every thing is OK, HEALTH_OK , all of OSDs are in UP and IN state.
I create some RBD images (rbd create  ) and map to some ubuntu
host .
I can read and write data to my volume , but when i delete some content
from volume (e,g some huge files,...), populated capacity of cluster
does not free and None of objects were clean.
what is the problem ???

Regards
Albert
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph hang on pg list_unfound

2016-05-19 Thread Samuel Just
Restart osd.1 with debugging enabled

debug osd = 20
debug filestore = 20
debug ms = 1

Then, run list_unfound once the pg is back in active+recovering.  If
it still hangs, post osd.1's log to the list along with the output of
ceph osd dump and ceph pg dump.
-Sam

On Wed, May 18, 2016 at 6:20 PM, Don Waterloo  wrote:
> I am running 10.2.0-0ubuntu0.16.04.1.
> I've run into a problem w/ cephfs metadata pool. Specifically I have a pg w/
> an 'unfound' object.
>
> But i can't figure out which since when i run:
> ceph pg 12.94 list_unfound
>
> it hangs (as does ceph pg 12.94 query). I know its in the cephfs metadata
> pool since I run:
> ceph pg ls-by-pool cephfs_metadata |egrep "pg_stat|12\\.94"
>
> and it shows it there:
> pg_stat objects mip degrmispunf bytes   log disklog
> state   state_stamp v   reportedup  up_primary
> acting  acting_primary  last_scrub  scrub_stamp last_deep_scrub
> deep_scrub_stamp
> 12.94   231 1   1   0   1   90  30923092
> active+recovering+degraded  2016-05-18 23:49:15.718772  8957'386130
> 9472:367098 [1,4]   1   [1,4]   1   8935'385144 2016-05-18
> 10:46:46.123526 8337'379527 2016-05-14 22:37:05.974367
>
> OK, so what is hanging, and how can i get it to unhang so i can run a
> 'mark_unfound_lost' on it?
>
> pg 12.94 is on osd.0
>
> ID WEIGHT  TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 5.48996 root default
> -2 0.8 host nubo-1
>  0 0.8 osd.0 up  1.0  1.0
> -3 0.8 host nubo-2
>  1 0.8 osd.1 up  1.0  1.0
> -4 0.8 host nubo-3
>  2 0.8 osd.2 up  1.0  1.0
> -5 0.92999 host nubo-19
>  3 0.92999 osd.3 up  1.0  1.0
> -6 0.92999 host nubo-20
>  4 0.92999 osd.4 up  1.0  1.0
> -7 0.92999 host nubo-21
>  5 0.92999 osd.5 up  1.0  1.0
>
> I cranked the logging on osd.0. I see a lot of messages, but nothing
> interesting.
>
> I've double checked all nodes can ping each other. I've run 'xfs_repair' on
> the underlying xfs storage to check for issues (there were none).
>
> Can anyone suggest how to uncrack this hang so i can try and repair this
> system?
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enabling hammer rbd features on cluster with a few dumpling clients

2016-05-19 Thread Jason Dillaman
On Thu, May 19, 2016 at 12:15 PM, Dan van der Ster  wrote:
> I hope it will just refuse to
> attach, rather than attach but allow bad stuff to happen.

You are correct -- older librbd/krbd clients will refuse to open
images that have unsupported features enabled.

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cluster [ERR] osd.NN: inconsistent clone_overlap found for oid xxxxxxxx/rbd_data and OSD crashes

2016-05-19 Thread Frode Nordahl
Hello,

We recently had a outage on our Ceph storage cluster caused by what I believe 
to be a bug in Ceph. At the time of the incident all MONs, OSDs and clients 
(except for one) were running Ceph Hammer 0.94.6.

To start describing the incident I will portray a hierarchy of rbd 
volumes/snaptshos/clones:
root: volumes/volume-35429025-844f-42bb-8fb1-7071cabf0e9a
child:  
volumes/volume-35429025-844f-42bb-8fb1-7071cabf0e9a@snapshot-37cfc5c6-312c-45f1-a586-0398b04a2e35
child:volumes/volume-02557bbd-d513-4c7c-9035-73dac6704a93

This is a long-lived cluster so I am not sure under what version the above 
construction of volumes was created. I guess we started out with Firefly at 
some point back in time.

At approximately 09:22 a request was issued to delete 
snapshot-37cfc5c6-312c-45f1-a586-0398b04a2e35.

Some time after this there was also issued a «rbd -p volumes disk-usage» 
command from a 10.2.0 client. I don’t know if it is relevant but thought it was 
worth noting.

A short while after these messages occured in the logs:
2016-05-16 09:27:47.858897 osd.38 [xxx]:6805/2576 995 : cluster [ERR] osd.38: 
inconsistent clone_overlap found for oid 
46b786d5/rbd_data.26b3cb5fbccda4.0185/head//29 clone 0
2016-05-16 09:27:46.011693 osd.48 [xxx]:6801/2362 697 : cluster [ERR] osd.48: 
inconsistent clone_overlap found for oid 
9e88fdc5/rbd_data.26b3cb5fbccda4.0009/head//29 clone 0
2016-05-16 09:27:48.280009 osd.48 [xxx]:6801/2362 698 : cluster [ERR] osd.48: 
inconsistent clone_overlap found for oid 
81532b89/rbd_data.26b3cb5fbccda4.01ce/head//29 clone 0
2016-05-16 09:27:49.706609 osd.44 [xxx]:6801/2434 613 : cluster [ERR] osd.44: 
inconsistent clone_overlap found for oid 
bddaa5a1/rbd_data.26b3cb5fbccda4.0304/head//29 clone 0
2016-05-16 09:27:48.999400 osd.37 [xxx]:6803/2538 902 : cluster [ERR] osd.37: 
inconsistent clone_overlap found for oid 
ca18f5ca/rbd_data.26b3cb5fbccda4.026b/head//29 clone 0
2016-05-16 09:27:49.014679 osd.36 [xxx]:6807/2598 1015 : cluster [ERR] osd.36: 
inconsistent clone_overlap found for oid 
eca5f5dc/rbd_data.26b3cb5fbccda4.026c/head//29 clone 0
2016-05-16 09:27:49.235251 osd.36 [xxx]:6807/2598 1016 : cluster [ERR] osd.36: 
inconsistent clone_overlap found for oid 
9fb5b7dc/rbd_data.26b3cb5fbccda4.02a1/head//29 clone 0
2016-05-16 09:27:49.526915 osd.50 [xxx]:6805/2410 693 : cluster [ERR] osd.50: 
inconsistent clone_overlap found for oid 
ab5eff09/rbd_data.26b3cb5fbccda4.02c6/head//29 clone 0
2016-05-16 09:27:50.336825 osd.36 [xxx]:6807/2598 1017 : cluster [ERR] osd.36: 
inconsistent clone_overlap found for oid 
d7922b2b/rbd_data.26b3cb5fbccda4.0392/head//29 clone 0
2016-05-16 09:27:50.037706 osd.38 [xxx]:6805/2576 996 : cluster [ERR] osd.38: 
inconsistent clone_overlap found for oid 
d70156d5/rbd_data.26b3cb5fbccda4.034f/head//29 clone 0
2016-05-16 09:27:51.875372 osd.44 [xxx]:6801/2434 614 : cluster [ERR] osd.44: 
inconsistent clone_overlap found for oid 
e839050d/rbd_data.26b3cb5fbccda4.04f0/head//29 clone 0

At around 10:00 OSDs start to crash. After a long time of debugging we figured 
out that the OSD crashes was tied to write operations related to the volume at 
the bottom of the ancestry outlined above.

The tricky thing about this is that during the time we spent figuring this out 
the OSDs crashing caused a ripple effect trhoughout the cluster. When the first 
OSD eventually was marked down+out the next one to handle the requests would 
start crashing and so on and so forth. This had a devestating effect on our 
cluster, and we were effectively more ore less down for 12 - 17 hours while 
trying to figure out the root cause of the problem.

During the search for the root cause we first upgraded the cluster to Hammer 
0.94.7 and then to Jewel 10.2.1. So all these versions are still affected by 
the issue.

There are seperate problems here:
1) Why was the attempt to delete the snapshot successful(?) / why did the 
attempt at deleting the snapshot cause the error messages.
2) Would it be usefull to attempt to detect requests that crash the OSDs and at 
some point deny them?
3) We still have the «affected» volume in our cluster and cannot do anything 
with it. Attempts to delete or otherwise modify it still causes OSDs to crash, 
and we need some way of removing it.

Crashdump caught with gdb from one of the OSDs:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fc2e4f9f700 (LWP 10954)]
0x7fc30d45 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) bt
#0  0x7fc30d45 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00889e99 in operator-- (this=) at 
/usr/include/c++/4.8/bits/stl_tree.h:204
#2  operator* (this=) at 
/usr/include/c++/4.8/bits/stl_iterator.h:163
#3  operator-> (this=) at 
/usr/include/c++/4.8/bits/stl_iterator.h:173
#4  ReplicatedPG::make_writeable 

[ceph-users] Enabling hammer rbd features on cluster with a few dumpling clients

2016-05-19 Thread Dan van der Ster
Hi,

We want to enable the hammer rbd features on newly created Cinder
volumes [1], but we still have a few VMs running with super old librbd
running (dumpling).

Perhaps its academic, but does anyone know the expected behaviour if
an old dumpling-linked qemu-kvm tries to attach an rbd with
exclusive-lock+objectmap enabled? I hope it will just refuse to
attach, rather than attach but allow bad stuff to happen.

Thanks in advance!

Dan

[1] By adding

  rbd default format = 2
  rbd default features = 13

to our client side ceph.conf.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Maximum RBD image name length

2016-05-19 Thread Jason Dillaman
As of today, neither the rbd CLI nor librbd imposes any limit on the
maximum length of an RBD image name, whereas krbd has roughly a 100
character limit and the OSDs have a default object name limit of roughly
2000 characters. While there is a patch under review to increase the krbd
limit, it would still be bounded to a maximum in the low thousands of
characters.

Starting with the Kraken release, we would like to add validation for a
sensible maximum image name length when creating new images .  I am looking
for feedback from any users who require very long RBD image names (>100
characters) to help guide this limit.

Thanks,

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dense storage nodes

2016-05-19 Thread Benjeman Meekhof
Hi Christian,

Thanks for your insights.  To answer your question the NVMe devices
appear to be some variety of Samsung:

Model: Dell Express Flash NVMe 400GB
Manufacturer: SAMSUNG
Product ID: a820

regards,
Ben

On Wed, May 18, 2016 at 10:01 PM, Christian Balzer  wrote:
>
> Hello,
>
> On Wed, 18 May 2016 12:32:25 -0400 Benjeman Meekhof wrote:
>
>> Hi Lionel,
>>
>> These are all very good points we should consider, thanks for the
>> analysis.  Just a couple clarifications:
>>
>> - NVMe in this system are actually slotted in hot-plug front bays so a
>> failure can be swapped online.  However I do see your point about this
>> otherwise being a non-optimal config.
>>
> What NVMes are these exactly? DC P3700?
> With Intel you can pretty much rely on them not to die before their time
> is up, so monitor wearout levels religiously and automatically (nagios
> etc).
> At a low node count like yours it is understandable to not want to loose
> 15 OSDs because a NVMe failed, but your performance and cost are both not
> ideal as Lionel said.
>
> I guess you're happy with what you have, but as I mentioned in this
> thread also about RAIDed OSDs, there is a chassis that does basically what
> you're having while saving 1U:
> https://www.supermicro.com.tw/products/system/4U/6048/SSG-6048R-E1CR60N.cfm
>
> This can also have optionally 6 NVMes, hot-swappable.
>
>> - Our 20 physical cores come out to be 40 HT cores to the system which
>> we are hoping is adequate to do 60 OSD without raid devices.  My
>> experiences in other contexts lead me to believe a hyper-threaded core
>> is pretty well the same as a phys core (perhaps with some exceptions
>> depending on specific cases).
>>
> It all depends, if you had no SSD journals at all I'd say you could scrape
> by, barely.
> With NVMes for journals, especially if you should decide to use them
> individually with 15 OSDs per NVMe, I'd expect CPU to become the
> bottleneck when dealing with a high number of small IOPS.
>
> Regards,
>
> Christian
>> regards,
>> Ben
>>
>> On Wed, May 18, 2016 at 12:02 PM, Lionel Bouton
>>  wrote:
>> > Hi,
>> >
>> > I'm not yet familiar with Jewel, so take this with a grain of salt.
>> >
>> > Le 18/05/2016 16:36, Benjeman Meekhof a écrit :
>> >> We're in process of tuning a cluster that currently consists of 3
>> >> dense nodes with more to be added.  The storage nodes have spec:
>> >> - Dell R730xd 2 x Xeon E5-2650 v3 @ 2.30GHz (20 phys cores)
>> >> - 384 GB RAM
>> >> - 60 x 8TB HGST HUH728080AL5204 in MD3060e enclosure attached via 2 x
>> >> LSI 9207-8e SAS 6Gbps
>> >
>> > I'm not sure if 20 cores is enough for 60 OSDs on Jewel. With Firefly I
>> > think your performance would be limited by the CPUs but Jewel is faster
>> > AFAIK.
>> > That said you could setup the 60 disks as RAID arrays to limit the
>> > number of OSDs. This can be tricky but some people have reported doing
>> > so successfully (IIRC using RAID5 in order to limit both the number of
>> > OSDs and the rebalancing events when a disk fails).
>> >
>> >> - XFS filesystem on OSD data devs
>> >> - 4 x 400GB NVMe arranged into 2 mdraid devices for journals (30 per
>> >> raid-1 device)
>> >
>> > Your disks are rated at a maximum of ~200MB/s so even with a 100-150MB
>> > conservative estimate, for 30 disks you'd need a write bandwidth of
>> > 3GB/s to 4.5GB/s on each NVMe. Your NVMe will die twice as fast as they
>> > will take twice the amount of writes in RAID1. The alternative - using
>> > NVMe directly for journals - will get better performance and have less
>> > failures. The only drawback is that an NVMe failing entirely (I'm not
>> > familiar with NVMe but with SSD you often get write errors affecting a
>> > single OSD before a whole device failure) will bring down 15 OSDs at
>> > once. Note that replacing NVMe usually means stopping the whole node
>> > when not using hotplug PCIe, so not losing the journals when one fails
>> > may not gain you as much as anticipated if the cluster must rebalance
>> > anyway during the maintenance operation where your replace the faulty
>> > NVMe (and might perform other upgrades/swaps that were waiting).
>> >
>> >> - 2 x 25Gb Mellanox ConnectX-4 Lx dual port (4 x 25Gb
>> >
>> > Seems adequate although more bandwidth could be of some benefit.
>> >
>> > This is a total of ~12GB/s full duplex. If Ceph is able to use the
>> > whole disk bandwidth you will saturate this : if you get a hotspot on
>> > one node with a client capable of writing at 12GB/s on it and have a
>> > replication size of 3, you will get only half of this (as twice this
>> > amount will be sent on replicas). So ideally you would have room for
>> > twice the client bandwidth on the cluster network. In my experience
>> > this isn't a problem (hot spots like this almost never happen as
>> > client write traffic is mostly distributed evenly on nodes) but having
>> > the headroom avoids the risk of atypical access patterns becoming a
>> > problem 

Re: [ceph-users] Help...my cephfs client often occur error when mount -t ceph...

2016-05-19 Thread Yan, Zheng
On Thu, May 19, 2016 at 3:35 PM, 易明  wrote:
> Hi All,
>
> my cluster is Jewel Ceph.It is often that something wrong goes up when mount
> cephfs, but no error message can be found on logs
>
>
> the following are some infos:
> [root@ceph2 ~]# mount /mnt/cephfs_stor/
> mount error 5 = Input/output error

fails immediately or sometime later?

>
>
> [root@ceph2 ~]# cat /etc/fstab | grep 6789
> ceph2:6789:/ /mnt/cephfs_stor ceph
> name=admin,secretfile=/etc/ceph/admin_keyring,_netdev,noatime 0 2
>
> [root@ceph2 ceph]# dmesg | tail
> [8147859.732786] libceph: client1504169 fsid
> 3fcc77ef-9fda-4f83-8b9f-efc9c769c857
> [8147859.774117] libceph: mon0 172.17.0.172:6789 session established
> [8148008.420636] libceph: client1529478 fsid
> 3fcc77ef-9fda-4f83-8b9f-efc9c769c857
> [8148008.422172] libceph: mon0 172.17.0.170:6789 session established
> [8148225.540589] SELinux: initialized (dev tmpfs, type tmpfs), uses
> transition SIDs
> [8148241.014225] SELinux: initialized (dev tmpfs, type tmpfs), uses
> transition SIDs
> [8148298.445636] libceph: client1504172 fsid
> 3fcc77ef-9fda-4f83-8b9f-efc9c769c857
> [8148298.486282] libceph: mon0 172.17.0.172:6789 session established
> [8149773.194866] libceph: client1504175 fsid
> 3fcc77ef-9fda-4f83-8b9f-efc9c769c857
> [8149773.196711] libceph: mon0 172.17.0.172:6789 session established
>
> and the /var/log/messages:
> May 19 15:23:04 ceph2 kernel: libceph: client1504175 fsid
> 3fcc77ef-9fda-4f83-8b9f-efc9c769c857
> May 19 15:23:04 ceph2 kernel: libceph: mon0 172.17.0.172:6789 session
> established
>
> my cluster status:
> [root@ceph2 ceph]# ceph health
> HEALTH_OK
> [root@ceph2 ceph]# ceph mds stat
> e584: 3/3/3 up
> {4:0=ceph2-mds0=up:active,4:2=ceph0-mds0=up:active,4:4=ceph1-mds1=up:active},
> 3 up:standby

you have multiple active MDS. It's not stable, please don't do this.


>
> Though i have got some cephfs client mounted:
> [root@rgw0 ~]# df -h
> Filesystem   Size  Used Avail Use% Mounted on
> /dev/mapper/centos-root   50G  2.4G   48G   5% /
> devtmpfs  32G 0   32G   0% /dev
> tmpfs 32G 0   32G   0% /dev/shm
> tmpfs 32G   26M   32G   1% /run
> tmpfs 32G 0   32G   0% /sys/fs/cgroup
> /dev/mapper/centos-home   51G   33M   51G   1% /home
> /dev/sda1497M  164M  333M  34% /boot
> 172.17.0.171:6789:/   44T  558G   44T   2% /mnt/ceph1_cephfs
> 172.17.0.172:6789:/   44T  558G   44T   2% /mnt/cephfs
> 172.17.0.170:6789:/   44T  558G   44T   2% /mnt/ceph0_cephfs
>
> This phenomenon are so weird, can someone explain and help me?
>
> Any info will be greatly appreciated.
>
> THANKS
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD process doesn't die immediately after device disappears

2016-05-19 Thread Marcel Lauhoff

Hi Somnath,

Somnath Roy  writes:

> FileStore doesn't subscribe for any such event from the device. Presently, it 
> is relying on filesystem (for the FileStore assert) to return back error 
> during IO and based on the error it is giving an assert.
> FileJournal assert you are getting in the aio path is relying on linux aio 
> system to report an error.
> It should get these asserts pretty quickly not couple of minutes if IO is on.

ACK:
I retried with either the journal or the data fs on an usb thumb drive:
The OSD took ~ 1 sec to crash.

> Are you saying this crash timestamp is couple of minutes after ?

Yes, but let me double check. The test I originally wrote about had the
disks behind a RAID controller.. I think there may be some weirdness
there :/

Thanks,
~marcel

--
Marcel Lauhoff
Mail: lauh...@uni-mainz.de
XMPP: mlauh...@jabber.uni-mainz.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pure SSD ceph - journal placement

2016-05-19 Thread George Shuklin

Hello.

I'm curious how to get maximum performance without loosing significant 
space. OSD+its journal on SSD is good solution? Or using separate SSD 
for journal for the few others SSD-based OSD is better?


Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dense storage nodes

2016-05-19 Thread Mark Nelson
FWIW, we ran tests back in the dumpling era that more or less showed the 
same thing.  Increasing the merge/split thresholds does help.  We 
suspect it's primarily due to the PG splitting being spread out over a 
longer period of time so the effect lessens.  We're looking at some 
options to introduce jitter after the threshold is hit so that PGs don't 
all split at exactly the same time.


Here are the old tests:

https://drive.google.com/open?id=0B2gTBZrkrnpZNTNicWwtT1NobUk

Mark

On 05/18/2016 09:31 PM, Kris Jurka wrote:



On 5/18/2016 7:15 PM, Christian Balzer wrote:


We have hit the following issues:

  - Filestore merge splits occur at ~40 MObjects with default settings.
This is a really, really bad couple of days while things settle.


Could you elaborate on that?
As in which settings affect this and what happens exactly as "merge
splits"
sounds like an oxymoron, so I suppose it's more of a split than a
merge to
be so painful?



Filestore merges directories when the leafs are largely empty and splits
when they're full.  So they're sort of the same thing.  Here's the
result of a test I ran storing objects into RGW as fast as possible and
you can see performance tank while directories split and recover
afterwards.

http://thread.gmane.org/gmane.comp.file-systems.ceph.user/27189/focus=27213

Kris Jurka
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD node memory sizing

2016-05-19 Thread Christian Balzer

Hello,

On Thu, 19 May 2016 10:51:20 +0200 Dietmar Rieder wrote:

> Hello,
> 
> On 05/19/2016 03:36 AM, Christian Balzer wrote:
> > 
> > Hello again,
> > 
> > On Wed, 18 May 2016 15:32:50 +0200 Dietmar Rieder wrote:
> > 
> >> Hello Christian,
> >>
> >>> Hello,
> >>>
> >>> On Wed, 18 May 2016 13:57:59 +0200 Dietmar Rieder wrote:
> >>>
>  Dear Ceph users,
> 
>  I've a question regarding the memory recommendations for an OSD
>  node.
> 
>  The official Ceph hardware recommendations say that an OSD node
>  should have 1GB Ram / TB OSD [1]
> 
>  The "Reference Architecture" whitpaper from Red Hat & Supermicro
>  says that "typically" 2GB of memory per OSD on a OSD node is used.
>  [2]
> 
> >>> This question has been asked and answered here countless times.
> >>>
> >>> Maybe something a bit more detailed ought to be placed in the first
> >>> location, or simply a reference to the 2nd one. 
> >>> But then again, that would detract from the RH added value.
> >>
> >> thanks for replying, nonetheless.
> >> I checked the list before but I failed to find a definitive answer,
> >> may be I was not looking hard enough. Anyway, thanks!
> >>
> > They tend to hidden sometimes in other threads, but there really is a
> > lot..
> 
> It seems so, have to dig deeper into the available discussions...
>
See the recent thread "journal or cache tier on SSDs ?" started by
another academic, slightly to your west for some insights, more below.

> > 
> >>>  
>  According to the recommendation in [1] an OSD node with 24x 8TB OSD
>  disks is "underpowered "  when it is equipped with 128GB of RAM.
>  However, following the "recommendation" in [2] 128GB should be
>  plenty enough.
> 
> >>> It's fine per se, the OSD processes will not consume all of that even
> >>> in extreme situations.
> >>
> >> Ok, if I understood this correctly, then 128GB should be enough also
> >> during rebalancing or backfilling.
> >>
> > Definitely, but realize that during this time of high memory
> > consumption cause by backfilling your system is also under strain from
> > objects moving in an out, so as per the high-density thread you will
> > want all your dentry and other important SLAB objects to stay in RAM.
> > 
> > That's a lot of objects potentially with 8TB, so when choosing DIMMs
> > pick ones that leave you with the option to go to 256GB later if need
> > be.
> 
> Good point, I'll keep this in mind
> 
> > 
> > Also you'll probably have loads of fun playing with CRUSH weights to
> > keep the utilization of these 8TB OSDs within 100GB of each other. 
> 
> I'm afraid that  finding the "optimal" settings will demand a lot of
> testing/playing
> 

Optimal settings is another topic, this is just making tiny adjustments to
your CRUSH weights so that the OSDs stay within a few percent of usage of
each other. 

> > 
> >>>
> >>> Very large OSDs and high density storage nodes have other issues and
> >>> challenges, tuning and memory wise.
> >>> There are several threads about these recently, including today.
> >>
> >> Thanks, I'll study these...
> >>
>  I'm wondering which of the two is good enough for a Ceph cluster
>  with 10 nodes using EC (6+3)
> 
> >>> I would spend more time pondering about the CPU power of these
> >>> machines (EC need more) and what cache tier to get.
> >>
> >> We are planing to equip the OSD nodes with 2x2650v4 CPUs (24 cores @
> >> 2.2GHz), that is 1 core/OSD. For the cache tier each OSD node gets two
> >> 800Gb NVMe's. We hope this setup will give reasonable performance with
> >> EC.
> >>
> > So you have actually 26 OSDs per node then.
> > I'd say the CPUs are fine, but EC and the NVMes will eat a fair share
> > of it.
> 
> Your right, it is 26 OSDs but still I assume that with these CPUs we
> will not be completely underpowered.
>
Since you stated your use case I'll say the same, not so much if this were
to be the storage for lots of high IOPS VMs.
 
> > That's why I prefer to have dedicated cache tier nodes with fewer but
> > faster cores, unless the cluster is going to be very large.
> > With Hammer a 800GB DC S3160 SSD based OSD can easily saturate a 
> > "E5-2623 v3" core @3.3GHz (nearly 2 cores to be precise) and Jewel has
> > optimization that will both make it faster by itself AND enable it to
> > use more CPU resources as well.
> > 
> 
> That's probably, the best solution, but this will not be in our budged
> and rackspace limits for the first setup, however when expanding later
> on it will definitely be something to consider, also depending on the
> performance that we obtain with this first setup.
> 
Well, if you're gonna grow this cluster your shared setup will become more
and more effective (but still remain harder to design/specify just right).

> > The NVMes (DC P3700 one presumes?) just for cache tiering, no SSD
> > journals for the OSDs?
> 
> For now we have an offer for HPE  800GB NVMe MU (mixed use), 880MB/s
> write 2600MB/s read, 3 

Re: [ceph-users] mark out vs crush weight 0

2016-05-19 Thread Oliver Dzombic
Hi,

a sparedisk is a nice idea.

But i think thats something you can also do with a shellscript.

Checking if an osd is down or out and just using your spare disk.

Maybe the programming ressources should not be used for something most
of us can do with a simple shell script checking every 5 seconds the
situation.



Maybe better idea ( in my humble opinion ) is to solve this stuff by
optimizing the code in recovery situations.

Currently we have things like

client-op-priority,
recovery-op-priority,
max-backfills,
recovery-max-active and so on

to limit the performance impact in a recovery situation.

And still in a situation of recovery the performance go downhill ( a lot
)  when all OSD's start to refill the to_be_recovered OSD.

In my case, i was removing old HDD's from a cluster.

If i down/out them ( 6 TB drives 40-50% full ) the cluster's performance
will go down very dramatically. So i had to reduce the weight by 0.1
steps to ease this pain, but could not remove it completely.


So i think the tools / code to protect the cluster's performance ( even
in recovery situation ) can be improved.

Of course, on one hand, we want to make sure, that asap the configured
amount of replica's and this way, datasecurity is restored.

But on the other hand, it does not help too much if the recovery
proceedure will impact the cluster's performance on a level where the
useability is too much reduced.

So maybe introcude another config option to controle this ratio ?

To control more effectively how much IOPS/Bandwidth is used ( maybe
streight in numbers in form of an IO ratelimit ) so that administrator's
have the chance to config, according to the hardware environment, the
"perfect" settings for their individual usecase.


Because, right now, when i reduce the weight of a 6 TB HDD, while having
~ 30 OSD's in the cluster, from 1.0 to 0.9, around 3-5% of data will be
moved around the cluster ( replication 2 ).

While its moving, there is a true performance hit on the virtual servers.

So if this could be solved, by a IOPS/HDD Bandwidth rate limit, that i
can simply tell the cluster to use max. 10 IOPS and/or 10 MB/s for the
recovery, then i think it would be a great help for any usecase and
administrator.

Thanks !


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 19.05.2016 um 04:57 schrieb Christian Balzer:
> 
> Hello Sage,
> 
> On Wed, 18 May 2016 17:23:00 -0400 (EDT) Sage Weil wrote:
> 
>> Currently, after an OSD has been down for 5 minutes, we mark the OSD 
>> "out", whic redistributes the data to other OSDs in the cluster.  If the 
>> OSD comes back up, it marks the OSD back in (with the same reweight
>> value, usually 1.0).
>>
>> The good thing about marking OSDs out is that exactly the amount of data 
>> on the OSD moves.  (Well, pretty close.)  It is uniformly distributed 
>> across all other devices.
>>
> Others have commented already on how improve your initial suggestion
> (retaining CRUSH weights) etc.
> Let me butt in here with an even more invasive but impact reducing
> suggestion.
> 
> Your "good thing" up there is good as far as total data movement goes, but
> it still can unduly impact client performance when one OSD becomes both
> the target and source of data movement at the same time during
> backfill/recovery. 
> 
> So how about upping the ante with the (of course optional) concept of a
> "spare OSD" per node?
> People are already used to the concept, it also makes a full cluster
> situation massively more unlikely. 
> 
> So expanding on the concept below, lets say we have one spare OSD per node
> by default. 
> It's on a disk of the same size or larger than all the other OSDs in the
> node, it is fully prepared but has no ID yet. 
> 
> So we're experiencing an OSD failure and it's about to be set out by the
> MON, lets consider this sequence (OSD X is the dead, S the spare one:
> 
> 1. Set nobackfill/norecovery
> 2. OSD X gets weighted 0
> 3. OSD X gets set out
> 4. OSD S gets activated with the original weight of X and its ID.
> 5. Unset nobackfill/norecovery
> 
> Now data will flow only to the new OSD, other OSDs will not be subject to
> simultaneous reads and writes by backfills. 
> 
> Of course in case there is no spare available (not replaced yet or
> multiple OSD failures), Ceph can go ahead and do it's usual thing,
> hopefully enhanced by the logic below.
> 
> Alternatively, instead of just limiting the number of backfills per OSD
> make them directionally aware, that is don't allow concurrent read and
> write backfills on the same OSD.
> 
> Regards,
> 
> Christian
>> The bad thing is that if the OSD really is dead, and you remove it from 
>> the cluster, or replace it and recreate the new OSD with a new OSD id, 
>> 

Re: [ceph-users] dd testing from within the VM

2016-05-19 Thread Oliver Dzombic
Hi Ken,

wow thats quiet worst. That means you can not use this cluster like that.

How does your ceph.conf look like ?

How looks ceph -s ?


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 19.05.2016 um 12:56 schrieb Ken Peng:
> Oliver,
> 
> Thanks for the info.
> We then run sysbench for random IO testing, the result is even worse
> (757 KB/s).
> each object has 3 replicas.
> Both networks are 10Gbps, I don't think there are issues with network.
> Maybe lacking of SSD cache, and miscorrect configure to the cluster are
> the reason.
> 
> 
> 
> Extra file open flags: 0
> 128 files, 360Mb each
> 45Gb total file size
> Block size 16Kb
> Number of random requests for random IO: 0
> Read/Write ratio for combined random IO test: 1.50
> Periodic FSYNC enabled, calling fsync() each 100 requests.
> Calling fsync() at the end of test, Enabled.
> Using synchronous I/O mode
> Doing random r/w test
> Threads started!
> 
> Time limit exceeded, exiting...
> Done.
> 
> Operations performed:  8520 Read, 5680 Write, 18056 Other = 32256 Total
> Read 133.12Mb  Written 88.75Mb  Total transferred 221.88Mb  (757.33Kb/sec)
>47.33 Requests/sec executed
> 
> Test execution summary:
> total time:  300.0012s
> total number of events:  14200
> total time taken by event execution: 21.6865
> per-request statistics:
>  min:  0.02ms
>  avg:  1.53ms
>  max:   1325.73ms
>  approx.  95 percentile:   1.92ms
> 
> Threads fairness:
> events (avg/stddev):   14200./0.00
> execution time (avg/stddev):   21.6865/0.00
> 
> 
> 
> 
> On 2016/5/19 星期四 18:24, Oliver Dzombic wrote:
>> Hi Ken,
>>
>> dd is ok, but you should consider the fact that dd is a squence of
>> writing.
>>
>> So if you have random writes in your later productive usage, then this
>> test is basically only good to meassure the maximum squential write
>> performance in idle state.
>>
>> And 250 MB for 200 HDD's is quiet evil bad as a performance for a
>> sequential write.
>>
>> Sequential write of a 7200 RPM SATA HDD should be around 70-100 MB,
>> maybe more.
>>
>> So if you have 200 of them, idle, and writing a sequence, and resulting
>> in 250 MB/s. That does not look good to me.
>>
>> So eighter your network is not good, or your settings are not good. Or
>> you have too high replica number or something like that.
>>
>> At least for me, 200x HDDs and each HDD deliver 1,2 MB/s writing speed
>> performance.
>>
>> I assume that your 4 GB won't be spread over all 200 HDDs. But still,
>> the result does not look like good performance.
>>
>> FIO is a nice test with different settings.
>>
>> ---
>>
>> The effect of conv=fdatasync will be only as big, as the RAM memory of
>> your test client will be.
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dd testing from within the VM

2016-05-19 Thread Ken Peng

Oliver,

Thanks for the info.
We then run sysbench for random IO testing, the result is even worse 
(757 KB/s).

each object has 3 replicas.
Both networks are 10Gbps, I don't think there are issues with network.
Maybe lacking of SSD cache, and miscorrect configure to the cluster are 
the reason.




Extra file open flags: 0
128 files, 360Mb each
45Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!

Time limit exceeded, exiting...
Done.

Operations performed:  8520 Read, 5680 Write, 18056 Other = 32256 Total
Read 133.12Mb  Written 88.75Mb  Total transferred 221.88Mb  (757.33Kb/sec)
   47.33 Requests/sec executed

Test execution summary:
total time:  300.0012s
total number of events:  14200
total time taken by event execution: 21.6865
per-request statistics:
 min:  0.02ms
 avg:  1.53ms
 max:   1325.73ms
 approx.  95 percentile:   1.92ms

Threads fairness:
events (avg/stddev):   14200./0.00
execution time (avg/stddev):   21.6865/0.00




On 2016/5/19 星期四 18:24, Oliver Dzombic wrote:

Hi Ken,

dd is ok, but you should consider the fact that dd is a squence of writing.

So if you have random writes in your later productive usage, then this
test is basically only good to meassure the maximum squential write
performance in idle state.

And 250 MB for 200 HDD's is quiet evil bad as a performance for a
sequential write.

Sequential write of a 7200 RPM SATA HDD should be around 70-100 MB,
maybe more.

So if you have 200 of them, idle, and writing a sequence, and resulting
in 250 MB/s. That does not look good to me.

So eighter your network is not good, or your settings are not good. Or
you have too high replica number or something like that.

At least for me, 200x HDDs and each HDD deliver 1,2 MB/s writing speed
performance.

I assume that your 4 GB won't be spread over all 200 HDDs. But still,
the result does not look like good performance.

FIO is a nice test with different settings.

---

The effect of conv=fdatasync will be only as big, as the RAM memory of
your test client will be.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dd testing from within the VM

2016-05-19 Thread Oliver Dzombic
Hi Ken,

dd is ok, but you should consider the fact that dd is a squence of writing.

So if you have random writes in your later productive usage, then this
test is basically only good to meassure the maximum squential write
performance in idle state.

And 250 MB for 200 HDD's is quiet evil bad as a performance for a
sequential write.

Sequential write of a 7200 RPM SATA HDD should be around 70-100 MB,
maybe more.

So if you have 200 of them, idle, and writing a sequence, and resulting
in 250 MB/s. That does not look good to me.

So eighter your network is not good, or your settings are not good. Or
you have too high replica number or something like that.

At least for me, 200x HDDs and each HDD deliver 1,2 MB/s writing speed
performance.

I assume that your 4 GB won't be spread over all 200 HDDs. But still,
the result does not look like good performance.

FIO is a nice test with different settings.

---

The effect of conv=fdatasync will be only as big, as the RAM memory of
your test client will be.


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 19.05.2016 um 04:40 schrieb Ken Peng:
> Hi,
> 
> Our VM has been using ceph as block storage for both system and data
> patition.
> 
> This is what dd shows,
> 
> # dd if=/dev/zero of=test.file bs=4k count=1024k
> 1048576+0 records in
> 1048576+0 records out
> 4294967296 bytes (4.3 GB) copied, 16.7969 s, 256 MB/s
> 
> When dd again with fdatasync argument,the result is similar.
> 
> # dd if=/dev/zero of=test.file bs=4k count=1024k conv=fdatasync
> 1048576+0 records in
> 1048576+0 records out
> 4294967296 bytes (4.3 GB) copied, 17.6878 s, 243 MB/s
> 
> 
> My questions include,
> 
> 1. for a cluster which has more than 200 disks as OSD storage (SATA
> only), both the cluster and data network are 10Gbps, does the
> performance from within the VM behave well as the results above?
> 
> 2. is "dd" suitable for testing a block storage within the VM?
> 
> 3. why "fdatasync" influences nothing on the testing?
> 
> Thank you.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Installing ceph monitor on Ubuntu denial: segmentation fault

2016-05-19 Thread Daniel Wilhelm
Hi

I am trying to install ceph with the ceph ansible role: 
https://github.com/shieldwed/ceph-ansible.

I had to fix some ansible tasks to work correctly with ansible 2.0.2.0 but now 
it seems to work quite well.
Sadly I have now come across a bug, I cannot solve myself:

When ansible is starting the service ceph-mon@ceph-control01.service, 
ceph-create-keys@control01.service gets started as a dependency to create the 
admin key.

Within the unit log the following lines are shown:

May 19 11:42:14 control01 ceph-create-keys[21818]: 
INFO:ceph-create-keys:Talking to monitor...
May 19 11:42:14 control01 ceph-create-keys[21818]: INFO:ceph-create-keys:Cannot 
get or create admin key
May 19 11:42:15 control01 ceph-create-keys[21818]: 
INFO:ceph-create-keys:Talking to monitor...
May 19 11:42:15 control01 ceph-create-keys[21818]: INFO:ceph-create-keys:Cannot 
get or create admin key

And so on.

Since this script is calling “ceph --cluster=ceph --name=mon. 
--keyring=/var/lib/ceph/mon/ceph-control01/keyring auth get-or-create 
client.admin mon allow * osd allow * mds allow *”

I tried to call this command myself and got this as a result:
Segmentation fault (core dumped)

As for the ceph versions, I tried two different with the same result:
• Ubuntu integrated: ceph 10.1.2
• Official stable repo: http://download.ceph.com/debian-jewel so: 10.2.1

How can I circumvent this problem? Or is there any solution to that?

Thanks

Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD node memory sizing

2016-05-19 Thread Dietmar Rieder
Hello,

On 05/19/2016 03:36 AM, Christian Balzer wrote:
> 
> Hello again,
> 
> On Wed, 18 May 2016 15:32:50 +0200 Dietmar Rieder wrote:
> 
>> Hello Christian,
>>
>>> Hello,
>>>
>>> On Wed, 18 May 2016 13:57:59 +0200 Dietmar Rieder wrote:
>>>
 Dear Ceph users,

 I've a question regarding the memory recommendations for an OSD node.

 The official Ceph hardware recommendations say that an OSD node should
 have 1GB Ram / TB OSD [1]

 The "Reference Architecture" whitpaper from Red Hat & Supermicro says
 that "typically" 2GB of memory per OSD on a OSD node is used. [2]

>>> This question has been asked and answered here countless times.
>>>
>>> Maybe something a bit more detailed ought to be placed in the first
>>> location, or simply a reference to the 2nd one. 
>>> But then again, that would detract from the RH added value.
>>
>> thanks for replying, nonetheless.
>> I checked the list before but I failed to find a definitive answer, may
>> be I was not looking hard enough. Anyway, thanks!
>>
> They tend to hidden sometimes in other threads, but there really is a lot..

It seems so, have to dig deeper into the available discussions...

> 
>>>  
 According to the recommendation in [1] an OSD node with 24x 8TB OSD
 disks is "underpowered "  when it is equipped with 128GB of RAM.
 However, following the "recommendation" in [2] 128GB should be plenty
 enough.

>>> It's fine per se, the OSD processes will not consume all of that even
>>> in extreme situations.
>>
>> Ok, if I understood this correctly, then 128GB should be enough also
>> during rebalancing or backfilling.
>>
> Definitely, but realize that during this time of high memory consumption
> cause by backfilling your system is also under strain from objects moving
> in an out, so as per the high-density thread you will want all your dentry
> and other important SLAB objects to stay in RAM.
> 
> That's a lot of objects potentially with 8TB, so when choosing DIMMs pick
> ones that leave you with the option to go to 256GB later if need be.

Good point, I'll keep this in mind

> 
> Also you'll probably have loads of fun playing with CRUSH weights to keep
> the utilization of these 8TB OSDs within 100GB of each other. 

I'm afraid that  finding the "optimal" settings will demand a lot of
testing/playing

> 
>>>
>>> Very large OSDs and high density storage nodes have other issues and
>>> challenges, tuning and memory wise.
>>> There are several threads about these recently, including today.
>>
>> Thanks, I'll study these...
>>
 I'm wondering which of the two is good enough for a Ceph cluster with
 10 nodes using EC (6+3)

>>> I would spend more time pondering about the CPU power of these machines
>>> (EC need more) and what cache tier to get.
>>
>> We are planing to equip the OSD nodes with 2x2650v4 CPUs (24 cores @
>> 2.2GHz), that is 1 core/OSD. For the cache tier each OSD node gets two
>> 800Gb NVMe's. We hope this setup will give reasonable performance with
>> EC.
>>
> So you have actually 26 OSDs per node then.
> I'd say the CPUs are fine, but EC and the NVMes will eat a fair share of
> it.

Your right, it is 26 OSDs but still I assume that with these CPUs we
will not be completely underpowered.

> That's why I prefer to have dedicated cache tier nodes with fewer but
> faster cores, unless the cluster is going to be very large.
> With Hammer a 800GB DC S3160 SSD based OSD can easily saturate a 
> "E5-2623 v3" core @3.3GHz (nearly 2 cores to be precise) and Jewel has
> optimization that will both make it faster by itself AND enable it to
> use more CPU resources as well.
> 

That's probably, the best solution, but this will not be in our budged
and rackspace limits for the first setup, however when expanding later
on it will definitely be something to consider, also depending on the
performance that we obtain with this first setup.

> The NVMes (DC P3700 one presumes?) just for cache tiering, no SSD
> journals for the OSDs?

For now we have an offer for HPE  800GB NVMe MU (mixed use), 880MB/s
write 2600MB/s read, 3 DW/D. So they are a fast as the DC 3700, we will
probably check also other options.

> What are your network plans then, as in is your node storage bandwidth a
> good match for your network bandwidth? 
>

As network we will have 2x10GBit bonded cluster internal and 2x10GBit
bonded towards the clients, 1GBit for administration


>>> That is, if performance is a requirement in your use case.
>>
>> Always, who wouldn't care about performance?  :-)
>>
> "Good enough" sometimes really is good enough.
> 
> Since you're going for 8TB OSDs, EC and 10 nodes it feels that for you
> space is important, so something like archival, not RBD images for high
> performance VMs.
> 
> What is your use case?


You're right, space is most important. Our use case is not serving RBD
for VMs.
We will mainly store genomic data on cephfs volumes and access it from a
computing cluster
for analyis. This