Re: [ceph-users] RBD: How many snapshots is too many?

2017-09-08 Thread Gregory Farnum
On Fri, Sep 8, 2017 at 5:47 PM, Mclean, Patrick  wrote:

> On a related note, we are very curious why the snapshot id is
> incremented when a snapshot is deleted, this creates lots
> phantom entries in the deleted snapshots set. Interleaved
> deletions and creations will cause massive fragmentation in
> the interval set. The only reason we can come up for this
> is to track if anything changed, but I suspect a different
> value that doesn't inject entries in to the interval set might
> be better for this purpose.

Yes, it's because having a sequence number tied in with the snapshots
is convenient for doing comparisons. Those aren't leaked snapids that
will make holes; when we increment the snapid to delete something we
also stick it in the removed_snaps set. (I suppose if you alternate
deleting a snapshot with adding one that does increase the size until
you delete those snapshots; hrmmm. Another thing to avoid doing I
guess.)

>> It might really just be the osdmap update processing -- that would
>> make me happy as it's a much easier problem to resolve. But I'm also
>> surprised it's *that* expensive, even at the scales you've described.
> That would be nice, but unfortunately all the data is pointing
> to PGPool::Update(),

Yes, that's the OSDMap update processing I referred to. This is good
in terms of our ability to remove it without changing client
interfaces and things.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD: How many snapshots is too many?

2017-09-08 Thread Mclean, Patrick
On 2017-09-08 01:36 PM, Gregory Farnum wrote:
> On Thu, Sep 7, 2017 at 1:46 PM, Mclean, Patrick  
> wrote:
>> On 2017-09-05 02:41 PM, Gregory Farnum wrote:
>>> On Tue, Sep 5, 2017 at 1:44 PM, Florian Haas  > wrote: 
>>> >> Hi everyone, >> >> with the Luminous release out the door
>> and the Labor Day weekend >> over, I hope I can kick off a discussion on
>> another issue that has >> irked me a bit for quite a while. There
>> doesn't seem to be a good >> documented answer to this: what are Ceph's
>> real limits when it >> comes to RBD snapshots? >> >> For most people,
>> any RBD image will have perhaps a single-digit >> number of snapshots.
>> For example, in an OpenStack environment we >> typically have one
>> snapshot per Glance image, a few snapshots per >> Cinder volume, and
>> perhaps a few snapshots per ephemeral Nova disk >> (unless clones are
>> configured to flatten immediately). Ceph >> generally performs well
>> under those circumstances. >> >> However, things sometimes start getting
>> problematic when RBD >> snapshots are generated frequently, and in an
>> automated fashion. >> I've seen Ceph operators configure snapshots on a
>> daily or even >> hourly basis, typically when using snapshots as a
>> backup strategy >> (where they promise to allow for very short RTO and
>> RPO). In >> combination with thousands or maybe tens of thousands of
>> RBDs, >> that's a lot of snapshots. And in such scenarios (and only in
 those), users have been bitten by a few nasty bugs in the past — >>
>> here's an example where the OSD snap trim queue went berserk in the >>
>> event of lots of snapshots being deleted: >> >>
>> http://tracker.ceph.com/issues/9487 >>
>> https://www.spinics.net/lists/ceph-devel/msg20470.html >> >> It seems to
>> me that there still isn't a good recommendation along >> the lines of
>> "try not to have more than X snapshots per RBD image" >> or "try not to
>> have more than Y snapshots in the cluster overall". >> Or is the
>> "correct" recommendation actually "create as many >> snapshots as you
>> might possibly want, none of that is allowed to >> create any
>> instability nor performance degradation and if it does, >> that's a
>> bug"? > > I think we're closer to "as many snapshots as you want", but
>> there > are some known shortages there. > > First of all, if you haven't
>> seen my talk from the last OpenStack > summit on snapshots and you want
>> a bunch of details, go watch that. > :p >
>> https://www.openstack.org/videos/boston-2017/ceph-snapshots-for-fun-and-profit-1
>>
>> There are a few dimensions there can be failures with snapshots:
>>
>>> 1) right now the way we mark snapshots as deleted is suboptimal — > when 
>>> deleted they go into an interval_set in the OSDMap. So if you >
>> have a bunch of holes in your deleted snapshots, it is possible to >
>> inflate the osdmap to a size which causes trouble. But I'm not sure > if
>> we've actually seen this be an issue yet — it requires both a > large
>> cluster, and a large map, and probably some other failure > causing
>> osdmaps to be generated very rapidly.
>> In our use case, we are severly hampered by the size of removed_snaps
>> (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in
>> PGPool::update and its interval calculation code. We have a cluster of
>> around 100k RBDs with each RBD having upto 25 snapshots and only a small
>> portion of our RBDs mapped at a time (~500-1000). For size / performance
>> reasons we try to keep the number of snapshots low (<25) and need to
>> prune snapshots. Since in our use case RBDs 'age' at different rates,
>> snapshot pruning creates holes to the point where we the size of the
>> removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph
>> clusters. I think in general around 2 snapshot removal operations
>> currently happen a minute just because of the volume of snapshots and
>> users we have.
>>
>> We found the PGPool::update and the interval calculation code code to be
>> quite inefficient. Some small changes made it a lot faster giving more
>> breathing room, we shared and these and most already got applied:
>> https://github.com/ceph/ceph/pull/17088
>> https://github.com/ceph/ceph/pull/17121
>> https://github.com/ceph/ceph/pull/17239
>> https://github.com/ceph/ceph/pull/17265
>> https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes)
>>
>> However for our use case these patches helped, but overall CPU usage in
>> this area is still high (>70% or so), making the Ceph cluster slow
>> causing blocked requests and many operations (e.g. rbd map) to take a
>> long time.
>>
>> We are trying to work around these issues by trying to change our
>> snapshot strategy. In the short-term we are manually defragmenting the
>> interval set by scanning for holes and trying to delete snapids in
>> between holes to coalesce more holes. This is not so nice to do. In some
>> cases we employ strategies 

Re: [ceph-users] RBD: How many snapshots is too many?

2017-09-08 Thread Mclean, Patrick
On 2017-09-08 01:59 PM, Gregory Farnum wrote:
> On Fri, Sep 8, 2017 at 1:45 AM, Florian Haas  wrote:
>>> In our use case, we are severly hampered by the size of removed_snaps
>>> (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in
>>> PGPool::update and its interval calculation code. We have a cluster of
>>> around 100k RBDs with each RBD having upto 25 snapshots and only a small
>>> portion of our RBDs mapped at a time (~500-1000). For size / performance
>>> reasons we try to keep the number of snapshots low (<25) and need to
>>> prune snapshots. Since in our use case RBDs 'age' at different rates,
>>> snapshot pruning creates holes to the point where we the size of the
>>> removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph
>>> clusters. I think in general around 2 snapshot removal operations
>>> currently happen a minute just because of the volume of snapshots and
>>> users we have.
>> Right. Greg, this is what I was getting at: 25 snapshots per RBD is
>> firmly in "one snapshot per day per RBD" territory — this is something
>> that a cloud operator might do, for example, offering daily snapshots
>> going back one month. But it still wrecks the cluster simply by having
>> lots of images (even though only a fraction of them, less than 1%, are
>> ever in use). That's rather counter-intuitive, it doesn't hit you
>> until you have lots of images, and once you're affected by it there's
>> no practical way out — where "out" is defined as "restoring overall
>> cluster performance to something acceptable".
>>
>>> We found the PGPool::update and the interval calculation code code to be
>>> quite inefficient. Some small changes made it a lot faster giving more
>>> breathing room, we shared and these and most already got applied:
>>> https://github.com/ceph/ceph/pull/17088
>>> https://github.com/ceph/ceph/pull/17121
>>> https://github.com/ceph/ceph/pull/17239
>>> https://github.com/ceph/ceph/pull/17265
>>> https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes)
>>>
>>> However for our use case these patches helped, but overall CPU usage in
>>> this area is still high (>70% or so), making the Ceph cluster slow
>>> causing blocked requests and many operations (e.g. rbd map) to take a
>>> long time.
>> I think this makes this very much a practical issue, not a
>> hypothetical/theoretical one.
>>
>>> We are trying to work around these issues by trying to change our
>>> snapshot strategy. In the short-term we are manually defragmenting the
>>> interval set by scanning for holes and trying to delete snapids in
>>> between holes to coalesce more holes. This is not so nice to do. In some
>>> cases we employ strategies to 'recreate' old snapshots (as we need to
>>> keep them) at higher snapids. For our use case a 'snapid rename' feature
>>> would have been quite helpful.
>>>
>>> I hope this shines some light on practical Ceph clusters in which
>>> performance is bottlenecked not by I/O but by snapshot removal.
>> For others following this thread or retrieving it from the list
>> archive some time down the road, I'd rephrase that as "bottlenecked
>> not by I/O but by CPU utilization associated with snapshot removal".
>> Is that fair to say, Patrick? Please correct me if I'm
>> misrepresenting.
>>
>> Greg (or Josh/Jason/Sage/anyone really :) ), can you provide
>> additional insight as to how these issues can be worked around or
>> mitigated, besides the PRs that Patrick and his colleagues have
>> already sent?
> Yeah. Like I said, we have a proposed solution for this (that we can
> probably backport to Luminous stable?), but that's the sort of thing I
> haven't heard about before. And the issue is indeed with the raw size
> of the removed_snaps member, which will be a problem for cloud
> operators of a certain scale.
>
> Theoretically, I'd expect you could control it if you are careful:
> 1) take all snapshots on your RBD images for a single time unit
> together, don't intersperse them (ie, don't create up daily snapshots
> on some images at the same time as hourly snapshots on others)
> 2) trim all snapshots from the same time unit on the same schedule
> 3) limit the number of live time units you keep around

That is basically our long term strategy, but it does involve some
re-architecting of our
code, which does take some time.

> There are obvious downsides to those steps, and it's a problem I look
> forward to us resolving soonish. But if you follow those I'd expect
> the removed_snaps interval_set to be proportional in size to the
> number of live time units you have, rather than the number of RBD
> volumes or anything else.
>
>
>
> On Wed, Sep 6, 2017 at 8:44 AM, Florian Haas  wrote:
>> Hi Greg,
>>
>> thanks for your insight! I do have a few follow-up questions.
>>
>> On 09/05/2017 11:39 PM, Gregory Farnum wrote:
 It seems to me that there still isn't a good recommendation along the
 lines of "try not to have 

Re: [ceph-users] RBD: How many snapshots is too many?

2017-09-08 Thread Gregory Farnum
On Fri, Sep 8, 2017 at 1:45 AM, Florian Haas  wrote:
>> In our use case, we are severly hampered by the size of removed_snaps
>> (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in
>> PGPool::update and its interval calculation code. We have a cluster of
>> around 100k RBDs with each RBD having upto 25 snapshots and only a small
>> portion of our RBDs mapped at a time (~500-1000). For size / performance
>> reasons we try to keep the number of snapshots low (<25) and need to
>> prune snapshots. Since in our use case RBDs 'age' at different rates,
>> snapshot pruning creates holes to the point where we the size of the
>> removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph
>> clusters. I think in general around 2 snapshot removal operations
>> currently happen a minute just because of the volume of snapshots and
>> users we have.
>
> Right. Greg, this is what I was getting at: 25 snapshots per RBD is
> firmly in "one snapshot per day per RBD" territory — this is something
> that a cloud operator might do, for example, offering daily snapshots
> going back one month. But it still wrecks the cluster simply by having
> lots of images (even though only a fraction of them, less than 1%, are
> ever in use). That's rather counter-intuitive, it doesn't hit you
> until you have lots of images, and once you're affected by it there's
> no practical way out — where "out" is defined as "restoring overall
> cluster performance to something acceptable".
>
>> We found the PGPool::update and the interval calculation code code to be
>> quite inefficient. Some small changes made it a lot faster giving more
>> breathing room, we shared and these and most already got applied:
>> https://github.com/ceph/ceph/pull/17088
>> https://github.com/ceph/ceph/pull/17121
>> https://github.com/ceph/ceph/pull/17239
>> https://github.com/ceph/ceph/pull/17265
>> https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes)
>>
>> However for our use case these patches helped, but overall CPU usage in
>> this area is still high (>70% or so), making the Ceph cluster slow
>> causing blocked requests and many operations (e.g. rbd map) to take a
>> long time.
>
> I think this makes this very much a practical issue, not a
> hypothetical/theoretical one.
>
>> We are trying to work around these issues by trying to change our
>> snapshot strategy. In the short-term we are manually defragmenting the
>> interval set by scanning for holes and trying to delete snapids in
>> between holes to coalesce more holes. This is not so nice to do. In some
>> cases we employ strategies to 'recreate' old snapshots (as we need to
>> keep them) at higher snapids. For our use case a 'snapid rename' feature
>> would have been quite helpful.
>>
>> I hope this shines some light on practical Ceph clusters in which
>> performance is bottlenecked not by I/O but by snapshot removal.
>
> For others following this thread or retrieving it from the list
> archive some time down the road, I'd rephrase that as "bottlenecked
> not by I/O but by CPU utilization associated with snapshot removal".
> Is that fair to say, Patrick? Please correct me if I'm
> misrepresenting.
>
> Greg (or Josh/Jason/Sage/anyone really :) ), can you provide
> additional insight as to how these issues can be worked around or
> mitigated, besides the PRs that Patrick and his colleagues have
> already sent?

Yeah. Like I said, we have a proposed solution for this (that we can
probably backport to Luminous stable?), but that's the sort of thing I
haven't heard about before. And the issue is indeed with the raw size
of the removed_snaps member, which will be a problem for cloud
operators of a certain scale.

Theoretically, I'd expect you could control it if you are careful:
1) take all snapshots on your RBD images for a single time unit
together, don't intersperse them (ie, don't create up daily snapshots
on some images at the same time as hourly snapshots on others)
2) trim all snapshots from the same time unit on the same schedule
3) limit the number of live time units you keep around

There are obvious downsides to those steps, and it's a problem I look
forward to us resolving soonish. But if you follow those I'd expect
the removed_snaps interval_set to be proportional in size to the
number of live time units you have, rather than the number of RBD
volumes or anything else.



On Wed, Sep 6, 2017 at 8:44 AM, Florian Haas  wrote:
> Hi Greg,
>
> thanks for your insight! I do have a few follow-up questions.
>
> On 09/05/2017 11:39 PM, Gregory Farnum wrote:
>>> It seems to me that there still isn't a good recommendation along the
>>> lines of "try not to have more than X snapshots per RBD image" or "try
>>> not to have more than Y snapshots in the cluster overall". Or is the
>>> "correct" recommendation actually "create as many snapshots as you
>>> might possibly want, none of that is allowed to create any instability
>>> 

Re: [ceph-users] Significant uptick in inconsistent pgs in Jewel 10.2.9

2017-09-08 Thread David Zafman


Robin,

Would you generate the values and keys for the various versions of at 
least one of the objects?   .dir.default.292886573.13181.12 is a good 
example because there are 3 variations for the same object.


If there isn't much activity to .dir.default.64449186.344176, you could 
do one osd at a time.  Otherwise, stop all 3 OSDs 1322, 990, 655 execute 
these for all 3.  I suspect you'll need to pipe to "od-cx" to get 
printable output.


I created a simple object with ascii omap.

$ ceph-objectstore-tool --data-path ... --pgid 5.3d40 
.dir.default.64449186.344176 get-omaphdr

obj_header

$ for i in $(ceph-objectstore-tool --data-path ... --pgid 5.3d40 
.dir.default.64449186.344176 list-omap)


do

echo -n "${i}: "

ceph-objectstore-tool --data-path ... .dir.default.292886573.13181.12 
get-omap $i


done

key1: val1
key2: val2
key3: val3

David


On 9/8/17 12:18 PM, David Zafman wrote:


Robin,

The only two changesets I can spot in Jewel that I think might be 
related are these:

1.
http://tracker.ceph.com/issues/20089
https://github.com/ceph/ceph/pull/15416


This should improve the repair functionality.


2.
http://tracker.ceph.com/issues/19404
https://github.com/ceph/ceph/pull/14204

This pull request fixes an issue that corrupted omaps.  It also finds 
and repairs them.  However, the repair process might resurrect deleted 
omaps which would show up as an omap digest error.


This could temporarily cause additional inconsistent PGs.  So if this 
has NOT been occurring longer than your deep-scrub interval since 
upgrading, I'd repair the pgs and monitor going forward to make sure 
the problem doesn't recur.


---

You have good example of repair scenarios:


.dir.default.292886573.13181.12   only has a omap_digest_mismatch and 
no shard errors.  The automatic repair won't be sure which is a good 
copy.


In this case we can see that osd 1327 doesn't match the other two.  To 
assist the repair process to repair the right one. Remove the copy on 
osd.1327


Stop osd 1327 and use "ceph-objectstore-tool --data-path .1327 
.dir.default.292886573.13181.12 remove"



.dir.default.64449186.344176 has selected_object_info with "od 
337cf025" so shards have "omap_digest_mismatch_oi" except for osd 990.


The pg repair code will use osd.990 to fix the other 2 copies without 
further handling.



David



On 9/8/17 11:16 AM, Robin H. Johnson wrote:

On Thu, Sep 07, 2017 at 08:24:04PM +, Robin H. Johnson wrote:

pg 5.3d40 is active+clean+inconsistent, acting [1322,990,655]
pg 5.f1c0 is active+clean+inconsistent, acting [631,1327,91]

Here is the output of 'rados list-inconsistent-obj' for the PGs:

$ sudo rados list-inconsistent-obj 5.f1c0 |json_pp -json_opt 
canonical,pretty

{
    "epoch" : 1221254,
    "inconsistents" : [
   {
  "errors" : [
 "omap_digest_mismatch"
  ],
  "object" : {
 "locator" : "",
 "name" : ".dir.default.292886573.13181.12",
 "nspace" : "",
 "snap" : "head",
 "version" : 483490
  },
  "selected_object_info" : 
"5:038f1cff:::.dir.default.292886573.13181.12:head(1221843'483490 
client.417313345.0:19515832 dirty|omap|data_digest s 0 uv 483490 dd 
 alloc_hint [0 0])",

  "shards" : [
 {
    "data_digest" : "0x",
    "errors" : [],
    "omap_digest" : "0x928b0c0b",
    "osd" : 91,
    "size" : 0
 },
 {
    "data_digest" : "0x",
    "errors" : [],
    "omap_digest" : "0x928b0c0b",
    "osd" : 631,
    "size" : 0
 },
 {
    "data_digest" : "0x",
    "errors" : [],
    "omap_digest" : "0x6556c868",
    "osd" : 1327,
    "size" : 0
 }
  ],
  "union_shard_errors" : []
   }
    ]
}
$ sudo rados list-inconsistent-obj 5.3d40  |json_pp -json_opt 
canonical,pretty

{
    "epoch" : 1210895,
    "inconsistents" : [
   {
  "errors" : [
 "omap_digest_mismatch"
  ],
  "object" : {
 "locator" : "",
 "name" : ".dir.default.64449186.344176",
 "nspace" : "",
 "snap" : "head",
 "version" : 1177199
  },
  "selected_object_info" : 
"5:02bc4def:::.dir.default.64449186.344176:head(1177700'1180639 
osd.1322.0:537914 dirty|omap|data_digest|omap_digest s 0 uv 1177199 
dd  od 337cf025 alloc_hint [0 0])",

  "shards" : [
 {
    "data_digest" : "0x",
    "errors" : [
   "omap_digest_mismatch_oi"
    ],
    "omap_digest" : "0x3242b04e",
    "osd" : 655,
    "size" : 0
 },
 {
    "data_digest" : "0x",

Re: [ceph-users] RBD: How many snapshots is too many?

2017-09-08 Thread Gregory Farnum
On Thu, Sep 7, 2017 at 1:46 PM, Mclean, Patrick  wrote:
> On 2017-09-05 02:41 PM, Gregory Farnum wrote:
>> On Tue, Sep 5, 2017 at 1:44 PM, Florian Haas  > wrote: 
>> >> Hi everyone, >> >> with the Luminous release out the door
> and the Labor Day weekend >> over, I hope I can kick off a discussion on
> another issue that has >> irked me a bit for quite a while. There
> doesn't seem to be a good >> documented answer to this: what are Ceph's
> real limits when it >> comes to RBD snapshots? >> >> For most people,
> any RBD image will have perhaps a single-digit >> number of snapshots.
> For example, in an OpenStack environment we >> typically have one
> snapshot per Glance image, a few snapshots per >> Cinder volume, and
> perhaps a few snapshots per ephemeral Nova disk >> (unless clones are
> configured to flatten immediately). Ceph >> generally performs well
> under those circumstances. >> >> However, things sometimes start getting
> problematic when RBD >> snapshots are generated frequently, and in an
> automated fashion. >> I've seen Ceph operators configure snapshots on a
> daily or even >> hourly basis, typically when using snapshots as a
> backup strategy >> (where they promise to allow for very short RTO and
> RPO). In >> combination with thousands or maybe tens of thousands of
> RBDs, >> that's a lot of snapshots. And in such scenarios (and only in
>>> those), users have been bitten by a few nasty bugs in the past — >>
> here's an example where the OSD snap trim queue went berserk in the >>
> event of lots of snapshots being deleted: >> >>
> http://tracker.ceph.com/issues/9487 >>
> https://www.spinics.net/lists/ceph-devel/msg20470.html >> >> It seems to
> me that there still isn't a good recommendation along >> the lines of
> "try not to have more than X snapshots per RBD image" >> or "try not to
> have more than Y snapshots in the cluster overall". >> Or is the
> "correct" recommendation actually "create as many >> snapshots as you
> might possibly want, none of that is allowed to >> create any
> instability nor performance degradation and if it does, >> that's a
> bug"? > > I think we're closer to "as many snapshots as you want", but
> there > are some known shortages there. > > First of all, if you haven't
> seen my talk from the last OpenStack > summit on snapshots and you want
> a bunch of details, go watch that. > :p >
> https://www.openstack.org/videos/boston-2017/ceph-snapshots-for-fun-and-profit-1
>
> There are a few dimensions there can be failures with snapshots:
>
>> 1) right now the way we mark snapshots as deleted is suboptimal — > when 
>> deleted they go into an interval_set in the OSDMap. So if you >
> have a bunch of holes in your deleted snapshots, it is possible to >
> inflate the osdmap to a size which causes trouble. But I'm not sure > if
> we've actually seen this be an issue yet — it requires both a > large
> cluster, and a large map, and probably some other failure > causing
> osdmaps to be generated very rapidly.
> In our use case, we are severly hampered by the size of removed_snaps
> (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in
> PGPool::update and its interval calculation code. We have a cluster of
> around 100k RBDs with each RBD having upto 25 snapshots and only a small
> portion of our RBDs mapped at a time (~500-1000). For size / performance
> reasons we try to keep the number of snapshots low (<25) and need to
> prune snapshots. Since in our use case RBDs 'age' at different rates,
> snapshot pruning creates holes to the point where we the size of the
> removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph
> clusters. I think in general around 2 snapshot removal operations
> currently happen a minute just because of the volume of snapshots and
> users we have.
>
> We found the PGPool::update and the interval calculation code code to be
> quite inefficient. Some small changes made it a lot faster giving more
> breathing room, we shared and these and most already got applied:
> https://github.com/ceph/ceph/pull/17088
> https://github.com/ceph/ceph/pull/17121
> https://github.com/ceph/ceph/pull/17239
> https://github.com/ceph/ceph/pull/17265
> https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes)
>
> However for our use case these patches helped, but overall CPU usage in
> this area is still high (>70% or so), making the Ceph cluster slow
> causing blocked requests and many operations (e.g. rbd map) to take a
> long time.
>
> We are trying to work around these issues by trying to change our
> snapshot strategy. In the short-term we are manually defragmenting the
> interval set by scanning for holes and trying to delete snapids in
> between holes to coalesce more holes. This is not so nice to do. In some
> cases we employ strategies to 'recreate' old snapshots (as we need to
> keep them) at higher snapids. For our use case a 'snapid rename' feature
> would 

Re: [ceph-users] Significant uptick in inconsistent pgs in Jewel 10.2.9

2017-09-08 Thread David Zafman


Robin,

The only two changesets I can spot in Jewel that I think might be 
related are these:

1.
http://tracker.ceph.com/issues/20089
https://github.com/ceph/ceph/pull/15416


This should improve the repair functionality.


2.
http://tracker.ceph.com/issues/19404
https://github.com/ceph/ceph/pull/14204

This pull request fixes an issue that corrupted omaps.  It also finds 
and repairs them.  However, the repair process might resurrect deleted 
omaps which would show up as an omap digest error.


This could temporarily cause additional inconsistent PGs.  So if this 
has NOT been occurring longer than your deep-scrub interval since 
upgrading, I'd repair the pgs and monitor going forward to make sure the 
problem doesn't recur.


---

You have good example of repair scenarios:


.dir.default.292886573.13181.12   only has a omap_digest_mismatch and no 
shard errors.  The automatic repair won't be sure which is a good copy.


In this case we can see that osd 1327 doesn't match the other two.  To 
assist the repair process to repair the right one. Remove the copy on 
osd.1327


Stop osd 1327 and use "ceph-objectstore-tool --data-path .1327 
.dir.default.292886573.13181.12 remove"



.dir.default.64449186.344176 has selected_object_info with "od 337cf025" 
so shards have "omap_digest_mismatch_oi" except for osd 990.


The pg repair code will use osd.990 to fix the other 2 copies without 
further handling.



David



On 9/8/17 11:16 AM, Robin H. Johnson wrote:

On Thu, Sep 07, 2017 at 08:24:04PM +, Robin H. Johnson wrote:

pg 5.3d40 is active+clean+inconsistent, acting [1322,990,655]
pg 5.f1c0 is active+clean+inconsistent, acting [631,1327,91]

Here is the output of 'rados list-inconsistent-obj' for the PGs:

$ sudo rados list-inconsistent-obj 5.f1c0 |json_pp -json_opt canonical,pretty
{
"epoch" : 1221254,
"inconsistents" : [
   {
  "errors" : [
 "omap_digest_mismatch"
  ],
  "object" : {
 "locator" : "",
 "name" : ".dir.default.292886573.13181.12",
 "nspace" : "",
 "snap" : "head",
 "version" : 483490
  },
  "selected_object_info" : 
"5:038f1cff:::.dir.default.292886573.13181.12:head(1221843'483490 client.417313345.0:19515832 
dirty|omap|data_digest s 0 uv 483490 dd  alloc_hint [0 0])",
  "shards" : [
 {
"data_digest" : "0x",
"errors" : [],
"omap_digest" : "0x928b0c0b",
"osd" : 91,
"size" : 0
 },
 {
"data_digest" : "0x",
"errors" : [],
"omap_digest" : "0x928b0c0b",
"osd" : 631,
"size" : 0
 },
 {
"data_digest" : "0x",
"errors" : [],
"omap_digest" : "0x6556c868",
"osd" : 1327,
"size" : 0
 }
  ],
  "union_shard_errors" : []
   }
]
}
$ sudo rados list-inconsistent-obj 5.3d40  |json_pp -json_opt canonical,pretty
{
"epoch" : 1210895,
"inconsistents" : [
   {
  "errors" : [
 "omap_digest_mismatch"
  ],
  "object" : {
 "locator" : "",
 "name" : ".dir.default.64449186.344176",
 "nspace" : "",
 "snap" : "head",
 "version" : 1177199
  },
  "selected_object_info" : 
"5:02bc4def:::.dir.default.64449186.344176:head(1177700'1180639 osd.1322.0:537914 
dirty|omap|data_digest|omap_digest s 0 uv 1177199 dd  od 337cf025 alloc_hint [0 0])",
  "shards" : [
 {
"data_digest" : "0x",
"errors" : [
   "omap_digest_mismatch_oi"
],
"omap_digest" : "0x3242b04e",
"osd" : 655,
"size" : 0
 },
 {
"data_digest" : "0x",
"errors" : [],
"omap_digest" : "0x337cf025",
"osd" : 990,
"size" : 0
 },
 {
"data_digest" : "0x",
"errors" : [
   "omap_digest_mismatch_oi"
],
"omap_digest" : "0xc90d06a8",
"osd" : 1322,
"size" : 0
 }
  ],
  "union_shard_errors" : [
 "omap_digest_mismatch_oi"
  ]
   }
]
}





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD's flapping on ordinary scrub with cluster being static (after upgrade to 12.1.1

2017-09-08 Thread Tomasz Kusmierz
Hi there, 

Somebody told me that essentially i was runnning pre released version and I 
should upgrade to 12.2 and comeback if problem persists. 
Today I’ve got upgrade available to 12.2 and installed it … so, I still have 
problems;


Sep 08 18:48:05 proxmox1 ceph-osd[3954]: *** Caught signal (Segmentation fault) 
**
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  in thread 7f5c883f0700 
thread_name:tp_osd_tp
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  ceph version 12.2.0 
(36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  1: (()+0xa07bb4) [0x55e157f22bb4]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  2: (()+0x110c0) [0x7f5ca94030c0]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  3: (()+0x1ff2f) [0x7f5caba05f2f]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  4: 
(rocksdb::BlockBasedTable::NewIndexIterator(rocksdb::ReadOptions const&, 
rocksdb::BlockIter*, 
rocksdb::BlockBasedTable::CachableEntry*)+0x4e6)
 [0x55e158306bb6]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  5: 
(rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice 
const&, rocksdb::GetContext*, bool)+0x283) [0x55e158307963]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  6: 
(rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, 
rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, bool, 
int)+0x13a) [0x55e1583e718a]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  7: 
(rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, 
rocksdb::PinnableSlice*, rocksdb::Status*, rocksdb::MergeContext*, 
rocksdb::RangeDelAggregator*, bool*, bool*, unsigned long*)+0x3f8) 
[0x55e1582c8c28]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  8: 
(rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, 
bool*)+0x552) [0x55e15838d682]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  9: 
(rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
rocksdb::PinnableSlice*)+0x13) [0x55e15838dab3]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  10: 
(rocksdb::DB::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, 
std::__cxx11::basic_string*)+0xc1) [0x55e157e6bb51]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  11: 
(RocksDBStore::get(std::__cxx11::basic_string const&, std::__cxx11::basic_string const&, 
ceph::buffer::list*)+0x1bb) [0x55e157e6308b]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  12: (()+0x885d71) [0x55e157da0d71]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  13: (()+0x885675) [0x55e157da0675]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  14: 
(BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned 
int)+0x5d7) [0x55e157de48c7]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  15: 
(BlueStore::_do_truncate(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr, unsigned long, 
std::set, 
std::allocator >*)+0x118) [0x55e157e05cd8]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  16: 
(BlueStore::_do_remove(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr)+0xc5) [0x55e157e06755]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  17: 
(BlueStore::_remove(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr&)+0x7b) [0x55e157e0807b]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  18: 
(BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ObjectStore::Transaction*)+0x1f55) [0x55e157e1ec15]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  19: 
(BlueStore::queue_transactions(ObjectStore::Sequencer*, 
std::vector&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x536) 
[0x55e157e1f916]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  20: 
(PrimaryLogPG::queue_transactions(std::vector&, 
boost::intrusive_ptr)+0x66) [0x55e157b437f6]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  21: 
(ReplicatedBackend::do_repop(boost::intrusive_ptr)+0xbdc) 
[0x55e157c70d6c]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  22: 
(ReplicatedBackend::_handle_message(boost::intrusive_ptr)+0x2b7) 
[0x55e157c73b47]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  23: 
(PGBackend::handle_message(boost::intrusive_ptr)+0x50) 
[0x55e157b810d0]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  24: 
(PrimaryLogPG::do_request(boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x4e3) [0x55e157ae6a83]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  25: 
(OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, 
ThreadPool::TPHandle&)+0x3ab) [0x55e15796b19b]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  26: 
(PGQueueable::RunVis::operator()(boost::intrusive_ptr const&)+0x5a) 
[0x55e157c0354a]
Sep 08 18:48:05 proxmox1 ceph-osd[3954]:  27: 
(OSD::ShardedOpWQ::_process(unsigned int, 

Re: [ceph-users] Significant uptick in inconsistent pgs in Jewel 10.2.9

2017-09-08 Thread Robin H. Johnson
On Thu, Sep 07, 2017 at 08:24:04PM +, Robin H. Johnson wrote:
> pg 5.3d40 is active+clean+inconsistent, acting [1322,990,655]
> pg 5.f1c0 is active+clean+inconsistent, acting [631,1327,91]
Here is the output of 'rados list-inconsistent-obj' for the PGs:

$ sudo rados list-inconsistent-obj 5.f1c0 |json_pp -json_opt canonical,pretty
{
   "epoch" : 1221254,
   "inconsistents" : [
  {
 "errors" : [
"omap_digest_mismatch"
 ],
 "object" : {
"locator" : "",
"name" : ".dir.default.292886573.13181.12",
"nspace" : "",
"snap" : "head",
"version" : 483490
 },
 "selected_object_info" : 
"5:038f1cff:::.dir.default.292886573.13181.12:head(1221843'483490 
client.417313345.0:19515832 dirty|omap|data_digest s 0 uv 483490 dd  
alloc_hint [0 0])",
 "shards" : [
{
   "data_digest" : "0x",
   "errors" : [],
   "omap_digest" : "0x928b0c0b",
   "osd" : 91,
   "size" : 0
},
{
   "data_digest" : "0x",
   "errors" : [],
   "omap_digest" : "0x928b0c0b",
   "osd" : 631,
   "size" : 0
},
{
   "data_digest" : "0x",
   "errors" : [],
   "omap_digest" : "0x6556c868",
   "osd" : 1327,
   "size" : 0
}
 ],
 "union_shard_errors" : []
  }
   ]
}
$ sudo rados list-inconsistent-obj 5.3d40  |json_pp -json_opt canonical,pretty
{
   "epoch" : 1210895,
   "inconsistents" : [
  {
 "errors" : [
"omap_digest_mismatch"
 ],
 "object" : {
"locator" : "",
"name" : ".dir.default.64449186.344176",
"nspace" : "",
"snap" : "head",
"version" : 1177199
 },
 "selected_object_info" : 
"5:02bc4def:::.dir.default.64449186.344176:head(1177700'1180639 
osd.1322.0:537914 dirty|omap|data_digest|omap_digest s 0 uv 1177199 dd  
od 337cf025 alloc_hint [0 0])",
 "shards" : [
{
   "data_digest" : "0x",
   "errors" : [
  "omap_digest_mismatch_oi"
   ],
   "omap_digest" : "0x3242b04e",
   "osd" : 655,
   "size" : 0
},
{
   "data_digest" : "0x",
   "errors" : [],
   "omap_digest" : "0x337cf025",
   "osd" : 990,
   "size" : 0
},
{
   "data_digest" : "0x",
   "errors" : [
  "omap_digest_mismatch_oi"
   ],
   "omap_digest" : "0xc90d06a8",
   "osd" : 1322,
   "size" : 0
}
 ],
 "union_shard_errors" : [
"omap_digest_mismatch_oi"
 ]
  }
   ]
}



-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Asst. Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-maintainers] Ceph release cadence

2017-09-08 Thread Gregory Farnum
I think I'm the resident train release advocate so I'm sure my
advocating that model will surprise nobody. I'm not sure I'd go all
the way to Lars' multi-release maintenance model (although it's
definitely something I'm interested in), but there are two big reasons
I wish we were on a train with more frequent real releases:

1) It reduces the cost of features missing a release. Right now if
something misses an LTS release, that's it for a year. And nobody
likes releasing an LTS without a bunch of big new features, so each
LTS is later than the one before as we scramble to get features merged
in.

...and then we deal with the fact that we scrambled to get a bunch of
features merged in and they weren't quite baked. (Luminous so far
seems to have gone much better in this regard! Hurray! But I think
that has a lot to do with our feature-release-scramble this year being
mostly peripheral stuff around user interfaces that got tacked on
about the time we'd initially planned the release to occur.)

2) Train releases increase predictability for downstreams, partners,
and users around when releases will happen. Right now, the release
process and schedule is entirely opaque to anybody who's not involved
in every single upstream meeting we have; and it's unpredictable even
to those who are. That makes things difficult, as Xiaoxi said.

There are other peripheral but serious benefits I'd expect to see from
fully-validated train releases as well. It would be *awesome* to have
more frequent known-stable points to do new development against. If
you're an external developer and you want a new feature, you have to
either keep it rebased against a fast-changing master branch, or you
need to settle for writing it against a long-out-of-date LTS and then
forward-porting it for merge. If you're an FS developer writing a very
small new OSD feature and you try to validate it against RADOS, you've
no idea if bugs that pop up and look random are because you really did
something wrong or if there's currently an intermittent issue in RADOS
master. I would have *loved* to be able to maintain CephFS integration
branches for features that didn't touch RADOS and were built on top of
the latest release instead of master, but it was utterly infeasible
because there were too many missing features with the long delays.

On Fri, Sep 8, 2017 at 9:16 AM, Sage Weil  wrote:
> I'm going to pick on Lars a bit here...
>
> On Thu, 7 Sep 2017, Lars Marowsky-Bree wrote:
>> On 2017-09-06T15:23:34, Sage Weil  wrote:
>> > Other options we should consider?  Other thoughts?
>>
>> With about 20-odd years in software development, I've become a big
>> believer in schedule-driven releases. If it's feature-based, you never
>> know when they'll get done.
>>
>> If the schedule intervals are too long though, the urge to press too
>> much in (so as not to miss the next merge window) is just too high,
>> meaning the train gets derailed. (Which cascades into the future,
>> because the next time the pressure will be even higher based on the
>> previous experience.) This requires strictness.
>>
>> We've had a few Linux kernel releases that were effectively feature
>> driven and never quite made it. 1.3.x? 1.5.x? My memory is bad, but they
>> were a disaster than eventually led Linus to evolve to the current
>> model.
>>
>> That serves them really well, and I believe it might be worth
>> considering for us.
>
> This model is very appealing.  The problem with it that I see is that the
> upstream kernel community doesn't really do stable releases.  Mainline
> developers are just getting their stuff upstream, and entire separate
> organizations and teams are doing the stable distro kernels.  (There are
> upstream stable kernels too, yes, but they don't get much testing AFAICS
> and I'm not sure who uses them.)
>
> More importantly, upgrade and on-disk format issues are present for almost
> everything that we change in Ceph.  Those things rarely come up for the
> kernel.  Even the local file systems (a small piece of the kernel) have
> comparatively fewer format changes that we do, it seems.
>
> These make the upgrade testing a huge concern and burden for the
> Ceph development community.
>
>> I'd try to move away from the major milestones. Features get integrated
>> into the next schedule-driven release when they deemed ready and stable;
>> when they're not, not a big deal, the next one is coming up "soonish".
>>
>> (This effectively decouples feature development slightly from the
>> release schedule.)
>>
>> We could even go for "a release every 3 months, sharp", merge window for
>> the first month, stabilization the second, release clean up the third,
>> ship.
>>
>> Interoperability hacks for the cluster/server side are maintained for 2
>> years, and then dropped.  Sharp. (Speaking as one of those folks
>> affected, we should not burden the community with this.) Client interop
>> is a different story, a bit.
>>
>> Basically, effectively 

Re: [ceph-users] [Ceph-maintainers] Ceph release cadence

2017-09-08 Thread Sage Weil
I'm going to pick on Lars a bit here...

On Thu, 7 Sep 2017, Lars Marowsky-Bree wrote:
> On 2017-09-06T15:23:34, Sage Weil  wrote:
> > Other options we should consider?  Other thoughts?
> 
> With about 20-odd years in software development, I've become a big
> believer in schedule-driven releases. If it's feature-based, you never
> know when they'll get done.
> 
> If the schedule intervals are too long though, the urge to press too
> much in (so as not to miss the next merge window) is just too high,
> meaning the train gets derailed. (Which cascades into the future,
> because the next time the pressure will be even higher based on the
> previous experience.) This requires strictness.
> 
> We've had a few Linux kernel releases that were effectively feature
> driven and never quite made it. 1.3.x? 1.5.x? My memory is bad, but they
> were a disaster than eventually led Linus to evolve to the current
> model.
> 
> That serves them really well, and I believe it might be worth
> considering for us.

This model is very appealing.  The problem with it that I see is that the 
upstream kernel community doesn't really do stable releases.  Mainline 
developers are just getting their stuff upstream, and entire separate 
organizations and teams are doing the stable distro kernels.  (There are 
upstream stable kernels too, yes, but they don't get much testing AFAICS 
and I'm not sure who uses them.)

More importantly, upgrade and on-disk format issues are present for almost 
everything that we change in Ceph.  Those things rarely come up for the 
kernel.  Even the local file systems (a small piece of the kernel) have 
comparatively fewer format changes that we do, it seems.

These make the upgrade testing a huge concern and burden for the 
Ceph development community.

> I'd try to move away from the major milestones. Features get integrated
> into the next schedule-driven release when they deemed ready and stable;
> when they're not, not a big deal, the next one is coming up "soonish".
> 
> (This effectively decouples feature development slightly from the
> release schedule.)
> 
> We could even go for "a release every 3 months, sharp", merge window for
> the first month, stabilization the second, release clean up the third,
> ship.
> 
> Interoperability hacks for the cluster/server side are maintained for 2
> years, and then dropped.  Sharp. (Speaking as one of those folks
> affected, we should not burden the community with this.) Client interop
> is a different story, a bit.
> 
> Basically, effectively edging towards continuous integration of features
> and bugfixes both. Nobody has to wait for anything much, and can
> schedule reasonably independently.

If I read between the lines a bit here, but this sounds like is:

 - keep the frequently major releases (but possibly shorten the 6mo 
   cadence)
 - do backports for all of them, not just the even ones
 - test upgrades between all of them within a 2 year horizon, instead 
   of just the last major one

Is that accurate?

Unfortunately it sounds to me like that would significantly increase the 
maintenance burden (double it even?) and slow development down.  The user 
base will also end up fragmented across a broader range of versions, which 
means we'll see a wider variety of bugs and each release will be less 
stable.

This is full of trade-offs... time we spend backporting or testing 
upgrades is time we don't spend fixing bugs or improving performance or 
adding features.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph release cadence

2017-09-08 Thread Scottix
Personally I kind of like the current format and fundamentally we are
talking about Data storage which should be the most tested and scrutinized
piece of software on your computer. Having XYZ feature later than sooner
compared to oh I lost all my data. I am thinking of a recent FS that had a
feature they shouldn't have released. I appreciate the extra time it takes
to release to make it resilient.

Having a LTS version to rely on provides a good assurance the upgrade
process wil be thoroughly tested.
Having a version to do more experimental features keeps the new features at
bay, it follows the Ubuntu model basically.

I feel there were a lot of underpinning features in Luminous that
checkmarked a lot of checkboxes you have been wanting for a while. One
thing to consider possibly a lot of the core features become more
incremental.

I guess from my use case Ceph actually does everything I need it to do atm.
Yes new features and better processes make it better, but more or less I am
pretty content. Maybe I am a small minority in this logic.

On Fri, Sep 8, 2017 at 2:20 AM Matthew Vernon  wrote:

> Hi,
>
> On 06/09/17 16:23, Sage Weil wrote:
>
> > Traditionally, we have done a major named "stable" release twice a year,
> > and every other such release has been an "LTS" release, with fixes
> > backported for 1-2 years.
>
> We use the ceph version that comes with our distribution (Ubuntu LTS);
> those come out every 2 years (though we won't move to a brand-new
> distribution until we've done some testing!). So from my POV, LTS ceph
> releases that come out such that adjacent ceph LTSs fit neatly into
> adjacent Ubuntu LTSs is the ideal outcome. We're unlikely to ever try
> putting a non-LTS ceph version into production.
>
> I hope this isn't an unusual requirement :)
>
> Matthew
>
>
> --
>  The Wellcome Trust Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client features by IP?

2017-09-08 Thread Bryan Stillwell
On 09/07/2017 01:26 PM, Josh Durgin wrote:
> On 09/07/2017 11:31 AM, Bryan Stillwell wrote:
>> On 09/07/2017 10:47 AM, Josh Durgin wrote:
>>> On 09/06/2017 04:36 PM, Bryan Stillwell wrote:
 I was reading this post by Josh Durgin today and was pretty happy to
 see we can get a summary of features that clients are using with the
 'ceph features' command:

 http://ceph.com/community/new-luminous-upgrade-complete/

 However, I haven't found an option to display the IP address of
 those clients with the older feature sets.  Is there a flag I can
 pass to 'ceph features' to list the IPs associated with each feature
 set?
>>>
>>> There is not currently, we should add that - it'll be easy to backport
>>> to luminous too. The only place both features and IP are shown is in
>>> 'debug mon = 10' logs right now.
>>
>> I think that would be great!  The first thing I would want to do after
>> seeing an old client listed would be to find it and upgrade it.  Having
>> the IP of the client would make that a ton easier!
>
> Yup, should've included that in the first place!
>
>> Anything I could do to help make that happen?  File a feature request
>> maybe?
>
> Sure, adding a short tracker.ceph.com ticket would help, that way we can
> track the backport easily too.

Ticket created:

http://tracker.ceph.com/issues/21315

Thanks Josh!

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] output discards (queue drops) on switchport

2017-09-08 Thread Alexandre DERUMIER
Sorry, I dind't see that you use proxmox5.

As I'm a proxmox contributor, I can tell you that I have error with kernel 4.10 
(which is ubuntu kernel).

if you don't use zfs, try kernel 4.12 from stretch-backports, or kernel 4.4 
from proxmox 4 (with zfs support).


Tell me if it's works better for you.

(I'm currently try to backports last mlx5 patches from kernel 4.12 to kernel 
4.10, to see if it's helping)

I have open a thread on pve-devel mailing list today.



- Mail original -
De: "Alexandre Derumier" 
À: "Burkhard Linke" 
Cc: "ceph-users" 
Envoyé: Vendredi 8 Septembre 2017 17:27:49
Objet: Re: [ceph-users] output discards (queue drops) on switchport

Hi, 

>> public network Mellanox ConnectX-4 Lx dual-port 25 GBit/s 

which kernel/distro do you use ? 

I have same card, and I had problem with centos7 kernel 3.10 recently, with 
packet drop 

i have also problems with ubuntu kernel 4.10 and lacp 


kernel 4.4 or 4.12 are working fine for me. 





- Mail original - 
De: "Burkhard Linke"  
À: "ceph-users"  
Envoyé: Vendredi 8 Septembre 2017 16:25:31 
Objet: Re: [ceph-users] output discards (queue drops) on switchport 

Hi, 


On 09/08/2017 04:13 PM, Andreas Herrmann wrote: 
> Hi, 
> 
> On 08.09.2017 15:59, Burkhard Linke wrote: 
>> On 09/08/2017 02:12 PM, Marc Roos wrote: 
>>> 
>>> Afaik ceph is is not supporting/working with bonding. 
>>> 
>>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html 
>>> (thread: Maybe some tuning for bonded network adapters) 
>> CEPH works well with LACP bonds. The problem described in that thread is the 
>> fact that LACP is not using links in a round robin fashion, but distributes 
>> network stream depending on a hash of certain parameters like source and 
>> destination IP address. This is already set to layer3+4 policy by the OP. 
>> 
>> Regarding the drops (and without any experience with neither 25GBit ethernet 
>> nor the Arista switches): 
>> Do you have corresponding input drops on the server's network ports? 
> No input drops, just output drop 
Output drops on the switch are related to input drops on the server 
side. If the link uses flow control and the server signals the switch 
that its internal buffer are full, the switch has to drop further 
packages if the port buffer is also filled. If there's no flow control, 
and the network card is not able to store the packet (full buffers...), 
it should be noted as overrun in the interface statistics (and if this 
is not correct, please correct me, I'm not a network guy). 

> 
>> Did you tune the network settings on server side for high throughput, e.g. 
>> net.ipv4.tcp_rmem, wmem, ...? 
> sysctl tuning is disabled at the moment. I tried sysctl examples from 
> https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/. But there is 
> still 
> the same amount of output drops. 
> 
>> And are the CPUs fast enough to handle the network traffic? 
> Xeon(R) CPU E5-1660 v4 @ 3.20GHz should be fast enough. But I'm unsure. It's 
> my first Ceph cluster. 
The CPU has 6 cores, and you are driving 2x 10GBit, 2x 25 GBit, the raid 
controller and 8 ssd based osds with it. You can use tools like atop or 
ntop to watch certain aspects of the system during the tests (network, 
cpu, disk). 

Regards, 
Burkhard 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] output discards (queue drops) on switchport

2017-09-08 Thread Alexandre DERUMIER
Hi,

>> public network Mellanox ConnectX-4 Lx dual-port 25 GBit/s

which kernel/distro do you use ?

I have same card, and I had problem with centos7 kernel 3.10 recently, with 
packet drop

i have also problems with ubuntu kernel 4.10 and lacp 


kernel 4.4 or 4.12 are working fine for me.





- Mail original -
De: "Burkhard Linke" 
À: "ceph-users" 
Envoyé: Vendredi 8 Septembre 2017 16:25:31
Objet: Re: [ceph-users] output discards (queue drops) on switchport

Hi, 


On 09/08/2017 04:13 PM, Andreas Herrmann wrote: 
> Hi, 
> 
> On 08.09.2017 15:59, Burkhard Linke wrote: 
>> On 09/08/2017 02:12 PM, Marc Roos wrote: 
>>> 
>>> Afaik ceph is is not supporting/working with bonding. 
>>> 
>>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html 
>>> (thread: Maybe some tuning for bonded network adapters) 
>> CEPH works well with LACP bonds. The problem described in that thread is the 
>> fact that LACP is not using links in a round robin fashion, but distributes 
>> network stream depending on a hash of certain parameters like source and 
>> destination IP address. This is already set to layer3+4 policy by the OP. 
>> 
>> Regarding the drops (and without any experience with neither 25GBit ethernet 
>> nor the Arista switches): 
>> Do you have corresponding input drops on the server's network ports? 
> No input drops, just output drop 
Output drops on the switch are related to input drops on the server 
side. If the link uses flow control and the server signals the switch 
that its internal buffer are full, the switch has to drop further 
packages if the port buffer is also filled. If there's no flow control, 
and the network card is not able to store the packet (full buffers...), 
it should be noted as overrun in the interface statistics (and if this 
is not correct, please correct me, I'm not a network guy). 

> 
>> Did you tune the network settings on server side for high throughput, e.g. 
>> net.ipv4.tcp_rmem, wmem, ...? 
> sysctl tuning is disabled at the moment. I tried sysctl examples from 
> https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/. But there is 
> still 
> the same amount of output drops. 
> 
>> And are the CPUs fast enough to handle the network traffic? 
> Xeon(R) CPU E5-1660 v4 @ 3.20GHz should be fast enough. But I'm unsure. It's 
> my first Ceph cluster. 
The CPU has 6 cores, and you are driving 2x 10GBit, 2x 25 GBit, the raid 
controller and 8 ssd based osds with it. You can use tools like atop or 
ntop to watch certain aspects of the system during the tests (network, 
cpu, disk). 

Regards, 
Burkhard 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw crashing after buffer overflows detected

2017-09-08 Thread Bryan Stillwell
For about a week we've been seeing a decent number of buffer overflows
detected across all our RGW nodes in one of our clusters.  This started
happening a day after we started weighing in some new OSD nodes, so
we're thinking it's probably related to that.  Could someone help us
determine the root cause of this?

Cluster details:
  Distro: CentOS 7.2
  Release: 0.94.10-0.el7.x86_64
  OSDs: 1120
  RGW nodes: 10

See log messages below.  If you know how to improve the call trace
below I would like to hear that too.  I tried installing the
ceph-debuginfo-0.94.10-0.el7.x86_64 package, but that didn't seem to
help.

Thanks,
Bryan


# From /var/log/messages:

Sep  7 20:06:11 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 21:01:55 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 21:37:00 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 23:14:54 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  7 23:17:08 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 00:12:39 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:04:07 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:17:49 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:41:39 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated
Sep  8 07:59:29 p3cephrgw003 radosgw: *** buffer overflow detected ***: 
/bin/radosgw terminated


# From /var/log/ceph/client.radosgw.p3cephrgw003.log:

 0> 2017-09-08 07:59:29.696615 7f7b296a2700 -1 *** Caught signal (Aborted) 
**
 in thread 7f7b296a2700

 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
 1: /bin/radosgw() [0x6d3d92]
 2: (()+0xf100) [0x7f7f425e9100]
 3: (gsignal()+0x37) [0x7f7f4141d5f7]
 4: (abort()+0x148) [0x7f7f4141ece8]
 5: (()+0x75317) [0x7f7f4145d317]
 6: (__fortify_fail()+0x37) [0x7f7f414f5ac7]
 7: (()+0x10bc80) [0x7f7f414f3c80]
 8: (()+0x10da37) [0x7f7f414f5a37]
 9: (OS_Accept()+0xc1) [0x7f7f435bd8b1]
 10: (FCGX_Accept_r()+0x9c) [0x7f7f435bb91c]
 11: (RGWFCGXProcess::run()+0x7bf) [0x58136f]
 12: (RGWProcessControlThread::entry()+0xe) [0x5821fe]
 13: (()+0x7dc5) [0x7f7f425e1dc5]
 14: (clone()+0x6d) [0x7f7f414de21d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [PVE-User] OSD won't start, even created ??

2017-09-08 Thread Phil Schwarz
Hi,
any help would be really useful.
Does anyone got a clue with my issue ?

Thanks by advance.
Best regards;


Le 05/09/2017 à 20:25, Phil Schwarz a écrit :
> Hi,
> I come back with same issue as seen in previous thread ( link given)
> 
> trying to a 2TB SATA as OSD:
> Using proxmox GUI or CLI (command given) give the same (bad) result.
> 
> Didn't want to use a direct 'ceph osd create', thus bypassing pxmfs
> redundant filesystem.
> 
> I tried to build an OSD woth same disk on another machine (stronger one
> with Opteron QuadCore), failing at the same time.
> 
> 
> Sorry for crossposting, but i think, i fail against the pveceph wrapper.
> 
> 
> Any help or clue would be really useful..
> 
> Thanks
> Best regards.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- Link to previous thread (but same problem):
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg38897.html
> 
> 
> -- commands :
> fdisk /dev/sdc ( mklabel msdos, w, q)
> ceph-disk zap /dev/sdc
> pveceph createosd /dev/sdc
> 
> -- dpkg -l
> 
>  dpkg -l |grep ceph
> ii  ceph 12.1.2-pve1 amd64   
> distributed storage and file system
> ii  ceph-base12.1.2-pve1 amd64common
> ceph daemon libraries and management tools
> ii  ceph-common  12.1.2-pve1 amd64common
> utilities to mount and interact with a ceph storage cluster
> ii  ceph-mgr 12.1.2-pve1 amd64   
> manager for the ceph distributed storage system
> ii  ceph-mon 12.1.2-pve1 amd64   
> monitor server for the ceph storage system
> ii  ceph-osd 12.1.2-pve1 amd64OSD
> server for the ceph storage system
> ii  libcephfs1   10.2.5-7.2 amd64Ceph
> distributed file system client library
> ii  libcephfs2   12.1.2-pve1 amd64Ceph
> distributed file system client library
> ii  python-cephfs12.1.2-pve1 amd64Python
> 2 libraries for the Ceph libcephfs library
> 
> -- tail -f /var/log/ceph/ceph-osd.admin.log
> 
> 2017-09-03 18:28:20.856641 7fad97e45e00  0 ceph version 12.1.2
> (cd7bc3b11cdbe6fa94324b7322fb2a4716a052a7) luminous (rc), process
> (unknown), pid 5493
> 2017-09-03 18:28:20.857104 7fad97e45e00 -1 bluestore(/dev/sdc2)
> _read_bdev_label unable to decode label at offset 102:
> buffer::malformed_input: void
> bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode
> past end of struct encoding
> 2017-09-03 18:28:20.857200 7fad97e45e00  1 journal _open /dev/sdc2 fd 4:
> 2000293007360 bytes, block size 4096 bytes, directio = 0, aio = 0
> 2017-09-03 18:28:20.857366 7fad97e45e00  1 journal close /dev/sdc2
> 2017-09-03 18:28:20.857431 7fad97e45e00  0 probe_block_device_fsid
> /dev/sdc2 is filestore, ----
> 2017-09-03 18:28:21.937285 7fa5766a5e00  0 ceph version 12.1.2
> (cd7bc3b11cdbe6fa94324b7322fb2a4716a052a7) luminous (rc), process
> (unknown), pid 5590
> 2017-09-03 18:28:21.944189 7fa5766a5e00 -1 bluestore(/dev/sdc2)
> _read_bdev_label unable to decode label at offset 102:
> buffer::malformed_input: void
> bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode
> past end of struct encoding
> 2017-09-03 18:28:21.944305 7fa5766a5e00  1 journal _open /dev/sdc2 fd 4:
> 2000293007360 bytes, block size 4096 bytes, directio = 0, aio = 0
> 2017-09-03 18:28:21.944527 7fa5766a5e00  1 journal close /dev/sdc2
> 2017-09-03 18:28:21.944588 7fa5766a5e00  0 probe_block_device_fsid
> /dev/sdc2 is filestore, ----
> ___
> pve-user mailing list
> pve-u...@pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] output discards (queue drops) on switchport

2017-09-08 Thread Burkhard Linke

Hi,


On 09/08/2017 04:13 PM, Andreas Herrmann wrote:

Hi,

On 08.09.2017 15:59, Burkhard Linke wrote:

On 09/08/2017 02:12 PM, Marc Roos wrote:
  
Afaik ceph is is not supporting/working with bonding.


https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html
(thread: Maybe some tuning for bonded network adapters)

CEPH works well with LACP bonds. The problem described in that thread is the
fact that LACP is not using links in a round robin fashion, but distributes
network stream depending on a hash of certain parameters like source and
destination IP address. This is already set to layer3+4 policy by the OP.

Regarding the drops (and without any experience with neither 25GBit ethernet
nor the Arista switches):
Do you have corresponding input drops on the server's network ports?

No input drops, just output drop
Output drops on the switch are related to input drops on the server 
side. If the link uses flow control and the server signals the switch 
that its internal buffer are full, the switch has to drop further 
packages if the port buffer is also filled. If there's no flow control, 
and the network card is not able to store the packet (full buffers...), 
it should be noted as overrun in the interface statistics (and if this 
is not correct, please correct me, I'm not a network guy).





Did you tune the network settings on server side for high throughput, e.g.
net.ipv4.tcp_rmem, wmem, ...?

sysctl tuning is disabled at the moment. I tried sysctl examples from
https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/. But there is still
the same amount of output drops.


And are the CPUs fast enough to handle the network traffic?

Xeon(R) CPU E5-1660 v4 @ 3.20GHz should be fast enough. But I'm unsure. It's
my first Ceph cluster.
The CPU has 6 cores, and you are driving 2x 10GBit, 2x 25 GBit, the raid 
controller and 8 ssd based osds with it. You can use tools like atop or 
ntop to watch certain aspects of the system during the tests (network, 
cpu, disk).


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] output discards (queue drops) on switchport

2017-09-08 Thread Andreas Herrmann
Hi,

On 08.09.2017 15:59, Burkhard Linke wrote:
> On 09/08/2017 02:12 PM, Marc Roos wrote:
>>  
>> Afaik ceph is is not supporting/working with bonding.
>>
>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html
>> (thread: Maybe some tuning for bonded network adapters)
>
> CEPH works well with LACP bonds. The problem described in that thread is the
> fact that LACP is not using links in a round robin fashion, but distributes
> network stream depending on a hash of certain parameters like source and
> destination IP address. This is already set to layer3+4 policy by the OP.
> 
> Regarding the drops (and without any experience with neither 25GBit ethernet
> nor the Arista switches):
> Do you have corresponding input drops on the server's network ports?

No input drops, just output drop

> Did you tune the network settings on server side for high throughput, e.g.
> net.ipv4.tcp_rmem, wmem, ...?

sysctl tuning is disabled at the moment. I tried sysctl examples from
https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/. But there is still
the same amount of output drops.

> And are the CPUs fast enough to handle the network traffic?

Xeon(R) CPU E5-1660 v4 @ 3.20GHz should be fast enough. But I'm unsure. It's
my first Ceph cluster.

Later I'll upgrade from 12.1.2 => 12.2.0 and will some more test.

Regards,
Andreas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] output discards (queue drops) on switchport

2017-09-08 Thread Burkhard Linke

Hi,


On 09/08/2017 02:12 PM, Marc Roos wrote:
  


Afaik ceph is is not supporting/working with bonding.

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html
(thread: Maybe some tuning for bonded network adapters)
CEPH works well with LACP bonds. The problem described in that thread is 
the fact that LACP is not using links in a round robin fashion, but 
distributes network stream depending on a hash of certain parameters 
like source and destination IP address. This is already set to layer3+4 
policy by the OP.


Regarding the drops (and without any experience with neither 25GBit 
ethernet nor the Arista switches):
Do you have corresponding input drops on the server's network ports? Did 
you tune the network settings on server side for high throughput, e.g. 
net.ipv4.tcp_rmem, wmem, ...? And are the CPUs fast enough to handle the 
network traffic?


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] output discards (queue drops) on switchport

2017-09-08 Thread Andreas Herrmann
I disabled the complete bond and just used a single 25 GBit/s link. The output
drops still appear on the switchports.

On 08.09.2017 14:12, Marc Roos wrote:
>  
> 
> Afaik ceph is is not supporting/working with bonding. 
> 
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html
> (thread: Maybe some tuning for bonded network adapters)
> 
> 
> 
> 
> -Original Message-
> From: Andreas Herrmann [mailto:andr...@mx20.org] 
> Sent: vrijdag 8 september 2017 13:58
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] output discards (queue drops) on switchport
> 
> Hello,
> 
> I have a fresh Proxmox installation on 5 servers (Supermciro X10SRW-F, 
> Xeon E5-1660 v4, 128 GB RAM) with each 8 Samsung SSD SM863 960GB 
> connected to a LSI-9300-8i (SAS3008) controller used as OSDs for Ceph 
> (12.1.2)
> 
> The servers are connected to two Arista DCS-7060CX-32S switches. I'm 
> using MLAG bond (bondmode LACP, xmit_hash_policy layer3+4, MTU 9000):
>  * backend network for Ceph: cluster network & public network
>Mellanox ConnectX-4 Lx dual-port 25 GBit/s
>  * frontend network: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ 
> dual-port
> 
> Ceph is quite a default installation with size=3.
> 
> My problem:
> I'm issuing a dd (dd if=/dev/urandom of=urandom.0 bs=10M count=1024) in 
> a test virtual machine (the only one running in the cluster) with 
> arround 210 MB/s. I get output drops on all switchports. The drop rate 
> is between 0.1 - 0.9 %. The drop rate of 0.9 % is reached when writing 
> with about 1300MB/s into ceph.
> 
> First I thought about a problem with the Mellanox cards and used the 
> Intel cards for ceph traffic. The problem also exists.
> 
> I tried quite a lot and nothing help:
>  * changed the MTU from 9000 to 1500
>  * changed bond_xmit_hash_policy from layer3+4 to layer2+3
>  * deactivated the bond and just used a single link
>  * disabled offloading
>  * disabled power management in BIOS
>  * perf-bias 0
> 
> I analyzed the traffic via tcpdump and got some of those "errors":
>  * TCP Previous segment not captured
>  * TCP Out-of-Order
>  * TCP Retransmission
>  * TCP Fast Retransmission
>  * TCP Dup ACK
>  * TCP ACKed unseen segment
>  * TCP Window Update
> 
> Is that behaviour normal for ceph or has anyone ideas how to solve that 
> problem with the output drops at switch-side
> 
> With iperf I can reach full 50 GBit/s on the bond with zero output 
> drops.
> 
> Andreas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph OSD journal (with dmcrypt) replacement

2017-09-08 Thread David Turner
You used to need a keyfile in Hammer. I think Jewel changed that to a
partition, but I don't have experience with that.

On Fri, Sep 8, 2017, 9:18 AM M Ranga Swami Reddy 
wrote:

> when I create dmcrypted jounral using cryptsetup command, its asking
> for passphase? Can I use passphase  as empty?
>
> On Wed, Sep 6, 2017 at 11:23 PM, M Ranga Swami Reddy
>  wrote:
> > Thank you. Iam able to replace the dmcrypt journal successfully.
> >
> > On Sep 5, 2017 18:14, "David Turner"  wrote:
> >>
> >> Did the journal drive fail during operation? Or was it taken out during
> >> pre-failure. If it fully failed, then most likely you can't guarantee
> the
> >> consistency of the underlying osds. In this case, you just put the
> affected
> >> osds and add them back in as new osds.
> >>
> >> In the case of having good data on the osds, you follow the standard
> >> process of closing the journal, create the new partition, set up all of
> the
> >> partition metadata so that the ceph udev rules will know what the
> journal
> >> is, and just create a new dmcrypt volume on it. I would recommend using
> the
> >> same uuid as the old journal so that you don't need to update the
> symlinks
> >> and such on the osd. After everything is done, run the journal create
> >> command for the osd and start the osd.
> >>
> >>
> >> On Tue, Sep 5, 2017, 2:47 AM M Ranga Swami Reddy 
> >> wrote:
> >>>
> >>> Hello,
> >>> How to replace an OSD's journal created with dmcrypt, from one drive
> >>> to another drive, in case of current journal drive failed.
> >>>
> >>> Thanks
> >>> Swami
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph OSD journal (with dmcrypt) replacement

2017-09-08 Thread M Ranga Swami Reddy
when I create dmcrypted jounral using cryptsetup command, its asking
for passphase? Can I use passphase  as empty?

On Wed, Sep 6, 2017 at 11:23 PM, M Ranga Swami Reddy
 wrote:
> Thank you. Iam able to replace the dmcrypt journal successfully.
>
> On Sep 5, 2017 18:14, "David Turner"  wrote:
>>
>> Did the journal drive fail during operation? Or was it taken out during
>> pre-failure. If it fully failed, then most likely you can't guarantee the
>> consistency of the underlying osds. In this case, you just put the affected
>> osds and add them back in as new osds.
>>
>> In the case of having good data on the osds, you follow the standard
>> process of closing the journal, create the new partition, set up all of the
>> partition metadata so that the ceph udev rules will know what the journal
>> is, and just create a new dmcrypt volume on it. I would recommend using the
>> same uuid as the old journal so that you don't need to update the symlinks
>> and such on the osd. After everything is done, run the journal create
>> command for the osd and start the osd.
>>
>>
>> On Tue, Sep 5, 2017, 2:47 AM M Ranga Swami Reddy 
>> wrote:
>>>
>>> Hello,
>>> How to replace an OSD's journal created with dmcrypt, from one drive
>>> to another drive, in case of current journal drive failed.
>>>
>>> Thanks
>>> Swami
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] output discards (queue drops) on switchport

2017-09-08 Thread Marc Roos
 

Afaik ceph is is not supporting/working with bonding. 

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html
(thread: Maybe some tuning for bonded network adapters)




-Original Message-
From: Andreas Herrmann [mailto:andr...@mx20.org] 
Sent: vrijdag 8 september 2017 13:58
To: ceph-users@lists.ceph.com
Subject: [ceph-users] output discards (queue drops) on switchport

Hello,

I have a fresh Proxmox installation on 5 servers (Supermciro X10SRW-F, 
Xeon E5-1660 v4, 128 GB RAM) with each 8 Samsung SSD SM863 960GB 
connected to a LSI-9300-8i (SAS3008) controller used as OSDs for Ceph 
(12.1.2)

The servers are connected to two Arista DCS-7060CX-32S switches. I'm 
using MLAG bond (bondmode LACP, xmit_hash_policy layer3+4, MTU 9000):
 * backend network for Ceph: cluster network & public network
   Mellanox ConnectX-4 Lx dual-port 25 GBit/s
 * frontend network: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ 
dual-port

Ceph is quite a default installation with size=3.

My problem:
I'm issuing a dd (dd if=/dev/urandom of=urandom.0 bs=10M count=1024) in 
a test virtual machine (the only one running in the cluster) with 
arround 210 MB/s. I get output drops on all switchports. The drop rate 
is between 0.1 - 0.9 %. The drop rate of 0.9 % is reached when writing 
with about 1300MB/s into ceph.

First I thought about a problem with the Mellanox cards and used the 
Intel cards for ceph traffic. The problem also exists.

I tried quite a lot and nothing help:
 * changed the MTU from 9000 to 1500
 * changed bond_xmit_hash_policy from layer3+4 to layer2+3
 * deactivated the bond and just used a single link
 * disabled offloading
 * disabled power management in BIOS
 * perf-bias 0

I analyzed the traffic via tcpdump and got some of those "errors":
 * TCP Previous segment not captured
 * TCP Out-of-Order
 * TCP Retransmission
 * TCP Fast Retransmission
 * TCP Dup ACK
 * TCP ACKed unseen segment
 * TCP Window Update

Is that behaviour normal for ceph or has anyone ideas how to solve that 
problem with the output drops at switch-side

With iperf I can reach full 50 GBit/s on the bond with zero output 
drops.

Andreas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] output discards (queue drops) on switchport

2017-09-08 Thread Andreas Herrmann
Hello,

I have a fresh Proxmox installation on 5 servers (Supermciro X10SRW-F, Xeon
E5-1660 v4, 128 GB RAM) with each 8 Samsung SSD SM863 960GB connected to a
LSI-9300-8i (SAS3008) controller used as OSDs for Ceph (12.1.2)

The servers are connected to two Arista DCS-7060CX-32S switches. I'm using
MLAG bond (bondmode LACP, xmit_hash_policy layer3+4, MTU 9000):
 * backend network for Ceph: cluster network & public network
   Mellanox ConnectX-4 Lx dual-port 25 GBit/s
 * frontend network: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ dual-port

Ceph is quite a default installation with size=3.

My problem:
I'm issuing a dd (dd if=/dev/urandom of=urandom.0 bs=10M count=1024) in a test
virtual machine (the only one running in the cluster) with arround 210 MB/s. I
get output drops on all switchports. The drop rate is between 0.1 - 0.9 %. The
drop rate of 0.9 % is reached when writing with about 1300MB/s into ceph.

First I thought about a problem with the Mellanox cards and used the Intel
cards for ceph traffic. The problem also exists.

I tried quite a lot and nothing help:
 * changed the MTU from 9000 to 1500
 * changed bond_xmit_hash_policy from layer3+4 to layer2+3
 * deactivated the bond and just used a single link
 * disabled offloading
 * disabled power management in BIOS
 * perf-bias 0

I analyzed the traffic via tcpdump and got some of those "errors":
 * TCP Previous segment not captured
 * TCP Out-of-Order
 * TCP Retransmission
 * TCP Fast Retransmission
 * TCP Dup ACK
 * TCP ACKed unseen segment
 * TCP Window Update

Is that behaviour normal for ceph or has anyone ideas how to solve that
problem with the output drops at switch-side

With iperf I can reach full 50 GBit/s on the bond with zero output drops.

Andreas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore "separate" WAL and DB

2017-09-08 Thread Richard Hesketh
Hi,

Reading the ceph-users list I'm obviously seeing a lot of people talking about 
using bluestore now that Luminous has been released. I note that many users 
seem to be under the impression that they need separate block devices for the 
bluestore data block, the DB, and the WAL... even when they are going to put 
the DB and the WAL on the same device!

As per the docs at 
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ this 
is nonsense:

> If there is only a small amount of fast storage available (e.g., less than a 
> gigabyte), we recommend using it as a WAL device. If there is more, 
> provisioning a DB
> device makes more sense. The BlueStore journal will always be placed on the 
> fastest device available, so using a DB device will provide the same benefit 
> that the WAL
> device would while also allowing additional metadata to be stored there (if 
> it will fix). [sic, I assume that should be "fit"]

I understand that if you've got three speeds of storage available, there may be 
some sense to dividing these. For instance, if you've got lots of HDD, a bit of 
SSD, and a tiny NVMe available in the same host, data on HDD, DB on SSD and WAL 
on NVMe may be a sensible division of data. That's not the case for most of the 
examples I'm reading; they're talking about putting DB and WAL on the same 
block device, but in different partitions. There's even one example of someone 
suggesting to try partitioning a single SSD to put data/DB/WAL all in separate 
partitions!

Are the docs wrong and/or I am missing something about optimal bluestore setup, 
or do people simply have the wrong end of the stick? I ask because I'm just 
going through switching all my OSDs over to Bluestore now and I've just been 
reusing the partitions I set up for journals on my SSDs as DB devices for 
Bluestore HDDs without specifying anything to do with the WAL, and I'd like to 
know sooner rather than later if I'm making some sort of horrible mistake.

Rich
-- 
Richard Hesketh




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph release cadence

2017-09-08 Thread Matthew Vernon
Hi,

On 06/09/17 16:23, Sage Weil wrote:

> Traditionally, we have done a major named "stable" release twice a year, 
> and every other such release has been an "LTS" release, with fixes 
> backported for 1-2 years.

We use the ceph version that comes with our distribution (Ubuntu LTS);
those come out every 2 years (though we won't move to a brand-new
distribution until we've done some testing!). So from my POV, LTS ceph
releases that come out such that adjacent ceph LTSs fit neatly into
adjacent Ubuntu LTSs is the ideal outcome. We're unlikely to ever try
putting a non-LTS ceph version into production.

I hope this isn't an unusual requirement :)

Matthew


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD: How many snapshots is too many?

2017-09-08 Thread Florian Haas
> In our use case, we are severly hampered by the size of removed_snaps
> (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in
> PGPool::update and its interval calculation code. We have a cluster of
> around 100k RBDs with each RBD having upto 25 snapshots and only a small
> portion of our RBDs mapped at a time (~500-1000). For size / performance
> reasons we try to keep the number of snapshots low (<25) and need to
> prune snapshots. Since in our use case RBDs 'age' at different rates,
> snapshot pruning creates holes to the point where we the size of the
> removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph
> clusters. I think in general around 2 snapshot removal operations
> currently happen a minute just because of the volume of snapshots and
> users we have.

Right. Greg, this is what I was getting at: 25 snapshots per RBD is
firmly in "one snapshot per day per RBD" territory — this is something
that a cloud operator might do, for example, offering daily snapshots
going back one month. But it still wrecks the cluster simply by having
lots of images (even though only a fraction of them, less than 1%, are
ever in use). That's rather counter-intuitive, it doesn't hit you
until you have lots of images, and once you're affected by it there's
no practical way out — where "out" is defined as "restoring overall
cluster performance to something acceptable".

> We found the PGPool::update and the interval calculation code code to be
> quite inefficient. Some small changes made it a lot faster giving more
> breathing room, we shared and these and most already got applied:
> https://github.com/ceph/ceph/pull/17088
> https://github.com/ceph/ceph/pull/17121
> https://github.com/ceph/ceph/pull/17239
> https://github.com/ceph/ceph/pull/17265
> https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes)
>
> However for our use case these patches helped, but overall CPU usage in
> this area is still high (>70% or so), making the Ceph cluster slow
> causing blocked requests and many operations (e.g. rbd map) to take a
> long time.

I think this makes this very much a practical issue, not a
hypothetical/theoretical one.

> We are trying to work around these issues by trying to change our
> snapshot strategy. In the short-term we are manually defragmenting the
> interval set by scanning for holes and trying to delete snapids in
> between holes to coalesce more holes. This is not so nice to do. In some
> cases we employ strategies to 'recreate' old snapshots (as we need to
> keep them) at higher snapids. For our use case a 'snapid rename' feature
> would have been quite helpful.
>
> I hope this shines some light on practical Ceph clusters in which
> performance is bottlenecked not by I/O but by snapshot removal.

For others following this thread or retrieving it from the list
archive some time down the road, I'd rephrase that as "bottlenecked
not by I/O but by CPU utilization associated with snapshot removal".
Is that fair to say, Patrick? Please correct me if I'm
misrepresenting.

Greg (or Josh/Jason/Sage/anyone really :) ), can you provide
additional insight as to how these issues can be worked around or
mitigated, besides the PRs that Patrick and his colleagues have
already sent?

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs(Kraken 11.2.1), Unable to write more file when one dir more than 100000 files, mds_bal_fragment_size_max = 5000000

2017-09-08 Thread Yan, Zheng

> On 8 Sep 2017, at 13:54, donglifec...@gmail.com wrote:
> 
> ZhengYan,
> 
> I'm sorry, just a description of some questions.
> 
> when one dir more than 10 files, I can continue to write it , but I don't 
> find file which wrote in the past. for example:
> 1.  I write  10 files named 512k.file$i
>
> 2. I continue to write  1 files named aaa.file$i
> 
> 3. I continue to write  1 files named bbb.file$i
> 
> 4.  I continue to write  1 files named ccc.file$i
> 
> 5. I continue to write  1 files named ddd.file$i
> 
> 6. I can't find all ddd.file$i, some ddd.file$i missing. such as:
> 
> [root@yj43959-ceph-dev scripts]# find /mnt/cephfs/volumes -type f  |  grep 
> 512k.file | wc -l
> 10
> [root@yj43959-ceph-dev scripts]# ls /mnt/cephfs/volumes/aaa.file* | wc -l
> 1
> [root@yj43959-ceph-dev scripts]# ls /mnt/cephfs/volumes/bbb.file* | wc -l
> 1
> [root@yj43959-ceph-dev scripts]# ls /mnt/cephfs/volumes/ccc.file* | wc -l
> 1
> [root@yj43959-ceph-dev scripts]# ls /mnt/cephfs/volumes/ddd.file* | wc -l
> // some files missing
> 1072

It’s likely caused by http://tracker.ceph.com/issues/18314.  To support very 
large directory, you should enable directory fragment instead of enlarge 
mds_bal_fragment_size_max.

Regards
Yan, Zheng

> 
> 
> 
> donglifec...@gmail.com
>  
> From: donglifec...@gmail.com
> Date: 2017-09-08 13:30
> To: zyan
> CC: ceph-users
> Subject: [ceph-users]cephfs(Kraken 11.2.1), Unable to write more file when 
> one dir more than 10 files, mds_bal_fragment_size_max = 500
> ZhengYan,
> 
> I test cephfs(Kraken 11.2.1),  I can't write more files when one dir more 
> than 10 files, I have already set up "mds_bal_fragment_size_max = 
> 500".
> 
> why is this case? Is it a bug?
> 
> Thanks a lot.
> 
> donglifec...@gmail.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com