[ceph-users] Nimmst du meine Einladung an und kommst auch zu Ceph Berlin?

2017-09-07 Thread Robert Sander

Ceph Berlin


Begleite Robert Sander und 406 weitere  Cephalopods zu Berlin. Bleib immer auf 
dem Laufenden über neue Events in Deiner Gegend.

This is a group for anyone interested in Ceph ( 
http://meet.meetup.com/wf/click?upn=GD49DQZtg64H-2B6gLGjtaFxmHenmQfczCg-2Fbz5SlDxx4-3D_3m43nP6to2B7v9FAjDzXtNldDLR9h3SJnHRQkVD1i3Bm0tlYE1fk-2Bz4dxJ9qLfiTtuXUx8qanJXTWQtwGuegd4OLd7PqBu6t0Ylu69Tr-2Badc0lMS5vnk7n7i7ZOosVjo4DU9Sg5gNF1-2BQtKIBWLw33LzmlZml20P08LhymrFBkZrY2TInpoUajGpjkTI67G0qz0NRhP2mexSujPHOGwSnMFRvwXd5H-2FAG-2BA9UU12et8-3D
 ). All skills levels are welcome. There is a growing user community around 
this Free Software distributed storage. Participants are exp...

--

Einladung annehmen

http://meet.meetup.com/wf/click?upn=pEEcc35imY7Cq0tG1vyTt4Z4gND5RbLM8N-2BuJDsKubhlsuh5g4Jbj0Xb-2Ba7-2BJOrs7eqkf5U07yDualtMk4G9XU8HSPDz-2FOR381TTPPky2K-2FT45NkX6ZmY5pxu1PEBxNilFki1YXQIvlilu-2FJibrEisPxQLKWIk88qgWXFuzJduUwKlrPFwEDql7M6wGAutKCZnmY7Rev06gaOaZTMXfeODkAJpXB0poLWmGFKUdWST8AXlDCOy-2FEgvyb7B8kRj1sHvc9DS5l6nhyagc4ocPaE-2FcPpPy5YrPZqrefgsRjTkXfL-2FOd7ieDO72YNu1AXyutOlwx0jYsBX0foVS92kWnjV3qPJREDuHu0DKzCY2SeUcH1Zb3MfXRjVmDwnK39xIUG-2BFuqBuKfOCc-2B1yZWE0CZ4J-2BeGFXuFGtqPOABgaA3iqjJclVyJSXK5PuKu-2F2WPy-2B57iLm0Td-2B7X9F-2FyUL1xScRTaUVNYPcAaV13XiP7Iys4kgi2NBrsiZYwAsSlbPoe6-2F9LZfmmWez5r8WW-2BhjFbndXRs1-2FHAxhCBTQsDBqHFBcWi29OYMVD9dIP3QLyPqjyUi48-2BKZWKdG-2Fic4Q-2FQip-2FnTJRavTYBYgtdkcpYOXnoDxGfDbCQjsowDw0gAW4sNl7KckxnthPsv6rlgFZJktmjDCRIIKH7jDf7cI3H76waW9agCN78z8CSy3KolQwNVawxv9lYZ5NI7zAvsRtwgPUKQMzQT4U7sZRWeykPokUsUUWMq4y-2B7JWyJ7iWVCWE8JWEZ1s538JyheJ-2F-2FNKDRkyiFzjIEHXq8pAUJVkZQA-2BupzCjzLZIhDwWqpuewAGc-2Bjqs7x-2Fl6NTpEt0Yddd9TrBlwFj6AOhQVaTGRb3RpWsD5mJh3cqvgl7cMBnZ1LvgzD_3m43nP6to2B7v9FAjDzXtNldDLR9h3SJnHRQkVD1i3Bm0tlYE1fk-2Bz4dxJ9qLfiTt-2BaXInyCv-2F1cjQ1WkUDWuT-2BeDyd11F59c8V6qqWg-2F7CRwdBGd7ESfCGA6iHwIr2l2dv3iSyQX5IzKvODQzER9IvPzcc6MPKpsNY-2BqkfzsPKuPDeG6wLtUanwd0qkB0DLBv0fbxAcqP3KGBdT5TwCGcU4oRYVcuEutwnpuPbHwkY-3D

--

---
Diese Nachricht wurde von Meetup im Auftrag von Robert Sander 
(http://meet.meetup.com/wf/click?upn=pEEcc35imY7Cq0tG1vyTt0wKpnP6IfnATkFJd06gYNZtXqrk0epgntBhuACf0MnGxwRUM0d7t0HLFWF6aM5qeg08Mze-2BbdlPheIZaXYq178-3D_3m43nP6to2B7v9FAjDzXtNldDLR9h3SJnHRQkVD1i3Bm0tlYE1fk-2Bz4dxJ9qLfiT-2F2Q9YHUb78TeWceZfVhBaPoMMx3WVP-2Fg-2F3fhmkGYo3w0gjQrMSQ-2B63LxSlPcWPcFIW1b4Yfc0uWBl-2FsLts3pkQy8AGzg2rnHlJ8Tu7IMubw9BGUT4ht42atmybPK-2FKZiI-2Fn7PcECmYVNqSzpeoYTszEsEp-2BYr3V054oBOHdXfzE-3D
 von Ceph Berlin gesendet.


Noch Fragen? Schreib uns eine E-Mail an supp...@meetup.com

Ich möchte diese Art von E-Mails nicht mehr erhalten. 
(http://meet.meetup.com/wf/click?upn=pEEcc35imY7Cq0tG1vyTt4Z4gND5RbLM8N-2BuJDsKubhlsuh5g4Jbj0Xb-2Ba7-2BJOrs7eqkf5U07yDualtMk4G9XXM9AVZBrUeKDq0HbE4ayLXUofWSR5jrzgDpGi4KePNgeIiHLY81A-2FZ5XZXHMP-2B5wd6ElCOq1vuqVAeHj-2Bmkfxx3bzNrYNb3UrRtaQIGFg-2B-2BisLV1rO7XeaF8iW9n8PpziA7-2FknUIq3ix4JJg0v2sqRZmfuRwiMWwHnKO60VVdyJLNs8kA4xuLTk-2B4L5AtyS1qL18vkCyKr3bAvdfhqXPcUcNQ88pgqNPS7IrN6tfu8Jw8Fb0-2F-2Bg79Ye5aeu6g44vo-2BPiA2RgSTU-2BDFXTQZctdzHgi-2BZbu1E-2F21yal3oKV1myrRQTOSyTC3VA0IPJsN5VvAeXLQ1wdMLbiguh1sn6NEd-2FglgnuAI5LodbN-2Be0V23kgK3WDKDqx-2B9TazuQrnr1KA-2Fbwx-2BsAn4qqrY6-2BQsYv38dhG-2BuP9ImJMgUSKB2a9S7fjZyhqh7ztf9KylLDOLnr6MfaGbwDQ21PiKeUlsfXva3Y9bSnPFVz3RPxKjxLIqSEkhWWqyThOklg84g5ynr1KwWpVrPKYg4S6rufalYqfIdArpgrdjt-2B4cNYs2lPx9_3m43nP6to2B7v9FAjDzXtNldDLR9h3SJnHRQkVD1i3Bm0tlYE1fk-2Bz4dxJ9qLfiTF4fsbxN-2F7EoAxN7nZQxrAnOBq1P1dn1JfzgYs0be9sLR1hO9q1aAEH4ice-2BT71vbdMeJUnoLfOQ19lntkeexxPSJMJ9JR92OFC45FoNHXqVTp64MZrGQb3vMs6aVZOZOxVOZZCmVh8HtR7SDET2aFa0ieowZ482z-2Fy5Gb-2BV6a84-3D

Meetup Inc. 
(http://meet.meetup.com/wf/click?upn=pEEcc35imY7Cq0tG1vyTt0wKpnP6IfnATkFJd06gYNYhelFO3AWyXJ-2BzQcDZesNQ_3m43nP6to2B7v9FAjDzXtNldDLR9h3SJnHRQkVD1i3Bm0tlYE1fk-2Bz4dxJ9qLfiTL0-2BgJGmZuu2YpRpc7ixIJZ44HdOxUMjQoGQ52UifPFWbEMPZviWi9G7E5FCxN6kWemvqv3Rt5XOZ4Y6Pp-2FL90DafpC1GWOITE6Xr6IWHl-2FD0Wgv0GXpychJqMfJnfjBXflnuKkbN6hQy9uualEmaRTaqeRkWaNUirBqyNOKWF80-3D
 POB 4668 #37895 New York NY USA 10163
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PCIe journal benefit for SSD OSDs

2017-09-07 Thread Marc Roos
 
Sorry to cut in your thread. 

> Have you disabled te FLUSH command for the Samsung ones?

We have a test cluster currently only with spinners pool, but we have 
SM863 available to create the ssd pool. Is there something specific that 
needs to be done for the SM863?




-Original Message-
From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag] 
Sent: donderdag 7 september 2017 8:04
To: Christian Balzer; ceph-users
Subject: Re: [ceph-users] PCIe journal benefit for SSD OSDs

Hello,
Am 07.09.2017 um 03:53 schrieb Christian Balzer:
> 
> Hello,
> 
> On Wed, 6 Sep 2017 09:09:54 -0400 Alex Gorbachev wrote:
> 
>> We are planning a Jewel filestore based cluster for a performance 
>> sensitive healthcare client, and the conservative OSD choice is 
>> Samsung SM863A.
>>
> 
> While I totally see where you're coming from and me having stated that 

> I'll give Luminous and Bluestore some time to mature, I'd also be 
> looking into that if I were being in the planning phase now, with like 

> 3 months before deployment.
> The inherent performance increase with Bluestore (and having something 

> that hopefully won't need touching/upgrading for a while) shouldn't be 

> ignored.

Yes and that's the point where i'm currently as well. Thinking about how 
to design a new cluster based on bluestore.

> The SSDs are fine, I've been starting to use those recently (though 
> not with Ceph yet) as Intel DC S36xx or 37xx are impossible to get.
> They're a bit slower in the write IOPS department, but good enough for 
me.

I've never used the Intel DC ones but always the Samsung are the Intel 
really faster? Have you disabled te FLUSH command for the Samsung ones?
They don't skip the command automatically like the Intel do. Sadly the 
Samsung SM863 got more expensive over the last months. They were a lot 
cheaper  in the first month of 2016. May be the 2,5" optane intel ssds 
will change the game.

>> but was wondering if anyone has seen a positive impact from also 
>> using PCIe journals (e.g. Intel P3700 or even the older 910 series) 
>> in front of such SSDs?
>>
> NVMe journals (or WAL and DB space for Bluestore) are nice and can 
> certainly help, especially if Ceph is tuned accordingly.
> Avoid non DC NVMes, I doubt you can still get 910s, they are 
> officially EOL.
> You want to match capabilities and endurances, a DC P3700 800GB would 
> be an OK match for 3-4 SM863a 960GB for example.

That's a good point but makes the cluster more expensive. Currently 
while using filestore i use one SSD for journal and data which works 
fine.

With bluestore we've block, db and wal so we need 3 block devices per 
OSD. If we need one PCIe or NVMe device per 3-4 devices it get's much 
more expensive per host - currently running 10 OSDs / SSDs per Node.

Have you already done tests how he performance changes with bluestore 
while putting all 3 block devices on the same ssd?

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PCIe journal benefit for SSD OSDs

2017-09-07 Thread Christian Balzer

Hello,

On Thu, 7 Sep 2017 08:03:31 +0200 Stefan Priebe - Profihost AG wrote:

> Hello,
> Am 07.09.2017 um 03:53 schrieb Christian Balzer:
> > 
> > Hello,
> > 
> > On Wed, 6 Sep 2017 09:09:54 -0400 Alex Gorbachev wrote:
> >   
> >> We are planning a Jewel filestore based cluster for a performance
> >> sensitive healthcare client, and the conservative OSD choice is
> >> Samsung SM863A.
> >>  
> > 
> > While I totally see where you're coming from and me having stated that
> > I'll give Luminous and Bluestore some time to mature, I'd also be looking
> > into that if I were being in the planning phase now, with like 3 months
> > before deployment.
> > The inherent performance increase with Bluestore (and having something
> > that hopefully won't need touching/upgrading for a while) shouldn't be
> > ignored.   
> 
> Yes and that's the point where i'm currently as well. Thinking about how
> to design a new cluster based on bluestore.
> 
> > The SSDs are fine, I've been starting to use those recently (though not
> > with Ceph yet) as Intel DC S36xx or 37xx are impossible to get.
> > They're a bit slower in the write IOPS department, but good enough for me.  
> 
> I've never used the Intel DC ones but always the Samsung are the Intel
> really faster? 
I don't have any configuration right now where to directly compare them
(different HW, controllers, kernel and fio versions), but at least on
paper a 200GB DC S3700 (unobtanium) with 32K random 4k IOPS (confirmed in
tests) looks a lot better than the 240GB SM863A with 10K IOPS.

I dug into my archives and for a wheezy (3.16 kernel) system I found
results that had the about DC S3700 with 32K IOPS as per the specs and
for a 845DC EVO 960GB 12K write IOPS, also as expected from the specs.

On newer HW with recent kernels (4.9) and fio (I suspect the later)
things have changed to the point that the same fio command line as in the
old tests now gives me results of over 70K IOPS for both 400GB DC S3710s
and 960GB SM863A, both way higher than the specs.
It seems to be basically CPU/IRQ bound at that point, leading me to
believe that "--direct=1" no longer means the same thing.
Adding "--sync=1" to the fio command things become more sane, but are still
odd and partially higher than expected.

Make of that what you will.

> Have you disabled te FLUSH command for the Samsung ones?
How would one do that?
And since they have supposed full power loss protection, why wouldn't that
be the default?

> They don't skip the command automatically like the Intel do. Sadly the
> Samsung SM863 got more expensive over the last months. They were a lot
> cheaper  in the first month of 2016. May be the 2,5" optane intel ssds
> will change the game.
> 
The Optane offerings right now leave me rather unimpressed at 65K write
IOPS and 290MB write speed for their best (32GB) model. 
Not a fit for filestore journals given the write speed and not so
much for the DB part of Bluestore either.

> >> but was wondering if anyone has seen a positive
> >> impact from also using PCIe journals (e.g. Intel P3700 or even the
> >> older 910 series) in front of such SSDs?
> >>  
> > NVMe journals (or WAL and DB space for Bluestore) are nice and can
> > certainly help, especially if Ceph is tuned accordingly.
> > Avoid non DC NVMes, I doubt you can still get 910s, they are officially
> > EOL.
> > You want to match capabilities and endurances, a DC P3700 800GB would be
> > an OK match for 3-4 SM863a 960GB for example.   
> 
> That's a good point but makes the cluster more expensive. Currently
> while using filestore i use one SSD for journal and data which works fine.
> 
Inline is fine if it fits your use case and the reduction in endurance is
also calculated in and/or compensated for.
I do the same with DC S3610s (very similar to the SM863As) on my
cache-tier nodes. 

> With bluestore we've block, db and wal so we need 3 block devices per
> OSD. If we need one PCIe or NVMe device per 3-4 devices it get's much
> more expensive per host - currently running 10 OSDs / SSDs per Node.
> 
Well, the OP was asking for performance, so price obviously goes up.
If you're running SSD OSDs you can put all 3 on the same device and should
be no worse off than before with filestore. 
Keep in mind that small writes also get "journaled" on the DB part, so
double writes and endurance may not improve depending on your write
patterns.
Something really fast for the WAL would likely help, but I have zero
experience and very few written reports here to base that on.

> Have you already done tests how he performance changes with bluestore
> while putting all 3 block devices on the same ssd?
> 
Nope, and given my test clusters, it's likely going to be a while before I
do anything with Bluestore on SSDs, never mind NVMes (of which I have none
as nothing we do requires them at this point). 

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
__

Re: [ceph-users] PCIe journal benefit for SSD OSDs

2017-09-07 Thread Stefan Priebe - Profihost AG
Am 07.09.2017 um 10:22 schrieb Marc Roos:
>  
> Sorry to cut in your thread. 
> 
>> Have you disabled te FLUSH command for the Samsung ones?
> 
> We have a test cluster currently only with spinners pool, but we have 
> SM863 available to create the ssd pool. Is there something specific that 
> needs to be done for the SM863?

I've not tested how the SM863a behaves but at least with the "older"
SV843 and SM863 you need to disable the FLUSH command for those SSDs.
This is safe because they have a working capacitor to flush the writes
from cache itself.

You do this by writing the string "temporary write through" to
/sys/block/sdb/device/scsi_disk/*/cache_type

Greets,
Stefan

> 
> -Original Message-
> From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag] 
> Sent: donderdag 7 september 2017 8:04
> To: Christian Balzer; ceph-users
> Subject: Re: [ceph-users] PCIe journal benefit for SSD OSDs
> 
> Hello,
> Am 07.09.2017 um 03:53 schrieb Christian Balzer:
>>
>> Hello,
>>
>> On Wed, 6 Sep 2017 09:09:54 -0400 Alex Gorbachev wrote:
>>
>>> We are planning a Jewel filestore based cluster for a performance 
>>> sensitive healthcare client, and the conservative OSD choice is 
>>> Samsung SM863A.
>>>
>>
>> While I totally see where you're coming from and me having stated that 
> 
>> I'll give Luminous and Bluestore some time to mature, I'd also be 
>> looking into that if I were being in the planning phase now, with like 
> 
>> 3 months before deployment.
>> The inherent performance increase with Bluestore (and having something 
> 
>> that hopefully won't need touching/upgrading for a while) shouldn't be 
> 
>> ignored.
> 
> Yes and that's the point where i'm currently as well. Thinking about how 
> to design a new cluster based on bluestore.
> 
>> The SSDs are fine, I've been starting to use those recently (though 
>> not with Ceph yet) as Intel DC S36xx or 37xx are impossible to get.
>> They're a bit slower in the write IOPS department, but good enough for 
> me.
> 
> I've never used the Intel DC ones but always the Samsung are the Intel 
> really faster? Have you disabled te FLUSH command for the Samsung ones?
> They don't skip the command automatically like the Intel do. Sadly the 
> Samsung SM863 got more expensive over the last months. They were a lot 
> cheaper  in the first month of 2016. May be the 2,5" optane intel ssds 
> will change the game.
> 
>>> but was wondering if anyone has seen a positive impact from also 
>>> using PCIe journals (e.g. Intel P3700 or even the older 910 series) 
>>> in front of such SSDs?
>>>
>> NVMe journals (or WAL and DB space for Bluestore) are nice and can 
>> certainly help, especially if Ceph is tuned accordingly.
>> Avoid non DC NVMes, I doubt you can still get 910s, they are 
>> officially EOL.
>> You want to match capabilities and endurances, a DC P3700 800GB would 
>> be an OK match for 3-4 SM863a 960GB for example.
> 
> That's a good point but makes the cluster more expensive. Currently 
> while using filestore i use one SSD for journal and data which works 
> fine.
> 
> With bluestore we've block, db and wal so we need 3 block devices per 
> OSD. If we need one PCIe or NVMe device per 3-4 devices it get's much 
> more expensive per host - currently running 10 OSDs / SSDs per Node.
> 
> Have you already done tests how he performance changes with bluestore 
> while putting all 3 block devices on the same ssd?
> 
> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw-admin orphans find -- Hammer

2017-09-07 Thread Daniel Schneller
Hello,

we need to reclaim a lot of wasted space by RGW orphans in our production 
Hammer cluster (0.94.10 on Ubuntu 14.04).

According to http://tracker.ceph.com/issues/18258 
 there is a bug in the radosgw-admin 
orphans find command, that causes it to get stuck in an infinite loop.

From the bug report I cannot tell if there are unusual circumstances that need 
to be present to trigger the infinite-loop condition, of if I am more or less 
guaranteed to hit the issue.
The bug has been fixed, but not im Hammer.

Any chance of getting it backported into Hammer? 
Is the fix in the radosgw-admin tool itself, or are there more/other components 
that would have to be touched?

As the cluster has about 200 million objects, I would rather not just “try my 
luck” and get stuck in the middle.

Any insight on this would be appreciated.

Thanks a lot,
Daniel

-- 
Daniel Schneller
Principal Cloud Engineer
 
CenterDevice GmbH  | Hochstraße 11
   | 42697 Solingen
tel: +49 1754155711| Deutschland
daniel.schnel...@centerdevice.de   | www.centerdevice.de

Geschäftsführung: Dr. Patrick Peschlow, Dr. Lukas Pustina,
Michael Rosbach, Handelsregister-Nr.: HRB 18655,
HR-Gericht: Bonn, USt-IdNr.: DE-815299431


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PCIe journal benefit for SSD OSDs

2017-09-07 Thread Stefan Priebe - Profihost AG
Am 07.09.2017 um 10:44 schrieb Christian Balzer:
> 
> Hello,
> 
> On Thu, 7 Sep 2017 08:03:31 +0200 Stefan Priebe - Profihost AG wrote:
> 
>> Hello,
>> Am 07.09.2017 um 03:53 schrieb Christian Balzer:
>>>
>>> Hello,
>>>
>>> On Wed, 6 Sep 2017 09:09:54 -0400 Alex Gorbachev wrote:
>>>   
 We are planning a Jewel filestore based cluster for a performance
 sensitive healthcare client, and the conservative OSD choice is
 Samsung SM863A.
  
>>>
>>> While I totally see where you're coming from and me having stated that
>>> I'll give Luminous and Bluestore some time to mature, I'd also be looking
>>> into that if I were being in the planning phase now, with like 3 months
>>> before deployment.
>>> The inherent performance increase with Bluestore (and having something
>>> that hopefully won't need touching/upgrading for a while) shouldn't be
>>> ignored.   
>>
>> Yes and that's the point where i'm currently as well. Thinking about how
>> to design a new cluster based on bluestore.
>>
>>> The SSDs are fine, I've been starting to use those recently (though not
>>> with Ceph yet) as Intel DC S36xx or 37xx are impossible to get.
>>> They're a bit slower in the write IOPS department, but good enough for me.  
>>
>> I've never used the Intel DC ones but always the Samsung are the Intel
>> really faster? 
> I don't have any configuration right now where to directly compare them
> (different HW, controllers, kernel and fio versions), but at least on
> paper a 200GB DC S3700 (unobtanium) with 32K random 4k IOPS (confirmed in
> tests) looks a lot better than the 240GB SM863A with 10K IOPS.
> 
> I dug into my archives and for a wheezy (3.16 kernel) system I found
> results that had the about DC S3700 with 32K IOPS as per the specs and
> for a 845DC EVO 960GB 12K write IOPS, also as expected from the specs.
> 
> On newer HW with recent kernels (4.9) and fio (I suspect the later)
> things have changed to the point that the same fio command line as in the
> old tests now gives me results of over 70K IOPS for both 400GB DC S3710s
> and 960GB SM863A, both way higher than the specs.
> It seems to be basically CPU/IRQ bound at that point, leading me to
> believe that "--direct=1" no longer means the same thing.
> Adding "--sync=1" to the fio command things become more sane, but are still
> odd and partially higher than expected.

OK but the 845DC EVO is pretty old and also TLC and not MLC i don't
think you can compare them.


>> Have you disabled te FLUSH command for the Samsung ones?
> How would one do that?
> And since they have supposed full power loss protection, why wouldn't that
> be the default?

Intel has this as a default but Samsung did not. I think it's a
different kind of handling stuff.

Intel says - hey we have a capicitor just ignore the flush command.
Samsung says - hey we got a flush command do what the user wants and
flush all cache.

>> They don't skip the command automatically like the Intel do. Sadly the
>> Samsung SM863 got more expensive over the last months. They were a lot
>> cheaper  in the first month of 2016. May be the 2,5" optane intel ssds
>> will change the game.
>>
> The Optane offerings right now leave me rather unimpressed at 65K write
> IOPS and 290MB write speed for their best (32GB) model. 
> Not a fit for filestore journals given the write speed and not so
> much for the DB part of Bluestore either.

May be SSD DC S4600 will be interesting i'm always using the 2TB models.
I don't know the pricing yet.

 but was wondering if anyone has seen a positive
 impact from also using PCIe journals (e.g. Intel P3700 or even the
 older 910 series) in front of such SSDs?
  
>>> NVMe journals (or WAL and DB space for Bluestore) are nice and can
>>> certainly help, especially if Ceph is tuned accordingly.
>>> Avoid non DC NVMes, I doubt you can still get 910s, they are officially
>>> EOL.
>>> You want to match capabilities and endurances, a DC P3700 800GB would be
>>> an OK match for 3-4 SM863a 960GB for example.   
>>
>> That's a good point but makes the cluster more expensive. Currently
>> while using filestore i use one SSD for journal and data which works fine.
>>
> Inline is fine if it fits your use case and the reduction in endurance is
> also calculated in and/or compensated for.
> I do the same with DC S3610s (very similar to the SM863As) on my
> cache-tier nodes. 

Currently i'm always observing much higher life time with ceph than the
manufactors tell me. The wear out indicators or lifetime remaining smart
values are much higher than expected.

>> With bluestore we've block, db and wal so we need 3 block devices per
>> OSD. If we need one PCIe or NVMe device per 3-4 devices it get's much
>> more expensive per host - currently running 10 OSDs / SSDs per Node.
>>
> Well, the OP was asking for performance, so price obviously goes up.
> If you're running SSD OSDs you can put all 3 on the same device and should
> be no worse off than before with filestore.
> Keep in mind t

[ceph-users] Separate WAL and DB Partitions for existing OSDs ?

2017-09-07 Thread Christoph Adomeit
Hi there,

is it possible to move WAL and DB Data for Existing bluestore OSDs to separate 
partitions ? 

I am looking for a method to maybe take an OSD out, do some magic and move some 
data to new SSD Devices and then take the OSD back in.

Any Ideas ?

Thanks
  Christoph


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mgr unknown version

2017-09-07 Thread John Spray
On Wed, Sep 6, 2017 at 4:31 PM, Piotr Dzionek  wrote:
> Hi,
> I ran a small test two node ceph cluster - 12.2.0 version. It has 28 osds, 1
> mon and 2 mgr. It runs fine, however I noticed this strange thing in output
> of ceph versions command:
>
> # ceph versions
> {
> "mon": {
> "ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
> luminous (rc)": 1
> },
> "mgr": {
> "ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
> luminous (rc)": 1,
> "unknown": 1
> },
> "osd": {
> "ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
> luminous (rc)": 28
> },
> "mds": {},
> "overall": {
> "ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
> luminous (rc)": 30,
> "unknown": 1
> }
> }
>
> As you can see one ceph manager is in unknown state. Why is that ? FYI, I
> checked rpms versions and did a restart of all mgr and I still get the same
> result.

Thanks for finding this bug -- I can reproduce locally after failing
one of the mgrs (on first startup it's all populated).  Fix
incoming...

John

>
> Kind regards,
> Piotr Dzionek
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mgr unknown version

2017-09-07 Thread John Spray
On Wed, Sep 6, 2017 at 4:47 PM, Piotr Dzionek  wrote:
> Oh, I see that this is probably a bug: http://tracker.ceph.com/issues/21260
>
> I also noticed following error in mgr logs:
>
> 2017-09-06 16:41:08.537577 7f34c0a7a700  1 mgr send_beacon active
> 2017-09-06 16:41:08.539161 7f34c0a7a700  1 mgr[restful] Unknown request ''
> 2017-09-06 16:41:08.543830 7f34a77de700  0 mgr[restful] Traceback (most
> recent call last):
>   File "/usr/lib64/ceph/mgr/restful/module.py", line 248, in serve
> self._serve()
>   File "/usr/lib64/ceph/mgr/restful/module.py", line 299, in _serve
> raise RuntimeError('no certificate configured')
> RuntimeError: no certificate configured
>
> Probably not related, but what kind of certificate it might refer to ?
>

That's the `restful` mgr module that doesn't come up cleanly until it
has an SSL certificate (http://docs.ceph.com/docs/master/mgr/restful/)

I have a patch that cleans up the scary looking message, I'll pick it
out into a separate PR to get it into 12.2.1
(http://tracker.ceph.com/issues/21292)

John

>
> W dniu 06.09.2017 o 16:31, Piotr Dzionek pisze:
>
> Hi,
> I ran a small test two node ceph cluster - 12.2.0 version. It has 28 osds, 1
> mon and 2 mgr. It runs fine, however I noticed this strange thing in output
> of ceph versions command:
>
> # ceph versions
> {
> "mon": {
> "ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
> luminous (rc)": 1
> },
> "mgr": {
> "ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
> luminous (rc)": 1,
> "unknown": 1
> },
> "osd": {
> "ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
> luminous (rc)": 28
> },
> "mds": {},
> "overall": {
> "ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
> luminous (rc)": 30,
> "unknown": 1
> }
> }
>
> As you can see one ceph manager is in unknown state. Why is that ? FYI, I
> checked rpms versions and did a restart of all mgr and I still get the same
> result.
>
> Kind regards,
> Piotr Dzionek
>
>
> --
> Piotr Dzionek
> System Administrator
>
> SEQR Poland Sp. z o.o.
> ul. Łąkowa 29, 90-554 Łódź, Poland
> Mobile: +48 79687
> Mail: piotr.dzio...@seqr.com
> www.seqr.com | www.seamless.se
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-maintainers] Ceph release cadence

2017-09-07 Thread Lars Marowsky-Bree
On 2017-09-06T15:23:34, Sage Weil  wrote:

Hi Sage,

thanks for kicking off this discussion - after the L experience, it was
on my hot list to talk about too.

I do agree that we need predictable releases more than feature-rich
releases. Distributors like to plan, but that's not a reason. However,
we like to plan because *users* like to plan their schedules and
upgrades, and I think that matters more.

> - Not a lot of people seem to run the "odd" releases (e.g., infernalis, 
> kraken).  This limits the value of actually making them.  It also means 
> that those who *do* run them are running riskier code (fewer users -> more 
> bugs).

Yes. Odd releases never really make it to user systems. They're on the
previous LTS release. In the devel releases, the code is often too
unstable, and developers seem to cram everything in. Basically, the odd
releases are long periods working up to the next stable release.

(And they get all the cool names, which I find personally sad. I want my
users to run Infernalis, Kraken, and Mimic. ;-)

> - The more recent requirement that upgrading clusters must make a stop at 
> each LTS (e.g., hammer -> luminous not supported, must go hammer -> jewel 
> -> lumninous) has been hugely helpful on the development side by reducing 
> the amount of cross-version compatibility code to maintain and reducing 
> the number of upgrade combinations to test.

On this, I feel that it might make more sense to phrase this so that
such cross version compatibility is not tied to major releases (which
doesn't really help them plan lifecycles if those releases aren't
reliable), but to time periods.

> - When we try to do a time-based "train" release cadence, there always 
> seems to be some "must-have" thing that delays the release a bit.  This 
> doesn't happen as much with the odd releases, but it definitely happens 
> with the LTS releases.  When the next LTS is a year away, it is hard to 
> suck it up and wait that long.

Yes, I can see that. This is clearly something we'd want to avoid.

> A couple of options:
> 
> * Keep even/odd pattern, and continue being flexible with release dates

I admit I'm not a fan of this one.

> * Drop the odd releases but change nothing else (i.e., 12-month release 
> cadence)
>   + eliminate the confusing odd releases with dubious value

Periods too long for regular users. Admittedly, I suspect for RH and
SUSE with RHCS or SES respectively, this doesn't matter much - but it's
not good for the community as a whole. Also, this means not enough
community / end-user testing will happen for 11 out of those 12 months,
implying such long cycles make it hard to release n+1.0 in high
quality.

I've been doing software development for almost two decades, and no user
really touches betas before one calls it an RC, and even then ...

> * Drop the odd releases, and aim for a ~9 month cadence. This splits the 
> difference between the current even/odd pattern we've been doing.

It's a step up, but the period is still both too long, and unaligned.
This makes lifecycle management for everyone annoying.

> * Drop the odd releases, but relax the "must upgrade through every LTS" to 
> allow upgrades across 2 versions (e.g., luminous -> mimic or luminous -> 
> nautilus).  Shorten release cycle (~6-9 months).
> 
>   + more flexibility for users
>   + downstreams have greater choice in adopting an upstrema release
>   - more LTS branches to maintain
>   - more upgrade paths to consider

>From the list of options you provide, I like this one the best; the ~6
month release cycle means there should be one about once per year as
well, which makes cycling easier to plan.

> Other options we should consider?  Other thoughts?

With about 20-odd years in software development, I've become a big
believer in schedule-driven releases. If it's feature-based, you never
know when they'll get done.

If the schedule intervals are too long though, the urge to press too
much in (so as not to miss the next merge window) is just too high,
meaning the train gets derailed. (Which cascades into the future,
because the next time the pressure will be even higher based on the
previous experience.) This requires strictness.

We've had a few Linux kernel releases that were effectively feature
driven and never quite made it. 1.3.x? 1.5.x? My memory is bad, but they
were a disaster than eventually led Linus to evolve to the current
model.

That serves them really well, and I believe it might be worth
considering for us.

I'd try to move away from the major milestones. Features get integrated
into the next schedule-driven release when they deemed ready and stable;
when they're not, not a big deal, the next one is coming up "soonish".

(This effectively decouples feature development slightly from the
release schedule.)

We could even go for "a release every 3 months, sharp", merge window for
the first month, stabilization the second, release clean up the third,
ship.

Interoperability hacks for the cluster/server side 

Re: [ceph-users] ceph mgr unknown version

2017-09-07 Thread Piotr Dzionek

Thanks for explanation.

W dniu 07.09.2017 o 12:06, John Spray pisze:

On Wed, Sep 6, 2017 at 4:47 PM, Piotr Dzionek  wrote:

Oh, I see that this is probably a bug: http://tracker.ceph.com/issues/21260

I also noticed following error in mgr logs:

2017-09-06 16:41:08.537577 7f34c0a7a700  1 mgr send_beacon active
2017-09-06 16:41:08.539161 7f34c0a7a700  1 mgr[restful] Unknown request ''
2017-09-06 16:41:08.543830 7f34a77de700  0 mgr[restful] Traceback (most
recent call last):
   File "/usr/lib64/ceph/mgr/restful/module.py", line 248, in serve
 self._serve()
   File "/usr/lib64/ceph/mgr/restful/module.py", line 299, in _serve
 raise RuntimeError('no certificate configured')
RuntimeError: no certificate configured

Probably not related, but what kind of certificate it might refer to ?


That's the `restful` mgr module that doesn't come up cleanly until it
has an SSL certificate (http://docs.ceph.com/docs/master/mgr/restful/)

I have a patch that cleans up the scary looking message, I'll pick it
out into a separate PR to get it into 12.2.1
(http://tracker.ceph.com/issues/21292)

John


W dniu 06.09.2017 o 16:31, Piotr Dzionek pisze:

Hi,
I ran a small test two node ceph cluster - 12.2.0 version. It has 28 osds, 1
mon and 2 mgr. It runs fine, however I noticed this strange thing in output
of ceph versions command:

# ceph versions
{
 "mon": {
 "ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
luminous (rc)": 1
 },
 "mgr": {
 "ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
luminous (rc)": 1,
 "unknown": 1
 },
 "osd": {
 "ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
luminous (rc)": 28
 },
 "mds": {},
 "overall": {
 "ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
luminous (rc)": 30,
 "unknown": 1
 }
}

As you can see one ceph manager is in unknown state. Why is that ? FYI, I
checked rpms versions and did a restart of all mgr and I still get the same
result.

Kind regards,
Piotr Dzionek


--
Piotr Dzionek
System Administrator

SEQR Poland Sp. z o.o.
ul. Łąkowa 29, 90-554 Łódź, Poland
Mobile: +48 79687
Mail: piotr.dzio...@seqr.com
www.seqr.com | www.seamless.se


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Blocked requests

2017-09-07 Thread Matthew Stroud
After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests for 
‘currently waiting for missing object’. I have tried bouncing the osds and 
rebooting the osd nodes, but that just moves the problems around. Previous to 
this upgrade we had no issues. Any ideas of what to look at?

Thanks,
Matthew Stroud



CONFIDENTIALITY NOTICE: This message is intended only for the use and review of 
the individual or entity to which it is addressed and may contain information 
that is privileged and confidential. If the reader of this message is not the 
intended recipient, or the employee or agent responsible for delivering the 
message solely to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, please notify 
sender immediately by telephone or return email. Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Separate WAL and DB Partitions for existing OSDs ?

2017-09-07 Thread Christoph Adomeit
To be mor eprecise, what I want to know is:


I have a lot of bluestore osds and now I want to add separate wal and db on new 
nvme partitions.

Would it be enough to just generate empty partitions with parted and make 
symlinks on the osd partition like this:

$ sudo ln -sf /dev/disk/by-partlabel/osd-device-0-db 
/var/lib/ceph/osd/ceph-0/block.db
$ sudo ln -sf /dev/disk/by-partlabel/osd-device-0-wal 
/var/lib/ceph/osd/ceph-0/block.wal

Shall I use special partition ids or flags for db and wal ? And how big should 
I make db and wal partitions ?


Thanks


Christoph 



On Thu, Sep 07, 2017 at 09:57:16AM +0200, Christoph Adomeit wrote:
> Hi there,
> 
> is it possible to move WAL and DB Data for Existing bluestore OSDs to 
> separate partitions ? 
> 
> I am looking for a method to maybe take an OSD out, do some magic and move 
> some data to new SSD Devices and then take the OSD back in.
> 
> Any Ideas ?
> 
> Thanks
>   Christoph
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Es gibt keine  Cloud, es gibt nur die Computer anderer Leute
Christoph Adomeit
GATWORKS GmbH
Reststrauch 191
41199 Moenchengladbach
Sitz: Moenchengladbach
Amtsgericht Moenchengladbach, HRB 6303
Geschaeftsfuehrer:
Christoph Adomeit, Hans Wilhelm Terstappen

christoph.adom...@gatworks.de Internetloesungen vom Feinsten
Fon. +49 2166 9149-32  Fax. +49 2166 9149-10
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread David Turner
Ok, I've been testing, investigating, researching, etc for the last week
and I don't have any problems with data syncing.  The clients on one side
are creating multipart objects while the multisite sync is creating them as
whole objects and one of the datacenters is slower at cleaning up the
shadow files.  That's the big discrepancy between object counts in the
pools between datacenters.  I created a tool that goes through for each
bucket in a realm and does a recursive listing of all objects in it for
both datacenters and compares the 2 lists for any differences.  The data is
definitely in sync between the 2 datacenters down to the modified time and
byte of each file in s3.

The metadata is still not syncing for the other realm, though.  If I run
`metadata sync init` then the second datacenter will catch up with all of
the new users, but until I do that newly created users on the primary side
don't exist on the secondary side.  `metadata sync status`, `sync status`,
`metadata sync run` (only left running for 30 minutes before I ctrl+c it),
etc don't show any problems... but the new users just don't exist on the
secondary side until I run `metadata sync init`.  I created a new bucket
with the new user and the bucket shows up in the second datacenter, but no
objects because the objects don't have a valid owner.

Thank you all for the help with the data sync issue.  You pushed me into
good directions.  Does anyone have any insight as to what is preventing the
metadata from syncing in the other realm?  I have 2 realms being sync using
multi-site and it's only 1 of them that isn't getting the metadata across.
As far as I can tell it is configured identically.

On Thu, Aug 31, 2017 at 12:46 PM David Turner  wrote:

> All of the messages from sync error list are listed below.  The number on
> the left is how many times the error message is found.
>
>1811 "message": "failed to sync bucket instance:
> (16) Device or resource busy"
>   7 "message": "failed to sync bucket instance:
> (5) Input\/output error"
>  65 "message": "failed to sync object"
>
> On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman 
> wrote:
>
>>
>> Hi David,
>>
>> On Mon, Aug 28, 2017 at 8:33 PM, David Turner 
>> wrote:
>>
>>> The vast majority of the sync error list is "failed to sync bucket
>>> instance: (16) Device or resource busy".  I can't find anything on Google
>>> about this error message in relation to Ceph.  Does anyone have any idea
>>> what this means? and/or how to fix it?
>>>
>>
>> Those are intermediate errors resulting from several radosgw trying to
>> acquire the same sync log shard lease. It doesn't effect the sync progress.
>> Are there any other errors?
>>
>> Orit
>>
>>>
>>> On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley  wrote:
>>>
 Hi David,

 The 'data sync init' command won't touch any actual object data, no.
 Resetting the data sync status will just cause a zone to restart a full
 sync of the --source-zone's data changes log. This log only lists which
 buckets/shards have changes in them, which causes radosgw to consider them
 for bucket sync. So while the command may silence the warnings about data
 shards being behind, it's unlikely to resolve the issue with missing
 objects in those buckets.

 When data sync is behind for an extended period of time, it's usually
 because it's stuck retrying previous bucket sync failures. The 'sync error
 list' may help narrow down where those failures are.

 There is also a 'bucket sync init' command to clear the bucket sync
 status. Following that with a 'bucket sync run' should restart a full sync
 on the bucket, pulling in any new objects that are present on the
 source-zone. I'm afraid that those commands haven't seen a lot of polish or
 testing, however.

 Casey

 On 08/24/2017 04:15 PM, David Turner wrote:

 Apparently the data shards that are behind go in both directions, but
 only one zone is aware of the problem.  Each cluster has objects in their
 data pool that the other doesn't have.  I'm thinking about initiating a
 `data sync init` on both sides (one at a time) to get them back on the same
 page.  Does anyone know if that command will overwrite any local data that
 the zone has that the other doesn't if you run `data sync init` on it?

 On Thu, Aug 24, 2017 at 1:51 PM David Turner 
 wrote:

> After restarting the 2 RGW daemons on the second site again,
> everything caught up on the metadata sync.  Is there something about 
> having
> 2 RGW daemons on each side of the multisite that might be causing an issue
> with the sync getting stale?  I have another realm set up the same way 
> that
> is having a hard time with its data shards being behind.  I haven't told
> them to resync, but yesterday I noticed 90 shards were behind.  It's 
> c

Re: [ceph-users] Client features by IP?

2017-09-07 Thread Josh Durgin

On 09/06/2017 04:36 PM, Bryan Stillwell wrote:

I was reading this post by Josh Durgin today and was pretty happy to see we can 
get a summary of features that clients are using with the 'ceph features' 
command:

http://ceph.com/community/new-luminous-upgrade-complete/

However, I haven't found an option to display the IP address of those clients 
with the older feature sets.  Is there a flag I can pass to 'ceph features' to 
list the IPs associated with each feature set?


There is not currently, we should add that - it'll be easy to backport
to luminous too. The only place both features and IP are shown is in
'debug mon = 10' logs right now.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Separate WAL and DB Partitions for existing OSDs ?

2017-09-07 Thread David Turner
On Filestore you would flush the journal and then after mapping the new
journal device use the command to create the journal.  I'm sure there's
something similar for bluestore, but I don't have any experience with it
yet.  Is there a new command similar to flush and create for the WAL and DB?

On Thu, Sep 7, 2017 at 12:03 PM Christoph Adomeit <
christoph.adom...@gatworks.de> wrote:

> To be mor eprecise, what I want to know is:
>
>
> I have a lot of bluestore osds and now I want to add separate wal and db
> on new nvme partitions.
>
> Would it be enough to just generate empty partitions with parted and make
> symlinks on the osd partition like this:
>
> $ sudo ln -sf /dev/disk/by-partlabel/osd-device-0-db
> /var/lib/ceph/osd/ceph-0/block.db
> $ sudo ln -sf /dev/disk/by-partlabel/osd-device-0-wal
> /var/lib/ceph/osd/ceph-0/block.wal
>
> Shall I use special partition ids or flags for db and wal ? And how big
> should I make db and wal partitions ?
>
>
> Thanks
>
>
> Christoph
>
>
>
> On Thu, Sep 07, 2017 at 09:57:16AM +0200, Christoph Adomeit wrote:
> > Hi there,
> >
> > is it possible to move WAL and DB Data for Existing bluestore OSDs to
> separate partitions ?
> >
> > I am looking for a method to maybe take an OSD out, do some magic and
> move some data to new SSD Devices and then take the OSD back in.
> >
> > Any Ideas ?
> >
> > Thanks
> >   Christoph
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> Es gibt keine  Cloud, es gibt nur die Computer anderer Leute
> Christoph Adomeit
> GATWORKS GmbH
> Reststrauch 191
> 41199 Moenchengladbach
> Sitz: Moenchengladbach
> Amtsgericht Moenchengladbach, HRB 6303
> Geschaeftsfuehrer:
> Christoph Adomeit, Hans Wilhelm Terstappen
>
> christoph.adom...@gatworks.de Internetloesungen vom Feinsten
> Fon. +49 2166 9149-32 <+49%202166%20914932>  Fax. +49
> 2166 9149-10 <+49%202166%20914910>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests

2017-09-07 Thread David Turner
`ceph health detail` will give a little more information into the blocked
requests.  Specifically which OSDs are the requests blocked on and how long
have they actually been blocked (as opposed to '> 32 sec').  I usually find
a pattern after watching that for a time and narrow things down to an OSD,
journal, etc.  Some times I just need to restart a specific OSD and all is
well.

On Thu, Sep 7, 2017 at 10:33 AM Matthew Stroud 
wrote:

> After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests
> for ‘currently waiting for missing object’. I have tried bouncing the osds
> and rebooting the osd nodes, but that just moves the problems around.
> Previous to this upgrade we had no issues. Any ideas of what to look at?
>
>
>
> Thanks,
>
> Matthew Stroud
>
> --
>
> CONFIDENTIALITY NOTICE: This message is intended only for the use and
> review of the individual or entity to which it is addressed and may contain
> information that is privileged and confidential. If the reader of this
> message is not the intended recipient, or the employee or agent responsible
> for delivering the message solely to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited. If you have received this
> communication in error, please notify sender immediately by telephone or
> return email. Thank you.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests

2017-09-07 Thread David Turner
To be fair, other times I have to go in and tweak configuration settings
and timings to resolve chronic blocked requests.

On Thu, Sep 7, 2017 at 1:32 PM David Turner  wrote:

> `ceph health detail` will give a little more information into the blocked
> requests.  Specifically which OSDs are the requests blocked on and how long
> have they actually been blocked (as opposed to '> 32 sec').  I usually find
> a pattern after watching that for a time and narrow things down to an OSD,
> journal, etc.  Some times I just need to restart a specific OSD and all is
> well.
>
> On Thu, Sep 7, 2017 at 10:33 AM Matthew Stroud 
> wrote:
>
>> After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests
>> for ‘currently waiting for missing object’. I have tried bouncing the osds
>> and rebooting the osd nodes, but that just moves the problems around.
>> Previous to this upgrade we had no issues. Any ideas of what to look at?
>>
>>
>>
>> Thanks,
>>
>> Matthew Stroud
>>
>> --
>>
>> CONFIDENTIALITY NOTICE: This message is intended only for the use and
>> review of the individual or entity to which it is addressed and may contain
>> information that is privileged and confidential. If the reader of this
>> message is not the intended recipient, or the employee or agent responsible
>> for delivering the message solely to the intended recipient, you are hereby
>> notified that any dissemination, distribution or copying of this
>> communication is strictly prohibited. If you have received this
>> communication in error, please notify sender immediately by telephone or
>> return email. Thank you.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests

2017-09-07 Thread Matthew Stroud
Well in the meantime things have gone from bad to worse now the cluster isn’t 
rebuilding and clients are unable to pass IO to the cluster. When this first 
took place, we started rolling back to 10.2.7, though that was successful, it 
didn’t help with the issue. Here is the command output:

HEALTH_WARN 39 pgs backfill_wait; 5 pgs backfilling; 43 pgs degraded; 43 pgs 
stuck degraded; 44 pgs stuck unclean; 43 pgs stuck undersized; 43 pgs 
undersized; 367 requests are blocked > 32 sec; 14 osds have slow requests; 
recovery 4678/1097738 objects degraded (0.426%); recovery 10364/1097738 objects 
misplaced (0.944%)
pg 3.624 is stuck unclean for 1402.022837, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [12,9]
pg 3.587 is stuck unclean for 2536.693566, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [18,13]
pg 3.45f is stuck unclean for 1421.178244, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [14,10]
pg 3.41a is stuck unclean for 1505.091187, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [9,23]
pg 3.4cc is stuck unclean for 1560.824332, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [18,10]
< snip>
pg 3.188 is stuck degraded for 1207.118130, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [14,17]
pg 3.768 is stuck degraded for 1123.722910, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [11,18]
pg 3.77c is stuck degraded for 1211.981606, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [9,2]
pg 3.7d1 is stuck degraded for 1074.422756, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [10,12]
pg 3.7d1 is active+undersized+degraded+remapped+wait_backfill, acting [10,12]
pg 3.77c is active+undersized+degraded+remapped+wait_backfill, acting [9,2]
pg 3.768 is active+undersized+degraded+remapped+wait_backfill, acting [11,18]
pg 3.709 is active+undersized+degraded+remapped+wait_backfill, acting [10,4]

pg 3.5d8 is active+undersized+degraded+remapped+wait_backfill, acting [2,10]
pg 3.5dc is active+undersized+degraded+remapped+wait_backfill, acting [8,19]
pg 3.5f8 is active+undersized+degraded+remapped+wait_backfill, acting [2,21]
pg 3.624 is active+undersized+degraded+remapped+wait_backfill, acting [12,9]
2 ops are blocked > 1048.58 sec on osd.9
3 ops are blocked > 65.536 sec on osd.9
7 ops are blocked > 1048.58 sec on osd.8
1 ops are blocked > 524.288 sec on osd.8
1 ops are blocked > 131.072 sec on osd.8

1 ops are blocked > 524.288 sec on osd.2
1 ops are blocked > 262.144 sec on osd.2
2 ops are blocked > 65.536 sec on osd.21
9 ops are blocked > 1048.58 sec on osd.5
9 ops are blocked > 524.288 sec on osd.5
71 ops are blocked > 131.072 sec on osd.5
19 ops are blocked > 65.536 sec on osd.5
35 ops are blocked > 32.768 sec on osd.5
14 osds have slow requests
recovery 4678/1097738 objects degraded (0.426%)
recovery 10364/1097738 objects misplaced (0.944%)


From: David Turner 
Date: Thursday, September 7, 2017 at 11:33 AM
To: Matthew Stroud , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Blocked requests

To be fair, other times I have to go in and tweak configuration settings and 
timings to resolve chronic blocked requests.

On Thu, Sep 7, 2017 at 1:32 PM David Turner 
mailto:drakonst...@gmail.com>> wrote:
`ceph health detail` will give a little more information into the blocked 
requests.  Specifically which OSDs are the requests blocked on and how long 
have they actually been blocked (as opposed to '> 32 sec').  I usually find a 
pattern after watching that for a time and narrow things down to an OSD, 
journal, etc.  Some times I just need to restart a specific OSD and all is well.

On Thu, Sep 7, 2017 at 10:33 AM Matthew Stroud 
mailto:mattstr...@overstock.com>> wrote:
After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests for 
‘currently waiting for missing object’. I have tried bouncing the osds and 
rebooting the osd nodes, but that just moves the problems around. Previous to 
this upgrade we had no issues. Any ideas of what to look at?

Thanks,
Matthew Stroud



CONFIDENTIALITY NOTICE: This message is intended only for the use and review of 
the individual or entity to which it is addressed and may contain information 
that is privileged and confidential. If the reader of this message is not the 
intended recipient, or the employee or agent responsible for delivering the 
message solely to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, please notify 
sender immediately by telephone or return email. Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
ht

Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread Yehuda Sadeh-Weinraub
On Thu, Sep 7, 2017 at 7:44 PM, David Turner  wrote:
> Ok, I've been testing, investigating, researching, etc for the last week and
> I don't have any problems with data syncing.  The clients on one side are
> creating multipart objects while the multisite sync is creating them as
> whole objects and one of the datacenters is slower at cleaning up the shadow
> files.  That's the big discrepancy between object counts in the pools
> between datacenters.  I created a tool that goes through for each bucket in
> a realm and does a recursive listing of all objects in it for both
> datacenters and compares the 2 lists for any differences.  The data is
> definitely in sync between the 2 datacenters down to the modified time and
> byte of each file in s3.
>
> The metadata is still not syncing for the other realm, though.  If I run
> `metadata sync init` then the second datacenter will catch up with all of
> the new users, but until I do that newly created users on the primary side
> don't exist on the secondary side.  `metadata sync status`, `sync status`,
> `metadata sync run` (only left running for 30 minutes before I ctrl+c it),
> etc don't show any problems... but the new users just don't exist on the
> secondary side until I run `metadata sync init`.  I created a new bucket
> with the new user and the bucket shows up in the second datacenter, but no
> objects because the objects don't have a valid owner.
>
> Thank you all for the help with the data sync issue.  You pushed me into
> good directions.  Does anyone have any insight as to what is preventing the
> metadata from syncing in the other realm?  I have 2 realms being sync using
> multi-site and it's only 1 of them that isn't getting the metadata across.
> As far as I can tell it is configured identically.

What do you mean you have two realms? Zones and zonegroups need to
exist in the same realm in order for meta and data sync to happen
correctly. Maybe I'm misunderstanding.

Yehuda

>
> On Thu, Aug 31, 2017 at 12:46 PM David Turner  wrote:
>>
>> All of the messages from sync error list are listed below.  The number on
>> the left is how many times the error message is found.
>>
>>1811 "message": "failed to sync bucket instance:
>> (16) Device or resource busy"
>>   7 "message": "failed to sync bucket instance:
>> (5) Input\/output error"
>>  65 "message": "failed to sync object"
>>
>> On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman 
>> wrote:
>>>
>>>
>>> Hi David,
>>>
>>> On Mon, Aug 28, 2017 at 8:33 PM, David Turner 
>>> wrote:

 The vast majority of the sync error list is "failed to sync bucket
 instance: (16) Device or resource busy".  I can't find anything on Google
 about this error message in relation to Ceph.  Does anyone have any idea
 what this means? and/or how to fix it?
>>>
>>>
>>> Those are intermediate errors resulting from several radosgw trying to
>>> acquire the same sync log shard lease. It doesn't effect the sync progress.
>>> Are there any other errors?
>>>
>>> Orit


 On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley  wrote:
>
> Hi David,
>
> The 'data sync init' command won't touch any actual object data, no.
> Resetting the data sync status will just cause a zone to restart a full 
> sync
> of the --source-zone's data changes log. This log only lists which
> buckets/shards have changes in them, which causes radosgw to consider them
> for bucket sync. So while the command may silence the warnings about data
> shards being behind, it's unlikely to resolve the issue with missing 
> objects
> in those buckets.
>
> When data sync is behind for an extended period of time, it's usually
> because it's stuck retrying previous bucket sync failures. The 'sync error
> list' may help narrow down where those failures are.
>
> There is also a 'bucket sync init' command to clear the bucket sync
> status. Following that with a 'bucket sync run' should restart a full sync
> on the bucket, pulling in any new objects that are present on the
> source-zone. I'm afraid that those commands haven't seen a lot of polish 
> or
> testing, however.
>
> Casey
>
>
> On 08/24/2017 04:15 PM, David Turner wrote:
>
> Apparently the data shards that are behind go in both directions, but
> only one zone is aware of the problem.  Each cluster has objects in their
> data pool that the other doesn't have.  I'm thinking about initiating a
> `data sync init` on both sides (one at a time) to get them back on the 
> same
> page.  Does anyone know if that command will overwrite any local data that
> the zone has that the other doesn't if you run `data sync init` on it?
>
> On Thu, Aug 24, 2017 at 1:51 PM David Turner 
> wrote:
>>
>> After restarting the 2 RGW daemons on the second site again,
>> everything caught 

Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread David Turner
One realm is called public with a zonegroup called public-zg with a zone
for each datacenter.  The second realm is called internal with a zonegroup
called internal-zg with a zone for each datacenter.  they each have their
own rgw's and load balancers.  The needs of our public facing rgw's and
load balancers vs internal use ones was different enough that we split them
up completely.  We also have a local realm that does not use multisite and
a 4th realm called QA that mimics the public realm as much as possible for
staging configuration stages for the rgw daemons.  All 4 realms have their
own buckets, users, etc and that is all working fine.  For all of the
radosgw-admin commands I am using the proper identifiers to make sure that
each datacenter and realm are running commands on exactly what I expect
them to (--rgw-realm=public --rgw-zonegroup=public-zg --rgw-zone=public-dc1
--source-zone=public-dc2).

The data sync issue was in the internal realm but running a data sync init
and kickstarting the rgw daemons in each datacenter fixed the data
discrepancies (I'm thinking it had something to do with a power failure a
few months back that I just noticed recently).  The metadata sync issue is
in the public realm.  I have no idea what is causing this to not sync
properly since running a `metadata sync init` catches it back up to the
primary zone, but then it doesn't receive any new users created after that.

On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub 
wrote:

> On Thu, Sep 7, 2017 at 7:44 PM, David Turner 
> wrote:
> > Ok, I've been testing, investigating, researching, etc for the last week
> and
> > I don't have any problems with data syncing.  The clients on one side are
> > creating multipart objects while the multisite sync is creating them as
> > whole objects and one of the datacenters is slower at cleaning up the
> shadow
> > files.  That's the big discrepancy between object counts in the pools
> > between datacenters.  I created a tool that goes through for each bucket
> in
> > a realm and does a recursive listing of all objects in it for both
> > datacenters and compares the 2 lists for any differences.  The data is
> > definitely in sync between the 2 datacenters down to the modified time
> and
> > byte of each file in s3.
> >
> > The metadata is still not syncing for the other realm, though.  If I run
> > `metadata sync init` then the second datacenter will catch up with all of
> > the new users, but until I do that newly created users on the primary
> side
> > don't exist on the secondary side.  `metadata sync status`, `sync
> status`,
> > `metadata sync run` (only left running for 30 minutes before I ctrl+c
> it),
> > etc don't show any problems... but the new users just don't exist on the
> > secondary side until I run `metadata sync init`.  I created a new bucket
> > with the new user and the bucket shows up in the second datacenter, but
> no
> > objects because the objects don't have a valid owner.
> >
> > Thank you all for the help with the data sync issue.  You pushed me into
> > good directions.  Does anyone have any insight as to what is preventing
> the
> > metadata from syncing in the other realm?  I have 2 realms being sync
> using
> > multi-site and it's only 1 of them that isn't getting the metadata
> across.
> > As far as I can tell it is configured identically.
>
> What do you mean you have two realms? Zones and zonegroups need to
> exist in the same realm in order for meta and data sync to happen
> correctly. Maybe I'm misunderstanding.
>
> Yehuda
>
> >
> > On Thu, Aug 31, 2017 at 12:46 PM David Turner 
> wrote:
> >>
> >> All of the messages from sync error list are listed below.  The number
> on
> >> the left is how many times the error message is found.
> >>
> >>1811 "message": "failed to sync bucket instance:
> >> (16) Device or resource busy"
> >>   7 "message": "failed to sync bucket instance:
> >> (5) Input\/output error"
> >>  65 "message": "failed to sync object"
> >>
> >> On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman 
> >> wrote:
> >>>
> >>>
> >>> Hi David,
> >>>
> >>> On Mon, Aug 28, 2017 at 8:33 PM, David Turner 
> >>> wrote:
> 
>  The vast majority of the sync error list is "failed to sync bucket
>  instance: (16) Device or resource busy".  I can't find anything on
> Google
>  about this error message in relation to Ceph.  Does anyone have any
> idea
>  what this means? and/or how to fix it?
> >>>
> >>>
> >>> Those are intermediate errors resulting from several radosgw trying to
> >>> acquire the same sync log shard lease. It doesn't effect the sync
> progress.
> >>> Are there any other errors?
> >>>
> >>> Orit
> 
> 
>  On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley 
> wrote:
> >
> > Hi David,
> >
> > The 'data sync init' command won't touch any actual object data, no.
> > Resetting the data sync status will just cause a zone to restart a
> 

Re: [ceph-users] Blocked requests

2017-09-07 Thread David Turner
I would recommend pushing forward with the update instead of rolling back.
Ceph doesn't have a track record of rolling back to a previous version.

I don't have enough information to really make sense of the ceph health
detail output.  Like are the osds listed all on the same host?  Over time
of watching this output, are some of the requests clearing up?  Are there
any other patterns?  I put the following in a script and run it in a watch
command to try and follow patterns when I'm plagued with blocked requests.
output=$(ceph --cluster $cluster health detail | grep 'ops are blocked'
| sort -nrk6 | sed 's/ ops/+ops/' | sed 's/ sec/+sec/' | column -t -s'+')
echo "$output" | grep -v 'on osd'
echo "$output" | grep -Eo osd.[0-9]+ | sort -n | uniq -c | grep -v ' 1 '
echo "$output" | grep 'on osd'

Why do you have backfilling?  You haven't mentioned that you have any
backfilling yet.  Installing an update shouldn't cause backfilling, but
it's likely related to your blocked requests.

On Thu, Sep 7, 2017 at 2:24 PM Matthew Stroud 
wrote:

> Well in the meantime things have gone from bad to worse now the cluster
> isn’t rebuilding and clients are unable to pass IO to the cluster. When
> this first took place, we started rolling back to 10.2.7, though that was
> successful, it didn’t help with the issue. Here is the command output:
>
>
>
> HEALTH_WARN 39 pgs backfill_wait; 5 pgs backfilling; 43 pgs degraded; 43
> pgs stuck degraded; 44 pgs stuck unclean; 43 pgs stuck undersized; 43 pgs
> undersized; 367 requests are blocked > 32 sec; 14 osds have slow requests;
> recovery 4678/1097738 objects degraded (0.426%); recovery 10364/1097738
> objects misplaced (0.944%)
>
> pg 3.624 is stuck unclean for 1402.022837, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [12,9]
>
> pg 3.587 is stuck unclean for 2536.693566, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [18,13]
>
> pg 3.45f is stuck unclean for 1421.178244, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [14,10]
>
> pg 3.41a is stuck unclean for 1505.091187, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [9,23]
>
> pg 3.4cc is stuck unclean for 1560.824332, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [18,10]
>
> < snip>
>
> pg 3.188 is stuck degraded for 1207.118130, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [14,17]
>
> pg 3.768 is stuck degraded for 1123.722910, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [11,18]
>
> pg 3.77c is stuck degraded for 1211.981606, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [9,2]
>
> pg 3.7d1 is stuck degraded for 1074.422756, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [10,12]
>
> pg 3.7d1 is active+undersized+degraded+remapped+wait_backfill, acting
> [10,12]
>
> pg 3.77c is active+undersized+degraded+remapped+wait_backfill, acting [9,2]
>
> pg 3.768 is active+undersized+degraded+remapped+wait_backfill, acting
> [11,18]
>
> pg 3.709 is active+undersized+degraded+remapped+wait_backfill, acting
> [10,4]
>
> 
>
> pg 3.5d8 is active+undersized+degraded+remapped+wait_backfill, acting
> [2,10]
>
> pg 3.5dc is active+undersized+degraded+remapped+wait_backfill, acting
> [8,19]
>
> pg 3.5f8 is active+undersized+degraded+remapped+wait_backfill, acting
> [2,21]
>
> pg 3.624 is active+undersized+degraded+remapped+wait_backfill, acting
> [12,9]
>
> 2 ops are blocked > 1048.58 sec on osd.9
>
> 3 ops are blocked > 65.536 sec on osd.9
>
> 7 ops are blocked > 1048.58 sec on osd.8
>
> 1 ops are blocked > 524.288 sec on osd.8
>
> 1 ops are blocked > 131.072 sec on osd.8
>
> 
>
> 1 ops are blocked > 524.288 sec on osd.2
>
> 1 ops are blocked > 262.144 sec on osd.2
>
> 2 ops are blocked > 65.536 sec on osd.21
>
> 9 ops are blocked > 1048.58 sec on osd.5
>
> 9 ops are blocked > 524.288 sec on osd.5
>
> 71 ops are blocked > 131.072 sec on osd.5
>
> 19 ops are blocked > 65.536 sec on osd.5
>
> 35 ops are blocked > 32.768 sec on osd.5
>
> 14 osds have slow requests
>
> recovery 4678/1097738 objects degraded (0.426%)
>
> recovery 10364/1097738 objects misplaced (0.944%)
>
>
>
>
>
> *From: *David Turner 
> *Date: *Thursday, September 7, 2017 at 11:33 AM
> *To: *Matthew Stroud , "
> ceph-users@lists.ceph.com" 
> *Subject: *Re: [ceph-users] Blocked requests
>
>
>
> To be fair, other times I have to go in and tweak configuration settings
> and timings to resolve chronic blocked requests.
>
>
>
> On Thu, Sep 7, 2017 at 1:32 PM David Turner  wrote:
>
> `ceph health detail` will give a little more information into the blocked
> requests.  Specifically which OSDs are the requests blocked on and how long
> have they actually been blocked (as opposed to '> 32 sec').  I usually find
> a pattern after watching that for a time and narrow things down to an OSD

Re: [ceph-users] Luminous BlueStore EC performance

2017-09-07 Thread Mohamad Gebai
Hi,

These numbers are probably not as detailed as you'd like, but it's
something. They show the overhead of reading and/or writing to EC pools
as compared to 3x replicated pools using 1, 2, 8 and 16 threads (single
client):

 Rep   EC Diff  Slowdown
 IOPS  IOPS   
Read   
123,32522,052 -5.46%1.06
227,26127,147 -0.42%1.00
827,15127,127 -0.09%1.00
16   26,79326,728 -0.24%1.00
Write   
119,444 5,708-70.64%3.41
223,902 5,395-77.43%4.43
823,912 5,641-76.41%4.24
16   24,587 5,643-77.05%4.36
RW   
120,37911,166-45.21%1.83
234,246 9,525-72.19%3.60
833,195 9,300-71.98%3.57
16   31,641 9,762-69.15%3.24

This is on an all-SSD cluster, with 3 OSD nodes and Bluestore. Ceph
version 12.1.0-671-g2c11b88d14
(2c11b88d14e64bf60c0556c6a4ec8c9eda36ff6a) luminous (rc).

Mohamad

On 09/06/2017 01:28 AM, Blair Bethwaite wrote:
> Hi all,
>
> (Sorry if this shows up twice - I got auto-unsubscribed and so first
> attempt was blocked)
>
> I'm keen to read up on some performance comparisons for replication
> versus EC on HDD+SSD based setups. So far the only recent thing I've
> found is Sage's Vault17 slides [1], which have a single slide showing
> 3X / EC42 / EC51 for Kraken. I guess there is probably some of this
> data to be found in the performance meeting threads, but it's hard to
> know the currency of those (typically master or wip branch tests) with
> respect to releases. Can anyone point out any other references or
> highlight something that's coming?
>
> I'm sure there are piles of operators and architects out there at the
> moment wondering how they could and should reconfigure their clusters
> once upgraded to Luminous. A couple of things going around in my head
> at the moment:
>
> * We want to get to having the bulk of our online storage in CephFS on
> EC pool/s...
> *-- is overwrite performance on EC acceptable for near-line NAS use-cases?
> *-- recovery implications (currently recovery on our Jewel RGW EC83
> pool is _way_ slower that 3X pools, what does this do to reliability?
> maybe split capacity into multiple pools if it helps to contain failure?)
>
> [1] 
> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in/37
>
> -- 
> Cheers,
> ~Blairo
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PCIe journal benefit for SSD OSDs

2017-09-07 Thread Alexandre DERUMIER
Hi Stefan

>>Have you already done tests how he performance changes with bluestore 
>>while putting all 3 block devices on the same ssd?


I'm going to test bluestore with 3 nodes , 18 x intel s3610 1,6TB in coming 
weeks.

I'll send results on the mailing.



- Mail original -
De: "Stefan Priebe, Profihost AG" 
À: "Christian Balzer" , "ceph-users" 
Envoyé: Jeudi 7 Septembre 2017 08:03:31
Objet: Re: [ceph-users] PCIe journal benefit for SSD OSDs

Hello, 
Am 07.09.2017 um 03:53 schrieb Christian Balzer: 
> 
> Hello, 
> 
> On Wed, 6 Sep 2017 09:09:54 -0400 Alex Gorbachev wrote: 
> 
>> We are planning a Jewel filestore based cluster for a performance 
>> sensitive healthcare client, and the conservative OSD choice is 
>> Samsung SM863A. 
>> 
> 
> While I totally see where you're coming from and me having stated that 
> I'll give Luminous and Bluestore some time to mature, I'd also be looking 
> into that if I were being in the planning phase now, with like 3 months 
> before deployment. 
> The inherent performance increase with Bluestore (and having something 
> that hopefully won't need touching/upgrading for a while) shouldn't be 
> ignored. 

Yes and that's the point where i'm currently as well. Thinking about how 
to design a new cluster based on bluestore. 

> The SSDs are fine, I've been starting to use those recently (though not 
> with Ceph yet) as Intel DC S36xx or 37xx are impossible to get. 
> They're a bit slower in the write IOPS department, but good enough for me. 

I've never used the Intel DC ones but always the Samsung are the Intel 
really faster? Have you disabled te FLUSH command for the Samsung ones? 
They don't skip the command automatically like the Intel do. Sadly the 
Samsung SM863 got more expensive over the last months. They were a lot 
cheaper in the first month of 2016. May be the 2,5" optane intel ssds 
will change the game. 

>> but was wondering if anyone has seen a positive 
>> impact from also using PCIe journals (e.g. Intel P3700 or even the 
>> older 910 series) in front of such SSDs? 
>> 
> NVMe journals (or WAL and DB space for Bluestore) are nice and can 
> certainly help, especially if Ceph is tuned accordingly. 
> Avoid non DC NVMes, I doubt you can still get 910s, they are officially 
> EOL. 
> You want to match capabilities and endurances, a DC P3700 800GB would be 
> an OK match for 3-4 SM863a 960GB for example. 

That's a good point but makes the cluster more expensive. Currently 
while using filestore i use one SSD for journal and data which works fine. 

With bluestore we've block, db and wal so we need 3 block devices per 
OSD. If we need one PCIe or NVMe device per 3-4 devices it get's much 
more expensive per host - currently running 10 OSDs / SSDs per Node. 

Have you already done tests how he performance changes with bluestore 
while putting all 3 block devices on the same ssd? 

Greets, 
Stefan 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client features by IP?

2017-09-07 Thread Bryan Stillwell
On 09/07/2017 10:47 AM, Josh Durgin wrote:
> On 09/06/2017 04:36 PM, Bryan Stillwell wrote:
> > I was reading this post by Josh Durgin today and was pretty happy to
> > see we can get a summary of features that clients are using with the
> > 'ceph features' command:
> >
> > http://ceph.com/community/new-luminous-upgrade-complete/
> >
> > However, I haven't found an option to display the IP address of
> > those clients with the older feature sets.  Is there a flag I can
> > pass to 'ceph features' to list the IPs associated with each feature
> > set?
>
> There is not currently, we should add that - it'll be easy to backport
> to luminous too. The only place both features and IP are shown is in
> 'debug mon = 10' logs right now.

I think that would be great!  The first thing I would want to do after
seeing an old client listed would be to find it and upgrade it.  Having
the IP of the client would make that a ton easier!

Anything I could do to help make that happen?  File a feature request
maybe?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread Yehuda Sadeh-Weinraub
On Thu, Sep 7, 2017 at 10:04 PM, David Turner  wrote:
> One realm is called public with a zonegroup called public-zg with a zone for
> each datacenter.  The second realm is called internal with a zonegroup
> called internal-zg with a zone for each datacenter.  they each have their
> own rgw's and load balancers.  The needs of our public facing rgw's and load
> balancers vs internal use ones was different enough that we split them up
> completely.  We also have a local realm that does not use multisite and a
> 4th realm called QA that mimics the public realm as much as possible for
> staging configuration stages for the rgw daemons.  All 4 realms have their
> own buckets, users, etc and that is all working fine.  For all of the
> radosgw-admin commands I am using the proper identifiers to make sure that
> each datacenter and realm are running commands on exactly what I expect them
> to (--rgw-realm=public --rgw-zonegroup=public-zg --rgw-zone=public-dc1
> --source-zone=public-dc2).
>
> The data sync issue was in the internal realm but running a data sync init
> and kickstarting the rgw daemons in each datacenter fixed the data
> discrepancies (I'm thinking it had something to do with a power failure a
> few months back that I just noticed recently).  The metadata sync issue is
> in the public realm.  I have no idea what is causing this to not sync
> properly since running a `metadata sync init` catches it back up to the
> primary zone, but then it doesn't receive any new users created after that.
>

Sounds like an issue with the metadata log in the primary master zone.
Not sure what could go wrong there, but maybe the master zone doesn't
know that it is a master zone, or it's set to not log metadata. Or
maybe there's a problem when the secondary is trying to fetch the
metadata log. Maybe some kind of # of shards mismatch (though not
likely).
Try to see if the master logs any changes: should use the
'radosgw-admin mdlog list' command.

Yehuda

> On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub 
> wrote:
>>
>> On Thu, Sep 7, 2017 at 7:44 PM, David Turner 
>> wrote:
>> > Ok, I've been testing, investigating, researching, etc for the last week
>> > and
>> > I don't have any problems with data syncing.  The clients on one side
>> > are
>> > creating multipart objects while the multisite sync is creating them as
>> > whole objects and one of the datacenters is slower at cleaning up the
>> > shadow
>> > files.  That's the big discrepancy between object counts in the pools
>> > between datacenters.  I created a tool that goes through for each bucket
>> > in
>> > a realm and does a recursive listing of all objects in it for both
>> > datacenters and compares the 2 lists for any differences.  The data is
>> > definitely in sync between the 2 datacenters down to the modified time
>> > and
>> > byte of each file in s3.
>> >
>> > The metadata is still not syncing for the other realm, though.  If I run
>> > `metadata sync init` then the second datacenter will catch up with all
>> > of
>> > the new users, but until I do that newly created users on the primary
>> > side
>> > don't exist on the secondary side.  `metadata sync status`, `sync
>> > status`,
>> > `metadata sync run` (only left running for 30 minutes before I ctrl+c
>> > it),
>> > etc don't show any problems... but the new users just don't exist on the
>> > secondary side until I run `metadata sync init`.  I created a new bucket
>> > with the new user and the bucket shows up in the second datacenter, but
>> > no
>> > objects because the objects don't have a valid owner.
>> >
>> > Thank you all for the help with the data sync issue.  You pushed me into
>> > good directions.  Does anyone have any insight as to what is preventing
>> > the
>> > metadata from syncing in the other realm?  I have 2 realms being sync
>> > using
>> > multi-site and it's only 1 of them that isn't getting the metadata
>> > across.
>> > As far as I can tell it is configured identically.
>>
>> What do you mean you have two realms? Zones and zonegroups need to
>> exist in the same realm in order for meta and data sync to happen
>> correctly. Maybe I'm misunderstanding.
>>
>> Yehuda
>>
>> >
>> > On Thu, Aug 31, 2017 at 12:46 PM David Turner 
>> > wrote:
>> >>
>> >> All of the messages from sync error list are listed below.  The number
>> >> on
>> >> the left is how many times the error message is found.
>> >>
>> >>1811 "message": "failed to sync bucket instance:
>> >> (16) Device or resource busy"
>> >>   7 "message": "failed to sync bucket instance:
>> >> (5) Input\/output error"
>> >>  65 "message": "failed to sync object"
>> >>
>> >> On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman 
>> >> wrote:
>> >>>
>> >>>
>> >>> Hi David,
>> >>>
>> >>> On Mon, Aug 28, 2017 at 8:33 PM, David Turner 
>> >>> wrote:
>> 
>>  The vast majority of the sync error list is "failed to sync bucket
>>  instance: (16) Device or reso

Re: [ceph-users] Client features by IP?

2017-09-07 Thread Josh Durgin

On 09/07/2017 11:31 AM, Bryan Stillwell wrote:

On 09/07/2017 10:47 AM, Josh Durgin wrote:

On 09/06/2017 04:36 PM, Bryan Stillwell wrote:

I was reading this post by Josh Durgin today and was pretty happy to
see we can get a summary of features that clients are using with the
'ceph features' command:

http://ceph.com/community/new-luminous-upgrade-complete/

However, I haven't found an option to display the IP address of
those clients with the older feature sets.  Is there a flag I can
pass to 'ceph features' to list the IPs associated with each feature
set?


There is not currently, we should add that - it'll be easy to backport
to luminous too. The only place both features and IP are shown is in
'debug mon = 10' logs right now.


I think that would be great!  The first thing I would want to do after
seeing an old client listed would be to find it and upgrade it.  Having
the IP of the client would make that a ton easier!


Yup, should've included that in the first place!


Anything I could do to help make that happen?  File a feature request
maybe?


Sure, adding a short tracker.ceph.com ticket would help, that way we can 
track the backport easily too.


Thanks!
Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests

2017-09-07 Thread Matthew Stroud
Here is the output of your snippet:
[root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh
  6 osd.10
52  ops are blocked > 4194.3   sec on osd.17
9   ops are blocked > 2097.15  sec on osd.10
4   ops are blocked > 1048.58  sec on osd.10
39  ops are blocked > 262.144  sec on osd.10
19  ops are blocked > 131.072  sec on osd.10
6   ops are blocked > 65.536   sec on osd.10
2   ops are blocked > 32.768   sec on osd.10

Here is some backfilling info:

[root@mon01 ceph-conf]# ceph status
cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155
 health HEALTH_WARN
5 pgs backfilling
5 pgs degraded
5 pgs stuck degraded
5 pgs stuck unclean
5 pgs stuck undersized
5 pgs undersized
122 requests are blocked > 32 sec
recovery 2361/1097929 objects degraded (0.215%)
recovery 5578/1097929 objects misplaced (0.508%)
 monmap e1: 3 mons at 
{mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0}
election epoch 58, quorum 0,1,2 mon01,mon02,mon03
 osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs
flags sortbitwise,require_jewel_osds
  pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects
1005 GB used, 20283 GB / 21288 GB avail
2361/1097929 objects degraded (0.215%)
5578/1097929 objects misplaced (0.508%)
2587 active+clean
   5 active+undersized+degraded+remapped+backfilling
[root@mon01 ceph-conf]# ceph pg dump_stuck unclean
ok
pg_stat state   up  up_primary  acting  acting_primary
3.5c2   active+undersized+degraded+remapped+backfilling [17,2,10]   17  
[17,2]  17
3.54a   active+undersized+degraded+remapped+backfilling [10,19,2]   10  
[10,17] 10
5.3bactive+undersized+degraded+remapped+backfilling [3,19,0]3   
[10,17] 10
5.b3active+undersized+degraded+remapped+backfilling [10,19,2]   10  
[10,17] 10
3.180   active+undersized+degraded+remapped+backfilling [17,10,6]   17  
[22,19] 22

Most of the back filling is was caused by restarting osds to clear blocked IO. 
Here are some of the blocked IOs:

/var/log/ceph/ceph.log:2017-09-07 13:29:36.978559 osd.10 10.20.57.15:6806/7029 
9362 : cluster [WRN] slow request 60.834494 seconds old, received at 2017-09-07 
13:28:36.143920: osd_op(client.114947.0:2039090 5.e637a4b3 (undecoded) 
ack+read+balance_reads+skiprwlocks+known_if_redirected e6511) currently 
queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:36.978565 osd.10 10.20.57.15:6806/7029 
9363 : cluster [WRN] slow request 240.661052 seconds old, received at 
2017-09-07 13:25:36.317363: osd_op(client.246934107.0:3 5.f69addd6 (undecoded) 
ack+read+known_if_redirected e6511) currently queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:36.978571 osd.10 10.20.57.15:6806/7029 
9364 : cluster [WRN] slow request 240.660763 seconds old, received at 
2017-09-07 13:25:36.317651: osd_op(client.246944377.0:2 5.f69addd6 (undecoded) 
ack+read+known_if_redirected e6511) currently queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:36.978576 osd.10 10.20.57.15:6806/7029 
9365 : cluster [WRN] slow request 240.660675 seconds old, received at 
2017-09-07 13:25:36.317740: osd_op(client.246944377.0:3 5.f69addd6 (undecoded) 
ack+read+known_if_redirected e6511) currently queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:42.979367 osd.10 10.20.57.15:6806/7029 
9366 : cluster [WRN] 72 slow requests, 3 included below; oldest blocked for > 
1820.342287 secs
/var/log/ceph/ceph.log:2017-09-07 13:29:42.979373 osd.10 10.20.57.15:6806/7029 
9367 : cluster [WRN] slow request 30.606290 seconds old, received at 2017-09-07 
13:29:12.372999: osd_op(client.115008.0:996024003 5.e637a4b3 (undecoded) 
ondisk+write+skiprwlocks+known_if_redirected e6511) currently queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:42.979377 osd.10 10.20.57.15:6806/7029 
9368 : cluster [WRN] slow request 30.554317 seconds old, received at 2017-09-07 
13:29:12.424972: osd_op(client.115020.0:1831942 5.39f2d3b (undecoded) 
ack+read+known_if_redirected e6511) currently queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:42.979383 osd.10 10.20.57.15:6806/7029 
9369 : cluster [WRN] slow request 30.368086 seconds old, received at 2017-09-07 
13:29:12.611204: osd_op(client.115014.0:73392774 5.e637a4b3 (undecoded) 
ack+read+balance_reads+skiprwlocks+known_if_redirected e6511) currently 
queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:43.979553 osd.10 10.20.57.15:6806/7029 
9370 : cluster [WRN] 73 slow requests, 1 included below; oldest blocked for > 
1821.342499 secs
/var/log/ceph/ceph.log:2017-09-07 13:29:43.979559 osd.10 10.20.57.15:6806/7029 
9371 : cluster [WRN] slow request 30.452344 seconds old, received at 2017-09-07 
13:29:13.527157: osd_op(client.115011.0:483954528 5.e637a4b3 (undecoded) 
ack+read+balance_reads+skiprwlocks+known_if_redirected e6511) currently 
queued_for_pg

From: Davi

Re: [ceph-users] Blocked requests

2017-09-07 Thread Brian Andrus
"ceph osd blocked-by" can do the same thing as that provided script.

Can you post relevant osd.10 logs and a pg dump of an affected placement
group? Specifically interested in recovery_state section.

Hopefully you were careful in how you were rebooting OSDs, and not
rebooting multiple in the same failure domain before recovery was able to
occur.

On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud 
wrote:

> Here is the output of your snippet:
>
> [root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh
>
>   6 osd.10
>
> 52  ops are blocked > 4194.3   sec on osd.17
>
> 9   ops are blocked > 2097.15  sec on osd.10
>
> 4   ops are blocked > 1048.58  sec on osd.10
>
> 39  ops are blocked > 262.144  sec on osd.10
>
> 19  ops are blocked > 131.072  sec on osd.10
>
> 6   ops are blocked > 65.536   sec on osd.10
>
> 2   ops are blocked > 32.768   sec on osd.10
>
>
>
> Here is some backfilling info:
>
>
>
> [root@mon01 ceph-conf]# ceph status
>
> cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155
>
>  health HEALTH_WARN
>
> 5 pgs backfilling
>
> 5 pgs degraded
>
> 5 pgs stuck degraded
>
> 5 pgs stuck unclean
>
> 5 pgs stuck undersized
>
> 5 pgs undersized
>
> 122 requests are blocked > 32 sec
>
> recovery 2361/1097929 objects degraded (0.215%)
>
> recovery 5578/1097929 objects misplaced (0.508%)
>
>  monmap e1: 3 mons at {mon01=10.20.57.10:6789/0,
> mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0}
>
> election epoch 58, quorum 0,1,2 mon01,mon02,mon03
>
>  osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs
>
> flags sortbitwise,require_jewel_osds
>
>   pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects
>
> 1005 GB used, 20283 GB / 21288 GB avail
>
> 2361/1097929 objects degraded (0.215%)
>
> 5578/1097929 objects misplaced (0.508%)
>
> 2587 active+clean
>
>5 active+undersized+degraded+remapped+backfilling
>
> [root@mon01 ceph-conf]# ceph pg dump_stuck unclean
>
> ok
>
> pg_stat state   up  up_primary  acting  acting_primary
>
> 3.5c2   active+undersized+degraded+remapped+backfilling [17,2,10]
> 17  [17,2]  17
>
> 3.54a   active+undersized+degraded+remapped+backfilling [10,19,2]
> 10  [10,17] 10
>
> 5.3bactive+undersized+degraded+remapped+backfilling [3,19,0]
> 3   [10,17] 10
>
> 5.b3active+undersized+degraded+remapped+backfilling [10,19,2]
> 10  [10,17] 10
>
> 3.180   active+undersized+degraded+remapped+backfilling [17,10,6]
> 17  [22,19] 22
>
>
>
> Most of the back filling is was caused by restarting osds to clear blocked
> IO. Here are some of the blocked IOs:
>
>
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:36.978559 osd.10
> 10.20.57.15:6806/7029 9362 : cluster [WRN] slow request 60.834494 seconds
> old, received at 2017-09-07 13:28:36.143920: osd_op(client.114947.0:2039090
> 5.e637a4b3 (undecoded) ack+read+balance_reads+skiprwlocks+known_if_redirected
> e6511) currently queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:36.978565 osd.10
> 10.20.57.15:6806/7029 9363 : cluster [WRN] slow request 240.661052
> seconds old, received at 2017-09-07 13:25:36.317363:
> osd_op(client.246934107.0:3 5.f69addd6 (undecoded)
> ack+read+known_if_redirected e6511) currently queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:36.978571 osd.10
> 10.20.57.15:6806/7029 9364 : cluster [WRN] slow request 240.660763
> seconds old, received at 2017-09-07 13:25:36.317651:
> osd_op(client.246944377.0:2 5.f69addd6 (undecoded)
> ack+read+known_if_redirected e6511) currently queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:36.978576 osd.10
> 10.20.57.15:6806/7029 9365 : cluster [WRN] slow request 240.660675
> seconds old, received at 2017-09-07 13:25:36.317740:
> osd_op(client.246944377.0:3 5.f69addd6 (undecoded)
> ack+read+known_if_redirected e6511) currently queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:42.979367 osd.10
> 10.20.57.15:6806/7029 9366 : cluster [WRN] 72 slow requests, 3 included
> below; oldest blocked for > 1820.342287 secs
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:42.979373 osd.10
> 10.20.57.15:6806/7029 9367 : cluster [WRN] slow request 30.606290 seconds
> old, received at 2017-09-07 13:29:12.372999: osd_op(client.115008.0:996024003
> 5.e637a4b3 (undecoded) ondisk+write+skiprwlocks+known_if_redirected
> e6511) currently queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:42.979377 osd.10
> 10.20.57.15:6806/7029 9368 : cluster [WRN] slow request 30.554317 seconds
> old, received at 2017-09-07 13:29:12.424972: osd_op(client.115020.0:1831942
> 5.39f2d3b (undecoded) ack+read+known_if_redirected e6511) currently
> queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:42.979383 osd.10
> 10.20.57.15:6806/7029 9369 : cluster [WRN] slow request 30.368086 seconds
> old, received at 2017-09-07 13:29:12.611204: osd_op(client.115014.0:7

Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread David Turner
I created a test user named 'ice' and then used it to create a bucket named
ice.  The bucket ice can be found in the second dc, but not the user.
 `mdlog list` showed ice for the bucket, but not for the user.  I performed
the same test in the internal realm and it showed the user and bucket both
in `mdlog list`.



On Thu, Sep 7, 2017 at 3:27 PM Yehuda Sadeh-Weinraub 
wrote:

> On Thu, Sep 7, 2017 at 10:04 PM, David Turner 
> wrote:
> > One realm is called public with a zonegroup called public-zg with a zone
> for
> > each datacenter.  The second realm is called internal with a zonegroup
> > called internal-zg with a zone for each datacenter.  they each have their
> > own rgw's and load balancers.  The needs of our public facing rgw's and
> load
> > balancers vs internal use ones was different enough that we split them up
> > completely.  We also have a local realm that does not use multisite and a
> > 4th realm called QA that mimics the public realm as much as possible for
> > staging configuration stages for the rgw daemons.  All 4 realms have
> their
> > own buckets, users, etc and that is all working fine.  For all of the
> > radosgw-admin commands I am using the proper identifiers to make sure
> that
> > each datacenter and realm are running commands on exactly what I expect
> them
> > to (--rgw-realm=public --rgw-zonegroup=public-zg --rgw-zone=public-dc1
> > --source-zone=public-dc2).
> >
> > The data sync issue was in the internal realm but running a data sync
> init
> > and kickstarting the rgw daemons in each datacenter fixed the data
> > discrepancies (I'm thinking it had something to do with a power failure a
> > few months back that I just noticed recently).  The metadata sync issue
> is
> > in the public realm.  I have no idea what is causing this to not sync
> > properly since running a `metadata sync init` catches it back up to the
> > primary zone, but then it doesn't receive any new users created after
> that.
> >
>
> Sounds like an issue with the metadata log in the primary master zone.
> Not sure what could go wrong there, but maybe the master zone doesn't
> know that it is a master zone, or it's set to not log metadata. Or
> maybe there's a problem when the secondary is trying to fetch the
> metadata log. Maybe some kind of # of shards mismatch (though not
> likely).
> Try to see if the master logs any changes: should use the
> 'radosgw-admin mdlog list' command.
>
> Yehuda
>
> > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub 
> > wrote:
> >>
> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner 
> >> wrote:
> >> > Ok, I've been testing, investigating, researching, etc for the last
> week
> >> > and
> >> > I don't have any problems with data syncing.  The clients on one side
> >> > are
> >> > creating multipart objects while the multisite sync is creating them
> as
> >> > whole objects and one of the datacenters is slower at cleaning up the
> >> > shadow
> >> > files.  That's the big discrepancy between object counts in the pools
> >> > between datacenters.  I created a tool that goes through for each
> bucket
> >> > in
> >> > a realm and does a recursive listing of all objects in it for both
> >> > datacenters and compares the 2 lists for any differences.  The data is
> >> > definitely in sync between the 2 datacenters down to the modified time
> >> > and
> >> > byte of each file in s3.
> >> >
> >> > The metadata is still not syncing for the other realm, though.  If I
> run
> >> > `metadata sync init` then the second datacenter will catch up with all
> >> > of
> >> > the new users, but until I do that newly created users on the primary
> >> > side
> >> > don't exist on the secondary side.  `metadata sync status`, `sync
> >> > status`,
> >> > `metadata sync run` (only left running for 30 minutes before I ctrl+c
> >> > it),
> >> > etc don't show any problems... but the new users just don't exist on
> the
> >> > secondary side until I run `metadata sync init`.  I created a new
> bucket
> >> > with the new user and the bucket shows up in the second datacenter,
> but
> >> > no
> >> > objects because the objects don't have a valid owner.
> >> >
> >> > Thank you all for the help with the data sync issue.  You pushed me
> into
> >> > good directions.  Does anyone have any insight as to what is
> preventing
> >> > the
> >> > metadata from syncing in the other realm?  I have 2 realms being sync
> >> > using
> >> > multi-site and it's only 1 of them that isn't getting the metadata
> >> > across.
> >> > As far as I can tell it is configured identically.
> >>
> >> What do you mean you have two realms? Zones and zonegroups need to
> >> exist in the same realm in order for meta and data sync to happen
> >> correctly. Maybe I'm misunderstanding.
> >>
> >> Yehuda
> >>
> >> >
> >> > On Thu, Aug 31, 2017 at 12:46 PM David Turner 
> >> > wrote:
> >> >>
> >> >> All of the messages from sync error list are listed below.  The
> number
> >> >> on
> >> >> the left is how many times the error message is found

[ceph-users] Significant uptick in inconsistent pgs in Jewel 10.2.9

2017-09-07 Thread Robin H. Johnson
Hi,

Our clusters were upgraded to v10.2.9, from ~v10.2.7 (actually a local
git snapshot that was not quite 10.2.7), and since then, we're seeing a
LOT more scrub errors than previously.

The digest logging on the scrub errors, in some cases, is also now maddeningly
short: it doesn't contain ANY information on what the mismatch was, and many of
the errors seem to also be 3-way mismatches in the digest :-(.

I'm wondering if other people have seen something similar rises in scrub errors
in the upgrade, and/or the lack of digest output. I did hear one anecdotal
report that 10.2.9 seemed much more likely to fail out marginal disks.

The only two changesets I can spot in Jewel that I think might be related are 
these:
1.
http://tracker.ceph.com/issues/20089
https://github.com/ceph/ceph/pull/15416
2.
http://tracker.ceph.com/issues/19404
https://github.com/ceph/ceph/pull/14204

Two example PGs that are inconsistent (chosen because they didn't convey any 
private information so I didn't have to redact anything except IP):
$ sudo ceph health detail |grep -e 5.3d40 -e 5.f1c0
pg 5.3d40 is active+clean+inconsistent, acting [1322,990,655]
pg 5.f1c0 is active+clean+inconsistent, acting [631,1327,91]

$ fgrep 5.3d40 /var/log/ceph/ceph.log
2017-09-07 19:50:16.231523 osd.1322 [REDACTED::8861]:6808/3479303 1736 : 
cluster [INF] osd.1322 pg 5.3d40 Deep scrub errors, upgrading scrub to 
deep-scrub
2017-09-07 19:50:16.231862 osd.1322 [REDACTED::8861]:6808/3479303 1737 : 
cluster [INF] 5.3d40 deep-scrub starts
2017-09-07 19:54:38.631232 osd.1322 [REDACTED::8861]:6808/3479303 1738 : 
cluster [ERR] 5.3d40 shard 655: soid 
5:02bc4def:::.dir.default.64449186.344176:head omap_digest 0x3242b04e != 
omap_digest 0x337cf025 from auth oi 
5:02bc4def:::.dir.default.64449186.344176:head(1177700'1180639 
osd.1322.0:537914 dirty|omap|data_digest|omap_digest s 0 uv 1177199 dd  
od 337cf025 alloc_hint [0 0])
2017-09-07 19:54:38.631332 osd.1322 [REDACTED::8861]:6808/3479303 1739 : 
cluster [ERR] 5.3d40 shard 1322: soid 
5:02bc4def:::.dir.default.64449186.344176:head omap_digest 0xc90d06a8 != 
omap_digest 0x3242b04e from shard 655, omap_digest 0xc90d06a8 != omap_digest 
0x337cf025 from auth oi 
5:02bc4def:::.dir.default.64449186.344176:head(1177700'1180639 
osd.1322.0:537914 dirty|omap|data_digest|omap_digest s 0 uv 1177199 dd  
od 337cf025 alloc_hint [0 0])
2017-09-07 20:03:54.721681 osd.1322 [REDACTED::8861]:6808/3479303 1740 : 
cluster [ERR] 5.3d40 deep-scrub 0 missing, 1 inconsistent objects
2017-09-07 20:03:54.721687 osd.1322 [REDACTED::8861]:6808/3479303 1741 : 
cluster [ERR] 5.3d40 deep-scrub 3 errors

$ fgrep 5.f1c0   /var/log/ceph/ceph.log
2017-09-07 11:11:36.773986 osd.631 [REDACTED::8877]:6813/4036028 4234 : cluster 
[INF] osd.631 pg 5.f1c0 Deep scrub errors, upgrading scrub to deep-scrub
2017-09-07 11:11:36.774127 osd.631 [REDACTED::8877]:6813/4036028 4235 : cluster 
[INF] 5.f1c0 deep-scrub starts
2017-09-07 11:25:26.231502 osd.631 [REDACTED::8877]:6813/4036028 4236 : cluster 
[ERR] 5.f1c0 deep-scrub 0 missing, 1 inconsistent objects
2017-09-07 11:25:26.231508 osd.631 [REDACTED::8877]:6813/4036028 4237 : cluster 
[ERR] 5.f1c0 deep-scrub 1 errors

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Asst. Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread Yehuda Sadeh-Weinraub
On Thu, Sep 7, 2017 at 11:02 PM, David Turner  wrote:
> I created a test user named 'ice' and then used it to create a bucket named
> ice.  The bucket ice can be found in the second dc, but not the user.
> `mdlog list` showed ice for the bucket, but not for the user.  I performed
> the same test in the internal realm and it showed the user and bucket both
> in `mdlog list`.
>

Maybe your radosgw-admin command is running with a ceph user that
doesn't have permissions to write to the log pool? (probably not,
because you are able to run the sync init commands).
Another very slim explanation would be if you had for some reason
overlapping zones configuration that shared some of the config but not
all of it, having radosgw running against the correct one and
radosgw-admin against the bad one. I don't think it's the second
option.

Yehuda

>
>
> On Thu, Sep 7, 2017 at 3:27 PM Yehuda Sadeh-Weinraub 
> wrote:
>>
>> On Thu, Sep 7, 2017 at 10:04 PM, David Turner 
>> wrote:
>> > One realm is called public with a zonegroup called public-zg with a zone
>> > for
>> > each datacenter.  The second realm is called internal with a zonegroup
>> > called internal-zg with a zone for each datacenter.  they each have
>> > their
>> > own rgw's and load balancers.  The needs of our public facing rgw's and
>> > load
>> > balancers vs internal use ones was different enough that we split them
>> > up
>> > completely.  We also have a local realm that does not use multisite and
>> > a
>> > 4th realm called QA that mimics the public realm as much as possible for
>> > staging configuration stages for the rgw daemons.  All 4 realms have
>> > their
>> > own buckets, users, etc and that is all working fine.  For all of the
>> > radosgw-admin commands I am using the proper identifiers to make sure
>> > that
>> > each datacenter and realm are running commands on exactly what I expect
>> > them
>> > to (--rgw-realm=public --rgw-zonegroup=public-zg --rgw-zone=public-dc1
>> > --source-zone=public-dc2).
>> >
>> > The data sync issue was in the internal realm but running a data sync
>> > init
>> > and kickstarting the rgw daemons in each datacenter fixed the data
>> > discrepancies (I'm thinking it had something to do with a power failure
>> > a
>> > few months back that I just noticed recently).  The metadata sync issue
>> > is
>> > in the public realm.  I have no idea what is causing this to not sync
>> > properly since running a `metadata sync init` catches it back up to the
>> > primary zone, but then it doesn't receive any new users created after
>> > that.
>> >
>>
>> Sounds like an issue with the metadata log in the primary master zone.
>> Not sure what could go wrong there, but maybe the master zone doesn't
>> know that it is a master zone, or it's set to not log metadata. Or
>> maybe there's a problem when the secondary is trying to fetch the
>> metadata log. Maybe some kind of # of shards mismatch (though not
>> likely).
>> Try to see if the master logs any changes: should use the
>> 'radosgw-admin mdlog list' command.
>>
>> Yehuda
>>
>> > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub 
>> > wrote:
>> >>
>> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner 
>> >> wrote:
>> >> > Ok, I've been testing, investigating, researching, etc for the last
>> >> > week
>> >> > and
>> >> > I don't have any problems with data syncing.  The clients on one side
>> >> > are
>> >> > creating multipart objects while the multisite sync is creating them
>> >> > as
>> >> > whole objects and one of the datacenters is slower at cleaning up the
>> >> > shadow
>> >> > files.  That's the big discrepancy between object counts in the pools
>> >> > between datacenters.  I created a tool that goes through for each
>> >> > bucket
>> >> > in
>> >> > a realm and does a recursive listing of all objects in it for both
>> >> > datacenters and compares the 2 lists for any differences.  The data
>> >> > is
>> >> > definitely in sync between the 2 datacenters down to the modified
>> >> > time
>> >> > and
>> >> > byte of each file in s3.
>> >> >
>> >> > The metadata is still not syncing for the other realm, though.  If I
>> >> > run
>> >> > `metadata sync init` then the second datacenter will catch up with
>> >> > all
>> >> > of
>> >> > the new users, but until I do that newly created users on the primary
>> >> > side
>> >> > don't exist on the secondary side.  `metadata sync status`, `sync
>> >> > status`,
>> >> > `metadata sync run` (only left running for 30 minutes before I ctrl+c
>> >> > it),
>> >> > etc don't show any problems... but the new users just don't exist on
>> >> > the
>> >> > secondary side until I run `metadata sync init`.  I created a new
>> >> > bucket
>> >> > with the new user and the bucket shows up in the second datacenter,
>> >> > but
>> >> > no
>> >> > objects because the objects don't have a valid owner.
>> >> >
>> >> > Thank you all for the help with the data sync issue.  You pushed me
>> >> > into
>> >> > good directions.  Does anyone have any insi

Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread David Turner
I'm pretty sure I'm using the cluster admin user/keyring.  Is there any
output that would be helpful?  Period, zonegroup get, etc?

On Thu, Sep 7, 2017 at 4:27 PM Yehuda Sadeh-Weinraub 
wrote:

> On Thu, Sep 7, 2017 at 11:02 PM, David Turner 
> wrote:
> > I created a test user named 'ice' and then used it to create a bucket
> named
> > ice.  The bucket ice can be found in the second dc, but not the user.
> > `mdlog list` showed ice for the bucket, but not for the user.  I
> performed
> > the same test in the internal realm and it showed the user and bucket
> both
> > in `mdlog list`.
> >
>
> Maybe your radosgw-admin command is running with a ceph user that
> doesn't have permissions to write to the log pool? (probably not,
> because you are able to run the sync init commands).
> Another very slim explanation would be if you had for some reason
> overlapping zones configuration that shared some of the config but not
> all of it, having radosgw running against the correct one and
> radosgw-admin against the bad one. I don't think it's the second
> option.
>
> Yehuda
>
> >
> >
> > On Thu, Sep 7, 2017 at 3:27 PM Yehuda Sadeh-Weinraub 
> > wrote:
> >>
> >> On Thu, Sep 7, 2017 at 10:04 PM, David Turner 
> >> wrote:
> >> > One realm is called public with a zonegroup called public-zg with a
> zone
> >> > for
> >> > each datacenter.  The second realm is called internal with a zonegroup
> >> > called internal-zg with a zone for each datacenter.  they each have
> >> > their
> >> > own rgw's and load balancers.  The needs of our public facing rgw's
> and
> >> > load
> >> > balancers vs internal use ones was different enough that we split them
> >> > up
> >> > completely.  We also have a local realm that does not use multisite
> and
> >> > a
> >> > 4th realm called QA that mimics the public realm as much as possible
> for
> >> > staging configuration stages for the rgw daemons.  All 4 realms have
> >> > their
> >> > own buckets, users, etc and that is all working fine.  For all of the
> >> > radosgw-admin commands I am using the proper identifiers to make sure
> >> > that
> >> > each datacenter and realm are running commands on exactly what I
> expect
> >> > them
> >> > to (--rgw-realm=public --rgw-zonegroup=public-zg --rgw-zone=public-dc1
> >> > --source-zone=public-dc2).
> >> >
> >> > The data sync issue was in the internal realm but running a data sync
> >> > init
> >> > and kickstarting the rgw daemons in each datacenter fixed the data
> >> > discrepancies (I'm thinking it had something to do with a power
> failure
> >> > a
> >> > few months back that I just noticed recently).  The metadata sync
> issue
> >> > is
> >> > in the public realm.  I have no idea what is causing this to not sync
> >> > properly since running a `metadata sync init` catches it back up to
> the
> >> > primary zone, but then it doesn't receive any new users created after
> >> > that.
> >> >
> >>
> >> Sounds like an issue with the metadata log in the primary master zone.
> >> Not sure what could go wrong there, but maybe the master zone doesn't
> >> know that it is a master zone, or it's set to not log metadata. Or
> >> maybe there's a problem when the secondary is trying to fetch the
> >> metadata log. Maybe some kind of # of shards mismatch (though not
> >> likely).
> >> Try to see if the master logs any changes: should use the
> >> 'radosgw-admin mdlog list' command.
> >>
> >> Yehuda
> >>
> >> > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub <
> yeh...@redhat.com>
> >> > wrote:
> >> >>
> >> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner 
> >> >> wrote:
> >> >> > Ok, I've been testing, investigating, researching, etc for the last
> >> >> > week
> >> >> > and
> >> >> > I don't have any problems with data syncing.  The clients on one
> side
> >> >> > are
> >> >> > creating multipart objects while the multisite sync is creating
> them
> >> >> > as
> >> >> > whole objects and one of the datacenters is slower at cleaning up
> the
> >> >> > shadow
> >> >> > files.  That's the big discrepancy between object counts in the
> pools
> >> >> > between datacenters.  I created a tool that goes through for each
> >> >> > bucket
> >> >> > in
> >> >> > a realm and does a recursive listing of all objects in it for both
> >> >> > datacenters and compares the 2 lists for any differences.  The data
> >> >> > is
> >> >> > definitely in sync between the 2 datacenters down to the modified
> >> >> > time
> >> >> > and
> >> >> > byte of each file in s3.
> >> >> >
> >> >> > The metadata is still not syncing for the other realm, though.  If
> I
> >> >> > run
> >> >> > `metadata sync init` then the second datacenter will catch up with
> >> >> > all
> >> >> > of
> >> >> > the new users, but until I do that newly created users on the
> primary
> >> >> > side
> >> >> > don't exist on the secondary side.  `metadata sync status`, `sync
> >> >> > status`,
> >> >> > `metadata sync run` (only left running for 30 minutes before I
> ctrl+c
> >> >> > it),
> >> >> > etc don'

Re: [ceph-users] RBD: How many snapshots is too many?

2017-09-07 Thread Mclean, Patrick
On 2017-09-05 02:41 PM, Gregory Farnum wrote:
> On Tue, Sep 5, 2017 at 1:44 PM, Florian Haas  > wrote: 
> >> Hi everyone, >> >> with the Luminous release out the door
and the Labor Day weekend >> over, I hope I can kick off a discussion on
another issue that has >> irked me a bit for quite a while. There
doesn't seem to be a good >> documented answer to this: what are Ceph's
real limits when it >> comes to RBD snapshots? >> >> For most people,
any RBD image will have perhaps a single-digit >> number of snapshots.
For example, in an OpenStack environment we >> typically have one
snapshot per Glance image, a few snapshots per >> Cinder volume, and
perhaps a few snapshots per ephemeral Nova disk >> (unless clones are
configured to flatten immediately). Ceph >> generally performs well
under those circumstances. >> >> However, things sometimes start getting
problematic when RBD >> snapshots are generated frequently, and in an
automated fashion. >> I've seen Ceph operators configure snapshots on a
daily or even >> hourly basis, typically when using snapshots as a
backup strategy >> (where they promise to allow for very short RTO and
RPO). In >> combination with thousands or maybe tens of thousands of
RBDs, >> that's a lot of snapshots. And in such scenarios (and only in
>> those), users have been bitten by a few nasty bugs in the past — >>
here's an example where the OSD snap trim queue went berserk in the >>
event of lots of snapshots being deleted: >> >>
http://tracker.ceph.com/issues/9487 >>
https://www.spinics.net/lists/ceph-devel/msg20470.html >> >> It seems to
me that there still isn't a good recommendation along >> the lines of
"try not to have more than X snapshots per RBD image" >> or "try not to
have more than Y snapshots in the cluster overall". >> Or is the
"correct" recommendation actually "create as many >> snapshots as you
might possibly want, none of that is allowed to >> create any
instability nor performance degradation and if it does, >> that's a
bug"? > > I think we're closer to "as many snapshots as you want", but
there > are some known shortages there. > > First of all, if you haven't
seen my talk from the last OpenStack > summit on snapshots and you want
a bunch of details, go watch that. > :p >
https://www.openstack.org/videos/boston-2017/ceph-snapshots-for-fun-and-profit-1

There are a few dimensions there can be failures with snapshots:

> 1) right now the way we mark snapshots as deleted is suboptimal — > when 
> deleted they go into an interval_set in the OSDMap. So if you >
have a bunch of holes in your deleted snapshots, it is possible to >
inflate the osdmap to a size which causes trouble. But I'm not sure > if
we've actually seen this be an issue yet — it requires both a > large
cluster, and a large map, and probably some other failure > causing
osdmaps to be generated very rapidly.
In our use case, we are severly hampered by the size of removed_snaps
(50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in
PGPool::update and its interval calculation code. We have a cluster of
around 100k RBDs with each RBD having upto 25 snapshots and only a small
portion of our RBDs mapped at a time (~500-1000). For size / performance
reasons we try to keep the number of snapshots low (<25) and need to
prune snapshots. Since in our use case RBDs 'age' at different rates,
snapshot pruning creates holes to the point where we the size of the
removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph
clusters. I think in general around 2 snapshot removal operations
currently happen a minute just because of the volume of snapshots and
users we have.

We found the PGPool::update and the interval calculation code code to be
quite inefficient. Some small changes made it a lot faster giving more
breathing room, we shared and these and most already got applied:
https://github.com/ceph/ceph/pull/17088
https://github.com/ceph/ceph/pull/17121
https://github.com/ceph/ceph/pull/17239
https://github.com/ceph/ceph/pull/17265
https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes)

However for our use case these patches helped, but overall CPU usage in
this area is still high (>70% or so), making the Ceph cluster slow
causing blocked requests and many operations (e.g. rbd map) to take a
long time.

We are trying to work around these issues by trying to change our
snapshot strategy. In the short-term we are manually defragmenting the
interval set by scanning for holes and trying to delete snapids in
between holes to coalesce more holes. This is not so nice to do. In some
cases we employ strategies to 'recreate' old snapshots (as we need to
keep them) at higher snapids. For our use case a 'snapid rename' feature
would have been quite helpful.

I hope this shines some light on practical Ceph clusters in which
performance is bottlenecked not by I/O but by snapshot removal.

> 2) There may be issues with how rbd records what snapshots it is > associated 
> with? No idea 

Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread Yehuda Sadeh-Weinraub
On Thu, Sep 7, 2017 at 11:37 PM, David Turner  wrote:
> I'm pretty sure I'm using the cluster admin user/keyring.  Is there any
> output that would be helpful?  Period, zonegroup get, etc?

 - radosgw-admin period get
 - radosgw-admin zone list
 - radosgw-admin zonegroup list

For each zone, zonegroup in result:
 - radosgw-admin zone get --rgw-zone=
 - radosgw-admin zonegroup get --rgw-zonegroup=

 - rados lspools

Also, create a user with --debug-rgw=20 --debug-ms=1, need to look at the log.

Yehuda


>
> On Thu, Sep 7, 2017 at 4:27 PM Yehuda Sadeh-Weinraub 
> wrote:
>>
>> On Thu, Sep 7, 2017 at 11:02 PM, David Turner 
>> wrote:
>> > I created a test user named 'ice' and then used it to create a bucket
>> > named
>> > ice.  The bucket ice can be found in the second dc, but not the user.
>> > `mdlog list` showed ice for the bucket, but not for the user.  I
>> > performed
>> > the same test in the internal realm and it showed the user and bucket
>> > both
>> > in `mdlog list`.
>> >
>>
>> Maybe your radosgw-admin command is running with a ceph user that
>> doesn't have permissions to write to the log pool? (probably not,
>> because you are able to run the sync init commands).
>> Another very slim explanation would be if you had for some reason
>> overlapping zones configuration that shared some of the config but not
>> all of it, having radosgw running against the correct one and
>> radosgw-admin against the bad one. I don't think it's the second
>> option.
>>
>> Yehuda
>>
>> >
>> >
>> > On Thu, Sep 7, 2017 at 3:27 PM Yehuda Sadeh-Weinraub 
>> > wrote:
>> >>
>> >> On Thu, Sep 7, 2017 at 10:04 PM, David Turner 
>> >> wrote:
>> >> > One realm is called public with a zonegroup called public-zg with a
>> >> > zone
>> >> > for
>> >> > each datacenter.  The second realm is called internal with a
>> >> > zonegroup
>> >> > called internal-zg with a zone for each datacenter.  they each have
>> >> > their
>> >> > own rgw's and load balancers.  The needs of our public facing rgw's
>> >> > and
>> >> > load
>> >> > balancers vs internal use ones was different enough that we split
>> >> > them
>> >> > up
>> >> > completely.  We also have a local realm that does not use multisite
>> >> > and
>> >> > a
>> >> > 4th realm called QA that mimics the public realm as much as possible
>> >> > for
>> >> > staging configuration stages for the rgw daemons.  All 4 realms have
>> >> > their
>> >> > own buckets, users, etc and that is all working fine.  For all of the
>> >> > radosgw-admin commands I am using the proper identifiers to make sure
>> >> > that
>> >> > each datacenter and realm are running commands on exactly what I
>> >> > expect
>> >> > them
>> >> > to (--rgw-realm=public --rgw-zonegroup=public-zg
>> >> > --rgw-zone=public-dc1
>> >> > --source-zone=public-dc2).
>> >> >
>> >> > The data sync issue was in the internal realm but running a data sync
>> >> > init
>> >> > and kickstarting the rgw daemons in each datacenter fixed the data
>> >> > discrepancies (I'm thinking it had something to do with a power
>> >> > failure
>> >> > a
>> >> > few months back that I just noticed recently).  The metadata sync
>> >> > issue
>> >> > is
>> >> > in the public realm.  I have no idea what is causing this to not sync
>> >> > properly since running a `metadata sync init` catches it back up to
>> >> > the
>> >> > primary zone, but then it doesn't receive any new users created after
>> >> > that.
>> >> >
>> >>
>> >> Sounds like an issue with the metadata log in the primary master zone.
>> >> Not sure what could go wrong there, but maybe the master zone doesn't
>> >> know that it is a master zone, or it's set to not log metadata. Or
>> >> maybe there's a problem when the secondary is trying to fetch the
>> >> metadata log. Maybe some kind of # of shards mismatch (though not
>> >> likely).
>> >> Try to see if the master logs any changes: should use the
>> >> 'radosgw-admin mdlog list' command.
>> >>
>> >> Yehuda
>> >>
>> >> > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub
>> >> > 
>> >> > wrote:
>> >> >>
>> >> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner 
>> >> >> wrote:
>> >> >> > Ok, I've been testing, investigating, researching, etc for the
>> >> >> > last
>> >> >> > week
>> >> >> > and
>> >> >> > I don't have any problems with data syncing.  The clients on one
>> >> >> > side
>> >> >> > are
>> >> >> > creating multipart objects while the multisite sync is creating
>> >> >> > them
>> >> >> > as
>> >> >> > whole objects and one of the datacenters is slower at cleaning up
>> >> >> > the
>> >> >> > shadow
>> >> >> > files.  That's the big discrepancy between object counts in the
>> >> >> > pools
>> >> >> > between datacenters.  I created a tool that goes through for each
>> >> >> > bucket
>> >> >> > in
>> >> >> > a realm and does a recursive listing of all objects in it for both
>> >> >> > datacenters and compares the 2 lists for any differences.  The
>> >> >> > data
>> >> >> > is
>> >> >> > definitely in sync between the 2 datacente

[ceph-users] Vote re release cadence

2017-09-07 Thread Anthony D'Atri
One vote for:

* Drop the odd releases, and aim for a ~9 month cadence. This splits the 
difference between the current even/odd pattern we've been doing.

We've already been bit by gotchas with upgrades even between point releases, so 
I favor strategies that limit the number of upgrade paths in the hope that they 
will be more solid.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests

2017-09-07 Thread Matthew Stroud
After some troubleshooting, the issues appear to be caused by gnocchi using 
rados. I’m trying to figure out why.

Thanks,
Matthew Stroud

From: Brian Andrus 
Date: Thursday, September 7, 2017 at 1:53 PM
To: Matthew Stroud 
Cc: David Turner , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Blocked requests

"ceph osd blocked-by" can do the same thing as that provided script.

Can you post relevant osd.10 logs and a pg dump of an affected placement group? 
Specifically interested in recovery_state section.

Hopefully you were careful in how you were rebooting OSDs, and not rebooting 
multiple in the same failure domain before recovery was able to occur.

On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud 
mailto:mattstr...@overstock.com>> wrote:
Here is the output of your snippet:
[root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh
  6 osd.10
52  ops are blocked > 4194.3   sec on osd.17
9   ops are blocked > 2097.15  sec on osd.10
4   ops are blocked > 1048.58  sec on osd.10
39  ops are blocked > 262.144  sec on osd.10
19  ops are blocked > 131.072  sec on osd.10
6   ops are blocked > 65.536   sec on osd.10
2   ops are blocked > 32.768   sec on osd.10

Here is some backfilling info:

[root@mon01 ceph-conf]# ceph status
cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155
 health HEALTH_WARN
5 pgs backfilling
5 pgs degraded
5 pgs stuck degraded
5 pgs stuck unclean
5 pgs stuck undersized
5 pgs undersized
122 requests are blocked > 32 sec
recovery 2361/1097929 objects degraded (0.215%)
recovery 5578/1097929 objects misplaced (0.508%)
 monmap e1: 3 mons at 
{mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0}
election epoch 58, quorum 0,1,2 mon01,mon02,mon03
 osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs
flags sortbitwise,require_jewel_osds
  pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects
1005 GB used, 20283 GB / 21288 GB avail
2361/1097929 objects degraded (0.215%)
5578/1097929 objects misplaced (0.508%)
2587 active+clean
   5 active+undersized+degraded+remapped+backfilling
[root@mon01 ceph-conf]# ceph pg dump_stuck unclean
ok
pg_stat state   up  up_primary  acting  acting_primary
3.5c2   active+undersized+degraded+remapped+backfilling [17,2,10]   17  
[17,2]  17
3.54a   active+undersized+degraded+remapped+backfilling [10,19,2]   10  
[10,17] 10
5.3bactive+undersized+degraded+remapped+backfilling [3,19,0]3   
[10,17] 10
5.b3active+undersized+degraded+remapped+backfilling [10,19,2]   10  
[10,17] 10
3.180   active+undersized+degraded+remapped+backfilling [17,10,6]   17  
[22,19] 22

Most of the back filling is was caused by restarting osds to clear blocked IO. 
Here are some of the blocked IOs:

/var/log/ceph/ceph.log:2017-09-07 13:29:36.978559 osd.10 
10.20.57.15:6806/7029 9362 : cluster [WRN] slow 
request 60.834494 seconds old, received at 2017-09-07 13:28:36.143920: 
osd_op(client.114947.0:2039090 5.e637a4b3 (undecoded) 
ack+read+balance_reads+skiprwlocks+known_if_redirected e6511) currently 
queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:36.978565 osd.10 
10.20.57.15:6806/7029 9363 : cluster [WRN] slow 
request 240.661052 seconds old, received at 2017-09-07 13:25:36.317363: 
osd_op(client.246934107.0:3 5.f69addd6 (undecoded) ack+read+known_if_redirected 
e6511) currently queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:36.978571 osd.10 
10.20.57.15:6806/7029 9364 : cluster [WRN] slow 
request 240.660763 seconds old, received at 2017-09-07 13:25:36.317651: 
osd_op(client.246944377.0:2 5.f69addd6 (undecoded) ack+read+known_if_redirected 
e6511) currently queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:36.978576 osd.10 
10.20.57.15:6806/7029 9365 : cluster [WRN] slow 
request 240.660675 seconds old, received at 2017-09-07 13:25:36.317740: 
osd_op(client.246944377.0:3 5.f69addd6 (undecoded) ack+read+known_if_redirected 
e6511) currently queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:42.979367 osd.10 
10.20.57.15:6806/7029 9366 : cluster [WRN] 72 
slow requests, 3 included below; oldest blocked for > 1820.342287 secs
/var/log/ceph/ceph.log:2017-09-07 13:29:42.979373 osd.10 
10.20.57.15:6806/7029 9367 : cluster [WRN] slow 
request 30.606290 seconds old, received at 2017-09-07 13:29:12.372999: 
osd_op(client.115008.0:996024003 5.e637a4b3 (undecoded) 
ondisk+write+skiprwlocks+known_if_redirected e6511) currently queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:42.979377 osd.10 
10.20.57.15:6806/7029

Re: [ceph-users] RGW Multisite metadata sync init

2017-09-07 Thread David Turner
I sent the output of all of the files including the logs to you.  Thank you
for your help so far.

On Thu, Sep 7, 2017 at 4:48 PM Yehuda Sadeh-Weinraub 
wrote:

> On Thu, Sep 7, 2017 at 11:37 PM, David Turner 
> wrote:
> > I'm pretty sure I'm using the cluster admin user/keyring.  Is there any
> > output that would be helpful?  Period, zonegroup get, etc?
>
>  - radosgw-admin period get
>  - radosgw-admin zone list
>  - radosgw-admin zonegroup list
>
> For each zone, zonegroup in result:
>  - radosgw-admin zone get --rgw-zone=
>  - radosgw-admin zonegroup get --rgw-zonegroup=
>
>  - rados lspools
>
> Also, create a user with --debug-rgw=20 --debug-ms=1, need to look at the
> log.
>
> Yehuda
>
>
> >
> > On Thu, Sep 7, 2017 at 4:27 PM Yehuda Sadeh-Weinraub 
> > wrote:
> >>
> >> On Thu, Sep 7, 2017 at 11:02 PM, David Turner 
> >> wrote:
> >> > I created a test user named 'ice' and then used it to create a bucket
> >> > named
> >> > ice.  The bucket ice can be found in the second dc, but not the user.
> >> > `mdlog list` showed ice for the bucket, but not for the user.  I
> >> > performed
> >> > the same test in the internal realm and it showed the user and bucket
> >> > both
> >> > in `mdlog list`.
> >> >
> >>
> >> Maybe your radosgw-admin command is running with a ceph user that
> >> doesn't have permissions to write to the log pool? (probably not,
> >> because you are able to run the sync init commands).
> >> Another very slim explanation would be if you had for some reason
> >> overlapping zones configuration that shared some of the config but not
> >> all of it, having radosgw running against the correct one and
> >> radosgw-admin against the bad one. I don't think it's the second
> >> option.
> >>
> >> Yehuda
> >>
> >> >
> >> >
> >> > On Thu, Sep 7, 2017 at 3:27 PM Yehuda Sadeh-Weinraub <
> yeh...@redhat.com>
> >> > wrote:
> >> >>
> >> >> On Thu, Sep 7, 2017 at 10:04 PM, David Turner  >
> >> >> wrote:
> >> >> > One realm is called public with a zonegroup called public-zg with a
> >> >> > zone
> >> >> > for
> >> >> > each datacenter.  The second realm is called internal with a
> >> >> > zonegroup
> >> >> > called internal-zg with a zone for each datacenter.  they each have
> >> >> > their
> >> >> > own rgw's and load balancers.  The needs of our public facing rgw's
> >> >> > and
> >> >> > load
> >> >> > balancers vs internal use ones was different enough that we split
> >> >> > them
> >> >> > up
> >> >> > completely.  We also have a local realm that does not use multisite
> >> >> > and
> >> >> > a
> >> >> > 4th realm called QA that mimics the public realm as much as
> possible
> >> >> > for
> >> >> > staging configuration stages for the rgw daemons.  All 4 realms
> have
> >> >> > their
> >> >> > own buckets, users, etc and that is all working fine.  For all of
> the
> >> >> > radosgw-admin commands I am using the proper identifiers to make
> sure
> >> >> > that
> >> >> > each datacenter and realm are running commands on exactly what I
> >> >> > expect
> >> >> > them
> >> >> > to (--rgw-realm=public --rgw-zonegroup=public-zg
> >> >> > --rgw-zone=public-dc1
> >> >> > --source-zone=public-dc2).
> >> >> >
> >> >> > The data sync issue was in the internal realm but running a data
> sync
> >> >> > init
> >> >> > and kickstarting the rgw daemons in each datacenter fixed the data
> >> >> > discrepancies (I'm thinking it had something to do with a power
> >> >> > failure
> >> >> > a
> >> >> > few months back that I just noticed recently).  The metadata sync
> >> >> > issue
> >> >> > is
> >> >> > in the public realm.  I have no idea what is causing this to not
> sync
> >> >> > properly since running a `metadata sync init` catches it back up to
> >> >> > the
> >> >> > primary zone, but then it doesn't receive any new users created
> after
> >> >> > that.
> >> >> >
> >> >>
> >> >> Sounds like an issue with the metadata log in the primary master
> zone.
> >> >> Not sure what could go wrong there, but maybe the master zone doesn't
> >> >> know that it is a master zone, or it's set to not log metadata. Or
> >> >> maybe there's a problem when the secondary is trying to fetch the
> >> >> metadata log. Maybe some kind of # of shards mismatch (though not
> >> >> likely).
> >> >> Try to see if the master logs any changes: should use the
> >> >> 'radosgw-admin mdlog list' command.
> >> >>
> >> >> Yehuda
> >> >>
> >> >> > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub
> >> >> > 
> >> >> > wrote:
> >> >> >>
> >> >> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner <
> drakonst...@gmail.com>
> >> >> >> wrote:
> >> >> >> > Ok, I've been testing, investigating, researching, etc for the
> >> >> >> > last
> >> >> >> > week
> >> >> >> > and
> >> >> >> > I don't have any problems with data syncing.  The clients on one
> >> >> >> > side
> >> >> >> > are
> >> >> >> > creating multipart objects while the multisite sync is creating
> >> >> >> > them
> >> >> >> > as
> >> >> >> > whole objects and one of the datacenters is slower at cle

Re: [ceph-users] Luminous BlueStore EC performance

2017-09-07 Thread Christian Wuerdig
What type of EC config (k+m) was used if I may ask?

On Fri, Sep 8, 2017 at 1:34 AM, Mohamad Gebai  wrote:
> Hi,
>
> These numbers are probably not as detailed as you'd like, but it's
> something. They show the overhead of reading and/or writing to EC pools as
> compared to 3x replicated pools using 1, 2, 8 and 16 threads (single
> client):
>
>  Rep   EC Diff  Slowdown
>  IOPS  IOPS
> Read
> 123,32522,052 -5.46%1.06
> 227,26127,147 -0.42%1.00
> 827,15127,127 -0.09%1.00
> 16   26,79326,728 -0.24%1.00
> Write
> 119,444 5,708-70.64%3.41
> 223,902 5,395-77.43%4.43
> 823,912 5,641-76.41%4.24
> 16   24,587 5,643-77.05%4.36
> RW
> 120,37911,166-45.21%1.83
> 234,246 9,525-72.19%3.60
> 833,195 9,300-71.98%3.57
> 16   31,641 9,762-69.15%3.24
>
> This is on an all-SSD cluster, with 3 OSD nodes and Bluestore. Ceph version
> 12.1.0-671-g2c11b88d14 (2c11b88d14e64bf60c0556c6a4ec8c9eda36ff6a) luminous
> (rc).
>
> Mohamad
>
>
> On 09/06/2017 01:28 AM, Blair Bethwaite wrote:
>
> Hi all,
>
> (Sorry if this shows up twice - I got auto-unsubscribed and so first attempt
> was blocked)
>
> I'm keen to read up on some performance comparisons for replication versus
> EC on HDD+SSD based setups. So far the only recent thing I've found is
> Sage's Vault17 slides [1], which have a single slide showing 3X / EC42 /
> EC51 for Kraken. I guess there is probably some of this data to be found in
> the performance meeting threads, but it's hard to know the currency of those
> (typically master or wip branch tests) with respect to releases. Can anyone
> point out any other references or highlight something that's coming?
>
> I'm sure there are piles of operators and architects out there at the moment
> wondering how they could and should reconfigure their clusters once upgraded
> to Luminous. A couple of things going around in my head at the moment:
>
> * We want to get to having the bulk of our online storage in CephFS on EC
> pool/s...
> *-- is overwrite performance on EC acceptable for near-line NAS use-cases?
> *-- recovery implications (currently recovery on our Jewel RGW EC83 pool is
> _way_ slower that 3X pools, what does this do to reliability? maybe split
> capacity into multiple pools if it helps to contain failure?)
>
> [1]
> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in/37
>
> --
> Cheers,
> ~Blairo
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph release cadence

2017-09-07 Thread Deepak Naidu
>> Maybe I missed something, but I think Ceph does not support LTS releases for 
>> 3 years.
Yes, you are correct but it averages to 18mths sometime I see 20mths(Hammer). 
But anything with 1yr release cycle is not worth the time and having near 3yr 
support model is best for PROD.

http://docs.ceph.com/docs/master/releases/

--
Deepak

-Original Message-
From: Henrik Korkuc [mailto:li...@kirneh.eu] 
Sent: Wednesday, September 06, 2017 10:50 PM
To: Deepak Naidu; Sage Weil; ceph-de...@vger.kernel.org; 
ceph-maintain...@ceph.com; ceph-us...@ceph.com
Subject: Re: [ceph-users] Ceph release cadence

On 17-09-07 02:42, Deepak Naidu wrote:
> Hope collective feedback helps. So here's one.
>
>>> - Not a lot of people seem to run the "odd" releases (e.g., infernalis, 
>>> kraken).
> I think the more obvious reason companies/users wanting to use CEPH will 
> stick with LTS versions as it models the 3yr  support cycle.
Maybe I missed something, but I think Ceph does not support LTS releases for 3 
years.

>>> * Drop the odd releases, and aim for a ~9 month cadence. This splits the 
>>> difference between the current even/odd pattern we've been doing.
> Yes, provided an easy upgrade process.
>
>
> --
> Deepak
>
>
>
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
> Of Sage Weil
> Sent: Wednesday, September 06, 2017 8:24 AM
> To: ceph-de...@vger.kernel.org; ceph-maintain...@ceph.com; 
> ceph-us...@ceph.com
> Subject: [ceph-users] Ceph release cadence
>
> Hi everyone,
>
> Traditionally, we have done a major named "stable" release twice a year, and 
> every other such release has been an "LTS" release, with fixes backported for 
> 1-2 years.
>
> With kraken and luminous we missed our schedule by a lot: instead of 
> releasing in October and April we released in January and August.
>
> A few observations:
>
> - Not a lot of people seem to run the "odd" releases (e.g., infernalis, 
> kraken).  This limits the value of actually making them.  It also means that 
> those who *do* run them are running riskier code (fewer users -> more bugs).
>
> - The more recent requirement that upgrading clusters must make a stop 
> at each LTS (e.g., hammer -> luminous not supported, must go hammer -> 
> jewel
> -> lumninous) has been hugely helpful on the development side by 
> -> reducing
> the amount of cross-version compatibility code to maintain and reducing the 
> number of upgrade combinations to test.
>
> - When we try to do a time-based "train" release cadence, there always seems 
> to be some "must-have" thing that delays the release a bit.  This doesn't 
> happen as much with the odd releases, but it definitely happens with the LTS 
> releases.  When the next LTS is a year away, it is hard to suck it up and 
> wait that long.
>
> A couple of options:
>
> * Keep even/odd pattern, and continue being flexible with release 
> dates
>
>+ flexible
>- unpredictable
>- odd releases of dubious value
>
> * Keep even/odd pattern, but force a 'train' model with a more regular 
> cadence
>
>+ predictable schedule
>- some features will miss the target and be delayed a year
>
> * Drop the odd releases but change nothing else (i.e., 12-month 
> release
> cadence)
>
>+ eliminate the confusing odd releases with dubious value
>   
> * Drop the odd releases, and aim for a ~9 month cadence. This splits the 
> difference between the current even/odd pattern we've been doing.
>
>+ eliminate the confusing odd releases with dubious value
>+ waiting for the next release isn't quite as bad
>- required upgrades every 9 months instead of ever 12 months
>
> * Drop the odd releases, but relax the "must upgrade through every LTS" to 
> allow upgrades across 2 versions (e.g., luminous -> mimic or luminous -> 
> nautilus).  Shorten release cycle (~6-9 months).
>
>+ more flexibility for users
>+ downstreams have greater choice in adopting an upstrema release
>- more LTS branches to maintain
>- more upgrade paths to consider
>
> Other options we should consider?  Other thoughts?
>
> Thanks!
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> - This email message is for the sole use of the intended 
> recipient(s) and may contain confidential information.  Any 
> unauthorized review, use, disclosure or distribution is prohibited.  
> If you are not the intended recipient, please contact the sender by 
> reply email and destroy all copies of the original message.
> --
> -
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majord...@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html


___

Re: [ceph-users] Blocked requests

2017-09-07 Thread Brad Hubbard
Is it this?

https://bugzilla.redhat.com/show_bug.cgi?id=1430588

On Fri, Sep 8, 2017 at 7:01 AM, Matthew Stroud  wrote:
> After some troubleshooting, the issues appear to be caused by gnocchi using
> rados. I’m trying to figure out why.
>
>
>
> Thanks,
>
> Matthew Stroud
>
>
>
> From: Brian Andrus 
> Date: Thursday, September 7, 2017 at 1:53 PM
> To: Matthew Stroud 
> Cc: David Turner , "ceph-users@lists.ceph.com"
> 
>
>
> Subject: Re: [ceph-users] Blocked requests
>
>
>
> "ceph osd blocked-by" can do the same thing as that provided script.
>
>
>
> Can you post relevant osd.10 logs and a pg dump of an affected placement
> group? Specifically interested in recovery_state section.
>
>
>
> Hopefully you were careful in how you were rebooting OSDs, and not rebooting
> multiple in the same failure domain before recovery was able to occur.
>
>
>
> On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud 
> wrote:
>
> Here is the output of your snippet:
>
> [root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh
>
>   6 osd.10
>
> 52  ops are blocked > 4194.3   sec on osd.17
>
> 9   ops are blocked > 2097.15  sec on osd.10
>
> 4   ops are blocked > 1048.58  sec on osd.10
>
> 39  ops are blocked > 262.144  sec on osd.10
>
> 19  ops are blocked > 131.072  sec on osd.10
>
> 6   ops are blocked > 65.536   sec on osd.10
>
> 2   ops are blocked > 32.768   sec on osd.10
>
>
>
> Here is some backfilling info:
>
>
>
> [root@mon01 ceph-conf]# ceph status
>
> cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155
>
>  health HEALTH_WARN
>
> 5 pgs backfilling
>
> 5 pgs degraded
>
> 5 pgs stuck degraded
>
> 5 pgs stuck unclean
>
> 5 pgs stuck undersized
>
> 5 pgs undersized
>
> 122 requests are blocked > 32 sec
>
> recovery 2361/1097929 objects degraded (0.215%)
>
> recovery 5578/1097929 objects misplaced (0.508%)
>
>  monmap e1: 3 mons at
> {mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0}
>
> election epoch 58, quorum 0,1,2 mon01,mon02,mon03
>
>  osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs
>
> flags sortbitwise,require_jewel_osds
>
>   pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects
>
> 1005 GB used, 20283 GB / 21288 GB avail
>
> 2361/1097929 objects degraded (0.215%)
>
> 5578/1097929 objects misplaced (0.508%)
>
> 2587 active+clean
>
>5 active+undersized+degraded+remapped+backfilling
>
> [root@mon01 ceph-conf]# ceph pg dump_stuck unclean
>
> ok
>
> pg_stat state   up  up_primary  acting  acting_primary
>
> 3.5c2   active+undersized+degraded+remapped+backfilling [17,2,10]   17
> [17,2]  17
>
> 3.54a   active+undersized+degraded+remapped+backfilling [10,19,2]   10
> [10,17] 10
>
> 5.3bactive+undersized+degraded+remapped+backfilling [3,19,0]3
> [10,17] 10
>
> 5.b3active+undersized+degraded+remapped+backfilling [10,19,2]   10
> [10,17] 10
>
> 3.180   active+undersized+degraded+remapped+backfilling [17,10,6]   17
> [22,19] 22
>
>
>
> Most of the back filling is was caused by restarting osds to clear blocked
> IO. Here are some of the blocked IOs:
>
>
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:36.978559 osd.10
> 10.20.57.15:6806/7029 9362 : cluster [WRN] slow request 60.834494 seconds
> old, received at 2017-09-07 13:28:36.143920: osd_op(client.114947.0:2039090
> 5.e637a4b3 (undecoded)
> ack+read+balance_reads+skiprwlocks+known_if_redirected e6511) currently
> queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:36.978565 osd.10
> 10.20.57.15:6806/7029 9363 : cluster [WRN] slow request 240.661052 seconds
> old, received at 2017-09-07 13:25:36.317363: osd_op(client.246934107.0:3
> 5.f69addd6 (undecoded) ack+read+known_if_redirected e6511) currently
> queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:36.978571 osd.10
> 10.20.57.15:6806/7029 9364 : cluster [WRN] slow request 240.660763 seconds
> old, received at 2017-09-07 13:25:36.317651: osd_op(client.246944377.0:2
> 5.f69addd6 (undecoded) ack+read+known_if_redirected e6511) currently
> queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:36.978576 osd.10
> 10.20.57.15:6806/7029 9365 : cluster [WRN] slow request 240.660675 seconds
> old, received at 2017-09-07 13:25:36.317740: osd_op(client.246944377.0:3
> 5.f69addd6 (undecoded) ack+read+known_if_redirected e6511) currently
> queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:42.979367 osd.10
> 10.20.57.15:6806/7029 9366 : cluster [WRN] 72 slow requests, 3 included
> below; oldest blocked for > 1820.342287 secs
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:42.979373 osd.10
> 10.20.57.15:6806/7029 9367 : cluster [WRN] slow request 30.606290 seconds
> old, received at 2017-09-07 13:29:12.372999:
> osd_op(client.115008.0:996024003 5.e637a4b3 (undecoded)
> ondisk+write+skiprwlocks+known_if_redirected e6511) currently queued_for

[ceph-users] cephfs(Kraken 11.2.1), Unable to write more file when one dir more than 100000 files, mds_bal_fragment_size_max = 5000000

2017-09-07 Thread donglifec...@gmail.com
ZhengYan,

I test cephfs(Kraken 11.2.1),  I can't write more files when one dir more than 
10 files, I have already set up "mds_bal_fragment_size_max = 500".

why is this case? Is it a bug?

Thanks a lot.



donglifec...@gmail.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs(Kraken 11.2.1), Unable to write more file when one dir more than 100000 files, mds_bal_fragment_size_max = 5000000

2017-09-07 Thread Marcus Haarmann
Its a feature ... 

http://docs.ceph.com/docs/master/cephfs/dirfrags/ 
https://www.spinics.net/lists/ceph-users/msg31473.html 

Marcus Haarmann 


Von: donglifec...@gmail.com 
An: "zyan"  
CC: "ceph-users"  
Gesendet: Freitag, 8. September 2017 07:30:53 
Betreff: [ceph-users] cephfs(Kraken 11.2.1), Unable to write more file when one 
dir more than 10 files, mds_bal_fragment_size_max = 500 

ZhengYan, 

I test cephfs( Kraken 11.2.1), I can't write more files when one dir more than 
10 files, I have already set up " mds_bal_fragment_size_max = 500". 

why is this case? Is it a bug? 

Thanks a lot. 


donglifec...@gmail.com 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs(Kraken 11.2.1), Unable to write more file when one dir more than 100000 files, mds_bal_fragment_size_max = 5000000

2017-09-07 Thread donglifec...@gmail.com
ZhengYan,

I'm sorry, just a description of some questions.

when one dir more than 10 files, I can continue to write it , but I don't 
find file which wrote in the past. for example:
1.  I write  10 files named 512k.file$i
   
2. I continue to write  1 files named aaa.file$i

3. I continue to write  1 files named bbb.file$i

4.  I continue to write  1 files named ccc.file$i

5. I continue to write  1 files named ddd.file$i

6. I can't find all ddd.file$i, some ddd.file$i missing. such as:

[root@yj43959-ceph-dev scripts]# find /mnt/cephfs/volumes -type f  |  grep 
512k.file | wc -l
10
[root@yj43959-ceph-dev scripts]# ls /mnt/cephfs/volumes/aaa.file* | wc -l
1
[root@yj43959-ceph-dev scripts]# ls /mnt/cephfs/volumes/bbb.file* | wc -l
1
[root@yj43959-ceph-dev scripts]# ls /mnt/cephfs/volumes/ccc.file* | wc -l
1
[root@yj43959-ceph-dev scripts]# ls /mnt/cephfs/volumes/ddd.file* | wc -l// 
some files missing
1072




donglifec...@gmail.com
 
From: donglifec...@gmail.com
Date: 2017-09-08 13:30
To: zyan
CC: ceph-users
Subject: [ceph-users]cephfs(Kraken 11.2.1), Unable to write more file when one 
dir more than 10 files, mds_bal_fragment_size_max = 500
ZhengYan,

I test cephfs(Kraken 11.2.1),  I can't write more files when one dir more than 
10 files, I have already set up "mds_bal_fragment_size_max = 500".

why is this case? Is it a bug?

Thanks a lot.



donglifec...@gmail.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com