Re: [ceph-users] Nautilus: significant increase in cephfs metadata pool usage

2019-07-25 Thread Dietmar Rieder
On 7/25/19 11:55 AM, Konstantin Shalygin wrote:
>> we just recently upgraded our cluster from luminous 12.2.10 to nautilus
>> 14.2.1 and I noticed a massive increase of the space used on the cephfs
>> metadata pool although the used space in the 2 data pools  basically did
>> not change. See the attached graph (NOTE: log10 scale on y-axis)
>>
>> Is there any reason that explains this?
> 
> Dietmar, how your metadata usage now? Is stop growing?

it is stable now and only changes as the number of files in the FS changes.

Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel, Distro & Ceph

2019-07-25 Thread Dietmar Rieder
On 7/24/19 10:05 PM, Wido den Hollander wrote:
> 
> 
> On 7/24/19 9:38 PM, dhils...@performair.com wrote:
>> All;
>>
>> There's been a lot of discussion of various kernel versions on this list 
>> lately, so I thought I'd seek some clarification.
>>
>> I prefer to run CentOS, and I prefer to keep the number of "extra" 
>> repositories to a minimum.  Ceph requires adding a Ceph repo, and the EPEL 
>> repo.  Updating the kernel requires (from the research I've done) adding 
>> EL-Repo.  I believe CentOS 7 uses the 3.10 kernel.
>>
> 
> Are you planning on using CephFS? Because only on the clients using
> CephFS through the kernel you might require a new kernel.
> 

We are running CentOS stock kernels on all our HPC nodes which mount
cephfs via kernel client (pg-upmap, and quota enabled). Not problem, or
missing features noticed so far.

Dietmar

> The nodes in the Ceph cluster can run with the stock CentOS kernel.
> 
> Wido
> 
>> Under what circumstances would you recommend adding EL-Repo to CentOS 7.6, 
>> and installing kernel-ml?  Are there certain parts of Ceph which 
>> particularly benefit from kernels newer that 3.10?
>>
>> Thank you,
>>
>> Dominic L. Hilsbos, MBA 
>> Director - Information Technology 
>> Perform Air International Inc.
>> dhils...@performair.com 
>> www.PerformAir.com
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 1 MDSs report slow metadata IOs

2019-07-17 Thread Dietmar Rieder
Hi,

thanks for the hint!! This did it.

I indeed found stuck requests using "ceph daemon  mds.xxx
objecter_requests".
I then restarted the osds involved in those requests one by one and now
the problems are gone and the status is back to HEALTH_OK.

Thanks again

Dietmar


On 7/17/19 9:08 AM, Yan, Zheng wrote:
> Check if there is any hang request in 'ceph daemon  mds.xxx objecter_requests'
> 
> On Tue, Jul 16, 2019 at 11:51 PM Dietmar Rieder
>  wrote:
>>
>> On 7/16/19 4:11 PM, Dietmar Rieder wrote:
>>> Hi,
>>>
>>> We are running ceph version 14.1.2 with cephfs only.
>>>
>>> I just noticed that one of our pgs had scrub errors which I could repair
>>>
>>> # ceph health detail
>>> HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests;
>>> 1 scrub errors; Possible data damage: 1 pg inconsistent
>>> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>>> mdscephmds-01(mds.0): 3 slow metadata IOs are blocked > 30 secs,
>>> oldest blocked for 47743 secs
>>> MDS_SLOW_REQUEST 1 MDSs report slow requests
>>> mdscephmds-01(mds.0): 2 slow requests are blocked > 30 secs
>>> OSD_SCRUB_ERRORS 1 scrub errors
>>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>> pg 6.e0b is active+clean+inconsistent, acting
>>> [194,23,116,183,149,82,42,132,26]
>>>
>>>
>>> Apparently I was able to repair the pg:
>>>
>>> #  rados list-inconsistent-pg hdd-ec-data-pool
>>> ["6.e0b"]
>>>
>>> # ceph pg repair 6.e0b
>>> instructing pg 6.e0bs0 on osd.194 to repair
>>>
>>> [...]
>>> 2019-07-16 15:07:13.700 7f851d720700  0 log_channel(cluster) log [DBG] :
>>> 6.e0b repair starts
>>> 2019-07-16 15:10:23.852 7f851d720700  0 log_channel(cluster) log [DBG] :
>>> 6.e0b repair ok, 0 fixed
>>> []
>>>
>>>
>>> However I still have HEALTH_WARN do to slow metadata IOs.
>>>
>>> # ceph health detail
>>> HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests
>>> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>>> mdscephmds-01(mds.0): 3 slow metadata IOs are blocked > 30 secs,
>>> oldest blocked for 51123 secs
>>> MDS_SLOW_REQUEST 1 MDSs report slow requests
>>> mdscephmds-01(mds.0): 5 slow requests are blocked > 30 secs
>>>
>>>
>>> I already rebooted all my client machines accessing the cephfs via
>>> kernel client, but the HEALTH_WARN status is still the one above.
>>>
>>> In the MDS log I see tons of the following messages:
>>>
>>> [...]
>>> 2019-07-16 16:08:17.770 7f727fd2e700  0 log_channel(cluster) log [WRN] :
>>> slow request 1920.184123 seconds old, received at 2019-07-16
>>> 15:36:17.586647: client_request(client.3902814:84 getattr pAsLsXsFs
>>> #0x10001daa8ad 2019-07-16 15:36:17.585355 caller_uid=40059,
>>> caller_gid=5{}) currently failed to rdlock, waiting
>>> 2019-07-16 16:08:19.069 7f7282533700  1 mds.cephmds-01 Updating MDS map
>>> to version 12642 from mon.0
>>> 2019-07-16 16:08:22.769 7f727fd2e700  0 log_channel(cluster) log [WRN] :
>>> 5 slow requests, 0 included below; oldest blocked for > 49539.644840 secs
>>> 2019-07-16 16:08:26.683 7f7282533700  1 mds.cephmds-01 Updating MDS map
>>> to version 12643 from mon.0
>>> [...]
>>>
>>> How can I get back to normal?
>>>
>>> I'd be grateful for any help
>>
>>
>> after I restarted the 3 mds daemons I got rid of the blocked client
>> requests but there is still the slow metadata IOs warning:
>>
>>
>> # ceph health detail
>> HEALTH_WARN 1 MDSs report slow metadata IOs
>> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>> mdscephmds-01(mds.0): 2 slow metadata IOs are blocked > 30 secs,
>> oldest blocked for 563 secs
>>
>> the mds log has now these messages every ~5 seconds:
>> [...]
>> 2019-07-16 17:31:20.456 7f38947a2700  1 mds.cephmds-01 Updating MDS map
>> to version 13638 from mon.2
>> 2019-07-16 17:31:24.529 7f38947a2700  1 mds.cephmds-01 Updating MDS map
>> to version 13639 from mon.2
>> 2019-07-16 17:31:28.560 7f38947a2700  1 mds.cephmds-01 Updating MDS map
>> to version 13640 from mon.2
>> [...]
>>
>> What does this tell me? Can I do something about it?
>> For now I stopped all IO.
>>
>> Best
>>   Dietmar
>>
>>
>>
>>
>> --
>> _

Re: [ceph-users] HEALTH_WARN 1 MDSs report slow metadata IOs

2019-07-17 Thread Dietmar Rieder
On 7/16/19 5:34 PM, Dietmar Rieder wrote:
> On 7/16/19 4:11 PM, Dietmar Rieder wrote:
>> Hi,
>>
>> We are running ceph version 14.1.2 with cephfs only.
>>
>> I just noticed that one of our pgs had scrub errors which I could repair
>>
>> # ceph health detail
>> HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests;
>> 1 scrub errors; Possible data damage: 1 pg inconsistent
>> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>> mdscephmds-01(mds.0): 3 slow metadata IOs are blocked > 30 secs,
>> oldest blocked for 47743 secs
>> MDS_SLOW_REQUEST 1 MDSs report slow requests
>> mdscephmds-01(mds.0): 2 slow requests are blocked > 30 secs
>> OSD_SCRUB_ERRORS 1 scrub errors
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>> pg 6.e0b is active+clean+inconsistent, acting
>> [194,23,116,183,149,82,42,132,26]
>>
>>
>> Apparently I was able to repair the pg:
>>
>> #  rados list-inconsistent-pg hdd-ec-data-pool
>> ["6.e0b"]
>>
>> # ceph pg repair 6.e0b
>> instructing pg 6.e0bs0 on osd.194 to repair
>>
>> [...]
>> 2019-07-16 15:07:13.700 7f851d720700  0 log_channel(cluster) log [DBG] :
>> 6.e0b repair starts
>> 2019-07-16 15:10:23.852 7f851d720700  0 log_channel(cluster) log [DBG] :
>> 6.e0b repair ok, 0 fixed
>> []
>>
>>
>> However I still have HEALTH_WARN do to slow metadata IOs.
>>
>> # ceph health detail
>> HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests
>> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>> mdscephmds-01(mds.0): 3 slow metadata IOs are blocked > 30 secs,
>> oldest blocked for 51123 secs
>> MDS_SLOW_REQUEST 1 MDSs report slow requests
>> mdscephmds-01(mds.0): 5 slow requests are blocked > 30 secs
>>
>>
>> I already rebooted all my client machines accessing the cephfs via
>> kernel client, but the HEALTH_WARN status is still the one above.
>>
>> In the MDS log I see tons of the following messages:
>>
>> [...]
>> 2019-07-16 16:08:17.770 7f727fd2e700  0 log_channel(cluster) log [WRN] :
>> slow request 1920.184123 seconds old, received at 2019-07-16
>> 15:36:17.586647: client_request(client.3902814:84 getattr pAsLsXsFs
>> #0x10001daa8ad 2019-07-16 15:36:17.585355 caller_uid=40059,
>> caller_gid=5{}) currently failed to rdlock, waiting
>> 2019-07-16 16:08:19.069 7f7282533700  1 mds.cephmds-01 Updating MDS map
>> to version 12642 from mon.0
>> 2019-07-16 16:08:22.769 7f727fd2e700  0 log_channel(cluster) log [WRN] :
>> 5 slow requests, 0 included below; oldest blocked for > 49539.644840 secs
>> 2019-07-16 16:08:26.683 7f7282533700  1 mds.cephmds-01 Updating MDS map
>> to version 12643 from mon.0
>> [...]
>>
>> How can I get back to normal?
>>
>> I'd be grateful for any help
> 
> 
> after I restarted the 3 mds daemons I got rid of the blocked client
> requests but there is still the slow metadata IOs warning:
> 
> 
> # ceph health detail
> HEALTH_WARN 1 MDSs report slow metadata IOs
> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
> mdscephmds-01(mds.0): 2 slow metadata IOs are blocked > 30 secs,
> oldest blocked for 563 secs
> 
> the mds log has now these messages every ~5 seconds:
> [...]
> 2019-07-16 17:31:20.456 7f38947a2700  1 mds.cephmds-01 Updating MDS map
> to version 13638 from mon.2
> 2019-07-16 17:31:24.529 7f38947a2700  1 mds.cephmds-01 Updating MDS map
> to version 13639 from mon.2
> 2019-07-16 17:31:28.560 7f38947a2700  1 mds.cephmds-01 Updating MDS map
> to version 13640 from mon.2
> [...]
> 
> What does this tell me? Can I do something about it?
> For now I stopped all IO.
> 

I now waited about 12h with no (cephfs was mounted but no users were
accessing it) IO but the slow metadata IOs warning is still there:

# ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
mdscephmds-01(mds.0): 2 slow metadata IOs are blocked > 30 secs,
oldest blocked for 40194 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdscephmds-01(mds.0): 1 slow requests are blocked > 30 secs


Ceph fs dump gives the following output:

# ceph fs dump
dumped fsmap epoch 24544
e24544
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 3

Filesystem 'cephfs' 

Re: [ceph-users] HEALTH_WARN 1 MDSs report slow metadata IOs

2019-07-16 Thread Dietmar Rieder
On 7/16/19 4:11 PM, Dietmar Rieder wrote:
> Hi,
> 
> We are running ceph version 14.1.2 with cephfs only.
> 
> I just noticed that one of our pgs had scrub errors which I could repair
> 
> # ceph health detail
> HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests;
> 1 scrub errors; Possible data damage: 1 pg inconsistent
> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
> mdscephmds-01(mds.0): 3 slow metadata IOs are blocked > 30 secs,
> oldest blocked for 47743 secs
> MDS_SLOW_REQUEST 1 MDSs report slow requests
> mdscephmds-01(mds.0): 2 slow requests are blocked > 30 secs
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 6.e0b is active+clean+inconsistent, acting
> [194,23,116,183,149,82,42,132,26]
> 
> 
> Apparently I was able to repair the pg:
> 
> #  rados list-inconsistent-pg hdd-ec-data-pool
> ["6.e0b"]
> 
> # ceph pg repair 6.e0b
> instructing pg 6.e0bs0 on osd.194 to repair
> 
> [...]
> 2019-07-16 15:07:13.700 7f851d720700  0 log_channel(cluster) log [DBG] :
> 6.e0b repair starts
> 2019-07-16 15:10:23.852 7f851d720700  0 log_channel(cluster) log [DBG] :
> 6.e0b repair ok, 0 fixed
> []
> 
> 
> However I still have HEALTH_WARN do to slow metadata IOs.
> 
> # ceph health detail
> HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests
> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
> mdscephmds-01(mds.0): 3 slow metadata IOs are blocked > 30 secs,
> oldest blocked for 51123 secs
> MDS_SLOW_REQUEST 1 MDSs report slow requests
> mdscephmds-01(mds.0): 5 slow requests are blocked > 30 secs
> 
> 
> I already rebooted all my client machines accessing the cephfs via
> kernel client, but the HEALTH_WARN status is still the one above.
> 
> In the MDS log I see tons of the following messages:
> 
> [...]
> 2019-07-16 16:08:17.770 7f727fd2e700  0 log_channel(cluster) log [WRN] :
> slow request 1920.184123 seconds old, received at 2019-07-16
> 15:36:17.586647: client_request(client.3902814:84 getattr pAsLsXsFs
> #0x10001daa8ad 2019-07-16 15:36:17.585355 caller_uid=40059,
> caller_gid=5{}) currently failed to rdlock, waiting
> 2019-07-16 16:08:19.069 7f7282533700  1 mds.cephmds-01 Updating MDS map
> to version 12642 from mon.0
> 2019-07-16 16:08:22.769 7f727fd2e700  0 log_channel(cluster) log [WRN] :
> 5 slow requests, 0 included below; oldest blocked for > 49539.644840 secs
> 2019-07-16 16:08:26.683 7f7282533700  1 mds.cephmds-01 Updating MDS map
> to version 12643 from mon.0
> [...]
> 
> How can I get back to normal?
> 
> I'd be grateful for any help


after I restarted the 3 mds daemons I got rid of the blocked client
requests but there is still the slow metadata IOs warning:


# ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
mdscephmds-01(mds.0): 2 slow metadata IOs are blocked > 30 secs,
oldest blocked for 563 secs

the mds log has now these messages every ~5 seconds:
[...]
2019-07-16 17:31:20.456 7f38947a2700  1 mds.cephmds-01 Updating MDS map
to version 13638 from mon.2
2019-07-16 17:31:24.529 7f38947a2700  1 mds.cephmds-01 Updating MDS map
to version 13639 from mon.2
2019-07-16 17:31:28.560 7f38947a2700  1 mds.cephmds-01 Updating MDS map
to version 13640 from mon.2
[...]

What does this tell me? Can I do something about it?
For now I stopped all IO.

Best
  Dietmar




-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] HEALTH_WARN 1 MDSs report slow metadata IOs

2019-07-16 Thread Dietmar Rieder
Hi,

We are running ceph version 14.1.2 with cephfs only.

I just noticed that one of our pgs had scrub errors which I could repair

# ceph health detail
HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests;
1 scrub errors; Possible data damage: 1 pg inconsistent
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
mdscephmds-01(mds.0): 3 slow metadata IOs are blocked > 30 secs,
oldest blocked for 47743 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdscephmds-01(mds.0): 2 slow requests are blocked > 30 secs
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 6.e0b is active+clean+inconsistent, acting
[194,23,116,183,149,82,42,132,26]


Apparently I was able to repair the pg:

#  rados list-inconsistent-pg hdd-ec-data-pool
["6.e0b"]

# ceph pg repair 6.e0b
instructing pg 6.e0bs0 on osd.194 to repair

[...]
2019-07-16 15:07:13.700 7f851d720700  0 log_channel(cluster) log [DBG] :
6.e0b repair starts
2019-07-16 15:10:23.852 7f851d720700  0 log_channel(cluster) log [DBG] :
6.e0b repair ok, 0 fixed
[]


However I still have HEALTH_WARN do to slow metadata IOs.

# ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
mdscephmds-01(mds.0): 3 slow metadata IOs are blocked > 30 secs,
oldest blocked for 51123 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdscephmds-01(mds.0): 5 slow requests are blocked > 30 secs


I already rebooted all my client machines accessing the cephfs via
kernel client, but the HEALTH_WARN status is still the one above.

In the MDS log I see tons of the following messages:

[...]
2019-07-16 16:08:17.770 7f727fd2e700  0 log_channel(cluster) log [WRN] :
slow request 1920.184123 seconds old, received at 2019-07-16
15:36:17.586647: client_request(client.3902814:84 getattr pAsLsXsFs
#0x10001daa8ad 2019-07-16 15:36:17.585355 caller_uid=40059,
caller_gid=5{}) currently failed to rdlock, waiting
2019-07-16 16:08:19.069 7f7282533700  1 mds.cephmds-01 Updating MDS map
to version 12642 from mon.0
2019-07-16 16:08:22.769 7f727fd2e700  0 log_channel(cluster) log [WRN] :
5 slow requests, 0 included below; oldest blocked for > 49539.644840 secs
2019-07-16 16:08:26.683 7f7282533700  1 mds.cephmds-01 Updating MDS map
to version 12643 from mon.0
[...]

How can I get back to normal?

I'd be grateful for any help

 Thanks
   Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the release cadence

2019-06-06 Thread Dietmar Rieder
+1
Operators view: 12 months cycle is definitely better than 9. March seem
to be a reasonable compromise.

Best
  Dietmar

On 6/6/19 2:31 AM, Linh Vu wrote:
> I think 12 months cycle is much better from the cluster operations
> perspective. I also like March as a release month as well. 
> 
> *From:* ceph-users  on behalf of Sage
> Weil 
> *Sent:* Thursday, 6 June 2019 1:57 AM
> *To:* ceph-us...@ceph.com; ceph-de...@vger.kernel.org; d...@ceph.io
> *Subject:* [ceph-users] Changing the release cadence
>  
> Hi everyone,
> 
> Since luminous, we have had the follow release cadence and policy:  
>  - release every 9 months
>  - maintain backports for the last two releases
>  - enable upgrades to move either 1 or 2 releases heads
>    (e.g., luminous -> mimic or nautilus; mimic -> nautilus or octopus; ...)
> 
> This has mostly worked out well, except that the mimic release received
> less attention that we wanted due to the fact that multiple downstream
> Ceph products (from Red Has and SUSE) decided to based their next release
> on nautilus.  Even though upstream every release is an "LTS" release, as a
> practical matter mimic got less attention than luminous or nautilus.
> 
> We've had several requests/proposals to shift to a 12 month cadence. This
> has several advantages:
> 
>  - Stable/conservative clusters only have to be upgraded every 2 years
>    (instead of every 18 months)
>  - Yearly releases are more likely to intersect with downstream
>    distribution release (e.g., Debian).  In the past there have been
>    problems where the Ceph releases included in consecutive releases of a
>    distro weren't easily upgradeable.
>  - Vendors that make downstream Ceph distributions/products tend to
>    release yearly.  Aligning with those vendors means they are more likely
>    to productize *every* Ceph release.  This will help make every Ceph
>    release an "LTS" release (not just in name but also in terms of
>    maintenance attention).
> 
> So far the balance of opinion seems to favor a shift to a 12 month
> cycle[1], especially among developers, so it seems pretty likely we'll
> make that shift.  (If you do have strong concerns about such a move, now
> is the time to raise them.)
> 
> That brings us to an important decision: what time of year should we
> release?  Once we pick the timing, we'll be releasing at that time *every
> year* for each release (barring another schedule shift, which we want to
> avoid), so let's choose carefully!
> 
> A few options:
> 
>  - November: If we release Octopus 9 months from the Nautilus release
>    (planned for Feb, released in Mar) then we'd target this November.  We
>    could shift to a 12 months candence after that.
>  - February: That's 12 months from the Nautilus target.
>  - March: That's 12 months from when Nautilus was *actually* released.
> 
> November is nice in the sense that we'd wrap things up before the
> holidays.  It's less good in that users may not be inclined to install the
> new release when many developers will be less available in December.
> 
> February kind of sucked in that the scramble to get the last few things
> done happened during the holidays.  OTOH, we should be doing what we can
> to avoid such scrambles, so that might not be something we should factor
> in.  March may be a bit more balanced, with a solid 3 months before when
> people are productive, and 3 months after before they disappear on holiday
> to address any post-release issues.
> 
> People tend to be somewhat less available over the summer months due to
> holidays etc, so an early or late summer release might also be less than
> ideal.
> 
> Thoughts?  If we can narrow it down to a few options maybe we could do a
> poll to gauge user preferences.
> 
> Thanks!
> sage
> 
> 
> [1]
> https://protect-au.mimecast.com/s/N1l6CROAEns1RN1Zu9Jwts?domain=twitter.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus: significant increase in cephfs metadata pool usage

2019-05-09 Thread Dietmar Rieder
On 5/8/19 10:52 PM, Gregory Farnum wrote:
> On Wed, May 8, 2019 at 5:33 AM Dietmar Rieder
>  wrote:
>>
>> On 5/8/19 1:55 PM, Paul Emmerich wrote:
>>> Nautilus properly accounts metadata usage, so nothing changed it just
>>> shows up correctly now ;)
>>
>> OK, but then I'm not sure I understand why the increase was not sudden
>> (with the update) but it kept growing steadily over days.
> 
> Tracking the amount of data used by omap (ie, the internal RocksDB)
> isn't really possible to do live, and in the past we haven't done it
> at all. In Nautilus, it gets stats whenever a deep scrub happens so
> the omap data is always stale, but at least lets us approximate what's
> in use for a given PG.
> 
> So when you upgraded to Nautilus, the metadata pool scrubbed PGs over
> a period of days and each time a PG scrub finished the amount of data
> accounted to the pool as a whole increased. :)
> -Greg

Thanks for this clear explanation.

BTW what is the difference between the two following metrics:
ceph_pool_stored_raw
ceph_pool_stored

I expected that ceph_pool_stored_raw should large values than
ceph_pool_stored, depending on the redundancy/replication level, however
at least in our case the values are the same.

Best
  Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus: significant increase in cephfs metadata pool usage

2019-05-08 Thread Dietmar Rieder
On 5/8/19 1:55 PM, Paul Emmerich wrote:
> Nautilus properly accounts metadata usage, so nothing changed it just
> shows up correctly now ;)

OK, but then I'm not sure I understand why the increase was not sudden
(with the update) but it kept growing steadily over days.

~Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Nautilus: significant increase in cephfs metadata pool usage

2019-05-08 Thread Dietmar Rieder
Hi,

we just recently upgraded our cluster from luminous 12.2.10 to nautilus
14.2.1 and I noticed a massive increase of the space used on the cephfs
metadata pool although the used space in the 2 data pools  basically did
not change. See the attached graph (NOTE: log10 scale on y-axis)

Is there any reason that explains this?

Thanks
  Dietmar


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus (14.2.0) OSDs crashing at startup after removing a pool containing a PG with an unrepairable error

2019-05-03 Thread Dietmar Rieder
Hi,

to answer my question and for the record:

It turned out that the "device_health_metrics" pool was using PG 7.0
which had no objects after removing the pool but the PG was somehow not
deleted/removed.

[root@cephosd-05 ~]# ceph-objectstore-tool --data-path
/var/lib/ceph/osd/ceph-98 --pgid 7.0 --op list
Did not show any objects.

Issuing the command ( thanks Elise for sharing your experience ) on all
three osds, successfully removed the PG.

[root@cephosd-05 ~]# ceph-objectstore-tool --data-path
/var/lib/ceph/osd/ceph-98 --op export-remove --pgid 7.0 --file
pg_7_0_from_osd_98.bin
Exporting 7.0 info 7.0( v 9010'82 (0'0,9010'82] lb MIN (bitwise)
local-lis/les=9008/9009 n=41 ec=9008/9008 lis/c 9008/9008 les/c/f
9009/9009/0 9019/9019/9019)
Export successful
 marking collection for removal
setting '_remove' omap key
finish_remove_pgs 7.0_head removing 7.0
Remove successful

Now I could start the three OSDs again and the cluster is HEALTHY.

I hope this gets fixed soon, meanwhile one should keep this in mind and
be careful when trying the ceph device monitoring and deleting the
device_health_metrics pool.

Best
  Dietmar

On 5/3/19 10:09 PM, Dietmar Rieder wrote:
> HI,
> 
> I think I just hit the sam problem on Nautilus 14.2.1
> I tested the  ceph device monitoring, which created a new pool
> (device_health_metrics), after looking into the monitoring feature, I
> turned it off again and removed the pool. This resulted int 3 OSDs down
> which can not be started again since they keep crashing.
> 
> How can I reenable the OSDs?
> 
> I think the following is the relevant log.
> 
> 2019-05-03 21:24:05.265 7f8e96b8a700 -1
> bluestore(/var/lib/ceph/osd/ceph-206) _txc_add_transaction error (39)
> Directory not empty not handled on operation 21 (op 1, counting from 0)
> 2019-05-03 21:24:05.265 7f8e96b8a700  0
> bluestore(/var/lib/ceph/osd/ceph-206) _dump_transaction transaction dump:
> {
> "ops": [
> {
> "op_num": 0,
> "op_name": "remove",
> "collection": "7.0_head",
> "oid": "#7:head#"
> },
> {
> "op_num": 1,
> "op_name": "rmcoll",
> "collection": "7.0_head"
> }
> ]
> }
> 
> 2019-05-03 21:24:05.269 7f8e96b8a700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2
> .1/rpm/el7/BUILD/ceph-14.2.1/src/os/bluestore/BlueStore.cc: In function
> 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> ObjectStore::Transaction*)' thread 7f8e96b8a700 tim
> e 2019-05-03 21:24:05.266723
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/os/blue
> store/BlueStore.cc: 11089: abort()
> 
>  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
> (stable)
>  1: (ceph::__ceph_abort(char const*, int, char const*, std::string
> const&)+0xd8) [0x55766fbd6cd0]
>  2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> ObjectStore::Transaction*)+0x2a85) [0x5576701b6af5]
>  3:
> (BlueStore::queue_transactions(boost::intrusive_ptr&,
> std::vector std::allocator >&, boost::intrusive_
> ptr, ThreadPool::TPHandle*)+0x526) [0x5576701b7866]
>  4:
> (ObjectStore::queue_transaction(boost::intrusive_ptr&,
> ObjectStore::Transaction&&, boost::intrusive_ptr,
> ThreadPool::TPHandle*)+0x7f) [0x55766f
> d9274f]
>  5: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0x55766fdf577d]
>  6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38)
> [0x55766fdf6598]
>  7: (boost::statechart::simple_state PG::RecoveryState::ToDelete, boost::mpl::list mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_:
> :na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::react_impl(boost::statec
> hart::event_base const&, void const*)+0x16a) [0x55766fe355ca]
>  8:
> (boost::statechart::state_machine PG::RecoveryState::Initial, std::allocator,
> boost::statechart::null_exception_translator>::process_event(bo
> ost::statechart::event_base const&)+0x5a) [0x55766fe130ca]
>  9: (PG::do_peering_event(std::shared_ptr,
> PG::RecoveryCtx*)+0x119) [0x55766fe02389]
>  10: (OSD::dequeue_peering_evt(OSDShard*, PG*,
> std::shared_ptr, ThreadPool::TPHandle&)+0x1b4)
> [0x55766fd3c3c4]
>  11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int,
> T

Re: [ceph-users] Nautilus (14.2.0) OSDs crashing at startup after removing a pool containing a PG with an unrepairable error

2019-05-03 Thread Dietmar Rieder
HI,

I think I just hit the sam problem on Nautilus 14.2.1
I tested the  ceph device monitoring, which created a new pool
(device_health_metrics), after looking into the monitoring feature, I
turned it off again and removed the pool. This resulted int 3 OSDs down
which can not be started again since they keep crashing.

How can I reenable the OSDs?

I think the following is the relevant log.

2019-05-03 21:24:05.265 7f8e96b8a700 -1
bluestore(/var/lib/ceph/osd/ceph-206) _txc_add_transaction error (39)
Directory not empty not handled on operation 21 (op 1, counting from 0)
2019-05-03 21:24:05.265 7f8e96b8a700  0
bluestore(/var/lib/ceph/osd/ceph-206) _dump_transaction transaction dump:
{
"ops": [
{
"op_num": 0,
"op_name": "remove",
"collection": "7.0_head",
"oid": "#7:head#"
},
{
"op_num": 1,
"op_name": "rmcoll",
"collection": "7.0_head"
}
]
}

2019-05-03 21:24:05.269 7f8e96b8a700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2
.1/rpm/el7/BUILD/ceph-14.2.1/src/os/bluestore/BlueStore.cc: In function
'void BlueStore::_txc_add_transaction(BlueStore::TransContext*,
ObjectStore::Transaction*)' thread 7f8e96b8a700 tim
e 2019-05-03 21:24:05.266723
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/os/blue
store/BlueStore.cc: 11089: abort()

 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
(stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::string
const&)+0xd8) [0x55766fbd6cd0]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
ObjectStore::Transaction*)+0x2a85) [0x5576701b6af5]
 3:
(BlueStore::queue_transactions(boost::intrusive_ptr&,
std::vector >&, boost::intrusive_
ptr, ThreadPool::TPHandle*)+0x526) [0x5576701b7866]
 4:
(ObjectStore::queue_transaction(boost::intrusive_ptr&,
ObjectStore::Transaction&&, boost::intrusive_ptr,
ThreadPool::TPHandle*)+0x7f) [0x55766f
d9274f]
 5: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0x55766fdf577d]
 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38)
[0x55766fdf6598]
 7: (boost::statechart::simple_state,
(boost::statechart::history_mode)0>::react_impl(boost::statec
hart::event_base const&, void const*)+0x16a) [0x55766fe355ca]
 8:
(boost::statechart::state_machine,
boost::statechart::null_exception_translator>::process_event(bo
ost::statechart::event_base const&)+0x5a) [0x55766fe130ca]
 9: (PG::do_peering_event(std::shared_ptr,
PG::RecoveryCtx*)+0x119) [0x55766fe02389]
 10: (OSD::dequeue_peering_evt(OSDShard*, PG*,
std::shared_ptr, ThreadPool::TPHandle&)+0x1b4)
[0x55766fd3c3c4]
 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int,
ThreadPool::TPHandle&)+0x234) [0x55766fd3c804]
 12: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x9f4) [0x55766fd30b44]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433)
[0x55767032ae93]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55767032df30]
 15: (()+0x7dd5) [0x7f8eb7162dd5]
 16: (clone()+0x6d) [0x7f8eb6028ead]

2019-05-03 21:24:05.274 7f8e96b8a700 -1 *** Caught signal (Aborted) **
 in thread 7f8e96b8a700 thread_name:tp_osd_tp

 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
(stable)
 1: (()+0xf5d0) [0x7f8eb716a5d0]
 2: (gsignal()+0x37) [0x7f8eb5f61207]
 3: (abort()+0x148) [0x7f8eb5f628f8]
 4: (ceph::__ceph_abort(char const*, int, char const*, std::string
const&)+0x19c) [0x55766fbd6d94]
 5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
ObjectStore::Transaction*)+0x2a85) [0x5576701b6af5]
 6:
(BlueStore::queue_transactions(boost::intrusive_ptr&,
std::vector >&, boost::intrusive_
ptr, ThreadPool::TPHandle*)+0x526) [0x5576701b7866]
 7:
(ObjectStore::queue_transaction(boost::intrusive_ptr&,
ObjectStore::Transaction&&, boost::intrusive_ptr,
ThreadPool::TPHandle*)+0x7f) [0x55766f
d9274f]
 8: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0x55766fdf577d]
 9: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38)
[0x55766fdf6598]
 10: (boost::statechart::simple_state,
(boost::statechart::history_mode)0>::react_impl(boost::state
chart::event_base const&, void const*)+0x16a) [0x55766fe355ca]
 11:
(boost::statechart::state_machine,
boost::statechart::null_exception_translator>::process_event(b
oost::statechart::event_base const&)+0x5a) [0x55766fe130ca]
 12: (PG::do_peering_event(std::shared_ptr,
PG::RecoveryCtx*)+0x119) [0x55766fe02389]
 13: (OSD::dequeue_peering_evt(OSDShard*, PG*,
std::shared_ptr, ThreadPool::TPHandle&)+0x1b4)
[0x55766fd3c3c4]
 14: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int,
ThreadPool::TPHandle&)+0x234) [0x55766fd3c804]
 15: (OSD::ShardedOpWQ::_process(unsigned int,

Re: [ceph-users] migrate ceph-disk to ceph-volume fails with dmcrypt

2019-01-23 Thread Dietmar Rieder
On 1/23/19 3:05 PM, Alfredo Deza wrote:
> On Wed, Jan 23, 2019 at 8:25 AM Jan Fajerski  wrote:
>>
>> On Wed, Jan 23, 2019 at 10:01:05AM +0100, Manuel Lausch wrote:
>>> Hi,
>>>
>>> thats a bad news.
>>>
>>> round about 5000 OSDs are affected from this issue. It's not realy a
>>> solution to redeploy this OSDs.
>>>
>>> Is it possible to migrate the local keys to the monitors?
>>> I see that the OSDs with the "lockbox feature" has only one key for
>>> data and journal partition and the older OSDs have individual keys for
>>> journal and data. Might this be a problem?
>>>
>>> And a other question.
>>> Is it a good idea to mix ceph-disk and ceph-volume managed OSDSs on one
>>> host?
>>> So I could only migrate newer OSDs to ceph-volume and deploy new
>>> ones (after disk replacements) with ceph-volume until hopefuly there is
>>> a solution.
>> I might be wrong on this, since its been a while since I played with that. 
>> But
>> iirc you can't migrate a subset of ceph-disk OSDs to ceph-volume on one host.
>> Once you run ceph-volume simple activate, the ceph-disk systemd units and 
>> udev
>> profiles will be disabled. While the remaining ceph-disk OSDs will continue 
>> to
>> run, they won't come up after a reboot.
> 
> This is correct, once you "activate" ceph-disk OSDs via ceph-volume
> you are disabling all udev/systemd triggers for
> those OSDs, so you must migrate all.
> 
> I was assuming the question was more of a way to keep existing
> ceph-disk OSDs and create new ceph-volume OSDs, which you can, as long
> as this is not Nautilus or newer where ceph-disk doesn't exist
> 

Will there be any plans to implement a command in ceph-volume that
allows to create simple volumes like the ones that are migrated from
ceph-disk using the scan and activate commands from ceph-volume?


>> I'm sure there's a way to get them running again, but I imagine you'd rather 
>> not
>> manually deal with that.
>>>
>>> Regards
>>> Manuel
>>>
>>>
>>> On Tue, 22 Jan 2019 07:44:02 -0500
>>> Alfredo Deza  wrote:
>>>
>>>
 This is one case we didn't anticipate :/ We supported the wonky
 lockbox setup and thought we wouldn't need to go further back,
 although we did add support for both
 plain and luks keys.

 Looking through the code, it is very tightly couple to
 storing/retrieving keys from the monitors, and I don't know what
 workarounds might be possible here other than throwing away the OSD
 and deploying a new one (I take it this is not an option for you at
 all)


>>> Manuel Lausch
>>>
>>> Systemadministrator
>>> Storage Services
>>>
>>> 1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 |
>>> 76135 Karlsruhe | Germany Phone: +49 721 91374-1847
>>> E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de
>>>
>>> Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 5452
>>>
>>> Geschäftsführer: Thomas Ludwig, Jan Oetjen, Sascha Vollmer
>>>
>>>
>>> Member of United Internet
>>>
>>> Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte
>>> Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat
>>> sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie
>>> bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem
>>> bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern,
>>> weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu
>>> verwenden.
>>>
>>> This e-mail may contain confidential and/or privileged information. If
>>> you are not the intended recipient of this e-mail, you are hereby
>>> notified that saving, distribution or use of the content of this e-mail
>>> in any way is prohibited. If you have received this e-mail in error,
>>> please notify the sender and delete the e-mail.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> --
>> Jan Fajerski
>> Engineer Enterprise Storage
>> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
>> HRB 21284 (AG Nürnberg)
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] disk controller failure

2018-12-14 Thread Dietmar Rieder
On 12/14/18 1:44 AM, Christian Balzer wrote:
> On Thu, 13 Dec 2018 19:44:30 +0100 Ronny Aasen wrote:
> 
>> On 13.12.2018 18:19, Alex Gorbachev wrote:
>>> On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder
>>>  wrote:  
>>>> Hi Cephers,
>>>>
>>>> one of our OSD nodes is experiencing a Disk controller problem/failure
>>>> (frequent resetting), so the OSDs on this controller are flapping
>>>> (up/down in/out).
>>>>
>>>> I will hopefully get the replacement part soon.
>>>>
>>>> I have some simple questions, what are the best steps to take now before
>>>> an after replacement of the controller?
>>>>
>>>> - marking down and shutting down all osds on that node?
>>>> - waiting for rebalance is finished
>>>> - replace the controller
>>>> - just restart the osds? Or redeploy them, since they still hold data?
>>>>
>>>> We are running:
>>>>
>>>> ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
>>>> (stable)
>>>> CentOS 7.5
>>>>
>>>> Sorry for my naive questions.  
>>> I usually do ceph osd set noout first to prevent any recoveries
>>>
>>> Then replace the hardware and make sure all OSDs come back online
>>>
>>> Then ceph osd unset noout
>>>
>>> Best regards,
>>> Alex  
>>
>>
>> Setting noout prevents the osd's from re-balancing.  ie when you do a 
>> short fix and do not want it to start re-balancing, since you know the 
>> data will be available shortly.. eg a reboot or similar.
>>
>> if osd's are flapping you normally want them out of the cluster, so they 
>> do not impact performance any more.
>>
> I think in this case the question is, how soon is the new controller going
> to be there?
> If it's soon and/or if rebalancing would severely impact the cluster
> performance, I'd set noout and then shut the node down, stopping both the
> flapping and preventing data movement. 
> Of course if it's a long time to repairs and/or a small cluster (is there
> even enough space to rebalance a node worth of data?) things may be
> different.
> 
> I always set "mon_osd_down_out_subtree_limit = host" (and monitor things
> of course) since I reckon a down node can often be brought back way faster
> than a full rebalance.


Thanks Christian for this comment and suggestion.

I think setting noout and shutdown the node is a good option, because
rebalancing would mean that ~22TB of data has to be moved.
However the spare part seems to be delayed, so I'm affraid I'lll not get
it before Monday.

Best
  Dietmar

> 
> Regards,
> 
> Christian
>>
>> kind regards
>>
>> Ronny Aasen
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] disk controller failure

2018-12-13 Thread Dietmar Rieder
Hi Matthew,

thanks for your reply and advise, I really appreciate it.

So you say, that there will be no problem when after the rebalancing I
restart the stopped OSDs? I mean the have still the data on them.
(Sorry, I just don't like to mess somthing up)

Best
  Dietmar

On 12/13/18 5:11 PM, Matthew Vernon wrote:
> Hi,
> 
> On 13/12/2018 15:48, Dietmar Rieder wrote:
> 
>> one of our OSD nodes is experiencing a Disk controller problem/failure
>> (frequent resetting), so the OSDs on this controller are flapping
>> (up/down in/out).
> 
> Ah, hardware...
> 
>> I have some simple questions, what are the best steps to take now before
>> an after replacement of the controller?
> 
> I would stop all the OSDs on the affected node and let the cluster
> rebalance. Once you've replaced the disk controller, start them up again
> and Ceph will rebalance back again.
> 
> Regards,
> 
> Matthew
> 
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] disk controller failure

2018-12-13 Thread Dietmar Rieder
Hi Cephers,

one of our OSD nodes is experiencing a Disk controller problem/failure
(frequent resetting), so the OSDs on this controller are flapping
(up/down in/out).

I will hopefully get the replacement part soon.

I have some simple questions, what are the best steps to take now before
an after replacement of the controller?

- marking down and shutting down all osds on that node?
- waiting for rebalance is finished
- replace the controller
- just restart the osds? Or redeploy them, since they still hold data?

We are running:

ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
(stable)
CentOS 7.5

Sorry for my naive questions.

Thanks for any help
  Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph 12.2.9 release

2018-11-07 Thread Dietmar Rieder
On 11/7/18 11:59 AM, Konstantin Shalygin wrote:
>> I wonder if there is any release announcement for ceph 12.2.9 that I missed.
>> I just found the new packages on download.ceph.com, is this an official
>> release?
> 
> This is because 12.2.9 have a several bugs. You should avoid to use this
> release and wait for 12.2.10

Thanks a lot!

~Dietmar


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph 12.2.9 release

2018-11-07 Thread Dietmar Rieder
Hi,

I wonder if there is any release announcement for ceph 12.2.9 that I missed.
I just found the new packages on download.ceph.com, is this an official
release?

~ Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client - page cache being invaildated.

2018-10-15 Thread Dietmar Rieder
On 10/15/18 1:17 PM, jes...@krogh.cc wrote:
>> On 10/15/18 12:41 PM, Dietmar Rieder wrote:
>>> No big difference here.
>>> all CentOS 7.5 official kernel 3.10.0-862.11.6.el7.x86_64
>>
>> ...forgot to mention: all is luminous ceph-12.2.7
> 
> Thanks for your time in testing, this is very valueable to me in the
> debugging. 2 questions:
> 
> Did you "sleep 900" in-between the execution?
> Are you using the kernel client or the fuse client?
> 
> If I run them "right after each other" .. then I get the same behaviour.
> 

Hi, as I stated I'm using the kernel client, and yes I did the sleep 900
between the two runs.

~Dietmar



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client - page cache being invaildated.

2018-10-15 Thread Dietmar Rieder
On 10/15/18 12:41 PM, Dietmar Rieder wrote:
> On 10/15/18 12:02 PM, jes...@krogh.cc wrote:
>>>> On Sun, Oct 14, 2018 at 8:21 PM  wrote:
>>>> how many cephfs mounts that access the file? Is is possible that some
>>>> program opens that file in RW mode (even they just read the file)?
>>>
>>>
>>> The nature of the program is that it is "prepped" by one-set of commands
>>> and queried by another, thus the RW case is extremely unlikely.
>>> I can change permission bits to rewoke the w-bit for the user, they
>>> dont need it anyway... it is just the same service-users that generates
>>> the data and queries it today.
>>
>> Just to remove the suspicion of other clients fiddling with the files I did a
>> more structured test. I have 4 x 10GB files from fio-benchmarking, total
>> 40GB . Hosted on
>>
>> 1) CephFS /ceph/cluster/home/jk
>> 2) NFS /z/home/jk
>>
>> First I read them .. then sleep 900 seconds .. then read again (just with dd)
>>
>> jk@sild12:/ceph/cluster/home/jk$ time  for i in $(seq 0 3); do echo "dd
>> if=test.$i.0 of=/dev/null bs=1M"; done  | parallel -j 4 ; sleep 900; time 
>> for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done  |
>> parallel -j 4
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 2.56413 s, 4.2 GB/s
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 2.82234 s, 3.8 GB/s
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 2.9361 s, 3.7 GB/s
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 3.10397 s, 3.5 GB/s
>>
>> real0m3.449s
>> user0m0.217s
>> sys 0m11.497s
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 315.439 s, 34.0 MB/s
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 338.661 s, 31.7 MB/s
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 354.725 s, 30.3 MB/s
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 356.126 s, 30.2 MB/s
>>
>> real5m56.634s
>> user0m0.260s
>> sys 0m16.515s
>> jk@sild12:/ceph/cluster/home/jk$
>>
>>
>> Then NFS:
>>
>> jk@sild12:~$ time  for i in $(seq 0 3); do echo "dd if=test.$i.0
>> of=/dev/null bs=1M"; done  | parallel -j 4 ; sleep 900; time  for i in
>> $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done  | parallel
>> -j 4
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 1.60267 s, 6.7 GB/s
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 2.18602 s, 4.9 GB/s
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 2.47564 s, 4.3 GB/s
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 2.54674 s, 4.2 GB/s
>>
>> real0m2.855s
>> user0m0.185s
>> sys 0m8.888s
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 1.68613 s, 6.4 GB/s
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 1.6983 s, 6.3 GB/s
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 2.20059 s, 4.9 GB/s
>> 10240+0 records in
>> 10240+0 records out
>> 10737418240 bytes (11 GB, 10 GiB) copied, 2.58077 s, 4.2 GB/s
>>
>> real0m2.980s
>> user0m0.173s
>> sys 0m8.239s
>> jk@sild12:~$
>>
>>
>> Can I ask one of you to run the same "test" (or similar) .. and report back
>> i you can reproduce it?
> 
> here my test on e EC (6+3) pool using cephfs kernel client:
> 
> 7061+1 records in
> 7061+1 records out
> 7404496985 bytes (7.4 GB) copied, 3.62754 s, 2.0 GB/s
> 7450+1 records in
> 7450+1 records out
> 7812246720 bytes (7.8 GB) copied, 4.11908 s, 1.9 GB/s
> 7761+1 records in
> 7761+1 records out
> 8138636188 bytes (8.1 GB) copied, 4.34788 s, 1.9 GB/s
> 8212+1 records in
> 8212+1 records out
> 8611295220 bytes (8.6 GB) copied, 4.53371 s, 1.9 GB/s
> 
> real0m4.936s
> user0m0.275s
> sys 0m16.828s
> 
> 7061+1 records in
> 7061+1 records out
> 7404496985 bytes (7.4 GB) copied, 3.19726 s, 2.3 GB/s
> 7761+1 records in
> 7761+1 records out
> 8138636188 bytes (8.1 GB) copied, 3.31881 s, 2.5 GB/s
> 7450+1 records in
> 7450+1 records out
> 7812246720 bytes (7.8 GB) copied, 3.36354 s, 2.3 GB/s
> 8212+1 records in
> 8212+1 records out
> 8611295220 bytes (8.6 GB) copied, 3.74418 s, 2.3 GB/s
> 
> 
> No big difference here.
> all CentOS 7.5 official kernel 3.10.0-862.11.6.el7.x86_64

...forgot to mention: all is luminous ceph-12.2.7

~Dietmar



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client - page cache being invaildated.

2018-10-15 Thread Dietmar Rieder
On 10/15/18 12:02 PM, jes...@krogh.cc wrote:
>>> On Sun, Oct 14, 2018 at 8:21 PM  wrote:
>>> how many cephfs mounts that access the file? Is is possible that some
>>> program opens that file in RW mode (even they just read the file)?
>>
>>
>> The nature of the program is that it is "prepped" by one-set of commands
>> and queried by another, thus the RW case is extremely unlikely.
>> I can change permission bits to rewoke the w-bit for the user, they
>> dont need it anyway... it is just the same service-users that generates
>> the data and queries it today.
> 
> Just to remove the suspicion of other clients fiddling with the files I did a
> more structured test. I have 4 x 10GB files from fio-benchmarking, total
> 40GB . Hosted on
> 
> 1) CephFS /ceph/cluster/home/jk
> 2) NFS /z/home/jk
> 
> First I read them .. then sleep 900 seconds .. then read again (just with dd)
> 
> jk@sild12:/ceph/cluster/home/jk$ time  for i in $(seq 0 3); do echo "dd
> if=test.$i.0 of=/dev/null bs=1M"; done  | parallel -j 4 ; sleep 900; time 
> for i in $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done  |
> parallel -j 4
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 2.56413 s, 4.2 GB/s
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 2.82234 s, 3.8 GB/s
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 2.9361 s, 3.7 GB/s
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 3.10397 s, 3.5 GB/s
> 
> real0m3.449s
> user0m0.217s
> sys 0m11.497s
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 315.439 s, 34.0 MB/s
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 338.661 s, 31.7 MB/s
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 354.725 s, 30.3 MB/s
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 356.126 s, 30.2 MB/s
> 
> real5m56.634s
> user0m0.260s
> sys 0m16.515s
> jk@sild12:/ceph/cluster/home/jk$
> 
> 
> Then NFS:
> 
> jk@sild12:~$ time  for i in $(seq 0 3); do echo "dd if=test.$i.0
> of=/dev/null bs=1M"; done  | parallel -j 4 ; sleep 900; time  for i in
> $(seq 0 3); do echo "dd if=test.$i.0 of=/dev/null bs=1M"; done  | parallel
> -j 4
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 1.60267 s, 6.7 GB/s
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 2.18602 s, 4.9 GB/s
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 2.47564 s, 4.3 GB/s
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 2.54674 s, 4.2 GB/s
> 
> real0m2.855s
> user0m0.185s
> sys 0m8.888s
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 1.68613 s, 6.4 GB/s
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 1.6983 s, 6.3 GB/s
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 2.20059 s, 4.9 GB/s
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 2.58077 s, 4.2 GB/s
> 
> real0m2.980s
> user0m0.173s
> sys 0m8.239s
> jk@sild12:~$
> 
> 
> Can I ask one of you to run the same "test" (or similar) .. and report back
> i you can reproduce it?

here my test on e EC (6+3) pool using cephfs kernel client:

7061+1 records in
7061+1 records out
7404496985 bytes (7.4 GB) copied, 3.62754 s, 2.0 GB/s
7450+1 records in
7450+1 records out
7812246720 bytes (7.8 GB) copied, 4.11908 s, 1.9 GB/s
7761+1 records in
7761+1 records out
8138636188 bytes (8.1 GB) copied, 4.34788 s, 1.9 GB/s
8212+1 records in
8212+1 records out
8611295220 bytes (8.6 GB) copied, 4.53371 s, 1.9 GB/s

real0m4.936s
user0m0.275s
sys 0m16.828s

7061+1 records in
7061+1 records out
7404496985 bytes (7.4 GB) copied, 3.19726 s, 2.3 GB/s
7761+1 records in
7761+1 records out
8138636188 bytes (8.1 GB) copied, 3.31881 s, 2.5 GB/s
7450+1 records in
7450+1 records out
7812246720 bytes (7.8 GB) copied, 3.36354 s, 2.3 GB/s
8212+1 records in
8212+1 records out
8611295220 bytes (8.6 GB) copied, 3.74418 s, 2.3 GB/s


No big difference here.
all CentOS 7.5 official kernel 3.10.0-862.11.6.el7.x86_64

HTH
  Dietmar



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs slow 6MB/s and rados bench sort of ok.

2018-08-28 Thread Dietmar Rieder
Try to update to kernel-3.10.0-862.11.6.el7.x86_64.rpm that should solve the 
problem.

Best
 Dietmar

Am 28. August 2018 11:50:31 MESZ schrieb Marc Roos :
>
>I have a idle test cluster (centos7.5, Linux c04 
>3.10.0-862.9.1.el7.x86_64), and a client kernel mount cephfs. 
>
>I tested reading a few files on this cephfs mount and get very low 
>results compared to the rados bench. What could be the issue here?
>
>[@client folder]# dd if=5GB.img of=/dev/null status=progress
>954585600 bytes (955 MB) copied, 157.455633 s, 6.1 MB/s
>
>
>
>I included this is rados bench that shows sort of that cluster 
>performance is sort of as expected.
>[@c01 ~]# rados bench -p fs_data 10 write
>hints = 1
>Maintaining 16 concurrent writes of 4194304 bytes to objects of size 
>4194304 for up to 10 seconds or 0 objects
>Object prefix: benchmark_data_c01_453883
>  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg 
>lat(s)
>  0   0 0 0 0 0   -
>   0
>1  165842   167.967   1680.252071
>0.323443
>2  16   10690   179.967   1920.583383
>0.324867
>3  16   139   123   163.973   1320.170865
>0.325976
>4  16   183   167   166.975   1760.413676
>0.361364
>5  16   224   208   166.374   1640.394369
>0.365956
>6  16   254   238   158.642   1200.698396
>0.382729
>7  16   278   262   149.692960.120742
>0.397625
>8  16   317   301   150.478   1560.786822
>0.411193
>9  16   360   344   152.867   1720.601956
>0.411577
>   10  16   403   387   154.778   172 0.20342
>0.404114
>Total time run: 10.353683
>Total writes made:  404
>Write size: 4194304
>Object size:4194304
>Bandwidth (MB/sec): 156.08
>Stddev Bandwidth:   29.5778
>Max bandwidth (MB/sec): 192
>Min bandwidth (MB/sec): 96
>Average IOPS:   39
>Stddev IOPS:7
>Max IOPS:   48
>Min IOPS:   24
>Average Latency(s): 0.409676
>Stddev Latency(s):  0.243565
>Max latency(s): 1.25028
>Min latency(s): 0.0830112
>Cleaning up (deleting benchmark objects)
>Removed 404 objects
>Clean up completed and total clean up time :0.867185
>
>
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
___
D i e t m a r R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web: http://www.icbi.at
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs client version in RedHat/CentOS 7.5

2018-08-21 Thread Dietmar Rieder
On 08/21/2018 02:22 PM, Ilya Dryomov wrote:
> On Tue, Aug 21, 2018 at 9:12 AM Dietmar Rieder
>  wrote:
>>
>> On 08/20/2018 05:36 PM, Ilya Dryomov wrote:
>>> On Mon, Aug 20, 2018 at 4:52 PM Dietmar Rieder
>>>  wrote:
>>>>
>>>> Hi Cephers,
>>>>
>>>>
>>>> I wonder if the cephfs client in RedHat/CentOS 7.5 will be updated to
>>>> luminous?
>>>> As far as I see there is some luminous related stuff that was
>>>> backported, however,
>>>> the "ceph features" command just reports "jewel" as release of my cephfs
>>>> clients running CentOS 7.5 (kernel 3.10.0-862.11.6.el7.x86_64)
>>>>
>>>>
>>>> {
>>>> "mon": {
>>>> "group": {
>>>> "features": "0x3ffddff8eea4fffb",
>>>> "release": "luminous",
>>>> "num": 3
>>>> }
>>>> },
>>>> "mds": {
>>>> "group": {
>>>> "features": "0x3ffddff8eea4fffb",
>>>> "release": "luminous",
>>>> "num": 3
>>>> }
>>>> },
>>>> "osd": {
>>>> "group": {
>>>> "features": "0x3ffddff8eea4fffb",
>>>> "release": "luminous",
>>>> "num": 240
>>>> }
>>>> },
>>>> "client": {
>>>> "group": {
>>>> "features": "0x7010fb86aa42ada",
>>>> "release": "jewel",
>>>> "num": 23
>>>> },
>>>> "group": {
>>>> "features": "0x3ffddff8eea4fffb",
>>>> "release": "luminous",
>>>> "num": 4
>>>> }
>>>> }
>>>> }
>>>>
>>>>
>>>> This prevents me to run ceph balancer using the upmap mode.
>>>>
>>>>
>>>> Any idea?
>>>
>>> Hi Dietmar,
>>>
>>> All luminous features are supported in RedHat/CentOS 7.5, but it shows
>>> up as jewel due to a technicality.  Just do
>>>
>>>   $ ceph osd set-require-min-compat-client luminous --yes-i-really-mean-it
>>>
>>> to override the safety check.
>>>
>>> See https://www.spinics.net/lists/ceph-users/msg45071.html for details.
>>> It references an upstream kernel, but both the problem and the solution
>>> are the same.
>>>
>>
>> Hi Ilya,
>>
>> thank you for your answer.
>>
>> Just to make sure:
>> The thread you are referring to, is about kernel 4.13+, is this also
>> true for the "official" RedHat/CentOS 7.5 kernel 3.10
>> (3.10.0-862.11.6.el7.x86_64) ?
> 
> Yes, it is.
> 

Thanks

Dietmar




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs client version in RedHat/CentOS 7.5

2018-08-21 Thread Dietmar Rieder
On 08/20/2018 05:36 PM, Ilya Dryomov wrote:
> On Mon, Aug 20, 2018 at 4:52 PM Dietmar Rieder
>  wrote:
>>
>> Hi Cephers,
>>
>>
>> I wonder if the cephfs client in RedHat/CentOS 7.5 will be updated to
>> luminous?
>> As far as I see there is some luminous related stuff that was
>> backported, however,
>> the "ceph features" command just reports "jewel" as release of my cephfs
>> clients running CentOS 7.5 (kernel 3.10.0-862.11.6.el7.x86_64)
>>
>>
>> {
>> "mon": {
>> "group": {
>> "features": "0x3ffddff8eea4fffb",
>> "release": "luminous",
>> "num": 3
>> }
>> },
>> "mds": {
>> "group": {
>> "features": "0x3ffddff8eea4fffb",
>> "release": "luminous",
>> "num": 3
>> }
>> },
>> "osd": {
>> "group": {
>> "features": "0x3ffddff8eea4fffb",
>> "release": "luminous",
>> "num": 240
>> }
>> },
>> "client": {
>> "group": {
>> "features": "0x7010fb86aa42ada",
>> "release": "jewel",
>> "num": 23
>> },
>> "group": {
>> "features": "0x3ffddff8eea4fffb",
>> "release": "luminous",
>> "num": 4
>> }
>> }
>> }
>>
>>
>> This prevents me to run ceph balancer using the upmap mode.
>>
>>
>> Any idea?
> 
> Hi Dietmar,
> 
> All luminous features are supported in RedHat/CentOS 7.5, but it shows
> up as jewel due to a technicality.  Just do
> 
>   $ ceph osd set-require-min-compat-client luminous --yes-i-really-mean-it
> 
> to override the safety check.
> 
> See https://www.spinics.net/lists/ceph-users/msg45071.html for details.
> It references an upstream kernel, but both the problem and the solution
> are the same.
> 

Hi Ilya,

thank you for your answer.

Just to make sure:
The thread you are referring to, is about kernel 4.13+, is this also
true for the "official" RedHat/CentOS 7.5 kernel 3.10
(3.10.0-862.11.6.el7.x86_64) ?

Best
  Dietmar





signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs client version in RedHat/CentOS 7.5

2018-08-20 Thread Dietmar Rieder
Hi Cephers,


I wonder if the cephfs client in RedHat/CentOS 7.5 will be updated to
luminous?
As far as I see there is some luminous related stuff that was
backported, however,
the "ceph features" command just reports "jewel" as release of my cephfs
clients running CentOS 7.5 (kernel 3.10.0-862.11.6.el7.x86_64)


{
"mon": {
"group": {
"features": "0x3ffddff8eea4fffb",
"release": "luminous",
"num": 3
}
},
"mds": {
"group": {
"features": "0x3ffddff8eea4fffb",
"release": "luminous",
"num": 3
}
},
"osd": {
"group": {
"features": "0x3ffddff8eea4fffb",
"release": "luminous",
"num": 240
}
},
"client": {
"group": {
"features": "0x7010fb86aa42ada",
"release": "jewel",
"num": 23
},
"group": {
"features": "0x3ffddff8eea4fffb",
"release": "luminous",
"num": 4
}
}
}


This prevents me to run ceph balancer using the upmap mode.


Any idea?

  Dietmar



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancer: change from crush-compat to upmap

2018-08-17 Thread Dietmar Rieder
Hi Caspar,

did you have a chance yet to proceed with switching from crush-compat to
upmap?
If yes, would you eventually share your experience?

Best
  Dietmar

On 07/18/2018 11:07 AM, Caspar Smit wrote:
> Hi Xavier,
> 
> Not yet, i got a little anxious in changing anything major in the
> cluster after reading about the 12.2.5 regressions since i'm also using
> bluestore and erasure coding.
> 
> So after this cluster is upgraded to 12.2.7 i'm proceeding forward with
> this. 
> 
> Kind regards,
> Caspar
> 
> 2018-07-16 8:34 GMT+02:00 Xavier Trilla  >:
> 
> Hi Caspar,
> 
> __ __
> 
> Did you find any information regarding the migration from
> crush-compat to unmap? I’m facing the same situation.
> 
> __ __
> 
> Thanks!
> 
> __ __
> 
> __ __
> 
> *De:* ceph-users  > *En nombre de *Caspar Smit
> *Enviado el:* lunes, 25 de junio de 2018 12:25
> *Para:* ceph-users  >
> *Asunto:* [ceph-users] Balancer: change from crush-compat to upmap
> 
> __ __
> 
> Hi All,
> 
> __ __
> 
> I've been using the balancer module in crush-compat mode for quite a
> while now and want to switch to upmap mode since all my clients are
> now luminous (v12.2.5)
> 
> __ __
> 
> i've reweighted the compat weight-set back to as close as the
> original crush weights using 'ceph osd crush reweight-compat'
> 
> __ __
> 
> Before i switch to upmap i presume i need to remove the compat
> weight set with:
> 
> __ __
> 
> ceph osd crush weight-set rm-compat
> 
> __ __
> 
> Will this have any significant impact (rebalancing lots of pgs) or
> does this have very little effect since i already reweighted
> everything back close to crush default weights?
> 
> __ __
> 
> Thanks in advance and kind regards,
> 
> Caspar
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RAID question for Ceph

2018-07-19 Thread Dietmar Rieder
On 07/19/2018 04:44 AM, Satish Patel wrote:
> If i have 8 OSD drives in server on P410i RAID controller (HP), If i
> want to make this server has OSD node in that case show should i
> configure RAID?
> 
> 1. Put all drives in RAID-0?
> 2. Put individual HDD in RAID-0 and create 8 individual RAID-0 so OS
> can see 8 separate HDD drives
> 
> What most people doing in production for Ceph (BleuStore)?


We have P840ar controllers with battery backed cache in our OSD nodes
and configured an individual RAID-0 for each OSD (ceph luminous +
bluestore). We have not seen any problems with this setup so far and
performance is great at least for our workload.

Dietmar



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Show and Tell: Grafana cluster dashboard

2018-05-07 Thread Dietmar Rieder
+1 for supporting both!

Disclosure: Prometheus user

Dietmar

On 05/07/2018 04:53 PM, Reed Dier wrote:
> I’ll +1 on InfluxDB rather than Prometheus, though I think having a version 
> for each infrastructure path would be best.
> I’m sure plenty here have existing InfluxDB infrastructure as their TSDB of 
> choice, and moving to Prometheus would be less advantageous.
> 
> Conversely, I’m sure all of the Prometheus folks would be less inclined to 
> move to InfluxDB for TSDB, so I think supporting both paths would be the best 
> choice.
> 
> Reed
> 
>> On May 7, 2018, at 3:06 AM, Marc Roos  wrote:
>>
>>
>> Looks nice 
>>
>> - I rather have some dashboards with collectd/influxdb.
>> - Take into account bigger tv/screens eg 65" uhd. I am putting more 
>> stats on them than viewing them locally in a webbrowser.
>> - What is to be considered most important to have on your ceph 
>> dashboard? As a newbie I find it difficult to determine what is 
>> important to monitor.
>> - Maybe also some docs on what metrics you have taken and argumentation 
>> on how you used them (could be usefull if one wants to modify the 
>> dashboard for some other backend)
>>
>> Ceph performance counters description.
>> https://access.redhat.com/documentation/en/red-hat-ceph-storage/1.3/paged/administration-guide/chapter-9-performance-counters
>>
>>
>> -Original Message-
>> From: Jan Fajerski [mailto:jfajer...@suse.com] 
>> Sent: maandag 7 mei 2018 12:32
>> To: ceph-devel
>> Cc: ceph-users
>> Subject: [ceph-users] Show and Tell: Grafana cluster dashboard
>>
>> Hi all,
>> I'd like to request comments and feedback about a Grafana dashboard for 
>> Ceph cluster monitoring.
>>
>> https://youtu.be/HJquM127wMY
>>
>> https://github.com/ceph/ceph/pull/21850
>>
>> The goal is to eventually have a set of default dashboards in the Ceph 
>> repository that offer decent monitoring for clusters of various (maybe 
>> even all) sizes and applications, or at least serve as a starting point 
>> for customizations.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Collecting BlueStore per Object DB overhead

2018-04-27 Thread Dietmar Rieder
Hi Wido,

thanks for the tool. Here are some stats from our cluster:

Ceph 12.2.4, 240 OSDs, CephFS only

   onodes  db_used_bytes   avg_obj_sizeoverhead_per_obj

Mean214871  1574830080  2082298 7607
Max 309855  3018850304  3349799 17753
Min 61390   203423744   285059  3219
STDEV   63324   561838256   726776  2990

See the attached plot as well.

HTH
   Dietmar

On 04/26/2018 08:35 PM, Wido den Hollander wrote:
> Hi,
> 
> I've been investigating the per object overhead for BlueStore as I've
> seen this has become a topic for a lot of people who want to store a lot
> of small objects in Ceph using BlueStore.
> 
> I've writting a piece of Python code which can be run on a server
> running OSDs and will print the overhead.
> 
> https://gist.github.com/wido/b1328dd45aae07c45cb8075a24de9f1f
> 
> Feedback on this script is welcome, but also the output of what people
> are observing.
> 
> The results from my tests are below, but what I see is that the overhead
> seems to range from 10kB to 30kB per object.
> 
> On RBD-only clusters the overhead seems to be around 11kB, but on
> clusters with a RGW workload the overhead goes higher to 20kB.
> 
> I know that partial overwrites and appends contribute to higher overhead
> on objects and I'm trying to investigate this and share my information
> with the community.
> 
> I have two use-cases who want to store >2 billion objects with a avg
> object size of 50kB (8 - 80kB) and the RocksDB overhead is likely to
> become a big problem.
> 
> Anybody willing to share the overhead they are seeing with what use-case?
> 
> The more data we have on this the better we can estimate how DBs need to
> be sized for BlueStore deployments.
> 
> Wido
> 
> # Cluster #1
> osd.25 onodes=178572 db_used_bytes=2188378112 avg_obj_size=6196529
> overhead=12254
> osd.20 onodes=209871 db_used_bytes=2307915776 avg_obj_size=5452002
> overhead=10996
> osd.10 onodes=195502 db_used_bytes=2395996160 avg_obj_size=6013645
> overhead=12255
> osd.30 onodes=186172 db_used_bytes=2393899008 avg_obj_size=6359453
> overhead=12858
> osd.1 onodes=169911 db_used_bytes=1799356416 avg_obj_size=4890883
> overhead=10589
> osd.0 onodes=199658 db_used_bytes=2028994560 avg_obj_size=4835928
> overhead=10162
> osd.15 onodes=204015 db_used_bytes=2384461824 avg_obj_size=5722715
> overhead=11687
> 
> # Cluster #2
> osd.1 onodes=221735 db_used_bytes=2773483520 avg_obj_size=5742992
> overhead_per_obj=12508
> osd.0 onodes=196817 db_used_bytes=2651848704 avg_obj_size=6454248
> overhead_per_obj=13473
> osd.3 onodes=212401 db_used_bytes=2745171968 avg_obj_size=6004150
> overhead_per_obj=12924
> osd.2 onodes=185757 db_used_bytes=356722 avg_obj_size=5359974
> overhead_per_obj=19203
> osd.5 onodes=198822 db_used_bytes=3033530368 avg_obj_size=6765679
> overhead_per_obj=15257
> osd.4 onodes=161142 db_used_bytes=2136997888 avg_obj_size=6377323
> overhead_per_obj=13261
> osd.7 onodes=158951 db_used_bytes=1836056576 avg_obj_size=5247527
> overhead_per_obj=11551
> osd.6 onodes=178874 db_used_bytes=2542796800 avg_obj_size=6539688
> overhead_per_obj=14215
> osd.9 onodes=195166 db_used_bytes=2538602496 avg_obj_size=6237672
> overhead_per_obj=13007
> osd.8 onodes=203946 db_used_bytes=3279945728 avg_obj_size=6523555
> overhead_per_obj=16082
> 
> # Cluster 3
> osd.133 onodes=68558 db_used_bytes=15868100608 avg_obj_size=14743206
> overhead_per_obj=231455
> osd.132 onodes=60164 db_used_bytes=13911457792 avg_obj_size=14539445
> overhead_per_obj=231225
> osd.137 onodes=62259 db_used_bytes=15597568000 avg_obj_size=15138484
> overhead_per_obj=250527
> osd.136 onodes=70361 db_used_bytes=14540603392 avg_obj_size=13729154
> overhead_per_obj=206657
> osd.135 onodes=68003 db_used_bytes=12285116416 avg_obj_size=12877744
> overhead_per_obj=180655
> osd.134 onodes=64962 db_used_bytes=14056161280 avg_obj_size=15923550
> overhead_per_obj=216375
> osd.139 onodes=68016 db_used_bytes=20782776320 avg_obj_size=13619345
> overhead_per_obj=305557
> osd.138 onodes=66209 db_used_bytes=12850298880 avg_obj_size=14593418
> overhead_per_obj=194086
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy: recommended?

2018-04-05 Thread Dietmar Rieder
On 04/04/2018 08:58 PM, Robert Stanford wrote:
> 
>  I read a couple of versions ago that ceph-deploy was not recommended
> for production clusters.  Why was that?  Is this still the case?  We
> have a lot of problems automating deployment without ceph-deploy.
> 

We are using it in production on our luminous cluster for deploying and
updating. No problems so far.  It is very helpful.

Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-27 Thread Dietmar Rieder
Thanks Brad!

I added some information to the ticket.
Unfortunately I still could not grab a coredump, since there was no
segfault lately.

 http://tracker.ceph.com/issues/23431

Maybe Oliver has something to add as well.


Dietmar


On 03/27/2018 11:37 AM, Brad Hubbard wrote:
> "NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this."
> 
> Have you ever wondered what this means and why it's there? :)
> 
> This is at least something you can try. it may provide useful
> information, it may not.
> 
> This stack looks like it is either corrupted, or possibly not in ceph
> but in one of the linked libraries or glibc itself. If it's the
> former, it probably won't tell us anything. If it's the latter you
> will need the relevant debuginfo installed to get meaningful output
> and note that it will probably take a while. '' in this
> case is ceph-osd of course.
> 
> Alternatively, if you can upload a coredump and an sosreport (so I can
> validate exact versions of all packages installed) I can try and take
> a look.
> 
> On Fri, Mar 23, 2018 at 9:20 PM, Dietmar Rieder
> <dietmar.rie...@i-med.ac.at> wrote:
>> Hi,
>>
>>
>> I encountered one more two days ago, and I opened a ticket:
>>
>> http://tracker.ceph.com/issues/23431
>>
>> In our case it is more like 1 every two weeks, for now...
>> And it is affecting different OSDs on different hosts.
>>
>> Dietmar
>>
>> On 03/23/2018 11:50 AM, Oliver Freyermuth wrote:
>>> Hi together,
>>>
>>> I notice exactly the same, also the same addresses, Luminous 12.2.4, CentOS 
>>> 7.
>>> Sadly, logs are equally unhelpful.
>>>
>>> It happens randomly on an OSD about once per 2-3 days (of the 196 total 
>>> OSDs we have). It's also not a container environment.
>>>
>>> Cheers,
>>>   Oliver
>>>
>>> Am 08.03.2018 um 15:00 schrieb Dietmar Rieder:
>>>> Hi,
>>>>
>>>> I noticed in my client (using cephfs) logs that an osd was unexpectedly
>>>> going down.
>>>> While checking the osd logs for the affected OSD I found that the osd
>>>> was seg faulting:
>>>>
>>>> []
>>>> 2018-03-07 06:01:28.873049 7fd9af370700 -1 *** Caught signal
>>>> (Segmentation fault) **
>>>>  in thread 7fd9af370700 thread_name:safe_timer
>>>>
>>>>   ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
>>>> luminous (stable)
>>>>1: (()+0xa3c611) [0x564585904611]
>>>> 2: (()+0xf5e0) [0x7fd9b66305e0]
>>>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>>>> needed to interpret this.
>>>> [...]
>>>>
>>>> Should I open a ticket for this? What additional information is needed?
>>>>
>>>>
>>>> I put the relevant log entries for download under [1], so maybe someone
>>>> with more
>>>> experience can find some useful information therein.
>>>>
>>>> Thanks
>>>>   Dietmar
>>>>
>>>>
>>>> [1] https://expirebox.com/download/6473c34c80e8142e22032469a59df555.html
>>>>
>>>>
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> --
>> _
>> D i e t m a r  R i e d e r, Mag.Dr.
>> Innsbruck Medical University
>> Biocenter - Division for Bioinformatics
>> Innrain 80, 6020 Innsbruck
>> Phone: +43 512 9003 71402
>> Fax: +43 512 9003 73100
>> Email: dietmar.rie...@i-med.ac.at
>> Web:   http://www.icbi.at
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-23 Thread Dietmar Rieder
Hi,


I encountered one more two days ago, and I opened a ticket:

http://tracker.ceph.com/issues/23431

In our case it is more like 1 every two weeks, for now...
And it is affecting different OSDs on different hosts.

Dietmar

On 03/23/2018 11:50 AM, Oliver Freyermuth wrote:
> Hi together,
> 
> I notice exactly the same, also the same addresses, Luminous 12.2.4, CentOS 
> 7. 
> Sadly, logs are equally unhelpful. 
> 
> It happens randomly on an OSD about once per 2-3 days (of the 196 total OSDs 
> we have). It's also not a container environment. 
> 
> Cheers,
>   Oliver
> 
> Am 08.03.2018 um 15:00 schrieb Dietmar Rieder:
>> Hi,
>>
>> I noticed in my client (using cephfs) logs that an osd was unexpectedly
>> going down.
>> While checking the osd logs for the affected OSD I found that the osd
>> was seg faulting:
>>
>> []
>> 2018-03-07 06:01:28.873049 7fd9af370700 -1 *** Caught signal
>> (Segmentation fault) **
>>  in thread 7fd9af370700 thread_name:safe_timer
>>
>>   ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
>> luminous (stable)
>>1: (()+0xa3c611) [0x564585904611]
>> 2: (()+0xf5e0) [0x7fd9b66305e0]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>> [...]
>>
>> Should I open a ticket for this? What additional information is needed?
>>
>>
>> I put the relevant log entries for download under [1], so maybe someone
>> with more
>> experience can find some useful information therein.
>>
>> Thanks
>>   Dietmar
>>
>>
>> [1] https://expirebox.com/download/6473c34c80e8142e22032469a59df555.html
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating standby mds from 12.2.2 to 12.2.4 caused up:active 12.2.2 mds's to suicide

2018-03-14 Thread Dietmar Rieder
On 03/14/2018 01:48 PM, Lars Marowsky-Bree wrote:
> On 2018-02-28T02:38:34, Patrick Donnelly  wrote:
> 
>> I think it will be necessary to reduce the actives to 1 (max_mds -> 1;
>> deactivate other ranks), shutdown standbys, upgrade the single active,
>> then upgrade/start the standbys.
>>
>> Unfortunately this didn't get flagged in upgrade testing. Thanks for
>> the report Dan.
> 
> This means that - when the single active is being updated - there's a
> time when there's no MDS active, right?
> 
> Is another approach theoretically feasible? Have the updated MDS only go
> into the incompatible mode once there's a quorum of new ones available,
> or something?
> 
> (From the point of view of a distributed system, this is double plus
> ungood.)

here is what I did and what worked without any problem:

we have mons and mds on the same hosts, 3 host in total

1. stop the 2 non active mds
2. update ceph on all 3 hosts
3. restart the active mds
4. start the mds on the others

HTH
  Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mds suicide on upgrade

2018-03-12 Thread Dietmar Rieder
Hi,

See: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/025092.html

Might be of interest.

Dietmar

Am 12. März 2018 18:19:51 MEZ schrieb Reed Dier :
>Figured I would see if anyone has seen this or can see something I am
>doing wrong.
>
>Upgrading all of my daemons from 12.2.2. to 12.2.4.
>
>Followed the documentation, upgraded mons, mgrs, osds, then mds’s in
>that order.
>
>All was fine, until the MDSs.
>
>I have two MDS’s in Active:Standby config. I decided it made sense to
>upgrade the Standby MDS, so I could gracefully step down the current
>active, after the standby was upgraded.
>
>However, when I upgraded the standby, it caused the working active to
>suicide, and the then standby to immediately rejoin as active when it
>restarted, which didn’t leave me feeling warm and fuzzy about upgrading
>MDS’s in the future.
>
>Attaching log entries that would appear to be the culprit.
>
>> 2018-03-12 13:07:38.981339 7ff0cdc40700  0 mds.0 handle_mds_map
>mdsmap compatset compat={},rocompat={},incompat={1=base v0.20,2=client
>writeable ranges,3=default file layouts on dirs,4=dir inode in separate
>object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
>anchor table,9=file layout v2} not writeable with daemon features
>compat={},rocompat={},incompat={1=base v0.20,2=client writeable
>ranges,3=default file layouts on dirs,4=dir inode in separate
>object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds
>uses inline data,8=file layout v2}, killing myself
>> 2018-03-12 13:07:38.981353 7ff0cdc40700  1 mds.0 suicide.  wanted
>state up:active
>> 2018-03-12 13:07:40.000753 7ff0cdc40700  1 mds.0.119543 shutdown:
>shutting down rank 0
>> 2018-03-12 13:08:27.325667 7f32cc992200  0 set uid:gid to 64045:64045
>(ceph:ceph)
>> 2018-03-12 13:08:27.325687 7f32cc992200  0 ceph version 12.2.4
>(52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process
>(unknown), pid 66854
>> 2018-03-12 13:08:27.326795 7f32cc992200  0 pidfile_write: ignore
>empty --pid-file
>> 2018-03-12 13:08:32.350266 7f32c6440700  1 mds.0 handle_mds_map
>standby
>
>Hopefully there may be some config issue with my mds_map or something
>like that which may be an easy fix to prevent something like this in
>the future.
>
>Thanks,
>
>Reed
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
___
D i e t m a r R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web: http://www.icbi.at
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-09 Thread Dietmar Rieder
On 03/09/2018 12:49 AM, Brad Hubbard wrote:
> On Fri, Mar 9, 2018 at 3:54 AM, Subhachandra Chandra
> <schan...@grailbio.com> wrote:
>> I noticed a similar crash too. Unfortunately, I did not get much info in the
>> logs.
>>
>>  *** Caught signal (Segmentation fault) **
>>
>> Mar 07 17:58:26 data7 ceph-osd-run.sh[796380]:  in thread 7f63a0a97700
>> thread_name:safe_timer
>>
>> Mar 07 17:58:28 data7 ceph-osd-run.sh[796380]: docker_exec.sh: line 56:
>> 797138 Segmentation fault  (core dumped) "$@"
> 
> The log isn't very helpful AFAICT. Are these both container
> environments? If so, what are the details (OS, etc.).

In my case (reported in the OP) it is not a container. I'm running

- ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
- CentOS 7.4 (fully updated on 03.02.2017)
- Spectre and Meltdown workarrounds disabled (kerrnel options: noibrs
noibpb nopti)

3x MON/MDS hosts (128GB RAM)
10x OSD hosts 22 HDD + 2 SDD osds + 2 NVME for wal/db each (128GB)

ceph is using bluestore
wal and db are separated on NVME devices (1GB wal, 64GB db)

3 pools:
  1: 3 x replicated (all SSD osds): data
  2: 3 x replicated (all SSD osds): metadata pool for EC pool
  3: 6+3 EC pool (all HDD) -> metadata on pool 2

pools are used for cephfs only
# ceph fs ls
name: cephfs, metadata pool: ssd-rep-metadata-pool, data pools:
[hdd-ec-data-pool ssd-rep-data-pool ]


> 
> Can anyone capture a core file? Please feel free to open a tracker on this.

I've no core file avilable, was not dumped, and so far I've noticed just
that single segfault.

Dietmar

> 
>>
>>
>> Thanks
>>
>> Subhachandra
>>
>>
>>
>> On Thu, Mar 8, 2018 at 6:00 AM, Dietmar Rieder <dietmar.rie...@i-med.ac.at>
>> wrote:
>>>
>>> Hi,
>>>
>>> I noticed in my client (using cephfs) logs that an osd was unexpectedly
>>> going down.
>>> While checking the osd logs for the affected OSD I found that the osd
>>> was seg faulting:
>>>
>>> []
>>> 2018-03-07 06:01:28.873049 7fd9af370700 -1 *** Caught signal
>>> (Segmentation fault) **
>>>  in thread 7fd9af370700 thread_name:safe_timer
>>>
>>>   ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
>>> luminous (stable)
>>>1: (()+0xa3c611) [0x564585904611]
>>> 2: (()+0xf5e0) [0x7fd9b66305e0]
>>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>>> needed to interpret this.
>>> [...]
>>>
>>> Should I open a ticket for this? What additional information is needed?
>>>
>>>
>>> I put the relevant log entries for download under [1], so maybe someone
>>> with more
>>> experience can find some useful information therein.
>>>
>>> Thanks
>>>   Dietmar
>>>
>>>
>>> [1] https://expirebox.com/download/6473c34c80e8142e22032469a59df555.html
>>>
>>> --
>>> _
>>> D i e t m a r  R i e d e r, Mag.Dr.
>>> Innsbruck Medical University
>>> Biocenter - Division for Bioinformatics
>>> Email: dietmar.rie...@i-med.ac.at
>>> Web:   http://www.icbi.at
>>>
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-08 Thread Dietmar Rieder
Hi,

I noticed in my client (using cephfs) logs that an osd was unexpectedly
going down.
While checking the osd logs for the affected OSD I found that the osd
was seg faulting:

[]
2018-03-07 06:01:28.873049 7fd9af370700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7fd9af370700 thread_name:safe_timer

  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
luminous (stable)
   1: (()+0xa3c611) [0x564585904611]
2: (()+0xf5e0) [0x7fd9b66305e0]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
[...]

Should I open a ticket for this? What additional information is needed?


I put the relevant log entries for download under [1], so maybe someone
with more
experience can find some useful information therein.

Thanks
  Dietmar


[1] https://expirebox.com/download/6473c34c80e8142e22032469a59df555.html

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at





signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread Dietmar Rieder
Thanks for making this clear.

Dietmar

On 02/27/2018 05:29 PM, Alfredo Deza wrote:
> On Tue, Feb 27, 2018 at 11:13 AM, Dietmar Rieder
> <dietmar.rie...@i-med.ac.at> wrote:
>> ... however, it would be nice if ceph-volume would also create the
>> partitions for the WAL and/or DB if needed. Is there a special reason,
>> why this is not implemented?
> 
> Yes, the reason is that this was one of the most painful points in
> ceph-disk (code and maintenance-wise): to be in the business of
> understanding partitions, sizes, requirements, and devices
> is non-trivial.
> 
> One of the reasons ceph-disk did this was because it required quite a
> hefty amount of "special sauce" on partitions so that these would be
> discovered later by mechanisms that included udev.
> 
> If an admin wanted more flexibility, we decided that it had to be up
> to configuration management system (or whatever deployment mechanism)
> to do so. For users that want a simplistic approach (in the case of
> bluestore)
> we have a 1:1 mapping for device->logical volume->OSD
> 
> On the ceph-volume side as well, implementing partitions meant to also
> have a similar support for logical volumes, which have lots of
> variations that can be supported and we were not willing to attempt to
> support them all.
> 
> Even a small subset would inevitably bring up the question of "why is
> setup X not supported by ceph-volume if setup Y is?"
> 
> Configuration management systems are better suited for handling these
> situations, and we would prefer to offload that responsibility to
> those systems.
> 
>>
>> Dietmar
>>
>>
>> On 02/27/2018 04:25 PM, David Turner wrote:
>>> Gotcha.  As a side note, that setting is only used by ceph-disk as
>>> ceph-volume does not create partitions for the WAL or DB.  You need to
>>> create those partitions manually if using anything other than a whole
>>> block device when creating OSDs with ceph-volume.
>>>
>>> On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit <caspars...@supernas.eu
>>> <mailto:caspars...@supernas.eu>> wrote:
>>>
>>> David,
>>>
>>> Yes i know, i use 20GB partitions for 2TB disks as journal. It was
>>> just to inform other people that Ceph's default of 1GB is pretty low.
>>> Now that i read my own sentence it indeed looks as if i was using
>>> 1GB partitions, sorry for the confusion.
>>>
>>> Caspar
>>>
>>> 2018-02-27 14:11 GMT+01:00 David Turner <drakonst...@gmail.com
>>> <mailto:drakonst...@gmail.com>>:
>>>
>>> If you're only using a 1GB DB partition, there is a very real
>>> possibility it's already 100% full. The safe estimate for DB
>>> size seams to be 10GB/1TB so for a 4TB osd a 40GB DB should work
>>> for most use cases (except loads and loads of small files).
>>> There are a few threads that mention how to check how much of
>>> your DB partition is in use. Once it's full, it spills over to
>>> the HDD.
>>>
>>>
>>> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit
>>> <caspars...@supernas.eu <mailto:caspars...@supernas.eu>> wrote:
>>>
>>> 2018-02-26 23:01 GMT+01:00 Gregory Farnum
>>> <gfar...@redhat.com <mailto:gfar...@redhat.com>>:
>>>
>>> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit
>>> <caspars...@supernas.eu <mailto:caspars...@supernas.eu>>
>>> wrote:
>>>
>>> 2018-02-24 7:10 GMT+01:00 David Turner
>>> <drakonst...@gmail.com <mailto:drakonst...@gmail.com>>:
>>>
>>> Caspar, it looks like your idea should work.
>>> Worst case scenario seems like the osd wouldn't
>>> start, you'd put the old SSD back in and go back
>>> to the idea to weight them to 0, backfilling,
>>> then recreate the osds. Definitely with a try in
>>> my opinion, and I'd love to hear your experience
>>> after.
>>>
>>>
>>> Hi David,
>>>
>>> First of all, thank you for ALL your answers on this
>>> ML, you're really putting a lot of effort into
>>> answering many

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread Dietmar Rieder
... however, it would be nice if ceph-volume would also create the
partitions for the WAL and/or DB if needed. Is there a special reason,
why this is not implemented?

Dietmar


On 02/27/2018 04:25 PM, David Turner wrote:
> Gotcha.  As a side note, that setting is only used by ceph-disk as
> ceph-volume does not create partitions for the WAL or DB.  You need to
> create those partitions manually if using anything other than a whole
> block device when creating OSDs with ceph-volume.
> 
> On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit  > wrote:
> 
> David,
> 
> Yes i know, i use 20GB partitions for 2TB disks as journal. It was
> just to inform other people that Ceph's default of 1GB is pretty low.
> Now that i read my own sentence it indeed looks as if i was using
> 1GB partitions, sorry for the confusion.
> 
> Caspar
> 
> 2018-02-27 14:11 GMT+01:00 David Turner  >:
> 
> If you're only using a 1GB DB partition, there is a very real
> possibility it's already 100% full. The safe estimate for DB
> size seams to be 10GB/1TB so for a 4TB osd a 40GB DB should work
> for most use cases (except loads and loads of small files).
> There are a few threads that mention how to check how much of
> your DB partition is in use. Once it's full, it spills over to
> the HDD.
> 
> 
> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit
> > wrote:
> 
> 2018-02-26 23:01 GMT+01:00 Gregory Farnum
> >:
> 
> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit
> >
> wrote:
> 
> 2018-02-24 7:10 GMT+01:00 David Turner
> >:
> 
> Caspar, it looks like your idea should work.
> Worst case scenario seems like the osd wouldn't
> start, you'd put the old SSD back in and go back
> to the idea to weight them to 0, backfilling,
> then recreate the osds. Definitely with a try in
> my opinion, and I'd love to hear your experience
> after.
> 
> 
> Hi David,
> 
> First of all, thank you for ALL your answers on this
> ML, you're really putting a lot of effort into
> answering many questions asked here and very often
> they contain invaluable information.
> 
> 
> To follow up on this post i went out and built a
> very small (proxmox) cluster (3 OSD's per host) to
> test my suggestion of cloning the DB/WAL SDD. And it
> worked!
> Note: this was on Luminous v12.2.2 (all bluestore,
> ceph-disk based OSD's)
> 
> Here's what i did on 1 node:
> 
> 1) ceph osd set noout
> 2) systemctl stop osd.0; systemctl stop
> osd.1; systemctl stop osd.2
> 3) ddrescue -f -n -vv  
> /root/clone-db.log
> 4) removed the old SSD physically from the node
> 5) checked with "ceph -s" and already saw HEALTH_OK
> and all OSD's up/in
> 6) ceph osd unset noout
> 
> I assume that once the ddrescue step is finished a
> 'partprobe' or something similar is triggered and
> udev finds the DB partitions on the new SSD and
> starts the OSD's again (kind of what happens during
> hotplug)
> So it is probably better to clone the SSD in another
> (non-ceph) system to not trigger any udev events.
> 
> I also tested a reboot after this and everything
> still worked.
> 
> 
> The old SSD was 120GB and the new is 256GB (cloning
> took around 4 minutes)
> Delta of data was very low because it was a test
> cluster.
> 
> All in all the OSD's in question were 'down' for
> only 5 minutes (so i stayed within the
> ceph_osd_down_out interval of the default 10 minutes
> and didn't actually need to set noout :)
> 
> 
> I kicked off a brief discussion about this with some of
> the BlueStore guys and they're aware of the problem with
> migrating 

Re: [ceph-users] Rocksdb: Try to delete WAL files size....

2018-02-12 Thread Dietmar Rieder
Anyone?

Am 9. Februar 2018 09:59:54 MEZ schrieb Dietmar Rieder 
<dietmar.rie...@i-med.ac.at>:
>Hi,
>
>we are running ceph version 12.2.2 (10 nodes, 240 OSDs, 3 mon). While
>monitoring the WAL db used bytes we noticed that there are some OSDs
>that use proportionally more WAL db bytes than others (800Mb vs 300Mb).
>These OSDs eventually exceed the WAL db size (1GB in our case) and
>spill
>over to the HDD data device. So it seems flushing the WAL db does not
>free space.
>
>We looked for some hints in the logs of the OSDs in question and
>spotted
>the following entries:
>
>[...]
>2018-02-08 16:17:27.496695 7f0ffce55700  4 rocksdb:
>[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_write.cc:684]
>reusing log 152 from recycle list
>2018-02-08 16:17:27.496768 7f0ffce55700  4 rocksdb:
>[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_write.cc:725]
>[default] New memtable created with log file: #162. Immutable
>memtables: 0.
>2018-02-08 16:17:27.496976 7f0fe7e2b700  4 rocksdb: (Original Log Time
>2018/02/08-16:17:27.496841)
>[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_compaction_flush.cc:1158]
>Calling FlushMemTableToOutputFile with column family [default], flush
>slots available 1, compaction slots allowed 1, compaction slots
>scheduled 1
>2018-02-08 16:17:27.496983 7f0fe7e2b700  4 rocksdb:
>[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/flush_job.cc:264]
>[default] [JOB 6] Flushing memtable with next log file: 162
>2018-02-08 16:17:27.497001 7f0fe7e2b700  4 rocksdb: EVENT_LOG_v1
>{"time_micros": 1518103047496990, "job": 6, "event": "flush_started",
>"num_memtables": 1, "num_entries": 328542, "num_deletes": 66632,
>"memory_usage": 260058032}
>2018-02-08 16:17:27.497006 7f0fe7e2b700  4 rocksdb:
>[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/flush_job.cc:293]
>[default] [JOB 6] Level-0 flush table #163: started
>2018-02-08 16:17:27.627110 7f0fe7e2b700  4 rocksdb: EVENT_LOG_v1
>{"time_micros": 1518103047627094, "cf_name": "default", "job": 6,
>"event": "table_file_creation", "file_number": 163, "file_size":
>5502182, "table_properties": {"data_size": 5160167, "index_size":
>81548,
>"filter_size": 259478, "raw_key_size": 5138655, "raw_average_key_size":
>51, "raw_value_size": 3606384, "raw_average_value_size": 36,
>"num_data_blocks": 1287, "num_entries": 98984, "filter_policy_name":
>"rocksdb.BuiltinBloomFilter", "kDeletedKeys": "66093",
>"kMergeOperands":
>"192"}}
>2018-02-08 16:17:27.627127 7f0fe7e2b700  4 rocksdb:
>[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/flush_job.cc:319]
>[default] [JOB 6] Level-0 flush table #163: 5502182 bytes OK
>2018-02-08 16:17:27.627449 7f0fe7e2b700  4 rocksdb:
>[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_files.cc:242]
>adding log 155 to recycle list
>2018-02-08 16:17:27.627457 7f0fe7e2b700  4 rocksdb: (Original Log Time
>2018/02/08-16:17:27.627136)
>[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/memtable_list.cc:360]
>[default] Level-0 commit table #163 started
>2018-02-08 16:17:27.627461 7f0fe7e2b700  4 rocksdb: (Original Log Time
>2018/02/08-16:17:27.627402)
>[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/memtable_list.cc:383]
>[default] Level-0 commit 

[ceph-users] Rocksdb: Try to delete WAL files size....

2018-02-09 Thread Dietmar Rieder
Hi,

we are running ceph version 12.2.2 (10 nodes, 240 OSDs, 3 mon). While
monitoring the WAL db used bytes we noticed that there are some OSDs
that use proportionally more WAL db bytes than others (800Mb vs 300Mb).
These OSDs eventually exceed the WAL db size (1GB in our case) and spill
over to the HDD data device. So it seems flushing the WAL db does not
free space.

We looked for some hints in the logs of the OSDs in question and spotted
the following entries:

[...]
2018-02-08 16:17:27.496695 7f0ffce55700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_write.cc:684]
reusing log 152 from recycle list
2018-02-08 16:17:27.496768 7f0ffce55700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_write.cc:725]
[default] New memtable created with log file: #162. Immutable memtables: 0.
2018-02-08 16:17:27.496976 7f0fe7e2b700  4 rocksdb: (Original Log Time
2018/02/08-16:17:27.496841)
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_compaction_flush.cc:1158]
Calling FlushMemTableToOutputFile with column family [default], flush
slots available 1, compaction slots allowed 1, compaction slots scheduled 1
2018-02-08 16:17:27.496983 7f0fe7e2b700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/flush_job.cc:264]
[default] [JOB 6] Flushing memtable with next log file: 162
2018-02-08 16:17:27.497001 7f0fe7e2b700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1518103047496990, "job": 6, "event": "flush_started",
"num_memtables": 1, "num_entries": 328542, "num_deletes": 66632,
"memory_usage": 260058032}
2018-02-08 16:17:27.497006 7f0fe7e2b700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/flush_job.cc:293]
[default] [JOB 6] Level-0 flush table #163: started
2018-02-08 16:17:27.627110 7f0fe7e2b700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1518103047627094, "cf_name": "default", "job": 6,
"event": "table_file_creation", "file_number": 163, "file_size":
5502182, "table_properties": {"data_size": 5160167, "index_size": 81548,
"filter_size": 259478, "raw_key_size": 5138655, "raw_average_key_size":
51, "raw_value_size": 3606384, "raw_average_value_size": 36,
"num_data_blocks": 1287, "num_entries": 98984, "filter_policy_name":
"rocksdb.BuiltinBloomFilter", "kDeletedKeys": "66093", "kMergeOperands":
"192"}}
2018-02-08 16:17:27.627127 7f0fe7e2b700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/flush_job.cc:319]
[default] [JOB 6] Level-0 flush table #163: 5502182 bytes OK
2018-02-08 16:17:27.627449 7f0fe7e2b700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_files.cc:242]
adding log 155 to recycle list
2018-02-08 16:17:27.627457 7f0fe7e2b700  4 rocksdb: (Original Log Time
2018/02/08-16:17:27.627136)
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/memtable_list.cc:360]
[default] Level-0 commit table #163 started
2018-02-08 16:17:27.627461 7f0fe7e2b700  4 rocksdb: (Original Log Time
2018/02/08-16:17:27.627402)
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/memtable_list.cc:383]
[default] Level-0 commit table #163: memtable #1 done
2018-02-08 16:17:27.627474 7f0fe7e2b700  4 rocksdb: (Original Log Time
2018/02/08-16:17:27.627415) EVENT_LOG_v1 {"time_micros":
1518103047627409, "job": 6, "event": "flush_finished", "lsm_state": [1,
2, 3, 0, 0, 0, 0], "immutable_memtables": 0}
2018-02-08 16:17:27.627476 7f0fe7e2b700  4 rocksdb: (Original Log Time
2018/02/08-16:17:27.627435)
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_compaction_flush.cc:132]
[default] Level summary: base level 1 max bytes base 268435456 files[1 2
3 0 0 0 

Re: [ceph-users] Bluefs WAL : bluefs _allocate failed to allocate on bdev 0

2018-01-29 Thread Dietmar Rieder
Hi,

just for the record:

A reboot of the osd node solved the issue, now the wal is fully purged
and the extra 790MB are gone.

Sorry for the noise.

  Dietmar


On 01/27/2018 11:08 AM, Dietmar Rieder wrote:
> Hi,
> 
> replying to my own message.
> 
> After I restarted the OSD it seems some of the wal partition got purged.
> However there are still ~790MB used. As far as I think, it should get
> completely emptied. At least this is what happens when I restart another
> OSD, there its associated wal gets copletely flushed.
> Is it somhow possible to reinitialize the wal for that OSD in question?
> 
> Thanks
>   Dietmar
> 
> 
> On 01/26/2018 05:11 PM, Dietmar Rieder wrote:
>> Hi all,
>>
>> I've a question regarding bluestore wal.db:
>>
>>
>> We are running a 10 OSD node + 3 MON/MDS node cluster (luminous 12.2.2).
>> Each OSD node has 22xHDD (8TB) OSDs, 2xSSD (1.6TB) OSDs and 2xNVME (800
>> GB) for bluestore wal and db.
>>
>> We have separated wal and db partitions
>> wal partitions are 1GB
>> db partitions are 64GB
>>
>> The cluster is providing cephfs from one HDD (EC 6+3) and one SSD
>> (3xrep) pool.
>> Since the cluster is "new" we have not much data ~30TB (HDD EC) and
>> ~140GB (SSD rep) stored on it yet.
>>
>> I just noticed that the wal db usage for the SSD OSDs is all more or
>> less equal ~518MB. The wal db usage for the HDD OSDs is as well quite
>> balanced at 284-306MB, however there is one OSD whose wal db usage is ~ 1GB
>>
>>
>>"bluefs": {
>> "gift_bytes": 0,
>> "reclaim_bytes": 0,
>> "db_total_bytes": 68719468544,
>> "db_used_bytes": 1114636288,
>> "wal_total_bytes": 1073737728,
>> "wal_used_bytes": 1072693248,
>> "slow_total_bytes": 320057901056,
>> "slow_used_bytes": 0,
>> "num_files": 16,
>> "log_bytes": 862326784,
>> "log_compactions": 0,
>> "logged_bytes": 850575360,
>> "files_written_wal": 2,
>> "files_written_sst": 9,
>> "bytes_written_wal": 744469265,
>> "bytes_written_sst": 568855830
>> },
>>
>>
>> and I got the following log entries:
>>
>> 2018-01-26 16:31:05.484284 7f65ea28a700  1 bluefs _allocate failed to
>> allocate 0x40 on bdev 0, free 0xff000; fallback to bdev 1
>>
>> Is there any reason for this difference ~300MB vs 1GB?
>> I have in mind that 1GB of wal should be enough, and old logs should be
>> purged to free space. (can this be triggered manually?)
>>
>> Could this be related to the fact that the HDD OSD in question was
>> failing some week ago and we replaced it with with a new HDD?
>>
>> Do we have to expect problems/performace reductions, with the falling
>> back to bdev 1?
>>
>> Thanks for any clarifying comment
>>Dietmar
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluefs WAL : bluefs _allocate failed to allocate on bdev 0

2018-01-27 Thread Dietmar Rieder
Hi,

replying to my own message.

After I restarted the OSD it seems some of the wal partition got purged.
However there are still ~790MB used. As far as I think, it should get
completely emptied. At least this is what happens when I restart another
OSD, there its associated wal gets copletely flushed.
Is it somhow possible to reinitialize the wal for that OSD in question?

Thanks
  Dietmar


On 01/26/2018 05:11 PM, Dietmar Rieder wrote:
> Hi all,
> 
> I've a question regarding bluestore wal.db:
> 
> 
> We are running a 10 OSD node + 3 MON/MDS node cluster (luminous 12.2.2).
> Each OSD node has 22xHDD (8TB) OSDs, 2xSSD (1.6TB) OSDs and 2xNVME (800
> GB) for bluestore wal and db.
> 
> We have separated wal and db partitions
> wal partitions are 1GB
> db partitions are 64GB
> 
> The cluster is providing cephfs from one HDD (EC 6+3) and one SSD
> (3xrep) pool.
> Since the cluster is "new" we have not much data ~30TB (HDD EC) and
> ~140GB (SSD rep) stored on it yet.
> 
> I just noticed that the wal db usage for the SSD OSDs is all more or
> less equal ~518MB. The wal db usage for the HDD OSDs is as well quite
> balanced at 284-306MB, however there is one OSD whose wal db usage is ~ 1GB
> 
> 
>"bluefs": {
> "gift_bytes": 0,
> "reclaim_bytes": 0,
> "db_total_bytes": 68719468544,
> "db_used_bytes": 1114636288,
> "wal_total_bytes": 1073737728,
> "wal_used_bytes": 1072693248,
> "slow_total_bytes": 320057901056,
> "slow_used_bytes": 0,
> "num_files": 16,
> "log_bytes": 862326784,
> "log_compactions": 0,
> "logged_bytes": 850575360,
> "files_written_wal": 2,
> "files_written_sst": 9,
> "bytes_written_wal": 744469265,
> "bytes_written_sst": 568855830
> },
> 
> 
> and I got the following log entries:
> 
> 2018-01-26 16:31:05.484284 7f65ea28a700  1 bluefs _allocate failed to
> allocate 0x40 on bdev 0, free 0xff000; fallback to bdev 1
> 
> Is there any reason for this difference ~300MB vs 1GB?
> I have in mind that 1GB of wal should be enough, and old logs should be
> purged to free space. (can this be triggered manually?)
> 
> Could this be related to the fact that the HDD OSD in question was
> failing some week ago and we replaced it with with a new HDD?
> 
> Do we have to expect problems/performace reductions, with the falling
> back to bdev 1?
> 
> Thanks for any clarifying comment
>Dietmar
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluefs WAL : bluefs _allocate failed to allocate on bdev 0

2018-01-26 Thread Dietmar Rieder
Hi all,

I've a question regarding bluestore wal.db:


We are running a 10 OSD node + 3 MON/MDS node cluster (luminous 12.2.2).
Each OSD node has 22xHDD (8TB) OSDs, 2xSSD (1.6TB) OSDs and 2xNVME (800
GB) for bluestore wal and db.

We have separated wal and db partitions
wal partitions are 1GB
db partitions are 64GB

The cluster is providing cephfs from one HDD (EC 6+3) and one SSD
(3xrep) pool.
Since the cluster is "new" we have not much data ~30TB (HDD EC) and
~140GB (SSD rep) stored on it yet.

I just noticed that the wal db usage for the SSD OSDs is all more or
less equal ~518MB. The wal db usage for the HDD OSDs is as well quite
balanced at 284-306MB, however there is one OSD whose wal db usage is ~ 1GB


   "bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 68719468544,
"db_used_bytes": 1114636288,
"wal_total_bytes": 1073737728,
"wal_used_bytes": 1072693248,
"slow_total_bytes": 320057901056,
"slow_used_bytes": 0,
"num_files": 16,
"log_bytes": 862326784,
"log_compactions": 0,
"logged_bytes": 850575360,
"files_written_wal": 2,
"files_written_sst": 9,
"bytes_written_wal": 744469265,
"bytes_written_sst": 568855830
},


and I got the following log entries:

2018-01-26 16:31:05.484284 7f65ea28a700  1 bluefs _allocate failed to
allocate 0x40 on bdev 0, free 0xff000; fallback to bdev 1

Is there any reason for this difference ~300MB vs 1GB?
I have in mind that 1GB of wal should be enough, and old logs should be
purged to free space. (can this be triggered manually?)

Could this be related to the fact that the HDD OSD in question was
failing some week ago and we replaced it with with a new HDD?

Do we have to expect problems/performace reductions, with the falling
back to bdev 1?

Thanks for any clarifying comment
   Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replace failed disk in Luminous v12.2.2

2018-01-18 Thread Dietmar Rieder
Hi,

I finally found a working way to replace the failed OSD. Everthing looks
fine again.

Thanks again for your comments and suggestions.

Dietmar

On 01/12/2018 04:08 PM, Dietmar Rieder wrote:
> Hi,
> 
> can someone, comment/confirm my planned OSD replacement procedure?
> 
> It would be very helpful for me.
> 
> Dietmar
> 
> Am 11. Januar 2018 17:47:50 MEZ schrieb Dietmar Rieder
> <dietmar.rie...@i-med.ac.at>:
> 
> Hi Alfredo,
> 
> thanks for your coments, see my answers inline.
> 
> On 01/11/2018 01:47 PM, Alfredo Deza wrote:
> 
> On Thu, Jan 11, 2018 at 4:30 AM, Dietmar Rieder
> <dietmar.rie...@i-med.ac.at> wrote:
> 
> Hello,
> 
> we have failed OSD disk in our Luminous v12.2.2 cluster that
> needs to
> get replaced.
> 
> The cluster was initially deployed using ceph-deploy on Luminous
> v12.2.0. The OSDs were created using
> 
> ceph-deploy osd create --bluestore cephosd-${osd}:/dev/sd${disk}
> --block-wal /dev/nvme0n1 --block-db /dev/nvme0n1
> 
> Note we separated the bluestore data, wal and db.
> 
> We updated to Luminous v12.2.1 and further to Luminous v12.2.2.
> 
> With the last update we also let ceph-volume take over the
> OSDs using
> "ceph-volume simple scan /var/lib/ceph/osd/$osd" and
> "ceph-volume
> simple activate ${osd} ${id}". All of this went smoothly.
> 
> 
> That is good to hear!
> 
> 
> Now wonder what is the correct way to replace a failed OSD
> block disk?
> 
> The docs for luminous [1] say:
> 
> REPLACING AN OSD
> 
> 1. Destroy the OSD first:
> 
> ceph osd destroy {id} --yes-i-really-mean-it
> 
> 2. Zap a disk for the new OSD, if the disk was used before
> for other
> purposes. It’s not necessary for a new disk:
> 
> ceph-disk zap /dev/sdX
> 
> 
> 3. Prepare the disk for replacement by using the previously
> destroyed
> OSD id:
> 
> ceph-disk prepare --bluestore /dev/sdX --osd-id {id}
> --osd-uuid `uuidgen`
> 
> 
> 4. And activate the OSD:
> 
> ceph-disk activate /dev/sdX1
> 
> 
> Initially this seems to be straight forward, but
> 
> 1. I'm not sure if there is something to do with the still
> existing
> bluefs db and wal partitions on the nvme device for the
> failed OSD. Do
> they have to be zapped ? If yes, what is the best way? There
> is nothing
> mentioned in the docs.
> 
> 
> What is your concern here if the activation seems to work?
> 
> 
> I geuss on the nvme partitions for bluefs db and bluefs wal there is
> still data related to the failed OSD  block device. I was thinking that
> this data might "interfere" with the new replacement OSD block device,
> which is empty.
> 
> So you are saying that this is no concern, right?
> Are they automatically reused and assigned to the replacement OSD block
> device, or do I have to specify them when running ceph-disk prepare?
> If I need to specify the wal and db partition, how is this done?
> 
> I'm asking this since from the logs of the initial cluster deployment I
> got the following warning:
> 
> [cephosd-02][WARNING] prepare_device: OSD will not be hot-swappable if
> block.db is not the same device as the osd data
> [...]
> [cephosd-02][WARNING] prepare_device: OSD will not be hot-swappable if
> block.wal is not the same device as the osd data
> 
> 
> 
> 2. Since we already let "ceph-volume simple" take over our
> OSDs I'm not
> sure if we should now use ceph-volume or again ceph-disk
> (followed by
> "ceph-vloume simple" takeover) to prepare and activate the OSD?
> 
> 
> The `simple` sub-command is meant to help with the activation of
> OSDs
> at boot time, supporting ceph-disk (or manual) created OSDs.
> 
> 
> OK, got this...
> 
> 
> There is no requirement to use `ceph-volume lvm` which is
> intended for
> new OSDs using LVM as devices.
> 
> 
> Fine...
> 
> 
> 3. If we should use ceph-volume, then by looking at the luminous
>

Re: [ceph-users] replace failed disk in Luminous v12.2.2

2018-01-12 Thread Dietmar Rieder
Hi,

can someone, comment/confirm my planned OSD replacement procedure?

It would be very helpful for me.

Dietmar

Am 11. Januar 2018 17:47:50 MEZ schrieb Dietmar Rieder 
<dietmar.rie...@i-med.ac.at>:
>Hi Alfredo,
>
>thanks for your coments, see my answers inline.
>
>On 01/11/2018 01:47 PM, Alfredo Deza wrote:
>> On Thu, Jan 11, 2018 at 4:30 AM, Dietmar Rieder
>> <dietmar.rie...@i-med.ac.at> wrote:
>>> Hello,
>>>
>>> we have failed OSD disk in our Luminous v12.2.2 cluster that needs
>to
>>> get replaced.
>>>
>>> The cluster was initially deployed using ceph-deploy on Luminous
>>> v12.2.0. The OSDs were created using
>>>
>>> ceph-deploy osd create --bluestore cephosd-${osd}:/dev/sd${disk}
>>> --block-wal /dev/nvme0n1 --block-db /dev/nvme0n1
>>>
>>> Note we separated the bluestore data, wal and db.
>>>
>>> We updated to Luminous v12.2.1 and further to Luminous v12.2.2.
>>>
>>> With the last update we also let ceph-volume take over the OSDs
>using
>>> "ceph-volume simple scan  /var/lib/ceph/osd/$osd" and "ceph-volume
>>> simple activate ${osd} ${id}". All of this went smoothly.
>> 
>> That is good to hear!
>> 
>>>
>>> Now wonder what is the correct way to replace a failed OSD block
>disk?
>>>
>>> The docs for luminous [1] say:
>>>
>>> REPLACING AN OSD
>>>
>>> 1. Destroy the OSD first:
>>>
>>> ceph osd destroy {id} --yes-i-really-mean-it
>>>
>>> 2. Zap a disk for the new OSD, if the disk was used before for other
>>> purposes. It’s not necessary for a new disk:
>>>
>>> ceph-disk zap /dev/sdX
>>>
>>>
>>> 3. Prepare the disk for replacement by using the previously
>destroyed
>>> OSD id:
>>>
>>> ceph-disk prepare --bluestore /dev/sdX  --osd-id {id} --osd-uuid
>`uuidgen`
>>>
>>>
>>> 4. And activate the OSD:
>>>
>>> ceph-disk activate /dev/sdX1
>>>
>>>
>>> Initially this seems to be straight forward, but
>>>
>>> 1. I'm not sure if there is something to do with the still existing
>>> bluefs db and wal partitions on the nvme device for the failed OSD.
>Do
>>> they have to be zapped ? If yes, what is the best way? There is
>nothing
>>> mentioned in the docs.
>> 
>> What is your concern here if the activation seems to work?
>
>I geuss on the nvme partitions for bluefs db and bluefs wal there is
>still data related to the failed OSD  block device. I was thinking that
>this data might "interfere" with the new replacement OSD block device,
>which is empty.
>
>So you are saying that this is no concern, right?
>Are they automatically reused and assigned to the replacement OSD block
>device, or do I have to specify them when running ceph-disk prepare?
>If I need to specify the wal and db partition, how is this done?
>
>I'm asking this since from the logs of the initial cluster deployment I
>got the following warning:
>
>[cephosd-02][WARNING] prepare_device: OSD will not be hot-swappable if
>block.db is not the same device as the osd data
>[...]
>[cephosd-02][WARNING] prepare_device: OSD will not be hot-swappable if
>block.wal is not the same device as the osd data
>
>
>>>
>>> 2. Since we already let "ceph-volume simple" take over our OSDs I'm
>not
>>> sure if we should now use ceph-volume or again ceph-disk (followed
>by
>>> "ceph-vloume simple" takeover) to prepare and activate the OSD?
>> 
>> The `simple` sub-command is meant to help with the activation of OSDs
>> at boot time, supporting ceph-disk (or manual) created OSDs.
>
>OK, got this...
>
>> 
>> There is no requirement to use `ceph-volume lvm` which is intended
>for
>> new OSDs using LVM as devices.
>
>Fine...
>
>>>
>>> 3. If we should use ceph-volume, then by looking at the luminous
>>> ceph-volume docs [2] I find for both,
>>>
>>> ceph-volume lvm prepare
>>> ceph-volume lvm activate
>>>
>>> that the bluestore option is either NOT implemented or NOT supported
>>>
>>> activate:  [–bluestore] filestore (IS THIS A TYPO???) objectstore
>(not
>>> yet implemented)
>>> prepare: [–bluestore] Use the bluestore objectstore (not currently
>>> supported)
>> 
>> These might be a typo on the man page, will get that addressed.
>Ticket
>> opened at http://tracker.ceph.com/issues/22

Re: [ceph-users] replace failed disk in Luminous v12.2.2

2018-01-11 Thread Dietmar Rieder
Hi Konstantin,

thanks for your answer, see my answer to Alfredo which includes your
suggestions.

~Dietmar

On 01/11/2018 12:57 PM, Konstantin Shalygin wrote:
>> Now wonder what is the correct way to replace a failed OSD block disk?
> 
> Generic way for maintenance (e.g. disk replace) is rebalance by change osd 
> weight:
> 
> ceph osd crush reweight osdid 0
> 
> cluster migrate data "from this osd"
> When HEALTH_OK you can safe remove this OSD:
> 
> ceph osd out osd_id
> systemctl stop ceph-osd at osd_id 
> 
> ceph osd crush remove osd_id
> ceph auth del osd_id
> ceph osd rm osd_id
> 
> 
>> I'm not sure if there is something to do with the still existing bluefs db 
>> and wal partitions on the nvme device for the failed OSD. Do they have to be 
>> zapped ? If yes, what is the best way?
> 
> 
> 1. Find nvme partition for this OSD. You can't do it in several ways. 
> ceph-volume, by hand or with "ceph-disk list" (because is more human 
> readable):
> 
> /dev/sda :
>  /dev/sda1 ceph data, active, cluster ceph, osd.0, block /dev/sda2, block.db 
> /dev/nvme2n1p1, block.wal /dev/nvme2n1p2
>  /dev/sda2 ceph block, for /dev/sda1
> 
> 2. Delete partition via parted or fdisk.
> 
> fdisk -u /dev/nvme2n1
> d (delete partitions)
> enter partition number of block.db: 1
> d
> enter partition number of block.wal: 2
> w (write partition table)
> 
> 3. Deploy your new OSD.
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replace failed disk in Luminous v12.2.2

2018-01-11 Thread Dietmar Rieder
Hi Alfredo,

thanks for your coments, see my answers inline.

On 01/11/2018 01:47 PM, Alfredo Deza wrote:
> On Thu, Jan 11, 2018 at 4:30 AM, Dietmar Rieder
> <dietmar.rie...@i-med.ac.at> wrote:
>> Hello,
>>
>> we have failed OSD disk in our Luminous v12.2.2 cluster that needs to
>> get replaced.
>>
>> The cluster was initially deployed using ceph-deploy on Luminous
>> v12.2.0. The OSDs were created using
>>
>> ceph-deploy osd create --bluestore cephosd-${osd}:/dev/sd${disk}
>> --block-wal /dev/nvme0n1 --block-db /dev/nvme0n1
>>
>> Note we separated the bluestore data, wal and db.
>>
>> We updated to Luminous v12.2.1 and further to Luminous v12.2.2.
>>
>> With the last update we also let ceph-volume take over the OSDs using
>> "ceph-volume simple scan  /var/lib/ceph/osd/$osd" and "ceph-volume
>> simple activate ${osd} ${id}". All of this went smoothly.
> 
> That is good to hear!
> 
>>
>> Now wonder what is the correct way to replace a failed OSD block disk?
>>
>> The docs for luminous [1] say:
>>
>> REPLACING AN OSD
>>
>> 1. Destroy the OSD first:
>>
>> ceph osd destroy {id} --yes-i-really-mean-it
>>
>> 2. Zap a disk for the new OSD, if the disk was used before for other
>> purposes. It’s not necessary for a new disk:
>>
>> ceph-disk zap /dev/sdX
>>
>>
>> 3. Prepare the disk for replacement by using the previously destroyed
>> OSD id:
>>
>> ceph-disk prepare --bluestore /dev/sdX  --osd-id {id} --osd-uuid `uuidgen`
>>
>>
>> 4. And activate the OSD:
>>
>> ceph-disk activate /dev/sdX1
>>
>>
>> Initially this seems to be straight forward, but
>>
>> 1. I'm not sure if there is something to do with the still existing
>> bluefs db and wal partitions on the nvme device for the failed OSD. Do
>> they have to be zapped ? If yes, what is the best way? There is nothing
>> mentioned in the docs.
> 
> What is your concern here if the activation seems to work?

I geuss on the nvme partitions for bluefs db and bluefs wal there is
still data related to the failed OSD  block device. I was thinking that
this data might "interfere" with the new replacement OSD block device,
which is empty.

So you are saying that this is no concern, right?
Are they automatically reused and assigned to the replacement OSD block
device, or do I have to specify them when running ceph-disk prepare?
If I need to specify the wal and db partition, how is this done?

I'm asking this since from the logs of the initial cluster deployment I
got the following warning:

[cephosd-02][WARNING] prepare_device: OSD will not be hot-swappable if
block.db is not the same device as the osd data
[...]
[cephosd-02][WARNING] prepare_device: OSD will not be hot-swappable if
block.wal is not the same device as the osd data


>>
>> 2. Since we already let "ceph-volume simple" take over our OSDs I'm not
>> sure if we should now use ceph-volume or again ceph-disk (followed by
>> "ceph-vloume simple" takeover) to prepare and activate the OSD?
> 
> The `simple` sub-command is meant to help with the activation of OSDs
> at boot time, supporting ceph-disk (or manual) created OSDs.

OK, got this...

> 
> There is no requirement to use `ceph-volume lvm` which is intended for
> new OSDs using LVM as devices.

Fine...

>>
>> 3. If we should use ceph-volume, then by looking at the luminous
>> ceph-volume docs [2] I find for both,
>>
>> ceph-volume lvm prepare
>> ceph-volume lvm activate
>>
>> that the bluestore option is either NOT implemented or NOT supported
>>
>> activate:  [–bluestore] filestore (IS THIS A TYPO???) objectstore (not
>> yet implemented)
>> prepare: [–bluestore] Use the bluestore objectstore (not currently
>> supported)
> 
> These might be a typo on the man page, will get that addressed. Ticket
> opened at http://tracker.ceph.com/issues/22663

Thanks

> bluestore as of 12.2.2 is fully supported and it is the default. The
> --help output in ceph-volume does have the flags updated and correctly
> showing this.

OK

>>
>>
>> So, now I'm completely lost. How is all of this fitting together in
>> order to replace a failed OSD?
> 
> You would need to keep using ceph-disk. Unless you want ceph-volume to
> take over, in which case you would need to follow the steps to deploy
> a new OSD
> with ceph-volume.

OK

> Note that although --osd-id is supported, there is an issue with that
> on 12.2.2 that would prevent you from correctly deploying it
> http://tracker.ceph.com/issues/22642
> 
> The

[ceph-users] replace failed disk in Luminous v12.2.2

2018-01-11 Thread Dietmar Rieder
Hello,

we have failed OSD disk in our Luminous v12.2.2 cluster that needs to
get replaced.

The cluster was initially deployed using ceph-deploy on Luminous
v12.2.0. The OSDs were created using

ceph-deploy osd create --bluestore cephosd-${osd}:/dev/sd${disk}
--block-wal /dev/nvme0n1 --block-db /dev/nvme0n1

Note we separated the bluestore data, wal and db.

We updated to Luminous v12.2.1 and further to Luminous v12.2.2.

With the last update we also let ceph-volume take over the OSDs using
"ceph-volume simple scan  /var/lib/ceph/osd/$osd" and "ceph-volume
simple activate ${osd} ${id}". All of this went smoothly.

Now wonder what is the correct way to replace a failed OSD block disk?

The docs for luminous [1] say:

REPLACING AN OSD

1. Destroy the OSD first:

ceph osd destroy {id} --yes-i-really-mean-it

2. Zap a disk for the new OSD, if the disk was used before for other
purposes. It’s not necessary for a new disk:

ceph-disk zap /dev/sdX


3. Prepare the disk for replacement by using the previously destroyed
OSD id:

ceph-disk prepare --bluestore /dev/sdX  --osd-id {id} --osd-uuid `uuidgen`


4. And activate the OSD:

ceph-disk activate /dev/sdX1


Initially this seems to be straight forward, but

1. I'm not sure if there is something to do with the still existing
bluefs db and wal partitions on the nvme device for the failed OSD. Do
they have to be zapped ? If yes, what is the best way? There is nothing
mentioned in the docs.

2. Since we already let "ceph-volume simple" take over our OSDs I'm not
sure if we should now use ceph-volume or again ceph-disk (followed by
"ceph-vloume simple" takeover) to prepare and activate the OSD?

3. If we should use ceph-volume, then by looking at the luminous
ceph-volume docs [2] I find for both,

ceph-volume lvm prepare
ceph-volume lvm activate

that the bluestore option is either NOT implemented or NOT supported

activate:  [–bluestore] filestore (IS THIS A TYPO???) objectstore (not
yet implemented)
prepare: [–bluestore] Use the bluestore objectstore (not currently
supported)


So, now I'm completely lost. How is all of this fitting together in
order to replace a failed OSD?

4. More after reading some a recent threads on this list additional
questions are coming up:

According to the OSD replacement doc [1] :

"When disks fail, [...], OSDs need to be replaced. Unlike Removing the
OSD, replaced OSD’s id and CRUSH map entry need to be keep [TYPO HERE?
keep -> kept] intact after the OSD is destroyed for replacement."

but
http://tracker.ceph.com/issues/22642 seems to say that it is not
possible to reuse am OSD's id


So I'm quite lost with an essential and very basic seemingly simple task
of storage management.

Thanks for any help here.

~Dietmar


[1]: http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-osds/
[2]: http://docs.ceph.com/docs/luminous/man/8/ceph-volume/

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk removal roadmap (was ceph-disk is now deprecated)

2017-12-01 Thread Dietmar Rieder
On 12/01/2017 01:45 PM, Alfredo Deza wrote:
> On Fri, Dec 1, 2017 at 3:28 AM, Stefan Kooman  wrote:
>> Quoting Fabian Grünbichler (f.gruenbich...@proxmox.com):
>>> I think the above roadmap is a good compromise for all involved parties,
>>> and I hope we can use the remainder of Luminous to prepare for a
>>> seam- and painless transition to ceph-volume in time for the Mimic
>>> release, and then finally retire ceph-disk for good!
>>
>> Will the upcoming 12.2.2 release ship with a ceph-volume capable of
>> doing bluestore on top of LVM? Eager to use ceph-volume for that, and
>> skip entirely over ceph-disk and our manual osd prepare process ...
> 
> Yes. I think that for 12.2.1 this was the case as well, in 12.2.2 is
> the default.


...and will ceph-deploy be ceph-volume capable and default to it in the
12.2.2  release?

Dietmar



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-26 Thread Dietmar Rieder
thanks David,

that's confirming what I was assuming. To bad that there is no
estimate/method to calculate the db partition size.

Dietmar

On 09/25/2017 05:10 PM, David Turner wrote:
> db/wal partitions are per OSD.  DB partitions need to be made as big as
> you need them.  If they run out of space, they will fall back to the
> block device.  If the DB and block are on the same device, then there's
> no reason to partition them and figure out the best size.  If they are
> on separate devices, then you need to make it as big as you need to to
> ensure that it won't spill over (or if it does that you're ok with the
> degraded performance while the db partition is full).  I haven't come
> across an equation to judge what size should be used for either
> partition yet.
> 
> On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder
> <dietmar.rie...@i-med.ac.at <mailto:dietmar.rie...@i-med.ac.at>> wrote:
> 
> On 09/25/2017 02:59 PM, Mark Nelson wrote:
> > On 09/25/2017 03:31 AM, TYLin wrote:
> >> Hi,
> >>
> >> To my understand, the bluestore write workflow is
> >>
> >> For normal big write
> >> 1. Write data to block
> >> 2. Update metadata to rocksdb
> >> 3. Rocksdb write to memory and block.wal
> >> 4. Once reach threshold, flush entries in block.wal to block.db
> >>
> >> For overwrite and small write
> >> 1. Write data and metadata to rocksdb
> >> 2. Apply the data to block
> >>
> >> Seems we don’t have a formula or suggestion to the size of block.db.
> >> It depends on the object size and number of objects in your pool. You
> >> can just give big partition to block.db to ensure all the database
> >> files are on that fast partition. If block.db full, it will use block
> >> to put db files, however, this will slow down the db performance. So
> >> give db size as much as you can.
> >
> > This is basically correct.  What's more, it's not just the object
> size,
> > but the number of extents, checksums, RGW bucket indices, and
> > potentially other random stuff.  I'm skeptical how well we can
> estimate
> > all of this in the long run.  I wonder if we would be better served by
> > just focusing on making it easy to understand how the DB device is
> being
> > used, how much is spilling over to the block device, and make it
> easy to
> > upgrade to a new device once it gets full.
> >
> >>
> >> If you want to put wal and db on same ssd, you don’t need to create
> >> block.wal. It will implicitly use block.db to put wal. The only case
> >> you need block.wal is that you want to separate wal to another disk.
> >
> > I always make explicit partitions, but only because I (potentially
> > illogically) like it that way.  There may actually be some benefits to
> > using a single partition for both if sharing a single device.
> 
> is this "Single db/wal partition" then to be used for all OSDs on a node
> or do you need to create a seperate "Single  db/wal partition" for each
> OSD  on the node?
> 
> >
> >>
> >> I’m also studying bluestore, this is what I know so far. Any
> >> correction is welcomed.
> >>
> >> Thanks
> >>
> >>
> >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
> >>> <richard.hesk...@rd.bbc.co.uk
> <mailto:richard.hesk...@rd.bbc.co.uk>> wrote:
> >>>
> >>> I asked the same question a couple of weeks ago. No response I got
> >>> contradicted the documentation but nobody actively confirmed the
> >>> documentation was correct on this subject, either; my end state was
> >>> that I was relatively confident I wasn't making some horrible
> mistake
> >>> by simply specifying a big DB partition and letting bluestore work
> >>> itself out (in my case, I've just got HDDs and SSDs that were
> >>> journals under filestore), but I could not be sure there wasn't some
> >>> sort of performance tuning I was missing out on by not specifying
> >>> them separately.
> >>>
> >>> Rich
> >>>
> >>> On 21/09/17 20:37, Benjeman Meekhof wrote:
> >>>> Some of this thread seems to contradict the documentation and
> confuses
> >>>> me.  Is the

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Dietmar Rieder
On 09/25/2017 02:59 PM, Mark Nelson wrote:
> On 09/25/2017 03:31 AM, TYLin wrote:
>> Hi,
>>
>> To my understand, the bluestore write workflow is
>>
>> For normal big write
>> 1. Write data to block
>> 2. Update metadata to rocksdb
>> 3. Rocksdb write to memory and block.wal
>> 4. Once reach threshold, flush entries in block.wal to block.db
>>
>> For overwrite and small write
>> 1. Write data and metadata to rocksdb
>> 2. Apply the data to block
>>
>> Seems we don’t have a formula or suggestion to the size of block.db.
>> It depends on the object size and number of objects in your pool. You
>> can just give big partition to block.db to ensure all the database
>> files are on that fast partition. If block.db full, it will use block
>> to put db files, however, this will slow down the db performance. So
>> give db size as much as you can.
> 
> This is basically correct.  What's more, it's not just the object size,
> but the number of extents, checksums, RGW bucket indices, and
> potentially other random stuff.  I'm skeptical how well we can estimate
> all of this in the long run.  I wonder if we would be better served by
> just focusing on making it easy to understand how the DB device is being
> used, how much is spilling over to the block device, and make it easy to
> upgrade to a new device once it gets full.
> 
>>
>> If you want to put wal and db on same ssd, you don’t need to create
>> block.wal. It will implicitly use block.db to put wal. The only case
>> you need block.wal is that you want to separate wal to another disk.
> 
> I always make explicit partitions, but only because I (potentially
> illogically) like it that way.  There may actually be some benefits to
> using a single partition for both if sharing a single device.

is this "Single db/wal partition" then to be used for all OSDs on a node
or do you need to create a seperate "Single  db/wal partition" for each
OSD  on the node?

> 
>>
>> I’m also studying bluestore, this is what I know so far. Any
>> correction is welcomed.
>>
>> Thanks
>>
>>
>>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
>>> <richard.hesk...@rd.bbc.co.uk> wrote:
>>>
>>> I asked the same question a couple of weeks ago. No response I got
>>> contradicted the documentation but nobody actively confirmed the
>>> documentation was correct on this subject, either; my end state was
>>> that I was relatively confident I wasn't making some horrible mistake
>>> by simply specifying a big DB partition and letting bluestore work
>>> itself out (in my case, I've just got HDDs and SSDs that were
>>> journals under filestore), but I could not be sure there wasn't some
>>> sort of performance tuning I was missing out on by not specifying
>>> them separately.
>>>
>>> Rich
>>>
>>> On 21/09/17 20:37, Benjeman Meekhof wrote:
>>>> Some of this thread seems to contradict the documentation and confuses
>>>> me.  Is the statement below correct?
>>>>
>>>> "The BlueStore journal will always be placed on the fastest device
>>>> available, so using a DB device will provide the same benefit that the
>>>> WAL device would while also allowing additional metadata to be stored
>>>> there (if it will fix)."
>>>>
>>>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
>>>>
>>>>
>>>>  it seems to be saying that there's no reason to create separate WAL
>>>> and DB partitions if they are on the same device.  Specifying one
>>>> large DB partition per OSD will cover both uses.
>>>>
>>>> thanks,
>>>> Ben
>>>>
>>>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
>>>> <dietmar.rie...@i-med.ac.at> wrote:
>>>>> On 09/21/2017 05:03 PM, Mark Nelson wrote:
>>>>>>
>>>>>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>>>>>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>>>>>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm still looking for the answer of these questions. Maybe
>>>>>>>>> someone can
>>>>>>>>> share their thought on these. Any comment will be helpful too.
>>>>>>>>>
>>>>>>>>> Bes

Re: [ceph-users] erasure code profile

2017-09-22 Thread Dietmar Rieder
Hmm...

not sure what happens if you loose 2 disks in 2 different rooms, isn't
there is a risk that you loose  data ?

Dietmar

On 09/22/2017 10:39 AM, Luis Periquito wrote:
> Hi all,
> 
> I've been trying to think what will be the best erasure code profile,
> but I don't really like the one I came up with...
> 
> I have 3 rooms that are part of the same cluster, and I need to design
> so we can lose any one of the 3.
> 
> As this is a backup cluster I was thinking on doing a k=2 m=1 code,
> with ruleset-failure-domain=room as the OSD tree is correctly built.
> 
> Can anyone think of a better profile?
> 
> thanks,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Dietmar Rieder
On 09/21/2017 05:03 PM, Mark Nelson wrote:
> 
> 
> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm still looking for the answer of these questions. Maybe someone can
>>>> share their thought on these. Any comment will be helpful too.
>>>>
>>>> Best regards,
>>>>
>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>>>> <mrxlazuar...@gmail.com <mailto:mrxlazuar...@gmail.com>> wrote:
>>>>
>>>>     Hi,
>>>>
>>>>     1. Is it possible configure use osd_data not as small partition on
>>>>     OSD but a folder (ex. on root disk)? If yes, how to do that with
>>>>     ceph-disk and any pros/cons of doing that?
>>>>     2. Is WAL & DB size calculated based on OSD size or expected
>>>>     throughput like on journal device of filestore? If no, what is the
>>>>     default value and pro/cons of adjusting that?
>>>>     3. Is partition alignment matter on Bluestore, including WAL & DB
>>>>     if using separate device for them?
>>>>
>>>>     Best regards,
>>>>
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> I am also looking for recommendations on wal/db partition sizes. Some
>>> hints:
>>>
>>> ceph-disk defaults used in case it does not find
>>> bluestore_block_wal_size or bluestore_block_db_size in config file:
>>>
>>> wal =  512MB
>>>
>>> db = if bluestore_block_size (data size) is in config file it uses 1/100
>>> of it else it uses 1G.
>>>
>>> There is also a presentation by Sage back in March, see page 16:
>>>
>>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>>>
>>>
>>> wal: 512 MB
>>>
>>> db: "a few" GB
>>>
>>> the wal size is probably not debatable, it will be like a journal for
>>> small block sizes which are constrained by iops hence 512 MB is more
>>> than enough. Probably we will see more on the db size in the future.
>>
>> This is what I understood so far.
>> I wonder if it makes sense to set the db size as big as possible and
>> divide entire db device is  by the number of OSDs it will serve.
>>
>> E.g. 10 OSDs / 1 NVME (800GB)
>>
>>  (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>>
>> Is this smart/stupid?
> 
> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
> amp but mean larger memtables and potentially higher overhead scanning
> through memtables).  4x256MB buffers works pretty well, but it means
> memory overhead too.  Beyond that, I'd devote the entire rest of the
> device to DB partitions.
> 

thanks for your suggestion Mark!

So, just to make sure I understood this right:

You'd  use a separeate 512MB-2GB WAL partition for each OSD and the
entire rest for DB partitions.

In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
partitions with each 512MB-2GB and 10 equal sized DB partitions
consuming the rest of the NVME.


Thanks
  Dietmar
-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Dietmar Rieder
On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
> On 2017-09-21 07:56, Lazuardi Nasution wrote:
> 
>> Hi,
>>  
>> I'm still looking for the answer of these questions. Maybe someone can
>> share their thought on these. Any comment will be helpful too.
>>  
>> Best regards,
>>
>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>> > wrote:
>>
>> Hi,
>>  
>> 1. Is it possible configure use osd_data not as small partition on
>> OSD but a folder (ex. on root disk)? If yes, how to do that with
>> ceph-disk and any pros/cons of doing that?
>> 2. Is WAL & DB size calculated based on OSD size or expected
>> throughput like on journal device of filestore? If no, what is the
>> default value and pro/cons of adjusting that?
>> 3. Is partition alignment matter on Bluestore, including WAL & DB
>> if using separate device for them?
>>  
>> Best regards,
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
>  
> 
> I am also looking for recommendations on wal/db partition sizes. Some hints:
> 
> ceph-disk defaults used in case it does not find
> bluestore_block_wal_size or bluestore_block_db_size in config file:
> 
> wal =  512MB
> 
> db = if bluestore_block_size (data size) is in config file it uses 1/100
> of it else it uses 1G.
> 
> There is also a presentation by Sage back in March, see page 16:
> 
> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
> 
> wal: 512 MB
> 
> db: "a few" GB 
> 
> the wal size is probably not debatable, it will be like a journal for
> small block sizes which are constrained by iops hence 512 MB is more
> than enough. Probably we will see more on the db size in the future.

This is what I understood so far.
I wonder if it makes sense to set the db size as big as possible and
divide entire db device is  by the number of OSDs it will serve.

E.g. 10 OSDs / 1 NVME (800GB)

 (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD

Is this smart/stupid?

Dietmar
 --
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v11.1.0 kraken candidate released

2017-02-16 Thread Dietmar Rieder
On 02/16/2017 09:47 AM, John Spray wrote:
> On Thu, Feb 16, 2017 at 8:37 AM, Dietmar Rieder
> <dietmar.rie...@i-med.ac.at> wrote:
>> Hi,
>>
>> On 12/13/2016 12:35 PM, John Spray wrote:
>>> On Tue, Dec 13, 2016 at 7:35 AM, Dietmar Rieder
>>> <dietmar.rie...@i-med.ac.at> wrote:
>>>> Hi,
>>>>
>>>> this is good news! Thanks.
>>>>
>>>> As far as I see the RBD supports (experimentally) now EC data pools. Is
>>>> this true also for CephFS? It is not stated in the announce, so I wonder
>>>> if and when EC pools are planned to be supported by CephFS.
>>>
>>> Nobody has worked on this so far.  For EC data pools, it should mainly
>>> be a case of modifying the pool validation in MDSMonitor that
>>> currently prevents assigning an EC pool.  I strongly suspect we'll get
>>> around to this before Luminous.
>>>
>>
>> since  v12.0.0 Luminous (dev) released is now released, I just wanted to
>> ask if there was some work done in order to enable cephfs data pools on EC?
> 
> To clarify, the luminous release I was talking about was the stable
> one (what will eventually be 12.2.0 later in the year).

John, thanks for clarifying.

Dietmar


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v11.1.0 kraken candidate released

2017-02-16 Thread Dietmar Rieder
Hi,

On 12/13/2016 12:35 PM, John Spray wrote:
> On Tue, Dec 13, 2016 at 7:35 AM, Dietmar Rieder
> <dietmar.rie...@i-med.ac.at> wrote:
>> Hi,
>>
>> this is good news! Thanks.
>>
>> As far as I see the RBD supports (experimentally) now EC data pools. Is
>> this true also for CephFS? It is not stated in the announce, so I wonder
>> if and when EC pools are planned to be supported by CephFS.
> 
> Nobody has worked on this so far.  For EC data pools, it should mainly
> be a case of modifying the pool validation in MDSMonitor that
> currently prevents assigning an EC pool.  I strongly suspect we'll get
> around to this before Luminous.
> 

since  v12.0.0 Luminous (dev) released is now released, I just wanted to
ask if there was some work done in order to enable cephfs data pools on EC?

Thanks
  Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v11.1.0 kraken candidate released

2016-12-13 Thread Dietmar Rieder
Hi John,

Thanks for your answer.
The mentioned modification of the pool validation would than allow for
CephFS having the data pools on EC while keeping the metadata on a
replicated pool, right?

Dietmar

On 12/13/2016 12:35 PM, John Spray wrote:
> On Tue, Dec 13, 2016 at 7:35 AM, Dietmar Rieder
> <dietmar.rie...@i-med.ac.at> wrote:
>> Hi,
>>
>> this is good news! Thanks.
>>
>> As far as I see the RBD supports (experimentally) now EC data pools. Is
>> this true also for CephFS? It is not stated in the announce, so I wonder
>> if and when EC pools are planned to be supported by CephFS.
> 
> Nobody has worked on this so far.  For EC data pools, it should mainly
> be a case of modifying the pool validation in MDSMonitor that
> currently prevents assigning an EC pool.  I strongly suspect we'll get
> around to this before Luminous.
> 
> John
> 
>> ~regards
>>   Dietmar
>>
>> On 12/13/2016 03:28 AM, Abhishek L wrote:
>>> Hi everyone,
>>>
>>> This is the first release candidate for Kraken, the next stable
>>> release series. There have been major changes from jewel with many
>>> features being added. Please note the upgrade process from jewel,
>>> before upgrading.
>>>
>>> Major Changes from Jewel
>>> 
>>>
>>> - *RADOS*:
>>>
>>>   * The new *BlueStore* backend now has a stable disk format and is
>>> passing our failure and stress testing. Although the backend is
>>> still flagged as experimental, we encourage users to try it out
>>> for non-production clusters and non-critical data sets.
>>>   * RADOS now has experimental support for *overwrites on
>>> erasure-coded* pools. Because the disk format and implementation
>>> are not yet finalized, there is a special pool option that must be
>>> enabled to test the new feature.  Enabling this option on a cluster
>>> will permanently bar that cluster from being upgraded to future
>>> versions.
>>>   * We now default to the AsyncMessenger (``ms type = async``) instead
>>> of the legacy SimpleMessenger.  The most noticeable difference is
>>> that we now use a fixed sized thread pool for network connections
>>> (instead of two threads per socket with SimpleMessenger).
>>>   * Some OSD failures are now detected almost immediately, whereas
>>> previously the heartbeat timeout (which defaults to 20 seconds)
>>> had to expire.  This prevents IO from blocking for an extended
>>> period for failures where the host remains up but the ceph-osd
>>> process is no longer running.
>>>   * There is a new ``ceph-mgr`` daemon.  It is currently collocated with
>>> the monitors by default, and is not yet used for much, but the basic
>>> infrastructure is now in place.
>>>   * The size of encoded OSDMaps has been reduced.
>>>   * The OSDs now quiesce scrubbing when recovery or rebalancing is in 
>>> progress.
>>>
>>> - *RGW*:
>>>
>>>   * RGW now supports a new zone type that can be used for metadata indexing
>>> via Elasticseasrch.
>>>   * RGW now supports the S3 multipart object copy-part API.
>>>   * It is possible now to reshard an existing bucket. Note that bucket
>>> resharding currently requires that all IO (especially writes) to
>>> the specific bucket is quiesced.
>>>   * RGW now supports data compression for objects.
>>>   * Civetweb version has been upgraded to 1.8
>>>   * The Swift static website API is now supported (S3 support has been added
>>> previously).
>>>   * S3 bucket lifecycle API has been added. Note that currently it only 
>>> supports
>>> object expiration.
>>>   * Support for custom search filters has been added to the LDAP auth
>>> implementation.
>>>   * Support for NFS version 3 has been added to the RGW NFS gateway.
>>>   * A Python binding has been created for librgw.
>>>
>>> - *RBD*:
>>>
>>>   * RBD now supports images stored in an *erasure-coded* RADOS pool
>>> using the new (experimental) overwrite support. Images must be
>>> created using the new rbd CLI "--data-pool " option to
>>> specify the EC pool where the backing data objects are
>>> stored. Attempting to create an image directly on an EC pool will
>>> not be successful since the image's backing metadata is only
>>> supported on a replicated pool.
&g

Re: [ceph-users] v11.1.0 kraken candidate released

2016-12-12 Thread Dietmar Rieder
Hi,

this is good news! Thanks.

As far as I see the RBD supports (experimentally) now EC data pools. Is
this true also for CephFS? It is not stated in the announce, so I wonder
if and when EC pools are planned to be supported by CephFS.

~regards
  Dietmar

On 12/13/2016 03:28 AM, Abhishek L wrote:
> Hi everyone,
> 
> This is the first release candidate for Kraken, the next stable
> release series. There have been major changes from jewel with many
> features being added. Please note the upgrade process from jewel,
> before upgrading.
> 
> Major Changes from Jewel
> 
> 
> - *RADOS*:
> 
>   * The new *BlueStore* backend now has a stable disk format and is
> passing our failure and stress testing. Although the backend is
> still flagged as experimental, we encourage users to try it out
> for non-production clusters and non-critical data sets.
>   * RADOS now has experimental support for *overwrites on
> erasure-coded* pools. Because the disk format and implementation
> are not yet finalized, there is a special pool option that must be
> enabled to test the new feature.  Enabling this option on a cluster
> will permanently bar that cluster from being upgraded to future
> versions.
>   * We now default to the AsyncMessenger (``ms type = async``) instead
> of the legacy SimpleMessenger.  The most noticeable difference is
> that we now use a fixed sized thread pool for network connections
> (instead of two threads per socket with SimpleMessenger).
>   * Some OSD failures are now detected almost immediately, whereas
> previously the heartbeat timeout (which defaults to 20 seconds)
> had to expire.  This prevents IO from blocking for an extended
> period for failures where the host remains up but the ceph-osd
> process is no longer running.
>   * There is a new ``ceph-mgr`` daemon.  It is currently collocated with
> the monitors by default, and is not yet used for much, but the basic
> infrastructure is now in place.
>   * The size of encoded OSDMaps has been reduced.
>   * The OSDs now quiesce scrubbing when recovery or rebalancing is in 
> progress.
> 
> - *RGW*:
> 
>   * RGW now supports a new zone type that can be used for metadata indexing
> via Elasticseasrch.
>   * RGW now supports the S3 multipart object copy-part API.
>   * It is possible now to reshard an existing bucket. Note that bucket
> resharding currently requires that all IO (especially writes) to
> the specific bucket is quiesced.
>   * RGW now supports data compression for objects.
>   * Civetweb version has been upgraded to 1.8
>   * The Swift static website API is now supported (S3 support has been added
> previously).
>   * S3 bucket lifecycle API has been added. Note that currently it only 
> supports
> object expiration.
>   * Support for custom search filters has been added to the LDAP auth
> implementation.
>   * Support for NFS version 3 has been added to the RGW NFS gateway.
>   * A Python binding has been created for librgw.
> 
> - *RBD*:
> 
>   * RBD now supports images stored in an *erasure-coded* RADOS pool
> using the new (experimental) overwrite support. Images must be
> created using the new rbd CLI "--data-pool " option to
> specify the EC pool where the backing data objects are
> stored. Attempting to create an image directly on an EC pool will
> not be successful since the image's backing metadata is only
> supported on a replicated pool.
>   * The rbd-mirror daemon now supports replicating dynamic image
> feature updates and image metadata key/value pairs from the
> primary image to the non-primary image.
>   * The number of image snapshots can be optionally restricted to a
> configurable maximum.
>   * The rbd Python API now supports asynchronous IO operations.
> 
> - *CephFS*:
> 
>   * libcephfs function definitions have been changed to enable proper
> uid/gid control.  The library version has been increased to reflect the
> interface change.
>   * Standby replay MDS daemons now consume less memory on workloads
> doing deletions.
>   * Scrub now repairs backtrace, and populates `damage ls` with
> discovered errors.
>   * A new `pg_files` subcommand to `cephfs-data-scan` can identify
> files affected by a damaged or lost RADOS PG.
>   * The false-positive "failing to respond to cache pressure" warnings have
> been fixed.
> 
> 
> Upgrading from Jewel
> 
> 
> * All clusters must first be upgraded to Jewel 10.2.z before upgrading
>   to Kraken 11.2.z (or, eventually, Luminous 12.2.z).
> 
> * The ``sortbitwise`` flag must be set on the Jewel cluster before upgrading
>   to Kraken.  The latest Jewel (10.2.4+) releases issue a health warning if
>   the flag is not set, so this is probably already set.  If it is not, Kraken
>   OSDs will refuse to start and will print and error message in their log.
> 
> 
> Upgrading
> -
> 
> * 

Re: [ceph-users] cache tiering deprecated in RHCS 2.0

2016-10-24 Thread Dietmar Rieder
On 10/24/2016 03:10 AM, Christian Balzer wrote:

[...]
> There are several items here and I very much would welcome a response from
> a Ceph/RH representative.
> 
> 1. Is that depreciation only in regards to RHCS, as Nick seems to hope? 
> Because I very much doubt that, why develop code you just "removed" from
> your milk cow?
> 
> 2. Is that the same kind of depreciation as with the format 1 RBD images,
> as in, will there be a 5 year window where this functionality is NOT
> removed from the code base and a clear, seamless and non-disruptive
> upgrade path?
> 

Let me add one more point:

How will this affect situations in which one has cephfs on EC pools
(which demand a cache tier in front)?


Dietmar


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD node memory sizing

2016-05-19 Thread Dietmar Rieder
Hello,

On 05/19/2016 03:36 AM, Christian Balzer wrote:
> 
> Hello again,
> 
> On Wed, 18 May 2016 15:32:50 +0200 Dietmar Rieder wrote:
> 
>> Hello Christian,
>>
>>> Hello,
>>>
>>> On Wed, 18 May 2016 13:57:59 +0200 Dietmar Rieder wrote:
>>>
>>>> Dear Ceph users,
>>>>
>>>> I've a question regarding the memory recommendations for an OSD node.
>>>>
>>>> The official Ceph hardware recommendations say that an OSD node should
>>>> have 1GB Ram / TB OSD [1]
>>>>
>>>> The "Reference Architecture" whitpaper from Red Hat & Supermicro says
>>>> that "typically" 2GB of memory per OSD on a OSD node is used. [2]
>>>>
>>> This question has been asked and answered here countless times.
>>>
>>> Maybe something a bit more detailed ought to be placed in the first
>>> location, or simply a reference to the 2nd one. 
>>> But then again, that would detract from the RH added value.
>>
>> thanks for replying, nonetheless.
>> I checked the list before but I failed to find a definitive answer, may
>> be I was not looking hard enough. Anyway, thanks!
>>
> They tend to hidden sometimes in other threads, but there really is a lot..

It seems so, have to dig deeper into the available discussions...

> 
>>>  
>>>> According to the recommendation in [1] an OSD node with 24x 8TB OSD
>>>> disks is "underpowered "  when it is equipped with 128GB of RAM.
>>>> However, following the "recommendation" in [2] 128GB should be plenty
>>>> enough.
>>>>
>>> It's fine per se, the OSD processes will not consume all of that even
>>> in extreme situations.
>>
>> Ok, if I understood this correctly, then 128GB should be enough also
>> during rebalancing or backfilling.
>>
> Definitely, but realize that during this time of high memory consumption
> cause by backfilling your system is also under strain from objects moving
> in an out, so as per the high-density thread you will want all your dentry
> and other important SLAB objects to stay in RAM.
> 
> That's a lot of objects potentially with 8TB, so when choosing DIMMs pick
> ones that leave you with the option to go to 256GB later if need be.

Good point, I'll keep this in mind

> 
> Also you'll probably have loads of fun playing with CRUSH weights to keep
> the utilization of these 8TB OSDs within 100GB of each other. 

I'm afraid that  finding the "optimal" settings will demand a lot of
testing/playing

> 
>>>
>>> Very large OSDs and high density storage nodes have other issues and
>>> challenges, tuning and memory wise.
>>> There are several threads about these recently, including today.
>>
>> Thanks, I'll study these...
>>
>>>> I'm wondering which of the two is good enough for a Ceph cluster with
>>>> 10 nodes using EC (6+3)
>>>>
>>> I would spend more time pondering about the CPU power of these machines
>>> (EC need more) and what cache tier to get.
>>
>> We are planing to equip the OSD nodes with 2x2650v4 CPUs (24 cores @
>> 2.2GHz), that is 1 core/OSD. For the cache tier each OSD node gets two
>> 800Gb NVMe's. We hope this setup will give reasonable performance with
>> EC.
>>
> So you have actually 26 OSDs per node then.
> I'd say the CPUs are fine, but EC and the NVMes will eat a fair share of
> it.

Your right, it is 26 OSDs but still I assume that with these CPUs we
will not be completely underpowered.

> That's why I prefer to have dedicated cache tier nodes with fewer but
> faster cores, unless the cluster is going to be very large.
> With Hammer a 800GB DC S3160 SSD based OSD can easily saturate a 
> "E5-2623 v3" core @3.3GHz (nearly 2 cores to be precise) and Jewel has
> optimization that will both make it faster by itself AND enable it to
> use more CPU resources as well.
> 

That's probably, the best solution, but this will not be in our budged
and rackspace limits for the first setup, however when expanding later
on it will definitely be something to consider, also depending on the
performance that we obtain with this first setup.

> The NVMes (DC P3700 one presumes?) just for cache tiering, no SSD
> journals for the OSDs?

For now we have an offer for HPE  800GB NVMe MU (mixed use), 880MB/s
write 2600MB/s read, 3 DW/D. So they are a fast as the DC 3700, we will
probably check also other options.

> What are your network plans then, as in is your node storage bandwidth a
> good match for your network bandwidth? 
>

As network we wi

Re: [ceph-users] OSD node memory sizing

2016-05-18 Thread Dietmar Rieder
Hello Christian,

> Hello,
> 
> On Wed, 18 May 2016 13:57:59 +0200 Dietmar Rieder wrote:
> 
>> Dear Ceph users,
>>
>> I've a question regarding the memory recommendations for an OSD node.
>>
>> The official Ceph hardware recommendations say that an OSD node should
>> have 1GB Ram / TB OSD [1]
>>
>> The "Reference Architecture" whitpaper from Red Hat & Supermicro says
>> that "typically" 2GB of memory per OSD on a OSD node is used. [2]
>>
> This question has been asked and answered here countless times.
> 
> Maybe something a bit more detailed ought to be placed in the first
> location, or simply a reference to the 2nd one. 
> But then again, that would detract from the RH added value.

thanks for replying, nonetheless.
I checked the list before but I failed to find a definitive answer, may
be I was not looking hard enough. Anyway, thanks!

>  
>> According to the recommendation in [1] an OSD node with 24x 8TB OSD
>> disks is "underpowered "  when it is equipped with 128GB of RAM.
>> However, following the "recommendation" in [2] 128GB should be plenty
>> enough.
>>
> It's fine per se, the OSD processes will not consume all of that even in
> extreme situations.

Ok, if I understood this correctly, then 128GB should be enough also
during rebalancing or backfilling.

> 
> Very large OSDs and high density storage nodes have other issues and
> challenges, tuning and memory wise.
> There are several threads about these recently, including today.

Thanks, I'll study these...

>> I'm wondering which of the two is good enough for a Ceph cluster with 10
>> nodes using EC (6+3)
>>
> I would spend more time pondering about the CPU power of these machines
> (EC need more) and what cache tier to get.

We are planing to equip the OSD nodes with 2x2650v4 CPUs (24 cores @
2.2GHz), that is 1 core/OSD. For the cache tier each OSD node gets two
800Gb NVMe's. We hope this setup will give reasonable performance with EC.

> That is, if performance is a requirement in your use case.

Always, who wouldn't care about performance?  :-)

Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at





signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD node memory sizing

2016-05-18 Thread Dietmar Rieder
Dear Ceph users,

I've a question regarding the memory recommendations for an OSD node.

The official Ceph hardware recommendations say that an OSD node should
have 1GB Ram / TB OSD [1]

The "Reference Architecture" whitpaper from Red Hat & Supermicro says
that "typically" 2GB of memory per OSD on a OSD node is used. [2]

According to the recommendation in [1] an OSD node with 24x 8TB OSD
disks is "underpowered "  when it is equipped with 128GB of RAM.
However, following the "recommendation" in [2] 128GB should be plenty
enough.

I'm wondering which of the two is good enough for a Ceph cluster with 10
nodes using EC (6+3)

Thanks for any comment
  Dietmar

[1] http://docs.ceph.com/docs/jewel/start/hardware-recommendations/
[2]
https://www.redhat.com/en/files/resources/en-rhst-cephstorage-supermicro-INC0270868_v2_0715.pdf

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS memory sizing

2016-03-01 Thread Dietmar Rieder
Dear ceph users,


I'm in the very initial phase of planning a ceph cluster an have a
question regarding the RAM recommendation for an MDS.

According to the ceph docs the minimum amount of RAM should be "1 GB
minimum per daemon". Is this per OSD in the cluster or per MDS in the
cluster?

I plan to run 3 ceph-mon on 3 dedicated machines and would like to run 3
ceph-msd on these  machines as well. The raw capacity of the cluster
should be ~1.9PB. Would 64GB of RAM then be enough for the
ceph-mon/ceph-msd nodes?

Thanks
  Dietmar

-- 
_
D i e t m a r  R i e d e r



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com