date:20171120

I was running 10.2.7 but I’ve upgraded to 10.2.10 few days ago.

Here Pg dump:

https://owncloud.enter.it/index.php/s/AaD5Fc5tA6c8i1G 




> Il giorno 19 nov 2017, alle ore 11:15, Gregory Farnum  ha 
> scritto:
> 
> On Tue, Nov 14, 2017 at 1:09 AM Matteo Dacrema  > wrote:
> Hi,
> I noticed that sometimes the monitors start to log active+clean pgs many 
> times in the same line. For example I have 18432 and the logs shows " 2136 
> active+clean, 28 active+clean, 2 active+clean+scrubbing+deep, 16266 
> active+clean;”
> After a minute monitor start to log correctly again.
> 
> Is it normal ?
> 
> That definitely looks weird to me, but I can imagine a few ways for it to 
> occur. What version of Ceph are you running? Can you extract the pgmap and 
> post the binary somewhere?
>  
> 
> 2017-11-13 11:05:08.876724 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797105: 18432 pgs: 3 active+clean+scrubbing+deep, 18429 
> active+clean; 59520 GB data, 129 TB used, 110 TB / 239 TB avail; 40596 kB/s 
> rd, 89723 kB/s wr, 4899 op/s
> 2017-11-13 11:05:09.911266 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797106: 18432 pgs: 2 active+clean+scrubbing+deep, 18430 
> active+clean; 59520 GB data, 129 TB used, 110 TB / 239 TB avail; 45931 kB/s 
> rd, 114 MB/s wr, 6179 op/s
> 2017-11-13 11:05:10.751378 7fb359cfb700  0 mon.controller001@0(leader) e1 
> handle_command mon_command({"prefix": "osd pool stats", "format": "json"} v 
> 0) v1
> 2017-11-13 11:05:10.751599 7fb359cfb700  0 log_channel(audit) log [DBG] : 
> from='client.? MailScanner warning: numerical links are often malicious: 
> 10.16.24.127:0/547552484 ' 
> entity='client.telegraf' cmd=[{"prefix": "osd pool stats", "format": 
> "json"}]: dispatch
> 2017-11-13 11:05:10.926839 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797107: 18432 pgs: 3 active+clean+scrubbing+deep, 18429 
> active+clean; 59520 GB data, 129 TB used, 110 TB / 239 TB avail; 47617 kB/s 
> rd, 134 MB/s wr, 7414 op/s
> 2017-11-13 11:05:11.921115 7fb35d17d700  1 mon.controller001@0(leader).osd 
> e120942 e120942: 216 osds: 216 up, 216 in
> 2017-11-13 11:05:11.926818 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> osdmap e120942: 216 osds: 216 up, 216 in
> 2017-11-13 11:05:11.984732 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797109: 18432 pgs: 3 active+clean+scrubbing+deep, 18429 
> active+clean; 59520 GB data, 129 TB used, 110 TB / 239 TB avail; 54110 kB/s 
> rd, 115 MB/s wr, 7827 op/s
> 2017-11-13 11:05:13.085799 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797110: 18432 pgs: 973 active+clean, 12 active+clean, 3 
> active+clean+scrubbing+deep, 17444 active+clean; 59520 GB data, 129 TB used, 
> 110 TB / 239 TB avail; 115 MB/s rd, 90498 kB/s wr, 8490 op/s
> 2017-11-13 11:05:14.181219 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797111: 18432 pgs: 2136 active+clean, 28 active+clean, 2 
> active+clean+scrubbing+deep, 16266 active+clean; 59520 GB data, 129 TB used, 
> 110 TB / 239 TB avail; 136 MB/s rd, 94461 kB/s wr, 10237 op/s
> 2017-11-13 11:05:15.324630 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797112: 18432 pgs: 3179 active+clean, 44 active+clean, 2 
> active+clean+scrubbing+deep, 15207 active+clean; 59519 GB data, 129 TB used, 
> 110 TB / 239 TB avail; 184 MB/s rd, 81743 kB/s wr, 13786 op/s
> 2017-11-13 11:05:16.381452 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797113: 18432 pgs: 3600 active+clean, 52 active+clean, 2 
> active+clean+scrubbing+deep, 14778 active+clean; 59518 GB data, 129 TB used, 
> 110 TB / 239 TB avail; 208 MB/s rd, 77342 kB/s wr, 14382 op/s
> 2017-11-13 11:05:17.272757 7fb3570f2700  1 leveldb: Level-0 table #26314650: 
> started
> 2017-11-13 11:05:17.390808 7fb3570f2700  1 leveldb: Level-0 table #26314650: 
> 18281928 bytes OK
> 2017-11-13 11:05:17.392636 7fb3570f2700  1 leveldb: Delete type=0 #26314647
> 
> 2017-11-13 11:05:17.397516 7fb3570f2700  1 leveldb: Manual compaction at 
> level-0 from 'pgmap\x0099796362' @ 72057594037927935 : 1 .. 
> 'pgmap\x0099796613' @ 0 : 0; will stop at 'pgmap_pg\x006.ff' @ 29468156273 : 1
> 
> 
> Thank you
> Matteo
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> -- 
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto. 
> Clicca qui per segnalarlo come spam. 
>  
> Clicca qui per metterlo in blacklist 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/lis

Re: [ceph-users] Rename iscsi target_iqn

2017-11-20 Thread Frank Brendel


Hi Jason,

Am 17.11.2017 um 14:09 schrieb Jason Dillaman:

how can I rename an iscsi target_iqn?

That operation is not supported via gwcli.

Is there a special reason for that or is it simply not implemented?


And where is the configuration that I made with gwcli stored?

It's stored in a JSON object within the 'rbd' pool named "gateway.conf".

To start from scratch I made the following steps:

1. Stop the iSCSI gateway on all nodes 'systemctl stop rbd-target-gw'
2. Remove the iSCSI kernel configuration on all nodes 'targetctl clear'
3. Remove gateway.conf from rbd pool 'rados -p rbd rm gateway.conf'
4. Start the iSCSI gateway on all nodes 'systemctl start rbd-target-api'

Is this the recommended way?

Thank you
Frank
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD Random Failures - Latest Luminous

2017-11-20 Thread Ashley Merrick

Hello,


So I tried as suggested marking one OSD that continuously failed as lost and 
add a new OSD to take it's place.


However all this does is make another 2-3 OSD's fail with the exact same error.


Seems this is a pretty huge and nasty bug / issue!


Greg your have to give me some more information about what you need if you want 
me to try and get some information.


However right now the cluster it self is pretty much toast due to the amount of 
OSD's now with this assert.


,Ashley


From: Gregory Farnum 
Sent: 19 November 2017 09:25:39
To: Ashley Merrick
Cc: David Turner; ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous

I only see two asserts (in my local checkout) in that function; one is metadata
assert(info.history.same_interval_since != 0);
and the other is a sanity check
assert(!deleting);

Can you open a core dump with gdb and look at what line it's on in the 
start_peering_interval frame? (May need to install the debug packages.)

I think we've run across that first assert as an issue before, but both of them 
ought to be dumping out more cleanly about what line they're on.
-Greg


On Sun, Nov 19, 2017 at 1:32 AM Ashley Merrick 
mailto:ash...@amerrick.co.uk>> wrote:

Hello,



So seems noup does not help.



Still have the same error :



2017-11-18 14:26:40.982827 7fb4446cd700 -1 *** Caught signal (Aborted) **in 
thread 7fb4446cd700 thread_name:tp_peering



ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

1: (()+0xa0c554) [0x56547f500554]

2: (()+0x110c0) [0x7fb45cabe0c0]

3: (gsignal()+0xcf) [0x7fb45ba85fcf]

4: (abort()+0x16a) [0x7fb45ba873fa]

5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) 
[0x56547f547f0e]

6: (PG::start_peering_interval(std::shared_ptr, std::vector > const&, int, std::vector > 
const&, int, ObjectStore::Transaction*)+0x1569) [0x56547f029ad9]

7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x479) [0x56547f02a099]

8: (boost::statechart::simple_state, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x188) [0x56547f06c6d8]

9: (boost::statechart::state_machine, 
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
 const&)+0x69) [0x56547f045549]

10: (PG::handle_advance_map(std::shared_ptr, 
std::shared_ptr, std::vector >&, int, 
std::vector >&, int, PG::RecoveryCtx*)+0x4a7) 
[0x56547f00e837]

11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*, std::set, 
std::less >, std::allocator > 
>*)+0x2e7) [0x56547ef56e67]

12: (OSD::process_peering_events(std::__cxx11::list > 
const&, ThreadPool::TPHandle&)+0x1e4) [0x56547ef57cb4]

13: (ThreadPool::BatchWorkQueue::_void_process(void*, 
ThreadPool::TPHandle&)+0x2c) [0x56547efc2a0c]

14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x56547f54ef28]

15: (ThreadPool::WorkThread::entry()+0x10) [0x56547f5500c0]

16: (()+0x7494) [0x7fb45cab4494]

17: (clone()+0x3f) [0x7fb45bb3baff]

NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.



I guess even with noup the OSD/PG still has the peer with the other PG’s which 
is the stage that causes the failure, most OSD’s seem to stay up for about 30 
seconds, and every time it’s a different PG listed on the failure.



,Ashley



From: David Turner [mailto:drakonst...@gmail.com]

Sent: 18 November 2017 22:19
To: Ashley Merrick mailto:ash...@amerrick.co.uk>>

Cc: Eric Nelson mailto:ericnel...@gmail.com>>; 
ceph-us...@ceph.com

Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous



Does letting the cluster run with noup for a while until all down disks are 
idle, and then letting them come in help at all?  I don't know your specific 
issue and haven't touched bluestore yet, but that is generally sound advice 
when is won't start.

Also is there any pattern to the osds that are down? Common PGs, common hosts, 
common ssds, etc?



On Sat, Nov 18, 2017, 7:08 AM Ashley Merrick 
mailto:ash...@amerrick.co.uk>> wrote:

Hello,



Any further suggestions or work around’s from anyone?



Cluster is hard down now with around 2% PG’s offline, on the occasion able to 
get an OSD to start for a bit but then will seem to do some peering and again 
crash with “*** Caught signal (Aborted) **in thread 7f3471c55700 
thread_name:tp_peering”



,Ashley



From: Ashley Merrick

Sent: 16 November 2017 17:27
To: Eric Nelson mailto:ericnel...@gmail.com>>

Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous



Hello,



Good to hear it's not just me, however have a cluster basically offline due to 
too many OSD's dropping for this issue.



Anybody have any suggestions?



,Ashley



From: Eric Nelson mailto:ericnel...@gmail.com>>
Sent: 16 November 2017

Re: [ceph-users] OSD Random Failures - Latest Luminous

2017-11-20 Thread Ashley Merrick

One thing I have been trying on the new down OSD is exporting a PG and 
importing to another OSD using ceph-objectstore-tool.


Export & Import goes fine, however when the OSD is then started back up the PG 
query still show's its looking for the old down OSD, should the OSD starting 
with a copy of the PG not communicate it now hold's the data the PG want's?


Or do I need to force it to see this somehow?


I can't mark down or lost the old OSD as doing that causes further OSD's to go 
down so just have to leave them stopped by still listed as an OSD.


,Ashley


From: Ashley Merrick
Sent: 20 November 2017 08:56:15
To: Gregory Farnum
Cc: David Turner; ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


Hello,


So I tried as suggested marking one OSD that continuously failed as lost and 
add a new OSD to take it's place.


However all this does is make another 2-3 OSD's fail with the exact same error.


Seems this is a pretty huge and nasty bug / issue!


Greg your have to give me some more information about what you need if you want 
me to try and get some information.


However right now the cluster it self is pretty much toast due to the amount of 
OSD's now with this assert.


,Ashley


From: Gregory Farnum 
Sent: 19 November 2017 09:25:39
To: Ashley Merrick
Cc: David Turner; ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous

I only see two asserts (in my local checkout) in that function; one is metadata
assert(info.history.same_interval_since != 0);
and the other is a sanity check
assert(!deleting);

Can you open a core dump with gdb and look at what line it's on in the 
start_peering_interval frame? (May need to install the debug packages.)

I think we've run across that first assert as an issue before, but both of them 
ought to be dumping out more cleanly about what line they're on.
-Greg


On Sun, Nov 19, 2017 at 1:32 AM Ashley Merrick 
mailto:ash...@amerrick.co.uk>> wrote:

Hello,



So seems noup does not help.



Still have the same error :



2017-11-18 14:26:40.982827 7fb4446cd700 -1 *** Caught signal (Aborted) **in 
thread 7fb4446cd700 thread_name:tp_peering



ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

1: (()+0xa0c554) [0x56547f500554]

2: (()+0x110c0) [0x7fb45cabe0c0]

3: (gsignal()+0xcf) [0x7fb45ba85fcf]

4: (abort()+0x16a) [0x7fb45ba873fa]

5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) 
[0x56547f547f0e]

6: (PG::start_peering_interval(std::shared_ptr, std::vector > const&, int, std::vector > 
const&, int, ObjectStore::Transaction*)+0x1569) [0x56547f029ad9]

7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x479) [0x56547f02a099]

8: (boost::statechart::simple_state, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x188) [0x56547f06c6d8]

9: (boost::statechart::state_machine, 
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
 const&)+0x69) [0x56547f045549]

10: (PG::handle_advance_map(std::shared_ptr, 
std::shared_ptr, std::vector >&, int, 
std::vector >&, int, PG::RecoveryCtx*)+0x4a7) 
[0x56547f00e837]

11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*, std::set, 
std::less >, std::allocator > 
>*)+0x2e7) [0x56547ef56e67]

12: (OSD::process_peering_events(std::__cxx11::list > 
const&, ThreadPool::TPHandle&)+0x1e4) [0x56547ef57cb4]

13: (ThreadPool::BatchWorkQueue::_void_process(void*, 
ThreadPool::TPHandle&)+0x2c) [0x56547efc2a0c]

14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x56547f54ef28]

15: (ThreadPool::WorkThread::entry()+0x10) [0x56547f5500c0]

16: (()+0x7494) [0x7fb45cab4494]

17: (clone()+0x3f) [0x7fb45bb3baff]

NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.



I guess even with noup the OSD/PG still has the peer with the other PG’s which 
is the stage that causes the failure, most OSD’s seem to stay up for about 30 
seconds, and every time it’s a different PG listed on the failure.



,Ashley



From: David Turner [mailto:drakonst...@gmail.com]

Sent: 18 November 2017 22:19
To: Ashley Merrick mailto:ash...@amerrick.co.uk>>

Cc: Eric Nelson mailto:ericnel...@gmail.com>>; 
ceph-us...@ceph.com

Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous



Does letting the cluster run with noup for a while until all down disks are 
idle, and then letting them come in help at all?  I don't know your specific 
issue and haven't touched bluestore yet, but that is generally sound advice 
when is won't start.

Also is there any pattern to the osds that are down? Common PGs, common hosts, 
common ssds, etc?



On Sat, Nov 18, 2017, 7:08 AM Ashley Merrick 
mailto:ash...@amerrick.co.uk>> wrote:

Hello,



Any further suggestions or work aroun

[ceph-users] how to improve performance

Hi,

Can someone please help me, how do I improve performance on ou CEPH cluster?

The hardware in use are as follows:
3x SuperMicro servers with the following configuration
12Core Dual XEON 2.2Ghz
128GB RAM
2x 400GB Intel DC SSD drives
4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
1x SuperMicro DOM for Proxmox / Debian OS
4x Port 10Gbe NIC
Cisco 10Gbe switch.


root@virt2:~# rados bench -p Data 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for   up to 10 seconds or 0 objects
Object prefix: benchmark_data_virt2_39099
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0   0 0 0 0 0   -
 0
1  168569   275.979   2760.185576
0.204146
2  16   171   155   309.966   344   0.0625409
0.193558
3  16   243   227   302.633   288   0.0547129
 0.19835
4  16   330   314   313.965   348   0.0959492
0.199825
5  16   413   397   317.565   3320.124908
0.196191
6  16   494   478   318.633   324  0.1556
0.197014
7  15   591   576   329.109   3920.136305
0.192192
8  16   670   654   326.965   312   0.0703808
0.190643
9  16   757   741   329.297   3480.165211
0.192183
   10  16   828   812   324.764   284   0.0935803
0.194041
Total time run: 10.120215
Total writes made:  829
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 327.661
Stddev Bandwidth:   35.8664
Max bandwidth (MB/sec): 392
Min bandwidth (MB/sec): 276
Average IOPS:   81
Stddev IOPS:8
Max IOPS:   98
Min IOPS:   69
Average Latency(s): 0.195191
Stddev Latency(s):  0.0830062
Max latency(s): 0.481448
Min latency(s): 0.0414858
root@virt2:~# hdparm -I /dev/sda



root@virt2:~# ceph osd tree
ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
-1   72.78290 root default
-3   29.11316 host virt1
 1   hdd  7.27829 osd.1  up  1.0 1.0
 2   hdd  7.27829 osd.2  up  1.0 1.0
 3   hdd  7.27829 osd.3  up  1.0 1.0
 4   hdd  7.27829 osd.4  up  1.0 1.0
-5   21.83487 host virt2
 5   hdd  7.27829 osd.5  up  1.0 1.0
 6   hdd  7.27829 osd.6  up  1.0 1.0
 7   hdd  7.27829 osd.7  up  1.0 1.0
-7   21.83487 host virt3
 8   hdd  7.27829 osd.8  up  1.0 1.0
 9   hdd  7.27829 osd.9  up  1.0 1.0
10   hdd  7.27829 osd.10 up  1.0 1.0
 0  0 osd.0down0 1.0


root@virt2:~# ceph -s
  cluster:
id: 278a2e9c-0578-428f-bd5b-3bb348923c27
health: HEALTH_OK

  services:
mon: 3 daemons, quorum virt1,virt2,virt3
mgr: virt1(active)
osd: 11 osds: 10 up, 10 in

  data:
pools:   1 pools, 512 pgs
objects: 6084 objects, 24105 MB
usage:   92822 MB used, 74438 GB / 74529 GB avail
pgs: 512 active+clean

root@virt2:~# ceph -w
  cluster:
id: 278a2e9c-0578-428f-bd5b-3bb348923c27
health: HEALTH_OK

  services:
mon: 3 daemons, quorum virt1,virt2,virt3
mgr: virt1(active)
osd: 11 osds: 10 up, 10 in

  data:
pools:   1 pools, 512 pgs
objects: 6084 objects, 24105 MB
usage:   92822 MB used, 74438 GB / 74529 GB avail
pgs: 512 active+clean


2017-11-20 12:32:08.199450 mon.virt1 [INF] mon.1 10.10.10.82:6789/0



The SSD drives are used as journal drives:

root@virt3:~# ceph-disk list | grep /dev/sde | grep osd
 /dev/sdb1 ceph data, active, cluster ceph, osd.8, block /dev/sdb2,
block.db /dev/sde1
root@virt3:~# ceph-disk list | grep /dev/sdf | grep osd
 /dev/sdc1 ceph data, active, cluster ceph, osd.9, block /dev/sdc2,
block.db /dev/sdf1
 /dev/sdd1 ceph data, active, cluster ceph, osd.10, block /dev/sdd2,
block.db /dev/sdf2



I see now /dev/sda doesn't have a journal, though it should have. Not sure
why.
This is the command I used to create it:


 pveceph createosd /dev/sda -bluestore 1  -journal_dev /dev/sde


-- 
Kind Regards
Rudi Ahlers
Website: http://www.rudiahlers.co.za
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Switch to replica 3

Hi,

I need to switch a cluster of over 200 OSDs from replica 2 to replica 3
There are two different crush maps for HDD and SSDs also mapped to two 
different pools.

Is there a best practice to use? Can this provoke troubles?

Thank you
Matteo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to improve performance

As matter of interest, when I ran the test, the network throughput reached
3.98Gb/s:

 ens2f0  /  traffic statistics

   rx |   tx
--+--
  bytes 2.59 GiB  |4.63 GiB
--+--
  max2.29 Gbit/s  | 3.98 Gbit/s
  average  905.58 Mbit/s  | 1.62 Gbit/s
  min 203 kbit/s  |  186 kbit/s
--+--
  packets1980792  | 3354372
--+--
  max 207630 p/s  |  342902 p/s
  average  82533 p/s  |  139765 p/s
  min 51 p/s  |  56 p/s
--+--
  time24 seconds

Some more stats:

root@virt2:~# rados bench -p Data 10 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0   0 0 0 0 0   -
 0
1  16   402   386   1543.69  1544  0.00182802
 0.0395421
2  16   773   757   1513.71  1484  0.00243911
 0.0409455
Total time run:   2.340037
Total reads made: 877
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   1499.12
Average IOPS: 374
Stddev IOPS:  10
Max IOPS: 386
Min IOPS: 371
Average Latency(s):   0.0419036
Max latency(s):   0.176739
Min latency(s):   0.00161271




root@virt2:~# rados bench -p Data 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0   0 0 0 0 0   -
 0
1  16   376   360   1439.71  1440   0.0356502
 0.0409024
2  16   752   736   1471.74  1504   0.0163304
 0.0419063
3  16  1134  1118   1490.43  15280.059643
 0.0417043
4  16  1515  1499   1498.78  1524   0.0502131
 0.0416087
5  15  1880  1865   1491.79  14640.017407
 0.0414158
6  16  2254  2238   1491.79  1492   0.0657474
 0.0420471
7  15  2509  2494   1424.95  1024  0.00182097
 0.0440063
8  15  2873  2858   1428.81  1456   0.0302541
 0.0439319
9  15  3243  3228   1434.47  14800.108037
 0.0438106
   10  16  3616  3600   1439.81  1488   0.0295953
 0.0436184
Total time run:   10.058519
Total reads made: 3616
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   1437.99
Average IOPS: 359
Stddev IOPS:  37
Max IOPS: 382
Min IOPS: 256
Average Latency(s):   0.0438002
Max latency(s):   0.664223
Min latency(s):   0.00156885







On Mon, Nov 20, 2017 at 12:38 PM, Rudi Ahlers  wrote:

> Hi,
>
> Can someone please help me, how do I improve performance on ou CEPH
> cluster?
>
> The hardware in use are as follows:
> 3x SuperMicro servers with the following configuration
> 12Core Dual XEON 2.2Ghz
> 128GB RAM
> 2x 400GB Intel DC SSD drives
> 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
> 1x SuperMicro DOM for Proxmox / Debian OS
> 4x Port 10Gbe NIC
> Cisco 10Gbe switch.
>
>
> root@virt2:~# rados bench -p Data 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> 4194304 for   up to 10 seconds or 0 objects
> Object prefix: benchmark_data_virt2_39099
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
>  0
> 1  168569   275.979   2760.185576
> 0.204146
> 2  16   171   155   309.966   344   0.0625409
> 0.193558
> 3  16   243   227   302.633   288   0.0547129
>  0.19835
> 4  16   330   314   313.965   348   0.0959492
> 0.199825
> 5  16   413   397   317.565   3320.124908
> 0.196191
> 6  16   494   478   318.633   324  0.1556
> 0.197014
> 7  15   591   576   329.109   3920.136305
> 0.192192
> 8  16   670   654   326.965   312   0.0703808
> 0.190643
> 9  16   757   741   329.297   3480.165211
> 0.192183
>10  16   828   812   324.764   284   0.0935803
> 0.194041
> Total time run: 10.120215
> Total writes made:  829
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 327.661
> Stddev Bandwidth:   35.8664
> Max bandwidth (MB/sec): 392
> Min bandwidth (MB/sec): 276
> Average IOPS:   81
> Stddev IOPS:8
> Max IOPS:   98
> Min IOPS:   69
> Average Lat

Re: [ceph-users] OSD is near full and slow in accessing storage from client

2017-11-20 Thread gjprabu

Hi David,



Sorry for the late reply and its completed OSD Sync and more ever 
still fourth OSD available size is keep reducing. Is there any option to check 
or fix .





ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS 


0 3.29749  1.0  3376G  2320G  1056G 68.71 1.10 144


1 3.26869  1.0  3347G  1871G  1475G 55.92 0.89 134

2 3.27339  1.0  3351G  1699G  1652G 50.69 0.81 134

3 3.24089  1.0  3318G  1865G  1452G 56.22 0.90 142

4 3.24089  1.0  3318G  2839G   478G 85.57 1.37 158

5 3.32669  1.0  3406G  2249G  1156G 66.04 1.06 136

6 3.27800  1.0  3356G  1924G  1432G 57.33 0.92 139

7 3.20470  1.0  3281G  1949G  1331G 59.42 0.95 141

  TOTAL 26757G 16720G 10037G 62.49 

MIN/MAX VAR: 0.81/1.37  STDDEV: 10.26





Regards

Prabu GJ






 On Mon, 13 Nov 2017 00:27:47 +0530 David Turner 
 wrote 




You cannot reduce the PG count for a pool.  So there isn't anything you can 
really do for this unless you create a new FS with better PG counts and migrate 
your data into it.

The problem with having more PGs than you need is in the memory footprint for 
the osd daemon. There are warning thresholds for having too many PGs per osd.  
Also in future expansions, if you need to add pools, you might not be able to 
create the pools with the proper amount of PGs due to older pools that have way 
too many PGs.

It would still be nice to see the output from those commands I asked about.

The built-in reweighting scripts might help your data distribution.  
reweight-by-utilization



On Sun, Nov 12, 2017, 11:41 AM gjprabu  wrote:







Hi David,



Thanks for your valuable reply , once complete the backfilling for new osd and 
will consider by increasing replica value asap. Is it possible to decrease the 
metadata pg count ?  if the pg count for metadata for value same as data count 
what kind of issue may occur ? 



Regards

PrabuGJ






 On Sun, 12 Nov 2017 21:25:05 +0530 David 
Turner wrote 





What's the output of `ceph df` to see if your PG counts are good or not?  Like 
everyone else has said, the space on the original osds can't be expected to 
free up until the backfill from adding the new osd has finished.

You don't have anything in your cluster health to indicate that your cluster 
will not be able to finish this backfilling operation on its own.

You might find this URL helpful in calculating your PG counts. 
http://ceph.com/pgcalc/  As a side note. It is generally better to keep your PG 
counts as base 2 numbers (16, 64, 256, etc). When you do not have a base 2 
number then some of your PGs will take up twice as much space as others. In 
your case with 250, you have 244 PGs that are the same size and 6 PGs that are 
twice the size of those 244 PGs.  Bumping that up to 256 will even things out.

Assuming that the metadata pool is for a CephFS volume, you do not need nearly 
so many PGs for that pool. Also, I would recommend changing at least the 
metadata pool to 3 replica_size. If we can talk you into 3 replica for 
everything else, great! But if not, at least do the metadata pool. If you lose 
an object in the data pool, you just lose that file. If you lose an object in 
the metadata pool, you might lose access to the entire CephFS volume.



On Sun, Nov 12, 2017, 9:39 AM gjprabu  wrote:



Hi Cassiano,



   Thanks for your valuable feedback and will wait for some time till new 
osd sync get complete. Also for by increasing pg count it is the issue will 
solve? our setup pool size for data and metadata pg number is 250. Is this 
correct for 7 OSD with 2 replica. Also currently stored data size is 17TB.



ceph osd df



ID WEIGHT  REWEIGHT SIZE   USEAVAIL %USE  VAR  PGS

0 3.29749  1.0  3376G  2814G  562G 83.35 1.23 165

1 3.26869  1.0  3347G  1923G 1423G 57.48 0.85 152

2 3.27339  1.0  3351G  1980G 1371G 59.10 0.88 161

3 3.24089  1.0  3318G  2131G 1187G 64.23 0.95 168

4 3.24089  1.0  3318G  2998G  319G 90.36 1.34 176

5 3.32669  1.0  3406G  2476G  930G 72.68 1.08 165

6 3.27800  1.0  3356G  1518G 1838G 45.24 0.67 166

  TOTAL 23476G 15843G 7632G 67.49 

MIN/MAX VAR: 0.67/1.34  STDDEV: 14.53



ceph osd tree

ID WEIGHT   TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY

-1 22.92604 root default  

-2  3.29749 host intcfs-osd1  

0  3.29749 osd.0 up  1.0  1.0

-3  3.26869 host intcfs-osd2  

1  3.26869 osd.1 up  1.0  1.0

-4  3.27339 host intcfs-osd3  

2  3.27339 osd.2 up  1.0  1.0

-5  3.24089 host intcfs-osd4  

3  3.24089 osd.3 up  1.0  1.

Re: [ceph-users] Switch to replica 3

2017-11-20 Thread Wido den Hollander


> Op 20 november 2017 om 11:56 schreef Matteo Dacrema :
> 
> 
> Hi,
> 
> I need to switch a cluster of over 200 OSDs from replica 2 to replica 3
> There are two different crush maps for HDD and SSDs also mapped to two 
> different pools.
> 
> Is there a best practice to use? Can this provoke troubles?
> 

The command is very simple, but without more information nobody can tell you.

Can you should (attach) the output of 'ceph osd tree'?

Wido

> Thank you
> Matteo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Switch to replica 3

Hello,

On Mon, 20 Nov 2017 11:56:31 +0100 Matteo Dacrema wrote:

> Hi,
> 
> I need to switch a cluster of over 200 OSDs from replica 2 to replica 3
I presume this means the existing cluster and not adding 100 OSDs...

> There are two different crush maps for HDD and SSDs also mapped to two 
> different pools.
>
> Is there a best practice to use? Can this provoke troubles?
> 
Are your SSDs a cache-tier or are they a fully separate pool?

As for troubles, how busy is your cluster during the recovery of failed
OSDs or deep scrubs?

There are 2 things to consider here:

1. The re-balancing and additional replication of all the data, which you
can control/ease by the various knobs present. Ceph version matters to
which are relevant/useful. It shouldn't impact things too much, unless
your cluster was at the very edge of it's capacity anyway.

2. The little detail that after 1) is done, your cluster will be
noticeably slower than before, especially in the latency department. 
In short, you don't just need to have the disk space to go 3x, but also
enough IOPS/bandwidth reserves.

Christian

> Thank you
> Matteo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to improve performance

2017-11-20 Thread Sébastien VIGNERON

Hi,

MTU size? Did you ran an iperf test to see raw bandwidth?

Cordialement / Best regards,

Sébastien VIGNERON 
CRIANN, 
Ingénieur / Engineer
Technopôle du Madrillet 
745, avenue de l'Université 
76800 Saint-Etienne du Rouvray - France 
tél. +33 2 32 91 42 91 
fax. +33 2 32 91 42 92 
http://www.criann.fr 
mailto:sebastien.vigne...@criann.fr
support: supp...@criann.fr

> Le 20 nov. 2017 à 11:58, Rudi Ahlers  a écrit :
> 
> As matter of interest, when I ran the test, the network throughput reached 
> 3.98Gb/s:
> 
>  ens2f0  /  traffic statistics
> 
>rx |   tx
> --+--
>   bytes 2.59 GiB  |4.63 GiB
> --+--
>   max2.29 Gbit/s  | 3.98 Gbit/s
>   average  905.58 Mbit/s  | 1.62 Gbit/s
>   min 203 kbit/s  |  186 kbit/s
> --+--
>   packets1980792  | 3354372
> --+--
>   max 207630 p/s  |  342902 p/s
>   average  82533 p/s  |  139765 p/s
>   min 51 p/s  |  56 p/s
> --+--
>   time24 seconds
> 
> Some more stats:
> 
> root@virt2:~# rados bench -p Data 10 seq
> hints = 1
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
> 0   0 0 0 0 0   -   0
> 1  16   402   386   1543.69  1544  0.00182802   0.0395421
> 2  16   773   757   1513.71  1484  0.00243911   0.0409455
> Total time run:   2.340037
> Total reads made: 877
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   1499.12
> Average IOPS: 374
> Stddev IOPS:  10
> Max IOPS: 386
> Min IOPS: 371
> Average Latency(s):   0.0419036
> Max latency(s):   0.176739
> Min latency(s):   0.00161271
> 
> 
> 
> 
> root@virt2:~# rados bench -p Data 10 rand
> hints = 1
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
> 0   0 0 0 0 0   -   0
> 1  16   376   360   1439.71  1440   0.0356502   0.0409024
> 2  16   752   736   1471.74  1504   0.0163304   0.0419063
> 3  16  1134  1118   1490.43  15280.059643   0.0417043
> 4  16  1515  1499   1498.78  1524   0.0502131   0.0416087
> 5  15  1880  1865   1491.79  14640.017407   0.0414158
> 6  16  2254  2238   1491.79  1492   0.0657474   0.0420471
> 7  15  2509  2494   1424.95  1024  0.00182097   0.0440063
> 8  15  2873  2858   1428.81  1456   0.0302541   0.0439319
> 9  15  3243  3228   1434.47  14800.108037   0.0438106
>10  16  3616  3600   1439.81  1488   0.0295953   0.0436184
> Total time run:   10.058519
> Total reads made: 3616
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   1437.99
> Average IOPS: 359
> Stddev IOPS:  37
> Max IOPS: 382
> Min IOPS: 256
> Average Latency(s):   0.0438002
> Max latency(s):   0.664223
> Min latency(s):   0.00156885
> 
> 
> 
> 
> 
> 
> 
> On Mon, Nov 20, 2017 at 12:38 PM, Rudi Ahlers  > wrote:
> Hi, 
> 
> Can someone please help me, how do I improve performance on ou CEPH cluster?
> 
> The hardware in use are as follows:
> 3x SuperMicro servers with the following configuration
> 12Core Dual XEON 2.2Ghz
> 128GB RAM
> 2x 400GB Intel DC SSD drives
> 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
> 1x SuperMicro DOM for Proxmox / Debian OS
> 4x Port 10Gbe NIC
> Cisco 10Gbe switch. 
> 
> 
> root@virt2:~# rados bench -p Data 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 
> for   up to 10 seconds or 0 objects
> Object prefix: benchmark_data_virt2_39099
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
> 0   0 0 0 0 0   -   0
> 1  168569   275.979   2760.1855760.204146
> 2  16   171   155   309.966   344   0.06254090.193558
> 3  16   243   227   302.633   288   0.0547129 0.19835
> 4  16   330   314   313.965   348   0.09594920.199825
> 5  16   413   397   317.565   3320.1249080.196191
> 6  16   494   478   318.633   324  0.15560.197014
> 7  15   591   576   329.109

Re: [ceph-users] Switch to replica 3

Yes I mean the existing Cluster.
SSDs are on a fully separate pool.
Cluster is not busy during recovery and deep scrubs but I think it’s better to 
limit replication in some way when switching to replica 3.

My question is to understand if I need to set some options parameters to limit 
the impact of the creation of new objects.I’m also concerned about disk filling 
up during recovery because of inefficient data balancing.

Here osd tree

ID  WEIGHTTYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
-10  19.69994 root ssd
-11   5.06998 host ceph101
166   0.98999 osd.166   up  1.0  1.0
167   1.0 osd.167   up  1.0  1.0
168   1.0 osd.168   up  1.0  1.0
169   1.07999 osd.169   up  1.0  1.0
170   1.0 osd.170   up  1.0  1.0
-12   4.92998 host ceph102
171   0.98000 osd.171   up  1.0  1.0
172   0.92999 osd.172   up  1.0  1.0
173   0.98000 osd.173   up  1.0  1.0
174   1.0 osd.174   up  1.0  1.0
175   1.03999 osd.175   up  1.0  1.0
-13   4.69998 host ceph103
176   0.84999 osd.176   up  1.0  1.0
177   0.84999 osd.177   up  1.0  1.0
178   1.0 osd.178   up  1.0  1.0
179   1.0 osd.179   up  1.0  1.0
180   1.0 osd.180   up  1.0  1.0
-14   5.0 host ceph104
181   1.0 osd.181   up  1.0  1.0
182   1.0 osd.182   up  1.0  1.0
183   1.0 osd.183   up  1.0  1.0
184   1.0 osd.184   up  1.0  1.0
185   1.0 osd.185   up  1.0  1.0
 -1 185.19835 root default
 -2  18.39980 host ceph001
 63   0.7 osd.63up  1.0  1.0
 64   0.7 osd.64up  1.0  1.0
 65   0.7 osd.65up  1.0  1.0
146   0.7 osd.146   up  1.0  1.0
147   0.7 osd.147   up  1.0  1.0
148   0.90999 osd.148   up  1.0  1.0
149   0.7 osd.149   up  1.0  1.0
150   0.7 osd.150   up  1.0  1.0
151   0.7 osd.151   up  1.0  1.0
152   0.7 osd.152   up  1.0  1.0
153   0.7 osd.153   up  1.0  1.0
154   0.7 osd.154   up  1.0  1.0
155   0.8 osd.155   up  1.0  1.0
156   0.84999 osd.156   up  1.0  1.0
157   0.7 osd.157   up  1.0  1.0
158   0.7 osd.158   up  1.0  1.0
159   0.84999 osd.159   up  1.0  1.0
160   0.90999 osd.160   up  1.0  1.0
161   0.90999 osd.161   up  1.0  1.0
162   0.90999 osd.162   up  1.0  1.0
163   0.7 osd.163   up  1.0  1.0
164   0.90999 osd.164   up  1.0  1.0
165   0.64999 osd.165   up  1.0  1.0
 -3  19.41982 host ceph002
 23   0.7 osd.23up  1.0  1.0
 24   0.7 osd.24up  1.0  1.0
 25   0.90999 osd.25up  1.0  1.0
 26   0.5 osd.26up  1.0  1.0
 27   0.95000 osd.27up  1.0  1.0
 28   0.64999 osd.28up  1.0  1.0
 29   0.75000 osd.29up  1.0  1.0
 30   0.8 osd.30up  1.0  1.0
 31   0.90999 osd.31up  1.0  1.0
 32   0.90999 osd.32up  1.0  1.0
 33   0.8 osd.33up  1.0  1.0
 34   0.90999 osd.34up  1.0  1.0
 35   0.90999 osd.35up  1.0  1.0
 36   0.84999 osd.36up  1.0  1.0
 37   0.8 osd.37up  1.0  1.0
 38   1.0 osd.38up  1.0  1.0
 39   0.7 osd.39up  1.0  1.0
 40   0.90999 osd.40up  1.0  1.0
 41   0.84999 osd.41up  1.0  1.0
 42   0.84999 osd.42up  1.0  1.0
 43   0.90999 osd.43up  1.0  1.0
 44   0.75000 osd.44up  1.0  1.0
 45   0.7 osd.45

Re: [ceph-users] how to improve performance

On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:

> Hi,
> 
> Can someone please help me, how do I improve performance on ou CEPH cluster?
> 
> The hardware in use are as follows:
> 3x SuperMicro servers with the following configuration
> 12Core Dual XEON 2.2Ghz
Faster cores is better for Ceph, IMNSHO.
Though with main storage on HDDs, this will do.

> 128GB RAM
Overkill for Ceph but I see something else below...

> 2x 400GB Intel DC SSD drives
Exact model please.

> 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
One hopes that's a non SMR one.
Model please.

> 1x SuperMicro DOM for Proxmox / Debian OS
Ah, Proxmox. 
I'm personally not averse to converged, high density, multi-role clusters
myself, but you:
a) need to know what you're doing and
b) will find a lot of people here who don't approve of it.

I've avoided DOMs so far (non-hotswapable SPOF), even though the SM ones
look good on paper with regards to endurance and IOPS. 
The later being rather important for your monitors. 

> 4x Port 10Gbe NIC
> Cisco 10Gbe switch.
> 
Configuration would be nice for those, LACP?

> 
> root@virt2:~# rados bench -p Data 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> 4194304 for   up to 10 seconds or 0 objects

rados bench is limited tool and measuring bandwidth is in nearly all
the use cases pointless. 
Latency is where it is at and testing from inside a VM is more relevant
than synthetic tests of the storage.
But it is a start.

> Object prefix: benchmark_data_virt2_39099
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
>  0
> 1  168569   275.979   2760.185576
> 0.204146
> 2  16   171   155   309.966   344   0.0625409
> 0.193558
> 3  16   243   227   302.633   288   0.0547129
>  0.19835
> 4  16   330   314   313.965   348   0.0959492
> 0.199825
> 5  16   413   397   317.565   3320.124908
> 0.196191
> 6  16   494   478   318.633   324  0.1556
> 0.197014
> 7  15   591   576   329.109   3920.136305
> 0.192192
> 8  16   670   654   326.965   312   0.0703808
> 0.190643
> 9  16   757   741   329.297   3480.165211
> 0.192183
>10  16   828   812   324.764   284   0.0935803
> 0.194041
> Total time run: 10.120215
> Total writes made:  829
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 327.661
What part of this surprises you?

With a replication of 3, you have effectively the bandwidth of your 2 SSDs
(for small writes, not the case here) and the bandwidth of your 4 HDDs
available. 
Given overhead, other inefficiencies and the fact that this is not a
sequential write from the HDD perspective, 320MB/s isn't all that bad.
Though with your setup I would have expected something faster, but NOT the
theoretical 600MB/s 4 HDDs will do in sequential writes.

> Stddev Bandwidth:   35.8664
> Max bandwidth (MB/sec): 392
> Min bandwidth (MB/sec): 276
> Average IOPS:   81
> Stddev IOPS:8
> Max IOPS:   98
> Min IOPS:   69
> Average Latency(s): 0.195191
> Stddev Latency(s):  0.0830062
> Max latency(s): 0.481448
> Min latency(s): 0.0414858
> root@virt2:~# hdparm -I /dev/sda
> 
> 
> 
> root@virt2:~# ceph osd tree
> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
> -1   72.78290 root default
> -3   29.11316 host virt1
>  1   hdd  7.27829 osd.1  up  1.0 1.0
>  2   hdd  7.27829 osd.2  up  1.0 1.0
>  3   hdd  7.27829 osd.3  up  1.0 1.0
>  4   hdd  7.27829 osd.4  up  1.0 1.0
> -5   21.83487 host virt2
>  5   hdd  7.27829 osd.5  up  1.0 1.0
>  6   hdd  7.27829 osd.6  up  1.0 1.0
>  7   hdd  7.27829 osd.7  up  1.0 1.0
> -7   21.83487 host virt3
>  8   hdd  7.27829 osd.8  up  1.0 1.0
>  9   hdd  7.27829 osd.9  up  1.0 1.0
> 10   hdd  7.27829 osd.10 up  1.0 1.0
>  0  0 osd.0down0 1.0
> 
> 
> root@virt2:~# ceph -s
>   cluster:
> id: 278a2e9c-0578-428f-bd5b-3bb348923c27
> health: HEALTH_OK
> 
>   services:
> mon: 3 daemons, quorum virt1,virt2,virt3
> mgr: virt1(active)
> osd: 11 osds: 10 up, 10 in
> 
>   data:
> pools:   1 pools, 512 pgs
> objects: 6084 objects, 24105 MB
> usage:   92822 MB used, 74438 GB / 74529 GB avail
> pgs: 512 active+clean
> 
> root@virt2:~# ceph -w
>   cluster:
> id: 278a2e9c-0578-428f-bd5b-3bb348923c27
> health: HEALTH_OK
> 
>   services:
> mon: 3 daemons, quorum virt1,virt2,virt3
> mgr: vi

Re: [ceph-users] how to improve performance

root@virt2:~# iperf -c 10.10.10.81

Client connecting to 10.10.10.81, TCP port 5001
TCP window size: 1.78 MByte (default)

[  3] local 10.10.10.82 port 57132 connected with 10.10.10.81 port 5001
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.0 sec  10.5 GBytes  9.02 Gbits/sec


On Mon, Nov 20, 2017 at 1:22 PM, Sébastien VIGNERON <
sebastien.vigne...@criann.fr> wrote:

> Hi,
>
> MTU size? Did you ran an iperf test to see raw bandwidth?
>
> Cordialement / Best regards,
>
> Sébastien VIGNERON
> CRIANN,
> Ingénieur / Engineer
> Technopôle du Madrillet
> 745, avenue de l'Université
> 
>
> 76800 Saint-Etienne du Rouvray - France
> 
>
> tél. +33 2 32 91 42 91 <+33%202%2032%2091%2042%2091>
> fax. +33 2 32 91 42 92 <+33%202%2032%2091%2042%2092>
> http://www.criann.fr
> mailto:sebastien.vigne...@criann.fr 
> support: supp...@criann.fr
>
> Le 20 nov. 2017 à 11:58, Rudi Ahlers  a écrit :
>
> As matter of interest, when I ran the test, the network throughput reached
> 3.98Gb/s:
>
>  ens2f0  /  traffic statistics
>
>rx |   tx
> --+--
>   bytes 2.59 GiB  |4.63 GiB
> --+--
>   max2.29 Gbit/s  | 3.98 Gbit/s
>   average  905.58 Mbit/s  | 1.62 Gbit/s
>   min 203 kbit/s  |  186 kbit/s
> --+--
>   packets1980792  | 3354372
> --+--
>   max 207630 p/s  |  342902 p/s
>   average  82533 p/s  |  139765 p/s
>   min 51 p/s  |  56 p/s
> --+--
>   time24 seconds
>
> Some more stats:
>
> root@virt2:~# rados bench -p Data 10 seq
> hints = 1
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
>  0
> 1  16   402   386   1543.69  1544  0.00182802
>  0.0395421
> 2  16   773   757   1513.71  1484  0.00243911
>  0.0409455
> Total time run:   2.340037
> Total reads made: 877
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   1499.12
> Average IOPS: 374
> Stddev IOPS:  10
> Max IOPS: 386
> Min IOPS: 371
> Average Latency(s):   0.0419036
> Max latency(s):   0.176739
> Min latency(s):   0.00161271
>
>
>
>
> root@virt2:~# rados bench -p Data 10 rand
> hints = 1
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
>  0
> 1  16   376   360   1439.71  1440   0.0356502
>  0.0409024
> 2  16   752   736   1471.74  1504   0.0163304
>  0.0419063
> 3  16  1134  1118   1490.43  15280.059643
>  0.0417043
> 4  16  1515  1499   1498.78  1524   0.0502131
>  0.0416087
> 5  15  1880  1865   1491.79  14640.017407
>  0.0414158
> 6  16  2254  2238   1491.79  1492   0.0657474
>  0.0420471
> 7  15  2509  2494   1424.95  1024  0.00182097
>  0.0440063
> 8  15  2873  2858   1428.81  1456   0.0302541
>  0.0439319
> 9  15  3243  3228   1434.47  14800.108037
>  0.0438106
>10  16  3616  3600   1439.81  1488   0.0295953
>  0.0436184
> Total time run:   10.058519
> Total reads made: 3616
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   1437.99
> Average IOPS: 359
> Stddev IOPS:  37
> Max IOPS: 382
> Min IOPS: 256
> Average Latency(s):   0.0438002
> Max latency(s):   0.664223
> Min latency(s):   0.00156885
>
>
>
>
>
>
>
> On Mon, Nov 20, 2017 at 12:38 PM, Rudi Ahlers 
> wrote:
>
>> Hi,
>>
>> Can someone please help me, how do I improve performance on ou CEPH
>> cluster?
>>
>> The hardware in use are as follows:
>> 3x SuperMicro servers with the following configuration
>> 12Core Dual XEON 2.2Ghz
>> 128GB RAM
>> 2x 400GB Intel DC SSD drives
>> 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
>> 1x SuperMicro DOM for Proxmox / Debian OS
>> 4x Port 10Gbe NIC
>> Cisco 10Gbe switch.
>>
>>
>> root@virt2:~# rados bench -p Data 10 write --no-cleanup
>> hints = 1
>> Maintaining 16 concurre

Re: [ceph-users] how to improve performance

2017-11-20 Thread ulembke


Hi Rudi,

Am 2017-11-20 11:58, schrieb Rudi Ahlers:

...

Some more stats:

root@virt2:~# rados bench -p Data 10 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0   0 0 0 0 0   -
 0
1  16   402   386   1543.69  1544  0.00182802
 0.0395421
2  16   773   757   1513.71  1484  0.00243911
 0.0409455


this values are due cached osd-data on your osd-nodes.

If you flush your cache (on all osd-nodes), your reads will be much 
worse, because they came from the HDDs.



Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to improve performance

We're planning on installing 12X Virtual Machines with some heavy loads.

the SSD drives are  INTEL SSDSC2BA400G4

The SATA drives are ST8000NM0055-1RM112

Please explain your comment, "b) will find a lot of people here who don't
approve of it."

I don't have access to the switches right now, but they're new so whatever
default config ships from factory would be active. Though iperf shows 10.5
GBytes  / 9.02 Gbits/sec throughput.

What speeds would you expect?
"Though with your setup I would have expected something faster, but NOT the
theoretical 600MB/s 4 HDDs will do in sequential writes."



On this, "If an OSD has no fast WAL/DB, it will drag the overall speed
down. Verify and if so fix this and re-test.": how?


On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer  wrote:

> On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:
>
> > Hi,
> >
> > Can someone please help me, how do I improve performance on ou CEPH
> cluster?
> >
> > The hardware in use are as follows:
> > 3x SuperMicro servers with the following configuration
> > 12Core Dual XEON 2.2Ghz
> Faster cores is better for Ceph, IMNSHO.
> Though with main storage on HDDs, this will do.
>
> > 128GB RAM
> Overkill for Ceph but I see something else below...
>
> > 2x 400GB Intel DC SSD drives
> Exact model please.
>
> > 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
> One hopes that's a non SMR one.
> Model please.
>
> > 1x SuperMicro DOM for Proxmox / Debian OS
> Ah, Proxmox.
> I'm personally not averse to converged, high density, multi-role clusters
> myself, but you:
> a) need to know what you're doing and
> b) will find a lot of people here who don't approve of it.
>
> I've avoided DOMs so far (non-hotswapable SPOF), even though the SM ones
> look good on paper with regards to endurance and IOPS.
> The later being rather important for your monitors.
>
> > 4x Port 10Gbe NIC
> > Cisco 10Gbe switch.
> >
> Configuration would be nice for those, LACP?
>
> >
> > root@virt2:~# rados bench -p Data 10 write --no-cleanup
> > hints = 1
> > Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> > 4194304 for   up to 10 seconds or 0 objects
>
> rados bench is limited tool and measuring bandwidth is in nearly all
> the use cases pointless.
> Latency is where it is at and testing from inside a VM is more relevant
> than synthetic tests of the storage.
> But it is a start.
>
> > Object prefix: benchmark_data_virt2_39099
> >   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> > lat(s)
> > 0   0 0 0 0 0   -
> >  0
> > 1  168569   275.979   2760.185576
> > 0.204146
> > 2  16   171   155   309.966   344   0.0625409
> > 0.193558
> > 3  16   243   227   302.633   288   0.0547129
> >  0.19835
> > 4  16   330   314   313.965   348   0.0959492
> > 0.199825
> > 5  16   413   397   317.565   3320.124908
> > 0.196191
> > 6  16   494   478   318.633   324  0.1556
> > 0.197014
> > 7  15   591   576   329.109   3920.136305
> > 0.192192
> > 8  16   670   654   326.965   312   0.0703808
> > 0.190643
> > 9  16   757   741   329.297   3480.165211
> > 0.192183
> >10  16   828   812   324.764   284   0.0935803
> > 0.194041
> > Total time run: 10.120215
> > Total writes made:  829
> > Write size: 4194304
> > Object size:4194304
> > Bandwidth (MB/sec): 327.661
> What part of this surprises you?
>
> With a replication of 3, you have effectively the bandwidth of your 2 SSDs
> (for small writes, not the case here) and the bandwidth of your 4 HDDs
> available.
> Given overhead, other inefficiencies and the fact that this is not a
> sequential write from the HDD perspective, 320MB/s isn't all that bad.
> Though with your setup I would have expected something faster, but NOT the
> theoretical 600MB/s 4 HDDs will do in sequential writes.
>
> > Stddev Bandwidth:   35.8664
> > Max bandwidth (MB/sec): 392
> > Min bandwidth (MB/sec): 276
> > Average IOPS:   81
> > Stddev IOPS:8
> > Max IOPS:   98
> > Min IOPS:   69
> > Average Latency(s): 0.195191
> > Stddev Latency(s):  0.0830062
> > Max latency(s): 0.481448
> > Min latency(s): 0.0414858
> > root@virt2:~# hdparm -I /dev/sda
> >
> >
> >
> > root@virt2:~# ceph osd tree
> > ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
> > -1   72.78290 root default
> > -3   29.11316 host virt1
> >  1   hdd  7.27829 osd.1  up  1.0 1.0
> >  2   hdd  7.27829 osd.2  up  1.0 1.0
> >  3   hdd  7.27829 osd.3  up  1.0 1.0
> >  4   hdd  7.27829 osd.4  up  1.0 1.0
> > -5   21.83487 host virt2
> >  5   hdd  7.27829 osd.5  up  1.0 1.

Re: [ceph-users] how to improve performance

Hi,

So are you saying this isn't true speed?

Do I just flush the journal and test again? i.e. ceph-osd -i osd.0
--flush-journal && ceph-osd -i osd.2 --flush-journal && ceph-osd -i osd.3
--flush-journal etc, etc?

On Mon, Nov 20, 2017 at 2:02 PM,  wrote:

> Hi Rudi,
>
> Am 2017-11-20 11:58, schrieb Rudi Ahlers:
>
>> ...
>>
>> Some more stats:
>>
>> root@virt2:~# rados bench -p Data 10 seq
>> hints = 1
>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
>> lat(s)
>> 0   0 0 0 0 0   -
>>  0
>> 1  16   402   386   1543.69  1544  0.00182802
>>  0.0395421
>> 2  16   773   757   1513.71  1484  0.00243911
>>  0.0409455
>>
>> this values are due cached osd-data on your osd-nodes.
>
> If you flush your cache (on all osd-nodes), your reads will be much worse,
> because they came from the HDDs.
>
>
> Udo
>



-- 
Kind Regards
Rudi Ahlers
Website: http://www.rudiahlers.co.za
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to improve performance

2017-11-20 Thread Sébastien VIGNERON

As a jumbo frame test, can you try the following?

ping -M do -s 8972 -c 4 IP_of_other_node_within_cluster_network

If you have « ping: sendto: Message too long », jumbo frames are not activated.

Cordialement / Best regards,

Sébastien VIGNERON 
CRIANN, 
Ingénieur / Engineer
Technopôle du Madrillet 
745, avenue de l'Université 
76800 Saint-Etienne du Rouvray - France 
tél. +33 2 32 91 42 91 
fax. +33 2 32 91 42 92 
http://www.criann.fr 
mailto:sebastien.vigne...@criann.fr
support: supp...@criann.fr

> Le 20 nov. 2017 à 13:02, Rudi Ahlers  a écrit :
> 
> We're planning on installing 12X Virtual Machines with some heavy loads. 
> 
> the SSD drives are  INTEL SSDSC2BA400G4
> 
> The SATA drives are ST8000NM0055-1RM112
> 
> Please explain your comment, "b) will find a lot of people here who don't 
> approve of it."
> 
> I don't have access to the switches right now, but they're new so whatever 
> default config ships from factory would be active. Though iperf shows 10.5 
> GBytes  / 9.02 Gbits/sec throughput.
> 
> What speeds would you expect?
> "Though with your setup I would have expected something faster, but NOT the
> theoretical 600MB/s 4 HDDs will do in sequential writes."
> 
> 
> 
> On this, "If an OSD has no fast WAL/DB, it will drag the overall speed down. 
> Verify and if so fix this and re-test.": how?
> 
> 
> On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer  > wrote:
> On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:
> 
> > Hi,
> >
> > Can someone please help me, how do I improve performance on ou CEPH cluster?
> >
> > The hardware in use are as follows:
> > 3x SuperMicro servers with the following configuration
> > 12Core Dual XEON 2.2Ghz
> Faster cores is better for Ceph, IMNSHO.
> Though with main storage on HDDs, this will do.
> 
> > 128GB RAM
> Overkill for Ceph but I see something else below...
> 
> > 2x 400GB Intel DC SSD drives
> Exact model please.
> 
> > 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
> One hopes that's a non SMR one.
> Model please.
> 
> > 1x SuperMicro DOM for Proxmox / Debian OS
> Ah, Proxmox.
> I'm personally not averse to converged, high density, multi-role clusters
> myself, but you:
> a) need to know what you're doing and
> b) will find a lot of people here who don't approve of it.
> 
> I've avoided DOMs so far (non-hotswapable SPOF), even though the SM ones
> look good on paper with regards to endurance and IOPS.
> The later being rather important for your monitors.
> 
> > 4x Port 10Gbe NIC
> > Cisco 10Gbe switch.
> >
> Configuration would be nice for those, LACP?
> 
> >
> > root@virt2:~# rados bench -p Data 10 write --no-cleanup
> > hints = 1
> > Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> > 4194304 for   up to 10 seconds or 0 objects
> 
> rados bench is limited tool and measuring bandwidth is in nearly all
> the use cases pointless.
> Latency is where it is at and testing from inside a VM is more relevant
> than synthetic tests of the storage.
> But it is a start.
> 
> > Object prefix: benchmark_data_virt2_39099
> >   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> > lat(s)
> > 0   0 0 0 0 0   -
> >  0
> > 1  168569   275.979   2760.185576
> > 0.204146
> > 2  16   171   155   309.966   344   0.0625409
> > 0.193558
> > 3  16   243   227   302.633   288   0.0547129
> >  0.19835
> > 4  16   330   314   313.965   348   0.0959492
> > 0.199825
> > 5  16   413   397   317.565   3320.124908
> > 0.196191
> > 6  16   494   478   318.633   324  0.1556
> > 0.197014
> > 7  15   591   576   329.109   3920.136305
> > 0.192192
> > 8  16   670   654   326.965   312   0.0703808
> > 0.190643
> > 9  16   757   741   329.297   3480.165211
> > 0.192183
> >10  16   828   812   324.764   284   0.0935803
> > 0.194041
> > Total time run: 10.120215
> > Total writes made:  829
> > Write size: 4194304
> > Object size:4194304
> > Bandwidth (MB/sec): 327.661
> What part of this surprises you?
> 
> With a replication of 3, you have effectively the bandwidth of your 2 SSDs
> (for small writes, not the case here) and the bandwidth of your 4 HDDs
> available.
> Given overhead, other inefficiencies and the fact that this is not a
> sequential write from the HDD perspective, 320MB/s isn't all that bad.
> Though with your setup I would have expected something faster, but NOT the
> theoretical 600MB/s 4 HDDs will do in sequential writes.
> 
> > Stddev Bandwidth:   35.8664
> > Max bandwidth (MB/sec): 392
> > Min bandwidth (MB/sec): 276
> > Average IOPS:   81
> > Stddev IOPS:8
> > Max IOPS:   98
> > Min IOPS:   69
> > Average Latency(s): 0.195191
> > Stddev Latency(s):

Re: [ceph-users] how to improve performance

On Mon, 20 Nov 2017 14:02:30 +0200 Rudi Ahlers wrote:

> We're planning on installing 12X Virtual Machines with some heavy loads.
> 
> the SSD drives are  INTEL SSDSC2BA400G4
> 
Interesting, where did you find those?
Or did you have them lying around?

I've been unable to get DC S3710 SSDs for nearly a year now.

> The SATA drives are ST8000NM0055-1RM112
> 
Note that these (while fast) have an internal flash cache, limiting them to
something like 0.2 DWPD.
Probably not an issue with the WAL/DB on the Intels, but something to keep
in mind.

> Please explain your comment, "b) will find a lot of people here who don't
> approve of it."
> 
Read the archives.
Converged clusters are complex and debugging Ceph when tons of other
things are going on at the same time on the machine even more so.

> I don't have access to the switches right now, but they're new so whatever
> default config ships from factory would be active. Though iperf shows 10.5
> GBytes  / 9.02 Gbits/sec throughput.
> 
Didn't think it was the switches, but completeness sake and all that.

> What speeds would you expect?
> "Though with your setup I would have expected something faster, but NOT the
> theoretical 600MB/s 4 HDDs will do in sequential writes."
>
What I wrote.
A 7200RPM HDD, even these, can not sustain writes much over 170MB/s, in
the most optimal circumstances. 
So your cluster can NOT exceed about 600MB/s sustained writes with the
effective bandwidth of 4 HDDs.
Smaller writes/reads that can be cached by RAM, DB, onboard caches on the
HDDs of course can and will be faster.

But again, you're missing the point, even if you get 600MB/s writes out of
your cluster, the number of 4k IOPS will be much more relevant to your VMs.
  
> 
> 
> On this, "If an OSD has no fast WAL/DB, it will drag the overall speed
> down. Verify and if so fix this and re-test.": how?
> 
No idea, I don't do bluestore. 
You noticed the lack of a WAL/DB for sda, go and fix it.
If in in doubt by destroying and re-creating.

And if you're looking for a less invasive procedure, docs and the ML
archive, but AFAIK there is nothing but re-creation at this time.

Christian
> 
> On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer  wrote:
> 
> > On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:
> >  
> > > Hi,
> > >
> > > Can someone please help me, how do I improve performance on ou CEPH  
> > cluster?  
> > >
> > > The hardware in use are as follows:
> > > 3x SuperMicro servers with the following configuration
> > > 12Core Dual XEON 2.2Ghz  
> > Faster cores is better for Ceph, IMNSHO.
> > Though with main storage on HDDs, this will do.
> >  
> > > 128GB RAM  
> > Overkill for Ceph but I see something else below...
> >  
> > > 2x 400GB Intel DC SSD drives  
> > Exact model please.
> >  
> > > 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's  
> > One hopes that's a non SMR one.
> > Model please.
> >  
> > > 1x SuperMicro DOM for Proxmox / Debian OS  
> > Ah, Proxmox.
> > I'm personally not averse to converged, high density, multi-role clusters
> > myself, but you:
> > a) need to know what you're doing and
> > b) will find a lot of people here who don't approve of it.
> >
> > I've avoided DOMs so far (non-hotswapable SPOF), even though the SM ones
> > look good on paper with regards to endurance and IOPS.
> > The later being rather important for your monitors.
> >  
> > > 4x Port 10Gbe NIC
> > > Cisco 10Gbe switch.
> > >  
> > Configuration would be nice for those, LACP?
> >  
> > >
> > > root@virt2:~# rados bench -p Data 10 write --no-cleanup
> > > hints = 1
> > > Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> > > 4194304 for   up to 10 seconds or 0 objects  
> >
> > rados bench is limited tool and measuring bandwidth is in nearly all
> > the use cases pointless.
> > Latency is where it is at and testing from inside a VM is more relevant
> > than synthetic tests of the storage.
> > But it is a start.
> >  
> > > Object prefix: benchmark_data_virt2_39099
> > >   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> > > lat(s)
> > > 0   0 0 0 0 0   -
> > >  0
> > > 1  168569   275.979   2760.185576
> > > 0.204146
> > > 2  16   171   155   309.966   344   0.0625409
> > > 0.193558
> > > 3  16   243   227   302.633   288   0.0547129
> > >  0.19835
> > > 4  16   330   314   313.965   348   0.0959492
> > > 0.199825
> > > 5  16   413   397   317.565   3320.124908
> > > 0.196191
> > > 6  16   494   478   318.633   324  0.1556
> > > 0.197014
> > > 7  15   591   576   329.109   3920.136305
> > > 0.192192
> > > 8  16   670   654   326.965   312   0.0703808
> > > 0.190643
> > > 9  16   757   741   329.297   3480.165211
> > > 0.192183
> > >10  16   828   812   324.764   284   0.

[ceph-users] Migration from filestore to blustore

2017-11-20 Thread Iban Cabrillo

Hi cephers,
  I was trying to migrate from Filestore to bluestore followig the
instructions but after the ceph-disk prepare the new osd had not join to
the cluster again:

   [root@cephadm ~]# ceph osd tree
ID CLASS WEIGHT   TYPE NAMESTATUSREWEIGHT PRI-AFF
-1   58.21509 root default
-7   58.21509 datacenter 10GbpsNet
-2   29.12000 host cephosd01
 1   hdd  3.64000 osd.1   up  1.0 1.0
 3   hdd  3.64000 osd.3   up  1.0 1.0
 5   hdd  3.64000 osd.5   up  1.0 1.0
 7   hdd  3.64000 osd.7   up  1.0 1.0
 9   hdd  3.64000 osd.9   up  1.0 1.0
11   hdd  3.64000 osd.11  up  1.0 1.0
13   hdd  3.64000 osd.13  up  1.0 1.0
15   hdd  3.64000 osd.15  up  1.0 1.0
-3   29.09509 host cephosd02
 0   hdd  3.63689 osd.0destroyed0 1.0
 2   hdd  3.63689 osd.2   up  1.0 1.0
 4   hdd  3.63689 osd.4   up  1.0 1.0
 6   hdd  3.63689 osd.6   up  1.0 1.0
 8   hdd  3.63689 osd.8   up  1.0 1.0
10   hdd  3.63689 osd.10  up  1.0 1.0
12   hdd  3.63689 osd.12  up  1.0 1.0
14   hdd  3.63689 osd.14  up  1.0 1.0
-8  0 datacenter 1GbpsNet


The state is destroyed yet

The operation has completed successfully.
[root@cephosd02 ~]# ceph-disk prepare --bluestore /dev/sda --osd-id 0
The operation has completed successfully.
The operation has completed successfully.
The operation has completed successfully.
meta-data=/dev/sda1  isize=2048   agcount=4, agsize=6400 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=0, sparse=0
data =   bsize=4096   blocks=25600, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
log  =internal log   bsize=4096   blocks=864, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

The metadata was on SSD disk

In the logs I only see this :

2017-11-20 14:00:48.536252 7fc2d149dd00 -1  ** ERROR: unable to open OSD
superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
2017-11-20 14:01:08.788158 7f4a9165fd00  0 set uid:gid to 167:167
(ceph:ceph)
2017-11-20 14:01:08.788179 7f4a9165fd00  0 ceph version 12.2.0
(32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process
(unknown), pid 115029

Any Advise?

Regards, I

-- 

Iban Cabrillo Bartolome
Instituto de Fisica de Cantabria (IFCA)
Santander, Spain
Tel: +34942200969
PGP PUBLIC KEY:
http://pgp.mit.edu/pks/lookup?op=get&search=0xD9DF0B3D6C8C08AC

Bertrand Russell:*"El problema con el mundo es que los estúpidos están
seguros de todo y los inteligentes están **llenos de dudas*"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migration from filestore to blustore

2017-11-20 Thread Wido den Hollander


> Op 20 november 2017 om 14:02 schreef Iban Cabrillo :
> 
> 
> Hi cephers,
>   I was trying to migrate from Filestore to bluestore followig the
> instructions but after the ceph-disk prepare the new osd had not join to
> the cluster again:
> 
>[root@cephadm ~]# ceph osd tree
> ID CLASS WEIGHT   TYPE NAMESTATUSREWEIGHT PRI-AFF
> -1   58.21509 root default
> -7   58.21509 datacenter 10GbpsNet
> -2   29.12000 host cephosd01
>  1   hdd  3.64000 osd.1   up  1.0 1.0
>  3   hdd  3.64000 osd.3   up  1.0 1.0
>  5   hdd  3.64000 osd.5   up  1.0 1.0
>  7   hdd  3.64000 osd.7   up  1.0 1.0
>  9   hdd  3.64000 osd.9   up  1.0 1.0
> 11   hdd  3.64000 osd.11  up  1.0 1.0
> 13   hdd  3.64000 osd.13  up  1.0 1.0
> 15   hdd  3.64000 osd.15  up  1.0 1.0
> -3   29.09509 host cephosd02
>  0   hdd  3.63689 osd.0destroyed0 1.0
>  2   hdd  3.63689 osd.2   up  1.0 1.0
>  4   hdd  3.63689 osd.4   up  1.0 1.0
>  6   hdd  3.63689 osd.6   up  1.0 1.0
>  8   hdd  3.63689 osd.8   up  1.0 1.0
> 10   hdd  3.63689 osd.10  up  1.0 1.0
> 12   hdd  3.63689 osd.12  up  1.0 1.0
> 14   hdd  3.63689 osd.14  up  1.0 1.0
> -8  0 datacenter 1GbpsNet
> 
> 
> The state is destroyed yet
> 
> The operation has completed successfully.
> [root@cephosd02 ~]# ceph-disk prepare --bluestore /dev/sda --osd-id 0
> The operation has completed successfully.

Did you wipe the disk yet? Make sure it's completely empty before you re-create 
the OSD.

Wido

> The operation has completed successfully.
> The operation has completed successfully.
> meta-data=/dev/sda1  isize=2048   agcount=4, agsize=6400 blks
>  =   sectsz=512   attr=2, projid32bit=1
>  =   crc=1finobt=0, sparse=0
> data =   bsize=4096   blocks=25600, imaxpct=25
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
> log  =internal log   bsize=4096   blocks=864, version=2
>  =   sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> 
> The metadata was on SSD disk
> 
> In the logs I only see this :
> 
> 2017-11-20 14:00:48.536252 7fc2d149dd00 -1  ** ERROR: unable to open OSD
> superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
> 2017-11-20 14:01:08.788158 7f4a9165fd00  0 set uid:gid to 167:167
> (ceph:ceph)
> 2017-11-20 14:01:08.788179 7f4a9165fd00  0 ceph version 12.2.0
> (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process
> (unknown), pid 115029
> 
> Any Advise?
> 
> Regards, I
> 
> -- 
> 
> Iban Cabrillo Bartolome
> Instituto de Fisica de Cantabria (IFCA)
> Santander, Spain
> Tel: +34942200969
> PGP PUBLIC KEY:
> http://pgp.mit.edu/pks/lookup?op=get&search=0xD9DF0B3D6C8C08AC
> 
> Bertrand Russell:*"El problema con el mundo es que los estúpidos están
> seguros de todo y los inteligentes están **llenos de dudas*"
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migration from filestore to blustore

2017-11-20 Thread Iban Cabrillo

Hi Wido,
  The disk was empty, I checked that there were no remapped pgs, before run
ceph-disk prepare. Re-run ceph-disk again?

Regards, i

El lun., 20 nov. 2017 14:12, Wido den Hollander  escribió:

>
> > Op 20 november 2017 om 14:02 schreef Iban Cabrillo <
> cabri...@ifca.unican.es>:
> >
> >
> > Hi cephers,
> >   I was trying to migrate from Filestore to bluestore followig the
> > instructions but after the ceph-disk prepare the new osd had not join to
> > the cluster again:
> >
> >[root@cephadm ~]# ceph osd tree
> > ID CLASS WEIGHT   TYPE NAMESTATUSREWEIGHT PRI-AFF
> > -1   58.21509 root default
> > -7   58.21509 datacenter 10GbpsNet
> > -2   29.12000 host cephosd01
> >  1   hdd  3.64000 osd.1   up  1.0 1.0
> >  3   hdd  3.64000 osd.3   up  1.0 1.0
> >  5   hdd  3.64000 osd.5   up  1.0 1.0
> >  7   hdd  3.64000 osd.7   up  1.0 1.0
> >  9   hdd  3.64000 osd.9   up  1.0 1.0
> > 11   hdd  3.64000 osd.11  up  1.0 1.0
> > 13   hdd  3.64000 osd.13  up  1.0 1.0
> > 15   hdd  3.64000 osd.15  up  1.0 1.0
> > -3   29.09509 host cephosd02
> >  0   hdd  3.63689 osd.0destroyed0 1.0
> >  2   hdd  3.63689 osd.2   up  1.0 1.0
> >  4   hdd  3.63689 osd.4   up  1.0 1.0
> >  6   hdd  3.63689 osd.6   up  1.0 1.0
> >  8   hdd  3.63689 osd.8   up  1.0 1.0
> > 10   hdd  3.63689 osd.10  up  1.0 1.0
> > 12   hdd  3.63689 osd.12  up  1.0 1.0
> > 14   hdd  3.63689 osd.14  up  1.0 1.0
> > -8  0 datacenter 1GbpsNet
> >
> >
> > The state is destroyed yet
> >
> > The operation has completed successfully.
> > [root@cephosd02 ~]# ceph-disk prepare --bluestore /dev/sda --osd-id 0
> > The operation has completed successfully.
>
> Did you wipe the disk yet? Make sure it's completely empty before you
> re-create the OSD.
>
> Wido
>
> > The operation has completed successfully.
> > The operation has completed successfully.
> > meta-data=/dev/sda1  isize=2048   agcount=4, agsize=6400 blks
> >  =   sectsz=512   attr=2, projid32bit=1
> >  =   crc=1finobt=0, sparse=0
> > data =   bsize=4096   blocks=25600, imaxpct=25
> >  =   sunit=0  swidth=0 blks
> > naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
> > log  =internal log   bsize=4096   blocks=864, version=2
> >  =   sectsz=512   sunit=0 blks, lazy-count=1
> > realtime =none   extsz=4096   blocks=0, rtextents=0
> >
> > The metadata was on SSD disk
> >
> > In the logs I only see this :
> >
> > 2017-11-20 14:00:48.536252 7fc2d149dd00 -1  ** ERROR: unable to open OSD
> > superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
> > 2017-11-20 14:01:08.788158 7f4a9165fd00  0 set uid:gid to 167:167
> > (ceph:ceph)
> > 2017-11-20 14:01:08.788179 7f4a9165fd00  0 ceph version 12.2.0
> > (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process
> > (unknown), pid 115029
> >
> > Any Advise?
> >
> > Regards, I
> >
> > --
> >
> 
> > Iban Cabrillo Bartolome
> > Instituto de Fisica de Cantabria (IFCA)
> > Santander, Spain
> > Tel: +34942200969
> > PGP PUBLIC KEY:
> > http://pgp.mit.edu/pks/lookup?op=get&search=0xD9DF0B3D6C8C08AC
> >
> 
> > Bertrand Russell:*"El problema con el mundo es que los estúpidos están
> > seguros de todo y los inteligentes están **llenos de dudas*"
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rename iscsi target_iqn

2017-11-20 Thread Jason Dillaman

On Mon, Nov 20, 2017 at 3:30 AM, Frank Brendel
 wrote:
> Hi Jason,
>
> Am 17.11.2017 um 14:09 schrieb Jason Dillaman:
>>>
>>> how can I rename an iscsi target_iqn?
>>
>> That operation is not supported via gwcli.
>
> Is there a special reason for that or is it simply not implemented?

It's not implemented. Is that even supported by targetcli? AFAIK, LIO
would require you to delete and recreate the target if you wanted to
rename it.

>>> And where is the configuration that I made with gwcli stored?
>>
>> It's stored in a JSON object within the 'rbd' pool named "gateway.conf".
>
> To start from scratch I made the following steps:
>
> 1. Stop the iSCSI gateway on all nodes 'systemctl stop rbd-target-gw'
> 2. Remove the iSCSI kernel configuration on all nodes 'targetctl clear'
> 3. Remove gateway.conf from rbd pool 'rados -p rbd rm gateway.conf'
> 4. Start the iSCSI gateway on all nodes 'systemctl start rbd-target-api'
>
> Is this the recommended way?

Recommended way to do what, exactly? If you are attempting to rename
the target while keeping all other settings, at step (3) you could use
"rados get" to get the current config, modify it, and then "rados put"
to uploaded before continuing to step 4.

>
> Thank you
> Frank
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph - SSD cluster

2017-11-20 Thread David Turner

This topic has been discussed in detail multiple times and from various
angles. Your key points are going to be CPU limits iops, dwpd, iops vs
bandwidth, and SSD clusters/pools in general. You should be able to find
everything you need in the archives.

On Mon, Nov 20, 2017, 12:56 AM M Ranga Swami Reddy 
wrote:

> Hello,
> We plan to use the ceph cluster with all SSDs. Do we have any
> recommendations for Ceph cluster with Full SSD disks.
>
> Thanks
> Swami
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph - SSD cluster

2017-11-20 Thread M Ranga Swami Reddy

Thank you...let me dig the archives

Thanks
Swami

On Mon, Nov 20, 2017 at 7:50 PM, David Turner  wrote:
> This topic has been discussed in detail multiple times and from various
> angles. Your key points are going to be CPU limits iops, dwpd, iops vs
> bandwidth, and SSD clusters/pools in general. You should be able to find
> everything you need in the archives.
>
>
> On Mon, Nov 20, 2017, 12:56 AM M Ranga Swami Reddy 
> wrote:
>>
>> Hello,
>> We plan to use the ceph cluster with all SSDs. Do we have any
>> recommendations for Ceph cluster with Full SSD disks.
>>
>> Thanks
>> Swami
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph - SSD cluster

2017-11-20 Thread Ansgar Jazdzewski

Hi *,

just on note because we hit it, take a look on your discard options
make sure it not run on all OSD at the same time.

2017-11-20 6:56 GMT+01:00 M Ranga Swami Reddy :
> Hello,
> We plan to use the ceph cluster with all SSDs. Do we have any
> recommendations for Ceph cluster with Full SSD disks.
>
> Thanks
> Swami
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migration from filestore to blustore

2017-11-20 Thread Iban Cabrillo

Hi,
  Looking at ceph-deploy list node:

[cephosd02][INFO  ] Running command: /usr/sbin/ceph-disk list
[cephosd02][DEBUG ] /dev/sda :
[cephosd02][DEBUG ]  /dev/sda1 ceph data, prepared, cluster ceph, block
/dev/sda2
[cephosd02][DEBUG ]  /dev/sda2 ceph block, for /dev/sda1
[cephosd02][DEBUG ] /dev/sdb :
[cephosd02][DEBUG ]  /dev/sdb1 ceph data, active, cluster ceph, osd.2,
journal /dev/sdn1

..
[cephosd02][DEBUG ]  /dev/sdm1 ceph journal

/dev/sdm1 should be the jorunar for old sda disk (old osd.0)

Doing this should solve the issue?
ceph-disk prepare --bluestore --osd-id 0 /dev/sda /dev/sdm1

Regards, I







2017-11-20 14:34 GMT+01:00 Iban Cabrillo :

> Hi Wido,
>   The disk was empty, I checked that there were no remapped pgs, before
> run ceph-disk prepare. Re-run ceph-disk again?
>
> Regards, i
>
> El lun., 20 nov. 2017 14:12, Wido den Hollander  escribió:
>
>>
>> > Op 20 november 2017 om 14:02 schreef Iban Cabrillo <
>> cabri...@ifca.unican.es>:
>> >
>> >
>> > Hi cephers,
>> >   I was trying to migrate from Filestore to bluestore followig the
>> > instructions but after the ceph-disk prepare the new osd had not join to
>> > the cluster again:
>> >
>> >[root@cephadm ~]# ceph osd tree
>> > ID CLASS WEIGHT   TYPE NAMESTATUSREWEIGHT PRI-AFF
>> > -1   58.21509 root default
>> > -7   58.21509 datacenter 10GbpsNet
>> > -2   29.12000 host cephosd01
>> >  1   hdd  3.64000 osd.1   up  1.0 1.0
>> >  3   hdd  3.64000 osd.3   up  1.0 1.0
>> >  5   hdd  3.64000 osd.5   up  1.0 1.0
>> >  7   hdd  3.64000 osd.7   up  1.0 1.0
>> >  9   hdd  3.64000 osd.9   up  1.0 1.0
>> > 11   hdd  3.64000 osd.11  up  1.0 1.0
>> > 13   hdd  3.64000 osd.13  up  1.0 1.0
>> > 15   hdd  3.64000 osd.15  up  1.0 1.0
>> > -3   29.09509 host cephosd02
>> >  0   hdd  3.63689 osd.0destroyed0 1.0
>> >  2   hdd  3.63689 osd.2   up  1.0 1.0
>> >  4   hdd  3.63689 osd.4   up  1.0 1.0
>> >  6   hdd  3.63689 osd.6   up  1.0 1.0
>> >  8   hdd  3.63689 osd.8   up  1.0 1.0
>> > 10   hdd  3.63689 osd.10  up  1.0 1.0
>> > 12   hdd  3.63689 osd.12  up  1.0 1.0
>> > 14   hdd  3.63689 osd.14  up  1.0 1.0
>> > -8  0 datacenter 1GbpsNet
>> >
>> >
>> > The state is destroyed yet
>> >
>> > The operation has completed successfully.
>> > [root@cephosd02 ~]# ceph-disk prepare --bluestore /dev/sda --osd-id 0
>> > The operation has completed successfully.
>>
>> Did you wipe the disk yet? Make sure it's completely empty before you
>> re-create the OSD.
>>
>> Wido
>>
>> > The operation has completed successfully.
>> > The operation has completed successfully.
>> > meta-data=/dev/sda1  isize=2048   agcount=4, agsize=6400
>> blks
>> >  =   sectsz=512   attr=2, projid32bit=1
>> >  =   crc=1finobt=0, sparse=0
>> > data =   bsize=4096   blocks=25600, imaxpct=25
>> >  =   sunit=0  swidth=0 blks
>> > naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
>> > log  =internal log   bsize=4096   blocks=864, version=2
>> >  =   sectsz=512   sunit=0 blks, lazy-count=1
>> > realtime =none   extsz=4096   blocks=0, rtextents=0
>> >
>> > The metadata was on SSD disk
>> >
>> > In the logs I only see this :
>> >
>> > 2017-11-20 14:00:48.536252 7fc2d149dd00 -1  ** ERROR: unable to open OSD
>> > superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
>> > 2017-11-20 14:01:08.788158 7f4a9165fd00  0 set uid:gid to 167:167
>> > (ceph:ceph)
>> > 2017-11-20 14:01:08.788179 7f4a9165fd00  0 ceph version 12.2.0
>> > (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process
>> > (unknown), pid 115029
>> >
>> > Any Advise?
>> >
>> > Regards, I
>> >
>> > --
>> > 
>> 
>> > Iban Cabrillo Bartolome
>> > Instituto de Fisica de Cantabria (IFCA)
>> > Santander, Spain
>> > Tel: +34942200969 <+34%20942%2020%2009%2069>
>> > PGP PUBLIC KEY:
>> > http://pgp.mit.edu/pks/lookup?op=get&search=0xD9DF0B3D6C8C08AC
>> > 
>> 
>> > Bertrand Russell:*"El problema con el mundo es que los estúpidos están
>> > seguros de todo y los inteligentes están **llenos de dudas*"
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.cep

Re: [ceph-users] Migration from filestore to blustore

2017-11-20 Thread Gerhard W. Recher

Hi, I have done this: ... your mileage may vary upon creation parameters


# cat bs.sh

ID=$1
echo "wait for cluster ok"
while ! ceph health | grep HEALTH_OK ; do echo -n "."; sleep 10 ; done
echo "ceph osd out $ID"
ceph osd out $ID
sleep 10
while ! ceph health | grep HEALTH_OK ; do sleep 10 ; done
echo "systemctl stop ceph-osd@$ID.service"
systemctl stop ceph-osd@$ID.service
sleep 60
DEVICE=`mount | grep /var/lib/ceph/osd/ceph-$ID| cut -f1 -d"p"`

umount /var/lib/ceph/osd/ceph-$ID
echo "ceph-disk zap $DEVICE"
ceph-disk zap $DEVICE
ceph osd destroy $ID --yes-i-really-mean-it
echo "ceph-disk prepare --bluestore $DEVICE --osd-id $ID"
ceph-disk prepare --bluestore $DEVICE --osd-id $ID
sleep 10;
ceph osd metadata $ID
ceph -s
echo "wait for cluster ok"
while ! ceph health | grep HEALTH_OK ; do echo -n "."; sleep 10 ; done
ceph -s
echo " proceed with next"




Gerhard W. Recher

net4sec UG (haftungsbeschränkt)
Leitenweg 6
86929 Penzing

+49 171 4802507
Am 20.11.2017 um 14:34 schrieb Iban Cabrillo:
>
> Hi Wido,
>   The disk was empty, I checked that there were no remapped pgs,
> before run ceph-disk prepare. Re-run ceph-disk again?
>
> Regards, i
>
>
> El lun., 20 nov. 2017 14:12, Wido den Hollander  > escribió:
>
>
> > Op 20 november 2017 om 14:02 schreef Iban Cabrillo
> mailto:cabri...@ifca.unican.es>>:
> >
> >
> > Hi cephers,
> >   I was trying to migrate from Filestore to bluestore followig the
> > instructions but after the ceph-disk prepare the new osd had not
> join to
> > the cluster again:
> >
> >    [root@cephadm ~]# ceph osd tree
> > ID CLASS WEIGHT   TYPE NAME                STATUS    REWEIGHT
> PRI-AFF
> > -1       58.21509 root default
> > -7       58.21509     datacenter 10GbpsNet
> > -2       29.12000         host cephosd01
> >  1   hdd  3.64000             osd.1               up  1.0
> 1.0
> >  3   hdd  3.64000             osd.3               up  1.0
> 1.0
> >  5   hdd  3.64000             osd.5               up  1.0
> 1.0
> >  7   hdd  3.64000             osd.7               up  1.0
> 1.0
> >  9   hdd  3.64000             osd.9               up  1.0
> 1.0
> > 11   hdd  3.64000             osd.11              up  1.0
> 1.0
> > 13   hdd  3.64000             osd.13              up  1.0
> 1.0
> > 15   hdd  3.64000             osd.15              up  1.0
> 1.0
> > -3       29.09509         host cephosd02
> >  0   hdd  3.63689             osd.0        destroyed        0
> 1.0
> >  2   hdd  3.63689             osd.2               up  1.0
> 1.0
> >  4   hdd  3.63689             osd.4               up  1.0
> 1.0
> >  6   hdd  3.63689             osd.6               up  1.0
> 1.0
> >  8   hdd  3.63689             osd.8               up  1.0
> 1.0
> > 10   hdd  3.63689             osd.10              up  1.0
> 1.0
> > 12   hdd  3.63689             osd.12              up  1.0
> 1.0
> > 14   hdd  3.63689             osd.14              up  1.0
> 1.0
> > -8              0     datacenter 1GbpsNet
> >
> >
> > The state is destroyed yet
> >
> > The operation has completed successfully.
> > [root@cephosd02 ~]# ceph-disk prepare --bluestore /dev/sda
> --osd-id 0
> > The operation has completed successfully.
>
> Did you wipe the disk yet? Make sure it's completely empty before
> you re-create the OSD.
>
> Wido
>
> > The operation has completed successfully.
> > The operation has completed successfully.
> > meta-data=/dev/sda1              isize=2048   agcount=4,
> agsize=6400 blks
> >          =                       sectsz=512   attr=2, projid32bit=1
> >          =                       crc=1        finobt=0, sparse=0
> > data     =                       bsize=4096   blocks=25600,
> imaxpct=25
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > log      =internal log           bsize=4096   blocks=864, version=2
> >          =                       sectsz=512   sunit=0 blks,
> lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> >
> > The metadata was on SSD disk
> >
> > In the logs I only see this :
> >
> > 2017-11-20 14:00:48.536252 7fc2d149dd00 -1  ** ERROR: unable to
> open OSD
> > superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or
> directory
> > 2017-11-20 14:01:08.788158 7f4a9165fd00  0 set uid:gid to 167:167
> > (ceph:ceph)
> > 2017-11-20 14:01:08.788179 7f4a9165fd00  0 ceph version 12.2.0
> > (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process
> > (unknown), pid 115029
>

Re: [ceph-users] OSD Random Failures - Latest Luminous

2017-11-20 Thread Gregory Farnum

On Mon, Nov 20, 2017 at 7:56 PM, Ashley Merrick  wrote:
> Hello,
>
>
> So I tried as suggested marking one OSD that continuously failed as lost and
> add a new OSD to take it's place.
>
>
> However all this does is make another 2-3 OSD's fail with the exact same
> error.
>
>
> Seems this is a pretty huge and nasty bug / issue!
>
>
> Greg your have to give me some more information about what you need if you
> want me to try and get some information.

Do you know how to use gdb?
Open it with the core dump and osd binary. ("gdb  ceph-osd")
Switch to the frame in the function ("frame 6", if it matches with the
printed backtrace, but it may not)
GDB will print out what line it's on. You can also have it print out
the surrounding code ("list", maybe?)
-Greg

>
>
> However right now the cluster it self is pretty much toast due to the amount
> of OSD's now with this assert.
>
>
> ,Ashley
>
> 
> From: Gregory Farnum 
> Sent: 19 November 2017 09:25:39
> To: Ashley Merrick
> Cc: David Turner; ceph-us...@ceph.com
>
> Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous
>
> I only see two asserts (in my local checkout) in that function; one is
> metadata
> assert(info.history.same_interval_since != 0);
> and the other is a sanity check
> assert(!deleting);
>
> Can you open a core dump with gdb and look at what line it's on in the
> start_peering_interval frame? (May need to install the debug packages.)
>
> I think we've run across that first assert as an issue before, but both of
> them ought to be dumping out more cleanly about what line they're on.
> -Greg
>
>
> On Sun, Nov 19, 2017 at 1:32 AM Ashley Merrick 
> wrote:
>
> Hello,
>
>
>
> So seems noup does not help.
>
>
>
> Still have the same error :
>
>
>
> 2017-11-18 14:26:40.982827 7fb4446cd700 -1 *** Caught signal (Aborted) **in
> thread 7fb4446cd700 thread_name:tp_peering
>
>
>
> ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
> (stable)
>
> 1: (()+0xa0c554) [0x56547f500554]
>
> 2: (()+0x110c0) [0x7fb45cabe0c0]
>
> 3: (gsignal()+0xcf) [0x7fb45ba85fcf]
>
> 4: (abort()+0x16a) [0x7fb45ba873fa]
>
> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x28e) [0x56547f547f0e]
>
> 6: (PG::start_peering_interval(std::shared_ptr,
> std::vector > const&, int, std::vector std::allocator > const&, int, ObjectStore::Transaction*)+0x1569)
> [0x56547f029ad9]
>
> 7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x479)
> [0x56547f02a099]
>
> 8: (boost::statechart::simple_state PG::RecoveryState::RecoveryMachine, boost::mpl::list mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> const&, void const*)+0x188) [0x56547f06c6d8]
>
> 9: (boost::statechart::state_machine PG::RecoveryState::Initial, std::allocator,
> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
> const&)+0x69) [0x56547f045549]
>
> 10: (PG::handle_advance_map(std::shared_ptr,
> std::shared_ptr, std::vector >&, int,
> std::vector >&, int, PG::RecoveryCtx*)+0x4a7)
> [0x56547f00e837]
>
> 11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
> PG::RecoveryCtx*, std::set,
> std::less >,
> std::allocator > >*)+0x2e7) [0x56547ef56e67]
>
> 12: (OSD::process_peering_events(std::__cxx11::list
>> const&, ThreadPool::TPHandle&)+0x1e4) [0x56547ef57cb4]
>
> 13: (ThreadPool::BatchWorkQueue::_void_process(void*,
> ThreadPool::TPHandle&)+0x2c) [0x56547efc2a0c]
>
> 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x56547f54ef28]
>
> 15: (ThreadPool::WorkThread::entry()+0x10) [0x56547f5500c0]
>
> 16: (()+0x7494) [0x7fb45cab4494]
>
> 17: (clone()+0x3f) [0x7fb45bb3baff]
>
> NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
>
>
> I guess even with noup the OSD/PG still has the peer with the other PG’s
> which is the stage that causes the failure, most OSD’s seem to stay up for
> about 30 seconds, and every time it’s a different PG listed on the failure.
>
>
>
> ,Ashley
>
>
>
> From: David Turner [mailto:drakonst...@gmail.com]
>
> Sent: 18 November 2017 22:19
> To: Ashley Merrick 
>
> Cc: Eric Nelson ; ceph-us...@ceph.com
>
>
> Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous
>
>
>
> Does letting the cluster run with noup for a while until all down disks are
> idle, and then letting them come in help at all?  I don't know your specific
> issue and haven't touched bluestore yet, but that is generally sound advice
> when is won't start.
>
> Also is there any pattern to the osds that are down? Common PGs, common
> hosts, common ssds, etc?
>
>
>
> On Sat, Nov 18, 2017, 7:08 AM Ashley Merrick  wrote:
>
> Hello,
>
>
>
> Any further suggestions or work around’s from anyone?
>
>
>
> Cluster is hard down now with around 2% PG’s of

Re: [ceph-users] Active+clean PGs reported many times in log

2017-11-20 Thread Gregory Farnum

Is this from a time when it was displaying the doubled active+clean
outputs? Otherwise you'll need to retrieve a specific map when it was.
I'll ask around if anybody's seen this before; Jewel has been out a
while and the pg output like this changed dramatically for Luminous so
it may not be an issue in the latest LTS.
-Greg

On Mon, Nov 20, 2017 at 7:02 PM, Matteo Dacrema  wrote:
> I was running 10.2.7 but I’ve upgraded to 10.2.10 few days ago.
>
> Here Pg dump:
>
> https://owncloud.enter.it/index.php/s/AaD5Fc5tA6c8i1G
>
>
>
> Il giorno 19 nov 2017, alle ore 11:15, Gregory Farnum 
> ha scritto:
>
> On Tue, Nov 14, 2017 at 1:09 AM Matteo Dacrema  wrote:
>>
>> Hi,
>> I noticed that sometimes the monitors start to log active+clean pgs many
>> times in the same line. For example I have 18432 and the logs shows " 2136
>> active+clean, 28 active+clean, 2 active+clean+scrubbing+deep, 16266
>> active+clean;”
>> After a minute monitor start to log correctly again.
>>
>>
>> Is it normal ?
>
>
> That definitely looks weird to me, but I can imagine a few ways for it to
> occur. What version of Ceph are you running? Can you extract the pgmap and
> post the binary somewhere?
>
>>
>>
>> 2017-11-13 11:05:08.876724 7fb35d17d700  0 log_channel(cluster) log [INF]
>> : pgmap v99797105: 18432 pgs: 3 active+clean+scrubbing+deep, 18429
>> active+clean; 59520 GB data, 129 TB used, 110 TB / 239 TB avail; 40596 kB/s
>> rd, 89723 kB/s wr, 4899 op/s
>> 2017-11-13 11:05:09.911266 7fb35d17d700  0 log_channel(cluster) log [INF]
>> : pgmap v99797106: 18432 pgs: 2 active+clean+scrubbing+deep, 18430
>> active+clean; 59520 GB data, 129 TB used, 110 TB / 239 TB avail; 45931 kB/s
>> rd, 114 MB/s wr, 6179 op/s
>> 2017-11-13 11:05:10.751378 7fb359cfb700  0 mon.controller001@0(leader) e1
>> handle_command mon_command({"prefix": "osd pool stats", "format": "json"} v
>> 0) v1
>> 2017-11-13 11:05:10.751599 7fb359cfb700  0 log_channel(audit) log [DBG] :
>> from='client.? MailScanner warning: numerical links are often malicious:
>> 10.16.24.127:0/547552484' entity='client.telegraf' cmd=[{"prefix": "osd pool
>> stats", "format": "json"}]: dispatch
>> 2017-11-13 11:05:10.926839 7fb35d17d700  0 log_channel(cluster) log [INF]
>> : pgmap v99797107: 18432 pgs: 3 active+clean+scrubbing+deep, 18429
>> active+clean; 59520 GB data, 129 TB used, 110 TB / 239 TB avail; 47617 kB/s
>> rd, 134 MB/s wr, 7414 op/s
>> 2017-11-13 11:05:11.921115 7fb35d17d700  1 mon.controller001@0(leader).osd
>> e120942 e120942: 216 osds: 216 up, 216 in
>> 2017-11-13 11:05:11.926818 7fb35d17d700  0 log_channel(cluster) log [INF]
>> : osdmap e120942: 216 osds: 216 up, 216 in
>> 2017-11-13 11:05:11.984732 7fb35d17d700  0 log_channel(cluster) log [INF]
>> : pgmap v99797109: 18432 pgs: 3 active+clean+scrubbing+deep, 18429
>> active+clean; 59520 GB data, 129 TB used, 110 TB / 239 TB avail; 54110 kB/s
>> rd, 115 MB/s wr, 7827 op/s
>> 2017-11-13 11:05:13.085799 7fb35d17d700  0 log_channel(cluster) log [INF]
>> : pgmap v99797110: 18432 pgs: 973 active+clean, 12 active+clean, 3
>> active+clean+scrubbing+deep, 17444 active+clean; 59520 GB data, 129 TB used,
>> 110 TB / 239 TB avail; 115 MB/s rd, 90498 kB/s wr, 8490 op/s
>> 2017-11-13 11:05:14.181219 7fb35d17d700  0 log_channel(cluster) log [INF]
>> : pgmap v99797111: 18432 pgs: 2136 active+clean, 28 active+clean, 2
>> active+clean+scrubbing+deep, 16266 active+clean; 59520 GB data, 129 TB used,
>> 110 TB / 239 TB avail; 136 MB/s rd, 94461 kB/s wr, 10237 op/s
>> 2017-11-13 11:05:15.324630 7fb35d17d700  0 log_channel(cluster) log [INF]
>> : pgmap v99797112: 18432 pgs: 3179 active+clean, 44 active+clean, 2
>> active+clean+scrubbing+deep, 15207 active+clean; 59519 GB data, 129 TB used,
>> 110 TB / 239 TB avail; 184 MB/s rd, 81743 kB/s wr, 13786 op/s
>> 2017-11-13 11:05:16.381452 7fb35d17d700  0 log_channel(cluster) log [INF]
>> : pgmap v99797113: 18432 pgs: 3600 active+clean, 52 active+clean, 2
>> active+clean+scrubbing+deep, 14778 active+clean; 59518 GB data, 129 TB used,
>> 110 TB / 239 TB avail; 208 MB/s rd, 77342 kB/s wr, 14382 op/s
>> 2017-11-13 11:05:17.272757 7fb3570f2700  1 leveldb: Level-0 table
>> #26314650: started
>> 2017-11-13 11:05:17.390808 7fb3570f2700  1 leveldb: Level-0 table
>> #26314650: 18281928 bytes OK
>> 2017-11-13 11:05:17.392636 7fb3570f2700  1 leveldb: Delete type=0
>> #26314647
>>
>> 2017-11-13 11:05:17.397516 7fb3570f2700  1 leveldb: Manual compaction at
>> level-0 from 'pgmap\x0099796362' @ 72057594037927935 : 1 ..
>> 'pgmap\x0099796613' @ 0 : 0; will stop at 'pgmap_pg\x006.ff' @ 29468156273 :
>> 1
>>
>>
>> Thank you
>> Matteo
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non
> infetto.
> Clicca qui per segnalarlo come spam.
> Clicca qui per metterlo in blacklist
>
>
___
ceph-user

Re: [ceph-users] Getting errors on erasure pool writes k=2, m=1

2017-11-20 Thread Sage Weil

Hi Marc,

On Fri, 10 Nov 2017, Marc Roos wrote:
>  
> osd's are crashing when putting a (8GB) file in a erasure coded pool, 

I take it you adjusted the osd_max_object_size option in your ceph.conf?  
We can "fix" this by enforcing a hard limit on that option, but that 
will just mean you get an error when you try to write the large 
object or offset instead of a crash.

sage



> just before finishing. The same osd's are used for replicated pools 
> rbd/cephfs, and seem to do fine. Did I made some error is this a bug? 
> Looks similar to
> https://www.spinics.net/lists/ceph-devel/msg38685.html
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021045.html
> 
> 
> [@c01 ~]# date ; rados -p ec21 put  $(basename 
> "/mnt/disk/blablablalbalblablalablalb.txt") 
> blablablalbalblablalablalb.txt
> Fri Nov 10 20:27:26 CET 2017
> 
> [Fri Nov 10 20:33:51 2017] libceph: osd9 down
> [Fri Nov 10 20:33:51 2017] libceph: osd9 down
> [Fri Nov 10 20:33:51 2017] libceph: osd0 192.168.10.111:6802 socket 
> closed (con state OPEN)
> [Fri Nov 10 20:33:51 2017] libceph: osd0 192.168.10.111:6802 socket 
> error on write
> [Fri Nov 10 20:33:52 2017] libceph: osd0 down
> [Fri Nov 10 20:33:52 2017] libceph: osd7 down
> [Fri Nov 10 20:33:55 2017] libceph: osd0 down
> [Fri Nov 10 20:33:55 2017] libceph: osd7 down
> [Fri Nov 10 20:34:41 2017] libceph: osd7 up
> [Fri Nov 10 20:34:41 2017] libceph: osd7 up
> [Fri Nov 10 20:35:03 2017] libceph: osd9 up
> [Fri Nov 10 20:35:03 2017] libceph: osd9 up
> [Fri Nov 10 20:35:47 2017] libceph: osd0 up
> [Fri Nov 10 20:35:47 2017] libceph: osd0 up
> 
> [@c02 ~]# rados -p ec21 stat blablablalbalblablalablalb.txt
> 2017-11-10 20:39:31.296101 7f840ad45e40 -1 WARNING: the following 
> dangerous and experimental features are enabled: bluestore
> 2017-11-10 20:39:31.296290 7f840ad45e40 -1 WARNING: the following 
> dangerous and experimental features are enabled: bluestore
> 2017-11-10 20:39:31.331588 7f840ad45e40 -1 WARNING: the following 
> dangerous and experimental features are enabled: bluestore
> ec21/blablablalbalblablalablalb.txt mtime 2017-11-10 20:32:52.00, 
> size 8585740288
> 
> 
> 
> 2017-11-10 20:32:52.287503 7f933028d700  4 rocksdb: EVENT_LOG_v1 
> {"time_micros": 1510342372287484, "job": 32, "event": "flush_started", 
> "num_memtables": 1, "num_entries": 728747, "num_deletes": 363960, 
> "memory_usage": 263854696}
> 2017-11-10 20:32:52.287509 7f933028d700  4 rocksdb: 
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
> CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
> 12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/rocksdb/db/flush_job.cc:293] 
> [default] [JOB 32] Level-0 flush table #25279: started
> 2017-11-10 20:32:52.503311 7f933028d700  4 rocksdb: EVENT_LOG_v1 
> {"time_micros": 1510342372503293, "cf_name": "default", "job": 32, 
> "event": "table_file_creation", "file_number": 25279, "file_size": 
> 4811948, "table_properties": {"data_size": 4675796, "index_size": 
> 102865, "filter_size": 32302, "raw_key_size": 646440, 
> "raw_average_key_size": 75, "raw_value_size": 4446103, 
> "raw_average_value_size": 519, "num_data_blocks": 1180, "num_entries": 
> 8560, "filter_policy_name": "rocksdb.BuiltinBloomFilter", 
> "kDeletedKeys": "0", "kMergeOperands": "330"}}
> 2017-11-10 20:32:52.503327 7f933028d700  4 rocksdb: 
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
> CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
> 12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/rocksdb/db/flush_job.cc:319] 
> [default] [JOB 32] Level-0 flush table #25279: 4811948 bytes OK
> 2017-11-10 20:32:52.572413 7f933028d700  4 rocksdb: 
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
> CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
> 12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/rocksdb/db/db_impl_files.cc:242] 
> adding log 25276 to recycle list
> 
> 2017-11-10 20:32:52.572422 7f933028d700  4 rocksdb: (Original Log Time 
> 2017/11/10-20:32:52.503339) 
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
> CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
> 12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/rocksdb/db/memtable_list.cc:360] 
> [default] Level-0 commit table #25279 started
> 2017-11-10 20:32:52.572425 7f933028d700  4 rocksdb: (Original Log Time 
> 2017/11/10-20:32:52.572312) 
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
> CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
> 12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/rocksdb/db/memtable_list.cc:383] 
> [default] Level-0 commit table #25279: memtable #1 done
> 2017-11-10 20:32:52.572428 7f933028d700  4 rocksdb: (Original Log Time 
> 2017/11/10-20:32:52.572328) EVENT_LOG_v1 {"time_micros": 
> 1510342372572321, "job": 32, "event": "flush_finished", "lsm_state": [4, 
> 4, 36, 140, 0, 0, 0], "immutable_memtables": 0}
> 2017-11-10 20:3

Re: [ceph-users] rocksdb: Corruption: missing start of fragmented record

2017-11-20 Thread Gregory Farnum

On Mon, Nov 20, 2017 at 9:27 AM, Michael Schmid  wrote:
> Gregory Farnum wrote:
>> Your hardware and configuration is very relevant.
>> [...]
>> I'd look at whether you have a writeback cache somewhere that isn't
>> reflecting ordering requirements, or if your disk passes a crash consistency
>> tester. (No, I don't know one off-hand. But many disks lie horribly even
>> about stuff like flushes.)
> It may certainly have had something to do with how I managed to end up with
> the broken rocksdb WAL log. Maybe this is not the best possible behavior
> possible when one simulates a crash or drive disconnect. Perhaps if I can
> get this OSD back in action, and the issue occurs entirely predictably on
> another test, I'll eventually start to see a pattern where / how it happens
> & maybe even find out what hardware / configuration changes might be needed
> to prevent the WAL from corrupting. Perhaps.
>
> --
>
> However, my actual & immediate Ceph + ceph-users relevant problem with this
> is basically only that I cannot seem to figure out how one could deal with
> such an already broken rocksdb WAL log.
>
> 1. Ceph's tooling and rocksdb don't *appear* to be capable to deal be able
> to deal with this corrupted WAL file once it has been corrupted, certainly
> not with the commands that I tried.
> I initially had hoped for some tool to be able to do something - drop the
> log, revert to an earlier backup of a consistent db - any option like that
> that I might have missed. Judging by this ML so far, I'm going to guess
> there is no such thing? So the subsequent problem is:

The error is pretty clear: "Corruption: missing start of fragmented record(2)"
What that says to me is that rocksdb has a journal entry saying that
record *does* exist, but it's missing the opening block or something.
ie, during an atomic write it (1) wrote down a lookaside block, (2)
flushed that block to disk, (3), journaled that it had written the
block. But now on restart, it's finding out that (2) apparently didn't
happen.

>
> 2. I do not know how I can get manual, filewise access to the rocksdb WAL
> logs. This may be immensely simple, but I simply don't know how.
> I don't have any indication that either 1. or 2. is failing due to hardware
> or configuration specifics (...beyond having this broken WAL log) so far.

Rocksdb may offer repair tools for this (I have no idea), but the
fundamental issue is that as far as the program can tell, the
underlying hardware lied, the disk state is corrupted, and it has no
idea what data it can trust or not at this point. Same with Ceph; the
OSD has no desire to believe anything a corrupted disk tells it since
that can break all of our invariants.
BlueStore is a custom block device-managing system; we have a way to
mount and poke at it via FUSE but that assumes the data on disk makes
any sense. In this case, it doesn't (RocksDB stores the disk layout
metadata.) Somebody more familiar with bluestore development may know
if there's a way to mount only the "BlueFS" portion that RocksDB
writes its own data to; if there is it's just a bunch of .ldb files or
whatever, but those are again a custom data format that you'll need
rocksdb expertise to do anything with...

Toss this disk and let Ceph do its recovery thing. Look hard at what
your hardware configuration is doing to make sure it doesn't happen
again. *shrug*
-Greg

>
>> As you note, the WAL should be able to handle being incompletely-written
> Yes, I'd also have thought so? But it apparently just isn't able to deal
> with this log file corruption. Maybe it is not an extremely specific bug.
> Maybe a lot of possible WAL corruptions might throw a comparable error and
> prevent replay.
>
>> and both Ceph and RocksDB are designed to handle failures mid-write.
> As far as I can tell, not in this WAL log case, no. It would certainly be
> really interesting to see at this point if just moving or deleting that WAL
> log allows everything to continue and the OSD to go online, and if then
> doing a scrub fixes the entirety of this issue. Maybe everything is
> essentially fine apart from JUST the WAL log replay and maybe one or another
> bit of a page on the OSD.
>
>> That RocksDB *isn't* doing that here implies either 1) there's a fatal bug
>> in rocksdb
> Not so sure. Ultimately rocksdb does seem to throw a fairly indicative
> error: "db/001005.log: dropping 3225 bytes; Corruption: missing start of
> fragmented record(2)".
> Maybe they intend that users use a repair tool at
> https://github.com/facebook/rocksdb/wiki/RocksDB-Repairer . Or maybe it's a
> case for manual interaction with the file.
>
> But my point 2. -namely that I don't even understand how to get filewise
> access to rocksdb's files- has so far prevented me from trying either.
>
>
>
> Thanks for your input!
>
> -Michael
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] findmnt (was Re: Migration from filestore to blustore)

2017-11-20 Thread Matthew Vernon

Hi,

On 20/11/17 15:00, Gerhard W. Recher wrote:

Just interjecting here because I keep seeing things like this, and
they're often buggy, and there's an easy answer:

> DEVICE=`mount | grep /var/lib/ceph/osd/ceph-$ID| cut -f1 -d"p"`

findmnt(8) is your friend, any time you want to find out about mounted
filesystems, and much more reliable than grepping the output of mount or
/proc/mtab/ or whatever (consider if ID is 1 and you have ceph-1 and
ceph-10 mounted on the host, for example).

findmnt -T "/var/lib/ceph/osd/ceph-$id" -n -o SOURCE

is probably what you wanted here. Findmnt is in util-linux, and should
be in all non-ancient distributions.

Here ends the message from the findmnt(8) appreciation society :)

Regards,

Matthew


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to improve performance

I am not sure why, but I cannot get Jumbo Frames to work properly:


root@virt2:~# ping -M do -s 8972 -c 4 10.10.10.83
PING 10.10.10.83 (10.10.10.83) 8972(9000) bytes of data.
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500


Jumbo Frames is on, on the switch and on the NIC's:

ens2f0: flags=4163  mtu 9000
inet 10.10.10.83  netmask 255.255.255.0  broadcast 10.10.10.255
inet6 fe80::ec4:7aff:feea:7b40  prefixlen 64  scopeid 0x20
ether 0c:c4:7a:ea:7b:40  txqueuelen 1000  (Ethernet)
RX packets 166440655  bytes 229547410625 (213.7 GiB)
RX errors 0  dropped 223  overruns 0  frame 0
TX packets 142788790  bytes 188658602086 (175.7 GiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0




root@virt2:~# ifconfig ens2f0
ens2f0: flags=4163  mtu 9000
inet 10.10.10.82  netmask 255.255.255.0  broadcast 10.10.10.255
inet6 fe80::ec4:7aff:feea:ff2c  prefixlen 64  scopeid 0x20
ether 0c:c4:7a:ea:ff:2c  txqueuelen 1000  (Ethernet)
RX packets 466774  bytes 385578454 (367.7 MiB)
RX errors 4  dropped 223  overruns 0  frame 3
TX packets 594975  bytes 580053745 (553.1 MiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0



On Mon, Nov 20, 2017 at 2:13 PM, Sébastien VIGNERON <
sebastien.vigne...@criann.fr> wrote:

> As a jumbo frame test, can you try the following?
>
> ping -M do -s 8972 -c 4 IP_of_other_node_within_cluster_network
>
> If you have « ping: sendto: Message too long », jumbo frames are not
> activated.
>
> Cordialement / Best regards,
>
> Sébastien VIGNERON
> CRIANN,
> Ingénieur / Engineer
> Technopôle du Madrillet
> 745, avenue de l'Université
> 
>
> 76800 Saint-Etienne du Rouvray - France
> 
>
> tél. +33 2 32 91 42 91 <+33%202%2032%2091%2042%2091>
> fax. +33 2 32 91 42 92 <+33%202%2032%2091%2042%2092>
> http://www.criann.fr
> mailto:sebastien.vigne...@criann.fr 
> support: supp...@criann.fr
>
> Le 20 nov. 2017 à 13:02, Rudi Ahlers  a écrit :
>
> We're planning on installing 12X Virtual Machines with some heavy loads.
>
> the SSD drives are  INTEL SSDSC2BA400G4
>
> The SATA drives are ST8000NM0055-1RM112
>
> Please explain your comment, "b) will find a lot of people here who don't
> approve of it."
>
> I don't have access to the switches right now, but they're new so whatever
> default config ships from factory would be active. Though iperf shows 10.5
> GBytes  / 9.02 Gbits/sec throughput.
>
> What speeds would you expect?
> "Though with your setup I would have expected something faster, but NOT
> the
> theoretical 600MB/s 4 HDDs will do in sequential writes."
>
>
>
> On this, "If an OSD has no fast WAL/DB, it will drag the overall speed
> down. Verify and if so fix this and re-test.": how?
>
>
> On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer  wrote:
>
>> On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:
>>
>> > Hi,
>> >
>> > Can someone please help me, how do I improve performance on ou CEPH
>> cluster?
>> >
>> > The hardware in use are as follows:
>> > 3x SuperMicro servers with the following configuration
>> > 12Core Dual XEON 2.2Ghz
>> Faster cores is better for Ceph, IMNSHO.
>> Though with main storage on HDDs, this will do.
>>
>> > 128GB RAM
>> Overkill for Ceph but I see something else below...
>>
>> > 2x 400GB Intel DC SSD drives
>> Exact model please.
>>
>> > 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
>> One hopes that's a non SMR one.
>> Model please.
>>
>> > 1x SuperMicro DOM for Proxmox / Debian OS
>> Ah, Proxmox.
>> I'm personally not averse to converged, high density, multi-role clusters
>> myself, but you:
>> a) need to know what you're doing and
>> b) will find a lot of people here who don't approve of it.
>>
>> I've avoided DOMs so far (non-hotswapable SPOF), even though the SM ones
>> look good on paper with regards to endurance and IOPS.
>> The later being rather important for your monitors.
>>
>> > 4x Port 10Gbe NIC
>> > Cisco 10Gbe switch.
>> >
>> Configuration would be nice for those, LACP?
>>
>> >
>> > root@virt2:~# rados bench -p Data 10 write --no-cleanup
>> > hints = 1
>> > Maintaining 16 concurrent writes of 4194304 bytes to objects of size
>> > 4194304 for   up to 10 seconds or 0 objects
>>
>> rados bench is limited tool and measuring bandwidth is in nearly all
>> the use cases pointless.
>> Latency is where it is at and testing from inside a VM is more relevant
>> than synthetic tests of the storage.
>> But it is a start.
>>
>> > Object prefix: benchmark_data_virt2_39099
>> >   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
>> > lat(s)
>> > 0   0 0 0 0

Re: [ceph-users] how to improve performance

2017-11-20 Thread Sébastien VIGNERON

Your performance hit can be from here. When OSD daemons tries to send a big 
frame, MTU misconfiguration blocks them and they must send them again with a 
lower size.
On some switches, you have to set the global and the per-interface MTU sizes.

Cordialement / Best regards,

Sébastien VIGNERON 
CRIANN, 
Ingénieur / Engineer
Technopôle du Madrillet 
745, avenue de l'Université 
76800 Saint-Etienne du Rouvray - France 
tél. +33 2 32 91 42 91 
fax. +33 2 32 91 42 92 
http://www.criann.fr 
mailto:sebastien.vigne...@criann.fr
support: supp...@criann.fr

> Le 20 nov. 2017 à 16:21, Rudi Ahlers  a écrit :
> 
> I am not sure why, but I cannot get Jumbo Frames to work properly:
> 
> 
> root@virt2:~# ping -M do -s 8972 -c 4 10.10.10.83
> PING 10.10.10.83 (10.10.10.83) 8972(9000) bytes of data.
> ping: local error: Message too long, mtu=1500
> ping: local error: Message too long, mtu=1500
> ping: local error: Message too long, mtu=1500
> 
> 
> Jumbo Frames is on, on the switch and on the NIC's:
> 
> ens2f0: flags=4163  mtu 9000
> inet 10.10.10.83  netmask 255.255.255.0  broadcast 10.10.10.255
> inet6 fe80::ec4:7aff:feea:7b40  prefixlen 64  scopeid 0x20
> ether 0c:c4:7a:ea:7b:40  txqueuelen 1000  (Ethernet)
> RX packets 166440655  bytes 229547410625 (213.7 GiB)
> RX errors 0  dropped 223  overruns 0  frame 0
> TX packets 142788790  bytes 188658602086 (175.7 GiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> 
> 
> 
> 
> root@virt2:~# ifconfig ens2f0
> ens2f0: flags=4163  mtu 9000
> inet 10.10.10.82  netmask 255.255.255.0  broadcast 10.10.10.255
> inet6 fe80::ec4:7aff:feea:ff2c  prefixlen 64  scopeid 0x20
> ether 0c:c4:7a:ea:ff:2c  txqueuelen 1000  (Ethernet)
> RX packets 466774  bytes 385578454 (367.7 MiB)
> RX errors 4  dropped 223  overruns 0  frame 3
> TX packets 594975  bytes 580053745 (553.1 MiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> 
> 
> 
> On Mon, Nov 20, 2017 at 2:13 PM, Sébastien VIGNERON 
> mailto:sebastien.vigne...@criann.fr>> wrote:
> As a jumbo frame test, can you try the following?
> 
> ping -M do -s 8972 -c 4 IP_of_other_node_within_cluster_network
> 
> If you have « ping: sendto: Message too long », jumbo frames are not 
> activated.
> 
> Cordialement / Best regards,
> 
> Sébastien VIGNERON 
> CRIANN, 
> Ingénieur / Engineer
> Technopôle du Madrillet 
> 745, avenue de l'Université 
> 
>  
> 76800 Saint-Etienne du Rouvray - France 
> 
>  
> tél. +33 2 32 91 42 91  
> fax. +33 2 32 91 42 92  
> http://www.criann.fr  
> mailto:sebastien.vigne...@criann.fr 
> support: supp...@criann.fr 
> 
>> Le 20 nov. 2017 à 13:02, Rudi Ahlers > > a écrit :
>> 
>> We're planning on installing 12X Virtual Machines with some heavy loads. 
>> 
>> the SSD drives are  INTEL SSDSC2BA400G4
>> 
>> The SATA drives are ST8000NM0055-1RM112
>> 
>> Please explain your comment, "b) will find a lot of people here who don't 
>> approve of it."
>> 
>> I don't have access to the switches right now, but they're new so whatever 
>> default config ships from factory would be active. Though iperf shows 10.5 
>> GBytes  / 9.02 Gbits/sec throughput.
>> 
>> What speeds would you expect?
>> "Though with your setup I would have expected something faster, but NOT the
>> theoretical 600MB/s 4 HDDs will do in sequential writes."
>> 
>> 
>> 
>> On this, "If an OSD has no fast WAL/DB, it will drag the overall speed down. 
>> Verify and if so fix this and re-test.": how?
>> 
>> 
>> On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer > > wrote:
>> On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:
>> 
>> > Hi,
>> >
>> > Can someone please help me, how do I improve performance on ou CEPH 
>> > cluster?
>> >
>> > The hardware in use are as follows:
>> > 3x SuperMicro servers with the following configuration
>> > 12Core Dual XEON 2.2Ghz
>> Faster cores is better for Ceph, IMNSHO.
>> Though with main storage on HDDs, this will do.
>> 
>> > 128GB RAM
>> Overkill for Ceph but I see something else below...
>> 
>> > 2x 400GB Intel DC SSD drives
>> Exact model please.
>> 
>> > 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
>> One hopes that's a non SMR one.
>> Model please.
>> 
>> > 1x SuperMicro DOM for Proxmox / Debian OS
>> Ah, Proxmox.
>> I'm personally not averse to converged, high density, multi-role clusters
>> myself, but you:
>> a) need to know what you're doing and
>> b) will find a lot of people here who don't approve of it.
>> 
>> I've avoided DOMs so far (non-hotswapable SPOF), even though the SM ones
>> look good on

Re: [ceph-users] Deleting large pools

2017-11-20 Thread David Turner

I created a bug tracker for this here.  http://tracker.ceph.com/issues/22201
Thank you for your help Gregory.

On Sat, Nov 18, 2017 at 9:20 PM Gregory Farnum  wrote:

> On Wed, Nov 15, 2017 at 6:50 AM David Turner 
> wrote:
>
>> 2 weeks later and things are still deleting, but getting really close to
>> being done.  I tried to use ceph-objectstore-tool to remove one of the
>> PGs.  I only tested on 1 PG on 1 OSD, but it's doing something really
>> weird.  While it was running, my connection to the DC reset and the command
>> died.  Now when I try to run the tool it segfaults and just running the OSD
>> it doesn't try to delete the data.  The data in this PG does not matter and
>> I figure the worst case scenario is that it just sits there taking up 200GB
>> until I redeploy the OSD.
>>
>> However, I like to learn things about Ceph.  Is there anyone with any
>> insight to what is happening with this PG?
>>
>
> Well, this isn't supposed to happen, but backtraces like that generally
> mean the PG is trying to load an OSDMap that has already been trimmed.
>
> If I were to guess, in this case enough of the PG metadata got cleaned up
> that the OSD no longer knows it's there, and it removed the maps. But
> trying to remove the PG is pulling them in.
> Or, alternatively, there's an issue with removing PGs that have lost their
> metadata and it's trying to pull in map epoch 0 or something...
> I'd stick a bug in the tracker in case it comes up in the future or
> somebody takes a fancy to it. :)
> -Greg
>
>
>>
>> [root@osd1 ~] # ceph-objectstore-tool --data-path
>> /var/lib/ceph/osd/ceph-0 --journal-path /var/lib/ceph/osd/ceph-0/journal
>> --pgid 97.314s0 --op remove
>> SG_IO: questionable sense data, results may be incorrect
>> SG_IO: questionable sense data, results may be incorrect
>>  marking collection for removal
>> mark_pg_for_removal warning: peek_map_epoch reported error
>> terminate called after throwing an instance of
>> 'ceph::buffer::end_of_buffer'
>>   what():  buffer::end_of_buffer
>> *** Caught signal (Aborted) **
>>  in thread 7f98ab2dc980 thread_name:ceph-objectstor
>>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>>  1: (()+0x95209a) [0x7f98abc4b09a]
>>  2: (()+0xf100) [0x7f98a91d7100]
>>  3: (gsignal()+0x37) [0x7f98a7d825f7]
>>  4: (abort()+0x148) [0x7f98a7d83ce8]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f98a86879d5]
>>  6: (()+0x5e946) [0x7f98a8685946]
>>  7: (()+0x5e973) [0x7f98a8685973]
>>  8: (()+0x5eb93) [0x7f98a8685b93]
>>  9: (ceph::buffer::list::iterator_impl::copy(unsigned int,
>> char*)+0xa5) [0x7f98abd498a5]
>>  10: (PG::read_info(ObjectStore*, spg_t, coll_t const&,
>> ceph::buffer::list&, pg_info_t&, std::map> std::less, std::allocator> pg_interval_t> > >&, unsigned char&)+0x324) [0x7f98ab6d3094]
>>  11: (mark_pg_for_removal(ObjectStore*, spg_t,
>> ObjectStore::Transaction*)+0x87c) [0x7f98ab66615c]
>>  12: (initiate_new_remove_pg(ObjectStore*, spg_t,
>> ObjectStore::Sequencer&)+0x131) [0x7f98ab666a51]
>>  13: (main()+0x39b7) [0x7f98ab610437]
>>  14: (__libc_start_main()+0xf5) [0x7f98a7d6eb15]
>>  15: (()+0x363a57) [0x7f98ab65ca57]
>> Aborted
>>
>> On Thu, Nov 2, 2017 at 12:45 PM Gregory Farnum 
>> wrote:
>>
>>> Deletion is throttled, though I don’t know the configs to change it you
>>> could poke around if you want stuff to go faster.
>>>
>>> Don’t just remove the directory in the filesystem; you need to clean up
>>> the leveldb metadata as well. ;)
>>> Removing the pg via Ceph-objectstore-tool would work fine but I’ve seen
>>> too many people kill the wrong thing to recommend it.
>>> -Greg
>>> On Thu, Nov 2, 2017 at 9:40 AM David Turner 
>>> wrote:
>>>
 Jewel 10.2.7; XFS formatted OSDs; no dmcrypt or LVM.  I have a pool
 that I deleted 16 hours ago that accounted for about 70% of the available
 space on each OSD (averaging 84% full), 370M objects in 8k PGs, ec 4+2
 profile.  Based on the rate that the OSDs are freeing up space after
 deleting the pool, it will take about a week to finish deleting the PGs
 from the OSDs.

 Is there anything I can do to speed this process up?  I feel like there
 may be a way for me to go through the OSDs and delete the PG folders either
 with the objectstore tool or while the OSD is offline.  I'm not sure what
 Ceph is doing to delete the pool, but I don't think that an `rm -Rf` of the
 PG folder would take nearly this long.

 Thank you all for your help.

>>> ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD is near full and slow in accessing storage from client

2017-11-20 Thread David Turner

What is your current `ceph status` and `ceph df`? The status of your
cluster has likely changed a bit in the last week.

On Mon, Nov 20, 2017 at 6:00 AM gjprabu  wrote:

> Hi David,
>
> Sorry for the late reply and its completed OSD Sync and more
> ever still fourth OSD available size is keep reducing. Is there any option
> to check or fix .
>
>
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
> 0 3.29749  1.0  3376G  2320G  1056G 68.71 1.10 144
> 1 3.26869  1.0  3347G  1871G  1475G 55.92 0.89 134
> 2 3.27339  1.0  3351G  1699G  1652G 50.69 0.81 134
> 3 3.24089  1.0  3318G  1865G  1452G 56.22 0.90 142
> 4 3.24089  1.0  3318G  2839G   478G 85.57 1.37 158
> 5 3.32669  1.0  3406G  2249G  1156G 66.04 1.06 136
> 6 3.27800  1.0  3356G  1924G  1432G 57.33 0.92 139
> 7 3.20470  1.0  3281G  1949G  1331G 59.42 0.95 141
>   TOTAL 26757G 16720G 10037G 62.49
> MIN/MAX VAR: 0.81/1.37  STDDEV: 10.26
>
>
> Regards
> Prabu GJ
>
>
>
>  On Mon, 13 Nov 2017 00:27:47 +0530 *David Turner
> >* wrote 
>
> You cannot reduce the PG count for a pool.  So there isn't anything you
> can really do for this unless you create a new FS with better PG counts and
> migrate your data into it.
>
> The problem with having more PGs than you need is in the memory footprint
> for the osd daemon. There are warning thresholds for having too many PGs
> per osd.  Also in future expansions, if you need to add pools, you might
> not be able to create the pools with the proper amount of PGs due to older
> pools that have way too many PGs.
>
> It would still be nice to see the output from those commands I asked about.
>
> The built-in reweighting scripts might help your data distribution.
> reweight-by-utilization
>
> On Sun, Nov 12, 2017, 11:41 AM gjprabu  wrote:
>
>
> Hi David,
>
> Thanks for your valuable reply , once complete the backfilling for new osd
> and will consider by increasing replica value asap. Is it possible to
> decrease the metadata pg count ?  if the pg count for metadata for value
> same as data count what kind of issue may occur ?
>
> Regards
> PrabuGJ
>
>
>  On Sun, 12 Nov 2017 21:25:05 +0530 David Turner
> wrote 
>
> What's the output of `ceph df` to see if your PG counts are good or not?
> Like everyone else has said, the space on the original osds can't be
> expected to free up until the backfill from adding the new osd has finished.
>
> You don't have anything in your cluster health to indicate that your
> cluster will not be able to finish this backfilling operation on its own.
>
> You might find this URL helpful in calculating your PG counts.
> http://ceph.com/pgcalc/  As a side note. It is generally better to keep
> your PG counts as base 2 numbers (16, 64, 256, etc). When you do not have a
> base 2 number then some of your PGs will take up twice as much space as
> others. In your case with 250, you have 244 PGs that are the same size and
> 6 PGs that are twice the size of those 244 PGs.  Bumping that up to 256
> will even things out.
>
> Assuming that the metadata pool is for a CephFS volume, you do not need
> nearly so many PGs for that pool. Also, I would recommend changing at least
> the metadata pool to 3 replica_size. If we can talk you into 3 replica for
> everything else, great! But if not, at least do the metadata pool. If you
> lose an object in the data pool, you just lose that file. If you lose an
> object in the metadata pool, you might lose access to the entire CephFS
> volume.
>
> On Sun, Nov 12, 2017, 9:39 AM gjprabu  wrote:
>
>
> Hi Cassiano,
>
>Thanks for your valuable feedback and will wait for some time till
> new osd sync get complete. Also for by increasing pg count it is the issue
> will solve? our setup pool size for data and metadata pg number is 250. Is
> this correct for 7 OSD with 2 replica. Also currently stored data size is
> 17TB.
>
> ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL %USE  VAR  PGS
> 0 3.29749  1.0  3376G  2814G  562G 83.35 1.23 165
> 1 3.26869  1.0  3347G  1923G 1423G 57.48 0.85 152
> 2 3.27339  1.0  3351G  1980G 1371G 59.10 0.88 161
> 3 3.24089  1.0  3318G  2131G 1187G 64.23 0.95 168
> 4 3.24089  1.0  3318G  2998G  319G 90.36 1.34 176
> 5 3.32669  1.0  3406G  2476G  930G 72.68 1.08 165
> 6 3.27800  1.0  3356G  1518G 1838G 45.24 0.67 166
>   TOTAL 23476G 15843G 7632G 67.49
> MIN/MAX VAR: 0.67/1.34  STDDEV: 14.53
>
> ceph osd tree
> ID WEIGHT   TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 22.92604 root default
> -2  3.29749 host intcfs-osd1
> 0  3.29749 osd.0 up  1.0  1.0
> -3  3.26869 host intcfs-osd2
> 1  3.26869 osd.1 up  1.0  1.0
> -4  3.27339 host intcfs-osd3
> 2  3.27339 osd.2 up  1.0  1.0
> -5  3.24089 host intcfs-osd4
> 3  3.24089 osd.3 up  1.0  1.0
> -6  3.24089

[ceph-users] Poor libRBD write performance

2017-11-20 Thread Moreno, Orlando

Hi all,

I've been experiencing weird performance behavior when using FIO RBD engine 
directly to an RBD volume with numjobs > 1. For a 4KB random write test at 32 
QD and 1 numjob, I can get about 40K IOPS, but when I increase the numjobs to 
4, it plummets to 2800 IOPS. I tried running the same exact test on a VM using 
FIO libaio targeting a block device (volume) attached through QEMU/RBD and I 
get ~35K-40K IOPS in both situations. In all cases, CPU was not fully utilized 
and there were no signs of any hardware bottlenecks. I did not disable any RBD 
features and most of the Ceph parameters are default (besides auth, debug, pool 
size, etc).

My Ceph cluster is running on 6 nodes, all-NVMe, 22-core, 376GB mem, Luminous 
12.2.1, Ubuntu 16.04, and clients running FIO job/VM on similar HW/SW spec. The 
VM has 16 vCPU, 64GB mem, and the root disk is locally stored while the 
persistent disk comes from an RBD volume serviced by the Ceph cluster.

If anyone has seen this issue or have any suggestions please let me know.

Thanks,
Orlando
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor libRBD write performance

2017-11-20 Thread Jason Dillaman

I suspect you are seeing this issue [1]. TL;DR: never use "numjobs" >
1 against an RBD image that has the exclusive-lock feature enabled.

[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-August/012123.html

On Mon, Nov 20, 2017 at 11:06 AM, Moreno, Orlando
 wrote:
> Hi all,
>
>
>
> I’ve been experiencing weird performance behavior when using FIO RBD engine
> directly to an RBD volume with numjobs > 1. For a 4KB random write test at
> 32 QD and 1 numjob, I can get about 40K IOPS, but when I increase the
> numjobs to 4, it plummets to 2800 IOPS. I tried running the same exact test
> on a VM using FIO libaio targeting a block device (volume) attached through
> QEMU/RBD and I get ~35K-40K IOPS in both situations. In all cases, CPU was
> not fully utilized and there were no signs of any hardware bottlenecks. I
> did not disable any RBD features and most of the Ceph parameters are default
> (besides auth, debug, pool size, etc).
>
>
>
> My Ceph cluster is running on 6 nodes, all-NVMe, 22-core, 376GB mem,
> Luminous 12.2.1, Ubuntu 16.04, and clients running FIO job/VM on similar
> HW/SW spec. The VM has 16 vCPU, 64GB mem, and the root disk is locally
> stored while the persistent disk comes from an RBD volume serviced by the
> Ceph cluster.
>
>
>
> If anyone has seen this issue or have any suggestions please let me know.
>
>
>
> Thanks,
>
> Orlando
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Cbt] Poor libRBD write performance

2017-11-20 Thread Mark Nelson


On 11/20/2017 10:06 AM, Moreno, Orlando wrote:

Hi all,



I’ve been experiencing weird performance behavior when using FIO RBD
engine directly to an RBD volume with numjobs > 1. For a 4KB random
write test at 32 QD and 1 numjob, I can get about 40K IOPS, but when I
increase the numjobs to 4, it plummets to 2800 IOPS. I tried running the
same exact test on a VM using FIO libaio targeting a block device
(volume) attached through QEMU/RBD and I get ~35K-40K IOPS in both
situations. In all cases, CPU was not fully utilized and there were no
signs of any hardware bottlenecks. I did not disable any RBD features
and most of the Ceph parameters are default (besides auth, debug, pool
size, etc).



My Ceph cluster is running on 6 nodes, all-NVMe, 22-core, 376GB mem,
Luminous 12.2.1, Ubuntu 16.04, and clients running FIO job/VM on similar
HW/SW spec. The VM has 16 vCPU, 64GB mem, and the root disk is locally
stored while the persistent disk comes from an RBD volume serviced by
the Ceph cluster.



If anyone has seen this issue or have any suggestions please let me know.


Hi Orlando,

Try seeing if disabling the RBD image exclusive lock helps (if only to 
confirm that's what's going on).  I usually test with numjobs=1 and run 
multiple fio instances with higher iodepth values instead to avoid this. 
 See:


https://www.spinics.net/lists/ceph-devel/msg30468.html

and

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-September/004872.html

Mark





Thanks,

Orlando



___
Cbt mailing list
c...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/cbt-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rename iscsi target_iqn

2017-11-20 Thread Frank Brendel




Am 20.11.2017 um 15:10 schrieb Jason Dillaman:

Recommended way to do what, exactly? If you are attempting to rename
the target while keeping all other settings, at step (3) you could use
"rados get" to get the current config, modify it, and then "rados put"
to uploaded before continuing to step 4.

I am new to this and so I made some mistakes. I just wanted to start over.
Had no idea if I could simply rename the target.
With the question for the "recommended way" I meant "Did I miss 
something?" Kernel modules etc.

I am not an expert. Sorry.


Thank you
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to improve performance

2017-11-20 Thread ulembke


Hi,
not flushing the ceph-journal!

I speak about the caching from linux.
If you run free, you can see how much is cached:
like
# free
  totalusedfree  shared  buff/cache   
available
Mem:   4118969216665960 4795700  12478019728032
28247464


To free the cache (normly not done in productional systems):
sync; echo 3 > /proc/sys/vm/drop_caches

look after that with free and run your bench (only read) again.


Udo


Am 2017-11-20 13:06, schrieb Rudi Ahlers:

Hi,

So are you saying this isn't true speed?

Do I just flush the journal and test again? i.e. ceph-osd -i osd.0
--flush-journal && ceph-osd -i osd.2 --flush-journal && ceph-osd -i 
osd.3

--flush-journal etc, etc?

On Mon, Nov 20, 2017 at 2:02 PM,  wrote:


Hi Rudi,

Am 2017-11-20 11:58, schrieb Rudi Ahlers:


...

Some more stats:

root@virt2:~# rados bench -p Data 10 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  
avg

lat(s)
0   0 0 0 0 0   -
 0
1  16   402   386   1543.69  1544  0.00182802
 0.0395421
2  16   773   757   1513.71  1484  0.00243911
 0.0409455

this values are due cached osd-data on your osd-nodes.


If you flush your cache (on all osd-nodes), your reads will be much 
worse,

because they came from the HDDs.


Udo


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] radosgw bucket rename and change owner

2017-11-20 Thread Kim-Norman Sahm

is it possible to rename a radosgw bucket and change the owner?
i'm using ceph as swift backend in openstack and want to move an old
bucket to a keystone based user.

br Kim
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor libRBD write performance

2017-11-20 Thread Moreno, Orlando

Hi Jason,

You're right, thanks for pointing that out. I could've sworn I saw the same 
problem with exclusive-lock disabled, but after trying it again, disabling the 
feature does fix the write performance :)

So does this mean that when an RBD is attached to a VM, it is considered a 
single client connection?

Thanks,
Orlando

-Original Message-
From: Jason Dillaman [mailto:jdill...@redhat.com] 
Sent: Monday, November 20, 2017 9:10 AM
To: Moreno, Orlando 
Cc: f...@vger.kernel.org; ceph-users@lists.ceph.com; c...@lists.ceph.com
Subject: Re: [ceph-users] Poor libRBD write performance

I suspect you are seeing this issue [1]. TL;DR: never use "numjobs" >
1 against an RBD image that has the exclusive-lock feature enabled.

[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-August/012123.html

On Mon, Nov 20, 2017 at 11:06 AM, Moreno, Orlando  
wrote:
> Hi all,
>
>
>
> I’ve been experiencing weird performance behavior when using FIO RBD 
> engine directly to an RBD volume with numjobs > 1. For a 4KB random 
> write test at
> 32 QD and 1 numjob, I can get about 40K IOPS, but when I increase the 
> numjobs to 4, it plummets to 2800 IOPS. I tried running the same exact 
> test on a VM using FIO libaio targeting a block device (volume) 
> attached through QEMU/RBD and I get ~35K-40K IOPS in both situations. 
> In all cases, CPU was not fully utilized and there were no signs of 
> any hardware bottlenecks. I did not disable any RBD features and most 
> of the Ceph parameters are default (besides auth, debug, pool size, etc).
>
>
>
> My Ceph cluster is running on 6 nodes, all-NVMe, 22-core, 376GB mem, 
> Luminous 12.2.1, Ubuntu 16.04, and clients running FIO job/VM on 
> similar HW/SW spec. The VM has 16 vCPU, 64GB mem, and the root disk is 
> locally stored while the persistent disk comes from an RBD volume 
> serviced by the Ceph cluster.
>
>
>
> If anyone has seen this issue or have any suggestions please let me know.
>
>
>
> Thanks,
>
> Orlando
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



--
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor libRBD write performance

2017-11-20 Thread Jason Dillaman

On Mon, Nov 20, 2017 at 12:00 PM, Moreno, Orlando
 wrote:
> Hi Jason,
>
> You're right, thanks for pointing that out. I could've sworn I saw the same 
> problem with exclusive-lock disabled, but after trying it again, disabling 
> the feature does fix the write performance :)
>
> So does this mean that when an RBD is attached to a VM, it is considered a 
> single client connection?

Yes, the VM (QEMU) is a single client to librbd.

> Thanks,
> Orlando
>
> -Original Message-
> From: Jason Dillaman [mailto:jdill...@redhat.com]
> Sent: Monday, November 20, 2017 9:10 AM
> To: Moreno, Orlando 
> Cc: f...@vger.kernel.org; ceph-users@lists.ceph.com; c...@lists.ceph.com
> Subject: Re: [ceph-users] Poor libRBD write performance
>
> I suspect you are seeing this issue [1]. TL;DR: never use "numjobs" >
> 1 against an RBD image that has the exclusive-lock feature enabled.
>
> [1] 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-August/012123.html
>
> On Mon, Nov 20, 2017 at 11:06 AM, Moreno, Orlando  
> wrote:
>> Hi all,
>>
>>
>>
>> I’ve been experiencing weird performance behavior when using FIO RBD
>> engine directly to an RBD volume with numjobs > 1. For a 4KB random
>> write test at
>> 32 QD and 1 numjob, I can get about 40K IOPS, but when I increase the
>> numjobs to 4, it plummets to 2800 IOPS. I tried running the same exact
>> test on a VM using FIO libaio targeting a block device (volume)
>> attached through QEMU/RBD and I get ~35K-40K IOPS in both situations.
>> In all cases, CPU was not fully utilized and there were no signs of
>> any hardware bottlenecks. I did not disable any RBD features and most
>> of the Ceph parameters are default (besides auth, debug, pool size, etc).
>>
>>
>>
>> My Ceph cluster is running on 6 nodes, all-NVMe, 22-core, 376GB mem,
>> Luminous 12.2.1, Ubuntu 16.04, and clients running FIO job/VM on
>> similar HW/SW spec. The VM has 16 vCPU, 64GB mem, and the root disk is
>> locally stored while the persistent disk comes from an RBD volume
>> serviced by the Ceph cluster.
>>
>>
>>
>> If anyone has seen this issue or have any suggestions please let me know.
>>
>>
>>
>> Thanks,
>>
>> Orlando
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Jason



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cluster network bandwidth?

2017-11-20 Thread Anthony Verevkin

> From: "John Spray" 
> Sent: Thursday, November 16, 2017 11:01:35 AM
> 
> On Thu, Nov 16, 2017 at 3:32 PM, David Turner 
> wrote:
> > That depends on another question.  Does the client write all 3
> > copies or
> > does the client send the copy to the primary OSD and then the
> > primary OSD
> > sends the write to the secondaries?  Someone asked this recently,
> > but I
> > don't recall if an answer was given.  I'm not actually certain
> > which is the
> > case.  If it's the latter then the 10Gb pipe from the client is all
> > you
> > need.
> 
> The client sends the write to the primary OSD (via the public
> network)
> and the primary OSD sends it on to the two replicas (via the cluster
> network).
> 
> John

Thank you John! Would you also know if the same is true for Erasure coding?
Is it the client or an OSD that is splitting the request into k+m chunks?
What about reads? Is it the client assembling the erasures or is the primary
OSD proxying each read request?

Also, for replicated sets people often forget that it's not just writes. When
the client is reading data, it comes from the primary OSD only and does not
generate extra traffic on the cluster network. So in the 50/50 read-write use 
case the public and cluster traffic would actually be balanced:
1x read + 1x write on public / 2x write replication on cluster.

Regards,
Anthony
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Switch to replica 3

2017-11-20 Thread Chris Taylor



On 2017-11-20 3:39 am, Matteo Dacrema wrote:

Yes I mean the existing Cluster.
SSDs are on a fully separate pool.
Cluster is not busy during recovery and deep scrubs but I think it’s
better to limit replication in some way when switching to replica 3.

My question is to understand if I need to set some options parameters
to limit the impact of the creation of new objects.I’m also concerned
about disk filling up during recovery because of inefficient data
balancing.


You can try using osd_recovery_sleep to slow down the backfilling so it 
does not cause the client io to hang.


ceph tell osd.* injectargs "--osd_recovery_sleep 0.1"




Here osd tree

ID  WEIGHTTYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
-10  19.69994 root ssd
-11   5.06998 host ceph101
166   0.98999 osd.166   up  1.0  1.0
167   1.0 osd.167   up  1.0  1.0
168   1.0 osd.168   up  1.0  1.0
169   1.07999 osd.169   up  1.0  1.0
170   1.0 osd.170   up  1.0  1.0
-12   4.92998 host ceph102
171   0.98000 osd.171   up  1.0  1.0
172   0.92999 osd.172   up  1.0  1.0
173   0.98000 osd.173   up  1.0  1.0
174   1.0 osd.174   up  1.0  1.0
175   1.03999 osd.175   up  1.0  1.0
-13   4.69998 host ceph103
176   0.84999 osd.176   up  1.0  1.0
177   0.84999 osd.177   up  1.0  1.0
178   1.0 osd.178   up  1.0  1.0
179   1.0 osd.179   up  1.0  1.0
180   1.0 osd.180   up  1.0  1.0
-14   5.0 host ceph104
181   1.0 osd.181   up  1.0  1.0
182   1.0 osd.182   up  1.0  1.0
183   1.0 osd.183   up  1.0  1.0
184   1.0 osd.184   up  1.0  1.0
185   1.0 osd.185   up  1.0  1.0
 -1 185.19835 root default
 -2  18.39980 host ceph001
 63   0.7 osd.63up  1.0  1.0
 64   0.7 osd.64up  1.0  1.0
 65   0.7 osd.65up  1.0  1.0
146   0.7 osd.146   up  1.0  1.0
147   0.7 osd.147   up  1.0  1.0
148   0.90999 osd.148   up  1.0  1.0
149   0.7 osd.149   up  1.0  1.0
150   0.7 osd.150   up  1.0  1.0
151   0.7 osd.151   up  1.0  1.0
152   0.7 osd.152   up  1.0  1.0
153   0.7 osd.153   up  1.0  1.0
154   0.7 osd.154   up  1.0  1.0
155   0.8 osd.155   up  1.0  1.0
156   0.84999 osd.156   up  1.0  1.0
157   0.7 osd.157   up  1.0  1.0
158   0.7 osd.158   up  1.0  1.0
159   0.84999 osd.159   up  1.0  1.0
160   0.90999 osd.160   up  1.0  1.0
161   0.90999 osd.161   up  1.0  1.0
162   0.90999 osd.162   up  1.0  1.0
163   0.7 osd.163   up  1.0  1.0
164   0.90999 osd.164   up  1.0  1.0
165   0.64999 osd.165   up  1.0  1.0
 -3  19.41982 host ceph002
 23   0.7 osd.23up  1.0  1.0
 24   0.7 osd.24up  1.0  1.0
 25   0.90999 osd.25up  1.0  1.0
 26   0.5 osd.26up  1.0  1.0
 27   0.95000 osd.27up  1.0  1.0
 28   0.64999 osd.28up  1.0  1.0
 29   0.75000 osd.29up  1.0  1.0
 30   0.8 osd.30up  1.0  1.0
 31   0.90999 osd.31up  1.0  1.0
 32   0.90999 osd.32up  1.0  1.0
 33   0.8 osd.33up  1.0  1.0
 34   0.90999 osd.34up  1.0  1.0
 35   0.90999 osd.35up  1.0  1.0
 36   0.84999 osd.36up  1.0  1.0
 37   0.8 osd.37up  1.0  1.0
 38   1.0 osd.38up  1.0  1.0
 39   0.7 osd.39up  1.0  1.0
 40   0.90999 osd.40up  1.0  1.0
 41   0.84999 osd.41up  1.0  1.0
 42   0.

Re: [ceph-users] how to improve performance

Ok, so it seems an MTU of 9000 didn't improve anything.

On Mon, Nov 20, 2017 at 5:34 PM, Sébastien VIGNERON <
sebastien.vigne...@criann.fr> wrote:

> Your performance hit can be from here. When OSD daemons tries to send a
> big frame, MTU misconfiguration blocks them and they must send them again
> with a lower size.
> On some switches, you have to set the global and the per-interface MTU
> sizes.
>
> Cordialement / Best regards,
>
> Sébastien VIGNERON
> CRIANN,
> Ingénieur / Engineer
> Technopôle du Madrillet
> 745, avenue de l'Université
> 
>
> 76800 Saint-Etienne du Rouvray - France
> 
>
> tél. +33 2 32 91 42 91 <+33%202%2032%2091%2042%2091>
> fax. +33 2 32 91 42 92 <+33%202%2032%2091%2042%2092>
> http://www.criann.fr
> mailto:sebastien.vigne...@criann.fr 
> support: supp...@criann.fr
>
> Le 20 nov. 2017 à 16:21, Rudi Ahlers  a écrit :
>
> I am not sure why, but I cannot get Jumbo Frames to work properly:
>
>
> root@virt2:~# ping -M do -s 8972 -c 4 10.10.10.83
> PING 10.10.10.83 (10.10.10.83) 8972(9000) bytes of data.
> ping: local error: Message too long, mtu=1500
> ping: local error: Message too long, mtu=1500
> ping: local error: Message too long, mtu=1500
>
>
> Jumbo Frames is on, on the switch and on the NIC's:
>
> ens2f0: flags=4163  mtu 9000
> inet 10.10.10.83  netmask 255.255.255.0  broadcast 10.10.10.255
> inet6 fe80::ec4:7aff:feea:7b40  prefixlen 64  scopeid 0x20
> ether 0c:c4:7a:ea:7b:40  txqueuelen 1000  (Ethernet)
> RX packets 166440655  bytes 229547410625 (213.7 GiB)
> RX errors 0  dropped 223  overruns 0  frame 0
> TX packets 142788790  bytes 188658602086 (175.7 GiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
>
>
>
> root@virt2:~# ifconfig ens2f0
> ens2f0: flags=4163  mtu 9000
> inet 10.10.10.82  netmask 255.255.255.0  broadcast 10.10.10.255
> inet6 fe80::ec4:7aff:feea:ff2c  prefixlen 64  scopeid 0x20
> ether 0c:c4:7a:ea:ff:2c  txqueuelen 1000  (Ethernet)
> RX packets 466774  bytes 385578454 (367.7 MiB)
> RX errors 4  dropped 223  overruns 0  frame 3
> TX packets 594975  bytes 580053745 (553.1 MiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
>
>
> On Mon, Nov 20, 2017 at 2:13 PM, Sébastien VIGNERON  criann.fr> wrote:
>
>> As a jumbo frame test, can you try the following?
>>
>> ping -M do -s 8972 -c 4 IP_of_other_node_within_cluster_network
>>
>> If you have « ping: sendto: Message too long », jumbo frames are not
>> activated.
>>
>> Cordialement / Best regards,
>>
>> Sébastien VIGNERON
>> CRIANN,
>> Ingénieur / Engineer
>> Technopôle du Madrillet
>> 745, avenue de l'Université
>> 
>>
>> 76800 Saint-Etienne du Rouvray - France
>> 
>>
>> tél. +33 2 32 91 42 91 <+33%202%2032%2091%2042%2091>
>> fax. +33 2 32 91 42 92 <+33%202%2032%2091%2042%2092>
>> http://www.criann.fr
>> mailto:sebastien.vigne...@criann.fr 
>> support: supp...@criann.fr
>>
>> Le 20 nov. 2017 à 13:02, Rudi Ahlers  a écrit :
>>
>> We're planning on installing 12X Virtual Machines with some heavy loads.
>>
>> the SSD drives are  INTEL SSDSC2BA400G4
>>
>> The SATA drives are ST8000NM0055-1RM112
>>
>> Please explain your comment, "b) will find a lot of people here who
>> don't approve of it."
>>
>> I don't have access to the switches right now, but they're new so
>> whatever default config ships from factory would be active. Though iperf
>> shows 10.5 GBytes  / 9.02 Gbits/sec throughput.
>>
>> What speeds would you expect?
>> "Though with your setup I would have expected something faster, but NOT
>> the
>> theoretical 600MB/s 4 HDDs will do in sequential writes."
>>
>>
>>
>> On this, "If an OSD has no fast WAL/DB, it will drag the overall speed
>> down. Verify and if so fix this and re-test.": how?
>>
>>
>> On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer  wrote:
>>
>>> On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:
>>>
>>> > Hi,
>>> >
>>> > Can someone please help me, how do I improve performance on ou CEPH
>>> cluster?
>>> >
>>> > The hardware in use are as follows:
>>> > 3x SuperMicro servers with the following configuration
>>> > 12Core Dual XEON 2.2Ghz
>>> Faster cores is better for Ceph, IMNSHO.
>>> Though with main storage on HDDs, this will do.
>>>
>>> > 128GB RAM
>>> Overkill for Ceph but I see something else below...
>>>
>>> > 2x 400GB Intel DC SSD drives
>>> Exact model please.
>>>
>>> > 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
>>> One hopes that's a non S

[ceph-users] Configuring ceph usage statistics

2017-11-20 Thread Richard Cox

Attempting to set up a proof of concept ceph cluster (3 osd's 1 mon node), and 
everything is working as far as radowsgw and s3 connectivity, however I can't 
seem to get any usage statistics.

Looking at the documentation this is enabled by default, but just in case it 
isn't, I have

[client.radosgw.gateway]

rgw enable usage log = true
rgw usage log tick interval = 30
rgw usage log flush threshold = 1024
rgw usage max shards = 32
rgw usage max user shards = 1

I read and write to the cluster using a set up demo account; however when I try 
to view the usage stats:

radosgw-admin -uid=demo usage show
{
"entries": [],
"summary": []
}

I'm sure there's something blindingly obvious that I'm missing, but I'm at my 
wits end to what it could be.

Thanks for any assistance!

Richard.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph - SSD cluster

On Mon, 20 Nov 2017 15:53:31 +0100 Ansgar Jazdzewski wrote:

> Hi *,
> 
> just on note because we hit it, take a look on your discard options
> make sure it not run on all OSD at the same time.
>
Any SSD that actually _requires_ the use of TRIM/DISCARD to maintain
either speed or endurance I'd consider unfit for Ceph to boot.

Christian
 
> 2017-11-20 6:56 GMT+01:00 M Ranga Swami Reddy :
> > Hello,
> > We plan to use the ceph cluster with all SSDs. Do we have any
> > recommendations for Ceph cluster with Full SSD disks.
> >
> > Thanks
> > Swami
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Switch to replica 3

On Mon, 20 Nov 2017 10:35:36 -0800 Chris Taylor wrote:

> On 2017-11-20 3:39 am, Matteo Dacrema wrote:
> > Yes I mean the existing Cluster.
> > SSDs are on a fully separate pool.
> > Cluster is not busy during recovery and deep scrubs but I think it’s
> > better to limit replication in some way when switching to replica 3.
> > 
> > My question is to understand if I need to set some options parameters
> > to limit the impact of the creation of new objects.I’m also concerned
> > about disk filling up during recovery because of inefficient data
> > balancing.  
> 
> You can try using osd_recovery_sleep to slow down the backfilling so it 
> does not cause the client io to hang.
> 
> ceph tell osd.* injectargs "--osd_recovery_sleep 0.1"
> 

Which is one of the things that is version specific and we don't know the
version yet.

The above will work with Hammer and should again with Luminous, but not so
much with the unified queue bits inbetween. 

Christian

> 
> > 
> > Here osd tree
> > 
> > ID  WEIGHTTYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
> > -10  19.69994 root ssd
> > -11   5.06998 host ceph101
> > 166   0.98999 osd.166   up  1.0  1.0
> > 167   1.0 osd.167   up  1.0  1.0
> > 168   1.0 osd.168   up  1.0  1.0
> > 169   1.07999 osd.169   up  1.0  1.0
> > 170   1.0 osd.170   up  1.0  1.0
> > -12   4.92998 host ceph102
> > 171   0.98000 osd.171   up  1.0  1.0
> > 172   0.92999 osd.172   up  1.0  1.0
> > 173   0.98000 osd.173   up  1.0  1.0
> > 174   1.0 osd.174   up  1.0  1.0
> > 175   1.03999 osd.175   up  1.0  1.0
> > -13   4.69998 host ceph103
> > 176   0.84999 osd.176   up  1.0  1.0
> > 177   0.84999 osd.177   up  1.0  1.0
> > 178   1.0 osd.178   up  1.0  1.0
> > 179   1.0 osd.179   up  1.0  1.0
> > 180   1.0 osd.180   up  1.0  1.0
> > -14   5.0 host ceph104
> > 181   1.0 osd.181   up  1.0  1.0
> > 182   1.0 osd.182   up  1.0  1.0
> > 183   1.0 osd.183   up  1.0  1.0
> > 184   1.0 osd.184   up  1.0  1.0
> > 185   1.0 osd.185   up  1.0  1.0
> >  -1 185.19835 root default
> >  -2  18.39980 host ceph001
> >  63   0.7 osd.63up  1.0  1.0
> >  64   0.7 osd.64up  1.0  1.0
> >  65   0.7 osd.65up  1.0  1.0
> > 146   0.7 osd.146   up  1.0  1.0
> > 147   0.7 osd.147   up  1.0  1.0
> > 148   0.90999 osd.148   up  1.0  1.0
> > 149   0.7 osd.149   up  1.0  1.0
> > 150   0.7 osd.150   up  1.0  1.0
> > 151   0.7 osd.151   up  1.0  1.0
> > 152   0.7 osd.152   up  1.0  1.0
> > 153   0.7 osd.153   up  1.0  1.0
> > 154   0.7 osd.154   up  1.0  1.0
> > 155   0.8 osd.155   up  1.0  1.0
> > 156   0.84999 osd.156   up  1.0  1.0
> > 157   0.7 osd.157   up  1.0  1.0
> > 158   0.7 osd.158   up  1.0  1.0
> > 159   0.84999 osd.159   up  1.0  1.0
> > 160   0.90999 osd.160   up  1.0  1.0
> > 161   0.90999 osd.161   up  1.0  1.0
> > 162   0.90999 osd.162   up  1.0  1.0
> > 163   0.7 osd.163   up  1.0  1.0
> > 164   0.90999 osd.164   up  1.0  1.0
> > 165   0.64999 osd.165   up  1.0  1.0
> >  -3  19.41982 host ceph002
> >  23   0.7 osd.23up  1.0  1.0
> >  24   0.7 osd.24up  1.0  1.0
> >  25   0.90999 osd.25up  1.0  1.0
> >  26   0.5 osd.26up  1.0  1.0
> >  27   0.95000 osd.27up  1.0  1.0
> >  28   0.64999 osd.28up  1.0  1.0
> >  29   0.75000 osd.29up  1.0  1.0
> >  30   0.8 osd.30up  1.0  1.0
> >  31   0.90999 osd.31up  1.0  1.0
> >  32   0.90999 osd.32up  1.0  1.0
> >  33

Re: [ceph-users] Configuring ceph usage statistics

2017-11-20 Thread Jean-Charles Lopez

Hi Richard,

you need to grant admin ops capabilities to a specific user to be able to query 
the usage stats.

radosgw-admin caps add --caps "usage=*;buckets=*;metadata=*;users=*;zone=*" 
--uid=johndoe

* can be replace with “read”, “read, write” depending on what you want the user 
to be able to do.

[root@ex-sem-1 ~]# radosgw-admin caps add --caps 
"usage=*;buckets=*;metadata=*;users=*;zone=*" --uid=johndoe
{
"user_id": "johndoe",
"display_name": "John Doe",
"email": "j...@redhat.com",
"suspended": 0,
"max_buckets": 1000,
"auid": 0,
"subusers": [],
"keys": [
{
"user": "johndoe",
"access_key": “x",
"secret_key": “y"
}
],
"swift_keys": [],
"caps": [
{
"type": "buckets",
"perm": "*"
},
{
"type": "metadata",
"perm": "*"
},
{
"type": "usage",
"perm": "*"
},
{
"type": "users",
"perm": "*"
},
{
"type": "zone",
"perm": "*"
}
],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"temp_url_keys": []
}

[root@ex-sem-1 ~]# radosgw-admin usage show --uid=johndoe
{
"entries": [
{
"user": "johndoe",
"buckets": [
{
"bucket": "bucket1",
"time": "2017-11-20 22:00:00.00Z",
"epoch": 1511215200,
"owner": "johndoe",
"categories": [
{
"category": "put_obj",
"bytes_sent": 0,
"bytes_received": 3939,
"ops": 3,
"successful_ops": 3
}
]
}
]
}
],
"summary": [
{
"user": "johndoe",
"categories": [
{
"category": "put_obj",
"bytes_sent": 0,
"bytes_received": 3939,
"ops": 3,
"successful_ops": 3
}
],
"total": {
"bytes_sent": 0,
"bytes_received": 3939,
"ops": 3,
"successful_ops": 3
}
}
]
}

Play with the caps to allow what you feel is necessary.

Note that you also have this to check byte usage

[root@ex-sem-1 ~]# radosgw-admin user stats --uid=johndoe
{
"stats": {
"total_entries": 6,
"total_bytes": 37307,
"total_bytes_rounded": 53248
},
"last_stats_sync": "2017-07-22 22:50:37.572798Z",
"last_stats_update": "2017-11-20 22:56:56.311295Z"
}

Best regards
JC

> On Nov 20, 2017, at 13:30, Richard Cox  wrote:
> 
> Attempting to set up a proof of concept ceph cluster (3 osd’s 1 mon node), 
> and everything is working as far as radowsgw and s3 connectivity, however I 
> can’t seem to get any usage statistics.
>  
> Looking at the documentation this is enabled by default, but just in case it 
> isn’t, I have 
>  
> [client.radosgw.gateway]
>  
> rgw enable usage log = true
> rgw usage log tick interval = 30
> rgw usage log flush threshold = 1024
> rgw usage max shards = 32
> rgw usage max user shards = 1
>  
> I read and write to the cluster using a set up demo account; however when I 
> try to view the usage stats:
>  
> radosgw-admin –uid=demo usage show
> {
> "entries": [],
> "summary": []
> }
>  
> I’m sure there’s something blindingly obvious that I’m missing, but I’m at my 
> wits end to what it could be.
>  
> Thanks for any assistance!
>  
> Richard.
>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to improve performance

2017-11-20 Thread Nigel Williams

On 20 November 2017 at 23:36, Christian Balzer  wrote:
> On Mon, 20 Nov 2017 14:02:30 +0200 Rudi Ahlers wrote:
>> The SATA drives are ST8000NM0055-1RM112
>>
> Note that these (while fast) have an internal flash cache, limiting them to
> something like 0.2 DWPD.
> Probably not an issue with the WAL/DB on the Intels, but something to keep
> in mind.

I had forgotten about the flash-cache hybrid drives. Seagate calls
them SSHD (Solid State Hard Drives) and as Christian highlights they
have several GB of SSD as an on-board cache. I looked at the
specifications for the ST8000NM0055 but I cannot see them listed as
SSHD, rather they seem like the usual Seagate Enterprise hard-drive.

https://www.seagate.com/www-content/product-content/enterprise-hdd-fam/enterprise-capacity-3-5-hdd/constellation-es-4/en-us/docs/ent-capacity-3-5-hdd-8tb-ds1863-2-1510us.pdf

Is there something in the specifications that gives them away as SSHD?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to improve performance

On Tue, 21 Nov 2017 10:00:28 +1100 Nigel Williams wrote:

> On 20 November 2017 at 23:36, Christian Balzer  wrote:
> > On Mon, 20 Nov 2017 14:02:30 +0200 Rudi Ahlers wrote:  
> >> The SATA drives are ST8000NM0055-1RM112
> >>  
> > Note that these (while fast) have an internal flash cache, limiting them to
> > something like 0.2 DWPD.
> > Probably not an issue with the WAL/DB on the Intels, but something to keep
> > in mind.  
> 
> I had forgotten about the flash-cache hybrid drives. Seagate calls
> them SSHD (Solid State Hard Drives) and as Christian highlights they
> have several GB of SSD as an on-board cache. I looked at the
> specifications for the ST8000NM0055 but I cannot see them listed as
> SSHD, rather they seem like the usual Seagate Enterprise hard-drive.
> 
> https://www.seagate.com/www-content/product-content/enterprise-hdd-fam/enterprise-capacity-3-5-hdd/constellation-es-4/en-us/docs/ent-capacity-3-5-hdd-8tb-ds1863-2-1510us.pdf
> 
> Is there something in the specifications that gives them away as SSHD?
> 
The 550TB endurance per year for an 8TB drive and the claim of 30% faster
IOPS would be a dead giveaway, one thinks.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to improve performance

2017-11-20 Thread Nigel Williams

On 21 November 2017 at 10:07, Christian Balzer  wrote:
> On Tue, 21 Nov 2017 10:00:28 +1100 Nigel Williams wrote:
>> Is there something in the specifications that gives them away as SSHD?
>>
> The 550TB endurance per year for an 8TB drive and the claim of 30% faster
> IOPS would be a dead giveaway, one thinks.

I just found this other answer:

http://products.wdc.com/library/other/2579-772003.pdf

Hard-drive manufacturers introduced workload specifications because
they better model failure rates than MTTF.

I see the drive has 2MB of NOR-flash for write-caching, what happens
when this wears out?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to improve performance

On Tue, 21 Nov 2017 10:35:57 +1100 Nigel Williams wrote:

> On 21 November 2017 at 10:07, Christian Balzer  wrote:
> > On Tue, 21 Nov 2017 10:00:28 +1100 Nigel Williams wrote:  
> >> Is there something in the specifications that gives them away as SSHD?
> >>  
> > The 550TB endurance per year for an 8TB drive and the claim of 30% faster
> > IOPS would be a dead giveaway, one thinks.  
> 
> I just found this other answer:
> 
> http://products.wdc.com/library/other/2579-772003.pdf
> 
> Hard-drive manufacturers introduced workload specifications because
> they better model failure rates than MTTF.
>
I've heard that before, alas if you exceed 550TB/year, do you void your
warranty then? 
If so, another thing to keep in mind.
 
> I see the drive has 2MB of NOR-flash for write-caching, what happens
> when this wears out?
> 
Should have been 30x up there of course.

As for NOR, it is supposed to be very durable, but if it fails it
definitely means dead drive and lost data, like with all caches really.


Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mount failed since failed to load ceph kernel module

2017-11-20 Thread Dai Xiang

On Tue, Nov 14, 2017 at 11:12:47AM +0100, Iban Cabrillo wrote:
> HI,
>You should do something like #ceph osd in osd.${num}:
>But If this is your tree, I do not see any osd available at this moment
> in your cluster, should be something similar to this xesample:
> 
> ID CLASS WEIGHT   TYPE NAMESTATUS REWEIGHT PRI-AFF
> -1   58.21509 root default
> 
> -2   29.12000 host cephosd01
>  1   hdd  3.64000 osd.1up  1.0 1.0
> ..
> -3   29.09509 host cephosd02
>  0   hdd  3.63689 osd.0up  1.0 1.0
> ..
> 
> Please have a look at the guide:
> http://docs.ceph.com/docs/luminous/rados/deployment/ceph-deploy-osd/


I install ceph in docker in fact, since docker doesn't support create
partition during running, i use `parted` to create then start
container to use ceph create again. The debug log is all right,
anywhere else i can get detail info?
> 
> 
> Regards, I
> 
> 2017-11-14 10:58 GMT+01:00 Dai Xiang :
> 
> > On Tue, Nov 14, 2017 at 10:52:00AM +0100, Iban Cabrillo wrote:
> > > Hi Dai Xiang,
> > >   There is no OSD available at this moment in your cluste, then you can't
> > > read/write or mount anything, maybe the osds are configured but they are
> > > out, please could you paste the "#ceph osd tree " command
> > > to see your osd status ?
> >
> > ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF
> > -10 root default
> >
> > It is out indeed, but i really do not know how to fix it.
> >
> > --
> > Best Regards
> > Dai Xiang
> > >
> > > Regards, I
> > >
> > >
> > > 2017-11-14 10:39 GMT+01:00 Dai Xiang :
> > >
> > > > On Tue, Nov 14, 2017 at 09:21:56AM +, Linh Vu wrote:
> > > > > Odd, you only got 2 mons and 0 osds? Your cluster build looks
> > incomplete.
> > > >
> > > > But from the log, osd seems normal:
> > > > [172.17.0.4][INFO  ] checking OSD status...
> > > > [172.17.0.4][DEBUG ] find the location of an executable
> > > > [172.17.0.4][INFO  ] Running command: /bin/ceph --cluster=ceph osd stat
> > > > --format=json
> > > > [ceph_deploy.osd][DEBUG ] Host 172.17.0.4 is now ready for osd use.
> > > > ...
> > > >
> > > > [172.17.0.5][INFO  ] Running command: systemctl enable ceph.target
> > > > [172.17.0.5][INFO  ] checking OSD status...
> > > > [172.17.0.5][DEBUG ] find the location of an executable
> > > > [172.17.0.5][INFO  ] Running command: /bin/ceph --cluster=ceph osd stat
> > > > --format=json
> > > > [ceph_deploy.osd][DEBUG ] Host 172.17.0.5 is now ready for osd use.
> > > >
> > > > --
> > > > Best Regards
> > > > Dai Xiang
> > > > >
> > > > > Get Outlook for Android
> > > > >
> > > > > 
> > > > > From: Dai Xiang 
> > > > > Sent: Tuesday, November 14, 2017 6:12:27 PM
> > > > > To: Linh Vu
> > > > > Cc: ceph-users@lists.ceph.com
> > > > > Subject: Re: mount failed since failed to load ceph kernel module
> > > > >
> > > > > On Tue, Nov 14, 2017 at 02:24:06AM +, Linh Vu wrote:
> > > > > > Your kernel is way too old for CephFS Luminous. I'd use one of the
> > > > newer kernels from http://elrepo.org. :) We're on 4.12 here on RHEL
> > 7.4.
> > > > >
> > > > > I had updated kernel version to newest:
> > > > > [root@d32f3a7b6eb8 ~]$ uname -a
> > > > > Linux d32f3a7b6eb8 4.14.0-1.el7.elrepo.x86_64 #1 SMP Sun Nov 12
> > 20:21:04
> > > > EST 2017 x86_64 x86_64 x86_64 GNU/Linux
> > > > > [root@d32f3a7b6eb8 ~]$ cat /etc/redhat-release
> > > > > CentOS Linux release 7.2.1511 (Core)
> > > > >
> > > > > But still failed:
> > > > > [root@d32f3a7b6eb8 ~]$ /bin/mount 172.17.0.4,172.17.0.5:/ /cephfs -t
> > > > ceph -o name=admin,secretfile=/etc/ceph/admin.secret -v
> > > > > failed to load ceph kernel module (1)
> > > > > parsing options: rw,name=admin,secretfile=/etc/ceph/admin.secret
> > > > > mount error 2 = No such file or directory
> > > > > [root@d32f3a7b6eb8 ~]$ ll /cephfs
> > > > > total 0
> > > > >
> > > > > [root@d32f3a7b6eb8 ~]$ ceph -s
> > > > >   cluster:
> > > > > id: a5f1d744-35eb-4e1b-a7c7-cb9871ec559d
> > > > > health: HEALTH_WARN
> > > > > Reduced data availability: 128 pgs inactive
> > > > > Degraded data redundancy: 128 pgs unclean
> > > > >
> > > > >   services:
> > > > > mon: 2 daemons, quorum d32f3a7b6eb8,1d22f2d81028
> > > > > mgr: d32f3a7b6eb8(active), standbys: 1d22f2d81028
> > > > > mds: cephfs-1/1/1 up  {0=1d22f2d81028=up:creating}, 1 up:standby
> > > > > osd: 0 osds: 0 up, 0 in
> > > > >
> > > > >   data:
> > > > > pools:   2 pools, 128 pgs
> > > > > objects: 0 objects, 0 bytes
> > > > > usage:   0 kB used, 0 kB / 0 kB avail
> > > > > pgs: 100.000% pgs unknown
> > > > >  128 unknown
> > > > >
> > > > > [root@d32f3a7b6eb8 ~]$ lsmod | grep ceph
> > > > > ceph  372736  0
> > > > > libceph   315392  1 ceph
> > > > > fscache65536  3 ceph,nfsv4,nfs
> > > > > libcrc32c  163

[ceph-users] rbd: list: (1) Operation not permitted

2017-11-20 Thread Manuel Sopena Ballesteros

Hi all,

I just built a small ceph cluster for Openstack but I am getting a permission 
problem:

[root@zeus-59 ~]# ceph auth list
installed auth entries:
...
client.cinder
key: AQCvaBNawgsXAxAA18S90LWPLIiZ4tCY0Boa/w==
caps: [mon] allow r
caps: [osd] allow class-read object_prefix rdb_children, allow rwx 
pool=volumes,
...
client.glance
key: AQCiaBNaTDOCJxAArUEI6cuqLmiF2TqictGAEA==
caps: [mon] allow r
caps: [osd] allow class-read object_prefix rdb_children, allow rwx 
pool=images
...

[root@zeus-59 ~]# cat /etc/ceph/ceph.client.cinder.keyring
[client.cinder]
key = AQCvaBNawgsXAxAA18S90LWPLIiZ4tCY0Boa/w==

[root@zeus-59 ~]# rbd -p volumes --user cinder ls
rbd: list: (1) Operation not permitted


[root@zeus-59 ~]# rbd -p images --user glance ls
15b87aeb-6482-403d-825b-e7c7bc007679
e972681b-3028-4b44-84c7-3752a93d5518
fc6dd1dc-fe11-4bdd-96f4-28276ecb75c0

I also tried deleting and recreating user and pool but that didn't fix the 
issue.

Ceph looks ok because user glance can list images pool, but I am not sure why 
user cinder doesn't have permission as they both have same permissions to their 
respective pools?

Any advice?

Thank you very much

Manuel Sopena Ballesteros | Big data Engineer
Garvan Institute of Medical Research
The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010
T: + 61 (0)2 9355 5760 | F: +61 (0)2 9295 8507 | E: 
manuel...@garvan.org.au

NOTICE
Please consider the environment before printing this email. This message and 
any attachments are intended for the addressee named and may contain legally 
privileged/confidential/copyright information. If you are not the intended 
recipient, you should not read, use, disclose, copy or distribute this 
communication. If you have received this message in error please notify us at 
once by return email and then delete both messages. We accept no liability for 
the distribution of viruses or similar in electronic communications. This 
notice should not be removed.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd: list: (1) Operation not permitted

2017-11-20 Thread Manuel Sopena Ballesteros

Ok, I got it working. I got help from the irc channel, my problem was a type in 
the command managing caps for the user.

Manuel

From: Manuel Sopena Ballesteros
Sent: Tuesday, November 21, 2017 2:56 PM
To: ceph-users@lists.ceph.com
Subject: rbd: list: (1) Operation not permitted

Hi all,

I just built a small ceph cluster for Openstack but I am getting a permission 
problem:

[root@zeus-59 ~]# ceph auth list
installed auth entries:
...
client.cinder
key: AQCvaBNawgsXAxAA18S90LWPLIiZ4tCY0Boa/w==
caps: [mon] allow r
caps: [osd] allow class-read object_prefix rdb_children, allow rwx 
pool=volumes,
...
client.glance
key: AQCiaBNaTDOCJxAArUEI6cuqLmiF2TqictGAEA==
caps: [mon] allow r
caps: [osd] allow class-read object_prefix rdb_children, allow rwx 
pool=images
...

[root@zeus-59 ~]# cat /etc/ceph/ceph.client.cinder.keyring
[client.cinder]
key = AQCvaBNawgsXAxAA18S90LWPLIiZ4tCY0Boa/w==

[root@zeus-59 ~]# rbd -p volumes --user cinder ls
rbd: list: (1) Operation not permitted

[root@zeus-59 ~]# rbd -p images --user glance ls
15b87aeb-6482-403d-825b-e7c7bc007679
e972681b-3028-4b44-84c7-3752a93d5518
fc6dd1dc-fe11-4bdd-96f4-28276ecb75c0

I also tried deleting and recreating user and pool but that didn't fix the 
issue.

Ceph looks ok because user glance can list images pool, but I am not sure why 
user cinder doesn't have permission as they both have same permissions to their 
respective pools?

Any advice?

Thank you very much

Manuel Sopena Ballesteros | Big data Engineer
Garvan Institute of Medical Research
The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010
T: + 61 (0)2 9355 5760 | F: +61 (0)2 9295 8507 | E: 
manuel...@garvan.org.au

NOTICE
Please consider the environment before printing this email. This message and 
any attachments are intended for the addressee named and may contain legally 
privileged/confidential/copyright information. If you are not the intended 
recipient, you should not read, use, disclose, copy or distribute this 
communication. If you have received this message in error please notify us at 
once by return email and then delete both messages. We accept no liability for 
the distribution of viruses or similar in electronic communications. This 
notice should not be removed.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD is near full and slow in accessing storage from client

2017-11-20 Thread gjprabu

Hi David,

This is our current status.

~]# ceph status

cluster b466e09c-f7ae-4e89-99a7-99d30eba0a13

health HEALTH_WARN

mds0: Client integ-hm3 failing to respond to cache pressure

mds0: Client integ-hm9-bkp failing to respond to cache pressure

mds0: Client me-build1-bkp failing to respond to cache pressure

monmap e2: 3 mons at
{intcfs-mon1=192.168.113.113:6789/0,intcfs-mon2=192.168.113.114:6789/0,intcfs-mon3=192.168.113.72:6789/0}

election epoch 16, quorum 0,1,2 intcfs-mon3,intcfs-mon1,intcfs-mon2

fsmap e177798: 1/1/1 up {0=intcfs-osd1=up:active}, 1 up:standby

osdmap e4388: 8 osds: 8 up, 8 in

flags sortbitwise

pgmap v24129785: 564 pgs, 3 pools, 6885 GB data, 17138 kobjects

14023 GB used, 12734 GB / 26757 GB avail

560 active+clean

3 active+clean+scrubbing

1 active+clean+scrubbing+deep

client io 47187 kB/s rd, 965 kB/s wr, 125 op/s rd, 525 op/s wr

]# ceph df

GLOBAL:

SIZE AVAIL RAW USED %RAW USED

26757G 12735G 14022G 52.41

POOLS:

NAME ID USED %USED MAX AVAIL OBJECTS

rbd0 0 0 3787G0

downloads_data 3 6885G 51.46 3787G 16047944

downloads_metadata 4 84773k 0 3787G 1501805

Regards

Prabu GJ

On Mon, 20 Nov 2017 21:35:17 +0530 David Turner
wrote

What is your current `ceph status` and `ceph df`? The status of your cluster
has likely changed a bit in the last week.

On Mon, Nov 20, 2017 at 6:00 AM gjprabu wrote:

Hi David,

Sorry for the late reply and its completed OSD Sync and more ever
still fourth OSD available size is keep reducing. Is there any option to check
or fix .

ID WEIGHT REWEIGHT SIZE USEAVAIL %USE VAR PGS

0 3.29749 1.0 3376G 2320G 1056G 68.71 1.10 144

1 3.26869 1.0 3347G 1871G 1475G 55.92 0.89 134

2 3.27339 1.0 3351G 1699G 1652G 50.69 0.81 134

3 3.24089 1.0 3318G 1865G 1452G 56.22 0.90 142

4 3.24089 1.0 3318G 2839G 478G 85.57 1.37 158

5 3.32669 1.0 3406G 2249G 1156G 66.04 1.06 136

6 3.27800 1.0 3356G 1924G 1432G 57.33 0.92 139

7 3.20470 1.0 3281G 1949G 1331G 59.42 0.95 141

TOTAL 26757G 16720G 10037G 62.49

MIN/MAX VAR: 0.81/1.37 STDDEV: 10.26

Regards

Prabu GJ

On Mon, 13 Nov 2017 00:27:47 +0530 David Turner
wrote

You cannot reduce the PG count for a pool. So there isn't anything you can
really do for this unless you create a new FS with better PG counts and migrate
your data into it.

The problem with having more PGs than you need is in the memory footprint for
the osd daemon. There are warning thresholds for having too many PGs per osd.
Also in future expansions, if you need to add pools, you might not be able to
create the pools with the proper amount of PGs due to older pools that have way
too many PGs.

It would still be nice to see the output from those commands I asked about.

The built-in reweighting scripts might help your data distribution.
reweight-by-utilization

On Sun, Nov 12, 2017, 11:41 AM gjprabu wrote:

Hi David,

Thanks for your valuable reply , once complete the backfilling for new osd and
will consider by increasing replica value asap. Is it possible to decrease the
metadata pg count ? if the pg count for metadata for value same as data count
what kind of issue may occur ?

Regards

PrabuGJ

On Sun, 12 Nov 2017 21:25:05 +0530 David
Turner wrote

What's the output of `ceph df` to see if your PG counts are good or not? Like
everyone else has said, the space on the original osds can't be expected to
free up until the backfill from adding the new osd has finished.

You don't have anything in your cluster health to indicate that your cluster
will not be able to finish this backfilling operation on its own.

You might find this URL helpful in calculating your PG counts.
http://ceph.com/pgcalc/ As a side note. It is generally better to keep your PG
counts as base 2 numbers (16, 64, 256, etc). When you do not have a base 2
number then some of your PGs will take up twice as much space as others. In
your case with 250, you have 244 PGs that are the same size and 6 PGs that are
twice the size of those 244 PGs. Bumping that up to 256 will even things out.

Assuming that the metadata pool is for a CephFS volume, you do not need nearly
so many PGs for that pool. Also, I would recommend changing at least the
metadata pool to 3 replica_size. If we can talk you into 3 replica for
everything else, great! But if not, at le

Re: [ceph-users] how to improve performance

On Mon, Nov 20, 2017 at 2:36 PM, Christian Balzer  wrote:

> On Mon, 20 Nov 2017 14:02:30 +0200 Rudi Ahlers wrote:
>
> > We're planning on installing 12X Virtual Machines with some heavy loads.
> >
> > the SSD drives are  INTEL SSDSC2BA400G4
> >
> Interesting, where did you find those?
> Or did you have them lying around?
>
> I've been unable to get DC S3710 SSDs for nearly a year now.
>

In South Africa, one of our suppliers had some in stock. They're still
fairly new, about 2 months old now.




> The SATA drives are ST8000NM0055-1RM112
> >
> Note that these (while fast) have an internal flash cache, limiting them to
> something like 0.2 DWPD.
> Probably not an issue with the WAL/DB on the Intels, but something to keep
> in mind.
>


I don't quite understand what you want to say, please explain?



> > Please explain your comment, "b) will find a lot of people here who don't
> > approve of it."
> >
> Read the archives.
> Converged clusters are complex and debugging Ceph when tons of other
> things are going on at the same time on the machine even more so.
>


Ok, so I have 4 physical servers and need to setup a highly redundant
cluster. How else would you have done it? There is no budget for a SAN, let
alone a highly available SAN.



>
> > I don't have access to the switches right now, but they're new so
> whatever
> > default config ships from factory would be active. Though iperf shows
> 10.5
> > GBytes  / 9.02 Gbits/sec throughput.
> >
> Didn't think it was the switches, but completeness sake and all that.
>
> > What speeds would you expect?
> > "Though with your setup I would have expected something faster, but NOT
> the
> > theoretical 600MB/s 4 HDDs will do in sequential writes."
> >
> What I wrote.
> A 7200RPM HDD, even these, can not sustain writes much over 170MB/s, in
> the most optimal circumstances.
> So your cluster can NOT exceed about 600MB/s sustained writes with the
> effective bandwidth of 4 HDDs.
> Smaller writes/reads that can be cached by RAM, DB, onboard caches on the
> HDDs of course can and will be faster.
>
> But again, you're missing the point, even if you get 600MB/s writes out of
> your cluster, the number of 4k IOPS will be much more relevant to your VMs.
>
>
hdparm shows about 230MB/s:

^Croot@virt2:~# hdparm -Tt /dev/sda

/dev/sda:
 Timing cached reads:   20250 MB in  2.00 seconds = 10134.81 MB/sec
 Timing buffered disk reads: 680 MB in  3.00 seconds = 226.50 MB/sec



600MB/s would be super nice, but in reality even 400MB/s would be nice.
Would it not be achievable?



> >
> >
> > On this, "If an OSD has no fast WAL/DB, it will drag the overall speed
> > down. Verify and if so fix this and re-test.": how?
> >
> No idea, I don't do bluestore.
> You noticed the lack of a WAL/DB for sda, go and fix it.
> If in in doubt by destroying and re-creating.
>
> And if you're looking for a less invasive procedure, docs and the ML
> archive, but AFAIK there is nothing but re-creation at this time.
>


Since I use Proxmox, which setup a DB device, but not a WAL device.




> Christian
> >
> > On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer  wrote:
> >
> > > On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:
> > >
> > > > Hi,
> > > >
> > > > Can someone please help me, how do I improve performance on ou CEPH
> > > cluster?
> > > >
> > > > The hardware in use are as follows:
> > > > 3x SuperMicro servers with the following configuration
> > > > 12Core Dual XEON 2.2Ghz
> > > Faster cores is better for Ceph, IMNSHO.
> > > Though with main storage on HDDs, this will do.
> > >
> > > > 128GB RAM
> > > Overkill for Ceph but I see something else below...
> > >
> > > > 2x 400GB Intel DC SSD drives
> > > Exact model please.
> > >
> > > > 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
> > > One hopes that's a non SMR one.
> > > Model please.
> > >
> > > > 1x SuperMicro DOM for Proxmox / Debian OS
> > > Ah, Proxmox.
> > > I'm personally not averse to converged, high density, multi-role
> clusters
> > > myself, but you:
> > > a) need to know what you're doing and
> > > b) will find a lot of people here who don't approve of it.
> > >
> > > I've avoided DOMs so far (non-hotswapable SPOF), even though the SM
> ones
> > > look good on paper with regards to endurance and IOPS.
> > > The later being rather important for your monitors.
> > >
> > > > 4x Port 10Gbe NIC
> > > > Cisco 10Gbe switch.
> > > >
> > > Configuration would be nice for those, LACP?
> > >
> > > >
> > > > root@virt2:~# rados bench -p Data 10 write --no-cleanup
> > > > hints = 1
> > > > Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> > > > 4194304 for   up to 10 seconds or 0 objects
> > >
> > > rados bench is limited tool and measuring bandwidth is in nearly all
> > > the use cases pointless.
> > > Latency is where it is at and testing from inside a VM is more relevant
> > > than synthetic tests of the storage.
> > > But it is a start.
> > >
> > > > Object prefix: benchmark_data_virt2_39099
> > > >   sec Cur ops

Re: [ceph-users] Switch to replica 3