Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread Ronny Aasen

On 21. sep. 2017 00:35, hjcho616 wrote:

# rados list-inconsistent-pg data
["0.0","0.5","0.a","0.e","0.1c","0.29","0.2c"]
# rados list-inconsistent-pg metadata
["1.d","1.3d"]
# rados list-inconsistent-pg rbd
["2.7"]
# rados list-inconsistent-obj 0.0 --format=json-pretty
{
 "epoch": 23112,
 "inconsistents": []
}
# rados list-inconsistent-obj 0.5 --format=json-pretty
{
 "epoch": 23078,
 "inconsistents": []
}
# rados list-inconsistent-obj 0.a --format=json-pretty
{
 "epoch": 22954,
 "inconsistents": []
}
# rados list-inconsistent-obj 0.e --format=json-pretty
{
 "epoch": 23068,
 "inconsistents": []
}
# rados list-inconsistent-obj 0.1c --format=json-pretty
{
 "epoch": 22954,
 "inconsistents": []
}
# rados list-inconsistent-obj 0.29 --format=json-pretty
{
 "epoch": 22974,
 "inconsistents": []
}
# rados list-inconsistent-obj 0.2c --format=json-pretty
{
 "epoch": 23194,
 "inconsistents": []
}
# rados list-inconsistent-obj 1.d --format=json-pretty
{
 "epoch": 23072,
 "inconsistents": []
}
# rados list-inconsistent-obj 1.3d --format=json-pretty
{
 "epoch": 23221,
 "inconsistents": []
}
# rados list-inconsistent-obj 2.7 --format=json-pretty
{
 "epoch": 23032,
 "inconsistents": []
}

Looks like not much information is there.  Could you elaborate on the 
items you mentioned in find the object?  How do I check metadata.  What 
are we looking for in md5sum?


- find the object  :: manually check the objects, check the object 
metadata, run md5sum on them all and compare. check objects on the 
nonrunning osd's and compare there as well. anything to try to determine 
what object is ok and what is bad.


I tried that Ceph: manually repair object - Ceph 
 methods on 
PG 2.7 before..Tried 3 replica case, which would result in shard 
missing, regardless of which one I moved,  2 replica case, hmm... I 
guess I don't know how long is "wait a bit" is, I just turned it back on 
after a minute or so, just returns back to same inconsistent message.. 
=P  Are we looking for entire stopped OSD to map to different OSD and 
get 3 replica when running stopped OSD again?


Regards,
Hong



since your  list-inconsistent-obj is empty, you need to up debugging on 
all osd's and grep the logs to find the objects with issues. this is 
explained in the link.  ceph ph  map [pg]  tells you what osd's to look 
at, and the log will have hints to the reason for the error. keep in 
mind that it can be a while since the scrub errors out, so you may need 
to look at older logs. or trigger a scrub, and wait for it to finish so 
you can check the current log.


once you have the object names you can find them with the find command.

after removing/fixing the broken object, and restaring osd, you issue 
the repair, and wait for the repair and scrub of that pg to finish. you 
can probably follow along by tailing the log.


good luck
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-20 Thread Lazuardi Nasution
Hi,

I'm still looking for the answer of these questions. Maybe someone can
share their thought on these. Any comment will be helpful too.

Best regards,

On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution 
wrote:

> Hi,
>
> 1. Is it possible configure use osd_data not as small partition on OSD but
> a folder (ex. on root disk)? If yes, how to do that with ceph-disk and any
> pros/cons of doing that?
> 2. Is WAL & DB size calculated based on OSD size or expected throughput
> like on journal device of filestore? If no, what is the default value and
> pro/cons of adjusting that?
> 3. Is partition alignment matter on Bluestore, including WAL & DB if using
> separate device for them?
>
> Best regards,
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD assert hit suicide timeout

2017-09-20 Thread Brad Hubbard
Start gathering something akin to what systat gathers (there are of course
numerous options for this) and going over it carefully. Tools like perf or
oprofile can also provide important clues.

This "looks" like a network issue or some sort of resource shortage.
Comprehensive monitoring and gathering of stats from the period that
coincides with an "event" is definitely where I'd start as well as looking
at ceph data such as dump_historic_ops, dump_ops_in_flight and trying to
establish any commonality or indication of where to focus.

There's a trail there, we just need to find it.

On Thu, Sep 21, 2017 at 7:48 AM, Jordan Share 
wrote:

> We do have our nf_conntrack_max set to 1048576, but our count is actually
> only between 65k and 101k, so your 256k is probably fine.  You'd have
> noticed in dmesg anyway, I figure, but I thought I'd mention it.
>
> I am out of ideas. :)
>
> Jordan
>
> On Wed, Sep 20, 2017 at 1:28 PM, David Turner 
> wrote:
>
>> We have been running with the following settings for months for pid_max
>> and conntrack_max.  Are these at values you would deem too low?
>>   kernel.pid_max = 4194303
>>   net.nf_conntrack_max = 262144
>>
>>
>> I went back in the logs to see where the slow requests and timeouts
>> started.  It seems to have come out of nowhere.  I included the log for 8
>> minutes prior to the osd op thread time outs and 8.5 minutes before the
>> first slow request message.  I don't see any indicators at all that it was
>> scrubbing or anything.  I'm also not seeing any SMART errors on the drives.
>>
>> 2017-09-19 01:14:25.539030 7f134326a700  1 leveldb: Delete type=2 #125088
>>
>> 2017-09-19 01:16:25.544797 7f134326a700  1 leveldb: Level-0 table
>> #125303: started
>> 2017-09-19 01:16:25.571103 7f134326a700  1 leveldb: Level-0 table
>> #125303: 1509406 bytes OK
>> 2017-09-19 01:16:25.572424 7f134326a700  1 leveldb: Delete type=0 #125277
>>
>> 2017-09-19 01:18:14.329887 7f134326a700  1 leveldb: Level-0 table
>> #125305: started
>> 2017-09-19 01:18:14.359580 7f134326a700  1 leveldb: Level-0 table
>> #125305: 1503050 bytes OK
>> 2017-09-19 01:18:14.361107 7f134326a700  1 leveldb: Delete type=0 #125302
>>
>> 2017-09-19 01:20:13.239215 7f134326a700  1 leveldb: Level-0 table
>> #125307: started
>> 2017-09-19 01:20:13.268553 7f134326a700  1 leveldb: Level-0 table
>> #125307: 1500713 bytes OK
>> 2017-09-19 01:20:13.270018 7f134326a700  1 leveldb: Delete type=0 #125304
>>
>> 2017-09-19 01:22:24.023066 7f13c142b700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f139a4e0700' had timed out after 15
>> 2017-09-19 01:22:24.023081 7f13c142b700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f139ace1700' had timed out after 15
>>
>> --- this lines repeated thousands of times before we see the first slow
>> request message ---
>>
>> 2017-09-19 01:22:37.770750 7f134ddac700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f139d4e6700' had timed out after 15
>> 2017-09-19 01:22:37.770751 7f134ddac700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f139e4e8700' had timed out after 15
>> 2017-09-19 01:22:37.772928 7f13bc0b6700  0 log_channel(cluster) log [WRN]
>> : 25 slow requests, 5 included below; oldest blocked for > 30.890114 secs
>> 2017-09-19 01:22:37.772946 7f13bc0b6700  0 log_channel(cluster) log [WRN]
>> : slow request 30.726969 seconds old, received at 2017-09-19
>> 01:22:07.045866: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428263,
>> reqid=client.12636870.0:480468942, at_version=48111'1958916,
>> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
>> 2017-09-19 01:22:37.772957 7f13bc0b6700  0 log_channel(cluster) log [WRN]
>> : slow request 30.668230 seconds old, received at 2017-09-19
>> 01:22:07.104606: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428264,
>> reqid=client.12636870.0:480468943, at_version=48111'1958917,
>> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
>> 2017-09-19 01:22:37.772984 7f13bc0b6700  0 log_channel(cluster) log [WRN]
>> : slow request 30.057066 seconds old, received at 2017-09-19
>> 01:22:07.715770: MOSDECSubOpWrite(97.2e6s2 48111 ECSubWrite(tid=3560867,
>> reqid=client.18422767.0:328744332, at_version=48111'1953782,
>> trim_to=48066'1950689, trim_rollback_to=48111'1953780)) currently started
>> 2017-09-19 01:22:37.772989 7f13bc0b6700  0 log_channel(cluster) log [WRN]
>> : slow request 30.664709 seconds old, received at 2017-09-19
>> 01:22:07.108126: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428265,
>> reqid=client.12636870.0:480468944, at_version=48111'1958918,
>> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
>> 2017-09-19 01:22:37.773000 7f13bc0b6700  0 log_channel(cluster) log [WRN]
>> : slow request 30.657596 seconds old, received at 2017-09-19
>> 01:22:07.115240: MOSDECSubOpWrite(97.51s3 48111 ECSubWrite(tid=5500858,
>> reqid=client.24892528.0:35633485, at_version=48111'1958524,
>> trim_to=48066'1955458, trim_rollback_to=48111'19

Re: [ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-20 Thread Alejandro Comisario
But for example, on the same server i have 3 disks technologies to deploy
pools, SSD, SAS and SATA.
The NVME were bought just thinking on the journal for SATA and SAS, since
journals for SSD were colocated.

But now, exactly the same scenario, should i trust the NVME for the SSD
pool ? are there that much of a  gain ? against colocating block.* on the
same SSD?

best.

On Wed, Sep 20, 2017 at 6:36 PM, Nigel Williams 
wrote:

> On 21 September 2017 at 04:53, Maximiliano Venesio 
> wrote:
>
>> Hi guys i'm reading different documents about bluestore, and it never
>> recommends to use NVRAM to store the bluefs db, nevertheless the official
>> documentation says that, is better to use the faster device to put the
>> block.db in.
>>
>
> ​Likely not mentioned since no one yet has had the opportunity to test it.​
>
> So how do i have to deploy using bluestore, regarding where i should put
>> block.wal and block.db ?
>>
>
> ​block.* would be best on your NVRAM device, like this:
>
> ​ceph-deploy osd create --bluestore c0osd-136:/dev/sda --block-wal
> /dev/nvme0n1 --block-db /dev/nvme0n1
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread hjcho616
# rados list-inconsistent-pg 
data["0.0","0.5","0.a","0.e","0.1c","0.29","0.2c"]# rados list-inconsistent-pg 
metadata["1.d","1.3d"]# rados list-inconsistent-pg rbd["2.7"]# rados 
list-inconsistent-obj 0.0 --format=json-pretty
{    "epoch": 23112,    "inconsistents": []}# rados list-inconsistent-obj 0.5 
--format=json-pretty{    "epoch": 23078,    "inconsistents": []}# rados 
list-inconsistent-obj 0.a --format=json-pretty{    "epoch": 22954,    
"inconsistents": []}# rados list-inconsistent-obj 0.e --format=json-pretty{    
"epoch": 23068,    "inconsistents": []}# rados list-inconsistent-obj 0.1c 
--format=json-pretty{    "epoch": 22954,    "inconsistents": []}# rados 
list-inconsistent-obj 0.29 --format=json-pretty{    "epoch": 22974,    
"inconsistents": []}# rados list-inconsistent-obj 0.2c --format=json-pretty{    
"epoch": 23194,    "inconsistents": []}# rados list-inconsistent-obj 1.d 
--format=json-pretty{    "epoch": 23072,    "inconsistents": []}# rados 
list-inconsistent-obj 1.3d --format=json-pretty{    "epoch": 23221,    
"inconsistents": []}# rados list-inconsistent-obj 2.7 --format=json-pretty{    
"epoch": 23032,    "inconsistents": []}
Looks like not much information is there.  Could you elaborate on the items you 
mentioned in find the object?  How do I check metadata.  What are we looking 
for in md5sum? 
- find the object  :: manually check the objects, check the object metadata, 
run md5sum on them all and compare. check objects on the nonrunning osd's and 
compare there as well. anything to try to determine what object is ok and what 
is bad. 

I tried that Ceph: manually repair object - Ceph methods on PG 2.7 
before..Tried 3 replica case, which would result in shard missing, regardless 
of which one I moved,  2 replica case, hmm... I guess I don't know how long is 
"wait a bit" is, I just turned it back on after a minute or so, just returns 
back to same inconsistent message.. =P  Are we looking for entire stopped OSD 
to map to different OSD and get 3 replica when running stopped OSD again?
Regards,Hong

 

On Wednesday, September 20, 2017 4:47 PM, hjcho616  
wrote:
 

 Thanks Ronny.  I'll try that inconsistent issue soon.  
I think the OSD drive that PG 1.28 is sitting on is still ok... just file 
corruption happened when power outage happened.. =P  As you suggested, cd 
/var/lib/ceph/osd/ceph-4/current/
tar --xattrs --preserve-permissions -zcvf osd.4.tar.gz 1.28_*
cd /var/lib/ceph/osd/ceph-10/tmposd
mkdir currentchown ceph.ceph current/
cd current/tar --xattrs --preserve-permissions -zxvf 
/var/lib/ceph/osd/ceph-4/current/osd.4.tar.gz
systemctl start ceph-osd@8

I created an temp OSD like I did during import time.  Then set the crush 
reweight to 0.  I noticed current directory was missing. =P So created a 
current directory and copied content there.
Starting OSD doesn't appear to show any activity.  Is there any other file I 
need to copy over other than 1.28_head and 1.28_tail directories?
Regards,Hong 

On Wednesday, September 20, 2017 4:04 PM, Ronny Aasen 
 wrote:
 

  i would only tar the pg you have missing objects from, trying to inject older 
objects when the pg is correct can not be good. 
 
 
 scrub errors is kind of the issue with only 2 replicas. when you have 2 
different objects. how to know witch one is correct and witch one is bad..
 and as you have read on 
http://ceph.com/geen-categorie/ceph-manually-repair-object/  and 
onhttp://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ 
you need to
 
 - find the pg  ::  rados list-inconsistent-pg [pool]
 - find the problem ::  rados list-inconsistent-obj 0.6 --format=json-pretty ; 
give you the object name  look for hints to what is the bad object 
 - find the object  :: manually check the objects, check the object metadata, 
run md5sum on them all and compare. check objects on the nonrunning osd's and 
compare there as well. anything to try to determine what object is ok and what 
is bad. 
 - fix the problem  :: assuming you find the bad object, stop the affected osd 
with the bad object, remove the object manually, restart osd. issue repair 
command.
 
 
 if the rados commands does not give you the info you need to do it all 
manually as on http://ceph.com/geen-categorie/ceph-manually-repair-object/
 
 good luck 
 Ronny Aasen
 
 On 20.09.2017 22:17, hjcho616 wrote:
  
  Thanks Ronny. 
  I decided to try to tar everything under current directory.  Is this correct 
command for it?  Is there any directory we do not want in the new drive?  
commit_op_seq, meta, nosnap, omap?  
  tar --xattrs --preserve-permissions -zcvf osd.4.tar.gz . 
  As far as inconsistent PGs... I am running in to these errors.  I tried 
moving one copy of pg to other location, but it just says moved shard is 
missing.  Tried setting 'noout ' and turn one of them down, seems to work on 
something but then back to same error.  Currently trying to move to different 
osd... making sure the drive is not faulty, got few of th

Re: [ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-20 Thread Andras Pataki
Is there any guidance on the sizes for the WAL and DB devices when they 
are separated to an SSD/NVMe?  I understand that probably there isn't a 
one size fits all number, but perhaps something as a function of 
cluster/usage parameters like OSD size and usage pattern (amount of 
writes, number/size of objects, etc.)?
Also, once numbers are chosen and the OSD is in use, is there a way to 
tell what portion of these spaces are used?


Thanks,

Andras


On 09/20/2017 05:36 PM, Nigel Williams wrote:
On 21 September 2017 at 04:53, Maximiliano Venesio 
mailto:mass...@nubeliu.com>> wrote:


Hi guys i'm reading different documents about bluestore, and it
never recommends to use NVRAM to store the bluefs db, nevertheless
the official documentation says that, is better to use the faster
device to put the block.db in.


​Likely not mentioned since no one yet has had the opportunity to test 
it.​


So how do i have to deploy using bluestore, regarding where i
should put block.wal and block.db ?


​block.* would be best on your NVRAM device, like this:

​ceph-deploy osd create --bluestore c0osd-136:/dev/sda --block-wal 
/dev/nvme0n1 --block-db /dev/nvme0n1





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD assert hit suicide timeout

2017-09-20 Thread Jordan Share
We do have our nf_conntrack_max set to 1048576, but our count is actually
only between 65k and 101k, so your 256k is probably fine.  You'd have
noticed in dmesg anyway, I figure, but I thought I'd mention it.

I am out of ideas. :)

Jordan

On Wed, Sep 20, 2017 at 1:28 PM, David Turner  wrote:

> We have been running with the following settings for months for pid_max
> and conntrack_max.  Are these at values you would deem too low?
>   kernel.pid_max = 4194303
>   net.nf_conntrack_max = 262144
>
>
> I went back in the logs to see where the slow requests and timeouts
> started.  It seems to have come out of nowhere.  I included the log for 8
> minutes prior to the osd op thread time outs and 8.5 minutes before the
> first slow request message.  I don't see any indicators at all that it was
> scrubbing or anything.  I'm also not seeing any SMART errors on the drives.
>
> 2017-09-19 01:14:25.539030 7f134326a700  1 leveldb: Delete type=2 #125088
>
> 2017-09-19 01:16:25.544797 7f134326a700  1 leveldb: Level-0 table #125303:
> started
> 2017-09-19 01:16:25.571103 7f134326a700  1 leveldb: Level-0 table #125303:
> 1509406 bytes OK
> 2017-09-19 01:16:25.572424 7f134326a700  1 leveldb: Delete type=0 #125277
>
> 2017-09-19 01:18:14.329887 7f134326a700  1 leveldb: Level-0 table #125305:
> started
> 2017-09-19 01:18:14.359580 7f134326a700  1 leveldb: Level-0 table #125305:
> 1503050 bytes OK
> 2017-09-19 01:18:14.361107 7f134326a700  1 leveldb: Delete type=0 #125302
>
> 2017-09-19 01:20:13.239215 7f134326a700  1 leveldb: Level-0 table #125307:
> started
> 2017-09-19 01:20:13.268553 7f134326a700  1 leveldb: Level-0 table #125307:
> 1500713 bytes OK
> 2017-09-19 01:20:13.270018 7f134326a700  1 leveldb: Delete type=0 #125304
>
> 2017-09-19 01:22:24.023066 7f13c142b700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139a4e0700' had timed out after 15
> 2017-09-19 01:22:24.023081 7f13c142b700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139ace1700' had timed out after 15
>
> --- this lines repeated thousands of times before we see the first slow
> request message ---
>
> 2017-09-19 01:22:37.770750 7f134ddac700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139d4e6700' had timed out after 15
> 2017-09-19 01:22:37.770751 7f134ddac700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139e4e8700' had timed out after 15
> 2017-09-19 01:22:37.772928 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : 25 slow requests, 5 included below; oldest blocked for > 30.890114 secs
> 2017-09-19 01:22:37.772946 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.726969 seconds old, received at 2017-09-19
> 01:22:07.045866: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428263,
> reqid=client.12636870.0:480468942, at_version=48111'1958916,
> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
> 2017-09-19 01:22:37.772957 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.668230 seconds old, received at 2017-09-19
> 01:22:07.104606: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428264,
> reqid=client.12636870.0:480468943, at_version=48111'1958917,
> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
> 2017-09-19 01:22:37.772984 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.057066 seconds old, received at 2017-09-19
> 01:22:07.715770: MOSDECSubOpWrite(97.2e6s2 48111 ECSubWrite(tid=3560867,
> reqid=client.18422767.0:328744332, at_version=48111'1953782,
> trim_to=48066'1950689, trim_rollback_to=48111'1953780)) currently started
> 2017-09-19 01:22:37.772989 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.664709 seconds old, received at 2017-09-19
> 01:22:07.108126: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428265,
> reqid=client.12636870.0:480468944, at_version=48111'1958918,
> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
> 2017-09-19 01:22:37.773000 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.657596 seconds old, received at 2017-09-19
> 01:22:07.115240: MOSDECSubOpWrite(97.51s3 48111 ECSubWrite(tid=5500858,
> reqid=client.24892528.0:35633485, at_version=48111'1958524,
> trim_to=48066'1955458, trim_rollback_to=48111'1958513)) currently started
> 2017-09-19 01:22:37.785581 7f1381ee9700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139a4e0700' had timed out after 15
> 2017-09-19 01:22:37.785583 7f1381ee9700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139ace1700' had timed out after 15
>
>
> On Tue, Sep 19, 2017 at 7:31 PM Jordan Share 
> wrote:
>
>> We had suicide timeouts, but unfortunately I can't remember the specific
>> root cause at this point.
>>
>> It was definitely one of two things:
>>* too low of net.netfilter.nf_conntrack_max (preventing the osds from
>> opening new connections to each other)
>>* too low of kernel.pid_max or kernel.threads-max (preventing new
>> threads from starting)
>>
>> I am p

Re: [ceph-users] monitor takes long time to join quorum: STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH got BADAUTHORIZER

2017-09-20 Thread Gregory Farnum
That definitely sounds like a time sync issue. Are you *sure* they matched
each other? Is it reproducible on restart?

On Wed, Sep 20, 2017 at 2:50 AM Sean Purdy  wrote:

>
> Hi,
>
>
> Luminous 12.2.0
>
> Three node cluster, 18 OSD, debian stretch.
>
>
> One node is down for maintenance for several hours.  When bringing it back
> up, OSDs rejoin after 5 minutes, but health is still warning.  monitor has
> not joined quorum after 40 minutes and logs show BADAUTHORIZER message
> every time the monitor tries to connect to the leader.
>
> 2017-09-20 09:46:05.581590 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
> 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1
> s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
> l=0).handle_connect_reply connect got BADAUTHORIZER
>
> Then after ~45 minutes monitor *does* join quorum.
>
> I'm presuming this isn't normal behaviour?  Or if it is, let me know and I
> won't worry.
>
> All three nodes are using ntp and look OK timewise.
>
>
> ceph-mon log:
>
> (.43 is leader, .45 is rebooted node, .44 is other live node in quorum)
>
> Boot:
>
> 2017-09-20 09:45:21.874152 7f49efeb8f80  0 ceph version 12.2.0
> (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process
> (unknown), pid 2243
>
> 2017-09-20 09:46:01.824708 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> 172.16.0.44:6789/0 conn(0x56007244d000 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
> accept connect_seq 3 vs existing csq=0 existing_state=STATE_CONNECTING
> 2017-09-20 09:46:01.824723 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> 172.16.0.44:6789/0 conn(0x56007244d000 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
> accept we reset (peer sent cseq 3, 0x5600722c.cseq = 0), sending
> RESETSESSION
> 2017-09-20 09:46:01.825247 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> 172.16.0.44:6789/0 conn(0x56007244d000 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
> accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING
> 2017-09-20 09:46:01.828053 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> 172.16.0.44:6789/0 conn(0x5600722c :-1
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=21872 cs=1 l=0).process
> missed message?  skipped from seq 0 to 552717734
>
> 2017-09-20 09:46:05.580342 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> 172.16.0.43:6789/0 conn(0x5600720fe800 :-1
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=49261 cs=1 l=0).process
> missed message?  skipped from seq 0 to 1151972199
> 2017-09-20 09:46:05.581097 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
> 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1
> s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
> l=0).handle_connect_reply connect got BADAUTHORIZER
> 2017-09-20 09:46:05.581590 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
> 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1
> s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
> l=0).handle_connect_reply connect got BADAUTHORIZER
> ...
> [message repeats for 45 minutes]
> ...
> 2017-09-20 10:23:38.818767 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
> 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1
> s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
> l=0).handle_connect_reply connect
>  got BADAUTHORIZER
>
>
> At this point, "ceph mon stat" says .45/store03 not in quorum:
>
> e5: 3 mons at {store01=
> 172.16.0.43:6789/0,store02=172.16.0.44:6789/0,store03=172.16.0.45:6789/0},
> election epoch 376, leader 0 store01, quorum 0,1 store01,store02
>
>
> Then suddenly a valid connection is made and sync happens:
>
> 2017-09-20 10:23:43.041009 7f49e5b2f700  1 mon.store03@2(synchronizing).mds
> e1 Unable to load 'last_metadata'
> 2017-09-20 10:23:43.041967 7f49e5b2f700  1 mon.store03@2(synchronizing).osd
> e2381 e2381: 18 total, 13 up, 14 in
> ...
> 2017-09-20 10:23:43.045961 7f49e5b2f700  1 mon.store03@2(synchronizing).osd
> e2393 e2393: 18 total, 15 up, 15 in
> ...
> 2017-09-20 10:23:43.049255 7f49e5b2f700  1 mon.store03@2(synchronizing).osd
> e2406 e2406: 18 total, 18 up, 18 in
> ...
> 2017-09-20 10:23:43.054828 7f49e5b2f700  0 log_channel(cluster) log [INF]
> : mon.store03 calling new monitor election
> 2017-09-20 10:23:43.054901 7f49e5b2f700  1 
> mon.store03@2(electing).elector(372)
> init, last seen epoch 372
>
>
> Now "ceph mon stat" says:
>
> e5: 3 mons at {store01=
> 172.16.0.43:6789/0,store02=172.16.0.44:6789/0,store03=172.16.0.45:6789/0},
> election epoch 378, leader 0 store01, quorum 0,1,2 store01,store02,store03
>
> and everything's happy.
>
>
> What should I look for/fix?  It's a fairly vanilla system.
>
>
> Thanks in advance,
>
> Sean Purdy
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread hjcho616
Thanks Ronny.  I'll try that inconsistent issue soon.  
I think the OSD drive that PG 1.28 is sitting on is still ok... just file 
corruption happened when power outage happened.. =P  As you suggested, cd 
/var/lib/ceph/osd/ceph-4/current/
tar --xattrs --preserve-permissions -zcvf osd.4.tar.gz 1.28_*
cd /var/lib/ceph/osd/ceph-10/tmposd
mkdir currentchown ceph.ceph current/
cd current/tar --xattrs --preserve-permissions -zxvf 
/var/lib/ceph/osd/ceph-4/current/osd.4.tar.gz
systemctl start ceph-osd@8

I created an temp OSD like I did during import time.  Then set the crush 
reweight to 0.  I noticed current directory was missing. =P So created a 
current directory and copied content there.
Starting OSD doesn't appear to show any activity.  Is there any other file I 
need to copy over other than 1.28_head and 1.28_tail directories?
Regards,Hong 

On Wednesday, September 20, 2017 4:04 PM, Ronny Aasen 
 wrote:
 

  i would only tar the pg you have missing objects from, trying to inject older 
objects when the pg is correct can not be good. 
 
 
 scrub errors is kind of the issue with only 2 replicas. when you have 2 
different objects. how to know witch one is correct and witch one is bad..
 and as you have read on 
http://ceph.com/geen-categorie/ceph-manually-repair-object/  and 
onhttp://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ 
you need to
 
 - find the pg  ::  rados list-inconsistent-pg [pool]
 - find the problem ::  rados list-inconsistent-obj 0.6 --format=json-pretty ; 
give you the object name  look for hints to what is the bad object 
 - find the object  :: manually check the objects, check the object metadata, 
run md5sum on them all and compare. check objects on the nonrunning osd's and 
compare there as well. anything to try to determine what object is ok and what 
is bad. 
 - fix the problem  :: assuming you find the bad object, stop the affected osd 
with the bad object, remove the object manually, restart osd. issue repair 
command.
 
 
 if the rados commands does not give you the info you need to do it all 
manually as on http://ceph.com/geen-categorie/ceph-manually-repair-object/
 
 good luck 
 Ronny Aasen
 
 On 20.09.2017 22:17, hjcho616 wrote:
  
  Thanks Ronny. 
  I decided to try to tar everything under current directory.  Is this correct 
command for it?  Is there any directory we do not want in the new drive?  
commit_op_seq, meta, nosnap, omap?  
  tar --xattrs --preserve-permissions -zcvf osd.4.tar.gz . 
  As far as inconsistent PGs... I am running in to these errors.  I tried 
moving one copy of pg to other location, but it just says moved shard is 
missing.  Tried setting 'noout ' and turn one of them down, seems to work on 
something but then back to same error.  Currently trying to move to different 
osd... making sure the drive is not faulty, got few of them.. but still 
persisting..  I've been kicking off ceph pg repair PG#, hoping it would fix 
them. =P  Any other suggestion? 
  2017-09-20 09:39:48.481400 7f163c5fa700  0 log_channel(cluster) log [INF] : 
0.29 repair starts 2017-09-20 09:47:37.384921 7f163c5fa700 -1 
log_channel(cluster) log [ERR] : 0.29 shard 6: soid 
0:97126ead:::200014ce4c3.028f:head data_digest 0x8f679a50 != data_digest 
0x979f2ed4 from auth oi 0:97126ead:::200014ce4c3.028f:head(19366'539375 
client.535319.1:2361163 dirty|data_digest|omap_digest s  4194304 uv 539375 dd 
979f2ed4 od  alloc_hint [0 0]) 2017-09-20 09:47:37.384931 7f163c5fa700 
-1 log_channel(cluster) log [ERR] : 0.29 shard 7: soid 
0:97126ead:::200014ce4c3.028f:head data_digest 0x8f679a50 != data_digest 
0x979f2ed4 from auth oi 0:97126ead:::200014ce4c3.028f:head(19366'539375 
client.535319.1:2361163 dirty|data_digest|omap_digest s  4194304 uv 539375 dd 
979f2ed4 od  alloc_hint [0 0]) 2017-09-20 09:47:37.384936 7f163c5fa700 
-1 log_channel(cluster) log [ERR] : 0.29 soid 
0:97126ead:::200014ce4c3.028f:head: failed to pick suitable auth object 
2017-09-20 09:48:11.138566 7f1639df5700 -1 log_channel(cluster) log [ERR] : 
0.29 shard 6: soid 0:97d5c15a:::10101b4.6892:head data_digest 
0xd65b4014 != data_digest 0xf41cfab8 from auth oi 
0:97d5c15a:::10101b4.6892:head(12962'65557 osd.4.0:42234 
dirty|data_digest|omap_digest s 4194304 uv 776  dd f41cfab8 od  
alloc_hint [0 0]) 2017-09-20 09:48:11.138575 7f1639df5700 -1 
log_channel(cluster) log [ERR] : 0.29 shard 7: soid 
0:97d5c15a:::10101b4.6892:head data_digest 0xd65b4014 != data_digest 
0xf41cfab8 from auth oi 0:97d5c15a:::10101b4.6892:head(12962'65557 
osd.4.0:42234 dirty|data_digest|omap_digest s 4194304 uv 776  dd f41cfab8 od 
 alloc_hint [0 0]) 2017-09-20 09:48:11.138581 7f1639df5700 -1 
log_channel(cluster) log [ERR] : 0.29 soid 
0:97d5c15a:::10101b4.6892:head: failed to pick suitable auth object 
2017-09-20 09:48:55.584022 7f1639df5700 -1 log_channel(cluster) log [ERR] : 
0.29 repair 4 errors, 0 fixed 
  Latest health...  H

Re: [ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-20 Thread Nigel Williams
On 21 September 2017 at 04:53, Maximiliano Venesio 
wrote:

> Hi guys i'm reading different documents about bluestore, and it never
> recommends to use NVRAM to store the bluefs db, nevertheless the official
> documentation says that, is better to use the faster device to put the
> block.db in.
>

​Likely not mentioned since no one yet has had the opportunity to test it.​

So how do i have to deploy using bluestore, regarding where i should put
> block.wal and block.db ?
>

​block.* would be best on your NVRAM device, like this:

​ceph-deploy osd create --bluestore c0osd-136:/dev/sda --block-wal
/dev/nvme0n1 --block-db /dev/nvme0n1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread Ronny Aasen
i would only tar the pg you have missing objects from, trying to inject 
older objects when the pg is correct can not be good.



scrub errors is kind of the issue with only 2 replicas. when you have 2 
different objects. how to know witch one is correct and witch one is bad..
and as you have read on 
http://ceph.com/geen-categorie/ceph-manually-repair-object/  and on 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ 
you need to


- find the pg  ::  rados list-inconsistent-pg [pool]
- find the problem :: rados list-inconsistent-obj 0.6 
--format=json-pretty ; give you the object name  look for hints to what 
is the bad object
- find the object  :: manually check the objects, check the object 
metadata, run md5sum on them all and compare. check objects on the 
nonrunning osd's and compare there as well. anything to try to determine 
what object is ok and what is bad.
- fix the problem  :: assuming you find the bad object, stop the 
affected osd with the bad object, remove the object manually, restart 
osd. issue repair command.



if the rados commands does not give you the info you need to do it all 
manually as on http://ceph.com/geen-categorie/ceph-manually-repair-object/


good luck
Ronny Aasen

On 20.09.2017 22:17, hjcho616 wrote:

Thanks Ronny.

I decided to try to tar everything under current directory.  Is this 
correct command for it?  Is there any directory we do not want in the 
new drive?  commit_op_seq, meta, nosnap, omap?


tar --xattrs --preserve-permissions -zcvf osd.4.tar.gz .

As far as inconsistent PGs... I am running in to these errors.  I 
tried moving one copy of pg to other location, but it just says moved 
shard is missing.  Tried setting 'noout ' and turn one of them down, 
seems to work on something but then back to same error.  Currently 
trying to move to different osd... making sure the drive is not 
faulty, got few of them.. but still persisting..  I've been kicking 
off ceph pg repair PG#, hoping it would fix them. =P  Any other 
suggestion?


2017-09-20 09:39:48.481400 7f163c5fa700  0 log_channel(cluster) log 
[INF] : 0.29 repair starts
2017-09-20 09:47:37.384921 7f163c5fa700 -1 log_channel(cluster) log 
[ERR] : 0.29 shard 6: soid 0:97126ead:::200014ce4c3.028f:head 
data_digest 0x8f679a50 != data_digest 0x979f2ed4 from auth oi 
0:97126ead:::200014ce4c3.028f:head(19366'539375 
client.535319.1:2361163 dirty|data_digest|omap_digest s 4194304 uv 
539375 dd 979f2ed4 od  alloc_hint [0 0])
2017-09-20 09:47:37.384931 7f163c5fa700 -1 log_channel(cluster) log 
[ERR] : 0.29 shard 7: soid 0:97126ead:::200014ce4c3.028f:head 
data_digest 0x8f679a50 != data_digest 0x979f2ed4 from auth oi 
0:97126ead:::200014ce4c3.028f:head(19366'539375 
client.535319.1:2361163 dirty|data_digest|omap_digest s 4194304 uv 
539375 dd 979f2ed4 od  alloc_hint [0 0])
2017-09-20 09:47:37.384936 7f163c5fa700 -1 log_channel(cluster) log 
[ERR] : 0.29 soid 0:97126ead:::200014ce4c3.028f:head: failed to 
pick suitable auth object
2017-09-20 09:48:11.138566 7f1639df5700 -1 log_channel(cluster) log 
[ERR] : 0.29 shard 6: soid 0:97d5c15a:::10101b4.6892:head 
data_digest 0xd65b4014 != data_digest 0xf41cfab8 from auth oi 
0:97d5c15a:::10101b4.6892:head(12962'65557 osd.4.0:42234 
dirty|data_digest|omap_digest s 4194304 uv 776 dd f41cfab8 od  
alloc_hint [0 0])
2017-09-20 09:48:11.138575 7f1639df5700 -1 log_channel(cluster) log 
[ERR] : 0.29 shard 7: soid 0:97d5c15a:::10101b4.6892:head 
data_digest 0xd65b4014 != data_digest 0xf41cfab8 from auth oi 
0:97d5c15a:::10101b4.6892:head(12962'65557 osd.4.0:42234 
dirty|data_digest|omap_digest s 4194304 uv 776 dd f41cfab8 od  
alloc_hint [0 0])
2017-09-20 09:48:11.138581 7f1639df5700 -1 log_channel(cluster) log 
[ERR] : 0.29 soid 0:97d5c15a:::10101b4.6892:head: failed to 
pick suitable auth object
2017-09-20 09:48:55.584022 7f1639df5700 -1 log_channel(cluster) log 
[ERR] : 0.29 repair 4 errors, 0 fixed


Latest health...
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs 
down; 1 pgs incomplete; 9 pgs inconsistent; 1 pgs repair; 1 pgs stuck 
inactive; 1 pgs stuck unclean; 68 scrub errors; mds rank 0 has failed; 
mds cluster is degraded; no legacy OSD present but 'sortbitwise' flag 
is not set


Regards,
Hong




On Wednesday, September 20, 2017 11:53 AM, Ronny Aasen 
 wrote:



On 20.09.2017 16:49, hjcho616 wrote:

Anyone?  Can this page be saved?  If not what are my options?

Regards,
Hong


On Saturday, September 16, 2017 1:55 AM, hjcho616 
  wrote:



Looking better... working on scrubbing..
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs 
incomplete; 12 pgs inconsistent; 2 pgs repair; 1 pgs stuck inactive; 
1 pgs stuck unclean; 109 scrub errors; too few PGs per OSD (29 < min 
30); mds rank 0 has failed; mds cluster is degraded; noout flag(s) 
set; no legacy OSD present but 'sortbitwise' flag is n

Re: [ceph-users] OSD assert hit suicide timeout

2017-09-20 Thread David Turner
We have been running with the following settings for months for pid_max and
conntrack_max.  Are these at values you would deem too low?
  kernel.pid_max = 4194303
  net.nf_conntrack_max = 262144


I went back in the logs to see where the slow requests and timeouts
started.  It seems to have come out of nowhere.  I included the log for 8
minutes prior to the osd op thread time outs and 8.5 minutes before the
first slow request message.  I don't see any indicators at all that it was
scrubbing or anything.  I'm also not seeing any SMART errors on the drives.

2017-09-19 01:14:25.539030 7f134326a700  1 leveldb: Delete type=2 #125088

2017-09-19 01:16:25.544797 7f134326a700  1 leveldb: Level-0 table #125303:
started
2017-09-19 01:16:25.571103 7f134326a700  1 leveldb: Level-0 table #125303:
1509406 bytes OK
2017-09-19 01:16:25.572424 7f134326a700  1 leveldb: Delete type=0 #125277

2017-09-19 01:18:14.329887 7f134326a700  1 leveldb: Level-0 table #125305:
started
2017-09-19 01:18:14.359580 7f134326a700  1 leveldb: Level-0 table #125305:
1503050 bytes OK
2017-09-19 01:18:14.361107 7f134326a700  1 leveldb: Delete type=0 #125302

2017-09-19 01:20:13.239215 7f134326a700  1 leveldb: Level-0 table #125307:
started
2017-09-19 01:20:13.268553 7f134326a700  1 leveldb: Level-0 table #125307:
1500713 bytes OK
2017-09-19 01:20:13.270018 7f134326a700  1 leveldb: Delete type=0 #125304

2017-09-19 01:22:24.023066 7f13c142b700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139a4e0700' had timed out after 15
2017-09-19 01:22:24.023081 7f13c142b700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139ace1700' had timed out after 15

--- this lines repeated thousands of times before we see the first slow
request message ---

2017-09-19 01:22:37.770750 7f134ddac700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139d4e6700' had timed out after 15
2017-09-19 01:22:37.770751 7f134ddac700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139e4e8700' had timed out after 15
2017-09-19 01:22:37.772928 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
25 slow requests, 5 included below; oldest blocked for > 30.890114 secs
2017-09-19 01:22:37.772946 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.726969 seconds old, received at 2017-09-19 01:22:07.045866:
MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428263,
reqid=client.12636870.0:480468942, at_version=48111'1958916,
trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
2017-09-19 01:22:37.772957 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.668230 seconds old, received at 2017-09-19 01:22:07.104606:
MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428264,
reqid=client.12636870.0:480468943, at_version=48111'1958917,
trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
2017-09-19 01:22:37.772984 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.057066 seconds old, received at 2017-09-19 01:22:07.715770:
MOSDECSubOpWrite(97.2e6s2 48111 ECSubWrite(tid=3560867,
reqid=client.18422767.0:328744332, at_version=48111'1953782,
trim_to=48066'1950689, trim_rollback_to=48111'1953780)) currently started
2017-09-19 01:22:37.772989 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.664709 seconds old, received at 2017-09-19 01:22:07.108126:
MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428265,
reqid=client.12636870.0:480468944, at_version=48111'1958918,
trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
2017-09-19 01:22:37.773000 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.657596 seconds old, received at 2017-09-19 01:22:07.115240:
MOSDECSubOpWrite(97.51s3 48111 ECSubWrite(tid=5500858,
reqid=client.24892528.0:35633485, at_version=48111'1958524,
trim_to=48066'1955458, trim_rollback_to=48111'1958513)) currently started
2017-09-19 01:22:37.785581 7f1381ee9700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139a4e0700' had timed out after 15
2017-09-19 01:22:37.785583 7f1381ee9700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139ace1700' had timed out after 15


On Tue, Sep 19, 2017 at 7:31 PM Jordan Share  wrote:

> We had suicide timeouts, but unfortunately I can't remember the specific
> root cause at this point.
>
> It was definitely one of two things:
>* too low of net.netfilter.nf_conntrack_max (preventing the osds from
> opening new connections to each other)
>* too low of kernel.pid_max or kernel.threads-max (preventing new
> threads from starting)
>
> I am pretty sure we hit the pid_max early on, but the conntrack_max didn't
> cause trouble until we had enough VMs running to push the total number of
> connections past the default limit.
>
> Our cluster is approximately the same size as yours.
>
> Jordan
>
> On Tue, Sep 19, 2017 at 3:05 PM, Stanley Zhang  > wrote:
>
>> We don't use EC pools, but my experience with similar slow requests on
>> RGW+replicated_pools is that in the logs you need to find out t

Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread hjcho616
Thanks Ronny.
I decided to try to tar everything under current directory.  Is this correct 
command for it?  Is there any directory we do not want in the new drive?  
commit_op_seq, meta, nosnap, omap?
tar --xattrs --preserve-permissions -zcvf osd.4.tar.gz .
As far as inconsistent PGs... I am running in to these errors.  I tried moving 
one copy of pg to other location, but it just says moved shard is missing.  
Tried setting 'noout ' and turn one of them down, seems to work on something 
but then back to same error.  Currently trying to move to different osd... 
making sure the drive is not faulty, got few of them.. but still persisting..  
I've been kicking off ceph pg repair PG#, hoping it would fix them. =P  Any 
other suggestion?
2017-09-20 09:39:48.481400 7f163c5fa700  0 log_channel(cluster) log [INF] : 
0.29 repair starts2017-09-20 09:47:37.384921 7f163c5fa700 -1 
log_channel(cluster) log [ERR] : 0.29 shard 6: soid 
0:97126ead:::200014ce4c3.028f:head data_digest 0x8f679a50 != data_digest 
0x979f2ed4 from auth oi 0:97126ead:::200014ce4c3.028f:head(19366'539375 
client.535319.1:2361163 dirty|data_digest|omap_digest s 4194304 uv 539375 dd 
979f2ed4 od  alloc_hint [0 0])2017-09-20 09:47:37.384931 7f163c5fa700 
-1 log_channel(cluster) log [ERR] : 0.29 shard 7: soid 
0:97126ead:::200014ce4c3.028f:head data_digest 0x8f679a50 != data_digest 
0x979f2ed4 from auth oi 0:97126ead:::200014ce4c3.028f:head(19366'539375 
client.535319.1:2361163 dirty|data_digest|omap_digest s 4194304 uv 539375 dd 
979f2ed4 od  alloc_hint [0 0])2017-09-20 09:47:37.384936 7f163c5fa700 
-1 log_channel(cluster) log [ERR] : 0.29 soid 
0:97126ead:::200014ce4c3.028f:head: failed to pick suitable auth 
object2017-09-20 09:48:11.138566 7f1639df5700 -1 log_channel(cluster) log [ERR] 
: 0.29 shard 6: soid 0:97d5c15a:::10101b4.6892:head data_digest 
0xd65b4014 != data_digest 0xf41cfab8 from auth oi 
0:97d5c15a:::10101b4.6892:head(12962'65557 osd.4.0:42234 
dirty|data_digest|omap_digest s 4194304 uv 776 dd f41cfab8 od  
alloc_hint [0 0])2017-09-20 09:48:11.138575 7f1639df5700 -1 
log_channel(cluster) log [ERR] : 0.29 shard 7: soid 
0:97d5c15a:::10101b4.6892:head data_digest 0xd65b4014 != data_digest 
0xf41cfab8 from auth oi 0:97d5c15a:::10101b4.6892:head(12962'65557 
osd.4.0:42234 dirty|data_digest|omap_digest s 4194304 uv 776 dd f41cfab8 od 
 alloc_hint [0 0])2017-09-20 09:48:11.138581 7f1639df5700 -1 
log_channel(cluster) log [ERR] : 0.29 soid 
0:97d5c15a:::10101b4.6892:head: failed to pick suitable auth 
object2017-09-20 09:48:55.584022 7f1639df5700 -1 log_channel(cluster) log [ERR] 
: 0.29 repair 4 errors, 0 fixed
Latest health...HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 
1 pgs down; 1 pgs incomplete; 9 pgs inconsistent; 1 pgs repair; 1 pgs stuck 
inactive; 1 pgs stuck unclean; 68 scrub errors; mds rank 0 has failed; mds 
cluster is degraded; no legacy OSD present but 'sortbitwise' flag is not set
Regards,Hong

 

On Wednesday, September 20, 2017 11:53 AM, Ronny Aasen 
 wrote:
 

  On 20.09.2017 16:49, hjcho616 wrote:
  
  Anyone?  Can this page be saved?  If not what are my options? 
  Regards, Hong 
 
  On Saturday, September 16, 2017 1:55 AM, hjcho616  
wrote:
  
 
 Looking better... working on scrubbing.. HEALTH_ERR 1 pgs are stuck 
inactive for more than 300 seconds; 1 pgs incomplete; 12 pgs inconsistent; 2 
pgs repair; 1 pgs stuck inactive; 1 pgs stuck unclean; 109 scrub errors; too 
few PGs per OSD (29 < min 30); mds rank 0 has failed; mds cluster is degraded; 
noout flag(s) set; no legacy OSD present but 'sortbitwise' flag is not  set
  
  Now PG1.28.. looking at all old osds dead or alive.  Only one with DIR_* 
directory is in osd.4.   This appears to be metadata pool!  21M of metadata can 
be quite a bit of stuff.. so I would like to rescue this!  But I am not able to 
start this OSD.  exporting through ceph-objectstore-tool appears to crash.  
Even with --skip-journal-replay and --skip-mount-omap (different failure).  As 
I mentioned in earlier email, that exception thrown message is bogus... # 
ceph-objectstore-tool --op export --pgid 1.28  --data-path 
/var/lib/ceph/osd/ceph-4 --journal-path /var/lib/ceph/osd/ceph-4/journal --file 
~/1.28.export terminate called after throwing an instance of 
'std::domain_error' 
  
 
 [SNIP]
 
 What can I do to save that PG1.28?  Please let me know if you need 
more information.  So close!... =)  
  Regards, Hong 
   
 12 inconsistent and 109 scrub errors is something you should fix first of all. 
  also you can consider using the paid-services of many ceph support companies. 
that specialize in these kind of situations. 
  -- that beeing said, here are some suggestions...
  when it comes to lost object recovery you have come about as far as i have 
ever experienced. so everything after here is just assumptions and wild 
guesswork to what you can 

[ceph-users] Bluestore disk colocation using NVRAM, SSD and SATA

2017-09-20 Thread Maximiliano Venesio
Hi guys i'm reading different documents about bluestore, and it never
recommends to use NVRAM to store the bluefs db, nevertheless the official
documentation says that, is better to use the faster device to put the
block.db in.

In my cluster i have NVRAM devices of 400GB, SSDs disks for high
performance and SATA disks for cold storage.

So how do i have to deploy using bluestore, regarding where i should put
block.wal and block.db ?
Do i have to use the NVRAM just for the block.wal and put the the block.db
in the same SATA and SSD disks as the DATA ? or should i have to use NVRAM
for both block.wal and block.db ?

Any idea if exists some special constraint about putting block.db into
NVRAM, even when the backend disk is an SSD ?



Thanks in Advanced.

*Maximiliano Venesio*
*Chief Cloud Architect | NUBELIU*
E-mail: massimo@nubeliu.comCell: +54 9 11 3770 1853
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore "separate" WAL and DB

2017-09-20 Thread Alejandro Comisario
Bump! i would love the thoughts about this !

On Fri, Sep 8, 2017 at 7:44 AM, Richard Hesketh <
richard.hesk...@rd.bbc.co.uk> wrote:

> Hi,
>
> Reading the ceph-users list I'm obviously seeing a lot of people talking
> about using bluestore now that Luminous has been released. I note that many
> users seem to be under the impression that they need separate block devices
> for the bluestore data block, the DB, and the WAL... even when they are
> going to put the DB and the WAL on the same device!
>
> As per the docs at http://docs.ceph.com/docs/master/rados/configuration/
> bluestore-config-ref/ this is nonsense:
>
> > If there is only a small amount of fast storage available (e.g., less
> than a gigabyte), we recommend using it as a WAL device. If there is more,
> provisioning a DB
> > device makes more sense. The BlueStore journal will always be placed on
> the fastest device available, so using a DB device will provide the same
> benefit that the WAL
> > device would while also allowing additional metadata to be stored there
> (if it will fix). [sic, I assume that should be "fit"]
>
> I understand that if you've got three speeds of storage available, there
> may be some sense to dividing these. For instance, if you've got lots of
> HDD, a bit of SSD, and a tiny NVMe available in the same host, data on HDD,
> DB on SSD and WAL on NVMe may be a sensible division of data. That's not
> the case for most of the examples I'm reading; they're talking about
> putting DB and WAL on the same block device, but in different partitions.
> There's even one example of someone suggesting to try partitioning a single
> SSD to put data/DB/WAL all in separate partitions!
>
> Are the docs wrong and/or I am missing something about optimal bluestore
> setup, or do people simply have the wrong end of the stick? I ask because
> I'm just going through switching all my OSDs over to Bluestore now and I've
> just been reusing the partitions I set up for journals on my SSDs as DB
> devices for Bluestore HDDs without specifying anything to do with the WAL,
> and I'd like to know sooner rather than later if I'm making some sort of
> horrible mistake.
>
> Rich
> --
> Richard Hesketh
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Possible to change the location of run_dir?

2017-09-20 Thread Bryan Banister
But the socket files are not group writeable:

srwxr-xr-x 1 ceph ceph 0 Aug 21 15:58 
/var/run/ceph/ceph-mgr.carf-ceph-osd16.asok

So users in that group can’t write to that admin socket, which is required for 
running `ceph --admin-daemon /var/run/ceph/ceph-osd.9.asok perf dump` (what 
telegraf is trying to run)?

Thanks,
-Bryan

From: David Turner [mailto:drakonst...@gmail.com]
Sent: Wednesday, September 20, 2017 1:34 PM
To: Bryan Banister ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Possible to change the location of run_dir?

Note: External Email

You can always add the telegraf user to the ceph group.  That change will 
persist on reboots and allow the user running the commands to read any 
folder/file that is owned by the group ceph.  I do this for Zabbix and Nagios 
now that the /var/lib/ceph folder is not public readable.

On Wed, Sep 20, 2017 at 2:27 PM Bryan Banister 
mailto:bbanis...@jumptrading.com>> wrote:
We are running telegraf and would like to have the telegraf user read the admin 
sockets from ceph, which is required for the ceph telegraf plugin to apply the 
ceph  related tags to the data.  The ceph admin sockets are by default stored 
in /var/run/ceph, but this is recreated at boot time, so we can’t set 
permissions on these sockets which will persist.

We would like to change the run_dir for ceph to be a persistent directory.  Is 
there a way to do this?

Would be nice if there was a [global] config option or something we could put 
in the /etc/sysconfig/ceph file.

Thanks,
-Bryan



Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Possible to change the location of run_dir?

2017-09-20 Thread David Turner
You can always add the telegraf user to the ceph group.  That change will
persist on reboots and allow the user running the commands to read any
folder/file that is owned by the group ceph.  I do this for Zabbix and
Nagios now that the /var/lib/ceph folder is not public readable.

On Wed, Sep 20, 2017 at 2:27 PM Bryan Banister 
wrote:

> We are running telegraf and would like to have the telegraf user read the
> admin sockets from ceph, which is required for the ceph telegraf plugin to
> apply the ceph  related tags to the data.  The ceph admin sockets are by
> default stored in /var/run/ceph, but this is recreated at boot time, so we
> can’t set permissions on these sockets which will persist.
>
>
>
> We would like to change the run_dir for ceph to be a persistent
> directory.  Is there a way to do this?
>
>
>
> Would be nice if there was a [global] config option or something we could
> put in the /etc/sysconfig/ceph file.
>
>
>
> Thanks,
>
> -Bryan
>
> --
>
> Note: This email is for the confidential use of the named addressee(s)
> only and may contain proprietary, confidential or privileged information.
> If you are not the intended recipient, you are hereby notified that any
> review, dissemination or copying of this email is strictly prohibited, and
> to please notify the sender immediately and destroy this email and any
> attachments. Email transmission cannot be guaranteed to be secure or
> error-free. The Company, therefore, does not make any guarantees as to the
> completeness or accuracy of this email or any attachments. This email is
> for informational purposes only and does not constitute a recommendation,
> offer, request or solicitation of any kind to buy, sell, subscribe, redeem
> or perform any type of transaction of a financial product.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Possible to change the location of run_dir?

2017-09-20 Thread Jean-Charles Lopez
Hi

use the run_dir parameter in your /etc/ceph/ceph.conf file. It defaults to 
/var/run/ceph

Or

use admin_socket = /full/path/to/admin/socket for each daemon or in the global 
section using environment variable such as $type, $id, $cluster and so on

Regards
JC

> On Sep 20, 2017, at 11:27, Bryan Banister  wrote:
> 
> We are running telegraf and would like to have the telegraf user read the 
> admin sockets from ceph, which is required for the ceph telegraf plugin to 
> apply the ceph  related tags to the data.  The ceph admin sockets are by 
> default stored in /var/run/ceph, but this is recreated at boot time, so we 
> can’t set permissions on these sockets which will persist.
>  
> We would like to change the run_dir for ceph to be a persistent directory.  
> Is there a way to do this?
>  
> Would be nice if there was a [global] config option or something we could put 
> in the /etc/sysconfig/ceph file.
>  
> Thanks,
> -Bryan
> 
> 
> Note: This email is for the confidential use of the named addressee(s) only 
> and may contain proprietary, confidential or privileged information. If you 
> are not the intended recipient, you are hereby notified that any review, 
> dissemination or copying of this email is strictly prohibited, and to please 
> notify the sender immediately and destroy this email and any attachments. 
> Email transmission cannot be guaranteed to be secure or error-free. The 
> Company, therefore, does not make any guarantees as to the completeness or 
> accuracy of this email or any attachments. This email is for informational 
> purposes only and does not constitute a recommendation, offer, request or 
> solicitation of any kind to buy, sell, subscribe, redeem or perform any type 
> of transaction of a financial product.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Possible to change the location of run_dir?

2017-09-20 Thread Bryan Banister
We are running telegraf and would like to have the telegraf user read the admin 
sockets from ceph, which is required for the ceph telegraf plugin to apply the 
ceph  related tags to the data.  The ceph admin sockets are by default stored 
in /var/run/ceph, but this is recreated at boot time, so we can't set 
permissions on these sockets which will persist.

We would like to change the run_dir for ceph to be a persistent directory.  Is 
there a way to do this?

Would be nice if there was a [global] config option or something we could put 
in the /etc/sysconfig/ceph file.

Thanks,
-Bryan



Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread David Turner
I would recommend stopping the OSD daemon, waiting for everything to
happen, and then starting it again.  The cluster has settings for
automatically marking it down and subsequently out on its own.
On Wed, Sep 20, 2017 at 1:18 PM Jonas Jaszkowic <
jonasjaszkowic.w...@gmail.com> wrote:

> Thank you for this detailed information. In fact I am seeing spikes in my
> objects/second recovery. Unfortunately
> I had to shut down my cluster. But thanks for the help!
>
> In order to improve my scenario (marking an osd out to see how the cluster
> recovers),
> what is the best way to simulate an OSD failure without actually shutting
> down or powering off the whole machine?
> I want to do automate the process with bash scripts.
>
> Currently I am just doing *ceph osd out *. Should I additionally
> do *ceph osd down * in combination with
> killing the OSD daemon on the specific host? What I am trying to do is
> basically the following:
>
> - Killing one OSD
> - Measuring recovery time, monitoring, etc.
> - Bringing the OSD back in
> - Again measuring recovery time, monitoring, etc
> - Enjoy a healthy cluster
>
> This is (quite) working as described with *ceph osd out * and *ceph
> osd in *, but I am wondering
> if this produces a realistic behavior.
>
>
> Am 20.09.2017 um 18:06 schrieb David Turner :
>
> When you posted your ceph status, you only had 56 PGs degraded.  Any value
> of osd_max_backfills or osd_recovery_max_active over 56 would not do
> anything.  What these settings do is dictate to each OSD the maximum amount
> of PGs that it can be involved in a recovery process at once.  If you had
> 56 PGs degraded and all of them were on a single OSD, then a value of 56
> would tell all of them to be able to run at the same time.  If they were
> more spread out across your cluster, then a lower setting would still allow
> all of the PGs to recover at the same time.
>
> Now you're talking about how many objects are recovering a second.  Note
> that your PGs are recovering and not backfilling.  Backfilling is moving
> all of the data for a PG from 1 OSD to another.  All objects need to be
> recovered and you'll see a much higher number of objects/second.  Recovery
> is just catching up after an OSD has been down for a bit, but never marked
> out.  It only needs to catch up on the objects that have been altered,
> created, or deleted since it was last caught up for the PG.  When the PG
> finishes it's recovery state and is in a healthy state again, all of the
> objects that were in it but that didn't need to catch up are all at once
> marked recovered and you'll see spikes in your objects/second recovery.
>
> Your scenario (marking an OSD out to see how the cluster rebounds)
> shouldn't have a lot of PGs in recovery they should all be in backfill
> because the data needs to shift between OSDs.  I'm guessing that had
> something to do with the OSD still being up while it was marked down or
> that you had some other OSDs in your cluster be marked down due to not
> responding or possibly being restarted due to an OOM killer from the
> kernel.  What is your current `ceph status`?
>
> On Wed, Sep 20, 2017 at 11:52 AM Jonas Jaszkowic <
> jonasjaszkowic.w...@gmail.com> wrote:
>
>> Thank you for the admin socket information and the hint to Luminous, I
>> will try it out when I have the time.
>>
>> What I noticed when looking at ceph -w is that the number of objects per
>> second recovering is still very low.
>> Meanwhile I set the options osd_recovery_max_active and osd_max_backfills
>> to very high numbers (4096, just to be sure).
>> Most of the time it is something like ‚0 objects/s recovering‘ or less
>> than ‚10 objects/s recovering‘, for example:
>>
>> 2017-09-20 15:41:12.341364 mon.0 [INF] pgmap v16029: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30554/1376215 objects degraded (2.220%); 12205/1376215 objects misplaced
>> (0.887%); 42131 kB/s, 3 objects/s recovering
>> 2017-09-20 15:41:13.344684 mon.0 [INF] pgmap v16030: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30554/1376215 objects degraded (2.220%); 12205/1376215 objects misplaced
>> (0.887%); 9655 kB/s, 2 objects/s recovering
>> 2017-09-20 15:41:14.352699 mon.0 [INF] pgmap v16031: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30554/1376215 objects degraded (2.220%); 12204/1376215 objects misplaced
>> (0.887%); 2034 kB/s, 0 objects/s recovering
>> 2017-09-20 15:41:15.363921 mon.0 [INF] pgmap v16032: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30553/1376215 objects degraded (2.220%); 12204/1376215 objects misplaced
>> (0.887%); 255 MB/s, 0 ob

[ceph-users] Luminous RGW dynamic sharding

2017-09-20 Thread Ryan Leimenstoll
Hi all, 

I noticed Luminous now has dynamic sharding for RGW bucket indices as a 
production option. Does anyone know of any potential caveats or issues we 
should be aware of before enabling this? Beyond the Luminous v12.2.0 release 
notes and a few mailing list entries from during the release candidate phase, I 
haven’t seen much mention of it. For some time now we have been experiencing 
blocked requests when deep scrubbing PGs in our bucket index, so this could be 
quite useful for us. 

Thanks,
Ryan Leimenstoll
rleim...@umiacs.umd.edu
University of Maryland Institute for Advanced Computer Studies

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD crash starting up

2017-09-20 Thread David Turner
My guess is that it's actually just your cluster finding the inconsistent
PGs during its normal scrubbing schedule.  If a PG that was scrubbed and
clean then becomes inconsistent, then yes I would look for a failing disk.
This could be fallout from the failing disk from before.  It could have
been up just long enough before it crashed to cause problems.

On Wed, Sep 20, 2017 at 1:12 PM Gonzalo Aguilar Delgado <
gagui...@aguilardelgado.com> wrote:

> Hi David,
>
> Thank you for your support. What can be the cause of
> active+clean+inconsistent still growing up? Bad disk?
>
> Best regards,
>
> On 19/09/17 17:50, David Turner wrote:
>
> Adding the old OSD back in with its data shouldn't help you at all.  Your
> cluster has finished backfilling and has the proper amount of copies of all
> of its data.  The time you would want to add a removed OSD back to a
> cluster is when you have unfound objects.
>
> The scrub errors and inconsistent PGs are what you need to focus on and
> where your current problem is.  The message with too many PGs per OSD is
> just a warning and not causing any issues at this point as long as your OSD
> nodes aren't having any OOM messages.  Once you add in a 6th OSD, that will
> go away on its own.
>
> There are several threads on the Mailing List that you should be able to
> find about recovering from these and the potential dangers of some of the
> commands.  Googling for `ceph-users scrub errors inconsistent pgs` is a
> good place to start.
>
> On Tue, Sep 19, 2017 at 11:28 AM Gonzalo Aguilar Delgado <
> gagui...@aguilardelgado.com> wrote:
>
>> Hi David,
>>
>> What I want is to add the OSD back with its data yes. But avoiding any
>> troubles that can happen from the time it was out.
>>
>> Is it possible? I suppose that some pg has been updated after. Will ceph
>> manage it gracefully?
>>
>> Ceph status is getting worse every day.
>>
>> ceph status
>> cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
>>  health HEALTH_ERR
>> 6 pgs inconsistent
>> 31 scrub errors
>> too many PGs per OSD (305 > max 300)
>>  monmap e12: 2 mons at {blue-compute=
>> 172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0}
>> election epoch 4328, quorum 0,1 red-compute,blue-compute
>>   fsmap e881: 1/1/1 up {0=blue-compute=up:active}
>>  osdmap e7120: 5 osds: 5 up, 5 in
>> flags require_jewel_osds
>>   pgmap v66976120: 764 pgs, 6 pools, 555 GB data, 140 kobjects
>>  GB used, 3068 GB / 4179 GB avail
>>  758 active+clean
>>6 active+clean+inconsistent
>>   client io 384 kB/s wr, 0 op/s rd, 83 op/s wr
>>
>>
>> I want to add the old OSD, rebalance copies are more hosts/osds and
>> remove it out again.
>>
>>
>> Best regards,
>>
>> On 19/09/17 14:47, David Turner wrote:
>>
>> Are you asking to add the osd back with its data or add it back in as a
>> fresh osd.  What is your `ceph status`?
>>
>> On Tue, Sep 19, 2017, 5:23 AM Gonzalo Aguilar Delgado <
>> gagui...@aguilardelgado.com> wrote:
>>
>>> Hi David,
>>>
>>> Thank you for the great explanation of the weights, I thought that ceph
>>> was adjusting them based on disk. But it seems it's not.
>>>
>>> But the problem was not that I think the node was failing because a
>>> software bug because the disk was not full anymeans.
>>>
>>> /dev/sdb1 976284608 172396756   803887852  18%
>>> /var/lib/ceph/osd/ceph-1
>>>
>>> Now the question is to know if I can add again this osd safely. Is it
>>> possible?
>>>
>>> Best regards,
>>>
>>>
>>>
>>> On 14/09/17 23:29, David Turner wrote:
>>>
>>> Your weights should more closely represent the size of the OSDs.  OSD3
>>> and OSD6 are weighted properly, but your other 3 OSDs have the same weight
>>> even though OSD0 is twice the size of OSD2 and OSD4.
>>>
>>> Your OSD weights is what I thought you were referring to when you said
>>> you set the crush map to 1.  At some point it does look like you set all of
>>> your OSD weights to 1, which would apply to OSD1.  If the OSD was too small
>>> for that much data, it would have filled up and be too full to start.  Can
>>> you mount that disk and see how much free space is on it?
>>>
>>> Just so you understand what that weight is, it is how much data the
>>> cluster is going to put on it.  The default is for the weight to be the
>>> size of the OSD in TiB (1024 based instead of TB which is 1000).  If you
>>> set the weight of a 1TB disk and a 4TB disk both to 1, then the cluster
>>> will try and give them the same amount of data.  If you set the 4TB disk to
>>> a weight of 4, then the cluster will try to give it 4x more data than the
>>> 1TB drive (usually what you want).
>>>
>>> In your case, your 926G OSD0 has a weight of 1 and your 460G OSD2 has a
>>> weight of 1 so the cluster thinks they should each receive the same amount
>>> of data (which it did, they each have ~275GB of data).  OSD3 has a weight
>>> of 1.36380 (its size in TiB) and OSD6

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread Jonas Jaszkowic
Thank you for this detailed information. In fact I am seeing spikes in my 
objects/second recovery. Unfortunately
I had to shut down my cluster. But thanks for the help! 

In order to improve my scenario (marking an osd out to see how the cluster 
recovers),
what is the best way to simulate an OSD failure without actually shutting down 
or powering off the whole machine?
I want to do automate the process with bash scripts.

Currently I am just doing ceph osd out . Should I additionally do ceph 
osd down  in combination with
killing the OSD daemon on the specific host? What I am trying to do is 
basically the following:

- Killing one OSD
- Measuring recovery time, monitoring, etc.
- Bringing the OSD back in
- Again measuring recovery time, monitoring, etc
- Enjoy a healthy cluster

This is (quite) working as described with ceph osd out  and ceph osd in 
, but I am wondering
if this produces a realistic behavior.


> Am 20.09.2017 um 18:06 schrieb David Turner :
> 
> When you posted your ceph status, you only had 56 PGs degraded.  Any value of 
> osd_max_backfills or osd_recovery_max_active over 56 would not do anything.  
> What these settings do is dictate to each OSD the maximum amount of PGs that 
> it can be involved in a recovery process at once.  If you had 56 PGs degraded 
> and all of them were on a single OSD, then a value of 56 would tell all of 
> them to be able to run at the same time.  If they were more spread out across 
> your cluster, then a lower setting would still allow all of the PGs to 
> recover at the same time.
> 
> Now you're talking about how many objects are recovering a second.  Note that 
> your PGs are recovering and not backfilling.  Backfilling is moving all of 
> the data for a PG from 1 OSD to another.  All objects need to be recovered 
> and you'll see a much higher number of objects/second.  Recovery is just 
> catching up after an OSD has been down for a bit, but never marked out.  It 
> only needs to catch up on the objects that have been altered, created, or 
> deleted since it was last caught up for the PG.  When the PG finishes it's 
> recovery state and is in a healthy state again, all of the objects that were 
> in it but that didn't need to catch up are all at once marked recovered and 
> you'll see spikes in your objects/second recovery.
> 
> Your scenario (marking an OSD out to see how the cluster rebounds) shouldn't 
> have a lot of PGs in recovery they should all be in backfill because the data 
> needs to shift between OSDs.  I'm guessing that had something to do with the 
> OSD still being up while it was marked down or that you had some other OSDs 
> in your cluster be marked down due to not responding or possibly being 
> restarted due to an OOM killer from the kernel.  What is your current `ceph 
> status`?
> 
> On Wed, Sep 20, 2017 at 11:52 AM Jonas Jaszkowic 
> mailto:jonasjaszkowic.w...@gmail.com>> wrote:
> Thank you for the admin socket information and the hint to Luminous, I will 
> try it out when I have the time.
> 
> What I noticed when looking at ceph -w is that the number of objects per 
> second recovering is still very low.
> Meanwhile I set the options osd_recovery_max_active and osd_max_backfills to 
> very high numbers (4096, just to be sure).
> Most of the time it is something like ‚0 objects/s recovering‘ or less than 
> ‚10 objects/s recovering‘, for example:
> 
> 2017-09-20 15:41:12.341364 mon.0 [INF] pgmap v16029: 256 pgs: 68 
> active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
> 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30554/1376215 objects 
> degraded (2.220%); 12205/1376215 objects misplaced (0.887%); 42131 kB/s, 3 
> objects/s recovering
> 2017-09-20 15:41:13.344684 mon.0 [INF] pgmap v16030: 256 pgs: 68 
> active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
> 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30554/1376215 objects 
> degraded (2.220%); 12205/1376215 objects misplaced (0.887%); 9655 kB/s, 2 
> objects/s recovering
> 2017-09-20 15:41:14.352699 mon.0 [INF] pgmap v16031: 256 pgs: 68 
> active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
> 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30554/1376215 objects 
> degraded (2.220%); 12204/1376215 objects misplaced (0.887%); 2034 kB/s, 0 
> objects/s recovering
> 2017-09-20 15:41:15.363921 mon.0 [INF] pgmap v16032: 256 pgs: 68 
> active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
> 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30553/1376215 objects 
> degraded (2.220%); 12204/1376215 objects misplaced (0.887%); 255 MB/s, 0 
> objects/s recovering
> 2017-09-20 15:41:16.367734 mon.0 [INF] pgmap v16033: 256 pgs: 68 
> active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
> 1975 GB data, 3011 GB used, 7063 GB / 10075 GB avail; 30553/1376215 objects 
> degraded (2.220%); 12203/1376215 objects misplaced (0.887%); 2

Re: [ceph-users] Ceph OSD crash starting up

2017-09-20 Thread Gonzalo Aguilar Delgado

Hi David,

Thank you for your support. What can be the cause of 
active+clean+inconsistent still growing up? Bad disk?


Best regards,


On 19/09/17 17:50, David Turner wrote:
Adding the old OSD back in with its data shouldn't help you at all.  
Your cluster has finished backfilling and has the proper amount of 
copies of all of its data.  The time you would want to add a removed 
OSD back to a cluster is when you have unfound objects.


The scrub errors and inconsistent PGs are what you need to focus on 
and where your current problem is.  The message with too many PGs per 
OSD is just a warning and not causing any issues at this point as long 
as your OSD nodes aren't having any OOM messages.  Once you add in a 
6th OSD, that will go away on its own.


There are several threads on the Mailing List that you should be able 
to find about recovering from these and the potential dangers of some 
of the commands.  Googling for `ceph-users scrub errors inconsistent 
pgs` is a good place to start.


On Tue, Sep 19, 2017 at 11:28 AM Gonzalo Aguilar Delgado 
mailto:gagui...@aguilardelgado.com>> wrote:


Hi David,

What I want is to add the OSD back with its data yes. But avoiding
any troubles that can happen from the time it was out.

Is it possible? I suppose that some pg has been updated after.
Will ceph manage it gracefully?

Ceph status is getting worse every day.

ceph status
cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
 health HEALTH_ERR
6 pgs inconsistent
31 scrub errors
too many PGs per OSD (305 > max 300)
 monmap e12: 2 mons at
{blue-compute=172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0
}
election epoch 4328, quorum 0,1 red-compute,blue-compute
  fsmap e881: 1/1/1 up {0=blue-compute=up:active}
 osdmap e7120: 5 osds: 5 up, 5 in
flags require_jewel_osds
  pgmap v66976120: 764 pgs, 6 pools, 555 GB data, 140 kobjects
 GB used, 3068 GB / 4179 GB avail
 758 active+clean
   6 active+clean+inconsistent
  client io 384 kB/s wr, 0 op/s rd, 83 op/s wr


I want to add the old OSD, rebalance copies are more hosts/osds
and remove it out again.


Best regards,


On 19/09/17 14:47, David Turner wrote:


Are you asking to add the osd back with its data or add it back
in as a fresh osd.  What is your `ceph status`?


On Tue, Sep 19, 2017, 5:23 AM Gonzalo Aguilar Delgado
mailto:gagui...@aguilardelgado.com>> wrote:

Hi David,

Thank you for the great explanation of the weights, I thought
that ceph was adjusting them based on disk. But it seems it's
not.

But the problem was not that I think the node was failing
because a software bug because the disk was not full anymeans.

/dev/sdb1 976284608 172396756  
803887852  18% /var/lib/ceph/osd/ceph-1


Now the question is to know if I can add again this osd
safely. Is it possible?

Best regards,



On 14/09/17 23:29, David Turner wrote:

Your weights should more closely represent the size of the
OSDs.  OSD3 and OSD6 are weighted properly, but your other 3
OSDs have the same weight even though OSD0 is twice the size
of OSD2 and OSD4.

Your OSD weights is what I thought you were referring to
when you said you set the crush map to 1.  At some point it
does look like you set all of your OSD weights to 1, which
would apply to OSD1.  If the OSD was too small for that much
data, it would have filled up and be too full to start.  Can
you mount that disk and see how much free space is on it?

Just so you understand what that weight is, it is how much
data the cluster is going to put on it.  The default is for
the weight to be the size of the OSD in TiB (1024 based
instead of TB which is 1000).  If you set the weight of a
1TB disk and a 4TB disk both to 1, then the cluster will try
and give them the same amount of data.  If you set the 4TB
disk to a weight of 4, then the cluster will try to give it
4x more data than the 1TB drive (usually what you want).

In your case, your 926G OSD0 has a weight of 1 and your 460G
OSD2 has a weight of 1 so the cluster thinks they should
each receive the same amount of data (which it did, they
each have ~275GB of data).  OSD3 has a weight of 1.36380
(its size in TiB) and OSD6 has a weight of 0.90919 and they
have basically the same %used space (17%) as opposed to the
same amount of data because the weight is based on their size.

As long as you had enough replicas of your data in the
cluster for it to recover 

Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread Ronny Aasen

On 20.09.2017 16:49, hjcho616 wrote:

Anyone?  Can this page be saved?  If not what are my options?

Regards,
Hong


On Saturday, September 16, 2017 1:55 AM, hjcho616  
wrote:



Looking better... working on scrubbing..
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs 
incomplete; 12 pgs inconsistent; 2 pgs repair; 1 pgs stuck inactive; 1 
pgs stuck unclean; 109 scrub errors; too few PGs per OSD (29 < min 
30); mds rank 0 has failed; mds cluster is degraded; noout flag(s) 
set; no legacy OSD present but 'sortbitwise' flag is not set


Now PG1.28.. looking at all old osds dead or alive.  Only one with 
DIR_* directory is in osd.4. This appears to be metadata pool!  21M of 
metadata can be quite a bit of stuff.. so I would like to rescue this! 
 But I am not able to start this OSD.  exporting through 
ceph-objectstore-tool appears to crash.  Even with 
--skip-journal-replay and --skip-mount-omap (different failure).  As I 
mentioned in earlier email, that exception thrown message is bogus...
# ceph-objectstore-tool --op export --pgid 1.28  --data-path 
/var/lib/ceph/osd/ceph-4 --journal-path 
/var/lib/ceph/osd/ceph-4/journal --file ~/1.28.export

terminate called after throwing an instance of 'std::domain_error'



[SNIP]
What can I do to save that PG1.28?  Please let me know if you need 
more information.  So close!... =)


Regards,
Hong

12 inconsistent and 109 scrub errors is something you should fix first 
of all.


also you can consider using the paid-services of many ceph support 
companies. that specialize in these kind of situations.


--

that beeing said, here are some suggestions...

when it comes to lost object recovery you have come about as far as i 
have ever experienced. so everything after here is just assumptions and 
wild guesswork to what you can try.  I hope others shouts out if i tell 
you wildly wrong things.


if you have found date pg1.28 from the broken osd and have checked all 
other working and nonworking drives, for that pg. then you need to try 
and extract the pg from the broken drive. As always in recovery cases, 
take a dd clone of the drive and work from the cloned image. to avoid 
more damage to the drive, and to allow you to try multiple times.


you should add a temporary injection drive large enough for that pg, and 
set its crush weight to 0 so it always drains. make sure it is up and 
registered properly in ceph.


the idea is to copy the pg manually from broken-osd to the injection 
drive, since the export/import fails.. making sure you get all xattrs 
included.  one can either copy the whole pg, or just the "missing" 
objects.  if there are few objects i would go for that, if there are 
many i would take the whole pg. you wont get data from leveldb. so i am 
not at all sure this would work. but worth a shot.


- stop your injection osd, verify it is down and the proccess not running.
- from the mountpoint of your broken-osd go into the current directory. 
and tar up the pg1.28 make sure you use -p and --xattrs when you create 
the archive.
- if tar errors out on unreadable files, just rm those (since you are 
working on a copy of your rescue image, you can allways try again)
- copy the tar file to the injection drive and extract while sitting in 
the current directory (remember --xattrs)

- set debug options on the injection drive in ceph.conf
- start the injection drive, and follow along in the log file. hopefully 
it should scan, locate the pg, and replicate the pg1.28 objects off to 
the current primary drive for pg1.28. and since it have crush weight 0 
it should drain out.
- if that works, verify the injection drive is drained, stop it and 
remove it from ceph.  zap the drive.



this is all as i said guesstimates so your mileage may vary
good luck

Ronny Aasen







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds: failed to decode message of type 43 v7: buffer::end_of_buffer

2017-09-20 Thread Christian Salzmann-Jäckel
On 19.09.2017 20:58, Gregory Farnum wrote:
> You've probably run in to http://tracker.ceph.com/issues/16010 — do you have 
> very large directories? (Or perhaps just a whole bunch of unlinked files 
> which the MDS hasn't managed to trim yet?)

thank you for pointing us to the right direction.
I had found issue #16010 before but had ruled it out as reason for the outage.
As the cluster's overall health was ok, we decided to anticipate the pending 
upgrade to luminous, now profitting from multiple active MDSes and directory 
fragmentation.

ciao
Christian



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread David Turner
Correction, if the OSD had been marked down and been marked out, some of
its PGs would be in a backfill state while others would be in a recovery
state depending on how long the OSD was marked down and how much
backfilling had completed in the cluster.

On Wed, Sep 20, 2017 at 12:06 PM David Turner  wrote:

> When you posted your ceph status, you only had 56 PGs degraded.  Any value
> of osd_max_backfills or osd_recovery_max_active over 56 would not do
> anything.  What these settings do is dictate to each OSD the maximum amount
> of PGs that it can be involved in a recovery process at once.  If you had
> 56 PGs degraded and all of them were on a single OSD, then a value of 56
> would tell all of them to be able to run at the same time.  If they were
> more spread out across your cluster, then a lower setting would still allow
> all of the PGs to recover at the same time.
>
> Now you're talking about how many objects are recovering a second.  Note
> that your PGs are recovering and not backfilling.  Backfilling is moving
> all of the data for a PG from 1 OSD to another.  All objects need to be
> recovered and you'll see a much higher number of objects/second.  Recovery
> is just catching up after an OSD has been down for a bit, but never marked
> out.  It only needs to catch up on the objects that have been altered,
> created, or deleted since it was last caught up for the PG.  When the PG
> finishes it's recovery state and is in a healthy state again, all of the
> objects that were in it but that didn't need to catch up are all at once
> marked recovered and you'll see spikes in your objects/second recovery.
>
> Your scenario (marking an OSD out to see how the cluster rebounds)
> shouldn't have a lot of PGs in recovery they should all be in backfill
> because the data needs to shift between OSDs.  I'm guessing that had
> something to do with the OSD still being up while it was marked down or
> that you had some other OSDs in your cluster be marked down due to not
> responding or possibly being restarted due to an OOM killer from the
> kernel.  What is your current `ceph status`?
>
> On Wed, Sep 20, 2017 at 11:52 AM Jonas Jaszkowic <
> jonasjaszkowic.w...@gmail.com> wrote:
>
>> Thank you for the admin socket information and the hint to Luminous, I
>> will try it out when I have the time.
>>
>> What I noticed when looking at ceph -w is that the number of objects per
>> second recovering is still very low.
>> Meanwhile I set the options osd_recovery_max_active and osd_max_backfills
>> to very high numbers (4096, just to be sure).
>> Most of the time it is something like ‚0 objects/s recovering‘ or less
>> than ‚10 objects/s recovering‘, for example:
>>
>> 2017-09-20 15:41:12.341364 mon.0 [INF] pgmap v16029: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30554/1376215 objects degraded (2.220%); 12205/1376215 objects misplaced
>> (0.887%); 42131 kB/s, 3 objects/s recovering
>> 2017-09-20 15:41:13.344684 mon.0 [INF] pgmap v16030: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30554/1376215 objects degraded (2.220%); 12205/1376215 objects misplaced
>> (0.887%); 9655 kB/s, 2 objects/s recovering
>> 2017-09-20 15:41:14.352699 mon.0 [INF] pgmap v16031: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30554/1376215 objects degraded (2.220%); 12204/1376215 objects misplaced
>> (0.887%); 2034 kB/s, 0 objects/s recovering
>> 2017-09-20 15:41:15.363921 mon.0 [INF] pgmap v16032: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30553/1376215 objects degraded (2.220%); 12204/1376215 objects misplaced
>> (0.887%); 255 MB/s, 0 objects/s recovering
>> 2017-09-20 15:41:16.367734 mon.0 [INF] pgmap v16033: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7063 GB / 10075 GB avail;
>> 30553/1376215 objects degraded (2.220%); 12203/1376215 objects misplaced
>> (0.887%); 254 MB/s, 0 objects/s recovering
>> 2017-09-20 15:41:17.379183 mon.0 [INF] pgmap v16034: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7063 GB / 10075 GB avail;
>> 30549/1376215 objects degraded (2.220%); 12201/1376215 objects misplaced
>> (0.887%); 21868 kB/s, 3 objects/s recovering
>>
>> Is this an acceptable recovery rate? Unfortunately I have no point of
>> reference. My internal OSD network throughput is 500MBit/s (in a
>> virtualized Amazon EC2 environment).
>>
>> Am 20.09.2017 um 17:45 schrieb David Turner :
>>
>> You can always check what settings your daemons are running by querying
>> the admin

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread David Turner
When you posted your ceph status, you only had 56 PGs degraded.  Any value
of osd_max_backfills or osd_recovery_max_active over 56 would not do
anything.  What these settings do is dictate to each OSD the maximum amount
of PGs that it can be involved in a recovery process at once.  If you had
56 PGs degraded and all of them were on a single OSD, then a value of 56
would tell all of them to be able to run at the same time.  If they were
more spread out across your cluster, then a lower setting would still allow
all of the PGs to recover at the same time.

Now you're talking about how many objects are recovering a second.  Note
that your PGs are recovering and not backfilling.  Backfilling is moving
all of the data for a PG from 1 OSD to another.  All objects need to be
recovered and you'll see a much higher number of objects/second.  Recovery
is just catching up after an OSD has been down for a bit, but never marked
out.  It only needs to catch up on the objects that have been altered,
created, or deleted since it was last caught up for the PG.  When the PG
finishes it's recovery state and is in a healthy state again, all of the
objects that were in it but that didn't need to catch up are all at once
marked recovered and you'll see spikes in your objects/second recovery.

Your scenario (marking an OSD out to see how the cluster rebounds)
shouldn't have a lot of PGs in recovery they should all be in backfill
because the data needs to shift between OSDs.  I'm guessing that had
something to do with the OSD still being up while it was marked down or
that you had some other OSDs in your cluster be marked down due to not
responding or possibly being restarted due to an OOM killer from the
kernel.  What is your current `ceph status`?

On Wed, Sep 20, 2017 at 11:52 AM Jonas Jaszkowic <
jonasjaszkowic.w...@gmail.com> wrote:

> Thank you for the admin socket information and the hint to Luminous, I
> will try it out when I have the time.
>
> What I noticed when looking at ceph -w is that the number of objects per
> second recovering is still very low.
> Meanwhile I set the options osd_recovery_max_active and osd_max_backfills
> to very high numbers (4096, just to be sure).
> Most of the time it is something like ‚0 objects/s recovering‘ or less
> than ‚10 objects/s recovering‘, for example:
>
> 2017-09-20 15:41:12.341364 mon.0 [INF] pgmap v16029: 256 pgs: 68
> active+recovering+degraded, 15 active+remapped+backfilling, 173
> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
> 30554/1376215 objects degraded (2.220%); 12205/1376215 objects misplaced
> (0.887%); 42131 kB/s, 3 objects/s recovering
> 2017-09-20 15:41:13.344684 mon.0 [INF] pgmap v16030: 256 pgs: 68
> active+recovering+degraded, 15 active+remapped+backfilling, 173
> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
> 30554/1376215 objects degraded (2.220%); 12205/1376215 objects misplaced
> (0.887%); 9655 kB/s, 2 objects/s recovering
> 2017-09-20 15:41:14.352699 mon.0 [INF] pgmap v16031: 256 pgs: 68
> active+recovering+degraded, 15 active+remapped+backfilling, 173
> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
> 30554/1376215 objects degraded (2.220%); 12204/1376215 objects misplaced
> (0.887%); 2034 kB/s, 0 objects/s recovering
> 2017-09-20 15:41:15.363921 mon.0 [INF] pgmap v16032: 256 pgs: 68
> active+recovering+degraded, 15 active+remapped+backfilling, 173
> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
> 30553/1376215 objects degraded (2.220%); 12204/1376215 objects misplaced
> (0.887%); 255 MB/s, 0 objects/s recovering
> 2017-09-20 15:41:16.367734 mon.0 [INF] pgmap v16033: 256 pgs: 68
> active+recovering+degraded, 15 active+remapped+backfilling, 173
> active+clean; 1975 GB data, 3011 GB used, 7063 GB / 10075 GB avail;
> 30553/1376215 objects degraded (2.220%); 12203/1376215 objects misplaced
> (0.887%); 254 MB/s, 0 objects/s recovering
> 2017-09-20 15:41:17.379183 mon.0 [INF] pgmap v16034: 256 pgs: 68
> active+recovering+degraded, 15 active+remapped+backfilling, 173
> active+clean; 1975 GB data, 3011 GB used, 7063 GB / 10075 GB avail;
> 30549/1376215 objects degraded (2.220%); 12201/1376215 objects misplaced
> (0.887%); 21868 kB/s, 3 objects/s recovering
>
> Is this an acceptable recovery rate? Unfortunately I have no point of
> reference. My internal OSD network throughput is 500MBit/s (in a
> virtualized Amazon EC2 environment).
>
> Am 20.09.2017 um 17:45 schrieb David Turner :
>
> You can always check what settings your daemons are running by querying
> the admin socket.  I'm linking you to the kraken version of the docs.
> AFAIK, the "unchangeable" is wrong, especially for these settings.  I don't
> know why it's there, but you can always query the admin socket to see your
> currently running settings to make sure that they took effect.
>
>
> http://docs.ceph.com/docs/kraken/rados/operations/monitoring/#using-the-admin-socket
>
> On Wed, Sep 20, 2017 at 11:42 AM David T

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread Jonas Jaszkowic
Thank you for the admin socket information and the hint to Luminous, I will try 
it out when I have the time.

What I noticed when looking at ceph -w is that the number of objects per second 
recovering is still very low.
Meanwhile I set the options osd_recovery_max_active and osd_max_backfills to 
very high numbers (4096, just to be sure).
Most of the time it is something like ‚0 objects/s recovering‘ or less than ‚10 
objects/s recovering‘, for example:

2017-09-20 15:41:12.341364 mon.0 [INF] pgmap v16029: 256 pgs: 68 
active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30554/1376215 objects 
degraded (2.220%); 12205/1376215 objects misplaced (0.887%); 42131 kB/s, 3 
objects/s recovering
2017-09-20 15:41:13.344684 mon.0 [INF] pgmap v16030: 256 pgs: 68 
active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30554/1376215 objects 
degraded (2.220%); 12205/1376215 objects misplaced (0.887%); 9655 kB/s, 2 
objects/s recovering
2017-09-20 15:41:14.352699 mon.0 [INF] pgmap v16031: 256 pgs: 68 
active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30554/1376215 objects 
degraded (2.220%); 12204/1376215 objects misplaced (0.887%); 2034 kB/s, 0 
objects/s recovering
2017-09-20 15:41:15.363921 mon.0 [INF] pgmap v16032: 256 pgs: 68 
active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30553/1376215 objects 
degraded (2.220%); 12204/1376215 objects misplaced (0.887%); 255 MB/s, 0 
objects/s recovering
2017-09-20 15:41:16.367734 mon.0 [INF] pgmap v16033: 256 pgs: 68 
active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
1975 GB data, 3011 GB used, 7063 GB / 10075 GB avail; 30553/1376215 objects 
degraded (2.220%); 12203/1376215 objects misplaced (0.887%); 254 MB/s, 0 
objects/s recovering
2017-09-20 15:41:17.379183 mon.0 [INF] pgmap v16034: 256 pgs: 68 
active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
1975 GB data, 3011 GB used, 7063 GB / 10075 GB avail; 30549/1376215 objects 
degraded (2.220%); 12201/1376215 objects misplaced (0.887%); 21868 kB/s, 3 
objects/s recovering

Is this an acceptable recovery rate? Unfortunately I have no point of 
reference. My internal OSD network throughput is 500MBit/s (in a virtualized 
Amazon EC2 environment).

> Am 20.09.2017 um 17:45 schrieb David Turner :
> 
> You can always check what settings your daemons are running by querying the 
> admin socket.  I'm linking you to the kraken version of the docs.  AFAIK, the 
> "unchangeable" is wrong, especially for these settings.  I don't know why 
> it's there, but you can always query the admin socket to see your currently 
> running settings to make sure that they took effect.
> 
> http://docs.ceph.com/docs/kraken/rados/operations/monitoring/#using-the-admin-socket
>  
> 
> On Wed, Sep 20, 2017 at 11:42 AM David Turner  > wrote:
> You are currently on Kraken, but if you upgrade to Luminous you'll gain 
> access to the new setting `osd_recovery_sleep` which you can tweak.
> 
> The best way to deal with recovery speed vs client IO is to be aware of what 
> your cluster does.  If you have a time of day that you don't have much client 
> IO, then you can increase your recovery during that time.  Otherwise your 
> best bet is to do testing with these settings while watching `iostat -x 1` on 
> your OSDs to see what settings you need to maintaining something around 80% 
> disk utilization while client IO and recovery is happening.  That will ensure 
> that your clients have some overhead to not notice the recovery.  If Client 
> IO isn't so important that they aren't aware of a minor speed decrease during 
> recovery, then you can aim for closer to 100% disk utilization with both 
> client IO and recovery happening.
> 
> On Wed, Sep 20, 2017 at 11:30 AM Jean-Charles Lopez  > wrote:
> Hi,
> 
> you can play with the following 2 parameters:
> osd_recovery_max_active
> osd_max_backfills
> 
> The higher the number the higher the number of PGs being processed at the 
> same time.
> 
> Regards
> Jean-Charles LOPEZ
> jeanchlo...@mac.com 
> 
> 
> 
> JC Lopez
> Senior Technical Instructor, Global Storage Consulting Practice
> Red Hat, Inc.
> jelo...@redhat.com 
> +1 408-680-6959 
> 
>> On Sep 20, 2017, at 08:26, Jonas Jaszkowic > > wrote:
>> 
>> Thank you, that is very helpful. I didn’t know about the osd_max_backfills 
>> option. Recovery is now working faster. 
>> 
>> What is the best way to make recovery as fast as possible assuming that I do 
>> not 

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread David Turner
You can always check what settings your daemons are running by querying the
admin socket.  I'm linking you to the kraken version of the docs.  AFAIK,
the "unchangeable" is wrong, especially for these settings.  I don't know
why it's there, but you can always query the admin socket to see your
currently running settings to make sure that they took effect.

http://docs.ceph.com/docs/kraken/rados/operations/monitoring/#using-the-admin-socket

On Wed, Sep 20, 2017 at 11:42 AM David Turner  wrote:

> You are currently on Kraken, but if you upgrade to Luminous you'll gain
> access to the new setting `osd_recovery_sleep` which you can tweak.
>
> The best way to deal with recovery speed vs client IO is to be aware of
> what your cluster does.  If you have a time of day that you don't have much
> client IO, then you can increase your recovery during that time.  Otherwise
> your best bet is to do testing with these settings while watching `iostat
> -x 1` on your OSDs to see what settings you need to maintaining something
> around 80% disk utilization while client IO and recovery is happening.
> That will ensure that your clients have some overhead to not notice the
> recovery.  If Client IO isn't so important that they aren't aware of a
> minor speed decrease during recovery, then you can aim for closer to 100%
> disk utilization with both client IO and recovery happening.
>
> On Wed, Sep 20, 2017 at 11:30 AM Jean-Charles Lopez 
> wrote:
>
>> Hi,
>>
>> you can play with the following 2 parameters:
>> osd_recovery_max_active
>> osd_max_backfills
>>
>> The higher the number the higher the number of PGs being processed at the
>> same time.
>>
>> Regards
>> Jean-Charles LOPEZ
>> jeanchlo...@mac.com
>>
>>
>>
>> JC Lopez
>> Senior Technical Instructor, Global Storage Consulting Practice
>> Red Hat, Inc.
>> jelo...@redhat.com
>> +1 408-680-6959 <(408)%20680-6959>
>>
>> On Sep 20, 2017, at 08:26, Jonas Jaszkowic 
>> wrote:
>>
>> Thank you, that is very helpful. I didn’t know about the *osd_max_backfills
>> *option. Recovery is now working faster.
>>
>> What is the best way to make recovery as fast as possible assuming that I
>> do not care about read/write speed? (Besides
>> setting *osd_max_backfills *as high as possible). Are there any
>> important options that I have to know?
>>
>> What is the best practice to deal with the issue recovery speed vs.
>> read/write speed during a recovery situation? Do you
>> have any suggestions/references/hints how to deal with such situations?
>>
>>
>> Am 20.09.2017 um 16:45 schrieb David Turner :
>>
>> To help things look a little better, I would also stop the daemon for
>> osd.6 and mark it down `ceph osd down 6`.  Note that if the OSD is still
>> running it will likely mark itself back up and in on its own.  I don't
>> think that the OSD still running and being up in the cluster is causing the
>> issue, but it might.  After that, I would increase how many PGs can recover
>> at the same time by increasing osd_max_backfills `ceph tell osd.*
>> injectargs '--osd_max_backfills=5'`.  Note that for production you'll want
>> to set this number to something that doesn't negatively impact your client
>> IO, but high enough to help recover your cluster faster.  You can figure
>> out that number by increasing it 1 at a time and watching the OSD
>> performance with `iostat -x 1` or something to see how heavily used the
>> OSDs are during your normal usage and again during recover while testing
>> the settings.  For testing, you can set it as high as you'd like (probably
>> no need to go above 20 as that will likely saturate your disks'
>> performance) to get the PGs out of the wait status and into active recovery
>> and backfilling.
>>
>> On Wed, Sep 20, 2017 at 10:03 AM Jonas Jaszkowic <
>> jonasjaszkowic.w...@gmail.com> wrote:
>>
>>> Output of *ceph status*:
>>>
>>> cluster 18e87fd8-17c1-4045-a1a2-07aac106f200
>>>  health HEALTH_WARN
>>> 1 pgs backfill_wait
>>> 56 pgs degraded
>>> 1 pgs recovering
>>> 55 pgs recovery_wait
>>> 56 pgs stuck degraded
>>> 57 pgs stuck unclean
>>> recovery 50570/1369003 objects degraded (3.694%)
>>> recovery 854/1369003 objects misplaced (0.062%)
>>>  monmap e2: 1 mons at {ip-172-31-16-102=172.31.16.102:6789/0}
>>> election epoch 4, quorum 0 ip-172-31-16-102
>>> mgr active: ip-172-31-16-102
>>>  osdmap e247: 32 osds: 32 up, 31 in; 1 remapped pgs
>>> flags sortbitwise,require_jewel_osds,require_kraken_osds
>>>   pgmap v10860: 256 pgs, 1 pools, 1975 GB data, 111 kobjects
>>> 2923 GB used, 6836 GB / 9760 GB avail
>>> 50570/1369003 objects degraded (3.694%)
>>> 854/1369003 objects misplaced (0.062%)
>>>  199 active+clean
>>>   55 active+recovery_wait+degraded
>>>1 active+remapped+backfill_wait
>>>1 active+recovering+de

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread David Turner
You are currently on Kraken, but if you upgrade to Luminous you'll gain
access to the new setting `osd_recovery_sleep` which you can tweak.

The best way to deal with recovery speed vs client IO is to be aware of
what your cluster does.  If you have a time of day that you don't have much
client IO, then you can increase your recovery during that time.  Otherwise
your best bet is to do testing with these settings while watching `iostat
-x 1` on your OSDs to see what settings you need to maintaining something
around 80% disk utilization while client IO and recovery is happening.
That will ensure that your clients have some overhead to not notice the
recovery.  If Client IO isn't so important that they aren't aware of a
minor speed decrease during recovery, then you can aim for closer to 100%
disk utilization with both client IO and recovery happening.

On Wed, Sep 20, 2017 at 11:30 AM Jean-Charles Lopez 
wrote:

> Hi,
>
> you can play with the following 2 parameters:
> osd_recovery_max_active
> osd_max_backfills
>
> The higher the number the higher the number of PGs being processed at the
> same time.
>
> Regards
> Jean-Charles LOPEZ
> jeanchlo...@mac.com
>
>
>
> JC Lopez
> Senior Technical Instructor, Global Storage Consulting Practice
> Red Hat, Inc.
> jelo...@redhat.com
> +1 408-680-6959 <(408)%20680-6959>
>
> On Sep 20, 2017, at 08:26, Jonas Jaszkowic 
> wrote:
>
> Thank you, that is very helpful. I didn’t know about the *osd_max_backfills
> *option. Recovery is now working faster.
>
> What is the best way to make recovery as fast as possible assuming that I
> do not care about read/write speed? (Besides
> setting *osd_max_backfills *as high as possible). Are there any important
> options that I have to know?
>
> What is the best practice to deal with the issue recovery speed vs.
> read/write speed during a recovery situation? Do you
> have any suggestions/references/hints how to deal with such situations?
>
>
> Am 20.09.2017 um 16:45 schrieb David Turner :
>
> To help things look a little better, I would also stop the daemon for
> osd.6 and mark it down `ceph osd down 6`.  Note that if the OSD is still
> running it will likely mark itself back up and in on its own.  I don't
> think that the OSD still running and being up in the cluster is causing the
> issue, but it might.  After that, I would increase how many PGs can recover
> at the same time by increasing osd_max_backfills `ceph tell osd.*
> injectargs '--osd_max_backfills=5'`.  Note that for production you'll want
> to set this number to something that doesn't negatively impact your client
> IO, but high enough to help recover your cluster faster.  You can figure
> out that number by increasing it 1 at a time and watching the OSD
> performance with `iostat -x 1` or something to see how heavily used the
> OSDs are during your normal usage and again during recover while testing
> the settings.  For testing, you can set it as high as you'd like (probably
> no need to go above 20 as that will likely saturate your disks'
> performance) to get the PGs out of the wait status and into active recovery
> and backfilling.
>
> On Wed, Sep 20, 2017 at 10:03 AM Jonas Jaszkowic <
> jonasjaszkowic.w...@gmail.com> wrote:
>
>> Output of *ceph status*:
>>
>> cluster 18e87fd8-17c1-4045-a1a2-07aac106f200
>>  health HEALTH_WARN
>> 1 pgs backfill_wait
>> 56 pgs degraded
>> 1 pgs recovering
>> 55 pgs recovery_wait
>> 56 pgs stuck degraded
>> 57 pgs stuck unclean
>> recovery 50570/1369003 objects degraded (3.694%)
>> recovery 854/1369003 objects misplaced (0.062%)
>>  monmap e2: 1 mons at {ip-172-31-16-102=172.31.16.102:6789/0}
>> election epoch 4, quorum 0 ip-172-31-16-102
>> mgr active: ip-172-31-16-102
>>  osdmap e247: 32 osds: 32 up, 31 in; 1 remapped pgs
>> flags sortbitwise,require_jewel_osds,require_kraken_osds
>>   pgmap v10860: 256 pgs, 1 pools, 1975 GB data, 111 kobjects
>> 2923 GB used, 6836 GB / 9760 GB avail
>> 50570/1369003 objects degraded (3.694%)
>> 854/1369003 objects misplaced (0.062%)
>>  199 active+clean
>>   55 active+recovery_wait+degraded
>>1 active+remapped+backfill_wait
>>1 active+recovering+degraded
>>   client io 513 MB/s rd, 131 op/s rd, 0 op/s wr
>>
>> Output of* ceph osd tree*:
>>
>> ID  WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>  -1 9.83984 root default
>>  -2 0.30750 host ip-172-31-24-96
>>   0 0.30750 osd.0  up  1.0  1.0
>>  -3 0.30750 host ip-172-31-30-32
>>   1 0.30750 osd.1  up  1.0  1.0
>>  -4 0.30750 host ip-172-31-28-36
>>   2 0.30750 osd.2  up  1.0  1.0
>>  -5 0.30750 host ip-172-31-18-100
>>   3 0.30750 osd.3   

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread Jonas Jaszkowic
So to speed things up I would basically do the following two things:

ceph tell osd.* injectargs "\'—osd_max_backfills=\’“
ceph tell osd.* injectargs "\'—osd_recovery_max_active=\’"

The second command return with the following output:

osd.0: osd_recovery_max_active = '5' (unchangeable)

The error code is 0, but what does the ‚unchangeable‘ mean? Am I inserting the 
options correctly?


> Am 20.09.2017 um 17:30 schrieb Jean-Charles Lopez :
> 
> Hi,
> 
> you can play with the following 2 parameters:
> osd_recovery_max_active
> osd_max_backfills
> 
> The higher the number the higher the number of PGs being processed at the 
> same time.
> 
> Regards
> Jean-Charles LOPEZ
> jeanchlo...@mac.com 
> 
> 
> 
> JC Lopez
> Senior Technical Instructor, Global Storage Consulting Practice
> Red Hat, Inc.
> jelo...@redhat.com 
> +1 408-680-6959
> 
>> On Sep 20, 2017, at 08:26, Jonas Jaszkowic > > wrote:
>> 
>> Thank you, that is very helpful. I didn’t know about the osd_max_backfills 
>> option. Recovery is now working faster. 
>> 
>> What is the best way to make recovery as fast as possible assuming that I do 
>> not care about read/write speed? (Besides
>> setting osd_max_backfills as high as possible). Are there any important 
>> options that I have to know?
>> 
>> What is the best practice to deal with the issue recovery speed vs. 
>> read/write speed during a recovery situation? Do you
>> have any suggestions/references/hints how to deal with such situations?
>> 
>> 
>>> Am 20.09.2017 um 16:45 schrieb David Turner >> >:
>>> 
>>> To help things look a little better, I would also stop the daemon for osd.6 
>>> and mark it down `ceph osd down 6`.  Note that if the OSD is still running 
>>> it will likely mark itself back up and in on its own.  I don't think that 
>>> the OSD still running and being up in the cluster is causing the issue, but 
>>> it might.  After that, I would increase how many PGs can recover at the 
>>> same time by increasing osd_max_backfills `ceph tell osd.* injectargs 
>>> '--osd_max_backfills=5'`.  Note that for production you'll want to set this 
>>> number to something that doesn't negatively impact your client IO, but high 
>>> enough to help recover your cluster faster.  You can figure out that number 
>>> by increasing it 1 at a time and watching the OSD performance with `iostat 
>>> -x 1` or something to see how heavily used the OSDs are during your normal 
>>> usage and again during recover while testing the settings.  For testing, 
>>> you can set it as high as you'd like (probably no need to go above 20 as 
>>> that will likely saturate your disks' performance) to get the PGs out of 
>>> the wait status and into active recovery and backfilling.
>>> 
>>> On Wed, Sep 20, 2017 at 10:03 AM Jonas Jaszkowic 
>>> mailto:jonasjaszkowic.w...@gmail.com>> 
>>> wrote:
>>> Output of ceph status:
>>> 
>>> cluster 18e87fd8-17c1-4045-a1a2-07aac106f200
>>>  health HEALTH_WARN
>>> 1 pgs backfill_wait
>>> 56 pgs degraded
>>> 1 pgs recovering
>>> 55 pgs recovery_wait
>>> 56 pgs stuck degraded
>>> 57 pgs stuck unclean
>>> recovery 50570/1369003 objects degraded (3.694%)
>>> recovery 854/1369003 objects misplaced (0.062%)
>>>  monmap e2: 1 mons at {ip-172-31-16-102=172.31.16.102:6789/0 
>>> }
>>> election epoch 4, quorum 0 ip-172-31-16-102
>>> mgr active: ip-172-31-16-102
>>>  osdmap e247: 32 osds: 32 up, 31 in; 1 remapped pgs
>>> flags sortbitwise,require_jewel_osds,require_kraken_osds
>>>   pgmap v10860: 256 pgs, 1 pools, 1975 GB data, 111 kobjects
>>> 2923 GB used, 6836 GB / 9760 GB avail
>>> 50570/1369003 objects degraded (3.694%)
>>> 854/1369003 objects misplaced (0.062%)
>>>  199 active+clean
>>>   55 active+recovery_wait+degraded
>>>1 active+remapped+backfill_wait
>>>1 active+recovering+degraded
>>>   client io 513 MB/s rd, 131 op/s rd, 0 op/s wr
>>> 
>>> Output of ceph osd tree:
>>> 
>>> ID  WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>>  -1 9.83984 root default
>>>  -2 0.30750 host ip-172-31-24-96
>>>   0 0.30750 osd.0  up  1.0  1.0
>>>  -3 0.30750 host ip-172-31-30-32
>>>   1 0.30750 osd.1  up  1.0  1.0
>>>  -4 0.30750 host ip-172-31-28-36
>>>   2 0.30750 osd.2  up  1.0  1.0
>>>  -5 0.30750 host ip-172-31-18-100
>>>   3 0.30750 osd.3  up  1.0  1.0
>>>  -6 0.30750 host ip-172-31-25-240
>>>   4 0.30750 osd.4  up  1.0  1.0
>>>  -7 0.30750 host ip-172-31-24

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread Jean-Charles Lopez
Hi,

you can play with the following 2 parameters:
osd_recovery_max_active
osd_max_backfills

The higher the number the higher the number of PGs being processed at the same 
time.

Regards
Jean-Charles LOPEZ
jeanchlo...@mac.com



JC Lopez
Senior Technical Instructor, Global Storage Consulting Practice
Red Hat, Inc.
jelo...@redhat.com 
+1 408-680-6959

> On Sep 20, 2017, at 08:26, Jonas Jaszkowic  
> wrote:
> 
> Thank you, that is very helpful. I didn’t know about the osd_max_backfills 
> option. Recovery is now working faster. 
> 
> What is the best way to make recovery as fast as possible assuming that I do 
> not care about read/write speed? (Besides
> setting osd_max_backfills as high as possible). Are there any important 
> options that I have to know?
> 
> What is the best practice to deal with the issue recovery speed vs. 
> read/write speed during a recovery situation? Do you
> have any suggestions/references/hints how to deal with such situations?
> 
> 
>> Am 20.09.2017 um 16:45 schrieb David Turner > >:
>> 
>> To help things look a little better, I would also stop the daemon for osd.6 
>> and mark it down `ceph osd down 6`.  Note that if the OSD is still running 
>> it will likely mark itself back up and in on its own.  I don't think that 
>> the OSD still running and being up in the cluster is causing the issue, but 
>> it might.  After that, I would increase how many PGs can recover at the same 
>> time by increasing osd_max_backfills `ceph tell osd.* injectargs 
>> '--osd_max_backfills=5'`.  Note that for production you'll want to set this 
>> number to something that doesn't negatively impact your client IO, but high 
>> enough to help recover your cluster faster.  You can figure out that number 
>> by increasing it 1 at a time and watching the OSD performance with `iostat 
>> -x 1` or something to see how heavily used the OSDs are during your normal 
>> usage and again during recover while testing the settings.  For testing, you 
>> can set it as high as you'd like (probably no need to go above 20 as that 
>> will likely saturate your disks' performance) to get the PGs out of the wait 
>> status and into active recovery and backfilling.
>> 
>> On Wed, Sep 20, 2017 at 10:03 AM Jonas Jaszkowic 
>> mailto:jonasjaszkowic.w...@gmail.com>> wrote:
>> Output of ceph status:
>> 
>> cluster 18e87fd8-17c1-4045-a1a2-07aac106f200
>>  health HEALTH_WARN
>> 1 pgs backfill_wait
>> 56 pgs degraded
>> 1 pgs recovering
>> 55 pgs recovery_wait
>> 56 pgs stuck degraded
>> 57 pgs stuck unclean
>> recovery 50570/1369003 objects degraded (3.694%)
>> recovery 854/1369003 objects misplaced (0.062%)
>>  monmap e2: 1 mons at {ip-172-31-16-102=172.31.16.102:6789/0 
>> }
>> election epoch 4, quorum 0 ip-172-31-16-102
>> mgr active: ip-172-31-16-102
>>  osdmap e247: 32 osds: 32 up, 31 in; 1 remapped pgs
>> flags sortbitwise,require_jewel_osds,require_kraken_osds
>>   pgmap v10860: 256 pgs, 1 pools, 1975 GB data, 111 kobjects
>> 2923 GB used, 6836 GB / 9760 GB avail
>> 50570/1369003 objects degraded (3.694%)
>> 854/1369003 objects misplaced (0.062%)
>>  199 active+clean
>>   55 active+recovery_wait+degraded
>>1 active+remapped+backfill_wait
>>1 active+recovering+degraded
>>   client io 513 MB/s rd, 131 op/s rd, 0 op/s wr
>> 
>> Output of ceph osd tree:
>> 
>> ID  WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>  -1 9.83984 root default
>>  -2 0.30750 host ip-172-31-24-96
>>   0 0.30750 osd.0  up  1.0  1.0
>>  -3 0.30750 host ip-172-31-30-32
>>   1 0.30750 osd.1  up  1.0  1.0
>>  -4 0.30750 host ip-172-31-28-36
>>   2 0.30750 osd.2  up  1.0  1.0
>>  -5 0.30750 host ip-172-31-18-100
>>   3 0.30750 osd.3  up  1.0  1.0
>>  -6 0.30750 host ip-172-31-25-240
>>   4 0.30750 osd.4  up  1.0  1.0
>>  -7 0.30750 host ip-172-31-24-110
>>   5 0.30750 osd.5  up  1.0  1.0
>>  -8 0.30750 host ip-172-31-20-245
>>   6 0.30750 osd.6  up0  1.0
>>  -9 0.30750 host ip-172-31-17-241
>>   7 0.30750 osd.7  up  1.0  1.0
>> -10 0.30750 host ip-172-31-18-107
>>   8 0.30750 osd.8  up  1.0  1.0
>> -11 0.30750 host ip-172-31-21-170
>>   9 0.30750 osd.9  up  1.0  1.0
>> -12 0.30750 host ip-172-31-21-29
>>  10 0.30750 osd.10 up  1.0

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread Jonas Jaszkowic
Thank you, that is very helpful. I didn’t know about the osd_max_backfills 
option. Recovery is now working faster. 

What is the best way to make recovery as fast as possible assuming that I do 
not care about read/write speed? (Besides
setting osd_max_backfills as high as possible). Are there any important options 
that I have to know?

What is the best practice to deal with the issue recovery speed vs. read/write 
speed during a recovery situation? Do you
have any suggestions/references/hints how to deal with such situations?


> Am 20.09.2017 um 16:45 schrieb David Turner :
> 
> To help things look a little better, I would also stop the daemon for osd.6 
> and mark it down `ceph osd down 6`.  Note that if the OSD is still running it 
> will likely mark itself back up and in on its own.  I don't think that the 
> OSD still running and being up in the cluster is causing the issue, but it 
> might.  After that, I would increase how many PGs can recover at the same 
> time by increasing osd_max_backfills `ceph tell osd.* injectargs 
> '--osd_max_backfills=5'`.  Note that for production you'll want to set this 
> number to something that doesn't negatively impact your client IO, but high 
> enough to help recover your cluster faster.  You can figure out that number 
> by increasing it 1 at a time and watching the OSD performance with `iostat -x 
> 1` or something to see how heavily used the OSDs are during your normal usage 
> and again during recover while testing the settings.  For testing, you can 
> set it as high as you'd like (probably no need to go above 20 as that will 
> likely saturate your disks' performance) to get the PGs out of the wait 
> status and into active recovery and backfilling.
> 
> On Wed, Sep 20, 2017 at 10:03 AM Jonas Jaszkowic 
> mailto:jonasjaszkowic.w...@gmail.com>> wrote:
> Output of ceph status:
> 
> cluster 18e87fd8-17c1-4045-a1a2-07aac106f200
>  health HEALTH_WARN
> 1 pgs backfill_wait
> 56 pgs degraded
> 1 pgs recovering
> 55 pgs recovery_wait
> 56 pgs stuck degraded
> 57 pgs stuck unclean
> recovery 50570/1369003 objects degraded (3.694%)
> recovery 854/1369003 objects misplaced (0.062%)
>  monmap e2: 1 mons at {ip-172-31-16-102=172.31.16.102:6789/0 
> }
> election epoch 4, quorum 0 ip-172-31-16-102
> mgr active: ip-172-31-16-102
>  osdmap e247: 32 osds: 32 up, 31 in; 1 remapped pgs
> flags sortbitwise,require_jewel_osds,require_kraken_osds
>   pgmap v10860: 256 pgs, 1 pools, 1975 GB data, 111 kobjects
> 2923 GB used, 6836 GB / 9760 GB avail
> 50570/1369003 objects degraded (3.694%)
> 854/1369003 objects misplaced (0.062%)
>  199 active+clean
>   55 active+recovery_wait+degraded
>1 active+remapped+backfill_wait
>1 active+recovering+degraded
>   client io 513 MB/s rd, 131 op/s rd, 0 op/s wr
> 
> Output of ceph osd tree:
> 
> ID  WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>  -1 9.83984 root default
>  -2 0.30750 host ip-172-31-24-96
>   0 0.30750 osd.0  up  1.0  1.0
>  -3 0.30750 host ip-172-31-30-32
>   1 0.30750 osd.1  up  1.0  1.0
>  -4 0.30750 host ip-172-31-28-36
>   2 0.30750 osd.2  up  1.0  1.0
>  -5 0.30750 host ip-172-31-18-100
>   3 0.30750 osd.3  up  1.0  1.0
>  -6 0.30750 host ip-172-31-25-240
>   4 0.30750 osd.4  up  1.0  1.0
>  -7 0.30750 host ip-172-31-24-110
>   5 0.30750 osd.5  up  1.0  1.0
>  -8 0.30750 host ip-172-31-20-245
>   6 0.30750 osd.6  up0  1.0
>  -9 0.30750 host ip-172-31-17-241
>   7 0.30750 osd.7  up  1.0  1.0
> -10 0.30750 host ip-172-31-18-107
>   8 0.30750 osd.8  up  1.0  1.0
> -11 0.30750 host ip-172-31-21-170
>   9 0.30750 osd.9  up  1.0  1.0
> -12 0.30750 host ip-172-31-21-29
>  10 0.30750 osd.10 up  1.0  1.0
> -13 0.30750 host ip-172-31-23-220
>  11 0.30750 osd.11 up  1.0  1.0
> -14 0.30750 host ip-172-31-24-154
>  12 0.30750 osd.12 up  1.0  1.0
> -15 0.30750 host ip-172-31-26-25
>  13 0.30750 osd.13 up  1.0  1.0
> -16 0.30750 host ip-172-31-20-28
>  14 0.30750 osd.14 up  1.0  1.0
> -17 0.30750 host ip-172-31-23-90
>  15 0.30750 osd.15 up  1.0   

Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread hjcho616
Anyone?  Can this page be saved?  If not what are my options?
Regards,Hong 

On Saturday, September 16, 2017 1:55 AM, hjcho616  
wrote:
 

 Looking better... working on scrubbing..HEALTH_ERR 1 pgs are stuck inactive 
for more than 300 seconds; 1 pgs incomplete; 12 pgs inconsistent; 2 pgs repair; 
1 pgs stuck inactive; 1 pgs stuck unclean; 109 scrub errors; too few PGs per 
OSD (29 < min 30); mds rank 0 has failed; mds cluster is degraded; noout 
flag(s) set; no legacy OSD present but 'sortbitwise' flag is not set

Now PG1.28.. looking at all old osds dead or alive.  Only one with DIR_* 
directory is in osd.4.   This appears to be metadata pool!  21M of metadata can 
be quite a bit of stuff.. so I would like to rescue this!  But I am not able to 
start this OSD.  exporting through ceph-objectstore-tool appears to crash.  
Even with --skip-journal-replay and --skip-mount-omap (different failure).  As 
I mentioned in earlier email, that exception thrown message is bogus...# 
ceph-objectstore-tool --op export --pgid 1.28  --data-path 
/var/lib/ceph/osd/ceph-4 --journal-path /var/lib/ceph/osd/ceph-4/journal --file 
~/1.28.exportterminate called after throwing an instance of 'std::domain_error' 
 what():  coll_t::decode(): don't know how to decode version 1*** Caught signal 
(Aborted) ** in thread 7f812e7fb940 thread_name:ceph-objectstor ceph version 
10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0) 1: (()+0x996a57) 
[0x55dee175fa57] 2: (()+0x110c0) [0x7f812d0050c0] 3: (gsignal()+0xcf) 
[0x7f812b438fcf] 4: (abort()+0x16a) [0x7f812b43a3fa] 5: 
(__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7f812bd1fb3d] 6: 
(()+0x5ebb6) [0x7f812bd1dbb6] 7: (()+0x5ec01) [0x7f812bd1dc01] 8: (()+0x5ee19) 
[0x7f812bd1de19] 9: (coll_t::decode(ceph::buffer::list::iterator&)+0x21e) 
[0x55dee143001e] 10: 
(DBObjectMap::_Header::decode(ceph::buffer::list::iterator&)+0x125) 
[0x55dee156d5f5] 11: (DBObjectMap::check(std::ostream&, bool)+0x279) 
[0x55dee1562bb9] 12: (DBObjectMap::init(bool)+0x288) [0x55dee1561eb8] 13: 
(FileStore::mount()+0x2525) [0x55dee1498eb5] 14: (main()+0x28c0) 
[0x55dee10c9400] 15: (__libc_start_main()+0xf1) [0x7f812b4262b1] 16: 
(()+0x34f747) [0x55dee1118747]Aborted# ceph-objectstore-tool --op export --pgid 
1.28  --data-path /var/lib/ceph/osd/ceph-4 --journal-path 
/var/lib/ceph/osd/ceph-4/journal --file ~/1.28.export 
--skip-journal-replayterminate called after throwing an instance of 
'std::domain_error'  what():  coll_t::decode(): don't know how to decode 
version 1*** Caught signal (Aborted) ** in thread 7fa6d087b940 
thread_name:ceph-objectstor ceph version 10.2.9 
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0) 1: (()+0x996a57) [0x55abd356aa57] 2: 
(()+0x110c0) [0x7fa6cf0850c0] 3: (gsignal()+0xcf) [0x7fa6cd4b8fcf] 4: 
(abort()+0x16a) [0x7fa6cd4ba3fa] 5: 
(__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fa6cdd9fb3d] 6: 
(()+0x5ebb6) [0x7fa6cdd9dbb6] 7: (()+0x5ec01) [0x7fa6cdd9dc01] 8: (()+0x5ee19) 
[0x7fa6cdd9de19] 9: (coll_t::decode(ceph::buffer::list::iterator&)+0x21e) 
[0x55abd323b01e] 10: 
(DBObjectMap::_Header::decode(ceph::buffer::list::iterator&)+0x125) 
[0x55abd33785f5] 11: (DBObjectMap::check(std::ostream&, bool)+0x279) 
[0x55abd336dbb9] 12: (DBObjectMap::init(bool)+0x288) [0x55abd336ceb8] 13: 
(FileStore::mount()+0x2525) [0x55abd32a3eb5] 14: (main()+0x28c0) 
[0x55abd2ed4400] 15: (__libc_start_main()+0xf1) [0x7fa6cd4a62b1] 16: 
(()+0x34f747) [0x55abd2f23747]Aborted# ceph-objectstore-tool --op export --pgid 
1.28  --data-path /var/lib/ceph/osd/ceph-4 --journal-path 
/var/lib/ceph/osd/ceph-4/journal --file ~/1.28.export 
--skip-mount-omapceph-objectstore-tool: 
/usr/include/boost/smart_ptr/scoped_ptr.hpp:99: T* 
boost::scoped_ptr::operator->() const [with T = ObjectMap]: Assertion `px != 
0' failed.*** Caught signal (Aborted) ** in thread 7f14345c5940 
thread_name:ceph-objectstor ceph version 10.2.9 
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0) 1: (()+0x996a57) [0x5575b50a9a57] 2: 
(()+0x110c0) [0x7f1432dcf0c0] 3: (gsignal()+0xcf) [0x7f1431202fcf] 4: 
(abort()+0x16a) [0x7f14312043fa] 5: (()+0x2be37) [0x7f14311fbe37] 6: 
(()+0x2bee2) [0x7f14311fbee2] 7: (()+0x2fa19c) [0x5575b4a0d19c] 8: 
(FileStore::omap_get_values(coll_t const&, ghobject_t const&, 
std::set, std::allocator > 
const&, std::map, 
std::allocator > >*)+0x6c2) 
[0x5575b4dc9322] 9: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, 
ceph::buffer::list*)+0x235) [0x5575b4ab3135] 10: (main()+0x5bd6) 
[0x5575b4a16716] 11: (__libc_start_main()+0xf1) [0x7f14311f02b1] 12: 
(()+0x34f747) [0x5575b4a62747]
When trying to bring up osd.4 we get this message.  Feels very similar to that 
crash in first two above. ceph version 10.2.9 
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0) 1: (()+0x960e57) [0x5565e564ae57] 2: 
(()+0x110c0) [0x7f34aa17e0c0] 3: (gsignal()+0xcf) [0x7f34a81c4fcf] 4: 
(abort()+0x16a) [0x7f34a81c63fa] 5: 
(__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7f34a8aabb3d] 6: 
(()+0x5ebb6) [0x7f34a8aa9bb6] 7: (()+0x5ec01) [0x7f34a8aa9

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread David Turner
To help things look a little better, I would also stop the daemon for osd.6
and mark it down `ceph osd down 6`.  Note that if the OSD is still running
it will likely mark itself back up and in on its own.  I don't think that
the OSD still running and being up in the cluster is causing the issue, but
it might.  After that, I would increase how many PGs can recover at the
same time by increasing osd_max_backfills `ceph tell osd.* injectargs
'--osd_max_backfills=5'`.  Note that for production you'll want to set this
number to something that doesn't negatively impact your client IO, but high
enough to help recover your cluster faster.  You can figure out that number
by increasing it 1 at a time and watching the OSD performance with `iostat
-x 1` or something to see how heavily used the OSDs are during your normal
usage and again during recover while testing the settings.  For testing,
you can set it as high as you'd like (probably no need to go above 20 as
that will likely saturate your disks' performance) to get the PGs out of
the wait status and into active recovery and backfilling.

On Wed, Sep 20, 2017 at 10:03 AM Jonas Jaszkowic <
jonasjaszkowic.w...@gmail.com> wrote:

> Output of *ceph status*:
>
> cluster 18e87fd8-17c1-4045-a1a2-07aac106f200
>  health HEALTH_WARN
> 1 pgs backfill_wait
> 56 pgs degraded
> 1 pgs recovering
> 55 pgs recovery_wait
> 56 pgs stuck degraded
> 57 pgs stuck unclean
> recovery 50570/1369003 objects degraded (3.694%)
> recovery 854/1369003 objects misplaced (0.062%)
>  monmap e2: 1 mons at {ip-172-31-16-102=172.31.16.102:6789/0}
> election epoch 4, quorum 0 ip-172-31-16-102
> mgr active: ip-172-31-16-102
>  osdmap e247: 32 osds: 32 up, 31 in; 1 remapped pgs
> flags sortbitwise,require_jewel_osds,require_kraken_osds
>   pgmap v10860: 256 pgs, 1 pools, 1975 GB data, 111 kobjects
> 2923 GB used, 6836 GB / 9760 GB avail
> 50570/1369003 objects degraded (3.694%)
> 854/1369003 objects misplaced (0.062%)
>  199 active+clean
>   55 active+recovery_wait+degraded
>1 active+remapped+backfill_wait
>1 active+recovering+degraded
>   client io 513 MB/s rd, 131 op/s rd, 0 op/s wr
>
> Output of* ceph osd tree*:
>
> ID  WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>  -1 9.83984 root default
>  -2 0.30750 host ip-172-31-24-96
>   0 0.30750 osd.0  up  1.0  1.0
>  -3 0.30750 host ip-172-31-30-32
>   1 0.30750 osd.1  up  1.0  1.0
>  -4 0.30750 host ip-172-31-28-36
>   2 0.30750 osd.2  up  1.0  1.0
>  -5 0.30750 host ip-172-31-18-100
>   3 0.30750 osd.3  up  1.0  1.0
>  -6 0.30750 host ip-172-31-25-240
>   4 0.30750 osd.4  up  1.0  1.0
>  -7 0.30750 host ip-172-31-24-110
>   5 0.30750 osd.5  up  1.0  1.0
>  -8 0.30750 host ip-172-31-20-245
>   6 0.30750 osd.6  up0  1.0
>  -9 0.30750 host ip-172-31-17-241
>   7 0.30750 osd.7  up  1.0  1.0
> -10 0.30750 host ip-172-31-18-107
>   8 0.30750 osd.8  up  1.0  1.0
> -11 0.30750 host ip-172-31-21-170
>   9 0.30750 osd.9  up  1.0  1.0
> -12 0.30750 host ip-172-31-21-29
>  10 0.30750 osd.10 up  1.0  1.0
> -13 0.30750 host ip-172-31-23-220
>  11 0.30750 osd.11 up  1.0  1.0
> -14 0.30750 host ip-172-31-24-154
>  12 0.30750 osd.12 up  1.0  1.0
> -15 0.30750 host ip-172-31-26-25
>  13 0.30750 osd.13 up  1.0  1.0
> -16 0.30750 host ip-172-31-20-28
>  14 0.30750 osd.14 up  1.0  1.0
> -17 0.30750 host ip-172-31-23-90
>  15 0.30750 osd.15 up  1.0  1.0
> -18 0.30750 host ip-172-31-31-197
>  16 0.30750 osd.16 up  1.0  1.0
> -19 0.30750 host ip-172-31-29-195
>  17 0.30750 osd.17 up  1.0  1.0
> -20 0.30750 host ip-172-31-28-9
>  18 0.30750 osd.18 up  1.0  1.0
> -21 0.30750 host ip-172-31-25-199
>  19 0.30750 osd.19 up  1.0  1.0
> -22 0.30750 host ip-172-31-25-187
>  20 0.30750 osd.20 up  1.0  1.0
> -23 0.30750 host ip-172-31-31-57
>  21 0.30750 osd.21 up  1.0  

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread Jonas Jaszkowic
Output of ceph status:

cluster 18e87fd8-17c1-4045-a1a2-07aac106f200
 health HEALTH_WARN
1 pgs backfill_wait
56 pgs degraded
1 pgs recovering
55 pgs recovery_wait
56 pgs stuck degraded
57 pgs stuck unclean
recovery 50570/1369003 objects degraded (3.694%)
recovery 854/1369003 objects misplaced (0.062%)
 monmap e2: 1 mons at {ip-172-31-16-102=172.31.16.102:6789/0}
election epoch 4, quorum 0 ip-172-31-16-102
mgr active: ip-172-31-16-102
 osdmap e247: 32 osds: 32 up, 31 in; 1 remapped pgs
flags sortbitwise,require_jewel_osds,require_kraken_osds
  pgmap v10860: 256 pgs, 1 pools, 1975 GB data, 111 kobjects
2923 GB used, 6836 GB / 9760 GB avail
50570/1369003 objects degraded (3.694%)
854/1369003 objects misplaced (0.062%)
 199 active+clean
  55 active+recovery_wait+degraded
   1 active+remapped+backfill_wait
   1 active+recovering+degraded
  client io 513 MB/s rd, 131 op/s rd, 0 op/s wr

Output of ceph osd tree:

ID  WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -1 9.83984 root default
 -2 0.30750 host ip-172-31-24-96
  0 0.30750 osd.0  up  1.0  1.0
 -3 0.30750 host ip-172-31-30-32
  1 0.30750 osd.1  up  1.0  1.0
 -4 0.30750 host ip-172-31-28-36
  2 0.30750 osd.2  up  1.0  1.0
 -5 0.30750 host ip-172-31-18-100
  3 0.30750 osd.3  up  1.0  1.0
 -6 0.30750 host ip-172-31-25-240
  4 0.30750 osd.4  up  1.0  1.0
 -7 0.30750 host ip-172-31-24-110
  5 0.30750 osd.5  up  1.0  1.0
 -8 0.30750 host ip-172-31-20-245
  6 0.30750 osd.6  up0  1.0
 -9 0.30750 host ip-172-31-17-241
  7 0.30750 osd.7  up  1.0  1.0
-10 0.30750 host ip-172-31-18-107
  8 0.30750 osd.8  up  1.0  1.0
-11 0.30750 host ip-172-31-21-170
  9 0.30750 osd.9  up  1.0  1.0
-12 0.30750 host ip-172-31-21-29
 10 0.30750 osd.10 up  1.0  1.0
-13 0.30750 host ip-172-31-23-220
 11 0.30750 osd.11 up  1.0  1.0
-14 0.30750 host ip-172-31-24-154
 12 0.30750 osd.12 up  1.0  1.0
-15 0.30750 host ip-172-31-26-25
 13 0.30750 osd.13 up  1.0  1.0
-16 0.30750 host ip-172-31-20-28
 14 0.30750 osd.14 up  1.0  1.0
-17 0.30750 host ip-172-31-23-90
 15 0.30750 osd.15 up  1.0  1.0
-18 0.30750 host ip-172-31-31-197
 16 0.30750 osd.16 up  1.0  1.0
-19 0.30750 host ip-172-31-29-195
 17 0.30750 osd.17 up  1.0  1.0
-20 0.30750 host ip-172-31-28-9
 18 0.30750 osd.18 up  1.0  1.0
-21 0.30750 host ip-172-31-25-199
 19 0.30750 osd.19 up  1.0  1.0
-22 0.30750 host ip-172-31-25-187
 20 0.30750 osd.20 up  1.0  1.0
-23 0.30750 host ip-172-31-31-57
 21 0.30750 osd.21 up  1.0  1.0
-24 0.30750 host ip-172-31-20-64
 22 0.30750 osd.22 up  1.0  1.0
-25 0.30750 host ip-172-31-26-255
 23 0.30750 osd.23 up  1.0  1.0
-26 0.30750 host ip-172-31-18-146
 24 0.30750 osd.24 up  1.0  1.0
-27 0.30750 host ip-172-31-22-16
 25 0.30750 osd.25 up  1.0  1.0
-28 0.30750 host ip-172-31-26-152
 26 0.30750 osd.26 up  1.0  1.0
-29 0.30750 host ip-172-31-24-215
 27 0.30750 osd.27 up  1.0  1.0
-30 0.30750 host ip-172-31-24-138
 28 0.30750 osd.28 up  1.0  1.0
-31 0.30750 host ip-172-31-24-10
 29 0.30750 osd.29 up  1.0  1.0
-32 0.30750 host ip-172-31-20-79
 30 0.30750 osd.30 up  1.0  1.0
-33 0.30750 host ip-172-31-23-140
 31 0.30750 osd.31 up  1.0  1.0

Output of ceph health detail:

HEALTH_WARN 1 pgs backfill_wait; 55 pgs degraded; 1 pgs recovering; 54 pgs 
recovery_wait; 55 pgs stuck degraded; 56 pgs stuck unclean; recovery 
49688/1369003 objects degraded (3.630%); recovery 

Re: [ceph-users] What HBA to choose? To expand or not to expand?

2017-09-20 Thread Дробышевский , Владимир
Hello!

According to LSI2308 performance I have the following interesting
observation:

At the time of the initial tests I've made a number of tests and chose
Samsung SM863 drives as a core of the all-flash pools and journals. The
tests were done on the workstation with Intel Z97 chipset via
internal SATA3 port.
And I've got the following result (here is only part of the test, full
results can be found in the comment

to Sebastian Han's blog post and full explanation of the situation with SAS
controller is here

):

$ sudo hdparm -W 0 /dev/sdc
$ sudo fio --filename=/dev/sdc --direct=1 --sync=1 --rw=write --bs=4k
--numjobs=10 --iodepth=1 --runtime=60 --time_based --group_reporting
--name=journal-test
...
write: io=20243MB, bw=345477KB/s, *iops=86369*, runt= 60001msec
clat (usec): min=28, max=19841, *avg=115.18, stdev=42.61*
...

But after I've put the same SSD into the server with Intel RMS25LB

controller (it's a part of the Intel AH2000JF6GKIT
)
it
shows quarter less IOPS and worse latencies (more predictable though):

$ sudo hdparm -W 0 /dev/sdc # actually does nothing
$ echo "temporary write through" | sudo tee
/sys/block/sdc/device/scsi_disk/0\:0\:0\:0/cache_type # does the actual
trick with cache on rms26lb controller, without it results much worse
$ sudo fio --filename=/dev/sdc --direct=1 --sync=1 --rw=write --bs=4k
--numjobs=10 --iodepth=1 --runtime=60 --time_based --group_reporting
--name=journal-test
...
write: io=14603MB, bw=249210KB/s, *iops=62302*, runt= 60002msec
clat (usec): min=66, max=947, *avg=158.92, stdev=30.05*
...

I don't have another compatible SAS controllers around to do more tests and
I'm not sure this isn't connected with mis-configuration, but this is what
I see. I've run tests on different servers and with different SSD samples -
results are the same.

RMS25LBs have the latest firmware (20.0).

By the way, if somebody will point settings I can use to raise IOPS on
rms25lb controller - I'll be happy.

---
Best regards,
Vladimir


2017-09-20 18:08 GMT+05:00 Vincent Tondellier <
tondellier+ml.ceph-us...@dosisoft.fr>:

> Marc Roos wrote:
>
> > We use these :
> > NVDATA Product ID  : SAS9207-8i
> > Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308
> > PCI-Express Fusion-MPT SAS-2 (rev 05)
> >
> > Does someone by any chance know how to turn on the drive identification
> > lights?
>
> Tested with a MegaRAID SAS 2108 / DELL H700 :
>
> megacli -PDList -a0
>
> get the enclosure and drive number :
> Enclosure Device ID: 32
> Slot Number: 0
>
> megacli -PdLocate -start -physdrv '[32:0]' -a0
>
> >
> > -Original Message-
> > From: Jake Young [mailto:jak3kaj-re5jqeeqqe8avxtiumw...@public.gmane.org
> ]
> > Sent: dinsdag 19 september 2017 18:00
> > To: Kees Meijs; ceph-users-qp0ms5ga...@public.gmane.org
> > Subject: Re: [ceph-users] What HBA to choose? To expand or not to
> > expand?
> >
> >
> > On Tue, Sep 19, 2017 at 9:38 AM Kees Meijs
> >  wrote:
> >
> >
> > Hi Jake,
> >
> > On 19-09-17 15:14, Jake Young wrote:
> > > Ideally you actually want fewer disks per server and more
> > servers.
> > > This has been covered extensively in this mailing list. Rule of
> > thumb
> > > is that each server should have 10% or less of the capacity of
> > your
> > > cluster.
> >
> > That's very true, but let's focus on the HBA.
> >
> > > I didn't do extensive research to decide on this HBA, it's simply
> > what
> > > my server vendor offered. There are probably better, faster,
> > cheaper
> > > HBAs out there. A lot of people complain about LSI HBAs, but I am
> > > comfortable with them.
> >
> > Given a configuration our vendor offered it's about LSI/Avago
> > 9300-8i
> > with 8 drives connected individually using SFF8087 on a backplane
> > (e.g.
> > not an expander). Or, 24 drives using three HBAs (6xSFF8087 in
> > total)
> > when using a 4HE SuperMicro chassis with 24 drive bays.
> >
> > But, what are the LSI complaints about? Or, are the complaints
> > generic
> > to HBAs and/or cryptic CLI tools and not LSI specific?
> >
> >
> > Typically people rant about how much Megaraid/LSI support sucks. I've
> > been using LSI or MegaRAID for years and haven't had any big problems.
> >
> > I had some performance issues with Areca onboard SAS chips (non-Ceph
> > setup, 4 disks in a RAID10) and after about 6 months of troubleshooting
> > with the server vendor and Areca support they did patch the firmware and
> > resolve the issue.
> >
> >
> >
> >
> > > There is a management tool called storcli that can fully
> > configure the
> > > HBA in one or two command lines.  There's a command that
> > confi

Re: [ceph-users] What HBA to choose? To expand or not to expand?

2017-09-20 Thread Vincent Tondellier
Marc Roos wrote:

> We use these :
> NVDATA Product ID  : SAS9207-8i
> Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308
> PCI-Express Fusion-MPT SAS-2 (rev 05)
> 
> Does someone by any chance know how to turn on the drive identification
> lights?

Tested with a MegaRAID SAS 2108 / DELL H700 :

megacli -PDList -a0

get the enclosure and drive number :
Enclosure Device ID: 32
Slot Number: 0

megacli -PdLocate -start -physdrv '[32:0]' -a0

> 
> -Original Message-
> From: Jake Young [mailto:jak3kaj-re5jqeeqqe8avxtiumw...@public.gmane.org]
> Sent: dinsdag 19 september 2017 18:00
> To: Kees Meijs; ceph-users-qp0ms5ga...@public.gmane.org
> Subject: Re: [ceph-users] What HBA to choose? To expand or not to
> expand?
> 
> 
> On Tue, Sep 19, 2017 at 9:38 AM Kees Meijs
>  wrote:
> 
> 
> Hi Jake,
> 
> On 19-09-17 15:14, Jake Young wrote:
> > Ideally you actually want fewer disks per server and more
> servers.
> > This has been covered extensively in this mailing list. Rule of
> thumb
> > is that each server should have 10% or less of the capacity of
> your
> > cluster.
> 
> That's very true, but let's focus on the HBA.
> 
> > I didn't do extensive research to decide on this HBA, it's simply
> what
> > my server vendor offered. There are probably better, faster,
> cheaper
> > HBAs out there. A lot of people complain about LSI HBAs, but I am
> > comfortable with them.
> 
> Given a configuration our vendor offered it's about LSI/Avago
> 9300-8i
> with 8 drives connected individually using SFF8087 on a backplane
> (e.g.
> not an expander). Or, 24 drives using three HBAs (6xSFF8087 in
> total)
> when using a 4HE SuperMicro chassis with 24 drive bays.
> 
> But, what are the LSI complaints about? Or, are the complaints
> generic
> to HBAs and/or cryptic CLI tools and not LSI specific?
> 
> 
> Typically people rant about how much Megaraid/LSI support sucks. I've
> been using LSI or MegaRAID for years and haven't had any big problems.
> 
> I had some performance issues with Areca onboard SAS chips (non-Ceph
> setup, 4 disks in a RAID10) and after about 6 months of troubleshooting
> with the server vendor and Areca support they did patch the firmware and
> resolve the issue.
> 
> 
> 
> 
> > There is a management tool called storcli that can fully
> configure the
> > HBA in one or two command lines.  There's a command that
> configures
> > all attached disks as individual RAID0 disk groups. That command
> gets
> > run by salt when I provision a new osd server.
> 
> The thread I read was about Areca in JBOD but still able to utilise
> the
> cache, if I'm not mistaken. I'm not sure anymore if there was
> something
> mentioned about BBU.
> 
> 
> JBOD with WB cache would be nice so you can get smart data directly from
> the disks instead of having interrogate the HBA for the data.  This
> becomes more important once your cluster is stable and in production.
> 
> IMHO if there is unwritten data in a RAM chip, like when you enable WB
> cache, you really, really need a BBU. This is another nice thing about
> using SSD journals instead of HBAs in WB mode, the journaled data is
> safe on the SSD before the write is acknowledged.
> 
> 
> 
> 
> >
> > What many other people are doing is using the least expensive
> JBOD HBA
> > or the on board SAS controller in JBOD mode and then using SSD
> > journals. Save the money you would have spent on the fancy HBA
> for
> > fast, high endurance SSDs.
> 
> Thanks! And obviously I'm very interested in other comments as
> well.
> 
> Regards,
> Kees
> 
> ___
> ceph-users mailing list
> ceph-users-idqoxfivofjgjs9i8mt...@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What HBA to choose? To expand or not to expand?

2017-09-20 Thread Jake Young
On Wed, Sep 20, 2017 at 5:31 AM Marc Roos  wrote:

>
>
>
> We use these :
> NVDATA Product ID  : SAS9207-8i
> Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308
> PCI-Express Fusion-MPT SAS-2 (rev 05)
>
> Does someone by any chance know how to turn on the drive identification
> lights?
>

storcli64 /c0/e8/s1 start locate

Where c is the controller id, e is the enclosure id and s is the drive slot

Look for the PD List section in the output to see the enclosure id / slot
id list.

 storcli64 /c0 show


>
>
>
> -Original Message-
> From: Jake Young [mailto:jak3...@gmail.com]
> Sent: dinsdag 19 september 2017 18:00
> To: Kees Meijs; ceph-us...@ceph.com
> Subject: Re: [ceph-users] What HBA to choose? To expand or not to
> expand?
>
>
> On Tue, Sep 19, 2017 at 9:38 AM Kees Meijs  wrote:
>
>
> Hi Jake,
>
> On 19-09-17 15:14, Jake Young wrote:
> > Ideally you actually want fewer disks per server and more
> servers.
> > This has been covered extensively in this mailing list. Rule of
> thumb
> > is that each server should have 10% or less of the capacity of
> your
> > cluster.
>
> That's very true, but let's focus on the HBA.
>
> > I didn't do extensive research to decide on this HBA, it's simply
> what
> > my server vendor offered. There are probably better, faster,
> cheaper
> > HBAs out there. A lot of people complain about LSI HBAs, but I am
> > comfortable with them.
>
> Given a configuration our vendor offered it's about LSI/Avago
> 9300-8i
> with 8 drives connected individually using SFF8087 on a backplane
> (e.g.
> not an expander). Or, 24 drives using three HBAs (6xSFF8087 in
> total)
> when using a 4HE SuperMicro chassis with 24 drive bays.
>
> But, what are the LSI complaints about? Or, are the complaints
> generic
> to HBAs and/or cryptic CLI tools and not LSI specific?
>
>
> Typically people rant about how much Megaraid/LSI support sucks. I've
> been using LSI or MegaRAID for years and haven't had any big problems.
>
> I had some performance issues with Areca onboard SAS chips (non-Ceph
> setup, 4 disks in a RAID10) and after about 6 months of troubleshooting
> with the server vendor and Areca support they did patch the firmware and
> resolve the issue.
>
>
>
>
> > There is a management tool called storcli that can fully
> configure the
> > HBA in one or two command lines.  There's a command that
> configures
> > all attached disks as individual RAID0 disk groups. That command
> gets
> > run by salt when I provision a new osd server.
>
> The thread I read was about Areca in JBOD but still able to utilise
> the
> cache, if I'm not mistaken. I'm not sure anymore if there was
> something
> mentioned about BBU.
>
>
> JBOD with WB cache would be nice so you can get smart data directly from
> the disks instead of having interrogate the HBA for the data.  This
> becomes more important once your cluster is stable and in production.
>
> IMHO if there is unwritten data in a RAM chip, like when you enable WB
> cache, you really, really need a BBU. This is another nice thing about
> using SSD journals instead of HBAs in WB mode, the journaled data is
> safe on the SSD before the write is acknowledged.
>
>
>
>
> >
> > What many other people are doing is using the least expensive
> JBOD HBA
> > or the on board SAS controller in JBOD mode and then using SSD
> > journals. Save the money you would have spent on the fancy HBA
> for
> > fast, high endurance SSDs.
>
> Thanks! And obviously I'm very interested in other comments as
> well.
>
> Regards,
> Kees
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] High reading IOPS in rgw gc pool since upgrade to Luminous

2017-09-20 Thread Uwe Mesecke
Hi,

I have a Ceph cluster with 6 hosts and 24 OSDs (mon and rgw colocated on these 
hosts) that is used just for RGW. Since upgrading to from Kraken (11.2.0) to 
Luminous (12.2.0) a week ago I can see lots of IOPS in the default.rgw.gc pool.

The cluster was created with Kraken two months ago and has the following status 
(yeah I know size=2 but data loss in this cluster is not problem):
root@ceph01:~# ceph status
  cluster:
id: 14ff6ba0-6c70-4a01-ac80-5059c4956f2e
health: HEALTH_OK
 
  services:
mon: 3 daemons, quorum ceph04,ceph05,ceph06
mgr: ceph04(active), standbys: ceph05, ceph06
osd: 24 osds: 24 up, 24 in
rgw: 4 daemons active
 
  data:
pools:   13 pools, 1280 pgs
objects: 16200k objects, 29355 GB
usage:   59078 GB used, 44005 GB / 100 TB avail
pgs: 1280 active+clean
 
  io:
client:   177 MB/s rd, 24044 kB/s wr, 3260 op/s rd, 148 op/s wr
 
Normal load (before upgrade) is around 150 IOPS reading + 150 IOPS writing. Now 
the reading IOPS are between 2000 and 6000 all the time. All this excessive 
load is done in the default.rgw.gc pool. So my guess is there is some kind of 
garbage collection done. This started right after restarting the radosgw 
processes during upgrade. Otherwise the cluster is healthy and normal rgw work 
does not seem to be affected. The number of objects in the cluster went down by 
1000k objects since the upgrade. The data size increased slightly which is 
expected by our workload.

When looking at the OSDs I can see a pattern of high CPU on a single OSD 
process in the cluster for about 60 to 90 minutes with lots of messages in the 
log and then after a break the load switches to another OSD.

2017-09-20 06:05:02.560587 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.560586174
2017-09-20 06:05:02.562030 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.562028463
2017-09-20 06:05:02.563489 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.563487625
2017-09-20 06:05:02.564869 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.564867424
2017-09-20 06:05:02.566284 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.566282739
2017-09-20 06:05:02.567652 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.567650712
2017-09-20 06:05:02.569123 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.569122362
2017-09-20 06:05:02.570587 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.570585626
2017-09-20 06:05:02.572017 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.572015899
2017-09-20 06:05:02.573471 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.573469906
2017-09-20 06:05:02.574912 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.574910343
2017-09-20 06:05:02.576363 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.576362151
2017-09-20 06:05:02.577807 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.577806002
2017-09-20 06:05:02.579190 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.579189306
2017-09-20 06:05:02.580642 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.580641131
2017-09-20 06:05:02.582086 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.582084531
2017-09-20 06:05:02.583506 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.583505141
2017-09-20 06:05:02.584996 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.584994743
2017-09-20 06:05:02.586429 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.586427627
2017-09-20 06:05:02.587885 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.587884152
2017-09-20 06:05:02.589272 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.589271242
2017-09-20 06:05:02.590766 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_entries 
end_key=1_01505880302.590764473
2017-09-20 06:05:02.592184 7fd1bf114700  0  
/build/ceph-12.2.0/src/cls/rgw/cls_rgw.cc:3251: gc_iterate_ent

Re: [ceph-users] Fwd: FileStore vs BlueStore

2017-09-20 Thread Sean Purdy
On Wed, 20 Sep 2017, Burkhard Linke said:
> Hi,
> 
> 
> On 09/20/2017 12:24 PM, Sean Purdy wrote:
> >On Wed, 20 Sep 2017, Burkhard Linke said:
> >>The main reason for having a journal with filestore is having a block device
> >>that supports synchronous writes. Writing to a filesystem in a synchronous
> >>way (e.g. including all metadata writes) results in a huge performance
> >>penalty.
> >>
> >>With bluestore the data is also stored on a block devices, and thus also
> >>allows to perform synchronous writes directly (given the backing storage is
> >>handling sync writes correctly and in a consistent way, e.g. no drive
> >>caches, bbu for raid controllers/hbas). And similar to the filestore journal
> >Our Bluestore disks are hosted on RAID controllers.  Should I set cache 
> >policy as WriteThrough for these disks then?
> 
> It depends on the setup and availability of a BBU. If you have a BBU and
> cache on the controller, using write back should be ok if you monitor the
> BBU state. To be on the safe side is using write through and live with the
> performance impact.

We do have BBU and cache and we do monitor state.  Thanks!

Sean
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.0 bluestore - OSD down/crash " internal heartbeat not healthy, dropping ping reques "

2017-09-20 Thread nokia ceph
Hello Henrik,

Thanks for the update, I applied below  two changes and will monitor the
cluster.

~~~

bluestore_deferred_throttle_bytes = 0

bluestore_throttle_bytes = 0

~~~


Attached gdb trace after installed debug rpm's


Thanks




On Wed, Sep 20, 2017 at 12:18 PM, Henrik Korkuc  wrote:

> On 17-09-20 08:06, nokia ceph wrote:
>
> Hello,
>
> Env:- RHEL 7.2 , 3.10.0-327.el7.x86_64 , EC 4+1 , bluestore
>
> We are writing to ceph via librados C API  . Testing with rados no issues.
>
>
> The same we tested with Jewel/kraken without any issue. Need your view how
> to debug this issue?
>
> maybe similar to http://tracker.ceph.com/issues/21180? It seems it was
> resolved for me with mentioned fix. You could apply mentioned config
> options and see if it helps (and build newer version, if able).
>
>
> >>
>
> OSD.log
> ==
>
> ~~~
>
> 2017-09-18 14:51:59.895746 7f1e744e0700  0 log_channel(cluster) log [WRN]
> : slow request 60.068824 seconds old, received at 2017-09-18
> 14:50:59.826849: MOSDECSubOpWriteReply(1.132s0 1350/1344
> ECSubWriteReply(tid=971, last_complete=1350'153, committed=1, applied=0))
> currently queued_for_pg
> 2017-09-18 14:51:59.895749 7f1e744e0700  0 log_channel(cluster) log [WRN]
> : slow request 60.068737 seconds old, received at 2017-09-18
> 14:50:59.826936: MOSDECSubOpWriteReply(1.132s0 1350/1344
> ECSubWriteReply(tid=971, last_complete=0'0, committed=0, applied=1))
> currently queued_for_pg
> 2017-09-18 14:51:59.895754 7f1e744e0700  0 log_channel(cluster) log [WRN]
> : slow request 60.067539 seconds old, received at 2017-09-18
> 14:50:59.828134: MOSDECSubOpWriteReply(1.132s0 1350/1344
> ECSubWriteReply(tid=971, last_complete=1350'153, committed=1, applied=0))
> currently queued_for_pg
> 2017-09-18 14:51:59.923825 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1359 k (1083 k + 276 k)
> 2017-09-18 14:51:59.923835 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1066 k (1066 k + 0 )
> 2017-09-18 14:51:59.923837 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 643 k (643 k + 0 )
> 2017-09-18 14:51:59.923840 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1049 k (1049 k + 0 )
> 2017-09-18 14:51:59.923842 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 896 k (896 k + 0 )
> 2017-09-18 14:51:59.940780 7f1e77ca5700 20 osd.181 1350 share_map_peer
> 0x7f1e8dbf2800 already has epoch 1350
> 2017-09-18 14:51:59.940855 7f1e78ca7700 20 osd.181 1350 share_map_peer
> 0x7f1e8dbf2800 already has epoch 1350
> 2017-09-18 14:52:00.081390 7f1e6f572700 20 osd.181 1350 OSD::ms_dispatch:
> ping magic: 0 v1
> 2017-09-18 14:52:00.081393 7f1e6f572700 10 osd.181 1350 do_waiters -- start
> 2017-09-18 14:52:00.081394 7f1e6f572700 10 osd.181 1350 do_waiters --
> finish
> 2017-09-18 14:52:00.081395 7f1e6f572700 20 osd.181 1350 _dispatch
> 0x7f1e90923a40 ping magic: 0 v1
> 2017-09-18 14:52:00.081397 7f1e6f572700 10 osd.181 1350 ping from
> client.414556
> 2017-09-18 14:52:00.123908 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1359 k (1083 k + 276 k)
> 2017-09-18 14:52:00.123926 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1066 k (1066 k + 0 )
> 2017-09-18 14:52:00.123932 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 643 k (643 k + 0 )
> 2017-09-18 14:52:00.123937 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1049 k (1049 k + 0 )
> 2017-09-18 14:52:00.123942 7f1e71cdb700 10 trim shard target 102 M
> meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 896 k (896 k + 0 )
> 2017-09-18 14:52:00.145445 7f1e784a6700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60
> 2017-09-18 14:52:00.145450 7f1e784a6700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60
> 2017-09-18 14:52:00.145496 7f1e784a6700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1e63cbf700' had timed out after 60
> 2017-09-18 14:52:00.145534 7f1e784a6700 10 osd.181 1350 internal heartbeat
> not healthy, dropping ping request
> 2017-09-18 14:52:00.146224 7f1e78ca7700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60
> 2017-09-18 14:52:00.146226 7f1e78ca7700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60
>
> ~~~
>
>  thread apply all bt
>
> Thread 54 (LWP 479360):
> #0  0x7f1e7b5606d5 in ?? ()
> #1  0x in ?? ()
>
> Thread 53 (LWP 484888):
> #0  0x7f1e7a644b7d in ?? ()
> #1  0x in ?? ()
>
> Thread 52 (LWP 484177):
> #0  0x7f1e7b5606d5 in ?? ()
> #1  0x000a in ?? ()
> #2  0x7f1e88d8df98 in ?? ()
> #3  0x7f1e88d8df48 in ?? ()
> #

Re: [ceph-users] Fwd: FileStore vs BlueStore

2017-09-20 Thread Burkhard Linke

Hi,


On 09/20/2017 12:24 PM, Sean Purdy wrote:

On Wed, 20 Sep 2017, Burkhard Linke said:

The main reason for having a journal with filestore is having a block device
that supports synchronous writes. Writing to a filesystem in a synchronous
way (e.g. including all metadata writes) results in a huge performance
penalty.

With bluestore the data is also stored on a block devices, and thus also
allows to perform synchronous writes directly (given the backing storage is
handling sync writes correctly and in a consistent way, e.g. no drive
caches, bbu for raid controllers/hbas). And similar to the filestore journal

Our Bluestore disks are hosted on RAID controllers.  Should I set cache policy 
as WriteThrough for these disks then?


It depends on the setup and availability of a BBU. If you have a BBU and 
cache on the controller, using write back should be ok if you monitor 
the BBU state. To be on the safe side is using write through and live 
with the performance impact.


There's also another thread on the mailing list discussing the choice of 
controllers/hba. Maybe there's more information available in that 
thread, especially with regard to vendors, firmware etc.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: FileStore vs BlueStore

2017-09-20 Thread Sean Purdy
On Wed, 20 Sep 2017, Burkhard Linke said:
> The main reason for having a journal with filestore is having a block device
> that supports synchronous writes. Writing to a filesystem in a synchronous
> way (e.g. including all metadata writes) results in a huge performance
> penalty.
> 
> With bluestore the data is also stored on a block devices, and thus also
> allows to perform synchronous writes directly (given the backing storage is
> handling sync writes correctly and in a consistent way, e.g. no drive
> caches, bbu for raid controllers/hbas). And similar to the filestore journal

Our Bluestore disks are hosted on RAID controllers.  Should I set cache policy 
as WriteThrough for these disks then?


Sean Purdy

> the bluestore wal/rocksdb partitions can be used to allow both faster
> devices (ssd/nvme) and faster sync writes (compared to spinners).
> 
> Regards,
> Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] monitor takes long time to join quorum: STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH got BADAUTHORIZER

2017-09-20 Thread Sean Purdy

Hi,


Luminous 12.2.0

Three node cluster, 18 OSD, debian stretch.


One node is down for maintenance for several hours.  When bringing it back up, 
OSDs rejoin after 5 minutes, but health is still warning.  monitor has not 
joined quorum after 40 minutes and logs show BADAUTHORIZER message every time 
the monitor tries to connect to the leader.

2017-09-20 09:46:05.581590 7f49e2b29700  0 -- 172.16.0.45:0/2243 >> 
172.16.0.43:6812/2422 conn(0x5600720fb800 :-1 
s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply 
connect got BADAUTHORIZER

Then after ~45 minutes monitor *does* join quorum.

I'm presuming this isn't normal behaviour?  Or if it is, let me know and I 
won't worry.

All three nodes are using ntp and look OK timewise.


ceph-mon log:

(.43 is leader, .45 is rebooted node, .44 is other live node in quorum)

Boot:

2017-09-20 09:45:21.874152 7f49efeb8f80  0 ceph version 12.2.0 
(32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process (unknown), 
pid 2243

2017-09-20 09:46:01.824708 7f49e1b27700  0 -- 172.16.0.45:6789/0 >> 
172.16.0.44:6789/0 conn(0x56007244d000 :6789 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 3 vs existing csq=0 existing_state=STATE_CONNECTING
2017-09-20 09:46:01.824723 7f49e1b27700  0 -- 172.16.0.45:6789/0 >> 
172.16.0.44:6789/0 conn(0x56007244d000 :6789 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept we reset (peer sent cseq 3, 0x5600722c.cseq = 0), sending 
RESETSESSION
2017-09-20 09:46:01.825247 7f49e1b27700  0 -- 172.16.0.45:6789/0 >> 
172.16.0.44:6789/0 conn(0x56007244d000 :6789 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING
2017-09-20 09:46:01.828053 7f49e1b27700  0 -- 172.16.0.45:6789/0 >> 
172.16.0.44:6789/0 conn(0x5600722c :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=21872 cs=1 l=0).process 
missed message?  skipped from seq 0 to 552717734

2017-09-20 09:46:05.580342 7f49e1b27700  0 -- 172.16.0.45:6789/0 >> 
172.16.0.43:6789/0 conn(0x5600720fe800 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=49261 cs=1 l=0).process 
missed message?  skipped from seq 0 to 1151972199
2017-09-20 09:46:05.581097 7f49e2b29700  0 -- 172.16.0.45:0/2243 >> 
172.16.0.43:6812/2422 conn(0x5600720fb800 :-1 
s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply 
connect got BADAUTHORIZER
2017-09-20 09:46:05.581590 7f49e2b29700  0 -- 172.16.0.45:0/2243 >> 
172.16.0.43:6812/2422 conn(0x5600720fb800 :-1 
s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply 
connect got BADAUTHORIZER
...
[message repeats for 45 minutes]
...
2017-09-20 10:23:38.818767 7f49e2b29700  0 -- 172.16.0.45:0/2243 >> 
172.16.0.43:6812/2422 conn(0x5600720fb800 :-1 
s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply 
connect
 got BADAUTHORIZER


At this point, "ceph mon stat" says .45/store03 not in quorum:

e5: 3 mons at 
{store01=172.16.0.43:6789/0,store02=172.16.0.44:6789/0,store03=172.16.0.45:6789/0},
 election epoch 376, leader 0 store01, quorum 0,1 store01,store02


Then suddenly a valid connection is made and sync happens:

2017-09-20 10:23:43.041009 7f49e5b2f700  1 mon.store03@2(synchronizing).mds e1 
Unable to load 'last_metadata'
2017-09-20 10:23:43.041967 7f49e5b2f700  1 mon.store03@2(synchronizing).osd 
e2381 e2381: 18 total, 13 up, 14 in
...
2017-09-20 10:23:43.045961 7f49e5b2f700  1 mon.store03@2(synchronizing).osd 
e2393 e2393: 18 total, 15 up, 15 in
...
2017-09-20 10:23:43.049255 7f49e5b2f700  1 mon.store03@2(synchronizing).osd 
e2406 e2406: 18 total, 18 up, 18 in
...
2017-09-20 10:23:43.054828 7f49e5b2f700  0 log_channel(cluster) log [INF] : 
mon.store03 calling new monitor election
2017-09-20 10:23:43.054901 7f49e5b2f700  1 mon.store03@2(electing).elector(372) 
init, last seen epoch 372


Now "ceph mon stat" says:

e5: 3 mons at 
{store01=172.16.0.43:6789/0,store02=172.16.0.44:6789/0,store03=172.16.0.45:6789/0},
 election epoch 378, leader 0 store01, quorum 0,1,2 store01,store02,store03

and everything's happy.


What should I look for/fix?  It's a fairly vanilla system.


Thanks in advance,

Sean Purdy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: FileStore vs BlueStore

2017-09-20 Thread Vladimir Prokofev
> So, filestore:
> - write data to the journal
> - write metadata
> - move data from journal to definitive storage
>
> Bluestore:
> - write data to definitive storage (free space, not overwriting anything)
> - write metadata

Wait, doesn't that mean that when I move to Bluestore from Filestore I'll
have increased latency for writes?
Currently I'm using filestore with SSD journals, and my write latency is
that of SSD. By that logic in Bluestore my writes will come straight to
HDD, and I'll have HDD write latency?

2017-09-20 12:37 GMT+03:00 Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de>:

> Hi,
>
>
> On 09/20/2017 11:10 AM, Sam Huracan wrote:
>
>> So why do not journal write only metadata?
>> As I've read, it is for ensure consistency of data, but I do not know how
>> to do that in detail? And why BlueStore still ensure consistency without
>> journal?
>>
>
> The main reason for having a journal with filestore is having a block
> device that supports synchronous writes. Writing to a filesystem in a
> synchronous way (e.g. including all metadata writes) results in a huge
> performance penalty.
>
> With bluestore the data is also stored on a block devices, and thus also
> allows to perform synchronous writes directly (given the backing storage is
> handling sync writes correctly and in a consistent way, e.g. no drive
> caches, bbu for raid controllers/hbas). And similar to the filestore
> journal the bluestore wal/rocksdb partitions can be used to allow both
> faster devices (ssd/nvme) and faster sync writes (compared to spinners).
>
> Regards,
> Burkhard
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: FileStore vs BlueStore

2017-09-20 Thread Burkhard Linke

Hi,


On 09/20/2017 11:10 AM, Sam Huracan wrote:

So why do not journal write only metadata?
As I've read, it is for ensure consistency of data, but I do not know 
how to do that in detail? And why BlueStore still ensure consistency 
without journal?


The main reason for having a journal with filestore is having a block 
device that supports synchronous writes. Writing to a filesystem in a 
synchronous way (e.g. including all metadata writes) results in a huge 
performance penalty.


With bluestore the data is also stored on a block devices, and thus also 
allows to perform synchronous writes directly (given the backing storage 
is handling sync writes correctly and in a consistent way, e.g. no drive 
caches, bbu for raid controllers/hbas). And similar to the filestore 
journal the bluestore wal/rocksdb partitions can be used to allow both 
faster devices (ssd/nvme) and faster sync writes (compared to spinners).


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What HBA to choose? To expand or not to expand?

2017-09-20 Thread Marc Roos
 


We use these :
NVDATA Product ID  : SAS9207-8i
Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308 
PCI-Express Fusion-MPT SAS-2 (rev 05)

Does someone by any chance know how to turn on the drive identification 
lights?




-Original Message-
From: Jake Young [mailto:jak3...@gmail.com] 
Sent: dinsdag 19 september 2017 18:00
To: Kees Meijs; ceph-us...@ceph.com
Subject: Re: [ceph-users] What HBA to choose? To expand or not to 
expand?


On Tue, Sep 19, 2017 at 9:38 AM Kees Meijs  wrote:


Hi Jake,

On 19-09-17 15:14, Jake Young wrote:
> Ideally you actually want fewer disks per server and more 
servers.
> This has been covered extensively in this mailing list. Rule of 
thumb
> is that each server should have 10% or less of the capacity of 
your
> cluster.

That's very true, but let's focus on the HBA.

> I didn't do extensive research to decide on this HBA, it's simply 
what
> my server vendor offered. There are probably better, faster, 
cheaper
> HBAs out there. A lot of people complain about LSI HBAs, but I am
> comfortable with them.

Given a configuration our vendor offered it's about LSI/Avago 
9300-8i
with 8 drives connected individually using SFF8087 on a backplane 
(e.g.
not an expander). Or, 24 drives using three HBAs (6xSFF8087 in 
total)
when using a 4HE SuperMicro chassis with 24 drive bays.

But, what are the LSI complaints about? Or, are the complaints 
generic
to HBAs and/or cryptic CLI tools and not LSI specific?


Typically people rant about how much Megaraid/LSI support sucks. I've 
been using LSI or MegaRAID for years and haven't had any big problems. 

I had some performance issues with Areca onboard SAS chips (non-Ceph 
setup, 4 disks in a RAID10) and after about 6 months of troubleshooting 
with the server vendor and Areca support they did patch the firmware and 
resolve the issue. 




> There is a management tool called storcli that can fully 
configure the
> HBA in one or two command lines.  There's a command that 
configures
> all attached disks as individual RAID0 disk groups. That command 
gets
> run by salt when I provision a new osd server.

The thread I read was about Areca in JBOD but still able to utilise 
the
cache, if I'm not mistaken. I'm not sure anymore if there was 
something
mentioned about BBU.


JBOD with WB cache would be nice so you can get smart data directly from 
the disks instead of having interrogate the HBA for the data.  This 
becomes more important once your cluster is stable and in production.

IMHO if there is unwritten data in a RAM chip, like when you enable WB 
cache, you really, really need a BBU. This is another nice thing about 
using SSD journals instead of HBAs in WB mode, the journaled data is 
safe on the SSD before the write is acknowledged. 




>
> What many other people are doing is using the least expensive 
JBOD HBA
> or the on board SAS controller in JBOD mode and then using SSD
> journals. Save the money you would have spent on the fancy HBA 
for
> fast, high endurance SSDs.

Thanks! And obviously I'm very interested in other comments as 
well.

Regards,
Kees

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: FileStore vs BlueStore

2017-09-20 Thread Sam Huracan
So why do not journal write only metadata?
As I've read, it is for ensure consistency of data, but I do not know how
to do that in detail? And why BlueStore still ensure consistency without
journal?

2017-09-20 16:03 GMT+07:00 :

> On 20/09/2017 10:59, Sam Huracan wrote:
> > Hi Cephers,
> >
> > I've read about new BlueStore and have 2 questions:
> >
> > 1. The purpose of BlueStore is eliminating the drawbacks of POSIX when
> > using FileStore. These drawbacks also cause journal, result in double
> write
> > penalty. Could you explain me more detail about POSIX fails when using in
> > FileStore? and how Bluestore still guarantee consistency without journal?
> From what I guess, bluestore uses the definitive storage location as a
> journal
>
> So, filestore:
> - write data to the journal
> - write metadata
> - move data from journal to definitive storage
>
> Bluestore:
> - write data to definitive storage (free space, not overwriting anything)
> - write metadata
>
> >
> > I find a topic on reddit that told by journal, ceph avoid buffer cache,
> is
> > it true? is is drawback of POSIX?
> > https://www.reddit.com/r/ceph/comments/5wbp4d/so_will_
> > bluestore_make_ssd_journals_pointless/
> >
> > 2. We have to put journal on SSD to avoid double write, but it lead to
> > losing some OSDs when SSD fails. With BlueStore model, we can put all
> WAL,
> > metadata, data on 1 disk, make it easily for monitoring, maintaining. But
> > according to this post of Sebastian Han:
> > https://www.sebastien-han.fr/blog/2016/05/04/Ceph-Jewel-
> > configure-BlueStore-with-multiple-devices/
> > We could put WAL, metadata, and data on separate disks for increasing
> > performance, I think it is not different to FileStore model.
> > What is OSD deployment is most optimized? Put all on 1 disk or split on
> > multi disks?
> >
> > Thanks in advance
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: FileStore vs BlueStore

2017-09-20 Thread ceph
On 20/09/2017 10:59, Sam Huracan wrote:
> Hi Cephers,
> 
> I've read about new BlueStore and have 2 questions:
> 
> 1. The purpose of BlueStore is eliminating the drawbacks of POSIX when
> using FileStore. These drawbacks also cause journal, result in double write
> penalty. Could you explain me more detail about POSIX fails when using in
> FileStore? and how Bluestore still guarantee consistency without journal?
>From what I guess, bluestore uses the definitive storage location as a
journal

So, filestore:
- write data to the journal
- write metadata
- move data from journal to definitive storage

Bluestore:
- write data to definitive storage (free space, not overwriting anything)
- write metadata

> 
> I find a topic on reddit that told by journal, ceph avoid buffer cache, is
> it true? is is drawback of POSIX?
> https://www.reddit.com/r/ceph/comments/5wbp4d/so_will_
> bluestore_make_ssd_journals_pointless/
> 
> 2. We have to put journal on SSD to avoid double write, but it lead to
> losing some OSDs when SSD fails. With BlueStore model, we can put all WAL,
> metadata, data on 1 disk, make it easily for monitoring, maintaining. But
> according to this post of Sebastian Han:
> https://www.sebastien-han.fr/blog/2016/05/04/Ceph-Jewel-
> configure-BlueStore-with-multiple-devices/
> We could put WAL, metadata, and data on separate disks for increasing
> performance, I think it is not different to FileStore model.
> What is OSD deployment is most optimized? Put all on 1 disk or split on
> multi disks?
> 
> Thanks in advance
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: FileStore vs BlueStore

2017-09-20 Thread Sam Huracan
Hi Cephers,

I've read about new BlueStore and have 2 questions:

1. The purpose of BlueStore is eliminating the drawbacks of POSIX when
using FileStore. These drawbacks also cause journal, result in double write
penalty. Could you explain me more detail about POSIX fails when using in
FileStore? and how Bluestore still guarantee consistency without journal?

I find a topic on reddit that told by journal, ceph avoid buffer cache, is
it true? is is drawback of POSIX?
https://www.reddit.com/r/ceph/comments/5wbp4d/so_will_
bluestore_make_ssd_journals_pointless/

2. We have to put journal on SSD to avoid double write, but it lead to
losing some OSDs when SSD fails. With BlueStore model, we can put all WAL,
metadata, data on 1 disk, make it easily for monitoring, maintaining. But
according to this post of Sebastian Han:
https://www.sebastien-han.fr/blog/2016/05/04/Ceph-Jewel-
configure-BlueStore-with-multiple-devices/
We could put WAL, metadata, and data on separate disks for increasing
performance, I think it is not different to FileStore model.
What is OSD deployment is most optimized? Put all on 1 disk or split on
multi disks?

Thanks in advance
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-20 Thread Matthew Vernon
On 19/09/17 10:40, Wido den Hollander wrote:
> 
>> Op 19 september 2017 om 10:24 schreef Adrian Saul 
>> :
>>
>>
>>> I understand what you mean and it's indeed dangerous, but see:
>>> https://github.com/ceph/ceph/blob/master/systemd/ceph-osd%40.service
>>>
>>> Looking at the systemd docs it's difficult though:
>>> https://www.freedesktop.org/software/systemd/man/systemd.service.ht
>>> ml
>>>
>>> If the OSD crashes due to another bug you do want it to restart.
>>>
>>> But for systemd it's not possible to see if the crash was due to a disk I/O-
>>> error or a bug in the OSD itself or maybe the OOM-killer or something.
>>
>> Perhaps using something like RestartPreventExitStatus and defining a 
>> specific exit code for the OSD to exit on when it is exiting due to an IO 
>> error.
>>
> 
> That's a very, very good idea! I didn't know that one existed.
> 
> That would prevent restarts in case of I/O error indeed.

That would depend on the OSD gracefully handling the I/O failure - IME
they quite often seem to end up abort()ing...

Regards,

Matthew


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very slow start of osds after reboot

2017-09-20 Thread Manuel Lausch
Hi, 

I have the same issue with Ceph Jewel (10.2.9), RedHat7 and dmcrypt
Is there any fix or at least a workaround available ?

Regards,
Manuel 


Am Thu, 31 Aug 2017 16:24:10 +0200
schrieb Piotr Dzionek :

> Hi,
> 
> For a last 3 weeks I have been running latest LTS Luminous Ceph
> release on CentOS7. It started with 4th RC and now I have Stable
> Release. Cluster runs fine, however I noticed that if I do a reboot
> of one the nodes, it takes a really long time for cluster to be in ok
> status. Osds are starting up, but not as soon as the server is up.
> They are up one by one during a period of 5 minutes. I checked the
> logs and all osds have following errors.
> 
> 
> 
> As you can see the xfs volume(the part with meta-data) is not mounted 
> yet. My question here, what mounts it and why it takes so long ?
> Maybe there is a setting that randomizes the start up process of osds
> running on the same node?
> 
> Kind regards,
> Piotr Dzionek
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Manuel Lausch

Systemadministrator
Cloud Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 |
76135 Karlsruhe | Germany Phone: +49 721 91374-1847
E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de

Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Thomas Ludwig, Jan Oetjen


Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte
Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat
sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie
bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem
bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern,
weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu
verwenden.

This e-mail may contain confidential and/or privileged information. If
you are not the intended recipient of this e-mail, you are hereby
notified that saving, distribution or use of the content of this e-mail
in any way is prohibited. If you have received this e-mail in error,
please notify the sender and delete the e-mail.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com