Re: [ceph-users] osd heartbeat protocol issue on upgrade v12.1.0 ->v12.2.0

2017-09-01 Thread Thomas Gebhardt
Hello,

thank you very much for the hint, you are right!

Kind regards, Thomas

Marc Roos schrieb am 30.08.2017 um 14:26:
>  
> I had this also once. If you update all nodes and then systemctl restart 
> 'ceph-osd@*' on all nodes, you should be fine. But first the monitors of 
> course
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very slow start of osds after reboot

2017-09-01 Thread Piotr Dzionek

Hi,

I have RAID0 for each disks, unfortunately my raid doesn't support JBOD. 
Apart from this I also run separate cluster with Jewel 10.2.9 on RAID0 
and there is no such problem(I just tested it). Moreover, a cluster that 
has this issue used to run Firefly with RAID0 and everything was fine.



W dniu 31.08.2017 o 18:14, Hervé Ballans pisze:

Hi Piotr,

Just to verify one point, how are connected your disks (physically), 
in a NON-RAID or RAID0 mode ?


rv

Le 31/08/2017 à 16:24, Piotr Dzionek a écrit :
For a last 3 weeks I have been running latest LTS Luminous Ceph 
release on CentOS7. It started with 4th RC and now I have Stable 
Release.
Cluster runs fine, however I noticed that if I do a reboot of one the 
nodes, it takes a really long time for cluster to be in ok status.
Osds are starting up, but not as soon as the server is up. They are 
up one by one during a period of 5 minutes. I checked the logs and 
all osds have following errors.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 




These are different systemd units  ceph-osd.target and ceph-osd.target 
so they have separate dependencies. So I guess this is nothing weird.



W dniu 31.08.2017 o 17:13, Dan van der Ster pisze:

Random theory... I just noticed that the ceph-osd's are listed twice
[1] in the output of systemctl list-dependencies.

Is that correct?!!!

-- dan

[1] > systemctl list-dependencies
...
● ├─ceph-mds.target
● ├─ceph-mon.target
● ├─ceph-osd.target
● │ ├─ceph-osd@48.service
● │ ├─ceph-osd@49.service
● │ ├─ceph-osd@50.service
● │ ├─ceph-osd@51.service
● │ ├─ceph-osd@53.service
● │ ├─ceph-osd@54.service
● │ ├─ceph-osd@55.service
● │ ├─ceph-osd@56.service
● │ ├─ceph-osd@59.service
● │ ├─ceph-osd@61.service
● │ ├─ceph-osd@63.service
● │ ├─ceph-osd@65.service
● │ ├─ceph-osd@68.service
● │ ├─ceph-osd@70.service
● │ ├─ceph-osd@74.service
● │ ├─ceph-osd@80.service
● │ ├─ceph-osd@81.service
● │ ├─ceph-osd@82.service
● │ ├─ceph-osd@83.service
● │ ├─ceph-osd@84.service
● │ ├─ceph-osd@89.service
● │ ├─ceph-osd@90.service
● │ ├─ceph-osd@91.service
● │ └─ceph-osd@92.service
● ├─ceph.target
● │ ├─ceph-mds.target
● │ ├─ceph-mon.target
● │ └─ceph-osd.target
● │   ├─ceph-osd@48.service
● │   ├─ceph-osd@49.service
● │   ├─ceph-osd@50.service
● │   ├─ceph-osd@51.service
● │   ├─ceph-osd@53.service
● │   ├─ceph-osd@54.service
● │   ├─ceph-osd@55.service
● │   ├─ceph-osd@56.service
● │   ├─ceph-osd@59.service
● │   ├─ceph-osd@61.service
● │   ├─ceph-osd@63.service
● │   ├─ceph-osd@65.service
● │   ├─ceph-osd@68.service
● │   ├─ceph-osd@70.service
● │   ├─ceph-osd@74.service
● │   ├─ceph-osd@80.service
● │   ├─ceph-osd@81.service
● │   ├─ceph-osd@82.service
● │   ├─ceph-osd@83.service
● │   ├─ceph-osd@84.service
● │   ├─ceph-osd@89.service
● │   ├─ceph-osd@90.service
● │   ├─ceph-osd@91.service
● │   └─ceph-osd@92.service
● ├─getty.target
...



I tested it on jewel 10.2.9 and I don't see this issue here.


On Thu, Aug 31, 2017 at 4:57 PM, Dan van der Ster  wrote:

Hi,

I see the same with jewel on el7 -- it started one of the recent point
releases around ~10.2.5, IIRC.

Problem seems to be the same -- daemon is started before the osd is
mounted... then the service waits several seconds before trying again.

Aug 31 15:41:47 ceph-osd: 2017-08-31 15:41:47.267661 7f2e49731800 -1
#033[0;31m ** ERROR: unable to open OSD superblock on
/var/lib/ceph/osd/ceph-89: (2) No such file or directory#033[0m
Aug 31 15:41:47 ceph-osd: starting osd.55 at :/0 osd_data
/var/lib/ceph/osd/ceph-55 /var/lib/ceph/osd/ceph-55/journal
Aug 31 15:41:47 systemd: ceph-osd@89.service: main process exited,
code=exited, status=1/FAILURE
Aug 31 15:41:47 systemd: Unit ceph-osd@89.service entered failed state.
Aug 31 15:41:47 systemd: ceph-osd@89.service failed.
Aug 31 15:41:47 kernel: XFS (sdi1): Ending clean mount
Aug 31 15:41:47 rc.local: Removed symlink
/etc/systemd/system/ceph-osd.target.wants/ceph-osd@54.service.
Aug 31 15:41:47 systemd: Reloading.
Aug 31 15:41:47 systemd: Reloading.
Aug 31 15:41:47 rc.local: Created symlink from
/etc/systemd/system/ceph-osd.target.wants/ceph-osd@54.service to
/usr/lib/systemd/system/ceph-osd@.service.
Aug 31 15:41:47 systemd: Reloading.
Aug 31 15:41:55 ceph-osd: 2017-08-31 15:41:55.425566 7f74b92e1800 -1
osd.55 123659 log_to_monitors {default=true}
Aug 31 15:42:07 systemd: ceph-osd@84.service holdoff time over,
scheduling restart.
Aug 31 15:42:07 systemd: ceph-osd@61.service holdoff time over,
scheduling restart.
Aug 31 15:42:07 systemd: ceph-osd@83.service holdoff time over,
scheduling restart.
Aug 31 15:42:07 systemd: ceph-osd@80.service holdoff time over,
scheduling restart.
Aug 31 15:42:07 systemd: ceph-osd@70.service holdoff time over,
scheduling restart.
Aug 31 15:42:07 systemd: ceph-osd@65.service holdoff time over,
scheduling restart.
Aug 31 15:42:07 systemd: ceph-osd@82.service holdoff time over,
scheduling restart.
Aug 31 15:

Re: [ceph-users] PGs in peered state?

2017-09-01 Thread Yuri Gorshkov
Hi All,

Is there a known procedure to debug the PG state in case of problems like
this?

Best regards,
Yuri.

2017-08-28 14:05 GMT+03:00 Yuri Gorshkov :

> Hi.
>
> When trying to take down a host for maintenance purposes I encountered an
> I/O stall along with some PGs marked 'peered' unexpectedly.
>
> Cluster stats: 96/96 OSDs, healthy prior to incident, 5120 PGs, 4 hosts
> consisting of 24 OSDs each. Ceph version 11.2.0, using standard filestore
> (with LVM journals on SSD) and default crush map. All pools are size 3,
> min_size 2.
>
> Steps to reproduce the problem:
> 0. Cluster is healthy, HEALTH_OK
> 1. Set noout flag to prepare for host removal.
> 2. Begin taking OSDs on one of the hosts down: systemctl stop ceph-osd@
> $osd.
> 3. Notice the IO has stalled unexpectedly and about 100 PGs total are in
> degraded+undersized+peered state if the host is down.
>
> AFAIK the 'peered' state means that the PG has not been replicated to
> min_size yet, so there is something strange going on. Since we have 4 hosts
> and are using the default crush map, how is it possible that after taking
> one host (or even just some OSDs on that host) down some PGs in the cluster
> are left with less than 2 copies?
>
> Here's the snippet of 'ceph pg dump_stuck' when this happened. Sadly I
> don't have any more information yet...
>
> # ceph pg dump|grep peered
> dumped all in format plain
> 3.c80   173  0  346   692   0   715341824
> 1004110041 undersized+degraded+remapped+backfill_wait+peered
> 2017-08-02 19:12:39.319222  12124'104727   12409:62777 [62,76,44]
> 62[2]  21642'32485 2017-07-18 22:57:06.263727
>  1008'135 2017-07-09 22:34:40.893182
> 3.204   184  0  368   649   0   769544192
> 1006510065 undersized+degraded+remapped+backfill_wait+peered
> 2017-08-02 19:12:39.334905   12124'13665   12409:37345  [75,52,1]
> 75[2]  2 1375'4316 2017-07-18 00:10:27.601548
> 1371'2740 2017-07-12 07:48:34.953831
> 11.19 25525  051050 78652   0 14829768529
> 1005910059 undersized+degraded+remapped+backfill_wait+peered
> 2017-08-02 19:12:39.311612  12124'156267  12409:137128 [56,26,14]
> 56   [18] 181375'28148 2017-07-17 20:27:04.916079
>   0'0 2017-07-10 16:12:49.270606
>
> --
> Sincerely,
> Yuri Gorshkov
> Systems Engineer
> SmartLabs LLC
> +7 (495) 645-44-46 ext. 6926
> ygorsh...@smartlabs.tv
>
>


-- 
Sincerely,
Yuri Gorshkov
Systems Engineer
SmartLabs LLC
+7 (495) 645-44-46 ext. 6926
ygorsh...@smartlabs.tv
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the failure domain

2017-09-01 Thread David Turner
That is normal to have backfilling because the crush map did change. The
host and the chassis have crush numbers and their own weight which is the
sum of the osds under them.  By moving the host into the chassis you
changed the weight of the chassis and that affects the PG placement even
though you didn't change the failure domain.

Osd_max_backfills = 1 shouldn't impact customer traffic and cause blocked
requests. Most people find that they can use 3-5 before the disks are
active enough to come close to impacting customer traffic.  That would lead
me to think you have a dying drive that you're reading from/writing to in
sectors that are bad or at least slower.

On Fri, Sep 1, 2017, 6:13 AM Laszlo Budai  wrote:

> Hi David,
>
> Well, most probably the larger part of our PGs will have to be
> reorganized, as we are moving from 9 hosts to 3 chassis. But I was hoping
> to be able to throttle the backfilling to an extent where it has minimal
> impact on our user traffic. Unfortunately I wasn't able to do it. I saw
> that the newer versions of ceph have the "osd recovery sleep" parameter. I
> think this would help, but unfortunately it's not present in hammer ... :(
>
> Also I have an other question: Is it normal to have backfill when we add a
> host to a chassis even if we don't change the CRUSH rule? Let me explain:
> We have the hosts directly assigned to the root bucket. Then we add chassis
> to the root, and then we move a host from the root to the chassis. In all
> this time the rule set remains unchanged, with the host being the failure
> domain.
>
> Kind regards,
> Laszlo
>
>
> On 31.08.2017 17:56, David Turner wrote:
> > How long are you seeing these blocked requests for?  Initially or
> perpetually?  Changing the failure domain causes all PGs to peer at the
> same time.  This would be the cause if it happens really quickly.  There is
> no way to avoid all of them peering while making a change like this.  After
> that, It could easily be caused because a fair majority of your data is
> probably set to move around.  I would check what might be causing the
> blocked requests during this time.  See if there is an OSD that might be
> dying (large backfills have the tendancy to find a couple failing drives)
> which could easily cause things to block.  Also checking if your disks or
> journals are maxed out with iostat could shine some light on any mitigating
> factor.
> >
> > On Thu, Aug 31, 2017 at 9:01 AM Laszlo Budai  > wrote:
> >
> > Dear all!
> >
> > In our Hammer cluster we are planning to switch our failure domain
> from host to chassis. We have performed some simulations, and regardless of
> the settings we have used some slow requests have appeared all the time.
> >
> > we had the the following settings:
> >
> >"osd_max_backfills": "1",
> >   "osd_backfill_full_ratio": "0.85",
> >   "osd_backfill_retry_interval": "10",
> >   "osd_backfill_scan_min": "1",
> >   "osd_backfill_scan_max": "4",
> >   "osd_kill_backfill_at": "0",
> >   "osd_debug_skip_full_check_in_backfill_reservation": "false",
> >   "osd_debug_reject_backfill_probability": "0",
> >
> >  "osd_min_recovery_priority": "0",
> >   "osd_allow_recovery_below_min_size": "true",
> >   "osd_recovery_threads": "1",
> >   "osd_recovery_thread_timeout": "60",
> >   "osd_recovery_thread_suicide_timeout": "300",
> >   "osd_recovery_delay_start": "0",
> >   "osd_recovery_max_active": "1",
> >   "osd_recovery_max_single_start": "1",
> >   "osd_recovery_max_chunk": "8388608",
> >   "osd_recovery_forget_lost_objects": "false",
> >   "osd_recovery_op_priority": "1",
> >   "osd_recovery_op_warn_multiple": "16",
> >
> >
> > we have also tested it with the CFQ IO scheduler on the OSDs and the
> following params:
> >   "osd_disk_thread_ioprio_priority": "7"
> >   "osd_disk_thread_ioprio_class": "idle"
> >
> > and the nodeep-scrub set.
> >
> > Is there anything else to try? Is there a good way to switch from
> one kind of failure domain to an other without slow requests?
> >
> > Thank you in advance for any suggestions.
> >
> > Kind regards,
> > Laszlo
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power outages!!! help!

2017-09-01 Thread hjcho616
Looks like it has been rescued... Only 1 error as we saw before in the smart 
log!# ddrescue -f /dev/sda /dev/sdc ./rescue.logGNU ddrescue 1.21Press Ctrl-C 
to interrupt     ipos:    1508 GB, non-trimmed:        0 B,  current rate:      
 0 B/s     opos:    1508 GB, non-scraped:        0 B,  average rate:  88985 
kB/snon-tried:        0 B,     errsize:     4096 B,      run time:  6h 14m 40s  
rescued:    2000 GB,      errors:        1,  remaining time:         n/apercent 
rescued:  99.99%      time since last successful read:         39sFinished      
                 
Still missing partition in the new drive. =P  I found this util called testdisk 
for broken partition tables.  Will try that tonight. =P
Regards,Hong
 

On Wednesday, August 30, 2017 9:18 AM, Ronny Aasen 
 wrote:
 

  On 30.08.2017 15:32, Steve Taylor wrote:
  
 
I'm not familiar with dd_rescue, but I've just been reading about it. I'm not 
seeing any features that would be beneficial in this scenario that aren't also 
available in dd. What specific features give it "really a far better chance of 
restoring a copy of your disk" than dd? I'm always interested in learning about 
new recovery tools. 
 i see i wrote dd_rescue from old habit, but the package one should use on 
debian is gddrescue or also called gnu ddrecue. 
 
 this page have some details on the differences on dd vs the ddrescue variants. 
 http://www.toad.com/gnu/sysadmin/index.html#ddrescue
 
 kind regards
 Ronny Aasen
 
 
 
 
||  Steve Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
 380 Data Drive Suite 300 | Draper | Utah | 84020
 Office: 801.871.2799 |   |

  
| If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any  
attachments, and be advised that any dissemination or copying of this message 
is prohibited. |

  
 On Tue, 2017-08-29 at 21:49 +0200, Willem Jan Withagen wrote: 
 On 29-8-2017 19:12, Steve Taylor wrote:

Hong,Probably your best chance at recovering any data without 
special,expensive, forensic procedures is to perform a dd from /dev/sdb 
tosomewhere else large enough to hold a full disk image and attempt torepair 
that. You'll want to use 'conv=noerror' with your dd commandsince your disk is 
failing. Then you could either re-attach the OSDfrom the new source or attempt 
to retrieve objects from the filestoreon it.
Like somebody else already pointed outIn problem "cases like disk, use 
dd_rescue.It has really a far better chance of restoring a copy of your 
disk--WjW
I have actually done this before by creating an RBD that matches thedisk size, 
performing the dd, running xfs_repair, and eventuallyadding it back to the 
cluster as an OSD. RBDs as OSDs is certainly atemporary arrangement for repair 
only, but I'm happy to report that itworked flawlessly in my case. I was able 
to weight the OSD to 0,offload all of its data, then remove it for a full 
recovery, at whichpoint I just deleted the RBD.The possibilities afforded by 
Ceph inception are endless. ☺ Steve Taylor | Senior Software Engineer | 
StorageCraft Technology Corporation380 Data Drive Suite 300 | Draper | Utah | 
84020Office: 801.871.2799 | If you are not the intended recipient of this 
message or received it erroneously, please notify the sender and delete it, 
together with any attachments, and be advised that any dissemination or copying 
of this message is prohibited. On Mon, 2017-08-28 at 23:17 +0100, Tomasz 
Kusmierz wrote:
Rule of thumb with batteries is:- more “proper temperature” you run them at the 
more life you get outof them- more battery is overpowered for your application 
the longer it willsurvive. Get your self a LSI 94** controller and use it as 
HBA and you will befine. but get MORE DRIVES ! … 
On 28 Aug 2017, at 23:10, hjcho616  wrote:Thank you Tomasz 
and Ronny.  I'll have to order some hdd soon andtry these out.  Car battery 
idea is nice!  I may try that.. =)  Dothey last longer?  Ones that fit the UPS 
original battery specdidn't last very long... part of the reason why I gave up 
on them..=P  My wife probably won't like the idea of car battery hanging 
outthough ha!The OSD1 (one with mostly ok OSDs, except that smart 
failure)motherboard doesn't have any additional SATA connectors available. 
Would it be safe to add another OSD host?Regards,HongOn Monday, August 28, 2017 
4:43 PM, Tomasz Kusmierz  wrote:Sorry for being brutal 
… anyway 1. get the battery for UPS ( a car battery will do as well, I’vemoded 
on ups in the past with truck battery and it was working likea charm :D )2. get 
spare drives and put those in because your cluster CAN NOTget out of error due 
to lack of space3. Follow advice of Ronny Aasen on hot to recover data from 
harddrives 4 get cooling to drives or you will loose more ! 
On 28 Aug 2017, at 22:39, hjcho616  wrote:Tomasz,Those 
machines are behind a surge protector.  Doesn't appear tobe a good one!  I do 
have a UPS... bu

[ceph-users] a question about use of CEPH_IOC_SYNCIO in write

2017-09-01 Thread sa514164
Hi:
I want to ask a question about CEPH_IOC_SYNCIO flag.
I know that when using O_SYNC flag or O_DIRECT flag, write call 
executes in other two code paths different than using CEPH_IOC_SYNCIO flag.
And I find the comments about CEPH_IOC_SYNCIO here:

/*
 * CEPH_IOC_SYNCIO - force synchronous IO
 *
 * This ioctl sets a file flag that forces the synchronous IO that
 * bypasses the page cache, even if it is not necessary.  This is
 * essentially the opposite behavior of IOC_LAZYIO.  This forces the
 * same read/write path as a file opened by multiple clients when one
 * or more of those clients is opened for write.
 *
 * Note that this type of sync IO takes a different path than a file
 * opened with O_SYNC/D_SYNC (writes hit the page cache and are
 * immediately flushed on page boundaries).  It is very similar to
 * O_DIRECT (writes bypass the page cache) except that O_DIRECT writes
 * are not copied (user page must remain stable) and O_DIRECT writes
 * have alignment restrictions (on the buffer and file offset).
 */
#define CEPH_IOC_SYNCIO _IO(CEPH_IOCTL_MAGIC, 5)

My question is: 
1."This forces the same read/write path as a file opened by multiple 
clients when one or more of those clients is opened for write." -- Does this 
mean multiple clients can execute in the same code path when they all use the 
CEPH_IOC_SYNCIO flag? Will the use of CEPH_IOC_SYNCIO in all clients bring 
effects such as coherency and performance?
2."...except that O_DIRECT writes are not copied (user page must remain 
stable)" -- As I know when threads write with CEPH_IOC_SYNCIO flag, the write 
call will block until ceph osd and mds send back responses. So even with 
CEPH_IOC_SYNCIO flag(the user pages are not locked here, I guess), but the user 
cannot use these pages. How can the use of CEPH_IOC_SYNCIO flag make better use 
of user space memory?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the failure domain

2017-09-01 Thread Laszlo Budai

Hi David,

Well, most probably the larger part of our PGs will have to be reorganized, as we are 
moving from 9 hosts to 3 chassis. But I was hoping to be able to throttle the backfilling 
to an extent where it has minimal impact on our user traffic. Unfortunately I wasn't able 
to do it. I saw that the newer versions of ceph have the "osd recovery sleep" 
parameter. I think this would help, but unfortunately it's not present in hammer ... :(

Also I have an other question: Is it normal to have backfill when we add a host 
to a chassis even if we don't change the CRUSH rule? Let me explain: We have 
the hosts directly assigned to the root bucket. Then we add chassis to the 
root, and then we move a host from the root to the chassis. In all this time 
the rule set remains unchanged, with the host being the failure domain.

Kind regards,
Laszlo


On 31.08.2017 17:56, David Turner wrote:

How long are you seeing these blocked requests for?  Initially or perpetually?  
Changing the failure domain causes all PGs to peer at the same time.  This 
would be the cause if it happens really quickly.  There is no way to avoid all 
of them peering while making a change like this.  After that, It could easily 
be caused because a fair majority of your data is probably set to move around.  
I would check what might be causing the blocked requests during this time.  See 
if there is an OSD that might be dying (large backfills have the tendancy to 
find a couple failing drives) which could easily cause things to block.  Also 
checking if your disks or journals are maxed out with iostat could shine some 
light on any mitigating factor.

On Thu, Aug 31, 2017 at 9:01 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote:

Dear all!

In our Hammer cluster we are planning to switch our failure domain from 
host to chassis. We have performed some simulations, and regardless of the 
settings we have used some slow requests have appeared all the time.

we had the the following settings:

   "osd_max_backfills": "1",
  "osd_backfill_full_ratio": "0.85",
  "osd_backfill_retry_interval": "10",
  "osd_backfill_scan_min": "1",
  "osd_backfill_scan_max": "4",
  "osd_kill_backfill_at": "0",
  "osd_debug_skip_full_check_in_backfill_reservation": "false",
  "osd_debug_reject_backfill_probability": "0",

 "osd_min_recovery_priority": "0",
  "osd_allow_recovery_below_min_size": "true",
  "osd_recovery_threads": "1",
  "osd_recovery_thread_timeout": "60",
  "osd_recovery_thread_suicide_timeout": "300",
  "osd_recovery_delay_start": "0",
  "osd_recovery_max_active": "1",
  "osd_recovery_max_single_start": "1",
  "osd_recovery_max_chunk": "8388608",
  "osd_recovery_forget_lost_objects": "false",
  "osd_recovery_op_priority": "1",
  "osd_recovery_op_warn_multiple": "16",


we have also tested it with the CFQ IO scheduler on the OSDs and the 
following params:
  "osd_disk_thread_ioprio_priority": "7"
  "osd_disk_thread_ioprio_class": "idle"

and the nodeep-scrub set.

Is there anything else to try? Is there a good way to switch from one kind 
of failure domain to an other without slow requests?

Thank you in advance for any suggestions.

Kind regards,
Laszlo


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the failure domain

2017-09-01 Thread David Turner
Don't discount failing drives. You can have drives in a "ready-to-fail"
state that doesn't show up in SMART or anywhere easy to track. When
backfilling, the drive is using sectors it may not normally use. I managed
a 1400 osd cluster that would lose 1-3 drives in random nodes when I added
new storage due to the large backfill that took place. We monitored dmesg,
SMART, etc for detection of disk errors, but it would all of a sudden
happen during the large backfill. Several times the osd didn't even have
any SMART errors after it was dead.

It's easiest to track slow requests while they are happening. `ceph health
detail` will report which osd the request is blocked in and might she'd
something light. If a PG is peeing for a while, you can also check which
osd it is stuck waiting on.

On Fri, Sep 1, 2017, 12:09 PM Laszlo Budai  wrote:

> Hello,
>
>
> We have checked all the drives, and there is no problem with them. If
> there would be a failing drive, then I think that the slow requests should
> appear also in the normal traffic as the ceph cluster is using all the OSDs
> as primaries for some PGs. But these slow requests are appearing only
> during the backfill. I will try to dig deeper into the IO operations at the
> next test.
>
> Kind regards,
> Laszlo
>
>
>
> On 01.09.2017 16:08, David Turner wrote:
> > That is normal to have backfilling because the crush map did change. The
> host and the chassis have crush numbers and their own weight which is the
> sum of the osds under them.  By moving the host into the chassis you
> changed the weight of the chassis and that affects the PG placement even
> though you didn't change the failure domain.
> >
> > Osd_max_backfills = 1 shouldn't impact customer traffic and cause
> blocked requests. Most people find that they can use 3-5 before the disks
> are active enough to come close to impacting customer traffic.  That would
> lead me to think you have a dying drive that you're reading from/writing to
> in sectors that are bad or at least slower.
> >
> >
> > On Fri, Sep 1, 2017, 6:13 AM Laszlo Budai  > wrote:
> >
> > Hi David,
> >
> > Well, most probably the larger part of our PGs will have to be
> reorganized, as we are moving from 9 hosts to 3 chassis. But I was hoping
> to be able to throttle the backfilling to an extent where it has minimal
> impact on our user traffic. Unfortunately I wasn't able to do it. I saw
> that the newer versions of ceph have the "osd recovery sleep" parameter. I
> think this would help, but unfortunately it's not present in hammer ... :(
> >
> > Also I have an other question: Is it normal to have backfill when we
> add a host to a chassis even if we don't change the CRUSH rule? Let me
> explain: We have the hosts directly assigned to the root bucket. Then we
> add chassis to the root, and then we move a host from the root to the
> chassis. In all this time the rule set remains unchanged, with the host
> being the failure domain.
> >
> > Kind regards,
> > Laszlo
> >
> >
> > On 31.08.2017 17:56, David Turner wrote:
> >  > How long are you seeing these blocked requests for?  Initially or
> perpetually?  Changing the failure domain causes all PGs to peer at the
> same time.  This would be the cause if it happens really quickly.  There is
> no way to avoid all of them peering while making a change like this.  After
> that, It could easily be caused because a fair majority of your data is
> probably set to move around.  I would check what might be causing the
> blocked requests during this time.  See if there is an OSD that might be
> dying (large backfills have the tendancy to find a couple failing drives)
> which could easily cause things to block.  Also checking if your disks or
> journals are maxed out with iostat could shine some light on any mitigating
> factor.
> >  >
> >  > On Thu, Aug 31, 2017 at 9:01 AM Laszlo Budai <
> las...@componentsoft.eu   las...@componentsoft.eu >> wrote:
> >  >
> >  > Dear all!
> >  >
> >  > In our Hammer cluster we are planning to switch our failure
> domain from host to chassis. We have performed some simulations, and
> regardless of the settings we have used some slow requests have appeared
> all the time.
> >  >
> >  > we had the the following settings:
> >  >
> >  >"osd_max_backfills": "1",
> >  >   "osd_backfill_full_ratio": "0.85",
> >  >   "osd_backfill_retry_interval": "10",
> >  >   "osd_backfill_scan_min": "1",
> >  >   "osd_backfill_scan_max": "4",
> >  >   "osd_kill_backfill_at": "0",
> >  >   "osd_debug_skip_full_check_in_backfill_reservation":
> "false",
> >  >   "osd_debug_reject_backfill_probability": "0",
> >  >
> >  >  "osd_min_recovery_priority": "0",
> >  >   "osd_allow_recovery_below_min_size": 

Re: [ceph-users] v12.2.0 Luminous released

2017-09-01 Thread Sage Weil
On Fri, 1 Sep 2017, Felix, Evan J wrote:
> Is there documentation about how to deal with a pool application 
> association that is not one of cephfs, rbd, or rgw? We have multiple 
> pools that have nothing to do with those applications, we just use the 
> objects in them directly using the librados API calls.  I don’t really 
> want health warnings always showing in my status screens.

Hi Evan,

Just

 ceph osd pool application enable $pool myapp

See

 
http://docs.ceph.com/docs/master/rados/operations/pools/#associate-pool-to-application

sage


> 
> Evan Felix
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Abhishek Lekshmanan
> Sent: Tuesday, August 29, 2017 11:20 AM
> To: ceph-de...@vger.kernel.org; ceph-us...@ceph.com; 
> ceph-maintain...@ceph.com; ceph-annou...@ceph.com
> Subject: v12.2.0 Luminous released 
> 
> 
> We're glad to announce the first release of Luminous v12.2.x long term stable 
> release series. There have been major changes since Kraken
> (v11.2.z) and Jewel (v10.2.z), and the upgrade process is non-trivial.
> Please read the release notes carefully.
> 
> For more details, links & changelog please refer to the complete release 
> notes entry at the Ceph blog:
> http://ceph.com/releases/v12-2-0-luminous-released/
> 
> 
> Major Changes from Kraken
> -
> 
> - *General*:
>   * Ceph now has a simple, built-in web-based dashboard for monitoring cluster
> status.
> 
> - *RADOS*:
>   * *BlueStore*:
> - The new *BlueStore* backend for *ceph-osd* is now stable and the
>   new default for newly created OSDs.  BlueStore manages data
>   stored by each OSD by directly managing the physical HDDs or
>   SSDs without the use of an intervening file system like XFS.
>   This provides greater performance and features.
> - BlueStore supports full data and metadata checksums
>   of all data stored by Ceph.
> - BlueStore supports inline compression using zlib, snappy, or LZ4. (Ceph
>   also supports zstd for RGW compression but zstd is not recommended for
>   BlueStore for performance reasons.)
> 
>   * *Erasure coded* pools now have full support for overwrites
> allowing them to be used with RBD and CephFS.
> 
>   * *ceph-mgr*:
> - There is a new daemon, *ceph-mgr*, which is a required part of
>   any Ceph deployment.  Although IO can continue when *ceph-mgr*
>   is down, metrics will not refresh and some metrics-related calls
>   (e.g., `ceph df`) may block.  We recommend deploying several
>   instances of *ceph-mgr* for reliability.  See the notes on
>   Upgrading below.
> - The *ceph-mgr* daemon includes a REST-based management API.
>   The API is still experimental and somewhat limited but
>   will form the basis for API-based management of Ceph going forward.
> - ceph-mgr also includes a Prometheus exporter plugin, which can provide 
> Ceph
>   perfcounters to Prometheus.
> - ceph-mgr now has a Zabbix plugin. Using zabbix_sender it sends trapper
>   events to a Zabbix server containing high-level information of the Ceph
>   cluster. This makes it easy to monitor a Ceph cluster's status and send
>   out notifications in case of a malfunction.
> 
>   * The overall *scalability* of the cluster has improved. We have
> successfully tested clusters with up to 10,000 OSDs.
>   * Each OSD can now have a device class associated with
> it (e.g., `hdd` or `ssd`), allowing CRUSH rules to trivially map
> data to a subset of devices in the system.  Manually writing CRUSH
> rules or manual editing of the CRUSH is normally not required.
>   * There is a new upmap exception mechanism that allows individual PGs to be 
> moved around to achieve
> a *perfect distribution* (this requires luminous clients).
>   * Each OSD now adjusts its default configuration based on whether the
> backing device is an HDD or SSD. Manual tuning generally not required.
>   * The prototype mClock QoS queueing algorithm is now available.
>   * There is now a *backoff* mechanism that prevents OSDs from being
> overloaded by requests to objects or PGs that are not currently able to
> process IO.
>   * There is a simplified OSD replacement process that is more robust.
>   * You can query the supported features and (apparent) releases of
> all connected daemons and clients with `ceph features`
>   * You can configure the oldest Ceph client version you wish to allow to
> connect to the cluster via `ceph osd set-require-min-compat-client` and
> Ceph will prevent you from enabling features that will break compatibility
> with those clients.
>   * Several `sleep` settings, include `osd_recovery_sleep`,
> `osd_snap_trim_sleep`, and `osd_scrub_sleep` have been
> reimplemented to work efficiently.  (These are used in some cases
> to work around issues throttling background work.)

Re: [ceph-users] Changing the failure domain

2017-09-01 Thread Laszlo Budai

Hello,


We have checked all the drives, and there is no problem with them. If there 
would be a failing drive, then I think that the slow requests should appear 
also in the normal traffic as the ceph cluster is using all the OSDs as 
primaries for some PGs. But these slow requests are appearing only during the 
backfill. I will try to dig deeper into the IO operations at the next test.

Kind regards,
Laszlo



On 01.09.2017 16:08, David Turner wrote:

That is normal to have backfilling because the crush map did change. The host 
and the chassis have crush numbers and their own weight which is the sum of the 
osds under them.  By moving the host into the chassis you changed the weight of 
the chassis and that affects the PG placement even though you didn't change the 
failure domain.

Osd_max_backfills = 1 shouldn't impact customer traffic and cause blocked 
requests. Most people find that they can use 3-5 before the disks are active 
enough to come close to impacting customer traffic.  That would lead me to 
think you have a dying drive that you're reading from/writing to in sectors 
that are bad or at least slower.


On Fri, Sep 1, 2017, 6:13 AM Laszlo Budai mailto:las...@componentsoft.eu>> wrote:

Hi David,

Well, most probably the larger part of our PGs will have to be reorganized, as we are 
moving from 9 hosts to 3 chassis. But I was hoping to be able to throttle the backfilling 
to an extent where it has minimal impact on our user traffic. Unfortunately I wasn't able 
to do it. I saw that the newer versions of ceph have the "osd recovery sleep" 
parameter. I think this would help, but unfortunately it's not present in hammer ... :(

Also I have an other question: Is it normal to have backfill when we add a 
host to a chassis even if we don't change the CRUSH rule? Let me explain: We 
have the hosts directly assigned to the root bucket. Then we add chassis to the 
root, and then we move a host from the root to the chassis. In all this time 
the rule set remains unchanged, with the host being the failure domain.

Kind regards,
Laszlo


On 31.08.2017 17:56, David Turner wrote:
 > How long are you seeing these blocked requests for?  Initially or 
perpetually?  Changing the failure domain causes all PGs to peer at the same time. 
 This would be the cause if it happens really quickly.  There is no way to avoid 
all of them peering while making a change like this.  After that, It could easily 
be caused because a fair majority of your data is probably set to move around.  I 
would check what might be causing the blocked requests during this time.  See if 
there is an OSD that might be dying (large backfills have the tendancy to find a 
couple failing drives) which could easily cause things to block.  Also checking if 
your disks or journals are maxed out with iostat could shine some light on any 
mitigating factor.
 >
 > On Thu, Aug 31, 2017 at 9:01 AM Laszlo Budai mailto:las...@componentsoft.eu> >> wrote:
 >
 > Dear all!
 >
 > In our Hammer cluster we are planning to switch our failure domain 
from host to chassis. We have performed some simulations, and regardless of the 
settings we have used some slow requests have appeared all the time.
 >
 > we had the the following settings:
 >
 >"osd_max_backfills": "1",
 >   "osd_backfill_full_ratio": "0.85",
 >   "osd_backfill_retry_interval": "10",
 >   "osd_backfill_scan_min": "1",
 >   "osd_backfill_scan_max": "4",
 >   "osd_kill_backfill_at": "0",
 >   "osd_debug_skip_full_check_in_backfill_reservation": "false",
 >   "osd_debug_reject_backfill_probability": "0",
 >
 >  "osd_min_recovery_priority": "0",
 >   "osd_allow_recovery_below_min_size": "true",
 >   "osd_recovery_threads": "1",
 >   "osd_recovery_thread_timeout": "60",
 >   "osd_recovery_thread_suicide_timeout": "300",
 >   "osd_recovery_delay_start": "0",
 >   "osd_recovery_max_active": "1",
 >   "osd_recovery_max_single_start": "1",
 >   "osd_recovery_max_chunk": "8388608",
 >   "osd_recovery_forget_lost_objects": "false",
 >   "osd_recovery_op_priority": "1",
 >   "osd_recovery_op_warn_multiple": "16",
 >
 >
 > we have also tested it with the CFQ IO scheduler on the OSDs and the 
following params:
 >   "osd_disk_thread_ioprio_priority": "7"
 >   "osd_disk_thread_ioprio_class": "idle"
 >
 > and the nodeep-scrub set.
 >
 > Is there anything else to try? Is there a good way to switch from 
one kind of failure domain to an other without slow requests?
 >
 > Thank you in advance for any suggestions.
 >
 > Kind regards,
 > Laszlo

Re: [ceph-users] Power outages!!! help!

2017-09-01 Thread hjcho616
Found the partition, wasn't able to mount the partition right away... Did a 
xfs_repair on that drive.  
Got bunch of messages like this.. =(entry 
"10a89fd.__head_AE319A25__0" in shortform directory 845908970 
references non-existent inode 605294241               junking entry 
"10a89fd.__head_AE319A25__0" in directory inode 845908970           
Was able to mount.  lost+found has lots of files there. =P  Running du seems to 
show OK files in current directory.
Will it be safe to attach this one back to the cluster?  Is there a way to 
specify to use this drive if the data is missing? =)  Or am I being paranoid?  
Just plug it? =)
Regards,Hong 

On Friday, September 1, 2017 9:01 AM, hjcho616  wrote:
 

 Looks like it has been rescued... Only 1 error as we saw before in the smart 
log!# ddrescue -f /dev/sda /dev/sdc ./rescue.logGNU ddrescue 1.21Press Ctrl-C 
to interrupt     ipos:    1508 GB, non-trimmed:        0 B,  current rate:      
 0 B/s     opos:    1508 GB, non-scraped:        0 B,  average rate:  88985 
kB/snon-tried:        0 B,     errsize:     4096 B,      run time:  6h 14m 40s  
rescued:    2000 GB,      errors:        1,  remaining time:         n/apercent 
rescued:  99.99%      time since last successful read:         39sFinished      
                 
Still missing partition in the new drive. =P  I found this util called testdisk 
for broken partition tables.  Will try that tonight. =P
Regards,Hong
 

On Wednesday, August 30, 2017 9:18 AM, Ronny Aasen 
 wrote:
 

  On 30.08.2017 15:32, Steve Taylor wrote:
  
 
I'm not familiar with dd_rescue, but I've just been reading about it. I'm not 
seeing any features that would be beneficial in this scenario that aren't also 
available in dd. What specific features give it "really a far better chance of 
restoring a copy of your disk" than dd? I'm always interested in learning about 
new recovery tools. 
 i see i wrote dd_rescue from old habit, but the package one should use on 
debian is gddrescue or also called gnu ddrecue. 
 
 this page have some details on the differences on dd vs the ddrescue variants. 
 http://www.toad.com/gnu/sysadmin/index.html#ddrescue
 
 kind regards
 Ronny Aasen
 
 
 
 
||  Steve Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
 380 Data Drive Suite 300 | Draper | Utah | 84020
 Office: 801.871.2799 |   |

  
| If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any  
attachments, and be advised that any dissemination or copying of this message 
is prohibited. |

  
 On Tue, 2017-08-29 at 21:49 +0200, Willem Jan Withagen wrote: 
 On 29-8-2017 19:12, Steve Taylor wrote:

Hong,Probably your best chance at recovering any data without 
special,expensive, forensic procedures is to perform a dd from /dev/sdb 
tosomewhere else large enough to hold a full disk image and attempt torepair 
that. You'll want to use 'conv=noerror' with your dd commandsince your disk is 
failing. Then you could either re-attach the OSDfrom the new source or attempt 
to retrieve objects from the filestoreon it.
Like somebody else already pointed outIn problem "cases like disk, use 
dd_rescue.It has really a far better chance of restoring a copy of your 
disk--WjW
I have actually done this before by creating an RBD that matches thedisk size, 
performing the dd, running xfs_repair, and eventuallyadding it back to the 
cluster as an OSD. RBDs as OSDs is certainly atemporary arrangement for repair 
only, but I'm happy to report that itworked flawlessly in my case. I was able 
to weight the OSD to 0,offload all of its data, then remove it for a full 
recovery, at whichpoint I just deleted the RBD.The possibilities afforded by 
Ceph inception are endless. ☺ Steve Taylor | Senior Software Engineer | 
StorageCraft Technology Corporation380 Data Drive Suite 300 | Draper | Utah | 
84020Office: 801.871.2799 | If you are not the intended recipient of this 
message or received it erroneously, please notify the sender and delete it, 
together with any attachments, and be advised that any dissemination or copying 
of this message is prohibited. On Mon, 2017-08-28 at 23:17 +0100, Tomasz 
Kusmierz wrote:
Rule of thumb with batteries is:- more “proper temperature” you run them at the 
more life you get outof them- more battery is overpowered for your application 
the longer it willsurvive. Get your self a LSI 94** controller and use it as 
HBA and you will befine. but get MORE DRIVES ! … 
On 28 Aug 2017, at 23:10, hjcho616  wrote:Thank you Tomasz 
and Ronny.  I'll have to order some hdd soon andtry these out.  Car battery 
idea is nice!  I may try that.. =)  Dothey last longer?  Ones that fit the UPS 
original battery specdidn't last very long... part of the reason why I gave up 
on them..=P  My wife probably won't like the idea of car battery hanging 
outthough ha!The OSD1 (one with mostly ok OSDs, except that smart 
fa

Re: [ceph-users] Power outages!!! help!

2017-09-01 Thread hjcho616
Tried connecting recovered osd.  Looks like some of the files in the lost+found 
are super blocks.  Below is the log.  What can I do about this?
2017-09-01 22:27:27.634228 7f68837e5800  0 set uid:gid to 1001:1001 
(ceph:ceph)2017-09-01 22:27:27.634245 7f68837e5800  0 ceph version 10.2.9 
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0), process ceph-osd, pid 
54322017-09-01 22:27:27.635456 7f68837e5800  0 pidfile_write: ignore empty 
--pid-file2017-09-01 22:27:27.646849 7f68837e5800  0 
filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342)2017-09-01 
22:27:27.647077 7f68837e5800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl 
is disabled via 'filestore fiemap' config option2017-09-01 22:27:27.647080 
7f68837e5800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) 
detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' 
config option2017-09-01 22:27:27.647091 7f68837e5800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: splice is 
supported2017-09-01 22:27:27.678937 7f68837e5800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) 
syscall fully supported (by glibc and kernel)2017-09-01 22:27:27.679044 
7f68837e5800  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: 
extsize is disabled by conf2017-09-01 22:27:27.680718 7f68837e5800  1 leveldb: 
Recovering log #280542017-09-01 22:27:27.804501 7f68837e5800  1 leveldb: Delete 
type=0 #28054
2017-09-01 22:27:27.804579 7f68837e5800  1 leveldb: Delete type=3 #28053
2017-09-01 22:27:35.586725 7f68837e5800  0 filestore(/var/lib/ceph/osd/ceph-0) 
mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled2017-09-01 
22:27:35.587689 7f68837e5800  1 journal _open /var/lib/ceph/osd/ceph-0/journal 
fd 18: 9998729216 bytes, block size 4096 bytes, directio = 1, aio = 12017-09-01 
22:27:35.589631 7f68837e5800  1 journal _open /var/lib/ceph/osd/ceph-0/journal 
fd 18: 9998729216 bytes, block size 4096 bytes, directio = 1, aio = 12017-09-01 
22:27:35.590041 7f68837e5800  1 filestore(/var/lib/ceph/osd/ceph-0) 
upgrade2017-09-01 22:27:35.590149 7f68837e5800 -1 
filestore(/var/lib/ceph/osd/ceph-0) could not find 
#-1:7b3f43c4:::osd_superblock:0# in index: (2) No such file or 
directory2017-09-01 22:27:35.590158 7f68837e5800 -1 osd.0 0 OSD::init() : 
unable to read osd superblock2017-09-01 22:27:35.590547 7f68837e5800  1 journal 
close /var/lib/ceph/osd/ceph-0/journal2017-09-01 22:27:35.611595 7f68837e5800 
-1 ^[[0;31m ** ERROR: osd init failed: (22) Invalid argument^[[0m
Recovered drive is mounted on /var/lib/ceph/osd/ceph-0.# dfFilesystem      
1K-blocks      Used  Available Use% Mounted onudev                10240         
0      10240   0% /devtmpfs             1584780      9172    1575608   1% 
/run/dev/sda1        15247760   9319048    5131120  65% /tmpfs             
3961940         0    3961940   0% /dev/shmtmpfs                5120         0   
    5120   0% /run/locktmpfs             3961940         0    3961940   0% 
/sys/fs/cgroup/dev/sdb1      1952559676 634913968 1317645708  33% 
/var/lib/ceph/osd/ceph-0/dev/sde1      1952559676 640365952 1312193724  33% 
/var/lib/ceph/osd/ceph-6/dev/sdd1      1952559676 712018768 1240540908  37% 
/var/lib/ceph/osd/ceph-2/dev/sdc1      1952559676 755827440 1196732236  39% 
/var/lib/ceph/osd/ceph-1/dev/sdf1       312417560  42538060  269879500  14% 
/var/lib/ceph/osd/ceph-7tmpfs              792392         0     792392   0% 
/run/user/0# cd /var/lib/ceph/osd/ceph-0# lsactivate.monmap  current  
journal_uuid  magic          superblock  whoamiactive           fsid     
keyring       ready          sysvinitceph_fsid        journal  lost+found    
store_version  type
Regards,Hong 

On Friday, September 1, 2017 2:59 PM, hjcho616  wrote:
 

 Found the partition, wasn't able to mount the partition right away... Did a 
xfs_repair on that drive.  
Got bunch of messages like this.. =(entry 
"10a89fd.__head_AE319A25__0" in shortform directory 845908970 
references non-existent inode 605294241               junking entry 
"10a89fd.__head_AE319A25__0" in directory inode 845908970           
Was able to mount.  lost+found has lots of files there. =P  Running du seems to 
show OK files in current directory.
Will it be safe to attach this one back to the cluster?  Is there a way to 
specify to use this drive if the data is missing? =)  Or am I being paranoid?  
Just plug it? =)
Regards,Hong 

On Friday, September 1, 2017 9:01 AM, hjcho616  wrote:
 

 Looks like it has been rescued... Only 1 error as we saw before in the smart 
log!# ddrescue -f /dev/sda /dev/sdc ./rescue.logGNU ddrescue 1.21Press Ctrl-C 
to interrupt     ipos:    1508 GB, non-trimmed:        0 B,  current rate:      
 0 B/s     opos:    1508 GB, non-scraped:        0 B,  average rate:  88985 
kB/snon-tried:        0 B,     errsize:     4096 B,      run time:  6h 14m 40s  
rescued:    2000 GB,      errors:      

Re: [ceph-users] Power outages!!! help!

2017-09-01 Thread hjcho616
Just realized there is a file called superblock in the ceph directory.  ceph-1 
and ceph-2's superblock file is identical, ceph-6 and ceph-7 are identical, but 
not between the two groups.  When I originally created the OSDs, I created 
ceph-0 through 5.  Can superblock file be copied over from ceph-1 to ceph-0?
Hmm.. it appears to be doing something in the background even though osd.0 is 
down.  ceph health output is changing!# ceph healthHEALTH_ERR 40 pgs are stuck 
inactive for more than 300 seconds; 14 pgs backfill_wait; 21 pgs degraded; 10 
pgs down; 2 pgs inconsistent; 10 pgs peering; 3 pgs recovering; 2 pgs 
recovery_wait; 30 pgs stale; 21 pgs stuck degraded; 10 pgs stuck inactive; 30 
pgs stuck stale; 45 pgs stuck unclean; 16 pgs stuck undersized; 16 pgs 
undersized; 2 requests are blocked > 32 sec; recovery 221826/2473662 objects 
degraded (8.968%); recovery 254711/2473662 objects misplaced (10.297%); 
recovery 103/2251966 unfound (0.005%); 7 scrub errors; mds cluster is degraded; 
no legacy OSD present but 'sortbitwise' flag is not set
Regards,Hong 

On Friday, September 1, 2017 10:37 PM, hjcho616  wrote:
 

 Tried connecting recovered osd.  Looks like some of the files in the 
lost+found are super blocks.  Below is the log.  What can I do about this?
2017-09-01 22:27:27.634228 7f68837e5800  0 set uid:gid to 1001:1001 
(ceph:ceph)2017-09-01 22:27:27.634245 7f68837e5800  0 ceph version 10.2.9 
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0), process ceph-osd, pid 
54322017-09-01 22:27:27.635456 7f68837e5800  0 pidfile_write: ignore empty 
--pid-file2017-09-01 22:27:27.646849 7f68837e5800  0 
filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342)2017-09-01 
22:27:27.647077 7f68837e5800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl 
is disabled via 'filestore fiemap' config option2017-09-01 22:27:27.647080 
7f68837e5800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) 
detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' 
config option2017-09-01 22:27:27.647091 7f68837e5800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: splice is 
supported2017-09-01 22:27:27.678937 7f68837e5800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) 
syscall fully supported (by glibc and kernel)2017-09-01 22:27:27.679044 
7f68837e5800  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: 
extsize is disabled by conf2017-09-01 22:27:27.680718 7f68837e5800  1 leveldb: 
Recovering log #280542017-09-01 22:27:27.804501 7f68837e5800  1 leveldb: Delete 
type=0 #28054
2017-09-01 22:27:27.804579 7f68837e5800  1 leveldb: Delete type=3 #28053
2017-09-01 22:27:35.586725 7f68837e5800  0 filestore(/var/lib/ceph/osd/ceph-0) 
mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled2017-09-01 
22:27:35.587689 7f68837e5800  1 journal _open /var/lib/ceph/osd/ceph-0/journal 
fd 18: 9998729216 bytes, block size 4096 bytes, directio = 1, aio = 12017-09-01 
22:27:35.589631 7f68837e5800  1 journal _open /var/lib/ceph/osd/ceph-0/journal 
fd 18: 9998729216 bytes, block size 4096 bytes, directio = 1, aio = 12017-09-01 
22:27:35.590041 7f68837e5800  1 filestore(/var/lib/ceph/osd/ceph-0) 
upgrade2017-09-01 22:27:35.590149 7f68837e5800 -1 
filestore(/var/lib/ceph/osd/ceph-0) could not find 
#-1:7b3f43c4:::osd_superblock:0# in index: (2) No such file or 
directory2017-09-01 22:27:35.590158 7f68837e5800 -1 osd.0 0 OSD::init() : 
unable to read osd superblock2017-09-01 22:27:35.590547 7f68837e5800  1 journal 
close /var/lib/ceph/osd/ceph-0/journal2017-09-01 22:27:35.611595 7f68837e5800 
-1 ^[[0;31m ** ERROR: osd init failed: (22) Invalid argument^[[0m
Recovered drive is mounted on /var/lib/ceph/osd/ceph-0.# dfFilesystem      
1K-blocks      Used  Available Use% Mounted onudev                10240         
0      10240   0% /devtmpfs             1584780      9172    1575608   1% 
/run/dev/sda1        15247760   9319048    5131120  65% /tmpfs             
3961940         0    3961940   0% /dev/shmtmpfs                5120         0   
    5120   0% /run/locktmpfs             3961940         0    3961940   0% 
/sys/fs/cgroup/dev/sdb1      1952559676 634913968 1317645708  33% 
/var/lib/ceph/osd/ceph-0/dev/sde1      1952559676 640365952 1312193724  33% 
/var/lib/ceph/osd/ceph-6/dev/sdd1      1952559676 712018768 1240540908  37% 
/var/lib/ceph/osd/ceph-2/dev/sdc1      1952559676 755827440 1196732236  39% 
/var/lib/ceph/osd/ceph-1/dev/sdf1       312417560  42538060  269879500  14% 
/var/lib/ceph/osd/ceph-7tmpfs              792392         0     792392   0% 
/run/user/0# cd /var/lib/ceph/osd/ceph-0# lsactivate.monmap  current  
journal_uuid  magic          superblock  whoamiactive           fsid     
keyring       ready          sysvinitceph_fsid        journal  lost+found    
store_version  type
Regards,Hong 

On Friday, September 1, 2017 2:59 PM, hjcho616  wrote:
 

 Found the partition, wasn't able to