Can you restart those osds with
debug osd = 20
debug filestore = 20
debug ms = 1
in the [osd] section of the ceph.conf file on the respective machines
and upload the logs? Sounds like a bug.
-Sam
On Tue, Sep 17, 2013 at 2:05 PM, Matt Thompson wrote:
> Hi All,
>
> I set up a new cluster today w/ 20 OSDs spanning 4 machines (journals not
> stored on separate disks), and a single MON running on a separate server
> (understand the single MON is not ideal for production environments).
>
> The cluster had the default pools along w/ the ones created by radosgw.
> There was next to no user data on the cluster with the exception of a few
> test files uploaded via swift client.
>
> I ran the following on one node to increase replica size from 2 to 3:
>
> for x in $(rados lspools); do ceph osd pool set $x size 3; done
>
> After doing this, I noticed that 5 OSDs were down and repeatedly restarting
> them using the following brings them back online momentarily but then they
> go down / out again:
>
> start ceph-osd id=X
>
> Looking across the affected nodes, I'm seeing errors like this in the
> respective osd logs:
>
> osd/ReplicatedPG.cc: 5405: FAILED assert(ssc)
>
> ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
> 1: (ReplicatedPG::prep_push_to_replica(ObjectContext*, hobject_t const&,
> int, int, PushOp*)+0x8ea)
> [0x5fd50a]
> 2: (ReplicatedPG::prep_object_replica_pushes(hobject_t const&, eversion_t,
> int, std::map :vector >, std::less,
> std::allocator vector > > > >*)+0x722) [0x5fe552]
> 3: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0x657)
> [0x5ff487]
> 4: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*,
> ThreadPool::TPHandle&)+0x736) [0x61d9c6]
> 5: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x1b8) [0x6863e8]
> 6: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0x11) [0x6c5541]
> 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x8b8df6]
> 8: (ThreadPool::WorkThread::entry()+0x10) [0x8bac00]
> 9: (()+0x7e9a) [0x7f610c09fe9a]
> 10: (clone()+0x6d) [0x7f610a91dccd]
> NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
> Have I done something foolish, or am I hitting a legitimate issue here?
>
> On a side note, my cluster is now in the following state:
>
> 2013-09-17 20:47:13.651250 mon.0 [INF] pgmap v1536: 248 pgs: 243
> active+clean, 2 active+recovery_wait, 3 active+recovering; 5497 bytes data,
> 866 MB used, 999 GB / 1000 GB avail; 21/255 degraded (8.235%); 7/85 unfound
> (8.235%)
>
> According to a ceph health detail, the unfound are on the .users.uid and
> .rgw radosgw pools; I suppose I can remove those pools and have radosgw
> recreate them? If this is not recoverable is it advisable to just format
> the cluster and start again?
>
> Thanks in advance for the help.
>
> Regards,
> Matt
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com