Re: [ceph-users] Cannot start 5/20 OSDs

2013-09-23 Thread Samuel Just
Can you restart those osds with

debug osd = 20
debug filestore = 20
debug ms = 1

in the [osd] section of the ceph.conf file on the respective machines
and upload the logs?  Sounds like a bug.
-Sam

On Tue, Sep 17, 2013 at 2:05 PM, Matt Thompson  wrote:
> Hi All,
>
> I set up a new cluster today w/ 20 OSDs spanning 4 machines (journals not
> stored on separate disks), and a single MON running on a separate server
> (understand the single MON is not ideal for production environments).
>
> The cluster had the default pools along w/ the ones created by radosgw.
> There was next to no user data on the cluster with the exception of a few
> test files uploaded via swift client.
>
> I ran the following on one node to increase replica size from 2 to 3:
>
> for x in $(rados lspools); do ceph osd pool set $x size 3; done
>
> After doing this, I noticed that 5 OSDs were down and repeatedly restarting
> them using the following brings them back online momentarily but then they
> go down / out again:
>
> start ceph-osd id=X
>
> Looking across the affected nodes, I'm seeing errors like this in the
> respective osd logs:
>
> osd/ReplicatedPG.cc: 5405: FAILED assert(ssc)
>
>  ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
>  1: (ReplicatedPG::prep_push_to_replica(ObjectContext*, hobject_t const&,
> int, int, PushOp*)+0x8ea)
>  [0x5fd50a]
>  2: (ReplicatedPG::prep_object_replica_pushes(hobject_t const&, eversion_t,
> int, std::map :vector >, std::less,
> std::allocator vector > > > >*)+0x722) [0x5fe552]
>  3: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0x657)
> [0x5ff487]
>  4: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*,
> ThreadPool::TPHandle&)+0x736) [0x61d9c6]
>  5: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x1b8) [0x6863e8]
>  6: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0x11) [0x6c5541]
>  7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x8b8df6]
>  8: (ThreadPool::WorkThread::entry()+0x10) [0x8bac00]
>  9: (()+0x7e9a) [0x7f610c09fe9a]
>  10: (clone()+0x6d) [0x7f610a91dccd]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
> Have I done something foolish, or am I hitting a legitimate issue here?
>
> On a side note, my cluster is now in the following state:
>
> 2013-09-17 20:47:13.651250 mon.0 [INF] pgmap v1536: 248 pgs: 243
> active+clean, 2 active+recovery_wait, 3 active+recovering; 5497 bytes data,
> 866 MB used, 999 GB / 1000 GB avail; 21/255 degraded (8.235%); 7/85 unfound
> (8.235%)
>
> According to a ceph health detail, the unfound are on the .users.uid and
> .rgw radosgw pools; I suppose I can remove those pools and have radosgw
> recreate them?  If this is not recoverable is it advisable to just format
> the cluster and start again?
>
> Thanks in advance for the help.
>
> Regards,
> Matt
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cannot start 5/20 OSDs

2013-09-17 Thread Matt Thompson
Hi All,

I set up a new cluster today w/ 20 OSDs spanning 4 machines (journals not
stored on separate disks), and a single MON running on a separate server
(understand the single MON is not ideal for production environments).

The cluster had the default pools along w/ the ones created by radosgw.
 There was next to no user data on the cluster with the exception of a few
test files uploaded via swift client.

I ran the following on one node to increase replica size from 2 to 3:

for x in $(rados lspools); do ceph osd pool set $x size 3; done

After doing this, I noticed that 5 OSDs were down and repeatedly restarting
them using the following brings them back online momentarily but then they
go down / out again:

start ceph-osd id=X

Looking across the affected nodes, I'm seeing errors like this in the
respective osd logs:

osd/ReplicatedPG.cc: 5405: FAILED assert(ssc)

 ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
 1: (ReplicatedPG::prep_push_to_replica(ObjectContext*, hobject_t const&,
int, int, PushOp*)+0x8ea)
 [0x5fd50a]
 2: (ReplicatedPG::prep_object_replica_pushes(hobject_t const&, eversion_t,
int, std::map >, std::less,
std::allocator > > > >*)+0x722) [0x5fe552]
 3: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0x657)
[0x5ff487]
 4: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*,
ThreadPool::TPHandle&)+0x736) [0x61d9c6]
 5: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x1b8) [0x6863e8]
 6: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0x11) [0x6c5541]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x8b8df6]
 8: (ThreadPool::WorkThread::entry()+0x10) [0x8bac00]
 9: (()+0x7e9a) [0x7f610c09fe9a]
 10: (clone()+0x6d) [0x7f610a91dccd]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

Have I done something foolish, or am I hitting a legitimate issue here?

On a side note, my cluster is now in the following state:

2013-09-17 20:47:13.651250 mon.0 [INF] pgmap v1536: 248 pgs: 243
active+clean, 2 active+recovery_wait, 3 active+recovering; 5497 bytes data,
866 MB used, 999 GB / 1000 GB avail; 21/255 degraded (8.235%); 7/85 unfound
(8.235%)

According to a ceph health detail, the unfound are on the .users.uid
and .rgw radosgw pools; I suppose I can remove those pools and have radosgw
recreate them?  If this is not recoverable is it advisable to just format
the cluster and start again?

Thanks in advance for the help.

Regards,
Matt
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com