Re: [ceph-users] blocked ops

2016-08-12 Thread Brad Hubbard
On Fri, Aug 12, 2016 at 07:47:54AM +0100, roeland mertens wrote:
> Hi Brad,
> 
> thank you for that. Unfortunately our immediate concern is the blocked ops
> rather than the broken pg (we know why its broken).

OK, if you look at the following file it shows not only the declaration of
wait_for_blocked_object (highlighted) but also all of it's callers.

https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L500

Multiple calls relate to snapshots but I'd suggest turning debug logging for
the OSDs right up may give us more information.

# ceph tell osd.* injectargs '--debug_osd 20 --debug_ms 5'

Note: The above will turn up debugging for all OSDs, you may want to only
focus on some so adjust accordingly.

> I don't think that's specifically crushmap related nor related to the
> broken pg as the osds involved in the blocked ops aren't the ones that were
> hosting the broken pg.
> 
> 
> 
> 
> On 12 August 2016 at 04:12, Brad Hubbard  wrote:
> 
> > On Thu, Aug 11, 2016 at 11:33:29PM +0100, Roeland Mertens wrote:
> > > Hi,
> > >
> > > I was hoping someone on this list may be able to help?
> > >
> > > We're running a 35 node 10.2.1 cluster with 595 OSDs. For the last 12
> > hours
> > > we've been plagued with blocked requests which completely kills the
> > > performance of the cluster
> > >
> > > # ceph health detail
> > > HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs
> > down; 1
> > > pgs peering; 1 pgs stuck inactive; 100 requests are blocked > 32 sec; 1
> > osds
> > > have slow requests; noout,nodeep-scrub,sortbitwise flag(s) set
> > > pg 63.1a18 is stuck inactive for 135133.509820, current state
> > > down+remapped+peering, last acting [2147483647,2147483647,
> > 2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370]
> >
> > That value (2147483647) is defined in src/crush/crush.h like so;
> >
> > #define CRUSH_ITEM_NONE   0x7fff  /* no result */
> >
> > So this could be due to a bad crush rule or maybe choose_total_tries needs
> > to
> > be higher?
> >
> > $ ceph osd crush rule ls
> >
> > For each rule listed by the above command.
> >
> > $ ceph osd crush rule dump [rule_name]
> >
> > I'd then dump out the crushmap and test it showing any bad mappings with
> > the
> > commands listed here;
> >
> > http://docs.ceph.com/docs/master/rados/troubleshooting/
> > troubleshooting-pg/#crush-gives-up-too-soon
> >
> > I'd also check the pg numbers for your pool(s) are appropriate as not
> > enough
> > pgs could also be a contributing factor IIRC.
> >
> > That should hopefully give some insight.
> >
> > --
> > HTH,
> > Brad
> >
> > > pg 63.1a18 is down+remapped+peering, acting [2147483647,2147483647,
> > 2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370]
> > > 100 ops are blocked > 2097.15 sec on osd.4
> > > 1 osds have slow requests
> > > noout,nodeep-scrub,sortbitwise flag(s) set
> > >
> > > the one pg down is due to us running into an odd EC issue which I mailed
> > the
> > > list about earlier, it's the 100 blocked ops that are puzzling us. If we
> > out
> > > the osd in question, they just shift to another osd (on a different
> > host!).
> > > We even tried rebooting the node it's on but to little avail.
> > >
> > > We get a ton of log messages like this:
> > >
> > > 2016-08-11 23:32:10.041174 7fc668d9f700  0 log_channel(cluster) log
> > [WRN] :
> > > 100 slow requests, 5 included below; oldest blocked for > 139.313915 secs
> > > 2016-08-11 23:32:10.041184 7fc668d9f700  0 log_channel(cluster) log
> > [WRN] :
> > > slow request 139.267004 seconds old, received at 2016-08-11
> > 23:29:50.774091:
> > > osd_op(client.9192464.0:485640 66.b96c3a18
> > > default.4282484.42_442fac8195c63a2e19c3c4bb91e8800e [getxattrs,stat,read
> > > 0~524288] snapc 0=[] RETRY=36 ack+retry+read+known_if_redirected e50109)
> > > currently waiting for blocked object
> > > 2016-08-11 23:32:10.041189 7fc668d9f700  0 log_channel(cluster) log
> > [WRN] :
> > > slow request 139.244839 seconds old, received at 2016-08-11
> > 23:29:50.796256:
> > > osd_op(client.9192464.0:596033 66.942a5a18
> > > default.4282484.30__shadow_.sLkZ_rUX6cvi0ifFasw1UipEIuFPzYB_6 [write
> > > 1048576~524288] snapc 0=[] RETRY=36
> > > ack+ondisk+retry+write+known_if_redirected e50109) currently waiting for
> > > blocked object
> > >
> > > A dump of the blocked ops tells us very little , is there anyone who can
> > > shed some light on this? Or at least give us a hint on how we can fix
> > this?
> > >
> > > # ceph daemon osd.4 dump_blocked_ops
> > > 
> > >
> > >{
> > > "description": "osd_op(client.9192464.0:596030 66.942a5a18
> > > default.4282484.30__shadow_.sLkZ_rUX6cvi0ifFasw1UipEIuFPzYB_6 [writefull
> > > 0~0] snapc 0=[] RETRY=32 ack+ondisk+retry+write+known_if_redirected
> > > e50092)",
> > > "initiated_at": "2016-08-11 22:58:09.721027",
> > > "age": 1515.105186,
> > > "duration": 1515.113255,
> > > "type_data": [
> > >  

Re: [ceph-users] High-performance way for access Windows of users to Ceph.

2016-08-12 Thread Nick Fisk
I’m not sure how stable that ceph dokan is, I would imagine the best way to 
present ceph-fs to windows users would be through samba.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
? 
Sent: 12 August 2016 07:53
To: ceph-users 
Subject: [ceph-users] High-performance way for access Windows of users to Ceph.

 

Hello,

I continue to design high-performance cluster Ceph, petascale.

Scheduled to purchase a high-performance server, OS Windows 2016, for  clients. 
Clients are in the Dockers.
https://docs.docker.com/engine/installation/windows/
Virtualization. It does not matter...

 Clients run program written by them, which generates files of various sizes - 
from 1 KB to 200 GB (yes, creepy single file size). Network planning to use 
Infiniband 40 GB/s between clients and Ceph. Clients work with Ceph always on 
one,  and always only either for record or for reading.

While I do not understand what Ceph technology is appropriate to use? Object, 
block, or file storage CephFS.
So far, it seems to me, i need to use MDS, CephFS and ceph-dokan
https://github.com/ketor/ceph-dokan


Please share the experience of how it is possible to provide access with 
minimal overhead (preferably zero :(  ) Windows Ceph users to the server?
 Ie  how to make sure that the files generated by the program on Windows very 
quickly proved to Ceph.

-- 
Александр Пивушков

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High-performance way for access Windows of users to Ceph.

2016-08-12 Thread Александр Пивушков

Thanks for the answer!

Samba is very slow.
I have very strict requirements for speed.

I need to write 160 GB of user Windows on Ceph per minute. And the same thing 
back ...
So I'm trying to find something :) with zero overhead.


>Пятница, 12 августа 2016, 10:33 +03:00 от Nick Fisk :
>
>I’m not sure how stable that ceph dokan is, I would imagine the best way to 
>present ceph-fs to windows users would be through samba.
> 
>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com]  On Behalf Of  
>? 
>Sent: 12 August 2016 07:53
>To: ceph-users < ceph-users@lists.ceph.com >
>Subject: [ceph-users] High-performance way for access Windows of users to Ceph.
> 
>Hello,
>I continue to design high-performance cluster Ceph, petascale.
>Scheduled to purchase a high-performance server, OS Windows 2016, for  
>clients. Clients are in the Dockers.
>https://docs.docker.com/engine/installation/windows/
>Virtualization. It does not matter...
> Clients run program written by them, which generates files of various sizes - 
>from 1 KB to 200 GB (yes, creepy single file size). Network planning to use 
>Infiniband 40 GB/s between clients and Ceph. Clients work with Ceph always on 
>one,  and always only either for record or for reading.
>While I do not understand what Ceph technology is appropriate to use? Object, 
>block, or file storage CephFS.
>So far, it seems to me, i need to use MDS, CephFS and ceph-dokan
>https://github.com/ketor/ceph-dokan
>
>Please share the experience of how it is possible to provide access with 
>minimal overhead (preferably zero :(  ) Windows Ceph users to the server?
> Ie  how to make sure that the files generated by the program on Windows very 
>quickly proved to Ceph.
>
>-- 
>Александр Пивушков


-- 
Александр Пивушков
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High-performance way for access Windows of users to Ceph.

2016-08-12 Thread Maxime Guyot
Hi,


> “Clients run program written by them, which generates files of various sizes 
> - from 1 KB to 200 GB”
If the clients are running custom software on Windows and if at all possible, I 
would consider using 
librados. The 
library is available for C/C++, Java, PHP and Python. The object API is fairly 
simple and would lift the CephFS requirement.
Using Rados your client will be able to talk directly to the cluster (OSDs).

Some other options to access Ceph form Windows, but require a gateway (CephFS 
to NFS/Samba or RBD to NFS/Samba) which usually ends up being a bottleneck and 
a SPOF.

Regarding the performance, you mentioned 160GB/min, so that is 2.7 GB/s. That 
shouldn’t be too difficult to reach with Journals on SSDs.
In a previous thread you mentioned 468 OSDs. Doing a quick napkin calculation 
with a Journal:OSD ratio of 1:6 (usually 1:4 to 1:6), that should be 78 
Journals, if you estimate 400MB/s (like the Intel S3710 serie) Journal write 
speed and a replica factor of 3, you have a maximum theoretical write speed of 
~10GB/s. Say you get ~50% (I usually reach 50~60% of the theoretical write 
speed) of the theoretical write speed you are still above your target of 2.7 
GB/s.

Regards
Maxime G.


From: ceph-users  on behalf of Nick Fisk 

Reply-To: "n...@fisk.me.uk" 
Date: Friday 12 August 2016 09:33
To: 'Александр Пивушков' , 'ceph-users' 

Subject: Re: [ceph-users] High-performance way for access Windows of users to 
Ceph.

I’m not sure how stable that ceph dokan is, I would imagine the best way to 
present ceph-fs to windows users would be through samba.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
? 
Sent: 12 August 2016 07:53
To: ceph-users 
Subject: [ceph-users] High-performance way for access Windows of users to Ceph.

Hello,

I continue to design high-performance cluster Ceph, petascale.

Scheduled to purchase a high-performance server, OS Windows 2016, for  clients. 
Clients are in the Dockers.
https://docs.docker.com/engine/installation/windows/
Virtualization. It does not matter...

 Clients run program written by them, which generates files of various sizes - 
from 1 KB to 200 GB (yes, creepy single file size). Network planning to use 
Infiniband 40 GB/s between clients and Ceph. Clients work with Ceph always on 
one,  and always only either for record or for reading.

While I do not understand what Ceph technology is appropriate to use? Object, 
block, or file storage CephFS.
So far, it seems to me, i need to use MDS, CephFS and ceph-dokan
https://github.com/ketor/ceph-dokan

Please share the experience of how it is possible to provide access with 
minimal overhead (preferably zero :(  ) Windows Ceph users to the server?
 Ie  how to make sure that the files generated by the program on Windows very 
quickly proved to Ceph.

--
Александр Пивушков


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Include mon restart in logrotate?

2016-08-12 Thread Eugen Block

So yeah, upgrade time.


Well, I guess I'll upgrade then, but not before next week. Thanks for  
your answers, I'll report back as soon as we have some reliable data.



Zitat von Christian Balzer :


Hello,

see below.

On Thu, 11 Aug 2016 12:52:32 +0200 (CEST) Wido den Hollander wrote:



> Op 11 augustus 2016 om 10:18 schreef Eugen Block :
>
>
> Thanks for the really quick response!
>
> > Warning! These are not your regular log files.
>
> Thanks for the warning!
>
> > You shouldn't have to worry about that. The MONs should compact and
> > rotate those logs themselve.
>
> I believe the compaction works fine, but these large LOG files just
> grow until mon restart. Is there no way to limit the size to a desired
> value or anything similar?
>

That's not good. That shouldn't happen. The monitor has to trim  
these logs as well.


How big is your mon store?

$ du -sh /var/lib/ceph/mon/*

> > What version of Ceph are you running exactly?
>
> ceph@node1:~/ceph-deploy> ceph --version
> ceph version 0.94.6-75
>

0.94.7 is already out, might be worth upgrading. Release Notes  
don't tell anything about this case though.




0.94.5 definitely has that bug (no compaction on either MON or OSD
leveldbs) and I remember a tracker and release note entry about that.

And 0.94.7 definitely doesn't have that problem.
While 0.94.6 has the potential to eat all your data with cache-tiering.

So yeah, upgrade time.

Christian


> > What is the output of ceph -s?
>
> ceph@node1:~/ceph-deploy> ceph -s
>  cluster 655cb05a-435a-41ba-83d9-8549f7c36167
>   health HEALTH_OK
>   monmap e7: 3 mons at
>  
{mon1=192.168.160.15:6789/0,mon2=192.168.160.17:6789/0,mon3=192.168.160.16:6789/0}

>  election epoch 242, quorum 0,1,2 mon1,mon2,mon3
>   osdmap e2377: 19 osds: 19 up, 19 in
>pgmap v3791457: 4336 pgs, 14 pools, 1551 GB data, 234 kobjects
>  3223 GB used, 4929 GB / 8153 GB avail
>  4336 active+clean
>client io 0 B/s rd, 72112 B/s wr, 7 op/s
>

Ok, that's good. Monitors don't trim the logs when the cluster  
isn't healthy, but yours is.


Wido

>
> Zitat von Wido den Hollander :
>
> >> Op 11 augustus 2016 om 9:56 schreef Eugen Block :
> >>
> >>
> >> Hi list,
> >>
> >> we have a working cluster based on Hammer with 4 nodes, 19  
OSDs and 3 MONs.

> >> Now after a couple of weeks we noticed that we're running out of disk
> >> space on one of the nodes in /var.
> >> Similar to [1] there are two large LOG files in
> >> /var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are
> >> managed when the respective MON is restarted. But the MONs are not
> >> restarted regularly so the log files can grow for months and fill up
> >> the file system.
> >>
> >
> > Warning! These are not your regular log files. They are binary logs
> > of LevelDB which are mandatory for the MONs to work!
> >
> >> I was thinking about adding another file in /etc/logrotate.d/ and
> >> trigger a monitor restart once a week. But I'm not sure if it's
> >> recommended to restart all MONs at the same time, which could happen
> >> if someone started logrotate manually.
> >> So my question is, how do you guys manage that and how is it supposed
> >> to be handled? I'd really appreciate any insights!
> >>
> > You shouldn't have to worry about that. The MONs should compact and
> > rotate those logs themselve.
> >
> > They compact their store on start, so that works for you, but they
> > should do this while running.
> >
> > What version of Ceph are you running exactly?
> >
> > What is the output of ceph -s? MONs usually only compact when the
> > cluster is healthy.
> >
> > Wido
> >
> >> Regards,
> >> Eugen
> >>
> >> [1]
> >>  
http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor

> >>
> >> --
> >> Eugen Block voice   : +49-40-559 51 75
> >> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
> >> Postfach 61 03 15
> >> D-22423 Hamburg e-mail  : ebl...@nde.ag
> >>
> >>  Vorsitzende des Aufsichtsrates: Angelika Mozdzen
> >>Sitz und Registergericht: Hamburg, HRB 90934
> >>Vorstand: Jens-U. Mozdzen
> >> USt-IdNr. DE 814 013 983
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Eugen Block voice   : +49-40-559 51 75
> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
> Postfach 61 03 15
> D-22423 Hamburg e-mail  : ebl...@nde.ag
>
>  Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>Sitz und Registergericht: Hamburg, HRB 90934
>Vorstand: Jens-U. Mozdzen
> USt-IdNr. DE 814 013 983
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.co

Re: [ceph-users] High-performance way for access Windows of users to Ceph.

2016-08-12 Thread Александр Пивушков

Yes, I saw an abilities librados.  I will postpone as a last resort. I look for 
the ready decision.

Purchase is planned. That Intel S3710 serie, 1:4 to 6TB SATA drives
The truth is their cost 78 SSD in rubles turns scary :) but that's another 
story ...
Is there a quick turnkey solutions for Win clients?



>Пятница, 12 августа 2016, 11:08 +03:00 от Maxime Guyot 
>:
>
>Hi,
> 
>> “Clients run program written by them, which generates files of various sizes 
>>- from 1 KB to 200 GB”
>If the clients are running custom software on Windows and if at all possible, 
>I would consider using librados . The library is available for C/C++, Java, 
>PHP and Python. The object API is fairly simple and would lift the CephFS 
>requirement.
>Using Rados your client will be able to talk directly to the cluster (OSDs).
> 
>Some other options to access Ceph form Windows, but require a gateway (CephFS 
>to NFS/Samba or RBD to NFS/Samba) which usually ends up being a bottleneck and 
>a SPOF.
> 
>Regarding the performance, you mentioned 160GB/min, so that is 2.7 GB/s. That 
>shouldn’t be too difficult to reach with Journals on SSDs.
>In a previous thread you mentioned 468 OSDs. Doing a quick napkin calculation 
>with a Journal:OSD ratio of 1:6 (usually 1:4 to 1:6), that should be 78 
>Journals, if you estimate 400MB/s (like the Intel S3710 serie) Journal write 
>speed and a replica factor of 3, you have a maximum theoretical write speed of 
>~10GB/s. Say you get ~50% (I usually reach 50~60% of the theoretical write 
>speed) of the theoretical write speed you are still above your target of 2.7 
>GB/s.
> 
>Regards 
>Maxime G.
>
> 
>From:  ceph-users < ceph-users-boun...@lists.ceph.com > on behalf of Nick Fisk 
>< n...@fisk.me.uk >
>Reply-To:  "n...@fisk.me.uk" < n...@fisk.me.uk >
>Date:  Friday 12 August 2016 09:33
>To:  'Александр Пивушков' < p...@mail.ru >, 'ceph-users' < 
>ceph-users@lists.ceph.com >
>Subject:  Re: [ceph-users] High-performance way for access Windows of users to 
>Ceph.
> 
>I’m not sure how stable that ceph dokan is, I would imagine the best way to 
>present ceph-fs to windows users would be through samba.
> 
>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of  
>? 
>Sent: 12 August 2016 07:53
>To: ceph-users < ceph-users@lists.ceph.com >
>Subject: [ceph-users] High-performance way for access Windows of users to Ceph.
> 
>Hello,
>I continue to design high-performance cluster Ceph, petascale.
>Scheduled to purchase a high-performance server, OS Windows 2016, for  
>clients. Clients are in the Dockers.
>https://docs.docker.com/engine/installation/windows/
>Virtualization. It does not matter...
> Clients run program written by them, which generates files of various sizes - 
>from 1 KB to 200 GB (yes, creepy single file size). Network planning to use 
>Infiniband 40 GB/s between clients and Ceph. Clients work with Ceph always on 
>one,
  and always only either for record or for reading.
>While I do not understand what Ceph technology is appropriate to use? Object, 
>block, or file storage CephFS.
>So far, it seems to me, i need to use MDS, CephFS and ceph-dokan
>https://github.com/ketor/ceph-dokan
>
>Please share the experience of how it is possible to provide access with 
>minimal overhead (preferably zero :(  ) Windows Ceph users to the server?
> Ie  how to make sure that the files generated by the program on Windows very 
>quickly proved to Ceph.
>
>-- 
>Александр Пивушков
>


-- 
Александр Пивушков
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance benchmark -- metadata intensive

2016-08-12 Thread John Spray
On Thu, Aug 11, 2016 at 1:24 PM, Brett Niver  wrote:
> Patrick and I had a related question yesterday, are we able to dynamically
> vary cache size to artificially manipulate cache pressure?

Yes -- at the top of MDCache::trim the max size is read straight out
of g_conf so it should pick up on any changes you do with "tell
injectargs".  Things might be a little bit funny though because the
new cache limit wouldn't be reflected in the logic in lru_adjust().

John

> On Thu, Aug 11, 2016 at 6:07 AM, John Spray  wrote:
>>
>> On Thu, Aug 11, 2016 at 8:29 AM, Xiaoxi Chen 
>> wrote:
>> > Hi ,
>> >
>> >
>> >  Here is the slide I shared yesterday on performance meeting.
>> > Thanks and hoping for inputs.
>> >
>> >
>> >
>> > http://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark
>>
>> These are definitely useful results and I encourage everyone working
>> with cephfs to go and look at Xiaoxi's slides.
>>
>> The main thing that this highlighted for me was our lack of testing so
>> far on systems with full caches.  Too much of our existing testing is
>> done on freshly configured systems that never fill the MDS cache.
>>
>> Test 2.1 notes that we don't enable directory fragmentation by default
>> currently -- this is an issue, and I'm hoping we can switch it on by
>> default in Kraken (see thread "Switching on mds_bal_frag by default").
>> In the meantime we have the fix that Patrick wrote for Jewel which at
>> least prevents people creating dirfrags too large for the OSDs to
>> handle.
>>
>> Test 2.2: since a "failing to respond to cache pressure" bug is
>> affecting this, I would guess we see the performance fall off at about
>> the point where the *client* caches fill up (so they start trimming
>> things even though they're ignore cache pressure).  It would be
>> interesting to see this chart with addition lines for some related
>> perf counters like mds_log.evtrm and mds.inodes_expired, that might
>> make it pretty obvious where the MDS is entering different stages that
>> see a decrease in the rate of handling client requests.
>>
>> We really need to sort out the "failing to respond to cache pressure"
>> issues that keep popping up, especially if they're still happening on
>> a comparatively simple test that is just creating files.  We have a
>> specific test for this[1] that is currently being run against the fuse
>> client but not the kernel client[2].  This is a good time to try and
>> push that forward so I've kicked off an experimental run here:
>>
>> http://pulpito.ceph.com/jspray-2016-08-10_16:14:52-kcephfs:recovery-master-testing-basic-mira/
>>
>> In the meantime, although there are reports of similar issues with
>> newer kernels, it would be very useful to confirm if the same issue is
>> still occurring with more recent kernels.  Issues with cache trimming
>> have occurred due to various (separate) bugs, so it's possible that
>> while some people are still seeing cache trimming issues with recent
>> kernels, the specific case you're hitting might be fixed.
>>
>> Test 2.3: restarting the MDS doesn't actually give you a completely
>> empty cache (everything in the journal gets replayed to pre-populate
>> the cache on MDS startup).  However, the results are still valid
>> because you're using a different random order in the non-caching test
>> case, and the number of inodes in your journal is probably much
>> smaller than the overall cache size so it's only a little bit
>> populated.  We don't currently have a "drop cache" command built into
>> the MDS but it would be pretty easy to add one for use in testing
>> (basically just call mds->mdcache->trim(0)).
>>
>> As one would imagine, the non-caching case is latency-dominated when
>> the working set is larger than the cache, where each client is waiting
>> for one open to finish before proceeding to the next.  The MDS is
>> probably capable of handling many more operations per second, but it
>> would need more parallel IO operations from the clients.  When a
>> single client is doing opens one by one, you're potentially seeing a
>> full network+disk latency for each one (though in practice the OSD
>> read cache will be helping a lot here).  This non-caching case would
>> be the main argument for giving the metadata pool low latency (SSD)
>> storage.
>>
>> Test 2.5: The observation that the CPU bottleneck makes using fast
>> storage for the metadata pool less useful (in sequential/cached cases)
>> is valid, although it could still be useful to isolate the metadata
>> OSDs (probably SSDs since not so much capacity is needed) to avoid
>> competing with data operations.  For random access in the non-caching
>> cases (2.3, 2.4) I think you would probably see an improvement from
>> SSDs.
>>
>> Thanks again to the team from ebay for sharing all this.
>>
>> John
>>
>>
>>
>> 1.
>> https://github.com/ceph/ceph-qa-suite/blob/master/tasks/cephfs/test_client_limits.py#L96
>> 2. http://tracker.ceph.com/issues/9466
>>
>>
>> >
>> > Xiaoxi
>> > --

[ceph-users] S3 lifecycle support in Jewel

2016-08-12 Thread Henrik Korkuc

Hey,

I noticed that rgw lifecycle feature got back to master almost a month 
ago. Is there any chance that it will be backported to Jewel? If not, 
are you aware of any incompatibilities with jewel code what would 
prevent/complicate custom build with that code?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] what happen to the OSDs if the OS disk dies?

2016-08-12 Thread Félix Barbeira
Hi,

I'm planning to make a ceph cluster but I have a serious doubt. At this
moment we have ~10 servers DELL R730xd with 12x4TB SATA disks. The official
ceph docs says:

"We recommend using a dedicated drive for the operating system and
software, and one drive for each Ceph OSD Daemon you run on the host."

I could use for example 1 disk for the OS and 11 for OSD data. In the
operating system I would run 11 daemons to control the OSDs. But...what
happen to the cluster if the disk with the OS fails?? maybe the cluster
thinks that 11 OSD failed and try to replicate all that data over the
cluster...that sounds no good.

Should I use 2 disks for the OS making a RAID1? in this case I'm "wasting"
8TB only for ~10GB that the OS needs.

In all the docs that i've been reading says ceph has no unique single point
of failure, so I think that this scenario must have a optimal solution,
maybe somebody could help me.

Thanks in advance.

-- 
Félix Barbeira.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs going down when we bring down some OSD nodes Or cut-off the cluster network link between OSD nodes

2016-08-12 Thread Venkata Manojawa Paritala
Hi,

As part of our testing over a period of time, we used a lot of parameters
in Ceph.conf. With that configuration, we observed issues when we pulled
down 2 sites as mentioned earlier.

In the last couple of days, we cleaned up a lot of parameters and
configured only couple of mandatory parameters and we are not seeing any
issues when we bring down 2 sites. FYI..

Thanks & Regards,
Manoj

On Sat, Aug 6, 2016 at 8:23 PM, Venkata Manojawa Paritala <
manojaw...@vedams.com> wrote:

> Hi,
>
> We have configured single Ceph cluster in a lab with the below
> specification.
>
> 1. Divided the cluster into 3 logical sites (SiteA, SiteB & SiteC). This
> is to simulate that nodes are part of different Data Centers and having
> network connectivity between them for DR.
> 2. Each site operates in a different subnet and each subnet is part of one
> VLAN. We have configured routing so that OSD nodes in one site can
> communicate to OSD nodes in the other 2 sites.
> 3. Each site will have one monitor  node, 2  OSD nodes (to which we have
> disks attached) and IO generating clients.
> 4. We have configured 2 networks.
> 4.1. Public network - To which all the clients, monitors and OSD nodes are
> connected
> 4.2. Cluster network - To which only the OSD nodes are connected for -
> Replication/recovery/hearbeat traffic.
>
> 5. We have 2 issues here.
> 5.1. We are unable sustain IO for clients from individual sites when we
> isolate the OSD nodes by bringing down ONLY the cluster network between
> sites. Logically this will make the individual sites to be in isolation
> with respect to the cluster network. Please note that the public network is
> still connected between the sites.
> 5.2. In a fully functional cluster, when we bring down 2 sites (shutdown
> the OSD services of 2 sites - say Site A OSDs and Site B OSDs) then, OSDs
> in the third site (Site C) are going down (OSD Flapping).
>
> We need workarounds/solutions to  fix the above 2 issues.
>
> Below are some of the parameters we have already mentioned in the
> Cenf.conf to sustain the cluster for a longer time, when we cut-off the
> links between sites. But, they were not successful.
>
> --
> [global]
> public_network = 10.10.0.0/16
> cluster_network = 192.168.100.0/16,192.168.150.0/16,192.168.200.0/16
> osd hearbeat address = 172.16.0.0/16
>
> [monitor]
> mon osd report timeout = 1800
>
> [OSD}
> osd heartbeat interval = 12
> osd hearbeat grace = 60
> osd mon heartbeat interval = 60
> osd mon report interval max = 300
> osd mon report interval min = 10
> osd mon act timeout = 60
> .
> .
> 
>
> We also confiured the parameter "osd_heartbeat_addr" and tried with the
> values - 1) Ceph public network (assuming that when we bring down the
> cluster network hearbeat should happen via public network). 2) Provided a
> different network range altogether and had physical connections. But both
> the options did not work.
>
> We have a total of 49 OSDs (14 in Site A, 14 in SiteB, 21 in SiteC) in the
> cluster. One Monitor in each Site.
>
> We need to try the below two options.
>
> A) Increase the "mon osd min down reporters" value. Question is how much.
> Say, if I give this value to 49, then will the client IO sustain when we
> cut-off the cluster network links between sites. In this case one issue
> would be that if the OSD is really down we wouldn't know.
>
> B) Add 2 monitors to each site. This would make each site with 3 monitors
> and the overall cluster will have 9 monitors. The reason we wanted to try
> this is, we think that the OSDs are going down as the the quorum is unable
> to find the minimum number nodes (may be monitors) to sustain.
>
> Thanks & Regards,
> Manoj
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what happen to the OSDs if the OS disk dies?

2016-08-12 Thread Brian ::
Hi Felix

If you have R730XD then you should have 2 x 2.5" slots on the back.
You can stick in SSDs in RAID1 for your OS here.



On Fri, Aug 12, 2016 at 12:41 PM, Félix Barbeira  wrote:
> Hi,
>
> I'm planning to make a ceph cluster but I have a serious doubt. At this
> moment we have ~10 servers DELL R730xd with 12x4TB SATA disks. The official
> ceph docs says:
>
> "We recommend using a dedicated drive for the operating system and software,
> and one drive for each Ceph OSD Daemon you run on the host."
>
> I could use for example 1 disk for the OS and 11 for OSD data. In the
> operating system I would run 11 daemons to control the OSDs. But...what
> happen to the cluster if the disk with the OS fails?? maybe the cluster
> thinks that 11 OSD failed and try to replicate all that data over the
> cluster...that sounds no good.
>
> Should I use 2 disks for the OS making a RAID1? in this case I'm "wasting"
> 8TB only for ~10GB that the OS needs.
>
> In all the docs that i've been reading says ceph has no unique single point
> of failure, so I think that this scenario must have a optimal solution,
> maybe somebody could help me.
>
> Thanks in advance.
>
> --
> Félix Barbeira.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what happen to the OSDs if the OS disk dies?

2016-08-12 Thread RDS
Mirror the OS disks, use 10 disks for 10 OSD's
> On Aug 12, 2016, at 7:41 AM, Félix Barbeira  wrote:
> 
> Hi,
> 
> I'm planning to make a ceph cluster but I have a serious doubt. At this 
> moment we have ~10 servers DELL R730xd with 12x4TB SATA disks. The official 
> ceph docs says:
> 
> "We recommend using a dedicated drive for the operating system and software, 
> and one drive for each Ceph OSD Daemon you run on the host."
> 
> I could use for example 1 disk for the OS and 11 for OSD data. In the 
> operating system I would run 11 daemons to control the OSDs. But...what 
> happen to the cluster if the disk with the OS fails?? maybe the cluster 
> thinks that 11 OSD failed and try to replicate all that data over the 
> cluster...that sounds no good.
> 
> Should I use 2 disks for the OS making a RAID1? in this case I'm "wasting" 
> 8TB only for ~10GB that the OS needs.
> 
> In all the docs that i've been reading says ceph has no unique single point 
> of failure, so I think that this scenario must have a optimal solution, maybe 
> somebody could help me.
> 
> Thanks in advance.
> 
> -- 
> Félix Barbeira.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Rick Stehno


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what happen to the OSDs if the OS disk dies?

2016-08-12 Thread Cybertinus

Hello Felix,

When you put your OS on a single drive and that drive fails, you will 
loose all the OSDs on that machine, because the entier machine goes 
down. The PGs that now miss a partner are going to be replicated again. 
So, in your case, the PGs that are on those 11 OSDs.
This rebuilding doesn't start right away, so you can safely reboot an 
OSD host without starting a major rebalance of your data.


I would put 2 drives in RAID1 if I were you. Putting 2 SSDs in the back 
2,5" slots, like suggested by Brian, sounds like the best option to me. 
This way you don't loose a massive storage amount (2x10x8 = 160 TB you 
would loose otherwise, just for the OS installation...)


---
Kind regards,
Cybertinus

On 12-08-2016 13:41, Félix Barbeira wrote:


Hi,

I'm planning to make a ceph cluster but I have a serious doubt. At this 
moment we have ~10 servers DELL R730xd with 12x4TB SATA disks. The 
official ceph docs says:


"We recommend using a dedicated drive for the operating system and 
software, and one drive for each Ceph OSD Daemon you run on the host."


I could use for example 1 disk for the OS and 11 for OSD data. In the 
operating system I would run 11 daemons to control the OSDs. But...what 
happen to the cluster if the disk with the OS fails?? maybe the cluster 
thinks that 11 OSD failed and try to replicate all that data over the 
cluster...that sounds no good.


Should I use 2 disks for the OS making a RAID1? in this case I'm 
"wasting" 8TB only for ~10GB that the OS needs.


In all the docs that i've been reading says ceph has no unique single 
point of failure, so I think that this scenario must have a optimal 
solution, maybe somebody could help me.


Thanks in advance.
--
Félix Barbeira.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] blocked ops

2016-08-12 Thread Roeland Mertens

Hi Brad,


thanks for that. We were able to track the blocked ops down to the cache 
layer in front of our EC rgw pool, which contains the utterly broken pg, 
so it may be related. We removed the cache layer and now the blocked ops 
have moved to the primary OSD for the broken pg with the debug logging 
pointing at the broken pg being the culprit.



The pg is in a horrendous state due to multiple disk losses (exceeding 
the EC m value) and attempts at "fixing" it :(


I've put the output of pg query here : http://pastebin.com/EXjup33D and 
the osd log specifically for the pg here: http://pastebin.com/0UpRJnUZ


and I've put some logs here: http://pastebin.com/Br1u8wTj

If anyone can give us a hand getting this resolved we'd greatly 
appreciate as we'd rather not have to zap a pool containing 100TB of 
data just because of a pg containing only about 0.1% of that data



kind regards,


Roeland



On 12/08/16 08:10, Brad Hubbard wrote:

On Fri, Aug 12, 2016 at 07:47:54AM +0100, roeland mertens wrote:

Hi Brad,

thank you for that. Unfortunately our immediate concern is the blocked ops
rather than the broken pg (we know why its broken).

OK, if you look at the following file it shows not only the declaration of
wait_for_blocked_object (highlighted) but also all of it's callers.

https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L500

Multiple calls relate to snapshots but I'd suggest turning debug logging for
the OSDs right up may give us more information.

# ceph tell osd.* injectargs '--debug_osd 20 --debug_ms 5'

Note: The above will turn up debugging for all OSDs, you may want to only
focus on some so adjust accordingly.


I don't think that's specifically crushmap related nor related to the
broken pg as the osds involved in the blocked ops aren't the ones that were
hosting the broken pg.




On 12 August 2016 at 04:12, Brad Hubbard  wrote:


On Thu, Aug 11, 2016 at 11:33:29PM +0100, Roeland Mertens wrote:

Hi,

I was hoping someone on this list may be able to help?

We're running a 35 node 10.2.1 cluster with 595 OSDs. For the last 12

hours

we've been plagued with blocked requests which completely kills the
performance of the cluster

# ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs

down; 1

pgs peering; 1 pgs stuck inactive; 100 requests are blocked > 32 sec; 1

osds

have slow requests; noout,nodeep-scrub,sortbitwise flag(s) set
pg 63.1a18 is stuck inactive for 135133.509820, current state
down+remapped+peering, last acting [2147483647,2147483647,

2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370]

That value (2147483647) is defined in src/crush/crush.h like so;

#define CRUSH_ITEM_NONE   0x7fff  /* no result */

So this could be due to a bad crush rule or maybe choose_total_tries needs
to
be higher?

$ ceph osd crush rule ls

For each rule listed by the above command.

$ ceph osd crush rule dump [rule_name]

I'd then dump out the crushmap and test it showing any bad mappings with
the
commands listed here;

http://docs.ceph.com/docs/master/rados/troubleshooting/
troubleshooting-pg/#crush-gives-up-too-soon

I'd also check the pg numbers for your pool(s) are appropriate as not
enough
pgs could also be a contributing factor IIRC.

That should hopefully give some insight.

--
HTH,
Brad


pg 63.1a18 is down+remapped+peering, acting [2147483647,2147483647,

2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370]

100 ops are blocked > 2097.15 sec on osd.4
1 osds have slow requests
noout,nodeep-scrub,sortbitwise flag(s) set

the one pg down is due to us running into an odd EC issue which I mailed

the

list about earlier, it's the 100 blocked ops that are puzzling us. If we

out

the osd in question, they just shift to another osd (on a different

host!).

We even tried rebooting the node it's on but to little avail.

We get a ton of log messages like this:

2016-08-11 23:32:10.041174 7fc668d9f700  0 log_channel(cluster) log

[WRN] :

100 slow requests, 5 included below; oldest blocked for > 139.313915 secs
2016-08-11 23:32:10.041184 7fc668d9f700  0 log_channel(cluster) log

[WRN] :

slow request 139.267004 seconds old, received at 2016-08-11

23:29:50.774091:

osd_op(client.9192464.0:485640 66.b96c3a18
default.4282484.42_442fac8195c63a2e19c3c4bb91e8800e [getxattrs,stat,read
0~524288] snapc 0=[] RETRY=36 ack+retry+read+known_if_redirected e50109)
currently waiting for blocked object
2016-08-11 23:32:10.041189 7fc668d9f700  0 log_channel(cluster) log

[WRN] :

slow request 139.244839 seconds old, received at 2016-08-11

23:29:50.796256:

osd_op(client.9192464.0:596033 66.942a5a18
default.4282484.30__shadow_.sLkZ_rUX6cvi0ifFasw1UipEIuFPzYB_6 [write
1048576~524288] snapc 0=[] RETRY=36
ack+ondisk+retry+write+known_if_redirected e50109) currently waiting for
blocked object

A dump of the blocked ops tells us very little , is there anyone who can
shed some light on this? Or at least give us a hint on how we can 

[ceph-users] radosgw-agent not syncing data as expected

2016-08-12 Thread Edward Hope-Morley
Hi all, I'm having an issue with RGW federation and would really appreciate
some help.

I have two ceph clusters each fronted with their own Rados Gateway. Each
RGW is configured as being in zone1 and zone2 respectively and both within
the default region "default". I have configured zone1 to use rgw pools
created prior to configuring zones (with the exception of the zone root
pool) and zone2 has a whole set of .z2.* pools to itself. When I run the
radosgw-agent from zone1 (master) in order to sync with zone2, it runs
successfully and says that is has done a metadata and data sync yet it does
not appear to have actually done a data sync at all. All my env info and
output is here - http://pastebin.ubuntu.com/23049564/

I am expecting to see the data/objects in .rgw.buckets from z1 synced to
.z2.rgw.buckets in z2 yet the latter remains empty.

This is using Ceph 0.94.7 and radosgw-agent v1.1 on Ubuntu Trusty.

Any ideas?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what happen to the OSDs if the OS disk dies?

2016-08-12 Thread David Turner
Nothing actually happens to your osds if your OS drive fails.  To prevent the 
unnecessary backfilling off of the server with the dead OS drive, you would set 
NOOUT in the cluster, reinstall the OS on a good drive, install ceph on it, and 
then restart the server.  The OSDs have all of the information they need to 
bring themselves back up and into the cluster.  Once they are back up, you 
unset noout and are good to go.

If the drives had already been marked out of the cluster, then set noout and 
manually mark them in via `ceph osd in #` and proceed as above.  It is a very 
simple process to replace the OS drive of a storage node.



[cid:image036f36.JPG@bb80dafc.4f82c6ec]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Cybertinus 
[c...@cybertinus.nl]
Sent: Friday, August 12, 2016 7:31 AM
To: Félix Barbeira
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] what happen to the OSDs if the OS disk dies?

Hello Felix,

When you put your OS on a single drive and that drive fails, you will
loose all the OSDs on that machine, because the entier machine goes
down. The PGs that now miss a partner are going to be replicated again.
So, in your case, the PGs that are on those 11 OSDs.
This rebuilding doesn't start right away, so you can safely reboot an
OSD host without starting a major rebalance of your data.

I would put 2 drives in RAID1 if I were you. Putting 2 SSDs in the back
2,5" slots, like suggested by Brian, sounds like the best option to me.
This way you don't loose a massive storage amount (2x10x8 = 160 TB you
would loose otherwise, just for the OS installation...)

---
Kind regards,
Cybertinus

On 12-08-2016 13:41, Félix Barbeira wrote:

> Hi,
>
> I'm planning to make a ceph cluster but I have a serious doubt. At this
> moment we have ~10 servers DELL R730xd with 12x4TB SATA disks. The
> official ceph docs says:
>
> "We recommend using a dedicated drive for the operating system and
> software, and one drive for each Ceph OSD Daemon you run on the host."
>
> I could use for example 1 disk for the OS and 11 for OSD data. In the
> operating system I would run 11 daemons to control the OSDs. But...what
> happen to the cluster if the disk with the OS fails?? maybe the cluster
> thinks that 11 OSD failed and try to replicate all that data over the
> cluster...that sounds no good.
>
> Should I use 2 disks for the OS making a RAID1? in this case I'm
> "wasting" 8TB only for ~10GB that the OS needs.
>
> In all the docs that i've been reading says ceph has no unique single
> point of failure, so I think that this scenario must have a optimal
> solution, maybe somebody could help me.
>
> Thanks in advance.
> --
> Félix Barbeira.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-12 Thread Sean Sullivan
ceph-monstore-tool? Is that the same as monmaptool? oops! NM found it in
ceph-test package::

I can't seem to get it working :-( dump monmap or any of the commands. They
all bomb out with the same message:

root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool
/var/lib/ceph/mon/ceph-kh10-8 dump-trace -- /tmp/test.trace
Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
store.db/10882319.ldb
root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool
/var/lib/ceph/mon/ceph-kh10-8 dump-keys
Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
store.db/10882319.ldb


I need to clarify as I originally had 2 clusters with this issue and now I
have 1 with all 3 monitors dead and 1 that I was successfully able to
repair. I am about to recap everything I know about the issue and the issue
at hand. Should I start a new email thread about this instead?

The cluster that is currently having issues is on hammer (94.7), and the
monitor stats are the same::
root@kh08-8:~# cat /proc/cpuinfo | grep -iE "model name" | uniq -c
 24 model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
 ext4 volume comprised of 4x300GB 10k drives in raid 10.
 ubuntu 14.04

root@kh08-8:~# uname -a
Linux kh08-8 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux
root@kh08-8:~# ceph --version
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)


>From here: Here are the errors I am getting when starting each of the
monitors::


---
root@kh08-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh08-8 -d
2016-08-11 22:15:23.731550 7fe5ad3e98c0  0 ceph version 0.94.7
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 317309
Corruption: error in middle of record
2016-08-11 22:15:28.274340 7fe5ad3e98c0 -1 error opening mon data directory
at '/var/lib/ceph/mon/ceph-kh08-8': (22) Invalid argument
--
root@kh09-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh09-8 -d
2016-08-11 22:14:28.252370 7f7eaab908c0  0 ceph version 0.94.7
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 30
Corruption: 14 missing files; e.g.: /var/lib/ceph/mon/ceph-kh09-8/
store.db/10845998.ldb
2016-08-11 22:14:35.094237 7f7eaab908c0 -1 error opening mon data directory
at '/var/lib/ceph/mon/ceph-kh09-8': (22) Invalid argument
--
root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# /usr/bin/ceph-mon
--cluster=ceph -i kh10-8 -d
2016-08-11 22:17:54.632762 7f80bf34d8c0  0 ceph version 0.94.7
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 292620
Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
store.db/10882319.ldb
2016-08-11 22:18:01.207749 7f80bf34d8c0 -1 error opening mon data directory
at '/var/lib/ceph/mon/ceph-kh10-8': (22) Invalid argument
---


for kh08, a coworker patched leveldb to print and skip on the first error
and that one is also missing a bunch of files. As such I think kh10-8 is my
most likely candidate to recover but either way recovery is probably not an
option. I see leveldb has a repair.cc (https://github.com/google/lev
eldb/blob/master/db/repair.cc)) but I do not see repair mentioned in
monitor in respect to the dbstore. I tried using the leveldb python module
(plyvel) to attempt a repair but my repl just ends up dying.

I understand two things:: 1.) Without rebuilding the monitor backend
leveldb (the cluster map as I understand it) store all of the data in the
cluster is essentialy lost (right?)
 2.) it is possible to rebuild this
database via some form of magic or (source)ry as all of this data is
essential held throughout the cluster as well.

We only use radosgw / S3 for this cluster. If there is a way to recover my
data that is easier//more likely than rebuilding the leveldb of a monitor
and starting a single monitor cluster up I would like to switch gears and
focus on that.

Looking at the dev docs:
http://docs.ceph.com/docs/hammer/architecture/#cluster-map
it has 5 main parts::

```
The Monitor Map: Contains the cluster fsid, the position, name address and
port of each monitor. It also indicates the current epoch, when the map was
created, and the last time it changed. To view a monitor map, execute ceph
mon dump.
The OSD Map: Contains the cluster fsid, when the map was created and last
modified, a list of pools, replica sizes, PG numbers, a list of OSDs and
their status (e.g., up, in). To view an OSD map, execute ceph osd dump.
The PG Map: Contains the PG version, its time stamp, the last OSD map
epoch, the full ratios, and details on each placement group such as the PG
ID, the Up Set, the Acting Set, the state of the PG (e.g., active + clean),
and data usage statistics for each pool.
The CRUSH Map: Contains a list of storage devices, the failure domain
hierarchy (e.g., device, host, rack, row, room, etc.), and rules for
traversing the hierarchy when storing data. To view a CRUSH map, execute
ceph osd getcrushmap -o {fil

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-12 Thread Sean Sullivan
A coworker patched leveldb and we were able to export quite a bit of data
from kh08's leveldb database. At this point I think I need to re-construct
a new leveldb with whatever values I can. Is it the same leveldb database
across all 3 montiors? IE will keys exported from one work in the other?
All should have the same keys/values although constructed differently
right? I can't blindly copy /var/lib/ceph/mon/ceph-$(hostname)/store.db/
from one host to another right? But can I copy the keys/values from one to
another?

On Fri, Aug 12, 2016 at 12:45 PM, Sean Sullivan 
wrote:

> ceph-monstore-tool? Is that the same as monmaptool? oops! NM found it in
> ceph-test package::
>
> I can't seem to get it working :-( dump monmap or any of the commands.
> They all bomb out with the same message:
>
> root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool
> /var/lib/ceph/mon/ceph-kh10-8 dump-trace -- /tmp/test.trace
> Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
> store.db/10882319.ldb
> root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool
> /var/lib/ceph/mon/ceph-kh10-8 dump-keys
> Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
> store.db/10882319.ldb
>
>
> I need to clarify as I originally had 2 clusters with this issue and now I
> have 1 with all 3 monitors dead and 1 that I was successfully able to
> repair. I am about to recap everything I know about the issue and the issue
> at hand. Should I start a new email thread about this instead?
>
> The cluster that is currently having issues is on hammer (94.7), and the
> monitor stats are the same::
> root@kh08-8:~# cat /proc/cpuinfo | grep -iE "model name" | uniq -c
>  24 model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
>  ext4 volume comprised of 4x300GB 10k drives in raid 10.
>  ubuntu 14.04
>
> root@kh08-8:~# uname -a
> Linux kh08-8 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC
> 2016 x86_64 x86_64 x86_64 GNU/Linux
> root@kh08-8:~# ceph --version
> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>
>
> From here: Here are the errors I am getting when starting each of the
> monitors::
>
>
> ---
> root@kh08-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh08-8 -d
> 2016-08-11 22:15:23.731550 7fe5ad3e98c0  0 ceph version 0.94.7
> (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 317309
> Corruption: error in middle of record
> 2016-08-11 22:15:28.274340 7fe5ad3e98c0 -1 error opening mon data
> directory at '/var/lib/ceph/mon/ceph-kh08-8': (22) Invalid argument
> --
> root@kh09-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh09-8 -d
> 2016-08-11 22:14:28.252370 7f7eaab908c0  0 ceph version 0.94.7
> (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 30
> Corruption: 14 missing files; e.g.: /var/lib/ceph/mon/ceph-kh09-8/
> store.db/10845998.ldb
> 2016-08-11 22:14:35.094237 7f7eaab908c0 -1 error opening mon data
> directory at '/var/lib/ceph/mon/ceph-kh09-8': (22) Invalid argument
> --
> root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# /usr/bin/ceph-mon
> --cluster=ceph -i kh10-8 -d
> 2016-08-11 22:17:54.632762 7f80bf34d8c0  0 ceph version 0.94.7
> (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 292620
> Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
> store.db/10882319.ldb
> 2016-08-11 22:18:01.207749 7f80bf34d8c0 -1 error opening mon data
> directory at '/var/lib/ceph/mon/ceph-kh10-8': (22) Invalid argument
> ---
>
>
> for kh08, a coworker patched leveldb to print and skip on the first error
> and that one is also missing a bunch of files. As such I think kh10-8 is my
> most likely candidate to recover but either way recovery is probably not an
> option. I see leveldb has a repair.cc (https://github.com/google/lev
> eldb/blob/master/db/repair.cc)) but I do not see repair mentioned in
> monitor in respect to the dbstore. I tried using the leveldb python module
> (plyvel) to attempt a repair but my repl just ends up dying.
>
> I understand two things:: 1.) Without rebuilding the monitor backend
> leveldb (the cluster map as I understand it) store all of the data in the
> cluster is essentialy lost (right?)
>  2.) it is possible to rebuild
> this database via some form of magic or (source)ry as all of this data is
> essential held throughout the cluster as well.
>
> We only use radosgw / S3 for this cluster. If there is a way to recover my
> data that is easier//more likely than rebuilding the leveldb of a monitor
> and starting a single monitor cluster up I would like to switch gears and
> focus on that.
>
> Looking at the dev docs:
> http://docs.ceph.com/docs/hammer/architecture/#cluster-map
> it has 5 main parts::
>
> ```
> The Monitor Map: Contains the cluster fsid, the position, name address and
> port of each monitor. It also indicates the current epoch, when the map was
> created, and the last time it changed. To view a monit

Re: [ceph-users] what happen to the OSDs if the OS disk dies?

2016-08-12 Thread Ronny Aasen

On 12.08.2016 13:41, Félix Barbeira wrote:

Hi,

I'm planning to make a ceph cluster but I have a serious doubt. At 
this moment we have ~10 servers DELL R730xd with 12x4TB SATA disks. 
The official ceph docs says:


"We recommend using a dedicated drive for the operating system and 
software, and one drive for each Ceph OSD Daemon you run on the host."


I could use for example 1 disk for the OS and 11 for OSD data. In the 
operating system I would run 11 daemons to control the OSDs. 
But...what happen to the cluster if the disk with the OS fails?? maybe 
the cluster thinks that 11 OSD failed and try to replicate all that 
data over the cluster...that sounds no good.


Should I use 2 disks for the OS making a RAID1? in this case I'm 
"wasting" 8TB only for ~10GB that the OS needs.


In all the docs that i've been reading says ceph has no unique single 
point of failure, so I think that this scenario must have a optimal 
solution, maybe somebody could help me.


Thanks in advance.

--
Félix Barbeira.

if you do not have dedicated slots on the back for OS disks, then i 
would recomend using SATADOM flash modules directly into a SATA port 
internal in the machine. Saves you 2 slots for osd's and they are quite 
reliable. you could even use 2 sd cards if your machine have the 
internal SD slot


http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf

kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what happen to the OSDs if the OS disk dies?

2016-08-12 Thread Bill Sharer
If all the system disk does is handle the o/s (ie osd journals are on 
dedicated or osd drives as well), no problem.  Just rebuild the system 
and copy the ceph.conf back in when you re-install ceph.  Keep a spare 
copy of your original fstab to keep your osd filesystem mounts straight.


Just keep in mind that you are down 11 osds while that system drive gets 
rebuilt though.  It's safer to do 10 osds and then have a mirror set for 
the system disk.


Bill Sharer


On 08/12/2016 03:33 PM, Ronny Aasen wrote:

On 12.08.2016 13:41, Félix Barbeira wrote:

Hi,

I'm planning to make a ceph cluster but I have a serious doubt. At 
this moment we have ~10 servers DELL R730xd with 12x4TB SATA disks. 
The official ceph docs says:


"We recommend using a dedicated drive for the operating system and 
software, and one drive for each Ceph OSD Daemon you run on the host."


I could use for example 1 disk for the OS and 11 for OSD data. In the 
operating system I would run 11 daemons to control the OSDs. 
But...what happen to the cluster if the disk with the OS fails?? 
maybe the cluster thinks that 11 OSD failed and try to replicate all 
that data over the cluster...that sounds no good.


Should I use 2 disks for the OS making a RAID1? in this case I'm 
"wasting" 8TB only for ~10GB that the OS needs.


In all the docs that i've been reading says ceph has no unique single 
point of failure, so I think that this scenario must have a optimal 
solution, maybe somebody could help me.


Thanks in advance.

--
Félix Barbeira.

if you do not have dedicated slots on the back for OS disks, then i 
would recomend using SATADOM flash modules directly into a SATA port 
internal in the machine. Saves you 2 slots for osd's and they are 
quite reliable. you could even use 2 sd cards if your machine have the 
internal SD slot


http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf

kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS: Future Internetworking File System?

2016-08-12 Thread Matthew Walster
I've been following Ceph (and in particular CephFS) for some time now, and
glad to see it coming on in leaps and bounds!

I've been running a small OpenAFS Cell for a while now, and it's really
starting to show its age. I thought I'd ask whether anyone's considered
CephFS for a similar role?

As I understand it, Ceph authentication/authorization is very coarse (i.e.
granularity down to the mount point level only) and doesn't operate any
form of encryption between client and server, so I was wondering whether
anyone was using a form of intermediary proxy to provide these semantics to
the end user?

I was thinking perhaps of a WebDAV gateway (via radosgw or cephfs, and
https via davfs2 for the client side) or NFSv4 (via cephfs... but obviously
then you have to generate keytabs for the client machines, which I don't
have to do for AFS at present) or whether this is just something that isn't
anywhere near the front of mind for developers/users yet?

I realise this is not the current intended use cases, but I'm interested in
people's opinions, and whether anyone implements such a scheme today.

Many thanks in advance,

Matthew Walster
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what happen to the OSDs if the OS disk dies?

2016-08-12 Thread w...@42on.com


> Op 13 aug. 2016 om 03:19 heeft Bill Sharer  het 
> volgende geschreven:
> 
> If all the system disk does is handle the o/s (ie osd journals are on 
> dedicated or osd drives as well), no problem.  Just rebuild the system and 
> copy the ceph.conf back in when you re-install ceph.  Keep a spare copy of 
> your original fstab to keep your osd filesystem mounts straight.
> 

With systems deployed with ceph-disk/ceph-deploy you no longer need a fstab. 
Udev handles it.

> Just keep in mind that you are down 11 osds while that system drive gets 
> rebuilt though.  It's safer to do 10 osds and then have a mirror set for the 
> system disk.
> 

In the years that I run Ceph I rarely see OS disks fail. Why bother? Ceph is 
designed for failure.

I would not sacrifice a OSD slot for a OS disk. Also, let's say a additional OS 
disk is €100.

If you put that disk in 20 machines that's €2.000. For that money you can even 
buy a additional chassis.

No, I would run on a single OS disk. It fails? Let it fail. Re-install and 
you're good again.

Ceph makes sure the data is safe.

Wido

> Bill Sharer
> 
> 
>> On 08/12/2016 03:33 PM, Ronny Aasen wrote:
>>> On 12.08.2016 13:41, Félix Barbeira wrote:
>>> Hi,
>>> 
>>> I'm planning to make a ceph cluster but I have a serious doubt. At this 
>>> moment we have ~10 servers DELL R730xd with 12x4TB SATA disks. The official 
>>> ceph docs says:
>>> 
>>> "We recommend using a dedicated drive for the operating system and 
>>> software, and one drive for each Ceph OSD Daemon you run on the host."
>>> 
>>> I could use for example 1 disk for the OS and 11 for OSD data. In the 
>>> operating system I would run 11 daemons to control the OSDs. But...what 
>>> happen to the cluster if the disk with the OS fails?? maybe the cluster 
>>> thinks that 11 OSD failed and try to replicate all that data over the 
>>> cluster...that sounds no good.
>>> 
>>> Should I use 2 disks for the OS making a RAID1? in this case I'm "wasting" 
>>> 8TB only for ~10GB that the OS needs.
>>> 
>>> In all the docs that i've been reading says ceph has no unique single point 
>>> of failure, so I think that this scenario must have a optimal solution, 
>>> maybe somebody could help me.
>>> 
>>> Thanks in advance.
>>> 
>>> -- 
>>> Félix Barbeira.
>>> 
>> if you do not have dedicated slots on the back for OS disks, then i would 
>> recomend using SATADOM flash modules directly into a SATA port internal in 
>> the machine. Saves you 2 slots for osd's and they are quite reliable. you 
>> could even use 2 sd cards if your machine have the internal SD slot 
>> 
>> http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf
>> 
>> kind regards
>> Ronny Aasen
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what happen to the OSDs if the OS disk dies?

2016-08-12 Thread Georgios Dimitrakakis



Op 13 aug. 2016 om 03:19 heeft Bill Sharer  het volgende geschreven:


If all the system disk does is handle the o/s (ie osd journals are
on dedicated or osd drives as well), no problem. Just rebuild the
system and copy the ceph.conf back in when you re-install ceph.Â
Keep a spare copy of your original fstab to keep your osd filesystem
mounts straight.


With systems deployed with ceph-disk/ceph-deploy you no longer need a
fstab. Udev handles it.


Just keep in mind that you are down 11 osds while that system drive
gets rebuilt though. It's safer to do 10 osds and then have a
mirror set for the system disk.


In the years that I run Ceph I rarely see OS disks fail. Why bother?
Ceph is designed for failure.

I would not sacrifice a OSD slot for a OS disk. Also, let's say a
additional OS disk is €100.

If you put that disk in 20 machines that's €2.000. For that money
you can even buy a additional chassis.

No, I would run on a single OS disk. It fails? Let it fail. 
Re-install

and you're good again.

Ceph makes sure the data is safe.



Wido,

can you elaborate a little bit more on this? How does CEPH achieve 
that? Is it by redundant MONs?


To my understanding the OSD mapping is needed to have the cluster back. 
In our setup (I assume in others as well) that is stored in the OS 
disk.Furthermore, our MONs are running on the same host as OSDs. So if 
the OS disk fails not only we loose the OSD host but we also loose the 
MON node. Is there another way to be protected by such a failure besides 
additional MONs?


We recently had a problem where a user accidentally deleted a volume. 
Of course this has nothing to do with OS disk failure itself but we 've 
been in the loop to start looking for other possible failures on our 
system that could jeopardize data and this thread got my attention.



Warmest regards,

George



Wido

 Bill Sharer

 On 08/12/2016 03:33 PM, Ronny Aasen wrote:


On 12.08.2016 13:41, Félix Barbeira wrote:


Hi,

I'm planning to make a ceph cluster but I have a serious doubt. At
this moment we have ~10 servers DELL R730xd with 12x4TB SATA
disks. The official ceph docs says:

"We recommend using a dedicated drive for the operating system and
software, and one drive for each Ceph OSD Daemon you run on the
host."

I could use for example 1 disk for the OS and 11 for OSD data. In
the operating system I would run 11 daemons to control the OSDs.
But...what happen to the cluster if the disk with the OS fails??
maybe the cluster thinks that 11 OSD failed and try to replicate
all that data over the cluster...that sounds no good.

Should I use 2 disks for the OS making a RAID1? in this case I'm
"wasting" 8TB only for ~10GB that the OS needs.

In all the docs that i've been reading says ceph has no unique
single point of failure, so I think that this scenario must have a
optimal solution, maybe somebody could help me.

Thanks in advance.

--

Félix Barbeira.

if you do not have dedicated slots on the back for OS disks, then i
would recomend using SATADOM flash modules directly into a SATA port
internal in the machine. Saves you 2 slots for osd's and they are
quite reliable. you could even use 2 sd cards if your machine have
the internal SD slot




http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf

[1]

kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com [2]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3]

___
ceph-users mailing list
ceph-u

ph.com
http://li


i/ceph-users-ceph.com



Links:
--
[1]

http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf
[2] mailto:ceph-users@lists.ceph.com
[3] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[4] mailto:bsha...@sharerland.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com