Re: [ceph-users] Ceph Performance Questions with rbd images access by qemu-kvm

2015-08-31 Thread Quentin Hartman
I would say you are probably simply IO starved because you're running too
many VMs.

To follow on from Warren's response, if you spread those 160 available iops
across 15 VMs, you are talking about roughly 10 iops per vm, assuming they
have similar workloads. That's almost certainly too little. I would expect
normal system respiration to consume that without even trying to do any
real work.

The way I like to think of it is "fractions of a spindle" since that is the
most meaningful thing to the people I'm usually talking to. It illustrates
the resources in a more tangible way. You have 6 drives available for VM
operations. That immediately gets cut down by a factor of three because of
the replicas, so you have two available spindles. So, with 15 VM's, you
have about 1/7 of a disk's worth of "attention" that can be paid to each
VM. That's not nearly enough.

I run a setup that sounds a lot like a bigger version of what you are
doing, and I've found as a rule of thumb that I need at least 1/3 of a disk
per VM to get decent performance. I've created a reduced-redundancy pool to
store unimportant VMs on, so that reduces the io load of those machines
since they only have two replicas instead of three, so that has freed some
io for "real" work. But side from that, you have to either reduce VMs or
increase spindles...

QH

On Mon, Aug 31, 2015 at 3:39 PM, Wang, Warren  wrote:

> Hey Kenneth, it looks like you¹re just down the tollroad from me. I¹m in
> Reston Town Center.
>
> Just as a really rough estimate, I¹d say this is your max IOPS:
> 80 IOPS/spinner * 6 drives / 3 replicas = 160ish max sustained IOPS
>
> It¹s more complicated than that, since you have a reasonable solid state
> journal, lots of memory, etc, but that¹s a guess, since the backend will
> eventually need to keep up. That being said, almost every time I have seen
> blocked requests, there is some other underlying issue. I would say start
> with implementation checks:
>
> - checking connectivity between OSDs, with and without LACP (overkill for
> your purposes)
> - ensuring that the OSDs target drives are actually mounted instead of
> scribbling to the root drive
> - ensuring that the journal is properly implemented
> - all OSDs on the same version
> - Any OSDs crashing?
> - packet fragmentation? We have to stick with 1500 MTU to prevent frags.
> Don¹t assume you can run jumbo
> - You¹re not running much traffic, so a short capture on both
> sides and
> wireshark should reveal any obvious issues
>
> Is there anything in the ceph.log from a mon host? Grep for WRN. Also look
> at the individual OSD log. This seems more like an implementation issue.
> Happy to help out a local if you need more.
>
> --
> Warren Wang
> Comcast Cloud (OpenStack)
>
>
>
> On 8/31/15, 1:28 PM, "ceph-users on behalf of Kenneth Van Alstyne"
>  kvanalst...@knightpoint.com> wrote:
>
> >Christian, et al:
> >
> >Sorry for the lack of information.  I wasn¹t sure what of our hardware
> >specifications or Ceph configuration was useful information at this
> >point.  Thanks for the feedback ‹ any feedback, is appreciated at this
> >point, as I¹ve been beating my head against a wall trying to figure out
> >what¹s going on.  (If anything.  Maybe the spindle count is indeed our
> >upper limit or our SSDs really suck? :-) )
> >
> >To directly address your questions, see answers below:
> >   - CBT is the Ceph Benchmarking Tool.  Since my question was more
> generic
> >rather than with CBT itself, it was probably more useful to post in the
> >ceph-users list rather than cbt.
> >   - 8 Cores are from 2x quad core Intel(R) Xeon(R) CPU E5-2609 0 @
> 2.40GHz
> >   - The SSDs are indeed Intel S3500s.  I agree ‹ not ideal, but
> supposedly
> >capable of up to 75,000 random 4KB reads/writes.  Throughput and
> >longevity is quite low for an SSD, rated at about 400MB/s reads and
> >100MB/s writes, though.  When we added these as journals in front of the
> >SATA spindles, both VM performance and rados benchmark numbers were
> >relatively unchanged.
> >
> >   - Regarding throughput vs iops, indeed ‹ the throughput that I¹m
> seeing
> >is nearly worst case scenario, with all I/O being 4KB block size.  With
> >RBD cache enabled and the writeback option set in the VM configuration, I
> >was hoping more coalescing would occur, increasing the I/O block size.
> >
> >As an aside, the orchestration layer on top of KVM is OpenNebula if
> >that¹s of any interest.
> >
> >VM information:
> >   - Number = 15
> >   - Worload = Mixed (I know, I know ‹ that¹s as vague of an answer
> as they
> >come)  A handful of VMs are running some MySQL databases and some web
> >applications in Apache Tomcat.  One is running a syslog server.
> >Everything else is mostly static web page serving for a low number of
> >users.
> >
> >I can duplicate the blocked request issue pretty consistently, just by
> >running something simple like a ³yum -y update² in one VM.  While that is
> >running, ceph -w a

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-03 Thread Quentin Hartman
We also just started having our 850 Pros die one after the other after
about 9 months of service. 3 down, 11 to go... No warning at all, the drive
is fine, and then it's not even visible to the machine. According to the
stats in hdparm and the calcs I did they should have had years of life
left, so it seems that ceph journals definitely do something they do not
like, which is not reflected in their stats.

QH

On Wed, Aug 26, 2015 at 7:15 AM, 10 minus  wrote:

> Hi ,
>
> We got a good deal on 843T and we are using it in our Openstack setup ..as
> journals .
> They have been running for last six months ... No issues .
> When we compared with  Intel SSDs I think it was 3700 they  were shade
> slower for our workload and considerably cheaper.
> We did not run any synthetic benchmark since we had a specific use case.
> The performance was better than our old setup so it was good enough.
>
> hth
>
>
>
> On Tue, Aug 25, 2015 at 12:07 PM, Andrija Panic 
> wrote:
>
>> We have some 850 pro 256gb ssds if anyone interested to buy:)
>>
>> And also there was new 850 pro firmware that broke peoples disk which was
>> revoked later etc... I'm sticking with only vacuum cleaners from Samsung
>> for now, maybe... :)
>> On Aug 25, 2015 12:02 PM, "Voloshanenko Igor" <
>> igor.voloshane...@gmail.com> wrote:
>>
>>> To be honest, Samsung 850 PRO not 24/7 series... it's something about
>>> desktop+ series, but anyway - results from this drives - very very bad in
>>> any scenario acceptable by real life...
>>>
>>> Possible 845 PRO more better, but we don't want to experiment anymore...
>>> So we choose S3500 240G. Yes, it's cheaper than S3700 (about 2x times), and
>>> no so durable for writes, but we think more better to replace 1 ssd per 1
>>> year than to pay double price now.
>>>
>>> 2015-08-25 12:59 GMT+03:00 Andrija Panic :
>>>
 And should I mention that in another CEPH installation we had samsung
 850 pro 128GB and all of 6 ssds died in 2 month period - simply disappear
 from the system, so not wear out...

 Never again we buy Samsung :)
 On Aug 25, 2015 11:57 AM, "Andrija Panic" 
 wrote:

> First read please:
>
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
> We are getting 200 IOPS in comparison to Intels3500 18.000 iops -
> those are  constant performance numbers, meaning avoiding drives cache and
> running for longer period of time...
> Also if checking with FIO you will get better latencies on intel s3500
> (model tested in our case) along with 20X better IOPS results...
>
> We observed original issue by having high speed at begining of i.e.
> file transfer inside VM, which than halts to zero... We moved journals 
> back
> to HDDs and performans was acceptable...no we are upgrading to intel
> S3500...
>
> Best
> any details on that ?
>
> On Tue, 25 Aug 2015 11:42:47 +0200, Andrija Panic
>  wrote:
>
> > Make sure you test what ever you decide. We just learned this the
> hard way
> > with samsung 850 pro, which is total crap, more than you could
> imagine...
> >
> > Andrija
> > On Aug 25, 2015 11:25 AM, "Jan Schermer"  wrote:
> >
> > > I would recommend Samsung 845 DC PRO (not EVO, not just PRO).
> > > Very cheap, better than Intel 3610 for sure (and I think it beats
> even
> > > 3700).
> > >
> > > Jan
> > >
> > > > On 25 Aug 2015, at 11:23, Christopher Kunz <
> chrisl...@de-punkt.de>
> > > wrote:
> > > >
> > > > Am 25.08.15 um 11:18 schrieb Götz Reinicke - IT Koordinator:
> > > >> Hi,
> > > >>
> > > >> most of the times I do get the recommendation from resellers to
> go with
> > > >> the intel s3700 for the journalling.
> > > >>
> > > > Check out the Intel s3610. 3 drive writes per day for 5 years.
> Plus, it
> > > > is cheaper than S3700.
> > > >
> > > > Regards,
> > > >
> > > > --ck
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
>
>
>
> --
> Mariusz Gronczewski, Administrator
>
> Efigence S. A.
> ul. Wołoska 9a, 02-583 Warszawa
> T: [+48] 22 380 13 13
> F: [+48] 22 380 13 14
> E: mariusz.gronczew...@efigence.com
> 
>

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


>>>
>> ___
>> c

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-03 Thread Quentin Hartman
Yeah, we've ordered some S3700's to replace them already. Should be here
early next week. Hopefully they arrive before we have multiple nodes die at
once and can no longer rebalance successfully.

Most of the drives I have are the 850 Pro 128GB (specifically MZ7KE128HMGA)
There are a couple 120GB 850 EVOs in there too, but ironically, none of
them have pooped out yet.

On Thu, Sep 3, 2015 at 1:58 PM, Andrija Panic 
wrote:

> I really advise removing the bastards becore they die...no rebalancing
> hapening just temp osd down while replacing journals...
>
> What size and model are yours Samsungs?
> On Sep 3, 2015 7:10 PM, "Quentin Hartman" 
> wrote:
>
>> We also just started having our 850 Pros die one after the other after
>> about 9 months of service. 3 down, 11 to go... No warning at all, the drive
>> is fine, and then it's not even visible to the machine. According to the
>> stats in hdparm and the calcs I did they should have had years of life
>> left, so it seems that ceph journals definitely do something they do not
>> like, which is not reflected in their stats.
>>
>> QH
>>
>> On Wed, Aug 26, 2015 at 7:15 AM, 10 minus  wrote:
>>
>>> Hi ,
>>>
>>> We got a good deal on 843T and we are using it in our Openstack setup
>>> ..as journals .
>>> They have been running for last six months ... No issues .
>>> When we compared with  Intel SSDs I think it was 3700 they  were shade
>>> slower for our workload and considerably cheaper.
>>> We did not run any synthetic benchmark since we had a specific use case.
>>> The performance was better than our old setup so it was good enough.
>>>
>>> hth
>>>
>>>
>>>
>>> On Tue, Aug 25, 2015 at 12:07 PM, Andrija Panic >> > wrote:
>>>
>>>> We have some 850 pro 256gb ssds if anyone interested to buy:)
>>>>
>>>> And also there was new 850 pro firmware that broke peoples disk which
>>>> was revoked later etc... I'm sticking with only vacuum cleaners from
>>>> Samsung for now, maybe... :)
>>>> On Aug 25, 2015 12:02 PM, "Voloshanenko Igor" <
>>>> igor.voloshane...@gmail.com> wrote:
>>>>
>>>>> To be honest, Samsung 850 PRO not 24/7 series... it's something about
>>>>> desktop+ series, but anyway - results from this drives - very very bad in
>>>>> any scenario acceptable by real life...
>>>>>
>>>>> Possible 845 PRO more better, but we don't want to experiment
>>>>> anymore... So we choose S3500 240G. Yes, it's cheaper than S3700 (about 2x
>>>>> times), and no so durable for writes, but we think more better to replace 
>>>>> 1
>>>>> ssd per 1 year than to pay double price now.
>>>>>
>>>>> 2015-08-25 12:59 GMT+03:00 Andrija Panic :
>>>>>
>>>>>> And should I mention that in another CEPH installation we had samsung
>>>>>> 850 pro 128GB and all of 6 ssds died in 2 month period - simply disappear
>>>>>> from the system, so not wear out...
>>>>>>
>>>>>> Never again we buy Samsung :)
>>>>>> On Aug 25, 2015 11:57 AM, "Andrija Panic" 
>>>>>> wrote:
>>>>>>
>>>>>>> First read please:
>>>>>>>
>>>>>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>>>>>>>
>>>>>>> We are getting 200 IOPS in comparison to Intels3500 18.000 iops -
>>>>>>> those are  constant performance numbers, meaning avoiding drives cache 
>>>>>>> and
>>>>>>> running for longer period of time...
>>>>>>> Also if checking with FIO you will get better latencies on intel
>>>>>>> s3500 (model tested in our case) along with 20X better IOPS results...
>>>>>>>
>>>>>>> We observed original issue by having high speed at begining of i.e.
>>>>>>> file transfer inside VM, which than halts to zero... We moved journals 
>>>>>>> back
>>>>>>> to HDDs and performans was acceptable...no we are upgrading to intel
>>>>>>> S3500...
>>>>>>>
>>>>>>> Best
>>>>>>> any details on that ?
>>>>>>>
>>>>>>> On

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread Quentin Hartman
Mine are also mostly 850 Pros. I have a few 840s, and a few 850 EVOs in
there just because I couldn't find 14 pros at the time we were ordering
hardware. I have 14 nodes, each with a single 128 or 120GB SSD that serves
as the boot drive  and the journal for 3 OSDs. And similarly, mine just
started disappearing a few weeks ago. I've now had four fail (three 850
Pro, one 840 Pro). I expect the rest to fail any day.

As it turns out I had a phone conversation with the support rep who has
been helping me with RMA's today and he's putting together a report with my
pertinent information in it to forward on to someone.

FWIW, I tried to get your 845's for this deploy, but couldn't find them
anywhere, and since the 850's looked about as durable on paper I figured
they would do ok. Seems not to be the case.

QH

On Fri, Sep 4, 2015 at 12:53 PM, Andrija Panic 
wrote:

> Hi James,
>
> I had 3 CEPH nodes as folowing: 12 OSDs(HDD) and 2 SSDs (2x 6 Journals
> partitions on each SSD) - SSDs just vanished with no warning, no smartctl
> errors nothing... so 2 SSDs in 3 servers vanished in...2-3 weeks, after a
> 3-4 months of being in production (VMs/KVM/CloudStack)
>
> Mine were also Samsung 850 PRO 128GB.
>
> Best,
> Andrija
>
> On 4 September 2015 at 19:27, James (Fei) Liu-SSI <
> james@ssi.samsung.com> wrote:
>
>> Hi Quentin and Andrija,
>>
>> Thanks so much for reporting the problems with Samsung.
>>
>>
>>
>> Would be possible to get to know your configuration of your system?  What
>> kind of workload are you running?  Do you use Samsung SSD as separate
>> journaling disk, right?
>>
>>
>>
>> Thanks so much.
>>
>>
>>
>> James
>>
>>
>>
>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
>> Of *Quentin Hartman
>> *Sent:* Thursday, September 03, 2015 1:06 PM
>> *To:* Andrija Panic
>> *Cc:* ceph-users
>> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T
>> vs. Intel s3700
>>
>>
>>
>> Yeah, we've ordered some S3700's to replace them already. Should be here
>> early next week. Hopefully they arrive before we have multiple nodes die at
>> once and can no longer rebalance successfully.
>>
>>
>>
>> Most of the drives I have are the 850 Pro 128GB (specifically
>> MZ7KE128HMGA)
>>
>> There are a couple 120GB 850 EVOs in there too, but ironically, none of
>> them have pooped out yet.
>>
>>
>>
>> On Thu, Sep 3, 2015 at 1:58 PM, Andrija Panic 
>> wrote:
>>
>> I really advise removing the bastards becore they die...no rebalancing
>> hapening just temp osd down while replacing journals...
>>
>> What size and model are yours Samsungs?
>>
>> On Sep 3, 2015 7:10 PM, "Quentin Hartman" 
>> wrote:
>>
>> We also just started having our 850 Pros die one after the other after
>> about 9 months of service. 3 down, 11 to go... No warning at all, the drive
>> is fine, and then it's not even visible to the machine. According to the
>> stats in hdparm and the calcs I did they should have had years of life
>> left, so it seems that ceph journals definitely do something they do not
>> like, which is not reflected in their stats.
>>
>>
>>
>> QH
>>
>>
>>
>> On Wed, Aug 26, 2015 at 7:15 AM, 10 minus  wrote:
>>
>> Hi ,
>>
>> We got a good deal on 843T and we are using it in our Openstack setup
>> ..as journals .
>> They have been running for last six months ... No issues .
>>
>> When we compared with  Intel SSDs I think it was 3700 they  were shade
>> slower for our workload and considerably cheaper.
>>
>> We did not run any synthetic benchmark since we had a specific use case.
>>
>> The performance was better than our old setup so it was good enough.
>>
>> hth
>>
>>
>>
>> On Tue, Aug 25, 2015 at 12:07 PM, Andrija Panic 
>> wrote:
>>
>> We have some 850 pro 256gb ssds if anyone interested to buy:)
>>
>> And also there was new 850 pro firmware that broke peoples disk which was
>> revoked later etc... I'm sticking with only vacuum cleaners from Samsung
>> for now, maybe... :)
>>
>> On Aug 25, 2015 12:02 PM, "Voloshanenko Igor" <
>> igor.voloshane...@gmail.com> wrote:
>>
>> To be honest, Samsung 850 PRO not 24/7 series... it's something about
>> desktop+ series, but anyway - results from this drives - very very bad in
>> any scenario acceptable by real life...
>>

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread Quentin Hartman
Yeah, we've ordered some S3700's since we can't afford to have these sorts
of failures and haven't been able to find any of the DC-rated Samsung
drives anywhere.

fwiw, we didn't have any performance problems with the samsungs, it's
exclusively this sudden failure that's making us look elsewhere.

QH

On Fri, Sep 4, 2015 at 1:20 PM, Andrija Panic 
wrote:

> Quentin,
>
> try fio or dd with O_DIRECT and D_SYNC flags, and you will see less than
> 1MB/s - that is common for most "home" drives - check the post down to
> understand
>
> We removed all Samsung 850 pro 256GB from our new CEPH installation and
> replaced with Intel S3500 (18.000 (4Kb) IOPS constant write speed with
> O_DIRECT, D_SYNC, in comparison to 200 IOPS for Samsun 850pro - you can
> imagine the difference...):
>
>
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
> Best
>
> On 4 September 2015 at 21:09, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
>> Mine are also mostly 850 Pros. I have a few 840s, and a few 850 EVOs in
>> there just because I couldn't find 14 pros at the time we were ordering
>> hardware. I have 14 nodes, each with a single 128 or 120GB SSD that serves
>> as the boot drive  and the journal for 3 OSDs. And similarly, mine just
>> started disappearing a few weeks ago. I've now had four fail (three 850
>> Pro, one 840 Pro). I expect the rest to fail any day.
>>
>> As it turns out I had a phone conversation with the support rep who has
>> been helping me with RMA's today and he's putting together a report with my
>> pertinent information in it to forward on to someone.
>>
>> FWIW, I tried to get your 845's for this deploy, but couldn't find them
>> anywhere, and since the 850's looked about as durable on paper I figured
>> they would do ok. Seems not to be the case.
>>
>> QH
>>
>> On Fri, Sep 4, 2015 at 12:53 PM, Andrija Panic 
>> wrote:
>>
>>> Hi James,
>>>
>>> I had 3 CEPH nodes as folowing: 12 OSDs(HDD) and 2 SSDs (2x 6 Journals
>>> partitions on each SSD) - SSDs just vanished with no warning, no smartctl
>>> errors nothing... so 2 SSDs in 3 servers vanished in...2-3 weeks, after a
>>> 3-4 months of being in production (VMs/KVM/CloudStack)
>>>
>>> Mine were also Samsung 850 PRO 128GB.
>>>
>>> Best,
>>> Andrija
>>>
>>> On 4 September 2015 at 19:27, James (Fei) Liu-SSI <
>>> james@ssi.samsung.com> wrote:
>>>
>>>> Hi Quentin and Andrija,
>>>>
>>>> Thanks so much for reporting the problems with Samsung.
>>>>
>>>>
>>>>
>>>> Would be possible to get to know your configuration of your system?
>>>> What kind of workload are you running?  Do you use Samsung SSD as separate
>>>> journaling disk, right?
>>>>
>>>>
>>>>
>>>> Thanks so much.
>>>>
>>>>
>>>>
>>>> James
>>>>
>>>>
>>>>
>>>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
>>>> Behalf Of *Quentin Hartman
>>>> *Sent:* Thursday, September 03, 2015 1:06 PM
>>>> *To:* Andrija Panic
>>>> *Cc:* ceph-users
>>>> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T
>>>> vs. Intel s3700
>>>>
>>>>
>>>>
>>>> Yeah, we've ordered some S3700's to replace them already. Should be
>>>> here early next week. Hopefully they arrive before we have multiple nodes
>>>> die at once and can no longer rebalance successfully.
>>>>
>>>>
>>>>
>>>> Most of the drives I have are the 850 Pro 128GB (specifically
>>>> MZ7KE128HMGA)
>>>>
>>>> There are a couple 120GB 850 EVOs in there too, but ironically, none of
>>>> them have pooped out yet.
>>>>
>>>>
>>>>
>>>> On Thu, Sep 3, 2015 at 1:58 PM, Andrija Panic 
>>>> wrote:
>>>>
>>>> I really advise removing the bastards becore they die...no rebalancing
>>>> hapening just temp osd down while replacing journals...
>>>>
>>>> What size and model are yours Samsungs?
>>>>
>>>> On Sep 3, 2015 7:10 PM, "Quentin Hartman" 
>>>> wrote:
>>>>
>>>> We also just started having 

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread Quentin Hartman
_III_Spec.pdf)
which is still less than I'd calculated when I bought these drives. Though
I assumed a much slower wear rate (1TB / month) than what we're actually
apparently getting (3.6TB / month), so my original estimated lifespan of
about 6 years was way off.

QH


On Fri, Sep 4, 2015 at 1:15 PM, James (Fei) Liu-SSI <
james@ssi.samsung.com> wrote:

> Hi Andrija,
>
> Thanks for your promptly response. Would be possible to have any change to
> know your hardware configuration including your server information?
> Secondly, Is there anyway to duplicate your workload with fio-rbd, rbd
> bench or rados bench?
>
>
>
>   “so 2 SSDs in 3 servers vanished in...2-3 weeks, after a 3-4 months of
> being in production (VMs/KVM/CloudStack)”
>
>
>
>What you mean over here is that you deploy Ceph with CloudStack , am I
> correct? The 2 SSDs vanished in 2~3 weeks is brand new Samsung 850 Pro
> 128GB, right?
>
>
>
> Thanks,
>
> James
>
>
>
> *From:* Andrija Panic [mailto:andrija.pa...@gmail.com]
> *Sent:* Friday, September 04, 2015 11:53 AM
> *To:* James (Fei) Liu-SSI
> *Cc:* Quentin Hartman; ceph-users
>
> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T vs.
> Intel s3700
>
>
>
> Hi James,
>
>
>
> I had 3 CEPH nodes as folowing: 12 OSDs(HDD) and 2 SSDs (2x 6 Journals
> partitions on each SSD) - SSDs just vanished with no warning, no smartctl
> errors nothing... so 2 SSDs in 3 servers vanished in...2-3 weeks, after a
> 3-4 months of being in production (VMs/KVM/CloudStack)
>
> Mine were also Samsung 850 PRO 128GB.
>
>
>
> Best,
>
> Andrija
>
>
>
> On 4 September 2015 at 19:27, James (Fei) Liu-SSI <
> james@ssi.samsung.com> wrote:
>
> Hi Quentin and Andrija,
>
> Thanks so much for reporting the problems with Samsung.
>
>
>
> Would be possible to get to know your configuration of your system?  What
> kind of workload are you running?  Do you use Samsung SSD as separate
> journaling disk, right?
>
>
>
> Thanks so much.
>
>
>
> James
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Quentin Hartman
> *Sent:* Thursday, September 03, 2015 1:06 PM
> *To:* Andrija Panic
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T vs.
> Intel s3700
>
>
>
> Yeah, we've ordered some S3700's to replace them already. Should be here
> early next week. Hopefully they arrive before we have multiple nodes die at
> once and can no longer rebalance successfully.
>
>
>
> Most of the drives I have are the 850 Pro 128GB (specifically
> MZ7KE128HMGA)
>
> There are a couple 120GB 850 EVOs in there too, but ironically, none of
> them have pooped out yet.
>
>
>
> On Thu, Sep 3, 2015 at 1:58 PM, Andrija Panic 
> wrote:
>
> I really advise removing the bastards becore they die...no rebalancing
> hapening just temp osd down while replacing journals...
>
> What size and model are yours Samsungs?
>
> On Sep 3, 2015 7:10 PM, "Quentin Hartman" 
> wrote:
>
> We also just started having our 850 Pros die one after the other after
> about 9 months of service. 3 down, 11 to go... No warning at all, the drive
> is fine, and then it's not even visible to the machine. According to the
> stats in hdparm and the calcs I did they should have had years of life
> left, so it seems that ceph journals definitely do something they do not
> like, which is not reflected in their stats.
>
>
>
> QH
>
>
>
> On Wed, Aug 26, 2015 at 7:15 AM, 10 minus  wrote:
>
> Hi ,
>
> We got a good deal on 843T and we are using it in our Openstack setup ..as
> journals .
> They have been running for last six months ... No issues .
>
> When we compared with  Intel SSDs I think it was 3700 they  were shade
> slower for our workload and considerably cheaper.
>
> We did not run any synthetic benchmark since we had a specific use case.
>
> The performance was better than our old setup so it was good enough.
>
> hth
>
>
>
> On Tue, Aug 25, 2015 at 12:07 PM, Andrija Panic 
> wrote:
>
> We have some 850 pro 256gb ssds if anyone interested to buy:)
>
> And also there was new 850 pro firmware that broke peoples disk which was
> revoked later etc... I'm sticking with only vacuum cleaners from Samsung
> for now, maybe... :)
>
> On Aug 25, 2015 12:02 PM, "Voloshanenko Igor" 
> wrote:
>
> To be honest, Samsung 850 PRO not 24/7 series... it's something about
> desktop+ series, but anyway - results from this drives - very very bad in
> any scenario acceptable by real life.

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread Quentin Hartman
I just went through and ran this on all my currently running SSDs:

echo "$(smartctl -a /dev/sda | grep Total_LBAs_Written | awk '{ print $NF
}') * 512 /1025/1024/1024/1024" | bc

which is showing about 32TB written on the oldest nodes, about 20 on the
newer ones, and 1 on the first one I've RMA'd and replaced last week. So
the numbers are in-line with the test I did a few months ago in that they
are even, but looking back when I checked on them last my numbers were off
by 1024.

Note that this invocation of bc only outputs integers so the results will
be roudned.

On Fri, Sep 4, 2015 at 1:40 PM, James (Fei) Liu-SSI <
james@ssi.samsung.com> wrote:

> Hi Anrija,
>
> Your feedback is greatly appreciated.
>
>
>
> Regards,
>
> James
>
>
>
> *From:* Andrija Panic [mailto:andrija.pa...@gmail.com]
> *Sent:* Friday, September 04, 2015 12:39 PM
> *To:* James (Fei) Liu-SSI
> *Cc:* Quentin Hartman; ceph-users
>
> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T vs.
> Intel s3700
>
>
>
> James,
>
>
>
> there are simple FIO tests or even DD test on Linux, which you can run to
> see how good SSD will perform as CEPH Journal device (CEPH does writes with
> O_DIRECT and D_SYNC flags to SSDs) - Samsung 850 perform here extremely
> bad, as many, many other vendors (D_SYNC kills performance for them...)
>
>
>
> If you are not using D_SYNC flag, then Samsung can achieve some nice
> numbers...
>
> dd if=/dev/zero of=/dev/sda bs=4k count=10 oflag=direct,dsync (where
> /dev/sda is raw drive, or replace that with mount point i.e. /root/ddfile)
>
>
>
> Check post for more info please:
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
> Thanks
>
>
>
> On 4 September 2015 at 21:31, James (Fei) Liu-SSI <
> james@ssi.samsung.com> wrote:
>
> Andrija,
>
> In your email thread, (18.000 (4Kb) IOPS constant write speed stands for
> 18K iops with 4k block size, right? However, you can only achieve 200IOPS
> with Samsung 850Pro, right?
>
>
>
> Theoretically, Samsung 850 Pro can get up to 100,000 IOPS with 4k Random
> Read with certain workload.  It is a little bit strange over here.
>
>
>
> Regards,
>
> James
>
>
>
>
>
> *From:* Andrija Panic [mailto:andrija.pa...@gmail.com]
> *Sent:* Friday, September 04, 2015 12:21 PM
> *To:* Quentin Hartman
> *Cc:* James (Fei) Liu-SSI; ceph-users
>
>
> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T vs.
> Intel s3700
>
>
>
> Quentin,
>
>
>
> try fio or dd with O_DIRECT and D_SYNC flags, and you will see less than
> 1MB/s - that is common for most "home" drives - check the post down to
> understand
>
> We removed all Samsung 850 pro 256GB from our new CEPH installation and
> replaced with Intel S3500 (18.000 (4Kb) IOPS constant write speed with
> O_DIRECT, D_SYNC, in comparison to 200 IOPS for Samsun 850pro - you can
> imagine the difference...):
>
>
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
>
>
> Best
>
>
>
> On 4 September 2015 at 21:09, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
> Mine are also mostly 850 Pros. I have a few 840s, and a few 850 EVOs in
> there just because I couldn't find 14 pros at the time we were ordering
> hardware. I have 14 nodes, each with a single 128 or 120GB SSD that serves
> as the boot drive  and the journal for 3 OSDs. And similarly, mine just
> started disappearing a few weeks ago. I've now had four fail (three 850
> Pro, one 840 Pro). I expect the rest to fail any day.
>
>
>
> As it turns out I had a phone conversation with the support rep who has
> been helping me with RMA's today and he's putting together a report with my
> pertinent information in it to forward on to someone.
>
>
>
> FWIW, I tried to get your 845's for this deploy, but couldn't find them
> anywhere, and since the 850's looked about as durable on paper I figured
> they would do ok. Seems not to be the case.
>
>
>
> QH
>
>
>
> On Fri, Sep 4, 2015 at 12:53 PM, Andrija Panic 
> wrote:
>
> Hi James,
>
>
>
> I had 3 CEPH nodes as folowing: 12 OSDs(HDD) and 2 SSDs (2x 6 Journals
> partitions on each SSD) - SSDs just vanished with no warning, no smartctl
> errors nothing... so 2 SSDs in 3 servers vanished in...2-3 weeks, after a
> 3-4 months of being in production (VMs/KVM/CloudStack)
>
> Mine were also Samsung 850 PRO 128GB.
>
>
>
> Best,
>
> Andrija
>
>
>
>

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-07 Thread Quentin Hartman
fwiw, I am not confused about the various types of SSDs that Samsung
offers. I knew exactly what I was getting when I ordered them. Based on
their specs and my WAG on how much writing I would be doing they should
have lasted about 6 years. Turns out my estimates were wrong, but even
adjusting for actual use, I should have gotten about 18 months out of these
drives, but I have them dying now at 9 months, with about half of their
theoretical life left.

A list of hardware that is known to work well would be incredibly valuable
to people getting started. It doesn't have to be exhaustive, nor does it
have to provide all the guidance someone could want. A simple "these things
have worked for others" would be sufficient. If nothing else, it will help
people justify more expensive gear when their approval people say "X seems
just as good and is cheaper, why can't we get that?".

To that point, I think perhaps though something more important than a list
of known "good" hardware would be a list of known "bad" hardware, and
perhaps some more experience about what kind of write volume people should
reasonably expect. Setting aside for a moment the early death problem the
recent Samsung drives clearly have (I wonder if it's a side-effect of the
"3D-NAND" tech?) I wouldn't have gotten them had my estimates told me I'd
only get 18 months out of them. That would have also provided me the
information I needed to justify the DC-class drives that cost four times as
much to those that approve purchases. Without that critical piece of
information, I'm left trying to justify thousands of extra dollars with
only "because they're better".

Also, I talked to a Samsung rep last week and he told me the DC 845 line
has been discontinued. The DC-class drives from Samsung are now model
pm863. They are theoretically on the market, but I've not been able to find
them in stock anywhere.

QH

On Mon, Sep 7, 2015 at 4:22 AM, Jan Schermer  wrote:

> It is not just a question of which SSD.
> It's the combination of distribution (kernel version), disk controller and
> firmware, SSD revision and firmware.
>
> There are several ways to select hardware
> 1) the most traditional way where you build your BoM on a single vendor -
> so you buy servers including SSDs and HBAs as a single unit and then scream
> at the vendor when it doesn't work. I had a good experience with vendors in
> this scenario.
> 2) based on Hardware Compatibility Lists - usually means you can't use tha
> latest hardware. For example LSI doesn't list most SSDs as compatible, or
> they only list really old firmware versions. Unusable, nobody will really
> help you.
> 3) You get a sample and test it, and you hope you will get the same
> hardware when you order in bulk later. We went this route and got nothing
> but trouble when Kingston changed their SSDs completely without changing
> their PN.
>
> Would we recommend s3700/3710 for Ceph? Absolutely. But there are still
> people who have trouble with them in combination with LSI controllers.
> Can we recommend Samsung 845 DC PRO then? I can say it worked nicely with
> my hardware. But surely some people had trouble with it.
>
> I "vote" against creating such a list because of all those reasons, it
> could get someone in trouble.
>
> Jan
>
>
> On 07 Sep 2015, at 11:14, Andrija Panic  wrote:
>
> There is
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
> On the other hand, I'm not sure if SSD vendors would be happy to see their
> device listed performing total crap (for Journaling) ...but yes, I vote for
> having some oficial page if possible !
>
> On 7 September 2015 at 11:12, Eino Tuominen  wrote:
>
>> Hello,
>>
>> Should we (somebody, please?) gather up a comprehensive list of suitable
>> SSD devices to use as ceph journals? This seems to be a FAQ, and it would
>> be nice if all the knowledge and user experiences from several different
>> threads could be referenced easily in the future. I took a look at
>> wiki.ceph.org and there was nothing on this.
>>
>> --
>>   Eino Tuominen
>>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Jan Schermer
>> Sent: 7. syyskuuta 2015 11:44
>> To: Christian Balzer
>> Cc: ceph-users; Межов Игорь Александрович
>> Subject: Re: [ceph-users] which SSD / experiences with Samsung 843T vs.
>> Intel s3700
>>
>> Re: Samsungs - I feel some of you are mixing and confusing different
>> Samsung drives.
>>
>> There is a DC line of Samsung drives meant for DataCenter use. Those have
>> EVO (write once read many) and PRO (write mostly) variants.
>> You don't want to go anywhere near the EVO line with Ceph.
>> Then there are "regular" EVO and PRO drives - they are not meant for
>> server use so don't use them.
>>
>> The main difference is that the "DC" line should provide reliable and
>> stable performance over time, no surprises, while the desktop drives can
>> just pause and perform garbage 

Re: [ceph-users] btrfs ready for production?

2015-09-07 Thread Quentin Hartman
btrfs has been discussed at length here. Search the archives if you want
more detail, but my take on it is that you probably shouldn't use it
production right now. Also from
http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/

"We currently recommend XFS for production deployments. We recommend btrfs
for testing, development, and any non-critical deployments. We believe that
btrfs has the correct feature set and roadmap to serve Ceph in the
long-term, but XFS and ext4 provide the necessary stability for today’s
deployments. btrfs development is proceeding rapidly: users should be
comfortable installing the latest released upstream kernels and be able to
track development activity for critical bug fixes."

QH

On Mon, Sep 7, 2015 at 1:53 AM, Alan Zhang  wrote:

> hi everyone:
>
> as ceph doc currently recommend:
>
> *If you use the btrfs file system with Ceph, we recommend using a recent
> Linux kernel (v3.14 or later).*
>
> so, I want to know:
> 1. if we use btrfs base on linux kernel 3.14 or later, it is ready for
> production?
> 2. Any one have use btrfs on production env?
>
> Thanks.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-08 Thread Quentin Hartman
On Tue, Sep 8, 2015 at 9:05 AM, Mark Nelson  wrote:

> A list of hardware that is known to work well would be incredibly
>> valuable to people getting started. It doesn't have to be exhaustive,
>> nor does it have to provide all the guidance someone could want. A
>> simple "these things have worked for others" would be sufficient. If
>> nothing else, it will help people justify more expensive gear when their
>> approval people say "X seems just as good and is cheaper, why can't we
>> get that?".
>>
>
> So I have my opinions on different drives, but I think we do need to be
> really careful not to appear to endorse or pick on specific vendors. The
> more we can stick to high-level statements like:
>
> - Drives should have high write endurance
> - Drives should perform well with O_DSYNC writes
> - Drives should support power loss protection for data in motion
>
> The better I think.  Once those are established, I think it's reasonable
> to point out that certain drives meet (or do not meet) those criteria and
> get feedback from the community as to whether or not vendor's marketing
> actually reflects reality.  It'd also be really nice to see more
> information available like the actual hardware (capacitors, flash cells,
> etc) used in the drives.  I've had to show photos of the innards of
> specific drives to vendors to get them to give me accurate information
> regarding certain drive capabilities.  Having a database of such things
> available to the community would be really helpful.
>
>
That's probably a very good approach. I think it would be pretty simple to
avoid the appearance of endorsement if the data is presented correctly.


>
>> To that point, I think perhaps though something more important than a
>> list of known "good" hardware would be a list of known "bad" hardware,
>>
>
> I'm rather hesitant to do this unless it's been specifically confirmed by
> the vendor.  It's too easy to point fingers (see the recent kernel trim bug
> situation).


I disagree. I think that only comes into play if you claim to know why the
hardware has problems. In this case, if you simply state "people who have
used this drive have experienced a large number of seemingly premature
failures when using them as journals" that provides sufficient warning to
users, and if the vendor wants to engage the community and potentially pin
down why and help us find a way to make the device work or confirm that
it's just not suited, then that's on them. Samsung seems to be doing
exactly that. It would be great to have them help provide that level of
detail, but again, I don't think it's necessary. We're not saying
"ceph/redhat/$whatever says this hardware sucks" we're saying "The
community has found that using this hardware with ceph has exhibited these
negative behaviors...". At that point you're just relaying experiences and
collecting them in a central location. It's up to the reader to draw
conclusions from it.

But again, I think more important than either of these would be a
collection of use cases with actual journal write volumes that have
occurred in those use cases so that people can make more informed
purchasing decisions. The fact that my small openstack cluster created 3.6T
of writes per month on my journal drives (3 OSD each) is somewhat
mind-blowing. That's almost four times the amount of writes my best guess
estimates indicated we'd be doing. Clearly there's more going on than we
are used to paying attention to. Someone coming to ceph and seeing the cost
of DC-class SSDs versus consumer-class SSDs will almost certainly suffer
from some amount of sticker shock, and even if they don't their purchasing
approval people almost certainly will. This is especially true for people
in smaller organizations where SSDs are still somewhat exotic. And when
they come back with the "Why won't cheaper thing X be OK?" they need to
have sufficient information to answer that. Without a test environment to
generate data with, they will need to rely on the experiences of others,
and right now those experiences don't seem to be documented anywhere, and
if they are, they are not very discoverable.

QH
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-17 Thread Quentin Hartman
I ended up having 7 total die. 5 while in service, 2 more when I hooked
them up to a test machine to collect information from them. To Samsung's
credit, they've been great to deal with and are replacing the failed
drives, on the condition that I don't use them for ceph again. Apparently
they sent some of my failed drives to an engineer in Korea and they did a
failure analysis on them and came to the conclusion they we put to an
"unintended use". I have seven left I'm not sure what to do with.

I've honestly always really liked Samsung, and I'm disappointed that I
wasn't able to find anyone with their DC-class drives actually in stock so
I ended up switching the to Intel S3700s. My users will be happy to have
some SSDs to put in their workstations though!

QH

On Thu, Sep 17, 2015 at 4:49 PM, Andrija Panic 
wrote:

> Another one bites the dust...
>
> This is Samsung 850 PRO 256GB... (6 journals on this SSDs just died...)
>
> [root@cs23 ~]# smartctl -a /dev/sda
> smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.66-1.el6.elrepo.x86_64]
> (local build)
> Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Vendor:   /1:0:0:0
> Product:
> User Capacity:600,332,565,813,390,450 bytes [600 PB]
> Logical block size:   774843950 bytes
> >> Terminate command early due to bad response to IEC mode page
> A mandatory SMART command failed: exiting. To continue, add one or more
> '-T permissive' options
>
> On 8 September 2015 at 18:01, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
>> On Tue, Sep 8, 2015 at 9:05 AM, Mark Nelson  wrote:
>>
>>> A list of hardware that is known to work well would be incredibly
>>>> valuable to people getting started. It doesn't have to be exhaustive,
>>>> nor does it have to provide all the guidance someone could want. A
>>>> simple "these things have worked for others" would be sufficient. If
>>>> nothing else, it will help people justify more expensive gear when their
>>>> approval people say "X seems just as good and is cheaper, why can't we
>>>> get that?".
>>>>
>>>
>>> So I have my opinions on different drives, but I think we do need to be
>>> really careful not to appear to endorse or pick on specific vendors. The
>>> more we can stick to high-level statements like:
>>>
>>> - Drives should have high write endurance
>>> - Drives should perform well with O_DSYNC writes
>>> - Drives should support power loss protection for data in motion
>>>
>>> The better I think.  Once those are established, I think it's reasonable
>>> to point out that certain drives meet (or do not meet) those criteria and
>>> get feedback from the community as to whether or not vendor's marketing
>>> actually reflects reality.  It'd also be really nice to see more
>>> information available like the actual hardware (capacitors, flash cells,
>>> etc) used in the drives.  I've had to show photos of the innards of
>>> specific drives to vendors to get them to give me accurate information
>>> regarding certain drive capabilities.  Having a database of such things
>>> available to the community would be really helpful.
>>>
>>>
>> That's probably a very good approach. I think it would be pretty simple
>> to avoid the appearance of endorsement if the data is presented correctly.
>>
>>
>>>
>>>> To that point, I think perhaps though something more important than a
>>>> list of known "good" hardware would be a list of known "bad" hardware,
>>>>
>>>
>>> I'm rather hesitant to do this unless it's been specifically confirmed
>>> by the vendor.  It's too easy to point fingers (see the recent kernel trim
>>> bug situation).
>>
>>
>> I disagree. I think that only comes into play if you claim to know why
>> the hardware has problems. In this case, if you simply state "people who
>> have used this drive have experienced a large number of seemingly premature
>> failures when using them as journals" that provides sufficient warning to
>> users, and if the vendor wants to engage the community and potentially pin
>> down why and help us find a way to make the device work or confirm that
>> it's just not suited, then that's on them. Samsung seems to be doing
>> exactly that. It would be great to have them help provide that level of
>> detail, but again, I don't think it's necessary. We&

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-17 Thread Quentin Hartman
Well, if you look at the very very fine print on their warranty statement
and some spec sheets they say they are only supposed to be used in "Client
PCs" and if the application exceeds certain write amounts per day, even if
it's below the total volume of writes the drive is supposed to handle, it
voids the warranty. I expect it's the write rate that is killing them.
Purely by measure of the amount of writes, mine should have been at about
50% life or better.

So, according to the strict letter of their specs, a ceph server would be
an unintended use. Of course, all that detail gets omitted in lots of
places where one would do research. In the end though, they are taking care
of me, and frankly that means a lot more in my book. And for what it's
worth, I have many drives from them in PCs and laptops that have been
rolling happily along for years.

QH

On Thu, Sep 17, 2015 at 5:07 PM, Andrija Panic 
wrote:

> "  came to the conclusion they we put to an "unintended use".   "
> wtf ? : Best to install them inside shutdown workstation... :)
>
> On 18 September 2015 at 01:04, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
>> I ended up having 7 total die. 5 while in service, 2 more when I hooked
>> them up to a test machine to collect information from them. To Samsung's
>> credit, they've been great to deal with and are replacing the failed
>> drives, on the condition that I don't use them for ceph again. Apparently
>> they sent some of my failed drives to an engineer in Korea and they did a
>> failure analysis on them and came to the conclusion they we put to an
>> "unintended use". I have seven left I'm not sure what to do with.
>>
>> I've honestly always really liked Samsung, and I'm disappointed that I
>> wasn't able to find anyone with their DC-class drives actually in stock so
>> I ended up switching the to Intel S3700s. My users will be happy to have
>> some SSDs to put in their workstations though!
>>
>> QH
>>
>> On Thu, Sep 17, 2015 at 4:49 PM, Andrija Panic 
>> wrote:
>>
>>> Another one bites the dust...
>>>
>>> This is Samsung 850 PRO 256GB... (6 journals on this SSDs just died...)
>>>
>>> [root@cs23 ~]# smartctl -a /dev/sda
>>> smartctl 5.43 2012-06-30 r3573
>>> [x86_64-linux-3.10.66-1.el6.elrepo.x86_64] (local build)
>>> Copyright (C) 2002-12 by Bruce Allen,
>>> http://smartmontools.sourceforge.net
>>>
>>> Vendor:   /1:0:0:0
>>> Product:
>>> User Capacity:600,332,565,813,390,450 bytes [600 PB]
>>> Logical block size:   774843950 bytes
>>> >> Terminate command early due to bad response to IEC mode page
>>> A mandatory SMART command failed: exiting. To continue, add one or more
>>> '-T permissive' options
>>>
>>> On 8 September 2015 at 18:01, Quentin Hartman <
>>> qhart...@direwolfdigital.com> wrote:
>>>
>>>> On Tue, Sep 8, 2015 at 9:05 AM, Mark Nelson  wrote:
>>>>
>>>>> A list of hardware that is known to work well would be incredibly
>>>>>> valuable to people getting started. It doesn't have to be exhaustive,
>>>>>> nor does it have to provide all the guidance someone could want. A
>>>>>> simple "these things have worked for others" would be sufficient. If
>>>>>> nothing else, it will help people justify more expensive gear when
>>>>>> their
>>>>>> approval people say "X seems just as good and is cheaper, why can't we
>>>>>> get that?".
>>>>>>
>>>>>
>>>>> So I have my opinions on different drives, but I think we do need to
>>>>> be really careful not to appear to endorse or pick on specific vendors. 
>>>>> The
>>>>> more we can stick to high-level statements like:
>>>>>
>>>>> - Drives should have high write endurance
>>>>> - Drives should perform well with O_DSYNC writes
>>>>> - Drives should support power loss protection for data in motion
>>>>>
>>>>> The better I think.  Once those are established, I think it's
>>>>> reasonable to point out that certain drives meet (or do not meet) those
>>>>> criteria and get feedback from the community as to whether or not vendor's
>>>>> marketing actually reflects reality.  It'd also be really nice to see more
>>>>> information available like the actu

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-18 Thread Quentin Hartman
No, they are dead dead dead. Can't get anything off of them. If you look
back further on this thread I think the most noteworthy part of this whole
experience is just how far off my write estimates were. The ones that have
not died have somewhere between 24 and 32 TB written to them after 9 months
in service. This is almost 4x what I thought they would get.

QH

On Fri, Sep 18, 2015 at 1:48 AM, Jan Schermer  wrote:

> "850 PRO" is a workstation drive. You shouldn't put it in the server...
> But it should not just die either way, so don't tell them you use it for
> Ceph next time.
>
> Do the drives work when replugged? Can you get anything from SMART?
>
> Jan
>
>
> On 18 Sep 2015, at 02:57, James (Fei) Liu-SSI 
> wrote:
>
> Hi Quentin,
> Samsung has so different type of SSD for different type of workload with
> different SSD media like SLC,MLC,TLC ,3D NAND etc. They were designed for
> different workloads for different purposes. Thanks for your understanding
> and support.
>
> Regards,
> James
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> ] *On Behalf Of *Quentin Hartman
> *Sent:* Thursday, September 17, 2015 4:05 PM
> *To:* Andrija Panic
> *Cc:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T vs.
> Intel s3700
>
> I ended up having 7 total die. 5 while in service, 2 more when I hooked
> them up to a test machine to collect information from them. To Samsung's
> credit, they've been great to deal with and are replacing the failed
> drives, on the condition that I don't use them for ceph again. Apparently
> they sent some of my failed drives to an engineer in Korea and they did a
> failure analysis on them and came to the conclusion they we put to an
> "unintended use". I have seven left I'm not sure what to do with.
>
> I've honestly always really liked Samsung, and I'm disappointed that I
> wasn't able to find anyone with their DC-class drives actually in stock so
> I ended up switching the to Intel S3700s. My users will be happy to have
> some SSDs to put in their workstations though!
>
> QH
>
> On Thu, Sep 17, 2015 at 4:49 PM, Andrija Panic 
> wrote:
> Another one bites the dust...
>
> This is Samsung 850 PRO 256GB... (6 journals on this SSDs just died...)
>
> [root@cs23 ~]# smartctl -a /dev/sda
> smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.66-1.el6.elrepo.x86_64]
> (local build)
> Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Vendor:   /1:0:0:0
> Product:
> User Capacity:600,332,565,813,390,450 bytes [600 PB]
> Logical block size:   774843950 bytes
> >> Terminate command early due to bad response to IEC mode page
> A mandatory SMART command failed: exiting. To continue, add one or more
> '-T permissive' options
>
> On 8 September 2015 at 18:01, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
> On Tue, Sep 8, 2015 at 9:05 AM, Mark Nelson  wrote:
>
> A list of hardware that is known to work well would be incredibly
> valuable to people getting started. It doesn't have to be exhaustive,
> nor does it have to provide all the guidance someone could want. A
> simple "these things have worked for others" would be sufficient. If
> nothing else, it will help people justify more expensive gear when their
> approval people say "X seems just as good and is cheaper, why can't we
> get that?".
>
>
> So I have my opinions on different drives, but I think we do need to be
> really careful not to appear to endorse or pick on specific vendors. The
> more we can stick to high-level statements like:
>
> - Drives should have high write endurance
> - Drives should perform well with O_DSYNC writes
> - Drives should support power loss protection for data in motion
>
> The better I think.  Once those are established, I think it's reasonable
> to point out that certain drives meet (or do not meet) those criteria and
> get feedback from the community as to whether or not vendor's marketing
> actually reflects reality.  It'd also be really nice to see more
> information available like the actual hardware (capacitors, flash cells,
> etc) used in the drives.  I've had to show photos of the innards of
> specific drives to vendors to get them to give me accurate information
> regarding certain drive capabilities.  Having a database of such things
> available to the community would be really helpful.
>
> That's probably a very good approach. I think it would be pretty simple to
> avoid the appearance of endorsement if the data is presented correctly.
>
>
&

Re: [ceph-users] Hammer reduce recovery impact

2015-09-18 Thread Quentin Hartman
I just applied the following settings to my cluster and it resulted in much
better behavior in the hosted VMs:

osd_backfill_scan_min = 2
osd_backfill_scan_max = 16
osd_recovery_max_active = 1
osd_max_backfills = 1
osd_recovery_threads = 1
osd_recovery_op_priority = 1

On my "canary" VM iowait dropped from a hard 50% or more to recurring wave
of nothing up to 25%, then down again, which is apparently low enough that
my users aren't noticing it. Recovery is of course taking much longer, but
since I can now do OSD maintenance operations during the day, it's a big
win.


QH

On Wed, Sep 16, 2015 at 9:42 AM, Robert LeBlanc 
wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I was out of the office for a few days. We have some more hosts to
> add. I'll send some logs for examination.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Fri, Sep 11, 2015 at 12:45 AM, GuangYang  wrote:
> > If we are talking about requests being blocked 60+ seconds, those
> tunings might not help (they help a lot for average latency during
> recovering/backfilling).
> >
> > It would be interesting to see the logs for those blocked requests at
> OSD side (they have level 0), pattern to search might be "slow requests \d+
> seconds old".
> >
> > I had a problem that for a recovery candidate object, all updates to
> that object would stuck until it is recovered, that might take extremely
> long time if there are large number of PG and objects to recover. But I
> think that is resolved by Sam to allow write for degraded objects in Hammer.
> >
> > 
> >> Date: Thu, 10 Sep 2015 14:56:12 -0600
> >> From: rob...@leblancnet.us
> >> To: ceph-users@lists.ceph.com
> >> Subject: [ceph-users] Hammer reduce recovery impact
> >>
> >> -BEGIN PGP SIGNED MESSAGE-
> >> Hash: SHA256
> >>
> >> We are trying to add some additional OSDs to our cluster, but the
> >> impact of the backfilling has been very disruptive to client I/O and
> >> we have been trying to figure out how to reduce the impact. We have
> >> seen some client I/O blocked for more than 60 seconds. There has been
> >> CPU and RAM head room on the OSD nodes, network has been fine, disks
> >> have been busy, but not terrible.
> >>
> >> 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals
> >> (10GB), dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta
> >> S51G-1UL.
> >>
> >> Clients are QEMU VMs.
> >>
> >> [ulhglive-root@ceph5 current]# ceph --version
> >> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
> >>
> >> Some nodes are 0.94.3
> >>
> >> [ulhglive-root@ceph5 current]# ceph status
> >> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
> >> health HEALTH_WARN
> >> 3 pgs backfill
> >> 1 pgs backfilling
> >> 4 pgs stuck unclean
> >> recovery 2382/33044847 objects degraded (0.007%)
> >> recovery 50872/33044847 objects misplaced (0.154%)
> >> noscrub,nodeep-scrub flag(s) set
> >> monmap e2: 3 mons at
> >> {mon1=
> 10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> >> election epoch 180, quorum 0,1,2 mon1,mon2,mon3
> >> osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
> >> flags noscrub,nodeep-scrub
> >> pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
> >> 128 TB used, 322 TB / 450 TB avail
> >> 2382/33044847 objects degraded (0.007%)
> >> 50872/33044847 objects misplaced (0.154%)
> >> 2300 active+clean
> >> 3 active+remapped+wait_backfill
> >> 1 active+remapped+backfilling
> >> recovery io 70401 kB/s, 16 objects/s
> >> client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
> >>
> >> Each pool is size 4 with min_size 2.
> >>
> >> One problem we have is that the requirements of the cluster changed
> >> after setting up our pools, so our PGs are really out of wack. Our
> >> most active pool has only 256 PGs and each PG is about 120 GB is size.
> >> We are trying to clear out a pool that has way too many PGs so that we
> >> can split the PGs in that pool. I think these large PGs is part of our
> >> issues.
> >>
> >> Things I've tried:
> >>
> >> * Lowered nr_requests on the spindles from 1000 to 100. This reduced
> >> the max latency sometimes up to 3000 ms down to a max of 500-700 ms.
> >> it has also reduced the huge swings in latency, but has also reduced
> >> throughput somewhat.
> >> * Changed the scheduler from deadline to CFQ. I'm not sure if the the
> >> OSD process gives the recovery threads a different disk priority or if
> >> changing the scheduler without restarting the OSD allows the OSD to
> >> use disk priorities.
> >> * Reduced the number of osd_max_backfills from 2 to 1.
> >> * Tried setting noin to give the new OSDs time to get the PG map and
> >> peer before starting the backfill. This caused more problems than
> >> solved as we had blocked I/O (over 200 seconds) until we set the new
> >> OSDs to in.
> >>
> >> Even adding one OSD disk into the cluster is causing these slow I/O
> >> mess

Re: [ceph-users] Poor Read Performance with Ubuntu 14.04 LTS 3.19.0-30 Kernel

2015-10-06 Thread Quentin Hartman
Could you share some of your testing methodology? I'd like to repeat your
tests.

I have a cluster that is currently running mostly 3.13 kernels, but the
latest patch of that version breaks the onboard 1Gb NIC in the servers I'm
using. I recently had to redeploy several of these servers due to SSD
failures, and so those ones I built around the 3.19 vivid kernel. I'm
planning on upgrading all the machines to that kernel and Hammer (from
Giant) in about a week.

I had planned on performing before and after testing on the cluster anyway,
but if I could replicate your they would serve as an even better data point.

QH

On Tue, Oct 6, 2015 at 9:23 AM, Mark Nelson  wrote:

> On 10/06/2015 10:14 AM, MailingLists - EWS wrote:
>
>> I have encountered a rather interesting issue with Ubuntu 14.04 LTS
>> running 3.19.0-30 kernel (Vivid) using Ceph Hammer (0.94.3).
>>
>> With everything else identical in our testing cluster, no other changes
>> other than the kernel (apt-get install linux-image-generic-lts-vivid and
>> then a reboot), we are seeing a substantial drop in read performance as
>> indicated with rados bench.
>>
>> Kernel 3.13.0.-65 (latest for Trusty): Read performance: 575MB/s
>>
>> Kernel 3.19.0-30 (Vivid): Read performance: 175MB/s
>>
>> Has anyone experienced this sort of performance drop?
>>
>
> Hi,
>
> Very interesting!  Did you upgrade the kernel on both the OSDs and clients
> or just some of them?  I remember there were some kernel performance
> regressions a little while back.  You might try running perf during your
> tests and look for differences.  Also, iperf might be worth trying to see
> if it's a network regression.
>
> I also have a script that compares output from sysctl which might be worth
> trying to see if any defaults changes.
>
> https://github.com/ceph/cbt/blob/master/tools/compare_sysctl.py
>
> basically just save systctl -a with both kernels and pass them as
> arguments to the python script.
>
> Mark
>
>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor Read Performance with Ubuntu 14.04 LTS 3.19.0-30 Kernel

2015-10-20 Thread Quentin Hartman
I performed this kernel upgrade (to 3.19.30) over the weekend on my
cluster, and my before / after benchmarks were very close to each other,
about 500MB/s each.

On Tue, Oct 6, 2015 at 3:15 PM, Nick Fisk  wrote:

> I'm wondering if you are hitting the "bug" with the readahead changes?
>
> I know the changes to limit readahead to 2MB was introduced in 3.15, but I
> don't know if it was back ported into 3.13 or not. I have a feeling this
> may
> also limit maximum request size to 2MB as well.
>
> If you look in iostat do you see different request sizes between the two
> kernels?
>
> There is a 4.2 kernel with the readahead change reverted, it might be worth
> testing it.
>
>
> http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/ra-bring-back
> /
>
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > MailingLists - EWS
> > Sent: 06 October 2015 18:12
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Poor Read Performance with Ubuntu 14.04 LTS
> > 3.19.0-30 Kernel
> >
> > > Hi,
> > >
> > > Very interesting!  Did you upgrade the kernel on both the OSDs and
> > > clients
> > or
> > > just some of them?  I remember there were some kernel performance
> > > regressions a little while back.  You might try running perf during
> > > your
> > tests
> > > and look for differences.  Also, iperf might be worth trying to see if
> > it's a
> > > network regression.
> > >
> > > I also have a script that compares output from sysctl which might be
> > > worth trying to see if any defaults changes.
> > >
> > > https://github.com/ceph/cbt/blob/master/tools/compare_sysctl.py
> > >
> > > basically just save systctl -a with both kernels and pass them as
> > arguments to
> > > the python script.
> > >
> > > Mark
> >
> > Mark,
> >
> > The testing was done with 3.19 on the client with 3.13 on the OSD nodes
> > using "rados bench -p bench 50 seq" with an initial "rados bench -p bench
> 50
> > write --no-cleanup". We suspected the network as well and tested with
> iperf
> > as one of our first steps and saw expected speeds (9.9Gb/s as we are
> using
> > bonded X540-T2 interfaces) on both kernels. As an added data point, we
> > have no problem with write performance to the same pool with the same
> > kernel configuration (~1GB/s). We also checked the values of
> > read_ahead_kb of the block devices but both were shown to be the default
> > of 128 (we have since changed these to 4096 in our configuration, but the
> > results were seen with the default of 128).
> >
> > We are in the process of rebuilding the entire cluster to use 3.13 and a
> > completely fresh installation of Ceph to make sure nothing else is at
> play
> > here.
> >
> > We did check a few things in iostat and collectl, but we didn't see any
> read IO
> > against the OSDs, so I am leaning towards something further up the stack.
> >
> > Just a little more background on the cluster configuration:
> >
> > Specific pool created just for benchmarking, using 512 pgs and pgps and 2
> > replicas. Using 3 OSD nodes (also handling MON duties) with 8 SATA 7.2K
> > RPM OSDs and 2 NVMe journals (4 OSD to 1 Journal ratio). 1 x Hex core
> CPUs
> > with 32GB of RAM per OSD node.
> >
> > Tom
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running Openstack Nova and Ceph OSD on same machine

2015-10-26 Thread Quentin Hartman
I am. For our workloads it works fine. The biggest trick I found is to make
sure that Nova leaves enough free RAM to not starve the OSDs. In my case,
each node is running three OSDs, so in my nova.cfg I added
"reserved_host_memory_mb = 3072" to help ensure that. Each node has 72GB of
RAM, so there's plenty left for VMs.

If I had it to do all over again I would have pushed for enough budget to
split up the cluster and get ceph-dedicated storage nodes that can support
20-ish disks or so. Not because of any problems we've had with having the
two cohosted on the nodes, but because the number of VMs we can run is
limited by IOPS, so I need more spindles. I've found that when running
three replicas in the ceph pool, it's a good rule of thumb to assume 1 disk
per VM to get good consistent performance once you've eliminated other
bottlenecks. So in my case, I have 42 OSD disks, so I can run at most 42
VMs before performance starts to get weird.

QH

2015-10-26 4:17 GMT-06:00 Stolte, Felix :

> Hi all,
>
>
>
> is anyone running nova compute on ceph OSD Servers and could share his
> experience?
>
>
>
> Thanks and Regards,
>
>
>
> Felix
>
>
>
> Forschungszentrum Juelich GmbH
>
> 52425 Juelich
>
> Sitz der Gesellschaft: Juelich
>
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
>
> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
>
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
>
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
>
> Prof. Dr. Sebastian M. Schmidt
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why was osd pool default size changed from 2 to 3.

2015-10-26 Thread Quentin Hartman
TL;DR - Running two copies in my cluster cost me a weekend, and many more
hours of productive time during normal working hours. Networking problems
can be just as destructive as disk problems. I only run 2 copies on
throwaway data.

So, I have personal experience in data loss when running only two copies. I
had a networking problem in my ceph cluster, and it took me a long time to
track it down because it was an intermittent thing that caused the node
with the faulty connection to not only get marked out by it's peers, but
also caused it to incorrectly mark out other nodes. It was a mess, that I
made worse by trying to force recovery before I really knew what the
problem was since it was so elusive.

In the end, the cluster tried to do recovery on PGs that had gotten
degraded, but because there were only two copies it had no way to tell
which one was correct, and when I forced it to choose it often chose wrong.
All of the data was VM images, so in the end, I ended up having small bits
of random corruption across almost all my VMs. It took me about 40 hours of
work over a weekend to get things recovered (onto spare desktop machines
since I still hadn't found the problem and didn't trust the cluster) and
rebuilt to make sure that people could work on monday, and I was cleaning
up little bits of leftover mess for weeks. Once I finally found and
repaired the problem, it was another several days worth of work to get the
cluster rebuilt and the VMs migrated back onto it. Never will I run only
two copies on things I actually care about ever again, regardless of the
quality of the underlying disk hardware. In my case, the disks were fine
all along.

QH

On Sat, Oct 24, 2015 at 8:35 AM, Christian Balzer  wrote:

>
>
> Hello,
>
> There have been COUNTLESS discussions about Ceph reliability, fault
> tolerance and so forth in this very ML.
> Google is very much evil, but in this case it is your friend.
>
> In those threads you will find several reliability calculators, some more
> flawed than others, but penultimately you do not use a replica of 2 for
> the same reasons people don't use RAID5 for anything valuable.
>
> A replication of 2 MAY be fine with very reliable, fast and not too large
> SSDs, but that's about it.
> Spinning rust is never safe with just one copy.
>
> Christian
>
> On Sat, 24 Oct 2015 09:41:35 +0200 Stefan Eriksson wrote:
>
> > > Am 23.10.2015 um 20:53 schrieb Gregory Farnum:
> > >> On Fri, Oct 23, 2015 at 8:17 AM, Stefan Eriksson 
> > wrote:
> > >>
> > >> Nothing changed to make two copies less secure. 3 copies is just so
> > >> much more secure and is the number that all the companies providing
> > >> support recommend, so we changed the default.
> > >> (If you're using it for data you care about, you should really use 3
> > copies!)
> > >> -Greg
> > >
> > > I assume that number really depends on the (number of) OSDs you have in
> > your crush rule for that pool. A replication of
> > > 2 might be ok for a pool spread over 10 osds, but not for one spread
> > > over
> > 100 osds
> > >
> > > Corin
> > >
> >
> > I'm also interested in this, what changes when you add 100+ OSDs (to
> > warrant 3 replicas instead of 2), and the reasoning as to why "the
> > companies providing support recommend 3." ?
> > Theoretically it seems secure to have two replicas.
> > If you have 100+ OSDs, I can see that maintenance will take much longer,
> > and if you use "set noout" then a single PG will be active when the other
> > replica is under maintenance.
> > But if you "crush reweight to 0" before the maintenance this would not be
> > an issue.
> > Is this the main reason?
> >
> > From what I can gather even if you add new OSDs to the cluster and the
> > balancing kicks in, it still maintains its two replicas.
> >
> > thanks.
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Would HEALTH_DISASTER be a good addition?

2015-11-25 Thread Quentin Hartman
I don't have any comment on Greg's specific concerns, but I agree that
conceptually that distinguishing between states that are likely to resolve
themselves and ones that require intervention would be a nice addition.

QH


On Wed, Nov 25, 2015 at 2:46 PM, Gregory Farnum  wrote:

> On Wed, Nov 25, 2015 at 11:09 AM, Wido den Hollander 
> wrote:
> > Hi,
> >
> > Currently we have OK, WARN and ERR as states for a Ceph cluster.
> >
> > Now, it could happen that while a Ceph cluster is in WARN state certain
> > PGs are not available due to being in peering or any non-active+? state.
> >
> > When monitoring a Ceph cluster you usually want to see OK and not worry
> > when a cluster is in WARN.
> >
> > However, with the current situation you need to check if there are any
> > PGs in a non-active state since that means they are currently not doing
> > any I/O.
> >
> > For example, size is to 3, min_size is set to 2. One OSD fails, cluster
> > starts to recover/backfill. A second OSD fails which causes certain PGs
> > to become undersized and no longer serve I/O.
> >
> > I've seen such situations happen multiple times. VMs running and a few
> > PGs become non-active which caused about all I/O to stop effectively.
> >
> > The health stays in WARN, but a certain part of it is not serving I/O.
> >
> > My suggestion would be:
> >
> > OK: All PGs are active+clean and no other issues
> > WARN: All PGs are active+? (degraded, recovery_wait, backfilling, etc)
> > ERR: One or more PGs are not active
> > DISASTER: Anything which currently triggers ERR
> >
> > This way you can monitor for ERR. If the cluster goes into >= ERR you
> > know you have to come into action. <= WARN is just a thing you might
> > want to look in to, but not at 03:00 on Sunday morning.
> >
> > Does this sound reasonable?
>
> It sounds like basically you want a way of distinguishing between
> manual intervention required, and bad states which are going to be
> repaired on their own. That sounds like a good idea to me, but I'm not
> sure how feasible the specific thing here is. How long does a PG need
> to be in a not-active state before you shift into the alert mode? They
> can go through peering for a second or so when a node dies, and that
> will block IO but probably shouldn't trigger alerts.
> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best setup for SSD

2015-06-12 Thread Quentin Hartman
I don't know the official reason, but I would imagine the disparity in
performance would lead to weird behaviors and very spiky overall
performance. I would think that running a mix of SSD and HDD OSDs in the
same pool would be frowned upon, not just the same server.

On Fri, Jun 12, 2015 at 9:00 AM, Dominik Zalewski 
wrote:

> Be warned that running SSD and HD based OSDs in the same server is not
>> recommended. If you need the storage capacity, I'd stick to the journals
>> on SSDs plan.
>
>
> Can you please elaborate more why running SSD and HD based OSDs in the
> same server is not
> recommended ?
>
> Thanks
>
> Dominik
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Combining MON & OSD Nodes

2015-06-25 Thread Quentin Hartman
The biggest downside that I've found is the log volume that mons create
eats a lot of io. I was running mons on my OSDs previously, but in my
current dpeloyment I've moved them to other hardware and noticed a
perceptible load reduction on those nodes that were formerly running mons.

QH

On Thu, Jun 25, 2015 at 10:21 AM, Lazuardi Nasution 
wrote:

> Hi,
>
> I'm looking for pros and cons of combining MON and OSD functionality on
> the same nodes. Mostly recommended configuration is to have dedicated, odd
> number MON nodes. What I'm thinking is more like single node deployment but
> consist more than one node, if we have 3 nodes we have 3 MONs with 3 OSDs.
> Since MON will only consume small resources, I think MON load will not
> degrade OSD performance significantly. If we have odd number of nodes, we
> can still maintain the quorum of MON with this way. Any idea?
>
> Best regards,
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mon performance impact on OSDs?

2015-07-01 Thread Quentin Hartman
I've been wrestling with IO performance in my cluster and one area I have
not yet explored thoroughly is whether or not performance constraints on
mon hosts would be likely to have any impact on OSDs. My mons are quite
small, and one in particular has rather high IO waits (frequently 30% or
more) due to the other work it performs, notably hosting postgres for
Openstack which is quite chatty for some reason. Is this likely to
trickle-down into the OSDs performance? Everything I've seen online
indicates the performance between MONs and OSDs  should be decoupled, but
I'd like to hear some real world experiences.

Thanks!

QH
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Journal Disk Size

2015-07-01 Thread Quentin Hartman
Like most disk redundancy systems, the concern usually is the amount of
time it takes to recover, wherein you are vulnerable to another failure. I
would assume that is also the concern here.


On Wed, Jul 1, 2015 at 5:54 PM, Nate Curry  wrote:

> 4TB is too much to lose?  Why would it matter if you lost one 4TB with the
> redundancy?  Won't it auto recover from the disk failure?
>
> Nate Curry
> On Jul 1, 2015 6:12 PM, "German Anders"  wrote:
>
>> I would probably go with less size osd disks, 4TB is to much to loss in
>> case of a broken disk, so maybe more osd daemons with less size, maybe 1TB
>> or 2TB size. 4:1 relationship is good enough, also i think that 200G disk
>> for the journals would be ok, so you can save some money there, the osd's
>> of course configured them as a JBOD, don't use any RAID under it, and use
>> two different networks for public and cluster net.
>>
>> *German*
>>
>> 2015-07-01 18:49 GMT-03:00 Nate Curry :
>>
>>> I would like to get some clarification on the size of the journal disks
>>> that I should get for my new Ceph cluster I am planning.  I read about the
>>> journal settings on
>>> http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings
>>> but that didn't really clarify it for me that or I just didn't get it.  I
>>> found in the Learning Ceph Packt book it states that you should have one
>>> disk for journalling for every 4 OSDs.  Using that as a reference I was
>>> planning on getting multiple systems with 8 x 6TB inline SAS drives for
>>> OSDs with two SSDs for journalling per host as well as 2 hot spares for the
>>> 6TB drives and 2 drives for the OS.  I was thinking of 400GB SSD drives but
>>> am wondering if that is too much.  Any informed opinions would be
>>> appreciated.
>>>
>>> Thanks,
>>>
>>> *Nate Curry*
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Journal Disk Size

2015-07-08 Thread Quentin Hartman
Regarding using spinning disks for journals, before I was able to put SSDs
in my deployment I came up wit ha somewhat novel journal setup that gave my
cluster way more life than having all the journals on a single disk, or
having the journal on the disk with the OSD. I called it "interleaved
journals". Essentially offset the journal location by one disk, so in a 4
disk system:

OS disk sda has journal for sdb OSD
sdb OSD disk has journal for sdc OSD
sdc OSD disk has journal for sdd OSD
sdd OSD disk has no journal on it

This limited the contention substantially. When the cluster got busy enough
that multiple OSDs on the same machine were writing simultaneously it still
took a hit, but it was a big upgrade from the out of the box deployment. I
also tried leaving the OS drive out and only interleaving the journals on
the OSD drives, but that was slightly worse under load than this
configuration. It seems that the contention of the journals and OSDs was
stronger than the contention with logging.

QH

On Fri, Jul 3, 2015 at 1:23 AM, Van Leeuwen, Robert 
wrote:

>  > Another issue is performance : you'll get 4x more IOPS with 4 x 2TB
> drives than with one single 8TB.
> > So if you have a performance target your money might be better spent on
> smaller drives
>
>  Regardless of the discussion if it is smart to have very large spinners:
> Be aware that some of the bigger drives use SMR technology.
> Quoting wikipedia on SMR:
> "shingled recording writes new tracks that overlap part of the previously
> written magnetic track, leaving the previous track thinner and allowing for
> higher track density.”
> and
> "The overlapping-tracks architecture may slow down the writing process
> since writing to one track overwrites adjacent tracks, and requires them to
> be rewritten as well."
>
>  Usually these these disks are marketed "for archival use".
> Generally speaking you really should not use these unless you exactly know
> which write workload is hitting the disk and it is just very big sequential
> writes.
>
>  Cheers,
>  Robert van Leeuwen
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Journal Disk Size

2015-07-08 Thread Quentin Hartman
I don't see it as being any worse than having multiple journals on a single
drive. If your journal drive tanks, you're out X OSDs as well. It's
arguably better, since the number of affected OSDs per drive failure is
lower. Admittedly, neither deployment is ideal, but it an effective way to
get from A to B for those of us with limited hardware options.

QH

On Wed, Jul 8, 2015 at 10:32 AM, Mark Nelson  wrote:

> The biggest thing to be careful of with this kind of deployment is that
> now a single drive failure will take out 2 OSDs instead of 1 which means
> OSD failure rates and associated recovery traffic go up.  I'm not sure
> that's worth the trade-off...
>
> Mark
>
> On 07/08/2015 11:01 AM, Quentin Hartman wrote:
>
>> Regarding using spinning disks for journals, before I was able to put
>> SSDs in my deployment I came up wit ha somewhat novel journal setup that
>> gave my cluster way more life than having all the journals on a single
>> disk, or having the journal on the disk with the OSD. I called it
>> "interleaved journals". Essentially offset the journal location by one
>> disk, so in a 4 disk system:
>>
>> OS disk sda has journal for sdb OSD
>> sdb OSD disk has journal for sdc OSD
>> sdc OSD disk has journal for sdd OSD
>> sdd OSD disk has no journal on it
>>
>> This limited the contention substantially. When the cluster got busy
>> enough that multiple OSDs on the same machine were writing
>> simultaneously it still took a hit, but it was a big upgrade from the
>> out of the box deployment. I also tried leaving the OS drive out and
>> only interleaving the journals on the OSD drives, but that was slightly
>> worse under load than this configuration. It seems that the contention
>> of the journals and OSDs was stronger than the contention with logging.
>>
>> QH
>>
>> On Fri, Jul 3, 2015 at 1:23 AM, Van Leeuwen, Robert
>> mailto:rovanleeu...@ebay.com>> wrote:
>>
>> > Another issue is performance : you'll get 4x more IOPS with 4 x 2TB
>> drives than with one single 8TB.
>> > So if you have a performance target your money might be better
>> spent on smaller drives
>>
>> Regardless of the discussion if it is smart to have very large
>> spinners:
>> Be aware that some of the bigger drives use SMR technology.
>> Quoting wikipedia on SMR:
>> "shingled recording writes new tracks that overlap part of the
>> previously written magnetic track, leaving the previous track
>> thinner and allowing for higher track density.”
>> and
>> "The overlapping-tracks architecture may slow down the writing
>> process since writing to one track overwrites adjacent tracks, and
>> requires them to be rewritten as well."
>>
>> Usually these these disks are marketed "for archival use".
>> Generally speaking you really should not use these unless you
>> exactly know which write workload is hitting the disk and it is just
>> very big sequential writes.
>>
>> Cheers,
>> Robert van Leeuwen
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>  ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor questions

2015-07-09 Thread Quentin Hartman
I have my mons sharing the ceph network, and while I currently do not run
mds or rgw, I have run those on my mon hosts in the past with no
perceptible ill effects.

On Thu, Jul 9, 2015 at 3:20 PM, Nate Curry  wrote:

> I have a question in regards to monitor nodes and network layout.  Its my
> understanding that there should be two networks; a ceph only network for
> comms between the various ceph nodes, and a separate storage network where
> other systems will interface with the ceph nodes.  Are the monitor nodes
> supposed to straddle both the ceph only network and the storage network or
> just in the ceph network?
>
> Another question is can I run multiple things on the monitor nodes?  Like
> the RADOS GW and the MDS?
>
>
> Thanks,
>
> *Nate Curry*
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] External XFS Filesystem Journal on OSD

2015-07-09 Thread Quentin Hartman
Thanks for sharing this info. I've been toying with doing this very
thing... How did you measure the performance? I'm specifically looking at
reducing the IO load on my spinners and it seems the xfs journaling process
is eating a lot of my IO. My queues on my OSD drives frequently get into
the 500 ballpark which makes for sad VMs.

QH

On Thu, Jul 9, 2015 at 12:05 PM, David Burley 
wrote:

> Converted a few of our OSD's (spinners) over to a config where the OSD
> journal and XFS journal both live on an NVMe drive (Intel P3700). The XFS
> journal might have provided some very minimal performance gains (3%,
> maybe). Given the low gains, we're going to reject this as something to dig
> into deeper and stick with the simpler configuration of just using the NVMe
> drives for OSD journaling and leave the XFS journals on the partition.
>
> --David
>
> On Thu, Jun 4, 2015 at 2:23 PM, Lars Marowsky-Bree  wrote:
>
>> On 2015-06-04T12:42:42, David Burley  wrote:
>>
>> > Are there any safety/consistency or other reasons we wouldn't want to
>> try
>> > using an external XFS log device for our OSDs? I realize if that device
>> > fails the filesystem is pretty much lost, but beyond that?
>>
>> I think with the XFS journal on the same SSD as ceph's OSD journal, that
>> could be a quite interesting setup. Please share performance numbers!
>>
>> I've been meaning to benchmark bcache in front of the OSD backend,
>> especially for SMRs, but haven't gotten around to it yet.
>>
>>
>> Regards,
>> Lars
>>
>> --
>> Architect Storage/HA
>> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Dilip Upmanyu,
>> Graham Norton, HRB 21284 (AG Nürnberg)
>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> David Burley
> NOC Manager, Sr. Systems Programmer/Analyst
> Slashdot Media
>
> e: da...@slashdotmedia.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-09 Thread Quentin Hartman
So, I was running with size=2, until we had a network interface on an
OSD node go faulty, and start corrupting data. Because ceph couldn't tell
which copy was right it caused all sorts of trouble. I might have been able
to recover more gracefully had I caught the problem sooner and been able to
identify the root right away, but as it was, we ended up labeling every VM
in the cluster suspect destroying the whole thing and restoring from
backups. I didn't end up managing to find the root of the problem until I
was rebuilding the cluster and noticed one node "felt weird" when I was
ssh'd into it. It was painful.

We are currently running "important" vms from a ceph pool with size=3, and
more disposable ones from a size=2 pool, and that seems to be a reasonable
tradeoff so far, giving us a bit more IO overhead tha nwe would have
running 3 for everything, but still having safety where we need it.

QH

On Thu, Jul 9, 2015 at 3:46 PM, Götz Reinicke <
goetz.reini...@filmakademie.de> wrote:

> Hi Warren,
>
> thanks for that feedback. regarding the 2 or 3 copies we had a lot of
> internal discussions and lots of pros and cons on 2 and 3 :) … and finally
> decided to give 2 copies in the first - now called evaluation cluster - a
> chance to prove.
>
> I bet in 2016 we will see, if that was a good decision or bad and data los
> is in that scenario ok. We evaluate. :)
>
> Regarding one P3700 for 12 SATA disks I do get it right, that if that
> P3700 fails all 12 OSDs are lost… ? So that looks like a bigger risk to me
> from my current knowledge. Or are the P3700 so much more reliable than the
> eg. S3500 or S3700?
>
> Or is the suggestion with the P3700 if we go in the direction of 20+ nodes
> and till than stay without SSDs for journaling.
>
> I really appreciate your thoughts and feedback and I’m aware of the fact
> that building a ceph cluster is some sort of knowing the specs,
> configuration option, math, experience, modification and feedback from best
> practices real world clusters. Finally all clusters are unique in some way
> and what works for one will not work for an other.
>
> Thanks for feedback, 100 kowtows . Götz
>
>
>
> > Am 09.07.2015 um 16:58 schrieb Wang, Warren <
> warren_w...@cable.comcast.com>:
> >
> > You'll take a noticeable hit on write latency. Whether or not it's
> tolerable will be up to you and the workload you have to capture. Large
> file operations are throughput efficient without an SSD journal, as long as
> you have enough spindles.
> >
> > About the Intel P3700, you will only need 1 to keep up with 12 SATA
> drives. The 400 GB is probably okay if you keep the journal sizes small,
> but the 800 is probably safer if you plan on leaving these in production
> for a few years. Depends on the turnover of data on the servers.
> >
> > The dual disk failure comment is pointing out that you are more exposed
> for data loss with 2 copies. You do need to understand that there is a
> possibility for 2 drives to fail either simultaneously, or one before the
> cluster is repaired. As usual, this is going to be a decision you need to
> decide if it's acceptable or not. We have many clusters, and some are 2,
> and others are 3. If your data resides nowhere else, then 3 copies is the
> safe thing to do. That's getting harder and harder to justify though, when
> the price of other storage solutions using erasure coding continues to
> plummet.
> >
> > Warren
> >
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of Götz Reinicke - IT Koordinator
> > Sent: Thursday, July 09, 2015 4:47 AM
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Real world benefit from SSD Journals for a
> more read than write cluster
> >
> > Hi Christian,
> > Am 09.07.15 um 09:36 schrieb Christian Balzer:
> >>
> >> Hello,
> >>
> >> On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote:
> >>
> >>> Hi again,
> >>>
> >>> time is passing, so is my budget :-/ and I have to recheck the
> >>> options for a "starter" cluster. An expansion next year for may be an
> >>> openstack installation or more performance if the demands rise is
> >>> possible. The "starter" could always be used as test or slow dark
> archive.
> >>>
> >>> At the beginning I was at 16SATA OSDs with 4 SSDs for journal per
> >>> node, but now I'm looking for 12 SATA OSDs without SSD journal. Less
> >>> performance, less capacity I know. But thats ok!
> >>>
> >> Leave the space to upgrade these nodes with SSDs in the future.
> >> If your cluster grows large enough (more than 20 nodes) even a single
> >> P3700 might do the trick and will need only a PCIe slot.
> >
> > If I get you right, the 12Disk is not a bad idea, if there would be the
> need of SSD Journal I can add the PCIe P3700.
> >
> > In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs.
> >
> > God or bad idea?
> >
> >>
> >>> There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.
> >>>
> >> Danger, Will Ro

Re: [ceph-users] Monitor questions

2015-07-10 Thread Quentin Hartman
You mean the hardware config? They are older Core2-based servers with 4GB
of RAM. Nothing special. I have one running mon and rgw, one running mon
and mds, and one run just a mon.

QH

On Fri, Jul 10, 2015 at 8:58 AM, Nate Curry  wrote:

> What was your monitor node's configuration when you had multiple ceph
> daemons running on them?
>
> *Nate Curry*
> IT Manager
> ISSM
> *Mosaic ATM*
> mobile: 240.285.7341
> office: 571.223.7036 x226
> cu...@mosaicatm.com
>
> On Thu, Jul 9, 2015 at 5:36 PM, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
>> I have my mons sharing the ceph network, and while I currently do not run
>> mds or rgw, I have run those on my mon hosts in the past with no
>> perceptible ill effects.
>>
>> On Thu, Jul 9, 2015 at 3:20 PM, Nate Curry  wrote:
>>
>>> I have a question in regards to monitor nodes and network layout.  Its
>>> my understanding that there should be two networks; a ceph only network for
>>> comms between the various ceph nodes, and a separate storage network where
>>> other systems will interface with the ceph nodes.  Are the monitor nodes
>>> supposed to straddle both the ceph only network and the storage network or
>>> just in the ceph network?
>>>
>>> Another question is can I run multiple things on the monitor nodes?
>>> Like the RADOS GW and the MDS?
>>>
>>>
>>> Thanks,
>>>
>>> *Nate Curry*
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor questions

2015-07-10 Thread Quentin Hartman
For very small values of production. I never had more than a couple clients
hitting either of them, but they were doing "real work". Ultimately though,
we decided to just use NFS exports from a VM to do what we were trying to
do with rgw and mds.

QH

On Fri, Jul 10, 2015 at 9:47 AM, Nate Curry  wrote:

> Yes that was what I meant.  Thanks.  Was that in a production environment?
>
> Nate Curry
> On Jul 10, 2015 11:21 AM, "Quentin Hartman" 
> wrote:
>
>> You mean the hardware config? They are older Core2-based servers with 4GB
>> of RAM. Nothing special. I have one running mon and rgw, one running mon
>> and mds, and one run just a mon.
>>
>> QH
>>
>> On Fri, Jul 10, 2015 at 8:58 AM, Nate Curry  wrote:
>>
>>> What was your monitor node's configuration when you had multiple ceph
>>> daemons running on them?
>>>
>>> *Nate Curry*
>>> IT Manager
>>> ISSM
>>> *Mosaic ATM*
>>> mobile: 240.285.7341
>>> office: 571.223.7036 x226
>>> cu...@mosaicatm.com
>>>
>>> On Thu, Jul 9, 2015 at 5:36 PM, Quentin Hartman <
>>> qhart...@direwolfdigital.com> wrote:
>>>
>>>> I have my mons sharing the ceph network, and while I currently do not
>>>> run mds or rgw, I have run those on my mon hosts in the past with no
>>>> perceptible ill effects.
>>>>
>>>> On Thu, Jul 9, 2015 at 3:20 PM, Nate Curry  wrote:
>>>>
>>>>> I have a question in regards to monitor nodes and network layout.  Its
>>>>> my understanding that there should be two networks; a ceph only network 
>>>>> for
>>>>> comms between the various ceph nodes, and a separate storage network where
>>>>> other systems will interface with the ceph nodes.  Are the monitor nodes
>>>>> supposed to straddle both the ceph only network and the storage network or
>>>>> just in the ceph network?
>>>>>
>>>>> Another question is can I run multiple things on the monitor nodes?
>>>>> Like the RADOS GW and the MDS?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> *Nate Curry*
>>>>>
>>>>>
>>>>> ___
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>>
>>>>
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Quentin Hartman
What does "ceph status" say? I had a problem with similar symptoms some
months ago that was accompanied by OSDs getting marked out for no apparent
reason and the cluster going into a HEALTH_WARN state intermittently.
Ultimately the root of the problem ended up being a faulty NIC. Once I took
that out of the picture everything started flying right.

QH

On Fri, Jul 17, 2015 at 8:21 AM, Mark Nelson  wrote:

> On 07/17/2015 08:38 AM, J David wrote:
>
>> This is the same cluster I posted about back in April.  Since then,
>> the situation has gotten significantly worse.
>>
>> Here is what iostat looks like for the one active RBD image on this
>> cluster:
>>
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> vdb   0.00 0.00   14.100.00   685.65 0.00
>> 97.26 3.43  299.40  299.400.00  70.92 100.00
>> vdb   0.00 0.001.100.00   140.80 0.00
>> 256.00 3.00 2753.09 2753.090.00 909.09 100.00
>> vdb   0.00 0.00   17.400.00  2227.20 0.00
>> 256.00 3.00  178.78  178.780.00  57.47 100.00
>> vdb   0.00 0.001.300.00   166.40 0.00
>> 256.00 3.00 2256.62 2256.620.00 769.23 100.00
>> vdb   0.00 0.008.200.00  1049.60 0.00
>> 256.00 3.00  362.10  362.100.00 121.95 100.00
>> vdb   0.00 0.001.100.00   140.80 0.00
>> 256.00 3.00 2517.45 2517.450.00 909.45 100.04
>> vdb   0.00 0.001.100.00   140.66 0.00
>> 256.00 3.00 2863.64 2863.640.00 909.09  99.90
>> vdb   0.00 0.000.700.0089.60 0.00
>> 256.00 3.00 3898.86 3898.860.00 1428.57 100.00
>> vdb   0.00 0.000.600.0076.80 0.00
>> 256.00 3.00 5093.33 5093.330.00 1666.67 100.00
>> vdb   0.00 0.001.200.00   153.60 0.00
>> 256.00 3.00 2568.33 2568.330.00 833.33 100.00
>> vdb   0.00 0.001.300.00   166.40 0.00
>> 256.00 3.00 2457.85 2457.850.00 769.23 100.00
>> vdb   0.00 0.00   13.900.00  1779.20 0.00
>> 256.00 3.00  220.95  220.950.00  71.94 100.00
>> vdb   0.00 0.001.000.00   128.00 0.00
>> 256.00 3.00 2250.40 2250.400.00 1000.00 100.00
>> vdb   0.00 0.001.300.00   166.40 0.00
>> 256.00 3.00 2798.77 2798.770.00 769.23 100.00
>> vdb   0.00 0.000.900.00   115.20 0.00
>> 256.00 3.00 3304.00 3304.000.00 .11 100.00
>> vdb   0.00 0.000.900.00   115.20 0.00
>> 256.00 3.00 3425.33 3425.330.00 .11 100.00
>> vdb   0.00 0.001.300.00   166.40 0.00
>> 256.00 3.00 2290.77 2290.770.00 769.23 100.00
>> vdb   0.00 0.004.300.00   550.40 0.00
>> 256.00 3.00  721.30  721.300.00 232.56 100.00
>> vdb   0.00 0.001.600.00   204.80 0.00
>> 256.00 3.00 1894.75 1894.750.00 625.00 100.00
>> vdb   0.00 0.001.200.00   153.60 0.00
>> 256.00 3.00 2375.00 2375.000.00 833.33 100.00
>> vdb   0.00 0.000.900.00   115.20 0.00
>> 256.00 3.00 3036.44 3036.440.00 .11 100.00
>> vdb   0.00 0.001.100.00   140.80 0.00
>> 256.00 3.00 3086.18 3086.180.00 909.09 100.00
>> vdb   0.00 0.000.900.00   115.20 0.00
>> 256.00 3.00 2480.44 2480.440.00 .11 100.00
>> vdb   0.00 0.001.200.00   153.60 0.00
>> 256.00 3.00 3124.33 3124.330.00 833.67 100.04
>> vdb   0.00 0.000.800.00   102.40 0.00
>> 256.00 3.00 3228.00 3228.000.00 1250.00 100.00
>> vdb   0.00 0.001.200.00   153.60 0.00
>> 256.00 3.00 2439.33 2439.330.00 833.33 100.00
>> vdb   0.00 0.001.300.00   166.40 0.00
>> 256.00 3.00 2567.08 2567.080.00 769.23 100.00
>> vdb   0.00 0.000.800.00   102.40 0.00
>> 256.00 3.00 3023.00 3023.000.00 1250.00 100.00
>> vdb   0.00 0.004.800.00   614.40 0.00
>> 256.00 3.00  712.50  712.500.00 208.33 100.00
>> vdb   0.00 0.001.300.00   118.75 0.00
>> 182.69 3.00 2003.69 2003.690.00 769.23 100.00
>> vdb   0.00 0.00   10.500.00  1344.00 0.00
>> 256.00 3.00  344.46  344.460.00  95.24 100.00
>>
>> So, between 0 and 15 reads per second, no write activity, a constant
>> queue depth of 3+, wait times in seconds, and 100% I/O utilization,
>> all for read performance of 100-200K/sec.  Even trivial writes can
>> hang for 15-60 seconds before completing.
>>
>> Sometimes this behavior will 

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Quentin Hartman
That looks a lot like what I was seeing initially. The OSDs getting marked
out was relatively rare and it took a bit before I saw it. I ended up
digging into the logs on the OSDs themselves to discover that they were
getting marked out. The messages were like "So-and-so incorrectly marked us
out" IIRC.

I don't remember the commands exactly, but there are tools for digging into
specifics of the health of individual pgs. It's something like "ceph pg
dump summary". Someone else may chime in more details, but that should be a
good google seed. That should help you isolate the OSDs and PGs that are
being problematic.

On the off chance you are having a NIC problem like I was, use ifconfig to
check the error rates on your interfaces. You will likely have some errors,
but if one of them has way more than the others, it's worth investigating
the NIC / cabling. If nothing else it will help eliminate this as the root
of the problem.

On Fri, Jul 17, 2015 at 9:06 AM, J David  wrote:
>
> On Fri, Jul 17, 2015 at 10:47 AM, Quentin Hartman
>  wrote:
> > What does "ceph status" say?
>
> Usually it says everything is cool.  However just now it gave this:
>
> cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7
>  health HEALTH_WARN 2 requests are blocked > 32 sec
>  monmap e3: 3 mons at
> {f16=
192.168.19.216:6789/0,f17=192.168.19.217:6789/0,f18=192.168.19.218:6789/0},
> election epoch 176, quorum 0,1,2 f16,f17,f18
>  osdmap e10705: 28 osds: 28 up, 28 in
>   pgmap v26213984: 4224 pgs, 3 pools, 21637 GB data, 5409 kobjects
> 43344 GB used, 86870 GB / 127 TB avail
> 4224 active+clean
>   client io 368 kB/s rd, 444 kB/s wr, 10 op/s
>
> How do I find out what requests are blocked and why?  Resolving
> whatever that is seems like a very good next step to troubleshooting
> this issue.
>
> > I had a problem with similar symptoms some
> > months ago that was accompanied by OSDs getting marked out for no
apparent
> > reason and the cluster going into a HEALTH_WARN state intermittently.
>
> There's no record of any OSD's being marked out at the time of
> problems.  (E.g. as shown above, it's currently angry about something,
> but all OSD's are up & in.)
>
> Here's the slow requests related info from ceph.log from the last 90
minutes:
>
> 2015-07-17 06:25:14.809606 osd.16 192.168.19.218:6823/29360 4137 :
> [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 30.051626 secs
> 2015-07-17 06:25:14.809618 osd.16 192.168.19.218:6823/29360 4138 :
> [WRN] slow request 30.051626 seconds old, received at 2015-07-17
> 06:24:44.757909: osd_op(client.32913524.0:3016955
> rbd_data.15322ae8944a.0011f34f [set-alloc-hint object_size
> 4194304 write_size 4194304,write 1041920~3152384] 2.20c2afbe
> ack+ondisk+write e10705) v4 currently waiting for subops from 13
> 2015-07-17 06:25:20.280232 osd.24 192.168.19.218:6826/31177 4095 :
> [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 30.105786 secs
> 2015-07-17 06:25:20.280239 osd.24 192.168.19.218:6826/31177 4096 :
> [WRN] slow request 30.105786 seconds old, received at 2015-07-17
> 06:24:50.174399: osd_op(client.32913524.0:3016970
> rbd_data.15322ae8944a.0011f355 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 0~4194304] 2.66b14219
> ack+ondisk+write e10705) v4 currently waiting for subops from 4
> 2015-07-17 06:25:28.476583 osd.2 192.168.19.216:6808/29552 12827 :
> [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 33.389677 secs
> 2015-07-17 06:25:28.476594 osd.2 192.168.19.216:6808/29552 12828 :
> [WRN] slow request 33.389677 seconds old, received at 2015-07-17
> 06:24:55.086843: osd_sub_op(client.32913524.0:3016982 2.e55
> 53500e55/rbd_data.15322ae8944a.0011f359/head//2 [] v
> 10705'851608 snapset=0=[]:[] snapc=0=[]) v11 currently commit sent
> 2015-07-17 06:25:25.223395 osd.20 192.168.19.218:6810/30518 4133 :
> [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 30.255906 secs
> 2015-07-17 06:25:25.223406 osd.20 192.168.19.218:6810/30518 4134 :
> [WRN] slow request 30.255906 seconds old, received at 2015-07-17
> 06:24:54.967416: osd_op(client.32913524.0:3016982
> rbd_data.15322ae8944a.0011f359 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 0~3683840] 2.53500e55
> ack+ondisk+write e10705) v4 currently waiting for subops from 2
> 2015-07-17 06:25:30.224304 osd.20 192.168.19.218:6810/30518 4135 :
> [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 30.574590 secs
> 2015-07-17 06:25:30.224312 osd.20 192.168.19.218:6810/30518 4136 :
> [WRN] slow request 30.574590 seconds old, received at 2015-07-17
> 06:24:59.64965

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Quentin Hartman
Glad we were able to point you in the right direction! I would suspect a
borderline cable at this point. Did you happen to notice if the interface
had negotiated down to some dumb speed? If it had, I've seen cases where a
dodgy cable has caused an intermittent problem that causes it to negotiate
the speed downward, but then it never tries to coma back up until the
interface is restarted.

fwiw, I'm using the same chipset (not onboard) and driver, and they have
been rock solid so far. I would be skeptical of a driver bug as well.

QH

On Fri, Jul 17, 2015 at 2:34 PM, J David  wrote:

> On Fri, Jul 17, 2015 at 12:19 PM, Mark Nelson  wrote:
> > Maybe try some iperf tests between the different OSD nodes in your
> > cluster and also the client to the OSDs.
>
> This proved to be an excellent suggestion.  One of these is not like the
> others:
>
> f16 inbound: 6Gbps
> f16 outbound: 6Gbps
> f17 inbound: 6Gbps
> f17 outbound: 6Gbps
> f18 inbound: 6Gbps
> f18 outbound: 1.2Mbps
>
> There is flatly no explanation for the outbound performance on f18.
> There are no errors in ifconfig/netstat, nothing logged on the switch,
> etc.  Even with tcpdump running during iperf, there aren't retransmits
> or anything.  It's just slow.
>
> ifconfig'ing the primary bond interface down immediately resolved the
> problem.  The iostat running in the virtual machine immediately surged
> to 500+ IOPS and 40M-60M/sec.
>
> Weirdly, ifconfig'ing the primary device back up did not bring the
> problem back.  It switched back to that interface, but everything is
> still fine (and iperf gives 6Gbps) at the moment.  There's no way of
> telling if that will last, but it's a solid lead either way.
>
> It's an Intel onboard dual-port X540's using the ixgbe driver.  If it
> were a driver problem, we've got tons of these so I'd expect to see
> this problem elsewhere.  If it's a hardware problem, ifconfig down/up
> doesn't seem like it would "fix" it.  Very mysterious!
>
> Thanks!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] injectargs not working?

2015-07-29 Thread Quentin Hartman
I'm running a 0.87.1 cluster, and my "ceph tell" seems to not be working:

# ceph tell osd.0 injectargs '--osd-scrub-begin-hour 1'
 failed to parse arguments: --osd-scrub-begin-hour,1


I've also tried the daemon config set variant and it also fails:

# ceph daemon osd.0 config set osd_scrub_begin_hour 1
{ "error": "error setting 'osd_scrub_begin_hour' to '1': (2) No such file
or directory"}

I'm guessing I have something goofed in my admin socket client config:

[client]
rbd cache = true
rbd cache writethrough until flush = true
admin socket = /var/run/ceph/$cluster-$type.$id.asok

but that seems to correlate with the structure that exists:

# ls
ceph-osd.24.asok  ceph-osd.25.asok  ceph-osd.26.asok
# pwd
/var/run/ceph

I can show my configs all over the place, but changing them seems to always
fail. It behaves the same if I'm working on a local daemon, or on my config
node trying to make changes globally.

Thanks in advance for any ideas

QH
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] injectargs not working?

2015-07-29 Thread Quentin Hartman
well, that would certainly do it. I _always_ forget to twiddle the little
thing on the web page that changes the version of the docs I'm looking at.

So I guess then my question becomes, "How do i prevent deep scrubs from
happening in the middle of the day and ruining everything?"

QH


On Wed, Jul 29, 2015 at 5:55 PM, Travis Rhoden  wrote:

> Hi Quentin,
>
> It may be the specific option you are trying to tweak.
> osd-scrub-begin-hour was first introduced in development release
> v0.93, which means it would be in 0.94.x (Hammer), but your cluster is
> 0.87.1 (Giant).
>
> Cheers,
>
>  - Travis
>
> On Wed, Jul 29, 2015 at 4:28 PM, Quentin Hartman
>  wrote:
> > I'm running a 0.87.1 cluster, and my "ceph tell" seems to not be working:
> >
> > # ceph tell osd.0 injectargs '--osd-scrub-begin-hour 1'
> >  failed to parse arguments: --osd-scrub-begin-hour,1
> >
> >
> > I've also tried the daemon config set variant and it also fails:
> >
> > # ceph daemon osd.0 config set osd_scrub_begin_hour 1
> > { "error": "error setting 'osd_scrub_begin_hour' to '1': (2) No such
> file or
> > directory"}
> >
> > I'm guessing I have something goofed in my admin socket client config:
> >
> > [client]
> > rbd cache = true
> > rbd cache writethrough until flush = true
> > admin socket = /var/run/ceph/$cluster-$type.$id.asok
> >
> > but that seems to correlate with the structure that exists:
> >
> > # ls
> > ceph-osd.24.asok  ceph-osd.25.asok  ceph-osd.26.asok
> > # pwd
> > /var/run/ceph
> >
> > I can show my configs all over the place, but changing them seems to
> always
> > fail. It behaves the same if I'm working on a local daemon, or on my
> config
> > node trying to make changes globally.
> >
> > Thanks in advance for any ideas
> >
> > QH
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon cpu usage

2015-07-29 Thread Quentin Hartman
I just had my ceph cluster exhibit this behavior (two of three mons eat all
CPU, cluster becomes unusably slow) which is running 0.87.1

It seems to be tied to deep scrubbing, as the behavior almost immediately
surfaces if that is turned on, but if it is off the behavior eventually
seems to return to normal and stays that way while scrubbing is off. I have
not yet found anything in the cluster to indicate a hardware problem.

Any thoughts or further insights on this subject would be appreciated.

QH

On Sat, Jul 25, 2015 at 12:31 AM, Luis Periquito 
wrote:

> I think I figured out! All 4 of the OSDs on one host (OSD 107-110) were
> sending massive amounts of auth requests to the monitors, seeming to
> overwhelm them.
>
> Weird bit is that I removed them (osd crush remove, auth del, osd rm), dd
> the box and all of the disks, reinstalled and guess what? They are still
> doing a lot of requests to the MONs... this will require some further
> investigations.
>
> As this is happening during my holidays, I just disabled them, and will
> investigate further when I get back.
>
>
> On Fri, Jul 24, 2015 at 11:11 PM, Kjetil Jørgensen 
> wrote:
>
>> It sounds slightly similar to what I just experienced.
>>
>> I had one monitor out of three, which seemed to essentially run one core
>> at full tilt continuously, and had it's virtual address space allocated at
>> the point where top started calling it Tb. Requests hitting this monitor
>> did not get very timely responses (although; I don't know if this were
>> happening consistently or arbitrarily).
>>
>> I ended up re-building the monitor from the two healthy ones I had, which
>> made the problem go away for me.
>>
>> After the fact inspection of the monitor I ripped out, clocked it in at
>> 1.3Gb compared to the 250Mb of the other two, after rebuild they're all
>> comparable in size.
>>
>> In my case; this started out for me on firefly, and persisted after
>> upgrading to hammer. Which prompted the rebuild, suspecting that in my case
>> it were related to "something" persistent for this monitor.
>>
>> I do not have that much more useful to contribute to this discussion,
>> since I've more-or-less destroyed any evidence by re-building the monitor.
>>
>> Cheers,
>> KJ
>>
>> On Fri, Jul 24, 2015 at 1:55 PM, Luis Periquito 
>> wrote:
>>
>>> The leveldb is smallish: around 70mb.
>>>
>>> I ran debug mon = 10 for a while,  but couldn't find any interesting
>>> information. I would run out of space quite quickly though as the log
>>> partition only has 10g.
>>> On 24 Jul 2015 21:13, "Mark Nelson"  wrote:
>>>
 On 07/24/2015 02:31 PM, Luis Periquito wrote:

> Now it's official,  I have a weird one!
>
> Restarted one of the ceph-mons with jemalloc and it didn't make any
> difference. It's still using a lot of cpu and still not freeing up
> memory...
>
> The issue is that the cluster almost stops responding to requests, and
> if I restart the primary mon (that had almost no memory usage nor cpu)
> the cluster goes back to its merry way responding to requests.
>
> Does anyone have any idea what may be going on? The worst bit is that I
> have several clusters just like this (well they are smaller), and as we
> do everything with puppet, they should be very similar... and all the
> other clusters are just working fine, without any issues whatsoever...
>

 We've seen cases where leveldb can't compact fast enough and memory
 balloons, but it's usually associated with extreme CPU usage as well. It
 would be showing up in perf though if that were the case...


> On 24 Jul 2015 10:11, "Jan Schermer"  > wrote:
>
> You don’t (shouldn’t) need to rebuild the binary to use jemalloc.
> It
> should be possible to do something like
>
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 ceph-osd …
>
> The last time we tried it segfaulted after a few minutes, so YMMV
> and be careful.
>
> Jan
>
>  On 23 Jul 2015, at 18:18, Luis Periquito > > wrote:
>>
>> Hi Greg,
>>
>> I've been looking at the tcmalloc issues, but did seem to affect
>> osd's, and I do notice it in heavy read workloads (even after the
>> patch and
>> increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728). This
>> is affecting the mon process though.
>>
>> looking at perf top I'm getting most of the CPU usage in mutex
>> lock/unlock
>>   5.02% libpthread-2.19.so [.]
>> pthread_mutex_unlock
>>   3.82%  libsoftokn3.so[.] 0x0001e7cb
>>   3.46% libpthread-2.19.so [.]
>> pthread_mutex_lock
>>
>> I could try to use jemalloc, are you aware of any built binaries?
>> Can I mix a cluste

Re: [ceph-users] injectargs not working?

2015-07-29 Thread Quentin Hartman
So it looks like the scrub was not actually the root of the problem. It
seems that I have some hardware that is failing that I'm now trying to run
down.

QH

On Wed, Jul 29, 2015 at 8:22 PM, Christian Balzer  wrote:

>
> Hello,
>
> On Wed, 29 Jul 2015 17:59:10 -0600 Quentin Hartman wrote:
>
> > well, that would certainly do it. I _always_ forget to twiddle the little
> > thing on the web page that changes the version of the docs I'm looking
> > at.
> >
> > So I guess then my question becomes, "How do i prevent deep scrubs from
> > happening in the middle of the day and ruining everything?"
> >
>
> Firstly a qualification and quantification of "ruining everything" would
> be interesting, but I'll assume it's bad.
>
> I have (had) clusters where even simple scrubs would be detrimental, so I
> can relate.
>
> That being said, if your cluster goes catatonic when being scrubbed, you
> might want to improve it (more, faster OSDs, etc) because a deep scrub
> isn't all that different from the load you'll experience when loosing an
> OSD or node even, something your cluster should survive w/o becoming
> totally unusable in regards to client I/O.
>
> The most effective way to keep scrubs from starving client
> I/O is setting "osd_scrub_sleep = 0.1" (the recommended value in
> documentation seems to be far too small to have any beneficial effect for
> most people).
>
> To scrub at a specific time and given that your cluster can deep- scrub
> itself completely during the night, consider issuing a
> "ceph osd deep-scrub \*"
> late on a weekend evening.
>
> My largest cluster can deep scrub itself in 4 hours, so once I kicked that
> off at midnight on a Saturday all scrubs (daily) and deep scrubs
> (weekly) happen in that time frame.
>
> Christian
>
> > QH
> >
> >
> > On Wed, Jul 29, 2015 at 5:55 PM, Travis Rhoden 
> wrote:
> >
> > > Hi Quentin,
> > >
> > > It may be the specific option you are trying to tweak.
> > > osd-scrub-begin-hour was first introduced in development release
> > > v0.93, which means it would be in 0.94.x (Hammer), but your cluster is
> > > 0.87.1 (Giant).
> > >
> > > Cheers,
> > >
> > >  - Travis
> > >
> > > On Wed, Jul 29, 2015 at 4:28 PM, Quentin Hartman
> > >  wrote:
> > > > I'm running a 0.87.1 cluster, and my "ceph tell" seems to not be
> > > > working:
> > > >
> > > > # ceph tell osd.0 injectargs '--osd-scrub-begin-hour 1'
> > > >  failed to parse arguments: --osd-scrub-begin-hour,1
> > > >
> > > >
> > > > I've also tried the daemon config set variant and it also fails:
> > > >
> > > > # ceph daemon osd.0 config set osd_scrub_begin_hour 1
> > > > { "error": "error setting 'osd_scrub_begin_hour' to '1': (2) No such
> > > file or
> > > > directory"}
> > > >
> > > > I'm guessing I have something goofed in my admin socket client
> > > > config:
> > > >
> > > > [client]
> > > > rbd cache = true
> > > > rbd cache writethrough until flush = true
> > > > admin socket = /var/run/ceph/$cluster-$type.$id.asok
> > > >
> > > > but that seems to correlate with the structure that exists:
> > > >
> > > > # ls
> > > > ceph-osd.24.asok  ceph-osd.25.asok  ceph-osd.26.asok
> > > > # pwd
> > > > /var/run/ceph
> > > >
> > > > I can show my configs all over the place, but changing them seems to
> > > always
> > > > fail. It behaves the same if I'm working on a local daemon, or on my
> > > config
> > > > node trying to make changes globally.
> > > >
> > > > Thanks in advance for any ideas
> > > >
> > > > QH
> > > >
> > > >
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Fusion Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Check networking first?

2015-07-30 Thread Quentin Hartman
Just wanted to drop a note to the group that I had my cluster go sideways
yesterday, and the root of the problem was networking again. Using iperf I
discovered that one of my nodes was only moving data at 1.7Mb / s. Moving
that node to a different switch port with a different cable has resolved
the problem. It took awhile to track down because none of the server-side
error metrics for disk or network showed anything was amiss, and I didn't
think to test network performance (as suggested in another thread) until
well into the process.

Check networking first!

QH
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon cpu usage

2015-07-30 Thread Quentin Hartman
Thanks for the suggestion. NTP is fine in my case. Turns out it was a
networking problem that wasn't triggering error counters on the NICs so it
took a bit to track it down.

QH

On Thu, Jul 30, 2015 at 4:16 PM, Spillmann, Dieter <
dieter.spillm...@arris.com> wrote:

> I saw this behavior when the servers are not in time sync.
> Check your ntp settings
>
> Dieter
>
> From: ceph-users  on behalf of Quentin
> Hartman 
> Date: Wednesday, July 29, 2015 at 5:47 PM
> To: Luis Periquito 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] ceph-mon cpu usage
>
> I just had my ceph cluster exhibit this behavior (two of three mons eat
> all CPU, cluster becomes unusably slow) which is running 0.87.1
>
> It seems to be tied to deep scrubbing, as the behavior almost immediately
> surfaces if that is turned on, but if it is off the behavior eventually
> seems to return to normal and stays that way while scrubbing is off. I have
> not yet found anything in the cluster to indicate a hardware problem.
>
> Any thoughts or further insights on this subject would be appreciated.
>
> QH
>
> On Sat, Jul 25, 2015 at 12:31 AM, Luis Periquito 
> wrote:
>
>> I think I figured out! All 4 of the OSDs on one host (OSD 107-110) were
>> sending massive amounts of auth requests to the monitors, seeming to
>> overwhelm them.
>>
>> Weird bit is that I removed them (osd crush remove, auth del, osd rm), dd
>> the box and all of the disks, reinstalled and guess what? They are still
>> doing a lot of requests to the MONs... this will require some further
>> investigations.
>>
>> As this is happening during my holidays, I just disabled them, and will
>> investigate further when I get back.
>>
>>
>> On Fri, Jul 24, 2015 at 11:11 PM, Kjetil Jørgensen 
>> wrote:
>>
>>> It sounds slightly similar to what I just experienced.
>>>
>>> I had one monitor out of three, which seemed to essentially run one core
>>> at full tilt continuously, and had it's virtual address space allocated at
>>> the point where top started calling it Tb. Requests hitting this monitor
>>> did not get very timely responses (although; I don't know if this were
>>> happening consistently or arbitrarily).
>>>
>>> I ended up re-building the monitor from the two healthy ones I had,
>>> which made the problem go away for me.
>>>
>>> After the fact inspection of the monitor I ripped out, clocked it in at
>>> 1.3Gb compared to the 250Mb of the other two, after rebuild they're all
>>> comparable in size.
>>>
>>> In my case; this started out for me on firefly, and persisted after
>>> upgrading to hammer. Which prompted the rebuild, suspecting that in my case
>>> it were related to "something" persistent for this monitor.
>>>
>>> I do not have that much more useful to contribute to this discussion,
>>> since I've more-or-less destroyed any evidence by re-building the monitor.
>>>
>>> Cheers,
>>> KJ
>>>
>>> On Fri, Jul 24, 2015 at 1:55 PM, Luis Periquito 
>>> wrote:
>>>
>>>> The leveldb is smallish: around 70mb.
>>>>
>>>> I ran debug mon = 10 for a while,  but couldn't find any interesting
>>>> information. I would run out of space quite quickly though as the log
>>>> partition only has 10g.
>>>> On 24 Jul 2015 21:13, "Mark Nelson"  wrote:
>>>>
>>>>> On 07/24/2015 02:31 PM, Luis Periquito wrote:
>>>>>
>>>>>> Now it's official,  I have a weird one!
>>>>>>
>>>>>> Restarted one of the ceph-mons with jemalloc and it didn't make any
>>>>>> difference. It's still using a lot of cpu and still not freeing up
>>>>>> memory...
>>>>>>
>>>>>> The issue is that the cluster almost stops responding to requests, and
>>>>>> if I restart the primary mon (that had almost no memory usage nor cpu)
>>>>>> the cluster goes back to its merry way responding to requests.
>>>>>>
>>>>>> Does anyone have any idea what may be going on? The worst bit is that
>>>>>> I
>>>>>> have several clusters just like this (well they are smaller), and as
>>>>>> we
>>>>>> do everything with puppet, they should be very similar... and all the
>>>>>> other clusters are just working fine, without any issues whatsoever...
>>>>>>
>>&g

Re: [ceph-users] Check networking first?

2015-08-03 Thread Quentin Hartman
The problem with this kind of monitoring is that there are so many possible
metrics to watch and so many possible ways to watch them. For myself, I'm
working on implementing a couple of things:
- Watching error counters on servers
- Watching error counters on switches
- Watching performance

My plan for this is to feed these metrics into graphite and then use
Skyline to do anomaly detection on them. The error counts are simple
collectors from every machine, a very light test. The performance is a bit
trickier. My intent is to do an iperf test between two semi-randomly
selected nodes in the cluster every 30 minutes. After each node is tested
successfully it will be removed from the pool of potential nodes to test.
Once the pool is depleted, it gets reset. If that proves to be too intense,
I'll do something lighter.

QH

On Mon, Aug 3, 2015 at 6:21 AM, John Spray  wrote:

> On Mon, Aug 3, 2015 at 12:30 PM, Stijn De Weirdt
>  wrote:
> >> Like a lot of system monitoring stuff, this is the kind of thing that
> >> in an ideal world we wouldn't have to worry about, but the experience
> >> in practice is that people deploy big distributed storage systems
> >> without having really good monitoring in place.  We (people providing
> >
> > not to become completely off-topic but do you have any suggestions for
> such
> > "really good monitoring" that could help monitor the many-to-many
> > communication pattern that is typical for ceph cluster? especially the
> > performance part, not only the funxtional part.
>
> I guess I'm kind of just assuming that ops people have tools for this
> stuff -- I don't run any large systems myself, so can't recommend
> anything.
>
> John
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Check networking first?

2015-08-03 Thread Quentin Hartman
All of the other things that I would be looking at would show a link speed
failure. In the two cases of network shenanigans I've had that effectively
broke ceph the link speed was always correct. That leads me to distrust
link speed as a reliable source of truth. Also, it's testing a proxy for
what you actually care about, not the thing you actually care about. Put
another way, I don't care what speed the link negotiates, I care how fast
it actually moves packets.

Link usage seems like a much more interesting metric to add to me, but I
would be concerned about generating a lot of false positives. If my network
is usually 20% utilized, but then spikes to 100% for awhile because of some
legit activity, I don't want to get an alarm for that. Maybe having a rule
where it has a very long normalization period or something so it only
alarms if it's pegged for multiple hours or something. But again, I would
think that in that case there would be other problems that are evident
because of some pathological state on the network. Again, I don't care if
my network is being fully utilized, I care if the network is being utilized
to the point that it's causing IO wait in VMs. Looking at utilization alone
won't tell me that. But again, it could be a good canary for anomaly
detection if your normal state leaves a lot of headroom. I'm torn on this
one.

Also, while we're on the subject, if anyone isn't doing any kind of metric
collection on their ceph networks, I highly recommend installing ganglia.
IT's dead simpel to get going and creates all sorts of useful system-level
graphs that are helpful in locating and identifying trends and problems.

QH

On Mon, Aug 3, 2015 at 9:45 AM, Antonio Messina 
wrote:

> On Mon, Aug 3, 2015 at 5:10 PM, Quentin Hartman
>  wrote:
> > The problem with this kind of monitoring is that there are so many
> possible
> > metrics to watch and so many possible ways to watch them. For myself, I'm
> > working on implementing a couple of things:
> > - Watching error counters on servers
> > - Watching error counters on switches
> > - Watching performance
>
> I would also check:
>
> - link speed (on both servers and switches)
> - link usage (over 80% issue a warning)
>
> .a.
>
> --
> antonio.mess...@uzh.ch
> S3IT: Services and Support for Science IThttp://www.s3it.uzh.ch/
> University of Zurich Y12 F 84
> Winterthurerstrasse 190
> CH-8057 Zurich Switzerland
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping OSD's when scrubbing

2015-08-07 Thread Quentin Hartman
That kind of behavior is usually caused by the OSDs getting busy enough
that they aren't answering heartbeats in a timely fashion. It can also
happen if you have any netowrk flakiness and heartbeats are getting lost
because of that.

I think (I'm not positive though) that increasing your heartbeat interval
may help. Also, looking at the number of threads you have for your OSDs,
that seems potentially problematic. If you've got 24 OSDs per machine and
each one is running 12 threads, that's 288 threads on 12 cores for just the
requests. Plus the disk threads, plus the filestore op threads... That
level of thread contention seems like it might be contributing to missing
the heartbeats. But again, that's conjecture. I've not worked with a setup
as dense as yours.

QH

On Fri, Aug 7, 2015 at 11:21 AM, Tuomas Juntunen <
tuomas.juntu...@databasement.fi> wrote:

> Hi
>
>
>
> We are experiencing an annoying problem where scrubs make OSD’s flap down
> and cause Ceph cluster to be unusable for couple of minutes.
>
>
>
> Our cluster consists of three nodes connected with 40gbit infiniband using
> IPoIB, with 2x 6 core X5670 CPU’s and 64GB of memory
>
> Each node has 6 SSD’s for journals to 12 OSD’s 2TB disks (Fast pools) and
> another 12 OSD’s 4TB disks (Archive pools) which have journal on the same
> disk.
>
>
>
> It seems that our cluster is constantly doing scrubbing, we rarely see
> only active+clean, below is the status at the moment.
>
>
>
> cluster a2974742-3805-4cd3-bc79-765f2bddaefe
>
>  health HEALTH_OK
>
>  monmap e16: 4 mons at {lb1=
> 10.20.60.1:6789/0,lb2=10.20.60.2:6789/0,nc1=10.20.50.2:6789/0,nc2=10.20.50.3:6789/0
> }
>
> election epoch 1838, quorum 0,1,2,3 nc1,nc2,lb1,lb2
>
>  mdsmap e7901: 1/1/1 up {0=lb1=up:active}, 4 up:standby
>
>  osdmap e104824: 72 osds: 72 up, 72 in
>
>   pgmap v12941402: 5248 pgs, 9 pools, 19644 GB data, 4810 kobjects
>
> 59067 GB used, 138 TB / 196 TB avail
>
> 5241 active+clean
>
>7 active+clean+scrubbing
>
>
>
> When OSD’s go down, first the load on a node goes high during scrubbing
> and after that some OSD’s go down and 30 secs, they are back up. They are
> not really going down, but are marked as down. Then it takes around couple
> of minutes for everything be OK again.
>
>
>
> Any suggestion how to fix this? We can’t go to production while this
> behavior exists.
>
>
>
> Our config is below:
>
>
>
> [global]
>
> fsid = a2974742-3805-4cd3-bc79-765f2bddaefe
>
> mon_initial_members = lb1,lb2,nc1,nc2
>
> mon_host = 10.20.60.1,10.20.60.2,10.20.50.2,10.20.50.3
>
> auth_cluster_required = cephx
>
> auth_service_required = cephx
>
> auth_client_required = cephx
>
> filestore_xattr_use_omap = true
>
>
>
> osd pool default pg num = 128
>
> osd pool default pgp num = 128
>
>
>
> public network = 10.20.0.0/16
>
>
>
> osd_op_threads = 12
>
> osd_op_num_threads_per_shard = 2
>
> osd_op_num_shards = 6
>
> #osd_op_num_sharded_pool_threads = 25
>
> filestore_op_threads = 12
>
> ms_nocrc = true
>
> filestore_fd_cache_size = 64
>
> filestore_fd_cache_shards = 32
>
> ms_dispatch_throttle_bytes = 0
>
> throttler_perf_counter = false
>
>
>
> mon osd min down reporters = 25
>
>
>
> [osd]
>
> osd scrub max interval = 1209600
>
> osd scrub min interval = 604800
>
> osd scrub load threshold = 3.0
>
> osd max backfills = 1
>
> osd recovery max active = 1
>
> # IO Scheduler settings
>
> osd scrub sleep = 1.0
>
> osd disk thread ioprio class = idle
>
> osd disk thread ioprio priority = 7
>
> osd scrub chunk max = 1
>
> osd scrub chunk min = 1
>
> osd deep scrub stride = 1048576
>
> filestore queue max ops = 1
>
> filestore max sync interval = 30
>
> filestore min sync interval = 29
>
>
>
> osd deep scrub interval = 2592000
>
> osd heartbeat grace = 240
>
> osd heartbeat interval = 12
>
> osd mon report interval max = 120
>
> osd mon report interval min = 5
>
>
>
>osd_client_message_size_cap = 0
>
> osd_client_message_cap = 0
>
> osd_enable_op_tracker = false
>
>
>
> osd crush update on start = false
>
>
>
> [client]
>
> rbd cache = true
>
> rbd cache size = 67108864 # 64mb
>
> rbd cache max dirty = 50331648 # 48mb
>
> rbd cache target dirty = 33554432 # 32mb
>
> rbd cache writethrough until flush = true # It's by default
>
> rbd cache max dirty age = 2
>
> admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
>
>
>
>
>
> Br,
>
> Tuomas
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] btrfs w/ centos 7.1

2015-08-07 Thread Quentin Hartman
I would say probably not. btrfs (or, "worse FS" as we call it around my
office) still does weird stuff from time to time, especially in low-memory
conditions. This is based on testing we did on Ubuntu 14.04, running kernel
3.16.something.

I long for the day that btrfs realizes it's promise, but I do not think
that day is here.

QH

On Fri, Aug 7, 2015 at 2:05 PM, Ben Hines  wrote:

> Howdy,
>
> The Ceph docs still say btrfs is 'experimental' in one section, but
> say it's the long term ideal for ceph in the later section. Is this
> still accurate with Hammer? Is it mature enough on centos 7.1 for
> production use?
>
> (kernel is  3.10.0-229.7.2.el7.x86_64 )
>
> thanks-
>
> -Ben
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph monitoring with graphite

2015-08-26 Thread Quentin Hartman
That would certainly be something we would use.

QH

On Wed, Aug 26, 2015 at 8:33 AM, Dan van der Ster 
wrote:

> Hi Wido,
>
> On Wed, Aug 26, 2015 at 10:36 AM, Wido den Hollander 
> wrote:
> > I'm sending pool statistics to Graphite
>
> We're doing the same -- stripping invalid chars as needed -- and I
> would guess that lots of people have written similar json2graphite
> convertor scripts for Ceph monitoring in the recent months.
>
> It makes me wonder if it might be useful if Ceph had a --format mode
> to output df/stats/perf commands directly in graphite compatible text.
> Shouldn't be too difficult to write.
>
> Cheers, Dan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Hammer for Production?

2015-08-27 Thread Quentin Hartman
I'm currently running Giant in my cluster, and there are a number of things
in Hammer that look promising. Today's release announcement for Hammer
reminded me to go take a look at it, and I'm getting mixed information
about whether or not it is considered safe production yet or not. The main
ceph.com page features it, but all the docs talk about Giant, unless you go
to the dev versions.

Thanks!


QH
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer for Production?

2015-08-27 Thread Quentin Hartman
Well, assuming you guys are cleaving to the normal Redhat ethos, then that
would be a pretty ringing endorsement for production readiness.

Thanks!

QH

On Thu, Aug 27, 2015 at 11:35 AM, Ian Colle  wrote:

> Quentin,
>
> Red Hat Ceph Storage 1.3 is based upon Hammer. I guess you can take away
> from that that we at Red Hat think it's production ready :-)
>
> Ian
>
> On Thu, Aug 27, 2015 at 10:30 AM, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
>> I'm currently running Giant in my cluster, and there are a number of
>> things in Hammer that look promising. Today's release announcement for
>> Hammer reminded me to go take a look at it, and I'm getting mixed
>> information about whether or not it is considered safe production yet or
>> not. The main ceph.com page features it, but all the docs talk about
>> Giant, unless you go to the dev versions.
>>
>> Thanks!
>>
>>
>> QH
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Ian R. Colle
> Global Director of Software Engineering
> Red Hat, Inc.
> ico...@redhat.com
> +1-303-601-7713
> http://www.linkedin.com/in/ircolle
> http://www.twitter.com/ircolle
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
So I'm in the middle of trying to triage a problem with my ceph cluster
running 0.80.5. I have 24 OSDs spread across 8 machines. The cluster has
been running happily for about a year. This last weekend, something caused
the box running the MDS to sieze hard, and when we came in on monday,
several OSDs were down or unresponsive. I brought the MDS and the OSDs back
on online, and managed to get things running again with minimal data loss.
Had to mark a few objects as lost, but things were apparently running fine
at the end of the day on Monday.

This afternoon, I noticed that one of the OSDs was apparently stuck in a
crash/restart loop, and the cluster was unhappy. Performance was in the
tank and "ceph status" is reporting all manner of problems, as one would
expect if an OSD is misbehaving. I marked the offending OSD out, and the
cluster started rebalancing as expected. However, I noticed a short while
later, another OSD has started into a crash/restart loop. So, I repeat the
process. And it happens again. At this point I notice, that there are
actually two at a time which are in this state.

It's as if there's some toxic chunk of data that is getting passed around,
and when it lands on an OSD it kills it. Contrary to that, however, I tried
just stopping an OSD when it's in a bad state, and once the cluster starts
to try rebalancing with that OSD down and not previously marked out,
another OSD will start crash-looping.

I've investigated the disk of the first OSD I found with this problem, and
it has no apparent corruption on the file system.

I'll follow up to this shortly with links to pastes of log snippets. Any
input would be appreciated. This is turning into a real cascade failure,
and I haven't any idea how to stop it.

QH
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Ceph health detail - http://pastebin.com/5URX9SsQ
pg dump summary (with active+clean pgs removed) -
http://pastebin.com/Y5ATvWDZ
an osd crash log (in github gist because it was too big for pastebin) -
https://gist.github.com/qhartman/cb0e290df373d284cfb5

And now I've got four OSDs that are looping.

On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> So I'm in the middle of trying to triage a problem with my ceph cluster
> running 0.80.5. I have 24 OSDs spread across 8 machines. The cluster has
> been running happily for about a year. This last weekend, something caused
> the box running the MDS to sieze hard, and when we came in on monday,
> several OSDs were down or unresponsive. I brought the MDS and the OSDs back
> on online, and managed to get things running again with minimal data loss.
> Had to mark a few objects as lost, but things were apparently running fine
> at the end of the day on Monday.
>
> This afternoon, I noticed that one of the OSDs was apparently stuck in a
> crash/restart loop, and the cluster was unhappy. Performance was in the
> tank and "ceph status" is reporting all manner of problems, as one would
> expect if an OSD is misbehaving. I marked the offending OSD out, and the
> cluster started rebalancing as expected. However, I noticed a short while
> later, another OSD has started into a crash/restart loop. So, I repeat the
> process. And it happens again. At this point I notice, that there are
> actually two at a time which are in this state.
>
> It's as if there's some toxic chunk of data that is getting passed around,
> and when it lands on an OSD it kills it. Contrary to that, however, I tried
> just stopping an OSD when it's in a bad state, and once the cluster starts
> to try rebalancing with that OSD down and not previously marked out,
> another OSD will start crash-looping.
>
> I've investigated the disk of the first OSD I found with this problem, and
> it has no apparent corruption on the file system.
>
> I'll follow up to this shortly with links to pastes of log snippets. Any
> input would be appreciated. This is turning into a real cascade failure,
> and I haven't any idea how to stop it.
>
> QH
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Thanks for the suggestion, but that doesn't seem to have made a difference.

I've shut the entire cluster down and brought it back up, and my config
management system seems to have upgraded ceph to 0.80.8 during the reboot.
Everything seems to have come back up, but I am still seeing the crash
loops, so that seems to indicate that this is definitely something
persistent, probably tied to the OSD data, rather than some weird transient
state.


On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil  wrote:

> It looks like you may be able to work around the issue for the moment with
>
>  ceph osd set nodeep-scrub
>
> as it looks like it is scrub that is getting stuck?
>
> sage
>
>
> On Fri, 6 Mar 2015, Quentin Hartman wrote:
>
> > Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with
> > active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
> > an osd crash log (in github gist because it was too big for pastebin) -
> > https://gist.github.com/qhartman/cb0e290df373d284cfb5
> >
> > And now I've got four OSDs that are looping.
> >
> > On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
> >  wrote:
> >   So I'm in the middle of trying to triage a problem with my ceph
> >   cluster running 0.80.5. I have 24 OSDs spread across 8 machines.
> >   The cluster has been running happily for about a year. This last
> >   weekend, something caused the box running the MDS to sieze hard,
> >   and when we came in on monday, several OSDs were down or
> >   unresponsive. I brought the MDS and the OSDs back on online, and
> >   managed to get things running again with minimal data loss. Had
> >   to mark a few objects as lost, but things were apparently
> >   running fine at the end of the day on Monday.
> > This afternoon, I noticed that one of the OSDs was apparently stuck in
> > a crash/restart loop, and the cluster was unhappy. Performance was in
> > the tank and "ceph status" is reporting all manner of problems, as one
> > would expect if an OSD is misbehaving. I marked the offending OSD out,
> > and the cluster started rebalancing as expected. However, I noticed a
> > short while later, another OSD has started into a crash/restart loop.
> > So, I repeat the process. And it happens again. At this point I
> > notice, that there are actually two at a time which are in this state.
> >
> > It's as if there's some toxic chunk of data that is getting passed
> > around, and when it lands on an OSD it kills it. Contrary to that,
> > however, I tried just stopping an OSD when it's in a bad state, and
> > once the cluster starts to try rebalancing with that OSD down and not
> > previously marked out, another OSD will start crash-looping.
> >
> > I've investigated the disk of the first OSD I found with this problem,
> > and it has no apparent corruption on the file system.
> >
> > I'll follow up to this shortly with links to pastes of log snippets.
> > Any input would be appreciated. This is turning into a real cascade
> > failure, and I haven't any idea how to stop it.
> >
> > QH
> >
> >
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Finally found an error that seems to provide some direction:

-1> 2015-03-07 02:52:19.378808 7f175b1cf700  0 log [ERR] : scrub 3.18e
e08a418e/rbd_data.3f7a2ae8944a.16c8/7//3 on disk size (0) does
not match object info size (4120576) ajusted for ondisk to (4120576)

I'm diving into google now and hoping for something useful. If anyone has a
suggestion, I'm all ears!

QH

On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> Thanks for the suggestion, but that doesn't seem to have made a difference.
>
> I've shut the entire cluster down and brought it back up, and my config
> management system seems to have upgraded ceph to 0.80.8 during the reboot.
> Everything seems to have come back up, but I am still seeing the crash
> loops, so that seems to indicate that this is definitely something
> persistent, probably tied to the OSD data, rather than some weird transient
> state.
>
>
> On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil  wrote:
>
>> It looks like you may be able to work around the issue for the moment with
>>
>>  ceph osd set nodeep-scrub
>>
>> as it looks like it is scrub that is getting stuck?
>>
>> sage
>>
>>
>> On Fri, 6 Mar 2015, Quentin Hartman wrote:
>>
>> > Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with
>> > active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
>> > an osd crash log (in github gist because it was too big for pastebin) -
>> > https://gist.github.com/qhartman/cb0e290df373d284cfb5
>> >
>> > And now I've got four OSDs that are looping.
>> >
>> > On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
>> >  wrote:
>> >   So I'm in the middle of trying to triage a problem with my ceph
>> >   cluster running 0.80.5. I have 24 OSDs spread across 8 machines.
>> >   The cluster has been running happily for about a year. This last
>> >   weekend, something caused the box running the MDS to sieze hard,
>> >   and when we came in on monday, several OSDs were down or
>> >   unresponsive. I brought the MDS and the OSDs back on online, and
>> >   managed to get things running again with minimal data loss. Had
>> >   to mark a few objects as lost, but things were apparently
>> >   running fine at the end of the day on Monday.
>> > This afternoon, I noticed that one of the OSDs was apparently stuck in
>> > a crash/restart loop, and the cluster was unhappy. Performance was in
>> > the tank and "ceph status" is reporting all manner of problems, as one
>> > would expect if an OSD is misbehaving. I marked the offending OSD out,
>> > and the cluster started rebalancing as expected. However, I noticed a
>> > short while later, another OSD has started into a crash/restart loop.
>> > So, I repeat the process. And it happens again. At this point I
>> > notice, that there are actually two at a time which are in this state.
>> >
>> > It's as if there's some toxic chunk of data that is getting passed
>> > around, and when it lands on an OSD it kills it. Contrary to that,
>> > however, I tried just stopping an OSD when it's in a bad state, and
>> > once the cluster starts to try rebalancing with that OSD down and not
>> > previously marked out, another OSD will start crash-looping.
>> >
>> > I've investigated the disk of the first OSD I found with this problem,
>> > and it has no apparent corruption on the file system.
>> >
>> > I'll follow up to this shortly with links to pastes of log snippets.
>> > Any input would be appreciated. This is turning into a real cascade
>> > failure, and I haven't any idea how to stop it.
>> >
>> > QH
>> >
>> >
>> >
>> >
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Alright, tried a few suggestions for repairing this state, but I don't seem
to have any PG replicas that have good copies of the missing / zero length
shards. What do I do now? telling the pg's to repair doesn't seem to help
anything? I can deal with data loss if I can figure out which images might
be damaged, I just need to get the cluster consistent enough that the
things which aren't damaged can be usable.

Also, I'm seeing these similar, but not quite identical, error messages as
well. I assume they are referring to the same root problem:

-1> 2015-03-07 03:12:49.217295 7fc8ab343700  0 log [ERR] : 3.69d shard 22:
soid dd85669d/rbd_data.3f7a2ae8944a.19a5/7//3 size 0 != known
size 4194304



On Fri, Mar 6, 2015 at 7:54 PM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> Finally found an error that seems to provide some direction:
>
> -1> 2015-03-07 02:52:19.378808 7f175b1cf700  0 log [ERR] : scrub 3.18e
> e08a418e/rbd_data.3f7a2ae8944a.16c8/7//3 on disk size (0) does
> not match object info size (4120576) ajusted for ondisk to (4120576)
>
> I'm diving into google now and hoping for something useful. If anyone has
> a suggestion, I'm all ears!
>
> QH
>
> On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
>> Thanks for the suggestion, but that doesn't seem to have made a
>> difference.
>>
>> I've shut the entire cluster down and brought it back up, and my config
>> management system seems to have upgraded ceph to 0.80.8 during the reboot.
>> Everything seems to have come back up, but I am still seeing the crash
>> loops, so that seems to indicate that this is definitely something
>> persistent, probably tied to the OSD data, rather than some weird transient
>> state.
>>
>>
>> On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil  wrote:
>>
>>> It looks like you may be able to work around the issue for the moment
>>> with
>>>
>>>  ceph osd set nodeep-scrub
>>>
>>> as it looks like it is scrub that is getting stuck?
>>>
>>> sage
>>>
>>>
>>> On Fri, 6 Mar 2015, Quentin Hartman wrote:
>>>
>>> > Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with
>>> > active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
>>> > an osd crash log (in github gist because it was too big for pastebin) -
>>> > https://gist.github.com/qhartman/cb0e290df373d284cfb5
>>> >
>>> > And now I've got four OSDs that are looping.
>>> >
>>> > On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
>>> >  wrote:
>>> >   So I'm in the middle of trying to triage a problem with my ceph
>>> >   cluster running 0.80.5. I have 24 OSDs spread across 8 machines.
>>> >   The cluster has been running happily for about a year. This last
>>> >   weekend, something caused the box running the MDS to sieze hard,
>>> >   and when we came in on monday, several OSDs were down or
>>> >   unresponsive. I brought the MDS and the OSDs back on online, and
>>> >   managed to get things running again with minimal data loss. Had
>>> >   to mark a few objects as lost, but things were apparently
>>> >   running fine at the end of the day on Monday.
>>> > This afternoon, I noticed that one of the OSDs was apparently stuck in
>>> > a crash/restart loop, and the cluster was unhappy. Performance was in
>>> > the tank and "ceph status" is reporting all manner of problems, as one
>>> > would expect if an OSD is misbehaving. I marked the offending OSD out,
>>> > and the cluster started rebalancing as expected. However, I noticed a
>>> > short while later, another OSD has started into a crash/restart loop.
>>> > So, I repeat the process. And it happens again. At this point I
>>> > notice, that there are actually two at a time which are in this state.
>>> >
>>> > It's as if there's some toxic chunk of data that is getting passed
>>> > around, and when it lands on an OSD it kills it. Contrary to that,
>>> > however, I tried just stopping an OSD when it's in a bad state, and
>>> > once the cluster starts to try rebalancing with that OSD down and not
>>> > previously marked out, another OSD will start crash-looping.
>>> >
>>> > I've investigated the disk of the first OSD I found with this problem,
>>> > and it has no apparent corruption on the file system.
>>> >
>>> > I'll follow up to this shortly with links to pastes of log snippets.
>>> > Any input would be appreciated. This is turning into a real cascade
>>> > failure, and I haven't any idea how to stop it.
>>> >
>>> > QH
>>> >
>>> >
>>> >
>>> >
>>>
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Thanks for the response. Is this the post you are referring to?
http://ceph.com/community/incomplete-pgs-oh-my/

For what it's worth, this cluster was running happily for the better part
of a year until the event from this weekend that I described in my first
post, so I doubt it's configuration issue. I suppose it could be some
edge-casey thing, that only came up just now, but that seems unlikely. Our
usage of this cluster has been much heavier in the past than it has been
recently.

And yes, I have what looks to be about 8 pg shards on several OSDs that
seem to be in this state, but it's hard to say for sure. It seems like each
time I look at this more problems are popping up.

On Fri, Mar 6, 2015 at 8:19 PM, Gregory Farnum  wrote:

> This might be related to the backtrace assert, but that's the problem
> you need to focus on. In particular, both of these errors are caused
> by the scrub code, which Sage suggested temporarily disabling — if
> you're still getting these messages, you clearly haven't done so
> successfully.
>
> That said, it looks like the problem is that the object and/or object
> info specified here are just totally busted. You probably want to
> figure out what happened there since these errors are normally a
> misconfiguration somewhere (e.g., setting nobarrier on fs mount and
> then losing power). I'm not sure if there's a good way to repair the
> object, but if you can lose the data I'd grab the ceph-objectstore
> tool and remove the object from each OSD holding it that way. (There's
> a walkthrough of using it for a similar situation in a recent Ceph
> blog post.)
>
> On Fri, Mar 6, 2015 at 7:14 PM, Quentin Hartman
>  wrote:
> > Alright, tried a few suggestions for repairing this state, but I don't
> seem
> > to have any PG replicas that have good copies of the missing / zero
> length
> > shards. What do I do now? telling the pg's to repair doesn't seem to help
> > anything? I can deal with data loss if I can figure out which images
> might
> > be damaged, I just need to get the cluster consistent enough that the
> things
> > which aren't damaged can be usable.
> >
> > Also, I'm seeing these similar, but not quite identical, error messages
> as
> > well. I assume they are referring to the same root problem:
> >
> > -1> 2015-03-07 03:12:49.217295 7fc8ab343700  0 log [ERR] : 3.69d shard
> 22:
> > soid dd85669d/rbd_data.3f7a2ae8944a.19a5/7//3 size 0 != known
> > size 4194304
>
> Mmm, unfortunately that's a different object than the one referenced
> in the earlier crash. Maybe it's repairable, or it might be the same
> issue — looks like maybe you've got some widespread data loss.
> -Greg
>
> >
> >
> >
> > On Fri, Mar 6, 2015 at 7:54 PM, Quentin Hartman
> >  wrote:
> >>
> >> Finally found an error that seems to provide some direction:
> >>
> >> -1> 2015-03-07 02:52:19.378808 7f175b1cf700  0 log [ERR] : scrub 3.18e
> >> e08a418e/rbd_data.3f7a2ae8944a.16c8/7//3 on disk size (0)
> does
> >> not match object info size (4120576) ajusted for ondisk to (4120576)
> >>
> >> I'm diving into google now and hoping for something useful. If anyone
> has
> >> a suggestion, I'm all ears!
> >>
> >> QH
> >>
> >> On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman
> >>  wrote:
> >>>
> >>> Thanks for the suggestion, but that doesn't seem to have made a
> >>> difference.
> >>>
> >>> I've shut the entire cluster down and brought it back up, and my config
> >>> management system seems to have upgraded ceph to 0.80.8 during the
> reboot.
> >>> Everything seems to have come back up, but I am still seeing the crash
> >>> loops, so that seems to indicate that this is definitely something
> >>> persistent, probably tied to the OSD data, rather than some weird
> transient
> >>> state.
> >>>
> >>>
> >>> On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil  wrote:
> >>>>
> >>>> It looks like you may be able to work around the issue for the moment
> >>>> with
> >>>>
> >>>>  ceph osd set nodeep-scrub
> >>>>
> >>>> as it looks like it is scrub that is getting stuck?
> >>>>
> >>>> sage
> >>>>
> >>>>
> >>>> On Fri, 6 Mar 2015, Quentin Hartman wrote:
> >>>>
> >>>> > Ceph health detail - 

Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
 images min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 4.76 is incomplete, acting [19] (reducing pool images min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 4.57 is incomplete, acting [14] (reducing pool images min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.4c is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 5.18 is incomplete, acting [19] (reducing pool backups min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 4.13 is incomplete, acting [14] (reducing pool images min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 0.6 is incomplete, acting [14] (reducing pool data min_size from 2 may
help; search ceph.com/docs for 'incomplete')
pg 3.7dc is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.6b4 is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.692 is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.5fc is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.5ce is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.4fa is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.4ca is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.4c1 is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.4a7 is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.460 is incomplete, acting [19] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.453 is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.394 is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.372 is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.343 is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.337 is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.321 is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.2c0 is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.27c is incomplete, acting [19] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.27e is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.244 is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.207 is incomplete, acting [19] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')


Why would this keep changing? It seems like it would have to be because of
the OSDs running through their crash loops, only accurately reporting from
time to time, making it difficult to get an accurate view of the extent of
the damage.


On Fri, Mar 6, 2015 at 8:30 PM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> Thanks for the response. Is this the post you are referring to?
> http://ceph.com/community/incomplete-pgs-oh-my/
>
> For what it's worth, this cluster was running happily for the better part
> of a year until the event from this weekend that I described in my first
> post, so I doubt it's configuration issue. I suppose it could be some
> edge-casey thing, that only came up just now, but that seems unlikely. Our
> usage of this cluster has been much heavier in the past than it has been
> recently.
>
> And yes, I have what looks to be about 8 pg shards on several OSDs that
> seem to be in this state, but it's hard to say for sure. It seems like each
> time I look at this more problems are popping up.
>
> On Fri, Mar 6, 2015 at 8:19 PM, Gregory Farnum  wrote:
>
>> This might be related to the backtrace assert, but that's the problem
>> you need to focus on. In particular, both of these

Re: [ceph-users] Cascading Failure of OSDs

2015-03-07 Thread Quentin Hartman
Now that I have a better understanding of what's happening, I threw
together a little one-liner to create a report of the errors that the OSDs
are seeing. Lots of missing  / corrupted pg shards:
https://gist.github.com/qhartman/174cc567525060cb462e

I've experimented with exporting / importing the broken pgs with
ceph_objectstore_tool, and while they seem to export correctly, the tool
crashes when trying to import:

root@node12:/var/lib/ceph/osd# ceph_objectstore_tool --op import
--data-path /var/lib/ceph/osd/ceph-19/ --journal-path
/var/lib/ceph/osd/ceph-19/journal --file ~/3.75b.export
Importing pgid 3.75b
Write 2672075b/rbd_data.2bce2ae8944a.1509/head//3
Write 3473075b/rbd_data.1d6172ae8944a.0001636a/head//3
Write f2e4075b/rbd_data.c816f2ae8944a.0208/head//3
Write f215075b/rbd_data.c4a892ae8944a.0b6b/head//3
Write c086075b/rbd_data.42a742ae8944a.02fb/head//3
Write 6f9d075b/rbd_data.1d6172ae8944a.5ac3/head//3
Write dd9f075b/rbd_data.1d6172ae8944a.0001127d/head//3
Write f9f075b/rbd_data.c4a892ae8944a.f056/head//3
Write 4d71175b/rbd_data.c4a892ae8944a.9e51/head//3
Write bcc3175b/rbd_data.2bce2ae8944a.133f/head//3
Write 1356175b/rbd_data.3f862ae8944a.05d6/head//3
Write d327175b/rbd_data.1d6172ae8944a.0001af85/head//3
Write 7388175b/rbd_data.2bce2ae8944a.1353/head//3
Write 8cda175b/rbd_data.c4a892ae8944a.b585/head//3
Write 6b3c175b/rbd_data.c4a892ae8944a.00018e91/head//3
Write d37f175b/rbd_data.1d6172ae8944a.3a90/head//3
Write 4590275b/rbd_data.2bce2ae8944a.1f67/head//3
Write fe51275b/rbd_data.c4a892ae8944a.e917/head//3
Write 3402275b/rbd_data.3f5c2ae8944a.1252/6//3
osd/SnapMapper.cc: In function 'void SnapMapper::add_oid(const hobject_t&,
const std::set&, MapCacher::Transaction,
ceph::buffer::list>*)' thread 7fba67ff3900 time 2015-03-07 16:21:57.921820
osd/SnapMapper.cc: 228: FAILED assert(r == -2)
 ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x8b) [0xb94fbb]
 2: (SnapMapper::add_oid(hobject_t const&, std::set, std::allocator > const&,
MapCacher::Transaction*)+0x63e) [0x7b719e]
 3: (get_attrs(ObjectStore*, coll_t, ghobject_t, ObjectStore::Transaction*,
ceph::buffer::list&, OSDriver&, SnapMapper&)+0x67c) [0x661a1c]
 4: (get_object(ObjectStore*, coll_t, ceph::buffer::list&)+0x3e5) [0x661f85]
 5: (do_import(ObjectStore*, OSDSuperblock&)+0xd61) [0x665be1]
 6: (main()+0x2208) [0x63f178]
 7: (__libc_start_main()+0xf5) [0x7fba627b2ec5]
 8: ceph_objectstore_tool() [0x659577]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (Aborted) **
 in thread 7fba67ff3900
 ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
 1: ceph_objectstore_tool() [0xab1cea]
 2: (()+0x10340) [0x7fba66a95340]
 3: (gsignal()+0x39) [0x7fba627c7cc9]
 4: (abort()+0x148) [0x7fba627cb0d8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fba630d26b5]
 6: (()+0x5e836) [0x7fba630d0836]
 7: (()+0x5e863) [0x7fba630d0863]
 8: (()+0x5eaa2) [0x7fba630d0aa2]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x278) [0xb951a8]
 10: (SnapMapper::add_oid(hobject_t const&, std::set, std::allocator > const&,
MapCacher::Transaction*)+0x63e) [0x7b719e]
 11: (get_attrs(ObjectStore*, coll_t, ghobject_t,
ObjectStore::Transaction*, ceph::buffer::list&, OSDriver&,
SnapMapper&)+0x67c) [0x661a1c]
 12: (get_object(ObjectStore*, coll_t, ceph::buffer::list&)+0x3e5)
[0x661f85]
 13: (do_import(ObjectStore*, OSDSuperblock&)+0xd61) [0x665be1]
 14: (main()+0x2208) [0x63f178]
 15: (__libc_start_main()+0xf5) [0x7fba627b2ec5]
 16: ceph_objectstore_tool() [0x659577]
Aborted (core dumped)


Which I suppose is expected if it's importing from bad pg data. At this
point I'm really most interested in what I can do to get this cluster
consistent as quickly as possible so I can start coping with the data loss
in the VMs and start restoring from backups where needed. Any guidance in
that direction would be appreciated. Something along the lines of "give up
on that busted pg" is what I'm thinking of, but I haven't noticed anything
that seems to approximate that yet.

Thanks

QH




On Fri, Mar 6, 2015 at 8:47 PM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> Here's more information I have been able to glean:
>
> pg 3.5d3 is stuck inactive for 917.471444, current state incomplete, last
> acting [24]
> pg 3.690 is stuck inactive for 11991.281739, current state incomplete,
> last acting [24]
> pg 4.ca is stuck inactive for 15905.499058, current state incomple

Re: [ceph-users] Cascading Failure of OSDs

2015-03-07 Thread Quentin Hartman
So I'm not sure what has changed, but in the last 30 minutes the errors
which were all over the place, have finally settled down to this:
http://pastebin.com/VuCKwLDp

The only thing I can think of is that I also net the noscrub flag in
addition to the nodeep-scrub when I first got here, and that finally
"took". Anyway, they've been stable there for some time now, and I've been
able to get a couple VMs to come up and behave reasonably well. At this
point I'm prepared to wipe the entire cluster and start over if I have to
to get it truly consistent again, since my efforts to zap pg 3.75b haven't
borne fruit. However, if anyone has a less nuclear option they'd like to
suggest, I'm all ears.

I've tried to export/re-import the pg and do a force_create. The import
failed, and the force_create just reverted back to being incomplete after
"creating" for a few minutes.

QH

On Sat, Mar 7, 2015 at 9:29 AM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> Now that I have a better understanding of what's happening, I threw
> together a little one-liner to create a report of the errors that the OSDs
> are seeing. Lots of missing  / corrupted pg shards:
> https://gist.github.com/qhartman/174cc567525060cb462e
>
> I've experimented with exporting / importing the broken pgs with
> ceph_objectstore_tool, and while they seem to export correctly, the tool
> crashes when trying to import:
>
> root@node12:/var/lib/ceph/osd# ceph_objectstore_tool --op import
> --data-path /var/lib/ceph/osd/ceph-19/ --journal-path
> /var/lib/ceph/osd/ceph-19/journal --file ~/3.75b.export
> Importing pgid 3.75b
> Write 2672075b/rbd_data.2bce2ae8944a.1509/head//3
> Write 3473075b/rbd_data.1d6172ae8944a.0001636a/head//3
> Write f2e4075b/rbd_data.c816f2ae8944a.0208/head//3
> Write f215075b/rbd_data.c4a892ae8944a.0b6b/head//3
> Write c086075b/rbd_data.42a742ae8944a.02fb/head//3
> Write 6f9d075b/rbd_data.1d6172ae8944a.5ac3/head//3
> Write dd9f075b/rbd_data.1d6172ae8944a.0001127d/head//3
> Write f9f075b/rbd_data.c4a892ae8944a.f056/head//3
> Write 4d71175b/rbd_data.c4a892ae8944a.9e51/head//3
> Write bcc3175b/rbd_data.2bce2ae8944a.133f/head//3
> Write 1356175b/rbd_data.3f862ae8944a.05d6/head//3
> Write d327175b/rbd_data.1d6172ae8944a.0001af85/head//3
> Write 7388175b/rbd_data.2bce2ae8944a.1353/head//3
> Write 8cda175b/rbd_data.c4a892ae8944a.b585/head//3
> Write 6b3c175b/rbd_data.c4a892ae8944a.00018e91/head//3
> Write d37f175b/rbd_data.1d6172ae8944a.3a90/head//3
> Write 4590275b/rbd_data.2bce2ae8944a.1f67/head//3
> Write fe51275b/rbd_data.c4a892ae8944a.e917/head//3
> Write 3402275b/rbd_data.3f5c2ae8944a.1252/6//3
> osd/SnapMapper.cc: In function 'void SnapMapper::add_oid(const hobject_t&,
> const std::set&, MapCacher::Transaction,
> ceph::buffer::list>*)' thread 7fba67ff3900 time 2015-03-07 16:21:57.921820
> osd/SnapMapper.cc: 228: FAILED assert(r == -2)
>  ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0xb94fbb]
>  2: (SnapMapper::add_oid(hobject_t const&, std::set std::less, std::allocator > const&,
> MapCacher::Transaction*)+0x63e) [0x7b719e]
>  3: (get_attrs(ObjectStore*, coll_t, ghobject_t,
> ObjectStore::Transaction*, ceph::buffer::list&, OSDriver&,
> SnapMapper&)+0x67c) [0x661a1c]
>  4: (get_object(ObjectStore*, coll_t, ceph::buffer::list&)+0x3e5)
> [0x661f85]
>  5: (do_import(ObjectStore*, OSDSuperblock&)+0xd61) [0x665be1]
>  6: (main()+0x2208) [0x63f178]
>  7: (__libc_start_main()+0xf5) [0x7fba627b2ec5]
>  8: ceph_objectstore_tool() [0x659577]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
> terminate called after throwing an instance of 'ceph::FailedAssertion'
> *** Caught signal (Aborted) **
>  in thread 7fba67ff3900
>  ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
>  1: ceph_objectstore_tool() [0xab1cea]
>  2: (()+0x10340) [0x7fba66a95340]
>  3: (gsignal()+0x39) [0x7fba627c7cc9]
>  4: (abort()+0x148) [0x7fba627cb0d8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fba630d26b5]
>  6: (()+0x5e836) [0x7fba630d0836]
>  7: (()+0x5e863) [0x7fba630d0863]
>  8: (()+0x5eaa2) [0x7fba630d0aa2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x278) [0xb951a8]
>  10: (SnapMapper::add_oid(hobject_t const&, std::set std::less, std::allocator > const&,
> MapCacher::Transaction*)+0x63e) [0x7b719e]

Re: [ceph-users] centos vs ubuntu for production ceph cluster ?

2015-03-20 Thread Quentin Hartman
For all intents and purposes, centos and rhel are equivalent, so I'd not be
too concerned about that distinction. I can't comment as to which distro is
better tested by ceph devs, but assuming that the packages are built
appropriately with similar dependency versions and whatnot, that also
shouldn't matter much, though distro-specific bugs are certainly a thing
they are generally a rare thing aside from packaging quirks.

In my experience the biggest differentiator by distro for the quality of a
deployed service is the skill of the people administering it. In other
words, deploy on the one your ops team knows better. Everything else will
come out in the wash.

QH

On Fri, Mar 20, 2015 at 8:16 AM, Alexandre DERUMIER 
wrote:

> Hi,
> I'll build my full ssd production soon,
>
> I wonder which distrib is best tested with inktank and ceph team ?
>
> ceph.com doc is quite old, and don't have reference for giant or hammer
>
> http://ceph.com/docs/master/start/os-recommendations/
>
> Seem than in past only ubuntu and rhel was well tested,
> not sure about centos.
>
>
> Regards,
>
> Alexandre
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes

2015-03-26 Thread Quentin Hartman
I run a converged openstack / ceph cluster with 14 1U nodes. Each has 1 SSD
(os / journals), 3 1TB spinners (1 OSD each), 16 HT cores, 10Gb NICs for
ceph network, and 72GB of RAM. I configure openstack to leave 3GB of RAM
unused on each node for OSD / OS overhead. All the VMs are backed by ceph
volumes and things generally work very well. I would prefer a dedicated
storage layer simply because it seems more "right", but I can't say that
any of the common concerns of using this kind of setup have come up for me.
Aside from shaving off that 3GB of RAM, my deployment isn't any more
complex than a split stack deployment would be. After running like this for
the better part of a year, I would have a hard time honestly making a real
business case for the extra hardware a split stack cluster would require.

QH

On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson  wrote:

> It's kind of a philosophical question.  Technically there's nothing that
> prevents you from putting ceph and the hypervisor on the same boxes. It's a
> question of whether or not potential cost savings are worth increased risk
> of failure and contention.  You can minimize those things through various
> means (cgroups, ristricting NUMA nodes, etc).  What is more difficult is
> isolating disk IO contention (say if you want local SSDs for VMs), memory
> bus and QPI contention, network contention, etc. If the VMs are working
> really hard you can restrict them to their own socket, and you can even
> restrict memory usage to the local socket, but what about remote socket
> network or disk IO? (you will almost certainly want these things on the
> ceph socket)  I wonder as well about increased risk of hardware failure
> with the increased load, but I don't have any statistics.
>
> I'm guessing if you spent enough time at it you could make it work
> relatively well, but at least personally I question how beneficial it
> really is after all of that.  If you are going for cost savings, I suspect
> efficient compute and storage node designs will be nearly as good with much
> less complexity.
>
> Mark
>
>
> On 03/26/2015 07:11 AM, Wido den Hollander wrote:
>
>> On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote:
>>
>>> Hi Wido,
>>> Am 26.03.2015 um 11:59 schrieb Wido den Hollander:
>>>
 On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote:

> Hi,
>
> in the past i rwad pretty often that it's not a good idea to run ceph
> and qemu / the hypervisors on the same nodes.
>
> But why is this a bad idea? You save space and can better use the
> ressources you have in the nodes anyway.
>
>
 Memory pressure during recovery *might* become a problem. If you make
 sure that you don't allocate more then let's say 50% for the guests it
 could work.

>>>
>>> mhm sure? I've never seen problems like that. Currently i ran each ceph
>>> node with 64GB of memory and each hypervisor node with around 512GB to
>>> 1TB RAM while having 48 cores.
>>>
>>>
>> Yes, it can happen. You have machines with enough memory, but if you
>> overprovision the machines it can happen.
>>
>>  Using cgroups you could also prevent that the OSDs eat up all memory or
 CPU.

>>> Never seen an OSD doing so crazy things.
>>>
>>>
>> Again, it really depends on the available memory and CPU. If you buy big
>> machines for this purpose it probably won't be a problem.
>>
>>  Stefan
>>>
>>>  So technically it could work, but memorey and CPU pressure is something
 which might give you problems.

  Stefan
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


>>
>>  ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes

2015-03-26 Thread Quentin Hartman
That one big server sounds great, but it also sounds like a single point of
failure. It's also not cheap. I've been able to build this cluster for
about $1400 per node, including the 10Gb networking gear, which is less
than what I see the _empty case_ you describe going for new. Even used, the
lowest I've seen (lacking trays at that price) is what I paid for one of my
nodes including CPU and RAM, and drive trays. So, it's been a pretty
inexpensive venture considering what we get out of it. I have no per-node
fault tolerance, but if one of my nodes dies, I just restart the VMs that
were on it somewhere else and wait for ceph to heal. I also benefit from
higher aggregate network bandwidth because I have more ports on the wire.
And better per-U cpu and RAM density (for the money). *shrug* different
strokes.

As for difficulty of management, any screwing around I've done has had
nothing to do with the converged nature of the setup, aside from
discovering and changing the one setting I mentioned. So, for me at least,
it's been a pretty well unqualified net win. I can imagine all sorts of
scenarios where that wouldn't be, but I think it's probably debatable
whether or not those constitute a common case. The higher node count does
add some complexity, but that's easily overcome with some simple
automation. Again though, that has no bearing on the converged setup, it's
just a factor of how much CPU and RAM we need for our use case.

I guess what I'm trying to say is that I don't think the answer is as cut
and dry as you seem to think.

QH

On Thu, Mar 26, 2015 at 9:36 AM, Mark Nelson  wrote:

> I suspect a config like this where you only have 3 OSDs per node would be
> more manageable than something denser.
>
> IE theoretically a single E5-2697v3 is enough to run 36 OSDs in a 4U super
> micro chassis for a semi-dense converged solution.  You could attempt to
> restrict the OSDs to one socket and then use a second E5-2697v3 for VMs.
> Maybe after you've got cgroups setup properly and if you've otherwise
> balanced things it would work out ok.  I question though how much you
> really benefit by doing this rather than running a 36 drive storage server
> with lower bin CPUs and a 2nd 1U box for VMs (which you don't need as many
> of because you can dedicate both sockets to VMs).
>
> It probably depends quite a bit on how memory, network, and disk intensive
> the VMs are, but my take is that it's better to error on the side of
> simplicity rather than making things overly complicated.  Every second you
> are screwing around trying to make the setup work right eats into any
> savings you might gain by going with the converged setup.
>
> Mark
>
> On 03/26/2015 10:12 AM, Quentin Hartman wrote:
>
>> I run a converged openstack / ceph cluster with 14 1U nodes. Each has 1
>> SSD (os / journals), 3 1TB spinners (1 OSD each), 16 HT cores, 10Gb NICs
>> for ceph network, and 72GB of RAM. I configure openstack to leave 3GB of
>> RAM unused on each node for OSD / OS overhead. All the VMs are backed by
>> ceph volumes and things generally work very well. I would prefer a
>> dedicated storage layer simply because it seems more "right", but I
>> can't say that any of the common concerns of using this kind of setup
>> have come up for me. Aside from shaving off that 3GB of RAM, my
>> deployment isn't any more complex than a split stack deployment would
>> be. After running like this for the better part of a year, I would have
>> a hard time honestly making a real business case for the extra hardware
>> a split stack cluster would require.
>>
>> QH
>>
>> On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson > <mailto:mnel...@redhat.com>> wrote:
>>
>> It's kind of a philosophical question.  Technically there's nothing
>> that prevents you from putting ceph and the hypervisor on the same
>> boxes. It's a question of whether or not potential cost savings are
>> worth increased risk of failure and contention.  You can minimize
>> those things through various means (cgroups, ristricting NUMA nodes,
>> etc).  What is more difficult is isolating disk IO contention (say
>> if you want local SSDs for VMs), memory bus and QPI contention,
>> network contention, etc. If the VMs are working really hard you can
>> restrict them to their own socket, and you can even restrict memory
>> usage to the local socket, but what about remote socket network or
>> disk IO? (you will almost certainly want these things on the ceph
>> socket)  I wonder as well about increased risk of hardware failure
>> with the increased load, but I don't 

Re: [ceph-users] Calamari Deployment

2015-03-26 Thread Quentin Hartman
I used this as a guide for building calamari packages w/o using vagrant.
Worked great:
http://bryanapperson.com/blog/compiling-calamari-ceph-ubuntu-14-04/

On Thu, Mar 26, 2015 at 10:30 AM, Steffen W Sørensen  wrote:

>
> On 26/03/2015, at 17.18, LaBarre, James (CTR) A6IT <
> james.laba...@cigna.com> wrote:
> For that matter, is there a way to build Calamari without going the whole
> vagrant path at all?  Some way of just building it through command-line
> tools?  I would be building it on an Openstack instance, no GUI.  Seems
> silly to have to install an entire virtualbox environment inside something
> that’s already a VM.
>
> Agreed... if U wanted to built in on your server farm/cloud stack env.
> I just built my packages for Debian Wheezy (with CentOS+RHEL rpms as a
> bonus) on my desktop Mac/OS-X with use of virtualbox and vagrant ( vagrant
> is an easy disposable built-env:)
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> ] *On Behalf Of *JESUS CHAVEZ ARGUELLES
> *Sent:* Monday, March 02, 2015 3:00 PM
> *To:* ceph-users@lists.ceph.com
> *Subject:* [ceph-users] Calamari Deployment
>
>
> Does anybody know how to succesful install Calamari in rhel7 ? I have
> tried the vagrant thug without sucesss and it seems like a nightmare there
> is a Kind of Sidur when you do vagrant up where it seems not to find the vm
> path...
>
> Regards
>
> *Jesus Chavez*
> SYSTEMS ENGINEER-C.SALES
>
> jesch...@cisco.com
> Phone: *+52 55 5267 3146 <+52%2055%205267%203146>*
> Mobile: *+51 1 5538883255 <+51%201%205538883255>*
>
> CCIE - 44433
>
>
> --
> CONFIDENTIALITY NOTICE: If you have received this email in error,
> please immediately notify the sender by e-mail at the address shown.
> This email transmission may contain confidential information.  This
> information is intended only for the use of the individual(s) or entity to
>
> whom it is intended even if addressed incorrectly.  Please delete it from
> your files if you are not the intended recipient.  Thank you for your
> compliance.  Copyright (c) 2015 Cigna
>
> ==
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-26 Thread Quentin Hartman
Since I have been in ceph-land today, it reminded me that I needed to close
the loop on this. I was finally able to isolate this problem down to a
faulty NIC on the ceph cluster network. It "worked", but it was
accumulating a huge number of Rx errors. My best guess is some receive
buffer cache failed? Anyway, having a NIC go weird like that is totally
consistent with all the weird problems I was seeing, the corrupted PGs, and
the inability for the cluster to settle down.

As a result we've added NIC error rates to our monitoring suite on the
cluster so we'll hopefully see this coming if it ever happens again.

QH

On Sat, Mar 7, 2015 at 11:36 AM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> So I'm not sure what has changed, but in the last 30 minutes the errors
> which were all over the place, have finally settled down to this:
> http://pastebin.com/VuCKwLDp
>
> The only thing I can think of is that I also net the noscrub flag in
> addition to the nodeep-scrub when I first got here, and that finally
> "took". Anyway, they've been stable there for some time now, and I've been
> able to get a couple VMs to come up and behave reasonably well. At this
> point I'm prepared to wipe the entire cluster and start over if I have to
> to get it truly consistent again, since my efforts to zap pg 3.75b haven't
> borne fruit. However, if anyone has a less nuclear option they'd like to
> suggest, I'm all ears.
>
> I've tried to export/re-import the pg and do a force_create. The import
> failed, and the force_create just reverted back to being incomplete after
> "creating" for a few minutes.
>
> QH
>
> On Sat, Mar 7, 2015 at 9:29 AM, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
>> Now that I have a better understanding of what's happening, I threw
>> together a little one-liner to create a report of the errors that the OSDs
>> are seeing. Lots of missing  / corrupted pg shards:
>> https://gist.github.com/qhartman/174cc567525060cb462e
>>
>> I've experimented with exporting / importing the broken pgs with
>> ceph_objectstore_tool, and while they seem to export correctly, the tool
>> crashes when trying to import:
>>
>> root@node12:/var/lib/ceph/osd# ceph_objectstore_tool --op import
>> --data-path /var/lib/ceph/osd/ceph-19/ --journal-path
>> /var/lib/ceph/osd/ceph-19/journal --file ~/3.75b.export
>> Importing pgid 3.75b
>> Write 2672075b/rbd_data.2bce2ae8944a.1509/head//3
>> Write 3473075b/rbd_data.1d6172ae8944a.0001636a/head//3
>> Write f2e4075b/rbd_data.c816f2ae8944a.0208/head//3
>> Write f215075b/rbd_data.c4a892ae8944a.0b6b/head//3
>> Write c086075b/rbd_data.42a742ae8944a.02fb/head//3
>> Write 6f9d075b/rbd_data.1d6172ae8944a.5ac3/head//3
>> Write dd9f075b/rbd_data.1d6172ae8944a.0001127d/head//3
>> Write f9f075b/rbd_data.c4a892ae8944a.f056/head//3
>> Write 4d71175b/rbd_data.c4a892ae8944a.9e51/head//3
>> Write bcc3175b/rbd_data.2bce2ae8944a.133f/head//3
>> Write 1356175b/rbd_data.3f862ae8944a.05d6/head//3
>> Write d327175b/rbd_data.1d6172ae8944a.0001af85/head//3
>> Write 7388175b/rbd_data.2bce2ae8944a.1353/head//3
>> Write 8cda175b/rbd_data.c4a892ae8944a.b585/head//3
>> Write 6b3c175b/rbd_data.c4a892ae8944a.00018e91/head//3
>> Write d37f175b/rbd_data.1d6172ae8944a.3a90/head//3
>> Write 4590275b/rbd_data.2bce2ae8944a.1f67/head//3
>> Write fe51275b/rbd_data.c4a892ae8944a.e917/head//3
>> Write 3402275b/rbd_data.3f5c2ae8944a.1252/6//3
>> osd/SnapMapper.cc: In function 'void SnapMapper::add_oid(const
>> hobject_t&, const std::set&,
>> MapCacher::Transaction, ceph::buffer::list>*)'
>> thread 7fba67ff3900 time 2015-03-07 16:21:57.921820
>> osd/SnapMapper.cc: 228: FAILED assert(r == -2)
>>  ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x8b) [0xb94fbb]
>>  2: (SnapMapper::add_oid(hobject_t const&, std::set> std::less, std::allocator > const&,
>> MapCacher::Transaction*)+0x63e) [0x7b719e]
>>  3: (get_attrs(ObjectStore*, coll_t, ghobject_t,
>> ObjectStore::Transaction*, ceph::buffer::list&, OSDriver&,
>> SnapMapper&)+0x67c) [0x661a1c]
>>  4: (get_object(ObjectStore*, coll_t, ceph::buffer::list&)+0x3e5)
>> [0x661f85]
>>  5: (do_import(ObjectStore*, OSDSuperblock&)+0xd61) [0x665be1]
>&g

[ceph-users] Weird cluster restart behavior

2015-03-31 Thread Quentin Hartman
I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1.
Last friday I got everything deployed and all was working well, and I set
noout and shut all the OSD nodes down over the weekend. Yesterday when I
spun it back up, the OSDs were behaving very strangely, incorrectly marking
each other because of missed heartbeats, even though they were up. It
looked like some kind of low-level networking problem, but I couldn't find
any.

After much work, I narrowed the apparent source of the problem down to the
OSDs running on the first host I started in the morning. They were the ones
that were logged the most messages about not being able to ping other OSDs,
and the other OSDs were mostly complaining about them. After running out of
other ideas to try, I restarted them, and then everything started working.
It's still working happily this morning. It seems as though when that set
of OSDs started they got stale OSD map information from the MON boxes,
which failed to be updated as the other OSDs came up. Does that make sense?
I still don't consider myself an expert on ceph architecture and would
appreciate and corrections or other possible interpretations of events (I'm
happy to provide whatever additional information I can) so I can get a
deeper understanding of things. If my interpretation of events is correct,
it seems that might point at a bug.

QH
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird cluster restart behavior

2015-03-31 Thread Quentin Hartman
Thanks for the extra info Gregory. I did not also set nodown.

I expect that I will be very rarely shutting everything down in the normal
course of things, but it has come up a couple times when having to do some
physical re-organizing of racks. Little irritants like this aren't a big
deal if people know to expect them, but as it is I lost quite a lot of time
troubleshooting a non-existant problem. What's the best way to get notes to
that effect added to the docs? It seems something in
http://ceph.com/docs/master/rados/operations/operating/ would save some
people some headache. I'm happy to propose edits, but a quick look doesn't
reveal a process for submitting that sort of thing.

My understanding is that the "right" method to take an entire cluster
offline is to set noout and then shutting everything down. Is there a
better way?

QH

On Tue, Mar 31, 2015 at 1:35 PM, Gregory Farnum  wrote:

> On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman
>  wrote:
> > I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1.
> Last
> > friday I got everything deployed and all was working well, and I set
> noout
> > and shut all the OSD nodes down over the weekend. Yesterday when I spun
> it
> > back up, the OSDs were behaving very strangely, incorrectly marking each
> > other because of missed heartbeats, even though they were up. It looked
> like
> > some kind of low-level networking problem, but I couldn't find any.
> >
> > After much work, I narrowed the apparent source of the problem down to
> the
> > OSDs running on the first host I started in the morning. They were the
> ones
> > that were logged the most messages about not being able to ping other
> OSDs,
> > and the other OSDs were mostly complaining about them. After running out
> of
> > other ideas to try, I restarted them, and then everything started
> working.
> > It's still working happily this morning. It seems as though when that
> set of
> > OSDs started they got stale OSD map information from the MON boxes, which
> > failed to be updated as the other OSDs came up. Does that make sense? I
> > still don't consider myself an expert on ceph architecture and would
> > appreciate and corrections or other possible interpretations of events
> (I'm
> > happy to provide whatever additional information I can) so I can get a
> > deeper understanding of things. If my interpretation of events is
> correct,
> > it seems that might point at a bug.
>
> I can't find the ticket now, but I think we did indeed have a bug
> around heartbeat failures when restarting nodes. This has been fixed
> in other branches but might have been missed for giant. (Did you by
> any chance set the nodown flag as well as noout?)
>
> In general Ceph isn't very happy with being shut down completely like
> that and its behaviors aren't validated, so nothing will go seriously
> wrong but you might find little irritants like this. It's particularly
> likely when you're prohibiting state changes with the noout/nodown
> flags.
> -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird cluster restart behavior

2015-03-31 Thread Quentin Hartman
On Tue, Mar 31, 2015 at 2:05 PM, Gregory Farnum  wrote:

> Github pull requests. :)
>

Ah, well that's easy:

https://github.com/ceph/ceph/pull/4237


QH
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-04-01 Thread Quentin Hartman
Right now we're just scraping the output of ifconfig:

ifconfig p2p1 | grep -e 'RX\|TX' | grep packets | awk '{print $3}'

It clunky, but it works. I'm sure there's a cleaner way, but this was
expedient.

QH


On Tue, Mar 31, 2015 at 5:05 PM, Francois Lafont  wrote:

> Hi,
>
> Quentin Hartman wrote:
>
> > Since I have been in ceph-land today, it reminded me that I needed to
> close
> > the loop on this. I was finally able to isolate this problem down to a
> > faulty NIC on the ceph cluster network. It "worked", but it was
> > accumulating a huge number of Rx errors. My best guess is some receive
> > buffer cache failed? Anyway, having a NIC go weird like that is totally
> > consistent with all the weird problems I was seeing, the corrupted PGs,
> and
> > the inability for the cluster to settle down.
> >
> > As a result we've added NIC error rates to our monitoring suite on the
> > cluster so we'll hopefully see this coming if it ever happens again.
>
> Good for you. ;)
>
> Could you post here the command that you use to get NIC error rates?
>
> --
> François Lafont
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Calamari Questions

2015-04-01 Thread Quentin Hartman
You should have a config page in calamari UI where you can accept osd nodes
"into the cluster" as Calamari sees it. If you skipped the little
first-setup window like I did, it's kind of a pain to find.

QH

On Wed, Apr 1, 2015 at 12:34 PM, Bruce McFarland <
bruce.mcfarl...@taec.toshiba.com> wrote:

>  I’ve built the Calamari client, server, and diamond packages from source
> for trusty and centos and installed it on the trusty Master. Installed
> diamond and salt packages on the storage nodes. I can connect to the
> calamari master, accept salt keys from the ceph nodes, but then Calamari
> reports “3 Ceph servers are connected to Calamari, but no Ceph cluster has
> been created yet. Please use ceph-deploy to create a cluster” The 3 Ceph
> nodes are part of an existing Ceph cluster with 90 OSDS. I also built and
> installed the minion package on the Calamari Master under
> /opt/calamari/webapp/content/calamari-minions
>
>
>
> Any ideas what I’ve overlooked in my Calamari bring up?
>
> Thanks,
>
> Bruce
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and Openstack

2015-04-01 Thread Quentin Hartman
I am conincidentally going through the same process right now. The best
reference I've found is this: http://ceph.com/docs/master/rbd/rbd-openstack/

When I did Firefly / icehouse, this (seemingly) same guide Just Worked(tm),
but now with Giant / Juno I'm running into similar trouble  to that which
you describe. Everything _seems_ right, but creating volumes via openstack
just sits and spins forever, never creating anything and (as far as i've
found so far) not logging anything interesting. Normal Rados operations
work fine.

Feel free to hit me up off list if you want to confer and then we can
return here if we come up with anything to be shared with the group.

QH

On Wed, Apr 1, 2015 at 3:43 PM, Iain Geddes  wrote:

> All,
>
> Apologies for my ignorance but I don't seem to be able to search an
> archive.
>
> I've spent a lot of time trying but am having difficulty in integrating
> Ceph (Giant) into Openstack (Juno). I don't appear to be recording any
> errors anywhere, but simply don't seem to be writing to the cluster if I
> try creating a new volume or importing an image. The cluster is good and I
> can create a static rbd mapping so I know the key components are in place.
> My problem is almost certainly finger trouble on my part but am completely
> lost and wondered if there was a well thumbed guide to integration?
>
> Thanks
>
>
> Iain
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and Openstack

2015-04-02 Thread Quentin Hartman
s saving. Given that the top one was
>> through the GUI yesterday I'm guessing it's not going to finish any time
>> soon!
>>
>> == Glance images ==
>>
>> +--+-+-+--+--++
>> | ID   | Name| Disk
>> Format | Container Format | Size | Status |
>>
>> +--+-+-+--+--++
>> | f77429b2-17fd-4ef6-97a8-f710862182c6 | Cirros Raw  | raw
>>   | bare | 41126400 | saving |
>> | 1b12e65a-01cd-4d05-91e8-9e9d86979229 | cirros-0.3.3-x86_64 | raw
>>   | bare | 41126400 | saving |
>> | fd23c0f3-54b9-4698-b90b-8cdbd6e152c6 | cirros-0.3.3-x86_64 | raw
>>   | bare | 41126400 | saving |
>> | db297a42-5242-4122-968e-33bf4ad3fe1f | cirros-0.3.3-x86_64 | raw
>>   | bare | 41126400 | saving |
>>
>> +--+-+-+--+--++
>>
>> Was there a particular document that you referenced to perform your
>> install Karan? This should be the easy part ... but I've been saying that
>> about nearly everything for the past month or two!!
>>
>> Kind regards
>>
>>
>> Iain
>>
>>
>>
>> On Thu, Apr 2, 2015 at 3:28 AM, Karan Singh  wrote:
>>
>>> Fortunately Ceph Giant + OpenStack Juno works flawlessly for me.
>>>
>>> If you have configured cinder / glance correctly , then after restarting
>>>  cinder and glance services , you should see something like this in cinder
>>> and glance logs.
>>>
>>>
>>> Cinder logs :
>>>
>>> volume.log:2015-04-02 13:20:43.943 2085 INFO cinder.volume.manager
>>> [req-526cb14e-42ef-4c49-b033-e9bf2096be8f - - - - -] Starting volume driver
>>> RBDDriver (1.1.0)
>>>
>>>
>>> Glance Logs:
>>>
>>> api.log:2015-04-02 13:20:50.448 1266 DEBUG glance.common.config [-]
>>> glance_store.default_store = rbd log_opt_values
>>> /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004
>>> api.log:2015-04-02 13:20:50.449 1266 DEBUG glance.common.config [-]
>>> glance_store.rbd_store_ceph_conf = /etc/ceph/ceph.conf log_opt_values
>>> /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004
>>> api.log:2015-04-02 13:20:50.449 1266 DEBUG glance.common.config [-]
>>> glance_store.rbd_store_chunk_size = 8 log_opt_values
>>> /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004
>>> api.log:2015-04-02 13:20:50.449 1266 DEBUG glance.common.config [-]
>>> glance_store.rbd_store_pool= images log_opt_values
>>> /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004
>>> api.log:2015-04-02 13:20:50.449 1266 DEBUG glance.common.config [-]
>>> glance_store.rbd_store_user= glance log_opt_values
>>> /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004
>>> api.log:2015-04-02 13:20:50.451 1266 DEBUG glance.common.config [-]
>>> glance_store.stores= ['rbd'] log_opt_values
>>> /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004
>>>
>>>
>>> If Cinder and Glance are able to initialize RBD driver , then everything
>>> should work like charm.
>>>
>>>
>>> 
>>> Karan Singh
>>> Systems Specialist , Storage Platforms
>>> CSC - IT Center for Science,
>>> Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
>>> mobile: +358 503 812758
>>> tel. +358 9 4572001
>>> fax +358 9 4572302
>>> http://www.csc.fi/
>>> 
>>>
>>> On 02 Apr 2015, at 03:10, Erik McCormick 
>>> wrote:
>>>
>>> Can you both set Cinder and / or Glance logging to debug and provide
>>> some logs? There was an issue with the first Juno release of Glance in some
>>> vendor packages, so make sure you're fully updated to 2014.2.2
>>> On Apr 1, 2015 7:12 PM, "Quentin Hartman" 
>>> wrote:
>>>
>>>> I am conincidentally going through the same process right now. The best
>>>> reference I've found is this:
>>>> http://ceph.com/docs/master/rbd/rbd-openstack/
>>>>
>>>> When I did Firefly / icehouse, this (seemingly) same guide Just
>>

Re: [ceph-users] Ceph and Openstack

2015-04-02 Thread Quentin Hartman
py:2004
>>> 2015-04-02 10:58:37.142 18302 DEBUG glance.common.config [-]
>>> glance_store.rbd_store_chunk_size = 8 log_opt_values
>>> /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004
>>> 2015-04-02 10:58:37.142 18302 DEBUG glance.common.config [-]
>>> glance_store.rbd_store_pool= images log_opt_values
>>> /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004
>>> 2015-04-02 10:58:37.142 18302 DEBUG glance.common.config [-]
>>> glance_store.rbd_store_user= glance log_opt_values
>>> /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004
>>> 2015-04-02 10:58:37.143 18302 DEBUG glance.common.config [-]
>>> glance_store.stores= ['rbd'] log_opt_values
>>> /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004
>>>
>>>
>>> Debug of the api really doesn't reveal anything either as far as I can
>>> see. Attempting an image-create from the CLI:
>>>
>>> glance image-create --name "cirros-0.3.3-x86_64" --file
>>> cirros-0.3.3-x86_64-disk.raw --disk-format raw --container-format bare
>>> --is-public True --progress
>>> returns log entries that can be seen in the attached which appears to
>>> show that the process has started ... but progress never moves beyond 4%
>>> and I haven't seen any further log messages. openstack-status shows all the
>>> processes to be up, and Glance images as saving. Given that the top one was
>>> through the GUI yesterday I'm guessing it's not going to finish any time
>>> soon!
>>>
>>> == Glance images ==
>>>
>>> +--+-+-+--+--++
>>> | ID   | Name| Disk
>>> Format | Container Format | Size | Status |
>>>
>>> +--+-+-+--+--++
>>> | f77429b2-17fd-4ef6-97a8-f710862182c6 | Cirros Raw  | raw
>>>   | bare | 41126400 | saving |
>>> | 1b12e65a-01cd-4d05-91e8-9e9d86979229 | cirros-0.3.3-x86_64 | raw
>>>   | bare | 41126400 | saving |
>>> | fd23c0f3-54b9-4698-b90b-8cdbd6e152c6 | cirros-0.3.3-x86_64 | raw
>>>   | bare | 41126400 | saving |
>>> | db297a42-5242-4122-968e-33bf4ad3fe1f | cirros-0.3.3-x86_64 | raw
>>>   | bare | 41126400 | saving |
>>>
>>> +--+-+-+--+--++
>>>
>>> Was there a particular document that you referenced to perform your
>>> install Karan? This should be the easy part ... but I've been saying that
>>> about nearly everything for the past month or two!!
>>>
>>> Kind regards
>>>
>>>
>>> Iain
>>>
>>>
>>>
>>> On Thu, Apr 2, 2015 at 3:28 AM, Karan Singh  wrote:
>>>
>>>> Fortunately Ceph Giant + OpenStack Juno works flawlessly for me.
>>>>
>>>> If you have configured cinder / glance correctly , then after
>>>> restarting  cinder and glance services , you should see something like this
>>>> in cinder and glance logs.
>>>>
>>>>
>>>> Cinder logs :
>>>>
>>>> volume.log:2015-04-02 13:20:43.943 2085 INFO cinder.volume.manager
>>>> [req-526cb14e-42ef-4c49-b033-e9bf2096be8f - - - - -] Starting volume driver
>>>> RBDDriver (1.1.0)
>>>>
>>>>
>>>> Glance Logs:
>>>>
>>>> api.log:2015-04-02 13:20:50.448 1266 DEBUG glance.common.config [-]
>>>> glance_store.default_store = rbd log_opt_values
>>>> /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004
>>>> api.log:2015-04-02 13:20:50.449 1266 DEBUG glance.common.config [-]
>>>> glance_store.rbd_store_ceph_conf = /etc/ceph/ceph.conf log_opt_values
>>>> /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004
>>>> api.log:2015-04-02 13:20:50.449 1266 DEBUG glance.common.config [-]
>>>> glance_store.rbd_store_chunk_size = 8 log_opt_values
>>>> /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004
>>>> api.log:2015-04-02 13:20:50.449 1266 DEBUG glance.common.config [-]
>>>> glance_store.rbd_store_pool= images log_opt_values
>>>> /usr/lib/python2.7/site-packages/oslo/config/cfg.py:2004
>>>> api.log:2015-04-02 13:20:50.449 1266 DEBUG

Re: [ceph-users] Ceph and Openstack

2015-04-02 Thread Quentin Hartman
Well, 100% may be overstating things. When I try to create a volume from an
image it fails. I'm digging through the logs right now. glance alone works
(I can upload and delete images) and cinder alone works (I can create and
delete volumes) but when cinder tries to get the glance service it fails,
it seems to be trying to contact the completely wrong IP:

2015-04-02 16:39:05.033 24986 TRACE cinder.api.middleware.fault
CommunicationError: Error finding address for
http://192.168.1.18:9292/v2/schemas/image:
HTTPConnectionPool(host='192.168.1.18', port=9292): Max retries exceeded
with url: /v2/schemas/image (Caused by : [Errno 111]
ECONNREFUSED)

Which I would expect to fail, since my glance service is not on that
machine. I assume that cinder gets this information out of keystone's
endpoint registry, but that lists the correct IP for glance:

| cf833cf63944490ba69a49a7af7fa2f5 |  office   |
http://glance-host:9292 |  http://192.168.1.20:9292
|  http://glance-host:9292 |
a2a74e440b134e08bd526d6dd36540d2 |

But this is probably something to move to an Openstack list. Thanks for all
the ideas and talking through things.

QH

On Thu, Apr 2, 2015 at 10:41 AM, Erik McCormick 
wrote:

>
>
> On Thu, Apr 2, 2015 at 12:18 PM, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
>> Hm, even lacking the mentions of rbd in the glance docs, and the lack of
>> cephx auth information in the config, glance seems to be working after all.
>> S, hooray! It was probably working all along, I just hadn't gotten to
>> really testing it since I was getting blocked by my typo on the cinder
>> config.
>>
>>
>>
> Glance sets defaults for almost everything, so just enabling the default
> store will work. I thought you needed to specify a username still, but
> maybe that's defaulted now as well. Glad it's working. So Quentin is 100%
> working now and  Iain has no Cinder and slow Glance. Right?
>
>
> Erik -
>>
>> Here's my output for the requested grep (though I am on Ubuntu, so the
>> path was slightly different:
>>
>> cfg.IntOpt('rbd_store_chunk_size', default=DEFAULT_CHUNKSIZE,
>> def __init__(self, name, store, chunk_size=None):
>> self.chunk_size = chunk_size or store.READ_CHUNKSIZE
>> length = min(self.chunk_size, bytes_left)
>> chunk = self.conf.glance_store.rbd_store_chunk_size
>> self.chunk_size = chunk * (1024 ** 2)
>> self.READ_CHUNKSIZE = self.chunk_size
>> def get(self, location, offset=0, chunk_size=None, context=None):
>> return (ImageIterator(loc.image, self, chunk_size=chunk_size),
>> chunk_size or self.get_size(location))
>>
>>
>> This all looks correct, so any slowness isn't the bug I was thinking of.
>
>>
>> QH
>>
>> On Thu, Apr 2, 2015 at 10:06 AM, Erik McCormick <
>> emccorm...@cirrusseven.com> wrote:
>>
>>> The RDO glance-store package had a bug in it that miscalculated the
>>> chunk size. I should hope that it's been patched by Redhat now since the
>>> fix was committed upstream before the first Juno rleease, but perhaps not.
>>> The symptom of the bug was horribly slow uploads to glance.
>>>
>>> Run this and send back the output:
>>>
>>> grep chunk_size
>>> /usr/lib/python2.7/site-packages/glance_store/_drivers/rbd.py
>>>
>>> -Erik
>>>
>>> On Thu, Apr 2, 2015 at 7:34 AM, Iain Geddes 
>>> wrote:
>>>
>>>> Oh, apologies, I missed the versions ...
>>>>
>>>> # glance --version   :   0.14.2
>>>> # cinder --version   :   1.1.1
>>>> # ceph -v:   ceph version 0.87.1
>>>> (283c2e7cfa2457799f534744d7d549f83ea1335e)
>>>>
>>>> From rpm I can confirm that Cinder and Glance are both of the February
>>>> 2014 vintage:
>>>>
>>>> # rpm -qa |grep -e ceph -e glance -e cinder
>>>> ceph-0.87.1-0.el7.x86_64
>>>> libcephfs1-0.87.1-0.el7.x86_64
>>>> ceph-common-0.87.1-0.el7.x86_64
>>>> python-ceph-0.87.1-0.el7.x86_64
>>>> openstack-cinder-2014.2.2-1.el7ost.noarch
>>>> python-cinder-2014.2.2-1.el7ost.noarch
>>>> python-cinderclient-1.1.1-1.el7ost.noarch
>>>> python-glanceclient-0.14.2-2.el7ost.noarch
>>>> python-glance-2014.2.2-1.el7ost.noarch
>>>> python-glance-store-0.1.10-2.el7ost.noarch
>>>> openstack-glance-2014.2.2-1.el7ost.noarch
>>>>
>>>> On Thu, Apr 2, 2015 at 4:24 AM, 

Re: [ceph-users] Why is running OSDs on a Hypervisors a bad idea?

2015-04-06 Thread Quentin Hartman
We just had a fairly extensive discussion about this on the thread "running
Qemu / Hypervisor AND Ceph on the same nodes". Check that out in the
archives.

On Fri, Apr 3, 2015 at 6:08 AM, Piotr Wachowicz <
piotr.wachow...@brightcomputing.com> wrote:
>
> Hey,
>
> We keep hearing that running Hypervisors (KVM) on the OSD nodes is a bad
idea. But why exactly is that the case?
>
> In our usecase, under normal operations our VMs use relatively low
amounts of CPU resources. So are the OSD services, so why not combine them?
(We use ceph for openstack volume/images storage, 7 shared OSD/KVM nodes, 2
pools, 128 PGs per pool, 2 OSDs per node, 10GigE)
>
> I know that during recovery the OSD memory usage spikes. So I guess that
might be one of the reasons.
>
> But are there any other concrete examples of situations when the
hypervisor could compete for CPU/mem resources with the OSD services
running on the same node in a way which would noticeably impact the
performance of either?
>
> Kind Regards,
> Piotr
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What are you doing to locate performance issues in a Ceph cluster?

2015-04-07 Thread Quentin Hartman
I spend a bunch of time figuring out ways to graph dense data sets for my
monitoring, and I have to say that graph is a thing of beauty. I'll
definitely be adding something similar to my ceph cluster monitoring
deployment.

QH

On Mon, Apr 6, 2015 at 10:36 PM, Chris Kitzmiller  wrote:

> On Apr 6, 2015, at 7:04 PM, Robert LeBlanc  wrote:
>
> I see that ceph has 'ceph osd perf' that gets the latency of the OSDs.
> Is there a similar command that would provide some performance data
> about RBDs in use? I'm concerned about out ability to determine which
> RBD(s) may be "abusing" our storage at any given time.
>
> What are others doing to locate performance issues in their Ceph clusters?
>
>
> I graph aggregate stats for `ceph --admin-daemon
> /var/run/ceph/ceph-osd.$osdid.asok perf dump`. If the max latency strays
> too far outside of my mean latency I know to go look for the troublemaker.
> My graphs look something like this:
>
>
> So on Thursday just before noon a drive dies. The blue min latency for all
> disks spikes up because all disks are recovering the data on the lost OSD.
> The min drops back down to normal pretty quickly but then the red max line
> spikes way up for that single new disk which replaced the dead drive. It
> stays pretty high until it is done moving data back to itself at which time
> it becomes normal again just before midnight.
>
> I do this style of graphing because I have 30 OSDs per chassis and a chart
> with 30 individual lines on it would be kind of tough to read. Though on
> less dense nodes that would probably be the way to go.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low power single disk nodes

2015-04-09 Thread Quentin Hartman
I'm skeptical about how well this would work, but a Banana Pi might be a
place to start. Like a raspberry pi, but it has a SATA connector:
http://www.bananapi.org/

On Thu, Apr 9, 2015 at 3:18 AM, Jerker Nyberg  wrote:

>
> Hello ceph users,
>
> Is anyone running any low powered single disk nodes with Ceph now? Calxeda
> seems to be no more according to Wikipedia. I do not think HP moonshot is
> what I am looking for - I want stand-alone nodes, not server cartridges
> integrated into server chassis. And I do not want to be locked to a single
> vendor.
>
> I was playing with Raspberry Pi 2 for signage when I thought of my old
> experiments with Ceph.
>
> I am thinking of for example Odroid-C1 or Odroid-XU3 Lite or maybe
> something with a low-power Intel x64/x86 processor. Together with one SSD
> or one low power HDD the node could get all power via PoE (via splitter or
> integrated into board if such boards exist). PoE provide remote power-on
> power-off even for consumer grade nodes.
>
> The cost for a single low power node should be able to compete with
> traditional PC-servers price per disk. Ceph take care of redundancy.
>
> I think simple custom casing should be good enough - maybe just strap or
> velcro everything on trays in the rack, at least for the nodes with SSD.
>
> Kind regards,
> --
> Jerker Nyberg, Uppsala, Sweden.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low power single disk nodes

2015-04-09 Thread Quentin Hartman
Where's the "take my money" button?

On Thu, Apr 9, 2015 at 9:43 AM, Mark Nelson  wrote:

> How about drives that run Linux with an ARM processor, RAM, and an
> ethernet port right on the drive?  Notice the Ceph logo. :)
>
> https://www.hgst.com/science-of-storage/emerging-
> technologies/open-ethernet-drive-architecture
>
> Mark
>
> On 04/09/2015 10:37 AM, Scott Laird wrote:
>
>> Minnowboard Max?  2 atom cores, 1 SATA port, and a real (non-USB)
>> Ethernet port.
>>
>>
>> On Thu, Apr 9, 2015, 8:03 AM p...@philw.com <mailto:p...@philw.com>
>> mailto:p...@philw.com>> wrote:
>>
>> Rather expensive option:
>>
>> Applied Micro X-Gene, overkill for a single disk, and only really
>> available in a
>> development kit format right now.
>>
>> <https://www.apm.com/products/__data-center/x-gene-family/x-
>> __c1-development-kits/
>> <https://www.apm.com/products/data-center/x-gene-family/x-
>> c1-development-kits/>>
>>
>> Better Option:
>>
>> Ambedded CY7 - 7 nodes in 1U half Depth, 6 positions for SATA disks,
>> and one
>> node with mSATA SSD
>>
>> <http://www.ambedded.com.tw/__pt_list.php?CM_ID=20140214001
>> <http://www.ambedded.com.tw/pt_list.php?CM_ID=20140214001>>
>>
>> --phil
>>
>>  > On 09 April 2015 at 15:57 Quentin Hartman
>> mailto:qhart...@direwolfdigital.com>>
>>  > wrote:
>>  >
>>  >  I'm skeptical about how well this would work, but a Banana Pi
>> might be a
>>  > place to start. Like a raspberry pi, but it has a SATA connector:
>>  > http://www.bananapi.org/
>>  >
>>  >  On Thu, Apr 9, 2015 at 3:18 AM, Jerker Nyberg
>> mailto:jer...@update.uu.se>
>>  > <mailto:jer...@update.uu.se <mailto:jer...@update.uu.se>> > wrote:
>>  >> >Hello ceph users,
>>  > >
>>  > >Is anyone running any low powered single disk nodes with
>> Ceph now?
>>  > > Calxeda seems to be no more according to Wikipedia. I do not
>> think HP
>>  > > moonshot is what I am looking for - I want stand-alone nodes,
>> not server
>>  > > cartridges integrated into server chassis. And I do not want to
>> be locked to
>>  > > a single vendor.
>>  > >
>>  > >I was playing with Raspberry Pi 2 for signage when I thought
>> of my old
>>  > > experiments with Ceph.
>>  > >
>>  > >I am thinking of for example Odroid-C1 or Odroid-XU3 Lite or
>> maybe
>>  > > something with a low-power Intel x64/x86 processor. Together
>> with one SSD or
>>  > > one low power HDD the node could get all power via PoE (via
>> splitter or
>>  > > integrated into board if such boards exist). PoE provide remote
>> power-on
>>  > > power-off even for consumer grade nodes.
>>  > >
>>  > >The cost for a single low power node should be able to
>> compete with
>>  > > traditional PC-servers price per disk. Ceph take care of
>> redundancy.
>>  > >
>>  > >I think simple custom casing should be good enough - maybe
>> just strap or
>>  > > velcro everything on trays in the rack, at least for the nodes
>> with SSD.
>>  > >
>>  > >Kind regards,
>>  > >--
>>  > >Jerker Nyberg, Uppsala, Sweden.
>>  > >_
>>  > >ceph-users mailing list
>>  > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> <mailto:ceph-us...@lists.ceph.__com <mailto:ceph-users@lists.ceph.com
>> >>
>>  > > http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>  > > <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>
>>  > >  >  _
>>  >  ceph-users mailing list
>>  > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>  > http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>  >
>>
>>
>> _
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>  ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Managing larger ceph clusters

2015-04-17 Thread Quentin Hartman
I also have a fairly small deployment of 14 nodes, 42 OSDs, but even I use
some automation. I do my OS installs and partitioning with PXE / kickstart,
then use chef for my baseline install of the "normal" server stuff in our
env and admin accounts. Then the ceph-specific stuff I handle by hand and
with ceph-deploy and some light wrapper scripts. Monitoring / alerting is
sensu and graphite. I tried Calamari, and it was nice. But it produced a
lot of load on the admin machine (especially considering the work it should
have been performing) and once I figured out how to get metrics into
"normal" graphite, the appeal of a ceph-specific tool was reduced
substantially.

QH

On Fri, Apr 17, 2015 at 1:07 PM, Steve Anthony  wrote:

>  For reference, I'm currently running 26 nodes (338 OSDs); will be 35
> nodes (455 OSDs) in the near future.
>
> Node/OSD provisioning and replacements:
>
> Mostly I'm using ceph-deploy, at least to do node/osd adds and
> replacements. Right now the process is:
>
> Use FAI (http://fai-project.org) to setup software RAID1/LVM for the OS
> disks, and do a minimal installation, including the salt-minion.
>
> Accept the new minion on the salt-master node and deploy the
> configuration. LDAP auth, nrpe, diamond collector, udev configuration,
> custom python disk add script, and everything on the Ceph preflight page (
> http://ceph.com/docs/firefly/start/quick-start-preflight/)
>
> Insert the journals into the case. Udev triggers my python code, which
> partitions the SSDs and fires a Prowl alert (http://www.prowlapp.com/) to
> my phone when it's finished.
>
> Insert the OSDs into the case. Same thing, udev triggers the python code,
> which selects the next available partition on the journals so OSDs go on
> journal1partA, journal2partA, journal3partA, journal1partB,... for the
> three journals in each node. The code then fires a salt event at the master
> node with the OSD dev path, journal /dev/by-id/ path and node hostname. The
> salt reactor on the master node takes this event and runs a script on the
> admin node which passes those parameters to ceph-deploy, which does the OSD
> deployment. Send Prowl alert on success or fail with details.
>
> Similarity, when an OSD fails, I remove it, and insert the new OSD. The
> same process as above occurs. Logical removal I do manually, since I'm not
> at a scale where it's common yet. Eventually, I imagine I'll write code to
> trigger OSD removal on certain events using the same event/reactor Salt
> framework.
>
> Pool/CRUSH management:
>
> Pool configuration and CRUSH management are mostly one-time operations.
> That is, I'll make a change rarely and when I do it will persist in that
> new state for a long time. Given that and the fact that I can make the
> changes from one node and inject them into the cluster, I haven't needed to
> automate that portion of Ceph as I've added more nodes, at least not yet.
>
> Replacing journals:
>
> I haven't had to do this yet; I'd probably remove/readd all the OSDs if it
> happened today, but will be reading the post you linked.
>
> Upgrading releases:
>
> Change the configuration of /etc/apt/source.list.d/ceph.list to point at
> new release and push to all the nodes with Salt. Then salt -N 'ceph'
> pkg.upgrade to upgrade the packages on all the nodes in the ceph nodegroup.
> Then, use Salt to restart the monitors, then the OSDs on each node, one by
> one. Finally run the following command on all nodes with Salt to verify all
> monitors/OSDs are using the new version:
>
> for i in $(ls /var/run/ceph/ceph-*.asok);do echo $i;ceph --admin-daemon $i
> version;done
>
> Node decommissioning:
>
> I have a script which enumerates all the OSDs on a given host and stores
> that list in a file. Another script (run by cron every 10 minutes) checks
> if the cluster health is OK, and if so pops the next OSD from that file and
> executes the steps to remove it from the host, trickling the node out of
> service.
>
>
>
>
>
> On 04/17/2015 02:18 PM, Craig Lewis wrote:
>
> I'm running a small cluster, but I'll chime in since nobody else has.
>
>  Cern had a presentation a while ago (dumpling time-frame) about their
> deployment.  They go over some of your questions:
> http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern
>
>  My philosophy on Config Management is that it should save me time.  If
> it's going to take me longer to write a recipe to do something, I'll just
> do it by hand. Since my cluster is small, there are many things I can do
> faster by hand.  This may or may not work for you, depending on your
> documentation / repeatability requirements.  For things that need to be
> documented, I'll usually write the recipe anyway (I accept Chef recipes as
> documentation).
>
>
>  For my clusters, I'm using Chef to setups all nodes and manage
> ceph.conf.  I manually manage my pools, CRUSH map, RadosGW users, and disk
> replacement.  I was using Chef to add new disks, but I ran into load
> problems due to my small cl