Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

2017-05-02 Thread Łukasz Jagiełło
Hi,

I tried today revert [1] from 10.2.7 but the problem is still there even
without the change. Revert to 10.2.5 fix the issue instantly.

https://github.com/ceph/ceph/commit/c9445faf7fac2ccb8a05b53152c0ca16d7f4c6d0

On Thu, Apr 27, 2017 at 4:53 AM, Radoslaw Zarzynski  wrote:

> Bingo! From the 10.2.5-admin:
>
>   GET
>
>   Thu, 27 Apr 2017 07:49:59 GMT
>   /
>
> And also:
>
>   2017-04-27 09:49:59.117447 7f4a90ff9700 20 subdomain= domain=
> in_hosted_domain=0 in_hosted_domain_s3website=0
>   2017-04-27 09:49:59.117449 7f4a90ff9700 20 final domain/bucket
> subdomain= domain= in_hosted_domain=0 in_hosted_domain_s3website=0
> s->info.domain= s->info.request_uri=/
>
> The most interesting part is the "final ... in_hosted_domain=0".
> It looks we need to dig around RGWREST::preprocess(),
> rgw_find_host_in_domains() & company.
>
> There is a commit introduced in v10.2.6 that touches this area [1].
> I'm definitely not saying it's the root cause. It might be that a change
> in the code just unhidden a configuration issue [2].
>
> I will talk about the problem on the today's sync-up.
>
> Thanks for the logs!
> Regards,
> Radek
>
> [1] https://github.com/ceph/ceph/commit/c9445faf7fac2ccb8a05b53152c0ca
> 16d7f4c6d0
> [2] http://tracker.ceph.com/issues/17440
>
> On Thu, Apr 27, 2017 at 10:11 AM, Ben Morrice  wrote:
> > Hello Radek,
> >
> > Thank-you for your analysis so far! Please find attached logs for both
> the
> > admin user and a keystone backed user from 10.2.5 (same host as before, I
> > have simply downgraded the packages). Both users can authenticate and
> list
> > buckets on 10.2.5.
> >
> > Also - I tried version 10.2.6 and see the same behavior as 10.2.7, so the
> > bug i'm hitting looks like it was introduced in 10.2.6
> >
> > Kind regards,
> >
> > Ben Morrice
> >
> > __
> > Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
> > EPFL / BBP
> > Biotech Campus
> > Chemin des Mines 9
> > 1202 Geneva
> > Switzerland
> >
> > On 27/04/17 04:45, Radoslaw Zarzynski wrote:
> >>
> >> Thanks for the logs, Ben.
> >>
> >> It looks that two completely different authenticators have failed:
> >> the local, RADOS-backed auth (admin.txt) and Keystone-based
> >> one as well. In the second case I'm pretty sure that Keystone has
> >> rejected [1][2] to authenticate provided signature/StringToSign.
> >> RGW tried to fallback to the local auth which obviously didn't have
> >> any chance as the credentials were stored remotely. This explains
> >> the presence of "error reading user info" in the user-keystone.txt.
> >>
> >> What is common for both scenarios are the low-level things related
> >> to StringToSign crafting/signature generation at RadosGW's side.
> >> Following one has been composed for the request from admin.txt:
> >>
> >>GET
> >>
> >>
> >>Wed, 26 Apr 2017 09:18:42 GMT
> >>/bbpsrvc15.cscs.ch/
> >>
> >> If you could provide a similar log from v10.2.5, I would be really
> >> grateful.
> >>
> >> Regards,
> >> Radek
> >>
> >> [1]
> >> https://github.com/ceph/ceph/blob/v10.2.7/src/rgw/rgw_rest_
> s3.cc#L3269-L3272
> >> [2] https://github.com/ceph/ceph/blob/v10.2.7/src/rgw/rgw_common.h#L170
> >>
> >> On Wed, Apr 26, 2017 at 11:29 AM, Morrice Ben 
> wrote:
> >>>
> >>> Hello Radek,
> >>>
> >>> Please find attached the failed request for both the admin user and a
> >>> standard user (backed by keystone).
> >>>
> >>> Kind regards,
> >>>
> >>> Ben Morrice
> >>>
> >>> __
> >>> Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
> >>> EPFL BBP
> >>> Biotech Campus
> >>> Chemin des Mines 9
> >>> 1202 Geneva
> >>> Switzerland
> >>>
> >>> 
> >>> From: Radoslaw Zarzynski 
> >>> Sent: Tuesday, April 25, 2017 7:38 PM
> >>> To: Morrice Ben
> >>> Cc: ceph-users@lists.ceph.com
> >>> Subject: Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?
> >>>
> >>> Hello Ben,
> >>>
> >>> Could you provide full RadosGW's log for the failed request?
> >>> I mean the lines starting from header listing, through the start
> >>> marker ("== starting new request...") till the end marker?
> >>>
> >>> At the moment we can't see any details related to the signature
> >>> calculation.
> >>>
> >>> Regards,
> >>> Radek
> >>>
> >>> On Thu, Apr 20, 2017 at 5:08 PM, Ben Morrice 
> wrote:
> 
>  Hi all,
> 
>  I have tried upgrading one of our RGW servers from 10.2.5 to 10.2.7
>  (RHEL7)
>  and authentication is in a very bad state. This installation is part
> of
>  a
>  multigw configuration, and I have just updated one host in the
> secondary
>  zone (all other hosts/zones are running 10.2.5).
> 
>  On the 10.2.7 server I cannot authenticate as a user (normally backed
> by
>  OpenStack Keystone), but even worse I can also not authenticate with
> an
>  admin user.
> 
>  Please see [1] for the resu

Re: [ceph-users] Failed to read JournalPointer - MDS error (mds rank 0 is damaged)

2017-05-02 Thread Patrick Donnelly
Looks like: http://tracker.ceph.com/issues/17236

The fix is in v10.2.6.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-05-02 Thread David Turner
I was only interjecting on the comment "So that is 5 . Which is real easy
to obtain" and commenting on what the sustained writes into a cluster of
2,000 OSDs would require to actually sustain that 5 MBps on each SSD
journal.

My calculation was off because I forgot replica size, but my corrected math
is this...

5 MBps per journal device
8 OSDs per journal (overestimated number as most do 4)
2,000 OSDs based on what you said "Which is real easy to obtain, even with
hardware 0f 2000."
3 replicas

2,000 OSDs / 8 OSDs per journal = 250 journal SSDs
250 SSDs * 5 MBps = 1,250 MBps / 3 replicas = 416.67 MBps required
sustained cluster write speed to cause each SSD to average 5 MBps on each
journal device.

Swap out any variable you want to match your environment.  For example, if
you only have 4 OSDs per journal device, that number would be double for a
cluster this size to require a cluster write speed of 833.33 MBps to
average 5 MBps on each journal.  Also if you have less than 2,000 OSDs,
then everything shrinks fast.


On Tue, May 2, 2017 at 5:39 PM Willem Jan Withagen  wrote:

> On 02-05-17 19:54, David Turner wrote:
> > Are you guys talking about 5Mbytes/sec to each journal device?  Even if
> > you had 8 OSDs per journal and had 2000 osds... you would need a
> > sustained 1.25 Gbytes/sec to average 5Mbytes/sec per journal device.
>
> I'm not sure I'm following this...
> But I'm rather curious.
> Are you saying that the required journal bandwidth versus OSD write
> bandwidth has an approx 1:200 ratio??
>
> Note that I took it the other way.
> Given the Intel specs
>  - What sustained bandwidth is allowed to have the device last its
> lifetime.
>  - How much more usage would a 3710 give in regards to a 3520 SSD per
>dollar spent.
>
> --WjW
>
> > On Tue, May 2, 2017 at 1:47 PM Willem Jan Withagen  > > wrote:
> >
> > On 02-05-17 19:16, Дробышевский, Владимир wrote:
> > > Willem,
> > >
> > >   please note that you use 1.6TB Intel S3520 endurance rating in
> your
> > > calculations but then compare prices with 480GB model, which has
> only
> > > 945TBW or 1.1DWPD (
> > >
> >
> https://ark.intel.com/products/93026/Intel-SSD-DC-S3520-Series-480GB-2_5in-SATA-6Gbs-3D1-MLC
> > > ). It also worth to notice that S3710 has tremendously higher write
> > > speed\IOPS and especially SYNC writes. Haven't seen S3520 real sync
> > > write tests yet but don't think they differ much from S3510 ones.
> >
> > Arrgh, you are right. I guess I had too many pages open, and copied
> the
> > wrong one.
> >
> > But the good news is that the stats were already in favour of the
> 3710
> > so this only increases that conclusion.
> >
> > The bad news is that the sustained write speed goes down with a
> > factor 4.
> > So that is 5Mbyte/sec. Which is real easy to obtain, even with
> hardware
> > 0f 2000.
> >
> > --WjW
> >
> >
> > > Best regards,
> > > Vladimir
> > >
> > > 2017-05-02 21:05 GMT+05:00 Willem Jan Withagen  > 
> > > >>:
> > >
> > > On 27-4-2017 20:46, Alexandre DERUMIER wrote:
> > > > Hi,
> > > >
> > > >>> What I'm trying to get from the list is /why/ the
> > "enterprise" drives
> > > >>> are important. Performance? Reliability? Something else?
> > > >
> > > > performance, for sure (for SYNC write,
> >
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> > >
> >  <
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> >)
> > > >
> > > > Reliabity : yes, enteprise drive have supercapacitor in case
> > of powerfailure, and endurance (1 DWPD for 3520, 3 DWPD for 3610)
> > > >
> > > >
> > > >>> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC
> > S3610. Obviously
> > > >>> the single drive leaves more bays free for OSD disks, but
> > is there any
> > > >>> other reason a single S3610 is preferable to 4 S3520s?
> > Wouldn't 4xS3520s
> > > >>> mean:
> > > >
> > > > where do you see this price difference ?
> > > >
> > > > for me , S3520 are around 25-30% cheaper than S3610
> > >
> > > I just checked for the DCS3520 on
> > >
> >
> https://ark.intel.com/nl/products/93005/Intel-SSD-DC-S3520-Series-1_6TB-2_5in-SATA-6Gbs-3D1-MLC
> > >
> >  <
> https://ark.intel.com/nl/products/93005/Intel-SSD-DC-S3520-Series-1_6TB-2_5in-SATA-6Gbs-3D1-MLC
> >
> > >
> > > And is has a TBW of 2925 (Terrabytes Write over life time) =
> > 2,9 PB
> > > the warranty is 5 years.
> > >
> > > Now if I do the math:
> > >   2925 * 104 /5 /365 /24 /60 = 1,14 Gbyte/min to be written.
> > >   which 

Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-05-02 Thread Willem Jan Withagen
On 02-05-17 19:54, David Turner wrote:
> Are you guys talking about 5Mbytes/sec to each journal device?  Even if
> you had 8 OSDs per journal and had 2000 osds... you would need a
> sustained 1.25 Gbytes/sec to average 5Mbytes/sec per journal device.

I'm not sure I'm following this...
But I'm rather curious.
Are you saying that the required journal bandwidth versus OSD write
bandwidth has an approx 1:200 ratio??

Note that I took it the other way.
Given the Intel specs
 - What sustained bandwidth is allowed to have the device last its lifetime.
 - How much more usage would a 3710 give in regards to a 3520 SSD per
   dollar spent.

--WjW

> On Tue, May 2, 2017 at 1:47 PM Willem Jan Withagen  > wrote:
> 
> On 02-05-17 19:16, Дробышевский, Владимир wrote:
> > Willem,
> >
> >   please note that you use 1.6TB Intel S3520 endurance rating in your
> > calculations but then compare prices with 480GB model, which has only
> > 945TBW or 1.1DWPD (
> >
> 
> https://ark.intel.com/products/93026/Intel-SSD-DC-S3520-Series-480GB-2_5in-SATA-6Gbs-3D1-MLC
> > ). It also worth to notice that S3710 has tremendously higher write
> > speed\IOPS and especially SYNC writes. Haven't seen S3520 real sync
> > write tests yet but don't think they differ much from S3510 ones.
> 
> Arrgh, you are right. I guess I had too many pages open, and copied the
> wrong one.
> 
> But the good news is that the stats were already in favour of the 3710
> so this only increases that conclusion.
> 
> The bad news is that the sustained write speed goes down with a
> factor 4.
> So that is 5Mbyte/sec. Which is real easy to obtain, even with hardware
> 0f 2000.
> 
> --WjW
> 
> 
> > Best regards,
> > Vladimir
> >
> > 2017-05-02 21:05 GMT+05:00 Willem Jan Withagen  
> > >>:
> >
> > On 27-4-2017 20:46, Alexandre DERUMIER wrote:
> > > Hi,
> > >
> > >>> What I'm trying to get from the list is /why/ the
> "enterprise" drives
> > >>> are important. Performance? Reliability? Something else?
> > >
> > > performance, for sure (for SYNC write,
> 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> >   
>  
> )
> > >
> > > Reliabity : yes, enteprise drive have supercapacitor in case
> of powerfailure, and endurance (1 DWPD for 3520, 3 DWPD for 3610)
> > >
> > >
> > >>> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC
> S3610. Obviously
> > >>> the single drive leaves more bays free for OSD disks, but
> is there any
> > >>> other reason a single S3610 is preferable to 4 S3520s?
> Wouldn't 4xS3520s
> > >>> mean:
> > >
> > > where do you see this price difference ?
> > >
> > > for me , S3520 are around 25-30% cheaper than S3610
> >
> > I just checked for the DCS3520 on
> >   
>  
> https://ark.intel.com/nl/products/93005/Intel-SSD-DC-S3520-Series-1_6TB-2_5in-SATA-6Gbs-3D1-MLC
> >   
>  
> 
> >
> > And is has a TBW of 2925 (Terrabytes Write over life time) =
> 2,9 PB
> > the warranty is 5 years.
> >
> > Now if I do the math:
> >   2925 * 104 /5 /365 /24 /60 = 1,14 Gbyte/min to be written.
> >   which is approx 20Mbyte /sec
> >   or approx 10Gbit/min = 0,15 Gbit/sec
> >
> > And that is only 20% of the capacity of that SATA link.
> > Also writing 20Mbyte/sec sustained is not really that hard for
> modern
> > systems.
> >
> > Now a 400Gb 3710 takes 8.3 PB, which is ruffly 3 times as much.
> > so it will last 3 times longer.
> >
> > Checking Amazone, I get
> > $520 for the DC S3710-400G
> > $300 for the DC S3520-480G
> >
> > So that is less than a factor of 2 for using the S3710's and a
> 3 times
> > longer lifetime. To be exact (8.3/520) / (2,9/300) = 1.65 more
> bang for
> > your buck.
> >
> > But still do not expect your SSDs to last very long if the
> write rate is
> > much over that 20Mbyte/sec
> >
> > --WjW
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> >
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 

Re: [ceph-users] Ceph memory overhead when used with KVM

2017-05-02 Thread Jason Dillaman
Can you share the fio job file that you utilized so I can attempt to
repeat locally?

On Tue, May 2, 2017 at 2:51 AM, nick  wrote:
> Hi Jason,
> thanks for your feedback. I did now some tests over the weekend to verify the
> memory overhead.
> I was using qemu 2.8 (taken from the Ubuntu Cloud Archive) with librbd 10.2.7
> on Ubuntu 16.04 hosts. I suspected the ceph rbd cache to be the cause of the
> overhead so I just generated a lot of IO with the help of fio in the VMs (with
> a datasize of 80GB) . All VMs had 3GB of memory. I had to run fio multiple
> times, before reaching high RSS values.
> I also noticed that when using larger blocksizes during writes (like 4M) the
> memory overhead in the KVM process increased faster.
> I ran several fio tests (one after another) and the results are:
>
> KVM with writeback RBD cache: max. 85% memory overhead (2.5 GB overhead)
> KVM with writethrough RBD cache: max. 50% memory overhead
> KVM without RBD caching: less than 10% overhead all the time
> KVM with local storage (logical volume used): 8% overhead all the time
>
> I did not reach those >200% memory overhead results that we see on our live
> cluster, but those virtual machines have a way longer uptime as well.
>
> I also tried to reduce the RSS memory value with cache dropping on the
> physical host and in the VM. Both did not lead to any change. A reboot of the
> VM also does not change anything (reboot in the VM, not a new KVM process).
> The only way to reduce the RSS memory value is a live migration so far. Might
> this be a bug? The memory overhead sounds a bit too much for me.
>
> Best Regards
> Sebastian
>
>
> On Thursday, April 27, 2017 10:08:36 AM you wrote:
>> I know we noticed high memory usage due to librados in the Ceph
>> multipathd checker [1] -- the order of hundreds of megabytes. That
>> client was probably nearly as trivial as an application can get and I
>> just assumed it was due to large monitor maps being sent to the client
>> for whatever reason. Since we changed course on our RBD iSCSI
>> implementation, unfortunately the investigation into this high memory
>> usage fell by the wayside.
>>
>> [1]
>> http://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=blob;f=libmultip
>> ath/checkers/rbd.c;h=9ea0572f2b5bd41b80bf2601137b74f92bdc7278;hb=HEAD
>> On Thu, Apr 27, 2017 at 5:26 AM, nick  wrote:
>> > Hi Christian,
>> > thanks for your answer.
>> > The highest value I can see for a local storage VM in our infrastructure
>> > is a memory overhead of 39%. This is big, but the majority (>90%) of our
>> > local storage VMs are using less than 10% memory overhead.
>> > For ceph storage based VMs this looks quite different. The highest value I
>> > can see currently is 244% memory overhead. So that specific allocated 3GB
>> > memory VM is using now 10.3 GB RSS memory on the physical host. This is a
>> > really huge value. In general I can see that the majority of the ceph
>> > based VMs has more than 60% memory overhead.
>> >
>> > Maybe this is also a bug related to qemu+librbd. It would be just nice to
>> > know if other people are seeing those high values as well.
>> >
>> > Cheers
>> > Sebastian
>> >
>> > On Thursday, April 27, 2017 06:10:48 PM you wrote:
>> >> Hello,
>> >>
>> >> Definitely seeing about 20% overhead with Hammer as well, so not version
>> >> specific from where I'm standing.
>> >>
>> >> While non-RBD storage VMs by and large tend to be closer the specified
>> >> size, I've seen them exceed things by few % at times, too.
>> >> For example a 4317968KB RSS one that ought to be 4GB.
>> >>
>> >> Regards,
>> >>
>> >> Christian
>> >>
>> >> On Thu, 27 Apr 2017 09:56:48 +0200 nick wrote:
>> >> > Hi,
>> >> > we are running a jewel ceph cluster which serves RBD volumes for our
>> >> > KVM
>> >> > virtual machines. Recently we noticed that our KVM machines use a lot
>> >> > more
>> >> > memory on the physical host system than what they should use. We
>> >> > collect
>> >> > the data with a python script which basically executes 'virsh
>> >> > dommemstat
>> >> > '. We also verified the results of the script
>> >> > with
>> >> > the memory stats of 'cat /proc//status' for each virtual
>> >> > machine
>> >> > and the results are the same.
>> >> >
>> >> > Here is an excerpt for one pysical host where all virtual machines are
>> >> > running since yesterday (virtual machine names removed):
>> >> >
>> >> > """
>> >> > overheadactualpercent_overhead  rss
>> >> > --      
>> >> > 423.8 MiB   2.0 GiB 20  2.4 GiB
>> >> > 460.1 MiB   4.0 GiB 11  4.4 GiB
>> >> > 471.5 MiB   1.0 GiB 46  1.5 GiB
>> >> > 472.6 MiB   4.0 GiB 11  4.5 GiB
>> >> > 681.9 MiB   8.0 GiB  8  8.7 GiB
>> >> > 156.1 MiB   1.0 GiB 15  1.2 GiB
>> >> > 278.6 MiB   1.0 GiB 27  1.3 GiB
>> >> > 290.4 MiB   1.0 GiB 28  1.3 GiB
>> >> > 291.5 MiB   1.0 GiB   

Re: [ceph-users] RBD behavior for reads to a volume with no data written

2017-05-02 Thread Jason Dillaman
If the RBD object map feature is enabled, the read request would never even
be sent to the OSD if the client knows the backing object doesn't exist.
However, if the object map feature is disabled, the read request will be
sent to the OSD.

The OSD isn't my area of expertise, but I can try to explain what occurs to
the best of my knowledge. There is a small in-memory cache for object
contexts with the OSD PG -- which includes a whiteout flag to indicate the
object is deleted. I believe that the whiteout flag is only really used on
a cache tier to avoid having to attempt to promote a known non-existent
object. Therefore, in the common case the OSD would query the non-existent
object from the object store (FileStore or BlueStore).

In FileStore, it will attempt to open the associated backing file for
object. If the necessary dentries are cached in the kernel, I'd expect that
the -ENOENT error would be bubbled back to RBD w/o a disk hit. Otherwise,
the kernel would need to read the associated dentries from disk to
determine that the object is missing.

In BlueStore, there is another in-memory cache for onodes that can quickly
detect a missing object. If the object isn't in the cache, the associated
onode will be looked up within the backing RocksDB. If the RocksDB metadata
scan for the object's onode fails since the object is missing, the -ENOENT
error would be bubbled back to the client.



On Tue, May 2, 2017 at 1:24 PM, Prashant Murthy 
wrote:

> I wanted to add that I was particularly interested about the behavior with
> filestore, but was also curious how this works on bluestore.
>
> Prashant
>
> On Mon, May 1, 2017 at 10:04 PM, Prashant Murthy 
> wrote:
>
>> Hi all,
>>
>> I was wondering what happens when reads are issued to an RBD device with
>> no previously written data. Can somebody explain how such requests flow
>> from rbd (client) into OSDs and whether any of these reads would hit the
>> disks at all or whether OSD metadata would recognize that there is no data
>> at the offsets requested and returns a bunch of zeros back to the client?
>>
>> Thanks,
>> Prashant
>>
>> --
>> Prashant Murthy
>> Sr Director, Software Engineering | Salesforce
>> Mobile: 919-961-3041 <(919)%20961-3041>
>>
>>
>> --
>>
>
>
>
> --
> Prashant Murthy
> Sr Director, Software Engineering | Salesforce
> Mobile: 919-961-3041 <(919)%20961-3041>
>
>
> --
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-05-02 Thread David Turner
Are you guys talking about 5Mbytes/sec to each journal device?  Even if you
had 8 OSDs per journal and had 2000 osds... you would need a sustained 1.25
Gbytes/sec to average 5Mbytes/sec per journal device.

On Tue, May 2, 2017 at 1:47 PM Willem Jan Withagen  wrote:

> On 02-05-17 19:16, Дробышевский, Владимир wrote:
> > Willem,
> >
> >   please note that you use 1.6TB Intel S3520 endurance rating in your
> > calculations but then compare prices with 480GB model, which has only
> > 945TBW or 1.1DWPD (
> >
> https://ark.intel.com/products/93026/Intel-SSD-DC-S3520-Series-480GB-2_5in-SATA-6Gbs-3D1-MLC
> > ). It also worth to notice that S3710 has tremendously higher write
> > speed\IOPS and especially SYNC writes. Haven't seen S3520 real sync
> > write tests yet but don't think they differ much from S3510 ones.
>
> Arrgh, you are right. I guess I had too many pages open, and copied the
> wrong one.
>
> But the good news is that the stats were already in favour of the 3710
> so this only increases that conclusion.
>
> The bad news is that the sustained write speed goes down with a factor 4.
> So that is 5Mbyte/sec. Which is real easy to obtain, even with hardware
> 0f 2000.
>
> --WjW
>
>
> > Best regards,
> > Vladimir
> >
> > 2017-05-02 21:05 GMT+05:00 Willem Jan Withagen  > >:
> >
> > On 27-4-2017 20:46, Alexandre DERUMIER wrote:
> > > Hi,
> > >
> > >>> What I'm trying to get from the list is /why/ the "enterprise"
> drives
> > >>> are important. Performance? Reliability? Something else?
> > >
> > > performance, for sure (for SYNC write,
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> > <
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> >)
> > >
> > > Reliabity : yes, enteprise drive have supercapacitor in case of
> powerfailure, and endurance (1 DWPD for 3520, 3 DWPD for 3610)
> > >
> > >
> > >>> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610.
> Obviously
> > >>> the single drive leaves more bays free for OSD disks, but is
> there any
> > >>> other reason a single S3610 is preferable to 4 S3520s? Wouldn't
> 4xS3520s
> > >>> mean:
> > >
> > > where do you see this price difference ?
> > >
> > > for me , S3520 are around 25-30% cheaper than S3610
> >
> > I just checked for the DCS3520 on
> >
> https://ark.intel.com/nl/products/93005/Intel-SSD-DC-S3520-Series-1_6TB-2_5in-SATA-6Gbs-3D1-MLC
> > <
> https://ark.intel.com/nl/products/93005/Intel-SSD-DC-S3520-Series-1_6TB-2_5in-SATA-6Gbs-3D1-MLC
> >
> >
> > And is has a TBW of 2925 (Terrabytes Write over life time) = 2,9 PB
> > the warranty is 5 years.
> >
> > Now if I do the math:
> >   2925 * 104 /5 /365 /24 /60 = 1,14 Gbyte/min to be written.
> >   which is approx 20Mbyte /sec
> >   or approx 10Gbit/min = 0,15 Gbit/sec
> >
> > And that is only 20% of the capacity of that SATA link.
> > Also writing 20Mbyte/sec sustained is not really that hard for modern
> > systems.
> >
> > Now a 400Gb 3710 takes 8.3 PB, which is ruffly 3 times as much.
> > so it will last 3 times longer.
> >
> > Checking Amazone, I get
> > $520 for the DC S3710-400G
> > $300 for the DC S3520-480G
> >
> > So that is less than a factor of 2 for using the S3710's and a 3
> times
> > longer lifetime. To be exact (8.3/520) / (2,9/300) = 1.65 more bang
> for
> > your buck.
> >
> > But still do not expect your SSDs to last very long if the write
> rate is
> > much over that 20Mbyte/sec
> >
> > --WjW
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> >
> >
> >
> >
> > --
> >
> > С уважением,
> > Дробышевский Владимир
> > Компания "АйТи Город"
> > +7 343 192 <+7%20343%20222-21-92>
> >
> > ИТ-консалтинг
> > Поставка проектов "под ключ"
> > Аутсорсинг ИТ-услуг
> > Аутсорсинг ИТ-инфраструктуры
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-05-02 Thread Willem Jan Withagen
On 02-05-17 19:16, Дробышевский, Владимир wrote:
> Willem,
> 
>   please note that you use 1.6TB Intel S3520 endurance rating in your
> calculations but then compare prices with 480GB model, which has only
> 945TBW or 1.1DWPD (
> https://ark.intel.com/products/93026/Intel-SSD-DC-S3520-Series-480GB-2_5in-SATA-6Gbs-3D1-MLC
> ). It also worth to notice that S3710 has tremendously higher write
> speed\IOPS and especially SYNC writes. Haven't seen S3520 real sync
> write tests yet but don't think they differ much from S3510 ones.

Arrgh, you are right. I guess I had too many pages open, and copied the
wrong one.

But the good news is that the stats were already in favour of the 3710
so this only increases that conclusion.

The bad news is that the sustained write speed goes down with a factor 4.
So that is 5Mbyte/sec. Which is real easy to obtain, even with hardware
0f 2000.

--WjW


> Best regards,
> Vladimir
> 
> 2017-05-02 21:05 GMT+05:00 Willem Jan Withagen  >:
> 
> On 27-4-2017 20:46, Alexandre DERUMIER wrote:
> > Hi,
> >
> >>> What I'm trying to get from the list is /why/ the "enterprise" drives
> >>> are important. Performance? Reliability? Something else?
> >
> > performance, for sure (for SYNC write, 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> 
> )
> >
> > Reliabity : yes, enteprise drive have supercapacitor in case of 
> powerfailure, and endurance (1 DWPD for 3520, 3 DWPD for 3610)
> >
> >
> >>> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. 
> Obviously
> >>> the single drive leaves more bays free for OSD disks, but is there any
> >>> other reason a single S3610 is preferable to 4 S3520s? Wouldn't 
> 4xS3520s
> >>> mean:
> >
> > where do you see this price difference ?
> >
> > for me , S3520 are around 25-30% cheaper than S3610
> 
> I just checked for the DCS3520 on
> 
> https://ark.intel.com/nl/products/93005/Intel-SSD-DC-S3520-Series-1_6TB-2_5in-SATA-6Gbs-3D1-MLC
> 
> 
> 
> And is has a TBW of 2925 (Terrabytes Write over life time) = 2,9 PB
> the warranty is 5 years.
> 
> Now if I do the math:
>   2925 * 104 /5 /365 /24 /60 = 1,14 Gbyte/min to be written.
>   which is approx 20Mbyte /sec
>   or approx 10Gbit/min = 0,15 Gbit/sec
> 
> And that is only 20% of the capacity of that SATA link.
> Also writing 20Mbyte/sec sustained is not really that hard for modern
> systems.
> 
> Now a 400Gb 3710 takes 8.3 PB, which is ruffly 3 times as much.
> so it will last 3 times longer.
> 
> Checking Amazone, I get
> $520 for the DC S3710-400G
> $300 for the DC S3520-480G
> 
> So that is less than a factor of 2 for using the S3710's and a 3 times
> longer lifetime. To be exact (8.3/520) / (2,9/300) = 1.65 more bang for
> your buck.
> 
> But still do not expect your SSDs to last very long if the write rate is
> much over that 20Mbyte/sec
> 
> --WjW
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> 
> -- 
> 
> С уважением,
> Дробышевский Владимир
> Компания "АйТи Город"
> +7 343 192
> 
> ИТ-консалтинг
> Поставка проектов "под ключ"
> Аутсорсинг ИТ-услуг
> Аутсорсинг ИТ-инфраструктуры

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD behavior for reads to a volume with no data written

2017-05-02 Thread Prashant Murthy
I wanted to add that I was particularly interested about the behavior with
filestore, but was also curious how this works on bluestore.

Prashant

On Mon, May 1, 2017 at 10:04 PM, Prashant Murthy 
wrote:

> Hi all,
>
> I was wondering what happens when reads are issued to an RBD device with
> no previously written data. Can somebody explain how such requests flow
> from rbd (client) into OSDs and whether any of these reads would hit the
> disks at all or whether OSD metadata would recognize that there is no data
> at the offsets requested and returns a bunch of zeros back to the client?
>
> Thanks,
> Prashant
>
> --
> Prashant Murthy
> Sr Director, Software Engineering | Salesforce
> Mobile: 919-961-3041 <(919)%20961-3041>
>
>
> --
>



-- 
Prashant Murthy
Sr Director, Software Engineering | Salesforce
Mobile: 919-961-3041 <(919)%20961-3041>


--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-05-02 Thread Дробышевский , Владимир
Willem,

  please note that you use 1.6TB Intel S3520 endurance rating in your
calculations but then compare prices with 480GB model, which has only
945TBW or 1.1DWPD (
https://ark.intel.com/products/93026/Intel-SSD-DC-S3520-Series-480GB-2_5in-SATA-6Gbs-3D1-MLC
). It also worth to notice that S3710 has tremendously higher write
speed\IOPS and especially SYNC writes. Haven't seen S3520 real sync write
tests yet but don't think they differ much from S3510 ones.

Best regards,
Vladimir

2017-05-02 21:05 GMT+05:00 Willem Jan Withagen :

> On 27-4-2017 20:46, Alexandre DERUMIER wrote:
> > Hi,
> >
> >>> What I'm trying to get from the list is /why/ the "enterprise" drives
> >>> are important. Performance? Reliability? Something else?
> >
> > performance, for sure (for SYNC write, https://www.sebastien-han.fr/
> blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-
> as-a-journal-device/)
> >
> > Reliabity : yes, enteprise drive have supercapacitor in case of
> powerfailure, and endurance (1 DWPD for 3520, 3 DWPD for 3610)
> >
> >
> >>> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. Obviously
> >>> the single drive leaves more bays free for OSD disks, but is there any
> >>> other reason a single S3610 is preferable to 4 S3520s? Wouldn't
> 4xS3520s
> >>> mean:
> >
> > where do you see this price difference ?
> >
> > for me , S3520 are around 25-30% cheaper than S3610
>
> I just checked for the DCS3520 on
> https://ark.intel.com/nl/products/93005/Intel-SSD-DC-
> S3520-Series-1_6TB-2_5in-SATA-6Gbs-3D1-MLC
>
> And is has a TBW of 2925 (Terrabytes Write over life time) = 2,9 PB
> the warranty is 5 years.
>
> Now if I do the math:
>   2925 * 104 /5 /365 /24 /60 = 1,14 Gbyte/min to be written.
>   which is approx 20Mbyte /sec
>   or approx 10Gbit/min = 0,15 Gbit/sec
>
> And that is only 20% of the capacity of that SATA link.
> Also writing 20Mbyte/sec sustained is not really that hard for modern
> systems.
>
> Now a 400Gb 3710 takes 8.3 PB, which is ruffly 3 times as much.
> so it will last 3 times longer.
>
> Checking Amazone, I get
> $520 for the DC S3710-400G
> $300 for the DC S3520-480G
>
> So that is less than a factor of 2 for using the S3710's and a 3 times
> longer lifetime. To be exact (8.3/520) / (2,9/300) = 1.65 more bang for
> your buck.
>
> But still do not expect your SSDs to last very long if the write rate is
> much over that 20Mbyte/sec
>
> --WjW
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

С уважением,
Дробышевский Владимир
Компания "АйТи Город"
+7 343 192

ИТ-консалтинг
Поставка проектов "под ключ"
Аутсорсинг ИТ-услуг
Аутсорсинг ИТ-инфраструктуры
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Maintaining write performance under a steady intake of small objects

2017-05-02 Thread Patrick Dinnen
Hi George,

Also, I should have mentioned before. The results I shared were with a
lowered cache pressure value (in an attempt to keep inodes in cache).
vm.vfs_cache_pressure
= 10 (down from the default 100). The results were a little ambiguous, but
it seemed like that did help somewhat. We haven't touched the
vm.swappiness, but I will take a look at that.

Actually we are running 128GB of RAM on the OSD machines right now, there
might have been some improvement since our 64GB tests but not dramatic.

Packing our small objects is possibility. Though we'd prefer to avoid the
added complexity if possible.

The ten clients are currently theoretical. We're entirely focused on
getting the write performance where we need it right now. We'll likely use
RADOSGW if and when we get to a point where the raw write performance meets
our needs.

vmtouch looks interesting. Do you have any hints about particular ways to
use that?

Thank you, Patrick

On Tue, May 2, 2017 at 8:24 AM, George Mihaiescu 
wrote:

> Hi Patrick,
>
> You could add more RAM to the servers witch will not increase the cost too
> much, probably.
>
> You could change swappiness value or use something like
> https://hoytech.com/vmtouch/ to pre-cache inodes entries.
>
> You could tarball the smaller files before loading them into Ceph maybe.
>
> How are the ten clients accessing Ceph by the way?
>
> On May 1, 2017, at 14:23, Patrick Dinnen  wrote:
>
> One additional detail, we also did filestore testing using Jewel and saw
> substantially similar results to those on Kraken.
>
> On Mon, May 1, 2017 at 2:07 PM, Patrick Dinnen  wrote:
>
>> Hello Ceph-users,
>>
>> Florian has been helping with some issues on our proof-of-concept
>> cluster, where we've been experiencing these issues. Thanks for the replies
>> so far. I wanted to jump in with some extra details.
>>
>> All of our testing has been with scrubbing turned off, to remove that as
>> a factor.
>>
>> Our use case requires a Ceph cluster to indefinitely store ~10 billion
>> files 20-60KB in size. We’ll begin with 4 billion files migrated from a
>> legacy storage system. Ongoing writes will be handled by ~10 client
>> machines and come in at a fairly steady 10-20 million files/day. Every file
>> (excluding the legacy 4 billion) will be read once by a single client
>> within hours of it’s initial write to the cluster. Future file read
>> requests will come from a single server and with a long-tail distribution,
>> with popular files read thousands of times a year but most read never or
>> virtually never.
>>
>> Our “production” design has 6-nodes, 24-OSDs (expandable to 48 OSDs). SSD
>> journals at a 1:4 ratio with HDDs, Each node looks like this:
>>
>>- 2 x E5-2660 8-core Xeons
>>- 64GB RAM DDR-3 PC1600
>>- 10Gb ceph-internal network (SFP+)
>>- LSI 9210-8i controller (IT mode)
>>- 4 x OSD 8TB HDDs, mix of two types
>>- Seagate ST8000DM002
>>   - HGST HDN728080ALE604
>>   - Mount options = xfs (rw,noatime,attr2,inode64,noquota)
>>   - 1 x SSD journal Intel 200GB DC S3700
>>
>>
>> Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done with a
>> replication level 2. We’re using rados bench to shotgun a lot of files into
>> our test pools. Specifically following these two steps:
>> ceph osd pool create poolofhopes 2048 2048 replicated ""
>> replicated_ruleset 5
>> rados -p poolofhopes bench -t 32 -b 2 3000 write --no-cleanup
>>
>> We leave the bench running for days at a time and watch the objects in
>> cluster count. We see performance that starts off decent and degrades over
>> time. There’s a very brief initial surge in write performance after which
>> things settle into the downward trending pattern.
>>
>> 1st hour - 2 million objects/hour
>> 20th hour - 1.9 million objects/hour
>> 40th hour - 1.7 million objects/hour
>>
>> This performance is not encouraging for us. We need to be writing 40
>> million objects per day (20 million files, duplicated twice). The rates
>> we’re seeing at the 40th hour of our bench would be suffecient to
>> achieve that. Those write rates are still falling though and we’re only at
>> a fraction of the number of objects in cluster that we need to handle. So,
>> the trends in performance suggests we shouldn’t count on having the write
>> performance we need for too long.
>>
>> If we repeat the process of creating a new pool and running the bench the
>> same pattern holds, good initial performance that gradually degrades.
>>
>> https://postimg.org/image/ovymk7n2d/
>> [caption:90 million objects written to a brand new, pre-split pool
>> (poolofhopes). There are already 330 million objects on the cluster in
>> other pools.]
>>
>> Our working theory is that the degradation over time may be related to
>> inode or dentry lookups that miss cache and lead to additional disk reads
>> and seek activity. There’s a suggestion that filestore directory splitting
>> may exacerbate that problem as additional/longer disk seeks occur

Re: [ceph-users] Maintaining write performance under a steady intake of small objects

2017-05-02 Thread Patrick Dinnen
That's interesting Mark. It would be great if anyone has a definitive
answer on the potential syncfs-related downside of caching a lot of
inodes. A lot of our testing so far has been on the assumption that
more cached inodes is a pure good.

On Tue, May 2, 2017 at 9:19 AM, Mark Nelson  wrote:
> I used to advocate that users favor dentry/inode cache, but it turns out
> that it's not necessarily a good idea if you also are using syncfs.  It
> turns out that when syncfs is used, the kernel will iterate through all
> cached inodes, rather than just dirty inodes.  With high numbers of cached
> inodes, it can impact performance enough that it ends up being a problem.
> See Sage's post here:
>
> http://www.spinics.net/lists/ceph-devel/msg25644.html
>
> I don't remember if we ended up ripping syncfs out completely. Bluestore
> ditches the filesystem so we don't have to deal with this anymore
> regardless.  It's something to be aware of though.
>
> Mark
>
> On 05/02/2017 07:24 AM, George Mihaiescu wrote:
>>
>> Hi Patrick,
>>
>> You could add more RAM to the servers witch will not increase the cost
>> too much, probably.
>>
>> You could change swappiness value or use something
>> like https://hoytech.com/vmtouch/ to pre-cache inodes entries.
>>
>> You could tarball the smaller files before loading them into Ceph maybe.
>>
>> How are the ten clients accessing Ceph by the way?
>>
>> On May 1, 2017, at 14:23, Patrick Dinnen > > wrote:
>>
>>> One additional detail, we also did filestore testing using Jewel and
>>> saw substantially similar results to those on Kraken.
>>>
>>> On Mon, May 1, 2017 at 2:07 PM, Patrick Dinnen >> > wrote:
>>>
>>> Hello Ceph-users,
>>>
>>> Florian has been helping with some issues on our proof-of-concept
>>> cluster, where we've been experiencing these issues. Thanks for
>>> the replies so far. I wanted to jump in with some extra details.
>>>
>>> All of our testing has been with scrubbing turned off, to remove
>>> that as a factor.
>>>
>>> Our use case requires a Ceph cluster to indefinitely store ~10
>>> billion files 20-60KB in size. We’ll begin with 4 billion files
>>> migrated from a legacy storage system. Ongoing writes will be
>>> handled by ~10 client machines and come in at a fairly steady
>>> 10-20 million files/day. Every file (excluding the legacy 4
>>> billion) will be read once by a single client within hours of it’s
>>> initial write to the cluster. Future file read requests will come
>>> from a single server and with a long-tail distribution, with
>>> popular files read thousands of times a year but most read never
>>> or virtually never.
>>>
>>> Our “production” design has 6-nodes, 24-OSDs (expandable to 48
>>> OSDs). SSD journals at a 1:4 ratio with HDDs, Each node looks like
>>> this:
>>>
>>>  *
>>> 2 x E5-2660 8-core Xeons
>>>  *
>>> 64GB RAM DDR-3 PC1600
>>>  *
>>> 10Gb ceph-internal network (SFP+)
>>>  *
>>> LSI 9210-8i controller (IT mode)
>>>  *
>>> 4 x OSD 8TB HDDs, mix of two types
>>>  o
>>> Seagate ST8000DM002
>>>  o
>>> HGST HDN728080ALE604
>>>  o
>>> Mount options = xfs (rw,noatime,attr2,inode64,noquota)
>>>  *
>>>
>>> 1 x SSD journal Intel 200GB DC S3700
>>>
>>>
>>> Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done
>>> with a replication level 2. We’re using rados bench to shotgun a
>>> lot of files into our test pools. Specifically following these two
>>> steps:
>>> ceph osd pool create poolofhopes 2048 2048 replicated ""
>>> replicated_ruleset 5
>>> rados -p poolofhopes bench -t 32 -b 2 3000 write --no-cleanup
>>>
>>> We leave the bench running for days at a time and watch the
>>> objects in cluster count. We see performance that starts off
>>> decent and degrades over time. There’s a very brief initial surge
>>> in write performance after which things settle into the downward
>>> trending pattern.
>>>
>>> 1st hour - 2 million objects/hour
>>> 20th hour - 1.9 million objects/hour
>>> 40th hour - 1.7 million objects/hour
>>>
>>> This performance is not encouraging for us. We need to be writing
>>> 40 million objects per day (20 million files, duplicated twice).
>>> The rates we’re seeing at the 40th hour of our bench would be
>>> suffecient to achieve that. Those write rates are still falling
>>> though and we’re only at a fraction of the number of objects in
>>> cluster that we need to handle. So, the trends in performance
>>> suggests we shouldn’t count on having the write performance we
>>> need for too long.
>>>
>>> If we repeat the process of creating a new pool and running the
>>> bench the same pattern holds, good initial performance that
>>> gradually degrades.
>>>
>>

Re: [ceph-users] Power Failure

2017-05-02 Thread Reed Dier
One scenario I can offer here as it relates to powercut/hard shutdown.

I had my data center get struck by lightning very early on in my Ceph lifespan 
when I was testing and evaluating.

I had 8 OSD’s on 8 hosts, and each OSD was a RAID0 (single) vd on my LSI RAID 
controller.
On the RAID controller, I did not have a BBU. (mistake 1)
On the disks, I was using on-disk cache (pdcache), as well as write back cache 
at the controller level. (mistake 2, mistake 3)

It was a learning experience, as it corrupted leveldb on 6/8 OSD’s, as the 
on-disk cache had partially written writes to persistent storage.

So moral of the story was to make sure pdcache is configured to off, if 
expecting power failures.
> $ sudo /opt/MegaRAID/storcli/storcli64 /c0 add vd type=raid0 drives=252:0 
> pdcache=off


And BBU’s would also increase likelihood of writes not going missing.

Reed

> On May 2, 2017, at 3:12 AM, Tomáš Kukrál  wrote:
> 
> 
> 
> Hi,
> It really depends on type of power failure ...
> 
> Normal poweroff of the cluster is fine ... I've been managing large cluster 
> and we were forced to do total poweroff twice a year. It was working fine: we 
> just safely unmounted all clients, then set noout flag and powered machines 
> down.
> 
> Powercut (hard shutdown) can be a big problem and I would expected problems 
> here.
> 
> Tom
> 
> On 04-22 05:04, Santu Roy wrote:
>> Hi
>> 
>> I am very new to Ceph. Studding for few days for a deployment of Ceph
>> cluster. I am going to deploy ceph in a small data center where power
>> failure is a big problem. we have single power supply, Single ups and a
>> stand by generator. so what happend if all node down due to power failure?
>> will it create any problem to restart service when power restore?
>> 
>> looking for your suggestion..
>> 
>> -- 
>> 
>> *Regards*Santu Roy
> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph CBT simulate down OSDs

2017-05-02 Thread Henry Ngo
Mark,

Thanks for the detailed explanation and example. This is exactly what I was
looking for.

Best,
Henry Ngo


On Tue, May 2, 2017 at 9:29 AM, Mark Nelson  wrote:

> Hi Henry,
>
> The recovery test mechanism is basically a state machine launched in
> another thread that runs concurrently during whatever benchmark you want to
> run.  The basic premise is that it waits for a configurable amount of "pre"
> time to let the benchmarks get started, then marks osd down/out, waits
> until the cluster is healthy, then marks them up/in, and waits until they
> are healthy again.  This happens while your chosen background load is
> runs.  At the end, there is a post phase where you can specify how long you
> would like the benchmark to continue running after the recovery process has
> completed.  ceph health is run every second during this process and
> recorded in a log to keep track of what's happening while the tests are
> running.
>
> Typically once the recovery test is complete, a callback in the benchmark
> module is made to let the benchmark know the recovery test is done.
> Usually this will kill the benchmark (ie you might choose to run a 4 hour
> fio test and then let the recovery process inform the fio benchmark module
> kill fio).  Alternately, you can tell it to keep repeating the process
> until the benchmark itself completes with the "repeat" option.
>
> The actual yaml to do this is quite simple.  Simply put a "recovery_test"
> section in your cluster section, tell it which OSDs you want to mark down,
> and optionally give it repeat, pre_time, and post_time options.
>
> Here's an example:
>
> recovery_test:
>   osds: [3,6]
>   repeat: True
>   pre_time: 60
>   post_time: 60
>
> Here's a paper where this functionality was actually used to predict how
> long our thrashing tests in the ceph QA lab would take based on HDDs/SSDs.
> We knew our thrashing tests were using most of the time in the lab and we
> were able to use this to determine how much buying SSDs would speed up the
> QA runs.
>
> https://drive.google.com/open?id=0B2gTBZrkrnpZYVpPb3VpTkw5aFk
>
> See appendix B for the ceph.conf that was used at the time for the tests.
> Also, please do not use the "-n size=64k" mkfs.xfs option in that yaml
> file.  We later found out that it can cause XFS to deadlock and may not be
> safe to use.
>
> Mark
>
>
> On 05/02/2017 10:58 AM, Henry Ngo wrote:
>
>> Hi all,
>>
>> CBT documentation states that this can be achieved. If so, how do I set
>> it up? What do I add in the yaml file? Below is an EC example. Thanks.
>>
>> cluster:
>>
>>   head:"ceph@head"
>>
>>   clients:["ceph@client"]
>>
>>   osds:["ceph@osd"]
>>
>>   mons:["ceph@mon"]
>>
>>   osds_per_node:1
>>
>>   fs:xfs
>>
>>   mkfs_opts:-f -i size=2048
>>
>>   mount_opts:-o inode64,noatime,logbsize=256k
>>
>>   conf_file:/home/ceph/ceph-tools/cbt/example/ceph.conf
>>
>>   ceph.conf:/home/ceph/ceph-tools/cbt/example/ceph.conf
>>
>>   iterations:3
>>
>>   rebuild_every_test:False
>>
>>   tmp_dir:"/tmp/cbt"
>>
>>   pool_profiles:
>>
>> erasure:
>>
>>   pg_size:4096
>>
>>   pgp_size:4096
>>
>>   replication:'erasure'
>>
>>   erasure_profile:'myec'
>>
>> benchmarks:
>>
>>   radosbench:
>>
>> op_size:[4194304, 524288, 4096]
>>
>> write_only:False
>>
>> time:300
>>
>> concurrent_ops:[128]
>>
>> concurrent_procs:1
>>
>> use_existing:True
>>
>> pool_profile:erasure
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph CBT simulate down OSDs

2017-05-02 Thread Mark Nelson

Hi Henry,

The recovery test mechanism is basically a state machine launched in 
another thread that runs concurrently during whatever benchmark you want 
to run.  The basic premise is that it waits for a configurable amount of 
"pre" time to let the benchmarks get started, then marks osd down/out, 
waits until the cluster is healthy, then marks them up/in, and waits 
until they are healthy again.  This happens while your chosen background 
load is runs.  At the end, there is a post phase where you can specify 
how long you would like the benchmark to continue running after the 
recovery process has completed.  ceph health is run every second during 
this process and recorded in a log to keep track of what's happening 
while the tests are running.


Typically once the recovery test is complete, a callback in the 
benchmark module is made to let the benchmark know the recovery test is 
done.  Usually this will kill the benchmark (ie you might choose to run 
a 4 hour fio test and then let the recovery process inform the fio 
benchmark module kill fio).  Alternately, you can tell it to keep 
repeating the process until the benchmark itself completes with the 
"repeat" option.


The actual yaml to do this is quite simple.  Simply put a 
"recovery_test" section in your cluster section, tell it which OSDs you 
want to mark down, and optionally give it repeat, pre_time, and 
post_time options.


Here's an example:

recovery_test:
  osds: [3,6]
  repeat: True
  pre_time: 60
  post_time: 60

Here's a paper where this functionality was actually used to predict how 
long our thrashing tests in the ceph QA lab would take based on 
HDDs/SSDs.  We knew our thrashing tests were using most of the time in 
the lab and we were able to use this to determine how much buying SSDs 
would speed up the QA runs.


https://drive.google.com/open?id=0B2gTBZrkrnpZYVpPb3VpTkw5aFk

See appendix B for the ceph.conf that was used at the time for the 
tests.  Also, please do not use the "-n size=64k" mkfs.xfs option in 
that yaml file.  We later found out that it can cause XFS to deadlock 
and may not be safe to use.


Mark

On 05/02/2017 10:58 AM, Henry Ngo wrote:

Hi all,

CBT documentation states that this can be achieved. If so, how do I set
it up? What do I add in the yaml file? Below is an EC example. Thanks.

cluster:

  head:"ceph@head"

  clients:["ceph@client"]

  osds:["ceph@osd"]

  mons:["ceph@mon"]

  osds_per_node:1

  fs:xfs

  mkfs_opts:-f -i size=2048

  mount_opts:-o inode64,noatime,logbsize=256k

  conf_file:/home/ceph/ceph-tools/cbt/example/ceph.conf

  ceph.conf:/home/ceph/ceph-tools/cbt/example/ceph.conf

  iterations:3

  rebuild_every_test:False

  tmp_dir:"/tmp/cbt"

  pool_profiles:

erasure:

  pg_size:4096

  pgp_size:4096

  replication:'erasure'

  erasure_profile:'myec'

benchmarks:

  radosbench:

op_size:[4194304, 524288, 4096]

write_only:False

time:300

concurrent_ops:[128]

concurrent_procs:1

use_existing:True

pool_profile:erasure



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-05-02 Thread Willem Jan Withagen
On 27-4-2017 20:46, Alexandre DERUMIER wrote:
> Hi,
> 
>>> What I'm trying to get from the list is /why/ the "enterprise" drives 
>>> are important. Performance? Reliability? Something else? 
> 
> performance, for sure (for SYNC write, 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/)
> 
> Reliabity : yes, enteprise drive have supercapacitor in case of powerfailure, 
> and endurance (1 DWPD for 3520, 3 DWPD for 3610)
> 
> 
>>> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. Obviously
>>> the single drive leaves more bays free for OSD disks, but is there any
>>> other reason a single S3610 is preferable to 4 S3520s? Wouldn't 4xS3520s
>>> mean:
> 
> where do you see this price difference ?
> 
> for me , S3520 are around 25-30% cheaper than S3610

I just checked for the DCS3520 on
https://ark.intel.com/nl/products/93005/Intel-SSD-DC-S3520-Series-1_6TB-2_5in-SATA-6Gbs-3D1-MLC

And is has a TBW of 2925 (Terrabytes Write over life time) = 2,9 PB
the warranty is 5 years.

Now if I do the math:
  2925 * 104 /5 /365 /24 /60 = 1,14 Gbyte/min to be written.
  which is approx 20Mbyte /sec
  or approx 10Gbit/min = 0,15 Gbit/sec

And that is only 20% of the capacity of that SATA link.
Also writing 20Mbyte/sec sustained is not really that hard for modern
systems.

Now a 400Gb 3710 takes 8.3 PB, which is ruffly 3 times as much.
so it will last 3 times longer.

Checking Amazone, I get
$520 for the DC S3710-400G
$300 for the DC S3520-480G

So that is less than a factor of 2 for using the S3710's and a 3 times
longer lifetime. To be exact (8.3/520) / (2,9/300) = 1.65 more bang for
your buck.

But still do not expect your SSDs to last very long if the write rate is
much over that 20Mbyte/sec

--WjW



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs metadata damage and scrub error

2017-05-02 Thread David Zafman


James,

You have an omap corruption.  It is likely caused by a bug which 
has already been identified.  A fix for that problem is available but it 
is still pending backport for the next Jewel point release.  All 4 of 
your replicas have different "omap_digest" values.


Instead of the xattrs the ceph-osdomap-tool --command 
dump-objects-with-keys output from OSDs 3, 10, 11, 23 would be 
interesting to compare.


***WARNING*** Please backup your data before doing any repair attempts.

If you can upgrade to Kraken v11.2.0, it will auto repair the omaps on 
ceph-osd start up.  It will likely still require a ceph pg repair to 
make the 4 replicas consistent with each other.  The final result may be 
the reappearance of removed MDS files in the directory.


If you can recover the data, you could remove the directory entirely and 
rebuild it.  The original bug was triggered during omap deletion 
typically in a large directory which corresponds to an individual unlink 
in cephfs.


If you can build a branch in github to get the newer ceph-osdomap-tool 
you could try to use it to repair the omaps.


David



On 5/2/17 5:05 AM, James Eckersall wrote:

Hi,

I'm having some issues with a ceph cluster.  It's an 8 node cluster rnning
Jewel ceph-10.2.7-0.el7.x86_64 on CentOS 7.
This cluster provides RBDs and a CephFS filesystem to a number of clients.

ceph health detail is showing the following errors:

pg 2.9 is active+clean+inconsistent, acting [3,10,11,23]
1 scrub errors
mds0: Metadata damage detected


The pg 2.9 is in the cephfs_metadata pool (id 2).

I've looked at the OSD logs for OSD 3, which is the primary for this PG,
but the only thing that appears relating to this PG is the following:

log_channel(cluster) log [ERR] : 2.9 deep-scrub 1 errors

After initiating a ceph pg repair 2.9, I see the following in the primary
OSD log:

log_channel(cluster) log [ERR] : 2.9 repair 1 errors, 0 fixed
log_channel(cluster) log [ERR] : 2.9 deep-scrub 1 errors


I found the below command in a previous ceph-users post.  Running this
returns the following:

# rados list-inconsistent-obj 2.9
{"epoch":23738,"inconsistents":[{"object":{"name":"1411194.","nspace":"","locator":"","snap":"head","version":14737091},"errors":["omap_digest_mismatch"],"union_shard_errors":[],"selected_object_info":"2:9758b358:::1411194.:head(33456'14737091
mds.0.214448:248532 dirty|omap|data_digest s 0 uv 14737091 dd
)","shards":[{"osd":3,"errors":[],"size":0,"omap_digest":"0x6748eef3","data_digest":"0x"},{"osd":10,"errors":[],"size":0,"omap_digest":"0xa791d5a4","data_digest":"0x"},{"osd":11,"errors":[],"size":0,"omap_digest":"0x53f46ab0","data_digest":"0x"},{"osd":23,"errors":[],"size":0,"omap_digest":"0x97b80594","data_digest":"0x"}]}]}


So from this, I think that the object in PG 2.9 with the problem is
1411194..

This is what I see on the filesystem on the 4 OSD's this PG resides on:

-rw-r--r--. 1 ceph ceph 0 Apr 27 12:31
/var/lib/ceph/osd/ceph-3/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2
-rw-r--r--. 1 ceph ceph 0 Apr 15 22:05
/var/lib/ceph/osd/ceph-10/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2
-rw-r--r--. 1 ceph ceph 0 Apr 15 22:07
/var/lib/ceph/osd/ceph-11/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2
-rw-r--r--. 1 ceph ceph 0 Apr 16 03:58
/var/lib/ceph/osd/ceph-23/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2

The extended attrs are as follows, although I have no idea what any of them
mean.

# file:
var/lib/ceph/osd/ceph-11/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2
user.ceph._=0sDwj5BAM1ABQxMDAwMDQxMTE5NC4wMDAwMDAwMP7/6RrNGgAAAgAGAxwCAP8AAP//ABUn4QAAu4IAAK4m4QAAu4IAAAICFQIAAOSZDAAAsEUDjUoIWUgWsQQCAhUVJ+EAABwAAACNSghZESm8BP///w==
user.ceph._@1=0s//8=
user.ceph._layout=0sAgIY//8A
user.ceph._parent=0sBQRPAQAAlBFBAAABAAAIAgIjjxFBAAABAAAPdHViZWFtYXRldXIubmV0qdgCAh0AAAB/EUEAAAEAAAkAAAB3cC1yb2NrZXREAAICGQAAABYNQQAAAQAABQAAAGNhY2hlUgACAh4QDUEAAAEAAAoAAAB3cC1jb250ZW50NAMCAhgNDUEAAAEAAAQAAABodG1sIAECAikAAADagTMAAAEAABUAAABuZ2lueC1waHA3LWNsdmdmLWRhdGGJAAICMwAAADkAAQ==
user.ceph._parent@1
=0sAAAfNDg4LTU3YjI2NTdmMmZhMTMtbWktcHJveWVjdG8tMXSQCgIcAQAIcHJvamVjdHPBAgcAAAIAAA==
user.ceph.snapset=0sAgIZAAABAA==
user.cephos.seq=0sAQEQgcAqFgA=
user.cephos.spill_out=0sMAA=getfattr: Removing leading '/' from absolute
path names

# file:
var/lib/ceph/osd/ceph-3/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1AC

[ceph-users] Ceph CBT simulate down OSDs

2017-05-02 Thread Henry Ngo
Hi all,

CBT documentation states that this can be achieved. If so, how do I set it
up? What do I add in the yaml file? Below is an EC example. Thanks.

cluster:

  head: "ceph@head"

  clients: ["ceph@client"]

  osds: ["ceph@osd"]

  mons: ["ceph@mon"]

  osds_per_node: 1

  fs: xfs

  mkfs_opts: -f -i size=2048

  mount_opts: -o inode64,noatime,logbsize=256k

  conf_file: /home/ceph/ceph-tools/cbt/example/ceph.conf

  ceph.conf: /home/ceph/ceph-tools/cbt/example/ceph.conf

  iterations: 3

  rebuild_every_test: False

  tmp_dir: "/tmp/cbt"

  pool_profiles:

erasure:

  pg_size: 4096

  pgp_size: 4096

  replication: 'erasure'

  erasure_profile: 'myec'

benchmarks:

  radosbench:

op_size: [ 4194304, 524288, 4096 ]

write_only: False

time: 300

concurrent_ops: [ 128 ]

concurrent_procs: 1

use_existing: True

pool_profile: erasure
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph FS installation issue on ubuntu 16.04

2017-05-02 Thread dheeraj dubey
Hi,

Getting following error while installing "ceph-fs-common" package on ubuntu
16.04

$sudo apt-get install ceph-fs-common
Reading package lists...
Building dependency tree...
Reading state information...
You might want to run 'apt-get -f install' to correct these:
The following packages have unmet dependencies:
 librbd-dev : Depends: librbd1 (= 10.2.6-0ubuntu0.16.04.1) but
10.2.7-1xenial is to be installed


ceph version is 10.2.7

Thanks in Advance,

Regards,
Dheeraj
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Maintaining write performance under a steady intake of small objects

2017-05-02 Thread Mark Nelson
I used to advocate that users favor dentry/inode cache, but it turns out 
that it's not necessarily a good idea if you also are using syncfs.  It 
turns out that when syncfs is used, the kernel will iterate through all 
cached inodes, rather than just dirty inodes.  With high numbers of 
cached inodes, it can impact performance enough that it ends up being a 
problem.  See Sage's post here:


http://www.spinics.net/lists/ceph-devel/msg25644.html

I don't remember if we ended up ripping syncfs out completely. 
Bluestore ditches the filesystem so we don't have to deal with this 
anymore regardless.  It's something to be aware of though.


Mark

On 05/02/2017 07:24 AM, George Mihaiescu wrote:

Hi Patrick,

You could add more RAM to the servers witch will not increase the cost
too much, probably.

You could change swappiness value or use something
like https://hoytech.com/vmtouch/ to pre-cache inodes entries.

You could tarball the smaller files before loading them into Ceph maybe.

How are the ten clients accessing Ceph by the way?

On May 1, 2017, at 14:23, Patrick Dinnen mailto:pdin...@gmail.com>> wrote:


One additional detail, we also did filestore testing using Jewel and
saw substantially similar results to those on Kraken.

On Mon, May 1, 2017 at 2:07 PM, Patrick Dinnen mailto:pdin...@gmail.com>> wrote:

Hello Ceph-users,

Florian has been helping with some issues on our proof-of-concept
cluster, where we've been experiencing these issues. Thanks for
the replies so far. I wanted to jump in with some extra details.

All of our testing has been with scrubbing turned off, to remove
that as a factor.

Our use case requires a Ceph cluster to indefinitely store ~10
billion files 20-60KB in size. We’ll begin with 4 billion files
migrated from a legacy storage system. Ongoing writes will be
handled by ~10 client machines and come in at a fairly steady
10-20 million files/day. Every file (excluding the legacy 4
billion) will be read once by a single client within hours of it’s
initial write to the cluster. Future file read requests will come
from a single server and with a long-tail distribution, with
popular files read thousands of times a year but most read never
or virtually never.

Our “production” design has 6-nodes, 24-OSDs (expandable to 48
OSDs). SSD journals at a 1:4 ratio with HDDs, Each node looks like
this:

 *
2 x E5-2660 8-core Xeons
 *
64GB RAM DDR-3 PC1600
 *
10Gb ceph-internal network (SFP+)
 *
LSI 9210-8i controller (IT mode)
 *
4 x OSD 8TB HDDs, mix of two types
 o
Seagate ST8000DM002
 o
HGST HDN728080ALE604
 o
Mount options = xfs (rw,noatime,attr2,inode64,noquota)
 *
1 x SSD journal Intel 200GB DC S3700


Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done
with a replication level 2. We’re using rados bench to shotgun a
lot of files into our test pools. Specifically following these two
steps:
ceph osd pool create poolofhopes 2048 2048 replicated ""
replicated_ruleset 5
rados -p poolofhopes bench -t 32 -b 2 3000 write --no-cleanup

We leave the bench running for days at a time and watch the
objects in cluster count. We see performance that starts off
decent and degrades over time. There’s a very brief initial surge
in write performance after which things settle into the downward
trending pattern.

1st hour - 2 million objects/hour
20th hour - 1.9 million objects/hour
40th hour - 1.7 million objects/hour

This performance is not encouraging for us. We need to be writing
40 million objects per day (20 million files, duplicated twice).
The rates we’re seeing at the 40th hour of our bench would be
suffecient to achieve that. Those write rates are still falling
though and we’re only at a fraction of the number of objects in
cluster that we need to handle. So, the trends in performance
suggests we shouldn’t count on having the write performance we
need for too long.

If we repeat the process of creating a new pool and running the
bench the same pattern holds, good initial performance that
gradually degrades.

https://postimg.org/image/ovymk7n2d/

[caption:90 million objects written to a brand new, pre-split pool
(poolofhopes). There are already 330 million objects on the
cluster in other pools.]

Our working theory is that the degradation over time may be
related to inode or dentry lookups that miss cache and lead to
additional disk reads and seek activity. There’s a suggestion that
filestore directory splitting may exacerbate that problem as
additional/longer disk seeks occur related to what’s in which XFS
assignment group. We have found pre-split pools useful in one
major way, t

Re: [ceph-users] ceph-deploy to a particular version

2017-05-02 Thread German Anders
I think you can do *$ceph-deploy install --release  --repo-url
http://download.ceph.com/. .. *, also you
can change the --release flag with --dev or --testing and specify the
version, I've done with release and dev flags and work great :)

hope it helps

best,


*German*

2017-05-02 10:03 GMT-03:00 David Turner :

> You can indeed install ceph via yum and then utilize ceph-deploy to finish
> things up. You just skip the Ceph install portion. I haven't done it in a
> while and you might need to manually place the config and key on the new
> servers yourself.
>
> On Tue, May 2, 2017, 8:57 AM Puff, Jonathon 
> wrote:
>
>> From what I can find ceph-deploy only allows installs for a release, i.e
>> jewel which is giving me 10.2.7, but I’d like to specify the particular
>> update.  For instance, I want to go to 10.2.3.Do I need to avoid
>> ceph-deploy entirely to do this or can I install the correct version via
>> yum then leverage ceph-deploy for the remaining configuration?
>>
>>
>>
>> -JP
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy to a particular version

2017-05-02 Thread David Turner
You can indeed install ceph via yum and then utilize ceph-deploy to finish
things up. You just skip the Ceph install portion. I haven't done it in a
while and you might need to manually place the config and key on the new
servers yourself.

On Tue, May 2, 2017, 8:57 AM Puff, Jonathon 
wrote:

> From what I can find ceph-deploy only allows installs for a release, i.e
> jewel which is giving me 10.2.7, but I’d like to specify the particular
> update.  For instance, I want to go to 10.2.3.Do I need to avoid
> ceph-deploy entirely to do this or can I install the correct version via
> yum then leverage ceph-deploy for the remaining configuration?
>
>
>
> -JP
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-05-02 Thread David Turner
You would need to have 1TB of SSDs for every 2TB of HDDs used this way. If
you set up your cluster with those ratios, you would fill up evenly.

On Tue, May 2, 2017, 8:37 AM George Mihaiescu  wrote:

> One problem that I can see with this setup is that you will fill up the
> SSDs holding the primary replica before the HDD ones, if they are much
> different in size.
>
> Other than that, it's a very inventive solution to increase read speeds
> without using a possibly buggy cache configuration.
>
>
>
> > On Apr 20, 2017, at 05:25, Richard Hesketh 
> wrote:
> >
> >> On 19/04/17 21:08, Reed Dier wrote:
> >> Hi Maxime,
> >>
> >> This is a very interesting concept. Instead of the primary affinity
> being used to choose SSD for primary copy, you set crush rule to first
> choose an osd in the ‘ssd-root’, then the ‘hdd-root’ for the second set.
> >>
> >> And with 'step chooseleaf first {num}’
> >>> If {num} > 0 && < pool-num-replicas, choose that many buckets.
> >> So 1 chooses that bucket
> >>> If {num} < 0, it means pool-num-replicas - {num}
> >> And -1 means it will fill remaining replicas on this bucket.
> >>
> >> This is a very interesting concept, one I had not considered.
> >> Really appreciate this feedback.
> >>
> >> Thanks,
> >>
> >> Reed
> >>
> >>> On Apr 19, 2017, at 12:15 PM, Maxime Guyot 
> wrote:
> >>>
> >>> Hi,
> >>>
> > Assuming production level, we would keep a pretty close 1:2 SSD:HDD
> ratio,
>  1:4-5 is common but depends on your needs and the devices in
> question, ie. assuming LFF drives and that you aren’t using crummy journals.
> >>>
> >>> You might be speaking about different ratios here. I think that
> Anthony is speaking about journal/OSD and Reed speaking about capacity
> ratio between and HDD and SSD tier/root.
> >>>
> >>> I have been experimenting with hybrid setups (1 copy on SSD + 2 copies
> on HDD), like Richard says you’ll get much better random read performance
> with primary OSD on SSD but write performance won’t be amazing since you
> still have 2 HDD copies to write before ACK.
> >>>
> >>> I know the doc suggests using primary affinity but since it’s a OSD
> level setting it does not play well with other storage tiers so I searched
> for other options. From what I have tested, a rule that selects the
> first/primary OSD from the ssd-root then the rest of the copies from the
> hdd-root works. Though I am not sure it is *guaranteed* that the first OSD
> selected will be primary.
> >>>
> >>> “rule hybrid {
> >>> ruleset 2
> >>> type replicated
> >>> min_size 1
> >>> max_size 10
> >>> step take ssd-root
> >>> step chooseleaf firstn 1 type host
> >>> step emit
> >>> step take hdd-root
> >>> step chooseleaf firstn -1 type host
> >>> step emit
> >>> }”
> >>>
> >>> Cheers,
> >>> Maxime
> >
> > FWIW splitting my HDDs and SSDs into two separate roots and using a
> crush rule to first choose a host from the SSD root and take remaining
> replicas on the HDD root was the way I did it, too. By inspection, it did
> seem that all PGs in the pool had an SSD for a primary, so I think this is
> a reliable way of doing it. You would of course end up with an acting
> primary on one of the slow spinners for a brief period if you lost an SSD
> for whatever reason and it needed to rebalance.
> >
> > The only downside is that if you have your SSD and HDD OSDs on the same
> physical hosts I'm not sure how you set up your failure domains and rules
> to make sure that you don't take an SSD primary and HDD replica on the same
> host. In my case, SSDs and HDDs are on different hosts, so it didn't matter
> to me.
> > --
> > Richard Hesketh
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy to a particular version

2017-05-02 Thread Puff, Jonathon
From what I can find ceph-deploy only allows installs for a release, i.e jewel 
which is giving me 10.2.7, but I’d like to specify the particular update.  For 
instance, I want to go to 10.2.3.Do I need to avoid ceph-deploy entirely to 
do this or can I install the correct version via yum then leverage ceph-deploy 
for the remaining configuration?

-JP
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large META directory within each OSD's directory

2017-05-02 Thread David Turner
It was fixed in 0.94.8 the release notes say "osd: remove all stale osdmaps
in handle_osd_map() (issue#13990, pr#9090, Kefu Chai)"

On Mon, May 1, 2017, 10:41 PM 许雪寒  wrote:

> ThanksJ
>
>
>
> We are using hammer 0.94.5, Which commit is supposed to fix this bug?
> Thank you.
>
>
>
> *发件人:* David Turner [mailto:drakonst...@gmail.com]
> *发送时间:* 2017年4月25日 20:17
> *收件人:* 许雪寒; ceph-users@lists.ceph.com
> *主题:* Re: [ceph-users] Large META directory within each OSD's directory
>
>
>
> Which version of Ceph are you running? My guess is Hammer pre-0.94.9.
> There is an osdmap cache bug that was introduced with Hammer that was fixed
> in 0.94.9. The work around is to restart all of the OSDs in your cluster.
> After restarting the OSDs, the cluster will start to clean up osdmaps 20 at
> a time each time you generate a new map. If you don't generate maps often,
> then you can write a loop that does something like setting the min size for
> a pool to the same thing every 10-20 seconds until you catch up. (Note,
> that doesn't change any settings, but it does update the map).
>
>
>
> On Tue, Apr 25, 2017, 4:45 AM 许雪寒  wrote:
>
> Hi, everyone.
>
> Recently, in one of our clusters, we found that the “META” directory in
> each OSD’s working directory is getting extremely large, about 17GB each.
> Why hasn’t the OSD cleared those old osdmaps? How should I deal with this
> problem?
>
> Thank you☺
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-05-02 Thread George Mihaiescu
One problem that I can see with this setup is that you will fill up the SSDs 
holding the primary replica before the HDD ones, if they are much different in 
size.

Other than that, it's a very inventive solution to increase read speeds without 
using a possibly buggy cache configuration.



> On Apr 20, 2017, at 05:25, Richard Hesketh  
> wrote:
> 
>> On 19/04/17 21:08, Reed Dier wrote:
>> Hi Maxime,
>> 
>> This is a very interesting concept. Instead of the primary affinity being 
>> used to choose SSD for primary copy, you set crush rule to first choose an 
>> osd in the ‘ssd-root’, then the ‘hdd-root’ for the second set.
>> 
>> And with 'step chooseleaf first {num}’
>>> If {num} > 0 && < pool-num-replicas, choose that many buckets. 
>> So 1 chooses that bucket
>>> If {num} < 0, it means pool-num-replicas - {num}
>> And -1 means it will fill remaining replicas on this bucket.
>> 
>> This is a very interesting concept, one I had not considered.
>> Really appreciate this feedback.
>> 
>> Thanks,
>> 
>> Reed
>> 
>>> On Apr 19, 2017, at 12:15 PM, Maxime Guyot  wrote:
>>> 
>>> Hi,
>>> 
> Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio,
 1:4-5 is common but depends on your needs and the devices in question, ie. 
 assuming LFF drives and that you aren’t using crummy journals.
>>> 
>>> You might be speaking about different ratios here. I think that Anthony is 
>>> speaking about journal/OSD and Reed speaking about capacity ratio between 
>>> and HDD and SSD tier/root. 
>>> 
>>> I have been experimenting with hybrid setups (1 copy on SSD + 2 copies on 
>>> HDD), like Richard says you’ll get much better random read performance with 
>>> primary OSD on SSD but write performance won’t be amazing since you still 
>>> have 2 HDD copies to write before ACK. 
>>> 
>>> I know the doc suggests using primary affinity but since it’s a OSD level 
>>> setting it does not play well with other storage tiers so I searched for 
>>> other options. From what I have tested, a rule that selects the 
>>> first/primary OSD from the ssd-root then the rest of the copies from the 
>>> hdd-root works. Though I am not sure it is *guaranteed* that the first OSD 
>>> selected will be primary.
>>> 
>>> “rule hybrid {
>>> ruleset 2
>>> type replicated
>>> min_size 1
>>> max_size 10
>>> step take ssd-root
>>> step chooseleaf firstn 1 type host
>>> step emit
>>> step take hdd-root
>>> step chooseleaf firstn -1 type host
>>> step emit
>>> }”
>>> 
>>> Cheers,
>>> Maxime
> 
> FWIW splitting my HDDs and SSDs into two separate roots and using a crush 
> rule to first choose a host from the SSD root and take remaining replicas on 
> the HDD root was the way I did it, too. By inspection, it did seem that all 
> PGs in the pool had an SSD for a primary, so I think this is a reliable way 
> of doing it. You would of course end up with an acting primary on one of the 
> slow spinners for a brief period if you lost an SSD for whatever reason and 
> it needed to rebalance.
> 
> The only downside is that if you have your SSD and HDD OSDs on the same 
> physical hosts I'm not sure how you set up your failure domains and rules to 
> make sure that you don't take an SSD primary and HDD replica on the same 
> host. In my case, SSDs and HDDs are on different hosts, so it didn't matter 
> to me.
> -- 
> Richard Hesketh
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Maintaining write performance under a steady intake of small objects

2017-05-02 Thread George Mihaiescu
Hi Patrick,

You could add more RAM to the servers witch will not increase the cost too 
much, probably.

You could change swappiness value or use something like 
https://hoytech.com/vmtouch/ to pre-cache inodes entries.

You could tarball the smaller files before loading them into Ceph maybe.

How are the ten clients accessing Ceph by the way?

> On May 1, 2017, at 14:23, Patrick Dinnen  wrote:
> 
> One additional detail, we also did filestore testing using Jewel and saw 
> substantially similar results to those on Kraken.
> 
>> On Mon, May 1, 2017 at 2:07 PM, Patrick Dinnen  wrote:
>> Hello Ceph-users,
>> 
>> Florian has been helping with some issues on our proof-of-concept cluster, 
>> where we've been experiencing these issues. Thanks for the replies so far. I 
>> wanted to jump in with some extra details.
>> 
>> All of our testing has been with scrubbing turned off, to remove that as a 
>> factor.
>> 
>> Our use case requires a Ceph cluster to indefinitely store ~10 billion files 
>> 20-60KB in size. We’ll begin with 4 billion files migrated from a legacy 
>> storage system. Ongoing writes will be handled by ~10 client machines and 
>> come in at a fairly steady 10-20 million files/day. Every file (excluding 
>> the legacy 4 billion) will be read once by a single client within hours of 
>> it’s initial write to the cluster. Future file read requests will come from 
>> a single server and with a long-tail distribution, with popular files read 
>> thousands of times a year but most read never or virtually never.
>> 
>> 
>> Our “production” design has 6-nodes, 24-OSDs (expandable to 48 OSDs). SSD 
>> journals at a 1:4 ratio with HDDs, Each node looks like this:
>> 2 x E5-2660 8-core Xeons
>> 64GB RAM DDR-3 PC1600
>> 10Gb ceph-internal network (SFP+) 
>> LSI 9210-8i controller (IT mode)
>> 4 x OSD 8TB HDDs, mix of two types
>> Seagate ST8000DM002
>> HGST HDN728080ALE604
>> Mount options = xfs (rw,noatime,attr2,inode64,noquota) 
>> 1 x SSD journal Intel 200GB DC S3700
>> 
>> Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done with a 
>> replication level 2. We’re using rados bench to shotgun a lot of files into 
>> our test pools. Specifically following these two steps: 
>> ceph osd pool create poolofhopes 2048 2048 replicated "" replicated_ruleset 
>> 5
>> rados -p poolofhopes bench -t 32 -b 2 3000 write --no-cleanup
>> 
>> We leave the bench running for days at a time and watch the objects in 
>> cluster count. We see performance that starts off decent and degrades over 
>> time. There’s a very brief initial surge in write performance after which 
>> things settle into the downward trending pattern.
>> 
>> 1st hour - 2 million objects/hour
>> 20th hour - 1.9 million objects/hour 
>> 40th hour - 1.7 million objects/hour
>> 
>> This performance is not encouraging for us. We need to be writing 40 million 
>> objects per day (20 million files, duplicated twice). The rates we’re seeing 
>> at the 40th hour of our bench would be suffecient to achieve that. Those 
>> write rates are still falling though and we’re only at a fraction of the 
>> number of objects in cluster that we need to handle. So, the trends in 
>> performance suggests we shouldn’t count on having the write performance we 
>> need for too long.
>> 
>> If we repeat the process of creating a new pool and running the bench the 
>> same pattern holds, good initial performance that gradually degrades.
>> 
>> https://postimg.org/image/ovymk7n2d/
>> [caption:90 million objects written to a brand new, pre-split pool 
>> (poolofhopes). There are already 330 million objects on the cluster in other 
>> pools.]
>> 
>> Our working theory is that the degradation over time may be related to inode 
>> or dentry lookups that miss cache and lead to additional disk reads and seek 
>> activity. There’s a suggestion that filestore directory splitting may 
>> exacerbate that problem as additional/longer disk seeks occur related to 
>> what’s in which XFS assignment group. We have found pre-split pools useful 
>> in one major way, they avoid periods of near-zero write performance that we 
>> have put down to the active splitting of directories (the "thundering herd" 
>> effect). The overall downward curve seems to remain the same whether we 
>> pre-split or not.
>> 
>> The thundering herd seems to be kept in check by an appropriate pre-split. 
>> Bluestore may or may not be a solution, but uncertainty and stability within 
>> our fairly tight timeline don't recommend it to us. Right now our big 
>> question is "how can we avoid the gradual degradation in write performance 
>> over time?". 
>> 
>> Thank you, Patrick
>> 
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Maintaining write performance under a steady intake of small objects

2017-05-02 Thread Mark Nelson

On 05/02/2017 01:32 AM, Frédéric Nass wrote:



Le 28/04/2017 à 17:03, Mark Nelson a écrit :

On 04/28/2017 08:23 AM, Frédéric Nass wrote:


Le 28/04/2017 à 15:19, Frédéric Nass a écrit :


Hi Florian, Wido,

That's interesting. I ran some bluestore benchmarks a few weeks ago on
Luminous dev (1st release) and came to the same (early) conclusion
regarding the performance drop with many small objects on bluestore,
whatever the number of PGs is on a pool. Here is the graph I generated
from the results:



The test was run on a 36 OSDs cluster (3x R730xd with 12x 4TB SAS
drives) with rocksdb and WAL on same SAS drives.
Test consisted of multiple runs of the following command on a size 1
pool : rados bench -p pool-test-mom02h06-2 120 write -b 4K -t 128
--no-cleanup


Correction: test was made on a size 1 pool hosted on a single 12x OSDs
node. The rados bench was run from this single host (to this single
host).

Frédéric.


If you happen to have time, I would be very interested to see what the
compaction statistics look like in rocksdb (available via the osd
logs).  We actually wrote a tool that's in the cbt tools directory
that can parse the data and look at what rocksdb is doing.  Here's
some of the data we collected last fall:

https://drive.google.com/open?id=0B2gTBZrkrnpZRFdiYjFRNmxLblU

The idea there was to try to determine how WAL buffer size / count and
min_alloc size affected the amount of compaction work that rocksdb was
doing.  There are also some more general compaction statistics that
are more human readable in the logs that are worth looking at (ie
things like write amp and such).

The gist of it is that as you do lots of small writes the amount of
metadata that has to be kept track of in rocksdb increases, and
rocksdb ends up doing a *lot* of compaction work, with the associated
read and write amplification.  The only ways to really deal with this
are to either reduce the amount of metadata (onodes, extents, etc) or
see if we can find any ways to reduce the amount of work rocksdb has
to do.

On the first point, increasing the min_alloc size in bluestore tends
to help, but with tradeoffs.  Any io smaller than the min_alloc size
will be doubly-written like with filestore journals, so you trade
reducing metadata for an extra WAL write. We did a bunch of testing
last fall and at least on NVMe it was better to use a 16k min_alloc
size and eat the WAL write than use a 4K min_alloc size, skip the WAL
write, but shove more metadata at rocksdb.  For HDDs, I wouldn't
expect too bad of behavior with the default 64k min alloc size, but it
sounds like it could be a problem based on your results.  That's why
it would be interesting to see if that's what's happening during your
tests.

Another issue is that short lived WAL writes potentially can leak into
level0 and cause additional compaction work.  Sage has a pretty clever
idea to fix this but we need someone knowledgeable about rocksdb to go
in and try to implement it (or something like it).

Anyway, we still see a significant amount of work being done by
rocksdb due to compaction, most of it being random reads.  We actually
spoke about this quite a bit yesterday at the performance meeting.  If
you look at a wallclock profile of 4K random writes, you'll see a ton
of work being doing on compact (about 70% in total of thread 2):

https://paste.fedoraproject.org/paste/uS3LHRHw2Yma0iUYSkgKOl5M1UNdIGYhyRLivL9gydE=


One thing we are still confused about is why rocksdb is doing
random_reads for compaction rather than sequential reads.  It would be
really great if someone that knows rocksdb well could help us
understand why it's doing this.

Ultimately for something like RBD I suspect the performance will stop
dropping once you've completely filled the disk with 4k random writes.
For RGW type work, the more tiny objects you add the more data rocksdb
has to keep track of and the more rocksdb is going to slow down.  It's
not the same problem filestore suffers from, but it's similar in that
the more keys/bytes/levels rocksdb has to deal with, the more data
gets moved around between levels, the more background work that
happens, the more likely we are waiting on rocksdb before we can write
more data.

Mark



Hi Mark,

This is very interesting. I actually did use "bluefs buffered io = true"
and "bluestore compression mode = aggressive" during the tests as I saw
these 2 options were improving write performances (x4) but didn't look
at the logs for compaction statistics. These nodes I used for the tests
made it to production so I won't be able to reproduce the test any soon,
but I will when we get new hardware.

Frederic.



FWIW, I spent some time yesterday digging into rocksdb's compaction 
code.  It looks like every time comapaction is done, the iterator ends 
up walking through doing "random reads" that ultimately work their way 
down to bluefs, but in reality it's seeking almost entirely sequentially 
as it walks through the SST.  In the blktrace I did the behavior

[ceph-users] cephfs metadata damage and scrub error

2017-05-02 Thread James Eckersall
Hi,

I'm having some issues with a ceph cluster.  It's an 8 node cluster rnning
Jewel ceph-10.2.7-0.el7.x86_64 on CentOS 7.
This cluster provides RBDs and a CephFS filesystem to a number of clients.

ceph health detail is showing the following errors:

pg 2.9 is active+clean+inconsistent, acting [3,10,11,23]
1 scrub errors
mds0: Metadata damage detected


The pg 2.9 is in the cephfs_metadata pool (id 2).

I've looked at the OSD logs for OSD 3, which is the primary for this PG,
but the only thing that appears relating to this PG is the following:

log_channel(cluster) log [ERR] : 2.9 deep-scrub 1 errors

After initiating a ceph pg repair 2.9, I see the following in the primary
OSD log:

log_channel(cluster) log [ERR] : 2.9 repair 1 errors, 0 fixed
log_channel(cluster) log [ERR] : 2.9 deep-scrub 1 errors


I found the below command in a previous ceph-users post.  Running this
returns the following:

# rados list-inconsistent-obj 2.9
{"epoch":23738,"inconsistents":[{"object":{"name":"1411194.","nspace":"","locator":"","snap":"head","version":14737091},"errors":["omap_digest_mismatch"],"union_shard_errors":[],"selected_object_info":"2:9758b358:::1411194.:head(33456'14737091
mds.0.214448:248532 dirty|omap|data_digest s 0 uv 14737091 dd
)","shards":[{"osd":3,"errors":[],"size":0,"omap_digest":"0x6748eef3","data_digest":"0x"},{"osd":10,"errors":[],"size":0,"omap_digest":"0xa791d5a4","data_digest":"0x"},{"osd":11,"errors":[],"size":0,"omap_digest":"0x53f46ab0","data_digest":"0x"},{"osd":23,"errors":[],"size":0,"omap_digest":"0x97b80594","data_digest":"0x"}]}]}


So from this, I think that the object in PG 2.9 with the problem is
1411194..

This is what I see on the filesystem on the 4 OSD's this PG resides on:

-rw-r--r--. 1 ceph ceph 0 Apr 27 12:31
/var/lib/ceph/osd/ceph-3/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2
-rw-r--r--. 1 ceph ceph 0 Apr 15 22:05
/var/lib/ceph/osd/ceph-10/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2
-rw-r--r--. 1 ceph ceph 0 Apr 15 22:07
/var/lib/ceph/osd/ceph-11/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2
-rw-r--r--. 1 ceph ceph 0 Apr 16 03:58
/var/lib/ceph/osd/ceph-23/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2

The extended attrs are as follows, although I have no idea what any of them
mean.

# file:
var/lib/ceph/osd/ceph-11/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2
user.ceph._=0sDwj5BAM1ABQxMDAwMDQxMTE5NC4wMDAwMDAwMP7/6RrNGgAAAgAGAxwCAP8AAP//ABUn4QAAu4IAAK4m4QAAu4IAAAICFQIAAOSZDAAAsEUDjUoIWUgWsQQCAhUVJ+EAABwAAACNSghZESm8BP///w==
user.ceph._@1=0s//8=
user.ceph._layout=0sAgIY//8A
user.ceph._parent=0sBQRPAQAAlBFBAAABAAAIAgIjjxFBAAABAAAPdHViZWFtYXRldXIubmV0qdgCAh0AAAB/EUEAAAEAAAkAAAB3cC1yb2NrZXREAAICGQAAABYNQQAAAQAABQAAAGNhY2hlUgACAh4QDUEAAAEAAAoAAAB3cC1jb250ZW50NAMCAhgNDUEAAAEAAAQAAABodG1sIAECAikAAADagTMAAAEAABUAAABuZ2lueC1waHA3LWNsdmdmLWRhdGGJAAICMwAAADkAAQ==
user.ceph._parent@1
=0sAAAfNDg4LTU3YjI2NTdmMmZhMTMtbWktcHJveWVjdG8tMXSQCgIcAQAIcHJvamVjdHPBAgcAAAIAAA==
user.ceph.snapset=0sAgIZAAABAA==
user.cephos.seq=0sAQEQgcAqFgA=
user.cephos.spill_out=0sMAA=getfattr: Removing leading '/' from absolute
path names

# file:
var/lib/ceph/osd/ceph-3/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2
user.ceph._=0sDwj5BAM1ABQxMDAwMDQxMTE5NC4wMDAwMDAwMP7/6RrNGgAAAgAGAxwCAP8AAP//ABUn4QAAu4IAAK4m4QAAu4IAAAICFQIAAOSZDAAAsEUDjUoIWUgWsQQCAhUVJ+EAABwAAACNSghZESm8BP///w==
user.ceph._@1=0s//8=
user.ceph._layout=0sAgIY//8A
user.ceph._parent=0sBQRPAQAAlBFBAAABAAAIAgIjjxFBAAABAAAPdHViZWFtYXRldXIubmV0qdgCAh0AAAB/EUEAAAEAAAkAAAB3cC1yb2NrZXREAAICGQAAABYNQQAAAQAABQAAAGNhY2hlUgACAh4QDUEAAAEAAAoAAAB3cC1jb250ZW50NAMCAhgNDUEAAAEAAAQAAABodG1sIAECAikAAADagTMAAAEAABUAAABuZ2lueC1waHA3LWNsdmdmLWRhdGGJAAICMwAAADkAAQ==
user.ceph._parent@1
=0sAAAfNDg4LTU3YjI2NTdmMmZhMTMtbWktcHJveWVjdG8tMXSQCgIcAQAIcHJvamVjdHPBAgcAAAIAAA==
user.ceph.snapset=0sAgIZAAABAA==
user.cephos.seq=0sAQEQZaQ9GwAAAgA=
user.cephos.spill_out=0sMAA=getfattr: Removing leading '/' from absolute
path names

# file:
var/lib/

Re: [ceph-users] Power Failure

2017-05-02 Thread Tomáš Kukrál



Hi,
It really depends on type of power failure ...

Normal poweroff of the cluster is fine ... I've been managing large 
cluster and we were forced to do total poweroff twice a year. It was 
working fine: we just safely unmounted all clients, then set noout flag 
and powered machines down.


Powercut (hard shutdown) can be a big problem and I would expected 
problems here.


Tom

On 04-22 05:04, Santu Roy wrote:

Hi

I am very new to Ceph. Studding for few days for a deployment of Ceph
cluster. I am going to deploy ceph in a small data center where power
failure is a big problem. we have single power supply, Single ups and a
stand by generator. so what happend if all node down due to power failure?
will it create any problem to restart service when power restore?

looking for your suggestion..

--

*Regards*Santu Roy



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-05-02 Thread Eneko Lacunza

Hi,

Has anyone used the new S3520's? They are 1 DWPD and so much closer to 
the S3610 than previous S35x0's.


Cheers

El 01/05/17 a las 17:41, David Turner escribió:
I can attest to this.  I had a cluster that used 3510's for the first 
rack and then switched to 3710's after that.  We had 3TB drives and 
every single 3510 ran out of writes after 1.5 years.  We noticed 
because we tracked down incredibly slow performance to a subset of 
OSDs and each time they had a common journal.  This happened for about 
2 weeks and 4 journals,  That was when we realized that they were all 
3510 journals and SMART showed not only the journals we had tracked 
down, but all of the 3510's were out of writes.  Replacing all of your 
journals every 1.5 years is way more expensive than the increased cost 
of the 3710's.  That was our use case and experience, but I'm pretty 
sure that any cluster large enough to fill at least most of a rack 
will run into this much sooner than later.


On Mon, May 1, 2017 at 11:15 AM Maxime Guyot > wrote:


Hi,

Lots of good info on SSD endurance in this thread.

For Ceph journal you should also consider the size of the backing
OSDs: the SSD journal won't last as long if backing 5x8TB OSDs or
5x1TB OSDs.

For example, the S3510 480GB (275TB of endurance), if backing
5x8TB (40TB) OSDs, will provide very little endurance, assuming
triple replication you will be able to fill the OSDs twice and
that's about it (275/(5x8)/3).
On the other end of the scale a 1.2TB S3710 backing 5x1TB will be
able to fill them 1620 times before running out of endurance
(24300/(5x1)/3).

Ultimately it depends on your workload. Some people can get away
with S3510 as journals if the workload is read intensive, but in
most cases the higher endurance is a safe bet (S3710 or S3610).

Cheers,
Maxime


On Mon, 1 May 2017 at 11:04 Jens Dueholm Christensen
mailto:j...@ramboll.com>> wrote:

Sorry for topposting, but..

The Intel 35xx drives are rated for a much lower DWPD
(drive-writes-per-day) than the 36xx or 37xx models.

Keep in mind that a single SSD that acts as journal for 5 OSDs
will recieve ALL writes for those 5 OSDs before the data is
moved off to the OSDs actual data drives.

This makes for quite a lot of writes, and along with the
consumer/enterprise advice others have written about, your SSD
journal devices will recieve quite a lot of writes over time.

The S3510 is rated for 0.3 DWPD for 5 years

(http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3510-spec.html)
The S3610 is rated for 3 DWPD for 5 years 
(http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3610-spec.html)

The S3710 is rated for 10 DWPD for 5 years

(http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3710-spec.html)

A 480GB S3510 has no endurance left once you have written
0.275PB to it.
A 480GB S3610 has no endurance left once you have written
3.7PB to it.
A 400GB S3710 has no endurance left once you have written
8.3PB to it.

This makes for quite a lot of difference over time - even if a
S3510 wil only act as journal for 1 or 2 OSDs, it will wear
out much much much faster than others.

And I know I've used the xx10 models above, but the xx00
models have all been replaced by those newer models now.

And yes, the xx10 models are using MLC NAND, but so were the
xx00 models, that have a proven trackrecord and delivers what
Intel promised in the datasheet.

You could try and take a look at some of the enterprise SSDs
that Samsung has launched.
Price-wise they are very competitive to Intel, but I want to
see (or at least hear from others) if they can deliver what
their datasheet promises.
Samsungs consumer SSDs did not (840/850 Pro), so I'm only
using S3710s in my cluster.


Before I created our own cluster some time ago, I found these
threads from the mailinglist regarding the exact same disks we
had been expecting to use (Samsung 840/850 Pro), that was
quickly changed to Intel S3710s:


http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-November/044258.html
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17369.html

A longish thread about Samsung consumer drives:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000572.html
- highlights from that thread:
  -

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000610.html
  -

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000611.html
  -

http://lists.ceph.com/pipermail/ceph-users-ceph.com/201