Re: [ceph-users] RGW pool contents

2015-12-22 Thread Somnath Roy
Thanks for responding back, unfortunately Cosbench setup is not there..
Good to know that there are cleanup steps for Cosbench data.

Regards
Somnath

From: ghislain.cheval...@orange.com [mailto:ghislain.cheval...@orange.com]
Sent: Tuesday, December 22, 2015 11:28 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: RGW pool contents

Hi,
Did you try to use the cleanup and dispose steps of cosbench?
brgds

De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de 
Somnath Roy
Envoyé : mardi 24 novembre 2015 20:49
À : ceph-users@lists.ceph.com
Objet : [ceph-users] RGW pool contents

Hi Yehuda/RGW experts,
I have one cluster with RGW up and running in the customer site.
I did some heavy performance testing on that with CosBench and as a result 
written significant amount of data to showcase performance on that.
Over time, customer also wrote significant amount of data using S3 api into the 
cluster.
Now, I want to remove the buckets/objects created by CosBench and need some 
help on that.
I ran the following command to list the buckets.

"radosgw-admin bucket list"

The output is the following snippet..

"rgwdef42",
"rgwdefghijklmnop79",
"rgwyzabc43",
"rgwdefgh43",
"rgwdefghijklm200",

..
..

My understanding is , cosbench should create containers with "mycontainers_" 
 and objects with format "myobjects_" prefix (?). But, it's not there in the 
output of the above command.

Next, I tried to list the contents of the different rgw pools..

rados -p .rgw.buckets.index ls

.dir.default.5407.17
.dir.default.6063.24
.dir.default.6068.23
.dir.default.6046.7
.dir.default.6065.44
.dir.default.5409.3
...
...

Nothing with rgw prefix...Shouldn't the bucketindex objects having similar 
prefix with bucket names ?


Now, tried to get the actual objects...
rados -p .rgw.buckets ls

default.6662.5_myobjects57862
default.5193.18_myobjects6615
default.5410.5_myobjects68518
default.6661.8_myobjects7407
default.5410.22_myobjects54939
default.6651.6_myobjects23790


...

So, looking at these, it seems cosbench run is creating the .dir.default.* 
buckets and the default._myobjects* objects (?)

But, these buckets are not listed by the first "radosgw-admin" command, why ?

Next, I listed the contents of the .rgw pool and here is the output..

rados -p .rgw ls

.bucket.meta.rgwdefghijklm78:default.6069.18
rgwdef42
rgwdefghijklmnop79
rgwyzabc43
.bucket.meta.rgwdefghijklmnopqr71:default.6655.3
rgwdefgh43
.bucket.meta.rgwdefghijklm119:default.6066.25
rgwdefghijklm200
.bucket.meta.rgwxghi2:default.5203.4
rgwxjk17
rgwdefghijklm196

...
...

It seems this pool has the buckets listed by the radosgw-admin command.

Can anybody explain what is .rgw pool supposed to contain ?

Also, what is the difference between .users.uid and .users pool ?


Appreciate any help on this.

Thanks & Regards
Somnath

_



Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc

pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler

a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,

Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.



This message and its attachments may contain confidential or privileged 
information that may be protected by law;

they should not be distributed, used or copied without authorisation.

If you have received this email in error, please notify the sender and delete 
this message and its attachments.

As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.

Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] errors when install-deps.sh

2015-12-22 Thread gongfengguang
Hi all,

   When I exec ./install-deps.sh, there are some errors:

 

   --> Already installed : junit-4.11-8.el7.noarch

No uninstalled build requires

Running virtualenv with interpreter /usr/bin/python2.7

New python executable in
/home/gongfenguang/ceph-infernalis/install-deps-python2.7/bin/python2.7

Also creating executable in
/home/gongfenguang/ceph-infernalis/install-deps-python2.7/bin/python

Installing
Setuptools..


done.

Installing
Pip.



done.

Downloading/unpacking distribute>=0.7.3

Cannot fetch index base URL https://pypi.python.org/simple/

Could not find any downloads that satisfy the requirement
distribute>=0.7.3

Cleaning up...

No distributions at all found for distribute>=0.7.3

Storing complete log in /root/.pip/pip.log

 

/root/.pip/pip.log as follows:

Downloading/unpacking distribute>=0.7.3

 

  Getting page https://pypi.python.org/simple/distribute/

  Could not fetch URL https://pypi.python.org/simple/distribute/: 

  Will skip URL https://pypi.python.org/simple/distribute/ when looking for
download links for distribute>=0.7.3

  Getting page https://pypi.python.org/simple/

  Could not fetch URL https://pypi.python.org/simple/: 

  Will skip URL https://pypi.python.org/simple/ when looking for download
links for distribute>=0.7.3

  Cannot fetch index base URL https://pypi.python.org/simple/

 

  URLs to search for versions for distribute>=0.7.3:

  * https://pypi.python.org/simple/distribute/

  Getting page https://pypi.python.org/simple/distribute/

  Could not fetch URL https://pypi.python.org/simple/distribute/: 

  Will skip URL https://pypi.python.org/simple/distribute/ when looking for
download links for distribute>=0.7.3

  Could not find any downloads that satisfy the requirement
distribute>=0.7.3

 

Cleaning up...

 

  Removing temporary dir
/home/gongfenguang/ceph-infernalis/install-deps-python2.7/build...

No distributions at all found for distribute>=0.7.3

 

Exception information:

Traceback (most recent call last):

  File
"/home/gongfenguang/ceph-infernalis/install-deps-python2.7/lib/python2.7/sit
e-packages/pip/basecommand.py", line 134, in main

status = self.run(options, args)

  File
"/home/gongfenguang/ceph-infernalis/install-deps-python2.7/lib/python2.7/sit
e-packages/pip/commands/install.py", line 236, in run

requirement_set.prepare_files(finder, force_root_egg_info=self.bundle,
bundle=self.bundle)

  File
"/home/gongfenguang/ceph-infernalis/install-deps-python2.7/lib/python2.7/sit
e-packages/pip/req.py", line 1085, in prepare_files

url = finder.find_requirement(req_to_install, upgrade=self.upgrade)

  File
"/home/gongfenguang/ceph-infernalis/install-deps-python2.7/lib/python2.7/sit
e-packages/pip/index.py", line 265, in find_requirement

raise DistributionNotFound('No distributions at all found for %s' % req)

DistributionNotFound: No distributions at all found for distribute>=0.7.3

 

 

Has someone encountered such a situation!

 

thanks

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW pool contents

2015-12-22 Thread ghislain.chevalier
Hi,
Did you try to use the cleanup and dispose steps of cosbench?
brgds

De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de 
Somnath Roy
Envoyé : mardi 24 novembre 2015 20:49
À : ceph-users@lists.ceph.com
Objet : [ceph-users] RGW pool contents

Hi Yehuda/RGW experts,
I have one cluster with RGW up and running in the customer site.
I did some heavy performance testing on that with CosBench and as a result 
written significant amount of data to showcase performance on that.
Over time, customer also wrote significant amount of data using S3 api into the 
cluster.
Now, I want to remove the buckets/objects created by CosBench and need some 
help on that.
I ran the following command to list the buckets.

"radosgw-admin bucket list"

The output is the following snippet..

"rgwdef42",
"rgwdefghijklmnop79",
"rgwyzabc43",
"rgwdefgh43",
"rgwdefghijklm200",

..
..

My understanding is , cosbench should create containers with "mycontainers_" 
 and objects with format "myobjects_" prefix (?). But, it's not there in the 
output of the above command.

Next, I tried to list the contents of the different rgw pools..

rados -p .rgw.buckets.index ls

.dir.default.5407.17
.dir.default.6063.24
.dir.default.6068.23
.dir.default.6046.7
.dir.default.6065.44
.dir.default.5409.3
...
...

Nothing with rgw prefix...Shouldn't the bucketindex objects having similar 
prefix with bucket names ?


Now, tried to get the actual objects...
rados -p .rgw.buckets ls

default.6662.5_myobjects57862
default.5193.18_myobjects6615
default.5410.5_myobjects68518
default.6661.8_myobjects7407
default.5410.22_myobjects54939
default.6651.6_myobjects23790


...

So, looking at these, it seems cosbench run is creating the .dir.default.* 
buckets and the default._myobjects* objects (?)

But, these buckets are not listed by the first "radosgw-admin" command, why ?

Next, I listed the contents of the .rgw pool and here is the output..

rados -p .rgw ls

.bucket.meta.rgwdefghijklm78:default.6069.18
rgwdef42
rgwdefghijklmnop79
rgwyzabc43
.bucket.meta.rgwdefghijklmnopqr71:default.6655.3
rgwdefgh43
.bucket.meta.rgwdefghijklm119:default.6066.25
rgwdefghijklm200
.bucket.meta.rgwxghi2:default.5203.4
rgwxjk17
rgwdefghijklm196

...
...

It seems this pool has the buckets listed by the radosgw-admin command.

Can anybody explain what is .rgw pool supposed to contain ?

Also, what is the difference between .users.uid and .users pool ?


Appreciate any help on this.

Thanks & Regards
Somnath

_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph journal failed?

2015-12-22 Thread Christian Balzer

Hello,

On Wed, 23 Dec 2015 11:46:58 +0800 yuyang wrote:

> ok, You give me the answer, thanks a lot.
>

Assume that a journal SSD failure means a loss of all associated OSDs.

So in your case a single SSD failure will cause the data loss of a whole
node.

If you have 15 or more of those nodes, your cluster should be able to
handle the resulting I/O storm from recovering 9 OSDs, but with just a few
nodes you will have a severe performance impact and also risk data loss if
other failures occur during recovery.

Lastly, a 1:9 SSD journal to SATA ratio sounds also wrong when it comes to
performance, your SSD would need be able to handle about 900MB/s sync
writes, that's very expensive territory.

Christan
 
> But, I don't know the answer to your questions.
> 
> Maybe someone else can answer.
> 
> -- Original --
> From:  "Loris Cuoghi";;
> Date:  Tue, Dec 22, 2015 07:31 PM
> To:  "ceph-users";
> Subject:  Re: [ceph-users] ceph journal failed?
> 
> Le 22/12/2015 09:42, yuyang a écrit :
> > Hello, everyone,
> [snip snap]
> 
> Hi
> 
> > If the SSD failed or down, can the OSD work?
> > Is the osd down or only can be read?
> 
> If you don't have a journal anymore, the OSD has already quit, as it 
> can't continue writing, nor it can assure data consistency, since writes 
> have probably been interrupted.
> 
> The Ceph's community general assumption for a dead journal, is a dead
> OSD.
> 
> But.
> 
> http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-journal-failure/
> 
> How does this apply in reality?
> Is the solution that Sébastien is proposing viable?
> In most/all cases?
> Will the OSD continue chugging along after this kind of surgery?
> Is it necessary/suggested to deep scrub ASAP the OSD's placement groups?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph journal failed??

2015-12-22 Thread yuyang
ok, You give me the answer, thanks a lot.

But, I don't know the answer to your questions.

Maybe someone else can answer.

-- Original --
From:  "Loris Cuoghi";;
Date:  Tue, Dec 22, 2015 07:31 PM
To:  "ceph-users";
Subject:  Re: [ceph-users] ceph journal failed??

Le 22/12/2015 09:42, yuyang a ??crit :
> Hello, everyone,
[snip snap]

Hi

> If the SSD failed or down, can the OSD work?
> Is the osd down or only can be read?

If you don't have a journal anymore, the OSD has already quit, as it 
can't continue writing, nor it can assure data consistency, since writes 
have probably been interrupted.

The Ceph's community general assumption for a dead journal, is a dead OSD.

But.

http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-journal-failure/

How does this apply in reality?
Is the solution that S??bastien is proposing viable?
In most/all cases?
Will the OSD continue chugging along after this kind of surgery?
Is it necessary/suggested to deep scrub ASAP the OSD's placement groups?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-22 Thread Yan, Zheng
On Tue, Dec 22, 2015 at 9:29 PM, Francois Lafont  wrote:
> Hello,
>
> On 21/12/2015 04:47, Yan, Zheng wrote:
>
>> fio tests AIO performance in this case. cephfs does not handle AIO
>> properly, AIO is actually SYNC IO. that's why cephfs is so slow in
>> this case.
>
> Ah ok, thanks for this very interesting information.
>
> So, in fact, the question I ask myself is: how to test my cephfs
> to know if I have correct (or not) perfs as regard my hardware
> configuration?
>
> Because currently, in fact, I'm unable to say if I have correct perf
> (not incredible but in line with my hardware configuration) or if I
> have a problem. ;)
>

It's hard to tell. basically data IO performance on cephfs should be
similar to data IO performance on rbd.

Regards
Yan, Zheng

> --
> François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hardware for a new installation

2015-12-22 Thread Sam Huracan
Hi,
I think the ratio is based on SSD max throughput/HDD max throughput

For example: one 400 Mbps SSD could be journal for 4 100 Mbps SAS.

This is my idea, I'm also building a Ceph Storage for Openstack.
Could you guys give some experiences?
On Dec 23, 2015 03:04, "Pshem Kowalczyk"  wrote:

> Hi,
>
> We'll be building our first production-grade ceph cluster to back an
> openstack setup (a few hundreds of VMs). Initially we'll need only about
> 20-30TB of storage, but that's likely to grow. I'm unsure about required
> IOPs (there are multiple, very different classes of workloads to consider).
> Currently we use a mixture of on-blade disks and commercial storage
> solutions (NetApp and EMC).
>
> We're a Cisco UCS shop for compute and I would like to know if anyone here
> has experience with the C3160 storage server Cisco offers. Any particular
> pitfalls to avoid?
>
> I would like to use SSD for journals, but I'm not sure what's the
> performance (and durability) of the UCS-C3X60-12G0400 really is.
> What's considered a reasonable ratio of HDD to journal SSD? 5:1, 4:1?
>
> kind regards
> Pshem
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Hardware for a new installation

2015-12-22 Thread Pshem Kowalczyk
Hi,

We'll be building our first production-grade ceph cluster to back an
openstack setup (a few hundreds of VMs). Initially we'll need only about
20-30TB of storage, but that's likely to grow. I'm unsure about required
IOPs (there are multiple, very different classes of workloads to consider).
Currently we use a mixture of on-blade disks and commercial storage
solutions (NetApp and EMC).

We're a Cisco UCS shop for compute and I would like to know if anyone here
has experience with the C3160 storage server Cisco offers. Any particular
pitfalls to avoid?

I would like to use SSD for journals, but I'm not sure what's the
performance (and durability) of the UCS-C3X60-12G0400 really is.
What's considered a reasonable ratio of HDD to journal SSD? 5:1, 4:1?

kind regards
Pshem
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

2015-12-22 Thread koukou73gr
Even the cheapest stuff nowadays has some more or less decent wear
leveling algorithm built into their controller so this won't be a
problem. Wear leveling algorithms cycle the blocks internally so wear
evens out on the whole disk.

-K.

On 12/22/2015 06:57 PM, Alan Johnson wrote:
> I would also add that the journal activity is write intensive so a small part 
> of the drive would get excessive writes if the journal and data are 
> co-located on an SSD. This would also be the case where an SSD has multiple 
> journals associated with many HDDs.
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

2015-12-22 Thread Lionel Bouton
Le 22/12/2015 17:36, Tyler Bishop a écrit :
> Write endurance is kinda bullshit.
>
> We have crucial 960gb drives storing data and we've only managed to take 2% 
> off the drives life in the period of a year and hundreds of tb written weekly.

This is not really helpful without more context.

This would help if you stated:
* the exact model including the firmware version,
* how many SSDs are used to handle these hundreds of TB written weekly
(if you use 1000 SSDs your numbers don't mean the same thing that if you
use only 10 of them on your cluster),
* if you use them as journals, storage or both,
* if it is 200TB or 900TB weekly,
* if you include the replication size in the amount written (and/or the
double writes if you use them for both journal and store).

If you imply that the 2% is below what you would expect according to the
total TBW specified for your model you clearly have a problem and I
wouldn't trust these drives: the manufacturer is lying to you one way or
another. If it underestimates the TBW then fine (but why would it look
bad on purpose ?) but if it overestimates the reported life expectancy
because of a bug you can expect a catastrophic failure if you hit the
real limit before replacing the SSDs.

One year is a bit short to have a real experience on endurance too: some
consumer-level drives (Samsung 850 PRO IIRC) have been known to fail
early (far before their expected life felled to 0 according to their
SMART attributes), although I don't remember seeing any reports for
Crucial SSD yet. If you replaced Crucial 960gb by 850 Pro in your
statement I'd clearly worry about your cluster failing badly in the
short future. Without knowing more about the exact model you use and the
real numbers for your cluster I don't know what could happen.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

2015-12-22 Thread Alan Johnson
I would also add that the journal activity is write intensive so a small part 
of the drive would get excessive writes if the journal and data are co-located 
on an SSD. This would also be the case where an SSD has multiple journals 
associated with many HDDs.

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido 
den Hollander
Sent: Tuesday, December 22, 2015 11:46 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

On 12/22/2015 05:36 PM, Tyler Bishop wrote:
> Write endurance is kinda bullshit.
> 
> We have crucial 960gb drives storing data and we've only managed to take 2% 
> off the drives life in the period of a year and hundreds of tb written weekly.
> 
> 
> Stuff is way more durable than anyone gives it credit.
> 
> 

No, that is absolutely not true. I've seen multiple SSDs fail in Ceph clusters. 
Small Samsung 850 Pro SSDs worn out within 4 months in heavy write-intensive 
Ceph clusters.

> - Original Message -
> From: "Lionel Bouton" 
> To: "Andrei Mikhailovsky" , "ceph-users" 
> 
> Sent: Tuesday, December 22, 2015 11:04:26 AM
> Subject: Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB 
> fio results
> 
> Le 22/12/2015 13:43, Andrei Mikhailovsky a écrit :
>> Hello guys,
>>
>> Was wondering if anyone has done testing on Samsung PM863 120 GB version to 
>> see how it performs? IMHO the 480GB version seems like a waste for the 
>> journal as you only need to have a small disk size to fit 3-4 osd journals. 
>> Unless you get a far greater durability.
> 
> The problem is endurance. If we use the 480GB for 3 OSDs each on the 
> cluster we might build we expect 3 years (with some margin for error 
> but not including any write amplification at the SSD level) before the 
> SSDs will fail.
> In our context a 120GB model might not even last a year (endurance is 
> 1/4th of the 480GB model). This is why SM863 models will probably be 
> more suitable if you have access to them: you can use smaller ones 
> which cost less and get more endurance (you'll have to check the 
> performance though, usually smaller models have lower IOPS and bandwidth).
> 
>> I am planning to replace my current journal ssds over the next month or so 
>> and would like to find out if there is an a good alternative to the Intel's 
>> 3700/3500 series. 
> 
> 3700 are a safe bet (the 100GB model is rated for ~1.8PBW). 3500 
> models probably don't have enough endurance for many Ceph clusters to 
> be cost effective. The 120GB model is only rated for 70TBW and you 
> have to consider both client writes and rebalance events.
> I'm uneasy with SSDs expected to fail within the life of the system 
> they are in: you can have a cascade effect where an SSD failure brings 
> down several OSDs triggering a rebalance which might make SSDs 
> installed at the same time fail too. In this case in the best scenario 
> you will reach your min_size (>=2) and block any writes which would 
> prevent more SSD failures until you move journals to fresh SSDs. If 
> min_size = 1 you might actually lose data.
> 
> If you expect to replace your current journal SSDs if I were you I 
> would make a staggered deployment over several months/a year to avoid 
> them failing at the same time in case of an unforeseen problem. In 
> addition this would allow to evaluate the performance and behavior of 
> a new SSD model with your hardware (there have been reports of 
> performance problems with some combinations of RAID controllers and 
> SSD models/firmware versions) without impacting your cluster's overall 
> performance too much.
> 
> When using SSDs for journals you have to monitor both :
> * the SSD wear leveling or something equivalent (SMART data may not be 
> available if you use a RAID controller but usually you can get the 
> total amount data written) of each SSD,
> * the client writes on the whole cluster.
> And check periodically what the expected lifespan left there is for 
> each of your SSD based on their current state, average write speed, 
> estimated write amplification (both due to pool's size parameter and 
> the SSD model's inherent write amplification) and the amount of data 
> moved by rebalance events you expect to happen.
> Ideally you should make this computation before choosing the SSD 
> models, but several variables are not always easy to predict and 
> probably will change during the life of your cluster.
> 
> Lionel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing lis

Re: [ceph-users] requests are blocked

2015-12-22 Thread Dan Nica
would this behavior go away if I add more osds or pg(s), or can I do anything 
else besides to change the FS on osds? is this a known performance issue?

Thanks
--
Dan

On December 22, 2015 4:53:24 PM Wade Holler  wrote:

The hanging kernel tasks under -327 for XFS resulted in LOG verification 
failures and completely locked the hosts.
BTRFS task timeouts we could get around by setting 
kernel.hung_task_timeout_secs = 960

The host would eventually get responsive again however that doesn't really 
matter, since the ceph ops are blocked for so long it all goes to hell anyways.
I only found stability under high load with EXT4 or -229 with BTRFS|EXT4.

Bad story, sorry to have to tell it.

-Wade


On Tue, Dec 22, 2015 at 9:44 AM Dan Nica 
mailto:dan.n...@staff.bluematrix.com>> wrote:
That is strange, maybe there is a sysctl option to tweak on OSDs ? this will be 
nasty if it goes into our production!

--
Dan

From: Wade Holler [mailto:wade.hol...@gmail.com]
Sent: Tuesday, December 22, 2015 4:36 PM
To: Dan Nica 
mailto:dan.n...@staff.bluematrix.com>>; 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] requests are blocked

I had major host stability problems under load with -327  . Repeatable test 
cases under high load with XFS or BTRFS would result in hung kernel tasks and 
of course the sympathetic behavior you mention.
requests are blocked mean that the op tracker in ceph hasn't received a timely 
response from the osd usually.  I'm sure someone more seasoned can provide a 
better explanation.
-Wade

On Tue, Dec 22, 2015 at 9:24 AM Dan Nica 
mailto:dan.n...@staff.bluematrix.com>> wrote:
Hi

I try to run a bench test on a RBD image and I get from time to time the 
following in ceph status

cluster 046b0180-dc3f-4846-924f-41d9729d48c8
 health HEALTH_WARN
2 requests are blocked > 32 sec
 monmap e1: 3 mons at 
{alder=10.6.250.249:6789/0,ash=10.6.250.248:6789/0,aspen=10.6.250.247:6789/0}
election epoch 18, quorum 0,1,2 aspen,ash,alder
 osdmap e114: 6 osds: 6 up, 6 in
flags sortbitwise
  pgmap v3816: 192 pgs, 1 pools, 23062 MB data, 5814 objects
46406 MB used, 44624 GB / 44670 GB avail
 192 active+clean
  client io 6083 B/s rd, 18884 kB/s wr, 75 op/s


what does  "requests are blocked" mean ? and performance drops to almost  0 ?
I am running infernalis version on Centos 7 kernel 3.10.0-327.3.1.el7.x86_64

Thanks
--
Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

2015-12-22 Thread Wido den Hollander
On 12/22/2015 05:36 PM, Tyler Bishop wrote:
> Write endurance is kinda bullshit.
> 
> We have crucial 960gb drives storing data and we've only managed to take 2% 
> off the drives life in the period of a year and hundreds of tb written weekly.
> 
> 
> Stuff is way more durable than anyone gives it credit.
> 
> 

No, that is absolutely not true. I've seen multiple SSDs fail in Ceph
clusters. Small Samsung 850 Pro SSDs worn out within 4 months in heavy
write-intensive Ceph clusters.

> - Original Message -
> From: "Lionel Bouton" 
> To: "Andrei Mikhailovsky" , "ceph-users" 
> 
> Sent: Tuesday, December 22, 2015 11:04:26 AM
> Subject: Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio 
> results
> 
> Le 22/12/2015 13:43, Andrei Mikhailovsky a écrit :
>> Hello guys,
>>
>> Was wondering if anyone has done testing on Samsung PM863 120 GB version to 
>> see how it performs? IMHO the 480GB version seems like a waste for the 
>> journal as you only need to have a small disk size to fit 3-4 osd journals. 
>> Unless you get a far greater durability.
> 
> The problem is endurance. If we use the 480GB for 3 OSDs each on the
> cluster we might build we expect 3 years (with some margin for error but
> not including any write amplification at the SSD level) before the SSDs
> will fail.
> In our context a 120GB model might not even last a year (endurance is
> 1/4th of the 480GB model). This is why SM863 models will probably be
> more suitable if you have access to them: you can use smaller ones which
> cost less and get more endurance (you'll have to check the performance
> though, usually smaller models have lower IOPS and bandwidth).
> 
>> I am planning to replace my current journal ssds over the next month or so 
>> and would like to find out if there is an a good alternative to the Intel's 
>> 3700/3500 series. 
> 
> 3700 are a safe bet (the 100GB model is rated for ~1.8PBW). 3500 models
> probably don't have enough endurance for many Ceph clusters to be cost
> effective. The 120GB model is only rated for 70TBW and you have to
> consider both client writes and rebalance events.
> I'm uneasy with SSDs expected to fail within the life of the system they
> are in: you can have a cascade effect where an SSD failure brings down
> several OSDs triggering a rebalance which might make SSDs installed at
> the same time fail too. In this case in the best scenario you will reach
> your min_size (>=2) and block any writes which would prevent more SSD
> failures until you move journals to fresh SSDs. If min_size = 1 you
> might actually lose data.
> 
> If you expect to replace your current journal SSDs if I were you I would
> make a staggered deployment over several months/a year to avoid them
> failing at the same time in case of an unforeseen problem. In addition
> this would allow to evaluate the performance and behavior of a new SSD
> model with your hardware (there have been reports of performance
> problems with some combinations of RAID controllers and SSD
> models/firmware versions) without impacting your cluster's overall
> performance too much.
> 
> When using SSDs for journals you have to monitor both :
> * the SSD wear leveling or something equivalent (SMART data may not be
> available if you use a RAID controller but usually you can get the total
> amount data written) of each SSD,
> * the client writes on the whole cluster.
> And check periodically what the expected lifespan left there is for each
> of your SSD based on their current state, average write speed, estimated
> write amplification (both due to pool's size parameter and the SSD
> model's inherent write amplification) and the amount of data moved by
> rebalance events you expect to happen.
> Ideally you should make this computation before choosing the SSD models,
> but several variables are not always easy to predict and probably will
> change during the life of your cluster.
> 
> Lionel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

2015-12-22 Thread Tyler Bishop
Write endurance is kinda bullshit.

We have crucial 960gb drives storing data and we've only managed to take 2% off 
the drives life in the period of a year and hundreds of tb written weekly.


Stuff is way more durable than anyone gives it credit.


- Original Message -
From: "Lionel Bouton" 
To: "Andrei Mikhailovsky" , "ceph-users" 

Sent: Tuesday, December 22, 2015 11:04:26 AM
Subject: Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

Le 22/12/2015 13:43, Andrei Mikhailovsky a écrit :
> Hello guys,
>
> Was wondering if anyone has done testing on Samsung PM863 120 GB version to 
> see how it performs? IMHO the 480GB version seems like a waste for the 
> journal as you only need to have a small disk size to fit 3-4 osd journals. 
> Unless you get a far greater durability.

The problem is endurance. If we use the 480GB for 3 OSDs each on the
cluster we might build we expect 3 years (with some margin for error but
not including any write amplification at the SSD level) before the SSDs
will fail.
In our context a 120GB model might not even last a year (endurance is
1/4th of the 480GB model). This is why SM863 models will probably be
more suitable if you have access to them: you can use smaller ones which
cost less and get more endurance (you'll have to check the performance
though, usually smaller models have lower IOPS and bandwidth).

> I am planning to replace my current journal ssds over the next month or so 
> and would like to find out if there is an a good alternative to the Intel's 
> 3700/3500 series. 

3700 are a safe bet (the 100GB model is rated for ~1.8PBW). 3500 models
probably don't have enough endurance for many Ceph clusters to be cost
effective. The 120GB model is only rated for 70TBW and you have to
consider both client writes and rebalance events.
I'm uneasy with SSDs expected to fail within the life of the system they
are in: you can have a cascade effect where an SSD failure brings down
several OSDs triggering a rebalance which might make SSDs installed at
the same time fail too. In this case in the best scenario you will reach
your min_size (>=2) and block any writes which would prevent more SSD
failures until you move journals to fresh SSDs. If min_size = 1 you
might actually lose data.

If you expect to replace your current journal SSDs if I were you I would
make a staggered deployment over several months/a year to avoid them
failing at the same time in case of an unforeseen problem. In addition
this would allow to evaluate the performance and behavior of a new SSD
model with your hardware (there have been reports of performance
problems with some combinations of RAID controllers and SSD
models/firmware versions) without impacting your cluster's overall
performance too much.

When using SSDs for journals you have to monitor both :
* the SSD wear leveling or something equivalent (SMART data may not be
available if you use a RAID controller but usually you can get the total
amount data written) of each SSD,
* the client writes on the whole cluster.
And check periodically what the expected lifespan left there is for each
of your SSD based on their current state, average write speed, estimated
write amplification (both due to pool's size parameter and the SSD
model's inherent write amplification) and the amount of data moved by
rebalance events you expect to happen.
Ideally you should make this computation before choosing the SSD models,
but several variables are not always easy to predict and probably will
change during the life of your cluster.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

2015-12-22 Thread Lionel Bouton
Le 22/12/2015 13:43, Andrei Mikhailovsky a écrit :
> Hello guys,
>
> Was wondering if anyone has done testing on Samsung PM863 120 GB version to 
> see how it performs? IMHO the 480GB version seems like a waste for the 
> journal as you only need to have a small disk size to fit 3-4 osd journals. 
> Unless you get a far greater durability.

The problem is endurance. If we use the 480GB for 3 OSDs each on the
cluster we might build we expect 3 years (with some margin for error but
not including any write amplification at the SSD level) before the SSDs
will fail.
In our context a 120GB model might not even last a year (endurance is
1/4th of the 480GB model). This is why SM863 models will probably be
more suitable if you have access to them: you can use smaller ones which
cost less and get more endurance (you'll have to check the performance
though, usually smaller models have lower IOPS and bandwidth).

> I am planning to replace my current journal ssds over the next month or so 
> and would like to find out if there is an a good alternative to the Intel's 
> 3700/3500 series. 

3700 are a safe bet (the 100GB model is rated for ~1.8PBW). 3500 models
probably don't have enough endurance for many Ceph clusters to be cost
effective. The 120GB model is only rated for 70TBW and you have to
consider both client writes and rebalance events.
I'm uneasy with SSDs expected to fail within the life of the system they
are in: you can have a cascade effect where an SSD failure brings down
several OSDs triggering a rebalance which might make SSDs installed at
the same time fail too. In this case in the best scenario you will reach
your min_size (>=2) and block any writes which would prevent more SSD
failures until you move journals to fresh SSDs. If min_size = 1 you
might actually lose data.

If you expect to replace your current journal SSDs if I were you I would
make a staggered deployment over several months/a year to avoid them
failing at the same time in case of an unforeseen problem. In addition
this would allow to evaluate the performance and behavior of a new SSD
model with your hardware (there have been reports of performance
problems with some combinations of RAID controllers and SSD
models/firmware versions) without impacting your cluster's overall
performance too much.

When using SSDs for journals you have to monitor both :
* the SSD wear leveling or something equivalent (SMART data may not be
available if you use a RAID controller but usually you can get the total
amount data written) of each SSD,
* the client writes on the whole cluster.
And check periodically what the expected lifespan left there is for each
of your SSD based on their current state, average write speed, estimated
write amplification (both due to pool's size parameter and the SSD
model's inherent write amplification) and the amount of data moved by
rebalance events you expect to happen.
Ideally you should make this computation before choosing the SSD models,
but several variables are not always easy to predict and probably will
change during the life of your cluster.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] requests are blocked

2015-12-22 Thread Wade Holler
The hanging kernel tasks under -327 for XFS resulted in LOG verification
failures and completely locked the hosts.
BTRFS task timeouts we could get around by setting
kernel.hung_task_timeout_secs = 960

The host would eventually get responsive again however that doesn't really
matter, since the ceph ops are blocked for so long it all goes to hell
anyways.
I only found stability under high load with EXT4 or -229 with BTRFS|EXT4.

Bad story, sorry to have to tell it.

-Wade


On Tue, Dec 22, 2015 at 9:44 AM Dan Nica 
wrote:

> That is strange, maybe there is a sysctl option to tweak on OSDs ? this
> will be nasty if it goes into our production!
>
>
>
> --
>
> Dan
>
>
>
> *From:* Wade Holler [mailto:wade.hol...@gmail.com]
> *Sent:* Tuesday, December 22, 2015 4:36 PM
> *To:* Dan Nica ; ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] requests are blocked
>
>
>
> I had major host stability problems under load with -327  . Repeatable
> test cases under high load with XFS or BTRFS would result in hung kernel
> tasks and of course the sympathetic behavior you mention.
>
> requests are blocked mean that the op tracker in ceph hasn't received a
> timely response from the osd usually.  I'm sure someone more seasoned can
> provide a better explanation.
>
> -Wade
>
>
>
> On Tue, Dec 22, 2015 at 9:24 AM Dan Nica 
> wrote:
>
> Hi
>
>
>
> I try to run a bench test on a RBD image and I get from time to time the
> following in ceph status
>
>
>
> cluster 046b0180-dc3f-4846-924f-41d9729d48c8
>
>  health HEALTH_WARN
>
> 2 requests are blocked > 32 sec
>
>  monmap e1: 3 mons at {alder=
> 10.6.250.249:6789/0,ash=10.6.250.248:6789/0,aspen=10.6.250.247:6789/0}
>
> election epoch 18, quorum 0,1,2 aspen,ash,alder
>
>  osdmap e114: 6 osds: 6 up, 6 in
>
> flags sortbitwise
>
>   pgmap v3816: 192 pgs, 1 pools, 23062 MB data, 5814 objects
>
> 46406 MB used, 44624 GB / 44670 GB avail
>
>  192 active+clean
>
>   client io 6083 B/s rd, 18884 kB/s wr, 75 op/s
>
>
>
>
>
> what does  “requests are blocked” mean ? and performance drops to almost
>  0 ?
>
> I am running infernalis version on Centos 7 kernel
> 3.10.0-327.3.1.el7.x86_64
>
>
>
> Thanks
>
> --
>
> Dan
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] requests are blocked

2015-12-22 Thread Dan Nica
That is strange, maybe there is a sysctl option to tweak on OSDs ? this will be 
nasty if it goes into our production!

--
Dan

From: Wade Holler [mailto:wade.hol...@gmail.com]
Sent: Tuesday, December 22, 2015 4:36 PM
To: Dan Nica ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] requests are blocked

I had major host stability problems under load with -327  . Repeatable test 
cases under high load with XFS or BTRFS would result in hung kernel tasks and 
of course the sympathetic behavior you mention.
requests are blocked mean that the op tracker in ceph hasn't received a timely 
response from the osd usually.  I'm sure someone more seasoned can provide a 
better explanation.
-Wade

On Tue, Dec 22, 2015 at 9:24 AM Dan Nica 
mailto:dan.n...@staff.bluematrix.com>> wrote:
Hi

I try to run a bench test on a RBD image and I get from time to time the 
following in ceph status

cluster 046b0180-dc3f-4846-924f-41d9729d48c8
 health HEALTH_WARN
2 requests are blocked > 32 sec
 monmap e1: 3 mons at 
{alder=10.6.250.249:6789/0,ash=10.6.250.248:6789/0,aspen=10.6.250.247:6789/0}
election epoch 18, quorum 0,1,2 aspen,ash,alder
 osdmap e114: 6 osds: 6 up, 6 in
flags sortbitwise
  pgmap v3816: 192 pgs, 1 pools, 23062 MB data, 5814 objects
46406 MB used, 44624 GB / 44670 GB avail
 192 active+clean
  client io 6083 B/s rd, 18884 kB/s wr, 75 op/s


what does  “requests are blocked” mean ? and performance drops to almost  0 ?
I am running infernalis version on Centos 7 kernel 3.10.0-327.3.1.el7.x86_64

Thanks
--
Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] requests are blocked

2015-12-22 Thread Wade Holler
I had major host stability problems under load with -327  . Repeatable test
cases under high load with XFS or BTRFS would result in hung kernel tasks
and of course the sympathetic behavior you mention.

requests are blocked mean that the op tracker in ceph hasn't received a
timely response from the osd usually.  I'm sure someone more seasoned can
provide a better explanation.

-Wade


On Tue, Dec 22, 2015 at 9:24 AM Dan Nica 
wrote:

> Hi
>
>
>
> I try to run a bench test on a RBD image and I get from time to time the
> following in ceph status
>
>
>
> cluster 046b0180-dc3f-4846-924f-41d9729d48c8
>
>  health HEALTH_WARN
>
> 2 requests are blocked > 32 sec
>
>  monmap e1: 3 mons at {alder=
> 10.6.250.249:6789/0,ash=10.6.250.248:6789/0,aspen=10.6.250.247:6789/0}
>
> election epoch 18, quorum 0,1,2 aspen,ash,alder
>
>  osdmap e114: 6 osds: 6 up, 6 in
>
> flags sortbitwise
>
>   pgmap v3816: 192 pgs, 1 pools, 23062 MB data, 5814 objects
>
> 46406 MB used, 44624 GB / 44670 GB avail
>
>  192 active+clean
>
>   client io 6083 B/s rd, 18884 kB/s wr, 75 op/s
>
>
>
>
>
> what does  “requests are blocked” mean ? and performance drops to almost
>  0 ?
>
> I am running infernalis version on Centos 7 kernel
> 3.10.0-327.3.1.el7.x86_64
>
>
>
> Thanks
>
> --
>
> Dan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] requests are blocked

2015-12-22 Thread Dan Nica
Hi

I try to run a bench test on a RBD image and I get from time to time the 
following in ceph status

cluster 046b0180-dc3f-4846-924f-41d9729d48c8
 health HEALTH_WARN
2 requests are blocked > 32 sec
 monmap e1: 3 mons at 
{alder=10.6.250.249:6789/0,ash=10.6.250.248:6789/0,aspen=10.6.250.247:6789/0}
election epoch 18, quorum 0,1,2 aspen,ash,alder
 osdmap e114: 6 osds: 6 up, 6 in
flags sortbitwise
  pgmap v3816: 192 pgs, 1 pools, 23062 MB data, 5814 objects
46406 MB used, 44624 GB / 44670 GB avail
 192 active+clean
  client io 6083 B/s rd, 18884 kB/s wr, 75 op/s


what does  "requests are blocked" mean ? and performance drops to almost  0 ?
I am running infernalis version on Centos 7 kernel 3.10.0-327.3.1.el7.x86_64

Thanks
--
Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-22 Thread Yan, Zheng
On Tue, Dec 22, 2015 at 7:18 PM, Don Waterloo  wrote:
> On 21 December 2015 at 22:07, Yan, Zheng  wrote:
>>
>>
>> > OK, so i changed fio engine to 'sync' for the comparison of a single
>> > underlying osd vs the cephfs.
>> >
>> > the cephfs w/ sync is ~ 115iops / ~500KB/s.
>>
>> This is normal because you were doing single thread sync IO. If
>> round-trip time for each OSD request is about 10ms (network latency),
>> you can only have about 100 IOPS.
>>
>>
>
> yes... except the RTT is 200us. So that would be 5000 RTT/s.
>

The time OSD handles request should also be taken into account. please
try creating a rbd device, then run "dd if=/dev/zero bs=4k
oflag=direct of=/dev/rbdx", the performance number should be roughly
the same as cephfs.

Regards
Yan, Zheng
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-22 Thread Francois Lafont
Hello,

On 21/12/2015 04:47, Yan, Zheng wrote:

> fio tests AIO performance in this case. cephfs does not handle AIO
> properly, AIO is actually SYNC IO. that's why cephfs is so slow in
> this case.

Ah ok, thanks for this very interesting information.

So, in fact, the question I ask myself is: how to test my cephfs
to know if I have correct (or not) perfs as regard my hardware
configuration?

Because currently, in fact, I'm unable to say if I have correct perf
(not incredible but in line with my hardware configuration) or if I
have a problem. ;)

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Another MDS crash... log included

2015-12-22 Thread John Spray
On Tue, Dec 22, 2015 at 12:58 PM, Florent B  wrote:
> Hi,
>
> Today I had another MDS crash, but this time it was an active MDS crash.
>
> Log is here : http://paste.ubuntu.com/14136900/
>
> Infernalis on Debian Jessie (packaged version).
>
> Does anyone know something about it ?

That's not an MDS crash.  The MDS is respawning itself because it got
a message from the monitor indicating that it had been removed from
the MDSMap.  That will happen if the mon and MDS lose contact for too
long, which appears to have happened here (lost contact for about 40
seconds up to the time 13:42:20, when the new mdsmap is received).

You will need to diagnose what caused the mon and MDS to lose contact
during this time window, or possibly what caused the mons to be
unresponsive if that was the case.

John

> Thank you.
>
> Flo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

2015-12-22 Thread Wido den Hollander


On 22-12-15 13:43, Andrei Mikhailovsky wrote:
> Hello guys,
> 
> Was wondering if anyone has done testing on Samsung PM863 120 GB version to 
> see how it performs? IMHO the 480GB version seems like a waste for the 
> journal as you only need to have a small disk size to fit 3-4 osd journals. 
> Unless you get a far greater durability.
> 

In that case I would look at the SM836 from Samsung. They are sold as
write-intensive SSDs.

Wido

> I am planning to replace my current journal ssds over the next month or so 
> and would like to find out if there is an a good alternative to the Intel's 
> 3700/3500 series. 
> 
> Thanks
> 
> Andrei
> 
> - Original Message -
>> From: "Wido den Hollander" 
>> To: "ceph-users" 
>> Sent: Monday, 21 December, 2015 19:12:33
>> Subject: Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio 
>> results
> 
>> On 12/21/2015 05:30 PM, Lionel Bouton wrote:
>>> Hi,
>>>
>>> Sébastien Han just added the test results I reported for these SSDs on
>>> the following page :
>>>
>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>>>
>>> The table in the original post has the most important numbers and more
>>> details can be found in the comments.
>>>
>>> To sum things up, both have good performance (this isn't surprising for
>>> the S3710 but AFAIK this had to be confirmed for the PM863 and my
>>> company just purchased 2 of them just for these tests because they are
>>> the only "DC" SSDs available at one of our hosting providers).
>>> PM863 models are not designed for write-intensive applications and we
>>> have yet to see how they behave in the long run (in our case where PM863
>>> endurance is a bit short, if I had a choice we would test SM863 models
>>> if they were available to us).
>>>
>>> So at least for the PM863 please remember that this report is just about
>>> the performance side (on fresh SSDs) which arguably is excellent for the
>>> price but this doesn't address other conditions to check (performance
>>> consistency over the long run, real-world write endurance including
>>> write amplification, large scale testing to detect potential firmware
>>> bugs, ...).
>>>
>>
>> Interesting! I might be able to gain access to some PM836 3,84TB SSDs
>> later this week.
>>
>> I'll run the same tests if I can. Interesting to see how they perform.
>>
>>> Best regards,
>>>
>>> Lionel
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] release of the next Infernalis

2015-12-22 Thread Andrei Mikhailovsky
Hello guys, 

I was planning to upgrade our ceph cluster over the holiday period and was 
wondering when are you planning to release the next point release of the 
Infernalis? Should I wait for it or just roll out 9.2.0 for the time being? 

thanks 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

2015-12-22 Thread Andrei Mikhailovsky
Hello guys,

Was wondering if anyone has done testing on Samsung PM863 120 GB version to see 
how it performs? IMHO the 480GB version seems like a waste for the journal as 
you only need to have a small disk size to fit 3-4 osd journals. Unless you get 
a far greater durability.

I am planning to replace my current journal ssds over the next month or so and 
would like to find out if there is an a good alternative to the Intel's 
3700/3500 series. 

Thanks

Andrei

- Original Message -
> From: "Wido den Hollander" 
> To: "ceph-users" 
> Sent: Monday, 21 December, 2015 19:12:33
> Subject: Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio 
> results

> On 12/21/2015 05:30 PM, Lionel Bouton wrote:
>> Hi,
>> 
>> Sébastien Han just added the test results I reported for these SSDs on
>> the following page :
>> 
>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>> 
>> The table in the original post has the most important numbers and more
>> details can be found in the comments.
>> 
>> To sum things up, both have good performance (this isn't surprising for
>> the S3710 but AFAIK this had to be confirmed for the PM863 and my
>> company just purchased 2 of them just for these tests because they are
>> the only "DC" SSDs available at one of our hosting providers).
>> PM863 models are not designed for write-intensive applications and we
>> have yet to see how they behave in the long run (in our case where PM863
>> endurance is a bit short, if I had a choice we would test SM863 models
>> if they were available to us).
>> 
>> So at least for the PM863 please remember that this report is just about
>> the performance side (on fresh SSDs) which arguably is excellent for the
>> price but this doesn't address other conditions to check (performance
>> consistency over the long run, real-world write endurance including
>> write amplification, large scale testing to detect potential firmware
>> bugs, ...).
>> 
> 
> Interesting! I might be able to gain access to some PM836 3,84TB SSDs
> later this week.
> 
> I'll run the same tests if I can. Interesting to see how they perform.
> 
>> Best regards,
>> 
>> Lionel
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph journal failed?

2015-12-22 Thread Loris Cuoghi

Le 22/12/2015 09:42, yuyang a écrit :

Hello, everyone,

[snip snap]

Hi

> If the SSD failed or down, can the OSD work?
> Is the osd down or only can be read?

If you don't have a journal anymore, the OSD has already quit, as it 
can't continue writing, nor it can assure data consistency, since writes 
have probably been interrupted.


The Ceph's community general assumption for a dead journal, is a dead OSD.

But.

http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-journal-failure/

How does this apply in reality?
Is the solution that Sébastien is proposing viable?
In most/all cases?
Will the OSD continue chugging along after this kind of surgery?
Is it necessary/suggested to deep scrub ASAP the OSD's placement groups?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-22 Thread Don Waterloo
On 21 December 2015 at 22:07, Yan, Zheng  wrote:

>
> > OK, so i changed fio engine to 'sync' for the comparison of a single
> > underlying osd vs the cephfs.
> >
> > the cephfs w/ sync is ~ 115iops / ~500KB/s.
>
> This is normal because you were doing single thread sync IO. If
> round-trip time for each OSD request is about 10ms (network latency),
> you can only have about 100 IOPS.
>
>
>
yes... except the RTT is 200us. So that would be 5000 RTT/s.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Metadata Server (MDS) Hardware Suggestions

2015-12-22 Thread John Spray
On Tue, Dec 22, 2015 at 9:00 AM, Simon  Hallam  wrote:
> Thank you both, cleared up a lot.
>
> Is there a performance metric in perf dump on the MDS' that I can see the 
> active number of inodes/dentries? I'm guessing the mds_mem ino and dn metrics 
> are the relevant ones?
> http://paste.fedoraproject.org/303932/77466614/

Those are the number of CInode and CDentry objects allocated,
respectively.  You can also look at mds -> inodes (there's some subtle
difference between mem_mem->ino but I haven't checked what it is --
any of these are probably fine for a rough impression.

John

>
> Cheers,
>
> Simon
>
>> -Original Message-
>> From: Gregory Farnum [mailto:gfar...@redhat.com]
>> Sent: 17 December 2015 23:54
>> To: John Spray
>> Cc: Simon Hallam; ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Metadata Server (MDS) Hardware Suggestions
>>
>> On Thu, Dec 17, 2015 at 2:06 PM, John Spray  wrote:
>> > On Thu, Dec 17, 2015 at 2:31 PM, Simon  Hallam  wrote:
>> >> Hi all,
>> >>
>> >>
>> >>
>> >> I’m looking at sizing up some new MDS nodes, but I’m not sure if my
>> thought
>> >> process is correct or not:
>> >>
>> >>
>> >>
>> >> CPU: Limited to a maximum 2 cores. The higher the GHz, the more IOPS
>> >> available. So something like a single E5-2637v3 should fulfil this.
>> >
>> > No idea where you're getting the 2 core part.  But a mid range CPU
>> > like the one you're looking at is probably perfectly appropriate.  As
>> > you have probably gathered, the MDS will not make good use of large
>> > core counts (though there are plenty of threads and various
>> > serialisation/deserialisation parts can happen in parallel).
>>
>> There's just not much that happens outside of the big MDS lock right
>> now, besides journaling and some message handling. So basically two
>> cores is all we'll be able to use until that happens. ;)
>>
>> >
>> >> Memory: The more the better, as the metadata can be cached in RAM
>> (how much
>> >> RAM required is dependent on number of files?).
>> >
>> > Correct, the more RAM you have, the higher you can set mds_cache_size,
>> > and the larger your working set will be.
>>
>> Note that "working set" there; it's only the active metadata you need
>> to worry about when sizing things. I think at last count Zheng was
>> seeing ~3KB of memory for each inode/dentry combo.
>> -Greg
>
>
> Please visit our new website at www.pml.ac.uk and follow us on Twitter  
> @PlymouthMarine
>
> Winner of the Environment & Conservation category, the Charity Awards 2014.
>
> Plymouth Marine Laboratory (PML) is a company limited by guarantee registered 
> in England & Wales, company number 4178503. Registered Charity No. 1091222. 
> Registered Office: Prospect Place, The Hoe, Plymouth  PL1 3DH, UK.
>
> This message is private and confidential. If you have received this message 
> in error, please notify the sender and remove it from your system. You are 
> reminded that e-mail communications are not secure and may contain viruses; 
> PML accepts no liability for any loss or damage which may be caused by 
> viruses.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Metadata Server (MDS) Hardware Suggestions

2015-12-22 Thread Simon Hallam
Thank you both, cleared up a lot. 

Is there a performance metric in perf dump on the MDS' that I can see the 
active number of inodes/dentries? I'm guessing the mds_mem ino and dn metrics 
are the relevant ones?
http://paste.fedoraproject.org/303932/77466614/

Cheers,

Simon

> -Original Message-
> From: Gregory Farnum [mailto:gfar...@redhat.com]
> Sent: 17 December 2015 23:54
> To: John Spray
> Cc: Simon Hallam; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Metadata Server (MDS) Hardware Suggestions
> 
> On Thu, Dec 17, 2015 at 2:06 PM, John Spray  wrote:
> > On Thu, Dec 17, 2015 at 2:31 PM, Simon  Hallam  wrote:
> >> Hi all,
> >>
> >>
> >>
> >> I’m looking at sizing up some new MDS nodes, but I’m not sure if my
> thought
> >> process is correct or not:
> >>
> >>
> >>
> >> CPU: Limited to a maximum 2 cores. The higher the GHz, the more IOPS
> >> available. So something like a single E5-2637v3 should fulfil this.
> >
> > No idea where you're getting the 2 core part.  But a mid range CPU
> > like the one you're looking at is probably perfectly appropriate.  As
> > you have probably gathered, the MDS will not make good use of large
> > core counts (though there are plenty of threads and various
> > serialisation/deserialisation parts can happen in parallel).
> 
> There's just not much that happens outside of the big MDS lock right
> now, besides journaling and some message handling. So basically two
> cores is all we'll be able to use until that happens. ;)
> 
> >
> >> Memory: The more the better, as the metadata can be cached in RAM
> (how much
> >> RAM required is dependent on number of files?).
> >
> > Correct, the more RAM you have, the higher you can set mds_cache_size,
> > and the larger your working set will be.
> 
> Note that "working set" there; it's only the active metadata you need
> to worry about when sizing things. I think at last count Zheng was
> seeing ~3KB of memory for each inode/dentry combo.
> -Greg


Please visit our new website at www.pml.ac.uk and follow us on Twitter  
@PlymouthMarine

Winner of the Environment & Conservation category, the Charity Awards 2014.

Plymouth Marine Laboratory (PML) is a company limited by guarantee registered 
in England & Wales, company number 4178503. Registered Charity No. 1091222. 
Registered Office: Prospect Place, The Hoe, Plymouth  PL1 3DH, UK. 

This message is private and confidential. If you have received this message in 
error, please notify the sender and remove it from your system. You are 
reminded that e-mail communications are not secure and may contain viruses; PML 
accepts no liability for any loss or damage which may be caused by viruses.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph journal failed??

2015-12-22 Thread yuyang
Hello, everyone,
I have a ceph cluster with sereral nodes, every node has 1 SSD and 9 STAT disks.
Every STAT disk is used as an OSD, in order to improve IO performance, the SSD 
is used as journal file disk.
That is, there are 9 nournal files in every SSD.

If the SSD failed or down, can the OSD work? 
Is the osd down or only can be read?

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cluster raw used problem

2015-12-22 Thread Don Laursen
Hello all,
I have simple 10 OSD cluster that is running out of space on several OSDs.
Notice the output below shows a bid difference in .rgw.buckets used at 10% and 
raw used at 73%

Is there something I need to do to purge the space?

I did have a 10TB rbd block image in the data pool that I've deleted a month 
ago. Could there be leftover stale objects somewhere?

root@mon01:~# ceph df
GLOBAL:
SIZE   AVAIL RAW USED %RAW USED
35829G 9467G   26338G 73.51
POOLS:
NAME   ID USED  %USED MAX AVAIL 
OBJECTS
Data   0  0 0   
   2202G   0
metadata1   3224 0  
2202G  24
rbd 2  8 0  
2202G   1
.rgw.root 3955   0  
2202G   4
.rgw.control   4  0 0  
2202G   8
.rgw   5   8445 0   
2202G  51
.rgw.gc 6  0 0  
2202G  32
.users7124   0  
2202G   8
.users.uid8   2474 0  
2202G  13
.users.email   9  0 0  
2202G   0
.rgw.buckets.index 10 00  2202G 
 40
.rgw.buckets  11 3785G  10.57 2202G 
3103149
12   0  0   
   2202G   0


root@mon01:~# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USEAVAIL %USE  VAR
0 3.5  0.28165  3582G  2916G  663G 81.41 1.11
5 3.5  0.15871  3582G  3096G  484G 86.43 1.18
1 3.5  0.40709  3582G  3027G  554G 84.49 1.15
7 3.5  1.0  3582G  1191G 2384G 33.26 0.45
3 3.5  1.0  3582G  1945G 1635G 54.30 0.74
4 3.5  0.25082  3582G  3085G  494G 86.12 1.17
8 3.5  0.19499  3582G  3139G  440G 87.63 1.19
9 3.5  0.79721  3582G  3140G  441G 87.65 1.19
2 3.5  1.0  3582G  1686G 1895G 47.06 0.64
6 3.5  0.43005  3582G  3108G  473G 86.75 1.18
  TOTAL 35829G 26338G 9467G 73.51
MIN/MAX VAR: 0.45/1.19  STDDEV: 23.68



Thanks,
Don Laursen



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com