[ceph-users] OSD Capacity via Python / C API

2016-01-18 Thread Alex Leake
?Hello All.


Does anyone know if it's possible to retrieve the remaining OSD capacity via 
the Python or C API?


I can get all other sorts of information, but I thought it would be nice to see 
near-full OSDs via the API.



Kind Regards,

Alex.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph and NFS

2016-01-18 Thread david
Hello All.
Does anyone provides Ceph rbd/rgw/cephfs through NFS?  I have a 
requirement about Ceph Cluster which needs to provide NFS service. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Capacity via Python / C API

2016-01-18 Thread Wido den Hollander


On 18-01-16 10:22, Alex Leake wrote:
> ​Hello All.
> 
> 
> Does anyone know if it's possible to retrieve the remaining OSD capacity
> via the Python or C API?
> 

Using a mon_command in librados you can send a 'osd df' if you want to.

See this snippet: https://gist.github.com/wido/ac53ae01d661dd57f4a8

Here I sent the command 'status', but you can change that to 'osd df'
and you should get back JSON.

Wido

> 
> I can get all other sorts of information, but I thought it would be nice
> to see near-full OSDs via the API.
> 
> 
> 
> Kind Regards,
> 
> Alex.
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CentOS 7 iscsi gateway using lrbd

2016-01-18 Thread Dominik Zalewski
Hi,

I'm looking into implementing iscsi gateway with MPIO using lrbd -
https://github.com/swiftgist/lrb


https://www.suse.com/docrep/documents/kgu61iyowz/suse_enterprise_storage_2_and_iscsi.pdf

https://www.susecon.com/doc/2015/sessions/TUT16512.pdf

>From above examples:

*For iSCSI failover and load-balancing,*

*these servers must run a kernel supporting the target_core_*

*rbd module. This also requires that the target servers run at*

*least the version 3.12.48-52.27.1 of the kernel-default ­package.*

*Updates packages are available from the SUSE Linux*

*Enterprise Server maintenance channel.*


I understand that lrbd is basically a nice way to configure LIO and rbd
across ceph osd nodes/iscsi gatways. Does CentOS 7 have same
target_core_rbd module in the kernel or this is something Suse Enterprise
Storage specific only?


Basically will LIO+rbd work the same way on CentOS 7? Has anyone using it
with CentOS?


Thanks


Dominik
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Again - state of Ceph NVMe and SSDs

2016-01-18 Thread Tyler Bishop
Check these out to:  
http://www.seagate.com/internal-hard-drives/solid-state-hybrid/1200-ssd/


- Original Message -
From: "Christian Balzer" 
To: "ceph-users" 
Sent: Sunday, January 17, 2016 10:45:56 PM
Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs

Hello,

On Sat, 16 Jan 2016 19:06:07 +0100 David wrote:

> Hi!
> 
> We’re planning our third ceph cluster and been trying to find how to
> maximize IOPS on this one.
> 
> Our needs:
> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM
> servers)
> * Pool for storage of many small files, rbd (probably dovecot maildir
> and dovecot index etc)
>
I'm running dovecot for several 100k users on 2-node DRBD clusters and for
a mail archive server for a few hundred users backed by Ceph/RBD.
The later works fine (it's not that busy), but I wouldn't consider
replacing the DRBD clusters with Ceph/RBD at this time (higher investment
in storage 3x vs 2x and lower performance of course).

Depending on your use case you may be just fine of course.

> So I’ve been reading up on:
> 
> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance
> 
> and ceph-users from october 2015:
> 
> http://www.spinics.net/lists/ceph-users/msg22494.html
> 
> We’re planning something like 5 OSD servers, with:
> 
> * 4x 1.2TB Intel S3510
I'd be wary of that.
As in, you're spec'ing the best Intel SSDs money can buy below for
journals, but the least write-endurable Intel DC SSDs for OSDs here.
Note that write amplification (beyond Ceph and FS journals) is very much a
thing, especially with small files. 
There's a mail about this by me in the ML archives somewhere:
http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html

Unless you're very sure about this being a read-mostly environment I'd go
with 3610's at least.

> * 8st 4TB HDD
> * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and
> one for HDD pool journal)
You may be better off (cost and SPOF wise) with 2 200GB S3700 (not 3710)
for the HDD journals, but then again that won't fit into your case, won't
it...
Given the IOPS limits in Ceph as it is, you're unlikely to see much of
difference if you forgo a journal for the SSDs and use shared journals with
DC S3610 or 3710 OSD SSDs. 
Note that as far as pure throughput is concerned (in most operations the
least critical factor) your single journal SSD will limit things to the
speed of 2 (of your 4) storage SSDs.
But then again, your network is probably saturated before that.

> * 2x 80GB Intel S3510 raid1 for system
> * 256GB RAM
Plenty. ^o^

> * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better
> 
Not sure about Jewel, but SSD OSDs will eat pretty much any and all CPU
cycles you can throw at them.
This also boils down to the question if having mixed HDD/SSD storage nodes
(with the fun of having to set "osd crush update on start = false") is a
good idea or not, as opposed to nodes that are optimized for their
respective storage hardware (CPU, RAM, network wise).

Regards,

Christian
> This cluster will probably run Hammer LTS unless there are huge
> improvements in Infernalis when dealing 4k IOPS.
> 
> The first link above hints at awesome performance. The second one from
> the list not so much yet.. 
> 
> Is anyone running Hammer or Infernalis with a setup like this?
> Is it a sane setup?
> Will we become CPU constrained or can we just throw more RAM on it? :D
> 
> Kind Regards,
> David Majchrzak

-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and NFS

2016-01-18 Thread Tyler Bishop
You should test out cephfs exported as an NFS target.


- Original Message -
From: "david" 
To: ceph-users@lists.ceph.com
Sent: Monday, January 18, 2016 4:36:17 AM
Subject: [ceph-users] Ceph and NFS

Hello All.
Does anyone provides Ceph rbd/rgw/cephfs through NFS?  I have a 
requirement about Ceph Cluster which needs to provide NFS service. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CentOS 7 iscsi gateway using lrbd

2016-01-18 Thread Tyler Bishop

Well that's interesting. 

I've mounted block devices to the kernel and exported them to iscsi but the 
performance was horrible.. I wonder if this is any different? 



From: "Dominik Zalewski"  
To: ceph-users@lists.ceph.com 
Sent: Monday, January 18, 2016 6:35:20 AM 
Subject: [ceph-users] CentOS 7 iscsi gateway using lrbd 

Hi, 
I'm looking into implementing iscsi gateway with MPIO using lrbd - 
https://github.com/swiftgist/lrb 


https://www.suse.com/docrep/documents/kgu61iyowz/suse_enterprise_storage_2_and_iscsi.pdf
 

https://www.susecon.com/doc/2015/sessions/TUT16512.pdf 

>From above examples: 



For iSCSI failover and load-balancing, 

these servers must run a kernel supporting the target_core_ 

rbd module. This also requires that the target servers run at 

least the version 3.12.48-52.27.1 of the kernel-default ­package. 

Updates packages are available from the SUSE Linux 

Enterprise Server maintenance channel. 




I understand that lrbd is basically a nice way to configure LIO and rbd across 
ceph osd nodes/iscsi gatways. Does CentOS 7 have same target_core_rbd module in 
the kernel or this is something Suse Enterprise Storage specific only? 




Basically will LIO+rbd work the same way on CentOS 7? Has anyone using it with 
CentOS? 




Thanks 




Dominik 





___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and NFS

2016-01-18 Thread Burkhard Linke

Hi,

On 18.01.2016 10:36, david wrote:

Hello All.
Does anyone provides Ceph rbd/rgw/cephfs through NFS?  I have a 
requirement about Ceph Cluster which needs to provide NFS service.


We export a CephFS mount point on one of our NFS servers. Works out of 
the box with Ubuntu Trusty, a recent kernel and kernel-based cephfs driver.


ceph-fuse did not work that well, and using nfs-ganesha 2.2 instead of 
standard kernel based NFSd resulted in segfaults and permissions problems.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and NFS

2016-01-18 Thread Arthur Liu
On Mon, Jan 18, 2016 at 11:34 PM, Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de> wrote:

> Hi,
>
> On 18.01.2016 10:36, david wrote:
>
>> Hello All.
>> Does anyone provides Ceph rbd/rgw/cephfs through NFS?  I have a
>> requirement about Ceph Cluster which needs to provide NFS service.
>>
>
> We export a CephFS mount point on one of our NFS servers. Works out of the
> box with Ubuntu Trusty, a recent kernel and kernel-based cephfs driver.
>
> ceph-fuse did not work that well, and using nfs-ganesha 2.2 instead of
> standard kernel based NFSd resulted in segfaults and permissions problems.


I've found that using knfsd does not preserve cephfs directory and file
layouts, but using nfs-ganesha does. I'm currently using nfs-ganesha
2.4dev5 and seems stable so far.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs are down, don't know why

2016-01-18 Thread Steve Taylor
Do you have a ceph private network defined in your config file? I've seen this 
before in that situation where the private network isn't functional. The osds 
can talk to the mon(s) but not to each other, so they report each other as down 
when they're all running just fine.


Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | Fax: 801.545.4705

If you are not the intended recipient of this message, be advised that any 
dissemination or copying of this message is prohibited.
If you received this message erroneously, please notify the sender and delete 
it, together with any attachments.


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jeff 
Epstein
Sent: Friday, January 15, 2016 7:28 PM
To: ceph-users 
Subject: [ceph-users] OSDs are down, don't know why

Hello,

I'm setting up a small test instance of ceph and I'm running into a situation 
where the OSDs are being shown as down, but I don't know why.

Connectivity seems to be working. The OSD hosts are able to communicate with 
the MON hosts; running "ceph status" and "ceph osd in" from an OSD host works 
fine, but with a HEALTH_WARN that I have 2 osds: 0 up, 2 in. 
Both the OSD and MON daemons seem to be running fine. Network connectivity 
seems to be okay: I can nc from the OSD to port 6789 on the MON, and from the 
MON to port 6800-6803 on the OSD (I have constrained the ms bind port min/max 
config options so that the OSDs will use only these ports). Neither OSD nor MON 
logs show anything that seems unusual, nor why the OSD is marked as being down.

Furthermore, using tcpdump i've watched network traffic between the OSD and the 
MON, and it seems that the OSD is sending heartbeats and getting an ack from 
the MON. So I'm definitely not sure why the MON thinks the OSD is down.

Some questions:
- How does the MON determine if the OSD is down?
- Is there a way to get the MON to report on why an OSD is down, e.g. no 
heartbeat?
- Is there any need to open ports other than TCP 6789 and 6800-6803?
- Any other suggestions?

ceph 0.94 on Debian Jessie

Best,
Jeff
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs are down, don't know why

2016-01-18 Thread Jeff Epstein

Hi Steve
Thanks for your answer. I don't have a private network defined. 
Furthermore, in my current testing configuration, there is only one OSD, 
so communication between OSDs should be a non-issue.

Do you know how OSD up/down state is determined when there is only one OSD?
Best,
Jeff

On 01/18/2016 03:59 PM, Steve Taylor wrote:

Do you have a ceph private network defined in your config file? I've seen this 
before in that situation where the private network isn't functional. The osds 
can talk to the mon(s) but not to each other, so they report each other as down 
when they're all running just fine.


Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | Fax: 801.545.4705

If you are not the intended recipient of this message, be advised that any 
dissemination or copying of this message is prohibited.
If you received this message erroneously, please notify the sender and delete 
it, together with any attachments.


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jeff 
Epstein
Sent: Friday, January 15, 2016 7:28 PM
To: ceph-users 
Subject: [ceph-users] OSDs are down, don't know why

Hello,

I'm setting up a small test instance of ceph and I'm running into a situation 
where the OSDs are being shown as down, but I don't know why.

Connectivity seems to be working. The OSD hosts are able to communicate with the MON hosts; running 
"ceph status" and "ceph osd in" from an OSD host works fine, but with a 
HEALTH_WARN that I have 2 osds: 0 up, 2 in.
Both the OSD and MON daemons seem to be running fine. Network connectivity 
seems to be okay: I can nc from the OSD to port 6789 on the MON, and from the 
MON to port 6800-6803 on the OSD (I have constrained the ms bind port min/max 
config options so that the OSDs will use only these ports). Neither OSD nor MON 
logs show anything that seems unusual, nor why the OSD is marked as being down.

Furthermore, using tcpdump i've watched network traffic between the OSD and the 
MON, and it seems that the OSD is sending heartbeats and getting an ack from 
the MON. So I'm definitely not sure why the MON thinks the OSD is down.

Some questions:
- How does the MON determine if the OSD is down?
- Is there a way to get the MON to report on why an OSD is down, e.g. no 
heartbeat?
- Is there any need to open ports other than TCP 6789 and 6800-6803?
- Any other suggestions?

ceph 0.94 on Debian Jessie

Best,
Jeff
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Again - state of Ceph NVMe and SSDs

2016-01-18 Thread Mark Nelson



On 01/16/2016 12:06 PM, David wrote:

Hi!

We’re planning our third ceph cluster and been trying to find how to
maximize IOPS on this one.

Our needs:
* Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM
servers)
* Pool for storage of many small files, rbd (probably dovecot maildir
and dovecot index etc)

So I’ve been reading up on:

https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance

and ceph-users from october 2015:

http://www.spinics.net/lists/ceph-users/msg22494.html

We’re planning something like 5 OSD servers, with:

* 4x 1.2TB Intel S3510
* 8st 4TB HDD
* 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and
one for HDD pool journal)
* 2x 80GB Intel S3510 raid1 for system
* 256GB RAM
* 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better

This cluster will probably run Hammer LTS unless there are huge
improvements in Infernalis when dealing 4k IOPS.

The first link above hints at awesome performance. The second one from
the list not so much yet..

Is anyone running Hammer or Infernalis with a setup like this?
Is it a sane setup?
Will we become CPU constrained or can we just throw more RAM on it? :D


On the write side you can pretty quickly hit CPU limits, though if you 
upgrade to tcmalloc 2.4 and set a high thread cache or switch to 
jemalloc it will help dramatically.  There's been various posts and 
threads about it here on the mailing list, but generally in CPU 
constrained scenarios people are seeing pretty dramatic improvements. 
(like 4X IOPs on the write side with SSDs/NVMes).


We are also seeing a dramatic improvement in small random write 
performance with bluestore, but that's only going to be tech preview in 
jewel.




Kind Regards,
David Majchrzak


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs are down, don't know why

2016-01-18 Thread Steve Taylor
With a single osd there shouldn't be much to worry about. It will have to get 
caught up on map epochs before it will report itself as up, but on a new 
cluster that should be pretty immediate.

You'll probably have to look for clues in the osd and mon logs. I would expect 
some sort of error reported in this scenario. It seems likely that it would be 
network-related in this case, but the logs will confirm or debunk that theory.

Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | Fax: 801.545.4705

If you are not the intended recipient of this message, be advised that any 
dissemination or copying of this message is prohibited.
If you received this message erroneously, please notify the sender and delete 
it, together with any attachments.


-Original Message-
From: Jeff Epstein [mailto:jeff.epst...@commerceguys.com] 
Sent: Monday, January 18, 2016 8:32 AM
To: Steve Taylor ; ceph-users 

Subject: Re: [ceph-users] OSDs are down, don't know why

Hi Steve
Thanks for your answer. I don't have a private network defined. 
Furthermore, in my current testing configuration, there is only one OSD, so 
communication between OSDs should be a non-issue.
Do you know how OSD up/down state is determined when there is only one OSD?
Best,
Jeff

On 01/18/2016 03:59 PM, Steve Taylor wrote:
> Do you have a ceph private network defined in your config file? I've seen 
> this before in that situation where the private network isn't functional. The 
> osds can talk to the mon(s) but not to each other, so they report each other 
> as down when they're all running just fine.
>
>
> Steve Taylor | Senior Software Engineer | StorageCraft Technology 
> Corporation
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2799 | Fax: 801.545.4705
>
> If you are not the intended recipient of this message, be advised that any 
> dissemination or copying of this message is prohibited.
> If you received this message erroneously, please notify the sender and delete 
> it, together with any attachments.
>
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
> Of Jeff Epstein
> Sent: Friday, January 15, 2016 7:28 PM
> To: ceph-users 
> Subject: [ceph-users] OSDs are down, don't know why
>
> Hello,
>
> I'm setting up a small test instance of ceph and I'm running into a situation 
> where the OSDs are being shown as down, but I don't know why.
>
> Connectivity seems to be working. The OSD hosts are able to communicate with 
> the MON hosts; running "ceph status" and "ceph osd in" from an OSD host works 
> fine, but with a HEALTH_WARN that I have 2 osds: 0 up, 2 in.
> Both the OSD and MON daemons seem to be running fine. Network connectivity 
> seems to be okay: I can nc from the OSD to port 6789 on the MON, and from the 
> MON to port 6800-6803 on the OSD (I have constrained the ms bind port min/max 
> config options so that the OSDs will use only these ports). Neither OSD nor 
> MON logs show anything that seems unusual, nor why the OSD is marked as being 
> down.
>
> Furthermore, using tcpdump i've watched network traffic between the OSD and 
> the MON, and it seems that the OSD is sending heartbeats and getting an ack 
> from the MON. So I'm definitely not sure why the MON thinks the OSD is down.
>
> Some questions:
> - How does the MON determine if the OSD is down?
> - Is there a way to get the MON to report on why an OSD is down, e.g. no 
> heartbeat?
> - Is there any need to open ports other than TCP 6789 and 6800-6803?
> - Any other suggestions?
>
> ceph 0.94 on Debian Jessie
>
> Best,
> Jeff
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cache pool redundancy requirements.

2016-01-18 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

>From what I understand, the scrub only scrubs PG copies in the same
pool, so there would not be much benefit to scrubbing a single
replication pool until Ceph starts storing the hash of the metadata
and data. Then you would only know that your data is bad/good and
can't repair it automatically.

>From what I understand of the documentation, if you are running VMs,
you don't want the cache pool in read-only; it is really intended for
data that is read many times, written hardly ever. If however, you
find that it works well for your use case, you may have a lot of
manual work to do when a cache OSD dies. Although the cache tier
should just go to the base tier in this situation, when you bring a
new drive into the cache, you may have to tell ceph that all the PGs
that were on that disk are now lost. Hopefully, Ceph will be smart and
just repopulate them as needed without any other issues, but it is a
big unknown for me.

The best thing to do is try it out on a test cluster first and try
different failures to make sure it works as you expect it to.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWnRpkCRDmVDuy+mK58QAA6TkP+gKNr8bkXAg3bfOJkHrJ
PIJ98bJzym5oA8Ny6lzJzgcJYA8WZfcd34ZwEN/7pPheSdcDlm4U2cs5am2h
78xR0RGVHKi4hxN/z17OzBzb9FjqJW2ed5Xq36fw9I363HEfoSkPwDbzibqT
77EGBxgZGFuUqL4lcxAd5JkQp+C4M62FEezdBmJ+nVa+OF0kCosAJelvuDpe
D5GO/8MRdmKBHbEoeSUCXX2Tk3S7XaVX/MjjiZ+2UgqLvZk5fIiLHyUKT6jC
Otcx+X7fyHHMRdg8bFxHL6Vu5iRRpT5y8M49VW7BhE5+ACo+AXuvn0yJH1rw
84jT7LzQEZeeZtwxOg1prR0qK/E73u3UeF3sONt1dmUxMp5ZEw+UW4yXRhZO
gHcfTUbVNPF3b2ZTsj4Bs1GMRktgwQvTaVMDAuBlzVLnMttVtonjOhwynk/Z
+GSfoiIdKZeq+8XGfcOS7cFIYiW01cx9KiScDZyBwV88Wwyslop+MU5wyBgM
V80TYDmmtouMvN0KuEqR+HHErVzifjOX7D5QXNdjlAtMhzmPB45D/zcrrEXk
JYpiDTWCCoADtIai+uyqZaXoE311nne4lv89gzaqZQTjXep5bLvkPHQwkUPn
GF6aOgGIZjZf199EkImqsQYmoIyTCXKlwwxcDyFqGcqk5/m43IapcpZ/SnN/
lXPM
=iUPg
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sun, Jan 17, 2016 at 2:08 PM, Tyler Bishop
 wrote:
> Adding to this thought, even if you are using a single replica for the cache
> pool, will ceph scrub the cached block against the base tier?  What if you
> have corruption in your cache?
>
> 
> From: "Tyler Bishop" 
> To: ceph-users@lists.ceph.com
> Cc: "Sebastien han" 
> Sent: Sunday, January 17, 2016 3:47:13 PM
> Subject: Ceph Cache pool redundancy requirements.
>
> Based off Sebastiens design I had some thoughts:
> http://www.sebastien-han.fr/images/ceph-cache-pool-compute-design.png
>
> Hypervisors are for obvious reason more susceptible to crashes and reboots
> for security updates.  Since ceph is utilizing a standard pool for the cache
> tier it creates a requirement for placement group stability.   IE: We cannot
> use a pool with only 1 PG replica required. The ideal configuration would be
> to utilize a single replica ssd cache pool as READ ONLY, and all writes will
> be sent to the base tier ssd journals, this way your getting quick acks and
> fast reads without any lost flash capacity for redundancy.
>
> Has anyone tested a failure with a read only cache pool that utilizes a
> single replica?  Does ceph simply fetch the data and place it to another pg?
> The cache pool should be able to sustain drive failures with 1 replica
> because its not needed for consistency.
>
> Interesting topic here.. curious if anyone has tried this.
>
> Our current architecture utilizes 48 hosts with 2x 1T SSD each as a 2
> replica ssd pool.  We have 4 host with 52x 6T disk for a capacity pool.  We
> would like to run the base tier on the spindles with the SSD as a 100%
> utilized cache tier for busy pools.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CentOS 7 iscsi gateway using lrbd

2016-01-18 Thread Василий Ангапов
https://github.com/swiftgist/lrbd/wiki
According to lrbd wiki it still uses KRBD (see those /dev/rbd/...
devices in targetcli config).
I was thinking that Mike Christie developed a librbd module for LIO.
So what is it - KRBD or librbd?

2016-01-18 20:23 GMT+08:00 Tyler Bishop :
>
> Well that's interesting.
>
> I've mounted block devices to the kernel and exported them to iscsi but the
> performance was horrible.. I wonder if this is any different?
>
>
> 
> From: "Dominik Zalewski" 
> To: ceph-users@lists.ceph.com
> Sent: Monday, January 18, 2016 6:35:20 AM
> Subject: [ceph-users] CentOS 7 iscsi gateway using lrbd
>
> Hi,
> I'm looking into implementing iscsi gateway with MPIO using lrbd -
> https://github.com/swiftgist/lrb
>
>
> https://www.suse.com/docrep/documents/kgu61iyowz/suse_enterprise_storage_2_and_iscsi.pdf
>
> https://www.susecon.com/doc/2015/sessions/TUT16512.pdf
>
> From above examples:
>
> For iSCSI failover and load-balancing,
>
> these servers must run a kernel supporting the target_core_
>
> rbd module. This also requires that the target servers run at
>
> least the version 3.12.48-52.27.1 of the kernel-default ­package.
>
> Updates packages are available from the SUSE Linux
>
> Enterprise Server maintenance channel.
>
>
> I understand that lrbd is basically a nice way to configure LIO and rbd
> across ceph osd nodes/iscsi gatways. Does CentOS 7 have same target_core_rbd
> module in the kernel or this is something Suse Enterprise Storage specific
> only?
>
>
> Basically will LIO+rbd work the same way on CentOS 7? Has anyone using it
> with CentOS?
>
>
> Thanks
>
>
> Dominik
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH Rule Review - Not replicating correctly

2016-01-18 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I'm not sure why you have six monitors. Six monitors buys you nothing
over five monitors other than more power being used, and more latency
and more headache. See
http://docs.ceph.com/docs/hammer/rados/configuration/mon-config-ref/#monitor-quorum
for some more info. Also, I'd consider 5 monitors overkill for this
size cluster, I'd recommend three.

Although this is most likely not the root cause of your problem, you
probably have an error here: "root replicated-T1" is pointing to
b02s08 and b02s12 and "site erbus" is also pointing to b02s08 and
b02s12. You probably meant to have "root replicated-T1" pointing to
erbus instead.

Where I think your problem is, is in your "rule replicated" section.
You can try:
step take replicated-T1
step choose firstn 2 type host
step chooseleaf firstn 2 type osdgroup
step emit

What this does is choose two hosts from the root replicated-T1 (which
happens to be both hosts you have), then chooses an OSD from two
osdgroups on each host.

I believe the problem with your current rule set is that firstn 0 type
host tries to select four hosts, but only two are available. You
should be able to see that with 'ceph pg dump', where only two osds
will be listed in the up set.

I hope that helps.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWnR9kCRDmVDuy+mK58QAA5hUP/iJprG4nGR2sJvL//8l+
V6oLYXTCs8lHeKL3ZPagThE9oh2xDMV37WR3I/xMNTA8735grl8/AAhy8ypW
MDOikbpzfWnlaL0SWs5rIQ5umATwv73Fg/Mf+K2Olt8IGP6D0NMIxfeOjU6E
0Sc3F37nDQFuDEkBYjcVcqZC89PByh7yaId+eOgr7Ot+BZL/3fbpWIZ9kyD5
KoPYdPjtFruoIpc8DJydzbWdmha65DkB65QOZlI3F3lMc6LGXUopm4OP4sQd
txVKFtTcLh97WgUshQMSWIiJiQT7+3D6EqQyPzlnei3O3gACpkpsmUteDPpn
p8CDeJtIpgKnQZjBwfK/bUQXdIGem8Y0x/PC+1ekIhkHCIJeW2sD3mFJduDQ
9loQ9+IsWHfQmEHLMLdeNzRXbgBY2djxP2X70fXTg31fx+dYvbWeulYJHiKi
1fJS4GdbPjoRUp5k4lthk3hDTFD/f5ZuowLDIaexgISb0bIJcObEn9RWlHut
IRVi0fUuRVIX3snGMOKjLmSUe87Od2KSEbULYPTLYDMo/FsWXWHNlP3gVKKd
lQJdxcwXOW7/v5oayY4wiEE6NF4rCupcqt0nPxxmbehmeRPxgkWCKJJs3FNr
VmUdnrdpfxzR5c8dmOELJnpNS6MTT56B8A4kKmqbbHCEKpZ83piG7uwqc+6f
RKkQ
=gp/0
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sun, Jan 17, 2016 at 6:31 PM, deeepdish  wrote:
> Hi Everyone,
>
> Looking for a double check of my logic and crush map..
>
> Overview:
>
> - osdgroup bucket type defines failure domain within a host of 5 OSDs + 1
> SSD.   Therefore 5 OSDs (all utilizing the same journal) constitute an
> osdgroup bucket.   Each host has 4 osdgroups.
> - 6 monitors
> - Two node cluster
> - Each node:
> - 20 OSDs
> -  4 SSDs
> - 4 osdgroups
>
> Desired Crush Rule outcome:
> - Assuming a pool with min_size=2 and size=4, all each node would contain a
> redundant copy of each object.   Should any of the hosts fail, access to
> data would be uninterrupted.
>
> Current Crush Rule outcome:
> - There are 4 copies of each object, however I don’t believe each node has a
> redundant copy of each object, when a node fails, data is NOT accessible
> until ceph rebuilds itself / node becomes accessible again.
>
> I susepct my crush is not right, and to remedy it may take some time and
> cause cluster to be unresponsive / unavailable.Is there a way / method
> to apply substantial crush changes gradually to a cluster?
>
> Thanks for your help.
>
>
> Current crush map:
>
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable straw_calc_version 1
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
> device 10 osd.10
> device 11 osd.11
> device 12 osd.12
> device 13 osd.13
> device 14 osd.14
> device 15 osd.15
> device 16 osd.16
> device 17 osd.17
> device 18 osd.18
> device 19 osd.19
> device 20 osd.20
> device 21 osd.21
> device 22 osd.22
> device 23 osd.23
> device 24 osd.24
> device 25 osd.25
> device 26 osd.26
> device 27 osd.27
> device 28 osd.28
> device 29 osd.29
> device 30 osd.30
> device 31 osd.31
> device 32 osd.32
> device 33 osd.33
> device 34 osd.34
> device 35 osd.35
> device 36 osd.36
> device 37 osd.37
> device 38 osd.38
> device 39 osd.39
>
> # types
> type 0 osd
> type 1 osdgroup
> type 2 host
> type 3 rack
> type 4 site
> type 5 root
>
> # buckets
> osdgroup b02s08-osdgroupA {
> id -81 # do not change unnecessarily
> # weight 18.100
> alg straw
> hash 0 # rjenkins1
> item osd.0 weight 3.620
> item osd.1 weight 3.620
> item osd.2 weight 3.620
> item osd.3 weight 3.620
> item osd.4 weight 3.620
> }
> osdgroup b02s08-osdgroupB {
> id -82 # do not change unnecessarily
> # weight 18.100
> alg straw
> hash 0 # rjenkins1
> item osd.5 weight 3.620
> item osd.6 weight 3.620
> item osd.7 weight 3.620
> item osd.8 weight 3.620
> item osd.9 weight 3.620
> }
> osdgroup b02s08-osdg

Re: [ceph-users] OSDs are down, don't know why

2016-01-18 Thread Jeff Epstein
Unfortunately, I haven't seen any obvious suspicious log messages from 
either the OSD or the MON. Is there a way to query detailed information 
on OSD monitoring, e.g. heartbeats?


On 01/18/2016 05:54 PM, Steve Taylor wrote:

With a single osd there shouldn't be much to worry about. It will have to get 
caught up on map epochs before it will report itself as up, but on a new 
cluster that should be pretty immediate.

You'll probably have to look for clues in the osd and mon logs. I would expect 
some sort of error reported in this scenario. It seems likely that it would be 
network-related in this case, but the logs will confirm or debunk that theory.

Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | Fax: 801.545.4705

If you are not the intended recipient of this message, be advised that any 
dissemination or copying of this message is prohibited.
If you received this message erroneously, please notify the sender and delete 
it, together with any attachments.


-Original Message-
From: Jeff Epstein [mailto:jeff.epst...@commerceguys.com]
Sent: Monday, January 18, 2016 8:32 AM
To: Steve Taylor ; ceph-users 

Subject: Re: [ceph-users] OSDs are down, don't know why

Hi Steve
Thanks for your answer. I don't have a private network defined.
Furthermore, in my current testing configuration, there is only one OSD, so 
communication between OSDs should be a non-issue.
Do you know how OSD up/down state is determined when there is only one OSD?
Best,
Jeff

On 01/18/2016 03:59 PM, Steve Taylor wrote:

Do you have a ceph private network defined in your config file? I've seen this 
before in that situation where the private network isn't functional. The osds 
can talk to the mon(s) but not to each other, so they report each other as down 
when they're all running just fine.


Steve Taylor | Senior Software Engineer | StorageCraft Technology
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | Fax: 801.545.4705

If you are not the intended recipient of this message, be advised that any 
dissemination or copying of this message is prohibited.
If you received this message erroneously, please notify the sender and delete 
it, together with any attachments.


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of Jeff Epstein
Sent: Friday, January 15, 2016 7:28 PM
To: ceph-users 
Subject: [ceph-users] OSDs are down, don't know why

Hello,

I'm setting up a small test instance of ceph and I'm running into a situation 
where the OSDs are being shown as down, but I don't know why.

Connectivity seems to be working. The OSD hosts are able to communicate with the MON hosts; running 
"ceph status" and "ceph osd in" from an OSD host works fine, but with a 
HEALTH_WARN that I have 2 osds: 0 up, 2 in.
Both the OSD and MON daemons seem to be running fine. Network connectivity 
seems to be okay: I can nc from the OSD to port 6789 on the MON, and from the 
MON to port 6800-6803 on the OSD (I have constrained the ms bind port min/max 
config options so that the OSDs will use only these ports). Neither OSD nor MON 
logs show anything that seems unusual, nor why the OSD is marked as being down.

Furthermore, using tcpdump i've watched network traffic between the OSD and the 
MON, and it seems that the OSD is sending heartbeats and getting an ack from 
the MON. So I'm definitely not sure why the MON thinks the OSD is down.

Some questions:
- How does the MON determine if the OSD is down?
- Is there a way to get the MON to report on why an OSD is down, e.g. no 
heartbeat?
- Is there any need to open ports other than TCP 6789 and 6800-6803?
- Any other suggestions?

ceph 0.94 on Debian Jessie

Best,
Jeff
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH Rule Review - Not replicating correctly

2016-01-18 Thread deeepdish
Thanks Robert.   Will definitely try this.   Is there a way to implement 
“gradual CRUSH” changes?   I noticed whenever cluster wide changes are pushed 
(crush map, for instance) the cluster immediately attempts to align itself 
disrupting client access / performance…  


> On Jan 18, 2016, at 12:22 , Robert LeBlanc  wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> I'm not sure why you have six monitors. Six monitors buys you nothing
> over five monitors other than more power being used, and more latency
> and more headache. See
> http://docs.ceph.com/docs/hammer/rados/configuration/mon-config-ref/#monitor-quorum
> for some more info. Also, I'd consider 5 monitors overkill for this
> size cluster, I'd recommend three.
> 
> Although this is most likely not the root cause of your problem, you
> probably have an error here: "root replicated-T1" is pointing to
> b02s08 and b02s12 and "site erbus" is also pointing to b02s08 and
> b02s12. You probably meant to have "root replicated-T1" pointing to
> erbus instead.
> 
> Where I think your problem is, is in your "rule replicated" section.
> You can try:
> step take replicated-T1
> step choose firstn 2 type host
> step chooseleaf firstn 2 type osdgroup
> step emit
> 
> What this does is choose two hosts from the root replicated-T1 (which
> happens to be both hosts you have), then chooses an OSD from two
> osdgroups on each host.
> 
> I believe the problem with your current rule set is that firstn 0 type
> host tries to select four hosts, but only two are available. You
> should be able to see that with 'ceph pg dump', where only two osds
> will be listed in the up set.
> 
> I hope that helps.
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.3.3
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWnR9kCRDmVDuy+mK58QAA5hUP/iJprG4nGR2sJvL//8l+
> V6oLYXTCs8lHeKL3ZPagThE9oh2xDMV37WR3I/xMNTA8735grl8/AAhy8ypW
> MDOikbpzfWnlaL0SWs5rIQ5umATwv73Fg/Mf+K2Olt8IGP6D0NMIxfeOjU6E
> 0Sc3F37nDQFuDEkBYjcVcqZC89PByh7yaId+eOgr7Ot+BZL/3fbpWIZ9kyD5
> KoPYdPjtFruoIpc8DJydzbWdmha65DkB65QOZlI3F3lMc6LGXUopm4OP4sQd
> txVKFtTcLh97WgUshQMSWIiJiQT7+3D6EqQyPzlnei3O3gACpkpsmUteDPpn
> p8CDeJtIpgKnQZjBwfK/bUQXdIGem8Y0x/PC+1ekIhkHCIJeW2sD3mFJduDQ
> 9loQ9+IsWHfQmEHLMLdeNzRXbgBY2djxP2X70fXTg31fx+dYvbWeulYJHiKi
> 1fJS4GdbPjoRUp5k4lthk3hDTFD/f5ZuowLDIaexgISb0bIJcObEn9RWlHut
> IRVi0fUuRVIX3snGMOKjLmSUe87Od2KSEbULYPTLYDMo/FsWXWHNlP3gVKKd
> lQJdxcwXOW7/v5oayY4wiEE6NF4rCupcqt0nPxxmbehmeRPxgkWCKJJs3FNr
> VmUdnrdpfxzR5c8dmOELJnpNS6MTT56B8A4kKmqbbHCEKpZ83piG7uwqc+6f
> RKkQ
> =gp/0
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Sun, Jan 17, 2016 at 6:31 PM, deeepdish  wrote:
>> Hi Everyone,
>> 
>> Looking for a double check of my logic and crush map..
>> 
>> Overview:
>> 
>> - osdgroup bucket type defines failure domain within a host of 5 OSDs + 1
>> SSD.   Therefore 5 OSDs (all utilizing the same journal) constitute an
>> osdgroup bucket.   Each host has 4 osdgroups.
>> - 6 monitors
>> - Two node cluster
>> - Each node:
>> - 20 OSDs
>> -  4 SSDs
>> - 4 osdgroups
>> 
>> Desired Crush Rule outcome:
>> - Assuming a pool with min_size=2 and size=4, all each node would contain a
>> redundant copy of each object.   Should any of the hosts fail, access to
>> data would be uninterrupted.
>> 
>> Current Crush Rule outcome:
>> - There are 4 copies of each object, however I don’t believe each node has a
>> redundant copy of each object, when a node fails, data is NOT accessible
>> until ceph rebuilds itself / node becomes accessible again.
>> 
>> I susepct my crush is not right, and to remedy it may take some time and
>> cause cluster to be unresponsive / unavailable.Is there a way / method
>> to apply substantial crush changes gradually to a cluster?
>> 
>> Thanks for your help.
>> 
>> 
>> Current crush map:
>> 
>> # begin crush map
>> tunable choose_local_tries 0
>> tunable choose_local_fallback_tries 0
>> tunable choose_total_tries 50
>> tunable chooseleaf_descend_once 1
>> tunable straw_calc_version 1
>> 
>> # devices
>> device 0 osd.0
>> device 1 osd.1
>> device 2 osd.2
>> device 3 osd.3
>> device 4 osd.4
>> device 5 osd.5
>> device 6 osd.6
>> device 7 osd.7
>> device 8 osd.8
>> device 9 osd.9
>> device 10 osd.10
>> device 11 osd.11
>> device 12 osd.12
>> device 13 osd.13
>> device 14 osd.14
>> device 15 osd.15
>> device 16 osd.16
>> device 17 osd.17
>> device 18 osd.18
>> device 19 osd.19
>> device 20 osd.20
>> device 21 osd.21
>> device 22 osd.22
>> device 23 osd.23
>> device 24 osd.24
>> device 25 osd.25
>> device 26 osd.26
>> device 27 osd.27
>> device 28 osd.28
>> device 29 osd.29
>> device 30 osd.30
>> device 31 osd.31
>> device 32 osd.32
>> device 33 osd.33
>> device 34 osd.34
>> device 35 osd.35
>> device 36 osd.36
>> device 37 osd.37
>> device 38 osd.38
>> device 39 osd.39
>> 
>> # types
>> type 0 osd
>> type 1 osdgroup
>> type 2 host
>> type 3 rack
>> type 4 site
>> type 5 r

Re: [ceph-users] CRUSH Rule Review - Not replicating correctly

2016-01-18 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Not that I know of.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Jan 18, 2016 at 10:33 AM, deeepdish  wrote:
> Thanks Robert.   Will definitely try this.   Is there a way to implement 
> “gradual CRUSH” changes?   I noticed whenever cluster wide changes are pushed 
> (crush map, for instance) the cluster immediately attempts to align itself 
> disrupting client access / performance…
>
>
>> On Jan 18, 2016, at 12:22 , Robert LeBlanc  wrote:
>>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> I'm not sure why you have six monitors. Six monitors buys you nothing
>> over five monitors other than more power being used, and more latency
>> and more headache. See
>> http://docs.ceph.com/docs/hammer/rados/configuration/mon-config-ref/#monitor-quorum
>> for some more info. Also, I'd consider 5 monitors overkill for this
>> size cluster, I'd recommend three.
>>
>> Although this is most likely not the root cause of your problem, you
>> probably have an error here: "root replicated-T1" is pointing to
>> b02s08 and b02s12 and "site erbus" is also pointing to b02s08 and
>> b02s12. You probably meant to have "root replicated-T1" pointing to
>> erbus instead.
>>
>> Where I think your problem is, is in your "rule replicated" section.
>> You can try:
>> step take replicated-T1
>> step choose firstn 2 type host
>> step chooseleaf firstn 2 type osdgroup
>> step emit
>>
>> What this does is choose two hosts from the root replicated-T1 (which
>> happens to be both hosts you have), then chooses an OSD from two
>> osdgroups on each host.
>>
>> I believe the problem with your current rule set is that firstn 0 type
>> host tries to select four hosts, but only two are available. You
>> should be able to see that with 'ceph pg dump', where only two osds
>> will be listed in the up set.
>>
>> I hope that helps.
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.3.3
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWnR9kCRDmVDuy+mK58QAA5hUP/iJprG4nGR2sJvL//8l+
>> V6oLYXTCs8lHeKL3ZPagThE9oh2xDMV37WR3I/xMNTA8735grl8/AAhy8ypW
>> MDOikbpzfWnlaL0SWs5rIQ5umATwv73Fg/Mf+K2Olt8IGP6D0NMIxfeOjU6E
>> 0Sc3F37nDQFuDEkBYjcVcqZC89PByh7yaId+eOgr7Ot+BZL/3fbpWIZ9kyD5
>> KoPYdPjtFruoIpc8DJydzbWdmha65DkB65QOZlI3F3lMc6LGXUopm4OP4sQd
>> txVKFtTcLh97WgUshQMSWIiJiQT7+3D6EqQyPzlnei3O3gACpkpsmUteDPpn
>> p8CDeJtIpgKnQZjBwfK/bUQXdIGem8Y0x/PC+1ekIhkHCIJeW2sD3mFJduDQ
>> 9loQ9+IsWHfQmEHLMLdeNzRXbgBY2djxP2X70fXTg31fx+dYvbWeulYJHiKi
>> 1fJS4GdbPjoRUp5k4lthk3hDTFD/f5ZuowLDIaexgISb0bIJcObEn9RWlHut
>> IRVi0fUuRVIX3snGMOKjLmSUe87Od2KSEbULYPTLYDMo/FsWXWHNlP3gVKKd
>> lQJdxcwXOW7/v5oayY4wiEE6NF4rCupcqt0nPxxmbehmeRPxgkWCKJJs3FNr
>> VmUdnrdpfxzR5c8dmOELJnpNS6MTT56B8A4kKmqbbHCEKpZ83piG7uwqc+6f
>> RKkQ
>> =gp/0
>> -END PGP SIGNATURE-
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Sun, Jan 17, 2016 at 6:31 PM, deeepdish  wrote:
>>> Hi Everyone,
>>>
>>> Looking for a double check of my logic and crush map..
>>>
>>> Overview:
>>>
>>> - osdgroup bucket type defines failure domain within a host of 5 OSDs + 1
>>> SSD.   Therefore 5 OSDs (all utilizing the same journal) constitute an
>>> osdgroup bucket.   Each host has 4 osdgroups.
>>> - 6 monitors
>>> - Two node cluster
>>> - Each node:
>>> - 20 OSDs
>>> -  4 SSDs
>>> - 4 osdgroups
>>>
>>> Desired Crush Rule outcome:
>>> - Assuming a pool with min_size=2 and size=4, all each node would contain a
>>> redundant copy of each object.   Should any of the hosts fail, access to
>>> data would be uninterrupted.
>>>
>>> Current Crush Rule outcome:
>>> - There are 4 copies of each object, however I don’t believe each node has a
>>> redundant copy of each object, when a node fails, data is NOT accessible
>>> until ceph rebuilds itself / node becomes accessible again.
>>>
>>> I susepct my crush is not right, and to remedy it may take some time and
>>> cause cluster to be unresponsive / unavailable.Is there a way / method
>>> to apply substantial crush changes gradually to a cluster?
>>>
>>> Thanks for your help.
>>>
>>>
>>> Current crush map:
>>>
>>> # begin crush map
>>> tunable choose_local_tries 0
>>> tunable choose_local_fallback_tries 0
>>> tunable choose_total_tries 50
>>> tunable chooseleaf_descend_once 1
>>> tunable straw_calc_version 1
>>>
>>> # devices
>>> device 0 osd.0
>>> device 1 osd.1
>>> device 2 osd.2
>>> device 3 osd.3
>>> device 4 osd.4
>>> device 5 osd.5
>>> device 6 osd.6
>>> device 7 osd.7
>>> device 8 osd.8
>>> device 9 osd.9
>>> device 10 osd.10
>>> device 11 osd.11
>>> device 12 osd.12
>>> device 13 osd.13
>>> device 14 osd.14
>>> device 15 osd.15
>>> device 16 osd.16
>>> device 17 osd.17
>>> device 18 osd.18
>>> device 19 osd.19
>>> device 20 osd.20
>>> device 21 osd.21
>>> device 22 osd.22
>>> device 23 osd.23
>>> device 24 osd.24
>>> device 25 osd.25
>>> device 26 osd.26
>>> device 27 osd.27
>>

[ceph-users] CephFS

2016-01-18 Thread Gregory Farnum
On Sunday, January 17, 2016, James Gallagher > wrote:

> Hi,
>
> I'm looking to implement the CephFS on my Firefly release (v0.80) with
> an XFS native file system, but so far I'm having some difficulties. After
> following the ceph/qsg and creating a storage cluster, I have the following
> topology
>
> admin node - mds/mon
>osd1
>osd2
>
> Ceph health is OK, ceph -s shows
>
> monmap e1: 1 mons at {node1=192.168.43.129:6789/0}, election epoch 2,
> quorum 0 node 1
> msdmap e6: 1/1/1 up {0=node1=up:active}
> osdmap e10: 2osds: 2 up, 2 in
> active + clean and so on
>
> However, unfortunately when I use the guide here:
>
> http://docs.ceph.com/docs/master/cephfs/kernel/
>
> and try the command
>
> sudo mount -t ceph 192.168.43.129:6789:/ /mnt/mycephfs -o
> name=admin-node,secretfile=admin.secret
>
> where admin-node=hostname of admin node and where admin.secret is the
> string taken from ceph.client.admin.keyring without the unnecessary bits
>
> I then get:
> mount: wrong fs type, bad option, bad superblock on 192.168.43.129:6789
> missing codepage or helper program or ...
> for several filesystems e.g. nfs cifs
> need a /sbin./mount.type helper program
>
> This leads me to believe that there is a problem with XFS, but this is
> supported with this version of Ceph so I don't really know anymore.
>
> When I try the command
>
> sudo mount -t ceph 192.168.43.129:6789:/ /mnt/mycephfs -o
> name=admin-node,secret={secretkey}
>
> I get libceph: auth method x error -1
> mount: permission denied
>
> and when I try
> sudo mount -t ceph 192.168.43.129:6789:/ /mnt/mycephfs
>
> I get
> no secret set ...
> error -22 on auth protocol 2 init
> then the whole mount: wrong fs type, bad option jargon again
>
>
> Any ideas?
>

What kernel are you running? What's the content of your admin.secret file?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS

2016-01-18 Thread Ilya Dryomov
On Sun, Jan 17, 2016 at 6:34 PM, James Gallagher
 wrote:
> Hi,
>
> I'm looking to implement the CephFS on my Firefly release (v0.80) with an
> XFS native file system, but so far I'm having some difficulties. After
> following the ceph/qsg and creating a storage cluster, I have the following
> topology
>
> admin node - mds/mon
>osd1
>osd2
>
> Ceph health is OK, ceph -s shows
>
> monmap e1: 1 mons at {node1=192.168.43.129:6789/0}, election epoch 2, quorum
> 0 node 1
> msdmap e6: 1/1/1 up {0=node1=up:active}
> osdmap e10: 2osds: 2 up, 2 in
> active + clean and so on
>
> However, unfortunately when I use the guide here:
>
> http://docs.ceph.com/docs/master/cephfs/kernel/
>
> and try the command
>
> sudo mount -t ceph 192.168.43.129:6789:/ /mnt/mycephfs -o
> name=admin-node,secretfile=admin.secret
>
> where admin-node=hostname of admin node and where admin.secret is the string
> taken from ceph.client.admin.keyring without the unnecessary bits

name is the name of the user, not the hostname.  If you didn't create
a custom user, doing secret= is enough.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Again - state of Ceph NVMe and SSDs

2016-01-18 Thread Gregory Farnum
On Sun, Jan 17, 2016 at 12:34 PM, Tyler Bishop
 wrote:
> The changes you are looking for are coming from Sandisk in the ceph "Jewel" 
> release coming up.
>
> Based on benchmarks and testing, sandisk has really contributed heavily on 
> the tuning aspects and are promising 90%+ native iop of a drive in the 
> cluster.

Mmmm, they've gotten some very impressive numbers but most people
shouldn't be expecting 90% of an SSD's throughput out of their
workloads. These tests are *very* parallel and tend to run multiple
OSD processes on a single SSD, IIRC.
-Greg

>
> The biggest changes will come from the memory allocation with writes.  
> Latency is going to be a lot lower.
>
>
> - Original Message -
> From: "David" 
> To: "Wido den Hollander" 
> Cc: ceph-users@lists.ceph.com
> Sent: Sunday, January 17, 2016 6:49:25 AM
> Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs
>
> Thanks Wido, those are good pointers indeed :)
> So we just have to make sure the backend storage (SSD/NVMe journals) won’t be 
> saturated (or the controllers) and then go with as many RBD per VM as 
> possible.
>
> Kind Regards,
> David Majchrzak
>
> 16 jan 2016 kl. 22:26 skrev Wido den Hollander :
>
>> On 01/16/2016 07:06 PM, David wrote:
>>> Hi!
>>>
>>> We’re planning our third ceph cluster and been trying to find how to
>>> maximize IOPS on this one.
>>>
>>> Our needs:
>>> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM
>>> servers)
>>> * Pool for storage of many small files, rbd (probably dovecot maildir
>>> and dovecot index etc)
>>>
>>
>> Not completely NVMe related, but in this case, make sure you use
>> multiple disks.
>>
>> For MySQL for example:
>>
>> - Root disk for OS
>> - Disk for /var/lib/mysql (data)
>> - Disk for /var/log/mysql (binary log)
>> - Maybe even a InnoDB logfile disk
>>
>> With RBD you gain more performance by sending I/O into the cluster in
>> parallel. So when ever you can, do so!
>>
>> Regarding small files, it might be interesting to play with the stripe
>> count and stripe size there. By default this is 1 and 4MB. But maybe 16
>> and 256k work better here.
>>
>> With Dovecot as well, use a different RBD disk for the indexes and a
>> different one for the Maildir itself.
>>
>> Ceph excels at parallel performance. That is what you want to aim for.
>>
>>> So I’ve been reading up on:
>>>
>>> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance
>>>
>>> and ceph-users from october 2015:
>>>
>>> http://www.spinics.net/lists/ceph-users/msg22494.html
>>>
>>> We’re planning something like 5 OSD servers, with:
>>>
>>> * 4x 1.2TB Intel S3510
>>> * 8st 4TB HDD
>>> * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and
>>> one for HDD pool journal)
>>> * 2x 80GB Intel S3510 raid1 for system
>>> * 256GB RAM
>>> * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better
>>>
>>> This cluster will probably run Hammer LTS unless there are huge
>>> improvements in Infernalis when dealing 4k IOPS.
>>>
>>> The first link above hints at awesome performance. The second one from
>>> the list not so much yet..
>>>
>>> Is anyone running Hammer or Infernalis with a setup like this?
>>> Is it a sane setup?
>>> Will we become CPU constrained or can we just throw more RAM on it? :D
>>>
>>> Kind Regards,
>>> David Majchrzak
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Again - state of Ceph NVMe and SSDs

2016-01-18 Thread Tyler Bishop
One of the other guys on the list here benchmarked them.  They spanked every 
other ssd on the *recommended* tree..

- Original Message -
From: "Gregory Farnum" 
To: "Tyler Bishop" 
Cc: "David" , "Ceph Users" 
Sent: Monday, January 18, 2016 2:01:44 PM
Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs

On Sun, Jan 17, 2016 at 12:34 PM, Tyler Bishop
 wrote:
> The changes you are looking for are coming from Sandisk in the ceph "Jewel" 
> release coming up.
>
> Based on benchmarks and testing, sandisk has really contributed heavily on 
> the tuning aspects and are promising 90%+ native iop of a drive in the 
> cluster.

Mmmm, they've gotten some very impressive numbers but most people
shouldn't be expecting 90% of an SSD's throughput out of their
workloads. These tests are *very* parallel and tend to run multiple
OSD processes on a single SSD, IIRC.
-Greg

>
> The biggest changes will come from the memory allocation with writes.  
> Latency is going to be a lot lower.
>
>
> - Original Message -
> From: "David" 
> To: "Wido den Hollander" 
> Cc: ceph-users@lists.ceph.com
> Sent: Sunday, January 17, 2016 6:49:25 AM
> Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs
>
> Thanks Wido, those are good pointers indeed :)
> So we just have to make sure the backend storage (SSD/NVMe journals) won’t be 
> saturated (or the controllers) and then go with as many RBD per VM as 
> possible.
>
> Kind Regards,
> David Majchrzak
>
> 16 jan 2016 kl. 22:26 skrev Wido den Hollander :
>
>> On 01/16/2016 07:06 PM, David wrote:
>>> Hi!
>>>
>>> We’re planning our third ceph cluster and been trying to find how to
>>> maximize IOPS on this one.
>>>
>>> Our needs:
>>> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM
>>> servers)
>>> * Pool for storage of many small files, rbd (probably dovecot maildir
>>> and dovecot index etc)
>>>
>>
>> Not completely NVMe related, but in this case, make sure you use
>> multiple disks.
>>
>> For MySQL for example:
>>
>> - Root disk for OS
>> - Disk for /var/lib/mysql (data)
>> - Disk for /var/log/mysql (binary log)
>> - Maybe even a InnoDB logfile disk
>>
>> With RBD you gain more performance by sending I/O into the cluster in
>> parallel. So when ever you can, do so!
>>
>> Regarding small files, it might be interesting to play with the stripe
>> count and stripe size there. By default this is 1 and 4MB. But maybe 16
>> and 256k work better here.
>>
>> With Dovecot as well, use a different RBD disk for the indexes and a
>> different one for the Maildir itself.
>>
>> Ceph excels at parallel performance. That is what you want to aim for.
>>
>>> So I’ve been reading up on:
>>>
>>> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance
>>>
>>> and ceph-users from october 2015:
>>>
>>> http://www.spinics.net/lists/ceph-users/msg22494.html
>>>
>>> We’re planning something like 5 OSD servers, with:
>>>
>>> * 4x 1.2TB Intel S3510
>>> * 8st 4TB HDD
>>> * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and
>>> one for HDD pool journal)
>>> * 2x 80GB Intel S3510 raid1 for system
>>> * 256GB RAM
>>> * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better
>>>
>>> This cluster will probably run Hammer LTS unless there are huge
>>> improvements in Infernalis when dealing 4k IOPS.
>>>
>>> The first link above hints at awesome performance. The second one from
>>> the list not so much yet..
>>>
>>> Is anyone running Hammer or Infernalis with a setup like this?
>>> Is it a sane setup?
>>> Will we become CPU constrained or can we just throw more RAM on it? :D
>>>
>>> Kind Regards,
>>> David Majchrzak
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Again - state of Ceph NVMe and SSDs

2016-01-18 Thread Mark Nelson
Take Greg's comments to heart, because he's absolutely correct here. 
Distributed storage systems almost as a rule love parallelism and if you 
have enough you can often hide other issues.  Latency is probably the 
more interesting question, and frankly that's where you'll often start 
seeing the kernel, ceph code, drivers, random acts of god, etc, get in 
the way.  It's very easy for any one of these things to destroy your 
performance, so you have to be *very* *very* careful to understand 
exactly what you are seeing.  As such, don't trust any one benchmark. 
Wait until it's independently verified, possibly by multiple sources, 
before putting too much weight into it.


Mark

On 01/18/2016 01:02 PM, Tyler Bishop wrote:

One of the other guys on the list here benchmarked them.  They spanked every 
other ssd on the *recommended* tree..

- Original Message -
From: "Gregory Farnum" 
To: "Tyler Bishop" 
Cc: "David" , "Ceph Users" 
Sent: Monday, January 18, 2016 2:01:44 PM
Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs

On Sun, Jan 17, 2016 at 12:34 PM, Tyler Bishop
 wrote:

The changes you are looking for are coming from Sandisk in the ceph "Jewel" 
release coming up.

Based on benchmarks and testing, sandisk has really contributed heavily on the 
tuning aspects and are promising 90%+ native iop of a drive in the cluster.


Mmmm, they've gotten some very impressive numbers but most people
shouldn't be expecting 90% of an SSD's throughput out of their
workloads. These tests are *very* parallel and tend to run multiple
OSD processes on a single SSD, IIRC.
-Greg



The biggest changes will come from the memory allocation with writes.  Latency 
is going to be a lot lower.


- Original Message -
From: "David" 
To: "Wido den Hollander" 
Cc: ceph-users@lists.ceph.com
Sent: Sunday, January 17, 2016 6:49:25 AM
Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs

Thanks Wido, those are good pointers indeed :)
So we just have to make sure the backend storage (SSD/NVMe journals) won’t be 
saturated (or the controllers) and then go with as many RBD per VM as possible.

Kind Regards,
David Majchrzak

16 jan 2016 kl. 22:26 skrev Wido den Hollander :


On 01/16/2016 07:06 PM, David wrote:

Hi!

We’re planning our third ceph cluster and been trying to find how to
maximize IOPS on this one.

Our needs:
* Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM
servers)
* Pool for storage of many small files, rbd (probably dovecot maildir
and dovecot index etc)



Not completely NVMe related, but in this case, make sure you use
multiple disks.

For MySQL for example:

- Root disk for OS
- Disk for /var/lib/mysql (data)
- Disk for /var/log/mysql (binary log)
- Maybe even a InnoDB logfile disk

With RBD you gain more performance by sending I/O into the cluster in
parallel. So when ever you can, do so!

Regarding small files, it might be interesting to play with the stripe
count and stripe size there. By default this is 1 and 4MB. But maybe 16
and 256k work better here.

With Dovecot as well, use a different RBD disk for the indexes and a
different one for the Maildir itself.

Ceph excels at parallel performance. That is what you want to aim for.


So I’ve been reading up on:

https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance

and ceph-users from october 2015:

http://www.spinics.net/lists/ceph-users/msg22494.html

We’re planning something like 5 OSD servers, with:

* 4x 1.2TB Intel S3510
* 8st 4TB HDD
* 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and
one for HDD pool journal)
* 2x 80GB Intel S3510 raid1 for system
* 256GB RAM
* 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better

This cluster will probably run Hammer LTS unless there are huge
improvements in Infernalis when dealing 4k IOPS.

The first link above hints at awesome performance. The second one from
the list not so much yet..

Is anyone running Hammer or Infernalis with a setup like this?
Is it a sane setup?
Will we become CPU constrained or can we just throw more RAM on it? :D

Kind Regards,
David Majchrzak


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.

[ceph-users] bucket type and crush map

2016-01-18 Thread Pedro Benites

Hello,

I have configured osd_crush_chooseleaf_type = 3 (rack), and I have 6 osd 
in three hosts and three racks, my tree y this:


datacenter datacenter1
-7  5.45999 rack rack1
-2  5.45999 host storage1
 0  2.73000 osd.0up 1.0  
1.0
 3  2.73000 osd.3up 1.0  
1.0

-8  5.45999 rack rack2
-3  5.45999 host storage2
 1  2.73000 osd.1up 1.0  
1.0
 4  2.73000 osd.4up 1.0  
1.0

-6  5.45999 datacenter datacenter2
-9  5.45999 rack rack3
-4  5.45999 host storage3
 2  2.73000 osd.2up 1.0  
1.0
 5  2.73000 osd.5up 1.0  
1.0



But when I created my fourth pool I got the message "too many PGs per 
OSD (420 > max 300)"
I dont understand that message because I have 840 PG and 6 OSD  or 140 
PGs/OSD,

Why I got 420 in the warm?


Regards,
Pedro.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and NFS

2016-01-18 Thread david
Hi,
Does CephFS stable enough to deploy it in product environments? and Do 
you compare the performance between nfs-ganesha and standard kernel based NFSd 
which are based on CephFS?

> On Jan 18, 2016, at 20:34, Burkhard Linke 
>  wrote:
> 
> Hi,
> 
> On 18.01.2016 10:36, david wrote:
>> Hello All.
>>  Does anyone provides Ceph rbd/rgw/cephfs through NFS?  I have a 
>> requirement about Ceph Cluster which needs to provide NFS service.
> 
> We export a CephFS mount point on one of our NFS servers. Works out of the 
> box with Ubuntu Trusty, a recent kernel and kernel-based cephfs driver.
> 
> ceph-fuse did not work that well, and using nfs-ganesha 2.2 instead of 
> standard kernel based NFSd resulted in segfaults and permissions problems.
> 
> Regards,
> Burkhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and NFS

2016-01-18 Thread david
Hi,
Thanks for your answer. Does CephFS stable enough to deploy it in 
product environments? and Do you compare the performance between nfs-ganesha 
and standard kernel based NFSd which are based on CephFS?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and NFS

2016-01-18 Thread Gregory Farnum
On Mon, Jan 18, 2016 at 4:48 AM, Arthur Liu  wrote:
>
>
> On Mon, Jan 18, 2016 at 11:34 PM, Burkhard Linke
>  wrote:
>>
>> Hi,
>>
>> On 18.01.2016 10:36, david wrote:
>>>
>>> Hello All.
>>> Does anyone provides Ceph rbd/rgw/cephfs through NFS?  I have a
>>> requirement about Ceph Cluster which needs to provide NFS service.
>>
>>
>> We export a CephFS mount point on one of our NFS servers. Works out of the
>> box with Ubuntu Trusty, a recent kernel and kernel-based cephfs driver.
>>
>> ceph-fuse did not work that well, and using nfs-ganesha 2.2 instead of
>> standard kernel based NFSd resulted in segfaults and permissions problems.
>
>
> I've found that using knfsd does not preserve cephfs directory and file
> layouts, but using nfs-ganesha does. I'm currently using nfs-ganesha 2.4dev5
> and seems stable so far.

Can you expand on that? In what manner is it not preserving directory
and file layouts?
-Greg

>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis upgrade breaks when journal on separate partition

2016-01-18 Thread Francois Lafont
Hi,

I have not well followed this thread, so sorry in advance if I'm a little out
of topic. Personally I'm using this udev rule and it works well (servers are
Ubuntu Trusty):

~# cat /etc/udev/rules.d/90-ceph.rules
ENV{ID_PART_ENTRY_SCHEME}=="gpt", 
ENV{ID_PART_ENTRY_NAME}=="osd-?*-journal", OWNER="ceph"

Indeed, I'm using GPT and all my journal partitions have this partname pattern:

/osd-[0-9]+-journal/

If you currently don't use GTP (but msdos partitions), I think you can do the 
same
thing by using _explicit_ "by-id". For instance something like that (not 
tested!):

ENV{DEVTYPE}=="partition", ENV{ID_WWN_WITH_EXTENSION}=="xxx", 
OWNER="ceph"
ENV{DEVTYPE}=="partition", ENV{ID_WWN_WITH_EXTENSION}=="yyy", 
OWNER="ceph"
# etc.

where xxx, yyy, etc. the name of your journal partitions in 
/dev/disk/by-id/.

HTH. ;)

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-18 Thread Francois Lafont
Hi,

On 18/01/2016 05:00, Adam Tygart wrote:

> As I understand it:

I think you understand well. ;)

> 4.2G is used by ceph (all replication, metadata, et al) it is a sum of
> all the space "used" on the osds.

I confirm that.

> 958M is the actual space the data in cephfs is using (without replication).
> 3.8G means you have some sparse files in cephfs.
> 
> 'ceph df detail' should return something close to 958MB used for your
> cephfs "data" pool. "RAW USED" should be close to 4.2GB

Yes, your predictions are correct. ;)

However, I still have a question. Since my previous message, supplementary
data have been put in the cephfs and the values have changes as you can see:

~# du -sh /mnt/cephfs/
1.2G/mnt/cephfs/

~# du --apparent-size -sh /mnt/cephfs/
6.4G/mnt/cephfs/

You can see that the difference between "disk usage" and "apparent size"
has really increased and it seems to me curious that only sparse files can
explain this difference (in my mind, sparse files are very specific files
and here the files are essentially images which doesn't seem to me potential
sparse files). I'm not completely sure but I think that same files are put in
the cephfs directory.

Do you think it's possible that the sames file present in different directories
of the cephfs are stored in only one object in the cephfs pool?

This is my feeling when I see the difference between "apparent size" and
"disk usage" which has increased. Am I wrong?

Anyway, thanks a lot for the explanations Adam.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Keystone PKIZ token support for RadosGW

2016-01-18 Thread Blair Bethwaite
Hi all,

Does anyone know if RGW supports Keystone's PKIZ tokens, or better yet know
a list of the supported token types?

Cheers,

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-18 Thread Francois Lafont
On 19/01/2016 05:19, Francois Lafont wrote:

> However, I still have a question. Since my previous message, supplementary
> data have been put in the cephfs and the values have changes as you can see:
> 
> ~# du -sh /mnt/cephfs/
> 1.2G  /mnt/cephfs/
> 
> ~# du --apparent-size -sh /mnt/cephfs/
> 6.4G  /mnt/cephfs/
> 
> You can see that the difference between "disk usage" and "apparent size"
> has really increased and it seems to me curious that only sparse files can
> explain this difference (in my mind, sparse files are very specific files
> and here the files are essentially images which doesn't seem to me potential
> sparse files). I'm not completely sure but I think that same files are put in
> the cephfs directory.
> 
> Do you think it's possible that the sames file present in different 
> directories
> of the cephfs are stored in only one object in the cephfs pool?
> 
> This is my feeling when I see the difference between "apparent size" and
> "disk usage" which has increased. Am I wrong?

In fact, I'm not so sure. Here another information, where /backups is a XFS 
partition:

~# du --apparent-size -sh 
/mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/
2.8G/mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/

~# du -sh /mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/
701M/mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/

~# cp -r /mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/ 
/backups/test

~# du -sh /backups/test
701M/backups/test

~# du --apparent-size -sh /backups/test
701M/backups/test

So I definitively don't understand of du --apparent-size -sh...


-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-18 Thread Adam Tygart
It appears that with --apparent-size, du adds the "size" of the
directories to the total as well. On most filesystems this is the
block size, or the amount of metadata space the directory is using. On
CephFS, this size is fabricated to be the size sum of all sub-files.
i.e. a cheap/free 'du -sh $folder'

$ stat /homes/mozes/tmp/sbatten
  File: '/homes/mozes/tmp/sbatten'
  Size: 138286  Blocks: 0  IO Block: 65536  directory
Device: 0h/0d   Inode: 1099523094368  Links: 1
Access: (0755/drwxr-xr-x)  Uid: (163587/   mozes)   Gid: (163587/mozes_users)
Access: 2016-01-19 00:12:23.331201000 -0600
Modify: 2015-10-14 13:38:01.098843320 -0500
Change: 2015-10-14 13:38:01.098843320 -0500
 Birth: -
$ stat /tmp/sbatten/
  File: '/tmp/sbatten/'
  Size: 4096Blocks: 8  IO Block: 4096   directory
Device: 803h/2051d  Inode: 9568257 Links: 2
Access: (0755/drwxr-xr-x)  Uid: (163587/   mozes)   Gid: (163587/mozes_users)
Access: 2016-01-19 00:12:23.331201000 -0600
Modify: 2015-10-14 13:38:01.098843320 -0500
Change: 2016-01-19 00:17:29.658902081 -0600
 Birth: -

$ du -s --apparent-size -B1 /homes/mozes/tmp/sbatten
276572  /homes/mozes/tmp/sbatten
$ du -s -B1 /homes/mozes/tmp/sbatten
147456  /homes/mozes/tmp/sbatten

$ du -s -B1 /tmp/sbatten
225280  /tmp/sbatten
$ du -s --apparent-size -B1 /tmp/sbatten
142382  /tmp/sbatten

Notice how the apparent-size version is *exactly* the Size from the
stat + the size from the "proper" du?

--
Adam

On Mon, Jan 18, 2016 at 11:45 PM, Francois Lafont  wrote:
> On 19/01/2016 05:19, Francois Lafont wrote:
>
>> However, I still have a question. Since my previous message, supplementary
>> data have been put in the cephfs and the values have changes as you can see:
>>
>> ~# du -sh /mnt/cephfs/
>> 1.2G  /mnt/cephfs/
>>
>> ~# du --apparent-size -sh /mnt/cephfs/
>> 6.4G  /mnt/cephfs/
>>
>> You can see that the difference between "disk usage" and "apparent size"
>> has really increased and it seems to me curious that only sparse files can
>> explain this difference (in my mind, sparse files are very specific files
>> and here the files are essentially images which doesn't seem to me potential
>> sparse files). I'm not completely sure but I think that same files are put in
>> the cephfs directory.
>>
>> Do you think it's possible that the sames file present in different 
>> directories
>> of the cephfs are stored in only one object in the cephfs pool?
>>
>> This is my feeling when I see the difference between "apparent size" and
>> "disk usage" which has increased. Am I wrong?
>
> In fact, I'm not so sure. Here another information, where /backups is a XFS 
> partition:
>
> ~# du --apparent-size -sh 
> /mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/
> 2.8G/mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/
>
> ~# du -sh /mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/
> 701M/mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/
>
> ~# cp -r /mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/ 
> /backups/test
>
> ~# du -sh /backups/test
> 701M/backups/test
>
> ~# du --apparent-size -sh /backups/test
> 701M/backups/test
>
> So I definitively don't understand of du --apparent-size -sh...
>
>
> --
> François Lafont
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com