Re: [ceph-users] Ceph expansion/deploy via ansible

2019-04-17 Thread Francois Lafont

Hi,

+1 for ceph-ansible too. ;)

--
François (flaf)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw in Nautilus: message "client_io->complete_request() returned Broken pipe"

2019-04-17 Thread Francois Lafont

Hi @ll,

I have a Nautilus Ceph cluster UP with radosgw
in a zonegroup. I'm using the web frontend Beast
(the default in Nautilus). All seems to work fine
but in the log of radosgw I have this message:

Apr 17 14:02:56 rgw-m-1 ceph-m-rgw.rgw-m-1.rgw0[888]: 2019-04-17 14:02:56.410 
7fe659803700  0 ERROR: client_io->complete_request() returned Broken pipe

approximately every ~2-3 minutes (it's an average,
it's random, it's not every 2 minutes exactly).
I think the code which generates this message is
here:

https://github.com/ceph/ceph/blob/master/src/rgw/rgw_process.cc#L283-L287

but I'm completely unqualified to understand the code.
What is the meaning this error message? Should I worry
about this message?

François (flaf)


PS: just in case, here my conf:


~$ cat /etc/ceph/ceph-m.conf
[client.rgw.rgw-m-1.rgw0]
host = rgw-m-1
keyring = /var/lib/ceph/radosgw/ceph-m-rgw.rgw-m-1.rgw0/keyring
log file = /var/log/ceph/ceph-m-rgw-rgw-m-1.rgw0.log
rgw frontends = beast endpoint=192.168.222.1:80
rgw thread pool size = 512

[client.rgw.rgw-m-2.rgw0]
host = rgw-m-2
keyring = /var/lib/ceph/radosgw/ceph-m-rgw.rgw-m-2.rgw0/keyring
log file = /var/log/ceph/ceph-m-rgw-rgw-m-2.rgw0.log
rgw frontends = beast endpoint=192.168.222.2:80
rgw thread pool size = 512

# Please do not change this file directly since it is managed by Ansible and 
will be overwritten
[global]
cluster network = 10.90.90.0/24
debug_rgw = 0/5
fsid = bb27079f-f116-4440-8a64-9ed430dc17be
log file = /dev/null
mon cluster log file = /dev/null
mon host = 
[v2:192.168.221.31:3300,v1:192.168.221.31:6789],[v2:192.168.221.32:3300,v1:192.168.221.32:6789],[v2:192.168.221.33:3300,v1:192.168.221.33:6789]
mon_osd_down_out_subtree_limit = host
mon_osd_min_down_reporters = 4
osd_crush_chooseleaf_type = 1
osd_crush_update_on_start = true
osd_pool_default_min_size = 2
osd_pool_default_pg_num = 8
osd_pool_default_ppg_num = 8
osd_pool_default_size = 3
public network = 192.168.221.0/25
rgw_enable_ops_log = true
rgw_log_http_headers = http_x_forwarded_for
rgw_ops_log_socket_path = /var/run/ceph/rgw-opslog.asok
rgw_realm = denmark
rgw_zone = zone-m
rgw_zonegroup = copenhagen


Installation via ceph-ansible with a docker deployment version stable 4.0.
ceph_docker_image: v4.0.0-stable-4.0-nautilus-centos-7-x86_64
ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy

2019-04-17 Thread Francois Lafont

Hi Matt,

On 4/17/19 1:08 AM, Matt Benjamin wrote:


Why is using an explicit unix socket problematic for you?  For what it
does, that decision has always seemed sensible.


In fact, I don't understand why the "ops" logs have a different
way from the logs of the process radosgw itself. Personally, if
radosgw is launched without a foreground option, it seems to me
logical that "ops" logs are put in "log_file" (ie
/var/log/ceph/$cluster-$name.log by default) and if radosgw is
launched with a foreground option (ie -d or -f) it seems to me
logical that "ops" logs are put in stdout/stderr too.

Is there a specific reason to put "ops" logs in a different
location from the logs of the process radosgw itself? "ops"
logs are logs of the "radosgw" process, no?

In my case, I use ceph-ansible with docker containers (it works
fine by the way ;)):

1. a systemd unit launches a docker container
2. the docker container launches a radosgw process with the -d
   option (ie "run in foreground, log to stderr").
3. systemd logs stdout/stderr of the process radosgw in syslog.

It could be handy for me if "ops" logs were directly put in
stderr/stdout. No?

--
François (flaf)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy

2019-04-16 Thread Francois Lafont

Hi @all,

On 4/9/19 12:43 PM, Francois Lafont wrote:


I have tried this config:

-
rgw enable ops log  = true
rgw ops log socket path = /tmp/opslog
rgw log http headers    = http_x_forwarded_for
-

and I have logs in the socket /tmp/opslog like this:

-
{"bucket":"test1","time":"2019-04-09 09:41:18.188350Z","time_local":"2019-04-09 11:41:18.188350","remote_addr":"10.111.222.51","user":"flaf","operation":"GET","uri":"GET /?prefix=toto/&delimiter=%2F 
HTTP/1.1","http_status":"200","error_code":"","bytes_sent":832,"bytes_received":0,"object_size":0,"total_time":39,"user_agent":"DragonDisk 1.05 ( http://www.dragondisk.com 
)","referrer":"","http_x_headers":[{"HTTP_X_FORWARDED_FOR":"10.111.222.55"}]},
-

I can see the IP address of the client in the value of HTTP_X_FORWARDED_FOR, 
that's cool.

But I don't understand why there is a specific socket to log that? I'm using radosgw in a Docker container 
(installed via ceph-ansible) and I have logs of the "radosgw" daemon in the 
"/var/log/syslog" file of my host (I'm using the Docker "syslog" log-driver).

1. Why is there a _separate_ log source for that? Indeed, in "/var/log/syslog" 
I have already some logs of civetweb. For instance:

     2019-04-09 12:33:45.926 7f02e021c700  1 civetweb: 0x55876dc9c000: 10.111.222.51 - - 
[09/Apr/2019:12:33:45 +0200] "GET /?prefix=toto/&delimiter=%2F HTTP/1.1" 200 
1014 - DragonDisk 1.05 ( http://www.dragondisk.com )


The fact that radosgw uses a separate log source for "ops log" (ie a specific 
Unix socket) is still very mysterious for me.



2. In my Docker container context, is it possible to put the logs above in the file 
"/var/log/syslog" of my host, in other words is it possible to make sure to log this in 
stdout of the daemon "radosgw"?


It seems to me impossible to put ops log in the stdout of the "radosgw" process 
(or, if it's possible, I have not found). So I have made a workaround. I have set:

rgw_ops_log_socket_path = /var/run/ceph/rgw-opslog.asok

in my ceph.conf and I have created a daemon (via un systemd unit file) which 
runs this loop:

while true;
do
netcat -U "/var/run/ceph/rgw-opslog.asok" | logger -t "rgwops" -p 
"local5.notice"
done

to retrieve logs in syslog. It's not very satisfying but it's works.

--
François (flaf)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy

2019-04-09 Thread Francois Lafont

On 4/9/19 12:43 PM, Francois Lafont wrote:


2. In my Docker container context, is it possible to put the logs above in the file 
"/var/log/syslog" of my host, in other words is it possible to make sure to log this in 
stdout of the daemon "radosgw"?


In brief, is it possible log "operations" in a regular file or better for me in 
stdout?


--
flaf
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy

2019-04-09 Thread Francois Lafont

Hi,


On 4/9/19 5:02 AM, Pavan Rallabhandi wrote:


Refer "rgw log http headers" under 
http://docs.ceph.com/docs/nautilus/radosgw/config-ref/

Or even better in the code https://github.com/ceph/ceph/pull/7639



Ok, thx for your help Pavan. I have progressed but I have already some 
problems. With the help of this comment:

https://github.com/ceph/ceph/pull/7639#issuecomment-266893208

I have tried this config:

-
rgw enable ops log  = true
rgw ops log socket path = /tmp/opslog
rgw log http headers= http_x_forwarded_for
-

and I have logs in the socket /tmp/opslog like this:

-
{"bucket":"test1","time":"2019-04-09 09:41:18.188350Z","time_local":"2019-04-09 11:41:18.188350","remote_addr":"10.111.222.51","user":"flaf","operation":"GET","uri":"GET /?prefix=toto/&delimiter=%2F 
HTTP/1.1","http_status":"200","error_code":"","bytes_sent":832,"bytes_received":0,"object_size":0,"total_time":39,"user_agent":"DragonDisk 1.05 ( http://www.dragondisk.com 
)","referrer":"","http_x_headers":[{"HTTP_X_FORWARDED_FOR":"10.111.222.55"}]},
-

I can see the IP address of the client in the value of HTTP_X_FORWARDED_FOR, 
that's cool.

But I don't understand why there is a specific socket to log that? I'm using radosgw in a Docker container 
(installed via ceph-ansible) and I have logs of the "radosgw" daemon in the 
"/var/log/syslog" file of my host (I'm using the Docker "syslog" log-driver).

1. Why is there a _separate_ log source for that? Indeed, in "/var/log/syslog" 
I have already some logs of civetweb. For instance:

2019-04-09 12:33:45.926 7f02e021c700  1 civetweb: 0x55876dc9c000: 10.111.222.51 - - 
[09/Apr/2019:12:33:45 +0200] "GET /?prefix=toto/&delimiter=%2F HTTP/1.1" 200 
1014 - DragonDisk 1.05 ( http://www.dragondisk.com )

2. In my Docker container context, is it possible to put the logs above in the file 
"/var/log/syslog" of my host, in other words is it possible to make sure to log this in 
stdout of the daemon "radosgw"?

--
flaf
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy

2019-04-08 Thread Francois Lafont

Hi @all,

I'm using Ceph rados gateway installed via ceph-ansible with the Nautilus
version. The radosgw are behind a haproxy which add these headers (checked
via tcpdump):

X-Forwarded-Proto: http
X-Forwarded-For: 10.111.222.55

where 10.111.222.55 is the IP address of the client. The radosgw use the
civetweb http frontend. Currently, this is the IP address of the haproxy
itself which is mentioned in logs. I would like to mention the IP address
from the X-Forwarded-For HTTP header. How to do that?

I have tried this option in ceph.conf:

rgw_remote_addr_param = X-Forwarded-For

It doesn't work but maybe I have read the doc wrongly.

Thx in advance for your help.

PS: I have tried too the http frontend "beast" but, in this case, no HTTP
request seems to be logged.

--
François
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon memory issue jewel 10.2.5 kernel 4.4

2017-10-21 Thread Francois Lafont
Hi @all,

On 02/08/2017 08:45 PM, Jim Kilborn wrote:

> I have had two ceph monitor nodes generate swap space alerts this week.
> Looking at the memory, I see ceph-mon using a lot of memory and most of the 
> swap space. My ceph nodes have 128GB mem, with 2GB swap  (I know the 
> memory/swap ratio is odd)

I had exactly the same problem here in my little ceph cluster:

- 5 nodes ceph01,02,03,04,05 on Ubuntu Trusty kernel 3.13 (kernel from the 
distribution).
- Ceph version Jewel 10.2.9
- 4 OSDs per node
- 3 montors in ceph01,02,03
- 1 active and 2 standby mds in ceph01,02,03

Yesterday, I had in _ceph02_:

1. Swap and RAM at 100%.
2. A process kswapd0 which took 100% of 1 CPU.
3. A simple "ceph status" or "ceph --version" in this node (ceph02) failed
   with "ImportError: librados.so.2 cannot map zero-fill pages: Cannot allocate 
memory".

However, in the other nodes, a "ceph status" gave me a fully HEALTH_OK cluster.

Maybe an important point: ceph01, ceph02, ceph03 are the same servers
(hardware and conf via Puppet, 4 osd + 1 mon + 1 mds each) but the
_active_ mds was hosted in ceph02 (since 2 months approximatively).

The process ceph-mon in ceph02 has been oom-killed by the kernel this
night and the usage of the memory is normal now.

The data in the monitor working dir are really small as you can see:

  Filesystem  Size  Used Avail Use% Mounted on
ceph01 => /dev/sda530G  126M   30G   1% /var/lib/ceph/mon/ceph-ceph01
ceph02 => /dev/sda530G  121M   30G   1% /var/lib/ceph/mon/ceph-ceph02
ceph03 => /dev/sda530G   78M   30G   1% /var/lib/ceph/mon/ceph-ceph03

It seems to me that the problem appears step by step after 2 month
approximatively. It's not suddenly.

Is it a known issue?

Thanks for your help.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unwanted automatic restart of daemons during an upgrade since 10.2.5 (on Trusty)

2016-12-20 Thread Francois Lafont
On 12/20/2016 10:02 AM, Wido den Hollander wrote:

> I think it is commit 0cdf3bc875447c87fdc0fed29831554277a3774b: 
> https://github.com/ceph/ceph/commit/0cdf3bc875447c87fdc0fed29831554277a3774b

Thanks Wido but in fact I have doubts...

> It invokes a start after the package install/upgrade. Since you have manually 
> stopped the daemons they will be started again.

Yes, that seems to be logical but I have checked with the 10.2.1
version for instance and I have these lines too in the ceph-osd
postinst:

case "$1" in
configure)
[ -x /sbin/start ] && start ceph-osd-all || :

and I'm pretty sure that during the 10.2.0 => 10.2.1 upgrade
the osd daemons weren't started again (after I have manually
stopped the daemons). I'm pretty because my process is manual
but it's a note I follow "idiotically" and I'm sure that I have
notice the change with the 10.2.5 version.

However, between 10.2.1 and 10.2.5 I have noticed this diff
in the postinst:


# Automatically added by dh_systemd_enable
# This will only remove masks created by d-s-h on package removal.
deb-systemd-helper unmask ceph-osd.target >/dev/null || true

# was-enabled defaults to true, so new installations run enable.
if deb-systemd-helper --quiet was-enabled ceph-osd.target; then
# Enables the unit on first installation, creates new
# symlinks on upgrades if the unit file has changed.
deb-systemd-helper enable ceph-osd.target >/dev/null || true
else
# Update the statefile to add new symlinks (if any), which need to be
# cleaned up on purge. Also remove old symlinks.
deb-systemd-helper update-state ceph-osd.target >/dev/null || true
fi
# End automatically added section
# Automatically added by dh_systemd_start
if [ -d /run/systemd/system ]; then
systemctl --system daemon-reload >/dev/null || true
deb-systemd-invoke start ceph-osd.target >/dev/null || true
fi
# End automatically added section


I don't know if it can explain the change I have noticed...
Currently I'm lost. As you said Wido, the line "... start ceph-osd-all..."
should start again the osd daemons manually stopped by
myself but this line is present since 10.2.1 at least and
I pretty sure that I hadn't this behavior with 10.2.1.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unwanted automatic restart of daemons during an upgrade since 10.2.5 (on Trusty)

2016-12-20 Thread Francois Lafont
On 12/19/2016 09:58 PM, Ken Dreyer wrote:

> I looked into this again on a Trusty VM today. I set up a single
> mon+osd cluster on v10.2.3, with the following:
> 
>   # status ceph-osd id=0
>   ceph-osd (ceph/0) start/running, process 1301
> 
>   #ceph daemon osd.0 version
>   {"version":"10.2.3"}
> 
> I ran "apt-get upgrade" to get go 10.2.3 -> 10.2.5, and the OSD PID
> (1301) and version from the admin socket (v10.2.3) remained the same.

In which repository do you have retrieved the 10.2.3 version of ceph?
I could make a test too.

> Could something else be restarting the daemons in your case?

I use Puppet to manage my hosts but "ceph" services are all *un*managed
by Puppet, I'm sure (and the run is weekly only and I have noticed the
behavior in my 5 nodes). Management of the "ceph" services is completely
manual in my case.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unwanted automatic restart of daemons during an upgrade since 10.2.5 (on Trusty)

2016-12-19 Thread Francois Lafont
Hi,

On 12/19/2016 09:58 PM, Ken Dreyer wrote:

> I looked into this again on a Trusty VM today. I set up a single
> mon+osd cluster on v10.2.3, with the following:
> 
>   # status ceph-osd id=0
>   ceph-osd (ceph/0) start/running, process 1301
> 
>   #ceph daemon osd.0 version
>   {"version":"10.2.3"}
> 
> I ran "apt-get upgrade" to get go 10.2.3 -> 10.2.5, and the OSD PID
> (1301) and version from the admin socket (v10.2.3) remained the same.
> 
> Could something else be restarting the daemons in your case?

As Christian said, this is not _exactly_ the "problem" I have described
in my first message. You can read it again, I give _verbatim_ the commands
I launch in the host during an upgrade. Personally, I stop manually the
daemons before the "ceph" upgrade (which is not the case in your example
above):

1. I stop manually all OSD daemons in the host.
2. I make the "ceph" upgrade (sudo apt-get update && sudo apt-get upgrade)

Then...

3(i).  Before the 10.2.5 version, the ceph daemons are still stopped.
3(ii). With the 10.2.5 version, the ceph daemons have been started 
automatically.

Personally I would prefer the 3i scenario (all details are in my first message).
I don't know what exactly but something has changed with the version 10.2.5.

Regards.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unwanted automatic restart of daemons during an upgrade since 10.2.5 (on Trusty)

2016-12-13 Thread Francois Lafont
On 12/13/2016 12:42 PM, Francois Lafont wrote:

> But, _by_ _principle_, in the specific case of ceph (I know it's not the
> usual case of packages which provide daemons), I think it would be more
> safe and practical that the ceph packages don't manage the restart of
> daemons.

And I say (even if I think it was relatively clear in my first post) that
*it was the case* before the 10.2.5 version, so I was surprised by this
change.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unwanted automatic restart of daemons during an upgrade since 10.2.5 (on Trusty)

2016-12-13 Thread Francois Lafont
Hi @all,

I have a little remark concerning at least the Trusty ceph packages (maybe
it concerns another distributions, I don't know).

I'm pretty sure that before the 10.2.5 version, the restart of the daemons
wasn't managed during the packages upgrade and with the 10.2.5 version it's
the case. I explain below.

Personally, during a "ceph" upgrade, I prefer to manage the "ceph" daemons
_myself_. For instance, during a "ceph" upgrade of a Ubuntu Trusty OSD server,
I'm used to make something like that:


# I stop all the OSD daemons (here, it's an upstart command but it's
# an implementation detail, the idea is just "I stop all OSD"):
sudo stop ceph-osd-all

# And after that, I launch the "ceph" upgrade with something like that:
sudo apt-get update && sudo apt-get upgrade

# (*) Before the 10.2.5 version, the daemons weren't automatically
# restarted by the upgrade and personally, it was a _good_ thing
# for me. Now, with the 10.2.5 version, the daemons seems to be
# automatically restarted.

# Personally, after a "ceph" upgrade, I always prefer launch a _reboot_
# of the server.
sudo reboot


So, now with 10.2.5 version, in my process, OSD daemons are stopped,
then automatically restarted by the upgrade and then stopped again
by the reboot. This is not an optimal process of course. ;)

I perfectly know workarounds to avoid an automatic restart of the
daemons during the "ceph" upgrades (for instance, in the case of
Trusty, I could temporarily removed the files
/var/lib/ceph/osd/ceph-$id/upstart).

But, _by_ _principle_, in the specific case of ceph (I know it's not the
usual case of packages which provide daemons), I think it would be more
safe and practical that the ceph packages don't manage the restart of
daemons.

What you do think about that ? Maybe I'm wrong... ;)

François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-09 Thread Francois Lafont
On 12/09/2016 06:39 PM, Alex Evonosky wrote:

> Sounds great.  May I asked what procedure you did to upgrade?

Of course. ;)

It's here: https://shaman.ceph.com/repos/ceph/wip-msgr-jewel-fix2/
(I think this link was pointed by Greg Farnum or Sage Weil in a
previous message).

Personally I use Ubuntu Trusty, so for me in the page above leads me
to use this line in my "sources.list":

deb 
http://3.chacra.ceph.com/r/ceph/wip-msgr-jewel-fix2/5d3c76c1c6e991649f0beedb80e6823606176d9e/ubuntu/trusty/flavors/default/
 trusty main

And after that "apt-get update && apt-get upgrade" etc.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-09 Thread Francois Lafont
Hi,

Just for information, after the upgrade to the version
10.2.4-1-g5d3c76c (5d3c76c1c6e991649f0beedb80e6823606176d9e)
of all my cluster (osd, mon and mds) since ~30 hours, I have
no problem (my cluster is a small cluster with 5 nodes and
4 osds per nodes and 3 monitors and I just use cephfs).

Bye.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-08 Thread Francois Lafont
On 12/08/2016 11:24 AM, Ruben Kerkhof wrote:

> I've been running this on one of my servers now for half an hour, and
> it fixes the issue.

It's the same for me. ;)

~$ ceph -v
ceph version 10.2.4-1-g5d3c76c (5d3c76c1c6e991649f0beedb80e6823606176d9e)

Thanks for the help.
Bye.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released -- IMPORTANT

2016-12-07 Thread Francois Lafont
On 12/08/2016 12:38 AM, Gregory Farnum wrote:

> Yep!

Ok, thanks for the confirmations Greg.
Bye.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released -- IMPORTANT

2016-12-07 Thread Francois Lafont
On 12/08/2016 12:06 AM, Sage Weil wrote:

> Please hold off on upgrading to this release.  It triggers a bug in 
> SimpleMessenger that causes threads for broken connections to spin, eating 
> CPU.
> 
> We're making sure we understand the root cause and preparing a fix.

Waiting for the fix and its release, can you confirm to me that restart osd 
daemon every 15 minutes is a possible workaround? In my case, I have a little 
cluster (5 nodes with 4 osd each) and it's possible for me to restart daemons 
every 15 minutes without have a cluster completely down. ;)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-07 Thread Francois Lafont
On 12/07/2016 11:33 PM, Ruben Kerkhof wrote:

> Thanks, l'll check how long it takes for this to happen on my cluster.
> 
> I did just pause scrub and deep-scrub. Are there scrubs running on
> your cluster now by any chance?

Yes but normally not currently because I have:

  osd scrub begin hour = 3
  osd scrub end hour   = 5

In the ceph.conf of all my cluster node, so normally I have currently no 
scrubbing.

Why do you think it's related with scrubbing.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-07 Thread Francois Lafont
On 12/07/2016 11:16 PM, Steve Taylor wrote:
> I'm seeing the same behavior with very similar perf top output. One server 
> with 32 OSDs has a load average approaching 800. No excessive memory usage 
> and no iowait at all.

Exactly!

And another interesting information (maybe). I have ceph-osd process with big 
cpu load (as Steve said no iowait and no excessive memory usage). If I restart 
the ceph-osd daemon cpu load becomes OK during exactly 15 minutes for me. After 
15 minutes, I have the cpu load again. It's curious this number of 15 minutes, 
isn't it?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-07 Thread Francois Lafont
Hi,

On 12/07/2016 01:21 PM, Abhishek L wrote:

> This point release fixes several important bugs in RBD mirroring, RGW
> multi-site, CephFS, and RADOS.
> 
> We recommend that all v10.2.x users upgrade. Also note the following when 
> upgrading from hammer

Well... little warning: after upgrade from 10.2.3 to 10.2.4, I have big load 
cpu on osd and mds. Something like that:

top - 18:53:40 up  2:11,  1 user,  load average: 32.14, 29.49, 27.36
Tasks: 192 total,   2 running, 190 sleeping,   0 stopped,   0 zombie
%Cpu(s): 19.4 us, 80.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  32908088 total,  1876820 used, 31031268 free,31464 buffers
KiB Swap:  8388604 total,0 used,  8388604 free.   412340 cached Mem
 
  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND 
 
 2174 ceph  20   0  492408  79260   8688 S 169.7  0.2 139:49.77 ceph-mds

 2318 ceph  20   0 1081428 166700  25832 S 160.4  0.5 178:32.18 ceph-osd

 2288 ceph  20   0 1256604 241796  22896 S 159.4  0.7 189:25.19 ceph-osd

 2301 ceph  20   0 1261172 261040  23664 S 156.1  0.8 197:11.24 ceph-osd

 2337 ceph  20   0 1247904 260048  19084 S 154.8  0.8 191:01.90 ceph-osd

 2171 ceph  20   0  466160  58292  10992 S   0.3  0.2   0:29.89 ceph-mon

On IRC, two another persons have the same behavior after the upgrade.

The cluster is HEALTH OK. I don't see O/I on disk. If I restart daemons, all is 
ok but after a few minutes the load cpu starts again.

I have currently no idea about the problem.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Keep previous versions of ceph in the APT repository

2016-11-29 Thread Francois Lafont
Hi @all,

Ceph teaem, could it be possible to keep the previous versions of
ceph* packages in the APT repository?

Indeed, for instance for Ubuntu Trusty, currently we have:

~$ curl -s 
http://download.ceph.com/debian-jewel/dists/trusty/main/binary-amd64/Packages | 
grep -A 1 '^Package: ceph$'
Package: ceph
Version: 10.2.3-1trusty

Only the last version 10.2.3 is available and, for instance,
versions 10.2.2 and 10.2.1 has been removed from the APT
repository.

It could be handy to keep the previous versions too. Personally,
it's useful for me to test an upgrade in a lab: for instance when
I want to make a lab in 10.2.2 and then test an upgrade to 10.2.3
(and then make the upgrade in production if all is ok).

It seems to me a good thing to keep the old versions in the APT
repository but maybe it's complicated for the Ceph team...

Thanks for your help.
Regards.

François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse "Transport endpoint is not connected" on Jewel 10.2.2

2016-08-30 Thread Francois Lafont
Hi,

On 08/29/2016 08:30 PM, Gregory Farnum wrote:

> Ha, yep, that's one of the bugs Giancolo found:
> 
>  ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
>  1: (()+0x299152) [0x7f91398dc152]
>  2: (()+0x10330) [0x7f9138bbb330]
>  3: (Client::get_root_ino()+0x10) [0x7f91397df6c0]
>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
> [0x7f91397dd3d5]
>  5: (()+0x19ac09) [0x7f91397ddc09]
>  6: (()+0x14b45) [0x7f91391f7b45]
>  7: (()+0x1522b) [0x7f91391f822b]
>  8: (()+0x11e49) [0x7f91391f4e49]
>  9: (()+0x8184) [0x7f9138bb3184]
>  10: (clone()+0x6d) [0x7f913752237d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> 
> 
> So I that'll be in the next Jewel release if it's not already fixed in 10.2.2.

If I see the previous message of Goncalo in this thread, the bug still exists
in Jewel 10.2.2 so I deduce that it will be fixed in the 10.2.3.

Can you tell me where is the report of this specific bug in 
http://tracker.ceph.com ?
I have not found it.

Thanks.
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse "Transport endpoint is not connected" on Jewel 10.2.2

2016-08-27 Thread Francois Lafont
On 08/27/2016 12:01 PM, Francois Lafont wrote:

> I had exactly the same error in my production ceph client node with
> Jewel 10.2.1 in my case.

I have forgotten to say that the ceph cluster was perfectly HEALTH_OK
before, during and after the error in the client side.

Regards.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse "Transport endpoint is not connected" on Jewel 10.2.2

2016-08-27 Thread Francois Lafont
Hi,

I had exactly the same error in my production ceph client node with
Jewel 10.2.1 in my case.

In the client node :
- Ubuntu 14.04
- kernel 3.13.0-92-generic
- ceph 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
- cephfs via _ceph-fuse_

In the cluster node :
- Ubuntu 14.04
- kernel 3.13.0-92-generic
- ceph 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)

It was during the execution of a very basic Python (2.7.6) script which
makes some os.makedirs(...) and os.chown(...).

Just in case, the logs are below. I'm sorry they are not verbose at all
and so probably useless for you...

Which settings should I put in my client and cluster configuration to
have relevant logs if the same error happens again?

Regards.
François Lafont

Here are the logs:

1. In the client node: 
http://francois-lafont.ac-versailles.fr/misc/ceph-client.cephfs.log.1.gz

2. In the (active) mds node:

%<%<%<%<%<%<%<%<
~$ sudo zcat /var/log/ceph/ceph-mds.ceph02.log.1.gz
2016-08-22 15:02:03.799037 7f3f9adc1700  0 -- 10.0.2.102:6800/2186 >> 
192.168.23.11:0/431481110 pipe(0x7f3fb3a87400 sd=22 :6800 s=2 pgs=64 cs=1 l=0 
c=0x7f3fb5f10900).fault with nothing to send, going to standby
2016-08-22 15:02:40.236001 7f3f9f7d3700  0 log_channel(cluster) log [WRN] : 1 
slow requests, 1 included below; oldest blocked for > 34.503993 secs
2016-08-22 15:02:40.236026 7f3f9f7d3700  0 log_channel(cluster) log [WRN] : 
slow request 34.503993 seconds old, received at 2016-08-22 15:02:05.731897: 
client_request(client.1442720:650326 getattr pAsLsXsFs #101b6d0 2016-08-22 
15:02:05.731515) currently failed to rdlock, waiting
2016-08-22 15:07:00.245269 7f3f9f7d3700  0 log_channel(cluster) log [INF] : 
closing stale session client.1433176 192.168.23.11:0/431481110 after 304.132797
2016-08-22 15:23:07.970215 7f3f9adc1700  0 -- 10.0.2.102:6800/2186 >> 
192.168.23.11:0/2607326748 pipe(0x7f3fff365400 sd=22 :6800 s=2 pgs=8 cs=1 l=0 
c=0x7f3fb5f10a80).fault, server, going to standby
2016-08-22 15:28:05.281489 7f3f9f7d3700  0 log_channel(cluster) log [INF] : 
closing stale session client.1537178 192.168.23.11:0/2607326748 after 300.588323
%<%<%<%<%<%<%<%<

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse, fio largely better after migration Infernalis to Jewel, is my bench relevant?

2016-06-06 Thread Francois Lafont
On 06/06/2016 18:41, Gregory Farnum wrote:

> We had several metadata caching improvements in ceph-fuse recently which I
> think went in after Infernalis. That could explain it.

Ok, in this case, it could be good news. ;)

I had doubts concerning my fio bench. I know that benchs can be tricky 
especially
with distributed filesystems.

Thanks for your answer Greg.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-fuse, fio largely better after migration Infernalis to Jewel, is my bench relevant?

2016-06-06 Thread Francois Lafont
Hi,

I have a little Ceph cluster in production with 5 cluster nodes and 2
client nodes. The clients are using cephfs via fuse.ceph. Recently, I
have upgraded my cluster from Infernalis to Jewel (servers _and_ clients).

When the cluster was in Infernalis version the fio command below gave
me approximatively 1100-1300 iops.

fio --directory=/mnt/moodle/test/ --name=rwjob --readwrite=randrw \
--rwmixread=50 --gtod_reduce=1 --bs=4k --size=100MB   \
--ioengine=sync --direct=0 --numjobs=4 --group_reporting

I have tested the exactly same fio command after the migration where the
all nodes are in Jewel version and I have ~ 2500-3000 iops.

I know that benchs can be very tricky so here is my question: is this
significant improvement due to the "Infernalis => Jewel" migration, or
is my test not relevant?

Thanks in advance for your help.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] A radosgw keyring with the minimal rights, which pools have I to create?

2016-06-04 Thread Francois Lafont
Hi,

In a from scratch Jewel cluster, I'm searching the exact list of pools I
have to create and the minimal rights that I can set for the keyring used
by the radosgw instance. This is for the default zone. I intend to just use
the S3 API of the radosgw.

a) I have read the doc here 
http://docs.ceph.com/docs/jewel/radosgw/config-ref/#pools,
but, according to me, it doesn't seem to be updated, am I wrong?

Indeed, I have used a keyring with these rights:

[client.radosgw.gateway]
  key = xx==
  caps mon = "allow rwx"
  caps osd = "allow rwx"

so that the pools are created automatically after the starting of radosgw.
I have created a S3 account with "radosgw-admin" and I have created a bucket
with this S3 account. After that, here is the list of created pools:

.rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log
default.rgw.users.uid
default.rgw.users.email
default.rgw.users.keys
default.rgw.meta
default.rgw.buckets.index

It doesn't seem to match with the doc. Am I wrong anywhere?


b) By the way, can you confirm me there are modifications on this point
between Infernalis and Jewel. Indeed if I do exactly the same "test" with
a from scratch Infernalis cluster, here is the list of created pools:

.rgw.root
.rgw.control
.rgw
.rgw.gc
.log
.users.uid
.users.email
.users
.rgw.buckets.index
.rgw.buckets

Why is it different between Infernalis and Jewel? To me, it seems curious
and I have probably missed something, haven't I?

c) Can you confirm me that the minimal rights for a radosgw keyring is
something like that:

[client.radosgw.gateway]
  key = xx==
  caps mon = "allow r"
  caps osd = "allow rwx pool=,..., rwx="

and can you tell me the exact list of pools I have to create, ie the list
, ...,  because this is not clear for me?

Just in case, here is the typical conf of my radosgw instance:

[client.radosgw.gateway]
  host   = ceph-rgw
  keyring= /etc/ceph/ceph.client.radosgw.gateway.keyring
  rgw socket path= ""
  log file   = /var/log/ceph/ceph.client.radosgw.gateway.log
  rgw frontends  = civetweb port=8080
  rgw print continue = false
  rgw dns name   = store.domain.tld

Thanks in advance for your help.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel upgrade and sortbitwise

2016-06-03 Thread Francois Lafont
Hi,

On 03/06/2016 16:29, Samuel Just wrote:

> Sorry, I should have been more clear. The bug actually is due to a
> difference in an on disk encoding from hammer. An infernalis cluster would
> never had had such encodings and is fine.

Ah ok, fine. ;)
Thanks for the answer.
Bye.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel upgrade and sortbitwise

2016-06-03 Thread Francois Lafont
Hi, 

On 03/06/2016 05:39, Samuel Just wrote:

> Due to http://tracker.ceph.com/issues/16113, it would be best to avoid
> setting the sortbitwise flag on jewel clusters upgraded from previous
> versions until we get a point release out with a fix.
> 
> The symptom is that setting the sortbitwise flag on a jewel cluster
> upgraded from a previous version can result in some pgs reporting
> spurious unfound objects.  Unsetting sortbitwise should cause the PGs
> to go back to normal.  Clusters created at jewel don't need to worry
> about this.

Now, I have an Infernalis cluster in production. It's an Infernalis cluster
installed from scratch (not from an upgrade). I intend to upgrade the
cluster to Jewel. Indeed, I have noticed that the flag "sortbitwise" was
set by default in my Infernalis cluster. By the way, I don't know exactly
the meaning of this flag but the cluster is HEALTH_OK with this flag set
by default so I have not changed it.

If I have well understood, to upgrade my Infernalis cluster, I have 2
options:

a) I unset the flag "sortbitwise" via "ceph osd unset sortbitwise", then
I upgrade the cluster to Jewel 10.2.1 and in the next release of Jewel
(I guess 10.2.2) I could set again the flag via "ceph osd set sortbitwise".

b) Or I just wait for the next release of Jewel (10.2.2) without worrying
about the flag "sortbitwise".

1. Is it correct?
2. Can we have data movement when we toggle the flag "sortbitwise"?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis => Jewel: ceph-fuse regression concerning the automatic mount at boot?

2016-06-03 Thread Francois Lafont
Hi,

On 02/06/2016 04:44, Francois Lafont wrote:

> ~# grep ceph /etc/fstab
> id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/ 
> /mnt/ fuse.ceph noatime,nonempty,defaults,_netdev 0 0

[...]

> And I have rebooted. After the reboot, big surprise with this:
> 
> ~# cat /tmp/mount.fuse.ceph.log 
> arguments are 
> id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint= 
> /mnt -o rw,_netdev,noatime,nonempty
> ceph-fuse --id=cephfs --keyring=/etc/ceph/ceph.client.cephfs.keyring 
> --client_mountpoint= /mnt -o rw,noatime,nonempty
> 
> Yes, this is not a misprint, there is no "/" after "client_mountpoint=".

[...]

> Now, my question is: which program gives the arguments to 
> /sbin/mount.fuse.ceph?
> Is it the init program (upstart in my case)? Or does it concern a Ceph 
> programs?

I have definitely found the culprit. In fact, this is not Upstart. It's 
"/sbin/mountall"
(from the "mountall" package) which is used by Upstart to mount filesystems in 
fstab.
In the source code "src/mountall.c", there is a line which removes wrongly the 
trailing
"/" in my valid fstab line:

id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/ 
/mnt/ fuse.ceph ...

I have made a bug report here where all is explained:
https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/1588594

It could be good to know it (I lost 1/2 day with this bug ;)).

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis => Jewel: ceph-fuse regression concerning the automatic mount at boot?

2016-06-01 Thread Francois Lafont
Now, I have a explanation and it's _very_ strange, absolutely not related
to a problem of Unix rights. For memory, my client node is an updated
Ubuntu Trusty and I use ceph-fuse. Here is my fstab line:

~# grep ceph /etc/fstab
id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/ 
/mnt/ fuse.ceph noatime,nonempty,defaults,_netdev 0 0

My VM is in the Infernalis state where cephfs is well automatically
mounted at boot. I have just modified the file /sbin/mount.fuse.ceph
(it's shell script) to add these 2 lines:


echo arguments are "$@" >/tmp/mount.fuse.ceph.log

[...]

# The command launched by /sbin/mount.fuse.ceph via an "exec".
echo ceph-fuse $cephargs $2 $3 $opts >>/tmp/mount.fuse.ceph.log


And I have rebooted. After the reboot, big surprise with this:

~# cat /tmp/mount.fuse.ceph.log 
arguments are 
id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint= /mnt 
-o rw,_netdev,noatime,nonempty
ceph-fuse --id=cephfs --keyring=/etc/ceph/ceph.client.cephfs.keyring 
--client_mountpoint= /mnt -o rw,noatime,nonempty

Yes, this is not a misprint, there is no "/" after "client_mountpoint=".
But, with Infernalis, it works even without the "/". 

~# ceph-fuse --id=cephfs --keyring=/etc/ceph/ceph.client.cephfs.keyring 
--client_mountpoint= /mnt -o rw,noatime,nonempty && echo OK
ceph-fuse[1380]: starting ceph client
2016-06-02 04:09:37.340050 7f69590e9780 -1 init, newargv = 0x7f695b7ae0b0 
newargc=13
ceph-fuse[1380]: starting fuse
OK

And with Jewel, it's simple, I have exactly the same thing, except
that, without the "/", ceph-fuse fails:

~# ceph -v
ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)

~# ceph-fuse --id=cephfs --keyring=/etc/ceph/ceph.client.cephfs.keyring 
--client_mountpoint= /mnt -o rw,noatime,nonempty && echo OK
ceph-fuse[1302]: starting ceph client
2016-06-02 04:30:25.840514 7f9ec24b2e80 -1 init, newargv = 0x7f9ecba9ffd0 
newargc=13
ceph-fuse[1302]: ceph mount failed with (1) Operation not permitted
ceph-fuse[1300]: mount failed: (1) Operation not permitted

By the way, it seems to me a sane behavior to have a fail with a not clean 
option.

So, this is not a Infernalis=>Jewel regression at all. The problem is:
the arguments which are given to /sbin/mount.fuse.ceph are bad.

A possible workaround is to just change the place of "client_mountpoint=/"
in the fstab line. For instance, no problem with:

id=cephfs,client_mountpoint=/,keyring=/etc/ceph/ceph.client.cephfs.keyring /mnt 
...
  ^^^

It's definitively curious that a manual mount works well and not at boot.
My conclusion is that the mechanism (ie the code) to pass arguments to ceph-fuse
from fstab are different in these 2 cases (manual mount vs mount at boot).

Now, my question is: which program gives the arguments to /sbin/mount.fuse.ceph?
Is it the init program (upstart in my case)? Or does it concern a Ceph programs?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis => Jewel: ceph-fuse regression concerning the automatic mount at boot?

2016-06-01 Thread Francois Lafont
Hi,

On 01/06/2016 23:16, Florent B wrote:

> Don't have this problem on Debian migration from Infernalis to Jewel,
> check all permissions...

Ok, it's probably the reason (I hope) but currently I don't find the good unix
rights. I have this (which doesn't work):

~# ll -d /etc/ceph
drwxr-xr-x 2 root root 4096 Jun  2 00:17 /etc/ceph/

~# tree -pug /etc/ceph
/etc/ceph
|-- [-rw-rw ceph ceph]  ceph.client.cephfs.keyring
|-- [-rw-rw ceph ceph]  ceph.client.cephfs.secret
`-- [-rw-r--r-- root root]  ceph.conf

Can you give me your unix rights to compare please?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Infernalis => Jewel: ceph-fuse regression concerning the automatic mount at boot?

2016-06-01 Thread Francois Lafont
Hi,

I have a Jewel Ceph cluster in OK state and I have a "ceph-fuse" Ubuntu
Trusty client with ceph Infernalis. The cephfs is mounted automatically
and perfectly during the boot via ceph-fuse and this line in /etc/fstab :

~# grep ceph /etc/fstab
id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/  
/mnt/  fuse.ceph  noatime,nonempty,defaults,_netdev  0  0

I change my sources.list to install the Jewel version. I install the Jewel
packages via a simple "apt-get update && apt-get upgrade". Now, the Jewel
version is installed.

So I reboot the machine. But now the automatic mount of cephfs at boot works
no longer. After the reboot, I have:

~# mountpoint /mnt/
/mnt/ is not a mountpoint

~# tail /var/log/upstart/mountall.log 
[...]
2016-06-01 19:00:55.234594 7f29301dbe80 -1 init, newargv = 0x7f29397b8fd0 
newargc=13
ceph-fuse[362]: starting ceph client
ceph-fuse[362]: ceph mount failed with (1) Operation not permitted
ceph-fuse[319]: mount failed: (1) Operation not permitted  
<== Here!
mountall: mount /mnt [306] terminated with status 255
mountall: Disconnected from Plymouth

The error is very curious because I have absolutely no problem to mount the
cephfs manually:

~# mount /mnt/
ceph-fuse[1279]: starting ceph client
2016-06-01 19:06:23.660419 7f174e336e80 -1 init, newargv = 0x7f1758c4afd0 
newargc=13
ceph-fuse[1279]: starting fuse

~# mountpoint /mnt/
/mnt/ is a mountpoint

~# df /mnt/
Filesystem 1K-blocks   Used Available Use% Mounted on
ceph-fuse   21983232 172032  21811200   1% /mnt

The machine is an updated and basic Ubuntu Trusty. I can reproduce the
problem *systematically*. Indeed, the machine is a VM with a snapshot in
Infernalis state where all is OK and after the upgrade the problem happens
systematically. I have tried several reboot and the cephfs is *never*
mounted automatically (but the manual mount is completely OK).

Is it a little "Infernalis => Jewel" regression concerning ceph-fuse or
have I forgotten a new mount option or something like that?

I can reproduce the problem and provide any log if needed.
Thanks in advance for your help.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Meaning of the "host" parameter in the section [client.radosgw.{instance-name}] in ceph.conf?

2016-05-28 Thread Francois Lafont
Hi,

On 26/05/2016 23:46, Francois Lafont wrote:

> a) My first question is perfectly summarized in the title. ;)
> Indeed, here is a typical section [client.radosgw.{instance-name}] in
> the ceph.conf of a radosgw server "rgw-01":
> 
> --
> # The instance-name is "gateway" here.
> [client.radosgw.gateway]
>   host   = rgw-01
>   keyring= /etc/ceph/ceph.client.radosgw.gateway.keyring
>   rgw socket path= ""
>   log file   = /var/log/radosgw/ceph.client.radosgw.gateway.log
>   rgw frontends  = civetweb port=8080
>   rgw print continue = false
>   rgw dns name   = rgw-01.domain.tld
> --
> 
> I have tried without the "host" parameter and it seems to work perfectly.
> So what is the meaning of this parameter and what's it for?
> 
> I have found no answer in the documentation but I may be wrong searched...

Can you confirm me these 2 points?

i) In fact, the "host" parameter is needed only if in ceph.conf there are
   multiple [client.radosgw.{instance-name}] sections for different radosgw
   servers. If there are only the sections [client.radosgw.{instance-name}]
   which concern the current radosgw server, the "host" parameter is useless.
   In fact, everything happens as if the default value of the "host" parameter
   in the [client.radosgw.{instance-name}] was $(hostname).

ii) The {instance-name} in [client.radosgw.{instance-name}] must be necessarily
unique _in_ the cluster, _not_ unique by radosgw server.

Is it correct?

> b) Is it a bad idea if I use the same keyring (and so the same ceph account)
> in the 2 radosgw servers "rgw-01" and "rgw-02"?

I'm still interested by this question.

I know it's possible to use the same keyring (ie the same ceph account) in
multiple radosgw servers but I don't know if it's recommended or not.

Thanks in advance.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Meaning of the "host" parameter in the section [client.radosgw.{instance-name}] in ceph.conf?

2016-05-26 Thread Francois Lafont
Hi,

a) My first question is perfectly summarized in the title. ;)
Indeed, here is a typical section [client.radosgw.{instance-name}] in
the ceph.conf of a radosgw server "rgw-01":

--
# The instance-name is "gateway" here.
[client.radosgw.gateway]
  host   = rgw-01
  keyring= /etc/ceph/ceph.client.radosgw.gateway.keyring
  rgw socket path= ""
  log file   = /var/log/radosgw/ceph.client.radosgw.gateway.log
  rgw frontends  = civetweb port=8080
  rgw print continue = false
  rgw dns name   = rgw-01.domain.tld
--

I have tried without the "host" parameter and it seems to work perfectly.
So what is the meaning of this parameter and what's it for?

I have found no answer in the documentation but I may be wrong searched...


2. Is it a bad idea if I use the same keyring (and so the same ceph account)
in the 2 radosgw servers "rgw-01" and "rgw-02"?

Thanks in advance.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-13 Thread Francois Lafont
Hello,

On 11/04/2016 23:39, Sage Weil wrote:

> [...] Is this reasonable?  [...]

Warning: I'm just a ceph user and definitively non-expert user.

1. Personally, if you see the documentation, read a little the maling list
and/or IRC, it seems to me _clear_ that ext4 is not recommended even if the
opposite if mentioned sometimes (personally I don't use ext4 in my ceph
cluster, I use xfs as the doc says).

2. I'm not a ceph expert but I can imagine the monstrous work that represents
the development of a software such as ceph and I think it can be reasonable
sometimes to limit the work when it's possible.

So make ext4 deprecated seems to me reasonable. I think the comfort of the
users is important but, in a _long_ term, it seems to me important that the
developers can concentrate their work to important things.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-20 Thread Francois Lafont
Hello,

On 20/03/2016 04:47, Christian Balzer wrote:

> That's not protection, that's an "uh-oh, something is wrong, you better
> check it out" notification, after which you get to spend a lot of time
> figuring out which is the good replica 

In fact, I have never been confronted to this case so far and I have a
couple of questions.

1. When it happens (ie a deep scrub fails), is it mentioned in the output
of the "ceph status" command and, in this case, can you confirm to me
that the health of the cluster in the output is different of "HEALTH_OK"?

2. For instance, if it happens with the PG id == 19.10 and if I have 3 OSDs
for this PG (because my pool has replica size == 3). I suppose that the
concerned OSDs are OSD id == 1, 6 and 12. Can you tell me if this "naive"
method is valid to solve the problem (and, if not, why)?

a) ssh in the node which hosts osd-1 and I launch this command:
~# id=1 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | 
sed "s|/ceph-$id/|/ceph-id/|" | sha1sum
055b0fd18cee4b158a8d336979de74d25fadc1a3  -

b) ssh in the node which hosts osd-6 and I launch this command:
~# id=6 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | 
sed "s|/ceph-$id/|/ceph-id/|" | sha1sum
055b0fd18cee4b158a8d336979de74d25fadc1a3 -

c) ssh in the node which hosts osd-12 and I launch this command:
~# id=12 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | 
sed "s|/ceph-$id/|/ceph-id/|" | sha1sum
3f786850e387550fdab836ed7e6dc881de23001b -

I notice that the result is different for osd-12 so it's the "bad" osd.
So, in the node which hosts osd-12, I launch this command:

id=12 && rm /var/lib/ceph/osd/ceph-$id/current/19.10_head/*

And now I can launch safely this command:

ceph pg repair 19.10

Is there a problem with this "naive" method?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Change Unix rights of /var/lib/ceph/{osd, mon}/$cluster-$id/ directories on Infernalis?

2016-03-14 Thread Francois Lafont
Hi David,

On 14/03/2016 18:33, David Casier wrote:

> "usermod -aG ceph snmp" is better ;)

After thinking, the solution to add "snmp" in the "ceph" group seems to me
better too... _if_ the "ceph" group has never the "w" right in /var/lib/ceph/
(which seems to be the case). So thanks to comfort me in my choice.

PS: by the way, the "usermod" command always seems to me complicated to add
a user in a group. I prefer the more readable below. ;)

gpasswd --add snmp ceph

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Change Unix rights of /var/lib/ceph/{osd, mon}/$cluster-$id/ directories on Infernalis?

2016-03-10 Thread Francois Lafont
Hi,

I have a ceph cluster on Infernalis and I'm using a snmp agent to retrieve
data and generate generic graphs concerning each cluster node. Currently, I
can see in the syslog of each node this kind of lines (every 5 minutes):

Mar 11 03:15:26 ceph01 snmpd[16824]: Cannot statfs 
/var/lib/ceph/mon/ceph-ceph01#012: Permission denied
Mar 11 03:15:26 ceph01 snmpd[16824]: Cannot statfs 
/var/lib/ceph/osd/ceph-16#012: Permission denied

Of course, it's a basic problem of Unix rights. The snmp agent uses the
account "snmp" and the Unix rights of the ceph home directory are:

~# ll -d /var/lib/ceph
drwxr-x--- 9 ceph ceph 4096 Nov  4 06:34 /var/lib/ceph/

So, of course, currently the snmp account can't access to
/var/lib/ceph/{osd,mon}/$cluster-$id/.

1. Is there a problem (an eventually side effect) if I just do that?

chmod o+rx /var/lib/ceph/

Can I have security problem with that?


2. Or do you think it's a better idea to just add "snmp" in the Unix group
"ceph"? Maybe better than 1. because I don't change the permissions of the
directory _and_ it seems to me that a member of the "ceph" group has never
the "w" right in /var/lib/ceph/.

Thanks in advance for your help.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier operation clarifications

2016-03-04 Thread Francois Lafont
Hello,

On 04/03/2016 09:17, Christian Balzer wrote:

> Unlike the subject may suggest, I'm mostly going to try and explain how
> things work with cache tiers, as far as I understand them.
> Something of a reference to point to. [...]

I'm currently unqualified concerning cache tiering but I'm pretty
sure that your post is very relevant and I think you should make
a pull-request on the Ceph documentation where you could bring all
these lights. Here, your explanations will be lost in the depths
of the mailing list. ;)

Regards.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot mount cephfs after some disaster recovery

2016-03-01 Thread Francois Lafont
On 01/03/2016 18:14, John Spray wrote:

>> And what is the meaning of the first and the second number below?
>>
>> mdsmap e21038: 1/1/0 up {0=HK-IDC1-10-1-72-160=up:active}
>>^ ^
> 
> Your whitespace got lost here I think, but I guess you're talking
> about the 1/1 part.

Yes indeed.

> The shorthand MDS status is up/in/max_mds
> (https://github.com/ceph/ceph/blob/master/src/mds/MDSMap.cc#L248)
> 
> up: how many daemons are up and holding a rank (they may be active or
> replaying, etc)
> in: how many ranks exist in the MDS cluster
> max_mds: if there are this many MDSs already, new daemons will be made
> standbys instead of having ranks created for them.
> 
> On single-active-daemon systems, this is really just going to be 1/1/1
> or 0/1/1 for whether you have an up MDS or not.

Ok thx John for the explanations.


-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade to INFERNALIS

2016-03-01 Thread Francois Lafont
Hi,

On 02/03/2016 00:12, Garg, Pankaj wrote:

> I have upgraded my cluster from 0.94.4 as recommended to the just released 
> Infernalis (9.2.1) Update directly (skipped 9.2.0).
> I installed the packaged on each system, manually (.deb files that I built).
> 
> After that I followed the steps :
> 
> Stop ceph-all
> chown -R  ceph:ceph /var/lib/ceph
> start ceph-all

Ok, and the journals?

> I am still getting errors on starting OSDs.
> 
> 2016-03-01 22:44:45.991043 7fa185f000 -1 filestore(/var/lib/ceph/osd/ceph-69) 
> mount failed to open journal /var/lib/ceph/osd/ceph-69/journal: (13) 
> Permission denied

I suppose your journal is a symlink which targets to a raw partition, correct? 
In this case, the ceph Unix account seems currently to be unable to read and 
write in this partition. If this partition is /dev/sdb2 (for instance), you 
have to set the Unix rights this "file" /dev/sdb2 (manually or via a udev rule).

> 2016-03-01 22:44:46.001112 7fa185f000 -1 osd.69 0 OSD:init: unable to mount 
> object store
> 2016-03-01 22:44:46.001128 7fa185f000 -1  ** ERROR: osd init failed: (13) 
> Permission denied
> 
> 
> What am I missing?

I think you missed to set the Unix rights of the journal partitions. The ceph 
account must be able to read/write in /var/lib/ceph/osd/$cluster-$id/ _and_ in 
the journal partitions too.

Regards.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot mount cephfs after some disaster recovery

2016-03-01 Thread Francois Lafont
Hi,

On 01/03/2016 10:32, John Spray wrote:

> As Zheng has said, that last number is the "max_mds" setting.

And what is the meaning of the first and the second number below?

mdsmap e21038: 1/1/0 up {0=HK-IDC1-10-1-72-160=up:active}
   ^ ^

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-20 Thread Francois Lafont
On 21/01/2016 03:40, Francois Lafont wrote:

> Ah ok, interesting. I have tested and I have noticed however that size
> of a directory is not updated immediately. For instance, if I change
> the size of the regular file in a directory (of cephfs) the size of the
> size doesn't change immediately after.
  

Misprint. The "size of the directory" of course.
   ^


-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-20 Thread Francois Lafont
Hi,

On 19/01/2016 07:24, Adam Tygart wrote:
> It appears that with --apparent-size, du adds the "size" of the
> directories to the total as well. On most filesystems this is the
> block size, or the amount of metadata space the directory is using. On
> CephFS, this size is fabricated to be the size sum of all sub-files.
> i.e. a cheap/free 'du -sh $folder'

Ah ok, interesting. I have tested and I have noticed however that size
of a directory is not updated immediately. For instance, if I change
the size of the regular file in a directory (of cephfs) the size of the
size doesn't change immediately after.

> $ stat /homes/mozes/tmp/sbatten
>   File: '/homes/mozes/tmp/sbatten'
>   Size: 138286  Blocks: 0  IO Block: 65536  directory
> Device: 0h/0d   Inode: 1099523094368  Links: 1
> Access: (0755/drwxr-xr-x)  Uid: (163587/   mozes)   Gid: (163587/mozes_users)
> Access: 2016-01-19 00:12:23.331201000 -0600
> Modify: 2015-10-14 13:38:01.098843320 -0500
> Change: 2015-10-14 13:38:01.098843320 -0500
>  Birth: -
> $ stat /tmp/sbatten/
>   File: '/tmp/sbatten/'
>   Size: 4096Blocks: 8  IO Block: 4096   directory
> Device: 803h/2051d  Inode: 9568257 Links: 2
> Access: (0755/drwxr-xr-x)  Uid: (163587/   mozes)   Gid: (163587/mozes_users)
> Access: 2016-01-19 00:12:23.331201000 -0600
> Modify: 2015-10-14 13:38:01.098843320 -0500
> Change: 2016-01-19 00:17:29.658902081 -0600
>  Birth: -
> 
> $ du -s --apparent-size -B1 /homes/mozes/tmp/sbatten
> 276572  /homes/mozes/tmp/sbatten
> $ du -s -B1 /homes/mozes/tmp/sbatten
> 147456  /homes/mozes/tmp/sbatten
> 
> $ du -s -B1 /tmp/sbatten
> 225280  /tmp/sbatten
> $ du -s --apparent-size -B1 /tmp/sbatten
> 142382  /tmp/sbatten
> 
> Notice how the apparent-size version is *exactly* the Size from the
> stat + the size from the "proper" du?

Err... exactly? Are you sure?

138286 + 147456 = 285742 which is != 276572, no?
Anyway thx for your help Adam.


-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-18 Thread Francois Lafont
On 19/01/2016 05:19, Francois Lafont wrote:

> However, I still have a question. Since my previous message, supplementary
> data have been put in the cephfs and the values have changes as you can see:
> 
> ~# du -sh /mnt/cephfs/
> 1.2G  /mnt/cephfs/
> 
> ~# du --apparent-size -sh /mnt/cephfs/
> 6.4G  /mnt/cephfs/
> 
> You can see that the difference between "disk usage" and "apparent size"
> has really increased and it seems to me curious that only sparse files can
> explain this difference (in my mind, sparse files are very specific files
> and here the files are essentially images which doesn't seem to me potential
> sparse files). I'm not completely sure but I think that same files are put in
> the cephfs directory.
> 
> Do you think it's possible that the sames file present in different 
> directories
> of the cephfs are stored in only one object in the cephfs pool?
> 
> This is my feeling when I see the difference between "apparent size" and
> "disk usage" which has increased. Am I wrong?

In fact, I'm not so sure. Here another information, where /backups is a XFS 
partition:

~# du --apparent-size -sh 
/mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/
2.8G/mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/

~# du -sh /mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/
701M/mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/

~# cp -r /mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/ 
/backups/test

~# du -sh /backups/test
701M/backups/test

~# du --apparent-size -sh /backups/test
701M/backups/test

So I definitively don't understand of du --apparent-size -sh...


-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-18 Thread Francois Lafont
Hi,

On 18/01/2016 05:00, Adam Tygart wrote:

> As I understand it:

I think you understand well. ;)

> 4.2G is used by ceph (all replication, metadata, et al) it is a sum of
> all the space "used" on the osds.

I confirm that.

> 958M is the actual space the data in cephfs is using (without replication).
> 3.8G means you have some sparse files in cephfs.
> 
> 'ceph df detail' should return something close to 958MB used for your
> cephfs "data" pool. "RAW USED" should be close to 4.2GB

Yes, your predictions are correct. ;)

However, I still have a question. Since my previous message, supplementary
data have been put in the cephfs and the values have changes as you can see:

~# du -sh /mnt/cephfs/
1.2G/mnt/cephfs/

~# du --apparent-size -sh /mnt/cephfs/
6.4G/mnt/cephfs/

You can see that the difference between "disk usage" and "apparent size"
has really increased and it seems to me curious that only sparse files can
explain this difference (in my mind, sparse files are very specific files
and here the files are essentially images which doesn't seem to me potential
sparse files). I'm not completely sure but I think that same files are put in
the cephfs directory.

Do you think it's possible that the sames file present in different directories
of the cephfs are stored in only one object in the cephfs pool?

This is my feeling when I see the difference between "apparent size" and
"disk usage" which has increased. Am I wrong?

Anyway, thanks a lot for the explanations Adam.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis upgrade breaks when journal on separate partition

2016-01-18 Thread Francois Lafont
Hi,

I have not well followed this thread, so sorry in advance if I'm a little out
of topic. Personally I'm using this udev rule and it works well (servers are
Ubuntu Trusty):

~# cat /etc/udev/rules.d/90-ceph.rules
ENV{ID_PART_ENTRY_SCHEME}=="gpt", 
ENV{ID_PART_ENTRY_NAME}=="osd-?*-journal", OWNER="ceph"

Indeed, I'm using GPT and all my journal partitions have this partname pattern:

/osd-[0-9]+-journal/

If you currently don't use GTP (but msdos partitions), I think you can do the 
same
thing by using _explicit_ "by-id". For instance something like that (not 
tested!):

ENV{DEVTYPE}=="partition", ENV{ID_WWN_WITH_EXTENSION}=="xxx", 
OWNER="ceph"
ENV{DEVTYPE}=="partition", ENV{ID_WWN_WITH_EXTENSION}=="yyy", 
OWNER="ceph"
# etc.

where xxx, yyy, etc. the name of your journal partitions in 
/dev/disk/by-id/.

HTH. ;)

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-17 Thread Francois Lafont
On 18/01/2016 04:19, Francois Lafont wrote:

> ~# du -sh /mnt/cephfs
> 958M  /mnt/cephfs
> 
> ~# df -h /mnt/cephfs/
> Filesystem  Size  Used Avail Use% Mounted on
> ceph-fuse55T  4.2G   55T   1% /mnt/cephfs

Even with the option --apparent-size, the size are different (but closer 
indeed):

~# du -sh --apparent-size /mnt/cephfs
3.8G/mnt/cephfs


-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Infernalis, cephfs: difference between df and du

2016-01-17 Thread Francois Lafont
Hello,

Can someone explain me the difference between df and du commands
concerning the data used in my cephfs? And which is the correct value,
958M or 4.2G?

~# du -sh /mnt/cephfs
958M/mnt/cephfs

~# df -h /mnt/cephfs/
Filesystem  Size  Used Avail Use% Mounted on
ceph-fuse55T  4.2G   55T   1% /mnt/cephfs

My client node is a "classical" Ubuntu Trusty, kernel 3.13 but as you
can see I'm using ceph-fuse. The cluster nodes are "classical" Ubuntu
Trusty nodes too.

Regards.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs (ceph-fuse) and file-layout: "operation not supported" in a client Ubuntu Trusty

2016-01-08 Thread Francois Lafont
Hi,

Some news...

On 08/01/2016 12:42, Francois Lafont wrote:

> ~# mkdir /mnt/cephfs/ssd
> 
> ~# setfattr -n ceph.dir.layout.pool -v poolssd /mnt/cephfs/ssd/
> setfattr: /mnt/cephfs/ssd/: Operation not supported
> 
> ~# getfattr -n ceph.dir.layout /mnt/cephfs/
> /mnt/cephfs/: ceph.dir.layout: Operation not supported
> 
> Here is my fstab line which mount the cephfs:
> 
> id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/data1
>  /mnt/cephfs fuse.ceph noatime,defaults,_netdev 0 0

In fact, I have retried the same thing without the "noatime" mount
option and after that it worked. Then I have retried _with_ the "noatime"
to be sure and... it worked too. Now, it just works with or witout the
option.

So I have 2 possible explanations:

1. The fact to remove noatime and mount just once has unblocked
something...

2. or I have another explanation terrible for me. Maybe during
my first attempt, the cephfs was just not mounted in fact. Indeed,
now I have a doubt on this point because few minutes after the attempt
I have seen that the cephfs was not mounted (and I don't know why).

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs (ceph-fuse) and file-layout: "operation not supported" in a client Ubuntu Trusty

2016-01-08 Thread Francois Lafont
Hi @all,

I'm using ceph Infernalis (9.2.0) in the client and cluster side.
I have a Ubuntu Trusty client where cephfs is mounted via ceph-fuse
and I would like to put a sub-directory of cephfs in a specific pool
(a ssd pool).

In the cluster, I have:

~# ceph auth get client.cephfs
exported keyring for client.cephfs
[client.cephfs]
key = XX==
caps mds = "allow"
caps mon = "allow r"
caps osd = "allow class-read object_prefix rbd_children, allow rwx 
pool=cephfsdata, allow rwx pool=poolssd"

~# ceph fs ls
name: cephfs, metadata pool: cephfsmetadata, data pools: [cephfsdata poolssd ]

Now, in the Ubuntu Trusty client, I have installed the "attr" package
and I try this:

~# mkdir /mnt/cephfs/ssd

~# setfattr -n ceph.dir.layout.pool -v poolssd /mnt/cephfs/ssd/
setfattr: /mnt/cephfs/ssd/: Operation not supported

~# getfattr -n ceph.dir.layout /mnt/cephfs/
/mnt/cephfs/: ceph.dir.layout: Operation not supported

Here is my fstab line which mount the cephfs:

id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/data1 
/mnt/cephfs fuse.ceph noatime,defaults,_netdev 0 0

Where is my problem?
Thanks in advance for your help. ;)

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2016-01-02 Thread Francois Lafont
olc, I think you haven't posted in the ceph-users list.

On 31/12/2015 15:39, olc wrote:

> Same model _and_ same firmware (`smartctl -i /dev/sdX | grep Firmware`)? As 
> far as I've been told, this can make huge differences.

Good idea indeed. I have checked, the versions are the same. Finally, after
some tests, I think I have probably made a mistake because now I have identical
performances on the disks (~192 iops SYNC IO O_DIRECT).

> Don't know how important it and if it is relevant in your case is but 
> transfer rate is supposed better when data are located at the periphery of 
> the platters than when they are located at the core of the platters.

Yes indeed, but in my case the spinning hard drives have only one
partition. Only the SSD have several partitions.

Regard.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2016-01-02 Thread Francois Lafont
Hi,

On 31/12/2015 15:30, Robert LeBlanc wrote:

> Because Ceph is not perfectly distributed there will be more PGs/objects in
> one drive than others. That drive will become a bottleneck for the entire
> cluster. The current IO scheduler poses some challenges in this regard.
> I've implemented a new scheduler which I've seen much better drive
> utilization across the cluster as well as 3-17% performance increase and a
> substantial reduction in client performance deviation (all clients are
> getting the same amount of performance). Hopefully we will be able to get
> that into Jewel.

Ok, thx for the information. So I hope too it will be ready for Jewel.
If I have well understood, Jewel will involve many improvements. I follow
that with attention... ;)

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] In production - Change osd config

2016-01-02 Thread Francois Lafont
Hi,

On 03/01/2016 02:16, Sam Huracan wrote:

> I try restart all osd but not efficient.
> Is there anyway to apply this change transparently to client?

You can use this command (it's an example):

# In a cluster node where the admin account is available.
ceph tell 'osd.*' injectargs '--osd_disk_threads 2'

After, you can check the config in a specific osd. For instance:

ceph daemon osd.5 config show | grep 'osd_disk_threads'

But you must launch this command in the node which hosts the osd.5
daemon.

Furthermore, with "ceph tell osd.\* injectargs ..." it's possible
to set a parameter for all osds from a simple cluster node with just
one command, but I don't know if it's possible to just _get_ (not set)
the value of a parameter of all osds with just one command.
Does a such command exist?

Personally, I don't know a such command and currently, I have to
launch "ceph daemon osd.$id config show" for each osd which is
hosted by the current server where I'm connected and I have to
repeat the commands in the other cluster nodes.

Regards.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-30 Thread Francois Lafont
Hi,

On 30/12/2015 10:23, Yan, Zheng wrote:

>> And it seems to me that I can see the bottleneck of my little cluster (only
>> 5 OSD servers with each 4 osds daemons). According to the "atop" command, I
>> can see that some disks (4TB SATA 7200rpm Western digital WD4000FYYZ) are
>> very busy. It's curious because during the bench I have some disks very busy
>> and some other disks not so busy. But I think the reason is that is a little
>> cluster and with just 15 osds (the 5 other osds are full SSD osds 
>> cephfsmetadata
>> dedicated), I can have a perfect repartition of data, especially when the
>> bench concern just a specific file of few hundred MB.
> 
> do these disks have same size and performance? large disks (with
> higher wights) or slow disks are likely busy.

The disks are exactly the same model with the same size (4TB SATA 7200rpm
Western digital WD4000FYYZ). I'm not completely sure but it seems to me
that in a specific node I have a disk which is a little slower than the
others (maybe minus ~50-75 iops) and it seems to me that it's the busiest
disk during a bench.

Is it possible (or frequent) to have difference of perfs between exactly
same model of disks?

>> That being said, when you talk about "using buffered IO" I'm not sure to
>> understand the option of fio which is concerns by that. Is it the --buffered
>> option ? Because with this option I have noticed no change concerning iops.
>> Personally, I was able to increase global iops only with the --numjobs 
>> option.
>>
> 
> I didn't make it clear. I actually meant buffered write (add
> --rwmixread=0 option to fio) .

But with fio if I set "--readwrite=randrw --rwmixread=0", it's completely
equivalent to just set "--readwrite=randwrite", no?

> In your test case, writes mix with reads. 

Yes indeed.

> read is synchronous when cache miss.

You mean that I have SYNC IO for reading if I set --direct=0, is it correct?
Is it valid for any file system or just for cephfs?

Regards.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-29 Thread Francois Lafont
Hi,

On 28/12/2015 09:04, Yan, Zheng wrote:

>> Ok, so in a client node, I have mounted cephfs (via ceph-fuse) and a rados
>> block device formatted in XFS. If I have well understood, cephfs uses sync
>> IO (not async IO) and, with ceph-fuse, cephfs can't make O_DIRECT IO. So, I
>> have tested this fio command in cephfs _and_ in rbd:
>>
>> fio --randrepeat=1 --ioengine=sync --direct=0 --gtod_reduce=1 
>> --name=readwrite \
>> --filename=rw.data --bs=4k --iodepth=1 --size=300MB 
>> --readwrite=randrw \
>>
>>
>> and indeed with cephfs _and_ rbd, I have approximatively the same result:
>> - cephfs => ~516 iops
>> - rbd=> ~587 iops
>>
>> Is it consistent?
>>
> yes

Ok, cool. ;)

>> That being said, I'm unable to know if it's good performance as regard my 
>> hardware
>> configuration. I'm curious to know the result in other clusters with the 
>> same fio
>> command.
> 
> This fio command is check performance of single thread SYNC IO. If you
> want to check overall throughput, you can try using buffered IO or
> increasing thread number.

Ok, I have increased the thread number via the --numjobs option of fio
and indeed, if I add all the iops of each job, it seems to me that I can
reach something like ~1000 iops with ~5 jobs. This result seems to me
further in relation with my hardware configuration, isn't it?

And it seems to me that I can see the bottleneck of my little cluster (only
5 OSD servers with each 4 osds daemons). According to the "atop" command, I
can see that some disks (4TB SATA 7200rpm Western digital WD4000FYYZ) are
very busy. It's curious because during the bench I have some disks very busy
and some other disks not so busy. But I think the reason is that is a little
cluster and with just 15 osds (the 5 other osds are full SSD osds cephfsmetadata
dedicated), I can have a perfect repartition of data, especially when the
bench concern just a specific file of few hundred MB. 

That being said, when you talk about "using buffered IO" I'm not sure to
understand the option of fio which is concerns by that. Is it the --buffered
option ? Because with this option I have noticed no change concerning iops.
Personally, I was able to increase global iops only with the --numjobs option.

> FYI, I have written a patch to add AIO support to cephfs kernel client:
> https://github.com/ceph/ceph-client/commits/testing

Ok thanks for the information but I'm afraid to be unable to test it 
immediately.

>> * --direct=1 => ~1400 iops
>> * --direct=0 => ~570 iops
>>
>> Why I have this behavior? I thought it will be the opposite (better perfs 
>> with
>> --direct=0). Is it normal?
>>
> linux kernel only supports AIO for fd opened in O_DIRECT mode, when
> file is not opened in O_DIRECT mode, AIO is actually SYNC IO.

Ok, so this is not ceph specific, this is a behavior of the Linux kernel.
A good info to know again.

Anyway, thanks _a_ _lot_ Yan for your help very efficient. I have learned
lot of very interesting things 

Regards.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-27 Thread Francois Lafont
Hi,

Sorry for my late answer.

On 23/12/2015 03:49, Yan, Zheng wrote:

>>> fio tests AIO performance in this case. cephfs does not handle AIO
>>> properly, AIO is actually SYNC IO. that's why cephfs is so slow in
>>> this case.
>>
>> Ah ok, thanks for this very interesting information.
>>
>> So, in fact, the question I ask myself is: how to test my cephfs
>> to know if I have correct (or not) perfs as regard my hardware
>> configuration?
>>
>> Because currently, in fact, I'm unable to say if I have correct perf
>> (not incredible but in line with my hardware configuration) or if I
>> have a problem. ;)
>>
> 
> It's hard to tell. basically data IO performance on cephfs should be
> similar to data IO performance on rbd.

Ok, so in a client node, I have mounted cephfs (via ceph-fuse) and a rados
block device formatted in XFS. If I have well understood, cephfs uses sync
IO (not async IO) and, with ceph-fuse, cephfs can't make O_DIRECT IO. So, I
have tested this fio command in cephfs _and_ in rbd:

fio --randrepeat=1 --ioengine=sync --direct=0 --gtod_reduce=1 
--name=readwrite \
--filename=rw.data --bs=4k --iodepth=1 --size=300MB --readwrite=randrw  
   \
--rwmixread=50

and indeed with cephfs _and_ rbd, I have approximatively the same result:
- cephfs => ~516 iops
- rbd=> ~587 iops

Is it consistent?

That being said, I'm unable to know if it's good performance as regard my 
hardware
configuration. I'm curious to know the result in other clusters with the same 
fio
command.

Another point: I have noticed something which is very strange for me. It's about
the rados block device and this fio command:

# In this case, I use libaio and (direct == 0)
fio --randrepeat=1 --ioengine=libaio --direct=0 --gtod_reduce=1 
--name=readwrite \
--filename=rw.data --bs=4k --iodepth=16 --size=300MB --readwrite=randrw 
 \
--rwmixread=50

This command in the rados block device gives me ~570 iops. But the curious thing
is that I have better iops if I just change "--direct=0" to "--direct=1" in the
command above. In this case, I have ~1400 iops. I don't understand this 
difference.
So, I have better perfs with "--direct=1":

* --direct=1 => ~1400 iops
* --direct=0 => ~570 iops

Why I have this behavior? I thought it will be the opposite (better perfs with
--direct=0). Is it normal?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-22 Thread Francois Lafont
Hello,

On 21/12/2015 04:47, Yan, Zheng wrote:

> fio tests AIO performance in this case. cephfs does not handle AIO
> properly, AIO is actually SYNC IO. that's why cephfs is so slow in
> this case.

Ah ok, thanks for this very interesting information.

So, in fact, the question I ask myself is: how to test my cephfs
to know if I have correct (or not) perfs as regard my hardware
configuration?

Because currently, in fact, I'm unable to say if I have correct perf
(not incredible but in line with my hardware configuration) or if I
have a problem. ;)

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-20 Thread Francois Lafont
On 20/12/2015 22:51, Don Waterloo wrote:
 
> All nodes have 10Gbps to each other

Even the link client node <---> cluster nodes?

> OSD:
> $ ceph osd tree
> ID WEIGHT  TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 5.48996 root default
> -2 0.8 host nubo-1
>  0 0.8 osd.0 up  1.0  1.0
> -3 0.8 host nubo-2
>  1 0.8 osd.1 up  1.0  1.0
> -4 0.8 host nubo-3
>  2 0.8 osd.2 up  1.0  1.0
> -5 0.92999 host nubo-19
>  3 0.92999 osd.3 up  1.0  1.0
> -6 0.92999 host nubo-20
>  4 0.92999 osd.4 up  1.0  1.0
> -7 0.92999 host nubo-21
>  5 0.92999 osd.5 up  1.0  1.0
> 
> Each contains 1 x Samsung 850 Pro 1TB SSD (on sata)
> 
> Each are Ubuntu 15.10 running 4.3.0-040300-generic kernel.
> Each are running ceph 0.94.5-0ubuntu0.15.10.1
> 
> nubo-1/nubo-2/nubo-3 are 2x X5650 @ 2.67GHz w/ 96GB ram.
> nubo-19/nubo-20/nubo-21 are 2x E5-2699 v3 @ 2.30GHz, w/ 576GB ram.
> 
> the connections are to the chipset sata in each case.
> The fio test to the underlying xfs disk
> (e.g. cd /var/lib/ceph/osd/ceph-1; fio --randrepeat=1 --ioengine=libaio
> --direct=1 --gtod_reduce=1 --name=readwrite --filename=rw.data --bs=4k
> --iodepth=64 --size=5000MB --readwrite=randrw --rwmixread=50)
> shows ~22K IOPS on each disk.
> 
> nubo-1/2/3 are also the mon and the mds:
> $ ceph status
> cluster b23abffc-71c4-4464-9449-3f2c9fbe1ded
>  health HEALTH_OK
>  monmap e1: 3 mons at {nubo-1=
> 10.100.10.60:6789/0,nubo-2=10.100.10.61:6789/0,nubo-3=10.100.10.62:6789/0}
> election epoch 1104, quorum 0,1,2 nubo-1,nubo-2,nubo-3
>  mdsmap e621: 1/1/1 up {0=nubo-3=up:active}, 2 up:standby
>  osdmap e2459: 6 osds: 6 up, 6 in
>   pgmap v127331: 840 pgs, 6 pools, 144 GB data, 107 kobjects
> 289 GB used, 5332 GB / 5622 GB avail
>  840 active+clean
>   client io 0 B/s rd, 183 kB/s wr, 54 op/s

And you have "replica size == 3" in your cluster, correct?
Do you have specific mount options or specific options in ceph.conf concerning 
ceph-fuse?

So the hardware configuration of your cluster seems to me globally highly
better than my cluster (config given in my first message) because you have
10Gb links (between the client and the cluster I have just 1Gb) and you
have full SSD OSDs.

I have tried to put _all_ cephfs in my SSD: ie the pools "cephfsdata" _and_ 
"cephfsmetadata" are in the SSD. The performances are slightly improved because
I have ~670 iops now (with the fio command of my first message again) but it
still seems to me bad.

In fact, I'm curious to have the opinion of "cephfs" experts to know what
iops we can expect. If anaything, ~700 iops is a correct iops for our hardware
configuration and maybe we are searching a problem which doesn't exist...

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-20 Thread Francois Lafont
On 20/12/2015 21:06, Francois Lafont wrote:

> Ok. Please, can you give us your configuration?
> How many nodes, osds, ceph version, disks (SSD or not, HBA/controller), RAM, 
> CPU, network (1Gb/10Gb) etc.?

And I add this: with cephfs-fuse, did you have some specific conf in the client 
side?
Specific mount options? Specific parameters in ceph.conf?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-20 Thread Francois Lafont
Hi,

On 20/12/2015 19:47, Don Waterloo wrote:

> I did a bit more work on this.
> 
> On cephfs-fuse, I get ~700 iops.
> On cephfs kernel, I get ~120 iops.
> These were both on 4.3 kernel
> 
> So i backed up to 3.16 kernel on the client. And observed the same results.
> 
> So ~20K iops w/ rbd, ~120iops w/ cephfs.

Ok. Please, can you give us your configuration?
How many nodes, osds, ceph version, disks (SSD or not, HBA/controller), RAM, 
CPU, network (1Gb/10Gb) etc.?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-20 Thread Francois Lafont
Hello,

On 18/12/2015 23:26, Don Waterloo wrote:

> rbd -p mypool create speed-test-image --size 1000
> rbd -p mypool bench-write speed-test-image
> 
> I get
> 
> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
>   SEC   OPS   OPS/SEC   BYTES/SEC
> 1 79053  79070.82  323874082.50
> 2144340  72178.81  295644410.60
> 3221975  73997.57  303094057.34
> elapsed:10  ops:   262144  ops/sec: 26129.32  bytes/sec: 107025708.32
> 
> which is *much* faster than the cephfs.

Me too, I have better performance with rbd (~1400 iops with the fio command
in my first message instead of ~575 iops with the same fio command and cephfs).

The question is: is it normal if I have ~575 iops with cephfs and my config?
Indeed, I imagine that rbd has better performance than cephfs and, after all
maybe my value of iops is normal. I don't know...

I have tried to edit the crushmap to put the cephfsmetadata pool only in the
5 SSD. It seems to improve slightly the performance and, with the fio command
of my first message, I have ~650 iops now but it still seems to me bad, no?

Currently I'm searching any option in ceph.conf or any mount option to improve
performance with cephfs via ceph-fuse. In the archive of "ceph-users", I have
seen the options "client cache size" and "client oc size" which would be used
by ceph-fuse.

Is it correct?

I don't see anything in the documentation. Where should I put these parameters?
In the ceph.conf of the client which mounts the cephfs via fuse? In the [global]
section? I have tried that but it seems to be not ignored. Indeed I have tried
to put these parameters in the [global] section of ceph.conf (in the client 
node)
and I have set very very small value like this:

[global]
  client cache size = 1024
  client oc size= 1024

and I thought it highly decreases the performance but there is absolutely no
effect and I have the same result (ie ~650 iops) so I think the parameters are
just ignored. Is it the right place to put these parameters?

Furthermore, do you know mount options which can improve perf (for cephfs mount
via ceph-fuse)?

It seems to me that the mount option noacl existed but ceph-fuse doesn't know 
this
mount option (I have no need to acl). I haven't found the list of mount options
in the web. I just can display a short list with the command "ceph-fuse -h". I
have tried to change the max_* options but without effect.

Thanks in advance for your help.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs, low performances

2015-12-18 Thread Francois Lafont
Hi Christian,

On 18/12/2015 04:16, Christian Balzer wrote:

>> It seems to me very bad. 
> Indeed. 
> Firstly let me state that I don't use CephFS and have no clues how this
> influences things and can/should be tuned.

Ok, no problem. Anyway, thanks for your answer. ;)

> That being said, the fio above running in VM (RBD) gives me 440 IOPS
> against a single OSD storage server (replica 1) with 4 crappy HDDs and
> on-disk journals on my test cluster (1Gb/s links). 
> So yeah, given your configuration that's bad.

I have tried a quick test with a rados block device (size = 4GB with
filesystem EXT4) mounted on the same client node (the client node where
I'm testing cephfs) and the same "fio" command give me iops read/write
equal to ~1400.

So my problem could be "cephfs" specific, no?

That being said, I don't know if it's can be a symptom but during the bench
the iops are real-time displayed and the value seems to me no very constant.
I can see sometimes peacks at 1800 iops and suddenly the value is 800 iops
and re-turns up at ~1400 etc.

> In comparison I get 3000 IOPS against a production cluster (so not idle)
> with 4 storage nodes. Each with 4 100GB DC S3700 for journals and OS and 8
> SATA HDDs, Infiniband (IPoIB) connectivity for everything.
> 
> All of this is with .80.x (Firefly) on Debian Jessie.

Ok, interesting. My cluster is idle and but I have approximatively twice as
less disks than your cluster and my SATA disk are directly connected on the
motherboard. So, it seems to me logical that I have ~1400 and you ~3000, no? 

> You want to use atop on all your nodes and look for everything from disks
> to network utilization.
> There might be nothing obvious going on, but it needs to be ruled out.

It's a detail but I have noticed that atop (on Ubuntu Trusty) don't display
the % of bandwidth of my 10GbE interface.

Anyway, I have tried to inspect the node cluster during the cephfs bench,
but I have seen no bottleneck concerning CPU, network and disks. 

>> I use Ubuntu 14.04 on each server with the 3.13 kernel (it's the same
>> for the client ceph where I run my bench) and I use Ceph 9.2.0
>> (Infernalis). 
> 
> I seem to recall that this particular kernel has issues, you might want to
> scour the archives here.

But, in my case, I use cephfs-fuse in the client node so the kernel version
is not relevant I think. And I thought that the kernel version was not very
important in the cluster nodes side. Am I wrong?

>> On the client, cephfs is mounted via cephfs-fuse with this
>> in /etc/fstab:
>>
>> id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/   
>> /mnt/cephfs
>> fuse.cephnoatime,defaults,_netdev0   0
>>
>> I have 5 cluster node servers "Supermicro Motherboard X10SLM+-LN4 S1150"
>> with one 1GbE port for the ceph public network and one 10GbE port for
>> the ceph private network:
>>
> For the sake of latency (which becomes the biggest issues when you're not
> exhausting CPU/DISK), you'd be better off with everything on 10GbE, unless
> you need the 1GbE to connect to clients that have no 10Gb/s ports.

Yes, exactly. My client is 1Gb/s only.

>> - 1 x Intel Xeon E3-1265Lv3
>> - 1 SSD DC3710 Series 200GB (with partitions for the OS, the 3
>> OSD-journals and, just for ceph01, ceph02 and ceph03, the SSD contains
>> too a partition for the workdir of a monitor
> The 200GB DC S3700 would have been faster, but that's a moot point and not
> your bottleneck for sure.
> 
>> - 3 HD 4TB Western Digital (WD) SATA 7200rpm
>> - RAM 32GB
>> - NO RAID controlleur
> 
> Which controller are you using?

No controller, the 3 SATA disks of my client are directly connected on
the SATA ports of the motherboard.

> I recently came across an Adaptec SATA3 HBA that delivered only 176 MB/s
> writes with 200GB DC S3700s as opposed to 280MB/s when used with Intel
> onboard SATA-3 ports or a LSI 9211-4i HBA.

Thanks for your help Christian.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs, low performances

2015-12-17 Thread Francois Lafont
Hi,

I have ceph cluster currently unused and I have (to my mind) very low 
performances.
I'm not an expert in benchs, here an example of quick bench:

---
# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 
--name=readwrite --filename=rw.data --bs=4k --iodepth=64 --size=300MB 
--readwrite=randrw --rwmixread=50
readwrite: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.1.3
Starting 1 process
readwrite: Laying out IO file(s) (1 file(s) / 300MB)
Jobs: 1 (f=1): [m] [100.0% done] [2264KB/2128KB/0KB /s] [566/532/0 iops] [eta 
00m:00s]
readwrite: (groupid=0, jobs=1): err= 0: pid=3783: Fri Dec 18 02:01:13 2015
  read : io=153640KB, bw=2302.9KB/s, iops=575, runt= 66719msec
  write: io=153560KB, bw=2301.7KB/s, iops=575, runt= 66719msec
  cpu  : usr=0.77%, sys=3.07%, ctx=115432, majf=0, minf=604
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
 issued: total=r=38410/w=38390/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=153640KB, aggrb=2302KB/s, minb=2302KB/s, maxb=2302KB/s, 
mint=66719msec, maxt=66719msec
  WRITE: io=153560KB, aggrb=2301KB/s, minb=2301KB/s, maxb=2301KB/s, 
mint=66719msec, maxt=66719msec
---

It seems to me very bad. Can I hope better results with my setup (explained 
below)?
During the bench, I don't see particular symptoms (no CPU blocked at 100% etc). 
If
you have advices to improve the perf and/or maybe to make smarter benchs, I'm 
really
interested.

Thanks in advance for your help. Here is my conf...

I use Ubuntu 14.04 on each server with the 3.13 kernel (it's the same for the 
client
ceph where I run my bench) and I use Ceph 9.2.0 (Infernalis).
On the client, cephfs is mounted via cephfs-fuse with this in /etc/fstab:

id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/  
/mnt/cephfs fuse.ceph   noatime,defaults,_netdev0   0

I have 5 cluster node servers "Supermicro Motherboard X10SLM+-LN4 S1150" with
one 1GbE port for the ceph public network and one 10GbE port for the ceph 
private
network:

- 1 x Intel Xeon E3-1265Lv3
- 1 SSD DC3710 Series 200GB (with partitions for the OS, the 3 OSD-journals
and, just for ceph01, ceph02 and ceph03, the SSD contains too a partition
for the workdir of a monitor
- 3 HD 4TB Western Digital (WD) SATA 7200rpm
- RAM 32GB
- NO RAID controlleur
- Each partition uses XFS with noatim option, except the OS partition in EXT4.

Here is my ceph.conf :

---
[global]
  fsid   = 
  cluster network= 192.168.22.0/24
  public network = 10.0.2.0/24
  auth cluster required  = cephx
  auth service required  = cephx
  auth client required   = cephx
  filestore xattr use omap   = true
  osd pool default size  = 3
  osd pool default min size  = 1
  osd pool default pg num= 64
  osd pool default pgp num   = 64
  osd crush chooseleaf type  = 1
  osd journal size   = 0
  osd max backfills  = 1
  osd recovery max active= 1
  osd client op priority = 63
  osd recovery op priority   = 1
  osd op threads = 4
  mds cache size = 100
  osd scrub begin hour   = 3
  osd scrub end hour = 5
  mon allow pool delete  = false
  mon osd down out subtree limit = host
  mon osd min down reporters = 4

[mon.ceph01]
  host = ceph01
  mon addr = 10.0.2.101

[mon.ceph02]
  host = ceph02
  mon addr = 10.0.2.102

[mon.ceph03]
  host = ceph03
  mon addr = 10.0.2.103
---

mds are in active/standby mode.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] about PG_Number

2015-11-13 Thread Francois Lafont
Hi,

On 13/11/2015 09:13, Vickie ch wrote:

> If you have a large amount of OSDs but less pg number. You will find your
> data write unevenly.
> Some OSD have no change to write data.
> In the other side, pg number too large but OSD number too small that have a
> chance to cause data lost.

Data lost, are you sure?

Personally, I would have said:

  few PG/OSDs   lot of PG/OSDs
  >
 * Data distribution less envenly  * Good balanced distribution of data
 * Use less CPU and RAM* Use lot of CPU and RAM

No?


François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.2.0 Infernalis released

2015-11-09 Thread Francois Lafont
Oops, sorry Dan, I would like to send my message to the list.
Sorry.

> On Mon, Nov 9, 2015 at 11:55 AM, Francois Lafont
>>
>> 1. Ok, so, the rank of my monitors are 0, 1, 2 but the its ID are 1, 2, 3
>> (ID chosen by himself because the hosts are called ceph01, ceph02 and
>> ceph03 and these ID seemed to me a good idea). Is it correct ?
>>
>> 2. And, if I understand well, with this command `ceph tell mon.$thing 
>> version`
>> $thing is in fact the rank of the monitor, correct?
>>
>> 3. But with `ceph tell osd.$thing version`, $thing is the ID of the osd, 
>> correct?
>>
>> 4. Why not. But in this case, why with the command `ceph tell mon.* version`,
>> "*" is expanded to the ID of my monitors (ie 1, 2, 3) and not to the ranks ?
>> It seems to me not logical? Am I wrong?
>>
>> But Dan, in your case (monitors have the ID `hostname -s`), the command
>> `ceph tell mon.* version` doesn't work at all, no? Because "*" is expanded
>> to `hostname -s` which doesn't match any rank value, no?
>>
>> Sorry for all these questions, I understand the difference between ID
>> and rank for monitors, but currently I don't understand:
>>
>> - which is $thing (rank or ID?) in the command `ceph tell mon.$thing 
>> version`?
>> - in what "*" is expanded (ranks or IDs?) in the command `ceph tell mon.* 
>> version`?
>>
> 
> Here's the behaviour on hammer. I don't know if this changed in infernalis:
> 
> # ceph mon dump
> ...
> 0: 128.142.xxx:6790/0 mon.p01001532077xxx
> 1: 128.142.yyy:6790/0 mon.p01001532149yyy
> 2: 128.142.zzz:6790/0 mon.p01001532184zzz
> 
> 
> # ceph tell mon.* version
> mon.p01001532077xxx: ceph version 0.94.5
> (9764da52395923e0b32908d83a9f7304401fee43)
> mon.p01001532149yyy: ceph version 0.94.5
> (9764da52395923e0b32908d83a9f7304401fee43)
> mon.p01001532184zzz: ceph version 0.94.5
> (9764da52395923e0b32908d83a9f7304401fee43)
> 
> So, mon.* resolves to the IDs. You can tell directly to the IDs:
> 
> # ceph tell mon.p01001532077xxx version
> ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
> # ceph tell mon.p01001532149yyy version
> ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
> # ceph tell mon.p01001532184zzz version
> ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
> 
> And you can also tell directly to the ranks:
> 
> # ceph tell mon.0 version
> ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
> # ceph tell mon.1 version
> ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
> # ceph tell mon.2 version
> ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

Ok, thanks Dan for your answer. If understand well:

1. with `ceph tell mon.* version`, "*" are expanded to the IDs of monitors.

2. But with `ceph tell mon.$thing version`, if $thing is a interger, $thing
is interpreted as a rank not as an ID, and if not $thing is interpreted as
an ID.

Is it correct?

If yes, by conclusion : for the monitor ID, it's better to chose an ID which
is not an integer (even if it's not very dramatic).

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.2.0 Infernalis released

2015-11-08 Thread Francois Lafont
On 09/11/2015 06:28, Francois Lafont wrote:
 
> I have just upgraded a cluster to 9.2.0 from 9.1.0.
> All seems to be well except I have this little error
> message :
> 
> ~# ceph tell mon.* version --format plain
> mon.1: ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58)
> mon.2: ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58)
> mon.3: Error ENOENT: problem getting command descriptions from mon.3 < 
> Here. ;)
> mon.3: problem getting command descriptions from mon.3
> 
> Except this little message, all seems to be fine.
> 
> ~# ceph -s
> cluster f875b4c1-535a-4f17-9883-2793079d410a
>  health HEALTH_OK
>  monmap e3: 3 mons at 
> {1=10.0.2.101:6789/0,2=10.0.2.102:6789/0,3=10.0.2.103:6789/0}
> election epoch 104, quorum 0,1,2 1,2,3
>  mdsmap e66: 1/1/1 up {0=3=up:active}, 2 up:standby
>  osdmap e256: 15 osds: 15 up, 15 in
> flags sortbitwise
>   pgmap v1094: 192 pgs, 3 pools, 31798 bytes data, 20 objects
> 560 MB used, 55862 GB / 55863 GB avail
>  192 active+clean
> 
> I have tried to restart mon.3 but no success. Should I ignore the
> message?

In fact, it's curious:

~# ceph mon dump
dumped monmap epoch 3
epoch 3
fsid f875b4c1-535a-4f17-9883-2793079d410a
last_changed 2015-11-04 08:25:37.700420
created 2015-11-04 07:31:38.790832
0: 10.0.2.101:6789/0 mon.1
1: 10.0.2.102:6789/0 mon.2
2: 10.0.2.103:6789/0 mon.3


~# ceph tell mon.1 version 
ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58)

~# ceph tell mon.2 version 
ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58)

~# ceph tell mon.3 version 
Error ENOENT: problem getting command descriptions from mon.3
[2] root@ceph03 06:35 ~

~# ceph tell mon.0 version 
ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58)

Concerning monitors, I have this in my ceph.conf:

 [mon.1]
  host = ceph01
  mon addr = 10.0.2.101

[mon.2]
  host = ceph02
  mon addr = 10.0.2.102

[mon.3]
  host = ceph03
  mon addr = 10.0.2.103

So the ID of my monitors are 1, 2, 3. But there is a little
problem because I have :

~# ceph tell mon.0 version 
ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58)

So what is this mon.0 ??

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.2.0 Infernalis released

2015-11-08 Thread Francois Lafont
Hi,

I have just upgraded a cluster to 9.2.0 from 9.1.0.
All seems to be well except I have this little error
message :

~# ceph tell mon.* version --format plain
mon.1: ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58)
mon.2: ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58)
mon.3: Error ENOENT: problem getting command descriptions from mon.3 < 
Here. ;)
mon.3: problem getting command descriptions from mon.3

Except this little message, all seems to be fine.

~# ceph -s
cluster f875b4c1-535a-4f17-9883-2793079d410a
 health HEALTH_OK
 monmap e3: 3 mons at 
{1=10.0.2.101:6789/0,2=10.0.2.102:6789/0,3=10.0.2.103:6789/0}
election epoch 104, quorum 0,1,2 1,2,3
 mdsmap e66: 1/1/1 up {0=3=up:active}, 2 up:standby
 osdmap e256: 15 osds: 15 up, 15 in
flags sortbitwise
  pgmap v1094: 192 pgs, 3 pools, 31798 bytes data, 20 objects
560 MB used, 55862 GB / 55863 GB avail
 192 active+clean

I have tried to restart mon.3 but no success. Should I ignore the
message?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.4 Hammer released

2015-10-20 Thread Francois Lafont
Hi,

On 20/10/2015 20:11, Stefan Eriksson wrote:

> A change like this below, where we have to change ownership was not add to a 
> point release for hammer right?

Right. ;)

I have upgraded my ceph cluster from 0.94.3 to 0.94.4 today without any problem.
The daemons used in 0.94.3 and currently use in 0.94.4 the root account. I have
not changed at all ownership of /var/lib/ceph/ for this upgrade.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS file to rados object mapping

2015-10-14 Thread Francois Lafont
Hi,

On 14/10/2015 06:45, Gregory Farnum wrote:

>> Ok, however during my tests I had been careful to replace the correct
>> file by a bad file with *exactly* the same size (the content of the
>> file was just a little string and I have changed it by a string with
>> exactly the same size). I had been careful to undo the mtime update
>> too (I had restore the mtime of the file before the change). Despite
>> this, the "repair" command worked well. Tested twice: 1. with the change
>> on the primary OSD and 2. on the secondary OSD. And I was surprised
>> because I though the test 1. (in primary OSD) will fail.
> 
> Hm. I'm a little confused by that, actually. Exactly what was the path
> to the files you changed, and do you have before-and-after comparisons
> on the content and metadata?

I didn't remember exactly the process I have made so I have just retried
today. Here is my process. I have a healthy cluster with 3 nodes (Ubuntu
Trusty) and I have ceph Hammer (version 0.94.3). I have mounted cephfs on
/mnt on one of the nodes.

~# cat /mnt/file.txt # yes it's a little file. ;)
123456

~# ls -i /mnt/file.txt 
1099511627776 /mnt/file.txt

~# printf "%x\n" 1099511627776
100

~# rados -p data ls - | grep 100
100.

I have the name of the object mapped to my "file.txt".

~# ceph osd map data 100.
osdmap e76 pool 'data' (3) object '100.' -> pg 3.f0b56f30 
(3.30) -> up ([1,2], p1) acting ([1,2], p1)

So my object is in the primary OSD OSD-1 and in the secondary OSD OSD-2.
So I open a terminal in the node which hosts the primary OSD OSD-1 and
then:

~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
 
123456

~# ll 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
 
-rw-r--r-- 1 root root 7 Oct 15 03:46 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3

Now, I change the content with this script called "change_content.sh" to
preserve the mtime after the change:

-
#!/bin/sh

f="$1"
f_tmp="${f}.tmp"
content="$2"
cp --preserve=all "$f" "$f_tmp"
echo "$content" >"$f"
touch -r "$f_tmp" "$f" # to restore the mtime after the change
rm "$f_tmp"
-

So, let's go, I replace the content by a new content with exactly
the same size (ie "ABCDEF" in this example):

~# ./change_content.sh 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
 ABCDEF

~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
 
ABCDEF

~# ll 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
 
-rw-r--r-- 1 root root 7 Oct 15 03:46 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3

Now, the secondary OSD contains the good version of the object and
the primary a bad version. Now, I launch a "ceph pg repair":

~# ceph pg repair 3.30
instructing pg 3.30 on osd.1 to repair

# I'm in the primary OSD and the file below has been repaired correctly.
~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
 
123456

As you can see, the repair command has worked well.
Maybe my little is too trivial?

>> Greg, if I understand you well, I shouldn't have too much confidence in
>> the "ceph pg repair" command, is it correct?
>>
>> But, if yes, what is the good way to repair a PG?
> 
> Usually what we recommend is for those with 3 copies to find the
> differing copy, delete it, and run a repair — then you know it'll
> repair from a good version. But yeah, it's not as reliable as we'd
> like it to be on its own.

I would like to be sure to well understand. The process could be (in
the case where size == 3):

1. In each of the 3 OSDs where my object is put:

md5sum /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}*

2. Normally, I will have the same result in 2 OSDs, and in the other
OSD, let's call it OSD-X, the result will be different. So, in the OSD-X,
I run:

rm /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}*

3. And now I can run the "ceph pg repair" command without risk:

ceph pg repair $pg_id
 
Is it the correct process?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Francois Lafont
Sorry, another remark.

On 13/10/2015 23:01, Sage Weil wrote:

> The v9.1.0 packages are pushed to the development release repositories::
> 
>   http://download.ceph.com/rpm-testing
>   http://download.ceph.com/debian-testing

I don't see the 9.1.0 available for Ubuntu Trusty :


http://download.ceph.com/debian-testing/dists/trusty/main/binary-amd64/Packages
(the string "9.1" is not present in this page currently)

The 9.0.3 is available but, after a quick test, this version of
the package doesn't create the ceph unix account.

Have I forgotten something?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.1.0 Infernalis release candidate released

2015-10-14 Thread Francois Lafont
Hi and thanks at all for this good news, ;)

On 13/10/2015 23:01, Sage Weil wrote:

>#. Fix the data ownership during the upgrade.  This is the preferred 
> option,
>   but is more work.  The process for each host would be to:
> 
>   #. Upgrade the ceph package.  This creates the ceph user and group.  For
>example::
> 
>  ceph-deploy install --stable infernalis HOST
> 
>   #. Stop the daemon(s).::
> 
>  service ceph stop   # fedora, centos, rhel, debian
>  stop ceph-all   # ubuntu
>  
>   #. Fix the ownership::
> 
>  chown -R ceph:ceph /var/lib/ceph
> 
>   #. Restart the daemon(s).::
> 
>  start ceph-all# ubuntu
>  systemctl start ceph.target   # debian, centos, fedora, rhel

With this (preferred) option, if I understand well, I should
repeat these commands above host-by-host. Personally, my monitors
are hosted in the OSD servers (I have no dedicated monitor server).
So, with this option, I will have osd daemons upgraded before
monitor daemons. Is it a problem?

I ask the question because, during a migration to a new release,
it's generally recommended to upgrade _all_ the monitors before
to upgrade the first osd daemon.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS file to rados object mapping

2015-10-09 Thread Francois Lafont
Hi,

Thanks for your answer Greg.

On 09/10/2015 04:11, Gregory Farnum wrote:

> The size of the on-disk file didn't match the OSD's record of the
> object size, so it rejected it. This works for that kind of gross
> change, but it won't catch stuff like a partial overwrite or loss of
> data within a file.

Ok, however during my tests I had been careful to replace the correct
file by a bad file with *exactly* the same size (the content of the
file was just a little string and I have changed it by a string with
exactly the same size). I had been careful to undo the mtime update
too (I had restore the mtime of the file before the change). Despite
this, the "repair" command worked well. Tested twice: 1. with the change
on the primary OSD and 2. on the secondary OSD. And I was surprised
because I though the test 1. (in primary OSD) will fail.

Greg, if I understand you well, I shouldn't have too much confidence in
the "ceph pg repair" command, is it correct?

But, if yes, what is the good way to repair a PG?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS file to rados object mapping

2015-10-08 Thread Francois Lafont
Hi,

On 08/10/2015 22:25, Gregory Farnum wrote:

> So that means there's no automated way to guarantee the right copy of
> an object when scrubbing. If you have 3+ copies I'd recommend checking
> each of them and picking the one that's duplicated...

It's curious because I have already tried with cephfs to "corrupt" a
file in the OSD backend. I had a little text file in cephfs mapped to
the object "$inode.$num" and this object was in the PG $pg_id, in the
primary OSD $primary and in the secondary OSD $secondary (I had indeed
size == 2). I thought that the primary OSD was always taken as reference
by the "ceph pg repair" command, so I have tried this:

# Test A
echo "foo blabla..." 
>/var/lib/ceph/osd/ceph-$primary/current/$pg_id_head/$inode.$num
ceph pg repair $pg_id

and the "repair" command worked correctly and my file was repaired
correctly. I have tried to change the file in the secondary OSD too with:

# Test B
echo "foo blabla..." 
>/var/lib/ceph/osd/ceph-$secondary/current/$pg_id_head/$inode.$num
ceph pg repair $pg_id

and it was the same, the file was repaired correctly too. In these 2
cases, the good OSD was taken as reference (the secondary for the test
A and the primary for the test B).

So, in this case, how did ceph know which copy was the correct object?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: 'ls -alR' performance terrible unless Linux cache flushed

2015-06-16 Thread Francois Lafont
Hi,

On 16/06/2015 18:46, negillen negillen wrote:

> Fixed! At least looks like fixed.

That's cool for you. ;)

> It seems that after migrating every node (both servers and clients) from
> kernel 3.10.80-1 to 4.0.4-1 the issue disappeared.
> Now I get decent speeds both for reading files and for getting stats from
> every node.

It seems to me that an interesting test could be to let the old kernel in
your client nodes (ie 3.10.80-1), use ceph-fuse instead of the ceph kernel
module and test if you have decent speeds too.

Bye.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.2 Hammer released

2015-06-11 Thread Francois Lafont
Hi,

On 11/06/2015 19:34, Sage Weil wrote:

> Bug #11442 introduced a change that made rgw objects that start with 
> underscore incompatible with previous versions. The fix to that bug 
> reverts to the previous behavior. In order to be able to access objects 
> that start with an underscore and were created in prior Hammer releases, 
> following the upgrade it is required to run (for each affected bucket)::
> 
> $ radosgw-admin bucket check --check-head-obj-locator \
>  --bucket= [--fix]
> 
> You can get a list of buckets with
> 
> $ radosgw-admin bucket list

After the upgrade of my radosgw, I can't fix the problem of rgw objects
that start with underscore. The command with the --fix option displays
some errors which I don't understand. Here is a (troncated) paste of my
shell below. Have I done something wrong?

Thx in advance for the help.
François Lafont

--
~# radosgw-admin --id=radosgw.gw2 bucket check --check-head-obj-locator 
--bucket=$bucket
{
"bucket": "moodles-poc-registry",
"check_objects": [
{
"key": {
"name": 
"_multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta",
"instance": ""
},
"oid": 
"default.763616.1___multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta",
"locator": 
"default.763616.1__multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta",
"needs_fixing": true,
"status": "needs_fixing"
},

[snip]

{
"key": {
"name": 
"_multipart_registry\/images\/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b\/layer.2~TSdIpafsfGXJ7kKMOVqJ-hn8Aog4ETF.meta",
"instance": ""
},
"oid": 
"default.763616.1___multipart_registry\/images\/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b\/layer.2~TSdIpafsfGXJ7kKMOVqJ-hn8Aog4ETF.meta",
"locator": 
"default.763616.1__multipart_registry\/images\/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b\/layer.2~TSdIpafsfGXJ7kKMOVqJ-hn8Aog4ETF.meta",
"needs_fixing": true,
"status": "needs_fixing"
}

]
}

~# radosgw-admin --id=radosgw.gw2 bucket check --check-head-obj-locator 
--bucket=$bucket --fix
2015-06-12 03:01:33.197984 7f3c9130d840 -1 ERROR: 
ioctx.operate(oid=default.763616.1___multipart_registry/images/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta)
 returned ret=-2
ERROR: fix_head_object_locator() returned ret=-2
2015-06-12 03:01:33.200428 7f3c9130d840 -1 ERROR: 
ioctx.operate(oid=default.763616.1___multipart_registry/images/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909/layer.2~poMH-PQKCLstUWpMQpji7JuGaBT53Th.meta)
 returned ret=-2
ERROR: fix_head_object_locator() returned ret=-2
ERROR: fix_head_object_locator() returned ret=-2
2015-06-12 03:01:33.206875 7f3c9130d840 -1 ERROR: 
ioctx.operate(oid=default.763616.1___multipart_registry/images/c5a7fc74211188aabf3429539674275645b07717d003c390a943acc44f35c6d0/layer.2~Bg6bkbSOE8GCtV4Mxr0t56vSfTQTCx9.1)
 returned ret=-2
2015-06-12 03:01:33.209293 7f3c9130d840 -1 ERROR: 
ioctx.operate(oid=default.763616.1___multipart_registry/images/c5a7fc74211188aabf3429539674275645b07717d003c390a943acc44f35c6d0/layer.2~Bg6bkbSOE8GCtV4Mxr0t56vSfTQTCx9.2)
 returned ret=-2
ERROR: fix_head_object_locator() returned ret=-2
ERROR: fix_head_object_locator() returned ret=-2

[snip]

2015-06-12 03:01:33.301101 7f3c9130d840 -1 ERROR: 
ioctx.operate(oid=default.763616.1___multipart_registry/images/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b/layer.2~TSdIpafsfGXJ7kKMOVqJ-hn8Aog4ETF.meta)
 returned ret=-2
{
"bucket": "moodles-poc-registry",
"check_objects": [
{
"key": {
"name": 
"_multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta",
"instance": ""
},
"oid": 
"default.763616.1___multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta",
"locator": 
"default.763616.1__multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta",
"needs_fixing": true,
"status": "needs_fixing"
},

[snip]

{
"key": {
"name": 
"_multipart_registry\/images\/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b\/layer.2~TS

Re: [ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)

2015-06-08 Thread Francois Lafont
Hi,

On 27/05/2015 22:34, Gregory Farnum wrote:

> Sorry for the delay; I've been traveling.

No problem, me too, I'm not really fast to answer. ;)

>> Ok, I see. According to the online documentation, the way to close
>> a cephfs client session is:
>>
>> ceph daemon mds.$id session ls # to get the $session_id and the 
>> $address
>> ceph osd blacklist add $address
>> ceph osd dump  # to get the $epoch
>> ceph daemon mds.$id osdmap barrier $epoch
>> ceph daemon mds.$id session evict $session_id
>>
>> Is it correct?
>>
>> With the commands above, could I reproduce the client freeze in my testing
>> cluster?
> 
> Yes, I believe so.

In fact, after some tests, the commands above evicts correctly the client
(`ceph daemon mds.1 session ls` returns an empty array) but in the client
side a new connection is automatically established as soon as the cephfs
mountpoint is requested. In fact, I haven't succeeded in reproducing the
freeze. ;) I have tried to stop the network in the client side (ifdown -a)
and after few minutes (more than 60 seconds though), I have seen in the
mds log "closing stale session client". But after a `ifup -a`, I have
get back a cephfs connection and a mountpoint in good health.

>> And could it be conceivable one day (for instance with an option) to be
>> able to change the behavior of cephfs to be *not*-strictly-consistent,
>> like NFS for instance? It seems to me it could improve performances of
>> cephfs and cephfs could be more flexible concerning short network failure
>> (not really sure for this second point). Ok it's just a remark of a simple
>> and unqualified ceph-user ;) but it seems to me that NFS isn't strictly
>> consistent and generally this not a problem in many use cases. Am I wrong?
> 
> Mmm, this is something we're pretty resistant to.

Ah ok, so I don't insist. ;)

> In particular NFS
> just doesn't make any efforts to be consistent when there are multiple
> writers, and CephFS works *really hard* to behave properly in that
> case. For many use cases it's not a big deal, but for others it is,
> and we target some of them.

Ok. Thanks Greg for your answer.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs: one ceph account per directory?

2015-06-08 Thread Francois Lafont
Hi,

Gregory Farnum wrote:

>> 1. Can you confirm to me that currently it's impossible to restrict the read
>> and write access of a ceph account to a specific directory of a cephfs?
> 
> It's sadly impossible to restrict access to the filesystem hierarchy
> at this time, yes. By making use of the file layouts and assigning
> each user their own pool you can restrict access to the actual file
> data.

In fact, according to my test and with the precious help of John Spray
in IRC (thanks to him), it seems that file-layouts features can't protect
a cephfs directory against the deletion from a specific ceph account.

I try to be more precise. In a client node if I mount the cephfs with
a specific ceph account, with the file-layouts features it's possible
to configure a cephfs directory so that "root" (in the node) will be not
able to *read* and to *modify* the files contained in the directory but
"root" will always be able to *remove* the files because "root" will
always has the capability "to send unlink operations to the MDS and
the MDS will purge the files" (I take the liberty of quoting John Spray
from IRC ;) and I have noticed indeed this behaviour).

>> 2. Is it planned to implement a such feature in a next release of Ceph?
> 
> There are a couple students working on these features this summer, and
> many discussions amongst the core team about how to enable secure
> multi-tenancy in CephFS.

Ok, cool. I'm ready to test this feature with pleasure when it will be
released (I have a good feeling to fall in bugs by accident ;)).

> Just the file layout/multiple-pool one, right now. Or you could do
> something like set up an NFS export that each user mounts of the
> CephFS, but then you lose all the CephFS goodness on the clients...

Ok, I see. Many thanks Greg for your answer.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mount options nodcache and nofsc

2015-05-21 Thread Francois Lafont
Hi,

Yan, Zheng wrote:
 
> fsc means fs-cache. it's a kernel facility by which a network
> filesystem can cache data locally, trading disk space to gain
> performance improvements for access to slow networks and media. cephfs
> does not use fs-cache by default.

So enable this option can improve performance, correct?
Is there downside in return?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to backup hundreds or thousands of TB

2015-05-17 Thread Francois Lafont
Hi,

Wido den Hollander wrote:
 
> Aren't snapshots something that should protect you against removal? IF
> snapshots work properly in CephFS you could create a snapshot every hour.

Are you talking about the .snap/ directory in a cephfs directory?
If yes, does it work well? Because, with Hammer, if I want to enable
this feature:

~# ceph mds set allow_new_snaps true
Error EPERM: Snapshots are unstable and will probably break your FS!
Set to --yes-i-really-mean-it if you are sure you want to enable them

I have never tried with the --yes-i-really-mean-it option. The warning
is not very encouraging. ;)

> With the recursive statistics [0] of CephFS you could "easily" backup
> all your data to a different Ceph system or anything not Ceph.

What is the link between this (very interesting) recursive statistics
feature and the backup? I'm not sure to understand. Can you explain me?
Maybe you test if the size of a directory has changed?

> I've done this with a ~700TB CephFS cluster and that is still working
> properly.
> 
> Wido
> 
> [0]:
> http://blog.widodh.nl/2015/04/playing-with-cephfs-recursive-statistics/

Thanks Wido for this very interesting (and very simple) feature.
But does it work well? Because, I use Hammer in a Ubuntu Trusty
cluster nodes, and in a Ubuntu Trusty client with 3.16 kernel
and cephfs mounted with the kernel module client, I have this:

~# mount | grep cephfs # /mnt is my mounted cephfs
10.0.2.150,10.0.2.151,10.0.2.152:/ on /mnt type ceph 
(noacl,name=cephfs,key=client.cephfs)

~# ls -lah /mnt/dir1/
total 0
drwxr-xr-x 1 root root  96M May 12 21:06 .
drwxr-xr-x 1 root root 103M May 17 23:56 ..
drwxr-xr-x 1 root root  96M May 12 21:06 8
drwxr-xr-x 1 root root 4.0M May 17 23:57 test

As you can see:
  /mnt/dir1/8/  => 96M
  /mnt/dir1/test/   => 4.0M

But:
  /mnt/dir1/ (ie .) => 96M

I should have:

size("/mnt/dir1/") = size("/mnt/dir1/8/") + size("/mnt/dir1/test/")

and this is not the case. Is it normal?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)

2015-05-17 Thread Francois Lafont
John Spray wrote:

> Greg's response is pretty comprehensive, but for completeness I'll add that 
> the specific case of shutdown blocking is http://tracker.ceph.com/issues/9477

Yes indeed, during the freeze, "INFO: task sync:3132 blocked for more than 120 
seconds..."
was exactly the message I have seen in the VNC console of the client (it was a 
Openstack VM).

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)

2015-05-17 Thread Francois Lafont
Hi,

Sorry for my late answer.

Gregory Farnum wrote:

>> 1. Is this kind of freeze normal? Can I avoid these freezes with a
>> more recent version of the kernel in the client?
> 
> Yes, it's normal. Although you should have been able to do a lazy
> and/or force umount. :)

Ah, I haven't tried it.
Maybe I'm wrong but I think a "lazy" or a "force" umount wouldn't
succeed. I'll try to test if I can reproduce the freeze.

> You can't avoid the freeze with a newer client. :(
> 
> If you notice the problem quickly enough, you should be able to
> reconnect everything by rebooting the MDS — although if the MDS hasn't
> failed the client then things shouldn't be blocking, so actually that
> probably won't help you.

Yes, the mds was completely ok and after the hard-reboot of the client,
the client had access again to the cephfs with the exactly same mds service
in the cluster side (no restart etc).

>> 2. Can I avoid these freezes with ceph-fuse instead of the kernel
>> cephfs module? But in this case, the cephfs performance will be
>> worse. Am I wrong?
> 
> No, ceph-fuse will suffer the same blockage, although obviously in
> userspace it's a bit easier to clean up.

Yes, I suppose that after "kill" commands, I would be able to remount
the cephfs without any reboot etc., isn't it?

> Depending on your workload it
> will be slightly faster to a lot slower. Though you'll also get
> updates faster/more easily. ;)

Yes, I imagine that with "ceph-fuse" I have a completely updated
cephfs-client (in user-space) whereas with the cephfs-client kernel
I have just the version available in the current kernel of my client
node (3.16 in my case).

>> 3. Is there a parameter in ceph.conf to tell mds to be more patient
>> before closing the "stale session" of a client?
> 
> Yes. You'll need to increase the "mds session timeout" value on the
> MDS; it currently defaults to 60 seconds. You can increase that to
> whatever values you like. The tradeoff here is that if you have a
> client die, anything it had "capabilities' on (for read/write access)
> will be unavailable for anybody who's doing something that might
> conflict with those capabilities.

Ok, thanks for the warning, it seems logical.

> If you've got a new enough MDS (Hammer, probably, but you can check)

Yes, I use Hammer.

> then you can use the admin socket to boot specific sessions, so it may
> suit you to set very large timeouts and manually zap any client which
> actually goes away badly (rather than getting disconnected by the
> network).

Ok, I see. According to the online documentation, the way to close
a cephfs client session is:

ceph daemon mds.$id session ls # to get the $session_id and the 
$address
ceph osd blacklist add $address
ceph osd dump  # to get the $epoch
ceph daemon mds.$id osdmap barrier $epoch
ceph daemon mds.$id session evict $session_id

Is it correct?

With the commands above, could I reproduce the client freeze in my testing
cluster?

I'll try because it convenient to be able reproduce the problem just with
command lines (without to really stop the network in the client etc). I
would like to test if, with ceph-fuse, I can easily restore the situation
of my client.

>> I'm in a testing period and a hard reboot of my cephfs clients would
>> be quite annoying for me. Thanks in advance for your help.
> 
> Yeah. Unfortunately there's a basic tradeoff in strictly-consistent
> (aka POSIX) network filesystems here: if the network goes away, you
> can't be consistent any more because the disconnected client can make
> conflicting changes. And you can't tell exactly when the network
> disappeared.

And could it be conceivable one day (for instance with an option) to be
able to change the behavior of cephfs to be *not*-strictly-consistent,
like NFS for instance? It seems to me it could improve performances of
cephfs and cephfs could be more flexible concerning short network failure
(not really sure for this second point). Ok it's just a remark of a simple
and unqualified ceph-user ;) but it seems to me that NFS isn't strictly
consistent and generally this not a problem in many use cases. Am I wrong?

> So while we hope to make this less painful in the future, the network
> dying that badly is a failure case that you need to be aware of
> meaning that the client might have conflicting information. If it
> *does* have conflicting info, the best we can do about it is be
> polite, return a bunch of error codes, and unmount gracefully. We'll
> get there eventually but it's a lot of work.

Yes, I can imagine the amount of work...
Thank a lot Greg for your answer. ;)

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)

2015-05-14 Thread Francois Lafont
Hi,

I had a problem with a cephfs freeze in a client. Impossible to
re-enable the mountpoint. A simple "ls /mnt" command totally
blocked (of course impossible to umount-remount etc.) and I had
to reboot the host. But even a "normal" reboot didn't work, the
host didn't stop. I had to do a hard reboot of the host. In brief,
it was like a big "NFS" freeze. ;)

In the logs, nothing relevant in the client side and just this line
in the cluster side:

~# cat /var/log/ceph/ceph-mds.1.log
[...]
2015-05-14 17:07:17.259866 7f3b5cffc700  0 log_channel(cluster) log [INF] : 
closing stale session client.1342358 192.168.21.207:0/519924348 after 301.329013
[...]

And indeed, the freeze was probably triggered by a little network
interruption.

Here is my configuration:
- OS: Ubuntu 14.04 in the client and in the cluster nodes.
- Kernel: 3.16.0-36-generic in the client and in the cluster nodes.
  (apt-get install linux-image-generic-lts-utopic).
- Ceph version: Hammer in the client and in cluster nodes (0.94.1-1trusty).

In the client, I use the cephfs kernel module (not ceph-fuse). Here
is the fstab line in the client node:

10.0.2.150,10.0.2.151,10.0.2.152:/ /mnt ceph 
noatime,noacl,name=cephfs,secretfile=/etc/ceph/secret,_netdev 0 0

My only configuration concerning mds in ceph.conf is just:

  mds cache size = 100

That's all.

Here are my questions:

1. Is this kind of freeze normal? Can I avoid these freezes with a
more recent version of the kernel in the client?

2. Can I avoid these freezes with ceph-fuse instead of the kernel
cephfs module? But in this case, the cephfs performance will be
worse. Am I wrong?

3. Is there a parameter in ceph.conf to tell mds to be more patient
before closing the "stale session" of a client?

I'm in a testing period and a hard reboot of my cephfs clients would
be quite annoying for me. Thanks in advance for your help.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Find out the location of OSD Journal

2015-05-07 Thread Francois Lafont
Hi,

Patrik Plank wrote:

> i cant remember on which drive I install which OSD journal :-||
> Is there any command to show this?

It's probably not the answer you hope, but why don't use a simple:

ls -l /var/lib/ceph/osd/ceph-$id/journal

?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some more numbers - CPU/Memory suggestions for OSDs and Monitors

2015-04-22 Thread Francois Lafont
Mark Nelson wrote:

> I'm not sure who came up with the 1GB for each 1TB of OSD daemons rule, but 
> frankly I don't think it scales well at the extremes.  You can't get by with 
> 256MB of ram for OSDs backed by 256GB SSDs, nor do you need 6GB of ram per 
> OSD for 6TB spinning disks.
> 
> 2-4GB of RAM per OSD is reasonable depending on how much page cache you need. 
>  I wouldn't stray outside of that range myself.

Ok. It's recorded.

> What it really comes down to is that your CPU needs to be fast enough to 
> process your workload.  Small IOs tend to be more CPU intensive than large 
> IOs.  Some processors have higher IPC than others so it's all just kind of a 
> vague guessing game.  With modern Intel XEON processors, 1GHz of 1 core is a 
> good general estimate.  If you are doing lots of small IO with SSD backed 
> OSDs you may need more.  If you are doing high performance erasure coding you 
> may need more.  If you have slow disks with journals on disk, 3x replication, 
> and a mostly read workload, you may be able to get away with less.
> 
> As always, the recommendations above are just recommendations.  It's best if 
> you can test yourself.

Yes, sure. Thx for the explanations Mark. :)
Bye.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs: proportion of data between data pool and metadata pool

2015-04-22 Thread Francois Lafont
Hi,

When I want to have an estimation of the pg_num of a new pool,
I use this very useful page: http://ceph.com/pgcalc/.
In the table, I must give the %data of a pool. For instance, for
a "rados gateway only" use case, I can see that, by default, the
page gives:

- .rgw.buckets => 96.90% of data
- .rgw.control =>  0.10% of data
- etc.

But in the menu, the use case "cephfs only" doesn't exist and I have
no idea of the %data for each pools metadata and data. So, what is
the proportion (approximatively) of %data between the "data" pool and
the "metadata" pool of cephfs in a cephfs-only cluster?

Is it rather metadata=20%, data=80%?
Is it rather metadata=10%, data=90%?
Is it rather metadata= 5%, data=95%?
etc.

Thanks in advance.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Radosgw and mds hardware configuration

2015-04-22 Thread Francois Lafont
Hi Cephers, :)

I would like to know if there are some rules to estimate (approximatively)
the need of CPU and RAM for:

1. a radosgw server (for instance with Hammer and civetweb). 
2. a mds server

If I am not mistaken, for these 2 types of server, there is no need
concerning the storage.

For a mds server, I wonder if this page is udpated: 


http://ceph.com/docs/master/start/hardware-recommendations/#minimum-hardware-recommendations

1GB per mds daemon seems to me very few.

Thanks for your help.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] decrease pg number

2015-04-22 Thread Francois Lafont
Hi,

Pavel V. Kaygorodov wrote:

> I have updated my cluster to Hammer and got a warning "too many PGs
> per OSD (2240 > max 300)". I know, that there is no way to decrease
> number of page groups, so I want to re-create my pools with less pg
> number, move all my data to them, delete old pools and rename new
> pools as the old ones. Also I want to preserve the user rights on new
> pools. I have several pools with RBD images, some of them with
> snapshots.
> 
> Which is the proper way to do this?

I'm not a ceph expert and I can just tell you my (little but happy) experience. 
;)
I had the same problem with pools of my radosgw pools ie:

- the .rgw.* pools except ".rgw.buckets", and
- the .users.* pools

So, **warning**, it was for very tiny pools. The version of Ceph
was Hammer 94.1, nodes were Ubuntu 14.04 with 3.16 kernel. These commands
worked well for me:

-
# /!\ Before I have stopped my radosgws (ie the ceph clients of the pools).

old_pool=foo
new_pool=foo.new

ceph osd pool create $new_pool 64
rados cppool $old_pool $new_pool
ceph osd pool delete $old_pool $old_pool --yes-i-really-really-mean-it
ceph osd pool rename $new_pool $old_pool

# And I have restarted my radosgws.
-

That's all. In my case, it was very fast because the pools didn't contain
very much data.

And I prolong your question: is it possible to do the same process
but with a pool of the cephfs? For instance, the pool metadata?

If I try the commands above, I have an error with the delete
command:

~# ceph osd pool delete metadata metadata --yes-i-really-really-mean-it
Error EBUSY: pool 'metadata' is in use by CephFS

However, I'm sure no client use the cephfs (it's a cluster for test).

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some more numbers - CPU/Memory suggestions for OSDs and Monitors

2015-04-22 Thread Francois Lafont
Hi,

Christian Balzer wrote: 

>> thanks for the feedback regarding the network questions. Currently I try
>> to solve the question of how much memory, cores and GHz for OSD nodes
>> and Monitors.
>>
>> My research so far:
>>
>> OSD nodes: 2 GB RAM, 2 GHz, 1 Core (?) per OSD
>>
> RAM is enough, but more helps (page cache on the storage node makes the
> reads of hot objects quite fast and prevents concurrent access to the
> disks).

Personally, I have seen a different rule for the RAM: "1GB for each 1TB
of OSD daemons". This is I understand in this doc:


http://ceph.com/docs/master/start/hardware-recommendations/#minimum-hardware-recommendations

So, for instance, with (it's just a stupid example):

- 4 OSD daemons of 6TB and
- 5 OSD daemons of 1TB

The needed RAM would be:

1GB x (4 x 6) + 1GB x (5 x 1) = 29GB for the RAM

Is it correct? Because if I follow the "2GB RAM per OSD" rule, I just need:

2GB x 9 = 18GB.

Which rule is correct?

> 1GHz or so for per pure HDD based OSD, at least 2GHz for HDD OSDs with SSD
> journals, as much as you can afford for entirely SSD based OSDs.

Are there links about the "at least 2Ghz per OSD with SSD journal", because
I have never seen that except in this mailing list. For instance in the
"HARDWARE CONFIGURATION GUIDE" of Inktank, it is just indicated: "one GHz
per OSD" (https://ceph.com/category/resources/).

Why should SSD journals increase the needed CPU?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is a "dirty" object

2015-04-20 Thread Francois Lafont
Hi,

John Spray wrote:
 
> As far as I can see, this is only meaningful for cache pools, and object is 
> "dirty" in the sense of having been created or modified since their its last 
> flush.  For a non-cache-tier pool, everything is logically dirty since it is 
> never flushed.
> 
> I hadn't noticed that we presented this as nonzero for regular pools before, 
> it is a bit weird.  Perhaps we should show zero here instead for 
> non-cache-tier pools.

Ok, in this case, maybe something like "Not_Relevant" or "NR" could
be more suitable.

Thank you John.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about an example of ceph infrastructure

2015-04-19 Thread Francois Lafont
Hi,

Christian Balzer wrote:

> For starters, make that 5 MONs. 
> It won't really help you with your problem of keeping a quorum when
> loosing a DC, but being able to loose more than 1 monitor will come in
> handy.
> Note that MONs don't really need to be dedicated nodes, if you know what
> you're doing and have enough resources (most importantly fast I/O aka SSD
> for the leveldb) on another machine.

Ok, I keep that in my head.

>> In DC1: 2 "OSD" nodes each with 6 OSDs daemons, one per disk.
>> Journals in SSD, there are 2 SSD so 3 journals per SSD.
>> In DC2: the same config.
>>
> Out of curiosity, is that a 1U case with 8 2.5" bays, or why that
> (relatively low) density per node?

Sorry I have no idea because, in fact, it was just an example to be
concrete. So I have taken a (imaginary) server with 8 disks and 2 SSDs
(among the 8 disks, 2 for the OS in RAID1 soft). Currently, I can't be
precise about hardware because were are absolutely not fixed about the
budget (if we get it!), there are lot of uncertainties.

> 4 nodes make a pretty small cluster, if you loose a SSD or a whole node
> your cluster will get rather busy and may run out of space if you filled
> it more than 50%.

Yes indeed, it's a relevant remark. If the cluster is ~50% filled and if a
node crashes in a DC, the other node in the same DC will be 100% filled and
the cluster will be blocked. Indeed, the cluster is probably too small.

> Unless you OSDs are RAID1s, a replica of 2 is basically asking Murphy to
> "bless" you with a double disk failure. A very distinct probability with
> 24 HDDs. 

The probability of a *simultaneous* disk failure in DC1 and in DC2 seems to
me relatively low. For instance, if a disk fails in DC1 and if the rebalancing
of data takes ~ 1 or 2 hours, it seems to me acceptable. But maybe I'm too
optimistic... ;)

> With OSDs backed by plain HDDs you really want a replica size of 3.

But the "2-DCs" topology isn't really suitable for a replica size of 3, no?
Is the replica size of 2 so risky?

> Normally you'd configure Ceph to NOT set OSDs out automatically if a DC
> fails (mon_osd_down_out_subtree_limit)

I didn't known this option. In the online doc, the explanations are not
clear enough for me and I'm not sure to understand its meaning. If I set:

mon_osd_down_out_subtree_limit = datacenter

what are the consequences?

- If all OSDs in DC2 are unreachable, these OSDs will not be marked out
- and if only several OSDs in DC2 are unreachable but not all in DC2,
  these OSDs will be marked out.

Am I correct?

> but in the case of a prolonged DC
> outage you'll want to restore redundancy and set those OSDs out. 
> Which means you will need 3 times the actual data capacity on your
> surviving 2 nodes.
> In other words, if your 24 OSDs are 2TB each you can "safely" only store
> 8TB in your cluster (48TB/3(replica)/2(DCs).

I see but my idea was just to have a long enough disaster in DC1 so that 
I must restart the cluster in degraded mode in DC2, but not long enough
so that I must restore a total redundancy in DC2. Personally I didn't
consider this case and, unfortunately, I think we will never have a budget
to be able to restore a total redundancy in just one datacenter. I'm afraid
that it a unreachable whealth for us.

> Fiber isn't magical FTL (faster than light) communications and the latency
> depends (mostly) on the length (which you may or may not control) and the
> protocol used. 
> A 2m long GbE link has a much worse latency than the same length in
> Infiniband.

In our case, if we can implement this infrastructure (if we have the
budget etc.), the connection would be probably 2 dark fiber with 10km
between DC1 and DC2. And we'll use Ethernet switchs with SFP transceivers
(if you have good references of switchs, I'm interested). I suppose it
could be possible to have low latencies in this case, no?

> You will of course need "enough" bandwidth, but what is going to kill
> (making it rather slow) your cluster will be the latency between those DCs.
> 
> Each write will have to be acknowledged and this is where every ms less of
> latency will make a huge difference.

Yes indeed, I understand.
 
>> For instance, I suppose the OSD disks in DC1 (and in DC2) has
>> a throughput equal to 150 MB/s, so with 12 OSD disk in each DC,
>> I have:
>>
>> 12 x 150 = 1800 MB/s ie 1.8 GB/s, ie 14.4 Mbps
>>
>> So, in the fiber, I need to have 14.4 Mbs. Is it correct? 
> 
> How do you get from 1.8 GigaByte/s to 14.4 Megabit/s?

Sorry, it was a misprint, I wanted to write 14.4 Gb/s of course. ;)

> You to multiply, not divide. 
> And assuming 10 bits (not 8) for a Byte when serialized never hurts. 
> So that's 18 Gb/s.

Yes, indeed. So the "naive" estimation gives 18 Gb/s (Ok for 10 bits
instead of 8).

>> Maybe is it too naive reasoning?
>
> Very much so. Your disks (even with SSD journals) will not write 150MB/s,
> because Ceph doesn't do long sequential writes (though 4MB blobs are
> better than 

[ceph-users] What is a "dirty" object

2015-04-18 Thread Francois Lafont
Hi,

With my testing cluster (Hammer on Ubuntu 14.04), I have this:

--
~# ceph df detail
GLOBAL:
SIZE  AVAIL RAW USED %RAW USED OBJECTS 
4073G 3897G 176G  4.33   23506 
POOLS:
NAME   ID CATEGORY USED   %USED MAX AVAIL   
  OBJECTS DIRTY READ WRITE 
data   0  -20579M  0.49 1934G   
 6973  6973 597k 2898k 
metadata   1  -81447k 0 1934G   
   5353  243  135k 
volumes3  -56090M  1.34 1934G   
14393 14393 208k 2416k 
images 4  -12194M  0.29 1934G   
 1551  1551 6263  5912 
.rgw.buckets   13 -  362M 0 1934G   
  445   445 9244 14954 
.users 25 -26 0 1934G   
3 30 3 
.users.email   26 -26 0 1934G   
3 30 3 
.users.uid 27 -  1059 0 1934G   
6 6   12 6 
.rgw.root  28 -   840 0 1934G   
3 3   63 3 
.rgw.control   29 - 0 0 1934G   
8 80 8 
.rgw.buckets.extra 30 - 0 0 1934G   
8 80 8 
.rgw.buckets.index 31 - 0 0 1934G   
   1111011 
.rgw.gc32 - 0 0 1934G   
   3232032 
.rgw   33 -  3064 0 1934G   
   1717017 
--

If I understand well, all objects in the cluster are "dirty".
Is it normal?
What is a "dirty" object?

Thanks for your help.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Questions about an example of ceph infrastructure

2015-04-18 Thread Francois Lafont
Hi,

We are thinking about a ceph infrastructure and I have questions.
Here is the conceived (but not yet implemented) infrastructure:
(please, be careful to read the schema with a monospace font ;))


 +-+
 |  users  |
 |(browser)|
 +++
  |
  |
 +++
 | |
  +--+   WAN   ++
  |  | ||
  |  +-+|
  | |
  | |
+-+-+ +-+-+
|   | |   |
| monitor-1 | | monitor-3 |
| monitor-2 | |   |
|   |  Fiber connection   |   |
|   +-+   |
|  OSD-1| |  OSD-13   |
|  OSD-2| |  OSD-14   |
|   ... | |   ... |
|  OSD-12   | |  OSD-24   |
|   | |   |
| client-a1 | | client-a2 |
| client-b1 | | client-b2 |
|   | |   |
+---+ +---+
 Datacenter1   Datacenter2
(DC1) (DC2)

In DC1: 2 "OSD" nodes each with 6 OSDs daemons, one per disk.
Journals in SSD, there are 2 SSD so 3 journals per SSD.
In DC2: the same config.

You can imagine for instance that:
- client-a1 and client-a2 are radosgw 
- client-b1 and client-b2 are web servers which use the Cephfs of the cluster.

And of course, the principle is to have data dispatched in DC1 and
DC2 (size == 2, one copy of the object in DC1, the other in DC2).


1. If I suppose that the latency between DC1 and DC2 (via the fiber
connection) is ok, I would like to know which throughput do I need to
avoid network bottleneck? Is there a rule to compute the needed
throughput? I suppose it depends on the disk throughputs?

For instance, I suppose the OSD disks in DC1 (and in DC2) has
a throughput equal to 150 MB/s, so with 12 OSD disk in each DC,
I have:

12 x 150 = 1800 MB/s ie 1.8 GB/s, ie 14.4 Mbps

So, in the fiber, I need to have 14.4 Mbs. Is it correct? Maybe is it
too naive reasoning?

Furthermore I have not taken into account the SSD. How evaluate the
needed throughput more precisely?


2. I'm thinking about disaster recoveries too. For instance, if there
is a disaster in DC2, DC1 will work (fine). But if there is a disaster
in DC1, DC2 will not work (no quorum).

But now, I suppose there is a long and big disaster in DC1. So I suppose
DC1 is totally unreachable. In this case, I want to start (manually) my
ceph cluster in DC2. No problem with that, I have seen explanations in the
documentation to do that:

- I stop monitor-3
- I extract the monmap
- I remove monitor-1 and monitor-2 from this monmap
- I inject the new monmap in monitor-3 
- I restart monitor-3

After that, I have a DC1 unreachable but DC2 is working (with only one monitor).

But what happens if DC1 becomes again reachable? What will the behavior of
monitor-1 and monitor-2 in this case? Do monitor-1 and monitor-2 understand
that they belong no longer to the ceph cluster?

And now I imagine the worst scenario: DC1 becomes again reachable but the
switch in DC1 which is connected on the fiber is very long to restart so
that, during a short period, DC1 is reachable but the connection with DC2
is not yet operational. What happens in this period? client-a1 and client-b1
could write data in the cluster in this case, right? And the data in the
cluster could be compromised because DC1 in not aware of writings in DC2.
Am I wrong?

My conclusion about that is: in case of long disaster in DC1, I can restart
the ceph cluster in DC2 with the method described above (removing monitor-1
and monitor-2 from the monmap in monitor-3 etc.) but *only* *if* I can
definitively stop monitor-1 and monitor-2 in DC1 before (and if I can't, I
do nothing and I wait). Is it correct?

Thanks in advance for your explanations.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade from Firefly to Hammer

2015-04-14 Thread Francois Lafont
Hi,

Garg, Pankaj wrote:

> I have a small cluster of 7 machines. Can I just individually upgrade each of 
> them (using apt-get upgrade) from Firefly to Hammer release, or there more to 
> it than that?

Not exactly, this is "individually" which is not correct. ;)
You should indeed "apt-get upgrade" on each node 1,..., 7
but after you should follow this order:

1. restart the monitor daemons on each node
2. then, restart the osd daemons on each node
3. then, restart the mds daemons on each node
4. then, restart the radosgw daemon on each node

Regards.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] norecover and nobackfill

2015-04-14 Thread Francois Lafont
Robert LeBlanc wrote:

> HmmmI've been deleting the OSD (ceph osd rm X; ceph osd crush rm osd.X)
> along with removing the auth key. This has caused data movement,

Maybe but if the flag "noout" is set, removing an OSD of the cluster doesn't
trigger at all data movement (I have tested with Firefly).

> I'd still like to know the difference between norecover and nobackfill if
> anyone knows.

If I read this page, http://ceph.com/docs/master/rados/operations/pg-states/,
I understand that backfilling is just a special case of recovery more "detailed"
(but I'm not a ceph expert).

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to dispatch monitors in a multi-site cluster (ie in 2 datacenters)

2015-04-13 Thread Francois Lafont
Joao Eduardo wrote:

> To be more precise, it's the lowest IP:PORT combination:
> 
> 10.0.1.2:6789 = rank 0
> 10.0.1.2:6790 = rank 1
> 10.0.1.3:6789 = rank 3
> 
> and so on.

Ok, so if there is 2 possible quorum, the quorum with the
lowest IP:PORT will be chosen. But what happens if, in the
2 possible quorum, quorum A and quorum B, the monitor which
has the lowest IP:PORT belongs to quorum A and quorum B?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw: upgrade Firefly to Hammer, impossible to create bucket

2015-04-13 Thread Francois Lafont
Hi,

Yehuda Sadeh-Weinraub wrote:

> The 405 in this case usually means that rgw failed to translate the http 
> hostname header into
> a bucket name. Do you have 'rgw dns name' set correctly? 

Ah, I have found and indeed it concerned "rgw dns name" as also Karan thought. 
;)
But it's a little curious. Explanations:

My s3cmd client use these hostnames (which are well resolved with the IP address
of the radosgw host):

.ostore.athome.priv

And in the configuration of my radosgw, I had:

---
[client.radosgw.gw1]
  host= ceph-radosgw1
  rgw dns name= ostore
  ...
---

ie just the *short* name of the radosgw's fqdn (its fqdn is ostore.athome.priv).
And with Firefly, it worked well, I never had problem with this configuration!
But with Hammer, it doesn't work anymore (I don't know why). Now, with Hammer,
I just notice that I have to put the fqdn in "rgw dns name" not the short name:

---
[client.radosgw.gw1]
  host= ceph-radosgw1
  rgw dns name= ostore.athome.priv
  ...
---

And with this configuration, it works.

Is it normal? In fact, maybe my configuration with the short name (instead of 
the
fqdn) was not valid and I just was lucky it work well so far. Is it the good 
conclusion
of the story?

In fact, I think I never have well understood the meaning of the "rgw dns name"
parameter. Can you confirm to me (or not) this: 

This parameter is *only* used when a S3 client accesses to a bucket with
the method http://.. If we don't set this
parameter, such access will not work and a S3 client could access to a
bucket only with the method http:///

Is it correct?

Thx Yehuda and thx to Karan (who has pointed the real problem in fact ;)).

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] norecover and nobackfill

2015-04-13 Thread Francois Lafont
Hi,

Robert LeBlanc wrote:

> What I'm trying to achieve is minimal data movement when I have to service
> a node to replace a failed drive. [...]

I will perhaps say something stupid but it seems to me that it's the
goal of the "noout" flag, isn't it?

1. ceph osd set noout
2. an old OSD disk failed, no rebalancing of data because noout is set, the 
cluster is just degraded.
3. You remove of the cluster the OSD daemon which used the old disk.
4. You power off the host and replace the old disk by a new disk and you 
restart the host.
5. You create a new OSD on the new disk.

With these steps, there will be no movement of data. Only during the step 5
where the data will be recreated in the new disk (but it's normal and desired).

Sorry in advance if there is something I'm missing in your problem.
Regards.


-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >