Re: [ceph-users] CephFS: slow writes over NFS when fs is mounted with kernel driver but fast with Fuse

2016-06-03 Thread Jan Schermer
It should be noted that using  "async" with NFS _will_ corrupt your data if 
anything happens.
It's ok-ish for something like an image library, but it's most certainly not OK 
for VM drives, databases, or if you write any kind of binary blobs that you 
can't recreate.

If ceph-fuse is fast (you are testing that on the NFS client side, right?) then 
it must completely ignore the sync flag the nfs server asks for when doing IO. 
I'd call that a serious bug unless it's stated somewhere...

Jan


> On 03 Jun 2016, at 06:03, Yan, Zheng  wrote:
> 
> On Mon, May 30, 2016 at 10:29 PM, David  wrote:
>> Hi All
>> 
>> I'm having an issue with slow writes over NFS (v3) when cephfs is mounted
>> with the kernel driver. Writing a single 4K file from the NFS client is
>> taking 3 - 4 seconds, however a 4K write (with sync) into the same folder on
>> the server is fast as you would expect. When mounted with ceph-fuse, I don't
>> get this issue on the NFS client.
>> 
>> Test environment is a small cluster with a single MON and single MDS, all
>> running 10.2.1, CephFS metadata is an ssd pool, data is on spinners. The NFS
>> server is CentOS 7, I've tested with the current shipped kernel (3.10),
>> ELrepo 4.4 and ELrepo 4.6.
>> 
>> More info:
>> 
>> With the kernel driver, I mount the filesystem with "-o name=admin,secret"
>> 
>> I've exported a folder with the following options:
>> 
>> *(rw,root_squash,sync,wdelay,no_subtree_check,fsid=1244,sec=1)
>> 
>> I then mount the folder on a CentOS 6 client with the following options (all
>> default):
>> 
>> rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.3.231,mountvers=3,mountport=597,mountproto=udp,local_lock=none
>> 
>> A small 4k write is taking 3 - 4 secs:
>> 
>> # time dd if=/dev/zero of=testfile bs=4k count=1
>> 1+0 records in
>> 1+0 records out
>> 4096 bytes (4.1 kB) copied, 3.59678 s, 1.1 kB/s
>> 
>> real0m3.624s
>> user0m0.000s
>> sys 0m0.001s
>> 
>> But a sync write on the sever directly into the same folder is fast (this is
>> with the kernel driver):
>> 
>> # time dd if=/dev/zero of=testfile2 bs=4k count=1 conv=fdatasync
>> 1+0 records in
>> 1+0 records out
>> 4096 bytes (4.1 kB) copied, 0.0121925 s, 336 kB/s
> 
> 
> Your nfs export has sync option. 'dd if=/dev/zero of=testfile bs=4k
> count=1' on nfs client is equivalent to 'dd if=/dev/zero of=testfile
> bs=4k count=1 conv=fsync' on cephfs. The reason that sync metadata
> operation takes 3~4 seconds is that the MDS flushes its journal every
> 5 seconds.  Adding async option to nfs export can avoid this delay.
> 
>> 
>> real0m0.015s
>> user0m0.000s
>> sys 0m0.002s
>> 
>> If I mount cephfs with Fuse instead of the kernel, the NFS client write is
>> fast:
>> 
>> dd if=/dev/zero of=fuse01 bs=4k count=1
>> 1+0 records in
>> 1+0 records out
>> 4096 bytes (4.1 kB) copied, 0.026078 s, 157 kB/s
>> 
> 
> In this case, ceph-fuse sends an extra request (getattr request on
> directory) to MDS. The request causes MDS to flush its journal.
> Whether or not client sends the extra request depends on what
> capabilities it has.  What capabilities client has, in turn, depend on
> how many clients are accessing the directory. In my test, nfs on
> ceph-fuse is not always fast.
> 
> Yan, Zheng
> 
> 
>> Does anyone know what's going on here?
> 
> 
> 
>> 
>> Thanks
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Required maintenance for upgraded CephFS filesystems

2016-06-03 Thread John Spray
Hi,

If you do not have a CephFS filesystem that was created with a Ceph
version older than Firefly, then you can ignore this message.

If you have such a filesystem, you need to run a special command at
some point while you are using Jewel, but before upgrading to future
versions.  Please see the documentation here:
http://docs.ceph.com/docs/jewel/cephfs/upgrading/

In Kraken, we are removing all the code that handled legacy TMAP
objects, so this is something you need to take care of during the
Jewel lifetime.

Thanks,
John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mount error 5 = Input/output error (kernel driver)

2016-06-03 Thread John Spray
On Mon, May 30, 2016 at 8:33 PM, Ilya Dryomov  wrote:
> On Mon, May 30, 2016 at 4:12 PM, Jens Offenbach  wrote:
>> Hallo,
>> in my OpenStack Mitaka, I have installed the additional service "Manila" 
>> with a CephFS backend. Everything is working. All shares are created 
>> successfully:
>>
>> manila show 9dd24065-97fb-4bcd-9ad1-ca63d40bf3a8
>> +-++
>> | Property| Value
>>   |
>> +-++
>> | status  | available
>>   |
>> | share_type_name | cephfs   
>>   |
>> | description | None 
>>   |
>> | availability_zone   | nova 
>>   |
>> | share_network_id| None 
>>   |
>> | export_locations|  
>>   |
>> | | path = 
>> 10.152.132.71:6789,10.152.132.72:6789,10.152.132.73:6789:/volumes/_nogroup/b27ad01a-245f-49e2-8974-1ed0ce8e259e
>>  |
>> | | preferred = False
>>   |
>> | | is_admin_only = False
>>   |
>> | | id = 9b7d7e9e-d661-4fa0-89d7-9727efb75554
>>   |
>> | | share_instance_id = 
>> b27ad01a-245f-49e2-8974-1ed0ce8e259e 
>>   |
>> | share_server_id | None 
>>   |
>> | host| os-sharedfs@cephfs#cephfs
>>   |
>> | access_rules_status | active   
>>   |
>> | snapshot_id | None 
>>   |
>> | is_public   | False
>>   |
>> | task_state  | None 
>>   |
>> | snapshot_support| True 
>>   |
>> | id  | 9dd24065-97fb-4bcd-9ad1-ca63d40bf3a8 
>>   |
>> | size| 1
>>   |
>> | name| cephshare1   
>>   |
>> | share_type  | 2a62fda4-82ce-4798-9a85-c800736b01e5 
>>   |
>> | has_replicas| False
>>   |
>> | replication_type| None 
>>   |
>> | created_at  | 2016-05-30T13:09:11.00   
>>   |
>> | share_proto | CEPHFS   
>>   |
>> | consistenc

Re: [ceph-users] Problems with Calamari setup

2016-06-03 Thread fridifree
I'll check it out
Thank you
On Jun 2, 2016 11:46 PM, "Michael Kuriger"  wrote:

> For me, this same issue was caused by having too new a version of salt.
> I’m running salt-2014.1.5-1 in centos 7.2, so yours will probably be
> different.  But I thought it was worth mentioning.
>
>
>
>
>
> [image: yp]
>
>
>
> Michael Kuriger
> Sr. Unix Systems Engineer
> * mk7...@yp.com |( 818-649-7235
>
>
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *fridifree
> *Sent:* Wednesday, June 01, 2016 6:00 AM
> *To:* Ceph Users
> *Subject:* [ceph-users] Problems with Calamari setup
>
>
>
> *Hello, Everyone. *
>
>
>
> I'm trying to install a Calamari server in my organisation and I'm
> encountering some problems.
>
>
>
> I have a small dev environment, just 4 OSD nodes and 5 monitors (one of
> them is also the RADOS GW). We chose to use Ubuntu 14.04 LTS for all our
> servers. The Calamari server is provisioned by VMware for now, the rest of
> the servers are physical.
>
>
>
> The packages' versions are as follows:
>
> - calamari-server - 1.3.1.1-1trusty
>
> - calamari-client - 1.3.1.1-1trusty
>
> - salt - 0.7.15
>
> - diamond - 3.4.67
>
>
>
> I used Calamari Survival Guide
> 
>  but
> without the 'build' part.
>
>
>
> The problem is I've managed to install the server and the web page, but
> the Calamari server doesn't recognize the cluster. It does manage to OSD
> nodes connected to it, but without a cluster (that exists).
>
>
>
> Also, the output of the "salt '*' ceph.get_heartbeats" command seems to
> look fine, as the Cthultu log (but maybe I'm looking for the wrong thing).
> Re-installing the cluster is *not* an option, we want to connect the
> Calamari as it is, without hurting the Ceph cluster.
>
>
>
> Thanks so much!
>
>
>
> *Jacob Goldenberg, *
>
> *Israel. *
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis => Jewel: ceph-fuse regression concerning the automatic mount at boot?

2016-06-03 Thread Francois Lafont
Hi,

On 02/06/2016 04:44, Francois Lafont wrote:

> ~# grep ceph /etc/fstab
> id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/ 
> /mnt/ fuse.ceph noatime,nonempty,defaults,_netdev 0 0

[...]

> And I have rebooted. After the reboot, big surprise with this:
> 
> ~# cat /tmp/mount.fuse.ceph.log 
> arguments are 
> id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint= 
> /mnt -o rw,_netdev,noatime,nonempty
> ceph-fuse --id=cephfs --keyring=/etc/ceph/ceph.client.cephfs.keyring 
> --client_mountpoint= /mnt -o rw,noatime,nonempty
> 
> Yes, this is not a misprint, there is no "/" after "client_mountpoint=".

[...]

> Now, my question is: which program gives the arguments to 
> /sbin/mount.fuse.ceph?
> Is it the init program (upstart in my case)? Or does it concern a Ceph 
> programs?

I have definitely found the culprit. In fact, this is not Upstart. It's 
"/sbin/mountall"
(from the "mountall" package) which is used by Upstart to mount filesystems in 
fstab.
In the source code "src/mountall.c", there is a line which removes wrongly the 
trailing
"/" in my valid fstab line:

id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/ 
/mnt/ fuse.ceph ...

I have made a bug report here where all is explained:
https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/1588594

It could be good to know it (I lost 1/2 day with this bug ;)).

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel upgrade and sortbitwise

2016-06-03 Thread Francois Lafont
Hi, 

On 03/06/2016 05:39, Samuel Just wrote:

> Due to http://tracker.ceph.com/issues/16113, it would be best to avoid
> setting the sortbitwise flag on jewel clusters upgraded from previous
> versions until we get a point release out with a fix.
> 
> The symptom is that setting the sortbitwise flag on a jewel cluster
> upgraded from a previous version can result in some pgs reporting
> spurious unfound objects.  Unsetting sortbitwise should cause the PGs
> to go back to normal.  Clusters created at jewel don't need to worry
> about this.

Now, I have an Infernalis cluster in production. It's an Infernalis cluster
installed from scratch (not from an upgrade). I intend to upgrade the
cluster to Jewel. Indeed, I have noticed that the flag "sortbitwise" was
set by default in my Infernalis cluster. By the way, I don't know exactly
the meaning of this flag but the cluster is HEALTH_OK with this flag set
by default so I have not changed it.

If I have well understood, to upgrade my Infernalis cluster, I have 2
options:

a) I unset the flag "sortbitwise" via "ceph osd unset sortbitwise", then
I upgrade the cluster to Jewel 10.2.1 and in the next release of Jewel
(I guess 10.2.2) I could set again the flag via "ceph osd set sortbitwise".

b) Or I just wait for the next release of Jewel (10.2.2) without worrying
about the flag "sortbitwise".

1. Is it correct?
2. Can we have data movement when we toggle the flag "sortbitwise"?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what does the 'rbd watch ' mean?

2016-06-03 Thread Jason Dillaman
That command is used for debugging to show the notifications sent by librbd
whenever image properties change.  These notifications are used by other
librbd clients with the same image open to synchronize state (e.g. a
snapshot was created so instruct the other librbd client to refresh the
image's header).

On Fri, Jun 3, 2016 at 2:56 AM, dingx...@hotmail.com 
wrote:

>
> everyone:
>
>  hi
>
>I am writing a doc for rbd command .when I use the command “rbd
> watch  ” .it  can only display as follows:
>
> [image: cid:image001.png@01D1BDA7.3BA014F0]
>
>
>
> When I create snap、delete snap、lock image、protect snap and unprotect snap
> ,it changes like this:
>
>
> So I do not know how to use this command and What is this command
> monitoring!
>
> Please give me some help.
>
>thanks
> --
> dingx...@hotmail.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: slow writes over NFS when fs is mounted with kernel driver but fast with Fuse

2016-06-03 Thread David
Zheng, thanks for looking into this, it makes sense although strangely I've
set up a new nfs server (different hardware, same OS, Kernel etc.) and I'm
unable to recreate the issue. I'm no longer getting the delay, the nfs
export is still using sync. I'm now comparing the servers to see what's
different on the original server. Apologies if I've wasted your time on
this!

Jan, I did some more testing with Fuse on the original server and I was
seeing the same issue, yes I was testing from the nfs client. As above I
think there was something weird with that original server. Noted on sync vs
async, I plan on sticking with sync.

On Fri, Jun 3, 2016 at 5:03 AM, Yan, Zheng  wrote:

> On Mon, May 30, 2016 at 10:29 PM, David  wrote:
> > Hi All
> >
> > I'm having an issue with slow writes over NFS (v3) when cephfs is mounted
> > with the kernel driver. Writing a single 4K file from the NFS client is
> > taking 3 - 4 seconds, however a 4K write (with sync) into the same
> folder on
> > the server is fast as you would expect. When mounted with ceph-fuse, I
> don't
> > get this issue on the NFS client.
> >
> > Test environment is a small cluster with a single MON and single MDS, all
> > running 10.2.1, CephFS metadata is an ssd pool, data is on spinners. The
> NFS
> > server is CentOS 7, I've tested with the current shipped kernel (3.10),
> > ELrepo 4.4 and ELrepo 4.6.
> >
> > More info:
> >
> > With the kernel driver, I mount the filesystem with "-o
> name=admin,secret"
> >
> > I've exported a folder with the following options:
> >
> > *(rw,root_squash,sync,wdelay,no_subtree_check,fsid=1244,sec=1)
> >
> > I then mount the folder on a CentOS 6 client with the following options
> (all
> > default):
> >
> >
> rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.3.231,mountvers=3,mountport=597,mountproto=udp,local_lock=none
> >
> > A small 4k write is taking 3 - 4 secs:
> >
> >  # time dd if=/dev/zero of=testfile bs=4k count=1
> > 1+0 records in
> > 1+0 records out
> > 4096 bytes (4.1 kB) copied, 3.59678 s, 1.1 kB/s
> >
> > real0m3.624s
> > user0m0.000s
> > sys 0m0.001s
> >
> > But a sync write on the sever directly into the same folder is fast
> (this is
> > with the kernel driver):
> >
> > # time dd if=/dev/zero of=testfile2 bs=4k count=1 conv=fdatasync
> > 1+0 records in
> > 1+0 records out
> > 4096 bytes (4.1 kB) copied, 0.0121925 s, 336 kB/s
>
>
> Your nfs export has sync option. 'dd if=/dev/zero of=testfile bs=4k
> count=1' on nfs client is equivalent to 'dd if=/dev/zero of=testfile
> bs=4k count=1 conv=fsync' on cephfs. The reason that sync metadata
> operation takes 3~4 seconds is that the MDS flushes its journal every
> 5 seconds.  Adding async option to nfs export can avoid this delay.
>
> >
> > real0m0.015s
> > user0m0.000s
> > sys 0m0.002s
> >
> > If I mount cephfs with Fuse instead of the kernel, the NFS client write
> is
> > fast:
> >
> > dd if=/dev/zero of=fuse01 bs=4k count=1
> > 1+0 records in
> > 1+0 records out
> > 4096 bytes (4.1 kB) copied, 0.026078 s, 157 kB/s
> >
>
> In this case, ceph-fuse sends an extra request (getattr request on
> directory) to MDS. The request causes MDS to flush its journal.
> Whether or not client sends the extra request depends on what
> capabilities it has.  What capabilities client has, in turn, depend on
> how many clients are accessing the directory. In my test, nfs on
> ceph-fuse is not always fast.
>
> Yan, Zheng
>
>
> > Does anyone know what's going on here?
>
>
>
> >
> > Thanks
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel upgrade and sortbitwise

2016-06-03 Thread Samuel Just
Sorry, I should have been more clear.  The bug actually is due to a
difference in an on disk encoding from hammer.  An infernalis cluster would
never had had such encodings and is fine.
-Sam
On Jun 3, 2016 6:53 AM, "Francois Lafont"  wrote:

> Hi,
>
> On 03/06/2016 05:39, Samuel Just wrote:
>
> > Due to http://tracker.ceph.com/issues/16113, it would be best to avoid
> > setting the sortbitwise flag on jewel clusters upgraded from previous
> > versions until we get a point release out with a fix.
> >
> > The symptom is that setting the sortbitwise flag on a jewel cluster
> > upgraded from a previous version can result in some pgs reporting
> > spurious unfound objects.  Unsetting sortbitwise should cause the PGs
> > to go back to normal.  Clusters created at jewel don't need to worry
> > about this.
>
> Now, I have an Infernalis cluster in production. It's an Infernalis cluster
> installed from scratch (not from an upgrade). I intend to upgrade the
> cluster to Jewel. Indeed, I have noticed that the flag "sortbitwise" was
> set by default in my Infernalis cluster. By the way, I don't know exactly
> the meaning of this flag but the cluster is HEALTH_OK with this flag set
> by default so I have not changed it.
>
> If I have well understood, to upgrade my Infernalis cluster, I have 2
> options:
>
> a) I unset the flag "sortbitwise" via "ceph osd unset sortbitwise", then
> I upgrade the cluster to Jewel 10.2.1 and in the next release of Jewel
> (I guess 10.2.2) I could set again the flag via "ceph osd set sortbitwise".
>
> b) Or I just wait for the next release of Jewel (10.2.2) without worrying
> about the flag "sortbitwise".
>
> 1. Is it correct?
> 2. Can we have data movement when we toggle the flag "sortbitwise"?
>
> --
> François Lafont
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: slow writes over NFS when fs is mounted with kernel driver but fast with Fuse

2016-06-03 Thread Jan Schermer
I'd be worried about it getting "fast" all of sudden. Test crash consistency.
If you test something like file creation you should be able to estimate if it 
should be that fast. (So it should be some fraction of theoretical IOPS on the 
drives/backing rbd device...)

If it's too fast then maybe the "sync" isn't working properly...

Jan

> On 03 Jun 2016, at 16:26, David  wrote:
> 
> Zheng, thanks for looking into this, it makes sense although strangely I've 
> set up a new nfs server (different hardware, same OS, Kernel etc.) and I'm 
> unable to recreate the issue. I'm no longer getting the delay, the nfs export 
> is still using sync. I'm now comparing the servers to see what's different on 
> the original server. Apologies if I've wasted your time on this!
> 
> Jan, I did some more testing with Fuse on the original server and I was 
> seeing the same issue, yes I was testing from the nfs client. As above I 
> think there was something weird with that original server. Noted on sync vs 
> async, I plan on sticking with sync.
> 
> On Fri, Jun 3, 2016 at 5:03 AM, Yan, Zheng  > wrote:
> On Mon, May 30, 2016 at 10:29 PM, David  > wrote:
> > Hi All
> >
> > I'm having an issue with slow writes over NFS (v3) when cephfs is mounted
> > with the kernel driver. Writing a single 4K file from the NFS client is
> > taking 3 - 4 seconds, however a 4K write (with sync) into the same folder on
> > the server is fast as you would expect. When mounted with ceph-fuse, I don't
> > get this issue on the NFS client.
> >
> > Test environment is a small cluster with a single MON and single MDS, all
> > running 10.2.1, CephFS metadata is an ssd pool, data is on spinners. The NFS
> > server is CentOS 7, I've tested with the current shipped kernel (3.10),
> > ELrepo 4.4 and ELrepo 4.6.
> >
> > More info:
> >
> > With the kernel driver, I mount the filesystem with "-o name=admin,secret"
> >
> > I've exported a folder with the following options:
> >
> > *(rw,root_squash,sync,wdelay,no_subtree_check,fsid=1244,sec=1)
> >
> > I then mount the folder on a CentOS 6 client with the following options (all
> > default):
> >
> > rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.3.231,mountvers=3,mountport=597,mountproto=udp,local_lock=none
> >
> > A small 4k write is taking 3 - 4 secs:
> >
> >  # time dd if=/dev/zero of=testfile bs=4k count=1
> > 1+0 records in
> > 1+0 records out
> > 4096 bytes (4.1 kB) copied, 3.59678 s, 1.1 kB/s
> >
> > real0m3.624s
> > user0m0.000s
> > sys 0m0.001s
> >
> > But a sync write on the sever directly into the same folder is fast (this is
> > with the kernel driver):
> >
> > # time dd if=/dev/zero of=testfile2 bs=4k count=1 conv=fdatasync
> > 1+0 records in
> > 1+0 records out
> > 4096 bytes (4.1 kB) copied, 0.0121925 s, 336 kB/s
> 
> 
> Your nfs export has sync option. 'dd if=/dev/zero of=testfile bs=4k
> count=1' on nfs client is equivalent to 'dd if=/dev/zero of=testfile
> bs=4k count=1 conv=fsync' on cephfs. The reason that sync metadata
> operation takes 3~4 seconds is that the MDS flushes its journal every
> 5 seconds.  Adding async option to nfs export can avoid this delay.
> 
> >
> > real0m0.015s
> > user0m0.000s
> > sys 0m0.002s
> >
> > If I mount cephfs with Fuse instead of the kernel, the NFS client write is
> > fast:
> >
> > dd if=/dev/zero of=fuse01 bs=4k count=1
> > 1+0 records in
> > 1+0 records out
> > 4096 bytes (4.1 kB) copied, 0.026078 s, 157 kB/s
> >
> 
> In this case, ceph-fuse sends an extra request (getattr request on
> directory) to MDS. The request causes MDS to flush its journal.
> Whether or not client sends the extra request depends on what
> capabilities it has.  What capabilities client has, in turn, depend on
> how many clients are accessing the directory. In my test, nfs on
> ceph-fuse is not always fast.
> 
> Yan, Zheng
> 
> 
> > Does anyone know what's going on here?
> 
> 
> 
> >
> > Thanks
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > 
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS in the wild

2016-06-03 Thread David
I'm hoping to implement cephfs in production at some point this year so I'd
be interested to hear your progress on this.

Have you considered SSD for your metadata pool? You wouldn't need loads of
capacity although even with reliable SSD I'd probably still do x3
replication for metadata. I've been looking at the intel s3610's for this.



On Wed, Jun 1, 2016 at 9:50 PM, Brady Deetz  wrote:

> Question:
> I'm curious if there is anybody else out there running CephFS at the scale
> I'm planning for. I'd like to know some of the issues you didn't expect
> that I should be looking out for. I'd also like to simply see when CephFS
> hasn't worked out and why. Basically, give me your war stories.
>
>
> Problem Details:
> Now that I'm out of my design phase and finished testing on VMs, I'm ready
> to drop $100k on a pilo. I'd like to get some sense of confidence from the
> community that this is going to work before I pull the trigger.
>
> I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with
> CephFS by this time next year (hopefully by December). My workload is a mix
> of small and vary large files (100GB+ in size). We do fMRI analysis on
> DICOM image sets as well as other physio data collected from subjects. We
> also have plenty of spreadsheets, scripts, etc. Currently 90% of our
> analysis is I/O bound and generally sequential.
>
> In deploying Ceph, I am hoping to see more throughput than the 7320 can
> currently provide. I'm also looking to get away from traditional
> file-systems that require forklift upgrades. That's where Ceph really
> shines for us.
>
> I don't have a total file count, but I do know that we have about 500k
> directories.
>
>
> Planned Architecture:
>
> Storage Interconnect:
> Brocade VDX 6940 (40 gig)
>
> Access Switches for clients (servers):
> Brocade VDX 6740 (10 gig)
>
> Access Switches for clients (workstations):
> Brocade ICX 7450
>
> 3x MON:
> 128GB RAM
> 2x 200GB SSD for OS
> 2x 400GB P3700 for LevelDB
> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet
>
> 2x MDS:
> 128GB RAM
> 2x 200GB SSD for OS
> 2x 400GB P3700 for LevelDB (is this necessary?)
> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet
>
> 8x OSD:
> 128GB RAM
> 2x 200GB SSD for OS
> 2x 400GB P3700 for Journals
> 24x 6TB Enterprise SATA
> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Required maintenance for upgraded CephFS filesystems

2016-06-03 Thread Scottix
Is there anyway to check what it is currently using?

Best,
Scott

On Fri, Jun 3, 2016 at 4:26 AM John Spray  wrote:

> Hi,
>
> If you do not have a CephFS filesystem that was created with a Ceph
> version older than Firefly, then you can ignore this message.
>
> If you have such a filesystem, you need to run a special command at
> some point while you are using Jewel, but before upgrading to future
> versions.  Please see the documentation here:
> http://docs.ceph.com/docs/jewel/cephfs/upgrading/
>
> In Kraken, we are removing all the code that handled legacy TMAP
> objects, so this is something you need to take care of during the
> Jewel lifetime.
>
> Thanks,
> John
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-03 Thread Adam Tygart
Is there any way we could have a "leveldb_defrag_on_mount" option for
the osds similar to the "leveldb_compact_on_mount" option?

Also, I've got at least one user that is creating and deleting
thousands of files at a time in some of their directories (keeping
1-2% of them). Could that cause this fragmentation that we think is
the issue?
--
Adam

On Thu, Jun 2, 2016 at 10:32 PM, Adam Tygart  wrote:
> I'm still exporting pgs out of some of the downed osds, but things are
> definitely looking promising.
>
> Marginally related to this thread, as these seem to be most of the
> hanging objects when exporting pgs, what are inodes in the 600 range
> used for within the metadata pool? I know the 200 range is used for
> journaling. 8 of the 13 osds I've got left down are currently trying
> to export objects in the 600 range. Are these just MDS journal objects
> from an mds severely behind on trimming?
>
> --
> Adam
>
> On Thu, Jun 2, 2016 at 6:10 PM, Brad Hubbard  wrote:
>> On Thu, Jun 2, 2016 at 9:07 AM, Brandon Morris, PMP
>>  wrote:
>>
>>> The only way that I was able to get back to Health_OK was to export/import. 
>>>  * Please note, any time you use the ceph_objectstore_tool you risk 
>>> data loss if not done carefully.   Never remove a PG until you have a known 
>>> good export *
>>>
>>> Here are the steps I used:
>>>
>>> 1. set NOOUT, NO BACKFILL
>>> 2. Stop the OSD's that have the erroring PG
>>> 3. Flush the journal and export the primary version of the PG.  This took 1 
>>> minute on a well-behaved PG and 4 hours on the misbehaving PG
>>>   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16 
>>> --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export 
>>> --file /root/32.10c.b.export
>>>
>>> 4. Import the PG into a New / Temporary OSD that is also offline,
>>>   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100 
>>> --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op export 
>>> --file /root/32.10c.b.export
>>
>> This should be an import op and presumably to a different data path
>> and journal path more like the following?
>>
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-101
>> --journal-path /var/lib/ceph/osd/ceph-101/journal --pgid 32.10c --op
>> import --file /root/32.10c.b.export
>>
>> Just trying to clarify for anyone that comes across this thread in the 
>> future.
>>
>> Cheers,
>> Brad
>>
>>>
>>> 5. remove the PG from all other OSD's  (16, 143, 214, and 448 in your case 
>>> it looks like)
>>> 6. Start cluster OSD's
>>> 7. Start the temporary OSD's and ensure 32.10c backfills correctly to the 3 
>>> OSD's it is supposed to be on.
>>>
>>> This is similar to the recovery process described in this post from 
>>> 04/09/2015: 
>>> http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
>>>Hopefully it works in your case too and you can the cluster back to a 
>>> state that you can make the CephFS directories smaller.
>>>
>>> - Brandon
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Required maintenance for upgraded CephFS filesystems

2016-06-03 Thread John Spray
On Fri, Jun 3, 2016 at 4:49 PM, Scottix  wrote:
> Is there anyway to check what it is currently using?

Since Firefly, the MDS rewrites TMAPs to OMAPs whenever a directory is
updated, so a pre-firefly filesystem might already be all OMAPs, or
might still have some TMAPs -- there's no way to know without scanning
the whole system.

If you're think you might have used a pre-firefly version to create
the filesystem, then run the tool: if there aren't any TMAPs in the
system it'll be a no-op.

John

> Best,
> Scott
>
> On Fri, Jun 3, 2016 at 4:26 AM John Spray  wrote:
>>
>> Hi,
>>
>> If you do not have a CephFS filesystem that was created with a Ceph
>> version older than Firefly, then you can ignore this message.
>>
>> If you have such a filesystem, you need to run a special command at
>> some point while you are using Jewel, but before upgrading to future
>> versions.  Please see the documentation here:
>> http://docs.ceph.com/docs/jewel/cephfs/upgrading/
>>
>> In Kraken, we are removing all the code that handled legacy TMAP
>> objects, so this is something you need to take care of during the
>> Jewel lifetime.
>>
>> Thanks,
>> John
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-03 Thread Brandon Morris, PMP
Nice catch.  That was a copy-paste error.  Sorry

it should have read:

 3. Flush the journal and export the primary version of the PG.  This took
1 minute on a well-behaved PG and 4 hours on the misbehaving PG
   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16
--journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export
--file /root/32.10c.b.export

  4. Import the PG into a New / Temporary OSD that is also offline,
   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100
--journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op import
--file /root/32.10c.b.export


On Thu, Jun 2, 2016 at 5:10 PM, Brad Hubbard  wrote:

> On Thu, Jun 2, 2016 at 9:07 AM, Brandon Morris, PMP
>  wrote:
>
> > The only way that I was able to get back to Health_OK was to
> export/import.  * Please note, any time you use the
> ceph_objectstore_tool you risk data loss if not done carefully.   Never
> remove a PG until you have a known good export *
> >
> > Here are the steps I used:
> >
> > 1. set NOOUT, NO BACKFILL
> > 2. Stop the OSD's that have the erroring PG
> > 3. Flush the journal and export the primary version of the PG.  This
> took 1 minute on a well-behaved PG and 4 hours on the misbehaving PG
> >   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16
> --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export
> --file /root/32.10c.b.export
> >
> > 4. Import the PG into a New / Temporary OSD that is also offline,
> >   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100
> --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op export
> --file /root/32.10c.b.export
>
> This should be an import op and presumably to a different data path
> and journal path more like the following?
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-101
> --journal-path /var/lib/ceph/osd/ceph-101/journal --pgid 32.10c --op
> import --file /root/32.10c.b.export
>
> Just trying to clarify for anyone that comes across this thread in the
> future.
>
> Cheers,
> Brad
>
> >
> > 5. remove the PG from all other OSD's  (16, 143, 214, and 448 in your
> case it looks like)
> > 6. Start cluster OSD's
> > 7. Start the temporary OSD's and ensure 32.10c backfills correctly to
> the 3 OSD's it is supposed to be on.
> >
> > This is similar to the recovery process described in this post from
> 04/09/2015:
> http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
>  Hopefully it works in your case too and you can the cluster back to a
> state that you can make the CephFS directories smaller.
> >
> > - Brandon
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Required maintenance for upgraded CephFS filesystems

2016-06-03 Thread Scottix
Great thanks.

--Scott

On Fri, Jun 3, 2016 at 8:59 AM John Spray  wrote:

> On Fri, Jun 3, 2016 at 4:49 PM, Scottix  wrote:
> > Is there anyway to check what it is currently using?
>
> Since Firefly, the MDS rewrites TMAPs to OMAPs whenever a directory is
> updated, so a pre-firefly filesystem might already be all OMAPs, or
> might still have some TMAPs -- there's no way to know without scanning
> the whole system.
>
> If you're think you might have used a pre-firefly version to create
> the filesystem, then run the tool: if there aren't any TMAPs in the
> system it'll be a no-op.
>
> John
>
> > Best,
> > Scott
> >
> > On Fri, Jun 3, 2016 at 4:26 AM John Spray  wrote:
> >>
> >> Hi,
> >>
> >> If you do not have a CephFS filesystem that was created with a Ceph
> >> version older than Firefly, then you can ignore this message.
> >>
> >> If you have such a filesystem, you need to run a special command at
> >> some point while you are using Jewel, but before upgrading to future
> >> versions.  Please see the documentation here:
> >> http://docs.ceph.com/docs/jewel/cephfs/upgrading/
> >>
> >> In Kraken, we are removing all the code that handled legacy TMAP
> >> objects, so this is something you need to take care of during the
> >> Jewel lifetime.
> >>
> >> Thanks,
> >> John
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel upgrade and sortbitwise

2016-06-03 Thread Francois Lafont
Hi,

On 03/06/2016 16:29, Samuel Just wrote:

> Sorry, I should have been more clear. The bug actually is due to a
> difference in an on disk encoding from hammer. An infernalis cluster would
> never had had such encodings and is fine.

Ah ok, fine. ;)
Thanks for the answer.
Bye.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-03 Thread Adam Tygart
With regards to this export/import process, I've been exporting a pg
from an osd for more than 24 hours now. The entire OSD only has 8.6GB
of data. 3GB of that is in omap. The export for this particular PG is
only 108MB in size right now, after more than 24 hours. How is it
possible that a fragmented database on an ssd capable of 13,000 iops
can be this slow?

--
Adam

On Fri, Jun 3, 2016 at 11:11 AM, Brandon Morris, PMP
 wrote:
> Nice catch.  That was a copy-paste error.  Sorry
>
> it should have read:
>
>  3. Flush the journal and export the primary version of the PG.  This took 1
> minute on a well-behaved PG and 4 hours on the misbehaving PG
>i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16
> --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export
> --file /root/32.10c.b.export
>
>   4. Import the PG into a New / Temporary OSD that is also offline,
>i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100
> --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op import
> --file /root/32.10c.b.export
>
>
> On Thu, Jun 2, 2016 at 5:10 PM, Brad Hubbard  wrote:
>>
>> On Thu, Jun 2, 2016 at 9:07 AM, Brandon Morris, PMP
>>  wrote:
>>
>> > The only way that I was able to get back to Health_OK was to
>> > export/import.  * Please note, any time you use the
>> > ceph_objectstore_tool you risk data loss if not done carefully.   Never
>> > remove a PG until you have a known good export *
>> >
>> > Here are the steps I used:
>> >
>> > 1. set NOOUT, NO BACKFILL
>> > 2. Stop the OSD's that have the erroring PG
>> > 3. Flush the journal and export the primary version of the PG.  This
>> > took 1 minute on a well-behaved PG and 4 hours on the misbehaving PG
>> >   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16
>> > --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export
>> > --file /root/32.10c.b.export
>> >
>> > 4. Import the PG into a New / Temporary OSD that is also offline,
>> >   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100
>> > --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op export
>> > --file /root/32.10c.b.export
>>
>> This should be an import op and presumably to a different data path
>> and journal path more like the following?
>>
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-101
>> --journal-path /var/lib/ceph/osd/ceph-101/journal --pgid 32.10c --op
>> import --file /root/32.10c.b.export
>>
>> Just trying to clarify for anyone that comes across this thread in the
>> future.
>>
>> Cheers,
>> Brad
>>
>> >
>> > 5. remove the PG from all other OSD's  (16, 143, 214, and 448 in your
>> > case it looks like)
>> > 6. Start cluster OSD's
>> > 7. Start the temporary OSD's and ensure 32.10c backfills correctly to
>> > the 3 OSD's it is supposed to be on.
>> >
>> > This is similar to the recovery process described in this post from
>> > 04/09/2015:
>> > http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
>> > Hopefully it works in your case too and you can the cluster back to a state
>> > that you can make the CephFS directories smaller.
>> >
>> > - Brandon
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CoreOS Cluster of 7 machines and Ceph

2016-06-03 Thread Michael Shuey
Sorry for the late reply - been traveling.

I'm doing exactly that right now, using the ceph-docker container.
It's just in my test rack for now, but hardware arrived this week to
seed the production version.

I'm using separate containers for each daemon, including a container
for each OSD.  I've got a bit of cloudinit logic to loop over all
disks in a machine, fire off a "prepare" container if the disk isn't
partitioned, then start an "activate" container to bring the OSD up.
Works pretty well; I can power on a new machine, and get a stack of
new OSDs about 5 minutes later.

I've opted to not allow ANY containers to run on local disk, and we're
setting up appropriate volume plugins (for NFS, CephFS, and Ceph RBDs)
now.  IN THEORY (so far, so good) volumes will be dynamically mapped
into the container at startup.  This should let us orchestrate
containers with Swarm or Kubernetes, and give us the same volumes
wherever they land.  In a few weeks we'll start experimenting with a
vxlan network plugin as well, to allow similar flexibility with IPs
and subnets.  Once that's done, our registry will become just another
container (though we'll need a master registry, with storage on local
disk SOMEWHERE, to be able to handle a cold-boot of the Ceph
containers).

I'm curious where you're going with your environment.  To me,
Ceph+Docker seems like a nice match; if others are doing this, we
should definitely pool experiences.

--
Mike Shuey


On Thu, May 26, 2016 at 7:00 AM, EnDSgUy EnDSgUy  wrote:
> Hello All,
>
> I am looking for some help to design the Ceph for the cluster of 7
> machines running on CoreOS with fleet and docker. I am still thinking
> what's the best way for the moment.
>
> Has anybody done something similair and could advise on his
> experiences?
>
> The primary purpose is
> - be able to store "docker data containers" and only them in a
> redundant way (so let a specific directory be mounted to specific
> container). So Ceph should be availabe in a container.
> - ideally other types of containers should be running without
> redundancy just on the hard drive
> - docker images (registry) should also be stored in a redundant way
>
> Has anybody done something similair?
>
>
> Dmitry
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com