date:20160722

Re: [ceph-users] Recovery stuck after adjusting to recent tunables

2016-07-22 Thread Brad Hubbard

On Sat, Jul 23, 2016 at 12:17 AM, Kostis Fardelas  wrote:
> Hello,
> being in latest Hammer, I think I hit a bug with more recent than
> legacy tunables.
>
> Being in legacy tunables for a while, I decided to experiment with
> "better" tunables. So first I went from argonaut profile to bobtail
> and then to firefly. However, I decided to make the changes on
> chooseleaf_vary_r incrementally (because the remapping from 0 to 5 was
> huge), from 5 down to the best value (1). So when I reached
> chooseleaf_vary_r = 2, I decided to run a simple test before going to
> chooseleaf_vary_r = 1: close an OSD (OSD.14) and let the cluster
> recover. But the recovery never completes and a PG remains stuck,
> reported as undersized+degraded. No OSD is near full and all pools
> have min_size=1.
>
> ceph osd crush show-tunables -f json-pretty
>
> {
> "choose_local_tries": 0,
> "choose_local_fallback_tries": 0,
> "choose_total_tries": 50,
> "chooseleaf_descend_once": 1,
> "chooseleaf_vary_r": 2,
> "straw_calc_version": 1,
> "allowed_bucket_algs": 22,
> "profile": "unknown",
> "optimal_tunables": 0,
> "legacy_tunables": 0,
> "require_feature_tunables": 1,
> "require_feature_tunables2": 1,
> "require_feature_tunables3": 1,
> "has_v2_rules": 0,
> "has_v3_rules": 0,
> "has_v4_buckets": 0
> }
>
> The really strange thing is that the OSDs of the stuck PG belong to
> other nodes than the one I decided to stop (osd.14).
>
> # ceph pg dump_stuck
> ok
> pg_stat state up up_primary acting acting_primary
> 179.38 active+undersized+degraded [2,8] 2 [2,8] 2

Can you share a query of this pg?

What size (not min size) is this pool (assuming it's 2)?

>
>
> ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 11.19995 root default
> -3 11.19995 rack unknownrack
> -2  0.3 host staging-rd0-03
> 14  0.2 osd.14   up  1.0  1.0
> 15  0.2 osd.15   up  1.0  1.0
> -8  5.19998 host staging-rd0-01
>  6  0.5 osd.6up  1.0  1.0
>  7  0.5 osd.7up  1.0  1.0
>  8  1.0 osd.8up  1.0  1.0
>  9  1.0 osd.9up  1.0  1.0
> 10  1.0 osd.10   up  1.0  1.0
> 11  1.0 osd.11   up  1.0  1.0
> -7  5.19998 host staging-rd0-00
>  0  0.5 osd.0up  1.0  1.0
>  1  0.5 osd.1up  1.0  1.0
>  2  1.0 osd.2up  1.0  1.0
>  3  1.0 osd.3up  1.0  1.0
>  4  1.0 osd.4up  1.0  1.0
>  5  1.0 osd.5up  1.0  1.0
> -4  0.3 host staging-rd0-02
> 12  0.2 osd.12   up  1.0  1.0
> 13  0.2 osd.13   up  1.0  1.0
>
>
> Have you experienced something similar?
>
> Regards,
> Kostis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Try to install ceph hammer on CentOS7

2016-07-22 Thread Brad Hubbard

On Sat, Jul 23, 2016 at 1:41 AM, Ruben Kerkhof  wrote:
> Please keep the mailing list on the CC.
>
> On Fri, Jul 22, 2016 at 3:40 PM, Manuel Lausch  wrote:
>> oh. This was a copy failure.
>> Of course I checked my config again. Some other variations of configurating
>> didn't help as well.
>>
>> Finaly I took the ceph-0.94.7-0.el7.x86_64.rpm in a directory and created
>> with createrepo the neccessary repository index files. Also with this as a
>> repository the ceph package is not visible. Other packages in the repository
>> works fine.
>>
>> If I try to install the package with yum install
>> ~/ceph-0.94.7-0.el7.x86_64.rpm the Installation including the dependencys is
>> successfull.
>>
>> My knowledge with rpm and yum is not as big as it should be. So I don't know
>> how to debug further.
>
> What does yum repolist show?

This is good advice.

I'd also advise running "yum clean all" before proceeding once you
have confirmed everything is configured correctly.

HTH,
Brad

> It looks like the ceph-noarch repo is ok, the ceph repo isn't.
>
>>
>> Regards,
>> Manuel
>
> Regards,
>
> Ruben
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS Samba VFS RHEL packages

2016-07-22 Thread Blair Bethwaite

Hi Brett,

Don't think so thanks, but we'll shout if we find any breakage/weirdness.

Cheers,

On 23 July 2016 at 05:42, Brett Niver  wrote:
> So Blair,
>
> As far as you know right now, you don't need anything from the CephFS team,
> correct?
> Thanks,
> Brett
>
>
> On Fri, Jul 22, 2016 at 2:18 AM, Yan, Zheng  wrote:
>>
>> On Fri, Jul 22, 2016 at 11:15 AM, Blair Bethwaite
>>  wrote:
>> > Thanks Zheng,
>> >
>> > On 22 July 2016 at 12:12, Yan, Zheng  wrote:
>> >> We actively back-port fixes to RHEL 7.x kernel.  When RHCS2.0 release,
>> >> the RHEL kernel should contain fixes up to 3.7 upstream kernel.
>> >
>> > You meant 4.7 right?
>>
>> I mean 4.7. sorry
>>
>> >
>> > --
>> > Cheers,
>> > ~Blairo
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph performance calculator

2016-07-22 Thread EP Komarla

Team,

Have a performance related question on Ceph.

I know performance of a ceph cluster depends on so many factors like type of 
storage servers, processors (no of processor, raw performance of processor), 
memory, network links, type of disks, journal disks, etc.  On top of the 
hardware features, it is also influenced by the type of operation you are doing 
like seqRead, seqWrite, blocksize, etc., etc.

Today one way we demonstrate performance is using benchmarks and test 
configurations.  As a result, it is difficult to compare performance without 
understanding the underlying system and the usecases.

Now coming to my question.  Is there a Ceph performance calculator, that takes 
all (or some) of these factors and gives out an estimate of the performance you 
can expect for different scenarios?  I was asked this question, I didn't know 
how to answer this question, I thought of checking with the wider user group to 
see if someone is aware of such a tool or knows how to do this calculation.  
Any pointers will be appreciated.

Thanks,

- epk

Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS Samba VFS RHEL packages

2016-07-22 Thread Brett Niver

So Blair,

As far as you know right now, you don't need anything from the CephFS team,
correct?
Thanks,
Brett


On Fri, Jul 22, 2016 at 2:18 AM, Yan, Zheng  wrote:

> On Fri, Jul 22, 2016 at 11:15 AM, Blair Bethwaite
>  wrote:
> > Thanks Zheng,
> >
> > On 22 July 2016 at 12:12, Yan, Zheng  wrote:
> >> We actively back-port fixes to RHEL 7.x kernel.  When RHCS2.0 release,
> >> the RHEL kernel should contain fixes up to 3.7 upstream kernel.
> >
> > You meant 4.7 right?
>
> I mean 4.7. sorry
>
> >
> > --
> > Cheers,
> > ~Blairo
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Try to install ceph hammer on CentOS7

2016-07-22 Thread Ruben Kerkhof

Please keep the mailing list on the CC.

On Fri, Jul 22, 2016 at 3:40 PM, Manuel Lausch  wrote:
> oh. This was a copy failure.
> Of course I checked my config again. Some other variations of configurating
> didn't help as well.
>
> Finaly I took the ceph-0.94.7-0.el7.x86_64.rpm in a directory and created
> with createrepo the neccessary repository index files. Also with this as a
> repository the ceph package is not visible. Other packages in the repository
> works fine.
>
> If I try to install the package with yum install
> ~/ceph-0.94.7-0.el7.x86_64.rpm the Installation including the dependencys is
> successfull.
>
> My knowledge with rpm and yum is not as big as it should be. So I don't know
> how to debug further.

What does yum repolist show?
It looks like the ceph-noarch repo is ok, the ceph repo isn't.

>
> Regards,
> Manuel

Regards,

Ruben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Try to install ceph hammer on CentOS7

2016-07-22 Thread Gaurav Goyal

It will be smooth installation. I have recently installed hammer on centos
7.


Regards
Gaurav Goyal

On Fri, Jul 22, 2016 at 7:22 AM, Ruben Kerkhof 
wrote:

> On Thu, Jul 21, 2016 at 7:26 PM, Manuel Lausch 
> wrote:
> > Hi,
>
> Hi,
> >
> > I try to install ceph hammer on centos7 but something with the RPM
> > Repository seems to be wrong.
> >
> > In my yum.repos.d/ceph.repo file I have the following configuration:
> >
> > [ceph]
> > name=Ceph packages for $basearch
> > baseurl=baseurl=http://download.ceph.com/rpm-hammer/el7/$basearch
>
> There's your issue. Remove the seconds baseurl=
>
> Kind regards,
>
> Ruben
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-22 Thread Anthony D'Atri

> FWIW, the xfs -n size=64k option is probably not a good idea. 

Agreed, moreover it’s a really bad idea.  You get memory allocation slowdowns 
as described in the linked post, and eventually the OSD dies.

It can be mitigated somewhat by periodically (say every 2 hours, ymmv) flushing 
the system’s buffer cache to effectively defragment memory, or possibly by 
increasing setting vm_min_free_kybtes to a large enough setting based on your 
physmem.

For sure the only real cure is rebuilding all OSD’s with defaults.

— Anthony

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Nick Fisk

 

 

From: Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr] 
Sent: 22 July 2016 15:13
To: n...@fisk.me.uk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 22/07/2016 14:10, Nick Fisk a écrit :

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 11:19
To: n...@fisk.me.uk  ; 'Jake Young'  
 ; 'Jan Schermer'  
 
Cc: ceph-users@lists.ceph.com  
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 22/07/2016 11:48, Nick Fisk a écrit :

 

 

From: Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr] 
Sent: 22 July 2016 10:40
To: n...@fisk.me.uk  ; 'Jake Young'  
 ; 'Jan Schermer'  
 
Cc: ceph-users@lists.ceph.com  
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 22/07/2016 10:23, Nick Fisk a écrit :

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 09:10
To: n...@fisk.me.uk  ; 'Jake Young'  
 ; 'Jan Schermer'  
 
Cc: ceph-users@lists.ceph.com  
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 22/07/2016 09:47, Nick Fisk a écrit :

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 08:11
To: Jake Young   ; Jan Schermer  
 
Cc: ceph-users@lists.ceph.com  
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer  > wrote:


> On 20 Jul 2016, at 18:38, Mike Christie   > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only use synchronous IO.

 

VMware does only use synchronous IO. Since the hypervisor can't tell what type 
of data the VMs are writing, all IO is treated as needing to be synchronous. 

 

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight 
IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on 
persistent storage.

If you only use synchronous IO all the way from client to the persistent 
storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like 
that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, 
and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images on a gallery 
site could be fine but corrupting
a large database tablespace is no fun at all.

 

No, it's not. VMFS corruption is pretty bad too and there is no fsck for

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass




Le 22/07/2016 14:10, Nick Fisk a écrit :


*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
Behalf Of *Frédéric Nass

*Sent:* 22 July 2016 11:19
*To:* n...@fisk.me.uk; 'Jake Young' ; 'Jan 
Schermer' 

*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 11:48, Nick Fisk a écrit :

*From:*Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr]
*Sent:* 22 July 2016 10:40
*To:* n...@fisk.me.uk ; 'Jake Young'
 ; 'Jan Schermer'
 
*Cc:* ceph-users@lists.ceph.com 
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 10:23, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
*On Behalf Of *Frédéric Nass
*Sent:* 22 July 2016 09:10
*To:* n...@fisk.me.uk ; 'Jake Young'
 ; 'Jan Schermer'
 
*Cc:* ceph-users@lists.ceph.com 
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 09:47, Nick Fisk a écrit :

*From:*ceph-users
[mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of
*Frédéric Nass
*Sent:* 22 July 2016 08:11
*To:* Jake Young 
; Jan Schermer 

*Cc:* ceph-users@lists.ceph.com

*Subject:* Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer
> wrote:


> On 20 Jul 2016, at 18:38, Mike Christie
>
wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with
VMWare ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non
HA support though.
>
>>
>> Knowing that HA iSCSI target was on the
roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS
targets when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD
images. Each iSCSI target
>> has all VAAI primitives enabled and run the
same configuration.
>> - RBD images are mapped on each target using
the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs
through both targets,
>> but in a failover manner so that each ESXi
always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives
are enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this
configuration ?
>
> If you use a application that uses scsi
persistent reservations then you
> could run into troubles, because some apps
expect the reservation info
> to be on the failover nodes as well as the
active ones.
>
> Depending on the how you do failover and the
issue that caused the
> failover, IO could be stuck on the old active
node and cause data
> corruption. If the initial active node looses
its network connectivity
> and you failover, you have to make sure that the
initial active node is
> fenced off and IO stuck on that node will never
be executed. So do
> something like add it to the ceph

[ceph-users] Recovery stuck after adjusting to recent tunables

2016-07-22 Thread Kostis Fardelas

Hello,
being in latest Hammer, I think I hit a bug with more recent than
legacy tunables.

Being in legacy tunables for a while, I decided to experiment with
"better" tunables. So first I went from argonaut profile to bobtail
and then to firefly. However, I decided to make the changes on
chooseleaf_vary_r incrementally (because the remapping from 0 to 5 was
huge), from 5 down to the best value (1). So when I reached
chooseleaf_vary_r = 2, I decided to run a simple test before going to
chooseleaf_vary_r = 1: close an OSD (OSD.14) and let the cluster
recover. But the recovery never completes and a PG remains stuck,
reported as undersized+degraded. No OSD is near full and all pools
have min_size=1.

ceph osd crush show-tunables -f json-pretty

{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 2,
"straw_calc_version": 1,
"allowed_bucket_algs": 22,
"profile": "unknown",
"optimal_tunables": 0,
"legacy_tunables": 0,
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"require_feature_tunables3": 1,
"has_v2_rules": 0,
"has_v3_rules": 0,
"has_v4_buckets": 0
}

The really strange thing is that the OSDs of the stuck PG belong to
other nodes than the one I decided to stop (osd.14).

# ceph pg dump_stuck
ok
pg_stat state up up_primary acting acting_primary
179.38 active+undersized+degraded [2,8] 2 [2,8] 2


ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 11.19995 root default
-3 11.19995 rack unknownrack
-2  0.3 host staging-rd0-03
14  0.2 osd.14   up  1.0  1.0
15  0.2 osd.15   up  1.0  1.0
-8  5.19998 host staging-rd0-01
 6  0.5 osd.6up  1.0  1.0
 7  0.5 osd.7up  1.0  1.0
 8  1.0 osd.8up  1.0  1.0
 9  1.0 osd.9up  1.0  1.0
10  1.0 osd.10   up  1.0  1.0
11  1.0 osd.11   up  1.0  1.0
-7  5.19998 host staging-rd0-00
 0  0.5 osd.0up  1.0  1.0
 1  0.5 osd.1up  1.0  1.0
 2  1.0 osd.2up  1.0  1.0
 3  1.0 osd.3up  1.0  1.0
 4  1.0 osd.4up  1.0  1.0
 5  1.0 osd.5up  1.0  1.0
-4  0.3 host staging-rd0-02
12  0.2 osd.12   up  1.0  1.0
13  0.2 osd.13   up  1.0  1.0


Have you experienced something similar?

Regards,
Kostis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] change of dns names and IP addresses of cluster members

2016-07-22 Thread Andrei Mikhailovsky

Hi Henrik, 

Many thanks for your answer. 

What settings in the ceph.conf are you referring to? These: 

mon_initial_members = 
mon_host = 

I was under the impression that mon_initial_members are only used when the 
cluster is being setup and are not used by the live cluster. Is this the case? 
If so, should I simply change the mon_host values? Is this value used by ceph 
clients to find ceph monitors? 

thanks 

Andrei 

> From: "Henrik Korkuc" 
> To: "ceph-users" 
> Sent: Friday, 22 July, 2016 13:04:36
> Subject: Re: [ceph-users] change of dns names and IP addresses of cluster
> members

> On 16-07-22 13:33, Andrei Mikhailovsky wrote:

>> Hello

>> We are planning to make changes to our IT infrastructure and as a result the
>> fqdn and IPs of the ceph cluster will change. Could someone suggest the best
>> way of dealing with this to make sure we have a minimal ceph downtime?

> Can old and new network reach each other? If yes, then you can do it without
> cluster downtime. You can change OSDs IP on server by server basis - stop 
> OSDs,
> rename host, change IP, start OSDs. They should connect to cluster with new 
> IP.
> Rinse and repeat for all.

> As for mons you'll need to remove mon out of the cluster, then readd with new
> name and IP, redistribute configs with new mons. Rinse and repeat.

> Depending on your use case it is possible that you may have client downtime as
> sometimes it is not possible to change client's config without a restart (e.g.
> qemu machine config)

>> Many thanks

>> Andrei

>> ___
>> ceph-users mailing list ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Uncompactable Monitor Store at 69GB -- Re: Cluster in warn state, not sure what to do next.

2016-07-22 Thread Brian ::

Such great detail in this post David.. This will come in very handy
for people in the future

On Thu, Jul 21, 2016 at 8:24 PM, David Turner
 wrote:
> The Mon store is important and since your cluster isn't healthy, they need
> to hold onto it to make sure that when things come up that the mon can
> replay everything for them.  Once you fix the 2 down and peering PGs, The
> mon store will fix itself in no time at all.  Ceph is rightly refusing to
> compact that database until your cluster is healthy.
>
> It seems like you have a couple things that might help your setup.  First I
> see something very easy to resolve, and that's the blocked requests.  Try
> running the following command:
>
> ceph osd down 71
>
> That command will tell the cluster that osd.71 is down without restarting
> the actual osd daemon.  Osd.71 will come back and tell the mons it's
> actually up, but in the mean time the operations blocking on osd.71 will go
> to a secondary to get the response and clear up.
>
> Second, osd.53  looks to be causing the never ending peering.  A couple
> questions to check things here.  What is your osd_max_backfills set to?
> That is directly related to how fast osd.53 will fill back up.  Something
> you might do to speed that up is to just inject a higher setting for osd.53
> and not the rest of the cluster:
>
> ceph tell osd.53 injectargs '--osd_max_backfills=20'
>
> If this is the problem and the cluster is just waiting for osd.53 to finish
> backfilling, then this will get you there faster.  I'm unfamiliar with the
> strategy you used to rebuild the data for osd.53.  I would have removed the
> osd from the cluster and added it back in with the same weight.  That way
> the osd would start right away and you would see the pgs backfilling onto
> the osd as opposed to it sitting in a perpetual "booting" state.
>
> To remove the osd with minimal impact to the cluster, the following commands
> should get you there.
>
> ceph osd tree | grep 'osd.53 '
> ceph osd set nobackfill
> ceph osd set norecover
> #on the host with osd.53, stop the daemon
> ceph osd down 53
> ceph osd out 53
> ceph osd crush remove osd.53
> ceph auth rm osd.53
> ceph osd rm 53
>
> At this point osd.53 is completely removed from the cluster and you have the
> original weight of the osd to set it to when you bring the osd back in.  The
> down and peering PGs should now be resolved.  Now, completely re-format and
> add the osd back into the cluster.  Make sure to do whatever you need for
> dmcrypt, journals, etc that are specific to your environment.  Once the osd
> is back in the cluster, up and in, reweight the osd to what it was before
> you removed it and unset norecover and nobackfill.
>
> ceph osd crush reweight osd.53 {{ weight_from_tree_command }}
> ceph osd unset nobackfill
> ceph osd unset norecover
>
> At this point everything is back to the way it was and the osd should start
> receiving data.  The only data movement should be refilling osd.53 with the
> data it used to have and everything else should stay the same.  Increasing
> the backfills for this osd will help it fill up faster, but it will be
> slower for client io if you do.  The mon stores will remain "too big" until
> after backfilling onto osd.53 finishes, but once the data stops moving
> around and all of your osds are up and in, the mon stores will compact in no
> time.
>
> I hope this helps.  Ask questions if you have any, and never run a command
> on your cluster that you don't understand.
>
> David Turner
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Salwasser,
> Zac [zsalw...@akamai.com]
> Sent: Thursday, July 21, 2016 12:54 PM
> To: ceph-users@lists.ceph.com
> Cc: Heller, Chris
> Subject: [ceph-users] Uncompactable Monitor Store at 69GB -- Re: Cluster in
> warn state, not sure what to do next.
>
> Rephrasing for brevity – I have a monitor store that is 69GB and won’t
> compact any further on restart or with ‘tell compact’.  Has anyone dealt
> with this before?
>
>
>
>
>
>
>
> From: "Salwasser, Zac" 
> Date: Thursday, July 21, 2016 at 1:18 PM
> To: "ceph-users@lists.ceph.com" 
> Cc: "Salwasser, Zac" , "Heller, Chris"
> 
> Subject: Cluster in warn state, not sure what to do next.
>
>
>
> Hi,
>
>
>
> I have a cluster that has been in an unhealthy state for a month or so.  We
> realized the OSDs were flapping due to not having user access to enough file
> handles, but it took us a while to realize this and we appear to have done a
> lot of damage to the state of the monitor store in the meantime.
>
>
>
> I’ve been trying to tackle one issue at a time, starting with the size of
> the monitor store.  Compaction, either compact on restart or compact as a
> ‘tell’ operation, does not shrink the size of the monitor store any more
> than it presently is.  Having no luck getting the monitor store

Re: [ceph-users] rbd export-dif question

2016-07-22 Thread Jason Dillaman

Nothing immediately pops out at me as incorrect when looking at your
scripts. What do you mean when you say the diff is "always a certain
size"? Any chance that these images are clones? If it's the first
write to the backing object from the clone, librbd will need to copy
the full object from the parent up into the clone (e.g. a 512 byte
write could be a 4MB write on the backend).

On Wed, Jul 20, 2016 at 11:52 AM, Norman Uittenbogaart
 wrote:
> Hi,
>
> I made a backup script to backup RBD images in the pool by snapshots and
> exporting the
> first snapshot and afterwards only the diffs.
>
> I notices however that creating a diff from one image to the next is always
> a certain size.
> And its much bigger then only the changes from one snapshot to the next.
>
> I have shared the script here, https://github.com/normanu/rbd_backup
>
> So you can see what I am doing.
> But I wonder if I am doing something wrong or if there is always a overhead
> in backing up diffs.
>
> Thanks,
>
> Norman
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase

2016-07-22 Thread Jason Dillaman

You aren't, by chance, sharing the same RBD image between multiple
VMs, are you? An order-of-magnitude performance degradation would not
be unexpected if you have multiple clients concurrently accessing the
same image with the "exclusive-lock" feature enabled on the image.

4000 IOPS for 4K random writes also sounds suspiciously high to me.

On Thu, Jul 21, 2016 at 7:32 PM, Martin Millnert  wrote:
> Hi,
>
> I just upgraded from Infernalis to Jewel and see an approximate 10x
> latency increase.
>
> Quick facts:
>  - 3x replicated pool
>  - 4x 2x-"E5-2690 v3 @ 2.60GHz", 128GB RAM, 6x 1.6 TB Intel S3610 SSDs,
>  - LSI3008 controller with up-to-date firmware and upstream driver, and
> up-to-date firmware on SSDs.
>  - 40GbE (Mellanox, with up-to-date drivers & firmware)
>  - CentOS 7.2
>
> Physical checks out, both iperf3 for network and e.g. fio over all the
> SSDs. Not done much of Linux tuning yet; but irqbalanced does a pretty
> good job with pairing both NIC and HBA with their respective CPUs.
>
> In performance hunting mode, and today took the next logical step of
> upgrading from Infernalis to Jewel.
>
> Tester is remote KVM/Qemu/libvirt guest (openstack) CentOS 7 image with
> fio. The test scenario is 4K randomwrite, libaio, directIO, QD=1,
> runtime=900s, test-file-size=40GiB.
>
> Went from a picture of [1] to [2]. In [1], the guest saw 98.25% of the
> I/O complete within maximum 250 µsec (~4000 IOPS). This, [2], sees
> 98.95% of the IO at ~4 msec (actually ~300 IOPs).
>
> Between [1] and [2] (simple plots of FIO's E2E-latency metrics), the
> entire cluster including compute nodes code went from Infernalis to
> 10.2.2
>
> What's going on here?
>
> I haven't tuned Ceph OSDs either in config or via Linux kernel at all
> yet; upgrade to Jewel came first. I haven't changed any OSD configs
> between [1] and [2] myself (only minimally before [1], 0 effort on
> performance tuning) , other than updated to Jewel tunables. But the
> difference is very drastic, wouldn't you say?
>
> Best,
> Martin
> [1] 
> http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test08/ceph-fio-bench_lat.1.png
> [2] 
> http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test10/ceph-fio-bench_lat.1.png
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Nick Fisk

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 11:19
To: n...@fisk.me.uk; 'Jake Young' ; 'Jan Schermer' 

Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph + vmware

Le 22/07/2016 11:48, Nick Fisk a écrit :

From: Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr] 
Sent: 22 July 2016 10:40
To: n...@fisk.me.uk  ; 'Jake Young'  
 ; 'Jan Schermer'  

Cc: ceph-users@lists.ceph.com  
Subject: Re: [ceph-users] ceph + vmware

Le 22/07/2016 10:23, Nick Fisk a écrit :

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 09:10
To: n...@fisk.me.uk  ; 'Jake Young'  
 ; 'Jan Schermer'  

Cc: ceph-users@lists.ceph.com  
Subject: Re: [ceph-users] ceph + vmware

Le 22/07/2016 09:47, Nick Fisk a écrit :

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 08:11
To: Jake Young   ; Jan Schermer  

Cc: ceph-users@lists.ceph.com  
Subject: Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :

On Wednesday, July 20, 2016, Jan Schermer  > wrote:

> On 20 Jul 2016, at 18:38, Mike Christie   > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only use synchronous IO.

VMware does only use synchronous IO. Since the hypervisor can't tell what type 
of data the VMs are writing, all IO is treated as needing to be synchronous. 

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight 
IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on 
persistent storage.

If you only use synchronous IO all the way from client to the persistent 
storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like 
that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, 
and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images on a gallery 
site could be fine but corrupting
a large database tablespace is no fun at all.

No, it's not. VMFS corruption is pretty bad too and there is no fsck for VMFS...

Some (non opensource) solutions exist, Solaris supposedly does this in some(?) 
way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's possible without 
client support
(you essentialy have to do something like transactions and replay the last 
transaction on failover). Maybe
something can be enabled in

Re: [ceph-users] change of dns names and IP addresses of cluster members

2016-07-22 Thread Henrik Korkuc


On 16-07-22 13:33, Andrei Mikhailovsky wrote:

Hello

We are planning to make changes to our IT infrastructure and as a 
result the fqdn and IPs of the ceph cluster will change. Could someone 
suggest the best way of dealing with this to make sure we have a 
minimal ceph downtime?


Can old and new network reach each other? If yes, then you can do it 
without cluster downtime. You can change OSDs IP on server by server 
basis - stop OSDs, rename host, change IP, start OSDs. They should 
connect to cluster with new IP. Rinse and repeat for all.


As for mons you'll need to remove mon out of the cluster, then readd 
with new name and IP, redistribute configs with new mons. Rinse and repeat.


Depending on your use case it is possible that you may have client 
downtime as sometimes it is not possible to change client's config 
without a restart (e.g. qemu machine config)



Many thanks

Andrei


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Try to install ceph hammer on CentOS7

2016-07-22 Thread Ruben Kerkhof

On Thu, Jul 21, 2016 at 7:26 PM, Manuel Lausch  wrote:
> Hi,

Hi,
>
> I try to install ceph hammer on centos7 but something with the RPM
> Repository seems to be wrong.
>
> In my yum.repos.d/ceph.repo file I have the following configuration:
>
> [ceph]
> name=Ceph packages for $basearch
> baseurl=baseurl=http://download.ceph.com/rpm-hammer/el7/$basearch

There's your issue. Remove the seconds baseurl=

Kind regards,

Ruben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] change of dns names and IP addresses of cluster members

2016-07-22 Thread Andrei Mikhailovsky

Hello 

We are planning to make changes to our IT infrastructure and as a result the 
fqdn and IPs of the ceph cluster will change. Could someone suggest the best 
way of dealing with this to make sure we have a minimal ceph downtime? 

Many thanks 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass




Le 22/07/2016 11:48, Nick Fisk a écrit :


*From:*Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr]
*Sent:* 22 July 2016 10:40
*To:* n...@fisk.me.uk; 'Jake Young' ; 'Jan 
Schermer' 

*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 10:23, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
Behalf Of *Frédéric Nass
*Sent:* 22 July 2016 09:10
*To:* n...@fisk.me.uk ; 'Jake Young'
 ; 'Jan Schermer'
 
*Cc:* ceph-users@lists.ceph.com 
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 09:47, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
*On Behalf Of *Frédéric Nass
*Sent:* 22 July 2016 08:11
*To:* Jake Young 
; Jan Schermer 

*Cc:* ceph-users@lists.ceph.com 
*Subject:* Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer > wrote:


> On 20 Jul 2016, at 18:38, Mike Christie
> wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare
ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA
support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we
chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets
when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD
images. Each iSCSI target
>> has all VAAI primitives enabled and run the same
configuration.
>> - RBD images are mapped on each target using the
kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs
through both targets,
>> but in a failover manner so that each ESXi always
access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are
enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent
reservations then you
> could run into troubles, because some apps expect
the reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue
that caused the
> failover, IO could be stuck on the old active node
and cause data
> corruption. If the initial active node looses its
network connectivity
> and you failover, you have to make sure that the
initial active node is
> fenced off and IO stuck on that node will never be
executed. So do
> something like add it to the ceph monitor blacklist
and make sure IO on
> that node is flushed and failed before
unblacklisting it.
>

With iSCSI you can't really do hot failover unless you
only use synchronous IO.

VMware does only use synchronous IO. Since the hypervisor
can't tell what type of data the VMs are writing, all IO
is treated as needing to be synchronous.

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you
don't know what in-flight IO happened before the outage
and which didn't. You could end with only part of the
"transaction" written on persistent storage.

If you only use synchronous IO all the way from

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Nick Fisk

From: Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr] 
Sent: 22 July 2016 10:40
To: n...@fisk.me.uk; 'Jake Young' ; 'Jan Schermer' 

Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph + vmware

Le 22/07/2016 10:23, Nick Fisk a écrit :

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 09:10
To: n...@fisk.me.uk  ; 'Jake Young'  
 ; 'Jan Schermer'  

Cc: ceph-users@lists.ceph.com  
Subject: Re: [ceph-users] ceph + vmware

Le 22/07/2016 09:47, Nick Fisk a écrit :

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 08:11
To: Jake Young   ; Jan Schermer  

Cc: ceph-users@lists.ceph.com  
Subject: Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :

On Wednesday, July 20, 2016, Jan Schermer  > wrote:

> On 20 Jul 2016, at 18:38, Mike Christie   > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only use synchronous IO.

VMware does only use synchronous IO. Since the hypervisor can't tell what type 
of data the VMs are writing, all IO is treated as needing to be synchronous. 

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight 
IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on 
persistent storage.

If you only use synchronous IO all the way from client to the persistent 
storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like 
that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, 
and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images on a gallery 
site could be fine but corrupting
a large database tablespace is no fun at all.

No, it's not. VMFS corruption is pretty bad too and there is no fsck for VMFS...

Some (non opensource) solutions exist, Solaris supposedly does this in some(?) 
way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's possible without 
client support
(you essentialy have to do something like transactions and replay the last 
transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO synchronous or make it 
at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring between the targets) 
without making it synchronous all the way.

This is why the SAN vendors wrote their own clients and drivers. It is not 
possible to dynamically make all OS's do what your iSCSI target expects. 

Something like VMware does the right thing pretty much all the time

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass




Le 22/07/2016 10:23, Nick Fisk a écrit :


*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
Behalf Of *Frédéric Nass

*Sent:* 22 July 2016 09:10
*To:* n...@fisk.me.uk; 'Jake Young' ; 'Jan 
Schermer' 

*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 09:47, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
Behalf Of *Frédéric Nass
*Sent:* 22 July 2016 08:11
*To:* Jake Young  ;
Jan Schermer  
*Cc:* ceph-users@lists.ceph.com 
*Subject:* Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer > wrote:


> On 20 Jul 2016, at 18:38, Mike Christie
> wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare
ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA
support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we
chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when
it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images.
Each iSCSI target
>> has all VAAI primitives enabled and run the same
configuration.
>> - RBD images are mapped on each target using the kernel
client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through
both targets,
>> but in a failover manner so that each ESXi always
access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are
enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent
reservations then you
> could run into troubles, because some apps expect the
reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that
caused the
> failover, IO could be stuck on the old active node and
cause data
> corruption. If the initial active node looses its
network connectivity
> and you failover, you have to make sure that the initial
active node is
> fenced off and IO stuck on that node will never be
executed. So do
> something like add it to the ceph monitor blacklist and
make sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you
only use synchronous IO.

VMware does only use synchronous IO. Since the hypervisor
can't tell what type of data the VMs are writing, all IO is
treated as needing to be synchronous.

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't
know what in-flight IO happened before the outage
and which didn't. You could end with only part of the
"transaction" written on persistent storage.

If you only use synchronous IO all the way from client to
the persistent storage shared between
iSCSI target then all should be fine, otherwise YMMV -
some people run it like that without realizing
the dangers and have never had a problem, so it may be
strictly theoretical, and it all depends on how often you
need to do the
failover and what data you are storing - corrupting a few
images on a gallery site could be fine but corrupting
a large database tablespace is no fun at all.

No, it's not. VMFS corruption is pretty bad too and there is
no fsck for VMFS...


Some (non opensource) solutions exist, Solaris supposedly
does this in some(?) way, maybe some iSCSI guru
can chime

[ceph-users] Re: Infernalis -> Jewel, 10x+ RBD latency increase

2016-07-22 Thread Yoann Moulin

Hi,

>>> I just upgraded from Infernalis to Jewel and see an approximate 10x
>>> latency increase.
>>>
>>> Quick facts:
>>>  - 3x replicated pool
>>>  - 4x 2x-"E5-2690 v3 @ 2.60GHz", 128GB RAM, 6x 1.6 TB Intel S3610
>>> SSDs,
>>>  - LSI3008 controller with up-to-date firmware and upstream driver,
>>> and up-to-date firmware on SSDs.
>>>  - 40GbE (Mellanox, with up-to-date drivers & firmware)
>>>  - CentOS 7.2

Which kernel do you runs ? I found performance drop on the troughput (~40%) with
kernel 4.4 compare of kernel 4.2. I didn't do the bench on latency but maybe the
issue impact the latency too.


-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase

2016-07-22 Thread Nick Fisk

> -Original Message-
> From: Martin Millnert [mailto:mar...@millnert.se]
> Sent: 22 July 2016 10:32
> To: n...@fisk.me.uk; 'Ceph Users' 
> Subject: Re: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase
> 
> On Fri, 2016-07-22 at 08:56 +0100, Nick Fisk wrote:
> > >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Martin Millnert
> > > Sent: 22 July 2016 00:33
> > > To: Ceph Users 
> > > Subject: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase
> > >
> > > Hi,
> > >
> > > I just upgraded from Infernalis to Jewel and see an approximate 10x
> > > latency increase.
> > >
> > > Quick facts:
> > >  - 3x replicated pool
> > >  - 4x 2x-"E5-2690 v3 @ 2.60GHz", 128GB RAM, 6x 1.6 TB Intel S3610
> > > SSDs,
> > >  - LSI3008 controller with up-to-date firmware and upstream driver,
> > > and up-to-date firmware on SSDs.
> > >  - 40GbE (Mellanox, with up-to-date drivers & firmware)
> > >  - CentOS 7.2
> > >
> > > Physical checks out, both iperf3 for network and e.g. fio over all
> > > the SSDs. Not done much of Linux tuning yet; but irqbalanced does a
> > > pretty good job with pairing both NIC and HBA with their respective
> > > CPUs.
> > >
> > > In performance hunting mode, and today took the next logical step of
> > > upgrading from Infernalis to Jewel.
> > >
> > > Tester is remote KVM/Qemu/libvirt guest (openstack) CentOS 7 image
> > > with fio. The test scenario is 4K randomwrite, libaio, directIO,
> > > QD=1, runtime=900s, test-file-size=40GiB.
> > >
> > > Went from a picture of [1] to [2]. In [1], the guest saw 98.25% of
> > > the I/O complete within maximum 250 µsec (~4000 IOPS). This, [2],
> > > sees 98.95% of the IO at ~4 msec (actually ~300 IOPs).
> >
> > I would be suspicious that somehow somewhere you had some sort of
> > caching going on, in the 1st example.
> 
> It wouldn't surprise me either, though I to the best of my knowledge haven't 
> actively configured any such write caching anywhere.
> 
> I did forget one brief detail regarding the setup: We run 4x OSDs per 
> SSD-drive, i.e. roughly 400 GB each.
> Consistent 4k random-write performance onto /var/lib/ceph/osd- 
> $num/fiotestfile, with similar test-config as above, is 13k IOPS *per
> partition*.
> 
> > 250us is pretty much unachievable for directio writes with Ceph.
> 
> Thanks for the feedback, though it's disappointing to hear.
> 
> >  I've just built some new nodes with the pure goal of crushing (excuse
> > the pun) write latency and after extensive tuning can't get it much
> > below 600-700us.
> 
> What of the below, or other than the below, have you done, considering the 
> directIO baseline?
>  - SSD only hosts
>  - NIC <-> CPU/NUMA mapping
>  - HBA <-> CPU/NUMA mapping
>  - ceph-osd process <-> CPU/NUMA mapping
>  - Partition SSDs into multiple partitions
>  - Ceph OSD tunings for concurrency (many-clients)
>  - Ceph OSD tunings for latency (many-clients)
>  - async messenger, new in Jewel (not sure what impact is), or, change/tuning 
> of memory allocator
>  - RDMA (e.g. Mellanox) messenger

The things that have made the most difference (ie going from 2-3ms down to 
600us are:-

- Fast cores (using Xeon E3 running at 3.6Ghz)
- Which are also single socket so no NUMA to worry about
- NVME journals (get significantly lower device write latency vs SSD)
- Fix CPU freq at 3.6Ghz
- Set max c-state to C1

Most of the OSD tuning is probably more for high concurrency or throughput, it 
seems to have less of an effect vs the above. Of course those CPU tunings do 
increase power usage, so I'm looking at ways to find the best balance.

> 
> I have yet to iron out precisely what those two OSD tunings would be.
> 
> > The 4ms sounds more likely for an untuned cluster. I wonder if any of
> > the RBD or qemu cache settings would have changed between versions?
> 
> I'm curious about this too.  What are relevant OSD-side configs here?
> And how do I check what the librbd clients experience? What parameters from 
> e.g. /etc/ceph/$clustername.conf applies to them?

I think RBD cache is enabled by default, you can check via the admin socket

http://ceph.com/planet/ceph-validate-that-the-rbd-cache-is-active/


> 
> I'll have to make another pass over the rbd PRs between Infernalis and
> 10.2.2 I suppose.
> 
> 
> > > Between [1] and [2] (simple plots of FIO's E2E-latency metrics), the
> > > entire cluster including compute nodes code went from Infernalis to
> > > 10.2.2
> > >
> > > What's going on here?
> > >
> > > I haven't tuned Ceph OSDs either in config or via Linux kernel at
> > > all yet; upgrade to Jewel came first. I haven't changed any OSD
> > > configs between [1] and [2] myself (only minimally before [1], 0
> > > effort on performance tuning) , other than updated to Jewel
> > > tunables. But the difference is very drastic, wouldn't you say?
> > >
> > > Best,
> > > Martin
> > > [1]

Re: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase

2016-07-22 Thread Martin Millnert

On Fri, 2016-07-22 at 08:56 +0100, Nick Fisk wrote:
> > 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > Behalf Of Martin Millnert
> > Sent: 22 July 2016 00:33
> > To: Ceph Users 
> > Subject: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency
> > increase
> > 
> > Hi,
> > 
> > I just upgraded from Infernalis to Jewel and see an approximate 10x
> > latency increase.
> > 
> > Quick facts:
> >  - 3x replicated pool
> >  - 4x 2x-"E5-2690 v3 @ 2.60GHz", 128GB RAM, 6x 1.6 TB Intel S3610
> > SSDs,
> >  - LSI3008 controller with up-to-date firmware and upstream driver,
> > and up-to-date firmware on SSDs.
> >  - 40GbE (Mellanox, with up-to-date drivers & firmware)
> >  - CentOS 7.2
> > 
> > Physical checks out, both iperf3 for network and e.g. fio over all
> > the SSDs. Not done much of Linux tuning yet; but irqbalanced does a
> > pretty good job with pairing both NIC and HBA with their respective
> > CPUs.
> > 
> > In performance hunting mode, and today took the next logical step
> > of upgrading from Infernalis to Jewel.
> > 
> > Tester is remote KVM/Qemu/libvirt guest (openstack) CentOS 7 image
> > with fio. The test scenario is 4K randomwrite, libaio, directIO,
> > QD=1, runtime=900s, test-file-size=40GiB.
> > 
> > Went from a picture of [1] to [2]. In [1], the guest saw 98.25% of
> > the I/O complete within maximum 250 µsec (~4000 IOPS). This, [2],
> > sees 98.95% of the IO at ~4 msec (actually ~300 IOPs).
> 
> I would be suspicious that somehow somewhere you had some sort of
> caching going on, in the 1st example. 

It wouldn't surprise me either, though I to the best of my knowledge
haven't actively configured any such write caching anywhere.

I did forget one brief detail regarding the setup: We run 4x OSDs per
SSD-drive, i.e. roughly 400 GB each.
Consistent 4k random-write performance onto /var/lib/ceph/osd-
$num/fiotestfile, with similar test-config as above, is 13k IOPS *per
partition*.

> 250us is pretty much unachievable for directio writes with Ceph.

Thanks for the feedback, though it's disappointing to hear.

>  I've just built some new nodes with the pure goal of crushing
> (excuse the pun) write latency and after extensive tuning can't get
> it much below 600-700us. 

What of the below, or other than the below, have you done, considering
the directIO baseline?
 - SSD only hosts
 - NIC <-> CPU/NUMA mapping
 - HBA <-> CPU/NUMA mapping
 - ceph-osd process <-> CPU/NUMA mapping
 - Partition SSDs into multiple partitions
 - Ceph OSD tunings for concurrency (many-clients)
 - Ceph OSD tunings for latency (many-clients)
 - async messenger, new in Jewel (not sure what impact is), or,
change/tuning of memory allocator
 - RDMA (e.g. Mellanox) messenger

I have yet to iron out precisely what those two OSD tunings would be.

> The 4ms sounds more likely for an untuned cluster. I wonder if any of
> the RBD or qemu cache settings would have changed between versions?

I'm curious about this too.  What are relevant OSD-side configs here?
And how do I check what the librbd clients experience? What parameters
from e.g. /etc/ceph/$clustername.conf applies to them?

I'll have to make another pass over the rbd PRs between Infernalis and
10.2.2 I suppose.


> > Between [1] and [2] (simple plots of FIO's E2E-latency metrics),
> > the entire cluster including compute nodes code went from
> > Infernalis
> > to
> > 10.2.2
> > 
> > What's going on here?
> > 
> > I haven't tuned Ceph OSDs either in config or via Linux kernel at
> > all yet; upgrade to Jewel came first. I haven't changed any OSD
> > configs
> > between [1] and [2] myself (only minimally before [1], 0 effort on
> > performance tuning) , other than updated to Jewel tunables. But
> > the difference is very drastic, wouldn't you say?
> > 
> > Best,
> > Martin
> > [1] http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test08
> > /ceph-fio-bench_lat.1.png
> > [2] http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test10
> > /ceph-fio-bench_lat.1.png
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Nick Fisk

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 09:10
To: n...@fisk.me.uk; 'Jake Young' ; 'Jan Schermer' 

Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph + vmware

Le 22/07/2016 09:47, Nick Fisk a écrit :

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 08:11
To: Jake Young   ; Jan Schermer  

Cc: ceph-users@lists.ceph.com  
Subject: Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :

On Wednesday, July 20, 2016, Jan Schermer  > wrote:

> On 20 Jul 2016, at 18:38, Mike Christie   > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only use synchronous IO.

VMware does only use synchronous IO. Since the hypervisor can't tell what type 
of data the VMs are writing, all IO is treated as
needing to be synchronous. 

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight 
IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on 
persistent storage.

If you only use synchronous IO all the way from client to the persistent 
storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like 
that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, 
and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images on a gallery 
site could be fine but corrupting
a large database tablespace is no fun at all.

No, it's not. VMFS corruption is pretty bad too and there is no fsck for VMFS...

Some (non opensource) solutions exist, Solaris supposedly does this in some(?) 
way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's possible without 
client support
(you essentialy have to do something like transactions and replay the last 
transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO synchronous or make it 
at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring between the targets) 
without making it synchronous all the way.

This is why the SAN vendors wrote their own clients and drivers. It is not 
possible to dynamically make all OS's do what your iSCSI
target expects. 

Something like VMware does the right thing pretty much all the time (there are 
some iSCSI initiator bugs in earlier ESXi 5.x).  If
you have control of your ESXi hosts then attempting to set up HA iSCSI targets 
is possible. 

If you have a mixed client environment with various versions of Windows 
connecting to the target, you may be better off buying some
SAN appliances.

The one time I had to use it I resorted to simply mirroring in via mdraid on 
the client side over two

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass

Le 22/07/2016 09:47, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
Behalf Of *Frédéric Nass

*Sent:* 22 July 2016 08:11
*To:* Jake Young ; Jan Schermer 
*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :

On Wednesday, July 20, 2016, Jan Schermer > wrote:

> On 20 Jul 2016, at 18:38, Mike Christie > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi
client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support
though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose
iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's
available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each
iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel
client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through
both targets,
>> but in a failover manner so that each ESXi always access
the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled
client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent
reservations then you
> could run into troubles, because some apps expect the
reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that
caused the
> failover, IO could be stuck on the old active node and cause
data
> corruption. If the initial active node looses its network
connectivity
> and you failover, you have to make sure that the initial
active node is
> fenced off and IO stuck on that node will never be executed.
So do
> something like add it to the ceph monitor blacklist and make
sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only
use synchronous IO.

VMware does only use synchronous IO. Since the hypervisor can't
tell what type of data the VMs are writing, all IO is treated as
needing to be synchronous.

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't
know what in-flight IO happened before the outage
and which didn't. You could end with only part of the
"transaction" written on persistent storage.

If you only use synchronous IO all the way from client to the
persistent storage shared between
iSCSI target then all should be fine, otherwise YMMV - some
people run it like that without realizing
the dangers and have never had a problem, so it may be
strictly theoretical, and it all depends on how often you need
to do the
failover and what data you are storing - corrupting a few
images on a gallery site could be fine but corrupting
a large database tablespace is no fun at all.

No, it's not. VMFS corruption is pretty bad too and there is no
fsck for VMFS...

Some (non opensource) solutions exist, Solaris supposedly does
this in some(?) way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's
possible without client support
(you essentialy have to do something like transactions and
replay the last transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO
synchronous or make it at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring
between the targets) without making it synchronous all the way.

This is why the SAN vendors wrote their own clients and drivers.
It is not possible to dynamically make all OS's do what your iSCSI
target expects.

Something like VMware does the right thing pretty much all the
time (there are some iSCSI initiator bugs in earlier ESXi 5.x). 
If you have control

Re: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase

2016-07-22 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Martin Millnert
> Sent: 22 July 2016 00:33
> To: Ceph Users 
> Subject: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase
> 
> Hi,
> 
> I just upgraded from Infernalis to Jewel and see an approximate 10x latency 
> increase.
> 
> Quick facts:
>  - 3x replicated pool
>  - 4x 2x-"E5-2690 v3 @ 2.60GHz", 128GB RAM, 6x 1.6 TB Intel S3610 SSDs,
>  - LSI3008 controller with up-to-date firmware and upstream driver, and 
> up-to-date firmware on SSDs.
>  - 40GbE (Mellanox, with up-to-date drivers & firmware)
>  - CentOS 7.2
> 
> Physical checks out, both iperf3 for network and e.g. fio over all the SSDs. 
> Not done much of Linux tuning yet; but irqbalanced does a
> pretty good job with pairing both NIC and HBA with their respective CPUs.
> 
> In performance hunting mode, and today took the next logical step of 
> upgrading from Infernalis to Jewel.
> 
> Tester is remote KVM/Qemu/libvirt guest (openstack) CentOS 7 image with fio. 
> The test scenario is 4K randomwrite, libaio, directIO,
> QD=1, runtime=900s, test-file-size=40GiB.
> 
> Went from a picture of [1] to [2]. In [1], the guest saw 98.25% of the I/O 
> complete within maximum 250 µsec (~4000 IOPS). This, [2],
> sees 98.95% of the IO at ~4 msec (actually ~300 IOPs).

I would be suspicious that somehow somewhere you had some sort of caching going 
on, in the 1st example. 250us is pretty much unachievable for directio writes 
with Ceph. I've just built some new nodes with the pure goal of crushing 
(excuse the pun) write latency and after extensive tuning can't get it much 
below 600-700us. The 4ms sounds more likely for an untuned cluster. I wonder if 
any of the RBD or qemu cache settings would have changed between versions?

> 
> Between [1] and [2] (simple plots of FIO's E2E-latency metrics), the entire 
> cluster including compute nodes code went from Infernalis
> to
> 10.2.2
> 
> What's going on here?
> 
> I haven't tuned Ceph OSDs either in config or via Linux kernel at all yet; 
> upgrade to Jewel came first. I haven't changed any OSD configs
> between [1] and [2] myself (only minimally before [1], 0 effort on 
> performance tuning) , other than updated to Jewel tunables. But
> the difference is very drastic, wouldn't you say?
> 
> Best,
> Martin
> [1] 
> http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test08/ceph-fio-bench_lat.1.png
> [2] 
> http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test10/ceph-fio-bench_lat.1.png
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Nick Fisk

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 08:11
To: Jake Young ; Jan Schermer 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :

On Wednesday, July 20, 2016, Jan Schermer  > wrote:

> On 20 Jul 2016, at 18:38, Mike Christie   > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only use synchronous IO.

VMware does only use synchronous IO. Since the hypervisor can't tell what type 
of data the VMs are writing, all IO is treated as
needing to be synchronous. 

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight 
IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on 
persistent storage.

If you only use synchronous IO all the way from client to the persistent 
storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like 
that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, 
and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images on a gallery 
site could be fine but corrupting
a large database tablespace is no fun at all.

No, it's not. VMFS corruption is pretty bad too and there is no fsck for VMFS...

Some (non opensource) solutions exist, Solaris supposedly does this in some(?) 
way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's possible without 
client support
(you essentialy have to do something like transactions and replay the last 
transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO synchronous or make it 
at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring between the targets) 
without making it synchronous all the way.

This is why the SAN vendors wrote their own clients and drivers. It is not 
possible to dynamically make all OS's do what your iSCSI
target expects. 

Something like VMware does the right thing pretty much all the time (there are 
some iSCSI initiator bugs in earlier ESXi 5.x).  If
you have control of your ESXi hosts then attempting to set up HA iSCSI targets 
is possible. 

If you have a mixed client environment with various versions of Windows 
connecting to the target, you may be better off buying some
SAN appliances.

The one time I had to use it I resorted to simply mirroring in via mdraid on 
the client side over two targets sharing the same
DAS, and this worked fine during testing but never went to production in the 
end.

Jan

>
>>
>> Would you recommend LIO or STGT (with rbd bs-type) target for ESXi
>> clients ?
>
> I can't say, because I have not used stgt with rbd bs-type support enough.

For starters, STGT doesn't implement VAAI properly and you will need to disable 
VAAI in ESXi.

LIO does seem to implement VAAI properly,

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass

Le 20/07/2016 21:20, Jake Young a écrit :

On Wednesday, July 20, 2016, Jan Schermer > wrote:

> On 20 Jul 2016, at 18:38, Mike Christie > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client
? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI
over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's
available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI
target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client
(so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both
targets,
>> but in a failover manner so that each ESXi always access the
same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled
client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent reservations
then you
> could run into troubles, because some apps expect the
reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network
connectivity
> and you failover, you have to make sure that the initial active
node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make
sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only use
synchronous IO.

VMware does only use synchronous IO. Since the hypervisor can't tell 
what type of data the VMs are writing, all IO is treated as needing to 
be synchronous.

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know
what in-flight IO happened before the outage
and which didn't. You could end with only part of the
"transaction" written on persistent storage.

If you only use synchronous IO all the way from client to the
persistent storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people
run it like that without realizing
the dangers and have never had a problem, so it may be strictly
theoretical, and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images
on a gallery site could be fine but corrupting
a large database tablespace is no fun at all.

No, it's not. VMFS corruption is pretty bad too and there is no fsck 
for VMFS...

Some (non opensource) solutions exist, Solaris supposedly does
this in some(?) way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's
possible without client support
(you essentialy have to do something like transactions and replay
the last transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO
synchronous or make it at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring between
the targets) without making it synchronous all the way.

This is why the SAN vendors wrote their own clients and drivers. It is 
not possible to dynamically make all OS's do what your iSCSI target 
expects.

Something like VMware does the right thing pretty much all the time 
(there are some iSCSI initiator bugs in earlier ESXi 5.x).  If you 
have control of your ESXi hosts then attempting to set up HA iSCSI 
targets is possible.

If you have a mixed client environment with various versions of 
Windows connecting to the target, you may be better off buying some 
SAN appliances.

The one time I had to use it I resorted to simply mirroring in via
mdraid on the client side over two targets sharing the same
DAS, and this worked fine during testing but never went to
production in the end.

Jan

>
>>
>> Would you recommend LIO or STGT (with rbd bs-type) target for ESXi
>> clients ?
>
> I can't say, because I have not used stgt with rbd bs-type
support enough.

For starters, STGT doesn't implement VAAI properly and you will need

Re: [ceph-users] CephFS Samba VFS RHEL packages

2016-07-22 Thread Yan, Zheng

On Fri, Jul 22, 2016 at 11:15 AM, Blair Bethwaite
 wrote:
> Thanks Zheng,
>
> On 22 July 2016 at 12:12, Yan, Zheng  wrote:
>> We actively back-port fixes to RHEL 7.x kernel.  When RHCS2.0 release,
>> the RHEL kernel should contain fixes up to 3.7 upstream kernel.
>
> You meant 4.7 right?

I mean 4.7. sorry

>
> --
> Cheers,
> ~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Recovery stuck after adjusting to recent tunables

Re: [ceph-users] Try to install ceph hammer on CentOS7

Re: [ceph-users] CephFS Samba VFS RHEL packages

[ceph-users] Ceph performance calculator

Re: [ceph-users] CephFS Samba VFS RHEL packages

Re: [ceph-users] Try to install ceph hammer on CentOS7

Re: [ceph-users] Try to install ceph hammer on CentOS7

Re: [ceph-users] Terrible RBD performance with Jewel

Re: [ceph-users] ceph + vmware

Re: [ceph-users] ceph + vmware

[ceph-users] Recovery stuck after adjusting to recent tunables

Re: [ceph-users] change of dns names and IP addresses of cluster members

Re: [ceph-users] Uncompactable Monitor Store at 69GB -- Re: Cluster in warn state, not sure what to do next.

Re: [ceph-users] rbd export-dif question

Re: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase

Re: [ceph-users] ceph + vmware

Re: [ceph-users] change of dns names and IP addresses of cluster members

Re: [ceph-users] Try to install ceph hammer on CentOS7

[ceph-users] change of dns names and IP addresses of cluster members

Re: [ceph-users] ceph + vmware

Re: [ceph-users] ceph + vmware

Re: [ceph-users] ceph + vmware

[ceph-users] Re: Infernalis -> Jewel, 10x+ RBD latency increase

Re: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase

Re: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase

Re: [ceph-users] ceph + vmware

Re: [ceph-users] ceph + vmware

Re: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase

Re: [ceph-users] ceph + vmware

Re: [ceph-users] ceph + vmware

Re: [ceph-users] CephFS Samba VFS RHEL packages

31 matches

Site Navigation

Mail list logo

Footer information