from:"Josef Johansson"

Re: [ceph-users] Running ceph in docker

2016-07-05 Thread Josef Johansson

Hi,

The docker image is a new bootstrapped system with all the binaries
included. However, it's possible to let the docker have a whole device by
specifying the --device parameter and then it will survive a reboot or even
rebuild.

Regards,
Josef

On Tue, 5 Jul 2016, 12:28 Steffen Weißgerber,  wrote:

>
>
> >>> Josef Johansson  schrieb am Donnerstag, 30. Juni
> 2016 um
> 15:23:
> > Hi,
> >
>
> Hi,
>
> > You could actually managed every osd and mon and mds through docker
> swarm,
> > since all just software it make sense to deploy it through docker where
> you
> > add the disk that is needed.
> >
> > Mons does not need permanent storage either. Not that a restart of the
> > docker instance would remove the but rather that a remove would.
> >
> > Updates are easy as well, download the latest docker image and rebuild
> the
> > OSD/MON.
> >
>
> And what about the data? Rebuilding an OSD's means dropping it from the
> cluster,
> doesn't it?
>
> Or do you change only the binaries and restart the new docker image on the
> same
> osd device?
>
> Otherwise it's a build up and tear down of cluster components. That's what
> most ceph
> admins try to avoid to keep the cluster performance stable.
>
> > All in all I believe you give the sysop exactly one control plane to
> handle
> > all of the environment.
> >
> > Regards,
> > Josef
> >
>
> Regards
>
> Steffen
>
> > On Thu, 30 Jun 2016, 15:16 xiaoxi chen, 
> wrote:
> >
> >> It make sense to me to run MDS inside docker or k8s as MDS is stateless.
> >> But Mon and OSD do have data in local , what's the motivation to run it
> in
> >> docker?
> >>
> >> > To: ceph-users@lists.ceph.com
> >> > From: d...@redhat.com
> >> > Date: Thu, 30 Jun 2016 08:36:45 -0400
> >> > Subject: Re: [ceph-users] Running ceph in docker
> >>
> >> >
> >> > On 06/30/2016 02:05 AM, F21 wrote:
> >> > > Hey all,
> >> > >
> >> > > I am interested in running ceph in docker containers. This is
> extremely
> >> > > attractive given the recent integration of swarm into the docker
> >> engine,
> >> > > making it really easy to set up a docker cluster.
> >> > >
> >> > > When running ceph in docker, should monitors, radosgw and OSDs all
> be
> >> on
> >> > > separate physical nodes? I watched Sebastian's video on setting up
> ceph
> >> > > in docker here: https://www.youtube.com/watch?v=FUSTjTBA8f8. In the
> >> > > video, there were 6 OSDs, with 2 OSDs running on each node.
> >> > >
> >> > > Is running multiple OSDs on the same node a good idea in production?
> >> Has
> >> > > anyone operated ceph in docker containers in production? Are there
> any
> >> > > things I should watch out for?
> >> > >
> >> > > Cheers,
> >> > > Francis
> >> >
> >> > It's actually quite common to run multiple OSDs on the same physical
> >> > node, since an OSD currently maps to a single block device. Depending
> >> > on your load and traffic, it's usually a good idea to run monitors and
> >> > RGWs on separate nodes.
> >> >
> >> > Daniel
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
>
>
> --
> Klinik-Service Neubrandenburg GmbH
> Allendestr. 30, 17036 Neubrandenburg
> Amtsgericht Neubrandenburg, HRB 2457
> Geschaeftsfuehrerin: Gudrun Kappich
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mounting Ceph RBD image to XenServer 7 as SR

2016-06-30 Thread Josef Johansson

Also, is it possible to recompile the rbd kernel module in XenServer? I am
under the impression that it's open source as well.

Regards,
Josef

On Fri, 1 Jul 2016, 04:52 Mike Jacobacci,  wrote:

> Thanks Somnath and Christian,
>
> Yes, it looks like the latest version of XenServer still runs on an old
> kernel (3.10).  I know the method Christian linked, but it doesn’t work if
> XenServer is installed from iso.  It is really annoying there has been no
> movement on this for 3 years… I really like XenServer and am excited to use
> Ceph, I want this to work.
>
> Since there are no VM’s on it yet, I think I will upgrade the kernel and
> see what happens.
>
> Cheers,
> Mike
>
>
> On Jun 30, 2016, at 7:40 PM, Somnath Roy  wrote:
>
> It seems your client kernel is pretty old ?
> Either upgrade your kernel to 3.15 or later or you need to disable
> CRUSH_TUNABLES3.
> ceph osd crush tunables bobtail or ceph osd crush tunables legacy should
> help. This will start rebalancing and also you will lose improvement added
> in Firefly. So, better to upgrade client kernel IMO.
>
> Thanks & Regards
> Somnath
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> ] *On Behalf Of *Mike Jacobacci
> *Sent:* Thursday, June 30, 2016 7:27 PM
> *To:* Jake Young
> *Cc:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] Mounting Ceph RBD image to XenServer 7 as SR
>
> Thanks Jake!  I enabled the epel 7 repo and was able to get ceph-common
> installed.  Here is what happens when I try to map the drive:
>
> rbd map rbd/enterprise-vm0 --name client.admin -m mon0 -k
> /etc/ceph/ceph.client.admin.keyring
> rbd: sysfs write failed
> In some cases useful info is found in syslog - try "dmesg | tail" or so.
> rbd: map failed: (5) Input/output error
>
> dmesg | tail:
>
> [35034.469236] libceph: mon0 192.168.10.187:6789 socket error on read
> [35044.469183] libceph: mon0 192.168.10.187:6789 feature set mismatch, my
> 4a042a42 < server's 2004a042a42, missing 200
> [35044.469199] libceph: mon0 192.168.10.187:6789 socket error on read
> [35054.469076] libceph: mon0 192.168.10.187:6789 feature set mismatch, my
> 4a042a42 < server's 2004a042a42, missing 200
> [35054.469083] libceph: mon0 192.168.10.187:6789 socket error on read
> [35064.469287] libceph: mon0 192.168.10.187:6789 feature set mismatch, my
> 4a042a42 < server's 2004a042a42, missing 200
> [35064.469302] libceph: mon0 192.168.10.187:6789 socket error on read
> [35074.469162] libceph: mon0 192.168.10.187:6789 feature set mismatch, my
> 4a042a42 < server's 2004a042a42, missing 200
> [35074.469178] libceph: mon0 192.168.10.187:6789 socket error on read
>
>
>
>
>
> On Jun 30, 2016, at 6:15 PM, Jake Young  wrote:
>
> See https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17112.html
>
>
> On Thursday, June 30, 2016, Mike Jacobacci  wrote:
> So after adding the ceph repo and enabling the cents-7 repo… It fails
> trying to install ceph-common:
>
> Loaded plugins: fastestmirror
> Loading mirror speeds from cached hostfile
>  * base: mirror.web-ster.com
> Resolving Dependencies
> --> Running transaction check
> ---> Package ceph-common.x86_64 1:10.2.2-0.el7 will be installed
> --> Processing Dependency: python-cephfs = 1:10.2.2-0.el7 for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: python-rados = 1:10.2.2-0.el7 for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: librbd1 = 1:10.2.2-0.el7 for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libcephfs1 = 1:10.2.2-0.el7 for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: python-rbd = 1:10.2.2-0.el7 for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: librados2 = 1:10.2.2-0.el7 for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: python-requests for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libboost_program_options-mt.so.1.53.0()(64bit)
> for package: 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: librgw.so.2()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libradosstriper.so.1()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libbabeltrace-ctf.so.1()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libboost_regex-mt.so.1.53.0()(64bit) for
> package: 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libboost_iostreams-mt.so.1.53.0()(64bit) for
> package: 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: librbd.so.1()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libtcmalloc.so.4()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: librados.so.2()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libbabeltrace.so.1()(64bit) for package:
> 1:ceph-common-10.2.

Re: [ceph-users] Running ceph in docker

2016-06-30 Thread Josef Johansson

Hi,

You could actually managed every osd and mon and mds through docker swarm,
since all just software it make sense to deploy it through docker where you
add the disk that is needed.

Mons does not need permanent storage either. Not that a restart of the
docker instance would remove the but rather that a remove would.

Updates are easy as well, download the latest docker image and rebuild the
OSD/MON.

All in all I believe you give the sysop exactly one control plane to handle
all of the environment.

Regards,
Josef

On Thu, 30 Jun 2016, 15:16 xiaoxi chen,  wrote:

> It make sense to me to run MDS inside docker or k8s as MDS is stateless.
> But Mon and OSD do have data in local , what's the motivation to run it in
> docker?
>
> > To: ceph-users@lists.ceph.com
> > From: d...@redhat.com
> > Date: Thu, 30 Jun 2016 08:36:45 -0400
> > Subject: Re: [ceph-users] Running ceph in docker
>
> >
> > On 06/30/2016 02:05 AM, F21 wrote:
> > > Hey all,
> > >
> > > I am interested in running ceph in docker containers. This is extremely
> > > attractive given the recent integration of swarm into the docker
> engine,
> > > making it really easy to set up a docker cluster.
> > >
> > > When running ceph in docker, should monitors, radosgw and OSDs all be
> on
> > > separate physical nodes? I watched Sebastian's video on setting up ceph
> > > in docker here: https://www.youtube.com/watch?v=FUSTjTBA8f8. In the
> > > video, there were 6 OSDs, with 2 OSDs running on each node.
> > >
> > > Is running multiple OSDs on the same node a good idea in production?
> Has
> > > anyone operated ceph in docker containers in production? Are there any
> > > things I should watch out for?
> > >
> > > Cheers,
> > > Francis
> >
> > It's actually quite common to run multiple OSDs on the same physical
> > node, since an OSD currently maps to a single block device. Depending
> > on your load and traffic, it's usually a good idea to run monitors and
> > RGWs on separate nodes.
> >
> > Daniel
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-20 Thread Josef Johansson

Hi,

People ran into this when there were some changes in tunables that caused
70-100% movement, the solution was to find out what values that changed and
increment them in the smallest steps possible.

I've found that with major rearrangement in ceph the VMs does not
neccesarily survive ( last time on a ssd cluster ), so linux and timeouts
doesn't work well os my assumption. Which is true with any other storage
backend out there ;)

Regards,
Josef

On Mon, 20 Jun 2016, 19:51 Gregory Farnum,  wrote:

> On Mon, Jun 20, 2016 at 8:33 AM, Daniel Swarbrick
>  wrote:
> > We have just updated our third cluster from Infernalis to Jewel, and are
> > experiencing similar issues.
> >
> > We run a number of KVM virtual machines (qemu 2.5) with RBD images, and
> > have seen a lot of D-state processes and even jbd/2 timeouts and kernel
> > stack traces inside the guests. At first I thought the VMs were being
> > starved of IO, but this is still happening after throttling back the
> > recovery with:
> >
> > osd_max_backfills = 1
> > osd_recovery_max_active = 1
> > osd_recovery_op_priority = 1
> >
> > After upgrading the cluster to Jewel, I changed our crushmap to use the
> > newer straw2 algorithm, which resulted in a little data movment, but no
> > problems at that stage.
> >
> > Once the cluster had settled down again, I set tunables to optimal
> > (hammer profile -> jewel profile), which has triggered between 50% and
> > 70% misplaced PGs on our clusters. This is when the trouble started each
> > time, and when we had cascading failures of VMs.
> >
> > However, after performing hard shutdowns on the VMs and restarting them,
> > they seemed to be OK.
> >
> > At this stage, I have a strong suspicion that it is the introduction of
> > "require_feature_tunables5 = 1" in the tunables. This seems to require
> > all RADOS connections to be re-established.
>
> Do you have any evidence of that besides the one restart?
>
> I guess it's possible that we aren't kicking requests if the crush map
> but not the rest of the osdmap changes, but I'd be surprised.
> -Greg
>
> >
> >
> > On 20/06/16 13:54, Andrei Mikhailovsky wrote:
> >> Hi Oliver,
> >>
> >> I am also seeing this as a strange behavriour indeed! I was going
> through the logs and I was not able to find any errors or issues. There was
> also no slow/blocked requests that I could see during the recovery process.
> >>
> >> Does anyone has an idea what could be the issue here? I don't want to
> shut down all vms every time there is a new release with updated tunable
> values.
> >>
> >>
> >> Andrei
> >>
> >>
> >>
> >> - Original Message -
> >>> From: "Oliver Dzombic" 
> >>> To: "andrei" , "ceph-users" <
> ceph-users@lists.ceph.com>
> >>> Sent: Sunday, 19 June, 2016 10:14:35
> >>> Subject: Re: [ceph-users] cluster down during backfilling, Jewel
> tunables and client IO optimisations
> >>
> >>> Hi,
> >>>
> >>> so far the key values for that are:
> >>>
> >>> osd_client_op_priority = 63 ( anyway default, but i set it to remember
> it )
> >>> osd_recovery_op_priority = 1
> >>>
> >>>
> >>> In addition i set:
> >>>
> >>> osd_max_backfills = 1
> >>> osd_recovery_max_active = 1
> >>>
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph cookbook failed: Where to report that https://git.ceph.com/release.asc is down?

2016-06-18 Thread Josef Johansson

This mailinglist should be fine reporting it, I also said something in
ceph-devel.

Regards,
Josef

On Sat, 18 Jun 2016, 19:18 Soonthorn Ativanichayaphong, <
soont...@getfelix.com> wrote:

> Thank you. I believe it the same. But we can't update ceph's production
> recipe at the moment. Is there a place we can report that git.ceph.com is
> down? There are many places referencing  https://git.ceph.com/release.asc 
> (e.g.
>
> http://ceph.com/releases/important-security-notice-regarding-signing-key-and-binary-downloads-of-ceph/
> ).
>
> Thank you for your help.
>
>
> On Sat, Jun 18, 2016 at 11:33 AM, Josef Johansson 
> wrote:
>
>> Hi,
>>
>> Shouldn't https://github.com/ceph/ceph/blob/master/keys/release.asc be
>> up to date?
>>
>> Regards,
>> Josef
>>
>> On Sat, 18 Jun 2016, 17:29 Soonthorn Ativanichayaphong, <
>> soont...@getfelix.com> wrote:
>>
>>> Hello,
>>>
>>> Our chef cookbook fail due to https://ceph.com/git/?p=ceph.git;a=bl
>>> connection timedout. Does anyone know how to workaround this or where I can
>>> report that the link is down?
>>> Thank you for your help.
>>>
>>> ---
>>>
>>> 
>>> Error executing action `add` on resource 'apt_repository[ceph]'
>>>
>>> 
>>>
>>> Errno::ETIMEDOUT
>>> 
>>> remote_file[/var/cache/chef/release.asc]
>>> (/var/cache/chef/cookbooks/apt/providers/repository.rb line 59) had an
>>> error: Errno::ETIMEDOUT: Error connecting to
>>> https://ceph.com/git/?p=ceph.git;a=bl
>>> ob_plain;f=keys/release.asc - Error connecting to
>>> https://git.ceph.com/?p=ceph.git;a=blob_plain;f=keys/release.asc -
>>> Connection timed out - connect(2)
>>>
>>> Resource Declaration:
>>> -
>>> # In /var/cache/chef/cookbooks/ceph/recipes/apt.rb
>>>
>>>  12: apt_repository 'ceph' do
>>>  13:   repo_name 'ceph'
>>>  14:   uri node['ceph']['debian'][branch]['repository']
>>>  15:   distribution distribution_codename
>>>  16:   components ['main']
>>>  17:   key node['ceph']['debian'][branch]['repository_key']
>>>  18: end
>>>  19:
>>>
>>> Compiled Resource:
>>> --
>>> # Declared in /var/cache/chef/cookbooks/ceph/recipes/apt.rb:12:in
>>> `from_file'
>>>
>>> apt_repository("ceph") do
>>>   action :add
>>>   retries 0
>>>   retry_delay 2
>>>   guard_interpreter :default
>>>   cookbook_name "ceph"
>>>   recipe_name "apt"
>>>   repo_name "ceph"
>>>   uri "http://ceph.com/debian-giant/";
>>>   distribution "trusty"
>>>   components ["main"]
>>>   key "
>>> https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc";
>>>   cache_rebuild true
>>> end
>>> --
>>> Thanks,
>>> Soonthorn Ativanichayaphong
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
>
> --
> Thanks,
> Soonthorn Ativanichayaphong
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph cookbook failed: Where to report that https://git.ceph.com/release.asc is down?

2016-06-18 Thread Josef Johansson

Hi,

Shouldn't https://github.com/ceph/ceph/blob/master/keys/release.asc be up
to date?

Regards,
Josef

On Sat, 18 Jun 2016, 17:29 Soonthorn Ativanichayaphong, <
soont...@getfelix.com> wrote:

> Hello,
>
> Our chef cookbook fail due to https://ceph.com/git/?p=ceph.git;a=bl
> connection timedout. Does anyone know how to workaround this or where I can
> report that the link is down?
> Thank you for your help.
>
> ---
>
> 
> Error executing action `add` on resource 'apt_repository[ceph]'
>
> 
>
> Errno::ETIMEDOUT
> 
> remote_file[/var/cache/chef/release.asc]
> (/var/cache/chef/cookbooks/apt/providers/repository.rb line 59) had an
> error: Errno::ETIMEDOUT: Error connecting to
> https://ceph.com/git/?p=ceph.git;a=bl
> ob_plain;f=keys/release.asc - Error connecting to
> https://git.ceph.com/?p=ceph.git;a=blob_plain;f=keys/release.asc -
> Connection timed out - connect(2)
>
> Resource Declaration:
> -
> # In /var/cache/chef/cookbooks/ceph/recipes/apt.rb
>
>  12: apt_repository 'ceph' do
>  13:   repo_name 'ceph'
>  14:   uri node['ceph']['debian'][branch]['repository']
>  15:   distribution distribution_codename
>  16:   components ['main']
>  17:   key node['ceph']['debian'][branch]['repository_key']
>  18: end
>  19:
>
> Compiled Resource:
> --
> # Declared in /var/cache/chef/cookbooks/ceph/recipes/apt.rb:12:in
> `from_file'
>
> apt_repository("ceph") do
>   action :add
>   retries 0
>   retry_delay 2
>   guard_interpreter :default
>   cookbook_name "ceph"
>   recipe_name "apt"
>   repo_name "ceph"
>   uri "http://ceph.com/debian-giant/";
>   distribution "trusty"
>   components ["main"]
>   key "
> https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc";
>   cache_rebuild true
> end
> --
> Thanks,
> Soonthorn Ativanichayaphong
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSPF to the host

2016-06-08 Thread Josef Johansson

Hi,

Regarding single points of failure on the daemon on the host I was thinking
about doing a cluster setup with i.e. VyOS on kvm-machines on the host, and
they handle all the ospf stuff as well. I have not done any performance
benchmarks but it should be possible to do at least. Maybe even possible to
do in docker or straight in lxc since it's mostly route management in the
kernel.

Regards,
Josef

On Mon, 6 Jun 2016, 18:54 Jeremy Hanmer, 
wrote:

> We do the same thing. OSPF between ToR switches, BGP to all of the hosts
> with each one advertising its own /32 (each has 2 NICs).
>
> On Mon, Jun 6, 2016 at 6:29 AM, Luis Periquito 
> wrote:
>
>> Nick,
>>
>> TL;DR: works brilliantly :)
>>
>> Where I work we have all of the ceph nodes (and a lot of other stuff)
>> using OSPF and BGP server attachment. With that we're able to implement
>> solutions like Anycast addresses, removing the need to add load balancers,
>> for the radosgw solution.
>>
>> The biggest issues we've had were around the per-flow vs per-packets
>> traffic load balancing, but as long as you keep it simple you shouldn't
>> have any issues.
>>
>> Currently we have a P2P network between the servers and the ToR switches
>> on a /31 subnet, and then create a virtual loopback address, which is the
>> interface we use for all communications. Running tests like iperf we're
>> able to reach 19Gbps (on a 2x10Gbps network). OTOH we no longer have the
>> ability to separate traffic between public and osd network, but never
>> really felt the need for it.
>>
>> Also spend a bit of time planning how the network will look like and it's
>> topology. If done properly (think details like route summarization) then
>> it's really worth the extra effort.
>>
>>
>>
>> On Mon, Jun 6, 2016 at 11:57 AM, Nick Fisk  wrote:
>>
>>> Hi All,
>>>
>>>
>>>
>>> Has anybody had any experience with running the network routed down all
>>> the way to the host?
>>>
>>>
>>>
>>> I know the standard way most people configured their OSD nodes is to
>>> bond the two nics which will then talk via a VRRP gateway and then probably
>>> from then on the networking is all Layer3. The main disadvantage I see here
>>> is that you need a beefy inter switch link to cope with the amount of
>>> traffic flowing between switches to the VRRP address. I’ve been trying to
>>> design around this by splitting hosts into groups with different VRRP
>>> gateways on either switch, but this relies on using active/passive bonding
>>> on the OSD hosts to make sure traffic goes from the correct Nic to the
>>> directly connected switch.
>>>
>>>
>>>
>>> What I was thinking, instead of terminating the Layer3 part of the
>>> network at the access switches, terminate it at the hosts. If each Nic of
>>> the OSD host had a different subnet and the actual “OSD Server” address
>>> bound to a loopback adapter, OSPF should advertise this loopback adapter
>>> address as reachable via the two L3 links on the physically attached Nic’s.
>>> This should give you a redundant topology which also will respect your
>>> physically layout and potentially give you higher performance due to ECMP.
>>>
>>>
>>>
>>> Any thoughts, any pitfalls?
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] MONs fall out of quorum

2016-05-24 Thread Josef Johansson

Hi,

I’m diagnosing a problem where monitors fall out of quorum now and then. It 
seems that when two monitors do a new election, one answer is not received 
until 5 minutes later. I checked ntpd on the servers, and all of them are spot 
on, no sync problems. This is happening a couple of time every day now, with 
all of the mons being the one not answering in due time.

This is from the osd12-logs, the same pattern is repeated on the others:
2016-05-17 08:20:55.851276 mon.1 10.168.7.32:6789/0 157491 : cluster [INF] 
mon.osd12 calling new monitor election
2016-05-17 08:20:56.750915 mon.0 10.168.7.31:6789/0 4179082 : cluster [INF] 
mon.osd11 calling new monitor election
2016-05-17 08:21:02.709111 mon.0 10.168.7.31:6789/0 4179083 : cluster [INF] 
mon.osd11@0 won leader election with quorum 0,1
2016-05-17 08:20:58.916940 mon.2 10.168.7.33:6789/0 157323 : cluster [INF] 
mon.osd13 calling new monitor election
2016-05-17 08:21:03.931656 mon.0 10.168.7.31:6789/0 4179090 : cluster [INF] 
mon.osd11 calling new monitor election
2016-05-17 08:21:03.933038 mon.1 10.168.7.32:6789/0 157495 : cluster [INF] 
mon.osd12 calling new monitor election
2016-05-17 08:21:03.940032 mon.0 10.168.7.31:6789/0 4179091 : cluster [INF] 
mon.osd11@0 won leader election with quorum 0,1,2

Anyone who have had similar problems?

We’re about to upgrade to latest jewel, this is ceph version 9.2.1 
(752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd).

Thanks for any help,
Josef
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Diagnosing slow requests

2016-05-24 Thread Josef Johansson

Hi,

> On 24 May 2016, at 09:16, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> On Tue, 24 May 2016 07:03:25 +0000 Josef Johansson wrote:
> 
>> Hi,
>> 
>> You need to monitor latency instead of peak points. As Ceph is writing to
>> two other nodes if you have 3 replicas that is 4x extra the latency
>> compared to one roundtrip to the first OSD from client. So smaller and
>> more IO equals more pain in latency.
>> 
> While very true, I don't think latency (as in network Ceph code related)
> is causing his problems.
> 
> 30+ second slow requests tend to be nearly exclusively the domain of the
> I/O system being unable to keep up.
> And I'm pretty sure it's the HDD based EC pool.
> 
> Not just monitoring Ceph counters but the actual disk stats can be
> helpful, as keeping in mind that it takes only one slow OSD, node, wonky
> link to bring everything to a standstill. 
> 

I agree, more counters to identify the problem. Any way nowadays to 
automatically find out what OSD that is causing the slow requests? Like, a 
graphical way of doing it :)

>> And the worst thing is that there is nothing that actually shows this
>> AFAIK, ceph osd perf shows some latency. A slow CPU could hamper
>> performance even if it's showing no sign of it.
>> 
>> I believe you can see which operations is running right now and see where
>> they are waiting for, I think there's a thread on this ML regarding dead
>> lock and skow requests that could be interesting.
>> 
>> I did not see Christians responses either so maybe not a problem with
>> your client.
>> 
> The list (archive) and me sure did:
> 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009464.html 
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009464.html>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009744.html 
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009744.html>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009747.html 
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009747.html>
> 

So another Subject line for that issue. That explains it.

Regards,
Josef

> Christian
> 
>> Regards,
>> Josef
>> 
>> On Tue, 24 May 2016, 08:49 Peter Kerdisle, 
>> wrote:
>> 
>>> Hey Christian,
>>> 
>>> I honestly haven't seen any replies to my earlier message. I will
>>> traverse my email and make sure I find it, my apologies.
>>> 
>>> I am graphing everything with collectd and graphite this is what makes
>>> it so frustrating since I am not seeing any obvious pain points.
>>> 
>>> I am basically using the pool in read-forward now so there should be
>>> almost no promotion from EC to the SSD pool. I will see what options I
>>> have for adding some SSD journals to the OSD nodes to help speed
>>> things along.
>>> 
>>> Thanks, and apologies again for missing your earlier replies.
>>> 
>>> Peter
>>> 
>>> On Tue, May 24, 2016 at 4:25 AM, Christian Balzer 
>>> wrote:
>>> 
>>>> 
>>>> Hello,
>>>> 
>>>> On Mon, 23 May 2016 10:45:41 +0200 Peter Kerdisle wrote:
>>>> 
>>>>> Hey,
>>>>> 
>>>>> Sadly I'm still battling this issue. I did notice one interesting
>>>>> thing.
>>>>> 
>>>>> I changed the cache settings for my cache tier to add redundancy to
>>>>> the pool which means a lot of recover activity on the cache. During
>>>>> all this there were absolutely no slow requests reported. Is there
>>>>> anything I can conclude from that information? Is it possible that
>>>>> not having any SSDs for journals could be the bottleneck on my
>>>>> erasure pool and that's generating slow requests?
>>>>> 
>>>> That's what I suggested in the first of my 3 replies to your original
>>>> thread which seemingly got ignored judging by the lack of a reply.
>>>> 
>>>>> I simply can't imagine why a request can be blocked for 30 or even
>>>>> 60 seconds. It's getting really frustrating not being able to fix
>>>>> this and
>>>> I
>>>>> simply don't know what else I can do at this point.
>>>>> 
>>>> Are you monitoring your cluster, especially the HDD nodes?
>>>> Permanently with collectd and graphite (or similar) and topically
>>>> (especially d

Re: [ceph-users] Diagnosing slow requests

2016-05-24 Thread Josef Johansson

Hi,

You need to monitor latency instead of peak points. As Ceph is writing to
two other nodes if you have 3 replicas that is 4x extra the latency
compared to one roundtrip to the first OSD from client. So smaller and more
IO equals more pain in latency.

And the worst thing is that there is nothing that actually shows this
AFAIK, ceph osd perf shows some latency. A slow CPU could hamper
performance even if it's showing no sign of it.

I believe you can see which operations is running right now and see where
they are waiting for, I think there's a thread on this ML regarding dead
lock and skow requests that could be interesting.

I did not see Christians responses either so maybe not a problem with your
client.

Regards,
Josef

On Tue, 24 May 2016, 08:49 Peter Kerdisle,  wrote:

> Hey Christian,
>
> I honestly haven't seen any replies to my earlier message. I will traverse
> my email and make sure I find it, my apologies.
>
> I am graphing everything with collectd and graphite this is what makes it
> so frustrating since I am not seeing any obvious pain points.
>
> I am basically using the pool in read-forward now so there should be
> almost no promotion from EC to the SSD pool. I will see what options I have
> for adding some SSD journals to the OSD nodes to help speed things along.
>
> Thanks, and apologies again for missing your earlier replies.
>
> Peter
>
> On Tue, May 24, 2016 at 4:25 AM, Christian Balzer  wrote:
>
>>
>> Hello,
>>
>> On Mon, 23 May 2016 10:45:41 +0200 Peter Kerdisle wrote:
>>
>> > Hey,
>> >
>> > Sadly I'm still battling this issue. I did notice one interesting thing.
>> >
>> > I changed the cache settings for my cache tier to add redundancy to the
>> > pool which means a lot of recover activity on the cache. During all this
>> > there were absolutely no slow requests reported. Is there anything I can
>> > conclude from that information? Is it possible that not having any SSDs
>> > for journals could be the bottleneck on my erasure pool and that's
>> > generating slow requests?
>> >
>> That's what I suggested in the first of my 3 replies to your original
>> thread which seemingly got ignored judging by the lack of a reply.
>>
>> > I simply can't imagine why a request can be blocked for 30 or even 60
>> > seconds. It's getting really frustrating not being able to fix this and
>> I
>> > simply don't know what else I can do at this point.
>> >
>> Are you monitoring your cluster, especially the HDD nodes?
>> Permanently with collectd and graphite (or similar) and topically
>> (especially during tests) with atop (or iostat, etc)?
>>
>> And one more time, your cache tier can only help you if it is fast and
>> large enough to sufficiently dis-engage you from your slow backing
>> storage.
>> And lets face it EC pool AND no journal SSDs will be quite slow.
>>
>> So if your cache is dirty all the time and has to flush (causing IO on the
>> backing storage) while there is also promotion from the backing storage to
>> the cache going on you're basically down to the base speed of your EC
>> pool, at least for some of your ops.
>>
>> Christian
>>
>> > If anybody has anything I haven't tried before please let me know.
>> >
>> > Peter
>> >
>> > On Thu, May 5, 2016 at 10:30 AM, Peter Kerdisle
>> >  wrote:
>> >
>> > > Hey guys,
>> > >
>> > > I'm running into an issue with my cluster during high activity.
>> > >
>> > > I have two SSD cache servers (2 SSDs for journals, 7 SSDs for data)
>> > > with 2x10Gbit bonded each and a six OSD nodes with a 10Gbit public and
>> > > 10Gbit cluster network for the erasure pool (10x3TB without separate
>> > > journal). This is all on Jewel.
>> > >
>> > > It's working fine with normal load. However when I force increased
>> > > activity by lowering the cache_target_dirty_ratio to make sure my
>> > > files are promoted things start to go amiss.
>> > >
>> > > To give an example: http://pastie.org/private/5k5ml6a8gqkivjshgjcedg
>> > >
>> > > This is especially concering:  pgs: 9 activating+undersized+degraded,
>> > > 48 active+undersized+degraded, 1 stale+active+clean, 27 peering.
>> > >
>> > > Here is an other minute or so where I grepped for warnings:
>> > > http://pastie.org/private/bfv3kxl63cfcduafoaurog
>> > >
>> > > These warnings are generated all over the OSD nodes, not specifically
>> > > one OSD or even one node.
>> > >
>> > > During this time the different OSD logs show various warnings:
>> > >
>> > > 2016-05-05 10:04:10.873603 7f1afaf3b700  1 heartbeat_map is_healthy
>> > > 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15
>> > > 2016-05-05 10:04:10.873605 7f1afaf3b700  1 heartbeat_map is_healthy
>> > > 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
>> > > 2016-05-05 10:04:10.905997 7f1afc73e700  1 heartbeat_map is_healthy
>> > > 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15
>> > > 2016-05-05 10:04:10.906000 7f1afc73e700  1 heartbeat_map is_healthy
>> > > 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
>>

Re: [ceph-users] ceph cache tier clean rate too low

2016-04-19 Thread Josef Johansson

Hi,

response in line

On 20 Apr 2016 7:45 a.m., "Christian Balzer"  wrote:
>
>
> Hello,
>
> On Wed, 20 Apr 2016 03:42:00 + Stephen Lord wrote:
>
> >
> > OK, you asked ;-)
> >
>
> I certainly did. ^o^
>
> > This is all via RBD, I am running a single filesystem on top of 8 RBD
> > devices in an effort to get data striping across more OSDs, I had been
> > using that setup before adding the cache tier.
> >
> Nods.
> Depending on your use case (sequential writes) actual RADOS striping might
> be more advantageous than this (with 4MB writes still going to the same
> PG/OSD all the time).
>
>
> > 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is
> > setup with replication size 3. No SSDs involved in those OSDs, since
> > ceph-disk does not let you break a bluestore configuration into more
> > than one device at the moment.
> >
> That's a pity, but supposedly just  a limitation of ceph-disk.
> I'd venture you can work around that with symlinks to a raw SSD
> partition, same as with current filestore journals.
>
> As Sage recently wrote:
> ---
> BlueStore can use as many as three devices: one for the WAL (journal,
> though it can be much smaller than FileStores, e.g., 128MB), one for
> metadata (e.g., an SSD partition), and one for data.
> ---

I believe he also mentioned the use of bcache and friends for the osd,
maybe a way forward in this case?

Regards
Josef
>
> > The 600 Mbytes/sec is an approx sustained number for the data rate I can
> > get going into this pool via RBD, that turns into 3 times that for raw
> > data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have
> > pushed it harder than that from time to time, but the OSD really wants
> > to use fdatasync a lot and that tends to suck up a lot of the potential
> > of a device, these disks will do 160 Mbytes/sec if you stream data to
> > them.
> >
> > I just checked with rados bench to this set of 33 OSDs with a 3 replica
> > pool, and 600 Mbytes/sec is what it will do from the same client host.
> >
> This matches a cluster of mine with 32 OSDs (filestore of course) and SSD
> journals on 4 nodes with a replica of 3.
>
> So BlueStore is indeed faster than than filestore.
>
> > All the networking is 40 GB ethernet, single port per host, generally I
> > can push 2.2 Gbytes/sec in one direction between two hosts over a single
> > tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a
> > node. Short of going to RDMA that appears to be about the limit for
> > these processors.
> >
> Yeah, didn't expect your network to be involved here bottleneck wise, but
> a good data point to have nevertheless.
>
> > There are a grand total of 2 400 GB P3700s which are running a pool with
> > a replication factor of 1, these are in 2 other nodes. Once I add in
> > replication perf goes downhill. If I had more hardware I would be
> > running more of these and using replication, but I am out of network
> > cards right now.
> >
> Alright, so at 900MB/s you're pretty close to what one would expect from 2
> of these: 1080MB/s*2/2(journal).
>
> How much downhill is that?
>
> I have a production cache tier with 2 nodes (replica 2 of course) and 4
> 800GB DC S3610s each, IPoIB QDR (40Gbs) interconnect and the performance
> is pretty much what I would expect.
>
> > So 5 nodes running OSDs, and a 6th node running the RBD client using the
> > kernel implementation.
> >
> I assume there's are reason for use the kernel RBD client (which kernel?),
> given that it tends to be behind the curve in terms of features and speed?
>
> > Complete set of commands for creating the cache tier, I pulled this from
> > history, so the line in the middle was a failed command actually so
> > sorry for the red herring.
> >
> >   982  ceph osd pool create nvme 512 512 replicated_nvme
> >   983  ceph osd pool set nvme size 1
> >   984  ceph osd tier add rbd nvme
> >   985  ceph osd tier cache-mode  nvme writeback
> >   986  ceph osd tier set-overlay rbd nvme
> >   987  ceph osd pool set nvme  hit_set_type bloom
> >   988  ceph osd pool set target_max_bytes 5000 <<—— typo here,
> > so never mind 989  ceph osd pool set nvme target_max_bytes 5000
> >   990  ceph osd pool set nvme target_max_objects 50
> >   991  ceph osd pool set nvme cache_target_dirty_ratio 0.5
> >   992  ceph osd pool set nvme cache_target_full_ratio 0.8
> >
> > I wish the cache tier would cause a health warning if it does not have
> > a max size set, it lets you do that, flushes nothing and fills the OSDs.
> >
> Oh yes, people have been bitten by this over and over again.
> At least it's documented now.
>
> > As for what the actual test is, this is 4K uncompressed DPX video
frames,
> > so 50 Mbyte files written at least 24 a second on a good day, ideally
> > more. This needs to sustain around 1.3 Gbytes/sec in either direction
> > from a single application and needs to do it consistently. There is a
> > certain amount of buffering to deal with fluctuations in perf. I a

Re: [ceph-users] User Interface

2016-03-11 Thread Josef Johansson

Proxmox handles the block storage at least, I know that ownCloud handles object 
storage through rgw nowadays :)

Regards,
Josef

> On 02 Mar 2016, at 20:51, Michał Chybowski  
> wrote:
> 
> Unfortunately, VSM can manage only pools / clusters created by itself.
> Pozdrawiam
> Michał Chybowski
> Tiktalik.com
> W dniu 02.03.2016 o 20:23, Василий Ангапов pisze:
>> You may also look at Intel Virtual Storage Manager:
>> https://github.com/01org/virtual-storage-manager 
>> 
>> 
>> 
>> 2016-03-02 13:57 GMT+03:00 John Spray > >:
>> On Tue, Mar 1, 2016 at 2:42 AM, Vlad Blando < 
>> vbla...@morphlabs.com 
>> > wrote:
>> Hi,
>> 
>> We already have a user interface that is admin facing (ex. calamari, kraken, 
>> ceph-dash), how about a client facing interface, that can cater for both 
>> block and object store. For object store I can use Swift via Horizon 
>> dashboard, but for block-store, I'm not sure how.
>> 
>> So you're thinking of something that would be a UI equivalent of the rbd 
>> command line, right?  In an openstack environment I guess you'd be doing 
>> that via the Cinder integration with Horizon.  Outside of openstack, I think 
>> that the people working on  
>> https://github.com/skyrings/skyring 
>>  have ambitions along these lines too.
>> 
>> John
>>  
>> 
>> Thanks.
>> 
>> 
>> /Vlad
>> ᐧ
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Problems with starting services on Debian Jessie/Infernalis

2016-03-07 Thread Josef Johansson

Hi,

We’re setting up a new cluster, but we’re having trouble restarting the monitor 
services.

The problem is the difference between the ceph.service and ceph-mon@osd11 
service in our case.


root@osd11:/etc/init.d# /bin/systemctl status ceph.service
● ceph.service - LSB: Start Ceph distributed file system daemons at boot time
   Loaded: loaded (/etc/init.d/ceph)
   Active: active (exited) since Mon 2016-03-07 15:13:28 UTC; 49s ago
  Process: 5168 ExecStart=/etc/init.d/ceph start (code=exited, status=0/SUCCESS)

root@osd11:/etc/init.d# systemctl status ceph-mon@osd11
● ceph-mon@osd11.service - Ceph cluster monitor daemon
   Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled)
   Active: active (running) since Mon 2016-03-07 15:08:27 UTC; 24min ago
 Main PID: 5044 (ceph-mon)
   CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@osd11.service
   └─5044 /usr/bin/ceph-mon -f --cluster ceph --id osd11 --setuser ceph 
--setgroup ceph

Mar 07 15:08:27 osd11 ceph-mon[5044]: starting mon.osd11 rank 0 at 
10.168.7.31:6789/0 mon_data /var/lib/ceph/mon/ceph-osd11 fsid 
aa6c952d-3149-4679-8e44-a6bf250d8c48


If I were top ‘stop’ the ceph-mon process

root@osd11:/etc/init# service ceph stop -a

The service is still up.

root@osd11:/etc/init# systemctl status ceph-mon@osd11
● ceph-mon@osd11.service - Ceph cluster monitor daemon
   Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled)
   Active: active (running) since Mon 2016-03-07 15:08:27 UTC; 29min ago
 Main PID: 5044 (ceph-mon)
   CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@osd11.service
   └─5044 /usr/bin/ceph-mon -f --cluster ceph --id osd11 --setuser ceph 
--setgroup ceph

Mar 07 15:08:27 osd11 ceph-mon[5044]: starting mon.osd11 rank 0 at 
10.168.7.31:6789/0 mon_data /var/lib/ceph/mon/ceph-osd11 fsid 
aa6c952d-3149-4679-8e44-a6bf250d8c48


I see a lot of systemd/init.d loops here but could anyone explain how it’s 
supposed to work?
Is there something missing from our confs?

Thanks,
Josef
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph mirrors wanted!

2016-02-29 Thread Josef Johansson

Hm, I should be a bit more updated now. At least for 
{debian,rpm}-{hammer,infernalis,testing}

/Josef
> On 29 Feb 2016, at 19:19, Wido den Hollander  wrote:
> 
> 
>> Op 29 februari 2016 om 18:22 schreef Austin Johnson 
>> :
>> 
>> 
>> All,
>> 
>> I agree that rsync is down on download.ceph.com. I get a connection timeout
>> as well. Which makes it seem like an issue of the firewall silently
>> dropping packets.
>> 
> 
> Remember that download.ceph.com is THE source of all data. eu.ceph.com syncs
> from there as well.
> 
> That needs to be fixed since there is no other source.
> 
> Wido
> 
>> It has been down for at least a few weeks, forcing me to sync from eu,
>> which seems out of date.
>> 
>> Tyler - Is there any way that beyondhosting.net could turn rsync up for its
>> mirror?
>> 
>> Thanks,
>> Austin
>> 
>> On Mon, Feb 29, 2016 at 7:19 AM, Florent B  wrote:
>> 
>>> I would like to inform you that I have difficulties to set-up a mirror.
>>> 
>>> rsync on download.ceph.com is down
>>> 
>>> # rsync download.ceph.com::
>>> rsync: failed to connect to download.ceph.com (173.236.253.173):
>>> Connection timed out (110)
>>> 
>>> And eu.ceph.com is out of sync for a few weeks.
>>> 
>>> On 01/30/2016 03:14 PM, Wido den Hollander wrote:
 Hi,
 
 My PR was merged with a script to mirror Ceph properly:
 https://github.com/ceph/ceph/tree/master/mirroring
 
 Currently there are 3 (official) locations where you can get Ceph:
 
 - download.ceph.com (Dreamhost, US)
 - eu.ceph.com (PCextreme, Netherlands)
 - au.ceph.com (Digital Pacific, Australia)
 
 I'm looking for more mirrors to become official mirrors so we can easily
 distribute Ceph.
 
 Mirrors do go down and it's always nice to have a mirror local to you.
 
 I'd like to have one or more mirrors in Asia, Africa and/or South
 Ameirca if possible. Anyone able to host there? Other locations are
 welcome as well!
 
 A few things which are required:
 
 - 1Gbit connection or more
 - Native IPv4 and IPv6
 - HTTP access
 - rsync access
 - 2TB of storage or more
 - Monitoring of the mirror/source
 
 You can easily mirror Ceph yourself with this script I wrote:
 https://github.com/ceph/ceph/blob/master/mirroring/mirror-ceph.sh
 
 eu.ceph.com and au.ceph.com use it to sync from download.ceph.com. If
 you want to mirror Ceph locally, please pick a mirror local to you.
 
 Please refer to these guidelines:
 https://github.com/ceph/ceph/tree/master/mirroring#guidelines
 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph mirrors wanted!

2016-02-29 Thread Josef Johansson

Got rpm-infernalis in now, and I’m updating debian-infernalis as well.

/Josef
> On 29 Feb 2016, at 15:44, Josef Johansson  wrote:
> 
> Syncing now.
>> On 29 Feb 2016, at 15:38, Josef Johansson > <mailto:jose...@gmail.com>> wrote:
>> 
>> I’ll check if I can mirror it though http.
>>> On 29 Feb 2016, at 15:31, Josef Johansson >> <mailto:jose...@gmail.com>> wrote:
>>> 
>>> Then we’re all in the same boat.
>>> 
>>>> On 29 Feb 2016, at 15:30, Florent B >>> <mailto:flor...@coppint.com>> wrote:
>>>> 
>>>> Hi and thank you. But for me, you are out of sync as eu.ceph.com 
>>>> <http://eu.ceph.com/>. Can't find Infernalis 9.2.1 on your mirror :(
>>>> 
>>>> On 02/29/2016 03:21 PM, Josef Johansson wrote:
>>>>> You could sync from me instead @ se.ceph.com <http://se.ceph.com/> 
>>>>> As a start.
>>>>> 
>>>>> Regards
>>>>> /Josef
>>>>> 
>>>>>> On 29 Feb 2016, at 15:19, Florent B < 
>>>>>> <mailto:flor...@coppint.com>flor...@coppint.com 
>>>>>> <mailto:flor...@coppint.com>> wrote:
>>>>>> 
>>>>>> I would like to inform you that I have difficulties to set-up a mirror.
>>>>>> 
>>>>>> rsync on download.ceph.com <http://download.ceph.com/> is down
>>>>>> 
>>>>>> # rsync download.ceph.com <http://download.ceph.com/>::
>>>>>> rsync: failed to connect to download.ceph.com 
>>>>>> <http://download.ceph.com/> (173.236.253.173):
>>>>>> Connection timed out (110)
>>>>>> 
>>>>>> And eu.ceph.com <http://eu.ceph.com/> is out of sync for a few weeks.
>>>>>> 
>>>>>> On 01/30/2016 03:14 PM, Wido den Hollander wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> My PR was merged with a script to mirror Ceph properly:
>>>>>>> https://github.com/ceph/ceph/tree/master/mirroring 
>>>>>>> <https://github.com/ceph/ceph/tree/master/mirroring>
>>>>>>> 
>>>>>>> Currently there are 3 (official) locations where you can get Ceph:
>>>>>>> 
>>>>>>> - download.ceph.com <http://download.ceph.com/> (Dreamhost, US)
>>>>>>> - eu.ceph.com <http://eu.ceph.com/> (PCextreme, Netherlands)
>>>>>>> - au.ceph.com <http://au.ceph.com/> (Digital Pacific, Australia)
>>>>>>> 
>>>>>>> I'm looking for more mirrors to become official mirrors so we can easily
>>>>>>> distribute Ceph.
>>>>>>> 
>>>>>>> Mirrors do go down and it's always nice to have a mirror local to you.
>>>>>>> 
>>>>>>> I'd like to have one or more mirrors in Asia, Africa and/or South
>>>>>>> Ameirca if possible. Anyone able to host there? Other locations are
>>>>>>> welcome as well!
>>>>>>> 
>>>>>>> A few things which are required:
>>>>>>> 
>>>>>>> - 1Gbit connection or more
>>>>>>> - Native IPv4 and IPv6
>>>>>>> - HTTP access
>>>>>>> - rsync access
>>>>>>> - 2TB of storage or more
>>>>>>> - Monitoring of the mirror/source
>>>>>>> 
>>>>>>> You can easily mirror Ceph yourself with this script I wrote:
>>>>>>> https://github.com/ceph/ceph/blob/master/mirroring/mirror-ceph.sh 
>>>>>>> <https://github.com/ceph/ceph/blob/master/mirroring/mirror-ceph.sh>
>>>>>>> 
>>>>>>> eu.ceph.com <http://eu.ceph.com/> and au.ceph.com <http://au.ceph.com/> 
>>>>>>> use it to sync from download.ceph.com <http://download.ceph.com/>. If
>>>>>>> you want to mirror Ceph locally, please pick a mirror local to you.
>>>>>>> 
>>>>>>> Please refer to these guidelines:
>>>>>>> https://github.com/ceph/ceph/tree/master/mirroring#guidelines 
>>>>>>> <https://github.com/ceph/ceph/tree/master/mirroring#guidelines>
>>>>>>> 
>>>>>> 
>>>>>> ___
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>>>> 
>>>> 
>>> 
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph mirrors wanted!

2016-02-29 Thread Josef Johansson

Syncing now.
> On 29 Feb 2016, at 15:38, Josef Johansson  wrote:
> 
> I’ll check if I can mirror it though http.
>> On 29 Feb 2016, at 15:31, Josef Johansson > <mailto:jose...@gmail.com>> wrote:
>> 
>> Then we’re all in the same boat.
>> 
>>> On 29 Feb 2016, at 15:30, Florent B >> <mailto:flor...@coppint.com>> wrote:
>>> 
>>> Hi and thank you. But for me, you are out of sync as eu.ceph.com 
>>> <http://eu.ceph.com/>. Can't find Infernalis 9.2.1 on your mirror :(
>>> 
>>> On 02/29/2016 03:21 PM, Josef Johansson wrote:
>>>> You could sync from me instead @ se.ceph.com <http://se.ceph.com/> 
>>>> As a start.
>>>> 
>>>> Regards
>>>> /Josef
>>>> 
>>>>> On 29 Feb 2016, at 15:19, Florent B < 
>>>>> <mailto:flor...@coppint.com>flor...@coppint.com 
>>>>> <mailto:flor...@coppint.com>> wrote:
>>>>> 
>>>>> I would like to inform you that I have difficulties to set-up a mirror.
>>>>> 
>>>>> rsync on download.ceph.com <http://download.ceph.com/> is down
>>>>> 
>>>>> # rsync download.ceph.com <http://download.ceph.com/>::
>>>>> rsync: failed to connect to download.ceph.com <http://download.ceph.com/> 
>>>>> (173.236.253.173):
>>>>> Connection timed out (110)
>>>>> 
>>>>> And eu.ceph.com <http://eu.ceph.com/> is out of sync for a few weeks.
>>>>> 
>>>>> On 01/30/2016 03:14 PM, Wido den Hollander wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> My PR was merged with a script to mirror Ceph properly:
>>>>>> https://github.com/ceph/ceph/tree/master/mirroring 
>>>>>> <https://github.com/ceph/ceph/tree/master/mirroring>
>>>>>> 
>>>>>> Currently there are 3 (official) locations where you can get Ceph:
>>>>>> 
>>>>>> - download.ceph.com <http://download.ceph.com/> (Dreamhost, US)
>>>>>> - eu.ceph.com <http://eu.ceph.com/> (PCextreme, Netherlands)
>>>>>> - au.ceph.com <http://au.ceph.com/> (Digital Pacific, Australia)
>>>>>> 
>>>>>> I'm looking for more mirrors to become official mirrors so we can easily
>>>>>> distribute Ceph.
>>>>>> 
>>>>>> Mirrors do go down and it's always nice to have a mirror local to you.
>>>>>> 
>>>>>> I'd like to have one or more mirrors in Asia, Africa and/or South
>>>>>> Ameirca if possible. Anyone able to host there? Other locations are
>>>>>> welcome as well!
>>>>>> 
>>>>>> A few things which are required:
>>>>>> 
>>>>>> - 1Gbit connection or more
>>>>>> - Native IPv4 and IPv6
>>>>>> - HTTP access
>>>>>> - rsync access
>>>>>> - 2TB of storage or more
>>>>>> - Monitoring of the mirror/source
>>>>>> 
>>>>>> You can easily mirror Ceph yourself with this script I wrote:
>>>>>> https://github.com/ceph/ceph/blob/master/mirroring/mirror-ceph.sh 
>>>>>> <https://github.com/ceph/ceph/blob/master/mirroring/mirror-ceph.sh>
>>>>>> 
>>>>>> eu.ceph.com <http://eu.ceph.com/> and au.ceph.com <http://au.ceph.com/> 
>>>>>> use it to sync from download.ceph.com <http://download.ceph.com/>. If
>>>>>> you want to mirror Ceph locally, please pick a mirror local to you.
>>>>>> 
>>>>>> Please refer to these guidelines:
>>>>>> https://github.com/ceph/ceph/tree/master/mirroring#guidelines 
>>>>>> <https://github.com/ceph/ceph/tree/master/mirroring#guidelines>
>>>>>> 
>>>>> 
>>>>> ___
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>>> 
>>> 
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph mirrors wanted!

2016-02-29 Thread Josef Johansson

I’ll check if I can mirror it though http.
> On 29 Feb 2016, at 15:31, Josef Johansson  wrote:
> 
> Then we’re all in the same boat.
> 
>> On 29 Feb 2016, at 15:30, Florent B > <mailto:flor...@coppint.com>> wrote:
>> 
>> Hi and thank you. But for me, you are out of sync as eu.ceph.com 
>> <http://eu.ceph.com/>. Can't find Infernalis 9.2.1 on your mirror :(
>> 
>> On 02/29/2016 03:21 PM, Josef Johansson wrote:
>>> You could sync from me instead @ se.ceph.com <http://se.ceph.com/> 
>>> As a start.
>>> 
>>> Regards
>>> /Josef
>>> 
>>>> On 29 Feb 2016, at 15:19, Florent B < 
>>>> <mailto:flor...@coppint.com>flor...@coppint.com 
>>>> <mailto:flor...@coppint.com>> wrote:
>>>> 
>>>> I would like to inform you that I have difficulties to set-up a mirror.
>>>> 
>>>> rsync on download.ceph.com <http://download.ceph.com/> is down
>>>> 
>>>> # rsync download.ceph.com <http://download.ceph.com/>::
>>>> rsync: failed to connect to download.ceph.com <http://download.ceph.com/> 
>>>> (173.236.253.173):
>>>> Connection timed out (110)
>>>> 
>>>> And eu.ceph.com <http://eu.ceph.com/> is out of sync for a few weeks.
>>>> 
>>>> On 01/30/2016 03:14 PM, Wido den Hollander wrote:
>>>>> Hi,
>>>>> 
>>>>> My PR was merged with a script to mirror Ceph properly:
>>>>> https://github.com/ceph/ceph/tree/master/mirroring 
>>>>> <https://github.com/ceph/ceph/tree/master/mirroring>
>>>>> 
>>>>> Currently there are 3 (official) locations where you can get Ceph:
>>>>> 
>>>>> - download.ceph.com <http://download.ceph.com/> (Dreamhost, US)
>>>>> - eu.ceph.com <http://eu.ceph.com/> (PCextreme, Netherlands)
>>>>> - au.ceph.com <http://au.ceph.com/> (Digital Pacific, Australia)
>>>>> 
>>>>> I'm looking for more mirrors to become official mirrors so we can easily
>>>>> distribute Ceph.
>>>>> 
>>>>> Mirrors do go down and it's always nice to have a mirror local to you.
>>>>> 
>>>>> I'd like to have one or more mirrors in Asia, Africa and/or South
>>>>> Ameirca if possible. Anyone able to host there? Other locations are
>>>>> welcome as well!
>>>>> 
>>>>> A few things which are required:
>>>>> 
>>>>> - 1Gbit connection or more
>>>>> - Native IPv4 and IPv6
>>>>> - HTTP access
>>>>> - rsync access
>>>>> - 2TB of storage or more
>>>>> - Monitoring of the mirror/source
>>>>> 
>>>>> You can easily mirror Ceph yourself with this script I wrote:
>>>>> https://github.com/ceph/ceph/blob/master/mirroring/mirror-ceph.sh 
>>>>> <https://github.com/ceph/ceph/blob/master/mirroring/mirror-ceph.sh>
>>>>> 
>>>>> eu.ceph.com <http://eu.ceph.com/> and au.ceph.com <http://au.ceph.com/> 
>>>>> use it to sync from download.ceph.com <http://download.ceph.com/>. If
>>>>> you want to mirror Ceph locally, please pick a mirror local to you.
>>>>> 
>>>>> Please refer to these guidelines:
>>>>> https://github.com/ceph/ceph/tree/master/mirroring#guidelines 
>>>>> <https://github.com/ceph/ceph/tree/master/mirroring#guidelines>
>>>>> 
>>>> 
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>> 
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph mirrors wanted!

2016-02-29 Thread Josef Johansson

Then we’re all in the same boat.

> On 29 Feb 2016, at 15:30, Florent B  wrote:
> 
> Hi and thank you. But for me, you are out of sync as eu.ceph.com. Can't find 
> Infernalis 9.2.1 on your mirror :(
> 
> On 02/29/2016 03:21 PM, Josef Johansson wrote:
>> You could sync from me instead @ se.ceph.com <http://se.ceph.com/> 
>> As a start.
>> 
>> Regards
>> /Josef
>> 
>>> On 29 Feb 2016, at 15:19, Florent B < 
>>> <mailto:flor...@coppint.com>flor...@coppint.com 
>>> <mailto:flor...@coppint.com>> wrote:
>>> 
>>> I would like to inform you that I have difficulties to set-up a mirror.
>>> 
>>> rsync on download.ceph.com <http://download.ceph.com/> is down
>>> 
>>> # rsync download.ceph.com <http://download.ceph.com/>::
>>> rsync: failed to connect to download.ceph.com <http://download.ceph.com/> 
>>> (173.236.253.173):
>>> Connection timed out (110)
>>> 
>>> And eu.ceph.com <http://eu.ceph.com/> is out of sync for a few weeks.
>>> 
>>> On 01/30/2016 03:14 PM, Wido den Hollander wrote:
>>>> Hi,
>>>> 
>>>> My PR was merged with a script to mirror Ceph properly:
>>>> https://github.com/ceph/ceph/tree/master/mirroring 
>>>> <https://github.com/ceph/ceph/tree/master/mirroring>
>>>> 
>>>> Currently there are 3 (official) locations where you can get Ceph:
>>>> 
>>>> - download.ceph.com (Dreamhost, US)
>>>> - eu.ceph.com (PCextreme, Netherlands)
>>>> - au.ceph.com (Digital Pacific, Australia)
>>>> 
>>>> I'm looking for more mirrors to become official mirrors so we can easily
>>>> distribute Ceph.
>>>> 
>>>> Mirrors do go down and it's always nice to have a mirror local to you.
>>>> 
>>>> I'd like to have one or more mirrors in Asia, Africa and/or South
>>>> Ameirca if possible. Anyone able to host there? Other locations are
>>>> welcome as well!
>>>> 
>>>> A few things which are required:
>>>> 
>>>> - 1Gbit connection or more
>>>> - Native IPv4 and IPv6
>>>> - HTTP access
>>>> - rsync access
>>>> - 2TB of storage or more
>>>> - Monitoring of the mirror/source
>>>> 
>>>> You can easily mirror Ceph yourself with this script I wrote:
>>>> https://github.com/ceph/ceph/blob/master/mirroring/mirror-ceph.sh 
>>>> <https://github.com/ceph/ceph/blob/master/mirroring/mirror-ceph.sh>
>>>> 
>>>> eu.ceph.com and au.ceph.com use it to sync from download.ceph.com. If
>>>> you want to mirror Ceph locally, please pick a mirror local to you.
>>>> 
>>>> Please refer to these guidelines:
>>>> https://github.com/ceph/ceph/tree/master/mirroring#guidelines 
>>>> <https://github.com/ceph/ceph/tree/master/mirroring#guidelines>
>>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph mirrors wanted!

2016-02-29 Thread Josef Johansson

You could sync from me instead @ se.ceph.com  
As a start.

Regards
/Josef

> On 29 Feb 2016, at 15:19, Florent B  wrote:
> 
> I would like to inform you that I have difficulties to set-up a mirror.
> 
> rsync on download.ceph.com is down
> 
> # rsync download.ceph.com::
> rsync: failed to connect to download.ceph.com (173.236.253.173):
> Connection timed out (110)
> 
> And eu.ceph.com is out of sync for a few weeks.
> 
> On 01/30/2016 03:14 PM, Wido den Hollander wrote:
>> Hi,
>> 
>> My PR was merged with a script to mirror Ceph properly:
>> https://github.com/ceph/ceph/tree/master/mirroring
>> 
>> Currently there are 3 (official) locations where you can get Ceph:
>> 
>> - download.ceph.com (Dreamhost, US)
>> - eu.ceph.com (PCextreme, Netherlands)
>> - au.ceph.com (Digital Pacific, Australia)
>> 
>> I'm looking for more mirrors to become official mirrors so we can easily
>> distribute Ceph.
>> 
>> Mirrors do go down and it's always nice to have a mirror local to you.
>> 
>> I'd like to have one or more mirrors in Asia, Africa and/or South
>> Ameirca if possible. Anyone able to host there? Other locations are
>> welcome as well!
>> 
>> A few things which are required:
>> 
>> - 1Gbit connection or more
>> - Native IPv4 and IPv6
>> - HTTP access
>> - rsync access
>> - 2TB of storage or more
>> - Monitoring of the mirror/source
>> 
>> You can easily mirror Ceph yourself with this script I wrote:
>> https://github.com/ceph/ceph/blob/master/mirroring/mirror-ceph.sh
>> 
>> eu.ceph.com and au.ceph.com use it to sync from download.ceph.com. If
>> you want to mirror Ceph locally, please pick a mirror local to you.
>> 
>> Please refer to these guidelines:
>> https://github.com/ceph/ceph/tree/master/mirroring#guidelines
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.94.6 Hammer released

2016-02-29 Thread Josef Johansson

Maybe the reverse is possible, where we as a community lend out computing 
resources that the central build system could use.

> On 29 Feb 2016, at 14:38, Josef Johansson  wrote:
> 
> Hi,
> 
> There is also https://github.com/jordansissel/fpm/wiki 
> <https://github.com/jordansissel/fpm/wiki>
> 
> I find it quite useful for building deb/rpm.
> 
> What would be useful for the community per se would be if you made a 
> Dockerfile for each type of combination, i.e. Ubuntu trusty / 10.0.3 and so 
> forth.
> 
> That way anyone could just docker run ceph/compile-ubuntu-trusty-10.0.3 and 
> that would be it.
> 
> I don’t think that would even be tough to do.
> 
> I’m unsure how well you can test that it’s not tampered with, but I assume 
> it’s possible to solve, or at least set up trusts between a contributor and 
> the repo. 
> 
> Regards,
> Josef
> 
>> On 29 Feb 2016, at 14:28, Dan van der Ster > <mailto:d...@vanderster.com>> wrote:
>> 
>> On Mon, Feb 29, 2016 at 12:30 PM, Odintsov Vladislav > <mailto:vlodint...@croc.ru>> wrote:
>>> Can you please provide right way for building rpm packages?
>> 
>> It's documented here:
>> http://docs.ceph.com/docs/master/install/build-ceph/#rpm-package-manager 
>> <http://docs.ceph.com/docs/master/install/build-ceph/#rpm-package-manager>
>> 
>> For 0.94.6 you need to change the .spec file to use .tar.gz (because
>> there was no .bz2 published for some reason). And then also grab
>> init-ceph.in-fedora.patch from here:
>> https://raw.githubusercontent.com/ceph/ceph/master/rpm/init-ceph.in-fedora.patch
>>  
>> <https://raw.githubusercontent.com/ceph/ceph/master/rpm/init-ceph.in-fedora.patch>
>> 
>> BTW, I've put our build here:
>> http://linuxsoft.cern.ch/internal/repos/ceph6-stable/x86_64/os/ 
>> <http://linuxsoft.cern.ch/internal/repos/ceph6-stable/x86_64/os/>
>> These are unsigned, untested and come with no warranty, no guarantees
>> of any sort. And IMHO, no third party build would ever to give that
>> warm fuzzy trust-it-with-my-data feeling like a ceph.com <http://ceph.com/> 
>> build would
>> ;)
>> 
>> Moving forward, it would be great if the required community effort
>> could be put to work to get ceph.com <http://ceph.com/> el6 (and other) 
>> builds. For el6
>> in particular there is also the option to help out the Centos Storage
>> SIG to produce builds. I don't have a good feeling which direction is
>> better ... maybe both.
>> 
>> -- Dan
>> CERN IT Storage Group
>> 
>> 
>>> 
>>> Regards,
>>> 
>>> Vladislav Odintsov
>>> 
>>> 
>>> From: Shinobu Kinjo mailto:ski...@redhat.com>>
>>> Sent: Monday, February 29, 2016 14:11
>>> To: Odintsov Vladislav
>>> Cc: Franklin M. Siler; Xiaoxi Chen; ceph-de...@vger.kernel.org 
>>> <mailto:ceph-de...@vger.kernel.org>; ceph-users; Sage Weil
>>> Subject: Re: [ceph-users] v0.94.6 Hammer released
>>> 
>>> Can we make any kind of general procedure to make packages so that almost 
>>> everyone in community build packages by themselves and reduce developers 
>>> work load caused by too much requirement -;
>>> 
>>> Cheers,
>>> Shinobu
>>> 
>>> - Original Message -
>>> From: "Odintsov Vladislav" mailto:vlodint...@croc.ru>>
>>> To: "Franklin M. Siler" mailto:m...@franksiler.com>>, 
>>> "Xiaoxi Chen" mailto:superdebu...@gmail.com>>
>>> Cc: ceph-de...@vger.kernel.org <mailto:ceph-de...@vger.kernel.org>, 
>>> "ceph-users" mailto:ceph-us...@ceph.com>>, "Sage 
>>> Weil" mailto:s...@redhat.com>>
>>> Sent: Monday, February 29, 2016 6:04:02 PM
>>> Subject: Re: [ceph-users] v0.94.6 Hammer released
>>> 
>>> Hi all,
>>> 
>>> should we build el6 packages ourself or, it's hoped that these packages 
>>> would be built officially by community?
>>> 
>>> 
>>> Regards,
>>> 
>>> Vladislav Odintsov
>>> 
>>> 
>>> From: ceph-devel-ow...@vger.kernel.org 
>>> <mailto:ceph-devel-ow...@vger.kernel.org> >> <mailto:ceph-devel-ow...@vger.kernel.org>> on behalf of Franklin M. Siler 
>>> mailto:m...@franksiler.com>>
>>> Sent: Friday,

Re: [ceph-users] v0.94.6 Hammer released

2016-02-29 Thread Josef Johansson

Hi,

There is also https://github.com/jordansissel/fpm/wiki 


I find it quite useful for building deb/rpm.

What would be useful for the community per se would be if you made a Dockerfile 
for each type of combination, i.e. Ubuntu trusty / 10.0.3 and so forth.

That way anyone could just docker run ceph/compile-ubuntu-trusty-10.0.3 and 
that would be it.

I don’t think that would even be tough to do.

I’m unsure how well you can test that it’s not tampered with, but I assume it’s 
possible to solve, or at least set up trusts between a contributor and the 
repo. 

Regards,
Josef

> On 29 Feb 2016, at 14:28, Dan van der Ster  wrote:
> 
> On Mon, Feb 29, 2016 at 12:30 PM, Odintsov Vladislav  > wrote:
>> Can you please provide right way for building rpm packages?
> 
> It's documented here:
> http://docs.ceph.com/docs/master/install/build-ceph/#rpm-package-manager 
> 
> 
> For 0.94.6 you need to change the .spec file to use .tar.gz (because
> there was no .bz2 published for some reason). And then also grab
> init-ceph.in-fedora.patch from here:
> https://raw.githubusercontent.com/ceph/ceph/master/rpm/init-ceph.in-fedora.patch
>  
> 
> 
> BTW, I've put our build here:
> http://linuxsoft.cern.ch/internal/repos/ceph6-stable/x86_64/os/ 
> 
> These are unsigned, untested and come with no warranty, no guarantees
> of any sort. And IMHO, no third party build would ever to give that
> warm fuzzy trust-it-with-my-data feeling like a ceph.com  
> build would
> ;)
> 
> Moving forward, it would be great if the required community effort
> could be put to work to get ceph.com  el6 (and other) 
> builds. For el6
> in particular there is also the option to help out the Centos Storage
> SIG to produce builds. I don't have a good feeling which direction is
> better ... maybe both.
> 
> -- Dan
> CERN IT Storage Group
> 
> 
>> 
>> Regards,
>> 
>> Vladislav Odintsov
>> 
>> 
>> From: Shinobu Kinjo 
>> Sent: Monday, February 29, 2016 14:11
>> To: Odintsov Vladislav
>> Cc: Franklin M. Siler; Xiaoxi Chen; ceph-de...@vger.kernel.org; ceph-users; 
>> Sage Weil
>> Subject: Re: [ceph-users] v0.94.6 Hammer released
>> 
>> Can we make any kind of general procedure to make packages so that almost 
>> everyone in community build packages by themselves and reduce developers 
>> work load caused by too much requirement -;
>> 
>> Cheers,
>> Shinobu
>> 
>> - Original Message -
>> From: "Odintsov Vladislav" 
>> To: "Franklin M. Siler" , "Xiaoxi Chen" 
>> 
>> Cc: ceph-de...@vger.kernel.org, "ceph-users" , "Sage 
>> Weil" 
>> Sent: Monday, February 29, 2016 6:04:02 PM
>> Subject: Re: [ceph-users] v0.94.6 Hammer released
>> 
>> Hi all,
>> 
>> should we build el6 packages ourself or, it's hoped that these packages 
>> would be built officially by community?
>> 
>> 
>> Regards,
>> 
>> Vladislav Odintsov
>> 
>> 
>> From: ceph-devel-ow...@vger.kernel.org  on 
>> behalf of Franklin M. Siler 
>> Sent: Friday, February 26, 2016 05:03
>> To: Xiaoxi Chen
>> Cc: Alfredo Deza; Dan van der Ster; Sage Weil; ceph-de...@vger.kernel.org; 
>> ceph-users
>> Subject: Re: [ceph-users] v0.94.6 Hammer released
>> 
>> On Feb 25, 2016, at 1839, Xiaoxi Chen  wrote:
>> 
>>> Will we build package for ubuntu 12.04 (Precise)?
>>> Seems it also doesnt show in the repo
>> 
>> The Ceph packages provided by Ubuntu are old.  However, the Ceph project 
>> publishes its own packages.
>> 
>> http://download.ceph.com/debian-hammer/dists/precise/
>> 
>> so repo lines for sources.list would be, I think:
>> 
>> deb http://download.ceph.com/debian-hammer/ precise main
>> deb-src http://download.ceph.com/debian-hammer/ precise main
>> 
>> 
>> Cheers,
>> 
>> Frank Siler
>> Siler Industrial Analytics
>> 314.799.9405--
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
_

Re: [ceph-users] Dedumplication feature

2016-02-28 Thread Josef Johansson

I assume you author meant deduplication? :-)

Cheers,
Josef
> On 28 Feb 2016, at 02:08, Lindsay Mathieson  
> wrote:
> 
> On 28/02/2016 10:23 AM, Shinobu Kinjo wrote:
>> Does the Ceph have ${subject}?
> 
> Well ceph 0.67 was codename "Dumpling", and we are well past that, so yes I 
> guess ceph has mostly been dedumplified. Which is a shame because I love 
> dumplings! Yum!
> 
> -- 
> Lindsay Mathieson
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Tips for faster openstack instance boot

2016-02-09 Thread Josef Johansson

The biggest question here is if the OS is using systemctl or not. Cl7 boots
extremely quick but our cl6 instances take up to 90 seconds if the cluster
has work to do.

I know there a lot to do in the init as well with boot profiling etc that
could help.

/Josef

On Tue, 9 Feb 2016 17:11 Vickey Singh  wrote:

> Guys Thanks a lot for your response.
>
> We are running OpenStack Juno + Ceph 94.5
>
> @Jason Dillaman Can you please explain what do you mean by "Glance is
> configured to cache your RBD image" ? This might give me some clue.
>
> Many Thanks.
>
>
> On Mon, Feb 8, 2016 at 10:33 PM, Jason Dillaman 
> wrote:
>
>> If Nova and Glance are properly configured, it should only require a
>> quick clone of the Glance image to create your Nova ephemeral image.  Have
>> you double-checked your configuration against the documentation [1]?  What
>> version of OpenStack are you using?
>>
>> To answer your questions:
>>
>> > - From Ceph point of view. does COW works cross pool i.e. image from
>> glance
>> > pool ---> (cow) --> instance disk on nova pool
>> Yes, cloning copy-on-write images works across pools
>>
>> > - Will a single pool for glance and nova instead of separate pool .
>> will help
>> > here ?
>> Should be no change -- the creation of the clone is extremely lightweight
>> (add the image to a directory, create a couple metadata objects)
>>
>> > - Is there any tunable parameter from Ceph or OpenStack side that
>> should be
>> > set ?
>> I'd double-check your OpenStack configuration.  Perhaps Glance isn't
>> configured with "show_image_direct_url = True", or Glance is configured to
>> cache your RBD images, or you have an older OpenStack release that requires
>> patches to fully support Nova+RBD.
>>
>> [1] http://docs.ceph.com/docs/master/rbd/rbd-openstack/
>>
>> --
>>
>> Jason Dillaman
>>
>>
>> - Original Message -
>>
>> > From: "Vickey Singh" 
>> > To: ceph-users@lists.ceph.com, "ceph-users" 
>> > Sent: Monday, February 8, 2016 9:10:59 AM
>> > Subject: [ceph-users] Tips for faster openstack instance boot
>>
>> > Hello Community
>>
>> > I need some guidance how can i reduce openstack instance boot time
>> using Ceph
>>
>> > We are using Ceph Storage with openstack ( cinder, glance and nova ).
>> All
>> > OpenStack images and instances are being stored on Ceph in different
>> pools
>> > glance and nova pool respectively.
>>
>> > I assume that Ceph by default uses COW rbd , so for example if an
>> instance is
>> > launched using glance image (which is stored on Ceph) , Ceph should
>> take COW
>> > snapshot of glance image and map it as RBD disk for instance. And this
>> whole
>> > process should be very quick.
>>
>> > In our case , the instance launch is taking 90 seconds. Is this normal
>> ? ( i
>> > know this really depends one's infra , but still )
>>
>> > Is there any way , i can utilize Ceph's power and can launch instances
>> ever
>> > faster.
>>
>> > - From Ceph point of view. does COW works cross pool i.e. image from
>> glance
>> > pool ---> (cow) --> instance disk on nova pool
>> > - Will a single pool for glance and nova instead of separate pool .
>> will help
>> > here ?
>> > - Is there any tunable parameter from Ceph or OpenStack side that
>> should be
>> > set ?
>>
>> > Regards
>> > Vickey
>>
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph mirrors wanted!

2016-02-07 Thread Josef Johansson

Yes exactly. I can serve it through IPV6 as well.

On Sat, 6 Feb 2016 08:44 Wido den Hollander  wrote:

> Hi,
>
> Great! So that would be se.ceph.com?
>
> There is a ceph-mirrors list for mirror admins, so let me know when you are
> ready to set up so I can add you there.
>
> Wido
>
> > Op 6 februari 2016 om 8:22 schreef Josef Johansson :
> >
> >
> > Hi Wido,
> >
> > We're planning on hosting here in Sweden.
> >
> > I can let you know when we're ready.
> >
> > Regards
> >
> >
> > Josef
> >
> > On Sat, 30 Jan 2016 15:15 Wido den Hollander  wrote:
> >
> > > Hi,
> > >
> > > My PR was merged with a script to mirror Ceph properly:
> > > https://github.com/ceph/ceph/tree/master/mirroring
> > >
> > > Currently there are 3 (official) locations where you can get Ceph:
> > >
> > > - download.ceph.com (Dreamhost, US)
> > > - eu.ceph.com (PCextreme, Netherlands)
> > > - au.ceph.com (Digital Pacific, Australia)
> > >
> > > I'm looking for more mirrors to become official mirrors so we can
> easily
> > > distribute Ceph.
> > >
> > > Mirrors do go down and it's always nice to have a mirror local to you.
> > >
> > > I'd like to have one or more mirrors in Asia, Africa and/or South
> > > Ameirca if possible. Anyone able to host there? Other locations are
> > > welcome as well!
> > >
> > > A few things which are required:
> > >
> > > - 1Gbit connection or more
> > > - Native IPv4 and IPv6
> > > - HTTP access
> > > - rsync access
> > > - 2TB of storage or more
> > > - Monitoring of the mirror/source
> > >
> > > You can easily mirror Ceph yourself with this script I wrote:
> > > https://github.com/ceph/ceph/blob/master/mirroring/mirror-ceph.sh
> > >
> > > eu.ceph.com and au.ceph.com use it to sync from download.ceph.com. If
> > > you want to mirror Ceph locally, please pick a mirror local to you.
> > >
> > > Please refer to these guidelines:
> > > https://github.com/ceph/ceph/tree/master/mirroring#guidelines
> > >
> > > --
> > > Wido den Hollander
> > > 42on B.V.
> > > Ceph trainer and consultant
> > >
> > > Phone: +31 (0)20 700 9902
> > > Skype: contact42on
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph mirrors wanted!

2016-02-05 Thread Josef Johansson

Hi Wido,

We're planning on hosting here in Sweden.

I can let you know when we're ready.

Regards


Josef

On Sat, 30 Jan 2016 15:15 Wido den Hollander  wrote:

> Hi,
>
> My PR was merged with a script to mirror Ceph properly:
> https://github.com/ceph/ceph/tree/master/mirroring
>
> Currently there are 3 (official) locations where you can get Ceph:
>
> - download.ceph.com (Dreamhost, US)
> - eu.ceph.com (PCextreme, Netherlands)
> - au.ceph.com (Digital Pacific, Australia)
>
> I'm looking for more mirrors to become official mirrors so we can easily
> distribute Ceph.
>
> Mirrors do go down and it's always nice to have a mirror local to you.
>
> I'd like to have one or more mirrors in Asia, Africa and/or South
> Ameirca if possible. Anyone able to host there? Other locations are
> welcome as well!
>
> A few things which are required:
>
> - 1Gbit connection or more
> - Native IPv4 and IPv6
> - HTTP access
> - rsync access
> - 2TB of storage or more
> - Monitoring of the mirror/source
>
> You can easily mirror Ceph yourself with this script I wrote:
> https://github.com/ceph/ceph/blob/master/mirroring/mirror-ceph.sh
>
> eu.ceph.com and au.ceph.com use it to sync from download.ceph.com. If
> you want to mirror Ceph locally, please pick a mirror local to you.
>
> Please refer to these guidelines:
> https://github.com/ceph/ceph/tree/master/mirroring#guidelines
>
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Tech Talk - High-Performance Production Databases on Ceph

2016-02-03 Thread Josef Johansson

I was fascinated as well. This is how it should be done ☺

We are in the middle of ordering and I saw the notice that they use single
socket systems for the OSDs due to latency issues. I have only seen dual
socket systems on the OSD setups here. Is this something you should do with
new SSD clusters?

Regards,
Josef

On Sat, 30 Jan 2016 09:43 Nick Fisk  wrote:

> Yes, thank you very much. I've just finished going through this and found
> it very interesting. The dynamic nature of the infrastructure from top to
> bottom is fascinating, especially the use of OSPF per container.
>
> One question though, are those latency numbers for writes on Ceph correct?
> 9us is very fast or is it something to do with the 1/100 buffered nature of
> the test?
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Gregory Farnum
> > Sent: 29 January 2016 21:25
> > To: Patrick McGarry 
> > Cc: Ceph Devel ; Ceph-User  > us...@ceph.com>
> > Subject: Re: [ceph-users] Ceph Tech Talk - High-Performance Production
> > Databases on Ceph
> >
> > This is super cool — thanks, Thorvald, for the realistic picture of how
> > databases behave on rbd!
> >
> > On Thu, Jan 28, 2016 at 11:56 AM, Patrick McGarry 
> > wrote:
> > > Hey cephers,
> > >
> > > Here are the links to both the video and the slides from the Ceph Tech
> > > Talk today. Thanks again to Thorvald and Medallia for stepping forward
> > > to present.
> > >
> > > Video: https://youtu.be/OqlC7S3cUKs
> > >
> > > Slides:
> > > http://www.slideshare.net/Inktank_Ceph/2016jan28-high-performance-
> > prod
> > > uction-databases-on-ceph-57620014
> > >
> > >
> > > --
> > >
> > > Best Regards,
> > >
> > > Patrick McGarry
> > > Director Ceph Community || Red Hat
> > > http://ceph.com  ||  http://community.redhat.com @scuttlemonkey ||
> > > @ceph
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majord...@vger.kernel.org More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] very high OSD RAM usage values

2016-01-08 Thread Josef Johansson

Maybe changing the number of concurrent back fills could limit the memory
usage.
On 9 Jan 2016 05:52, "Josef Johansson"  wrote:

> Hi,
>
> I would say this is normal. 1GB of ram per 1TB is what we designed the
> cluster for, I would believe that an EC-pool demands a lot more. Buy more
> ram and start everything 32GB ram is quite little, when the cluster is
> operating OK you'll see that extra ram getting used as file cache which
> makes the cluster faster.
>
> Regards,
> Josef
> On 6 Jan 2016 12:12, "Kenneth Waegeman"  wrote:
>
>> Hi all,
>>
>> We experienced some serious trouble with our cluster: A running cluster
>> started failing and started a chain reaction until the ceph cluster was
>> down, as about half the OSDs are down (in a EC pool)
>>
>> Each host has 8 OSDS of 8 TB (i.e. RAID 0 of 2 4TB disk) for an EC pool
>> (10+3, 14 hosts) and 2 cache OSDS and 32 GB of RAM.
>> The reason we have the Raid0 of the disks, is because we tried with 16
>> disk before, but 32GB didn't seem enough to keep the cluster stable
>>
>> We don't know for sure what triggered the chain reaction, but what we
>> certainly see, is that while recovering, our OSDS are using a lot of
>> memory. We've seen some OSDS using almost 8GB of RAM (resident; virtual
>> 11GB)
>> So right now we don't have enough memory to recover the cluster, because
>> the  OSDS  get killed by OOMkiller before they can recover..
>> And I don't know doubling our memory will be enough..
>>
>> A few questions:
>>
>> * Does someone has seen this before?
>> * 2GB was still normal, but 8GB seems a lot, is this expected behaviour?
>> * We didn't see this with an nearly empty cluster. Now it was filled
>> about 1/4 (270TB). I guess it would become worse when filled half or more?
>> * How high can this memory usage become ? Can we calculate the maximum
>> memory of an OSD? Can we limit it ?
>> * We can upgrade/reinstall to infernalis, will that solve anything?
>>
>> This is related to a previous post of me :
>> http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/22259
>>
>>
>> Thank you very much !!
>>
>> Kenneth
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] very high OSD RAM usage values

2016-01-08 Thread Josef Johansson

Hi,

I would say this is normal. 1GB of ram per 1TB is what we designed the
cluster for, I would believe that an EC-pool demands a lot more. Buy more
ram and start everything 32GB ram is quite little, when the cluster is
operating OK you'll see that extra ram getting used as file cache which
makes the cluster faster.

Regards,
Josef
On 6 Jan 2016 12:12, "Kenneth Waegeman"  wrote:

> Hi all,
>
> We experienced some serious trouble with our cluster: A running cluster
> started failing and started a chain reaction until the ceph cluster was
> down, as about half the OSDs are down (in a EC pool)
>
> Each host has 8 OSDS of 8 TB (i.e. RAID 0 of 2 4TB disk) for an EC pool
> (10+3, 14 hosts) and 2 cache OSDS and 32 GB of RAM.
> The reason we have the Raid0 of the disks, is because we tried with 16
> disk before, but 32GB didn't seem enough to keep the cluster stable
>
> We don't know for sure what triggered the chain reaction, but what we
> certainly see, is that while recovering, our OSDS are using a lot of
> memory. We've seen some OSDS using almost 8GB of RAM (resident; virtual
> 11GB)
> So right now we don't have enough memory to recover the cluster, because
> the  OSDS  get killed by OOMkiller before they can recover..
> And I don't know doubling our memory will be enough..
>
> A few questions:
>
> * Does someone has seen this before?
> * 2GB was still normal, but 8GB seems a lot, is this expected behaviour?
> * We didn't see this with an nearly empty cluster. Now it was filled about
> 1/4 (270TB). I guess it would become worse when filled half or more?
> * How high can this memory usage become ? Can we calculate the maximum
> memory of an OSD? Can we limit it ?
> * We can upgrade/reinstall to infernalis, will that solve anything?
>
> This is related to a previous post of me :
> http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/22259
>
>
> Thank you very much !!
>
> Kenneth
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] KVM problems when rebalance occurs

2016-01-07 Thread Josef Johansson

Hi,

How did you benchmark?

I would recommend to have a lot of mysql with a lot of innodb tables that
are utilised heavily. During a recover you should see the latency rise at
least. Maybe using one of the tools here
https://dev.mysql.com/downloads/benchmarks.html

Regards,
Josef
On 7 Jan 2016 16:36, "Robert LeBlanc"  wrote:

> With these min,max settings, we didn't have any problem going to more
> backfills.
>
> Robert LeBlanc
>
> Sent from a mobile device please excuse any typos.
> On Jan 7, 2016 8:30 AM, "nick"  wrote:
>
>> Heya,
>> thank you for your answers. We will try to set 16/32 as values for
>> osd_backfill_scan_[min|max]. I also set the debug logging config. Here is
>> an
>> excerpt of our new ceph.conf:
>>
>> """
>> [osd]
>> osd max backfills = 1
>> osd backfill scan max = 32
>> osd backfill scan min = 16
>> osd recovery max active = 1
>> osd recovery op priority = 1
>> osd op threads = 8
>>
>> [global]
>> debug optracker = 0/0
>> debug asok = 0/0
>> debug hadoop = 0/0
>> debug mds migrator = 0/0
>> debug objclass = 0/0
>> debug paxos = 0/0
>> debug context = 0/0
>> debug objecter = 0/0
>> debug mds balancer = 0/0
>> debug finisher = 0/0
>> debug auth = 0/0
>> debug buffer = 0/0
>> debug lockdep = 0/0
>> debug mds log = 0/0
>> debug heartbeatmap = 0/0
>> debug journaler = 0/0
>> debug mon = 0/0
>> debug client = 0/0
>> debug mds = 0/0
>> debug throttle = 0/0
>> debug journal = 0/0
>> debug crush = 0/0
>> debug objectcacher = 0/0
>> debug filer = 0/0
>> debug perfcounter = 0/0
>> debug filestore = 0/0
>> debug rgw = 0/0
>> debug monc = 0/0
>> debug rbd = 0/0
>> debug tp = 0/0
>> debug osd = 0/0
>> debug ms = 0/0
>> debug mds locker = 0/0
>> debug timer = 0/0
>> debug mds log expire = 0/0
>> debug rados = 0/0
>> debug striper = 0/0
>> debug rbd replay = 0/0
>> debug none = 0/0
>> debug keyvaluestore = 0/0
>> debug compressor = 0/0
>> debug crypto = 0/0
>> debug xio = 0/0
>> debug civetweb = 0/0
>> debug newstore = 0/0
>> """
>>
>> I already made a benchmark on our staging setup with the new config and
>> fio, but
>> did not really get different results than before.
>>
>> For us it is hardly possible to reproduce the 'stalling' problems on the
>> staging cluster so I will have to wait and test this in production.
>>
>> Does anyone know if 'osd max backfills' > 1 could have an impact as well?
>> The
>> default seems to be 10...
>>
>> Cheers
>> Nick
>>
>>
>>
>> On Wednesday, January 06, 2016 09:17:43 PM Josef Johansson wrote:
>> > Hi,
>> >
>> > Also make sure that you optimize the debug log config. There's a lot on
>> the
>> > ML on how to set them all to low values (0/0).
>> >
>> > Not sure how it's in infernalis but it did a lot in previous versions.
>> >
>> > Regards,
>> > Josef
>> >
>> > On 6 Jan 2016 18:16, "Robert LeBlanc"  wrote:
>> > > -BEGIN PGP SIGNED MESSAGE-
>> > > Hash: SHA256
>> > >
>> > > There has been a lot of "discussion" about osd_backfill_scan[min,max]
>> > > lately. My experience with hammer has been opposite that of what
>> > > people have said before. Increasing those values for us has reduced
>> > > the load of recovery and has prevented a lot of the disruption seen in
>> > > our cluster caused by backfilling. It does increase the amount of time
>> > > to do the recovery (a new node added to the cluster took about 3-4
>> > > hours before, now takes about 24 hours).
>> > >
>> > > We are currently using these values and seem to work well for us.
>> > > osd_max_backfills = 1
>> > > osd_backfill_scan_min = 16
>> > > osd_recovery_max_active = 1
>> > > osd_backfill_scan_max = 32
>> > >
>> > > I would be interested in your results if you try these values.
>> > > -BEGIN PGP SIGNATURE-
>> > > Version: Mailvelope v1.3.2
>> > > Comment: https://www.mailvelope.com
>> > >
>> > > wsFcBAEBCAAQBQJWjUu/CRDmVDuy+mK58QAArdMQAI+0Er/sdN7TF7knGey2
>> > > 5wJ6Ie81KJlrt/X9fIMpFdwkU2g5ET+sdU9R2hK4XcBpkonfGvwS8Ctha5Aq
>> > > XOJPrN4bMMeDK9Z4angK86ioLJevTH7tzp3FZL0U4Kbt1s9ZpwF6t+wlvkKl

Re: [ceph-users] KVM problems when rebalance occurs

2016-01-06 Thread Josef Johansson

Hi,

Also make sure that you optimize the debug log config. There's a lot on the
ML on how to set them all to low values (0/0).

Not sure how it's in infernalis but it did a lot in previous versions.

Regards,
Josef
On 6 Jan 2016 18:16, "Robert LeBlanc"  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> There has been a lot of "discussion" about osd_backfill_scan[min,max]
> lately. My experience with hammer has been opposite that of what
> people have said before. Increasing those values for us has reduced
> the load of recovery and has prevented a lot of the disruption seen in
> our cluster caused by backfilling. It does increase the amount of time
> to do the recovery (a new node added to the cluster took about 3-4
> hours before, now takes about 24 hours).
>
> We are currently using these values and seem to work well for us.
> osd_max_backfills = 1
> osd_backfill_scan_min = 16
> osd_recovery_max_active = 1
> osd_backfill_scan_max = 32
>
> I would be interested in your results if you try these values.
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.3.2
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWjUu/CRDmVDuy+mK58QAArdMQAI+0Er/sdN7TF7knGey2
> 5wJ6Ie81KJlrt/X9fIMpFdwkU2g5ET+sdU9R2hK4XcBpkonfGvwS8Ctha5Aq
> XOJPrN4bMMeDK9Z4angK86ioLJevTH7tzp3FZL0U4Kbt1s9ZpwF6t+wlvkKl
> mt6Tkj4VKr0917TuXqk58AYiZTYcEjGAb0QUe/gC24yFwZYrPO0vUVb4gmTQ
> klNKAdTinGSn4Ynj+lBsEstWGVlTJiL3FA6xRBTz1BSjb4vtb2SoIFwHlAp+
> GO+bKSh19YIasXCZfRqC/J2XcNauOIVfb4l4viV23JN2fYavEnLCnJSglYjF
> Rjxr0wK+6NhRl7naJ1yGNtdMkw+h+nu/xsbYhNqT0EVq1d0nhgzh6ZjAhW1w
> oRiHYA4KNn2uWiUgigpISFi4hJSP4CEPToO8jbhXhARs0H6v33oWrR8RYKxO
> dFz+Lxx969rpDkk+1nRks9hTeIF+oFnW7eezSiR6TILYxvCZQ0ThHXQsL4ph
> bvUr0FQmdV3ukC+Xwa/cePIlVY6JsIQfOlqmrtG7caTZWLvLUDwrwcleb272
> 243GXlbWCxoI7+StJDHPnY2k7NHLvbN2yG3f5PZvZaBgqqyAP8Fnq6CDtTIE
> vZ/p+ZcuRw8lqoDgjjdiFyMmhQnFcCtDo3vtIy/UXDw23AVsI5edUyyP/sHt
> ruPt
> =X7SH
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Jan 6, 2016 at 7:13 AM, nick  wrote:
> > Heya,
> > we are using a ceph cluster (6 Nodes with each having 10x4TB HDD + 2x
> SSD (for
> > journal)) in combination with KVM virtualization. All our virtual
> machine hard
> > disks are stored on the ceph cluster. The ceph cluster was updated to the
> > 'infernalis' release recently.
> >
> > We are experiencing problems during cluster maintenance. A normal
> workflow for
> > us looks like this:
> >
> > - set the noout flag for the cluster
> > - stop all OSDs on one node
> > - update the node
> > - reboot the node
> > - start all OSDs
> > - wait for the backfilling to finish
> > - unset the noout flag
> >
> > After we start all OSDs on the node again the cluster backfills and
> tries to
> > get all the OSDs in sync. During the beginning of this process we
> experience
> > 'stalls' in our running virtual machines. On some the load raises to a
> very
> > high value. On others a running webserver responses only with 5xx HTTP
> codes.
> > It takes around 5-6 minutes until all is ok again. After those 5-6
> minutes the
> > cluster is still backfilling, but the virtual machines behave normal
> again.
> >
> > I already set the following parameters in ceph.conf on the nodes to have
> a
> > better rebalance traffic/user traffic ratio:
> >
> > """
> > [osd]
> > osd max backfills = 1
> > osd backfill scan max = 8
> > osd backfill scan min = 4
> > osd recovery max active = 1
> > osd recovery op priority = 1
> > osd op threads = 8
> > """
> >
> > It helped a bit, but we are still experiencing the above written
> problems. It
> > feels like that for a short time some virtual hard disks are locked. Our
> ceph
> > nodes are using bonded 10G network interfaces for the 'OSD network', so
> I do
> > not think that network is a bottleneck.
> >
> > After reading this blog post:
> > http://dachary.org/?p=2182
> > I wonder if there is really a 'read lock' during the object push.
> >
> > Does anyone know more about this or do others have the same problems and
> were
> > able to fix it?
> >
> > Best Regards
> > Nick
> >
> > --
> > Sebastian Nickel
> > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
> > Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help! OSD host failure - recovery without rebuilding OSDs

2015-12-28 Thread Josef Johansson

Did you manage to work this out?
On 25 Dec 2015 9:33 am, "Josef Johansson"  wrote:

> Hi
>
> Someone here will probably lay out a detailed answer but to get you
> started,
>
> All the details for the osd are in the xfs partitions, mirror a new USB
> key and change ip etc and you should be able to recover.
>
> If the journal is linked to a /dev/sdx, make sure it's in the same spot as
> it was before..
>
> All the best of luck
> /Josef
> On 25 Dec 2015 05:39, "deeepdish"  wrote:
>
>> Hello,
>>
>> Had an interesting issue today.
>>
>> My OSD hosts are booting off a USB key which, you guessed it has a root
>> partition on there.   All OSDs are mounted.   My USB key failed on one of
>> my OSD hosts, leaving the data on OSDs inaccessible to the rest of my
>> cluster.   I have multiple monitors running other OSD hosts where data can
>> be recovered to.   However I’m wondering if there’s a way to “restore” /
>> “rebuild” the ceph install that was on this host without having all OSDs
>> resync again.
>>
>> Lesson learned = don’t use USB boot/root drives.   However, now just
>> looking at what needs to be done once the OS and Ceph packages are
>> reinstalled.
>>
>> Thank you.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help! OSD host failure - recovery without rebuilding OSDs

2015-12-25 Thread Josef Johansson

Hi

Someone here will probably lay out a detailed answer but to get you started,

All the details for the osd are in the xfs partitions, mirror a new USB key
and change ip etc and you should be able to recover.

If the journal is linked to a /dev/sdx, make sure it's in the same spot as
it was before..

All the best of luck
/Josef
On 25 Dec 2015 05:39, "deeepdish"  wrote:

> Hello,
>
> Had an interesting issue today.
>
> My OSD hosts are booting off a USB key which, you guessed it has a root
> partition on there.   All OSDs are mounted.   My USB key failed on one of
> my OSD hosts, leaving the data on OSDs inaccessible to the rest of my
> cluster.   I have multiple monitors running other OSD hosts where data can
> be recovered to.   However I’m wondering if there’s a way to “restore” /
> “rebuild” the ceph install that was on this host without having all OSDs
> resync again.
>
> Lesson learned = don’t use USB boot/root drives.   However, now just
> looking at what needs to be done once the OS and Ceph packages are
> reinstalled.
>
> Thank you.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD:s failing out after upgrade to 9.2.0 on Ubuntu 14.04

2015-12-12 Thread Josef Johansson

Thanks for sharing your solution as well!

Happy holidays
/Josef
On 12 Dec 2015 12:56 pm, "Claes Sahlström"  wrote:

> Just to share with the rest of the list, my problems have been solved now.
>
>
>
> I got this information from Sergey Malinin who had the same problem:
>
> 1. Stop OSD daemons on all nodes.
>
> 2. Check the output of "ceph osd tree". You will see some of OSDs showing
> as "up" - shut them down using "ceph osd down osd.X"
>
> 3. Start OSD daemons on all nodes - your cluster should now become
> operational.
>
>
>
> It is probably the  “mon_osd_min_up_ratio” that messed up my upgrade. I am
> happily running Infernalis now.
>
>
>
> Thanks all for the help and effort.
>
>
>
> /Claes
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD:s failing out after upgrade to 9.2.0 on Ubuntu 14.04

2015-11-16 Thread Josef Johansson

And if you look through the archives Sage did release a version of Infernalis 
that fixed if you didn’t do it that way as well.

> On 16 Nov 2015, at 22:15, David Clarke  wrote:
> 
> On 17/11/15 09:46, Claes Sahlström wrote:
>> Did some more logging and for some reason it seems like I do have some
>> problem communicating with my OSDs:
>> 
>> 
>> 
>> “ceph tell osd.* version” gives two different errors that might shed
>> some light on what is going on…
>> 
>> 
>> 
>> osd.0: Error ENXIO: problem getting command descriptions from osd.0
>> 
>> osd.0: problem getting command descriptions from osd.0
>> 
>> osd.3: Error ENXIO: problem getting command descriptions from osd.3
>> 
>> osd.3: problem getting command descriptions from osd.3
>> 
>> 2015-11-16 21:37:17.737671 7fafa05f6700  0 -- 172.16.0.202:0/1780708646
 172.16.0.202:7001/8193 pipe(0x7fafa4068e40 sd=4 :0 s=1 pgs=0 cs=0 l=1
>> c=0x7fafa4062920).fault
>> 
>> osd.4: Error EINTR: problem getting command descriptions from osd.4
>> 
>> osd.4: problem getting command descriptions from osd.4
>> 
>> 
>> 
>> I have some quite large logs that I was going through and noticed that I
>> got this in the osd-log:
>> 
>> 2015-11-16 20:27:43.432210 7f6b356a0700  1 osd.0 39502 osdmap indicates
>> one or more pre-v0.94.4 hammer OSDs is running
>> 
>> 
>> 
>> That made me check what version the OSDs were, but I get that log entry
>> is because it cannot check the version of the other OSDs at all.
> 
> From which version did you upgrade?  The release notes for 0.94.4 [0] say:
> 
> "This Hammer point release fixes several important bugs in Hammer, as
> well as fixing interoperability issues that are required before an
> upgrade to Infernalis. That is, all users of earlier version of Hammer
> or any version of Firefly will first need to upgrade to hammer v0.94.4
> or later before upgrading to Infernalis (or future releases)."
> 
> 
> [0] http://docs.ceph.com/docs/master/release-notes/#v0-94-4-hammer 
> 
> 
> 
> -- 
> David Clarke
> Systems Architect
> Catalyst IT
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD:s failing out after upgrade to 9.2.0 on Ubuntu 14.04

2015-11-16 Thread Josef Johansson

Hi,

That piece of code is keeping your OSD from booting.

Well you could run  the below to check the version as well. Might do that with 
the mon as well just to be sure.
# /usr/bin/ceph-osd --version
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)

Regards,
/Josef
> On 16 Nov 2015, at 21:46, Claes Sahlström  wrote:
> 
> Did some more logging and for some reason it seems like I do have some 
> problem communicating with my OSDs:
>  
> “ceph tell osd.* version” gives two different errors that might shed some 
> light on what is going on…
>  
> osd.0: Error ENXIO: problem getting command descriptions from osd.0
> osd.0: problem getting command descriptions from osd.0
> osd.3: Error ENXIO: problem getting command descriptions from osd.3
> osd.3: problem getting command descriptions from osd.3
> 2015-11-16 21:37:17.737671 7fafa05f6700  0 -- 172.16.0.202:0/1780708646 >> 
> 172.16.0.202:7001/8193 pipe(0x7fafa4068e40 sd=4 :0 s=1 pgs=0 cs=0 l=1 
> c=0x7fafa4062920).fault
> osd.4: Error EINTR: problem getting command descriptions from osd.4
> osd.4: problem getting command descriptions from osd.4
>  
> I have some quite large logs that I was going through and noticed that I got 
> this in the osd-log:
> 2015-11-16 20:27:43.432210 7f6b356a0700  1 osd.0 39502 osdmap indicates one 
> or more pre-v0.94.4 hammer OSDs is running
>  
> That made me check what version the OSDs were, but I get that log entry is 
> because it cannot check the version of the other OSDs at all.
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> ] On Behalf Of Claes Sahlström
> Sent: den 16 november 2015 17:51
> To: 'ceph-users'  >
> Subject: Re: [ceph-users] OSD:s failing out after upgrade to 9.2.0 on Ubuntu 
> 14.04
>  
> After some time 4 more OSD:s from one server dropped out and it now seems 
> that only 3 OSD:s from 1 server (I have 3 servers each with 4 OSD:s) are 
> marked as up the other 9 are down. I have shut the servers down for now since 
> I will not have any time to work with this until the weekend.
>  
> Any suggestion of how to get the system online again are most welcome. The 
> OSD disks have not crashed and I hope to be able to get them to join the 
> cluster again and get the data back.
>  
> I am not sure what I did wrong when doing the upgrade from Hammer to 
> Infernalis, at first I thought that it was that I didn´t remove the ceph user 
> and group when upgrading, but now I have no clue, I do not think I actually 
> had a ceph-user before Infernalis.
>  
> Any help or suggestions what I can try to get the system online is most 
> welcome.
>  
> Thanks,
> Claes
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD:s failing out after upgrade to 9.2.0 on Ubuntu 14.04

2015-11-15 Thread Josef Johansson

cc the list as well
> On 15 Nov 2015, at 23:41, Josef Johansson  wrote:
> 
> Hi,
> 
> So it’s just frozen at that point?
> 
> You should definatly increase the logging and restart the osd. I believe it’s 
> debug osd 20 and debug mon 20. 
> 
> A quick google brings up a case where UUID was crashing. 
> http://serverfault.com/questions/671372/ceph-osd-always-down-in-ubuntu-14-04-1
>  
> <http://serverfault.com/questions/671372/ceph-osd-always-down-in-ubuntu-14-04-1>
> 
> /Josef
>> On 15 Nov 2015, at 23:29, Claes Sahlström > <mailto:cl...@verymetal.com>> wrote:
>> 
>> Hi and thanks for helping.
>>  
>> None that I can when scanning the logfile, it actually looks to me like it 
>> starts up just fine when I start the OSD. This is the last time I restarted 
>> it:
>>  
>> 2015-11-15 22:58:13.445684 7f6f8f9be940  0 set uid:gid to 0:0
>> 2015-11-15 22:58:13.445854 7f6f8f9be940  0 ceph version 9.2.0 
>> (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process ceph-osd, pid 5463
>> 2015-11-15 22:58:13.510385 7f6f8f9be940  0 filestore(/ceph/osd.11) backend 
>> xfs (magic 0x58465342)
>> 2015-11-15 22:58:13.511120 7f6f8f9be940  0 
>> genericfilestorebackend(/ceph/osd.11) detect_features: FIEMAP ioctl is 
>> disabled via 'filestore fiemap' config option
>> 2015-11-15 22:58:13.511129 7f6f8f9be940  0 
>> genericfilestorebackend(/ceph/osd.11) detect_features: SEEK_DATA/SEEK_HOLE 
>> is disabled via 'filestore seek data hole' config option
>> 2015-11-15 22:58:13.511158 7f6f8f9be940  0 
>> genericfilestorebackend(/ceph/osd.11) detect_features: splice is supported
>> 2015-11-15 22:58:13.515688 7f6f8f9be940  0 
>> genericfilestorebackend(/ceph/osd.11) detect_features: syncfs(2) syscall 
>> fully supported (by glibc and kernel)
>> 2015-11-15 22:58:13.515934 7f6f8f9be940  0 xfsfilestorebackend(/ceph/osd.11) 
>> detect_features: extsize is supported and your kernel >= 3.5
>> 2015-11-15 22:58:13.600801 7f6f8f9be940  0 filestore(/ceph/osd.11) mount: 
>> enabling WRITEAHEAD journal mode: checkpoint is not enabled
>> 2015-11-15 22:58:39.150619 7f6f8f9be940  1 journal _open 
>> /dev/orange/journal-osd.11 fd 19: 23622320128 bytes, block size 4096 bytes, 
>> directio = 1, aio = 1
>> 2015-11-15 22:58:39.160621 7f6f8f9be940  1 journal _open 
>> /dev/orange/journal-osd.11 fd 19: 23622320128 bytes, block size 4096 bytes, 
>> directio = 1, aio = 1
>> 2015-11-15 22:58:39.192660 7f6f8f9be940  1 filestore(/ceph/osd.11) upgrade
>> 2015-11-15 22:58:39.200192 7f6f8f9be940  0  
>> cls/cephfs/cls_cephfs.cc:136: loading cephfs_size_scan
>> 2015-11-15 22:58:39.200457 7f6f8f9be940  0  cls/hello/cls_hello.cc:305: 
>> loading cls_hello
>> 2015-11-15 22:58:39.206906 7f6f8f9be940  0 osd.11 35462 crush map has 
>> features 1107558400, adjusting msgr requires for clients
>> 2015-11-15 22:58:39.206983 7f6f8f9be940  0 osd.11 35462 crush map has 
>> features 1107558400 was 8705, adjusting msgr requires for mons
>> 2015-11-15 22:58:39.207030 7f6f8f9be940  0 osd.11 35462 crush map has 
>> features 1107558400, adjusting msgr requires for osds
>> 2015-11-15 22:58:40.712757 7f6f8f9be940  0 osd.11 35462 load_pgs
>> 2015-11-15 22:59:09.980042 7f6f8f9be940  0 osd.11 35462 load_pgs opened 874 
>> pgs
>> 2015-11-15 22:59:09.981963 7f6f8f9be940 -1 osd.11 35462 log_to_monitors 
>> {default=true}
>> 2015-11-15 22:59:09.990204 7f6f71312700  0 osd.11 35462 ignoring osdmap 
>> until we have initialized
>> 2015-11-15 22:59:11.194276 7f6f8f9be940  0 osd.11 35462 done with init, 
>> starting boot process
>>  
>> From: Josef Johansson [mailto:jose...@gmail.com <mailto:jose...@gmail.com>] 
>> Sent: den 15 november 2015 23:10
>> To: Claes Sahlström mailto:cl...@verymetal.com>>
>> Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] OSD:s failing out after upgrade to 9.2.0 on Ubuntu 
>> 14.04
>>  
>> Hi,
>>  
>> Could you catch any segmentation faults in /var/log/ceph/ceph-osd.11.log ?
>>  
>> Regards,
>> Josef
>>  
>> On 15 Nov 2015, at 23:06, Claes Sahlström > <mailto:cl...@verymetal.com>> wrote:
>>  
>> Sorry to almost double post, I noticed that it seems like one mon is down, 
>> but they do actually seem to be ok, the 11 that are in falls out and I am 
>> back at 7 healthy OSD:s again:
>>  
>> root@black:/var/lib/ceph/mon# ceph -s
>> cluster ee8eae7a-5994-48bc-bd43-aa07639a543b
>>  health HEALTH_WARN
>> 108 pgs backfill
>>

Re: [ceph-users] OSD:s failing out after upgrade to 9.2.0 on Ubuntu 14.04

2015-11-15 Thread Josef Johansson

Hi,

Could you catch any segmentation faults in /var/log/ceph/ceph-osd.11.log ?

Regards,
Josef

> On 15 Nov 2015, at 23:06, Claes Sahlström  wrote:
> 
> Sorry to almost double post, I noticed that it seems like one mon is down, 
> but they do actually seem to be ok, the 11 that are in falls out and I am 
> back at 7 healthy OSD:s again:
>  
> root@black:/var/lib/ceph/mon# ceph -s
> cluster ee8eae7a-5994-48bc-bd43-aa07639a543b
>  health HEALTH_WARN
> 108 pgs backfill
> 37 pgs backfilling
> 2339 pgs degraded
> 105 pgs down
> 237 pgs peering
> 138 pgs stale
> 765 pgs stuck degraded
> 173 pgs stuck inactive
> 138 pgs stuck stale
> 3327 pgs stuck unclean
> 765 pgs stuck undersized
> 2339 pgs undersized
> recovery 1612956/6242357 objects degraded (25.839%)
> recovery 772311/6242357 objects misplaced (12.372%)
> too many PGs per OSD (561 > max 350)
> 4/11 in osds are down
>  monmap e3: 3 mons at 
> {black=172.16.0.201:6789/0,orange=172.16.0.203:6789/0,purple=172.16.0.202:6789/0}
> election epoch 456, quorum 0,1,2 black,purple,orange
>  mdsmap e5: 0/0/1 up
>  osdmap e35627: 12 osds: 7 up, 11 in; 1201 remapped pgs
>   pgmap v8215121: 4608 pgs, 3 pools, 11897 GB data, 2996 kobjects
> 17203 GB used, 8865 GB / 26069 GB avail
> 1612956/6242357 objects degraded (25.839%)
> 772311/6242357 objects misplaced (12.372%)
> 2137 active+undersized+degraded
> 1052 active+clean
>  783 active+remapped
>  137 stale+active+undersized+degraded
>  104 down+peering
>  102 active+remapped+wait_backfill
>   66 remapped+peering
>   65 peering
>   33 active+remapped+backfilling
>   27 activating+undersized+degraded
>   26 active+undersized+degraded+remapped
>   25 activating
>   16 remapped
>   14 inactive
>7 activating+remapped
>6 active+undersized+degraded+remapped+wait_backfill
>4 active+undersized+degraded+remapped+backfilling
>2 activating+undersized+degraded+remapped
>1 down+remapped+peering
>1 stale+remapped+peering
> recovery io 22108 MB/s, 5581 objects/s
>   client io 1065 MB/s rd, 2317 MB/s wr, 11435 op/s
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Claes Sahlström
> Sent: den 15 november 2015 21:56
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] OSD:s failing out after upgrade to 9.2.0 on Ubuntu 14.04
>  
> Hi,
>  
> I have a problem I hope is possible to solve…
>  
> I upgraded to 9.2.0 a couple of days back and I missed this part:
> “If your systems already have a ceph user, upgrading the package will cause 
> problems. We suggest you first remove or rename the existing ‘ceph’ user and 
> ‘ceph’ group before upgrading.”
>  
> I guess that might be the reason why my OSD:s has started to die on me.
>  
> I can get the osd-services when having the file permissions as root:root  and 
> using:
> setuser match path = /var/lib/ceph/$type/$cluster-$i
>  
> I am really not sure where to look to find out what is wrong.
>  
> First when I had upgraded and the OSD:s were restarted then I got a 
> permission denied on the ods-directories and that was solve then adding the 
> “setuser match” in ceph.conf.
>  
> With 5 of 12 OSD:s down I am starting to worry and since I only have one 
> replica I might lose som data. As I mentioned the OSD-services start and 
> “ceph osd in” does not give me any error but the OSD never comes up.
>  
> Any suggestions or helpful tips are most welcome,
>  
> /Claes
>  
>  
>  
>  
>  
>  
> ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 24.0 root default
> -2  8.0 host black
> 3  2.0 osd.3up  1.0  1.0
> 2  2.0 osd.2up  1.0  1.0
> 0  2.0 osd.0up  1.0  1.0
> 1  2.0 osd.1up  1.0  1.0
> -3  8.0 host purple
> 7  2.0 osd.7  down0  1.0
> 6  2.0 osd.6up  1.0  1.0
> 4  2.0 osd.4up  1.0  1.0
> 5  2.0 osd.5up  1.0  1.0
> -4  8.0 host orange
> 11  2.0 osd.11 down0  1.0
> 10  2.0 osd.10 down0  1.0
> 8  2.0 osd.8  down0  1.0
> 9  2.0 osd.9  down0  1.0
>  
>  
>  
>  
>  
>  
> root@black:/var/log/ceph# ceph -s
>

Re: [ceph-users] Potential OSD deadlock?

2015-10-05 Thread Josef Johansson

0.00   1.09   1.80
> sdg   0.00 2.50   17.00   23.50  1032.00  3224.50   210.20
>  0.256.252.568.91   3.41  13.80
> sdh   0.0010.504.50  241.0066.00  7252.0059.62
> 23.00   91.664.22   93.29   2.11  51.85
> 
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda   0.00 0.503.50   91.0092.00   552.7513.65
> 36.27  479.41   81.57  494.71   5.65  53.35
> sdb   0.00 1.006.00  168.00   224.00   962.5013.64
> 83.35  533.92   62.00  550.77   5.75 100.00
> sdc   0.00 1.003.00  171.0016.00  1640.0019.03
>  1.086.18   11.836.08   3.15  54.80
> sdd   0.00 5.005.00  107.50   132.00  6576.75   119.27
>  0.797.06   18.806.51   5.13  57.70
> sde   0.00 0.000.000.00 0.00 0.00 0.00
>  0.000.000.000.00   0.00   0.00
> sdj   0.00 0.000.00 .50 0.00 22346.0040.21
>  0.270.240.000.24   0.11  12.10
> sdk   0.00 0.000.00 1022.00 0.00 33040.0064.66
>  0.680.670.000.67   0.13  13.60
> sdf   0.00 5.502.50   91.0012.00  4977.25   106.72
>  2.29   24.48   14.40   24.76   2.42  22.60
> sdi   0.00 0.00   10.00   69.50   368.00   858.5030.86
>  7.40  586.415.50  669.99   4.21  33.50
> sdm   0.00 4.008.00  210.00   944.00  5833.5062.18
>  1.577.62   18.627.20   4.57  99.70
> sdl   0.00 0.007.50   22.50   104.00   253.2523.82
>  0.144.825.074.73   4.03  12.10
> sdg   0.00 4.001.00   84.00 4.00  3711.7587.43
>  0.586.88   12.506.81   5.75  48.90
> sdh   0.00 3.507.50   44.0072.00  2954.25   117.52
>  1.54   39.50   61.73   35.72   6.40  32.95
> 
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda   0.00 0.001.000.0020.00 0.0040.00
>  0.01   14.50   14.500.00  14.50   1.45
> sdb   0.00 7.00   10.50  198.50  2164.00  6014.7578.27
>  1.949.29   28.908.25   4.77  99.75
> sdc   0.00 2.004.00   95.50   112.00  5152.25   105.81
>  0.949.46   24.258.84   4.68  46.55
> sdd   0.00 1.002.00  131.0010.00  7167.25   107.93
>  4.55   34.23   83.25   33.48   2.52  33.55
> sde   0.00 0.000.000.50 0.00 2.00 8.00
>  0.000.000.000.00   0.00   0.00
> sdj   0.00 0.000.00  541.50 0.00  6468.0023.89
>  0.050.100.000.10   0.09   5.00
> sdk   0.00 0.000.00  509.00 0.00  7704.0030.27
>  0.070.140.000.14   0.10   4.85
> sdf   0.00 0.000.000.00 0.00 0.00 0.00
>  0.000.000.000.00   0.00   0.00
> sdi   0.00 0.003.500.0090.00 0.0051.43
>  0.04   10.14   10.140.00  10.14   3.55
> sdm   0.00 2.005.00  102.50  1186.00  4583.00   107.33
>  0.817.56   23.206.80   2.78  29.85
> sdl   0.0014.00   10.00  216.00   112.00  3645.5033.25
> 73.45  311.05   46.30  323.31   3.51  79.35
> sdg   0.00 1.000.00   52.50 0.00   240.00 9.14
>  0.254.760.004.76   4.48  23.50
> sdh   0.00 0.003.500.0018.00 0.0010.29
>  0.027.007.000.00   7.00   2.45
> 
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda   0.00 0.001.000.00 4.00 0.00 8.00
>  0.01   14.50   14.500.00  14.50   1.45
> sdb   0.00 9.002.00  292.00   192.00 10925.7575.63
> 36.98  100.27   54.75  100.58   2.95  86.60
> sdc   0.00 9.00   10.50  151.0078.00  6771.2584.82
> 36.06   94.60   26.57   99.33   3.77  60.85
> sdd   0.00 0.005.001.0074.0024.0032.67
>  0.035.006.000.00   5.00   3.00
> sde   0.00 0.000.000.00 0.00 0.00 0.00
>  0.000.000.000.00   0.00   0.00
> sdj   0.00 0.000.00  787.50 0.00  9418.0023.92
>  0.070.100.000.10   0.09   6.70
> sdk   0.00 0.000.00

Re: [ceph-users] Potential OSD deadlock?

2015-10-04 Thread Josef Johansson

 64.66
>  0.680.670.000.67   0.13  13.60
> sdf   0.00 5.502.50   91.0012.00  4977.25   106.72
>  2.29   24.48   14.40   24.76   2.42  22.60
> sdi   0.00 0.00   10.00   69.50   368.00   858.5030.86
>  7.40  586.415.50  669.99   4.21  33.50
> sdm   0.00 4.008.00  210.00   944.00  5833.5062.18
>  1.577.62   18.627.20   4.57  99.70
> sdl   0.00 0.007.50   22.50   104.00   253.2523.82
>  0.144.825.074.73   4.03  12.10
> sdg   0.00 4.001.00   84.00 4.00  3711.7587.43
>  0.586.88   12.506.81   5.75  48.90
> sdh   0.00 3.507.50   44.0072.00  2954.25   117.52
>  1.54   39.50   61.73   35.72   6.40  32.95
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda   0.00 0.001.000.0020.00 0.0040.00
>  0.01   14.50   14.500.00  14.50   1.45
> sdb   0.00 7.00   10.50  198.50  2164.00  6014.7578.27
>  1.949.29   28.908.25   4.77  99.75
> sdc   0.00 2.004.00   95.50   112.00  5152.25   105.81
>  0.949.46   24.258.84   4.68  46.55
> sdd   0.00 1.002.00  131.0010.00  7167.25   107.93
>  4.55   34.23   83.25   33.48   2.52  33.55
> sde   0.00 0.000.000.50 0.00 2.00 8.00
>  0.000.000.000.00   0.00   0.00
> sdj   0.00 0.000.00  541.50 0.00  6468.0023.89
>  0.050.100.000.10   0.09   5.00
> sdk   0.00 0.000.00  509.00 0.00  7704.0030.27
>  0.070.140.000.14   0.10   4.85
> sdf   0.00 0.000.000.00 0.00 0.00 0.00
>  0.000.000.000.00   0.00   0.00
> sdi   0.00 0.003.500.0090.00 0.0051.43
>  0.04   10.14   10.140.00  10.14   3.55
> sdm   0.00 2.005.00  102.50  1186.00  4583.00   107.33
>  0.817.56   23.206.80   2.78  29.85
> sdl   0.0014.00   10.00  216.00   112.00  3645.5033.25
> 73.45  311.05   46.30  323.31   3.51  79.35
> sdg   0.00 1.000.00   52.50 0.00   240.00 9.14
>  0.254.760.004.76   4.48  23.50
> sdh   0.00 0.003.500.0018.00 0.0010.29
>  0.027.007.000.00   7.00   2.45
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda   0.00 0.001.000.00 4.00 0.00 8.00
>  0.01   14.50   14.500.00  14.50   1.45
> sdb   0.00 9.002.00  292.00   192.00 10925.7575.63
> 36.98  100.27   54.75  100.58   2.95  86.60
> sdc   0.00 9.00   10.50  151.0078.00  6771.2584.82
> 36.06   94.60   26.57   99.33   3.77  60.85
> sdd   0.00 0.005.001.0074.0024.0032.67
>  0.035.006.000.00   5.00   3.00
> sde   0.00 0.000.000.00 0.00 0.00 0.00
>  0.000.000.000.00   0.00   0.00
> sdj   0.00     0.000.00  787.50 0.00  9418.0023.92
>  0.070.100.000.10   0.09   6.70
> sdk   0.00 0.000.00  766.50 0.00  9400.0024.53
>  0.080.110.000.11   0.10   7.70
> sdf   0.00 0.000.50   41.50 6.00   391.0018.90
>  0.245.799.005.75   5.50  23.10
> sdi   0.0010.009.00  268.0092.00  1618.7512.35
> 68.20  150.90   15.50  155.45   2.36  65.30
> sdm   0.0011.50   10.00  330.5072.00  3201.2519.23
> 68.83  139.38   37.45  142.46   1.84  62.80
> sdl   0.00 2.502.50  228.5014.00  2526.0021.99
> 90.42  404.71  242.40  406.49   4.33 100.00
> sdg   0.00 5.507.50  298.0068.00  5275.2534.98
> 75.31  174.85   26.73  178.58   2.67  81.60
> sdh   0.00 0.002.502.0028.0024.0023.11
>  0.012.785.000.00   2.78   1.25
>
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> On Sun, Oct 4, 2015 at 12:16 AM, Josef Johansson  wrote:
> Hi,
>
> I don't know what brand those 4TB spindles are, but I know that mine are very 
> bad at doing write at the same time as read. Especially small read write.
>
> This has an absurdly bad effect

Re: [ceph-users] Potential OSD deadlock?

2015-10-03 Thread Josef Johansson

Hi,

I don't know what brand those 4TB spindles are, but I know that mine are
very bad at doing write at the same time as read. Especially small read
write.

This has an absurdly bad effect when doing maintenance on ceph. That being
said we see a lot of difference between dumpling and hammer in performance
on these drives. Most likely due to hammer able to read write degraded PGs.

We have run into two different problems along the way, the first was
blocked request where we had to upgrade from 64GB mem on each node to
256GB. We thought that it was the only safe buy make things better.

I believe it worked because more reads were cached so we had less mixed
read write on the nodes, thus giving the spindles more room to breath. Now
this was a shot in the dark then, but the price is not that high even to
just try it out.. compared to 6 people working on it. I believe the IO on
disk was not huge either, but what kills the disk is high latency. How much
bandwidth are the disk using? We had very low.. 3-5MB/s.

The second problem was defragmentations hitting 70%, lowering that to 6%
made a lot of difference. Depending on IO pattern this increases different.

TL;DR read kills the 4TB spindles.

Hope you guys clear out of the woods.
/Josef
On 3 Oct 2015 10:10 pm, "Robert LeBlanc"  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We are still struggling with this and have tried a lot of different
> things. Unfortunately, Inktank (now Red Hat) no longer provides
> consulting services for non-Red Hat systems. If there are some
> certified Ceph consultants in the US that we can do both remote and
> on-site engagements, please let us know.
>
> This certainly seems to be network related, but somewhere in the
> kernel. We have tried increasing the network and TCP buffers, number
> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> on the boxes, the disks are busy, but not constantly at 100% (they
> cycle from <10% up to 100%, but not 100% for more than a few seconds
> at a time). There seems to be no reasonable explanation why I/O is
> blocked pretty frequently longer than 30 seconds. We have verified
> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> network admins have verified that packets are not being dropped in the
> switches for these nodes. We have tried different kernels including
> the recent Google patch to cubic. This is showing up on three cluster
> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> (from CentOS 7.1) with similar results.
>
> The messages seem slightly different:
> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> 100.087155 secs
> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> cluster [WRN] slow request 30.041999 seconds old, received at
> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> rbd_data.13fdcb2ae8944a.0001264f [read 975360~4096]
> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> points reached
>
> I don't know what "no flag points reached" means.
>
> The problem is most pronounced when we have to reboot an OSD node (1
> of 13), we will have hundreds of I/O blocked for some times up to 300
> seconds. It takes a good 15 minutes for things to settle down. The
> production cluster is very busy doing normally 8,000 I/O and peaking
> at 15,000. This is all 4TB spindles with SSD journals and the disks
> are between 25-50% full. We are currently splitting PGs to distribute
> the load better across the disks, but we are having to do this 10 PGs
> at a time as we get blocked I/O. We have max_backfills and
> max_recovery set to 1, client op priority is set higher than recovery
> priority. We tried increasing the number of op threads but this didn't
> seem to help. It seems as soon as PGs are finished being checked, they
> become active and could be the cause for slow I/O while the other PGs
> are being checked.
>
> What I don't understand is that the messages are delayed. As soon as
> the message is received by Ceph OSD process, it is very quickly
> committed to the journal and a response is sent back to the primary
> OSD which is received very quickly as well. I've adjust
> min_free_kbytes and it seems to keep the OSDs from crashing, but
> doesn't solve the main problem. We don't have swap and there is 64 GB
> of RAM per nodes for 10 OSDs.
>
> Is there something that could cause the kernel to get a packet but not
> be able to dispatch it to Ceph such that it could be explaining why we
> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> to tracing Ceph messages from the network buffer through the kernel to
> the Ceph process?
>
> We can really use some pointers no matter how outrageous. We've have
> over 6 people looking into this for weeks now and just can't think of
> anything else.
>
> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelo

Re: [ceph-users] help! failed to start ceph-mon daemon

2015-09-20 Thread Josef Johansson

Hi,

No, not myself, did you manage to compile with the --without-ttng flag?
On 21 Sep 2015 02:51, "Zhen Wang"  wrote:

> BTW, did you successfully build the deb package?⊙▽⊙
>
>
> 发自 网易邮箱大师 <http://u.163.com/signature>
>
>
> On 2015-09-20 17:47 , Josef Johansson  Wrote:
>
> Hi,
>
> I would assume the deb knows more about the startup system, and the
> install assumes you know what you're doing.
>
> Maybe there's a parameter to the configure to say which startup system to
> use?
>
> Anyhow producing a deb will make it easier for you in the end I would say.
>
> Regards
> /Josef
> On 20 Sep 2015 11:35, "wikison"  wrote:
>
>> OS : Ubuntu 14.04
>> I have built the ceph source code and make it installed with the
>> following commands:
>>
>> apt-get install a series of dependency packages
>> ./autogen.sh
>> ./configure
>> make
>> make install
>>
>> All these processes went well. When I type:
>>
>> which ceph
>>
>> Console shows:
>>
>> /usr/local/bin/ceph
>>
>> So I think I have install ceph successfully. But when I try to start the
>> ceph-mon daemon, console tells me:
>>
>> start: unknown job: ceph-mon
>>
>> And I have checked my service list by typing:
>>
>> initctl list | grep ceph
>>
>> And the output is blank.
>>
>> Somebody could tell me why? And how to deal with it? Do I really need to
>> build the deb package and install it ?
>>
>>
>>
>>
>> --
>> Zhen Wang
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Fwd: Re: help! failed to start ceph-mon daemon

2015-09-20 Thread Josef Johansson

Posting to the ML as well.
-- Forwarded message --
From: "Josef Johansson" 
Date: 20 Sep 2015 11:47
Subject: Re: [ceph-users] help! failed to start ceph-mon daemon
To: "wikison" 
Cc:

Hi,

I would assume the deb knows more about the startup system, and the install
assumes you know what you're doing.

Maybe there's a parameter to the configure to say which startup system to
use?

Anyhow producing a deb will make it easier for you in the end I would say.

Regards
/Josef
On 20 Sep 2015 11:35, "wikison"  wrote:

> OS : Ubuntu 14.04
> I have built the ceph source code and make it installed with the following
> commands:
>
> apt-get install a series of dependency packages
> ./autogen.sh
> ./configure
> make
> make install
>
> All these processes went well. When I type:
>
> which ceph
>
> Console shows:
>
> /usr/local/bin/ceph
>
> So I think I have install ceph successfully. But when I try to start the
> ceph-mon daemon, console tells me:
>
> start: unknown job: ceph-mon
>
> And I have checked my service list by typing:
>
> initctl list | grep ceph
>
> And the output is blank.
>
> Somebody could tell me why? And how to deal with it? Do I really need to
> build the deb package and install it ?
>
>
>
>
> --
> Zhen Wang
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Check networking first?

2015-08-01 Thread Josef Johansson

Hi,

I did a "big-ping" test to verify the network after last major network
problem. If anyone wants to take a peek I could share.

Cheers
Josef

lör 1 aug 2015 02:19 Ben Hines  skrev:

> I encountered a similar problem. Incoming firewall ports were blocked
> on one host. So the other OSDs kept marking that OSD as down. But, it
> could talk out, so it kept saying 'hey, i'm up, mark me up' so then
> the other OSDs started trying to send it data again, causing backed up
> requests.. Which goes on, ad infinitum. I had to figure out the
> connectivity problem myself by looking in the OSD logs.
>
> After a while, the cluster should just say 'no, you're not reachable,
> stop putting yourself back into the cluster'.
>
> -Ben
>
> On Fri, Jul 31, 2015 at 11:21 AM, Jan Schermer  wrote:
> > I remember reading that ScaleIO (I think?) does something like this by
> regularly sending reports to a multicast group, thus any node with issues
> (or just overload) is reweighted or avoided automatically on the client.
> OSD map is the Ceph equivalent I guess. It makes sense to gather metrics
> and prioritize better performing OSDs over those with e.g. worse latencies,
> but it needs to update fast. But I believe that _network_ monitoring itself
> ought to be part of… a network monitoring system you should already have
> :-) and not a storage system that just happens to use network. I don’t
> remember seeing anything but a simple ping/traceroute/dns test in any SAN
> interface. If an OSD has issues it might be anything from a failing drive
> to a swapping OS and a number like “commit latency” (= response time
> average from the clients’ perspective) is maybe the ultimate metric of all
> for this purpose, irrespective of the root cause.
> >
> > Nice option would be to read data from all replicas at once - this would
> of course increase load and cause all sorts of issues if abused, but if you
> have an app that absolutely-always-without-fail-must-get-data-ASAP then you
> could enable this in the client (and I think that would be an easy option
> to add). This is actually used in some systems. Harder part is to fail
> nicely when writing (like waiting only for the remote network buffers on 2
> nodes to get the data instead of waiting for commit on all 3 replicas…)
> >
> > Jan
> >
> >> On 31 Jul 2015, at 19:45, Robert LeBlanc  wrote:
> >>
> >> -BEGIN PGP SIGNED MESSAGE-
> >> Hash: SHA256
> >>
> >> Even just a ping at max MTU set with nodefrag could tell a lot about
> >> connectivity issues and latency without a lot of traffic. Using Ceph
> >> messenger would be even better to check firewall ports. I like the
> >> idea of incorporating simple network checks into Ceph. The monitor can
> >> correlate failures and help determine if the problem is related to one
> >> host from the CRUSH map.
> >> - 
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Thu, Jul 30, 2015 at 11:27 PM, Stijn De Weirdt  wrote:
> >>> wouldn't it be nice that ceph does something like this in background
> (some
> >>> sort of network-scrub). debugging network like this is not that easy
> (can't
> >>> expect admins to install e.g. perfsonar on all nodes and/or clients)
> >>>
> >>> something like: every X min, each service X pick a service Y on
> another host
> >>> (assuming X and Y will exchange some communication at some point; like
> osd
> >>> with other osd), send 1MB of data, and make the timing data available
> so we
> >>> can monitor it and detect underperforming links over time.
> >>>
> >>> ideally clients also do this, but not sure where they should
> report/store
> >>> the data.
> >>>
> >>> interpreting the data can be a bit tricky, but extreme outliers will be
> >>> spotted easily, and the main issue with this sort of debugging is
> collecting
> >>> the data.
> >>>
> >>> simply reporting / keeping track of ongoing communications is already
> a big
> >>> step forward, but then we need to have the size of the exchanged data
> to
> >>> allow interpretation (and the timing should be about the network part,
> not
> >>> e.g. flush data to disk in case of an osd). (and obviously sampling is
> >>> enough, no need to have details of every bit send).
> >>>
> >>>
> >>>
> >>> stijn
> >>>
> >>>
> >>> On 07/30/2015 08:04 PM, Mark Nelson wrote:
> 
>  Thanks for posting this!  We see issues like this more often than
> you'd
>  think.  It's really important too because if you don't figure it out
> the
>  natural inclination is to blame Ceph! :)
> 
>  Mark
> 
>  On 07/30/2015 12:50 PM, Quentin Hartman wrote:
> >
> > Just wanted to drop a note to the group that I had my cluster go
> > sideways yesterday, and the root of the problem was networking again.
> > Using iperf I discovered that one of my nodes was only moving data at
> > 1.7Mb / s. Moving that node to a different switch port with a
> different
> > cable has resolved the problem

Re: [ceph-users] Discuss: New default recovery config settings

2015-05-29 Thread Josef Johansson

Hi,

We did it the other way around instead, defining a period where the load is
lighter and turn off/on backfill/recover. Then you want the backfill values
to be the what is default right now.

Also, someone said that (think it was Greg?) If you have problems with
backfill, your cluster backing store is not fast enough/too much load.
If 10 osds goes down at the same time you want those values to be high to
minimize the downtime.

/Josef

fre 29 maj 2015 23:47 Samuel Just  skrev:

> Many people have reported that they need to lower the osd recovery config
> options to minimize the impact of recovery on client io.  We are talking
> about changing the defaults as follows:
>
> osd_max_backfills to 1 (from 10)
> osd_recovery_max_active to 3 (from 15)
> osd_recovery_op_priority to 1 (from 10)
> osd_recovery_max_single_start to 1 (from 5)
>
> We'd like a bit of feedback first though.  Is anyone happy with the
> current configs?  Is anyone using something between these values and the
> current defaults?  What kind of workload?  I'd guess that lowering
> osd_max_backfills to 1 is probably a good idea, but I wonder whether
> lowering osd_recovery_max_active and osd_recovery_max_single_start will
> cause small objects to recover unacceptably slowly.
>
> Thoughts?
> -Sam
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to improve latencies and per-VM performance and latencies

2015-05-20 Thread Josef Johansson

Hi,

Just to add, there’s also a collectd plugin at 
https://github.com/rochaporto/collectd-ceph 
.

Things to check when you have slow read performance is:

*) how much defragmentation on those xfs-partitions? With some workloads you 
get high values pretty quick.
for osd in $(grep 'osd/ceph' /etc/mtab | cut -d ' ' -f 1); do sudo xfs_db -c 
frag -r $osd;done
*) 32/48GB RAM on the OSDs, could be increased. So as XFS is used and all the 
objects are files, ceph uses the linux file cache.
If your data set fits into that cache pretty much, you can gain _alot_ of read 
performance since there’s pretty much no reads from the drives. We’re at 128GB 
per OSD right now. Compared with the options at hand this could be a cheap way 
of increasing the performance. It won’t help you out when you’re doing 
deep-scrubs or recovery though.
*) turn off logging
[global]
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0
[osd]
   debug lockdep = 0/0
   debug context = 0/0
   debug crush = 0/0
   debug buffer = 0/0
   debug timer = 0/0
   debug journaler = 0/0
   debug osd = 0/0
   debug optracker = 0/0
   debug objclass = 0/0
   debug filestore = 0/0
   debug journal = 0/0
   debug ms = 0/0
   debug monc = 0/0
   debug tp = 0/0
   debug auth = 0/0
   debug finisher = 0/0
   debug heartbeatmap = 0/0
   debug perfcounter = 0/0
   debug asok = 0/0
   debug throttle = 0/0

*) run htop or vmstat/iostat to determinate whether it’s the CPU that’s getting 
maxed out or not.
*) just double check the performance and latencies on the network (do it for 
low and high MTU, just to make sure, it’s tough to optimise a lot and get 
bitten by it ;)

2) I don’t see anything in the help section about it
sudo ceph --admin-daemon /var/run/ceph/ceph-osd.$osd.asok help
an easy way of getting the osds if you want to change something globally
for osd in $(grep 'osd/ceph' /etc/mtab | cut -d ' ' -f 2 | cut -d '-' -f 2); do 
echo $osd;done

3) this is on one of the OSDs, about the same size as yours but sata drives for 
backing ( a bit more cpu and memory though):

sudo ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok perf dump | grep -A 1 -e 
op_latency -e op_[rw]_latency -e op_[rw]_process_latency -e journal_latency
  "journal_latency": { "avgcount": 406051353,
  "sum": 230178.927806000},
--
  "op_latency": { "avgcount": 272537987,
  "sum": 4337608.21104},
--
  "op_r_latency": { "avgcount": 111672059,
  "sum": 758059.732591000},
--
  "op_w_latency": { "avgcount": 9308193,
  "sum": 174762.139637000},
--
  "subop_latency": { "avgcount": 273742609,
  "sum": 1084598.823585000},
--
  "subop_w_latency": { "avgcount": 273742609,
  "sum": 1084598.823585000},

Cheers
Josef

> On 20 May 2015, at 10:20, Межов Игорь Александрович  wrote:
> 
> Hi!
> 
> 1. Use it at your own risk. I'm not responsible to any damage, you can get by 
> running thos script
> 
> 2. What is it for. 
> Ceph osd daemon have so called 'admin socket' - a local (to osd host) unix 
> socket, that we can
> use to issue commant to that osd. The script connects to a list od osd hosts 
> (now it os hardcoded in
> source code, but it's easily changeable) by ssh, lists all admin sockets from 
> /var/run/ceph, grep
> socket names for osd numbers, and issue 'perf dump' command to all osds. Json 
> output parsed
> by standard python libs ans some latency parameters extracted from it. They 
> coded in json as tuples,
> containing  total amount of time in milliseconds and count of events. So 
> dividing time to count we get
> average latency for one or more ceph operations. The min/max/avg are counted 
> for every host and
> whole cluster, and latency of every osd compared to minimal value of cluster 
> (or host) and colorized
> to easily detect too high values. 
> You can check usage example in comments at the top of the script and change 
> hardcoded values,
> that are also gathered at the top.
> 
> 3. I use script on Ceph Firefly 0.80.7, but think that it will work on any 
> release, that supports
> admin socket connection to osd, 'perf dump' command and the same json output 
>

Re: [ceph-users] Find out the location of OSD Journal

2015-05-14 Thread Josef Johansson

I tend to use something along the lines

for osd in $(grep osd /etc/mtab | cut -d ' ' -f 2); do echo "$(echo $osd | cut 
-d '-' -f 2): $(readlink -f $(readlink $osd/journal))";done | sort -k 2

Cheers,
Josef
 
> On 08 May 2015, at 02:47, Robert LeBlanc  wrote:
> 
> You may also be able to use `ceph-disk list`.
> 
> On Thu, May 7, 2015 at 3:56 AM, Francois Lafont  > wrote:
> Hi,
> 
> Patrik Plank wrote:
> 
> > i cant remember on which drive I install which OSD journal :-||
> > Is there any command to show this?
> 
> It's probably not the answer you hope, but why don't use a simple:
> 
> ls -l /var/lib/ceph/osd/ceph-$id/journal
> 
> ?
> 
> --
> François Lafont
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] defragment xfs-backed OSD

2015-04-26 Thread Josef Johansson

Hi,

I’m seeing high fragmentation on my OSDs, is it safe to perform xfs_fsr 
defragmentation? Any guidelines in using it?

I would assume doing it in off hours and using a tmp-file for saving the last 
position for the defrag.

Thanks!
/Josef
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] IOWait on SATA-backed with SSD-journals

2015-04-25 Thread Josef Johansson

Hi,

With inspiration from all the other performance threads going on here, I 
started to investigate on my own as well.

I’m seeing a lot iowait on the OSD, and the journal utilised at 2-7%, with 
about 8-30MB/s (mostly around 8MB/s write). This is a dumpling cluster. The 
goal here is to increase the utilisation to maybe 50%.

Journals: Intel DC S3700, OSD: HGST 4TB

I did some initial testing to make the wbthrottle have more in the buffer, and 
I think I managed to do it, didn’t affect the journal utilisation though.

There’s 12 cores for the 10 OSDs per machine to utilise, and they use about 20% 
of them, so I guess no bottle neck there.

Well that’s the problem, I really can’t see any bottleneck with the current 
layout, maybe it’s out copper 10Gb that’s giving us too much latency?

It would be fancy with some kind of bottle-neck troubleshoot in ceph docs :)
I’m guessing I’m not the only one on these kinds of specs and would be 
interesting to see if there’s optimisation to be done.

Hope you guys have a nice weekend :)

Cheers,
Josef

Ping from a host to OSD:

6 packets transmitted, 6 received, 0% packet loss, time 4998ms
rtt min/avg/max/mdev = 0.063/0.107/0.193/0.048 ms

Setting on the OSD

{ "filestore_wbthrottle_xfs_ios_start_flusher": "5000"}
{ "filestore_wbthrottle_xfs_inodes_start_flusher": "5000"}
{ "filestore_wbthrottle_xfs_ios_hard_limit": "1"}
{ "filestore_wbthrottle_xfs_inodes_hard_limit": "1"}
{ "filestore_max_sync_interval": "30”}

From the standard

{ "filestore_wbthrottle_xfs_ios_start_flusher": "500"}
{ "filestore_wbthrottle_xfs_inodes_start_flusher": "500"}
{ "filestore_wbthrottle_xfs_ios_hard_limit": “5000"}
{ "filestore_wbthrottle_xfs_inodes_hard_limit": “5000"}
{ "filestore_max_sync_interval": “5”}


a single dump_historic_ops

{ "description": "osd_op(client.47765822.0:99270434 
rbd_data.1da982c2eb141f2.5825 [stat,write 2093056~8192] 3.8130048c 
e19290)",
  "rmw_flags": 6,
  "received_at": "2015-04-26 08:24:03.226255",
  "age": "87.026653",
  "duration": "0.801927",
  "flag_point": "commit sent; apply or cleanup",
  "client_info": { "client": "client.47765822",
  "tid": 99270434},
  "events": [
{ "time": "2015-04-26 08:24:03.226329",
  "event": "waiting_for_osdmap"},
{ "time": "2015-04-26 08:24:03.230921",
  "event": "reached_pg"},
{ "time": "2015-04-26 08:24:03.230928",
  "event": "started"},
{ "time": "2015-04-26 08:24:03.230931",
  "event": "started"},
{ "time": "2015-04-26 08:24:03.231791",
  "event": "waiting for subops from [22,48]"},
{ "time": "2015-04-26 08:24:03.231813",
  "event": "commit_queued_for_journal_write"},
{ "time": "2015-04-26 08:24:03.231849",
  "event": "write_thread_in_journal_buffer"},
{ "time": "2015-04-26 08:24:03.232075",
  "event": "journaled_completion_queued"},
{ "time": "2015-04-26 08:24:03.232492",
  "event": "op_commit"},
{ "time": "2015-04-26 08:24:03.233134",
  "event": "sub_op_commit_rec"},
{ "time": "2015-04-26 08:24:03.233183",
  "event": "op_applied"},
{ "time": "2015-04-26 08:24:04.028167",
  "event": "sub_op_commit_rec"},
{ "time": "2015-04-26 08:24:04.028174",
  "event": "commit_sent"},
{ "time": "2015-04-26 08:24:04.028182",
  "event": "done"}]},


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] replace dead SSD journal

2015-04-18 Thread Josef Johansson

Have you looked into the samsung 845 dc? They are not that expensive last
time I checked.

/Josef
On 18 Apr 2015 13:15, "Andrija Panic"  wrote:

> might be true, yes - we had Intel 128GB (intel S3500 or S3700) - but these
> have horrible random/sequetial speeds - Samsun 850 PROs are 3 times at
> least faster on sequential, and more than 3 times faser on random/IOPS
> measures.
> And ofcourse modern enterprise drives = ...
>
> On 18 April 2015 at 12:42, Mark Kirkwood 
> wrote:
>
>> Yes, it sure is - my experience with 'consumer' SSD is that they die with
>> obscure firmware bugs (wrong capacity, zero capacity, not detected in bios
>> anymore) rather than flash wearout. It seems that the 'enterprise' tagged
>> drives are less inclined to suffer this fate.
>>
>> Regards
>>
>> Mark
>>
>> On 18/04/15 22:23, Andrija Panic wrote:
>>
>>> these 2 drives, are on the regular SATA (on board)controler, and beside
>>> this, there is 12 x 4TB on the fron of the servers - normal backplane on
>>> the front.
>>>
>>> Anyway, we are going to check those dead SSDs on a pc/laptop or so,just
>>> to confirm they are really dead - but this is the way they die, not wear
>>> out, but simply show different space instead of real one - thse were 3
>>> months old only when they died...
>>>
>>> On 18 April 2015 at 11:55, Josef Johansson >> <mailto:jose...@gmail.com>> wrote:
>>>
>>> If the same chassi/chip/backplane is behind both drives and maybe
>>> other drives in the chassi have troubles,it may be a defect there as
>>> well.
>>>
>>> On 18 Apr 2015 09:42, "Steffen W Sørensen" >> <mailto:ste...@me.com>> wrote:
>>>
>>>
>>>  > On 17/04/2015, at 21.07, Andrija Panic
>>> mailto:andrija.pa...@gmail.com>>
>>> wrote:
>>>  >
>>>  > nahSamsun 850 PRO 128GB - dead after 3months - 2 of these
>>> died... wearing level is 96%, so only 4% wasted... (yes I know
>>> these are not enterprise,etc… )
>>> Damn… but maybe your surname says it all - Don’t Panic :) But
>>> making sure same type of SSD devices ain’t of near same age and
>>> doing preventive replacement rotation might be good practice I
>>> guess.
>>>
>>> /Steffen
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Andrija Panić
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>
>
> --
>
> Andrija Panić
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] replace dead SSD journal

2015-04-18 Thread Josef Johansson

If the same chassi/chip/backplane is behind both drives and maybe other
drives in the chassi have troubles,it may be a defect there as well.
On 18 Apr 2015 09:42, "Steffen W Sørensen"  wrote:

>
> > On 17/04/2015, at 21.07, Andrija Panic  wrote:
> >
> > nahSamsun 850 PRO 128GB - dead after 3months - 2 of these died...
> wearing level is 96%, so only 4% wasted... (yes I know these are not
> enterprise,etc… )
> Damn… but maybe your surname says it all - Don’t Panic :) But making sure
> same type of SSD devices ain’t of near same age and doing preventive
> replacement rotation might be good practice I guess.
>
> /Steffen
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] replace dead SSD journal

2015-04-17 Thread Josef Johansson

the massive rebalancing does not affect the ssds in a good way either. But
from what I've gatherd the pro should be fine. Massive amount of write
errors in the logs?

/Josef
On 17 Apr 2015 21:07, "Andrija Panic"  wrote:

> nahSamsun 850 PRO 128GB - dead after 3months - 2 of these died...
> wearing level is 96%, so only 4% wasted... (yes I know these are not
> enterprise,etc... )
>
> On 17 April 2015 at 21:01, Josef Johansson  wrote:
>
>> tough luck, hope everything comes up ok afterwards. What models on the
>> SSD?
>>
>> /Josef
>> On 17 Apr 2015 20:05, "Andrija Panic"  wrote:
>>
>>> SSD died that hosted journals for 6 OSDs - 2 x SSD died, so 12 OSDs are
>>> down, and rebalancing is about finish... after which I need to fix the OSDs.
>>>
>>> On 17 April 2015 at 19:01, Josef Johansson  wrote:
>>>
>>>> Hi,
>>>>
>>>> Did 6 other OSDs go down when re-adding?
>>>>
>>>> /Josef
>>>>
>>>> On 17 Apr 2015, at 18:49, Andrija Panic 
>>>> wrote:
>>>>
>>>> 12 osds down - I expect less work with removing and adding osd?
>>>> On Apr 17, 2015 6:35 PM, "Krzysztof Nowicki" <
>>>> krzysztof.a.nowi...@gmail.com> wrote:
>>>>
>>>>> Why not just wipe out the OSD filesystem, run ceph-osd --mkfs with the
>>>>> existing OSD UUID, copy the keyring and let it populate itself?
>>>>>
>>>>> pt., 17 kwi 2015 o 18:31 użytkownik Andrija Panic <
>>>>> andrija.pa...@gmail.com> napisał:
>>>>>
>>>>>> Thx guys, thats what I will be doing at the end.
>>>>>>
>>>>>> Cheers
>>>>>> On Apr 17, 2015 6:24 PM, "Robert LeBlanc" 
>>>>>> wrote:
>>>>>>
>>>>>>> Delete and re-add all six OSDs.
>>>>>>>
>>>>>>> On Fri, Apr 17, 2015 at 3:36 AM, Andrija Panic <
>>>>>>> andrija.pa...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> I have 1 SSD that hosted 6 OSD's Journals, that is dead, so 6 OSD
>>>>>>>> down, ceph rebalanced etc.
>>>>>>>>
>>>>>>>> Now I have new SSD inside, and I will partition it etc - but would
>>>>>>>> like to know, how to proceed now, with the journal recreation for 
>>>>>>>> those 6
>>>>>>>> OSDs that are down now.
>>>>>>>>
>>>>>>>> Should I flush journal (where to, journals doesnt still exist...?),
>>>>>>>> or just recreate journal from scratch (making symboliv links again: ln 
>>>>>>>> -s
>>>>>>>> /dev/$DISK$PART /var/lib/ceph/osd/ceph-$ID/journal) and starting OSDs.
>>>>>>>>
>>>>>>>> I expect the folowing procedure, but would like confirmation please:
>>>>>>>>
>>>>>>>> rm /var/lib/ceph/osd/ceph-$ID/journal -f (sym link)
>>>>>>>> ln -s /dev/SDAxxx /var/lib/ceph/osd/ceph-$ID/journal
>>>>>>>> ceph-osd -i $ID --mkjournal
>>>>>>>> ll /var/lib/ceph/osd/ceph-$ID/journal
>>>>>>>> service ceph start osd.$ID
>>>>>>>>
>>>>>>>> Any thought greatly appreciated !
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Andrija Panić
>>>>>>>>
>>>>>>>> ___
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>>>
>>>>>>>  ___
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>  ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Andrija Panić
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>
>
> --
>
> Andrija Panić
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] replace dead SSD journal

2015-04-17 Thread Josef Johansson

tough luck, hope everything comes up ok afterwards. What models on the SSD?

/Josef
On 17 Apr 2015 20:05, "Andrija Panic"  wrote:

> SSD died that hosted journals for 6 OSDs - 2 x SSD died, so 12 OSDs are
> down, and rebalancing is about finish... after which I need to fix the OSDs.
>
> On 17 April 2015 at 19:01, Josef Johansson  wrote:
>
>> Hi,
>>
>> Did 6 other OSDs go down when re-adding?
>>
>> /Josef
>>
>> On 17 Apr 2015, at 18:49, Andrija Panic  wrote:
>>
>> 12 osds down - I expect less work with removing and adding osd?
>> On Apr 17, 2015 6:35 PM, "Krzysztof Nowicki" <
>> krzysztof.a.nowi...@gmail.com> wrote:
>>
>>> Why not just wipe out the OSD filesystem, run ceph-osd --mkfs with the
>>> existing OSD UUID, copy the keyring and let it populate itself?
>>>
>>> pt., 17 kwi 2015 o 18:31 użytkownik Andrija Panic <
>>> andrija.pa...@gmail.com> napisał:
>>>
>>>> Thx guys, thats what I will be doing at the end.
>>>>
>>>> Cheers
>>>> On Apr 17, 2015 6:24 PM, "Robert LeBlanc"  wrote:
>>>>
>>>>> Delete and re-add all six OSDs.
>>>>>
>>>>> On Fri, Apr 17, 2015 at 3:36 AM, Andrija Panic <
>>>>> andrija.pa...@gmail.com> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I have 1 SSD that hosted 6 OSD's Journals, that is dead, so 6 OSD
>>>>>> down, ceph rebalanced etc.
>>>>>>
>>>>>> Now I have new SSD inside, and I will partition it etc - but would
>>>>>> like to know, how to proceed now, with the journal recreation for those 6
>>>>>> OSDs that are down now.
>>>>>>
>>>>>> Should I flush journal (where to, journals doesnt still exist...?),
>>>>>> or just recreate journal from scratch (making symboliv links again: ln -s
>>>>>> /dev/$DISK$PART /var/lib/ceph/osd/ceph-$ID/journal) and starting OSDs.
>>>>>>
>>>>>> I expect the folowing procedure, but would like confirmation please:
>>>>>>
>>>>>> rm /var/lib/ceph/osd/ceph-$ID/journal -f (sym link)
>>>>>> ln -s /dev/SDAxxx /var/lib/ceph/osd/ceph-$ID/journal
>>>>>> ceph-osd -i $ID --mkjournal
>>>>>> ll /var/lib/ceph/osd/ceph-$ID/journal
>>>>>> service ceph start osd.$ID
>>>>>>
>>>>>> Any thought greatly appreciated !
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Andrija Panić
>>>>>>
>>>>>> ___
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>>
>>>>>  ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>  ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>
>
> --
>
> Andrija Panić
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] replace dead SSD journal

2015-04-17 Thread Josef Johansson

Hi,

Did 6 other OSDs go down when re-adding?

/Josef

> On 17 Apr 2015, at 18:49, Andrija Panic  wrote:
> 
> 12 osds down - I expect less work with removing and adding osd?
> 
> On Apr 17, 2015 6:35 PM, "Krzysztof Nowicki"  > wrote:
> Why not just wipe out the OSD filesystem, run ceph-osd --mkfs with the 
> existing OSD UUID, copy the keyring and let it populate itself?
> 
> pt., 17 kwi 2015 o 18:31 użytkownik Andrija Panic  > napisał:
> Thx guys, thats what I will be doing at the end.
> 
> Cheers
> 
> On Apr 17, 2015 6:24 PM, "Robert LeBlanc"  > wrote:
> Delete and re-add all six OSDs.
> 
> On Fri, Apr 17, 2015 at 3:36 AM, Andrija Panic  > wrote:
> Hi guys,
> 
> I have 1 SSD that hosted 6 OSD's Journals, that is dead, so 6 OSD down, ceph 
> rebalanced etc.
> 
> Now I have new SSD inside, and I will partition it etc - but would like to 
> know, how to proceed now, with the journal recreation for those 6 OSDs that 
> are down now.
> 
> Should I flush journal (where to, journals doesnt still exist...?), or just 
> recreate journal from scratch (making symboliv links again: ln -s 
> /dev/$DISK$PART /var/lib/ceph/osd/ceph-$ID/journal) and starting OSDs.
> 
> I expect the folowing procedure, but would like confirmation please:
> 
> rm /var/lib/ceph/osd/ceph-$ID/journal -f (sym link)
> ln -s /dev/SDAxxx /var/lib/ceph/osd/ceph-$ID/journal
> ceph-osd -i $ID --mkjournal
> ll /var/lib/ceph/osd/ceph-$ID/journal
> service ceph start osd.$ID
> 
> Any thought greatly appreciated !
> 
> Thanks,
> 
> -- 
> 
> Andrija Panić
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] metadata management in case of ceph object storage and ceph block storage

2015-04-16 Thread Josef Johansson

Hi,

Maybe others had your mail going into junk as well, but that is why at least I 
did not see  it.

To your question, which I’m not sure I understand completely.

In Ceph you have three distinct types of services,

Mon, Monitors
MDS, Metadata Servers
OSD, Object Storage Devices

And some other concepts

PG, placement group
Object
Pool

So a Pool contains PGs that contains Objects, in that order.
Monitors keeps track of pools and PGs, and the objects are kept on the OSD.

In case you’re running CephFS, the Ceph File System, you also have files, which 
the MDS keeps track of.

So yes, you don’t need the MDS if you just keep track of block storage and 
object storage. (i.e. images for KVM)

So the Mon keeps track of the metadata for the Pool and PG
and the MDS keep track of all the files, hence the MDS should have at least 10x 
the memory of what the Mon have.

I’m no Ceph expert, especially not on CephFS, but this is my picture of it :)

Maybe the architecture docs could help you out? 
http://docs.ceph.com/docs/master/architecture/#cluster-map 

Hope that resolves your question.

Cheers,
Josef

> On 06 Apr 2015, at 18:51, pragya jain  wrote:
> 
> Please somebody reply my queries.
> 
> Thank yuo
>  
> -
> Regards
> Pragya Jain
> Department of Computer Science
> University of Delhi
> Delhi, India
> 
> 
> 
> On Saturday, 4 April 2015 3:24 PM, pragya jain  wrote:
> 
> 
> hello all!
> 
> As the documentation said "One of the unique features of Ceph is that it 
> decouples data and metadata".
> for applying the mechanism of decoupling, Ceph uses Metadata Server (MDS) 
> cluster.
> MDS cluster manages metadata operations, like open or rename a file
> 
> On the other hand, Ceph implementation for object storage as a service and 
> block storage as a service does not require MDS implementation.
> 
> My question is:
> In case of object storage and block storage, how does Ceph manage the 
> metadata?
> 
> Please help me to understand this concept more clearly.
> 
> Thank you
>  
> -
> Regards
> Pragya Jain
> Department of Computer Science
> University of Delhi
> Delhi, India
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] low power single disk nodes

2015-04-10 Thread Josef Johansson

Hi,

You have these guys as well,
http://www.seagate.com/gb/en/products/enterprise-servers-storage/nearline-storage/kinetic-hdd/

I talked to them during WHD, and they said that it's not fit for ceph if
you pack 70 of them in one chassi because of the noise level. I would
assume that 1U wirh alot of sound and vibration canceling could cut it
though.

They use a the sas interface as an ethernet port,pretty nifty.

Funny story, the first idea to this was 15 years ago, and they said they
needed at least 16MB ram on the chip to make it work. Right :-)

/Josef
On 10 Apr 2015 14:32, "10 minus"  wrote:

> Hi ,
>
> Question is what do you want to use it for . As an OSD it wont cut it.
> Maybe as an iscsi target and YMMV
>
> I played around with an OEM product from Taiwan ... I dont remember the
> name.
>
> They had an  Armada XP arm soc and a SATA port + 2 ethernet. Pretty
> niffty.
> Downside were:
> -  1GB RAM .
> - OS was using a custom ubuntu
> - Ceph had to be compiled and deployed.
> -  if you put steady  i/o  and/or Cpu load for 15 - 20 min . the hardware
> will just pause.
> Causing ceph to crash
>
> hth
>
>
> On Fri, Apr 10, 2015 at 10:04 AM, Philip Williams  wrote:
>
>> James Hughes talked about these at MSST last year, and some of his
>> colleagues demonstrated the hardware: <
>> http://storageconference.us/2014/Presentations/Hughes.pdf>
>>
>> For tinkering purposes there is java based simulator: <
>> https://developers.seagate.com/display/KV/Kinetic+Open+Storage+Documentation+Wiki
>> >
>>
>> The drives do use a key/value interface.
>>
>> --phil
>>
>> > On 9 Apr 2015, at 17:01, Mark Nelson  wrote:
>> >
>> > Notice that this is under their emerging technologies section.  I don't
>> think you can buy them yet.  Hopefully we'll know more as time goes on. :)
>> >
>> > Mark
>> >
>> >
>> > On 04/09/2015 10:52 AM, Stillwell, Bryan wrote:
>> >> These are really interesting to me, but how can you buy them?  What's
>> the
>> >> performance like in ceph?  Are they using the keyvaluestore backend, or
>> >> something specific to these drives?  Also what kind of chassis do they
>> go
>> >> into (some kind of ethernet JBOD)?
>> >>
>> >> Bryan
>> >>
>> >> On 4/9/15, 9:43 AM, "Mark Nelson"  wrote:
>> >>
>> >>> How about drives that run Linux with an ARM processor, RAM, and an
>> >>> ethernet port right on the drive?  Notice the Ceph logo. :)
>> >>>
>> >>>
>> https://www.hgst.com/science-of-storage/emerging-technologies/open-etherne
>> >>> t-drive-architecture
>> >>>
>> >>> Mark
>> >>>
>> >>> On 04/09/2015 10:37 AM, Scott Laird wrote:
>>  Minnowboard Max?  2 atom cores, 1 SATA port, and a real (non-USB)
>>  Ethernet port.
>> 
>> 
>>  On Thu, Apr 9, 2015, 8:03 AM p...@philw.com 
>>  mailto:p...@philw.com>> wrote:
>> 
>>  Rather expensive option:
>> 
>>  Applied Micro X-Gene, overkill for a single disk, and only really
>>  available in a
>>  development kit format right now.
>> 
>> 
>>  <
>> https://www.apm.com/products/__data-center/x-gene-family/x-__c1-developm
>>  ent-kits/
>> 
>>  <
>> https://www.apm.com/products/data-center/x-gene-family/x-c1-development-
>>  kits/>>
>> 
>>  Better Option:
>> 
>>  Ambedded CY7 - 7 nodes in 1U half Depth, 6 positions for SATA
>> disks,
>>  and one
>>  node with mSATA SSD
>> 
>>  >  >
>> 
>>  --phil
>> 
>>   > On 09 April 2015 at 15:57 Quentin Hartman
>>  > qhart...@direwolfdigital.com>>
>>   > wrote:
>>   >
>>   >  I'm skeptical about how well this would work, but a Banana Pi
>>  might be a
>>   > place to start. Like a raspberry pi, but it has a SATA
>> connector:
>>   > http://www.bananapi.org/
>>   >
>>   >  On Thu, Apr 9, 2015 at 3:18 AM, Jerker Nyberg
>>  mailto:jer...@update.uu.se>
>>   > > >
>>  wrote:
>>   >> >Hello ceph users,
>>   > >
>>   > >Is anyone running any low powered single disk nodes with
>>  Ceph now?
>>   > > Calxeda seems to be no more according to Wikipedia. I do not
>>  think HP
>>   > > moonshot is what I am looking for - I want stand-alone
>> nodes,
>>  not server
>>   > > cartridges integrated into server chassis. And I do not
>> want to
>>  be locked to
>>   > > a single vendor.
>>   > >
>>   > >I was playing with Raspberry Pi 2 for signage when I
>> thought
>>  of my old
>>   > > experiments with Ceph.
>>   > >
>>   > >I am thinking of for example Odroid-C1 or Odroid-XU3
>> Lite or
>>  maybe
>>   > > something with a low-power In

Re: [ceph-users] Finding out how much data is in the journal

2015-03-23 Thread Josef Johansson


> On 23 Mar 2015, at 03:58, Haomai Wang  wrote:
> 
> On Mon, Mar 23, 2015 at 2:53 AM, Josef Johansson  <mailto:jose...@gmail.com>> wrote:
>> Hi all!
>> 
>> Trying to figure out how much my journals are used, using SSDs as journals 
>> and SATA-drives as storage, I dive into perf dump.
>> But I can’t figure out why journal_queue_bytes is at constant 0. The only 
>> thing that differs is dirtied in WBThrottle.
> 
> journal_queue_bytes means how much journal data in the queue and is
> waiting for Journal Thread to be processed.
> 
> Still now osd can't tell you how much data in the journal waiting for
> writeback and sync.
> 
Hm, who knows that then?
Is this the WBThrottle value?

No way of knowing how much journal is used at all?

Maybe I thought of this wrong so if I understand you correctly

Data is written to OSD
The journal saves it to the queue
Waits for others to sync the requests as well
Sends a ACK to the client
Starts writing to the filestore buffer
filestore buffer commits when limits and met (inodes/ios-dirtied, 
filestore_sync_max_interval)

So if I’m meeting latency and want to see if my journals are lazy, I should 
indeed look at journal_queue_bytes, if that’s zero, it’s behaving well.

Thanks,
Josef

>> 
>> Maybe I’ve disable that when setting the in-memory debug variables to 0/0?
>> 
>> Thanks,
>> Josef
>> 
>> # ceph --version
>> ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
>> 
>> # ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep 
>> journal
>>  "journaler": "0\/0",
>>  "journal": "0\/0",
>>  "journaler_allow_split_entries": "true",
>>  "journaler_write_head_interval": "15",
>>  "journaler_prefetch_periods": "10",
>>  "journaler_prezero_periods": "5",
>>  "journaler_batch_interval": "0.001",
>>  "journaler_batch_max": "0",
>>  "mds_kill_journal_at": "0",
>>  "mds_kill_journal_expire_at": "0",
>>  "mds_kill_journal_replay_at": "0",
>>  "osd_journal": "\/var\/lib\/ceph\/osd\/ceph-0\/journal",
>>  "osd_journal_size": "25600",
>>  "filestore_fsync_flushes_journal_data": "false",
>>  "filestore_journal_parallel": "false",
>>  "filestore_journal_writeahead": "false",
>>  "filestore_journal_trailing": "false",
>>  "journal_dio": "true",
>>  "journal_aio": "true",
>>  "journal_force_aio": "false",
>>  "journal_max_corrupt_search": "10485760",
>>  "journal_block_align": "true",
>>  "journal_write_header_frequency": "0",
>>  "journal_max_write_bytes": "10485760",
>>  "journal_max_write_entries": "100",
>>  "journal_queue_max_ops": "300",
>>  "journal_queue_max_bytes": "33554432",
>>  "journal_align_min_size": "65536",
>>  "journal_replay_from": "0",
>>  "journal_zero_on_create": "false",
>>  "journal_ignore_corruption": "false",
>> 
>> # ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump
>> { "WBThrottle": { "bytes_dirtied": 32137216,
>>  "bytes_wb": 0,
>>  "ios_dirtied": 1445,
>>  "ios_wb": 0,
>>  "inodes_dirtied": 491,
>>  "inodes_wb": 0},
>>  "filestore": { "journal_queue_max_ops": 300,
>>  "journal_queue_ops": 0,
>>  "journal_ops": 116105073,
>>  "journal_queue_max_bytes": 33554432,
>>  "journal_queue_bytes": 0,
>>  "journal_bytes": 3160504432839,
>>  "journal_latency": { "avgcount": 116105073,
>>  "sum": 64951.260611000},
>>  "journal_wr": 112261141,
>>  "journal_wr_bytes": { "avgcount": 112261141,
>>  "sum": 3426141528064},
>>  "op_queue_max_ops": 50,
>>  "op_queue_ops": 0,
>>  "ops": 116105073,
>>  "op_queue_max_bytes": 104857600,
>>  "op_queue_bytes": 0,
>>  "bytes": 3159111228243,
>>  "ap

[ceph-users] Finding out how much data is in the journal

2015-03-22 Thread Josef Johansson

Hi all!

Trying to figure out how much my journals are used, using SSDs as journals and 
SATA-drives as storage, I dive into perf dump.
But I can’t figure out why journal_queue_bytes is at constant 0. The only thing 
that differs is dirtied in WBThrottle.

Maybe I’ve disable that when setting the in-memory debug variables to 0/0?

Thanks,
Josef

# ceph --version
ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)

# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep journal
  "journaler": "0\/0",
  "journal": "0\/0",
  "journaler_allow_split_entries": "true",
  "journaler_write_head_interval": "15",
  "journaler_prefetch_periods": "10",
  "journaler_prezero_periods": "5",
  "journaler_batch_interval": "0.001",
  "journaler_batch_max": "0",
  "mds_kill_journal_at": "0",
  "mds_kill_journal_expire_at": "0",
  "mds_kill_journal_replay_at": "0",
  "osd_journal": "\/var\/lib\/ceph\/osd\/ceph-0\/journal",
  "osd_journal_size": "25600",
  "filestore_fsync_flushes_journal_data": "false",
  "filestore_journal_parallel": "false",
  "filestore_journal_writeahead": "false",
  "filestore_journal_trailing": "false",
  "journal_dio": "true",
  "journal_aio": "true",
  "journal_force_aio": "false",
  "journal_max_corrupt_search": "10485760",
  "journal_block_align": "true",
  "journal_write_header_frequency": "0",
  "journal_max_write_bytes": "10485760",
  "journal_max_write_entries": "100",
  "journal_queue_max_ops": "300",
  "journal_queue_max_bytes": "33554432",
  "journal_align_min_size": "65536",
  "journal_replay_from": "0",
  "journal_zero_on_create": "false",
  "journal_ignore_corruption": "false",

# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump
{ "WBThrottle": { "bytes_dirtied": 32137216,
  "bytes_wb": 0,
  "ios_dirtied": 1445,
  "ios_wb": 0,
  "inodes_dirtied": 491,
  "inodes_wb": 0},
  "filestore": { "journal_queue_max_ops": 300,
  "journal_queue_ops": 0,
  "journal_ops": 116105073,
  "journal_queue_max_bytes": 33554432,
  "journal_queue_bytes": 0,
  "journal_bytes": 3160504432839,
  "journal_latency": { "avgcount": 116105073,
  "sum": 64951.260611000},
  "journal_wr": 112261141,
  "journal_wr_bytes": { "avgcount": 112261141,
  "sum": 3426141528064},
  "op_queue_max_ops": 50,
  "op_queue_ops": 0,
  "ops": 116105073,
  "op_queue_max_bytes": 104857600,
  "op_queue_bytes": 0,
  "bytes": 3159111228243,
  "apply_latency": { "avgcount": 116105073,
  "sum": 247410.066048000},
  "committing": 0,
  "commitcycle": 267176,
  "commitcycle_interval": { "avgcount": 267176,
  "sum": 1873193.631124000},
  "commitcycle_latency": { "avgcount": 267176,
  "sum": 390421.06299},
  "journal_full": 0,
  "queue_transaction_latency_avg": { "avgcount": 116105073,
  "sum": 378.948923000}},
  "leveldb": { "leveldb_get": 699871216,
  "leveldb_transaction": 522440246,
  "leveldb_compact": 0,
  "leveldb_compact_range": 0,
  "leveldb_compact_queue_merge": 0,
  "leveldb_compact_queue_len": 0},
  "mutex-FileJournal::completions_lock": { "wait": { "avgcount": 0,
  "sum": 0.0}},
  "mutex-FileJournal::finisher_lock": { "wait": { "avgcount": 0,
  "sum": 0.0}},
  "mutex-FileJournal::write_lock": { "wait": { "avgcount": 0,
  "sum": 0.0}},
  "mutex-FileJournal::writeq_lock": { "wait": { "avgcount": 0,
  "sum": 0.0}},
  "mutex-JOS::ApplyManager::apply_lock": { "wait": { "avgcount": 0,
  "sum": 0.0}},
  "mutex-JOS::ApplyManager::com_lock": { "wait": { "avgcount": 0,
  "sum": 0.0}},
  "mutex-JOS::SubmitManager::lock": { "wait": { "avgcount": 0,
  "sum": 0.0}},
  "mutex-WBThrottle::lock": { "wait": { "avgcount": 0,
  "sum": 0.0}},
  "osd": { "opq": 0,
  "op_wip": 0,
  "op": 83920139,
  "op_in_bytes": 1075345387581,
  "op_out_bytes": 954428806331,
  "op_latency": { "avgcount": 83920139,
  "sum": 1279934.620502000},
  "op_r": 32399024,
  "op_r_out_bytes": 953657617715,
  "op_r_latency": { "avgcount": 32399024,
  "sum": 238792.729743000},
  "op_w": 3321731,
  "op_w_in_bytes": 52637941027,
  "op_w_rlat": { "avgcount": 3321731,
  "sum": 15577.62004},
  "op_w_latency": { "avgcount": 3321731,
  "sum": 62541.746123000},
  "op_rw": 48199384,
  "op_rw_in_bytes": 1022707446554,
  "op_rw_out_bytes": 771188616,
  "op_rw_rlat": { "avgcount": 48199384,
  "sum": 169776.087496000},
  "op_rw_latency": { "avgcount": 48199384,
  "sum": 978600.144636000},
  "subop": 73746080,
  "subop_in_bytes": 2008774955062,
  "subop_latency": { "avgcount": 73746080,
  "sum": 346096.627047000},
  "subop_w": 0,
  "subop_w_in_bytes": 2008774955062,
  "subop_w_latency": { "avgcount": 73746080,

Re: [ceph-users] Uneven CPU usage on OSD nodes

2015-03-21 Thread Josef Johansson

I'm neither a dev or a well informed Cepher. But I've seen posts that the
pg count may be set too high, see
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg16205.html

Also, we use 128GB+ in production on the OSD servers with 10 osd per server
because it boosts the read cache,so you may want to increase it
anyhow,depending on your kind of load.

Regards,
Josef
On 20 Mar 2015 23:19, "Craig Lewis"  wrote:

> I would say you're a little light on RAM.  With 4TB disks 70% full, I've
> seen some ceph-osd processes using 3.5GB of RAM during recovery.  You'll be
> fine during normal operation, but you might run into issues at the worst
> possible time.
>
> I have 8 OSDs per node, and 32G of RAM.  I've had ceph-osd processes start
> swapping, and that's a great way to get them kicked out for being
> unresponsive.
>
>
> I'm not a dev, but I can make some wild and uninformed guesses :-) .  The
> primary OSD uses more CPU than the replicas, and I suspect that you have
> more primaries on the hot nodes.
>
> Since you're testing, try repeating the test on 3 OSD nodes instead of 4.
> If you don't want to run that test, you can generate a histogram from ceph
> pg dump data, and see if there are more primary osds (the first one in the
> acting array) on the hot nodes.
>
>
>
> On Wed, Mar 18, 2015 at 7:18 AM, f...@univ-lr.fr  wrote:
>
>> Hi to the ceph-users list !
>>
>> We're setting up a new Ceph infrastructure :
>> - 1 MDS admin node
>> - 4 OSD storage nodes (60 OSDs)
>>   each of them running a monitor
>> - 1 client
>>
>> Each 32GB RAM/16 cores OSD node supports 15 x 4TB SAS OSDs (XFS) and 1
>> SSD with 5GB journal partitions, all in JBOD attachement.
>> Every node has 2x10Gb LACP attachement.
>> The OSD nodes are freshly installed with puppet then from the admin node
>> Default OSD weight in the OSD tree
>> 1 test pool with 4096 PGs
>>
>> During setup phase, we're trying to qualify the performance
>> characteristics of our setup.
>> Rados benchmark are done from a client with these commandes :
>> rados -p pool -b 4194304 bench 60 write -t 32 --no-cleanup
>> rados -p pool -b 4194304 bench 60 seq -t 32 --no-cleanup
>>
>> Each time we observed a recurring phenomena : 2 of the 4 OSD nodes have
>> twice the CPU load :
>> http://www.4shared.com/photo/Ua0umPVbba/UnevenLoad.html
>> (What to look at is the real-time %CPU and the cumulated CPU time per
>> ceph-osd process)
>>
>> And after a fresh complete reinstall to be sure, this twice-as-high CPU
>> load is observed but not on the same 2 nodes :
>> http://www.4shared.com/photo/2AJfd1B_ba/UnevenLoad-v2.html
>>
>> Nothing obvious about the installation seems able to explain that.
>>
>> The crush distribution function doesn't have more than 4.5% inequality
>> between the 4 OSD nodes for the primary OSDs of the objects, and less than
>> 3% between the hosts if we considere the whole acting sets for the objects
>> used during the benchmark. And the differences are not accordingly
>> comparable to the CPU loads. So the cause has to be elsewhere.
>>
>> I cannot be sure it has no impact on performance. Even if we have enough
>> CPU cores headroom, logic would say it has to have some consequences on
>> delays and also on performances .
>>
>> Would someone have any idea, or reproduce the test on its setup to see if
>> this is a common comportment ?
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD Hardware recommendation

2015-03-20 Thread Josef Johansson


> On 19 Mar 2015, at 08:17, Christian Balzer  wrote:
> 
> On Wed, 18 Mar 2015 08:59:14 +0100 Josef Johansson wrote:
> 
>> Hi,
>> 
>>> On 18 Mar 2015, at 05:29, Christian Balzer  wrote:
>>> 
>>> 
>>> Hello,
>>> 
>>> On Wed, 18 Mar 2015 03:52:22 +0100 Josef Johansson wrote:
>> 
> [snip]
>>>> We though of doing a cluster with 3 servers, and any recommendation of
>>>> supermicro servers would be appreciated.
>>>> 
>>> Why 3, replication of 3? 
>>> With Intel SSDs and diligent (SMART/NAGIOS) wear level monitoring I'd
>>> personally feel safe with a replication factor of 2.
>>> 
>> I’ve seen recommendations  of replication 2!  The Intel SSDs are indeed
>> endurable. This is only with Intel SSDs I assume?
> 
> From the specifications and reviews I've seen the Samsung 845DC PRO, the
> SM 843T and even more so the SV843 
> (http://www.samsung.com/global/business/semiconductor/product/flash-ssd/overview
> don't you love it when the same company has different, competing
> products?) should do just fine when it comes to endurance and performance.
> Alas I have no first hand experience with either, just the
> (read-optimized) 845DC EVO.
> 
The 845DC Pro does look really nice, comparable with s3700 with TDW even.
The price is what really does it, as it’s almost a third compared with s3700..

With replication set of 3 it’s the same price as s3610 with replication set of 
2.

How enterprise-ish is it to run with replication set of 2 according to the 
Inktank-guys?

Really thinking of going with 845DC Pro here actually.
> 
>> This 1U
>> http://www.supermicro.com.tw/products/system/1U/1028/SYS-1028U-TR4T_.cfm
>> <http://www.supermicro.com.tw/products/system/1U/1028/SYS-1028U-TR4T_.cfm>
>> is really nice, missing the SuperDOM peripherals though.. 
> While I certainly see use cases for SuperDOM, not all models have 2
> connectors, so no chance to RAID1 things, thus the need to _definitely_
> have to pull the server out (and re-install the OS) should it fail.
Yeah, I fancy using hot swap for OS disks, and with 24 front hot swap there’s 
plenty room to have a couple of OS drives =)
The 2U also has possibility to have an extra 2x10GbE-card totalling in 4x10GbE, 
which is needed.
> 
>> so you really
>> get 8 drives if you need two for OS. And the rails.. don’t get me
>> started, but lately they do just snap into the racks! No screws needed.
>> That’s a refresh from earlier 1U SM rails.
>> 
> Ah, the only 1U servers I'm currently deploying from SM are older ones, so
> still no snap-in rails. Everything 2U has been that way for at least 2
> years, though. ^^
It’s awesome I tell you. :)

Cheers,
Josef

> 
> Christian
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Fusion Communications
> http://www.gol.com/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] World hosting days 2015

2015-03-18 Thread Josef Johansson

Hi!

We’re three guys that’s going to attend the whole event, would be nice to meet 
up some cephers indeed.

Yeah, it would be nice to talk about Ceph, but meeting up at your booth sounds 
like a good idea. We’re just plain visitors.

Have you seen anything about how much presence there will be from Ceph this 
year?

See you :)
Josef

> On 18 Mar 2015, at 19:59, Pawel Stefanski  wrote:
> 
> hello!
> 
> I will attend whole event, so if you would like to talk about Ceph - please 
> please let me know (or catch me at B20 booth) :)
> 
> There was strong Ceph (from Inktank and the community) representation last 
> years. 
> 
> see you!
> -- 
> Pawel
> 
> On Tue, Mar 17, 2015 at 6:38 PM, Josef Johansson  <mailto:jose...@gmail.com>> wrote:
> Hi,
> 
> I was wondering if any cepher where going to WHD this year?
> 
> Cheers,
> Josef
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD Hardware recommendation

2015-03-18 Thread Josef Johansson

Hi Alexandre,

I actually have been searching for this information a couple of times in the ML 
now.

Was hoping that you would’ve been done with it before I ordered :)

I will most likely order this week so I will see it when the stuff is being 
assembled :o

Do you feel that there something in the setup that could be better if you would 
decide on hardware as of today?

Also, will you try out replication set of 2 as well?

Thanks

Josef

> On 18 Mar 2015, at 08:19, Alexandre DERUMIER  wrote:
> 
> Hi Josef,
> 
> I'm going to benchmark a 3nodes cluster with 6ssd each node (2x10 cores 
> 3,1ghz).
> From my previous bench, you need fast cpus if you need a lot of iops, and 
> writes are lot more expansive than reads.
> 
> Now i'm you are doing only small iops (big blocks / big throughput), you 
> don't need too fast/many cores.
> 
> I'm going to use intel s3610 ssd for my production cluster, can't comment 
> about samsung drive.
> 
> 
> I'll try to post benchmark results in coming weeks.
> 
> 
> - Mail original -
> De: "Josef Johansson" 
> À: "ceph-users" 
> Envoyé: Mercredi 18 Mars 2015 03:52:22
> Objet: [ceph-users] SSD Hardware recommendation
> 
> Hi, 
> 
> I’m planning a Ceph SSD cluster, I know that we won’t get the full 
> performance from the SSD in this case, but SATA won’t cut it as backend 
> storage and SAS is the same price as SSD now. 
> 
> The backend network will be a 10GbE active/passive, but will be used mainly 
> for MySQL, so we’re aiming for swallowing IO. 
> 
> So, for 10x SSD drivers, what kind of CPU would that need? Just go all out 
> with two 10x cores 3.5GHz? 
> I read somewhere that you should use as fast CPUs that you can afford. 
> 
> Planning on using the Samsung 845 DC EVO, anyone using these in current ceph 
> clusters? 
> 
> We though of doing a cluster with 3 servers, and any recommendation of 
> supermicro servers would be appreciated. 
> 
> Cheers, 
> Josef 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD Hardware recommendation

2015-03-18 Thread Josef Johansson

Hi,

> On 18 Mar 2015, at 05:29, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> On Wed, 18 Mar 2015 03:52:22 +0100 Josef Johansson wrote:
> 
>> Hi,
>> 
>> I’m planning a Ceph SSD cluster, I know that we won’t get the full
>> performance from the SSD in this case, but SATA won’t cut it as backend
>> storage and SAS is the same price as SSD now.
>> 
> Have you actually tested SATA with SSD journals?
> Given a big enough (number of OSDs) cluster you should be able to come
> close to SSD performance currently achievable with regards to a single
> client.
> 
Yeah, 
The problem is really the latency when backing storage is fully utilised, 
especially while rebalancing data and deep scrubbing.
The MySQL is actually living inside a Journal + SATA backing storage atm, so 
this is the problem I’m trying to solve.
>> The backend network will be a 10GbE active/passive, but will be used
>> mainly for MySQL, so we’re aiming for swallowing IO.
>> 
> Is this a single MySQL instance or are we talking various VMs here?
> If you're flexible in regards to the network, Infiniband will give you
> lower latency, especially with the RDMA stuff being developed currently
> for Ceph (I'd guess a year or so out).
> Because with single (or few) clients, IOPS per client/thread are limited
> by the latency of the network (and of course the whole Ceph stack) more
> than anything else, so on a single thread you're never going to see
> performance anywhere near what a local SSD could deliver.
> 
Going to use 150-200 MySQL clients, one on each VM, so the load should be good 
for Ceph =)
And sadly I’m in no position to use RDMA etc, as It’s decided with 10Gbase-T.
Really liking the SM servers with 4x 10Gbase-T =)
Thanks for the recommendation though.

>> So, for 10x SSD drivers, what kind of CPU would that need? Just go all
>> out with two 10x cores 3.5GHz? I read somewhere that you should use as
>> fast CPUs that you can afford.
>> 
> Indeed. 
> With 10 SSDs even that will probably be CPU bound with small IOPS and
> current stable Ceph versions.
> See my list archive URL below.
> 
> What size SSDs?
> Is the number of SSDs a result of needing the space, or is it there to get
> more OSDs and thus performance?
Both, performance and space. So 1TB drives (well 960GB in this case)
100GB MySQL for 100VMs. (Calculated on a replication of 3)
> 
>> Planning on using the Samsung 845 DC EVO, anyone using these in current
>> ceph clusters? 
>> 
> I'm using them in a DRBD cluster where they were a good fit as their write
> endurance was a match for the use case, I needed lots of space (960GB ones)
> and the relatively low price was within my budget.
> 
> While I'm not tearing out my hairs and curse the day I ever considered
> using them, their speed, endurance and some spurious errors I'm not seeing
> with Intel DC S3700s in the same server have me considering DC S3610s
> instead for the next cluster of this type I'm currently planning.
> 
> Compare those Intel DC S3610 and DC S3700 with Samsung 845 DC Pro if you're
> building a Ceph cluster, the write amplification I'm seeing with SSD
> backed Ceph clusters will turn your EVO's into scrap metal in no time,.
> 
> Consider what you think your IO load (writes) generated by your client(s)
> will be, multiply that by your replication factor, divide by the number of
> OSDs, that will give you the base load per OSD. 
> Then multiply by 2 (journal on OSD) per OSD.
> Finally based on my experience and measurements (link below) multiply that
> by at least 6, probably 10 to be on safe side. Use that number to find the
> SSD that can handle this write load for the time period you're budgeting
> that cluster for.
> http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html
>  
> <http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html>
It feels that I can’t go with anything else than at least S3610. Especially if 
it’s replication set of 2.
Haven’t done much reading about the S3610 I will go into depth on them.
> 
>> We though of doing a cluster with 3 servers, and any recommendation of
>> supermicro servers would be appreciated.
>> 
> Why 3, replication of 3? 
> With Intel SSDs and diligent (SMART/NAGIOS) wear level monitoring I'd
> personally feel safe with a replication factor of 2.
> 
I’ve seen recommendations  of replication 2!  The Intel SSDs are indeed 
endurable.
This is only with Intel SSDs I assume?
> I used one of these chassis for the DRBD cluster mentioned above, the
> version with Infiniband actually:
> http://www.supermicro.com.tw/products/system/2U/2028/SYS-2028TP-DC0TR.cfm
> 
> It&#

[ceph-users] SSD Hardware recommendation

2015-03-17 Thread Josef Johansson

Hi,

I’m planning a Ceph SSD cluster, I know that we won’t get the full performance 
from the SSD in this case, but SATA won’t cut it as backend storage and SAS is 
the same price as SSD now.

The backend network will be a 10GbE active/passive, but will be used mainly for 
MySQL, so we’re aiming for swallowing IO.

So, for 10x SSD drivers, what kind of CPU would that need? Just go all out with 
two 10x cores 3.5GHz?
I read somewhere that you should use as fast CPUs that you can afford.

Planning on using the Samsung 845 DC EVO, anyone using these in current ceph 
clusters? 

We though of doing a cluster with 3 servers, and any recommendation of 
supermicro servers would be appreciated.

Cheers,
Josef
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] World hosting days 2015

2015-03-17 Thread Josef Johansson

Hi,

I was wondering if any cepher where going to WHD this year?

Cheers,
Josef
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD turned itself off

2015-02-16 Thread Josef Johansson

And yeah, it’s the same EIO 5 error.

So ok, the errors doesn’t show anything useful to the osd crash.


> On 16 Feb 2015, at 21:58, Josef Johansson  wrote:
> 
> Well, I knew it had all the correct information since earlier so gave it a 
> shot :)
> 
> Anyway, I think it may be just a bad controller as well. New enterprise 
> drives shouldn’t be giving read errors this early in deployment tbh.
> 
> Cheers,
> Josef
>> On 16 Feb 2015, at 17:37, Greg Farnum > <mailto:gfar...@redhat.com>> wrote:
>> 
>> Woah, major thread necromancy! :)
>> 
>> On Feb 13, 2015, at 3:03 PM, Josef Johansson > <mailto:jo...@oderland.se>> wrote:
>>> 
>>> Hi,
>>> 
>>> I skimmed the logs again, as we’ve had more of this kinda errors,
>>> 
>>> I saw a lot of lossy connections errors,
>>> -2567> 2014-11-24 11:49:40.028755 7f6d49367700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.54:0/1011446 pipe(0x19321b80 sd=44 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x110d2b00).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2564> 2014-11-24 11:49:42.000463 7f6d51df1700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.51:0/1015676 pipe(0x22d6000 sd=204 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x16e218c0).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2563> 2014-11-24 11:49:47.704467 7f6d4d1a5700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.52:0/3029106 pipe(0x231f6780 sd=158 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x136bd1e0).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2562> 2014-11-24 11:49:48.180604 7f6d4cb9f700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.52:0/2027138 pipe(0x1657f180 sd=254 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x13273340).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2561> 2014-11-24 11:49:48.808604 7f6d4c498700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.52:0/2023529 pipe(0x12831900 sd=289 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x12401600).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2559> 2014-11-24 11:49:50.128379 7f6d4b88c700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.53:0/1023180 pipe(0x11cb2280 sd=309 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x1280a000).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2558> 2014-11-24 11:49:52.472867 7f6d425eb700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.52:0/3019692 pipe(0x18eb4a00 sd=311 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x10df6b00).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2556> 2014-11-24 11:49:55.100208 7f6d49e72700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.51:0/3021273 pipe(0x1bacf680 sd=353 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x164ae2c0).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2555> 2014-11-24 11:49:55.776568 7f6d49468700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.51:0/3024351 pipe(0x1bacea00 sd=20 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x1887ba20).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2554> 2014-11-24 11:49:57.704437 7f6d49165700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.52:0/1023529 pipe(0x1a32ac80 sd=213 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0xfe93b80).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2553> 2014-11-24 11:49:58.694246 7f6d47549700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.51:0/3017204 pipe(0x102e5b80 sd=370 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0xfb5a000).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2551> 2014-11-24 11:50:00.412242 7f6d4673b700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.52:0/3027138 pipe(0x1b83b400 sd=250 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x12922dc0).accept replacing existing (lossy) channel (new one lossy=1)
>>> -2387> 2014-11-24 11:50:22.761490 7f6d44fa4700  0 -- 
>>> 10.168.7.23:6840/4010217 >> 10.168.7.25:0/27131 pipe(0xfc60c80 sd=300 :6840 
>>> s=0 pgs=0 cs=0 l=1 c=0x1241d080).accept replacing existing (lossy) channel 
>>> (new one lossy=1)
>>> -2300> 2014-11-24 11:50:31.366214 7f6d517eb700  0 -- 
>>> 10.168.7.23:6840/4010217 >> 10.168.7.22:0/15549 pipe(0x193b3180 sd=214 
>>> :6840 s=0 pgs=0 cs=0 l=1 c=0x10ebbe40).accept replacing existing (lossy) 
>>> channel (new one lossy=1)
>>> -2247> 2014-11-24 11:50:37.372934 7f6d4a276700  0 -- 10.168.7.23:6819/10217 
>>> >> 10.168.7.51:0/1013890 pipe(0x25d4780 sd=112 :6819 s=0 pgs=0 cs=0 l=1 
>>> c=0x10666580).accept replacing existing (lossy) chann

Re: [ceph-users] OSD turned itself off

2015-02-16 Thread Josef Johansson

Well, I knew it had all the correct information since earlier so gave it a shot 
:)

Anyway, I think it may be just a bad controller as well. New enterprise drives 
shouldn’t be giving read errors this early in deployment tbh.

Cheers,
Josef
> On 16 Feb 2015, at 17:37, Greg Farnum  wrote:
> 
> Woah, major thread necromancy! :)
> 
> On Feb 13, 2015, at 3:03 PM, Josef Johansson  <mailto:jo...@oderland.se>> wrote:
>> 
>> Hi,
>> 
>> I skimmed the logs again, as we’ve had more of this kinda errors,
>> 
>> I saw a lot of lossy connections errors,
>> -2567> 2014-11-24 11:49:40.028755 7f6d49367700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.54:0/1011446 pipe(0x19321b80 sd=44 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0x110d2b00).accept replacing existing (lossy) channel (new one lossy=1)
>> -2564> 2014-11-24 11:49:42.000463 7f6d51df1700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.51:0/1015676 pipe(0x22d6000 sd=204 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0x16e218c0).accept replacing existing (lossy) channel (new one lossy=1)
>> -2563> 2014-11-24 11:49:47.704467 7f6d4d1a5700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.52:0/3029106 pipe(0x231f6780 sd=158 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0x136bd1e0).accept replacing existing (lossy) channel (new one lossy=1)
>> -2562> 2014-11-24 11:49:48.180604 7f6d4cb9f700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.52:0/2027138 pipe(0x1657f180 sd=254 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0x13273340).accept replacing existing (lossy) channel (new one lossy=1)
>> -2561> 2014-11-24 11:49:48.808604 7f6d4c498700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.52:0/2023529 pipe(0x12831900 sd=289 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0x12401600).accept replacing existing (lossy) channel (new one lossy=1)
>> -2559> 2014-11-24 11:49:50.128379 7f6d4b88c700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.53:0/1023180 pipe(0x11cb2280 sd=309 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0x1280a000).accept replacing existing (lossy) channel (new one lossy=1)
>> -2558> 2014-11-24 11:49:52.472867 7f6d425eb700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.52:0/3019692 pipe(0x18eb4a00 sd=311 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0x10df6b00).accept replacing existing (lossy) channel (new one lossy=1)
>> -2556> 2014-11-24 11:49:55.100208 7f6d49e72700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.51:0/3021273 pipe(0x1bacf680 sd=353 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0x164ae2c0).accept replacing existing (lossy) channel (new one lossy=1)
>> -2555> 2014-11-24 11:49:55.776568 7f6d49468700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.51:0/3024351 pipe(0x1bacea00 sd=20 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0x1887ba20).accept replacing existing (lossy) channel (new one lossy=1)
>> -2554> 2014-11-24 11:49:57.704437 7f6d49165700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.52:0/1023529 pipe(0x1a32ac80 sd=213 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0xfe93b80).accept replacing existing (lossy) channel (new one lossy=1)
>> -2553> 2014-11-24 11:49:58.694246 7f6d47549700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.51:0/3017204 pipe(0x102e5b80 sd=370 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0xfb5a000).accept replacing existing (lossy) channel (new one lossy=1)
>> -2551> 2014-11-24 11:50:00.412242 7f6d4673b700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.52:0/3027138 pipe(0x1b83b400 sd=250 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0x12922dc0).accept replacing existing (lossy) channel (new one lossy=1)
>> -2387> 2014-11-24 11:50:22.761490 7f6d44fa4700  0 -- 
>> 10.168.7.23:6840/4010217 >> 10.168.7.25:0/27131 pipe(0xfc60c80 sd=300 :6840 
>> s=0 pgs=0 cs=0 l=1 c=0x1241d080).accept replacing existing (lossy) channel 
>> (new one lossy=1)
>> -2300> 2014-11-24 11:50:31.366214 7f6d517eb700  0 -- 
>> 10.168.7.23:6840/4010217 >> 10.168.7.22:0/15549 pipe(0x193b3180 sd=214 :6840 
>> s=0 pgs=0 cs=0 l=1 c=0x10ebbe40).accept replacing existing (lossy) channel 
>> (new one lossy=1)
>> -2247> 2014-11-24 11:50:37.372934 7f6d4a276700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.51:0/1013890 pipe(0x25d4780 sd=112 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0x10666580).accept replacing existing (lossy) channel (new one lossy=1)
>> -2246> 2014-11-24 11:50:37.738539 7f6d4f6ca700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.51:0/3026502 pipe(0x1338ea00 sd=230 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0x123f11e0).accept replacing existing (lossy) channel (new one lossy=1)
>> -2245> 2014-11-24 11:50:38.390093 7f6d48c60700  0 -- 10.168.7.23:6819/10217 
>> >> 10.168.7.51:0/2026502 pipe(0x16ba7400 sd=276 :6819 s=0 pgs=0 cs=0 l=1 
>> c=0x7d4fb80)

Re: [ceph-users] OSD turned itself off

2015-02-13 Thread Josef Johansson

141f2.03ce 
[stat,write 1925120~4096] ondisk = 0) v4 remote, 10.168.7.54:0/2007323, failed 
lossy con, dropping message 0x12989400
  -855> 2015-01-10 22:01:36.589036 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
submit_message osd_op_reply(727627 rbd_data.1cc69413d1b58ba.0055 
[stat,write 2289664~4096] ondisk = 0) v4 remote, 10.168.7.54:0/1007323, failed 
lossy con, dropping message 0x24f68800
  -819> 2015-01-12 05:25:06.229753 7f6d3646c700  0 -- 10.168.7.23:6819/10217 >> 
10.168.7.53:0/2019809 pipe(0x1f0e9680 sd=460 :6819 s=0 pgs=0 cs=0 l=1 
c=0x13090420).accept replacing existing (lossy) channel (new one lossy=1)
  -818> 2015-01-12 05:25:06.581703 7f6d37534700  0 -- 10.168.7.23:6819/10217 >> 
10.168.7.53:0/1025252 pipe(0x1b67a780 sd=71 :6819 s=0 pgs=0 cs=0 l=1 
c=0x16311e40).accept replacing existing (lossy) channel (new one lossy=1)
  -817> 2015-01-12 05:25:21.342998 7f6d41167700  0 -- 10.168.7.23:6819/10217 >> 
10.168.7.53:0/1025579 pipe(0x114e8000 sd=502 :6819 s=0 pgs=0 cs=0 l=1 
c=0x16310160).accept replacing existing (lossy) channel (new one lossy=1)
  -808> 2015-01-12 16:01:35.783534 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
submit_message osd_op_reply(752034 rbd_data.1cc69413d1b58ba.0055 
[stat,write 2387968~8192] ondisk = 0) v4 remote, 10.168.7.54:0/1007323, failed 
lossy con, dropping message 0x1fde9a00
  -515> 2015-01-25 18:44:23.303855 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
submit_message osd_op_reply(46402240 rbd_data.4b8e9b3d1b58ba.0471 
[read 1310720~4096] ondisk = 0) v4 remote, 10.168.7.51:0/1017204, failed lossy 
con, dropping message 0x250bce00
  -303> 2015-02-02 22:30:03.140599 7f6d5c155700  0 -- 10.168.7.23:6819/10217 
submit_message osd_op_reply(17710313 rbd_data.1cc69562eb141f2.03ce 
[stat,write 4145152~4096] ondisk = 0) v4 remote, 10.168.7.54:0/2007323, failed 
lossy con, dropping message 0x1c5d4200
  -236> 2015-02-05 15:29:04.945660 7f6d3d357700  0 -- 10.168.7.23:6819/10217 >> 
10.168.7.51:0/1026961 pipe(0x1c63e780 sd=203 :6819 s=0 pgs=0 cs=0 l=1 
c=0x11dc8dc0).accept replacing existing (lossy) channel (new one lossy=1)
   -66> 2015-02-10 20:20:36.673969 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
submit_message osd_op_reply(11088 rbd_data.10b8c82eb141f2.4459 
[stat,write 749568~8192] ondisk = 0) v4 remote, 10.168.7.55:0/1005630, failed 
lossy con, dropping message 0x138db200

Could this have lead to the data being erroneous, or is the -5 return code just 
a sign of a broken hard drive?

Cheers,
Josef

> On 14 Jun 2014, at 02:38, Josef Johansson  wrote:
> 
> Thanks for the quick response.
> 
> Cheers,
> Josef
> 
> Gregory Farnum skrev 2014-06-14 02:36:
>> On Fri, Jun 13, 2014 at 5:25 PM, Josef Johansson  wrote:
>>> Hi Greg,
>>> 
>>> Thanks for the clarification. I believe the OSD was in the middle of a deep
>>> scrub (sorry for not mentioning this straight away), so then it could've
>>> been a silent error that got wind during scrub?
>> Yeah.
>> 
>>> What's best practice when the store is corrupted like this?
>> Remove the OSD from the cluster, and either reformat the disk or
>> replace as you judge appropriate.
>> -Greg
>> 
>>> Cheers,
>>> Josef
>>> 
>>> Gregory Farnum skrev 2014-06-14 02:21:
>>> 
>>>> The OSD did a read off of the local filesystem and it got back the EIO
>>>> error code. That means the store got corrupted or something, so it
>>>> killed itself to avoid spreading bad data to the rest of the cluster.
>>>> -Greg
>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>> 
>>>> 
>>>> On Fri, Jun 13, 2014 at 5:16 PM, Josef Johansson 
>>>> wrote:
>>>>> Hey,
>>>>> 
>>>>> Just examing what happened to an OSD, that was just turned off. Data has
>>>>> been moved away from it, so hesitating to turned it back on.
>>>>> 
>>>>> Got the below in the logs, any clues to what the assert talks about?
>>>>> 
>>>>> Cheers,
>>>>> Josef
>>>>> 
>>>>> -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t,
>>>>> const
>>>>> hobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7fdacb88
>>>>> c700 time 2014-06-11 21:13:54.036982
>>>>> os/FileStore.cc: 2992: FAILED assert(allow_eio || !m_filestore_fail_eio
>>>>> ||
>>>>> got != -5)
>>>>> 
>>>>>   ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
>>>>>   1: (FileStore::read(coll_t, hobj

Re: [ceph-users] RBD and HA KVM anybody?

2014-12-15 Thread Josef Johansson

Hi,

> On 16 Dec 2014, at 05:00, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> On Mon, 15 Dec 2014 09:23:23 +0100 Josef Johansson wrote:
> 
>> Hi Christian,
>> 
>> We’re using Proxmox that has support for HA, they do it per-vm.
>> We’re doing it manually right now though, because we like it :). 
>> 
>> When I looked at it I couldn’t see a way of just allowing a set of hosts
>> in the HA (i.e. not the storage nodes), but that’s probably easy to
>> solve.
>> 
> 
> Ah, Proxmox. I test drove this about a year ago and while it has some nice
> features the "black box" approach of taking over bare metal hardware and
> the ancient kernel doesn't mesh with other needs I have here.
The ancient kernel is not needed if you’re running just KVM. They are working 
on a 3.10 kernel if I’m correct though.
As it’s Debian 7 in the bottom now, just put in a back ported kernel and you’re 
good to go. 3.14 was bad but 3.15 should be ok.
And it has Ceph support now a days :)

Cheers,
Josef
> 
> Thanks for reminding me, though.
No problemo :)
> 
> Christian
> 
>> Cheers,
>> Josef
>> 
>>> On 15 Dec 2014, at 04:10, Christian Balzer  wrote:
>>> 
>>> 
>>> Hello,
>>> 
>>> What are people here using to provide HA KVMs (and with that I mean
>>> automatic, fast VM failover in case of host node failure) in with RBD
>>> images?
>>> 
>>> Openstack and ganeti have decent Ceph/RBD support, but no HA (plans
>>> aplenty though).
>>> 
>>> I have plenty of experience with Pacemaker (DRBD backed) but there is
>>> only an unofficial RBD resource agent for it, which also only supports
>>> kernel based RBD. 
>>> And while Pacemaker works great, it scales like leaden porcupines,
>>> things degrade rapidly after 20 or so instances.
>>> 
>>> So what are other people here using to keep their KVM based VMs up and
>>> running all the time?
>>> 
>>> Regards,
>>> 
>>> Christian
>>> -- 
>>> Christian BalzerNetwork/Systems Engineer
>>> ch...@gol.com   Global OnLine Japan/Fusion Communications
>>> http://www.gol.com/
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com <mailto:ch...@gol.com>  Global OnLine Japan/Fusion 
> Communications
> http://www.gol.com/ <http://www.gol.com/>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD and HA KVM anybody?

2014-12-15 Thread Josef Johansson

Hi Christian,

We’re using Proxmox that has support for HA, they do it per-vm.
We’re doing it manually right now though, because we like it :). 

When I looked at it I couldn’t see a way of just allowing a set of hosts in the 
HA (i.e. not the storage nodes), but that’s probably easy to solve.

Cheers,
Josef

> On 15 Dec 2014, at 04:10, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> What are people here using to provide HA KVMs (and with that I mean
> automatic, fast VM failover in case of host node failure) in with RBD
> images?
> 
> Openstack and ganeti have decent Ceph/RBD support, but no HA (plans
> aplenty though).
> 
> I have plenty of experience with Pacemaker (DRBD backed) but there is only
> an unofficial RBD resource agent for it, which also only supports kernel
> based RBD. 
> And while Pacemaker works great, it scales like leaden porcupines, things
> degrade rapidly after 20 or so instances.
> 
> So what are other people here using to keep their KVM based VMs up and
> running all the time?
> 
> Regards,
> 
> Christian
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Showing package loss in ceph main log

2014-09-12 Thread Josef Johansson

Hi,

I've stumpled upon this a couple of times, where Ceph just stops
responding, but still works.
The cause has been package loss on the network layer, but Ceph doesn't
say anything.

Is there a debug flag for showing retransmission of package, or someway
to see that packages are lost?

Regards,
Josef
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson


On 07 Sep 2014, at 04:47, Christian Balzer  wrote:

> On Sat, 6 Sep 2014 19:47:13 +0200 Josef Johansson wrote:
> 
>> 
>> On 06 Sep 2014, at 19:37, Josef Johansson  wrote:
>> 
>>> Hi,
>>> 
>>> Unfortunatly the journal tuning did not do much. That’s odd, because I
>>> don’t see much utilisation on OSDs themselves. Now this leads to a
>>> network-issue between the OSDs right?
>>> 
>> To answer my own question. Restarted a bond and it all went up again,
>> found the culprit — packet loss. Everything up and running afterwards.
>> 
> If there were actual errors, that should have been visible in atop as well.
> For utilization it isn't that obvious, as it doesn't know what bandwidth a
> bond device has. Same is true for IPoIB interfaces.
> And FWIW, tap (kvm guest interfaces) are wrongly pegged in the kernel at
> 10Mb/s, so they get to be falsely redlined on compute nodes all the time.
> 
This is the second time I’ve seen Ceph behaving badly due to networking issues. 
Maybe @Inktank has ideas of how to announce in the ceph log that there’s packet 
loss?
Regards,
Josef
>> I’ll be taking that beer now,
> 
> Skol.
> 
> Christian
> 
>> Regards,
>> Josef
>>> On 06 Sep 2014, at 18:17, Josef Johansson  wrote:
>>> 
>>>> Hi,
>>>> 
>>>> On 06 Sep 2014, at 17:59, Christian Balzer  wrote:
>>>> 
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
>>>>>>> 
>>>>>>>> We manage to go through the restore, but the performance
>>>>>>>> degradation is still there.
>>>>>>>> 
>>>>>>> Manifesting itself how?
>>>>>>> 
>>>>>> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
>>>>>> But mostly a lot of iowait.
>>>>>> 
>>>>> I was thinking about the storage nodes. ^^
>>>>> As in, does a particular node or disk seem to be redlined all the
>>>>> time?
>>>> They’re idle, with little io wait.
>>> It also shows it self as earlier, with slow requests now and then.
>>> 
>>> Like this 
>>> 2014-09-06 19:13:28.469533 osd.25 10.168.7.23:6827/11423 362 : [WRN]
>>> slow request 31.554785 seconds old, received at 2014-09-06
>>> 19:12:56.914688: osd_op(client.12483520.0:12211087
>>> rbd_data.4b8e9b3d1b58ba.1222 [stat,write 3813376~4096]
>>> 3.3bfab9da e15861) v4 currently waiting for subops from [13,2]
>>> 2014-09-06 19:13:28.469536 osd.25 10.168.7.23:6827/11423 363 : [WRN]
>>> slow request 31.554736 seconds old, received at 2014-09-06
>>> 19:12:56.914737: osd_op(client.12483520.0:12211088
>>> rbd_data.4b8e9b3d1b58ba.1222 [stat,write 3842048~8192]
>>> 3.3bfab9da e15861) v4 currently waiting for subops from [13,2]
>>> 2014-09-06 19:13:28.469539 osd.25 10.168.7.23:6827/11423 364 : [WRN]
>>> slow request 30.691760 seconds old, received at 2014-09-06
>>> 19:12:57.13: osd_op(client.12646408.0:36726433
>>> rbd_data.81ab322eb141f2.ec38 [stat,write 749568~4096]
>>> 3.7ae1c1da e15861) v4 currently waiting for subops from [13,2]
>>> 2014-09-06 19:13:31.469946 osd.25 10.168.7.23:6827/11423 365 : [WRN]
>>> 23 slow requests, 2 included below; oldest blocked for > 42.196747
>>> secs 2014-09-06 19:13:31.469951 osd.25 10.168.7.23:6827/11423 366 :
>>> [WRN] slow request 30.344653 seconds old, received at 2014-09-06
>>> 19:13:01.125248: osd_op(client.18869229.0:100325
>>> rbd_data.41d2eb2eb141f2.2732 [stat,write 2174976~4096]
>>> 3.55d437e e15861) v4 currently waiting for subops from [13,6]
>>> 2014-09-06 19:13:31.469954 osd.25 10.168.7.23:6827/11423 367 : [WRN]
>>> slow request 30.344579 seconds old, received at 2014-09-06
>>> 19:13:01.125322: osd_op(client.18869229.0:100326
>>> rbd_data.41d2eb2eb141f2.2732 [stat,write 2920448~4096]
>>> 3.55d437e e15861) v4 currently waiting for subops from [13,6]
>>> 2014-09-06 19:13:32.470156 osd.25 10.168.7.23:6827/11423 368 : [WRN]
>>> 24 slow requests, 1 inc

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson


On 06 Sep 2014, at 19:37, Josef Johansson  wrote:

> Hi,
> 
> Unfortunatly the journal tuning did not do much. That’s odd, because I don’t 
> see much utilisation on OSDs themselves. Now this leads to a network-issue 
> between the OSDs right?
> 
To answer my own question. Restarted a bond and it all went up again, found the 
culprit — packet loss. Everything up and running afterwards.

I’ll be taking that beer now,
Regards,
Josef
> On 06 Sep 2014, at 18:17, Josef Johansson  wrote:
> 
>> Hi,
>> 
>> On 06 Sep 2014, at 17:59, Christian Balzer  wrote:
>> 
>>> 
>>> Hello,
>>> 
>>> On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote:
>>> 
>>>> Hi,
>>>> 
>>>> On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
>>>> 
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
>>>>> 
>>>>>> We manage to go through the restore, but the performance degradation
>>>>>> is still there.
>>>>>> 
>>>>> Manifesting itself how?
>>>>> 
>>>> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
>>>> But mostly a lot of iowait.
>>>> 
>>> I was thinking about the storage nodes. ^^
>>> As in, does a particular node or disk seem to be redlined all the time?
>> They’re idle, with little io wait.
> It also shows it self as earlier, with slow requests now and then.
> 
> Like this 
> 2014-09-06 19:13:28.469533 osd.25 10.168.7.23:6827/11423 362 : [WRN] slow 
> request 31.554785 seconds old, received at 2014-09-06 19:12:56.914688: 
> osd_op(client.12483520.0:12211087 rbd_data.4b8e9b3d1b58ba.1222 
> [stat,write 3813376~4096] 3.3bfab9da e15861) v4 currently waiting for subops 
> from [13,2]
> 2014-09-06 19:13:28.469536 osd.25 10.168.7.23:6827/11423 363 : [WRN] slow 
> request 31.554736 seconds old, received at 2014-09-06 19:12:56.914737: 
> osd_op(client.12483520.0:12211088 rbd_data.4b8e9b3d1b58ba.1222 
> [stat,write 3842048~8192] 3.3bfab9da e15861) v4 currently waiting for subops 
> from [13,2]
> 2014-09-06 19:13:28.469539 osd.25 10.168.7.23:6827/11423 364 : [WRN] slow 
> request 30.691760 seconds old, received at 2014-09-06 19:12:57.13: 
> osd_op(client.12646408.0:36726433 rbd_data.81ab322eb141f2.ec38 
> [stat,write 749568~4096] 3.7ae1c1da e15861) v4 currently waiting for subops 
> from [13,2]
> 2014-09-06 19:13:31.469946 osd.25 10.168.7.23:6827/11423 365 : [WRN] 23 slow 
> requests, 2 included below; oldest blocked for > 42.196747 secs
> 2014-09-06 19:13:31.469951 osd.25 10.168.7.23:6827/11423 366 : [WRN] slow 
> request 30.344653 seconds old, received at 2014-09-06 19:13:01.125248: 
> osd_op(client.18869229.0:100325 rbd_data.41d2eb2eb141f2.2732 
> [stat,write 2174976~4096] 3.55d437e e15861) v4 currently waiting for subops 
> from [13,6]
> 2014-09-06 19:13:31.469954 osd.25 10.168.7.23:6827/11423 367 : [WRN] slow 
> request 30.344579 seconds old, received at 2014-09-06 19:13:01.125322: 
> osd_op(client.18869229.0:100326 rbd_data.41d2eb2eb141f2.2732 
> [stat,write 2920448~4096] 3.55d437e e15861) v4 currently waiting for subops 
> from [13,6]
> 2014-09-06 19:13:32.470156 osd.25 10.168.7.23:6827/11423 368 : [WRN] 24 slow 
> requests, 1 included below; oldest blocked for > 43.196971 secs
> 2014-09-06 19:13:32.470163 osd.25 10.168.7.23:6827/11423 369 : [WRN] slow 
> request 30.627252 seconds old, received at 2014-09-06 19:13:01.842873: 
> osd_op(client.10785413.0:136148901 rbd_data.96803f2eb141f2.33d7 
> [stat,write 4063232~4096] 3.cf740399 e15861) v4 currently waiting for subops 
> from [1,13]
> 2014-09-06 19:13:37.470895 osd.25 10.168.7.23:6827/11423 370 : [WRN] 27 slow 
> requests, 3 included below; oldest blocked for > 48.197700 secs
> 2014-09-06 19:13:37.470902 osd.25 10.168.7.23:6827/11423 371 : [WRN] slow 
> request 30.769509 seconds old, received at 2014-09-06 19:13:06.701345: 
> osd_op(client.18777372.0:1605468 rbd_data.2f1e4e2eb141f2.3541 
> [stat,write 1118208~4096] 3.db1ca37e e15861) v4 currently waiting for subops 
> from [13,6]
> 2014-09-06 19:13:37.470907 osd.25 10.168.7.23:6827/11423 372 : [WRN] slow 
> request 30.769458 seconds old, received at 2014-09-06 19:13:06.701396: 
> osd_op(client.18777372.0:1605469 rbd_data.2f1e4e2eb141f2.3541 
> [stat,write 1130496~4096] 3.db1ca37e e15861) v4 currently waiting for subops 
> from [13,6]
> 2014-09-06 19:13:37.470910 osd.25 10.168.7.23:6827/11423 373 : [WRN] slow 
> request 30.266843 seconds old, received at 2014-09-06 19:13:0

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson

Hi,

Unfortunatly the journal tuning did not do much. That’s odd, because I don’t 
see much utilisation on OSDs themselves. Now this leads to a network-issue 
between the OSDs right?

On 06 Sep 2014, at 18:17, Josef Johansson  wrote:

> Hi,
> 
> On 06 Sep 2014, at 17:59, Christian Balzer  wrote:
> 
>> 
>> Hello,
>> 
>> On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote:
>> 
>>> Hi,
>>> 
>>> On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
>>> 
>>>> 
>>>> Hello,
>>>> 
>>>> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
>>>> 
>>>>> We manage to go through the restore, but the performance degradation
>>>>> is still there.
>>>>> 
>>>> Manifesting itself how?
>>>> 
>>> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
>>> But mostly a lot of iowait.
>>> 
>> I was thinking about the storage nodes. ^^
>> As in, does a particular node or disk seem to be redlined all the time?
> They’re idle, with little io wait.
It also shows it self as earlier, with slow requests now and then.

Like this 
2014-09-06 19:13:28.469533 osd.25 10.168.7.23:6827/11423 362 : [WRN] slow 
request 31.554785 seconds old, received at 2014-09-06 19:12:56.914688: 
osd_op(client.12483520.0:12211087 rbd_data.4b8e9b3d1b58ba.1222 
[stat,write 3813376~4096] 3.3bfab9da e15861) v4 currently waiting for subops 
from [13,2]
2014-09-06 19:13:28.469536 osd.25 10.168.7.23:6827/11423 363 : [WRN] slow 
request 31.554736 seconds old, received at 2014-09-06 19:12:56.914737: 
osd_op(client.12483520.0:12211088 rbd_data.4b8e9b3d1b58ba.1222 
[stat,write 3842048~8192] 3.3bfab9da e15861) v4 currently waiting for subops 
from [13,2]
2014-09-06 19:13:28.469539 osd.25 10.168.7.23:6827/11423 364 : [WRN] slow 
request 30.691760 seconds old, received at 2014-09-06 19:12:57.13: 
osd_op(client.12646408.0:36726433 rbd_data.81ab322eb141f2.ec38 
[stat,write 749568~4096] 3.7ae1c1da e15861) v4 currently waiting for subops 
from [13,2]
2014-09-06 19:13:31.469946 osd.25 10.168.7.23:6827/11423 365 : [WRN] 23 slow 
requests, 2 included below; oldest blocked for > 42.196747 secs
2014-09-06 19:13:31.469951 osd.25 10.168.7.23:6827/11423 366 : [WRN] slow 
request 30.344653 seconds old, received at 2014-09-06 19:13:01.125248: 
osd_op(client.18869229.0:100325 rbd_data.41d2eb2eb141f2.2732 
[stat,write 2174976~4096] 3.55d437e e15861) v4 currently waiting for subops 
from [13,6]
2014-09-06 19:13:31.469954 osd.25 10.168.7.23:6827/11423 367 : [WRN] slow 
request 30.344579 seconds old, received at 2014-09-06 19:13:01.125322: 
osd_op(client.18869229.0:100326 rbd_data.41d2eb2eb141f2.2732 
[stat,write 2920448~4096] 3.55d437e e15861) v4 currently waiting for subops 
from [13,6]
2014-09-06 19:13:32.470156 osd.25 10.168.7.23:6827/11423 368 : [WRN] 24 slow 
requests, 1 included below; oldest blocked for > 43.196971 secs
2014-09-06 19:13:32.470163 osd.25 10.168.7.23:6827/11423 369 : [WRN] slow 
request 30.627252 seconds old, received at 2014-09-06 19:13:01.842873: 
osd_op(client.10785413.0:136148901 rbd_data.96803f2eb141f2.33d7 
[stat,write 4063232~4096] 3.cf740399 e15861) v4 currently waiting for subops 
from [1,13]
2014-09-06 19:13:37.470895 osd.25 10.168.7.23:6827/11423 370 : [WRN] 27 slow 
requests, 3 included below; oldest blocked for > 48.197700 secs
2014-09-06 19:13:37.470902 osd.25 10.168.7.23:6827/11423 371 : [WRN] slow 
request 30.769509 seconds old, received at 2014-09-06 19:13:06.701345: 
osd_op(client.18777372.0:1605468 rbd_data.2f1e4e2eb141f2.3541 
[stat,write 1118208~4096] 3.db1ca37e e15861) v4 currently waiting for subops 
from [13,6]
2014-09-06 19:13:37.470907 osd.25 10.168.7.23:6827/11423 372 : [WRN] slow 
request 30.769458 seconds old, received at 2014-09-06 19:13:06.701396: 
osd_op(client.18777372.0:1605469 rbd_data.2f1e4e2eb141f2.3541 
[stat,write 1130496~4096] 3.db1ca37e e15861) v4 currently waiting for subops 
from [13,6]
2014-09-06 19:13:37.470910 osd.25 10.168.7.23:6827/11423 373 : [WRN] slow 
request 30.266843 seconds old, received at 2014-09-06 19:13:07.204011: 
osd_op(client.18795696.0:847270 rbd_data.30532e2eb141f2.36bd 
[stat,write 3772416~4096] 3.76f1df7e e15861) v4 currently waiting for subops 
from [13,6]
2014-09-06 19:13:38.471152 osd.25 10.168.7.23:6827/11423 374 : [WRN] 30 slow 
requests, 3 included below; oldest blocked for > 49.197952 secs
2014-09-06 19:13:38.471158 osd.25 10.168.7.23:6827/11423 375 : [WRN] slow 
request 30.706236 seconds old, received at 2014-09-06 19:13:07.764870: 
osd_op(client.12483523.0:36628673 rbd_data.4defd32eb141f2.00015200 
[stat,write 2121728~4096] 3.cd82ed8a e15861) v4 currently waiting for subops 
from [0,13]
2014-09-06 19:13:38

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson

Hi,

On 06 Sep 2014, at 17:59, Christian Balzer  wrote:

> 
> Hello,
> 
> On Sat, 6 Sep 2014 17:41:02 +0200 Josef Johansson wrote:
> 
>> Hi,
>> 
>> On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
>> 
>>> 
>>> Hello,
>>> 
>>> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
>>> 
>>>> We manage to go through the restore, but the performance degradation
>>>> is still there.
>>>> 
>>> Manifesting itself how?
>>> 
>> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
>> But mostly a lot of iowait.
>> 
> I was thinking about the storage nodes. ^^
> As in, does a particular node or disk seem to be redlined all the time?
They’re idle, with little io wait.
> 
>>>> Looking through the OSDs to pinpoint a source of the degradation and
>>>> hoping the current load will be lowered.
>>>> 
>>> 
>>> You're the one looking at your cluster, the iostat, atop, iotop and
>>> whatnot data.
>>> If one particular OSD/disk stands out, investigate it, as per the "Good
>>> way to monitor detailed latency/throughput" thread. 
>>> 
>> Will read it through.
>>> If you have a spare and idle machine that is identical to your storage
>>> nodes, you could run a fio benchmark on a disk there and then compare
>>> the results to that of your suspect disk after setting your cluster to
>>> noout and stopping that particular OSD.
>> No spare though, but I have a rough idea what it should be, what’s I’m
>> going at right now. Right, so the cluster should be fine after I stop
>> the OSD right? I though of stopping it a little bit to see if the IO was
>> better afterwards from within the VMs. Not sure how good effect it makes
>> though since it may be waiting for the IO to complete what not.
>>> 
> If you set your cluster to noout, as in "ceph osd set noout" per
> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
> before shutting down a particular ODS, no data migration will happen.
> 
> Of course you will want to shut it down as little as possible, so that
> recovery traffic when it comes back is minimized. 
> 
Good, yes will do this.
Regards,
Josef
> Christian 
> 
>>>> I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be
>>>> tough if the degradation is still there afterwards? i.e. if I set back
>>>> the weight would it move back all the PGs?
>>>> 
>>> Of course.
>>> 
>>> Until you can determine that a specific OSD/disk is the culprit, don't
>>> do that. 
>>> If you have the evidence, go ahead.
>>> 
>> Great, that’s what I though as well.
>>> Regards,
>>> 
>>> Christian
>>> 
>>>> Regards,
>>>> Josef
>>>> 
>>>> On 06 Sep 2014, at 15:52, Josef Johansson  wrote:
>>>> 
>>>>> FWI I did restart the OSDs until I saw a server that made impact.
>>>>> Until that server stopped doing impact, I didn’t get lower in the
>>>>> number objects being degraded. After a while it was done with
>>>>> recovering that OSD and happily started with others. I guess I will
>>>>> be seeing the same behaviour when it gets to replicating the same PGs
>>>>> that were causing troubles the first time.
>>>>> 
>>>>> On 06 Sep 2014, at 15:04, Josef Johansson  wrote:
>>>>> 
>>>>>> Actually, it only worked with restarting  for a period of time to
>>>>>> get the recovering process going. Can’t get passed the 21k object
>>>>>> mark.
>>>>>> 
>>>>>> I’m uncertain if the disk really is messing this up right now as
>>>>>> well. So I’m not glad to start moving 300k objects around.
>>>>>> 
>>>>>> Regards,
>>>>>> Josef
>>>>>> 
>>>>>> On 06 Sep 2014, at 14:33, Josef Johansson  wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> On 06 Sep 2014, at 13:53, Christian Balzer  wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
>>>>>>>> 
>>>>>>>>> Also putting this on the list.
>>>>>>>>> 
>>

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson

Hi,

On 06 Sep 2014, at 18:05, Christian Balzer  wrote:

> 
> Hello,
> 
> On Sat, 6 Sep 2014 17:52:59 +0200 Josef Johansson wrote:
> 
>> Hi,
>> 
>> Just realised that it could also be with a popularity bug as well and
>> lots a small traffic. And seeing that it’s fast it gets popular until it
>> hits the curb.
>> 
> I don't think I ever heard the term "popularity bug" before, care to
> elaborate? 
I did! :D When you start out fine with great numbers, people like it and 
suddenly it’s not so fast anymore, and when you hit the magic number it starts 
to be trouble.
> 
>> I’m seeing this in the stats I think.
>> 
>> Linux 3.13-0.bpo.1-amd64 (osd1)  09/06/2014
>> _x86_64_ (24 CPU)
> Any particular reason you're not running 3.14?
No, just that we don’t have that much time on our hands.
>> 
>> 09/06/2014 05:48:41 PM
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>   2.210.001.002.860.00   93.93
>> 
>> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdm   0.02 1.477.05   42.72 0.67 1.07
>> 71.43 0.418.176.418.46   3.44  17.13 sdn
>> 0.03 1.426.17   37.08 0.57 0.9270.51 0.08
>> 1.766.470.98   3.46  14.98 sdg   0.03 1.44
>> 6.27   36.62 0.56 0.9471.40 0.348.006.83
>> 8.20   3.45  14.78 sde   0.03 1.236.47   39.07
>> 0.59 0.9870.29 0.439.476.579.95   3.37  15.33
>> sdf   0.02 1.266.47   33.77 0.61 0.87
>> 75.30 0.225.396.005.27   3.52  14.17 sdl
>> 0.03 1.446.44   40.54 0.59 1.0872.68 0.21
>> 4.496.564.16   3.40  15.95 sdk   0.03 1.41
>> 5.62   35.92 0.52 0.9070.10 0.153.586.17
>> 3.17   3.45  14.32 sdj   0.03 1.266.30   34.23
>> 0.57 0.8370.84 0.317.656.567.85   3.48  14.10
>> 
>> Seeing that the drives are in pretty good shape but not giving lotsa
>> read, I would assume that I need to tweak the cache to swallow more IO.
>> 
> That looks indeed fine, as in, none of these disks looks suspicious to me.
> 
>> When I tweaked it before production I did not see any performance gains
>> what so ever, so they are pretty low. And it’s odd because we just saw
>> these problems a little while ago. So probably that we hit a limit where
>> the disks are getting lot of IO.
>> 
>> I know that there’s some threads about this that I will read again.
>> 
> URL?
> 
Uhm, I think you’re involved in most of them. I'll post what I do and from 
where.
> Christian
> 
>> Thanks for the hints in looking at bad drives.
>> 
>> Regards,
>> Josef
>> 
>> On 06 Sep 2014, at 17:41, Josef Johansson  wrote:
>> 
>>> Hi,
>>> 
>>> On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
>>> 
>>>> 
>>>> Hello,
>>>> 
>>>> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
>>>> 
>>>>> We manage to go through the restore, but the performance degradation
>>>>> is still there.
>>>>> 
>>>> Manifesting itself how?
>>>> 
>>> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
>>> But mostly a lot of iowait.
>>> 
>>>>> Looking through the OSDs to pinpoint a source of the degradation and
>>>>> hoping the current load will be lowered.
>>>>> 
>>>> 
>>>> You're the one looking at your cluster, the iostat, atop, iotop and
>>>> whatnot data.
>>>> If one particular OSD/disk stands out, investigate it, as per the
>>>> "Good way to monitor detailed latency/throughput" thread. 
>>>> 
>>> Will read it through.
>>>> If you have a spare and idle machine that is identical to your storage
>>>> nodes, you could run a fio benchmark on a disk there and then compare
>>>> the results to that of your suspect disk after setting your cluster
>>>> to noout and stopping that particular OSD.
>>> No spare though, but I have a rough idea what it should be, what’s I’m
>>> going at right now. Right, so the cluster should be fine after I stop
>>> the OSD right? I though of stopping it a little bit to see if the IO
>>> was better afterwards from within the VMs. Not sure how good

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson

Hi,

Just realised that it could also be with a popularity bug as well and lots a 
small traffic. And seeing that it’s fast it gets popular until it hits the curb.

I’m seeing this in the stats I think.

Linux 3.13-0.bpo.1-amd64 (osd1) 09/06/2014  _x86_64_(24 CPU)

09/06/2014 05:48:41 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   2.210.001.002.860.00   93.93

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdm   0.02 1.477.05   42.72 0.67 1.0771.43 
0.418.176.418.46   3.44  17.13
sdn   0.03 1.426.17   37.08 0.57 0.9270.51 
0.081.766.470.98   3.46  14.98
sdg   0.03 1.446.27   36.62 0.56 0.9471.40 
0.348.006.838.20   3.45  14.78
sde   0.03 1.236.47   39.07 0.59 0.9870.29 
0.439.476.579.95   3.37  15.33
sdf   0.02 1.266.47   33.77 0.61 0.8775.30 
0.225.396.005.27   3.52  14.17
sdl   0.03 1.446.44   40.54 0.59 1.0872.68 
0.214.496.564.16   3.40  15.95
sdk   0.03 1.415.62   35.92 0.52 0.9070.10 
0.153.586.173.17   3.45  14.32
sdj   0.03 1.266.30   34.23 0.57 0.8370.84 
0.317.656.567.85   3.48  14.10

Seeing that the drives are in pretty good shape but not giving lotsa read, I 
would assume that I need to tweak the cache to swallow more IO.

When I tweaked it before production I did not see any performance gains what so 
ever, so they are pretty low. And it’s odd because we just saw these problems a 
little while ago. So probably that we hit a limit where the disks are getting 
lot of IO.

I know that there’s some threads about this that I will read again.

Thanks for the hints in looking at bad drives.

Regards,
Josef

On 06 Sep 2014, at 17:41, Josef Johansson  wrote:

> Hi,
> 
> On 06 Sep 2014, at 17:27, Christian Balzer  wrote:
> 
>> 
>> Hello,
>> 
>> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
>> 
>>> We manage to go through the restore, but the performance degradation is
>>> still there.
>>> 
>> Manifesting itself how?
>> 
> Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
> But mostly a lot of iowait.
> 
>>> Looking through the OSDs to pinpoint a source of the degradation and
>>> hoping the current load will be lowered.
>>> 
>> 
>> You're the one looking at your cluster, the iostat, atop, iotop and
>> whatnot data.
>> If one particular OSD/disk stands out, investigate it, as per the "Good
>> way to monitor detailed latency/throughput" thread. 
>> 
> Will read it through.
>> If you have a spare and idle machine that is identical to your storage
>> nodes, you could run a fio benchmark on a disk there and then compare the
>> results to that of your suspect disk after setting your cluster to noout
>> and stopping that particular OSD.
> No spare though, but I have a rough idea what it should be, what’s I’m going 
> at right now.
> Right, so the cluster should be fine after I stop the OSD right? I though of 
> stopping it a little bit to see if the IO was better afterwards from within 
> the VMs. Not sure how good effect it makes though since it may be waiting for 
> the IO to complete what not.
>> 
>>> I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be
>>> tough if the degradation is still there afterwards? i.e. if I set back
>>> the weight would it move back all the PGs?
>>> 
>> Of course.
>> 
>> Until you can determine that a specific OSD/disk is the culprit, don't do
>> that. 
>> If you have the evidence, go ahead.
>> 
> Great, that’s what I though as well.
>> Regards,
>> 
>> Christian
>> 
>>> Regards,
>>> Josef
>>> 
>>> On 06 Sep 2014, at 15:52, Josef Johansson  wrote:
>>> 
>>>> FWI I did restart the OSDs until I saw a server that made impact.
>>>> Until that server stopped doing impact, I didn’t get lower in the
>>>> number objects being degraded. After a while it was done with
>>>> recovering that OSD and happily started with others. I guess I will be
>>>> seeing the same behaviour when it gets to replicating the same PGs
>>>> that were causing troubles the first time.
>>>> 
>>>> On 06 Sep 2014, at 15:04, Josef Johansson  wrote:
>>>> 
>>>>> Actually, it only worked wit

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson

Hi,

On 06 Sep 2014, at 17:27, Christian Balzer  wrote:

> 
> Hello,
> 
> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
> 
>> We manage to go through the restore, but the performance degradation is
>> still there.
>> 
> Manifesting itself how?
> 
Awful slow io on the VMs, and iowait, it’s about 2MB/s or so.
But mostly a lot of iowait.

>> Looking through the OSDs to pinpoint a source of the degradation and
>> hoping the current load will be lowered.
>> 
> 
> You're the one looking at your cluster, the iostat, atop, iotop and
> whatnot data.
> If one particular OSD/disk stands out, investigate it, as per the "Good
> way to monitor detailed latency/throughput" thread. 
> 
Will read it through.
> If you have a spare and idle machine that is identical to your storage
> nodes, you could run a fio benchmark on a disk there and then compare the
> results to that of your suspect disk after setting your cluster to noout
> and stopping that particular OSD.
No spare though, but I have a rough idea what it should be, what’s I’m going at 
right now.
Right, so the cluster should be fine after I stop the OSD right? I though of 
stopping it a little bit to see if the IO was better afterwards from within the 
VMs. Not sure how good effect it makes though since it may be waiting for the 
IO to complete what not.
> 
>> I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be
>> tough if the degradation is still there afterwards? i.e. if I set back
>> the weight would it move back all the PGs?
>> 
> Of course.
> 
> Until you can determine that a specific OSD/disk is the culprit, don't do
> that. 
> If you have the evidence, go ahead.
> 
Great, that’s what I though as well.
> Regards,
> 
> Christian
> 
>> Regards,
>> Josef
>> 
>> On 06 Sep 2014, at 15:52, Josef Johansson  wrote:
>> 
>>> FWI I did restart the OSDs until I saw a server that made impact.
>>> Until that server stopped doing impact, I didn’t get lower in the
>>> number objects being degraded. After a while it was done with
>>> recovering that OSD and happily started with others. I guess I will be
>>> seeing the same behaviour when it gets to replicating the same PGs
>>> that were causing troubles the first time.
>>> 
>>> On 06 Sep 2014, at 15:04, Josef Johansson  wrote:
>>> 
>>>> Actually, it only worked with restarting  for a period of time to get
>>>> the recovering process going. Can’t get passed the 21k object mark.
>>>> 
>>>> I’m uncertain if the disk really is messing this up right now as
>>>> well. So I’m not glad to start moving 300k objects around.
>>>> 
>>>> Regards,
>>>> Josef
>>>> 
>>>> On 06 Sep 2014, at 14:33, Josef Johansson  wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> On 06 Sep 2014, at 13:53, Christian Balzer  wrote:
>>>>> 
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
>>>>>> 
>>>>>>> Also putting this on the list.
>>>>>>> 
>>>>>>> On 06 Sep 2014, at 13:36, Josef Johansson 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Same issues again, but I think we found the drive that causes the
>>>>>>>> problems.
>>>>>>>> 
>>>>>>>> But this is causing problems as it’s trying to do a recover to
>>>>>>>> that osd at the moment.
>>>>>>>> 
>>>>>>>> So we’re left with the status message 
>>>>>>>> 
>>>>>>>> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs:
>>>>>>>> 6841 active+clean, 19 active+remapped+backfilling; 12299 GB data,
>>>>>>>> 36882 GB used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr,
>>>>>>>> 74op/s; 41424/15131923 degraded (0.274%);  recovering 0 o/s,
>>>>>>>> 2035KB/s
>>>>>>>> 
>>>>>>>> 
>>>>>>>> It’s improving, but way too slowly. If I restart the recovery
>>>>>>>> (ceph osd set no recovery /unset) it doesn’t change the osd what
>>>>>>>> I can see.
>>>>>>>> 
>>>>>>>> Any ideas?
>>&

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson

We manage to go through the restore, but the performance degradation is still 
there.

Looking through the OSDs to pinpoint a source of the degradation and hoping the 
current load will be lowered.

I’m a bit afraid of doing the 0 to weight of an OSD, wouldn’t it be tough if 
the degradation is still there afterwards? i.e. if I set back the weight would 
it move back all the PGs?

Regards,
Josef

On 06 Sep 2014, at 15:52, Josef Johansson  wrote:

> FWI I did restart the OSDs until I saw a server that made impact. Until that 
> server stopped doing impact, I didn’t get lower in the number objects being 
> degraded.
> After a while it was done with recovering that OSD and happily started with 
> others.
> I guess I will be seeing the same behaviour when it gets to replicating the 
> same PGs that were causing troubles the first time.
> 
> On 06 Sep 2014, at 15:04, Josef Johansson  wrote:
> 
>> Actually, it only worked with restarting  for a period of time to get the 
>> recovering process going. Can’t get passed the 21k object mark.
>> 
>> I’m uncertain if the disk really is messing this up right now as well. So 
>> I’m not glad to start moving 300k objects around.
>> 
>> Regards,
>> Josef
>> 
>> On 06 Sep 2014, at 14:33, Josef Johansson  wrote:
>> 
>>> Hi,
>>> 
>>> On 06 Sep 2014, at 13:53, Christian Balzer  wrote:
>>> 
>>>> 
>>>> Hello,
>>>> 
>>>> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
>>>> 
>>>>> Also putting this on the list.
>>>>> 
>>>>> On 06 Sep 2014, at 13:36, Josef Johansson  wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Same issues again, but I think we found the drive that causes the
>>>>>> problems.
>>>>>> 
>>>>>> But this is causing problems as it’s trying to do a recover to that
>>>>>> osd at the moment.
>>>>>> 
>>>>>> So we’re left with the status message 
>>>>>> 
>>>>>> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841
>>>>>> active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB
>>>>>> used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s;
>>>>>> 41424/15131923 degraded (0.274%);  recovering 0 o/s, 2035KB/s
>>>>>> 
>>>>>> 
>>>>>> It’s improving, but way too slowly. If I restart the recovery (ceph
>>>>>> osd set no recovery /unset) it doesn’t change the osd what I can see.
>>>>>> 
>>>>>> Any ideas?
>>>>>> 
>>>> I don't know the state of your cluster, i.e. what caused the recovery to
>>>> start (how many OSDs went down?).
>>> Performance degradation, databases are the worst impacted. It’s actually a 
>>> OSD that we put in that’s causing it (removed it again though). So the 
>>> cluster in itself is healthy.
>>> 
>>>> If you have a replication of 3 and only one OSD was involved, what is
>>>> stopping you from taking that wonky drive/OSD out?
>>>> 
>>> There’s data that goes missing if I do that, I guess I have to wait for the 
>>> recovery process to complete before I can go any further, this is with rep 
>>> 3.
>>>> If you don't know that or want to play it safe, how about setting the
>>>> weight of that OSD to 0? 
>>>> While that will AFAICT still result in all primary PGs to be evacuated
>>>> off it, no more writes will happen to it and reads might be faster.
>>>> In either case, it shouldn't slow down the rest of your cluster anymore.
>>>> 
>>> That’s actually one idea I haven’t thought off, I wan’t to play it safe 
>>> right now and hope that it goes up again, I actually found one wonky way of 
>>> getting the recovery process from not stalling to a grind, and that was 
>>> restarting OSDs. One at the time.
>>> 
>>> Regards,
>>> Josef
>>>> Regards,
>>>> 
>>>> Christian
>>>>>> Cheers,
>>>>>> Josef
>>>>>> 
>>>>>> On 05 Sep 2014, at 11:26, Luis Periquito 
>>>>>> wrote:
>>>>>> 
>>>>>>> Only time I saw such behaviour was when I was deleting a big chunk of
>>>>>>> data from the cluster: all the client activity was reduced, the op/s
>>>>>>> were al

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson

FWI I did restart the OSDs until I saw a server that made impact. Until that 
server stopped doing impact, I didn’t get lower in the number objects being 
degraded.
After a while it was done with recovering that OSD and happily started with 
others.
I guess I will be seeing the same behaviour when it gets to replicating the 
same PGs that were causing troubles the first time.

On 06 Sep 2014, at 15:04, Josef Johansson  wrote:

> Actually, it only worked with restarting  for a period of time to get the 
> recovering process going. Can’t get passed the 21k object mark.
> 
> I’m uncertain if the disk really is messing this up right now as well. So I’m 
> not glad to start moving 300k objects around.
> 
> Regards,
> Josef
> 
> On 06 Sep 2014, at 14:33, Josef Johansson  wrote:
> 
>> Hi,
>> 
>> On 06 Sep 2014, at 13:53, Christian Balzer  wrote:
>> 
>>> 
>>> Hello,
>>> 
>>> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
>>> 
>>>> Also putting this on the list.
>>>> 
>>>> On 06 Sep 2014, at 13:36, Josef Johansson  wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Same issues again, but I think we found the drive that causes the
>>>>> problems.
>>>>> 
>>>>> But this is causing problems as it’s trying to do a recover to that
>>>>> osd at the moment.
>>>>> 
>>>>> So we’re left with the status message 
>>>>> 
>>>>> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841
>>>>> active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB
>>>>> used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s;
>>>>> 41424/15131923 degraded (0.274%);  recovering 0 o/s, 2035KB/s
>>>>> 
>>>>> 
>>>>> It’s improving, but way too slowly. If I restart the recovery (ceph
>>>>> osd set no recovery /unset) it doesn’t change the osd what I can see.
>>>>> 
>>>>> Any ideas?
>>>>> 
>>> I don't know the state of your cluster, i.e. what caused the recovery to
>>> start (how many OSDs went down?).
>> Performance degradation, databases are the worst impacted. It’s actually a 
>> OSD that we put in that’s causing it (removed it again though). So the 
>> cluster in itself is healthy.
>> 
>>> If you have a replication of 3 and only one OSD was involved, what is
>>> stopping you from taking that wonky drive/OSD out?
>>> 
>> There’s data that goes missing if I do that, I guess I have to wait for the 
>> recovery process to complete before I can go any further, this is with rep 3.
>>> If you don't know that or want to play it safe, how about setting the
>>> weight of that OSD to 0? 
>>> While that will AFAICT still result in all primary PGs to be evacuated
>>> off it, no more writes will happen to it and reads might be faster.
>>> In either case, it shouldn't slow down the rest of your cluster anymore.
>>> 
>> That’s actually one idea I haven’t thought off, I wan’t to play it safe 
>> right now and hope that it goes up again, I actually found one wonky way of 
>> getting the recovery process from not stalling to a grind, and that was 
>> restarting OSDs. One at the time.
>> 
>> Regards,
>> Josef
>>> Regards,
>>> 
>>> Christian
>>>>> Cheers,
>>>>> Josef
>>>>> 
>>>>> On 05 Sep 2014, at 11:26, Luis Periquito 
>>>>> wrote:
>>>>> 
>>>>>> Only time I saw such behaviour was when I was deleting a big chunk of
>>>>>> data from the cluster: all the client activity was reduced, the op/s
>>>>>> were almost non-existent and there was unjustified delays all over
>>>>>> the cluster. But all the disks were somewhat busy in atop/iotstat.
>>>>>> 
>>>>>> 
>>>>>> On 5 September 2014 09:51, David  wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Indeed strange.
>>>>>> 
>>>>>> That output was when we had issues, seems that most operations were
>>>>>> blocked / slow requests.
>>>>>> 
>>>>>> A ”baseline” output is more like today:
>>>>>> 
>>>>>> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs:
>>>>>> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson

Actually, it only worked with restarting  for a period of time to get the 
recovering process going. Can’t get passed the 21k object mark.

I’m uncertain if the disk really is messing this up right now as well. So I’m 
not glad to start moving 300k objects around.

Regards,
Josef

On 06 Sep 2014, at 14:33, Josef Johansson  wrote:

> Hi,
> 
> On 06 Sep 2014, at 13:53, Christian Balzer  wrote:
> 
>> 
>> Hello,
>> 
>> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
>> 
>>> Also putting this on the list.
>>> 
>>> On 06 Sep 2014, at 13:36, Josef Johansson  wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Same issues again, but I think we found the drive that causes the
>>>> problems.
>>>> 
>>>> But this is causing problems as it’s trying to do a recover to that
>>>> osd at the moment.
>>>> 
>>>> So we’re left with the status message 
>>>> 
>>>> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841
>>>> active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB
>>>> used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s;
>>>> 41424/15131923 degraded (0.274%);  recovering 0 o/s, 2035KB/s
>>>> 
>>>> 
>>>> It’s improving, but way too slowly. If I restart the recovery (ceph
>>>> osd set no recovery /unset) it doesn’t change the osd what I can see.
>>>> 
>>>> Any ideas?
>>>> 
>> I don't know the state of your cluster, i.e. what caused the recovery to
>> start (how many OSDs went down?).
> Performance degradation, databases are the worst impacted. It’s actually a 
> OSD that we put in that’s causing it (removed it again though). So the 
> cluster in itself is healthy.
> 
>> If you have a replication of 3 and only one OSD was involved, what is
>> stopping you from taking that wonky drive/OSD out?
>> 
> There’s data that goes missing if I do that, I guess I have to wait for the 
> recovery process to complete before I can go any further, this is with rep 3.
>> If you don't know that or want to play it safe, how about setting the
>> weight of that OSD to 0? 
>> While that will AFAICT still result in all primary PGs to be evacuated
>> off it, no more writes will happen to it and reads might be faster.
>> In either case, it shouldn't slow down the rest of your cluster anymore.
>> 
> That’s actually one idea I haven’t thought off, I wan’t to play it safe right 
> now and hope that it goes up again, I actually found one wonky way of getting 
> the recovery process from not stalling to a grind, and that was restarting 
> OSDs. One at the time.
> 
> Regards,
> Josef
>> Regards,
>> 
>> Christian
>>>> Cheers,
>>>> Josef
>>>> 
>>>> On 05 Sep 2014, at 11:26, Luis Periquito 
>>>> wrote:
>>>> 
>>>>> Only time I saw such behaviour was when I was deleting a big chunk of
>>>>> data from the cluster: all the client activity was reduced, the op/s
>>>>> were almost non-existent and there was unjustified delays all over
>>>>> the cluster. But all the disks were somewhat busy in atop/iotstat.
>>>>> 
>>>>> 
>>>>> On 5 September 2014 09:51, David  wrote:
>>>>> Hi,
>>>>> 
>>>>> Indeed strange.
>>>>> 
>>>>> That output was when we had issues, seems that most operations were
>>>>> blocked / slow requests.
>>>>> 
>>>>> A ”baseline” output is more like today:
>>>>> 
>>>>> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs:
>>>>> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>> avail; 9273KB/s rd, 24650KB/s wr, 2755op/s 2014-09-05 10:44:30.125637
>>>>> mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 active+clean; 12253 GB
>>>>> data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s rd, 20430KB/s
>>>>> wr, 2294op/s 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761:
>>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB /
>>>>> 178 TB avail; 9216KB/s rd, 20062KB/s wr, 2488op/s 2014-09-05
>>>>> 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860
>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
>>>>> 12511KB/s rd, 15739KB/s wr, 2488op/s 2014-09-05 10:44:33.161210 mon.0
>>>>> [INF] pgmap v12582763: 6

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson

Hi,

On 06 Sep 2014, at 13:53, Christian Balzer  wrote:

> 
> Hello,
> 
> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
> 
>> Also putting this on the list.
>> 
>> On 06 Sep 2014, at 13:36, Josef Johansson  wrote:
>> 
>>> Hi,
>>> 
>>> Same issues again, but I think we found the drive that causes the
>>> problems.
>>> 
>>> But this is causing problems as it’s trying to do a recover to that
>>> osd at the moment.
>>> 
>>> So we’re left with the status message 
>>> 
>>> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841
>>> active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB
>>> used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s;
>>> 41424/15131923 degraded (0.274%);  recovering 0 o/s, 2035KB/s
>>> 
>>> 
>>> It’s improving, but way too slowly. If I restart the recovery (ceph
>>> osd set no recovery /unset) it doesn’t change the osd what I can see.
>>> 
>>> Any ideas?
>>> 
> I don't know the state of your cluster, i.e. what caused the recovery to
> start (how many OSDs went down?).
Performance degradation, databases are the worst impacted. It’s actually a OSD 
that we put in that’s causing it (removed it again though). So the cluster in 
itself is healthy.

> If you have a replication of 3 and only one OSD was involved, what is
> stopping you from taking that wonky drive/OSD out?
> 
There’s data that goes missing if I do that, I guess I have to wait for the 
recovery process to complete before I can go any further, this is with rep 3.
> If you don't know that or want to play it safe, how about setting the
> weight of that OSD to 0? 
> While that will AFAICT still result in all primary PGs to be evacuated
> off it, no more writes will happen to it and reads might be faster.
> In either case, it shouldn't slow down the rest of your cluster anymore.
> 
That’s actually one idea I haven’t thought off, I wan’t to play it safe right 
now and hope that it goes up again, I actually found one wonky way of getting 
the recovery process from not stalling to a grind, and that was restarting 
OSDs. One at the time.

Regards,
Josef
> Regards,
> 
> Christian
>>> Cheers,
>>> Josef
>>> 
>>> On 05 Sep 2014, at 11:26, Luis Periquito 
>>> wrote:
>>> 
>>>> Only time I saw such behaviour was when I was deleting a big chunk of
>>>> data from the cluster: all the client activity was reduced, the op/s
>>>> were almost non-existent and there was unjustified delays all over
>>>> the cluster. But all the disks were somewhat busy in atop/iotstat.
>>>> 
>>>> 
>>>> On 5 September 2014 09:51, David  wrote:
>>>> Hi,
>>>> 
>>>> Indeed strange.
>>>> 
>>>> That output was when we had issues, seems that most operations were
>>>> blocked / slow requests.
>>>> 
>>>> A ”baseline” output is more like today:
>>>> 
>>>> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs:
>>>> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>> avail; 9273KB/s rd, 24650KB/s wr, 2755op/s 2014-09-05 10:44:30.125637
>>>> mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 active+clean; 12253 GB
>>>> data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s rd, 20430KB/s
>>>> wr, 2294op/s 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761:
>>>> 6860 pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB /
>>>> 178 TB avail; 9216KB/s rd, 20062KB/s wr, 2488op/s 2014-09-05
>>>> 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860
>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
>>>> 12511KB/s rd, 15739KB/s wr, 2488op/s 2014-09-05 10:44:33.161210 mon.0
>>>> [INF] pgmap v12582763: 6860 pgs: 6860 active+clean; 12253 GB data,
>>>> 36574 GB used, 142 TB / 178 TB avail; 18593KB/s rd, 14880KB/s wr,
>>>> 2609op/s 2014-09-05 10:44:34.187294 mon.0 [INF] pgmap v12582764: 6860
>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>> avail; 17720KB/s rd, 22964KB/s wr, 3257op/s 2014-09-05
>>>> 10:44:35.190785 mon.0 [INF] pgmap v12582765: 6860 pgs: 6860
>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail;
>>>> 19230KB/s rd, 18901KB/s wr, 3199op/s 2014-09-05 10:44:36.213535 mon.0
>>>> [INF] pgmap v12582766: 6860 pgs: 6860 active+clean; 12253 GB data,
>>>> 36574 GB

Re: [ceph-users] Huge issues with slow requests

2014-09-06 Thread Josef Johansson

Also putting this on the list.

On 06 Sep 2014, at 13:36, Josef Johansson  wrote:

> Hi,
> 
> Same issues again, but I think we found the drive that causes the problems.
> 
> But this is causing problems as it’s trying to do a recover to that osd at 
> the moment.
> 
> So we’re left with the status message 
> 
> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs: 6841 
> active+clean, 19 active+remapped+backfilling; 12299 GB data, 36882 GB used, 
> 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr, 74op/s; 41424/15131923 
> degraded (0.274%);  recovering 0 o/s, 2035KB/s
> 
> 
> It’s improving, but way too slowly. If I restart the recovery (ceph osd set 
> no recovery /unset) it doesn’t change the osd what I can see.
> 
> Any ideas?
> 
> Cheers,
> Josef
> 
> On 05 Sep 2014, at 11:26, Luis Periquito  wrote:
> 
>> Only time I saw such behaviour was when I was deleting a big chunk of data 
>> from the cluster: all the client activity was reduced, the op/s were almost 
>> non-existent and there was unjustified delays all over the cluster. But all 
>> the disks were somewhat busy in atop/iotstat.
>> 
>> 
>> On 5 September 2014 09:51, David  wrote:
>> Hi,
>> 
>> Indeed strange.
>> 
>> That output was when we had issues, seems that most operations were blocked 
>> / slow requests.
>> 
>> A ”baseline” output is more like today:
>> 
>> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9273KB/s 
>> rd, 24650KB/s wr, 2755op/s
>> 2014-09-05 10:44:30.125637 mon.0 [INF] pgmap v12582760: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9500KB/s 
>> rd, 20430KB/s wr, 2294op/s
>> 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9216KB/s 
>> rd, 20062KB/s wr, 2488op/s
>> 2014-09-05 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 12511KB/s 
>> rd, 15739KB/s wr, 2488op/s
>> 2014-09-05 10:44:33.161210 mon.0 [INF] pgmap v12582763: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 18593KB/s 
>> rd, 14880KB/s wr, 2609op/s
>> 2014-09-05 10:44:34.187294 mon.0 [INF] pgmap v12582764: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 17720KB/s 
>> rd, 22964KB/s wr, 3257op/s
>> 2014-09-05 10:44:35.190785 mon.0 [INF] pgmap v12582765: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 19230KB/s 
>> rd, 18901KB/s wr, 3199op/s
>> 2014-09-05 10:44:36.213535 mon.0 [INF] pgmap v12582766: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 17630KB/s 
>> rd, 18855KB/s wr, 3131op/s
>> 2014-09-05 10:44:37.220052 mon.0 [INF] pgmap v12582767: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 12262KB/s 
>> rd, 18627KB/s wr, 2595op/s
>> 2014-09-05 10:44:38.233357 mon.0 [INF] pgmap v12582768: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 17697KB/s 
>> rd, 17572KB/s wr, 2156op/s
>> 2014-09-05 10:44:39.239409 mon.0 [INF] pgmap v12582769: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 20300KB/s 
>> rd, 19735KB/s wr, 2197op/s
>> 2014-09-05 10:44:40.260423 mon.0 [INF] pgmap v12582770: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 14656KB/s 
>> rd, 15460KB/s wr, 2199op/s
>> 2014-09-05 10:44:41.269736 mon.0 [INF] pgmap v12582771: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 8969KB/s 
>> rd, 11918KB/s wr, 1951op/s
>> 2014-09-05 10:44:42.276192 mon.0 [INF] pgmap v12582772: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 7272KB/s 
>> rd, 10644KB/s wr, 1832op/s
>> 2014-09-05 10:44:43.291817 mon.0 [INF] pgmap v12582773: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9316KB/s 
>> rd, 16610KB/s wr, 2412op/s
>> 2014-09-05 10:44:44.295469 mon.0 [INF] pgmap v12582774: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9257KB/s 
>> rd, 19953KB/s wr, 2633op/s
>> 2014-09-05 10:44:45.315774 mon.0 [INF] pgmap v12582775: 6860 pgs: 6860 
>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB avail; 9718KB/s 
>> rd, 14298KB/s wr, 2101op/s
>> 2

[ceph-users] Good way to monitor detailed latency/throughput

2014-09-05 Thread Josef Johansson

Hi,

How do you guys monitor the cluster to find disks that behave bad, or
VMs that impact the Ceph cluster?

I'm looking for something where I could get a good bird-view of
latency/throughput, that uses something easy like SNMP.

Regards,
Josef Johansson
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Using Ramdisk wi

2014-07-30 Thread Josef Johansson

Hi,

Just chippin in,
As RAM is pretty cheap right now, it could be an idea to fill all the
memory slots in the OSDs, bigger chance that the data you've requested
is actually in ram already then.

You should go with DC S3700 400GB for the journals at least..

Cheers,
Josef

On 30/07/14 17:12, Christian Balzer wrote:
> On Wed, 30 Jul 2014 10:50:02 -0400 German Anders wrote:
>
>> Hi Christian,
>>   How are you? Thanks a lot for the answers, mine in red.
>>
> Most certainly not in red on my mail client...
>
>> --- Original message ---
>>> Asunto: Re: [ceph-users] Using Ramdisk wi
>>> De: Christian Balzer 
>>> Para: 
>>> Cc: German Anders 
>>> Fecha: Wednesday, 30/07/2014 11:42
>>>
>>>
>>> Hello,
>>>
>>> On Wed, 30 Jul 2014 09:55:49 -0400 German Anders wrote:
>>>
 Hi Wido,

  How are you? Thanks a lot for the quick response. I know 
 that is
 heavy cost on using ramdisk, but also i want to try that to see if i
 could get better performance, since I'm using a 10GbE network with the
 following configuration and i can't achieve more than 300MB/s of
 throughput on rbd:

>>> Testing the limits of Ceph with a ramdisk based journal to see what is
>>> possible in terms of speed (and you will find that it is CPU/protocol
>>> bound) is fine.
>>> Anything resembling production is a big no-no.
>> Got it, did you try flashcache from facebook or dm-cache?
> No.
>
>>>
>>>
 MON Servers (3):
  2x Intel Xeon E3-1270v3 @3.5Ghz (8C)
  32GB RAM
  2x SSD Intel 120G in RAID1 for OS
  1x 10GbE port

 OSD Servers (4):
  2x Intel Xeon E5-2609v2 @2.5Ghz (8C)
  64GB RAM
  2x SSD Intel 120G in RAID1 for OS
  3x SSD Intel 120G for Journals (3 SAS disks: 1 SSD 
 Journal)
>>> You're not telling us WHICH actual Intel SSDs you're using.
>>> If those are DC3500 ones, then 300MB/s totoal isn't a big surprise at 
>>> all,
>>> as they are capable of 135MB/s writes at most.
>> The SSD model is Intel SSDSC2BB120G4 firm D2010370
> That's not really an answer, but then again Intel could have chosen model
> numbers that resemble their product names.
>
> That is indeed a DC 3500, so my argument stands.
> With those SSDs for your journals, much more than 300MB/s per node is
> simply not possible, never mind how fast or slow the HDDs perform.
>
>>>
>>>
  9x SAS 3TB 6G for OSD
>>> That would be somewhere over 1GB/s in theory, but give file system and
>>> other overheads (what is your replication level?) that's a very
>>> theoretical value indeed.
>> The RF is 2, so perf should be much better, also notice that read perf 
>> is really poor, around 62MB/s...
>>
> A replication factor of 2 means that each write is amplified by 2.
> So half of your theoretical performance is gone already.
>
> Do your tests with atop or iostat running on all storage nodes. 
> Determine where the bottleneck is, the journals SSDs or the HDDs or
> (unlikely) something else.
>
> Read performance sucks balls with RBD (at least individually), it can be
> improved by fondling the readahead value. See:
>
> http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/8817
>
> This is something the Ceph developers are aware of and hopefully will
> address in the future:
> https://wiki.ceph.com/Planning/Blueprints/Emperor/Kernel_client_read_ahead_optimization
>
> Christian
>
>>>
>>>
>>> Christian
>>>
  2x 10GbE port (1 for Cluster Network, 1 for Public 
 Network)

 - 10GbE Switches (1 for Cluster interconnect and 1 for Public network)
 - Using Ceph Firefly version 0.80.4.

  The thing is that with fio, rados bench and vdbench
 tools we
 only see 300MB/s on writes (rand and seq) with bs of 4m and 16
 threads, that's pretty low actually, yesterday i was talking in the
 ceph irc and i hit with the presentation that someone from Fujitsu do
 on Frankfurt and also with some mails with some config at 10GbE  and
 he achieve almost 795MB/s and more... i would like to know if possible
 how to implement that so we could improve our ceph cluster a little
 bit more, i actually configure the scheduler on the SSD's disks both
 OS and Journal to [noop] but still didn't notice any improvement.
 That's why we would like to try RAMDISK on Journals, i've noticed that
 he implement that on their Ceph cluster.

 I will really appreciate the help on this. Also if you need me to send
 you some more information about the  Ceph scheme please let me know.
 Also if someone could share some detail conf info will really help!

 Thanks a lot,

 German Anders

> --- Original message ---
> Asunto: Re: [ceph-users] Using Ramdisk wi
> De: Wido den Hollander 
> Para: 
> Fecha: We

[ceph-users] Recommendation to safely avoid problems with osd-failure

2014-07-28 Thread Josef Johansson

Hi,

I'm trying to compile a strategy to avoid performance problems if osds
or osd hosts fails.

If I encounter a re-balance of one OSD during mid-day, there'll be
problems with performance right now, if I could see the issue and let it
re-balance during evening, that would be great.

I.e. if two OSD hosts dies around the same time I suspect that the
clients would suffer greatly.

Currently the osd has the following settings

 osd max backfills = 1
 osd recovery max active = 1

Is there any general guidance or recommendation for unexpected outages?

Cheers,
Josef Johansson
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor performance on all SSD cluster

2014-06-25 Thread Josef Johansson

Hi,

On 25/06/14 00:27, Mark Kirkwood wrote:
> On 24/06/14 23:39, Mark Nelson wrote:
>> On 06/24/2014 03:45 AM, Mark Kirkwood wrote:
>>> On 24/06/14 18:15, Robert van Leeuwen wrote:
> All of which means that Mysql performance (looking at you binlog) may
> still suffer due to lots of small block size sync writes.

 Which begs the question:
 Anyone running a reasonable busy Mysql server on Ceph backed storage?

 We tried and it did not perform good enough.
 We have a small ceph cluster: 3 machines with 2 SSD journals and 10
 spinning disks each.
 Using ceph trough kvm rbd we were seeing performance equal to about
 1-2 spinning disks.

 Reading this thread it now looks a bit if there are inherent
 architecture + latency issues that would prevent it from performing
 great as a Mysql database store.
 I'd be interested in example setups where people are running busy
 databases on Ceph backed volumes.
>>>
>>> Yes indeed,
>>>
>>> We have looked extensively at Postgres performance on rbd - and
>>> while it
>>> is not Mysql, the underlying mechanism for durable writes (i.e commit)
>>> is essentially very similar (fsync, fdatasync and friends). We achieved
>>> quite reasonable performance (by that I mean sufficiently
>>> encouraging to
>>> be happy to host real datastores for our moderately busy systems - and
>>> we are continuing to investigate using it for our really busy ones).
>>>
>>> I have not experimented exptensively with the various choices of flush
>>> method (called sync method in Postgres but the same idea), as we found
>>> quite good performance with the default (fdatasync). However this is
>>> clearly an area that is worth investigation.
>>
>> FWIW, I ran through the DBT-3 benchmark suite on MariaDB ontop of
>> qemu/kvm RBD with a 3X replication pool on 30 OSDs with 3x replication.
>>   I kept buffer sizes small to try to force disk IO and benchmarked
>> against a local disk passed through to the VM.  We typically did about
>> 3-4x faster on queries than the local disk, but there were a couple of
>> queries were we were slower.  I didn't look at how multiple databases
>> scaled though.  That may have it's own benefits and challenges.
>>
>> I'm encouraged overall though.  It looks like from your comments and
>> from my own testing it's possible to have at least passable performance
>> with a single database and potentially as we reduce latency in Ceph make
>> it even better.  With multiple databases, it's entirely possible that we
>> can do pretty good even now.
>>
>
> Yes - same kind of findings, specifically:
>
> - random read and write (e.g index access) faster than local disk
> - sequential write (e.g batch inserts) similar or faster than local disk
> - sequential read (e.g table scan) slower than local disk
>
Regarding sequential read, I think it was
https://software.intel.com/en-us/blogs/2013/11/20/measure-ceph-rbd-performance-in-a-quantitative-way-part-ii
that did some tuning with that.
Anyone tried to optimize it the way they did in the article?

Cheers,
Josef
> Regards
>
> Mark
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD turned itself off

2014-06-13 Thread Josef Johansson


Thanks for the quick response.

Cheers,
Josef

Gregory Farnum skrev 2014-06-14 02:36:

On Fri, Jun 13, 2014 at 5:25 PM, Josef Johansson  wrote:

Hi Greg,

Thanks for the clarification. I believe the OSD was in the middle of a deep
scrub (sorry for not mentioning this straight away), so then it could've
been a silent error that got wind during scrub?

Yeah.


What's best practice when the store is corrupted like this?

Remove the OSD from the cluster, and either reformat the disk or
replace as you judge appropriate.
-Greg


Cheers,
Josef

Gregory Farnum skrev 2014-06-14 02:21:


The OSD did a read off of the local filesystem and it got back the EIO
error code. That means the store got corrupted or something, so it
killed itself to avoid spreading bad data to the rest of the cluster.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Jun 13, 2014 at 5:16 PM, Josef Johansson 
wrote:

Hey,

Just examing what happened to an OSD, that was just turned off. Data has
been moved away from it, so hesitating to turned it back on.

Got the below in the logs, any clues to what the assert talks about?

Cheers,
Josef

-1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t,
const
hobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7fdacb88
c700 time 2014-06-11 21:13:54.036982
os/FileStore.cc: 2992: FAILED assert(allow_eio || !m_filestore_fail_eio
||
got != -5)

   ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
   1: (FileStore::read(coll_t, hobject_t const&, unsigned long, unsigned
long,
ceph::buffer::list&, bool)+0x653) [0x8ab6c3]
   2: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*,
std::vector >&)+0x350) [0x708230]
   3: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x86)
[0x713366]
   4: (ReplicatedPG::do_op(std::tr1::shared_ptr)+0x3095)
[0x71acb5]
   5: (PG::do_request(std::tr1::shared_ptr,
ThreadPool::TPHandle&)+0x3f0) [0x812340]
   6: (OSD::dequeue_op(boost::intrusive_ptr,
std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x2ea) [0x75c80a]
   7: (OSD::OpWQ::_process(boost::intrusive_ptr,
ThreadPool::TPHandle&)+0x198) [0x770da8]
   8: (ThreadPool::WorkQueueVal,
std::tr1::shared_ptr >, boost::intrusive_ptr

::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x7a89

ce]
   9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x9b5dea]
   10: (ThreadPool::WorkThread::entry()+0x10) [0x9b7040]
   11: (()+0x6b50) [0x7fdadffdfb50]
   12: (clone()+0x6d) [0x7fdade53b0ed]
   NOTE: a copy of the executable, or `objdump -rdS ` is
needed to
interpret this.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD turned itself off

2014-06-13 Thread Josef Johansson


Hi Greg,

Thanks for the clarification. I believe the OSD was in the middle of a 
deep scrub (sorry for not mentioning this straight away), so then it 
could've been a silent error that got wind during scrub?


What's best practice when the store is corrupted like this?

Cheers,
Josef

Gregory Farnum skrev 2014-06-14 02:21:

The OSD did a read off of the local filesystem and it got back the EIO
error code. That means the store got corrupted or something, so it
killed itself to avoid spreading bad data to the rest of the cluster.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Jun 13, 2014 at 5:16 PM, Josef Johansson  wrote:

Hey,

Just examing what happened to an OSD, that was just turned off. Data has
been moved away from it, so hesitating to turned it back on.

Got the below in the logs, any clues to what the assert talks about?

Cheers,
Josef

-1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const
hobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7fdacb88
c700 time 2014-06-11 21:13:54.036982
os/FileStore.cc: 2992: FAILED assert(allow_eio || !m_filestore_fail_eio ||
got != -5)

  ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
  1: (FileStore::read(coll_t, hobject_t const&, unsigned long, unsigned long,
ceph::buffer::list&, bool)+0x653) [0x8ab6c3]
  2: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*, std::vector >&)+0x350) [0x708230]
  3: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x86)
[0x713366]
  4: (ReplicatedPG::do_op(std::tr1::shared_ptr)+0x3095) [0x71acb5]
  5: (PG::do_request(std::tr1::shared_ptr,
ThreadPool::TPHandle&)+0x3f0) [0x812340]
  6: (OSD::dequeue_op(boost::intrusive_ptr,
std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x2ea) [0x75c80a]
  7: (OSD::OpWQ::_process(boost::intrusive_ptr,
ThreadPool::TPHandle&)+0x198) [0x770da8]
  8: (ThreadPool::WorkQueueVal,
std::tr1::shared_ptr >, boost::intrusive_ptr

::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x7a89

ce]
  9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x9b5dea]
  10: (ThreadPool::WorkThread::entry()+0x10) [0x9b7040]
  11: (()+0x6b50) [0x7fdadffdfb50]
  12: (clone()+0x6d) [0x7fdade53b0ed]
  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
interpret this.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OSD turned itself off

2014-06-13 Thread Josef Johansson


Hey,

Just examing what happened to an OSD, that was just turned off. Data has 
been moved away from it, so hesitating to turned it back on.


Got the below in the logs, any clues to what the assert talks about?

Cheers,
Josef

-1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, 
const hobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 
7fdacb88

c700 time 2014-06-11 21:13:54.036982
os/FileStore.cc: 2992: FAILED assert(allow_eio || !m_filestore_fail_eio 
|| got != -5)


 ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
 1: (FileStore::read(coll_t, hobject_t const&, unsigned long, unsigned 
long, ceph::buffer::list&, bool)+0x653) [0x8ab6c3]
 2: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*, 
std::vector >&)+0x350) [0x708230]
 3: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x86) 
[0x713366]
 4: (ReplicatedPG::do_op(std::tr1::shared_ptr)+0x3095) 
[0x71acb5]
 5: (PG::do_request(std::tr1::shared_ptr, 
ThreadPool::TPHandle&)+0x3f0) [0x812340]
 6: (OSD::dequeue_op(boost::intrusive_ptr, 
std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x2ea) [0x75c80a]
 7: (OSD::OpWQ::_process(boost::intrusive_ptr, 
ThreadPool::TPHandle&)+0x198) [0x770da8]
 8: (ThreadPool::WorkQueueVal, 
std::tr1::shared_ptr >, boost::intrusive_ptr 
>::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x7a89

ce]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x9b5dea]
 10: (ThreadPool::WorkThread::entry()+0x10) [0x9b7040]
 11: (()+0x6b50) [0x7fdadffdfb50]
 12: (clone()+0x6d) [0x7fdade53b0ed]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-06-13 Thread Josef Johansson


Hey,

That sounds awful. Have you had any luck in increasing the performance?

Cheers,
Josef

Christian Balzer skrev 2014-05-23 17:57:

For what it's worth (very little in my case)...

Since the cluster wasn't in production yet and Firefly (0.80.1) did hit
Debian Jessie today I upgraded it.

Big mistake...

I did the recommended upgrade song and dance, MONs first, OSDs after that.

Then applied "ceph osd crush tunables default" as per the update
instructions and since "ceph -s" was whining about it.

Lastly I did a "ceph osd pool set rbd hashpspool true" and after that was
finished (people with either a big cluster or slow network probably should
avoid this like the plague) I re-ran the below fio from a VM (old or new
client libraries made no difference) again.

The result, 2800 write IOPS instead of 3200 with Emperor.

So much for improved latency and whatnot...

Christian

On Wed, 14 May 2014 21:33:06 +0900 Christian Balzer wrote:


Hello!

On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:


Hi Christian,

I missed this thread, haven't been reading the list that well the last
weeks.

You already know my setup, since we discussed it in an earlier thread.
I don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here
though.


Nods, I do recall that thread.


A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.


That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.
  

I'll get back to you with the results, hopefully I'll manage to get
them done during this night.


Looking forward to that. ^^


Christian

Cheers,
Josef

On 13/05/14 11:03, Christian Balzer wrote:

I'm clearly talking to myself, but whatever.

For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.

Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front
of a RAID/JBODs or using SSDs for final storage?

If so, what results do you get out of the fio statement below per
OSD? In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per
OSD, which is of course vastly faster than the normal indvidual HDDs
could do.

So I'm wondering if I'm hitting some inherent limitation of how fast
a single OSD (as in the software) can handle IOPS, given that
everything else has been ruled out from where I stand.

This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^

On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.

Christian

On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:


On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:


Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.


In the a "picture" being worth a thousand words tradition, I give
you this iostat -x output taken during a fio run:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   50.820.00   19.430.170.00   29.58

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.0051.500.00 1633.50 0.00  7460.00
9.13 0.180.110.000.11   0.01   1.40 sdb
0.00 0.000.00 1240.50 0.00  5244.00 8.45 0.30
0.250.000.25   0.02   2.00 sdc   0.00 5.00
0.00 2468.50 0.00 13419.0010.87 0.240.100.00
0.10   0.09  22.00 sdd   0.00 6.500.00 1913.00
0.00 10313.0010.78 0.200.100.000.10   0.09
16.60

The %user CPU utilization is pretty much entirely the 2 OSD
processes, note the nearly complete absence of iowait.

sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.

The only conclusion I can draw from these numbers and the

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-06-13 Thread Josef Johansson


Hey,

I did try this, it didn't work though, so I think I still have to patch 
the kernel though, as the user_xattr is not allowed on tmpfs.


Thanks for the description though.

I think the next step in this is to do it all virtual, maybe on the same 
hardware to avoid network.
Any problems with doing it all virtual? If it's just memory and the same 
machine, we should see the pure ceph performance right?


Anyone done this?

Cheers,
Josef

Stefan Priebe - Profihost AG skrev 2014-05-15 09:58:

Am 15.05.2014 09:56, schrieb Josef Johansson:

On 15/05/14 09:11, Stefan Priebe - Profihost AG wrote:

Am 15.05.2014 00:26, schrieb Josef Johansson:

Hi,

So, apparently tmpfs does not support non-root xattr due to a possible
DoS-vector. There's configuration set for enabling it as far as I can see.

CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y

Anyone know a way around it? Saw that there's a patch for enabling it,
but recompiling my kernel is out of reach right now ;)

I would create an empty file in tmpfs and then format that file as a
block device.

How do you mean exactly? Creating with dd and mounting with losetup?

mount -t tmpfs -o size=4G /mnt /mnt
dd if=/dev/zero of=/mnt/blockdev_a bs=1M count=4000
mkfs.xfs -f /mnt/blockdev_a
mount /mnt/blockdev_a /ceph/osd.X

Dann /mnt/blockdev_a als OSD device nutzen.


Cheers,
Josef

Created the osd with following:

root@osd1:/# dd seek=6G if=/dev/zero of=/dev/shm/test-osd/img bs=1 count=1
root@osd1:/# losetup /dev/loop0 /dev/shm/test-osd/img
root@osd1:/# mkfs.xfs /dev/loop0
root@osd1:/# ceph osd create
50
root@osd1:/# mkdir /var/lib/ceph/osd/ceph-50
root@osd1:/# mount -t xfs /dev/loop0 /var/lib/ceph/osd/ceph-50
root@osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
--osd-journal=/dev/sdc7 --mkjournal
2014-05-15 00:20:29.796822 7f40063bb780 -1 journal FileJournal::_open:
aio not supported without directio; disabling aio
2014-05-15 00:20:29.798583 7f40063bb780 -1 journal check: ondisk fsid
bc14ff30-e016-4e0d-9672-96262ee5f07e doesn't match expected
b3f5b98b-e024-4153-875d-5c758a6060eb, invalid (someone else's?) journal
2014-05-15 00:20:29.802155 7f40063bb780 -1 journal FileJournal::_open:
aio not supported without directio; disabling aio
2014-05-15 00:20:29.807237 7f40063bb780 -1
filestore(/var/lib/ceph/osd/ceph-50) could not find
23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2014-05-15 00:20:29.809083 7f40063bb780 -1 created object store
/var/lib/ceph/osd/ceph-50 journal /dev/sdc7 for osd.50 fsid
c51a2683-55dc-4634-9d9d-f0fec9a6f389
2014-05-15 00:20:29.809121 7f40063bb780 -1 auth: error reading file:
/var/lib/ceph/osd/ceph-50/keyring: can't open
/var/lib/ceph/osd/ceph-50/keyring: (2) No such file or directory
2014-05-15 00:20:29.809179 7f40063bb780 -1 created new key in keyring
/var/lib/ceph/osd/ceph-50/keyring
root@osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
--osd-journal=/dev/sdc7 --mkjournal
2014-05-15 00:20:51.122716 7ff813ba4780 -1 journal FileJournal::_open:
aio not supported without directio; disabling aio
2014-05-15 00:20:51.126275 7ff813ba4780 -1 journal FileJournal::_open:
aio not supported without directio; disabling aio
2014-05-15 00:20:51.129532 7ff813ba4780 -1 provided osd id 50 !=
superblock's -1
2014-05-15 00:20:51.129845 7ff813ba4780 -1  ** ERROR: error creating
empty object store in /var/lib/ceph/osd/ceph-50: (22) Invalid argument

Cheers,
Josef

Christian Balzer skrev 2014-05-14 14:33:

Hello!

On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:


Hi Christian,

I missed this thread, haven't been reading the list that well the last
weeks.

You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here
though.


Nods, I do recall that thread.


A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.


That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.
  

I'll get back to you with the results, hopefully I'll manage to get them
done during this night.


Looking forward to that. ^^


Christian

Cheers,
Josef

On 13/05/14 11:03, Christian Balzer wrote:

I'm clearly talking to myself, but whatever.

For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.

Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in fr

Re: [ceph-users] Ceph networks, to bond or not to bond?

2014-06-07 Thread Josef Johansson

Hi,

Late to the party, but just to be sure, does the switch support mc-lag 
or mlag by any chance?

There could be updates integrating this.

Cheers,
Josef

Sven Budde skrev 2014-06-06 13:06:

Hi all,

thanks for the replies and heads up for the different bonding options. 
I'll toy around with them in the next days; hopefully there's some 
stable setup possible with provides HA and increased bandwidth together.

Cheers,
Sven
Am 05.06.2014 21:36, schrieb Cedric Lemarchand:
Yes, forgot to mention that, of course LACP and stackable switches is 
the safest and easy way, but sometimes when budget is a constraint 
you have to deal with it. Prices difference between simple Gb 
switches and stackable ones are not negligible. You generally get 
what you paid for ;-)

But I think Linux bonding with a simple network design (2x1Gb for 
each Ceph networks) could do the trick well, with some works 
overhead. Maybe some cephers on this list could confirm that ?

Cheers

Le 05/06/2014 21:21, Scott Laird a écrit :
Doing bonding without LACP is probably going to end up being 
painful.  Sooner or later you're going to end up with one end 
thinking that bonding is working while the other end thinks that 
it's not, and half of your traffic is going to get black-holed.

I've had moderately decent luck running Ceph on top of a weird 
network by carefully controlling the source address that every 
outbound connection uses and then telling Ceph that it's running 
with a 1-network config.  With Linux, the default source address of 
an outbound TCP connection is a function of the route that the 
kernel picks to send traffic to the remote end, and you can override 
it on a per-route basis (it's visible as the the 'src' attribute in 
iproute).  I have a mixed Infiniband+GigE network with each host 
running an OSPF routing daemon (for non-Ceph reasons, mostly), and 
the only two ways that I could get Ceph to be happy were:

1.  Turn off the Infiniband network.  Slow, and causes other problems.
2.  Tell Ceph that there was no cluster network, and tell the OSPF 
daemon to always set src=$eth0_ip on routes that it adds.  Then just 
pretend that the Ethernet network is the only one that exists, and 
sometimes you get a sudden and unexpected boost in bandwidth due to 
/32 routes that send traffic via Infiniband instead of Ethernet.

It works, but I wouldn't recommend it for production.  It would have 
been cheaper for me to buy a 10 GigE switch and cards for my garage 
than to have debugged all of this, and that's just for a hobby project.

OTOH, it's probably the only way to get working multipathing for Ceph.

On Thu, Jun 5, 2014 at 10:50 AM, Cedric Lemarchand 
mailto:ced...@yipikai.org>> wrote:

Le 05/06/2014 18:27, Sven Budde a écrit :
> Hi Alexandre,
>
> thanks for the reply. As said, my switches are not stackable,
so using LCAP seems not to be my best option.
>
> I'm seeking for an explanation how Ceph is utilizing two (or
more) independent links on both the public and the cluster network.
AFAIK, Ceph do not support multiple IP link in the same "designated
network" (aka client/osd networks). Ceph is not aware of links
aggregations, it has to be done at the Ethernet layer, so :

- if your switchs are stackable, you can use traditional LACP on
both
sides (switch and Ceph)
- if they are not, and as Mariusz said, use the appropriate
bonding mode
on the Ceph side and do not use LCAP on switchs.

More infos here :
http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding

Cheers !
>
> If I configure two IPs for the public network on two NICs,
will Ceph route traffic from its (multiple) OSDs on this node
over both IPs?
>
> Cheers,
> Sven
>
> -Ursprüngliche Nachricht-
> Von: Alexandre DERUMIER [mailto:aderum...@odiso.com
]
> Gesendet: Donnerstag, 5. Juni 2014 18:14
> An: Sven Budde
> Cc: ceph-users@lists.ceph.com 
> Betreff: Re: [ceph-users] Ceph networks, to bond or not to bond?
>
> Hi,
>
>>> My low-budget setup consists of two gigabit switches,
capable of LACP,
>>> but not stackable. For redundancy, I'd like to have my links
spread
>>> evenly over both switches.
> If you want to do lacp with both switches, they need to be
stackable.
>
> (or use active-backup bonding)
>
>>> My question where I didn't find a conclusive answer in the
>>> documentation and mailing archives:
>>> Will the OSDs utilize both 'single' interfaces per network, if I
>>> assign two IPs per public and per cluster network? Or will
all OSDs
>>> just bind on one IP and use only a single link?
> you just need 1 ip by bond.
>
> with lacp, the load balacing use an hash algorithm, to
loadbalance tcp connections.
> (that also mean than 1 connection can'

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-15 Thread Josef Johansson


On 15/05/14 09:11, Stefan Priebe - Profihost AG wrote:
> Am 15.05.2014 00:26, schrieb Josef Johansson:
>> Hi,
>>
>> So, apparently tmpfs does not support non-root xattr due to a possible
>> DoS-vector. There's configuration set for enabling it as far as I can see.
>>
>> CONFIG_TMPFS=y
>> CONFIG_TMPFS_POSIX_ACL=y
>> CONFIG_TMPFS_XATTR=y
>>
>> Anyone know a way around it? Saw that there's a patch for enabling it,
>> but recompiling my kernel is out of reach right now ;)
> I would create an empty file in tmpfs and then format that file as a
> block device.
How do you mean exactly? Creating with dd and mounting with losetup?

Cheers,
Josef
>> Created the osd with following:
>>
>> root@osd1:/# dd seek=6G if=/dev/zero of=/dev/shm/test-osd/img bs=1 count=1
>> root@osd1:/# losetup /dev/loop0 /dev/shm/test-osd/img
>> root@osd1:/# mkfs.xfs /dev/loop0
>> root@osd1:/# ceph osd create
>> 50
>> root@osd1:/# mkdir /var/lib/ceph/osd/ceph-50
>> root@osd1:/# mount -t xfs /dev/loop0 /var/lib/ceph/osd/ceph-50
>> root@osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
>> --osd-journal=/dev/sdc7 --mkjournal
>> 2014-05-15 00:20:29.796822 7f40063bb780 -1 journal FileJournal::_open:
>> aio not supported without directio; disabling aio
>> 2014-05-15 00:20:29.798583 7f40063bb780 -1 journal check: ondisk fsid
>> bc14ff30-e016-4e0d-9672-96262ee5f07e doesn't match expected
>> b3f5b98b-e024-4153-875d-5c758a6060eb, invalid (someone else's?) journal
>> 2014-05-15 00:20:29.802155 7f40063bb780 -1 journal FileJournal::_open:
>> aio not supported without directio; disabling aio
>> 2014-05-15 00:20:29.807237 7f40063bb780 -1
>> filestore(/var/lib/ceph/osd/ceph-50) could not find
>> 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
>> 2014-05-15 00:20:29.809083 7f40063bb780 -1 created object store
>> /var/lib/ceph/osd/ceph-50 journal /dev/sdc7 for osd.50 fsid
>> c51a2683-55dc-4634-9d9d-f0fec9a6f389
>> 2014-05-15 00:20:29.809121 7f40063bb780 -1 auth: error reading file:
>> /var/lib/ceph/osd/ceph-50/keyring: can't open
>> /var/lib/ceph/osd/ceph-50/keyring: (2) No such file or directory
>> 2014-05-15 00:20:29.809179 7f40063bb780 -1 created new key in keyring
>> /var/lib/ceph/osd/ceph-50/keyring
>> root@osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
>> --osd-journal=/dev/sdc7 --mkjournal
>> 2014-05-15 00:20:51.122716 7ff813ba4780 -1 journal FileJournal::_open:
>> aio not supported without directio; disabling aio
>> 2014-05-15 00:20:51.126275 7ff813ba4780 -1 journal FileJournal::_open:
>> aio not supported without directio; disabling aio
>> 2014-05-15 00:20:51.129532 7ff813ba4780 -1 provided osd id 50 !=
>> superblock's -1
>> 2014-05-15 00:20:51.129845 7ff813ba4780 -1  ** ERROR: error creating
>> empty object store in /var/lib/ceph/osd/ceph-50: (22) Invalid argument
>>
>> Cheers,
>> Josef
>>
>> Christian Balzer skrev 2014-05-14 14:33:
>>> Hello!
>>>
>>> On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:
>>>
>>>> Hi Christian,
>>>>
>>>> I missed this thread, haven't been reading the list that well the last
>>>> weeks.
>>>>
>>>> You already know my setup, since we discussed it in an earlier thread. I
>>>> don't have a fast backing store, but I see the slow IOPS when doing
>>>> randwrite inside the VM, with rbd cache. Still running dumpling here
>>>> though.
>>>>
>>> Nods, I do recall that thread.
>>>
>>>> A thought struck me that I could test with a pool that consists of OSDs
>>>> that have tempfs-based disks, think I have a bit more latency than your
>>>> IPoIB but I've pushed 100k IOPS with the same network devices before.
>>>> This would verify if the problem is with the journal disks. I'll also
>>>> try to run the journal devices in tempfs as well, as it would test
>>>> purely Ceph itself.
>>>>
>>> That would be interesting indeed.
>>> Given what I've seen (with the journal at 20% utilization and the actual
>>> filestore ataround 5%) I'd expect Ceph to be the culprit.
>>>  
>>>> I'll get back to you with the results, hopefully I'll manage to get them
>>>> done during this night.
>>>>
>>> Looking forward to that. ^^
>>>
>>>
>>> Christian
>>>> Cheers,
>>>> Josef
>>>>
>>>> On 13/05/14 11:03, Christian Bal

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-14 Thread Josef Johansson


Hi,

So, apparently tmpfs does not support non-root xattr due to a possible 
DoS-vector. There's configuration set for enabling it as far as I can see.


CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y

Anyone know a way around it? Saw that there's a patch for enabling it, 
but recompiling my kernel is out of reach right now ;)


Created the osd with following:

root@osd1:/# dd seek=6G if=/dev/zero of=/dev/shm/test-osd/img bs=1 count=1
root@osd1:/# losetup /dev/loop0 /dev/shm/test-osd/img
root@osd1:/# mkfs.xfs /dev/loop0
root@osd1:/# ceph osd create
50
root@osd1:/# mkdir /var/lib/ceph/osd/ceph-50
root@osd1:/# mount -t xfs /dev/loop0 /var/lib/ceph/osd/ceph-50
root@osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey 
--osd-journal=/dev/sdc7 --mkjournal
2014-05-15 00:20:29.796822 7f40063bb780 -1 journal FileJournal::_open: 
aio not supported without directio; disabling aio
2014-05-15 00:20:29.798583 7f40063bb780 -1 journal check: ondisk fsid 
bc14ff30-e016-4e0d-9672-96262ee5f07e doesn't match expected 
b3f5b98b-e024-4153-875d-5c758a6060eb, invalid (someone else's?) journal
2014-05-15 00:20:29.802155 7f40063bb780 -1 journal FileJournal::_open: 
aio not supported without directio; disabling aio
2014-05-15 00:20:29.807237 7f40063bb780 -1 
filestore(/var/lib/ceph/osd/ceph-50) could not find 
23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2014-05-15 00:20:29.809083 7f40063bb780 -1 created object store 
/var/lib/ceph/osd/ceph-50 journal /dev/sdc7 for osd.50 fsid 
c51a2683-55dc-4634-9d9d-f0fec9a6f389
2014-05-15 00:20:29.809121 7f40063bb780 -1 auth: error reading file: 
/var/lib/ceph/osd/ceph-50/keyring: can't open 
/var/lib/ceph/osd/ceph-50/keyring: (2) No such file or directory
2014-05-15 00:20:29.809179 7f40063bb780 -1 created new key in keyring 
/var/lib/ceph/osd/ceph-50/keyring
root@osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey 
--osd-journal=/dev/sdc7 --mkjournal
2014-05-15 00:20:51.122716 7ff813ba4780 -1 journal FileJournal::_open: 
aio not supported without directio; disabling aio
2014-05-15 00:20:51.126275 7ff813ba4780 -1 journal FileJournal::_open: 
aio not supported without directio; disabling aio
2014-05-15 00:20:51.129532 7ff813ba4780 -1 provided osd id 50 != 
superblock's -1
2014-05-15 00:20:51.129845 7ff813ba4780 -1  ** ERROR: error creating 
empty object store in /var/lib/ceph/osd/ceph-50: (22) Invalid argument


Cheers,
Josef

Christian Balzer skrev 2014-05-14 14:33:

Hello!

On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:


Hi Christian,

I missed this thread, haven't been reading the list that well the last
weeks.

You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here
though.


Nods, I do recall that thread.


A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.


That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.
  

I'll get back to you with the results, hopefully I'll manage to get them
done during this night.


Looking forward to that. ^^


Christian

Cheers,
Josef

On 13/05/14 11:03, Christian Balzer wrote:

I'm clearly talking to myself, but whatever.

For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.

Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of
a RAID/JBODs or using SSDs for final storage?

If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.

So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.

This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^

On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm

Re: [ceph-users] Slow IOPS on RBD compared to journalandbackingdevices

2014-05-14 Thread Josef Johansson

Hi,

Yeah, running with MTU 9000 here, but the test was with sequential.

Just ran rbd -p shared-1 bench-write test --io-size $((32*1024*1024))
--io-pattern rand

The cluster itself showed 700MB/s write (3x replicas), but the test just
45MB/s. But I think rbd is a little bit broken ;)

Cheers,
Josef

On 14/05/14 15:23, German Anders wrote:
> Hi Josef, 
> Thanks a lot for the quick answer.
>
> yes 32M and rand writes
>
> and also, do you get those values i guess with a MTU of 9000 or with
> the traditional and beloved MTU 1500?
>
>  
>
> *German Anders*
> /Field Storage Support Engineer/**
>
> Despegar.com - IT Team
>
>
>
>
>
>
>
>
>  
>> --- Original message ---
>> *Asunto:* Re: [ceph-users] Slow IOPS on RBD compared to
>> journalandbackingdevices
>> *De:* Josef Johansson 
>> *Para:* 
>> *Fecha:* Wednesday, 14/05/2014 10:10
>>
>> Hi,
>>
>> On 14/05/14 14:45, German Anders wrote:
>>
>> I forgot to mention, of course on a 10GbE network
>>  
>>  
>>
>> *German Anders*
>> /Field Storage Support Engineer/**
>>
>> Despegar.com - IT Team
>>
>>
>>
>>
>>
>>
>>
>>
>>  
>>
>> --- Original message ---
>> *Asunto:* Re: [ceph-users] Slow IOPS on RBD compared to
>> journal andbackingdevices
>> *De:* German Anders 
>> *Para:* Christian Balzer 
>> *Cc:* 
>> *Fecha:* Wednesday, 14/05/2014 09:41
>>
>> Someone could get a performance throughput on RBD of 600MB/s
>> or more on (rw) with a block size of 32768k?
>>  
>>
>> Is that 32M then?
>> Sequential or randwrite?
>>
>> I get about those speeds when doing (1M block size) buffered writes
>> from within a VM on 20GbE. The cluster max out at about 900MB/s.
>>
>> Cheers,
>> Josef
>>
>>  
>>
>> *German Anders*
>> /Field Storage Support Engineer/**
>>
>> Despegar.com - IT Team
>>
>>
>>
>>
>>
>>
>>
>>
>>  
>>
>> --- Original message ---
>> *Asunto:* Re: [ceph-users] Slow IOPS on RBD compared to
>> journal and backingdevices
>> *De:* Christian Balzer 
>> *Para:* Josef Johansson 
>> *Cc:* 
>> *Fecha:* Wednesday, 14/05/2014 09:33
>>
>>
>> Hello!
>>
>> On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:
>>
>> Hi Christian,
>>
>> I missed this thread, haven't been reading the list
>> that well the last
>> weeks.
>>
>> You already know my setup, since we discussed it in
>> an earlier thread. I
>> don't have a fast backing store, but I see the slow
>> IOPS when doing
>> randwrite inside the VM, with rbd cache. Still
>> running dumpling here
>> though.
>>
>> Nods, I do recall that thread.
>>
>> A thought struck me that I could test with a pool
>> that consists of OSDs
>> that have tempfs-based disks, think I have a bit more
>> latency than your
>> IPoIB but I've pushed 100k IOPS with the same network
>> devices before.
>> This would verify if the problem is with the journal
>> disks. I'll also
>> try to run the journal devices in tempfs as well, as
>> it would test
>> purely Ceph itself.
>>
>> That would be interesting indeed.
>> Given what I've seen (with the journal at 20% utilization
>> and the actual
>> filestore ataround 5%) I'd expect Ceph to be the culprit.
>>
>> I'll get back to you with the results, hopefully I'll
>> manage to get them
>> done during this night.
>>
>> Looking forward to that. ^^
>>
>>
>> Christian
>>
>> Cheers,
>> Josef
>>
>> On 13/05/14 11:03, Christian Balzer wrote:
>>
>> I'm c

Re: [ceph-users] Slow IOPS on RBD compared to journal andbackingdevices

2014-05-14 Thread Josef Johansson

Hi,

On 14/05/14 14:45, German Anders wrote:
> I forgot to mention, of course on a 10GbE network
>  
>  
>
> *German Anders*
> /Field Storage Support Engineer/**
>
> Despegar.com - IT Team
>
>
>
>
>
>
>
>
>  
>> --- Original message ---
>> *Asunto:* Re: [ceph-users] Slow IOPS on RBD compared to journal
>> andbackingdevices
>> *De:* German Anders 
>> *Para:* Christian Balzer 
>> *Cc:* 
>> *Fecha:* Wednesday, 14/05/2014 09:41
>>
>> Someone could get a performance throughput on RBD of 600MB/s or more
>> on (rw) with a block size of 32768k?
>>  
Is that 32M then?
Sequential or randwrite?

I get about those speeds when doing (1M block size) buffered writes from
within a VM on 20GbE. The cluster max out at about 900MB/s.

Cheers,
Josef
>>  
>>
>> *German Anders*
>> /Field Storage Support Engineer/**
>>
>> Despegar.com - IT Team
>>
>>
>>
>>
>>
>>
>>
>>
>>  
>>
>> --- Original message ---
>> *Asunto:* Re: [ceph-users] Slow IOPS on RBD compared to journal
>> and backingdevices
>> *De:* Christian Balzer 
>> *Para:* Josef Johansson 
>> *Cc:* 
>> *Fecha:* Wednesday, 14/05/2014 09:33
>>
>>
>> Hello!
>>
>> On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:
>>
>> Hi Christian,
>>
>> I missed this thread, haven't been reading the list that well
>> the last
>> weeks.
>>
>> You already know my setup, since we discussed it in an
>> earlier thread. I
>> don't have a fast backing store, but I see the slow IOPS when
>> doing
>> randwrite inside the VM, with rbd cache. Still running
>> dumpling here
>> though.
>>
>> Nods, I do recall that thread.
>>
>> A thought struck me that I could test with a pool that
>> consists of OSDs
>> that have tempfs-based disks, think I have a bit more latency
>> than your
>> IPoIB but I've pushed 100k IOPS with the same network devices
>> before.
>> This would verify if the problem is with the journal disks.
>> I'll also
>> try to run the journal devices in tempfs as well, as it would
>> test
>> purely Ceph itself.
>>
>> That would be interesting indeed.
>> Given what I've seen (with the journal at 20% utilization and the
>> actual
>> filestore ataround 5%) I'd expect Ceph to be the culprit.
>>
>> I'll get back to you with the results, hopefully I'll manage
>> to get them
>> done during this night.
>>
>> Looking forward to that. ^^
>>
>>
>> Christian
>>
>> Cheers,
>> Josef
>>
>> On 13/05/14 11:03, Christian Balzer wrote:
>>
>> I'm clearly talking to myself, but whatever.
>>
>> For Greg, I've played with all the pertinent journal and
>> filestore
>> options and TCP nodelay, no changes at all.
>>
>> Is there anybody on this ML who's running a Ceph cluster
>> with a fast
>> network and FAST filestore, so like me with a big HW
>> cache in front of
>> a RAID/JBODs or using SSDs for final storage?
>>
>> If so, what results do you get out of the fio statement
>> below per OSD?
>> In my case with 4 OSDs and 3200 IOPS that's about 800
>> IOPS per OSD,
>> which is of course vastly faster than the normal
>> indvidual HDDs could
>> do.
>>
>> So I'm wondering if I'm hitting some inherent limitation
>> of how fast a
>> single OSD (as in the software) can handle IOPS, given
>> that everything
>> else has been ruled out from where I stand.
>>
>> This would also explain why none of the option changes or
>> the use of
>> RBD caching has any measurable effect in the test case
>> below.
>> As in, a slow OSD aka single HDD with journal on the same
>> disk would
>> clearly benefit from even the small 32MB standard RBD
>> cache, while in
>>

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-14 Thread Josef Johansson

Hi Christian,

I missed this thread, haven't been reading the list that well the last
weeks.

You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here though.

A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.

I'll get back to you with the results, hopefully I'll manage to get them
done during this night.

Cheers,
Josef

On 13/05/14 11:03, Christian Balzer wrote:
> I'm clearly talking to myself, but whatever.
>
> For Greg, I've played with all the pertinent journal and filestore options
> and TCP nodelay, no changes at all.
>
> Is there anybody on this ML who's running a Ceph cluster with a fast
> network and FAST filestore, so like me with a big HW cache in front of a
> RAID/JBODs or using SSDs for final storage?
>
> If so, what results do you get out of the fio statement below per OSD?
> In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which
> is of course vastly faster than the normal indvidual HDDs could do.
>
> So I'm wondering if I'm hitting some inherent limitation of how fast a
> single OSD (as in the software) can handle IOPS, given that everything else
> has been ruled out from where I stand.
>
> This would also explain why none of the option changes or the use of
> RBD caching has any measurable effect in the test case below. 
> As in, a slow OSD aka single HDD with journal on the same disk would
> clearly benefit from even the small 32MB standard RBD cache, while in my
> test case the only time the caching becomes noticeable is if I increase
> the cache size to something larger than the test data size. ^o^
>
> On the other hand if people here regularly get thousands or tens of
> thousands IOPS per OSD with the appropriate HW I'm stumped. 
>
> Christian
>
> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:
>
>> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:
>>
>>> Oh, I didn't notice that. I bet you aren't getting the expected
>>> throughput on the RAID array with OSD access patterns, and that's
>>> applying back pressure on the journal.
>>>
>> In the a "picture" being worth a thousand words tradition, I give you
>> this iostat -x output taken during a fio run:
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>   50.820.00   19.430.170.00   29.58
>>
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sda   0.0051.500.00 1633.50 0.00  7460.00
>> 9.13 0.180.110.000.11   0.01   1.40 sdb
>> 0.00 0.000.00 1240.50 0.00  5244.00 8.45 0.30
>> 0.250.000.25   0.02   2.00 sdc   0.00 5.00
>> 0.00 2468.50 0.00 13419.0010.87 0.240.100.00
>> 0.10   0.09  22.00 sdd   0.00 6.500.00 1913.00
>> 0.00 10313.0010.78 0.200.100.000.10   0.09  16.60
>>
>> The %user CPU utilization is pretty much entirely the 2 OSD processes,
>> note the nearly complete absence of iowait.
>>
>> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
>> Look at these numbers, the lack of queues, the low wait and service
>> times (this is in ms) plus overall utilization.
>>
>> The only conclusion I can draw from these numbers and the network results
>> below is that the latency happens within the OSD processes.
>>
>> Regards,
>>
>> Christian
>>> When I suggested other tests, I meant with and without Ceph. One
>>> particular one is OSD bench. That should be interesting to try at a
>>> variety of block sizes. You could also try runnin RADOS bench and
>>> smalliobench at a few different sizes.
>>> -Greg
>>>
>>> On Wednesday, May 7, 2014, Alexandre DERUMIER 
>>> wrote:
>>>
 Hi Christian,

 Do you have tried without raid6, to have more osd ?
 (how many disks do you have begin the raid6 ?)


 Aslo, I known that direct ios can be quite slow with ceph,

 maybe can you try without --direct=1

 and also enable rbd_cache

 ceph.conf
 [client]
 rbd cache = true




 - Mail original -

 De: "Christian Balzer" >
 À: "Gregory Farnum" >,
 ceph-users@lists.ceph.com 
 Envoyé: Jeudi 8 Mai 2014 04:49:16
 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
 backing devices

 On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote:

> On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
> >
 wrote:
>> Hello,
>>
>> ceph 0.72 on Debian Jessie,

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-11 Thread Josef Johansson


On 11/04/14 09:07, Wido den Hollander wrote:
>
>> Op 11 april 2014 om 8:50 schreef Josef Johansson :
>>
>>
>> Hi,
>>
>> On 11/04/14 07:29, Wido den Hollander wrote:
>>>> Op 11 april 2014 om 7:13 schreef Greg Poirier :
>>>>
>>>>
>>>> One thing to note
>>>> All of our kvm VMs have to be rebooted. This is something I wasn't
>>>> expecting.  Tried waiting for them to recover on their own, but that's not
>>>> happening. Rebooting them restores service immediately. :/ Not ideal.
>>>>
>>> A reboot isn't really required though. It could be that the VM itself is in
>>> trouble, but from a librados/librbd perspective I/O should simply continue
>>> as
>>> soon as a osdmap has been received without the "full" flag.
>>>
>>> It could be that you have to wait some time before the VM continues. This
>>> can
>>> take up to 15 minutes.
>> With other storage solution you would have to change the timeout-value
>> for each disk, i.e. changing to 180 secs from 60 secs, for the VMs to
>> survive storage problems.
>> Does Ceph handle this differently somehow?
>>
> It's not that RBD does it differently. Librados simply blocks the I/O and thus
> dus librbd which then causes Qemu to block.
>
> I've seen VMs survive RBD issues for longer periods then 60 seconds. Gave them
> some time and they continued again.
>
> Which exact setting are you talking about? I'm talking about a Qemu/KVM VM
> running with a VirtIO drive.
cat /sys/block/*/device/timeout
(http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009465)

This file is non-existant for my Ceph-VirtIO-drive however, so it seems
RBD handles this.

I have just Para-Virtualized VMs to compare with right now, and they
don't have it inside the VM, but that's expected. From my understanding
it should've been there if it was a HVM. Whenever the timeout was
reached, an error occured and the disk was set in read-only-mode.

Cheers,
Josef
> Wido
>
>> Cheers,
>> Josef
>>> Wido
>>>
>>>> On Thu, Apr 10, 2014 at 10:12 PM, Greg Poirier
>>>> wrote:
>>>>
>>>>> Going to try increasing the full ratio. Disk utilization wasn't really
>>>>> growing at an unreasonable pace. I'm going to keep an eye on it for the
>>>>> next couple of hours and down/out the OSDs if necessary.
>>>>>
>>>>> We have four more machines that we're in the process of adding (which
>>>>> doubles the number of OSDs), but got held up by some networking nonsense.
>>>>>
>>>>> Thanks for the tips.
>>>>>
>>>>>
>>>>> On Thu, Apr 10, 2014 at 9:51 PM, Sage Weil  wrote:
>>>>>
>>>>>> On Thu, 10 Apr 2014, Greg Poirier wrote:
>>>>>>> Hi,
>>>>>>> I have about 200 VMs with a common RBD volume as their root filesystem
>>>>>> and a
>>>>>>> number of additional filesystems on Ceph.
>>>>>>>
>>>>>>> All of them have stopped responding. One of the OSDs in my cluster is
>>>>>> marked
>>>>>>> full. I tried stopping that OSD to force things to rebalance or at
>>>>>> least go
>>>>>>> to degraded mode, but nothing is responding still.
>>>>>>>
>>>>>>> I'm not exactly sure what to do or how to investigate. Suggestions?
>>>>>> Try marking the osd out or partially out (ceph osd reweight N .9) to move
>>>>>> some data off, and/or adjust the full ratio up (ceph pg set_full_ratio
>>>>>> .95).  Note that this becomes increasinly dangerous as OSDs get closer to
>>>>>> full; add some disks.
>>>>>>
>>>>>> sage
>>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-10 Thread Josef Johansson

Hi,

On 11/04/14 07:29, Wido den Hollander wrote:
>
>> Op 11 april 2014 om 7:13 schreef Greg Poirier :
>>
>>
>> One thing to note
>> All of our kvm VMs have to be rebooted. This is something I wasn't
>> expecting.  Tried waiting for them to recover on their own, but that's not
>> happening. Rebooting them restores service immediately. :/ Not ideal.
>>
> A reboot isn't really required though. It could be that the VM itself is in
> trouble, but from a librados/librbd perspective I/O should simply continue as
> soon as a osdmap has been received without the "full" flag.
>
> It could be that you have to wait some time before the VM continues. This can
> take up to 15 minutes.
With other storage solution you would have to change the timeout-value
for each disk, i.e. changing to 180 secs from 60 secs, for the VMs to
survive storage problems.
Does Ceph handle this differently somehow?

Cheers,
Josef
> Wido
>
>> On Thu, Apr 10, 2014 at 10:12 PM, Greg Poirier 
>> wrote:
>>
>>> Going to try increasing the full ratio. Disk utilization wasn't really
>>> growing at an unreasonable pace. I'm going to keep an eye on it for the
>>> next couple of hours and down/out the OSDs if necessary.
>>>
>>> We have four more machines that we're in the process of adding (which
>>> doubles the number of OSDs), but got held up by some networking nonsense.
>>>
>>> Thanks for the tips.
>>>
>>>
>>> On Thu, Apr 10, 2014 at 9:51 PM, Sage Weil  wrote:
>>>
 On Thu, 10 Apr 2014, Greg Poirier wrote:
> Hi,
> I have about 200 VMs with a common RBD volume as their root filesystem
 and a
> number of additional filesystems on Ceph.
>
> All of them have stopped responding. One of the OSDs in my cluster is
 marked
> full. I tried stopping that OSD to force things to rebalance or at
 least go
> to degraded mode, but nothing is responding still.
>
> I'm not exactly sure what to do or how to investigate. Suggestions?
 Try marking the osd out or partially out (ceph osd reweight N .9) to move
 some data off, and/or adjust the full ratio up (ceph pg set_full_ratio
 .95).  Note that this becomes increasinly dangerous as OSDs get closer to
 full; add some disks.

 sage
>>>
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to detect journal problems

2014-04-09 Thread Josef Johansson

Thanks all for helping to clarify in this matter :)

On 09/04/14 17:03, Christian Balzer wrote:
> Hello,
>
> On Wed, 9 Apr 2014 07:31:53 -0700 Gregory Farnum wrote:
>
>> journal_max_write_bytes: the maximum amount of data the journal will
>> try to write at once when it's coalescing multiple pending ops in the
>> journal queue.
>> journal_queue_max_bytes: the maximum amount of data allowed to be
>> queued for journal writing.
>>
>> In particular, both of those are about how much is waiting to get into
>> the durable journal, not waiting to get flushed out of it.
> Thanks a bundle for that clarification Greg.
>
> So the tunable to play with when trying to push the backing storage to its
> throughput limits would be "filestore min sync interval" then?
>
> Or can something else cause the journal to be flushed long before it
> becomes full?
This. Because this is what I see. I see the OSDs writing in 1-3MB/s with
300w/s, with 100% util. Which makes me want to optimize the journal further.

Even if I cram the journal_queue settings higher, it seems to stay that way.

My idea of the journal making everything sequential was that the data
would merge  inside the journal and get out on the disk as nice
sequential I/O.

I assume that it also could be that it didn't manage to merge the ops
because they were spread out too much. As the objects are 4M maybe the
4K data is spread out over different objects.

Cheers,
Josef
> Christian
>
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Wed, Apr 9, 2014 at 3:06 AM, Christian Balzer  wrote:
>>> On Tue, 8 Apr 2014 09:35:19 -0700 Gregory Farnum wrote:
>>>
>>>> On Tuesday, April 8, 2014, Christian Balzer  wrote:
>>>>
>>>>> On Tue, 08 Apr 2014 14:19:20 +0200 Josef Johansson wrote:
>>>>>> On 08/04/14 10:39, Christian Balzer wrote:
>>>>>>> On Tue, 08 Apr 2014 10:31:44 +0200 Josef Johansson wrote:
>>>>>>>
>>>>>>>> On 08/04/14 10:04, Christian Balzer wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> On Tue, 08 Apr 2014 09:31:18 +0200 Josef Johansson wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I am currently benchmarking a standard setup with Intel DC
>>>>>>>>>> S3700 disks as journals and Hitachi 4TB-disks as
>>>>>>>>>> data-drives, all on LACP 10GbE network.
>>>>>>>>>>
>>>>>>>>> Unless that is the 400GB version of the DC3700, you're already
>>>>>>>>> limiting yourself to 365MB/s throughput with the 200GB
>>>>>>>>> variant. If sequential write speed is that important to you
>>>>>>>>> and you think you'll ever get those 5 HDs to write at full
>>>>>>>>> speed with Ceph (unlikely).
>>>>>>>> It's the 400GB version of the DC3700, and yes, I'm aware that I
>>>>>>>> need a 1:3 ratio to max out these disks, as they write
>>>>>>>> sequential data at about 150MB/s.
>>>>>>>> But our thoughts are that it would cover the current demand
>>>>>>>> with a 1:5 ratio, but we could upgrade.
>>>>>>> I'd reckon you'll do fine, as in run out of steam and IOPS
>>>>>>> before hitting that limit.
>>>>>>>
>>>>>>>>>> The size of my journals are 25GB each, and I have two
>>>>>>>>>> journals per machine, with 5 OSDs per journal, with 5
>>>>>>>>>> machines in total. We currently use the tunables optimal and
>>>>>>>>>> the version of ceph is the latest dumpling.
>>>>>>>>>>
>>>>>>>>>> Benchmarking writes with rbd show that there's no problem
>>>>>>>>>> hitting upper levels on the 4TB-disks with sequential data,
>>>>>>>>>> thus maxing out 10GbE. At this moment we see full utilization
>>>>>>>>>> on the journals.
>>>>>>>>>>
>>>>>>>>>> But lowering the byte-size to 4k shows that the journals are
>>>>>>>>>> utilized to about 20%, and the 4TB-disks 100%. (rados -p
>>>>>>>>>>  -b 4096 -t 256 1

1 2 >

1 - 100 of 104 matches

Mail list logo