Re: [ceph-users] osd crash and high server load - ceph-osd crashes with stacktrace

2015-10-25 Thread Jacek Jarosiewicz

We've upgraded ceph to 0.94.4 and kernel to 3.16.0-51-generic
but the problem still persists. Lately we see these crashes on a daily 
basis. I'm leaning toward the conclusion that this is a software problem 
- this hardware ran stable before and we're seeing all four nodes crash 
randomly with the same messages in log.. I'm thinking if this can be 
flashcache related.. nothing else comes to mind..


can anyone look at the logs and help some?

ceph-osd log: http://pastebin.com/AGGtvHr2
kernel log: http://pastebin.com/jVSa8eme

J

On 10/09/2015 09:15 AM, Jacek Jarosiewicz wrote:

Hi,

We've noticed a problem with our cluster setup:

4 x OSD nodes:
E5-1630 CPU
32 GB RAM
Mellanox MT27520 56Gbps network cards
SATA controller LSI Logic SAS3008
Storage nodes are connected to two SuperMicro chassis: 847E1C-R1K28JBOD
Each node has 2-3 spinning OSDs (6TB drives) and 2 ssd drives (240GB
Intel DC S3710 drives) for journal and cache
3 monitors running on OSD nodes
ceph hammer 0.94.3
Ubuntu 14.04
standard replicated pools with size 2 (min_size 1)
40GB journal per osd on SSD drives, 40GB flashcache per osd.

Everything seems to work fine, but every few days or so one of the nodes
(not always the same node - different nodes each time) gets very high
load, becomes inaccessible and needs to be rebooted.

After reboot we can start osd's and the cluster returns to HEALTH_OK
state pretty quickly.

After looking into logfiles this seems to be related to ceph-osd
processes (links to the logs are at the bottom of this msg).

The cluster is a test setup - not used in production and at the time the
ceph-osd processes crushes the cluster isn't doing anything.

Any help would be appreciated.

ceph-osd log: http://pastebin.com/AGGtvHr2
kernel log: http://pastebin.com/jVSa8eme

J




--
Jacek Jarosiewicz
Administrator Systemów Informatycznych


SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie
ul. Senatorska 13/15, 00-075 Warszawa
Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego 
Rejestru Sądowego,

nr KRS 029537; kapitał zakładowy 42.756.000 zł
NIP: 957-05-49-503
Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa


SUPERMEDIA ->   http://www.supermedia.pl
dostep do internetu - hosting - kolokacja - lacza - telefonia
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about hardware and CPU selection

2015-10-25 Thread Christian Balzer

Hello,

There are of course a number of threads in the ML archives about things
like this.

On Sat, 24 Oct 2015 17:48:35 +0200 Mike Miller wrote:

> Hi,
> 
> as I am planning to set up a ceph cluster with 6 OSD nodes with 10 
> harddisks in each node, could you please give me some advice about 
> hardware selection? CPU? RAM?
> I am planning a 10 GBit/s public and a separate 10 GBit/s private
> network.
>

If I read this correctly your OSDs are entirely HDD based (no journal
SSDs).

In that case you'll be lucky to see writes faster than 750MB/s, meaning
your split network is wasted. 
IMHO a split cluster/public network only makes sense if you can actually
saturate either link if not both.

In your case a redundant (LACP) setup would be much more beneficial,
unless your use case is vastly skewed to reads from hot (in page cache)
objects.

As for CPU, pure HDD OSDs will do well with about 1GHz per OSD, the more
small write I/Os you have the more power you need.
For OSDs with journal SSDs my rule of thumb is at least 2GHz, for purely
SSD based OSDs whatever you can afford.

2GB RAM per OSD are generally sufficient, however more is definitely
better in my book. 

This is especially true when you have hot (read) objects that may get
evicted from local (in VM) page caches, but still fit comfortably in the
distributed page caches of your OSD nodes.  

Regards,

Christian 

> For a smaller test cluster with 5 OSD nodes and 4 harddisks each, 2 
> GBit/s public and 4 GBit/s private network, I already tested this using 
> core i5 boxes 16GB RAM installed. In most of my test scenarios including 
> load, node failure, backfilling, etc. the CPU usage was not at all the 
> bottleneck with a maximum of about 25% load per core. The private 
> network was also far from being fully loaded.
> 
> It would be really great to get some advice about hardware choices for 
> my newly planned setup.
> 
> Thanks very much and regards,
> 
> Mike
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 2-Node Cluster - possible scenario?

2015-10-25 Thread Hermann Himmelbauer
Hi,
In a little project of mine I plan to start ceph storage with a small
setup and to be able to scale it up later. Perhaps someone can give me
any advice if the following (two nodes with OSDs, third node with
Monitor only):

- 2 Nodes (enough RAM + CPU), 6*3TB Harddisk for OSDs -> 9TB usable
space in case of 3* redundancy, 1 Monitor on each of the nodes
- 1 extra node that has no OSDs but runs a third monitor.
- 10GBit Ethernet as storage backbone

Later I may add more nodes + OSDs to expand the cluster in case more
storage / performance is needed.

Would this work / be stable? Or do I need to spread my OSDs to 3 ceph
nodes (e.g. in order to achive quorum). In case one of the two OSD nodes
fail, would the storage still be accessible?

The setup should be used for RBD/QEMU only, no cephfs or the like.

Any hints are appreciated!

Best Regards,
Hermann

-- 
herm...@qwer.tk
PGP/GPG: 299893C7 (on keyservers)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] locked up cluster while recovering OSD

2015-10-25 Thread Ludovico Cavedon
Hi,

we have a Ceph cluster with:
- 12 OSDs on 6 physical nodes, 64 GB RAM
- each OSD has a 6 TB spinning disk and a 10GB journal in ram (tmpfs) [1]
- 3 redundant copies
- 25% space usage so far
- ceph 0.94.2.
- store data via radosgw, using sharded bucket indexes (64 shards).
- 500 PGs per node (as we are planning on scaling the number of nodes
without adding more pools in the future).

We currently have a constant write load (about 60 PUTs per second of small
objects, usually a few KB, but sometimes they can go up to a few MB).

If I restart an OSD, it seems that most operations get stuck for up to
multiple minutes until the OSD is done recovering.
(noout is set, but I understand it does not matter because the the OSD is
down for less than 5 minutes).

Most of the "slow operation" messages had the following reasons:
- currently waiting for rw locks
- currently waiting for missing object
- currently waiting for degraded object

And were:
- [call rgw.bucket_prepare_op] ... ondisk+write+known_if_redirected
- [call rgw.bucket_complete_op] ... ondisk+write+known_if_redirected

operating mostly on the bucket index shard objects.

The monitors and gateways look completely unloaded.
On the other side it looks like the IO on the OSDs is very intense (average
disk write completion time is 300 ms) and the disk IO utilization is around
50%.

It looks to me the storage layer needs to be improved (RAID controller with
big write-back cache maybe?).
However I do not understand exactly what is going wrong here.
I would expect that the operations keep being served  as before either
writing to the primary PG  or to the replica, and the PGs would recover in
the background.
Do you have any ideas?
What path would you follow to understand what the problem is?
I am happy to provide more logs if that helps.

Thanks in advance for any help,
Ludovico

[1] We had to disable filestore_fadivse, otherwise two threads per OSD
would get stuck on 100% CPU moving pages from ram (presumably the journal)
to the swap.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2-Node Cluster - possible scenario?

2015-10-25 Thread Alan Johnson
Quorum can be achieved with one monitor node (for testing purposes this would 
be OK, but of course it is a single point of failure) however the default for 
the OSD nodes  is three way replication (can be changed) but easier to set up 
three OSD nodes to start with and one monitor node. For your case the monitor 
node would not need to be very powerful and a lower spec system could be used 
allowing your previously suggested mon node to be used instead as a third OSD 
node. 

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Hermann Himmelbauer
Sent: Monday, October 26, 2015 12:17 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] 2-Node Cluster - possible scenario?

Hi,
In a little project of mine I plan to start ceph storage with a small setup and 
to be able to scale it up later. Perhaps someone can give me any advice if the 
following (two nodes with OSDs, third node with Monitor only):

- 2 Nodes (enough RAM + CPU), 6*3TB Harddisk for OSDs -> 9TB usable space in 
case of 3* redundancy, 1 Monitor on each of the nodes
- 1 extra node that has no OSDs but runs a third monitor.
- 10GBit Ethernet as storage backbone

Later I may add more nodes + OSDs to expand the cluster in case more storage / 
performance is needed.

Would this work / be stable? Or do I need to spread my OSDs to 3 ceph nodes 
(e.g. in order to achive quorum). In case one of the two OSD nodes fail, would 
the storage still be accessible?

The setup should be used for RBD/QEMU only, no cephfs or the like.

Any hints are appreciated!

Best Regards,
Hermann

--
herm...@qwer.tk
PGP/GPG: 299893C7 (on keyservers)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG won't stay clean

2015-10-25 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I have a 0.94.4 cluster that when I repair/deep-scrub a PG, it comes
back clean, but as soon as I restart any OSD that hosts it, it goes
back to inconsistent. If I deep-scrub that PG it clears up.

I determined that the bad copy was not on the primary and issued a pg
repair command. I have shut down and deleted the PG folder on each OSD
in turn and let it back fill. I tried taking the primary OSD down and
issuing a repair command then. I took an m5sum of all files in the PG
directory and compared all files across the OSDs and it came back
clean. I shut down each OSD in turn and removed any PG_TEMP
directories. I'm just not sure why the cluster is so confused as to
the status of this PG.

Any ideas?

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWLWr3CRDmVDuy+mK58QAA4HEP/361NhUXujdrKr9xa4d/
Lr/zMCbppT7uof3BueLkkIF2erun19ENNLZV5ehyzcbeWlHVA3UEzJaZlaew
eQrmFN0TKk2BFtQAnXp66KhBVco05tiLKZthkGH9AzwQT33ftf8ErVwT4GXs
aEXDdQLctLGxvqfoyys9woNqalYjG9JtZxJHTWfaVU+t3yXEme3GBJBmMlVE
GSSk8KAyEil8DP1q4PMQJrScQxqYFpfBi1UGnbiQj02pan16OtbkaUkJNLMB
o8XlHdiNfPWmMoyAuOPBMoKSPo1diLBP3uEJN8u3Mw4+9kLZSeMDMRHdkMF0
kmhGA26ihRHcHWVsC+4wevCGJoq7vvPmf8892z+hEjC5vm4eWGAD7UPBjqjl
5BL282XI+AYLbw4VkiDrP4tTL4neOr6IW50mnG8SPVSAvMN+cFJnlMZRpQ/6
SQB4Tv5fr1SEMZDZqC//RacWZYCsBd1XZi6M0VhOhOrGjqmlr/41P2dmrdI1
ldHEyl3l07mJdBANQ0AgIMAeMyuD3dGQ4q0IpJgVMbrfxq8m/lLp/jddOWq3
MhNGT5J1K4Qg3eFlqhuTIw7yLmERACYHlMUBCHGq/8jGCjEQOe6uhuePyH/6
ugUoZ+J4Y+Fxsu1Jsoj+GQtDSSkOTGjUpGhXPp/gqMhxZRlGdsZL5LIEs7hm
Us8M
=gnRK
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd crash and high server load - ceph-osd crashes with stacktrace

2015-10-25 Thread Brad Hubbard
- Original Message -
> From: "Jacek Jarosiewicz" 
> To: ceph-users@lists.ceph.com
> Sent: Sunday, 25 October, 2015 8:48:59 PM
> Subject: Re: [ceph-users] osd crash and high server load - ceph-osd crashes 
> with stacktrace
> 
> We've upgraded ceph to 0.94.4 and kernel to 3.16.0-51-generic
> but the problem still persists. Lately we see these crashes on a daily
> basis. I'm leaning toward the conclusion that this is a software problem
> - this hardware ran stable before and we're seeing all four nodes crash
> randomly with the same messages in log.. I'm thinking if this can be
> flashcache related.. nothing else comes to mind..
> 
> can anyone look at the logs and help some?
> 
> ceph-osd log: http://pastebin.com/AGGtvHr2
> kernel log: http://pastebin.com/jVSa8eme

I'd suggest you focus on why the kernel threads are going into d-state
(uninterruptible sleep) since that should probably be addressed first. Ceph is
a userspace application so should not be able to "hang" the kernel. XFS
filesystem code or the underlying storage appears to be implicated here but it
could be something else. The hung kernel threads are waiting for something,
need to work out what that is.

It is likely Ceph is just triggering this problem.

Cheers,
Brad

> 
> J
> 
> On 10/09/2015 09:15 AM, Jacek Jarosiewicz wrote:
> > Hi,
> >
> > We've noticed a problem with our cluster setup:
> >
> > 4 x OSD nodes:
> > E5-1630 CPU
> > 32 GB RAM
> > Mellanox MT27520 56Gbps network cards
> > SATA controller LSI Logic SAS3008
> > Storage nodes are connected to two SuperMicro chassis: 847E1C-R1K28JBOD
> > Each node has 2-3 spinning OSDs (6TB drives) and 2 ssd drives (240GB
> > Intel DC S3710 drives) for journal and cache
> > 3 monitors running on OSD nodes
> > ceph hammer 0.94.3
> > Ubuntu 14.04
> > standard replicated pools with size 2 (min_size 1)
> > 40GB journal per osd on SSD drives, 40GB flashcache per osd.
> >
> > Everything seems to work fine, but every few days or so one of the nodes
> > (not always the same node - different nodes each time) gets very high
> > load, becomes inaccessible and needs to be rebooted.
> >
> > After reboot we can start osd's and the cluster returns to HEALTH_OK
> > state pretty quickly.
> >
> > After looking into logfiles this seems to be related to ceph-osd
> > processes (links to the logs are at the bottom of this msg).
> >
> > The cluster is a test setup - not used in production and at the time the
> > ceph-osd processes crushes the cluster isn't doing anything.
> >
> > Any help would be appreciated.
> >
> > ceph-osd log: http://pastebin.com/AGGtvHr2
> > kernel log: http://pastebin.com/jVSa8eme
> >
> > J
> >
> 
> 
> --
> Jacek Jarosiewicz
> Administrator Systemów Informatycznych
> 
> 
> SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie
> ul. Senatorska 13/15, 00-075 Warszawa
> Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego
> Rejestru Sądowego,
> nr KRS 029537; kapitał zakładowy 42.756.000 zł
> NIP: 957-05-49-503
> Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa
> 
> 
> SUPERMEDIA ->   http://www.supermedia.pl
> dostep do internetu - hosting - kolokacja - lacza - telefonia
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2-Node Cluster - possible scenario?

2015-10-25 Thread Christian Balzer

Hello,

On Sun, 25 Oct 2015 16:17:02 +0100 Hermann Himmelbauer wrote:

> Hi,
> In a little project of mine I plan to start ceph storage with a small
> setup and to be able to scale it up later. Perhaps someone can give me
> any advice if the following (two nodes with OSDs, third node with
> Monitor only):
> 
> - 2 Nodes (enough RAM + CPU), 6*3TB Harddisk for OSDs -> 9TB usable
> space in case of 3* redundancy, 1 Monitor on each of the nodes

Just for the record, a monitor will be happy with 2GB RAM and 2GHz of CPU
(more is better), but does a LOT of time critical writes, so it running on
decent (also in the endurance sense) SSDs is recommended. 

Once you have SSDs in the game, using them for Ceph journals comes
naturally. 

Keep in mind that while you certainly can improve the performance by just
adding more OSDs later on, SSD journals are such a significant improvement
when it comes to writes that you may want to consider them.

> - 1 extra node that has no OSDs but runs a third monitor.

Ceph uses the MON with the lowest IP address as leader, which is busier
(sometimes a lot so) than the other MONs. 
Plan your nodes with that in mind.

> - 10GBit Ethernet as storage backbone
> 
Good for lower latency. 
I assume "storage backbone" is a single (the "public" network in Ceph
speak) network. Having 10GB for the Ceph private network in your case
would be a bit of a waste, though.


> Later I may add more nodes + OSDs to expand the cluster in case more
> storage / performance is needed.
> 
> Would this work / be stable? Or do I need to spread my OSDs to 3 ceph
> nodes (e.g. in order to achive quorum). In case one of the two OSD nodes
> fail, would the storage still be accessible?
> 
A monitor quorum of 3 is fine, OSDs don't enter that picture.

However 3 OSD storage nodes are highly advised, because with non-SSD
journal HDDs for OSDs your performance will already be low.
It also saves you from having to deal with a custom CRUSH map.

As for accessibility, yes, in theory. 
I certainly have tested this with a 2 storage node cluster and a
replication of 2 (min_size 1). 
With this setup (custom CRUSH map) you will need a min_size of 1 as well.

So again, 3 storage nodes will give you a lot less headaches.

> The setup should be used for RBD/QEMU only, no cephfs or the like.
>
Depending on what these VMs do and the amount of them, see my comments
about performance.

Christian
> Any hints are appreciated!
> 
> Best Regards,
> Hermann
> 



-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG won't stay clean

2015-10-25 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I set debug_osd = 20/20 and restarted the primary osd. The logs are at
http://162.144.87.113/files/ceph-osd.110.log.xz .

The PG in question is 9.e3 and it is one of 15 that have this same
behavior. The cluster is currently idle.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sun, Oct 25, 2015 at 5:51 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I have a 0.94.4 cluster that when I repair/deep-scrub a PG, it comes
> back clean, but as soon as I restart any OSD that hosts it, it goes
> back to inconsistent. If I deep-scrub that PG it clears up.
>
> I determined that the bad copy was not on the primary and issued a pg
> repair command. I have shut down and deleted the PG folder on each OSD
> in turn and let it back fill. I tried taking the primary OSD down and
> issuing a repair command then. I took an m5sum of all files in the PG
> directory and compared all files across the OSDs and it came back
> clean. I shut down each OSD in turn and removed any PG_TEMP
> directories. I'm just not sure why the cluster is so confused as to
> the status of this PG.
>
> Any ideas?
>
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.2
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWLWr3CRDmVDuy+mK58QAA4HEP/361NhUXujdrKr9xa4d/
> Lr/zMCbppT7uof3BueLkkIF2erun19ENNLZV5ehyzcbeWlHVA3UEzJaZlaew
> eQrmFN0TKk2BFtQAnXp66KhBVco05tiLKZthkGH9AzwQT33ftf8ErVwT4GXs
> aEXDdQLctLGxvqfoyys9woNqalYjG9JtZxJHTWfaVU+t3yXEme3GBJBmMlVE
> GSSk8KAyEil8DP1q4PMQJrScQxqYFpfBi1UGnbiQj02pan16OtbkaUkJNLMB
> o8XlHdiNfPWmMoyAuOPBMoKSPo1diLBP3uEJN8u3Mw4+9kLZSeMDMRHdkMF0
> kmhGA26ihRHcHWVsC+4wevCGJoq7vvPmf8892z+hEjC5vm4eWGAD7UPBjqjl
> 5BL282XI+AYLbw4VkiDrP4tTL4neOr6IW50mnG8SPVSAvMN+cFJnlMZRpQ/6
> SQB4Tv5fr1SEMZDZqC//RacWZYCsBd1XZi6M0VhOhOrGjqmlr/41P2dmrdI1
> ldHEyl3l07mJdBANQ0AgIMAeMyuD3dGQ4q0IpJgVMbrfxq8m/lLp/jddOWq3
> MhNGT5J1K4Qg3eFlqhuTIw7yLmERACYHlMUBCHGq/8jGCjEQOe6uhuePyH/6
> ugUoZ+J4Y+Fxsu1Jsoj+GQtDSSkOTGjUpGhXPp/gqMhxZRlGdsZL5LIEs7hm
> Us8M
> =gnRK
> -END PGP SIGNATURE-

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWLaAwCRDmVDuy+mK58QAAAMkP/iJ+OpklHB/O2wgj2Le1
Wudniy7KzpDVaS+kXpPZ8Bhbn+rxXCuN9ySUf1sxM37SdqZBCWHpPdx7GTbC
QDaO2Bucn53iNG6FDcXHf0TDAUuw7f5u381B2+qfuUbc6Q7iJeJRIjrzQsce
1ieBDytn+DKis1YEOY5Rlbj80CBB5MhkiokJlxjNjaj2AZJAORwoLbqoCSSI
u8YnzsbxhkpYxCcCqM3lHf36dsP40vkyXXyqjVWgaW9qThFx9N67ERG/hQSU
VTBWXqY8glAQmbeuvlT/zAhl0e2qEsEOBUBn5r/ydL5M2x+3dFHsLp92ewwt
pyrEeq6n8Wt1mmYklesDZQCgex47uWAy5mDFOWQjzWBbeO7ji8jwM+PpXxW9
h/mRJZFLLTScFHTOONDXfF41GXFV3ZtdukpHdT46k++RRHmlFVRZgcTym3/b
g0pLQKZ8ynKvFzAora2/r9IlN7dDPJEw2jpN2pAYda0GlY8wc6h5i/qUQoGE
VN6b5SNURyw53OMPv6yOx2bvc7RKmpLGWjhnTEHjydI0w+kmbqvAnbT2mG/O
eHeEyteK4m3+Jtf/s+wN9ULr1pNr++37Zt2igfTtvPm4OceqiU4Y02YgrPu+
LgOWGduVSmEmmRRnBE8+gZYU6gSgpOV3JWqP0AQMavujZmQpGs55DjnfMwdv
v8IJ
=4TOB
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-25 Thread hzwuli...@gmail.com
Hi, 

New information, i think the poor performance is due to too many threads of 
qume-system-x86 process.

For normal case, it just use about 200 threads.

For abnormal case, it will use about 400 threads or 700 threads, and the 
performance is:

200 threads > 400 threads > 700 threads

Now, i guess it performance down is due to the competition between the threads. 
As you could see, i paste the
perf record before. 

The problem is really stuck us.

So, anyone know why the threads number of qume-system-x86 increasing?
And any way we could control it.

Thanks!



hzwuli...@gmail.com
 
From: hzwuli...@gmail.com
Date: 2015-10-23 13:15
To: Alexandre DERUMIER
CC: ceph-users
Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
Yeah, you are right.  Test the rbd volume form host is fine.

Now, at least we could affirm ti's the qemu or kvm problem, not ceph.



hzwuli...@gmail.com
 
From: Alexandre DERUMIER
Date: 2015-10-23 12:51
To: hzwulibin
CC: ceph-users
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
>>Anyway, i could try to collect somthing, maybe there are some clues. 
 
And you don't have problem to read/write to this rbd from host with fio-rbd ? 
(try a read full the rbd volume for example)
 
- Mail original -
De: hzwuli...@gmail.com
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Vendredi 23 Octobre 2015 06:42:41
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
 
Oh, no, from the phenomenon. IO in VM is wait for the host to completion. 
The CPU wait in VM is very high. 
Anyway, i could try to collect somthing, maybe there are some clues. 
 
 
hzwuli...@gmail.com 
 
 
 
From: Alexandre DERUMIER 
Date: 2015-10-23 12:39 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Do you have tried to use perf inside the faulty guest too ? 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Vendredi 23 Octobre 2015 06:15:07 
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
btw, we use perf to track the process qemu-system-x86(15801), there is an 
abnormal function: 
Samples: 1M of event 'cycles', Event count (approx.): 1057109744252 
- 75.23% qemu-system-x86 [kernel.kallsyms] [k] do_raw_spin_lock 
- do_raw_spin_lock 
+ 54.44% 0x7fc79fc769d9 + 45.31% 0x7fc79fc769ab 
So, maybe it's the kvm problem? 
hzwuli...@gmail.com 
From: hzwuli...@gmail.com 
Date: 2015-10-23 11:54 
To: Alexandre DERUMIER 
CC: ceph-users 
Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu 
librbd 
Hi, list 
We still stuck on this problem, when this problem comes, the CPU usage of 
qemu-system-x86 if very high(1420): 
15801 libvirt- 20 0 33.7g 1.4g 11m R 1420 0.6 1322:26 qemu-system-x86 
quem-system-x86 process 15801 is responsible for the VM. 
Anyone has ever run into this problem also. 
hzwuli...@gmail.com 
BQ_BEGIN 
From: hzwuli...@gmail.com 
Date: 2015-10-22 10:15 
To: Alexandre DERUMIER 
CC: ceph-users 
Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu 
librbd 
Hi, 
Sure, all those could help, but not so much -:) 
Now, we find it's the VM problem. CPU on the host is very high. 
We create a new VM could solve this problem, but don't know why until now. 
Here is the detail version info: 
Compiled against library: libvirt 1.2.9 
Using library: libvirt 1.2.9 
Using API: QEMU 1.2.9 
Running hypervisor: QEMU 2.1.2 
Are there any already know bugs about those version? 
Thanks! 
hzwuli...@gmail.com 
BQ_BEGIN 
From: Alexandre DERUMIER 
Date: 2015-10-21 18:38 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
here a libvirt sample to enable iothreads: 
 
2 
 
 
 
 
 
 
 
 
 
 
 
With this, you can scale with multiple disks. (but it should help a little bit 
with 1 disk too) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 10:31:56 
Objet: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
let me post the version and configuration here first. 
host os: debian 7.8 kernel: 3.10.45 
guest os: debian 7.8 kernel: 3.2.0-4 
qemu version: 
ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2.1~bpo70+1 all PXE boot firmware - ROM 
images for qemu 
ii qemu-kvm 1:2.1+dfsg-12~bpo70+1 amd64 QEMU Full virtualization on x86 
hardware 
ii qemu-system-common 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (common files) 
ii qemu-system-x86 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (x86) 
ii qemu-utils 1:2.1+dfsg-12~bpo70+1 amd64 QEMU utilities 
vm config: 
 
 
 
 
 
 
 
 
 
 
 
*** 
 
 
Thanks! 
hzwuli...@gmail.com 
From: Alexandre DERUMIER 
Date: 2015-10-21 14:01 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Damn, that's a huge difference. 
What is your

Re: [ceph-users] when an osd is started up, IO will be blocked

2015-10-25 Thread wangsongbo

Hi all,

When an osd is started, I will get a lot of slow requests from the 
corresponding osd log, as follows:


2015-10-26 03:42:51.593961 osd.4 [WRN] slow request 3.967808 seconds 
old, received at 2015-10-26 03:42:47.625968: 
osd_repop(client.2682003.0:2686048 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347845) 
currently commit_sent
2015-10-26 03:42:51.593964 osd.4 [WRN] slow request 3.964537 seconds 
old, received at 2015-10-26 03:42:47.629239: 
osd_repop(client.2682003.0:2686049 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193029) currently commit_sent
2015-10-26 03:42:52.594166 osd.4 [WRN] 40 slow requests, 17 included 
below; oldest blocked for > 53.692556 secs
2015-10-26 03:42:52.594172 osd.4 [WRN] slow request 2.272928 seconds 
old, received at 2015-10-26 03:42:50.321151: 
osd_repop(client.3684690.0:191908 43.540 
f1858540/rbd_data.1fc5ca7429fc17.0280/head//43 v 9744'63645) 
currently commit_sent
2015-10-26 03:42:52.594175 osd.4 [WRN] slow request 2.270618 seconds 
old, received at 2015-10-26 03:42:50.323461: 
osd_op(client.3684690.0:191911 rbd_data.1fc5ca7429fc17.0209 
[write 2633728~4096] 43.72b9f039 ack+ondisk+write e9744) currently 
commit_sent
2015-10-26 03:42:52.594264 osd.4 [WRN] slow request 4.968252 seconds 
old, received at 2015-10-26 03:42:47.625828: 
osd_repop(client.2682003.0:2686047 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193028) currently commit_sent
2015-10-26 03:42:52.594266 osd.4 [WRN] slow request 4.968111 seconds 
old, received at 2015-10-26 03:42:47.625968: 
osd_repop(client.2682003.0:2686048 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347845) 
currently commit_sent
2015-10-26 03:42:52.594318 osd.4 [WRN] slow request 4.964841 seconds 
old, received at 2015-10-26 03:42:47.629239: 
osd_repop(client.2682003.0:2686049 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193029) currently commit_sent
2015-10-26 03:42:53.594527 osd.4 [WRN] 40 slow requests, 16 included 
below; oldest blocked for > 54.692945 secs
2015-10-26 03:42:53.594533 osd.4 [WRN] slow request 16.004669 seconds 
old, received at 2015-10-26 03:42:37.589800: 
osd_repop(client.2682003.0:2686041 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193024) currently commit_sent
2015-10-26 03:42:53.594536 osd.4 [WRN] slow request 16.003889 seconds 
old, received at 2015-10-26 03:42:37.590580: 
osd_repop(client.2682003.0:2686040 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347842) 
currently commit_sent
2015-10-26 03:42:53.594538 osd.4 [WRN] slow request 16.000954 seconds 
old, received at 2015-10-26 03:42:37.593515: 
osd_repop(client.2682003.0:2686042 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193025) currently commit_sent
2015-10-26 03:42:53.594541 osd.4 [WRN] slow request 29.138828 seconds 
old, received at 2015-10-26 03:42:24.455641: 
osd_repop(client.4764855.0:65121 43.dbe 
169a9dbe/rbd_data.49a7a4633ac0b1.0021/head//43 v 9744'12509) 
currently commit_sent
2015-10-26 03:42:53.594543 osd.4 [WRN] slow request 15.998814 seconds 
old, received at 2015-10-26 03:42:37.595656: 
osd_repop(client.1800547.0:1205399 43.cc5 
9285ecc5/rbd_data.1b794560c6e2ea.00d0/head//43 v 9744'36732) 
currently commit_sent
2015-10-26 03:42:54.594892 osd.4 [WRN] 39 slow requests, 17 included 
below; oldest blocked for > 55.693227 secs
2015-10-26 03:42:54.594908 osd.4 [WRN] slow request 4.273600 seconds 
old, received at 2015-10-26 03:42:50.321151: 
osd_repop(client.3684690.0:191908 43.540 
f1858540/rbd_data.1fc5ca7429fc17.0280/head//43 v 9744'63645) 
currently commit_sent
2015-10-26 03:42:54.594911 osd.4 [WRN] slow request 4.271290 seconds 
old, received at 2015-10-26 03:42:50.323461: 
osd_op(client.3684690.0:191911 rbd_data.1fc5ca7429fc17.0209 
[write 2633728~4096] 43.72b9f039 ack+ondisk+write e9744) currently 
commit_sent




Meanwhile, I run fio process with the rbd ioengine.
The iops of read and write were too small to response any IO from the 
fio process,
In other words, when an osd is started, the IO of the whole cluster will 
be blocked.

Is there some parameter to adjust ?
How to explain this  problem?
The results of running fio process were as fllows:

ebs: (g=0): rw=randrw, bs=8K-8K/8K-8K/8K-8K, ioengine=rbd, iodepth=64
fio-2.2.9-20-g1520
Starting 1 thread
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [m(1)] [0.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 
05h:10m:14s]

ebs: (groupid=0, jobs=1): err= 0: pid=40323: Mon Oct 26 04:02:00 2015
  read : io=10904KB, bw=175183B/s, *iops=21*, runt= 63737msec
slat (usec): min=0, max=61, avg= 1.11, stdev= 3.16
clat (msec): min=1, max=63452, avg=1190.04, stdev=6046.28
 lat (msec): min=1, max=63452, avg=1190.04, stdev=6046.28
clat percentiles (msec):
 |  1.00th=[3],  5.00th=[ 

Re: [ceph-users] when an osd is started up, IO will be blocked

2015-10-25 Thread wangsongbo

Hi all,

When an osd is started, I will get a lot of slow requests from the 
corresponding osd log, as follows:


2015-10-26 03:42:51.593961 osd.4 [WRN] slow request 3.967808 seconds 
old, received at 2015-10-26 03:42:47.625968: 
osd_repop(client.2682003.0:2686048 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347845) 
currently commit_sent
2015-10-26 03:42:51.593964 osd.4 [WRN] slow request 3.964537 seconds 
old, received at 2015-10-26 03:42:47.629239: 
osd_repop(client.2682003.0:2686049 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193029) currently commit_sent
2015-10-26 03:42:52.594166 osd.4 [WRN] 40 slow requests, 17 included 
below; oldest blocked for > 53.692556 secs
2015-10-26 03:42:52.594172 osd.4 [WRN] slow request 2.272928 seconds 
old, received at 2015-10-26 03:42:50.321151: 
osd_repop(client.3684690.0:191908 43.540 
f1858540/rbd_data.1fc5ca7429fc17.0280/head//43 v 9744'63645) 
currently commit_sent
2015-10-26 03:42:52.594175 osd.4 [WRN] slow request 2.270618 seconds 
old, received at 2015-10-26 03:42:50.323461: 
osd_op(client.3684690.0:191911 rbd_data.1fc5ca7429fc17.0209 
[write 2633728~4096] 43.72b9f039 ack+ondisk+write e9744) currently 
commit_sent
2015-10-26 03:42:52.594264 osd.4 [WRN] slow request 4.968252 seconds 
old, received at 2015-10-26 03:42:47.625828: 
osd_repop(client.2682003.0:2686047 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193028) currently commit_sent
2015-10-26 03:42:52.594266 osd.4 [WRN] slow request 4.968111 seconds 
old, received at 2015-10-26 03:42:47.625968: 
osd_repop(client.2682003.0:2686048 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347845) 
currently commit_sent
2015-10-26 03:42:52.594318 osd.4 [WRN] slow request 4.964841 seconds 
old, received at 2015-10-26 03:42:47.629239: 
osd_repop(client.2682003.0:2686049 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193029) currently commit_sent
2015-10-26 03:42:53.594527 osd.4 [WRN] 40 slow requests, 16 included 
below; oldest blocked for > 54.692945 secs
2015-10-26 03:42:53.594533 osd.4 [WRN] slow request 16.004669 seconds 
old, received at 2015-10-26 03:42:37.589800: 
osd_repop(client.2682003.0:2686041 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193024) currently commit_sent
2015-10-26 03:42:53.594536 osd.4 [WRN] slow request 16.003889 seconds 
old, received at 2015-10-26 03:42:37.590580: 
osd_repop(client.2682003.0:2686040 43.fcf 
d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347842) 
currently commit_sent
2015-10-26 03:42:53.594538 osd.4 [WRN] slow request 16.000954 seconds 
old, received at 2015-10-26 03:42:37.593515: 
osd_repop(client.2682003.0:2686042 43.b4b 
cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 
9744'193025) currently commit_sent
2015-10-26 03:42:53.594541 osd.4 [WRN] slow request 29.138828 seconds 
old, received at 2015-10-26 03:42:24.455641: 
osd_repop(client.4764855.0:65121 43.dbe 
169a9dbe/rbd_data.49a7a4633ac0b1.0021/head//43 v 9744'12509) 
currently commit_sent
2015-10-26 03:42:53.594543 osd.4 [WRN] slow request 15.998814 seconds 
old, received at 2015-10-26 03:42:37.595656: 
osd_repop(client.1800547.0:1205399 43.cc5 
9285ecc5/rbd_data.1b794560c6e2ea.00d0/head//43 v 9744'36732) 
currently commit_sent
2015-10-26 03:42:54.594892 osd.4 [WRN] 39 slow requests, 17 included 
below; oldest blocked for > 55.693227 secs
2015-10-26 03:42:54.594908 osd.4 [WRN] slow request 4.273600 seconds 
old, received at 2015-10-26 03:42:50.321151: 
osd_repop(client.3684690.0:191908 43.540 
f1858540/rbd_data.1fc5ca7429fc17.0280/head//43 v 9744'63645) 
currently commit_sent
2015-10-26 03:42:54.594911 osd.4 [WRN] slow request 4.271290 seconds 
old, received at 2015-10-26 03:42:50.323461: 
osd_op(client.3684690.0:191911 rbd_data.1fc5ca7429fc17.0209 
[write 2633728~4096] 43.72b9f039 ack+ondisk+write e9744) currently 
commit_sent


Meanwhile, I run fio process with the rbd ioengine.
The iops of read and write were too small to response any IO from the 
fio process,
In other words, when an osd is started, the IO of the whole cluster will 
be blocked.

Is there some parameter to adjust ?
How to explain this  problem?
The results of running fio process were as fllows:

ebs: (g=0): rw=randrw, bs=8K-8K/8K-8K/8K-8K, ioengine=rbd, iodepth=64
fio-2.2.9-20-g1520
Starting 1 thread
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [m(1)] [0.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 
05h:10m:14s]

ebs: (groupid=0, jobs=1): err= 0: pid=40323: Mon Oct 26 04:02:00 2015
  read : io=10904KB, bw=175183B/s, *iops=21*, runt= 63737msec
slat (usec): min=0, max=61, avg= 1.11, stdev= 3.16
clat (msec): min=1, max=63452, avg=1190.04, stdev=6046.28
 lat (msec): min=1, max=63452, avg=1190.04, stdev=6046.28
clat percentiles (msec):
 |  1.00th=[3],  5.00th=[   

[ceph-users] randwrite iops of rbd volume in kvm decrease after several hours with qemu threads and cpu usage on host increasing

2015-10-25 Thread Jackie

Hi experts,

When I test io performance of rbd volume in pure ssd pool with fio in
kvm vm, the iops decreased from 15k to 5k, while nums of qemu
threads on host increased from about 200 to about 700, cpu usage
of qemu process on host increased from 600% to 1400%.

My testing scene is as following:
rw=randwrite
direct=1
numjobs=64
ioengine=sync
bsrange=4k-4k
runtime=180

The version of some packages are as following:
ceph: 0.94.3
qemu-kvm: 2.1.2
host kernel: 3.10

What's maybe the problem?

Appreciate for any help.

Best Regards,
Jackie


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com