[ceph-users] Ceph stable releases team: call for participation

2015-10-03 Thread Loic Dachary
Hi Ceph,

TL;DR: If you have one day a week to work on the next Ceph stable releases [1] 
your help would be most welcome.

The Ceph "Long Term Stable" (LTS) releases - currently firefly[3] and hammer[4] 
- are used by individuals, non-profits, government agencies and companies for 
their production Ceph clusters. They are also used when Ceph is integrated into 
larger products, such as hardware appliances. Ceph packages for a range of 
supported distribution are available at http://ceph.com/. Before the packages 
for a new stable release are published, they are carefully tested for potential 
regressions or upgrade problems. The Ceph project makes every effort to ensure 
the packages published at http://ceph.com/ can be used and upgraded in 
production.

The Stable release team[5] plays an essential role in the making of each Ceph 
stable release. In addition to maintaining an inventory of bugfixes that are in 
various stages of backporting[6], in most cases we do the actual backporting 
ourselves[7]. We also run integration tests involving hundreds of machines[8] 
and analyze the test results when they fail[9]. The developers of the bugfixes 
only hear from us when we're stuck or to make the final decision whether to 
merge a backport into the stable branch. Our process is well documented[1] and 
participating is a relaxing experience. Every month or so we have the 
satisfaction of seeing a new stable release published.

Nathan Cutler (SUSE) drives the next Firefly release[10] and Abhishek Varshney 
(Flipkart) drives the next Hammer release. Loic Dachary (Red Hat), one of the 
Ceph core developers, and Abhishek Lekshmanan (Reliance Jio Infocomm Ltd.) 
oversee the process and provides help and advice when necessary. After these 
two releases are published (which should happen in the next few weeks), the 
roles will change and we would like to invite you to participate. If you're 
employed by a company using Ceph or doing business with it, maybe your manager 
could agree to give back to the Ceph community in this way. You can join at any 
time and you will be mentored while the ongoing releases complete. When the 
time comes (and if you feel ready), you will be offered a seat to drive the 
next release.

Cheers

[1] Ceph Stable releases home page 
http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO
[2] Ceph Releases timeline http://ceph.com/docs/master/releases/
[3] Firefly v0.80.10 http://ceph.com/docs/master/release-notes/#v0-80-10-firefly
[4] Hammer v0.94.3 http://ceph.com/docs/master/release-notes/#v0-94-3-hammer
[5] Stable release team http://tracker.ceph.com/projects/ceph-releases
[6] Hammer backports http://tracker.ceph.com/projects/ceph/issues?query_id=78
[7] Backporting commits 
http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_backport_commits
[8] Integration tests 
http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_run_integration_and_upgrade_tests
[9] Forensic analysis of integration tests 
http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_forensic_analysis_of_integration_and_upgrade_tests
[10] Firefly v0.80.11 http://tracker.ceph.com/issues/11644

-- 
Loïc Dachary, Artisan Logiciel Libre





signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Simultaneous CEPH OSD crashes

2015-10-03 Thread Lionel Bouton
Hi,

Le 29/09/2015 19:06, Samuel Just a écrit :
> It's an EIO.  The osd got an EIO from the underlying fs.  That's what
> causes those asserts.  You probably want to redirect to the relevant
> fs maling list.

Thanks.

I didn't get any answer on this from BTRFS developers yet. The problem
seems hard to reproduce though (we still have the same configuration in
production without any new crash and we only had a total of 3 OSD crashes).

I'll just say for reference that BTRFS with kernel 3.18.9 looks
suspicious to me (from the events that happened to us on a mixed
3.18.9/4.0.5 cluster statistically there's about 80% chances that
there's a BTRFS bug in 3.18.9 solved in 4.0.5).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Crush Ruleset Questions

2015-10-03 Thread Daniel Maraio

Hello,

  I've looked over the crush documentation but I am a little confused. 
Perhaps someone here can help me out!


  I have three chassis with 6 SSD osds that I use for writeback cache. 
I have removed one OSD from each server and I want to make a new 
replicated ruleset to use just these three OSDs. I want to segregate the 
IO for the RGW bucket index on this new ruleset to isolate it from 
scrubs,promotes,eviction operations.


  My question is, how do I a make a ruleset that will use just these 
three OSDs. My current ruleset for these hosts looks like:


root cache {
id -27  # do not change unnecessarily
# weight 3.780
alg straw
hash 0  # rjenkins1
item osd-cache01 weight 1.260
item osd-cache02 weight 1.260
item osd-cache03 weight 1.260
}

host osd-cache01 {
id -3   # do not change unnecessarily
# weight 1.260
alg straw
hash 0  # rjenkins1
item osd.15 weight 0.210
item osd.18 weight 0.210
item osd.19 weight 0.210
item osd.20 weight 0.210
item osd.21 weight 0.210
item osd.25 weight 0.210
}
host osd-cache02 {
id -4   # do not change unnecessarily
# weight 1.260
alg straw
hash 0  # rjenkins1
item osd.26 weight 0.210
item osd.27 weight 0.210
item osd.28 weight 0.210
item osd.29 weight 0.210
item osd.30 weight 0.210
item osd.31 weight 0.210
}
host osd-cache03 {
id -5   # do not change unnecessarily
# weight 1.260
alg straw
hash 0  # rjenkins1
item osd.32 weight 0.210
item osd.33 weight 0.210
item osd.34 weight 0.210
item osd.35 weight 0.210
item osd.36 weight 0.210
item osd.37 weight 0.210
}


- Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to setup Ceph radosgw to support multi-tenancy?

2015-10-03 Thread Christian Sarrasin
What are the best options to setup the Ceph radosgw so it supports 
separate/independent "tenants"? What I'm after:


1. Ensure isolation between tenants, ie: no overlap/conflict in bucket 
namespace; something separate radosgw "users" doesn't achieve

2. Ability to backup/restore tenants' pools individually

Referring to the docs [1], it seems this could possibly be achieved with 
zones; one zone per tenant and leave out synchronization. Seems a little 
heavy handed and presumably the overhead is non-negligible.


Is this "supported"? Is there a better way?

I'm running Firefly. I'm also rather new to Ceph so apologies if this is 
already covered somewhere; kindly send pointers if so...


Cheers,
Christian

PS: cross-posted from [2]

[1] http://docs.ceph.com/docs/v0.80/radosgw/federated-config/
[2] 
http://serverfault.com/questions/726491/how-to-setup-ceph-radosgw-to-support-multi-tenancy


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph stable releases team: call for participation

2015-10-03 Thread Robin H. Johnson
On Sat, Oct 03, 2015 at 11:07:22AM +0200, Loic Dachary wrote:
> Hi Ceph,
> 
> TL;DR: If you have one day a week to work on the next Ceph stable releases 
> [1] your help would be most welcome.
I'd like to throw my name in.

As of August, I work on Ceph development for Dreamhost. Most of my work
focuses on RGW, but I also care about getting my RGW fixes out to the
world.

Presently, that means I have to backport to Firefly & Hammer for
production.

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead
E-Mail : robb...@gentoo.org
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Potential OSD deadlock?

2015-10-03 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We are still struggling with this and have tried a lot of different
things. Unfortunately, Inktank (now Red Hat) no longer provides
consulting services for non-Red Hat systems. If there are some
certified Ceph consultants in the US that we can do both remote and
on-site engagements, please let us know.

This certainly seems to be network related, but somewhere in the
kernel. We have tried increasing the network and TCP buffers, number
of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
on the boxes, the disks are busy, but not constantly at 100% (they
cycle from <10% up to 100%, but not 100% for more than a few seconds
at a time). There seems to be no reasonable explanation why I/O is
blocked pretty frequently longer than 30 seconds. We have verified
Jumbo frames by pinging from/to each node with 9000 byte packets. The
network admins have verified that packets are not being dropped in the
switches for these nodes. We have tried different kernels including
the recent Google patch to cubic. This is showing up on three cluster
(two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
(from CentOS 7.1) with similar results.

The messages seem slightly different:
2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
100.087155 secs
2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
cluster [WRN] slow request 30.041999 seconds old, received at
2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
rbd_data.13fdcb2ae8944a.0001264f [read 975360~4096]
11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
points reached

I don't know what "no flag points reached" means.

The problem is most pronounced when we have to reboot an OSD node (1
of 13), we will have hundreds of I/O blocked for some times up to 300
seconds. It takes a good 15 minutes for things to settle down. The
production cluster is very busy doing normally 8,000 I/O and peaking
at 15,000. This is all 4TB spindles with SSD journals and the disks
are between 25-50% full. We are currently splitting PGs to distribute
the load better across the disks, but we are having to do this 10 PGs
at a time as we get blocked I/O. We have max_backfills and
max_recovery set to 1, client op priority is set higher than recovery
priority. We tried increasing the number of op threads but this didn't
seem to help. It seems as soon as PGs are finished being checked, they
become active and could be the cause for slow I/O while the other PGs
are being checked.

What I don't understand is that the messages are delayed. As soon as
the message is received by Ceph OSD process, it is very quickly
committed to the journal and a response is sent back to the primary
OSD which is received very quickly as well. I've adjust
min_free_kbytes and it seems to keep the OSDs from crashing, but
doesn't solve the main problem. We don't have swap and there is 64 GB
of RAM per nodes for 10 OSDs.

Is there something that could cause the kernel to get a packet but not
be able to dispatch it to Ceph such that it could be explaining why we
are seeing these blocked I/O for 30+ seconds. Is there some pointers
to tracing Ceph messages from the network buffer through the kernel to
the Ceph process?

We can really use some pointers no matter how outrageous. We've have
over 6 people looking into this for weeks now and just can't think of
anything else.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
l7OF
=OI++
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
> We dropped the replication on our cluster from 4 to 3 and it looks
> like all the blocked I/O has stopped (no entries in the log for the
> last 12 hours). This makes me believe that there is some issue with
> the number of sockets or some other TCP issue. We have not messed with
> Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> processes and 16K system wide.
>
> 

Re: [ceph-users] Ceph stable releases team: call for participation

2015-10-03 Thread Loic Dachary
Hi Robin,

On 03/10/2015 21:38, Robin H. Johnson wrote:
> On Sat, Oct 03, 2015 at 11:07:22AM +0200, Loic Dachary wrote:
>> Hi Ceph,
>>
>> TL;DR: If you have one day a week to work on the next Ceph stable releases 
>> [1] your help would be most welcome.
> I'd like to throw my name in.
> 
> As of August, I work on Ceph development for Dreamhost. Most of my work
> focuses on RGW, but I also care about getting my RGW fixes out to the
> world.
> 
> Presently, that means I have to backport to Firefly & Hammer for
> production.

Sounds like a perfect match :-) When would you like to start ?

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Potential OSD deadlock?

2015-10-03 Thread Josef Johansson
Hi,

I don't know what brand those 4TB spindles are, but I know that mine are
very bad at doing write at the same time as read. Especially small read
write.

This has an absurdly bad effect when doing maintenance on ceph. That being
said we see a lot of difference between dumpling and hammer in performance
on these drives. Most likely due to hammer able to read write degraded PGs.

We have run into two different problems along the way, the first was
blocked request where we had to upgrade from 64GB mem on each node to
256GB. We thought that it was the only safe buy make things better.

I believe it worked because more reads were cached so we had less mixed
read write on the nodes, thus giving the spindles more room to breath. Now
this was a shot in the dark then, but the price is not that high even to
just try it out.. compared to 6 people working on it. I believe the IO on
disk was not huge either, but what kills the disk is high latency. How much
bandwidth are the disk using? We had very low.. 3-5MB/s.

The second problem was defragmentations hitting 70%, lowering that to 6%
made a lot of difference. Depending on IO pattern this increases different.

TL;DR read kills the 4TB spindles.

Hope you guys clear out of the woods.
/Josef
On 3 Oct 2015 10:10 pm, "Robert LeBlanc"  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We are still struggling with this and have tried a lot of different
> things. Unfortunately, Inktank (now Red Hat) no longer provides
> consulting services for non-Red Hat systems. If there are some
> certified Ceph consultants in the US that we can do both remote and
> on-site engagements, please let us know.
>
> This certainly seems to be network related, but somewhere in the
> kernel. We have tried increasing the network and TCP buffers, number
> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> on the boxes, the disks are busy, but not constantly at 100% (they
> cycle from <10% up to 100%, but not 100% for more than a few seconds
> at a time). There seems to be no reasonable explanation why I/O is
> blocked pretty frequently longer than 30 seconds. We have verified
> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> network admins have verified that packets are not being dropped in the
> switches for these nodes. We have tried different kernels including
> the recent Google patch to cubic. This is showing up on three cluster
> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> (from CentOS 7.1) with similar results.
>
> The messages seem slightly different:
> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> 100.087155 secs
> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> cluster [WRN] slow request 30.041999 seconds old, received at
> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> rbd_data.13fdcb2ae8944a.0001264f [read 975360~4096]
> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> points reached
>
> I don't know what "no flag points reached" means.
>
> The problem is most pronounced when we have to reboot an OSD node (1
> of 13), we will have hundreds of I/O blocked for some times up to 300
> seconds. It takes a good 15 minutes for things to settle down. The
> production cluster is very busy doing normally 8,000 I/O and peaking
> at 15,000. This is all 4TB spindles with SSD journals and the disks
> are between 25-50% full. We are currently splitting PGs to distribute
> the load better across the disks, but we are having to do this 10 PGs
> at a time as we get blocked I/O. We have max_backfills and
> max_recovery set to 1, client op priority is set higher than recovery
> priority. We tried increasing the number of op threads but this didn't
> seem to help. It seems as soon as PGs are finished being checked, they
> become active and could be the cause for slow I/O while the other PGs
> are being checked.
>
> What I don't understand is that the messages are delayed. As soon as
> the message is received by Ceph OSD process, it is very quickly
> committed to the journal and a response is sent back to the primary
> OSD which is received very quickly as well. I've adjust
> min_free_kbytes and it seems to keep the OSDs from crashing, but
> doesn't solve the main problem. We don't have swap and there is 64 GB
> of RAM per nodes for 10 OSDs.
>
> Is there something that could cause the kernel to get a packet but not
> be able to dispatch it to Ceph such that it could be explaining why we
> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> to tracing Ceph messages from the network buffer through the kernel to
> the Ceph process?
>
> We can really use some pointers no matter how outrageous. We've have
> over 6 people looking into this for weeks now and just can't think of
> anything else.
>
> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelo