Re: [Teuthology] Upgrade hammer on ubuntu : all passed

2015-07-20 Thread David Casier AEVOO


On 07/21/2015 12:24 AM, Loic Dachary wrote:

teuthology-suite --ceph hammer-backports --machine-type openstack --suite 
upgrade/hammer-x --filter ubuntu_14.04 
$HOME/src/ceph-qa-suite_master/machine_types/vps.yaml 
$(pwd)/teuthology/test/integration/archive-on-error.yaml

Hi Loic,
Job started
http://ceph.aevoo.fr:8081/ubuntu-2015-07-21_05:02:44-upgrade:hammer-hammer---basic-openstack/

David
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: The design of the eviction improvement

2015-07-20 Thread Wang, Zhiqiang


> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Tuesday, July 21, 2015 6:38 AM
> To: Wang, Zhiqiang
> Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
> Subject: Re: The design of the eviction improvement
> 
> On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
> > Hi all,
> >
> > This is a follow-up of one of the CDS session at
> http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
> ng_eviction. We discussed the drawbacks of the current eviction algorithm and
> several ways to improve it. Seems like the LRU variants is the right way to 
> go. I
> come up with some design points after the CDS, and want to discuss it with 
> you.
> It is an approximate 2Q algorithm, combining some benefits of the clock
> algorithm, similar to what the linux kernel does for the page cache.
> 
> Unfortunately I missed this last CDS so I'm behind on the discussion.  I have 
> a
> few questions though...
> 
> > # Design points:
> >
> > ## LRU lists
> > - Maintain LRU lists at the PG level.
> > The SharedLRU and SimpleLRU implementation in the current code have a
> > max_size, which limits the max number of elements in the list. This
> > mostly looks like a MRU, though its name implies they are LRUs. Since
> > the object size may vary in a PG, it's not possible to caculate the
> > total number of objects which the cache tier can hold ahead of time.
> > We need a new LRU implementation with no limit on the size.
> 
> This last sentence seems to me to be the crux of it.  Assuming we have an
> OSD based by flash storing O(n) objects, we need a way to maintain an LRU of
> O(n) objects in memory.  The current hitset-based approach was taken based
> on the assumption that this wasn't feasible--or at least we didn't know how to
> implmement such a thing.  If it is, or we simply want to stipulate that cache
> tier OSDs get gobs of RAM to make it possible, then lots of better options
> become possible...
> 
> Let's say you have a 1TB SSD, with an average object size of 1MB -- that's
> 1 million objects.  At maybe ~100bytes per object of RAM for an LRU entry
> that's 100MB... so not so unreasonable, perhaps!

I was having the same question before proposing this. I did the similar 
calculation and thought it would be ok to use this many memory :-)

> 
> > - Two lists for each PG: active and inactive Objects are first put
> > into the inactive list when they are accessed, and moved between these two
> lists based on some criteria.
> > Object flag: active, referenced, unevictable, dirty.
> > - When an object is accessed:
> > 1) If it's not in both of the lists, it's put on the top of the
> > inactive list
> > 2) If it's in the inactive list, and the referenced flag is not set, the 
> > referenced
> flag is set, and it's moved to the top of the inactive list.
> > 3) If it's in the inactive list, and the referenced flag is set, the 
> > referenced flag
> is cleared, and it's removed from the inactive list, and put on top of the 
> active
> list.
> > 4) If it's in the active list, and the referenced flag is not set, the 
> > referenced
> flag is set, and it's moved to the top of the active list.
> > 5) If it's in the active list, and the referenced flag is set, it's moved 
> > to the top
> of the active list.
> > - When selecting objects to evict:
> > 1) Objects at the bottom of the inactive list are selected to evict. They 
> > are
> removed from the inactive list.
> > 2) If the number of the objects in the inactive list becomes low, some of 
> > the
> objects at the bottom of the active list are moved to the inactive list. For 
> those
> objects which have the referenced flag set, they are given one more chance in
> the active list. They are moved to the top of the active list with the 
> referenced
> flag cleared. For those objects which don't have the referenced flag set, they
> are moved to the inactive list, with the referenced flag set. So that they 
> can be
> quickly promoted to the active list when necessary.
> >
> > ## Combine flush with eviction
> > - When evicting an object, if it's dirty, it's flushed first. After 
> > flushing, it's
> evicted. If not dirty, it's evicted directly.
> > - This means that we won't have separate activities and won't set different
> ratios for flush and evict. Is there a need to do so?
> > - Number of objects to evict at a time. 'evict_effort' acts as the 
> > priority, which
> is used to calculate the number of objects to evict.
> 
> As someone else mentioned in a follow-up, the reason we let the dirty level be
> set lower than the full level is that it provides headroom so that objects 
> can be
> quickly evicted (delete, no flush) to make room for new writes or new
> promotions.
> 
> That said, we probably can/should streamline the flush so that an evict can
> immediately follow without waiting for the agent to come around again.
> (I don't think we do that now?)

I was afraid of having to

RE: The design of the eviction improvement

2015-07-20 Thread Wang, Zhiqiang
Hi Nick,

> -Original Message-
> From: Nick Fisk [mailto:n...@fisk.me.uk]
> Sent: Monday, July 20, 2015 5:28 PM
> To: Wang, Zhiqiang; 'Sage Weil'; sj...@redhat.com;
> ceph-devel@vger.kernel.org
> Subject: RE: The design of the eviction improvement
> 
> Hi,
> 
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> > ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang
> > Sent: 20 July 2015 09:47
> > To: Sage Weil ; sj...@redhat.com; ceph-
> > de...@vger.kernel.org
> > Subject: The design of the eviction improvement
> >
> > Hi all,
> >
> > This is a follow-up of one of the CDS session at
> > http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_
> > tiering_eviction. We discussed the drawbacks of the current eviction
> > algorithm and several ways to improve it. Seems like the LRU variants
> > is
> the
> > right way to go. I come up with some design points after the CDS, and
> > want to discuss it with you. It is an approximate 2Q algorithm,
> > combining some benefits of the clock algorithm, similar to what the
> > linux kernel does for
> the
> > page cache.
> >
> > # Design points:
> >
> > ## LRU lists
> > - Maintain LRU lists at the PG level.
> > The SharedLRU and SimpleLRU implementation in the current code have a
> > max_size, which limits the max number of elements in the list. This
> > mostly looks like a MRU, though its name implies they are LRUs. Since
> > the object
> size
> > may vary in a PG, it's not possible to caculate the total number of
> objects
> > which the cache tier can hold ahead of time. We need a new LRU
> > implementation with no limit on the size.
> > - Two lists for each PG: active and inactive Objects are first put
> > into
> the
> > inactive list when they are accessed, and moved between these two
> > lists based on some criteria.
> > Object flag: active, referenced, unevictable, dirty.
> > - When an object is accessed:
> > 1) If it's not in both of the lists, it's put on the top of the
> > inactive
> list
> > 2) If it's in the inactive list, and the referenced flag is not set,
> > the
> referenced
> > flag is set, and it's moved to the top of the inactive list.
> > 3) If it's in the inactive list, and the referenced flag is set, the
> referenced flag
> > is cleared, and it's removed from the inactive list, and put on top of
> > the
> active
> > list.
> > 4) If it's in the active list, and the referenced flag is not set, the
> referenced
> > flag is set, and it's moved to the top of the active list.
> > 5) If it's in the active list, and the referenced flag is set, it's
> > moved
> to the top
> > of the active list.
> > - When selecting objects to evict:
> > 1) Objects at the bottom of the inactive list are selected to evict.
> > They
> are
> > removed from the inactive list.
> > 2) If the number of the objects in the inactive list becomes low, some
> > of
> the
> > objects at the bottom of the active list are moved to the inactive list.
> For
> > those objects which have the referenced flag set, they are given one
> > more chance in the active list. They are moved to the top of the
> > active list
> with the
> > referenced flag cleared. For those objects which don't have the
> > referenced flag set, they are moved to the inactive list, with the
> > referenced flag
> set. So
> > that they can be quickly promoted to the active list when necessary.
> >
> 
> I really like this idea but just out of interest, there must be a point where 
> the
> overhead of managing much larger lists of very cold objects starts to impact 
> on
> the gains of having exactly the right objects in each tier. If 90% of your 
> hot IO is
> in 10% of the total data, how much extra benefit would you get by tracking all
> objects vs just tracking the top 10,20,30%...etc and evicting randomly after
> that?  If these objects are being accessed infrequently, the impact of
> re-promoting is probably minimal and if the promotion code can get to a point
> where it is being a bit more intelligent about what objects are promoted then
> this is probably even more so?

The idea is that the lists only hold the objects in the cache tier. For those 
objects which are cold enough, it's evicted from the cache tier and removed 
from the lists. Also, the lists are maintained at the PG level. I guess the 
lists won't be too extremely large? In your example of the 90%/10% data access, 
it may be right that randomly evicting the 90% cold data is good enough. But we 
need a way to know what the 10% of the hot data are. Also, we can't assume the 
90%/10% pattern for every workload.

> 
> > ## Combine flush with eviction
> > - When evicting an object, if it's dirty, it's flushed first. After
> flushing, it's
> > evicted. If not dirty, it's evicted directly.
> > - This means that we won't have separate activities and won't set
> different
> > ratios for flush and evict. Is there a need to do so?
> > - Number of objects to evict at a time. 'evict_effort' acts as the
> prio

Re: The design of the eviction improvement

2015-07-20 Thread Sage Weil
On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
> Hi all,
> 
> This is a follow-up of one of the CDS session at 
> http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tiering_eviction.
>  We discussed the drawbacks of the current eviction algorithm and several 
> ways to improve it. Seems like the LRU variants is the right way to go. I 
> come up with some design points after the CDS, and want to discuss it with 
> you. It is an approximate 2Q algorithm, combining some benefits of the clock 
> algorithm, similar to what the linux kernel does for the page cache.

Unfortunately I missed this last CDS so I'm behind on the discussion.  I 
have a few questions though...
 
> # Design points:
> 
> ## LRU lists
> - Maintain LRU lists at the PG level.
> The SharedLRU and SimpleLRU implementation in the current code have a 
> max_size, which limits the max number of elements in the list. This 
> mostly looks like a MRU, though its name implies they are LRUs. Since 
> the object size may vary in a PG, it's not possible to caculate the 
> total number of objects which the cache tier can hold ahead of time. We 
> need a new LRU implementation with no limit on the size.

This last sentence seems to me to be the crux of it.  Assuming we have an 
OSD based by flash storing O(n) objects, we need a way to maintain an LRU 
of O(n) objects in memory.  The current hitset-based approach was taken 
based on the assumption that this wasn't feasible--or at least we 
didn't know how to implmement such a thing.  If it is, or we simply want 
to stipulate that cache tier OSDs get gobs of RAM to make it possible, 
then lots of better options become possible...

Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 
1 million objects.  At maybe ~100bytes per object of RAM for an LRU entry 
that's 100MB... so not so unreasonable, perhaps!

> - Two lists for each PG: active and inactive
> Objects are first put into the inactive list when they are accessed, and 
> moved between these two lists based on some criteria.
> Object flag: active, referenced, unevictable, dirty.
> - When an object is accessed:
> 1) If it's not in both of the lists, it's put on the top of the inactive list
> 2) If it's in the inactive list, and the referenced flag is not set, the 
> referenced flag is set, and it's moved to the top of the inactive list.
> 3) If it's in the inactive list, and the referenced flag is set, the 
> referenced flag is cleared, and it's removed from the inactive list, and put 
> on top of the active list.
> 4) If it's in the active list, and the referenced flag is not set, the 
> referenced flag is set, and it's moved to the top of the active list.
> 5) If it's in the active list, and the referenced flag is set, it's moved to 
> the top of the active list.
> - When selecting objects to evict:
> 1) Objects at the bottom of the inactive list are selected to evict. They are 
> removed from the inactive list.
> 2) If the number of the objects in the inactive list becomes low, some of the 
> objects at the bottom of the active list are moved to the inactive list. For 
> those objects which have the referenced flag set, they are given one more 
> chance in the active list. They are moved to the top of the active list with 
> the referenced flag cleared. For those objects which don't have the 
> referenced flag set, they are moved to the inactive list, with the referenced 
> flag set. So that they can be quickly promoted to the active list when 
> necessary.
> 
> ## Combine flush with eviction
> - When evicting an object, if it's dirty, it's flushed first. After flushing, 
> it's evicted. If not dirty, it's evicted directly.
> - This means that we won't have separate activities and won't set different 
> ratios for flush and evict. Is there a need to do so?
> - Number of objects to evict at a time. 'evict_effort' acts as the priority, 
> which is used to calculate the number of objects to evict.

As someone else mentioned in a follow-up, the reason we let the dirty 
level be set lower than the full level is that it provides headroom so 
that objects can be quickly evicted (delete, no flush) to make room for 
new writes or new promotions.

That said, we probably can/should streamline the flush so that an evict 
can immediately follow without waiting for the agent to come around again.  
(I don't think we do that now?)

sage

 
> ## LRU lists Snapshotting
> - The two lists are snapshotted persisted periodically.
> - Only one copy needs to be saved. The old copy is removed when persisting 
> the lists. The saved lists are used to restore the LRU lists when OSD reboots.
> 
> Any comments/feedbacks are welcomed.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majo

RE: The design of the eviction improvement

2015-07-20 Thread Allen Samuels
This seems much better than the current mechanism. Do you have an estimate of 
the memory consumption of the two lists? (In terms of bytes/object?)


Allen Samuels
Software Architect, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang
Sent: Monday, July 20, 2015 1:47 AM
To: Sage Weil; sj...@redhat.com; ceph-devel@vger.kernel.org
Subject: The design of the eviction improvement

Hi all,

This is a follow-up of one of the CDS session at 
http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tiering_eviction.
 We discussed the drawbacks of the current eviction algorithm and several ways 
to improve it. Seems like the LRU variants is the right way to go. I come up 
with some design points after the CDS, and want to discuss it with you. It is 
an approximate 2Q algorithm, combining some benefits of the clock algorithm, 
similar to what the linux kernel does for the page cache.

# Design points:

## LRU lists
- Maintain LRU lists at the PG level.
The SharedLRU and SimpleLRU implementation in the current code have a max_size, 
which limits the max number of elements in the list. This mostly looks like a 
MRU, though its name implies they are LRUs. Since the object size may vary in a 
PG, it's not possible to caculate the total number of objects which the cache 
tier can hold ahead of time. We need a new LRU implementation with no limit on 
the size.
- Two lists for each PG: active and inactive Objects are first put into the 
inactive list when they are accessed, and moved between these two lists based 
on some criteria.
Object flag: active, referenced, unevictable, dirty.
- When an object is accessed:
1) If it's not in both of the lists, it's put on the top of the inactive list
2) If it's in the inactive list, and the referenced flag is not set, the 
referenced flag is set, and it's moved to the top of the inactive list.
3) If it's in the inactive list, and the referenced flag is set, the referenced 
flag is cleared, and it's removed from the inactive list, and put on top of the 
active list.
4) If it's in the active list, and the referenced flag is not set, the 
referenced flag is set, and it's moved to the top of the active list.
5) If it's in the active list, and the referenced flag is set, it's moved to 
the top of the active list.
- When selecting objects to evict:
1) Objects at the bottom of the inactive list are selected to evict. They are 
removed from the inactive list.
2) If the number of the objects in the inactive list becomes low, some of the 
objects at the bottom of the active list are moved to the inactive list. For 
those objects which have the referenced flag set, they are given one more 
chance in the active list. They are moved to the top of the active list with 
the referenced flag cleared. For those objects which don't have the referenced 
flag set, they are moved to the inactive list, with the referenced flag set. So 
that they can be quickly promoted to the active list when necessary.

## Combine flush with eviction
- When evicting an object, if it's dirty, it's flushed first. After flushing, 
it's evicted. If not dirty, it's evicted directly.
- This means that we won't have separate activities and won't set different 
ratios for flush and evict. Is there a need to do so?
- Number of objects to evict at a time. 'evict_effort' acts as the priority, 
which is used to calculate the number of objects to evict.

## LRU lists Snapshotting
- The two lists are snapshotted persisted periodically.
- Only one copy needs to be saved. The old copy is removed when persisting the 
lists. The saved lists are used to restore the LRU lists when OSD reboots.

Any comments/feedbacks are welcomed.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Teuthology] Upgrade hammer on ubuntu : all passed

2015-07-20 Thread Loic Dachary
Hi David,

Would you agree to run a similar suite against the hammer-backports branch ? It 
already is scheduled 
http://pulpito.ceph.com/loic-2015-07-20_16:52:10-upgrade:hammer-x-hammer-backports-distro-basic-multi/
 but maybe you can complete it faster. The command is:

teuthology-suite --ceph hammer-backports --machine-type openstack --suite 
upgrade/hammer-x --filter ubuntu_14.04 
$HOME/src/ceph-qa-suite_master/machine_types/vps.yaml 
$(pwd)/teuthology/test/integration/archive-on-error.yaml

Cheers

On 20/07/2015 12:56, David Casier AEVOO wrote:
> Hi all,
> Good news for upgrade hammer on Ubuntu :
> http://ceph.aevoo.fr:8081/ubuntu-2015-07-19_05:44:18-upgrade:hammer-hammer---basic-openstack/
> All jobs are passed.
> 
> David
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: dmcrypt with luks keys in hammer

2015-07-20 Thread Wyllys Ingersoll
On Mon, Jul 20, 2015 at 6:21 PM, Sage Weil  wrote:
> On Mon, 20 Jul 2015, Wyllys Ingersoll wrote:
>> No luck with ceph-disk-activate (all or just one device).
>>
>> $ sudo ceph-disk-activate /dev/sdv1
>> mount: unknown filesystem type 'crypto_LUKS'
>> ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t',
>> 'crypto_LUKS', '-o', '', '--', '/dev/sdv1',
>> '/var/lib/ceph/tmp/mnt.QHe3zK']' returned non-zero exit status 32
>>
>>
>> Its odd that it should complain about the "crypto_LUKS" filesystem not
>> being recognized, because it did mount some of the LUKS systems
>> successfully, though not sometimes just the data and not the journal
>> (or vice versa).
>>
>> $ lsblk /dev/sdb
>> NAMEMAJ:MIN RM   SIZE RO
>> TYPE  MOUNTPOINT
>> sdb   8:16   0   3.7T  0 disk
>> ??sdb18:17   0   3.6T  0 part
>> ? ??e8bc1531-a187-4fd2-9e3f-cf90255f89d0 (dm-0) 252:00   3.6T  0
>> crypt /var/lib/ceph/osd/ceph-54
>> ??sdb28:18   010G  0 part
>>   ??temporary-cryptsetup-1235 (dm-6)252:60   125K  1 crypt
>>
>>
>> $ blkid /dev/sdb1
>> /dev/sdb1: UUID="d6194096-a219-4732-8d61-d0c125c49393" TYPE="crypto_LUKS"
>>
>>
>> A race condition (or other issue) with udev seems likely given that
>> its rather random which ones come up and which ones don't.
>
> A race condition during creation or activation?  If it's activation I
> would expect ceph-disk activate ... to work reasonably reliably when
> called manually (on a single device at a time).
>
> sage
>


Im not sure. I do know that all of the disks *did* work after the
initial installation and activation, but they fail after reboot, and
the failures are non-deterministic.  Im not really sure how to debug
it any further.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dmcrypt with luks keys in hammer

2015-07-20 Thread Sage Weil
On Mon, 20 Jul 2015, Wyllys Ingersoll wrote:
> No luck with ceph-disk-activate (all or just one device).
> 
> $ sudo ceph-disk-activate /dev/sdv1
> mount: unknown filesystem type 'crypto_LUKS'
> ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t',
> 'crypto_LUKS', '-o', '', '--', '/dev/sdv1',
> '/var/lib/ceph/tmp/mnt.QHe3zK']' returned non-zero exit status 32
> 
> 
> Its odd that it should complain about the "crypto_LUKS" filesystem not
> being recognized, because it did mount some of the LUKS systems
> successfully, though not sometimes just the data and not the journal
> (or vice versa).
> 
> $ lsblk /dev/sdb
> NAMEMAJ:MIN RM   SIZE RO
> TYPE  MOUNTPOINT
> sdb   8:16   0   3.7T  0 disk
> ??sdb18:17   0   3.6T  0 part
> ? ??e8bc1531-a187-4fd2-9e3f-cf90255f89d0 (dm-0) 252:00   3.6T  0
> crypt /var/lib/ceph/osd/ceph-54
> ??sdb28:18   010G  0 part
>   ??temporary-cryptsetup-1235 (dm-6)252:60   125K  1 crypt
> 
> 
> $ blkid /dev/sdb1
> /dev/sdb1: UUID="d6194096-a219-4732-8d61-d0c125c49393" TYPE="crypto_LUKS"
> 
> 
> A race condition (or other issue) with udev seems likely given that
> its rather random which ones come up and which ones don't.

A race condition during creation or activation?  If it's activation I 
would expect ceph-disk activate ... to work reasonably reliably when 
called manually (on a single device at a time).

sage

> 
> 
> 
> 
> On Mon, Jul 20, 2015 at 5:22 PM, Sage Weil  wrote:
> > On Mon, 20 Jul 2015, Wyllys Ingersoll wrote:
> >> Were running a cluster with Hammer v94.2 and are running into issues
> >> with the Luks encrypted OSD data and journal partitions.  The
> >> installation goes smoothly and everything runs OK, but we've had to
> >> reboot a couple of the storage nodes for various reasons and when they
> >> come back online, a large number of OSD processes fail to start
> >> because the LUKS encrypted partitions are not getting mounted
> >> correctly.
> >>
> >> I'm not sure if it is a udev issue or a problem with the OSD process
> >> itself, but the encrypted partitions end up getting mounted as
> >> "temporary-cryptsetup-PID" and they never recover.  From below, you
> >> can see that some of the OSDs did come up correctly, but the majority
> >> do not.   We've seen this problem now on several storage nodes, and it
> >> only occurs for those OSDs that used luks (the new default).  The only
> >> recovery that we've found is to wipe them all out and rebuild them
> >> using "plain" dmcrypt (as it used to be).
> >>
> >> Using "blkid" on a partition that is in the "temporary-cryptsetup"
> >> state, does show that it has the right ID_PART_ENTRY_UUID and TYPE
> >> values and I can confirm that there is an associated key in
> >> /etc/ceph/dmcrypt-keys, but it still isn't mounting correctly.
> >>
> >> $ sudo blkid -p -o udev /dev/sdv2
> >> ID_FS_UUID=87008c17-9e57-487d-8f8b-160f8f803d8b
> >> ID_FS_UUID_ENC=87008c17-9e57-487d-8f8b-160f8f803d8b
> >> ID_FS_VERSION=1
> >> ID_FS_TYPE=crypto_LUKS
> >> ID_FS_USAGE=crypto
> >> ID_PART_ENTRY_SCHEME=gpt
> >> ID_PART_ENTRY_NAME=ceph\x20journal
> >> ID_PART_ENTRY_UUID=e3eda67b-a2e0-4d22-a62e-d9bda5ecf8b1
> >> ID_PART_ENTRY_TYPE=45b0969e-9b03-4f30-b4c6-35865ceff106
> >> ID_PART_ENTRY_NUMBER=2
> >> ID_PART_ENTRY_OFFSET=2048
> >> ID_PART_ENTRY_SIZE=20969473
> >> ID_PART_ENTRY_DISK=65:80
> >>
> >> So Im checking to see if this is a known issue or if we are missing
> >> something in the installation or configuration that would fix this
> >> problem.
> >
> > This isn't a known issue, although I think we have seen problems in
> > general with hosts with lots of OSDs not always coming up on boot.  If it
> > is specifically a problem with luks+dmcrypt that would be interesting!
> >
> > Does an explicit 'ceph-disk activate /dev/...' on one of the devices make
> > it come up?  And/or a 'ceph-disk activate-all'?  If so that would indicate
> > a race issue in udev.
> >
> > Thanks-
> > sage
> >
> >
> >>
> >> -Wyllys Ingersoll
> >>
> >>
> >> Ex:
> >> $ lsblk -l
> >> NAME MAJ:MIN RM   SIZE RO TYPE
> >>  MOUNTPOINT
> >> sda8:00 111.8G  0 disk
> >> sda1   8:10  15.3G  0 part  
> >> [SWAP]
> >> sda2   8:20 1K  0 part
> >> sda5   8:50  96.5G  0 part  /
> >> sdb8:16   0   3.7T  0 disk
> >> sdb1   8:17   0   3.6T  0 part
> >> e8bc1531-a187-4fd2-9e3f-cf90255f89d0 (dm-0)  252:00   3.6T  0 crypt
> >> sdb2   8:18   010G  0 part
> >> temporary-cryptsetup-1235 (dm-6) 252:60   125K  1 crypt
> >> sdc   

[ANN] ceps-deploy 1.5.26 released

2015-07-20 Thread Travis Rhoden
Hi everyone,

This is announcing a new release of ceph-deploy that focuses on usability 
improvements.

 - Most of the help menus for ceph-deploy subcommands (e.ge. “ceph-deploy mon” 
and “ceph-deploy osd”) have been improved to be more context aware, such that 
help for “ceph-deploy osd create --help “ and “ceph-deploy osd zap --help” 
return different output specific to the command.  Previously it would show 
generic help for “ceph-deploy osd”.  Additionally, the list of optional 
arguments shown for the command are always correct for the subcommand in 
question.  Previously the options shown were the aggregate of all options.

 - ceph-deploy now points to git.ceph.com for downloading GPG keys

 - ceph-deploy will now work on the Mint Linux distribution (by pointing to 
Ubuntu packages)

 - SUSE distro users will now be pointed to SUSE packages by default, as there 
have not been updated SUSE packages on ceph.com in quite some time.

Full changelog is available at: 
http://ceph.com/ceph-deploy/docs/changelog.html#id1

New packages are available in the usual places of ceph.com hosted repos and 
PyPI.

Cheers,

 - Travis--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dmcrypt with luks keys in hammer

2015-07-20 Thread Wyllys Ingersoll
No luck with ceph-disk-activate (all or just one device).

$ sudo ceph-disk-activate /dev/sdv1
mount: unknown filesystem type 'crypto_LUKS'
ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t',
'crypto_LUKS', '-o', '', '--', '/dev/sdv1',
'/var/lib/ceph/tmp/mnt.QHe3zK']' returned non-zero exit status 32


Its odd that it should complain about the "crypto_LUKS" filesystem not
being recognized, because it did mount some of the LUKS systems
successfully, though not sometimes just the data and not the journal
(or vice versa).

$ lsblk /dev/sdb
NAMEMAJ:MIN RM   SIZE RO
TYPE  MOUNTPOINT
sdb   8:16   0   3.7T  0 disk
├─sdb18:17   0   3.6T  0 part
│ └─e8bc1531-a187-4fd2-9e3f-cf90255f89d0 (dm-0) 252:00   3.6T  0
crypt /var/lib/ceph/osd/ceph-54
└─sdb28:18   010G  0 part
  └─temporary-cryptsetup-1235 (dm-6)252:60   125K  1 crypt


$ blkid /dev/sdb1
/dev/sdb1: UUID="d6194096-a219-4732-8d61-d0c125c49393" TYPE="crypto_LUKS"


A race condition (or other issue) with udev seems likely given that
its rather random which ones come up and which ones don't.




On Mon, Jul 20, 2015 at 5:22 PM, Sage Weil  wrote:
> On Mon, 20 Jul 2015, Wyllys Ingersoll wrote:
>> Were running a cluster with Hammer v94.2 and are running into issues
>> with the Luks encrypted OSD data and journal partitions.  The
>> installation goes smoothly and everything runs OK, but we've had to
>> reboot a couple of the storage nodes for various reasons and when they
>> come back online, a large number of OSD processes fail to start
>> because the LUKS encrypted partitions are not getting mounted
>> correctly.
>>
>> I'm not sure if it is a udev issue or a problem with the OSD process
>> itself, but the encrypted partitions end up getting mounted as
>> "temporary-cryptsetup-PID" and they never recover.  From below, you
>> can see that some of the OSDs did come up correctly, but the majority
>> do not.   We've seen this problem now on several storage nodes, and it
>> only occurs for those OSDs that used luks (the new default).  The only
>> recovery that we've found is to wipe them all out and rebuild them
>> using "plain" dmcrypt (as it used to be).
>>
>> Using "blkid" on a partition that is in the "temporary-cryptsetup"
>> state, does show that it has the right ID_PART_ENTRY_UUID and TYPE
>> values and I can confirm that there is an associated key in
>> /etc/ceph/dmcrypt-keys, but it still isn't mounting correctly.
>>
>> $ sudo blkid -p -o udev /dev/sdv2
>> ID_FS_UUID=87008c17-9e57-487d-8f8b-160f8f803d8b
>> ID_FS_UUID_ENC=87008c17-9e57-487d-8f8b-160f8f803d8b
>> ID_FS_VERSION=1
>> ID_FS_TYPE=crypto_LUKS
>> ID_FS_USAGE=crypto
>> ID_PART_ENTRY_SCHEME=gpt
>> ID_PART_ENTRY_NAME=ceph\x20journal
>> ID_PART_ENTRY_UUID=e3eda67b-a2e0-4d22-a62e-d9bda5ecf8b1
>> ID_PART_ENTRY_TYPE=45b0969e-9b03-4f30-b4c6-35865ceff106
>> ID_PART_ENTRY_NUMBER=2
>> ID_PART_ENTRY_OFFSET=2048
>> ID_PART_ENTRY_SIZE=20969473
>> ID_PART_ENTRY_DISK=65:80
>>
>> So Im checking to see if this is a known issue or if we are missing
>> something in the installation or configuration that would fix this
>> problem.
>
> This isn't a known issue, although I think we have seen problems in
> general with hosts with lots of OSDs not always coming up on boot.  If it
> is specifically a problem with luks+dmcrypt that would be interesting!
>
> Does an explicit 'ceph-disk activate /dev/...' on one of the devices make
> it come up?  And/or a 'ceph-disk activate-all'?  If so that would indicate
> a race issue in udev.
>
> Thanks-
> sage
>
>
>>
>> -Wyllys Ingersoll
>>
>>
>> Ex:
>> $ lsblk -l
>> NAME MAJ:MIN RM   SIZE RO TYPE
>>  MOUNTPOINT
>> sda8:00 111.8G  0 disk
>> sda1   8:10  15.3G  0 part  
>> [SWAP]
>> sda2   8:20 1K  0 part
>> sda5   8:50  96.5G  0 part  /
>> sdb8:16   0   3.7T  0 disk
>> sdb1   8:17   0   3.6T  0 part
>> e8bc1531-a187-4fd2-9e3f-cf90255f89d0 (dm-0)  252:00   3.6T  0 crypt
>> sdb2   8:18   010G  0 part
>> temporary-cryptsetup-1235 (dm-6) 252:60   125K  1 crypt
>> sdc8:32   0   3.7T  0 disk
>> sdc1   8:33   0   3.6T  0 part
>> temporary-cryptsetup-1788 (dm-37)252:37   0   125K  1 crypt
>> sdc2   8:34   010G  0 part
>> temporary-cryptsetup-1789 (dm-36)252:36   0   125K  1 crypt
>> sdd8:48   0   3.7T  0 disk
>> sdd1

Re: dmcrypt with luks keys in hammer

2015-07-20 Thread Sage Weil
On Mon, 20 Jul 2015, Wyllys Ingersoll wrote:
> Were running a cluster with Hammer v94.2 and are running into issues
> with the Luks encrypted OSD data and journal partitions.  The
> installation goes smoothly and everything runs OK, but we've had to
> reboot a couple of the storage nodes for various reasons and when they
> come back online, a large number of OSD processes fail to start
> because the LUKS encrypted partitions are not getting mounted
> correctly.
> 
> I'm not sure if it is a udev issue or a problem with the OSD process
> itself, but the encrypted partitions end up getting mounted as
> "temporary-cryptsetup-PID" and they never recover.  From below, you
> can see that some of the OSDs did come up correctly, but the majority
> do not.   We've seen this problem now on several storage nodes, and it
> only occurs for those OSDs that used luks (the new default).  The only
> recovery that we've found is to wipe them all out and rebuild them
> using "plain" dmcrypt (as it used to be).
> 
> Using "blkid" on a partition that is in the "temporary-cryptsetup"
> state, does show that it has the right ID_PART_ENTRY_UUID and TYPE
> values and I can confirm that there is an associated key in
> /etc/ceph/dmcrypt-keys, but it still isn't mounting correctly.
> 
> $ sudo blkid -p -o udev /dev/sdv2
> ID_FS_UUID=87008c17-9e57-487d-8f8b-160f8f803d8b
> ID_FS_UUID_ENC=87008c17-9e57-487d-8f8b-160f8f803d8b
> ID_FS_VERSION=1
> ID_FS_TYPE=crypto_LUKS
> ID_FS_USAGE=crypto
> ID_PART_ENTRY_SCHEME=gpt
> ID_PART_ENTRY_NAME=ceph\x20journal
> ID_PART_ENTRY_UUID=e3eda67b-a2e0-4d22-a62e-d9bda5ecf8b1
> ID_PART_ENTRY_TYPE=45b0969e-9b03-4f30-b4c6-35865ceff106
> ID_PART_ENTRY_NUMBER=2
> ID_PART_ENTRY_OFFSET=2048
> ID_PART_ENTRY_SIZE=20969473
> ID_PART_ENTRY_DISK=65:80
> 
> So Im checking to see if this is a known issue or if we are missing
> something in the installation or configuration that would fix this
> problem.

This isn't a known issue, although I think we have seen problems in 
general with hosts with lots of OSDs not always coming up on boot.  If it 
is specifically a problem with luks+dmcrypt that would be interesting!

Does an explicit 'ceph-disk activate /dev/...' on one of the devices make 
it come up?  And/or a 'ceph-disk activate-all'?  If so that would indicate 
a race issue in udev.

Thanks-
sage


> 
> -Wyllys Ingersoll
> 
> 
> Ex:
> $ lsblk -l
> NAME MAJ:MIN RM   SIZE RO TYPE
>  MOUNTPOINT
> sda8:00 111.8G  0 disk
> sda1   8:10  15.3G  0 part  [SWAP]
> sda2   8:20 1K  0 part
> sda5   8:50  96.5G  0 part  /
> sdb8:16   0   3.7T  0 disk
> sdb1   8:17   0   3.6T  0 part
> e8bc1531-a187-4fd2-9e3f-cf90255f89d0 (dm-0)  252:00   3.6T  0 crypt
> sdb2   8:18   010G  0 part
> temporary-cryptsetup-1235 (dm-6) 252:60   125K  1 crypt
> sdc8:32   0   3.7T  0 disk
> sdc1   8:33   0   3.6T  0 part
> temporary-cryptsetup-1788 (dm-37)252:37   0   125K  1 crypt
> sdc2   8:34   010G  0 part
> temporary-cryptsetup-1789 (dm-36)252:36   0   125K  1 crypt
> sdd8:48   0   3.7T  0 disk
> sdd1   8:49   0   3.6T  0 part
> temporary-cryptsetup-1252 (dm-1) 252:10   125K  1 crypt
> sdd2   8:50   010G  0 part
> temporary-cryptsetup-1246 (dm-3) 252:30   125K  1 crypt
> sde8:64   0   3.7T  0 disk
> sde1   8:65   0   3.6T  0 part
> temporary-cryptsetup-1260 (dm-14)252:14   0   125K  1 crypt
> sde2   8:66   010G  0 part
> temporary-cryptsetup-1255 (dm-12)252:12   0   125K  1 crypt
> sdf8:80   0   3.7T  0 disk
> sdf1   8:81   0   3.6T  0 part
> temporary-cryptsetup-1268 (dm-15)252:15   0   125K  1 crypt
> sdf2   8:82   010G  0 part
> temporary-cryptsetup-1245 (dm-5) 252:50   125K  1 crypt
> sdg8:96   0   3.7T  0 disk
> sdg1   8:97   0   3.6T  0 part
> temporary-cryptsetup-1271 (dm-17)252:17   0   125K  1 crypt
> sdg2   8:98   010G  0 part
> temporary-cryptsetup-1278 (dm-2) 252:20   125K  1 crypt
> sdh

dmcrypt with luks keys in hammer

2015-07-20 Thread Wyllys Ingersoll
Were running a cluster with Hammer v94.2 and are running into issues
with the Luks encrypted OSD data and journal partitions.  The
installation goes smoothly and everything runs OK, but we've had to
reboot a couple of the storage nodes for various reasons and when they
come back online, a large number of OSD processes fail to start
because the LUKS encrypted partitions are not getting mounted
correctly.

I'm not sure if it is a udev issue or a problem with the OSD process
itself, but the encrypted partitions end up getting mounted as
"temporary-cryptsetup-PID" and they never recover.  From below, you
can see that some of the OSDs did come up correctly, but the majority
do not.   We've seen this problem now on several storage nodes, and it
only occurs for those OSDs that used luks (the new default).  The only
recovery that we've found is to wipe them all out and rebuild them
using "plain" dmcrypt (as it used to be).

Using "blkid" on a partition that is in the "temporary-cryptsetup"
state, does show that it has the right ID_PART_ENTRY_UUID and TYPE
values and I can confirm that there is an associated key in
/etc/ceph/dmcrypt-keys, but it still isn't mounting correctly.

$ sudo blkid -p -o udev /dev/sdv2
ID_FS_UUID=87008c17-9e57-487d-8f8b-160f8f803d8b
ID_FS_UUID_ENC=87008c17-9e57-487d-8f8b-160f8f803d8b
ID_FS_VERSION=1
ID_FS_TYPE=crypto_LUKS
ID_FS_USAGE=crypto
ID_PART_ENTRY_SCHEME=gpt
ID_PART_ENTRY_NAME=ceph\x20journal
ID_PART_ENTRY_UUID=e3eda67b-a2e0-4d22-a62e-d9bda5ecf8b1
ID_PART_ENTRY_TYPE=45b0969e-9b03-4f30-b4c6-35865ceff106
ID_PART_ENTRY_NUMBER=2
ID_PART_ENTRY_OFFSET=2048
ID_PART_ENTRY_SIZE=20969473
ID_PART_ENTRY_DISK=65:80

So Im checking to see if this is a known issue or if we are missing
something in the installation or configuration that would fix this
problem.

-Wyllys Ingersoll


Ex:
$ lsblk -l
NAME MAJ:MIN RM   SIZE RO TYPE
 MOUNTPOINT
sda8:00 111.8G  0 disk
sda1   8:10  15.3G  0 part  [SWAP]
sda2   8:20 1K  0 part
sda5   8:50  96.5G  0 part  /
sdb8:16   0   3.7T  0 disk
sdb1   8:17   0   3.6T  0 part
e8bc1531-a187-4fd2-9e3f-cf90255f89d0 (dm-0)  252:00   3.6T  0 crypt
sdb2   8:18   010G  0 part
temporary-cryptsetup-1235 (dm-6) 252:60   125K  1 crypt
sdc8:32   0   3.7T  0 disk
sdc1   8:33   0   3.6T  0 part
temporary-cryptsetup-1788 (dm-37)252:37   0   125K  1 crypt
sdc2   8:34   010G  0 part
temporary-cryptsetup-1789 (dm-36)252:36   0   125K  1 crypt
sdd8:48   0   3.7T  0 disk
sdd1   8:49   0   3.6T  0 part
temporary-cryptsetup-1252 (dm-1) 252:10   125K  1 crypt
sdd2   8:50   010G  0 part
temporary-cryptsetup-1246 (dm-3) 252:30   125K  1 crypt
sde8:64   0   3.7T  0 disk
sde1   8:65   0   3.6T  0 part
temporary-cryptsetup-1260 (dm-14)252:14   0   125K  1 crypt
sde2   8:66   010G  0 part
temporary-cryptsetup-1255 (dm-12)252:12   0   125K  1 crypt
sdf8:80   0   3.7T  0 disk
sdf1   8:81   0   3.6T  0 part
temporary-cryptsetup-1268 (dm-15)252:15   0   125K  1 crypt
sdf2   8:82   010G  0 part
temporary-cryptsetup-1245 (dm-5) 252:50   125K  1 crypt
sdg8:96   0   3.7T  0 disk
sdg1   8:97   0   3.6T  0 part
temporary-cryptsetup-1271 (dm-17)252:17   0   125K  1 crypt
sdg2   8:98   010G  0 part
temporary-cryptsetup-1278 (dm-2) 252:20   125K  1 crypt
sdh8:112  0   3.7T  0 disk
sdh1   8:113  0   3.6T  0 part
69dcd1e1-6e11-41ec-af19-1e0d90013957 (dm-43) 252:43   0   3.6T  0
crypt /var/lib/ceph/osd/ceph-42
sdh2   8:114  010G  0 part
3382723d-b0d9-4b50-affe-fb9f5df78d6f (dm-45) 252:45   010G  0 crypt
sdi8:128  0   3.7T  0 disk
sdi1   8:129  0   3.6T  0 part
temporary-cryptsetup-1265 (dm-20)252:20   0   125K  1 crypt
sdi2   

Re: [sepia] debian jessie gitbuilder repositories ?

2015-07-20 Thread Sage Weil
On Mon, 20 Jul 2015, Dan Mick wrote:
> On 07/20/2015 07:19 AM, Sage Weil wrote:
> > On Mon, 20 Jul 2015, Alexandre DERUMIER wrote:
> >> Hi,
> >>
> >> debian jessie gitbuilder is ok since 2 weeks now,
> >>
> >> http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-deb-jessie-amd64-basic
> >>
> >>
> >> It is possible to push packages to repositories ?
> >>
> >> http://gitbuilder.ceph.com/ceph-deb-jessie-x86_64-basic/
> > 
> > 
> > The builds are failing with this:
> > 
> > + GNUPGHOME=/srv/gnupg reprepro --ask-passphrase -b 
> > ../out/output/sha1/6ffb1c4ae43bcde9f5fde40dd97959399135ed86.tmp -C main 
> > --ignore=undef
> > inedtarget --ignore=wrongdistribution include jessie 
> > out~/ceph_0.94.2-50-g6ffb1c4-1jessie_amd64.changes
> > Cannot find definition of distribution 'jessie'!
> > There have been errors!
> > 
> > 
> > I've seen it before a long time ago, but I forget what the resolution is.
> > 
> > sage
> > ___
> > Sepia mailing list
> > se...@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/sepia-ceph.com
> 
> https://github.com/ceph/ceph-build/pull/102, probably

That fixed it, thanks!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [sepia] debian jessie gitbuilder repositories ?

2015-07-20 Thread Dan Mick
On 07/20/2015 07:19 AM, Sage Weil wrote:
> On Mon, 20 Jul 2015, Alexandre DERUMIER wrote:
>> Hi,
>>
>> debian jessie gitbuilder is ok since 2 weeks now,
>>
>> http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-deb-jessie-amd64-basic
>>
>>
>> It is possible to push packages to repositories ?
>>
>> http://gitbuilder.ceph.com/ceph-deb-jessie-x86_64-basic/
> 
> 
> The builds are failing with this:
> 
> + GNUPGHOME=/srv/gnupg reprepro --ask-passphrase -b 
> ../out/output/sha1/6ffb1c4ae43bcde9f5fde40dd97959399135ed86.tmp -C main 
> --ignore=undef
> inedtarget --ignore=wrongdistribution include jessie 
> out~/ceph_0.94.2-50-g6ffb1c4-1jessie_amd64.changes
> Cannot find definition of distribution 'jessie'!
> There have been errors!
> 
> 
> I've seen it before a long time ago, but I forget what the resolution is.
> 
> sage
> ___
> Sepia mailing list
> se...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/sepia-ceph.com

https://github.com/ceph/ceph-build/pull/102, probably
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: teuthology : 70 workers need more than 8GB RAM / 2 CPUS

2015-07-20 Thread Loic Dachary
Thanks for the feedback. I'll try with postgresql as it seems the sqlite 
modifications did nothing really significant.

On 20/07/2015 17:38, Zack Cerza wrote:
> Hi Loic,
> 
> This is definitely something to keep an eye on. It's actually a bit 
> surprising to me, though - I haven't seen ansible-playbook use any 
> significant resources in sepia.
> 
> I wouldn't really recommend running paddles on the same host as teuthology 
> though, to do any serious amount of testing; some teuthology tasks do use 
> large amounts of RAM and/or CPU, and severe load issues could feasibly cause 
> requests to time out, affecting other jobs.
> 
> That's all theory though, as I've always used separate hosts for the two 
> services.
> 
> Zack
> 
> - Original Message -
>> From: "Loic Dachary" 
>> To: "Zack Cerza" , "Andrew Schoen" 
>> Cc: "Ceph Development" 
>> Sent: Sunday, July 19, 2015 9:06:41 AM
>> Subject: Re: teuthology : 70 workers need more than 8GB RAM / 2 CPUS
>>
>> Hi again,
>>
>> I had the same problem when 50 workers kick in at the same time. I've lowered
>> the number of workers down to 25 and it went well. During a few minutes (~8
>> minutes) the load average stayed around 25 (CPU bound, mainly the ansible
>> playbook competing, see the screenshot of htop). But did not see any error /
>> timeout. then I added 15 workers, wait for the load to go back to < 2 (10
>> minutes), then 15 more (10 minutes) to get to 55.
>>
>> That sound like a log of CPU used by a single playbook run. Is there a known
>> way to reduce that ? If not I'll just upgrade the machine. Just want to make
>> sure I'm not missing a simple solution ;-)
>>
>> Cheers
>>
>> On 19/07/2015 14:22, Loic Dachary wrote:
>>> Hi,
>>>
>>> For the record, I launched a rados suite on an idle teuthology cluster,
>>> with 70 workers running on a 8GB RAM / 2 CPUS / 40GB SSD disk. The load
>>> average reached 40 within a minute or two and some jobs started failing /
>>> timeouting. I had pulpito running on the same machine and it failed one
>>> time out of two because of the load (see the top image).
>>>
>>> On friday I was able to run 70 workers because I gradually added them. The
>>> load peak is when a job starts and all workers kick in a the same time.
>>>
>>> Cheers
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: teuthology : 70 workers need more than 8GB RAM / 2 CPUS

2015-07-20 Thread Zack Cerza
Hi Loic,

This is definitely something to keep an eye on. It's actually a bit surprising 
to me, though - I haven't seen ansible-playbook use any significant resources 
in sepia.

I wouldn't really recommend running paddles on the same host as teuthology 
though, to do any serious amount of testing; some teuthology tasks do use large 
amounts of RAM and/or CPU, and severe load issues could feasibly cause requests 
to time out, affecting other jobs.

That's all theory though, as I've always used separate hosts for the two 
services.

Zack

- Original Message -
> From: "Loic Dachary" 
> To: "Zack Cerza" , "Andrew Schoen" 
> Cc: "Ceph Development" 
> Sent: Sunday, July 19, 2015 9:06:41 AM
> Subject: Re: teuthology : 70 workers need more than 8GB RAM / 2 CPUS
> 
> Hi again,
> 
> I had the same problem when 50 workers kick in at the same time. I've lowered
> the number of workers down to 25 and it went well. During a few minutes (~8
> minutes) the load average stayed around 25 (CPU bound, mainly the ansible
> playbook competing, see the screenshot of htop). But did not see any error /
> timeout. then I added 15 workers, wait for the load to go back to < 2 (10
> minutes), then 15 more (10 minutes) to get to 55.
> 
> That sound like a log of CPU used by a single playbook run. Is there a known
> way to reduce that ? If not I'll just upgrade the machine. Just want to make
> sure I'm not missing a simple solution ;-)
> 
> Cheers
> 
> On 19/07/2015 14:22, Loic Dachary wrote:
> > Hi,
> > 
> > For the record, I launched a rados suite on an idle teuthology cluster,
> > with 70 workers running on a 8GB RAM / 2 CPUS / 40GB SSD disk. The load
> > average reached 40 within a minute or two and some jobs started failing /
> > timeouting. I had pulpito running on the same machine and it failed one
> > time out of two because of the load (see the top image).
> > 
> > On friday I was able to run 70 workers because I gradually added them. The
> > load peak is when a job starts and all workers kick in a the same time.
> > 
> > Cheers
> > 
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph branch status

2015-07-20 Thread ceph branch robot
-- All Branches --

Adam Crume 
2014-12-01 20:45:58 -0800   wip-doc-rbd-replay

Alfredo Deza 
2015-03-23 16:39:48 -0400   wip-11212
2015-03-25 10:10:43 -0400   wip-11065
2015-07-01 08:34:15 -0400   wip-12037

Alfredo Deza 
2014-07-08 13:58:35 -0400   wip-8679
2014-09-04 13:58:14 -0400   wip-8366
2014-10-13 11:10:10 -0400   wip-9730

Boris Ranto 
2015-04-13 13:51:32 +0200   wip-fix-ceph-dencoder-build
2015-04-14 13:51:49 +0200   wip-fix-ceph-dencoder-build-master
2015-06-23 15:29:45 +0200   wip-user-rebase
2015-07-10 12:34:33 +0200   wip-bash-completion
2015-07-15 18:21:11 +0200   wip-selinux-policy

Chendi.Xue 
2015-06-16 14:39:42 +0800   wip-blkin

Chi Xinze 
2015-05-15 21:47:44 +   XinzeChi-wip-ec-read

Dan Mick 
2013-07-16 23:00:06 -0700   wip-5634

Danny Al-Gaaf 
2015-04-23 16:32:00 +0200   wip-da-SCA-20150421
2015-04-23 17:18:57 +0200   wip-nosetests
2015-04-23 18:20:16 +0200   wip-unify-num_objects_degraded
2015-07-17 10:50:46 +0200   wip-da-SCA-20150601

David Zafman 
2014-08-29 10:41:23 -0700   wip-libcommon-rebase
2015-04-24 13:14:23 -0700   wip-cot-giant
2015-06-02 13:46:23 -0700   wip-11511
2015-07-07 18:11:19 -0700   wip-zafman-testing
2015-07-16 19:13:45 -0700   wip-12000-12200

Dongmao Zhang 
2014-11-14 19:14:34 +0800   thesues-master

Greg Farnum 
2015-04-29 21:44:11 -0700   wip-init-names
2015-06-11 18:22:55 -0700   greg-fs-testing
2015-07-16 09:28:24 -0700   hammer-12297

Greg Farnum 
2014-10-23 13:33:44 -0700   wip-forward-scrub

Gregory Meno 
2015-02-25 17:30:33 -0800   wip-fix-typo-troubleshooting

Guang G Yang 
2015-06-26 20:31:44 +   wip-ec-readall

Guang Yang 
2014-08-08 10:41:12 +   wip-guangyy-pg-splitting
2014-09-25 00:47:46 +   wip-9008
2014-09-30 10:36:39 +   guangyy-wip-9614

Haomai Wang 
2014-07-27 13:37:49 +0800   wip-flush-set
2015-04-20 00:47:59 +0800   update-organization
2015-04-20 00:48:42 +0800   update-organization-1
2015-07-10 15:46:45 +0800   fio-objectstore

Ilya Dryomov 
2014-09-05 16:15:10 +0400   wip-rbd-notify-errors

James Page 
2013-02-27 22:50:38 +   wip-debhelper-8

Jason Dillaman 
2015-05-22 00:52:20 -0400   wip-11625
2015-06-10 12:02:16 -0400   wip-11770-hammer
2015-06-22 11:17:56 -0400   wip-12109-hammer
2015-06-22 16:02:33 -0400   wip-11769-firefly
2015-07-17 12:06:14 -0400   wip-11286
2015-07-17 12:07:32 -0400   wip-11287
2015-07-17 14:17:04 -0400   wip-12384-hammer
2015-07-19 13:44:16 -0400   wip-12237-hammer

Jenkins 
2014-07-29 05:24:39 -0700   wip-nhm-hang
2015-02-02 10:35:28 -0800   wip-sam-v0.92
2015-06-10 15:04:07 -0700   rhcs-v0.80.8
2015-07-01 14:40:49 -0700   rhcs-v0.94.1-ubuntu
2015-07-14 13:10:32 -0700   last

Joao Eduardo Luis 
2014-09-10 09:39:23 +0100   wip-leveldb-get.dumpling

Joao Eduardo Luis 
2014-07-22 15:41:42 +0100   wip-leveldb-misc

Joao Eduardo Luis 
2014-09-02 17:19:52 +0100   wip-leveldb-get
2014-10-17 16:20:11 +0100   wip-paxos-fix
2014-10-21 21:32:46 +0100   wip-9675.dumpling

Joao Eduardo Luis 
2014-11-17 16:43:53 +   wip-mon-osdmap-cleanup
2014-12-15 16:18:56 +   wip-giant-mon-backports
2014-12-17 17:13:57 +   wip-mon-backports.firefly
2014-12-17 23:15:10 +   wip-mon-sync-fix.dumpling
2015-01-07 23:01:00 +   wip-mon-blackhole-mlog-0.87.7
2015-01-10 02:40:42 +   wip-dho-joao
2015-01-10 02:46:31 +   wip-mon-paxos-fix
2015-01-26 13:00:09 +   wip-mon-datahealth-fix
2015-02-04 22:36:14 +   wip-10643

Joao Eduardo Luis 
2015-05-27 23:48:45 +0100   wip-mon-scrub
2015-05-28 08:12:48 +0100   wip-11786
2015-05-29 12:21:43 +0100   wip-11545
2015-06-05 16:12:57 +0100   wip-10507
2015-06-16 14:34:11 +0100   wip-11470
2015-06-25 00:16:41 +0100   wip-10507-2
2015-07-14 16:52:35 +0100   wip-joao-testing

John Spray 
2015-04-06 17:25:02 +0100   wip-progress-events
2015-05-05 14:29:16 +0100   wip-live-query
2015-05-28 13:31:32 +0100   wip-9963
2015-05-29 13:59:03 +0100   wip-9964-intrapg
2015-05-29 14:19:16 +0100   wip-9964
2015-06-02 18:16:38 +0100   wip-11859
2015-06-02 18:16:38 +0100   wip-damage-table
2015-06-03 10:09:09 +0100   wip-11857
2015-06-04 12:36:09 +0100   wip-nobjectiterator-crash
2015-06-10 14:10:24 +0100   wip-9663-hashorder
2015-06-10 23:50:49 +0100   wip-9964-nosharding
2015-07-15 15:04:42 +0100   wip-mds-refactor
2015-07-20 12:35:21 +0100   wip-scrub-jcs

John Wilkins 
2

Re: debian jessie gitbuilder repositories ?

2015-07-20 Thread Alfredo Deza


- Original Message -
> From: "Sage Weil" 
> To: "Alexandre DERUMIER" 
> Cc: "ceph-devel" , se...@ceph.com
> Sent: Monday, July 20, 2015 10:19:49 AM
> Subject: Re: debian jessie gitbuilder repositories ?
> 
> On Mon, 20 Jul 2015, Alexandre DERUMIER wrote:
> > Hi,
> > 
> > debian jessie gitbuilder is ok since 2 weeks now,
> > 
> > http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-deb-jessie-amd64-basic
> > 
> > 
> > It is possible to push packages to repositories ?
> > 
> > http://gitbuilder.ceph.com/ceph-deb-jessie-x86_64-basic/
> 
> 
> The builds are failing with this:
> 
> + GNUPGHOME=/srv/gnupg reprepro --ask-passphrase -b
> ../out/output/sha1/6ffb1c4ae43bcde9f5fde40dd97959399135ed86.tmp -C main
> --ignore=undef
> inedtarget --ignore=wrongdistribution include jessie
> out~/ceph_0.94.2-50-g6ffb1c4-1jessie_amd64.changes
> Cannot find definition of distribution 'jessie'!
> There have been errors!
> 
> 
> I've seen it before a long time ago, but I forget what the resolution is.

I am not 100% how gitbuilders are setup but the DEB builders are meant to be 
generic
and they can be only when some setup is run to hold the environments for each 
distro.

The environments are created/updated with pbuilder, so one needs to exist for 
jessie. There is a script
that can do that called update_pbuilder.sh but that is missing `jessie`:

https://github.com/ceph/ceph-build/blob/master/update_pbuilder.sh#L22-30

A manual call to create the jessie environment for pbuilder would suffice here 
I think.
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rados/thrash on OpenStack

2015-07-20 Thread Loic Dachary
More information about this run. I'll run a rados suite on master on OpenStack 
to get a baseline of what we should expect.

http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/12/
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/14/
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/15/
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/17/
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/20/
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/21/
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/22/
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/23/
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/26/
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/28/
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/2/
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/5/
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/6/
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/7/
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/9/

I see

2015-07-20T10:02:10.567 
INFO:tasks.ceph.osd.5.ovh165019.stderr:osd/ReplicatedPG.cc: In function 'bool 
ReplicatedPG::is_degraded_or_backfilling_object(const hobject_t&)' thread 
7f2af94df700 time 2015-07-20 10:02:10.481916
2015-07-20T10:02:10.567 
INFO:tasks.ceph.osd.5.ovh165019.stderr:osd/ReplicatedPG.cc: 412: FAILED 
assert(!actingbackfill.empty())
2015-07-20T10:02:10.567 INFO:tasks.ceph.osd.5.ovh165019.stderr: ceph version 
9.0.2-799-gba9c2ae (ba9c2ae4bffd3fd7b26a2e0ce843913b77940b8a)
2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 1: 
(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) 
[0xc45d1b]
2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 2: ceph-osd() 
[0x88535d]
2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 3: 
(ReplicatedPG::hit_set_remove_all()+0x7c) [0x8b039c]
2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 4: 
(ReplicatedPG::on_pool_change()+0x161) [0x8b1a21]
2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 5: 
(PG::handle_advance_map(std::tr1::shared_ptr, 
std::tr1::shared_ptr, std::vector >&, 
int, std::vector >&, int, PG::RecoveryCtx*)+0x60c) 
[0x8348fc]
2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 6: 
(OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, 
std::set, std::less >, 
std::allocator > >*)+0x2c3) [0x6dcc73]
2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 7: 
(OSD::process_peering_events(std::list > const&, 
ThreadPool::TPHandle&)+0x1f1) [0x6dd721]
2015-07-20T10:02:10.572 INFO:tasks.ceph.osd.5.ovh165019.stderr: 8: 
(OSD::PeeringWQ::_process(std::list > const&, 
ThreadPool::TPHandle&)+0x18) [0x7328d8]
2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 9: 
(ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xc3677e]
2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 10: 
(ThreadPool::WorkThread::entry()+0x10) [0xc37820]
2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 11: (()+0x8182) 
[0x7f2b149e3182]
2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 12: 
(clone()+0x6d) [0x7f2b12d2847d]


In

http://149.202.164.239/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/24/

I see the same error as below.

In

http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/8/

it looks like the run was about to finish, just took a long time, and should be 
ignored as a false negative.

On 20/07/2015 14:52, Loic Dachary wrote:
> Hi,
> 
> I checked one of the timeout (dead) at 
> http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/
> 
> http://149.202.164.239/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/10/config.yaml
> timeed out because of
> 
> 
> Paste2
> 
> Create Paste
> Followup Paste
> QR
> 
> sd.5 since back 2015-07-20 10:45:28.566308 front 2015-07-20 10:45:28.566308 
> (cutoff 2015-07-20 10:45:33.823074)
> 2015-07-20T10:47:13.921 INFO:tasks.ceph.osd.4.ovh164254.stderr:2015-07-20 
> 10:47:13.899770 7fb4be171700 -1 osd.4 655 heartbeat_check: no reply from 
> osd.5 since back 2015-07-20 10:45:30.719801 front 2015-07-20 10:45:30.719801 
> (cutoff 2015-07-20 10:45:33.899763)
> 2015-07-20T10:47:15.023 
> INFO:tasks.ceph.osd.1.ovh164253.stderr:os

Re: debian jessie gitbuilder repositories ?

2015-07-20 Thread Sage Weil
On Mon, 20 Jul 2015, Alexandre DERUMIER wrote:
> Hi,
> 
> debian jessie gitbuilder is ok since 2 weeks now,
> 
> http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-deb-jessie-amd64-basic
> 
> 
> It is possible to push packages to repositories ?
> 
> http://gitbuilder.ceph.com/ceph-deb-jessie-x86_64-basic/


The builds are failing with this:

+ GNUPGHOME=/srv/gnupg reprepro --ask-passphrase -b 
../out/output/sha1/6ffb1c4ae43bcde9f5fde40dd97959399135ed86.tmp -C main 
--ignore=undef
inedtarget --ignore=wrongdistribution include jessie 
out~/ceph_0.94.2-50-g6ffb1c4-1jessie_amd64.changes
Cannot find definition of distribution 'jessie'!
There have been errors!


I've seen it before a long time ago, but I forget what the resolution is.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: start-stop-daemon radosgw

2015-07-20 Thread Pavan Rallabhandi
http://tracker.ceph.com/issues/12407

Thanks,
-Pavan.

-Original Message-
From: Sage Weil [mailto:s...@newdream.net]
Sent: Monday, July 20, 2015 7:19 PM
To: Pavan Rallabhandi
Cc: ceph-devel@vger.kernel.org; Srinivasula Maram; Yehuda Sadeh-Weinraub
Subject: Re: start-stop-daemon radosgw

On Mon, 20 Jul 2015, Pavan Rallabhandi wrote:
> [Resending in plain text format, apologies for the spam]
>
> Hi,
>
> This is with reference to the commit 
> https://github.com/ceph/ceph/commit/f30fa4a364602fb9412babf7319140eca4c64995 
> and tracker http://tracker.ceph.com/issues/11453
>
> On Hammer binaries, we are finding this fix has regressed to have multiple 
> RGW instances to be run on a single machine. Meaning, with no user specified 
> under 'client.radosgw.gateway' sections, and by having the default user to be 
> assumed as 'root', we are unable to get multiple RGW daemons run on a client 
> machine.
>
> The start-stop-daemon complains than an instance of 'radosgw' is already 
> running, by starting the first daemon in the configuration and bails out from 
> starting further instances:
>
> 
>
> + start-stop-daemon --start -u root -x /usr/bin/radosgw -- -n 
> client.radosgw.gateway-3
> /usr/bin/radosgw already running.
>
> <\snip>
>
> However, by having a user specified in the relevant 'client.radosgw.gateway' 
> sections, one can get around this issue. Wanted to confirm if this is indeed 
> a regression or was it expected to behave so from the fix.

This was not intentional. Can you open a tracker ticket?

Thanks!
sage




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: start-stop-daemon radosgw

2015-07-20 Thread Sage Weil
On Mon, 20 Jul 2015, Pavan Rallabhandi wrote:
> [Resending in plain text format, apologies for the spam]
> 
> Hi,
> 
> This is with reference to the commit 
> https://github.com/ceph/ceph/commit/f30fa4a364602fb9412babf7319140eca4c64995 
> and tracker http://tracker.ceph.com/issues/11453
> 
> On Hammer binaries, we are finding this fix has regressed to have multiple 
> RGW instances to be run on a single machine. Meaning, with no user specified 
> under 'client.radosgw.gateway' sections, and by having the default user to be 
> assumed as 'root', we are unable to get multiple RGW daemons run on a client 
> machine.
> 
> The start-stop-daemon complains than an instance of 'radosgw' is already 
> running, by starting the first daemon in the configuration and bails out from 
> starting further instances:
> 
> 
> 
> + start-stop-daemon --start -u root -x /usr/bin/radosgw -- -n 
> client.radosgw.gateway-3
> /usr/bin/radosgw already running.
> 
> <\snip>
> 
> However, by having a user specified in the relevant 'client.radosgw.gateway' 
> sections, one can get around this issue. Wanted to confirm if this is indeed 
> a regression or was it expected to behave so from the fix.

This was not intentional. Can you open a tracker ticket?

Thanks!
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Documentation] Hardware recommandation : RAM and PGLog

2015-07-20 Thread Sage Weil
On Sun, 19 Jul 2015, David Casier AEVOO wrote:
> Hi,
> I have a question about PGLog and RAM consumption.
> 
> In the documentation, we read "OSDs do not require as much RAM for regular
> operations (e.g., 500MB of RAM per daemon instance); however, during recovery
> they need significantly more RAM (e.g., ~1GB per 1TB of storage per daemon)"
> 
> But in fact, all pg log are read in the start of ceph-osd daemon and put in
> RAM ( pg->read_state(store, bl); )
> 
> Is this normal behavior or I have a defect in my environment?

There are two tunables that control how many pg log entries we keep 
around.  When teh PG is healthy, we keep ~1000, and when the PG is 
degraded, we keep more, to expand the time window over which a recovering 
OSD will be able to do regular log-based recovery instead of a more 
expensive backfill.  This is one source of additional memory.

Others are the missing sets (lists of missing/degraded objects) and 
messages/data/state associated with objcts that are being 
recovered/copied.

Note that the numbers in teh documentation are pretty rough rules of 
thumb.  At some point it would be great to build a model for how much RAM 
the osd consumes as a function of the various configurables (pg log size, 
pg count, avg object size, etc.).

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


rados/thrash on OpenStack

2015-07-20 Thread Loic Dachary
Hi,

I checked one of the timeout (dead) at 
http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/

149.202.164.239/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/10/config.yaml
timeed out because of


Paste2

Create Paste
Followup Paste
QR

sd.5 since back 2015-07-20 10:45:28.566308 front 2015-07-20 10:45:28.566308 
(cutoff 2015-07-20 10:45:33.823074)
2015-07-20T10:47:13.921 INFO:tasks.ceph.osd.4.ovh164254.stderr:2015-07-20 
10:47:13.899770 7fb4be171700 -1 osd.4 655 heartbeat_check: no reply from osd.5 
since back 2015-07-20 10:45:30.719801 front 2015-07-20 10:45:30.719801 (cutoff 
2015-07-20 10:45:33.899763)
2015-07-20T10:47:15.023 
INFO:tasks.ceph.osd.1.ovh164253.stderr:osd/ReplicatedPG.cc: In function 
'virtual void ReplicatedPG::op_applied(const eversion_t&)' thread 7f92f0244700 
time 2015-07-20 10:47:14.998470
2015-07-20T10:47:15.024 
INFO:tasks.ceph.osd.1.ovh164253.stderr:osd/ReplicatedPG.cc: 7311: FAILED 
assert(applied_version <= info.last_update)
2015-07-20T10:47:15.025 INFO:tasks.ceph.osd.1.ovh164253.stderr: ceph version 
9.0.2-799-gba9c2ae (ba9c2ae4bffd3fd7b26a2e0ce843913b77940b8a)
2015-07-20T10:47:15.025 INFO:tasks.ceph.osd.1.ovh164253.stderr: 1: 
(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) 
[0xc45d1b]
2015-07-20T10:47:15.025 INFO:tasks.ceph.osd.1.ovh164253.stderr: 2: 
(ReplicatedPG::op_applied(eversion_t const&)+0x6dc) [0x8741ac]
2015-07-20T10:47:15.026 INFO:tasks.ceph.osd.1.ovh164253.stderr: 3: 
(ReplicatedBackend::op_applied(ReplicatedBackend::InProgressOp*)+0xd0) 
[0xa5cfe0]
2015-07-20T10:47:15.026 INFO:tasks.ceph.osd.1.ovh164253.stderr: 4: 
(Context::complete(int)+0x9) [0x6f4649]
2015-07-20T10:47:15.026 INFO:tasks.ceph.osd.1.ovh164253.stderr: 5: 
(ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8dec54]
2015-07-20T10:47:15.026 INFO:tasks.ceph.osd.1.ovh164253.stderr: 6: 
(Context::complete(int)+0x9) [0x6f4649]
2015-07-20T10:47:15.026 INFO:tasks.ceph.osd.1.ovh164253.stderr: 7: (void 
finish_contexts(CephContext*, std::list >&, int)+0x94) [0x7351d4]
2015-07-20T10:47:15.026 INFO:tasks.ceph.osd.1.ovh164253.stderr: 8: 
(C_ContextsBase::complete(int)+0x9) [0x6f4e89]
2015-07-20T10:47:15.026 INFO:tasks.ceph.osd.1.ovh164253.stderr: 9: 
(Finisher::finisher_thread_entry()+0x158) [0xb6f2b8]
2015-07-20T10:47:15.026 INFO:tasks.ceph.osd.1.ovh164253.stderr: 10: (()+0x8182) 
[0x7f92ff4e7182]
2015-07-20T10:47:15.026 INFO:tasks.ceph.osd.1.ovh164253.stderr: 11: 
(clone()+0x6d) [0x7f92fd82c47d]
2015-07-20T10:47:15.027 INFO:tasks.ceph.osd.1.ovh164253.stderr: NOTE: a copy of 
the executable, or `objdump -rdS ` is needed to interpret this.
2015-07-20T10:47:15.038 INFO:tasks.ceph.osd.1.ovh164253.stderr:2015-07-20 
10:47:15.005862 7f92f0244700 -1 osd/ReplicatedPG.cc: In function 'virtual void 
ReplicatedPG::op_applied(const eversion_t&)' thread 7f92f0244700 time 
2015-07-20 10:47:14.998470
2015-07-20T10:47:15.039 
INFO:tasks.ceph.osd.1.ovh164253.stderr:osd/ReplicatedPG.cc: 7311: FAILED 
assert(applied_version <= info.last_update)
2015-07-20T10:47:15.039 INFO:tasks.ceph.osd.1.ovh164253.stderr:
2015-07-20T10:47:15.039 INFO:tasks.ceph.osd.1.ovh164253.stderr: ceph version 
9.0.2-799-gba9c2ae (ba9c2ae4bffd3fd7b26a2e0ce843913b77940b8a)
2015-07-20T10:47:15.039 INFO:tasks.ceph.osd.1.ovh164253.stderr: 1: 
(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) 
[0xc45d1b]
2015-07-20T10:47:15.039 INFO:tasks.ceph.osd.1.ovh164253.stderr: 2: 
(ReplicatedPG::op_applied(eversion_t const&)+0x6dc) [0x8741ac]
2015-07-20T10:47:15.039 INFO:tasks.ceph.osd.1.ovh164253.stderr: 3: 
(ReplicatedBackend::op_applied(ReplicatedBackend::InProgressOp*)+0xd0) 
[0xa5cfe0]
2015-07-20T10:47:15.039 INFO:tasks.ceph.osd.1.ovh164253.stderr: 4: 
(Context::complete(int)+0x9) [0x6f4649]
2015-07-20T10:47:15.039 INFO:tasks.ceph.osd.1.ovh164253.stderr: 5: 
(ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8dec54]
2015-07-20T10:47:15.040 INFO:tasks.ceph.osd.1.ovh164253.stderr: 6: 
(Context::complete(int)+0x9) [0x6f4649]
2015-07-20T10:47:15.040 INFO:tasks.ceph.osd.1.ovh164253.stderr: 7: (void 
finish_contexts(CephContext*, std::list >&, int)+0x94) [0x7351d4]
2015-07-20T10:47:15.040 INFO:tasks.ceph.osd.1.ovh164253.stderr: 8: 
(C_ContextsBase::complete(int)+0x9) [0x6f4e89]
2015-07-20T10:47:15.040 INFO:tasks.ceph.osd.1.ovh164253.stderr: 9: 
(Finisher::finisher_thread_entry()+0x158) [0xb6f2b8]
2015-07-20T10:47:15.040 INFO:tasks.ceph.osd.1.ovh164253.stderr: 10: (()+0x8182) 
[0x7f92ff4e7182]
2015-07-20T10:47:15.040 INFO:tasks.ceph.osd.1.ovh164253.stderr: 11: 
(clone()+0x6d) [0x7f92fd82c47d]
2015-07-20T10:47:15.040 INFO:tasks.ceph.osd.1.ovh164253.stderr: NOTE: a copy of 
the executable, or `objdump -rdS ` is needed to interpret this.
2015-07-20T10:47:15.041 INFO:tasks.ceph.osd.1.ovh164253.stderr:
2015-07-20T10:47:15.212 INFO:tasks.ceph.osd.1.ovh164253.stderr:terminate called 
after throwing an instance of 'ceph::FailedAssertion'
2015-07-2

Re: pulpito slowness

2015-07-20 Thread Alfredo Deza


- Original Message -
> From: "Loic Dachary" 
> To: "Alfredo Deza" 
> Cc: "Ceph Development" 
> Sent: Sunday, July 19, 2015 12:56:12 PM
> Subject: pulpito slowness
> 
> Hi Alfredo,
> 
> After installing pulpito and run from sources with:
> 
> virtualenv ./virtualenv
> source ./virtualenv/bin/activate
> pip install -r requirements.txt
> python run.py &
> 
> I run a rados suite with 40 workers and 218 jobs. All is well except a
> slowness from pulpito that I don't quite understand. It takes 9 seconds to
> load although the load average of the machine is low, the CPU are not all
> busy, there is plenty of free ram.

There are pieces of the setup that might be causing this. Pulpito on its own 
doesn't do much, it is stateless
and just serves HTML.

I would look into paddles (pulpito feeds from it) and see how that is doing. 
Ideally, paddles would be setup with
PostgreSQL as well, I remember that at some point the queries became very 
complex in paddles and some investigation
was done to improve their speed.



> 
> ubuntu@teuthology:~$ curl
> http://localhost:8081/ubuntu-2015-07-19_15:57:13-rados-hammer---basic-openstack/
> > /dev/null  % Total% Received % Xferd  Average Speed   TimeTime
> Time  Current
>  Dload  Upload   Total   SpentLeft  Speed
> 100  391k  100  391k0 0  42774  0  0:00:09  0:00:09 --:--:--
> 96305
> 
> Do you have an idea of the reason for this slowness ?
> 
> Cheers
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Teuthology] Upgrade hammer on ubuntu : all passed

2015-07-20 Thread David Casier AEVOO

Hi all,
Good news for upgrade hammer on Ubuntu :
http://ceph.aevoo.fr:8081/ubuntu-2015-07-19_05:44:18-upgrade:hammer-hammer---basic-openstack/
All jobs are passed.

David
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


start-stop-daemon radosgw

2015-07-20 Thread Pavan Rallabhandi
[Resending in plain text format, apologies for the spam]

Hi,

This is with reference to the commit 
https://github.com/ceph/ceph/commit/f30fa4a364602fb9412babf7319140eca4c64995 
and tracker http://tracker.ceph.com/issues/11453

On Hammer binaries, we are finding this fix has regressed to have multiple RGW 
instances to be run on a single machine. Meaning, with no user specified under 
'client.radosgw.gateway' sections, and by having the default user to be assumed 
as 'root', we are unable to get multiple RGW daemons run on a client machine.

The start-stop-daemon complains than an instance of 'radosgw' is already 
running, by starting the first daemon in the configuration and bails out from 
starting further instances:



+ start-stop-daemon --start -u root -x /usr/bin/radosgw -- -n 
client.radosgw.gateway-3
/usr/bin/radosgw already running.

<\snip>

However, by having a user specified in the relevant 'client.radosgw.gateway' 
sections, one can get around this issue. Wanted to confirm if this is indeed a 
regression or was it expected to behave so from the fix.

Thanks,
-Pavan.




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


The design of the eviction improvement

2015-07-20 Thread Wang, Zhiqiang
Hi all,

This is a follow-up of one of the CDS session at 
http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tiering_eviction.
 We discussed the drawbacks of the current eviction algorithm and several ways 
to improve it. Seems like the LRU variants is the right way to go. I come up 
with some design points after the CDS, and want to discuss it with you. It is 
an approximate 2Q algorithm, combining some benefits of the clock algorithm, 
similar to what the linux kernel does for the page cache.

# Design points:

## LRU lists
- Maintain LRU lists at the PG level.
The SharedLRU and SimpleLRU implementation in the current code have a max_size, 
which limits the max number of elements in the list. This mostly looks like a 
MRU, though its name implies they are LRUs. Since the object size may vary in a 
PG, it's not possible to caculate the total number of objects which the cache 
tier can hold ahead of time. We need a new LRU implementation with no limit on 
the size.
- Two lists for each PG: active and inactive
Objects are first put into the inactive list when they are accessed, and moved 
between these two lists based on some criteria.
Object flag: active, referenced, unevictable, dirty.
- When an object is accessed:
1) If it's not in both of the lists, it's put on the top of the inactive list
2) If it's in the inactive list, and the referenced flag is not set, the 
referenced flag is set, and it's moved to the top of the inactive list.
3) If it's in the inactive list, and the referenced flag is set, the referenced 
flag is cleared, and it's removed from the inactive list, and put on top of the 
active list.
4) If it's in the active list, and the referenced flag is not set, the 
referenced flag is set, and it's moved to the top of the active list.
5) If it's in the active list, and the referenced flag is set, it's moved to 
the top of the active list.
- When selecting objects to evict:
1) Objects at the bottom of the inactive list are selected to evict. They are 
removed from the inactive list.
2) If the number of the objects in the inactive list becomes low, some of the 
objects at the bottom of the active list are moved to the inactive list. For 
those objects which have the referenced flag set, they are given one more 
chance in the active list. They are moved to the top of the active list with 
the referenced flag cleared. For those objects which don't have the referenced 
flag set, they are moved to the inactive list, with the referenced flag set. So 
that they can be quickly promoted to the active list when necessary.

## Combine flush with eviction
- When evicting an object, if it's dirty, it's flushed first. After flushing, 
it's evicted. If not dirty, it's evicted directly.
- This means that we won't have separate activities and won't set different 
ratios for flush and evict. Is there a need to do so?
- Number of objects to evict at a time. 'evict_effort' acts as the priority, 
which is used to calculate the number of objects to evict.

## LRU lists Snapshotting
- The two lists are snapshotted persisted periodically.
- Only one copy needs to be saved. The old copy is removed when persisting the 
lists. The saved lists are used to restore the LRU lists when OSD reboots.

Any comments/feedbacks are welcomed.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


debian jessie gitbuilder repositories ?

2015-07-20 Thread Alexandre DERUMIER
Hi,

debian jessie gitbuilder is ok since 2 weeks now,

http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-deb-jessie-amd64-basic


It is possible to push packages to repositories ?

http://gitbuilder.ceph.com/ceph-deb-jessie-x86_64-basic/

?

Alexandre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html