Re: teuthology : ulimit: error

2013-08-09 Thread Dan Mick
IIRC we had to adjust settings in /etc/security to allow ulimit 
adjustment of at least core:


sed -i 's/^#\*.*soft.*core.*0/\*softcore 
unlimited/g' /etc/security/limits.conf


or something like that.  That seems to apply to centos/fedora/redhat 
systems.



On 08/08/2013 02:52 PM, Loic Dachary wrote:

Hi,

Trying to use Ubuntu precise virtual machines as teuthology targets ( making 
sure they have 2GB of RAM because ceph-test-dbg will not even install with 1GB 
of RAM ;-) and installing the key with

wget -q -O- 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc' 
| sudo apt-key add -

teuthology runs with

./virtualenv/bin/teuthology job.yaml

where job.yaml is

check-locks: false
roles:
- - mon.a
   - mon.c
   - osd.0
- - mon.b
   - osd.1
   - client.0
tasks:
- install:
project: ceph
branch: wip-5510
- ceph:
fs: ext4
- workunit:
 clients:
   all:
   - filestore/filestore.sh
targets:
ubuntu@91.121.254.229: ssh-rsa 
B3NzaC1yc2EDAQABAAABAQDOSit20EyZ2AKCvk2tnMsdQ6LmVRutvBmZb0awV9Z2EduJa0fYPrReYRb9ZhGRq2PJe0zgpFPKr8s4gGay+tL0+dFkju5uyABqMGeITlJCifd+RhM0MCVllgIzekDwVb0n6ydTS8k7GVFyYv8ZC0TPgbfcDcEtSEgqJNRJ0o1Bh8swuTn+cipanNDRVK39tOqJdfptUxak+TD+5QY8CGFdXdEQYP7VsYJ+jQHw73O2xbuPgfv5Shbmt+qGWLToxFqKca3owMtkvFeONgYUdujgg9qr7Q9p0+HhCFCXB8v4N2I7NSbWNdpGqyJLdLqwJ70gEeNlOhm5F7IsXfVxTapB
ubuntu@91.121.254.237: ssh-rsa 
B3NzaC1yc2EDAQABAAABAQCXVzhedORtmEmCeZJ4Ssg8wfqpYyH9W/Aa+j6CvPHSAkzZ48zXqVBATxm6S8sIIqfKkz1hWpNssx6uUotbm8k/ZatMddsd932+Di136l/HUhp6O8iIFze56kjWpyDpRPw2MM0V+OKmsiHZDfMj9ATt6ChgXfRsm23MUlmnoFHvtfwvFBnvBjcZPN7qxMpFHDamZzACNvnis/OINJrud9VprOgjhZZ7mxcTbcVZaVgcTcnyC4b9d9PRrMG2aCv0BO1eb/qnlmSogQPfoKEORJcwaUcLgabww+Taa9hJSZ9l8yEQamj+XIgr6yzGKgCvlG4lTdHM2tQdpgATZvR7/pBz

and produces the following output

INFO:teuthology.orchestra.run.err:[91.121.254.229]: 
/home/ubuntu/cephtest/lo1308082328/adjust-ulimits: 4: ulimit: error setting 
limit (Operation not permitted)

and the full output is at

http://pastealacon.com/32957

as if

/home/ubuntu/cephtest/lo1308082328/adjust-ulimits ceph-coverage 
/home/ubuntu/cephtest/lo1308082328/archive/coverage monmaptool --create 
--clobber --add a 91.121.254.229:6789 --add c 91.121.254.229:6790 --add b 
91.121.254.237:6789 --print /home/ubuntu/cephtest/lo1308082328/monmap'

was run without a leading sudo. I tried running sudo adjust-ulimits echo foo 
manually on the target and it works.

I would very much appreciate a hint ;-)

Cheers


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: bug in /etc/init.d/ceph debian

2013-08-09 Thread Sage Weil
On Fri, 9 Aug 2013, James Harper wrote:
> > > But I think this still won't have the desired outcome if you have 2 OSD's.
> > > The possible situations if the resource is supposed to be running are:
> > > . Both running => all good, pacemaker will do nothing
> > > . Both stopped => all good, pacemaker will start the services
> > > . One stopped one running => not good, pacemaker won't make any effort
> > > to start services
> > 
> > If one daemon si stopped and one is running, returning 'not running' seems
> > ok to me, since 'start' at that point will do the right thing.
> 
> Maybe. If the stopped daemon is stopped because it fails to start then 
> pacemaker might get unhappy when subsequent starts also fail, and might even 
> get STONITHy.

This is sounding more like we're trying to fit a square peg in a round 
hole.  Generally speaking there is *never* any need for anything that 
resembles STONITH with Ceph; all of that is handled internally by Ceph 
itself.

I think the only real reason why you would want to use pacemaker here is 
if you just like it better than the normal startup scripts, or perhaps 
because you are using it to control where the standby mdss run.  So maybe 
we are barking up the wrong tree...

sage


> 
> > > . One in error, one running => not good. I'm not sure exactly what will
> > > happen but it won't be what you expect.
> > 
> > I think it's fine for this to be an error condition.
> 
> Again. If pacemaker see's the error it might start doing things you don't 
> want.
> 
> Technically, for actual clustered resources, returning "not running" when 
> something is running is about the worst thing you can do because pacemaker 
> might then start up the resource on another node (eg start a VM on two nodes 
> at once, corrupting the fs). The way you'd set this up for ceph though is 
> just a cloned resource on each node so it wouldn't matter anyway.
> 
> > >
> > > The only solution I can see is to manage the services individually, in
> > > which case the init.d script with your patch + setting to 0 if running
> > > does the right thing anyway.
> > 
> > Yeah, managing individually is probably the most robust, but if it works
> > well enough in the generic configuration with no customization that is
> > good.
> 
> Actually it subsequently occurred to me that if I set them up individually 
> then my dependencies will break (eg start ceph before mounting ceph-fs) 
> because there are now different ceph instances per node.
> 
> > 
> > Anyway, I'm fine with whatever variation of your original or my patch you
> > think addresses this.  A comment block in the init-ceph script documenting
> > what the return codes mean (similar to the above) would be nice so that
> > it is clear to the next person who comes along.
> > 
> 
> I might post on the pacemaker list and see what the thoughts are there.
> 
> Maybe it would be better for me to just re-order the init.d scripts so ceph 
> starts in init.d and leave it at that...
> 
> James
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: bug in /etc/init.d/ceph debian

2013-08-09 Thread James Harper
> > But I think this still won't have the desired outcome if you have 2 OSD's.
> > The possible situations if the resource is supposed to be running are:
> > . Both running => all good, pacemaker will do nothing
> > . Both stopped => all good, pacemaker will start the services
> > . One stopped one running => not good, pacemaker won't make any effort
> > to start services
> 
> If one daemon si stopped and one is running, returning 'not running' seems
> ok to me, since 'start' at that point will do the right thing.

Maybe. If the stopped daemon is stopped because it fails to start then 
pacemaker might get unhappy when subsequent starts also fail, and might even 
get STONITHy.

> > . One in error, one running => not good. I'm not sure exactly what will
> > happen but it won't be what you expect.
> 
> I think it's fine for this to be an error condition.

Again. If pacemaker see's the error it might start doing things you don't want.

Technically, for actual clustered resources, returning "not running" when 
something is running is about the worst thing you can do because pacemaker 
might then start up the resource on another node (eg start a VM on two nodes at 
once, corrupting the fs). The way you'd set this up for ceph though is just a 
cloned resource on each node so it wouldn't matter anyway.

> >
> > The only solution I can see is to manage the services individually, in
> > which case the init.d script with your patch + setting to 0 if running
> > does the right thing anyway.
> 
> Yeah, managing individually is probably the most robust, but if it works
> well enough in the generic configuration with no customization that is
> good.

Actually it subsequently occurred to me that if I set them up individually then 
my dependencies will break (eg start ceph before mounting ceph-fs) because 
there are now different ceph instances per node.

> 
> Anyway, I'm fine with whatever variation of your original or my patch you
> think addresses this.  A comment block in the init-ceph script documenting
> what the return codes mean (similar to the above) would be nice so that
> it is clear to the next person who comes along.
> 

I might post on the pacemaker list and see what the thoughts are there.

Maybe it would be better for me to just re-order the init.d scripts so ceph 
starts in init.d and leave it at that...

James
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: still recovery issues with cuttlefish

2013-08-09 Thread Samuel Just
I think Stefan's problem is probably distinct from Mike's.

Stefan: Can you reproduce the problem with

debug osd = 20
debug filestore = 20
debug ms = 1
debug optracker = 20

on a few osds (including the restarted osd), and upload those osd logs
along with the ceph.log from before killing the osd until after the
cluster becomes clean again?
-Sam

On Thu, Aug 8, 2013 at 11:13 AM, Stefan Priebe  wrote:
> Hi Mike,
>
> Am 08.08.2013 16:05, schrieb Mike Dawson:
>
>> Stefan,
>>
>> I see the same behavior and I theorize it is linked to an issue detailed
>> in another thread [0]. Do your VM guests ever hang while your cluster is
>> HEALTH_OK like described in that other thread?
>>
>> [0] http://comments.gmane.org/gmane.comp.file-systems.ceph.user/2982
>
>
> mhm no can't see that. All our VMs are working fine even under high load
> while ceph is OK.
>
>
>> A few observations:
>>
>> - The VMs that hang do lots of writes (video surveillance).
>> - I use rbd and qemu. The problem exists in both qemu 1.4.x and 1.5.2.
>> - The problem exists with or without joshd's qemu async flush patch.
>> - Windows VMs seem to be more vulnerable than Linux VMs.
>> - If I restart the qemu-system-x86_64 process, the guest will come back
>> to life.
>> - A partial workaround seems to be console input (NoVNC or 'virsh
>> screenshot'), but restarting qemu-system-x86_64 works better.
>> - The issue of VMs hanging seems worse with RBD writeback cache enabled
>> - I typically run virtio, but I believe I've seen it with e1000, too.
>> - VM guests hang at different times, not all at once on a host (or
>> across all hosts).
>> - I co-mingle VM guests on servers that host ceph OSDs.
>>
>>
>>
>> Oliver,
>>
>> If your cluster has to recover/backfill, do your guest VMs hang with
>> more frequency than under normal HEALTH_OK conditions, even if you
>> prioritize client IO as Sam wrote below?
>>
>>
>> Sam,
>>
>> Turning down all the settings you mentioned certainly does slow the
>> recover/backfill process, but it doesn't prevent the VM guests backed by
>> RBD volumes from hanging. In fact, I often try to prioritize
>> recovery/backfill because my guests tend to hang until I get back to
>> HEALTH_OK. Given this apparent bug, completing recovery/backfill quicker
>> leads to less total outage, it seems.
>>
>>
>> Josh,
>>
>> How can I help you investigate if RBD is the common source of both of
>> these issues?
>>
>>
>> Thanks,
>> Mike Dawson
>>
>>
>> On 8/2/2013 2:46 PM, Stefan Priebe wrote:
>>>
>>> Hi,
>>>
>>>  osd recovery max active = 1
>>>  osd max backfills = 1
>>>  osd recovery op priority = 5
>>>
>>> still no difference...
>>>
>>> Stefan
>>>
>>> Am 02.08.2013 20:21, schrieb Samuel Just:

 Also, you have osd_recovery_op_priority at 50.  That is close to the
 priority of client IO.  You want it below 10 (defaults to 10), perhaps
 at 1.  You can also adjust down osd_recovery_max_active.
 -Sam

 On Fri, Aug 2, 2013 at 11:16 AM, Stefan Priebe 
 wrote:
>
> I already tried both values this makes no difference. The drives are
> not the
> bottleneck.
>
> Am 02.08.2013 19:35, schrieb Samuel Just:
>
>> You might try turning osd_max_backfills to 2 or 1.
>> -Sam
>>
>> On Fri, Aug 2, 2013 at 12:44 AM, Stefan Priebe 
>> wrote:
>>>
>>>
>>> Am 01.08.2013 23:23, schrieb Samuel Just:> Can you dump your osd
>>> settings?
>>>
 sudo ceph --admin-daemon ceph-osd..asok config show
>>>
>>>
>>>
>>> Sure.
>>>
>>>
>>>
>>> { "name": "osd.0",
>>> "cluster": "ceph",
>>> "none": "0\/5",
>>> "lockdep": "0\/0",
>>> "context": "0\/0",
>>> "crush": "0\/0",
>>> "mds": "0\/0",
>>> "mds_balancer": "0\/0",
>>> "mds_locker": "0\/0",
>>> "mds_log": "0\/0",
>>> "mds_log_expire": "0\/0",
>>> "mds_migrator": "0\/0",
>>> "buffer": "0\/0",
>>> "timer": "0\/0",
>>> "filer": "0\/0",
>>> "striper": "0\/1",
>>> "objecter": "0\/0",
>>> "rados": "0\/0",
>>> "rbd": "0\/0",
>>> "journaler": "0\/0",
>>> "objectcacher": "0\/0",
>>> "client": "0\/0",
>>> "osd": "0\/0",
>>> "optracker": "0\/0",
>>> "objclass": "0\/0",
>>> "filestore": "0\/0",
>>> "journal": "0\/0",
>>> "ms": "0\/0",
>>> "mon": "0\/0",
>>> "monc": "0\/0",
>>> "paxos": "0\/0",
>>> "tp": "0\/0",
>>> "auth": "0\/0",
>>> "crypto": "1\/5",
>>> "finisher": "0\/0",
>>> "heartbeatmap": "0\/0",
>>> "perfcounter": "0\/0",
>>> "rgw": "0\/0",
>>> "hadoop": "0\/0",
>>> "javaclient": "1\/5",
>>> "asok": "0\/0",
>>> "throttle": "0\/0",
>>> "host": "cloud1-1268",
>>> "fsid": "----",
>>> "public_addr": "10.255.0.90:0\

Re: [PATCH 0/2] Cleanup invalidate page

2013-08-09 Thread Milosz Tanski
On Fri, Aug 9, 2013 at 3:44 PM, Sage Weil  wrote:
> On Fri, 9 Aug 2013, Milosz Tanski wrote:
>> Sage,
>>
>> Great. Is there some automated testing system that looks for
>> regressions in cephfs that I can be watching for?
>
> Yep, you can join the ceph...@ceph.com email list and watch for the
> kcephfs suite results (see http://ceph.com/resources/mailing-list-irc/).

I'll brace my self for the spam.

>
> BTW I made a few tweaks to the second patch due to a conflict (added
> handling for the length arg to invalidatepage).

Thanks and sorry for the oversight.

Best,
 -Milosz

>
> Thanks!
> sage
>
>>
>> - Milosz
>>
>> On Fri, Aug 9, 2013 at 1:44 PM, Sage Weil  wrote:
>> > Hi Milosz,
>> >
>> > I pulled both these into the testing branch.  Thanks!
>> >
>> > sage
>> >
>> > On Fri, 9 Aug 2013, Milosz Tanski wrote:
>> >
>> >> Currently ceph_invalidatepage has is overly eger with it's checks which 
>> >> are
>> >> moot. The second change cleans up the case where offset is non zero.
>> >>
>> >> Please pull the from:
>> >>   https://bitbucket.org/adfin/linux-fs.git wip-invalidatepage
>> >>
>> >> This simple patchset came from the changes I made while working on fscache
>> >> support for cephfs. Per Sage's request I split this up into smaller bites 
>> >> for
>> >> testing and review.
>> >>
>> >> Milosz Tanski (2):
>> >>   ceph: Remove bogus check in invalidatepage
>> >>   ceph: cleanup the logic in ceph_invalidatepage
>> >>
>> >>  fs/ceph/addr.c | 33 +++--
>> >>  1 file changed, 15 insertions(+), 18 deletions(-)
>> >>
>> >> --
>> >> 1.8.1.2
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" 
>> >> in
>> >> the body of a message to majord...@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Cleanup invalidate page

2013-08-09 Thread Sage Weil
On Fri, 9 Aug 2013, Milosz Tanski wrote:
> Sage,
> 
> Great. Is there some automated testing system that looks for
> regressions in cephfs that I can be watching for?

Yep, you can join the ceph...@ceph.com email list and watch for the 
kcephfs suite results (see http://ceph.com/resources/mailing-list-irc/).

BTW I made a few tweaks to the second patch due to a conflict (added 
handling for the length arg to invalidatepage).

Thanks!
sage

> 
> - Milosz
> 
> On Fri, Aug 9, 2013 at 1:44 PM, Sage Weil  wrote:
> > Hi Milosz,
> >
> > I pulled both these into the testing branch.  Thanks!
> >
> > sage
> >
> > On Fri, 9 Aug 2013, Milosz Tanski wrote:
> >
> >> Currently ceph_invalidatepage has is overly eger with it's checks which are
> >> moot. The second change cleans up the case where offset is non zero.
> >>
> >> Please pull the from:
> >>   https://bitbucket.org/adfin/linux-fs.git wip-invalidatepage
> >>
> >> This simple patchset came from the changes I made while working on fscache
> >> support for cephfs. Per Sage's request I split this up into smaller bites 
> >> for
> >> testing and review.
> >>
> >> Milosz Tanski (2):
> >>   ceph: Remove bogus check in invalidatepage
> >>   ceph: cleanup the logic in ceph_invalidatepage
> >>
> >>  fs/ceph/addr.c | 33 +++--
> >>  1 file changed, 15 insertions(+), 18 deletions(-)
> >>
> >> --
> >> 1.8.1.2
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Cleanup invalidate page

2013-08-09 Thread Milosz Tanski
Sage,

Great. Is there some automated testing system that looks for
regressions in cephfs that I can be watching for?

- Milosz

On Fri, Aug 9, 2013 at 1:44 PM, Sage Weil  wrote:
> Hi Milosz,
>
> I pulled both these into the testing branch.  Thanks!
>
> sage
>
> On Fri, 9 Aug 2013, Milosz Tanski wrote:
>
>> Currently ceph_invalidatepage has is overly eger with it's checks which are
>> moot. The second change cleans up the case where offset is non zero.
>>
>> Please pull the from:
>>   https://bitbucket.org/adfin/linux-fs.git wip-invalidatepage
>>
>> This simple patchset came from the changes I made while working on fscache
>> support for cephfs. Per Sage's request I split this up into smaller bites for
>> testing and review.
>>
>> Milosz Tanski (2):
>>   ceph: Remove bogus check in invalidatepage
>>   ceph: cleanup the logic in ceph_invalidatepage
>>
>>  fs/ceph/addr.c | 33 +++--
>>  1 file changed, 15 insertions(+), 18 deletions(-)
>>
>> --
>> 1.8.1.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blueprint: inline data support (step 2)

2013-08-09 Thread Sage Weil
Hi Li,

Thanks for discussing this at the summit!  As I mentioned, I think email 
will be the easiest way to detail my suggestion for handling the shared 
writer or read/write case.  The notes from the summit are at

  http://pad.ceph.com/p/mds-inline-data

For the single-writer case, it is simple enough for the client to simply 
dirty the buffer with the inline data and write it out with everything 
else.  When it flushes the cap back to the MDS there will be some marker 
(inline_version = 0?) indicating that the data is no longer inlined.

For the multi-writer case:

We normally do reads and writes synchronously to the OSD for simplicity.  
Everything gets ordered there at the object.  I think we can do the same 
for inline data: if there are shared writers, we uninline the data and 
fall back to storing the data in the usual way.

Each writer will have a copy of the *initial* inline data, issued by the 
MDS when they got the capability allowing them to write (or read).

On the *first* read or write operation, the client will first send an 
operation to the object that looks like

  ObjectOperation m;
  m.create(true);   // exclusive create; fails if object exists
  m.write_full(initial_inline_data);
  objecter->mutate(...);

The first client whose op reaches the osd will effectively un-inline the 
data; any others will be no-ops.  This will be immediately followed by 
the actual read or write operation that they are trying to do.

As long as the inline_data size is smaller than the file layout stripe 
unit, this will always be the first object.

When the caps are released to the MDS, if *any* of the clients indicate 
that they uninlined the object, it is uninlined.  (Some clients may not 
have done any IO.)  If a client fails, we need to make the recovery path 
see if the object exists and, if so, drop the inline data.

The one wrinkle I see in this is that the m.create(true) call above isn't 
quite right; the first object will often exist because of the backtrace 
information that the MDS is maintaining (for NFS and future fsck).  We 
need to replace that with some explicit flag on the object that the data 
is inlined, which means some tricky updates and an m.cmpxattr() call.  
Alternatively (and more simply), we can just check if the object has size 
0.  There isn't a rados op that lets us do that right now, but it is 
pretty simple to add.  cmpsize() or similar.

What do you think?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Cleanup invalidate page

2013-08-09 Thread Sage Weil
Hi Milosz,

I pulled both these into the testing branch.  Thanks!

sage

On Fri, 9 Aug 2013, Milosz Tanski wrote:

> Currently ceph_invalidatepage has is overly eger with it's checks which are
> moot. The second change cleans up the case where offset is non zero.
> 
> Please pull the from:
>   https://bitbucket.org/adfin/linux-fs.git wip-invalidatepage
> 
> This simple patchset came from the changes I made while working on fscache
> support for cephfs. Per Sage's request I split this up into smaller bites for
> testing and review.
> 
> Milosz Tanski (2):
>   ceph: Remove bogus check in invalidatepage
>   ceph: cleanup the logic in ceph_invalidatepage
> 
>  fs/ceph/addr.c | 33 +++--
>  1 file changed, 15 insertions(+), 18 deletions(-)
> 
> -- 
> 1.8.1.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed porting Ceph to RSockets

2013-08-09 Thread Kasper Dieter
Hi Matthew,

please have a look at: 
http://www.spinics.net/lists/linux-rdma/msg16710.html
http://wiki.ceph.com/01Planning/02Blueprints/Emperor/msgr%3A_implement_infiniband_support_via_rsockets

Maybe you should switch this discussion from ceph-user to the ceph-devel ML.

Kind Regards,
-Dieter

On Fri, Aug 09, 2013 at 09:11:07AM +0200, Matthew Anderson wrote:
>So I've had a chance to re-visit this since Bécholey Alexandre was kind
>   
>enough to let me know how to compile Ceph with the RDMACM library  
>   
>(thankyou again!). 
>   
>   
>   
>At this stage it compiles and runs but there appears to be a problem with  
>   
>calling rshutdown in Pipe as it seems to just wait forever for the pipe to 
>   
>close which causes commands like 'ceph osd tree' to hang indefinitely  
>   
>after they work successfully. Debug MS is here -   
>   
>[1]http://pastebin.com/WzMJNKZY
>   
>   
>   
>I also tried RADOS bench but it appears to be doing something similar. 
>   
>Debug MS is here - [2]http://pastebin.com/3aXbjzqS 
>   
>   
>   
>It seems like it's very close to working... I must be missing something
>   
>small that's causing some grief. You can see the OSD coming up in the ceph 
>   
>monitor and the PG's all become active+clean. When shutting down the   
>   
>monitor I get the below which show's it waiting for the pipes to close -   
>   
>   
>   
>2013-08-09 15:08:31.339394 7f4643cfd700 20 accepter.accepter closing   
>   
>2013-08-09 15:08:31.382075 7f4643cfd700 10 accepter.accepter stopping  
>   
>2013-08-09 15:08:31.382115 7f464bd397c0 20 -- [3]172.16.0.1:6789/0 wait:   
>   
>stopped accepter thread
>   
>2013-08-09 15:08:31.382127 7f464bd397c0 20 -- [4]172.16.0.1:6789/0 wait:   
>   
>stopping reaper thread 
>   
>2013-08-09 15:08:31.382146 7f4645500700 10 -- [5]172.16.0.1:6789/0 
>   
>reaper_entry done  
>   
>2013-08-09 15:08:31.382182 7f464bd397c0 20 -- [6]172.16.0.1:6789/0 wait:   
>   
>stopped reaper thread  
>   
>2013-08-09 15:08:31.382194 7f464bd397c0 10 -- [7]172.16.0.1:6789/0 wait:   
>   
>closing pipes  
>   
>2013-08-09 15:08:31.382200 7f464bd397c0 10 -- [8]172.16.0.1:6789/0 reaper  
>   
>2013-08-09 15:08:31.382205 7f464bd397c0 10 -- [9]172.16.0.1:6789/0 reaper  
>   
>done   
>   
>2013-08-09 15:08:31.382210 7f464bd397c0 10 -- [10]172.16.0.1:6789/0 wait:  
>   
>waiting for pipes 0x3014c80,0x3015180,0x3015400 to close   
>   
>   
>   
>The git repo has been updated if anyone has a few spare minutes to take a  
>   
>look - [11]https://github.com/funkBuild/ceph-rsockets  
>   
>   
>   
>Thanks again   
>   
>-Matt  
>   
>   
>   
>On Thu, Jun 20, 2013 at 5:09 PM, Matthew Anderson  
>   
><[12]manderson8...@gmail.com> wrote:   
>   
>   
>   
>  Hi All,  
>   
>  I've had a few conversations on IRC about getting RDMA support into Ceph 
>   
>  and thought I would give it a quick attempt to hopefully spur some   
>   
>  interest. What I would like to accomplish is an RSockets only
>   
>  implementation so I'm able to use Ceph, RBD and QEMU at full speed over  
>   
>  an Infiniband fabric.
>   
>  What I've tried to do is port Pipe.cc and Acceptor.cc to rsockets by 
>   
>  replacing the regular socket calls with the rsocket equivalent.  
>   
>  Unfortunately it doesn't compile and I get an error of -

[PATCH 2/2] ceph: cleanup the logic in ceph_invalidatepage

2013-08-09 Thread Milosz Tanski
The invalidatepage code bails if it encounters a non-zero page offset. The
current logic that does is non-obvious with multiple if statements.

This should be logically and functionally equivalent.

Signed-off-by: Milosz Tanski 
---
 fs/ceph/addr.c | 29 +++--
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index f1d6c60..341f998 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -150,6 +150,13 @@ static void ceph_invalidatepage(struct page *page, 
unsigned long offset)
struct ceph_snap_context *snapc = page_snap_context(page);
 
inode = page->mapping->host;
+   ci = ceph_inode(inode);
+
+   if (offset != 0) {
+   dout("%p invalidatepage %p idx %lu partial dirty page\n",
+inode, page, page->index);
+   return;
+   }
 
/*
 * We can get non-dirty pages here due to races between
@@ -159,21 +166,15 @@ static void ceph_invalidatepage(struct page *page, 
unsigned long offset)
if (!PageDirty(page))
pr_err("%p invalidatepage %p page not dirty\n", inode, page);
 
-   if (offset == 0)
-   ClearPageChecked(page);
+   ClearPageChecked(page);
 
-   ci = ceph_inode(inode);
-   if (offset == 0) {
-   dout("%p invalidatepage %p idx %lu full dirty page %lu\n",
-inode, page, page->index, offset);
-   ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
-   ceph_put_snap_context(snapc);
-   page->private = 0;
-   ClearPagePrivate(page);
-   } else {
-   dout("%p invalidatepage %p idx %lu partial dirty page\n",
-inode, page, page->index);
-   }
+   dout("%p invalidatepage %p idx %lu full dirty page %lu\n",
+inode, page, page->index, offset);
+
+   ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
+   ceph_put_snap_context(snapc);
+   page->private = 0;
+   ClearPagePrivate(page);
 }
 
 /* just a sanity check */
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] ceph: Remove bogus check in invalidatepage

2013-08-09 Thread Milosz Tanski
The early bug checks are moot because the VMA layer ensures those things.

1. It will not call invalidatepage unless PagePrivate (or PagePrivate2) are set
2. It will not call invalidatepage without taking a PageLock first.
3. Guantrees that the inode page is mapped.

Signed-off-by: Milosz Tanski 
---
 fs/ceph/addr.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index afb2fc2..f1d6c60 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -149,10 +149,6 @@ static void ceph_invalidatepage(struct page *page, 
unsigned long offset)
struct ceph_inode_info *ci;
struct ceph_snap_context *snapc = page_snap_context(page);
 
-   BUG_ON(!PageLocked(page));
-   BUG_ON(!PagePrivate(page));
-   BUG_ON(!page->mapping);
-
inode = page->mapping->host;
 
/*
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Cleanup invalidate page

2013-08-09 Thread Milosz Tanski
Currently ceph_invalidatepage has is overly eger with it's checks which are
moot. The second change cleans up the case where offset is non zero.

Please pull the from:
  https://bitbucket.org/adfin/linux-fs.git wip-invalidatepage

This simple patchset came from the changes I made while working on fscache
support for cephfs. Per Sage's request I split this up into smaller bites for
testing and review.

Milosz Tanski (2):
  ceph: Remove bogus check in invalidatepage
  ceph: cleanup the logic in ceph_invalidatepage

 fs/ceph/addr.c | 33 +++--
 1 file changed, 15 insertions(+), 18 deletions(-)

-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] ceph: use fscache as a local presisent cache

2013-08-09 Thread Milosz Tanski
Sage,

I can spin some of this out into a another patch; In the things I've
been sending I've been squashing the changes just because I've done so
many less then smart things to get to this point.

After reviewing this one more time and going from memory... I believe
the invalidate page code is overly paranoid in some of it's checking
in the beginning of. Here's my nodes inline.

146 static void ceph_invalidatepage(struct page *page, unsigned long offset)
147 {
148 struct inode *inode;
149 struct ceph_inode_info *ci;
150 struct ceph_snap_context *snapc = page_snap_context(page);
151
152 BUG_ON(!PageLocked(page));

This check is bunk because the VMA locks the page for us as it calls
this. The documentation in Documentation/filesystems/Locking make the
guarantee.

153 BUG_ON(!PagePrivate(page));

invalidatepage is not going to be called unless there is a PagePrivate
flag on it (or PagePrivate2 which is baked into the VMA for fscache).
In the current code has no effect, when I added fscache support this
line must go (it could be PagePrivate2 without PagePrivate). I'm using
Documentation/filesystems/vfs.txt as documentation.

154 BUG_ON(!page->mapping);

I also think this check is bunk. I doubt the VMA layer would call
ceph_invalidatepage without it being pre-mapped.

-

Now to address the code movement I can also refactor it a different
patch. I did it mainly so I could sandwich the fscache invalidate in
there. But taking the fscache change away, it should still be
functionally the same and simplify the logic a bit by just bailing
early. Since all the code does logically is throw it's hands up if we
get a truncate with a non-zero offset.

Cheers,
- Milosz

On Fri, Aug 9, 2013 at 12:16 AM, Sage Weil  wrote:
> Hi Milosz!
>
> I have a few comments below on invalidate_page:
>
> On Wed, 7 Aug 2013, Milosz Tanski wrote:
>> Adding support for fscache to the Ceph filesystem. This would bring it to on
>> par with some of the other network filesystems in Linux (like NFS, AFS, 
>> etc...)
>>
>> In order to mount the filesystem with fscache the 'fsc' mount option must be
>> passed.
>>
>> Signed-off-by: Milosz Tanski 
>> ---
>>  fs/ceph/Kconfig  |   9 ++
>>  fs/ceph/Makefile |   2 +
>>  fs/ceph/addr.c   |  70 +
>>  fs/ceph/cache.c  | 306 
>> +++
>>  fs/ceph/cache.h  | 117 +
>>  fs/ceph/caps.c   |  19 +++-
>>  fs/ceph/file.c   |  17 
>>  fs/ceph/inode.c  |  69 -
>>  fs/ceph/super.c  |  48 -
>>  fs/ceph/super.h  |  17 
>>  10 files changed, 646 insertions(+), 28 deletions(-)
>>  create mode 100644 fs/ceph/cache.c
>>  create mode 100644 fs/ceph/cache.h
>>
>> diff --git a/fs/ceph/Kconfig b/fs/ceph/Kconfig
>> index 49bc782..ac9a2ef 100644
>> --- a/fs/ceph/Kconfig
>> +++ b/fs/ceph/Kconfig
>> @@ -16,3 +16,12 @@ config CEPH_FS
>>
>> If unsure, say N.
>>
>> +if CEPH_FS
>> +config CEPH_FSCACHE
>> + bool "Enable Ceph client caching support"
>> + depends on CEPH_FS=m && FSCACHE || CEPH_FS=y && FSCACHE=y
>> + help
>> +   Choose Y here to enable persistent, read-only local
>> +   caching support for Ceph clients using FS-Cache
>> +
>> +endif
>> diff --git a/fs/ceph/Makefile b/fs/ceph/Makefile
>> index bd35212..0af0678 100644
>> --- a/fs/ceph/Makefile
>> +++ b/fs/ceph/Makefile
>> @@ -9,3 +9,5 @@ ceph-y := super.o inode.o dir.o file.o locks.o addr.o 
>> ioctl.o \
>>   mds_client.o mdsmap.o strings.o ceph_frag.o \
>>   debugfs.o
>>
>> +ceph-$(CONFIG_CEPH_FSCACHE) += cache.o
>> +
>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> index afb2fc2..de6de0e 100644
>> --- a/fs/ceph/addr.c
>> +++ b/fs/ceph/addr.c
>> @@ -11,6 +11,7 @@
>>
>>  #include "super.h"
>>  #include "mds_client.h"
>> +#include "cache.h"
>>  #include 
>>
>>  /*
>> @@ -149,11 +150,23 @@ static void ceph_invalidatepage(struct page *page, 
>> unsigned long offset)
>>   struct ceph_inode_info *ci;
>>   struct ceph_snap_context *snapc = page_snap_context(page);
>>
>> - BUG_ON(!PageLocked(page));
>> - BUG_ON(!PagePrivate(page));
>
> Do these go away because of the fscache change?  Or were they incorrect to
> begin with?
>
>>   BUG_ON(!page->mapping);
>>
>>   inode = page->mapping->host;
>> + ci = ceph_inode(inode);
>> +
>> + if (offset != 0) {
>> + dout("%p invalidatepage %p idx %lu partial dirty page\n",
>> +  inode, page, page->index);
>> + return;
>> + }
>
> It would be nice to factor out the offset != 0 short circuit into a
> separate patch.  Under what circumstances does it actually happen?
>
>> +
>> + ceph_invalidate_fscache_page(inode, page);
>> +
>> + if (!PagePrivate(page))
>> + return;
>> +
>> + BUG_ON(!PageLocked(page));
>>
>>   /*
>>* We can get non-dirty pages here due to races between
>> @@ -163,31 +176,28 @@ static void ceph_invalidatepage(struct page 

Re: cephfs set_layout

2013-08-09 Thread Sage Weil
Hi Dieter,

On Fri, 9 Aug 2013, Kasper Dieter wrote:
> On Fri, Aug 09, 2013 at 03:06:37PM +0200, Yan, Zheng wrote:
> > On Fri, Aug 9, 2013 at 5:03 PM, Kasper Dieter
> >  wrote:
> > > OK,
> > > I found this nice page: http://ceph.com/docs/next/dev/file-striping/
> > > which explains "--stripe_unit --stripe_count --object_size"
> > >
> > > But still I'm not sure about
> > > (1) what is the equivalent command on cephfs to 'rbd create --order 16' ?
> > 
> > you can get/set file layout through virtual xattr. for example:
> > 
> > # getfattr -d -m - targetfile
> > 
> > > (2) how to use those parameters to achieve different optimized layouts on 
> > > CephFS directories
> > > (e.g. for streaming, small sequential IOs, small random IOs)
> > >
> > 
> > ceph directories are not implemented as files. you can't optimize ceph
> > directories by this way.
> 
> In my view 'Directories' in CephFS are similar to 'Volumes' in RBD.
> 
> With 'rbd create --order 16 new-volume' I can assign an object size to a 
> volume.
> With 'cephfs directory set_layout ...'  I can set similar parameters to a 
> directory:
> 
> # mkdir /mnt/cephfs/test-dir
> # cephfs /mnt/cephfs/test-dir show_layout
> layout not specified
> 
> # cephfs /mnt/cephfs/test-dir set_layout -p 3 -s 4194304 -u 4194304 -c 1
> # cephfs /mnt/cephfs/test-dir show_layout
> layout.data_pool: 3
> layout.object_size:   4194304
> layout.stripe_unit:   4194304
> layout.stripe_count:  1
> 
> # echo asd > /mnt/cephfs/test-dir/test-file
> # cephfs /mnt/cephfs/test-dir/test-file show_layout
> layout.data_pool: 3
> layout.object_size:   4194304
> layout.stripe_unit:   4194304
> layout.stripe_count:  1
> 
> The set_layout attribute of a DIR will be inherit to the FILES below.
> 
> My question is: which combination of "--stripe_unit --stripe_count 
> --object_size"
> will be optimal for streaming, small sequential IOs, small random IOs ?
> (in/below a DIR)

Just setting object_size = stripe_unit = 64K will work, except that the 
final objects will be pretty small, which is not especially efficient on 
the back end.  In that case I would do something like 

 object_size = 4M
 stripe_unit = 64K
 stripe_count = 16

so that we stripe over 16 objects until they fill up and then move on to 
the next 16.  Note that you can do the same thing with RBD now too when 
you are using librbd, but the non-trivial striping is not supported by the 
kernel client.

Yan mentioned this, but I'll reiterate: using the virtual xattrs to adjust 
these parameters is generally more convenient than the cephfs tool, and 
works both with ceph-fuse and the kernel client.  See

ceph.git/qa/workunits/misc/layout_vxattrs.sh

to see how they are used.

sage



 > > 
> Best Regards,
> -Dieter
> 
> 
> > 
> > Regards
> > Yan, Zheng
> > 
> > > -Dieter
> > >
> > > On Fri, Aug 09, 2013 at 09:44:57AM +0200, Kasper Dieter wrote:
> > >> Hi,
> > >>
> > >> my goal is to set the 'object size' used in the distribution inside rados
> > >> in an equal (or similar) way between RBD and CephFS.
> > >>
> > >> To set obj_size=64k in RBD I use the command:
> > >> rbd create --size 1024000 --pool SSD-r2 ssd2-1T-64k --order 16
> > >>
> > >> On cephfs set_layout '-s 65536' runs into EINVAL:
> > >> cephfs /mnt/cephfs/fio-64k/ set_layout -p 3 -s   65536 -u 4194304 -c 1
> > >> Error setting layout: Invalid argument
> > >>
> > >> cephfs /mnt/cephfs/fio-64k/ set_layout -p 3 -s   65536 -u 65536 -c 1
> > >> cephfs /mnt/cephfs/fio-64k/ show_layout
> > >> layout.data_pool: 3
> > >> layout.object_size:   65536
> > >> layout.stripe_unit:   65536
> > >> layout.stripe_count:  1
> > >>
> > >> The man page of cephfs says
> > >> ---snip---
> > >>-u --stripe_unit
> > >>   Set the size of each stripe
> > >>
> > >>-c --stripe_count
> > >>   Set the number of objects to stripe across
> > >>
> > >>-s --object_size
> > >>   Set the size of the objects to stripe across
> > >> ---snip---
> > >>
> > >> What is the equivalent command on cephfs to 'rbd create --order 16' ?
> > >> Can you please give same explanation how "--stripe_unit --stripe_count 
> > >> --object_size"
> > >> should be used in combination to achieve different layouts on CephFS 
> > >> directories
> > >> (e.g. optimized for streaming, small sequential IOs, small random IOs)
> > >> ?
> > >>
> > >> Thanks,
> > >> -Dieter
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordo

RE: bug in /etc/init.d/ceph debian

2013-08-09 Thread Sage Weil
On Fri, 9 Aug 2013, James Harper wrote:
> > > I haven't tried your patch yet, but can it ever return 0? It seems to
> > > set it to 3 initially, and then change it to 1 if it finds an error. I
> > > can't see that it ever sets it to 0 indicating that daemons are running.
> > > Easy enough to fix by setting the EXIT_STATUS=0 after the check of
> > > daemon_is_running, I think, but it still doesn't allow for the case
> > > where there are three OSD's, one is running, one is stopped, and one is
> > > failed. The EXIT_STATUS in that case appears to be based on the last
> > > daemon checked, eg basically random.
> > 
> > What should it return in that case?
> > 
> 
> I've been thinking about this some more and I'm still not sure. I think my 
> patch says:
> if _any_ are in error then return 1
> else if any are running return 0
> else if all are stopped return 3
> 
> But I think this still won't have the desired outcome if you have 2 OSD's. 
> The possible situations if the resource is supposed to be running are:
> . Both running => all good, pacemaker will do nothing
> . Both stopped => all good, pacemaker will start the services
> . One stopped one running => not good, pacemaker won't make any effort to 
> start services

If one daemon si stopped and one is running, returning 'not running' seems 
ok to me, since 'start' at that point will do the right thing.

> . One in error, one running => not good. I'm not sure exactly what will 
> happen but it won't be what you expect.

I think it's fine for this to be an error condition.

> 
> The only solution I can see is to manage the services individually, in 
> which case the init.d script with your patch + setting to 0 if running 
> does the right thing anyway.

Yeah, managing individually is probably the most robust, but if it works 
well enough in the generic configuration with no customization that is 
good.

Anyway, I'm fine with whatever variation of your original or my patch you 
think addresses this.  A comment block in the init-ceph script documenting 
what the return codes mean (similar to the above) would be nice so that 
it is clear to the next person who comes along.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Could we introduce launchpad/gerrit for ceph

2013-08-09 Thread James Page
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

On 09/08/13 14:58, Chen, Xiaoxi wrote:
> I do think Launchpad is a mature alternative since there are big
> successful story such like Ubuntu and Openstack :)

Indeed

> OpenStack has already use it for years, and seems everyone happy
> with it, I am also happy with it, but I am not familiar with jira
> so I cannot do a comparison.

Bear in mind that the OpenStack project uses Launchpad for Blueprints,
Bug Tracking and Release Management and github+gerrit for code
management and reviews so its a bit of both worlds but it does seem to
work pretty well.

Monty's OpenStack CI team did some magic in gerrit to make comments in
reviews link back to bugs and blueprints.

HTH

James

- -- 
James Page
Ubuntu and Debian Developer
james.p...@ubuntu.com
jamesp...@debian.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBCAAGBQJSBQdEAAoJEL/srsug59jDOHwP/j0P7izbsfggwSZItcgHB4nY
iFJlCh1IEqPjJu1g09pjR/jCQJpAxhV6pMg9Q8u7N9Zwcoql/VIDWy4WZlL7Kgce
7QGEFIOqqvrd+8AzqqV5tsCzcLggUJDIvBfUiz+K3YtDnZeBAgqAFAX1ykG0Ntoi
kaeSkfEC7EDoLKcw1p1mJs9VE02IkSyN3Wl/dGr4SL5Cf8XwkVJEH9GItEEiQCMW
go0n9gsR0HY7N7tXpxyiFLfZODZYKd3pXc3CyerlwAHNyoKOatIAQ7e2amnyvXaF
mDljaRwtoSLEuj3xXaDKUxaoT5ZCMBFfooVtfca3h4YoZLvoOhFQ1YzObX+zqd7L
TbU784cG0Q/l82kvTpliiZvs9lMMJdx2LlONmRVUni/YZAfh74JdEXKlF8G08uN1
cgx4KrgbpgHHd/AA2fCjADzjeeklr8vsbyb6gPCPAGoKW6fAV6ehLe23eeQWPS7A
0sNbRg1xTLD+TrjqNlXkEnIy4B5ZdBMIhc+fLeXTdHwJT26f/P0abQyCYyEUQiNo
qRsm792+mbf9P0AluHiBi+a6/H05TTmPK5FbsAltPfUjOb0KHE5/UCGjiDLuIrV8
8Ym2R7vOpmLnEs9/t1tAMIUDZdZZV4oYAGPg3IyPbgxMay3dcLt9m3ARj4mS3o+f
iwAn2dOGJ3yJIOnla4kD
=bq9T
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RGW getting revoked tokens from keystone not working

2013-08-09 Thread Yehuda Sadeh
Hi,

   sorry for the late response.

On Tue, Aug 6, 2013 at 5:10 AM, Roald van Loon  wrote:
> Hi ceph devs,
>
> I was working with a RGW / keystone implementation, source tree from
> github master. I stumbled against this error from the radosgw log;
>
> 2013-08-06 14:00:02.523331 7f929c011780 10 allocated request req=0x1c31640
> 2013-08-06 14:00:02.556052 7f923bfff700 10 request returned
> 2013-08-06 14:00:02.556238 7f923bfff700  0 ERROR: keystone revocation
> processing returned error r=-22
>
> That was weird (nothing after the "request returned"), because I
> checked the request using curl and it actually returned data. Then I
> found this;
>
> https://github.com/ceph/ceph/blob/master/src/rgw/rgw_swift.cc#L307
>
> Shouldn't read_data be receive_data, because the received_data is
> called by the curl_easy_perform? Or else the bufferlist in
> RGWSwift::check_revoked() will remain empty at all times.
>
> I might be missing something, so that's why I'm mailing the question
> instead of mailing the patch...

Looking at it now, this is a real issue. I recently renamed the
read_data() to receive_data() to avoid any confusion and missed these.
I'll fix it for dumpling.

Thanks,
Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Could we introduce launchpad/gerrit for ceph

2013-08-09 Thread Chen, Xiaoxi
I do think Launchpad is a mature alternative since there are big successful 
story such like Ubuntu and Openstack :)
OpenStack has already use it for years, and seems everyone happy with it, I am 
also happy with it, but I am not familiar with jira so I cannot do a 
comparison. 

Is there any estimate date or plan for when we will introduce these stuff?
-Original Message-
From: Sage Weil [mailto:s...@inktank.com] 
Sent: Friday, August 09, 2013 1:06 PM
To: Chen, Xiaoxi
Cc: ceph-devel@vger.kernel.org
Subject: Re: Could we introduce launchpad/gerrit for ceph

> Hi,
> ??Now it?s a bit hard for us to track the bugs, review the submits, 
> and track the blueprints. We do have a bug tracking system, but most 
> of the time it doesn?t connect with a github submit link. We have 
> email review , pull requests, and also some internal mechanism inside 
> inktank , we do need a single entrance for doing a submission. We have 
> wiki to track blueprints, but it?s really confusing if you want to 
> know the status of a BP, and also ,the discussion around that BP. 
> Basically I think Openstack did a good job to track all these things, 
> I do really suggest Ceph could also introduce Launchpad for this.

We had a discussion during CDS about this.  The general output of that seems to 
be:

- gerrit workflow has a lot of good things
- gerrit treatment of individual commits vs branches is very unpleasant
- jenkins + gerrit integration is good
- we want build + smoke tests to run automatically prior to merge

- github pull requests are a decent alternative to gerrit if we
  - have build and smoke test output added as comments on teh request
via the github apis
  - have jenkins running tests on pull requests via the plugin api

The bug tracker vs blueprint discussion also came up.  We are strongly 
considering a switch to jira (+ greenhopper etc) which has an integrated 
blueprint function and much better scrum stuff.

I haven't used launchpad much.  Is that a viable alternative?

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cephfs set_layout

2013-08-09 Thread Kasper Dieter
On Fri, Aug 09, 2013 at 03:06:37PM +0200, Yan, Zheng wrote:
> On Fri, Aug 9, 2013 at 5:03 PM, Kasper Dieter
>  wrote:
> > OK,
> > I found this nice page: http://ceph.com/docs/next/dev/file-striping/
> > which explains "--stripe_unit --stripe_count --object_size"
> >
> > But still I'm not sure about
> > (1) what is the equivalent command on cephfs to 'rbd create --order 16' ?
> 
> you can get/set file layout through virtual xattr. for example:
> 
> # getfattr -d -m - targetfile
> 
> > (2) how to use those parameters to achieve different optimized layouts on 
> > CephFS directories
> > (e.g. for streaming, small sequential IOs, small random IOs)
> >
> 
> ceph directories are not implemented as files. you can't optimize ceph
> directories by this way.

In my view 'Directories' in CephFS are similar to 'Volumes' in RBD.

With 'rbd create --order 16 new-volume' I can assign an object size to a volume.
With 'cephfs directory set_layout ...'  I can set similar parameters to a 
directory:

# mkdir /mnt/cephfs/test-dir
# cephfs /mnt/cephfs/test-dir show_layout
layout not specified

# cephfs /mnt/cephfs/test-dir set_layout -p 3 -s 4194304 -u 4194304 -c 1
# cephfs /mnt/cephfs/test-dir show_layout
layout.data_pool: 3
layout.object_size:   4194304
layout.stripe_unit:   4194304
layout.stripe_count:  1

# echo asd > /mnt/cephfs/test-dir/test-file
# cephfs /mnt/cephfs/test-dir/test-file show_layout
layout.data_pool: 3
layout.object_size:   4194304
layout.stripe_unit:   4194304
layout.stripe_count:  1

The set_layout attribute of a DIR will be inherit to the FILES below.

My question is: which combination of "--stripe_unit --stripe_count 
--object_size"
will be optimal for streaming, small sequential IOs, small random IOs ?
(in/below a DIR)


Best Regards,
-Dieter


> 
> Regards
> Yan, Zheng
> 
> > -Dieter
> >
> > On Fri, Aug 09, 2013 at 09:44:57AM +0200, Kasper Dieter wrote:
> >> Hi,
> >>
> >> my goal is to set the 'object size' used in the distribution inside rados
> >> in an equal (or similar) way between RBD and CephFS.
> >>
> >> To set obj_size=64k in RBD I use the command:
> >> rbd create --size 1024000 --pool SSD-r2 ssd2-1T-64k --order 16
> >>
> >> On cephfs set_layout '-s 65536' runs into EINVAL:
> >> cephfs /mnt/cephfs/fio-64k/ set_layout -p 3 -s   65536 -u 4194304 -c 1
> >> Error setting layout: Invalid argument
> >>
> >> cephfs /mnt/cephfs/fio-64k/ set_layout -p 3 -s   65536 -u 65536 -c 1
> >> cephfs /mnt/cephfs/fio-64k/ show_layout
> >> layout.data_pool: 3
> >> layout.object_size:   65536
> >> layout.stripe_unit:   65536
> >> layout.stripe_count:  1
> >>
> >> The man page of cephfs says
> >> ---snip---
> >>-u --stripe_unit
> >>   Set the size of each stripe
> >>
> >>-c --stripe_count
> >>   Set the number of objects to stripe across
> >>
> >>-s --object_size
> >>   Set the size of the objects to stripe across
> >> ---snip---
> >>
> >> What is the equivalent command on cephfs to 'rbd create --order 16' ?
> >> Can you please give same explanation how "--stripe_unit --stripe_count 
> >> --object_size"
> >> should be used in combination to achieve different layouts on CephFS 
> >> directories
> >> (e.g. optimized for streaming, small sequential IOs, small random IOs)
> >> ?
> >>
> >> Thanks,
> >> -Dieter
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cephfs set_layout - EINVAL - solved

2013-08-09 Thread Yan, Zheng
On Fri, Aug 9, 2013 at 5:03 PM, Kasper Dieter
 wrote:
> OK,
> I found this nice page: http://ceph.com/docs/next/dev/file-striping/
> which explains "--stripe_unit --stripe_count --object_size"
>
> But still I'm not sure about
> (1) what is the equivalent command on cephfs to 'rbd create --order 16' ?

you can get/set file layout through virtual xattr. for example:

# getfattr -d -m - targetfile

> (2) how to use those parameters to achieve different optimized layouts on 
> CephFS directories
> (e.g. for streaming, small sequential IOs, small random IOs)
>

ceph directories are not implemented as files. you can't optimize ceph
directories by this way.

Regards
Yan, Zheng

> -Dieter
>
> On Fri, Aug 09, 2013 at 09:44:57AM +0200, Kasper Dieter wrote:
>> Hi,
>>
>> my goal is to set the 'object size' used in the distribution inside rados
>> in an equal (or similar) way between RBD and CephFS.
>>
>> To set obj_size=64k in RBD I use the command:
>> rbd create --size 1024000 --pool SSD-r2 ssd2-1T-64k --order 16
>>
>> On cephfs set_layout '-s 65536' runs into EINVAL:
>> cephfs /mnt/cephfs/fio-64k/ set_layout -p 3 -s   65536 -u 4194304 -c 1
>> Error setting layout: Invalid argument
>>
>> cephfs /mnt/cephfs/fio-64k/ set_layout -p 3 -s   65536 -u 65536 -c 1
>> cephfs /mnt/cephfs/fio-64k/ show_layout
>> layout.data_pool: 3
>> layout.object_size:   65536
>> layout.stripe_unit:   65536
>> layout.stripe_count:  1
>>
>> The man page of cephfs says
>> ---snip---
>>-u --stripe_unit
>>   Set the size of each stripe
>>
>>-c --stripe_count
>>   Set the number of objects to stripe across
>>
>>-s --object_size
>>   Set the size of the objects to stripe across
>> ---snip---
>>
>> What is the equivalent command on cephfs to 'rbd create --order 16' ?
>> Can you please give same explanation how "--stripe_unit --stripe_count 
>> --object_size"
>> should be used in combination to achieve different layouts on CephFS 
>> directories
>> (e.g. optimized for streaming, small sequential IOs, small random IOs)
>> ?
>>
>> Thanks,
>> -Dieter
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reduce log verbosity

2013-08-09 Thread Jean-Daniel BUSSY
Thanks for the info!

I now got these settings in my ceph.conf

  mon_cluster_log_file = /dev/null
  mon_cluster_log_to_syslog = true
  clog_to_syslog = true
  log_to_syslog = true
  err_to_syslog = true
  clog_to_syslog_level = "warn"
  mon_cluster_log_to_syslog_level = "warn"

I have the cluster log disabled and much less output (some options
might be unnecessary here).

I still can't find a way to take these out of the ceph-mon.{hostip}.log:

  mon.172-23-45-21@0(leader).data_health(754) update_stats avail 92%
total 138319904 used 2795168 avail 128498408

And during daily scrub there pgmap log messages appears:

  0 log [INF] : pgmap v428062: 3216 pgs: 3216 active+clean; 78649 MB
data, 154 GB used, 1429 GB / 1584 GB avail

but I suppose that will be fixed in future "mon_cluster_log_level"
option implementation.

Also, when ceph bench is running the stats are not filtered and still
appears in the mon log. That is not really important though.

BUSSY Jean-Daniel
Cloud Engineer


On Thu, Aug 8, 2013 at 12:12 AM, Sage Weil  wrote:
> On Wed, 7 Aug 2013, Jean-Daniel BUSSY wrote:
>> Hi all,
>>
>> We have log aggregation on our ceph cluster and verbosity is a problem
>> on the cluster log file at /etc/ceph/ceph.conf
>> I got advice from some nice folk on IRC and I tried to use syslog and
>> setup the log level to WARN but the INF logs are still pouring.
>> I confirmed settings are changed in the admin socked and got these:
>>
>>   "clog_to_syslog_level": "warn",
>>   "clog_to_syslog_facility": "daemon",
>>   "mon_cluster_log_to_syslog": "true",
>>   "mon_cluster_log_to_syslog_level": "warn",
>>   "mon_cluster_log_to_syslog_facility": "daemon",
>>   "mon_cluster_log_file": "\/var\/log\/ceph\/ceph.log",
>>
>> I guess I am missing some option flag or something. Any hint?
>
> Currently the ceph.log gets the log messages unconditionally.  You can set
> mon_cluster_log_file to /dev/null so that only warn and higher goes to
> syslog?
>
> And we should add a mon_cluster_log_level option :)
>
> sage
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] Xen blktap driver for Ceph RBD : Anybody wants to test ? :p

2013-08-09 Thread Sylvain Munaut
Hi,

> I've had a few occasions where tapdisk has segfaulted:
>
> tapdisk[9180]: segfault at 7f7e3a5c8c10 ip 7f7e387532d4 sp 
> 7f7e3a5c8c10 error 4 in libpthread-2.13.so[7f7e38748000+17000]
> tapdisk:9180 blocked for more than 120 seconds.
> tapdisk D 88043fc13540 0  9180  1 0x
>
> and then like:
>
> end_request: I/O error, dev tdc, sector 472008
>
> I can't be sure but I suspect that when this happened either one OSD was 
> offline, or the cluster lost quorum briefly.

Interesting. There might be an issue if a request ends in error, I'll
have to check that.
I'll have a look on monday.

Cheers,

Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cephfs set_layout - EINVAL - solved

2013-08-09 Thread Kasper Dieter
OK,
I found this nice page: http://ceph.com/docs/next/dev/file-striping/
which explains "--stripe_unit --stripe_count --object_size"

But still I'm not sure about
(1) what is the equivalent command on cephfs to 'rbd create --order 16' ?
(2) how to use those parameters to achieve different optimized layouts on 
CephFS directories
(e.g. for streaming, small sequential IOs, small random IOs)

-Dieter

On Fri, Aug 09, 2013 at 09:44:57AM +0200, Kasper Dieter wrote:
> Hi,
> 
> my goal is to set the 'object size' used in the distribution inside rados
> in an equal (or similar) way between RBD and CephFS.
> 
> To set obj_size=64k in RBD I use the command:
> rbd create --size 1024000 --pool SSD-r2 ssd2-1T-64k --order 16  
> 
> On cephfs set_layout '-s 65536' runs into EINVAL:
> cephfs /mnt/cephfs/fio-64k/ set_layout -p 3 -s   65536 -u 4194304 -c 1
> Error setting layout: Invalid argument
> 
> cephfs /mnt/cephfs/fio-64k/ set_layout -p 3 -s   65536 -u 65536 -c 1
> cephfs /mnt/cephfs/fio-64k/ show_layout
> layout.data_pool: 3
> layout.object_size:   65536
> layout.stripe_unit:   65536
> layout.stripe_count:  1
> 
> The man page of cephfs says
> ---snip---
>-u --stripe_unit
>   Set the size of each stripe
> 
>-c --stripe_count
>   Set the number of objects to stripe across
> 
>-s --object_size
>   Set the size of the objects to stripe across
> ---snip---
> 
> What is the equivalent command on cephfs to 'rbd create --order 16' ?
> Can you please give same explanation how "--stripe_unit --stripe_count 
> --object_size"
> should be used in combination to achieve different layouts on CephFS 
> directories
> (e.g. optimized for streaming, small sequential IOs, small random IOs)
> ?
> 
> Thanks,
> -Dieter
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


cephfs set_layout - EINVAL

2013-08-09 Thread Kasper Dieter
Hi,

my goal is to set the 'object size' used in the distribution inside rados
in an equal (or similar) way between RBD and CephFS.

To set obj_size=64k in RBD I use the command:
rbd create --size 1024000 --pool SSD-r2 ssd2-1T-64k --order 16  

On cephfs set_layout '-s 65536' runs into EINVAL:
cephfs /mnt/cephfs/fio-64k/ set_layout -p 3 -s   65536 -u 4194304 -c 1
Error setting layout: Invalid argument

cephfs /mnt/cephfs/fio-64k/ set_layout -p 3 -s   65536 -u 65536 -c 1
cephfs /mnt/cephfs/fio-64k/ show_layout
layout.data_pool: 3
layout.object_size:   65536
layout.stripe_unit:   65536
layout.stripe_count:  1

The man page of cephfs says
---snip---
   -u --stripe_unit
  Set the size of each stripe

   -c --stripe_count
  Set the number of objects to stripe across

   -s --object_size
  Set the size of the objects to stripe across
---snip---

What is the equivalent command on cephfs to 'rbd create --order 16' ?
Can you please give same explanation how "--stripe_unit --stripe_count 
--object_size"
should be used in combination to achieve different layouts on CephFS directories
(e.g. optimized for streaming, small sequential IOs, small random IOs)
?

Thanks,
-Dieter
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: bug in /etc/init.d/ceph debian

2013-08-09 Thread James Harper
> > I haven't tried your patch yet, but can it ever return 0? It seems to
> > set it to 3 initially, and then change it to 1 if it finds an error. I
> > can't see that it ever sets it to 0 indicating that daemons are running.
> > Easy enough to fix by setting the EXIT_STATUS=0 after the check of
> > daemon_is_running, I think, but it still doesn't allow for the case
> > where there are three OSD's, one is running, one is stopped, and one is
> > failed. The EXIT_STATUS in that case appears to be based on the last
> > daemon checked, eg basically random.
> 
> What should it return in that case?
> 

I've been thinking about this some more and I'm still not sure. I think my 
patch says:
if _any_ are in error then return 1
else if any are running return 0
else if all are stopped return 3

But I think this still won't have the desired outcome if you have 2 OSD's. The 
possible situations if the resource is supposed to be running are:
. Both running => all good, pacemaker will do nothing
. Both stopped => all good, pacemaker will start the services
. One stopped one running => not good, pacemaker won't make any effort to start 
services
. One in error, one running => not good. I'm not sure exactly what will happen 
but it won't be what you expect.

The only solution I can see is to manage the services individually, in which 
case the init.d script with your patch + setting to 0 if running does the right 
thing anyway.

James
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html