Re: wip-user status

2015-08-05 Thread Milan Broz
On 08/04/2015 10:53 PM, Sage Weil wrote:
 I rebased the wip-user patches from wip-selinux-policy onto 
 wip-selinux-policy-no-user + merge to master so that it sits on top of the 
 newly-merged systemd changes.

Great, so if it is build-ready state, I can try it with our virtual cluster 
install.

 Notes/issues:
 
  - ceph-osd-prestart.sh verifies that the osd_data dir is owned by either 
 'root' or 'ceph' or else it exits with an error.  (Presumably systemd will 
 fail to start the unit in this case.)  It prints a helpful message 
 pointing the user at 'ceph-disk chown ...'.
 
  - 'ceph-disk chown ...' is not implemented yet.  Should it take the base 
 device, like activate and prepare?  Or a mounted path?  Or either?

It should be easy to convert device/mountpoint by using findmnt so I would
prefer what is more consistent with the user interface...

IIRC, if the parameter is a base device, what should happen if device is not 
mounted?
If mount path - then what about other data/journal partitions?

It seems to me that parameter could be base OSD device and chown will simply
handle all its partitions. (So for encrypted OSD it needs to get key to unlock 
it etc...)

  - Currently ceph-osd@.service unconditionally passes --setuser ceph to 
 ceph-osd... even if the data directory is owned by root.  I don't think 
 systemd is smart enough to do this conditionally unless we make an ugly 
 wrapper script that starts ceph-osd.  Alternatively, we could make 
 ceph-osd conditionally do the setuid based on the ownership of the 
 directory, but... meh.  The idea was to do the setuid *very* early in the 
 startup process so that logging and so on are opened as the ceph user.  
 Ideas?

Well, systemd could do that if the service is generated (like e.g. cryptsetup
activation jobs are generated according to crypttab). But this adds complexity
that we do not need...

Maybe another option is to use environment variable (CEPH_USER or so), set it
in service Environment=/EnvironmentFile... and ceph-osd will use that...

But I think some systemd gurus will find something better here:)

Thanks,
Milan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Erasure Code Plugins : PLUGINS_V3 feature

2015-08-05 Thread Loic Dachary
Hi Sam,

How does this proposal sound ? It would be great if that was done before the 
feature freeze.

Cheers

On 29/07/2015 11:16, Loic Dachary wrote:
 Hi Sam,
 
 The SHEC plugin[0] has been running in the rados runs[1] in the past few 
 months. It also has a matching corpus verification which runs on every make 
 check[2] as well as its optimized variants. I believe the flag experimental 
 can now be removed. 
 
 In order to do so, we need to use a PLUGINS_V3 feature, in the same way we 
 did back in Giant when the ISA and LRC plugins were introduced[3]. This won't 
 be necessary in the future, when there is a generic plugin mechanism, but 
 right now that's what we need. It would be a commit very similar to the one 
 implementing PLUGINS_V2[4].
 
 Is this agreeable to you ? Or would you rather see another way to resolve 
 this ?
 
 Cheers
 
 [0] https://github.com/ceph/ceph/tree/master/src/erasure-code/shec
 [1] 
 https://github.com/ceph/ceph-qa-suite/tree/master/suites/rados/thrash-erasure-code-shec
 [2] 
 https://github.com/ceph/ceph-erasure-code-corpus/blob/master/v0.92-988/non-regression.sh#L52
 [3] http://tracker.ceph.com/issues/9343
 [4] 
 https://github.com/ceph/ceph/commit/9687150ceac9cc7e506bc227f430d4207a6d7489
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: Transitioning Ceph from Autotools to CMake

2015-08-05 Thread Orit Wasserman
On Tue, 2015-08-04 at 00:38 +0100, John Spray wrote:
 OK, here are vstart+ceph.in changes that work well enough in my out of
 tree build:
 https://github.com/ceph/ceph/pull/5457

Great!
 
 John
 
 On Mon, Aug 3, 2015 at 11:09 AM, John Spray jsp...@redhat.com wrote:
  On Sat, Aug 1, 2015 at 8:24 PM, Orit Wasserman owass...@redhat.com wrote:
 
 
  3. no vstart.sh , starting working on this too but have less progress
  here. At the moment in order to use vstart I copy the exe and libs to
  src dir.
 
  I just started playing with CMake on Friday, adding some missing cephfs
  bits.  I was going to fix (3) as well, but I don't want to duplicate work
  -- do you have an existing branch at all?  Presumably this will mostly be a
  case of adding appropriate prefixes to commands.
 
  Cheers,
  John


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Is it safe to increase pg number in a production environment

2015-08-05 Thread Jan Schermer
Hi,
comments inline.

 On 05 Aug 2015, at 05:45, Jevon Qiao qiaojianf...@unitedstack.com wrote:
 
 Hi Jan,
 
 Thank you for the detailed suggestion. Please see my reply in-line.
 On 5/8/15 01:23, Jan Schermer wrote:
 I think I wrote about my experience with this about 3 months ago, including 
 what techniques I used to minimize impact on production.
 
 Basicaly we had to
 1) increase pg_num in small increments only, bcreating the placement groups 
 themselves caused slowed requests on OSDs
 2) increse pgp_num in small increments and then go higher
 So you totally completed the step 1 before jumping into step 2. Have you ever 
 tried mixing them together? Increase pg_number, increase pgp_number, increase 
 pg_number…

Actually we first increased both to 8192 and then decided to go higher, but 
that doesn’t matter.
The only reason for this was that the first step took could run unattended at 
night without disturbing the workload.*
The second step had to be attended.

* in other words, we didn’t see “slow requests” because of our threshold 
settings, but while PGs were creating the cluster paused IO for non-trivial 
amounts of time. I suggest you do this in as small steps as possible, depending 
on your SLAs.

 We went from 4096 placement groups up to 16384
 
 pg_num (the number of on-disk created placement groups) was increased like 
 this:
 # for i in `seq 4096 64 16384` ; do ceph osd pool set $pool pg_num $i ; 
 sleep 60 ; done
 this ran overnight (and was upped to 128 step during the night)
 
 Increasing pgp_num was trickier in our case, first because it was heavy 
 production and we wanted to minimize the visible impact and second because 
 of wildly differing free space on the OSDs.
 We did it again in steps and waited for the cluster to settle before 
 continuing.
 Each step upped pgp_num by about 2% and as we got higher (8192) we 
 increased this to much more - the last step was 15360-16384 with the same 
 impact the initial 4096-4160 had.
 The strategy you adopted looks great. I'll do some experiments on a test 
 cluster to evaluate the real impact in each step
 The end result is much better but still nowhere near optimal - bigger impact 
 would be upgrading to a newer Ceph release and setting the new tunables 
 because we’re running Dumpling.
 
 Be aware that PGs cost some space (rough estimate is 5GB per OSD in our 
 case), and also quite a bit of memory - each OSD has 1.7-2.0GB RSS right now 
 while it only had about 1GB before. That’s a lot of memory and space with 
 higher OSD counts...
 This is a good point. So along with the increment of PGs, we also need to 
 take the current status of the cluster(the available disk space and memory 
 for each OSD) into account and evaluate whether it is needed to add more 
 resources.

Depends on how much free space you have. We had some OSDs at close to 85% 
capacity before we started (and other OSD’s at only 30%). When increasing the 
number of PGs the data shuffled greatly - but this depends on what CRUSH rules 
you have (and what version you are running). Newer versions with newer tunables 
will make this a lot easier I guess.

 And while I haven’t calculated the number of _objects_ per PG, but we have 
 differing numbers of _placement_groups_ per OSD (one OSD hosts 500, another 
 hosts 1300) and this seems to be the cause of poor data balancing.
 In our environment, we also encountered the imbalance mapping between PGs and 
 OSD. What kind of bucket algorithm was used in your environment? Any idea on 
 how to minimize it?

We are using straw because of dumpling. Straw2 should make everything better :-)

Jan

 
 Thanks,
 Jevon
 Jan
 
 
 On 04 Aug 2015, at 18:52, Marek Dohojda mdoho...@altitudedigital.com 
 wrote:
 
 I have done this not that long ago.  My original PG estimates were wrong 
 and I had to increase them.
 
 After increasing the PG numbers the Ceph rebalanced, and that took a while. 
  To be honest in my case the slowdown wasn’t really visible, but it took a 
 while.
 
 My strong suggestion to you would be to do it in a long IO time, and be 
 prepared that this willl take quite a long time to accomplish.  Do it 
 slowly  and do not increase multiple pools at once.
 
 It isn’t recommended practice but doable.
 
 
 
 On Aug 4, 2015, at 10:46 AM, Samuel Just sj...@redhat.com wrote:
 
 It will cause a large amount of data movement.  Each new pg after the
 split will relocate.  It might be ok if you do it slowly.  Experiment
 on a test cluster.
 -Sam
 
 On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 scaleq...@gmail.com wrote:
 Hi Cephers,
 
 This is a greeting from Jevon. Currently, I'm experiencing an issue which
 suffers me a lot, so I'm writing to ask for your 
 comments/help/suggestions.
 More details are provided bellow.
 
 Issue:
 I set up a cluster having 24 OSDs and created one pool with 1024 placement
 groups on it for a small startup company. The number 1024 was calculated 
 per
 the equation 'OSDs * 100'/pool size. The cluster have been running 

Re: More ondisk_finisher thread?

2015-08-05 Thread Sage Weil
On Wed, 5 Aug 2015, Ding Dinghua wrote:
 2015-08-05 0:13 GMT+08:00 Somnath Roy somnath@sandisk.com:
  Yes, it has to re-acquire pg_lock today..
  But, between journal write and initiating the ondisk ack, there is one 
  context switche in the code path. So, I guess the pg_lock is not the only 
  one that is causing this 1 ms delay...
  Not sure increasing the finisher threads will help in the pg_lock case as 
  it will be more or less serialized by this pg_lock..
 My concern is, if pg lock of pg A has been grabbed, not only ondisk
 callback of pg A is delayed, since ondisk_finisher has only one
 thread,  ondisk callback of other pgs will be delayed too.

I wonder if an optimistic approach might help here by making the 
completion synchronous and doing something like

   if (pg-lock.TryLock()) {
  pg-_finish_thing(completion-op);
  delete completion;
   } else {
  finisher.queue(completion);
   }

or whatever.  We'd need to ensure that we aren't holding any lock or 
throttle budget that the pg could deadlock against.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Transitioning Ceph from Autotools to CMake

2015-08-05 Thread Owen Synge
Dear Ali,

my point is no longer relevant, but your reassurances is still very
relevant.

Thanks

Owen

On 08/04/2015 08:26 PM, Ali Maredia wrote:
 Owen,
 
 I understand your concern, and don't think any transition will be made to
 CMake untill all the functionality is in it and until it has been thoroughly 
 vetted by the entire community to ensure a smooth transition.
 
 I pushed a branch earlier today called wip-cmake 
 (https://github.com/ceph/ceph/tree/wip-cmake) and plan to continue Orit's 
 make check work, and coordinate with John on the vstart work he's done
 already as my very first action items.
 
 -Ali
 
 - Original Message -
 From: Owen Synge osy...@suse.com
 To: Ali Maredia amare...@redhat.com, ceph-devel@vger.kernel.org
 Sent: Tuesday, August 4, 2015 6:42:31 AM
 Subject: Re: Transitioning Ceph from Autotools to CMake
 
 Dear Ali,
 
 I am glad you are making progress.
 
 Sadly I don't yet know cmake.
 
 Please consider the systemd wip branch. It might be wise to leave
 autotools around a little longer, until all functionality is in the cmake.
 
 Best regards
 
 Owen
 
 
 On 07/30/2015 09:01 PM, Ali Maredia wrote:
 After discussing with several other Ceph developers and Sage, I wanted
 to start a discussion about making CMake the primary build system for Ceph.

 CMake works just fine as it is (make -j4 on master with CMake builds
 350% faster than autotools!), but there's more work needed to make it 
 into a first-class build system.

 Short term (1-2 weeks):
  - Making sure CMake works on all supported platforms: Centos7, RHEL7,
Ubuntu 14.04  12.04, Fedora 22, Debian Jessie, Debian Wheezy are the
target platforms I have noted to test on already.
  - Adding a target similar to make check
  - Creating CMake targets that build packages (such as for rpm or debian)
  - Writing documentation for those who haven't used CMake before to smooth 
 the
transition over
  - Making sure no targets or dependencies are missing from the
current CMake build, and that CMake supports all current 
targets, configurations and options
  - Replacing the integration autotools has with any automated build/test
systems such as the gitbuilder

 Longer term (2-4 weeks):
  - Removing the current autotools files, to avoid doubling build system
workload
  - Adding more but shorter CMakeLists.txt files to a tree like structure
where a CMakeLists.txt would be in every subdirectory
  
 I'm already working on a target similar to the make check target, and plan
 on working on the other short term goals over the next weeks and beyond.

 I wanted to get feedback from the community any reasons why someone started 
 using 
 CMake but stopped (ex: lack of functionality), and more broadly, on what 
 other 
 obstacles there might be for the transition.

 -Ali
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 

-- 
SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB
21284 (AG
Nürnberg)

Maxfeldstraße 5

90409 Nürnberg

Germany
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Should pgls filters be part of object classes?

2015-08-05 Thread John Spray
So, I've got this very-cephfs-specific piece of pgls filtering code in
ReplicatedPG:
https://github.com/ceph/ceph/commit/907a3c5a2ba8e3edda18d7edf89ccae7b9d91dc5

I'm not sure I'm sufficiently motivated to create some whole new
plugin framework for these, but what about piggy-backing on object
classes?  I guess it would be an additional
cls_register_filter(myfilter, my_callback_constructor) fn.

tl;dr; How do people feel about extending object classes to include
providing PGLS filters as well?

John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Should pgls filters be part of object classes?

2015-08-05 Thread Sage Weil
On Wed, 5 Aug 2015, John Spray wrote:
 So, I've got this very-cephfs-specific piece of pgls filtering code in
 ReplicatedPG:
 https://github.com/ceph/ceph/commit/907a3c5a2ba8e3edda18d7edf89ccae7b9d91dc5
 
 I'm not sure I'm sufficiently motivated to create some whole new
 plugin framework for these, but what about piggy-backing on object
 classes?  I guess it would be an additional
 cls_register_filter(myfilter, my_callback_constructor) fn.
 
 tl;dr; How do people feel about extending object classes to include
 providing PGLS filters as well?

This seems like the way to do it...

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: radosgw - stuck ops

2015-08-05 Thread Yehuda Sadeh-Weinraub
On Tue, Aug 4, 2015 at 3:23 PM, GuangYang yguan...@outlook.com wrote:
 Thanks for Sage, Yehuda and Sam's quick reply.

 Given the discussion so far, could I summarize into the following bullet 
 points:

 1 The first step we would like to pursue is to implement the following 
 mechanism to avoid infinite waiting at radosgw side:
   1.1. radosgw - send OP with a *fast_fail* flag
   1.2. OSD - reply with -EAGAIN if the PG is *inactive* and the 
 *fast_fail* flag is set
   1.3. radosgw - upon receiving -EAGAIN, retry till a timeout interval is 
 reached (properly with some back-off?), and if it eventually fails, convert 
 -EAGAIN to some other error code and passes to upper layer.

I'm not crazy about the 'fast_fail' name, maybe we can come up with a
better describing term. Also, not 100% sure the EAGAIN is the error we
want to see. Maybe the flag on the request could specify what would be
the error code to return in this case?
I think it's a good plan to start with, we can adjust things later.


 2 In terms of management of radosgw's worker threads, I think we either 
 pursue Sage's proposal (which could linearly increase the time it takes to 
 stuck all worker threads depending how many threads we expand), or simply try 
 sharding work queue (which we already has some basic building block)?

The problem that I see with that proposal (missed it earlier, only
seeing it now), is that when the threads actually wake up the system
could become unusable. In any case, it's probably a lower priority at
this point, we could rethink this area again later.

Yehuda


 Can I start working on patch for 1 and then 2 as a lower priority?

 Thanks,
 Guang
 
 Date: Tue, 4 Aug 2015 10:14:06 -0700
 Subject: Re: radosgw - stuck ops
 From: ysade...@redhat.com
 To: sw...@redhat.com
 CC: yguan...@outlook.com; sj...@redhat.com; yeh...@redhat.com; 
 ceph-devel@vger.kernel.org

 On Tue, Aug 4, 2015 at 10:03 AM, Sage Weil sw...@redhat.com wrote:
 On Tue, 4 Aug 2015, Yehuda Sadeh-Weinraub wrote:
 On Tue, Aug 4, 2015 at 9:55 AM, Sage Weil sw...@redhat.com wrote:
 One solution that I can think of is to determine before the read/write
 whether the pg we're about to access is healthy (or has been unhealthy 
 for a
 short period of time), and if not to cancel the request before sending 
 the
 operation. This could mitigate the problem you're seeing at the expense 
 of
 availability in some cases. We'd need to have a way to query pg health
 through librados which we don't have right now afaik.
 Sage / Sam, does that make sense, and/or possible?

 This seems mostly impossible because we don't know ahead of time which
 PG(s) a request is going to touch (it'll generally be a lot of them)?


 Barring pgls() and such, each rados request that radosgw produces will
 only touch a single pg, right?

 Oh, yeah. I thought you meant before each RGW request. If it's at the
 rados level then yeah, you could avoid stuck pgs, although I think a
 better approach would be to make the OSD reply with -EAGAIN in that case
 so that you know the op didn't happen. There would still be cases (though
 more rare) where you weren't sure if the op happened or not (e.g., when
 you send to osd A, it goes down, you resend to osd B, and then you get
 EAGAIN/timeout).

 If done on the client side then we should only make it apply to the
 first request sent. Is it actually a problem if the osd triggered the
 error?


 What would you do when you get that failure/timeout, though? Is it
 practical to abort the rgw request handling completely?


 It should be like any error that happens through the transaction
 (e.g., client disconnection).

 Yehuda
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


08/05/2015 Weekly Ceph Performance Meeting IS ON!

2015-08-05 Thread Mark Nelson
8AM PST as usual (that's in 13 minutes folks!) No specific topics for 
this week, please feel free to add your own!


Here's the links:

Etherpad URL:
http://pad.ceph.com/p/performance_weekly

To join the Meeting:
https://bluejeans.com/268261044

To join via Browser:
https://bluejeans.com/268261044/browser

To join with Lync:
https://bluejeans.com/268261044/lync


To join via Room System:
Video Conferencing System: bjn.vc -or- 199.48.152.152
Meeting ID: 268261044

To join via Phone:
1) Dial:
  +1 408 740 7256
  +1 888 240 2560(US Toll Free)
  +1 408 317 9253(Alternate Number)
  (see all numbers - http://bluejeans.com/numbers)
2) Enter Conference ID: 268261044

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


civetweb health check

2015-08-05 Thread Srikanth Madugundi
Hi,

We are planning to move our radosgw setup from apache to civetweb. We
were successfully able to setup and run civetweb on a test cluster.

The radosgw instances are fronted by a VIP with currently checks the
health by getting /status.html file, after moving to civetweb the vip
is unable to get the health of radosgw server using /status.html
endpoint and assumes the server is down.

I looked at ceph radosgw documentation and did not find any
configuration to rewrite urls. What is the best approach for VIP to
get the health of radosgw?

Thanks
Srikanth
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Erasure Code Plugins : PLUGINS_V3 feature

2015-08-05 Thread Sage Weil
On Wed, 5 Aug 2015, Loic Dachary wrote:
 Hi Sam,
 
 How does this proposal sound ? It would be great if that was done before 
 the feature freeze.

I think it's a good time.

Takeshi, note that what this really means is that the on-disk encoding 
needs to remain fixed.  If we decide to change it down the line, we'll 
have to make a 'shec2' or similar so that the old format is still 
decodable (or ensure that existing data can still be read in some other 
way).

Sound good?

sage


 
 Cheers
 
 On 29/07/2015 11:16, Loic Dachary wrote:
  Hi Sam,
  
  The SHEC plugin[0] has been running in the rados runs[1] in the past few 
  months. It also has a matching corpus verification which runs on every make 
  check[2] as well as its optimized variants. I believe the flag 
  experimental can now be removed. 
  
  In order to do so, we need to use a PLUGINS_V3 feature, in the same way we 
  did back in Giant when the ISA and LRC plugins were introduced[3]. This 
  won't be necessary in the future, when there is a generic plugin mechanism, 
  but right now that's what we need. It would be a commit very similar to the 
  one implementing PLUGINS_V2[4].
  
  Is this agreeable to you ? Or would you rather see another way to resolve 
  this ?
  
  Cheers
  
  [0] https://github.com/ceph/ceph/tree/master/src/erasure-code/shec
  [1] 
  https://github.com/ceph/ceph-qa-suite/tree/master/suites/rados/thrash-erasure-code-shec
  [2] 
  https://github.com/ceph/ceph-erasure-code-corpus/blob/master/v0.92-988/non-regression.sh#L52
  [3] http://tracker.ceph.com/issues/9343
  [4] 
  https://github.com/ceph/ceph/commit/9687150ceac9cc7e506bc227f430d4207a6d7489
  
 
 -- 
 Loïc Dachary, Artisan Logiciel Libre
 
 

Re: rgw and the next hammer release v0.94.3

2015-08-05 Thread Yehuda Sadeh-Weinraub
On Tue, Aug 4, 2015 at 3:41 AM, Loic Dachary l...@dachary.org wrote:
 Hi Yehuda,

 The next hammer release as found at https://github.com/ceph/ceph/tree/hammer 
 passed the rgw suite (http://tracker.ceph.com/issues/11990#rgw and 
 http://tracker.ceph.com/issues/12502#note-6).

 Do you think the hammer branch is ready for QE to start their own round of 
 testing ?

Looks fine to me.

Yehuda

 Cheers

 P.S. http://tracker.ceph.com/issues/11990#Release-information has direct 
 links to the pull requests merged into hammer since v0.94.2 in case you need 
 more context about one of them.

 --
 Loïc Dachary, Artisan Logiciel Libre















--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: test/run-rbd-unit-tests.sh : pure virtual method called

2015-08-05 Thread Loic Dachary
Hi,

Here is another make check fail. They don't seem to be related. To the best of 
my knowledge these are the only two rbd related failures in make check during 
the past week.

http://jenkins.ceph.dachary.org/job/ceph/LABELS=ubuntu-14.04x86_64/6884/console

[ RUN  ] TestLibRBD.ObjectMapConsistentSnap
using new format!
test/librbd/test_librbd.cc:2790: Failure
Value of: passed
  Actual: false
Expected: true
[  FAILED  ] TestLibRBD.ObjectMapConsistentSnap (396 ms)

[--] Global test environment tear-down
[==] 98 tests from 6 test cases ran. (10554 ms total)
[  PASSED  ] 97 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TestLibRBD.ObjectMapConsistentSnap

On 03/08/2015 18:01, Loic Dachary wrote:
 Hi,
 
 test/run-rbd-unit-tests.sh failed today on master on Ubuntu 14.04, when run 
 by the make check bot on an unrelated pull request (modifying do_autogen 
 which is not used by the make check bot).
 
 http://jenkins.ceph.dachary.org/job/ceph/LABELS=ubuntu-14.04x86_64/6834/console
 
 [ RUN  ] TestInternal.MultipleResize
 pure virtual method called
 terminate called without an active exception
 
 Cheers
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: 08/05/2015 Weekly Ceph Performance Meeting IS ON!

2015-08-05 Thread kernel neophyte
Hi Mark,

I Missed todays call :-(
Could you please point me to the recording ? Etherpad link only shows
recording until 07/08.


__Neo



On Wed, Aug 5, 2015 at 7:47 AM, Mark Nelson mnel...@redhat.com wrote:
 8AM PST as usual (that's in 13 minutes folks!) No specific topics for this
 week, please feel free to add your own!

 Here's the links:

 Etherpad URL:
 http://pad.ceph.com/p/performance_weekly

 To join the Meeting:
 https://bluejeans.com/268261044

 To join via Browser:
 https://bluejeans.com/268261044/browser

 To join with Lync:
 https://bluejeans.com/268261044/lync


 To join via Room System:
 Video Conferencing System: bjn.vc -or- 199.48.152.152
 Meeting ID: 268261044

 To join via Phone:
 1) Dial:
   +1 408 740 7256
   +1 888 240 2560(US Toll Free)
   +1 408 317 9253(Alternate Number)
   (see all numbers - http://bluejeans.com/numbers)
 2) Enter Conference ID: 268261044

 Mark
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


FileStore should not use syncfs(2)

2015-08-05 Thread Sage Weil
Today I learned that syncfs(2) does an O(n) search of the superblock's 
inode list searching for dirty items.  I've always assumed that it was 
only traversing dirty inodes (e.g., a list of dirty inodes), but that 
appears not to be the case, even on the latest kernels.

That means that the more RAM in the box, the larger (generally) the inode 
cache, the longer syncfs(2) will take, and the more CPU you'll waste doing 
it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 
servicing a very light workload, and each syncfs(2) call was taking ~7 
seconds (usually to write out a single inode).

A possible workaround for such boxes is to turn 
/proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching 
pages instead of inodes/dentries)...

I think the take-away though is that we do need to bite the bullet and 
make FileStore f[data]sync all the right things so that the syncfs call 
can be avoided.  This is the path you were originally headed down, 
Somnath, and I think it's the right one.

The main thing to watch out for is that according to POSIX you really need 
to fsync directories.  With XFS that isn't the case since all metadata 
operations are going into the journal and that's fully ordered, but we 
don't want to allow data loss on e.g. ext4 (we need to check what the 
metadata ordering behavior is there) or other file systems.

:(

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: FileStore should not use syncfs(2)

2015-08-05 Thread Somnath Roy
Thanks Sage for digging down..I was suspecting something similar.. As I 
mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 64 
GB of RAM in the system.
The workaround I was talking about today  is working pretty good so far. In 
this implementation, I am not giving much work to syncfs as each worker thread 
is writing with o_dsync mode. I am issuing syncfs before trimming the journal 
and most of the time I saw it is taking  100 ms.
I have to wake up the sync_thread now after each worker thread finished 
writing. I will benchmark both the approaches. As we discussed earlier, in case 
of only fsync approach, we still need to do a db sync to make sure the leveldb 
stuff persisted, right ?

Thanks  Regards
Somnath

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Wednesday, August 05, 2015 2:27 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org; sj...@redhat.com
Subject: FileStore should not use syncfs(2)

Today I learned that syncfs(2) does an O(n) search of the superblock's inode 
list searching for dirty items.  I've always assumed that it was only 
traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to 
be the case, even on the latest kernels.

That means that the more RAM in the box, the larger (generally) the inode 
cache, the longer syncfs(2) will take, and the more CPU you'll waste doing it.  
The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 servicing 
a very light workload, and each syncfs(2) call was taking ~7 seconds (usually 
to write out a single inode).

A possible workaround for such boxes is to turn /proc/sys/vm/vfs_cache_pressure 
way up (so that the kernel favors caching pages instead of inodes/dentries)...

I think the take-away though is that we do need to bite the bullet and make 
FileStore f[data]sync all the right things so that the syncfs call can be 
avoided.  This is the path you were originally headed down, Somnath, and I 
think it's the right one.

The main thing to watch out for is that according to POSIX you really need to 
fsync directories.  With XFS that isn't the case since all metadata operations 
are going into the journal and that's fully ordered, but we don't want to allow 
data loss on e.g. ext4 (we need to check what the metadata ordering behavior is 
there) or other file systems.

:(

sage



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-05 Thread Mark Nelson



On 08/05/2015 04:26 PM, Sage Weil wrote:

Today I learned that syncfs(2) does an O(n) search of the superblock's
inode list searching for dirty items.  I've always assumed that it was
only traversing dirty inodes (e.g., a list of dirty inodes), but that
appears not to be the case, even on the latest kernels.

That means that the more RAM in the box, the larger (generally) the inode
cache, the longer syncfs(2) will take, and the more CPU you'll waste doing
it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40
servicing a very light workload, and each syncfs(2) call was taking ~7
seconds (usually to write out a single inode).

A possible workaround for such boxes is to turn
/proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching
pages instead of inodes/dentries)...


FWIW, I often see performance increase when favoring inode/dentry cache, 
but probably with far fewer inodes that the setup you just saw.  It 
sounds like there needs to be some maximum limit on the inode/dentry 
cache to prevent this kind of behavior but still favor it up until that 
point.  Having said that, maybe avoiding syncfs is best as you say below.




I think the take-away though is that we do need to bite the bullet and
make FileStore f[data]sync all the right things so that the syncfs call
can be avoided.  This is the path you were originally headed down,
Somnath, and I think it's the right one.

The main thing to watch out for is that according to POSIX you really need
to fsync directories.  With XFS that isn't the case since all metadata
operations are going into the journal and that's fully ordered, but we
don't want to allow data loss on e.g. ext4 (we need to check what the
metadata ordering behavior is there) or other file systems.

:(

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Consult some problems of Ceph when reading source code

2015-08-05 Thread 蔡毅

Dear developers,

My name is Cai Yi, and I am a graduate student majored in CS of Xi’an Jiaotong 
University in China. From Ceph’s homepage, I know Sage is the author of Ceph 
and I get the email address from your GitHub and Ceph’s official website. 
Because Ceph is an excellent distributed file system, so recently, I am reading 
the source code of the Ceph (the edition is Hammer) to understand the IO good 
path and the performance of Ceph. However, I face some problems which I could 
not find the solution from Internet or solve by myself and my partners. So I 
was wondering if you could help us solve some problems. The problems are as 
follows:

1)   In the Ceph, there is a concept that is the transaction. When the OSD 
receives a write request, and then it is encapsulated by a transaction. But 
When the OSD receive many requests, is there a transaction queue to receive the 
messages? If there is a queue, is it a process of serial or parallel to submit 
these transaction to do next operation? If it is serial, could the transaction 
operations influence the performance?

2)   From some documents about Ceph, if the OSD receives a read request, 
the OSD can only read data from primary and then back to client. Is the 
description right? Is there any way to read the data from replicated OSD? Do we 
have to require the data from the primary OSD when deal with the read request? 
If not and we can read from replicated OSD, could we promise the consistency?

3)   When the OSD receives the message, the message’s attribute may be the 
normal dispatch or the fast dispatch. What is the difference between the normal 
dispatch and the fast dispatch? If the attribute is the normal dispatch, it 
enters the dispatch queue. Is there a single dispatch queue or multi dispatch 
queue to deal with all the messages?

These are the problem I am facing. Thank you for your patience and cooperation, 
and I look forward to hearing from you.

Yours sincerely

Cai

Re: [ANN] ceph-deploy 1.5.27 released

2015-08-05 Thread Travis Rhoden
Hi Nigel,

On Wed, Aug 5, 2015 at 9:00 PM, Nigel Williams
nigel.willi...@utas.edu.au wrote:
 On 6/08/2015 9:45 AM, Travis Rhoden wrote:

 A new version of ceph-deploy has been released. Version 1.5.27
 includes the following:


 Has the syntax for use of --zap-disk changed? I moved it around but it is no
 longer recognised; worked around by doing a ceph-disk zap before running
 ceph-deploy.

A few things in this area changed with 1.5.26. ceph-deploys options
are much more strictly attached only to the commands where they make
sense.


 This worked previously:

 ceph-deploy --overwrite-conf osd --zap-disk prepare ceph05:/dev/sdb:/dev/sdd

--zap-disk is an option to 'prepare', not to 'osd'.  ceph-deploy osd
--zap-disk list doesn't make any sense, for example.  The help menus
should make this clear:

# ceph-deploy osd --help
usage: ceph-deploy osd [-h] {list,create,prepare,activate} ...

# ceph-deploy osd prepare --help
usage: ceph-deploy osd prepare [-h] [--zap-disk] [--fs-type FS_TYPE]
   [--dmcrypt] [--dmcrypt-key-dir KEYDIR]



 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-05 Thread Haomai Wang
Agree

On Thu, Aug 6, 2015 at 5:38 AM, Somnath Roy somnath@sandisk.com wrote:
 Thanks Sage for digging down..I was suspecting something similar.. As I 
 mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 
 64 GB of RAM in the system.
 The workaround I was talking about today  is working pretty good so far. In 
 this implementation, I am not giving much work to syncfs as each worker 
 thread is writing with o_dsync mode. I am issuing syncfs before trimming the 
 journal and most of the time I saw it is taking  100 ms.

Actually I prefer we don't use syncfs anymore. I more like to use
aio+dio+Filestore custom cache to deal with all syncfs+pagecache
things. So we even can make cache more smart to aware of upper levels
instead of fadvise* calls. Second we can use checkpoint method like
mysql innodb, we can know the bw of frontend(filejournal) and decide
how much and how often we want to flush(using aio+dio).

Anyway, because it's a big project, we may prefer to work at newstore
instead of filestore.

 I have to wake up the sync_thread now after each worker thread finished 
 writing. I will benchmark both the approaches. As we discussed earlier, in 
 case of only fsync approach, we still need to do a db sync to make sure the 
 leveldb stuff persisted, right ?

 Thanks  Regards
 Somnath

 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Wednesday, August 05, 2015 2:27 PM
 To: Somnath Roy
 Cc: ceph-devel@vger.kernel.org; sj...@redhat.com
 Subject: FileStore should not use syncfs(2)

 Today I learned that syncfs(2) does an O(n) search of the superblock's inode 
 list searching for dirty items.  I've always assumed that it was only 
 traversing dirty inodes (e.g., a list of dirty inodes), but that appears not 
 to be the case, even on the latest kernels.

 That means that the more RAM in the box, the larger (generally) the inode 
 cache, the longer syncfs(2) will take, and the more CPU you'll waste doing 
 it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 
 servicing a very light workload, and each syncfs(2) call was taking ~7 
 seconds (usually to write out a single inode).

 A possible workaround for such boxes is to turn 
 /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching 
 pages instead of inodes/dentries)...

 I think the take-away though is that we do need to bite the bullet and make 
 FileStore f[data]sync all the right things so that the syncfs call can be 
 avoided.  This is the path you were originally headed down, Somnath, and I 
 think it's the right one.

 The main thing to watch out for is that according to POSIX you really need to 
 fsync directories.  With XFS that isn't the case since all metadata 
 operations are going into the journal and that's fully ordered, but we don't 
 want to allow data loss on e.g. ext4 (we need to check what the metadata 
 ordering behavior is there) or other file systems.

I guess there only a little directory modify operations, is it true?
Maybe we only need to do syncfs when modifying directories?


 :(

 sage

 

 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANN] ceph-deploy 1.5.27 released

2015-08-05 Thread Nigel Williams

On 6/08/2015 2:22 PM, Travis Rhoden wrote:

A few things in this area changed with 1.5.26. ceph-deploys options
are much more strictly attached only to the commands where they make
sense.


Oh much better, thanks. I did wonder about that, but as it worked I didn't 
revisit.



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANN] ceph-deploy 1.5.27 released

2015-08-05 Thread Nigel Williams

On 6/08/2015 9:45 AM, Travis Rhoden wrote:

A new version of ceph-deploy has been released. Version 1.5.27
includes the following:


Has the syntax for use of --zap-disk changed? I moved it around but it is no longer 
recognised; worked around by doing a ceph-disk zap before running ceph-deploy.


This worked previously:

ceph-deploy --overwrite-conf osd --zap-disk prepare ceph05:/dev/sdb:/dev/sdd



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hammer branch for v0.94.3 ready for QE

2015-08-05 Thread Abhishek L
Hi Yuri

The hammer branch for v0.94.3, as found at
https://github.com/ceph/ceph/commits/hammer,is approved by the leads
(Sam,Greg,Josh  Yehuda) and is ready for QE. The
backport itself is tracked at http://tracker.ceph.com/issues/11990,
which has a record of all the test runs so far 
the tip of the branch is at 88e7ee716fdd7bcf81845087021a677de5a50da8.

Regards
Abhishek


signature.asc
Description: PGP signature


[ANN] ceph-deploy 1.5.27 released

2015-08-05 Thread Travis Rhoden
Hi everyone,

A new version of ceph-deploy has been released. Version 1.5.27
includes the following:

 - a new ceph-deploy repo command that allows for adding and
removing custom repo definitions
 - Makes commands like ceph-deploy install --rgw only install the
RGW component of Ceph.

This works for daemons/components such as --rgw, --mds, and --cli,
depending on how packages are split on your distro.  For example,
Debian packages the Ceph MDS into a separate 'ceph-mon' package, and
therefore if you use install --mds only the ceph-mds package will be
installed.  RPM packages do not do this, so it has to install ceph,
which includes MDS, MON, and OSD daemons.  Further package splits are
coming, but right now we do what we can.

 - Some fixes around using DNF (Fedora = 22)
 - Early support for systemd (Fedora 22 and development Ceph builds only)
 - Loads of internal changes.

Full changelog is at [1].

Updated packages have been uploaded to
{rpm,debian}-{firefly,hammer,testing} repos on ceph.com, and to PyPI.

Cheers,

 - Travis

[1] http://ceph.com/ceph-deploy/docs/changelog.html#id2
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


radosgw + civetweb latency issue on Hammer

2015-08-05 Thread Srikanth Madugundi
Hi,

After upgrading to Hammer and moving from apache to civetweb. We
started seeing high PUT latency in the order of 2 sec for every PUT
request. The GET request lo

Attaching the radosgw logs for a single request. The ceph.conf has the
following configuration for civetweb.

[client.radosgw.gateway]
rgw frontends = civetweb port=5632


Further investion reveled the call to get_data() at
https://github.com/ceph/ceph/blob/hammer/src/rgw/rgw_op.cc#L1786 is
taking 2 sec to respond. The cluster is running Hammer 94.2 release

Did any one face this issue before? Is there some configuration I am missing?

Regards
Srikanth


radosgw.log
Description: Binary data


RE: Erasure Code Plugins : PLUGINS_V3 feature

2015-08-05 Thread Miyamae, Takeshi
Dear Sage,

 note that what this really means is that the on-disk encoding needs to remain 
 fixed.

Thank you for letting us know the important notice.
We have no plan to change shec's format at this moment, but we will remember the
comment for any future events.

Best Regards,
Takeshi Miyamae

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Thursday, August 6, 2015 3:45 AM
To: Loic Dachary; Miyamae, Takeshi/宮前 剛
Cc: Samuel Just; Ceph Development
Subject: Re: Erasure Code Plugins : PLUGINS_V3 feature

On Wed, 5 Aug 2015, Loic Dachary wrote:
 Hi Sam,
 
 How does this proposal sound ? It would be great if that was done 
 before the feature freeze.

I think it's a good time.

Takeshi, note that what this really means is that the on-disk encoding needs to 
remain fixed.  If we decide to change it down the line, we'll have to make a 
'shec2' or similar so that the old format is still decodable (or ensure that 
existing data can still be read in some other way).

Sound good?

sage


 
 Cheers
 
 On 29/07/2015 11:16, Loic Dachary wrote:
  Hi Sam,
  
  The SHEC plugin[0] has been running in the rados runs[1] in the past few 
  months. It also has a matching corpus verification which runs on every make 
  check[2] as well as its optimized variants. I believe the flag 
  experimental can now be removed. 
  
  In order to do so, we need to use a PLUGINS_V3 feature, in the same way we 
  did back in Giant when the ISA and LRC plugins were introduced[3]. This 
  won't be necessary in the future, when there is a generic plugin mechanism, 
  but right now that's what we need. It would be a commit very similar to the 
  one implementing PLUGINS_V2[4].
  
  Is this agreeable to you ? Or would you rather see another way to resolve 
  this ?
  
  Cheers
  
  [0] https://github.com/ceph/ceph/tree/master/src/erasure-code/shec
  [1] 
  https://github.com/ceph/ceph-qa-suite/tree/master/suites/rados/thras
  h-erasure-code-shec [2] 
  https://github.com/ceph/ceph-erasure-code-corpus/blob/master/v0.92-9
  88/non-regression.sh#L52 [3] http://tracker.ceph.com/issues/9343
  [4] 
  https://github.com/ceph/ceph/commit/9687150ceac9cc7e506bc227f430d420
  7a6d7489
  
 
 --
 Loïc Dachary, Artisan Logiciel Libre