Re: packaging init systems in a more autoools style way.

2015-06-03 Thread Owen Synge
On 06/03/2015 06:26 PM, Sage Weil wrote:
 On Wed, 3 Jun 2015, Owen Synge wrote:
 Dear ceph-devel,

 Linux has more than one init systems.

 We in SUSE are in the process of up streaming our spec files, and all
 our releases are systemd based.

 Ceph seems more tested with sysVinit upstream.

 We have 3 basic options for doing this in a packaged upstream system.

 1) We dont install init scripts/config as part of make install and
 install all the init components via conditionals in the spec file.

 2) We install all init scripts/config for all flavours of init using
 make install and delete unwanted init systems via conditionals in the
 spec file.

 3) We add autotools an conditional for each init system, and only
 install with make install enabled init systems scripts/config.

snip/


 Their are many ways to follow policy 3 so I would propose that when no
 init system is followed, policy (1) and policy (3) should appear identical.

 -
 
 Let's do it!

Great :)

snip/


 I'm hoping that phase 3 can be avoided entirely.  The upgrade/conversion 
 path (at least for upstream packages) will be firefly - infernalis; I'm 
 don't think it will be that useful to build infernalis packages that do 
 sysvinit for systemd distros.  (Maybe this situation gets more 
 complicated if we backport this transition to hammer or downstream does 
 the same, but even then the transition will be an upgrade one.)

Agreed,

snip/

 Also, I think we should do 1 and 2 basically at the same time.  I don't 
 think it's worth spending any effort trying to make things behave with 
 just 1 (and not 2).
 
 Am I talking sense?  I can never tell with this stuff.  :)
 
 sage

I think you speak sense,

If I underwstand right you favor the user interface as:

--with-init=systemd
--with-init=sysv
--with-init=upstart
--with-init=bsd

This is wiser when you start adding up all the possible init systems
that can exist.

Owen
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-03 Thread Srikanth Madugundi
Hi Sage,

I saw the crash again here is the output after adding the debug
message from wip-newstore-debuglist


   -31 2015-06-03 20:28:18.864496 7fd95976b700 -1
newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is
--.7fff..!!!.


Here is the id of the file I posted.

ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804

Let me know if you need anything else.

Regards
Srikanth


On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi
srikanth.madugu...@gmail.com wrote:
 Hi Sage,

 Unfortunately I purged the cluster yesterday and restarted the
 backfill tool. I did not see the osd crash yet on the cluster. I am
 monitoring the OSDs and will update you once I see the crash.

 With the new backfill run I have reduced the rps by half, not sure if
 this is the reason for not seeing the crash yet.

 Regards
 Srikanth


 On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote:
 I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
 with that branch with 'debug newstore = 20' and send us the log?
 (You can just do 'ceph-post-file filename'.)

 Thanks!
 sage

 On Mon, 1 Jun 2015, Srikanth Madugundi wrote:

 Hi Sage,

 The assertion failed at line 1639, here is the log message


 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
 function 'virtual int NewStore::collection_list_partial(coll_t,
 ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
 ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174

 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  end_key)


 Just before the crash the here are the debug statements printed by the
 method (collection_list_partial)

 2015-05-30 22:49:23.607232 7f1681934700 15
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
 start -1/0//0/0 min/max 1024/1024 snap head
 2015-05-30 22:49:23.607251 7f1681934700 20
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
 --.7fb4.. to --.7fb4.0800. and
 --.804b.. to --.804b.0800. start
 -1/0//0/0


 Regards
 Srikanth

 On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
  On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
  Hi Sage and all,
 
  I build ceph code from wip-newstore on RHEL7 and running performance
  tests to compare with filestore. After few hours of running the tests
  the osd daemons started to crash. Here is the stack trace, the osd
  crashes immediately after the restart. So I could not get the osd up
  and running.
 
  ceph version b8e22893f44979613738dfcdd40dada2b513118
  (eb8e22893f44979613738dfcdd40dada2b513118)
  1: /usr/bin/ceph-osd() [0xb84652]
  2: (()+0xf130) [0x7f915f84f130]
  3: (gsignal()+0x39) [0x7f915e2695c9]
  4: (abort()+0x148) [0x7f915e26acd8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
  6: (()+0x5e946) [0x7f915eb6b946]
  7: (()+0x5e973) [0x7f915eb6b973]
  8: (()+0x5eb9f) [0x7f915eb6bb9f]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
  const*)+0x27a) [0xc84c5a]
  10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
  snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
  ghobject_t*)+0x13c9) [0xa08639]
  11: (PGBackend::objects_list_partial(hobject_t const, int, int,
  snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
  hobject_t*)+0x352) [0x918a02]
  12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
  [0x8aa906]
  13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) 
  [0x8cd06b]
  14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
  ThreadPool::TPHandle)+0x68a) [0x85dbea]
  15: (OSD::dequeue_op(boost::intrusive_ptrPG,
  std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
  [0x6c3f5d]
  16: (OSD::ShardedOpWQ::_process(unsigned int,
  ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
  [0xc746bf]
  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
  19: (()+0x7df3) [0x7f915f847df3]
  20: (clone()+0x6d) [0x7f915e32a01d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
  needed to interpret this.
 
  Please let me know the cause of this crash, when this crash happens I
  noticed that two osds on separate machines are down. I can bring one
  osd up but restarting the other osd causes both OSDs to crash. My
  understanding is the crash seems to happen when two OSDs try to
  communicate and replicate a particular PG.
 
  Can you include the log lines that preceed the dump above?  In particular,
  there should be a line that tells you what assertion failed in what
  function and at what line number.  I haven't seen this crash so I'm not
  sure offhand what it is.
 
  Thanks!
  sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  

Re: packaging init systems in a more autoools style way.

2015-06-03 Thread Ken Dreyer
On 06/03/2015 03:38 PM, Sage Weil wrote:
 On Wed, 3 Jun 2015, Ken Dreyer wrote:
 On 06/03/2015 02:45 PM, Sage Weil wrote:
 Sounds good to me.  It could (should?) even error out if no init
 system is
 specified?  Otherwise someone will likely be in for a surprise.

 I was picturing that we'd just autodetect based on OS version (eg Ubuntu
 15.04 should default to --with-init=systemd). It's one less thing to get
 wrong during the build process.

 What do you think?
 
 ./configure ... --with-init=`src/ceph-detect-init` ?
 
 sage
 

I should have been clearer. I was thinking that we'd call that
detect-init script inside ./configure itself , unless the user specifies
--with-init=foo .

- Ken
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: packaging init systems in a more autoools style way.

2015-06-03 Thread Ken Dreyer
On 06/03/2015 03:38 PM, Gregory Farnum wrote:
 We could maybe autodetect if they don't specify one?

Sorry, yes, that's what I meant; my last email was unclear.

- Ken
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Discuss: New default recovery config settings

2015-06-03 Thread Sage Weil
On Mon, 1 Jun 2015, Gregory Farnum wrote:
 On Mon, Jun 1, 2015 at 6:39 PM, Paul Von-Stamwitz
 pvonstamw...@us.fujitsu.com wrote:
  On Fri, May 29, 2015 at 4:18 PM, Gregory Farnum g...@gregs42.com wrote:
  On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote:
   Many people have reported that they need to lower the osd recovery 
   config options to minimize the impact of recovery on client io.  We are 
   talking about changing the defaults as follows:
  
   osd_max_backfills to 1 (from 10)
   osd_recovery_max_active to 3 (from 15)
   osd_recovery_op_priority to 1 (from 10)
   osd_recovery_max_single_start to 1 (from 5)
 
  I'm under the (possibly erroneous) impression that reducing the number of 
  max backfills doesn't actually reduce recovery speed much (but will reduce 
  memory use), but that dropping the op priority can. I'd rather we make 
  users manually adjust values which can have a material impact on their 
  data safety, even if most of them choose to do so.
 
  After all, even under our worst behavior we're still doing a lot better 
  than a resilvering RAID array. ;) -Greg
  --
 
 
  Greg,
  When we set...
 
  osd recovery max active = 1
  osd max backfills = 1
 
  We see rebalance times go down by more than half and client write 
  performance increase significantly while rebalancing. We initially played 
  with these settings to improve client IO expecting recovery time to get 
  worse, but we got a 2-for-1.
  This was with firefly using replication, downing an entire node with lots 
  of SAS drives. We left osd_recovery_threads, osd_recovery_op_priority, and 
  osd_recovery_max_single_start default.
 
  We dropped osd_recovery_max_active and osd_max_backfills together. If 
  you're right, do you think osd_recovery_max_active=1 is primary reason for 
  the improvement? (higher osd_max_backfills helps recovery time with erasure 
  coding.)
 
 Well, recovery max active and max backfills are similar in many ways.
 Both are about moving data into a new or outdated copy of the PG ? the
 difference is that recovery refers to our log-based recovery (where we
 compare the PG logs and move over the objects which have changed)
 whereas backfill requires us to incrementally move through the entire
 PG's hash space and compare.
 I suspect dropping down max backfills is more important than reducing
 max recovery (gathering recovery metadata happens largely in memory)
 but I don't really know either way.
 
 My comment was meant to convey that I'd prefer we not reduce the
 recovery op priority levels. :)

We could make a less extreme move than to 1, but IMO we have to reduce it 
one way or another.  Every major operator I've talked to does this, our PS 
folks have been recommending it for years, and I've yet to see a single 
complaint about recovery times... meanwhile we're drowning in a sea of 
complaints about the impact on clients.

How about

 osd_max_backfills to 1 (from 10)
 osd_recovery_max_active to 3 (from 15)
 osd_recovery_op_priority to 3 (from 10)
 osd_recovery_max_single_start to 1 (from 5)

(same as above, but 1/3rd the recovery op prio instead of 1/10th)
?

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: packaging init systems in a more autoools style way.

2015-06-03 Thread Sage Weil
On Wed, 3 Jun 2015, Ken Dreyer wrote:
 On 06/03/2015 02:45 PM, Sage Weil wrote:
  Sounds good to me.  It could (should?) even error out if no init
 system is
  specified?  Otherwise someone will likely be in for a surprise.
 
 I was picturing that we'd just autodetect based on OS version (eg Ubuntu
 15.04 should default to --with-init=systemd). It's one less thing to get
 wrong during the build process.
 
 What do you think?

./configure ... --with-init=`src/ceph-detect-init` ?

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: packaging init systems in a more autoools style way.

2015-06-03 Thread Gregory Farnum
On Wed, Jun 3, 2015 at 2:36 PM, Ken Dreyer kdre...@redhat.com wrote:
 On 06/03/2015 02:45 PM, Sage Weil wrote:
 Sounds good to me.  It could (should?) even error out if no init
 system is
 specified?  Otherwise someone will likely be in for a surprise.

 I was picturing that we'd just autodetect based on OS version (eg Ubuntu
 15.04 should default to --with-init=systemd). It's one less thing to get
 wrong during the build process.

 What do you think?

Debian users will get very angry at us. ;)
We could maybe autodetect if they don't specify one?
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: packaging init systems in a more autoools style way.

2015-06-03 Thread Sage Weil
On Wed, 3 Jun 2015, Ken Dreyer wrote:
 On 06/03/2015 03:38 PM, Sage Weil wrote:
  On Wed, 3 Jun 2015, Ken Dreyer wrote:
  On 06/03/2015 02:45 PM, Sage Weil wrote:
  Sounds good to me.  It could (should?) even error out if no init
  system is
  specified?  Otherwise someone will likely be in for a surprise.
 
  I was picturing that we'd just autodetect based on OS version (eg Ubuntu
  15.04 should default to --with-init=systemd). It's one less thing to get
  wrong during the build process.
 
  What do you think?
  
  ./configure ... --with-init=`src/ceph-detect-init` ?
  
  sage
  
 
 I should have been clearer. I was thinking that we'd call that
 detect-init script inside ./configure itself , unless the user specifies
 --with-init=foo .

Works for me, as long as there is only 1 piece of code (ceph-detect-init) 
that does the detection!

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: packaging init systems in a more autoools style way.

2015-06-03 Thread Ken Dreyer
On 06/03/2015 02:45 PM, Sage Weil wrote:
 Sounds good to me.  It could (should?) even error out if no init
system is
 specified?  Otherwise someone will likely be in for a surprise.

I was picturing that we'd just autodetect based on OS version (eg Ubuntu
15.04 should default to --with-init=systemd). It's one less thing to get
wrong during the build process.

What do you think?

- Ken
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD journal draft design

2015-06-03 Thread Gregory Farnum
On Wed, Jun 3, 2015 at 9:13 AM, Jason Dillaman dilla...@redhat.com wrote:
  In contrast to the current journal code used by CephFS, the new journal
  code will use sequence numbers to identify journal entries, instead of
  offsets within the journal.

 Am I misremembering what actually got done with our journal v2 format?
 I think this is done — or at least we made a move in this direction.

 Assuming journal v2 is the code in osdc/Journaler.cc, there is a new 
 resilient format that helps in detecting corruption, but it appears to be 
 still largely based upon offsets and using the Filer/Striper for I/O.  This 
 does remind me that I probably want to include a magic preamble value at the 
 start of each journal entry to facilitate recovery.

Ah yeah, I was confusing the changes we did there and in our MDLog
wrapper bits. Ignore me on this bit.


  A new journal object class method will be used to submit journal entry
  append requests.  This will act as a gatekeeper for the concurrent client
  case.

 The object class is going to be a big barrier to using EC pools;
 unless you want to block the use of EC pools on EC pools supporting
 object classes. :(

 Josh mentioned (via Sam) that reads were not currently supported by object 
 classes on EC pools.  Are appends not supported either?

We discussed this briefly and certain object class functions might
work by mistake on EC pools, but you should assume nothing does (is
my recollection of the conclusions). For instance, even if it's
technically possible, the append thing is really hard for this sort of
write; I think I mentioned in Josh's thread about needing to have an
entire stripe at a time (and the smallest you could even think about
doing reasonably is 4KB * N, and really that's not big enough given
metadata overheads).


 A successful append will indicate whether or not the journal is now full
 (larger than the max object size), indicating to the client that a new
 journal object should be used.  If the journal is too large, an error code
 responce would alert the client that it needs to write to the current
 active journal object.  In practice, the only time the journaler should
 expect to see such a response would be in the case where multiple clients
 are using the same journal and the active object update notification has
 yet to be received.

 I'm confused. How does this work with the splay count thing you
 mentioned above? Can you define splay count?

 Similar to the stripe width.

Okay, that sort of makes sense but I don't see how you could legally
be writing to different sets so why not just make it an explicit
striping thing and move all journal entries for that set at once?

...Actually, doesn't *not* forcing a coordinated move from one object
set to another mean that you don't actually have an ordering guarantee
across tags if you replay the journal objects in order?



 What happens if users submit sequenced entries substantially out of
 order? It sounds like if you have multiple writers (or even just a
 misbehaving client) it would not be hard for one of them to grab
 sequence value N, for another to fill up one of the journal entry
 objects with sequences in the range [N+1]...[N+x] and then for the
 user of N to get an error response.

 I was thinking that when a client submits their journal entry payload, the 
 journaler will allocate the next available sequence number, compute which 
 active journal object that sequence should be submitted to, and start an AIO 
 append op to write the journal entry.  The next journal entry to be appended 
 to the same journal object would be splay count/width entries later.  This 
 does bring up a good point that if you are generating journal entries fast 
 enough, the delayed response saying the object is full could cause multiple 
 later journal entry ops to need to be resent to the new (non-full) object.  
 Given that, it might be best to scrap the hard error when the journal object 
 gets full and just let the journaler eventually switch to a new object when 
 it receives a response saying the object is now full.

I was misunderstanding where the seqs came from and that they were
associated with the tag, not the journal. So this shouldn't be such a
problem.


 
  Since the journal is designed to be append-only, there needs to be support
  for cases where journal entry needs to be updated out-of-band (e.g. fixing
  a corrupt entry similar to CephFS's current journal recovery tools).  The
  proposed solution is to just append a new journal entry with the same
  sequence number as the record to be replaced to the end of the journal
  (i.e. last entry for a given sequence number wins).  This also protects
  against accidental replays of the original append operation.  An
  alternative suggestion would be to use a compare-and-swap mechanism to
  update the full journal object with the updated contents.

 I'm confused by this bit. It seems to imply that fetching a single
 entry requires checking the 

Re: Discuss: New default recovery config settings

2015-06-03 Thread Gregory Farnum
On Wed, Jun 3, 2015 at 3:44 PM, Sage Weil s...@newdream.net wrote:
 On Mon, 1 Jun 2015, Gregory Farnum wrote:
 On Mon, Jun 1, 2015 at 6:39 PM, Paul Von-Stamwitz
 pvonstamw...@us.fujitsu.com wrote:
  On Fri, May 29, 2015 at 4:18 PM, Gregory Farnum g...@gregs42.com wrote:
  On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote:
   Many people have reported that they need to lower the osd recovery 
   config options to minimize the impact of recovery on client io.  We are 
   talking about changing the defaults as follows:
  
   osd_max_backfills to 1 (from 10)
   osd_recovery_max_active to 3 (from 15)
   osd_recovery_op_priority to 1 (from 10)
   osd_recovery_max_single_start to 1 (from 5)
 
  I'm under the (possibly erroneous) impression that reducing the number of 
  max backfills doesn't actually reduce recovery speed much (but will 
  reduce memory use), but that dropping the op priority can. I'd rather we 
  make users manually adjust values which can have a material impact on 
  their data safety, even if most of them choose to do so.
 
  After all, even under our worst behavior we're still doing a lot better 
  than a resilvering RAID array. ;) -Greg
  --
 
 
  Greg,
  When we set...
 
  osd recovery max active = 1
  osd max backfills = 1
 
  We see rebalance times go down by more than half and client write 
  performance increase significantly while rebalancing. We initially played 
  with these settings to improve client IO expecting recovery time to get 
  worse, but we got a 2-for-1.
  This was with firefly using replication, downing an entire node with lots 
  of SAS drives. We left osd_recovery_threads, osd_recovery_op_priority, and 
  osd_recovery_max_single_start default.
 
  We dropped osd_recovery_max_active and osd_max_backfills together. If 
  you're right, do you think osd_recovery_max_active=1 is primary reason for 
  the improvement? (higher osd_max_backfills helps recovery time with 
  erasure coding.)

 Well, recovery max active and max backfills are similar in many ways.
 Both are about moving data into a new or outdated copy of the PG ? the
 difference is that recovery refers to our log-based recovery (where we
 compare the PG logs and move over the objects which have changed)
 whereas backfill requires us to incrementally move through the entire
 PG's hash space and compare.
 I suspect dropping down max backfills is more important than reducing
 max recovery (gathering recovery metadata happens largely in memory)
 but I don't really know either way.

 My comment was meant to convey that I'd prefer we not reduce the
 recovery op priority levels. :)

 We could make a less extreme move than to 1, but IMO we have to reduce it
 one way or another.  Every major operator I've talked to does this, our PS
 folks have been recommending it for years, and I've yet to see a single
 complaint about recovery times... meanwhile we're drowning in a sea of
 complaints about the impact on clients.

 How about

  osd_max_backfills to 1 (from 10)
  osd_recovery_max_active to 3 (from 15)
  osd_recovery_op_priority to 3 (from 10)
  osd_recovery_max_single_start to 1 (from 5)

 (same as above, but 1/3rd the recovery op prio instead of 1/10th)
 ?

Do we actually have numbers for these changes individually? We might,
but I have a suspicion that at some point there was just a well, you
could turn them all down comment and that state was preferred to our
defaults.

I mean, I have no real knowledge of how changing the op priority
impacts things, but I don't think many (any?) other people do either,
so I'd rather mutate slowly and see if that works better. :)
Especially given Paul's comment that just the recovery_max and
max_backfills values made a huge positive difference without any
change to priorities.
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD journal draft design

2015-06-03 Thread John Spray



On 02/06/2015 16:11, Jason Dillaman wrote:

I am posting to get wider review/feedback on this draft design.  In support of 
the RBD mirroring feature [1], a new client-side journaling class will be 
developed for use by librbd.  The implementation is designed to carry opaque 
journal entry payloads so it will be possible to be re-used in other 
applications as well in the future.  It will also use the librados API for all 
operations.  At a high level, a single journal will be composed of a journal 
header to store metadata and multiple journal objects to contain the individual 
journal entries.

...
A new journal object class method will be used to submit journal entry append 
requests.  This will act as a gatekeeper for the concurrent client case.  A 
successful append will indicate whether or not the journal is now full (larger 
than the max object size), indicating to the client that a new journal object 
should be used.  If the journal is too large, an error code responce would 
alert the client that it needs to write to the current active journal object.  
In practice, the only time the journaler should expect to see such a response 
would be in the case where multiple clients are using the same journal and the 
active object update notification has yet to be received.


Can you clarify the procedure when a client write gets a I'm full 
return code from a journal object?  The key part I'm not clear on is 
whether the client will first update the header to add an object to the 
active set (and then write it) or whether it goes ahead and writes 
objects and then lazily updates the header.
* If it's object first, header later, what bounds how far ahead of the 
active set we have to scan when doing recovery?
* If it's header first, object later, thats an uncomfortable bit of 
latency whenever we cross and object bound


Nothing intractable about mitigating either case, just wondering what 
the idea is in this design.




In contrast to the current journal code used by CephFS, the new journal code will use sequence numbers 
to identify journal entries, instead of offsets within the journal.  Additionally, a given journal 
entry will not be striped across multiple journal objects.  Journal entries will be mapped to journal 
objects using the sequence number: sequence number mod splay count == object 
number mod splay count for active journal objects.

The rationale for this difference is to facilitate parallelism for appends as 
journal entries will be splayed across a configurable number of journal 
objects.  The journal API for appending a new journal entry will return a 
future which can be used to retrieve the assigned sequence number for the 
submitted journal entry payload once committed to disk. The use of a future 
allows for asynchronous journal entry submissions by default and can be used to 
simplify integration with the client-side cache writeback handler (and as a 
potential future enhacement to delay appends to the journal in order to satisfy 
EC-pool alignment requirements).


When two clients are both doing splayed writes, and they both send writes in parallel, it 
seems like the per-object fullness check via the object class could result in the writes 
getting staggered across different objects.  E.g. if we have two objects that both only 
have one slot left, then A could end up taking the slot in one (call it 1) and B could 
end up taking the slot in the other (call it 2).  Then when B's write lands at to object 
1, it gets a I'm full response and has to send the entry... where?  I guess 
to some arbitrarily-higher-numbered journal object depending on how many other writes 
landed in the meantime.

This potentially leads to the stripes (splays?) of a given journal entry being 
separated arbitrarily far across different journal objects, which would be fine 
as long as everything was well formed, but will make detecting issues during 
replay harder (would have to remember partially-read entries when looking for 
their remaining stripes through rest of journal).

You could apply the object class behaviour only to the object containing the 
0th splay, but then you'd have to wait for the write there to complete before 
writing to the rest of the splays, so the latency benefit would go away.  Or 
its equally possible that there's a trick in the design that has gone over my 
head :-)

Cheers,
John

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Teuthology error 'exception on parallel execution'

2015-06-03 Thread Wang, Zhiqiang
Hi all,

Several teuthology jobs fail with the error 'exception on parallel execution' 
in my testing. I don't see any failures/errors/assertions in the mon/osd logs. 
And these failed jobs always have 'sentry event' log. However, I'm not able to 
open the sentry link.

Here is an example:
http://pulpito.ceph.com/yuan-2015-06-02_20:18:44-rados-wip-proxy-write---basic-multi/918584/
Log: 
http://qa-proxy.ceph.com/teuthology/yuan-2015-06-02_20:18:44-rados-wip-proxy-write---basic-multi/918584/teuthology.log
Sentry event: 
http://sentry.ceph.com/sepia/teuthology/search?q=38a00a22270748e79e8f0dd69a624ab4

2015-06-02T20:27:18.149 ERROR:teuthology.parallel:Exception in parallel 
execution
Traceback (most recent call last):
  File /home/teuthworker/src/teuthology_master/teuthology/parallel.py, line 
82, in __exit__
for result in self:
  File /home/teuthworker/src/teuthology_master/teuthology/parallel.py, line 
101, in next
resurrect_traceback(result)
  File /home/teuthworker/src/teuthology_master/teuthology/parallel.py, line 
19, in capture_traceback
return func(*args, **kwargs)
  File /var/lib/teuthworker/src/ceph-qa-suite_master/tasks/workunit.py, line 
361, in _run_tests
label=workunit test {workunit}.format(workunit=workunit)
  File 
/home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py, line 
156, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File /home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py, 
line 378, in run
r.wait()
  File /home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py, 
line 114, in wait
label=self.label)
CommandFailedError: Command failed (workunit test cephtool/test.sh) on plana52 
with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp  cd -- 
/home/ubuntu/cephtest/mnt.0/client.0/tmp  CEPH_CLI_TEST_DUP_COMMAND=1 
CEPH_REF=d5a52869dd23ae1fbdf3e80d54e855560e467ff6 
TESTDIR=/home/ubuntu/cephtest CEPH_ID=0 PATH=$PATH:/usr/sbin adjust-ulimits 
ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h 
/home/ubuntu/cephtest/workunit.client.0/cephtool/test.sh'
2015-06-02T20:27:18.150 ERROR:teuthology.run_tasks:Saw exception from tasks.

Is this something wrong with the teuthology itself? Anyone has experience with 
this? Any hints are welcomed~
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


06/03/2015 Weekly Ceph Performance Meeting IS ON!

2015-06-03 Thread Mark Nelson
8AM PST as usual! Discussion topics include:  cache tiering update. 
Please feel free to add your own!


Here's the links:

Etherpad URL:
http://pad.ceph.com/p/performance_weekly

To join the Meeting:
https://bluejeans.com/268261044

To join via Browser:
https://bluejeans.com/268261044/browser

To join with Lync:
https://bluejeans.com/268261044/lync


To join via Room System:
Video Conferencing System: bjn.vc -or- 199.48.152.152
Meeting ID: 268261044

To join via Phone:
1) Dial:
  +1 408 740 7256
  +1 888 240 2560(US Toll Free)
  +1 408 317 9253(Alternate Number)
  (see all numbers - http://bluejeans.com/numbers)
2) Enter Conference ID: 268261044

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: 'Racing read got wrong version' during proxy write testing

2015-06-03 Thread Wang, Zhiqiang
I ran into the 'op not idempotent' problem during the testing today. There is 
one bug in the previous fix. In that fix, we copy the reqids in the final step 
of 'fill_in_copy_get'. If the object is deleted, since the 'copy get' op is a 
read op, it returns earlier with ENOENT in do_op. No reqids will be copied 
during promotion in this case. This again leads to the 'op not idempotent' 
problem. We need a 'smart' way to detect the op is a 'copy get' op (looping the 
ops vector doesn't seem smart?) and copy the reqids in this case.

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Tuesday, May 26, 2015 12:27 AM
To: Wang, Zhiqiang
Cc: ceph-devel@vger.kernel.org
Subject: Re: 'Racing read got wrong version' during proxy write testing

On Mon, 25 May 2015, Wang, Zhiqiang wrote:
 Hi all,
 
 I ran into a problem during the teuthology test of proxy write. It is like 
 this:
 
 - Client sends 3 writes and a read on the same object to base tier
 - Set up cache tiering
 - Client retries ops and sends the 3 writes and 1 read to the cache 
 tier
 - The 3 writes finished on the base tier, say with versions v1, v2 and 
 v3
 - Cache tier proxies the 1st write, and start to promote the object 
 for the 2nd write, the 2nd and 3rd writes and the read are blocked
 - The proxied 1st write finishes on the base tier with version v4, and 
 returns to cache tier. But somehow the cache tier fails to send the 
 reply due to socket failure injecting
 - Client retries the writes and the read again, the writes are 
 identified as dup ops
 - The promotion finishes, it copies the pg_log entries from the base 
 tier and put it in the cache tier's pg_log. This includes the 3 writes 
 on the base tier and the proxied write
 - The writes dispatches after the promotion, they are identified as 
 completed dup ops. Cache tier replies these write ops with the version 
 from the base tier (v1, v2 and v3)
 - In the last, the read dispatches, it reads the version of the 
 proxied write (v4) and replies to client
 - Client complains that 'racing read got wrong version'
 
 In a previous discussion of the 'ops not idempotent' problem, we solved it by 
 copying the pg_log entries in the base tier to cache tier during promotion. 
 Seems like there is still a problem with this approach in the above scenario. 
 My first thought is that when proxying the write, the cache tier should use 
 the original reqid from the client. But currently we don't have a way to pass 
 the original reqid from cache to base. Any ideas?

I agree--I think the correct fix here is to make the proxied op be recognized 
as a dup.  We can either do that by passing in an optional reqid to the 
Objecter, or extending the op somehow so that both reqids are listed.  I think 
the first option will be cleaner, but I think we will also need to make sure 
the 'retry' count is preserved as (I think) we skip the dup check if retry==0.  
And we probably want to preserve the behavior that a given (reqid, retry) only 
exists once in the system.

This probably means adding more optional args to Objecter::read()...?

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


packaging init systems in a more autoools style way.

2015-06-03 Thread Owen Synge
Dear ceph-devel,

Linux has more than one init systems.

We in SUSE are in the process of up streaming our spec files, and all
our releases are systemd based.

Ceph seems more tested with sysVinit upstream.

We have 3 basic options for doing this in a packaged upstream system.

1) We dont install init scripts/config as part of make install and
install all the init components via conditionals in the spec file.

2) We install all init scripts/config for all flavours of init using
make install and delete unwanted init systems via conditionals in the
spec file.

3) We add autotools an conditional for each init system, and only
install with make install enabled init systems scripts/config.

--

We are currently following policy (1)

I propose we follow policy (3) because (1) makes many distribution
specific conditionals and requires duplication for each platform for all
files not installed with make install.

-

Their are many ways to follow policy 3 so I would propose that when no
init system is followed, policy (1) and policy (3) should appear identical.

-

For a transition period between following policy (1) to policy (3)

phase (1)

I would expect we would add a conditional to ceph.spec for suse to add
to the configure step:

--with-init-systemd

And when other distributions want to move to a full systemd flavour they
also add a similar conditional.

phase (2)

We add a new configure level conditional:

--with-init-sysv

All sysV init installs are removed from from the spec file and added to
the make install process.

phase (3)

Distributions with more than one init system, or init systems that can
emulate sysVinit, can build packages with either init system and so
migration can be tested.

-

Does anyone object to this plan?
Does anyone agree with this plan?
Does anyone see difficulties with the plan?

Best regards

Owen


-- 
SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB
21284 (AG
Nürnberg)

Maxfeldstraße 5

90409 Nürnberg

Germany
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD journal draft design

2015-06-03 Thread Jason Dillaman
  In contrast to the current journal code used by CephFS, the new journal
  code will use sequence numbers to identify journal entries, instead of
  offsets within the journal.
 
 Am I misremembering what actually got done with our journal v2 format?
 I think this is done — or at least we made a move in this direction.

Assuming journal v2 is the code in osdc/Journaler.cc, there is a new 
resilient format that helps in detecting corruption, but it appears to be 
still largely based upon offsets and using the Filer/Striper for I/O.  This 
does remind me that I probably want to include a magic preamble value at the 
start of each journal entry to facilitate recovery.

  A new journal object class method will be used to submit journal entry
  append requests.  This will act as a gatekeeper for the concurrent client
  case.
 
 The object class is going to be a big barrier to using EC pools;
 unless you want to block the use of EC pools on EC pools supporting
 object classes. :(

Josh mentioned (via Sam) that reads were not currently supported by object 
classes on EC pools.  Are appends not supported either?

 A successful append will indicate whether or not the journal is now full
 (larger than the max object size), indicating to the client that a new
 journal object should be used.  If the journal is too large, an error code
 responce would alert the client that it needs to write to the current
 active journal object.  In practice, the only time the journaler should
 expect to see such a response would be in the case where multiple clients
 are using the same journal and the active object update notification has
 yet to be received.
 
 I'm confused. How does this work with the splay count thing you
 mentioned above? Can you define splay count?

Similar to the stripe width.

 What happens if users submit sequenced entries substantially out of
 order? It sounds like if you have multiple writers (or even just a
 misbehaving client) it would not be hard for one of them to grab
 sequence value N, for another to fill up one of the journal entry
 objects with sequences in the range [N+1]...[N+x] and then for the
 user of N to get an error response.

I was thinking that when a client submits their journal entry payload, the 
journaler will allocate the next available sequence number, compute which 
active journal object that sequence should be submitted to, and start an AIO 
append op to write the journal entry.  The next journal entry to be appended to 
the same journal object would be splay count/width entries later.  This does 
bring up a good point that if you are generating journal entries fast enough, 
the delayed response saying the object is full could cause multiple later 
journal entry ops to need to be resent to the new (non-full) object.  Given 
that, it might be best to scrap the hard error when the journal object gets 
full and just let the journaler eventually switch to a new object when it 
receives a response saying the object is now full.

 
  Since the journal is designed to be append-only, there needs to be support
  for cases where journal entry needs to be updated out-of-band (e.g. fixing
  a corrupt entry similar to CephFS's current journal recovery tools).  The
  proposed solution is to just append a new journal entry with the same
  sequence number as the record to be replaced to the end of the journal
  (i.e. last entry for a given sequence number wins).  This also protects
  against accidental replays of the original append operation.  An
  alternative suggestion would be to use a compare-and-swap mechanism to
  update the full journal object with the updated contents.
 
 I'm confused by this bit. It seems to imply that fetching a single
 entry requires checking the entire object to make sure there's no
 replacement. Certainly if we were doing replay we couldn't just apply
 each entry sequentially any more because an overwritten entry might
 have its value replaced by a later (by sequence number) entry that
 occurs earlier (by offset) in the journal.

The goal would be to use prefetching on the replay.  Since the whole object is 
already in-memory, scanning for duplicates would be fairly trivial.  If there 
is a way to prevent the OSDs from potentially replaying a duplicate append 
journal entry message, the CAS update technique could be used.

 I'd also like it if we could organize a single Journal implementation
 within the Ceph project, or at least have a blessed one going forward
 that we use for new stuff and might plausibly migrate existing users
 to. The big things I see different from osdc/Journaler are:

Agreed.  While librbd will be the first user of this, I wasn't planning to 
locate it within the librbd library.

 1) (design) class-based
 2) (design) uses librados instead of Objecter (hurray)
 3) (need) should allow multiple writers
 4) (fallout of other choices?) does not stripe entries across multiple
 objects

For striping, I assume this is a function of how large MDS 

Re: RBD journal draft design

2015-06-03 Thread Jason Dillaman
  A new journal object class method will be used to submit journal entry
  append requests.  This will act as a gatekeeper for the concurrent client
  case.  A successful append will indicate whether or not the journal is now
  full (larger than the max object size), indicating to the client that a
  new journal object should be used.  If the journal is too large, an error
  code responce would alert the client that it needs to write to the current
  active journal object.  In practice, the only time the journaler should
  expect to see such a response would be in the case where multiple clients
  are using the same journal and the active object update notification has
  yet to be received.
 
 Can you clarify the procedure when a client write gets a I'm full
 return code from a journal object?  The key part I'm not clear on is
 whether the client will first update the header to add an object to the
 active set (and then write it) or whether it goes ahead and writes
 objects and then lazily updates the header.
 * If it's object first, header later, what bounds how far ahead of the
 active set we have to scan when doing recovery?
 * If it's header first, object later, thats an uncomfortable bit of
 latency whenever we cross and object bound
 
 Nothing intractable about mitigating either case, just wondering what
 the idea is in this design.

I was thinking object first, header later.  As I mentioned in my response to 
Greg, I now think this I'm full should only be used as a guide to kick future 
(un-submitted) requests over to a new journal object.  For example, if you 
submitted 16 4K AIO journal entry append requests, it's possible that the first 
request filled the journal -- so now your soft max size will include an extra 
15 4K journal entries before the response to the first request indicates that 
the journal object is full and future requests should use a new journal object.

  The rationale for this difference is to facilitate parallelism for appends
  as journal entries will be splayed across a configurable number of journal
  objects.  The journal API for appending a new journal entry will return a
  future which can be used to retrieve the assigned sequence number for the
  submitted journal entry payload once committed to disk. The use of a
  future allows for asynchronous journal entry submissions by default and
  can be used to simplify integration with the client-side cache writeback
  handler (and as a potential future enhacement to delay appends to the
  journal in order to satisfy EC-pool alignment requirements).
 
 When two clients are both doing splayed writes, and they both send writes in
 parallel, it seems like the per-object fullness check via the object class
 could result in the writes getting staggered across different objects.  E.g.
 if we have two objects that both only have one slot left, then A could end
 up taking the slot in one (call it 1) and B could end up taking the slot in
 the other (call it 2).  Then when B's write lands at to object 1, it gets a
 I'm full response and has to send the entry... where?  I guess to some
 arbitrarily-higher-numbered journal object depending on how many other
 writes landed in the meantime.

In this case, assuming B sent the request to journal object 0, it would send 
the re-request to journal object 0 + splay width since the request sequence 
number mod splay width must equal object number mod splay width.  
However, at this point I think it would be better to eliminate the I'm full 
error code and stick with extra soft max object size.

 This potentially leads to the stripes (splays?) of a given journal entry
 being separated arbitrarily far across different journal objects, which
 would be fine as long as everything was well formed, but will make detecting
 issues during replay harder (would have to remember partially-read entries
 when looking for their remaining stripes through rest of journal).
 
 You could apply the object class behaviour only to the object containing the
 0th splay, but then you'd have to wait for the write there to complete
 before writing to the rest of the splays, so the latency benefit would go
 away.  Or its equally possible that there's a trick in the design that has
 gone over my head :-)

I'm probably missing something here.  A journal entry won't be partially 
striped across multiple journal objects.  The journal entry in its entirety 
would be written to one of the splay width active journal objects.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: packaging init systems in a more autoools style way.

2015-06-03 Thread Sage Weil
On Wed, 3 Jun 2015, Owen Synge wrote:
 Dear ceph-devel,
 
 Linux has more than one init systems.
 
 We in SUSE are in the process of up streaming our spec files, and all
 our releases are systemd based.
 
 Ceph seems more tested with sysVinit upstream.
 
 We have 3 basic options for doing this in a packaged upstream system.
 
 1) We dont install init scripts/config as part of make install and
 install all the init components via conditionals in the spec file.
 
 2) We install all init scripts/config for all flavours of init using
 make install and delete unwanted init systems via conditionals in the
 spec file.
 
 3) We add autotools an conditional for each init system, and only
 install with make install enabled init systems scripts/config.
 
 --
 
 We are currently following policy (1)
 
 I propose we follow policy (3) because (1) makes many distribution
 specific conditionals and requires duplication for each platform for all
 files not installed with make install.
 
 -
 
 Their are many ways to follow policy 3 so I would propose that when no
 init system is followed, policy (1) and policy (3) should appear identical.
 
 -

Let's do it!

 For a transition period between following policy (1) to policy (3)
 
 phase (1)
 
 I would expect we would add a conditional to ceph.spec for suse to add
 to the configure step:
 
   --with-init-systemd
 
 And when other distributions want to move to a full systemd flavour they
 also add a similar conditional.
 
 phase (2)
 
 We add a new configure level conditional:
 
   --with-init-sysv
 
 All sysV init installs are removed from from the spec file and added to
 the make install process.
 
 phase (3)
 
 Distributions with more than one init system, or init systems that can
 emulate sysVinit, can build packages with either init system and so
 migration can be tested.
 
 -
 
 Does anyone object to this plan?
 Does anyone agree with this plan?
 Does anyone see difficulties with the plan?

I'm hoping that phase 3 can be avoided entirely.  The upgrade/conversion 
path (at least for upstream packages) will be firefly - infernalis; I'm 
don't think it will be that useful to build infernalis packages that do 
sysvinit for systemd distros.  (Maybe this situation gets more 
complicated if we backport this transition to hammer or downstream does 
the same, but even then the transition will be an upgrade one.)

Ken and I talked a bit about this yesterday and he convinced me that 
catering to multiple init systems w/in a single distro (e.g., by letting 
sysvinit and systemd files coexist) is not worth our time.  This has the 
nice benefit of letting us sunset the 
/var/lib/ceph/*/*/{sysvinit,upstart,systemd} files.

Also, I think we should do 1 and 2 basically at the same time.  I don't 
think it's worth spending any effort trying to make things behave with 
just 1 (and not 2).

Am I talking sense?  I can never tell with this stuff.  :)

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: 'Racing read got wrong version' during proxy write testing

2015-06-03 Thread Wang, Zhiqiang
Making the 'copy get' op to be a cache op seems like a good idea.

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Thursday, June 4, 2015 9:14 AM
To: Wang, Zhiqiang
Cc: ceph-devel@vger.kernel.org
Subject: RE: 'Racing read got wrong version' during proxy write testing

On Wed, 3 Jun 2015, Wang, Zhiqiang wrote:
 I ran into the 'op not idempotent' problem during the testing today. 
 There is one bug in the previous fix. In that fix, we copy the reqids 
 in the final step of 'fill_in_copy_get'. If the object is deleted, 
 since the 'copy get' op is a read op, it returns earlier with ENOENT in do_op.
 No reqids will be copied during promotion in this case. This again 
 leads to the 'op not idempotent' problem. We need a 'smart' way to 
 detect the op is a 'copy get' op (looping the ops vector doesn't seem 
 smart?) and copy the reqids in this case.

Hmm.  I think the idea here is/was that that ENOENT would somehow include the 
reqid list from PGLog::get_object_reqids().

I think teh trick is getting it past the generic check in do_op:

  if (!op-may_write() 
  !op-may_cache() 
  (!obc-obs.exists ||
   ((m-get_snapid() != CEPH_SNAPDIR) 
obc-obs.oi.is_whiteout( {
reply_ctx(ctx, -ENOENT);
return;
  }

Maybe we mark these as cache operations so that may_cache is true?

Sam, what do you think?

sage


 
 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Tuesday, May 26, 2015 12:27 AM
 To: Wang, Zhiqiang
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: 'Racing read got wrong version' during proxy write 
 testing
 
 On Mon, 25 May 2015, Wang, Zhiqiang wrote:
  Hi all,
  
  I ran into a problem during the teuthology test of proxy write. It is like 
  this:
  
  - Client sends 3 writes and a read on the same object to base tier
  - Set up cache tiering
  - Client retries ops and sends the 3 writes and 1 read to the cache 
  tier
  - The 3 writes finished on the base tier, say with versions v1, v2 
  and
  v3
  - Cache tier proxies the 1st write, and start to promote the object 
  for the 2nd write, the 2nd and 3rd writes and the read are blocked
  - The proxied 1st write finishes on the base tier with version v4, 
  and returns to cache tier. But somehow the cache tier fails to send 
  the reply due to socket failure injecting
  - Client retries the writes and the read again, the writes are 
  identified as dup ops
  - The promotion finishes, it copies the pg_log entries from the base 
  tier and put it in the cache tier's pg_log. This includes the 3 
  writes on the base tier and the proxied write
  - The writes dispatches after the promotion, they are identified as 
  completed dup ops. Cache tier replies these write ops with the 
  version from the base tier (v1, v2 and v3)
  - In the last, the read dispatches, it reads the version of the 
  proxied write (v4) and replies to client
  - Client complains that 'racing read got wrong version'
  
  In a previous discussion of the 'ops not idempotent' problem, we solved it 
  by copying the pg_log entries in the base tier to cache tier during 
  promotion. Seems like there is still a problem with this approach in the 
  above scenario. My first thought is that when proxying the write, the cache 
  tier should use the original reqid from the client. But currently we don't 
  have a way to pass the original reqid from cache to base. Any ideas?
 
 I agree--I think the correct fix here is to make the proxied op be recognized 
 as a dup.  We can either do that by passing in an optional reqid to the 
 Objecter, or extending the op somehow so that both reqids are listed.  I 
 think the first option will be cleaner, but I think we will also need to make 
 sure the 'retry' count is preserved as (I think) we skip the dup check if 
 retry==0.  And we probably want to preserve the behavior that a given (reqid, 
 retry) only exists once in the system.
 
 This probably means adding more optional args to Objecter::read()...?
 
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel 
 in the body of a message to majord...@vger.kernel.org More majordomo 
 info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 'Racing read got wrong version' during proxy write testing

2015-06-03 Thread David Zafman


I'm wonder if this issue could be the cause of #11511.  Could a proxy 
write have raced with the fill_in_copy_get() so object_info_t size 
doesn't correspond with the size of the object in the filestore?


David


On 6/3/15 6:22 PM, Wang, Zhiqiang wrote:

Making the 'copy get' op to be a cache op seems like a good idea.

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Thursday, June 4, 2015 9:14 AM
To: Wang, Zhiqiang
Cc: ceph-devel@vger.kernel.org
Subject: RE: 'Racing read got wrong version' during proxy write testing

On Wed, 3 Jun 2015, Wang, Zhiqiang wrote:

I ran into the 'op not idempotent' problem during the testing today.
There is one bug in the previous fix. In that fix, we copy the reqids
in the final step of 'fill_in_copy_get'. If the object is deleted,
since the 'copy get' op is a read op, it returns earlier with ENOENT in do_op.
No reqids will be copied during promotion in this case. This again
leads to the 'op not idempotent' problem. We need a 'smart' way to
detect the op is a 'copy get' op (looping the ops vector doesn't seem
smart?) and copy the reqids in this case.

Hmm.  I think the idea here is/was that that ENOENT would somehow include the 
reqid list from PGLog::get_object_reqids().

I think teh trick is getting it past the generic check in do_op:

   if (!op-may_write() 
   !op-may_cache() 
   (!obc-obs.exists ||
((m-get_snapid() != CEPH_SNAPDIR) 
obc-obs.oi.is_whiteout( {
 reply_ctx(ctx, -ENOENT);
 return;
   }

Maybe we mark these as cache operations so that may_cache is true?

Sam, what do you think?

sage



-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Tuesday, May 26, 2015 12:27 AM
To: Wang, Zhiqiang
Cc: ceph-devel@vger.kernel.org
Subject: Re: 'Racing read got wrong version' during proxy write
testing

On Mon, 25 May 2015, Wang, Zhiqiang wrote:

Hi all,

I ran into a problem during the teuthology test of proxy write. It is like this:

- Client sends 3 writes and a read on the same object to base tier
- Set up cache tiering
- Client retries ops and sends the 3 writes and 1 read to the cache
tier
- The 3 writes finished on the base tier, say with versions v1, v2
and
v3
- Cache tier proxies the 1st write, and start to promote the object
for the 2nd write, the 2nd and 3rd writes and the read are blocked
- The proxied 1st write finishes on the base tier with version v4,
and returns to cache tier. But somehow the cache tier fails to send
the reply due to socket failure injecting
- Client retries the writes and the read again, the writes are
identified as dup ops
- The promotion finishes, it copies the pg_log entries from the base
tier and put it in the cache tier's pg_log. This includes the 3
writes on the base tier and the proxied write
- The writes dispatches after the promotion, they are identified as
completed dup ops. Cache tier replies these write ops with the
version from the base tier (v1, v2 and v3)
- In the last, the read dispatches, it reads the version of the
proxied write (v4) and replies to client
- Client complains that 'racing read got wrong version'

In a previous discussion of the 'ops not idempotent' problem, we solved it by 
copying the pg_log entries in the base tier to cache tier during promotion. 
Seems like there is still a problem with this approach in the above scenario. 
My first thought is that when proxying the write, the cache tier should use the 
original reqid from the client. But currently we don't have a way to pass the 
original reqid from cache to base. Any ideas?

I agree--I think the correct fix here is to make the proxied op be recognized 
as a dup.  We can either do that by passing in an optional reqid to the 
Objecter, or extending the op somehow so that both reqids are listed.  I think 
the first option will be cleaner, but I think we will also need to make sure 
the 'retry' count is preserved as (I think) we skip the dup check if retry==0.  
And we probably want to preserve the behavior that a given (reqid, retry) only 
exists once in the system.

This probably means adding more optional args to Objecter::read()...?

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel
in the body of a message to majord...@vger.kernel.org More majordomo
info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: 'Racing read got wrong version' during proxy write testing

2015-06-03 Thread Wang, Zhiqiang
Hi David,

Proxy write hasn't been merge into master yet. It's not likely this is causing 
#11511.

-Original Message-
From: David Zafman [mailto:dzaf...@redhat.com] 
Sent: Thursday, June 4, 2015 9:46 AM
To: Wang, Zhiqiang; Sage Weil
Cc: ceph-devel@vger.kernel.org
Subject: Re: 'Racing read got wrong version' during proxy write testing


I'm wonder if this issue could be the cause of #11511.  Could a proxy write 
have raced with the fill_in_copy_get() so object_info_t size doesn't correspond 
with the size of the object in the filestore?

David


On 6/3/15 6:22 PM, Wang, Zhiqiang wrote:
 Making the 'copy get' op to be a cache op seems like a good idea.

 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Thursday, June 4, 2015 9:14 AM
 To: Wang, Zhiqiang
 Cc: ceph-devel@vger.kernel.org
 Subject: RE: 'Racing read got wrong version' during proxy write 
 testing

 On Wed, 3 Jun 2015, Wang, Zhiqiang wrote:
 I ran into the 'op not idempotent' problem during the testing today.
 There is one bug in the previous fix. In that fix, we copy the reqids 
 in the final step of 'fill_in_copy_get'. If the object is deleted, 
 since the 'copy get' op is a read op, it returns earlier with ENOENT in 
 do_op.
 No reqids will be copied during promotion in this case. This again 
 leads to the 'op not idempotent' problem. We need a 'smart' way to 
 detect the op is a 'copy get' op (looping the ops vector doesn't seem
 smart?) and copy the reqids in this case.
 Hmm.  I think the idea here is/was that that ENOENT would somehow include the 
 reqid list from PGLog::get_object_reqids().

 I think teh trick is getting it past the generic check in do_op:

if (!op-may_write() 
!op-may_cache() 
(!obc-obs.exists ||
 ((m-get_snapid() != CEPH_SNAPDIR) 
   obc-obs.oi.is_whiteout( {
  reply_ctx(ctx, -ENOENT);
  return;
}

 Maybe we mark these as cache operations so that may_cache is true?

 Sam, what do you think?

 sage


 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Tuesday, May 26, 2015 12:27 AM
 To: Wang, Zhiqiang
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: 'Racing read got wrong version' during proxy write 
 testing

 On Mon, 25 May 2015, Wang, Zhiqiang wrote:
 Hi all,

 I ran into a problem during the teuthology test of proxy write. It is like 
 this:

 - Client sends 3 writes and a read on the same object to base tier
 - Set up cache tiering
 - Client retries ops and sends the 3 writes and 1 read to the cache 
 tier
 - The 3 writes finished on the base tier, say with versions v1, v2 
 and
 v3
 - Cache tier proxies the 1st write, and start to promote the object 
 for the 2nd write, the 2nd and 3rd writes and the read are blocked
 - The proxied 1st write finishes on the base tier with version v4, 
 and returns to cache tier. But somehow the cache tier fails to send 
 the reply due to socket failure injecting
 - Client retries the writes and the read again, the writes are 
 identified as dup ops
 - The promotion finishes, it copies the pg_log entries from the base 
 tier and put it in the cache tier's pg_log. This includes the 3 
 writes on the base tier and the proxied write
 - The writes dispatches after the promotion, they are identified as 
 completed dup ops. Cache tier replies these write ops with the 
 version from the base tier (v1, v2 and v3)
 - In the last, the read dispatches, it reads the version of the 
 proxied write (v4) and replies to client
 - Client complains that 'racing read got wrong version'

 In a previous discussion of the 'ops not idempotent' problem, we solved it 
 by copying the pg_log entries in the base tier to cache tier during 
 promotion. Seems like there is still a problem with this approach in the 
 above scenario. My first thought is that when proxying the write, the cache 
 tier should use the original reqid from the client. But currently we don't 
 have a way to pass the original reqid from cache to base. Any ideas?
 I agree--I think the correct fix here is to make the proxied op be 
 recognized as a dup.  We can either do that by passing in an optional reqid 
 to the Objecter, or extending the op somehow so that both reqids are listed. 
  I think the first option will be cleaner, but I think we will also need to 
 make sure the 'retry' count is preserved as (I think) we skip the dup check 
 if retry==0.  And we probably want to preserve the behavior that a given 
 (reqid, retry) only exists once in the system.

 This probably means adding more optional args to Objecter::read()...?

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in the body of a message to majord...@vger.kernel.org More majordomo 
 info at  http://vger.kernel.org/majordomo-info.html


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel 
 in the body of a message to majord...@vger.kernel.org More majordomo 
 info at  http://vger.kernel.org/majordomo-info.html

--
To 

RE: 'Racing read got wrong version' during proxy write testing

2015-06-03 Thread Sage Weil
On Wed, 3 Jun 2015, Wang, Zhiqiang wrote:
 I ran into the 'op not idempotent' problem during the testing today. 
 There is one bug in the previous fix. In that fix, we copy the reqids in 
 the final step of 'fill_in_copy_get'. If the object is deleted, since 
 the 'copy get' op is a read op, it returns earlier with ENOENT in do_op. 
 No reqids will be copied during promotion in this case. This again leads 
 to the 'op not idempotent' problem. We need a 'smart' way to detect the 
 op is a 'copy get' op (looping the ops vector doesn't seem smart?) and 
 copy the reqids in this case.

Hmm.  I think the idea here is/was that that ENOENT would somehow include 
the reqid list from PGLog::get_object_reqids().

I think teh trick is getting it past the generic check in do_op:

  if (!op-may_write() 
  !op-may_cache() 
  (!obc-obs.exists ||
   ((m-get_snapid() != CEPH_SNAPDIR) 
obc-obs.oi.is_whiteout( {
reply_ctx(ctx, -ENOENT);
return;
  }

Maybe we mark these as cache operations so that may_cache is true?

Sam, what do you think?

sage


 
 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com] 
 Sent: Tuesday, May 26, 2015 12:27 AM
 To: Wang, Zhiqiang
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: 'Racing read got wrong version' during proxy write testing
 
 On Mon, 25 May 2015, Wang, Zhiqiang wrote:
  Hi all,
  
  I ran into a problem during the teuthology test of proxy write. It is like 
  this:
  
  - Client sends 3 writes and a read on the same object to base tier
  - Set up cache tiering
  - Client retries ops and sends the 3 writes and 1 read to the cache 
  tier
  - The 3 writes finished on the base tier, say with versions v1, v2 and 
  v3
  - Cache tier proxies the 1st write, and start to promote the object 
  for the 2nd write, the 2nd and 3rd writes and the read are blocked
  - The proxied 1st write finishes on the base tier with version v4, and 
  returns to cache tier. But somehow the cache tier fails to send the 
  reply due to socket failure injecting
  - Client retries the writes and the read again, the writes are 
  identified as dup ops
  - The promotion finishes, it copies the pg_log entries from the base 
  tier and put it in the cache tier's pg_log. This includes the 3 writes 
  on the base tier and the proxied write
  - The writes dispatches after the promotion, they are identified as 
  completed dup ops. Cache tier replies these write ops with the version 
  from the base tier (v1, v2 and v3)
  - In the last, the read dispatches, it reads the version of the 
  proxied write (v4) and replies to client
  - Client complains that 'racing read got wrong version'
  
  In a previous discussion of the 'ops not idempotent' problem, we solved it 
  by copying the pg_log entries in the base tier to cache tier during 
  promotion. Seems like there is still a problem with this approach in the 
  above scenario. My first thought is that when proxying the write, the cache 
  tier should use the original reqid from the client. But currently we don't 
  have a way to pass the original reqid from cache to base. Any ideas?
 
 I agree--I think the correct fix here is to make the proxied op be recognized 
 as a dup.  We can either do that by passing in an optional reqid to the 
 Objecter, or extending the op somehow so that both reqids are listed.  I 
 think the first option will be cleaner, but I think we will also need to make 
 sure the 'retry' count is preserved as (I think) we skip the dup check if 
 retry==0.  And we probably want to preserve the behavior that a given (reqid, 
 retry) only exists once in the system.
 
 This probably means adding more optional args to Objecter::read()...?
 
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: preparing v0.80.11

2015-06-03 Thread Wido den Hollander
On 05/26/2015 10:28 PM, Nathan Cutler wrote:
 Hi Loic:
 
 The first round of 0.80.11 backports, including all trivial backports
 (where trivial is defined as those I was able to do by myself without
 help), is now ready for integration testing in the firefly-backports
 branch of the SUSE fork:
 
 https://github.com/SUSE/ceph/commits/firefly-backports
 
 The non-trivial backports (on which I hereby solicit help) are:
 
 http://tracker.ceph.com/issues/11699 Objecter: resend linger ops on split
 http://tracker.ceph.com/issues/11700 make the all osd/filestore thread
 pool suicide timeouts separately configurable
 http://tracker.ceph.com/issues/11704 erasure-code: misalignment
 http://tracker.ceph.com/issues/11720 rgw deleting S3 objects leaves
 __shadow_ objects behind
 

Could I also ask for this one to be backported?

https://github.com/ceph/ceph/pull/4844

It breaks a couple of setups I know of. It's not in master yet, but it's
a very trivial fix.

 Nathan
 -- 
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html