date:20150721

Re: timeout 120 teuthology-killl is highly recommended

2015-07-21 Thread Yuri Weinstein

I was thinking of teuthology-nuke thou !

Thx
YuriW

- Original Message -
From: Yuri Weinstein ywein...@redhat.com
To: Loic Dachary l...@dachary.org
Cc: Ceph Development ceph-devel@vger.kernel.org
Sent: Tuesday, July 21, 2015 9:33:26 AM
Subject: Re: timeout 120 teuthology-killl is highly recommended

Loic

I don't use teuthology-kill simultaneously only sequentially.
As far as run time, just as a note, when we use 'stale' arg and it invokes 
ipmitool interface it does take awhile to finish. 

Thx
YuriW

- Original Message -
From: Loic Dachary l...@dachary.org
To: Ceph Development ceph-devel@vger.kernel.org
Sent: Tuesday, July 21, 2015 9:13:04 AM
Subject: timeout 120 teuthology-killl is highly recommended

Hi Ceph,

Today I did something wrong and that blocked the lab for a good half hour. 

a) I ran two teuthology-kill simultaneously and that makes them deadlock each 
other
b) I let them run unattended only to come back to the terminal 30 minutes later 
and see them stuck.

Sure, two teuthology-kill simultaneously should not deadlock and that needs to 
be fixed. But the easy workaround to avoid that trouble is to just not let it 
run forever. Even for ~200 jobs it takes at most a minute or two. And if it 
takes longer it probably means another teuthology-kill competes and it should 
be interrupted and restarted later. From now on I'll do

timeout 120 teuthology-kill  || echo FAIL!

as a generic safeguard.

Apologies for the troubles.

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: teuthology rados runs for next

2015-07-21 Thread Loic Dachary

Ok !

teuthology-kill -m multi -r 
teuthology-2015-07-18_21:00:09-rados-next-distro-basic-multi
teuthology-kill -m multi -r 
teuthology-2015-07-20_21:00:09-rados-next-distro-basic-multi

I observed that the older 

http://pulpito.ceph.com/teuthology-2015-07-17_21:00:10-rados-next-distro-basic-multi

is half way through but does not seem to make progress while the one scheduled 
today

http://pulpito.ceph.com/teuthology-2015-07-19_21:00:10-rados-next-distro-basic-multi

has one job running. Do you think best to kill the newer so it does not compete 
for resources that would prevent the older from finishing ? I'd be tempted to 
kill the newer because it's so difficult to get jobs running right now that it 
makes sense to preserve a run that already managed to pass  100 jobs :-)

Cheers

On 21/07/2015 15:43, Sage Weil wrote:
 On Tue, 21 Jul 2015, Loic Dachary wrote:
 Hi Sam,

 I noticed today that http://pulpito.ceph.com/?suite=radosbranch=next is 
 lagging three days behind. Do we want to keep all the runs or should we 
 kill the older ones ? I suppose there would be value in having the 
 results for all of them but given the current load in the sepia lab it 
 also significantly delays them. What do you think ?
 
 I think it's better to kill old scheduled runs.
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature

Re: timeout 120 teuthology-killl is highly recommended

2015-07-21 Thread Yuri Weinstein

Loic

I don't use teuthology-kill simultaneously only sequentially.
As far as run time, just as a note, when we use 'stale' arg and it invokes 
ipmitool interface it does take awhile to finish. 


Thx
YuriW

- Original Message -
From: Loic Dachary l...@dachary.org
To: Ceph Development ceph-devel@vger.kernel.org
Sent: Tuesday, July 21, 2015 9:13:04 AM
Subject: timeout 120 teuthology-killl is highly recommended

Hi Ceph,

Today I did something wrong and that blocked the lab for a good half hour. 

a) I ran two teuthology-kill simultaneously and that makes them deadlock each 
other
b) I let them run unattended only to come back to the terminal 30 minutes later 
and see them stuck.

Sure, two teuthology-kill simultaneously should not deadlock and that needs to 
be fixed. But the easy workaround to avoid that trouble is to just not let it 
run forever. Even for ~200 jobs it takes at most a minute or two. And if it 
takes longer it probably means another teuthology-kill competes and it should 
be interrupted and restarted later. From now on I'll do

timeout 120 teuthology-kill  || echo FAIL!

as a generic safeguard.

Apologies for the troubles.

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

9.0.2 test/perf_local.cc on non-x86 architectures

2015-07-21 Thread Deneau, Tom

I was trying to do an rpmbuild of v9.0.2 for aarch64 and got the following 
error:

test/perf_local.cc: In function 'double div32()':
test/perf_local.cc:396:31: error: impossible constraint in 'asm'
  cc);

Probably should have an if defined (__i386__) around it.

-- Tom

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Documentation] Hardware recommandation : RAM and PGLog

2015-07-21 Thread David Casier AEVOO


OK,
I just understand the need of transactions for the trim takes place 
after changing settings.
What is the risk to have too low a value for the parameter 
osd_min_pg_log_entries (not osd_max_pg_log_entries in degraded 
environment) ?


David.

On 07/20/2015 03:13 PM, Sage Weil wrote:

On Sun, 19 Jul 2015, David Casier AEVOO wrote:

Hi,
I have a question about PGLog and RAM consumption.

In the documentation, we read OSDs do not require as much RAM for regular
operations (e.g., 500MB of RAM per daemon instance); however, during recovery
they need significantly more RAM (e.g., ~1GB per 1TB of storage per daemon)

But in fact, all pg log are read in the start of ceph-osd daemon and put in
RAM ( pg-read_state(store, bl); )

Is this normal behavior or I have a defect in my environment?

There are two tunables that control how many pg log entries we keep
around.  When teh PG is healthy, we keep ~1000, and when the PG is
degraded, we keep more, to expand the time window over which a recovering
OSD will be able to do regular log-based recovery instead of a more
expensive backfill.  This is one source of additional memory.

Others are the missing sets (lists of missing/degraded objects) and
messages/data/state associated with objcts that are being
recovered/copied.

Note that the numbers in teh documentation are pretty rough rules of
thumb.  At some point it would be great to build a model for how much RAM
the osd consumes as a function of the various configurables (pg log size,
pg count, avg object size, etc.).

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: local teuthology testing

2015-07-21 Thread Loic Dachary

Hi,

Since July 18th teuthology no longer uses chef, this issue has been resolved ! 
Using ansible requires configuration ( http://dachary.org/?p=3752 explains it 
shortly, maybe there is something in the documentation but I did not pay enough 
attention to be sure ). At the end of http://dachary.org/?p=3752 you will see a 
list of configurable values and I suspect Andrew  Zack would be more than 
happy to explain how any hardcoded leftover can be stripped :-)

Cheers

On 21/07/2015 14:58, Shinobu Kinjo wrote:
 Hi,
 
 I think that you have to show us such a URLs for anyone who would have same 
 biggest issue.
 
 Sincerely,
 Kinjo
 
 On Tue, Jul 21, 2015 at 9:52 PM, Zhou, Yuan yuan.z...@intel.com 
 mailto:yuan.z...@intel.com wrote:
 
 Hi David/Loic,
 
 I was also trying to make some local Teuthology clusters here. The 
 biggest issue I met is in the ceph-qa-chef - there're lots of hardcoded URL 
 related with the sepia lab. I have to trace the code and change them line by 
 line.
 
 Can you please kindly share me how did you get this work? Is there an 
 easy way to fix this?
 
 Thanks, -yuan
 
 
 
 
 -- 
 Life w/ Linux http://i-shinobu.hatenablog.com/

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature

timeout 120 teuthology-killl is highly recommended

2015-07-21 Thread Loic Dachary

Hi Ceph,

Today I did something wrong and that blocked the lab for a good half hour. 

a) I ran two teuthology-kill simultaneously and that makes them deadlock each 
other
b) I let them run unattended only to come back to the terminal 30 minutes later 
and see them stuck.

Sure, two teuthology-kill simultaneously should not deadlock and that needs to 
be fixed. But the easy workaround to avoid that trouble is to just not let it 
run forever. Even for ~200 jobs it takes at most a minute or two. And if it 
takes longer it probably means another teuthology-kill competes and it should 
be interrupted and restarted later. From now on I'll do

timeout 120 teuthology-kill  || echo FAIL!

as a generic safeguard.

Apologies for the troubles.

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature

Re: timeout 120 teuthology-killl is highly recommended

2015-07-21 Thread Gregory Farnum

On Tue, Jul 21, 2015 at 5:13 PM, Loic Dachary l...@dachary.org wrote:
 Hi Ceph,

 Today I did something wrong and that blocked the lab for a good half hour.

 a) I ran two teuthology-kill simultaneously and that makes them deadlock each 
 other
 b) I let them run unattended only to come back to the terminal 30 minutes 
 later and see them stuck.

 Sure, two teuthology-kill simultaneously should not deadlock and that needs 
 to be fixed. But the easy workaround to avoid that trouble is to just not let 
 it run forever. Even for ~200 jobs it takes at most a minute or two.

Mmm, I'm not sure that's correct if you're killing jobs which are
actually running — teuthology-nuke (which it will invoke) can take a
while and you definitely don't want to time that out! So beware for
in-process runs.
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: timeout 120 teuthology-killl is highly recommended

2015-07-21 Thread Loic Dachary

Greg  Yuri : I stand corrected, I should have been less affirmative on a topic 
I know little about. Thanks !

On 21/07/2015 18:33, Yuri Weinstein wrote:
 Loic
 
 I don't use teuthology-kill simultaneously only sequentially.
 As far as run time, just as a note, when we use 'stale' arg and it invokes 
 ipmitool interface it does take awhile to finish. 
 
 
 Thx
 YuriW
 
 - Original Message -
 From: Loic Dachary l...@dachary.org
 To: Ceph Development ceph-devel@vger.kernel.org
 Sent: Tuesday, July 21, 2015 9:13:04 AM
 Subject: timeout 120 teuthology-killl is highly recommended
 
 Hi Ceph,
 
 Today I did something wrong and that blocked the lab for a good half hour. 
 
 a) I ran two teuthology-kill simultaneously and that makes them deadlock each 
 other
 b) I let them run unattended only to come back to the terminal 30 minutes 
 later and see them stuck.
 
 Sure, two teuthology-kill simultaneously should not deadlock and that needs 
 to be fixed. But the easy workaround to avoid that trouble is to just not let 
 it run forever. Even for ~200 jobs it takes at most a minute or two. And if 
 it takes longer it probably means another teuthology-kill competes and it 
 should be interrupted and restarted later. From now on I'll do
 
 timeout 120 teuthology-kill  || echo FAIL!
 
 as a generic safeguard.
 
 Apologies for the troubles.
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature

Re: The design of the eviction improvement

2015-07-21 Thread Matt W. Benjamin

Thanks for the explanations, Greg.

- Gregory Farnum g...@gregs42.com wrote:

 On Tue, Jul 21, 2015 at 3:15 PM, Matt W. Benjamin m...@cohortfs.com
 wrote:
  Hi,
 
  Couple of points.
 
  1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU
 implementation
  with 2 levels currently, plus a pinned queue, that addresses stuff
 like partitioning (sharding), scan resistance, and coordination
 w/lookup tables.  We might extend/re-use it.
 
  2) I'm a bit confused by active/inactive vocabulary, dimensioning of
 cache
  segments (are you proposing to/do we now always cache whole
 objects?), and cost
  of looking for dirty objects;  I suspect that it makes sense to
 amortize the
  cost of locating segments eligible to be flushed, rather than
 minimize
  bookkeeping.
 
 We make caching decisions in terms of whole objects right now, yeah.
 There's really nothing in the system that's capable of doing segments
 within an object, and it's not just about tracking a little more
 metadata about dirty objects — the way we handle snapshots, etc would
 have to be reworked if we were allowing partial-object caching. Plus
 keep in mind the IO cost of the bookkeeping — it needs to be either
 consistently persisted to disk or reconstructable from whatever
 happens to be in the object. That can get expensive really fast.
 -Greg

For current semantics/structure of PGs + specific tier held fixed, makes
sense.  For our object addressing currently, we have a greater requirement
for partial object caching.  (Partly, we did this to achieve periodicity
w/sequential I/O.)  I think broadly, there are large performance
tradeoffs here.  In AFS and DCE, there is full consistency in materialized
caches.  Also, caches are dimensioned by chunks.  If the cache is materialized
in memory, the semantics aren't those of disk.  Basically, consistency
guarantees are policy.  Different snapshot mechanisms, or omtting them, e.g.,
should logically enable relaxed consistency, modulo policy.

Matt

 
 
  Matt
 
  - Zhiqiang Wang zhiqiang.w...@intel.com wrote:
 
   -Original Message-
   From: ceph-devel-ow...@vger.kernel.org
   [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
   Sent: Tuesday, July 21, 2015 6:38 AM
   To: Wang, Zhiqiang
   Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
   Subject: Re: The design of the eviction improvement
  
   On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
Hi all,
   
This is a follow-up of one of the CDS session at
  
 
 http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
   ng_eviction. We discussed the drawbacks of the current eviction
  algorithm and
   several ways to improve it. Seems like the LRU variants is the
 right
  way to go. I
   come up with some design points after the CDS, and want to
 discuss
  it with you.
   It is an approximate 2Q algorithm, combining some benefits of
 the
  clock
   algorithm, similar to what the linux kernel does for the page
  cache.
  
   Unfortunately I missed this last CDS so I'm behind on the
  discussion.  I have a
   few questions though...
  
# Design points:
   
## LRU lists
- Maintain LRU lists at the PG level.
The SharedLRU and SimpleLRU implementation in the current code
  have a
max_size, which limits the max number of elements in the list.
  This
mostly looks like a MRU, though its name implies they are
 LRUs.
  Since
the object size may vary in a PG, it's not possible to
 caculate
  the
total number of objects which the cache tier can hold ahead of
  time.
We need a new LRU implementation with no limit on the size.
  
   This last sentence seems to me to be the crux of it.  Assuming
 we
  have an
   OSD based by flash storing O(n) objects, we need a way to
 maintain
  an LRU of
   O(n) objects in memory.  The current hitset-based approach was
 taken
  based
   on the assumption that this wasn't feasible--or at least we
 didn't
  know how to
   implmement such a thing.  If it is, or we simply want to
 stipulate
  that cache
   tier OSDs get gobs of RAM to make it possible, then lots of
 better
  options
   become possible...
  
   Let's say you have a 1TB SSD, with an average object size of 1MB
 --
  that's
   1 million objects.  At maybe ~100bytes per object of RAM for an
 LRU
  entry
   that's 100MB... so not so unreasonable, perhaps!
 
  I was having the same question before proposing this. I did the
  similar calculation and thought it would be ok to use this many
 memory
  :-)
 
  
- Two lists for each PG: active and inactive Objects are first
  put
into the inactive list when they are accessed, and moved
 between
  these two
   lists based on some criteria.
Object flag: active, referenced, unevictable, dirty.
- When an object is accessed:
1) If it's not in both of the lists, it's put on the top of
 the
inactive list
2) If it's in the inactive list, and the referenced flag is
 not
  set, the referenced
   flag is set, and it's

Re: MVC in ceph-deploy.

2015-07-21 Thread Travis Rhoden

Hi Owen,

I think the primary concern I have that I want to discuss more about
is cluster state discovery.  I'm worried about how this scales.
Normally when I think about MVC, I think of a long-running application
or something with a persistent data store for the model (or both).
ceph-deploy is neither of these, so the act of querying the cluster
first and loading all the data into the (memory-backed) model upon
every ceph-deploy invocation concerns me.

For querying pools and monitors, it seems fine.  What happens when we
have 1000s of OSDs?

I don't think we'd want to add any kind of data persistence to
ceph-deploy (not that you've suggested it), so that means we would
have to load the model every time.  Right now we take more of a
Pythonic approach of it's better to ask for forgiveness than
permission, meaning that we just try to do the action in question and
deal with any exceptions that arise.  I'm a little hesitant to try to
infer or pull much knowledge of a Ceph cluster's state into
ceph-deploy, as the cluster's state is quite complicated and dynamic
just because it's a large distributed system.  Of course the monitors
deal with that for us, and I think we would just be querying the
monitor(s) for the latest state.

In general that's my primary feedback.  Are there issues with scaling
to large clusters?  Does it become a large overhead to load the model
if someone runs ceph-deploy repeatedly (say, adding OSDs to one node,
then the next, then the next, and they've done it with separate calls
to ceph-deploy each time).

How do we deal with updating the model in failure scenarios?  We have
to re-query the monitor and update the model to make sure our local
representation is accurate.  I suppose that applies even for
non-failure. For example when we are proposing a change to the model,
say we want to add an OSD, we see go off to a node to create/deploy
the OSD and everything comes back okay.  But to really be sure, we
probably need to query the monitor and see what the status of the OSD
is (is it defined?  is it up?) and to re-sync the model.  It seems
like a lot of back and forth interaction to keep the model up to date,
and ultimately we lose all that information when the application
exits.

That's my initial feedback.

 - Travis

On Fri, Jul 17, 2015 at 2:31 AM, Owen Synge osy...@suse.com wrote:
 On 07/17/2015 08:16 AM, Travis Rhoden wrote:
 Hi Owen,

 I'm still giving this one some thought. I've gone back and reviewed
 https://github.com/ceph/ceph-deploy/pull/320 a few more times.  I do
 understand how it works (it took a couple times through it), and
 cosmetic things notwithstanding I can appreciate what it is doing.  I
 also fully get that the choice of sqlalchemy vs choice of data store
 makes no difference to the merit of the idea.  I'm still formulating
 my opinion, however, but wanted to let you know I was thinking about
 it.

 Thanks for this reply, but please don't get too focused on the patch.

 At the time of writing this patch I thought MVC would be completely
 uncontentious. It was never intended to illustrate the benefits of MVC.
 That is more work than this patch was intended to achieve.

 It was written to show a model can be practically done with a sqlalchemy
 implementation of an MVC's model with no important deployment overhead,
 and illustrate how SQL queries can be mapped easily as an advantage of
 using a RDBMS model, rather than illustrating MVC best practice.

 The rest of the email tries to show horn the patch into the discussion
 of the validity of MVC with respect to the patch:

 https://github.com/ceph/ceph-deploy/pull/320

 Its a nights work and only a discussion point, its half existing code I
 already have (for rgw setup), and half MVC in its self.

 Please think of it more as an aid to a conversation (like slides) rather
 than as final or a good example of MVC best practice. It would need more
 work to be this, and its probably a days work to be close to a good example.

 The model is clear, but the views could be clearer in the way they are
 abstracted, and can be extended in consistent ways, so I made a few
 comment on the patch showing where I think the code is not very MVC.

 A good example in how its not MVC enough to be an MVC example is the
 set operations that should be in a controller method, and comparing
 data in the model, the data in the model should be loaded via views. In
 addition if the model is based on a RDBMS, the set operations would be
 performed by the RDBMS and not in python.

 The first post in this thread is more stand alone in where I would see
 us going if we went MVC. The development of the patch helped me see many
 places where we could add consistency by being more MVC, rather than
 actually following the design pattern enough to show best practice.

 I would be happy to chat about the code face to face but reviewing this
 code directly without comments does not show the benefits of MVC.

 In summary about this patch:

 (1) It is

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Stefan Priebe



Am 21.07.2015 um 16:32 schrieb Jason Dillaman:

Any chance that the snapshot was just created prior to the first export and you 
have a process actively writing to the image?



Sadly not. I executed those commands exactly as i've posted manually at 
bash.


I can reproduce this at 5 different ceph cluster and 500 vms each.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ceph Tech Talk next week

2015-07-21 Thread Patrick McGarry

Hey cephers,

Just a reminder that the Ceph Tech Talk on CephFS that was scheduled
for last month (and cancelled due to technical difficulties) has been
rescheduled for this month's talk. It will be happening next Thurs at
17:00 UTC (1p EST) on our Blue Jeans conferencing system. If you have
any questions feel free to let me know. Thanks.

http://ceph.com/ceph-tech-talks/


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Jason Dillaman

Does this still occur if you export the images to the console (i.e. rbd export 
cephstor/disk-116@snap -  dump_file)?  

Would it be possible for you to provide logs from the two rbd export runs on 
your smallest VM image?  If so, please add the following to the [client] 
section of your ceph.conf:

  log file = /valid/path/to/logs/$name.$pid.log
  debug rbd = 20

I opened a ticket [1] where you can attach the logs (if they aren't too large).

[1] http://tracker.ceph.com/issues/12422

-- 

Jason Dillaman 
Red Hat 
dilla...@redhat.com 
http://www.redhat.com 


- Original Message -
 From: Stefan Priebe s.pri...@profihost.ag
 To: Jason Dillaman dilla...@redhat.com
 Cc: ceph-devel@vger.kernel.org
 Sent: Tuesday, July 21, 2015 12:55:43 PM
 Subject: Re: upstream/firefly exporting the same snap 2 times results in 
 different exports
 
 
 Am 21.07.2015 um 16:32 schrieb Jason Dillaman:
  Any chance that the snapshot was just created prior to the first export and
  you have a process actively writing to the image?
 
 
 Sadly not. I executed those commands exactly as i've posted manually at
 bash.
 
 I can reproduce this at 5 different ceph cluster and 500 vms each.
 
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hdparm -W redux, bug in _check_disk_write_cache for RHEL6?

2015-07-21 Thread Dan van der Ster

On Tue, Jul 21, 2015 at 4:20 PM, Ilya Dryomov idryo...@gmail.com wrote:
 This one, I think:

 commit ab0a9735e06914ce4d2a94ffa41497dbc142fe7f
 Author: Christoph Hellwig h...@lst.de
 Date:   Thu Oct 29 14:14:04 2009 +0100

 blkdev: flush disk cache on -fsync

Thanks, that looks relevant! Looks to me that all RHEL 6 kernels have
(a version of) that patch.

Cheers, Dan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: The design of the eviction improvement

2015-07-21 Thread Wang, Zhiqiang

 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Tuesday, July 21, 2015 9:29 PM
 To: Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement

 On Tue, 21 Jul 2015, Wang, Zhiqiang wrote:
   -Original Message-
   From: ceph-devel-ow...@vger.kernel.org
   [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
   Sent: Tuesday, July 21, 2015 6:38 AM
   To: Wang, Zhiqiang
   Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
   Subject: Re: The design of the eviction improvement

   On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
Hi all,

This is a follow-up of one of the CDS session at
   http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_
   tieri ng_eviction. We discussed the drawbacks of the current
   eviction algorithm and several ways to improve it. Seems like the
   LRU variants is the right way to go. I come up with some design
   points after the CDS, and want to discuss it with you.
   It is an approximate 2Q algorithm, combining some benefits of the
   clock algorithm, similar to what the linux kernel does for the page cache.

   Unfortunately I missed this last CDS so I'm behind on the
   discussion.  I have a few questions though...

# Design points:

## LRU lists
- Maintain LRU lists at the PG level.
The SharedLRU and SimpleLRU implementation in the current code
have a max_size, which limits the max number of elements in the
list. This mostly looks like a MRU, though its name implies they
are LRUs. Since the object size may vary in a PG, it's not
possible to caculate the total number of objects which the cache tier 
can
 hold ahead of time.
We need a new LRU implementation with no limit on the size.

   This last sentence seems to me to be the crux of it.  Assuming we
   have an OSD based by flash storing O(n) objects, we need a way to
   maintain an LRU of
   O(n) objects in memory.  The current hitset-based approach was taken
   based on the assumption that this wasn't feasible--or at least we
   didn't know how to implmement such a thing.  If it is, or we simply
   want to stipulate that cache tier OSDs get gobs of RAM to make it
   possible, then lots of better options become possible...

   Let's say you have a 1TB SSD, with an average object size of 1MB --
   that's
   1 million objects.  At maybe ~100bytes per object of RAM for an LRU
   entry that's 100MB... so not so unreasonable, perhaps!

  I was having the same question before proposing this. I did the
  similar calculation and thought it would be ok to use this many memory
  :-)

 The part that worries me now is the speed with which we can load and manage
 such a list.  Assuming it is several hundred MB, it'll take a while to load 
 that
 into memory and set up all the pointers (assuming a conventional linked list
 structure).  Maybe tens of seconds...

I'm thinking of maintaining the lists at the PG level. That's to say, we have 
an active/inactive list for every PG. We can load the lists in parallel during 
rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps 
it does not need such long time to load them?

 I wonder if instead we should construct some sort of flat model where we load
 slabs of contiguous memory, 10's of MB each, and have the next/previous
 pointers be a (slab,position) pair.  That way we can load it into memory in 
 big
 chunks, quickly, and be able to operate on it (adjust
 links) immediately.

 Another thought: currently we use the hobject_t hash only instead of the full
 object name.  We could continue to do the same, or we could do a hash pair
 (hobject_t hash + a different hash of the rest of the object) to keep the
 representation compact.  With a model lke the above, that could get the
 object representation down to 2 u32's.  A link could be a slab + position (2
 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per 
 object.

Looks like for an object, the head and the snapshot version have the same 
hobject hash. Thus we have to use the hash pair instead of just the hobject 
hash. But I still have two questions if we use the hash pair to represent an 
object.
1) Does the hash pair uniquely identify an object? That's to say, is it 
possible for two objects to have the same hash pair?
2) We need a way to get the full object name from the hash pair, so that we 
know what objects to evict. But seems like we don't have a good way to do this?

 With fixed-sized slots on the slabs, the slab allocator could be very simple..
 maybe just a bitmap, a free counter, and any other trivial optimizations to
 make finding a slab's next free a slot nice and quick.

- Two lists for each PG: active and inactive Objects are first put
into the inactive list when they are accessed, and moved between
these two
   lists based on some criteria.
Object flag: active, referenced,

Re: The design of the eviction improvement

2015-07-21 Thread Matt W. Benjamin

Hi,

- Zhiqiang Wang zhiqiang.w...@intel.com wrote:

 Hi Matt,

  -Original Message-
  From: Matt W. Benjamin [mailto:m...@cohortfs.com]
  Sent: Tuesday, July 21, 2015 10:16 PM
  To: Wang, Zhiqiang
  Cc: sj...@redhat.com; ceph-devel@vger.kernel.org; Sage Weil
  Subject: Re: The design of the eviction improvement

  Hi,

  Couple of points.

  1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU
  implementation with 2 levels currently, plus a pinned queue, that
 addresses
  stuff like partitioning (sharding), scan resistance, and
 coordination w/lookup
  tables.  We might extend/re-use it.

 The MQ algorithm is more complex, and seems like it has more overhead
 than 2Q. The approximate 2Q algorithm here combines some benefits of
 the clock algorithm, and works well on the linux kernel. MQ could be
 another choice. There are some other candidates like LIRS, ARC, etc.,
 which have been deployed in some practical systems.

MQ has been deployed in practical systems, and is more general.

  2) I'm a bit confused by active/inactive vocabulary, dimensioning of
 cache
  segments (are you proposing to/do we now always cache whole
 objects?), and
  cost of looking for dirty objects;  I suspect that it makes sense
 to amortize
  the cost of locating segments eligible to be flushed, rather than
 minimize
  bookkeeping.

 Though currently the caching decisions are made in the unit of object
 as Greg pointed out in another mail, I think we still have something
 to improve here. I'll come back to this some time later.

  Matt

  - Zhiqiang Wang zhiqiang.w...@intel.com wrote:

-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage
 Weil
Sent: Tuesday, July 21, 2015 6:38 AM
To: Wang, Zhiqiang
Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
Subject: Re: The design of the eviction improvement

On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
 Hi all,

 This is a follow-up of one of the CDS session at

 http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ti
   eri
ng_eviction. We discussed the drawbacks of the current eviction
   algorithm and
several ways to improve it. Seems like the LRU variants is the
 right
   way to go. I
come up with some design points after the CDS, and want to
 discuss
   it with you.
It is an approximate 2Q algorithm, combining some benefits of
 the
   clock
algorithm, similar to what the linux kernel does for the page
   cache.

Unfortunately I missed this last CDS so I'm behind on the
   discussion.  I have a
few questions though...

 # Design points:

 ## LRU lists
 - Maintain LRU lists at the PG level.
 The SharedLRU and SimpleLRU implementation in the current
 code
   have a
 max_size, which limits the max number of elements in the
 list.
   This
 mostly looks like a MRU, though its name implies they are
 LRUs.
   Since
 the object size may vary in a PG, it's not possible to
 caculate
   the
 total number of objects which the cache tier can hold ahead
 of
   time.
 We need a new LRU implementation with no limit on the size.

This last sentence seems to me to be the crux of it.  Assuming
 we
   have an
OSD based by flash storing O(n) objects, we need a way to
 maintain
   an LRU of
O(n) objects in memory.  The current hitset-based approach was
 taken
   based
on the assumption that this wasn't feasible--or at least we
 didn't
   know how to
implmement such a thing.  If it is, or we simply want to
 stipulate
   that cache
tier OSDs get gobs of RAM to make it possible, then lots of
 better
   options
become possible...

Let's say you have a 1TB SSD, with an average object size of 1MB
 --
   that's
1 million objects.  At maybe ~100bytes per object of RAM for an
 LRU
   entry
that's 100MB... so not so unreasonable, perhaps!

   I was having the same question before proposing this. I did the
   similar calculation and thought it would be ok to use this many
 memory
   :-)

 - Two lists for each PG: active and inactive Objects are
 first
   put
 into the inactive list when they are accessed, and moved
 between
   these two
lists based on some criteria.
 Object flag: active, referenced, unevictable, dirty.
 - When an object is accessed:
 1) If it's not in both of the lists, it's put on the top of
 the
 inactive list
 2) If it's in the inactive list, and the referenced flag is
 not
   set, the referenced
flag is set, and it's moved to the top of the inactive list.
 3) If it's in the inactive list, and the referenced flag is
 set,
   the referenced flag
is cleared, and it's removed from the inactive list, and put on
 top
   of the active
list.
 4) If it's in the active list, and the referenced flag is not
 set,
   the referenced
flag is

Subscription to the ceph-devel mailing list

2015-07-21 Thread Surabhi Bhalothia

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: The design of the eviction improvement

2015-07-21 Thread Wang, Zhiqiang

Hi Matt,

 -Original Message-
 From: Matt W. Benjamin [mailto:m...@cohortfs.com]
 Sent: Tuesday, July 21, 2015 10:16 PM
 To: Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org; Sage Weil
 Subject: Re: The design of the eviction improvement

 Hi,

 Couple of points.

 1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU
 implementation with 2 levels currently, plus a pinned queue, that addresses
 stuff like partitioning (sharding), scan resistance, and coordination 
 w/lookup
 tables.  We might extend/re-use it.

The MQ algorithm is more complex, and seems like it has more overhead than 2Q. 
The approximate 2Q algorithm here combines some benefits of the clock 
algorithm, and works well on the linux kernel. MQ could be another choice. 
There are some other candidates like LIRS, ARC, etc., which have been deployed 
in some practical systems.

 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache
 segments (are you proposing to/do we now always cache whole objects?), and
 cost of looking for dirty objects;  I suspect that it makes sense to 
 amortize
 the cost of locating segments eligible to be flushed, rather than minimize
 bookkeeping.

Though currently the caching decisions are made in the unit of object as Greg 
pointed out in another mail, I think we still have something to improve here. 
I'll come back to this some time later.

 Matt

 - Zhiqiang Wang zhiqiang.w...@intel.com wrote:

   -Original Message-
   From: ceph-devel-ow...@vger.kernel.org
   [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
   Sent: Tuesday, July 21, 2015 6:38 AM
   To: Wang, Zhiqiang
   Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
   Subject: Re: The design of the eviction improvement

   On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
Hi all,

This is a follow-up of one of the CDS session at

  http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ti
  eri
   ng_eviction. We discussed the drawbacks of the current eviction
  algorithm and
   several ways to improve it. Seems like the LRU variants is the right
  way to go. I
   come up with some design points after the CDS, and want to discuss
  it with you.
   It is an approximate 2Q algorithm, combining some benefits of the
  clock
   algorithm, similar to what the linux kernel does for the page
  cache.

   Unfortunately I missed this last CDS so I'm behind on the
  discussion.  I have a
   few questions though...

# Design points:

## LRU lists
- Maintain LRU lists at the PG level.
The SharedLRU and SimpleLRU implementation in the current code
  have a
max_size, which limits the max number of elements in the list.
  This
mostly looks like a MRU, though its name implies they are LRUs.
  Since
the object size may vary in a PG, it's not possible to caculate
  the
total number of objects which the cache tier can hold ahead of
  time.
We need a new LRU implementation with no limit on the size.

   This last sentence seems to me to be the crux of it.  Assuming we
  have an
   OSD based by flash storing O(n) objects, we need a way to maintain
  an LRU of
   O(n) objects in memory.  The current hitset-based approach was taken
  based
   on the assumption that this wasn't feasible--or at least we didn't
  know how to
   implmement such a thing.  If it is, or we simply want to stipulate
  that cache
   tier OSDs get gobs of RAM to make it possible, then lots of better
  options
   become possible...

   Let's say you have a 1TB SSD, with an average object size of 1MB --
  that's
   1 million objects.  At maybe ~100bytes per object of RAM for an LRU
  entry
   that's 100MB... so not so unreasonable, perhaps!

  I was having the same question before proposing this. I did the
  similar calculation and thought it would be ok to use this many memory
  :-)

- Two lists for each PG: active and inactive Objects are first
  put
into the inactive list when they are accessed, and moved between
  these two
   lists based on some criteria.
Object flag: active, referenced, unevictable, dirty.
- When an object is accessed:
1) If it's not in both of the lists, it's put on the top of the
inactive list
2) If it's in the inactive list, and the referenced flag is not
  set, the referenced
   flag is set, and it's moved to the top of the inactive list.
3) If it's in the inactive list, and the referenced flag is set,
  the referenced flag
   is cleared, and it's removed from the inactive list, and put on top
  of the active
   list.
4) If it's in the active list, and the referenced flag is not set,
  the referenced
   flag is set, and it's moved to the top of the active list.
5) If it's in the active list, and the referenced flag is set,
  it's moved to the top
   of the active list.
- When selecting objects to evict:
1) Objects at the bottom of the inactive list are selected to

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Stefan Priebe



Am 21.07.2015 um 21:46 schrieb Josh Durgin:

On 07/21/2015 12:22 PM, Stefan Priebe wrote:


Am 21.07.2015 um 19:19 schrieb Jason Dillaman:

Does this still occur if you export the images to the console (i.e.
rbd export cephstor/disk-116@snap -  dump_file)?

Would it be possible for you to provide logs from the two rbd export
runs on your smallest VM image?  If so, please add the following to
the [client] section of your ceph.conf:

   log file = /valid/path/to/logs/$name.$pid.log
   debug rbd = 20

I opened a ticket [1] where you can attach the logs (if they aren't
too large).

[1] http://tracker.ceph.com/issues/12422


Will post some more details to the tracker in a few hours. It seems it
is related to using discard inside guest but not on the FS the osd is on.


That sounds very odd. Could you verify via 'rados listwatchers' on an
in-use rbd image's header object that there's still a watch established?


How can i do this exactly?


Have you increased pgs in all those clusters recently?


Yes i bumped from 2048 to 4096 as i doubled the osds.

Stefan


Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Stefan Priebe


So this is really this old bug?

http://tracker.ceph.com/issues/9806

Stefan
Am 21.07.2015 um 21:46 schrieb Josh Durgin:

On 07/21/2015 12:22 PM, Stefan Priebe wrote:


Am 21.07.2015 um 19:19 schrieb Jason Dillaman:

Does this still occur if you export the images to the console (i.e.
rbd export cephstor/disk-116@snap -  dump_file)?

Would it be possible for you to provide logs from the two rbd export
runs on your smallest VM image?  If so, please add the following to
the [client] section of your ceph.conf:

   log file = /valid/path/to/logs/$name.$pid.log
   debug rbd = 20

I opened a ticket [1] where you can attach the logs (if they aren't
too large).

[1] http://tracker.ceph.com/issues/12422


Will post some more details to the tracker in a few hours. It seems it
is related to using discard inside guest but not on the FS the osd is on.


That sounds very odd. Could you verify via 'rados listwatchers' on an
in-use rbd image's header object that there's still a watch established?

Have you increased pgs in all those clusters recently?

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Stefan Priebe



Am 21.07.2015 um 19:19 schrieb Jason Dillaman:

Does this still occur if you export the images to the console (i.e. rbd export 
cephstor/disk-116@snap -  dump_file)?

Would it be possible for you to provide logs from the two rbd export runs on your 
smallest VM image?  If so, please add the following to the [client] section of your 
ceph.conf:

   log file = /valid/path/to/logs/$name.$pid.log
   debug rbd = 20

I opened a ticket [1] where you can attach the logs (if they aren't too large).

[1] http://tracker.ceph.com/issues/12422


Will post some more details to the tracker in a few hours. It seems it 
is related to using discard inside guest but not on the FS the osd is on.


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Josh Durgin


On 07/21/2015 12:22 PM, Stefan Priebe wrote:


Am 21.07.2015 um 19:19 schrieb Jason Dillaman:

Does this still occur if you export the images to the console (i.e.
rbd export cephstor/disk-116@snap -  dump_file)?

Would it be possible for you to provide logs from the two rbd export
runs on your smallest VM image?  If so, please add the following to
the [client] section of your ceph.conf:

   log file = /valid/path/to/logs/$name.$pid.log
   debug rbd = 20

I opened a ticket [1] where you can attach the logs (if they aren't
too large).

[1] http://tracker.ceph.com/issues/12422


Will post some more details to the tracker in a few hours. It seems it
is related to using discard inside guest but not on the FS the osd is on.


That sounds very odd. Could you verify via 'rados listwatchers' on an
in-use rbd image's header object that there's still a watch established?

Have you increased pgs in all those clusters recently?

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph Tech Talk next week

2015-07-21 Thread Gregory Farnum

On Tue, Jul 21, 2015 at 6:09 PM, Patrick McGarry pmcga...@redhat.com wrote:
 Hey cephers,

 Just a reminder that the Ceph Tech Talk on CephFS that was scheduled
 for last month (and cancelled due to technical difficulties) has been
 rescheduled for this month's talk. It will be happening next Thurs at
 17:00 UTC (1p EST)

So that's July 30, according to the website, right? :)
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Josh Durgin


Yes, I'm afraid it sounds like it is. You can double check whether the
watch exists on an image by getting the id of the image from 'rbd info
$pool/$image | grep block_name_prefix':

block_name_prefix: rbd_data.105674b0dc51

The id is the hex number there. Append that to 'rbd_header.' and you
have the header object name. Check whether it has watchers with:

rados listwatchers -p $pool rbd_header.105674b0dc51

If that doesn't show any watchers while the image is in use by a vm,
it's #9806.

I just merged the backport for firefly, so it'll be in 0.80.11.
Sorry it took so long to get to firefly :(. We'll need to be
more vigilant about checking non-trivial backports when we're
going through all the bugs periodically.

Josh

On 07/21/2015 12:52 PM, Stefan Priebe wrote:

So this is really this old bug?

http://tracker.ceph.com/issues/9806

Stefan
Am 21.07.2015 um 21:46 schrieb Josh Durgin:

On 07/21/2015 12:22 PM, Stefan Priebe wrote:


Am 21.07.2015 um 19:19 schrieb Jason Dillaman:

Does this still occur if you export the images to the console (i.e.
rbd export cephstor/disk-116@snap -  dump_file)?

Would it be possible for you to provide logs from the two rbd export
runs on your smallest VM image?  If so, please add the following to
the [client] section of your ceph.conf:

   log file = /valid/path/to/logs/$name.$pid.log
   debug rbd = 20

I opened a ticket [1] where you can attach the logs (if they aren't
too large).

[1] http://tracker.ceph.com/issues/12422


Will post some more details to the tracker in a few hours. It seems it
is related to using discard inside guest but not on the FS the osd is
on.


That sounds very odd. Could you verify via 'rados listwatchers' on an
in-use rbd image's header object that there's still a watch established?

Have you increased pgs in all those clusters recently?

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

quick way to rebuild deb packages

2015-07-21 Thread Bartłomiej Święcki

Hi all,

I'm currently working on a test environment for ceph where we're using deb 
files to deploy new version on test cluster.
To make this work efficiently I'd have to quckly build deb packages.

I tried dpkg-buildpackages -nc which should keep the results of previous build 
but it ends up in a linking error:

 ...
   CXXLDceph_rgw_jsonparser
 ./.libs/libglobal.a(json_spirit_reader.o): In function `~thread_specific_ptr':
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 ./.libs/libglobal.a(json_spirit_reader.o):/usr/include/boost/thread/tss.hpp:79:
  more undefined references to `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' follow
 ./.libs/libglobal.a(json_spirit_reader.o): In function `call_oncevoid 
 (*)()':
 ...

Any ideas on what could go wrong here ?

Version I'm compiling is v0.94.1 but I've observed same results with 9.0.1.

-- 
Bartlomiej Swiecki bartlomiej.swie...@corp.ovh.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Jason Dillaman

  That sounds very odd. Could you verify via 'rados listwatchers' on an
  in-use rbd image's header object that there's still a watch established?
 
 How can i do this exactly?
 

You need to determine the RBD header object name.  For format 1 images (default 
for Firefly), the image header object is named image name.rbd.  For format 
2 images, you can determine the header object name via rbd info image spec | 
grep 'block_name_prefix' | sed 's/.*rbd_data\.\(.*\)/rbd_header.\1/g'.

Once you have the RBD image header object name, you can run: rados 
listwatchers -p pool name RBD image header name. 

-- 

Jason Dillaman 
Red Hat 
dilla...@redhat.com 
http://www.redhat.com 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: local teuthology testing

2015-07-21 Thread Zhou, Yuan

Loic, thanks for the notes! Will try the new code and report out the issue I 
met.

Thanks, -yuan

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary
Sent: Tuesday, July 21, 2015 11:48 PM
To: shin...@linux.com; Zhou, Yuan
Cc: David Casier AEVOO; Ceph Devel; se...@lists.ceph.com
Subject: Re: local teuthology testing

Hi,

Since July 18th teuthology no longer uses chef, this issue has been resolved ! 
Using ansible requires configuration ( http://dachary.org/?p=3752 explains it 
shortly, maybe there is something in the documentation but I did not pay enough 
attention to be sure ). At the end of http://dachary.org/?p=3752 you will see a 
list of configurable values and I suspect Andrew  Zack would be more than 
happy to explain how any hardcoded leftover can be stripped :-)

Cheers

On 21/07/2015 14:58, Shinobu Kinjo wrote:
 Hi,

 I think that you have to show us such a URLs for anyone who would have same 
 biggest issue.

 Sincerely,
 Kinjo

 On Tue, Jul 21, 2015 at 9:52 PM, Zhou, Yuan yuan.z...@intel.com 
 mailto:yuan.z...@intel.com wrote:

 Hi David/Loic,

 I was also trying to make some local Teuthology clusters here. The 
 biggest issue I met is in the ceph-qa-chef - there're lots of hardcoded URL 
 related with the sepia lab. I have to trace the code and change them line by 
 line.

 Can you please kindly share me how did you get this work? Is there an 
 easy way to fix this?

 Thanks, -yuan

 -- 
 Life w/ Linux http://i-shinobu.hatenablog.com/

-- 
Loïc Dachary, Artisan Logiciel Libre

Re: rados/thrash on OpenStack

2015-07-21 Thread Loic Dachary

Hi Kefu,

The following runs on OpenStack and the next branch 
http://integration.ceph.dachary.org:8081/ubuntu-2015-07-21_00:04:04-rados-next---basic-openstack/
 and 15 out of the 16 dead jobs (timed out after 3 hours) are from 
rados/thrash. A rados suite run on next dated a few days ago in the sepia lab ( 
http://pulpito.ceph.com/teuthology-2015-07-15_21:00:10-rados-next-distro-basic-multi/
 ) also has a few dead jobs but only two of them are from rados/thrash.

Cheers


On 20/07/2015 16:23, Loic Dachary wrote:
 More information about this run. I'll run a rados suite on master on 
 OpenStack to get a baseline of what we should expect.
 
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/12/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/14/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/15/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/17/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/20/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/21/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/22/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/23/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/26/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/28/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/2/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/5/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/6/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/7/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/9/
 
 I see
 
 2015-07-20T10:02:10.567 
 INFO:tasks.ceph.osd.5.ovh165019.stderr:osd/ReplicatedPG.cc: In function 'bool 
 ReplicatedPG::is_degraded_or_backfilling_object(const hobject_t)' thread 
 7f2af94df700 time 2015-07-20 10:02:10.481916
 2015-07-20T10:02:10.567 
 INFO:tasks.ceph.osd.5.ovh165019.stderr:osd/ReplicatedPG.cc: 412: FAILED 
 assert(!actingbackfill.empty())
 2015-07-20T10:02:10.567 INFO:tasks.ceph.osd.5.ovh165019.stderr: ceph version 
 9.0.2-799-gba9c2ae (ba9c2ae4bffd3fd7b26a2e0ce843913b77940b8a)
 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 1: 
 (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) 
 [0xc45d1b]
 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 2: ceph-osd() 
 [0x88535d]
 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 3: 
 (ReplicatedPG::hit_set_remove_all()+0x7c) [0x8b039c]
 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 4: 
 (ReplicatedPG::on_pool_change()+0x161) [0x8b1a21]
 2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 5: 
 (PG::handle_advance_map(std::tr1::shared_ptrOSDMap const, 
 std::tr1::shared_ptrOSDMap const, std::vectorint, std::allocatorint , 
 int, std::vectorint, std::allocatorint , int, PG::RecoveryCtx*)+0x60c) 
 [0x8348fc]
 2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 6: 
 (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle, PG::RecoveryCtx*, 
 std::setboost::intrusive_ptrPG, std::lessboost::intrusive_ptrPG , 
 std::allocatorboost::intrusive_ptrPG  *)+0x2c3) [0x6dcc73]
 2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 7: 
 (OSD::process_peering_events(std::listPG*, std::allocatorPG*  const, 
 ThreadPool::TPHandle)+0x1f1) [0x6dd721]
 2015-07-20T10:02:10.572 INFO:tasks.ceph.osd.5.ovh165019.stderr: 8: 
 (OSD::PeeringWQ::_process(std::listPG*, std::allocatorPG*  const, 
 ThreadPool::TPHandle)+0x18) [0x7328d8]
 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 9: 
 (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xc3677e]
 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 10: 
 (ThreadPool::WorkThread::entry()+0x10) [0xc37820]
 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 11: 
 (()+0x8182) [0x7f2b149e3182]
 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 12: 
 (clone()+0x6d) [0x7f2b12d2847d]
 
 
 In
 
 http://149.202.164.239/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/24/
 
 I see the same error as below.
 
 In
 
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/8/
 
 it looks like the run was about to finish, just took a long time, and should 
 be ignored as a false negative.
 
 On 20/07/2015 14:52, Loic Dachary wrote:
 Hi,

 I checked one of the timeout (dead) at

Re: rados/thrash on OpenStack

2015-07-21 Thread Loic Dachary

Note however that only one of the dead (timed out) job has an assert (looks 
like it's because the file system is not as it should, which is expected since 
there are no attached disks to the instances, therefore no way for the job to 
mkfs the file system of choice). All others timed out just because they either 
need more disk or just more time.

On 21/07/2015 09:30, Loic Dachary wrote:
 Hi Kefu,
 
 The following runs on OpenStack and the next branch 
 http://integration.ceph.dachary.org:8081/ubuntu-2015-07-21_00:04:04-rados-next---basic-openstack/
  and 15 out of the 16 dead jobs (timed out after 3 hours) are from 
 rados/thrash. A rados suite run on next dated a few days ago in the sepia lab 
 ( 
 http://pulpito.ceph.com/teuthology-2015-07-15_21:00:10-rados-next-distro-basic-multi/
  ) also has a few dead jobs but only two of them are from rados/thrash.
 
 Cheers
 
 
 On 20/07/2015 16:23, Loic Dachary wrote:
 More information about this run. I'll run a rados suite on master on 
 OpenStack to get a baseline of what we should expect.

 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/12/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/14/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/15/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/17/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/20/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/21/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/22/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/23/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/26/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/28/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/2/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/5/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/6/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/7/
 http://149.202.164.239:8081/ubuntu-2015-07-20_09:21:01-rados-wip-kefu-testing---basic-openstack/9/

 I see

 2015-07-20T10:02:10.567 
 INFO:tasks.ceph.osd.5.ovh165019.stderr:osd/ReplicatedPG.cc: In function 
 'bool ReplicatedPG::is_degraded_or_backfilling_object(const hobject_t)' 
 thread 7f2af94df700 time 2015-07-20 10:02:10.481916
 2015-07-20T10:02:10.567 
 INFO:tasks.ceph.osd.5.ovh165019.stderr:osd/ReplicatedPG.cc: 412: FAILED 
 assert(!actingbackfill.empty())
 2015-07-20T10:02:10.567 INFO:tasks.ceph.osd.5.ovh165019.stderr: ceph version 
 9.0.2-799-gba9c2ae (ba9c2ae4bffd3fd7b26a2e0ce843913b77940b8a)
 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 1: 
 (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) 
 [0xc45d1b]
 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 2: 
 ceph-osd() [0x88535d]
 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 3: 
 (ReplicatedPG::hit_set_remove_all()+0x7c) [0x8b039c]
 2015-07-20T10:02:10.568 INFO:tasks.ceph.osd.5.ovh165019.stderr: 4: 
 (ReplicatedPG::on_pool_change()+0x161) [0x8b1a21]
 2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 5: 
 (PG::handle_advance_map(std::tr1::shared_ptrOSDMap const, 
 std::tr1::shared_ptrOSDMap const, std::vectorint, std::allocatorint , 
 int, std::vectorint, std::allocatorint , int, PG::RecoveryCtx*)+0x60c) 
 [0x8348fc]
 2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 6: 
 (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle, PG::RecoveryCtx*, 
 std::setboost::intrusive_ptrPG, std::lessboost::intrusive_ptrPG , 
 std::allocatorboost::intrusive_ptrPG  *)+0x2c3) [0x6dcc73]
 2015-07-20T10:02:10.569 INFO:tasks.ceph.osd.5.ovh165019.stderr: 7: 
 (OSD::process_peering_events(std::listPG*, std::allocatorPG*  const, 
 ThreadPool::TPHandle)+0x1f1) [0x6dd721]
 2015-07-20T10:02:10.572 INFO:tasks.ceph.osd.5.ovh165019.stderr: 8: 
 (OSD::PeeringWQ::_process(std::listPG*, std::allocatorPG*  const, 
 ThreadPool::TPHandle)+0x18) [0x7328d8]
 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 9: 
 (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xc3677e]
 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 10: 
 (ThreadPool::WorkThread::entry()+0x10) [0xc37820]
 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 11: 
 (()+0x8182) [0x7f2b149e3182]
 2015-07-20T10:02:10.573 INFO:tasks.ceph.osd.5.ovh165019.stderr: 12: 
 (clone()+0x6d) [0x7f2b12d2847d]


 In

Re: dmcrypt with luks keys in hammer

2015-07-21 Thread David Disseldorp

Hi,

On Mon, 20 Jul 2015 15:21:50 -0700 (PDT), Sage Weil wrote:

 On Mon, 20 Jul 2015, Wyllys Ingersoll wrote:
  No luck with ceph-disk-activate (all or just one device).
  
  $ sudo ceph-disk-activate /dev/sdv1
  mount: unknown filesystem type 'crypto_LUKS'
  ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t',
  'crypto_LUKS', '-o', '', '--', '/dev/sdv1',
  '/var/lib/ceph/tmp/mnt.QHe3zK']' returned non-zero exit status 32
  
  
  Its odd that it should complain about the crypto_LUKS filesystem not
  being recognized, because it did mount some of the LUKS systems
  successfully, though not sometimes just the data and not the journal
  (or vice versa).
  
  $ lsblk /dev/sdb
  NAMEMAJ:MIN RM   SIZE RO
  TYPE  MOUNTPOINT
  sdb   8:16   0   3.7T  0 disk
  ??sdb18:17   0   3.6T  0 part
  ? ??e8bc1531-a187-4fd2-9e3f-cf90255f89d0 (dm-0) 252:00   3.6T  0
  crypt /var/lib/ceph/osd/ceph-54
  ??sdb28:18   010G  0 part
??temporary-cryptsetup-1235 (dm-6)252:60   125K  1 crypt
  
  
  $ blkid /dev/sdb1
  /dev/sdb1: UUID=d6194096-a219-4732-8d61-d0c125c49393 TYPE=crypto_LUKS
  
  
  A race condition (or other issue) with udev seems likely given that
  its rather random which ones come up and which ones don't.
 
 A race condition during creation or activation?  If it's activation I 
 would expect ceph-disk activate ... to work reasonably reliably when 
 called manually (on a single device at a time).

We encountered similar issues on a non-dmcrypt firefly deployment with
10 OSDs per node.

I've been working on a patch set to defer device activation to systemd
services. ceph-disk activate is extended to support mapping of dmcrypt
devices prior to OSD startup.

The master-based changes aren't ready for upstream yet, but can be found
in my WIP branch at:
https://github.com/ddiss/ceph/tree/wip_bnc926756_split_udev_systemd_master

There are a few things that I'd still like to address before submitting
upstream, mostly covering activate-journal:
- The test/ceph-disk.sh unit tests need to be extended and fixed.
- The activate-journal --dmcrypt changes are less than optimal, and leave
  me with a few unanswered questions:
  + Does get_journal_osd_uuid(dev) return the plaintext or cyphertext
uuid?
  + If a journal is encrypted, is the data partition also always
encrypted?
- dmcrypt journal device mapping should probably also be split out into
  a separate systemd service, as that'll be needed for the future
  network based key retrieval feature.

Feedback on the approach taken would be appreciated.

Cheers, David
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: About Fio backend with ObjectStore API

2015-07-21 Thread Haomai Wang

Hi Casey,

I check your commits and know what you fixed. I cherry-picked your new
commits but I still met the same problem.


It's strange that it alwasys hit segment fault when entering
_fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but
when I up the stack, the td-io_ops is not null. Maybe it's related
to dlopen?


Do you have any hint about this?

On Thu, Jul 16, 2015 at 5:23 AM, Casey Bodley cbod...@gmail.com wrote:
 Hi Haomai,

 I was able to run this after a couple changes to the filestore.fio job
 file. Two of the config options were using the wrong names. I pushed a
 fix for the job file, as well as a patch that renames everything from
 filestore to objectstore (thanks James), to
 https://github.com/linuxbox2/linuxbox-ceph/commits/fio-objectstore.

 I found that the read support doesn't appear to work anymore, so give
 rw=write a try. And because it does a mkfs(), make sure you're
 pointing it to an empty xfs directory with the directory= option.

 Casey

 On Tue, Jul 14, 2015 at 2:45 AM, Haomai Wang haomaiw...@gmail.com wrote:
 Anyone who have successfully ran the fio with this external io engine
 ceph_objectstore?

 It's strange that it alwasys hit segment fault when entering
 _fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but
 when I up the stack, the td-io_ops is not null. Maybe it's related
 to dlopen?

 On Fri, Jul 10, 2015 at 3:51 PM, Haomai Wang haomaiw...@gmail.com wrote:
 I have rebased the branch with master, and push it to ceph upstream
 repo. https://github.com/ceph/ceph/compare/fio-objectstore?expand=1

 Plz let me know if who is working on this. Otherwise, I would like to
 improve this to be merge ready.

 On Fri, Jul 10, 2015 at 4:26 AM, Matt W. Benjamin m...@cohortfs.com wrote:
 That makes sense.

 Matt

 - James (Fei) Liu-SSI james@ssi.samsung.com wrote:

 Hi Casey,
   Got it. I was directed to the old code base. By the way, Since the
 testing case was used to exercise all of object stores.  Strongly
 recommend to change the name from fio_ceph_filestore.cc to
 fio_ceph_objectstore.cc . And the code in fio_ceph_filestore.cc should
 be refactored to reflect that the whole objectstore will be supported
 by fio_ceph_objectstore.cc. what you think?

 Let me know if you need any help from my side.


 Regards,
 James



 -Original Message-
 From: Casey Bodley [mailto:cbod...@gmail.com]
 Sent: Thursday, July 09, 2015 12:32 PM
 To: James (Fei) Liu-SSI
 Cc: Haomai Wang; ceph-devel@vger.kernel.org
 Subject: Re: About Fio backend with ObjectStore API

 Hi James,

 Are you looking at the code from
 https://github.com/linuxbox2/linuxbox-ceph/tree/fio-objectstore? It
 uses ObjectStore::create() instead of new FileStore(). This allows us
 to exercise all of the object stores with the same code.

 Casey

 On Thu, Jul 9, 2015 at 2:01 PM, James (Fei) Liu-SSI
 james@ssi.samsung.com wrote:
  Hi Casey,
Here is the code in the fio_ceph_filestore.cc. Basically, it
 creates a filestore as backend engine for IO exercises. If we got to
 send IO commands to KeyValue Store or Newstore, we got to change the
 code accordingly, right?  I did not see any other files like
 fio_ceph_keyvaluestore.cc or fio_ceph_newstore.cc. In my humble
 opinion, we might need to create other two fio engines for
 keyvaluestore and newstore if we want to exercise these two, right?
 
  Regards,
  James
 
  static int fio_ceph_filestore_init(struct thread_data *td)
  209 {
  210 vectorconst char* args;
  211 struct ceph_filestore_data *ceph_filestore_data = (struct
 ceph_filestore_data *) td-io_ops-data;
  212 ObjectStore::Transaction ft;
  213
 
  214 global_init(NULL, args, CEPH_ENTITY_TYPE_OSD,
 CODE_ENVIRONMENT_UTILITY, 0);
  215 //g_conf-journal_dio = false;
  216 common_init_finish(g_ceph_context);
  217 //g_ceph_context-_conf-set_val(debug_filestore, 20);
  218 //g_ceph_context-_conf-set_val(debug_throttle, 20);
  219 g_ceph_context-_conf-apply_changes(NULL);
  220
 
  221 ceph_filestore_data-osd_path =
 strdup(/mnt/fio_ceph_filestore.XXX);
  222 ceph_filestore_data-journal_path =
 strdup(/var/lib/ceph/osd/journal-ram/fio_ceph_filestore.XXX);
  223
 
  224 if (!mkdtemp(ceph_filestore_data-osd_path)) {
  225 cout  mkdtemp failed:   strerror(errno) 
 std::endl;
  226 return 1;
  227 }
  228 //mktemp(ceph_filestore_data-journal_path); // NOSPC issue
  229
 
  230 ObjectStore *fs = new
 FileStore(ceph_filestore_data-osd_path,
 ceph_filestore_data-journal_path);
  231 ceph_filestore_data-fs = fs;
  232
 
  233 if (fs-mkfs()  0) {
  234 cout  mkfs failed  std::endl;
  235 goto failed;
  236 }
  237
  238 if (fs-mount()  0) {
  239 cout  mount failed  std::endl;
  240 goto failed;
  241 }
  242
 
  243 ft.create_collection(coll_t());
  244 fs-apply_transaction(ft);
  245
 
  246
 
  247 return 0;
  248
 
  249

hdparm -W redux, bug in _check_disk_write_cache for RHEL6?

2015-07-21 Thread Dan van der Ster

Hi,

Following the sf.net corruption report I've been checking our config
w.r.t data consistency. AFAIK the two main recommendations are:

  1) don't mount FileStores with nobarrier
  2) disable write-caching (hdparm -W 0 /dev/sdX) when using block dev
journals and your kernel is  2.6.33

Obviously we don't do (1) because that would be crazy, but for (2) we
didn't disable yet write-caching, probably because we didn't notice
the doc.

But my lame excuse is that apparently _check_disk_write_cache in
FileJournal.cc doesn't print a warning when it should, because hdparm
-W doesn't always work on partitions rather than whole block devices.
See:

GOOD: ceph 0.94.2, kernel 3.10.0-229.7.2.el7.x86_64, hdparm v9.43:

   10 journal _open_block_device: ignoring osd journal size. We'll use
the entire block device (size: 21474836480)
   20 journal _check_disk_write_cache: disk write cache is on, but
your kernel is new enough to handle it correctly.
(fn:/var/lib/ceph/osd/ceph-96/journal)
1 journal _open /var/lib/ceph/osd/ceph-96/journal fd 20:
21474836480 bytes, block size 4096 bytes, directio = 1, aio = 1


BAD: ceph 0.94.2, kernel 2.6.32-431.29.2.el6.x86_64, hdparm v9.43:

   10 journal _open_block_device: ignoring osd journal size. We'll use
the entire block device (size: 21474836480)
1 journal _open /var/lib/ceph/osd/ceph-56/journal fd 19:
21474836480 bytes, block size 4096 bytes, directio = 1, aio = 1


In other words, running hammer on EL6, _check_disk_write_cache exits
without printing anything, but actually it should log the scary
WARNING: disk write cache is ON.

I guess it's because of this:

GOOD # uname -r  hdparm -W /dev/sda  hdparm -W /dev/sda1
3.10.0-229.7.2.el7.x86_64

/dev/sda1:
 write-caching =  1 (on)

/dev/sda:
 write-caching =  1 (on)


BAD # uname -r  hdparm -W /dev/sda  hdparm -W /dev/sda1
2.6.32-431.23.3.el6.x86_64

/dev/sda:
 write-caching =  1 (on)

/dev/sda1:
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device


(in both cases /dev/sda is an INTEL SSDSC2BA20).

So a few questions to end this:
  1) What was the magic patch in 2.6.33 which made write-caching safe?
  2) What's the recommended recourse here: hopefully Red Hat
backported the necessary to their 2.6.32 kernel, but if not should we
fix _check_disk_write_cache and make some publicity for people to
check their configs?

Best Regards,

Dan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

teuthology rados runs for next

2015-07-21 Thread Loic Dachary

Hi Sam,

I noticed today that http://pulpito.ceph.com/?suite=radosbranch=next is 
lagging three days behind. Do we want to keep all the runs or should we kill 
the older ones ? I suppose there would be value in having the results for all 
of them but given the current load in the sepia lab it also significantly 
delays them. What do you think ?

Cheers
-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature

Re: teuthology rados runs for next

2015-07-21 Thread Sage Weil

On Tue, 21 Jul 2015, Loic Dachary wrote:
 Hi Sam,
 
 I noticed today that http://pulpito.ceph.com/?suite=radosbranch=next is 
 lagging three days behind. Do we want to keep all the runs or should we 
 kill the older ones ? I suppose there would be value in having the 
 results for all of them but given the current load in the sepia lab it 
 also significantly delays them. What do you think ?

I think it's better to kill old scheduled runs.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hdparm -W redux, bug in _check_disk_write_cache for RHEL6?

2015-07-21 Thread Sage Weil

On Tue, 21 Jul 2015, Dan van der Ster wrote:
 Hi,
 
 Following the sf.net corruption report I've been checking our config
 w.r.t data consistency. AFAIK the two main recommendations are:
 
   1) don't mount FileStores with nobarrier
   2) disable write-caching (hdparm -W 0 /dev/sdX) when using block dev
 journals and your kernel is  2.6.33
 
 Obviously we don't do (1) because that would be crazy, but for (2) we
 didn't disable yet write-caching, probably because we didn't notice
 the doc.
 
 But my lame excuse is that apparently _check_disk_write_cache in
 FileJournal.cc doesn't print a warning when it should, because hdparm
 -W doesn't always work on partitions rather than whole block devices.
 See:
 
 GOOD: ceph 0.94.2, kernel 3.10.0-229.7.2.el7.x86_64, hdparm v9.43:
 
10 journal _open_block_device: ignoring osd journal size. We'll use
 the entire block device (size: 21474836480)
20 journal _check_disk_write_cache: disk write cache is on, but
 your kernel is new enough to handle it correctly.
 (fn:/var/lib/ceph/osd/ceph-96/journal)
 1 journal _open /var/lib/ceph/osd/ceph-96/journal fd 20:
 21474836480 bytes, block size 4096 bytes, directio = 1, aio = 1
 
 
 BAD: ceph 0.94.2, kernel 2.6.32-431.29.2.el6.x86_64, hdparm v9.43:
 
10 journal _open_block_device: ignoring osd journal size. We'll use
 the entire block device (size: 21474836480)
 1 journal _open /var/lib/ceph/osd/ceph-56/journal fd 19:
 21474836480 bytes, block size 4096 bytes, directio = 1, aio = 1
 
 
 In other words, running hammer on EL6, _check_disk_write_cache exits
 without printing anything, but actually it should log the scary
 WARNING: disk write cache is ON.
 
 I guess it's because of this:
 
 GOOD # uname -r  hdparm -W /dev/sda  hdparm -W /dev/sda1
 3.10.0-229.7.2.el7.x86_64
 
 /dev/sda1:
  write-caching =  1 (on)
 
 /dev/sda:
  write-caching =  1 (on)
 
 
 BAD # uname -r  hdparm -W /dev/sda  hdparm -W /dev/sda1
 2.6.32-431.23.3.el6.x86_64
 
 /dev/sda:
  write-caching =  1 (on)
 
 /dev/sda1:
  HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 
 
 (in both cases /dev/sda is an INTEL SSDSC2BA20).
 
 So a few questions to end this:
   1) What was the magic patch in 2.6.33 which made write-caching safe?

The specific behavior is that we want fsync or fdatasync to flush the 
write cache on the underlying device.  Unfortunately I've lost track of 
which commit led me to the magic 2.6.33 number.  However, this reference 
seems to confirm that 2.6.33 is a safe upper bound:

http://monolight.cc/2011/06/barriers-caches-filesystems/

   2) What's the recommended recourse here: hopefully Red Hat
 backported the necessary to their 2.6.32 kernel, but if not should we
 fix _check_disk_write_cache and make some publicity for people to
 check their configs?

I have no doubt that any and all patches related to flushing caches on 
fsync are part of the el6 kernel.

What's embarassing is that hdparm fails on kernels old enough to fail the 
test :).  The fix is probably to strip off the partition number (ideally 
using the helpers in blkdev.cc so that it works even for weirdly-named 
devices) and run hdparm on that.

sage


 
 Best Regards,
 
 Dan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dmcrypt with luks keys in hammer

2015-07-21 Thread Sage Weil

On Tue, 21 Jul 2015, David Disseldorp wrote:
 Hi,
 
 On Mon, 20 Jul 2015 15:21:50 -0700 (PDT), Sage Weil wrote:
 
  On Mon, 20 Jul 2015, Wyllys Ingersoll wrote:
   No luck with ceph-disk-activate (all or just one device).
   
   $ sudo ceph-disk-activate /dev/sdv1
   mount: unknown filesystem type 'crypto_LUKS'
   ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t',
   'crypto_LUKS', '-o', '', '--', '/dev/sdv1',
   '/var/lib/ceph/tmp/mnt.QHe3zK']' returned non-zero exit status 32
   
   
   Its odd that it should complain about the crypto_LUKS filesystem not
   being recognized, because it did mount some of the LUKS systems
   successfully, though not sometimes just the data and not the journal
   (or vice versa).
   
   $ lsblk /dev/sdb
   NAMEMAJ:MIN RM   SIZE RO
   TYPE  MOUNTPOINT
   sdb   8:16   0   3.7T  0 disk
   ??sdb18:17   0   3.6T  0 part
   ? ??e8bc1531-a187-4fd2-9e3f-cf90255f89d0 (dm-0) 252:00   3.6T  0
   crypt /var/lib/ceph/osd/ceph-54
   ??sdb28:18   010G  0 part
 ??temporary-cryptsetup-1235 (dm-6)252:60   125K  1 crypt
   
   
   $ blkid /dev/sdb1
   /dev/sdb1: UUID=d6194096-a219-4732-8d61-d0c125c49393 TYPE=crypto_LUKS
   
   
   A race condition (or other issue) with udev seems likely given that
   its rather random which ones come up and which ones don't.
  
  A race condition during creation or activation?  If it's activation I 
  would expect ceph-disk activate ... to work reasonably reliably when 
  called manually (on a single device at a time).
 
 We encountered similar issues on a non-dmcrypt firefly deployment with
 10 OSDs per node.
 
 I've been working on a patch set to defer device activation to systemd
 services. ceph-disk activate is extended to support mapping of dmcrypt
 devices prior to OSD startup.
 
 The master-based changes aren't ready for upstream yet, but can be found
 in my WIP branch at:
 https://github.com/ddiss/ceph/tree/wip_bnc926756_split_udev_systemd_master

This approach looks to be MUCH MUCH better than what we're doing right 
now!
 
 There are a few things that I'd still like to address before submitting
 upstream, mostly covering activate-journal:
 - The test/ceph-disk.sh unit tests need to be extended and fixed.
 - The activate-journal --dmcrypt changes are less than optimal, and leave
   me with a few unanswered questions:
   + Does get_journal_osd_uuid(dev) return the plaintext or cyphertext
 uuid?

The uuid is never encrypted.

   + If a journal is encrypted, is the data partition also always
 encrypted?

Yes (I don't think it's useful to support a mixed encrypted/unencrypted 
OSD).

 - dmcrypt journal device mapping should probably also be split out into
   a separate systemd service, as that'll be needed for the future
   network based key retrieval feature.
 
 Feedback on the approach taken would be appreciated.

My only regret is that it won't help non-systemd cases, but I'm okay with 
leaving those as is (users can use the existing workarounds, like 
'ceph-disk activate-all' in rc.local to mop up stragglers) and focus 
instead on the new systemd world.

Let us know if there's anything else we can do to help!

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: The design of the eviction improvement

2015-07-21 Thread Sage Weil

On Tue, 21 Jul 2015, Wang, Zhiqiang wrote:
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
  Sent: Tuesday, July 21, 2015 6:38 AM
  To: Wang, Zhiqiang
  Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
  Subject: Re: The design of the eviction improvement

  On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
   Hi all,

   This is a follow-up of one of the CDS session at
  http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
  ng_eviction. We discussed the drawbacks of the current eviction algorithm 
  and
  several ways to improve it. Seems like the LRU variants is the right way to 
  go. I
  come up with some design points after the CDS, and want to discuss it with 
  you.
  It is an approximate 2Q algorithm, combining some benefits of the clock
  algorithm, similar to what the linux kernel does for the page cache.

  Unfortunately I missed this last CDS so I'm behind on the discussion.  I 
  have a
  few questions though...

   # Design points:

   ## LRU lists
   - Maintain LRU lists at the PG level.
   The SharedLRU and SimpleLRU implementation in the current code have a
   max_size, which limits the max number of elements in the list. This
   mostly looks like a MRU, though its name implies they are LRUs. Since
   the object size may vary in a PG, it's not possible to caculate the
   total number of objects which the cache tier can hold ahead of time.
   We need a new LRU implementation with no limit on the size.

  This last sentence seems to me to be the crux of it.  Assuming we have an
  OSD based by flash storing O(n) objects, we need a way to maintain an LRU of
  O(n) objects in memory.  The current hitset-based approach was taken based
  on the assumption that this wasn't feasible--or at least we didn't know how 
  to
  implmement such a thing.  If it is, or we simply want to stipulate that 
  cache
  tier OSDs get gobs of RAM to make it possible, then lots of better options
  become possible...

  Let's say you have a 1TB SSD, with an average object size of 1MB -- that's
  1 million objects.  At maybe ~100bytes per object of RAM for an LRU entry
  that's 100MB... so not so unreasonable, perhaps!

 I was having the same question before proposing this. I did the similar 
 calculation and thought it would be ok to use this many memory :-)

The part that worries me now is the speed with which we can load and 
manage such a list.  Assuming it is several hundred MB, it'll take a while 
to load that into memory and set up all the pointers (assuming 
a conventional linked list structure).  Maybe tens of seconds...

I wonder if instead we should construct some sort of flat model where we 
load slabs of contiguous memory, 10's of MB each, and have the 
next/previous pointers be a (slab,position) pair.  That way we can load it 
into memory in big chunks, quickly, and be able to operate on it (adjust 
links) immediately.

Another thought: currently we use the hobject_t hash only instead of the 
full object name.  We could continue to do the same, or we could do a hash 
pair (hobject_t hash + a different hash of the rest of the object) to keep 
the representation compact.  With a model lke the above, that could get 
the object representation down to 2 u32's.  A link could be a slab + 
position (2 more u32's), and if we have prev + next that'd be just 6x4=24 
bytes per object.

With fixed-sized slots on the slabs, the slab allocator could be very 
simple.. maybe just a bitmap, a free counter, and any other trivial 
optimizations to make finding a slab's next free a slot nice and quick.

   - Two lists for each PG: active and inactive Objects are first put
   into the inactive list when they are accessed, and moved between these two
  lists based on some criteria.
   Object flag: active, referenced, unevictable, dirty.
   - When an object is accessed:
   1) If it's not in both of the lists, it's put on the top of the
   inactive list
   2) If it's in the inactive list, and the referenced flag is not set, the 
   referenced
  flag is set, and it's moved to the top of the inactive list.
   3) If it's in the inactive list, and the referenced flag is set, the 
   referenced flag
  is cleared, and it's removed from the inactive list, and put on top of the 
  active
  list.
   4) If it's in the active list, and the referenced flag is not set, the 
   referenced
  flag is set, and it's moved to the top of the active list.
   5) If it's in the active list, and the referenced flag is set, it's moved 
   to the top
  of the active list.
   - When selecting objects to evict:
   1) Objects at the bottom of the inactive list are selected to evict. They 
   are
  removed from the inactive list.
   2) If the number of the objects in the inactive list becomes low, some of 
   the
  objects at the bottom of the active list are moved to the inactive list. 
  For those
  objects which have the referenced flag

upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Stefan Priebe - Profihost AG

Hi,

i remember there was a bug before in ceph not sure in which release
where exporting the same rbd snap multiple times results in different
raw images.

Currently running upstream/firefly and i'm seeing the same again.


# rbd export cephstor/disk-116@snap dump1
# sleep 10
# rbd export cephstor/disk-116@snap dump2

# md5sum -b dump*
b89198f118de59b3aa832db1bfddaf8f *dump1
f63ed9345ac2d5898483531e473772b1 *dump2

Can anybody help?

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

local teuthology testing

2015-07-21 Thread Zhou, Yuan

Hi David/Loic,

I was also trying to make some local Teuthology clusters here. The biggest 
issue I met is in the ceph-qa-chef - there're lots of hardcoded URL related 
with the sepia lab. I have to trace the code and change them line by line. 

Can you please kindly share me how did you get this work? Is there an easy way 
to fix this?

Thanks, -yuan

N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

Re: The design of the eviction improvement

2015-07-21 Thread Matt W. Benjamin

Hi,

Couple of points.

1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU 
implementation
with 2 levels currently, plus a pinned queue, that addresses stuff like 
partitioning (sharding), scan resistance, and coordination w/lookup tables.  
We might extend/re-use it.

2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache
segments (are you proposing to/do we now always cache whole objects?), and cost
of looking for dirty objects;  I suspect that it makes sense to amortize the
cost of locating segments eligible to be flushed, rather than minimize
bookkeeping.

Matt

- Zhiqiang Wang zhiqiang.w...@intel.com wrote:

  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
  Sent: Tuesday, July 21, 2015 6:38 AM
  To: Wang, Zhiqiang
  Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
  Subject: Re: The design of the eviction improvement
  
  On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
   Hi all,
  
   This is a follow-up of one of the CDS session at
 
 http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
  ng_eviction. We discussed the drawbacks of the current eviction
 algorithm and
  several ways to improve it. Seems like the LRU variants is the right
 way to go. I
  come up with some design points after the CDS, and want to discuss
 it with you.
  It is an approximate 2Q algorithm, combining some benefits of the
 clock
  algorithm, similar to what the linux kernel does for the page
 cache.
  
  Unfortunately I missed this last CDS so I'm behind on the
 discussion.  I have a
  few questions though...
  
   # Design points:
  
   ## LRU lists
   - Maintain LRU lists at the PG level.
   The SharedLRU and SimpleLRU implementation in the current code
 have a
   max_size, which limits the max number of elements in the list.
 This
   mostly looks like a MRU, though its name implies they are LRUs.
 Since
   the object size may vary in a PG, it's not possible to caculate
 the
   total number of objects which the cache tier can hold ahead of
 time.
   We need a new LRU implementation with no limit on the size.
  
  This last sentence seems to me to be the crux of it.  Assuming we
 have an
  OSD based by flash storing O(n) objects, we need a way to maintain
 an LRU of
  O(n) objects in memory.  The current hitset-based approach was taken
 based
  on the assumption that this wasn't feasible--or at least we didn't
 know how to
  implmement such a thing.  If it is, or we simply want to stipulate
 that cache
  tier OSDs get gobs of RAM to make it possible, then lots of better
 options
  become possible...
  
  Let's say you have a 1TB SSD, with an average object size of 1MB --
 that's
  1 million objects.  At maybe ~100bytes per object of RAM for an LRU
 entry
  that's 100MB... so not so unreasonable, perhaps!
 
 I was having the same question before proposing this. I did the
 similar calculation and thought it would be ok to use this many memory
 :-)
 
  
   - Two lists for each PG: active and inactive Objects are first
 put
   into the inactive list when they are accessed, and moved between
 these two
  lists based on some criteria.
   Object flag: active, referenced, unevictable, dirty.
   - When an object is accessed:
   1) If it's not in both of the lists, it's put on the top of the
   inactive list
   2) If it's in the inactive list, and the referenced flag is not
 set, the referenced
  flag is set, and it's moved to the top of the inactive list.
   3) If it's in the inactive list, and the referenced flag is set,
 the referenced flag
  is cleared, and it's removed from the inactive list, and put on top
 of the active
  list.
   4) If it's in the active list, and the referenced flag is not set,
 the referenced
  flag is set, and it's moved to the top of the active list.
   5) If it's in the active list, and the referenced flag is set,
 it's moved to the top
  of the active list.
   - When selecting objects to evict:
   1) Objects at the bottom of the inactive list are selected to
 evict. They are
  removed from the inactive list.
   2) If the number of the objects in the inactive list becomes low,
 some of the
  objects at the bottom of the active list are moved to the inactive
 list. For those
  objects which have the referenced flag set, they are given one more
 chance in
  the active list. They are moved to the top of the active list with
 the referenced
  flag cleared. For those objects which don't have the referenced flag
 set, they
  are moved to the inactive list, with the referenced flag set. So
 that they can be
  quickly promoted to the active list when necessary.
  
   ## Combine flush with eviction
   - When evicting an object, if it's dirty, it's flushed first.
 After flushing, it's
  evicted. If not dirty, it's evicted directly.
   - This means that we won't have separate activities and won't set
 different
  ratios for flush and evict. Is there a need to do so?

Re: hdparm -W redux, bug in _check_disk_write_cache for RHEL6?

2015-07-21 Thread Ilya Dryomov

On Tue, Jul 21, 2015 at 4:54 PM, Sage Weil s...@newdream.net wrote:
 On Tue, 21 Jul 2015, Dan van der Ster wrote:
 Hi,

 Following the sf.net corruption report I've been checking our config
 w.r.t data consistency. AFAIK the two main recommendations are:

   1) don't mount FileStores with nobarrier
   2) disable write-caching (hdparm -W 0 /dev/sdX) when using block dev
 journals and your kernel is  2.6.33

 Obviously we don't do (1) because that would be crazy, but for (2) we
 didn't disable yet write-caching, probably because we didn't notice
 the doc.

 But my lame excuse is that apparently _check_disk_write_cache in
 FileJournal.cc doesn't print a warning when it should, because hdparm
 -W doesn't always work on partitions rather than whole block devices.
 See:

 GOOD: ceph 0.94.2, kernel 3.10.0-229.7.2.el7.x86_64, hdparm v9.43:

10 journal _open_block_device: ignoring osd journal size. We'll use
 the entire block device (size: 21474836480)
20 journal _check_disk_write_cache: disk write cache is on, but
 your kernel is new enough to handle it correctly.
 (fn:/var/lib/ceph/osd/ceph-96/journal)
 1 journal _open /var/lib/ceph/osd/ceph-96/journal fd 20:
 21474836480 bytes, block size 4096 bytes, directio = 1, aio = 1


 BAD: ceph 0.94.2, kernel 2.6.32-431.29.2.el6.x86_64, hdparm v9.43:

10 journal _open_block_device: ignoring osd journal size. We'll use
 the entire block device (size: 21474836480)
 1 journal _open /var/lib/ceph/osd/ceph-56/journal fd 19:
 21474836480 bytes, block size 4096 bytes, directio = 1, aio = 1


 In other words, running hammer on EL6, _check_disk_write_cache exits
 without printing anything, but actually it should log the scary
 WARNING: disk write cache is ON.

 I guess it's because of this:

 GOOD # uname -r  hdparm -W /dev/sda  hdparm -W /dev/sda1
 3.10.0-229.7.2.el7.x86_64

 /dev/sda1:
  write-caching =  1 (on)

 /dev/sda:
  write-caching =  1 (on)


 BAD # uname -r  hdparm -W /dev/sda  hdparm -W /dev/sda1
 2.6.32-431.23.3.el6.x86_64

 /dev/sda:
  write-caching =  1 (on)

 /dev/sda1:
  HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device


 (in both cases /dev/sda is an INTEL SSDSC2BA20).

 So a few questions to end this:
   1) What was the magic patch in 2.6.33 which made write-caching safe?

 The specific behavior is that we want fsync or fdatasync to flush the
 write cache on the underlying device.  Unfortunately I've lost track of
 which commit led me to the magic 2.6.33 number.  However, this reference
 seems to confirm that 2.6.33 is a safe upper bound:

 http://monolight.cc/2011/06/barriers-caches-filesystems/

This one, I think:

commit ab0a9735e06914ce4d2a94ffa41497dbc142fe7f
Author: Christoph Hellwig h...@lst.de
Date:   Thu Oct 29 14:14:04 2009 +0100

blkdev: flush disk cache on -fsync

Currently there is no barrier support in the block device code.  That
means we cannot guarantee any sort of data integerity when using the
block device node with dis kwrite caches enabled.  Using the raw block
device node is a typical use case for virtualization (and I assume
databases, too).  This patch changes block_fsync to issue a cache flush
and thus make fsync on block device nodes actually useful.

Note that in mainline we would also need to add such code to the
-aio_write method for O_SYNC handling, but assuming that Jan's patch
series for the O_SYNC rewrite goes in it will also call into -fsync
for 2.6.32.

Signed-off-by: Christoph Hellwig h...@lst.de
Signed-off-by: Jens Axboe jens.ax...@oracle.com


   2) What's the recommended recourse here: hopefully Red Hat
 backported the necessary to their 2.6.32 kernel, but if not should we
 fix _check_disk_write_cache and make some publicity for people to
 check their configs?

 I have no doubt that any and all patches related to flushing caches on
 fsync are part of the el6 kernel.

 What's embarassing is that hdparm fails on kernels old enough to fail the
 test :).  The fix is probably to strip off the partition number (ideally
 using the helpers in blkdev.cc so that it works even for weirdly-named
 devices) and run hdparm on that.

We should look into using libblkid for this and nuking blkdev.cc.  rbd
unmap supports unmap by partition and already relies on libblkid to do
the partition - whole disk thing.  I can't remember if that function
is old enough to be in el6 base, I can take a stab at this if it is...

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: The design of the eviction improvement

2015-07-21 Thread Gregory Farnum

On Tue, Jul 21, 2015 at 3:15 PM, Matt W. Benjamin m...@cohortfs.com wrote:
 Hi,

 Couple of points.

 1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU 
 implementation
 with 2 levels currently, plus a pinned queue, that addresses stuff like 
 partitioning (sharding), scan resistance, and coordination w/lookup tables. 
  We might extend/re-use it.

 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache
 segments (are you proposing to/do we now always cache whole objects?), and 
 cost
 of looking for dirty objects;  I suspect that it makes sense to amortize the
 cost of locating segments eligible to be flushed, rather than minimize
 bookkeeping.

We make caching decisions in terms of whole objects right now, yeah.
There's really nothing in the system that's capable of doing segments
within an object, and it's not just about tracking a little more
metadata about dirty objects — the way we handle snapshots, etc would
have to be reworked if we were allowing partial-object caching. Plus
keep in mind the IO cost of the bookkeeping — it needs to be either
consistently persisted to disk or reconstructable from whatever
happens to be in the object. That can get expensive really fast.
-Greg


 Matt

 - Zhiqiang Wang zhiqiang.w...@intel.com wrote:

  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
  Sent: Tuesday, July 21, 2015 6:38 AM
  To: Wang, Zhiqiang
  Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
  Subject: Re: The design of the eviction improvement
 
  On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
   Hi all,
  
   This is a follow-up of one of the CDS session at
 
 http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
  ng_eviction. We discussed the drawbacks of the current eviction
 algorithm and
  several ways to improve it. Seems like the LRU variants is the right
 way to go. I
  come up with some design points after the CDS, and want to discuss
 it with you.
  It is an approximate 2Q algorithm, combining some benefits of the
 clock
  algorithm, similar to what the linux kernel does for the page
 cache.
 
  Unfortunately I missed this last CDS so I'm behind on the
 discussion.  I have a
  few questions though...
 
   # Design points:
  
   ## LRU lists
   - Maintain LRU lists at the PG level.
   The SharedLRU and SimpleLRU implementation in the current code
 have a
   max_size, which limits the max number of elements in the list.
 This
   mostly looks like a MRU, though its name implies they are LRUs.
 Since
   the object size may vary in a PG, it's not possible to caculate
 the
   total number of objects which the cache tier can hold ahead of
 time.
   We need a new LRU implementation with no limit on the size.
 
  This last sentence seems to me to be the crux of it.  Assuming we
 have an
  OSD based by flash storing O(n) objects, we need a way to maintain
 an LRU of
  O(n) objects in memory.  The current hitset-based approach was taken
 based
  on the assumption that this wasn't feasible--or at least we didn't
 know how to
  implmement such a thing.  If it is, or we simply want to stipulate
 that cache
  tier OSDs get gobs of RAM to make it possible, then lots of better
 options
  become possible...
 
  Let's say you have a 1TB SSD, with an average object size of 1MB --
 that's
  1 million objects.  At maybe ~100bytes per object of RAM for an LRU
 entry
  that's 100MB... so not so unreasonable, perhaps!

 I was having the same question before proposing this. I did the
 similar calculation and thought it would be ok to use this many memory
 :-)

 
   - Two lists for each PG: active and inactive Objects are first
 put
   into the inactive list when they are accessed, and moved between
 these two
  lists based on some criteria.
   Object flag: active, referenced, unevictable, dirty.
   - When an object is accessed:
   1) If it's not in both of the lists, it's put on the top of the
   inactive list
   2) If it's in the inactive list, and the referenced flag is not
 set, the referenced
  flag is set, and it's moved to the top of the inactive list.
   3) If it's in the inactive list, and the referenced flag is set,
 the referenced flag
  is cleared, and it's removed from the inactive list, and put on top
 of the active
  list.
   4) If it's in the active list, and the referenced flag is not set,
 the referenced
  flag is set, and it's moved to the top of the active list.
   5) If it's in the active list, and the referenced flag is set,
 it's moved to the top
  of the active list.
   - When selecting objects to evict:
   1) Objects at the bottom of the inactive list are selected to
 evict. They are
  removed from the inactive list.
   2) If the number of the objects in the inactive list becomes low,
 some of the
  objects at the bottom of the active list are moved to the inactive
 list. For those
  objects which have the referenced flag set, they are given one

Re: dmcrypt with luks keys in hammer

2015-07-21 Thread Milan Broz

On 07/21/2015 01:14 PM, David Disseldorp wrote:

 A race condition (or other issue) with udev seems likely given that
 its rather random which ones come up and which ones don't

 A race condition during creation or activation?  If it's activation I 
 would expect ceph-disk activate ... to work reasonably reliably when 
 called manually (on a single device at a time).

I still do not understand completely how the dmcrypt activation
in Ceph is designed, but there are clear problems in the current design.

Activation of another device-mapper inside udev rules (here LUKS or
plain dmcrypt device) is broken by design, it can work with only
with ugly workarounds.

The first reason is correctly mentioned in your mentioned wip branch
(udev RUN is intended for short-running commands. For example,
I think if you increase iteration count in LUKS device, the whole Ceph udev
rules fails completely because udev thread processing will kill it on 
timeout...)
(Unlocking can take even minutes when you move encrypted disk to a very slow 
machine)

The second reason is even more serious - cryptsetup itself uses udev
(through libdevmapper) to create nodes and must synchronize with
some other device-mapper udev rules. So here it is a race by design...
udev waits for another udev process. Ditto for creating /dev/by* links
(created by udev rule as well).

(And add to mix +watch rules, which reacts on close-on-write on every
node by running another udev rule blkid scan. If you see some leftover
temporary-cryptsetup* devices, something is really wrong. These
devices are internal to libcryptsetup and maps keyslots only, there are never
keep open in correct operation.)

So moving activation outside of the udev rules is the correct solution here,
only processing of device nodes should be there and rest should be
offloaded after udev rules run.

 We encountered similar issues on a non-dmcrypt firefly deployment with
 10 OSDs per node.
 
 I've been working on a patch set to defer device activation to systemd
 services. ceph-disk activate is extended to support mapping of dmcrypt
 devices prior to OSD startup.

Well, using systemd service is one option. But then it should handle all
cryptsetup device activations.

Milan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dmcrypt with luks keys in hammer

2015-07-21 Thread Wyllys Ingersoll

ceph-disk activate-all does not fix the problem for non-systemd
users.  Once they are into the temporary-cryptsetup-PID state, they
have to be manually cleared and remounted as follows:


1. cryptsetup close all of the ones in the temporary-cryptsetup state
2. find the UUID for each block device (journal and data partitions)
3. cryptsetup luksOpen on those devices individually


for i in `ls /dev/sd?[12] | grep -v sda`
do
   UUID=`sudo blkid -p $i | sed 's/ /\n/g'|grep PART_ENTRY_UUID|cut
-f2 -d=| tr -d \
   cryptsetup luksOpen $i $UUID --key-file
/etc/ceph/dmcrypt-keys/${UUID}.luks.key
done

$ sudo start ceph-osd-all

On Tue, Jul 21, 2015 at 10:00 AM, Sage Weil s...@newdream.net wrote:
 On Tue, 21 Jul 2015, David Disseldorp wrote:
 Hi,

 On Mon, 20 Jul 2015 15:21:50 -0700 (PDT), Sage Weil wrote:

  On Mon, 20 Jul 2015, Wyllys Ingersoll wrote:
   No luck with ceph-disk-activate (all or just one device).
  
   $ sudo ceph-disk-activate /dev/sdv1
   mount: unknown filesystem type 'crypto_LUKS'
   ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t',
   'crypto_LUKS', '-o', '', '--', '/dev/sdv1',
   '/var/lib/ceph/tmp/mnt.QHe3zK']' returned non-zero exit status 32
  
  
   Its odd that it should complain about the crypto_LUKS filesystem not
   being recognized, because it did mount some of the LUKS systems
   successfully, though not sometimes just the data and not the journal
   (or vice versa).
  
   $ lsblk /dev/sdb
   NAMEMAJ:MIN RM   SIZE RO
   TYPE  MOUNTPOINT
   sdb   8:16   0   3.7T  0 disk
   ??sdb18:17   0   3.6T  0 part
   ? ??e8bc1531-a187-4fd2-9e3f-cf90255f89d0 (dm-0) 252:00   3.6T  0
   crypt /var/lib/ceph/osd/ceph-54
   ??sdb28:18   010G  0 part
 ??temporary-cryptsetup-1235 (dm-6)252:60   125K  1 
   crypt
  
  
   $ blkid /dev/sdb1
   /dev/sdb1: UUID=d6194096-a219-4732-8d61-d0c125c49393 TYPE=crypto_LUKS
  
  
   A race condition (or other issue) with udev seems likely given that
   its rather random which ones come up and which ones don't.
 
  A race condition during creation or activation?  If it's activation I
  would expect ceph-disk activate ... to work reasonably reliably when
  called manually (on a single device at a time).

 We encountered similar issues on a non-dmcrypt firefly deployment with
 10 OSDs per node.

 I've been working on a patch set to defer device activation to systemd
 services. ceph-disk activate is extended to support mapping of dmcrypt
 devices prior to OSD startup.

 The master-based changes aren't ready for upstream yet, but can be found
 in my WIP branch at:
 https://github.com/ddiss/ceph/tree/wip_bnc926756_split_udev_systemd_master

 This approach looks to be MUCH MUCH better than what we're doing right
 now!

 There are a few things that I'd still like to address before submitting
 upstream, mostly covering activate-journal:
 - The test/ceph-disk.sh unit tests need to be extended and fixed.
 - The activate-journal --dmcrypt changes are less than optimal, and leave
   me with a few unanswered questions:
   + Does get_journal_osd_uuid(dev) return the plaintext or cyphertext
 uuid?

 The uuid is never encrypted.

   + If a journal is encrypted, is the data partition also always
 encrypted?

 Yes (I don't think it's useful to support a mixed encrypted/unencrypted
 OSD).

 - dmcrypt journal device mapping should probably also be split out into
   a separate systemd service, as that'll be needed for the future
   network based key retrieval feature.

 Feedback on the approach taken would be appreciated.

 My only regret is that it won't help non-systemd cases, but I'm okay with
 leaving those as is (users can use the existing workarounds, like
 'ceph-disk activate-all' in rc.local to mop up stragglers) and focus
 instead on the new systemd world.

 Let us know if there's anything else we can do to help!

 sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Jason Dillaman

Any chance that the snapshot was just created prior to the first export and you 
have a process actively writing to the image?

-- 

Jason Dillaman 
Red Hat 
dilla...@redhat.com 
http://www.redhat.com 


- Original Message -
 From: Stefan Priebe - Profihost AG s.pri...@profihost.ag
 To: ceph-devel@vger.kernel.org
 Sent: Tuesday, July 21, 2015 8:29:46 AM
 Subject: upstream/firefly exporting the same snap 2 times results in 
 different exports
 
 Hi,
 
 i remember there was a bug before in ceph not sure in which release
 where exporting the same rbd snap multiple times results in different
 raw images.
 
 Currently running upstream/firefly and i'm seeing the same again.
 
 
 # rbd export cephstor/disk-116@snap dump1
 # sleep 10
 # rbd export cephstor/disk-116@snap dump2
 
 # md5sum -b dump*
 b89198f118de59b3aa832db1bfddaf8f *dump1
 f63ed9345ac2d5898483531e473772b1 *dump2
 
 Can anybody help?
 
 Greets,
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

46 matches

Mail list logo