Re: ceph replication and data redundancy

2013-01-21 Thread Ulysse 31
Hi everybody,

In fact, i found searching the doc on section adding/removing a
monitor, infos about the paxos system used for quorum establishment.
Following the documentation, in a catastrophy scenario, i need to
remove the other monitors configured on the other buildings.
For better efficiency, i think i'll keep 1 monitor per building, and,
if two other building fails, i will delete those two monitors from the
configuration in order to access data again.
I'll simulate that and see if it goes well.
Thanks for your help and advices.

Regards,


--
Gomes do Vale Victor
System, Network and Security engineer.

2013/1/20 Gregory Farnum g...@inktank.com:
 (Sorry for the blank email just now, my client got a little eager!)

 Apart from the things that Wido has mentioned, you say you've set up 4 nodes 
 and each one has a monitor on it. That's why you can't do anything when you 
 bring down two nodes — the monitor cluster requires a strict majority in 
 order to continue operating, which is why we recommend odd numbers. If you 
 set up a different node as a monitor (simulating one in a different data 
 center) and then bring down two nodes, things should keep working.
 -Greg


 On Sunday, January 20, 2013 at 9:29 AM, Wido den Hollander wrote:

 Hi,

 On 01/17/2013 10:55 AM, Ulysse 31 wrote:
  Hi all,
 
  I'm not sure if it's the good mailing, if not, sorry for that, tell me
  the appropriate one, i'll go for it.
  Here is my actual project :
  The company i work for has several buildings, each of them are linked
  with gigabit trunk links allowing us to have multiple machines over
  the same lan on different buildings.
  We need to archive some data (over 5 to 10Tb), but we want that data
  present on each buildings, and, in case of the lost of a building
  (catastrophy scenario) we steel have the data.
  Rather than using simple storage machines sync'ed by rsync, we thaught
  re-using older desktop machines we have in stock, and make a
  clusterized fs on it :
  In fact, speed is clearly not the goal of this data storage, we would
  just store old projects on it sometimes, and will access it in rare
  cases. the most important is to keep that data archived somewhere.



 Ok, keep that in mind. All writes to RADOS are synchronous, so if you
 experience high latency or some congestion on your network Ceph will
 become slow.

  I was interrested by ceph in the way that we can declare, using the
  crush-map, a hierarchical maner to place replicated data.
  So for a test, i build a sample cluster composed of 4 nodes, installed
  under debian squeeze and actual bobtail stable version of ceph.
  On my sample i wanted to simulate 2 per buildings nodes, each nodes
  has a 2Tb disk and has mon/osd/mds (i know it is not optimized, but
  that just a sample), osd uses xfs on /dev/sda3, and made a crush map
  like :
  ---
  # begin crush map
 
  # devices
  device 0 osd.0
  device 1 osd.1
  device 2 osd.2
  device 3 osd.3
 
  # types
  type 0 osd
  type 1 host
  type 2 rack
  type 3 row
  type 4 room
  type 5 datacenter
  type 6 root
 
  # buckets
  host server-0 {
  id -2 # do not change unnecessarily
  # weight 1.000
  alg straw
  hash 0 # rjenkins1
  item osd.0 weight 1.000
  }
  host server-1 {
  id -5 # do not change unnecessarily
  # weight 1.000
  alg straw
  hash 0 # rjenkins1
  item osd.1 weight 1.000
  }
  host server-2 {
  id -6 # do not change unnecessarily
  # weight 1.000
  alg straw
  hash 0 # rjenkins1
  item osd.2 weight 1.000
  }
  host server-3 {
  id -7 # do not change unnecessarily
  # weight 1.000
  alg straw
  hash 0 # rjenkins1
  item osd.3 weight 1.000
  }
  rack bat0 {
  id -3 # do not change unnecessarily
  # weight 3.000
  alg straw
  hash 0 # rjenkins1
  item server-0 weight 1.000
  item server-1 weight 1.000
  }
  rack bat1 {
  id -4 # do not change unnecessarily
  # weight 3.000
  alg straw
  hash 0 # rjenkins1
  item server-2 weight 1.000
  item server-3 weight 1.000
  }
  root root {
  id -1 # do not change unnecessarily
  # weight 3.000
  alg straw
  hash 0 # rjenkins1
  item bat0 weight 3.000
  item bat1 weight 3.000
  }
 
  # rules
  rule data {
  ruleset 0
  type replicated
  min_size 1
  max_size 10
  step take root
  step chooseleaf firstn 0 type rack
  step emit
  }
  rule metadata {
  ruleset 1
  type replicated
  min_size 1
  max_size 10
  step take root
  step chooseleaf firstn 0 type rack
  step emit
  }
  rule rbd {
  ruleset 2
  type replicated
  min_size 1
  max_size 10
  step take root
  step chooseleaf firstn 0 type rack
  step emit
  }
  # end crush map
  ---
 
  Using this crush-map, coupled with a default pool data size 2
  (replication 2), allowed me to be sure to have duplicate of all data
  on both sample building bat0 and bat1.
  Then I mounted on a client using ceph-fuse using : ceph-fuse -m
  server-2:6789 /mnt/mycephfs (server-2 located on bat1), everything
  works fine has expected, can write/read data, from one or more
  clients, no probs on that.



 Just to 

Re: ceph replication and data redundancy

2013-01-21 Thread Joao Eduardo Luis

On 01/21/2013 08:14 AM, Ulysse 31 wrote:

Hi everybody,

In fact, i found searching the doc on section adding/removing a
monitor, infos about the paxos system used for quorum establishment.
Following the documentation, in a catastrophy scenario, i need to
remove the other monitors configured on the other buildings.
For better efficiency, i think i'll keep 1 monitor per building, and,
if two other building fails, i will delete those two monitors from the
configuration in order to access data again.
I'll simulate that and see if it goes well.
Thanks for your help and advices.


If you are set on that approach, you could just as well add a third 
monitor on one of the buildings (whichever you feel to be more 
resilient), and cut down the chances of an unavailable cluster if the 
other fails.


It doesn't solve your problem, but if the building with just one monitor 
fails, your cluster will still be available; if it's the other way 
around, you could do the manual recovery just the same anyway.


  -Joao

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Always Signal() the first Cond when changing the maximum

2013-01-21 Thread Loic Dachary
 Removes a test by which the waiting queue is only
 Signal()ed if the new maximum is lower than the current
 maximum.  There is no evidence of a use case where such a
 restriction would be useful. In addition waking up a thread
 when the maximum increases gives it a chance to immediately
 continue the suspended process instead of waiting for the
 next put().  For additional context see the discussion at
 http://marc.info/?t=13586893831r=1w=4

Signed-off-by: Loic Dachary l...@dachary.org
---
 src/common/Throttle.cc |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/common/Throttle.cc b/src/common/Throttle.cc
index 844263a..82ffe7a 100644
--- a/src/common/Throttle.cc
+++ b/src/common/Throttle.cc
@@ -65,7 +65,7 @@ Throttle::~Throttle()
 void Throttle::_reset_max(int64_t m)
 {
   assert(lock.is_locked());
-  if (m  ((int64_t)max.read())  !cond.empty())
+  if (!cond.empty())
 cond.front()-SignalOne();
   logger-set(l_throttle_max, m);
   max.set((size_t)m);
-- 
1.7.10.4


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph replication and data redundancy

2013-01-21 Thread Wido den Hollander



On 01/21/2013 02:08 PM, Joao Eduardo Luis wrote:

On 01/21/2013 08:14 AM, Ulysse 31 wrote:

Hi everybody,

In fact, i found searching the doc on section adding/removing a
monitor, infos about the paxos system used for quorum establishment.
Following the documentation, in a catastrophy scenario, i need to
remove the other monitors configured on the other buildings.
For better efficiency, i think i'll keep 1 monitor per building, and,
if two other building fails, i will delete those two monitors from the
configuration in order to access data again.
I'll simulate that and see if it goes well.
Thanks for your help and advices.


If you are set on that approach, you could just as well add a third
monitor on one of the buildings (whichever you feel to be more
resilient), and cut down the chances of an unavailable cluster if the
other fails.

It doesn't solve your problem, but if the building with just one monitor
fails, your cluster will still be available; if it's the other way
around, you could do the manual recovery just the same anyway.



Another approach, if possible try to add a 3rd monitor in a neutral place.

I for sure don't know how your network looks like, but you might be able 
to put up a monitor in an external datacenter and do something with a VPN?


Assuming both buildings have their own external internet connection.

Wido


   -Joao


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throttle::wait use case clarification

2013-01-21 Thread Loic Dachary

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 01/21/2013 12:02 AM, Gregory Farnum wrote:
 On Sunday, January 20, 2013 at 5:39 AM, Loic Dachary wrote:
 Hi,

 While working on unit tests for Throttle.{cc,h} I tried to figure out a use 
 case related to the Throttle::wait method but couldn't

 https://github.com/ceph/ceph/pull/34/files#L3R258

 Although it was not a blocker and I managed to reach 100% coverage anyway, 
 it got me curious and I would very much appreciate pointers to understand 
 the rationale.

 wait() can be called to set a new maximum before waiting for all pending 
 threads to get get what they asked for. Since the maximum has changed, 
 wait() wakes up the first thread : the conditions under which it decided to 
 go to sleep have changed and the conclusion may be different.

 However, it only does so when the new maximum is less than current one. For 
 instance

 A) decision does not change

 max = 10, current 9
 thread 1 tries to get 5 but only 1 is available, it goes to sleep
 wait(8)
 max = 8, current 9
 wakes up thread 1
 thread 1 tries to get 5 but current is already beyond the maximum, it goes 
 to sleep

 B) decision changes

 max = 10, current 1
 thread 1 tries to get 10 but only 9 is available, it goes to sleep
 wait(9)
 max = 9, current 1
 wakes up thread 1
 thread 1 tries to get 10 which is above the maximum : it succeeds because 
 current is below the new maximum

 It will not wake up a thread if the maximum increases, for instance:

 max = 10, current 9
 thread 1 tries to get 5 but only 1 is available, it goes to sleep
 wait(20)
 max = 20, current 9
 does *not* wake up thread 1
 keeps waiting until another thread put(N) with N = 0 although there now is 
 11 available and it would allow it to get 5 out of it

 Why is it not desirable for thread 1 to wake up in this case ? When 
 debugging a real world situation, I think it would show as a thread blocked 
 although the throttle it is waiting on has enough to satisfy its request. 
 What am I missing ?

 Cheers


 Attachments:
 - loic.vcf



 Looking through the history of that test (in _reset_max), I think it's an 
 accident and we actually want to be waking up the front if the maximum 
 increases (or possibly in all cases, in case the front is a very large 
 request we're going to let through anyway). Want to submit a patch? :)
:-) Here it is. make check does not complain. I've not run teuthology + 
qa-suite though. I figured out how to run teuthology but did not yet try 
qa-suite.

http://marc.info/?l=ceph-develm=135877502606311w=4


 The other possibility I was trying to investigate is that it had something to 
 do with handling get() requests larger than the max correctly, but I can't 
 find any evidence of that one...
I've run the Throttle unit tests after uncommenting
https://github.com/ceph/ceph/pull/34/files#L3R269
and commenting out
https://github.com/ceph/ceph/pull/34/files#L3R266
and it passes.

I'm not sure if I should have posted the proposed Throttle unit test to the 
list instead of proposing it as a pull request
https://github.com/ceph/ceph/pull/34

What is best ?

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlD9RjUACgkQ8dLMyEl6F20boACggzHH3Dw+/kM+awkD5POxyQB4
WosAn02bfzTUnItoTlwKtU0cDlWnckGv
=SsIe
-END PGP SIGNATURE-

attachment: loic.vcf

Re: Ceph docs page down

2013-01-21 Thread Wido den Hollander

Hi,

On 01/21/2013 05:11 PM, Travis Rhoden wrote:

I suspect someone is working on it, but http://ceph.com/docs/master/
is returning HTTP 404.



I don't know if anyone is working on it, but in the meantime, use: 
http://eu.ceph.com/docs/master/


Wido


  - Travis
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph docs page down

2013-01-21 Thread Sage Weil
They're back up now.

They're auto-generated from git, the build failed due to an 
intermittent network connectivity problem, and eventually the old content 
timed out because master wasn't updated over the weekend.  Need to 
special-case the docs gitbuilder, or increase the timeout, or something so 
that this doesn't happen again..

sage

On Mon, 21 Jan 2013, Wido den Hollander wrote:

 Hi,
 
 On 01/21/2013 05:11 PM, Travis Rhoden wrote:
  I suspect someone is working on it, but http://ceph.com/docs/master/
  is returning HTTP 404.
  
 
 I don't know if anyone is working on it, but in the meantime, use:
 http://eu.ceph.com/docs/master/
 
 Wido
 
- Travis
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Inktank team @ FOSDEM 2013 ?

2013-01-21 Thread Sébastien Han
Hi guys,

See you at FOSDEM ;-)

Cheers,
--
Regards,
Sébastien Han.


On Sun, Jan 20, 2013 at 6:13 PM, Constantinos Venetsanopoulos
c...@grnet.gr wrote:
 Hello Loic, Sebastien, Patrick,

 that's great news! I'm sure we'll have some very interesting stuff to talk
 about.
 Saturday 14:00 @ K.3.201 also seems fine.

 Thanks,
 Constantinos



 On 1/20/2013 6:19 PM, Patrick McGarry wrote:

 Hey guys,

 I will be attending from Inktank and would love to meet all of you
 folks.  Also, for what it's worth, the wiki has been deprecated and
 mostly redirects to the doc.  If you are trying to take a look at
 Sebatien's page you'll most likely have to use:

 http://wiki.ceph.com/deprecated/FOSDEM2013

 Thanks.


 Best Regards,

 Patrick

 On Sun, Jan 20, 2013 at 6:10 AM, Loic Dachary l...@dachary.org wrote:

 On 01/20/2013 10:43 AM, Constantinos Venetsanopoulos wrote:

 Hello,

 I'd like to ask if anybody of the team will be attending FOSDEM 2013
 [1],
 held at Brussels. I'm just asking, because we will be having a talk
 there at
 the cloud devroom and will be referencing our experiences with RADOS.
 So, it would be nice if we could also meet in person with any of you
 guys.

 Hi,

 Sebastien Han and myself will be there (but we're not with Inktank ;-).
 It would be great to organize an informal meeting to discuss Ceph and I've
 just added a page to the wiki (defunct but still useful), listing our names
 : http://ceph.com/wiki/FOSDEM2013

 Cheers




 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Concepts of whole cluster snapshots/backups and backups in general.

2013-01-21 Thread Michael Grosser
For anyone looking for a solution I want to outline the solution I
will go with in my coming setup.

First I wanna say I'm looking forward to the geo replication feature,
which hopefully features async replication and if there will someday
be some sort of snapshotted replica that would be another awesome
thing.

But for the time being:

For object storage (radosgw) I will use the admin function to retrieve
all buckets, retrieve all objects within these buckets and then back
them up into another offsite ceph or on some distributed logic on top
of simple rsync + fs
For block storage (rbd) I will use the snapshot feature to snapshot
all images for non corrupt and uptodate backup and then back them up
into another offsite ceph or on some distributed logic on top of
simple rsync + fs
For distributed fs (cephfs) I could use the snapshot feature too, but
I'm not yet using cephfs so it's just a consideration.

Some considerations:
I will probably go with offsite ceph with 2 replicas cause this will
distribute the load and make retrieval and so easier. (could be
accessible read only to users...)
The actual data retrieval of the live ceph cluster will probably be
done by some chef script running from time to time.
With the approach of using ceph at the backend buckets could have the
name of the date/time of the actual snapshot and therefore provide
reverting capabilities. (planned monthly and 3 days)

Some things to keep in mind:
Copying all data will drain the network bandwidth.
Keeping Snapshots of data of a production environment at scale with a
replica of 2 seems like a lot of storage is thrown away.
- use 3+ replicas on production and perhaps only 1 replica in offsite backup?

With 3 replicas + 1 offsite replica + 1 monthly snapshot + 2-3 daily
snapshots the total amount of storage needed for 1 tb of data would
result in 7-8 tb in actual storage. Seems like a crazy idea to do 4-5
additional copies... (perhaps only have 1 monthly and the last day or
2)


Just a few ideas I wanted to throw at anyone interested.

Cheers Michael

On Tue, Jan 15, 2013 at 10:36 PM, Michael Grosser
m...@seetheprogress.com wrote:
 Hey,

 within this mail data is a reference to rados chunks so the actual
 data behind (fs/object/block storage).

 I was thinking about different scenarios, which could lead to data-loss.

 1. The usual stupid customer deleting some important data.
 2. The not so usual, totally corrupted cluster after upgrade or sorts.
 3. The fun to think about datacenter struck by [disaster] - nothing
 left scenario.

 While thinking about these scenarios, I wondered how these disasters
 and the mentioned data-loss could be prevented.
 Telling customers data is lost, be it self inflicted or nature
 inflicted, is nothing you want or should need to do.

 But what are the technical solutions to provide another layer of
 disaster recovery (not just one datacenter with n replicas)?

 Some ideas, which came to mind:

 1. Snapshotting (ability to get user deleted files + revert to old
 state after corruption)
 2. Offsite backup (ability to recover from a lost datacenter)

 With these ideas a few problems came to mind.
 Is it cost effective to backup the whole cluster (would probably
 backup all replicas, which is not good at all?)?
 Is there a way to snapshot the current state and back it up to some
 offsite server array, could be another ceph cluster or a NAS?
 Do you really want to snapshot the non readable Ceph objects from rados?
 Shouldn't a backup always be readable?

 The simplest solution darkfaded from irc came up with was using
 special replicas.
 Using additional replicas, which only sync hourly, daily or monthly
 and dettach after sync could be a solution. But how could that be
 done?
 Some benefits of this solution:
 1. Readable, cause it could a fully functioning cluster. Doable? Need
 for replication of gateways etc. or could that be intergrated within a
 special replica backup?
 2. Easy recovery, just make the needed replica the master.
 3. No new system. Ceph in and out \o/
 4. Offsite backup possibility.
 5. Versioned states via different replicas hourly, daily, monthly

 Some problems:
 1. strain on ceph cluster when sync is done for each special replica
 2. additional disk space needed (could be double the already used
 amount, when using 3 replicas with one current, one daily, one monthly
 replica)
 3. more costs
 4. more complex solution?

 Could someone shed some light on how to have replicas without the
 write to be acknowledged for every replica and therefore only be a
 mirror instead of a full replica.

 Could this replica based backup be used as current snapshot in another
 datacenter?

 Wouldn't that be the async feature, which isn't yet possible sort of?

 I hope this mail is not too cluttered and I'm looking forward to the
 thread about it.

 Hopefully we can not only collect some ideas and solutions, but hear
 some current implementations from some bigger players.

 Cheers Michael
--
To unsubscribe 

Re: questions on networks and hardware

2013-01-21 Thread John Nielsen
Thanks all for your responses! Some comments inline.

On Jan 20, 2013, at 10:16 AM, Wido den Hollander w...@widodh.nl wrote:

 On 01/19/2013 12:34 AM, John Nielsen wrote:
 I'm planning a Ceph deployment which will include:
  10Gbit/s public/client network
  10Gbit/s cluster network
  dedicated mon hosts (3 to start)
  dedicated storage hosts (multiple disks, one XFS and OSD per disk, 3-5 
 to start)
  dedicated RADOS gateway host (1 to start)
 
 I've done some initial testing and read through most of the docs but I still 
 have a few questions. Please respond even if you just have a suggestion or 
 response for one of them.
 
 If I have cluster network and public network entries under [global] or 
 [osd], do I still need to specify public addr and cluster addr for each 
 OSD individually?
 
 Already answered, but no. You don't need to. The OSDs will bind to the 
 available IP in that network.

Nice. That's how I hoped/assumed it would work but I have seen some 
configurations on the web that include both so I wanted to make sure.

 Which network(s) should the monitor hosts be on? If both, is it valid to 
 have more than one mon addr entry per mon host or is there a different way 
 to do it?
 
 They should be on the public network since the clients also need to be able 
 to reach the monitors.

It sounds (from other followups) like there is work on adding some awareness of 
the cluster network to the monitor but it's not there yet. I'll stay tuned. It 
would be nice if the monitors and OSD's together could form a reachability map 
for the cluster and give the option of using the public network for the 
affected subset of OSD traffic in the event of a problem on the cluster 
network. Having the monitor(s) connected to the cluster network might be useful 
for that...

Then again, a human should be involved if there is a network failure anyway; 
manually changing the network settings in the Ceph config as a temporary 
workaround is already an option I suppose.

 I'd like to have 2x 10Gbit/s NIC's on the gateway host and maximize 
 throughput. Any suggestions on how to best do that? I'm assuming it will 
 talk to the OSD's on the Ceph public/client network, so does that imply a 
 third even-more-public network for the gateway's clients?
 
 No, there is no third network. You could trunk the 2 NICs with LACP or 
 something. Since the client will open TCP connections to all the OSDs you 
 will get a pretty good balancing over the available bandwith.

Good suggestion, thanks. I actually just started using LACP to pair up 1Gb 
NIC's on a small test cluster and it's proven beneficial, even with 
less-than-ideal hashing from the switch.

To put my question here a different way: suppose I really really want to 
segregate the Ceph traffic from the HTTP traffic and I set up the IP's and 
routing necessary to do so (3rd network idea). Is there any reason NOT to do 
that?

 Is it worthwhile to have 10G NIC's on the monitor hosts? (The storage hosts 
 will each have 2x 10Gbit/s NIC's.)
 
 No, not really. 1Gbit should be more then enough for your monitors. 3 
 monitors should also be good. No need to go for 5 or 7.

Maybe I'll have the monitors just be on the public network but use LACP with 
dual 1Gbit NIC's for fault tolerance (since I'll have the NIC's onboard anyway).

 I think this has come up before, but has anyone written up something with 
 more details on setting up gateways? Hardware recommendations, strategies to 
 improve caching and performance, multiple gateway setups with and without a 
 load balancer, etc.
 
 Not that I know. I'm still trying to play with RGW and Varnish in front of 
 it, but haven't really taken the time yet. The goal is to offload a lot of 
 the caching to Varnish and have Varnish 'ban' objects when they change.
 
 You could also use Varnish as a loadbalancer. But in this case you could also 
 use RR-DNS or LVS with Direct Routing, that way only incoming traffic goes 
 through the loadbalancer and return traffic goes directly to your network's 
 gateway.

I definitely plan to look in to Varnish at some point, I'll be sure to follow 
up here if I learn anything interesting.

Thanks again,

JN

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: flashcache

2013-01-21 Thread John Nielsen
On Jan 19, 2013, at 7:56 PM, Joseph Glanville joseph.glanvi...@orionvm.com.au 
wrote:

 I assume it is now an EoIB driver. Does it replace the IPoIB driver?
 
 Nope, it is upper-layer thing: https://lwn.net/Articles/509448/
 
 Aye, its effectively a NAT translation layer that strips Ethernet
 headers and grafts on IPoIB headers, thus using the same wire protocol
 and allowing communication from EoIB to IPoIB.
 
 However this approach is a little dirty and has been nacked by the
 netdev community so we aren't likely to see it in the mainline
 kernel.. basically ever.

Just to clarify:

EoIB has been around for a while (at least in the Mellanox software, not sure 
about mainline). It uses the mlx4_vnic module and is a true Ethernet 
encapsulation over InfiniBand. Unfortunately the newer Mellanox switches won't 
support it any more and the ones that to have entered Limited Support. (Not 
to be confused with mlx4_en, which just turns a ConnectX card into a 10G 
Ethernet NIC.)

IPoIB is IP over InfiniBand without Ethernet (the data link layer is straight 
InfiniBand).

eIPoIB is (or will be, maybe) Ethernet over IP over InfiniBand. It is intended 
to work with both Linux bridging and regular IB switches that support IPoIB. 
(Allowing e.g. unmodified KVM guests on hypervisors connected to an IPoIB 
fabric.) Both Joseph's comments and the LWN link above are referring to eIPoIB. 
Last I heard (from a pretty direct source in the last couple of weeks) Mellanox 
is still working on this but doesn't have anything generally available yet. 
Here's hoping.. but the feedback on netdev was quite negative, to put it mildly.

JN

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Consistently reading/writing rados objects via command line

2013-01-21 Thread Nick Bartos
I would like to store some objects in rados, and retrieve them in a
consistent manor.  In my initial tests, if I do a 'rados -p foo put
test /tmp/test', while it is uploading I can do a 'rados -p foo get
test /tmp/blah' on another machine, and it will download a partially
written file without returning an error code, so the downloader cannot
tell the file is corrupt/incomplete.

My question is, how do I read/write objects in rados via the command
line in such a way where the downloader does not get a corrupt or
incomplete file?  It's fine if it just returns an error on the client
and I can try again, I just need to be notified on error.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Consistently reading/writing rados objects via command line

2013-01-21 Thread Gregory Farnum
On Monday, January 21, 2013 at 5:01 PM, Nick Bartos wrote:
 I would like to store some objects in rados, and retrieve them in a
 consistent manor. In my initial tests, if I do a 'rados -p foo put
 test /tmp/test', while it is uploading I can do a 'rados -p foo get
 test /tmp/blah' on another machine, and it will download a partially
 written file without returning an error code, so the downloader cannot
 tell the file is corrupt/incomplete.
  
 My question is, how do I read/write objects in rados via the command
 line in such a way where the downloader does not get a corrupt or
 incomplete file? It's fine if it just returns an error on the client
 and I can try again, I just need to be notified on error.
  
You must be writing large-ish objects? By default the rados tool will upload 
objects 4MB at a time and you're trying to download mid-way through the full 
object upload. You can add a --block-size 20971520 to upload 20MB in a single 
operation, but make sure you don't exceed the osd max write size (90MB by 
default).
This is all client-side stuff, though — from the RADOS object store's 
perspective, the file is complete after each 4MB write. If you want something 
more sophisticated (like handling larger objects) you'll need to do at least 
some minimal tooling of your own, e.g. by setting an object xattr before 
starting and after finishing the file change, then checking for that presence 
when reading (and locking on reads or doing a check when the read completes). 
You can do that with the setxattr, rmxattr, and getxattr options.
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Consistently reading/writing rados objects via command line

2013-01-21 Thread Sage Weil
On Mon, 21 Jan 2013, Gregory Farnum wrote:
 On Monday, January 21, 2013 at 5:01 PM, Nick Bartos wrote:
  I would like to store some objects in rados, and retrieve them in a
  consistent manor. In my initial tests, if I do a 'rados -p foo put
  test /tmp/test', while it is uploading I can do a 'rados -p foo get
  test /tmp/blah' on another machine, and it will download a partially
  written file without returning an error code, so the downloader cannot
  tell the file is corrupt/incomplete.
   
  My question is, how do I read/write objects in rados via the command
  line in such a way where the downloader does not get a corrupt or
  incomplete file? It's fine if it just returns an error on the client
  and I can try again, I just need to be notified on error.
   
 You must be writing large-ish objects? By default the rados tool will upload 
 objects 4MB at a time and you're trying to download mid-way through the full 
 object upload. You can add a --block-size 20971520 to upload 20MB in a 
 single operation, but make sure you don't exceed the osd max write size 
 (90MB by default).
 This is all client-side stuff, though ? from the RADOS object store's 
 perspective, the file is complete after each 4MB write. If you want something 
 more sophisticated (like handling larger objects) you'll need to do at least 
 some minimal tooling of your own, e.g. by setting an object xattr before 
 starting and after finishing the file change, then checking for that presence 
 when reading (and locking on reads or doing a check when the read completes). 
 You can do that with the setxattr, rmxattr, and getxattr options.

With a bit of additional support in the rados tool, we could write to 
object $foo.tmp with key $foo, and then clone it into position and delete 
the .tmp.

If they're really big objects, though, you may also be better off with 
radosgw, which provides striping and atomicity..

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


handling fs errors

2013-01-21 Thread Sage Weil
We observed an interesting situation over the weekend.  The XFS volume 
ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 
minutes.  After 3 minutes (180s), ceph-osd gave up waiting and committed 
suicide.  XFS seemed to unwedge itself a bit after that, as the daemon was 
able to restart and continue.

The problem is that during that 180s the OSD was claiming to be alive but 
not able to do any IO.  That heartbeat check is meant as a sanity check 
against a wedged kernel, but waiting so long meant that the ceph-osd 
wasn't failed by the cluster quickly enough and client IO stalled.

We could simply change that timeout to something close to the heartbeat 
interval (currently default is 20s).  That will make ceph-osd much more 
sensitive to fs stalls that may be transient (high load, whatever).

Another option would be to make the osd heartbeat replies conditional on 
whether the internal heartbeat is healthy.  Then the heartbeat warnings 
could start at 10-20s, ping replies would pause, but the suicide could 
still be 180s out.  If the stall is short-lived, pings will continue, the 
osd will mark itself back up (if it was marked down) and continue.

Having written that out, the last option sounds like the obvious choice.  
Any other thoughts?

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: handling fs errors

2013-01-21 Thread Yehuda Sadeh
On Mon, Jan 21, 2013 at 10:05 PM, Sage Weil s...@inktank.com wrote:
 We observed an interesting situation over the weekend.  The XFS volume
 ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4
 minutes.  After 3 minutes (180s), ceph-osd gave up waiting and committed
 suicide.  XFS seemed to unwedge itself a bit after that, as the daemon was
 able to restart and continue.

 The problem is that during that 180s the OSD was claiming to be alive but
 not able to do any IO.  That heartbeat check is meant as a sanity check
 against a wedged kernel, but waiting so long meant that the ceph-osd
 wasn't failed by the cluster quickly enough and client IO stalled.

 We could simply change that timeout to something close to the heartbeat
 interval (currently default is 20s).  That will make ceph-osd much more
 sensitive to fs stalls that may be transient (high load, whatever).

 Another option would be to make the osd heartbeat replies conditional on
 whether the internal heartbeat is healthy.  Then the heartbeat warnings
 could start at 10-20s, ping replies would pause, but the suicide could
 still be 180s out.  If the stall is short-lived, pings will continue, the
 osd will mark itself back up (if it was marked down) and continue.

 Having written that out, the last option sounds like the obvious choice.
 Any other thoughts?


Another option would be to have the osd reply to the ping with some
health description.

Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: handling fs errors

2013-01-21 Thread Andrey Korolyov
On Tue, Jan 22, 2013 at 10:05 AM, Sage Weil s...@inktank.com wrote:
 We observed an interesting situation over the weekend.  The XFS volume
 ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4
 minutes.  After 3 minutes (180s), ceph-osd gave up waiting and committed
 suicide.  XFS seemed to unwedge itself a bit after that, as the daemon was
 able to restart and continue.

 The problem is that during that 180s the OSD was claiming to be alive but
 not able to do any IO.  That heartbeat check is meant as a sanity check
 against a wedged kernel, but waiting so long meant that the ceph-osd
 wasn't failed by the cluster quickly enough and client IO stalled.

 We could simply change that timeout to something close to the heartbeat
 interval (currently default is 20s).  That will make ceph-osd much more
 sensitive to fs stalls that may be transient (high load, whatever).

 Another option would be to make the osd heartbeat replies conditional on
 whether the internal heartbeat is healthy.  Then the heartbeat warnings
 could start at 10-20s, ping replies would pause, but the suicide could
 still be 180s out.  If the stall is short-lived, pings will continue, the
 osd will mark itself back up (if it was marked down) and continue.

 Having written that out, the last option sounds like the obvious choice.
 Any other thoughts?

 sage

Seems to be possible to run in domino-style failing marks there if
lock is triggered frequently enough and depends only on pure amount of
workload. By the way, was that fs aged or you`re able to catch the
lock on fresh one? And which kernel you have run there?

Thanks!

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html