Re: ceph init script didn't stop the ceph.

2012-06-22 Thread ramu
ramu ramu.freesystems at gmail.com writes:
No error messages displayed and ceph osd down 1 this command also not 
working.When run this command in ceph-osd.1.log the error message is 
map e38 wrongly marked me down.



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: filestore flusher = false , correct my problem of constant write (need info on this parameter)

2012-06-22 Thread Alexandre DERUMIER
Hi Sage,
thanks for your response.

If you turn off the journal compeletely, you will see bursty write commits 
from the perspective of the client, because the OSD is periodically doing 
a sync or snapshot and only acking the writes then. 
If you enable the journal, the OSD will reply with a commit as soon as the 
write is stable in the journal. That's one reason why it is there--file 
system commits of heavyweight and slow. 

Yes of course, I don't wan't to desactivate journal, using a journal on a fast 
ssd or nvram is the right way.

If we left the file system to its own devices and did a sync every 10 
seconds, the disk would sit idle while a bunch of dirty data accumulated 
in cache, and then the sync/snapshot would take a really long time. This 
is horribly inefficient (the disk is idle half the time), and useless (the 
delayed write behavior makes sense for local workloads, but not servers 
where there is a client on the other end batching its writes). To prevent 
this, 'filestore flusher' will prod the kernel to flush out any written 
data to the disk quickly. Then, when we get around to doing the 
sync/snapshot it is pretty quick, because only fs metadata and 
just-written data needs to be flushed. 

mmm, I disagree.

If you flush quickly, it's works fine with sequential write workload.

But if you have a lot of random write with 4k block by exemple, you are going 
to have a lot of disk seeks.
The way zfs works or netapp san storage works, they take random writes in a 
fast journal then flush them sequentially each 20s to slow storage.

To compare with zfs or netapp, I can achieve around 2io/s on random write 
4K with 4GB nvram and 10 x 7200 disk.

with ceph, i'm around 2000io/s with same config. (3 nodes with 10x7200disk, 2x 
replication), so around real disk io limit without any write cache.


So for now, i'm think i'm going to use ssd for my osds,I have 80% random write 
workload. (no seeks, so no problem to constant random write)



NTW: maybe wiki is wrong
http://ceph.com/wiki/OSD_journal
section Motivation
Enterprise products like NetApp filers cheat by journaling all writes to 
NVRAM and then taking their time to flush things out to disk efficiently. This 
gives you very low-latency writes _and_ efficient disk IO at the expense of 
hardware.

This why I thinked ceph worked like this.


Thanks again,

-Alexandre







- Mail original - 

De: Sage Weil s...@inktank.com 
À: Alexandre DERUMIER aderum...@odiso.com 
Cc: ceph-devel@vger.kernel.org, Mark Nelson mark.nel...@inktank.com, 
Stefan Priebe s.pri...@profihost.ag 
Envoyé: Jeudi 21 Juin 2012 18:03:45 
Objet: Re: filestore flusher = false , correct my problem of constant write 
(need info on this parameter) 

Hi Alexandre, 

[Sorry I didn't follow up earlier; I didn't understand your question.] 

If you turn off the journal compeletely, you will see bursty write commits 
from the perspective of the client, because the OSD is periodically doing 
a sync or snapshot and only acking the writes then. 

If you enable the journal, the OSD will reply with a commit as soon as the 
write is stable in the journal. That's one reason why it is there--file 
system commits of heavyweight and slow. 

If we left the file system to its own devices and did a sync every 10 
seconds, the disk would sit idle while a bunch of dirty data accumulated 
in cache, and then the sync/snapshot would take a really long time. This 
is horribly inefficient (the disk is idle half the time), and useless (the 
delayed write behavior makes sense for local workloads, but not servers 
where there is a client on the other end batching its writes). To prevent 
this, 'filestore flusher' will prod the kernel to flush out any written 
data to the disk quickly. Then, when we get around to doing the 
sync/snapshot it is pretty quick, because only fs metadata and 
just-written data needs to be flushed. 

So: the behavior you're seeing is normal, and good. 

Did I understand your confusion correctly? 

Thanks! 
sage 


On Wed, 20 Jun 2012, Alexandre DERUMIER wrote: 

 Hi, 
 I have tried to disabe filestore flusher 
 
 filestore flusher = false 
 filestore max sync interval = 30 
 filestore min sync interval = 29 
 
 
 in osd config. 
 
 
 now, I see correct sync each 30s when doing rados bench 
 
 rados -p pool3 bench 60 write -t 16 
 
 
 seekwatcher movie: 
 
 
 before 
 -- 
 http://odisoweb1.odiso.net/seqwrite-radosbench-flusherenable.mpg 
 
 after 
 - 
 http://odisoweb1.odiso.net/seqwrite-radosbench-flusherdisable.mpg 
 
 
 Shouldn't it be the normal behaviour ? What's exactly is filestore flusher vs 
 syncfs ? 
 
 
 
 This seem to works fine with rados bench, 
 But when I launch benchmark with fio from my guest vm, I see again constant 
 write. 
 (I'll try to debug that today) 
 
 
 My target is to be able to handle small random write and write them each 30s. 
 
 Regards, 
 
 Alexandre 
 -- 
 To unsubscribe from this list: send the line unsubscribe ceph-devel in 
 

Re: ceph init script didn't stop the ceph.

2012-06-22 Thread ramu
Hi Dan Mick,
Thanks for reply,
I tried -v also ,it can't stop.The all daemons also still running.




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Rolling upgrades possible?

2012-06-22 Thread John Axel Eriksson
I guess this has been asked before, I'm just new to the list and
wondered whether it's possible to do
rolling upgrades of mons, osds and radosgw? We will soon be in the
process of migrating from our current
storage solution to Ceph/RGW. We will only use the object storage,
actually mainly the S3-interface radosgw
supplies.

Right now we have a very small test-installation - 1 mon, 2 osds where
the mon also runs rgw. Next week I've
heard that 0.48 might be released, if we upgrade to that, do we have
to shut down the cluster during the upgrade
or can we do a rolling upgrade while still responding to PUTs and
GETs? If not possible yet, is this in the pipeline?

Best,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Recommendations for OSDs, RGW and MON

2012-06-22 Thread John Axel Eriksson
Currently we're running a test cluster with 1 mon, 1 radosgw and 2
osds. RGW runs on the same host as the mon while
the osds recides on two different servers. We have thought of maybe
running more than 1 osd on each storage server, where
the osds use different disks of course - is this something reasonable
or would performance/stability suffer?

Is there any recommendation against running rgw/mon on the same
server? Would a better setup be to put osd/mon/rgw on each
server and loadbalancing rgw? Of course we might add more osd:s at
some point and I guess we don't want to run mons/rgw on
those.

Also, on Ubuntu 12.04, does anybody have experience with ceph on btrfs
performance or is the recommendation still to run on xfs?

All this is for running only the object storage part of ceph (only
accessed through RGWs S3-interface).

Thanks,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rolling upgrades possible?

2012-06-22 Thread Wido den Hollander

On 06/22/2012 11:23 AM, John Axel Eriksson wrote:

I guess this has been asked before, I'm just new to the list and
wondered whether it's possible to do
rolling upgrades of mons, osds and radosgw? We will soon be in the
process of migrating from our current
storage solution to Ceph/RGW. We will only use the object storage,
actually mainly the S3-interface radosgw
supplies.

Right now we have a very small test-installation - 1 mon, 2 osds where
the mon also runs rgw. Next week I've
heard that 0.48 might be released, if we upgrade to that, do we have
to shut down the cluster during the upgrade
or can we do a rolling upgrade while still responding to PUTs and
GETs? If not possible yet, is this in the pipeline?


Currently there is no guarantee that rolling upgrades will work, I 
however suspect that with 0.48 this will become a priority.


With 0.48 there will be a on-disk format change, but I don't know if the 
protocol between the daemons will change.


Towards 0.48 I wouldn't bet on a rolling upgrade, but you can always try 
with a test cluster :)


Wido



Best,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rolling upgrades possible?

2012-06-22 Thread Andrey Korolyov
On Fri, Jun 22, 2012 at 1:23 PM, John Axel Eriksson j...@insane.se wrote:
 I guess this has been asked before, I'm just new to the list and
 wondered whether it's possible to do
 rolling upgrades of mons, osds and radosgw? We will soon be in the
 process of migrating from our current
 storage solution to Ceph/RGW. We will only use the object storage,
 actually mainly the S3-interface radosgw
 supplies.

 Right now we have a very small test-installation - 1 mon, 2 osds where
 the mon also runs rgw. Next week I've
 heard that 0.48 might be released, if we upgrade to that, do we have
 to shut down the cluster during the upgrade
 or can we do a rolling upgrade while still responding to PUTs and
 GETs? If not possible yet, is this in the pipeline?

 Best,
 John

Should not be possible with only one mon, you need at least three for
continuous operation, so you can add them right now and then try to
upgrade cluster nodes one by one.

By the way, does recent upgrade of osd` on-disk content means near
stabilization of data format(which means theoretically flawless
per-node upgrade)? If so, is there any approximate times? About one
and half months ago some in-list discussion mentioned such
stabilization as very soon(tm), so I`ll be happy on more exact
timeline before pushing ceph-based infrastructure into production.

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommendations for OSDs, RGW and MON

2012-06-22 Thread Wido den Hollander

On 06/22/2012 11:28 AM, John Axel Eriksson wrote:

Currently we're running a test cluster with 1 mon, 1 radosgw and 2
osds. RGW runs on the same host as the mon while
the osds recides on two different servers. We have thought of maybe
running more than 1 osd on each storage server, where
the osds use different disks of course - is this something reasonable
or would performance/stability suffer?


No, that is not a problem at all. You can run multiple OSD's on one server.

Just make sure you have something like 1GB ~ 2GB per OSD available on 
memory.


See: http://www.ceph.com/docs/master/rec/



Is there any recommendation against running rgw/mon on the same
server? Would a better setup be to put osd/mon/rgw on each
server and loadbalancing rgw? Of course we might add more osd:s at
some point and I guess we don't want to run mons/rgw on
those.



You can mix the RGW and MON daemons, but you are better of in letting 
the OSD's run on their own, dedicated machines.


In a later stage you can always move the monitors to new machines.


Also, on Ubuntu 12.04, does anybody have experience with ceph on btrfs
performance or is the recommendation still to run on xfs?


I wouldn't run with the stock 12.04 kernel with btrfs. The story goes 
(see ml archive) that with kernel 3.5 there have been some btrfs 
improvements, but if you are only using the RGW, XFS might be your best 
option.


http://www.ceph.com/docs/master/rec/filesystem/

Wido



All this is for running only the object storage part of ceph (only
accessed through RGWs S3-interface).

Thanks,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD layering design draft

2012-06-22 Thread Guido Winkelmann
Am Montag, 18. Juni 2012, 10:00:32 schrieben Sie:
 On Fri, Jun 15, 2012 at 1:48 PM, Josh Durgin josh.dur...@inktank.com 
wrote:
 $ rbd unpreserve pool/image@snap
 Error unpreserving: child images rely on this image
 
 UX nit: this should also say what image it found.
 
 rbd: Cannot unpreserve: Still in use by pool2/image2

What if it's in use by a lot of images? Should it print them all, or should it 
print something like Still in use by pool2/image2 and 50 others, use 
list_children to see them all?

Guido
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD layering design draft

2012-06-22 Thread Guido Winkelmann
 On 06/15/2012 03:48 PM, Josh Durgin wrote:

  Then you can perform the clone:
  $ rbd clone --parent pool/parent@snap pool2/child1
 
 Based on my comments above, if the parent had not been preserved
 it would automatically be at this point, by virtue of the fact it
 has a clone associated with it.
 
 Since there is always exactly one parent and one child, I'd say
 drop the --parent and just have the parent and child be
 defined by their position.  If the parent could be optionally
 skipped for some reason, then make it be the second one.

I think that would be a very bad idea. clone source target would be a good 
idea; nearly all similar commandline utilities (cp, mv, ln) work like that. 
clone target source would be counterintuitive and probably lead to 
otherwise avoidable mistakes.

Guido
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD layering design draft

2012-06-22 Thread Guido Winkelmann
Am Freitag, 22. Juni 2012, 02:02:38 schrieb Alex Elsayed:
 Dan Mick dan.mick at inktank.com writes:
  On 06/18/2012 11:01 AM, Sage Weil wrote:
   On Mon, 18 Jun 2012, Josh Durgin wrote:
$ rbd copyup pool2/child1
  
  disown and adopt?  :)  (actually I started as a joke, but really I
  kinda like that; fits with the parent-child name)
 
 The issue I see with that is that the argument refers to the child rather
 than the parent, so it doesn't match. I personally like 'unshare' since
 it'll also work in the dedup case, but if we stick with the parent/child
 terminology 'emancipate' might work (although it lacks a good reverse).

AFAIK the word started in ancient Rome as meaning to release slaves into 
freedom, so I suppose the opposite would be enslave?

Guido
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unmountable btrfs filesystems

2012-06-22 Thread Guido Winkelmann
Am Samstag, 16. Juni 2012, 14:12:03 schrieb Mark Nelson:
 btrfsck might tell you what's wrong.  Sounds like there is a
 btrfs-restore command in the dangerdonteveruse branch you could try.
 Beyond that, I guess it just really comes down to tradeoffs.

I've had similar problems in the recent past. Turns out Ceph makes heavy use 
of btrfs snapshots when running on btrfs, and btrfs-restore will not restore 
those, so it cannot be used to restore a broken osd.

Guido
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unmountable btrfs filesystems

2012-06-22 Thread Guido Winkelmann
Am Sonntag, 17. Juni 2012, 15:55:42 schrieb Martin Mailand:
 Hi Wido,
 until recently there were still a few bugs in btrfs which could be hit
 quite easily with ceph. The last big one was fixed here
 http://www.spinics.net/lists/ceph-devel/msg06270.html

I keep hearing things along the lines of yes, btrfs is really really close to 
ready, we just had some really nasty bug in the last release, so you 
absolutely have to run the very latest Linux kernel since at least Linux 3.1.

I think I will probably wait until there have been at least three major Linux 
releases with no serious btrfs issues before I start using it in production.

Guido
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: all rbd users: set 'filestore fiemap = false'

2012-06-22 Thread Christoph Hellwig
On Mon, Jun 18, 2012 at 08:32:50AM -0700, Sage Weil wrote:
 On Mon, 18 Jun 2012, Christoph Hellwig wrote:
  On Sun, Jun 17, 2012 at 09:02:15PM -0700, Sage Weil wrote:
   that data over the wire.  We have observed incorrect/changing FIEMAP on 
   both btrfs:
  
  both btrfs and?
 
 Whoops, it was XFS.  :/

If you manage to extract a minimal test case I'd love to see it,  FIEMAP
is a complete mess, although most of the time the errors actually are on
the users side due to it's complicated semantics.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD layering design draft

2012-06-22 Thread Tommi Virtanen
On Fri, Jun 22, 2012 at 7:36 AM, Guido Winkelmann
guido-c...@thisisnotatest.de wrote:
 rbd: Cannot unpreserve: Still in use by pool2/image2

 What if it's in use by a lot of images? Should it print them all, or should it
 print something like Still in use by pool2/image2 and 50 others, use
 list_children to see them all?

As walking through all the (potential) clones is an expensive
operation, this should abort as soon as possible, and just complain
about the one encountered so far. That could easily be a difference of
a few seconds vs tens of seconds. We don't even know the count,
without paying that cost, so that can't be printed either.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-22 Thread Stefan Priebe - Profihost AG
I'm still able to crash the ceph cluster while doing a lot of random I/O 
and then shut down the KVM.


Stefan

Am 21.06.2012 21:57, schrieb Stefan Priebe:

OK i discovered this time that all osds had the same disk usage before
crash. After starting the osd again i got this one:
/dev/sdb1 224G 23G 191G 11% /srv/osd.30
/dev/sdc1 224G 1,5G 213G 1% /srv/osd.31
/dev/sdd1 224G 1,5G 213G 1% /srv/osd.32
/dev/sde1 224G 1,6G 213G 1% /srv/osd.33

So instead of 1,5GB osd 30 now uses 23G.

Stefan

Am 21.06.2012 15:23, schrieb Stefan Priebe - Profihost AG:

Mhm is this normal (ceph health is NOW OK again)

/dev/sdb1 224G 655M 214G 1% /srv/osd.20
/dev/sdc1 224G 640M 214G 1% /srv/osd.21
/dev/sdd1 224G 34G 181G 16% /srv/osd.22
/dev/sde1 224G 608M 214G 1% /srv/osd.23

Why does one OSD has so much more used space than the others?

On my other OSD nodes all have around 600MB-700MB. Even when i reformat
/dev/sdd1 after the backfill it has again 34GB?

Stefan

Am 21.06.2012 15:13, schrieb Stefan Priebe - Profihost AG:

Another strange thing. Why does THIS OSD have 24GB and the others just
650MB?

/dev/sdb1 224G 654M 214G 1% /srv/osd.20
/dev/sdc1 224G 638M 214G 1% /srv/osd.21
/dev/sdd1 224G 24G 190G 12% /srv/osd.22
/dev/sde1 224G 607M 214G 1% /srv/osd.23


When i start now the OSD again it seems to hang for forever. Load goes
up to 200 and I/O Waits rise vom 0% to 20%.

Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:

Hello list,

i'm able to reproducably crash osd daemons.

How i can reproduce:

Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS: btrfs
Journal: 2GB tmpfs per OSD
OSD: 3x servers with 4x Intel SSD OSDs each
10GBE Network
rbd_cache_max_age: 2.0
rbd_cache_size: 33554432

Disk is set to writeback.

Start a KVM VM via PXE with the disk attached in writeback mode.

Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
crashes.

# fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k
--size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; halt

Strangely exactly THIS OSD also has the most log entries:
64K ceph-osd.20.log
64K ceph-osd.21.log
1,3M ceph-osd.22.log
64K ceph-osd.23.log

But all OSDs are set to debug osd = 20.

dmesg shows:
ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp
7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]

I uploaded the following files:
priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and
didn't
crash
priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD
üu
priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump
priebe_fio_randwrite_ceph-osd.bz2 = osd binary

Stefan

--
To unsubscribe from this list: send the line unsubscribe
ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD layering design draft

2012-06-22 Thread Tommi Virtanen
On Thu, Jun 21, 2012 at 2:51 PM, Alex Elder el...@dreamhost.com wrote:
 Before cloning a snapshot, you must mark it as preserved, to prevent
 it from being deleted while child images refer to it:
 ::

     $ rbd preserve pool/image@snap

 Why is it necessary to do this?  I think it may be desirable to

So the snapshot will not be removed.

See this: 
http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/6595/focus=6675

     $ rbd clone --parent pool/parent@snap pool2/child1

 Based on my comments above, if the parent had not been preserved
 it would automatically be at this point, by virtue of the fact it
 has a clone associated with it.

The client creating the child typically has no write access to the
parent, and cannot do anything to it.

 To delete the parent, you must first mark it unpreserved, which checks
 that there are no children left:
 ::

 Please show what happens here if this is done at this point:

      $ rbd snap rm pool/image@snap

rbd: Cannot remove a preserved snapshot: pool/image@snap

or something like that.

 Note that the preserve and unpreserve operations are
 valid on snapshots, not RBD images or clones.

That's a very good point. Perhaps the command should be rbd snap
preserve and rbd snap unpreserve.

 In the initial implementation, called 'trivial layering', there will
 be no tracking of which objects exist in a clone. A read that hits a
 non-existent object will attempt to read from the parent object, and
 this will continue recursively until an object exists or an image with
 no parent is found.

 So a non-existent object in a clone is a bit like a hole in a file, but
 instead of implicitly backing it with zeroes it backs it with the data
 found at the same range as the snapshot the clone was based on?

Yes.

Continuation of that: will the clone store sparse objects, or always
copy all the data for that object from the parent? That is, what
happens if I write 1 byte to a fresh clone? (And remember that block
sizes can differ.)

 If a clone had snapshots, does this mean a snapshot can include
 non-existent objects in it?

I don't like the phrase include non-existent objects, and find that
an overambitious topological exercise, but yes, a snapshot may be
sparse.

Reads fall through toward parents until they find something -- or run
out of parents, in which case they read zeros.

 Does this mean that an attempt to read beyond the end of an RBD snapshot
 is not an error if the read is being done for a clone whose size has
 been increased from what it was originally?  (In that case, the correct
 action would be to read the range as zeroes.)

This was discussed later in the email, and I see you responded to that part.

 In addition to knowing which parent a given image has, we want to be
 able to tell if a preserved image still has children. This is
 accomplished with a new per-pool object, `rbd_children`, which maps
 (parent pool, parent id, parent snapshot id) to a list of child

 My first thought was, why does the parent snapshot need to know the
 *identity* of its descendant clones?  The main thing it seems to need
 is a count of the number of clones it has.

Maintaining that count in a distributed system, without listing the
things that are in it, gets challenging. Idempotent counters are
challenging. Maintaining it as a set is easier, significantly more
debuggable, and unlikely to be too costly. Plus it lets us serve rbd
children faster.

 The other thing though is that you shouldn't store the mapping
 in the rbd_children object.  Instead, you should only store
 the child object ids there, and consult those objects to identify
 their parents.  Otherwise you end up with problems related to
 possible discrepancy between what a child points to and what the
 rbd_children mapping says.

The question we need to ask is who here is a child of $FOO. Needing
an indirection for every member makes that cost a lot more.

 image ids. This is stored in the same pool as the child image
 because the client creating a clone already has read/write access to
 everything in this pool. This lets a client with read-only access to
 one pool clone a snapshot from that pool into a pool they have full
 access to. It increases the cost of unpreserving an image, since this

 This is really a bad feature of this design because it doesn't scale.
 So we ought to be thinking about a better way to do it if possible.

That would be nice. Good luck! We await your email, though not holding
our breath ;)

 To support resizing of layered images, we need to keep track of the
 minimum size the image ever was, so that if a child image is shrunk

 We don't want the minimum size.  We want to know the highest valid
 offset in the image:
 - Upon cloning, the last valid offset of the clone is set to the last
  valid offset of the snapshot.
 - If an image is resized larger, the last valid offset remains the same.
 - If an image is resized smaller, the last valid offset is reduced to
  the new, smaller size.
 - If 

Re: Rolling upgrades possible?

2012-06-22 Thread Sage Weil
A rolling upgrade to 0.48 will be possible, provided the old version is 
reasonably recent (0.45ish or later; I need to confirm that).

The upgrade will be a bit awkward because of teh disk format upgrade, 
however.  Each ceph-osd will need to do a conversion on startup which can 
take a while, so you will want to restart them on a per-host basis or 
per-rack basis (depending on how your CRUSH map is structured).

The monitors are also doing an ncoding change, but will only make the 
transition after all members of the quorum run the new code.  If you start 
the upgrade with a degraded cluster and have another failure, you'll need 
to make sure the recovering node(s) run new code.

The goal is to make all future upgrades possible using rolling upgrades.  
It will be tricky with some of the OSD changes coming, but that is the 
goal.

sage


On Fri, 22 Jun 2012, John Axel Eriksson wrote:

 I guess this has been asked before, I'm just new to the list and
 wondered whether it's possible to do
 rolling upgrades of mons, osds and radosgw? We will soon be in the
 process of migrating from our current
 storage solution to Ceph/RGW. We will only use the object storage,
 actually mainly the S3-interface radosgw
 supplies.
 
 Right now we have a very small test-installation - 1 mon, 2 osds where
 the mon also runs rgw. Next week I've
 heard that 0.48 might be released, if we upgrade to that, do we have
 to shut down the cluster during the upgrade
 or can we do a rolling upgrade while still responding to PUTs and
 GETs? If not possible yet, is this in the pipeline?
 
 Best,
 John
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Ceph fixes for -rc4

2012-06-22 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There are a couple of fixes from Yan for bad pointer dereferences in the 
messenger code and when fiddling with page-private after page migration, 
a fix from Alex for a use-after-free in the osd client code, and a couple 
fixes for the message refcounting and shutdown ordering.

Thanks!
sage



Alex Elder (1):
  libceph: osd_client: don't drop reply reference too early

Sage Weil (2):
  libceph: use con get/put ops from osd_client
  libceph: flush msgr queue during mon_client shutdown

Yan, Zheng (2):
  ceph: check PG_Private flag before accessing page-private
  rbd: Clear ceph_msg-bio_iter for retransmitted message

 fs/ceph/addr.c |   21 -
 net/ceph/ceph_common.c |7 ---
 net/ceph/messenger.c   |4 
 net/ceph/mon_client.c  |8 
 net/ceph/osd_client.c  |   12 ++--
 5 files changed, 30 insertions(+), 22 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/9] libceph: encapsulate and document connect sequence

2012-06-22 Thread Alex Elder
Encapsulate the code handles the initial phase of establishing a
ceph connection with a peer, and add a bunch of documentation about
what's involved.  Change process_banner() to return 1 on success
rather than 0, to allow the new ceph_con_connect_response() to
return 0 to indicate the response has not yet been completely read.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |   71
++-
 1 file changed, 54 insertions(+), 17 deletions(-)

Index: b/net/ceph/messenger.c
===
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1472,7 +1472,7 @@ static int process_banner(struct ceph_co
 ceph_pr_addr(con-msgr-inst.addr.in_addr));
}

-   return 0;
+   return 1;
 }

 static void fail_protocol(struct ceph_connection *con)
@@ -1970,6 +1970,57 @@ static void process_message(struct ceph_
prepare_read_tag(con);
 }

+/*
+ * Initiate the first phase of establishing a connection with
+ * the peer (connecting).  This phase consists of:
+ * - client requests TCP connection to server
+ * - server accepts TCP connection from client
+ * - client sends banner to server
+ * - server receives and validates client's banner
+ * - client sends little-endian encoded own socket (IP) address
+ * - server recieves, validates, and records client's encoded address
+ * If all is well to this point, then we begin processing the
+ * connect response.
+ */
+static int ceph_con_connect(struct ceph_connection *con)
+{
+   set_bit(CONNECTING, con-state);
+
+   con_out_kvec_reset(con);
+   prepare_write_banner(con);
+   prepare_read_banner(con);
+
+   BUG_ON(con-in_msg);
+   con-in_tag = CEPH_MSGR_TAG_READY;
+   dout(%s initiating connect on %p new state %lu\n,
+   __func__, con, con-state);
+
+   return ceph_tcp_connect(con);
+}
+
+/*
+ * Handle the response from the first phase of establishing a
+ * connection with the peer.  This consists of:
+ * - server sends banner to client
+ * - client receives and validates server's banner
+ * - server sends little-endian encoded own socket (IP) address
+ * - client recieves, validates, and records server's encoded address
+ * - server sends little-endian encoded socket (IP) address for client
+ * - client recieves and records its encoded address supplied by server
+ * If all is well to this point, then we can transition to the
+ * NEGOTIATING state.
+ */
+static int ceph_con_connect_response(struct ceph_connection *con)
+{
+   int ret;
+
+   dout(%s connecting\n, __func__);
+   ret = read_partial_banner(con);
+   if (ret  0)
+   ret = process_banner(con);
+
+   return ret;
+}

 /*
  * Write something to the socket.  Called in a worker thread when the
@@ -1986,17 +2037,7 @@ more:

/* open the socket first? */
if (con-sock == NULL) {
-   set_bit(CONNECTING, con-state);
-
-   con_out_kvec_reset(con);
-   prepare_write_banner(con);
-   prepare_read_banner(con);
-
-   BUG_ON(con-in_msg);
-   con-in_tag = CEPH_MSGR_TAG_READY;
-   dout(try_write initiating connect on %p new state %lu\n,
-con, con-state);
-   ret = ceph_tcp_connect(con);
+   ret = ceph_con_connect(con);
if (ret  0) {
con-error_msg = connect error;
goto out;
@@ -2095,13 +2136,9 @@ more:
}

if (test_bit(CONNECTING, con-state)) {
-   dout(try_read connecting\n);
-   ret = read_partial_banner(con);
+   ret = ceph_con_connect_response(con);
if (ret = 0)
goto out;
-   ret = process_banner(con);
-   if (ret  0)
-   goto out;

clear_bit(CONNECTING, con-state);
set_bit(NEGOTIATING, con-state);
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/9] libceph: encapsulate and document negotiation phase

2012-06-22 Thread Alex Elder
Encapsulate the code handles the negotiation phase of establishing a
ceph connection with a peer, and add a bunch of documentation about
what's involved.  Change process_connect() to return 1 on success
rather than 0, to allow the new ceph_con_negotiate_response() to
return 0 to indicate the response has not yet been completely read.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |  107
+++
 1 file changed, 91 insertions(+), 16 deletions(-)

Index: b/net/ceph/messenger.c
===
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1633,7 +1633,7 @@ static int process_connect(struct ceph_c
con-error_msg = protocol error, garbage tag during connect;
return -1;
}
-   return 0;
+   return 1;
 }


@@ -2023,6 +2023,82 @@ static int ceph_con_connect_response(str
 }

 /*
+ * The first phase of connecting with the peer succeeded.  Now start
+ * the second phase (negotiating), which consists of:
+ *  - client sends a connect message to server, specifying
+ *information about itself, including the protocol it intends to
+ *use and the features it supports.
+ *  - if authorizer data is needed for the connection, its length is
+ *recorded in the connect message, and client sends its content
+ *immediately after the connect message
+ *  - server receives the connect message from the client, and if it
+ *indicates authorizer data follows, reads that also.
+ * If all is well to this point, then we begin processing the
+ * negotiation response.
+ */
+static int ceph_con_negotiate(struct ceph_connection *con)
+{
+   int ret;
+
+   clear_bit(CONNECTING, con-state);
+   set_bit(NEGOTIATING, con-state);
+
+   /* Banner was good, exchange connection info */
+   ret = prepare_write_connect(con);
+   if (ret = 0)
+   prepare_read_connect(con);
+
+   return ret;
+}
+
+/*
+ * Handle the response from the negotiating phase of connecting the
+ * peer.  This consists of:
+ *  - server validates the connect message (and possibly authorizer
+ *data), and sends a response to the client:
+ *  - if the protocol version supplied by the client is not what
+ *was expected, response is a BADPROTOVER tag
+ *  - if the features supported by the client are missing
+ *features required by the server, response is a FEATURES
+ *tag.
+ *  - if the features supported by the client are missing
+ *  - if authorizer data is supplied by the client and it is not
+ *valid, response is a BADAUTHORIZER tag.
+ *  - (There are some other conditions related to message and
+ *connection sequence numbers but they are not covered here)
+ *  - Otherwise the response will begin with a READY tag, and
+ *will include a ceph connect reply message, which will
+ *include the features supported by the server, and the
+ *server's own authorization data.
+ *  - client validates the connect message (and possibly authorizer
+ *data) from the server:
+ *  - If the tag indicates a bad protocol or mismatching
+ *features, the connection attempt is abandoned, so the ceph
+ *connection is reset and closed.
+ *  - If the tag indicates a bad authorizer, a second connect
+ *attempt is initiated.  If a second attempt fails due to a
+ *bad authorizer, the connection attempt fails.
+ *  - If the tag indicates READY, the client will check the
+ *features supported by the server.  If the server's
+ *features do not include a feature required by the client,
+ *the connection attempt is abandoned, so the ceph
+ *connection is reset and closed.
+ *  If no failures occurred to this point, the connection is established.
+ */
+static int ceph_con_negotiate_response(struct ceph_connection *con)
+{
+   int ret;
+
+   dout(%s negotiating\n, __func__);
+
+   ret = read_partial_connect(con);
+   if (ret  0)
+   ret = process_connect(con);
+
+   return ret;
+}
+
+/*
  * Write something to the socket.  Called in a worker thread when the
  * socket appears to be writeable and we have something ready to send.
  */
@@ -2136,31 +2212,30 @@ more:
}

if (test_bit(CONNECTING, con-state)) {
+   /*
+* See if we got the response we expect from our
+* connection request.
+*/
ret = ceph_con_connect_response(con);
if (ret = 0)
goto out;

-   clear_bit(CONNECTING, con-state);
-   set_bit(NEGOTIATING, con-state);
-
-   /* Banner is good, exchange connection info */
-   ret = prepare_write_connect(con);
-   if (ret  0)
-   goto out;
-   

[PATCH 3/9] libceph: close the connection's socket on reset

2012-06-22 Thread Alex Elder
When a ceph connection is reset, all its state is cleared.  However
the underlying socket never actually gets closed.  Do that, to
essentially make the reset process complete.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |1 +
 1 file changed, 1 insertion(+)

Index: b/net/ceph/messenger.c
===
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -492,6 +492,7 @@ static void reset_connection(struct ceph
}
con-in_seq = 0;
con-in_seq_acked = 0;
+   con_close_socket(con);
 }

 /*
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/9] libceph: don't close socket in OPENING state

2012-06-22 Thread Alex Elder
The only way a socket enters OPENING state is via ceph_con_open().

The only times ceph_con_open() is called are:
  - In fs/ceph/mds_client.c:register_session(), where it occurs
soon after a call to ceph_con_init().
  - In fs/ceph/mds_client.c:send_mds_reconnect().  This is
called in two places.
- In fs/ceph/mds_client.c:check_new_map(), it is called
  after a call to ceph_con_close()
- Or in fs/ceph/mds_client.c:peer_reset(), which is also only
  called after reset_connection, which includes a call to
  ceph_con_close().
  - In net/ceph/mon_client.c:__open_session(), where it's called
right after a call to ceph_con_init().
  - In net/ceph/osd_client.c:__reset_osd(), right after a call
to ceph_con_close().
  - In net/ceph/osd_client.c:__map_request(), shortly after a call
to create_osd(), which includes a call to ceph_con_init().

After a call to ceph_con_init(), the state of a ceph connection is
CLOSED, and its socket pointer is null.

Similarly, after a call to ceph_con_close(), the state of the
connection is CLOSED, the underlying socket is closed, and the
connection's socket pointer is null.

Therefore, there is no reason to call con_close_socket() when a
connection is found to be in OPENING state in con_work(), because
the socket will already be closed, and the connection will already
be in CLOSED state.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |1 -
 1 file changed, 1 deletion(-)

Index: b/net/ceph/messenger.c
===
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -2387,7 +2387,6 @@ restart:
if (test_and_clear_bit(OPENING, con-state)) {
/* reopen w/ new peer */
dout(con_work OPENING\n);
-   con_close_socket(con);
}

ret = try_read(con);
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/9] libceph: change TAG_CLOSE handling

2012-06-22 Thread Alex Elder
Currently, if a connection is READY in try_read(), and a CLOSE tag
is the what is received next, the connection's state changes from
CONNECTED to CLOSED and try_read() returns.

If this happens, control returns to con_work(), and try_write()
is called.  If there was queued data to send, try_write() appears
to attempt to send it despite the receipt of the CLOSE tag.

Eventually, try_write() will return either:
  - A non-negative value, in which case con_work() will end, and
will at some point get triggered to run by an event.
  - -EAGAIN, in which case control returns to the top of con_work()
  - Some other error, which will cause con_work() to call
ceph_fault(), which will close the socket and force a new
connection sequence to be initiated on the next write.

At the top of con_work(), if the connection is in CLOSED state,
the same fault handling will be done as would happen for any
other error.

Instead of messing with the connection state deep inside try_read(),
just have try_read() return a negative value (an errno), and let
the fault handling code in con_work() take care of resetting the
connection right away.  This will also close the connection before
needlessly sending any queued data to the other end.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Index: b/net/ceph/messenger.c
===
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -2273,8 +2273,7 @@ more:
prepare_read_ack(con);
break;
case CEPH_MSGR_TAG_CLOSE:
-   clear_bit(CONNECTED, con-state);
-   set_bit(CLOSED, con-state);   /* fixme */
+   ret = -EIO;
goto out;
default:
goto bad_tag;
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/9] libceph: kill fail_protocol()

2012-06-22 Thread Alex Elder
In the negotiating phase of establishing a connection, the server
can indicate various connection failures using special tag values.
The tags can mean: that the client does not have features needed
by the server; that the protocol advertised by the client is not
what the server expects; or that the authorizer data provided by
the client was not adequate to grant access.

These three cases are handled in process_connect(), which calls
fail_protocal() for all three.  The result of that is that the
connection gets reset, and the connection gets moved to CLOSED
state.

The previous patch description walks through what happens when
a connection gets marked CLOSED within try_read(), and why it's
sufficient (and better) to simply have it return a negative value.

So just do that--don't bother with fail_protocol(), just return a
negative value in these cases and let the caller sort out resetting
things.  Return -EIO in these cases rather than -1 (which can be
confused with -EPERM).

We can get rid of fail_protocol() because it is no longer used.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |   15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

Index: b/net/ceph/messenger.c
===
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1476,12 +1476,6 @@ static int process_banner(struct ceph_co
return 1;
 }

-static void fail_protocol(struct ceph_connection *con)
-{
-   reset_connection(con);
-   set_bit(CLOSED, con-state);  /* in case there's queued work */
-}
-
 static int process_connect(struct ceph_connection *con)
 {
u64 sup_feat = con-msgr-supported_features;
@@ -1499,8 +1493,7 @@ static int process_connect(struct ceph_c
   ceph_pr_addr(con-peer_addr.in_addr),
   sup_feat, server_feat, server_feat  ~sup_feat);
con-error_msg = missing required protocol features;
-   fail_protocol(con);
-   return -1;
+   return -EIO;

case CEPH_MSGR_TAG_BADPROTOVER:
pr_err(%s%lld %s protocol version mismatch,
@@ -1510,8 +1503,7 @@ static int process_connect(struct ceph_c
   le32_to_cpu(con-out_connect.protocol_version),
   le32_to_cpu(con-in_reply.protocol_version));
con-error_msg = protocol version mismatch;
-   fail_protocol(con);
-   return -1;
+   return -EIO;

case CEPH_MSGR_TAG_BADAUTHORIZER:
con-auth_retry++;
@@ -1597,8 +1589,7 @@ static int process_connect(struct ceph_c
   ceph_pr_addr(con-peer_addr.in_addr),
   req_feat, server_feat, req_feat  ~server_feat);
con-error_msg = missing required protocol features;
-   fail_protocol(con);
-   return -1;
+   return -EIO;
}
clear_bit(NEGOTIATING, con-state);
set_bit(CONNECTED, con-state);
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/9] libceph: close connection on reset tag

2012-06-22 Thread Alex Elder
When a CEPH_MSGR_TAG_RESETSESSION tag is received, the connection
should be reset, dropping any pending messages and preparing for
a new connection to be negotiated.

Currently, reset_connection() is called to do this, but that only
drops messages.  To really get the connection fully reset, call
ceph_con_close() instead.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: b/net/ceph/messenger.c
===
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1533,7 +1533,8 @@ static int process_connect(struct ceph_c
pr_err(%s%lld %s connection reset\n,
   ENTITY_NAME(con-peer_name),
   ceph_pr_addr(con-peer_addr.in_addr));
-   reset_connection(con);
+   ceph_con_close(con);
+
ret = prepare_write_connect(con);
if (ret  0)
return ret;
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/9] libceph: close connection on connect failure

2012-06-22 Thread Alex Elder
The only time the CLOSED state is set on a ceph connection is in
ceph_con_init() and ceph_con_close().  Both of these will ensure
the connection's socket is closed.  Therefore there is no need
to close the socket in con_work() if the connection is found to
be in CLOSED state.

Rearrange things a bit in ceph_con_close() so we only manipulate
the state and flag bits *after* we've acquired the connection mutex.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: b/net/ceph/messenger.c
===
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -502,6 +502,8 @@ void ceph_con_close(struct ceph_connecti
 {
dout(con_close %p peer %s\n, con,
 ceph_pr_addr(con-peer_addr.in_addr));
+
+   mutex_lock(con-mutex);
clear_bit(NEGOTIATING, con-state);
clear_bit(CONNECTING, con-state);
clear_bit(CONNECTED, con-state);
@@ -512,11 +514,13 @@ void ceph_con_close(struct ceph_connecti
clear_bit(KEEPALIVE_PENDING, con-flags);
clear_bit(WRITE_PENDING, con-flags);

-   mutex_lock(con-mutex);
+   /* Clear everything out */
reset_connection(con);
con-peer_global_seq = 0;
cancel_delayed_work(con-work);
+
mutex_unlock(con-mutex);
+
queue_con(con);
 }
 EXPORT_SYMBOL(ceph_con_close);
@@ -2372,7 +2376,6 @@ restart:
}
if (test_bit(CLOSED, con-state)) { /* e.g. if we are replaced */
dout(con_work CLOSED\n);
-   con_close_socket(con);
goto done;
}
if (test_and_clear_bit(OPENING, con-state)) {
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 9/9] libceph: set CONNECTING state even earlier

2012-06-22 Thread Alex Elder
Move the setting of the CONNECTING state in a ceph connection
all the way back to where a connection first gets opened.  At
that point the connection's socket pointer is still null, and
the connection sequence is about to begin.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Index: b/net/ceph/messenger.c
===
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -533,6 +533,7 @@ void ceph_con_open(struct ceph_connectio
dout(con_open %p %s\n, con, ceph_pr_addr(addr-in_addr));
set_bit(OPENING, con-state);
WARN_ON(!test_and_clear_bit(CLOSED, con-state));
+   set_bit(CONNECTING, con-state);

memcpy(con-peer_addr, addr, sizeof(*addr));
con-delay = 0;  /* reset backoff memory */
@@ -1981,8 +1982,6 @@ static void process_message(struct ceph_
  */
 static int ceph_con_connect(struct ceph_connection *con)
 {
-   set_bit(CONNECTING, con-state);
-
con_out_kvec_reset(con);
prepare_write_banner(con);
prepare_read_banner(con);
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-22 Thread Dan Mick

Stefan, I'm looking at your logs and coredump now.

On 06/21/2012 11:43 PM, Stefan Priebe wrote:

Does anybody have an idea? This is right now a showstopper to me.

Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost 
AGs.pri...@profihost.ag:


Hello list,

i'm able to reproducably crash osd daemons.

How i can reproduce:

Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS: btrfs
Journal: 2GB tmpfs per OSD
OSD: 3x servers with 4x Intel SSD OSDs each
10GBE Network
rbd_cache_max_age: 2.0
rbd_cache_size: 33554432

Disk is set to writeback.

Start a KVM VM via PXE with the disk attached in writeback mode.

Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes.

# fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G 
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio 
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 
--runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 
--direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 
--group_reporting --name=file1; halt

Strangely exactly THIS OSD also has the most log entries:
64K ceph-osd.20.log
64K ceph-osd.21.log
1,3Mceph-osd.22.log
64K ceph-osd.23.log

But all OSDs are set to debug osd = 20.

dmesg shows:
ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 
error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]

I uploaded the following files:
priebe_fio_randwrite_ceph-osd.21.log.bz2 =  OSD which was OK and didn't crash
priebe_fio_randwrite_ceph-osd.22.log.bz2 =  Log from the crashed OSD
üu
priebe_fio_randwrite_core.ssdstor001.27204.bz2 =  Core dump
priebe_fio_randwrite_ceph-osd.bz2 =  osd binary

Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-22 Thread Sam Just
I am still looking into the logs.
-Sam

On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick dan.m...@inktank.com wrote:
 Stefan, I'm looking at your logs and coredump now.


 On 06/21/2012 11:43 PM, Stefan Priebe wrote:

 Does anybody have an idea? This is right now a showstopper to me.

 Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost
 AGs.pri...@profihost.ag:

 Hello list,

 i'm able to reproducably crash osd daemons.

 How i can reproduce:

 Kernel: 3.5.0-rc3
 Ceph: 0.47.3
 FS: btrfs
 Journal: 2GB tmpfs per OSD
 OSD: 3x servers with 4x Intel SSD OSDs each
 10GBE Network
 rbd_cache_max_age: 2.0
 rbd_cache_size: 33554432

 Disk is set to writeback.

 Start a KVM VM via PXE with the disk attached in writeback mode.

 Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
 crashes.

 # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
 --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
 --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; halt

 Strangely exactly THIS OSD also has the most log entries:
 64K     ceph-osd.20.log
 64K     ceph-osd.21.log
 1,3M    ceph-osd.22.log
 64K     ceph-osd.23.log

 But all OSDs are set to debug osd = 20.

 dmesg shows:
 ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp
 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]

 I uploaded the following files:
 priebe_fio_randwrite_ceph-osd.21.log.bz2 =  OSD which was OK and didn't
 crash
 priebe_fio_randwrite_ceph-osd.22.log.bz2 =  Log from the crashed OSD
 üu
 priebe_fio_randwrite_core.ssdstor001.27204.bz2 =  Core dump
 priebe_fio_randwrite_ceph-osd.bz2 =  osd binary

 Stefan

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-22 Thread Dan Mick
The ceph-osd binary you sent claims to be version 0.47.2-521-g88c762, 
which is not quite 0.47.3.  You can get the version with binary -v, or 
(in my case) examining strings in the binary.  I'm retrieving that 
version to analyze the core dump.



On 06/21/2012 11:43 PM, Stefan Priebe wrote:

Does anybody have an idea? This is right now a showstopper to me.

Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost 
AGs.pri...@profihost.ag:


Hello list,

i'm able to reproducably crash osd daemons.

How i can reproduce:

Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS: btrfs
Journal: 2GB tmpfs per OSD
OSD: 3x servers with 4x Intel SSD OSDs each
10GBE Network
rbd_cache_max_age: 2.0
rbd_cache_size: 33554432

Disk is set to writeback.

Start a KVM VM via PXE with the disk attached in writeback mode.

Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes.

# fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G 
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio 
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 
--runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 
--direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 
--group_reporting --name=file1; halt

Strangely exactly THIS OSD also has the most log entries:
64K ceph-osd.20.log
64K ceph-osd.21.log
1,3Mceph-osd.22.log
64K ceph-osd.23.log

But all OSDs are set to debug osd = 20.

dmesg shows:
ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 
error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]

I uploaded the following files:
priebe_fio_randwrite_ceph-osd.21.log.bz2 =  OSD which was OK and didn't crash
priebe_fio_randwrite_ceph-osd.22.log.bz2 =  Log from the crashed OSD
üu
priebe_fio_randwrite_core.ssdstor001.27204.bz2 =  Core dump
priebe_fio_randwrite_ceph-osd.bz2 =  osd binary

Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Unable to restart Mon after reboot

2012-06-22 Thread David Blundell
Hi all,

I am testing Ceph 0.47.2 on btrfs with three servers running Fedora 17.  
Following a reboot of the servers, one of the mon daemons crashes on startup 
with FAILED assert(r0)

MDS and the OSD start and run fine as do the mon daemons on the other two 
servers.

The debug log is at http://pastebin.com/tXwvd44Z

I would really appreciate any comments - especially if I am missing something 
obvious.

David--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to restart Mon after reboot

2012-06-22 Thread Dan Mick

Hi David:

The code there is trying to read some stuff off the monitor's storage to 
initialize, and apparently failing in an odd way.  It's trying to read 
the file 'latest' from the monitor directory (/data/mon0);  the file can 
be opened, and stat says it's 4289 bytes long, but apparently the read 
is succeeding without error, but only getting back 0 bytes (i.e., not an 
error, but apparently end of file).


See if there's a file /data/mon0/latest of length 4289, and see if 
something is odd about its permissions (like maybe the read bits are 
turned off, or maybe the filesystem it's on has errors).



On 06/22/2012 05:31 PM, David Blundell wrote:

Hi all,

I am testing Ceph 0.47.2 on btrfs with three servers running Fedora 17.  Following a reboot 
of the servers, one of the mon daemons crashes on startup with FAILED 
assert(r0)

MDS and the OSD start and run fine as do the mon daemons on the other two 
servers.

The debug log is at http://pastebin.com/tXwvd44Z

I would really appreciate any comments - especially if I am missing something 
obvious.

David--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance benchmark of rbd

2012-06-22 Thread Alexandre DERUMIER
Hi Eric,

Do you have find any clue about slow random write iops ?

I'm doing some benchmark from a kvm guest with fio, random 4K block,
fio --filename=$DISK --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 
--runtime=90 --group_reporting --name=file1

journal is on tmpfs and storage is 15k drive

I can't have more than 1000-2000 iops.

I Don't understand why I don't have a lot more iops. 
If journal is on tmpfs, it should be around 3iops on a gigabit link (using 
all the bandwith)

I also try use rbd_caching on my kvm guest, didn't change nothing.

sequential write with 4MB block can use the full the gigabit link (around 
100MB/S)


Is the bottleneck the in rbd protocol ?


- Mail original - 

De: Eric YH Chen eric_yh_c...@wiwynn.com 
À: mark nelson mark.nel...@inktank.com 
Cc: ceph-devel@vger.kernel.org, Chris YT Huang chris_yt_hu...@wiwynn.com, 
Victor CY Chang victor_cy_ch...@wiwynn.com 
Envoyé: Jeudi 14 Juin 2012 03:26:12 
Objet: RE: Performance benchmark of rbd 

Hi, Mark: 

I forget to mention one thing, I create the rbd at the same machine and 
test it. That means the network latency may be lower than normal case. 

1. 
I use ext4 as the backend filesystem and with following attribute. 
data=writeback,noatime,nodiratime,user_xattr 

2. 
I use the default replication number, I think it is 2, right? 

3. 
On my platform, I have 192GB memory 

4. Sorry about the column name is left-right reversal. Here is the 
correct one 
Seq-write Seq-read 
32 KB 23 MB/s 690 MB/s 
512 KB 26 MB/s 960 MB/s 
4 MB 27 MB/s 1290 MB/s 
32 MB 36 MB/s 1435 MB/s 

5. If I put all the journal data on a SSD device (Intel 520). 
The sequence write performance would reach 135MB/s instead of 
27MB/s in original. (object size = 4MB). And others are no different, 
including random-write. I am curious why the SSD device doesn't 
help the performance of random-write. 

6. For the random read write, the data I provided before was correct. 
But I can give you the detail. Is it too high than what you expected? 

rand-write-4k rand-write-16k 
bw iops bw iops 
3,524 881 9,032 564 

mix-4k (50/50) 
r:bw r:iops w:bw w:iops 
2,925 731 2,924 731 

mix-8k (50/50) 
r:bw r:iops w:bw w:iops 
4,509 563 4,509 563 

mix-16k (50/50) 
r:bw r:iops w:bw w:iops 
8,366 522 8,345 521 


7. 
Here is the hw raid cache policy we used now. 
Write Policy Write Back with BBU 
Read Policy ReadAhead 

If you are interested in how HW raid help the performance, I can do for 
little help, since we also want to know what is the best configuration 
on our platform. Any test you want to know? 


Furthermore, is there any suggestion for our platform that can improve 
the performance? Thanks! 



-Original Message- 
From: Mark Nelson [mailto:mark.nel...@inktank.com] 
Sent: Wednesday, June 13, 2012 8:30 PM 
To: Eric YH Chen/WYHQ/Wiwynn 
Cc: ceph-devel@vger.kernel.org 
Subject: Re: Performance benchmark of rbd 

Hi Eric! 

On 6/13/12 5:06 AM, eric_yh_c...@wiwynn.com wrote: 
 Hi, all: 
 
 I am doing some benchmark of rbd. 
 The platform is on a NAS storage. 
 
 CPU: Intel E5640 2.67GHz 
 Memory: 192 GB 
 Hard Disk: SATA 250G * 1, 7200 rpm (H0) + SATA 1T * 12 , 7200rpm 
 (H1~ H12) 
 RAID Card: LSI 9260-4i 
 OS: Ubuntu12.04 with Kernel 3.2.0-24 
 Network: 1 Gb/s 
 
 We create 12 OSD on H1 ~ H12 with the journal is put on H0. 

Just to make sure I understand, you have a single node with 12 OSDs and 
3 mons, and all 12 OSDs are using the H0 disk for their journals? What 
filesystem are you using for the OSDs? How much replication? 

 We also create 3 MON in the cluster. 
 In briefly, we setup a ceph cluster all-in-one, with 3 monitors 
and 
 12 OSD. 
 
 The benchmark tool we used is fio 2.0.3. We had 7 basic test case 
 1) sequence write with bs=64k 
 2) sequence read with bs=64k 
 3) random write with bs=4k 
 4) random write with bs=16k 
 5) mix read/write with bs=4k 
 6) mix read/write with bs=8k 
 7) mix read/write with bs=16k 
 
 We create several rbd with different object size for the 
benchmark. 
 
 1. size = 20G, object size = 32KB 
 2. size = 20G, object size = 512KB 
 3. size = 20G, object size = 4MB 
 4. size = 20G, object size = 32MB 

Given how much memory you have, you may want to increase the amount of 
data you are writing during each test to rule out caching. 

 
 We have some conclusion after the benchmark. 
 
 a. We can get better performance of sequence read/write when the 
 object size is bigger. 
 Seq-read Seq-write 
 32 KB 23 MB/s 690 MB/s 
 512 KB 26 MB/s 960 MB/s 
 4 MB 27 MB/s 1290 MB/s 
 32 MB 36 MB/s 1435 MB/s 

Which test are these results from? I'm suspicious that the write 
numbers are so high. Figure that even with a local client and 1X 
replication, your journals and data partitions are each writing out a 
copy of the data. You don't have enough disk in that box to sustain 
1.4GB/s to both even under perfectly ideal conditions. Given that it 
sounds like you are using a single 7200rpm disk for 12 journals, I would