Re: qemu-1.7.0 and internal snapshot, Was: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend

2013-12-20 Thread Oliver Francke

Hi Wido,

On 12/20/2013 08:06 AM, Wido den Hollander wrote:

On 12/17/2013 05:07 PM, Oliver Francke wrote:

Hi Alexandre and Wido ;)

well, I know this is a pretty old question... but saw some comments from
you Wido as well as a most current patch for qemu-1.7.0 in the
git.proxmox ( internal snapshot async port to qemu 1.7 v5)
What I currently did is: apply modify-query-machines.patch and
internal-snapshot-async.patch to qemu-1.7.0 sources, all hunks succeed.

Now after talking to the QMP with:

 { execute : savevm-start, arguments: { statefile:
rbd:123/905.save1 }}

or a local file, it spits out:

 qemu-system-x86_64: block.c:4430: bdrv_set_in_use: Assertion
`bs-in_use != in_use' failed

*sigh*

qemu is started with some parameters... and finally drive-specific 
ones:


-device virtio-blk-pci,drive=virtio0 -drive
format=raw,file=rbd:123/vm-905-disk-.rbd:rbd_cache=true:rbd_cache_size=33554432:rbd_cache_max_dirty=16777216:rbd_cache_target_dirty=8388608,cache=writeback,if=none,id=virtio0,media=disk,index=0 




Did I miss a relevant point?
What would be the correct strategy?



I haven't tested this recently, so I'm not sure if this should already 
work.


It would be great if this worked, but I'm not aware of it.


unfortunately I didn't give it a try the time Alexandre first mentioned 
it. This functionality should def make it into some next qemu-version.
If you get it to work with current 1.7.0 i would appreciate any more 
input ;)


Regards,

Oliver.



Wido


Thnx in advance and kind regards,

Oliver.

P.S.: I don't use libvirt nor proxmox as a complete system.

On 05/24/2013 10:57 PM, Oliver Francke wrote:

Hi Alexandre,

Am 24.05.2013 um 17:37 schrieb Alexandre DERUMIER 
aderum...@odiso.com:



Hi,

For Proxmox, we have made some patchs to split the savevm process,

to be able to save the memory to an external volume. (and not the
current volume).

For rbd, we create a new rbd volume to store the memory.

qemu patch is here :
https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/internal-snapshot-async.patch;h=c67a97ea497fe31ff449acb79e04dc1c53b25578;hb=HEAD 




- Mail original -


wow, sounds very interesting, being on the road for the next 3 days I
will have a closer look next week.

Thnx n regards,

Oliver.


De: Wido den Hollander w...@42on.com
À: Oliver Francke oliver.fran...@filoo.de
Cc: ceph-devel@vger.kernel.org
Envoyé: Vendredi 24 Mai 2013 17:08:35
Objet: Re: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd
as storage-backend

On 05/24/2013 09:46 AM, Oliver Francke wrote:

Hi,

with a running VM I encounter this strange behaviour, former
qemu-versions don't show up such an error.
Perhaps this comes from the rbd-backend in qemu-1.5.0 in combination
with ceph-0.56.6? Therefore my
crosspost.

Even if I have no real live-snapshot avail - they know of this
restriction -, it's more work for the customers
to perform a shutdown before the wonna do some changes to their VM ;)


Doesn't Qemu try to save the memory state to RBD here as well? That
doesn't work and fails.


Any hints welcome,

Oliver.



--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe 
ceph-devel in

the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe 
ceph-devel in

the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html









--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RBD/qemu-1.7.0 memory leak with drive_mirror/live-migration

2013-12-17 Thread Oliver Francke

Hi *,

I just tried a feature for live-migration via:

drive_mirror -f virtio0 rbd:123123/virtio0 rbd

in a qemu-monitoring-session.
One immediately can see some memory consumption... never getting 
released again. Additionally the block-job never ends, even after X/X 
bytes completed.
You can cancel the job, memory still occupied, after another try more 
RSS-memory gets filled.


Same procedure with qcow2 does not need any more memory, and the job 
gets cleared after completion.


Possibly @Josh: any idea? Logfiles with debug_what=? needed? ;)

Thnx in @vance,

Oliver.

--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qemu-1.7.0 and internal snapshot, Was: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend

2013-12-17 Thread Oliver Francke

Hi Alexandre and Wido ;)

well, I know this is a pretty old question... but saw some comments from 
you Wido as well as a most current patch for qemu-1.7.0 in the 
git.proxmox ( internal snapshot async port to qemu 1.7 v5)
What I currently did is: apply modify-query-machines.patch and 
internal-snapshot-async.patch to qemu-1.7.0 sources, all hunks succeed.


Now after talking to the QMP with:

{ execute : savevm-start, arguments: { statefile: 
rbd:123/905.save1 }}


or a local file, it spits out:

qemu-system-x86_64: block.c:4430: bdrv_set_in_use: Assertion 
`bs-in_use != in_use' failed


*sigh*

qemu is started with some parameters... and finally drive-specific ones:

-device virtio-blk-pci,drive=virtio0 -drive 
format=raw,file=rbd:123/vm-905-disk-.rbd:rbd_cache=true:rbd_cache_size=33554432:rbd_cache_max_dirty=16777216:rbd_cache_target_dirty=8388608,cache=writeback,if=none,id=virtio0,media=disk,index=0


Did I miss a relevant point?
What would be the correct strategy?

Thnx in advance and kind regards,

Oliver.

P.S.: I don't use libvirt nor proxmox as a complete system.

On 05/24/2013 10:57 PM, Oliver Francke wrote:

Hi Alexandre,

Am 24.05.2013 um 17:37 schrieb Alexandre DERUMIER aderum...@odiso.com:


Hi,

For Proxmox, we have made some patchs to split the savevm process,

to be able to save the memory to an external volume. (and not the current 
volume).

For rbd, we create a new rbd volume to store the memory.

qemu patch is here :
https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/internal-snapshot-async.patch;h=c67a97ea497fe31ff449acb79e04dc1c53b25578;hb=HEAD

- Mail original -


wow, sounds very interesting, being on the road for the next 3 days I will have 
a closer look next week.

Thnx n regards,

Oliver.


De: Wido den Hollander w...@42on.com
À: Oliver Francke oliver.fran...@filoo.de
Cc: ceph-devel@vger.kernel.org
Envoyé: Vendredi 24 Mai 2013 17:08:35
Objet: Re: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as 
storage-backend

On 05/24/2013 09:46 AM, Oliver Francke wrote:

Hi,

with a running VM I encounter this strange behaviour, former
qemu-versions don't show up such an error.
Perhaps this comes from the rbd-backend in qemu-1.5.0 in combination
with ceph-0.56.6? Therefore my
crosspost.

Even if I have no real live-snapshot avail - they know of this
restriction -, it's more work for the customers
to perform a shutdown before the wonna do some changes to their VM ;)


Doesn't Qemu try to save the memory state to RBD here as well? That
doesn't work and fails.


Any hints welcome,

Oliver.



--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend

2013-06-11 Thread Oliver Francke

Hi Alexandre, Josh,

sorry for coming back so very late, I tried the patch and though I could 
not get it to work properly - very likely my fault - how would it be to 
integrate it into the rbd-handler of qemu?


Josh? I think you are CC'd from another qemu-ticket anyway?

I could just ignore the EOPNOTSUPP or whatever it's called, but a smooth 
integration of live-snapshots would be so cool ;)


Kind regards,

Oliver.

On 05/24/2013 05:37 PM, Alexandre DERUMIER wrote:

Hi,

For Proxmox, we have made some patchs to split the savevm process,

to be able to save the memory to an external volume. (and not the current 
volume).

For rbd, we create a new rbd volume to store the memory.

qemu patch is here :
https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/internal-snapshot-async.patch;h=c67a97ea497fe31ff449acb79e04dc1c53b25578;hb=HEAD

- Mail original -

De: Wido den Hollander w...@42on.com
À: Oliver Francke oliver.fran...@filoo.de
Cc: ceph-devel@vger.kernel.org
Envoyé: Vendredi 24 Mai 2013 17:08:35
Objet: Re: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as 
storage-backend

On 05/24/2013 09:46 AM, Oliver Francke wrote:

Hi,

with a running VM I encounter this strange behaviour, former
qemu-versions don't show up such an error.
Perhaps this comes from the rbd-backend in qemu-1.5.0 in combination
with ceph-0.56.6? Therefore my
crosspost.

Even if I have no real live-snapshot avail - they know of this
restriction -, it's more work for the customers
to perform a shutdown before the wonna do some changes to their VM ;)


Doesn't Qemu try to save the memory state to RBD here as well? That
doesn't work and fails.


Any hints welcome,

Oliver.






--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend

2013-05-24 Thread Oliver Francke

Hi,

with a running VM I encounter this strange behaviour, former 
qemu-versions don't show up such an error.
Perhaps this comes from the rbd-backend in qemu-1.5.0 in combination 
with ceph-0.56.6? Therefore my

crosspost.

Even if I have no real live-snapshot avail - they know of this 
restriction -, it's more work for the customers

to perform a shutdown before the wonna do some changes to their VM ;)

Any hints welcome,

Oliver.

--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend

2013-05-24 Thread Oliver Francke

Well,

On 05/24/2013 05:08 PM, Wido den Hollander wrote:

On 05/24/2013 09:46 AM, Oliver Francke wrote:

Hi,

with a running VM I encounter this strange behaviour, former
qemu-versions don't show up such an error.
Perhaps this comes from the rbd-backend in qemu-1.5.0 in combination
with ceph-0.56.6? Therefore my
crosspost.

Even if I have no real live-snapshot avail - they know of this
restriction -, it's more work for the customers
to perform a shutdown before the wonna do some changes to their VM ;)



Doesn't Qemu try to save the memory state to RBD here as well? That 
doesn't work and fails.


true, but that should qemu try in version 1.4.x, too, but that succeeds ;)

Tried to figure out some corresponding QMP-commands, found some 
references like:


{execute: snapshot-create, arguments: {name: vm_before_upgrade} }


but that fails with unknown command.

The convenience for a monitoring-command would be, that qemu cares for 
all mounted block-devs, well,
it _should_ exclude some ide-cd0 devices, where I get another error 
with an inserted CD, but... other error,


Oliver.




Any hints welcome,

Oliver.







--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as storage-backend

2013-05-24 Thread Oliver Francke
Hi Alexandre,

Am 24.05.2013 um 17:37 schrieb Alexandre DERUMIER aderum...@odiso.com:

 Hi,
 
 For Proxmox, we have made some patchs to split the savevm process,
 
 to be able to save the memory to an external volume. (and not the current 
 volume).
 
 For rbd, we create a new rbd volume to store the memory.
 
 qemu patch is here :
 https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/internal-snapshot-async.patch;h=c67a97ea497fe31ff449acb79e04dc1c53b25578;hb=HEAD
 
 - Mail original -
 

wow, sounds very interesting, being on the road for the next 3 days I will have 
a closer look next week.

Thnx n regards,

Oliver.

 De: Wido den Hollander w...@42on.com
 À: Oliver Francke oliver.fran...@filoo.de
 Cc: ceph-devel@vger.kernel.org
 Envoyé: Vendredi 24 Mai 2013 17:08:35
 Objet: Re: qemu-1.5.0 savevm error -95 while writing vm with ceph-rbd as 
 storage-backend
 
 On 05/24/2013 09:46 AM, Oliver Francke wrote:
 Hi,
 
 with a running VM I encounter this strange behaviour, former
 qemu-versions don't show up such an error.
 Perhaps this comes from the rbd-backend in qemu-1.5.0 in combination
 with ceph-0.56.6? Therefore my
 crosspost.
 
 Even if I have no real live-snapshot avail - they know of this
 restriction -, it's more work for the customers
 to perform a shutdown before the wonna do some changes to their VM ;)
 
 
 Doesn't Qemu try to save the memory state to RBD here as well? That
 doesn't work and fails.
 
 Any hints welcome,
 
 Oliver.
 
 
 
 --
 Wido den Hollander
 42on B.V.
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD memory leak when scrubbing [0.56.6]

2013-05-21 Thread Oliver Francke

Well,

subject seems familiar, version was 0.48.3 in the last mail.

Some more of the story. Before successful upgrade to latest bobtail 
everything with regards to scrubbing was disabled.

That is via:
ceph osd tell \* injectargs '--osd-max-scrubs 0'

We are running fine now since 9th of may. Fine means though, not have 
ran any scrubbing for ages.
This morning I re-started scrubbing. After a couple of hours I detected 
the first OSD's eating up memory.
Top-scorer was running with 23GiB rss. After stopping scrubbing again 
there was no regain of memory.


Not anyone else with perhaps large pg's experiencing such behaviour?

Any advice on how to proceed?

Thnx in advance,

Oliver.

--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory leak when scrubbing [0.56.6]

2013-05-21 Thread Oliver Francke

Uhm,

to be most correct... there was a follow-up even with version 0.56 ;)

On 05/21/2013 05:24 PM, Oliver Francke wrote:

Well,

subject seems familiar, version was 0.48.3 in the last mail.

Some more of the story. Before successful upgrade to latest bobtail 
everything with regards to scrubbing was disabled.

That is via:
ceph osd tell \* injectargs '--osd-max-scrubs 0'

We are running fine now since 9th of may. Fine means though, not have 
ran any scrubbing for ages.
This morning I re-started scrubbing. After a couple of hours I 
detected the first OSD's eating up memory.
Top-scorer was running with 23GiB rss. After stopping scrubbing again 
there was no regain of memory.


Not anyone else with perhaps large pg's experiencing such behaviour?

Any advice on how to proceed?

Thnx in advance,

Oliver.




--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory leak when scrubbing [0.56.6]

2013-05-21 Thread Oliver Francke

Right,

On 05/21/2013 05:35 PM, Sylvain Munaut wrote:

Hi,


subject seems familiar, version was 0.48.3 in the last mail.

Not anyone else with perhaps large pg's experiencing such behaviour?

Any advice on how to proceed?

I had the same behavior in both argonaut and bobtail, raising sharply
~ 100M or so at each scrub (every 24h).

It's now resolved in cuttlefish AFAICT.

However given the mon leveldb issues I'm having with cuttlefish, I'm
not sure I'd recommend upgrading ...


not really at this moment. Cause we have an otherwise stable and fast 
cluster for our VM's ;)


Oliver.



Cheers,

 Sylvain



--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory leak when scrubbing [0.56.6]

2013-05-21 Thread Oliver Francke
Well,

Am 21.05.2013 um 21:31 schrieb Sage Weil s...@inktank.com:

 On Tue, 21 May 2013, Stefan Priebe wrote:
 Am 21.05.2013 17:44, schrieb Sage Weil:
 On Tue, 21 May 2013, Stefan Priebe - Profihost AG wrote:
 Am 21.05.2013 um 17:35 schrieb Sylvain Munaut
 s.mun...@whatever-company.com:
 
 Hi,
 
 subject seems familiar, version was 0.48.3 in the last mail.
 
 Not anyone else with perhaps large pg's experiencing such behaviour?
 
 Any advice on how to proceed?
 
 I had the same behavior in both argonaut and bobtail, raising sharply
 ~ 100M or so at each scrub (every 24h).
 
 It's now resolved in cuttlefish AFAICT.
 
 However given the mon leveldb issues I'm having with cuttlefish, I'm
 not sure I'd recommend upgrading ...
 
 I thought all mon leveldb issues were solved?
 
 Not quite.  I'm now able to reproduce the leveldb growth from Mike
 Dawson's trace (thanks!) but we don't have a fix yet.
 
 Oh OK. Is there a tracker id?
 
 http://tracker.ceph.com/issues/4895
 

true for this issue, but back to topic, still not being able to safely ( deep-) 
scrub the whole cluster with 0.56.6.

 
 I thought the scrub memory issues were addressed by
 f80f64cf024bd7519d5a1fb2a5698db97a003ce8 in 0.56.4... :(
 

any advice very welcome, though about 1/3 of the cluster is safe in means of 
we scrubbed it deeply.

Best regards,

Oliver.

 sage
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Latest 0.56.3 and qemu-1.4.0 and cloned VM-image producing massive fs-corruption, not crashing

2013-03-26 Thread Oliver Francke

Hi Josh,

thanks for the quick response and...

On 03/26/2013 09:30 AM, Josh Durgin wrote:

On 03/25/2013 03:04 AM, Oliver Francke wrote:

Hi josh,

logfile is attached...


Thanks. It shows nothing out of the ordinary, but I just reproduced the
incorrect rollback locally, so it shouldn't be hard to track down from
here.

I opened http://tracker.ceph.com/issues/4551 to track it.


the good news.

Oliver.



Josh


On 03/22/2013 08:30 PM, Josh Durgin wrote:

On 03/22/2013 12:09 PM, Oliver Francke wrote:

Hi Josh, all,

I did not want to hijack the thread dealing with a crashing VM, but
perhaps there are some common things.

Today I installed a fresh cluster with mkephfs, went fine, imported a
master debian 6.0 image with format 2, made a snapshot, protected
it, and made some clones.
Clones mounted with qemu-nbd, fiddled a bit with
IP/interfaces/hosts/net.rules…etc and cleanly unmounted, VM started,
took 2 secs and the VM was up n running. Cool.

Now an ordinary shutdown was performed, made a snapshot of this
image. Started again, did some apt-get update… install s/t….
Shutdown - rbd rollback - startup again - login - install s/t
else… filesystem showed many ex3-errors, fell into read-only mode,
massive corruption.


This sounds like it might be a bug in rollback. Could you try cloning
and snapshotting again, but export the image before booting, and after
rolling back, and compare the md5sums?


Done, first MD5-mismatch after 32 4MB blocks, checked with dd and a bs
of 4MB.



Running the rollback with:

--debug-ms 1 --debug-rbd 20 --log-file rbd-rollback.log

might help too. Does your ceph.conf where you ran the rollback have
anything related to rbd_cache in it?


No cache settings in global ceph.conf.

Hope it helps,

Oliver.




qemu config was with :rbd_cache=false if it matters. Above scenario
is reproducible, and as I stated out, no crash detected.

Perhaps it is in the same area as in the crash-thread, otherwise I
will provide logfiles as needed.


It's unrelated, the other thread is an issue with the cache, which does
not cause corruption but triggers a crash.

Josh
--
To unsubscribe from this list: send the line unsubscribe 
ceph-devel in

the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html








--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Latest 0.56.3 and qemu-1.4.0 and cloned VM-image producing massive fs-corruption, not crashing

2013-03-22 Thread Oliver Francke
Hi Josh, all,

I did not want to hijack the thread dealing with a crashing VM, but perhaps 
there are some common things.

Today I installed a fresh cluster with mkephfs, went fine, imported a master 
debian 6.0 image with format 2, made a snapshot, protected it, and made some 
clones.
Clones mounted with qemu-nbd, fiddled a bit with 
IP/interfaces/hosts/net.rules…etc and cleanly unmounted, VM started, took 2 
secs and the VM was up n running. Cool.

Now an ordinary shutdown was performed, made a snapshot of this image. Started 
again, did some apt-get update… install s/t….
Shutdown - rbd rollback - startup again - login - install s/t else… 
filesystem showed many ex3-errors, fell into read-only mode, massive 
corruption.

qemu config was with :rbd_cache=false if it matters. Above scenario is 
reproducible, and as I stated out, no crash detected.

Perhaps it is in the same area as in the crash-thread, otherwise I will provide 
logfiles as needed.

Kind regards,

Oliver.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Latest 0.56.3 and qemu-1.4.0 and cloned VM-image producing massive fs-corruption, not crashing

2013-03-22 Thread Oliver Francke
Hi Josh, all,

I did not want to hijack the thread dealing with a crashing VM, but perhaps 
there are some common things.

Today I installed a fresh cluster with mkephfs, went fine, imported a master 
debian 6.0 image with format 2, made a snapshot, protected it, and made some 
clones.
Clones mounted with qemu-nbd, fiddled a bit with 
IP/interfaces/hosts/net.rules…etc and cleanly unmounted, VM started, took 2 
secs and the VM was up n running. Cool.

Now an ordinary shutdown was performed, made a snapshot of this image. Started 
again, did some apt-get update… install s/t….
Shutdown - rbd rollback - startup again - login - install s/t else… 
filesystem showed many ex3-errors, fell into read-only mode, massive 
corruption.

qemu config was with :rbd_cache=false if it matters. Above scenario is 
reproducible, and as I stated out, no crash detected.

Perhaps it is in the same area as in the crash-thread, otherwise I will provide 
logfiles as needed.

Kind regards,

Oliver.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A couple of OSD-crashes after serious network trouble

2012-12-13 Thread Oliver Francke

Hi Sam,

On 12/13/2012 05:15 AM, Samuel Just wrote:

Apologies, I missed your reply on Monday.  Any attempt to read or


no prob ;) We are busy, too, with preparing new nodes and switch to 10GE 
this evening.



write the object will hit the file on the primary (the smaller one
with the newer syslog entries).  If you take down both OSDs (12 and
40) while performing the repair, the vm in question will hang if it
tries to access that block, but should recover when you bring the OSDs
back up.  To expand on the the response Sage posted, writes/reads to
that block have been hitting the primary (osd.12) which unfortunately
is the incorrect file.  I would, however, have expected that those
writes would have been replicated to the larger file on osd.40 as
well.  Are you certain that the newer syslog entries on 12 aren't also
present on 40?


well... time heals... I re-checked right now and both files are md5-wise 
identical?!

Not checked the other 5 inconsistencies.
Still having three headers missing and 6 OSD's not checked with scrub, 
though.


Will be back... for sure ;)

Thnx for now,

Oliver.



-Sam

On Tue, Dec 11, 2012 at 11:38 AM, Oliver Francke
oliver.fran...@filoo.de wrote:

Hi Sage,

Am 11.12.2012 um 18:04 schrieb Sage Weil s...@inktank.com:


On Tue, 11 Dec 2012, Oliver Francke wrote:

Hi Sam,

perhaps you have overlooked my comments further down, beginning with
been there ? ;)

We're pretty swamped with bobtail stuff at the moment, so ceph-devel
inquiries are low on the priority list right now.


100% agree, this thing here is best effort right now, true.


See below:


If so, please have a look, cause I'm clueless 8-)

On 12/10/2012 11:48 AM, Oliver Francke wrote:

Hi Sam,

helpful input.. and... not so...

On 12/07/2012 10:18 PM, Samuel Just wrote:

Ah... unfortunately doing a repair in these 6 cases would probably
result in the wrong object surviving.  It should work, but it might
corrupt the rbd image contents.  If the images are expendable, you
could repair and then delete the images.

The red flag here is that the known size is smaller than the other
size.  This indicates that it most likely chose the wrong file as the
correct one since rbd image blocks usually get bigger over time.  To
fix this, you will need to manually copy the file for the larger of
the two object replicas to replace the smaller of the two object
replicas.

For the first, soid 87c96f10/rb.0.47d9b.1014b7b4.02df/head//65
in pg 65.10:
1) Find the object on the primary and the replica (from above, primary
is 12 and replica is 40).  You can use find in the primary and replica
current/65.10_head directories to look for a file matching
*rb.0.47d9b.1014b7b4.02df*).  The file name should be
'rb.0.47d9b.1014b7b4.02df__head_87C96F10__65' I think.
2) Stop the primary and replica osds
3) Compare the file sizes for the two files -- you should find that
the file sizes do not match.
4) Replace the smaller file with the larger one (you'll probably want
to keep a copy of the smaller one around just in case).
5) Restart the osds and scrub pg 65.10 -- the pg should come up clean
(possibly with a relatively harmless stat mismatch)

been there. on OSD.12 it's
-rw-r--r-- 1 root root 699904 Dec  9 06:25
rb.0.47d9b.1014b7b4.02df__head_87C96F10__41

on OSD.40:
-rw-r--r-- 1 root root 4194304 Dec  9 06:25
rb.0.47d9b.1014b7b4.02df__head_87C96F10__41

going by a short glance into the file, there are some readable
syslog-entries, in both files.
For the bad luck in this example, the shorter file contains the more current
entries?!

It sounds like the larger one was at one point correct, but since they got
out of sync an update was applied to the other.  What fs is this (inside
the VM)?  If we're lucky the whole block is file data, in which case I
would extend the small one with more recent out to the full size by taking
the last chunk of the second one.  (Or, if the bytes look like an
unimportant file, just use truncate(1) to extend it, and get zeros for
that region.)  Make backups of the object first, and fsck inside the VM
afterwards.

--

We've seen this issue bite twice now, both times on argonaut.  So far
nobody using anything more recent..but that is a smaller pool of people,
so no real comform there.  Working on setting up a higher-stress long-term
testing cluster to trigger this.

Can you remind me what kernel version you are using?

one of the affected nodes are driven by 3.5.4, the newer ones are nowadays 
Ubtuntu 12.04.1 LTS with self-compiled 3.6.6.
Inside the VM's you can imagine all flavors, less forgiving CentOS 5.8, some 
debian5.0 ( ext3)… mostly ext3, I think. Not optimum, at least.

Couple of problems caused by slow requests, I can see in some log-files customers 
pressing the RESET button, implemented via qemu-monitor.
Destructive as can be, with having some megs of cache with the rbd-device.

Thnx n regards,

Oliver.


sage



What exactly happens, if I try to copy or export the file? Which

Re: A couple of OSD-crashes after serious network trouble

2012-12-11 Thread Oliver Francke

Hi Sam,

perhaps you have overlooked my comments further down, beginning with
been there ? ;)

If so, please have a look, cause I'm clueless 8-)

On 12/10/2012 11:48 AM, Oliver Francke wrote:

Hi Sam,

helpful input.. and... not so...

On 12/07/2012 10:18 PM, Samuel Just wrote:

Ah... unfortunately doing a repair in these 6 cases would probably
result in the wrong object surviving.  It should work, but it might
corrupt the rbd image contents.  If the images are expendable, you
could repair and then delete the images.

The red flag here is that the known size is smaller than the other
size.  This indicates that it most likely chose the wrong file as the
correct one since rbd image blocks usually get bigger over time.  To
fix this, you will need to manually copy the file for the larger of
the two object replicas to replace the smaller of the two object
replicas.

For the first, soid 87c96f10/rb.0.47d9b.1014b7b4.02df/head//65
in pg 65.10:
1) Find the object on the primary and the replica (from above, primary
is 12 and replica is 40).  You can use find in the primary and replica
current/65.10_head directories to look for a file matching
*rb.0.47d9b.1014b7b4.02df*).  The file name should be
'rb.0.47d9b.1014b7b4.02df__head_87C96F10__65' I think.
2) Stop the primary and replica osds
3) Compare the file sizes for the two files -- you should find that
the file sizes do not match.
4) Replace the smaller file with the larger one (you'll probably want
to keep a copy of the smaller one around just in case).
5) Restart the osds and scrub pg 65.10 -- the pg should come up clean
(possibly with a relatively harmless stat mismatch)


been there. on OSD.12 it's
-rw-r--r-- 1 root root 699904 Dec  9 06:25 
rb.0.47d9b.1014b7b4.02df__head_87C96F10__41


on OSD.40:
-rw-r--r-- 1 root root 4194304 Dec  9 06:25 
rb.0.47d9b.1014b7b4.02df__head_87C96F10__41


going by a short glance into the file, there are some readable 
syslog-entries, in both files.
For the bad luck in this example, the shorter file contains the more 
current entries?!


What exactly happens, if I try to copy or export the file? Which block 
will be chosen?

VM is running as I'm writing, so flexibility reduced.

Regards,

Oliver.


If this worked our correctly, you can repeat for the other 5 cases.

Let me know if you have any questions.
-Sam

On Fri, Dec 7, 2012 at 11:09 AM, Oliver Francke 
oliver.fran...@filoo.de wrote:

Hi Sam,

Am 07.12.2012 um 19:37 schrieb Samuel Just sam.j...@inktank.com:


That is very likely to be one of the merge_log bugs fixed between 0.48
and 0.55.  I could confirm with a stacktrace from gdb with line
numbers or the remainder of the logging dumped when the daemon
crashed.

My understanding of your situation is that currently all pgs are
active+clean but you are missing some rbd image headers and some rbd
images appear to be corrupted.  Is that accurate?
-Sam


thnx for droppig in.

Uhm almost correct, there are now 6 pg in state inconsistent:

HEALTH_WARN 6 pgs inconsistent
pg 65.da is active+clean+inconsistent, acting [1,33]
pg 65.d7 is active+clean+inconsistent, acting [13,42]
pg 65.10 is active+clean+inconsistent, acting [12,40]
pg 65.f is active+clean+inconsistent, acting [13,31]
pg 65.75 is active+clean+inconsistent, acting [1,33]
pg 65.6a is active+clean+inconsistent, acting [13,31]

I know which images are affected, but does a repair help?

0 log [ERR] : 65.10 osd.40: soid 
87c96f10/rb.0.47d9b.1014b7b4.02df/head//65 size 4194304 != 
known size 699904
0 log [ERR] : 65.6a osd.31: soid 
19a2526a/rb.0.2dcf2.1da2a31e.0737/head//65 size 4191744 != 
known size 2757632
0 log [ERR] : 65.75 osd.33: soid 
20550575/rb.0.2d520.5c17a6e3.0339/head//65 size 4194304 != 
known size 1238016
0 log [ERR] : 65.d7 osd.42: soid 
fa3a5d7/rb.0.2c2a8.12ec359d.205c/head//65 size 4194304 != 
known size 1382912
0 log [ERR] : 65.da osd.33: soid 
c2a344da/rb.0.2be17.cb4bd69.0081/head//65 size 4191744 != 
known size 1815552
0 log [ERR] : 65.f osd.31: soid 
e8d2430f/rb.0.2d1e9.1339c5dd.0c41/head//65 size 2424832 != 
known size 2331648


of make things worse?

I could only check 14 out of 20 OSD's so far, cause from two older 
nodes a scrub leads to slow-requests…  couple of minutes, so VM's 
got stalled… customers pressing the reset-button, so losing caches…


Comments welcome,

Oliver.

On Fri, Dec 7, 2012 at 6:39 AM, Oliver Francke 
oliver.fran...@filoo.de wrote:

Hi,

is the following a known one, too? Would be good to get it out 
of my head:



/var/log/ceph/ceph-osd.40.log.1.gz: 1: /usr/bin/ceph-osd() 
[0x706c59]

/var/log/ceph/ceph-osd.40.log.1.gz: 2: (()+0xeff0) [0x7f7f306c0ff0]
/var/log/ceph/ceph-osd.40.log.1.gz: 3: (gsignal()+0x35) 
[0x7f7f2f35f1b5]
/var/log/ceph/ceph-osd.40.log.1.gz: 4: (abort()+0x180) 
[0x7f7f2f361fc0]

/var/log/ceph/ceph-osd.40.log.1.gz: 5:
(__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7f2fbf3dc5]
/var/log/ceph/ceph-osd.40.log.1.gz: 6

Re: A couple of OSD-crashes after serious network trouble

2012-12-11 Thread Oliver Francke
Hi Sage,

Am 11.12.2012 um 18:04 schrieb Sage Weil s...@inktank.com:

 On Tue, 11 Dec 2012, Oliver Francke wrote:
 Hi Sam,
 
 perhaps you have overlooked my comments further down, beginning with
 been there ? ;)
 
 We're pretty swamped with bobtail stuff at the moment, so ceph-devel 
 inquiries are low on the priority list right now.
 

100% agree, this thing here is best effort right now, true.

 See below:
 
 
 If so, please have a look, cause I'm clueless 8-)
 
 On 12/10/2012 11:48 AM, Oliver Francke wrote:
 Hi Sam,
 
 helpful input.. and... not so...
 
 On 12/07/2012 10:18 PM, Samuel Just wrote:
 Ah... unfortunately doing a repair in these 6 cases would probably
 result in the wrong object surviving.  It should work, but it might
 corrupt the rbd image contents.  If the images are expendable, you
 could repair and then delete the images.
 
 The red flag here is that the known size is smaller than the other
 size.  This indicates that it most likely chose the wrong file as the
 correct one since rbd image blocks usually get bigger over time.  To
 fix this, you will need to manually copy the file for the larger of
 the two object replicas to replace the smaller of the two object
 replicas.
 
 For the first, soid 87c96f10/rb.0.47d9b.1014b7b4.02df/head//65
 in pg 65.10:
 1) Find the object on the primary and the replica (from above, primary
 is 12 and replica is 40).  You can use find in the primary and replica
 current/65.10_head directories to look for a file matching
 *rb.0.47d9b.1014b7b4.02df*).  The file name should be
 'rb.0.47d9b.1014b7b4.02df__head_87C96F10__65' I think.
 2) Stop the primary and replica osds
 3) Compare the file sizes for the two files -- you should find that
 the file sizes do not match.
 4) Replace the smaller file with the larger one (you'll probably want
 to keep a copy of the smaller one around just in case).
 5) Restart the osds and scrub pg 65.10 -- the pg should come up clean
 (possibly with a relatively harmless stat mismatch)
 
 been there. on OSD.12 it's
 -rw-r--r-- 1 root root 699904 Dec  9 06:25
 rb.0.47d9b.1014b7b4.02df__head_87C96F10__41
 
 on OSD.40:
 -rw-r--r-- 1 root root 4194304 Dec  9 06:25
 rb.0.47d9b.1014b7b4.02df__head_87C96F10__41
 
 going by a short glance into the file, there are some readable
 syslog-entries, in both files.
 For the bad luck in this example, the shorter file contains the more current
 entries?!
 
 It sounds like the larger one was at one point correct, but since they got 
 out of sync an update was applied to the other.  What fs is this (inside 
 the VM)?  If we're lucky the whole block is file data, in which case I 
 would extend the small one with more recent out to the full size by taking 
 the last chunk of the second one.  (Or, if the bytes look like an 
 unimportant file, just use truncate(1) to extend it, and get zeros for 
 that region.)  Make backups of the object first, and fsck inside the VM 
 afterwards.
 
 --
 
 We've seen this issue bite twice now, both times on argonaut.  So far 
 nobody using anything more recent..but that is a smaller pool of people, 
 so no real comform there.  Working on setting up a higher-stress long-term 
 testing cluster to trigger this.
 
 Can you remind me what kernel version you are using?

one of the affected nodes are driven by 3.5.4, the newer ones are nowadays 
Ubtuntu 12.04.1 LTS with self-compiled 3.6.6.
Inside the VM's you can imagine all flavors, less forgiving CentOS 5.8, some 
debian5.0 ( ext3)… mostly ext3, I think. Not optimum, at least.

Couple of problems caused by slow requests, I can see in some log-files 
customers pressing the RESET button, implemented via qemu-monitor.
Destructive as can be, with having some megs of cache with the rbd-device.

Thnx n regards,

Oliver.

 
 sage
 
 
 
 What exactly happens, if I try to copy or export the file? Which block will
 be chosen?
 VM is running as I'm writing, so flexibility reduced.
 
 Regards,
 
 Oliver.
 
 If this worked our correctly, you can repeat for the other 5 cases.
 
 Let me know if you have any questions.
 -Sam
 
 On Fri, Dec 7, 2012 at 11:09 AM, Oliver Francke oliver.fran...@filoo.de
 wrote:
 Hi Sam,
 
 Am 07.12.2012 um 19:37 schrieb Samuel Just sam.j...@inktank.com:
 
 That is very likely to be one of the merge_log bugs fixed between 0.48
 and 0.55.  I could confirm with a stacktrace from gdb with line
 numbers or the remainder of the logging dumped when the daemon
 crashed.
 
 My understanding of your situation is that currently all pgs are
 active+clean but you are missing some rbd image headers and some rbd
 images appear to be corrupted.  Is that accurate?
 -Sam
 
 thnx for droppig in.
 
 Uhm almost correct, there are now 6 pg in state inconsistent:
 
 HEALTH_WARN 6 pgs inconsistent
 pg 65.da is active+clean+inconsistent, acting [1,33]
 pg 65.d7 is active+clean+inconsistent, acting [13,42]
 pg 65.10 is active+clean+inconsistent, acting [12,40]
 pg 65.f is active+clean

Re: A couple of OSD-crashes after serious network trouble

2012-12-10 Thread Oliver Francke

Hi Sam,

helpful input.. and... not so...

On 12/07/2012 10:18 PM, Samuel Just wrote:

Ah... unfortunately doing a repair in these 6 cases would probably
result in the wrong object surviving.  It should work, but it might
corrupt the rbd image contents.  If the images are expendable, you
could repair and then delete the images.

The red flag here is that the known size is smaller than the other
size.  This indicates that it most likely chose the wrong file as the
correct one since rbd image blocks usually get bigger over time.  To
fix this, you will need to manually copy the file for the larger of
the two object replicas to replace the smaller of the two object
replicas.

For the first, soid 87c96f10/rb.0.47d9b.1014b7b4.02df/head//65
in pg 65.10:
1) Find the object on the primary and the replica (from above, primary
is 12 and replica is 40).  You can use find in the primary and replica
current/65.10_head directories to look for a file matching
*rb.0.47d9b.1014b7b4.02df*).  The file name should be
'rb.0.47d9b.1014b7b4.02df__head_87C96F10__65' I think.
2) Stop the primary and replica osds
3) Compare the file sizes for the two files -- you should find that
the file sizes do not match.
4) Replace the smaller file with the larger one (you'll probably want
to keep a copy of the smaller one around just in case).
5) Restart the osds and scrub pg 65.10 -- the pg should come up clean
(possibly with a relatively harmless stat mismatch)


been there. on OSD.12 it's
-rw-r--r-- 1 root root 699904 Dec  9 06:25 
rb.0.47d9b.1014b7b4.02df__head_87C96F10__41


on OSD.40:
-rw-r--r-- 1 root root 4194304 Dec  9 06:25 
rb.0.47d9b.1014b7b4.02df__head_87C96F10__41


going by a short glance into the file, there are some readable 
syslog-entries, in both files.
For the bad luck in this example, the shorter file contains the more 
current entries?!


What exactly happens, if I try to copy or export the file? Which block 
will be chosen?

VM is running as I'm writing, so flexibility reduced.

Regards,

Oliver.


If this worked our correctly, you can repeat for the other 5 cases.

Let me know if you have any questions.
-Sam

On Fri, Dec 7, 2012 at 11:09 AM, Oliver Francke oliver.fran...@filoo.de wrote:

Hi Sam,

Am 07.12.2012 um 19:37 schrieb Samuel Just sam.j...@inktank.com:


That is very likely to be one of the merge_log bugs fixed between 0.48
and 0.55.  I could confirm with a stacktrace from gdb with line
numbers or the remainder of the logging dumped when the daemon
crashed.

My understanding of your situation is that currently all pgs are
active+clean but you are missing some rbd image headers and some rbd
images appear to be corrupted.  Is that accurate?
-Sam


thnx for droppig in.

Uhm almost correct, there are now 6 pg in state inconsistent:

HEALTH_WARN 6 pgs inconsistent
pg 65.da is active+clean+inconsistent, acting [1,33]
pg 65.d7 is active+clean+inconsistent, acting [13,42]
pg 65.10 is active+clean+inconsistent, acting [12,40]
pg 65.f is active+clean+inconsistent, acting [13,31]
pg 65.75 is active+clean+inconsistent, acting [1,33]
pg 65.6a is active+clean+inconsistent, acting [13,31]

I know which images are affected, but does a repair help?

0 log [ERR] : 65.10 osd.40: soid 
87c96f10/rb.0.47d9b.1014b7b4.02df/head//65 size 4194304 != known size 
699904
0 log [ERR] : 65.6a osd.31: soid 
19a2526a/rb.0.2dcf2.1da2a31e.0737/head//65 size 4191744 != known size 
2757632
0 log [ERR] : 65.75 osd.33: soid 
20550575/rb.0.2d520.5c17a6e3.0339/head//65 size 4194304 != known size 
1238016
0 log [ERR] : 65.d7 osd.42: soid 
fa3a5d7/rb.0.2c2a8.12ec359d.205c/head//65 size 4194304 != known size 
1382912
0 log [ERR] : 65.da osd.33: soid 
c2a344da/rb.0.2be17.cb4bd69.0081/head//65 size 4191744 != known size 
1815552
0 log [ERR] : 65.f osd.31: soid 
e8d2430f/rb.0.2d1e9.1339c5dd.0c41/head//65 size 2424832 != known size 
2331648

of make things worse?

I could only check 14 out of 20 OSD's so far, cause from two older nodes a scrub leads to 
slow-requests…  couple of minutes, so VM's got stalled… customers pressing the 
reset-button, so losing caches…

Comments welcome,

Oliver.


On Fri, Dec 7, 2012 at 6:39 AM, Oliver Francke oliver.fran...@filoo.de wrote:

Hi,

is the following a known one, too? Would be good to get it out of my head:



/var/log/ceph/ceph-osd.40.log.1.gz: 1: /usr/bin/ceph-osd() [0x706c59]
/var/log/ceph/ceph-osd.40.log.1.gz: 2: (()+0xeff0) [0x7f7f306c0ff0]
/var/log/ceph/ceph-osd.40.log.1.gz: 3: (gsignal()+0x35) [0x7f7f2f35f1b5]
/var/log/ceph/ceph-osd.40.log.1.gz: 4: (abort()+0x180) [0x7f7f2f361fc0]
/var/log/ceph/ceph-osd.40.log.1.gz: 5:
(__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7f2fbf3dc5]
/var/log/ceph/ceph-osd.40.log.1.gz: 6: (()+0xcb166) [0x7f7f2fbf2166]
/var/log/ceph/ceph-osd.40.log.1.gz: 7: (()+0xcb193) [0x7f7f2fbf2193]
/var/log/ceph/ceph-osd.40.log.1.gz: 8: (()+0xcb28e) [0x7f7f2fbf228e]
/var/log/ceph/ceph-osd.40.log.1.gz

Re: A couple of OSD-crashes after serious network trouble

2012-12-07 Thread Oliver Francke

Hi,

is the following a known one, too? Would be good to get it out of my head:


/var/log/ceph/ceph-osd.40.log.1.gz: 1: /usr/bin/ceph-osd() [0x706c59]
/var/log/ceph/ceph-osd.40.log.1.gz: 2: (()+0xeff0) [0x7f7f306c0ff0]
/var/log/ceph/ceph-osd.40.log.1.gz: 3: (gsignal()+0x35) [0x7f7f2f35f1b5]
/var/log/ceph/ceph-osd.40.log.1.gz: 4: (abort()+0x180) [0x7f7f2f361fc0]
/var/log/ceph/ceph-osd.40.log.1.gz: 5: 
(__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7f2fbf3dc5]

/var/log/ceph/ceph-osd.40.log.1.gz: 6: (()+0xcb166) [0x7f7f2fbf2166]
/var/log/ceph/ceph-osd.40.log.1.gz: 7: (()+0xcb193) [0x7f7f2fbf2193]
/var/log/ceph/ceph-osd.40.log.1.gz: 8: (()+0xcb28e) [0x7f7f2fbf228e]
/var/log/ceph/ceph-osd.40.log.1.gz: 9: (ceph::__ceph_assert_fail(char 
const*, char const*, int, char const*)+0x793) [0x77e903]
/var/log/ceph/ceph-osd.40.log.1.gz: 10: 
(PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, 
int)+0x1de3) [0x63db93]
/var/log/ceph/ceph-osd.40.log.1.gz: 11: 
(PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec 
const)+0x2cc) [0x63e00c]
/var/log/ceph/ceph-osd.40.log.1.gz: 12: 
(boost::statechart::simple_statePG::RecoveryState::Stray, 
PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
(boost::statechart::history_mode)0::react_impl(boost::statechart::event_base 
const, void const*)+0x203) [0x658a63]
/var/log/ceph/ceph-osd.40.log.1.gz: 13: 
(boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, 
PG::RecoveryState::Initial, std::allocatorvoid, 
boost::statechart::null_exception_translator::process_event(boost::statechart::event_base 
const)+0x6b) [0x650b4b]
/var/log/ceph/ceph-osd.40.log.1.gz: 14: 
(PG::RecoveryState::handle_log(int, MOSDPGLog*, 
PG::RecoveryCtx*)+0x190) [0x60a520]
/var/log/ceph/ceph-osd.40.log.1.gz: 15: 
(OSD::handle_pg_log(std::tr1::shared_ptrOpRequest)+0x666) [0x5c62e6]
/var/log/ceph/ceph-osd.40.log.1.gz: 16: 
(OSD::dispatch_op(std::tr1::shared_ptrOpRequest)+0x11b) [0x5c6f3b]
/var/log/ceph/ceph-osd.40.log.1.gz: 17: 
(OSD::_dispatch(Message*)+0x173) [0x5d1983]
/var/log/ceph/ceph-osd.40.log.1.gz: 18: 
(OSD::ms_dispatch(Message*)+0x184) [0x5d2254]
/var/log/ceph/ceph-osd.40.log.1.gz: 19: 
(SimpleMessenger::DispatchQueue::entry()+0x5e9) [0x7d3c09]
/var/log/ceph/ceph-osd.40.log.1.gz: 20: 
(SimpleMessenger::dispatch_entry()+0x15) [0x7d5195]
/var/log/ceph/ceph-osd.40.log.1.gz: 21: 
(SimpleMessenger::DispatchThread::entry()+0xd) [0x726bad]

/var/log/ceph/ceph-osd.40.log.1.gz: 22: (()+0x68ca) [0x7f7f306b88ca]
/var/log/ceph/ceph-osd.40.log.1.gz: 23: (clone()+0x6d) [0x7f7f2f3fc92d]



Thnx for looking,

Oliver.

--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A couple of OSD-crashes after serious network trouble

2012-12-07 Thread Oliver Francke
Hi Sam,

Am 07.12.2012 um 19:37 schrieb Samuel Just sam.j...@inktank.com:

 That is very likely to be one of the merge_log bugs fixed between 0.48
 and 0.55.  I could confirm with a stacktrace from gdb with line
 numbers or the remainder of the logging dumped when the daemon
 crashed.
 
 My understanding of your situation is that currently all pgs are
 active+clean but you are missing some rbd image headers and some rbd
 images appear to be corrupted.  Is that accurate?
 -Sam
 

thnx for droppig in.

Uhm almost correct, there are now 6 pg in state inconsistent:

HEALTH_WARN 6 pgs inconsistent
pg 65.da is active+clean+inconsistent, acting [1,33]
pg 65.d7 is active+clean+inconsistent, acting [13,42]
pg 65.10 is active+clean+inconsistent, acting [12,40]
pg 65.f is active+clean+inconsistent, acting [13,31]
pg 65.75 is active+clean+inconsistent, acting [1,33]
pg 65.6a is active+clean+inconsistent, acting [13,31]

I know which images are affected, but does a repair help?

0 log [ERR] : 65.10 osd.40: soid 
87c96f10/rb.0.47d9b.1014b7b4.02df/head//65 size 4194304 != known size 
699904
0 log [ERR] : 65.6a osd.31: soid 
19a2526a/rb.0.2dcf2.1da2a31e.0737/head//65 size 4191744 != known size 
2757632
0 log [ERR] : 65.75 osd.33: soid 
20550575/rb.0.2d520.5c17a6e3.0339/head//65 size 4194304 != known size 
1238016
0 log [ERR] : 65.d7 osd.42: soid 
fa3a5d7/rb.0.2c2a8.12ec359d.205c/head//65 size 4194304 != known size 
1382912
0 log [ERR] : 65.da osd.33: soid 
c2a344da/rb.0.2be17.cb4bd69.0081/head//65 size 4191744 != known size 
1815552
0 log [ERR] : 65.f osd.31: soid 
e8d2430f/rb.0.2d1e9.1339c5dd.0c41/head//65 size 2424832 != known size 
2331648

of make things worse?

I could only check 14 out of 20 OSD's so far, cause from two older nodes a 
scrub leads to slow-requests…  couple of minutes, so VM's got stalled… 
customers pressing the reset-button, so losing caches…

Comments welcome,

Oliver.

 On Fri, Dec 7, 2012 at 6:39 AM, Oliver Francke oliver.fran...@filoo.de 
 wrote:
 Hi,
 
 is the following a known one, too? Would be good to get it out of my head:
 
 
 /var/log/ceph/ceph-osd.40.log.1.gz: 1: /usr/bin/ceph-osd() [0x706c59]
 /var/log/ceph/ceph-osd.40.log.1.gz: 2: (()+0xeff0) [0x7f7f306c0ff0]
 /var/log/ceph/ceph-osd.40.log.1.gz: 3: (gsignal()+0x35) [0x7f7f2f35f1b5]
 /var/log/ceph/ceph-osd.40.log.1.gz: 4: (abort()+0x180) [0x7f7f2f361fc0]
 /var/log/ceph/ceph-osd.40.log.1.gz: 5:
 (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7f2fbf3dc5]
 /var/log/ceph/ceph-osd.40.log.1.gz: 6: (()+0xcb166) [0x7f7f2fbf2166]
 /var/log/ceph/ceph-osd.40.log.1.gz: 7: (()+0xcb193) [0x7f7f2fbf2193]
 /var/log/ceph/ceph-osd.40.log.1.gz: 8: (()+0xcb28e) [0x7f7f2fbf228e]
 /var/log/ceph/ceph-osd.40.log.1.gz: 9: (ceph::__ceph_assert_fail(char
 const*, char const*, int, char const*)+0x793) [0x77e903]
 /var/log/ceph/ceph-osd.40.log.1.gz: 10:
 (PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t,
 int)+0x1de3) [0x63db93]
 /var/log/ceph/ceph-osd.40.log.1.gz: 11:
 (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec const)+0x2cc)
 [0x63e00c]
 /var/log/ceph/ceph-osd.40.log.1.gz: 12:
 (boost::statechart::simple_statePG::RecoveryState::Stray,
 PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, mpl_::na,
 mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
 mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
 mpl_::na, mpl_::na, mpl_::na,
 (boost::statechart::history_mode)0::react_impl(boost::statechart::event_base
 const, void const*)+0x203) [0x658a63]
 /var/log/ceph/ceph-osd.40.log.1.gz: 13:
 (boost::statechart::state_machinePG::RecoveryState::RecoveryMachine,
 PG::RecoveryState::Initial, std::allocatorvoid,
 boost::statechart::null_exception_translator::process_event(boost::statechart::event_base
 const)+0x6b) [0x650b4b]
 /var/log/ceph/ceph-osd.40.log.1.gz: 14:
 (PG::RecoveryState::handle_log(int, MOSDPGLog*, PG::RecoveryCtx*)+0x190)
 [0x60a520]
 /var/log/ceph/ceph-osd.40.log.1.gz: 15:
 (OSD::handle_pg_log(std::tr1::shared_ptrOpRequest)+0x666) [0x5c62e6]
 /var/log/ceph/ceph-osd.40.log.1.gz: 16:
 (OSD::dispatch_op(std::tr1::shared_ptrOpRequest)+0x11b) [0x5c6f3b]
 /var/log/ceph/ceph-osd.40.log.1.gz: 17: (OSD::_dispatch(Message*)+0x173)
 [0x5d1983]
 /var/log/ceph/ceph-osd.40.log.1.gz: 18: (OSD::ms_dispatch(Message*)+0x184)
 [0x5d2254]
 /var/log/ceph/ceph-osd.40.log.1.gz: 19:
 (SimpleMessenger::DispatchQueue::entry()+0x5e9) [0x7d3c09]
 /var/log/ceph/ceph-osd.40.log.1.gz: 20:
 (SimpleMessenger::dispatch_entry()+0x15) [0x7d5195]
 /var/log/ceph/ceph-osd.40.log.1.gz: 21:
 (SimpleMessenger::DispatchThread::entry()+0xd) [0x726bad]
 /var/log/ceph/ceph-osd.40.log.1.gz: 22: (()+0x68ca) [0x7f7f306b88ca]
 /var/log/ceph/ceph-osd.40.log.1.gz: 23: (clone()+0x6d) [0x7f7f2f3fc92d]
 
 
 Thnx for looking,
 
 
 Oliver.
 
 --
 
 Oliver Francke
 
 filoo GmbH
 Moltkestraße 25a
 0 Gütersloh
 HRB4355 AG Gütersloh
 
 Geschäftsführer: S.Grewing | J.Rehpöhler

Re: A couple of OSD-crashes after serious network trouble

2012-12-06 Thread Oliver Francke

Hi,

On 12/05/2012 03:54 PM, Sage Weil wrote:

On Wed, 5 Dec 2012, Oliver Francke wrote:

Hi *,

around midnight yesterday we faced some layer-2 network problems. OSD's
started to lose heartbeats and so on. Slow requests... you name it.
So, after all OSD's doing their work, we had in sum around 6 of them crashed,
2 had to be restarted after first start. Should be 8 crashes in total.

The recover_got() crash has definitely been resolved in the recent code.
The others are hard to read since they've been sorted/summed; the full
backtrace is better for identifying the crash.  Do you have those
available?


There is the other pattern:

/var/log/ceph/ceph-osd.40.log.1.gz: 1: /usr/bin/ceph-osd() [0x706c59]
/var/log/ceph/ceph-osd.40.log.1.gz: 2: (()+0xeff0) [0x7f7f306c0ff0]
/var/log/ceph/ceph-osd.40.log.1.gz: 3: (gsignal()+0x35) [0x7f7f2f35f1b5]
/var/log/ceph/ceph-osd.40.log.1.gz: 4: (abort()+0x180) [0x7f7f2f361fc0]
/var/log/ceph/ceph-osd.40.log.1.gz: 5: 
(__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7f2fbf3dc5]

/var/log/ceph/ceph-osd.40.log.1.gz: 6: (()+0xcb166) [0x7f7f2fbf2166]
/var/log/ceph/ceph-osd.40.log.1.gz: 7: (()+0xcb193) [0x7f7f2fbf2193]
/var/log/ceph/ceph-osd.40.log.1.gz: 8: (()+0xcb28e) [0x7f7f2fbf228e]
/var/log/ceph/ceph-osd.40.log.1.gz: 9: (ceph::__ceph_assert_fail(char 
const*, char const*, int, char const*)+0x793) [0x77e903]
/var/log/ceph/ceph-osd.40.log.1.gz: 10: 
(PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, 
int)+0x1de3) [0x63db93]
/var/log/ceph/ceph-osd.40.log.1.gz: 11: 
(PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec 
const)+0x2cc) [0x63e00c]
/var/log/ceph/ceph-osd.40.log.1.gz: 12: 
(boost::statechart::simple_statePG::RecoveryState::Stray, 
PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
(boost::statechart::history_mode)0::react_impl(boost::statechart::event_base 
const, void const*)+0x203) [0x658a63]
/var/log/ceph/ceph-osd.40.log.1.gz: 13: 
(boost::statechart::state_machinePG::RecoveryState::RecoveryMachine, 
PG::RecoveryState::Initial, std::allocatorvoid, 
boost::statechart::null_exception_translator::process_event(boost::statechart::event_base 
const)+0x6b) [0x650b4b]
/var/log/ceph/ceph-osd.40.log.1.gz: 14: 
(PG::RecoveryState::handle_log(int, MOSDPGLog*, PG::RecoveryCtx*)+0x190) 
[0x60a520]
/var/log/ceph/ceph-osd.40.log.1.gz: 15: 
(OSD::handle_pg_log(std::tr1::shared_ptrOpRequest)+0x666) [0x5c62e6]
/var/log/ceph/ceph-osd.40.log.1.gz: 16: 
(OSD::dispatch_op(std::tr1::shared_ptrOpRequest)+0x11b) [0x5c6f3b]
/var/log/ceph/ceph-osd.40.log.1.gz: 17: (OSD::_dispatch(Message*)+0x173) 
[0x5d1983]
/var/log/ceph/ceph-osd.40.log.1.gz: 18: 
(OSD::ms_dispatch(Message*)+0x184) [0x5d2254]
/var/log/ceph/ceph-osd.40.log.1.gz: 19: 
(SimpleMessenger::DispatchQueue::entry()+0x5e9) [0x7d3c09]
/var/log/ceph/ceph-osd.40.log.1.gz: 20: 
(SimpleMessenger::dispatch_entry()+0x15) [0x7d5195]
/var/log/ceph/ceph-osd.40.log.1.gz: 21: 
(SimpleMessenger::DispatchThread::entry()+0xd) [0x726bad]

/var/log/ceph/ceph-osd.40.log.1.gz: 22: (()+0x68ca) [0x7f7f306b88ca]
/var/log/ceph/ceph-osd.40.log.1.gz: 23: (clone()+0x6d) [0x7f7f2f3fc92d]

State at the end of the day: active+clean;

Unfortunately... after some scrubbing today, we see again 
inconsistencies... *sigh*


End of year syndrom?

Tried to get onto one OSD, which crashed yesterday and fired off some 
ceph osd scrub 0.

And then ceph osd repair 0.

2012-12-06 16:46:29.818551 7f49f1923700  0 log [ERR] : 65.ad repair stat 
mismatch, got 4204/4205 objects, 0/0 clones, 16466529280/16470149632 bytes.
2012-12-06 16:46:29.818734 7f49f1923700  0 log [ERR] : 65.ad repair 1 
errors, 1 fixed
2012-12-06 16:46:30.104722 7f49f2124700  0 log [ERR] : 65.23 repair stat 
mismatch, got 4258/4259 objects, 0/0 clones, 16686233712/16690428016 bytes.
2012-12-06 16:46:30.104890 7f49f2124700  0 log [ERR] : 65.23 repair 1 
errors, 1 fixed
2012-12-06 16:51:26.973407 7f49f2124700  0 log [ERR] : 6.1 osd.31: soid 
bafe2559/rb.0.1adf5.6733efe2.07ce/head//6 size 4194304 != known 
size 3046912
2012-12-06 16:51:26.973426 7f49f2124700  0 log [ERR] : 6.1 repair 0 
missing, 1 inconsistent objects
2012-12-06 16:51:26.981234 7f49f2124700  0 log [ERR] : 6.1 repair stat 
mismatch, got 2153/2154 objects, 0/0 clones, 7013002752/7017197056 bytes.
2012-12-06 16:51:26.981402 7f49f2124700  0 log [ERR] : 6.1 repair 1 
errors, 1 fixed


um... is it repaired? Really? Everything cool now for OSD.0? 
Additionally there are - again - half a dozen headers missing. If 
corresponding VM's are stopped now, they will not restart, of course.


First tickets are raised by customers having s/t like filesystems 
errors... mounted read-only... on the console and kind of that crap... 
again.


Well then, should one now do a ceph osd repair \* ? Fix the headers? Is 
there a best practice

A couple of OSD-crashes after serious network trouble

2012-12-05 Thread Oliver Francke
::shared_ptrOpRequest)+0x32e)
 16 (ReplicatedPG::do_sub_op(std::tr1::shared_ptrOpRequest)+0x3f7)
 12 
(ReplicatedPG::handle_pull_response(std::tr1::shared_ptrOpRequest)+0x4d4)
 16 
(ReplicatedPG::handle_pull_response(std::tr1::shared_ptrOpRequest)+0xb24)

  4 (ReplicatedPG::handle_push(std::tr1::shared_ptrOpRequest)+0x263)
 32 (ReplicatedPG::recover_got(hobject_t,
 32 (ReplicatedPG::submit_push_complete(ObjectRecoveryInfo,
 12 (ReplicatedPG::sub_op_push(std::tr1::shared_ptrOpRequest)+0x98)
 16 (ReplicatedPG::sub_op_push(std::tr1::shared_ptrOpRequest)+0xa2)
  4 (ReplicatedPG::sub_op_push(std::tr1::shared_ptrOpRequest)+0xf3)
  4 (SimpleMessenger::dispatch_entry()+0x15)
  4 (SimpleMessenger::DispatchQueue::entry()+0x5e9)
  4 (SimpleMessenger::DispatchThread::entry()+0xd)
 16 (ThreadPool::worker()+0x4d5)
 16 (ThreadPool::worker()+0x76f)
 32 (ThreadPool::WorkThread::entry()+0xd)

=== 8- ===

Everything has cleared up so far, so that's some good news ;)

Comments welcome,

Oliver.

--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Best practice with 0.48.2 to take a node into maintenance

2012-12-03 Thread Oliver Francke
Hi *,

well, even if 0.48.2 is really stable and reliable, it is not everytime the 
case with linux kernel. We have a couple of nodes, where an update would make 
life better.
So, as our OSD-nodes have to care for VM's too, it's not the problem to let 
them drain so migrate all of them to other nodes.
Just reboot? Perhaps not, cause all OSD's will begin to remap/backfill, they 
are instructed to do so. Well, declare them as osd lost?
Dangerous. Is there another way I miss in doing node-maintenance? Will we have 
to wait for bobtail for far less hassle with all remapping and resources?

Thnx for comments,

Oliver.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best practice with 0.48.2 to take a node into maintenance

2012-12-03 Thread Oliver Francke
Hi Josh,

Am 03.12.2012 um 20:14 schrieb Josh Durgin josh.dur...@inktank.com:

 On 12/03/2012 11:05 AM, Oliver Francke wrote:
 Hi *,
 
 well, even if 0.48.2 is really stable and reliable, it is not everytime the 
 case with linux kernel. We have a couple of nodes, where an update would 
 make life better.
 So, as our OSD-nodes have to care for VM's too, it's not the problem to let 
 them drain so migrate all of them to other nodes.
 Just reboot? Perhaps not, cause all OSD's will begin to remap/backfill, they 
 are instructed to do so. Well, declare them as osd lost?
 Dangerous. Is there another way I miss in doing node-maintenance? Will we 
 have to wait for bobtail for far less hassle with all remapping and 
 resources?
 
 By default the monitors won't mark an OSD out in the time it takes to
 reboot, but if maintenance takes longer, you can drain data from the
 node.
 
 A simple way to rate limit it yourself is by slowly lowering the
 weights of the OSDs on the host you want to update, e.g. by 0.1 at a
 time and waiting for recovery to complete before lowering again. Once
 they're at 0 and the cluster is healthy, they're not responsible for
 any data anymore, and the node can be rebooted.
 

true. Should have mentioned knowing smooth way. But for a planned reboot this 
take way too much time ;)
But if it's recommended, it's recommended ;)

Oliver.

 Josh
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best practice with 0.48.2 to take a node into maintenance

2012-12-03 Thread Oliver Francke
Hi Florian,

Am 03.12.2012 um 20:45 schrieb Smart Weblications GmbH - Florian Wiessner  
f.wiess...@smart-weblications.de:

 Am 03.12.2012 20:21, schrieb Oliver Francke:
 Hi Josh,
 
 Am 03.12.2012 um 20:14 schrieb Josh Durgin josh.dur...@inktank.com:
 
 On 12/03/2012 11:05 AM, Oliver Francke wrote:
 Hi *,
 
 well, even if 0.48.2 is really stable and reliable, it is not everytime 
 the case with linux kernel. We have a couple of nodes, where an update 
 would make life better.
 So, as our OSD-nodes have to care for VM's too, it's not the problem to 
 let them drain so migrate all of them to other nodes.
 Just reboot? Perhaps not, cause all OSD's will begin to remap/backfill, 
 they are instructed to do so. Well, declare them as osd lost?
 Dangerous. Is there another way I miss in doing node-maintenance? Will we 
 have to wait for bobtail for far less hassle with all remapping and 
 resources?
 
 By default the monitors won't mark an OSD out in the time it takes to
 reboot, but if maintenance takes longer, you can drain data from the
 node.
 
 A simple way to rate limit it yourself is by slowly lowering the
 weights of the OSDs on the host you want to update, e.g. by 0.1 at a
 time and waiting for recovery to complete before lowering again. Once
 they're at 0 and the cluster is healthy, they're not responsible for
 any data anymore, and the node can be rebooted.
 
 
 true. Should have mentioned knowing smooth way. But for a planned reboot 
 this take way too much time ;)
 But if it's recommended, it's recommended ;)
 
 
 
 I did rolling reboots of our whole cluster a few days ago (3.4.20). When the
 system reboots and no fsck is done, ceph won't start to backfill in my setup.
 
 I had some nodes do fsck after upgrade so ceph marked the osd as down and
 started to backfill, but once the missing osd was back up running again, the
 backfill stopped and ceph did just a little bit of peering and was healthy in 
 a
 few minutes again (2-5 minutes)…
 

if you encounter all BIOS-, POST-, RAID-controller-checks, linux-boot, 
openvswitch-STP setup and so on, one can imagine, that a reboot takes a 
couple-of-minutes, normally with our setup after 30 seconds the cluster shall 
detect some outage and start to do it's work.
Everytings fine, but perhaps we could avoid big load in the cluster to remap 
and re-remap ( Theme: slow requests) I have to ask in means of QoS for a 
better way ;)
All that stuff had a big customer impact in the past… Time to ask.

Kind reg's

Oliver.

 

 
 
 
 -- 
 
 Mit freundlichen Grüßen,
 
 Florian Wiessner
 
 Smart Weblications GmbH
 Martinsberger Str. 1
 D-95119 Naila
 
 fon.: +49 9282 9638 200
 fax.: +49 9282 9638 205
 24/7: +49 900 144 000 00 - 0,99 EUR/Min*
 http://www.smart-weblications.de
 
 --
 Sitz der Gesellschaft: Naila
 Geschäftsführer: Florian Wiessner
 HRB-Nr.: HRB 3840 Amtsgericht Hof
 *aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd STDIN import does not work / wip-rbd-export-stdout

2012-11-26 Thread Oliver Francke

Well...

On 11/26/2012 02:20 PM, Stefan Priebe - Profihost AG wrote:

Hello list,

i know branch wip-rbd-export-stdout is work in progress but it is more 
than useful ;-)


When i try to import an image i get:

# gzip -dc vm-101-disk-1.img.gz | rbd import --format=2 
--size=42949672960 - kvmpool1/vm-101-disk-1

rbd: error reading file: (29) Illegal seek
Importing image: 0% complete...failed.
rbd: import failed: (29) Illegal seek

Anything i've tried wrong?


I would assume, that size is already in MiB? Seems to be a slightly too 
big value... Not tried myself, though...


Oliver.



Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ubuntu 12.04.1 + xfs + syncfs is still not our friend

2012-11-06 Thread Oliver Francke

Hi *,

anybody out there who's in Ubuntu 12.04.1/ in connection with libc + xfs 
+ syncfs?


We bit the bullet and reinstalled two new nodes from debian to precise 
in favour of possible performance increase?!


*sigh*, still getting:

2012-11-06 17:05:51.863921 7f5cc52e3780  0 filestore(/data/osd6-3) mount 
syncfs(2) syscall not support by glibc
2012-11-06 17:05:51.863925 7f5cc52e3780  0 filestore(/data/osd6-3) mount 
no syncfs(2), must use sync(2).


as a show-stopper.

That's with 3.2.* and 3.6.6 kernel. Should be new enough. And 
ceph-0.48.2 ( eu.ceph.com mirror was installing 0.48.1.. though *ops* ;) )


But both should be capable of this system-call?!

We've been so very near... but now clueless, please sched some light in 8-)

( I hope this is not a very obvious human RTFM error ;) )

And so I'm writing, should one enable directio + aio, as we're having 
the journals on a SSD-partition?


Kind regards,

Oliver.

--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ubuntu 12.04.1 + xfs + syncfs is still not our friend

2012-11-06 Thread Oliver Francke
Hi Jens,

sorry for the double work… answered off-list already ;)

Oliver.

Am 06.11.2012 um 19:46 schrieb Jens Rehpöhler jens.rehpoeh...@filoo.de:

 On 06.11.2012 18:33, Gandalf Corvotempesta wrote:
 2012/11/6 Oliver Francke oliver.fran...@filoo.de:
 2012-11-06 17:05:51.863921 7f5cc52e3780  0 filestore(/data/osd6-3) mount
 syncfs(2) syscall not support by glibc
 2012-11-06 17:05:51.863925 7f5cc52e3780  0 filestore(/data/osd6-3) mount no
 syncfs(2), must use sync(2).
 Could you please try to run ldd --version and test if man syncfs
 is working ?
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Here is the output:
 
 ldd --version
 ldd (Ubuntu EGLIBC 2.15-0ubuntu10.3) 2.15
 
 After installing manpages-dev i get the man page for syncfs (with  man
 syncfs)
 
 Distribution is a normal Ubuntu 12.04.1
 
 -- 
 mit freundlichen Grüssen
 
 Jens Rehpöhler
 
 --
 Filoo GmbH
 Moltkestr. 25a
 0 Gütersloh
 HRB4355 AG Gütersloh
 
 Geschäftsführer: S.Grewing | J.Rehpöhler | Dr. C.Kunz
 Telefon: +49 5241 8673012 | Mobil: +49 151 54645798
 Fax: +49 5241 8673020
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Reduce bandwidth for remapping/backfill/recover?

2012-11-03 Thread Oliver Francke
Hi *

anybody out there, who can help with an idea for reducing bandwidth when 
incorporating 2 new nodes into a cluster?
I know of osd recovery max active = X ( 5 default), but with 4 OSD's per 
node, there is enough possibility to saturate our backnet ( 1Gig at the moment).
Any other way to not disturb users too much?

Thxn in@vance,

Oliver.


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v0.53 released

2012-10-19 Thread Oliver Francke

Hi Josh,

On 10/19/2012 07:42 AM, Josh Durgin wrote:

On 10/17/2012 04:26 AM, Oliver Francke wrote:

Hi Sage, *,

after having some trouble with the journals - had to erase the partition
and redo a ceph... --mkjournal - I started my testing... Everything 
fine.


This would be due to the change in default osd journal size. In 0.53
it's 1024MB, even for block devices. Previously it defaulted to
the entire block device.

I already fixed this to use the entire block device in 0.54, and
didn't realize the fix wasn't included in 0.53.

You can restore the correct behaviour for block devices by setting
this in the [osd] section of your ceph.conf:

osd journal size = 0


thnx for the explanation, gives me a better feeling for the next stable 
to come to the stores ;)
Uhm, may it be impertinant to bring 
http://tracker.newdream.net/issues/2573 to your attention, as it's still 
ongoing at least in 0.48.2argonaut?


Thnx in advance,

Oliver.



Josh



--- 8- ---
2012-10-17 12:54:11.167782 7febab24a780  0 filestore(/data/osd0) mount:
enabling PARALLEL journal mode: btrfs, SNAP_CREATE_V2 detected and
'filestore btrfs snap' mode is enabled
2012-10-17 12:54:11.191723 7febab24a780  0 journal  kernel version is 
3.5.0

2012-10-17 12:54:11.191907 7febab24a780  1 journal _open /dev/sdb1 fd
27: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1
2012-10-17 12:54:11.201764 7febab24a780  0 journal  kernel version is 
3.5.0

2012-10-17 12:54:11.201924 7febab24a780  1 journal _open /dev/sdb1 fd
27: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1
--- 8- ---

And the other minute I started my fairly destructive testing, 0.52 never
ever failed on that. And then a loop started with
--- 8- ---

2012-10-17 12:59:15.403247 7feba5fed700  0 -- 10.0.0.11:6801/29042 
10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :57922 pgs=3 cs=1 l=0).fault,
initiating reconnect
2012-10-17 12:59:17.280143 7feb950cc700  0 -- 10.0.0.11:6801/29042 
10.0.0.12:6804/17972 pipe(0x17f2240 sd=29 :49431 pgs=3 cs=1 l=0).fault
with nothing to send, going to standby
2012-10-17 12:59:18.288902 7feb951cd700  0 -- 10.0.0.11:6801/29042 
10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :37519 pgs=3 cs=2 l=0).connect
claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node!
2012-10-17 12:59:18.297663 7feb951cd700  0 -- 10.0.0.11:6801/29042 
10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :34833 pgs=3 cs=2 l=0).connect
claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node!
2012-10-17 12:59:18.303215 7feb951cd700  0 -- 10.0.0.11:6801/29042 
10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :35169 pgs=3 cs=2 l=0).connect
claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node!
--- 8- ---

leading to high CPU-load on node2 ( IP 10.0.0.11). The destructive part
happens on node3 ( IP 10.0.0.12).

Procedure is as always just kill some OSDs and start over again...
Happened now twice, so I would call it reproducable ;)

Kind regards,

Oliver.


On 10/17/2012 01:48 AM, Sage Weil wrote:

Another development release of Ceph is ready, v0.53. We are getting
pretty
close to what will be frozen for the next stable release (bobtail), 
so if

you would like a preview, give this one a go. Notable changes include:

  * librbd: image locking
  * rbd: fix list command when more than 1024 (format 2) images
  * osd: backfill reservation framework (to avoid flooding new osds 
with

backfill data)
  * osd, mon: honor new 'nobackfill' and 'norecover' osdmap flags
  * osd: new 'deep scrub' will compare object content across replicas
(once
per week by default)
  * osd: crush performance improvements
  * osd: some performance improvements related to request queuing
  * osd: capability syntax improvements, bug fixes
  * osd: misc recovery fixes
  * osd: fix memory leak on certain error paths
  * osd: default journal size to 1 GB
  * crush: default root of tree type is now 'root' instead of 'pool' 
(to

avoid confusiong wrt rados pools)
  * ceph-fuse: fix handling for .. in root directory
  * librados: some locking fixes
  * mon: some election bug fixes
  * mon: some additional on-disk metadata to facilitate future mon
changes
(post-bobtail)
  * mon: throttle osd flapping based on osd history (limits osdmap
thrashing on overloaded or unhappy clusters)
  * mon: new 'osd crush create-or-move ...' command
  * radosgw: fix copy-object vs attributes
  * radosgw: fix bug in bucket stat updates
  * mds: fix ino release on abort session close, relative getattr
path, mds
shutdown, other misc items
  * upstart: stop jobs on shutdown
  * common: thread pool sizes can now be adjusted at runtime
  * build fixes for Fedora 18, CentOS/RHEL 6

The big items are locking support in RBD, and OSD improvements like 
deep

scrub (which verify object data across replicas) and backfill
reservations
(which limit load on expanding clusters). And a huge swath of bugfixes
and
cleanups, many due to feeding the code through scan.coverity.com (they
offer free static

Re: v0.53 released

2012-10-19 Thread Oliver Francke
Hi Sage,

Am 19.10.2012 um 17:48 schrieb Sage Weil s...@inktank.com:

 On Fri, 19 Oct 2012, Oliver Francke wrote:
 Hi Josh,
 
 On 10/19/2012 07:42 AM, Josh Durgin wrote:
 On 10/17/2012 04:26 AM, Oliver Francke wrote:
 Hi Sage, *,
 
 after having some trouble with the journals - had to erase the partition
 and redo a ceph... --mkjournal - I started my testing... Everything fine.
 
 This would be due to the change in default osd journal size. In 0.53
 it's 1024MB, even for block devices. Previously it defaulted to
 the entire block device.
 
 I already fixed this to use the entire block device in 0.54, and
 didn't realize the fix wasn't included in 0.53.
 
 You can restore the correct behaviour for block devices by setting
 this in the [osd] section of your ceph.conf:
 
 osd journal size = 0
 
 thnx for the explanation, gives me a better feeling for the next stable to
 come to the stores ;)
 Uhm, may it be impertinant to bring http://tracker.newdream.net/issues/2573 
 to
 your attention, as it's still ongoing at least in 0.48.2argonaut?
 
 Do you mean these messages?
 
 2012-10-11 10:51:25.879084 7f25d08dc700 0 osd.13 1353 pg[6.5( v 
 1353'2567562 (1353'2566561,1353'2567562] n=1857 ec=390 les/c 1347/1349 
 1340/1347/1333) [13,33] r=0 lpr=1347 mlcod 1353'2567561 active+clean] 
 watch: ctx-obc=0x6381000 cookie=1 oi.version=2301953 
 ctx-at_version=1353'2567563
 2012-10-11 10:51:25.879133 7f25d08dc700 0 osd.13 1353 pg[6.5( v 
 1353'2567562 (1353'2566561,1353'2567562] n=1857 ec=390 les/c 1347/1349 
 1340/1347/1333) [13,33] r=0 lpr=1347 mlcod 1353'2567561 active+clean] 
 watch: oi.user_version=2301951
 
 They're fixed in master; I'll backport the cleanup to stable.  It's 
 useless noise.
 

uhm, more into the following:

Oct 19 15:28:13 fcmsnode1 kernel: [1483536.141269] libceph: osd13 
10.10.10.22:6812 socket closed
Oct 19 15:43:13 fcmsnode1 kernel: [1484435.176280] libceph: osd13 
10.10.10.22:6812 socket closed
Oct 19 15:58:13 fcmsnode1 kernel: [1485334.382798] libceph: osd13 
10.10.10.22:6812 socket closed

It's kind of new, cause I would have got aware before. And we have 4 ODS's on 
every node, so why only from one OSD?
Same picture on two other nodes, If I read the ticket, no data is lost, just 
closing a socket? But then a kern.log entry is far too much? ;)

Oliver.


 sage
 
 
 
 
 Thnx in advance,
 
 Oliver.
 
 
 Josh
 
 
 --- 8- ---
 2012-10-17 12:54:11.167782 7febab24a780  0 filestore(/data/osd0) mount:
 enabling PARALLEL journal mode: btrfs, SNAP_CREATE_V2 detected and
 'filestore btrfs snap' mode is enabled
 2012-10-17 12:54:11.191723 7febab24a780  0 journal  kernel version is
 3.5.0
 2012-10-17 12:54:11.191907 7febab24a780  1 journal _open /dev/sdb1 fd
 27: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1
 2012-10-17 12:54:11.201764 7febab24a780  0 journal  kernel version is
 3.5.0
 2012-10-17 12:54:11.201924 7febab24a780  1 journal _open /dev/sdb1 fd
 27: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1
 --- 8- ---
 
 And the other minute I started my fairly destructive testing, 0.52 never
 ever failed on that. And then a loop started with
 --- 8- ---
 
 2012-10-17 12:59:15.403247 7feba5fed700  0 -- 10.0.0.11:6801/29042 
 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :57922 pgs=3 cs=1 l=0).fault,
 initiating reconnect
 2012-10-17 12:59:17.280143 7feb950cc700  0 -- 10.0.0.11:6801/29042 
 10.0.0.12:6804/17972 pipe(0x17f2240 sd=29 :49431 pgs=3 cs=1 l=0).fault
 with nothing to send, going to standby
 2012-10-17 12:59:18.288902 7feb951cd700  0 -- 10.0.0.11:6801/29042 
 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :37519 pgs=3 cs=2 l=0).connect
 claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node!
 2012-10-17 12:59:18.297663 7feb951cd700  0 -- 10.0.0.11:6801/29042 
 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :34833 pgs=3 cs=2 l=0).connect
 claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node!
 2012-10-17 12:59:18.303215 7feb951cd700  0 -- 10.0.0.11:6801/29042 
 10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :35169 pgs=3 cs=2 l=0).connect
 claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node!
 --- 8- ---
 
 leading to high CPU-load on node2 ( IP 10.0.0.11). The destructive part
 happens on node3 ( IP 10.0.0.12).
 
 Procedure is as always just kill some OSDs and start over again...
 Happened now twice, so I would call it reproducable ;)
 
 Kind regards,
 
 Oliver.
 
 
 On 10/17/2012 01:48 AM, Sage Weil wrote:
 Another development release of Ceph is ready, v0.53. We are getting
 pretty
 close to what will be frozen for the next stable release (bobtail), so
 if
 you would like a preview, give this one a go. Notable changes include:
 
  * librbd: image locking
  * rbd: fix list command when more than 1024 (format 2) images
  * osd: backfill reservation framework (to avoid flooding new osds with
backfill data)
  * osd, mon: honor new 'nobackfill' and 'norecover' osdmap flags
  * osd: new 'deep scrub' will compare object content across

Re: v0.53 released

2012-10-17 Thread Oliver Francke

Hi Sage, *,

after having some trouble with the journals - had to erase the partition 
and redo a ceph... --mkjournal - I started my testing... Everything fine.


--- 8- ---
2012-10-17 12:54:11.167782 7febab24a780  0 filestore(/data/osd0) mount: 
enabling PARALLEL journal mode: btrfs, SNAP_CREATE_V2 detected and 
'filestore btrfs snap' mode is enabled

2012-10-17 12:54:11.191723 7febab24a780  0 journal  kernel version is 3.5.0
2012-10-17 12:54:11.191907 7febab24a780  1 journal _open /dev/sdb1 fd 
27: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1

2012-10-17 12:54:11.201764 7febab24a780  0 journal  kernel version is 3.5.0
2012-10-17 12:54:11.201924 7febab24a780  1 journal _open /dev/sdb1 fd 
27: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1

--- 8- ---

And the other minute I started my fairly destructive testing, 0.52 never 
ever failed on that. And then a loop started with

--- 8- ---

2012-10-17 12:59:15.403247 7feba5fed700  0 -- 10.0.0.11:6801/29042  
10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :57922 pgs=3 cs=1 l=0).fault, 
initiating reconnect
2012-10-17 12:59:17.280143 7feb950cc700  0 -- 10.0.0.11:6801/29042  
10.0.0.12:6804/17972 pipe(0x17f2240 sd=29 :49431 pgs=3 cs=1 l=0).fault 
with nothing to send, going to standby
2012-10-17 12:59:18.288902 7feb951cd700  0 -- 10.0.0.11:6801/29042  
10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :37519 pgs=3 cs=2 l=0).connect 
claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node!
2012-10-17 12:59:18.297663 7feb951cd700  0 -- 10.0.0.11:6801/29042  
10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :34833 pgs=3 cs=2 l=0).connect 
claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node!
2012-10-17 12:59:18.303215 7feb951cd700  0 -- 10.0.0.11:6801/29042  
10.0.0.12:6801/17706 pipe(0x55a2240 sd=34 :35169 pgs=3 cs=2 l=0).connect 
claims to be 0.0.0.0:6801/5738 not 10.0.0.12:6801/17706 - wrong node!

--- 8- ---

leading to high CPU-load on node2 ( IP 10.0.0.11). The destructive part 
happens on node3 ( IP 10.0.0.12).


Procedure is as always just kill some OSDs and start over again... 
Happened now twice, so I would call it reproducable ;)


Kind regards,

Oliver.


On 10/17/2012 01:48 AM, Sage Weil wrote:

Another development release of Ceph is ready, v0.53. We are getting pretty
close to what will be frozen for the next stable release (bobtail), so if
you would like a preview, give this one a go. Notable changes include:

  * librbd: image locking
  * rbd: fix list command when more than 1024 (format 2) images
  * osd: backfill reservation framework (to avoid flooding new osds with
backfill data)
  * osd, mon: honor new 'nobackfill' and 'norecover' osdmap flags
  * osd: new 'deep scrub' will compare object content across replicas (once
per week by default)
  * osd: crush performance improvements
  * osd: some performance improvements related to request queuing
  * osd: capability syntax improvements, bug fixes
  * osd: misc recovery fixes
  * osd: fix memory leak on certain error paths
  * osd: default journal size to 1 GB
  * crush: default root of tree type is now 'root' instead of 'pool' (to
avoid confusiong wrt rados pools)
  * ceph-fuse: fix handling for .. in root directory
  * librados: some locking fixes
  * mon: some election bug fixes
  * mon: some additional on-disk metadata to facilitate future mon changes
(post-bobtail)
  * mon: throttle osd flapping based on osd history (limits osdmap
thrashing on overloaded or unhappy clusters)
  * mon: new 'osd crush create-or-move ...' command
  * radosgw: fix copy-object vs attributes
  * radosgw: fix bug in bucket stat updates
  * mds: fix ino release on abort session close, relative getattr path, mds
shutdown, other misc items
  * upstart: stop jobs on shutdown
  * common: thread pool sizes can now be adjusted at runtime
  * build fixes for Fedora 18, CentOS/RHEL 6

The big items are locking support in RBD, and OSD improvements like deep
scrub (which verify object data across replicas) and backfill reservations
(which limit load on expanding clusters). And a huge swath of bugfixes and
cleanups, many due to feeding the code through scan.coverity.com (they
offer free static code analysis for open source projects).

v0.54 is now frozen, and will include many deployment-related fixes
(including a new ceph-deploy tool to replace mkcephfs), more bugfixes for
libcephfs, ceph-fuse, and the MDS, and the fruits of some performance work
on the OSD.

You can get v0.53 from the usual locations:

  * Git at git://github.com/ceph/ceph.git
  * Tarball at http://ceph.com/download/ceph-0.53.tar.gz
  * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian
  * For RPMs, see http://ceph.com/docs/master/install/rpm
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--

Oliver Francke

filoo GmbH

Re: v0.48.2 argonaut update released

2012-10-01 Thread Oliver Francke

Hi *,

with reference to the below mentioned

objecter: misc fixes for op reordering


I assumed it could have something to do with slow requests not being 
solved for too long. I just not saw it anymore in 0.51 in our 
testing-evironment.

But today we took one of our nodes into maintenance, and I see many of:

--- 8

2012-10-01 11:58:46.766999 osd.10 [WRN] 38 slow requests, 1 included 
below; oldest blocked for  1189.312605 secs
2012-10-01 11:58:46.767013 osd.10 [WRN] slow request 240.183032 seconds 
old, received at 2012-10-01 11:54:46.583860: 
osd_op(client.110046.0:2143984 rb.0.1adf5.6733efe2.061a [write 
208384~4096] 6.f511e801) v4 currently delayed

2

--- 8- ---

which is bad, as I assume, that some of the VM's are now stalled. 
Anybody else experienced such things?


Thnx n regards,

Oliver.

On 09/20/2012 06:52 PM, Sage Weil wrote:

Hi everyone,

Another update to the stable argonaut series has been released.  This
fixes a few important bugs in rbd and radosgw and includes a series of
changes to upstart and deployment related scripts that will allow the
upcoming 'ceph-deploy' tool to work with the argonaut release.

Upgrading:

  * The default search path for keyring files now includes
/etc/ceph/ceph.$name.keyring. If such files are present on your
cluster, be aware that by default they may now be used.
  * There are several changes to the upstart init files. These have not
been previously documented or recommended. Any existing users should
review the changes before upgrading.
  * The ceph-disk-prepare and ceph-disk-active scripts have been updated
significantly. These have not been previously documented or
recommended. Any existing users should review the changes before upgrading.

Notable changes include:

  * mkcephfs: fix keyring generation for mds, osd when default paths are
used
  * radosgw: fix bug causing occasional corruption of per-bucket stats
  * radosgw: workaround to avoid previously corrupted stats from going
negative
  * radosgw: fix bug in usage stats reporting on busy buckets
  * radosgw: fix Content-Range: header for objects bigger than 2 GB.
  * rbd: avoid leaving watch acting when command line tool errors out
(avoids 30s delay on subsequent operations)
  * rbd: friendlier use of pool/image options for import (old calling
convention still works)
  * librbd: fix rare snapshot creation race (could lose a snap when
creation is concurrent)
  * librbd: fix discard handling when spanning holes
  * librbd: fix memory leak on discard when caching is enabled
  * objecter: misc fixes for op reordering
  * objecter: fix for rare startup-time deadlock waiting for osdmap
  * ceph: fix usage
  * mon: reduce log noise about check_sub
  * ceph-disk-activate: misc fixes, improvements
  * ceph-disk-prepare: partition and format osd disks automatically
  * upstart: start everyone on a reboot
  * upstart: always update the osd crush location on start if specified in
the config
  * config: add /etc/ceph/ceph.$name.keyring to default keyring search path
  * ceph.spec: don't package crush headers

You can get this release from the usual locations:

  * Git at git://github.com/ceph/ceph.git
  * Tarball at http://ceph.newdream.net/download/ceph-0.48.2.tar.gz
  * For Debian/Ubuntu packages, see 
http://ceph.newdream.net/docs/master/install/debian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v0.48.2 argonaut update released

2012-10-01 Thread Oliver Francke
Well,

Am 01.10.2012 um 18:07 schrieb Sage Weil s...@inktank.com:

 On Mon, 1 Oct 2012, Oliver Francke wrote:
 Hi *,
 
 with reference to the below mentioned
 
 objecter: misc fixes for op reordering
 
 
 I assumed it could have something to do with slow requests not being solved
 for too long. I just not saw it anymore in 0.51 in our testing-evironment.
 But today we took one of our nodes into maintenance, and I see many of:
 
 --- 8
 
 2012-10-01 11:58:46.766999 osd.10 [WRN] 38 slow requests, 1 included below;
 oldest blocked for  1189.312605 secs
 2012-10-01 11:58:46.767013 osd.10 [WRN] slow request 240.183032 seconds old,
 received at 2012-10-01 11:54:46.583860: osd_op(client.110046.0:2143984
 rb.0.1adf5.6733efe2.061a [write 208384~4096] 6.f511e801) v4 currently
 delayed
 2
 
 --- 8- ---
 
 which is bad, as I assume, that some of the VM's are now stalled. Anybody 
 else
 experienced such things?
 
 You see this on v0.51?  For v0.52 we merged in a large series of messenger 
 fixes and cleanups that could very easily explain this.  Can you try 
 v0.52?
 

just checking… up to now no slow req. is lasting for longer than 30… seconds.

 The same series has not been merged into the argonaut stable series; I'm 
 unsure yet whether that's a good idea (it's a lot of refactoring mixed in 
 with the fixes).  Perhaps after it has proven itself in v0.52 for longer, 
 and/or if we get reports of msgr problems in argonaut deployments.

I think most people are just happy campers these days… Not so destructive as 
myself ;)

 
 sage
 

Oliver.

 
 Thnx n regards,
 
 Oliver.
 
 On 09/20/2012 06:52 PM, Sage Weil wrote:
 Hi everyone,
 
 Another update to the stable argonaut series has been released.  This
 fixes a few important bugs in rbd and radosgw and includes a series of
 changes to upstart and deployment related scripts that will allow the
 upcoming 'ceph-deploy' tool to work with the argonaut release.
 
 Upgrading:
 
  * The default search path for keyring files now includes
/etc/ceph/ceph.$name.keyring. If such files are present on your
cluster, be aware that by default they may now be used.
  * There are several changes to the upstart init files. These have not
been previously documented or recommended. Any existing users should
review the changes before upgrading.
  * The ceph-disk-prepare and ceph-disk-active scripts have been updated
significantly. These have not been previously documented or
recommended. Any existing users should review the changes before
 upgrading.
 
 Notable changes include:
 
  * mkcephfs: fix keyring generation for mds, osd when default paths are
used
  * radosgw: fix bug causing occasional corruption of per-bucket stats
  * radosgw: workaround to avoid previously corrupted stats from going
negative
  * radosgw: fix bug in usage stats reporting on busy buckets
  * radosgw: fix Content-Range: header for objects bigger than 2 GB.
  * rbd: avoid leaving watch acting when command line tool errors out
(avoids 30s delay on subsequent operations)
  * rbd: friendlier use of pool/image options for import (old calling
convention still works)
  * librbd: fix rare snapshot creation race (could lose a snap when
creation is concurrent)
  * librbd: fix discard handling when spanning holes
  * librbd: fix memory leak on discard when caching is enabled
  * objecter: misc fixes for op reordering
  * objecter: fix for rare startup-time deadlock waiting for osdmap
  * ceph: fix usage
  * mon: reduce log noise about check_sub
  * ceph-disk-activate: misc fixes, improvements
  * ceph-disk-prepare: partition and format osd disks automatically
  * upstart: start everyone on a reboot
  * upstart: always update the osd crush location on start if specified in
the config
  * config: add /etc/ceph/ceph.$name.keyring to default keyring search path
  * ceph.spec: don't package crush headers
 
 You can get this release from the usual locations:
 
  * Git at git://github.com/ceph/ceph.git
  * Tarball at http://ceph.newdream.net/download/ceph-0.48.2.tar.gz
  * For Debian/Ubuntu packages, see
 http://ceph.newdream.net/docs/master/install/debian
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 -- 
 
 Oliver Francke
 
 filoo GmbH
 Moltkestra?e 25a
 0 G?tersloh
 HRB4355 AG G?tersloh
 
 Gesch?ftsf?hrer: S.Grewing | J.Rehp?hler | C.Kunz
 
 Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD-crash on 0.48.1argonout, error void ReplicatedPG::recover_got(hobject_t, eversion_t) not seen on list

2012-09-19 Thread Oliver Francke

Hi all,

after adding a new node into our ceph-cluster yesterday, we had a crash 
of one OSD.


I have found this kind of message in the bugtracker as being solved ( 
http://tracker.newdream.net/issues/2075),
I will update this one for my convenience and attach the according log ( 
due to productive site, there is no more

verbose debug available, sorry).

Other than that, everything went almost smoothly, except the annoying 
slow requests,
which are hopefully not only fixed in 0.51, ... when do we expect next 
stable, btw?


The replication was fast, due to a SSD-cached LSI-controller, 4 OSDs per 
node, one per HDD,

1Gbit was completely saturated, time for next step towards 10Gbit ;)

Regards,

Oliver.

--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v0.48.1 argonaut stable update released

2012-08-15 Thread Oliver Francke

Well,

On 08/14/2012 09:29 PM, Sage Weil wrote:

On Tue, 14 Aug 2012, Oliver Francke wrote:

Hi Sage,

I just updated to debian-testing/0.50 this afternoon, after some hint:

* osd: better tracking of recent slow operations

This is actually about the admin socket command to dump operations in
flight (more useful information is reported for diagnosis/debugging).


and it is hereby confirmed to be better in my testing environment.
Before I had requests, which could be there for 480 seconds? not any
more.

That great news!  That is probably Sam's refactor of the OSD threading at
work.  There were also a few bugs fixed in 0.48.1 that were causing
somewhat similar symptoms (ops blocked indefinitely) due to peering
problems, but that doesn't sound like it's the same thing.


How's about this fix in 0.48.X?

It's a huge set of changes, and definitely won't go into the 0.48 series,
sorry!  (In fact, the pending change was one motivation for doing 0.48
when we did.)  It will be in bobtail, though, which is probably about a
month away from freeze.

Please let us know what your experience is like with 0.50 (and beyond).


the more detailed picture is: it works and is stable, so far no problems 
with my torture-tests.

Sporadically I see a line ala:

--- 8- ---
delete error: image still has watchers
This means the image is still open or the client using it crashed. Try 
again after closing/unmapping it or waiting 30s for the crashed client 
to timeout.
2012-08-15 15:57:22.072729 7f9fe82a2760 -1 librbd: error removing 
header: (16) Device or resource busy

--- 8- ---

even from long ago stopped VM's.

Regards,

Oliver.



Thanks!
sage



Thnx in @vance,

Oliver - Thus being too lazy to read all change logs - Francke.

Am 14.08.2012 um 20:18 schrieb Sage Weil s...@inktank.com:


We've built and pushed the first update to the argonaut stable release.
This branch has a range of small fixes for stability, compatibility, and
performance, but no major changes in functionality.  The stability fixes
are particularly important for large clusters with many OSDs, and for
network environments where intermittent network failures are more common.

The highlights include:

* mkcephfs: use default `keyring', `osd data', `osd journal' paths when
   not specified in conf
* msgr: various fixes to socket error handling
* osd: reduce scrub overhead
* osd: misc peering fixes (past_interval sharing, pgs stuck in `peering'
   states)
* osd: fail on EIO in read path (do not silently ignore read errors from
   failing disks)
* osd: avoid internal heartbeat errors by breaking some large
   transactions into pieces
* osd: fix osdmap catch-up during startup (catch up and then add daemon
   to osdmap)
* osd: fix spurious `misdirected op' messages
* osd: report scrub status via `pg # query'
* rbd: fix race when watch registrations are resent
* rbd: fix rbd image id assignment scheme (new image data objects have
   slightly different names)
* rbd: fix perf stats for cache hit rate
* rbd tool: fix off-by-one in key name (crash when empty key specified)
* rbd: more robust udev rules
* rados tool: copy object, pool commands
* radosgw: fix in usage stats trimming
* radosgw: misc compatibility fixes (date strings, ETag quoting, swift
   headers, etc.)
* ceph-fuse: fix locking in read/write paths
* mon: fix rare race corrupting on-disk data
* config: fix admin socket `config set' command
* log: fix in-memory log event gathering
* debian: remove crush headers, include librados-config
* rpm: add ceph-disk-{activate, prepare}

The fix for the radosgw usage trimming is incompatible with v0.48 (which
was effectively broken).  You now need to use the v0.48.1 version of
radosgw-admin to initiate usage stats trimming.

There are a range of smaller bug fixes as well.  For a complete list of
what went into this release, please see the release notes and changelog.

You can get this stable update from the usual locations:

* Git at git://github.com/ceph/ceph.git
* Tarball at http://ceph.newdream.net/download/ceph-0.48.1.tar.gz
* For Debian/Ubuntu packages, see 
http://ceph.newdream.net/docs/master/install/debian

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v0.48.1 argonaut stable update released

2012-08-14 Thread Oliver Francke
Hi Sage,

I just updated to debian-testing/0.50 this afternoon, after some hint:

* osd: better tracking of recent slow operations

and it is hereby confirmed to be better in my testing environment. Before I had 
requests, which could be there for 480 seconds… not any more.

How's about this fix in 0.48.X?

Thnx in @vance,

Oliver - Thus being too lazy to read all change logs - Francke.

Am 14.08.2012 um 20:18 schrieb Sage Weil s...@inktank.com:

 We've built and pushed the first update to the argonaut stable release.  
 This branch has a range of small fixes for stability, compatibility, and 
 performance, but no major changes in functionality.  The stability fixes 
 are particularly important for large clusters with many OSDs, and for 
 network environments where intermittent network failures are more common.
 
 The highlights include:
 
 * mkcephfs: use default `keyring', `osd data', `osd journal' paths when 
   not specified in conf
 * msgr: various fixes to socket error handling
 * osd: reduce scrub overhead
 * osd: misc peering fixes (past_interval sharing, pgs stuck in `peering' 
   states)
 * osd: fail on EIO in read path (do not silently ignore read errors from 
   failing disks)
 * osd: avoid internal heartbeat errors by breaking some large 
   transactions into pieces
 * osd: fix osdmap catch-up during startup (catch up and then add daemon 
   to osdmap)
 * osd: fix spurious `misdirected op' messages
 * osd: report scrub status via `pg # query'
 * rbd: fix race when watch registrations are resent
 * rbd: fix rbd image id assignment scheme (new image data objects have 
   slightly different names)
 * rbd: fix perf stats for cache hit rate
 * rbd tool: fix off-by-one in key name (crash when empty key specified)
 * rbd: more robust udev rules
 * rados tool: copy object, pool commands
 * radosgw: fix in usage stats trimming
 * radosgw: misc compatibility fixes (date strings, ETag quoting, swift 
   headers, etc.)
 * ceph-fuse: fix locking in read/write paths
 * mon: fix rare race corrupting on-disk data
 * config: fix admin socket `config set' command
 * log: fix in-memory log event gathering
 * debian: remove crush headers, include librados-config
 * rpm: add ceph-disk-{activate, prepare}
 
 The fix for the radosgw usage trimming is incompatible with v0.48 (which 
 was effectively broken).  You now need to use the v0.48.1 version of 
 radosgw-admin to initiate usage stats trimming.
 
 There are a range of smaller bug fixes as well.  For a complete list of 
 what went into this release, please see the release notes and changelog.
 
 You can get this stable update from the usual locations:
 
 * Git at git://github.com/ceph/ceph.git
 * Tarball at http://ceph.newdream.net/download/ceph-0.48.1.tar.gz
 * For Debian/Ubuntu packages, see 
 http://ceph.newdream.net/docs/master/install/debian
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Some findings on 0.48, qemu-1.0.1 eating up RDB-write-cache memory

2012-07-09 Thread Oliver Francke

Hi *,

as I have read many postings from users using qemu, too, I would like 
them to keep an eye on memory consumption.


I'm with qemu-1.0.1 and qemu-1.1.0-1 and linux-kernel 3.4.2/3.5.0-rc2.

If I restart a VM from cold, I do some readings, up to memory being 
fully used ( cache/buffers), that is, VM started with:


-m 1024

and I can see RSS of 1.1g in top.
After doing some normal IOps testing with:
spew -v --raw -P -t -i 5 -b 4k -p random -B 4k 2G /tmp/doof.dat

so a 2G file, tested for IOps-performance with 4k blocks I get a pretty 
good value for 5x write/read-after-write:


Total iterations:5
Total runtime:00:04:43
Total write transfer time (WTT):  00:02:15
Total write transfer rate (WTR):77480.53 KiB/s
Total write IOPS:   19370.13 IOPS
Total read transfer time (RTT):   00:01:40
Total read transfer rate (RTR):103823.12 KiB/s
Total read IOPS:25955.78 IOPS

but at the cost of approx. 400MiB more memory used, showing now 1.5g. 
Though it's not proportional, after next run I get 1.6g, then the 
process slows down... two another runs and we break the 1.7g border... 
But with the following settings in the global section of ceph.conf:


   rbd_cache = true
   rbd_cache_size=16777216
   rbd_cache_max_dirty=8388608
   rbd_cache_target_dirty=4194304

I cannot see, why we should waste 500+ MiB of memory ;) ( multiplied 
with approx. 100 VM's running).


If same VM started with:
:rbd_cache=false
everything stays as it should.

Anybody with similar setup willing to do some testing?

Other than that: fast and stable release, it seems ;)

Thnx in @vance,

Oliver.

--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: all rbd users: set 'filestore fiemap = false'

2012-06-18 Thread Oliver Francke

Hi Sage,

On 06/18/2012 06:02 AM, Sage Weil wrote:

If you are using RBD, and want to avoid potential image corruption, add

filestore fiemap = false

to the [osd] section of your ceph.conf and restart your OSDs.


as far as this heals some trouble, but I fairly don't understand...



We've tracked down the source of some corruption to racy/buggy FIEMAP
ioctl behavior.  The RBD client (when caching is diabled--the default)
uses a 'sparse read' operation that the OSD implements by doing an fsync
on the object file, mapping which extents are allocated, and sending only
that data over the wire.  We have observed incorrect/changing FIEMAP on
both btrfs:

fsync
fiemap returns mapping
time passes, no modifications to file
fiemap returns different mapping


... that even an initial start of a VM leads to corruption of the read data?

I get s/t like:

--- 8- ---

Loading, please wait
/sbin/init: relocation error: ...
 not defined in file libc.so.6...
[ 0.81...] Kernel panic - not snycing: Attempted to kill init!

--- 8- ---

host-kernel is now 3.4.1 + qemu-1.0.1, but shows failures with other 
kernel/qemu-versions, too.


Keeping fingers crossed for Josh, though ;-)
Give me a shout, If I can do some debugging,

regards,

Oliver.



Josh is still tracking down which kernels and file system are affected;
fortunately it is relatively easy to reproduce with the test_librbd_fsx
tool.  In the meantime, the (mis)feature can be safely disabled. It will
default to off in 0.48. It is unclear whether it's really much of a
performance win anyway.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random data corruption in VM, possibly caused by rbd

2012-06-08 Thread Oliver Francke

Hi Guido,

yeah, there is something weird going on. I just started to establish 
some test-VM's. Freshly imported from running *.qcow2 images.

Kernel panic with INIT, seg-faults and other funny stuff.

Just added the rbd_cache=true in my config, voila. All is 
fast-n-up-n-running...
All my testing was done with cache enabled... Since our errors all came 
from rbd_writeback from former ceph-versions...


Josh? Sage? Help?!

Oliver.

On 06/08/2012 02:55 PM, Guido Winkelmann wrote:

Am Donnerstag, 7. Juni 2012, 12:48:05 schrieben Sie:

On 06/07/2012 11:04 AM, Guido Winkelmann wrote:

Hi,

I'm using Ceph with RBD to provide network-transparent disk images for
KVM-
based virtual servers. The last two days, I've been hunting some weird
elusive bug where data in the virtual machines would be corrupted in
weird ways. It usually manifests in files having some random data -
usually zeroes - at the start before the actual contents that should be
in there start.

I definitely want to figure out what's going on with this.
A few questions:

Are you using rbd caching? If so, what settings?

In either case, does the corruption still occur if you
switch caching on/off? There are different I/O paths here,
and this might tell us if the problem is on the client side.

Okay, I've tried enabling rbd caching now, and so far, the problem appears to
be gone.

I am using libvirt for starting and managing the virtual machines, and what I
did was change thesource  element for the virtual disk from

source protocol='rbd' name='rbd/name_of_image'

to

source protocol='rbd' name='rbd/name_of_image:rbd_cache=true'

and then restart the VM.
(I found that in one of your mails on this list; there does not appear to be
any proper documentation on this...)

The iotester does not find any corruptions with these settings.

The VM ist still horribly broken, but that's probably lingering filesystem
damage from yesterday. I'll try with a fresh image next.

I did not change anything else in the setup. In particular, the OSDs still use
btrfs. One of the OSD has been restarted, though. I will run another test with
a VM without rbd caching, to make sure it wasn't by random chance restarting
that one osd that made the real difference.

Enabling btrfs did not appear to make any difference wrt performance, but
that's probably because my tests mostly create sustained sequential IO, for
which caches are generally not very helpful.

Enabling rbd caching is not a solution I particularly like, for two reasons:

1. In my setup, migrating VMs from one host to another is a normal part of
operation, and I still don't know ho to prevent data corruption (in the form
of silently lost writes) when combining rbd caching and migration.

2. I'm not really looking into speeding up single VM, I'm really more
interested in just how many VMs I can run before performance starts degrading
for everyone, and I don't think rbd caching will help with that.

Regards,
Guido

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random data corruption in VM, possibly caused by rbd

2012-06-08 Thread Oliver Francke

Well then,

quite busy, too with some other stuff, but...


On 06/08/2012 04:50 PM, Josh Durgin wrote:

On 06/08/2012 06:55 AM, Sage Weil wrote:

On Fri, 8 Jun 2012, Oliver Francke wrote:

Hi Guido,

yeah, there is something weird going on. I just started to establish 
some

test-VM's. Freshly imported from running *.qcow2 images.
Kernel panic with INIT, seg-faults and other funny stuff.

Just added the rbd_cache=true in my config, voila. All is
fast-n-up-n-running...
All my testing was done with cache enabled... Since our errors all 
came from

rbd_writeback from former ceph-versions...


Are you guys able to reproduce the corruption with 'debug osd = 20' and
'debug ms = 1'?  Ideally we'd like to:

  - reproduce from a fresh vm, with osd logs
  - identify the bad file
  - map that file to a block offset (see
http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
  - use that to identify the badness in the log


a logfile with debugging is available at our local store...



I suspect the cache is just masking the problem because it submits fewer
IOs...


The cache also doesn't do sparse reads. Is it still reproducible with
a fresh vm when you set filestore_fiemap_threshold = 0 for the osds,
and run without rbd caching?


restarted OSDs with this setting, but without rbd_cache I still get 
errors. *sigh*



Oliver.



Josh


sage




Josh? Sage? Help?!

Oliver.

On 06/08/2012 02:55 PM, Guido Winkelmann wrote:

Am Donnerstag, 7. Juni 2012, 12:48:05 schrieben Sie:

On 06/07/2012 11:04 AM, Guido Winkelmann wrote:

Hi,

I'm using Ceph with RBD to provide network-transparent disk 
images for

KVM-
based virtual servers. The last two days, I've been hunting some 
weird

elusive bug where data in the virtual machines would be corrupted in
weird ways. It usually manifests in files having some random data -
usually zeroes - at the start before the actual contents that 
should be

in there start.

I definitely want to figure out what's going on with this.
A few questions:

Are you using rbd caching? If so, what settings?

In either case, does the corruption still occur if you
switch caching on/off? There are different I/O paths here,
and this might tell us if the problem is on the client side.
Okay, I've tried enabling rbd caching now, and so far, the problem 
appears

to
be gone.

I am using libvirt for starting and managing the virtual machines, 
and what

I
did was change thesource   element for the virtual disk from

source protocol='rbd' name='rbd/name_of_image'

to

source protocol='rbd' name='rbd/name_of_image:rbd_cache=true'

and then restart the VM.
(I found that in one of your mails on this list; there does not 
appear to be

any proper documentation on this...)

The iotester does not find any corruptions with these settings.

The VM ist still horribly broken, but that's probably lingering 
filesystem

damage from yesterday. I'll try with a fresh image next.

I did not change anything else in the setup. In particular, the 
OSDs still

use
btrfs. One of the OSD has been restarted, though. I will run 
another test

with
a VM without rbd caching, to make sure it wasn't by random chance 
restarting

that one osd that made the real difference.

Enabling btrfs did not appear to make any difference wrt 
performance, but
that's probably because my tests mostly create sustained sequential 
IO, for

which caches are generally not very helpful.

Enabling rbd caching is not a solution I particularly like, for two 
reasons:


1. In my setup, migrating VMs from one host to another is a normal 
part of
operation, and I still don't know ho to prevent data corruption (in 
the form

of silently lost writes) when combining rbd caching and migration.

2. I'm not really looking into speeding up single VM, I'm really more
interested in just how many VMs I can run before performance starts
degrading
for everyone, and I don't think rbd caching will help with that.

Regards,
Guido

--
To unsubscribe from this list: send the line unsubscribe 
ceph-devel in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe 
ceph-devel in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random data corruption in VM, possibly caused by rbd

2012-06-07 Thread Oliver Francke
Hi Guido,

unfortunately this sounds very familiar to me. We have been on a long road with 
similar weird errors.
Our setup is something like start a couple of VM's ( qemu-*), let them create 
a 1G-file each and randomly seek and write 4MB blocks filled with md5sums of 
the block as payload, to be verifiable after completely written.
Furthermore create some 1 files every-now-and-then and try to remove them 
after the verify-run.
This produced the same things than you are experiencing - zero'ed blocks -  
with the main difference, that my tests are now clean with 0.47-2 and friends. 
After a couple of hundreds of runs.
Our setup is with XFS as OSD-data partition, as we had too many errors with 
btrfs in the past.
My assumption now would be, that there are some relations to your filesystem…?!

Would be cool if you are able to change your setup to XFS. At least that would 
be a starting-point for further investigations.

Regards,

Oliver.

Am 07.06.2012 um 20:04 schrieb Guido Winkelmann:

 Hi,
 
 I'm using Ceph with RBD to provide network-transparent disk images for KVM-
 based virtual servers. The last two days, I've been hunting some weird 
 elusive 
 bug where data in the virtual machines would be corrupted in weird ways. It 
 usually manifests in files having some random data - usually zeroes - at the 
 start before the actual contents that should be in there start.
 
 To track this down, I wrote a simple io tester. It does the following:
 
 - Create 1 Megabyte of random data
 - Calculate the SHA256 hash of that data
 - Write the data to a file on the harddisk, in a given directory, using the 
 hash as the filename
 - Repeat until the disk is full
 - Delete the last file (because it is very likely to be incompletely written)
 - Read and delete all the files just written while checking that their sha256 
 sums are equal to their filenames
 
 When running this io tester in a VM that uses a qcow2 file on a local 
 harddisk 
 for its virtual disk, no errors are found. When the same VM is running using 
 rbd, the io tester finds on average about one corruption every 200 Megabytes, 
 reproducably.
 
 (As in an interesting aside, the io tester also prints how long it took to 
 read or write 100 MB, and it turns out reading the data back in again is 
 about 
 three times slower than writing them in the first place...)
 
 Ceph is version 0.47.2. Qemu KVM is 1.0, compiled with the spec file from 
 http://pkgs.fedoraproject.org/gitweb/?p=qemu.git;a=summary
 (And compiled after ceph 0.47.2 was installed on that machine, so it would 
 use 
 the correct headers...)
 Both the Ceph cluster and the KVM host machines are running on Fedora 16, 
 with 
 a fairly recent 3.3.x kernel.
 The ceph cluster uses btrf for the osd's data dirs. The journal is on a 
 tmpfs. 
 (This is not a production setup - luckily.)
 The virtual machine is using ext4 as its filesystem.
 There were no obvious other problems with either the ceph cluster or the KVM 
 host machines.
 
 I have attached a copy of the ceph.conf in use, in case it might be helpful.
 
 This is a huge problem, and any help in tracking it down would be much 
 appreciated.
 
 Regards,
 
   Guidoceph.conf

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: q. about rbd-header

2012-03-15 Thread Oliver Francke

Hi Josh,

On 03/14/2012 10:59 PM, Josh Durgin wrote:

On 03/14/2012 01:49 PM, Oliver Francke wrote:

Well,

nobody able to sched some light in?
Did some math and found out how to fill the size bytes.


Sorry I didn't respond faster.


But, one question never got answered:
 - why is - with busy VMs - frequently the first block affected,
   with the result of damaged 
grub-loaders/partition-tables/filesystems?

   Is this some NULL/zero pointer thingy in case of ceph-failure?


My guess is that this is not the first object affected, but it's where 
the loss of an object is most easily noticeable - if an object doesn't 
exist, it's treated as being full of zeros, which might go undetected 
for a long time if it's e.g. some temp or log file that's not reread 
and verified.


well, I responded to Sage with some more infos from one of the images 
where the header is missing... Did not want to bother the list ;)





If you demand some broken images… we have many of them to investigate,
unfortunately.


We'd really like to find the root cause of the problem. One 
possibility is some bad interaction between osds running different 
versions. This caused one issue with recovery stxShadow saw yesterday, 
for example (http://tracker.newdream.net/issues/2132). Had you been 
doing rolling upgrades of osds before these problems appeared? If so, 
do you know which versions you had running concurrently?


Are your osds often restarting?

What we'd need to diagnose this are osd logs during recovery with:

debug osd = 20
debug ms = 1

Once you detect the problem, a log from each replica storing the pg 
the bad/missing object is in should be enough.


And just to make sure, you aren't writing to these rbd images from 
multiple places, right? This wouldn't cause the missing header 
objects, but is likely to cause corruption of the image data. This 
could happen, for example, by rolling an image back to a snapshot 
while a vm is running on it.


Currently we don't use snapshots. And of course ensure, a VM is running 
once at a time ;-) And we had some rolling upgrade, but this was 
_after_ trouble/crashes occured.


Oliver.



Josh

Maybe this sounds a bit harsh, after the 5th night-shift trying to 
repair images

and keep customers calm, I think this is forgivable.

Oliver.

Am 14.03.2012 um 16:05 schrieb Oliver Francke:


Hey,

anybody out there who could explain the structure of a rbd-header? 
After

last crash we have about 10 images with a:
   2012-03-14 15:22:47.998790 7f45a61e3760 librbd: Error reading
header: 2 No such file or directory
error opening image vm-266-disk-1.rbd: 2 No such file or directory
... error?
I understand the rb.x.y-prefix, the 2 ^ 16hex as block-size. But
the size/count encoding is not intuitive ;)

Besides one file, where I created a header and putted it via rados
put back into the pool, and got some files
back, many of the other images with lost headers have different sizes.

We got bad luck again, too many crashed VM's, too much data-loss...

Comments welcome ;)

Oliver.
--
To unsubscribe from this list: send the line unsubscribe 
ceph-devel in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


q. about rbd-header

2012-03-14 Thread Oliver Francke
Hey,

anybody out there who could explain the structure of a rbd-header? After 
last crash we have about 10 images with a:
   2012-03-14 15:22:47.998790 7f45a61e3760 librbd: Error reading 
header: 2 No such file or directory
error opening image vm-266-disk-1.rbd: 2 No such file or directory
... error?
I understand the rb.x.y-prefix, the 2 ^ 16hex as block-size. But 
the size/count encoding is not intuitive ;)

Besides one file, where I created a header and putted it via rados 
put back into the pool, and got some files
back, many of the other images with lost headers have different sizes.

We got bad luck again, too many crashed VM's, too much data-loss...

Comments welcome ;)

Oliver.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: q. about rbd-header

2012-03-14 Thread Oliver Francke
Well,

nobody able to sched some light in?
Did some math and found out how to fill the size bytes.

But, one question never got answered:
- why is - with busy VMs - frequently the first block affected,
  with the result of damaged grub-loaders/partition-tables/filesystems?
  Is this some NULL/zero pointer thingy in case of ceph-failure?

If you demand some broken images… we have many of them to investigate,
unfortunately.

Maybe this sounds a bit harsh, after the 5th night-shift trying to repair images
and keep customers calm, I think this is forgivable.

Oliver.

Am 14.03.2012 um 16:05 schrieb Oliver Francke:

 Hey,
 
 anybody out there who could explain the structure of a rbd-header? After 
 last crash we have about 10 images with a:
   2012-03-14 15:22:47.998790 7f45a61e3760 librbd: Error reading 
 header: 2 No such file or directory
 error opening image vm-266-disk-1.rbd: 2 No such file or directory
 ... error?
 I understand the rb.x.y-prefix, the 2 ^ 16hex as block-size. But 
 the size/count encoding is not intuitive ;)
 
 Besides one file, where I created a header and putted it via rados 
 put back into the pool, and got some files
 back, many of the other images with lost headers have different sizes.
 
 We got bad luck again, too many crashed VM's, too much data-loss...
 
 Comments welcome ;)
 
 Oliver.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Still inconsistant pg's, ceph-osd crashes reliably after trying to repair

2012-03-01 Thread Oliver Francke

Hi *,

after some crashes we still had to care for some remaining 
inconsistancies reported via

ceph -w
and friends.
Well, we traced one of them down via
ceph pg dump

and we picked 79. pg=79.7 and found the corresponding file in the 
/var/log/ceph/osd.2.log.

/data/osd4/current/79.7_head/rb.0.0.136c__head_9FB2FA17
and the dup on
/data/osd2/...
Strange though, they had the same checksum but reported a stat-error. 
Anyway. Decided to do a:

ceph pg repair 79.7
... byebye ceph-osd on node2!

Here the trace:

=== 8- ===

2012-03-01 17:49:13.024571 7f3944584700 -- 10.10.10.14:6802/4892  
10.10.10.10:6802/19139 pipe(0xfcd2c80 sd=16 pgs=0 cs=0 l=0).connect 
protocol version mismatch, my 9 != 0
2012-03-01 17:49:23.674162 7f395001b700 log [ERR] : 79.7 osd.4: soid 
9fb2fa17/rb.0.0.136c/headextra attr _, extra attr snapset
2012-03-01 17:49:23.674222 7f395001b700 log [ERR] : 79.7 repair 0 
missing, 1 inconsistent objects

*** Caught signal (Aborted) **
 in thread 7f395001b700
 ceph version 0.42-142-gc9416e6 
(commit:c9416e6184905501159e96115f734bdf65a74d28)

 1: /usr/bin/ceph-osd() [0x5a6b89]
 2: (()+0xeff0) [0x7f3960ca5ff0]
 3: (gsignal()+0x35) [0x7f395f2841b5]
 4: (abort()+0x180) [0x7f395f286fc0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f395fb18dc5]
 6: (()+0xcb166) [0x7f395fb17166]
 7: (()+0xcb193) [0x7f395fb17193]
 8: (()+0xcb28e) [0x7f395fb1728e]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x13e) 
[0x67c5ce]

 10: (object_info_t::decode(ceph::buffer::list::iterator)+0x2c) [0x61663c]
 11: (PG::repair_object(hobject_t const, ScrubMap::object*, int, 
int)+0x3be) [0x68d96e]

 12: (PG::scrub_finalize()+0x1438) [0x6b8568]
 13: (OSD::ScrubFinalizeWQ::_process(PG*)+0xc) [0x588edc]
 14: (ThreadPool::worker()+0xa26) [0x5bc426]
 15: (ThreadPool::WorkThread::entry()+0xd) [0x585f0d]
 16: (()+0x68ca) [0x7f3960c9d8ca]
 17: (clone()+0x6d) [0x7f395f32186d]
2012-03-01 17:49:30.017269 7f81b662b780 ceph version 0.42-142-gc9416e6 
(commit:c9416e6184905501159e96115f734bdf65a74d28), process ceph-osd, pid 
3111
2012-03-01 17:49:30.085426 7f81b662b780 filestore(/data/osd2) mount 
FIEMAP ioctl is NOT supported
2012-03-01 17:49:30.085466 7f81b662b780 filestore(/data/osd2) mount did 
NOT detect btrfs
2012-03-01 17:49:30.110409 7f81b662b780 filestore(/data/osd2) mount 
found snaps 
2012-03-01 17:49:30.110476 7f81b662b780 filestore(/data/osd2) mount: 
enabling WRITEAHEAD journal mode: btrfs not detected
2012-03-01 17:49:31.964977 7f81b662b780 journal _open /dev/sdc1 fd 16: 
10737942528 bytes, block size 4096 bytes, directio = 1, aio = 0
2012-03-01 17:49:31.967549 7f81b662b780 journal read_entry 929464 : 
seq 67841857 11225 bytes


=== 8- ===

... after some journal-replay things calmed down, but:

2012-03-01 17:58:29.470446   log 2012-03-01 17:58:24.242369 osd.2 
10.10.10.14:6801/3111 368 : [WRN] bad locator @56 on object @79 loc @56 
op osd_op(client.44350.0:1412387 rb.0.0.136c [write 
2465792~49152] 56.9fb2fa17) v4


these type of messages we see ever so often... It corresponds, but in 
what way?


Can't we assume, if both snipplets rb.0.0... are identical, that 
life's good?
We had some other inconsistancies, where we had to delete the whole pool 
to get rid of crappy

blocks. The ceph-osd died, too, after doing some
rbd rm pool/image
the one block in question remained, visable via
rados ls -p pool

Any idea, o better clue? ;-)

Kind reg's,

Oliver.

--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still inconsistant pg's, ceph-osd crashes reliably after trying to repair

2012-03-01 Thread Oliver Francke
Well,

Am 01.03.2012 um 18:15 schrieb Oliver Francke:

 Hi *,
 
 after some crashes we still had to care for some remaining inconsistancies 
 reported via
ceph -w
 and friends.
 Well, we traced one of them down via
ceph pg dump
 
 and we picked 79. pg=79.7 and found the corresponding file in the 
 /var/log/ceph/osd.2.log.
/data/osd4/current/79.7_head/rb.0.0.136c__head_9FB2FA17
 and the dup on
/data/osd2/...
 Strange though, they had the same checksum but reported a stat-error. Anyway. 
 Decided to do a:
ceph pg repair 79.7
 ... byebye ceph-osd on node2!
 
 Here the trace:
 
 === 8- ===
 
 2012-03-01 17:49:13.024571 7f3944584700 -- 10.10.10.14:6802/4892  
 10.10.10.10:6802/19139 pipe(0xfcd2c80 sd=16 pgs=0 cs=0 l=0).connect protocol 
 version mismatch, my 9 != 0
 2012-03-01 17:49:23.674162 7f395001b700 log [ERR] : 79.7 osd.4: soid 
 9fb2fa17/rb.0.0.136c/headextra attr _, extra attr snapset

one clarification by ourselves done: one copy is missing the xattrs, checked via
getfattr
but why can't it be corrected, and worse this crash happens?

 2012-03-01 17:49:23.674222 7f395001b700 log [ERR] : 79.7 repair 0 missing, 1 
 inconsistent objects
 *** Caught signal (Aborted) **
 in thread 7f395001b700
 ceph version 0.42-142-gc9416e6 
 (commit:c9416e6184905501159e96115f734bdf65a74d28)
 1: /usr/bin/ceph-osd() [0x5a6b89]
 2: (()+0xeff0) [0x7f3960ca5ff0]
 3: (gsignal()+0x35) [0x7f395f2841b5]
 4: (abort()+0x180) [0x7f395f286fc0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f395fb18dc5]
 6: (()+0xcb166) [0x7f395fb17166]
 7: (()+0xcb193) [0x7f395fb17193]
 8: (()+0xcb28e) [0x7f395fb1728e]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x13e) [0x67c5ce]
 10: (object_info_t::decode(ceph::buffer::list::iterator)+0x2c) [0x61663c]
 11: (PG::repair_object(hobject_t const, ScrubMap::object*, int, int)+0x3be) 
 [0x68d96e]
 12: (PG::scrub_finalize()+0x1438) [0x6b8568]
 13: (OSD::ScrubFinalizeWQ::_process(PG*)+0xc) [0x588edc]
 14: (ThreadPool::worker()+0xa26) [0x5bc426]
 15: (ThreadPool::WorkThread::entry()+0xd) [0x585f0d]
 16: (()+0x68ca) [0x7f3960c9d8ca]
 17: (clone()+0x6d) [0x7f395f32186d]
 2012-03-01 17:49:30.017269 7f81b662b780 ceph version 0.42-142-gc9416e6 
 (commit:c9416e6184905501159e96115f734bdf65a74d28), process ceph-osd, pid 3111
 2012-03-01 17:49:30.085426 7f81b662b780 filestore(/data/osd2) mount FIEMAP 
 ioctl is NOT supported
 2012-03-01 17:49:30.085466 7f81b662b780 filestore(/data/osd2) mount did NOT 
 detect btrfs
 2012-03-01 17:49:30.110409 7f81b662b780 filestore(/data/osd2) mount found 
 snaps 
 2012-03-01 17:49:30.110476 7f81b662b780 filestore(/data/osd2) mount: enabling 
 WRITEAHEAD journal mode: btrfs not detected
 2012-03-01 17:49:31.964977 7f81b662b780 journal _open /dev/sdc1 fd 16: 
 10737942528 bytes, block size 4096 bytes, directio = 1, aio = 0
 2012-03-01 17:49:31.967549 7f81b662b780 journal read_entry 929464 : seq 
 67841857 11225 bytes
 
 === 8- ===
 
 ... after some journal-replay things calmed down, but:
 
2012-03-01 17:58:29.470446   log 2012-03-01 17:58:24.242369 osd.2 
 10.10.10.14:6801/3111 368 : [WRN] bad locator @56 on object @79 loc @56 op 
 osd_op(client.44350.0:1412387 rb.0.0.136c [write 2465792~49152] 
 56.9fb2fa17) v4
 
 these type of messages we see ever so often... It corresponds, but in what 
 way?
 
 Can't we assume, if both snipplets rb.0.0... are identical, that life's 
 good?
 We had some other inconsistancies, where we had to delete the whole pool to 
 get rid of crappy
 blocks. The ceph-osd died, too, after doing some
rbd rm pool/image
 the one block in question remained, visable via
rados ls -p pool
 
 Any idea, o better clue? ;-)
 
 Kind reg's,
 
 Oliver.
 
 -- 
 
 Oliver Francke
 
 filoo GmbH
 Moltkestraße 25a
 0 Gütersloh
 HRB4355 AG Gütersloh
 
 Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz
 
 Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Recommended number of pools, one Q. ever wanted to ask

2012-02-28 Thread Oliver Francke

Hi *,

well, there was once a comment on our layout in means of too many pools.
Our setup is to have a pool per customer, to simplify the view on used 
storage

capacity.
So, if we have - in a couple of months, we hope - more then some hundred
customers, this setup was not recommended, cause the whole system is not
designed for handling that. ( Sage)

What does not recommended mean? Is it, that per OSD the used memory 
will be

too high?
Is this a general performance issue?

Well, if we read pool, this gave us the basic idea/concept to put all 
per-customers

data into it.

Please sched some light in 8-)

Kind regards,

Oliver.

--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended number of pools, one Q. ever wanted to ask

2012-02-28 Thread Oliver Francke

Well,

On 02/28/2012 10:42 AM, Wido den Hollander wrote:

Hi,

On 02/28/2012 10:35 AM, Oliver Francke wrote:

Hi *,

well, there was once a comment on our layout in means of too many 
pools.

Our setup is to have a pool per customer, to simplify the view on used
storage
capacity.
So, if we have - in a couple of months, we hope - more then some hundred
customers, this setup was not recommended, cause the whole system is not
designed for handling that. ( Sage)

What does not recommended mean? Is it, that per OSD the used memory
will be
too high?


Yes. Every new pool you create will consume some memory on the OSD. So 
if you start creating a lot of pools, you will also start consuming 
more and more memory.


I haven't followed this lately, but that is the current information I 
have.


The number of objects in a pool is also not a problem, you can have 
millions without any issues. It's the number of pools which will haunt 
you later on.


thnx for the quick reply, so if we can imagine, that the number of pool 
per OSD is
the limiting factor, we shall not have more than let's say ~100, means, 
we shall be

safe.



Wido


best regards,

Oliver.




Is this a general performance issue?

Well, if we read pool, this gave us the basic idea/concept to put all
per-customers
data into it.

Please sched some light in 8-)

Kind regards,

Oliver.






--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with inconsistent PG

2012-02-17 Thread Oliver Francke
Well,

Am 17.02.2012 um 18:54 schrieb Sage Weil:

 On Fri, 17 Feb 2012, Oliver Francke wrote:
 Well then,
 
 found it via the ceph osd dump via the pool-id, thanks. The according 
 customer
 opened a ticket this morning for not being able to boot his VM after 
 shutdown.
 So I had to do some testdisk/fsck and tar the content into a new image.
 
 I hope, there are no other bad blocks not being visible as 
 inconsistencies.
 
 As these faulty images were easy detected as the boot-block was affected, how
 big is the chance, that there are more rb..-fragments being corrupted within 
 a image
 in reference to what you mentioned below:
 
 ...transactions to leak across checkpoint/snapshot boundaries.
 
 Do we have a chance to detect it? I fear not, cause it will perhaps only be 
 visible while
 doing a fsck inside the VM?!
 
 It is hard to say.  There is a small chance that it will trigger any time 
 ceph-osd is restarted.  The bug is fixed in the next release (which should 
 be out today), but of course upgrading involves shutting down :(.  
 Alternatively, you can cherry-pick the fixes, 
 1009d1a016f049e19ad729a0c00a354a3956caf7 and 
 93d7ef96316f30d3d7caefe07a5a747ce883ca2d.  v0.42 includes some encoding 
 changes that means you can upgrade but you can't downgrade again.  (These 
 encoding changes are being made so that in the future, you _can_ 
 downgrade.)
 
 Here's what I suggest:
 
 - don't restart any ceph-osds if you can help it
 - wait for v0.42 to come out, and wait until Monday at least
 - pause read/write traffic to the cluster with
 
 ceph osd pause
 
 - wait at least 30 seconds for osds to do a commit without any load.  
   this makes it extremely unlikely you'd trigger the bug.
 - upgrade to v0.42, or restart with a patched ceph-osd.
 - unpause io with
 
 ceph osd unpause
 

that sounds reasonable, cool stuff ;-)

Thnx again,

Oliver.

 sage
 
 
 
 
 Anyway, thanks for your help and best regards,
 
 Oliver.
 
 Am 16.02.2012 um 19:02 schrieb Sage Weil:
 
 On Thu, 16 Feb 2012, Oliver Francke wrote:
 Hi Sage,
 
 thnx for the quick response,
 
 Am 16.02.2012 um 18:17 schrieb Sage Weil:
 
 On Thu, 16 Feb 2012, Oliver Francke wrote:
 Hi Sage, *,
 
 your tip with truncating from below did not solve the problem. Just to 
 recap:
 
 we had two inconsistencies, which we could break down to something like:
 
 rb.0.0.__head_DA680EE2
 
 according to the ceph dump from below. Walking to the node with the OSD 
 mounted on /data/osd3
 for example, and a stupid find ? brings up a couple of them, so the pg 
 number is relevant too -
 makes sense - we went into lets say /data/osd3/current/84.2_head/ and 
 did a hex dump from the file, looked really
 like the head, in means of signs from an installed grub-loader. But a 
 corrupted partition-table.
 From other of these files one could do a fdisk -l file and at least 
 a partition-table could have been
 found.
 Two days later we got a customers big complaint about not being able to 
 boot his VM anymore. The point now is,
 from such a file with name and pg, how can we identify the real file 
 being associated with, cause there is another
 customer with a potential problem with next reboot ( second 
 inconsistency).
 
 We also had some VM's in a big test-phase with similar problems? grub 
 going into rescue-prompt, invalid/corrupted
 partition tables, so all in the first head-file?
 Would be cool to get some more infos? and sched some light into the 
 structures ( myself not really being a good code-reader
 anymore ;) ).
 
 'head' in this case means the object hasn't been COWed (snapshotted and 
 then overwritten), and  means its the first 4MB block of the 
 rbd image/disk.
 
 
 yes, true,
 
 We you able to use the 'rbd info' in the previous email to identify which 
 image it is?  Is that what you mean by 'identify the real file'?
 
 
 that's the point, from the object I would like to identify the complete 
 image location ala:
 
 pool/image
 
 from there I'd know, which customer's rbd disk-image is affected.
 
 For pool, look at the pgid, in this case '109.6'.  109 is the pool id.  
 Look at the pool list from 'ceph osd dump' output to see which pool name 
 that is.
 
 For the image, rb.0.0 is the image prefix.  Look at each rbd image in that 
 pool, and check for the image whose prefix matches.  e.g.,
 
 for img in `rbd -p poolname list` ; do rbd info $img -p poolname | grep 
 -q rb.0.0  echo found $img ; done
 
 BTW, are you creating a pool per customer here?  You need to be a little 
 bit careful about creating large numbers of pools; the system isn't really 
 designed to be used that way.  You should use a pool if you have a 
 distinct data placement requirement (e.g., put these objects on this set 
 of ceph-osds).  But because of the way things work internally creating 
 hundreds/thousands of them won't be very efficient.
 
 sage
 
 
 
 Thnx for your patience,
 
 Oliver.
 
 I'm not sure I understand exactly what your question is.  I would

Re: Problem with inconsistent PG

2012-02-16 Thread Oliver Francke
Hi Sage, *,

your tip with truncating from below did not solve the problem. Just to recap:

we had two inconsistencies, which we could break down to something like:

rb.0.0.__head_DA680EE2

according to the ceph dump from below. Walking to the node with the OSD mounted 
on /data/osd3
for example, and a stupid find … brings up a couple of them, so the pg number 
is relevant too -
makes sense - we went into lets say /data/osd3/current/84.2_head/ and did a 
hex dump from the file, looked really
like the head, in means of signs from an installed grub-loader. But a 
corrupted partition-table.
From other of these files one could do a fdisk -l file and at least a 
partition-table could have been
found.
Two days later we got a customers big complaint about not being able to boot 
his VM anymore. The point now is,
from such a file with name and pg, how can we identify the real file being 
associated with, cause there is another
customer with a potential problem with next reboot ( second inconsistency).

We also had some VM's in a big test-phase with similar problems… grub going 
into rescue-prompt, invalid/corrupted
partition tables, so all in the first head-file?
Would be cool to get some more infos… and sched some light into the structures 
( myself not really being a good code-reader
anymore ;) ).

Thanks in@vance and kind regards,

Oliver.

Am 13.02.2012 um 18:13 schrieb Sage Weil:

 On Sun, 12 Feb 2012, Jens Rehpoehler wrote:
 
 Hi Liste,
 
 today i've got another problem.
 
 ceph -w shows up with an inconsistent PG over night:
 
 2012-02-10 08:38:48.701775pg v441251: 1982 pgs: 1981 active+clean, 1
 active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
 GB avail
 2012-02-10 08:38:49.702789pg v441252: 1982 pgs: 1981 active+clean, 1
 active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
 GB avail
 
 I've identified it with ceph pg dump - | grep inconsistent
 
 109.6141000463820288111780111780
 active+clean+inconsistent485'7115480'7301[3
 http://marc.info/?l=ceph-develm=132891306919981w=2#3,4
 http://marc.info/?l=ceph-develm=132891306919981w=2#4][3
 http://marc.info/?l=ceph-develm=132891306919981w=2#3,4
 http://marc.info/?l=ceph-develm=132891306919981w=2#4]
 485'70612012-02-10 08:02:12.043986
 
 Now I've tried to repair it with: ceph pg repair 109.6
 
 2012-02-10 08:35:52.276325 mon- [pg,repair,109.6]
 2012-02-10 08:35:52.276776 mon.1 -  'instructing pg 109.6 on osd.3 to
 repair' (0)
 
 but i only get the following result:
 
 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
 1ef398ce/rb.0.0.00bd/headsize 2736128 != known size 3145728
 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent
 objects
 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
 
 Can someone please explain me what to do in this case and how to recover
 the pg ?
 
 So the fix is just to truncate the file to the expected size, 3145728,
 by finding it in the current/ directory.  The name/path will be slightly
 weird; look for 'rb.0.0.00bd'.
 
 The data is still suspect, though.  Did the ceph-osd restart or crash
 recently?  I would do that, repair (it should succeed), and then fsck the
 file system in that rbd image.
 
 We just fixed a bug that was causing transactions to leak across
 checkpoint/snapshot boundaries.  That could be responsible for causing all
 sorts of subtle corruptions, including this one.  It'll be included in
 v0.42 (out next week).
 
 sage
 
 Hi Sarge,
 
 no ... the osd didn't crash. I had to do some hardware maintainance and push
 it
 out of distribution with ceph osd out 3. After a short while i used
 /etc/init.d/ceph stop on that osd.
 Then, after my work i've started ceph and push it in the distribution with
 ceph osd in 3.
 
 For the bug I'm worried about, stopping the daemon and crashing are 
 equivalent.  In both cases, a transaction may have been only partially 
 included in the checkpoint.
 
 Could you please tell me if this is the right way to get an osd out for
 maintainance ? Is there
 any other thing i should do to keep data consistent ?
 
 You followed the right procedure.  There is (hopefully, was!) just a bug.
 
 sage
 
 
 My structure is -  3 MDS/MON Server on seperate Hardware Nodes an 3 OSD 
 Nodes
 with a each a total capacity
 of 8 TB. Journaling is done on a separate SSD per node. The whole thing is a
 data store for a kvm virtualisation
 farm. The farm is accessing the data directly per rbd.
 
 Thank you
 
 Jens
 
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 --
 To unsubscribe from this list: send 

Re: Problem with inconsistent PG

2012-02-16 Thread Oliver Francke
Hi Sage,

thnx for the quick response,

Am 16.02.2012 um 18:17 schrieb Sage Weil:

 On Thu, 16 Feb 2012, Oliver Francke wrote:
 Hi Sage, *,
 
 your tip with truncating from below did not solve the problem. Just to recap:
 
 we had two inconsistencies, which we could break down to something like:
 
 rb.0.0.__head_DA680EE2
 
 according to the ceph dump from below. Walking to the node with the OSD 
 mounted on /data/osd3
 for example, and a stupid find ? brings up a couple of them, so the pg 
 number is relevant too -
 makes sense - we went into lets say /data/osd3/current/84.2_head/ and did 
 a hex dump from the file, looked really
 like the head, in means of signs from an installed grub-loader. But a 
 corrupted partition-table.
 From other of these files one could do a fdisk -l file and at least a 
 partition-table could have been
 found.
 Two days later we got a customers big complaint about not being able to boot 
 his VM anymore. The point now is,
 from such a file with name and pg, how can we identify the real file being 
 associated with, cause there is another
 customer with a potential problem with next reboot ( second inconsistency).
 
 We also had some VM's in a big test-phase with similar problems? grub going 
 into rescue-prompt, invalid/corrupted
 partition tables, so all in the first head-file?
 Would be cool to get some more infos? and sched some light into the 
 structures ( myself not really being a good code-reader
 anymore ;) ).
 
 'head' in this case means the object hasn't been COWed (snapshotted and 
 then overwritten), and  means its the first 4MB block of the 
 rbd image/disk.
 

yes, true,

 We you able to use the 'rbd info' in the previous email to identify which 
 image it is?  Is that what you mean by 'identify the real file'?
 

that's the point, from the object I would like to identify the complete image 
location ala:

pool/image

from there I'd know, which customer's rbd disk-image is affected.

Thnx for your patience,

Oliver.

 I'm not sure I understand exactly what your question is.  I would have 
 expected modifying the file with fdisk -l to work (if fdisk sees a valid 
 partition table, it should be able to write it too).
 
 sage
 
 
 
 Thanks in@vance and kind regards,
 
 Oliver.
 
 Am 13.02.2012 um 18:13 schrieb Sage Weil:
 
 On Sun, 12 Feb 2012, Jens Rehpoehler wrote:
 
 Hi Liste,
 
 today i've got another problem.
 
 ceph -w shows up with an inconsistent PG over night:
 
 2012-02-10 08:38:48.701775pg v441251: 1982 pgs: 1981 active+clean, 1
 active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
 GB avail
 2012-02-10 08:38:49.702789pg v441252: 1982 pgs: 1981 active+clean, 1
 active+clean+inconsistent; 1790 GB data, 3368 GB used, 18977 GB / 22345
 GB avail
 
 I've identified it with ceph pg dump - | grep inconsistent
 
 109.6141000463820288111780111780
 active+clean+inconsistent485'7115480'7301[3
 http://marc.info/?l=ceph-develm=132891306919981w=2#3,4
 http://marc.info/?l=ceph-develm=132891306919981w=2#4][3
 http://marc.info/?l=ceph-develm=132891306919981w=2#3,4
 http://marc.info/?l=ceph-develm=132891306919981w=2#4]
 485'70612012-02-10 08:02:12.043986
 
 Now I've tried to repair it with: ceph pg repair 109.6
 
 2012-02-10 08:35:52.276325 mon- [pg,repair,109.6]
 2012-02-10 08:35:52.276776 mon.1 -  'instructing pg 109.6 on osd.3 to
 repair' (0)
 
 but i only get the following result:
 
 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455420 osd.3
 10.10.10.8:6801/25980 6913 : [ERR] 109.6 osd.4: soid
 1ef398ce/rb.0.0.00bd/headsize 2736128 != known size 3145728
 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455426 osd.3
 10.10.10.8:6801/25980 6914 : [ERR] 109.6 scrub 0 missing, 1 inconsistent
 objects
 2012-02-10 08:36:18.447553   log 2012-02-10 08:36:08.455799 osd.3
 10.10.10.8:6801/25980 6915 : [ERR] 109.6 scrub 1 errors
 
 Can someone please explain me what to do in this case and how to recover
 the pg ?
 
 So the fix is just to truncate the file to the expected size, 3145728,
 by finding it in the current/ directory.  The name/path will be slightly
 weird; look for 'rb.0.0.00bd'.
 
 The data is still suspect, though.  Did the ceph-osd restart or crash
 recently?  I would do that, repair (it should succeed), and then fsck the
 file system in that rbd image.
 
 We just fixed a bug that was causing transactions to leak across
 checkpoint/snapshot boundaries.  That could be responsible for causing all
 sorts of subtle corruptions, including this one.  It'll be included in
 v0.42 (out next week).
 
 sage
 
 Hi Sarge,
 
 no ... the osd didn't crash. I had to do some hardware maintainance and 
 push
 it
 out of distribution with ceph osd out 3. After a short while i used
 /etc/init.d/ceph stop on that osd.
 Then, after my work i've started ceph and push it in the distribution with
 ceph osd in 3.
 
 For the bug I'm worried about, stopping the daemon